Visual Representation Learning from Synthetic Data
Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2024
|
Online Access: | https://hdl.handle.net/1721.1/156315 |
_version_ | 1826211258202849280 |
---|---|
author | Fan, Lijie |
author2 | Katabi, Dina |
author_facet | Katabi, Dina Fan, Lijie |
author_sort | Fan, Lijie |
collection | MIT |
description | Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the viability of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this thesis we demonstrate our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. We begin by utilizing synthetic texts from large language models to enhance the training of vision-language models. Next, we explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark an advance in utilizing synthetic data for improved representation learning across the data-centric AI ecosystem. |
first_indexed | 2024-09-23T15:03:25Z |
format | Thesis |
id | mit-1721.1/156315 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T15:03:25Z |
publishDate | 2024 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1563152024-08-22T03:13:40Z Visual Representation Learning from Synthetic Data Fan, Lijie Katabi, Dina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the viability of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this thesis we demonstrate our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. We begin by utilizing synthetic texts from large language models to enhance the training of vision-language models. Next, we explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark an advance in utilizing synthetic data for improved representation learning across the data-centric AI ecosystem. Ph.D. 2024-08-21T18:56:04Z 2024-08-21T18:56:04Z 2024-05 2024-07-10T13:01:33.475Z Thesis https://hdl.handle.net/1721.1/156315 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Fan, Lijie Visual Representation Learning from Synthetic Data |
title | Visual Representation Learning from Synthetic Data |
title_full | Visual Representation Learning from Synthetic Data |
title_fullStr | Visual Representation Learning from Synthetic Data |
title_full_unstemmed | Visual Representation Learning from Synthetic Data |
title_short | Visual Representation Learning from Synthetic Data |
title_sort | visual representation learning from synthetic data |
url | https://hdl.handle.net/1721.1/156315 |
work_keys_str_mv | AT fanlijie visualrepresentationlearningfromsyntheticdata |