Visual Representation Learning from Synthetic Data

Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative...

Full description

Bibliographic Details
Main Author: Fan, Lijie
Other Authors: Katabi, Dina
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/156315
_version_ 1826211258202849280
author Fan, Lijie
author2 Katabi, Dina
author_facet Katabi, Dina
Fan, Lijie
author_sort Fan, Lijie
collection MIT
description Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the viability of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this thesis we demonstrate our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. We begin by utilizing synthetic texts from large language models to enhance the training of vision-language models. Next, we explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark an advance in utilizing synthetic data for improved representation learning across the data-centric AI ecosystem.
first_indexed 2024-09-23T15:03:25Z
format Thesis
id mit-1721.1/156315
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T15:03:25Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1563152024-08-22T03:13:40Z Visual Representation Learning from Synthetic Data Fan, Lijie Katabi, Dina Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Representation learning is crucial for developing robust vision systems. The effectiveness of this learning process largely depends on the quality and quantity of data. Synthetic data presents unique advantages in terms of flexibility, scalability, and controllability. Recent advances in generative modeling have enabled the synthesis of photorealistic images and high-quality text, drastically increasing the viability of synthetic data. Despite these advancements, the application of synthetic data for representation learning and visual recognition tasks lags behind, with a noticeable performance gap between models trained on synthetic versus real data. In this thesis we demonstrate our recent efforts to close this gap and utilize synthetic data to train state-of-the-art representation models. We begin by utilizing synthetic texts from large language models to enhance the training of vision-language models. Next, we explore synthetic images generated by text-to-image models, examining the scaling laws applicable to these images when used for supervised model training. We also introduce a multi-positive contrastive loss specifically designed for synthetic images, demonstrating their advantages over real images in representation learning. Finally, we propose a novel framework for training vision models exclusively with synthetic texts and images, which achieves superior performance, surpassing state-of-the-art models trained on real images in tasks including fine-grained classification and semantic segmentation. These works establish a robust foundation for advancing generative models in representation learning and solving key computer vision tasks, and mark an advance in utilizing synthetic data for improved representation learning across the data-centric AI ecosystem. Ph.D. 2024-08-21T18:56:04Z 2024-08-21T18:56:04Z 2024-05 2024-07-10T13:01:33.475Z Thesis https://hdl.handle.net/1721.1/156315 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Fan, Lijie
Visual Representation Learning from Synthetic Data
title Visual Representation Learning from Synthetic Data
title_full Visual Representation Learning from Synthetic Data
title_fullStr Visual Representation Learning from Synthetic Data
title_full_unstemmed Visual Representation Learning from Synthetic Data
title_short Visual Representation Learning from Synthetic Data
title_sort visual representation learning from synthetic data
url https://hdl.handle.net/1721.1/156315
work_keys_str_mv AT fanlijie visualrepresentationlearningfromsyntheticdata