SynthCLIP: are we ready for a fully synthetic CLIP training?
We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic d...
Główni autorzy: | , , , , , |
---|---|
Format: | Conference item |
Język: | English |
Wydane: |
IEEE
2024
|
_version_ | 1826313412195385344 |
---|---|
author | Hammoud, HAAK Itani, H Pizzati, F Torr, P Bibi, A Ghanem, B |
author_facet | Hammoud, HAAK Itani, H Pizzati, F Torr, P Bibi, A Ghanem, B |
author_sort | Hammoud, HAAK |
collection | OXFORD |
description | We present SynthCLIP, a novel framework for
training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we
are able to generate synthetic datasets of images and corresponding captions at any scale,
with no human intervention. With training at
scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets.
We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned
images. Our code, trained models, and generated
data are released at: https://github.com/
hammoudhasan/SynthCLIP. |
first_indexed | 2024-09-25T04:14:20Z |
format | Conference item |
id | oxford-uuid:bc525760-1577-4403-acfc-4507320f528e |
institution | University of Oxford |
language | English |
last_indexed | 2024-09-25T04:14:20Z |
publishDate | 2024 |
publisher | IEEE |
record_format | dspace |
spelling | oxford-uuid:bc525760-1577-4403-acfc-4507320f528e2024-07-11T10:21:09ZSynthCLIP: are we ready for a fully synthetic CLIP training?Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:bc525760-1577-4403-acfc-4507320f528eEnglishSymplectic ElementsIEEE2024Hammoud, HAAKItani, HPizzati, FTorr, PBibi, AGhanem, BWe present SynthCLIP, a novel framework for training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released at: https://github.com/ hammoudhasan/SynthCLIP. |
spellingShingle | Hammoud, HAAK Itani, H Pizzati, F Torr, P Bibi, A Ghanem, B SynthCLIP: are we ready for a fully synthetic CLIP training? |
title | SynthCLIP: are we ready for a fully synthetic CLIP training? |
title_full | SynthCLIP: are we ready for a fully synthetic CLIP training? |
title_fullStr | SynthCLIP: are we ready for a fully synthetic CLIP training? |
title_full_unstemmed | SynthCLIP: are we ready for a fully synthetic CLIP training? |
title_short | SynthCLIP: are we ready for a fully synthetic CLIP training? |
title_sort | synthclip are we ready for a fully synthetic clip training |
work_keys_str_mv | AT hammoudhaak synthcliparewereadyforafullysyntheticcliptraining AT itanih synthcliparewereadyforafullysyntheticcliptraining AT pizzatif synthcliparewereadyforafullysyntheticcliptraining AT torrp synthcliparewereadyforafullysyntheticcliptraining AT bibia synthcliparewereadyforafullysyntheticcliptraining AT ghanemb synthcliparewereadyforafullysyntheticcliptraining |