SynthCLIP: are we ready for a fully synthetic CLIP training?

We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic d...

Szczegółowa specyfikacja

Opis bibliograficzny
Główni autorzy: Hammoud, HAAK, Itani, H, Pizzati, F, Torr, P, Bibi, A, Ghanem, B
Format: Conference item
Język:English
Wydane: IEEE 2024
_version_ 1826313412195385344
author Hammoud, HAAK
Itani, H
Pizzati, F
Torr, P
Bibi, A
Ghanem, B
author_facet Hammoud, HAAK
Itani, H
Pizzati, F
Torr, P
Bibi, A
Ghanem, B
author_sort Hammoud, HAAK
collection OXFORD
description We present SynthCLIP, a novel framework for training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released at: https://github.com/ hammoudhasan/SynthCLIP.
first_indexed 2024-09-25T04:14:20Z
format Conference item
id oxford-uuid:bc525760-1577-4403-acfc-4507320f528e
institution University of Oxford
language English
last_indexed 2024-09-25T04:14:20Z
publishDate 2024
publisher IEEE
record_format dspace
spelling oxford-uuid:bc525760-1577-4403-acfc-4507320f528e2024-07-11T10:21:09ZSynthCLIP: are we ready for a fully synthetic CLIP training?Conference itemhttp://purl.org/coar/resource_type/c_5794uuid:bc525760-1577-4403-acfc-4507320f528eEnglishSymplectic ElementsIEEE2024Hammoud, HAAKItani, HPizzati, FTorr, PBibi, AGhanem, BWe present SynthCLIP, a novel framework for training CLIP models with entirely synthetic textimage pairs, significantly departing from previous methods relying on real data. Leveraging recent text-to-image (TTI) generative networks and large language models (LLM), we are able to generate synthetic datasets of images and corresponding captions at any scale, with no human intervention. With training at scale, SynthCLIP achieves performance comparable to CLIP models trained on real datasets. We also introduce SynthCI-30M, a purely synthetic dataset comprising 30 million captioned images. Our code, trained models, and generated data are released at: https://github.com/ hammoudhasan/SynthCLIP.
spellingShingle Hammoud, HAAK
Itani, H
Pizzati, F
Torr, P
Bibi, A
Ghanem, B
SynthCLIP: are we ready for a fully synthetic CLIP training?
title SynthCLIP: are we ready for a fully synthetic CLIP training?
title_full SynthCLIP: are we ready for a fully synthetic CLIP training?
title_fullStr SynthCLIP: are we ready for a fully synthetic CLIP training?
title_full_unstemmed SynthCLIP: are we ready for a fully synthetic CLIP training?
title_short SynthCLIP: are we ready for a fully synthetic CLIP training?
title_sort synthclip are we ready for a fully synthetic clip training
work_keys_str_mv AT hammoudhaak synthcliparewereadyforafullysyntheticcliptraining
AT itanih synthcliparewereadyforafullysyntheticcliptraining
AT pizzatif synthcliparewereadyforafullysyntheticcliptraining
AT torrp synthcliparewereadyforafullysyntheticcliptraining
AT bibia synthcliparewereadyforafullysyntheticcliptraining
AT ghanemb synthcliparewereadyforafullysyntheticcliptraining