Pre-training concept frequency is predictive of CLIP zero-shot performance

Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero...

Full description

Bibliographic Details
Main Authors:	Udandarao, V, Prabhu, A, Torr, PHS, Bibi, A, Albanie, S, Bethge, M
Format:	Conference item
Language:	English
Published:	OpenReview 2024

_version_	1811139642522599424
author	Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M
author_facet	Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M
author_sort	Udandarao, V
collection	OXFORD
description	Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets.
first_indexed	2024-09-25T04:09:20Z
format	Conference item
id	oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:09:20Z
publishDate	2024
publisher	OpenReview
record_format	dspace
spelling	oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f2024-06-10T16:50:06ZPre-training concept frequency is predictive of CLIP zero-shot performanceConference itemhttp://purl.org/coar/resource_type/c_5794uuid:b1d94a36-389e-491a-a21e-4e7a7f27790fEnglishSymplectic ElementsOpenReview2024Udandarao, VPrabhu, ATorr, PHSBibi, AAlbanie, SBethge, MWeb-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets.
spellingShingle	Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M Pre-training concept frequency is predictive of CLIP zero-shot performance
title	Pre-training concept frequency is predictive of CLIP zero-shot performance
title_full	Pre-training concept frequency is predictive of CLIP zero-shot performance
title_fullStr	Pre-training concept frequency is predictive of CLIP zero-shot performance
title_full_unstemmed	Pre-training concept frequency is predictive of CLIP zero-shot performance
title_short	Pre-training concept frequency is predictive of CLIP zero-shot performance
title_sort	pre training concept frequency is predictive of clip zero shot performance
work_keys_str_mv	AT udandaraov pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT prabhua pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT torrphs pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT bibia pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT albanies pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT bethgem pretrainingconceptfrequencyispredictiveofclipzeroshotperformance

Pre-training concept frequency is predictive of CLIP zero-shot performance

Similar Items