Pre-training concept frequency is predictive of CLIP zero-shot performance

Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero...

Full description

Bibliographic Details
Main Authors: Udandarao, V, Prabhu, A, Torr, PHS, Bibi, A, Albanie, S, Bethge, M
Format: Conference item
Language:English
Published: OpenReview 2024
_version_ 1811139642522599424
author Udandarao, V
Prabhu, A
Torr, PHS
Bibi, A
Albanie, S
Bethge, M
author_facet Udandarao, V
Prabhu, A
Torr, PHS
Bibi, A
Albanie, S
Bethge, M
author_sort Udandarao, V
collection OXFORD
description Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets.
first_indexed 2024-09-25T04:09:20Z
format Conference item
id oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f
institution University of Oxford
language English
last_indexed 2024-09-25T04:09:20Z
publishDate 2024
publisher OpenReview
record_format dspace
spelling oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f2024-06-10T16:50:06ZPre-training concept frequency is predictive of CLIP zero-shot performanceConference itemhttp://purl.org/coar/resource_type/c_5794uuid:b1d94a36-389e-491a-a21e-4e7a7f27790fEnglishSymplectic ElementsOpenReview2024Udandarao, VPrabhu, ATorr, PHSBibi, AAlbanie, SBethge, MWeb-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets.
spellingShingle Udandarao, V
Prabhu, A
Torr, PHS
Bibi, A
Albanie, S
Bethge, M
Pre-training concept frequency is predictive of CLIP zero-shot performance
title Pre-training concept frequency is predictive of CLIP zero-shot performance
title_full Pre-training concept frequency is predictive of CLIP zero-shot performance
title_fullStr Pre-training concept frequency is predictive of CLIP zero-shot performance
title_full_unstemmed Pre-training concept frequency is predictive of CLIP zero-shot performance
title_short Pre-training concept frequency is predictive of CLIP zero-shot performance
title_sort pre training concept frequency is predictive of clip zero shot performance
work_keys_str_mv AT udandaraov pretrainingconceptfrequencyispredictiveofclipzeroshotperformance
AT prabhua pretrainingconceptfrequencyispredictiveofclipzeroshotperformance
AT torrphs pretrainingconceptfrequencyispredictiveofclipzeroshotperformance
AT bibia pretrainingconceptfrequencyispredictiveofclipzeroshotperformance
AT albanies pretrainingconceptfrequencyispredictiveofclipzeroshotperformance
AT bethgem pretrainingconceptfrequencyispredictiveofclipzeroshotperformance