Pre-training concept frequency is predictive of CLIP zero-shot performance
Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero...
Main Authors: | , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
OpenReview
2024
|
_version_ | 1811139642522599424 |
---|---|
author | Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M |
author_facet | Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M |
author_sort | Udandarao, V |
collection | OXFORD |
description | Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets. |
first_indexed | 2024-09-25T04:09:20Z |
format | Conference item |
id | oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f |
institution | University of Oxford |
language | English |
last_indexed | 2024-09-25T04:09:20Z |
publishDate | 2024 |
publisher | OpenReview |
record_format | dspace |
spelling | oxford-uuid:b1d94a36-389e-491a-a21e-4e7a7f27790f2024-06-10T16:50:06ZPre-training concept frequency is predictive of CLIP zero-shot performanceConference itemhttp://purl.org/coar/resource_type/c_5794uuid:b1d94a36-389e-491a-a21e-4e7a7f27790fEnglishSymplectic ElementsOpenReview2024Udandarao, VPrabhu, ATorr, PHSBibi, AAlbanie, SBethge, MWeb-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero-shot” generalization is for CLIP, as its pre-training datasets (e.g., YFCC-15M, LAION-2B etc.) likely contain many samples of the “zero-shot” concept. To study this, for the first time, we analyze the composition of concepts in the pre-training datasets of CLIP. We robustly demonstrate that far from being “zero-shot”, CLIP’s zero-shot classification performance is strongly predictable by the frequency of a concept seen during pre-training. Precisely, the downstream zero-shot performance improves linearly as the pre-training concept frequency grows exponentially i.e., they follow a log-linear scaling trend. Our data-centric investigation further highlights two key findings: (1) The extreme “data-hunger” of CLIP, i.e., growing inability of “zero-shot” prediction on long-tailed concepts, and (2) A surprising degree of mis-alignment across image-text pairs in the pre-training datasets. |
spellingShingle | Udandarao, V Prabhu, A Torr, PHS Bibi, A Albanie, S Bethge, M Pre-training concept frequency is predictive of CLIP zero-shot performance |
title | Pre-training concept frequency is predictive of CLIP zero-shot performance |
title_full | Pre-training concept frequency is predictive of CLIP zero-shot performance |
title_fullStr | Pre-training concept frequency is predictive of CLIP zero-shot performance |
title_full_unstemmed | Pre-training concept frequency is predictive of CLIP zero-shot performance |
title_short | Pre-training concept frequency is predictive of CLIP zero-shot performance |
title_sort | pre training concept frequency is predictive of clip zero shot performance |
work_keys_str_mv | AT udandaraov pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT prabhua pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT torrphs pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT bibia pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT albanies pretrainingconceptfrequencyispredictiveofclipzeroshotperformance AT bethgem pretrainingconceptfrequencyispredictiveofclipzeroshotperformance |