Pre-training concept frequency is predictive of CLIP zero-shot performance

Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero...

Full description

Bibliographic Details
Main Authors: Udandarao, V, Prabhu, A, Torr, PHS, Bibi, A, Albanie, S, Bethge, M
Format: Conference item
Language:English
Published: OpenReview 2024