Pre-training concept frequency is predictive of CLIP zero-shot performance
Web-crawled pre-training datasets are speculated to be key drivers of zero-shot generalization abilities of Vision-Language Models (VLMs) like CLIP, across a range of downstream classification and retrieval tasks, spanning diverse visual concepts. However, it is unclear how meaningful the term “zero...
Main Authors: | , , , , , |
---|---|
Format: | Conference item |
Language: | English |
Published: |
OpenReview
2024
|