No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance

Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal...

Szczegółowa specyfikacja

Opis bibliograficzny
Główni autorzy: Udandarao, V, Prabhu, A, Ghosh, A, Sharma, Y, Torr, PHS, Bibi, A, Albanie, S, Bethge, M
Format: Conference item
Język:English
Wydane: Neural Information Processing Systems Foundation 2024