No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance
Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal...
Główni autorzy: | , , , , , , , |
---|---|
Format: | Conference item |
Język: | English |
Wydane: |
Neural Information Processing Systems Foundation
2024
|