No "zero-shot" without exponential data: pretraining concept frequency determines multimodal model performance
Web-crawled pretraining datasets underlie the impressive “zero-shot” evaluation performance of multimodal models, such as CLIP for classification/retrieval and Stable-Diffusion for image generation. However, it is unclear how meaningful the notion of “zero-shot” generalization is for such multimodal...
Huvudupphovsmän: | , , , , , , , |
---|---|
Materialtyp: | Conference item |
Språk: | English |
Publicerad: |
Neural Information Processing Systems Foundation
2024
|