Flamingo: a visual language model for few-shot learning

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bri...

Full description

Bibliographic Details
Main Authors:	Alayrac, J-B, Donahue, J, Luc, P, Miech, A, Barr, I, Hasson, Y, Lenc, K, Mensch, A, Millican, K, Reynolds, M, Ring, R, Rutherford, E, Cabi, S, Han, T, Gong, Z, Samangooei, S, Monteiro, M, Menick, J, Borgeaud, S, Brock, A, Nematzadeh, A, Sharifzadeh, S, Binkowski, M, Barreira, R, Vinyals, O, Zisserman, A, Simonyan, K
Format:	Conference item
Language:	English
Published:	NeurIPS Proceedings 2022

_version_	1826313471574147072
author	Alayrac, J-B Donahue, J Luc, P Miech, A Barr, I Hasson, Y Lenc, K Mensch, A Millican, K Reynolds, M Ring, R Rutherford, E Cabi, S Han, T Gong, Z Samangooei, S Monteiro, M Menick, J Borgeaud, S Brock, A Nematzadeh, A Sharifzadeh, S Binkowski, M Barreira, R Vinyals, O Zisserman, A Simonyan, K
author_facet	Alayrac, J-B Donahue, J Luc, P Miech, A Barr, I Hasson, Y Lenc, K Mensch, A Millican, K Reynolds, M Ring, R Rutherford, E Cabi, S Han, T Gong, Z Samangooei, S Monteiro, M Menick, J Borgeaud, S Brock, A Nematzadeh, A Sharifzadeh, S Binkowski, M Barreira, R Vinyals, O Zisserman, A Simonyan, K
author_sort	Alayrac, J-B
collection	OXFORD
description	Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
first_indexed	2024-09-25T04:13:56Z
format	Conference item
id	oxford-uuid:72fd0848-5ee3-43fa-bdab-d777270d7a58
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:13:56Z
publishDate	2022
publisher	NeurIPS Proceedings
record_format	dspace
spelling	oxford-uuid:72fd0848-5ee3-43fa-bdab-d777270d7a582024-07-11T15:13:48ZFlamingo: a visual language model for few-shot learningConference itemhttp://purl.org/coar/resource_type/c_5794uuid:72fd0848-5ee3-43fa-bdab-d777270d7a58EnglishSymplectic ElementsNeurIPS Proceedings 2022Alayrac, J-BDonahue, JLuc, PMiech, ABarr, IHasson, YLenc, KMensch, AMillican, KReynolds, MRing, RRutherford, ECabi, SHan, TGong, ZSamangooei, SMonteiro, MMenick, JBorgeaud, SBrock, ANematzadeh, ASharifzadeh, SBinkowski, MBarreira, RVinyals, OZisserman, ASimonyan, KBuilding models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
spellingShingle	Alayrac, J-B Donahue, J Luc, P Miech, A Barr, I Hasson, Y Lenc, K Mensch, A Millican, K Reynolds, M Ring, R Rutherford, E Cabi, S Han, T Gong, Z Samangooei, S Monteiro, M Menick, J Borgeaud, S Brock, A Nematzadeh, A Sharifzadeh, S Binkowski, M Barreira, R Vinyals, O Zisserman, A Simonyan, K Flamingo: a visual language model for few-shot learning
title	Flamingo: a visual language model for few-shot learning
title_full	Flamingo: a visual language model for few-shot learning
title_fullStr	Flamingo: a visual language model for few-shot learning
title_full_unstemmed	Flamingo: a visual language model for few-shot learning
title_short	Flamingo: a visual language model for few-shot learning
title_sort	flamingo a visual language model for few shot learning
work_keys_str_mv	AT alayracjb flamingoavisuallanguagemodelforfewshotlearning AT donahuej flamingoavisuallanguagemodelforfewshotlearning AT lucp flamingoavisuallanguagemodelforfewshotlearning AT miecha flamingoavisuallanguagemodelforfewshotlearning AT barri flamingoavisuallanguagemodelforfewshotlearning AT hassony flamingoavisuallanguagemodelforfewshotlearning AT lenck flamingoavisuallanguagemodelforfewshotlearning AT menscha flamingoavisuallanguagemodelforfewshotlearning AT millicank flamingoavisuallanguagemodelforfewshotlearning AT reynoldsm flamingoavisuallanguagemodelforfewshotlearning AT ringr flamingoavisuallanguagemodelforfewshotlearning AT rutherforde flamingoavisuallanguagemodelforfewshotlearning AT cabis flamingoavisuallanguagemodelforfewshotlearning AT hant flamingoavisuallanguagemodelforfewshotlearning AT gongz flamingoavisuallanguagemodelforfewshotlearning AT samangooeis flamingoavisuallanguagemodelforfewshotlearning AT monteirom flamingoavisuallanguagemodelforfewshotlearning AT menickj flamingoavisuallanguagemodelforfewshotlearning AT borgeauds flamingoavisuallanguagemodelforfewshotlearning AT brocka flamingoavisuallanguagemodelforfewshotlearning AT nematzadeha flamingoavisuallanguagemodelforfewshotlearning AT sharifzadehs flamingoavisuallanguagemodelforfewshotlearning AT binkowskim flamingoavisuallanguagemodelforfewshotlearning AT barreirar flamingoavisuallanguagemodelforfewshotlearning AT vinyalso flamingoavisuallanguagemodelforfewshotlearning AT zissermana flamingoavisuallanguagemodelforfewshotlearning AT simonyank flamingoavisuallanguagemodelforfewshotlearning

Flamingo: a visual language model for few-shot learning

Similar Items