Flamingo: a visual language model for few-shot learning

Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bri...

Full description

Bibliographic Details
Main Authors: Alayrac, J-B, Donahue, J, Luc, P, Miech, A, Barr, I, Hasson, Y, Lenc, K, Mensch, A, Millican, K, Reynolds, M, Ring, R, Rutherford, E, Cabi, S, Han, T, Gong, Z, Samangooei, S, Monteiro, M, Menick, J, Borgeaud, S, Brock, A, Nematzadeh, A, Sharifzadeh, S, Binkowski, M, Barreira, R, Vinyals, O, Zisserman, A, Simonyan, K
Format: Conference item
Language:English
Published: NeurIPS Proceedings 2022
_version_ 1826313471574147072
author Alayrac, J-B
Donahue, J
Luc, P
Miech, A
Barr, I
Hasson, Y
Lenc, K
Mensch, A
Millican, K
Reynolds, M
Ring, R
Rutherford, E
Cabi, S
Han, T
Gong, Z
Samangooei, S
Monteiro, M
Menick, J
Borgeaud, S
Brock, A
Nematzadeh, A
Sharifzadeh, S
Binkowski, M
Barreira, R
Vinyals, O
Zisserman, A
Simonyan, K
author_facet Alayrac, J-B
Donahue, J
Luc, P
Miech, A
Barr, I
Hasson, Y
Lenc, K
Mensch, A
Millican, K
Reynolds, M
Ring, R
Rutherford, E
Cabi, S
Han, T
Gong, Z
Samangooei, S
Monteiro, M
Menick, J
Borgeaud, S
Brock, A
Nematzadeh, A
Sharifzadeh, S
Binkowski, M
Barreira, R
Vinyals, O
Zisserman, A
Simonyan, K
author_sort Alayrac, J-B
collection OXFORD
description Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
first_indexed 2024-09-25T04:13:56Z
format Conference item
id oxford-uuid:72fd0848-5ee3-43fa-bdab-d777270d7a58
institution University of Oxford
language English
last_indexed 2024-09-25T04:13:56Z
publishDate 2022
publisher NeurIPS Proceedings
record_format dspace
spelling oxford-uuid:72fd0848-5ee3-43fa-bdab-d777270d7a582024-07-11T15:13:48ZFlamingo: a visual language model for few-shot learningConference itemhttp://purl.org/coar/resource_type/c_5794uuid:72fd0848-5ee3-43fa-bdab-d777270d7a58EnglishSymplectic ElementsNeurIPS Proceedings 2022Alayrac, J-BDonahue, JLuc, PMiech, ABarr, IHasson, YLenc, KMensch, AMillican, KReynolds, MRing, RRutherford, ECabi, SHan, TGong, ZSamangooei, SMonteiro, MMenick, JBorgeaud, SBrock, ANematzadeh, ASharifzadeh, SBinkowski, MBarreira, RVinyals, OZisserman, ASimonyan, KBuilding models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer, captioning tasks, which evaluate the ability to describe a scene or an event, and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data.
spellingShingle Alayrac, J-B
Donahue, J
Luc, P
Miech, A
Barr, I
Hasson, Y
Lenc, K
Mensch, A
Millican, K
Reynolds, M
Ring, R
Rutherford, E
Cabi, S
Han, T
Gong, Z
Samangooei, S
Monteiro, M
Menick, J
Borgeaud, S
Brock, A
Nematzadeh, A
Sharifzadeh, S
Binkowski, M
Barreira, R
Vinyals, O
Zisserman, A
Simonyan, K
Flamingo: a visual language model for few-shot learning
title Flamingo: a visual language model for few-shot learning
title_full Flamingo: a visual language model for few-shot learning
title_fullStr Flamingo: a visual language model for few-shot learning
title_full_unstemmed Flamingo: a visual language model for few-shot learning
title_short Flamingo: a visual language model for few-shot learning
title_sort flamingo a visual language model for few shot learning
work_keys_str_mv AT alayracjb flamingoavisuallanguagemodelforfewshotlearning
AT donahuej flamingoavisuallanguagemodelforfewshotlearning
AT lucp flamingoavisuallanguagemodelforfewshotlearning
AT miecha flamingoavisuallanguagemodelforfewshotlearning
AT barri flamingoavisuallanguagemodelforfewshotlearning
AT hassony flamingoavisuallanguagemodelforfewshotlearning
AT lenck flamingoavisuallanguagemodelforfewshotlearning
AT menscha flamingoavisuallanguagemodelforfewshotlearning
AT millicank flamingoavisuallanguagemodelforfewshotlearning
AT reynoldsm flamingoavisuallanguagemodelforfewshotlearning
AT ringr flamingoavisuallanguagemodelforfewshotlearning
AT rutherforde flamingoavisuallanguagemodelforfewshotlearning
AT cabis flamingoavisuallanguagemodelforfewshotlearning
AT hant flamingoavisuallanguagemodelforfewshotlearning
AT gongz flamingoavisuallanguagemodelforfewshotlearning
AT samangooeis flamingoavisuallanguagemodelforfewshotlearning
AT monteirom flamingoavisuallanguagemodelforfewshotlearning
AT menickj flamingoavisuallanguagemodelforfewshotlearning
AT borgeauds flamingoavisuallanguagemodelforfewshotlearning
AT brocka flamingoavisuallanguagemodelforfewshotlearning
AT nematzadeha flamingoavisuallanguagemodelforfewshotlearning
AT sharifzadehs flamingoavisuallanguagemodelforfewshotlearning
AT binkowskim flamingoavisuallanguagemodelforfewshotlearning
AT barreirar flamingoavisuallanguagemodelforfewshotlearning
AT vinyalso flamingoavisuallanguagemodelforfewshotlearning
AT zissermana flamingoavisuallanguagemodelforfewshotlearning
AT simonyank flamingoavisuallanguagemodelforfewshotlearning