AI models collapse when trained on recursively generated data

Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence...

Full description

Bibliographic Details
Main Authors:	Shumailov, I, Shumaylov, Z, Zhao, Y, Papernot, N, Anderson, R, Gal, Y
Format:	Journal article
Language:	English
Published:	Nature Research 2024

_version_	1826313744685203456
author	Shumailov, I Shumaylov, Z Zhao, Y Papernot, N Anderson, R Gal, Y
author_facet	Shumailov, I Shumaylov, Z Zhao, Y Papernot, N Anderson, R Gal, Y
author_sort	Shumailov, I
collection	OXFORD
description	Stable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.
first_indexed	2024-09-25T04:19:44Z
format	Journal article
id	oxford-uuid:fa1155e9-c2ff-436a-8391-455b622f4e64
institution	University of Oxford
language	English
last_indexed	2024-09-25T04:19:44Z
publishDate	2024
publisher	Nature Research
record_format	dspace
spelling	oxford-uuid:fa1155e9-c2ff-436a-8391-455b622f4e642024-07-25T19:36:53ZAI models collapse when trained on recursively generated dataJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:fa1155e9-c2ff-436a-8391-455b622f4e64EnglishJisc Publications RouterNature Research2024Shumailov, IShumaylov, ZZhao, YPapernot, NAnderson, RGal, YStable diffusion revolutionized image creation from descriptive text. GPT-2 (ref. 1), GPT-3(.5) (ref. 2) and GPT-4 (ref. 3) demonstrated high performance across a variety of language tasks. ChatGPT introduced such language models to the public. It is now clear that generative artificial intelligence (AI) such as large language models (LLMs) is here to stay and will substantially change the ecosystem of online text and images. Here we consider what may happen to GPT-{n} once LLMs contribute much of the text found online. We find that indiscriminate use of model-generated content in training causes irreversible defects in the resulting models, in which tails of the original content distribution disappear. We refer to this effect as ‘model collapse’ and show that it can occur in LLMs as well as in variational autoencoders (VAEs) and Gaussian mixture models (GMMs). We build theoretical intuition behind the phenomenon and portray its ubiquity among all learned generative models. We demonstrate that it must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web. Indeed, the value of data collected about genuine human interactions with systems will be increasingly valuable in the presence of LLM-generated content in data crawled from the Internet.
spellingShingle	Shumailov, I Shumaylov, Z Zhao, Y Papernot, N Anderson, R Gal, Y AI models collapse when trained on recursively generated data
title	AI models collapse when trained on recursively generated data
title_full	AI models collapse when trained on recursively generated data
title_fullStr	AI models collapse when trained on recursively generated data
title_full_unstemmed	AI models collapse when trained on recursively generated data
title_short	AI models collapse when trained on recursively generated data
title_sort	ai models collapse when trained on recursively generated data
work_keys_str_mv	AT shumailovi aimodelscollapsewhentrainedonrecursivelygenerateddata AT shumaylovz aimodelscollapsewhentrainedonrecursivelygenerateddata AT zhaoy aimodelscollapsewhentrainedonrecursivelygenerateddata AT papernotn aimodelscollapsewhentrainedonrecursivelygenerateddata AT andersonr aimodelscollapsewhentrainedonrecursivelygenerateddata AT galy aimodelscollapsewhentrainedonrecursivelygenerateddata

AI models collapse when trained on recursively generated data

Similar Items