An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is requir...

Full description

Bibliographic Details
Main Authors: Jiaao Chen, Derek Tam, Colin Raffel, Mohit Bansal, Diyi Yang
Format: Article
Language:English
Published: The MIT Press 2023-01-01
Series:Transactions of the Association for Computational Linguistics
Online Access:https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for
_version_ 1797796616272347136
author Jiaao Chen
Derek Tam
Colin Raffel
Mohit Bansal
Diyi Yang
author_facet Jiaao Chen
Derek Tam
Colin Raffel
Mohit Bansal
Diyi Yang
author_sort Jiaao Chen
collection DOAJ
description AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.
first_indexed 2024-03-13T03:36:42Z
format Article
id doaj.art-b26ac7ab1b3a4593bc9c80298cc4e58e
institution Directory Open Access Journal
issn 2307-387X
language English
last_indexed 2024-03-13T03:36:42Z
publishDate 2023-01-01
publisher The MIT Press
record_format Article
series Transactions of the Association for Computational Linguistics
spelling doaj.art-b26ac7ab1b3a4593bc9c80298cc4e58e2023-06-23T18:59:26ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2023-01-011119121110.1162/tacl_a_00542An Empirical Survey of Data Augmentation for Limited Data Learning in NLPJiaao Chen0Derek Tam1Colin Raffel2Mohit Bansal3Diyi Yang4Georgia Institute of Technology, USA. jchen896@gatech.eduUNC Chapel Hill, USA. dtredsox@cs.unc.eduUNC Chapel Hill, USA. craffel@cs.unc.eduUNC Chapel Hill, USA. mbansal@cs.unc.eduGeorgia Institute of Technology, USA. dyang888@gatech.edu AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for
spellingShingle Jiaao Chen
Derek Tam
Colin Raffel
Mohit Bansal
Diyi Yang
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
Transactions of the Association for Computational Linguistics
title An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
title_full An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
title_fullStr An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
title_full_unstemmed An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
title_short An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
title_sort empirical survey of data augmentation for limited data learning in nlp
url https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for
work_keys_str_mv AT jiaaochen anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT derektam anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT colinraffel anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT mohitbansal anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT diyiyang anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT jiaaochen empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT derektam empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT colinraffel empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT mohitbansal empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp
AT diyiyang empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp