An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is requir...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
The MIT Press
2023-01-01
|
Series: | Transactions of the Association for Computational Linguistics |
Online Access: | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for |
_version_ | 1797796616272347136 |
---|---|
author | Jiaao Chen Derek Tam Colin Raffel Mohit Bansal Diyi Yang |
author_facet | Jiaao Chen Derek Tam Colin Raffel Mohit Bansal Diyi Yang |
author_sort | Jiaao Chen |
collection | DOAJ |
description |
AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP. |
first_indexed | 2024-03-13T03:36:42Z |
format | Article |
id | doaj.art-b26ac7ab1b3a4593bc9c80298cc4e58e |
institution | Directory Open Access Journal |
issn | 2307-387X |
language | English |
last_indexed | 2024-03-13T03:36:42Z |
publishDate | 2023-01-01 |
publisher | The MIT Press |
record_format | Article |
series | Transactions of the Association for Computational Linguistics |
spelling | doaj.art-b26ac7ab1b3a4593bc9c80298cc4e58e2023-06-23T18:59:26ZengThe MIT PressTransactions of the Association for Computational Linguistics2307-387X2023-01-011119121110.1162/tacl_a_00542An Empirical Survey of Data Augmentation for Limited Data Learning in NLPJiaao Chen0Derek Tam1Colin Raffel2Mohit Bansal3Diyi Yang4Georgia Institute of Technology, USA. jchen896@gatech.eduUNC Chapel Hill, USA. dtredsox@cs.unc.eduUNC Chapel Hill, USA. craffel@cs.unc.eduUNC Chapel Hill, USA. mbansal@cs.unc.eduGeorgia Institute of Technology, USA. dyang888@gatech.edu AbstractNLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for |
spellingShingle | Jiaao Chen Derek Tam Colin Raffel Mohit Bansal Diyi Yang An Empirical Survey of Data Augmentation for Limited Data Learning in NLP Transactions of the Association for Computational Linguistics |
title | An Empirical Survey of Data Augmentation for Limited Data Learning in NLP |
title_full | An Empirical Survey of Data Augmentation for Limited Data Learning in NLP |
title_fullStr | An Empirical Survey of Data Augmentation for Limited Data Learning in NLP |
title_full_unstemmed | An Empirical Survey of Data Augmentation for Limited Data Learning in NLP |
title_short | An Empirical Survey of Data Augmentation for Limited Data Learning in NLP |
title_sort | empirical survey of data augmentation for limited data learning in nlp |
url | https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00542/115238/An-Empirical-Survey-of-Data-Augmentation-for |
work_keys_str_mv | AT jiaaochen anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT derektam anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT colinraffel anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT mohitbansal anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT diyiyang anempiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT jiaaochen empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT derektam empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT colinraffel empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT mohitbansal empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp AT diyiyang empiricalsurveyofdataaugmentationforlimiteddatalearninginnlp |