Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of dat...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10445169/ |
_version_ | 1797272681662382080 |
---|---|
author | Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy |
author_facet | Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy |
author_sort | Jozef Kapusta |
collection | DOAJ |
description | Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships. |
first_indexed | 2024-03-07T14:32:50Z |
format | Article |
id | doaj.art-4c989e39e15146088571517ce29cf030 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-07T14:32:50Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-4c989e39e15146088571517ce29cf0302024-03-06T00:00:35ZengIEEEIEEE Access2169-35362024-01-0112315383155010.1109/ACCESS.2024.336991810445169Text Data Augmentation Techniques for Word Embeddings in Fake News ClassificationJozef Kapusta0https://orcid.org/0000-0002-8285-2404David Drzik1https://orcid.org/0000-0002-1878-577XKirsten Steflovic2https://orcid.org/0000-0003-4327-5971Kitti Szabo Nagy3https://orcid.org/0000-0001-8449-0737Faculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaContemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.https://ieeexplore.ieee.org/document/10445169/Back translationfunction word deletionsynonym replacementtext data augmentationWord2Vecword embeddings |
spellingShingle | Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy Text Data Augmentation Techniques for Word Embeddings in Fake News Classification IEEE Access Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings |
title | Text Data Augmentation Techniques for Word Embeddings in Fake News Classification |
title_full | Text Data Augmentation Techniques for Word Embeddings in Fake News Classification |
title_fullStr | Text Data Augmentation Techniques for Word Embeddings in Fake News Classification |
title_full_unstemmed | Text Data Augmentation Techniques for Word Embeddings in Fake News Classification |
title_short | Text Data Augmentation Techniques for Word Embeddings in Fake News Classification |
title_sort | text data augmentation techniques for word embeddings in fake news classification |
topic | Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings |
url | https://ieeexplore.ieee.org/document/10445169/ |
work_keys_str_mv | AT jozefkapusta textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT daviddrzik textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT kirstensteflovic textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT kittiszabonagy textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification |