Text Data Augmentation Techniques for Word Embeddings in Fake News Classification

Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of dat...

Full description

Bibliographic Details
Main Authors: Jozef Kapusta, David Drzik, Kirsten Steflovic, Kitti Szabo Nagy
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10445169/
_version_ 1797272681662382080
author Jozef Kapusta
David Drzik
Kirsten Steflovic
Kitti Szabo Nagy
author_facet Jozef Kapusta
David Drzik
Kirsten Steflovic
Kitti Szabo Nagy
author_sort Jozef Kapusta
collection DOAJ
description Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.
first_indexed 2024-03-07T14:32:50Z
format Article
id doaj.art-4c989e39e15146088571517ce29cf030
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-07T14:32:50Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-4c989e39e15146088571517ce29cf0302024-03-06T00:00:35ZengIEEEIEEE Access2169-35362024-01-0112315383155010.1109/ACCESS.2024.336991810445169Text Data Augmentation Techniques for Word Embeddings in Fake News ClassificationJozef Kapusta0https://orcid.org/0000-0002-8285-2404David Drzik1https://orcid.org/0000-0002-1878-577XKirsten Steflovic2https://orcid.org/0000-0003-4327-5971Kitti Szabo Nagy3https://orcid.org/0000-0001-8449-0737Faculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaContemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.https://ieeexplore.ieee.org/document/10445169/Back translationfunction word deletionsynonym replacementtext data augmentationWord2Vecword embeddings
spellingShingle Jozef Kapusta
David Drzik
Kirsten Steflovic
Kitti Szabo Nagy
Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
IEEE Access
Back translation
function word deletion
synonym replacement
text data augmentation
Word2Vec
word embeddings
title Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_full Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_fullStr Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_full_unstemmed Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_short Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_sort text data augmentation techniques for word embeddings in fake news classification
topic Back translation
function word deletion
synonym replacement
text data augmentation
Word2Vec
word embeddings
url https://ieeexplore.ieee.org/document/10445169/
work_keys_str_mv AT jozefkapusta textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification
AT daviddrzik textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification
AT kirstensteflovic textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification
AT kittiszabonagy textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification