Text Data Augmentation Techniques for Word Embeddings in Fake News Classification

Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of dat...

Full description

Bibliographic Details
Main Authors:	Jozef Kapusta, David Drzik, Kirsten Steflovic, Kitti Szabo Nagy
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings
Online Access:	https://ieeexplore.ieee.org/document/10445169/

_version_	1797272681662382080
author	Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy
author_facet	Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy
author_sort	Jozef Kapusta
collection	DOAJ
description	Contemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.
first_indexed	2024-03-07T14:32:50Z
format	Article
id	doaj.art-4c989e39e15146088571517ce29cf030
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-07T14:32:50Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-4c989e39e15146088571517ce29cf0302024-03-06T00:00:35ZengIEEEIEEE Access2169-35362024-01-0112315383155010.1109/ACCESS.2024.336991810445169Text Data Augmentation Techniques for Word Embeddings in Fake News ClassificationJozef Kapusta0https://orcid.org/0000-0002-8285-2404David Drzik1https://orcid.org/0000-0002-1878-577XKirsten Steflovic2https://orcid.org/0000-0003-4327-5971Kitti Szabo Nagy3https://orcid.org/0000-0001-8449-0737Faculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaFaculty of Natural Sciences and Informatics, Constantine the Philosopher University in Nitra, Nitra, SlovakiaContemporary language models heavily rely on large corpora for their training. The larger the corpus, the better a model can capture various semantic relationships. The issue at hand appears to be the limited scope of the corpora used. One potential solution to this problem is the application of data augmentation techniques to expand the existing corpus. Data augmentation encompasses several techniques for corpus augmentation. In this article, we delve deeper into the analysis of three techniques: Synonym Replacement, Back Translation, and Reduction of Function Words. Utilizing these three techniques, we prepared diverse versions of the corpus employed for training Word2Vec Skip-gram models. These techniques were validated through extrinsic evaluation, wherein Word2Vec Skip-gram models were used to generate word vectors for classifying fake news articles. Performance measures of the generated classifiers were analyzed. The study highlights significant statistical differences in classifier outcomes between augmented and original corpora. Specifically, Back Translation significantly enhances accuracy, notably with Support Vector and Bernoulli Naive Bayes models. Conversely, the Reduction of Function Words (FWD) improves Logistic Regression, while the original corpus excels in Random Forest classification. The article also includes an intrinsic evaluation involving lexical semantic relations between word pairs. The intrinsic evaluation highlights nuanced differences in semantic relations across augmented corpora. Notably, the Back Translation (BT) corpus better aligns with established lexical resources, showcasing promising improvements in understanding specific semantic relationships.https://ieeexplore.ieee.org/document/10445169/Back translationfunction word deletionsynonym replacementtext data augmentationWord2Vecword embeddings
spellingShingle	Jozef Kapusta David Drzik Kirsten Steflovic Kitti Szabo Nagy Text Data Augmentation Techniques for Word Embeddings in Fake News Classification IEEE Access Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings
title	Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_full	Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_fullStr	Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_full_unstemmed	Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_short	Text Data Augmentation Techniques for Word Embeddings in Fake News Classification
title_sort	text data augmentation techniques for word embeddings in fake news classification
topic	Back translation function word deletion synonym replacement text data augmentation Word2Vec word embeddings
url	https://ieeexplore.ieee.org/document/10445169/
work_keys_str_mv	AT jozefkapusta textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT daviddrzik textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT kirstensteflovic textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification AT kittiszabonagy textdataaugmentationtechniquesforwordembeddingsinfakenewsclassification

Text Data Augmentation Techniques for Word Embeddings in Fake News Classification

Similar Items