The impact of synthetic text generation for sentiment analysis using GAN based models

Data imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performa...

Full description

Bibliographic Details
Main Authors: Ali Shariq Imran, Ru Yang, Zenun Kastrati, Sher Muhammad Daudpota, Sarang Shaikh
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:Egyptian Informatics Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1110866522000342
_version_ 1828345507104161792
author Ali Shariq Imran
Ru Yang
Zenun Kastrati
Sher Muhammad Daudpota
Sarang Shaikh
author_facet Ali Shariq Imran
Ru Yang
Zenun Kastrati
Sher Muhammad Daudpota
Sarang Shaikh
author_sort Ali Shariq Imran
collection DOAJ
description Data imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performance, especially for deep neural network based models which are data hungry. A model trained on such an imbalanced dataset would naturally favor the majority class. However, the minority class could be critical for decision-making in prediction systems, and therefore it is usually desirable to train a model with equally high class-level accuracy. This paper addresses the data imbalance issue for the sentiment analysis of users’ opinions task on two educational feedback datasets utilizing synthetic text generation deep learning models. Two state-of-the-art text generation GAN models namely CatGAN and SentiGAN, are employed for synthesizing text used to balance the highly imbalanced datasets in this study. Particular emphasis is given to the diversity of synthetically generated samples for populating minority classes. Experimental results on highly imbalanced datasets show significant improvement in models’ performance on CR23K and CR100K after balancing with synthetic data for the sentiment classification task.
first_indexed 2024-04-14T00:12:04Z
format Article
id doaj.art-1d79718d85f7457b9617e3fb1669b219
institution Directory Open Access Journal
issn 1110-8665
language English
last_indexed 2024-04-14T00:12:04Z
publishDate 2022-09-01
publisher Elsevier
record_format Article
series Egyptian Informatics Journal
spelling doaj.art-1d79718d85f7457b9617e3fb1669b2192022-12-22T02:23:16ZengElsevierEgyptian Informatics Journal1110-86652022-09-01233547557The impact of synthetic text generation for sentiment analysis using GAN based modelsAli Shariq Imran0Ru Yang1Zenun Kastrati2Sher Muhammad Daudpota3Sarang Shaikh4Department of Computer Science (IDI), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, Norway; Corresponding author.Department of Computer Science (IDI), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, NorwayDepartment of Informatics, Linnaeus University, 35195 Växjö, SwedenDepartment of Computer Science, Sukkur IBA University, Sukkur 65200, PakistanDepartment of Information Security and Communication Technology (IIK), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, NorwayData imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performance, especially for deep neural network based models which are data hungry. A model trained on such an imbalanced dataset would naturally favor the majority class. However, the minority class could be critical for decision-making in prediction systems, and therefore it is usually desirable to train a model with equally high class-level accuracy. This paper addresses the data imbalance issue for the sentiment analysis of users’ opinions task on two educational feedback datasets utilizing synthetic text generation deep learning models. Two state-of-the-art text generation GAN models namely CatGAN and SentiGAN, are employed for synthesizing text used to balance the highly imbalanced datasets in this study. Particular emphasis is given to the diversity of synthetically generated samples for populating minority classes. Experimental results on highly imbalanced datasets show significant improvement in models’ performance on CR23K and CR100K after balancing with synthetic data for the sentiment classification task.http://www.sciencedirect.com/science/article/pii/S1110866522000342Text generationSentiment analysisSentiGANCatGANDeep learningLanguage modeling
spellingShingle Ali Shariq Imran
Ru Yang
Zenun Kastrati
Sher Muhammad Daudpota
Sarang Shaikh
The impact of synthetic text generation for sentiment analysis using GAN based models
Egyptian Informatics Journal
Text generation
Sentiment analysis
SentiGAN
CatGAN
Deep learning
Language modeling
title The impact of synthetic text generation for sentiment analysis using GAN based models
title_full The impact of synthetic text generation for sentiment analysis using GAN based models
title_fullStr The impact of synthetic text generation for sentiment analysis using GAN based models
title_full_unstemmed The impact of synthetic text generation for sentiment analysis using GAN based models
title_short The impact of synthetic text generation for sentiment analysis using GAN based models
title_sort impact of synthetic text generation for sentiment analysis using gan based models
topic Text generation
Sentiment analysis
SentiGAN
CatGAN
Deep learning
Language modeling
url http://www.sciencedirect.com/science/article/pii/S1110866522000342
work_keys_str_mv AT alishariqimran theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT ruyang theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT zenunkastrati theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT shermuhammaddaudpota theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT sarangshaikh theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT alishariqimran impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT ruyang impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT zenunkastrati impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT shermuhammaddaudpota impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels
AT sarangshaikh impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels