The impact of synthetic text generation for sentiment analysis using GAN based models
Data imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performa...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-09-01
|
Series: | Egyptian Informatics Journal |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1110866522000342 |
_version_ | 1828345507104161792 |
---|---|
author | Ali Shariq Imran Ru Yang Zenun Kastrati Sher Muhammad Daudpota Sarang Shaikh |
author_facet | Ali Shariq Imran Ru Yang Zenun Kastrati Sher Muhammad Daudpota Sarang Shaikh |
author_sort | Ali Shariq Imran |
collection | DOAJ |
description | Data imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performance, especially for deep neural network based models which are data hungry. A model trained on such an imbalanced dataset would naturally favor the majority class. However, the minority class could be critical for decision-making in prediction systems, and therefore it is usually desirable to train a model with equally high class-level accuracy. This paper addresses the data imbalance issue for the sentiment analysis of users’ opinions task on two educational feedback datasets utilizing synthetic text generation deep learning models. Two state-of-the-art text generation GAN models namely CatGAN and SentiGAN, are employed for synthesizing text used to balance the highly imbalanced datasets in this study. Particular emphasis is given to the diversity of synthetically generated samples for populating minority classes. Experimental results on highly imbalanced datasets show significant improvement in models’ performance on CR23K and CR100K after balancing with synthetic data for the sentiment classification task. |
first_indexed | 2024-04-14T00:12:04Z |
format | Article |
id | doaj.art-1d79718d85f7457b9617e3fb1669b219 |
institution | Directory Open Access Journal |
issn | 1110-8665 |
language | English |
last_indexed | 2024-04-14T00:12:04Z |
publishDate | 2022-09-01 |
publisher | Elsevier |
record_format | Article |
series | Egyptian Informatics Journal |
spelling | doaj.art-1d79718d85f7457b9617e3fb1669b2192022-12-22T02:23:16ZengElsevierEgyptian Informatics Journal1110-86652022-09-01233547557The impact of synthetic text generation for sentiment analysis using GAN based modelsAli Shariq Imran0Ru Yang1Zenun Kastrati2Sher Muhammad Daudpota3Sarang Shaikh4Department of Computer Science (IDI), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, Norway; Corresponding author.Department of Computer Science (IDI), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, NorwayDepartment of Informatics, Linnaeus University, 35195 Växjö, SwedenDepartment of Computer Science, Sukkur IBA University, Sukkur 65200, PakistanDepartment of Information Security and Communication Technology (IIK), Norwegian University of Science & Technology (NTNU), 2815 Gjøvik, NorwayData imbalance in datasets is a common issue where the number of instances in one or more categories far exceeds the others, so is the case with the educational domain. Collecting feedback on a course on a large scale and the lack of publicly available datasets in this domain limits models’ performance, especially for deep neural network based models which are data hungry. A model trained on such an imbalanced dataset would naturally favor the majority class. However, the minority class could be critical for decision-making in prediction systems, and therefore it is usually desirable to train a model with equally high class-level accuracy. This paper addresses the data imbalance issue for the sentiment analysis of users’ opinions task on two educational feedback datasets utilizing synthetic text generation deep learning models. Two state-of-the-art text generation GAN models namely CatGAN and SentiGAN, are employed for synthesizing text used to balance the highly imbalanced datasets in this study. Particular emphasis is given to the diversity of synthetically generated samples for populating minority classes. Experimental results on highly imbalanced datasets show significant improvement in models’ performance on CR23K and CR100K after balancing with synthetic data for the sentiment classification task.http://www.sciencedirect.com/science/article/pii/S1110866522000342Text generationSentiment analysisSentiGANCatGANDeep learningLanguage modeling |
spellingShingle | Ali Shariq Imran Ru Yang Zenun Kastrati Sher Muhammad Daudpota Sarang Shaikh The impact of synthetic text generation for sentiment analysis using GAN based models Egyptian Informatics Journal Text generation Sentiment analysis SentiGAN CatGAN Deep learning Language modeling |
title | The impact of synthetic text generation for sentiment analysis using GAN based models |
title_full | The impact of synthetic text generation for sentiment analysis using GAN based models |
title_fullStr | The impact of synthetic text generation for sentiment analysis using GAN based models |
title_full_unstemmed | The impact of synthetic text generation for sentiment analysis using GAN based models |
title_short | The impact of synthetic text generation for sentiment analysis using GAN based models |
title_sort | impact of synthetic text generation for sentiment analysis using gan based models |
topic | Text generation Sentiment analysis SentiGAN CatGAN Deep learning Language modeling |
url | http://www.sciencedirect.com/science/article/pii/S1110866522000342 |
work_keys_str_mv | AT alishariqimran theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT ruyang theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT zenunkastrati theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT shermuhammaddaudpota theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT sarangshaikh theimpactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT alishariqimran impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT ruyang impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT zenunkastrati impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT shermuhammaddaudpota impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels AT sarangshaikh impactofsynthetictextgenerationforsentimentanalysisusingganbasedmodels |