A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data

The problem of imbalanced data has a heavy impact on the performance of learning models. In the case of an imbalanced text dataset, minority class data are often classified to the majority class, resulting in a loss of minority information and low accuracy. Thus, it is a serious challenge to determi...

Full description

Bibliographic Details
Main Authors:	Der-Chiang Li, Szu-Chou Chen, Yao-San Lin, Wen-Yen Hsu
Format:	Article
Language:	English
Published:	MDPI AG 2022-03-01
Series:	Symmetry
Subjects:	imbalanced data sentiment analysis text mining support vector machine
Online Access:	https://www.mdpi.com/2073-8994/14/3/567

_version_	1797441538232418304
author	Der-Chiang Li Szu-Chou Chen Yao-San Lin Wen-Yen Hsu
author_facet	Der-Chiang Li Szu-Chou Chen Yao-San Lin Wen-Yen Hsu
author_sort	Der-Chiang Li
collection	DOAJ
description	The problem of imbalanced data has a heavy impact on the performance of learning models. In the case of an imbalanced text dataset, minority class data are often classified to the majority class, resulting in a loss of minority information and low accuracy. Thus, it is a serious challenge to determine how to tackle the high imbalance ratio distribution of datasets. Here, we propose a novel classification method for learning tasks with imbalanced test data. It aims to construct a method for data preprocessing that researchers can apply to their learning tasks with imbalanced text data and save the efforts to search for more dedicated learning tools. In our proposed method, there are two core stages. In stage one, balanced datasets are generated using an asymmetric cost-sensitive support vector machine; in stage two, the balanced dataset is classified using the symmetric cost-sensitive support vector machine. In addition, the learning parameters in both stages are adjusted with a genetic algorithm to create an optimal model. A Yelp review dataset was used to validate the effectiveness of the proposed method. The experimental results showed that the proposed method led to a better performance subject to the targeted dataset, with at least 75% accuracy, and revealed that this new method significantly improved the learning approach.
first_indexed	2024-03-09T12:24:36Z
format	Article
id	doaj.art-333520de20df45d59e04bfad3ab22b0b
institution	Directory Open Access Journal
issn	2073-8994
language	English
last_indexed	2024-03-09T12:24:36Z
publishDate	2022-03-01
publisher	MDPI AG
record_format	Article
series	Symmetry
spelling	doaj.art-333520de20df45d59e04bfad3ab22b0b2023-11-30T22:36:25ZengMDPI AGSymmetry2073-89942022-03-0114356710.3390/sym14030567A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text DataDer-Chiang Li0Szu-Chou Chen1Yao-San Lin2Wen-Yen Hsu3Department of Industrial and Information Management, National Cheng Kung University, Tainan City 70101, TaiwanInstitute of Information Management, National Cheng Kung University, Tainan City 70101, TaiwanSingapore Centre for Chinese Language, Nanyang Technological University, Singapore 279623, SingaporeInstitute of Information Management, National Cheng Kung University, Tainan City 70101, TaiwanThe problem of imbalanced data has a heavy impact on the performance of learning models. In the case of an imbalanced text dataset, minority class data are often classified to the majority class, resulting in a loss of minority information and low accuracy. Thus, it is a serious challenge to determine how to tackle the high imbalance ratio distribution of datasets. Here, we propose a novel classification method for learning tasks with imbalanced test data. It aims to construct a method for data preprocessing that researchers can apply to their learning tasks with imbalanced text data and save the efforts to search for more dedicated learning tools. In our proposed method, there are two core stages. In stage one, balanced datasets are generated using an asymmetric cost-sensitive support vector machine; in stage two, the balanced dataset is classified using the symmetric cost-sensitive support vector machine. In addition, the learning parameters in both stages are adjusted with a genetic algorithm to create an optimal model. A Yelp review dataset was used to validate the effectiveness of the proposed method. The experimental results showed that the proposed method led to a better performance subject to the targeted dataset, with at least 75% accuracy, and revealed that this new method significantly improved the learning approach.https://www.mdpi.com/2073-8994/14/3/567imbalanced datasentiment analysistext miningsupport vector machine
spellingShingle	Der-Chiang Li Szu-Chou Chen Yao-San Lin Wen-Yen Hsu A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data Symmetry imbalanced data sentiment analysis text mining support vector machine
title	A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
title_full	A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
title_fullStr	A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
title_full_unstemmed	A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
title_short	A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data
title_sort	novel classification method based on a two phase technique for learning imbalanced text data
topic	imbalanced data sentiment analysis text mining support vector machine
url	https://www.mdpi.com/2073-8994/14/3/567
work_keys_str_mv	AT derchiangli anovelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT szuchouchen anovelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT yaosanlin anovelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT wenyenhsu anovelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT derchiangli novelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT szuchouchen novelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT yaosanlin novelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata AT wenyenhsu novelclassificationmethodbasedonatwophasetechniqueforlearningimbalancedtextdata

A Novel Classification Method Based on a Two-Phase Technique for Learning Imbalanced Text Data

Similar Items