A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification

Imbalanced data is an important issues and challenges faced in data classification. This will lead to poor performance of binary classifiers, this is due to bias in classification in favour of the majority class and lack of understanding of the influence of the minority class, while the minority cla...

Full description

Bibliographic Details
Main Authors:	Shaymaa Ahmed Razoqi, Ghayda Al-Talib
Format:	Article
Language:	Arabic
Published:	College of Education for Pure Sciences 2023-12-01
Series:	مجلة التربية والعلم
Subjects:	big data,, ,،classification,, ,،imbalanced problem,, ,،resampling,, ,،clustering
Online Access:	https://edusj.mosuljournals.com/article_180971_2e8e268f4bdbc7d511ffb728d5b70f64.pdf

_version_	1797447425578762240
author	Shaymaa Ahmed Razoqi Ghayda Al-Talib
author_facet	Shaymaa Ahmed Razoqi Ghayda Al-Talib
author_sort	Shaymaa Ahmed Razoqi
collection	DOAJ
description	Imbalanced data is an important issues and challenges faced in data classification. This will lead to poor performance of binary classifiers, this is due to bias in classification in favour of the majority class and lack of understanding of the influence of the minority class, while the minority class is usually the most important in the classification process. In order to find a compromise between the information loss and balance the data set before applying the classification, the research proposed a hybrid algorithm based on the use of clustering methods to divide the majority class into subgroups in the first phase, and using a method to encode the majority class. The Algorithm used the code to group samples that are similar to each other and reduce the majority class count. At the same time, the Synthetic Minority Oversampling Technique (SMOTE) was used to increase the number of minority class samples in the next phase. The study examined the impact of the proposed algorithm on five classifiers based on the AUC and F-score post-classification performance parameters using benchmark datasets with different sizes and imbalance factors. The results showed that the proposed algorithm significantly improved the performance of the classifiers when applied to the resampled data.
first_indexed	2024-03-09T13:55:45Z
format	Article
id	doaj.art-1b712fa763314e958bbab8f2dd1794cd
institution	Directory Open Access Journal
issn	1812-125X 2664-2530
language	Arabic
last_indexed	2024-03-09T13:55:45Z
publishDate	2023-12-01
publisher	College of Education for Pure Sciences
record_format	Article
series	مجلة التربية والعلم
spelling	doaj.art-1b712fa763314e958bbab8f2dd1794cd2023-11-30T18:30:02ZaraCollege of Education for Pure Sciencesمجلة التربية والعلم1812-125X2664-25302023-12-01324819010.33899/edusj.2023.143612.1393180971A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data ClassificationShaymaa Ahmed Razoqi0Ghayda Al-Talib1Department of Computer Science, College of Education for Pure Science, University of Mosul, Mosul, IRAQDepartment of Computer Science, College of Computer Science and Mathematics, University of Mosul, Mosul, IRAQImbalanced data is an important issues and challenges faced in data classification. This will lead to poor performance of binary classifiers, this is due to bias in classification in favour of the majority class and lack of understanding of the influence of the minority class, while the minority class is usually the most important in the classification process. In order to find a compromise between the information loss and balance the data set before applying the classification, the research proposed a hybrid algorithm based on the use of clustering methods to divide the majority class into subgroups in the first phase, and using a method to encode the majority class. The Algorithm used the code to group samples that are similar to each other and reduce the majority class count. At the same time, the Synthetic Minority Oversampling Technique (SMOTE) was used to increase the number of minority class samples in the next phase. The study examined the impact of the proposed algorithm on five classifiers based on the AUC and F-score post-classification performance parameters using benchmark datasets with different sizes and imbalance factors. The results showed that the proposed algorithm significantly improved the performance of the classifiers when applied to the resampled data.https://edusj.mosuljournals.com/article_180971_2e8e268f4bdbc7d511ffb728d5b70f64.pdfbig data,,,،classification,,,،imbalanced problem,,,،resampling,,,،clustering
spellingShingle	Shaymaa Ahmed Razoqi Ghayda Al-Talib A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification مجلة التربية والعلم big data,, ,،classification,, ,،imbalanced problem,, ,،resampling,, ,،clustering
title	A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification
title_full	A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification
title_fullStr	A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification
title_full_unstemmed	A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification
title_short	A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification
title_sort	hybrid undersampling smote method for imbalanced big data classification
topic	big data,, ,،classification,, ,،imbalanced problem,, ,،resampling,, ,،clustering
url	https://edusj.mosuljournals.com/article_180971_2e8e268f4bdbc7d511ffb728d5b70f64.pdf
work_keys_str_mv	AT shaymaaahmedrazoqi ahybridundersamplingsmotemethodforimbalancedbigdataclassification AT ghaydaaltalib ahybridundersamplingsmotemethodforimbalancedbigdataclassification AT shaymaaahmedrazoqi hybridundersamplingsmotemethodforimbalancedbigdataclassification AT ghaydaaltalib hybridundersamplingsmotemethodforimbalancedbigdataclassification

A Hybrid Undersampling-SMOTE Method for Imbalanced Big Data Classification

Similar Items