A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
The imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10322726/ |
_version_ | 1827631111659847680 |
---|---|
author | Liangliang Tao Huping Zhu Qingya Wang Yage Liang Xiaozheng Deng |
author_facet | Liangliang Tao Huping Zhu Qingya Wang Yage Liang Xiaozheng Deng |
author_sort | Liangliang Tao |
collection | DOAJ |
description | The imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make the assumption that each minority class observation carries an equal amount of information, neglecting the distribution of minority class observations and their relationship with neighboring majority class observations. Consequently, the synthetic samples generated by these methods deviate from the original data distribution, resulting in an increased overlap with the majority samples. To address this limitation, we introduce a novel sampling technique called Combined Priori and Purity Gaussian OverSampling (PPGO) in this paper. The proposed method incorporates prior probabilities and sample purity to calculate the weight assigned to each minority class sample. This weight is used to determine the quantity of synthetic samples to be generated for each minority class, as well as the level of dispersion during the Gaussian sampling process. This approach aims to restore the original distribution of the observations and minimize the overlap with the majority class regions. The experimental results demonstrate a significant improvement in the G-mean and AUC measures when using the proposed method compared to conventional approaches. These results were obtained through experiments conducted on 32 datasets obtained from the KEEL repository. |
first_indexed | 2024-03-09T14:16:27Z |
format | Article |
id | doaj.art-2a607d42012d47848ce16424577ada86 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-09T14:16:27Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-2a607d42012d47848ce16424577ada862023-11-29T00:01:38ZengIEEEIEEE Access2169-35362023-01-011113068813069610.1109/ACCESS.2023.333427210322726A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data ClassificationLiangliang Tao0https://orcid.org/0009-0000-2346-9504Huping Zhu1https://orcid.org/0009-0000-1695-5619Qingya Wang2https://orcid.org/0000-0001-5636-8785Yage Liang3Xiaozheng Deng4College of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaThe imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make the assumption that each minority class observation carries an equal amount of information, neglecting the distribution of minority class observations and their relationship with neighboring majority class observations. Consequently, the synthetic samples generated by these methods deviate from the original data distribution, resulting in an increased overlap with the majority samples. To address this limitation, we introduce a novel sampling technique called Combined Priori and Purity Gaussian OverSampling (PPGO) in this paper. The proposed method incorporates prior probabilities and sample purity to calculate the weight assigned to each minority class sample. This weight is used to determine the quantity of synthetic samples to be generated for each minority class, as well as the level of dispersion during the Gaussian sampling process. This approach aims to restore the original distribution of the observations and minimize the overlap with the majority class regions. The experimental results demonstrate a significant improvement in the G-mean and AUC measures when using the proposed method compared to conventional approaches. These results were obtained through experiments conducted on 32 datasets obtained from the KEEL repository.https://ieeexplore.ieee.org/document/10322726/Imbalanced learningprioripurityGaussian oversampling |
spellingShingle | Liangliang Tao Huping Zhu Qingya Wang Yage Liang Xiaozheng Deng A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification IEEE Access Imbalanced learning priori purity Gaussian oversampling |
title | A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification |
title_full | A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification |
title_fullStr | A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification |
title_full_unstemmed | A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification |
title_short | A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification |
title_sort | combined priori and purity gaussian oversampling algorithm for imbalanced data classification |
topic | Imbalanced learning priori purity Gaussian oversampling |
url | https://ieeexplore.ieee.org/document/10322726/ |
work_keys_str_mv | AT liangliangtao acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT hupingzhu acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT qingyawang acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT yageliang acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT xiaozhengdeng acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT liangliangtao combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT hupingzhu combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT qingyawang combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT yageliang combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification AT xiaozhengdeng combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification |