A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification

The imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make...

Full description

Bibliographic Details
Main Authors: Liangliang Tao, Huping Zhu, Qingya Wang, Yage Liang, Xiaozheng Deng
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10322726/
_version_ 1827631111659847680
author Liangliang Tao
Huping Zhu
Qingya Wang
Yage Liang
Xiaozheng Deng
author_facet Liangliang Tao
Huping Zhu
Qingya Wang
Yage Liang
Xiaozheng Deng
author_sort Liangliang Tao
collection DOAJ
description The imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make the assumption that each minority class observation carries an equal amount of information, neglecting the distribution of minority class observations and their relationship with neighboring majority class observations. Consequently, the synthetic samples generated by these methods deviate from the original data distribution, resulting in an increased overlap with the majority samples. To address this limitation, we introduce a novel sampling technique called Combined Priori and Purity Gaussian OverSampling (PPGO) in this paper. The proposed method incorporates prior probabilities and sample purity to calculate the weight assigned to each minority class sample. This weight is used to determine the quantity of synthetic samples to be generated for each minority class, as well as the level of dispersion during the Gaussian sampling process. This approach aims to restore the original distribution of the observations and minimize the overlap with the majority class regions. The experimental results demonstrate a significant improvement in the G-mean and AUC measures when using the proposed method compared to conventional approaches. These results were obtained through experiments conducted on 32 datasets obtained from the KEEL repository.
first_indexed 2024-03-09T14:16:27Z
format Article
id doaj.art-2a607d42012d47848ce16424577ada86
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-09T14:16:27Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-2a607d42012d47848ce16424577ada862023-11-29T00:01:38ZengIEEEIEEE Access2169-35362023-01-011113068813069610.1109/ACCESS.2023.333427210322726A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data ClassificationLiangliang Tao0https://orcid.org/0009-0000-2346-9504Huping Zhu1https://orcid.org/0009-0000-1695-5619Qingya Wang2https://orcid.org/0000-0001-5636-8785Yage Liang3Xiaozheng Deng4College of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaCollege of Information Engineering, Jiujiang Vocational and Technical College, Jiujiang, ChinaThe imbalanced data classification presents pervasive challenges in real-world data mining scenarios. To tackle these challenges, sampling techniques have emerged as effective approaches. However, the prevailing technique, SMOTE (Synthetic Minority Over-sampling Technique), and its derivatives make the assumption that each minority class observation carries an equal amount of information, neglecting the distribution of minority class observations and their relationship with neighboring majority class observations. Consequently, the synthetic samples generated by these methods deviate from the original data distribution, resulting in an increased overlap with the majority samples. To address this limitation, we introduce a novel sampling technique called Combined Priori and Purity Gaussian OverSampling (PPGO) in this paper. The proposed method incorporates prior probabilities and sample purity to calculate the weight assigned to each minority class sample. This weight is used to determine the quantity of synthetic samples to be generated for each minority class, as well as the level of dispersion during the Gaussian sampling process. This approach aims to restore the original distribution of the observations and minimize the overlap with the majority class regions. The experimental results demonstrate a significant improvement in the G-mean and AUC measures when using the proposed method compared to conventional approaches. These results were obtained through experiments conducted on 32 datasets obtained from the KEEL repository.https://ieeexplore.ieee.org/document/10322726/Imbalanced learningprioripurityGaussian oversampling
spellingShingle Liangliang Tao
Huping Zhu
Qingya Wang
Yage Liang
Xiaozheng Deng
A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
IEEE Access
Imbalanced learning
priori
purity
Gaussian oversampling
title A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
title_full A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
title_fullStr A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
title_full_unstemmed A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
title_short A Combined Priori and Purity Gaussian OverSampling Algorithm for Imbalanced Data Classification
title_sort combined priori and purity gaussian oversampling algorithm for imbalanced data classification
topic Imbalanced learning
priori
purity
Gaussian oversampling
url https://ieeexplore.ieee.org/document/10322726/
work_keys_str_mv AT liangliangtao acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT hupingzhu acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT qingyawang acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT yageliang acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT xiaozhengdeng acombinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT liangliangtao combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT hupingzhu combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT qingyawang combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT yageliang combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification
AT xiaozhengdeng combinedprioriandpuritygaussianoversamplingalgorithmforimbalanceddataclassification