Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

In practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address c...

Full description

Bibliographic Details
Main Authors:	Lina Gong, Shujuan Jiang, Li Jiang
Format:	Article
Language:	English
Published:	IEEE 2019-01-01
Series:	IEEE Access
Subjects:	Software defect prediction over-sampling class imbalance K-means noise filtering
Online Access:	https://ieeexplore.ieee.org/document/8861051/

_version_	1818479948946472960
author	Lina Gong Shujuan Jiang Li Jiang
author_facet	Lina Gong Shujuan Jiang Li Jiang
author_sort	Lina Gong
collection	DOAJ
description	In practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address class imbalance problem. This technique balances the number of defective and non-defective instances by generating new defective instances. However, these approaches would generate non-diverse synthetic instances, and many unnecessary noise instances at the same time. Motived by this, we propose a Cluster-based Over-sampling with noise filtering (KMFOS) approach to tackle class imbalance problem in SDP. KMFOS firstly divides defective instances into <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula> clusters, and new defective instances are generated by interpolation between instances of each two clusters. After this, these new defective instances would diversely spread in the space of defective dataset. Then, we extend this cluster-based over-sampling through the Closest List Noise Identification (CLNI) to clean the noise instances. We do extensive experiments on 24 projects to compare KMFOS with some over-sampling approaches such as SMOTE, Borderline-SMOTE, ADASYN, random over-sampling (ROS), K-means SMOTE, SMOTE + IPF, SMOTE + ENN and SMOTE + Tomek Links using five prediction classifiers. At the same time, we also compare KMFOS with other state-of-the-art class-imbalance methods including balancebaggingclassifier, RUSboostclassifier, InstanceHardnessThreshold and cost-sensitive methods. Experimental results indicate our KMFOS can obtain better <italic>Recall</italic> and <italic>bal</italic> values than other over-sampling methods and other compared class-imbalance methods. Hence, KMFOS is an efficient approach to generate balanced data for SDP and improves the performance of predicting models.
first_indexed	2024-12-10T11:16:43Z
format	Article
id	doaj.art-41677458655e4b57bda29dfcca319bfc
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-10T11:16:43Z
publishDate	2019-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-41677458655e4b57bda29dfcca319bfc2022-12-22T01:51:09ZengIEEEIEEE Access2169-35362019-01-01714572514573710.1109/ACCESS.2019.29458588861051Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With FilteringLina Gong0https://orcid.org/0000-0002-5272-6706Shujuan Jiang1https://orcid.org/0000-0003-0643-0565Li Jiang2School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, ChinaSchool of Computer Science and Technology, China University of Mining and Technology, Xuzhou, ChinaSchool of Computer Science and Technology, China University of Mining and Technology, Xuzhou, ChinaIn practice, Software Defect Prediction (SDP) models often suffer from highly imbalanced data, which makes classifiers difficult to identify defective instances. Recently, many techniques were proposed to tackle this problem, over-sampling technique is one of the most well-known methods to address class imbalance problem. This technique balances the number of defective and non-defective instances by generating new defective instances. However, these approaches would generate non-diverse synthetic instances, and many unnecessary noise instances at the same time. Motived by this, we propose a Cluster-based Over-sampling with noise filtering (KMFOS) approach to tackle class imbalance problem in SDP. KMFOS firstly divides defective instances into <inline-formula> <tex-math notation="LaTeX">$K$ </tex-math></inline-formula> clusters, and new defective instances are generated by interpolation between instances of each two clusters. After this, these new defective instances would diversely spread in the space of defective dataset. Then, we extend this cluster-based over-sampling through the Closest List Noise Identification (CLNI) to clean the noise instances. We do extensive experiments on 24 projects to compare KMFOS with some over-sampling approaches such as SMOTE, Borderline-SMOTE, ADASYN, random over-sampling (ROS), K-means SMOTE, SMOTE + IPF, SMOTE + ENN and SMOTE + Tomek Links using five prediction classifiers. At the same time, we also compare KMFOS with other state-of-the-art class-imbalance methods including balancebaggingclassifier, RUSboostclassifier, InstanceHardnessThreshold and cost-sensitive methods. Experimental results indicate our KMFOS can obtain better <italic>Recall</italic> and <italic>bal</italic> values than other over-sampling methods and other compared class-imbalance methods. Hence, KMFOS is an efficient approach to generate balanced data for SDP and improves the performance of predicting models.https://ieeexplore.ieee.org/document/8861051/Software defect predictionover-samplingclass imbalanceK-meansnoise filtering
spellingShingle	Lina Gong Shujuan Jiang Li Jiang Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering IEEE Access Software defect prediction over-sampling class imbalance K-means noise filtering
title	Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
title_full	Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
title_fullStr	Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
title_full_unstemmed	Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
title_short	Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering
title_sort	tackling class imbalance problem in software defect prediction through cluster based over sampling with filtering
topic	Software defect prediction over-sampling class imbalance K-means noise filtering
url	https://ieeexplore.ieee.org/document/8861051/
work_keys_str_mv	AT linagong tacklingclassimbalanceprobleminsoftwaredefectpredictionthroughclusterbasedoversamplingwithfiltering AT shujuanjiang tacklingclassimbalanceprobleminsoftwaredefectpredictionthroughclusterbasedoversamplingwithfiltering AT lijiang tacklingclassimbalanceprobleminsoftwaredefectpredictionthroughclusterbasedoversamplingwithfiltering

Tackling Class Imbalance Problem in Software Defect Prediction Through Cluster-Based Over-Sampling With Filtering

Similar Items