Predicting defects in imbalanced data using resampling methods: an empirical investigation

The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to ina...

Full description

Bibliographic Details
Main Authors:	Ruchika Malhotra, Juhi Jain
Format:	Article
Language:	English
Published:	PeerJ Inc. 2022-04-01
Series:	PeerJ Computer Science
Subjects:	Software defect prediction Machine learning Class imbalance problem Resampling methods Statistical validation
Online Access:	https://peerj.com/articles/cs-573.pdf

_version_	1811289466374979584
author	Ruchika Malhotra Juhi Jain
author_facet	Ruchika Malhotra Juhi Jain
author_sort	Ruchika Malhotra
collection	DOAJ
description	The development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators—AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.
first_indexed	2024-04-13T03:56:38Z
format	Article
id	doaj.art-52f3c39bfb5642aa9e74e50d5cb100cc
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-04-13T03:56:38Z
publishDate	2022-04-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-52f3c39bfb5642aa9e74e50d5cb100cc2022-12-22T03:03:37ZengPeerJ Inc.PeerJ Computer Science2376-59922022-04-018e57310.7717/peerj-cs.573Predicting defects in imbalanced data using resampling methods: an empirical investigationRuchika Malhotra0Juhi Jain1Department of Software Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, IndiaDepartment of Computer Science and Engineering, Delhi Technological University (former Delhi College of Engineering), Shahbad Daulatpur, Delhi, IndiaThe development of correct and effective software defect prediction (SDP) models is one of the utmost needs of the software industry. Statistics of many defect-related open-source data sets depict the class imbalance problem in object-oriented projects. Models trained on imbalanced data leads to inaccurate future predictions owing to biased learning and ineffective defect prediction. In addition to this large number of software metrics degrades the model performance. This study aims at (1) identification of useful metrics in the software using correlation feature selection, (2) extensive comparative analysis of 10 resampling methods to generate effective machine learning models for imbalanced data, (3) inclusion of stable performance evaluators—AUC, GMean, and Balance and (4) integration of statistical validation of results. The impact of 10 resampling methods is analyzed on selected features of 12 object-oriented Apache datasets using 15 machine learning techniques. The performances of developed models are analyzed using AUC, GMean, Balance, and sensitivity. Statistical results advocate the use of resampling methods to improve SDP. Random oversampling portrays the best predictive capability of developed defect prediction models. The study provides a guideline for identifying metrics that are influential for SDP. The performances of oversampling methods are superior to undersampling methods.https://peerj.com/articles/cs-573.pdfSoftware defect predictionMachine learningClass imbalance problemResampling methodsStatistical validation
spellingShingle	Ruchika Malhotra Juhi Jain Predicting defects in imbalanced data using resampling methods: an empirical investigation PeerJ Computer Science Software defect prediction Machine learning Class imbalance problem Resampling methods Statistical validation
title	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_full	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_fullStr	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_full_unstemmed	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_short	Predicting defects in imbalanced data using resampling methods: an empirical investigation
title_sort	predicting defects in imbalanced data using resampling methods an empirical investigation
topic	Software defect prediction Machine learning Class imbalance problem Resampling methods Statistical validation
url	https://peerj.com/articles/cs-573.pdf
work_keys_str_mv	AT ruchikamalhotra predictingdefectsinimbalanceddatausingresamplingmethodsanempiricalinvestigation AT juhijain predictingdefectsinimbalanceddatausingresamplingmethodsanempiricalinvestigation

Predicting defects in imbalanced data using resampling methods: an empirical investigation

Similar Items