Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning

Recent reviews of the literature indicate the need for empirical studies on cross-project defect prediction (CPDP) that would allow aggregation of the evidence and improve predictive performance. Most empirical studies predict defects at granularity levels of method, class, file, and module/package...

Full description

Bibliographic Details
Main Authors: Touseef Tahir, Cigdem Gencel, Ghulam Rasool, Tariq Umer, Jawad Rasheed, Sook Fern Yeo, Taner Cevik
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10345541/
_version_ 1797376388950392832
author Touseef Tahir
Cigdem Gencel
Ghulam Rasool
Tariq Umer
Jawad Rasheed
Sook Fern Yeo
Taner Cevik
author_facet Touseef Tahir
Cigdem Gencel
Ghulam Rasool
Tariq Umer
Jawad Rasheed
Sook Fern Yeo
Taner Cevik
author_sort Touseef Tahir
collection DOAJ
description Recent reviews of the literature indicate the need for empirical studies on cross-project defect prediction (CPDP) that would allow aggregation of the evidence and improve predictive performance. Most empirical studies predict defects at granularity levels of method, class, file, and module/package during the coding phase, and thereby avoid external failure costs. The main goal of this study is to perform an empirical study on early defect prediction at the beginning of a project at the product level of granularity for using it as input in planning quality activities of the project. Hence, both internal and external failure costs could be avoided as much as possible through proper planning of quality. We first made a systematic mapping study (SMS) on secondary studies (literature reviews) on defect prediction to identify the most used datasets, the project attributes and metrics utilized as estimators, and the supervised learning methods employed for training the data. Then, we made an empirical study on defect density prediction using cross-project data. We collected 760 project data from the International Software Benchmarking (ISBSG) dataset version 11, which reported both defects and functional size attributes. We trained the prediction models using: i) the complete set of project attributes, ii) the individual attributes, and iii) multiple subsets of attributes. We employed classification and regression approaches of machine learning. The machine learning models are trained using original values of the dataset, and z-score and logged transformations of original values to explore the effects of data normalization on prediction. Most machine learning models trained on the z-score transformation of the dataset performed best for classifying defects. The Multilayer-Perceptron (Neural Network) model trained on the z-score transformation of complete dataset predicted defects with the highest F1-score of 0.89 using binary classification. The logged transformation and feature selection methods improved the results for multivariable regression. The multivariable regression predicted defects with the highest Root Mean Squared Error (RMSE) and R2 (r-squared) values of 0.4 and 0.9, respectively, with a subset of 11 features using logged transformation. The results of classification and regression approaches indicate that defects can be predicted with reasonable accuracy at the software product level using cross-project data.
first_indexed 2024-03-08T19:37:50Z
format Article
id doaj.art-f382c28ac7ef48408f55e5a9b6a822fa
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-08T19:37:50Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-f382c28ac7ef48408f55e5a9b6a822fa2023-12-26T00:07:27ZengIEEEIEEE Access2169-35362023-01-011114196514198610.1109/ACCESS.2023.333999410345541Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised LearningTouseef Tahir0https://orcid.org/0000-0003-3347-4918Cigdem Gencel1https://orcid.org/0000-0003-0115-8902Ghulam Rasool2https://orcid.org/0000-0001-5408-0550Tariq Umer3https://orcid.org/0000-0002-3333-8142Jawad Rasheed4https://orcid.org/0000-0003-3761-1641Sook Fern Yeo5https://orcid.org/0000-0002-8060-5872Taner Cevik6Department of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, PakistanDepartment of Management Information Systems, Ankara Medipol University, Ankara, TurkeyDepartment of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, PakistanDepartment of Computer Science, COMSATS University Islamabad, Lahore Campus, Lahore, PakistanDepartment of Computer Engineering, Istanbul Sabahattin Zaim University, İstanbul, TurkeyFaculty of Business, Multimedia University, Malacca, MalaysiaDepartment of Computer Engineering, Istanbul Arel University, İstanbul, TurkeyRecent reviews of the literature indicate the need for empirical studies on cross-project defect prediction (CPDP) that would allow aggregation of the evidence and improve predictive performance. Most empirical studies predict defects at granularity levels of method, class, file, and module/package during the coding phase, and thereby avoid external failure costs. The main goal of this study is to perform an empirical study on early defect prediction at the beginning of a project at the product level of granularity for using it as input in planning quality activities of the project. Hence, both internal and external failure costs could be avoided as much as possible through proper planning of quality. We first made a systematic mapping study (SMS) on secondary studies (literature reviews) on defect prediction to identify the most used datasets, the project attributes and metrics utilized as estimators, and the supervised learning methods employed for training the data. Then, we made an empirical study on defect density prediction using cross-project data. We collected 760 project data from the International Software Benchmarking (ISBSG) dataset version 11, which reported both defects and functional size attributes. We trained the prediction models using: i) the complete set of project attributes, ii) the individual attributes, and iii) multiple subsets of attributes. We employed classification and regression approaches of machine learning. The machine learning models are trained using original values of the dataset, and z-score and logged transformations of original values to explore the effects of data normalization on prediction. Most machine learning models trained on the z-score transformation of the dataset performed best for classifying defects. The Multilayer-Perceptron (Neural Network) model trained on the z-score transformation of complete dataset predicted defects with the highest F1-score of 0.89 using binary classification. The logged transformation and feature selection methods improved the results for multivariable regression. The multivariable regression predicted defects with the highest Root Mean Squared Error (RMSE) and R2 (r-squared) values of 0.4 and 0.9, respectively, with a subset of 11 features using logged transformation. The results of classification and regression approaches indicate that defects can be predicted with reasonable accuracy at the software product level using cross-project data.https://ieeexplore.ieee.org/document/10345541/Cross projects datasetdefect predictionfeature selectionfault predictionISBSG datasetmachine learning
spellingShingle Touseef Tahir
Cigdem Gencel
Ghulam Rasool
Tariq Umer
Jawad Rasheed
Sook Fern Yeo
Taner Cevik
Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
IEEE Access
Cross projects dataset
defect prediction
feature selection
fault prediction
ISBSG dataset
machine learning
title Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
title_full Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
title_fullStr Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
title_full_unstemmed Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
title_short Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning
title_sort early software defects density prediction training the international software benchmarking cross projects data using supervised learning
topic Cross projects dataset
defect prediction
feature selection
fault prediction
ISBSG dataset
machine learning
url https://ieeexplore.ieee.org/document/10345541/
work_keys_str_mv AT touseeftahir earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT cigdemgencel earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT ghulamrasool earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT tariqumer earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT jawadrasheed earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT sookfernyeo earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning
AT tanercevik earlysoftwaredefectsdensitypredictiontrainingtheinternationalsoftwarebenchmarkingcrossprojectsdatausingsupervisedlearning