Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?

Modern systems produce and handle a large volume of sensitive enterprise data. Therefore, security vulnerabilities in the software systems must be identified and resolved early to prevent security breaches and failures. Predicting security vulnerabilities is an alternative to identifying them as dev...

Full description

Bibliographic Details
Main Authors:	Sundarakrishnan Ganesh, Francis Palma, Tobias Olsson
Format:	Article
Language:	English
Published:	MDPI AG 2022-09-01
Series:	Data
Subjects:	prediction security vulnerabilities machine learning source code software metrics
Online Access:	https://www.mdpi.com/2306-5729/7/9/127

_version_	1797489555168821248
author	Sundarakrishnan Ganesh Francis Palma Tobias Olsson
author_facet	Sundarakrishnan Ganesh Francis Palma Tobias Olsson
author_sort	Sundarakrishnan Ganesh
collection	DOAJ
description	Modern systems produce and handle a large volume of sensitive enterprise data. Therefore, security vulnerabilities in the software systems must be identified and resolved early to prevent security breaches and failures. Predicting security vulnerabilities is an alternative to identifying them as developers write code. In this study, we studied the ability of several machine learning algorithms to predict security vulnerabilities. We created two datasets containing security vulnerability information from two open-source systems: (1) Apache Tomcat (versions 4.x and five 2.5.x minor versions). We also computed source code metrics for these versions of both systems. We examined four classifiers, including Naive Bayes, Decision Tree, XGBoost Classifier, and Logistic Regression, to show their ability to predict security vulnerabilities. Moreover, an ensemble learner was introduced using a stacking classifier to see whether the prediction performance could be improved. We performed cross-version and cross-project predictions to assess the effectiveness of the best-performing model. Our results showed that the XGBoost classifier performed best compared to other learners, i.e., with an average accuracy of 97% in both datasets. The stacking classifier performed with an average accuracy of 92% in Struts and 71% in Tomcat. Our best-performing model—XGBoost—could predict with an average accuracy of 87% in Tomcat and 99% in Struts in a cross-version setup.
first_indexed	2024-03-10T00:19:16Z
format	Article
id	doaj.art-c6fa0d0da4994c66b4b82777959fe724
institution	Directory Open Access Journal
issn	2306-5729
language	English
last_indexed	2024-03-10T00:19:16Z
publishDate	2022-09-01
publisher	MDPI AG
record_format	Article
series	Data
spelling	doaj.art-c6fa0d0da4994c66b4b82777959fe7242023-11-23T15:46:47ZengMDPI AGData2306-57292022-09-017912710.3390/data7090127Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?Sundarakrishnan Ganesh0Francis Palma1Tobias Olsson2Department of Computer Science and Media Technology, Linnaeus University, 351 95 Växjö, SwedenDepartment of Computer Science and Media Technology, Linnaeus University, 351 95 Växjö, SwedenDepartment of Computer Science and Media Technology, Linnaeus University, 351 95 Växjö, SwedenModern systems produce and handle a large volume of sensitive enterprise data. Therefore, security vulnerabilities in the software systems must be identified and resolved early to prevent security breaches and failures. Predicting security vulnerabilities is an alternative to identifying them as developers write code. In this study, we studied the ability of several machine learning algorithms to predict security vulnerabilities. We created two datasets containing security vulnerability information from two open-source systems: (1) Apache Tomcat (versions 4.x and five 2.5.x minor versions). We also computed source code metrics for these versions of both systems. We examined four classifiers, including Naive Bayes, Decision Tree, XGBoost Classifier, and Logistic Regression, to show their ability to predict security vulnerabilities. Moreover, an ensemble learner was introduced using a stacking classifier to see whether the prediction performance could be improved. We performed cross-version and cross-project predictions to assess the effectiveness of the best-performing model. Our results showed that the XGBoost classifier performed best compared to other learners, i.e., with an average accuracy of 97% in both datasets. The stacking classifier performed with an average accuracy of 92% in Struts and 71% in Tomcat. Our best-performing model—XGBoost—could predict with an average accuracy of 87% in Tomcat and 99% in Struts in a cross-version setup.https://www.mdpi.com/2306-5729/7/9/127predictionsecurity vulnerabilitiesmachine learningsource codesoftware metrics
spellingShingle	Sundarakrishnan Ganesh Francis Palma Tobias Olsson Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities? Data prediction security vulnerabilities machine learning source code software metrics
title	Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?
title_full	Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?
title_fullStr	Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?
title_full_unstemmed	Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?
title_short	Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?
title_sort	are source code metrics good enough in predicting security vulnerabilities
topic	prediction security vulnerabilities machine learning source code software metrics
url	https://www.mdpi.com/2306-5729/7/9/127
work_keys_str_mv	AT sundarakrishnanganesh aresourcecodemetricsgoodenoughinpredictingsecurityvulnerabilities AT francispalma aresourcecodemetricsgoodenoughinpredictingsecurityvulnerabilities AT tobiasolsson aresourcecodemetricsgoodenoughinpredictingsecurityvulnerabilities

Are Source Code Metrics “Good Enough” in Predicting Security Vulnerabilities?

Similar Items