Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors

Software vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability det...

Full description

Bibliographic Details
Main Authors: Lili Liu, Zhen Li, Yu Wen, Penglong Chen
Format: Article
Language:English
Published: PeerJ Inc. 2022-05-01
Series:PeerJ Computer Science
Subjects:
Online Access:https://peerj.com/articles/cs-975.pdf
_version_ 1817990239763824640
author Lili Liu
Zhen Li
Yu Wen
Penglong Chen
author_facet Lili Liu
Zhen Li
Yu Wen
Penglong Chen
author_sort Lili Liu
collection DOAJ
description Software vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability detection. The DL-based method does not need to define features manually and achieves low false negatives and false positives. DL-based vulnerability detectors rely on vulnerability datasets. Recent studies found that DL-based vulnerability detectors have different effects on different vulnerability datasets. They also found that the authenticity, imbalance, and repetition rate of vulnerability datasets affect the effectiveness of DL-based vulnerability detectors. However, the existing research only did simple statistics, did not characterize vulnerability datasets, and did not systematically study the impact of vulnerability datasets on DL-based vulnerability detectors. In order to solve the above problems, we propose methods to characterize sample similarity and code features. We use sample granularity, sample similarity, and code features to characterize vulnerability datasets. Then, we analyze the correlation between the characteristics of vulnerability datasets and the results of DL-based vulnerability detectors. Finally, we systematically study the impact of vulnerability datasets on DL-based vulnerability detectors from sample granularity, sample similarity, and code features. We have the following insights for the impact of vulnerability datasets on DL-based vulnerability detectors: (1) Fine-grained samples are conducive to detecting vulnerabilities. (2) Vulnerability datasets with lower inter-class similarity, higher intra-class similarity, and simple structure help detect vulnerabilities in the original test set. (3) Vulnerability datasets with higher inter-class similarity, lower intra-class similarity, and complex structure can better detect vulnerabilities in other datasets.
first_indexed 2024-04-14T00:56:12Z
format Article
id doaj.art-51818635a0b64e4c95d6305f127bdb0a
institution Directory Open Access Journal
issn 2376-5992
language English
last_indexed 2024-04-14T00:56:12Z
publishDate 2022-05-01
publisher PeerJ Inc.
record_format Article
series PeerJ Computer Science
spelling doaj.art-51818635a0b64e4c95d6305f127bdb0a2022-12-22T02:21:35ZengPeerJ Inc.PeerJ Computer Science2376-59922022-05-018e97510.7717/peerj-cs.975Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectorsLili LiuZhen LiYu WenPenglong ChenSoftware vulnerabilities have led to system attacks and data leakage incidents, and software vulnerabilities have gradually attracted attention. Vulnerability detection had become an important research direction. In recent years, Deep Learning (DL)-based methods had been applied to vulnerability detection. The DL-based method does not need to define features manually and achieves low false negatives and false positives. DL-based vulnerability detectors rely on vulnerability datasets. Recent studies found that DL-based vulnerability detectors have different effects on different vulnerability datasets. They also found that the authenticity, imbalance, and repetition rate of vulnerability datasets affect the effectiveness of DL-based vulnerability detectors. However, the existing research only did simple statistics, did not characterize vulnerability datasets, and did not systematically study the impact of vulnerability datasets on DL-based vulnerability detectors. In order to solve the above problems, we propose methods to characterize sample similarity and code features. We use sample granularity, sample similarity, and code features to characterize vulnerability datasets. Then, we analyze the correlation between the characteristics of vulnerability datasets and the results of DL-based vulnerability detectors. Finally, we systematically study the impact of vulnerability datasets on DL-based vulnerability detectors from sample granularity, sample similarity, and code features. We have the following insights for the impact of vulnerability datasets on DL-based vulnerability detectors: (1) Fine-grained samples are conducive to detecting vulnerabilities. (2) Vulnerability datasets with lower inter-class similarity, higher intra-class similarity, and simple structure help detect vulnerabilities in the original test set. (3) Vulnerability datasets with higher inter-class similarity, lower intra-class similarity, and complex structure can better detect vulnerabilities in other datasets.https://peerj.com/articles/cs-975.pdfVulnerability datasetDeep learningVulnerability detection
spellingShingle Lili Liu
Zhen Li
Yu Wen
Penglong Chen
Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
PeerJ Computer Science
Vulnerability dataset
Deep learning
Vulnerability detection
title Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_full Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_fullStr Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_full_unstemmed Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_short Investigating the impact of vulnerability datasets on deep learning-based vulnerability detectors
title_sort investigating the impact of vulnerability datasets on deep learning based vulnerability detectors
topic Vulnerability dataset
Deep learning
Vulnerability detection
url https://peerj.com/articles/cs-975.pdf
work_keys_str_mv AT lililiu investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT zhenli investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT yuwen investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors
AT penglongchen investigatingtheimpactofvulnerabilitydatasetsondeeplearningbasedvulnerabilitydetectors