Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection

Detecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre...

Full description

Bibliographic Details
Main Authors:	Guanjun Lin, Heming Jia, Di Wu
Format:	Article
Language:	English
Published:	MDPI AG 2022-11-01
Series:	Mathematics
Subjects:	pre-trained contextualized embedding function-level vulnerability detection model compression knowledge distillation
Online Access:	https://www.mdpi.com/2227-7390/10/23/4482

_version_	1827643001255493632
author	Guanjun Lin Heming Jia Di Wu
author_facet	Guanjun Lin Heming Jia Di Wu
author_sort	Guanjun Lin
collection	DOAJ
description	Detecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre-trained on the large corpus and can be fine-tuned for various downstream tasks, but their feasibility and effectiveness for software vulnerability detection have not been systematically studied. In this paper, we explore six prevalent PCMs and compare them with three mainstream Non-Contextualized Models (NCMs) in terms of generating effective function-level representations for vulnerability detection. We found that, although the detection performance of PCMs outperformed that of the NCMs, training and fine-tuning PCMs were computationally expensive. The budgets for deployment and inference are also considerable in practice, which may prevent the wide adoption of PCMs in the field of interest. However, we discover that, when the PCMs were compressed using the technique of knowledge distillation, they achieved similar detection performance but with significantly improved efficiency compared with their uncompressed counterparts when using 40,000 synthetic C functions for fine-tuning and approximately 79,200 real-world C functions for training. Among the distilled PCMs, the distilled CodeBERT achieved the most cost-effective performance. Therefore, we proposed a framework encapsulating the Distilled CodeBERT for an end-to-end Vulnerable function Detection (named DistilVD). To examine the performance of the proposed framework in real-world scenarios, DistilVD was tested on four open-source real-world projects with a small amount of training data. Results showed that DistilVD outperformed the five baseline approaches. Further evaluations on multi-class vulnerability detection also confirmed the effectiveness of DistilVD for detecting various vulnerability types.
first_indexed	2024-03-09T17:40:41Z
format	Article
id	doaj.art-624856c0dcc446a8aab5ff71f2672208
institution	Directory Open Access Journal
issn	2227-7390
language	English
last_indexed	2024-03-09T17:40:41Z
publishDate	2022-11-01
publisher	MDPI AG
record_format	Article
series	Mathematics
spelling	doaj.art-624856c0dcc446a8aab5ff71f26722082023-11-24T11:34:17ZengMDPI AGMathematics2227-73902022-11-011023448210.3390/math10234482Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function DetectionGuanjun Lin0Heming Jia1Di Wu2School of Information Engineering, Sanming University, Sanming 365004, ChinaSchool of Information Engineering, Sanming University, Sanming 365004, ChinaSchool of Education and Music, Sanming University, Sanming 365004, ChinaDetecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre-trained on the large corpus and can be fine-tuned for various downstream tasks, but their feasibility and effectiveness for software vulnerability detection have not been systematically studied. In this paper, we explore six prevalent PCMs and compare them with three mainstream Non-Contextualized Models (NCMs) in terms of generating effective function-level representations for vulnerability detection. We found that, although the detection performance of PCMs outperformed that of the NCMs, training and fine-tuning PCMs were computationally expensive. The budgets for deployment and inference are also considerable in practice, which may prevent the wide adoption of PCMs in the field of interest. However, we discover that, when the PCMs were compressed using the technique of knowledge distillation, they achieved similar detection performance but with significantly improved efficiency compared with their uncompressed counterparts when using 40,000 synthetic C functions for fine-tuning and approximately 79,200 real-world C functions for training. Among the distilled PCMs, the distilled CodeBERT achieved the most cost-effective performance. Therefore, we proposed a framework encapsulating the Distilled CodeBERT for an end-to-end Vulnerable function Detection (named DistilVD). To examine the performance of the proposed framework in real-world scenarios, DistilVD was tested on four open-source real-world projects with a small amount of training data. Results showed that DistilVD outperformed the five baseline approaches. Further evaluations on multi-class vulnerability detection also confirmed the effectiveness of DistilVD for detecting various vulnerability types.https://www.mdpi.com/2227-7390/10/23/4482pre-trained contextualized embeddingfunction-levelvulnerability detectionmodel compressionknowledge distillation
spellingShingle	Guanjun Lin Heming Jia Di Wu Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection Mathematics pre-trained contextualized embedding function-level vulnerability detection model compression knowledge distillation
title	Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_full	Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_fullStr	Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_full_unstemmed	Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_short	Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_sort	distilled and contextualized neural models benchmarked for vulnerable function detection
topic	pre-trained contextualized embedding function-level vulnerability detection model compression knowledge distillation
url	https://www.mdpi.com/2227-7390/10/23/4482
work_keys_str_mv	AT guanjunlin distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection AT hemingjia distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection AT diwu distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection

Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection

Similar Items