Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection

Detecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre...

Full description

Bibliographic Details
Main Authors: Guanjun Lin, Heming Jia, Di Wu
Format: Article
Language:English
Published: MDPI AG 2022-11-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/10/23/4482
_version_ 1827643001255493632
author Guanjun Lin
Heming Jia
Di Wu
author_facet Guanjun Lin
Heming Jia
Di Wu
author_sort Guanjun Lin
collection DOAJ
description Detecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre-trained on the large corpus and can be fine-tuned for various downstream tasks, but their feasibility and effectiveness for software vulnerability detection have not been systematically studied. In this paper, we explore six prevalent PCMs and compare them with three mainstream Non-Contextualized Models (NCMs) in terms of generating effective function-level representations for vulnerability detection. We found that, although the detection performance of PCMs outperformed that of the NCMs, training and fine-tuning PCMs were computationally expensive. The budgets for deployment and inference are also considerable in practice, which may prevent the wide adoption of PCMs in the field of interest. However, we discover that, when the PCMs were compressed using the technique of knowledge distillation, they achieved similar detection performance but with significantly improved efficiency compared with their uncompressed counterparts when using 40,000 synthetic C functions for fine-tuning and approximately 79,200 real-world C functions for training. Among the distilled PCMs, the distilled CodeBERT achieved the most cost-effective performance. Therefore, we proposed a framework encapsulating the Distilled CodeBERT for an end-to-end Vulnerable function Detection (named DistilVD). To examine the performance of the proposed framework in real-world scenarios, DistilVD was tested on four open-source real-world projects with a small amount of training data. Results showed that DistilVD outperformed the five baseline approaches. Further evaluations on multi-class vulnerability detection also confirmed the effectiveness of DistilVD for detecting various vulnerability types.
first_indexed 2024-03-09T17:40:41Z
format Article
id doaj.art-624856c0dcc446a8aab5ff71f2672208
institution Directory Open Access Journal
issn 2227-7390
language English
last_indexed 2024-03-09T17:40:41Z
publishDate 2022-11-01
publisher MDPI AG
record_format Article
series Mathematics
spelling doaj.art-624856c0dcc446a8aab5ff71f26722082023-11-24T11:34:17ZengMDPI AGMathematics2227-73902022-11-011023448210.3390/math10234482Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function DetectionGuanjun Lin0Heming Jia1Di Wu2School of Information Engineering, Sanming University, Sanming 365004, ChinaSchool of Information Engineering, Sanming University, Sanming 365004, ChinaSchool of Education and Music, Sanming University, Sanming 365004, ChinaDetecting vulnerabilities in programs is an important yet challenging problem in cybersecurity. The recent advancement in techniques of natural language understanding enables the data-driven research on automated code analysis to embrace Pre-trained Contextualized Models (PCMs). These models are pre-trained on the large corpus and can be fine-tuned for various downstream tasks, but their feasibility and effectiveness for software vulnerability detection have not been systematically studied. In this paper, we explore six prevalent PCMs and compare them with three mainstream Non-Contextualized Models (NCMs) in terms of generating effective function-level representations for vulnerability detection. We found that, although the detection performance of PCMs outperformed that of the NCMs, training and fine-tuning PCMs were computationally expensive. The budgets for deployment and inference are also considerable in practice, which may prevent the wide adoption of PCMs in the field of interest. However, we discover that, when the PCMs were compressed using the technique of knowledge distillation, they achieved similar detection performance but with significantly improved efficiency compared with their uncompressed counterparts when using 40,000 synthetic C functions for fine-tuning and approximately 79,200 real-world C functions for training. Among the distilled PCMs, the distilled CodeBERT achieved the most cost-effective performance. Therefore, we proposed a framework encapsulating the Distilled CodeBERT for an end-to-end Vulnerable function Detection (named DistilVD). To examine the performance of the proposed framework in real-world scenarios, DistilVD was tested on four open-source real-world projects with a small amount of training data. Results showed that DistilVD outperformed the five baseline approaches. Further evaluations on multi-class vulnerability detection also confirmed the effectiveness of DistilVD for detecting various vulnerability types.https://www.mdpi.com/2227-7390/10/23/4482pre-trained contextualized embeddingfunction-levelvulnerability detectionmodel compressionknowledge distillation
spellingShingle Guanjun Lin
Heming Jia
Di Wu
Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
Mathematics
pre-trained contextualized embedding
function-level
vulnerability detection
model compression
knowledge distillation
title Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_full Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_fullStr Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_full_unstemmed Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_short Distilled and Contextualized Neural Models Benchmarked for Vulnerable Function Detection
title_sort distilled and contextualized neural models benchmarked for vulnerable function detection
topic pre-trained contextualized embedding
function-level
vulnerability detection
model compression
knowledge distillation
url https://www.mdpi.com/2227-7390/10/23/4482
work_keys_str_mv AT guanjunlin distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection
AT hemingjia distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection
AT diwu distilledandcontextualizedneuralmodelsbenchmarkedforvulnerablefunctiondetection