Binary code similarity analysis based on naming function and common vector space

Abstract Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cr...

Full description

Bibliographic Details
Main Authors:	Bing Xia, Jianmin Pang, Xin Zhou, Zheng Shan, Junchao Wang, Feng Yue
Format:	Article
Language:	English
Published:	Nature Portfolio 2023-09-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-023-42769-9

_version_	1827634793262612480
author	Bing Xia Jianmin Pang Xin Zhou Zheng Shan Junchao Wang Feng Yue
author_facet	Bing Xia Jianmin Pang Xin Zhou Zheng Shan Junchao Wang Feng Yue
author_sort	Bing Xia
collection	DOAJ
description	Abstract Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match .
first_indexed	2024-03-09T15:19:14Z
format	Article
id	doaj.art-1e730217667a4854aca178af59e1a6a4
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2024-03-09T15:19:14Z
publishDate	2023-09-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-1e730217667a4854aca178af59e1a6a42023-11-26T12:54:04ZengNature PortfolioScientific Reports2045-23222023-09-0113112010.1038/s41598-023-42769-9Binary code similarity analysis based on naming function and common vector spaceBing Xia0Jianmin Pang1Xin Zhou2Zheng Shan3Junchao Wang4Feng Yue5State Key Laboratory of Mathematical Engineering and Advanced ComputingState Key Laboratory of Mathematical Engineering and Advanced ComputingState Key Laboratory of Mathematical Engineering and Advanced ComputingState Key Laboratory of Mathematical Engineering and Advanced ComputingState Key Laboratory of Mathematical Engineering and Advanced ComputingState Key Laboratory of Mathematical Engineering and Advanced ComputingAbstract Binary code similarity analysis is widely used in the field of vulnerability search where source code may not be available to detect whether two binary functions are similar or not. Based on deep learning and natural processing techniques, several approaches have been proposed to perform cross-platform binary code similarity analysis using control flow graphs. However, existing schemes suffer from the shortcomings of large differences in instruction syntaxes across different target platforms, inability to align control flow graph nodes, and less introduction of high-level semantics of stability, which pose challenges for identifying similar computations between binary functions of different platforms generated from the same source code. We argue that extracting stable, platform-independent semantics can improve model accuracy, and a cross-platform binary function similarity comparison model N_Match is proposed. The model elevates different platform instructions to the same semantic space to shield their underlying platform instruction differences, uses graph embedding technology to learn the stability semantics of neighbors, extracts high-level knowledge of naming function to alleviate the differences brought about by cross-platform and cross-optimization levels, and combines the stable graph structure as well as the stable, platform-independent API knowledge of naming function to represent the final semantics of functions. The experimental results show that the model accuracy of N_Match outperforms the baseline model in terms of cross-platform, cross-optimization level, and industrial scenarios. In the vulnerability search experiment, N_Match significantly improves hit@N, the mAP exceeds the current graph embedding model by 66%. In addition, we also give several interesting observations from the experiments. The code and model are publicly available at https://www.github.com/CSecurityZhongYuan/Binary-Name_Match .https://doi.org/10.1038/s41598-023-42769-9
spellingShingle	Bing Xia Jianmin Pang Xin Zhou Zheng Shan Junchao Wang Feng Yue Binary code similarity analysis based on naming function and common vector space Scientific Reports
title	Binary code similarity analysis based on naming function and common vector space
title_full	Binary code similarity analysis based on naming function and common vector space
title_fullStr	Binary code similarity analysis based on naming function and common vector space
title_full_unstemmed	Binary code similarity analysis based on naming function and common vector space
title_short	Binary code similarity analysis based on naming function and common vector space
title_sort	binary code similarity analysis based on naming function and common vector space
url	https://doi.org/10.1038/s41598-023-42769-9
work_keys_str_mv	AT bingxia binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT jianminpang binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT xinzhou binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT zhengshan binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT junchaowang binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace AT fengyue binarycodesimilarityanalysisbasedonnamingfunctionandcommonvectorspace

Binary code similarity analysis based on naming function and common vector space

Similar Items