BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries

With the undeniable increase in popularity of open source software, also the availability and reuse of source code have increased. While the detection of code clones helps tracking reuse and evolution while dealing with source code, little prior work exists that can be used in binary code. This is c...

Full description

Bibliographic Details
Main Authors: Davide Pizzolotto, Katsuro Inoue
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9964192/
_version_ 1828171419251376128
author Davide Pizzolotto
Katsuro Inoue
author_facet Davide Pizzolotto
Katsuro Inoue
author_sort Davide Pizzolotto
collection DOAJ
description With the undeniable increase in popularity of open source software, also the availability and reuse of source code have increased. While the detection of code clones helps tracking reuse and evolution while dealing with source code, little prior work exists that can be used in binary code. This is complicated by the increased difficulty posed by the compilation transformations. In this paper, we present a CFG refinement useful to find function-level clones in a fast and scalable way by comparing the high-level structure of multiple disassembled binaries altogether. We are capable of determining if functions belonging to other programs have been copied or reused, even when the processor architecture is different. Specifically, our algorithm consists in the extraction of the various functions flows and the reconstruction of a higher level structure, leveraging architectural differences and allowing efficient comparison in linear time with structural hashing. We implemented our idea in a tool called BinCC, and analyzed 24 million functions spanning different architectures and optimization levels. Results show that our approach can achieve precision between 91% and 99% within the same architecture and 75% in detecting clones among different architectures, and can also detect the presence of specific library functions inside an executable. Our approach can reach comparable precision of current state-of-the-art learning approaches while being three order of magnitude faster.
first_indexed 2024-04-12T03:27:06Z
format Article
id doaj.art-4ae01fb0d613479b90cf8ef9f6f2609c
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-12T03:27:06Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-4ae01fb0d613479b90cf8ef9f6f2609c2022-12-22T03:49:40ZengIEEEIEEE Access2169-35362022-01-011012449112450610.1109/ACCESS.2022.32251009964192BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural BinariesDavide Pizzolotto0https://orcid.org/0000-0002-7690-6592Katsuro Inoue1https://orcid.org/0000-0001-5424-0614Osaka University, Osaka, JapanNanzan University, Nagoya, JapanWith the undeniable increase in popularity of open source software, also the availability and reuse of source code have increased. While the detection of code clones helps tracking reuse and evolution while dealing with source code, little prior work exists that can be used in binary code. This is complicated by the increased difficulty posed by the compilation transformations. In this paper, we present a CFG refinement useful to find function-level clones in a fast and scalable way by comparing the high-level structure of multiple disassembled binaries altogether. We are capable of determining if functions belonging to other programs have been copied or reused, even when the processor architecture is different. Specifically, our algorithm consists in the extraction of the various functions flows and the reconstruction of a higher level structure, leveraging architectural differences and allowing efficient comparison in linear time with structural hashing. We implemented our idea in a tool called BinCC, and analyzed 24 million functions spanning different architectures and optimization levels. Results show that our approach can achieve precision between 91% and 99% within the same architecture and 75% in detecting clones among different architectures, and can also detect the presence of specific library functions inside an executable. Our approach can reach comparable precision of current state-of-the-art learning approaches while being three order of magnitude faster.https://ieeexplore.ieee.org/document/9964192/Code clonesstatic code analysisreverse engineeringcompilers
spellingShingle Davide Pizzolotto
Katsuro Inoue
BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
IEEE Access
Code clones
static code analysis
reverse engineering
compilers
title BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
title_full BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
title_fullStr BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
title_full_unstemmed BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
title_short BinCC: Scalable Function Similarity Detection in Multiple Cross-Architectural Binaries
title_sort bincc scalable function similarity detection in multiple cross architectural binaries
topic Code clones
static code analysis
reverse engineering
compilers
url https://ieeexplore.ieee.org/document/9964192/
work_keys_str_mv AT davidepizzolotto binccscalablefunctionsimilaritydetectioninmultiplecrossarchitecturalbinaries
AT katsuroinoue binccscalablefunctionsimilaritydetectioninmultiplecrossarchitecturalbinaries