Binary Code Representation With Well-Balanced Instruction Normalization

The recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to aut...

Full description

Bibliographic Details
Main Authors:	Hyungjoon Koo, Soyeon Park, Daejin Choi, Taesoo Kim
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Binary code code representation BERT well-balanced instruction normalization binary code similarity detection
Online Access:	https://ieeexplore.ieee.org/document/10077368/

_version_	1797858210044968960
author	Hyungjoon Koo Soyeon Park Daejin Choi Taesoo Kim
author_facet	Hyungjoon Koo Soyeon Park Daejin Choi Taesoo Kim
author_sort	Hyungjoon Koo
collection	DOAJ
description	The recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to automatically infer the underlying semantics of a binary, rather than replying on manual analysis. One of crucial pipelines in this process is instruction normalization, which helps to reduce the number of tokens and to avoid an out-of-vocabulary (OOV) problem. However, existing approaches often substitutes the operand(s) of an instruction with a common token (e. g., callee target <inline-formula> <tex-math notation="LaTeX">$\rightarrow $ </tex-math></inline-formula> FOO), inevitably resulting in the loss of important information. In this paper, we introduce well-balanced instruction normalization (WIN), a novel approach that retains rich code information while minimizing the downsides of code normalization. With large swaths of binary code, our finding shows that the instruction distribution follows Zipf’s Law like a natural language, a function conveys contextually meaningful information, and the same instruction at different positions may require diverse code representations. To show the effectiveness of WIN, we present DEEP SEMANTIC that harnesses the BERT architecture with two training phases: pre-training for generic assembly code representation, and fine-tuning for building a model tailored to a specialized task. We define a downstream task of binary code similarity detection, which requires underlying code semantics. Our experimental results show that our binary similarity model with WIN outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, with an average improvement of 49.8% and 15.8%, respectively.
first_indexed	2024-04-09T21:09:37Z
format	Article
id	doaj.art-495f09b0a004403fb247f3cc2cb15a59
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-04-09T21:09:37Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-495f09b0a004403fb247f3cc2cb15a592023-03-28T23:00:17ZengIEEEIEEE Access2169-35362023-01-0111291832919810.1109/ACCESS.2023.325948110077368Binary Code Representation With Well-Balanced Instruction NormalizationHyungjoon Koo0https://orcid.org/0000-0003-0799-0230Soyeon Park1https://orcid.org/0000-0002-6550-0474Daejin Choi2https://orcid.org/0000-0001-5070-360XTaesoo Kim3https://orcid.org/0000-0002-7440-2067Department of Computer Science and Engineering, Sungkyunkwan University, Suwon, South KoreaSchool of Computer Science, Georgia Institute of Technology, Atlanta, GA, USADepartment of Computer Science and Engineering, Incheon National University, Incheon, South KoreaSchool of Computer Science, Georgia Institute of Technology, Atlanta, GA, USAThe recovery of contextual meanings on a machine code is required by a wide range of binary analysis applications, such as bug discovery, malware analysis, and code clone detection. To accomplish this, advancements on binary code analysis borrow the techniques from natural language processing to automatically infer the underlying semantics of a binary, rather than replying on manual analysis. One of crucial pipelines in this process is instruction normalization, which helps to reduce the number of tokens and to avoid an out-of-vocabulary (OOV) problem. However, existing approaches often substitutes the operand(s) of an instruction with a common token (e. g., callee target <inline-formula> <tex-math notation="LaTeX">$\rightarrow $ </tex-math></inline-formula> FOO), inevitably resulting in the loss of important information. In this paper, we introduce well-balanced instruction normalization (WIN), a novel approach that retains rich code information while minimizing the downsides of code normalization. With large swaths of binary code, our finding shows that the instruction distribution follows Zipf’s Law like a natural language, a function conveys contextually meaningful information, and the same instruction at different positions may require diverse code representations. To show the effectiveness of WIN, we present DEEP SEMANTIC that harnesses the BERT architecture with two training phases: pre-training for generic assembly code representation, and fine-tuning for building a model tailored to a specialized task. We define a downstream task of binary code similarity detection, which requires underlying code semantics. Our experimental results show that our binary similarity model with WIN outperforms two state-of-the-art binary similarity tools, DeepBinDiff and SAFE, with an average improvement of 49.8% and 15.8%, respectively.https://ieeexplore.ieee.org/document/10077368/Binary codecode representationBERTwell-balanced instruction normalizationbinary code similarity detection
spellingShingle	Hyungjoon Koo Soyeon Park Daejin Choi Taesoo Kim Binary Code Representation With Well-Balanced Instruction Normalization IEEE Access Binary code code representation BERT well-balanced instruction normalization binary code similarity detection
title	Binary Code Representation With Well-Balanced Instruction Normalization
title_full	Binary Code Representation With Well-Balanced Instruction Normalization
title_fullStr	Binary Code Representation With Well-Balanced Instruction Normalization
title_full_unstemmed	Binary Code Representation With Well-Balanced Instruction Normalization
title_short	Binary Code Representation With Well-Balanced Instruction Normalization
title_sort	binary code representation with well balanced instruction normalization
topic	Binary code code representation BERT well-balanced instruction normalization binary code similarity detection
url	https://ieeexplore.ieee.org/document/10077368/
work_keys_str_mv	AT hyungjoonkoo binarycoderepresentationwithwellbalancedinstructionnormalization AT soyeonpark binarycoderepresentationwithwellbalancedinstructionnormalization AT daejinchoi binarycoderepresentationwithwellbalancedinstructionnormalization AT taesookim binarycoderepresentationwithwellbalancedinstructionnormalization

Binary Code Representation With Well-Balanced Instruction Normalization

Similar Items