Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification

Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated clas...

Full description

Bibliographic Details
Main Authors:	Yipin Zhang, Xiaolin Chang, Yuzhou Lin, Jelena Misic, Vojislav B. Misic
Format:	Article
Language:	English
Published:	IEEE 2020-01-01
Series:	IEEE Access
Subjects:	Function call graph machine learning malware classification Portable Executable statistical features
Online Access:	https://ieeexplore.ieee.org/document/9023948/

_version_	1818933120643104768
author	Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic
author_facet	Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic
author_sort	Yipin Zhang
collection	DOAJ
description	Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only.
first_indexed	2024-12-20T04:43:20Z
format	Article
id	doaj.art-05e140831dfb4387b024a62291e6bcf7
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-12-20T04:43:20Z
publishDate	2020-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-05e140831dfb4387b024a62291e6bcf72022-12-21T19:53:04ZengIEEEIEEE Access2169-35362020-01-018446524466010.1109/ACCESS.2020.29783359023948Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File ClassificationYipin Zhang0https://orcid.org/0000-0003-1533-6407Xiaolin Chang1https://orcid.org/0000-0002-2975-8857Yuzhou Lin2https://orcid.org/0000-0001-6617-9443Jelena Misic3https://orcid.org/0000-0002-1251-3730Vojislav B. Misic4https://orcid.org/0000-0001-7760-9920Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaComputer Science Department, Ryerson University, Toronto, ON, CanadaComputer Science Department, Ryerson University, Toronto, ON, CanadaOver the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only.https://ieeexplore.ieee.org/document/9023948/Function call graphmachine learningmalware classificationPortable Executablestatistical features
spellingShingle	Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification IEEE Access Function call graph machine learning malware classification Portable Executable statistical features
title	Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_full	Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_fullStr	Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_full_unstemmed	Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_short	Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_sort	exploring function call graph vectorization and file statistical features in malicious pe file classification
topic	Function call graph machine learning malware classification Portable Executable statistical features
url	https://ieeexplore.ieee.org/document/9023948/
work_keys_str_mv	AT yipinzhang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT xiaolinchang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT yuzhoulin exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT jelenamisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT vojislavbmisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification

Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification

Similar Items