Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification

Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated clas...

Full description

Bibliographic Details
Main Authors: Yipin Zhang, Xiaolin Chang, Yuzhou Lin, Jelena Misic, Vojislav B. Misic
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9023948/
_version_ 1818933120643104768
author Yipin Zhang
Xiaolin Chang
Yuzhou Lin
Jelena Misic
Vojislav B. Misic
author_facet Yipin Zhang
Xiaolin Chang
Yuzhou Lin
Jelena Misic
Vojislav B. Misic
author_sort Yipin Zhang
collection DOAJ
description Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only.
first_indexed 2024-12-20T04:43:20Z
format Article
id doaj.art-05e140831dfb4387b024a62291e6bcf7
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-20T04:43:20Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-05e140831dfb4387b024a62291e6bcf72022-12-21T19:53:04ZengIEEEIEEE Access2169-35362020-01-018446524466010.1109/ACCESS.2020.29783359023948Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File ClassificationYipin Zhang0https://orcid.org/0000-0003-1533-6407Xiaolin Chang1https://orcid.org/0000-0002-2975-8857Yuzhou Lin2https://orcid.org/0000-0001-6617-9443Jelena Misic3https://orcid.org/0000-0002-1251-3730Vojislav B. Misic4https://orcid.org/0000-0001-7760-9920Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaComputer Science Department, Ryerson University, Toronto, ON, CanadaComputer Science Department, Ryerson University, Toronto, ON, CanadaOver the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only.https://ieeexplore.ieee.org/document/9023948/Function call graphmachine learningmalware classificationPortable Executablestatistical features
spellingShingle Yipin Zhang
Xiaolin Chang
Yuzhou Lin
Jelena Misic
Vojislav B. Misic
Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
IEEE Access
Function call graph
machine learning
malware classification
Portable Executable
statistical features
title Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_full Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_fullStr Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_full_unstemmed Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_short Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
title_sort exploring function call graph vectorization and file statistical features in malicious pe file classification
topic Function call graph
machine learning
malware classification
Portable Executable
statistical features
url https://ieeexplore.ieee.org/document/9023948/
work_keys_str_mv AT yipinzhang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification
AT xiaolinchang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification
AT yuzhoulin exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification
AT jelenamisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification
AT vojislavbmisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification