Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification
Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated clas...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9023948/ |
_version_ | 1818933120643104768 |
---|---|
author | Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic |
author_facet | Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic |
author_sort | Yipin Zhang |
collection | DOAJ |
description | Over the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only. |
first_indexed | 2024-12-20T04:43:20Z |
format | Article |
id | doaj.art-05e140831dfb4387b024a62291e6bcf7 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-20T04:43:20Z |
publishDate | 2020-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-05e140831dfb4387b024a62291e6bcf72022-12-21T19:53:04ZengIEEEIEEE Access2169-35362020-01-018446524466010.1109/ACCESS.2020.29783359023948Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File ClassificationYipin Zhang0https://orcid.org/0000-0003-1533-6407Xiaolin Chang1https://orcid.org/0000-0002-2975-8857Yuzhou Lin2https://orcid.org/0000-0001-6617-9443Jelena Misic3https://orcid.org/0000-0002-1251-3730Vojislav B. Misic4https://orcid.org/0000-0001-7760-9920Beijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaBeijing Key Laboratory of Security and Privacy in Intelligent Transportation, Beijing Jiaotong University, Beijing, ChinaComputer Science Department, Ryerson University, Toronto, ON, CanadaComputer Science Department, Ryerson University, Toronto, ON, CanadaOver the last few years, the malware propagation on PC platforms, especially on Windows OS has been even severe. For the purpose of resisting a large scale of malware variants, machine learning (ML) classifiers for malicious Portable Executable (PE) files have been proposed to achieve automated classification. Recently, function call graph (FCG) vectorization (FCGV) representation was explored as the input feature to achieve higher ML classification accuracy, but FCGV representation loses some critical features of PE files due to the hash technique. This paper aims to further improve the classification accuracy of FCGV-based ML model by applying both graph and non-graph features. We propose an FCGV-SF based Random Forest classification model, which applies both FCGV features (graph features) and statistical features (SF, non-graph features) extracted from disassembled PE files. Six types of effective non-graph features are chosen for our integrated vector, namely, metadata, symbol, operation code, register, section and data definition. We evaluate our model on a dataset provided by Microsoft hosted at Kaggle, and the experimental results indicate that the classification accuracy increases from 0.9851 to 0.9957 compared with the existing model based on FCGV only.https://ieeexplore.ieee.org/document/9023948/Function call graphmachine learningmalware classificationPortable Executablestatistical features |
spellingShingle | Yipin Zhang Xiaolin Chang Yuzhou Lin Jelena Misic Vojislav B. Misic Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification IEEE Access Function call graph machine learning malware classification Portable Executable statistical features |
title | Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification |
title_full | Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification |
title_fullStr | Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification |
title_full_unstemmed | Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification |
title_short | Exploring Function Call Graph Vectorization and File Statistical Features in Malicious PE File Classification |
title_sort | exploring function call graph vectorization and file statistical features in malicious pe file classification |
topic | Function call graph machine learning malware classification Portable Executable statistical features |
url | https://ieeexplore.ieee.org/document/9023948/ |
work_keys_str_mv | AT yipinzhang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT xiaolinchang exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT yuzhoulin exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT jelenamisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification AT vojislavbmisic exploringfunctioncallgraphvectorizationandfilestatisticalfeaturesinmaliciouspefileclassification |