PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malwa...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2024-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10412055/ |
_version_ | 1827132829308289024 |
---|---|
author | G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker |
author_facet | G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker |
author_sort | G. M. Sakhawat Hossain |
collection | DOAJ |
description | The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings. |
first_indexed | 2024-03-08T09:32:17Z |
format | Article |
id | doaj.art-c5302ad248bb428d91cc553eafb53eaf |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2025-03-20T16:54:11Z |
publishDate | 2024-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-c5302ad248bb428d91cc553eafb53eaf2024-08-28T23:01:19ZengIEEEIEEE Access2169-35362024-01-0112138331385910.1109/ACCESS.2024.335762010412055PDF Malware Detection: Toward Machine Learning Modeling With Explainability AnalysisG. M. Sakhawat Hossain0https://orcid.org/0000-0002-9632-0521Kaushik Deb1https://orcid.org/0000-0002-7345-0999Helge Janicke2https://orcid.org/0000-0002-1345-2829Iqbal H. Sarker3https://orcid.org/0000-0003-1740-5517Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaThe Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.https://ieeexplore.ieee.org/document/10412055/CybersecurityPDF malwaredata analyticsmachine learningdecision ruleexplainable AI |
spellingShingle | G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis IEEE Access Cybersecurity PDF malware data analytics machine learning decision rule explainable AI |
title | PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis |
title_full | PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis |
title_fullStr | PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis |
title_full_unstemmed | PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis |
title_short | PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis |
title_sort | pdf malware detection toward machine learning modeling with explainability analysis |
topic | Cybersecurity PDF malware data analytics machine learning decision rule explainable AI |
url | https://ieeexplore.ieee.org/document/10412055/ |
work_keys_str_mv | AT gmsakhawathossain pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT kaushikdeb pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT helgejanicke pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT iqbalhsarker pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis |