PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malwa...

Full description

Bibliographic Details
Main Authors: G. M. Sakhawat Hossain, Kaushik Deb, Helge Janicke, Iqbal H. Sarker
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10412055/
_version_ 1827132829308289024
author G. M. Sakhawat Hossain
Kaushik Deb
Helge Janicke
Iqbal H. Sarker
author_facet G. M. Sakhawat Hossain
Kaushik Deb
Helge Janicke
Iqbal H. Sarker
author_sort G. M. Sakhawat Hossain
collection DOAJ
description The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.
first_indexed 2024-03-08T09:32:17Z
format Article
id doaj.art-c5302ad248bb428d91cc553eafb53eaf
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2025-03-20T16:54:11Z
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-c5302ad248bb428d91cc553eafb53eaf2024-08-28T23:01:19ZengIEEEIEEE Access2169-35362024-01-0112138331385910.1109/ACCESS.2024.335762010412055PDF Malware Detection: Toward Machine Learning Modeling With Explainability AnalysisG. M. Sakhawat Hossain0https://orcid.org/0000-0002-9632-0521Kaushik Deb1https://orcid.org/0000-0002-7345-0999Helge Janicke2https://orcid.org/0000-0002-1345-2829Iqbal H. Sarker3https://orcid.org/0000-0003-1740-5517Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaThe Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.https://ieeexplore.ieee.org/document/10412055/CybersecurityPDF malwaredata analyticsmachine learningdecision ruleexplainable AI
spellingShingle G. M. Sakhawat Hossain
Kaushik Deb
Helge Janicke
Iqbal H. Sarker
PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
IEEE Access
Cybersecurity
PDF malware
data analytics
machine learning
decision rule
explainable AI
title PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_full PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_fullStr PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_full_unstemmed PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_short PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_sort pdf malware detection toward machine learning modeling with explainability analysis
topic Cybersecurity
PDF malware
data analytics
machine learning
decision rule
explainable AI
url https://ieeexplore.ieee.org/document/10412055/
work_keys_str_mv AT gmsakhawathossain pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis
AT kaushikdeb pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis
AT helgejanicke pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis
AT iqbalhsarker pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis