PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malwa...

Full description

Bibliographic Details
Main Authors:	G. M. Sakhawat Hossain, Kaushik Deb, Helge Janicke, Iqbal H. Sarker
Format:	Article
Language:	English
Published:	IEEE 2024-01-01
Series:	IEEE Access
Subjects:	Cybersecurity PDF malware data analytics machine learning decision rule explainable AI
Online Access:	https://ieeexplore.ieee.org/document/10412055/

_version_	1827132829308289024
author	G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker
author_facet	G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker
author_sort	G. M. Sakhawat Hossain
collection	DOAJ
description	The Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.
first_indexed	2024-03-08T09:32:17Z
format	Article
id	doaj.art-c5302ad248bb428d91cc553eafb53eaf
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2025-03-20T16:54:11Z
publishDate	2024-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-c5302ad248bb428d91cc553eafb53eaf2024-08-28T23:01:19ZengIEEEIEEE Access2169-35362024-01-0112138331385910.1109/ACCESS.2024.335762010412055PDF Malware Detection: Toward Machine Learning Modeling With Explainability AnalysisG. M. Sakhawat Hossain0https://orcid.org/0000-0002-9632-0521Kaushik Deb1https://orcid.org/0000-0002-7345-0999Helge Janicke2https://orcid.org/0000-0002-1345-2829Iqbal H. Sarker3https://orcid.org/0000-0003-1740-5517Department of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshDepartment of Computer Science and Engineering, Chittagong University of Engineering and Technology, Chattogram, BangladeshCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaCyber Security Cooperative Research Centre, Joondalup, WA, AustraliaThe Portable Document Format (PDF) is one of the most widely used file types, thus fraudsters insert harmful code into victims’ PDF documents to compromise their equipment. Conventional solutions and identification techniques are often insufficient and may only partially prevent PDF malware because of their versatile character and excessive dependence on a certain typical feature set. The primary goal of this work is to detect PDF malware efficiently in order to alleviate the current difficulties. To accomplish the goal, we first develop a comprehensive dataset of 15958 PDF samples taking into account the non-malevolent, malicious, and evasive behaviors of the PDF samples. Using three well-known PDF analysis tools (PDFiD, PDFINFO, and PDF-PARSER), we extract significant characteristics from the PDF samples of our newly created dataset. In addition, we generate a number of derivations of features that have been experimentally proven to be helpful in classifying PDF malware. We develop a method to build an efficient and explicable feature set through the proper empirical analysis of the extracted and derived features. We explore different baseline machine learning classifiers and demonstrate an accuracy improvement of approx. 2% for the Random Forest classifier utilizing the selected feature set. Furthermore, we demonstrate the model’s explainability by creating a decision tree that generates rules for human interpretation. Eventually, we make a comparison with previous studies and point out some important findings.https://ieeexplore.ieee.org/document/10412055/CybersecurityPDF malwaredata analyticsmachine learningdecision ruleexplainable AI
spellingShingle	G. M. Sakhawat Hossain Kaushik Deb Helge Janicke Iqbal H. Sarker PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis IEEE Access Cybersecurity PDF malware data analytics machine learning decision rule explainable AI
title	PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_full	PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_fullStr	PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_full_unstemmed	PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_short	PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis
title_sort	pdf malware detection toward machine learning modeling with explainability analysis
topic	Cybersecurity PDF malware data analytics machine learning decision rule explainable AI
url	https://ieeexplore.ieee.org/document/10412055/
work_keys_str_mv	AT gmsakhawathossain pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT kaushikdeb pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT helgejanicke pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis AT iqbalhsarker pdfmalwaredetectiontowardmachinelearningmodelingwithexplainabilityanalysis

PDF Malware Detection: Toward Machine Learning Modeling With Explainability Analysis

Similar Items