Detection of malware in downloaded files using various machine learning models

Malware has become an enormous risk in today’s world. There are different kinds of malware or malicious programs found on the internet. Research shows that malware has grown exponentially over the last decade, causing substantial financial losses to various organizations. Malware is a malicious prog...

Full description

Bibliographic Details
Main Authors: Akshit Kamboj, Priyanshu Kumar, Amit Kumar Bairwa, Sandeep Joshi
Format: Article
Language:English
Published: Elsevier 2023-03-01
Series:Egyptian Informatics Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S111086652200072X
_version_ 1828012119465918464
author Akshit Kamboj
Priyanshu Kumar
Amit Kumar Bairwa
Sandeep Joshi
author_facet Akshit Kamboj
Priyanshu Kumar
Amit Kumar Bairwa
Sandeep Joshi
author_sort Akshit Kamboj
collection DOAJ
description Malware has become an enormous risk in today’s world. There are different kinds of malware or malicious programs found on the internet. Research shows that malware has grown exponentially over the last decade, causing substantial financial losses to various organizations. Malware is a malicious program or software that proves exceedingly harmful to the user’s computer. The user’s system can be affected in several ways. The proposed solution uses various machine learning techniques to detect whether a file downloaded from the internet contains malware or not. This research aims to use different machine learning algorithms to differentiate between malicious and benign files successfully. The main idea is to study different features of the downloaded file like MD5 hash, size of the Optional Header, and Load Configuration Size. Based on the analysis performed on these features, the files will be classified as malicious or non-malicious. The models are trained on these different features which enables them to learn how to classify files. The models after proper training will be compared among each other based on various criteria. This comparison is made with the help of the Validation and Test datasets. Finally, the model with the best accuracy will be selected. This process helps in identifying all those types of malware that can have a detrimental impact on the user’s system after getting infected. The approach used here will be able to detect malware like Adware, Trojan, Backdoors, Unknown, Multidrop, Rbot, Spam, and Ransomware. After training and testing various machine learning models, the Random Forest Classifier was found to be the most accurate. It’s accuracy went as high as 99.99% in the case of the test dataset. This was closely followed by the XGBoost model with an accuracy of 99.68%. The results of five different models have been compared with those obtained in the previous research. These include the Decision Tree Classifier (99.57% accuracy), Random Forest Classifier (99.99% accuracy), Gradient Boosting Model (99.09% accuracy), XGBoost Model (99.68% accuracy), and AdaBoost Model (98.87% accuracy). Four out of five of these models have been found to have accuracies greater than those obtained in previous research works.
first_indexed 2024-04-10T09:25:40Z
format Article
id doaj.art-978c31ba7b994e788c088c012b73c03f
institution Directory Open Access Journal
issn 1110-8665
language English
last_indexed 2024-04-10T09:25:40Z
publishDate 2023-03-01
publisher Elsevier
record_format Article
series Egyptian Informatics Journal
spelling doaj.art-978c31ba7b994e788c088c012b73c03f2023-02-20T04:08:53ZengElsevierEgyptian Informatics Journal1110-86652023-03-012418194Detection of malware in downloaded files using various machine learning modelsAkshit Kamboj0Priyanshu Kumar1Amit Kumar Bairwa2Sandeep Joshi3Manipal University Jaipur, Rajastham, IndiaManipal University Jaipur, Rajastham, IndiaCorresponding author.; Manipal University Jaipur, Rajastham, IndiaManipal University Jaipur, Rajastham, IndiaMalware has become an enormous risk in today’s world. There are different kinds of malware or malicious programs found on the internet. Research shows that malware has grown exponentially over the last decade, causing substantial financial losses to various organizations. Malware is a malicious program or software that proves exceedingly harmful to the user’s computer. The user’s system can be affected in several ways. The proposed solution uses various machine learning techniques to detect whether a file downloaded from the internet contains malware or not. This research aims to use different machine learning algorithms to differentiate between malicious and benign files successfully. The main idea is to study different features of the downloaded file like MD5 hash, size of the Optional Header, and Load Configuration Size. Based on the analysis performed on these features, the files will be classified as malicious or non-malicious. The models are trained on these different features which enables them to learn how to classify files. The models after proper training will be compared among each other based on various criteria. This comparison is made with the help of the Validation and Test datasets. Finally, the model with the best accuracy will be selected. This process helps in identifying all those types of malware that can have a detrimental impact on the user’s system after getting infected. The approach used here will be able to detect malware like Adware, Trojan, Backdoors, Unknown, Multidrop, Rbot, Spam, and Ransomware. After training and testing various machine learning models, the Random Forest Classifier was found to be the most accurate. It’s accuracy went as high as 99.99% in the case of the test dataset. This was closely followed by the XGBoost model with an accuracy of 99.68%. The results of five different models have been compared with those obtained in the previous research. These include the Decision Tree Classifier (99.57% accuracy), Random Forest Classifier (99.99% accuracy), Gradient Boosting Model (99.09% accuracy), XGBoost Model (99.68% accuracy), and AdaBoost Model (98.87% accuracy). Four out of five of these models have been found to have accuracies greater than those obtained in previous research works.http://www.sciencedirect.com/science/article/pii/S111086652200072XCryptographySHA 256AESLSBSecurity
spellingShingle Akshit Kamboj
Priyanshu Kumar
Amit Kumar Bairwa
Sandeep Joshi
Detection of malware in downloaded files using various machine learning models
Egyptian Informatics Journal
Cryptography
SHA 256
AES
LSB
Security
title Detection of malware in downloaded files using various machine learning models
title_full Detection of malware in downloaded files using various machine learning models
title_fullStr Detection of malware in downloaded files using various machine learning models
title_full_unstemmed Detection of malware in downloaded files using various machine learning models
title_short Detection of malware in downloaded files using various machine learning models
title_sort detection of malware in downloaded files using various machine learning models
topic Cryptography
SHA 256
AES
LSB
Security
url http://www.sciencedirect.com/science/article/pii/S111086652200072X
work_keys_str_mv AT akshitkamboj detectionofmalwareindownloadedfilesusingvariousmachinelearningmodels
AT priyanshukumar detectionofmalwareindownloadedfilesusingvariousmachinelearningmodels
AT amitkumarbairwa detectionofmalwareindownloadedfilesusingvariousmachinelearningmodels
AT sandeepjoshi detectionofmalwareindownloadedfilesusingvariousmachinelearningmodels