System API Vectorization for Malware Detection

Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They ove...

Full description

Bibliographic Details
Main Authors: Kyounga Shin, Yunho Lee, Jungho Lim, Honggoo Kang, Sangjin Lee
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10124968/
_version_ 1797809171032178688
author Kyounga Shin
Yunho Lee
Jungho Lim
Honggoo Kang
Sangjin Lee
author_facet Kyounga Shin
Yunho Lee
Jungho Lim
Honggoo Kang
Sangjin Lee
author_sort Kyounga Shin
collection DOAJ
description Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They overlooked collection of benign data, which is as important as malware data, and data characterization of system APIs. As an optimization method for data-driven artificial intelligence, this paper studied the data collection, purification, preprocessing, and vectorization for EXE files and system APIs. The objectivity of the data was ensured by using global data, and a more robust model could be created by collecting benign data from Virus Total. By analyzing the weight distribution according to the order of system API execution, we identified that major malicious behaviors occurred at the beginning of execution. We found the optimal API length and optimal dimension (feature number). Finally, accuracy of the N-gram model ranged from 97.62 to 95.73, and that of the Word2Vec model ranged from 97.44 to 95.89. In the generalization performance test using different data from the source of the training ones, we confirmed that N-gram was affected by the quantity of training data, and Word2Vec was affected by data similarity. This study systematized the entire procedure of AI data processing for malware detection, and is the first study to compare and analyze statistical vectors and word embeddings based on the characteristics of system APIs.
first_indexed 2024-03-13T06:48:31Z
format Article
id doaj.art-824ac3c6ef184d13bf62ca50efc83b1f
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-13T06:48:31Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-824ac3c6ef184d13bf62ca50efc83b1f2023-06-07T23:00:29ZengIEEEIEEE Access2169-35362023-01-0111537885380510.1109/ACCESS.2023.327690210124968System API Vectorization for Malware DetectionKyounga Shin0https://orcid.org/0000-0002-3585-102XYunho Lee1https://orcid.org/0000-0001-5247-9146Jungho Lim2Honggoo Kang3Sangjin Lee4https://orcid.org/0000-0002-6809-5179School of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaData is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They overlooked collection of benign data, which is as important as malware data, and data characterization of system APIs. As an optimization method for data-driven artificial intelligence, this paper studied the data collection, purification, preprocessing, and vectorization for EXE files and system APIs. The objectivity of the data was ensured by using global data, and a more robust model could be created by collecting benign data from Virus Total. By analyzing the weight distribution according to the order of system API execution, we identified that major malicious behaviors occurred at the beginning of execution. We found the optimal API length and optimal dimension (feature number). Finally, accuracy of the N-gram model ranged from 97.62 to 95.73, and that of the Word2Vec model ranged from 97.44 to 95.89. In the generalization performance test using different data from the source of the training ones, we confirmed that N-gram was affected by the quantity of training data, and Word2Vec was affected by data similarity. This study systematized the entire procedure of AI data processing for malware detection, and is the first study to compare and analyze statistical vectors and word embeddings based on the characteristics of system APIs.https://ieeexplore.ieee.org/document/10124968/Malwaresystem APIvectorizationN-gram statistic vectorWord2Vec
spellingShingle Kyounga Shin
Yunho Lee
Jungho Lim
Honggoo Kang
Sangjin Lee
System API Vectorization for Malware Detection
IEEE Access
Malware
system API
vectorization
N-gram statistic vector
Word2Vec
title System API Vectorization for Malware Detection
title_full System API Vectorization for Malware Detection
title_fullStr System API Vectorization for Malware Detection
title_full_unstemmed System API Vectorization for Malware Detection
title_short System API Vectorization for Malware Detection
title_sort system api vectorization for malware detection
topic Malware
system API
vectorization
N-gram statistic vector
Word2Vec
url https://ieeexplore.ieee.org/document/10124968/
work_keys_str_mv AT kyoungashin systemapivectorizationformalwaredetection
AT yunholee systemapivectorizationformalwaredetection
AT jungholim systemapivectorizationformalwaredetection
AT honggookang systemapivectorizationformalwaredetection
AT sangjinlee systemapivectorizationformalwaredetection