System API Vectorization for Malware Detection

Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They ove...

Full description

Bibliographic Details
Main Authors:	Kyounga Shin, Yunho Lee, Jungho Lim, Honggoo Kang, Sangjin Lee
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Malware system API vectorization N-gram statistic vector Word2Vec
Online Access:	https://ieeexplore.ieee.org/document/10124968/

_version_	1797809171032178688
author	Kyounga Shin Yunho Lee Jungho Lim Honggoo Kang Sangjin Lee
author_facet	Kyounga Shin Yunho Lee Jungho Lim Honggoo Kang Sangjin Lee
author_sort	Kyounga Shin
collection	DOAJ
description	Data is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They overlooked collection of benign data, which is as important as malware data, and data characterization of system APIs. As an optimization method for data-driven artificial intelligence, this paper studied the data collection, purification, preprocessing, and vectorization for EXE files and system APIs. The objectivity of the data was ensured by using global data, and a more robust model could be created by collecting benign data from Virus Total. By analyzing the weight distribution according to the order of system API execution, we identified that major malicious behaviors occurred at the beginning of execution. We found the optimal API length and optimal dimension (feature number). Finally, accuracy of the N-gram model ranged from 97.62 to 95.73, and that of the Word2Vec model ranged from 97.44 to 95.89. In the generalization performance test using different data from the source of the training ones, we confirmed that N-gram was affected by the quantity of training data, and Word2Vec was affected by data similarity. This study systematized the entire procedure of AI data processing for malware detection, and is the first study to compare and analyze statistical vectors and word embeddings based on the characteristics of system APIs.
first_indexed	2024-03-13T06:48:31Z
format	Article
id	doaj.art-824ac3c6ef184d13bf62ca50efc83b1f
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-13T06:48:31Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-824ac3c6ef184d13bf62ca50efc83b1f2023-06-07T23:00:29ZengIEEEIEEE Access2169-35362023-01-0111537885380510.1109/ACCESS.2023.327690210124968System API Vectorization for Malware DetectionKyounga Shin0https://orcid.org/0000-0002-3585-102XYunho Lee1https://orcid.org/0000-0001-5247-9146Jungho Lim2Honggoo Kang3Sangjin Lee4https://orcid.org/0000-0002-6809-5179School of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaData is essential to the performance of artificial intelligence (AI) based malware detection models. System APIs, which allocate operating system resources, are important for identifying malicious behaviors. However, few studies have been conducted on data in the malware detection AI model. They overlooked collection of benign data, which is as important as malware data, and data characterization of system APIs. As an optimization method for data-driven artificial intelligence, this paper studied the data collection, purification, preprocessing, and vectorization for EXE files and system APIs. The objectivity of the data was ensured by using global data, and a more robust model could be created by collecting benign data from Virus Total. By analyzing the weight distribution according to the order of system API execution, we identified that major malicious behaviors occurred at the beginning of execution. We found the optimal API length and optimal dimension (feature number). Finally, accuracy of the N-gram model ranged from 97.62 to 95.73, and that of the Word2Vec model ranged from 97.44 to 95.89. In the generalization performance test using different data from the source of the training ones, we confirmed that N-gram was affected by the quantity of training data, and Word2Vec was affected by data similarity. This study systematized the entire procedure of AI data processing for malware detection, and is the first study to compare and analyze statistical vectors and word embeddings based on the characteristics of system APIs.https://ieeexplore.ieee.org/document/10124968/Malwaresystem APIvectorizationN-gram statistic vectorWord2Vec
spellingShingle	Kyounga Shin Yunho Lee Jungho Lim Honggoo Kang Sangjin Lee System API Vectorization for Malware Detection IEEE Access Malware system API vectorization N-gram statistic vector Word2Vec
title	System API Vectorization for Malware Detection
title_full	System API Vectorization for Malware Detection
title_fullStr	System API Vectorization for Malware Detection
title_full_unstemmed	System API Vectorization for Malware Detection
title_short	System API Vectorization for Malware Detection
title_sort	system api vectorization for malware detection
topic	Malware system API vectorization N-gram statistic vector Word2Vec
url	https://ieeexplore.ieee.org/document/10124968/
work_keys_str_mv	AT kyoungashin systemapivectorizationformalwaredetection AT yunholee systemapivectorizationformalwaredetection AT jungholim systemapivectorizationformalwaredetection AT honggookang systemapivectorizationformalwaredetection AT sangjinlee systemapivectorizationformalwaredetection

System API Vectorization for Malware Detection

Similar Items