Android malware dataset construction methodology to minimize bias–variance​ tradeoff

Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less importa...

Full description

Bibliographic Details
Main Authors: Shinho Lee, Wookhyun Jung, Wonrak Lee, Hyung Geun Oh, Eui Tak Kim
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:ICT Express
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2405959521001351
_version_ 1798037472443105280
author Shinho Lee
Wookhyun Jung
Wonrak Lee
Hyung Geun Oh
Eui Tak Kim
author_facet Shinho Lee
Wookhyun Jung
Wonrak Lee
Hyung Geun Oh
Eui Tak Kim
author_sort Shinho Lee
collection DOAJ
description Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated.
first_indexed 2024-04-11T21:27:00Z
format Article
id doaj.art-9ebb684fcf5640ba8001d9749616204b
institution Directory Open Access Journal
issn 2405-9595
language English
last_indexed 2024-04-11T21:27:00Z
publishDate 2022-09-01
publisher Elsevier
record_format Article
series ICT Express
spelling doaj.art-9ebb684fcf5640ba8001d9749616204b2022-12-22T04:02:21ZengElsevierICT Express2405-95952022-09-0183444462Android malware dataset construction methodology to minimize bias–variance​ tradeoffShinho Lee0Wookhyun Jung1Wonrak Lee2Hyung Geun Oh3Eui Tak Kim4Data Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaNational Security Research Institute, Daejeon, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of Korea; Corresponding author.Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated.http://www.sciencedirect.com/science/article/pii/S2405959521001351AndroidMalwareDatasetBiasVarianceUnderfitting
spellingShingle Shinho Lee
Wookhyun Jung
Wonrak Lee
Hyung Geun Oh
Eui Tak Kim
Android malware dataset construction methodology to minimize bias–variance​ tradeoff
ICT Express
Android
Malware
Dataset
Bias
Variance
Underfitting
title Android malware dataset construction methodology to minimize bias–variance​ tradeoff
title_full Android malware dataset construction methodology to minimize bias–variance​ tradeoff
title_fullStr Android malware dataset construction methodology to minimize bias–variance​ tradeoff
title_full_unstemmed Android malware dataset construction methodology to minimize bias–variance​ tradeoff
title_short Android malware dataset construction methodology to minimize bias–variance​ tradeoff
title_sort android malware dataset construction methodology to minimize bias variance​ tradeoff
topic Android
Malware
Dataset
Bias
Variance
Underfitting
url http://www.sciencedirect.com/science/article/pii/S2405959521001351
work_keys_str_mv AT shinholee androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff
AT wookhyunjung androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff
AT wonraklee androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff
AT hyunggeunoh androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff
AT euitakkim androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff