Android malware dataset construction methodology to minimize bias–variance tradeoff

Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less importa...

Full description

Bibliographic Details
Main Authors:	Shinho Lee, Wookhyun Jung, Wonrak Lee, Hyung Geun Oh, Eui Tak Kim
Format:	Article
Language:	English
Published:	Elsevier 2022-09-01
Series:	ICT Express
Subjects:	Android Malware Dataset Bias Variance Underfitting
Online Access:	http://www.sciencedirect.com/science/article/pii/S2405959521001351

_version_	1798037472443105280
author	Shinho Lee Wookhyun Jung Wonrak Lee Hyung Geun Oh Eui Tak Kim
author_facet	Shinho Lee Wookhyun Jung Wonrak Lee Hyung Geun Oh Eui Tak Kim
author_sort	Shinho Lee
collection	DOAJ
description	Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated.
first_indexed	2024-04-11T21:27:00Z
format	Article
id	doaj.art-9ebb684fcf5640ba8001d9749616204b
institution	Directory Open Access Journal
issn	2405-9595
language	English
last_indexed	2024-04-11T21:27:00Z
publishDate	2022-09-01
publisher	Elsevier
record_format	Article
series	ICT Express
spelling	doaj.art-9ebb684fcf5640ba8001d9749616204b2022-12-22T04:02:21ZengElsevierICT Express2405-95952022-09-0183444462Android malware dataset construction methodology to minimize bias–variance tradeoffShinho Lee0Wookhyun Jung1Wonrak Lee2Hyung Geun Oh3Eui Tak Kim4Data Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of KoreaNational Security Research Institute, Daejeon, Republic of KoreaData Intelligence Lab, ESTsecurity, Seoul, Republic of Korea; Corresponding author.Recently, research on Android malware categorization and detection is increasingly directed toward proposing different learned models based on various features of Android apps and machine learning algorithms. For the implementation of such modeling, properly constructing a dataset is no less important than selecting a suitable algorithm. The present study examines dataset construction using Dexofuzzy and proposes methods to determine the degree of bias and variance in the process and minimize the noise in sample set labeling where there is a possibility that even the same samples can be differently labeled. The method proposed in the present study goes beyond existing dataset construction methods relying on label data provided by antivirus vendors to include an effective approach to construct new types of datasets built on unified labels combined with opcode morphology. Based on newly constructed datasets, a flexible dataset, which allows overfitting and underfitting to be considered, was obtained via N-Gram and M-Partial Matching. This flexible dataset was then subjected to clustering, and the resultant clustering performance was evaluated.http://www.sciencedirect.com/science/article/pii/S2405959521001351AndroidMalwareDatasetBiasVarianceUnderfitting
spellingShingle	Shinho Lee Wookhyun Jung Wonrak Lee Hyung Geun Oh Eui Tak Kim Android malware dataset construction methodology to minimize bias–variance tradeoff ICT Express Android Malware Dataset Bias Variance Underfitting
title	Android malware dataset construction methodology to minimize bias–variance tradeoff
title_full	Android malware dataset construction methodology to minimize bias–variance tradeoff
title_fullStr	Android malware dataset construction methodology to minimize bias–variance tradeoff
title_full_unstemmed	Android malware dataset construction methodology to minimize bias–variance tradeoff
title_short	Android malware dataset construction methodology to minimize bias–variance tradeoff
title_sort	android malware dataset construction methodology to minimize bias variance tradeoff
topic	Android Malware Dataset Bias Variance Underfitting
url	http://www.sciencedirect.com/science/article/pii/S2405959521001351
work_keys_str_mv	AT shinholee androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff AT wookhyunjung androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff AT wonraklee androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff AT hyunggeunoh androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff AT euitakkim androidmalwaredatasetconstructionmethodologytominimizebiasvariancetradeoff

Android malware dataset construction methodology to minimize bias–variance​ tradeoff

Similar Items

Android malware dataset construction methodology to minimize bias–variance tradeoff