Android malware detection with MH-100K: An innovative dataset for advanced research

High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and pr...

Full description

Bibliographic Details
Main Authors: Hendrio Bragança, Vanderson Rocha, Lucas Barcellos, Eduardo Souto, Diego Kreutz, Eduardo Feitosa
Format: Article
Language:English
Published: Elsevier 2023-12-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340923008193
_version_ 1797429997721354240
author Hendrio Bragança
Vanderson Rocha
Lucas Barcellos
Eduardo Souto
Diego Kreutz
Eduardo Feitosa
author_facet Hendrio Bragança
Vanderson Rocha
Lucas Barcellos
Eduardo Souto
Diego Kreutz
Eduardo Feitosa
author_sort Hendrio Bragança
collection DOAJ
description High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.
first_indexed 2024-03-09T09:21:21Z
format Article
id doaj.art-9add417b5abf4a979446ae75022f5fcf
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-03-09T09:21:21Z
publishDate 2023-12-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-9add417b5abf4a979446ae75022f5fcf2023-12-02T07:00:19ZengElsevierData in Brief2352-34092023-12-0151109750Android malware detection with MH-100K: An innovative dataset for advanced researchHendrio Bragança0Vanderson Rocha1Lucas Barcellos2Eduardo Souto3Diego Kreutz4Eduardo Feitosa5Institute of Computing, Federal University of Amazonas, Amazonas, Brazil; Corresponding author.Institute of Computing, Federal University of Amazonas, Amazonas, BrazilFederal University of Pampa, Rio Grande do Sul, BrazilInstitute of Computing, Federal University of Amazonas, Amazonas, BrazilFederal University of Pampa, Rio Grande do Sul, BrazilInstitute of Computing, Federal University of Amazonas, Amazonas, BrazilHigh-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.http://www.sciencedirect.com/science/article/pii/S2352340923008193Android MalwareAndroid securityMalware detectionMachine learning
spellingShingle Hendrio Bragança
Vanderson Rocha
Lucas Barcellos
Eduardo Souto
Diego Kreutz
Eduardo Feitosa
Android malware detection with MH-100K: An innovative dataset for advanced research
Data in Brief
Android Malware
Android security
Malware detection
Machine learning
title Android malware detection with MH-100K: An innovative dataset for advanced research
title_full Android malware detection with MH-100K: An innovative dataset for advanced research
title_fullStr Android malware detection with MH-100K: An innovative dataset for advanced research
title_full_unstemmed Android malware detection with MH-100K: An innovative dataset for advanced research
title_short Android malware detection with MH-100K: An innovative dataset for advanced research
title_sort android malware detection with mh 100k an innovative dataset for advanced research
topic Android Malware
Android security
Malware detection
Machine learning
url http://www.sciencedirect.com/science/article/pii/S2352340923008193
work_keys_str_mv AT hendriobraganca androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT vandersonrocha androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT lucasbarcellos androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT eduardosouto androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT diegokreutz androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch
AT eduardofeitosa androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch