Android malware detection with MH-100K: An innovative dataset for advanced research
High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and pr...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-12-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340923008193 |
_version_ | 1797429997721354240 |
---|---|
author | Hendrio Bragança Vanderson Rocha Lucas Barcellos Eduardo Souto Diego Kreutz Eduardo Feitosa |
author_facet | Hendrio Bragança Vanderson Rocha Lucas Barcellos Eduardo Souto Diego Kreutz Eduardo Feitosa |
author_sort | Hendrio Bragança |
collection | DOAJ |
description | High-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time. |
first_indexed | 2024-03-09T09:21:21Z |
format | Article |
id | doaj.art-9add417b5abf4a979446ae75022f5fcf |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-03-09T09:21:21Z |
publishDate | 2023-12-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-9add417b5abf4a979446ae75022f5fcf2023-12-02T07:00:19ZengElsevierData in Brief2352-34092023-12-0151109750Android malware detection with MH-100K: An innovative dataset for advanced researchHendrio Bragança0Vanderson Rocha1Lucas Barcellos2Eduardo Souto3Diego Kreutz4Eduardo Feitosa5Institute of Computing, Federal University of Amazonas, Amazonas, Brazil; Corresponding author.Institute of Computing, Federal University of Amazonas, Amazonas, BrazilFederal University of Pampa, Rio Grande do Sul, BrazilInstitute of Computing, Federal University of Amazonas, Amazonas, BrazilFederal University of Pampa, Rio Grande do Sul, BrazilInstitute of Computing, Federal University of Amazonas, Amazonas, BrazilHigh-quality datasets are crucial for building realistic and high-performance supervised malware detection models. Currently, one of the major challenges of machine learning-based solutions is the scarcity of datasets that are both representative and of high quality. To foster future research and provide updated and public data for comprehensive evaluation and comparison of existing classifiers, we introduce the MH-100K dataset [1], an extensive collection of Android malware information comprising 101,975 samples. It encompasses a main CSV file with valuable metadata, including the SHA256 hash (APK's signature), file name, package name, Android's official compilation API, 166 permissions, 24,417 API calls, and 250 intents. Moreover, the MH-100K dataset features an extensive collection of files containing useful metadata of the VirusTotal1 analysis. This repository of information can serve future research by enabling the analysis of antivirus scan result patterns to discern the prevalence and behaviour of various malware families. Such analysis can help to extend existing malware taxonomies, the identification of novel variants, and the exploration of malware evolution over time.http://www.sciencedirect.com/science/article/pii/S2352340923008193Android MalwareAndroid securityMalware detectionMachine learning |
spellingShingle | Hendrio Bragança Vanderson Rocha Lucas Barcellos Eduardo Souto Diego Kreutz Eduardo Feitosa Android malware detection with MH-100K: An innovative dataset for advanced research Data in Brief Android Malware Android security Malware detection Machine learning |
title | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_full | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_fullStr | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_full_unstemmed | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_short | Android malware detection with MH-100K: An innovative dataset for advanced research |
title_sort | android malware detection with mh 100k an innovative dataset for advanced research |
topic | Android Malware Android security Malware detection Machine learning |
url | http://www.sciencedirect.com/science/article/pii/S2352340923008193 |
work_keys_str_mv | AT hendriobraganca androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT vandersonrocha androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT lucasbarcellos androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT eduardosouto androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT diegokreutz androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch AT eduardofeitosa androidmalwaredetectionwithmh100kaninnovativedatasetforadvancedresearch |