UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 m...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2020-06-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340920302948 |
_version_ | 1818068390811533312 |
---|---|
author | Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez |
author_facet | Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez |
author_sort | Mattia Zago |
collection | DOAJ |
description | In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics). |
first_indexed | 2024-12-10T15:38:49Z |
format | Article |
id | doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-12-10T15:38:49Z |
publishDate | 2020-06-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e2022-12-22T01:43:10ZengElsevierData in Brief2352-34092020-06-0130105400UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detectionMattia Zago0Manuel Gil Pérez1Gregorio Martínez Pérez2Corresponding author.; Department of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainIn computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).http://www.sciencedirect.com/science/article/pii/S2352340920302948Domain Generation Algorithm (DGA)Natural Language Processing (NLP)Machine learningDataNetwork security |
spellingShingle | Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection Data in Brief Domain Generation Algorithm (DGA) Natural Language Processing (NLP) Machine learning Data Network security |
title | UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection |
title_full | UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection |
title_fullStr | UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection |
title_full_unstemmed | UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection |
title_short | UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection |
title_sort | umudga a dataset for profiling algorithmically generated domain names in botnet detection |
topic | Domain Generation Algorithm (DGA) Natural Language Processing (NLP) Machine learning Data Network security |
url | http://www.sciencedirect.com/science/article/pii/S2352340920302948 |
work_keys_str_mv | AT mattiazago umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection AT manuelgilperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection AT gregoriomartinezperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection |