UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection

In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 m...

Full description

Bibliographic Details
Main Authors: Mattia Zago, Manuel Gil Pérez, Gregorio Martínez Pérez
Format: Article
Language:English
Published: Elsevier 2020-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340920302948
_version_ 1818068390811533312
author Mattia Zago
Manuel Gil Pérez
Gregorio Martínez Pérez
author_facet Mattia Zago
Manuel Gil Pérez
Gregorio Martínez Pérez
author_sort Mattia Zago
collection DOAJ
description In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).
first_indexed 2024-12-10T15:38:49Z
format Article
id doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-12-10T15:38:49Z
publishDate 2020-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e2022-12-22T01:43:10ZengElsevierData in Brief2352-34092020-06-0130105400UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detectionMattia Zago0Manuel Gil Pérez1Gregorio Martínez Pérez2Corresponding author.; Department of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainIn computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).http://www.sciencedirect.com/science/article/pii/S2352340920302948Domain Generation Algorithm (DGA)Natural Language Processing (NLP)Machine learningDataNetwork security
spellingShingle Mattia Zago
Manuel Gil Pérez
Gregorio Martínez Pérez
UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
Data in Brief
Domain Generation Algorithm (DGA)
Natural Language Processing (NLP)
Machine learning
Data
Network security
title UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_full UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_fullStr UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_full_unstemmed UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_short UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_sort umudga a dataset for profiling algorithmically generated domain names in botnet detection
topic Domain Generation Algorithm (DGA)
Natural Language Processing (NLP)
Machine learning
Data
Network security
url http://www.sciencedirect.com/science/article/pii/S2352340920302948
work_keys_str_mv AT mattiazago umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection
AT manuelgilperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection
AT gregoriomartinezperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection