UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection

In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 m...

Full description

Bibliographic Details
Main Authors:	Mattia Zago, Manuel Gil Pérez, Gregorio Martínez Pérez
Format:	Article
Language:	English
Published:	Elsevier 2020-06-01
Series:	Data in Brief
Subjects:	Domain Generation Algorithm (DGA) Natural Language Processing (NLP) Machine learning Data Network security
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340920302948

_version_	1818068390811533312
author	Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez
author_facet	Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez
author_sort	Mattia Zago
collection	DOAJ
description	In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).
first_indexed	2024-12-10T15:38:49Z
format	Article
id	doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-12-10T15:38:49Z
publishDate	2020-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-4d883dfcec8d44dd9deec8d8fb7b723e2022-12-22T01:43:10ZengElsevierData in Brief2352-34092020-06-0130105400UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detectionMattia Zago0Manuel Gil Pérez1Gregorio Martínez Pérez2Corresponding author.; Department of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainDepartment of Information Engineering and Communications, University of Murcia, Campus Espinardo Murcia 30100 SpainIn computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the domain generation algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated domain names decorated with a feature set ready-to-use for machine learning (ML) analysis. This proposed dataset has been co-submitted with the research article ”UMUDGA: a dataset for profiling DGA-based botnet” [1], and it aims to enable researchers to move forward the data collection, organization, and pre-processing phases, eventually enabling them to focus on the analysis and the production of ML-powered solutions for network intrusion detection. In this research, we selected 50 among the most notorious malware variants to be as exhaustive as possible. Inhere, each family is available both as a list of domains (generated by executing the malware DGAs in a controlled environment with fixed parameters) and as a collection of features (generated by extracting a combination of statistical and natural language processing metrics).http://www.sciencedirect.com/science/article/pii/S2352340920302948Domain Generation Algorithm (DGA)Natural Language Processing (NLP)Machine learningDataNetwork security
spellingShingle	Mattia Zago Manuel Gil Pérez Gregorio Martínez Pérez UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection Data in Brief Domain Generation Algorithm (DGA) Natural Language Processing (NLP) Machine learning Data Network security
title	UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_full	UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_fullStr	UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_full_unstemmed	UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_short	UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection
title_sort	umudga a dataset for profiling algorithmically generated domain names in botnet detection
topic	Domain Generation Algorithm (DGA) Natural Language Processing (NLP) Machine learning Data Network security
url	http://www.sciencedirect.com/science/article/pii/S2352340920302948
work_keys_str_mv	AT mattiazago umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection AT manuelgilperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection AT gregoriomartinezperez umudgaadatasetforprofilingalgorithmicallygenerateddomainnamesinbotnetdetection

UMUDGA: A dataset for profiling algorithmically generated domain names in botnet detection

Similar Items