SELID: Selective Event Labeling for Intrusion Detection Datasets

A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attack...

Full description

Bibliographic Details
Main Authors:	Woohyuk Jang, Hyunmin Kim, Hyungbin Seo, Minsong Kim, Myungkeun Yoon
Format:	Article
Language:	English
Published:	MDPI AG 2023-07-01
Series:	Sensors
Subjects:	security operations center intrusion detection unsupervised learning alert fatigue cyber security
Online Access:	https://www.mdpi.com/1424-8220/23/13/6105

_version_	1797435967610552320
author	Woohyuk Jang Hyunmin Kim Hyungbin Seo Minsong Kim Myungkeun Yoon
author_facet	Woohyuk Jang Hyunmin Kim Hyungbin Seo Minsong Kim Myungkeun Yoon
author_sort	Woohyuk Jang
collection	DOAJ
description	A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.
first_indexed	2024-03-09T10:55:53Z
format	Article
id	doaj.art-03af722752334d49938a3ff90183e29f
institution	Directory Open Access Journal
issn	1424-8220
language	English
last_indexed	2024-03-09T10:55:53Z
publishDate	2023-07-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj.art-03af722752334d49938a3ff90183e29f2023-12-01T01:37:20ZengMDPI AGSensors1424-82202023-07-012313610510.3390/s23136105SELID: Selective Event Labeling for Intrusion Detection DatasetsWoohyuk Jang0Hyunmin Kim1Hyungbin Seo2Minsong Kim3Myungkeun Yoon4Department of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaA large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.https://www.mdpi.com/1424-8220/23/13/6105security operations centerintrusion detectionunsupervised learningalert fatiguecyber security
spellingShingle	Woohyuk Jang Hyunmin Kim Hyungbin Seo Minsong Kim Myungkeun Yoon SELID: Selective Event Labeling for Intrusion Detection Datasets Sensors security operations center intrusion detection unsupervised learning alert fatigue cyber security
title	SELID: Selective Event Labeling for Intrusion Detection Datasets
title_full	SELID: Selective Event Labeling for Intrusion Detection Datasets
title_fullStr	SELID: Selective Event Labeling for Intrusion Detection Datasets
title_full_unstemmed	SELID: Selective Event Labeling for Intrusion Detection Datasets
title_short	SELID: Selective Event Labeling for Intrusion Detection Datasets
title_sort	selid selective event labeling for intrusion detection datasets
topic	security operations center intrusion detection unsupervised learning alert fatigue cyber security
url	https://www.mdpi.com/1424-8220/23/13/6105
work_keys_str_mv	AT woohyukjang selidselectiveeventlabelingforintrusiondetectiondatasets AT hyunminkim selidselectiveeventlabelingforintrusiondetectiondatasets AT hyungbinseo selidselectiveeventlabelingforintrusiondetectiondatasets AT minsongkim selidselectiveeventlabelingforintrusiondetectiondatasets AT myungkeunyoon selidselectiveeventlabelingforintrusiondetectiondatasets

SELID: Selective Event Labeling for Intrusion Detection Datasets

Similar Items