SELID: Selective Event Labeling for Intrusion Detection Datasets

A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attack...

Full description

Bibliographic Details
Main Authors: Woohyuk Jang, Hyunmin Kim, Hyungbin Seo, Minsong Kim, Myungkeun Yoon
Format: Article
Language:English
Published: MDPI AG 2023-07-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/23/13/6105
_version_ 1797435967610552320
author Woohyuk Jang
Hyunmin Kim
Hyungbin Seo
Minsong Kim
Myungkeun Yoon
author_facet Woohyuk Jang
Hyunmin Kim
Hyungbin Seo
Minsong Kim
Myungkeun Yoon
author_sort Woohyuk Jang
collection DOAJ
description A large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.
first_indexed 2024-03-09T10:55:53Z
format Article
id doaj.art-03af722752334d49938a3ff90183e29f
institution Directory Open Access Journal
issn 1424-8220
language English
last_indexed 2024-03-09T10:55:53Z
publishDate 2023-07-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj.art-03af722752334d49938a3ff90183e29f2023-12-01T01:37:20ZengMDPI AGSensors1424-82202023-07-012313610510.3390/s23136105SELID: Selective Event Labeling for Intrusion Detection DatasetsWoohyuk Jang0Hyunmin Kim1Hyungbin Seo2Minsong Kim3Myungkeun Yoon4Department of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaDepartment of Computer Science, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of KoreaA large volume of security events, generally collected by distributed monitoring sensors, overwhelms human analysts at security operations centers and raises an alert fatigue problem. Machine learning is expected to mitigate this problem by automatically distinguishing between true alerts, or attacks, and falsely reported ones. Machine learning models should first be trained on datasets having correct labels, but the labeling process itself requires considerable human resources. In this paper, we present a new selective sampling scheme for efficient data labeling via unsupervised clustering. The new scheme transforms the byte sequence of an event into a fixed-size vector through content-defined chunking and feature hashing. Then, a clustering algorithm is applied to the vectors, and only a few samples from each cluster are selected for manual labeling. The experimental results demonstrate that the new scheme can select only 2% of the data for labeling without degrading the F1-score of the machine learning model. Two datasets, a private dataset from a real security operations center and a public dataset from the Internet for experimental reproducibility, are used.https://www.mdpi.com/1424-8220/23/13/6105security operations centerintrusion detectionunsupervised learningalert fatiguecyber security
spellingShingle Woohyuk Jang
Hyunmin Kim
Hyungbin Seo
Minsong Kim
Myungkeun Yoon
SELID: Selective Event Labeling for Intrusion Detection Datasets
Sensors
security operations center
intrusion detection
unsupervised learning
alert fatigue
cyber security
title SELID: Selective Event Labeling for Intrusion Detection Datasets
title_full SELID: Selective Event Labeling for Intrusion Detection Datasets
title_fullStr SELID: Selective Event Labeling for Intrusion Detection Datasets
title_full_unstemmed SELID: Selective Event Labeling for Intrusion Detection Datasets
title_short SELID: Selective Event Labeling for Intrusion Detection Datasets
title_sort selid selective event labeling for intrusion detection datasets
topic security operations center
intrusion detection
unsupervised learning
alert fatigue
cyber security
url https://www.mdpi.com/1424-8220/23/13/6105
work_keys_str_mv AT woohyukjang selidselectiveeventlabelingforintrusiondetectiondatasets
AT hyunminkim selidselectiveeventlabelingforintrusiondetectiondatasets
AT hyungbinseo selidselectiveeventlabelingforintrusiondetectiondatasets
AT minsongkim selidselectiveeventlabelingforintrusiondetectiondatasets
AT myungkeunyoon selidselectiveeventlabelingforintrusiondetectiondatasets