Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and...

Full description

Bibliographic Details
Main Authors: Ying-Dar Lin, Zi-Qiang Liu, Ren-Hung Hwang, Van-Linh Nguyen, Po-Ching Lin, Yuan-Cheng Lai
Format: Article
Language:English
Published: IEEE 2022-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9705580/
_version_ 1818323719090601984
author Ying-Dar Lin
Zi-Qiang Liu
Ren-Hung Hwang
Van-Linh Nguyen
Po-Ching Lin
Yuan-Cheng Lai
author_facet Ying-Dar Lin
Zi-Qiang Liu
Ren-Hung Hwang
Van-Linh Nguyen
Po-Ching Lin
Yuan-Cheng Lai
author_sort Ying-Dar Lin
collection DOAJ
description As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and appropriate data segmentation. Improper handling of the issues will significantly degrade ML performance, e.g., resulting in high false-negative and low recall rates. Despite many efforts have done in the literature, detecting security attacks in a complicated network environment with imperfect data collection is still an open issue. This work proposes a <italic>machine learning</italic> framework with a combination of a <italic>variational autoencoder</italic> and <italic>multilayer perceptron</italic> model to deal with imbalanced datasets and detect the explosion of attack variants on the Internet. The detection engine also includes an efficient <italic>range-based sequential search</italic> algorithm to address the segmentation challenge in data pre-processing from multiple sources (network packets, system/statistic logs) effectively. Our work is the first attempt to demonstrate the effect of using an appropriate combination of ML models for boosting IDS detection capability in a heterogeneous environment, where data collection imperfection is common. Experimental results on a public system log dataset (e.g., HDFS) show that our method gains approximately as much as 97&#x0025; on F1 score and 98&#x0025; on recall rate, a promising result compared to the same measurement of other solutions. Even better, we found that the proposed treatment of imbalanced datasets can improve up to 35&#x0025; on the F1 score and 27&#x0025; on recall rate. The testing results also indicate that our model can detect new attack variants.
first_indexed 2024-12-13T11:17:09Z
format Article
id doaj.art-0b894e7af5fd4cab87e539ceac029d9b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-13T11:17:09Z
publishDate 2022-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-0b894e7af5fd4cab87e539ceac029d9b2022-12-21T23:48:35ZengIEEEIEEE Access2169-35362022-01-0110152471526010.1109/ACCESS.2022.31492959705580Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion DetectionYing-Dar Lin0https://orcid.org/0000-0002-5226-4396Zi-Qiang Liu1Ren-Hung Hwang2https://orcid.org/0000-0001-7996-4184Van-Linh Nguyen3https://orcid.org/0000-0002-3472-0108Po-Ching Lin4https://orcid.org/0000-0001-8294-5857Yuan-Cheng Lai5https://orcid.org/0000-0003-3695-5784Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, TaiwanDepartment of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, TaiwanDepartment of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, TaiwanDepartment of Computer Science and Information Engineering, National Chung Cheng University, Chiayi County, TaiwanDepartment of Computer Science and Information Engineering, National Chung Cheng University, Chiayi County, TaiwanDepartment of Information Management, National Taiwan University of Science and Technology, Taipei City, TaiwanAs a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and appropriate data segmentation. Improper handling of the issues will significantly degrade ML performance, e.g., resulting in high false-negative and low recall rates. Despite many efforts have done in the literature, detecting security attacks in a complicated network environment with imperfect data collection is still an open issue. This work proposes a <italic>machine learning</italic> framework with a combination of a <italic>variational autoencoder</italic> and <italic>multilayer perceptron</italic> model to deal with imbalanced datasets and detect the explosion of attack variants on the Internet. The detection engine also includes an efficient <italic>range-based sequential search</italic> algorithm to address the segmentation challenge in data pre-processing from multiple sources (network packets, system/statistic logs) effectively. Our work is the first attempt to demonstrate the effect of using an appropriate combination of ML models for boosting IDS detection capability in a heterogeneous environment, where data collection imperfection is common. Experimental results on a public system log dataset (e.g., HDFS) show that our method gains approximately as much as 97&#x0025; on F1 score and 98&#x0025; on recall rate, a promising result compared to the same measurement of other solutions. Even better, we found that the proposed treatment of imbalanced datasets can improve up to 35&#x0025; on the F1 score and 27&#x0025; on recall rate. The testing results also indicate that our model can detect new attack variants.https://ieeexplore.ieee.org/document/9705580/Imbalanced datasetmachine learningvariational autoencoderintrusion detection
spellingShingle Ying-Dar Lin
Zi-Qiang Liu
Ren-Hung Hwang
Van-Linh Nguyen
Po-Ching Lin
Yuan-Cheng Lai
Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
IEEE Access
Imbalanced dataset
machine learning
variational autoencoder
intrusion detection
title Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
title_full Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
title_fullStr Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
title_full_unstemmed Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
title_short Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection
title_sort machine learning with variational autoencoder for imbalanced datasets in intrusion detection
topic Imbalanced dataset
machine learning
variational autoencoder
intrusion detection
url https://ieeexplore.ieee.org/document/9705580/
work_keys_str_mv AT yingdarlin machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection
AT ziqiangliu machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection
AT renhunghwang machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection
AT vanlinhnguyen machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection
AT pochinglin machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection
AT yuanchenglai machinelearningwithvariationalautoencoderforimbalanceddatasetsinintrusiondetection