Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments

Networks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), signi...

Full description

Bibliographic Details
Main Authors:	Saleh Alabdulwahab, Young-Tak Kim, Aria Seo, Yunsik Son
Format:	Article
Language:	English
Published:	MDPI AG 2023-10-01
Series:	Applied Sciences
Subjects:	intrusion detection system machine learning information security IoT CTGAN advanced persistent threat
Online Access:	https://www.mdpi.com/2076-3417/13/19/10951

_version_	1797576197417205760
author	Saleh Alabdulwahab Young-Tak Kim Aria Seo Yunsik Son
author_facet	Saleh Alabdulwahab Young-Tak Kim Aria Seo Yunsik Son
author_sort	Saleh Alabdulwahab
collection	DOAJ
description	Networks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), significantly improving classification tasks. Training ML algorithms require a large network traffic dataset; however, large storage and months of recording are required to capture the attacks, which is costly for IoT environments. This study proposes an ML pipeline using the conditional tabular generative adversarial network (CTGAN) model to generate a synthetic dataset. Then, the synthetic dataset was evaluated using several types of statistical and ML metrics. Using a decision tree, the accuracy of the generated dataset reached 0.99, and its lower complexity reached 0.05 s training and 0.004 s test times. The results show that synthetic data accurately reflect real data and are less complex, making them suitable for IoT environments and smart city applications. Thus, the generated synthetic dataset can further train models to secure IoT networks and applications.
first_indexed	2024-03-10T21:48:50Z
format	Article
id	doaj.art-a69a28b418e648ce8c1735712d74e34e
institution	Directory Open Access Journal
issn	2076-3417
language	English
last_indexed	2024-03-10T21:48:50Z
publishDate	2023-10-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj.art-a69a28b418e648ce8c1735712d74e34e2023-11-19T14:06:29ZengMDPI AGApplied Sciences2076-34172023-10-0113191095110.3390/app131910951Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT EnvironmentsSaleh Alabdulwahab0Young-Tak Kim1Aria Seo2Yunsik Son3Department of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of KoreaDepartment of Biomedical Sciences, Korea University College of Medicine, Seoul 02841, Republic of KoreaDepartment of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of KoreaDepartment of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of KoreaNetworks within the Internet of Things (IoT) have some of the most targeted devices due to their lightweight design and the sensitive data exchanged through smart city networks. One way to protect a system from an attack is to use machine learning (ML)-based intrusion detection systems (IDSs), significantly improving classification tasks. Training ML algorithms require a large network traffic dataset; however, large storage and months of recording are required to capture the attacks, which is costly for IoT environments. This study proposes an ML pipeline using the conditional tabular generative adversarial network (CTGAN) model to generate a synthetic dataset. Then, the synthetic dataset was evaluated using several types of statistical and ML metrics. Using a decision tree, the accuracy of the generated dataset reached 0.99, and its lower complexity reached 0.05 s training and 0.004 s test times. The results show that synthetic data accurately reflect real data and are less complex, making them suitable for IoT environments and smart city applications. Thus, the generated synthetic dataset can further train models to secure IoT networks and applications.https://www.mdpi.com/2076-3417/13/19/10951intrusion detection systemmachine learninginformation securityIoTCTGANadvanced persistent threat
spellingShingle	Saleh Alabdulwahab Young-Tak Kim Aria Seo Yunsik Son Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments Applied Sciences intrusion detection system machine learning information security IoT CTGAN advanced persistent threat
title	Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
title_full	Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
title_fullStr	Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
title_full_unstemmed	Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
title_short	Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments
title_sort	generating synthetic dataset for ml based ids using ctgan and feature selection to protect smart iot environments
topic	intrusion detection system machine learning information security IoT CTGAN advanced persistent threat
url	https://www.mdpi.com/2076-3417/13/19/10951
work_keys_str_mv	AT salehalabdulwahab generatingsyntheticdatasetformlbasedidsusingctganandfeatureselectiontoprotectsmartiotenvironments AT youngtakkim generatingsyntheticdatasetformlbasedidsusingctganandfeatureselectiontoprotectsmartiotenvironments AT ariaseo generatingsyntheticdatasetformlbasedidsusingctganandfeatureselectiontoprotectsmartiotenvironments AT yunsikson generatingsyntheticdatasetformlbasedidsusingctganandfeatureselectiontoprotectsmartiotenvironments

Generating Synthetic Dataset for ML-Based IDS Using CTGAN and Feature Selection to Protect Smart IoT Environments

Similar Items