CICIDS-2017 dataset feature analysis with information gain for anomaly detection

Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze...

Full description

Bibliographic Details
Main Authors: Kurniabudi, Kurniabudi, Stiawan, Deris, Darmawijoyo, Darmawijoyo, Idris, Mohd. Yazid, Bamhdi, Alwi M., Budiarto, Rahmat
Format: Article
Language:English
Published: Institute of Electrical and Electronics Engineers Inc. 2020
Subjects:
Online Access:http://eprints.utm.my/93850/1/CICIDS-MohdYazidIdris2020_CICIDS2017DatasetFeatureAnalysisWithInformation.pdf
_version_ 1796865705701801984
author Kurniabudi, Kurniabudi
Stiawan, Deris
Darmawijoyo, Darmawijoyo
Idris, Mohd. Yazid
Bamhdi, Alwi M.
Budiarto, Rahmat
author_facet Kurniabudi, Kurniabudi
Stiawan, Deris
Darmawijoyo, Darmawijoyo
Idris, Mohd. Yazid
Bamhdi, Alwi M.
Budiarto, Rahmat
author_sort Kurniabudi, Kurniabudi
collection ePrints
description Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time.
first_indexed 2024-03-05T21:01:08Z
format Article
id utm.eprints-93850
institution Universiti Teknologi Malaysia - ePrints
language English
last_indexed 2024-03-05T21:01:08Z
publishDate 2020
publisher Institute of Electrical and Electronics Engineers Inc.
record_format dspace
spelling utm.eprints-938502022-03-07T00:09:16Z http://eprints.utm.my/93850/ CICIDS-2017 dataset feature analysis with information gain for anomaly detection Kurniabudi, Kurniabudi Stiawan, Deris Darmawijoyo, Darmawijoyo Idris, Mohd. Yazid Bamhdi, Alwi M. Budiarto, Rahmat QA75 Electronic computers. Computer science Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time. Institute of Electrical and Electronics Engineers Inc. 2020 Article PeerReviewed application/pdf en http://eprints.utm.my/93850/1/CICIDS-MohdYazidIdris2020_CICIDS2017DatasetFeatureAnalysisWithInformation.pdf Kurniabudi, Kurniabudi and Stiawan, Deris and Darmawijoyo, Darmawijoyo and Idris, Mohd. Yazid and Bamhdi, Alwi M. and Budiarto, Rahmat (2020) CICIDS-2017 dataset feature analysis with information gain for anomaly detection. IEEE Access, 8 . pp. 132911-132921. ISSN 2169-3536 http://dx.doi.org/10.1109/ACCESS.2020.3009843
spellingShingle QA75 Electronic computers. Computer science
Kurniabudi, Kurniabudi
Stiawan, Deris
Darmawijoyo, Darmawijoyo
Idris, Mohd. Yazid
Bamhdi, Alwi M.
Budiarto, Rahmat
CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title_full CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title_fullStr CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title_full_unstemmed CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title_short CICIDS-2017 dataset feature analysis with information gain for anomaly detection
title_sort cicids 2017 dataset feature analysis with information gain for anomaly detection
topic QA75 Electronic computers. Computer science
url http://eprints.utm.my/93850/1/CICIDS-MohdYazidIdris2020_CICIDS2017DatasetFeatureAnalysisWithInformation.pdf
work_keys_str_mv AT kurniabudikurniabudi cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection
AT stiawanderis cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection
AT darmawijoyodarmawijoyo cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection
AT idrismohdyazid cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection
AT bamhdialwim cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection
AT budiartorahmat cicids2017datasetfeatureanalysiswithinformationgainforanomalydetection