The Impact of Partial Balance of Imbalanced Dataset on Classification Performance

The imbalance of network data seriously affects the classification performance of algorithms. Most studies have only used a rough description of data imbalance with less exploration of the specific factors affecting classification performance, which has resulted in difficulty putting forward targete...

Full description

Bibliographic Details
Main Authors: Qing Li, Chang Zhao, Xintai He, Kun Chen, Runze Wang
Format: Article
Language:English
Published: MDPI AG 2022-04-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/11/9/1322
_version_ 1797505164118065152
author Qing Li
Chang Zhao
Xintai He
Kun Chen
Runze Wang
author_facet Qing Li
Chang Zhao
Xintai He
Kun Chen
Runze Wang
author_sort Qing Li
collection DOAJ
description The imbalance of network data seriously affects the classification performance of algorithms. Most studies have only used a rough description of data imbalance with less exploration of the specific factors affecting classification performance, which has resulted in difficulty putting forward targeted solutions. In this paper, we find that the impact of medium categories on classification performance cannot be ignored, and therefore propose the concept of partial balance, consisting of Class Number of Partial Balance (β) and Balance Degree of Partial Samples (μ). Combined with Global Slope (α), a parameterized model is established to describe the difference of imbalanced datasets. Experiments are performed on the Moore Dataset and CICIDS 2017 Dataset. The experiment’s results on Random Forest, Decision Tree and Deep Neural Network show increasing <b>α</b> is a conducive step in the performance improvement of minority classes and overall classes. When <b>β</b> of dominant categories increases, that of inferior classes decreases, which results in a decrease in the average performance of minority classes. The lower <b>μ</b> is, the closer the sample size of medium classes is to the minority classes, and the better the average performance is. Based on the conclusions, we propose and verify some basic strategies by various classical algorithms.
first_indexed 2024-03-10T04:14:42Z
format Article
id doaj.art-12fdf00ef428425f9140c004ba088b36
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-10T04:14:42Z
publishDate 2022-04-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-12fdf00ef428425f9140c004ba088b362023-11-23T08:01:56ZengMDPI AGElectronics2079-92922022-04-01119132210.3390/electronics11091322The Impact of Partial Balance of Imbalanced Dataset on Classification PerformanceQing Li0Chang Zhao1Xintai He2Kun Chen3Runze Wang4Department of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaDepartment of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaDepartment of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaDepartment of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaDepartment of Information System Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, ChinaThe imbalance of network data seriously affects the classification performance of algorithms. Most studies have only used a rough description of data imbalance with less exploration of the specific factors affecting classification performance, which has resulted in difficulty putting forward targeted solutions. In this paper, we find that the impact of medium categories on classification performance cannot be ignored, and therefore propose the concept of partial balance, consisting of Class Number of Partial Balance (β) and Balance Degree of Partial Samples (μ). Combined with Global Slope (α), a parameterized model is established to describe the difference of imbalanced datasets. Experiments are performed on the Moore Dataset and CICIDS 2017 Dataset. The experiment’s results on Random Forest, Decision Tree and Deep Neural Network show increasing <b>α</b> is a conducive step in the performance improvement of minority classes and overall classes. When <b>β</b> of dominant categories increases, that of inferior classes decreases, which results in a decrease in the average performance of minority classes. The lower <b>μ</b> is, the closer the sample size of medium classes is to the minority classes, and the better the average performance is. Based on the conclusions, we propose and verify some basic strategies by various classical algorithms.https://www.mdpi.com/2079-9292/11/9/1322network traffic classificationdata imbalanceimbalance degreeminority classpartial balance
spellingShingle Qing Li
Chang Zhao
Xintai He
Kun Chen
Runze Wang
The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
Electronics
network traffic classification
data imbalance
imbalance degree
minority class
partial balance
title The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
title_full The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
title_fullStr The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
title_full_unstemmed The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
title_short The Impact of Partial Balance of Imbalanced Dataset on Classification Performance
title_sort impact of partial balance of imbalanced dataset on classification performance
topic network traffic classification
data imbalance
imbalance degree
minority class
partial balance
url https://www.mdpi.com/2079-9292/11/9/1322
work_keys_str_mv AT qingli theimpactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT changzhao theimpactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT xintaihe theimpactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT kunchen theimpactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT runzewang theimpactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT qingli impactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT changzhao impactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT xintaihe impactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT kunchen impactofpartialbalanceofimbalanceddatasetonclassificationperformance
AT runzewang impactofpartialbalanceofimbalanceddatasetonclassificationperformance