Assessing feature selection method performance with class imbalance data

Identifying the most informative features is a crucial step in feature selection. This paper focuses primarily on wrapper feature selection methods designed to detect important features with F1-score as the target metric. As an initial step, most wrapper methods order features according to importanc...

Full description

Bibliographic Details
Main Authors: Surani Matharaarachchi, Mike Domaratzki, Saman Muthukumarana
Format: Article
Language:English
Published: Elsevier 2021-12-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827021000852
Description
Summary:Identifying the most informative features is a crucial step in feature selection. This paper focuses primarily on wrapper feature selection methods designed to detect important features with F1-score as the target metric. As an initial step, most wrapper methods order features according to importance. However, in most cases, the importance is defined according to the classification method used and varies with the characteristics of the data set. Using synthetically simulated data, we examine four existing feature ordering techniques to find the most desirable and the most effective ordering mechanism to identify informative features. Using the results, an improved method is suggested to extract the most informative feature subset from the data set. The method uses the sum of absolute values of the first k principal component loadings to order the features where k is a user-defined application-specific value. It also applies a sequential feature selection method to extract the best subset of features. We further compare the performance of the proposed feature selection method with results from the existing Recursive Feature Elimination (RFE) by simulating data for several practical scenarios with a different number of informative features and different imbalance rates. We also validate the method using a real-world application on several classification methods. The results based on the accuracy measures indicate that the proposed approach performs better than the existing feature selection methods.
ISSN:2666-8270