Assessing feature selection method performance with class imbalance data

Identifying the most informative features is a crucial step in feature selection. This paper focuses primarily on wrapper feature selection methods designed to detect important features with F1-score as the target metric. As an initial step, most wrapper methods order features according to importanc...

Full description

Bibliographic Details
Main Authors: Surani Matharaarachchi, Mike Domaratzki, Saman Muthukumarana
Format: Article
Language:English
Published: Elsevier 2021-12-01
Series:Machine Learning with Applications
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2666827021000852
_version_ 1819096853613903872
author Surani Matharaarachchi
Mike Domaratzki
Saman Muthukumarana
author_facet Surani Matharaarachchi
Mike Domaratzki
Saman Muthukumarana
author_sort Surani Matharaarachchi
collection DOAJ
description Identifying the most informative features is a crucial step in feature selection. This paper focuses primarily on wrapper feature selection methods designed to detect important features with F1-score as the target metric. As an initial step, most wrapper methods order features according to importance. However, in most cases, the importance is defined according to the classification method used and varies with the characteristics of the data set. Using synthetically simulated data, we examine four existing feature ordering techniques to find the most desirable and the most effective ordering mechanism to identify informative features. Using the results, an improved method is suggested to extract the most informative feature subset from the data set. The method uses the sum of absolute values of the first k principal component loadings to order the features where k is a user-defined application-specific value. It also applies a sequential feature selection method to extract the best subset of features. We further compare the performance of the proposed feature selection method with results from the existing Recursive Feature Elimination (RFE) by simulating data for several practical scenarios with a different number of informative features and different imbalance rates. We also validate the method using a real-world application on several classification methods. The results based on the accuracy measures indicate that the proposed approach performs better than the existing feature selection methods.
first_indexed 2024-12-22T00:05:48Z
format Article
id doaj.art-2f0ebbe5ca004d1e8d8e1f75896d796c
institution Directory Open Access Journal
issn 2666-8270
language English
last_indexed 2024-12-22T00:05:48Z
publishDate 2021-12-01
publisher Elsevier
record_format Article
series Machine Learning with Applications
spelling doaj.art-2f0ebbe5ca004d1e8d8e1f75896d796c2022-12-21T18:45:34ZengElsevierMachine Learning with Applications2666-82702021-12-016100170Assessing feature selection method performance with class imbalance dataSurani Matharaarachchi0Mike Domaratzki1Saman Muthukumarana2Department of Statistics, University of Manitoba, Winnipeg, MB, R3T 2N2, Canada; Corresponding author.Department of Computer Science, University of Manitoba, Winnipeg, MB, R3T 2N2, CanadaDepartment of Statistics, University of Manitoba, Winnipeg, MB, R3T 2N2, CanadaIdentifying the most informative features is a crucial step in feature selection. This paper focuses primarily on wrapper feature selection methods designed to detect important features with F1-score as the target metric. As an initial step, most wrapper methods order features according to importance. However, in most cases, the importance is defined according to the classification method used and varies with the characteristics of the data set. Using synthetically simulated data, we examine four existing feature ordering techniques to find the most desirable and the most effective ordering mechanism to identify informative features. Using the results, an improved method is suggested to extract the most informative feature subset from the data set. The method uses the sum of absolute values of the first k principal component loadings to order the features where k is a user-defined application-specific value. It also applies a sequential feature selection method to extract the best subset of features. We further compare the performance of the proposed feature selection method with results from the existing Recursive Feature Elimination (RFE) by simulating data for several practical scenarios with a different number of informative features and different imbalance rates. We also validate the method using a real-world application on several classification methods. The results based on the accuracy measures indicate that the proposed approach performs better than the existing feature selection methods.http://www.sciencedirect.com/science/article/pii/S2666827021000852Feature selectionInformative featureRecursive feature eliminationPrincipal component loading
spellingShingle Surani Matharaarachchi
Mike Domaratzki
Saman Muthukumarana
Assessing feature selection method performance with class imbalance data
Machine Learning with Applications
Feature selection
Informative feature
Recursive feature elimination
Principal component loading
title Assessing feature selection method performance with class imbalance data
title_full Assessing feature selection method performance with class imbalance data
title_fullStr Assessing feature selection method performance with class imbalance data
title_full_unstemmed Assessing feature selection method performance with class imbalance data
title_short Assessing feature selection method performance with class imbalance data
title_sort assessing feature selection method performance with class imbalance data
topic Feature selection
Informative feature
Recursive feature elimination
Principal component loading
url http://www.sciencedirect.com/science/article/pii/S2666827021000852
work_keys_str_mv AT suranimatharaarachchi assessingfeatureselectionmethodperformancewithclassimbalancedata
AT mikedomaratzki assessingfeatureselectionmethodperformancewithclassimbalancedata
AT samanmuthukumarana assessingfeatureselectionmethodperformancewithclassimbalancedata