An efficiency-driven, correlation-based feature elimination strategy for small datasets

With big datasets and highly efficient algorithms becoming increasingly available for many problem sets, rapid advancements and recent breakthroughs achieved in the field of machine learning encourage more and more scientific fields to make use of such a computational data analysis. Still, for many...

Full description

Bibliographic Details
Main Authors: Carolin A. Rickert, Manuel Henkel, Oliver Lieleg
Format: Article
Language:English
Published: AIP Publishing LLC 2023-03-01
Series:APL Machine Learning
Online Access:http://dx.doi.org/10.1063/5.0118207
_version_ 1797635577660571648
author Carolin A. Rickert
Manuel Henkel
Oliver Lieleg
author_facet Carolin A. Rickert
Manuel Henkel
Oliver Lieleg
author_sort Carolin A. Rickert
collection DOAJ
description With big datasets and highly efficient algorithms becoming increasingly available for many problem sets, rapid advancements and recent breakthroughs achieved in the field of machine learning encourage more and more scientific fields to make use of such a computational data analysis. Still, for many research problems, the amount of data available for training a machine learning (ML) model is very limited. An important strategy to combat the problems arising from data sparsity is feature elimination—a method that aims at reducing the dimensionality of an input feature space. Most such strategies exclusively focus on analyzing pairwise correlations, or they eliminate features based on their relation to a selected output label or by optimizing performance measures of a certain ML model. However, those strategies do not necessarily remove redundant information from datasets and cannot be applied to certain situations, e.g., to unsupervised learning models. Neither of these limitations applies to the network-based, correlation-driven redundancy elimination (NETCORE) algorithm introduced here, where the size of a feature vector is reduced by considering both redundancy and elimination efficiency. The NETCORE algorithm is model-independent, does not require an output label, and is applicable to all kinds of correlation topographies within a dataset. Thus, this algorithm has the potential to be a highly beneficial preprocessing tool for various machine learning pipelines.
first_indexed 2024-03-11T12:22:42Z
format Article
id doaj.art-6066f5e7194043e5ba661efe024822dc
institution Directory Open Access Journal
issn 2770-9019
language English
last_indexed 2024-03-11T12:22:42Z
publishDate 2023-03-01
publisher AIP Publishing LLC
record_format Article
series APL Machine Learning
spelling doaj.art-6066f5e7194043e5ba661efe024822dc2023-11-06T20:55:04ZengAIP Publishing LLCAPL Machine Learning2770-90192023-03-0111016105016105-1410.1063/5.0118207An efficiency-driven, correlation-based feature elimination strategy for small datasetsCarolin A. Rickert0Manuel Henkel1Oliver Lieleg2School of Engineering and Design, Department of Materials Engineering, Technical University of Munich, Boltzmannstraße 15, 85748 Garching, GermanySchool of Engineering and Design, Department of Materials Engineering, Technical University of Munich, Boltzmannstraße 15, 85748 Garching, GermanySchool of Engineering and Design, Department of Materials Engineering, Technical University of Munich, Boltzmannstraße 15, 85748 Garching, GermanyWith big datasets and highly efficient algorithms becoming increasingly available for many problem sets, rapid advancements and recent breakthroughs achieved in the field of machine learning encourage more and more scientific fields to make use of such a computational data analysis. Still, for many research problems, the amount of data available for training a machine learning (ML) model is very limited. An important strategy to combat the problems arising from data sparsity is feature elimination—a method that aims at reducing the dimensionality of an input feature space. Most such strategies exclusively focus on analyzing pairwise correlations, or they eliminate features based on their relation to a selected output label or by optimizing performance measures of a certain ML model. However, those strategies do not necessarily remove redundant information from datasets and cannot be applied to certain situations, e.g., to unsupervised learning models. Neither of these limitations applies to the network-based, correlation-driven redundancy elimination (NETCORE) algorithm introduced here, where the size of a feature vector is reduced by considering both redundancy and elimination efficiency. The NETCORE algorithm is model-independent, does not require an output label, and is applicable to all kinds of correlation topographies within a dataset. Thus, this algorithm has the potential to be a highly beneficial preprocessing tool for various machine learning pipelines.http://dx.doi.org/10.1063/5.0118207
spellingShingle Carolin A. Rickert
Manuel Henkel
Oliver Lieleg
An efficiency-driven, correlation-based feature elimination strategy for small datasets
APL Machine Learning
title An efficiency-driven, correlation-based feature elimination strategy for small datasets
title_full An efficiency-driven, correlation-based feature elimination strategy for small datasets
title_fullStr An efficiency-driven, correlation-based feature elimination strategy for small datasets
title_full_unstemmed An efficiency-driven, correlation-based feature elimination strategy for small datasets
title_short An efficiency-driven, correlation-based feature elimination strategy for small datasets
title_sort efficiency driven correlation based feature elimination strategy for small datasets
url http://dx.doi.org/10.1063/5.0118207
work_keys_str_mv AT carolinarickert anefficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets
AT manuelhenkel anefficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets
AT oliverlieleg anefficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets
AT carolinarickert efficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets
AT manuelhenkel efficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets
AT oliverlieleg efficiencydrivencorrelationbasedfeatureeliminationstrategyforsmalldatasets