Efficient cross-validation traversals in feature subset selection

Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. Howe...

Full description

Bibliographic Details
Main Authors: Ludwig Lausser, Robin Szekely, Florian Schmid, Markus Maucher, Hans A. Kestler
Format: Article
Language:English
Published: Nature Portfolio 2022-12-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-022-25942-4
_version_ 1811292091527987200
author Ludwig Lausser
Robin Szekely
Florian Schmid
Markus Maucher
Hans A. Kestler
author_facet Ludwig Lausser
Robin Szekely
Florian Schmid
Markus Maucher
Hans A. Kestler
author_sort Ludwig Lausser
collection DOAJ
description Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.
first_indexed 2024-04-13T04:40:11Z
format Article
id doaj.art-008503c4edc848268dfd5f8c6ffd03ba
institution Directory Open Access Journal
issn 2045-2322
language English
last_indexed 2024-04-13T04:40:11Z
publishDate 2022-12-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj.art-008503c4edc848268dfd5f8c6ffd03ba2022-12-22T03:02:01ZengNature PortfolioScientific Reports2045-23222022-12-0112111610.1038/s41598-022-25942-4Efficient cross-validation traversals in feature subset selectionLudwig Lausser0Robin Szekely1Florian Schmid2Markus Maucher3Hans A. Kestler4Institute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityAbstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.https://doi.org/10.1038/s41598-022-25942-4
spellingShingle Ludwig Lausser
Robin Szekely
Florian Schmid
Markus Maucher
Hans A. Kestler
Efficient cross-validation traversals in feature subset selection
Scientific Reports
title Efficient cross-validation traversals in feature subset selection
title_full Efficient cross-validation traversals in feature subset selection
title_fullStr Efficient cross-validation traversals in feature subset selection
title_full_unstemmed Efficient cross-validation traversals in feature subset selection
title_short Efficient cross-validation traversals in feature subset selection
title_sort efficient cross validation traversals in feature subset selection
url https://doi.org/10.1038/s41598-022-25942-4
work_keys_str_mv AT ludwiglausser efficientcrossvalidationtraversalsinfeaturesubsetselection
AT robinszekely efficientcrossvalidationtraversalsinfeaturesubsetselection
AT florianschmid efficientcrossvalidationtraversalsinfeaturesubsetselection
AT markusmaucher efficientcrossvalidationtraversalsinfeaturesubsetselection
AT hansakestler efficientcrossvalidationtraversalsinfeaturesubsetselection