Efficient cross-validation traversals in feature subset selection

Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. Howe...

Full description

Bibliographic Details
Main Authors:	Ludwig Lausser, Robin Szekely, Florian Schmid, Markus Maucher, Hans A. Kestler
Format:	Article
Language:	English
Published:	Nature Portfolio 2022-12-01
Series:	Scientific Reports
Online Access:	https://doi.org/10.1038/s41598-022-25942-4

_version_	1811292091527987200
author	Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler
author_facet	Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler
author_sort	Ludwig Lausser
collection	DOAJ
description	Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.
first_indexed	2024-04-13T04:40:11Z
format	Article
id	doaj.art-008503c4edc848268dfd5f8c6ffd03ba
institution	Directory Open Access Journal
issn	2045-2322
language	English
last_indexed	2024-04-13T04:40:11Z
publishDate	2022-12-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj.art-008503c4edc848268dfd5f8c6ffd03ba2022-12-22T03:02:01ZengNature PortfolioScientific Reports2045-23222022-12-0112111610.1038/s41598-022-25942-4Efficient cross-validation traversals in feature subset selectionLudwig Lausser0Robin Szekely1Florian Schmid2Markus Maucher3Hans A. Kestler4Institute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityAbstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.https://doi.org/10.1038/s41598-022-25942-4
spellingShingle	Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler Efficient cross-validation traversals in feature subset selection Scientific Reports
title	Efficient cross-validation traversals in feature subset selection
title_full	Efficient cross-validation traversals in feature subset selection
title_fullStr	Efficient cross-validation traversals in feature subset selection
title_full_unstemmed	Efficient cross-validation traversals in feature subset selection
title_short	Efficient cross-validation traversals in feature subset selection
title_sort	efficient cross validation traversals in feature subset selection
url	https://doi.org/10.1038/s41598-022-25942-4
work_keys_str_mv	AT ludwiglausser efficientcrossvalidationtraversalsinfeaturesubsetselection AT robinszekely efficientcrossvalidationtraversalsinfeaturesubsetselection AT florianschmid efficientcrossvalidationtraversalsinfeaturesubsetselection AT markusmaucher efficientcrossvalidationtraversalsinfeaturesubsetselection AT hansakestler efficientcrossvalidationtraversalsinfeaturesubsetselection

Efficient cross-validation traversals in feature subset selection

Similar Items