Efficient cross-validation traversals in feature subset selection
Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. Howe...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2022-12-01
|
Series: | Scientific Reports |
Online Access: | https://doi.org/10.1038/s41598-022-25942-4 |
_version_ | 1811292091527987200 |
---|---|
author | Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler |
author_facet | Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler |
author_sort | Ludwig Lausser |
collection | DOAJ |
description | Abstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach. |
first_indexed | 2024-04-13T04:40:11Z |
format | Article |
id | doaj.art-008503c4edc848268dfd5f8c6ffd03ba |
institution | Directory Open Access Journal |
issn | 2045-2322 |
language | English |
last_indexed | 2024-04-13T04:40:11Z |
publishDate | 2022-12-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Scientific Reports |
spelling | doaj.art-008503c4edc848268dfd5f8c6ffd03ba2022-12-22T03:02:01ZengNature PortfolioScientific Reports2045-23222022-12-0112111610.1038/s41598-022-25942-4Efficient cross-validation traversals in feature subset selectionLudwig Lausser0Robin Szekely1Florian Schmid2Markus Maucher3Hans A. Kestler4Institute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityInstitute of Medical Systems Biology, Ulm UniversityAbstract Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets $${\mathscr {O}}({n}m^2)$$ O ( n m 2 ) to those of one-dimensional ones $${\mathscr {O}}(m^2)$$ O ( m 2 ) . Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.https://doi.org/10.1038/s41598-022-25942-4 |
spellingShingle | Ludwig Lausser Robin Szekely Florian Schmid Markus Maucher Hans A. Kestler Efficient cross-validation traversals in feature subset selection Scientific Reports |
title | Efficient cross-validation traversals in feature subset selection |
title_full | Efficient cross-validation traversals in feature subset selection |
title_fullStr | Efficient cross-validation traversals in feature subset selection |
title_full_unstemmed | Efficient cross-validation traversals in feature subset selection |
title_short | Efficient cross-validation traversals in feature subset selection |
title_sort | efficient cross validation traversals in feature subset selection |
url | https://doi.org/10.1038/s41598-022-25942-4 |
work_keys_str_mv | AT ludwiglausser efficientcrossvalidationtraversalsinfeaturesubsetselection AT robinszekely efficientcrossvalidationtraversalsinfeaturesubsetselection AT florianschmid efficientcrossvalidationtraversalsinfeaturesubsetselection AT markusmaucher efficientcrossvalidationtraversalsinfeaturesubsetselection AT hansakestler efficientcrossvalidationtraversalsinfeaturesubsetselection |