Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)

In recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticat...

Full description

Bibliographic Details
Main Authors: Gergely Hanczár, Marcell Stippinger, Dávid Hanák, Marcell T Kurbucz, Olivér M Törteli, Ágnes Chripkó, Zoltán Somogyvári
Format: Article
Language:English
Published: IOP Publishing 2023-01-01
Series:Machine Learning: Science and Technology
Subjects:
Online Access:https://doi.org/10.1088/2632-2153/ad020e
_version_ 1797656006448119808
author Gergely Hanczár
Marcell Stippinger
Dávid Hanák
Marcell T Kurbucz
Olivér M Törteli
Ágnes Chripkó
Zoltán Somogyvári
author_facet Gergely Hanczár
Marcell Stippinger
Dávid Hanák
Marcell T Kurbucz
Olivér M Törteli
Ágnes Chripkó
Zoltán Somogyvári
author_sort Gergely Hanczár
collection DOAJ
description In recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. This algorithm successfully filters irrelevant features and also discovers binary and higher-order feature interactions. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods, while simultaneously possessing many advantages over them.
first_indexed 2024-03-11T17:22:55Z
format Article
id doaj.art-b9e76c4766c04692a96738a53f464413
institution Directory Open Access Journal
issn 2632-2153
language English
last_indexed 2024-03-11T17:22:55Z
publishDate 2023-01-01
publisher IOP Publishing
record_format Article
series Machine Learning: Science and Technology
spelling doaj.art-b9e76c4766c04692a96738a53f4644132023-10-19T09:24:08ZengIOP PublishingMachine Learning: Science and Technology2632-21532023-01-014404501210.1088/2632-2153/ad020eFeature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)Gergely Hanczár0https://orcid.org/0000-0002-0222-1400Marcell Stippinger1https://orcid.org/0000-0002-9954-8089Dávid Hanák2https://orcid.org/0000-0003-0678-9885Marcell T Kurbucz3https://orcid.org/0000-0002-0121-6781Olivér M Törteli4https://orcid.org/0000-0002-2148-9189Ágnes Chripkó5https://orcid.org/0000-0002-2863-5257Zoltán Somogyvári6https://orcid.org/0000-0002-4385-3025Cursor Insight Ltd , 20-22 Wenlock Road, N17GU London, United KingdomDepartment of Computational Sciences, Wigner Research Centre for Physics , 29-33 Konkoly Thege Miklós Street, H-1121 Budapest, HungaryCursor Insight Ltd , 20-22 Wenlock Road, N17GU London, United KingdomDepartment of Computational Sciences, Wigner Research Centre for Physics , 29-33 Konkoly Thege Miklós Street, H-1121 Budapest, Hungary; Institute of Data Analytics and Information Systems, Corvinus University of Budapest , 8 Fővám Square, H-1093 Budapest, HungaryCursor Insight Ltd , 20-22 Wenlock Road, N17GU London, United KingdomCursor Insight Ltd , 20-22 Wenlock Road, N17GU London, United KingdomDepartment of Computational Sciences, Wigner Research Centre for Physics , 29-33 Konkoly Thege Miklós Street, H-1121 Budapest, HungaryIn recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. This algorithm successfully filters irrelevant features and also discovers binary and higher-order feature interactions. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods, while simultaneously possessing many advantages over them.https://doi.org/10.1088/2632-2153/ad020efeature screeningultrahigh dimensionalitymulticlass classificationrandom forestbiometrics
spellingShingle Gergely Hanczár
Marcell Stippinger
Dávid Hanák
Marcell T Kurbucz
Olivér M Törteli
Ágnes Chripkó
Zoltán Somogyvári
Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
Machine Learning: Science and Technology
feature screening
ultrahigh dimensionality
multiclass classification
random forest
biometrics
title Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
title_full Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
title_fullStr Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
title_full_unstemmed Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
title_short Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)
title_sort feature space reduction method for ultrahigh dimensional multiclass data random forest based multiround screening rfms
topic feature screening
ultrahigh dimensionality
multiclass classification
random forest
biometrics
url https://doi.org/10.1088/2632-2153/ad020e
work_keys_str_mv AT gergelyhanczar featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT marcellstippinger featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT davidhanak featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT marcelltkurbucz featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT olivermtorteli featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT agneschripko featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms
AT zoltansomogyvari featurespacereductionmethodforultrahighdimensionalmulticlassdatarandomforestbasedmultiroundscreeningrfms