Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separ...

Full description

Bibliographic Details
Main Author:	Barbara Pes
Format:	Article
Language:	English
Published:	MDPI AG 2021-07-01
Series:	Information
Subjects:	high-dimensional data feature selection class imbalance random forest
Online Access:	https://www.mdpi.com/2078-2489/12/8/286

_version_	1797523530920755200
author	Barbara Pes
author_facet	Barbara Pes
author_sort	Barbara Pes
collection	DOAJ
description	Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the <i>Random Forest</i>, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.
first_indexed	2024-03-10T08:44:19Z
format	Article
id	doaj.art-cf648c550a1f465b878764de0216e11b
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-03-10T08:44:19Z
publishDate	2021-07-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-cf648c550a1f465b878764de0216e11b2023-11-22T08:05:33ZengMDPI AGInformation2078-24892021-07-0112828610.3390/info12080286Learning from High-Dimensional and Class-Imbalanced Datasets Using Random ForestsBarbara Pes0Dipartimento di Matematica e Informatica, Università di Cagliari, Via Ospedale 72, 09124 Cagliari, ItalyClass imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the <i>Random Forest</i>, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.https://www.mdpi.com/2078-2489/12/8/286high-dimensional datafeature selectionclass imbalancerandom forest
spellingShingle	Barbara Pes Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests Information high-dimensional data feature selection class imbalance random forest
title	Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_full	Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_fullStr	Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_full_unstemmed	Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_short	Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_sort	learning from high dimensional and class imbalanced datasets using random forests
topic	high-dimensional data feature selection class imbalance random forest
url	https://www.mdpi.com/2078-2489/12/8/286
work_keys_str_mv	AT barbarapes learningfromhighdimensionalandclassimbalanceddatasetsusingrandomforests

Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Similar Items