Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests

Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separ...

Full description

Bibliographic Details
Main Author: Barbara Pes
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/12/8/286
_version_ 1797523530920755200
author Barbara Pes
author_facet Barbara Pes
author_sort Barbara Pes
collection DOAJ
description Class imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the <i>Random Forest</i>, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.
first_indexed 2024-03-10T08:44:19Z
format Article
id doaj.art-cf648c550a1f465b878764de0216e11b
institution Directory Open Access Journal
issn 2078-2489
language English
last_indexed 2024-03-10T08:44:19Z
publishDate 2021-07-01
publisher MDPI AG
record_format Article
series Information
spelling doaj.art-cf648c550a1f465b878764de0216e11b2023-11-22T08:05:33ZengMDPI AGInformation2078-24892021-07-0112828610.3390/info12080286Learning from High-Dimensional and Class-Imbalanced Datasets Using Random ForestsBarbara Pes0Dipartimento di Matematica e Informatica, Università di Cagliari, Via Ospedale 72, 09124 Cagliari, ItalyClass imbalance and high dimensionality are two major issues in several real-life applications, e.g., in the fields of bioinformatics, text mining and image classification. However, while both issues have been extensively studied in the machine learning community, they have mostly been treated separately, and little research has been thus far conducted on which approaches might be best suited to deal with datasets that are class-imbalanced and high-dimensional at the same time (i.e., with a large number of features). This work attempts to give a contribution to this challenging research area by studying the effectiveness of hybrid learning strategies that involve the integration of feature selection techniques, to reduce the data dimensionality, with proper methods that cope with the adverse effects of class imbalance (in particular, data balancing and cost-sensitive methods are considered). Extensive experiments have been carried out across datasets from different domains, leveraging a well-known classifier, the <i>Random Forest</i>, which has proven to be effective in high-dimensional spaces and has also been successfully applied to imbalanced tasks. Our results give evidence of the benefits of such a hybrid approach, when compared to using only feature selection or imbalance learning methods alone.https://www.mdpi.com/2078-2489/12/8/286high-dimensional datafeature selectionclass imbalancerandom forest
spellingShingle Barbara Pes
Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
Information
high-dimensional data
feature selection
class imbalance
random forest
title Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_full Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_fullStr Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_full_unstemmed Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_short Learning from High-Dimensional and Class-Imbalanced Datasets Using Random Forests
title_sort learning from high dimensional and class imbalanced datasets using random forests
topic high-dimensional data
feature selection
class imbalance
random forest
url https://www.mdpi.com/2078-2489/12/8/286
work_keys_str_mv AT barbarapes learningfromhighdimensionalandclassimbalanceddatasetsusingrandomforests