Distance Correlation-Based Feature Selection in Random Forest

The Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has...

Full description

Bibliographic Details
Main Authors: Suthakaran Ratnasingam, Jose Muñoz-Lopez
Format: Article
Language:English
Published: MDPI AG 2023-08-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/25/9/1250
_version_ 1827726217754705920
author Suthakaran Ratnasingam
Jose Muñoz-Lopez
author_facet Suthakaran Ratnasingam
Jose Muñoz-Lopez
author_sort Suthakaran Ratnasingam
collection DOAJ
description The Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors <i>X</i> and <i>Y</i> in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>300</mn></mrow></semantics></math></inline-formula>) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications.
first_indexed 2024-03-10T22:46:59Z
format Article
id doaj.art-197be3dd251447e597f1e185f5a52394
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-10T22:46:59Z
publishDate 2023-08-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-197be3dd251447e597f1e185f5a523942023-11-19T10:35:01ZengMDPI AGEntropy1099-43002023-08-01259125010.3390/e25091250Distance Correlation-Based Feature Selection in Random ForestSuthakaran Ratnasingam0Jose Muñoz-Lopez1Department of Mathematics, California State University, San Bernardino, CA 92407, USADepartment of Mathematics, California State University, San Bernardino, CA 92407, USAThe Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors <i>X</i> and <i>Y</i> in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>300</mn></mrow></semantics></math></inline-formula>) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications.https://www.mdpi.com/1099-4300/25/9/1250feature selectionrandom forestPearson correlationdistance correlation
spellingShingle Suthakaran Ratnasingam
Jose Muñoz-Lopez
Distance Correlation-Based Feature Selection in Random Forest
Entropy
feature selection
random forest
Pearson correlation
distance correlation
title Distance Correlation-Based Feature Selection in Random Forest
title_full Distance Correlation-Based Feature Selection in Random Forest
title_fullStr Distance Correlation-Based Feature Selection in Random Forest
title_full_unstemmed Distance Correlation-Based Feature Selection in Random Forest
title_short Distance Correlation-Based Feature Selection in Random Forest
title_sort distance correlation based feature selection in random forest
topic feature selection
random forest
Pearson correlation
distance correlation
url https://www.mdpi.com/1099-4300/25/9/1250
work_keys_str_mv AT suthakaranratnasingam distancecorrelationbasedfeatureselectioninrandomforest
AT josemunozlopez distancecorrelationbasedfeatureselectioninrandomforest