Distance Correlation-Based Feature Selection in Random Forest
The Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-08-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/25/9/1250 |
_version_ | 1827726217754705920 |
---|---|
author | Suthakaran Ratnasingam Jose Muñoz-Lopez |
author_facet | Suthakaran Ratnasingam Jose Muñoz-Lopez |
author_sort | Suthakaran Ratnasingam |
collection | DOAJ |
description | The Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors <i>X</i> and <i>Y</i> in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>300</mn></mrow></semantics></math></inline-formula>) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications. |
first_indexed | 2024-03-10T22:46:59Z |
format | Article |
id | doaj.art-197be3dd251447e597f1e185f5a52394 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-03-10T22:46:59Z |
publishDate | 2023-08-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-197be3dd251447e597f1e185f5a523942023-11-19T10:35:01ZengMDPI AGEntropy1099-43002023-08-01259125010.3390/e25091250Distance Correlation-Based Feature Selection in Random ForestSuthakaran Ratnasingam0Jose Muñoz-Lopez1Department of Mathematics, California State University, San Bernardino, CA 92407, USADepartment of Mathematics, California State University, San Bernardino, CA 92407, USAThe Pearson correlation coefficient (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mi>ρ</mi></semantics></math></inline-formula>) is a commonly used measure of correlation, but it has limitations as it only measures the linear relationship between two numerical variables. The distance correlation measures all types of dependencies between random vectors <i>X</i> and <i>Y</i> in arbitrary dimensions, not just the linear ones. In this paper, we propose a filter method that utilizes distance correlation as a criterion for feature selection in Random Forest regression. We conduct extensive simulation studies to evaluate its performance compared to existing methods under various data settings, in terms of the prediction mean squared error. The results show that our proposed method is competitive with existing methods and outperforms all other methods in high-dimensional (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>p</mi><mo>≥</mo><mn>300</mn></mrow></semantics></math></inline-formula>) nonlinearly related data sets. The applicability of the proposed method is also illustrated by two real data applications.https://www.mdpi.com/1099-4300/25/9/1250feature selectionrandom forestPearson correlationdistance correlation |
spellingShingle | Suthakaran Ratnasingam Jose Muñoz-Lopez Distance Correlation-Based Feature Selection in Random Forest Entropy feature selection random forest Pearson correlation distance correlation |
title | Distance Correlation-Based Feature Selection in Random Forest |
title_full | Distance Correlation-Based Feature Selection in Random Forest |
title_fullStr | Distance Correlation-Based Feature Selection in Random Forest |
title_full_unstemmed | Distance Correlation-Based Feature Selection in Random Forest |
title_short | Distance Correlation-Based Feature Selection in Random Forest |
title_sort | distance correlation based feature selection in random forest |
topic | feature selection random forest Pearson correlation distance correlation |
url | https://www.mdpi.com/1099-4300/25/9/1250 |
work_keys_str_mv | AT suthakaranratnasingam distancecorrelationbasedfeatureselectioninrandomforest AT josemunozlopez distancecorrelationbasedfeatureselectioninrandomforest |