Variable Importance Analysis in Imbalanced Datasets: A New Approach
Decision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computational...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2020-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/9138401/ |
_version_ | 1818575008546422784 |
---|---|
author | Ismael Ahrazem Dfuf Joaquin Forte Perez-Minayo Jose Manuel Mira Mcwilliams Camino Gonzalez Fernandez |
author_facet | Ismael Ahrazem Dfuf Joaquin Forte Perez-Minayo Jose Manuel Mira Mcwilliams Camino Gonzalez Fernandez |
author_sort | Ismael Ahrazem Dfuf |
collection | DOAJ |
description | Decision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computationally more efficient. In classification problems with imbalanced datasets, this task is even more challenging. In this article, we present two variable importance techniques, a nonparametric solution, called mh-χ<sup>2</sup>, and a parametric method based on Global Sensitivity Analysis. The mh-χ<sup>2</sup> employs a multivariate continuous response framework to deal with the multiclass classification problem. Based on the permutation importance framework, the proposed mh-χ<sup>2</sup> algorithm captures the dissimilarities between the distribution of misclassification errors generated by the base learner, Conditional Inference Tree, before and after permuting the values of the input variable under analysis. The GSA solution is based on the Covariance decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative study of several Random Forest-based techniques with emphasis in the multiclass classification problem with different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first, to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic, political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence of spike prices on the Spanish electricity market. |
first_indexed | 2024-12-15T00:34:13Z |
format | Article |
id | doaj.art-88719e4d8a4d4f23a7234c5db1f68c5f |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-15T00:34:13Z |
publishDate | 2020-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-88719e4d8a4d4f23a7234c5db1f68c5f2022-12-21T22:41:51ZengIEEEIEEE Access2169-35362020-01-01812740412743010.1109/ACCESS.2020.30084169138401Variable Importance Analysis in Imbalanced Datasets: A New ApproachIsmael Ahrazem Dfuf0https://orcid.org/0000-0002-3141-2142Joaquin Forte Perez-Minayo1https://orcid.org/0000-0002-2693-6881Jose Manuel Mira Mcwilliams2Camino Gonzalez Fernandez3ETSIT, Universidad Politécnica de Madrid, Madrid, SpainETSII, Universidad Politécnica de Madrid, Madrid, SpainETSII, Universidad Politécnica de Madrid, Madrid, SpainETSII, Universidad Politécnica de Madrid, Madrid, SpainDecision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computationally more efficient. In classification problems with imbalanced datasets, this task is even more challenging. In this article, we present two variable importance techniques, a nonparametric solution, called mh-χ<sup>2</sup>, and a parametric method based on Global Sensitivity Analysis. The mh-χ<sup>2</sup> employs a multivariate continuous response framework to deal with the multiclass classification problem. Based on the permutation importance framework, the proposed mh-χ<sup>2</sup> algorithm captures the dissimilarities between the distribution of misclassification errors generated by the base learner, Conditional Inference Tree, before and after permuting the values of the input variable under analysis. The GSA solution is based on the Covariance decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative study of several Random Forest-based techniques with emphasis in the multiclass classification problem with different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first, to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic, political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence of spike prices on the Spanish electricity market.https://ieeexplore.ieee.org/document/9138401/Covid-19 pandemicelectricity marketglobal sensitivity analysismulticlass classification problemmultivariate response scenariovariable importance analysis |
spellingShingle | Ismael Ahrazem Dfuf Joaquin Forte Perez-Minayo Jose Manuel Mira Mcwilliams Camino Gonzalez Fernandez Variable Importance Analysis in Imbalanced Datasets: A New Approach IEEE Access Covid-19 pandemic electricity market global sensitivity analysis multiclass classification problem multivariate response scenario variable importance analysis |
title | Variable Importance Analysis in Imbalanced Datasets: A New Approach |
title_full | Variable Importance Analysis in Imbalanced Datasets: A New Approach |
title_fullStr | Variable Importance Analysis in Imbalanced Datasets: A New Approach |
title_full_unstemmed | Variable Importance Analysis in Imbalanced Datasets: A New Approach |
title_short | Variable Importance Analysis in Imbalanced Datasets: A New Approach |
title_sort | variable importance analysis in imbalanced datasets a new approach |
topic | Covid-19 pandemic electricity market global sensitivity analysis multiclass classification problem multivariate response scenario variable importance analysis |
url | https://ieeexplore.ieee.org/document/9138401/ |
work_keys_str_mv | AT ismaelahrazemdfuf variableimportanceanalysisinimbalanceddatasetsanewapproach AT joaquinforteperezminayo variableimportanceanalysisinimbalanceddatasetsanewapproach AT josemanuelmiramcwilliams variableimportanceanalysisinimbalanceddatasetsanewapproach AT caminogonzalezfernandez variableimportanceanalysisinimbalanceddatasetsanewapproach |