Variable Importance Analysis in Imbalanced Datasets: A New Approach

Decision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computational...

Full description

Bibliographic Details
Main Authors: Ismael Ahrazem Dfuf, Joaquin Forte Perez-Minayo, Jose Manuel Mira Mcwilliams, Camino Gonzalez Fernandez
Format: Article
Language:English
Published: IEEE 2020-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9138401/
_version_ 1818575008546422784
author Ismael Ahrazem Dfuf
Joaquin Forte Perez-Minayo
Jose Manuel Mira Mcwilliams
Camino Gonzalez Fernandez
author_facet Ismael Ahrazem Dfuf
Joaquin Forte Perez-Minayo
Jose Manuel Mira Mcwilliams
Camino Gonzalez Fernandez
author_sort Ismael Ahrazem Dfuf
collection DOAJ
description Decision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computationally more efficient. In classification problems with imbalanced datasets, this task is even more challenging. In this article, we present two variable importance techniques, a nonparametric solution, called mh-&#x03C7;<sup>2</sup>, and a parametric method based on Global Sensitivity Analysis. The mh-&#x03C7;<sup>2</sup> employs a multivariate continuous response framework to deal with the multiclass classification problem. Based on the permutation importance framework, the proposed mh-&#x03C7;<sup>2</sup> algorithm captures the dissimilarities between the distribution of misclassification errors generated by the base learner, Conditional Inference Tree, before and after permuting the values of the input variable under analysis. The GSA solution is based on the Covariance decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative study of several Random Forest-based techniques with emphasis in the multiclass classification problem with different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first, to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic, political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence of spike prices on the Spanish electricity market.
first_indexed 2024-12-15T00:34:13Z
format Article
id doaj.art-88719e4d8a4d4f23a7234c5db1f68c5f
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-15T00:34:13Z
publishDate 2020-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-88719e4d8a4d4f23a7234c5db1f68c5f2022-12-21T22:41:51ZengIEEEIEEE Access2169-35362020-01-01812740412743010.1109/ACCESS.2020.30084169138401Variable Importance Analysis in Imbalanced Datasets: A New ApproachIsmael Ahrazem Dfuf0https://orcid.org/0000-0002-3141-2142Joaquin Forte Perez-Minayo1https://orcid.org/0000-0002-2693-6881Jose Manuel Mira Mcwilliams2Camino Gonzalez Fernandez3ETSIT, Universidad Polit&#x00E9;cnica de Madrid, Madrid, SpainETSII, Universidad Polit&#x00E9;cnica de Madrid, Madrid, SpainETSII, Universidad Polit&#x00E9;cnica de Madrid, Madrid, SpainETSII, Universidad Polit&#x00E9;cnica de Madrid, Madrid, SpainDecision-making using machine learning requires a deep understanding of the model under analysis. Variable importance analysis provides the tools to assess the importance of input variables when dealing with complex interactions, making the machine learning model more interpretable and computationally more efficient. In classification problems with imbalanced datasets, this task is even more challenging. In this article, we present two variable importance techniques, a nonparametric solution, called mh-&#x03C7;<sup>2</sup>, and a parametric method based on Global Sensitivity Analysis. The mh-&#x03C7;<sup>2</sup> employs a multivariate continuous response framework to deal with the multiclass classification problem. Based on the permutation importance framework, the proposed mh-&#x03C7;<sup>2</sup> algorithm captures the dissimilarities between the distribution of misclassification errors generated by the base learner, Conditional Inference Tree, before and after permuting the values of the input variable under analysis. The GSA solution is based on the Covariance decomposition methodology for multivariate output models. Both solutions will be assessed in a comparative study of several Random Forest-based techniques with emphasis in the multiclass classification problem with different imbalanced scenarios. We apply the proposed techniques in two real application cases in order first, to quantify the importance of the 35 companies listed in the Spanish market index IBEX35 on the economic, political and social uncertainties reflected in economic newspapers in Spain during the first quadrimester of 2020 due to the COVID-19 pandemic and second, to assess the impact of energy factors on the occurrence of spike prices on the Spanish electricity market.https://ieeexplore.ieee.org/document/9138401/Covid-19 pandemicelectricity marketglobal sensitivity analysismulticlass classification problemmultivariate response scenariovariable importance analysis
spellingShingle Ismael Ahrazem Dfuf
Joaquin Forte Perez-Minayo
Jose Manuel Mira Mcwilliams
Camino Gonzalez Fernandez
Variable Importance Analysis in Imbalanced Datasets: A New Approach
IEEE Access
Covid-19 pandemic
electricity market
global sensitivity analysis
multiclass classification problem
multivariate response scenario
variable importance analysis
title Variable Importance Analysis in Imbalanced Datasets: A New Approach
title_full Variable Importance Analysis in Imbalanced Datasets: A New Approach
title_fullStr Variable Importance Analysis in Imbalanced Datasets: A New Approach
title_full_unstemmed Variable Importance Analysis in Imbalanced Datasets: A New Approach
title_short Variable Importance Analysis in Imbalanced Datasets: A New Approach
title_sort variable importance analysis in imbalanced datasets a new approach
topic Covid-19 pandemic
electricity market
global sensitivity analysis
multiclass classification problem
multivariate response scenario
variable importance analysis
url https://ieeexplore.ieee.org/document/9138401/
work_keys_str_mv AT ismaelahrazemdfuf variableimportanceanalysisinimbalanceddatasetsanewapproach
AT joaquinforteperezminayo variableimportanceanalysisinimbalanceddatasetsanewapproach
AT josemanuelmiramcwilliams variableimportanceanalysisinimbalanceddatasetsanewapproach
AT caminogonzalezfernandez variableimportanceanalysisinimbalanceddatasetsanewapproach