Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Abstract In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SH...

Full description

Bibliographic Details
Main Authors:	Huanjing Wang, Qianxin Liang, John T. Hancock, Taghi M. Khoshgoftaar
Format:	Article
Language:	English
Published:	SpringerOpen 2024-03-01
Series:	Journal of Big Data
Subjects:	Feature selection Class imbalance Credit card fraud SHAP Feature importance
Online Access:	https://doi.org/10.1186/s40537-024-00905-w

_version_	1797233518073348096
author	Huanjing Wang Qianxin Liang John T. Hancock Taghi M. Khoshgoftaar
author_facet	Huanjing Wang Qianxin Liang John T. Hancock Taghi M. Khoshgoftaar
author_sort	Huanjing Wang
collection	DOAJ
description	Abstract In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.
first_indexed	2024-04-24T16:17:26Z
format	Article
id	doaj.art-4939df3b3a7e4b0a838189d2e8a7897b
institution	Directory Open Access Journal
issn	2196-1115
language	English
last_indexed	2024-04-24T16:17:26Z
publishDate	2024-03-01
publisher	SpringerOpen
record_format	Article
series	Journal of Big Data
spelling	doaj.art-4939df3b3a7e4b0a838189d2e8a7897b2024-03-31T11:23:07ZengSpringerOpenJournal of Big Data2196-11152024-03-0111111610.1186/s40537-024-00905-wFeature selection strategies: a comparative analysis of SHAP-value and importance-based methodsHuanjing Wang0Qianxin Liang1John T. Hancock2Taghi M. Khoshgoftaar3Ogden College of Science and Engineering, Western Kentucky UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityCollege of Engineering and Computer Science, Florida Atlantic UniversityAbstract In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.https://doi.org/10.1186/s40537-024-00905-wFeature selectionClass imbalanceCredit card fraudSHAPFeature importance
spellingShingle	Huanjing Wang Qianxin Liang John T. Hancock Taghi M. Khoshgoftaar Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods Journal of Big Data Feature selection Class imbalance Credit card fraud SHAP Feature importance
title	Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods
title_full	Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods
title_fullStr	Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods
title_full_unstemmed	Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods
title_short	Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods
title_sort	feature selection strategies a comparative analysis of shap value and importance based methods
topic	Feature selection Class imbalance Credit card fraud SHAP Feature importance
url	https://doi.org/10.1186/s40537-024-00905-w
work_keys_str_mv	AT huanjingwang featureselectionstrategiesacomparativeanalysisofshapvalueandimportancebasedmethods AT qianxinliang featureselectionstrategiesacomparativeanalysisofshapvalueandimportancebasedmethods AT johnthancock featureselectionstrategiesacomparativeanalysisofshapvalueandimportancebasedmethods AT taghimkhoshgoftaar featureselectionstrategiesacomparativeanalysisofshapvalueandimportancebasedmethods

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Similar Items