Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets

Heart disease is one of the primary causes of morbidity and death worldwide. Millions of people have had heart attacks every year, and only early-stage predictions can help to reduce the number. Researchers are working on designing and developing early-stage prediction systems using different advanc...

Full description

Bibliographic Details
Main Authors:	Mahmudul Hasan, Md Abdus Sahid, Md Palash Uddin, Md Abu Marjan, Seifedine Kadry, Jungeun Kim
Format:	Article
Language:	English
Published:	PeerJ Inc. 2024-03-01
Series:	PeerJ Computer Science
Subjects:	Inter-dataset Performance discrepancy Dimensionality reduction Heart disease prediction Machine learning
Online Access:	https://peerj.com/articles/cs-1917.pdf

_version_	1797254453825372160
author	Mahmudul Hasan Md Abdus Sahid Md Palash Uddin Md Abu Marjan Seifedine Kadry Jungeun Kim
author_facet	Mahmudul Hasan Md Abdus Sahid Md Palash Uddin Md Abu Marjan Seifedine Kadry Jungeun Kim
author_sort	Mahmudul Hasan
collection	DOAJ
description	Heart disease is one of the primary causes of morbidity and death worldwide. Millions of people have had heart attacks every year, and only early-stage predictions can help to reduce the number. Researchers are working on designing and developing early-stage prediction systems using different advanced technologies, and machine learning (ML) is one of them. Almost all existing ML-based works consider the same dataset (intra-dataset) for the training and validation of their method. In particular, they do not consider inter-dataset performance checks, where different datasets are used in the training and testing phases. In inter-dataset setup, existing ML models show a poor performance named the inter-dataset discrepancy problem. This work focuses on mitigating the inter-dataset discrepancy problem by considering five available heart disease datasets and their combined form. All potential training and testing mode combinations are systematically executed to assess discrepancies before and after applying the proposed methods. Imbalance data handling using SMOTE-Tomek, feature selection using random forest (RF), and feature extraction using principle component analysis (PCA) with a long preprocessing pipeline are used to mitigate the inter-dataset discrepancy problem. The preprocessing pipeline builds on missing value handling using RF regression, log transformation, outlier removal, normalization, and data balancing that convert the datasets to more ML-centric. Support vector machine, K-nearest neighbors, decision tree, RF, eXtreme Gradient Boosting, Gaussian naive Bayes, logistic regression, and multilayer perceptron are used as classifiers. Experimental results show that feature selection and classification using RF produce better results than other combination strategies in both single- and inter-dataset setups. In certain configurations of individual datasets, RF demonstrates 100% accuracy and 96% accuracy during the feature selection phase in an inter-dataset setup, exhibiting commendable precision, recall, F1 score, specificity, and AUC score. The results indicate that an effective preprocessing technique has the potential to improve the performance of the ML model without necessitating the development of intricate prediction models. Addressing inter-dataset discrepancies introduces a novel research avenue, enabling the amalgamation of identical features from various datasets to construct a comprehensive global dataset within a specific domain.
first_indexed	2024-04-24T21:50:12Z
format	Article
id	doaj.art-819c1deb9a8d48c9a81bb620c5cb2f91
institution	Directory Open Access Journal
issn	2376-5992
language	English
last_indexed	2024-04-24T21:50:12Z
publishDate	2024-03-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ Computer Science
spelling	doaj.art-819c1deb9a8d48c9a81bb620c5cb2f912024-03-20T15:05:10ZengPeerJ Inc.PeerJ Computer Science2376-59922024-03-0110e191710.7717/peerj-cs.1917Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasetsMahmudul Hasan0Md Abdus Sahid1Md Palash Uddin2Md Abu Marjan3Seifedine Kadry4Jungeun Kim5Department of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, BangladeshDepartment of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, BangladeshDepartment of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, BangladeshDepartment of Computer Science and Engineering, Hajee Mohammad Danesh Science and Technology University, Dinajpur, BangladeshDepartment of Electrical and Computer Engineering, Lebanese American University, Byblos, LebanonDepartment of Software, Kongju National University, Cheonan, Republic of South KoreaHeart disease is one of the primary causes of morbidity and death worldwide. Millions of people have had heart attacks every year, and only early-stage predictions can help to reduce the number. Researchers are working on designing and developing early-stage prediction systems using different advanced technologies, and machine learning (ML) is one of them. Almost all existing ML-based works consider the same dataset (intra-dataset) for the training and validation of their method. In particular, they do not consider inter-dataset performance checks, where different datasets are used in the training and testing phases. In inter-dataset setup, existing ML models show a poor performance named the inter-dataset discrepancy problem. This work focuses on mitigating the inter-dataset discrepancy problem by considering five available heart disease datasets and their combined form. All potential training and testing mode combinations are systematically executed to assess discrepancies before and after applying the proposed methods. Imbalance data handling using SMOTE-Tomek, feature selection using random forest (RF), and feature extraction using principle component analysis (PCA) with a long preprocessing pipeline are used to mitigate the inter-dataset discrepancy problem. The preprocessing pipeline builds on missing value handling using RF regression, log transformation, outlier removal, normalization, and data balancing that convert the datasets to more ML-centric. Support vector machine, K-nearest neighbors, decision tree, RF, eXtreme Gradient Boosting, Gaussian naive Bayes, logistic regression, and multilayer perceptron are used as classifiers. Experimental results show that feature selection and classification using RF produce better results than other combination strategies in both single- and inter-dataset setups. In certain configurations of individual datasets, RF demonstrates 100% accuracy and 96% accuracy during the feature selection phase in an inter-dataset setup, exhibiting commendable precision, recall, F1 score, specificity, and AUC score. The results indicate that an effective preprocessing technique has the potential to improve the performance of the ML model without necessitating the development of intricate prediction models. Addressing inter-dataset discrepancies introduces a novel research avenue, enabling the amalgamation of identical features from various datasets to construct a comprehensive global dataset within a specific domain.https://peerj.com/articles/cs-1917.pdfInter-datasetPerformance discrepancyDimensionality reductionHeart disease predictionMachine learning
spellingShingle	Mahmudul Hasan Md Abdus Sahid Md Palash Uddin Md Abu Marjan Seifedine Kadry Jungeun Kim Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets PeerJ Computer Science Inter-dataset Performance discrepancy Dimensionality reduction Heart disease prediction Machine learning
title	Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets
title_full	Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets
title_fullStr	Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets
title_full_unstemmed	Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets
title_short	Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets
title_sort	performance discrepancy mitigation in heart disease prediction for multisensory inter datasets
topic	Inter-dataset Performance discrepancy Dimensionality reduction Heart disease prediction Machine learning
url	https://peerj.com/articles/cs-1917.pdf
work_keys_str_mv	AT mahmudulhasan performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets AT mdabdussahid performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets AT mdpalashuddin performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets AT mdabumarjan performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets AT seifedinekadry performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets AT jungeunkim performancediscrepancymitigationinheartdiseasepredictionformultisensoryinterdatasets

Performance discrepancy mitigation in heart disease prediction for multisensory inter-datasets

Similar Items