Missing data imputation with fuzzy feature selection for diabetes dataset

Missing data in datasets remain as a difficulty in terms of data analysis in various research fields, especially in the medical field, as it affects the treatment and diagnosis that the patient should receive. In this research, Fuzzy c-means (FCM) are used to impute the missing data. However, like i...

Full description

Bibliographic Details
Main Authors:	Dzulkalnine, Mohamad Faiz, Sallehuddin, Roselina
Format:	Article
Published:	Springer Nature Switzerland AG 2019
Subjects:	QA75 Electronic computers. Computer science

_version_	1796864863334563840
author	Dzulkalnine, Mohamad Faiz Sallehuddin, Roselina
author_facet	Dzulkalnine, Mohamad Faiz Sallehuddin, Roselina
author_sort	Dzulkalnine, Mohamad Faiz
collection	ePrints
description	Missing data in datasets remain as a difficulty in terms of data analysis in various research fields, especially in the medical field, as it affects the treatment and diagnosis that the patient should receive. In this research, Fuzzy c-means (FCM) are used to impute the missing data. However, like in most data imputation methods, FCM do not consider the presence of irrelevant features. Irrelevant features can increase the computational time of the imputation process and decrease the accuracy of the prediction. Feature selection techniques can alleviate this problem by selecting the most relevant features and reducing the dataset size. Fuzzy principal component analysis (FPCA) is used as the feature selection method in this study as it considers the presence of outliers compared to classical PCA as outliers are the main reason some features renders irrelevant. Therefore, an improved hybrid imputation model of FPCA–Support vector machines–FCM (FPCA–SVM–FCM) has been proposed and employed in this study. The efficiency of the proposed model is investigated on one dataset which is Pima Indians Diabetes dataset. Experimental results showed that the proposed hybrid imputation model is better than the existing methods by producing a more accurate estimation in terms of accuracy, RMSE and MAE. The proposed method was also validated by using Wilcoxon rank sum and Theil’s U test and obtained good results compared to SVM–FCM. Therefore, it can be used as an alternative tool for handling missing data in order to obtain a better quality dataset.
first_indexed	2024-03-05T20:48:16Z
format	Article
id	utm.eprints-89605
institution	Universiti Teknologi Malaysia - ePrints
last_indexed	2024-03-05T20:48:16Z
publishDate	2019
publisher	Springer Nature Switzerland AG
record_format	dspace
spelling	utm.eprints-896052021-02-22T06:08:17Z http://eprints.utm.my/89605/ Missing data imputation with fuzzy feature selection for diabetes dataset Dzulkalnine, Mohamad Faiz Sallehuddin, Roselina QA75 Electronic computers. Computer science Missing data in datasets remain as a difficulty in terms of data analysis in various research fields, especially in the medical field, as it affects the treatment and diagnosis that the patient should receive. In this research, Fuzzy c-means (FCM) are used to impute the missing data. However, like in most data imputation methods, FCM do not consider the presence of irrelevant features. Irrelevant features can increase the computational time of the imputation process and decrease the accuracy of the prediction. Feature selection techniques can alleviate this problem by selecting the most relevant features and reducing the dataset size. Fuzzy principal component analysis (FPCA) is used as the feature selection method in this study as it considers the presence of outliers compared to classical PCA as outliers are the main reason some features renders irrelevant. Therefore, an improved hybrid imputation model of FPCA–Support vector machines–FCM (FPCA–SVM–FCM) has been proposed and employed in this study. The efficiency of the proposed model is investigated on one dataset which is Pima Indians Diabetes dataset. Experimental results showed that the proposed hybrid imputation model is better than the existing methods by producing a more accurate estimation in terms of accuracy, RMSE and MAE. The proposed method was also validated by using Wilcoxon rank sum and Theil’s U test and obtained good results compared to SVM–FCM. Therefore, it can be used as an alternative tool for handling missing data in order to obtain a better quality dataset. Springer Nature Switzerland AG 2019-04 Article PeerReviewed Dzulkalnine, Mohamad Faiz and Sallehuddin, Roselina (2019) Missing data imputation with fuzzy feature selection for diabetes dataset. SN Applied Sciences, 1 (4). pp. 1-12. ISSN 2523-3963 http://dx.doi.org/10.1007/s42452-019-0383-x DOI:10.1007/s42452-019-0383-x
spellingShingle	QA75 Electronic computers. Computer science Dzulkalnine, Mohamad Faiz Sallehuddin, Roselina Missing data imputation with fuzzy feature selection for diabetes dataset
title	Missing data imputation with fuzzy feature selection for diabetes dataset
title_full	Missing data imputation with fuzzy feature selection for diabetes dataset
title_fullStr	Missing data imputation with fuzzy feature selection for diabetes dataset
title_full_unstemmed	Missing data imputation with fuzzy feature selection for diabetes dataset
title_short	Missing data imputation with fuzzy feature selection for diabetes dataset
title_sort	missing data imputation with fuzzy feature selection for diabetes dataset
topic	QA75 Electronic computers. Computer science
work_keys_str_mv	AT dzulkalninemohamadfaiz missingdataimputationwithfuzzyfeatureselectionfordiabetesdataset AT sallehuddinroselina missingdataimputationwithfuzzyfeatureselectionfordiabetesdataset

Missing data imputation with fuzzy feature selection for diabetes dataset

Similar Items