A comparison of imputation methods for categorical data

Objectives: Missing data is commonplace in clinical databases, which are being increasingly used for research. Without giving any regard to missing data, results from analysis may become biased and unrepresentative. Clinical databases contain mainly categorical variables. This study aims to assess t...

Full description

Bibliographic Details
Main Authors:	Shaheen MZ. Memon, Robert Wamala, Ignace H. Kabano
Format:	Article
Language:	English
Published:	Elsevier 2023-01-01
Series:	Informatics in Medicine Unlocked
Subjects:	Imputation Categorical variables Precision score Single imputation Multiple imputation
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352914823002289

_version_	1797646620461891584
author	Shaheen MZ. Memon Robert Wamala Ignace H. Kabano
author_facet	Shaheen MZ. Memon Robert Wamala Ignace H. Kabano
author_sort	Shaheen MZ. Memon
collection	DOAJ
description	Objectives: Missing data is commonplace in clinical databases, which are being increasingly used for research. Without giving any regard to missing data, results from analysis may become biased and unrepresentative. Clinical databases contain mainly categorical variables. This study aims to assess the methods used for imputation in categorical variables. Materials and methods: We utilized data extracted from paper-based maternal health records from Kawempe National Referral Hospital, Uganda. We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential Hot-Deck (SHD), and Multiple Imputation by Chained Equations (MICE). The five imputation methods were first compared by accuracy of predicting the missing values. Next, the imputation methods were compared by predictive accuracy of the outcome variable in four classifiers. The consistency of performance of imputation methods across different levels of missing data (5%–50 %) was assessed by Kendall's W test. Results: KNN imputation had the highest precision score at levels (5%–50 %) of MCAR missing data. At lower proportions of missing data (5 %, 10 %, 15 %, 20 %), RF imputation had the second-highest precision score. SHD imputation had the worst precision at all levels of missing data. In the prediction of the outcome, the methods performed differently at all proportions of missing data in the four classifiers. Even though KNN imputation was the best method in predicting the missing values, it did not consistently enhance the predictive accuracy of the classifiers at all levels of missing data. Our findings show that a high precision score of an imputation method does not translate into higher predictive accuracy in classifiers. Conclusions: KNN imputation is the best method in predicting missing values in categorical variables. There is no universal best imputation method that yields the highest predictive accuracy at all proportions of missing data.
first_indexed	2024-03-11T15:05:14Z
format	Article
id	doaj.art-e09ede66a247432596c4b20e295f0850
institution	Directory Open Access Journal
issn	2352-9148
language	English
last_indexed	2024-03-11T15:05:14Z
publishDate	2023-01-01
publisher	Elsevier
record_format	Article
series	Informatics in Medicine Unlocked
spelling	doaj.art-e09ede66a247432596c4b20e295f08502023-10-30T06:05:17ZengElsevierInformatics in Medicine Unlocked2352-91482023-01-0142101382A comparison of imputation methods for categorical dataShaheen MZ. Memon0Robert Wamala1Ignace H. Kabano2African Centre of Excellence in Data Science, University of Rwanda, P.O. Box 4285 Kigali-Rwanda, Kigali, Rwanda; Corresponding author.Makerere University, P.O. Box 7062, Kampala, UgandaAfrican Centre of Excellence in Data Science, University of Rwanda, P.O. Box 4285 Kigali-Rwanda, Kigali, RwandaObjectives: Missing data is commonplace in clinical databases, which are being increasingly used for research. Without giving any regard to missing data, results from analysis may become biased and unrepresentative. Clinical databases contain mainly categorical variables. This study aims to assess the methods used for imputation in categorical variables. Materials and methods: We utilized data extracted from paper-based maternal health records from Kawempe National Referral Hospital, Uganda. We compared the following imputation methods for categorical data in an empirical analysis: Mode, K-Nearest Neighbors (KNN), Random Forest (RF), Sequential Hot-Deck (SHD), and Multiple Imputation by Chained Equations (MICE). The five imputation methods were first compared by accuracy of predicting the missing values. Next, the imputation methods were compared by predictive accuracy of the outcome variable in four classifiers. The consistency of performance of imputation methods across different levels of missing data (5%–50 %) was assessed by Kendall's W test. Results: KNN imputation had the highest precision score at levels (5%–50 %) of MCAR missing data. At lower proportions of missing data (5 %, 10 %, 15 %, 20 %), RF imputation had the second-highest precision score. SHD imputation had the worst precision at all levels of missing data. In the prediction of the outcome, the methods performed differently at all proportions of missing data in the four classifiers. Even though KNN imputation was the best method in predicting the missing values, it did not consistently enhance the predictive accuracy of the classifiers at all levels of missing data. Our findings show that a high precision score of an imputation method does not translate into higher predictive accuracy in classifiers. Conclusions: KNN imputation is the best method in predicting missing values in categorical variables. There is no universal best imputation method that yields the highest predictive accuracy at all proportions of missing data.http://www.sciencedirect.com/science/article/pii/S2352914823002289ImputationCategorical variablesPrecision scoreSingle imputationMultiple imputation
spellingShingle	Shaheen MZ. Memon Robert Wamala Ignace H. Kabano A comparison of imputation methods for categorical data Informatics in Medicine Unlocked Imputation Categorical variables Precision score Single imputation Multiple imputation
title	A comparison of imputation methods for categorical data
title_full	A comparison of imputation methods for categorical data
title_fullStr	A comparison of imputation methods for categorical data
title_full_unstemmed	A comparison of imputation methods for categorical data
title_short	A comparison of imputation methods for categorical data
title_sort	comparison of imputation methods for categorical data
topic	Imputation Categorical variables Precision score Single imputation Multiple imputation
url	http://www.sciencedirect.com/science/article/pii/S2352914823002289
work_keys_str_mv	AT shaheenmzmemon acomparisonofimputationmethodsforcategoricaldata AT robertwamala acomparisonofimputationmethodsforcategoricaldata AT ignacehkabano acomparisonofimputationmethodsforcategoricaldata AT shaheenmzmemon comparisonofimputationmethodsforcategoricaldata AT robertwamala comparisonofimputationmethodsforcategoricaldata AT ignacehkabano comparisonofimputationmethodsforcategoricaldata

A comparison of imputation methods for categorical data

Similar Items