Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms

Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subse...

Full description

Bibliographic Details
Main Authors:	Philip Buczak, Jian-Jia Chen, Markus Pauly
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Entropy
Subjects:	missing values imputation MICE missForest classification machine learning
Online Access:	https://www.mdpi.com/1099-4300/25/3/521

_version_	1797611914689249280
author	Philip Buczak Jian-Jia Chen Markus Pauly
author_facet	Philip Buczak Jian-Jia Chen Markus Pauly
author_sort	Philip Buczak
collection	DOAJ
description	Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.
first_indexed	2024-03-11T06:35:24Z
format	Article
id	doaj.art-f05fb1b972df4192acb3f24ec398effb
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-11T06:35:24Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-f05fb1b972df4192acb3f24ec398effb2023-11-17T10:57:22ZengMDPI AGEntropy1099-43002023-03-0125352110.3390/e25030521Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing MechanismsPhilip Buczak0Jian-Jia Chen1Markus Pauly2Department of Statistics, TU Dortmund University, 44227 Dortmund, GermanyDepartment of Computer Science, TU Dortmund University, 44227 Dortmund, GermanyDepartment of Statistics, TU Dortmund University, 44227 Dortmund, GermanyMany datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.https://www.mdpi.com/1099-4300/25/3/521missing valuesimputationMICEmissForestclassificationmachine learning
spellingShingle	Philip Buczak Jian-Jia Chen Markus Pauly Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms Entropy missing values imputation MICE missForest classification machine learning
title	Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full	Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_fullStr	Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full_unstemmed	Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_short	Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_sort	analyzing the effect of imputation on classification performance under mcar and mar missing mechanisms
topic	missing values imputation MICE missForest classification machine learning
url	https://www.mdpi.com/1099-4300/25/3/521
work_keys_str_mv	AT philipbuczak analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms AT jianjiachen analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms AT markuspauly analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms

Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms

Similar Items