Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms

Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subse...

Full description

Bibliographic Details
Main Authors: Philip Buczak, Jian-Jia Chen, Markus Pauly
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/25/3/521
_version_ 1797611914689249280
author Philip Buczak
Jian-Jia Chen
Markus Pauly
author_facet Philip Buczak
Jian-Jia Chen
Markus Pauly
author_sort Philip Buczak
collection DOAJ
description Many datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.
first_indexed 2024-03-11T06:35:24Z
format Article
id doaj.art-f05fb1b972df4192acb3f24ec398effb
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-11T06:35:24Z
publishDate 2023-03-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-f05fb1b972df4192acb3f24ec398effb2023-11-17T10:57:22ZengMDPI AGEntropy1099-43002023-03-0125352110.3390/e25030521Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing MechanismsPhilip Buczak0Jian-Jia Chen1Markus Pauly2Department of Statistics, TU Dortmund University, 44227 Dortmund, GermanyDepartment of Computer Science, TU Dortmund University, 44227 Dortmund, GermanyDepartment of Statistics, TU Dortmund University, 44227 Dortmund, GermanyMany datasets in statistical analyses contain missing values. As omitting observations containing missing entries may lead to information loss or greatly reduce the sample size, imputation is usually preferable. However, imputation can also introduce bias and impact the quality and validity of subsequent analysis. Focusing on binary classification problems, we analyzed how missing value imputation under MCAR as well as MAR missingness with different missing patterns affects the predictive performance of subsequent classification. To this end, we compared imputation methods such as several MICE variants, missForest, Hot Deck as well as mean imputation with regard to the classification performance achieved with commonly used classifiers such as Random Forest, Extreme Gradient Boosting, Support Vector Machine and regularized logistic regression. Our simulation results showed that Random Forest based imputation (i.e., MICE Random Forest and missForest) performed particularly well in most scenarios studied. In addition to these two methods, simple mean imputation also proved to be useful, especially when many features (covariates) contained missing values.https://www.mdpi.com/1099-4300/25/3/521missing valuesimputationMICEmissForestclassificationmachine learning
spellingShingle Philip Buczak
Jian-Jia Chen
Markus Pauly
Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
Entropy
missing values
imputation
MICE
missForest
classification
machine learning
title Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_fullStr Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_full_unstemmed Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_short Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms
title_sort analyzing the effect of imputation on classification performance under mcar and mar missing mechanisms
topic missing values
imputation
MICE
missForest
classification
machine learning
url https://www.mdpi.com/1099-4300/25/3/521
work_keys_str_mv AT philipbuczak analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms
AT jianjiachen analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms
AT markuspauly analyzingtheeffectofimputationonclassificationperformanceundermcarandmarmissingmechanisms