Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study

Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a mis...

Full description

Bibliographic Details
Main Authors:	Menna Ibrahim Gabr, Yehia Mostafa Helmy, Doaa Saad Elzanfaly
Format:	Article
Language:	English
Published:	MDPI AG 2023-03-01
Series:	Big Data and Cognitive Computing
Subjects:	data quality data completeness missing patterns imputation techniques supervised classifiers
Online Access:	https://www.mdpi.com/2504-2289/7/1/55

_version_	1797613528299864064
author	Menna Ibrahim Gabr Yehia Mostafa Helmy Doaa Saad Elzanfaly
author_facet	Menna Ibrahim Gabr Yehia Mostafa Helmy Doaa Saad Elzanfaly
author_sort	Menna Ibrahim Gabr
collection	DOAJ
description	Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data.
first_indexed	2024-03-11T06:56:02Z
format	Article
id	doaj.art-fa8a3367893f41e0824de55856259cd6
institution	Directory Open Access Journal
issn	2504-2289
language	English
last_indexed	2024-03-11T06:56:02Z
publishDate	2023-03-01
publisher	MDPI AG
record_format	Article
series	Big Data and Cognitive Computing
spelling	doaj.art-fa8a3367893f41e0824de55856259cd62023-11-17T09:37:19ZengMDPI AGBig Data and Cognitive Computing2504-22892023-03-01715510.3390/bdcc7010055Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation StudyMenna Ibrahim Gabr0Yehia Mostafa Helmy1Doaa Saad Elzanfaly2Department of Business Information Systems (BIS), Faculty of Commerce and Business Administration, Helwan University, Cairo 11795, EgyptDepartment of Business Information Systems (BIS), Faculty of Commerce and Business Administration, Helwan University, Cairo 11795, EgyptDepartment of Information Systems, Faculty of Computer and Artificial Intelligence, Helwan University, Cairo 11795, EgyptData completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor is the missing data pattern and ratio, followed by the imputation method, and then the type, size, and balance of the dataset. The sensitivity of the classifiers when data are missing due to the Missing Completely At Random (MCAR) pattern is less than their sensitivity when data are missing due to the Missing Not At Random (MNAR) pattern. Furthermore, using the MCC as an evaluation measure better reflects the variation in the sensitivity of the classifiers to the missing data.https://www.mdpi.com/2504-2289/7/1/55data qualitydata completenessmissing patternsimputation techniquessupervisedclassifiers
spellingShingle	Menna Ibrahim Gabr Yehia Mostafa Helmy Doaa Saad Elzanfaly Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study Big Data and Cognitive Computing data quality data completeness missing patterns imputation techniques supervised classifiers
title	Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
title_full	Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
title_fullStr	Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
title_full_unstemmed	Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
title_short	Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study
title_sort	effect of missing data types and imputation methods on supervised classifiers an evaluation study
topic	data quality data completeness missing patterns imputation techniques supervised classifiers
url	https://www.mdpi.com/2504-2289/7/1/55
work_keys_str_mv	AT mennaibrahimgabr effectofmissingdatatypesandimputationmethodsonsupervisedclassifiersanevaluationstudy AT yehiamostafahelmy effectofmissingdatatypesandimputationmethodsonsupervisedclassifiersanevaluationstudy AT doaasaadelzanfaly effectofmissingdatatypesandimputationmethodsonsupervisedclassifiersanevaluationstudy

Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study

Similar Items