A Benchmark for Data Imputation Methods

With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning...

Full description

Bibliographic Details
Main Authors:	Sebastian Jäger, Arndt Allhorn, Felix Bießmann
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2021-07-01
Series:	Frontiers in Big Data
Subjects:	data quality data cleaning imputation missing data benchmark MCAR
Online Access:	https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full

_version_	1818437429583937536
author	Sebastian Jäger Arndt Allhorn Felix Bießmann
author_facet	Sebastian Jäger Arndt Allhorn Felix Bießmann
author_sort	Sebastian Jäger
collection	DOAJ
description	With the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.
first_indexed	2024-12-14T17:24:32Z
format	Article
id	doaj.art-467bebf2f5ff4ba3a83a0a92a8b72d64
institution	Directory Open Access Journal
issn	2624-909X
language	English
last_indexed	2024-12-14T17:24:32Z
publishDate	2021-07-01
publisher	Frontiers Media S.A.
record_format	Article
series	Frontiers in Big Data
spelling	doaj.art-467bebf2f5ff4ba3a83a0a92a8b72d642022-12-21T22:53:15ZengFrontiers Media S.A.Frontiers in Big Data2624-909X2021-07-01410.3389/fdata.2021.693674693674A Benchmark for Data Imputation MethodsSebastian JägerArndt AllhornFelix BießmannWith the increasing importance and complexity of data pipelines, data quality became one of the key challenges in modern software applications. The importance of data quality has been recognized beyond the field of data engineering and database management systems (DBMSs). Also, for machine learning (ML) applications, high data quality standards are crucial to ensure robust predictive performance and responsible usage of automated decision making. One of the most frequent data quality problems is missing values. Incomplete datasets can break data pipelines and can have a devastating impact on downstream ML applications when not detected. While statisticians and, more recently, ML researchers have introduced a variety of approaches to impute missing values, comprehensive benchmarks comparing classical and modern imputation approaches under fair and realistic conditions are underrepresented. Here, we aim to fill this gap. We conduct a comprehensive suite of experiments on a large number of datasets with heterogeneous data and realistic missingness conditions, comparing both novel deep learning approaches and classical ML imputation methods when either only test or train and test data are affected by missing data. Each imputation method is evaluated regarding the imputation quality and the impact imputation has on a downstream ML task. Our results provide valuable insights into the performance of a variety of imputation methods under realistic conditions. We hope that our results help researchers and engineers to guide their data preprocessing method selection for automated data quality improvement.https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/fulldata qualitydata cleaningimputationmissing databenchmarkMCAR
spellingShingle	Sebastian Jäger Arndt Allhorn Felix Bießmann A Benchmark for Data Imputation Methods Frontiers in Big Data data quality data cleaning imputation missing data benchmark MCAR
title	A Benchmark for Data Imputation Methods
title_full	A Benchmark for Data Imputation Methods
title_fullStr	A Benchmark for Data Imputation Methods
title_full_unstemmed	A Benchmark for Data Imputation Methods
title_short	A Benchmark for Data Imputation Methods
title_sort	benchmark for data imputation methods
topic	data quality data cleaning imputation missing data benchmark MCAR
url	https://www.frontiersin.org/articles/10.3389/fdata.2021.693674/full
work_keys_str_mv	AT sebastianjager abenchmarkfordataimputationmethods AT arndtallhorn abenchmarkfordataimputationmethods AT felixbießmann abenchmarkfordataimputationmethods AT sebastianjager benchmarkfordataimputationmethods AT arndtallhorn benchmarkfordataimputationmethods AT felixbießmann benchmarkfordataimputationmethods

A Benchmark for Data Imputation Methods

Similar Items