Missing Data Imputation for Supervised Learning

Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information. This paper compares methods for imputing missing categorical data for supervised classification tasks. We experiment on two machine learning benchmark datasets with...

Full description

Bibliographic Details
Main Authors: Jason Poulos, Rafael Valle
Format: Article
Language:English
Published: Taylor & Francis Group 2018-04-01
Series:Applied Artificial Intelligence
Online Access:http://dx.doi.org/10.1080/08839514.2018.1448143
_version_ 1827817650977243136
author Jason Poulos
Rafael Valle
author_facet Jason Poulos
Rafael Valle
author_sort Jason Poulos
collection DOAJ
description Missing data imputation can help improve the performance of prediction models in situations where missing data hide useful information. This paper compares methods for imputing missing categorical data for supervised classification tasks. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different levels of additional missing-data perturbation. We show imputation methods can increase predictive accuracy in the presence of missing-data perturbation, which can actually improve prediction accuracy by regularizing the classifier. We achieve results comparable to the state-of-the-art on the Adult dataset with missing-data perturbation and $$k$$-nearest-neighbors ($$k$$-NN) imputation.
first_indexed 2024-03-12T00:37:15Z
format Article
id doaj.art-84be66ac51d443ac9c475ee20c821bec
institution Directory Open Access Journal
issn 0883-9514
1087-6545
language English
last_indexed 2024-03-12T00:37:15Z
publishDate 2018-04-01
publisher Taylor & Francis Group
record_format Article
series Applied Artificial Intelligence
spelling doaj.art-84be66ac51d443ac9c475ee20c821bec2023-09-15T09:33:56ZengTaylor & Francis GroupApplied Artificial Intelligence0883-95141087-65452018-04-0132218619610.1080/08839514.2018.14481431448143Missing Data Imputation for Supervised LearningJason Poulos0Rafael Valle1Departments of Political Science and Electrical Engineering and Computer Sciences, University of CaliforniaDepartments of Political Science and Electrical Engineering and Computer Sciences, University of CaliforniaMissing data imputation can help improve the performance of prediction models in situations where missing data hide useful information. This paper compares methods for imputing missing categorical data for supervised classification tasks. We experiment on two machine learning benchmark datasets with missing categorical data, comparing classifiers trained on non-imputed (i.e., one-hot encoded) or imputed data with different levels of additional missing-data perturbation. We show imputation methods can increase predictive accuracy in the presence of missing-data perturbation, which can actually improve prediction accuracy by regularizing the classifier. We achieve results comparable to the state-of-the-art on the Adult dataset with missing-data perturbation and $$k$$-nearest-neighbors ($$k$$-NN) imputation.http://dx.doi.org/10.1080/08839514.2018.1448143
spellingShingle Jason Poulos
Rafael Valle
Missing Data Imputation for Supervised Learning
Applied Artificial Intelligence
title Missing Data Imputation for Supervised Learning
title_full Missing Data Imputation for Supervised Learning
title_fullStr Missing Data Imputation for Supervised Learning
title_full_unstemmed Missing Data Imputation for Supervised Learning
title_short Missing Data Imputation for Supervised Learning
title_sort missing data imputation for supervised learning
url http://dx.doi.org/10.1080/08839514.2018.1448143
work_keys_str_mv AT jasonpoulos missingdataimputationforsupervisedlearning
AT rafaelvalle missingdataimputationforsupervisedlearning