Imputation of clinical covariates in time series

Abstract Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach i...

Full description

Bibliographic Details
Main Authors:	Bertsimas, Dimitris, Orfanoudaki, Agni, Pawlowski, Colin
Other Authors:	Massachusetts Institute of Technology. Operations Research Center
Format:	Article
Language:	English
Published:	Springer US 2021
Online Access:	https://hdl.handle.net/1721.1/131956

_version_	1826193569449246720
author	Bertsimas, Dimitris Orfanoudaki, Agni Pawlowski, Colin
author2	Massachusetts Institute of Technology. Operations Research Center
author_facet	Massachusetts Institute of Technology. Operations Research Center Bertsimas, Dimitris Orfanoudaki, Agni Pawlowski, Colin
author_sort	Bertsimas, Dimitris
collection	MIT
description	Abstract Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach integrates patient specific information into an optimization formulation that can be adjusted for different imputation algorithms. We present the formulation for a K-nearest neighbors model and derive a corresponding scalable first-order method med.knn. Our algorithm provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time. In computational experiments on three real-world clinical datasets, we test its performance on imputation and downstream predictive tasks, varying the percentage of missing data, the number of observations per patient, and the mechanism of missing data. The proposed method improves upon both the imputation accuracy and downstream predictive performance relative to the best of the benchmark imputation methods considered. We show that this edge is consistently present both in longitudinal and electronic health records datasets as well as in binary classification and regression settings. On computational experiments on synthetic data, we test the scalability of this algorithm on large datasets, and we show that an efficient method for hyperparameter tuning scales to datasets with 10,000’s of observations and 100’s of covariates while maintaining high imputation accuracy.
first_indexed	2024-09-23T09:41:11Z
format	Article
id	mit-1721.1/131956
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T09:41:11Z
publishDate	2021
publisher	Springer US
record_format	dspace
spelling	mit-1721.1/1319562024-01-02T19:15:26Z Imputation of clinical covariates in time series Bertsimas, Dimitris Orfanoudaki, Agni Pawlowski, Colin Massachusetts Institute of Technology. Operations Research Center Abstract Missing data is a common problem in longitudinal datasets which include multiple instances of the same individual observed at different points in time. We introduce a new approach, MedImpute, for imputing missing clinical covariates in multivariate panel data. This approach integrates patient specific information into an optimization formulation that can be adjusted for different imputation algorithms. We present the formulation for a K-nearest neighbors model and derive a corresponding scalable first-order method med.knn. Our algorithm provides imputations for datasets with both continuous and categorical features and observations occurring at arbitrary points in time. In computational experiments on three real-world clinical datasets, we test its performance on imputation and downstream predictive tasks, varying the percentage of missing data, the number of observations per patient, and the mechanism of missing data. The proposed method improves upon both the imputation accuracy and downstream predictive performance relative to the best of the benchmark imputation methods considered. We show that this edge is consistently present both in longitudinal and electronic health records datasets as well as in binary classification and regression settings. On computational experiments on synthetic data, we test the scalability of this algorithm on large datasets, and we show that an efficient method for hyperparameter tuning scales to datasets with 10,000’s of observations and 100’s of covariates while maintaining high imputation accuracy. 2021-09-20T17:41:04Z 2021-09-20T17:41:04Z 2020-11-10 2021-01-26T04:41:13Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/131956 en https://doi.org/10.1007/s10994-020-05923-2 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature application/pdf Springer US Springer US
spellingShingle	Bertsimas, Dimitris Orfanoudaki, Agni Pawlowski, Colin Imputation of clinical covariates in time series
title	Imputation of clinical covariates in time series
title_full	Imputation of clinical covariates in time series
title_fullStr	Imputation of clinical covariates in time series
title_full_unstemmed	Imputation of clinical covariates in time series
title_short	Imputation of clinical covariates in time series
title_sort	imputation of clinical covariates in time series
url	https://hdl.handle.net/1721.1/131956
work_keys_str_mv	AT bertsimasdimitris imputationofclinicalcovariatesintimeseries AT orfanoudakiagni imputationofclinicalcovariatesintimeseries AT pawlowskicolin imputationofclinicalcovariatesintimeseries

Imputation of clinical covariates in time series

Similar Items