Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project

Abstract An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Ps...

Full description

Bibliographic Details
Main Authors: A. Llera, M. Brammer, B. Oakley, J. Tillmann, M. Zabihi, J. S. Amelink, T. Mei, T. Charman, C. Ecker, F. Dell’Acqua, T. Banaschewski, C. Moessnang, S. Baron-Cohen, R. Holt, S. Durston, D. Murphy, E. Loth, J. K. Buitelaar, D. L. Floris, C. F. Beckmann
Format: Article
Language:English
Published: BMC 2022-08-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-022-01656-z
_version_ 1811283791707111424
author A. Llera
M. Brammer
B. Oakley
J. Tillmann
M. Zabihi
J. S. Amelink
T. Mei
T. Charman
C. Ecker
F. Dell’Acqua
T. Banaschewski
C. Moessnang
S. Baron-Cohen
R. Holt
S. Durston
D. Murphy
E. Loth
J. K. Buitelaar
D. L. Floris
C. F. Beckmann
author_facet A. Llera
M. Brammer
B. Oakley
J. Tillmann
M. Zabihi
J. S. Amelink
T. Mei
T. Charman
C. Ecker
F. Dell’Acqua
T. Banaschewski
C. Moessnang
S. Baron-Cohen
R. Holt
S. Durston
D. Murphy
E. Loth
J. K. Buitelaar
D. L. Floris
C. F. Beckmann
author_sort A. Llera
collection DOAJ
description Abstract An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Psych. 10(1):100, 2020; Mol. Psych. 19:659–667, 2014; Mol. Aut. 8:24, 2017; Eur. Child and Adol. Psych. 24(3):265–281, 2015. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to integrate relationships across multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N = 764) and deeply phenotyped (i.e. range of clinical and cognitive instruments administered) sample of N = 453 autistic individuals and N = 311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular, we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies—mean and median imputation—as well as a Round Robin regression approach involving four independent multivariate regression models including Bayesian Ridge regression, as well as several non-linear models: Decision Trees (Extra Trees., and Nearest Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and also consider the Kullback–Leibler divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This not only allows the selection of a unique model to impute missing data for the LEAP project and delivers a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future, but provides more general guidelines for data imputation in large scale epidemiological studies.
first_indexed 2024-04-13T02:19:01Z
format Article
id doaj.art-44d70fd44f9d4e4e81ed4988c8f1ebc6
institution Directory Open Access Journal
issn 1471-2288
language English
last_indexed 2024-04-13T02:19:01Z
publishDate 2022-08-01
publisher BMC
record_format Article
series BMC Medical Research Methodology
spelling doaj.art-44d70fd44f9d4e4e81ed4988c8f1ebc62022-12-22T03:07:04ZengBMCBMC Medical Research Methodology1471-22882022-08-0122111510.1186/s12874-022-01656-zEvaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism ProjectA. Llera0M. Brammer1B. Oakley2J. Tillmann3M. Zabihi4J. S. Amelink5T. Mei6T. Charman7C. Ecker8F. Dell’Acqua9T. Banaschewski10C. Moessnang11S. Baron-Cohen12R. Holt13S. Durston14D. Murphy15E. Loth16J. K. Buitelaar17D. L. Floris18C. F. Beckmann19Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingInstitute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King’s College LondonDepartment of Forensic and Neurodevelopmental Sciences, Institute of Psychiatry, Psychology, and Neuroscience, King’s College LondonRoche Pharma Research and Early Development, Neuroscience and Rare Diseases, Roche Innovation Center BaselDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingDepartment of Psychology, Institute of Psychiatry, Psychology, and Neuroscience, King’s College LondonInstitute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King’s College LondonInstitute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King’s College LondonChild and Adolescent Psychiatry, Central Institute of Mental Health, University of HeidelbergDepartment of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, University Hospital Frankfurt Am Main, Goethe UniversityAutism Research Centre, Department of Psychiatry, University of CambridgeAutism Research Centre, Department of Psychiatry, University of CambridgeDepartment of Psychiatry, Brain Center Rudolf Magnus, University Medical Center UtrechtInstitute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King’s College LondonInstitute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King’s College LondonDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingDonders Institute for Brain, Cognition and Behaviour, Centre for Cognitive NeuroimagingAbstract An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Psych. 10(1):100, 2020; Mol. Psych. 19:659–667, 2014; Mol. Aut. 8:24, 2017; Eur. Child and Adol. Psych. 24(3):265–281, 2015. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to integrate relationships across multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N = 764) and deeply phenotyped (i.e. range of clinical and cognitive instruments administered) sample of N = 453 autistic individuals and N = 311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular, we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies—mean and median imputation—as well as a Round Robin regression approach involving four independent multivariate regression models including Bayesian Ridge regression, as well as several non-linear models: Decision Trees (Extra Trees., and Nearest Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and also consider the Kullback–Leibler divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This not only allows the selection of a unique model to impute missing data for the LEAP project and delivers a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future, but provides more general guidelines for data imputation in large scale epidemiological studies.https://doi.org/10.1186/s12874-022-01656-zImputationClinical dataMultivariateMachine learning
spellingShingle A. Llera
M. Brammer
B. Oakley
J. Tillmann
M. Zabihi
J. S. Amelink
T. Mei
T. Charman
C. Ecker
F. Dell’Acqua
T. Banaschewski
C. Moessnang
S. Baron-Cohen
R. Holt
S. Durston
D. Murphy
E. Loth
J. K. Buitelaar
D. L. Floris
C. F. Beckmann
Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
BMC Medical Research Methodology
Imputation
Clinical data
Multivariate
Machine learning
title Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
title_full Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
title_fullStr Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
title_full_unstemmed Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
title_short Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
title_sort evaluation of data imputation strategies in complex deeply phenotyped data sets the case of the eu aims longitudinal european autism project
topic Imputation
Clinical data
Multivariate
Machine learning
url https://doi.org/10.1186/s12874-022-01656-z
work_keys_str_mv AT allera evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT mbrammer evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT boakley evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT jtillmann evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT mzabihi evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT jsamelink evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT tmei evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT tcharman evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT cecker evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT fdellacqua evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT tbanaschewski evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT cmoessnang evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT sbaroncohen evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT rholt evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT sdurston evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT dmurphy evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT eloth evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT jkbuitelaar evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT dlfloris evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject
AT cfbeckmann evaluationofdataimputationstrategiesincomplexdeeplyphenotypeddatasetsthecaseoftheeuaimslongitudinaleuropeanautismproject