A real data-driven simulation strategy to select an imputation method for mixed-type trait data.

Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advant...

Full description

Bibliographic Details
Main Authors:	Jacqueline A May, Zeny Feng, Sarah J Adamowicz
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2023-03-01
Series:	PLoS Computational Biology
Online Access:	https://doi.org/10.1371/journal.pcbi.1010154

_version_	1797850334947704832
author	Jacqueline A May Zeny Feng Sarah J Adamowicz
author_facet	Jacqueline A May Zeny Feng Sarah J Adamowicz
author_sort	Jacqueline A May
collection	DOAJ
description	Missing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.
first_indexed	2024-04-09T18:59:42Z
format	Article
id	doaj.art-e6066688b43c44bfb40649b63e223b46
institution	Directory Open Access Journal
issn	1553-734X 1553-7358
language	English
last_indexed	2024-04-09T18:59:42Z
publishDate	2023-03-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS Computational Biology
spelling	doaj.art-e6066688b43c44bfb40649b63e223b462023-04-09T05:31:35ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582023-03-01193e101015410.1371/journal.pcbi.1010154A real data-driven simulation strategy to select an imputation method for mixed-type trait data.Jacqueline A MayZeny FengSarah J AdamowiczMissing observations in trait datasets pose an obstacle for analyses in myriad biological disciplines. Considering the mixed results of imputation, the wide variety of available methods, and the varied structure of real trait datasets, a framework for selecting a suitable imputation method is advantageous. We invoked a real data-driven simulation strategy to select an imputation method for a given mixed-type (categorical, count, continuous) target dataset. Candidate methods included mean/mode imputation, k-nearest neighbour, random forests, and multivariate imputation by chained equations (MICE). Using a trait dataset of squamates (lizards and amphisbaenians; order: Squamata) as a target dataset, a complete-case dataset consisting of species with nearly complete information was formed for the imputation method selection. Missing data were induced by removing values from this dataset under different missingness mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). For each method, combinations with and without phylogenetic information from single gene (nuclear and mitochondrial) or multigene trees were used to impute the missing values for five numerical and two categorical traits. The performances of the methods were evaluated under each missing mechanism by determining the mean squared error and proportion falsely classified rates for numerical and categorical traits, respectively. A random forest method supplemented with a nuclear-derived phylogeny resulted in the lowest error rates for the majority of traits, and this method was used to impute missing values in the original dataset. Data with imputed values better reflected the characteristics and distributions of the original data compared to complete-case data. However, caution should be taken when imputing trait data as phylogeny did not always improve performance for every trait and in every scenario. Ultimately, these results support the use of a real data-driven simulation strategy for selecting a suitable imputation method for a given mixed-type trait dataset.https://doi.org/10.1371/journal.pcbi.1010154
spellingShingle	Jacqueline A May Zeny Feng Sarah J Adamowicz A real data-driven simulation strategy to select an imputation method for mixed-type trait data. PLoS Computational Biology
title	A real data-driven simulation strategy to select an imputation method for mixed-type trait data.
title_full	A real data-driven simulation strategy to select an imputation method for mixed-type trait data.
title_fullStr	A real data-driven simulation strategy to select an imputation method for mixed-type trait data.
title_full_unstemmed	A real data-driven simulation strategy to select an imputation method for mixed-type trait data.
title_short	A real data-driven simulation strategy to select an imputation method for mixed-type trait data.
title_sort	real data driven simulation strategy to select an imputation method for mixed type trait data
url	https://doi.org/10.1371/journal.pcbi.1010154
work_keys_str_mv	AT jacquelineamay arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT zenyfeng arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT sarahjadamowicz arealdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT jacquelineamay realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT zenyfeng realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata AT sarahjadamowicz realdatadrivensimulationstrategytoselectanimputationmethodformixedtypetraitdata

A real data-driven simulation strategy to select an imputation method for mixed-type trait data.

Similar Items