Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics

Abstract When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spe...

Full description

Bibliographic Details
Main Authors:	Jonathan P. Dekermanjian, Elin Shaddox, Debmalya Nandy, Debashis Ghosh, Katerina Kechris
Format:	Article
Language:	English
Published:	BMC 2022-05-01
Series:	BMC Bioinformatics
Subjects:	Missing data Imputation Machine learning Metabolomics
Online Access:	https://doi.org/10.1186/s12859-022-04659-1

_version_	1811248828805808128
author	Jonathan P. Dekermanjian Elin Shaddox Debmalya Nandy Debashis Ghosh Katerina Kechris
author_facet	Jonathan P. Dekermanjian Elin Shaddox Debmalya Nandy Debashis Ghosh Katerina Kechris
author_sort	Jonathan P. Dekermanjian
collection	DOAJ
description	Abstract When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.
first_indexed	2024-04-12T15:36:05Z
format	Article
id	doaj.art-ce408c908a7843cfaa11e6d594a04f11
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-04-12T15:36:05Z
publishDate	2022-05-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-ce408c908a7843cfaa11e6d594a04f112022-12-22T03:26:58ZengBMCBMC Bioinformatics1471-21052022-05-0123111710.1186/s12859-022-04659-1Mechanism-aware imputation: a two-step approach in handling missing values in metabolomicsJonathan P. Dekermanjian0Elin Shaddox1Debmalya Nandy2Debashis Ghosh3Katerina Kechris4Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical CampusDepartment of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical CampusDepartment of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical CampusDepartment of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical CampusDepartment of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical CampusAbstract When analyzing large datasets from high-throughput technologies, researchers often encounter missing quantitative measurements, which are particularly frequent in metabolomics datasets. Metabolomics, the comprehensive profiling of metabolite abundances, are typically measured using mass spectrometry technologies that often introduce missingness via multiple mechanisms: (1) the metabolite signal may be smaller than the instrument limit of detection; (2) the conditions under which the data are collected and processed may lead to missing values; (3) missing values can be introduced randomly. Missingness resulting from mechanism (1) would be classified as Missing Not At Random (MNAR), that from mechanism (2) would be Missing At Random (MAR), and that from mechanism (3) would be classified as Missing Completely At Random (MCAR). Two common approaches for handling missing data are the following: (1) omit missing data from the analysis; (2) impute the missing values. Both approaches may introduce bias and reduce statistical power in downstream analyses such as testing metabolite associations with clinical variables. Further, standard imputation methods in metabolomics often ignore the mechanisms causing missingness and inaccurately estimate missing values within a data set. We propose a mechanism-aware imputation algorithm that leverages a two-step approach in imputing missing values. First, we use a random forest classifier to classify the missing mechanism for each missing value in the data set. Second, we impute each missing value using imputation algorithms that are specific to the predicted missingness mechanism (i.e., MAR/MCAR or MNAR). Using complete data, we conducted simulations, where we imposed different missingness patterns within the data and tested the performance of combinations of imputation algorithms. Our proposed algorithm provided imputations closer to the original data than those using only one imputation algorithm for all the missing values. Consequently, our two-step approach was able to reduce bias for improved downstream analyses.https://doi.org/10.1186/s12859-022-04659-1Missing dataImputationMachine learningMetabolomics
spellingShingle	Jonathan P. Dekermanjian Elin Shaddox Debmalya Nandy Debashis Ghosh Katerina Kechris Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics BMC Bioinformatics Missing data Imputation Machine learning Metabolomics
title	Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics
title_full	Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics
title_fullStr	Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics
title_full_unstemmed	Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics
title_short	Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics
title_sort	mechanism aware imputation a two step approach in handling missing values in metabolomics
topic	Missing data Imputation Machine learning Metabolomics
url	https://doi.org/10.1186/s12859-022-04659-1
work_keys_str_mv	AT jonathanpdekermanjian mechanismawareimputationatwostepapproachinhandlingmissingvaluesinmetabolomics AT elinshaddox mechanismawareimputationatwostepapproachinhandlingmissingvaluesinmetabolomics AT debmalyanandy mechanismawareimputationatwostepapproachinhandlingmissingvaluesinmetabolomics AT debashisghosh mechanismawareimputationatwostepapproachinhandlingmissingvaluesinmetabolomics AT katerinakechris mechanismawareimputationatwostepapproachinhandlingmissingvaluesinmetabolomics

Mechanism-aware imputation: a two-step approach in handling missing values in metabolomics

Similar Items