Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the...

Full description

Bibliographic Details
Main Authors:	Piotr S. Gromski, Yun Xu, Helen L. Kotze, Elon Correa, David I. Ellis, Emily Grace Armitage, Michael L. Turner, Royston Goodacre
Format:	Article
Language:	English
Published:	MDPI AG 2014-06-01
Series:	Metabolites
Subjects:	missing values metabolomics unsupervised learning supervised learning
Online Access:	http://www.mdpi.com/2218-1989/4/2/433

_version_	1811240533557772288
author	Piotr S. Gromski Yun Xu Helen L. Kotze Elon Correa David I. Ellis Emily Grace Armitage Michael L. Turner Royston Goodacre
author_facet	Piotr S. Gromski Yun Xu Helen L. Kotze Elon Correa David I. Ellis Emily Grace Armitage Michael L. Turner Royston Goodacre
author_sort	Piotr S. Gromski
collection	DOAJ
description	Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.
first_indexed	2024-04-12T13:22:01Z
format	Article
id	doaj.art-a656004fcb004106be7bd332e6a09189
institution	Directory Open Access Journal
issn	2218-1989
language	English
last_indexed	2024-04-12T13:22:01Z
publishDate	2014-06-01
publisher	MDPI AG
record_format	Article
series	Metabolites
spelling	doaj.art-a656004fcb004106be7bd332e6a091892022-12-22T03:31:26ZengMDPI AGMetabolites2218-19892014-06-014243345210.3390/metabo4020433metabo4020433Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics DataPiotr S. Gromski0Yun Xu1Helen L. Kotze2Elon Correa3David I. Ellis4Emily Grace Armitage5Michael L. Turner6Royston Goodacre7School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKSchool of Chemistry, Brunswick Street, The University of Manchester, Manchester M13 9PL, UK.School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, UKMissing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.http://www.mdpi.com/2218-1989/4/2/433missing valuesmetabolomicsunsupervised learningsupervised learning
spellingShingle	Piotr S. Gromski Yun Xu Helen L. Kotze Elon Correa David I. Ellis Emily Grace Armitage Michael L. Turner Royston Goodacre Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data Metabolites missing values metabolomics unsupervised learning supervised learning
title	Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_full	Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_fullStr	Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_full_unstemmed	Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_short	Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data
title_sort	influence of missing values substitutes on multivariate analysis of metabolomics data
topic	missing values metabolomics unsupervised learning supervised learning
url	http://www.mdpi.com/2218-1989/4/2/433
work_keys_str_mv	AT piotrsgromski influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT yunxu influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT helenlkotze influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT eloncorrea influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT davidiellis influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT emilygracearmitage influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT michaellturner influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata AT roystongoodacre influenceofmissingvaluessubstitutesonmultivariateanalysisofmetabolomicsdata

Influence of Missing Values Substitutes on Multivariate Analysis of Metabolomics Data

Similar Items