Probabilistic metabolite annotation using retention time prediction and meta-learned projections

Abstract Retention time information is used for metabolite annotation in metabolomic experiments. But its usefulness is hindered by the availability of experimental retention time data in metabolomic databases, and by the lack of reproducibility between different chromatographic methods. Accurate pr...

Full description

Bibliographic Details
Main Authors: Constantino A. García, Alberto Gil-de-la-Fuente, Coral Barbas, Abraham Otero
Format: Article
Language:English
Published: BMC 2022-06-01
Series:Journal of Cheminformatics
Subjects:
Online Access:https://doi.org/10.1186/s13321-022-00613-8
_version_ 1817976142426013696
author Constantino A. García
Alberto Gil-de-la-Fuente
Coral Barbas
Abraham Otero
author_facet Constantino A. García
Alberto Gil-de-la-Fuente
Coral Barbas
Abraham Otero
author_sort Constantino A. García
collection DOAJ
description Abstract Retention time information is used for metabolite annotation in metabolomic experiments. But its usefulness is hindered by the availability of experimental retention time data in metabolomic databases, and by the lack of reproducibility between different chromatographic methods. Accurate prediction of retention time for a given chromatographic method would be a valuable support for metabolite annotation. We have trained state-of-the-art machine learning regressors using the 80, 038 experimental retention times from the METLIN Small Molecule Retention Tim (SMRT) dataset. The models included deep neural networks, deep kernel learning, several gradient boosting models, and a blending approach. 5, 666 molecular descriptors and 2, 214 fingerprints (MACCS166, Extended Connectivity, and Path Fingerprints fingerprints) were generated with the alvaDesc software. The models were trained using only the descriptors, only the fingerprints, and both types of features simultaneously. Bayesian hyperparameter search was used for parameter tuning. To avoid data-leakage when reporting the performance metrics, nested cross-validation was employed. The best results were obtained by a heavily regularized deep neural network trained with cosine annealing warm restarts and stochastic weight averaging, achieving a mean and median absolute errors of $$39.2 \pm 1.2\; s$$ 39.2 ± 1.2 s and $$17.2 \pm 0.9\;s$$ 17.2 ± 0.9 s , respectively. To the best of our knowledge, these are the most accurate predictions published up to date over the SMRT dataset. To project retention times between chromatographic methods, a novel Bayesian meta-learning approach that can learn from just a few molecules is proposed. By applying this projection between the deep neural network retention time predictions and a given chromatographic method, our approach can be integrated into a metabolite annotation workflow to obtain z-scores for the candidate annotations. To this end, it is enough that just as few as 10 molecules of a given experiment have been identified (probably by using pure metabolite standards). The use of z-scores permits considering the uncertainty in the projection when ranking candidates, and not only the accuracy. In this scenario, our results show that in 68% of the cases the correct molecule was among the top three candidates filtered by mass and ranked according to z-scores. This shows the usefulness of this information to support metabolite annotation. Python code is available on GitHub at https://github.com/constantino-garcia/cmmrt.
first_indexed 2024-04-13T21:58:36Z
format Article
id doaj.art-c52da9025fc847bb8844df85ae9b3e7d
institution Directory Open Access Journal
issn 1758-2946
language English
last_indexed 2024-04-13T21:58:36Z
publishDate 2022-06-01
publisher BMC
record_format Article
series Journal of Cheminformatics
spelling doaj.art-c52da9025fc847bb8844df85ae9b3e7d2022-12-22T02:28:10ZengBMCJournal of Cheminformatics1758-29462022-06-0114112310.1186/s13321-022-00613-8Probabilistic metabolite annotation using retention time prediction and meta-learned projectionsConstantino A. García0Alberto Gil-de-la-Fuente1Coral Barbas2Abraham Otero3Department of Information Technology, Escuela Politécnica Superior, Universidad San Pablo CEUDepartment of Information Technology, Escuela Politécnica Superior, Universidad San Pablo CEUCentre for Metabolomics and Bioanalysis (CEMBIO), Facultad de Farmacia, Universidad San Pablo CEUDepartment of Information Technology, Escuela Politécnica Superior, Universidad San Pablo CEUAbstract Retention time information is used for metabolite annotation in metabolomic experiments. But its usefulness is hindered by the availability of experimental retention time data in metabolomic databases, and by the lack of reproducibility between different chromatographic methods. Accurate prediction of retention time for a given chromatographic method would be a valuable support for metabolite annotation. We have trained state-of-the-art machine learning regressors using the 80, 038 experimental retention times from the METLIN Small Molecule Retention Tim (SMRT) dataset. The models included deep neural networks, deep kernel learning, several gradient boosting models, and a blending approach. 5, 666 molecular descriptors and 2, 214 fingerprints (MACCS166, Extended Connectivity, and Path Fingerprints fingerprints) were generated with the alvaDesc software. The models were trained using only the descriptors, only the fingerprints, and both types of features simultaneously. Bayesian hyperparameter search was used for parameter tuning. To avoid data-leakage when reporting the performance metrics, nested cross-validation was employed. The best results were obtained by a heavily regularized deep neural network trained with cosine annealing warm restarts and stochastic weight averaging, achieving a mean and median absolute errors of $$39.2 \pm 1.2\; s$$ 39.2 ± 1.2 s and $$17.2 \pm 0.9\;s$$ 17.2 ± 0.9 s , respectively. To the best of our knowledge, these are the most accurate predictions published up to date over the SMRT dataset. To project retention times between chromatographic methods, a novel Bayesian meta-learning approach that can learn from just a few molecules is proposed. By applying this projection between the deep neural network retention time predictions and a given chromatographic method, our approach can be integrated into a metabolite annotation workflow to obtain z-scores for the candidate annotations. To this end, it is enough that just as few as 10 molecules of a given experiment have been identified (probably by using pure metabolite standards). The use of z-scores permits considering the uncertainty in the projection when ranking candidates, and not only the accuracy. In this scenario, our results show that in 68% of the cases the correct molecule was among the top three candidates filtered by mass and ranked according to z-scores. This shows the usefulness of this information to support metabolite annotation. Python code is available on GitHub at https://github.com/constantino-garcia/cmmrt.https://doi.org/10.1186/s13321-022-00613-8MetabolomicsRetention timeMachine learningBayesian methodsDeep learning
spellingShingle Constantino A. García
Alberto Gil-de-la-Fuente
Coral Barbas
Abraham Otero
Probabilistic metabolite annotation using retention time prediction and meta-learned projections
Journal of Cheminformatics
Metabolomics
Retention time
Machine learning
Bayesian methods
Deep learning
title Probabilistic metabolite annotation using retention time prediction and meta-learned projections
title_full Probabilistic metabolite annotation using retention time prediction and meta-learned projections
title_fullStr Probabilistic metabolite annotation using retention time prediction and meta-learned projections
title_full_unstemmed Probabilistic metabolite annotation using retention time prediction and meta-learned projections
title_short Probabilistic metabolite annotation using retention time prediction and meta-learned projections
title_sort probabilistic metabolite annotation using retention time prediction and meta learned projections
topic Metabolomics
Retention time
Machine learning
Bayesian methods
Deep learning
url https://doi.org/10.1186/s13321-022-00613-8
work_keys_str_mv AT constantinoagarcia probabilisticmetaboliteannotationusingretentiontimepredictionandmetalearnedprojections
AT albertogildelafuente probabilisticmetaboliteannotationusingretentiontimepredictionandmetalearnedprojections
AT coralbarbas probabilisticmetaboliteannotationusingretentiontimepredictionandmetalearnedprojections
AT abrahamotero probabilisticmetaboliteannotationusingretentiontimepredictionandmetalearnedprojections