Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra

Small molecule metabolites mediate myriad biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measu...

Full description

Bibliographic Details
Main Author: Goldman, Samuel Lucas
Other Authors: Coley, Connor W.
Format: Thesis
Published: Massachusetts Institute of Technology 2024
Online Access:https://hdl.handle.net/1721.1/154037
https://orcid.org/0000-0002-3928-6873
_version_ 1826205783466967040
author Goldman, Samuel Lucas
author2 Coley, Connor W.
author_facet Coley, Connor W.
Goldman, Samuel Lucas
author_sort Goldman, Samuel Lucas
collection MIT
description Small molecule metabolites mediate myriad biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measuring both their masses and fragmentation spectra. However, the complexity and high dimensionality of spectral data makes it difficult to identify unknown metabolites and their roles, with a large majority of detected metabolites remaining unidentified in public data. This thesis proposes a suite of new computational methodologies for higher accuracy annotation of small molecule metabolites from mass spectrometry data that integrate chemistry-informed priors with modern deep learning advancements. I begin by decomposing and framing the metabolite annotation pipeline into four key tasks well-fit for supervised deep learning including (A) molecular formula prediction, (B) spectrum-to-molecule property prediction, (C) molecule-to-spectrum prediction, and (D) de novo generation of molecular candidates. To address these various tasks, I first introduce the Molecular Formula Transformer to predict molecular property fingerprints from spectra by changing the tandem mass spectrum input basis from scalar mass values to plausible molecular formula annotations. This method is then extended to an energy-based-model formulation to predict the molecular formula of an unknown molecule from its tandem mass spectrum. Following these initial efforts to learn better representations of fragmentation spectra, I develop new neural networks capable of generating fragmentation spectra from small molecules through two-step autoregressive modeling. I show how this can be accomplished by generating either molecular formula peaks or molecular fragment peaks. Downstream of metabolite prediction, a separate key question is to identify the function of discovered small molecules. To this end, I study and probe the ability to model enzyme-substrate compatibility from high throughput screens within a single enzyme family. In a final collaborative work, I further demonstrate how a new method for epistemic uncertainty quantification, evidential deep learning, can be applied to molecular property prediction. Altogether, this work outlines a path forward to a fully neuralized pipeline for the high throughput identification of small molecule metabolites and their functions.
first_indexed 2024-09-23T13:19:03Z
format Thesis
id mit-1721.1/154037
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T13:19:03Z
publishDate 2024
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1540372024-04-03T03:31:57Z Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra Goldman, Samuel Lucas Coley, Connor W. Massachusetts Institute of Technology. Computational and Systems Biology Program Small molecule metabolites mediate myriad biological and environmental phenomena across host-microbiome interactions, plant chemistry, cancer biology, and various other processes. Mass spectrometry is often used as an analytical technique to investigate the small molecules present in a sample, measuring both their masses and fragmentation spectra. However, the complexity and high dimensionality of spectral data makes it difficult to identify unknown metabolites and their roles, with a large majority of detected metabolites remaining unidentified in public data. This thesis proposes a suite of new computational methodologies for higher accuracy annotation of small molecule metabolites from mass spectrometry data that integrate chemistry-informed priors with modern deep learning advancements. I begin by decomposing and framing the metabolite annotation pipeline into four key tasks well-fit for supervised deep learning including (A) molecular formula prediction, (B) spectrum-to-molecule property prediction, (C) molecule-to-spectrum prediction, and (D) de novo generation of molecular candidates. To address these various tasks, I first introduce the Molecular Formula Transformer to predict molecular property fingerprints from spectra by changing the tandem mass spectrum input basis from scalar mass values to plausible molecular formula annotations. This method is then extended to an energy-based-model formulation to predict the molecular formula of an unknown molecule from its tandem mass spectrum. Following these initial efforts to learn better representations of fragmentation spectra, I develop new neural networks capable of generating fragmentation spectra from small molecules through two-step autoregressive modeling. I show how this can be accomplished by generating either molecular formula peaks or molecular fragment peaks. Downstream of metabolite prediction, a separate key question is to identify the function of discovered small molecules. To this end, I study and probe the ability to model enzyme-substrate compatibility from high throughput screens within a single enzyme family. In a final collaborative work, I further demonstrate how a new method for epistemic uncertainty quantification, evidential deep learning, can be applied to molecular property prediction. Altogether, this work outlines a path forward to a fully neuralized pipeline for the high throughput identification of small molecule metabolites and their functions. Ph.D. 2024-04-02T14:57:46Z 2024-04-02T14:57:46Z 2024-02 2024-03-21T19:56:03.323Z Thesis https://hdl.handle.net/1721.1/154037 https://orcid.org/0000-0002-3928-6873 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Goldman, Samuel Lucas
Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title_full Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title_fullStr Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title_full_unstemmed Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title_short Machine Learning Methods for Discovering Metabolite Structures from Mass Spectra
title_sort machine learning methods for discovering metabolite structures from mass spectra
url https://hdl.handle.net/1721.1/154037
https://orcid.org/0000-0002-3928-6873
work_keys_str_mv AT goldmansamuellucas machinelearningmethodsfordiscoveringmetabolitestructuresfrommassspectra