Machine Learning Methods for High Throughput Biological Data

Machine learning is becoming a pivotal tool in the analysis of datasets generated from high-throughput biological omics experiments. However, omics data introduces distinctive algorithmic challenges that set it apart from other domains where machine learning is applied. These challenges encompass is...

Full description

Bibliographic Details
Main Author:	Murphy, Michael A.
Other Authors:	Fraenkel, Ernest
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/154024 https://orcid.org/0000-0002-7343-8383

_version_	1826214027461656576
author	Murphy, Michael A.
author2	Fraenkel, Ernest
author_facet	Fraenkel, Ernest Murphy, Michael A.
author_sort	Murphy, Michael A.
collection	MIT
description	Machine learning is becoming a pivotal tool in the analysis of datasets generated from high-throughput biological omics experiments. However, omics data introduces distinctive algorithmic challenges that set it apart from other domains where machine learning is applied. These challenges encompass issues such as limited data availability, complex noise, ambiguities in representation, and the absence of definitive ground truth for validation. In this thesis, I present three examples of machine learning applications to different omics modalities in which I address these challenges. In my first project, I develop an approach for contrastive representation learning with immunohistochemistry images, which suffer complex technical and biological noise that render generic approaches ineffective; and I demonstrate how this approach can be combined with noisy labels derived from transcriptomics to derive an effective classifier of cell-type specificity. In my second project, I consider the problem of predicting mass spectra of small molecules: previous methods suffer from a tradeoff between capturing high-resolution mass information and a tractable learning problem, which I resolve by introducing a novel representation of the output space. In my third project, I perform gene regulatory network inference using a number of different single-cell sequencing platforms, and carry out a quantitative comparison of these technologies. In summary, this thesis showcases the difficulties that arise in applying modern machine learning approaches to high-throughput biological measurements, and empirical case studies of how these difficulties may be overcome.
first_indexed	2024-09-23T15:58:39Z
format	Thesis
id	mit-1721.1/154024
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T15:58:39Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1540242024-04-03T03:06:02Z Machine Learning Methods for High Throughput Biological Data Murphy, Michael A. Fraenkel, Ernest Jegelka, Stefanie Massachusetts Institute of Technology. Computational and Systems Biology Program Machine learning is becoming a pivotal tool in the analysis of datasets generated from high-throughput biological omics experiments. However, omics data introduces distinctive algorithmic challenges that set it apart from other domains where machine learning is applied. These challenges encompass issues such as limited data availability, complex noise, ambiguities in representation, and the absence of definitive ground truth for validation. In this thesis, I present three examples of machine learning applications to different omics modalities in which I address these challenges. In my first project, I develop an approach for contrastive representation learning with immunohistochemistry images, which suffer complex technical and biological noise that render generic approaches ineffective; and I demonstrate how this approach can be combined with noisy labels derived from transcriptomics to derive an effective classifier of cell-type specificity. In my second project, I consider the problem of predicting mass spectra of small molecules: previous methods suffer from a tradeoff between capturing high-resolution mass information and a tractable learning problem, which I resolve by introducing a novel representation of the output space. In my third project, I perform gene regulatory network inference using a number of different single-cell sequencing platforms, and carry out a quantitative comparison of these technologies. In summary, this thesis showcases the difficulties that arise in applying modern machine learning approaches to high-throughput biological measurements, and empirical case studies of how these difficulties may be overcome. Ph.D. 2024-04-02T14:56:52Z 2024-04-02T14:56:52Z 2024-02 2024-03-21T19:56:11.095Z Thesis https://hdl.handle.net/1721.1/154024 https://orcid.org/0000-0002-7343-8383 Attribution 4.0 International (CC BY 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Murphy, Michael A. Machine Learning Methods for High Throughput Biological Data
title	Machine Learning Methods for High Throughput Biological Data
title_full	Machine Learning Methods for High Throughput Biological Data
title_fullStr	Machine Learning Methods for High Throughput Biological Data
title_full_unstemmed	Machine Learning Methods for High Throughput Biological Data
title_short	Machine Learning Methods for High Throughput Biological Data
title_sort	machine learning methods for high throughput biological data
url	https://hdl.handle.net/1721.1/154024 https://orcid.org/0000-0002-7343-8383
work_keys_str_mv	AT murphymichaela machinelearningmethodsforhighthroughputbiologicaldata

Machine Learning Methods for High Throughput Biological Data

Similar Items