Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets

Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabo...

Full description

Bibliographic Details
Main Author: Narayan, Ashwin
Other Authors: Berger, Bonnie
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/145044
https://orcid.org/ 0000-0001-7024-9424
_version_ 1811068535758127104
author Narayan, Ashwin
author2 Berger, Bonnie
author_facet Berger, Bonnie
Narayan, Ashwin
author_sort Narayan, Ashwin
collection MIT
description Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabolite or other molecule of interest can be counted. Understanding the molecular basis that determines the differentiation of cell fates is thus the holy grail promised by these data. However, the high-dimensional nature of the data, replete with correlations between features, noise, and heterogeneity means the computational work required to draw insights is significant. In particular, understanding the differences between cells requires a quantitative measure of similarity between the single-cell feature vectors of those cells. A vast array of existing methods, from those that cluster a given dataset to those that attempt to integrate multiple datasets or learn causal effects of perturbation, are built on this foundational notion of similarity. In this dissertation, we delve into the question of similarity metrics for high-dimensional biological data generally, and single-cell RNA-seq data specifically. We work from a global perspective — where we find a distance function that applies across the entire dataset — to a local perspective — where each cell can learn its own similarity function. In particular, we first present Schema, a method for combining similarity information encoded by several types of data, which has proven useful in analyzing the burgeoning number of datasets which contain multiple modalities of information. We also present DensVis, a package of algorithms for visualizing single-cell data, which improve upon existing dimensionality-reduction methods that focus on local structure by accounting for density in high-dimensional space. Lastly, we zoom in on each datapoint, and show a new method for learning 𝑘-nearest neighbors graphs based on local decompositions. Altogether, the works demonstrate the importance — through extensive validation on existing datasets — of understanding high-dimensional similarity.
first_indexed 2024-09-23T07:57:26Z
format Thesis
id mit-1721.1/145044
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T07:57:26Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1450442022-08-30T04:09:23Z Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets Narayan, Ashwin Berger, Bonnie Massachusetts Institute of Technology. Department of Mathematics Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabolite or other molecule of interest can be counted. Understanding the molecular basis that determines the differentiation of cell fates is thus the holy grail promised by these data. However, the high-dimensional nature of the data, replete with correlations between features, noise, and heterogeneity means the computational work required to draw insights is significant. In particular, understanding the differences between cells requires a quantitative measure of similarity between the single-cell feature vectors of those cells. A vast array of existing methods, from those that cluster a given dataset to those that attempt to integrate multiple datasets or learn causal effects of perturbation, are built on this foundational notion of similarity. In this dissertation, we delve into the question of similarity metrics for high-dimensional biological data generally, and single-cell RNA-seq data specifically. We work from a global perspective — where we find a distance function that applies across the entire dataset — to a local perspective — where each cell can learn its own similarity function. In particular, we first present Schema, a method for combining similarity information encoded by several types of data, which has proven useful in analyzing the burgeoning number of datasets which contain multiple modalities of information. We also present DensVis, a package of algorithms for visualizing single-cell data, which improve upon existing dimensionality-reduction methods that focus on local structure by accounting for density in high-dimensional space. Lastly, we zoom in on each datapoint, and show a new method for learning 𝑘-nearest neighbors graphs based on local decompositions. Altogether, the works demonstrate the importance — through extensive validation on existing datasets — of understanding high-dimensional similarity. Ph.D. 2022-08-29T16:29:13Z 2022-08-29T16:29:13Z 2022-05 2022-06-07T15:33:57.923Z Thesis https://hdl.handle.net/1721.1/145044 https://orcid.org/ 0000-0001-7024-9424 Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Narayan, Ashwin
Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_full Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_fullStr Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_full_unstemmed Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_short Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_sort similarity metrics for biological data algorithmic developments for high dimensional datasets
url https://hdl.handle.net/1721.1/145044
https://orcid.org/ 0000-0001-7024-9424
work_keys_str_mv AT narayanashwin similaritymetricsforbiologicaldataalgorithmicdevelopmentsforhighdimensionaldatasets