Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets

Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabo...

Full description

Bibliographic Details
Main Author:	Narayan, Ashwin
Other Authors:	Berger, Bonnie
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/145044 https://orcid.org/ 0000-0001-7024-9424

_version_	1826188291686268928
author	Narayan, Ashwin
author2	Berger, Bonnie
author_facet	Berger, Bonnie Narayan, Ashwin
author_sort	Narayan, Ashwin
collection	MIT
description	Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabolite or other molecule of interest can be counted. Understanding the molecular basis that determines the differentiation of cell fates is thus the holy grail promised by these data. However, the high-dimensional nature of the data, replete with correlations between features, noise, and heterogeneity means the computational work required to draw insights is significant. In particular, understanding the differences between cells requires a quantitative measure of similarity between the single-cell feature vectors of those cells. A vast array of existing methods, from those that cluster a given dataset to those that attempt to integrate multiple datasets or learn causal effects of perturbation, are built on this foundational notion of similarity. In this dissertation, we delve into the question of similarity metrics for high-dimensional biological data generally, and single-cell RNA-seq data specifically. We work from a global perspective — where we find a distance function that applies across the entire dataset — to a local perspective — where each cell can learn its own similarity function. In particular, we first present Schema, a method for combining similarity information encoded by several types of data, which has proven useful in analyzing the burgeoning number of datasets which contain multiple modalities of information. We also present DensVis, a package of algorithms for visualizing single-cell data, which improve upon existing dimensionality-reduction methods that focus on local structure by accounting for density in high-dimensional space. Lastly, we zoom in on each datapoint, and show a new method for learning 𝑘-nearest neighbors graphs based on local decompositions. Altogether, the works demonstrate the importance — through extensive validation on existing datasets — of understanding high-dimensional similarity.
first_indexed	2024-09-23T07:57:26Z
format	Thesis
id	mit-1721.1/145044
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T07:57:26Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1450442022-08-30T04:09:23Z Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets Narayan, Ashwin Berger, Bonnie Massachusetts Institute of Technology. Department of Mathematics Advances in experimental methods in biology have allowed researchers to gain an unprecedentedly high-resolution view of the molecular processes within cells, using so-called single-cell technologies. Every cell in the sample can be individually profiled — the amount of each type of protein or metabolite or other molecule of interest can be counted. Understanding the molecular basis that determines the differentiation of cell fates is thus the holy grail promised by these data. However, the high-dimensional nature of the data, replete with correlations between features, noise, and heterogeneity means the computational work required to draw insights is significant. In particular, understanding the differences between cells requires a quantitative measure of similarity between the single-cell feature vectors of those cells. A vast array of existing methods, from those that cluster a given dataset to those that attempt to integrate multiple datasets or learn causal effects of perturbation, are built on this foundational notion of similarity. In this dissertation, we delve into the question of similarity metrics for high-dimensional biological data generally, and single-cell RNA-seq data specifically. We work from a global perspective — where we find a distance function that applies across the entire dataset — to a local perspective — where each cell can learn its own similarity function. In particular, we first present Schema, a method for combining similarity information encoded by several types of data, which has proven useful in analyzing the burgeoning number of datasets which contain multiple modalities of information. We also present DensVis, a package of algorithms for visualizing single-cell data, which improve upon existing dimensionality-reduction methods that focus on local structure by accounting for density in high-dimensional space. Lastly, we zoom in on each datapoint, and show a new method for learning 𝑘-nearest neighbors graphs based on local decompositions. Altogether, the works demonstrate the importance — through extensive validation on existing datasets — of understanding high-dimensional similarity. Ph.D. 2022-08-29T16:29:13Z 2022-08-29T16:29:13Z 2022-05 2022-06-07T15:33:57.923Z Thesis https://hdl.handle.net/1721.1/145044 https://orcid.org/ 0000-0001-7024-9424 Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Copyright retained by author(s) https://creativecommons.org/licenses/by-nc/4.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Narayan, Ashwin Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title	Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_full	Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_fullStr	Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_full_unstemmed	Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_short	Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets
title_sort	similarity metrics for biological data algorithmic developments for high dimensional datasets
url	https://hdl.handle.net/1721.1/145044 https://orcid.org/ 0000-0001-7024-9424
work_keys_str_mv	AT narayanashwin similaritymetricsforbiologicaldataalgorithmicdevelopmentsforhighdimensionaldatasets

Similarity Metrics for Biological Data: Algorithmic developments for high-dimensional datasets

Similar Items