Theory and Applications of Matrix Completion in Genomics Datasets

The advent of rapid and efficient biological screening and sequencing technologies has enabled high-throughput data collection, opening the door to improvements in drug discovery, disease identification, and personalized medicine, among others. The size and scope of such datasets is unprecedented, a...

Full description

Bibliographic Details
Main Author: Stefanakis, George
Other Authors: Uhler, Caroline
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144547
Description
Summary:The advent of rapid and efficient biological screening and sequencing technologies has enabled high-throughput data collection, opening the door to improvements in drug discovery, disease identification, and personalized medicine, among others. The size and scope of such datasets is unprecedented, and their increased availability over the past decade, in conjunction with rapid advancements in statistical inference and machine learning, has paved the way for an explosion in research. Still, many problems in this space are yet-unexplored or still in their infancy, either due to data availability or lack of computationally efficient or high-accuracy methods for modeling and prediction. In this work, we develop theory and demonstrate empirical results for use of the novel Neural Tangent Kernel (NTK) in matrix completion. We derive the functional form of the NTK for a single-hidden-layer, infinite-width neural network with ReLU activation, and develop a framework applying the NTK to matrix completion. We explore a specific application of this framework, using the Connectivity Map dataset of gene expression data for various cells and perturbations, demonstrating competitive results as compared to other methods. Additionally, we analyze our contributions through the auxiliary lens of performance engineering and develop concrete algorithms for accurate, performant, and intuitive biological imputation.