Modeling structured biological processes with machine learning

Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but d...

Full description

Bibliographic Details
Main Author: Shen, Max Walt
Other Authors: Liu, David R.
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139524
Description
Summary:Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery.