Modeling structured biological processes with machine learning

Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but d...

Full description

Bibliographic Details
Main Author: Shen, Max Walt
Other Authors: Liu, David R.
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139524
_version_ 1826191629523877888
author Shen, Max Walt
author2 Liu, David R.
author_facet Liu, David R.
Shen, Max Walt
author_sort Shen, Max Walt
collection MIT
description Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery.
first_indexed 2024-09-23T08:58:51Z
format Thesis
id mit-1721.1/139524
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T08:58:51Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1395242022-01-15T04:03:10Z Modeling structured biological processes with machine learning Shen, Max Walt Liu, David R. Massachusetts Institute of Technology. Computational and Systems Biology Program Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery. Ph.D. 2022-01-14T15:17:32Z 2022-01-14T15:17:32Z 2021-06 2021-07-17T01:34:41.134Z Thesis https://hdl.handle.net/1721.1/139524 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Shen, Max Walt
Modeling structured biological processes with machine learning
title Modeling structured biological processes with machine learning
title_full Modeling structured biological processes with machine learning
title_fullStr Modeling structured biological processes with machine learning
title_full_unstemmed Modeling structured biological processes with machine learning
title_short Modeling structured biological processes with machine learning
title_sort modeling structured biological processes with machine learning
url https://hdl.handle.net/1721.1/139524
work_keys_str_mv AT shenmaxwalt modelingstructuredbiologicalprocesseswithmachinelearning