Modeling structured biological processes with machine learning
Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but d...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139524 |
_version_ | 1826191629523877888 |
---|---|
author | Shen, Max Walt |
author2 | Liu, David R. |
author_facet | Liu, David R. Shen, Max Walt |
author_sort | Shen, Max Walt |
collection | MIT |
description | Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets.
Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery. |
first_indexed | 2024-09-23T08:58:51Z |
format | Thesis |
id | mit-1721.1/139524 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:58:51Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1395242022-01-15T04:03:10Z Modeling structured biological processes with machine learning Shen, Max Walt Liu, David R. Massachusetts Institute of Technology. Computational and Systems Biology Program Models of natural phenomena have played a fundamental role in scientific progress. In modern biology, we seek to model ever more complex phenomena, driven by advances in highthroughput measurement technology and machine learning. These advances motivate a topdown data-driven modeling approach, but directly applying such methods to model complex biological processes can fail to yield models with causal understanding. It would be desirable to build models that combine the rich bodies of causal knowledge built over decades of research with modern flexible machine learning methods that scale to large and rich datasets. Here, I present deep data-driven models that incorporate biological and causal prior knowledge to model fundamental biological processes in genome editing and directed evolution. I first consider a model of DNA repair following CRISPR/Cas9 cleavage, which was generally thought to be unpredictable. In a large-scale dataset, I find signatures implicating an alternative and more predictable DNA repair pathway. I describe a model that accurately predicts genome editing outcomes by representing these competing but mechanistically independent repair pathways while flexibly learning unknown relationships from data. I use the model to discover a new genome editing strategy for efficiently and precisely correcting a class of disease-causing genetic mutations. Next, I consider a model for base editing, where I decompose a complex prediction problem into simpler subproblems and solve one with an autoregressive sequence-todistribution of sequences model. The models enable designing genome editing strategies with optimized outcomes for disease-causing mutation and enabled the first demonstration of transversion base editing by cytosine base editors, broadening the scope of base editing to potentially correcting new classes of mutations. These models also broaden the scope of C to G base editors with restrictive sequence preferences. Finally, I propose a method for reconstructing sequence-to-function datasets from directed evolution that can help increase the availability of datasets for machine learning for protein engineering. This method exploits the structure of a differential equation governing natural selection for efficient inference and is capable of proposing variants with higher activity than conventional methods. Incorporating prior knowledge and structure into models of natural phenomena can support scientific discovery. Ph.D. 2022-01-14T15:17:32Z 2022-01-14T15:17:32Z 2021-06 2021-07-17T01:34:41.134Z Thesis https://hdl.handle.net/1721.1/139524 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Shen, Max Walt Modeling structured biological processes with machine learning |
title | Modeling structured biological processes with machine learning |
title_full | Modeling structured biological processes with machine learning |
title_fullStr | Modeling structured biological processes with machine learning |
title_full_unstemmed | Modeling structured biological processes with machine learning |
title_short | Modeling structured biological processes with machine learning |
title_sort | modeling structured biological processes with machine learning |
url | https://hdl.handle.net/1721.1/139524 |
work_keys_str_mv | AT shenmaxwalt modelingstructuredbiologicalprocesseswithmachinelearning |