Protein language representation learning to predict SARS-CoV-2 mutational landscape

<p>With the proliferation of SARS-CoV-2 pandemic globally, numerous variants have been emerging on a daily basis containing distinct transmission and infection rates, risks and impact over evasion of antibody neutralisation. Early discovery of high-risk mutations is critical towards undertakin...

Full description

Bibliographic Details
Main Author: Batra, H
Other Authors: Minary, P
Format: Thesis
Language:English
Published: 2022
Subjects:
_version_ 1811139842867724288
author Batra, H
author2 Minary, P
author_facet Minary, P
Batra, H
author_sort Batra, H
collection OXFORD
description <p>With the proliferation of SARS-CoV-2 pandemic globally, numerous variants have been emerging on a daily basis containing distinct transmission and infection rates, risks and impact over evasion of antibody neutralisation. Early discovery of high-risk mutations is critical towards undertaking data-informed therapeutic design decisions and effective pandemic management. This dissertation explores the application of Language Models, commonly used for textual processing, to decipher SARS-CoV-2 spike protein sequences which are an amalgamation of amino acids represented as alphabets. Deep protein language models are revolutionising protein biology, and with the introduction of two novel models: transformer encoder-based sequence only <em>CoVBERT</em> for predicting point mutations, and <em>MuFormer</em> which leverages the sequence and structural space to design mutational protein sequences iteratively. CoVBERT has been able to predict highly transmissible mutations including <em>D614G</em> with a masked marginal log likelihood of 0.95, surpassing state-of-the-art large protein language models. This reflects over large language models ability to encapture in vitro mutagenesis by learning the language of evolution.</p> <p>MuFormer is capable of generating <em>de novo</em> protein sequences using AlphaFold2 for fixed backbone design, and curates evolutionary novel mutational sequences by injecting the representation derived state-of-the-art protein language models. The generated mutational sequences have been validated with historical data which exemplified the ability of MuFormer to capture phylogenetic properties for generating mutations such as Omicron and Delta variant, given Alpha variant as the input. MuFormer conditions not only over the sequence, but also the structure to generate end-to-end protein sequences and structure by optimising using two strategies of fixed backbone design (MuFormer-fixbb) and backbone atom optimisation (MuFormer-bba). Both these variants of MuFormer outperformed AlphaFold2 over the mutational sequence generation task for several structure and sequence likelihood metrics. These models ascertain over the potential of large language models, termed as foundational models, towards learning the representational language of biology which can assist in controlling pandemics by predicting mutations with higher infectivity in advance.</p>
first_indexed 2024-09-25T04:12:31Z
format Thesis
id oxford-uuid:a02856b7-8318-40f1-8478-366a47264ab0
institution University of Oxford
language English
last_indexed 2024-09-25T04:12:31Z
publishDate 2022
record_format dspace
spelling oxford-uuid:a02856b7-8318-40f1-8478-366a47264ab02024-06-26T10:23:25ZProtein language representation learning to predict SARS-CoV-2 mutational landscapeThesishttp://purl.org/coar/resource_type/c_bdccuuid:a02856b7-8318-40f1-8478-366a47264ab0Machine LearningNatural Language ProcessingComputational BiologyEnglishHyrax Deposit2022Batra, HMinary, P<p>With the proliferation of SARS-CoV-2 pandemic globally, numerous variants have been emerging on a daily basis containing distinct transmission and infection rates, risks and impact over evasion of antibody neutralisation. Early discovery of high-risk mutations is critical towards undertaking data-informed therapeutic design decisions and effective pandemic management. This dissertation explores the application of Language Models, commonly used for textual processing, to decipher SARS-CoV-2 spike protein sequences which are an amalgamation of amino acids represented as alphabets. Deep protein language models are revolutionising protein biology, and with the introduction of two novel models: transformer encoder-based sequence only <em>CoVBERT</em> for predicting point mutations, and <em>MuFormer</em> which leverages the sequence and structural space to design mutational protein sequences iteratively. CoVBERT has been able to predict highly transmissible mutations including <em>D614G</em> with a masked marginal log likelihood of 0.95, surpassing state-of-the-art large protein language models. This reflects over large language models ability to encapture in vitro mutagenesis by learning the language of evolution.</p> <p>MuFormer is capable of generating <em>de novo</em> protein sequences using AlphaFold2 for fixed backbone design, and curates evolutionary novel mutational sequences by injecting the representation derived state-of-the-art protein language models. The generated mutational sequences have been validated with historical data which exemplified the ability of MuFormer to capture phylogenetic properties for generating mutations such as Omicron and Delta variant, given Alpha variant as the input. MuFormer conditions not only over the sequence, but also the structure to generate end-to-end protein sequences and structure by optimising using two strategies of fixed backbone design (MuFormer-fixbb) and backbone atom optimisation (MuFormer-bba). Both these variants of MuFormer outperformed AlphaFold2 over the mutational sequence generation task for several structure and sequence likelihood metrics. These models ascertain over the potential of large language models, termed as foundational models, towards learning the representational language of biology which can assist in controlling pandemics by predicting mutations with higher infectivity in advance.</p>
spellingShingle Machine Learning
Natural Language Processing
Computational Biology
Batra, H
Protein language representation learning to predict SARS-CoV-2 mutational landscape
title Protein language representation learning to predict SARS-CoV-2 mutational landscape
title_full Protein language representation learning to predict SARS-CoV-2 mutational landscape
title_fullStr Protein language representation learning to predict SARS-CoV-2 mutational landscape
title_full_unstemmed Protein language representation learning to predict SARS-CoV-2 mutational landscape
title_short Protein language representation learning to predict SARS-CoV-2 mutational landscape
title_sort protein language representation learning to predict sars cov 2 mutational landscape
topic Machine Learning
Natural Language Processing
Computational Biology
work_keys_str_mv AT batrah proteinlanguagerepresentationlearningtopredictsarscov2mutationallandscape