PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.

One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventat...

Full description

Bibliographic Details
Main Authors: Anand Ramachandran, Steven S Lumetta, Deming Chen
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS Computational Biology
Online Access:https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011790&type=printable
_version_ 1797326208070844416
author Anand Ramachandran
Steven S Lumetta
Deming Chen
author_facet Anand Ramachandran
Steven S Lumetta
Deming Chen
author_sort Anand Ramachandran
collection DOAJ
description One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30× larger. Our method forecasts unseen lineages months in advance, whereas models 4× and 30× larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets.
first_indexed 2024-03-08T06:20:07Z
format Article
id doaj.art-26ff7579ec0d47c5b2adeb9fbbe90a07
institution Directory Open Access Journal
issn 1553-734X
1553-7358
language English
last_indexed 2024-03-08T06:20:07Z
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj.art-26ff7579ec0d47c5b2adeb9fbbe90a072024-02-04T05:30:48ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582024-01-01201e101179010.1371/journal.pcbi.1011790PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.Anand RamachandranSteven S LumettaDeming ChenOne of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30× larger. Our method forecasts unseen lineages months in advance, whereas models 4× and 30× larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets.https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011790&type=printable
spellingShingle Anand Ramachandran
Steven S Lumetta
Deming Chen
PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
PLoS Computational Biology
title PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
title_full PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
title_fullStr PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
title_full_unstemmed PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
title_short PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning.
title_sort pandogen generating complete instances of future sars cov 2 sequences using deep learning
url https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1011790&type=printable
work_keys_str_mv AT anandramachandran pandogengeneratingcompleteinstancesoffuturesarscov2sequencesusingdeeplearning
AT stevenslumetta pandogengeneratingcompleteinstancesoffuturesarscov2sequencesusingdeeplearning
AT demingchen pandogengeneratingcompleteinstancesoffuturesarscov2sequencesusingdeeplearning