Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source

ABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classificati...

Full description

Bibliographic Details
Main Authors: Josip Rudar, Peter Kruczkiewicz, Oksana Vernygora, G. Brian Golding, Mehrdad Hajibabaei, Oliver Lung
Format: Article
Language:English
Published: American Society for Microbiology 2024-04-01
Series:Microbiology Spectrum
Subjects:
Online Access:https://journals.asm.org/doi/10.1128/spectrum.03584-23
_version_ 1797228876204605440
author Josip Rudar
Peter Kruczkiewicz
Oksana Vernygora
G. Brian Golding
Mehrdad Hajibabaei
Oliver Lung
author_facet Josip Rudar
Peter Kruczkiewicz
Oksana Vernygora
G. Brian Golding
Mehrdad Hajibabaei
Oliver Lung
author_sort Josip Rudar
collection DOAJ
description ABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.
first_indexed 2024-04-24T15:03:39Z
format Article
id doaj.art-e059194ef17d4e6896be8a038f62e660
institution Directory Open Access Journal
issn 2165-0497
language English
last_indexed 2024-04-24T15:03:39Z
publishDate 2024-04-01
publisher American Society for Microbiology
record_format Article
series Microbiology Spectrum
spelling doaj.art-e059194ef17d4e6896be8a038f62e6602024-04-02T14:16:18ZengAmerican Society for MicrobiologyMicrobiology Spectrum2165-04972024-04-0112410.1128/spectrum.03584-23Sequence signatures within the genome of SARS-CoV-2 can be used to predict host sourceJosip Rudar0Peter Kruczkiewicz1Oksana Vernygora2G. Brian Golding3Mehrdad Hajibabaei4Oliver Lung5National Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaDepartment of Biology, McMaster University, Hamilton, Ontario, CanadaDepartment of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.https://journals.asm.org/doi/10.1128/spectrum.03584-23machine learningCOVID-19viral host adaptationselection pressuremetric learningbiomarker discovery
spellingShingle Josip Rudar
Peter Kruczkiewicz
Oksana Vernygora
G. Brian Golding
Mehrdad Hajibabaei
Oliver Lung
Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
Microbiology Spectrum
machine learning
COVID-19
viral host adaptation
selection pressure
metric learning
biomarker discovery
title Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
title_full Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
title_fullStr Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
title_full_unstemmed Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
title_short Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
title_sort sequence signatures within the genome of sars cov 2 can be used to predict host source
topic machine learning
COVID-19
viral host adaptation
selection pressure
metric learning
biomarker discovery
url https://journals.asm.org/doi/10.1128/spectrum.03584-23
work_keys_str_mv AT josiprudar sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource
AT peterkruczkiewicz sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource
AT oksanavernygora sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource
AT gbriangolding sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource
AT mehrdadhajibabaei sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource
AT oliverlung sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource