Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source
ABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classificati...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
American Society for Microbiology
2024-04-01
|
Series: | Microbiology Spectrum |
Subjects: | |
Online Access: | https://journals.asm.org/doi/10.1128/spectrum.03584-23 |
_version_ | 1797228876204605440 |
---|---|
author | Josip Rudar Peter Kruczkiewicz Oksana Vernygora G. Brian Golding Mehrdad Hajibabaei Oliver Lung |
author_facet | Josip Rudar Peter Kruczkiewicz Oksana Vernygora G. Brian Golding Mehrdad Hajibabaei Oliver Lung |
author_sort | Josip Rudar |
collection | DOAJ |
description | ABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2. |
first_indexed | 2024-04-24T15:03:39Z |
format | Article |
id | doaj.art-e059194ef17d4e6896be8a038f62e660 |
institution | Directory Open Access Journal |
issn | 2165-0497 |
language | English |
last_indexed | 2024-04-24T15:03:39Z |
publishDate | 2024-04-01 |
publisher | American Society for Microbiology |
record_format | Article |
series | Microbiology Spectrum |
spelling | doaj.art-e059194ef17d4e6896be8a038f62e6602024-04-02T14:16:18ZengAmerican Society for MicrobiologyMicrobiology Spectrum2165-04972024-04-0112410.1128/spectrum.03584-23Sequence signatures within the genome of SARS-CoV-2 can be used to predict host sourceJosip Rudar0Peter Kruczkiewicz1Oksana Vernygora2G. Brian Golding3Mehrdad Hajibabaei4Oliver Lung5National Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaDepartment of Biology, McMaster University, Hamilton, Ontario, CanadaDepartment of Integrative Biology & Centre for Biodiversity Genomics, University of Guelph, Guelph, Ontario, CanadaNational Centre for Foreign Animal Disease, Canadian Food Inspection Agency, Winnipeg, Manitoba, CanadaABSTRACTWe conducted an in silico analysis to better understand the potential factors impacting host adaptation of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in white-tailed deer, humans, and mink due to the strong evidence of sustained transmission within these hosts. Classification models trained on single nucleotide and amino acid differences between samples effectively identified white-tailed deer-, human-, and mink-derived SARS-CoV-2. For example, the balanced accuracy score of Extremely Randomized Trees classifiers was 0.984 ± 0.006. Eighty-eight commonly identified predictive mutations are found at sites under strong positive and negative selective pressure. A large fraction of sites under selection (86.9%) or identified by machine learning (87.1%) are found in genes other than the spike. Some locations encoded by these gene regions are predicted to be B- and T-cell epitopes or are implicated in modulating the immune response suggesting that host adaptation may involve the evasion of the host immune system, modulation of the class-I major-histocompatibility complex, and the diminished recognition of immune epitopes by CD8+ T cells. Our selection and machine learning analysis also identified that silent mutations, such as C7303T and C9430T, play an important role in discriminating deer-derived samples across multiple clades. Finally, our investigation into the origin of the B.1.641 lineage from white-tailed deer in Canada discovered an additional human sequence from Michigan related to the B.1.641 lineage sampled near the emergence of this lineage. These findings demonstrate that machine-learning approaches can be used in combination with evolutionary genomics to identify factors possibly involved in the cross-species transmission of viruses and the emergence of novel viral lineages.IMPORTANCESevere acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly transmissible virus capable of infecting and establishing itself in human and wildlife populations, such as white-tailed deer. This fact highlights the importance of developing novel ways to identify genetic factors that contribute to its spread and adaptation to new host species. This is especially important since these populations can serve as reservoirs that potentially facilitate the re-introduction of new variants into human populations. In this study, we apply machine learning and phylogenetic methods to uncover biomarkers of SARS-CoV-2 adaptation in mink and white-tailed deer. We find evidence demonstrating that both non-synonymous and silent mutations can be used to differentiate animal-derived sequences from human-derived ones and each other. This evidence also suggests that host adaptation involves the evasion of the immune system and the suppression of antigen presentation. Finally, the methods developed here are general and can be used to investigate host adaptation in viruses other than SARS-CoV-2.https://journals.asm.org/doi/10.1128/spectrum.03584-23machine learningCOVID-19viral host adaptationselection pressuremetric learningbiomarker discovery |
spellingShingle | Josip Rudar Peter Kruczkiewicz Oksana Vernygora G. Brian Golding Mehrdad Hajibabaei Oliver Lung Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source Microbiology Spectrum machine learning COVID-19 viral host adaptation selection pressure metric learning biomarker discovery |
title | Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source |
title_full | Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source |
title_fullStr | Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source |
title_full_unstemmed | Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source |
title_short | Sequence signatures within the genome of SARS-CoV-2 can be used to predict host source |
title_sort | sequence signatures within the genome of sars cov 2 can be used to predict host source |
topic | machine learning COVID-19 viral host adaptation selection pressure metric learning biomarker discovery |
url | https://journals.asm.org/doi/10.1128/spectrum.03584-23 |
work_keys_str_mv | AT josiprudar sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource AT peterkruczkiewicz sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource AT oksanavernygora sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource AT gbriangolding sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource AT mehrdadhajibabaei sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource AT oliverlung sequencesignatureswithinthegenomeofsarscov2canbeusedtopredicthostsource |