Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest

Abstract As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of th...

Full description

Bibliographic Details
Main Authors: Georg Hahn, Sanghun Lee, Dmitry Prokopenko, Jonathan Abraham, Tanya Novak, Julian Hecker, Michael Cho, Surender Khurana, Lindsey R. Baden, Adrienne G. Randolph, Scott T. Weiss, Christoph Lange
Format: Article
Language:English
Published: BMC 2022-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-022-05105-y
_version_ 1797977189127290880
author Georg Hahn
Sanghun Lee
Dmitry Prokopenko
Jonathan Abraham
Tanya Novak
Julian Hecker
Michael Cho
Surender Khurana
Lindsey R. Baden
Adrienne G. Randolph
Scott T. Weiss
Christoph Lange
author_facet Georg Hahn
Sanghun Lee
Dmitry Prokopenko
Jonathan Abraham
Tanya Novak
Julian Hecker
Michael Cho
Surender Khurana
Lindsey R. Baden
Adrienne G. Randolph
Scott T. Weiss
Christoph Lange
author_sort Georg Hahn
collection DOAJ
description Abstract As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses.
first_indexed 2024-04-11T05:02:58Z
format Article
id doaj.art-428c807d66334f48a8d811a6960611da
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-11T05:02:58Z
publishDate 2022-12-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-428c807d66334f48a8d811a6960611da2022-12-25T12:32:06ZengBMCBMC Bioinformatics1471-21052022-12-0123111810.1186/s12859-022-05105-yUnsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interestGeorg Hahn0Sanghun Lee1Dmitry Prokopenko2Jonathan Abraham3Tanya Novak4Julian Hecker5Michael Cho6Surender Khurana7Lindsey R. Baden8Adrienne G. Randolph9Scott T. Weiss10Christoph Lange11Department of Biostatistics, T.H. Chan School of Public Health, Harvard UniversityDepartment of Biostatistics, T.H. Chan School of Public Health, Harvard UniversityGenetics and Aging Research Unit, Department of Neurology, McCance Center for Brain Health, Massachusetts General HospitalDepartment of Microbiology, Harvard Medical School, Blavatnik InstituteDepartment of Anesthesiology, Critical Care and Pain Medicine, Boston Children’s HospitalHarvard Medical School, Harvard UniversityChanning Division of Network Medicine, Department of Medicine, Brigham and Women’s HospitalFood and Drug AdministrationDivision of Infectious Diseases, Harvard Medical School, Brigham and Women’s HospitalDepartment of Anesthesiology, Critical Care and Pain Medicine, Boston Children’s HospitalHarvard Medical School, Harvard UniversityDepartment of Biostatistics, T.H. Chan School of Public Health, Harvard UniversityAbstract As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses.https://doi.org/10.1186/s12859-022-05105-ySARS-CoV-2Nucleotide sequencesOutlier detectionVariants of interestMachine learning
spellingShingle Georg Hahn
Sanghun Lee
Dmitry Prokopenko
Jonathan Abraham
Tanya Novak
Julian Hecker
Michael Cho
Surender Khurana
Lindsey R. Baden
Adrienne G. Randolph
Scott T. Weiss
Christoph Lange
Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
BMC Bioinformatics
SARS-CoV-2
Nucleotide sequences
Outlier detection
Variants of interest
Machine learning
title Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_full Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_fullStr Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_full_unstemmed Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_short Unsupervised outlier detection applied to SARS-CoV-2 nucleotide sequences can identify sequences of common variants and other variants of interest
title_sort unsupervised outlier detection applied to sars cov 2 nucleotide sequences can identify sequences of common variants and other variants of interest
topic SARS-CoV-2
Nucleotide sequences
Outlier detection
Variants of interest
Machine learning
url https://doi.org/10.1186/s12859-022-05105-y
work_keys_str_mv AT georghahn unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT sanghunlee unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT dmitryprokopenko unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT jonathanabraham unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT tanyanovak unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT julianhecker unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT michaelcho unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT surenderkhurana unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT lindseyrbaden unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT adriennegrandolph unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT scotttweiss unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest
AT christophlange unsupervisedoutlierdetectionappliedtosarscov2nucleotidesequencescanidentifysequencesofcommonvariantsandothervariantsofinterest