Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, an...

Full description

Bibliographic Details
Main Authors:	Zahra Tayebi, Sarwan Ali, Murray Patterson
Format:	Article
Language:	English
Published:	MDPI AG 2021-11-01
Series:	Algorithms
Subjects:	COVID-19 SARS-CoV-2 spike protein sequences cluster analysis feature selection <i>k</i>-mers
Online Access:	https://www.mdpi.com/1999-4893/14/12/348

_version_	1827674524088270848
author	Zahra Tayebi Sarwan Ali Murray Patterson
author_facet	Zahra Tayebi Sarwan Ali Murray Patterson
author_sort	Zahra Tayebi
collection	DOAJ
description	The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a <i>k</i>-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> scores for the clusters and also better clustering quality metrics compared to baselines.
first_indexed	2024-03-10T04:40:39Z
format	Article
id	doaj.art-8dbe01dce3f84168a1d8452725532457
institution	Directory Open Access Journal
issn	1999-4893
language	English
last_indexed	2024-03-10T04:40:39Z
publishDate	2021-11-01
publisher	MDPI AG
record_format	Article
series	Algorithms
spelling	doaj.art-8dbe01dce3f84168a1d84527255324572023-11-23T03:24:47ZengMDPI AGAlgorithms1999-48932021-11-01141234810.3390/a14120348Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 VariantsZahra Tayebi0Sarwan Ali1Murray Patterson2Department of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USADepartment of Computer Science, Georgia State University, Atlanta, GA 30303, USAThe widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail, unlike any virus before it. On the one hand, this will help biologists, policymakers, and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task, given the size of the data. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence—the relatively short region which codes for the spike protein(s). In this paper, we propose a robust feature-vector representation of biological sequences that, when combined with the appropriate feature selection method, allows different downstream clustering approaches to perform well on a variety of different measures. We use such proposed approach with an array of clustering techniques to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at a very high rate throughout the world. We use a <i>k</i>-mers based approach first to generate a fixed-length feature vector representation of the spike sequences. We then show that we can efficiently and effectively cluster the spike sequences based on the different variants with the appropriate feature selection. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that, with our feature selection methods, we can achieve higher <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><msub><mi>F</mi><mn>1</mn></msub></semantics></math></inline-formula> scores for the clusters and also better clustering quality metrics compared to baselines.https://www.mdpi.com/1999-4893/14/12/348COVID-19SARS-CoV-2spike protein sequencescluster analysisfeature selection<i>k</i>-mers
spellingShingle	Zahra Tayebi Sarwan Ali Murray Patterson Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants Algorithms COVID-19 SARS-CoV-2 spike protein sequences cluster analysis feature selection <i>k</i>-mers
title	Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
title_full	Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
title_fullStr	Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
title_full_unstemmed	Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
title_short	Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
title_sort	robust representation and efficient feature selection allows for effective clustering of sars cov 2 variants
topic	COVID-19 SARS-CoV-2 spike protein sequences cluster analysis feature selection <i>k</i>-mers
url	https://www.mdpi.com/1999-4893/14/12/348
work_keys_str_mv	AT zahratayebi robustrepresentationandefficientfeatureselectionallowsforeffectiveclusteringofsarscov2variants AT sarwanali robustrepresentationandefficientfeatureselectionallowsforeffectiveclusteringofsarscov2variants AT murraypatterson robustrepresentationandefficientfeatureselectionallowsforeffectiveclusteringofsarscov2variants

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Similar Items