NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations

Abstract The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods...

Full description

Bibliographic Details
Main Authors:	Juhyeon Kim, Saeyeon Cheon, Insung Ahn
Format:	Article
Language:	English
Published:	BMC 2022-05-01
Series:	BMC Bioinformatics
Subjects:	SARS-CoV-2 Protein sequence analysis Sequence data pre-process t-Stochastic neighbour embedding Density based spatial clustering of applications with noise Clustering
Online Access:	https://doi.org/10.1186/s12859-022-04718-7

_version_	1811341096926576640
author	Juhyeon Kim Saeyeon Cheon Insung Ahn
author_facet	Juhyeon Kim Saeyeon Cheon Insung Ahn
author_sort	Juhyeon Kim
collection	DOAJ
description	Abstract The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations.
first_indexed	2024-04-13T18:51:38Z
format	Article
id	doaj.art-477c2ad67965430784e81f6ec1fdfc2f
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-04-13T18:51:38Z
publishDate	2022-05-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-477c2ad67965430784e81f6ec1fdfc2f2022-12-22T02:34:24ZengBMCBMC Bioinformatics1471-21052022-05-0123112410.1186/s12859-022-04718-7NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variationsJuhyeon Kim0Saeyeon Cheon1Insung Ahn2Department of Data-Centric Problem Solving Research, Korea Institute of Science and Technology InformationApplied Artificial Intelligence Major, University of Science & TechnologyDepartment of Data-Centric Problem Solving Research, Korea Institute of Science and Technology InformationAbstract The rapid global spread and dissemination of SARS-CoV-2 has provided the virus with numerous opportunities to develop several variants. Thus, it is critical to determine the degree of the variations and in which part of the virus those variations occurred. Therefore, in this study, methods that could be used to vectorize the sequence data, perform clustering analysis, and visualize the results were proposed using machine learning methods. To conduct this study, a total of 224,073 cases of SARS-CoV-2 sequence data were collected through NCBI and GISAID, and the data were visualized using dimensionality reduction and clustering analysis models such as T-SNE and DBSCAN. The SARS-CoV-2 virus, which was first detected, was distinguished from different variations, including Omicron and Delta, in the cluster results. Furthermore, it was possible to examine which codon changes in the spike protein caused the variants to be distinguished using feature importance extraction models such as Random Forest or Shapely Value. The proposed method has the advantage of being able to analyse and visualize a large amount of data at once compared to the existing tree-based sequence data analysis. The proposed method was able to identify and visualize significant changes between the SARS-CoV-2 virus, which was first detected in Wuhan, China, in December 2019, and the newly formed mutant virus group. As a result of clustering analysis using sequence data, it was possible to confirm the formation of clusters among various variants in a two-dimensional graph, and by extracting the importance of variables, it was possible to confirm which codon changes played a major role in distinguishing variants. Furthermore, since the proposed method can handle a variety of data sequences, it can be used for all kinds of diseases, including influenza and SARS-CoV-2. Therefore, the proposed method has the potential to become widely used for the effective analysis of disease variations.https://doi.org/10.1186/s12859-022-04718-7SARS-CoV-2Protein sequence analysisSequence data pre-processt-Stochastic neighbour embeddingDensity based spatial clustering of applications with noiseClustering
spellingShingle	Juhyeon Kim Saeyeon Cheon Insung Ahn NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations BMC Bioinformatics SARS-CoV-2 Protein sequence analysis Sequence data pre-process t-Stochastic neighbour embedding Density based spatial clustering of applications with noise Clustering
title	NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_full	NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_fullStr	NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_full_unstemmed	NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_short	NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations
title_sort	ngs data vectorization clustering and finding key codons in sars cov 2 variations
topic	SARS-CoV-2 Protein sequence analysis Sequence data pre-process t-Stochastic neighbour embedding Density based spatial clustering of applications with noise Clustering
url	https://doi.org/10.1186/s12859-022-04718-7
work_keys_str_mv	AT juhyeonkim ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations AT saeyeoncheon ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations AT insungahn ngsdatavectorizationclusteringandfindingkeycodonsinsarscov2variations

NGS data vectorization, clustering, and finding key codons in SARS-CoV-2 variations

Similar Items