Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder

Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. F...

Full description

Bibliographic Details
Main Authors:	Fayroz F. Sherif, Khaled S. Ahmed
Format:	Article
Language:	English
Published:	SpringerOpen 2022-08-01
Series:	Journal of Engineering and Applied Science
Subjects:	SARS-CoV-2 Unsupervised clustering Deep learning Convolutional autoencoder Spike protein Lineages
Online Access:	https://doi.org/10.1186/s44147-022-00125-0

_version_	1811340142550450176
author	Fayroz F. Sherif Khaled S. Ahmed
author_facet	Fayroz F. Sherif Khaled S. Ahmed
author_sort	Fayroz F. Sherif
collection	DOAJ
description	Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.
first_indexed	2024-04-13T18:37:45Z
format	Article
id	doaj.art-2a22e124233e43e884ff0513ecf5ee9e
institution	Directory Open Access Journal
issn	1110-1903 2536-9512
language	English
last_indexed	2024-04-13T18:37:45Z
publishDate	2022-08-01
publisher	SpringerOpen
record_format	Article
series	Journal of Engineering and Applied Science
spelling	doaj.art-2a22e124233e43e884ff0513ecf5ee9e2022-12-22T02:34:50ZengSpringerOpenJournal of Engineering and Applied Science1110-19032536-95122022-08-0169112210.1186/s44147-022-00125-0Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoderFayroz F. Sherif0Khaled S. Ahmed1Computers and Systems Department, Electronics Research InstituteBiomedical Department, Faculty of Engineering, Benha UniversityAbstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.https://doi.org/10.1186/s44147-022-00125-0SARS-CoV-2Unsupervised clusteringDeep learningConvolutional autoencoderSpike proteinLineages
spellingShingle	Fayroz F. Sherif Khaled S. Ahmed Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder Journal of Engineering and Applied Science SARS-CoV-2 Unsupervised clustering Deep learning Convolutional autoencoder Spike protein Lineages
title	Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_full	Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_fullStr	Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_full_unstemmed	Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_short	Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_sort	unsupervised clustering of sars cov 2 using deep convolutional autoencoder
topic	SARS-CoV-2 Unsupervised clustering Deep learning Convolutional autoencoder Spike protein Lineages
url	https://doi.org/10.1186/s44147-022-00125-0
work_keys_str_mv	AT fayrozfsherif unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder AT khaledsahmed unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder

Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder

Similar Items