Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder

Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. F...

Full description

Bibliographic Details
Main Authors: Fayroz F. Sherif, Khaled S. Ahmed
Format: Article
Language:English
Published: SpringerOpen 2022-08-01
Series:Journal of Engineering and Applied Science
Subjects:
Online Access:https://doi.org/10.1186/s44147-022-00125-0
_version_ 1811340142550450176
author Fayroz F. Sherif
Khaled S. Ahmed
author_facet Fayroz F. Sherif
Khaled S. Ahmed
author_sort Fayroz F. Sherif
collection DOAJ
description Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.
first_indexed 2024-04-13T18:37:45Z
format Article
id doaj.art-2a22e124233e43e884ff0513ecf5ee9e
institution Directory Open Access Journal
issn 1110-1903
2536-9512
language English
last_indexed 2024-04-13T18:37:45Z
publishDate 2022-08-01
publisher SpringerOpen
record_format Article
series Journal of Engineering and Applied Science
spelling doaj.art-2a22e124233e43e884ff0513ecf5ee9e2022-12-22T02:34:50ZengSpringerOpenJournal of Engineering and Applied Science1110-19032536-95122022-08-0169112210.1186/s44147-022-00125-0Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoderFayroz F. Sherif0Khaled S. Ahmed1Computers and Systems Department, Electronics Research InstituteBiomedical Department, Faculty of Engineering, Benha UniversityAbstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.https://doi.org/10.1186/s44147-022-00125-0SARS-CoV-2Unsupervised clusteringDeep learningConvolutional autoencoderSpike proteinLineages
spellingShingle Fayroz F. Sherif
Khaled S. Ahmed
Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
Journal of Engineering and Applied Science
SARS-CoV-2
Unsupervised clustering
Deep learning
Convolutional autoencoder
Spike protein
Lineages
title Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_full Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_fullStr Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_full_unstemmed Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_short Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
title_sort unsupervised clustering of sars cov 2 using deep convolutional autoencoder
topic SARS-CoV-2
Unsupervised clustering
Deep learning
Convolutional autoencoder
Spike protein
Lineages
url https://doi.org/10.1186/s44147-022-00125-0
work_keys_str_mv AT fayrozfsherif unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder
AT khaledsahmed unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder