Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder
Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. F...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
SpringerOpen
2022-08-01
|
Series: | Journal of Engineering and Applied Science |
Subjects: | |
Online Access: | https://doi.org/10.1186/s44147-022-00125-0 |
_version_ | 1811340142550450176 |
---|---|
author | Fayroz F. Sherif Khaled S. Ahmed |
author_facet | Fayroz F. Sherif Khaled S. Ahmed |
author_sort | Fayroz F. Sherif |
collection | DOAJ |
description | Abstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world. |
first_indexed | 2024-04-13T18:37:45Z |
format | Article |
id | doaj.art-2a22e124233e43e884ff0513ecf5ee9e |
institution | Directory Open Access Journal |
issn | 1110-1903 2536-9512 |
language | English |
last_indexed | 2024-04-13T18:37:45Z |
publishDate | 2022-08-01 |
publisher | SpringerOpen |
record_format | Article |
series | Journal of Engineering and Applied Science |
spelling | doaj.art-2a22e124233e43e884ff0513ecf5ee9e2022-12-22T02:34:50ZengSpringerOpenJournal of Engineering and Applied Science1110-19032536-95122022-08-0169112210.1186/s44147-022-00125-0Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoderFayroz F. Sherif0Khaled S. Ahmed1Computers and Systems Department, Electronics Research InstituteBiomedical Department, Faculty of Engineering, Benha UniversityAbstract SARS-CoV-2’s population structure might have a substantial impact on public health management and diagnostics if it can be identified. It is critical to rapidly monitor and characterize their lineages circulating globally for a more accurate diagnosis, improved care, and faster treatment. For a clearer picture of the SARS-CoV-2 population structure, clustering the sequencing data is essential. Here, deep clustering techniques were used to automatically group 29,017 different strains of SARS-CoV-2 into clusters. We aim to identify the main clusters of SARS-CoV-2 population structure based on convolutional autoencoder (CAE) trained with numerical feature vectors mapped from coronavirus Spike peptide sequences. Our clustering findings revealed that there are six large SARS-CoV-2 population clusters (C1, C2, C3, C4, C5, C6). These clusters contained 43 unique lineages in which the 29,017 publicly accessible strains were dispersed. In all the resulting six clusters, the genetic distances within the same cluster (intra-cluster distances) are less than the distances between inter-clusters (P-value 0.0019, Wilcoxon rank-sum test). This indicates substantial evidence of a connection between the cluster’s lineages. Furthermore, comparisons of the K-means and hierarchical clustering methods have been examined against the proposed deep learning clustering method. The intra-cluster genetic distances of the proposed method were smaller than those of K-means alone and hierarchical clustering methods. We used T-distributed stochastic-neighbor embedding (t-SNE) to show the outcomes of the deep learning clustering. The strains were isolated correctly between clusters in the t-SNE plot. Our results showed that the (C5) cluster exclusively includes Gamma lineage (P.1) only, suggesting that strains of P.1 in C5 are more diversified than those in the other clusters. Our study indicates that the genetic similarity between strains in the same cluster enables a better understanding of the major features of the unknown population lineages when compared to some of the more prevalent viral isolates. This information helps researchers figure out how the virus changed over time and spread to people all over the world.https://doi.org/10.1186/s44147-022-00125-0SARS-CoV-2Unsupervised clusteringDeep learningConvolutional autoencoderSpike proteinLineages |
spellingShingle | Fayroz F. Sherif Khaled S. Ahmed Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder Journal of Engineering and Applied Science SARS-CoV-2 Unsupervised clustering Deep learning Convolutional autoencoder Spike protein Lineages |
title | Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder |
title_full | Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder |
title_fullStr | Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder |
title_full_unstemmed | Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder |
title_short | Unsupervised clustering of SARS-CoV-2 using deep convolutional autoencoder |
title_sort | unsupervised clustering of sars cov 2 using deep convolutional autoencoder |
topic | SARS-CoV-2 Unsupervised clustering Deep learning Convolutional autoencoder Spike protein Lineages |
url | https://doi.org/10.1186/s44147-022-00125-0 |
work_keys_str_mv | AT fayrozfsherif unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder AT khaledsahmed unsupervisedclusteringofsarscov2usingdeepconvolutionalautoencoder |