Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
PeerJ Inc.
2023-02-01
|
Series: | PeerJ |
Subjects: | |
Online Access: | https://peerj.com/articles/14779.pdf |
_version_ | 1827610854199132160 |
---|---|
author | Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen |
author_facet | Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen |
author_sort | Rachel Nguyen |
collection | DOAJ |
description | A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus. |
first_indexed | 2024-03-09T07:56:48Z |
format | Article |
id | doaj.art-964e4597d0f8411aaef97cf7919a6a94 |
institution | Directory Open Access Journal |
issn | 2167-8359 |
language | English |
last_indexed | 2024-03-09T07:56:48Z |
publishDate | 2023-02-01 |
publisher | PeerJ Inc. |
record_format | Article |
series | PeerJ |
spelling | doaj.art-964e4597d0f8411aaef97cf7919a6a942023-12-03T00:57:02ZengPeerJ Inc.PeerJ2167-83592023-02-0111e1477910.7717/peerj.14779Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clusteringRachel Nguyen0Bahrad A. Sokhansanj1Robi Polikar2Gail L. Rosen3Drexel University, Philadelphia, United States of AmericaDrexel University, Philadelphia, United States of AmericaRowan University, Glassboro, NJ, United States of AmericaDrexel University, Philadelphia, United States of AmericaA major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.https://peerj.com/articles/14779.pdfProtein clusteringProtein familiesHomology |
spellingShingle | Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering PeerJ Protein clustering Protein families Homology |
title | Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering |
title_full | Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering |
title_fullStr | Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering |
title_full_unstemmed | Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering |
title_short | Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering |
title_sort | complet a computationally scalable method to improve completeness of large scale protein sequence clustering |
topic | Protein clustering Protein families Homology |
url | https://peerj.com/articles/14779.pdf |
work_keys_str_mv | AT rachelnguyen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT bahradasokhansanj completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT robipolikar completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT gaillrosen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering |