Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative...

Full description

Bibliographic Details
Main Authors: Rachel Nguyen, Bahrad A. Sokhansanj, Robi Polikar, Gail L. Rosen
Format: Article
Language:English
Published: PeerJ Inc. 2023-02-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/14779.pdf
_version_ 1827610854199132160
author Rachel Nguyen
Bahrad A. Sokhansanj
Robi Polikar
Gail L. Rosen
author_facet Rachel Nguyen
Bahrad A. Sokhansanj
Robi Polikar
Gail L. Rosen
author_sort Rachel Nguyen
collection DOAJ
description A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.
first_indexed 2024-03-09T07:56:48Z
format Article
id doaj.art-964e4597d0f8411aaef97cf7919a6a94
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T07:56:48Z
publishDate 2023-02-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-964e4597d0f8411aaef97cf7919a6a942023-12-03T00:57:02ZengPeerJ Inc.PeerJ2167-83592023-02-0111e1477910.7717/peerj.14779Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clusteringRachel Nguyen0Bahrad A. Sokhansanj1Robi Polikar2Gail L. Rosen3Drexel University, Philadelphia, United States of AmericaDrexel University, Philadelphia, United States of AmericaRowan University, Glassboro, NJ, United States of AmericaDrexel University, Philadelphia, United States of AmericaA major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.https://peerj.com/articles/14779.pdfProtein clusteringProtein familiesHomology
spellingShingle Rachel Nguyen
Bahrad A. Sokhansanj
Robi Polikar
Gail L. Rosen
Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
PeerJ
Protein clustering
Protein families
Homology
title Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_fullStr Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full_unstemmed Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_short Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_sort complet a computationally scalable method to improve completeness of large scale protein sequence clustering
topic Protein clustering
Protein families
Homology
url https://peerj.com/articles/14779.pdf
work_keys_str_mv AT rachelnguyen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering
AT bahradasokhansanj completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering
AT robipolikar completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering
AT gaillrosen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering