Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative...

Full description

Bibliographic Details
Main Authors:	Rachel Nguyen, Bahrad A. Sokhansanj, Robi Polikar, Gail L. Rosen
Format:	Article
Language:	English
Published:	PeerJ Inc. 2023-02-01
Series:	PeerJ
Subjects:	Protein clustering Protein families Homology
Online Access:	https://peerj.com/articles/14779.pdf

_version_	1827610854199132160
author	Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen
author_facet	Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen
author_sort	Rachel Nguyen
collection	DOAJ
description	A major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.
first_indexed	2024-03-09T07:56:48Z
format	Article
id	doaj.art-964e4597d0f8411aaef97cf7919a6a94
institution	Directory Open Access Journal
issn	2167-8359
language	English
last_indexed	2024-03-09T07:56:48Z
publishDate	2023-02-01
publisher	PeerJ Inc.
record_format	Article
series	PeerJ
spelling	doaj.art-964e4597d0f8411aaef97cf7919a6a942023-12-03T00:57:02ZengPeerJ Inc.PeerJ2167-83592023-02-0111e1477910.7717/peerj.14779Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clusteringRachel Nguyen0Bahrad A. Sokhansanj1Robi Polikar2Gail L. Rosen3Drexel University, Philadelphia, United States of AmericaDrexel University, Philadelphia, United States of AmericaRowan University, Glassboro, NJ, United States of AmericaDrexel University, Philadelphia, United States of AmericaA major challenge for clustering algorithms is to balance the trade-off between homogeneity, i.e., the degree to which an individual cluster includes only related sequences, and completeness, the degree to which related sequences are broken up into multiple clusters. Most algorithms are conservative in grouping sequences with other sequences. Remote homologs may fail to be clustered together and instead form unnecessarily distinct clusters. The resulting clusters have high homogeneity but completeness that is too low. We propose Complet+, a computationally scalable post-processing method to increase the completeness of clusters without an undue cost in homogeneity. Complet+ proves to effectively merge closely-related clusters of protein that have verified structural relationships in the SCOPe classification scheme, improving the completeness of clustering results at little cost to homogeneity. Applying Complet+ to clusters obtained using MMseqs2’s clusterupdate achieves an increased V-measure of 0.09 and 0.05 at the SCOPe superfamily and family levels, respectively. Complet+ also creates more biologically representative clusters, as shown by a substantial increase in Adjusted Mutual Information (AMI) and Adjusted Rand Index (ARI) metrics when comparing predicted clusters to biological classifications. Complet+ similarly improves clustering metrics when applied to other methods, such as CD-HIT and linclust. Finally, we show that Complet+ runtime scales linearly with respect to the number of clusters being post-processed on a COG dataset of over 3 million sequences. Code and supplementary information is available on Github: https://github.com/EESI/Complet-Plus.https://peerj.com/articles/14779.pdfProtein clusteringProtein familiesHomology
spellingShingle	Rachel Nguyen Bahrad A. Sokhansanj Robi Polikar Gail L. Rosen Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering PeerJ Protein clustering Protein families Homology
title	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_fullStr	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_full_unstemmed	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_short	Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering
title_sort	complet a computationally scalable method to improve completeness of large scale protein sequence clustering
topic	Protein clustering Protein families Homology
url	https://peerj.com/articles/14779.pdf
work_keys_str_mv	AT rachelnguyen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT bahradasokhansanj completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT robipolikar completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering AT gaillrosen completacomputationallyscalablemethodtoimprovecompletenessoflargescaleproteinsequenceclustering

Complet+: a computationally scalable method to improve completeness of large-scale protein sequence clustering

Similar Items