Clustering biological sequences with dynamic sequence similarity threshold

Abstract Background Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While curr...

Full description

Bibliographic Details
Main Authors:	Jimmy Ka Ho Chiu, Rick Twee-Hee Ong
Format:	Article
Language:	English
Published:	BMC 2022-03-01
Series:	BMC Bioinformatics
Subjects:	Sequence clustering Graph clustering Homologous sequences Metagenomics
Online Access:	https://doi.org/10.1186/s12859-022-04643-9

_version_	1818774021467013120
author	Jimmy Ka Ho Chiu Rick Twee-Hee Ong
author_facet	Jimmy Ka Ho Chiu Rick Twee-Hee Ong
author_sort	Jimmy Ka Ho Chiu
collection	DOAJ
description	Abstract Background Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences. Results We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained. Conclusions ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.
first_indexed	2024-12-18T10:34:31Z
format	Article
id	doaj.art-5d98cbec090d4d1aba0275437470eca7
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-18T10:34:31Z
publishDate	2022-03-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-5d98cbec090d4d1aba0275437470eca72022-12-21T21:10:47ZengBMCBMC Bioinformatics1471-21052022-03-0123112010.1186/s12859-022-04643-9Clustering biological sequences with dynamic sequence similarity thresholdJimmy Ka Ho Chiu0Rick Twee-Hee Ong1Saw Swee Hock School of Public Health, National University of Singapore and National University Health SystemSaw Swee Hock School of Public Health, National University of Singapore and National University Health SystemAbstract Background Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences. Results We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignment-free sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cut-off threshold for each individual cluster by considering both cluster separation and intra-cluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the non-singleton clusters obtained. Conclusions ALFATClust is able to generate sequence clusters having high intra-cluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cut-off thresholds.https://doi.org/10.1186/s12859-022-04643-9Sequence clusteringGraph clusteringHomologous sequencesMetagenomics
spellingShingle	Jimmy Ka Ho Chiu Rick Twee-Hee Ong Clustering biological sequences with dynamic sequence similarity threshold BMC Bioinformatics Sequence clustering Graph clustering Homologous sequences Metagenomics
title	Clustering biological sequences with dynamic sequence similarity threshold
title_full	Clustering biological sequences with dynamic sequence similarity threshold
title_fullStr	Clustering biological sequences with dynamic sequence similarity threshold
title_full_unstemmed	Clustering biological sequences with dynamic sequence similarity threshold
title_short	Clustering biological sequences with dynamic sequence similarity threshold
title_sort	clustering biological sequences with dynamic sequence similarity threshold
topic	Sequence clustering Graph clustering Homologous sequences Metagenomics
url	https://doi.org/10.1186/s12859-022-04643-9
work_keys_str_mv	AT jimmykahochiu clusteringbiologicalsequenceswithdynamicsequencesimilaritythreshold AT ricktweeheeong clusteringbiologicalsequenceswithdynamicsequencesimilaritythreshold

Clustering biological sequences with dynamic sequence similarity threshold

Similar Items