CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.

The recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique...

Full description

Bibliographic Details
Main Authors: Kyuin Hwang, Jeongsu Oh, Tae-Kyung Kim, Byung Kwon Kim, Dong Su Yu, Bo Kyeng Hou, Gustavo Caetano-Anollés, Soon Gyu Hong, Kyung Mo Kim
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2013-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC3641076?pdf=render
_version_ 1818097341933027328
author Kyuin Hwang
Jeongsu Oh
Tae-Kyung Kim
Byung Kwon Kim
Dong Su Yu
Bo Kyeng Hou
Gustavo Caetano-Anollés
Soon Gyu Hong
Kyung Mo Kim
author_facet Kyuin Hwang
Jeongsu Oh
Tae-Kyung Kim
Byung Kwon Kim
Dong Su Yu
Bo Kyeng Hou
Gustavo Caetano-Anollés
Soon Gyu Hong
Kyung Mo Kim
author_sort Kyuin Hwang
collection DOAJ
description The recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique opportunity to study microbial diversity and community structure. The large volume of sequencing data however makes it time consuming to assign individual sequences to phylotypes by searching them against public databases. Since ribosomal sequences have diverged across prokaryotic species, they can be grouped into clusters that represent operational taxonomic units. However, available clustering programs suffer from overlap of sequence spaces in adjacent clusters. In natural environments, gene sequences are homogenous within species but divergent between species. This evolutionary constraint results in an uneven distribution of genetic distances of genes in sequence space. To cluster 16S rRNA sequences more accurately, it is therefore essential to select core sequences that are located at the centers of the distributions represented by the genetic distance of sequences in taxonomic units. Based on this idea, we here describe a novel sequence clustering algorithm named CLUSTOM that minimizes the overlaps between adjacent clusters. The performance of this algorithm was evaluated in a comparative exercise with existing programs, using the reference sequences of the SILVA database as well as published pyrosequencing datasets. The test revealed that our algorithm achieves higher accuracy than ESPRIT-Tree and mothur, few of the best clustering algorithms. Results indicate that the concept of an uneven distribution of sequence distances can effectively and successfully cluster 16S rRNA gene sequences. The algorithm of CLUSTOM has been implemented both as a web and as a standalone command line application, which are available at http://clustom.kribb.re.kr.
first_indexed 2024-12-10T23:18:59Z
format Article
id doaj.art-741e941b393d4f0c95e910f4e69d3298
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-10T23:18:59Z
publishDate 2013-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-741e941b393d4f0c95e910f4e69d32982022-12-22T01:29:47ZengPublic Library of Science (PLoS)PLoS ONE1932-62032013-01-0185e6262310.1371/journal.pone.0062623CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.Kyuin HwangJeongsu OhTae-Kyung KimByung Kwon KimDong Su YuBo Kyeng HouGustavo Caetano-AnollésSoon Gyu HongKyung Mo KimThe recent nucleic acid sequencing revolution driven by shotgun and high-throughput technologies has led to a rapid increase in the number of sequences for microbial communities. The availability of 16S ribosomal RNA (rRNA) gene sequences from a multitude of natural environments now offers a unique opportunity to study microbial diversity and community structure. The large volume of sequencing data however makes it time consuming to assign individual sequences to phylotypes by searching them against public databases. Since ribosomal sequences have diverged across prokaryotic species, they can be grouped into clusters that represent operational taxonomic units. However, available clustering programs suffer from overlap of sequence spaces in adjacent clusters. In natural environments, gene sequences are homogenous within species but divergent between species. This evolutionary constraint results in an uneven distribution of genetic distances of genes in sequence space. To cluster 16S rRNA sequences more accurately, it is therefore essential to select core sequences that are located at the centers of the distributions represented by the genetic distance of sequences in taxonomic units. Based on this idea, we here describe a novel sequence clustering algorithm named CLUSTOM that minimizes the overlaps between adjacent clusters. The performance of this algorithm was evaluated in a comparative exercise with existing programs, using the reference sequences of the SILVA database as well as published pyrosequencing datasets. The test revealed that our algorithm achieves higher accuracy than ESPRIT-Tree and mothur, few of the best clustering algorithms. Results indicate that the concept of an uneven distribution of sequence distances can effectively and successfully cluster 16S rRNA gene sequences. The algorithm of CLUSTOM has been implemented both as a web and as a standalone command line application, which are available at http://clustom.kribb.re.kr.http://europepmc.org/articles/PMC3641076?pdf=render
spellingShingle Kyuin Hwang
Jeongsu Oh
Tae-Kyung Kim
Byung Kwon Kim
Dong Su Yu
Bo Kyeng Hou
Gustavo Caetano-Anollés
Soon Gyu Hong
Kyung Mo Kim
CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
PLoS ONE
title CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
title_full CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
title_fullStr CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
title_full_unstemmed CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
title_short CLUSTOM: a novel method for clustering 16S rRNA next generation sequences by overlap minimization.
title_sort clustom a novel method for clustering 16s rrna next generation sequences by overlap minimization
url http://europepmc.org/articles/PMC3641076?pdf=render
work_keys_str_mv AT kyuinhwang clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT jeongsuoh clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT taekyungkim clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT byungkwonkim clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT dongsuyu clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT bokyenghou clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT gustavocaetanoanolles clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT soongyuhong clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization
AT kyungmokim clustomanovelmethodforclustering16srrnanextgenerationsequencesbyoverlapminimization