Open-Source Sequence Clustering Methods Improve the State Of the Art

ABSTRACT Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-th...

Full description

Bibliographic Details
Main Authors: Evguenia Kopylova, Jose A. Navas-Molina, Céline Mercier, Zhenjiang Zech Xu, Frédéric Mahé, Yan He, Hong-Wei Zhou, Torbjørn Rognes, J. Gregory Caporaso, Rob Knight
Format: Article
Language:English
Published: American Society for Microbiology 2016-02-01
Series:mSystems
Subjects:
Online Access:https://journals.asm.org/doi/10.1128/mSystems.00003-15
_version_ 1818405913085607936
author Evguenia Kopylova
Jose A. Navas-Molina
Céline Mercier
Zhenjiang Zech Xu
Frédéric Mahé
Yan He
Hong-Wei Zhou
Torbjørn Rognes
J. Gregory Caporaso
Rob Knight
author_facet Evguenia Kopylova
Jose A. Navas-Molina
Céline Mercier
Zhenjiang Zech Xu
Frédéric Mahé
Yan He
Hong-Wei Zhou
Torbjørn Rognes
J. Gregory Caporaso
Rob Knight
author_sort Evguenia Kopylova
collection DOAJ
description ABSTRACT Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ).
first_indexed 2024-12-14T09:03:36Z
format Article
id doaj.art-75b0acec23d8460984a172ced1fd9c07
institution Directory Open Access Journal
issn 2379-5077
language English
last_indexed 2024-12-14T09:03:36Z
publishDate 2016-02-01
publisher American Society for Microbiology
record_format Article
series mSystems
spelling doaj.art-75b0acec23d8460984a172ced1fd9c072022-12-21T23:08:46ZengAmerican Society for MicrobiologymSystems2379-50772016-02-011110.1128/mSystems.00003-15Open-Source Sequence Clustering Methods Improve the State Of the ArtEvguenia Kopylova0Jose A. Navas-Molina1Céline Mercier2Zhenjiang Zech Xu3Frédéric Mahé4Yan He5Hong-Wei Zhou6Torbjørn Rognes7J. Gregory Caporaso8Rob Knight9Department of Pediatrics, UCSD School of Medicine, La Jolla, California, USADepartment of Pediatrics, UCSD School of Medicine, La Jolla, California, USALaboratoire d'Ecologie Alpine (LECA), CNRS UMR 5553, Université Grenoble Alpes, Grenoble, FranceDepartment of Pediatrics, UCSD School of Medicine, La Jolla, California, USADepartment of Ecology, University of Kaiserslautern, Kaiserslautern, GermanyDepartment of Environmental Health, State Key Laboratory of Organ Failure Research, Guangdong Provincial Key Laboratory of Tropical Disease Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, ChinaDepartment of Environmental Health, State Key Laboratory of Organ Failure Research, Guangdong Provincial Key Laboratory of Tropical Disease Research, School of Public Health and Tropical Medicine, Southern Medical University, Guangzhou, Guangdong, ChinaDepartment of Informatics, University of Oslo, Oslo, NorwayDepartment of Biological Sciences, Northern Arizona University, Flagstaff, Arizona, USADepartment of Pediatrics, UCSD School of Medicine, La Jolla, California, USAABSTRACT Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release. IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014, http://dx.doi.org/10.1186/s12915-014-0069-1 ).https://journals.asm.org/doi/10.1128/mSystems.00003-15sequence clusteringoperational taxonomic unitsmicrobial community analysisamplicon sequencing
spellingShingle Evguenia Kopylova
Jose A. Navas-Molina
Céline Mercier
Zhenjiang Zech Xu
Frédéric Mahé
Yan He
Hong-Wei Zhou
Torbjørn Rognes
J. Gregory Caporaso
Rob Knight
Open-Source Sequence Clustering Methods Improve the State Of the Art
mSystems
sequence clustering
operational taxonomic units
microbial community analysis
amplicon sequencing
title Open-Source Sequence Clustering Methods Improve the State Of the Art
title_full Open-Source Sequence Clustering Methods Improve the State Of the Art
title_fullStr Open-Source Sequence Clustering Methods Improve the State Of the Art
title_full_unstemmed Open-Source Sequence Clustering Methods Improve the State Of the Art
title_short Open-Source Sequence Clustering Methods Improve the State Of the Art
title_sort open source sequence clustering methods improve the state of the art
topic sequence clustering
operational taxonomic units
microbial community analysis
amplicon sequencing
url https://journals.asm.org/doi/10.1128/mSystems.00003-15
work_keys_str_mv AT evgueniakopylova opensourcesequenceclusteringmethodsimprovethestateoftheart
AT joseanavasmolina opensourcesequenceclusteringmethodsimprovethestateoftheart
AT celinemercier opensourcesequenceclusteringmethodsimprovethestateoftheart
AT zhenjiangzechxu opensourcesequenceclusteringmethodsimprovethestateoftheart
AT fredericmahe opensourcesequenceclusteringmethodsimprovethestateoftheart
AT yanhe opensourcesequenceclusteringmethodsimprovethestateoftheart
AT hongweizhou opensourcesequenceclusteringmethodsimprovethestateoftheart
AT torbjørnrognes opensourcesequenceclusteringmethodsimprovethestateoftheart
AT jgregorycaporaso opensourcesequenceclusteringmethodsimprovethestateoftheart
AT robknight opensourcesequenceclusteringmethodsimprovethestateoftheart