RESCRIPt: Reproducible sequence taxonomy reference database management.

Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleoti...

Πλήρης περιγραφή

Λεπτομέρειες βιβλιογραφικής εγγραφής
Κύριοι συγγραφείς:	Michael S Robeson, Devon R O'Rourke, Benjamin D Kaehler, Michal Ziemski, Matthew R Dillon, Jeffrey T Foster, Nicholas A Bokulich
Μορφή:	Άρθρο
Γλώσσα:	English
Έκδοση:	Public Library of Science (PLoS) 2021-11-01
Σειρά:	PLoS Computational Biology
Διαθέσιμο Online:	https://doi.org/10.1371/journal.pcbi.1009581

_version_	1831696620433965056
author	Michael S Robeson Devon R O'Rourke Benjamin D Kaehler Michal Ziemski Matthew R Dillon Jeffrey T Foster Nicholas A Bokulich
author_facet	Michael S Robeson Devon R O'Rourke Benjamin D Kaehler Michal Ziemski Matthew R Dillon Jeffrey T Foster Nicholas A Bokulich
author_sort	Michael S Robeson
collection	DOAJ
description	Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.
first_indexed	2024-12-20T13:16:41Z
format	Article
id	doaj.art-3b3ee68d2d0a49df902bccb26c1813c2
institution	Directory Open Access Journal
issn	1553-734X 1553-7358
language	English
last_indexed	2024-12-20T13:16:41Z
publishDate	2021-11-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS Computational Biology
spelling	doaj.art-3b3ee68d2d0a49df902bccb26c1813c22022-12-21T19:39:31ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582021-11-011711e100958110.1371/journal.pcbi.1009581RESCRIPt: Reproducible sequence taxonomy reference database management.Michael S RobesonDevon R O'RourkeBenjamin D KaehlerMichal ZiemskiMatthew R DillonJeffrey T FosterNicholas A BokulichNucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.https://doi.org/10.1371/journal.pcbi.1009581
spellingShingle	Michael S Robeson Devon R O'Rourke Benjamin D Kaehler Michal Ziemski Matthew R Dillon Jeffrey T Foster Nicholas A Bokulich RESCRIPt: Reproducible sequence taxonomy reference database management. PLoS Computational Biology
title	RESCRIPt: Reproducible sequence taxonomy reference database management.
title_full	RESCRIPt: Reproducible sequence taxonomy reference database management.
title_fullStr	RESCRIPt: Reproducible sequence taxonomy reference database management.
title_full_unstemmed	RESCRIPt: Reproducible sequence taxonomy reference database management.
title_short	RESCRIPt: Reproducible sequence taxonomy reference database management.
title_sort	rescript reproducible sequence taxonomy reference database management
url	https://doi.org/10.1371/journal.pcbi.1009581
work_keys_str_mv	AT michaelsrobeson rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT devonrorourke rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT benjamindkaehler rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT michalziemski rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT matthewrdillon rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT jeffreytfoster rescriptreproduciblesequencetaxonomyreferencedatabasemanagement AT nicholasabokulich rescriptreproduciblesequencetaxonomyreferencedatabasemanagement

RESCRIPt: Reproducible sequence taxonomy reference database management.

Παρόμοια τεκμήρια