ReprDB and panDB: minimalist databases with maximal microbial representation

Abstract Background Profiling of shotgun metagenomic samples is hindered by a lack of unified microbial reference genome databases that (i) assemble genomic information from all open access microbial genomes, (ii) have relatively small sizes, and (iii) are compatible to various metagenomic read mapp...

Full description

Bibliographic Details
Main Authors: Wei Zhou, Nicole Gay, Julia Oh
Format: Article
Language:English
Published: BMC 2018-01-01
Series:Microbiome
Subjects:
Online Access:http://link.springer.com/article/10.1186/s40168-018-0399-2
_version_ 1819055588028448768
author Wei Zhou
Nicole Gay
Julia Oh
author_facet Wei Zhou
Nicole Gay
Julia Oh
author_sort Wei Zhou
collection DOAJ
description Abstract Background Profiling of shotgun metagenomic samples is hindered by a lack of unified microbial reference genome databases that (i) assemble genomic information from all open access microbial genomes, (ii) have relatively small sizes, and (iii) are compatible to various metagenomic read mapping tools. Moreover, computational tools to rapidly compile and update such databases to accommodate the rapid increase in new reference genomes do not exist. As a result, database-guided analyses often fail to profile a substantial fraction of metagenomic shotgun sequencing reads from complex microbiomes. Results We report pipelines that efficiently traverse all open access microbial genomes and assemble non-redundant genomic information. The pipelines result in two species-resolution microbial reference databases of relatively small sizes: reprDB, which assembles microbial representative or reference genomes, and panDB, for which we developed a novel iterative alignment algorithm to identify and assemble non-redundant genomic regions in multiple sequenced strains. With the databases, we managed to assign taxonomic labels and genome positions to the majority of metagenomic reads from human skin and gut microbiomes, demonstrating a significant improvement over a previous database-guided analysis on the same datasets. Conclusions reprDB and panDB leverage the rapid increases in the number of open access microbial genomes to more fully profile metagenomic samples. Additionally, the databases exclude redundant sequence information to avoid inflated storage or memory space and indexing or analyzing time. Finally, the novel iterative alignment algorithm significantly increases efficiency in pan-genome identification and can be useful in comparative genomic analyses.
first_indexed 2024-12-21T13:09:54Z
format Article
id doaj.art-73c6b98d2eba4bc9a3d5fd51fc4f7c6a
institution Directory Open Access Journal
issn 2049-2618
language English
last_indexed 2024-12-21T13:09:54Z
publishDate 2018-01-01
publisher BMC
record_format Article
series Microbiome
spelling doaj.art-73c6b98d2eba4bc9a3d5fd51fc4f7c6a2022-12-21T19:02:55ZengBMCMicrobiome2049-26182018-01-016111510.1186/s40168-018-0399-2ReprDB and panDB: minimalist databases with maximal microbial representationWei Zhou0Nicole Gay1Julia Oh2The Jackson Laboratory for Genomic MedicineThe Jackson Laboratory for Genomic MedicineThe Jackson Laboratory for Genomic MedicineAbstract Background Profiling of shotgun metagenomic samples is hindered by a lack of unified microbial reference genome databases that (i) assemble genomic information from all open access microbial genomes, (ii) have relatively small sizes, and (iii) are compatible to various metagenomic read mapping tools. Moreover, computational tools to rapidly compile and update such databases to accommodate the rapid increase in new reference genomes do not exist. As a result, database-guided analyses often fail to profile a substantial fraction of metagenomic shotgun sequencing reads from complex microbiomes. Results We report pipelines that efficiently traverse all open access microbial genomes and assemble non-redundant genomic information. The pipelines result in two species-resolution microbial reference databases of relatively small sizes: reprDB, which assembles microbial representative or reference genomes, and panDB, for which we developed a novel iterative alignment algorithm to identify and assemble non-redundant genomic regions in multiple sequenced strains. With the databases, we managed to assign taxonomic labels and genome positions to the majority of metagenomic reads from human skin and gut microbiomes, demonstrating a significant improvement over a previous database-guided analysis on the same datasets. Conclusions reprDB and panDB leverage the rapid increases in the number of open access microbial genomes to more fully profile metagenomic samples. Additionally, the databases exclude redundant sequence information to avoid inflated storage or memory space and indexing or analyzing time. Finally, the novel iterative alignment algorithm significantly increases efficiency in pan-genome identification and can be useful in comparative genomic analyses.http://link.springer.com/article/10.1186/s40168-018-0399-2Reference databaseShotgun metagenomicsPan-genomeWhole-genome alignment
spellingShingle Wei Zhou
Nicole Gay
Julia Oh
ReprDB and panDB: minimalist databases with maximal microbial representation
Microbiome
Reference database
Shotgun metagenomics
Pan-genome
Whole-genome alignment
title ReprDB and panDB: minimalist databases with maximal microbial representation
title_full ReprDB and panDB: minimalist databases with maximal microbial representation
title_fullStr ReprDB and panDB: minimalist databases with maximal microbial representation
title_full_unstemmed ReprDB and panDB: minimalist databases with maximal microbial representation
title_short ReprDB and panDB: minimalist databases with maximal microbial representation
title_sort reprdb and pandb minimalist databases with maximal microbial representation
topic Reference database
Shotgun metagenomics
Pan-genome
Whole-genome alignment
url http://link.springer.com/article/10.1186/s40168-018-0399-2
work_keys_str_mv AT weizhou reprdbandpandbminimalistdatabaseswithmaximalmicrobialrepresentation
AT nicolegay reprdbandpandbminimalistdatabaseswithmaximalmicrobialrepresentation
AT juliaoh reprdbandpandbminimalistdatabaseswithmaximalmicrobialrepresentation