Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.

The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking...

Full description

Bibliographic Details
Main Authors: Grace A Blackwell, Martin Hunt, Kerri M Malone, Leandro Lima, Gal Horesh, Blaise T F Alako, Nicholas R Thomson, Zamin Iqbal
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2021-11-01
Series:PLoS Biology
Online Access:https://doi.org/10.1371/journal.pbio.3001421
_version_ 1819320269985021952
author Grace A Blackwell
Martin Hunt
Kerri M Malone
Leandro Lima
Gal Horesh
Blaise T F Alako
Nicholas R Thomson
Zamin Iqbal
author_facet Grace A Blackwell
Martin Hunt
Kerri M Malone
Leandro Lima
Gal Horesh
Blaise T F Alako
Nicholas R Thomson
Zamin Iqbal
author_sort Grace A Blackwell
collection DOAJ
description The open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.
first_indexed 2024-12-24T11:16:54Z
format Article
id doaj.art-abce2d8928ac4d6098a60f51ffb57a54
institution Directory Open Access Journal
issn 1544-9173
1545-7885
language English
last_indexed 2024-12-24T11:16:54Z
publishDate 2021-11-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Biology
spelling doaj.art-abce2d8928ac4d6098a60f51ffb57a542022-12-21T16:58:22ZengPublic Library of Science (PLoS)PLoS Biology1544-91731545-78852021-11-011911e300142110.1371/journal.pbio.3001421Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.Grace A BlackwellMartin HuntKerri M MaloneLeandro LimaGal HoreshBlaise T F AlakoNicholas R ThomsonZamin IqbalThe open sharing of genomic data provides an incredibly rich resource for the study of bacterial evolution and function and even anthropogenic activities such as the widespread use of antimicrobials. However, these data consist of genomes assembled with different tools and levels of quality checking, and of large volumes of completely unprocessed raw sequence data. In both cases, considerable computational effort is required before biological questions can be addressed. Here, we assembled and characterised 661,405 bacterial genomes retrieved from the European Nucleotide Archive (ENA) in November of 2018 using a uniform standardised approach. Of these, 311,006 did not previously have an assembly. We produced a searchable COmpact Bit-sliced Signature (COBS) index, facilitating the easy interrogation of the entire dataset for a specific sequence (e.g., gene, mutation, or plasmid). Additional MinHash and pp-sketch indices support genome-wide comparisons and estimations of genomic distance. Combined, this resource will allow data to be easily subset and searched, phylogenetic relationships between genomes to be quickly elucidated, and hypotheses rapidly generated and tested. We believe that this combination of uniform processing and variety of search/filter functionalities will make this a resource of very wide utility. In terms of diversity within the data, a breakdown of the 639,981 high-quality genomes emphasised the uneven species composition of the ENA/public databases, with just 20 of the total 2,336 species making up 90% of the genomes. The overrepresented species tend to be acute/common human pathogens, aligning with research priorities at different levels from individual interests to funding bodies and national and global public health agencies.https://doi.org/10.1371/journal.pbio.3001421
spellingShingle Grace A Blackwell
Martin Hunt
Kerri M Malone
Leandro Lima
Gal Horesh
Blaise T F Alako
Nicholas R Thomson
Zamin Iqbal
Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
PLoS Biology
title Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
title_full Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
title_fullStr Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
title_full_unstemmed Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
title_short Exploring bacterial diversity via a curated and searchable snapshot of archived DNA sequences.
title_sort exploring bacterial diversity via a curated and searchable snapshot of archived dna sequences
url https://doi.org/10.1371/journal.pbio.3001421
work_keys_str_mv AT graceablackwell exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT martinhunt exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT kerrimmalone exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT leandrolima exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT galhoresh exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT blaisetfalako exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT nicholasrthomson exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences
AT zaminiqbal exploringbacterialdiversityviaacuratedandsearchablesnapshotofarchiveddnasequences