Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

Abstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis r...

Full description

Bibliographic Details
Main Authors: Martin Steinegger, Steven L. Salzberg
Format: Article
Language:English
Published: BMC 2020-05-01
Series:Genome Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13059-020-02023-1
_version_ 1819077772998344704
author Martin Steinegger
Steven L. Salzberg
author_facet Martin Steinegger
Steven L. Salzberg
author_sort Martin Steinegger
collection DOAJ
description Abstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator
first_indexed 2024-12-21T19:02:31Z
format Article
id doaj.art-0c58db75fd8a487eb02d1ef019ea16a4
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-12-21T19:02:31Z
publishDate 2020-05-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-0c58db75fd8a487eb02d1ef019ea16a42022-12-21T18:53:27ZengBMCGenome Biology1474-760X2020-05-0121111210.1186/s13059-020-02023-1Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBankMartin Steinegger0Steven L. Salzberg1School of Biological Sciences, Seoul National UniversityCenter for Computational Biology, Whiting School of Engineering, Johns Hopkins UniversityAbstract Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminatorhttp://link.springer.com/article/10.1186/s13059-020-02023-1GenomesContaminationSoftwareRefSeqGenBank
spellingShingle Martin Steinegger
Steven L. Salzberg
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
Genome Biology
Genomes
Contamination
Software
RefSeq
GenBank
title Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
title_full Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
title_fullStr Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
title_full_unstemmed Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
title_short Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
title_sort terminating contamination large scale search identifies more than 2 000 000 contaminated entries in genbank
topic Genomes
Contamination
Software
RefSeq
GenBank
url http://link.springer.com/article/10.1186/s13059-020-02023-1
work_keys_str_mv AT martinsteinegger terminatingcontaminationlargescalesearchidentifiesmorethan2000000contaminatedentriesingenbank
AT stevenlsalzberg terminatingcontaminationlargescalesearchidentifiesmorethan2000000contaminatedentriesingenbank