Efficient counting of <it>k</it>-mers in DNA sequences using a bloom filter

Abstract Background Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic seq...

Full description

Bibliographic Details
Main Authors:	Melsted Páll, Pritchard Jonathan K
Format:	Article
Language:	English
Published:	BMC 2011-08-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/12/333

_version_	1818382132912848896
author	Melsted Páll Pritchard Jonathan K
author_facet	Melsted Páll Pritchard Jonathan K
author_sort	Melsted Páll
collection	DOAJ
description	<p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting <it>k</it>-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing <it>k</it>-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton <it>k</it>-mers are uninformative for many algorithms without some kind of error correction.</p> <p>Results</p> <p>We present a new method that identifies all the <it>k</it>-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed <it>k</it>-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique <it>k</it>-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting <it>k</it>-mers in sequence data with errors.</p> <p>Conclusions</p> <p>A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at <url>http://pritch.bsd.uchicago.edu/bfcounter.html</url></p>
first_indexed	2024-12-14T02:45:37Z
format	Article
id	doaj.art-be88730b1f314a4791d46ce3b0793eaa
institution	Directory Open Access Journal
issn	1471-2105
language	English
last_indexed	2024-12-14T02:45:37Z
publishDate	2011-08-01
publisher	BMC
record_format	Article
series	BMC Bioinformatics
spelling	doaj.art-be88730b1f314a4791d46ce3b0793eaa2022-12-21T23:19:53ZengBMCBMC Bioinformatics1471-21052011-08-0112133310.1186/1471-2105-12-333Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filterMelsted PállPritchard Jonathan K<p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting <it>k</it>-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing <it>k</it>-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton <it>k</it>-mers are uninformative for many algorithms without some kind of error correction.</p> <p>Results</p> <p>We present a new method that identifies all the <it>k</it>-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed <it>k</it>-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique <it>k</it>-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting <it>k</it>-mers in sequence data with errors.</p> <p>Conclusions</p> <p>A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at <url>http://pritch.bsd.uchicago.edu/bfcounter.html</url></p>http://www.biomedcentral.com/1471-2105/12/333
spellingShingle	Melsted Páll Pritchard Jonathan K Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter BMC Bioinformatics
title	Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_full	Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_fullStr	Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_full_unstemmed	Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_short	Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_sort	efficient counting of b it k it b mers in dna sequences using a bloom filter
url	http://www.biomedcentral.com/1471-2105/12/333
work_keys_str_mv	AT melstedpall efficientcountingofbitkitbmersindnasequencesusingabloomfilter AT pritchardjonathank efficientcountingofbitkitbmersindnasequencesusingabloomfilter

Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter

Similar Items