Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter

<p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic seq...

Full description

Bibliographic Details
Main Authors: Melsted Páll, Pritchard Jonathan K
Format: Article
Language:English
Published: BMC 2011-08-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/12/333
_version_ 1818382132912848896
author Melsted Páll
Pritchard Jonathan K
author_facet Melsted Páll
Pritchard Jonathan K
author_sort Melsted Páll
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting <it>k</it>-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing <it>k</it>-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton <it>k</it>-mers are uninformative for many algorithms without some kind of error correction.</p> <p>Results</p> <p>We present a new method that identifies all the <it>k</it>-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed <it>k</it>-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique <it>k</it>-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting <it>k</it>-mers in sequence data with errors.</p> <p>Conclusions</p> <p>A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at <url>http://pritch.bsd.uchicago.edu/bfcounter.html</url></p>
first_indexed 2024-12-14T02:45:37Z
format Article
id doaj.art-be88730b1f314a4791d46ce3b0793eaa
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-14T02:45:37Z
publishDate 2011-08-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-be88730b1f314a4791d46ce3b0793eaa2022-12-21T23:19:53ZengBMCBMC Bioinformatics1471-21052011-08-0112133310.1186/1471-2105-12-333Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filterMelsted PállPritchard Jonathan K<p>Abstract</p> <p>Background</p> <p>Counting <it>k</it>-mers (substrings of length <it>k </it>in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting <it>k</it>-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing <it>k</it>-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton <it>k</it>-mers are uninformative for many algorithms without some kind of error correction.</p> <p>Results</p> <p>We present a new method that identifies all the <it>k</it>-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed <it>k</it>-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique <it>k</it>-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting <it>k</it>-mers in sequence data with errors.</p> <p>Conclusions</p> <p>A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at <url>http://pritch.bsd.uchicago.edu/bfcounter.html</url></p>http://www.biomedcentral.com/1471-2105/12/333
spellingShingle Melsted Páll
Pritchard Jonathan K
Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
BMC Bioinformatics
title Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_full Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_fullStr Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_full_unstemmed Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_short Efficient counting of <b><it>k</it></b>-mers in DNA sequences using a bloom filter
title_sort efficient counting of b it k it b mers in dna sequences using a bloom filter
url http://www.biomedcentral.com/1471-2105/12/333
work_keys_str_mv AT melstedpall efficientcountingofbitkitbmersindnasequencesusingabloomfilter
AT pritchardjonathank efficientcountingofbitkitbmersindnasequencesusingabloomfilter