Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.

Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fi...

Full description

Bibliographic Details
Main Authors: Meznah Almutairy, Eric Torng
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC5794061?pdf=render
_version_ 1828823666301861888
author Meznah Almutairy
Eric Torng
author_facet Meznah Almutairy
Eric Torng
author_sort Meznah Almutairy
collection DOAJ
description Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
first_indexed 2024-12-12T13:39:38Z
format Article
id doaj.art-06ad1f3b524b488b9d724583a2b93019
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-12T13:39:38Z
publishDate 2018-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-06ad1f3b524b488b9d724583a2b930192022-12-22T00:22:50ZengPublic Library of Science (PLoS)PLoS ONE1932-62032018-01-01132e018996010.1371/journal.pone.0189960Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.Meznah AlmutairyEric TorngBioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.http://europepmc.org/articles/PMC5794061?pdf=render
spellingShingle Meznah Almutairy
Eric Torng
Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
PLoS ONE
title Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
title_full Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
title_fullStr Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
title_full_unstemmed Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
title_short Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches.
title_sort comparing fixed sampling with minimizer sampling when using k mer indexes to find maximal exact matches
url http://europepmc.org/articles/PMC5794061?pdf=render
work_keys_str_mv AT meznahalmutairy comparingfixedsamplingwithminimizersamplingwhenusingkmerindexestofindmaximalexactmatches
AT erictorng comparingfixedsamplingwithminimizersamplingwhenusingkmerindexestofindmaximalexactmatches