An optimized FM-index library for nucleotide and amino acid search
Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably compl...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-12-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13015-021-00204-6 |
_version_ | 1819094887799193600 |
---|---|
author | Tim Anderson Travis J. Wheeler |
author_facet | Tim Anderson Travis J. Wheeler |
author_sort | Tim Anderson |
collection | DOAJ |
description | Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex. |
first_indexed | 2024-12-21T23:34:33Z |
format | Article |
id | doaj.art-cce5c4452d344971908b8d72a19c95e1 |
institution | Directory Open Access Journal |
issn | 1748-7188 |
language | English |
last_indexed | 2024-12-21T23:34:33Z |
publishDate | 2021-12-01 |
publisher | BMC |
record_format | Article |
series | Algorithms for Molecular Biology |
spelling | doaj.art-cce5c4452d344971908b8d72a19c95e12022-12-21T18:46:23ZengBMCAlgorithms for Molecular Biology1748-71882021-12-0116111310.1186/s13015-021-00204-6An optimized FM-index library for nucleotide and amino acid searchTim Anderson0Travis J. Wheeler1Department of Computer Science, University of MontanaDepartment of Computer Science, University of MontanaAbstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.https://doi.org/10.1186/s13015-021-00204-6FM-indexString matchingSIMD vectorization |
spellingShingle | Tim Anderson Travis J. Wheeler An optimized FM-index library for nucleotide and amino acid search Algorithms for Molecular Biology FM-index String matching SIMD vectorization |
title | An optimized FM-index library for nucleotide and amino acid search |
title_full | An optimized FM-index library for nucleotide and amino acid search |
title_fullStr | An optimized FM-index library for nucleotide and amino acid search |
title_full_unstemmed | An optimized FM-index library for nucleotide and amino acid search |
title_short | An optimized FM-index library for nucleotide and amino acid search |
title_sort | optimized fm index library for nucleotide and amino acid search |
topic | FM-index String matching SIMD vectorization |
url | https://doi.org/10.1186/s13015-021-00204-6 |
work_keys_str_mv | AT timanderson anoptimizedfmindexlibraryfornucleotideandaminoacidsearch AT travisjwheeler anoptimizedfmindexlibraryfornucleotideandaminoacidsearch AT timanderson optimizedfmindexlibraryfornucleotideandaminoacidsearch AT travisjwheeler optimizedfmindexlibraryfornucleotideandaminoacidsearch |