An optimized FM-index library for nucleotide and amino acid search

Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably compl...

Full description

Bibliographic Details
Main Authors: Tim Anderson, Travis J. Wheeler
Format: Article
Language:English
Published: BMC 2021-12-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:https://doi.org/10.1186/s13015-021-00204-6
_version_ 1819094887799193600
author Tim Anderson
Travis J. Wheeler
author_facet Tim Anderson
Travis J. Wheeler
author_sort Tim Anderson
collection DOAJ
description Abstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.
first_indexed 2024-12-21T23:34:33Z
format Article
id doaj.art-cce5c4452d344971908b8d72a19c95e1
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-12-21T23:34:33Z
publishDate 2021-12-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-cce5c4452d344971908b8d72a19c95e12022-12-21T18:46:23ZengBMCAlgorithms for Molecular Biology1748-71882021-12-0116111310.1186/s13015-021-00204-6An optimized FM-index library for nucleotide and amino acid searchTim Anderson0Travis J. Wheeler1Department of Computer Science, University of MontanaDepartment of Computer Science, University of MontanaAbstract Background Pattern matching is a key step in a variety of biological sequence analysis pipelines. The FM-index is a compressed data structure for pattern matching, with search run time that is independent of the length of the database text. Implementation of the FM-index is reasonably complicated, so that increased adoption will be aided by the availability of a fast and flexible FM-index library. Results We present AvxWindowedFMindex (AWFM-index), a lightweight, open-source, thread-parallel FM-index library written in C that is optimized for indexing nucleotide and amino acid sequences. AWFM-index introduces a new approach to storing FM-index data in a strided bit-vector format that enables extremely efficient computation of the FM-index occurrence function via AVX2 bitwise instructions, and combines this with optional on-disk storage of the index’s suffix array and a cache-efficient lookup table for partial k-mer searches. The AWFM-index performs exact match count and locate queries faster than SeqAn3’s FM-index implementation across a range of comparable memory footprints. When optimized for speed, AWFM-index is $$\sim $$ ∼ 2–4x faster than SeqAn3 for nucleotide search, and $$\sim $$ ∼ 2–6x faster for amino acid search; it is also $$\sim $$ ∼ 4x faster with similar memory footprint when storing the suffix array in on-disk SSD storage. Conclusions AWFM-index is easy to incorporate into bioinformatics software, offers run-time performance parameterization, and provides clients with FM-index functionality at both a high-level (count or locate all instances of a query string) and low-level (step-wise control of the FM-index backward-search process). The open-source library is available for download at https://github.com/TravisWheelerLab/AvxWindowFmIndex.https://doi.org/10.1186/s13015-021-00204-6FM-indexString matchingSIMD vectorization
spellingShingle Tim Anderson
Travis J. Wheeler
An optimized FM-index library for nucleotide and amino acid search
Algorithms for Molecular Biology
FM-index
String matching
SIMD vectorization
title An optimized FM-index library for nucleotide and amino acid search
title_full An optimized FM-index library for nucleotide and amino acid search
title_fullStr An optimized FM-index library for nucleotide and amino acid search
title_full_unstemmed An optimized FM-index library for nucleotide and amino acid search
title_short An optimized FM-index library for nucleotide and amino acid search
title_sort optimized fm index library for nucleotide and amino acid search
topic FM-index
String matching
SIMD vectorization
url https://doi.org/10.1186/s13015-021-00204-6
work_keys_str_mv AT timanderson anoptimizedfmindexlibraryfornucleotideandaminoacidsearch
AT travisjwheeler anoptimizedfmindexlibraryfornucleotideandaminoacidsearch
AT timanderson optimizedfmindexlibraryfornucleotideandaminoacidsearch
AT travisjwheeler optimizedfmindexlibraryfornucleotideandaminoacidsearch