Disk compression of k-mer sets

Abstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpo...

Full description

Bibliographic Details
Main Authors: Amatur Rahman, Rayan Chikhi, Paul Medvedev
Format: Article
Language:English
Published: BMC 2021-06-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:https://doi.org/10.1186/s13015-021-00192-7
_version_ 1830084570951712768
author Amatur Rahman
Rayan Chikhi
Paul Medvedev
author_facet Amatur Rahman
Rayan Chikhi
Paul Medvedev
author_sort Amatur Rahman
collection DOAJ
description Abstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
first_indexed 2024-12-14T16:44:48Z
format Article
id doaj.art-7126c47cc903415d9ed68f00ae1e8d99
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-12-14T16:44:48Z
publishDate 2021-06-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-7126c47cc903415d9ed68f00ae1e8d992022-12-21T22:54:13ZengBMCAlgorithms for Molecular Biology1748-71882021-06-0116111410.1186/s13015-021-00192-7Disk compression of k-mer setsAmatur Rahman0Rayan Chikhi1Paul Medvedev2Penn State UniversityDepartment of Computational Biology, C3BI USR 3756 CNRS, Institut PasteurPenn State UniversityAbstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.https://doi.org/10.1186/s13015-021-00192-7De Bruijn graphsCompressionk-mer setsSpectrum-preserving string sets
spellingShingle Amatur Rahman
Rayan Chikhi
Paul Medvedev
Disk compression of k-mer sets
Algorithms for Molecular Biology
De Bruijn graphs
Compression
k-mer sets
Spectrum-preserving string sets
title Disk compression of k-mer sets
title_full Disk compression of k-mer sets
title_fullStr Disk compression of k-mer sets
title_full_unstemmed Disk compression of k-mer sets
title_short Disk compression of k-mer sets
title_sort disk compression of k mer sets
topic De Bruijn graphs
Compression
k-mer sets
Spectrum-preserving string sets
url https://doi.org/10.1186/s13015-021-00192-7
work_keys_str_mv AT amaturrahman diskcompressionofkmersets
AT rayanchikhi diskcompressionofkmersets
AT paulmedvedev diskcompressionofkmersets