Disk compression of k-mer sets
Abstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpo...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-06-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | https://doi.org/10.1186/s13015-021-00192-7 |
_version_ | 1830084570951712768 |
---|---|
author | Amatur Rahman Rayan Chikhi Paul Medvedev |
author_facet | Amatur Rahman Rayan Chikhi Paul Medvedev |
author_sort | Amatur Rahman |
collection | DOAJ |
description | Abstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do. |
first_indexed | 2024-12-14T16:44:48Z |
format | Article |
id | doaj.art-7126c47cc903415d9ed68f00ae1e8d99 |
institution | Directory Open Access Journal |
issn | 1748-7188 |
language | English |
last_indexed | 2024-12-14T16:44:48Z |
publishDate | 2021-06-01 |
publisher | BMC |
record_format | Article |
series | Algorithms for Molecular Biology |
spelling | doaj.art-7126c47cc903415d9ed68f00ae1e8d992022-12-21T22:54:13ZengBMCAlgorithms for Molecular Biology1748-71882021-06-0116111410.1186/s13015-021-00192-7Disk compression of k-mer setsAmatur Rahman0Rayan Chikhi1Paul Medvedev2Penn State UniversityDepartment of Computational Biology, C3BI USR 3756 CNRS, Institut PasteurPenn State UniversityAbstract K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.https://doi.org/10.1186/s13015-021-00192-7De Bruijn graphsCompressionk-mer setsSpectrum-preserving string sets |
spellingShingle | Amatur Rahman Rayan Chikhi Paul Medvedev Disk compression of k-mer sets Algorithms for Molecular Biology De Bruijn graphs Compression k-mer sets Spectrum-preserving string sets |
title | Disk compression of k-mer sets |
title_full | Disk compression of k-mer sets |
title_fullStr | Disk compression of k-mer sets |
title_full_unstemmed | Disk compression of k-mer sets |
title_short | Disk compression of k-mer sets |
title_sort | disk compression of k mer sets |
topic | De Bruijn graphs Compression k-mer sets Spectrum-preserving string sets |
url | https://doi.org/10.1186/s13015-021-00192-7 |
work_keys_str_mv | AT amaturrahman diskcompressionofkmersets AT rayanchikhi diskcompressionofkmersets AT paulmedvedev diskcompressionofkmersets |