LCQS: an efficient lossless compression tool of quality scores with random access functionality

Abstract Background Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the qual...

Full description

Bibliographic Details
Main Authors: Jiabing Fu, Bixin Ke, Shoubin Dong
Format: Article
Language:English
Published: BMC 2020-03-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-020-3428-7
_version_ 1819170217203335168
author Jiabing Fu
Bixin Ke
Shoubin Dong
author_facet Jiabing Fu
Bixin Ke
Shoubin Dong
author_sort Jiabing Fu
collection DOAJ
description Abstract Background Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. Results In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. Conclusion The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.
first_indexed 2024-12-22T19:31:53Z
format Article
id doaj.art-dc544ddf575b4580b82619516e9af57e
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-22T19:31:53Z
publishDate 2020-03-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-dc544ddf575b4580b82619516e9af57e2022-12-21T18:15:05ZengBMCBMC Bioinformatics1471-21052020-03-0121111210.1186/s12859-020-3428-7LCQS: an efficient lossless compression tool of quality scores with random access functionalityJiabing Fu0Bixin Ke1Shoubin Dong2School of Computer Science & Engineering, South China University of TechnologySchool of Computer Science & Engineering, South China University of TechnologySchool of Computer Science & Engineering, South China University of TechnologyAbstract Background Advanced sequencing machines dramatically speed up the generation of genomic data, which makes the demand of efficient compression of sequencing data extremely urgent and significant. As the most difficult part of the standard sequencing data format FASTQ, compression of the quality score has become a conundrum in the development of FASTQ compression. Existing lossless compressors of quality scores mainly utilize specific patterns generated by specific sequencer and complex context modeling techniques to solve the problem of low compression ratio. However, the main drawbacks of these compressors are the problem of weak robustness which means unstable or even unavailable results of sequencing files and the problem of slow compression speed. Meanwhile, some compressors attempt to construct a fine-grained index structure to solve the problem of slow random access decompression speed. However, they solve the problem at the sacrifice of compression speed and at the expense of large index files, which makes them inefficient and impractical. Therefore, an efficient lossless compressor of quality scores with strong robustness, high compression ratio, fast compression and random access decompression speed is urgently needed and of great significance. Results In this paper, based on the idea of maximizing the use of hardware resources, LCQS, a lossless compression tool specialized for quality scores, was proposed. It consists of four sequential processing steps: partitioning, indexing, packing and parallelizing. Experimental results reveal that LCQS outperforms all the other state-of-the-art compressors on all criteria except for the compression speed on the dataset SRR1284073. Furthermore, LCQS presents strong robustness on all the test datasets, with its acceleration ratios of compression speed increasing by up to 29.1x, its file size reducing by up to 28.78%, and its random access decompression speed increasing by up to 2.1x. Additionally, LCQS also exhibits strong scalability. That is, the compression speed increases almost linearly as the size of input dataset increases. Conclusion The ability to handle all different kinds of quality scores and superiority in compression ratio and compression speed make LCQS a high-efficient and advanced lossless quality score compressor, along with its strength of fast random access decompression. Our tool LCQS can be downloaded from https://github.com/SCUT-CCNL/LCQSand freely available for non-commercial usage.http://link.springer.com/article/10.1186/s12859-020-3428-7Quality scoreLossless compressionRandom accessRobustEfficientParallelization
spellingShingle Jiabing Fu
Bixin Ke
Shoubin Dong
LCQS: an efficient lossless compression tool of quality scores with random access functionality
BMC Bioinformatics
Quality score
Lossless compression
Random access
Robust
Efficient
Parallelization
title LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_full LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_fullStr LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_full_unstemmed LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_short LCQS: an efficient lossless compression tool of quality scores with random access functionality
title_sort lcqs an efficient lossless compression tool of quality scores with random access functionality
topic Quality score
Lossless compression
Random access
Robust
Efficient
Parallelization
url http://link.springer.com/article/10.1186/s12859-020-3428-7
work_keys_str_mv AT jiabingfu lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality
AT bixinke lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality
AT shoubindong lcqsanefficientlosslesscompressiontoolofqualityscoreswithrandomaccessfunctionality