Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification
It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de nov...
Main Authors: | , , |
---|---|
Other Authors: | |
Format: | Article |
Published: |
Springer Nature
2018
|
Online Access: | http://hdl.handle.net/1721.1/116308 https://orcid.org/0000-0002-2724-7228 https://orcid.org/0000-0002-8275-9576 https://orcid.org/0000-0003-2315-0768 |
_version_ | 1826193688796069888 |
---|---|
author | Berger Leighton, Bonnie Yu, Yun William Yorukoglu, Deniz |
author2 | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science |
author_facet | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Berger Leighton, Bonnie Yu, Yun William Yorukoglu, Deniz |
author_sort | Berger Leighton, Bonnie |
collection | MIT |
description | It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. Availability: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/. © 2014 Springer International Publishing Switzerland. Keywords: RQS; quality score; sparsification; compression; accuracy; variant calling |
first_indexed | 2024-09-23T09:43:25Z |
format | Article |
id | mit-1721.1/116308 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T09:43:25Z |
publishDate | 2018 |
publisher | Springer Nature |
record_format | dspace |
spelling | mit-1721.1/1163082022-09-30T16:27:34Z Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification Berger Leighton, Bonnie Yu, Yun William Yorukoglu, Deniz Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology. Department of Mathematics Berger Leighton, Bonnie Yu, Yun William Yorukoglu, Deniz It is becoming increasingly impractical to indefinitely store raw sequencing data for later processing in an uncompressed state. In this paper, we describe a scalable compressive framework, Read-Quality-Sparsifier (RQS), which substantially outperforms the compression ratio and speed of other de novo quality score compression methods while maintaining SNP-calling accuracy. Surprisingly, RQS also improves the SNP-calling accuracy on a gold-standard, real-life sequencing dataset (NA12878) using a k-mer density profile constructed from 77 other individuals from the 1000 Genomes Project. This improvement in downstream accuracy emerges from the observation that quality score values within NGS datasets are inherently encoded in the k-mer landscape of the genomic sequences. To our knowledge, RQS is the first scalable sequence-based quality compression method that can efficiently compress quality scores of terabyte-sized and larger sequencing datasets. Availability: An implementation of our method, RQS, is available for download at: http://rqs.csail.mit.edu/. © 2014 Springer International Publishing Switzerland. Keywords: RQS; quality score; sparsification; compression; accuracy; variant calling Hertz Foundation National Institutes of Health (U.S.) (R01GM108348) 2018-06-14T14:38:29Z 2018-06-14T14:38:29Z 2014-04 2018-05-16T17:18:42Z Article http://purl.org/eprint/type/ConferencePaper 978-3-319-05268-7 978-3-319-05269-4 0302-9743 1611-3349 http://hdl.handle.net/1721.1/116308 Yu, Y. William, et al. “Traversing the K-Mer Landscape of NGS Read Datasets for Quality Score Sparsification.” Research in Computational Molecular Biology, edited by Roded Sharan, vol. 8394, Springer International Publishing, 2014, pp. 385–99. https://orcid.org/0000-0002-2724-7228 https://orcid.org/0000-0002-8275-9576 https://orcid.org/0000-0003-2315-0768 http://dx.doi.org/10.1007/978-3-319-05269-4_31 Research in Computational Molecular Biology Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Springer Nature PMC |
spellingShingle | Berger Leighton, Bonnie Yu, Yun William Yorukoglu, Deniz Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title | Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title_full | Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title_fullStr | Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title_full_unstemmed | Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title_short | Traversing the k-mer Landscape of NGS Read Datasets for Quality Score Sparsification |
title_sort | traversing the k mer landscape of ngs read datasets for quality score sparsification |
url | http://hdl.handle.net/1721.1/116308 https://orcid.org/0000-0002-2724-7228 https://orcid.org/0000-0002-8275-9576 https://orcid.org/0000-0003-2315-0768 |
work_keys_str_mv | AT bergerleightonbonnie traversingthekmerlandscapeofngsreaddatasetsforqualityscoresparsification AT yuyunwilliam traversingthekmerlandscapeofngsreaddatasetsforqualityscoresparsification AT yorukogludeniz traversingthekmerlandscapeofngsreaddatasetsforqualityscoresparsification |