CHuff: Conditional Huffman String Compression
Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding i...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
Massachusetts Institute of Technology
2022
|
Online Access: | https://hdl.handle.net/1721.1/139926 |
_version_ | 1826189676273205248 |
---|---|
author | Nagda, Bhavik |
author2 | Kraska, Tim |
author_facet | Kraska, Tim Nagda, Bhavik |
author_sort | Nagda, Bhavik |
collection | MIT |
description | Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora. |
first_indexed | 2024-09-23T08:19:21Z |
format | Thesis |
id | mit-1721.1/139926 |
institution | Massachusetts Institute of Technology |
last_indexed | 2024-09-23T08:19:21Z |
publishDate | 2022 |
publisher | Massachusetts Institute of Technology |
record_format | dspace |
spelling | mit-1721.1/1399262022-02-08T04:05:20Z CHuff: Conditional Huffman String Compression Nagda, Bhavik Kraska, Tim Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora. M.Eng. 2022-02-07T15:13:10Z 2022-02-07T15:13:10Z 2021-09 2021-11-03T19:25:32.845Z Thesis https://hdl.handle.net/1721.1/139926 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology |
spellingShingle | Nagda, Bhavik CHuff: Conditional Huffman String Compression |
title | CHuff: Conditional Huffman String Compression |
title_full | CHuff: Conditional Huffman String Compression |
title_fullStr | CHuff: Conditional Huffman String Compression |
title_full_unstemmed | CHuff: Conditional Huffman String Compression |
title_short | CHuff: Conditional Huffman String Compression |
title_sort | chuff conditional huffman string compression |
url | https://hdl.handle.net/1721.1/139926 |
work_keys_str_mv | AT nagdabhavik chuffconditionalhuffmanstringcompression |