CHuff: Conditional Huffman String Compression

Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding i...

Full description

Bibliographic Details
Main Author: Nagda, Bhavik
Other Authors: Kraska, Tim
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/139926
_version_ 1826189676273205248
author Nagda, Bhavik
author2 Kraska, Tim
author_facet Kraska, Tim
Nagda, Bhavik
author_sort Nagda, Bhavik
collection MIT
description Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora.
first_indexed 2024-09-23T08:19:21Z
format Thesis
id mit-1721.1/139926
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T08:19:21Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1399262022-02-08T04:05:20Z CHuff: Conditional Huffman String Compression Nagda, Bhavik Kraska, Tim Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Columnar databases have become ubiquitous in recent years due to their performance for analytical processing applications. Data storage in columnar form benefits from opportunities for improved compression performance as compared to row-oriented systems. For common string data, dictionary encoding is a light-weight compression scheme that replaces string tokens with fixed-size integers. In performing dictionary compression on a given column, database systems initially build a table of distinct values and then compress tokens into their corresponding table indices. This work focuses on optimizing compression for strings in columnar database stores. We introduce Conditional Huffman (CHuff) compression, a novel approach leveraging longstanding Huffman encoding and recent advances in hashing and storage paradigms. CHuff relies on low-entropy conditional relationships between consecutive characters in textual data to construct and apply Huffman-based compression models. The system additionally auto-tunes parameters for various corpus workloads, optimizing the compression rate while avoiding over-fitting. We demonstrate that on real-world data, CHuff performs favorably compared to similar string compressors, achieving an average 24% improvement in compression rate on our diverse experimental corpora. M.Eng. 2022-02-07T15:13:10Z 2022-02-07T15:13:10Z 2021-09 2021-11-03T19:25:32.845Z Thesis https://hdl.handle.net/1721.1/139926 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Nagda, Bhavik
CHuff: Conditional Huffman String Compression
title CHuff: Conditional Huffman String Compression
title_full CHuff: Conditional Huffman String Compression
title_fullStr CHuff: Conditional Huffman String Compression
title_full_unstemmed CHuff: Conditional Huffman String Compression
title_short CHuff: Conditional Huffman String Compression
title_sort chuff conditional huffman string compression
url https://hdl.handle.net/1721.1/139926
work_keys_str_mv AT nagdabhavik chuffconditionalhuffmanstringcompression