Rethinking Update-in-Place Key-Value Stores for Modern Storage

Several widely-used key-value stores, like RocksDB, are designed around log-structured merge trees (LSMs). Optimizing for the performance characteristics of HDDs, LSMs provide good write performance by emphasizing sequential access to storage. However, this approach negatively impacts read performan...

Full description

Bibliographic Details
Main Author: Markakis, Markos
Other Authors: Kraska, Tim
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144764
https://orcid.org/ 0000-0003-2851-8840
Description
Summary:Several widely-used key-value stores, like RocksDB, are designed around log-structured merge trees (LSMs). Optimizing for the performance characteristics of HDDs, LSMs provide good write performance by emphasizing sequential access to storage. However, this approach negatively impacts read performance: LSMs must employ expensive compaction jobs and memory-consuming Bloom filters in order to achieve reasonably fast reads. In the era of NVMe SSDs, we argue that this trade-off between read performance and write performance is sub-optimal. With enough parallelism, modern storage media have comparable random and sequential access performance, making update-in-place designs, which traditionally provide high read performance, a viable alternative to LSMs. In this thesis, based on a research paper currently under submission, we close the gap between log-structured and update-in-place designs on modern SSDs by taking advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.18× and 2.05× respectively on average across the point workloads, and by up to 10.87× and 7.78× overall.