Learned String Index Structures for In-Memory Databases

Within the field of machine learning for systems, learning-based methods have brought new perspective to indexing by reframing it as a cumulative distribution function (CDF) modeling problem. The burgeoning field, despite its nascence, has brought with it many opportunities and efficiencies. However...

Full description

Bibliographic Details
Main Author: Spector, Benjamin
Other Authors: Kraska, Tim
Format: Thesis
Published: Massachusetts Institute of Technology 2022
Online Access:https://hdl.handle.net/1721.1/144902
_version_ 1826202193214046208
author Spector, Benjamin
author2 Kraska, Tim
author_facet Kraska, Tim
Spector, Benjamin
author_sort Spector, Benjamin
collection MIT
description Within the field of machine learning for systems, learning-based methods have brought new perspective to indexing by reframing it as a cumulative distribution function (CDF) modeling problem. The burgeoning field, despite its nascence, has brought with it many opportunities and efficiencies. However, most work in this area has focused on efficiently indexing numerical keys, as the additional challenges posed by indexing strings have prevented the effective application of these techniques to string domains. We hypothesize that the machine learning approaches which have, in recent years, made significant strides in scalar indexing applications can also be effectively adapted to string applications. First, we introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of learned radix splines each indexing a fixed number of bytes. RSS achieves better performance than other structures by first using the minimal string prefix to sufficiently distinguish the data, followed by a contextual learned model to predict its location. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. Second, we benchmark RSS against existing algorithms on several real-world string datasets and study its performance in-depth. RSS approaches or exceeds the performance of traditional string indexes while using up to 300× less memory, suggesting this line of research may be promising for future memory-intensive database applications.
first_indexed 2024-09-23T12:03:39Z
format Thesis
id mit-1721.1/144902
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:03:39Z
publishDate 2022
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/1449022022-08-30T03:36:44Z Learned String Index Structures for In-Memory Databases Spector, Benjamin Kraska, Tim Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Within the field of machine learning for systems, learning-based methods have brought new perspective to indexing by reframing it as a cumulative distribution function (CDF) modeling problem. The burgeoning field, despite its nascence, has brought with it many opportunities and efficiencies. However, most work in this area has focused on efficiently indexing numerical keys, as the additional challenges posed by indexing strings have prevented the effective application of these techniques to string domains. We hypothesize that the machine learning approaches which have, in recent years, made significant strides in scalar indexing applications can also be effectively adapted to string applications. First, we introduce the RadixStringSpline (RSS) learned index structure for efficiently indexing strings. RSS is a tree of learned radix splines each indexing a fixed number of bytes. RSS achieves better performance than other structures by first using the minimal string prefix to sufficiently distinguish the data, followed by a contextual learned model to predict its location. Additionally, the bounded-error nature of RSS accelerates the last mile search and also enables a memory-efficient hash-table lookup accelerator. Second, we benchmark RSS against existing algorithms on several real-world string datasets and study its performance in-depth. RSS approaches or exceeds the performance of traditional string indexes while using up to 300× less memory, suggesting this line of research may be promising for future memory-intensive database applications. M.Eng. 2022-08-29T16:19:49Z 2022-08-29T16:19:49Z 2022-05 2022-05-27T16:18:30.301Z Thesis https://hdl.handle.net/1721.1/144902 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle Spector, Benjamin
Learned String Index Structures for In-Memory Databases
title Learned String Index Structures for In-Memory Databases
title_full Learned String Index Structures for In-Memory Databases
title_fullStr Learned String Index Structures for In-Memory Databases
title_full_unstemmed Learned String Index Structures for In-Memory Databases
title_short Learned String Index Structures for In-Memory Databases
title_sort learned string index structures for in memory databases
url https://hdl.handle.net/1721.1/144902
work_keys_str_mv AT spectorbenjamin learnedstringindexstructuresforinmemorydatabases