Summary: | Cloud-based disaggregated database systems that divide data across a data layer and a storage layer connected by network calls are popular for analytical query loads. This thesis explores two topics critical to building performant systems of this type: space optimization and latency minimization.
First, I propose ColumnConstruct- a general-purpose machine learning compression that uses a novel information-maximizing method for building input features. ColumnConstruct is competitive with existing ML compression methods for categorical data, but is not able to perform lossless compression on arbitrary tabular data. This limitation, as well as the additional compression and decompression latency, make it insufficient to improve query latency within a database management system. Next, I investigate whether workload-aware data layout combined with caching can improve query times without the need for ML-based compression or storage layer computation pushdown. I show that for small cache sizes and homogeneous query sets, a workload-aware layout combined with existing compression methods can be more effective than computation pushdown without reliance on particular features in the data storage layer.
|