Machine Learning for Out of Distribution Database Workloads

DBMS query optimizers are designed using several heuristics to make decisions, such as simplifying assumptions in cardinality estimation, or cost model assumptions for predicting query latencies. With the rise of cloud first DBMS architectures, it is now possible to collect massive amounts of data o...

Full description

Bibliographic Details
Main Author:	Negi, Parimarjan
Other Authors:	Alizadeh, Mohammad
Format:	Thesis
Published:	Massachusetts Institute of Technology 2024
Online Access:	https://hdl.handle.net/1721.1/153835 https://orcid.org/0000-0002-8442-9159

_version_	1826205563102429184
author	Negi, Parimarjan
author2	Alizadeh, Mohammad
author_facet	Alizadeh, Mohammad Negi, Parimarjan
author_sort	Negi, Parimarjan
collection	MIT
description	DBMS query optimizers are designed using several heuristics to make decisions, such as simplifying assumptions in cardinality estimation, or cost model assumptions for predicting query latencies. With the rise of cloud first DBMS architectures, it is now possible to collect massive amounts of data on executed queries. This gives a way to improve the DBMS heuristics using models that utilize this execution history. In particular, such models can be specialized to particular workloads — thus, it may be possible to do much better than average by learning patterns, such as some joins are always unexpectedly slow, or some tables are always much larger than expected. This can be very beneficial for performance, however, deploying ML systems in the real world has a catch: it is hard to avoid Out of Distribution (OoD) scenarios in the real workloads. ML models often fail in surprising ways in OoD scenarios, and this is an active area of research in the broader ML community. In this thesis, we introduce several such OoD scenarios in the context of database workloads, and show that ML models can easily fail catastrophically in such cases. These range from new query patterns, such as a new column, or new join, to execution time variance across different hardware and system loads. In each case, we use database specific knowledge to develop techniques that get us ML models with more reliable and robust performance in OoD setting.
first_indexed	2024-09-23T13:15:26Z
format	Thesis
id	mit-1721.1/153835
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T13:15:26Z
publishDate	2024
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1538352024-03-22T03:35:09Z Machine Learning for Out of Distribution Database Workloads Negi, Parimarjan Alizadeh, Mohammad Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science DBMS query optimizers are designed using several heuristics to make decisions, such as simplifying assumptions in cardinality estimation, or cost model assumptions for predicting query latencies. With the rise of cloud first DBMS architectures, it is now possible to collect massive amounts of data on executed queries. This gives a way to improve the DBMS heuristics using models that utilize this execution history. In particular, such models can be specialized to particular workloads — thus, it may be possible to do much better than average by learning patterns, such as some joins are always unexpectedly slow, or some tables are always much larger than expected. This can be very beneficial for performance, however, deploying ML systems in the real world has a catch: it is hard to avoid Out of Distribution (OoD) scenarios in the real workloads. ML models often fail in surprising ways in OoD scenarios, and this is an active area of research in the broader ML community. In this thesis, we introduce several such OoD scenarios in the context of database workloads, and show that ML models can easily fail catastrophically in such cases. These range from new query patterns, such as a new column, or new join, to execution time variance across different hardware and system loads. In each case, we use database specific knowledge to develop techniques that get us ML models with more reliable and robust performance in OoD setting. Ph.D. 2024-03-21T19:09:10Z 2024-03-21T19:09:10Z 2024-02 2024-02-21T17:19:09.819Z Thesis https://hdl.handle.net/1721.1/153835 https://orcid.org/0000-0002-8442-9159 In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Negi, Parimarjan Machine Learning for Out of Distribution Database Workloads
title	Machine Learning for Out of Distribution Database Workloads
title_full	Machine Learning for Out of Distribution Database Workloads
title_fullStr	Machine Learning for Out of Distribution Database Workloads
title_full_unstemmed	Machine Learning for Out of Distribution Database Workloads
title_short	Machine Learning for Out of Distribution Database Workloads
title_sort	machine learning for out of distribution database workloads
url	https://hdl.handle.net/1721.1/153835 https://orcid.org/0000-0002-8442-9159
work_keys_str_mv	AT negiparimarjan machinelearningforoutofdistributiondatabaseworkloads

Machine Learning for Out of Distribution Database Workloads

Similar Items