Some Cardinality Estimates are More Equal than Others

Recently there has been significant interest in using machine learning to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably ma...

Full description

Bibliographic Details
Main Author:	Negi, Parimarjan
Other Authors:	Alizadeh, Mohammad
Format:	Thesis
Published:	Massachusetts Institute of Technology 2022
Online Access:	https://hdl.handle.net/1721.1/143367 https://orcid.org/0000-0002-8442-9159

_version_	1826218096538419200
author	Negi, Parimarjan
author2	Alizadeh, Mohammad
author_facet	Alizadeh, Mohammad Negi, Parimarjan
author_sort	Negi, Parimarjan
collection	MIT
description	Recently there has been significant interest in using machine learning to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, for learning cardinality estimation models. Flow-Loss approximates the optimizer’s cost model and search algorithm with analytical functions, which it uses to optimize explicitly for better query plans. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain “plan graph”, in which different paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB) which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the plan costs and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, models trained with both loss functions perform well. However, the Q-Error-trained model degrades significantly when evaluated on slightly different queries (e.g., similar but unseen query templates), while the Flow-Loss-trained model generalizes better to such situations, achieving 4−8× better 99th percentile runtimes on unseen templates with the same model architecture and training data.
first_indexed	2024-09-23T17:14:07Z
format	Thesis
id	mit-1721.1/143367
institution	Massachusetts Institute of Technology
last_indexed	2024-09-23T17:14:07Z
publishDate	2022
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1433672022-06-16T03:00:47Z Some Cardinality Estimates are More Equal than Others Negi, Parimarjan Alizadeh, Mohammad Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Recently there has been significant interest in using machine learning to improve the accuracy of cardinality estimation. This work has focused on improving average estimation error, but not all estimates matter equally for downstream tasks like query optimization. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, for learning cardinality estimation models. Flow-Loss approximates the optimizer’s cost model and search algorithm with analytical functions, which it uses to optimize explicitly for better query plans. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain “plan graph”, in which different paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark (CEB) which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the plan costs and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, models trained with both loss functions perform well. However, the Q-Error-trained model degrades significantly when evaluated on slightly different queries (e.g., similar but unseen query templates), while the Flow-Loss-trained model generalizes better to such situations, achieving 4−8× better 99th percentile runtimes on unseen templates with the same model architecture and training data. S.M. 2022-06-15T13:15:45Z 2022-06-15T13:15:45Z 2022-02 2022-03-04T20:59:53.191Z Thesis https://hdl.handle.net/1721.1/143367 https://orcid.org/0000-0002-8442-9159 In Copyright - Educational Use Permitted Copyright MIT http://rightsstatements.org/page/InC-EDU/1.0/ application/pdf Massachusetts Institute of Technology
spellingShingle	Negi, Parimarjan Some Cardinality Estimates are More Equal than Others
title	Some Cardinality Estimates are More Equal than Others
title_full	Some Cardinality Estimates are More Equal than Others
title_fullStr	Some Cardinality Estimates are More Equal than Others
title_full_unstemmed	Some Cardinality Estimates are More Equal than Others
title_short	Some Cardinality Estimates are More Equal than Others
title_sort	some cardinality estimates are more equal than others
url	https://hdl.handle.net/1721.1/143367 https://orcid.org/0000-0002-8442-9159
work_keys_str_mv	AT negiparimarjan somecardinalityestimatesaremoreequalthanothers

Some Cardinality Estimates are More Equal than Others

Similar Items