Accurate sampling-based cardinality estimation for complex graph queries
Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in database systems. This problem is particularly difficult in graph databases, where queries often involve a large number of joins and self-joins. Recently, Park et al. [54] surveyed seven st...
Main Authors: | , |
---|---|
Format: | Journal article |
Language: | English |
Published: |
Association for Computing Machinery
2024
|
_version_ | 1817930619714273280 |
---|---|
author | Hu, P Motik, B |
author_facet | Hu, P Motik, B |
author_sort | Hu, P |
collection | OXFORD |
description | Accurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in
database systems. This problem is particularly difficult in graph databases, where queries often involve a large
number of joins and self-joins. Recently, Park et al. [54] surveyed seven state-of-the-art cardinality estimation
approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method
based on the WanderJoin online aggregation algorithm [46] consistently offers superior accuracy.
We extended the framework by Park et al. [54] with three additional datasets and repeated their experiments.
Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples
and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails
to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond
simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference,
and duplicate elimination. Neither of the methods considered by Park et al. [54] is applicable to such queries.
In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our
approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with
arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates
converges with probability one to the actual cardinality. We present optimisations of the basic algorithm
that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that
our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to
integrate our approach into a simple dynamic programming query planner, and we confirm empirically that
our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times. |
first_indexed | 2024-09-25T04:18:21Z |
format | Journal article |
id | oxford-uuid:182efdfd-ea36-46c6-9837-4add560594a2 |
institution | University of Oxford |
language | English |
last_indexed | 2024-12-09T03:09:01Z |
publishDate | 2024 |
publisher | Association for Computing Machinery |
record_format | dspace |
spelling | oxford-uuid:182efdfd-ea36-46c6-9837-4add560594a22024-10-01T09:34:07ZAccurate sampling-based cardinality estimation for complex graph queriesJournal articlehttp://purl.org/coar/resource_type/c_dcae04bcuuid:182efdfd-ea36-46c6-9837-4add560594a2EnglishSymplectic ElementsAssociation for Computing Machinery2024Hu, PMotik, BAccurately estimating the cardinality (i.e., the number of answers) of complex queries plays a central role in database systems. This problem is particularly difficult in graph databases, where queries often involve a large number of joins and self-joins. Recently, Park et al. [54] surveyed seven state-of-the-art cardinality estimation approaches for graph queries. The results of their extensive empirical evaluation show that a sampling method based on the WanderJoin online aggregation algorithm [46] consistently offers superior accuracy. We extended the framework by Park et al. [54] with three additional datasets and repeated their experiments. Our results showed that WanderJoin is indeed very accurate, but it can often take a large number of samples and thus be very slow. Moreover, when queries are complex and data distributions are skewed, it often fails to find valid samples and estimates the cardinality as zero. Finally, complex graph queries often go beyond simple graph matching and involve arbitrary nesting of relational operators such as disjunction, difference, and duplicate elimination. Neither of the methods considered by Park et al. [54] is applicable to such queries. In this paper we present a novel approach for estimating the cardinality of complex graph queries. Our approach is inspired by WanderJoin, but, unlike all approaches known to us, it can process complex queries with arbitrary operator nesting. Our estimator is strongly consistent, meaning that the average of repeated estimates converges with probability one to the actual cardinality. We present optimisations of the basic algorithm that aim to reduce the chance of producing zero estimates and improve accuracy. We show empirically that our approach is both accurate and quick on complex queries and large datasets. Finally, we discuss how to integrate our approach into a simple dynamic programming query planner, and we confirm empirically that our planner produces high-quality plans that can significantly reduce end-to-end query evaluation times. |
spellingShingle | Hu, P Motik, B Accurate sampling-based cardinality estimation for complex graph queries |
title | Accurate sampling-based cardinality estimation for complex graph queries |
title_full | Accurate sampling-based cardinality estimation for complex graph queries |
title_fullStr | Accurate sampling-based cardinality estimation for complex graph queries |
title_full_unstemmed | Accurate sampling-based cardinality estimation for complex graph queries |
title_short | Accurate sampling-based cardinality estimation for complex graph queries |
title_sort | accurate sampling based cardinality estimation for complex graph queries |
work_keys_str_mv | AT hup accuratesamplingbasedcardinalityestimationforcomplexgraphqueries AT motikb accuratesamplingbasedcardinalityestimationforcomplexgraphqueries |