A robust partitioning scheme for ad-hoc query workloads

© 2017 Association for Computing Machinery. Data partitioning is crucial to improving query performance and severalworkload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not ha...

Full description

Bibliographic Details
Main Authors:	Shanbhag, Anil, Jindal, Alekh, Madden, Samuel, Quiane, Jorge, Elmore, Aaron J.
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	English
Published:	ACM 2021
Online Access:	https://hdl.handle.net/1721.1/137858

_version_	1826218153820028928
author	Shanbhag, Anil Jindal, Alekh Madden, Samuel Quiane, Jorge Elmore, Aaron J.
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Shanbhag, Anil Jindal, Alekh Madden, Samuel Quiane, Jorge Elmore, Aaron J.
author_sort	Shanbhag, Anil
collection	MIT
description	© 2017 Association for Computing Machinery. Data partitioning is crucial to improving query performance and severalworkload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this paper, we propose Amoeba, a distributed storage system that uses adaptive multi-attribute data partitioning to efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. The initial partitioning tree is created without requiring an upfront query workload and Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload.
first_indexed	2024-09-23T17:15:13Z
format	Article
id	mit-1721.1/137858
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T17:15:13Z
publishDate	2021
publisher	ACM
record_format	dspace
spelling	mit-1721.1/1378582023-04-18T18:29:55Z A robust partitioning scheme for ad-hoc query workloads Shanbhag, Anil Jindal, Alekh Madden, Samuel Quiane, Jorge Elmore, Aaron J. Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2017 Association for Computing Machinery. Data partitioning is crucial to improving query performance and severalworkload-based partitioning techniques have been proposed in database literature. However, many modern analytic applications involve ad-hoc or exploratory analysis where users do not have a representative query workload a priori. Static workload-based data partitioning techniques are therefore not suitable for such settings. In this paper, we propose Amoeba, a distributed storage system that uses adaptive multi-attribute data partitioning to efficiently support ad-hoc as well as recurring queries. Amoeba requires zero set-up and tuning effort, allowing analysts to get the benefits of partitioning without requiring an upfront query workload. The key idea is to build and maintain a partitioning tree on top of the dataset. The partitioning tree allows us to answer queries with predicates by reading a subset of the data. The initial partitioning tree is created without requiring an upfront query workload and Amoeba adapts it over time by incrementally modifying subtrees based on user queries using repartitioning. A prototype of Amoeba running on top of Apache Spark improves query performance by up to 7x over full scans and up to 2x over range-based partitioning techniques on TPC-H as well as a real-world workload. 2021-11-09T13:28:47Z 2021-11-09T13:28:47Z 2017-09-24 2019-06-18T13:56:03Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/137858 Shanbhag, Anil, Jindal, Alekh, Madden, Samuel, Quiane, Jorge and Elmore, Aaron J. 2017. "A robust partitioning scheme for ad-hoc query workloads." en 10.1145/3127479.3131613 Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf ACM website
spellingShingle	Shanbhag, Anil Jindal, Alekh Madden, Samuel Quiane, Jorge Elmore, Aaron J. A robust partitioning scheme for ad-hoc query workloads
title	A robust partitioning scheme for ad-hoc query workloads
title_full	A robust partitioning scheme for ad-hoc query workloads
title_fullStr	A robust partitioning scheme for ad-hoc query workloads
title_full_unstemmed	A robust partitioning scheme for ad-hoc query workloads
title_short	A robust partitioning scheme for ad-hoc query workloads
title_sort	robust partitioning scheme for ad hoc query workloads
url	https://hdl.handle.net/1721.1/137858
work_keys_str_mv	AT shanbhaganil arobustpartitioningschemeforadhocqueryworkloads AT jindalalekh arobustpartitioningschemeforadhocqueryworkloads AT maddensamuel arobustpartitioningschemeforadhocqueryworkloads AT quianejorge arobustpartitioningschemeforadhocqueryworkloads AT elmoreaaronj arobustpartitioningschemeforadhocqueryworkloads AT shanbhaganil robustpartitioningschemeforadhocqueryworkloads AT jindalalekh robustpartitioningschemeforadhocqueryworkloads AT maddensamuel robustpartitioningschemeforadhocqueryworkloads AT quianejorge robustpartitioningschemeforadhocqueryworkloads AT elmoreaaronj robustpartitioningschemeforadhocqueryworkloads

A robust partitioning scheme for ad-hoc query workloads

Similar Items