AdaptDB: Adaptive Partitioning for Distributed Joins

Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have b...

Full description

Bibliographic Details
Main Authors:	Jundal, Alekh, Lu, Yi, Shanbhag, Anil Atmanand, Madden, Samuel R
Other Authors:	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format:	Article
Language:	en_US
Published:	Association for Computing Machinery (ACM) 2018
Online Access:	http://hdl.handle.net/1721.1/116354 https://orcid.org/0000-0002-2718-9443 https://orcid.org/0000-0002-0925-1354 https://orcid.org/0000-0002-7470-3265

_version_	1826190134232481792
author	Jundal, Alekh Lu, Yi Shanbhag, Anil Atmanand Madden, Samuel R
author2	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Jundal, Alekh Lu, Yi Shanbhag, Anil Atmanand Madden, Samuel R
author_sort	Jundal, Alekh
collection	MIT
description	Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best partitioning scheme for a particular workload, rather than adapting to changes in the workload over time. In this paper, we present AdaptDB, an adaptive storage manager for analytical database workloads in a distributed setting. It works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run. AdaptDB introduces a novel hyper-join that avoids expensive data shuffling by identifying storage blocks of the joining tables that overlap on the join attribute, and only joining those blocks. Hyper-join performs well when each block in one table overlaps with few blocks in the other table, since that will minimize the number of blocks that have to be accessed. To minimize the number of overlapping blocks for common join queries, AdaptDB users smooth repartitioning to repartition small portions of the tables on join attributes as queries run. A prototype of AdaptDB running on top of Spark improves query performance by 2-3x on TPC-H as well as real-world dataset, versus a system that employs scans and shuffle-joins.
first_indexed	2024-09-23T08:35:33Z
format	Article
id	mit-1721.1/116354
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T08:35:33Z
publishDate	2018
publisher	Association for Computing Machinery (ACM)
record_format	dspace
spelling	mit-1721.1/1163542022-09-23T13:07:45Z AdaptDB: Adaptive Partitioning for Distributed Joins Jundal, Alekh Lu, Yi Shanbhag, Anil Atmanand Madden, Samuel R Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Lu, Yi Shanbhag, Anil Atmanand Madden, Samuel R Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best partitioning scheme for a particular workload, rather than adapting to changes in the workload over time. In this paper, we present AdaptDB, an adaptive storage manager for analytical database workloads in a distributed setting. It works by partitioning datasets across a cluster and incrementally refining data partitioning as queries are run. AdaptDB introduces a novel hyper-join that avoids expensive data shuffling by identifying storage blocks of the joining tables that overlap on the join attribute, and only joining those blocks. Hyper-join performs well when each block in one table overlaps with few blocks in the other table, since that will minimize the number of blocks that have to be accessed. To minimize the number of overlapping blocks for common join queries, AdaptDB users smooth repartitioning to repartition small portions of the tables on join attributes as queries run. A prototype of AdaptDB running on top of Spark improves query performance by 2-3x on TPC-H as well as real-world dataset, versus a system that employs scans and shuffle-joins. 2018-06-18T13:28:40Z 2018-06-18T13:28:40Z 2017-01 Article http://purl.org/eprint/type/JournalArticle 2150-8097 http://hdl.handle.net/1721.1/116354 Lu, Yi, Anil Shanbhag, Alekh Jindal and Samuel Madden. "AdaptDB: Adaptive Partitioning for Distributed Joins." Proceedings of the VLDB Endowment 10, no. 5 (2017): 589-600. https://orcid.org/0000-0002-2718-9443 https://orcid.org/0000-0002-0925-1354 https://orcid.org/0000-0002-7470-3265 en_US http://www.vldb.org/pvldb/vol10.html Proceedings of the VLDB Endowment Creative Commons Attribution-NonCommercial-NoDerivs License http://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf Association for Computing Machinery (ACM) Proceedings of the Vldb Endowment
spellingShingle	Jundal, Alekh Lu, Yi Shanbhag, Anil Atmanand Madden, Samuel R AdaptDB: Adaptive Partitioning for Distributed Joins
title	AdaptDB: Adaptive Partitioning for Distributed Joins
title_full	AdaptDB: Adaptive Partitioning for Distributed Joins
title_fullStr	AdaptDB: Adaptive Partitioning for Distributed Joins
title_full_unstemmed	AdaptDB: Adaptive Partitioning for Distributed Joins
title_short	AdaptDB: Adaptive Partitioning for Distributed Joins
title_sort	adaptdb adaptive partitioning for distributed joins
url	http://hdl.handle.net/1721.1/116354 https://orcid.org/0000-0002-2718-9443 https://orcid.org/0000-0002-0925-1354 https://orcid.org/0000-0002-7470-3265
work_keys_str_mv	AT jundalalekh adaptdbadaptivepartitioningfordistributedjoins AT luyi adaptdbadaptivepartitioningfordistributedjoins AT shanbhaganilatmanand adaptdbadaptivepartitioningfordistributedjoins AT maddensamuelr adaptdbadaptivepartitioningfordistributedjoins

AdaptDB: Adaptive Partitioning for Distributed Joins

Similar Items