Starling: A Scalable Query Engine on Cloud Functions

© 2020 Association for Computing Machinery. Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much o...

Full description

Bibliographic Details
Main Authors:	Perron, Matthew, Castro Fernandez, Raul, DeWitt, David, Madden, Samuel
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	English
Published:	ACM 2021
Online Access:	https://hdl.handle.net/1721.1/136620

_version_	1826202516001390592
author	Perron, Matthew Castro Fernandez, Raul DeWitt, David Madden, Samuel
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Perron, Matthew Castro Fernandez, Raul DeWitt, David Madden, Samuel
author_sort	Perron, Matthew
collection	MIT
description	© 2020 Association for Computing Machinery. Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when underutilized. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.
first_indexed	2024-09-23T12:08:44Z
format	Article
id	mit-1721.1/136620
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T12:08:44Z
publishDate	2021
publisher	ACM
record_format	dspace
spelling	mit-1721.1/1366202023-03-24T18:40:00Z Starling: A Scalable Query Engine on Cloud Functions Perron, Matthew Castro Fernandez, Raul DeWitt, David Madden, Samuel Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory © 2020 Association for Computing Machinery. Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when underutilized. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs a number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets. 2021-10-27T20:36:16Z 2021-10-27T20:36:16Z 2020 2021-01-29T19:17:53Z Article http://purl.org/eprint/type/ConferencePaper https://hdl.handle.net/1721.1/136620 en 10.1145/3318464.3380609 Proceedings of the ACM SIGMOD International Conference on Management of Data Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf ACM arXiv
spellingShingle	Perron, Matthew Castro Fernandez, Raul DeWitt, David Madden, Samuel Starling: A Scalable Query Engine on Cloud Functions
title	Starling: A Scalable Query Engine on Cloud Functions
title_full	Starling: A Scalable Query Engine on Cloud Functions
title_fullStr	Starling: A Scalable Query Engine on Cloud Functions
title_full_unstemmed	Starling: A Scalable Query Engine on Cloud Functions
title_short	Starling: A Scalable Query Engine on Cloud Functions
title_sort	starling a scalable query engine on cloud functions
url	https://hdl.handle.net/1721.1/136620
work_keys_str_mv	AT perronmatthew starlingascalablequeryengineoncloudfunctions AT castrofernandezraul starlingascalablequeryengineoncloudfunctions AT dewittdavid starlingascalablequeryengineoncloudfunctions AT maddensamuel starlingascalablequeryengineoncloudfunctions

Starling: A Scalable Query Engine on Cloud Functions

Similar Items