Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS

SoCC ’24, November 20–22, 2024, Redmond, WA, USA

Bibliographic Details
Main Authors: Singh, Vikramank, Song, Zhao, Narayanaswamy, Balakrishnan (Murali), Vaidya, Kapil Eknath, Kraska, Tim
Other Authors: Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format: Article
Language:English
Published: ACM|ACM Symposium on Cloud Computing 2024
Online Access:https://hdl.handle.net/1721.1/157897
_version_ 1824458271591759872
author Singh, Vikramank
Song, Zhao
Narayanaswamy, Balakrishnan (Murali)
Vaidya, Kapil Eknath
Kraska, Tim
author2 Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Singh, Vikramank
Song, Zhao
Narayanaswamy, Balakrishnan (Murali)
Vaidya, Kapil Eknath
Kraska, Tim
author_sort Singh, Vikramank
collection MIT
description SoCC ’24, November 20–22, 2024, Redmond, WA, USA
first_indexed 2025-02-19T04:23:14Z
format Article
id mit-1721.1/157897
institution Massachusetts Institute of Technology
language English
last_indexed 2025-02-19T04:23:14Z
publishDate 2024
publisher ACM|ACM Symposium on Cloud Computing
record_format dspace
spelling mit-1721.1/1578972025-01-04T04:36:44Z Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS Singh, Vikramank Song, Zhao Narayanaswamy, Balakrishnan (Murali) Vaidya, Kapil Eknath Kraska, Tim Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science SoCC ’24, November 20–22, 2024, Redmond, WA, USA Database performance troubleshooting is a complex multi-step process that broadly involves three key stages- (a) Detection: determining what's wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale. In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with custom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale. 2024-12-19T17:16:42Z 2024-12-19T17:16:42Z 2024-11-20 2024-12-01T08:54:23Z Article http://purl.org/eprint/type/ConferencePaper 979-8-4007-1286-9 https://hdl.handle.net/1721.1/157897 Singh, Vikramank, Song, Zhao, Narayanaswamy, Balakrishnan (Murali), Vaidya, Kapil Eknath and Kraska, Tim. 2024. "Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS." PUBLISHER_CC en https://doi.org/10.1145/3698038.3698519 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM|ACM Symposium on Cloud Computing Association for Computing Machinery
spellingShingle Singh, Vikramank
Song, Zhao
Narayanaswamy, Balakrishnan (Murali)
Vaidya, Kapil Eknath
Kraska, Tim
Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title_full Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title_fullStr Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title_full_unstemmed Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title_short Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
title_sort vista machine learning based database performance troubleshooting framework in amazon rds
url https://hdl.handle.net/1721.1/157897
work_keys_str_mv AT singhvikramank vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds
AT songzhao vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds
AT narayanaswamybalakrishnanmurali vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds
AT vaidyakapileknath vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds
AT kraskatim vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds