Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS
SoCC ’24, November 20–22, 2024, Redmond, WA, USA
Main Authors: | , , , , |
---|---|
Other Authors: | |
Format: | Article |
Language: | English |
Published: |
ACM|ACM Symposium on Cloud Computing
2024
|
Online Access: | https://hdl.handle.net/1721.1/157897 |
_version_ | 1824458271591759872 |
---|---|
author | Singh, Vikramank Song, Zhao Narayanaswamy, Balakrishnan (Murali) Vaidya, Kapil Eknath Kraska, Tim |
author2 | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science |
author_facet | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Singh, Vikramank Song, Zhao Narayanaswamy, Balakrishnan (Murali) Vaidya, Kapil Eknath Kraska, Tim |
author_sort | Singh, Vikramank |
collection | MIT |
description | SoCC ’24, November 20–22, 2024, Redmond, WA, USA |
first_indexed | 2025-02-19T04:23:14Z |
format | Article |
id | mit-1721.1/157897 |
institution | Massachusetts Institute of Technology |
language | English |
last_indexed | 2025-02-19T04:23:14Z |
publishDate | 2024 |
publisher | ACM|ACM Symposium on Cloud Computing |
record_format | dspace |
spelling | mit-1721.1/1578972025-01-04T04:36:44Z Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS Singh, Vikramank Song, Zhao Narayanaswamy, Balakrishnan (Murali) Vaidya, Kapil Eknath Kraska, Tim Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science SoCC ’24, November 20–22, 2024, Redmond, WA, USA Database performance troubleshooting is a complex multi-step process that broadly involves three key stages- (a) Detection: determining what's wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer workloads are noisy, non-stationary and quasi-periodic in nature rendering traditional detectors ineffective. Second, real-world production databases execute a highly diverse set of queries that skew the database statistics into long-tail distributions causing traditional RCA methods to fail. Third, these databases typically execute millions of such diverse queries every minute rendering traditional methods inefficient when deployed at scale. In this paper we describe Vista, a machine learning based performance troubleshooting framework for databases, and dive-deep into how it addresses the 3 real-world problems outlined above. Vista deploys a deep auto-regressive model trained on a large and diverse Amazon Relational Database Service (RDS) fleet with custom skip connections and periodicity alignment features to model long range and varying periodicity in customer workloads, and detects performance bottlenecks in the form of outliers. Furthermore, it efficiently filters only a top few dominating SQL queries from millions in a problematic workload, and uses a robust causal inference framework to identify the culprit queries and their statistics leading to a low false-positive and false-negative rate. Currently, Vista runs on hundreds of thousands of RDS databases, analyzes millions of workloads every day bringing down the troubleshooting time for RDS customers from hours to seconds. At the end, we also describe several challenges and learnings from implementing and deploying Vista at Amazon scale. 2024-12-19T17:16:42Z 2024-12-19T17:16:42Z 2024-11-20 2024-12-01T08:54:23Z Article http://purl.org/eprint/type/ConferencePaper 979-8-4007-1286-9 https://hdl.handle.net/1721.1/157897 Singh, Vikramank, Song, Zhao, Narayanaswamy, Balakrishnan (Murali), Vaidya, Kapil Eknath and Kraska, Tim. 2024. "Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS." PUBLISHER_CC en https://doi.org/10.1145/3698038.3698519 Creative Commons Attribution https://creativecommons.org/licenses/by/4.0/ The author(s) application/pdf ACM|ACM Symposium on Cloud Computing Association for Computing Machinery |
spellingShingle | Singh, Vikramank Song, Zhao Narayanaswamy, Balakrishnan (Murali) Vaidya, Kapil Eknath Kraska, Tim Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title | Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title_full | Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title_fullStr | Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title_full_unstemmed | Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title_short | Vista: Machine Learning based Database Performance Troubleshooting Framework in Amazon RDS |
title_sort | vista machine learning based database performance troubleshooting framework in amazon rds |
url | https://hdl.handle.net/1721.1/157897 |
work_keys_str_mv | AT singhvikramank vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds AT songzhao vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds AT narayanaswamybalakrishnanmurali vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds AT vaidyakapileknath vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds AT kraskatim vistamachinelearningbaseddatabaseperformancetroubleshootingframeworkinamazonrds |