A 2-Phase Strategy for Intelligent Cloud Operations

When operating large cloud computing infrastructures, ensuring healthiness of physical resources and software components is of paramount importance to meet the demanding service levels expected by customers. This is only possible using automations that can detect anomalies and alert the on-call pers...

Full description

Bibliographic Details
Main Authors:	Giacomo Lanciano, Remo Andreoli, Tommaso Cucinotta, Davide Bacciu, Andrea Passarella
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Cloud operations fault management machine learning monitoring OpenStack
Online Access:	https://ieeexplore.ieee.org/document/10239346/

_version_	1797685382804930560
author	Giacomo Lanciano Remo Andreoli Tommaso Cucinotta Davide Bacciu Andrea Passarella
author_facet	Giacomo Lanciano Remo Andreoli Tommaso Cucinotta Davide Bacciu Andrea Passarella
author_sort	Giacomo Lanciano
collection	DOAJ
description	When operating large cloud computing infrastructures, ensuring healthiness of physical resources and software components is of paramount importance to meet the demanding service levels expected by customers. This is only possible using automations that can detect anomalies and alert the on-call personnel, or trigger healing procedures. In production-grade deployments, such automations are generally based on static thresholds or predefined pattern-matching rules, checked against relevant metrics and logs. Defining and maintaining them is cumbersome and, as the infrastructure grows, they need continuous adjustments. To tackle this problem, we propose an intelligent automation system for cloud operations that learns, from what operators have done in the past, what actions should be applied in response to the observed anomalies. Such system is designed to operate elastic groups of cloud instances realizing typical (replicated) cloud services. The mechanism is based on a 2-phase machine learning pipeline, composed of: a first, lighter, model that automatically detects anomalous patterns, based on past observations of the normal behavior, causing activation of the second, more involved, model; this is a model that recommends specific corrective actions, based on historical operational data reporting the actions applied to heal the faulty components. The approach was validated on an OpenStack deployment, where we deployed both a synthetic application and a multi-node Cassandra NoSQL data-store, and injected different types of anomalies while these systems were exercised using synthetic workloads. For both applications, we obtained a remarkable accuracy (mostly beyond 90%, and also going beyond 95% in some cases), for the anomaly detection and corrective action recommendation tasks, by applying the models on the respective test sets. This allows us to conclude that the presented mechanism constitutes an efficient and effective technique to help operating cloud services in presence of a number of faults, albeit the types and heterogeneity of faulty conditions might be expanded in future evolutions of the framework. The implementation and the material needed to reproduce our results are available under an open-source license.
first_indexed	2024-03-12T00:44:24Z
format	Article
id	doaj.art-a2076add282e41f5b4f45e35934f3474
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T00:44:24Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-a2076add282e41f5b4f45e35934f34742023-09-14T23:00:36ZengIEEEIEEE Access2169-35362023-01-0111968419685310.1109/ACCESS.2023.331221810239346A 2-Phase Strategy for Intelligent Cloud OperationsGiacomo Lanciano0https://orcid.org/0000-0002-7431-8041Remo Andreoli1https://orcid.org/0000-0002-3268-4289Tommaso Cucinotta2https://orcid.org/0000-0002-0362-0657Davide Bacciu3https://orcid.org/0000-0001-5213-2468Andrea Passarella4https://orcid.org/0000-0002-1694-612XFaculty of Sciences, Scuola Normale Superiore, Pisa, ItalyReal-Time Systems Laboratory (RETIS), TECIP Institute, Scuola Superiore Sant’Anna, Pisa, ItalyReal-Time Systems Laboratory (RETIS), TECIP Institute, Scuola Superiore Sant’Anna, Pisa, ItalyDepartment of Computer Science, University of Pisa, Pisa, ItalyInstitute of Informatics and Telematics, National Research Council of Italy, Pisa, ItalyWhen operating large cloud computing infrastructures, ensuring healthiness of physical resources and software components is of paramount importance to meet the demanding service levels expected by customers. This is only possible using automations that can detect anomalies and alert the on-call personnel, or trigger healing procedures. In production-grade deployments, such automations are generally based on static thresholds or predefined pattern-matching rules, checked against relevant metrics and logs. Defining and maintaining them is cumbersome and, as the infrastructure grows, they need continuous adjustments. To tackle this problem, we propose an intelligent automation system for cloud operations that learns, from what operators have done in the past, what actions should be applied in response to the observed anomalies. Such system is designed to operate elastic groups of cloud instances realizing typical (replicated) cloud services. The mechanism is based on a 2-phase machine learning pipeline, composed of: a first, lighter, model that automatically detects anomalous patterns, based on past observations of the normal behavior, causing activation of the second, more involved, model; this is a model that recommends specific corrective actions, based on historical operational data reporting the actions applied to heal the faulty components. The approach was validated on an OpenStack deployment, where we deployed both a synthetic application and a multi-node Cassandra NoSQL data-store, and injected different types of anomalies while these systems were exercised using synthetic workloads. For both applications, we obtained a remarkable accuracy (mostly beyond 90%, and also going beyond 95% in some cases), for the anomaly detection and corrective action recommendation tasks, by applying the models on the respective test sets. This allows us to conclude that the presented mechanism constitutes an efficient and effective technique to help operating cloud services in presence of a number of faults, albeit the types and heterogeneity of faulty conditions might be expanded in future evolutions of the framework. The implementation and the material needed to reproduce our results are available under an open-source license.https://ieeexplore.ieee.org/document/10239346/Cloud operationsfault managementmachine learningmonitoringOpenStack
spellingShingle	Giacomo Lanciano Remo Andreoli Tommaso Cucinotta Davide Bacciu Andrea Passarella A 2-Phase Strategy for Intelligent Cloud Operations IEEE Access Cloud operations fault management machine learning monitoring OpenStack
title	A 2-Phase Strategy for Intelligent Cloud Operations
title_full	A 2-Phase Strategy for Intelligent Cloud Operations
title_fullStr	A 2-Phase Strategy for Intelligent Cloud Operations
title_full_unstemmed	A 2-Phase Strategy for Intelligent Cloud Operations
title_short	A 2-Phase Strategy for Intelligent Cloud Operations
title_sort	2 phase strategy for intelligent cloud operations
topic	Cloud operations fault management machine learning monitoring OpenStack
url	https://ieeexplore.ieee.org/document/10239346/
work_keys_str_mv	AT giacomolanciano a2phasestrategyforintelligentcloudoperations AT remoandreoli a2phasestrategyforintelligentcloudoperations AT tommasocucinotta a2phasestrategyforintelligentcloudoperations AT davidebacciu a2phasestrategyforintelligentcloudoperations AT andreapassarella a2phasestrategyforintelligentcloudoperations AT giacomolanciano 2phasestrategyforintelligentcloudoperations AT remoandreoli 2phasestrategyforintelligentcloudoperations AT tommasocucinotta 2phasestrategyforintelligentcloudoperations AT davidebacciu 2phasestrategyforintelligentcloudoperations AT andreapassarella 2phasestrategyforintelligentcloudoperations

A 2-Phase Strategy for Intelligent Cloud Operations

Similar Items