HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective mann...

Full description

Bibliographic Details
Main Authors:	Hao Mo, Ligu Zhu, Lei Shi, Songfu Tan, Suping Wang
Format:	Article
Language:	English
Published:	MDPI AG 2023-01-01
Series:	Electronics
Subjects:	inference serving autoscaling cost effectiveness multi-tenant inference
Online Access:	https://www.mdpi.com/2079-9292/12/1/240

_version_	1797625986010841088
author	Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang
author_facet	Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang
author_sort	Hao Mo
collection	DOAJ
description	To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements.
first_indexed	2024-03-11T10:04:09Z
format	Article
id	doaj.art-ab10d9a5289a485da306b88211477855
institution	Directory Open Access Journal
issn	2079-9292
language	English
last_indexed	2024-03-11T10:04:09Z
publishDate	2023-01-01
publisher	MDPI AG
record_format	Article
series	Electronics
spelling	doaj.art-ab10d9a5289a485da306b882114778552023-11-16T15:13:01ZengMDPI AGElectronics2079-92922023-01-0112124010.3390/electronics12010240HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model ServingHao Mo0Ligu Zhu1Lei Shi2Songfu Tan3Suping Wang4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaTo accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements.https://www.mdpi.com/2079-9292/12/1/240inference servingautoscalingcost effectivenessmulti-tenant inference
spellingShingle	Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving Electronics inference serving autoscaling cost effectiveness multi-tenant inference
title	HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_full	HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_fullStr	HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_full_unstemmed	HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_short	HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_sort	hetsev exploiting heterogeneity aware autoscaling and resource efficient scheduling for cost effective machine learning model serving
topic	inference serving autoscaling cost effectiveness multi-tenant inference
url	https://www.mdpi.com/2079-9292/12/1/240
work_keys_str_mv	AT haomo hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT liguzhu hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT leishi hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT songfutan hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT supingwang hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving

HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

Similar Items