HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving

To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective mann...

Full description

Bibliographic Details
Main Authors: Hao Mo, Ligu Zhu, Lei Shi, Songfu Tan, Suping Wang
Format: Article
Language:English
Published: MDPI AG 2023-01-01
Series:Electronics
Subjects:
Online Access:https://www.mdpi.com/2079-9292/12/1/240
_version_ 1797625986010841088
author Hao Mo
Ligu Zhu
Lei Shi
Songfu Tan
Suping Wang
author_facet Hao Mo
Ligu Zhu
Lei Shi
Songfu Tan
Suping Wang
author_sort Hao Mo
collection DOAJ
description To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements.
first_indexed 2024-03-11T10:04:09Z
format Article
id doaj.art-ab10d9a5289a485da306b88211477855
institution Directory Open Access Journal
issn 2079-9292
language English
last_indexed 2024-03-11T10:04:09Z
publishDate 2023-01-01
publisher MDPI AG
record_format Article
series Electronics
spelling doaj.art-ab10d9a5289a485da306b882114778552023-11-16T15:13:01ZengMDPI AGElectronics2079-92922023-01-0112124010.3390/electronics12010240HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model ServingHao Mo0Ligu Zhu1Lei Shi2Songfu Tan3Suping Wang4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaTo accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements.https://www.mdpi.com/2079-9292/12/1/240inference servingautoscalingcost effectivenessmulti-tenant inference
spellingShingle Hao Mo
Ligu Zhu
Lei Shi
Songfu Tan
Suping Wang
HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
Electronics
inference serving
autoscaling
cost effectiveness
multi-tenant inference
title HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_full HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_fullStr HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_full_unstemmed HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_short HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
title_sort hetsev exploiting heterogeneity aware autoscaling and resource efficient scheduling for cost effective machine learning model serving
topic inference serving
autoscaling
cost effectiveness
multi-tenant inference
url https://www.mdpi.com/2079-9292/12/1/240
work_keys_str_mv AT haomo hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving
AT liguzhu hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving
AT leishi hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving
AT songfutan hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving
AT supingwang hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving