HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving
To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective mann...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-01-01
|
Series: | Electronics |
Subjects: | |
Online Access: | https://www.mdpi.com/2079-9292/12/1/240 |
_version_ | 1797625986010841088 |
---|---|
author | Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang |
author_facet | Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang |
author_sort | Hao Mo |
collection | DOAJ |
description | To accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements. |
first_indexed | 2024-03-11T10:04:09Z |
format | Article |
id | doaj.art-ab10d9a5289a485da306b88211477855 |
institution | Directory Open Access Journal |
issn | 2079-9292 |
language | English |
last_indexed | 2024-03-11T10:04:09Z |
publishDate | 2023-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Electronics |
spelling | doaj.art-ab10d9a5289a485da306b882114778552023-11-16T15:13:01ZengMDPI AGElectronics2079-92922023-01-0112124010.3390/electronics12010240HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model ServingHao Mo0Ligu Zhu1Lei Shi2Songfu Tan3Suping Wang4State Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaState Key Laboratory of Media Convergence and Communication, Communication University of China, Beijing 100024, ChinaTo accelerate the inference of machine-learning (ML) model serving, clusters of machines require the use of expensive hardware accelerators (e.g., GPUs) to reduce execution time. Advanced inference serving systems are needed to satisfy latency service-level objectives (SLOs) in a cost-effective manner. Novel autoscaling mechanisms that greedily minimize the number of service instances while ensuring SLO compliance are helpful. However, we find that it is not adequate to guarantee cost effectiveness across heterogeneous GPU hardware, and this does not maximize resource utilization. In this paper, we propose HetSev to address these challenges by incorporating heterogeneity-aware autoscaling and resource-efficient scheduling to achieve cost effectiveness. We develop an autoscaling mechanism which accounts for SLO compliance and GPU heterogeneity, thus provisioning the appropriate type and number of instances to guarantee cost effectiveness. We leverage multi-tenant inference to improve GPU resource utilization, while alleviating inter-tenant interference by avoiding the co-location of identical ML instances on the same GPU during placement decisions. HetSev is integrated into Kubernetes and deployed onto a heterogeneous GPU cluster. We evaluated the performance of HetSev using several representative ML models. Compared with default Kubernetes, HetSev reduces resource cost by up to 2.15× while meeting SLO requirements.https://www.mdpi.com/2079-9292/12/1/240inference servingautoscalingcost effectivenessmulti-tenant inference |
spellingShingle | Hao Mo Ligu Zhu Lei Shi Songfu Tan Suping Wang HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving Electronics inference serving autoscaling cost effectiveness multi-tenant inference |
title | HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving |
title_full | HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving |
title_fullStr | HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving |
title_full_unstemmed | HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving |
title_short | HetSev: Exploiting Heterogeneity-Aware Autoscaling and Resource-Efficient Scheduling for Cost-Effective Machine-Learning Model Serving |
title_sort | hetsev exploiting heterogeneity aware autoscaling and resource efficient scheduling for cost effective machine learning model serving |
topic | inference serving autoscaling cost effectiveness multi-tenant inference |
url | https://www.mdpi.com/2079-9292/12/1/240 |
work_keys_str_mv | AT haomo hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT liguzhu hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT leishi hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT songfutan hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving AT supingwang hetsevexploitingheterogeneityawareautoscalingandresourceefficientschedulingforcosteffectivemachinelearningmodelserving |