Zeus: interpretable ML-based job scheduling in GPU datacentres

Hardware accelerators such as GPUs are essential for the development of Deep Learning (DL) models - as their training process is compute-intensive. A growing number of organisations have employed expensive multi-tenant GPU clusters to run distributed DL training jobs. Efficient job schedulers are re...

Full description

Bibliographic Details
Main Author:	Amrita, Ravishankar
Other Authors:	Zhang Tianwei
Format:	Final Year Project (FYP)
Language:	English
Published:	Nanyang Technological University 2022
Subjects:	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Simulation and modeling
Online Access:	https://hdl.handle.net/10356/156566

_version_	1811694626794373120
author	Amrita, Ravishankar
author2	Zhang Tianwei
author_facet	Zhang Tianwei Amrita, Ravishankar
author_sort	Amrita, Ravishankar
collection	NTU
description	Hardware accelerators such as GPUs are essential for the development of Deep Learning (DL) models - as their training process is compute-intensive. A growing number of organisations have employed expensive multi-tenant GPU clusters to run distributed DL training jobs. Efficient job schedulers are required to maximise GPU cluster utilisation and minimise job completion time and operation cost. In this study, we develop Zeus, an interpretable ML-based, non-intrusive job scheduler that ensures resource fairness, thus providing a better user experience. Zeus accommodates the concern of unreliability of black-box Machine Learning (ML) models by being 100% interpretable, thus avoiding any related deployment concerns in practical scenarios. The interpretability of our model helps reveal interesting dependencies between the training job’s details and the expected job duration and associated trends. Further, our scheduler does not require users to make any modifications to the source code or the underlying DL framework, thereby being completely non-intrusive in nature and consequently, more practical. Finally, we use a GPU datacenter simulator to analyse the efficiency of our scheduler in terms of two metrics: (1) Average Job Completion Time and (2) Average Queueing time.
first_indexed	2024-10-01T07:10:34Z
format	Final Year Project (FYP)
id	ntu-10356/156566
institution	Nanyang Technological University
language	English
last_indexed	2024-10-01T07:10:34Z
publishDate	2022
publisher	Nanyang Technological University
record_format	dspace
spelling	ntu-10356/1565662022-04-20T07:13:17Z Zeus: interpretable ML-based job scheduling in GPU datacentres Amrita, Ravishankar Zhang Tianwei School of Computer Science and Engineering tianwei.zhang@ntu.edu.sg Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Simulation and modeling Hardware accelerators such as GPUs are essential for the development of Deep Learning (DL) models - as their training process is compute-intensive. A growing number of organisations have employed expensive multi-tenant GPU clusters to run distributed DL training jobs. Efficient job schedulers are required to maximise GPU cluster utilisation and minimise job completion time and operation cost. In this study, we develop Zeus, an interpretable ML-based, non-intrusive job scheduler that ensures resource fairness, thus providing a better user experience. Zeus accommodates the concern of unreliability of black-box Machine Learning (ML) models by being 100% interpretable, thus avoiding any related deployment concerns in practical scenarios. The interpretability of our model helps reveal interesting dependencies between the training job’s details and the expected job duration and associated trends. Further, our scheduler does not require users to make any modifications to the source code or the underlying DL framework, thereby being completely non-intrusive in nature and consequently, more practical. Finally, we use a GPU datacenter simulator to analyse the efficiency of our scheduler in terms of two metrics: (1) Average Job Completion Time and (2) Average Queueing time. Bachelor of Engineering (Computer Science) 2022-04-20T07:13:17Z 2022-04-20T07:13:17Z 2022 Final Year Project (FYP) Amrita, R. (2022). Zeus: interpretable ML-based job scheduling in GPU datacentres. Final Year Project (FYP), Nanyang Technological University, Singapore. https://hdl.handle.net/10356/156566 https://hdl.handle.net/10356/156566 en application/pdf Nanyang Technological University
spellingShingle	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Simulation and modeling Amrita, Ravishankar Zeus: interpretable ML-based job scheduling in GPU datacentres
title	Zeus: interpretable ML-based job scheduling in GPU datacentres
title_full	Zeus: interpretable ML-based job scheduling in GPU datacentres
title_fullStr	Zeus: interpretable ML-based job scheduling in GPU datacentres
title_full_unstemmed	Zeus: interpretable ML-based job scheduling in GPU datacentres
title_short	Zeus: interpretable ML-based job scheduling in GPU datacentres
title_sort	zeus interpretable ml based job scheduling in gpu datacentres
topic	Engineering::Computer science and engineering::Computing methodologies::Artificial intelligence Engineering::Computer science and engineering::Computing methodologies::Simulation and modeling
url	https://hdl.handle.net/10356/156566
work_keys_str_mv	AT amritaravishankar zeusinterpretablemlbasedjobschedulingingpudatacentres

Zeus: interpretable ML-based job scheduling in GPU datacentres

Similar Items