Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster

Graphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Fu...

Full description

Bibliographic Details
Main Authors: Panissara Thanapol, Kittichai Lavangnananda, Franck Leprévost, Arnaud Glad, Julien Schleich, Pascal Bouvry
Format: Article
Language:English
Published: MDPI AG 2024-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/6/2349
_version_ 1797242244858642432
author Panissara Thanapol
Kittichai Lavangnananda
Franck Leprévost
Arnaud Glad
Julien Schleich
Pascal Bouvry
author_facet Panissara Thanapol
Kittichai Lavangnananda
Franck Leprévost
Arnaud Glad
Julien Schleich
Pascal Bouvry
author_sort Panissara Thanapol
collection DOAJ
description Graphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training efficiency. Addressing the variability in training performance and accounting for these influential factors are critical for optimising resource usage. This paper presents a scheduling policy for DL training tasks in a heterogeneous GPU cluster. It builds upon a model-similarity-based scheduling policy by implementing a round-based mechanism and job packing. The round-based mechanism allows the scheduler to adjust its scheduling decisions periodically, whereas job packing optimises GPU utilisation by fitting additional jobs into a GPU that trains a small model. Results show that implementing a round-based mechanism reduces the makespan by approximately 29%, compared to the scenario without it. Additionally, integrating job packing further decreases the makespan by 5%.
first_indexed 2024-04-24T18:36:09Z
format Article
id doaj.art-e95972cfd4e84cb2a756934dca31a82c
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-04-24T18:36:09Z
publishDate 2024-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-e95972cfd4e84cb2a756934dca31a82c2024-03-27T13:19:24ZengMDPI AGApplied Sciences2076-34172024-03-01146234910.3390/app14062349Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU ClusterPanissara Thanapol0Kittichai Lavangnananda1Franck Leprévost2Arnaud Glad3Julien Schleich4Pascal Bouvry5Department of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgGraphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training efficiency. Addressing the variability in training performance and accounting for these influential factors are critical for optimising resource usage. This paper presents a scheduling policy for DL training tasks in a heterogeneous GPU cluster. It builds upon a model-similarity-based scheduling policy by implementing a round-based mechanism and job packing. The round-based mechanism allows the scheduler to adjust its scheduling decisions periodically, whereas job packing optimises GPU utilisation by fitting additional jobs into a GPU that trains a small model. Results show that implementing a round-based mechanism reduces the makespan by approximately 29%, compared to the scenario without it. Additionally, integrating job packing further decreases the makespan by 5%.https://www.mdpi.com/2076-3417/14/6/2349deep learningdeep learning trainingdistributed trainingGPU clusterjob packinground-based mechanism
spellingShingle Panissara Thanapol
Kittichai Lavangnananda
Franck Leprévost
Arnaud Glad
Julien Schleich
Pascal Bouvry
Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
Applied Sciences
deep learning
deep learning training
distributed training
GPU cluster
job packing
round-based mechanism
title Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
title_full Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
title_fullStr Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
title_full_unstemmed Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
title_short Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
title_sort round based mechanism and job packing with model similarity based policy for scheduling dl training in gpu cluster
topic deep learning
deep learning training
distributed training
GPU cluster
job packing
round-based mechanism
url https://www.mdpi.com/2076-3417/14/6/2349
work_keys_str_mv AT panissarathanapol roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster
AT kittichailavangnananda roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster
AT franckleprevost roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster
AT arnaudglad roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster
AT julienschleich roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster
AT pascalbouvry roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster