Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster
Graphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Fu...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2024-03-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/14/6/2349 |
_version_ | 1797242244858642432 |
---|---|
author | Panissara Thanapol Kittichai Lavangnananda Franck Leprévost Arnaud Glad Julien Schleich Pascal Bouvry |
author_facet | Panissara Thanapol Kittichai Lavangnananda Franck Leprévost Arnaud Glad Julien Schleich Pascal Bouvry |
author_sort | Panissara Thanapol |
collection | DOAJ |
description | Graphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training efficiency. Addressing the variability in training performance and accounting for these influential factors are critical for optimising resource usage. This paper presents a scheduling policy for DL training tasks in a heterogeneous GPU cluster. It builds upon a model-similarity-based scheduling policy by implementing a round-based mechanism and job packing. The round-based mechanism allows the scheduler to adjust its scheduling decisions periodically, whereas job packing optimises GPU utilisation by fitting additional jobs into a GPU that trains a small model. Results show that implementing a round-based mechanism reduces the makespan by approximately 29%, compared to the scenario without it. Additionally, integrating job packing further decreases the makespan by 5%. |
first_indexed | 2024-04-24T18:36:09Z |
format | Article |
id | doaj.art-e95972cfd4e84cb2a756934dca31a82c |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-04-24T18:36:09Z |
publishDate | 2024-03-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-e95972cfd4e84cb2a756934dca31a82c2024-03-27T13:19:24ZengMDPI AGApplied Sciences2076-34172024-03-01146234910.3390/app14062349Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU ClusterPanissara Thanapol0Kittichai Lavangnananda1Franck Leprévost2Arnaud Glad3Julien Schleich4Pascal Bouvry5Department of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgDepartment of Computer Science, University of Luxembourg, 4365 Luxembourg, LuxembourgGraphics Processing Units (GPUs) are employed for their parallel processing capabilities, which are essential to train deep learning (DL) models with large datasets within a reasonable time. However, the diverse GPU architectures exhibit variability in training performance depending on DL models. Furthermore, factors such as the number of GPUs for distributed training and batch size significantly impact training efficiency. Addressing the variability in training performance and accounting for these influential factors are critical for optimising resource usage. This paper presents a scheduling policy for DL training tasks in a heterogeneous GPU cluster. It builds upon a model-similarity-based scheduling policy by implementing a round-based mechanism and job packing. The round-based mechanism allows the scheduler to adjust its scheduling decisions periodically, whereas job packing optimises GPU utilisation by fitting additional jobs into a GPU that trains a small model. Results show that implementing a round-based mechanism reduces the makespan by approximately 29%, compared to the scenario without it. Additionally, integrating job packing further decreases the makespan by 5%.https://www.mdpi.com/2076-3417/14/6/2349deep learningdeep learning trainingdistributed trainingGPU clusterjob packinground-based mechanism |
spellingShingle | Panissara Thanapol Kittichai Lavangnananda Franck Leprévost Arnaud Glad Julien Schleich Pascal Bouvry Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster Applied Sciences deep learning deep learning training distributed training GPU cluster job packing round-based mechanism |
title | Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster |
title_full | Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster |
title_fullStr | Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster |
title_full_unstemmed | Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster |
title_short | Round-Based Mechanism and Job Packing with Model-Similarity-Based Policy for Scheduling DL Training in GPU Cluster |
title_sort | round based mechanism and job packing with model similarity based policy for scheduling dl training in gpu cluster |
topic | deep learning deep learning training distributed training GPU cluster job packing round-based mechanism |
url | https://www.mdpi.com/2076-3417/14/6/2349 |
work_keys_str_mv | AT panissarathanapol roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster AT kittichailavangnananda roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster AT franckleprevost roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster AT arnaudglad roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster AT julienschleich roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster AT pascalbouvry roundbasedmechanismandjobpackingwithmodelsimilaritybasedpolicyforschedulingdltrainingingpucluster |