RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is beco...

Full description

Bibliographic Details
Main Authors: Qiqi Wang, Hongjie Zhang, Cheng Qu, Yu Shen, Xiaohui Liu, Jing Li
Format: Article
Language:English
Published: MDPI AG 2021-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/20/9448
_version_ 1797515463117242368
author Qiqi Wang
Hongjie Zhang
Cheng Qu
Yu Shen
Xiaohui Liu
Jing Li
author_facet Qiqi Wang
Hongjie Zhang
Cheng Qu
Yu Shen
Xiaohui Liu
Jing Li
author_sort Qiqi Wang
collection DOAJ
description The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.
first_indexed 2024-03-10T06:45:50Z
format Article
id doaj.art-ee4f9ee1b3e944d487842a21c634c9e6
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T06:45:50Z
publishDate 2021-10-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-ee4f9ee1b3e944d487842a21c634c9e62023-11-22T17:18:47ZengMDPI AGApplied Sciences2076-34172021-10-011120944810.3390/app11209448RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time PredictionQiqi Wang0Hongjie Zhang1Cheng Qu2Yu Shen3Xiaohui Liu4Jing Li5School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaSupercomputing Center, University of Science and Technology of China, Hefei 230026, ChinaSupercomputing Center, University of Science and Technology of China, Hefei 230026, ChinaSchool of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, ChinaThe job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.https://www.mdpi.com/2076-3417/11/20/9448high-performance computingRLSchertschedulingdeep reinforcement learningremaining runtime prediction
spellingShingle Qiqi Wang
Hongjie Zhang
Cheng Qu
Yu Shen
Xiaohui Liu
Jing Li
RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
Applied Sciences
high-performance computing
RLSchert
scheduling
deep reinforcement learning
remaining runtime prediction
title RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
title_full RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
title_fullStr RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
title_full_unstemmed RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
title_short RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction
title_sort rlschert an hpc job scheduler using deep reinforcement learning and remaining time prediction
topic high-performance computing
RLSchert
scheduling
deep reinforcement learning
remaining runtime prediction
url https://www.mdpi.com/2076-3417/11/20/9448
work_keys_str_mv AT qiqiwang rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction
AT hongjiezhang rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction
AT chengqu rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction
AT yushen rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction
AT xiaohuiliu rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction
AT jingli rlschertanhpcjobschedulerusingdeepreinforcementlearningandremainingtimeprediction