Improvements to Supercomputing Service Availability Based on Data Analysis

As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services i...

Full description

Bibliographic Details
Main Authors: Jae-Kook Lee, Min-Woo Kwon, Do-Sik An, Junweon Yoon, Taeyoung Hong, Joon Woo, Sung-Jun Kim, Guohua Li
Format: Article
Language:English
Published: MDPI AG 2021-07-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/11/13/6166
_version_ 1797528084128202752
author Jae-Kook Lee
Min-Woo Kwon
Do-Sik An
Junweon Yoon
Taeyoung Hong
Joon Woo
Sung-Jun Kim
Guohua Li
author_facet Jae-Kook Lee
Min-Woo Kwon
Do-Sik An
Junweon Yoon
Taeyoung Hong
Joon Woo
Sung-Jun Kim
Guohua Li
author_sort Jae-Kook Lee
collection DOAJ
description As the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.
first_indexed 2024-03-10T09:53:04Z
format Article
id doaj.art-a3fe5ad3b6be4a309b8d7b2d77a7cf3f
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-10T09:53:04Z
publishDate 2021-07-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-a3fe5ad3b6be4a309b8d7b2d77a7cf3f2023-11-22T02:34:34ZengMDPI AGApplied Sciences2076-34172021-07-011113616610.3390/app11136166Improvements to Supercomputing Service Availability Based on Data AnalysisJae-Kook Lee0Min-Woo Kwon1Do-Sik An2Junweon Yoon3Taeyoung Hong4Joon Woo5Sung-Jun Kim6Guohua Li7National Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaNational Supercomputing Center, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon 34141, KoreaAs the demand for high-performance computing (HPC) resources has increased in the field of computational science, an inevitable consideration is service availability in large cluster systems such as supercomputers. In particular, the factor that most affects availability in supercomputing services is the job scheduler utilized for allocating resources. Consequent to submitting user data through the job scheduler for data analysis, 25.6% of jobs failed because of program errors, scheduler errors, or I/O errors. Based on this analysis, we propose a K-hook method for scheduling to increase the success rate of job submissions and improve the availability of supercomputing services. By applying this method, the job-submission success rate was improved by 15% without negatively affecting users’ waiting time. We also achieved a mean time between interrupts (MTBI) of 24.3 days and maintained average system availability at 97%. As this research was verified on the Nurion supercomputer in a real service environment, the value of the research is expected to be found in significant service improvements.https://www.mdpi.com/2076-3417/11/13/6166high-performance computingsupercomputing servicedata analysisservice availabilityresource schedulerresource utilization
spellingShingle Jae-Kook Lee
Min-Woo Kwon
Do-Sik An
Junweon Yoon
Taeyoung Hong
Joon Woo
Sung-Jun Kim
Guohua Li
Improvements to Supercomputing Service Availability Based on Data Analysis
Applied Sciences
high-performance computing
supercomputing service
data analysis
service availability
resource scheduler
resource utilization
title Improvements to Supercomputing Service Availability Based on Data Analysis
title_full Improvements to Supercomputing Service Availability Based on Data Analysis
title_fullStr Improvements to Supercomputing Service Availability Based on Data Analysis
title_full_unstemmed Improvements to Supercomputing Service Availability Based on Data Analysis
title_short Improvements to Supercomputing Service Availability Based on Data Analysis
title_sort improvements to supercomputing service availability based on data analysis
topic high-performance computing
supercomputing service
data analysis
service availability
resource scheduler
resource utilization
url https://www.mdpi.com/2076-3417/11/13/6166
work_keys_str_mv AT jaekooklee improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT minwookwon improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT dosikan improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT junweonyoon improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT taeyounghong improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT joonwoo improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT sungjunkim improvementstosupercomputingserviceavailabilitybasedondataanalysis
AT guohuali improvementstosupercomputingserviceavailabilitybasedondataanalysis