Classifying jobs towards power-aware HPC system operation through long-term log analysis

The efficient utilization of high-performance computing (HPC) system resources under rigorous electric power budget or I/O workload constraints is among the most important goals set by system operators to deal with the demanding requirements of application users. In most cases, the effective utiliza...

Full description

Bibliographic Details
Main Authors: Yuichi Tsujita, Atsuya Uno, Ryuichi Sekizawa, Keiji Yamamoto, Fumichika Sueyasu
Format: Article
Language:English
Published: Elsevier 2022-09-01
Series:Array
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2590005622000376
Description
Summary:The efficient utilization of high-performance computing (HPC) system resources under rigorous electric power budget or I/O workload constraints is among the most important goals set by system operators to deal with the demanding requirements of application users. In most cases, the effective utilization of CPU and memory devices, which is tightly linked to electric power consumption, is a counterpart metric of I/O activities in most HPC jobs. Towards higher utilization of HPC systems under strict electric power consumption and I/O activity management constrains, we must be careful to prevent hot-spots from developing in power consumption or I/O operations that could lead to unstable system operations by exceeding electric power supply or I/O subsystem capabilities. One of the feasible solutions is arranging compute node assignment not to have such hot-spots in electric power or I/O operations. To address this issue, we analyzed vast amounts of log data collected from the K computer and found strong positive correlations between CPU and memory device utilization rates and electric power consumption levels. On the one hand, we also observed strong negative correlations and reduced electric power consumption in relation to file I/O activities in a specific compute node-layout, thereby indicating unique characteristics in some I/O-intensive HPC jobs in the node-layout. Our investigation revealed that HPC jobs could be divided into two groups when classified in terms of required electric power — jobs consuming high electric power levels and I/O-intensive jobs with reduced electric power levels. Then, we achieved high levels of accuracy when classifying jobs in terms of electric power levels using RandomForestClassifier among multiple machine learning classification models provided from scikit-learn. The classification can prevent us from hot-spots in electric power consumption in compute node assignment in job scheduling. Thus we demonstrated efficient job classifications towards power-aware system operations in the supercomputer Fugaku, which is the successor to the K computer.
ISSN:2590-0056