A container-based lightweight fault tolerance framework for high performance computing workloads

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2019

Bibliographic Details
Main Author:	Sindi, Mohamad(Mohamad Othman)
Other Authors:	John R. Williams.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2020
Subjects:	Civil and Environmental Engineering.
Online Access:	https://hdl.handle.net/1721.1/124188

_version_	1826203854640775168
author	Sindi, Mohamad(Mohamad Othman)
author2	John R. Williams.
author_facet	John R. Williams. Sindi, Mohamad(Mohamad Othman)
author_sort	Sindi, Mohamad(Mohamad Othman)
collection	MIT
description	Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2019
first_indexed	2024-09-23T12:44:20Z
format	Thesis
id	mit-1721.1/124188
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T12:44:20Z
publishDate	2020
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/1241882020-03-24T03:17:01Z A container-based lightweight fault tolerance framework for high performance computing workloads Sindi, Mohamad(Mohamad Othman) John R. Williams. Massachusetts Institute of Technology. Department of Civil and Environmental Engineering. Massachusetts Institute of Technology. Department of Civil and Environmental Engineering Civil and Environmental Engineering. Thesis: Ph. D., Massachusetts Institute of Technology, Department of Civil and Environmental Engineering, 2019 Cataloged from PDF version of thesis. Includes bibliographical references (pages 122-130). According to the latest world's top 500 supercomputers list, ~90% of the top High Performance Computing (HPC) systems are based on commodity hardware clusters, which are typically designed for performance rather than reliability. The Mean Time Between Failures (MTBF) for some current petascale systems has been reported to be several days, while studies estimate it may be less than 60 minutes for future exascale systems. One of the largest studies on HPC system failures showed that more than 50% of failures were due to hardware, and that failure rates grew with system size. Hence, running extended workloads on such systems is becoming more challenging as system sizes grow. In this work, we design and implement a lightweight fault tolerance framework to improve the sustainability of running workloads on HPC clusters. The framework mainly includes a fault prediction component and a remedy component. The fault prediction component is implemented using a parallel algorithm that proactively predicts hardware issues with no overhead. This allows remedial actions to be taken before failures impact workloads. The algorithm uses machine learning applied to supercomputer system logs. We test it on actual logs from systems from Sandia National Laboratories (SNL). The massive logs come from three supercomputers and consist of ~750 million logs (~86 GB data). The algorithm is also tested online on our test cluster. We demonstrate the algorithm's high accuracy and performance in predicting cluster nodes with potential issues. The remedy component is implemented using the Linux container technology. Container technology has proven its success in the microservices domain. We adapt it towards HPC workloads to make use of its resilience potential. By running workloads inside containers, we are able to migrate workloads from nodes predicted to have hardware issues, to healthy nodes while workloads are running. This does not introduce any major interruption or performance overhead to the workload, nor require application modification. We test with multiple real HPC applications that use the Message Passing Interface (MPI) standard. Tests are performed on various cluster platforms using different MPI types. Results demonstrate successful migration of HPC workloads, while maintaining integrity of results produced. by Mohamad Sindi. Ph. D. Ph.D. Massachusetts Institute of Technology, Department of Civil and Environmental Engineering 2020-03-23T18:10:40Z 2020-03-23T18:10:40Z 2019 2019 Thesis https://hdl.handle.net/1721.1/124188 1144931624 eng MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582 130 pages application/pdf Massachusetts Institute of Technology
spellingShingle	Civil and Environmental Engineering. Sindi, Mohamad(Mohamad Othman) A container-based lightweight fault tolerance framework for high performance computing workloads
title	A container-based lightweight fault tolerance framework for high performance computing workloads
title_full	A container-based lightweight fault tolerance framework for high performance computing workloads
title_fullStr	A container-based lightweight fault tolerance framework for high performance computing workloads
title_full_unstemmed	A container-based lightweight fault tolerance framework for high performance computing workloads
title_short	A container-based lightweight fault tolerance framework for high performance computing workloads
title_sort	container based lightweight fault tolerance framework for high performance computing workloads
topic	Civil and Environmental Engineering.
url	https://hdl.handle.net/1721.1/124188
work_keys_str_mv	AT sindimohamadmohamadothman acontainerbasedlightweightfaulttoleranceframeworkforhighperformancecomputingworkloads AT sindimohamadmohamadothman containerbasedlightweightfaulttoleranceframeworkforhighperformancecomputingworkloads

A container-based lightweight fault tolerance framework for high performance computing workloads

Similar Items