Cooperative checkpointing for supercomputing systems

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.

Bibliographic Details
Main Author: Oliner, Adam Jamison
Other Authors: José E. Moreira.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2006
Subjects:
Online Access:http://hdl.handle.net/1721.1/32102
_version_ 1826189999122415616
author Oliner, Adam Jamison
author2 José E. Moreira.
author_facet José E. Moreira.
Oliner, Adam Jamison
author_sort Oliner, Adam Jamison
collection MIT
description Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.
first_indexed 2024-09-23T08:33:28Z
format Thesis
id mit-1721.1/32102
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T08:33:28Z
publishDate 2006
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/321022019-04-09T18:27:03Z Cooperative checkpointing for supercomputing systems Oliner, Adam Jamison José E. Moreira. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Includes bibliographical references (p. 91-94). A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + [C/I])-competitiveness for deterministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing. Finally, the thesis suggests an embodiment of cooperative checkpointing for a large-scale high performance computer system and presents the results of some preliminary simulations. These results show that, in extreme cases, cooperative checkpointing improved system utilization by more than 25%, reduced bounded slowdown by a factor of 9, while simultaneously reducing the amount of work lost due to failures by 30%. This thesis contributes a unique approach to providing large-scale system reliability through cooperative checkpointing, techniques for analyzing the approach, and blueprints for implementing it in practice. by Adam Jamison Oliner. M.Eng. 2006-03-28T19:51:36Z 2006-03-28T19:51:36Z 2005 2005 Thesis http://hdl.handle.net/1721.1/32102 62323950 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 94 p. 2455146 bytes 2616682 bytes application/pdf application/pdf application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Oliner, Adam Jamison
Cooperative checkpointing for supercomputing systems
title Cooperative checkpointing for supercomputing systems
title_full Cooperative checkpointing for supercomputing systems
title_fullStr Cooperative checkpointing for supercomputing systems
title_full_unstemmed Cooperative checkpointing for supercomputing systems
title_short Cooperative checkpointing for supercomputing systems
title_sort cooperative checkpointing for supercomputing systems
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/32102
work_keys_str_mv AT olineradamjamison cooperativecheckpointingforsupercomputingsystems