A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance co...

Full description

Bibliographic Details
Main Authors: Hai Jiang, Yulu Zhang, Jeff Jennes, Kuan-Ching Li
Format: Article
Language:English
Published: Springer 2013-11-01
Series:International Journal of Networked and Distributed Computing (IJNDC)
Subjects:
Online Access:https://www.atlantis-press.com/article/9665.pdf