A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance co...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Springer
2013-11-01
|
Series: | International Journal of Networked and Distributed Computing (IJNDC) |
Subjects: | |
Online Access: | https://www.atlantis-press.com/article/9665.pdf |