Running resilient MPI applications on a Dynamic Group of Recommended Processes
Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sociedade Brasileira de Computação
2018-03-01
|
Series: | Journal of the Brazilian Computer Society |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13173-018-0069-z |