Running resilient MPI applications on a Dynamic Group of Recommended Processes

Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily...

Full description

Bibliographic Details
Main Authors: Edson Tavares de Camargo, Elias P. Duarte
Format: Article
Language:English
Published: Sociedade Brasileira de Computação 2018-03-01
Series:Journal of the Brazilian Computer Society
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13173-018-0069-z
_version_ 1811236087802101760
author Edson Tavares de Camargo
Elias P. Duarte
author_facet Edson Tavares de Camargo
Elias P. Duarte
author_sort Edson Tavares de Camargo
collection DOAJ
description Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.
first_indexed 2024-04-12T12:04:01Z
format Article
id doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b
institution Directory Open Access Journal
issn 0104-6500
1678-4804
language English
last_indexed 2024-04-12T12:04:01Z
publishDate 2018-03-01
publisher Sociedade Brasileira de Computação
record_format Article
series Journal of the Brazilian Computer Society
spelling doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b2022-12-22T03:33:47ZengSociedade Brasileira de ComputaçãoJournal of the Brazilian Computer Society0104-65001678-48042018-03-0124111610.1186/s13173-018-0069-zRunning resilient MPI applications on a Dynamic Group of Recommended ProcessesEdson Tavares de Camargo0Elias P. Duarte1Department of Informatics, Federal University of Paraná (UFPR)Department of Informatics, Federal University of Paraná (UFPR)Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.http://link.springer.com/article/10.1186/s13173-018-0069-zDynamic Group of Recommended Processes (DGRP)ResilienceFault toleranceMPI applicationsHPC systems
spellingShingle Edson Tavares de Camargo
Elias P. Duarte
Running resilient MPI applications on a Dynamic Group of Recommended Processes
Journal of the Brazilian Computer Society
Dynamic Group of Recommended Processes (DGRP)
Resilience
Fault tolerance
MPI applications
HPC systems
title Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_full Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_fullStr Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_full_unstemmed Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_short Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_sort running resilient mpi applications on a dynamic group of recommended processes
topic Dynamic Group of Recommended Processes (DGRP)
Resilience
Fault tolerance
MPI applications
HPC systems
url http://link.springer.com/article/10.1186/s13173-018-0069-z
work_keys_str_mv AT edsontavaresdecamargo runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses
AT eliaspduarte runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses