Running resilient MPI applications on a Dynamic Group of Recommended Processes

Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily...

Full description

Bibliographic Details
Main Authors:	Edson Tavares de Camargo, Elias P. Duarte
Format:	Article
Language:	English
Published:	Sociedade Brasileira de Computação 2018-03-01
Series:	Journal of the Brazilian Computer Society
Subjects:	Dynamic Group of Recommended Processes (DGRP) Resilience Fault tolerance MPI applications HPC systems
Online Access:	http://link.springer.com/article/10.1186/s13173-018-0069-z

_version_	1811236087802101760
author	Edson Tavares de Camargo Elias P. Duarte
author_facet	Edson Tavares de Camargo Elias P. Duarte
author_sort	Edson Tavares de Camargo
collection	DOAJ
description	Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.
first_indexed	2024-04-12T12:04:01Z
format	Article
id	doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b
institution	Directory Open Access Journal
issn	0104-6500 1678-4804
language	English
last_indexed	2024-04-12T12:04:01Z
publishDate	2018-03-01
publisher	Sociedade Brasileira de Computação
record_format	Article
series	Journal of the Brazilian Computer Society
spelling	doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b2022-12-22T03:33:47ZengSociedade Brasileira de ComputaçãoJournal of the Brazilian Computer Society0104-65001678-48042018-03-0124111610.1186/s13173-018-0069-zRunning resilient MPI applications on a Dynamic Group of Recommended ProcessesEdson Tavares de Camargo0Elias P. Duarte1Department of Informatics, Federal University of Paraná (UFPR)Department of Informatics, Federal University of Paraná (UFPR)Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.http://link.springer.com/article/10.1186/s13173-018-0069-zDynamic Group of Recommended Processes (DGRP)ResilienceFault toleranceMPI applicationsHPC systems
spellingShingle	Edson Tavares de Camargo Elias P. Duarte Running resilient MPI applications on a Dynamic Group of Recommended Processes Journal of the Brazilian Computer Society Dynamic Group of Recommended Processes (DGRP) Resilience Fault tolerance MPI applications HPC systems
title	Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_full	Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_fullStr	Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_full_unstemmed	Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_short	Running resilient MPI applications on a Dynamic Group of Recommended Processes
title_sort	running resilient mpi applications on a dynamic group of recommended processes
topic	Dynamic Group of Recommended Processes (DGRP) Resilience Fault tolerance MPI applications HPC systems
url	http://link.springer.com/article/10.1186/s13173-018-0069-z
work_keys_str_mv	AT edsontavaresdecamargo runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses AT eliaspduarte runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses

Running resilient MPI applications on a Dynamic Group of Recommended Processes

Similar Items