Running resilient MPI applications on a Dynamic Group of Recommended Processes
Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Sociedade Brasileira de Computação
2018-03-01
|
Series: | Journal of the Brazilian Computer Society |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13173-018-0069-z |
_version_ | 1811236087802101760 |
---|---|
author | Edson Tavares de Camargo Elias P. Duarte |
author_facet | Edson Tavares de Camargo Elias P. Duarte |
author_sort | Edson Tavares de Camargo |
collection | DOAJ |
description | Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers. |
first_indexed | 2024-04-12T12:04:01Z |
format | Article |
id | doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b |
institution | Directory Open Access Journal |
issn | 0104-6500 1678-4804 |
language | English |
last_indexed | 2024-04-12T12:04:01Z |
publishDate | 2018-03-01 |
publisher | Sociedade Brasileira de Computação |
record_format | Article |
series | Journal of the Brazilian Computer Society |
spelling | doaj.art-dfeb2445ee1f45a6b6abf27b883aa21b2022-12-22T03:33:47ZengSociedade Brasileira de ComputaçãoJournal of the Brazilian Computer Society0104-65001678-48042018-03-0124111610.1186/s13173-018-0069-zRunning resilient MPI applications on a Dynamic Group of Recommended ProcessesEdson Tavares de Camargo0Elias P. Duarte1Department of Informatics, Federal University of Paraná (UFPR)Department of Informatics, Federal University of Paraná (UFPR)Abstract High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.http://link.springer.com/article/10.1186/s13173-018-0069-zDynamic Group of Recommended Processes (DGRP)ResilienceFault toleranceMPI applicationsHPC systems |
spellingShingle | Edson Tavares de Camargo Elias P. Duarte Running resilient MPI applications on a Dynamic Group of Recommended Processes Journal of the Brazilian Computer Society Dynamic Group of Recommended Processes (DGRP) Resilience Fault tolerance MPI applications HPC systems |
title | Running resilient MPI applications on a Dynamic Group of Recommended Processes |
title_full | Running resilient MPI applications on a Dynamic Group of Recommended Processes |
title_fullStr | Running resilient MPI applications on a Dynamic Group of Recommended Processes |
title_full_unstemmed | Running resilient MPI applications on a Dynamic Group of Recommended Processes |
title_short | Running resilient MPI applications on a Dynamic Group of Recommended Processes |
title_sort | running resilient mpi applications on a dynamic group of recommended processes |
topic | Dynamic Group of Recommended Processes (DGRP) Resilience Fault tolerance MPI applications HPC systems |
url | http://link.springer.com/article/10.1186/s13173-018-0069-z |
work_keys_str_mv | AT edsontavaresdecamargo runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses AT eliaspduarte runningresilientmpiapplicationsonadynamicgroupofrecommendedprocesses |