High availability for parallel computers
Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant arc...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Postgraduate Office, School of Computer Science, Universidad Nacional de La Plata
2010-10-01
|
Series: | Journal of Computer Science and Technology |
Subjects: | |
Online Access: | https://journal.info.unlp.edu.ar/JCST/article/view/697 |
_version_ | 1819092008615018496 |
---|---|
author | Dolores Rexachs del Rosario Emilio Luque Fadón |
author_facet | Dolores Rexachs del Rosario Emilio Luque Fadón |
author_sort | Dolores Rexachs del Rosario |
collection | DOAJ |
description | Fault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environment |
first_indexed | 2024-12-21T22:48:47Z |
format | Article |
id | doaj.art-e5ed1cd7ebdb45a8b326118b84ede969 |
institution | Directory Open Access Journal |
issn | 1666-6046 1666-6038 |
language | English |
last_indexed | 2024-12-21T22:48:47Z |
publishDate | 2010-10-01 |
publisher | Postgraduate Office, School of Computer Science, Universidad Nacional de La Plata |
record_format | Article |
series | Journal of Computer Science and Technology |
spelling | doaj.art-e5ed1cd7ebdb45a8b326118b84ede9692022-12-21T18:47:39ZengPostgraduate Office, School of Computer Science, Universidad Nacional de La PlataJournal of Computer Science and Technology1666-60461666-60382010-10-011003110116392High availability for parallel computersDolores Rexachs del Rosario0Emilio Luque Fadón1Computer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, SpainComputer Architecture an Operating System Department, Universidad Autónoma de Barcelona, Barcelona 08193, SpainFault tolerance has become an important issue for parallel applications in the last few years. The parallel systems' users want them to be reliable considering two main dimensions, availability and data consistency. Availability can be provided with solutions such as RADIC, a fault tolerant architecture with different protection levels, offering high availability with transparency, decentralization, flexibility and scalability for message-passing systems. Transient faults may cause an application running in a computer system to be removed from execution, however the biggest risk of transient faults is to provoke undetected data corruption that changes the final result of the application without anyone knowing. To evaluate the effects of transient faults in the robustness of applications and validate new fault detection mechanism and strategies, we have developed a full-system simulation fault injection environmenthttps://journal.info.unlp.edu.ar/JCST/article/view/697fault toleranceavailabilityradictransient faultsperformability |
spellingShingle | Dolores Rexachs del Rosario Emilio Luque Fadón High availability for parallel computers Journal of Computer Science and Technology fault tolerance availability radic transient faults performability |
title | High availability for parallel computers |
title_full | High availability for parallel computers |
title_fullStr | High availability for parallel computers |
title_full_unstemmed | High availability for parallel computers |
title_short | High availability for parallel computers |
title_sort | high availability for parallel computers |
topic | fault tolerance availability radic transient faults performability |
url | https://journal.info.unlp.edu.ar/JCST/article/view/697 |
work_keys_str_mv | AT doloresrexachsdelrosario highavailabilityforparallelcomputers AT emilioluquefadon highavailabilityforparallelcomputers |