Availability Study of Dynamic Voting Algorithms

Fault tolerant distributed systems often select a primary component to allow a subset of the processes to function when failures occur. The dynamic voting paradigm defines rules for selecting the primary component adaptively: when a partition occurs, if a majority of the previous primary component i...

Full description

Bibliographic Details
Main Authors: Ingols, Kyle, Keidar, Idit
Published: 2023
Online Access:https://hdl.handle.net/1721.1/149301
Description
Summary:Fault tolerant distributed systems often select a primary component to allow a subset of the processes to function when failures occur. The dynamic voting paradigm defines rules for selecting the primary component adaptively: when a partition occurs, if a majority of the previous primary component is connected, a new and possibly smaller primary is chosen. Several studies have shown that dynamic voting leads to more available solutions than other paradigms for maintaining a primary component. However, these studies have assumed that every attempt made by the algorithm to form a new primary component terminates successfully. Unfortunately, in real systems, this is not always the case: a change in connectivity can interrupt the algorithm whiel it is still attempting to form a new primary component; in such cases, algorithms typically block until processes can resolve the outcome of the interrupted attempt. This paper uses simulations to evaluate the effect of interruptions on the availability of dynamic voting algorithms. We study four dynamic voting algorithms, and identify two important characteristics that impact an algorithm's availability in runs with frequent connectivity changes. First, we show that the number of communication rounds exchanged in an algorithm plays a significant role in the availability achieved, especially in the degradation of availability as connectivity changes become more frequent. Second, we show that the number of processes that need to be present in order to resolve past attempts impacts the availability, especially during long runs with numerous connectivity changes.