Next Article in Journal
Decision-Making Approach Based on Neutrosophic Rough Information
Next Article in Special Issue
Relaxed Data Types as Consistency Conditions
Previous Article in Journal
Optimal Control Algorithms and Their Analysis for Short-Term Scheduling in Manufacturing Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains †

Institute of Telematics, Hamburg University of Technology, 21073 Hamburg, Germany
An extended abstract of this work appeared at the 19th International Symposium on Stabilization, Safety, and Security of Distributed Systems 2017 in Boston, USA.
Algorithms 2018, 11(5), 58; https://doi.org/10.3390/a11050058
Submission received: 10 February 2018 / Revised: 28 March 2018 / Accepted: 2 May 2018 / Published: 3 May 2018

Abstract

:
The analysis of self-stabilizing algorithms is often limited to the worst case stabilization time starting from an arbitrary state, i.e., a state resulting from a sequence of faults. Considering the fact that these algorithms are intended to provide fault tolerance in the long run, this is not the most relevant metric. A common situation is that a running system is an a legitimate state when hit by a single fault. This event has a much higher probability than multiple concurrent faults. Therefore, the worst case time to recover from a single fault is more relevant than the recovery time from a large number of faults. This paper presents techniques to derive upper bounds for the mean time to recover from a single fault for self-stabilizing algorithms based on Markov chains in combination with lumping. To illustrate the applicability of the techniques they are applied to a new self-stabilizing coloring algorithm.

1. Introduction

Fault tolerance aims at making distributed systems more reliable by enabling them to continue the provision of services in the presence of faults. The strongest form is masking fault tolerance, where a system continues to operate after faults without any observable impairment of functionality, i.e., safety is always guaranteed. In contrast non-masking fault tolerance does not ensure safety at all times. Users may experience incorrect system behavior, but eventually the system will fully recover. The potential of this concept lies in the fact that it can be used in cases where masking fault tolerance is too costly or even impossible to implement [1]. Self-stabilizing algorithms belong to the category of distributed algorithms that provide non-masking fault tolerance. They guarantee that systems eventually recover from transient faults of any scale such as perturbations of the state in memory or communication message corruption [2]. A critical issue is the length of the time span until full recovery. Examples are known where a memory corruption at a single process caused a vast disruption in large parts of the system and triggered a cascade of corrections to reestablish safety. Thus, an important issue for non-masking fault tolerance is the containment of the effect of faults.
A fault-containing system has the ability to contain the effects of transient faults in space and time. The goal is to keep the extent of disruption during recovery proportional to the extent of the faults. An extreme case of fault-containment with respect to space is given when the effect of faults is bounded to the set of faulty nodes. Azar et al. call this error confinement [3]. More relaxed forms of fault-containment are known as time-adaptive self-stabilization [4], scalable self-stabilization [5], strong stabilization [6], and 1-adaptive self-stabilization [7].
A configuration is called k-faulty, if in a legitimate configuration exactly k processes are hit by a fault (a configuration is called legitimate if it conforms with the specification). A large body of research focuses on fault-containment for 1-faulty configurations. Two metrics have been introduced to quantify the containment behavior in the 1-faulty case: contamination radius and containment time [8,9]. A distributed algorithm A has contamination radius r if only nodes within the r-hop neighborhood of the faulty node change their state during recovery from a 1-faulty configuration. The containment time of A denotes the worst-case number of rounds any execution of A starting at a 1-faulty configuration needs to reach a legitimate configuration. In technical terms this corresponds to the worst case time to recover in case of a single fault. For randomized algorithms the expected number of rounds to reach a legitimate configuration corresponds to the mean time to recover (MTTR).
Over the last two decades a large number of self-stabilizing algorithms have been published. Surprisingly the analysis of the vast majority of these algorithms is confined to the worst case stabilization time starting from an arbitrary configuration. Considering the fact that these algorithms are intended to provide fault tolerance in the long run this is not the most relevant metric at all. From a practical point of view the recovery time from a 1-faulty configuration is more interesting. This statement is justified considering the fact that the probability for a 1-faulty configuration is much higher than that for a k-faulty configuration with a large value of k. The reason is that a distributed system consists of independently operating computers where transient faults such as memory faults in different computers are independent events. Considering this fact it comes as a surprise that most papers consider only arbitrary initial states (i.e., k-faulty configuration for any k) instead of focusing on 1-faulty configuration. Only in a few cases fault-containment metrics have been considered [10,11]. This is even more surprising considering the fact that the many techniques available to determine the worst case stabilization time of an algorithm, e.g., potential functions and convergence stairs, can also be used to compute the containment time.
This paper discusses a technique to analyze the containment time of randomized self-stabilizing algorithms with respect to memory and message corruption. The execution of the algorithm is modeled as a stochastic process. Let X be the random variable that represents the number of rounds the system requires to reach a legitimate configuration when started in a 1-faulty configuration. Then the MTTR of the algorithm is equal to E [ X ] , the expected value of X. Thus, we are interested in upper bounds for E [ X ] . Sometimes it is possible to derive an explicit expression for E [ X ] or use results about absorbing Markov chains for this purpose. These equations may be solvable with a software package based on symbolic mathematics. However, the state space explosion problem precludes success for many real world problems. An important technique for the reduction of the complexity of Markov chains is lumping [12]. Lumping is a method based on the aggregation of states that exhibit the same or similar behavior. It leads to a smaller Markov chain that retains the same performance characteristics as the original one. But often lumping is not immediately applicable because the structure of the Markov chain is too complex. In some of these cases a weaker form of lumping can lead to Markov chains with a simpler structure that can still be used to derive an upper bound for the absorption time.
The contribution of this paper is a discourse about the containment time of self-stabilizing algorithms. We present and apply techniques based on Markov chains to compute upper bounds for this metric. In particular we demonstrate how lumping can be applied to reduce the complexity of the Markov chains. To demonstrate the usability of the techniques we apply it to a new self-stabilizing coloring algorithm as a case study. We derive an absolute bound for the expected containment time and show that the variance is bounded by a surprisingly small constant independent of the network’s size. We believe that the techniques can also be applied to other algorithms.

2. Related Work

Self-stabilizing algorithms are analyzed with different techniques such as potential functions, convergence stairs, and Markov chains. The latter are particularly useful for randomized algorithms [13]. Their main drawback is that in order to set up the transition matrix the graph’s adjacency matrix must be known. This restricts the applicability of this method to small or highly symmetric instances. DeVille et al. apply model checking tools to Markov chains for cases of networks of small size ( n 7 ) to determine the expected stabilization time [14]. An example for highly symmetric networks are ring topologies [15,16]. Fribourg et al. model randomized distributed algorithms as Markov chains using the technique of coupling to compute upper bounds for the stabilization times [15]. Yamashita uses Markov chains to model self-stabilizing probabilistic algorithms and to prove stabilization [16]. Mitton et al. consider a randomized self-stabilizing Δ + 1 -coloring algorithm and model it in terms of urns/balls using a Markov chain to get a bound for the stabilization time [17]. Their evaluation is restricted to networks up to 1000 nodes. Crouzen et al. model faulty distributed algorithms as Markov decision processes to incorporate the effects of random faults when using a non-deterministic scheduler [18]. They used the PRISM model-checker to compute long-run average availabilities.

3. System Model

This paper uses the synchronous model of distributed computing as defined in the standard literature [2,8,19]. A distributed system is represented as an undirected graph G ( V , E ) where V is the set of nodes and E V × V is the set of edges. Let n = V and Δ ( G ) denote the maximal degree of G. The topology is assumed to be fixed. If two nodes are connected by an edge, they are called neighbors. The set of neighbors of node v is denoted by N ( v ) V and N [ v ] = N ( v ) { v } . Each node stores a set of variables. The values of all variables constitute the local state of a node. Let σ denote the set of possible local states of a node. The configuration of a system is the tuple of all local states of all nodes. Σ = σ n denotes the set of global states. A configuration is called legitimate if it conforms with the specification. The set of all legitimate configurations is denoted by L .
Nodes communicate either via locally shared memory or by exchanging messages. In the shared memory model each node executes a protocol consisting of a list of rules of the form guard statement . The guard is a Boolean expression over the variables of the node and its neighbors. The statement consists of a series of commands. A node is called enabled if one of its guards evaluates to true. The execution of a statement is called a move.
In the message passing model a node performs three actions per round: receiving messages from neighbors, executing code, and sending messages to neighbors. Direct access to the state of neighboring nodes is impossible. Two nodes u and v communicate via two link registers: u writes in its register and v reads from it and v writes in its register and u reads from it. In this model the state of a node also includes the states of its registers. This works assumes the CONGEST model of distributed computation [19]. Algorithms in the CONGEST model enforce a O ( log n ) limitation on the maximum message size. Hence, with a single message only a constant number of node identifiers in the range { 0 , , n } can be transmitted.
Execution of the statements is performed in a synchronous style, i.e., all enabled nodes execute their code in every round. An execution e = c 0 , c 1 , c 2 , , c i Σ is a sequence of configurations, where  c 0 is called the initial configuration and c i is the configuration after the i-th round. In other words, if the current configuration is c i 1 and all enabled nodes make a move, then this yields c i .
Let A be a distributed algorithm and L Σ a set of configurations. A is called self-stabilizing with respect to L if it satisfies the convergence and closure properties. The first property states that every execution of A reaches L after a finite number of rounds. The second property states that A ( c ) L for all c L as long as no fault occurs. The worst case stabilization time S T A ( G ) of A for a graph G is equal to the maximal number of rounds after which A reaches a legitimate configuration of G regardless of the initial configuration under the assumption that no errors occur.
Definition 1.
A configuration c Σ of a self-stabilizing algorithm with respect to L is called k-faulty if a configuration c Σ satisfying L exists such that c differs from c in the local states of at most k nodes.
Note that for the message passing model this definition also covers message corruption. This paper analyzes the most common fault situation: 1-faulty configurations. They arise when a single node v is hit by a memory corruption or a single message sent by v is corrupted.
The containment behavior of a self-stabilizing algorithm is characterized by the contamination radius and the containment time.
Definition 2.
Let A be a self-stabilizing algorithm.
1. 
The containment time of A denotes the worst-case number of rounds any execution of A starting at a 1-faulty configuration needs to reach a legitimate configuration.
2. 
Let R v be the subgraph induced by the nodes engaged in the recovery process from a 1-faulty configuration of A triggered by a fault at v. Then max { d i s t ( v , w ) w R v } is called the contamination radius.
The stabilization time S T A ( R v ) is an obvious upper bound for the containment time.

4. Examples

Before presenting techniques to compute these metrics we give some examples using the shared memory model to illustrate the two definitions.

4.1. Contamination Radius

Consider an algorithm in the shared memory model with contamination radius r. A single fault will not spread beyond the r-hop neighborhood of the faulty node v. In this case R v G v r ; G v r is the subgraph induced by nodes w with d i s t ( v , w ) r . As an example consider the well known self-stabilizing algorithm A 1 to compute a maximal independent set (see Algorithm 1). It uses a single variable s t a t e . A configuration is legitimate if nodes with s t a t e = I N form a maximal independent set.
Algorithm 1: Self-stabilizing algorithm A 1 to compute a MIS.
Algorithms 11 00058 i001
Lemma 1.
Algorithm A 1 has contamination radius two.
Proof. 
Consider a 1-faulty configuration where node v is hit by a memory corruption. First suppose the state of v changed from I N to O U T . Let u N ( v ) then u . s t a t e = O U T . If u has an neighbor w v with w . s t a t e = I N then u will not change its state during recovery. Otherwise, if all neighbors of u except v had state O U T node u may change state during recovery. But since these neighbors of u have a neighbor with state I N they will not change their state. Thus, in this case only the neighbors of v may change state during recovery.
Suppose that v . s t a t e changed from O U T to I N . Then v and those neighbors of v with state I N can change to O U T . Then arguing as in the first case only nodes within distance two of v may change their state during recovery. ☐
Next we consider another example: Δ + 1 -coloring. Most distributed algorithms for this problem follow the same pattern. A node that realizes that it has selected the same color as one of its neighbors chooses a new color from a finite color palette. This palette does not include the current colors of the node’s neighbors. To be executed under the synchronous scheduler these algorithms are either randomized or use identifiers for symmetry breaking. Variations of this idea are followed in [17,20,21]. As an example consider algorithm A 2 from [20] (see Algorithm 2). A 2 has a single variable c. A configuration is legitimate if the values of variable c describe a valid Δ + 1 -coloring. Due to its choice of a new color from the palette algorithm A 2 has contamination radius at least Δ ( G ) (see Figure 1).
Algorithm 2: Self-stabilizing Δ + 1 -coloring algorithm A 2 from [20].
Algorithms 11 00058 i002
A minor modification of algorithm A 2 dramatically changes matters. Algorithm A 3 (see Algorithm 3) has contamination radius 1 (see Lemma 2). Note that neighbors of v that change their color during recovery form an independent set.
Algorithm 3: Self-stabilizing Δ + 1 -coloring algorithm A 3 .
Algorithms 11 00058 i003
Lemma 2.
Algorithm A 3 has contamination radius one.
Proof. 
Consider a 1-faulty configuration where node v is hit by a memory corruption changing its color to a color c already chosen by at least one neighbor of v. Let N conflict = { w N ( v ) w . c = c } . In the next round the nodes in N conflict { v } will get a chance to choose a new color. The choices will only lead to conflicts between v and other nodes in N conflict . Thus, the fault will not spread beyond the set N conflict . With a positive probability the set N conflict will contain fewer nodes in each round. ☐

4.2. Containment Time

As the contamination radius the containment time strongly depends on the concrete structure of G. This can be illustrated with algorithm A 1 . Note that in this case R v can contain any subgraph H with Δ ( G ) nodes. As an example let G consist of H and an additional node v connected to each node of H. A legitimate configuration is given if the state of v is I N and all other nodes have state O U T (Figure 2 left). If v changes its state to O U T due to a fault then all nodes may change to state I N during the next round. Thus, there is little hope for a bound below the trivial bound. Similar arguments hold for the second 1-faulty configuration of A 1 shown on the right of Figure 2.

5. Self-Stabilizing Algorithms and Markov Chains

A self-stabilizing algorithm A can be regarded as a transition systems of Σ . In each round the current configuration c Σ is transformed into a new configuration A ( c ) Σ . Each configuration of Σ occurs with a specific probability as the new configuration A ( c ) . The source of randomness can have its origin in the scheduler—a probabilistic scheduler randomly selects enabled nodes to make a move—or in a random experiment within one of the rules. We assume that the scheduler and the random experiment are memory-less. Therefore, the execution of algorithm A can be described by a Markov chain with state set Σ and transition matrix P = p i , j , where p i j = P r o b ( A ( c i ) = c j ) gives the probability to move from configuration c i to c j Σ in one round. This work uses the notation for Markov chains as introduced in [12]. For a self-stabilizing algorithm A this Markov chain is denoted by C A . In the following we uses the terms configurations of A and states of C A as synonyms.
There is a close relation between the absorbing states of C A and the legitimate configurations L of A . The closure property of a self-stabilizing algorithm guarantees that a configuration of L is always mapped to a configuration in L , i.e., A ( L ) L . Whereas a state of a markov chain is called absorbing if it is impossible to leave this state. Note that a non-legitimate state of A cannot be an absorbing state of C A . In silent self-stabilizing algorithms we have A ( c ) = c for all c L . For non-silent algorithms we may without loss of generality also assume that A ( c ) = c for all c L . If this condition is not given we can partition L into subsets L 1 , , L s with A ( L i ) = L i and identify all configurations in L i as corresponding to the same state. For the computation of the stabilization time this does not make a difference. Under this assumption the Markov chain C A is an absorbing Markov chain and the set of absorbing states corresponds to the set of legitimate states of A . We can state the following easy to prove lemma.
Lemma 3.
Let c be a configuration of C A . An absorbing state of C A is reached from c in expected B steps if and only if A stabilizes in expected B rounds from c.
The computation of the expected time to absorption for an absorbing Markov chain using the transition matrix P is a simple matrix operation [12]. We assume a labeling of the states such that the t non-absorbing states come before the a absorbing states. Then P has the following canonical form
P = Q R 0 E
Here E is a a × a unit matrix and Q a t × t matrix. For an absorbing Markov chain, the matrix N = ( E Q ) 1 is called the fundamental matrix for P. Let t i be the expected number of steps before the chain is absorbed, given that the chain starts in the i-th state, and let t be the column vector whose i-th entry is t i . Then t = N c , where c is a column vector all of whose entries are 1. The variance of these numbers of steps is given by the entries of ( 2 N E ) t t s q where t s q is derived from t by squaring each entry [12]. Thus, if N is the fundamental matrix of C A then the expected number of rounds after which algorithm A stabilizes is max N c provided, the initial configuration is randomly chosen from the non-legitimate configurations.
There is still a big obstacles to practically apply this procedure. In order to compute the matrix N the probabilities p i , j need to be known explicitly. Without knowing the graph explicitly it is impossible to compute the probabilities p i , j = P r o b ( A ( c i ) = c j ) for all pairs of states c i , c j . This is only possible for small graphs or when the graph has a symmetric structure, e.g., a ring.
The fundamental matrix contains a lot of information which is not needed for the computation of the stabilization time. Therefore, a coarser analysis of the Markov chain would be sufficient. A common approach is to partition Σ into subsets Σ 0 , , Σ l and consider these as the states of a new Markov chain (see Figure 3). The challenge is to define the transition probabilities of the new chain in a way that allows to transfer properties of this new chain to the original one.
A partition P = { Σ 0 , , Σ l } of Σ is called lumpable if the subsets Σ i have the property that for each pair i , j the probability of a state c Σ i to be transformed in one step into a state of Σ j is independent of the choice of c Σ i (Definition 6.3.1 [12]). This probability is then interpreted as the transition probability from Σ i to Σ j . More formally, a Markov chain is lumpable with respect to partition P = { Σ 0 , , Σ l } of Σ if for any Σ i , Σ j P and any c 1 , c 2 Σ i
c Σ j p ( c 1 , c ) = c Σ j p ( c 2 , c ) .
A lumpable Markov C chain defines a new Markov chain C P with state set P and transition probabilities
p ( Σ i , Σ j ) = c Σ j p ( c i , c ) .
The following result proved in [12] , p. 128.
Lemma 4.
Let c Σ i be a state of a lumpable Markov chain C . Then the expected time to reach from c an absorbing state of C is equal to the expected time to reach from Σ i an absorbing state of C P .
The last two lemmas prove the following theorem.
Theorem 1.
Let A be a self-stabilizing algorithm with Σ the set of configurations. Let P = { Σ 0 , , Σ l } be a partition of Σ with Σ 0 = L such that C A is lumpable with respect to P. For any i an absorbing state of C A P is reached from Σ i in expected B steps if and only if A stabilizes in expected B rounds starting in any c Σ i .
Unfortunately it is rather difficult to make use of Theorem 1 under general conditions. This situations changes when the theorem is used to compute the containment time of a self-stabilizing algorithm A . Remember that the containment time of A denotes the worst-case number of rounds any execution of A starting in a 1-faulty configuration needs to reach a legitimate configuration. Thus, the containment time of A is S T A ( R v ) , where v is the faulty node. There are two aspects that ease the application of Theorem 1: Either R v has a symmetric structure or R v is small.
To illustrate this approach we consider again algorithm A 3 . Let v be a node that changes in a legitimate state its color to c f due to a memory fault (see Figure 4). Let c 0 be the new configuration. This causes a conflict with those neighbors of v that had chosen c f as their color. After the fault only nodes contained in R v (a star graph) change their state (see Figure 4 right). Once a neighbor has chosen a color different from c f then it becomes passive (at least until the next transient fault).
The set Σ is equal to the set of all configurations of R v reachable from c 0 . To partition this set let d be the number of neighbors of v that have color c f in c 0 . Let P = { Σ 0 , , Σ d } where Σ j is the subset of Σ where exactly d j neighbors of v are in conflict with v. Then Σ 0 = { c 0 } and Σ d L . Figure 5 shows some configurations belonging to the sets Σ 0 , Σ 2 , and Σ 3 . Let c Σ i . Then A 3 ( c ) Σ j for all j < i . Unfortunately the partition P is not lumpable because the probability of a configuration c Σ i to be transformed in one round into a fixed configuration of Σ j is not independent of the choice of c Σ i . But even in these cases Theorem 1 can lead to an upper bound of the stabilization time. This is proved in the following theorem.
Theorem 2.
Let A be a self-stabilizing algorithm with Σ the set of configurations. Let P = { Σ 0 , , Σ l } be a partition of Σ with Σ 0 = L such that for all i 0 if c Σ i then A ( c ) Σ j with j i . For i < j let c i j 0 be a constant such that P r o b ( A ( c ) Σ j ) c i j for all c Σ i . Furthermore, let c i j = 0 for j < i and c i i = 1 j = i + 1 d c i j for i = 0 , , d . Let C A be the Markov chain with states P and transition matrix c i j . If an absorbing state of C A is reached from Σ i in expected B steps then the expected number of rounds A requires to stabilize starting in any c Σ i is at most B.
Proof. 
Note
0 j = i d c i j j = i d P r o b ( A 3 ( c ) Σ j ) = 1
for each fixed c Σ i . Thus, c i i 0 and therefore the matrix ( c i j ) is a stochastic with c d d = 1 that describes the new Markov chain C A . Remember that C A denotes the markov chain corresponding to algorithm A . The difference between C A and C A is that with chain C A a node remains with a higher probability in its current configuration instead of moving to a state Σ i with lower index. Therefore, the expected number of steps of C A before being absorbed is an upper bound for the corresponding number in C A . The choice of the probabilities implies that C A is lumpable for partition P. Hence, we can apply Theorem 1 to complete the proof. ☐
In the rest of this paper the introduced techniques including Theorem 2 are exemplary applied to a self-stabilizing ( Δ + 1 ) -coloring algorithm A c o l using the message passing model.

6. Algorithm A c o l

This section introduces coloring algorithm A c o l (see Algorithm 4). Computing a Δ + 1 -coloring in expected O ( log n ) rounds with a randomized algorithm is long known [22,23]. Algorithm A c o l follows the pattern sketched in Section 4.1. We derived it from an existing algorithm (Algorithm 19 [24]) by adding the self-stabilization property. The presented techniques can also be applied to other randomized coloring algorithms such as [17,20,21]. The main difference is that A c o l assumes the synchronous CONGEST message passing model. Algorithm A c o l stabilizes after O ( log n ) rounds with high probability whereas the above cited self-stabilizing algorithms all require a linear number of rounds. Since synchronous local algorithms can be converted to asynchronous self-stabilizing algorithms [25], there are self-stabilizing Δ + 1 -coloring algorithms faster than A c o l . However, they entail a burden on memory resources and cause high traffic costs.
At the start of each round each node broadcasts its current color to its neighbors. Based on the information received from its neighbors a node decides either to keep its color (final choice), to choose a new color or no color (value ⊥). In particular, with equal probability a node v draws uniformly at random a color from the set { 0 , 1 , , δ ( v ) } t a b u or indicates that it made no choice (see function randomColor). Here, t a b u is the set of colors of neighbors of v that already made their final choice.
Algorithm 4: Algorithm A c o l as executed by a node v in each round.
Algorithms 11 00058 i004
    In the original algorithm a node maintains a list with the colors of those neighbors that made their final choice. A fault changing this list is difficult to contain. Furthermore, in order to notice a memory corruption at a neighbor, each node must continuously send its state to all its neighbors and cannot stop to do so. This is the price of self-stabilization and well known [2]. These considerations lead to the design of Algorithm A c o l . Each node only maintains the chosen color and whether its choice is final (variables c and final). In every round a node sends the values of c and final to all neighbors. A c o l uses two additional variables tabu and occupied. They are reset at the beginning of every round. To improve fault containment a node’s final choice of a color is only withdrawn if it coincides with the final choice of a neighbor. To achieve a Δ + 1 -coloring a node makes a new choice if its color is larger than its degree. This situation can only originate from a fault.
First we prove correctness and analyze the stabilization time of A c o l . A configuration is called a legal coloring if the values of variable c form a Δ + 1 -coloring. It is called legitimate if it is a legal coloring and v . f i n a l = t r u e for each node v.
Lemma 5.
A node v can change the value of variable f i n a l from t r u e to f a l s e only in the first round or when a fault occured just before the start of this round.
Proof. 
Let v . c = c r at the beginning of the round. In order for v to set v . f i n a l to f a l s e one of the following conditions must be met at the start of the round: c r > δ ( v ) , c r = , or v has a neighbor w with w . f i n a l = t r u e and w . c = c r .
The lemma is obviously true in the first case. Suppose that c r = and v . f i n a l = t r u e at the round’s start. If during the previous round v . f i n a l was set to t r u e then v . c can not be ⊥ at the start of this round. Hence, at the start of the previous round f i n a l already had value t r u e . But in this case v . c was not changed in the previous round and thus, c r , contradiction. Finally assume the last condition. Then v and w cannot have changed their value of c in the previous round, because then f i n a l = t r u e would be impossible at the start of this round. Thus, v sent ( c r , t r u e ) in the previous round. Hence, if w . c = c r at that time, w would have changed w . f i n a l to f a l s e , again a contradiction. ☐
Lemma 6.
A node setting f i n a l to t r u e will not change its variables as long as no error occurs.
Proof. 
Let v be a node that executes f i n a l : = t r u e . If v changes the value back to f a l s e in a later round then by Lemma 5 a fault must have occured. Thus in an error-free execution node v will never change variable f i n a l again. Since a node can only change variable c if f i n a l = f a l s e the proof is complete. ☐
Lemma 7.
If at the end of a round during which no error occured each node v satisfies v . f i n a l = t r u e then the configuration is legitimate and remains legitimate as long as no error occurs.
Proof. 
Note that no node changed its color during that round. If at the start of the round v . f i n a l = t r u e was already satisfied then none of v’s neighbors also having f i n a l = t r u e had the same color as v. Next consider a neighbor w of v with w . f i n a l = f a l s e at the start of the round. Since v sent ( v . c , t r u e ) at the start of this round, node w would have set f i n a l to f a l s e if it had chosen the same color as v. Contradiction. Finally consider that case that v . f i n a l = f a l s e at the start of the round. Since v changed f i n a l to t r u e , none of its neighbors had chosen the same color as v. Thus, the configuration is legitimate. Obviously, this property can only be changed by a fault. ☐
With these lemmas the next theorem is proved along the same lines as Lemma 10.3 in [24].
Theorem 3.
Algorithm A c o l is self-stabilizing and computes a Δ + 1 -coloring within O ( log n ) rounds with high probability (i.e., with probability at least 1 n c for any c 1 ). A c o l has contamination radius 1.
Proof. 
According to Lemma 7 it suffices to prove that all nodes terminate within O ( log n ) time with high probability. Let v V . Lemma 6 implies that the probability that v terminates in round r > 1 is equal to the probability that v sets v . f i n a l to t r u e in round r 1 . This is the probability that v selects a color different from ⊥ and from the selections of all neighbors that chose a value different from ⊥ in round r 2 . Suppose that indeed v . c at the end of round r 2 . Then v . c v . t a b u . The probability that a given neighbor u of v selects the same color u . c = v . c in this round is at most 1 / 2 ( δ ( v ) + 1 | v . t a b u | ) . This is because the probability that u selects a color different from ⊥ is 1 / 2 , and v has δ ( v ) + 1 | v . t a b u | different colors to select from. Since r > 1 all nodes in v . t a b u have f i n a l = t r u e and will never change this value. Thus, at most δ ( v ) | t a b u | neighbors select a new color. By the union bound, the probability that v selects the same color as a neighbor is at most
δ ( v ) | v . t a b u | 2 ( δ ( v ) + 1 | v . t a b u | ) < 1 2 .
Thus, if v selects a color v . c , it is distinct from the colors of its neighbors with probability at least 1 / 2 . It holds that v . c with probability 1 / 2 . Hence, v terminates with probability at least 1 / 4 .
The probability that a specific node v doesn’t terminate within r rounds is at most ( 3 / 4 ) r . By the union bound, the probability that there exists a vertex v V that does not terminate within r rounds is at most n ( 3 / 4 ) r . Hence, A c o l terminates after r = ( c + 1 ) 4 log n rounds, with probability at least 1 n ( 3 / 4 ) r 1 1 / n c (note that log 4 / 3 > 1 / 4 ). ☐

6.1. Fault Containment Time of Algorithm A c o l

There is a significant difference between the shared memory and the message passing model when analyzing the containment time. Firstly, a 1-faulty configuration also arises when a single message sent by a node v is corrupted. Secondly, this may cause v’s neighbors to send messages they would not send in a legitimate configuration. Even though the states of nodes outside G v r do not change, these nodes may be forced to send messages. Thus, in general the analysis of the containment time cannot be performed by considering G v r only. This is only possible in cases when a fault at v does not force nodes at distance r + 1 to send messages they would not send had the fault not occurred.
In the following the fault containment behavior of A c o l for 1-faulty configurations is analyzed. Two types of transient errors are considered:
  • A single broadcast message sent by v is corrupted. Note that the alternative of using δ ( v ) unicast messages instead a single broadcast has very good fault containment behavior but is slower due to the handling of acknowledgements.
  • Memory corruption at node v, i.e., the value of at least one of the two variables of v is corrupted.
The first case is analyzed analytically whereas for the second case Markov chains and lumping, (Theorem 2) are used. The independent degree δ i ( v ) of a node v is the size of a maximum independent set of N ( v ) . Let Δ i ( G ) = max { δ i ( v ) v V } .

6.2. Message Corruption

If a message broadcast by v contains a color c f different from v . c or the value f a l s e for variable f i n a l then the message ( c f , f a l s e ) has no effect on any w N ( v ) regardless of the value of c f , since  w . f i n a l = t r u e for all w N ( v ) . Thus, this corrupted message has no effect at all. In order to compute the containment time for A c o l we first compute the contamination radius.
Lemma 8.
The contamination radius of algorithm A c o l after a single corruption of a broadcast message sent by node v is 1. At most δ i ( v ) nodes change their state during recovery.
Proof. 
It suffices to consider the case that v broadcasts message ( c f , t r u e ) with c f v . c . Let N conflict ( v ) = { w N ( v ) w . c = c f } . The nodes in N conflict ( v ) form an independent set, because they all have the same color. Thus | N conflict ( v ) | δ i ( v ) .
Let u V N [ v ] . This node continues to send ( u . c , t r u e ) after the fault. Thus, a neighbor of u that changes its color will not change its color to u . c . This yields that no neighbor of u will ever send a message with u . c as the first parameter. This is also true in case u N ( v ) N conflict ( v ) . Hence, no node outside N conflict ( v ) { v } will change its state, i.e., the contamination radius is 1.
Let w N conflict ( v ) . When w receives the faulty message it sets w . f i n a l to false. Before the faulty message was sent no neighbor of v had the same color as v. Thus, in the worst case a node w N conflict ( v ) will choose v . c as its new color and send ( v . c , f a l s e ) to all neighbors. Since v . f i n a l = t r u e , this will not force state to change at v. Thus, v keeps broadcasting ( v . c , t r u e ) and no neighbor w of v will ever reach the state w . c = v . c and w . f i n a l = t r u e . Hence, v will never change its state. ☐
With this result Theorem 3 implies that the containment time of this fault is O ( log δ i ( v ) ) on expectation. The following theorem gives a bound for the expected value of the containment time including its variance. Since the variance of a random variable is the expected value of the squared deviation from the mean, this theorem shows that the containment time does not deviate much from its expected value. A concrete bound can be obtained from this result using Chebyshev’s inequality.
Theorem 4.
The expected containment time of algorithm A c o l after a corruption of a message broadcast by node v is at most 1 ln 2 H δ i ( v ) + 1 / 2 rounds ( H i is the i t h harmonic number) with a variance of at most
1 ln 2 2 i = 1 δ i ( v ) 1 i 2 + 1 4 π 2 6 ln 2 2 + 1 4 3.6737 .
Proof. 
After receiving message ( c f , t r u e ) all nodes w N conflict ( v ) set w . f i n a l to f a l s e and with equal probability w . c to ⊥ or to a random color c w { 0 , 1 , , δ ( w ) } w . t a b u . Note that | w . t a b u | δ ( w ) because w . t a b u = { u . c u N ( w ) v } { c f } . At the end of the round during which the corrupted message was received, node w can choose v’s current color, because it may not be contained in the set t a b u . This can not happen in the following rounds. Thus, if w chooses a color different from ⊥ in the following rounds then this color is different from the colors of all of w’s neighbors. Also in this case w will terminate after the following round because then it will set f i n a l to t r u e . Thus, after one round w has chosen a color that is different from the colors of all neighbors with probability at least 1 / 2 . Furthermore, this color will not change again. After one additional round w reaches a legitimate state.
Let the random variable X d with d = | N conflict ( v ) | denote the number of rounds until the system has reached a legal coloring. For w N conflict ( v ) let Y w be the random variable denoting the number of rounds until w has a legal coloring. By Lemma 8 X d = max { Y w w N conflict ( v ) } . For i 1 let G ( i ) = P r o b ( X d i ) = P r o b ( max { Y w w N conflict ( v ) } i ) . Since the random variables Y w ’s are independent G ( i ) = P r o b ( X i ) | N conflict ( v ) | where X is a geometric random variable with p = 0.5 . Thus G ( 0 ) = 0 and
G ( i ) = j = 1 i p ( 1 p ) j 1 d .
Then E [ X d ] = i = 1 i P r o b ( X d = i ) . Let q = 1 p . Now for i 1
P r o b ( X d = i ) = G ( i ) G ( i 1 ) = ( 1 q i ) d ( 1 q i 1 ) d
= j = 0 d d j ( 1 ) j + 1 ( 1 q j ) q j ( i 1 ) = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) q j ( q j ) i .
This implies
E [ X d ] = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) q j i = 1 i ( q j ) i = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) = j = 1 d d j ( 1 ) j + 1 l = 0 ( q j ) l = l = 0 j = 1 d d j ( 1 ) j + 1 ( q l ) j = l = 0 1 + j = 0 d d j ( 1 ) j + 1 ( q l ) j = l = 0 ( 1 ( 1 q l ) d ) .
The result follows from Lemma 9. The expression for the variance is proved in Lemma 10. ☐
Lemma 9.
For fixed 0 < q < 1 and fixed d 1
l = 0 ( 1 ( 1 q l ) d ) [ 1 ln q H d , 1 ln q H d + 1 ] a n d l = 0 ( 1 ( 1 q l ) d ) 1 ln q H d + 1 2 .
Proof. 
The function f ( x ) = 1 ( 1 q x ) d is for fixed values of d decreasing for x 0 . Furthermore, f ( 0 ) = 1 . Hence,
l = 0 ( 1 ( 1 q l ) d ) 0 f ( x ) d x l = 0 ( 1 ( 1 q l ) d ) 1 .
Using the substitution u = 1 q x the integral becomes
1 ln q 0 1 u d 1 u d u = 1 ln q 0 1 i = 0 d 1 u i d u = 1 ln q i = 1 d 1 i = 1 ln q H d .
Approximating i i + 1 f ( x ) d x with ( f ( i ) + f ( i + 1 ) ) / 2 yields
l = 0 ( 1 ( 1 q l ) d ) 0 f ( x ) d x + f ( 0 ) 2 = 1 ln q H d + 1 2
 ☐
Lemma 10.
For d > 0 the variance of the containment time is at most
V a r [ X d ] = 1 ln 2 2 i = 1 d 1 i 2 + 1 4 π 2 6 ln 2 2 + 1 4 3.6737 .
Proof. 
V a r [ X d ] = E [ X d 2 ] E [ X d ] 2 = i = 1 i 2 P r o b ( X = i ) E [ X d ] 2
By Lemma 11
i = 1 i 2 P r o b ( X = i ) = l = 1 ( 2 l + 1 ) ( 1 ( 1 q l ) d ) = 2 l = 1 l ( 1 ( 1 q l ) d ) + E [ X d ]
Now Lemma 12 and Lemma 4 yield
V a r [ X d ] 2 ln 2 2 i = 1 d H i i + E [ X d ] E [ X d ] 2 2 ln 2 2 i = 1 d H i i + H d ln 2 + 1 2 H d ln 2 + 1 2 2 = 1 ln 2 2 2 i = 1 d H i i H d 2 + 1 4
2 i = 1 d H i i H d 2 = 2 i = 1 d j = 1 i 1 i j ( 1 + 1 2 + + 1 d ) 2 = 2 i = 1 d 1 i 2 + 2 i = 1 d j = 1 i 1 1 i j 2 i = 1 d j = 1 i 1 1 i j i = 1 d 1 d 2 = i = 1 d 1 i 2 = π 2 6
 ☐
Lemma 11.
Let d > 0 , q ( 0 , 1 ) and P r o b ( X = i ) = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) q j ( q j ) i . Then
i = 1 i 2 P r o b ( X = i ) = l = 1 ( 2 l + 1 ) ( 1 ( 1 q l ) d )
Proof. 
i = 1 i 2 P r o b ( X = i ) = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) q j i = 1 i 2 ( q j ) i = j = 1 d d j ( 1 ) j + 1 ( 1 q j ) q j 2 q 2 j ( 1 q j ) 3 + q j ( 1 q j ) 2 = j = 1 d d j ( 1 ) j + 1 2 q j ( 1 q j ) 2 + 1 1 q j = j = 1 d d j ( 1 ) j + 1 l = 0 ( 2 l + 1 ) ( q j ) l = l = 1 ( 2 l + 1 ) j = 1 d d j ( 1 ) j + 1 ( q l ) j = l = 1 ( 2 l + 1 ) ( 1 ( 1 q l ) d )
For the first equation we refer to the proof of Theorem 4. The second equality makes use of
i = 1 i 2 x i = 2 x 2 ( 1 x ) 3 + x ( 1 x ) 2
and the fourth equality uses the following two identities
l = 0 x l = 1 ( 1 x ) and l = 0 l x l = x ( 1 x ) 2 .
 ☐
Lemma 12.
Let d > 0 and q ( 0 , 1 ) then
l = 1 l ( 1 ( 1 q l ) d ) 1 ( ln q ) 2 i = 1 d H i i
Proof. 
We approximate l = 1 l ( 1 ( 1 q l ) d ) with 0 x ( 1 ( 1 q x ) d ) d x . Note that x ( 1 ( 1 q x ) d ) has a single local maximum in the interval [ 0 , ) . If the local maximum is within the interval [ y , y + 1 ] with y N then the error is
y y + 1 x ( 1 ( 1 q x ) d ) d x .
This leads to a small overestimation of the sum as Figure 6b shows.
l = 1 l ( 1 ( 1 q l ) d ) 0 x ( 1 ( 1 q x ) d ) d x = 1 ( ln q ) 2 0 1 ln ( 1 u ) ( 1 u d ) 1 u d u = 1 ( ln q ) 2 i = 0 d 1 0 1 ln ( 1 u ) u i d u = 1 ( ln q ) 2 i = 1 d H i i .
The first equation uses the substitution u = 1 q x . The final result is based on the following identity
0 1 ln ( 1 u ) u i d u = H i + 1 i + 1 .
 ☐
Theorem 4 gives an upper bound for the containment time and its variance of algorithm A c o l . To evaluate the quality of these upper bounds we modeled the behavior of this fault situation as a Markov chain and computed E [ X d ] and V a r [ X d ] using a software package based on symbolic mathematics. Using Theorem 3.3.5 from [12] these computations showed that 1 ln 2 H d + 1 / 2 matches very well with E [ X d ] and that E [ X d ] 2 log d (see Figure 6a). Furthermore, the gap between V a r [ X d ] and the bound given in Theorem 4 is less than 0.2 (see Figure 6b).

6.3. Memory Corruption

This section demonstrates the use of Markov chains and the application of Theorem 2. We consider the case that the memory of a single node v is hit by a fault. The analysis breaks down the stabilizing executions into several states and then computes the expected time for each of these phases. First we look at the case that the fault causes variable v . f i n a l to change to f a l s e . If v . c does not change, then a legitimate configuration is reached after one round. So assume v . c also changes. Then the fault will not affect other nodes. This is because no w N ( v ) will change its value of w . c since w . f i n a l = t r u e and v . f i n a l = f a l s e . Thus, with probability at least 1 / 2 node v will choose in the next round a color different from the colors of all neighbors and terminate one round later. Similar to X d let random variable Z d denote the number of rounds until a legal coloring is reached ( d = | N conflict ( v ) | ). It is easy to verify that E [ Z d ] = 3 in this case.
The last case is that only variable v . c is affected (i.e., v . f i n a l remains t r u e ). The main difference to the case of a corrupted message is that this fault persists until v . c has again a legitimate value. Let  c f be the corrupted value of v . c and suppose that N conflict ( v ) = { w N ( v ) w . c = c f } . A node outside S = N conflict ( v ) { v } will not change its state (cf. Lemma 8). Thus, the contamination radius is 1 and at most δ i ( v ) + 1 nodes change state. Let d = | N conflict ( v ) | . The subgraph G S induced by S is a star graph with d + 1 nodes and center v.
Lemma 13.
To find a lower bound for E [ Z d ] we may assume that w can choose a color from { 0 , 1 } t a b u with t a b u = if v . f i n a l = f a l s e and t a b u = { v . c } otherwise and v can choose a color from { 0 , 1 , , d } t a b u with t a b u { 0 , 1 } .
Proof. 
When a node u S chooses a color with function randomColor the color is randomly selected from C u = { 0 , 1 , , δ ( v ) } t a b u . Thus, if w and v choose colors in the same round, the probability that the chosen colors coincide is | C w C v | / | C w | | C v | . This value is maximal if | C w C v | is maximal and | C w | | C v | is minimal. This is achieved when C w C v and C v is minimal (independent of the size of C w ) or vice versa. Thus, without loss of generality we can assume that C w C v and both sets are minimal. Thus, for w N conflict ( v ) the nodes in N ( w ) { v } already use all colors from { 0 , 1 , , δ ( v ) } but 0 and 1 and all nodes in N ( v ) N conflict ( v ) already use all colors from { 0 , 1 , , δ ( v ) } but 0 , 1 , , d . Hence, a node w N conflict ( v ) can choose a color from { 0 , 1 } t a b u with t a b u = if v . f i n a l = f a l s e and t a b u = { v . c } otherwise. Furthermore, v can choose a color from { 0 , 1 , , d } t a b u with t a b u { 0 , 1 } . In this case t a b u = if w . f i n a l = f a l s e for all w N conflict ( v ) . ☐
Thus, in order to bound the expected number of rounds to reach a legitimate state after a memory corruption we can assume that G = G S and u . f i n a l = t r u e and u . c = 0 (i.e., c f = 0 ) for all u S . After one round u . f i n a l = f a l s e for all u S . To apply Theorem 2 the set Σ of all configurations is partitioned into d + 4 subsets as follows:
I:
Represents the faulty state with u . c = 0 and u . f i n a l = t r u e for all u S .
C i :
Node v and exactly d i non-center nodes will not be in a legitimate state after the following round ( 0 i d ). In particular v . f i n a l = f a l s e and w . c = v . c or v . c = w . c = for exactly d i non-center nodes w.
P:
Node v has not reached a legitimate state but will do so in the next round. In particular v . f i n a l = f a l s e and v . c w . c for all non-center nodes w.
F:
Node v is in a legitimate state, i.e., v . f i n a l = t r u e and v . c w . c for all non-center nodes w, but w . c may be equal to ⊥.
Note that I is the initial and F the absorbing state of the lumped Markov chain. Also when the system is in state F, then it is not necessarily in a legitimate state. This state reflects the set of configurations considered in the last section.
Lemma 14.
Table 1 describes the transition probabilities of the lumped Markov chain.
Proof. 
We consider each case separately. The last two cases are trivial.
  • Note that u . f i n a l = t r u e and u . c = 0 for all u S .
    Case 0: v . c = . Impossible.
    Case 1: v . c = 0 . Impossible, since non-center nodes have c = 0 and f i n a l = t r u e .
    Case 2: v . c = 1 . This happens with probability 1 / 2 d . All non-center nodes w choose w . c = , this happens with probability 1 / 2 d .
    Case 3: v . c > 1 . This happens with probability ( d 1 ) / ( 2 d ) . Non-center nodes can make any choice. This gives the total probability for this transition as ( d 1 ) / ( 2 d ) + 1 / ( d 2 d + 1 ) .
  • Note that u . f i n a l = t r u e and u . c = 0 for all u S .
    Case 0: v . c = . Non-center nodes choose c = . The case has probability 1 / 2 d + 1
    Case 1: v . c = 0 . Impossible (see transition I P ).
    Case 2: v . c = 1 . At least one non-center nodes w choose w . c = 1 , all others choose w . c = . This case has probability 1 2 d l = 1 d d l 1 2 d = 1 2 d 1 2 d ( 2 d 1 ) .
    Case 3: v . c > 1 . This case is impossible.
  • Note that u . f i n a l = t r u e for all u S .
    Case 1: v . c = 0 . This happens with probability 1 / 2 ( d i + 1 ) . None of the d i non-center nodes w sets w . c = 0 , this has probability 3 4 d i .
    Case 2: v . c = 1 . Similar to case 1.
    Case 3: v . c > 1 . (requires d i > 1 ). This happens with probability d i 1 2 ( d i + 1 ) . Non-center nodes can make any choice.
  • Note that u . f i n a l = f a l s e for all u S .
    Case 1: v . c = . This happens with probability 1 2 . d j non-center nodes choose c = (with probability 1 / 2 d j ), the other j i non-center nodes choose c (with probability 1 / 2 j i ). The total probability for this case is d i d j 1 2 d i + 1 .
    Case 2: v . c = 0 . This happens with probability 1 / ( 2 ( d i + 1 ) ) . Exactly j i non-center nodes choose c = 1 (with probability 1 / 4 j i ), 1 l d j non-center nodes choose c = 0 (with probability 1 / 4 l ) and all other non-center nodes choose c = (with probability 1 / 2 d j l ). The total probability for this case is
    1 2 ( d i + 1 ) d i j i 1 4 j i l = 1 d j d j l 1 4 l 1 2 d j l = 1 2 ( d i + 1 ) d i j i 1 4 j i l = 1 d j d j l 1 2 d j + l = 1 2 ( d i + 1 ) d i j i 1 4 j i 1 2 d j l = 1 d j d j l 1 2 l = 1 2 ( d i + 1 ) d i j i 1 2 d + j 2 i l = 1 d j d j l 1 2 l = 1 2 ( d i + 1 ) d i j i 1 2 d + j 2 i 3 2 d j 1 = 1 2 ( d i + 1 ) d i j i 1 4 d i 3 d j 2 d j
    Case 2: v . c = 1 . Similar to Case 1.
    Case 3: v . c > 0 . This does not lead to C j but to P.
  • Note that u . f i n a l = f a l s e for all u S .
    Case 1: v . c = 0 . This happens with probability 1 / 2 ( d i + 1 ) . None of the d i non-center nodes w sets w . c = 0 , this has probability 3 / 4 d i .
    Case 2: v . c = 1 . Similar to case 1.
    Case 3: v . c > 1 . This happens with probability d i 1 2 ( d i + 1 ) . Note d > i . Non-center nodes can make any choice.
 ☐
We first calculate the expected number E [ A d ] of rounds to reach the absorbing state F. With Theorem 4 this will enable us to compute the expected number E [ Z d ] of rounds required to reach a legitimate system state. To build the transition matrix of the lumped Markov chain the d + 4 states are ordered as I , C 0 , C 1 , , C d , P , F . Let Q be the ( d + 3 ) × ( d + 3 ) upper left submatrix of P. For s = 1 , 0 , 1 , , d + 1 denote by Q s the ( s + 2 ) × ( s + 2 ) lower right submatrix of Q, i.e., Q = Q d + 1 . Denote by N s the fundamental matrix of Q s (notation as introduced in Section 5). Let 1 s be the column vector of length ( s + 2 ) whose entries are all 1 and ϵ s = N s 1 s . For s = 0 , , d , ϵ s is the expected number of rounds to reach state F from state C d s and ϵ d + 1 is the expected number of rounds to reach state F from I, i.e., ϵ d + 1 = E [ A d ] (Theorem 3.3.5, [12]). Identifying P with C d + 1 we have ϵ 1 = 1 .
Lemma 15.
The expected number E [ A d ] of rounds to reach F from I is at most 5 and the variance is at most 3.6 .
Proof. 
Note that Q s and N s are upper triangle matrices. Let
E i Q i = 1 a 1 a 2 a i + 2 0 E i 1 Q i 1 0 N i = x 1 x 2 x i + 2 0 N i 1 0
E i = ( E i Q i ) N i gives rise to ( i + 2 ) 2 equations. Adding up the i + 2 equations for the first row of E i results in
ϵ i = ( 1 a 1 ) 1 1 + l = 2 i + 2 a l ϵ i + 1 l
It is straightforward to verify that ϵ 1 = 1 and ϵ 0 = 3 . Hence
ϵ i = ( 1 a 1 ) 1 1 + l = 2 i a l ϵ i + 1 l + 3 a i + 1 + a i + 2
Next we show by induction on i that ϵ i 4 for i = 1 , 0 , 1 , , d . So assume that ϵ l 4 for l = 1 , 0 , 1 , , i 1 with i < d . Then
ϵ i ( 1 a 1 ) 1 1 + 4 l = 2 i a l + 3 a i + 1 + a i + 2
since a i 0 . Using the fact 1 a 1 = l = 2 i + 2 a l this inequality becomes
ϵ i ( 1 a 1 ) 1 ( 1 + 4 ( 1 a 1 a i + 1 a i + 2 ) + 3 a i + 1 + a i + 2 ) = 4 + 1 a i + 1 3 a i + 2 1 a 1
Coefficient a j denotes the transition probability from C d i to C d + j ( i + 1 ) for j = 1 , , i + 1 and a i + 2 that for changing from C d i to P. For i d the following values from Lemma 14 are used:
a 1 = i i + 1 l 1 2 i + 1 + 1 i + 1 i l 1 1 4 i ( 3 i + 1 l 2 i + 1 l ) a i + 1 = 1 2 i + 1 and a i + 2 = 1 i + 1 3 4 i + i 1 2 ( i + 1 ) . Thus ,
3 a i + 2 = 3 i + 1 3 4 i + 3 ( i 1 ) 2 ( i + 1 ) > 1
holds for i 2 . This yields
1 a i + 1 3 a i + 2 1 a 1 < 0
and therefore ϵ i 4 . To bound ϵ d + 1 we use Equation 1 with i = d + 1 . Note that in this case a 1 = 0 since a transition from I to itself is impossible. Hence
E [ A d ] = ϵ d + 1 = 1 + l = 2 d + 3 a l ϵ d + 2 l 1 + 4 l = 2 d + 3 a l = 5
Thus, V a r [ A d ] = ( ( 2 N d + 1 E d + 1 ) 1 d + 1 1 d + 1 2 ) [ 1 ] = 2 i = 1 d + 3 x i ϵ d + 2 i ϵ d + 1 ϵ d + 1 2 Figure 6c shows that V a r [ A d ] 3.6 . ☐
Lemma 16.
The expected containment time after a memory corruption at node v is at most 1 ln 2 H δ i ( v ) + 11 / 2 with variance less than 7.5 .
Proof. 
For a set X of configurations and a single configuration c denote by E ( c , X ) the expected value of the number of transitions from x to a state in X. Let L be the set of legitimate system states. Then
E ( I , L ) = e T ( I , L ) l ( e ) p ( e ) = x F e 1 T ( I , x ) e 2 T ( x , L ) ( l ( e 1 ) + l ( e 2 ) ) p ( e 1 ) p ( e 2 ) = x F e 1 T ( I , x ) l ( e 1 ) p ( e 1 ) e 2 T ( x , L ) p ( e 2 ) + p ( e 1 ) e 2 T ( x , L ) l ( e 2 ) p ( e 2 ) = x F e 1 T ( I , x ) l ( e 1 ) p ( e 1 ) + p ( e 1 ) E ( x , L ) = x F E ( I , x ) + e 1 T ( I , x ) p ( e 1 ) E ( x , L ) E ( I , F ) + max { E ( x , L ) x F } e 1 T ( I , F ) p ( e 1 ) = E ( I , F ) + max { E ( x , L ) x F } 5 + 1 ln 2 H δ i ( v ) + 1 / 2
The last step uses Lemma 4 and 15. The bound on the variance is proved similarly. ☐
Theorem 3 and 4, Lemma 8 and 16 together prove the following Theorem.
Theorem 5.
A c o l is a self-stabilizing algorithm for computing a ( Δ + 1 ) -coloring in the synchronous model within O ( log n ) time with high probability. It uses messages of size O ( log n ) and requires O ( log n ) storage per node. With respect to memory and message corruption it has contamination radius 1. The expected containment time is at most 1 ln 2 H Δ i + 11 / 2 with variance less than 7.5 .
Corollary 1.
Algorithm A c o l has expected containment time O ( 1 ) for bounded-independence graphs. For unit disc graphs this time is at most 8.8 .
Proof. 
For these graphs Δ i O ( 1 ) , in particular Δ i 5 for unit disc graphs. ☐

7. Conclusions

The analysis of self-stabilizing algorithms is often confined to the stabilization time starting from an arbitrary configuration. In practice the time to recover from a 1-faulty configuration is much more relevant. This paper presents techniques to analyze the containment time of randomized self-stabilizing algorithms for 1-faulty configurations. The execution of an algorithm is modeled as a Markov chain, its complexity is reduced with the lumping technique. The power of this technique is demonstrated by an application to a Δ + 1 -coloring algorithm. We believe that the technique can also be applied to other self-stabilizing algorithms. We leave the application to problems such as maximal independents sets and maximal matchings for future work.

Funding

Research was funded by Deutsche Forschungsgemeinschaft DFG (TU 221/6-2).

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Gärtner, F.C. Fundamentals of Fault-tolerant Distributed Computing in Asynchronous Environments. ACM Comput. Surv. 1999, 31, 1–26. [Google Scholar] [CrossRef]
  2. Dolev, S. Self-Stabilization; MIT Press: Cambridge, MA, USA, 2000. [Google Scholar]
  3. Azar, Y.; Kutten, S.; Patt-Shamir, B. Distributed Error Confinement. ACM Trans. Algorithms 2010, 6. [Google Scholar] [CrossRef]
  4. Kutten, S.; Patt-Shamir, B. Adaptive Stabilization of Reactive Protocols. In Lecture Notes in Computer Science, Proceedings of the International Conference on Foundations of Software Technology and Theoretical Computer Science, Chennai, India, 16–18 December 2004; Springer: Berlin/Heidelberg, Germany, 2004; Volume 3328, pp. 396–407. [Google Scholar]
  5. Ghosh, S.; He, X. Scalable Self-Stabilization. J. Parallel Distrib. Comput. 2002, 62, 945–960. [Google Scholar] [CrossRef]
  6. Dubois, S.; Masuzawa, T.; Tixeuil, S. Bounding the Impact of Unbounded Attacks in Stabilization. IEEE Trans. Parallel Distrib. Syst. 2012, 23, 460–466. [Google Scholar] [CrossRef]
  7. Beauquier, J.; Delaet, S.; Haddad, S. Necessary and sufficient conditions for 1-adaptivity. In Proceedings of the 20th International Parallel and Distributed Processing Symposium, Rhodes Island, Greece, 25–29 April 2006; pp. 10–16. [Google Scholar]
  8. Ghosh, S.; Gupta, A.; Herman, T.; Pemmaraju, S. Fault-containing self-stabilizing distributed protocols. Distrib. Comput. 2007, 20, 53–73. [Google Scholar] [CrossRef]
  9. Köhler, S.; Turau, V. Fault-containing self-stabilization in asynchronous systems with constant fault-gap. Distrib. Comput. 2012, 25, 207–224. [Google Scholar] [CrossRef]
  10. Ghosh, S.; Gupta, A. An exercise in fault-containment: Self-stabilizing leader election. Inf. Process. Lett. 1996, 59, 281–288. [Google Scholar] [CrossRef]
  11. Turau, V.; Hauck, B. A fault-containing self-stabilizing (3 ‒ 2/(Delta+1))-approximation algorithm for vertex cover in anonymous networks. Theor. Comput. Sci. 2011, 412, 4361–4371. [Google Scholar] [CrossRef]
  12. Kemeny, J.G.; Snell, J.L. Finite Markov Chains; Springer: Berlin/Heidelberg, Germany, 1976. [Google Scholar]
  13. Duflot, M.; Fribourg, L.; Picaronny, C. Randomized Finite-state Distributed Algorithms As Markov Chains. In Lecture Notes in Computer Science, Proceedings of the International Symposium on Distributed Computing, Lisbon, Portugal, 3–5 October 2001; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2180, pp. 240–254. [Google Scholar]
  14. DeVille, R.; Mitra, S. Stability of Distributed Algorithms in the Face of Incessant Faults. In Lecture Notes in Computer Science, Proceedings of the Symposium on Self-Stabilizing Systems, Lyon, France, 3–6 November 2009; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5873, pp. 224–237. [Google Scholar]
  15. Fribourg, L.; Messika, S.; Picaronny, C. Coupling and Self-Stabilization. Distrib. Comput. 2006, 18, 221–232. [Google Scholar] [CrossRef]
  16. Yamashita, M. Probabilistic Self-Stabilization and Random Walks. In Proceedings of the 2011 Second International Conference on Computing, Networking and Communications (ICNC), Osaka, Japan, 30 November–2 December 2011; pp. 1–7. [Google Scholar]
  17. Mitton, N.; Fleury, E.; Guérin-Lassous, I.; Séricola, B.; Tixeuil, S. On Fast Randomized Colorings in Sensor Networks. In Proceedings of ICPADS; IEEE: New York, NY, USA, 2006; pp. 31–38. [Google Scholar]
  18. Crouzen, P.; Hahn, E.; Hermanns, H.; Dhama, A.; Theel, O.; Wimmer, R.; Braitling, B.; Becker, B. Bounded Fairness for Probabilistic Distributed Algorithms. In Proceedings of the 11th International Conference on Application of Concurrency to System Design (ACSD), Newcastle Upon Tyne, UK, 20–24 June 2011; pp. 89–97. [Google Scholar]
  19. Peleg, D. Distributed Computing: A Locality-Sensitive Approach; SIAM Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2000. [Google Scholar]
  20. Gradinariu, M.; Tixeuil, S. Self-stabilizing Vertex Coloring of Arbitrary Graphs. In Proceedings of the 4th International Conference on on Principles of Distributed Systems (OPODIS 2000), Paris, France, 20–22 December 2000; pp. 55–70. [Google Scholar]
  21. Dolev, S.; Herman, T. Superstabilizing Protocols for Dynamic Distributed Systems. Chic. J. Theor. Comput. Sci. 1997, 4, 1–40. [Google Scholar]
  22. Luby, M. A Simple Parallel Algorithm for the Maximal Independent Set Problem. SIAM J. Comput. 1986, 15, 1036–1055. [Google Scholar] [CrossRef]
  23. Johansson, Ö. Simple Distributed δ + 1-coloring of Graphs. Inf. Process. Lett. 1999, 70, 229–232. [Google Scholar] [CrossRef]
  24. Barenboim, L.; Elkin, M. Distributed Graph Coloring: Fundamentals and Recent Developments; Morgan & Claypool Publishers: Williston, VT, USA, 2013. [Google Scholar]
  25. Lenzen, C.; Suomela, J.; Wattenhofer, R. Local algorithms: Self-stabilization on speed. In Proceedings of the Symposium on Self-Stabilizing Systems, Lyon, France, 3–6 November 2009; Springer: Berlin/Heidelberg, Germany, 2009; pp. 17–34. [Google Scholar]
Figure 1. The numbers indicate the nodes’ colors. If the left-most node is hit by a fault and changes its color to Δ 1 , then itself and its neighbor are enabled. With probability 0.5 the second node changes its color to Δ . This enables the third node which changes its color to Δ 1 with probability 0.5. This may causes a cascade of changes in which all nodes on the horizontal line change color.
Figure 1. The numbers indicate the nodes’ colors. If the left-most node is hit by a fault and changes its color to Δ 1 , then itself and its neighbor are enabled. With probability 0.5 the second node changes its color to Δ . This enables the third node which changes its color to Δ 1 with probability 0.5. This may causes a cascade of changes in which all nodes on the horizontal line change color.
Algorithms 11 00058 g001
Figure 2. 1-faulty configurations of A 1 caused by a memory corruption at v. Nodes drawn in bold have state I N . Subgraph H correspond to R v . In the left graph, if node v changes to state O U T then all nodes in H are enabled, thus the worst case stabilization time is equal to that of subgraph H. In the right graph, if node v changes to I N , then v and its two neighboring nodes all change to O U T resulting in a configuration similar to the previous example.
Figure 2. 1-faulty configurations of A 1 caused by a memory corruption at v. Nodes drawn in bold have state I N . Subgraph H correspond to R v . In the left graph, if node v changes to state O U T then all nodes in H are enabled, thus the worst case stabilization time is equal to that of subgraph H. In the right graph, if node v changes to I N , then v and its two neighboring nodes all change to O U T resulting in a configuration similar to the previous example.
Algorithms 11 00058 g002
Figure 3. Reducing the number of states of a Markov chain.
Figure 3. Reducing the number of states of a Markov chain.
Algorithms 11 00058 g003
Figure 4. A 1-faulty configuration c 0 for algorithm A 3 where node v was hit by a fault changing its color to c f causing a conflict. The corresponding graph R v is depicted at the right side.
Figure 4. A 1-faulty configuration c 0 for algorithm A 3 where node v was hit by a fault changing its color to c f causing a conflict. The corresponding graph R v is depicted at the right side.
Algorithms 11 00058 g004
Figure 5. Elements of the partition of Σ for a 1-faulty configuration of algorithm A 3 as described above. Σ 0 consists of legitimate configurations only.
Figure 5. Elements of the partition of Σ for a 1-faulty configuration of algorithm A 3 as described above. Σ 0 consists of legitimate configurations only.
Algorithms 11 00058 g005
Figure 6. Comparisons of computed with approximated values from Theorems 4 and 15. (a) Comparison of computed value of E [ X d ] with log d (Theorem 4); (b) Comparison of computed value of V a r [ X d ] with approximation (Theorem 4); (c) E [ A d ] and V a r [ A d ] from Lemma 15.
Figure 6. Comparisons of computed with approximated values from Theorems 4 and 15. (a) Comparison of computed value of E [ X d ] with log d (Theorem 4); (b) Comparison of computed value of V a r [ X d ] with approximation (Theorem 4); (c) E [ A d ] and V a r [ A d ] from Lemma 15.
Algorithms 11 00058 g006
Table 1. This table summarizes the probabilities for all transitions.
Table 1. This table summarizes the probabilities for all transitions.
NumberTransitionProbability
1 I P d 1 2 d + 1 d 1 2 d + 1
2 I C 0 d 1 d 1 2 d + 1 + 1 2 d
3 I C j d d j 1 2 d + 1 ( 0 < j d )
4 C i C j d i d j 1 2 d i + 1 + 1 d i + 1 d i j i 1 4 d i ( 3 d j 2 d j ) ( 0 i j d )
5 C i P 1 d i + 1 3 4 d i + d i 1 2 ( d i + 1 ) ( 0 i < d )
6 C d P 1 / 2
7 P F 1

Share and Cite

MDPI and ACS Style

Turau, V. Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains. Algorithms 2018, 11, 58. https://doi.org/10.3390/a11050058

AMA Style

Turau V. Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains. Algorithms. 2018; 11(5):58. https://doi.org/10.3390/a11050058

Chicago/Turabian Style

Turau, Volker. 2018. "Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains" Algorithms 11, no. 5: 58. https://doi.org/10.3390/a11050058

APA Style

Turau, V. (2018). Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains. Algorithms, 11(5), 58. https://doi.org/10.3390/a11050058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop