Computing Fault-Containment Times of Self-Stabilizing Algorithms Using Lumped Markov Chains †

: The analysis of self-stabilizing algorithms is often limited to the worst case stabilization time starting from an arbitrary state, i.e., a state resulting from a sequence of faults. Considering the fact that these algorithms are intended to provide fault tolerance in the long run, this is not the most relevant metric. A common situation is that a running system is an a legitimate state when hit by a single fault. This event has a much higher probability than multiple concurrent faults. Therefore, the worst case time to recover from a single fault is more relevant than the recovery time from a large number of faults. This paper presents techniques to derive upper bounds for the mean time to recover from a single fault for self-stabilizing algorithms based on Markov chains in combination with lumping. To illustrate the applicability of the techniques they are applied to a new self-stabilizing coloring algorithm.


Introduction
Fault tolerance aims at making distributed systems more reliable by enabling them to continue the provision of services in the presence of faults.The strongest form is masking fault tolerance, where a system continues to operate after faults without any observable impairment of functionality, i.e., safety is always guaranteed.In contrast non-masking fault tolerance does not ensure safety at all times.Users may experience incorrect system behavior, but eventually the system will fully recover.The potential of this concept lies in the fact that it can be used in cases where masking fault tolerance is too costly or even impossible to implement [1].Self-stabilizing algorithms belong to the category of distributed algorithms that provide non-masking fault tolerance.They guarantee that systems eventually recover from transient faults of any scale such as perturbations of the state in memory or communication message corruption [2].A critical issue is the length of the time span until full recovery.Examples are known where a memory corruption at a single process caused a vast disruption in large parts of the system and triggered a cascade of corrections to reestablish safety.Thus, an important issue for non-masking fault tolerance is the containment of the effect of faults.
A fault-containing system has the ability to contain the effects of transient faults in space and time.The goal is to keep the extent of disruption during recovery proportional to the extent of the faults.An extreme case of fault-containment with respect to space is given when the effect of faults is bounded to the set of faulty nodes.Azar et al. call this error confinement [3].More relaxed forms of fault-containment are known as time-adaptive self-stabilization [4], scalable self-stabilization [5], strong stabilization [6], and 1-adaptive self-stabilization [7].
A configuration is called k-faulty, if in a legitimate configuration exactly k processes are hit by a fault (a configuration is called legitimate if it conforms with the specification).A large body of

Related Work
Self-stabilizing algorithms are analyzed with different techniques such as potential functions, convergence stairs, and Markov chains.The latter are particularly useful for randomized algorithms [13].Their main drawback is that in order to set up the transition matrix the graph's adjacency matrix must be known.This restricts the applicability of this method to small or highly symmetric instances.DeVille et al. apply model checking tools to Markov chains for cases of networks of small size (n ≤ 7) to determine the expected stabilization time [14].An example for highly symmetric networks are ring topologies [15,16].Fribourg et al. model randomized distributed algorithms as Markov chains using the technique of coupling to compute upper bounds for the stabilization times [15].Yamashita uses Markov chains to model self-stabilizing probabilistic algorithms and to prove stabilization [16].Mitton et al. consider a randomized self-stabilizing ∆ + 1-coloring algorithm and model it in terms of urns/balls using a Markov chain to get a bound for the stabilization time [17].Their evaluation is restricted to networks up to 1000 nodes.Crouzen et al. model faulty distributed algorithms as Markov decision processes to incorporate the effects of random faults when using a non-deterministic scheduler [18].They used the PRISM model-checker to compute long-run average availabilities.

System Model
This paper uses the synchronous model of distributed computing as defined in the standard literature [2,8,19].A distributed system is represented as an undirected graph G(V, E) where V is the set of nodes and E ⊆ V × V is the set of edges.Let n = |V| and ∆(G) denote the maximal degree of G.The topology is assumed to be fixed.If two nodes are connected by an edge, they are called neighbors.The set of neighbors of node v is denoted by N(v) ⊆ V and N[v] = N(v) ∪ {v}.Each node stores a set of variables.The values of all variables constitute the local state of a node.Let σ denote the set of possible local states of a node.The configuration of a system is the tuple of all local states of all nodes.Σ = σ n denotes the set of global states.A configuration is called legitimate if it conforms with the specification.The set of all legitimate configurations is denoted by L.
Nodes communicate either via locally shared memory or by exchanging messages.In the shared memory model each node executes a protocol consisting of a list of rules of the form guard −→ statement.The guard is a Boolean expression over the variables of the node and its neighbors.The statement consists of a series of commands.A node is called enabled if one of its guards evaluates to true.The execution of a statement is called a move.
In the message passing model a node performs three actions per round: receiving messages from neighbors, executing code, and sending messages to neighbors.Direct access to the state of neighboring nodes is impossible.Two nodes u and v communicate via two link registers: u writes in its register and v reads from it and v writes in its register and u reads from it.In this model the state of a node also includes the states of its registers.This works assumes the CON GE ST model of distributed computation [19].Algorithms in the CON GE ST model enforce a O(log n) limitation on the maximum message size.Hence, with a single message only a constant number of node identifiers in the range {0, . . ., n} can be transmitted.
Execution of the statements is performed in a synchronous style, i.e., all enabled nodes execute their code in every round.An execution e = c 0 , c 1 , c 2 , . . ., c i ∈ Σ is a sequence of configurations, where c 0 is called the initial configuration and c i is the configuration after the i-th round.In other words, if the current configuration is c i−1 and all enabled nodes make a move, then this yields c i .
Let A be a distributed algorithm and L ⊆ Σ a set of configurations.A is called self-stabilizing with respect to L if it satisfies the convergence and closure properties.The first property states that every execution of A reaches L after a finite number of rounds.The second property states that A(c) ∈ L for all c ∈ L as long as no fault occurs.The worst case stabilization time ST A (G) of A for a graph G is equal to the maximal number of rounds after which A reaches a legitimate configuration of G regardless of the initial configuration under the assumption that no errors occur.Definition 1.A configuration c ∈ Σ of a self-stabilizing algorithm with respect to L is called k-faulty if a configuration c ∈ Σ satisfying L exists such that c differs from c in the local states of at most k nodes.
Note that for the message passing model this definition also covers message corruption.This paper analyzes the most common fault situation: 1-faulty configurations.They arise when a single node v is hit by a memory corruption or a single message sent by v is corrupted.
The containment behavior of a self-stabilizing algorithm is characterized by the contamination radius and the containment time.

Definition 2.
Let A be a self-stabilizing algorithm.

1.
The containment time of A denotes the worst-case number of rounds any execution of A starting at a 1-faulty configuration needs to reach a legitimate configuration.

2.
Let R v be the subgraph induced by the nodes engaged in the recovery process from a 1-faulty configuration of A triggered by a fault at v. Then max{dist(v, w) | w ∈ R v } is called the contamination radius.
The stabilization time ST A (R v ) is an obvious upper bound for the containment time.

Examples
Before presenting techniques to compute these metrics we give some examples using the shared memory model to illustrate the two definitions.

Contamination Radius
Consider an algorithm in the shared memory model with contamination radius r.A single fault will not spread beyond the r-hop neighborhood of the faulty node v.In this case R v ⊆ G r v ; G r v is the subgraph induced by nodes w with dist(v, w) ≤ r.As an example consider the well known self-stabilizing algorithm A 1 to compute a maximal independent set (see Algorithm 1).It uses a single variable state.A configuration is legitimate if nodes with state = I N form a maximal independent set.Algorithm 1: Self-stabilizing algorithm A 1 to compute a MIS .Next we consider another example: ∆ + 1-coloring.Most distributed algorithms for this problem follow the same pattern.A node that realizes that it has selected the same color as one of its neighbors chooses a new color from a finite color palette.This palette does not include the current colors of the node's neighbors.To be executed under the synchronous scheduler these algorithms are either randomized or use identifiers for symmetry breaking.Variations of this idea are followed in [17,20,21].As an example consider algorithm A 2 from [20] (see Algorithm 2).A 2 has a single variable c.A configuration is legitimate if the values of variable c describe a valid ∆ + 1-coloring.
Due to its choice of a new color from the palette algorithm A 2 has contamination radius at least ∆(G) (see Figure 1).Algorithm 2: Self-stabilizing ∆ + 1-coloring algorithm A 2 from [20].A minor modification of algorithm A 2 dramatically changes matters.Algorithm A 3 (see Algorithm 3) has contamination radius 1 (see Lemma 2).Note that neighbors of v that change their color during recovery form an independent set.Proof.Consider a 1-faulty configuration where node v is hit by a memory corruption changing its color to a color c already chosen by at least one neighbor of v.
In the next round the nodes in N conflict ∪ {v} will get a chance to choose a new color.The choices will only lead to conflicts between v and other nodes in N conflict .Thus, the fault will not spread beyond the set N conflict .With a positive probability the set N conflict will contain fewer nodes in each round.

Containment Time
As the contamination radius the containment time strongly depends on the concrete structure of G.This can be illustrated with algorithm A 1 .Note that in this case R v can contain any subgraph H with ∆(G) nodes.As an example let G consist of H and an additional node v connected to each node of H.A legitimate configuration is given if the state of v is I N and all other nodes have state OUT (Figure 2 left).If v changes its state to OUT due to a fault then all nodes may change to state I N during the next round.Thus, there is little hope for a bound below the trivial bound.Similar arguments hold for the second 1-faulty configuration of A 1 shown on the right of Figure 2.

Self-Stabilizing Algorithms and Markov Chains
A self-stabilizing algorithm A can be regarded as a transition systems of Σ.In each round the current configuration c ∈ Σ is transformed into a new configuration A(c) ∈ Σ.Each configuration of Σ occurs with a specific probability as the new configuration A(c).The source of randomness can have its origin in the scheduler-a probabilistic scheduler randomly selects enabled nodes to make a move-or in a random experiment within one of the rules.We assume that the scheduler and the random experiment are memory-less.Therefore, the execution of algorithm A can be described by a Markov chain with state set Σ and transition matrix P = p i,j , where p ij = Prob(A(c i ) = c j ) gives the probability to move from configuration c i to c j ∈ Σ in one round.This work uses the notation for Markov chains as introduced in [12].For a self-stabilizing algorithm A this Markov chain is denoted by C A .In the following we uses the terms configurations of A and states of C A as synonyms.
There is a close relation between the absorbing states of C A and the legitimate configurations L of A. The closure property of a self-stabilizing algorithm guarantees that a configuration of L is always mapped to a configuration in L, i.e., A(L) ⊆ L. Whereas a state of a markov chain is called absorbing if it is impossible to leave this state.Note that a non-legitimate state of A cannot be an absorbing state of C A .In silent self-stabilizing algorithms we have A(c) = c for all c ∈ L. For non-silent algorithms we may without loss of generality also assume that A(c) = c for all c ∈ L. If this condition is not given we can partition L into subsets L 1 , . . ., L s with A(L i ) = L i and identify all configurations in L i as corresponding to the same state.For the computation of the stabilization time this does not make a difference.Under this assumption the Markov chain C A is an absorbing Markov chain and the set of absorbing states corresponds to the set of legitimate states of A. We can state the following easy to prove lemma.The computation of the expected time to absorption for an absorbing Markov chain using the transition matrix P is a simple matrix operation [12].We assume a labeling of the states such that the t non-absorbing states come before the a absorbing states.Then P has the following canonical form Here E is a a × a unit matrix and Q a t × t matrix.For an absorbing Markov chain, the matrix N = (E − Q) −1 is called the fundamental matrix for P. Let t i be the expected number of steps before the chain is absorbed, given that the chain starts in the i-th state, and let t be the column vector whose i-th entry is t i .Then t = Nc , where c is a column vector all of whose entries are 1.The variance of these numbers of steps is given by the entries of (2N − E)t − t sq where t sq is derived from t by squaring each entry [12].Thus, if N is the fundamental matrix of C A then the expected number of rounds after which algorithm A stabilizes is max Nc provided, the initial configuration is randomly chosen from the non-legitimate configurations.
There is still a big obstacles to practically apply this procedure.In order to compute the matrix N the probabilities p i,j need to be known explicitly.Without knowing the graph explicitly it is impossible to compute the probabilities p i,j = Prob(A(c i ) = c j ) for all pairs of states c i , c j .This is only possible for small graphs or when the graph has a symmetric structure, e.g., a ring.
The fundamental matrix contains a lot of information which is not needed for the computation of the stabilization time.Therefore, a coarser analysis of the Markov chain would be sufficient.A common approach is to partition Σ into subsets Σ 0 , . . ., Σ l and consider these as the states of a new Markov chain (see Figure 3).The challenge is to define the transition probabilities of the new chain in a way that allows to transfer properties of this new chain to the original one.
Reducing the number of states of a Markov chain.
A partition P = {Σ 0 , . . ., Σ l } of Σ is called lumpable if the subsets Σ i have the property that for each pair i, j the probability of a state c ∈ Σ i to be transformed in one step into a state of Σ j is independent of the choice of c ∈ Σ i (Definition 6.3.1 [12]).This probability is then interpreted as the transition probability from Σ i to Σ j .More formally, a Markov chain is lumpable with respect to partition A lumpable Markov C chain defines a new Markov chain C P with state set P and transition probabilities The following result proved in [12] , p. 128.Lemma 4. Let c ∈ Σ i be a state of a lumpable Markov chain C. Then the expected time to reach from c an absorbing state of C is equal to the expected time to reach from Σ i an absorbing state of C P .
The last two lemmas prove the following theorem.

Theorem 1.
Let A be a self-stabilizing algorithm with Σ the set of configurations.Let P = {Σ 0 , . . ., Σ l } be a partition of Σ with Σ 0 = L such that C A is lumpable with respect to P. For any i an absorbing state of C A P is reached from Σ i in expected B steps if and only if A stabilizes in expected B rounds starting in any c ∈ Σ i .
Unfortunately it is rather difficult to make use of Theorem 1 under general conditions.This situations changes when the theorem is used to compute the containment time of a self-stabilizing algorithm A. Remember that the containment time of A denotes the worst-case number of rounds any execution of A starting in a 1-faulty configuration needs to reach a legitimate configuration.Thus, the containment time of A is ST A (R v ), where v is the faulty node.There are two aspects that ease the application of Theorem 1: Either R v has a symmetric structure or R v is small.
To illustrate this approach we consider again algorithm A 3 .Let v be a node that changes in a legitimate state its color to c f due to a memory fault (see Figure 4).Let c 0 be the new configuration.This causes a conflict with those neighbors of v that had chosen c f as their color.After the fault only nodes contained in R v (a star graph) change their state (see Figure 4 right).Once a neighbor has chosen a color different from c f then it becomes passive (at least until the next transient fault).The set Σ is equal to the set of all configurations of R v reachable from c 0 .To partition this set let d be the number of neighbors of v that have color c f in c 0 .Let P = {Σ 0 , . . ., Σ d } where Σ j is the subset of Σ where exactly d − j neighbors of v are in conflict with v. Then Σ 0 = {c 0 } and Σ d ⊆ L. Figure 5 shows some configurations belonging to the sets Σ 0 , Σ 2 , and Σ 3 .Let c ∈ Σ i .Then A 3 (c) ∈ Σ j for all j < i. Unfortunately the partition P is not lumpable because the probability of a configuration c ∈ Σ i to be transformed in one round into a fixed configuration of Σ j is not independent of the choice of c ∈ Σ i .But even in these cases Theorem 1 can lead to an upper bound of the stabilization time.This is proved in the following theorem.Theorem 2. Let A be a self-stabilizing algorithm with Σ the set of configurations.Let P = {Σ 0 , . . ., Σ l } be a partition of

=⇒ fault
Furthermore, let c ij = 0 for j < i and c ii = 1 − ∑ d j=i+1 c ij for i = 0, . . ., d.Let C A be the Markov chain with states P and transition matrix c ij .If an absorbing state of C A is reached from Σ i in expected B steps then the expected number of rounds A requires to stabilize starting in any c ∈ Σ i is at most B.
for each fixed c ∈ Σ i .Thus, c ii ≥ 0 and therefore the matrix (c ij ) is a stochastic with c dd = 1 that describes the new Markov chain C A .Remember that C A denotes the markov chain corresponding to algorithm A. The difference between C A and C A is that with chain C A a node remains with a higher probability in its current configuration instead of moving to a state Σ i with lower index.Therefore, the expected number of steps of C A before being absorbed is an upper bound for the corresponding number in C A .The choice of the probabilities implies that C A is lumpable for partition P. Hence, we can apply Theorem 1 to complete the proof.
In the rest of this paper the introduced techniques including Theorem 2 are exemplary applied to a self-stabilizing (∆ + 1)-coloring algorithm A col using the message passing model.

Algorithm A col
This section introduces coloring algorithm A col (see Algorithm 4).Computing a ∆ + 1-coloring in expected O(log n) rounds with a randomized algorithm is long known [22,23].Algorithm A col follows the pattern sketched in Section 4.1.We derived it from an existing algorithm (Algorithm 19 [24]) by adding the self-stabilization property.The presented techniques can also be applied to other randomized coloring algorithms such as [17,20,21].The main difference is that A col assumes the synchronous CON GE ST message passing model.Algorithm A col stabilizes after O(log n) rounds with high probability whereas the above cited self-stabilizing algorithms all require a linear number of rounds.Since synchronous local algorithms can be converted to asynchronous self-stabilizing algorithms [25], there are self-stabilizing ∆ + 1-coloring algorithms faster than A col .However, they entail a burden on memory resources and cause high traffic costs.
At the start of each round each node broadcasts its current color to its neighbors.Based on the information received from its neighbors a node decides either to keep its color (final choice), to choose a new color or no color (value ⊥).In particular, with equal probability a node v draws uniformly at random a color from the set {0, 1, . . ., δ(v)}\tabu or indicates that it made no choice (see function randomColor).Here, tabu is the set of colors of neighbors of v that already made their final choice.In the original algorithm a node maintains a list with the colors of those neighbors that made their final choice.A fault changing this list is difficult to contain.Furthermore, in order to notice a memory corruption at a neighbor, each node must continuously send its state to all its neighbors and cannot stop to do so.This is the price of self-stabilization and well known [2].These considerations lead to the design of Algorithm A col .Each node only maintains the chosen color and whether its choice is final (variables c and final).In every round a node sends the values of c and final to all neighbors.A col uses two additional variables tabu and occupied.They are reset at the beginning of every round.To improve fault containment a node's final choice of a color is only withdrawn if it coincides with the final choice of a neighbor.To achieve a ∆ + 1-coloring a node makes a new choice if its color is larger than its degree.This situation can only originate from a fault.
First we prove correctness and analyze the stabilization time of A col .A configuration is called a legal coloring if the values of variable c form a ∆ + 1-coloring.It is called legitimate if it is a legal coloring and v. f inal = true for each node v. Lemma 5. A node v can change the value of variable f inal from true to f alse only in the first round or when a fault occured just before the start of this round.
Proof.Let v.c = c r at the beginning of the round.In order for v to set v. f inal to f alse one of the following conditions must be met at the start of the round: c r > δ(v), c r = ⊥, or v has a neighbor w with w. f inal = true and w.c = c r .
The lemma is obviously true in the first case.Suppose that c r = ⊥ and v. f inal = true at the round's start.If during the previous round v. f inal was set to true then v.c can not be ⊥ at the start of this round.Hence, at the start of the previous round f inal already had value true.But in this case v.c was not changed in the previous round and thus, c r = ⊥, contradiction.Finally assume the last condition.Then v and w cannot have changed their value of c in the previous round, because then f inal = true would be impossible at the start of this round.Thus, v sent (c r , true) in the previous round.Hence, if w.c = c r at that time, w would have changed w.f inal to f alse, again a contradiction.Lemma 6.A node setting f inal to true will not change its variables as long as no error occurs.
Proof.Let v be a node that executes f inal := true.If v changes the value back to f alse in a later round then by Lemma 5 a fault must have occured.Thus in an error-free execution node v will never change variable f inal again.Since a node can only change variable c if f inal = f alse the proof is complete.Lemma 7. If at the end of a round during which no error occured each node v satisfies v. f inal = true then the configuration is legitimate and remains legitimate as long as no error occurs.
Proof.Note that no node changed its color during that round.If at the start of the round v. f inal = true was already satisfied then none of v's neighbors also having f inal = true had the same color as v. Next consider a neighbor w of v with w. f inal = f alse at the start of the round.Since v sent (v.c, true) at the start of this round, node w have set f inal to f alse if it had chosen the same color as v. Contradiction.Finally consider that case that v. f inal = f alse at the start of the round.Since v changed f inal to true, none of its neighbors had chosen the same color as v. Thus, the configuration is legitimate.Obviously, this property can only be changed by a fault.
With these lemmas the next theorem is proved along the same lines as Lemma 10.3 in [24].Theorem 3. Algorithm A col is self-stabilizing and computes a ∆ + 1-coloring within O(log n) rounds with high probability (i.e., with probability at least 1 − n c for any c ≥ 1).A col has contamination radius 1.

Proof.
According to Lemma 7 it suffices to prove that all nodes terminate within O(log n) time with high probability.Let v ∈ V. Lemma 6 implies that the probability that v terminates in round r > 1 is equal to the probability that v sets v. f inal to true in round r − 1.This is the probability that v selects a color different from ⊥ and from the selections of all neighbors that chose a value different from ⊥ in round r − 2. Suppose that indeed v.c = ⊥ at the end of round r − 2. Then v.c ∈ v.tabu.
The probability that a given neighbor u of v selects the same color u.c = v.c in this round is at most 1/2(δ(v) + 1 − |v.tabu|).This is because the probability that u selects a color different from ⊥ is 1/2, and v has δ(v) + 1 − |v.tabu| different colors to select from.Since r > 1 all nodes in v.tabu have f inal = true and will never change this value.Thus, at most δ(v) − |tabu| neighbors select a new color.By the union bound, the probability that v selects the same color as a neighbor is at most Thus, if v selects a color v.c = ⊥, it is distinct from the colors of its neighbors with probability at least 1/2.It holds that v.c = ⊥ with probability 1/2.Hence, v terminates with probability at least 1/4.
The probability that a specific node v doesn't terminate within r rounds is at most (3/4) r .By the union bound, the probability that there exists a vertex v ∈ V that does not terminate within r rounds is at most n(3/4) r .Hence, A col terminates after r = (c + 1)4 log n rounds, with probability at least 1 − n(3/4) r ≥ 1 − 1/n c (note that log 4/3 > 1/4).

Fault Containment Time of Algorithm A col
There is a significant difference between the shared memory and the message passing model when analyzing the containment time.Firstly, a 1-faulty configuration also arises when a single message sent by a node v is corrupted.Secondly, this may cause v's neighbors to send messages they would not send in a legitimate configuration.Even though the states of nodes outside G r v do not change, these nodes may be forced to send messages.Thus, in general the analysis of the containment time cannot be performed by considering G r v only.This is only possible in cases when a fault at v does not force nodes at distance r + 1 to send messages they would not send had the fault not occurred.
In the following the fault containment behavior of A col for 1-faulty configurations is analyzed.Two types of transient errors are considered: 1.
A single broadcast message sent by v is corrupted.Note that the alternative of using δ(v) unicast messages instead a single broadcast has very good fault containment behavior but is slower due to the handling of acknowledgements.

2.
Memory corruption at node v, i.e., the value of at least one of the two variables of v is corrupted.
The first case is analyzed analytically whereas for the second case Markov chains and lumping, (Theorem 2) are used.The independent degree δ i (v) of a node v is the size of a maximum independent set of N(v).

Message Corruption
If a message broadcast by v contains a color c f different from v.c or the value f alse for variable f inal then the message (c f , f alse) has no effect on any w ∈ N(v) regardless of the value of c f , since w.f inal = true for all w ∈ N(v).Thus, this corrupted message has no effect at all.In order to compute the containment time for A col we first compute the contamination radius.Lemma 8.The contamination radius of algorithm A col after a single corruption of a broadcast message sent by node v is 1.At most δ i (v) nodes change their state during recovery.
Proof.It suffices to consider the case that v broadcasts message (c f , true) The nodes in N conflict (v) form an independent set, because they all have the same color.
. This node continues to send (u.c, true) after the fault.Thus, a neighbor of u that changes its color will not change its color to u.c.This yields that no neighbor of u will ever send a message with u.c as the first parameter.This is also true in case u ∈ N(v)\N conflict (v).Hence, no node outside N conflict (v) ∪ {v} will change its state, i.e., the contamination radius is 1.
Let w ∈ N conflict (v).When w receives the faulty message it sets w. f inal to false.Before the faulty message was sent no neighbor of v had the same color as v. Thus, in the worst case a node w ∈ N conflict (v) will choose v.c as its new color and send (v.c, f alse) to all neighbors.Since v. f inal = true, this will not force state to change at v. Thus, v keeps broadcasting (v.c, true) and no neighbor w of v will ever reach the state w.c = v.c and w. f inal = true.Hence, v will never change its state.
With this result Theorem 3 implies that the containment time of this fault is O(log δ i (v)) on expectation.The following theorem gives a bound for the expected value of the containment time including its variance.Since the variance of a random variable is the expected value of the squared deviation from the mean, this theorem shows that the containment time does not deviate much from its expected value.A concrete bound can be obtained from this result using Chebyshev's inequality.
Theorem 4. The expected containment time of algorithm A col after a corruption of a message broadcast by node v is at most 1 ln 2 H δ i (v) + 1/2 rounds (H i is the i th harmonic number) with a variance of at most Proof.After receiving message (c f , true) all nodes w ∈ N conflict (v) set w. f inal to f alse and with equal probability w.c to ⊥ or to a random color c w ∈ {0, 1, . . ., δ(w)}\w.tabu.Note that |w.tabu| ≤ δ(w) because w.tabu = {u.c| u ∈ N(w)\v} ∪ {c f }.At the end of the round during which the corrupted message was received, node w can choose v's current color, because it may not be contained in the set tabu.This can not happen in the following rounds.Thus, if w chooses a color different from ⊥ in the following rounds then this color is different from the colors of all of w's neighbors.Also in this case w will terminate after the following round because then it will set f inal to true.Thus, after one round w has chosen a color that is different from the colors of all neighbors with probability at least 1/2.Furthermore, this color will not change again.After one additional round w reaches a legitimate state.Let the random variable X d with d = |N conflict (v)| denote the number of rounds until the system has reached a legal coloring.For w ∈ N conflict (v) let Y w be the random variable denoting the number of rounds until w has a legal coloring.By Lemma 8 where X is a geometric random variable with p = 0.5.Thus G(0) = 0 and This implies The result follows from Lemma 9.The expression for the variance is proved in Lemma 10.
Lemma 9.For fixed 0 < q < 1 and fixed d ≥ Using the substitution u = 1 − q x the integral becomes Approximating Lemma 10.For d > 0 the variance of the containment time is at most Proof.
By Lemma 11 Now Lemma 12 and Lemma 4 yield Lemma 11.Let d > 0, q ∈ (0, 1) and Prob( For the first equation we refer to the proof of Theorem 4. The second equality makes use of and the fourth equality uses the following two identities Lemma 12. Let d > 0 and q ∈ (0, 1) then This leads to a small overestimation of the sum as Figure 6b shows.
The first equation uses the substitution u = 1 − q x .The final result is based on the following identity Theorem 4 gives an upper bound for the containment time and its variance of algorithm A col .To evaluate the quality of these upper bounds we modeled the behavior of this fault situation as a Markov chain and computed E[X d ] and Var[X d ] using a software package based on symbolic mathematics.Using Theorem 3.3.5 from [12] these computations showed that 1 ln 2 H d + 1/2 matches very well with E[X d ] and that E[X d ] ≈ 2 log d (see Figure 6a).Furthermore, the gap between Var[X d ] and the bound given in Theorem 4 is less than 0.2 (see Figure 6b).

Memory Corruption
This section demonstrates the use of Markov chains and the application of Theorem 2. We consider the case that the memory of a single node v is hit by a fault.The analysis breaks down the stabilizing executions into several states and then computes the expected time for each of these phases.First we look at the case that the fault causes variable v. f inal to change to f alse.If v.c does not change, then a legitimate configuration is reached after one round.So assume v.c also changes.Then the fault will not affect other nodes.This is because no w ∈ N(v) will change its value of w.c since w.f inal = true and v. f inal = f alse.Thus, with probability at least 1/2 node v will choose in the next round a color different from the colors of all neighbors and terminate one round later.Similar to X d let random variable Z d denote the number of rounds until a legal coloring is reached (d = |N conflict (v)|).It is easy to verify that E[Z d ] = 3 in this case.
The last case is that only variable v.c is affected (i.e., v. f inal remains true).The main difference to the case of a corrupted message is that this fault persists until v.c has again a legitimate value.Let c f be the corrupted value of v.c and suppose that N conflict (v) = {w ∈ N(v) | w.c = c f } = ∅.A node outside S = N conflict (v) ∪ {v} will not change its state (cf.Lemma 8).Thus, the contamination radius is 1 and at most δ i (v) + Proof.When a node u ∈ S chooses a color with function randomColor the color is randomly selected from C u = {0, 1, . . ., δ(v)}\tabu.Thus, if w and v choose colors in the same round, the probability that the chosen colors coincide is minimal.This is achieved when C w ⊆ C v and C v is minimal (independent of the size of C w ) or vice versa.Thus, without loss of generality we can assume that C w ⊆ C v and both sets are minimal.Thus, for w ∈ N conflict (v) the nodes in N(w)\{v} already use all colors from {0, 1, . . ., δ(v)} but 0 and 1 and all nodes in N(v)\N conflict (v) already use all colors from {0, 1, . . ., δ(v)} but 0, 1, . . ., d.Hence, a node w ∈ N conflict (v) can choose a color from {0, 1}\tabu with tabu = ∅ if v. f inal = f alse and tabu = {v.c}otherwise.Furthermore, v can choose a color from {0, 1, . . ., d}\tabu with tabu ⊆ {0, 1}.In this case tabu = ∅ if w. f inal = f alse for all w ∈ N conflict (v).
Thus, in order to bound the expected number of rounds to reach a legitimate state after a memory corruption we can assume that G = G S and u. f inal = true and u.c = 0 (i.e., c f = 0) for all u ∈ S.After one round u. f inal = f alse for all u ∈ S. To apply Theorem 2 the set Σ of all configurations is partitioned into d + 4 subsets as follows: I: Represents the faulty state with u.c = 0 and u. f inal = true for all u ∈ S. C i : Node v and exactly d − i non-center nodes will not be in a legitimate state after the following round (0 ≤ i ≤ d).In particular v. f inal = f alse and w.c = v.c = ⊥ or v.c = w.c= ⊥ for exactly d − i non-center nodes w.P: Node v has not reached a legitimate state but will do so in the next round.In particular v. f inal = f alse and v.c = w.c for all non-center nodes w.F: Node v is in a legitimate state, i.e., v. f inal = true and v.c = w.c for all non-center nodes w, but w.c may be equal to ⊥.
Note that I is the initial and F the absorbing state of the lumped Markov chain.Also when the system is in state F, then it is not necessarily in a legitimate state.This state reflects the set of configurations considered in the last section.Lemma 14.Table 1 describes the transition probabilities of the lumped Markov chain.
Proof.We consider each case separately.The last two cases are trivial. .This happens with probability d−i−1 2(d−i+1) .Non-center nodes can make any choice.4. Note that u. f inal = f alse for all u ∈ S.
The total probability for this case is ( . Case 2: v.c = 0.This happens with probability 1/(2(d − i + 1)).Exactly j − i non-center nodes choose c = 1 (with probability 1/4 j−i ), 1 ≤ l ≤ d − j non-center nodes choose c = 0 (with probability 1/4 l ) and all other non-center nodes choose c = ⊥ (with probability 1/2 d−j−l ).The total probability for this case is Case 2: v.c = 1.Similar to Case 1. Case 3: v.c > 0. This does not lead to C j but to P. 5. Note that u. f inal = f alse for all u ∈ S. For s = −1, 0, 1, . . ., d + 1 denote by Q s the (s + 2) × (s + 2) lower right submatrix of Q, i.e., Q = Q d+1 .Denote by N s the fundamental matrix of Q s (notation as introduced in Section 5).Let 1 s be the column vector of length (s + 2) whose entries are all 1 and s = N s 1 s .For s = 0, . . ., d, s is the expected number of rounds to reach state F from state C d−s and d+1 is the expected number of rounds to reach state F from I, i.e., d+1 = E[A d ] (Theorem 3.3.5,[12]).Identifying P with C d+1 we have −1 = 1.
Lemma 15.The expected number E[A d ] of rounds to reach F from I is at most 5 and the variance is at most 3.6.
Proof.Note that Q s and N s are upper triangle matrices.Let gives rise to (i + 2) 2 equations.Adding up the i + 2 equations for the first row of E i results in It is straightforward to verify that −1 = 1 and 0 = 3. Hence Next we show by induction on i that i ≤ 4 for i = −1, 0, 1, . . ., d.So assume that l ≤ 4 for l = −1, 0, 1, . . ., i − 1 with i < d.Then i ≤ (1 − a 1 ) −1 1 + 4 i ∑ l=2 a l + 3a i+1 + a i+2 since a i ≥ 0. Using the fact 1 − a 1 = ∑ i+2 l=2 a l this inequality becomes Coefficient a j denotes the transition probability from C d−i to C d+j−(i+1) for j = 1, . . ., i + 1 and a i+2 that for changing from C d−i to P. For i ≤ d the following values from Lemma 14 are used: . Thus, holds for i ≥ 2. This yields 1 − a i+1 − 3a i+2 1 − a 1 < 0 and therefore i ≤ 4. To bound d+1 we use Equation 1 with i = d + 1.Note that in this case a 1 = 0 since a transition from I to itself is impossible.Hence Lemma 16.The expected containment time after a memory corruption at node v is at most 1 ln 2 H δ i (v) + 11/2 with variance less than 7.5.
Proof.For a set X of configurations and a single configuration c denote by E(c, X) the expected value of the number of transitions from x to a state in X.Let L be the set of legitimate system states.Then Proof.For these graphs ∆ i ∈ O(1), in particular ∆ i ≤ 5 for unit disc graphs.

Conclusions
The analysis of self-stabilizing algorithms is often confined to the stabilization time starting from an arbitrary configuration.In practice the time to recover from a 1-faulty configuration is much more relevant.This paper presents techniques to analyze the containment time of randomized self-stabilizing algorithms for 1-faulty configurations.The execution of an algorithm is modeled as a Markov chain, its complexity is reduced with the lumping technique.The power of this technique is demonstrated by an application to a ∆ + 1-coloring algorithm.We believe that the technique can also be applied to other self-stabilizing algorithms.We leave the application to problems such as maximal independents sets and maximal matchings for future work.

Conflicts of Interest:
The author declares no conflict of interest.

Lemma 3 .
Let c be a configuration of C A .An absorbing state of C A is reached from c in expected B steps if and only if A stabilizes in expected B rounds from c.

Figure 4 .
Figure 4.A 1-faulty configuration c 0 for algorithm A 3 where node v was hit by a fault changing its color to c f causing a conflict.The corresponding graph R v is depicted at the right side.

Figure 5 .
Figure 5. Elements of the partition of Σ for a 1-faulty configuration of algorithm A 3 as described above.Σ 0 consists of legitimate configurations only.

Figure 6 .
Figure 6.Comparisons of computed with approximated values from Theorems 4 and 15.(a) Comparison of computed value of E[X d ] with log d (Theorem 4); (b) Comparison of computed value of Var[X d ] with approximation (Theorem 4); (c) E[A d ] and Var[A d ] from Lemma 15.

Case 1 :
v.c = 0.This happens with probability 1/2(d − i + 1).None of the d − i non-center nodes w sets w.c = 0, this has probability (3/4) d−i .Case 2: v.c = 1.Similar to case 1. Case 3: v.c > 1.This happens with probability d−i−1 2(d−i+1) .Note d > i. Non-center nodes can make any choice.We first calculate the expected number E[A d ] of rounds to reach the absorbing state F. With Theorem 4 this will enable us to compute the expected number E[Z d ] of rounds required to reach a legitimate system state.To build the transition matrix of the lumped Markov chain the d + 4 states are ordered as I, C 0 , C 1 , . . ., C d , P, F. Let Q be the (d + 3) × (d + 3) upper left submatrix of P.
may change state during recovery.But since these neighbors of u have a neighbor with state I N they will not change their state.Thus, in this case only the neighbors of v may change state during recovery.Suppose that v.state changed from OUT to I N. Then v and those neighbors of v with state I N can change to OUT.Then arguing as in the first case only nodes within distance two of v may change their state during recovery.
t. w.state = I N then state := OUT if state = OUT ∧ ∀w ∈ N(v) w.state = OUT then if random bit from 0,1 = 1 then state := I N Lemma 1. Algorithm A 1 has contamination radius two.Proof.Consider a 1-faulty configuration where node v is hit by a memory corruption.First suppose the state of v changed from I N to OUT.Let u ∈ N(v) then u.state = OUT.If u has an neighbor w = v with w.state = I N then u will not change its state during recovery.Otherwise, if all neighbors of u except v had state OUT node u The numbers indicate the nodes' colors.If the left-most node is hit by a fault and changes its color to ∆ − 1, then itself and its neighbor are enabled.With probability 0.5 the second node changes its color to ∆.This enables the third node which changes its color to ∆ − 1 with probability 0.5.This may causes a cascade of changes in which all nodes on the horizontal line change color.
1-faulty configurations of A 1 caused by a memory corruption at v. Nodes drawn in bold have state I N. Subgraph H correspond to R v .In the left graph, if node v changes to state OUT then all nodes in H are enabled, thus the worst case stabilization time is equal to that of subgraph H.In the right graph, if node v changes to I N, then v and its two neighboring nodes all change to OUT resulting in a configuration similar to the previous example.
1 nodes change state.Let d = |N conflict (v)|.The subgraph G S induced by S is a star graph with d + 1 nodes and center v.To find a lower bound for E[Z d ] we may assume that w can choose a color from {0, 1}\tabu with tabu = ∅ if v. f inal = f alse and tabu = {v.c}otherwise and v can choose a color from {0, 1, . . ., d}\tabu with tabu ⊆ {0, 1}.

Table 1 .
This table summarizes the probabilities for all transitions.