Cluster-Fault Tolerant Routing in a Torus

The number of Internet-connected devices grows very rapidly, with even fears of running out of available IP addresses. It is clear that the number of sensors follows this trend, thus inducing large sensor networks. It is insightful to make the comparison with the huge number of processors of modern supercomputers. In such large networks, the problem of node faults necessarily arises, with faults often happening in clusters. The tolerance to faults, and especially cluster faults, is thus critical. Furthermore, thanks to its advantageous topological properties, the torus interconnection network has been adopted by the major supercomputer manufacturers of the recent years, thus proving its applicability. Acknowledging and embracing these two technological and industrial aspects, we propose in this paper a node-to-node routing algorithm in an n-dimensional k-ary torus that is tolerant to faults. Not only is this algorithm tolerant to faulty nodes, it also tolerates faulty node clusters. The described algorithm selects a fault-free path of length at most n(2k+⌊k/2⌋−2) with an O(n2k2|F|) worst-case time complexity with F the set of faulty nodes induced by the faulty clusters.


Introduction
As mentioned, for instance, in [1,2], the number of Internet-connected devices is seeing a very rapid growth, with even fears of running out of available IP addresses. It is clear that the number of sensors follows this trend, thus inducing large sensor networks. The interconnection issue of sensor networks is thus critical. Given the large number of sensors involved, it is critical that these interconnection networks come with efficient data communication algorithms to maximise the performance of the sensor network. Because hardware failure is highly probable given the scale of the network, the tolerance to faults by such routing algorithms is key to data communication efficiency and robustness.
Considering the number of network nodes involved, it is insightful to make the comparison with the number of processors of modern supercomputers. In such large networks of processors or sensors, the problem of faulty nodes necessarily arises, with faults often happening in clusters. The tolerance to faults, and especially cluster faults, is thus critical as detailed below. Featuring advantageous topological properties, the torus network [3] has become very popular as interconnection network of supercomputers (e.g., IBM Blue Gene/L and Blue Gene/P, Cray Titan (Gemini interconnect [4]) and Fujitsu K (Tofu interconnect [5])) [6], thus proving the applicability of the torus topology for large network interconnection. Such topological properties of an n-dimensional k-ary torus include the network degree 2n, the diameter n k/2 and the network order (i.e., number of nodes) k n . This is to be compared with, for instance, the hypercube network [7] (n-cube) that was favoured for earlier supercomputers (e.g., the Cosmic cube [8]), of degree and diameter n and network order paddy field. Sensors in the field are positioned according to the points of a two-dimensional lattice and each sensor has a transmission range so that it can at least communicate with its four neighbour sensors. In addition, the wrap-around edges that connect the sensors placed on the periphery of the paddy field are implemented with wires. Consequently, the sensor network forms a two-dimensional torus structure (see Figure 1b).
Another example with respect to smart agriculture is a plant factory [15,16]. In such a facility, plant pods are organised in a three-dimensional manner and cultivated under optimal growth conditions. In this situation as well, a wireless sensor network is suitable to not interfere with the factory automated operations. Precisely, in the factory the sensors are positioned according to the points of a three-dimensional lattice and each sensor has a transmission range so that it can at least communicate with its six neighbour sensors. In addition, on the floor, walls and ceiling, the wrap-around edges are implemented with wires. As a result, the sensor network forms a three-dimensional torus structure. The rest of this paper is organised as follows. Previous and related works are discussed in Section 2. Graph notations and definitions together with lemmas and propositions are recalled and established in Section 3. The proposed algorithm is described and exemplified in Section 4. The algorithm correctness is formally established in Section 5; this is a major part of the paper. Complexity analysis, precisely the maximum path length and worst-case time complexity, is formally conducted in Section 6 to evaluate the theoretical performance of the algorithm, and from which the main theorem of this research is induced. Then, in Section 7, the algorithm performance in average is empirically evaluated with computer experimentations and compared to the theoretical values. Finally, this paper is concluded in Section 8.

Previous and Related Works
The torus topology is often presented as an extension of the mesh network [3] to which "wrap-around edges" have been added and it has been itself further extended [17], for instance to design a hierarchical interconnection network [9].
There exist a few routing algorithms that are tolerant to faults which have been described for a torus network. Torus fault-tolerant routing algorithms based on the simple node-fault tolerant model (i.e., not considering faulty clusters) have been described in [18] with a node-to-node and a node-to-set disjoint paths routing algorithm. In an n-dimensional k-ary torus (n ≥ 2, k ≥ 4), the former algorithm selects a fault-free path of length at most n k/2 + 1 in O(n 2 ) time with a fault tolerance of at most 2n − 1 faulty nodes while the latter algorithm selects f ( f ≤ 2n) fault-free paths of lengths at most n k/2 + 1 in O(n 3 ) time with a fault tolerance of at most 2n − f faulty nodes. Still based on the simple node-fault tolerant model, an adaptive node-to-node routing algorithm for a 2-dimensional torus has been given in [19].
Other routing algorithms that apply to a torus network have been proposed. A torus set-to-set disjoint paths routing algorithm has been described in [20]. In an n-dimensional k-ary torus (n ≥ 1, k ≥ 3), this algorithm selects 2n mutually node-disjoint paths between 2n source nodes and 2n destination nodes, without imposing a particular pairing. The paths are selected in O(kn 3 + n 3 log n) time and have lengths that are at most 2(k + 1)n. Then, a torus pairwise disjoint paths routing algorithm has been presented in [21]. In an n-dimensional k-ary torus (n < k, k ≥ 5), given c (c ≤ n) source-destination node pairs, this algorithm finds c mutually node-disjoint paths that connect the nodes of the c pairs. The paths are selected in O(nc 4 ) time and have lengths at most 2k(c − 1) + n k/2 .
In other networks, fault-tolerance under the simple node-fault tolerant model has been treated, for instance, in hypercubes with a unicast algorithm [22] and a set-to-set disjoint paths routing algorithm [23], in star graphs with a set-to-set disjoint paths routing algorithm [24] and in burnt pancake graphs with a unicast algorithm [25]. Furthermore, fault-tolerant routing with an additional constraint regarding the nodes that can be selected has been discussed, for instance, in hypercubes [26]. Finally, routing algorithms based on the cluster-fault tolerant model have been proposed for burnt pancake graphs with a unicast algorithm [27], hypercubes with a unicast algorithm [28], a node-to-set and set-to-set disjoint paths routing algorithm [29] and a pairwise disjoint paths routing algorithm [30], and star graphs with a unicast algorithm [31] and a node-to-set and pairwise disjoint paths routing algorithm [32], amongst others.

Preliminaries
First, general graph theory notations and definitions are recalled-the notations and definitions that are not mentioned here are in accordance with [33].
Graphs herein are undirected. For a node u in a graph G, let N G (u) be the set of the nodes adjacent to u in G. A path in a graph G is a connected acyclic sub-graph of G of maximum degree 2. From this definition, a path necessarily has either one or two nodes of degree at most 1; they are called the end nodes of the path. For the sake of readability, a path is simply denoted by a sequence of nodes and edges as follows: u 1 → u 2 → . . . → u l , and is further abbreviated to u 1 u l when non-ambiguous. The node set V(p) that includes the nodes of a path p is simply denoted by p when non-ambiguous. Two paths p, q are mutually node-disjoint (or simply "disjoint") if and only if p ∩ q = ∅. When the path intersection p ∩ q consists solely of end nodes, the two paths are said to be internally disjoint. A path is fault-free if and only if it does not include a faulty node. A path is blocked if it includes at least one faulty node. The length of a path is its number of edges. Definition 1. An n-dimensional k-ary torus T(n, k), n ≥ 1, k ≥ 1 consists of the k n nodes induced by the set {0, 1, . . . , k − 1} n . A node u of a T(n, k) is thus an n-tuple (u 1 , u 2 , . . . , u n ) with u i (0 ≤ u i ≤ k − 1) the coordinate of u for the dimension i (1 ≤ i ≤ n). There is an edge between two nodes u = (u 1 , u 2 , . . . , Figure 2a,c, respectively.

Definition 2.
For a node u = (u 1 , u 2 , . . . , u n ) ∈ T(n, k) and a dimension δ (1 ≤ δ ≤ n), define the two paths In this research, we consider faulty clusters that include at most two nodes. So, a cluster is formally defined as follows.

Definition 3.
A cluster c of a graph G is a connected subgraph of G that is isomorphic either to a K 1 or to a K 2 , where K n denotes the complete graph of order n. So, a cluster can be denoted simply as a node set.
Definition 4. For a graph G and a cluster set C, the set I(G, C) ⊆ C consists of the clusters of C that have at least one node in G. That is, Next, several torus properties are recalled. First, a torus has a recursive structure; this is Proposition 1 below.
the node coordinate for the dimension j and i that for the dimension δ.
The three sub-tori

Definition 5.
For a node u ∈ T(n, k) and a dimension δ (1 ≤ δ ≤ n), T δ u is the sub-torus of u and t δ u ∈ N is such that u ∈ T t δ u ,δ (n − 1, k).
Second, there exist disjoint paths between sub-tori.

Lemma 1.
For a node u ∈ T(n, k) and a dimension δ (1 ≤ δ ≤ n), there exists a set P i,δ u of 2n − 1 internally disjoint paths of lengths at most k − 1 between u and the nodes of a sub-torus T i,δ (n − 1, k) with 1 ≤ i ≤ k.

Proof.
A constructive proof is given. For the node set {u 1 , The paths in P i,δ u are internally disjoint by Definition 2.

Lemma 2.
For a node u ∈ T(n, k) and a dimension δ (1 ≤ δ ≤ n), each path p ∈ P i,δ u has a variant p of same end nodes such that the paths of the set (P i,δ u \ P) ∪ P are internally disjoint for any subset P ⊆ P i,δ u with Lemma 3. For a node u = (u 1 , u 2 , . . . , u n ) ∈ T(n, k) and a dimension δ (1 ≤ δ ≤ n), there exists a set Q i,δ u of 4n − 3 paths of lengths at most k between u and the nodes of a sub-torus

Cluster-Fault Tolerant Routing Algorithm
Inside an n-dimensional k-ary torus T(n, k) that includes a set C of at most 2n − 1 faulty clusters (which induce the set of faulty nodes F), we describe a routing algorithm that selects a fault-free path between any two nodes s, d ∈ T(n, k) with s, d / ∈ F.

Algorithm Description
First, we present the assumptions made and the main idea of the proposed routing algorithm. A T(1, k) is isomorphic to a ring, and it is thus trivial to find a fault-free path between any two non-faulty nodes given that there is at most 2n − 1 = 1 faulty cluster. So, we can assume that n ≥ 2.
A T(n, 1) has one single node and thus it is trivial to find a fault-free s d path (s = d). It is easy to show that this problem is not solvable when k ∈ {2, 3, 4}: see Figure 4. Hence, a torus arity k ≥ 5 is considered hereinafter.  The main idea of this algorithm is to follow a divide-and-conquer approach by routing s to a node of T δ d and to apply this algorithm recursively in T δ d . Consider an arbitrary dimension δ (1 ≤ δ ≤ n). We distinguish the following mutually exclusive cases.
Case 0 (base case) T(n, k) is fault-free (i.e., C = F = ∅): This is simple point-to-point routing. A path between s and d is selected with a dimension-order routing algorithm [3]. : : We can apply the algorithm recursively in neither T δ s nor T δ d , so we use another sub-torus. Route s and d to an available sub-torus T i,δ (n − 1, k), that is satisfying |I(T i,δ (n − 1, k), C)| ≤ 2n − 3, with a fault-free path of P i,δ s ∪P i,δ s and a fault-free path of P i,δ d ∪P i,δ d , respectively. Let p s : s s ∈ T i,δ (n − 1, k) (resp. p d : d d ∈ T i,δ (n − 1, k)) be the selected path that connects s (resp. d) to a node of T i,δ (n − 1, k). If these paths are not disjoint, consider the node u ∈ p s ∩ p d that is the closest to s, discard the sub-paths (u s ) ⊂ p s and (u d ) ⊂ p d and terminate. Otherwise, apply this algorithm recursively in T i,δ (n − 1, k) with s as source node, d as destination node and {c ∩ T i,δ (n − 1, k) | c ∈ I(T i,δ (n − 1, k), C)} as cluster set.
the set of 8n − 6 paths from s to a node of T δ d .
Case 2.1 (special case) s not routable to T δ d (i.e., ∀p ∈ Q, p ∩ F = ∅): Route s and d to an available sub-torus T i,δ (n − 1, k), that is satisfying |I(T i,δ (n − 1, k), C)| ≤ 2n − 3, other than T δ d with a fault-free path of P i,δ s ∪P i,δ s and a fault-free path of P i,δ d ∪P i,δ d , respectively.
Let p s : s s ∈ T i,δ (n − 1, k) (resp. p d : d d ∈ T i,δ (n − 1, k)) be the selected path that connects s (resp. d) to a node of T i,δ (n − 1, k). If these paths are not disjoint, consider the node u ∈ p s ∩ p d that is the closest to s, discard the sub-paths (u s ) ⊂ p s and (u d ) ⊂ p d and terminate. Otherwise, apply this algorithm recursively in T i,δ (n − 1, k) with s as source node, d as destination node and {c ∩ T i,δ (n − 1, k) | c ∈ I(T i,δ (n − 1, k), C)} as cluster set.

Proof of Correctness
In this section, the correctness of the proposed algorithm is established. Each of all the distinguished cases is treated separately.

Case 0
There is no faulty node inside T(n, k), so a dimension-order routing algorithm can be applied.

Case 1.1
The sub-tori T δ s and T δ d each include at least 2n − 2 clusters (possibly partially, i.e., |I(T δ d , C)| > 2n − 3) and are thus unavailable in order to solve the problem recursively. Here are two necessary conditions for this situation to occur: 1) T δ s is adjacent to T δ d (i.e., t δ d = t δ s ± 1 (mod k)) and 2) out of the at most 2n − 1 clusters, at least 2n − 3 of them have two nodes, with one node in T δ s and the other in T δ d .
First, we show that there exist at least three sub-tori that are available for recursion, that is, that include at most 2n − 3 clusters. It is recalled that both T δ s and T δ d are not. At least 2n − 3 clusters are included (completely contained) in T δ s ∪ T δ d . Hence, for both T δ s and T δ d to satisfy |I(T δ d , C)| > 2n − 3, there remains at most two faulty nodes that are included in other sub-tori (i.e., neither in T δ s nor T δ d ). These two faulty nodes are either part of the same cluster, or, they are part of two distinct 2-node clusters that have one node in T δ s and one node in T δ d , respectively. So, importantly, if these at most two faulty nodes are part of two distinct 2-node clusters, these two faulty nodes are necessarily located in different sub-tori. See Figure 5. Figure 5. The two possible cluster repartitions in Case 1.1 with respect to T δ s and T δ d when n = 3 and thus at most five clusters. Ellipses separate sub-tori.
To apply the algorithm recursively inside a sub-torus, it can include at most 2n − 3 clusters. So, one sub-torus can be made unavailable with at least 2n − 2 clusters. Therefore, either the at most one 2-node cluster or the at most two faulty nodes located in distinct sub-tori suffice not to make another sub-torus unavailable for recursion since they would induce at most one cluster inside a sub-torus, and 1 < 2n − 2 given that n ≥ 2. Therefore, since there exist at least k ≥ 5 sub-tori and at most two of them are unavailable (T δ s and T δ d ), at least three sub-tori always remain available for recursion.
Next, we show that both s and d are routable to at least one of these three available sub-tori. If s and d are adjacent, select s → d and there is nothing else to prove. So, we can assume that s and d are not adjacent. We distinguish the following mutually exclusive sub-cases which are exhaustive.
This case can occur only when n = 2 and it implies that either (a) there are two 2-node clusters each included in there is only one such cluster and the other two clusters respectively have at least one node in N T δ s (s) and one node in N T δ d (d). As shown in Figure 6, the former case (a) occurs only when k = 4 and the latter case (b) only when k = 3. Hence, given that k ≥ 5 is assumed, these two cases shall never occur and there is thus nothing to prove.
Ellipses separate sub-tori, faulty nodes are greyed and the clusters that include two nodes are materialised with thicker lines.
Since both T δ s and T δ d unavailable, at least 2n − 3 clusters each have one node in T δ s and the other in T δ d , and either (a) at least one other cluster also has, or (b) the two other clusters each have at least one node in T δ s ∪ T δ d . In the former case (a), at least 2n − 2 clusters each have one node in T δ s and the other in T δ d , and those clusters thus can block at most one path of {p + s,δ , p − s,δ } and at most one path with c ≤ 2(n − 1) − 1 (and obviously c ≤ c). The relation c ≤ 2(n − 1) − 1 is the invariant of the recursion. Since n is decreased by one at each step, c is guaranteed to reach 0, that is the base case of the recursion. Therefore, the total worst-case time complexity of the proposed algorithm is O(n 2 k 2 |F|) and the maximum path length is n(2k − 2) + n k/2 = n(2k + k/2 − 2).
The described algorithm selects a fault-free path of length at most n(2k + k/2 − 2) with an O(n 2 k 2 |F|) worst-case time complexity with F the set of faulty nodes induced by the faulty clusters. The maximum path length is of the same order as the network diameter: O(nk), which is thus on par with previous works on node-to-node routing under the cluster-fault tolerant model [27,28,31].

Empirical Evaluation
Now that the worst-case complexities have been established in Section 6, we inspect the average behaviour of the proposed algorithm, implemented to this end. Two experiments were conducted: the first one aims at measuring the maximum length of a path selected by the proposed algorithm, and the second one at measuring the average execution time taken by the algorithm to solve one instance of the torus cluster-fault tolerant routing problem. These experiments were conducted on a computer equipped with an Intel Core i5-1035G7 processor (clocked at 1.20 GHz) and 8 GB RAM, and running Windows 10 Home 64-bit.
The experimental conditions for the first experiment (i.e., maximum path length measurement) were as follows: in a T(n, 5), the source node and destination nodes are randomly selected in the set of all the torus nodes. The torus arity k was fixed to 5 in this experiment to maximize the routing difficulty as indeed the number of faults depends on n and not on k. Then, the maximum number of faulty clusters 2n − 1 that can be tolerated were also randomly generated. The faulty clusters are all of diameter one to once again maximize the routing problem difficulty (i.e., a higher number of faulty nodes). Then, the algorithm implementation was used to solve the corresponding routing problem and the length of the selected path output was recorded. This process was repeated 10,000 times for each value of n with 2 ≤ n ≤ 7, each time calculating two path length values: the maximum path length and the average path length of the 10,000 selected paths. The results of this first experiment are shown in Figure 8, together with the theoretical maximum path length as established previously in Section 6 for reference. The second experiment (i.e., execution time measurement) was conducted in the same experimental conditions as the first experiment at the exception that the routing problem was solved in a T(n, max{5, n + 1}): the arity k was set to max{5, n + 1} in this time experiment to evaluate the average time complexity as k and n both increase. The path selection algorithm was run 10,000 times for each (n, max{5, n + 1}) pair with 2 ≤ n ≤ 7, each time measuring the real CPU time (i.e., excluding the time for garbage collection) taken to solve the problem instance. The obtained results are given in Figure 9, together with the worst-case time complexity as established previously in Section 6 for reference.
The following observations can be made from the obtained experimental results. First, regarding the maximum path length, one can note that it remains at distance from the theoretical upper bound, which is an indicator of the good performance of the algorithm. Second, regarding the average execution time, one can note that it remains well below the worst-case time complexity, which is yet another indicator of the efficiency of the proposed algorithm.

Conclusions
The growing number of Internet-connected devices and their sensors, comparable to that of computing nodes included in modern supercomputers, induces large interconnection networks. Hence, the performance of networks on this scale is tied to efficient and robust data routing. For example, major supercomputer makers such as IBM, Cray and Fujitsu have been relying on the torus topology for the interconnection network for its advantageous topological properties. The torus topology is also applicable to interconnect sensor networks, for instance, to report information collected by sensors across the network to the network user. Given the huge number of network nodes involved, faults are very likely to occur. A routing algorithm in a torus that is tolerant to faults is thus key for the future of such networks and has direct implications to the quality-of-service issue by reducing the number of failed data communications. Furthermore, hardware technical properties inducing that faults often happen in clusters (e.g., a same power supply unit applies to a few nodes), it is critical to not only tolerate node faults but also cluster-faults. Improving on Menger's condition on the maximum number of node faults that can be tolerated, and on torus fault-tolerant routing algorithms described in previous works, we have proposed in this paper for the first time a node-to-node routing algorithm in a torus that is tolerant to cluster-faults. In a T(n, k) with at most 2n − 1 faulty clusters of diameter at most 1, the described algorithm selects a fault-free path of length at most n(2k + k/2 − 2) with an O(n 2 k 2 |F|) worst-case time complexity with F the set of faulty nodes induced by the faulty clusters.
Regarding future works, it will be meaningful to first try to consider faulty clusters of diameter 2, possibly reducing the number of tolerated faulty clusters. Then, selecting several fault-free disjoint paths between the source and destination nodes can be considered. Furthermore, measuring the average performance of the proposed algorithm and comparing the results with the formally established worst-case complexities (maximum path length and time complexity) is yet another research route.