Network Coding Approaches for Distributed Computation over Lossy Wireless Networks

In wireless distributed computing systems, worker nodes connect to a master node wirelessly and perform large-scale computational tasks that are parallelized across them. However, the common phenomenon of straggling (i.e., worker nodes often experience unpredictable slowdown during computation and communication) and packet losses due to severe channel fading can significantly increase the latency of computational tasks. In this paper, we consider a heterogeneous, wireless, distributed computing system performing large-scale matrix multiplications which form the core of many machine learning applications. To address the aforementioned challenges, we first propose a random linear network coding (RLNC) approach that leverages the linearity of matrix multiplication, which has many salient properties, including ratelessness, maximum straggler tolerance and near-ideal load balancing. We then theoretically demonstrate that its latency converges to the optimum in probability when the matrix size grows to infinity. To combat the high encoding and decoding overheads of the RLNC approach, we further propose a practical variation based on batched sparse (BATS) code. The effectiveness of our proposed approaches is demonstrated by numerical simulations.


Introduction
In recent years, due to the proliferation of computationally intensive applications at the wireless edge, such as federated learning [1] and image recognition [2], wireless distributed computing has drawn great interest [3,4], where large-scale computational tasks are carried out by a cluster of wireless devices collaboratively. Meanwhile, due to the inherent randomness of wireless environment, wireless distributed computing systems are facing multiple challenges. One main challenge is called the straggler issue, where computing devices often experience unpredictable slowdown or even dropout during computation and communication, which can lead the computational task to much larger latency or even failure [5]. Another challenge is the packet-loss issue, where the packets can be lost during transmission due to severe channel fading of wireless networks.
In this paper, we consider a typical wireless distributed computing system consisting of multiple worker nodes and a master node. We focus on distributed matrix multiplication y = Ax, which forms the core of many computation-intensive machine learning applications, such as linear regression, and aims at tackling the two above challenges. One common approach to mitigate the effect of stragglers is providing redundancy through replication [6][7][8], which has been widely used in large distributed systems such as MapReduce [9] and Spark [10]. However, this kind of r-replication strategy can only tolerate r stragglers, and using a larger r increases the computation redundancy, which can lead to poor performance. • We first propose a random linear network coding (RLNC) [19] based approach. In this approach, the matrix A to be multiplied is first split into multiple submatrices A 1 , . . . , A k , and each worker node is assigned multiple submatrices, each of which is a random linear combination of the A 1 , . . . , A k . Each worker node multiplies each assigned submatrix with the input x, and it generates random linear combinations of submatrix-vector products that have been created for transmission. Once receiving enough packets with independent global encoding vectors, the master node can recover the desired result Ax by Gaussian elimination. We model the computation and communication process as a continuous-time trellis, and by conducting a probabilistic analysis of the connectivity of the trellis, we theoretically show that the latency of RLNC approach converges to the optimum in probability when the matrix size grows to infinity. • Since RLNC approach has high encoding and decoding costs, we further propose a practical variation of RLNC approach based on batched sparse (BATS) code [20] and show how to optimize the performance of the BATS approach. • We conducted numerical simulations to evaluate the proposed RLNC and BATS approaches. The simulation results show that both approaches can overcome the straggler issue and the packet-loss issue effectively and achieve near-optimal performance.
The reminder of the paper is organized as follows. Section 2 introduces the system model. Sections 3 and 4 introduce the RLNC approach and the BATS approach, respectively. Section 5 presents the numerical evaluation results. Finally, Section 6 concludes.

Coding-Based Wireless Distributed Computation
As shown in Figure 1, we consider a heterogeneous, wireless distributed computing system consisting of a master node and n heterogeneous worker nodes. These worker nodes, denoted by w 1 , w 2 , . . . , w n , are connected wirelessly to the master node. We focus on the matrix-vector multiplication problem, whose goal is to compute the result y = Ax for a given matrix A ∈ R m×d and an arbitrary vector x ∈ R d×1 , where R is a set of real numbers. Our results can be directly extended to matrix-matrix multiplication, where x is a small matrix. master worker 1 worker 2 worker w … packet loss In order to mitigate the effect of unpredictable node slowdown during computation and communication, we consider an error-correcting code based computing framework which consists of four components: • Encoding before computation: The matrix A is first split along its rows equally into k submatrices A 1 , . . . , A k , i.e., . Without loss of generality, here we assume that m/k is an integer. These submatrices are encoded into more submatrices using an error-correcting code, which are further placed on worker nodes. The submatrices assigned to worker node w i are denoted asÃ i,1 ,Ã i,2 , . . . ,Ã i,k i , where k i is the number of submatrices assigned to w i . Here, we emphasize that, in many applications, such as linear regression, this encoding will be used for multiple computations with different inputs x [11], so that the encoding is often required to be executed before the arrival of any x.
• Computation at each worker node: When an input x is arrived at the master node, the master node will broadcast x to all these worker nodes. Once worker node w i receives x, it will computeÃ i,1 x,Ã i,2 x, . . . ,Ã i,k i x in a sequential manner. • Communication from each worker node: During the computation, each worker node also keeps on sending its local computation results to the master node in some manner. For this, each submatrix-vector product which is a vector of length m/k is encapsulated into a packet. We assume that the communication link between worker i and the master node can be modeled as a packet erasure channel, where each packet is erased independently with probability ε i . In order to combat these packet losses, each worker node can transmit its local computation results using a coding based approach. • Decoding at the master node: Once the master node receives enough information, it will recover the desired result y = Ax and notify all the worker nodes to stop the computation.

Delay Model
In this paper, we mainly focus on minimizing the latency, which is the time required by the wireless computing system so that the result y = Ax can be successfully decoded at the master node by aggregating the results sent from the worker nodes. For the characterization of the latency, we consider the following two models, one for computation delay and the other for communication delay.
As in [14], we consider a computation delay model as follows. The computation delay at each worker node w i consists of two parts. The first is an initial setup time before w i starts to perform any submatrix-vector multiplication, denoted by X i , which is assumed to follow an exponential distribution with rate λ i . The second is a constant time for calculating each submatrix-vector product, which is denoted by τ i . Hence, the delay for computing r submatrix-vector products by w i is X i + τ i r.
In order to characterize the straggling effect during the communication, we model the communication time of a packet from worker node i to the master node as a shiftedexponential distribution with rate µ i and shift parameter θ i . Additionally, the communication times of all packets are mutually independent. The model has also been adopted by [17,21].

A Network Coding Approach
In order to combat the straggling effects during both computation and communication and the packet losses during communication, in this section, we propose a random linear network coding (RLNC)-based approach and show that it can achieve optimal latency performance in the asymptotic sense, i.e., when the number of rows of A goes to infinity, when the overheads incurred are ignored. A practical version of this approach is given in the next section.

Description
We describe the RLNC based approach based on the computing framework given in Section 2.1: Encoding before computation: In the RLNC-based approach, each submatrixÃ i,j assigned to worker node w i is a random linear combination of A 1 , . . . , A k ; i.e., where c i,j,e is chosen randomly and independently according to a standard normal distribution. Since this encoding approach is rateless, k i can be arbitrarily large. Computation at each worker node: When the worker node w i receives an input x, it starts to compute the local resultsỹ i, Communication from each worker node: For each packet transmission starting at time t, the worker node w i will generate a linear combination of all the local computation results in hand asŷ where d i (t) is the number of local results that have been computed before time t by w i . Here, (c 1 , . . . , c d i (t) ) is referred to as the local encoding vector ofŷ i,t . Decoding at the master node: Due to the linearity of matrix-vector multiplication, we can see that i.e., each packet received by the master node is a linear combination of is referred to as the global encoding vector ofŷ i,t . Hence, when the master node receives enough packets that have k linearly independent global encoding vectors, it can recover the desired results A 1 x, A 2 x, . . . , A k x by Gaussian elimination. Overhead: Our RLNC approach suffers from its high encoding and decoding complexities, just like RLNC for communication. More specifically, in our approach, the encoding cost per submatrix is O(k · m/k · d) = O(md), and the total decoding cost is We can see that the encoding cost is high, but the encoding can been done before any computation and just once, which can be used for computing Ax as many times as possible with different x. Meanwhile, the decoding cost is also high when k is large, but it is independent of d, the number of columns of A. Thus, when d is very large, the decoding cost at the master node can be much lower than the computation cost at each worker node. In addition, the decoding at the master node can be done in an incremental fashion using Gauss-Jordan elimination, which can further reduce the decoding latency.
Note that the global encoding vector is required by the master node for decoding. To achieve this efficiently, we use a pseudo-random number generator to generate the local encoding vector for each transmitted packet and append the random seed. The number of local results are computed for the packet. Then, the master node can get the global encoding vectors according to (3). In this way, the coefficient overhead is negligible, which is opposite to the traditional RLNC for communication networks.

Remark 1.
Lin et al. [22] have also applied RLNC in distributed training on mobile devices. They used RLNC to create coded data partitions among mobile devices so as to tolerate computational uncertainties, and their main purpose is to reduce the need to exchange data partitions across mobile devices. Differently from [22], the use of RLNC in this paper is for straggler mitigation and packet-loss tolerance in a joint manner, while leveraging the computation and communication capabilities of all worker nodes.

Remark 2.
Since random linear network coding is performed over the field of real numbers as opposed to a finite field, the entries of generated matrices could be very large numbers, leading the whole computation to be numerically unstable. In fact, this issue is present in any coded distributed computation over the field of real numbers and is not just limited to our approaches. There are two basic approaches to dealing with this issue. One is to use very small coefficients to avoid the emergence of large numbers, which is possible, as the encoding operations are also linear with these coefficients in our proposed approach. This is significantly different from the Reed-Solomon-code/polynomial-code-based approaches which have been widely adopted in coded distributed computation (see, e.g., [11,23]), as the coefficients are powers of evaluation points. In particular, the numerical instability issue for the RLNC approach is much less severe than that for Reed-Solomon-code/polynomial-code-based approaches, since Vandermonde matrices have exponentially large condition numbers. The other is to employ the finite field embedding technique [24,25], where the entries are quantized into number of finite digits and then embedded into a finite field. Nevertheless, both approaches incur numerical errors. How to guarantee numerical stability in coded distributed computation is still an open problem and requires further study.

Latency Analysis
The following result characterizes a upper bound of the latency of the proposed RLNC-based approach.
The following result establishes a lower bound on the latency of any scheme under the coding framework.

Theorem 2.
For any scheme under the coding framework, the probability that its latency T any is less than T 0 decays exponentially with k; i.e., for any constant δ > 0, there exists some constant η > 1 that does not depend on k, such that From Theorems 1 and 2, it is straightforward to see that the proposed RLNC-based approach is asymptotically optimal. In the following, we will formally prove Theorems 1 and 2 by a connectivity analysis of a continuous-time trellis, which models the computation and communication processes.
For any scheme under the coding framework, as illustrated in Figure 2, we model the computation and communication processes of each worker node w i up to time t using a continuous-time trellis (G (t) i ) [26], where edges are classified into three types: computation edges, transmission edges and memory edges. Each computation edge models the computation of a submatrix-vector product. Suppose w i computes a submatrixvector product from time t 0 to t 0 + τ i ≤ t. Then, two nodes, w i (t 0 ) and w i (t 0 + τ i ), will be introduced, and there is a computation edge from w i (t 0 ) to w i (t 0 + τ i ). Similarly, suppose a packet is transmitted from w i at time t 0 and received successfully by the master node at time t 1 ≤ t. Then, two nodes w i (t 0 ) and m(t 1 ), if they do not exist, will be introduced, and there is a transmission edge from w i (t 0 ) to m i (t 1 ). We also introduce nodes w i (0) and a node m i (t). Nodes {w i (·)} are connected through the timeline, so are nodes {w i (·)} and nodes {m i (·)}. The edges for such connections are called memory edges. Each computation edge and each transmission edge is associated with unit capacity, and each memory edge is associated with an infinity capacity. Finally, we construct a global continuous-time trellis G (t) , which includes the union of all G (t) i and two auxiliary nodes w(0) and m(t). In addition, there is an edge from w(0) to each w i (0) with an infinity capacity, and there is an edge from each m i (t) to m(t) with an infinity capacity.
The usefulness of the continuous-time trellis model is summarized in the following result.

Proof.
It is straightforward to see that the first part holds. The second part is inherited from the optimality of RLNC in communication networks [19] and the fact that all the operations are over the real field R. Now, we proceed to prove Theorems 1 and 2. We start by presenting some concentration results regarding the communication between worker nodes and the master node. Lemma 1. Suppose Y 1 , Y 2 , . . . follow a shifted exponential distribution with rate µ and shift parameter θ independently. Then, for any constant δ > 0, there exists some constant η 1 > 1, such that Proof. The result can be proved by a Chernoff-like argument based on moment generating function [27]. The moment generating function of Y i is Hence, where the inequality holds by applying the Markov's inequality.
For a scheme, let N i (t) (N i (t), resp.) be the number of packet transmissions (successful packet transmissions, resp.) from worker node w i to the master node during the time interval (X i , X i + t).

Lemma 2.
For any scheme and any constant δ > 0, there exists some constant η 2 > 1, such that Proof. Let Y 1 , Y 2 , . . . , Y N i (t) be i.i.d. shifted exponential random variables with rate µ i and shift parameter θ i , and s = (1 + δ)r i t . According to Lemma 1, there exist some constant Lemma 3. For any scheme and any constant δ > 0, there exists some constant η 3 > 1 such that Proof. Let A denote the event that N i (t) ≥ (1 + δ/2)r i t. By the total law of probability, According to Lemma 2, there exists some constant η 2 > 1 such that Pr(A) = O η −t 2 . Let N be a binomial random variable with parameters (1 + δ/2)r i t and 1 − ε i . Then, there exists some constant η 3 > 1 such that where the second step follows by applying the Chernoff bound for a binomial random variable [27]. Finally, by letting min η 3 = η 2 , η 3 , we have Lemma 4. For any scheme, let F i (t) be the maximum flow from w i (0) to m(t) in its continuoustime trellis G (t) . Then, for any constant δ > 0, there exists some constant η 4 > 1 such that for some constant η 4 > 1. By the total law of probability, We consider two cases. In the first case, 1 Thus, Pr for some constant η 5 > 1, where the last step follows from Lemma 3. Thus, we can show Now we are ready to prove Theorem 2.
Proof of Theorem 2. For any scheme, since the maximum flow from , according to Proposition 1, its latency T any satisfies where the last step follows from Lemma 4.
Next, we turn to prove Theorem 1. For the RLNC approach and t ≥ X i , let N i (t, t + ∆t) be the number of successful packet transmissions from worker node w i to the master node during the time interval (t, t + ∆t). We have the following result.

Lemma 5. For any t ≥
i.e., N i (t, t + ∆t)/∆t converges to r i (1 − ε i ) in probability when ∆t goes to infinity, or equivalently, for any constant > 0.
Proof. The result can be shown similarly to that of Lemma 3.

Lemma 6.
Let F i (t) be the maximum flow from w i (0) to m(t) in the continuous-time trellis G (t) of the RLNC approach. Then, Proof. According to Theorem 1 of [26], Lemma 5 implies this result immediately. Now we can prove Theorem 1.
Proof of Theorem 1. According to Lemma 6,F i it is straightforward to check that According to Proposition 1, this implies that The proof is accomplished.

BATS-Code-Based Approach
As mentioned earlier, despite its optimality, RLNC based approach suffers from its high encoding and decoding overheads. In this section, we propose a new approach based on batched sparse (BATS) code [20], which is a variation of RLNC having low encoding and decoding overheads.

Description
In the BATS-code-based approach, the k submatrices A 1 , . . . , A k are first encoded into A 1 , . . . , A k , A k + 1 , . . . , A k using a fixed-rate systematic erasure code (called a precode), where k = (1 + )k and is a small positive constant (e.g., 0.02). BATS codes are rateless, as an infinite number of batches can be generated. The generation of each batch is as follows: • Sample a degree deg according to a given degree distribution Ψ = (Ψ 1 , . . . , Ψ D ), where D is the maximum degree; • Select deg distinct submatrices uniformly at random from A 1 , . . . , A k , A k + 1 , . . . , A k ; • Generate M random linear combinations of the deg submatrices, which are referred to as a batch.
Based on BATS code, batches of submatrices are assigned to worker nodes, and each worker node performs the local computation on the basis of a batch, which consists of M submatrix-vector multiplications. In order to forward the computational result of a batch to the master node, each worker node will generate a number of packets, each of which is a random linear combination of the M submatrix-vector products corresponding to the batch. For decoding, the master node first recovers A 1 x, . . . , A k x, A k + 1 x, . . . , A k x using Gaussian-elimination-based belief propagation (BP) decoding, and once any k or slightly more than k of A 1 x, . . . , A k x, A k + 1 x, . . . , A k x are recovered, the master node can recover all these A 1 x, . . . , A k x by decoding the precode. See [20] for more details.
Overhead: In the BATS-code-based approach, the encoding cost per submatrix is O(deg · m k · d) = O( md k ), and the total decoding cost is O((M 3 + M 2 m k ) · k M ) = O(M 2 k + Mm). Clearly, both the encoding cost and decoding cost are much lower than for the RLNC approach, especially when M is a small constant (e.g., 8 or 16). As for the RLNC approach, the decoding cost is independent of d, and the coefficient overhead is negligible when leveraging the pseudo-random-number-generator-based approach.

Remark 3.
There have been many other sparse variants of random linear network coding, including chunked codes (e.g., [28,29]), tunable sparse network coding (e.g., [30,31], and sliding-window coding (e.g., [32][33][34][35][36]). While many of these codes can also be applied, BATS codes are more suitable for this distributed computing scenario. On the one hand, BATS codes are rateless. Thus, all the worker nodes can keep on computing and forwarding local results to the master node before the whole computation is completed, as long as enough batches are placed on each worker node. In contrast, chunked codes (e.g., [28,29]) usually have fixed coding rates or require a lot of feedback from the master node. On the other hand, as mentioned in Section 2, in many applications, the step of encoding before computation is required to be performed before the arrival of any input x. In other words, this encoding step should be irrelevant to the uncertain computation and communication processes of worker nodes. However, differently from BATS codes, sliding-window codes are often generated on-the-fly and are not as suitable as BATS codes.

Performance Optimization
The performance of BATS code heavily depends on how the M computation results of each batch are transmitted to the master node, and which degree distribution is used.
Suppose that worker node w i sends Z i coded packets to the master nodes for the computation results of each batch B j . Let H j be a Z i × M matrix, where each row corresponds to a transmitted packet. If the packet is successfully received by the master node, then the row is the local encoding vector. Otherwise, the row is zero-vector. Let h i = (h i,0 , . . . , h i,M ) denote the rank distribution of H j , where h i,r is the probability that H j has rank r. We can show that where ub is an upper bound of Z i . In order to maximize the transmission efficiency for BATS code, we apply the linear programming method [37] to optimize the distribution of Z i : Here, the objective is to maximize the expected rank. The first constraint stands for the expected time for transmitting Z j packets to the master node being no larger than the time for computing M submatrix-vector multiplications, and the last two constraints stand for Pr(Z i = ), = 0, . . . , ub being a probability distribution. When the time goes to infinity, we can see that the proportion of batches whose computation results have been sent to the master node by worker node w i is 1/τ i Hence, we can derive the empirical rank distribution h over all the batches done by worker nodes as Based on the empirical rank distribution, we can find a good degree distribution Ψ such that the BATS code can achieve a coding rate close toh/M, whereh is the expected value corresponding to the empirical rank distribution (c.f. [20]).

Performance Evaluation
In this section, we first evaluate the decoding cost incurred by our proposed approaches, and then we present simulations conducted to evaluate the overall computational performances of these approaches in comparison to some state-of-the-art approaches.
We first ran some experiments on a computer with an Intel(R) Core(TM) i7-10700 CPU 2.90 GHz and Python 3.7. In these experiments, the matrix A was 50,000 × d, where d ranged from 1000 to 16,000. Matrix A was split into 1000 sub-matrices of the same size, and each submatrix consisted of 50 rows so that each transmitted packet consisted of 50 real numbers. In the BATS-code-based approach, the batch size was set to eight. We simulated the decoding process and evaluated the decoding delays (in terms of second) of both the RLNC based approach and the BATS-code-based approach. The delay for the original matrix multiplication was also evaluated. The results are presented in Table 1. Note that the decoding latencies of both the RLNC based approach and the BATS-codebased approach are irrelevant to d, and the latency for the original matrix multiplication grows linearly with d. From this table, we can see that even when d = 1000, the decoding latency of the BATS-code-based approach is only about 1.58% of the latency of original computation, and when d grows larger, this latency becomes negligible. In contrast, when d = 1000 or d = 2000, the decoding cost of the RLNC based approach is prohibitive.
We also conducted simulations to evaluate the performances of our proposed approaches. In our simulations, the number of worker nodes was 10, and the settings of matrix A remained the same as above, except that the number of columns d was irrelevant in our simulations. We simulated four scenarios. In the first three scenarios, worker nodes were homogeneous, and the size relationship between computation time per submatrixvector product and average communication time of a packet varied among these scenarios. In the last scenario, worker nodes were heterogeneous. The involved parameters of these scenarios are given as follows. For these scenarios, we evaluated the following five methods.
• Uniform uncoded, where the divided sub-matrices were equally assigned to 10 worker nodes-i.e., each worker node computed 100 sub-matrices. • Two-Replication, where the divided sub-matrices were equally assigned to five worker nodes, and the computing tasks of these worker nodes were replicated at another five worker nodes. • (10,8) MDS code, where the divided 1000 sub-matrices were encoded into 1250 submatrices and then equally assigned to 10 worker nodes. • LT code [14], where the 1000 original sub-matrices were encoded using LT codes, and an infinite number of coded sub-matrices was assigned to each worker node. • RLNC: The details are introduced in Section 3. The time cost of recoding and decoding operations was ignored.
• BATS code: The details are introduced in Section 4, and a batch size of eight was used.
While our proposed schemes tackle the packet-loss issue, the first four of the above schemes do not consider this issue at all. For these schemes, we used an ideal retransmission (IR) scheme for the first four schemes, where the worker nodes know whether a transmitted packet is lost or not immediately. This leads these schemes to perform better. In the following, we refer to the first four schemes as Uncoded + IR, Rep + IR, (10,8)MDS + IR and LT + IR, respectively.
The latency performance levels of these approaches under the four scenarios are plotted in Figure 3, where the decoding latency at the master node is ignored. From this figure, we observe the following. • Among the first four schemes, LT + IR achieved the best performance for all four scenarios. Note that IR eliminates the packet-loss issue, and this result has also been demonstrated in [14], where only the straggler issue was considered. This is because LT codes can achieve near-perfect load balance among the worker nodes in the presence of stragglers. • For all these scenarios, the proposed RLNC approach achieved the best latency performance among all these schemes. In particular, the performance of the RLNC approach was slightly better than that of LT + IR. Just like LT + IR, our RLNC approach also achieved near-perfect load balance among the worker nodes. Meanwhile, LT + IR incurred a small precode overhead, whereas the RLNC approach did not. This result also demonstrates the near-optimality of the RLNC approach. • Our BATS approach performed much better than Uncoded + IR, Rep + IR, and (10,8) MDS + IR in all these scenarios, but slightly worse than LT + IR and RLNC. Since LT + IR assumes an ideal retransmission scheme, which is impractical, and the RLNC approach incurs high encoding and decoding costs, the BATS approach is much more practical.
In summary, both our RLNC approach and our BATS approach can overcome both the straggler issue and the packet-loss issue effectively and can achieve near-optimal performance in different scenarios when the number of columns d is large enough.

Conclusions
In this paper, we focused on addressing the straggler issue and the packet-loss issue jointly for distributed matrix multiplication in wireless distributed computing systems. We proposed an RLNC approach and proved its asymptotical optimality using a continuoustime-trellis-based argument. We further proposed a more practical variation of the RLNC approach based on BATS code. The effectiveness of both approaches was demonstrated through numerical simulations.