Coding for Large-Scale Distributed Machine Learning

This article aims to give a comprehensive and rigorous review of the principles and recent development of coding for large-scale distributed machine learning (DML). With increasing data volumes and the pervasive deployment of sensors and computing machines, machine learning has become more distributed. Moreover, the involved computing nodes and data volumes for learning tasks have also increased significantly. For large-scale distributed learning systems, significant challenges have appeared in terms of delay, errors, efficiency, etc. To address the problems, various error-control or performance-boosting schemes have been proposed recently for different aspects, such as the duplication of computing nodes. More recently, error-control coding has been investigated for DML to improve reliability and efficiency. The benefits of coding for DML include high-efficiency, low complexity, etc. Despite the benefits and recent progress, however, there is still a lack of comprehensive survey on this topic, especially for large-scale learning. This paper seeks to introduce the theories and algorithms of coding for DML. For primal-based DML schemes, we first discuss the gradient coding with the optimal code distance. Then, we introduce random coding for gradient-based DML. For primal–dual-based DML, i.e., ADMM (alternating direction method of multipliers), we propose a separate coding method for two steps of distributed optimization. Then coding schemes for different steps are discussed. Finally, a few potential directions for future works are also given.


Background and Motivations
With the fast development of computing and communication technologies, and emerging data-driven applications, e.g., IoT (Internet of Things), social network analysis, smart grids and vehicular networks, the volume of data for various intelligent systems with machine learning has increased explosively along with the number of involved computing nodes [1], i.e., in a large scale. For instance, learning systems based on MAPReduce [2] have been widely used and may often reach the data volume of petabytes (10 15 bytes), which may be produced and stored in thousands of separated nodes [3,4]. Large-scale machine learning is pervasive in our societies and industries. Meanwhile, it is inefficient (sometimes even infeasible) to transmit all data to a central node for analysis. For the reason, distributed machine learning (DML), which stores and processes all or parts of data in different nodes, has attracted significant research interests and applications [1,[3][4][5][6][7][8][9][10][11][12][13][14][15][16]. There are different methods of implementing DML, i.e., primal method (e.g., distributed gradient descend [4,7], federated learning [5,6]) and primal-dual method (e.g., alternating direction method of multipliers (ADMM)) [16]. In a DML system, participating nodes (i.e., agents or workers) normally process local data and send the learning model information to other nodes for consensus. For instance, in a typical federated learning system [5,6], worker nodes run multiple rounds of gradient descends (local epoch) with local data and received global models. Then, the updated local models are sent to the server for aggregating into new global models (normally weighted sum). The models are normally much shorter than raw data. Thus, significant communication costs are saved by federated learning, and meanwhile the transmission of models in general has better privacy than sending raw data over networks. Actually, in addition to federated learning, other DML also has the benefits

Introduction of Distributed Machine Learning
In general, DML will have two steps: (1) Agents learn local models from local data, maybe combining with global models. This step may iterate multiple rounds, i.e., local iterations, to produce a local model. (2) With local models, agents will reach consensus. These two steps may also iterate multiple rounds, i.e., global iterations. There are also different methods to implement the two steps, for instance, the primal and primal-dual methods as mentioned above. There are different ways to achieve consensus, for instance, through a central server, i.e., master-slave method or fully decentralized. For the former, the implementation is relatively straightforward. Yet, for the latter, there are also different approaches as will be discussed later on. For Step (1), the common local learning machine includes, for example, linear (polynomial) regressions, classification and neural networks. The common approach of these learning algorithms is to find the model parameters (e.g., weights in neural networks) that minimize the cost functions (such as mean-squared errors/L2 loss, hinge loss and cross-entropy loss). In general, convex cost functions should be chosen. For instance, for linear regression, we assume x, y as the input and output of the training data, respectively, and w (normally a matrix or a vector) as the weight to be optimized. If the mean-squared error cost functions are used, then the learning machine works as min w xw − y 2 .
To find the optimal w, one common approach is to use gradient descend, which is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. If the cost function is convex, then the local minimum is also the global minimum [33]. For instance, in the training process of neural networks, gradient descend is commonly used to find the optimized weight and bias iteratively. The gradient is found by partial derivative of cost functions relative to optimizing variables (weight and bias of training examples). For instance, for node i, the optimizing variables can be updated by where t is the iteration step index, γ is the step size, D i is the data set (training samples) in node i, F(w i t ) is the cost function with current optimizing variables, and ∇F(w i t , D i ) denotes the gradients for given (w i t , D i ) (by partial derivatives). The training process is normally performed in batches of data. D i can be further divided into subsets, e.g., N subsets, i.e., If subsets are exclusive, the gradients from different subsets are independent, i.e., ∇F(w i t , However, in many DML systems, e.g., those based on MAPReduce file systems, or sensor nodes in neighboring areas, there may be overlapping data subsets, i.e., D k i = D n j for certain k, n and i = j. Therefore, there may be identical gradients in different nodes. These properties were recently exploited for coding. It it clear from (2) that for given gradients, the steps of finding optimal parameters are mainly linear matrix operations (matrix multiplications). Actually, in addition to neural networks, one core operation of many other learning algorithms is also matrix multiplications, such as regression, poweriteration-like algorithms, etc. [4]. Thus, one of the major coding schemes for DML is based on the matrix multiplication of the learning process [4,[8][9][10][11][12][13][14]24,25]. Clearly, major coding schemes (forward error-control coding and network coding) are linear in terms of encoding and decoding operations, i.e., C = M × W, where C, M and W are codeword (vectors), coding matrix and information message, respectively. Since both learning and coding operations are linear matrix operations, then the coding matrix and learning matrix can be jointly optimized. On the other hand, coding can be optimized to provide efficient and reliable information pipelines for DML systems. In such way, coding and DML matrices are separately optimized. Separate optimization actually has been widely studied for many years for existing systems due to the simpler design relative to joint design. There are many works in the literature on the separate optimization of learning systems and coding schemes. We will focus on joint design in this article.

Coding for Reliable Large-Scale DML
In this section, we will first give a review on the basic principles of coding for reliable DML. Then, we will discuss two optimal construction of codes for DML.
One toy example of how coding can help to deal with stragglers can be found in Figure 1 [34]. For instance, it can be a federated learning network with worker and server nodes. There is partial overlapping for data segments in different worker nodes and thus the partial overlapping of gradients. As in Figure 1, we divide the data set of a node into multiple smaller sets to denote the partial overlapping of different nodes. Meanwhile, multiple sets in a node are also necessary for encoding as shown in the figure since one data set corresponds to one source symbol of the code. In the server node, a weight sum of the gradient is needed. In the figure, three worker nodes have different data parts of D 1 , D 2 , D 3 , which are used to compute gradients G 1 , G 2 , G 3 , respectively. In the server, an individual gradient is not needed but only their sum G s = G 1 + G 2 + G 3 . We can easily see that gradients from any two nodes can calculate G s . For instance, if worker3 is outage, then G s = 2(G 1 /2 + G 2 ) − (G 2 − G 3 ) with two transmission coded blocks from worker1 and worker2. If there is no coding, then worker1 and worker2 have to transmit G 1 , G 2 , G 3 separately with three blocks after the coordination operations. Thus, coding can save the transmission and also coordination loads.
Though the idea of applying coding for DML is straightforward as shown in the above toy example, the code design will be rather challenging for large-scale DML, i.e., when the numbers of nodes and/or gradients per node are very large. One big challenge is how to construct encoding and decoding matrices, especially with limited complexity.
In what follows, we will first give a brief introduction of the MAPReduce file systems, which are often used in DML. Then, we will discuss the coding schemes with deterministic construction [34]. The random construction based on fountain codes is given in the next section, which normally has lower complexity [13,14]. In large DML systems, MAPReduce is a commonly used distributed file storage system. As shown in Figure 2, there are three stages for the MAPReduce file systems: map, shuffling and reduce. In the system, data are stored in different nodes. In the map stage, stored data are sent to different computing nodes (e.g., cloud computing nodes), according to pre-defined protocols. In the shuffling stage, the computed results (e.g., gradients) are exchanged among nodes. Finally, the end users will collect the computed results in the reduce stage. MAPReduce can be used in federated learning, which was originally proposed for the applications in mobile devices [5]. In such a scenario, data are first sent to different worker nodes in the map stage, according to certain design principles. Then in the shuffling stage, local model parameters are aggregated in the server node. Finally, the aggregated models are obtained in the final iteration at the server. In such a way, worker nodes have all necessary data for computing local models, sent from storage nodes. However, there may be straggling worker nodes, due to either slow computing at the node or transmission errors in the channels. In such scenario, gradient coding [34] can be used to correct the straggler nodes. To construct gradient coding, we use A to denote the possible straggler pattern multiplied by the corresponding decoding matrix, and B to denote how different gradients (or model parameters) are combined in the worker node. Thus, A denotes transmission matrix multiplied by decoding matrices in some sense (as they recover transmitting gradients from received coded symbols) and B can also be regarded as an encoding matrix. Assuming that k is the number of different gradients (data partitions) in all nodes and there are a total of n output channels in all nodes, the dimension of B is n × k. Denotingḡ = [g 1 , g 2 , · · · , g k ] T as the vector of all gradients, then worker node i transmits b iḡ , where b i is the i-th row of B and the encoding vector at node i. The dimension of A is k × n. A row of A corresponds to an instance of straggling patterns, in which 0 means a straggler node and how the gradients are reproduced in the server. Thus, all rows in A denote all possible ways of straggling. Denoting f as the number of surviving workers (none-stragglers), there are at most n − f 0s in each row of A. In the example of Figure 1, we only need the sum of gradients from worker nodes instead of the values of individual gradients. Thus, we have AB = 1 k×k and each row of ABḡ is identically G 1 + G 2 + G 3 , where 1 k×k denotes all 1 matrix. For the example, we can easily see that ( Clearly, if we want individual values ofḡ, we should redesign A, B such that AB is an identity matrix. Or if we want the weighted sum of gradients (weights more general than 1), A, B should be also redesigned. From the description, we can see that the main challenge of designing the gradient coding is to find suitable encoding matrix B such that it can correct the straggling loss defined by A. In [34], two different ways of finding B and corresponding A are given, i.e., fractional repetition and cyclic repetition schemes as detailed in the following.
We denote n and s as the number of worker nodes and straggler nodes, respectively, and assume n is a multiple of s + 1. Then, fractional repetition construction is described as the following steps.

•
Divide n workers into s + 1 groups of size n/(s + 1); • In each group, divide all the data equally and disjointly, assigning s + 1 partitions to each worker; • All the groups are replicas of each other; • After local computing, every worker transmits the sum of its partial gradient.
By the second step, in a group, the first worker obtains the first s + 1 partitions from the map stage and computes the first s + 1 gradients, and the second worker obtains the second s + 1 partition from the map stage and computes the second s + 1 gradient and so on. The encoding of each group of workers can be denoted by a block matrixB block (n, s) ∈ R n s+1 ×n with Here 1 1×(s+1) and 0 1×(s+1) means 1 × (s + 1) matrix of all 1 s and all 0 s (row vector), respectively. Then B is obtained by replicating s + 1 copies ofB block (n, s), i.e., whereB i block (n, s) =B block (n, s), for i ∈ {1, · · · , s + 1}. In addition to the encoding matrix B f rac , reference [34] also gives the algorithms of constructing the corresponding A matrix as follows.
It was shown in [34] that by fractional repetition schemes, B = B f rac from (5) and A from Algorithm 1 can correct any s straggler. It can be more formally stated as the following theorem.  (5) for a given number of workers n and stragglers s(< n). Then, the scheme (A, B f rac ), with A from Algorithm 1 is robust to any s straggler.
Here, we refer the interested readers to [34] for the proof. In addition to fractional repetition construction, another way of finding the B matrix is the cyclic repetition scheme, which does not require n to be a multiple of s + 1. The algorithm to construct the cyclic repetition B matrix is given as follows.
Actually, the resultant matrix B = B cyc from Algorithm 2 has the following support (non-zero parts): where * is the non-zero entries in B cyc , and in each row of supp(B cyc ), there are (s + 1) non-zero entries. The position of non-zero entries is right shifted one step and cycled around until the last row. The construction of A matrix follows Algorithm 1 also for B cyc . It was shown in [34] that cyclic repetition schemes can also correct any s stragglers: −H(:, j(2 : s + 1))] \ H(:, j(1))] Output: B ∈ R n×n with (s + 1) non-zeros in each row. Theorem 2. Consider B = B cyc from Algorithm 2, for a given number of workers n and stragglers s(< n). Then, the scheme (A, B cyc ), with A from Algorithm 1 is robust to any s straggler.
Fractional repetition and cyclic repetition schemes provide specific methods of encoding and decoding for master-worker DML for tolerating any s stragglers. More generally, it was also shown in [34] the necessary conditions for matrix B for tolerating any s stragglers if the following conditions are satisfied.
Condition 1 (B-Span): Consider any scheme (A, B) robust to any s stragglers, given n(s < n) workers, then every subset (I) ⊆ span{b i |i ∈ (I)} is satisfied, where span{·} is the span of vectors.
If A matrix is constructed by Algorithm 1, (A, B) with Condition 1 is also sufficient.

Corollary 1. If A matrix is constructed by Algorithm 1 and B satisfies Condition 1, (A, B) can correct any s stragglers.
Numerical results: In Figure 3, the average time per iteration for different schemes is compared from [34]. In naive scheme, the data are divided uniformly across all workers without replication, and the master just waits for all workers to send their gradients. In ignoring the s straggler scheme, the data distribution is the same as the naive scheme. However, the master node only waits until n − s worker nodes successfully send their gradients (no need to wait for all gradients). Thus, as discussed in [34], ignoring the straggler scheme may lose in the generalization performance by ignoring a part of data sets of straggler nodes. The running learning algorithms are based on logistic regression. The training data are from the Amazon Employee Access dataset from Kaggle. The delay is introduced by the computing latency of AWS clusters, and there is no transmission error.
As shown in the figure, the naive scheme performs the worst. With increasing stragglers, coding schemes also perform better than ignoring straggler schemes as expected.

Random Coding Construction for Large-Scale DML
The gradient coding in [34] works well for the DML scheme with a master-worker structure with limited sizes (finite number of nodes and limited data partitions). However, the deterministic construction of encoding and decoding matrices may be challenging when the number of nodes or data partitions (e.g., n or k) is large. The first challenge is the complexity of encoding and decoding, both of which are based on matrix multiplication, which may be rather complex, especially for decoding (e.g., based on Gaussian elimination). Though DML with MDS codes is optimal in terms of code distance (i.e., the degree of tolerance to the amount of straggler nodes), the coding complexity will be rather high with the increasing number of participating nodes, i.e., for hundreds or even thousands of computing nodes. For instance, Reed-Solomon codes normally need to run in non-binary fields, which are of high complexity. Another challenge is lack of flexibility. Both factional repetition and cyclic repetition coding schemes assume static networks (worker nodes and data). However, in practice, the participating nodes may be varying in mobile nodes or sensors, for example. In the mobile computing scenario, the number of participating nodes may be unknown. It will rather difficult to design deterministic coding matrices (A or B) in such a scenario. Similarly, if the data are from sensors, the amount of data may also be varying. Thus, the deterministic construction of coding is hard to adapt to these scenarios, which, however, are very common in large-scale learning networks. Thus, coding schemes efficient in varying networks and of low complexity are preferable for large-scale DML. In [13,14], we investigated the random coding for DML (or distributed computing in general) to address the problems. Our coding scheme is based on fountain codes [35][36][37]. The coding scheme is introduced as follows.
Encoding Phase: As shown in Figure 4, we consider a network with multiple storage and computing/fog nodes. Let FN f denote the f -th fog node and let SU s denote the s-th storage unit with f ∈ {1, 2, · · · , F} and s ∈ {1, 2, · · · , S}, respectively. Let D f denote the dataset node f needed to finish a learning task. D f will be obtained from the storage units available to node f . For instance, in a DML with wireless links as in Figure 4, D f means the data union for all the storage units within the communication range of FN f (i.e., within R f ). Similar to federated learning, FN f will use the current model parameters to calculate gradients, namely, intermediate gradients, denoted as g f = [g f ,1 , g f ,2 , · · · , g f ,|D f | ], where g f ,a means the gradient trained by data a(a ∈ D f ) and |D f | is the size of D f . Meanwhile, fog nodes need to calculate the intermediate model parameters (e.g., weight) w f = [w f ,1 , w f ,2 , · · · , w f ,|w f | ], where |w f | is the length of model parameters learned at FN f . Then the intermediate gradients and model parameters will be sent out to other fog nodes (or the central sever if there is one) for further processing after encoding. The coding process for g f is as follows. • A number d g is selected according to degree distribution Ω(x) = ∑ Ω(x) can be optimized by the probability of straggling (regarded as erasure) due to channel errors, slow computing, etc. The optimization of the degree distribution for distributed fountain codes can be found in, for example, [38], and we will not discuss it here for space limitation. With the above coding process, the resulted coded intermediate gradients are where G g f is the generator matrix at fog node FN f . The encoding process for w f is the same as that of g f with a possibly different degree distribution f is the generator matrix at FN f for model parameters.
Here i, f denotes the straggling probability from FN i to FN f due to various reasons, e.g., physical-layer erasure, slow computing, and congestion. Thus, the generator matrices corresponding to the received coded intermediate gradient and model parameters at FN f can be written asG w F, f , respectively. Here I = {1 1 , · · · , 1 F } is an indicator parameter. Let λ be the probability of straggling. Then, I f , ( f ∈ {1, 2, · · · , F}) can be evaluated as Then fog node FN f decodes the received coded intermediate parameters fromG andG w i, f , (i ∈ {1, 2, · · · , F} \ { f }), and tried to decode N − |D f | new gradients and Γ w ∑ i∈{1,2,··· ,F}\{ f } w i model parameters, where Γ w ∈ [0, 1] is a parameter determined by specific learning algorithms. For the benefits of fountain codes (e.g., LT or Raptor codes), the iterative decoding is feasible if the numbers of received coded gradients or model parameters are slightly larger than those of gradients and models in transmitting fog nodes. Clearly, to optimize the code degree distribution and task allocation, it is critical for a node to know the number of received intermediate gradients and model parameters at the node. For the purpose, we have the following analysis.
Assume γ a,b as the overlapping ratio of the dataset in FN a and FN b , then for all fog nodes, we have the overlapping ratio as follows: If γ a,b = 0, then node FN a and FN b has disjoint datasets. At where Θ a, f is a set formed by the indices of fog nodes, and it can be evaluated by If γ is known at each fog node (or at least from the transmitted neighbors at each receiving node), then ∆ can be evaluated, and the computation and communication loads can be optimized through proper task assignment and code degree optimization. Theorem 3 is for gradients, and a similar analysis also holds for model parameters. In Figure 5, we show the coding gains in terms of communication loads, which are defined as the ratio of the total number of data transmitted by all the fog nodes to the data required at these fog nodes. As we can see from the figure, if the number of nodes F or straggler probability increases, the coding gains increase as expected. We note that both deterministic codes in Section 3 and random construction coding here are actually a type of network coding [29,30], which can reduce communication loads by computing at intermediate nodes (fog nodes) [3,4]. More recently, one type of special network codes, i.e., BATS (batched sparse) codes, was proposed with two layered codes as shown in Figure 6. For outer codes, we can use error control codes such as fountain codes in MAP phase. For inner codes, network codes can be used such as random linear network codes in data shuffling stage. In [12], we studied BATS codes for fog computing networks. As shown in Figure 7, numerical results demonstrate that the BATS codes can achieve a lower communication load than uncoded and deterministic codes (network codes) if the computing load is lower than certain thresholds. Here, we skip further details and refer interested readers to [12].   [12]. e F denotes the channel erasure probability and corresponds to straggling probability. The computing load is defined as involved computing nodes and thus corresponds to expanding coefficients.

Introduction and System Setup
As a primal-dual optimization method, ADMM is shown to be able to generally converge at a rate of O(1/t) for convex functions, where t is the iteration number [16], which is often faster than the schemes based on primal methods. Meanwhile, ADMM also has the benefits of robustness to non-smooth/non-convex functions and being adaptive to fully decentralized implementation. Thus, ADMM is especially suitable for large-scale DML and has attracted substantial research interests. For DML, especially for the fully decentralized learning system without a central server, we can denote the learning network as G = (N , E ), where N = {1, . . . , N} is the set of agents (computing nodes) and E is the set of links. For ADMM, agents aim at solving the following consensus optimization problem collaboratively: where f i : R p → R is the local optimization function of agent i, and D i is the data set of agent i. All the agents share a global optimization variable x ∈ R n . Data sets of different agent may have overlapping, i.e., D i ∩ D j = ∅, for a part or all i = j. This can happen, for instance, among the sensors of nearby areas for weathers, traffic, smart grids, etc., or if MAPReduce is used, the same data are mapped to different agents. For ADMM, (12) is solved iteratively by a two-step process: • Step (a), local optimization of f i on receiving updated global variable and with D i (normally by augmented Lagrangian as detailed below); • Step (b), global variable x reaches consensus.
With DML, there are also straggler nodes and unreliable-link challenges for ADMM, especially for large-scale and heterogeneous networks or with wireless links. However, with primal-dual optimization, it is very hard (if possible) to transfer ADMM optimization process into a linear function (e.g., matrix multiplication as in gradient descend). Thus, coding schemes based on linear operations (e.g., matrix multiplication in [4,[8][9][10][11]24,25]) are impossible to be directly used in ADMM and there are very few results on coding for ADMM so far, to our best knowledge. To address the problem, one solution is to use coding separately for two steps of ADMM. For instance, error control coding can be used for local optimization if the data are stored in different locations for an agent. For the global consensus, network coding can be used to reduce the communication loads and increase reliability. In [15], we preliminarily investigated how coding (MDS codes) can be used in local optimization (step (a)). A more detailed introduction is given as follows.
As depicted in Figure 8, a distributed computing system consists of multiple agents, each of which is connected with several edge computing nodes (ECNs). Agents can communicate with each other through links. ECNs are capable of processing data collected from sensors, and transferring desired messages (e.g., model updates) back to the connected agent. Based on the agent coverage and computing resources, the ECNs connected to agent i(∈ N ) are denoted as K i = {1, . . . , K i }. This model is common in current intelligent systems, such as smart factories or homes. The multi-agent system seeks to find out the optimal solution x * by solving (12). D i is allocated to dispersed ECNs K i . The formulation of decentralized optimization problem can be described as follows. By defining x = [x 1 , . . . , x N ] ∈ R pN×d and introducing a global variable z ∈ R p×d , problem (12) can be reformulated as (P-1) : min where 1 = [1, . . . , 1] T ∈ R N , and ⊗ is the Kronecker product. In the following, In what follows, we will present communication-efficient and straggler-tolerant decentralized algorithms, by which the agents can collaboratively find an optimal solution through local computations and limited information exchange among neighbors. In the scheme, local gradients are calculated in dispersed ECNs, while variables, including primal and dual variables and global variables z, are updated in the corresponding agent. For illustration purpose, we will first present stochastic ADMM (sI-ADMM) and then coded version of sI-ADMM (i.e., csI-ADMM). Both of them are proposed in [15]. The standard incremental ADMM iterations for decentralized consensus optimization will be reviewed first. The augmented Lagrangian function of problem (P-1) is where y = [y 1 , . . . , y N ] ∈ R pN×d is the dual variable, and ρ > 0 is a penalty parameter.
With incremental ADMM (I-ADMM) [39,40], with guaranteeing ∑ N i=1 (x 1 i − y 1 i ρ ) = 0 (e.g., initialize x 1 i = y 1 i = 0), the updates of x, y and z at the (k + 1)-th iteration follow: arg min x k i , otherwise; (15a) y k i , otherwise; (15b) For ADMM, solving augmented Lagrangian especially for the x-update above may lead to rather high computational complexity. To achieve fast computation for x-update, first-order approximation and mini-batch stochastic optimization in (15a) can be adapted. Furthermore, a quadratic proximal term with parameter τ k is proposed in [15] to stabilize the convergence behavior of the inexact augmented Lagrangian method. Ref. [15] also introduces the updating step-size γ k for the dual update. Both parameters τ k and γ k can be adjusted with iteration k. Finally, the updates of x and y at the (k + 1)-th iteration are presented as follows: x k i , otherwise; (16a) where G i (x k i ; ξ k i ) is the mini-batch stochastic gradient, which can be obtained through To be more specific, M is the mini-batch size of sampling data, ξ k i = {ξ k i,l } M denotes a set of independent and identically distributed randomly selected samples in one batch, and ∇F i (x k i ; ξ k i,l ) corresponds to the stochastic gradient of a single example ξ k i,l .

Mini-Batch Stochastic I-ADMM
For above setup of ADMM, response time is defined as the execution time for updating all variables in each iteration. In the updates, all steps, including x-update, y-update and z-update, are assumed to be in agents rather than ECNs. In practice, the update is often computed in a tandem order, which leads to a long response time. With the fast development of edge/fog computing, it is feasible to further reduce the response time since computing the local gradients can be dispersed to multiple edge nodes, as shown in Figure 8. Each ECN computes a gradient using local data and shares the result with its corresponding agent, and no information is directly exchanged among ECNs. Agents can be activated in a predetermined circulant pattern, e.g., according to a Hamiltonian cycle, and ECNs are activated whenever the connected agent is active, as shown in Figure 8. A Hamiltonian cycle based activation pattern is a cyclic pattern through a graph that visits each agent exactly once (i.e., 1 → 2 → 4 → 5 → 3 in Figure 8). Correspondingly, the mini-batch stochastic incremental ADMM (sI-ADMM) [15] is presented in Algorithm 3. At agent i k , global variable z k+1 gets updated and is passed as a token to the next agent i k+1 via a pre-determined traversing pattern, as shown in Figure 8. Specifically, in the k-th iteration with cycle index m = k/N , agent i k is activated. Token z k is first received and then the active agent broadcasts the local variable x k i to its attached ECNs K i . According to batch data with index I k i,j , new gradient g i,j is calculated in each ECN, followed by the gradient update, x-update, y-update and z-update in agent i k , via steps 21-24 in Algorithm 3. At last, the global variable z k+1 is passed as a token to its neighbor i k+1 . In Algorithm 3, the stopping criterion is reached when z k − x k i ≤ pri and G i (x k i ; ξ k i ) − y k i ≤ dual , ∀i ∈ N , where pri and dual are two pre-defined feasibility tolerances. divide D i labeled data into K i equally disjoint partitions and denote each partition as ξ i,j , j ∈ K i ; 5: for ECN j ∈ K i do 6: allocate ξ i,j to ECN j; 7: partition ξ i,j examples into multiple batches with each size M/K i ; 8: end for 9: end for 10: UpdatingProcess: 11: for k = 1, 2, . . . do 12: StepsofActiveAgenti = i k = (k − 1) mod N + 1: 13: receive token z k ; 14: broadcast local variable x k i to ECNs K i ; 15: ECNj ∈ K i computesgradientinparallel: 16: receive local primal variable x k i ; 17: select batch I k i,j = m mod |ξ i,j | · K i /M ; 18: update gradient g i,j = K i M ∑ M K i l=1 ∇F i (x k i ; ξ k i,l ); 19: transmit g i,j to the connected agent; 20: until the K i -th responded message is received; 21: update gradient via gradient summation: 22: update x k+1 according to (16a); 23: update y k+1 according to (16b); 24: update z k+1 according to (15c); 25: send token z k+1 to agent i k+1 via link (i k , i k+1 ); 26: until the stopping criterion is satisfied. 27: end for

Coding for Local Optimization for sI-ADMM
With less reliable and limited computing capability of ECNs, straggling nodes may be a significant performance bottleneck in the learning networks. To address this problem, error control codes can be used to mitigate the impact of the straggling nodes by leveraging data redundancy. Similar to Section 3, two MDS-based coding methods over real field R, i.e., fractional repetition scheme and cyclic repetition scheme, can be adopted and integrated with sI-ADMM for reducing the responding time in the presence of straggling nodes. The coded sI-ADMM (csI-ADMM) approach is presented in Algorithm 4. Denote the minimum required ECNs number by R i and the maximum number of stragglers the system can tolerate by S i . Different from sI-ADMM, in csI-ADMM, encoding and decoding processes are used in each ECN j ∈ K i and its corresponding agent i, respectively. G i (x k i ; ξ k i ) will be updated via steps 15-20, where the local gradient is calculated in ECN j ∈ K i in parallel via selected (S i + 1)M/K i batch samples, and the gradient summation can be recovered in active agent i k with the responded messages from any R i out of K i ECNs to combat slow links and straggler nodes. As in steps 22-26 of sI-ADMM, activated agent i k then updates local variables successively. Computation redundancy is introduced, but agent i can tolerate any (S i = K i − R i ) stragglers.

Simulations for Coded Local Optimization
Both computed-generated and real-world datasets are used to evaluate the performance of the coded stochastic ADMM algorithms. The experimental network G consists of N agents and E = N(N−1) 2 η links, where η is the network connectivity ratio. For agent i, K i = K ECNs with the same computing power (e.g., computing and memory) are attached. To reduce the impact of token traversing patterns, both the Hamiltonian cycle-based and non-Hamiltonian cycle-based (i.e., the shortest path cycle-based [41]) token traversing methods are evaluated for the proposed algorithms.
To demonstrate the advantages of the coding schemes, csI-ADMM algorithms are compared with uncoded sI-ADMM algorithms with respect to the accuracy [42], which is defined as where x * ∈ R p×d is the optimal solution of (P-1), and the test error [43], which is defined as the mean square error loss. For demonstrating the robustness against straggler nodes, distributed coding schemes, including cyclic and fractional repetition methods and the uncode method, are used for comparison. For fair comparison, the parameters for algorithms are tuned and kept the same in different experiments. Moreover, unicast is considered among agents, and the communication cost per link is 1 unit. The consumed time for each communication among agents is assumed to follow a uniform distribution U (10 −5 , 10 −4 ) seconds. The response time of each ECN is measured by the computation time, and the overall response time of each iteration is equal to the execution time for updating all variables in each iteration. All experiments were performed using Python on an Intel CPU @2.3 GHz (16 GB RAM) laptop.
To show the benefit of coding, in Figure 9, we compare the accuracy vs. running time for both coded and uncoded sI-ADMM. In simulation, the maximum delay i , (i = 1, 2, 3) for stragglers in each iteration is considered. For illustration purpose, we set up different i with 1 > 2 > 3 in simulation. For showing the benefits of coding to the convergence rate, convergence vs. straggler nodes trade-off for csI-ADMM, the impact of the number of straggler nodes on the convergence speed is shown in Figure 10. In simulations, 10 independent experiment runs are performed with the same simulation setup on synthetic data and take an average for presentation. We can see that, with an increasing number of straggler nodes, the convergence speed decreases. This is because increasing the number of straggler nodes decreases the allowable mini-batch size allocated in each iteration and therefore affects the convergence speed.

Discussion
Above, we discuss the application of error-control coding in the local optimization step of ADMM. In the agent consensus step, there are also straggling or transmission errors for updating global variables. To improve reliability in the consensus step, we can use linear network error correction codes [31] or BATS codes [32] based on LT codes. For the latter, the global variable (vector) is divided into many smaller vectors. The encoding process continues until certain stopping criteria are reached (e.g., feedback from other nodes or time out). There are quite a few papers on applying network coding for consensus; see [44,45].
Since there is no significant difference between the consensus process of the global variables of ADMM or other types of messages, interested readers are referred to these papers for further reading. We note that network coding can improve both the reliability and security of the consensus, i.e., as secure network codes [46].

Conclusions and Future Work
We discussed how coding can be used to improve the reliability and reduce the communication loads for both primal-and primal-dual-based DML. We discussed both deterministic (and optimal) and random construction of error-control codes for DML. For the low-complexity and high flexibility, the latter may be more suitable for large-scale DML. For primal-dual based DML (i.e., ADMM), we discussed separate coding process for the two steps of ADMM, i.e., in local optimization and consensus processes separately. We introduced the algorithms on how to use codes for the local optimization of ADMM.
For emerging applications of increased interest, DML will be more and more common. Another interesting area for applying coding for DML is security. Though DML has a certain privacy-preserved capability (compared to transmit raw data), a higher security standard may be needed for sensitive applications. Secure coding has been an active topic for years; see [47]. We also have preliminary results on improving privacy by artificial noise in DML [40]. However, a further study is largely needed for improving performance and general scenarios.
Another interesting area for future work may be further studying coding for primaldual methods. Though separate coding for the two steps of ADMM may solve the problem partly, the coding efficiency may be low and system complexity may be high. As discussed in Section 5, directly applying error control codes to ADMM may be infeasible. Another potential approach may be to simplify the optimization functions without significant performance loss, and error-control codes can be used.