Coordinate-Descent Adaptation over Hamiltonian Multi-Agent Networks

The incremental least-mean-square (ILMS) algorithm is a useful method to perform distributed adaptation and learning in Hamiltonian networks. To implement the ILMS algorithm, each node needs to receive the local estimate of the previous node on the cycle path to update its own local estimate. However, in some practical situations, perfect data exchange may not be possible among the nodes. In this paper, we develop a new version of ILMS algorithm, wherein in its adaptation step, only a random subset of the coordinates of update vector is available. We draw a comparison between the proposed coordinate-descent incremental LMS (CD-ILMS) algorithm and the ILMS algorithm in terms of convergence rate and computational complexity. Employing the energy conservation relation approach, we derive closed-form expressions to describe the learning curves in terms of excess mean-square-error (EMSE) and mean-square deviation (MSD). We show that, the CD-ILMS algorithm has the same steady-state error performance compared with the ILMS algorithm. However, the CD-ILMS algorithm has a faster convergence rate. Numerical examples are given to verify the efficiency of the CD-ILMS algorithm and the accuracy of theoretical analysis.

In typical consensus strategies, each individual node collects noisy measurements over a period of time and performs a local processing task (e.g., calculates a local estimate). Then, the nodes collaborate through several iterations and share their information to achieve agreement. In the incremental strategies, a Hamilton cycle is established through the nodes where each node receives data from previous node on cycle path and sends the updated data to the next node. Such a mode of cooperation decreases the inter-node communication across the network and modifies the network independence. In diffusion strategies, information is spread through the network simultaneously and locally at all nodes and the task of distributed processing is performed. As in this mode of cooperation each node collaborates with its connected neighbors, the communication burden is higher than that of an incremental based strategy.
In this paper, we propose an incremental-based algorithm for solving the distributed estimation problem over a Hamilton network. One of the primary methods for solving this problem is the incremental least-mean-square (ILMS) algorithm reported in [19]. In this algorithm, the local calculated estimates are updated at each node by employing the LMS adaptation rule, without requiring any prior information about the data statistics. In order to implement the ILMS algorithm, it is assumed that at each time instant, the adaptation step has access to all entries of the approximate gradient vector-see expression (3) further ahead. In some practical situations (e.g., limited communication resources or missing data due to imperfect communication links) only a fraction of the approximate gradient vector is available for adaption. Thus, developing a new version of ILMS algorithm for networks with mentioned scenarios is of practical importance.
So far, different algorithms have been reported in the literature which solve the problem of distributed learning with partial information. In [30][31][32], a Bernoulli variable is used to model the step-size parameter, so that when the step-size is zero, not all entries of intermediate estimates are updated and the adaptation step is skipped for some. In [33][34][35][36][37][38], a different notion of partial diffusion of information is employed which relies on the exchange of a subset of entries of the weight estimate vectors themselves. In other words, it is assumed that each node shares only a fraction of its local information with its neighbors. Likewise, in [39], a situation is assumed in which some entries in the regression vector are missing at random due to incomplete information or censoring. In order to undo these changes, authors in [39] proposed an estimation scheme to estimate the underlying model parameters when the data are missing. Some other criteria have been proposed in the literature to motivate partial updating scheme. For instance, in [40], the periodic and sequential partial LMS updates are proposed, where the former scheme updates all filter coefficients every P-th iteration, with P > 1, instead of every iteration. The later updates only a portion of coefficients, which are predetermined, at each iteration. In [41], the stochastic partial LMS algorithm is proposed that is a randomized version of sequential scheme where the coefficient subsets are selected in random instead of deterministic fashion. The works [42,43] utilize the concept of set-membership filtering to partially update the weight vector, where the updates occur only when the innovation obtained from the data exceeds a predetermined threshold. The works [12,44,45] employ Krylov subspace concept to partially update the weight vector that is based on dimensionality reduction policies. Some other techniques based on energy considerations limit the updates, e.g., [46]. In [47], a reduced complexity augmented complex least-mean-square algorithm is proposed by employing partial-update for improper complex signals. The algorithm involves selection of a fraction of coefficients at every iteration. In [48], the authors focus on diffusion learning mechanisms and study its mean-square error performance under a generalized coordinatedescent scheme where the adaptation step by each node involves only a random subset of the coordinate-descent gradient vector. None of the aforementioned works; however, rely on the adaptive incremental strategy. To address this issue, we combine two techniques (incremental mode of cooperation and coordinate-descent updates) to consolidate a new method, termed coordinate-descent ILMS (CD-ILMS), for distributed estimation problem.
The main contributions of this manuscript can be stated as follows: (a) A CD-ILMS algorithm is proposed to simulate the situation where some entries in approximate gradient vectors are missing or to control the computational burden of the update estimate purposely; 1. We examine the effect of random partial gradient information on the learning performance and convergence rate of ILMS algorithm for MSE cost function; 2. A theoretical analysis of the performance of CD-ILMS algorithm is concluded under some typical simplifying assumptions and approximations, typical in adaptive systems and tend to achieve a performance level that matches well with practice; 3. Stability conditions for CD-ILMS algorithm are derived both in mean and meansquare senses under certain statistical conditions. We find the necessary condition on step-size µ to guarantee the stability of CD-ILMS algorithms; 4. Employing the energy conservation paradigm, we derive closed-form expressions to describe the learning curves in terms of excess mean-square-error (EMSE) and mean-square-deviation (MSD); 5. We compare the CD-ILMS algorithm and regular full-update ILMS algorithm in terms of convergence rate and computational complexity. We find that although the CD-ILMS algorithm convergence to steady-state region takes longer, the overall computational complexity, i.e., savings in computation per iteration, does remain invariant.
The paper layout is organized as follows: In Section 2, we briefly explain the ILMS algorithm and formulate the CD-ILMS algorithm. In Section 3.1, we describe the system model and assumptions. The stability and performance analysis of CD-ILMS algorithm are analyzed both in mean and mean-square sense in Sections 3.2 and 3.3, respectively. Sections 3.4 and 3.5 investigate the transient and steady-state performance of CD-ILMS algorithm, respectively. In Section 4, we draw a comparison between ILMS and CD-ILMS algorithms on the basis of convergence rate and computational complexity. Performance evaluations are illustrated in Section 5. The paper is finally concluded in Section 6.
Notation: Throughout the paper, we adopt normal lowercase letters for scalars, bold lowercase letters for column vectors and bold uppercase letters for matrices, while I denotes an identity matrix of appropriate size. Superscript (.) T denotes transposition for real-valued vectors and matrices. The symbol E[.] is the expectation operator, tr{.} represents the trace of its matrix argument, vec{.} vectorizes a matrix by stacking its columns on top of each other, and ⊗ is the standard Kronecker product.

Algorithm Description
We consider a network with K nodes with incremental cooperation topology, as shown in Figure 1. The network is used to estimate the parameter w o ∈ R L×1 , that minimize the following aggregate global cost function: where it is assumed that each individual cost function, f k (w) : R L×1 → R is ν-strongly convex and its gradients satisfy the δ-Lipschitz condition [25]. These conditions are equivalent to requiring f k (w) to be twice-differentiable and Hessian matrices of the individual costs, ∇ 2 w f k (w) ∈ R L×L , to be bounded as follows for some positive parameters {ν, δ}, where ν < δ. Some cases of interest, e.g., Log-Loss regression or MSE [25,26], automatically satisfy the condition in (2).

Incremental Steepest-Descent Solution
By applying the steepest-descent method in conjunction with the incremental strategy to (1) gives the Incremental Steepest-Descent Solution as lists in Algorithm 1. Observe that in this implementation, each node k needs to receive the calculated estimate of w o at node k − 1 (which is denoted by ψ ψ ψ (t) k−1 ). In (3), w t denotes a global estimation of w o at time instant t. Moreover, since the true gradient vectors, {∇ w T f k (·)}, are not available beforehand in many practical applications, they have been replaced by approximates, { ∇ w T f k (·)}. Algorithm 1. Incremental Strategy [19].

Coordinate-Descent Incremental Adaptation
Consider a case in which, due to some practical issues (e.g., missing data or limited computational and communication burden) only a subset of the approximate gradient vector can be updated as follows: In order to handle such a situation, first a suitable model should be adopted. To this end, we follow a similar approach to [48] and define a diagonal random matrix Π k,t ∈ R L×L of node k at iteration t as where π k,t (l) are some Bernoulli random variables, that the randomness in the update varies across space, k, and over time, t. We have π k,t (l) = 0 or π k,t (l) = 1 with the following probability (3) is not updated. Multiplying the approximate gradient vector by Π k,t , we replace

Data Model and Assumptions
Consider MSE networks where at each time instant t, each node k is assumed to observe a scalar measurement d k (t) ∈ R and a 1 × L row regression vector u k,t ∈ R 1×L . A linear regression model is used to describe the collected data at each individual node as follows: where v k (t) is the noise process. The individual cost function, f k (w), that is associated with each node k is the MSE (quadratic) cost To proceed with analysis, we introduce some notation and assumptions. In our analysis, we assume that: Assumption 1. (i) For all nodes k, and t ≥ 1, the measurement noises {v k (t)} are all zeromean, white, and independent from the input and desired signals, with variances σ 2 v,k ; 1.
The regression data u k,t for all nodes k, and all observation times t ≥ 1, are zero-mean, white overt time and space with where 0 < R u,k ∈ R L×L , and δ k denotes the Kronecker delta, i.e., it is equal to one when k = and zero otherwise; 2.
The step-sizes, µ k , are small enough so as to ignore the quadratic term in µ k .
Let the notation K t−1 represent the available information about the random processes ψ ψ ψ (τ) k and Π k,τ at all agents k = 1, 2, . . . , K for τ ≤ t − 1: Assumption 2. For all k, , m, n, the indicator variables π k,t (m) and π k,t (n) are assumed to be independent of each other. Moreover, the random variables π k,t (m) and K t−1 are independent of each other, for any iteration w ∈ K t−1 and for all nodes k.
Based on the adaptive implementation of incremental strategy [19], the adaptive distributed coordinate-descent incremental algorithm or distributed coordinate-descent incremental LMS (CD-ILMS) algorithm can be summarized in Algorithm 2. In this algorithm, each node updates the local estimate as Initialization: start with w −1 = initial condition. for every time t ≥ 0 do set the fictitious boundary condition at ψ ψ ψ

Mean Stability Analysis
Now, the mean behavior of the proposed algorithm is analyzed. More specifically, we seek conditions that for sufficiently large t and for all k, the proposed algorithm is asymptotically unbiased. To proceed, the following local error signals are defined: By replacing d k from (7) into (10) and subtracting w o from both sides, gives Taking statistical expectations of both sides of (11) together with using the items on Assumption 1, we obtain Through iterating recursion (12) over t and setting k = K, we deduce that E ψ ψ ψ (t) k evolves according to where Q k (I L − µ k (1 − π k )R u,k ). Clearly, the mean weight-error vector for the CD-ILMS depends on the spectral radius of Q. The following proposition summarizes the required mean stability condition for the proposed algorithm.

Proposition 1 (Mean Stability)
. Assume the data model (7) and Assumption 1 hold. Then, the CD-ILMS algorithm is asymptotically unbiased if, and only if, the step-size parameters {µ k } satisfy the following condition: where λ max denotes the largest eigenvalue of its argument.
Proof. The asymptotic unbiasedness of the CD-ILMS algorithm is guaranteed if, and only if, the matrix Q be stable, or equivalently, all its eigenvalues lie inside the unit disc, namely, As the spectral radius of a matrix is always smaller than any induced norm of the same matrix [49] we have where step ( * ) is because every Q k is a symmetric matrix. This means that its spectral radius agrees with its 2-induced norm, so that ( * ) is justified. Accordingly, to guarantee the constraint (15) for all k, it is enough to have ρ(Q k ) ≤ 1, which is equivalent to Thus, the conclusion in (14) is verified.

Mean-Square Performance
In this section, the mean-square performance of the error recursion (11) is examined. We start by equating the squared-weighted Euclidean norms of both sides of (11) and taking the expectations together with employing Assumption 1. After a straightforward calculation, we find the following weighted-variance relation: for any arbitrary positive-definite weighting matrix Σ. Due to the assumption on v k (t), the expectations of the cross-terms involving v k (t) evaluate to zero. Note that items in Assumption 1 guarantee thatψ ψ ψ (t) k−1 is independent of Σ and u k,t . Thus, the expectation E ψ ψ ψ can be rewritten as By defining and employing (20)- (22), we can modify (18) to where Taking the vectorization operator of both sides of (19), and using (21), (22), Assumptions 1 and 2, together with employing the relationship between the Kronecker product and the vectorization operator (For any matrices {A, Σ, B} of compatible dimensions we have vec{AΣB} = (B T ⊗ A)vec{Σ}) we find that the weighting vectors {σ σ σ, σ σ σ } satisfy the following relation σ σ σ = F k σ σ σ (24) where F k ∈ R L 2 ×L 2 and given by Note that the vectorization commutes through expectation. Considering Assumption 1, we can approximate F k as follows To evaluate E u k,t Π k,t ΣΠ k,t u T k,t we employ a useful property from matrix algebra (which relates the vectorization operator matrix to trace operator (For real matrices {A, B}, the trace of a product can be written as tr{A T B} = vec T {B}vec{A}) [50]) together with the fact that Σ is symmetric and deterministic, to obtain where which, in the light of Assumptions 1 and 2, can be evaluated as It follows by direct inspection that the entries of H k are given by where (24) and (27) into (23) gives In the following proposition, we summarize the required conditions that guarantee the mean-square convergence for the DC-ILMS algorithm.
Proposition 2 (Mean-square Stability). Assume the data d k (t), u k,t satisfy the model (7). Under Assumptions 1 and 2, the recursion of type (31) is stable and convergent if, and only if, F k given by (25) is stable. On the other hand, the stability of F k is achieved when {µ k } are chosen, such that they satisfy (14).
Proof. Let A and B be two arbitrary matrices of compatible dimensions. Then, any eigenvalue of A ⊗ B is a product λ i (A) × λ j (B), where λ i (A) is an eigenvalue of A and λ j (B) is an eigenvalue of B [50]. Moreover, the sets of eigenvalues of A and A T are equal. Accordingly, using the expression (26), we have that ρ(F k ) = [ρ(I L − µ k (1 − π l )R u,k )] 2 . It follows that F k is stable if, and only if, (I L − µ k (1 − π l )R u,k ) is stable. Therefore, {µ k } that guarantee the mean stability, ensure mean-square stability as well. Thus, the result (14) follows.

Learning Curves
In this section, the variance relation (31) is employed to obtain a recursive equation that explains the transient behavior of the CD-ILMS algorithm. Since the weighting matrices, Σ, Σ , can be node-dependent, we can replace {σ σ σ, σ σ σ } by σ σ σ k , σ σ σ k , so that (31) can be rewritten in the following form ). To resolve this issue, the incremental topology along with the weighting matrices should be employed [19] as follows: first, by iterating (32) a set of K coupled equalities are obtained . . .
In order to explain the flow of energy through the agents it is required to make connection between the free parameters σ σ σ and σ σ σ . If we choose the weighting vector σ σ σ k−1 = F k σ σ σ k and combine (33) and (34) we get Iterating in this manner, a recursive expression is obtained which relatesψ ψ ψ By choosing σ σ σ = q = vec{I L 2 } and substituting it in (36), gives the following expression that explains how MSD of node k evolves over time: where Γ k−1, ∈ R L 2 ×L 2 , the product of F k matrices for each node, and a k−1 ∈ R 1×L 2 are defined by and a k−1 g k Γ k−1,2 + g k+1 Γ k−1,3 . . . + g k−2 Γ k−1,K + g k−1 (39) Therefore, the theoretical expression for MSD of node k is given by the compact form where the vectors γ t−1 and b t−1 are formulated by Accordingly, the selection of σ σ σ k−1 = r k = vec R u,k leads to a an expression that explains how EMSE of node k evolves over time: where the vectors γ t−1 and b t−1 are given, respectively, by

Steady-State Behavior
The variance relation (32) can also be used to evaluate the steady-state performance of k . At steady-state, i.e., when t → ∞, the variance relation (32) can be written as Iterating the Equation (42) for an incremental network topology and selecting appropriate weighting vectors σ σ σ k for k = 1, 2, . . . , K, we obtain an equality only involving p k−1 , given by Using (38) and (39) to simplify the expression (43), we obtain Equation (44) can be employed to evaluate the performance measures at node k, defined as follows: Proper selection of the weighting vector σ σ σ k−1 in (44) gives the required mean-square values. Selecting the weighting vector σ σ σ k as the solution of the linear equation (I L 2 − Γ k−1,1 )σ σ σ k−1 = q or (I L 2 − Γ k−1,1 )σ σ σ k−1 = r k , the desired MSD and EMSE at node k can be obtained as follows: The vector a k−1 can be regarded as the effect of aggregating the transformed noise and the local data statistics from all the nodes in the incremental network topology and the matrix Γ k−1, can be interpreted as the transition matrix for the weighting vector σ σ σ k−1 .

Remark 1.
It should be noted that (47) and (48) are exactly the theoretical MSD and EMSE for full-update ILMS algorithm, which means that the CD-ILMS algorithm has the same steady-state error performance compared with the incremental LMS algorithm.

Further Insight into the Proposed CD-ILMS
In order to gain more insight into the performance of CD-ILMS algorithm, in this section we compare the convergence rate and computational complexity of CD-ILMS algorithm with the regular full-update ILMS algorithm [19].

Convergence Rate
To make the analysis more tractable, we consider the following assumption. Assumption 3. The missing probabilities {π k }, the covariance matrices R u,k , and the step-size {µ k } are identical, i.e., π k = π, µ k = µ, R u,k = R u ∀k It can be seen from (13) that the evolution of weight-error vector for both CD-ILMS and ILMS algorithms are controlled by the modes of the following matrices where the subscript 'coor' and 'full' denote, respectively, the stochastic coordinate-descent and full-update incremental implementation. Thus, under Assumption 3, we have From (51) and (52), we find that the modes of convergence are given by where λ l denotes the eigenvalues of R u . Letting τ coor and τ full represent the largest number of time iterations that are required for the mean error vector, E ψ ψ ψ (t) k , to converge to zero. Then, it holds that τ coor τ full = ln(κ full,l ) ln(κ coor,l ) where in step ( * ) we considered ln(1 − z) ≈ −z as z → 0.

Remark 2.
Expression (55) illustrates the increase in the convergence time in the CD-ILMS algorithm. Since the convergence time is longer, the CD-ILMS algorithm may need more quantities to be exchanged through the network compared to the ILMS algorithm.

Computational Complexity
We now provide a discussion on computational complexity of full-update ILMS and its coordinate-descent implementations. Let θ add ≥ 0 and θ mul ≥ 0 represent the number of real additions and multiplications, respectively, that are required for every entry of the gradient vector. In CD-ILMS algorithm, every node requires (1 − π)(θ mul L + L) real multiplication and (1 − π)(θ add L + L) real additions per iteration in adaptation step (10), while in the full implementation, every node requires θ mul L + L multiplications and θ add L + L additions per iteration. Moreover, Let β coor and β full denote the number of multiplications needed by the adaptation steps per iteration at each agent k in the CD-ILMS and full-update cases. Then, β full,k = (θ mul + 1)L (56) If these algorithms require τ coor and τ full iterations to attain their steady-state values, then the total number of real multiplications at node k, represented by α coor,k and α full,k , are obtained by α coor,k = β coor,k × τ coor (58) so that employing (55) results in Substituting (56) and (57) into the first item in (60), we obtain Thus, from (60) and (61) Remark 3. It is obvious that, α coor,k and α full,k are essentially identical. This means that although the convergence of CD-ILMS algorithm to steady-state region takes longer, the overall computational complexity, i.e., savings in computation per iteration, does remain invariant. Generally speaking, in situations where the computational complexity per iteration require to be minimal, the coordinate-descent scheme is recommended. Moreover, the coordinate-descent scheme requires more iterations to achieve the same steady-state performance.

Remark 4 (Communication Costs).
In coordinate-descent incremental and regular incremental schemes, each node k receives weight estimate, ψ ψ ψ k,t , from its predecessor node k − 1 in the cycle. In this manner the overall number of communications needed at each node per iteration is L. Moreover, the nodes also need to send their global estimate, w t−1 , thus the communications requirement per iteration per node is 2L.

Simulation Results
This section provides some computer simulations to evaluate the performance of CD-ILMS algorithm and verify the theoretical analysis. We assume a distributed network composed of K = 20 agents with ring topology (see Figure 2). The L = 8 vector w o is generated randomly with w o 2 = 1 and step-sizes are µ k = 0.02. To obtain more accurate results, we conducted 100 independent simulations. Although the analysis depends on the independence assumptions, all simulations are accomplished by regressors with shift structure to deal with realistic scenarios. To this end, the regressors are generated by the following first-order auto-regressive model where the parameters z k ∈ (−1, 1) and h k,t denotes a are zero-mean, unit-variance Gaussian sequences. For each node, the measurement noise v k (t) is a zero mean white Gaussian sequence with σ 2 v ∈ (0.001, 0.1). Figure 3 illustrates the network statistical settings.   In Figure 4, we plot the experimental and theoretical MSD and EMSE learning curves using both full and partial updates for different values of π k ∈ {0.4, 0.6, 0.9}. It can be observed that the CDILS algorithm has the same steady-state error performance compared with the ILMS algorithm. Moreover, as we expected, as π k → 0 the performance of the coordinate-descent approaches to that of the full-gradient incremental algorithm. Generally speaking, the speed of convergence reduces proportionally for coordinate-descent schemes as the missing probability is decreased.  In Figure 5, we demonstrate the experimental and theoretical steady-state MSDs and EMSEs for π k = 0.6. It can be seen that, the calculated steady-state MSD and EMSE values using the theoretical expressions in (47) and (48) have a good agreement between with those obtained by the simulation results.  Figure 6 shows the modes of convergence for the case R u = I L and K = 10, for π k ∈ {0.5, 0.8}, as a function ofμ = µλ, both for ILMS and CD-ILMS algorithms. From Figure 6, it can be observed that, regardless of the values of step-size parameter, the full-update ILMS algorithm is always faster than the CD-ILMS algorithm.  Figure 6. Modes of mean-convergence, for ILMS and CD-ILMS algorithms using K = 10, π = 0.8 (top) and π = 0.5 (bottom).

Conclusions
In this paper, we have derived the CD-ILMS algorithm for adaptive estimation over incremental networks. Moreover, its detailed performance analysis based on the weighted energy-conservation approach under some assumptions and approximations has been discussed. More specifically, we have derived mean and mean-square stability conditions and theoretical expressions for steady-state and learning curves of MSD and EMSE. To gain further insight into the performance of CD-ILMS algorithm, its convergence and computational complexity have been compared with those of the full-update ILMS algorithm. It has been shown that the full-update ILMS algorithm and CD-ILMS algorithm provide the same steady-state performance, while full-update ILMS is always faster than the CD-ILMS algorithm. Finally, some numerical examples have been provided to support the theoretical derivations.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.