An Effective Adaptive Combination Strategy for Distributed Learning Network

: In this paper, we develop a modiﬁed adaptive combination strategy for the distributed estimation problem over diffusion networks. We still consider the online adaptive combiners estimation problem from the perspective of minimum variance unbiased estimation. In contrast with the classic adaptive combination strategy which exploits orthogonal projection technology, we formulate a non-constrained mean-square deviation (MSD) cost function by introducing Lagrange multipliers. Based on the Karush–Kuhn–Tucker (KKT) conditions, we derive the ﬁxed-point iteration scheme of adaptive combiners. Illustrative simulations validate the improved transient and steady-state performance of the diffusion least-mean-square LMS algorithm incorporated with the proposed adaptive combination strategy.


Introduction
It is generally beneficial to exploit diffusion strategies for distributed parameter estimation issues over adaptive networks [1][2][3][4][5][6]. Specifically, the diffusion least-meansquare (LMS)-based methods have already been used in many contexts, such as biological behavior modeling [7,8], distributed detection [9], distributed localization [10] and target tracking and escaping from predators [11], where scalability, robustness, and low power consumption are the desirable features [1]. For the diffusion strategy, each node of the network is allowed to receive the intermediate estimates from its neighboring nodes to improve the accuracy of its local estimate. Such cooperation enables each node to leverage the spatial diversity of noise profile over the entire network. From this point of view, the performance of distributed diffusion methods can be further enhanced by using suitable combination weights (combiners).
There have been several static combination rules [1,12], e.g., Metropolis rule, Laplacian rule, Uniform rule and Relative-degree rule. However, these static combiners are designed based solely on the topology of network, so they would generally be unadjustable to adapt to the spatial variation of signal and noise statistics. To address this problem, many studies resort to the adaptive combination (AC) strategies [12][13][14][15][16][17][18], most of which are developed for the adapt-then-combine (ATC) diffusion LMS algorithm [1].
Based on the minimum variance unbiased estimation (MVUE), the classic AC strategy [12] outperforms existing static combiners when applying onto the diffusion LMS algorithms. An optimal adaptive combination scheme is derived by estimating the variances of the measurement noises adaptively [13]. Simulation results validate the superior steady-state performance of the diffusion LMS algorithm regarding the optimal combiners in [13], as compared to previous static and adaptive combiners. Based on the adaptive combination rule in [13], an optimal combination rule regarding the channel distortion is also proposed [16]. To achieve both the accelerated convergence rate and good steady-state network performance, the combination switching mechanisms [14,15] are proposed, i.e., static combination scheme in the converging stage and AC scheme when approaching the steady state.
In addition, a decoupled adapt-then-combine (D-ATC) algorithm is proposed, for, the least-squares (LS)-based AC scheme is developed [17,18], which could achieve rather approximate performance as the ATC algorithm with the classic AC in homogeneous networks.

Motivation and Contribution
As mentioned above, the classic AC strategy is derived based on MVUE, which is validated to be a feasible criterion [12]. In fact, one of the key techniques in the classic AC strategy [12] is orthogonal projection which is exploited to guarantee the combiners adding up to 1. However, the orthogonal projection technique actually limits the update direction of combiners at each iteration, which can be relaxed to further improve the performance of diffusion LMS algorithm further ahead in Section 3.2.
In this paper, we still formulate the online adaptive combiners estimation of the ATC algorithm from the perspective of MVUE. Instead of directly exploiting the orthogonal projection technology in [12], we present a non-constrained mean-square deviation (MSD) cost function based on Lagrange multipliers. According to the fixed-point iteration methodology and KKT necessary conditions, we develop an effective adaptive combination strategy, which solely relies on the previous instantaneous intermediate weight estimates without resorting to the knowledge of measurement data and noises. The proposed AC strategy can be seen as the modified and extended version of the classic AC in [12]. Simulations validate the superior performance of the diffusion LMS algorithm when using the proposed AC strategy.
Notation: R and C denote the fields of real and complex numbers, respectively. Scalars are denoted by lower-case letters, and vectors and matrices respectively by lower-and upper-case boldface letters. The transpose and conjugate transpose are denoted by (·) T and (·) H , respectively. E{·} represents expectation. (·) means the real part. col{·} stands for the vector obtained by stacking its arguments on top of one another. diag{·} generates a diagonal matrix using the given diagonal arguments.
[·] i stands for the ith element of a vector. The min{·} denotes the minimum element of a vector. We define the eigenvalue set of the square matrix F as {λ(F)}, with λ max (F) denoting the maximum eigenvalue. The spectral radius of the square matrix F is denoted by (F) max{|λ(F)|}.

Model Assumption
Consider a network containing N nodes, which are used to estimate an M-dimensional unknown parameter w o ∈ C M collectively. N k denotes the set of neighbors of node k, including k itself. The cardinality of N k is n k . For each node k, at the time instant t, the regressor u k (t) ∈ C M and the measurement signal d k (t) ∈ C are available. The signal model is given by where v k (t) denotes the additive zero-mean white Gaussian measurement noise at node k, with variance σ 2 v,k . For any k and t, v k (t) is independent from u k (t), and for all k = l or i = j, v k (i) is independent from v l (j).

ATC Algorithm
The main target of distributed estimation algorithm is to generate an estimate w k (t) of w o at each node k and time t in a distributed manner. For the diffusion strategy, generally, each node k executes a local adaptation step to obtain an intermediate estimate ψ k (t), then all the nodes share their intermediate estimates to their neighbors, and finally, each node k linearly combines all the intermediate estimates received from its neighbors under some combination weights. The detailed steps of ATC diffusion LMS algorithm are where is the prior estimate error and µ k > 0 is the step size for k ∈ {1, 2, . . . , N}. The combiner a l,k is the weight of intermediate estimation from node l during the combination step of node k. Moreover, the non-negative combination matrix A = [a l,k ] satisfies [1,12] with a k denoting the kth column of A. Notice that A is left-stochastic since the entries of each column are non-negative and sum to 1 [19,20].

Fixed-Point Iteration Solution
First, we introduce a transform matrix P k , defined as P k [lth column of I N ] l∈N k . Then, a k in (6) can be expressed by a k = P k b k , with b k ∈ R n k . Therefore, the minimization problem (6) can be transformed into where By introducing the Lagrange multipliers α and ω, we obtain the cost function Taking the gradient of (8) with respect to b k yields According to the Karush-Kuhn-Tucker (KKT) condition [21], the optimal value of the tuple (b k , ω, α) should obey or equivalently, We introduce a positive definite diagonal matrix D k (t − 1), whose ith diagonal element is an arbitrary positive function with respect to b is still the descent direction of the cost function (8) [21]. Therefore, we have the adaptive solution of problem (7) through the fixed-point itera- where b k,i (t) is the estimate of b k,i at time instant t and η k,i is the learning factor of b k,i (t).
To simplify the problem, we choose the same learning factor η k,i for all i at any time t. Hence, we can rewrite (12) as a vector, where Applying the constraint 1 T n k b k (t) = 1 and pre-multiplying both sides of (13) by 1 T n k yields the Lagrange multiplier α Substituting α into (13), and using the constraint 1 T n k b k (t − 1) = 1 again, we can obtain the update of combiners b k (t), with I n k denoting the n k × n k identity matrix. The adaptive combiners (15) can be updated by two incremental steps where Then, we can obtain the combiner a k (t) which can be used in (3) to update the local weight estimate adaptively. We also define the adaptive combination matrix A(t) ∈ R N×N where a k (t) is the kth column vector of it. Please note that −g k (t) in (16) can be seen as the product of ζ k (t − 1) and the gradient , which means ζ k (t − 1) can be seen an auxiliary matrix to adjust the update direction ∇ b k J(b k (t − 1)). On the one hand, note that G k (t − 1) in (18) is a projection matrix [22], which enables the update direction −g k (t) orthogonal to the vector 1 n k , i.e., 1 T n k g k (t) = 0. Then we have 1 T n k b k (t) = 1 T n k b k (t − 1) = · · · = 1 T n k b k (0) = 1 if the initial combiners satisfy 1 T n k b k (0) = 1. On the other hand, since as a whole, ζ k (t − 1) is a positive semi-definite symmetric matrix, the update direction −g k (t) is still the descent direction of the cost function (7) [21]. Instead of using the positive semi-definite symmetric matrix ζ k (t − 1) = G k (t − 1)Γ k (t − 1), the classic AC [12] consider ζ k (t − 1) being replaced by the orthogonal projection matrix characterized by 1 n k , i.e., I n k − 1 n k 1 T n k n k , which actually limits the update direction of the adaptive combiners to be lie in the hyperplane spanned by 1 n k and ∇ b k J(b k (t − 1)). In fact, we find that (16) and (17) could reduce to the classic AC [12] when D k (t) = diag{b k (t)} −1 .
We now consider optimizing the learning factor η k (t). Substituting (16) and (17) into (7) yields a cost function regarding the learning factor η k (t), Obviously, h(η k (t)) is a quadratic (convex) function with respect to η k (t). Thus, its minimum value can be readily obtained if and only if Note that the optimal learning factor η o k (t) is non-negative since ζ k (t − 1) is a positive semi-definite matrix.
To guarantee that the combiners are non-negative, we set the upper bound of η k (t) as [12], where ε > 0 is a small constant. Please note that · ∞ represents the maximum norm. Thus, we choose the learning factor in (17) at time instant t, Please note that the Q Ψ k is usually unavailable in practical applications. As what done in [12], Q Ψ k can be replaced by its approximation where ∆Ψ k (t) = Ψ k (t) − Ψ k (t − 1). To make it smoother, we consider a forgetting factor λ. Then, the iterative expression ofQ Ψ k can be written aŝ In practical applications, we useQ Ψ k (t) in (25) to replace the aforementioned statistical quantity Q Ψ k for each node k at time instant t.
Finally, the implementation of the ATC algorithm with the proposed AC strategy is summarized in Algorithm 1.

Mean Convergence
We now analyze the mean convergence of the diffusion LMS algorithm with the proposed adaptive combiners.

Assumption 2. The combination matrix A(t)
is independent of all regressors u k (t) and all local weight estimates w k (t − 1) at time t − 1 ([12], Assumption 4.3).

Theorem 1.
Under Assumptions 1 and 2, a sufficient condition to guarantee the convergence of the diffusion LMS algorithm is given by, where (R) denotes the spectrum radius of the matrix R.

Simulation Results
We evaluate herein the MSD of the proposed algorithm. Without loss of generality, the unknown weight vector w o is set to be 1 M /M with M = 5. The initial weight vector estimations are w k (0) = 0 M for each node k. The constant ε used in (22) is set to be 0.5 × 10 −4 and the forgetting factor λ = 0 or λ = 0.95. We consider D k (t) = diag{b k (t)} γ in terms of the proposed AC, where γ can be 0, −1 or −2.
We use the empirical MSDs as the performance metrics. Both the transient and steadystate empirical network MSDs hereinafter are obtained by averaging L = 500 independent trials over all nodes of the network, wherew ( )  Figure 1a. The measurement noise is real Gaussian white noise with its variance σ 2 v,k at each node k presented in Figure 1b. According to the mean convergence condition, we set the step size to µ k = 0.01 for each node k. We herein consider γ = 0. Each regressor u k (t) at each node k is real Gaussian white noise sequence with their covariance matrices R u,k = σ 2 u,k I M with σ 2 u,k = 1 for all k. The noise power of node 5, 12, 13 at the 1500th iteration suddenly goes up to 5. Figure 2, the proposed AC strategy outperforms the classic AC [12] and the uniform combination [1] in terms of superior steady-state performance and similar convergence rates, and outperforms LS-based AC [17] and the optimal AC [13] in terms of accelerated convergence rate. We also observe that the forgetting factor can further enhance the performance of the proposed AC. After the sudden change of the noise power of some nodes, the proposed AC exhibits rather high reconvergence rate and good steady-state performance.  Example 2. The initial measurement noise variance σ 2 v,k at each node k is presented in Figure 1b. In this simulation, we compare the proposed AC scheme with the uniform combination and the classic AC in terms of the steady-state MSD at different noise variances τσ 2 v,k at each node k. The other simulation conditions are same as Example 1. The steady-state MSDs with respect to the noise variances are illustrated in Table 1.

As demonstrated in
The fairness of this experiment is endorsed by the relatively approximate convergence behavior of the transient MSD curves plotted in Figure 3. As shown in Table 1, with the noise variances increasing, the ATC algorithm incorporated with the proposed AC exhibits superior performance compared to the ATC algorithm with the uniform combination or the classic AC strategy. Additionally, we can also find that the performance gain brought by the forgetting factor is limited when the noise variances increase to some certain limit.

Example 3.
We now consider target tracking model ( [17], Equation (52)), namely where ξ(t) is a sequence of independent identically distributed perturbations with zero mean and covariance matrix Ξ, independent of the input regressors and measurement noise at every iteration. We herein consider ξ(t) is Gaussian white with Ξ = σ 2 ξ I M and σ 2 ξ = 1 × 10 −7 , i.e., the unknown weight vector is varying slowly. The other simulation conditions are same as Example 1.  As illustrated in Figure 4, similar to Example 1, the proposed AC strategy outperforms the uniform combination rule, the classic AC and LS-based AC in terms of superior steadystate performance, and outperforms the optimal AC in terms of accelerated convergence rate, under tracking scenarios. We could also observe that the forgetting factor can further enhance the performance of the proposed AC in the tracking scenario.

Example 4.
We now consider the impact of the factors λ and γ in the performance of the proposed AC strategy. Without loos of generality, we consider the step sizes µ k = 0.02 for each node k. The simulation result is shown in Figure 5.
It can be seen from Figure 5 that with specific γ, the larger λ brings the higher performance gain for the proposed AC. We could also observe that choosing γ = −2 brings better steady-state performance than the other two choices, especially for small forgetting factor λ. In particular, as compared to Example 1, for the two scenarios γ = 0, λ = 0 and γ = 0, λ = 0.95, we find that larger step sizes in this example lead to accelerated convergence rate at the cost of degraded steady-state performance. Example 5. We consider the sparse network with N = 15 nodes with its topology same as ( [12], Figure 8), depicted in Figure 6a. The measurement noise is real Gaussian white noise with its variance σ 2 v,k at each node presented in Figure 6c. We consider the heterogenous network with the step sizes µ k = 0.004 for orange shaded nodes and µ n = 0.02 for the rest. Each regressor u k (t) at each node k is real Gaussian white noise sequence with their covariance matrices R u,k = σ 2 u,k I M with σ 2 u,k presented in Figure 6b for all k. As illustrated in Figure 7, for the sparse network, the proposed AC scheme outperforms the classic AC and LS-based AC in terms of superior steady-state performance while keeping rather approximate convergence rate. It also outperforms the optimal AC scheme in terms of the accelerated convergence rate. Moreover, the introduction of forgetting factor can further enhance the performance of the proposed AC scheme.

Conclusions
In this paper, we present a modified adaptive combination strategy for the distributed estimation problem over diffusion networks to improve robustness against the spatial variation of signal and noise statistics over the network. Considering the Karush-Kuhn-Tucker conditions and fixed-point iteration methodology, we derive an effective adaptive combination strategy for the ATC diffusion LMS algorithm. We also invoke the forgetting factor and optimize the learning factor to further enhance the performance of the proposed adaptive combination strategy. Illustrative simulations validate the improved performance of the diffusion LMS algorithm with the proposed adaptive combination strategy.
where F (t) ∏ 1 i=t F(i) F(t) · · · F(1). To facilitate our analysis, we herein introduce the submultiplicative matrix norm (A submultiplicative matrix norm satisfies AB ≤ A B [23]). For any square matrix X, and any > 0, there exists a submultiplicative matrix norm · such that (X) ≤ X ≤ (X) + , where (X) denotes the spectrum radius of X [23,24]. Accordingly, we have F (t) ≤ ∏ 1 i=t F(i) . Notice that the diffusion LMS algorithm converges if and only if lim t→∞ F (t) = 0. Hence, a sufficient condition for the diffusion LMS algorithm to converge is F(t) ≤ (F(t)) + < 1. Therefore, the diffusion LMS algorithm converges if (F(t)) < 1 for all t with a sufficiently small chosen. Thus, the diffusion LMS algorithm converges if F(t) in (A8) is stable at each t.