A Distributed Optimization Accelerated Algorithm with Uncoordinated Time-Varying Step-Sizes in an Undirected Network

: In recent years, signiﬁcant progress has been made in the ﬁeld of distributed optimization algorithms. This study focused on the distributed convex optimization problem over an undirected network. The target was to minimize the average of all local objective functions known by each agent while each agent communicates necessary information only with its neighbors. Based on the state-of-the-art algorithm, we proposed a novel distributed optimization algorithm, when the objective function of each agent satisﬁes smoothness and strong convexity. Faster convergence can be attained by utilizing Nesterov and Heavy-ball accelerated methods simultaneously, making the algorithm widely applicable to many large-scale distributed tasks. Meanwhile, the step-sizes and accelerated momentum coefﬁcients are designed as uncoordinate, time-varying, and nonidentical, which can make the algorithm adapt to a wide range of application scenarios. Under some necessary assumptions and conditions, through rigorous theoretical analysis, a linear convergence rate was achieved. Finally, the numerical experiments over a real dataset demonstrate the superiority and efﬁcacy of the novel algorithm compared to similar algorithms.


Introduction
In recent years, with the rapid development of artificial intelligence, big data, etc., there has been much attention to distributed optimization problems in multi-agent systems. As one of the most important fields, distributed optimization methods have gained significant growing interest due to the widespread applications in science and engineering areas such as the transmission of information in wireless sensor networks [1][2][3], the collaboration of vehicles in formation control [4,5], speeding up the optimization process in distributed machine learning [6,7], distributed resource allocation in smart-grid networks [8][9][10], distributed control in nonlinear dynamical systems [11,12], etc. Specifically, a distributed optimization framework can avoid the establishment of long-distance communication between agents while providing better load balancing for the network. Compared to traditional centralized optimization, agents in a multi-agent system communicate information only with their neighbors for distributed optimization. At the same time, the local objective function of each agent is known only by itself.
Literature Review: Since the DGD (Distributed gradient descent) algorithm was proposed by Nedic [13] for solving distributed convex problem in multiagent systems, great progress has been made in the distributed optimization field. Especially, the distributed firstorder methods have attracted many researchers' attention. Based on consensus theory [14] and gradient-descent technology, diminishing step-sizes were introduced into the algorithm ball and Nesterov momentum can bring about a faster convergence rate in large-scale computing and communication tasks.
Statement of Contributions: Throughout this article, we mainly focus on the application of distributed convex optimization method over an undirected network. We propose a novel distributed optimization algorithm with uncoordinated, time-varying, and nonidentical step-sizes and accelerated momentum terms, which has a faster linear convergence rate and can apply to more scenarios. To summarize, three contributions are as follows: • Based on the distributed optimization methods [19,26,35], we designed and discussed a faster distributed optimization accelerated algorithm, named UGNH (UG with Nesterov and Heavy-ball accelerated methods), which solves the distributed convex problems over an undirected network. In particular, the momentum with the Nesterov and Heavy-ball methods together improve the convergence rate, which can be seen in the numerical experiments. • Compared to related algorithms, in our algorithm, not only the step-sizes but the coefficients of momentum terms (for convenience, we call them coefficients for short later) are uncoordinated, time-varying, and nonidentical, which are locally chosen for each agent. Through convergence analysis, the step-sizes and coefficients are more flexible than most existing methods. Meanwhile, if the local objective functions satisfy the conditions that are smooth and strongly convex, we can obtain an upper bound of step-sizes and coefficients. Under the upper bounds, the sequences generated by UGNH converge to the exact optimal solutions linearly. • In contrast to related algorithms, the upper bounds of the largest step-sizes and coefficients of UGNH are more relaxed, which only depend on the parameters of objective functions and the topology of the network. Meanwhile, there can be zero step-sizes and coefficients (not all) among agents.
Organization: The rest of this article is arranged as follows. In Section 2, we describe the distributed problem and provide some necessary assumptions. In Section 3, we discuss the development of relevant distributed optimization algorithms and two classical accelerated methods and then propose a new distributed accelerated algorithm. Convergence analysis is detailed in Section 4. In Section 5, numerical experiments are provided to demonstrate the superiority and efficiency of our algorithm. Finally, Section 6 concludes this article and provides some research directions for the future.
Basic Notation: Throughout the rest of this article, unless otherwise specified, all vectors are considered as column vectors, and n is the number of agents in network. The real-number set, the natural-number set, and the m-dimensional real column vector are denoted by R, N, and R m , respectively. The subscript notations i, j ∈ {1, 2, · · · , n} represent the indices of the agents, while the superscript notation t represents an index for the iteration step, e.g., x t i represents the ith agent's decision variable at the jth iteration; 0 n ∈ R n ,1 n ∈ R n , and I n ∈ R n×n denote an n-dimensional zero vector, one vector, and an identity matrix, respectively. For a matrix P, p ij denotes the element at the i-th row and the j-th column of P, while its spectral radius and spectral norm are defined as ρ(P) and P , respectively. Similarly, x denotes the 2-norm for vector x. The transpose of a vector x and a matrix P are denoted by x T and P T , respectively. For a vector r = [r 1 , r 2 , · · · , r n ] T , diag(r) represents a diagonal matrix, the diagonal elements of which equal to the vector r. The notation ⊗ represents the Kronecker product. Let ∇ f (x) : R m → R m denote the gradient of f (x) at x.

Preliminaries
This section describes the formulation of the distributed optimization problem and some necessary basic assumptions related to network and function.

Problem Formulation
Consider an undirected network of n agents, which cooperatively solve the optimization problem written in the following form over a common variable x ∈ R m : Here, each local objective function f i : R m → R with a convex property is possessed by agent i, which exchanges local information only with its neighbors. Our main target was to design a distributed optimization algorithm, a decision variable of which can linearly converge to the optimal solution that minimizes the average of all local objective functions. The optimal average objective value of problem (1) is defined as f (x * ) , wherex * ∈ R m is the optimal decision variable. Then, the global optimal solution of (1) is denoted by As a local copy of the global decision variable is saved at each agent, optimization problem (1) can be solved in a distributed way by iterating the decision variable. In this study, network is described as G = {V, E}, where V = {1, 2, · · · , n} is the vertex set that represents the agents of the network, and E = {(i, j)|i, j ∈ V} is the edges set. In an undirected network, an edge (i, j) ∈ E implies that an edge (j, i) ∈ E too. Meanwhile, agent i and agent j can exchange information with each other. Let N i = {j|(i, j) ∈ E} ∪ {i} denote the set of all neighbors of agent i. Then, formulation (1) can be rewritten as follows: Recently, it has been proved that a new equality 1 α L 1 2 x = 0 is equivalent to the consensus condition x i = x j in [33], where α is the step-size, and L = I − P is a Laplace matrix. Then, the primal-dual method can be introduced to solve (2) by utilizing the Augmented Lagrangian function, which is also a cornerstone of our algorithm.
Next, some necessary assumptions about the underlying graph and local objective functions are formalized, which are a common standard in related distributed optimization studies.

Assumptions
Assumption 1 ([35]). The network G = {V, E} is connected, undirected, and simple. In particular, there are no self-loops of any agent and no multiple links between any two agents.

Assumption 2 ([19]).
A non-negative symmetric doubly stochastic weight matrix P = p ij ∈ R n×n is defined to represent network G. The weight of the matrix P satisfies the following three conditions: Assumption 3. Each local objective function f i : R m → R, i ∈ V is smooth with Lipschitz constant ψ i and strongly convex with parameter µ i . Mathematically, there exits ψ i > 0, µ i ≥ 0(∑ µ i > 0), for any x, y ∈ R m such that: Assumption 1 ensures that each agent can directly or indirectly affect other agents in the network. Assumption 3 is a standard assumption in convergence analysis of distributed optimization methods. Especially, under the assumption of strongly convex for each function, there exists a unique global optimal solution to problem (1). Moreover, for the global objective function f , we defineψ = 1 n ∑ n i=1 ψ i as the global Lipschitz constant andμ = 1 n ∑ n i=1 µ i as the global strongly convex parameter.

Algorithm Development
In this section, Section 3.1 describes the development of some related algorithms. Section 3.2 describes the Nesterov and Heavy-ball accelerated methods for the distributed optimization algorithm. Section 3.3 describes the proposed algorithm UGNH and the relationship between UGNH and the previous algorithms.

Related Algorithms
In this subsection, we focus on some classical algorithms DGD, EXTRA, HSADO, and UG, which are related to the proposed algorithm and give them a simple explanation.
In [13], Nedic and Ozdaglar proposed a standard distributed gradient descent method DGD. The method updated the decision variable at each agent through its neighbors and the local negative gradient's direction, as follows: where α t was the step-size, which satisfied α t > 0, ∑ ∞ t=0 α t = ∞, and ∑ α t 2 < ∞; the matrix P satisfied Assumption 2. The variable x t i stored in the agents is the local estimate of x at the t-th iteration. It was proved that sequences generated by DGD cannot converge to the exact optimal solution x * when employing a fixed step-size, i.e., α t = α. By taking an appropriately diminishing step-sizes, DGD can converge accurately, but the convergence rate was sublinear.
To acquire linear convergence, Shi [19] proposed a new method EXTRA by modifying the update rule of DGD (3). There were two steps performed as follows: where the step-size α > 0 was a constant, the matrix P satisfied Assumption 2, whilẽ P = I+P 2 was appropriate. Compared to DGD (3), an initial condition (4) and one more iteration (5) were added. Notably, the step-size was a constant, but EXTRA can linearly converge to the exact optimal solution as long as the step-size was chosen appropriately.
Based on DGD (3), Qu and Li [26] proposed a novel distributed algorithm HSADO by using gradient-tracking technology. An auxiliary variable z t i was introduced to estimate the network-wide gradient average 1 n ∑ n i=1 ∇ f i x t i at the t-th iteration for agent i. As a result, the gradient contribution−α∇ f i x t i in (3) was replaced by −αz t i . The specific updating rules were as follows: where the step-size α > 0 was a constant, and the matrix P satisfied Assumption 2. Under the previous assumptions, initialized with x 0 i ∈ R m and z 0 i = ∇ f i x 0 i , a global linear convergence rate could be gained when choosing an appropriate fixed step-size.
Recently, a novel distributed optimization algorithm UG was proposed in [35], which used the primal-dual method to solve the equivalent problem (2). Through tuning parameters, the algorithm subsumed the well-known algorithm EXTRA and HSADO. Updating rules were as follows: where the step-size α > 0 was a constant; x t i was primal variable; and z t i was dual variable, which were initialized to x 0 i ∈ R m and z 0 i = 0 m , respectively. For using more-compact notation, we defined P = p ij , L = l ij , and K = k ij . The matrix L = I n − P and the matrix K ∈ R n×n are symmetric with the property that there exists some constant λ such that K1 n = λ1 n .
By analyzing when the matrix K is chosen properly, the algorithm UG is equivalent to: (1) EXTRA, when K = 1 α P and (2)HSADO, when K = 0 n . For others, K =μ +ψ 2 I n and K =μ +ψ 1+λ n P (λ n is the smallest eigenvalue of matrix P) are appropriate for K = kI n and K = kP, respectively. There are no extra computational variables and no communication relationships with other agents in the formula K, so (8) and (9) are easy to implement. As UG unifies and generalizes previous methods, we mainly focused on it.

Distributed Accelerated Methods
In this section, centralized Nesterov and Heavy-ball accelerated methods will be introduced. With them, many distributed optimization algorithms can converge faster.
For the gradient-descent algorithm, i.e., x t+1 t ; κ =ψ µ denotes the condition number of the objective function.
Ifψ is much larger thanμ so that κ is large, then the gradient descent becomes quite slow. To accelerate the gradient descent, Polyak [40] proposed a method called Heavy-ball for updating decision variable. The specifics was as follows: where γ was the momentum-accelerated coefficient, and the term γ x t i − x t−1 i was used to accelerate the convergence of the decision variable. It had been proved that under the appropriate step-size α and the coefficient γ, the momentum-accelerated method could which was obviously faster.
Inspired by conjugate gradient methods [44], history gradient information can improve the convergence rate for distributed first-order optimization algorithms. Nesterov proposed a method called CNGD [40] (Centralised Nesterov Gradient Descent method) as follows: where α = It is notable that the two accelerated methods have been adapted in many distributed algorithms, such as [42,43], etc. In this study, we devoted ourselves to studying the two accelerated methods on UG.

The Proposed Algorithm
The recent studies [35,42] are the most relevant to our work. Based on these works, considering that the Nesterov and Heavy-ball accelerated methods are very helpful for achieving a faster convergence, we added them into UG simultaneously. Meanwhile, in order to apply in many more scenarios, the step-sizes and coefficients were designed as uncoordinated, time-varying, and nonidentical. Combining together, we propose a new distributed optimization algorithm named UGNH as follows: where i, j ∈ V, t ∈ N, the step-sizes α t i > 0, and accelerated momentum coefficients γ t i ≥ 0 are uncoordinated, time-varying, and nonidentical, which are locally chosen at each agent. At the t-th iteration, each agent stores three variables: the primal decision variable x t i ∈ R m , the temporary variable y t i ∈ R m , and the dual variable z t i ∈ R m , which start with initial states: x 0 i ∈ R m , y 0 i ∈ R m and z 0 i = 0 m . The update of UGNH at each agent i is formally described in Algorithm 1.

Algorithm 1
The update of the algorithm UGNH at each agent i 1: Initialization: each agent starts with: x 0 i ∈ R m , y 0 i ∈ R m and z 0 i = 0 m . 2: for t = 0, 1, 2, · · · do 3: Update the primal decision variable x i as follows: Update the temporary variable y t+1 i as follows: for j = 1, 2, · · · , n do 6: for q = 1, 2, · · · , n do 7: Calculate z temp = ∑ n j=1 l ij ∇ f j y t j + z t i − ∑ n q=1 k jq y t q 8: end for 9: Update the dual variable z i as follows: is the Heavy-ball accelerated term in (13), (14) is the Nesterov accelerated term, and (15) is the dual variable iteration. It also can be easy to verify that UGNH is equivalent to UG if α t i = α, γ t i = 0 m . Further, it can be equal to EXTRA and HSADO if the matrix K is chosen properly.

Remark 2.
For the sake of compaction and brevity, let the dimension m = 1. Other multiple dimensions can be similarly proved.
As a result, we define:x t = x t 1 , x t 2 , · · · , x t n T ∈ R n , y t = y t 1 , y t 2 , · · · , y t n T ∈ R n , z t = z t 1 , z t 2 , · · · , z t n T ∈ R n and ∇F y t = ∇ f 1 y t 1 , ∇ f 2 y t 2 , · · · , ∇ f n y t n T ∈ R n , other notations latter used are defined as before. Then, UGNH can be compactly reformulated in a martix form as follows: where α t = α t 1 , α t 2 , · · · , α t n T ∈ R n and γ t = γ t 1 , γ t 2 , · · · , γ t n T ∈ R n represent step-sizes and coefficients, respectively. Furthermore,we define Γ t α = diag α t ∈ R n×n and Γ t γ = diag γ t ∈ R n×n .

Convergence Analysis
This section analyzes in detail the linear convergence of decision variable sequences generated by UGNH when step-sizes and coefficients are chosen properly. First, we define some notations that may frequently be used later.
Moreover, considering that the step-sizes and coefficients are uncoordinated, timevarying, and nonidentical, there are many possible numerical values that may be difficult to handle. By employing a small trick, we only studied the supremum and infimum of the step-sizes and coefficients. The specific definitions are as follows: In addition, let ξ α = α max − α min be the difference between α max and α min , and let Φ = α max α min be the condition number. Before giving the main results, we introduce some helpful supporting lemmas for the convergence analysis.
Next, we spared no effort to bound the four norm expressions at the (t + 1)-th iteration through their estimates at the t-th iteration in terms of linear combinations. Subsequently, based on Assumptions 1-3, we established a linear inequalities system for convergence analysis. In what follows, consensus violation Ξ t+1 1 is bounded first. Lemma 4. ∀ t > 0, the following inequality holds: Proof of Lemma 4. Considering (16) and (17), we have: Note that (I n − J n )x t+1 = Ξ t+1 1 , (I n − J n )P = P − J n and (P − J n )1 n = 0 n , multiplying (I n − J n ) on both sides of (20), then: Based on the fact z t − z * = z t + ∇F(x * ) [35] and Assumption 3, taking the norm on both sides of (21), then: Recalling the definition of a and b, then: Rearranging the terms in (23), the result in Lemma 4 is obtained.
Lemma 5. ∀ t > 0, the following inequality holds: Proof of Lemma 5. Multiplying J n on both of (16), and substituting y t = x t + Γ t−1 γ Ξ t 3 , we have: To get the related terms, recalling the fact thatz t+1 =z t = · · · =z 0 = 0 (e.g., J n z t = 0) in [35], we add some useful items and delete them in (25) as follows: By applying ∇ f (x) = 1 n 1 n ∇F(x), subtracting x * on the sides of (26), we then obtain: Taking the norm on both sides of (27) and using Lemma 1, then: Rearranging the terms in (28), the desired results can be obtained.
Lemma 6. ∀ t > 0, the following inequality holds: Proof of Lemma 6. Substituting (16), then subtracting x t on both sides, then: The second equality is based on (P − I n )1 n = 0 n ; recalling the definition of c and taking the norm on both sides of (30), we have: Rearranging the terms in (31) , the result in Lemma 6 is obtained.

Lemma 7.
Let Assumptions 2-3 and Lemma 2 hold. ∀ t > 0, the following inequality holds: Proof of Lemma 7. Noting (P − J n )1 n = 0 n and adding ∇F(x * ) on both sides of (18), we have: The third equality of (33) is from the following fact in [35] and Lemma 2: z t+1 =z t = · · · =z 0 = 0, J n ∇F(x * ) = 0 n and LK1 n = 0 n . Recalling the definition of d, taking the norm on both sides of (33),we have: Substituting y t = x t + Γ t−1 γ Ξ t 3 in (34) and rearranging the terms can yield the desired result.
With the Lemmas 4-7 above, we established the main convergence result as follows.
Theorem 1. Suppose that Assumptions 1-3 hold. Considering the sequences {x t }, {y t }, and {z t } generating by the proposed algorithm UGNH and combining Lemmas 4-7 in a linear-inequalities system, we have: where the matrix H ∈ R 4×4 is given as below: The largest step-size satisfies: The maximum momentum coefficient satisfies: And the conditional number satisfies: where ε 1 , ε 2 , ε 3 , and ε 4 are arbitrary constants, which obey the following picking rules: Then, the spectral radius of the matrix H is strictly less than 1, i.e., ρ(H) < 1, which is the desired result.
Proof of Theorem 1. According to Lemmas 4-7, we can immediately get the inequalities (35). Then, we provide some necessary conditions for parametersψ,γ and Φ, such that ρ(H) < 1. Based on Lemma 3, let ε = [ε 1 , ε 2 , ε 3 , ε 4 ] T ∈ R 4 be a positive vector, if Hε < ε, then ρ(H) < 1. According to the definition of H above, the inequality Hε < ε is equivalent to the following four inequalities: According to Lemma 1, if 0 < α max < 1 ψ , ζ = 1 −μα max , then (41) is equivalent to the following inequality: To make sure that the parameterγ is positive, it implies that the right sides of (40) and (42)- (44) are positive. Immediately, we can get the following conditions: Recalling that ξ α = α max − α min , Φ = α max α min as the conditional number, (46) further implies that : Now, we attempt to select the proper vector ε = [ε 1 , ε 2 , ε 3 , ε 4 ] T such that the parameters α max ,γ and Φ are available. Based on (46)-(48), an arbitrary positive constant ε 2 is chosen first, and then we choose ε 1 from (46), finally choosing ε 3 and ε 4 from (47) and (48), respectively. Hence, according to (45) and (47), and the requirement of 0 < α max < 1 ψ in (44), the upper bound of the largest step-size α max shown in (36) can be obtained. Furthermore, according to (46), the upper bound of the conditional number Φ demonstrated in (30) can be obtained. Besides, the upper bound of the maximum coefficientγ can yield from (40) and (42)- (44). Above all, the proof is finished. Remark 3. According to Theorem 1, a linear convergence rate of the proposed algorithm can be easily obtained if the parameters α max ,γ and Φ follow the conditions (36)- (38), respectively. It is noteworthy that these parameters only depend on the topology of the network and objective functions. Although some global parameters such asμ,ψ andψ are needed when designing step-sizes and the coefficients, these parameters can be easily pre-calculated without much effort.

Remark 4.
Being uncoordinated and being nonidentical are two important characteristics often designed in many related studies, considering that step-sizes and coefficients might be changed with time variance in some practical scenarios. In our algorithm, step-sizes and coefficients were designed as uncoordinated, time-varying, and nonidentical. Furthermore, the largest step-size and coefficients were chosen according to their bounds shown in Theorem 1, which only depend on the the communication network and the objective functions. Notably, there is a bound of a conditional number, such that when the largest step-size is chosen, the smallest step-size needs to be chosen carefully.

Numerical Experiments
In this section, some necessary numerical experiments in a real dataset are provided to illustrate the efficiency and superiority of our algorithm. In the experiments, we considered a binary-classification logistic-regression problem in the Wisconsin breast cancer dataset provided in the UCI Machine Learning Repository [45]. The problem can be described in the following form: with each local objective function f i written as follows: where n is the number of the agent in network, and d is the dimension of the decision variable. Each agent i is assumed to have an equal data samples n i , i.e., n i = N n (N is the total data samples). a ij ∈ R d represents the feature vector of the jth data sample at the ith agent, while y ij ∈ {−1, 1} denotes the corresponding label. The regularization term τ 2 x with parameter λ = 1 was set to avoid over-fitting.
In the experiments, we set N = 200 as training data, and d = 9 represents the feature in the real dataset. Meanwhile, we simulated a randomly undirected network generated by the Erdos-Renyi network with n = 10 nodes and edge probability p = 0.7. Then, we compared the proposed algorithm UGNH to relevant algorithms: EXTRA, HSADO, and UG. • Figure 1 indicates that the proposed algorithm UGNH promotes the convergence rate compared to the related algorithms in the real dataset; thus, UGNH is effective and superior. From Figure 2, the sequences generated by UGNH, EXTRA, UG, and HSADO can converge to the optimal solutions as expected. Avoiding confusion of the figure, only one dimension of each decision variable is exhibited.
• Figure 3 means that UGNH with the Nesterov momentum and the Heavy-ball momentum improved the convergence rate compared to the algorithm with only one or no momentum. • In Figure 4, we can conclude that step-size is usually chosen very small; the larger step-size leads to a faster convergence rate if it is chosen under the upper bound. For the coefficient, a similar result can be obtained in Figure 5. Comparing the two figures, it can be concluded that small changes in step-size are more influential than that of the coefficient.

Conclusions
In this study, a novel uncoordinated, time-varying, and nonidentical distributed optimization accelerated algorithm was proposed. It was mainly applied to handle the distributed optimization convex problem in an undirected network, where all agents are in an effort to optimize the average of all local objective functions collaboratively. When the largest step-size and the maximum coefficient do not exceed some estimated upper bounds, which have been provided in Theorem 1, the convergence rate of UGNH is linear under the condition that each local objective function is smooth and strongly convex. Besides, these parameters only depend on the topology of the network and the local objective function.
It is worth noting that to achieve a faster linear convergence rate, the Heavy-ball and Nesterov accelerated methods were simultaneously added into the algorithm, which provides a new way for accelerating convergence of other distributed optimization algorithms. Furthermore, the experiment results verified the effective and superior performance in a real dataset. However, UGNH is not suitable for all scenarios, and there are some more in-depth areas worth studying, such as the time-varying network architecture, random link failures, asynchronous communication between agents, directed networks, and so on. In all, these problems are worthy of further study and are our future research direction.