Distributed Optimization Algorithm for Composite Optimization Problems with Non-Smooth Function

: This paper mainly studies the distributed optimization problems in a class of undirected networks. The objective function of the problem consists of a smooth convex function and a non-smooth convex function. Each agent in the network needs to optimize the sum of the two objective functions. For this kind of problem, based on the operator splitting method, this paper uses the proximal operator to deal with the non-smooth term and further designs a distributed algorithm that allows the use of uncoordinated step-sizes. At the same time, by introducing the random-block coordinate mechanism, this paper develops an asynchronous iterative version of the synchronous algorithm. Finally, the convergence of the algorithms is proven, and the effectiveness is veriﬁed through numerical simulations.


Introduction
In this paper, we study a class of distributed multi-agent problems on networks.Each agent in the network system has the following private objective function to be solved where x ∈ R n is the decision variable, f i is a Lipschitz-differentiable convex function, and g i is a non-smooth convex function.Examples of f i include quadratic functions and logistic functions [1], and applications of function g i include the elastic-net norm, L1-norm, and indicator functions [2].For the network system, we consider that each agent in the system is only allowed to interact with neighbor agents, and there is no central agent to process data; then we can obtain min where x i ∈ R n is the local estimation for x and E represents a collection of edges in the network.This distributed computing architecture captures various areas containing distributed information processing and decision making, networked multi-vehicle coordination, distributed estimation, etc.Typical applications include power systems control [3], model predictive control [4], statistical inference and learning [5], and distributed average consensus [6].
In recent years, most of the literature has mainly focused on the case that the optimization objective function contains only one smooth convex function.At the same time, many centralized algorithms with excellent performance, such as proximal gradient descent, sub-gradient algorithm, Newton method, and so on, solve these problems by extending to a distributed form.The sub-gradient algorithm is the most commonly used method.In [7], Nedić and Ozdaglar apply this method to the distributed optimization problem on time-varying networks and creatively propose the distributed sub-gradient method (DGD).Shi et al. [8] propose an exact first-order algorithm (EXTRA) and prove the linear convergence of the algorithm.The algorithm makes use of the error between adjacent iterations of the DGD algorithm.Then, [9] designs a distributed first-order algorithm by combining DGD and the gradient tracking method.In order to further accelerate the convergence of the algorithm, researchers successively propose the distributed ADMM algorithm in [10][11][12][13].However, these algorithms can only solve the optimization problem of a single function.
For (2) this composite distributed optimization problem with a non-smooth term, many research results have emerged.The authors of [14] design a proximal gradient method by combining Nesterov acceleration mechanisms.However, each iteration will lead to the consumption of more computing resources because more internal iteration steps are required.In undirected networks, Shi et al. design a proximal gradient exact firstorder algorithm (PG-EXTRA) for composite optimization problems based on the classical first-order distributed optimization algorithm (EXTRA) [8] in [15].The algorithm can accurately converge to the optimal solution of the problem by using a fixed step-size, so it is different from most algorithms that must use attenuation step-size.The authors of [16] propose a communication-efficient random walk named Walkman by using a Markov chain.By analyzing the relationship between optimization accuracy and network connectivity, this method obtains the explicit expression of communication complexity and the communication efficiency of the system.Further, considering that the complex situation of the real scene causes most agents in the network to transmit data in a directed way, ref. [17] uses the push sum mechanism to eliminate the information imbalance caused by the directed network and proposes the PG-ExtraPush algorithm on the basis of [8] and maintains the same convergence property.
Recently developed, the operator splitting technology has become the mainstream method to deal with this kind of complex optimization problem.Operator splitting technology is applied for the first time to composite optimization since Combettes and Pesquet designed a fully splitting algorithm, refs.[18][19][20][21] and others successively propose various algorithms for composite optimization.However, operator splitting technology is rarely applied to distributed composite optimization.Based on this, this paper aims to design a distributed algorithm with excellent performance by using the operator splitting method and based on the theory of operator monotonicity.
Contributions: Compared with most existing distributed optimization algorithms, the main contributions of this paper are summarized as follows: 1.
To solve problem (2), this paper develops a novel, fully distributed algorithm based on the operator splitting method, which has superiorities in flexibility and efficiency compared with relatively centralized counterparts [18][19][20][21].

2.
Based on a class of randomized block-coordinate methods, an asynchronous iterative version of the proposed algorithm is also derived, wherein only a subset of agents that are independently activated participate in the updates.Note that such an activation scheme is more flexible compared with the single coordinate activation [22].

3.
Both proposed algorithms allow not only local information interaction among neighboring agents but also the use of uncoordinated step-sizes, without any requirement of coordinated or dynamic ones considered in [7][8][9]14,23].Additionally, the convergence of both algorithms is ensured under the same mild assumptions.In particular, the consideration of the local Lipschitz assumption avoids the conservative selections of step-sizes, unlike the global one assumed in [8,14,15,17].
Organization: The contents of the remaining sections of the paper are as follows.Section 2 provides the symbols, lemmas, definitions, and assumptions that will be used in the paper.We give the specific process of algorithm derivation in Section 3. In Section 4, we show the convergence analysis of the proposed algorithms.Section 5 presents the simulation experiment to verify the algorithms.Finally, Section 6 gives the conclusion of the paper.

Preliminaries
In this section, we give the notations and display the definitions and lemmas that will be used in the paper.Then, we give two important assumptions.
Above all, we introduce some knowledge about graph theory.Let G = (V, E ) represent an undirected network composed of n agents, where V denotes the set of agents and E denotes the set of edges.The neighborhood of the i-th agent is recorded as N i = {j|(i, j) ∈ E }.Specifically, when there is at least one path in any two agents in an undirected network G, the network is connected.
Let R n denote the n-dimensional Euclidean space and • denote the Euclidean norm of a vector x ∈ R n .The notation ρ max (•) is the spectral radius of a matrix, and N represents the set of positive integers.Then let X 0 (R n ) denote the collection of all proper lower semi-continuous convex functions from R n to (−∞, +∞].When W i denotes a positive definite matrix, using W i as the diagonal element can form a positive definite diagonal matrix blkdiag{W i } i∈V .Let ri(•) denote the interior of a convex subset and dom f denote the effective domain of f .The subdifferential of function f i is expressed as At the same time, we give the following lemmas and assumptions.

Lemma 3 ([19]
).For a fixed point iteration u k+1 = T(u k ), {u k } will converge to the fixed point of T when it satisfies the following conditions: σ T > 0 and operator T, then operator T satisfies σ T -strongly monotone.
The following assumptions will also be used.

Assumption 1.
Graph G satisfies undirected and connected operators.
Assumption 2. The following three points are satisfied: 1. f i : R n → R is a smooth convex function, let 1/β i be Lipschitz constant, then f i satisfies Problem (2) has at least one solution.

Algorithm Development
In this section, we design and derive the synchronous algorithm and asynchronous algorithm.
We next carry on the equivalent transformation to problem (2) to facilitate the subsequent algorithm design.The constraint, x i = x j in (2) can be written as the edge-based form where E ij = I ∈ R n×n for i < j, and E ij = −I ∈ R n×n otherwise.Then, define the following linear operator: with the compact variable We stack all M (i,j) to get the following operator: with the dimension 2n|E | × mn, where |E | is the number of edges of the network E .Considering the set Then, constraint (4) can be further reformulated in the following form: Based on the above analysis, problem (2) can be transformed into min where δ C represents the indicator function, i.e., Then, let and C = ∏ (i,j)∈E C (i,j) (∏ denotes the Cartesian product).Hence, the compact form of problem ( 8) can be expressed by

Synchronous Algorithm 1
According to the fixed point theory, we design the distributed optimization algorithm of problem (2) from (9).We define the step-size matrices , and H = blkdiag λ (i,j) I 2n (i,j)∈E , where we let γ(i,j) = blkdiag γ i I n , γ j I n , and then introduce the following operators: where ),i } are the fixed points of T. In particular, s * (i,j),i and s * (i,j),j are maintained by i and j, respectively.Considering the update variables , and s k+1 = col{s k+1 (i,j) } (i,j)∈E with the edge-based variable s k+1 (i,j) = col{s k+1 (i,j),i , s k+1 (i,j),j }, we give the Picard sequence of T and obtain the following update rules: Let wk+1 = col{ wk+1 (i,j) } (i,j)∈E = col{ λ(i,j) γ−1 (i,j) • s k+1 (i,j) } (i,j)∈E .Using Lemma 2, ( 14) can be rewritten as Next, we split (15a)-(15c) in a distributed manner.It follows from ( 5) and ( 6) For (15b), multiply both sides of the equality by Λ.As done in (16), we also split (15b) and (15c) and use the result prox δ C ( i,j) =proj C ( i,j) to get the semi-distributed form: γ(i,j) λ(i,j) wk (i,j) + M (i,j) y k+1 , (17b) Note that (17b) is not fully distributed due to the structure w k+1 (i,j) = col{w k+1 (i,j),i , w k+1 (i,j),j }.By using ( 4) and ( 5), we can derive that the projection of vectors e 1 , e 2 ∈ R n to C (i,j) is expressed as which contributes to the local update of (17b), i.e., wk+1 (i,j),i = wk (i,j),i + Therefore, according to (17a), (17c), and the update of w k+1 (i,j),i , we can summarize the synchronous distributed algorithm as follows: Remark 1.Notice that Algorithm 1 is completely distributed without involving any global parameters.For example, each agent individually maintains the private primal variable x k i , auxiliary variable y k i , and edge-based variables wk+1 (i,j),i .For each edge (i, j) ∈ E in the network, wk (i,j) = col{ wk (i,j),i , wk (i,j),i } as an auxiliary profile contains two components, i.e., wk (i,j),i and wk (i,j),i , which are respectively kept by i and j.Meanwhile, the information exchange is locally conducted; that is, agent i shares its updated data y k+1 i and wk+1 (i,j),i with its all neighbors j ∈ N i .On the other hand, the proposed algorithm takes uncoordinated constant positive step-sizes, γ i , essentially distinguished from the global and dynamic ones in [7][8][9]14,23].It is also worth noting that the edge-based step-size λ(i,j) , held by agents i and j linked by the edge (i, j) ∈ E , can be seen as inherent parameters of the communication network, revealing the quality of the communication.

Algorithm 1 Distributed algorithm based on proximal operators
Input: For all agents i ∈ V, x 0 i ∈ R n , and w0 (i,j),i ∈ R n , where j ∈ N i .And select proper positive step-sizes or parameters, γ i and λ(i,j) .For k = 0, 1, . . ., do: (i,j),i to j for j ∈ N i , 5. Until the x k+1 i − x k i approaches zero.End Output: The primal variable x k+1 i as the optimal solution x * i .

Asynchronous Algorithm 2
Here, we extend the synchronous Algorithm 1 to the asynchronous iterative version based on the random-block coordinate mechanism in [2].Combining with the principle of this mechanism, we define the diagonal matrix P i ∈ R (2|E |+m)n × R (2|E |+m)n (where |E | denotes the number of edges of the graph E ) diagonal elements of 0 or 1 to represent the coordinate matrix, and then divide the vector (s, x) into m blocks.At the same time, we define the activation vector ξ k ∈ R m of φ-valued, where φ = 0, 1 is a binary string with length m.When ξ k i = 1, it means that the agent i is activated at the k-th iteration; otherwise it is not activated.
In order to describe the activation state of different coordinate blocks and ensure random activation, we give the following assumption.Assumption 3. The following two points are satisfied: 1.
The sum of P i satisfies ∑ m i P i = I, 2.
ξ k k≥0 is a φ-valued vector satisfying identical independent distributionsand its probability Then, based on the given assumption, we can develop the asynchronous algorithm as follows: It can be seen that Algorithm 2 allows each agent to awaken with an independent probability, which means that a subset of randomly activated agents will participate in the updates while inactivated ones stay in previous states.Such a scheme is more flexible than the single waking-up scheme [22] or other activated block coordinates that are uniformly selected [26].In addition, the probability is completely independent of the others, which does not meet some strict conditions, such as ∑ m i=1 p i = 1.

Algorithm 2 Asynchronous distributed version
Input: For all agents i ∈ V, x 0 i ∈ R n i , and w0 (i,j),i ∈ R n , where j ∈ N i .And select proper positive step-sizes or parameters, γ i and λ(i,j) .For k = 0, 1, . . ., do : For j ∈ N i , each agent i is activated independently with probability p i , and further performs the update steps 1-5 in Algorithm 1.While agents that are not activated, the last values keep unchanged.

End
Output: The primal variable x k+1 i as the optimal solution x * i .
In order to facilitate the subsequent derivation of convergence, we need to give a compact form of Algorithm 2. By making u = (s, x), we get where E k+1 = ∑ m i=1 ξ k+1 i P i and operator T can be seen in Equation (11).

Convergence Analysis
The convergence proof of the algorithms is provided in this section.The following assumption is the condition to be met for the convergence of the algorithms.Assumption 4. Recall the local Lipschitz constant β i in Assumption 2. It is assumed that the step-sizes satisfy the following conditions: Lemma 4. Let x * be a solution to (9), then there are s * = T1 (s * , x * ), x * = T 2 (s * , x * ), which means u * = (s * , x * ) is a fixed point of T. On the contrary, x * is the solution to (9) when u * is the fixed point of T.
Proof.Use the first-order optimal condition of ( 9) to obtain 0 ∈ Γ∇ f (x * ) + Γ∂g(x * ) + ΓM T ∂δ C (Mx * ), where x * is the optimal solution.According to the definition of matrix step-sizes, we further obtain Use Lemma 1 and let s * ∈ Λ−1 ∂δ C (Mx * ) to get Then according to ( 19) and ( 20), we can get Therefore, we have x * = T 2 (s * , x * ) and s * = T1 (s * , x * ).Meanwhile, u * = Tu * , where u * = (s * , x * ).Accordingly, if there is u * = Tu * , it can also be deduced that x * satisfies the first-order optimality condition of problem (9).Thus x * is an optimal solution of problem (9).Lemma 5. Let Assumptions 1 and 2 hold, then there are Proof.Combining ( 14), (19), and Lemma 2, we get It is further concluded that Here we introduce an equality.For a positive definite matrix K and Combining the above two results, we derive In order to prove the validity of ( 22), ( 3) is used for ( 14) Using subdifferential properties to obtain ) and equivalent Moreover, there is A derivation similar to ( 25) is obtained for ( 14) Therefore, we deduce Combining the above two equalities and ( 26), we can get (22).Lemma 6.Let Assumptions 1 and 2 hold.Set β= blkdiag{β i I n } i∈V .For matrix P = blkdiag{ Λ, Γ −1 } and u = (s, x), there is Proof.Adding ( 21) and ( 22), then rearranging to get .
Further, we have Then we deal with some terms in the above inequality.For (20) combined with Lemma 1 can deduce Further using subdifferential properties, we have Bring the above results back to (28) and get (27).
Sum (27) over k from 0 to N to obtain When N tends to infinity, we can get This means Next, we give the following theorem to prove the convergence of Algorithm 1.
Theorem 1.Under Assumptions 1-4, {x k } and {u k } converge to the optimal solution of (2) and the fixed points of T, respectively.
Proof.Because prox f and I − prox f are firmly nonexpansive, T is continuous.Then, lim k→∞ u k+1 − u k 2 P = 0 and the sequence ||u k − u * || 2 P satisfies non-increasing are obtained from Lemma 7. Based on Lemma 3, the sequence {u k } converges to a fixed point of T. According to Lemma 4, it can be concluded that {x k } converges to a solution to (2).
At the same time, we also give the following theorem to prove the convergence of Algorithm 2.
Proof.Before proving, we give some definitions.Here Π = ∑ m i=1 p i P i denotes the probability matrix, and E[•|F k ] is ca onditional expectation, and its abbreviation is E k [•], where F k represents the filtration generated by ξ 1 , . . . ,ξ k .We use Based on the definition of ξ k , we have E • (E k+1 ) = Π.Using the idempotent property of E k , we have Then according to Lemma 6 and (23), we get Therefore, if Assumption 4 holds, we can obtain the convergence of (37) according to [28] Th. 3, [27] Prop.2.3, and the Robbins-Siegmund lemma in [29].

Numerical Experiments 5.1. Case Study I: Performance Examination
We present the effectiveness of the algorithms in this section by solving a class of quadratic programming problems on undirected networks.The network topology is shown in Figure 1.The quadratic programming problem model is as follows: where x i is the decision variable of each agent.Matrix V i in the objective function is a diagonal matrix, and its elements are randomly selected in [−8, 8], and the elements of vector b i are randomly selected in [−10, −5].For the box constraint of x i , the range of x min i is [−10, −5], and the range of x max i is [5,10].To solve problem (38), we need to convert the problem into the form of problem (8).Defining the set and defining the indicator function δ X i (x i ), then we can get the following problem: Figure 2a shows that the agent finally converges to a consistent state through synchronous Algorithm 1.In Figure 2b, we use asynchronous Algorithm 2 with activation probability p i = 0.2 to describe the state of the agent under the same parameter conditions.In Figure 3, the performance of both proposed algorithms is depicted through a comparison with existing algorithms, i.e., an ADMM-based method [30], TriPD-Dist, and its asynchronous version [2].It can be shown that Algorithm 1 outperforms the ADMMbased method and TriPD-Dist, and the proposed asynchronous algorithm (Algorithm 2) also has a faster convergence speed than asynchronous TriPD-Dist, mainly by estimating the logarithmic values of 1/m • ∑ m i=1 x k i − x * .

Case Study II: First-Order Dynamics System
In this subsection, we apply the proposed synchronous algorithm to solve a firstorder dynamics system problem in a 2-D space [31], where each agent has its own cost function f i ( p) = p − px,i 2 + p − py,i 2 , with the action response p = [ px , py ] T , and the private reference positions px,i = [i − 3.5, 0] T and py,i = [0, i − 3.5] T .The goal of the considered problem is that all agents cooperatively find the optimal position p under the local constraints Ω i = p ∈ R 2 p − p0 p i = p j , (i, j) ∈ E , where p i ∈ R 2 is the local estimation action for p.In light of (1), we can set g i (p i ) = δ Ω i (p i ).
The selections of step-sizes are the same as that of Case Study I.
The results are described in Figures 4 and 5. To be specific, Figure 4a,b reflect the trajectories of p i = [p x,i , p y,i ] T .Figure 5 depicts the motions of the entire system over iterations, where the optimal position p * = [0.6743,0.2711] T is marked by a cross at the intersection of two star lines, the circles with a dotted line are the corresponding motion areas of agents, and the solid ones are the initial positions.

Conclusions
This paper mainly studies a class of distributed composite optimization problems with non-smooth convex functions.To solve this kind of problem, this paper proposes two completely distributed algorithms.At the same time, the algorithms are verified in theory and simulation.However, there are still some aspects worthy of improvement in this paper.For example, in the network structure, we can consider expanding from an undirected graph to a directed graph, and we can also combine it with more practical application scenarios, such as resource allocation.

Figure 4 .Figure 5 .
Figure 4. Evaluations of positions.(a) Evaluations of p k x,i .(b) Evaluations of p k y,i .