A Flexible Framework for Decentralized Composite Optimization with Compressed Communication

Zhongyi Chang; Zhen Zhang; Shaofu Yang; Jinde Cao

doi:10.3390/fractalfract8120721

,

and

¹

School of Computer Science and Engineering, Southeast University, Nanjing 211189, China

²

School of Mathematics, Southeast University, Nanjing 211189, China

^*

Author to whom correspondence should be addressed.

Fractal Fract.2024, 8(12), 721;https://doi.org/10.3390/fractalfract8120721

This article belongs to the Section Optimization, Big Data, and AI/ML

Version Notes

Order Reprints

Abstract

This paper addresses the decentralized composite optimization problem, where a network of agents cooperatively minimize the sum of their local objective functions with non-differentiable terms. We propose a novel communication-efficient decentralized ADMM framework, termed as CE-DADMM, by combining the ADMM framework with the three-point compressed (3PC) communication mechanism. This framework not only covers existing mainstream communication-efficient algorithms but also introduces a series of new algorithms. One of the key features of the CE-DADMM framework is its flexibility, allowing it to adapt to different communication and computation needs, balancing communication efficiency and computational overhead. Notably, when employing quasi-Newton updates, CE-DADMM becomes the first communication-efficient second-order algorithm based on compression that can efficiently handle composite optimization problems. Theoretical analysis shows that, even in the presence of compression errors, the proposed algorithm maintains exact linear convergence when the local objective functions are strongly convex. Finally, numerical experiments demonstrate the algorithm’s impressive communication efficiency.

Keywords:

decentralized composite optimization; ADMM; quasi-Newton; communication-efficient mechanism

1. Introduction

The recent increase in the number of mobile devices with enhanced computing and communication capabilities has led to significant development of multiagent systems [1,2]. Within these systems, many applications, including smart grid management [3,4], wireless communications [5], multi-robot coordination [6,7], large-scale machine learning [8], etc., can be cast to decentralized optimization problems, in which a network of nodes cooperatively solve a finite-sum optimization problem using local information.

A vast of decentralized optimization algorithms have been proposed, since the pioneer work DGD [9], in which each node performs gradient descent and simultaneously communicates decision vector with its neighbors for consensus. As DGD requires diminishing stepsize, which might slower the convergence rate, gradient tracking (GT) based algorithms using constant stepsizes are then developed [10,11] and have been extensively investigated under various scenarios [12,13,14,15], to name a few. However, GT-based methods require to transmit both the decision vector and an additional gradient estimation vector, which increase the communication cost. In parallel, another type of decentralized algorithms based on alternating direction method of multipliers (ADMM) are proposed and analyzed [16,17]. Compared with GT-based algorithms, ADMM-type algorithms can achieve the same convergence rate but require the transmission of only decision vector, which can be more communication-efficient. Following this line, some decentralized optimization algorithms are proposed for accelerating the convergence rate by introducing second-order information [18,19,20,21]. More recently, Ref. [22] proposed a family of decentralized curvature aided primal dual algorithms, which can include gradient, Newton, and BFGS type of updates.

In the decentralized algorithms, it is of great significance to improve communication-efficiency. The methods can be classified into two types. One method is to adopt compressed communication scheme, using quantization [23] or sparsification [24,25] techniques to reduce communication overhead per transmission. Recently, the compressed communication scheme has been combined with DGD [26,27], GT-based algorithms [28,29], and ADMM-type algorithms [30,31]. Another method is to employ intermittent communication scheme which aims to reduce the communication frequency. Such type of methods includes event-triggered communication [32,33,34], lazy aggregation scheme (LAG) [35], etc. Besides, there are also some works combining the both methods [36,37,38]. In particular, ref. [39] combined event-triggered and compressed communication scheme with an ADMM-type algorithm and proposed a communication-efficient decentralized second-order optimization algorithm, which improve both computation and communication efficiency. It is worthy noting that the information distortion arisen by compressed communication scheme may have a negative effect on the convergence performance of the decentralized optimization algorithms. To overcome this shortage, Ref. [40] developed an error-feedback communication scheme (EF21) to avoid the negative effect of information distortion. Recently, a more general efficient communication scheme termed as three point compressor (3PC) is proposed in [41], which provides a unified method including EF21 and LAG as special cases. However, 3PC scheme is only investigated in distributed gradient descent algorithms under the parameter-server framework.

Despite of the progress, the development of communication-efficient decentralized optimization algorithms over the general networks is still lack an in-deep analysis, especially for the objective with non-differentiable part, i.e., decentralized composite optimization problems. Note that such problems have widely applications in the field of machine learning due to the existence of the non-differentiable regularization terms. Currently, some decentralized composite optimization algorithms have been proposed [22,42,43,44], but without employing efficient-communication schemes. Moreover, to the best of our knowledge, no work has been reported on communication-efficient decentralized composite optimization using second-order information. To fill this gap, in this paper, we incorporate the general efficient communication scheme 3PC [41] into the ADMM-based decentralized optimization framework [22], which result in a family of communication-efficient decentralized composite optimization algorithms with theoretical guarantees. It is worthy noting that such incorporation is not trivial as we need to overcome the negative effect arisen by the propagation of communication error over networks. The main contribution of this work can be summarized in the following two aspects:

First, we propose a flexible framework termed as CE-DADMM for communication-efficient decentralized composite optimization problems. The framework not only encompasses some existing algorithms, such as COLA [32] and CC-DQM [39], but also introduces several new algorithms. Specifically, by incorporating quasi-Newton updates into CE-DADMM, we derive CE-DADMM-BFGS, the first communication-efficient decentralized second-order algorithm for composite optimization. Compared with CC-DQM, it avoids computing the Hessian matrix and its inversion, significantly reducing the computational cost. Compared with DRUID [22], CE-DADMM can reduce the communication cost due to the efficient communication scheme.
Second, we theoretically prove that CE-DADMM can achieve exact linear convergence under the assumption of strong convexity by carefully analyzing the mixing error arisen by the efficient communication scheme and the disagreement of decision vectors. The dependency of the convergence rate on the parameters of the compression mechanism is also established. Additionally, sufficient numerical experiments are presented to substantiate the superior performance of our algorithms in terms of the communication efficiency.

Notation. If not specified,

‖ x ‖

and

‖ A ‖

represent the Euclidean norm and the spectral norm, respectively. For a positive definite matrix

P ≻ 0

, let

{‖ x ‖}_{P} : = \sqrt{x^{⊤} P x}

. Use

[n]

to denote the set

{1, \dots, n}

. The proximal mapping for a function

g (\cdot) : R^{d} \to R

is defined by

{prox}_{g / μ} (v) = {arg min}_{θ \in R^{d}} {g (θ) + \frac{μ}{2} {‖ θ - v ‖}^{2}}

. Let

I_{d}

represent the d-dimensional identity matrix, and

A \otimes B

represent the Kronecker product of matrices

A

and

B

.

2. Problem Setting

In this paper, we study the decentralized composite optimization problem on an undirected connected network with n agents (or, nodes), which takes the form

\begin{matrix} min_{x \in R^{d}} \{\sum_{i = 1}^{n} f_{i} (x) + g (x)\}, \end{matrix}

(1)

where

x

refers to the decision vector and

f_{i} (\cdot) : R^{d} \to R

is a convex and smooth function accessible only by node i and

g (\cdot) : R^{d} \to R

is a convex (possibly non-smooth) regularizer.

Next, we equivalently reformulate problem (1) into a compact form in terms of the whole network following the same idea in [22]. Denote the communication graph as

G = \{V, E\}

, where

V : = [n]

is the set of agents and

E \subseteq V \times V

is the set of edges containing the pair

(i, j)

if and only if agent i can communicate with agent j. There is no self-loops in

G

, i.e.,

(i, i) \notin E

for any

i \in [n]

. Note that the edges in

E

are enumerated in arbitrary order, with

e_{k} : = (i, j) \in E

denoting the k-th edge, where

k \in [m]

and

m : = | E |

is the number of edges. The neighbor set of agent i is

N_{i} : = {j \in V : (i, j) \in E}

. Let

x_{i} \in R^{d}

and

z_{k} \in R^{d}

be the local decision vectors corresponding the ith-node kth-edge, respectively. We assume that

G

is connected. Then, problem (1) is equivalent to the following constrained form:

\begin{matrix} min_{{x_{i}}, θ, {z_{i j}}} & \{\sum_{i = 1}^{n} f_{i} (x_{i}) + g (θ)\}, \\ s . t . & x_{i} = x_{j} = z_{k} \forall e_{k} = (i, j) \in E, \\ x_{l} = θ, for one arbitrary l \in V, \end{matrix}

(2)

where

θ \in R^{d}

is an auxiliary variable for decoupling the smooth and non-smooth functions. Denote the optimal solution of problem (1) and (2) as

x^{★}

and

{x_{i}^{★}, z_{k}^{★}, θ^{★}}

, respectively. It is straightforward to verify that

x^{★} = x_{i}^{★} = z_{k}^{★} = θ^{★}

for all

i \in [n]

and

k \in [m]

.

In what follows, define

\tilde{x} : = [x_{1}; x_{2}; \dots; x_{n}] \in R^{n d}

and

\tilde{z} : = [z_{1}; z_{2}; \dots; z_{m}] \in R^{m d}

. In addition, define two matrices

{\hat{A}}_{s}

and

{\hat{A}}_{d} \in R^{m \times n}

as follows: the k-th row of both

{\hat{A}}_{s}

and

{\hat{A}}_{d}

represents the k-th edge

e_{k}

. Specifically, the entries

{[\hat{A} s]}_{k i}

and

{[\hat{A} d]}_{k j}

are both equal to 1 if and only if the edge

e_{k} = (i, j)

; otherwise, they are 0. We also define

S : = (s_{l} \otimes I_{d}) \in R^{n d \times d}

, where

s_{l} \in R^{n}

is a vector with a 1 at its l-th position and 0 elsewhere. Clearly, the matrix

S^{⊤}

extracts the component of

\tilde{x}

that corresponds to agent l, meaning that

S^{⊤} \tilde{x} = x_{l}

. Let

F (\tilde{x}) : = \sum_{i = 1}^{n} f_{i} (x_{i})

. Then, problem (2) can be written as

\begin{matrix} min_{\tilde{x}, θ, \tilde{z}} & F (\tilde{x}) + g (θ) \\ s . t . & A \tilde{x} = B \tilde{z}, S^{⊤} \tilde{x} = θ, where A = [\begin{matrix} {\hat{A}}_{s} \otimes I_{d} \\ {\hat{A}}_{d} \otimes I_{d} \end{matrix}] and B = [\begin{matrix} I_{m d} \\ I_{m d} \end{matrix}] . \end{matrix}

(3)

Note that problem (3) is written from the network level, which will be the basis for designing our algorithm.

3. Algorithm Formulation

In this section, we first introduce the basic iterations of our algorithm based on ADMM method. Then, by combining compressed communication techniques with the ADMM-based algorithm, we will devise our algorithm and discuss its relationship with existing algorithms.

3.1. Background: ADMM-Based Algorithm

ADMM is a powerful tool to solve an optimization problem with several blocks of variables. To apply ADMM for solving problem (3), define its augmented Lagrangian as:

\begin{matrix} L (\tilde{x}, θ, \tilde{z}; ν, λ) : = & F (\tilde{x}) + g (θ) + ν^{⊤} (A \tilde{x} - B \tilde{z}) + λ^{⊤} (S^{⊤} \tilde{x} - θ) \\ + \frac{μ_{z}}{2} {∥A \tilde{x} - B \tilde{z}∥}^{2} + \frac{μ_{θ}}{2} {∥S^{⊤} \tilde{x} - θ∥}^{2}, \end{matrix}

(4)

where

μ_{z}

and

μ_{θ}

are positive constants,

ν \in R^{2 m d}

and

λ \in R^{d}

are Lagrange multipliers. Then, the kth-iteration in ADMM is written as

\begin{matrix} {\tilde{x}}_{t + 1} & = arg min_{\tilde{x}} L (\tilde{x}, θ_{t}, {\tilde{z}}_{t}; ν_{t}, λ_{t}) \end{matrix}

(5a)

\begin{matrix} θ_{t + 1} & = arg min_{θ} L ({\tilde{x}}_{t + 1}, θ, {\tilde{z}}_{t}; ν_{t}, λ_{t}) \end{matrix}

(5b)

\begin{matrix} {\tilde{z}}_{t + 1} & = arg min_{\tilde{z}} L ({\tilde{x}}_{t + 1}, θ_{t + 1}, \tilde{z}; ν_{t}, λ_{t}) \end{matrix}

(5c)

\begin{matrix} ν_{t + 1} & = ν_{t} + μ_{z} (A {\tilde{x}}_{t + 1} - B {\tilde{z}}_{t + 1}) \end{matrix}

(5d)

\begin{matrix} λ_{t + 1} & = λ_{t} + μ_{θ} (S^{⊤} {\tilde{x}}_{t + 1} - θ_{t + 1}) . \end{matrix}

(5e)

Define

{\hat{E}}_{s} = {\hat{A}}_{s} - {\hat{A}}_{d}

,

{\hat{E}}_{u} = {\hat{A}}_{s} + {\hat{A}}_{d}

,

{\hat{L}}_{s} = {\hat{E}}_{s}^{⊤} {\hat{E}}_{s}

,

{\hat{L}}_{u} = {\hat{E}}_{u}^{⊤} {\hat{E}}_{u}

,

\hat{D} = \frac{1}{2} ({\hat{L}}_{s} + {\hat{L}}_{u})

. In addition, define

E_{s} = {\hat{E}}_{s} \otimes I_{d}

, and similarly for

E_{u}, L_{s}, L_{u}

, and

D

. Similar as [22], if we initialize the multiplier

ν_{t} : = [α_{t}; β_{t}] \in R^{2 m d}

with

α_{0} = - β_{0}

,

E_{u} {\tilde{x}}_{0} = 2 {\tilde{z}}_{0}

, there is

E_{u} {\tilde{x}}_{t} = 2 {\tilde{z}}_{t}

. Let

ϕ_{t} = E_{s}^{⊤} α_{t}

, and approximate the augmented Lagrangian

L (\cdot)

in (5a) by employing a second-order expansion at

x_{t}

as

\begin{matrix} \hat{L} (\tilde{x}, θ_{t}, {\tilde{z}}_{t}; ν_{t}, λ_{t}) = L_{t} ({\tilde{x}}_{t}) + (\tilde{x} - {\tilde{x}}_{t}) \nabla_{\tilde{x}} L_{t} ({\tilde{x}}_{t}) + \frac{1}{2} {∥\tilde{x} - {\tilde{x}}_{t}∥}_{H_{t}}^{2}, \end{matrix}

where

L_{t} ({\tilde{x}}_{t})

is short for

L ({\tilde{x}}_{t}, θ_{t}, {\tilde{z}}_{t}; ν_{t}, λ_{t})

, and

H_{t}

is an invertible matrix representing the approximated Hessian of the augmented Lagrangian, then the iteration (5) can be simplified as

\begin{matrix} {\tilde{x}}_{t + 1} & = {\tilde{x}}_{t} - H_{t}^{- 1} (\nabla F ({\tilde{x}}_{t}) + ϕ_{t} + S λ_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{x}}_{t} + μ_{θ} S (S^{⊤} {\tilde{x}}_{t} - θ_{t})) \end{matrix}

(6a)

\begin{matrix} θ_{t + 1} & = {prox}_{g / μ_{θ}} (S^{⊤} {\tilde{x}}_{t + 1} + μ_{θ}^{- 1} λ_{t}) \end{matrix}

(6b)

\begin{matrix} ϕ_{t + 1} & = ϕ_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{x}}_{t + 1} \end{matrix}

(6c)

\begin{matrix} λ_{t + 1} & = λ_{t} + μ_{θ} (S^{⊤} {\tilde{x}}_{t + 1} - θ_{t + 1}) . \end{matrix}

(6d)

Compared with (5), the iteration (6) contains fewer vectors by eliminating the vector

\tilde{z}

and replacing

ν

by

ϕ

to halve the dimension of

ν

. Note that the iteration (6) is written in terms of the whole network. To implement (6) in a decentralized manner, we require the matrix

H_{t}

be block-diagonal, so that each block can be computed independently by each agent. Here, we assume that

H_{t} = diag {H_{1, t}, H_{2, t}, \dots, H_{n, t}}

. The choice of

H_{i, t}

will be discussed later. Then, agent i will perform the following iteration:

\begin{matrix} x_{i, t + 1} & = x_{i, t} - H_{i, t}^{- 1} (\nabla f (x_{i, t}) + ϕ_{i, t} + \frac{μ_{z}}{2} \sum_{j \in N_{i}} (x_{i, t} - x_{j, t}) + δ_{i l} (λ_{t} + μ_{θ} (x_{i, t} - θ_{t}))) \end{matrix}

(7a)

\begin{matrix} ϕ_{i, t + 1} & = ϕ_{i, t} + \frac{μ_{z}}{2} \sum_{j \in N_{i}} (x_{i, t + 1} - x_{j, t + 1}) \end{matrix}

(7b)

\begin{matrix} θ_{t + 1} & = δ_{i l} {prox}_{g / μ_{θ}} (x_{i, t + 1} + μ_{θ}^{- 1} λ_{t}) \end{matrix}

(7c)

\begin{matrix} λ_{t + 1} & = δ_{i l} (λ_{t} + μ_{θ} (x_{i, t + 1} - θ_{t + 1})), \end{matrix}

(7d)

where

δ_{i l} = 1

if

i = l

, otherwise

δ_{i l} = 0

.

3.2. Communication-Efficient Decentralized ADMM

Recalling the iteration (7a) and (7b), it can be seen that agent i will communicate the information on

x

to its neighbors at each iteration. However, such communication might not be realized for scenarios with limited communication resources. To reduce the communication overhead, we introduce the idea of compressed communication scheme into iteration (7), which results in our algorithm, termed as communication-efficient decentralized ADMM for composite optimization (CE-DADMM).

To compress the communication, we first give a definition of compressor.

Definition 1

(Compressor). A randomized map

C : R^{d} \to R^{d}

is called a compressor if there exists a constant

δ \in [0, 1)

such that

E {∥C (x) - x∥}^{2} \leq (1 - δ) {∥x∥}^{2}

holds for any

x \in R^{d}

.

In Definition 1, the compressor is characterized using the relationship between the compression error and the original state. Clearly,

δ

refers to the compression ratio. We can also call

C

a

δ

-compressor. In our algorithm, we do not apply

C (\cdot)

directly to the transmitted

x

, as it will lead to a non-dismissing compression error. Instead, we will adopt a general compressor termed as Three Point Compressor (3PC) [41], whose definition is given below.

Definition 2

(Three Point Compressor, see [41]). A randomized map

C_{h, y} : R^{d} \times R^{d} \times R^{d} \to R^{d}

is called a three point compressor (3PC) if there exist constants

0 < A \leq 1

and

B \geq 0

such that

\begin{matrix} E [{∥C_{h, y} (x) - x∥}^{2}] \leq (1 - A) {∥h - y∥}^{2} + B {∥x - y∥}^{2}, \forall x, y, h \in R^{d}, \end{matrix}

(8)

where

y

and

h

are parameters of the compressor.

3PC can be realized using

δ

-compressor

C

. Two examples of

C_{h, y}

are given below [41]:

\begin{matrix} E F 21 : & C_{h, y} (x) : = h + C (x - h), \end{matrix}

(9)

\begin{matrix} C L A G : & C_{h, y} (x) : = \{\begin{matrix} h + C (x - h), & {if ‖ x - h ‖}^{2} > ς {‖ x - y ‖}^{2} \\ h, & otherwise \end{matrix} . \end{matrix}

(10)

It can be checked that

A : = 1 - \sqrt{δ}

and

B : = \frac{δ}{1 - \sqrt{δ}}

for EF21, and

A : = 1 - \sqrt{δ}

and

B : = max \{\frac{δ}{1 - \sqrt{δ}}, ς\}

for CLAG.

Next, we formulate our algorithm based on 3PC, whose pseudo code is presented in Algorithm 1. We introduce a new state

y_{i, t}

relating to

x_{i, t}

, which refers to the estimation on

x_{i, t}

of agent i’s neighbors. The computation of

H_{i, t} {(y_{i, t})}^{- 1}

relies on

y_{i, t}

, and since

y_{i, t}

consists of compressed information, this significantly reduces the computational cost associated with the inverse of

H_{i, t} (y_{i, t})

. Then, at iteration t, agent i transmits the compressed vector

C_{y_{i, t - 1}, x_{i, t - 1}} (x_{i, t})

rather than

x_{i, t}

to its neighbors, which lead to the new iterations as below:

\begin{matrix} x_{i, t + 1} = & x_{i, t} - {(H_{i, t} (y_{i, t}))}^{- 1} (\nabla f (x_{i, t}) + ϕ_{i, t} + \frac{μ_{z}}{2} \sum_{j \in N_{i}} (y_{i, t} - y_{j, t}) \end{matrix}

\begin{matrix} + δ_{i l} (λ_{t} + μ_{θ} (y_{i, t} - θ_{t}))) \end{matrix}

(11a)

\begin{matrix} ϕ_{i, t + 1} = & ϕ_{i, t} + \frac{μ_{z}}{2} \sum_{j \in N_{i}} (y_{i, t + 1} - y_{j, t + 1}) \end{matrix}

(11b)

\begin{matrix} θ_{t + 1} = & δ_{i l} {prox}_{g / μ_{θ}} (y_{l, t + 1} + μ_{θ}^{- 1} λ_{t}) \end{matrix}

(11c)

\begin{matrix} λ_{t + 1} = & δ_{i l} (λ_{t} + μ_{θ} (y_{i, t + 1} - θ_{t + 1})) . \end{matrix}

(11d)

Algorithm 1 CE-DADMM
1: Initialization: $x_{i, 0} = 0$ , $y_{i, 0} = 0$ , $ϕ_{i, 0} = 0$ , $θ_{0} = 0$ , $λ_{0} = 0$ , $p_{i, t} = \nabla f_{i} (0)$ .
2: for t = 0,1,… do
3: for agent i do
4: Compute $H_{i, t}^{- 1}$ using (13), (14), or (15) according to its choice;
5: Compute $x_{i, t + 1}$ using (11a);
6: $y_{i, t + 1} = C_{y_{i, t}, x_{i, t}} (x_{i, t + 1})$	// Compressing information
7: Broadcast $y_{i, t + 1}$ to neighbors
8: $ϕ_{i, t + 1} = ϕ_{i, t} + \frac{μ_{z}}{2} \sum_{j \in N_{i}} (y_{i, t + 1} - y_{j, t + 1})$
9: if $i = l$ then	// Dealing with the non-smooth function
10: $θ_{t + 1} = {prox}_{g / μ_{θ}} (y_{l, t + 1} + \frac{1}{μ_{θ}} λ_{t})$
11: $λ_{t + 1} = λ_{t} + μ_{θ} (y_{l, t + 1} - θ_{t + 1})$
12: end if
13: end for
14: end for

3.3. Discussion

Our algorithm CE-DADMM presents a flexible framework that accommodates gradient updates, Newton updates, and quasi-Newton updates, depending on the choice of matrix

H_{i, t}^{- 1}

. Here,

H_{i, t}^{- 1}

has the following general structure:

\begin{matrix} H_{i, t}^{- 1} = {(J_{i, t} + (μ_{z} |N_{i}| + δ_{i l} μ_{θ} + ϵ) I_{d})}^{- 1}, \end{matrix}

(12)

where

ϵ > 0

is used to provide additional robustness and

J_{i, t}

is a matrix to be determined. A detailed discussion is presented below.

Case 1: Gradient Updates. By choosing

J_{i, t} \equiv 0

, (12) equals to

\begin{matrix} H_{i, t}^{- 1} = {((μ_{z} |N_{i}| + δ_{i l} μ_{θ} + ϵ) I_{d})}^{- 1} . \end{matrix}

(13)

Clearly,

H_{t}

is diagonal. The computation of

H_{t}^{- 1}

requires

O (d)

computational cost. Compared with COLA [45], CE-DADMM considers the presence of the non-smooth term

g (\cdot)

and allows for more options in the choice of the compression mechanism

C

. When

g (\cdot)

is excluded, only the lazy aggregation compression mechanism is applied, and Gradient Updates are used, CE-DADMM aligns with the form of COLA.

Case 2: Newton Updates. By choosing

J_{i, t} = \nabla^{2} f_{i} (y_{i, t})

, (12) equals to

\begin{matrix} H_{i, t}^{- 1} = {(\nabla^{2} f_{i} (y_{i, t}) + (μ_{z} |N_{i}| + δ_{i l} μ_{θ} + ϵ) I_{d})}^{- 1} . \end{matrix}

(14)

According to the definition of

F ({\tilde{y}}_{t})

,

\nabla^{2} F ({\tilde{y}}_{t})

is a block diagonal matrix with the ith block being

\nabla^{2} f_{i} (y_{i, t})

. The computation of

H_{t}^{- 1}

incurs

O (d^{3})

computational cost. When CE-DADMM uses Newton updates, excludes

g (\cdot)

, and adopts the same communication compression approach as CC-DQM [39], it recovers the form consistent with CC-DQM.

Case 3: Quasi-Newton Updates. Inspire by the distributed BFGS scheme in [22], we can derive a novel decentralized algorithm termed as CE-DADMM-BFGS, which combines the BFGS method with communication-efficient mechanisms. According to secant condition, each agent i constructs a model of the inverse Hessian directly using the pairs

{q_{i, t}, s_{i, t}}_{i = 1}^{n}

defined as

\begin{matrix} q_{i, t} : = (\nabla f_{i} (y_{i, t + 1}) - \nabla f_{i} (y_{i, t})) + (μ_{z} |N_{i}| + δ_{i l} μ_{θ} + ϵ) s_{i, t} and s_{i, t} : = y_{i, t + 1} - y_{i, t} . \end{matrix}

The Hessian inverse approximation is then iteratively updated as:

\begin{matrix} H_{i, t + 1}^{- 1} = \frac{s_{i, t} {(s_{i, t})}^{⊤}}{q_{i, t}^{⊤} s_{i, t}} + (I_{d} - \frac{s_{i, t} {(q_{i, t})}^{⊤}}{q_{i, t}^{⊤} s_{i, t}}) {(H_{i, t})}^{- 1} (I_{d} - \frac{q_{i, t} {(s_{i, t})}^{⊤}}{q_{i, t}^{⊤} s_{i, t}}) . \end{matrix}

(15)

Notably, the explicit inverse of

H_{i, t + 1}

is unnecessary, as this expression serves merely as a formal representation. Consequently, the computational cost for each agent is reduced from

O (d^{3})

to

O (d^{2})

.

4. Convergence Analysis

In this section, we propose a unified framework to analyze the proposed algorithms that incorporate gradient, Newton, and BFGS updates, along with a communication-efficient mechanism. First, we make the following assumptions throughout of the paper.

Assumption 1.

Each

f_{i}

is twice continuously differentiable,

m_{f}

-strongly convex, and

M_{f}

–smooth, i.e.,

m_{f} I_{d} ⪯ \nabla^{2} f_{i} (x_{i}) ⪯ M_{f} I_{d}

, where

M_{f} \geq m_{f} > 0

.

g (\cdot) : R^{d} \to R

is proper, closed, and convex, i.e.,

{(x - y)}^{⊤} (s_{x} - s_{y}) \geq 0

holds for any subgradients

s_{x} \in \partial g (x)

and

s_{y} \in \partial g (y)

.

Assumption 2.

Each

\nabla^{2} f_{i}

is Lipschitz continuous with constant

L_{f}

, i.e.,

‖ \nabla^{2} f_{i} (x) - \nabla^{2} f_{i} (y) ‖ \leq L_{f} ‖ x - y ‖

holds for any

x, y \in R^{d} .

Assumption 3.

Each

H_{i, t}

is uniformly upper bounded, i.e., for any

t \geq 0

, there exists a constant

ψ > 0

such that

H_{i, t} ⪯ ψ I_{d} .

It is worthy noting that Assumption 3 is only required for the quasi-Newton update case. Next, we introduce the optimal condition of problem (3), which is independent of the algorithm and has been proved in [22]. The result is given below.

Lemma 1

(optimal condition, see Lemma 2 in [22]). Suppose

({\tilde{x}}^{★}, {\tilde{z}}^{★}, α^{★}, θ^{★}, λ^{★})

is a primal-dual optimal pair of problem (3), if and only if the following holds:

\begin{matrix} \nabla F ({\tilde{x}}^{★}) + E_{s}^{⊤} α^{★} + S λ^{★} & = 0, \end{matrix}

(16a)

\begin{matrix} \partial g (θ^{★}) - λ^{★} & ∋ 0, \end{matrix}

(16b)

\begin{matrix} E_{s} {\tilde{x}}^{★} & = 0, \end{matrix}

(16c)

\begin{matrix} E_{u} {\tilde{x}}^{★} & = 2 {\tilde{z}}^{★}, \end{matrix}

(16d)

\begin{matrix} S^{⊤} {\tilde{x}}^{★} & = θ^{★} . \end{matrix}

(16e)

Moreover, there exists a unique dual optimal pair

[α^{★}; λ^{★}] \in R^{(m + 1) d}

that lies in the column space of

C : = [\begin{matrix} E_{s} \\ S^{⊤} \end{matrix}] \in R^{(m + 1) d \times n d}

.

Now, we are ready to analyze our algorithm CE-DADMM. First, we write (11) into a compact form:

\begin{matrix} {\tilde{x}}_{t + 1} & = {\tilde{x}}_{t} - {(H_{t})}^{- 1} (\nabla F ({\tilde{x}}_{t}) + ϕ_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{y}}_{t} + S (λ_{t} + μ_{θ} (S^{⊤} {\tilde{y}}_{t} - θ_{t}))) \end{matrix}

(17a)

\begin{matrix} ϕ_{t + 1} & = ϕ_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{y}}_{t + 1} \end{matrix}

(17b)

\begin{matrix} θ_{t + 1} & = {prox}_{g / μ_{θ}} (S^{⊤} {\tilde{y}}_{t + 1} + \frac{1}{μ_{θ}} λ_{t}) \end{matrix}

(17c)

\begin{matrix} λ_{t + 1} & = λ_{t} + μ_{θ} (S^{⊤} {\tilde{y}}_{t + 1} - θ_{t + 1}) . \end{matrix}

(17d)

According to the discussion in Section 3.1, we have

E_{u} {\tilde{x}}_{t} = 2 {\tilde{z}}_{t}

and

ϕ_{t} = E_{s}^{⊤} α_{t}

hold in (17). Then, it follows from Lemma 1 that the convergence of CE-DADMM can be obtained by showing

({\tilde{x}}^{t}, {\tilde{z}}^{t}, α^{t}, θ^{t}, λ^{t})

converges to

({\tilde{x}}^{★}, {\tilde{z}}^{★}, α^{★}, θ^{★}, λ^{★})

.

Due to the existence of the efficient communication scheme, we need to analyze the impact of communication error on the convergence of CE-DADMM. Define

\tilde{e} : = [e_{1}; e_{2}; \dots; e_{n}]

, where

e_{i} : = y_{i} - x_{i}

for all agent i. Clearly,

\tilde{e}

describes the error caused by efficient communication scheme. Regarding

\tilde{e}

, the following result holds.

Lemma 2.

The error

{\tilde{e}}_{t + 1}

in CE-DADMM satisfies

{∥{\tilde{e}}_{t + 1}∥}^{2} \leq (1 - A) {∥{\tilde{e}}_{t}∥}^{2} + B {∥{\tilde{x}}_{t + 1} - {\tilde{x}}_{t}∥}^{2}

, where A and B are the parameters of 3PC.

Proof.

According to the definition of 3PC, we have

\begin{matrix} ‖ {\tilde{e}}_{t + 1} ‖^{2} & = ‖ {\tilde{y}}_{t + 1} - {\tilde{x}}_{t + 1} ‖^{2} \leq (1 - A) ‖ {\tilde{e}}_{t} ‖^{2} + B {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖}^{2}, \end{matrix}

which completes the proof. □

Clearly, if 3PC is set as EF21 (9), it follows from Lemma 2 that

\begin{matrix} {∥{\tilde{e}}_{t + 1}∥}^{2} \leq \sqrt{δ} {∥{\tilde{e}}_{t}∥}^{2} + \frac{δ}{1 - \sqrt{δ}} {∥{\tilde{x}}_{t + 1} - {\tilde{x}}_{t}∥}^{2} . \end{matrix}

If 3PC is set as CLAG (10), we have

\begin{matrix} {∥{\tilde{e}}_{t + 1}∥}^{2} \leq \sqrt{δ} {∥{\tilde{e}}_{t}∥}^{2} + max \{\frac{δ}{1 - \sqrt{δ}}, ς\} {∥{\tilde{x}}_{t + 1} - {\tilde{x}}_{t}∥}^{2} . \end{matrix}

Then, to characterize the suboptimality of the iterates when (5a) is replaced by (17a), we introduce the following error term:

\begin{matrix} r_{t} : = \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{x}}_{t + 1}) + J_{t} ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) . \end{matrix}

(18)

The bound of the error term (18) is give below, which is important for our main result.

Lemma 3.

It holds that

‖ r_{t} ‖ \leq τ_{t} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ + γ_{t} (‖ {\tilde{e}}_{t + 1} ‖ + ‖ {\tilde{e}}_{t} ‖)

, where

τ_{t}

and

γ_{t}

correspond to the update case as below:

\begin{matrix} C a s e 1 : & τ_{t} = & M_{f}, & γ_{t} = 0 . \end{matrix}

(19)

\begin{matrix} C a s e 2 : & τ_{t} = & min \{2 M_{f}, \frac{L_{f}}{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ + L_{f} ‖ {\tilde{e}}_{t} ‖\}, & γ_{t} = 0 . \end{matrix}

(20)

\begin{matrix} C a s e 3 : & τ_{t} \leq & 2 ψ, & γ_{t} \leq 2 (M_{f} + ψ) . \end{matrix}

(21)

Proof.

See Appendix A. □

Lemma 3 extends the results in [17,18] by providing an upper bound on the error introduced when replacing the exact sub-optimization step (5a) with a one-step update using the compressed variable (17a). Under Newton updates, as the error

‖ {\tilde{e}}_{t} ‖

approaches zero, the term

\frac{L_{f}}{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ + L_{f} ‖ {\tilde{e}}_{t} ‖

becomes smaller than

2 M_{f}

in (20). In the case of Quasi-Newton updates, the error

‖ r_{t} ‖

remains bounded by a constant, ensuring it do not grow indefinitely.

Let

σ_{max}^{L_{u}}

and

σ_{min}^{L_{u}}

denote the maximum and minimum eigenvalues of

L_{u}

, respectively. Let

σ_{max}^{L_{s}}

denote the maximum eigenvalue of

L_{s}

. Denote by

σ_{min}^{+}

the smallest positive eigenvalue of

C C^{⊤}

, where

C

is given in Lemma 1. Define

{\tilde{v}}_{t} : = {[{\tilde{x}}_{t}, {\tilde{z}}_{t}, α_{t}, θ_{t}, λ_{t}]}^{⊤}

,

{\tilde{v}}^{★} : = {[x_{★}, z_{★}, α_{★}, θ_{★}, λ_{★}]}^{⊤}

, and

H_{1} = diag [ϵ, 2 μ_{z}, \frac{2}{μ_{z}}, μ_{θ}, \frac{1}{μ_{θ}}]

. Consider the following Lyapunov function:

\begin{matrix} V_{t} = ‖ {\tilde{v}}_{t} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ {‖ {\tilde{e}}_{t} ‖}^{2} . \end{matrix}

(22)

Clearly,

V_{t}

converges to zero implies that

{\tilde{v}}_{t}

converges to the optimal solution. We will use

V_{t}

to establish the convergence result of our algorithm, which is given below.

Theorem 1.

Suppose Assumptions 1–3 hold. Let

μ_{z} = 2 μ_{θ}

,

c_{1} = 7 σ_{max}^{L_{s}} μ_{θ} + 25 μ_{θ}

,

c_{2} = 3 σ_{max}^{L_{s}} μ_{θ} + 17 μ_{θ}

, and

c_{3} = 2 μ_{θ}^{2} (1 + {(σ_{max}^{L_{s}})}^{2}) + 4 γ_{t}^{2}

. If ζ satisfies

\begin{matrix} ζ \in & (\frac{2 m_{f} M_{f}}{m_{f} + M_{f}} - μ_{θ}, min \{\frac{ϵ + B (6 - c_{1} - μ_{θ})}{τ_{t}^{2} + 2 B γ_{t}^{2}}, \frac{4 - c_{2} + (1 - A) (6 - c_{1} - μ_{θ})}{2 (2 - A) τ_{t}^{2}}\}), \end{matrix}

then the iterates generated by CE-DADMM satisfy

V_{t + 1} \leq \frac{1}{1 + η_{t}} V_{t}

, where

\begin{matrix} η_{t} = & min \{\frac{ϵ + B (6 - c_{1} - μ_{θ}) - ζ (τ_{t}^{2} + 2 B γ_{t}^{2})}{8 B + \frac{7}{μ_{θ} σ_{min}^{+}} (ϵ^{2} + 2 τ_{t}^{2} + c_{3} B)}, \frac{\frac{2 m_{f} M_{f}}{m_{f} + M_{f}} - \frac{1}{ζ} - μ_{θ}}{ϵ + 4 μ_{θ} + σ_{max}^{L_{u}} μ_{θ}}, \frac{3 σ_{min}^{+}}{σ_{max}^{L_{u}}}, \\ \frac{4 - c_{2} + (1 - A) (6 - c_{1} - μ_{θ}) - 2 (2 - A) ζ τ_{t}^{2}}{(1 - A) 8 + \frac{7}{μ_{θ} σ_{min}^{+}} (2 - A) c_{3}}, \frac{μ_{θ} σ_{min}^{+}}{7 (m_{f} + M_{f})}, \frac{σ_{min}^{+}}{7}, \frac{3}{16}\} . \end{matrix}

(23)

Proof.

See Appendix B. □

Clearly, Theorem 1 implies that CE-DADMM can achieve exact linear convergence at a rate of

O (ρ^{t})

with

ρ = \frac{1}{1 + η_{t}}

. The larger

η_{t}

is, the faster CE-DADMM converges. To ensure positive

η_{t}

in (23), it can be obtain that the step size

μ_{θ}

should not be too large. Besides, excessive compression of

x_{k}

should be avoided. This is because, when the compression ratio A is very small, B approaches infinity. According to (23), as

B \to \infty

, the first term in (23) becomes negative. It is worth noting that when

η_{t}, γ_{t} = 0

, the compression ratio A can be arbitrarily small, making the approach highly applicable in scenarios with extremely limited bandwidth. Also, at this situation, we can get the fastest convergence rate and the smallest communication cost. However, setting

η_{t}, γ_{t} = 0

implies solving the subproblem (5a) exactly, which may result in high computational cost. This shows a trade-off between communication cost and computation cost in decentralized optimization.

5. Numerical Experiments

In this section, CE-DADMM is compared with existing state-of-the-art algorithms, including DRUID [22], PG-EXTRA [42], P2D2 [43], and CC-DQM [39], in distributed logistic/ridge/lasso regression problems. Noting that CC-DQM do not support non-smooth terms.

Datasets. We use real-world datasets from the LIBSVM library (Available Online: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html, accessed on 3 December 2024): a9a (32,561 samples, 123 dimensions) and ijcnn1 (49,990 samples, 22 dimensions). The samples are evenly distributed across n agents. The distribution of samples across agents for the a9a dataset (left) and ijcnn1 (right) is shown in Figure 1. Our experiments are implemented in Python 3.10.13.

Figure 1. Distribution of samples across agents for the a9a dataset (left) and ijcnn1 (right).

Experimental setting. The communication graph is randomly generated, with connections based on a Bernoulli distribution (

p = 0.5

) among

n = 10

agents, as shown in Figure 2. We evaluate performance based on the total communication bits and the number of iterations. CE-DADMM employs compression mechanisms EF21 and CLAG, using a Top-K compressor to reduce dimensions to 30 dimensions for the a9a dataset and 6 dimensions for the ijcnn1 dataset. In the experiment, we examine the algorithms from two perspectives: the number of iterations and total communication bits. The number of iterations refers to the number of times the algorithm runs, while total communication bits is calculated based on the cumulative number of bits of variable

y

transmitted between agents. Additionally, we define

e r r_{t} : = \frac{‖ x_{t} - x^{★} ‖}{‖ x_{0} - x^{★} ‖}

to measure the algorithm’s convergence.

Figure 2. Random communication graph of network with 10 agents.

5.1. Distributed Logistic Regression

The distributed logistics regression solves problem (1) with

g (x) = γ_{2} {‖ x ‖}_{1}

,

f_{i} (\cdot) : R^{d} \to R

defined as:

\begin{matrix} f_{i} (x) : = \frac{1}{m_{i}} \sum_{j = 1}^{m_{i}} log (1 + e^{- b_{i j} a_{i j}^{⊤} x}) + \frac{γ_{1}}{2} {‖ x ‖}^{2}, \end{matrix}

(24)

where

a_{i j} \in R^{d}

represents the feature vector,

b_{i j} \in {- 1, 1}

denotes the label, and

m_{i}

is the number of sample data for agent i. The parameters

γ_{1} = 10^{- 2}

and

γ_{2} = 10^{- 6}

are regularization terms.

In Figure 3 and Figure 4, when measured by the number of iterations, CE-DADMM with EF21 and CLAG compression mechanisms performs on par with DRUID without compression, and surpasses P2D2 and PG-EXTRA in both convergence speed and accuracy. Additionally, we observe that CE-DADMM, when using (quasi) Newton methods, significantly reduces the number of iterations required to reach a given accuracy compared to first-order methods. When measured by total communication bits, the introduction of EF21 and CLAG in CE-DADMM allows for a substantial reduction in communication overhead compared to DRUID, even under the same update scheme. Notably, when using quasi-Newton updates, CE-DADMM requires fewer communication bits to achieve the same accuracy than DRUID-Newton updates, and also outperforms P2D2 and PG-EXTRA in this regard. When achieving the same convergence accuracy as shown in Table 1, the detailed numerical results are presented in Table 2 and Table 3. Note that for P2D2 and PG-EXTRA, since they fail to reach the predefined convergence accuracy, we use the number of iterations required for them to converge to their optimal values as a benchmark.

Figure 3. Performance comparison of distributed logistic regression the on a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

Figure 4. Performance comparison of distributed logistic regression the on ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

Table 1. Convergence accuracy (

e r r_{t}

) for different experiments.

Table 2. Comparison of iterations.

Table 3. Comparison of communication bits.

5.2. Distributed Ridge Regression

The distributed ridge regression solves problem (1), whose

f_{i} (\cdot)

is the same as in (24) but with

g (x) = 0

.

In Figure 5 and Figure 6, we observe that when measured by the number of iterations, CE-DADMM with EF21 and CLAG compression mechanisms performs similarly to DRUID without compression. However, CE-DADMM using (quasi) Newton updates converges faster compared to CC-DQM, which also employs an communication-efficient mechanism. On the other hand, when CE-DADMM employs first-order method, its convergence is slower than second-order method, CC-DQM, due to the latter benefiting from additional Hessian information. When measured by total communication bits, CE-DADMM with (quasi) Newton updates requires fewer bits to achieve the same accuracy compared to CC-DQM. Additionally, CE-DADMM benefits from communication-efficient mechanisms, resulting in a substantial reduction in communication bits needed to achieve the same convergence accuracy compared to DRUID. Similarly, the corresponding numerical results are also presented in Table 2 and Table 3.

Figure 5. Performance comparison of distributed ridge regression on the a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

Figure 6. Performance comparison of distributed ridge regression on the ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

5.3. Distributed LASSO

The distributed LASSO solves problem (1) with

g (x) = γ_{2} {‖ x ‖}_{1}

,

f_{i} (\cdot) : R^{d} \to R

defined as:

\begin{matrix} f_{i} (x) : = \frac{1}{2 m_{i}} \sum_{j = 1}^{m_{i}} ‖ a_{i j} x - b_{j} ‖^{2} + \frac{γ_{1}}{2} {‖ x ‖}^{2}, \end{matrix}

(25)

where

a_{i j}

,

b_{i j}

, and

m_{i}

are as defined in Section 5.1. The parameters

γ_{1} = 10^{- 2}

and

γ_{2} = 10^{- 6}

are regularization terms.

In Figure 7 and Figure 8, when measured by the number of iterations, CE-DADMM performs similarly to DRUID and outperforms P2D2 and PG-EXTRA in terms of convergence speed and accuracy, indicating that the introduction of the 3PC compression mechanism does not negatively affect the convergence speed of the CE-DADMM algorithm. When measured by total communication bits, after introducing the EF21 and CLAG compression mechanisms, CE-DADMM significantly reduces communication overhead compared to DRUID, while also outperforming P2D2 and PG-EXTRA, demonstrating that the communication compression mechanisms can significantly lower communication costs between agents. Similarly, the corresponding numerical results are also presented in Table 2 and Table 3.

Figure 7. Performance comparison of distributed LASSO on the a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

Figure 8. Performance comparison of distributed LASSO on the ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

6. Conclusions

This paper presents a communication-efficient ADMM algorithm for composite optimization, named as CE-DADMM. The algorithm utilizes the ADMM framework with the 3PC communication mechanism, effectively adapting to various communication and computational demands while balancing communication efficiency and computational cost. Notably, when employing quasi-Newton updates, CE-DADMM becomes the first compression-based second-order communication-efficient algorithm. Theoretical analysis demonstrates that the proposed algorithm achieves linear convergence when the local objective functions are strongly convex. Numerical experiments further validate the effectiveness and superior performance of the algorithm. Future work will focus on extending the algorithm to fully asynchronous settings and stochastic problems.

Author Contributions

Z.C.: Conceptualization, writing—original draft and methodology; Z.Z.: Conceptualization, writing—original draft and methodology; S.Y.: writing—review and editing and supervision; J.C.: writing—review and editing and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62176056, and in part by the Young Elite Scientists Sponsorship Program by the China Association for Science and Technology (CAST) under Grant 2021QNRC001.

Data Availability Statement

The data supporting the findings of this study are openly available in a9a and ijcnn1 at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html (accessed on 3 December 2024).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Appendix A. Proof of Lemma 3

Proof of Lemma 3.

First, according to (18), it follows from the triangle inequality and Cauchy–Schwartz inequality that

\begin{matrix} ‖ r_{t} ‖ \leq ‖ \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{x}}_{t + 1}) ‖ + ‖ J_{t} ‖ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ . \end{matrix}

(A1)

For case 1 (gradient updates), there is

J_{t} \equiv 0

. By Assumption 1, we obtain

‖ r_{t} ‖ \leq ‖ \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{x}}_{t + 1}) ‖ \leq M_{f} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖

.

For case 2 (Newton updates), there is

J_{t} = \nabla^{2} F ({\tilde{y}}_{t})

. Applying Assumption 1 and (A1) yields

\begin{matrix} ‖ r_{t} ‖ \leq 2 M_{f} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ . \end{matrix}

(A2)

In parallel, by the fundamental theorem of calculus, we can obtain

\begin{matrix} \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{t}) \\ = & \int_{0}^{1} \nabla^{2} F (s {\tilde{x}}_{t + 1} + (1 - s) {\tilde{x}}_{t}) ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) d s \\ = & \int_{0}^{1} (\nabla^{2} F (s {\tilde{x}}_{t + 1} + (1 - s) {\tilde{x}}_{t}) - \nabla^{2} F ({\tilde{y}}_{t})) ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) d s + \nabla^{2} F ({\tilde{y}}_{t}) ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}), \end{matrix}

(A3)

which implies that

\begin{matrix} ‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{t}) - \nabla^{2} F ({\tilde{y}}_{t}) ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) ‖ \\ = & ∥\int_{0}^{1} (\nabla^{2} F (s {\tilde{x}}_{t + 1} + (1 - s) {\tilde{x}}_{t}) - \nabla^{2} F ({\tilde{y}}_{t})) ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) d s∥ \\ \leq & \int_{0}^{1} ‖ \nabla^{2} F (s {\tilde{x}}_{t + 1} + (1 - s) {\tilde{x}}_{t}) - \nabla^{2} F ({\tilde{y}}_{t}) ‖ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ d s \\ \leq & \int_{0}^{1} L_{f} ‖ s ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) + ({\tilde{x}}_{t} - {\tilde{y}}_{t}) ‖ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ d s \\ \leq & \int_{0}^{1} L_{f} s ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} d s + L_{f} ‖ {\tilde{x}}_{t} - {\tilde{y}}_{t} ‖ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ \\ = & (\frac{L_{f}}{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ + L_{f} ‖ {\tilde{e}}_{t} ‖) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖ . \end{matrix}

(A4)

The result for case 2 can be obtained by comparing (A2) and (A4).

For case 3 (quasi-Newton updates), according to Assumption 3, the secant condition

H_{t + 1} s_{t} = q_{t}

, and the definition of

{q_{t}, s_{t}}

, we have

\begin{matrix} ‖ r_{t} ‖ = & ‖ \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{x}}_{t + 1}) + J_{t} ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) ‖ \\ = & ‖ \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{y}}_{t}) + J_{t} ({\tilde{y}}_{t} - {\tilde{x}}_{t}) - (\nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{y}}_{t + 1}) + J_{t} ({\tilde{y}}_{t + 1} - {\tilde{x}}_{t + 1})) \\ + \nabla F ({\tilde{y}}_{t}) - \nabla F ({\tilde{y}}_{t + 1}) + J_{t} ({\tilde{y}}_{t + 1} - {\tilde{y}}_{t}) ‖ \\ \leq & ‖ \nabla F ({\tilde{x}}_{t}) - \nabla F ({\tilde{y}}_{t}) + J_{t} ({\tilde{y}}_{t} - {\tilde{x}}_{t}) ‖ + ‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{y}}_{t + 1}) + J_{t} ({\tilde{y}}_{t + 1} - {\tilde{x}}_{t + 1}) ‖ \\ + ‖ \nabla F ({\tilde{y}}_{t}) - \nabla F ({\tilde{y}}_{t + 1}) + J_{t} ({\tilde{y}}_{t + 1} - {\tilde{y}}_{t}) ‖ \\ \leq & 2 (M_{f} + ψ) (‖ {\tilde{e}}_{t + 1} ‖ + ‖ {\tilde{e}}_{t} ‖) + 2 ψ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖, \end{matrix}

(A5)

which completes the proof of case 3. □

Appendix B. Proof of Theorem 1

Lemma A1.

Let

(α_{★}, λ_{★})

be the unique dual optimal pair which lies in the column space of

C

as established in Lemma 1. The following inequality holds:

\begin{matrix} σ_{min}^{+} (‖ α_{t + 1} - α_{★} ‖^{2} + {‖ λ_{t + 1} - λ_{★} ‖}^{2}) \leq {‖ E_{s}^{⊤} (α_{t + 1} - α_{★}) + S (λ_{t + 1} - λ_{★}) ‖}^{2} . \end{matrix}

(A6)

Proof.

We rewrite (17b) and (17d) as

[\begin{matrix} α_{t + 1} \\ λ_{t + 1} \end{matrix}] = [\begin{matrix} α_{t} \\ λ_{t} \end{matrix}] + \underset{: = N}{\underset{︸}{[\begin{matrix} \frac{μ_{z}}{2} E_{s} \\ μ_{θ} S^{⊤} \end{matrix}]}} {\tilde{y}}_{t + 1} - \underset{: = M}{\underset{︸}{[\begin{matrix} 0 \\ μ_{θ} I_{d} \end{matrix}]}} θ_{t + 1} .

First, we show that the column space of

M

belongs to the column space of

N

. We fix all

y_{i}

as

y^{'}

, i.e.,

y = [y^{'}; \dots; y^{'}]

, then it holds that

N y = M y^{'}

, which shows

col (M) \subset col (N)

. By setting

μ_{z} = 2 μ_{θ}

, we conclude that

[α_{t + 1} - α_{★}; λ_{t + 1} - λ_{★}]

lies in the column space of

C

. □

Lemma A2.

The following two inequalities hold:

\begin{matrix} {(λ_{t + 1} - λ_{t})}^{⊤} (θ_{t + 1} - θ_{t}) \geq 0, \end{matrix}

(A7)

\begin{matrix} {(λ_{t + 1} - λ_{★})}^{⊤} (θ_{t + 1} - θ_{★}) \geq 0 . \end{matrix}

(A8)

Proof.

From the definition of the proximal operator, it holds that

\begin{matrix} θ_{t + 1} = \underset{θ}{arg min} \{g (θ) + \frac{μ_{θ}}{2} ‖ S^{⊤} {\tilde{y}}_{t + 1} + \frac{1}{μ_{θ}} λ_{t} - θ ‖^{2}\} . \end{matrix}

(A9)

By the optimal condition of (A9) and the dual update (17d), we obtain

\begin{matrix} 0 \in \partial g (θ_{t + 1}) - μ_{θ} (S^{⊤} {\tilde{y}}_{t + 1} + \frac{1}{μ_{θ}} λ_{t} - θ_{t + 1}) = \partial g (θ_{t + 1}) - λ_{t + 1}, \end{matrix}

which implies that

λ_{t} \in \partial g (θ_{t})

. Then, it follows from the convexity of

g (\cdot)

that

\begin{matrix} {(λ_{t + 1} - λ_{t})}^{⊤} (θ_{t + 1} - θ_{t}) \in {(\partial g (θ_{t + 1}) - \partial g (θ_{t}))}^{⊤} (θ_{t + 1} - θ_{t}) \geq 0 . \end{matrix}

Similarly, using (16b), we also have

\begin{matrix} {(λ_{t + 1} - λ_{★})}^{⊤} (θ_{t + 1} - θ_{★}) \in {(\partial g (θ_{t + 1}) - \partial g (θ_{★}))}^{⊤} (θ_{t + 1} - θ_{★}) \geq 0 . \end{matrix}

The proof is completed. □

Proof of Theorem 1.

To make the proof more concise, let

Θ_{t + 1} = \frac{m_{f} M_{f}}{m_{f} + M_{f}} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + \frac{1}{m_{f} + M_{f}} {‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖}^{2}

. As

F (x)

is strongly convex with Lipschitz continuous gradient, the following inequality holds:

\begin{matrix} Θ_{t + 1} \leq & {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} (\nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★})) \\ \leq & - {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} - \underset{Ξ_{1}}{\underset{︸}{ϵ {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t})}} - \underset{Ξ_{2}}{\underset{︸}{{({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} E_{s}^{⊤} (α_{t + 1} - α_{★})}} \\ - \underset{Ξ_{3}}{\underset{︸}{μ_{z} {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} E_{u}^{⊤} ({\tilde{z}}_{t + 1} - {\tilde{z}}_{t})}} - \underset{Ξ_{4}}{\underset{︸}{{({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} S (λ_{t + 1} - λ_{★} + μ_{θ} (θ_{t + 1} - θ_{t}))}} \\ + \underset{Ξ_{5}}{\underset{︸}{\frac{μ_{z}}{2} {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}} + \underset{Ξ_{6}}{\underset{︸}{μ_{θ} {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}} . \end{matrix}

(A10)

For

Ξ_{1}

, we have

\begin{matrix} - 2 Ξ_{1} \leq ϵ (‖ {\tilde{x}}_{t} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2}) . \end{matrix}

(A11)

For

Ξ_{2}

, since

E_{s}^{⊤} α_{t + 1} = E_{s}^{⊤} α_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{y}}_{t + 1}

, it follows from (16c) that

\begin{matrix} - 2 Ξ_{2} = & - 2 {({\tilde{y}}_{t + 1} - {\tilde{e}}_{t + 1})}^{⊤} E_{s}^{⊤} (α_{t + 1} - α_{★}) \\ = & - \frac{4}{μ_{z}} {(α_{t + 1} - α_{t})}^{⊤} (α_{t + 1} - α_{★})) + 2 e_{t + 1}^{⊤} E_{s}^{⊤} (α_{t + 1} - α_{★}) \\ = & \frac{2}{μ_{z}} (‖ α_{t} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{t} ‖^{2}) + 2 e_{t + 1}^{⊤} E_{s}^{⊤} (α_{t + 1} - α_{★}) \\ \leq & \frac{2}{μ_{z}} (‖ α_{t} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{t} ‖^{2}) \\ + 4 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} E_{s}^{⊤} E_{s} {\tilde{e}}_{t + 1} + \frac{1}{4 μ_{θ}} {‖ α_{t + 1} - α_{★} ‖}^{2} . \end{matrix}

(A12)

For

Ξ_{3}

, using

{\tilde{z}}_{t} = \frac{1}{2} E_{u} {\tilde{x}}_{t}

and (16d), there is

\begin{matrix} - 2 Ξ_{3} & = - 4 μ_{z} {({\tilde{z}}_{t + 1} - {\tilde{z}}_{★})}^{⊤} ({\tilde{z}}_{t + 1} - {\tilde{z}}_{t}) \\ = 2 μ_{z} (‖ {\tilde{z}}_{t} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖^{2}) . \end{matrix}

(A13)

For

Ξ_{4}

, using (16e), we have

\begin{matrix} - 2 Ξ_{4} = & - 2 {({\tilde{y}}_{t + 1} - {\tilde{e}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} S (λ_{t + 1} - λ_{★} + μ_{θ} (θ_{t + 1} - θ_{t})) \\ = & - 2 {(\frac{1}{μ_{θ}} (λ_{t + 1} - λ_{t}) + θ_{t + 1} - θ_{★})}^{⊤} (λ_{t + 1} - λ_{★} + μ_{θ} (θ_{t + 1} - θ_{t})) \\ + 2 {\tilde{e}}_{t + 1}^{⊤} S (λ_{t + 1} - λ_{★} + μ_{θ} (θ_{t + 1} - θ_{t})) \\ = & - \frac{2}{μ_{θ}} {(λ_{t + 1} - λ_{t})}^{⊤} (λ_{t + 1} - λ_{★}) \underset{\leq 0, using (A 7)}{\underset{︸}{- 2 {(λ_{t + 1} - λ_{t})}^{⊤} (θ_{t + 1} - θ_{t})}} \\ - 2 μ_{θ} {(θ_{t + 1} - θ_{★})}^{⊤} (θ_{t + 1} - θ_{t}) \underset{\leq 0, using (A 8)}{\underset{︸}{- 2 {(θ_{t + 1} - θ_{★})}^{⊤} (λ_{t + 1} - λ_{★})}} \\ + 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S (\frac{1}{μ_{θ}} (λ_{t + 1} - λ_{★}) + θ_{t + 1} - θ_{t}) \\ \leq & \frac{1}{μ_{θ}} (‖ λ_{t} - λ_{★} ‖^{2} - ‖ λ_{t + 1} - λ_{★} ‖^{2} - ‖ λ_{t + 1} - λ_{t} ‖^{2}) \\ + μ_{θ} (‖ θ_{t} - θ_{★} ‖^{2} - ‖ θ_{t + 1} - θ_{★} ‖^{2} - ‖ θ_{t + 1} - θ_{t} ‖^{2}) \\ + 8 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} {\tilde{e}}_{t + 1} + \frac{1}{4 μ_{θ}} ‖ λ_{t + 1} - λ_{★} ‖^{2} + \frac{μ_{θ}}{4} {‖ θ_{t + 1} - θ_{t} ‖}^{2} . \end{matrix}

(A14)

For

Ξ_{5}

, using

E_{s}^{⊤} α_{t + 1} = E_{s}^{⊤} α_{t} + \frac{μ_{z}}{2} L_{s} {\tilde{y}}_{t + 1}

and (16c), we have

\begin{matrix} 2 Ξ_{5} & = μ_{z} {({\tilde{y}}_{t + 1} - {\tilde{e}}_{t + 1})}^{⊤} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ = μ_{z} {\tilde{y}}_{t + 1}^{⊤} E_{s}^{⊤} E_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - μ_{z} {\tilde{e}}_{t + 1}^{⊤} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ = 2 {(α_{t + 1} - α_{t})}^{⊤} E_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - μ_{z} {\tilde{e}}_{t + 1}^{⊤} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ \leq μ_{θ} {({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}^{⊤} E_{s}^{⊤} E_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) + μ_{θ}^{- 1} {‖ α_{t + 1} - α_{t} ‖}^{2} + μ_{z} {\tilde{e}}_{t + 1}^{⊤} L_{s} {\tilde{e}}_{t} . \end{matrix}

(A15)

For

Ξ_{6}

, using (17d) and (16e), we have

\begin{matrix} 2 Ξ_{6} = & 2 μ_{θ} {\tilde{y}}_{t + 1}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - 2 μ_{θ} {\tilde{x}}_{★}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ = & 2 μ_{θ} {(\frac{1}{μ_{θ}} (λ_{t + 1} - λ_{t}) + θ_{t + 1} - θ_{★})}^{⊤} S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ \leq & 8 μ_{θ} {({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) \\ + \frac{1}{4 μ_{θ}} ‖ λ_{t + 1} - λ_{t} ‖^{2} + \frac{μ_{θ}}{4} {‖ θ_{t + 1} - θ_{★} ‖}^{2} + 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} {\tilde{e}}_{t} . \end{matrix}

(A16)

Substituting (A11)–(A16) into (A10) yields

\begin{matrix} 2 Θ_{t + 1} \leq & ϵ (‖ {\tilde{x}}_{t} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2}) \\ + 2 μ_{z} (‖ {\tilde{z}}_{t} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖^{2}) \\ + \frac{2}{μ_{z}} (‖ α_{t} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{★} ‖^{2} - ‖ α_{t + 1} - α_{t} ‖^{2}) \\ + μ_{θ} (‖ θ_{t} - θ_{★} ‖^{2} - ‖ θ_{t + 1} - θ_{★} ‖^{2} - ‖ θ_{t + 1} - θ_{t} ‖^{2}) \\ + \frac{1}{μ_{θ}} (‖ λ_{t} - λ_{★} ‖^{2} - ‖ λ_{t + 1} - λ_{★} ‖^{2} - ‖ λ_{t + 1} - λ_{t} ‖^{2}) \\ + \underset{Ξ_{7}}{\underset{︸}{4 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} L_{s} {\tilde{e}}_{t + 1} + μ_{θ} {({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}^{⊤} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) + μ_{z} {\tilde{e}}_{t + 1}^{⊤} L_{s} {\tilde{e}}_{t}}} \\ + \underset{Ξ_{8}}{\underset{︸}{8 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} {\tilde{e}}_{t + 1} + 8 μ_{θ} {({\tilde{e}}_{t + 1} - {\tilde{e}}_{t})}^{⊤} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) + 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} S S^{⊤} {\tilde{e}}_{t}}} \\ + \frac{1}{4 μ_{θ}} ‖ α_{t + 1} - α_{★} ‖^{2} + \frac{1}{μ_{θ}} ‖ α_{t + 1} - α_{t} ‖^{2} + \frac{μ_{θ}}{4} ‖ θ_{t + 1} - θ_{★} ‖^{2} + \frac{μ_{θ}}{4} {‖ θ_{t + 1} - θ_{t} ‖}^{2} \\ + \frac{1}{4 μ_{θ}} ‖ λ_{t + 1} - λ_{★} ‖^{2} + 4 \frac{1}{μ_{θ}} {‖ λ_{t + 1} - λ_{t} ‖}^{2} - 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t}, \end{matrix}

(A17)

where

Ξ_{7}

and

Ξ_{8}

can be further estimated as

\begin{matrix} Ξ_{7} & \leq σ_{max}^{L_{s}} (4 μ_{θ} ‖ {\tilde{e}}_{t + 1} ‖^{2} + μ_{θ} ‖ {\tilde{e}}_{t + 1} - {\tilde{e}}_{t} ‖^{2} + μ_{z} {\tilde{e}}_{t + 1}^{⊤} {\tilde{e}}_{t}) \\ \leq σ_{max}^{L_{s}} (\frac{μ_{z}}{2} + 6 μ_{θ}) ‖ {\tilde{e}}_{t + 1} ‖^{2} + σ_{max}^{L_{s}} (\frac{μ_{z}}{2} + 2 μ_{θ}) {‖ {\tilde{e}}_{t} ‖}^{2}, \end{matrix}

(A18)

and

\begin{matrix} Ξ_{8} & \leq 8 μ_{θ} ‖ {\tilde{e}}_{t + 1} ‖^{2} + 8 μ_{θ} {‖ {\tilde{e}}_{t + 1} - {\tilde{e}}_{t} ‖}^{2} + 2 μ_{θ} {\tilde{e}}_{t + 1}^{⊤} {\tilde{e}}_{t} \\ \leq 25 μ_{θ} ‖ {\tilde{e}}_{t + 1} ‖^{2} + 17 μ_{θ} {‖ {\tilde{e}}_{t} ‖}^{2}, \end{matrix}

(A19)

where (A19) uses the fact that the largest eigenvalue of

S S^{⊤}

is 1. Finally, substituting (A18) and (A19) into (A17), we obtain

\begin{matrix} 2 Θ_{t + 1} \leq & ϵ (‖ {\tilde{x}}_{t} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} - ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2}) \\ + 2 μ_{z} (‖ {\tilde{z}}_{t} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{★} ‖^{2} - ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖^{2}) \\ + \frac{2}{μ_{z}} ‖ α_{t} - α_{★} ‖^{2} - (\frac{2}{μ_{z}} - \frac{1}{4 μ_{θ}}) ‖ α_{t + 1} - α_{★} ‖^{2} - (\frac{2}{μ_{z}} - \frac{1}{μ_{θ}}) {‖ α_{t + 1} - α_{t} ‖}^{2} \\ + μ_{θ} ‖ θ_{t} - θ_{★} ‖^{2} - \frac{3 μ_{θ}}{4} ‖ θ_{t + 1} - θ_{★} ‖^{2} - \frac{3 μ_{θ}}{4} {‖ θ_{t + 1} - θ_{t} ‖}^{2} \\ + \frac{1}{μ_{θ}} ‖ λ_{t} - λ_{★} ‖^{2} - \frac{3 μ_{θ}}{4} ‖ λ_{t + 1} - λ_{★} ‖^{2} - \frac{3 μ_{θ}}{4} {‖ λ_{t + 1} - λ_{t} ‖}^{2} - 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} \\ + c_{1} ‖ {\tilde{e}}_{t + 1} ‖^{2} + c_{2} {‖ {\tilde{e}}_{t} ‖}^{2} . \end{matrix}

(A20)

Recall the definitions of

\tilde{v}

and

H_{1}

, and define

H_{2} = diag [ϵ, 4 μ_{θ}, \frac{3}{4 μ_{θ}}, \frac{3 μ_{θ}}{4}, \frac{3}{4 μ_{θ}}]

and

H_{3} = diag [ϵ, 4 μ θ, 0, \frac{3 μ_{θ}}{4}, \frac{3}{4 μ_{θ}}]

. The inequalities

H_{1} ≻ H_{2}

and

H_{1} ≻ H_{3}

hold, along with the assumption that

ξ > 0

. Thus, (A20) can be rewritten as

\begin{matrix} 2 Θ_{t + 1} + ‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{t} ‖_{H_{3}}^{2} + 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} + (ξ - c_{1}) ‖ {\tilde{e}}_{t + 1} ‖^{2} + (ξ - c_{2}) {‖ {\tilde{e}}_{t} ‖}^{2} \\ - (\frac{1}{4 μ_{θ}} ‖ α_{t + 1} - α_{★} ‖^{2} + \frac{μ_{θ}}{4} ‖ θ_{t + 1} - θ_{★} ‖^{2} + \frac{1}{4 μ_{θ}} ‖ λ_{t + 1} - λ_{★} ‖^{2}) \\ \leq & ‖ {\tilde{v}}_{t} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} - ‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ ‖ {\tilde{e}}_{t} ‖^{2} - ξ {‖ {\tilde{e}}_{t + 1} ‖}^{2} . \end{matrix}

(A21)

To establish linear convergence, we need to show the following holds for some

η_{t} > 0

:

\begin{matrix} (1 + η_{t}) (‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ ‖ {\tilde{e}}_{t + 1} ‖^{2}) \leq ‖ {\tilde{v}}_{t} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ {‖ {\tilde{e}}_{t} ‖}^{2} . \end{matrix}

(A22)

Note that

\begin{matrix} η_{t} (‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ ‖ {\tilde{e}}_{t + 1} ‖^{2}) \\ = & η_{t} (ϵ ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + 2 μ_{z} ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{★} ‖^{2} + \frac{2}{μ_{z}} ‖ α_{t + 1} - α_{★} ‖^{2} + μ_{θ} {‖ θ_{t + 1} - θ_{★} ‖}^{2} \\ + \frac{1}{μ_{θ}} ‖ λ_{t + 1} - λ_{★} ‖^{2} + ξ {‖ {\tilde{e}}_{t + 1} ‖}^{2}) . \end{matrix}

(A23)

Next, we establish an upper bound for each component of (A23), primarily using the inequality

{(\sum_{i = 1}^{n} α_{i})}^{2} \leq \sum_{i = 1}^{n} n α_{i}^{2}

and

μ_{z} = 2 μ_{θ}

. First, according to Lemma A1, we obtain

\begin{matrix} \frac{2}{μ_{z}} ‖ α_{t + 1} - α_{★} ‖^{2} + \frac{1}{μ_{θ}} {‖ λ_{t + 1} - λ_{★} ‖}^{2} \\ \leq & \frac{1}{μ_{θ} σ_{min}^{+}} {‖ E_{s}^{⊤} (α_{t + 1} - α_{★}) + S (λ_{t + 1} - λ_{★}) ‖}^{2} \\ = & \frac{1}{μ_{θ} σ_{min}^{+}} ‖ - (\nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) + ϵ ({\tilde{x}}_{t + 1} - {\tilde{x}}_{t}) + 2 μ_{θ} E_{u}^{⊤} ({\tilde{z}}_{t + 1} - {\tilde{z}}_{t}) + μ_{θ} S (θ_{t + 1} - θ_{t}) \\ - μ_{θ} L_{s} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) - μ_{θ} S S^{⊤} ({\tilde{e}}_{t + 1} - {\tilde{e}}_{t}) + r_{t} {) ‖}^{2} \\ \leq & \frac{7}{μ_{θ} σ_{min}^{+}} (‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + ϵ^{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + μ_{θ}^{2} ‖ θ_{t + 1} - θ_{t} ‖^{2} + 2 μ_{θ}^{2} (1 + {(σ_{max}^{L_{s}})}^{2}) (‖ {\tilde{e}}_{t + 1} ‖^{2} + ‖ {\tilde{e}}_{t} ‖^{2}) + ‖ r_{t} ‖^{2}) . \end{matrix}

(A24)

Next, since

{\tilde{z}}_{t + 1} - {\tilde{z}}_{★} = \frac{1}{2} E_{u} ({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})

, we have

\begin{matrix} 2 μ_{z} ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{★} ‖^{2} \leq μ_{θ} σ_{max}^{L_{u}} {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖}^{2} . \end{matrix}

(A25)

Then, from (17d) and (16e), we obtain

\begin{matrix} μ_{θ} {‖ θ_{t + 1} - θ_{★} ‖}^{2} & = μ_{θ} {‖ S^{⊤} ({\tilde{x}}_{t + 1} - {\tilde{x}}_{★} + {\tilde{e}}_{t + 1}) + \frac{1}{μ_{θ}} (λ_{t + 1} - λ_{t}) ‖}^{2} \\ \leq 4 μ_{θ} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + 4 μ_{θ} ‖ {\tilde{e}}_{t + 1} ‖^{2} + \frac{2}{μ_{θ}} {‖ λ_{t + 1} - λ_{t} ‖}^{2} . \end{matrix}

(A26)

Finally, substituting (A24)–(A26) into (A23), applying Lemma 2, we obtain

\begin{matrix} η_{t} (‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{★} ‖_{H_{1}}^{2} + ξ ‖ {\tilde{e}}_{t + 1} ‖^{2}) \\ \leq & η_{t} {\frac{7}{μ_{θ} σ_{min}^{+}} (‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + ϵ^{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + μ_{θ}^{2} ‖ θ_{t + 1} - θ_{t} ‖^{2} + 2 μ_{θ}^{2} (1 + {(σ_{max}^{L_{s}})}^{2}) (‖ {\tilde{e}}_{t + 1} ‖^{2} + ‖ {\tilde{e}}_{t} ‖^{2}) + ‖ r_{t} ‖^{2}) \\ + \frac{2}{μ_{θ}} ‖ λ_{t + 1} - λ_{t} ‖^{2} + (ϵ + 4 μ_{θ} + σ_{max}^{L_{u}} μ_{θ}) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + (4 + ξ) {‖ {\tilde{e}}_{t + 1} ‖}^{2}} \\ \leq & η_{t} {\frac{7}{μ_{θ} σ_{min}^{+}} (‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖^{2} + μ_{θ}^{2} {‖ θ_{t + 1} - θ_{t} ‖}^{2}) \\ + \frac{2}{μ_{θ}} ‖ λ_{t + 1} - λ_{t} ‖^{2} + ((1 - A) (4 + ξ) + \frac{7 c_{3} (2 - A)}{μ_{θ} σ_{min}^{+}}) {‖ {\tilde{e}}_{t} ‖}^{2} \\ + (B (4 + ξ) + \frac{7}{μ_{θ} σ_{min}^{+}} (ϵ^{2} + 2 τ_{t}^{2} + c_{3} B)) {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖}^{2} \\ + (ϵ + 4 μ_{θ} + μ_{z} σ_{max}^{L_{θ}}) {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖}^{2}} \\ \leq & 2 Θ_{t + 1} + ‖ {\tilde{v}}_{t + 1} - {\tilde{v}}_{t} ‖_{H_{3}}^{2} + 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} + (ξ - c_{1}) ‖ {\tilde{e}}_{t + 1} ‖^{2} + (ξ - c_{2}) {‖ {\tilde{e}}_{t} ‖}^{2} \\ - \frac{1}{4 μ_{θ}} ‖ α_{t + 1} - α_{★} ‖^{2} - \frac{μ_{θ}}{4} ‖ θ_{t + 1} - θ_{★} ‖^{2} - \frac{1}{4 μ_{θ}} {‖ λ_{t + 1} - λ_{★} ‖}^{2} \\ \leq & 2 Θ_{t + 1} + ‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + (ϵ + (ξ - Ξ_{7}) B) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 4 μ_{θ} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + (ξ - Ξ_{8} + (1 - A) (ξ - Ξ_{7})) {‖ {\tilde{e}}_{t} ‖}^{2} + 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} \\ - \frac{1}{4 μ_{θ}} (‖ α_{t + 1} - α_{★} ‖^{2} + {‖ λ_{t + 1} - λ_{★} ‖}^{2}) - \frac{μ_{θ}}{4} {‖ θ_{t + 1} - θ_{★} ‖}^{2} \\ + \frac{3 μ_{θ}}{4} ‖ θ_{t + 1} - θ_{t} ‖^{2} + \frac{3}{4 μ_{θ}} {‖ λ_{t + 1} - λ_{t} ‖}^{2} . \end{matrix}

(A27)

To ensure that both sides of inequality (A27) hold, we need to separately consider and determine the coefficient of

η_{t}

. We consider the terms in (A27) separately:

\begin{matrix} - 2 {({\tilde{x}}_{t + 1} - {\tilde{x}}_{★})}^{⊤} r_{t} \\ \leq & ζ ‖ r_{t} ‖^{2} + \frac{1}{ζ} {‖ {\tilde{x}}_{t + 1} + {\tilde{x}}_{★} ‖}^{2} \\ \leq & ζ (τ_{t}^{2} + 2 B γ_{t}^{2}) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 2 (2 - A) ζ τ_{t}^{2} ‖ {\tilde{e}}_{t} ‖^{2} + \frac{1}{ζ} {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖}^{2}, \end{matrix}

(A28)

and

\begin{matrix} \frac{1}{4 μ_{θ}} (‖ α_{t + 1} - α_{★} ‖^{2} + {‖ λ_{t + 1} - λ_{★} ‖}^{2}) \\ \leq & \frac{7}{4 μ_{θ} σ_{min}^{+}} (‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + ϵ^{2} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + μ_{θ}^{2} ‖ θ_{t + 1} - θ_{t} ‖^{2} + 2 μ_{θ}^{2} (1 + {(σ_{max}^{L_{s}})}^{2}) (‖ {\tilde{e}}_{t + 1} ‖^{2} + ‖ {\tilde{e}}_{t} ‖^{2}) + ‖ r_{t} ‖^{2}), \end{matrix}

(A29)

and

\begin{matrix} \frac{μ_{θ}}{4} ‖ θ_{t + 1} - θ_{★} ‖^{2} \leq μ_{θ} ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + μ_{θ} ‖ {\tilde{e}}_{t + 1} ‖^{2} + \frac{1}{2 μ_{θ}} {‖ λ_{t + 1} - λ_{t} ‖}^{2} . \end{matrix}

(A30)

Substituting (A28)–(A30) into (A27) and by properly choosing

η

, we obtain

\begin{matrix} (B μ_{θ} + ζ (τ_{t}^{2} + 2 B γ_{t}^{2})) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + ((1 - A) μ_{θ} + 2 (2 - A) ζ τ_{t}^{2}) {‖ {\tilde{e}}_{t} ‖}^{2} \\ + (μ_{θ} + \frac{1}{ζ}) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖^{2} + \frac{1}{2 μ_{θ}} ‖ λ_{t + 1} - λ_{t} ‖^{2} + \frac{7}{4 μ_{θ} σ_{min}^{+}} ({‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖}^{2} \\ + (ϵ^{2} + 2 τ_{t}^{2} + c_{3} B) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + μ_{θ}^{2} ‖ θ_{t + 1} - θ_{t} ‖^{2} + 2 (2 - A) (μ_{θ}^{2} + {(σ_{max}^{L_{s}})}^{2} μ_{θ}^{2} + 2 γ_{t}^{2}) {‖ {\tilde{e}}_{t} ‖}^{2}) \\ + η_{t} {\frac{7}{μ_{θ} σ_{min}^{+}} (‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + 4 σ_{max}^{L_{u}} μ_{θ}^{2} ‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖^{2} + μ_{θ}^{2} {‖ θ_{t + 1} - θ_{t} ‖}^{2}) \\ + \frac{2}{μ_{θ}} ‖ λ_{t + 1} - λ_{t} ‖^{2} + ((1 - A) (4 + ξ) + \frac{7 (2 - A)}{μ_{θ} σ_{min}^{+}} (2 μ_{θ}^{2} + 4 γ_{t}^{2} + \frac{σ_{max}^{L_{s}^{2}} μ_{z}^{2}}{2})) {‖ {\tilde{e}}_{t} ‖}^{2} \\ + (B (4 + ξ) + \frac{7}{μ_{θ} σ_{min}^{+}} (ϵ^{2} + 2 τ_{t}^{2} + c_{3} B)) {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖}^{2} \\ + (ϵ + 4 μ_{θ} + σ_{max}^{L_{u}} μ_{θ}) {‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{★} ‖}^{2}} \\ \leq & 2 Θ_{t + 1} + ‖ \nabla F ({\tilde{x}}_{t + 1}) - \nabla F ({\tilde{x}}_{★}) ‖^{2} + 4 μ_{θ} {‖ {\tilde{z}}_{t + 1} - {\tilde{z}}_{t} ‖}^{2} \\ + (ϵ + (ξ - c_{1}) B) ‖ {\tilde{x}}_{t + 1} - {\tilde{x}}_{t} ‖^{2} + \frac{3 μ_{θ}}{4} ‖ θ_{t + 1} - θ_{t} ‖^{2} + \frac{3}{4 μ_{θ}} {‖ λ_{t + 1} - λ_{t} ‖}^{2} \\ + (ξ - c_{2} + (1 - A) (ξ - c_{1})) {‖ {\tilde{e}}_{t} ‖}^{2} . \end{matrix}

(A31)

To make (A31) hold,

η_{t}

is chosen such that

\begin{matrix} η_{t} (B (4 + ξ) + \frac{7}{μ_{θ} σ_{min}^{+}} (ϵ^{2} + 2 τ_{t}^{2} + B Ξ_{3})) \\ \leq & ϵ + (ξ - Ξ_{1} - μ_{θ}) B - ζ (τ_{t}^{2} + 2 B γ_{t}^{2}) - \frac{7}{4 μ_{θ} σ_{min}^{+}} (ϵ^{2} + 2 τ_{t}^{2} + B Ξ_{3}), \\ η_{t} ((1 - A) (4 + ξ) + \frac{7}{μ_{θ} σ_{min}^{+}} (2 - A) Ξ_{3}) \\ \leq & ξ - Ξ_{2} + (1 - A) (ξ - Ξ_{1} - μ_{θ}) - 2 (2 - A) ζ τ_{t}^{2} - \frac{7}{4 μ_{θ} σ_{min}^{+}} (2 - A) Ξ_{3}, \\ η_{t} (ϵ + 4 μ_{θ} + σ_{max}^{L_{u}} μ_{θ}) \leq \frac{2 m_{f} M_{f}}{m_{f} + M_{f}} - \frac{1}{ζ} - μ_{θ}, η_{t} \frac{7}{μ_{θ} σ_{min}^{+}} \leq \frac{2}{m_{f} + M_{f}} - \frac{7}{4 μ_{θ} σ_{min}^{+}}, \\ η_{t} \frac{28 σ_{max}^{L_{u}} μ_{θ}}{σ_{min}^{+}} \leq 4 μ_{θ} - \frac{7 σ_{max}^{L_{u}} μ_{θ}}{σ_{min}^{+}}, η_{t} \frac{7 μ_{θ}}{σ_{min}^{+}} \leq \frac{3 μ_{θ}}{4} - \frac{7 μ_{θ}}{4 σ_{min}^{+}}, η_{t} \frac{2}{μ_{θ}} \leq \frac{3}{8 μ_{θ}}, \end{matrix}

which implies (23). Then, we can prove that (A22) holds. The proof is completed. □

References

Olfati-Saber, R.; Fax, J.A.; Murray, R.M. Consensus and cooperation in networked multi-agent systems. Proc. IEEE 2007, 95, 215–233. [Google Scholar] [CrossRef]
Yoo, S.J.; Park, B.S. Dynamic event-triggered prescribed-time consensus tracking of nonlinear time-delay multiagent systems by output feedback. Fractal Fract. 2024, 8, 545. [Google Scholar] [CrossRef]
Liu, H.J.; Shi, W.; Zhu, H. Distributed voltage control in distribution networks: Online and robust implementations. IEEE Trans. Smart Grid 2017, 9, 6106–6117. [Google Scholar] [CrossRef]
Molzahn, D.K.; Dorfler, F.; Sandberg, H.; Low, S.H.; Chakrabarti, S.; Baldick, R.; Lavaei, J. A survey of distributed optimization and control algorithms for electric power systems. IEEE Trans. Smart Grid 2017, 8, 2941–2962. [Google Scholar] [CrossRef]
Liu, Y.F.; Chang, T.H.; Hong, M.; Wu, Z.; So, A.M.C.; Jorswieck, E.A.; Yu, W. A survey of recent advances in optimization methods for wireless communications. IEEE J. Sel. Areas Commun. 2024, 42, 2992–3031. [Google Scholar] [CrossRef]
Huang, J.; Zhou, S.; Tu, H.; Yao, Y.; Liu, Q. Distributed optimization algorithm for multi-robot formation with virtual reference center. IEEE/CAA J. Autom. Sin. 2022, 9, 732–734. [Google Scholar] [CrossRef]
Yang, X.; Zhao, W.; Yuan, J.; Chen, T.; Zhang, C.; Wang, L. Distributed optimization for fractional-order multi-agent systems based on adaptive backstepping dynamic surface control technology. Fractal Fract. 2022, 6, 642. [Google Scholar] [CrossRef]
Liu, J.; Zhang, C. Distributed learning systems with first-order methods. Found. Trends Databases 2020, 9, 1–100. [Google Scholar] [CrossRef]
Nedic, A.; Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 2009, 54, 48–61. [Google Scholar] [CrossRef]
Nedic, A.; Olshevsky, A.; Shi, W. Achieving geometric convergence for distributed optimization over time-varying graphs. SIAM J. Optim. 2017, 27, 2597–2633. [Google Scholar] [CrossRef]
Xu, J.; Zhu, S.; Soh, Y.C.; Xie, L. Convergence of asynchronous distributed gradient methods over stochastic networks. IEEE Trans. Autom. Control 2018, 63, 434–448. [Google Scholar] [CrossRef]
Wen, X.; Luan, L.; Qin, S. A continuous-time neurodynamic approach and its discretization for distributed convex optimization over multi-agent systems. Neural Netw. 2021, 143, 52–65. [Google Scholar] [CrossRef]
Feng, Z.; Xu, W.; Cao, J. Alternating inertial and overrelaxed algorithms for distributed generalized Nash equilibrium seeking in multi-player games. Fractal Fract. 2021, 5, 62. [Google Scholar] [CrossRef]
Che, K.; Yang, S. A snapshot gradient tracking for distributed optimization over digraphs. In Proceedings of the CAAI International Conference on Artificial Intelligence, Beijing, China, 27–28 August 2022; pp. 348–360. [Google Scholar]
Zhou, S.; Wei, Y.; Liang, S.; Cao, J. A gradient tracking protocol for optimization over Nabla fractional multi-agent systems. IEEE Trans. Signal Inf. Process. Over Netw. 2024, 10, 500–512. [Google Scholar] [CrossRef]
Shi, W.; Ling, Q.; Wu, G.; Yin, W. EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 2015, 25, 944–966. [Google Scholar] [CrossRef]
Ling, Q.; Shi, W.; Wu, G.; Ribeiro, A. DLM: Decentralized linearized alternating direction method of multipliers. IEEE Trans. Signal Process. 2015, 63, 4051–4064. [Google Scholar] [CrossRef]
Mokhtari, A.; Shi, W.; Ling, Q.; Ribeiro, A. DQM: Decentralized quadratically approximated alternating direction method of multipliers. IEEE Trans. Signal Process. 2016, 64, 5158–5173. [Google Scholar] [CrossRef]
Eisen, M.; Mokhtari, A.; Ribeiro, A. A primal-dual quasi-Newton method for exact consensus optimization. IEEE Trans. Signal Process. 2019, 67, 5983–5997. [Google Scholar] [CrossRef]
Mansoori, F.; Wei, E. A fast distributed asynchronous Newton-based optimization algorithm. IEEE Trans. Autom. Control 2019, 65, 2769–2784. [Google Scholar] [CrossRef]
Jiang, X.; Qin, S.; Xue, X.; Liu, X. A second-order accelerated neurodynamic approach for distributed convex optimization. Neural Netw. 2022, 146, 161–173. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Voulgaris, P.G.; Stipanović, D.M.; Freris, N.M. Communication efficient curvature aided primal-dual algorithms for decentralized optimization. IEEE Trans. Autom. Control 2023, 68, 6573–6588. [Google Scholar] [CrossRef]
Alistarh, D.; Grubic, D.; Li, J.Z.; Tomioka, R.; Vojnovic, M. QSGD: Communication-efficient SGD via gradient quantization and encoding. In Proceedings of the 30th NeurIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 1710–1721. [Google Scholar]
Wangni, J.; Wang, J.; Liu, J.; Zhang, T. Gradient sparsification for communication-efficient distributed optimization. In Proceedings of the 31st NeurIPS 2018, Montreal, QC, Canada, 2–8 December 2018; pp. 1306–1316. [Google Scholar]
Stich, S.U.; Cordonnier, J.B.; Jaggi, M. Sparsified SGD with memory. In Proceedings of the 31st NeurIPS 2018, Montreal, QC, Canada, 2–8 December 2018; pp. 4447–4458. [Google Scholar]
Doan, T.T.; Maguluri, S.T.; Romberg, J. Fast convergence rates of distributed subgradient methods with adaptive quantization. IEEE Trans. Autom. Control 2020, 66, 2191–2205. [Google Scholar] [CrossRef]
Taheri, H.; Mokhtari, A.; Hassni, H.; Pedarsani, R. Quantized decentralized stochastic learning over directed graphs. In Proceedings of the 37th ICML, Virtual, 13–18 July 2020; pp. 9324–9333. [Google Scholar]
Song, Z.; Shi, L.; Pu, S.; Yan, M. Compressed gradient tracking for decentralized optimization over general directed networks. IEEE Trans. Signal Process. 2022, 70, 1775–1787. [Google Scholar] [CrossRef]
Xiong, Y.; Wu, L.; You, K.; Xie, L. Quantized distributed gradient tracking algorithm with linear convergence in directed networks. IEEE Trans. Autom. Control 2022, 68, 5638–5645. [Google Scholar] [CrossRef]
Zhu, S.; Hong, M.; Chen, B. Quantized consensus ADMM for multi-agent distributed optimization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4134–4138. [Google Scholar]
Elgabli, A.; Park, J.; Bedi, A.S.; Issaid, C.B.; Bennis, M.; Aggarwal, V. Q-GADMM: Quantized group ADMM for communication efficient decentralized machine learning. IEEE Trans. Commun. 2020, 69, 164–181. [Google Scholar] [CrossRef]
Li, W.; Liu, Y.; Tian, Z.; Ling, Q. Communication-censored linearized ADMM for decentralized consensus optimization. IEEE Trans. Signal Inf. Process. Over Netw. 2020, 6, 18–34. [Google Scholar] [CrossRef]
Gao, L.; Deng, S.; Li, H.; Li, C. An event-triggered approach for gradient tracking in consensus-based distributed optimization. IEEE Trans. Netw. Sci. Eng. 2021, 9, 510–523. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, S.; Xu, W.; Di, K. Privacy-preserving distributed ADMM with event-triggered communication. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 2835–2847. [Google Scholar] [CrossRef]
Chen, T.; Giannakis, G.; Sun, T.; Yin, W. LAG: Lazily aggregated gradient for communication-efficient distributed learning. Adv. Neural Inf. Process. Syst. 2018, 31, 5050–5060. [Google Scholar]
Sun, J.; Chen, T.; Giannakis, G.; Yang, Z. Communication-efficient distributed learning via lazily aggregated quantized gradients. Adv. Neural Inf. Process. Syst. 2019, 32, 3370–3380. [Google Scholar]
Singh, N.; Data, D.; George, J.; Diggavi, S. SPARQ-SGD: Event-triggered and compressed communication in decentralized optimization. IEEE Trans. Autom. Control 2022, 68, 721–736. [Google Scholar] [CrossRef]
Yang, X.; Yuan, J.; Chen, T.; Yang, H. Distributed adaptive optimization algorithm for fractional high-order multiagent systems based on event-triggered strategy and input quantization. Fractal Fract. 2023, 7, 749. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, S.; Xu, W. Decentralized ADMM with compressed and event-triggered communication. Neural Netw. 2023, 165, 472–482. [Google Scholar] [CrossRef]
Richtárik, P.; Sokolov, I.; Fatkhullin, I. EF21: A new, simpler, theoretically better, and practically faster error feedback. In Proceedings of the 34th NeurIPS, Virtual, 6–14 December 2021; pp. 4384–4396. [Google Scholar]
Richtarik, P.; Sokolov, I.; Fatkhullin, I.; Gasanov, E.; Li, Z.; Gorbunov, E. 3PC: Three point compressors for communication-efficient distributed training and a better theory for Lazy aggregation. In Proceedings of the 39th ICML, Baltimore, MD, USA, 17–23 July 2022; pp. 18596–18648. [Google Scholar]
Shi, W.; Ling, Q.; Wu, G.; Yin, W. A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 2015, 63, 6013–6023. [Google Scholar] [CrossRef]
Alghunaim, S.; Yuan, K.; Sayed, A.H. A linearly convergent proximal gradient algorithm for decentralized optimization. In Proceedings of the 32nd NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Guo, L.; Shi, X.; Yang, S.; Cao, J. DISA: A dual inexact splitting algorithm for distributed convex composite optimization. IEEE Trans. Autom. Control 2024, 69, 2995–3010. [Google Scholar] [CrossRef]
Li, W.; Liu, Y.; Tian, Z.; Ling, Q. COLA: Communication-censored linearized ADMM for decentralized consensus optimization. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5237–5241. [Google Scholar]

$Fractalfract 08 00721 g001$

Figure 1. Distribution of samples across agents for the a9a dataset (left) and ijcnn1 (right).

$Fractalfract 08 00721 g002$

Figure 2. Random communication graph of network with 10 agents.

$Fractalfract 08 00721 g003$

Figure 3. Performance comparison of distributed logistic regression the on a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

$Fractalfract 08 00721 g004$

Figure 4. Performance comparison of distributed logistic regression the on ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

$Fractalfract 08 00721 g005$

Figure 5. Performance comparison of distributed ridge regression on the a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

$Fractalfract 08 00721 g006$

Figure 6. Performance comparison of distributed ridge regression on the ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

$Fractalfract 08 00721 g007$

Figure 7. Performance comparison of distributed LASSO on the a9a dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

$Fractalfract 08 00721 g008$

Figure 8. Performance comparison of distributed LASSO on the ijcnn1 dataset: Plots of iteration number (left) and total communication bits (right) versus distance error.

Table 1. Convergence accuracy (

e r r_{t}

) for different experiments.

Table 1. Convergence accuracy (

e r r_{t}

) for different experiments.

	Distributed Logistic Regression		Distributed Ridge Regression		Distributed LASSO
	a9a	ijcnn1	a9a	ijcnn1	a9a	ijcnn1
Convergence Error ( $e r r_{t}$ )	$1 \times 10^{- 4.5}$	$1 \times 10^{- 5.5}$	$1 \times 10^{- 6.8}$	$1 \times 10^{- 6.8}$	$1 \times 10^{- 4.5}$	$1 \times 10^{- 5.5}$

Table 2. Comparison of iterations.

Method	Distributed		Distributed		Distributed
	Logistic Regression		Ridge Regression		LASSO
	a9a	ijcnn1	a9a	ijcnn1	a9a	ijcnn1
P2D2	468	811	-	-	4230	409
PG-EXTRA	470	813	-	-	4231	410
CC-DQM	-	-	246	79	-	-
DRUID-Gradient	845	884	977	176	3348	445
CE-DADMM-Gradient:EF21	845	884	977	171	3348	446
CE-DADMM-Gradient:CLAG	845	884	980	170	3349	446
DRUID-Newton	154	559	197	60	1910	318
CE-DADMM-Newton:EF21	155	559	252	60	1912	318
CE-DADMM-Newton:CLAG	155	559	194	64	1912	318
DRUID-BFGS	684	566	330	77	2890	399
CE-DADMM-BFGS:EF21	478	567	325	73	2890	400
CE-DADMM-BFGS:CLAG	478	566	327	69	2890	399

Table 3. Comparison of communication bits.

Method	Distributed		Distributed		Distributed
	Logistic Regression		Ridge Regression		LASSO
	a9a	ijcnn1	a9a	ijcnn1	a9a	ijcnn1
P2D2	116,056,832	38,539,264	-	-	862,720,000	29,344,768
PG-EXTRA	125,088,000	30,080,000	-	-	861,520,000	20,101,504
CC-DQM	-	-	20,782,080	1,219,680	-	-
DRUID-Gradient	146,340,480	27,382,784	169,200,768	5,451,776	579,820,032	13,784,320
CE-DADMM-Gradient:EF21	35,692,800	7,468,032	41,268,480	1,444,608	141,419,520	3,767,808
CE-DADMM-Gradient:CLAG	35,650,560	7,459,584	41,352,960	1,427,712	141,419,520	3,759,360
DRUID-Newton	26,670,336	17,315,584	34,117,248	1,858,560	330,781,440	9,850,368
CE-DADMM-Newton:EF21	6,547,200	4,722,432	10,644,480	506,880	80,762,880	2,686,464
CE-DADMM-Newton:CLAG	6,504,960	4,713,984	8,152,320	532,224	80,720,640	2,678,016
DRUID-BFGS	118,457,856	17,532,416	57,150,720	2,385,152	500,501,760	12,359,424
CE-DADMM-BFGS:EF21	20,190,720	4,790,016	13,728,000	616,704	122,073,600	3,379,200
CE-DADMM-BFGS:CLAG	20,148,480	4,773,120	13,770,240	574,464	122,031,360	3,362,304

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

A Flexible Framework for Decentralized Composite Optimization with Compressed Communication

Abstract

1. Introduction

2. Problem Setting

3. Algorithm Formulation

3.1. Background: ADMM-Based Algorithm

3.2. Communication-Efficient Decentralized ADMM

3.3. Discussion

4. Convergence Analysis

5. Numerical Experiments

5.1. Distributed Logistic Regression

5.2. Distributed Ridge Regression

5.3. Distributed LASSO

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Lemma 3

Appendix B. Proof of Theorem 1

References

Article Metrics

Citations

Article Access Statistics