Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems

An, Weixin; Liu, Yuanyuan; Shang, Fanhua; Liu, Hongying

doi:10.3390/math13101687

Open AccessArticle

Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems

¹

The Key Laboratory of Intelligent Perception and Image Understanding of the Ministry of Education, School of Artificial Intelligence, Xidian University, Xi’an 710126, China

²

The College of Intelligence and Computing, Tianjin University, Tianjin 300072, China

³

Medical College, Tianjin University, Tianjin 300072, China

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(10), 1687; https://doi.org/10.3390/math13101687

Submission received: 24 March 2025 / Revised: 21 April 2025 / Accepted: 22 April 2025 / Published: 21 May 2025

Download

Browse Figures

Versions Notes

Abstract

Recently, many stochastic Alternating Direction Methods of Multipliers (ADMMs) have been proposed to solve large-scale machine learning problems. However, for large-scale saddle-point problems, the state-of-the-art (SOTA) stochastic ADMMs still have high per-iteration costs. On the other hand, the stochastic primal–dual hybrid gradient (SPDHG) has a low per-iteration cost but only a suboptimal convergence rate of

𝒪 (1 / \sqrt{S})

. Thus, there still remains a gap in the convergence rates between SPDHG and SOTA ADMMs. Motivated by the two matters, we propose (accelerated) stochastic variance reduced primal–dual hybrid gradient ((A)SVR-PDHG) methods. We design a linear extrapolation step to improve the convergence rate and a new adaptive epoch length strategy to remove the extra boundedness assumption. Our algorithms have a simpler structure and lower per-iteration complexity than SOTA ADMMs. As a by-product, we present the asynchronous parallel variants of our algorithms. In theory, we rigorously prove that our methods converge linearly for strongly convex problems and improve the convergence rate to

𝒪 (1 / S^{2})

for non-strongly convex problems as opposed to the existing

𝒪 (1 / S)

convergence rate. Compared with SOTA algorithms, various experimental results demonstrate that ASVR-PDHG can achieve an average speedup of

2 \times \sim 5 \times

.

Keywords:

saddle-point problem; stochastic optimization; variance reduction; asynchronous parallelism

MSC:

90-08; 90C06; 90C46; 90C90

1. Introduction

In this paper, we mainly consider the following saddle-point problem:

\min_{x \in R^{d}} \max_{y \in Y} \{H (x, y) : = F (x) + 〈y, A x〉 - G^{*} (y)\},

(1)

where

F (x) : = \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x)

is a convex and lower semi-continuous (l.s.c.) function that is a finite average of n convex, smooth and l.s.c. functions

F_{i}

,

A \in R^{l \times d}

, d is the dimension of the feature,

G^{*} (\cdot)

is the Fenchel conjugate of a convex (but possibly non-smooth) function

G (\cdot)

, and

Y

is a nonempty closed and convex set. Such a problem is to find a trade-off between minimizing the objective function for primal variable x and maximizing it for the dual variable y. Many modern machine learning problems can be formulated as such a problem, such as total variation denoising [1,2,3],

ℓ_{1}

-norm regularization problems [4], and image reconstruction [5].

The saddle-point problem mentioned above can be solved equivalently by its primal problem. According to the definition of the conjugate function

G^{*}

, its primal problem can be written as

\min_{x} F (x) + G (A x)

, which often appears in the machine learning community, such as graph Lasso (i.e.,

G (A x) = λ_{1} {∥ A x ∥}_{1}

) and low-rank matrix recovery (i.e.,

G (A x) = λ_{1} {∥ A x ∥}_{*}

, where

{∥ \cdot ∥}_{*}

is the nuclear norm of a matrix). By introducing an auxiliary variable

y = A x

, the primal problem can be formulated an equality-constrained problem. Alternating direction methods of multipliers (ADMMs) [6,7,8,9] are common algorithms for solving such an equality-constrained problem and have shown excellent advantages. For example, for solving large-scale equality-constrained problems (n is very large), Online and Stochastic Alternating Direction Methods of Multipliers (OADMM [8] and SADMM [10]) have been proposed, though they only have suboptimal convergence rates. Thus, some acceleration techniques such as those in works [11,12,13,14,15,16] have been developed to successfully address the obstacle of the high variance of stochastic gradient estimators. Among them, the stochastic variance reduced gradient (SVRG) methods such as those in the works [16,17] can obtain a linear convergence rate for strongly convex (SC) problems. Katyusha [18], MIG [16], and ASVRG [19] have further improved the convergence rates for non-strongly convex (non-SC) problems by designing different momentum tricks. Recently, some researchers have introduced these techniques into SADMM and proposed some stochastic variants with faster convergence rates. SAG-ADMM [20] can attain linear convergence for SC problems, but it requires

𝒪 (n d)

to store the past gradients. Similarly, SDCA-ADMM [21] inherits the drawbacks of SDCA [22], which calls for

𝒪 (n)

extra storage. In contrast, stochastic variance reduced gradient ADMM (SVRG-ADMM) [23] does not require extra storage while ensuring linear convergence for SC problems and

𝒪 (1 / S)

convergence rate for non-SC problems. And ASVRG-ADMM [24] further achieves the convergence rate of

𝒪 (1 / S^{2})

for non-SC problems. However, as indirect methods for solving saddle-point problems, ADMM-type methods usually require that the proximal mapping of regular function G is easily computed, and need to update at least three vector variables in each iteration. Thus, when G is complex [25] or when solving large-scale structural regularization problems, ADMM-type methods may not be the first choice, and the primal–dual algorithms [4,26] are more efficient.

As another effective tool, the primal–dual algorithms are prevalent in solving the saddle-point problem directly, such as Stochastic Primal–Dual Coordinate (SPDC) [26,27,28,29] and primal–dual hybrid gradient (PDHG) [25,30,31,32,33]. Indeed, these methods alternate between maximizing the dual variable y and minimizing the primal variable x. Thus, these primal–dual algorithms update at least one less vector variable than ADMMs for solving the saddle-point problem, resulting in a lower per-iteration complexity. Due to such properties, primal–dual algorithms have been widely used in various machine learning applications [34,35,36,37,38,39].

Many SPDC algorithms have obtained excellent performance when d is very large. The work [26] proposed the SPDC method with a linear convergence rate when the loss function is smooth and SC. In order to further reduce its per-iteration complexity, the work [27] proposed a stochastic primal–dual method, called SPD1, with

𝒪 (1)

per-iteration complexity as opposed to

𝒪 (d)

. By incorporating the variance reduction technique [13], SPD1-VR obtains linear convergence for SC problems. More generally, for empirical composition optimization problems, SVRPDA-I and SVRPDA-II in the work [28] both achieve linear convergence under the condition of smooth component functions and the SC regularization term.

In contrast, when n is very large, the stochastic PDHG methods as an alternative become more competitive for solving the saddle-point problem. For solving such a large-scale optimization problem, the deterministic PDHG algorithm [25] incurs the extremely expensive iteration cost (i.e.,

𝒪 (n d)

), and thus its stochastic version (SPDHG) [4] was developed. SPDHG only selects one sample for updating the primal variable at each iteration, which has accomplished the best possible one regarding the sample complexity. However, SPDHG can only obtain the convergence rates of

𝒪 (1 / S)

and

𝒪 (1 / \sqrt{S})

for the SC and non-SC problems (1), respectively. Another work [40] proposed a stochastic accelerated primal–dual (APD) algorithm for solving bilinear saddle-point problems in the online setting, which achieves the convergence rate of

𝒪 (1 / S^{2} + 1 / S + 1 / \sqrt{S})

and matches the lower bound based on the primal–dual gap, i.e.,

\max_{y} H (\hat{x}, y) - \min_{x} H (x, \hat{y})

for the point

(\hat{x}, \hat{y})

. The works [41,42] further focused on the finite-sum setting and analyzed primal–dual algorithms for solving the non-smooth case. For example, Song et al. [42] considered the non-smooth saddle-point problem by using the convex conjugate of the data fidelity term, but it is limited to solving the primal problem with simple regular functions. On the contrary, our algorithms can solve more general structural regularity problems. Thus, their algorithms are orthogonal to our methods. Recently, several faster versions of stochastic primal–dual methods have been proposed. The work [43] can achieve a linear convergence rate for the problem with strong convexity of

G^{*}

instead of F. The algorithms proposed by [44] further achieved the complexity matching the lower bound when F and

G^{*}

are both SC. However, the non-SC setting is not taken into consideration by these two works. To bridge this gap, Zhao et al. [45] proposed a restart scheme, which focuses on the general convex–concave saddle-point problems rather than just the bilinear structure but they focused on solving the more general saddle-point problem. Very recently, SVRG-PDFP [46] integrated the variance reduction technique and the primal–dual fixed point method to solve the graph-guide logistic regression model and CT image reconstruction, and achieved an

𝒪 (1 / S)

convergence rate for non-SC finite-sum problems. But there remains a gap in the convergence rates between it and the lower bound

𝒪 (1 / (n S))

[47]. Thus, it is essential to take advantage of the relative simplicity of primal–dual methods to design faster algorithms for solving the finite-sum problem (1).

1.1. Our Motivations

In this paper, we focus on the large sample regime (i.e., n is large). To solve the large-scale saddle-point problem more effectively, we mainly consider the following factors:

Computational cost per iteration: Although stochastic ADMMs can be used to solve the large-scale saddle-point problem, their per-iteration cost is still high. That is, the stochastic ADMMs usually use positive semi-definite matrix Q and update at least three variables per iteration, which increases the computational cost.
Theoretical properties: SPDHG only employs a decaying step size to reduce the variance of the stochastic gradient estimator, which leads to a suboptimal convergence rate. Thus, there still exists a gap in the convergence rate between SPDHG and state-of-the-art stochastic methods. Recently, SVRG-PDFP has improved the convergence rate from $𝒪 (1 / \sqrt{S})$ to $𝒪 (1 / S)$ for non-SC objectives, which only closes this gap in some sense.
Applications: SVRG-PDFP requires both F and $G^{*}$ to be SC functions in order to obtain a linear convergence rate, which limits its application. SPDC can also reduce the number of updated variables and achieve a linear convergence rate for SC problems as mentioned above. However, these proposals require that regularized functions must be SC. For common regularized problems (e.g., $ℓ_{1}$ -norm regularization), such a condition is not satisfied.

These facts motivate us to design a more efficient primal–dual algorithm for solving the large-scale saddle-point problem (1).

1.2. Our Contributions

We address the above tricky issues by proposing efficient stochastic variance reduced primal–dual hybrid gradient methods, which have the following advantages:

Accelerated primal–dual algorithms: We propose novel primal–dual hybrid gradient methods (SVR-PDHG and ASVR-PDHG) to solve SC and non-SC objectives by integrating variance reduction and acceleration techniques. In our ASVR-PDHG algorithm, we design a new momentum acceleration step and a linear extrapolation step to further improve our theoretical and practical convergence speeds.
Better convergence rates: For non-SC problems, we rigorously prove that SVR-PDHG achieves the convergence rate $𝒪 (1 / S)$ and ASVR-PDHG attains the convergence rate $𝒪 (1 / S^{2})$ based on the convergence criterion $𝒯$ . Moreover, our algorithms enjoy linear convergence rates for SC problems. As by-products, we also analyze their gradient complexity results.
Lower computation cost: Our SVR-PDHG and ASVR-PDHG have simpler structures than SVRG-ADMM and ASVRG-ADMM, respectively. Our algorithms update one less vector variable than stochastic ADMMs, which reduces the per-iteration cost. That is why our algorithms perform better than them in practice.
More general applications: Our algorithms require fewer assumptions, which significantly extends the applicability of our algorithms. Firstly, the boundedness assumptions (i.e., assume $x \in X$ and $y \in Y$ , where $X$ and $Y$ are the convex compact sets with diameters $D_{X} = {sup}_{x_{1}, x_{2} \in X} ∥ x_{1} - x_{2} ∥$ and $D_{Y} = {sup}_{y_{1}, y_{2} \in Y} ∥ y_{1} - y_{2} ∥$ ) are removed. Secondly, unlike SVRPDA [28], our algorithms are also applicable for non-SC regularization (e.g., $ℓ_{1}$ -regularization). Thirdly, our algorithms only require the strong convexity of F to achieve a linear convergence rate for SC problems, while SVRG-PDFP and LPD [44] algorithms call for the strong convexity of $G^{*}$ .
Asynchronous Parallel Algorithms: We extend our SVR-PDHG and ASVR-PDHG algorithms to the asynchronous parallel setting. To the best of our knowledge, this is the first asynchronous parallel primal–dual algorithm. Our experiments show that the speedup of our SVR-PDHG and ASVR-PDHG is proportional to the number of threads.
Superior empirical behavior: We conduct various experiments for solving non-SC graph-guided fused Lasso problems, SC graph-guided logistic regression, and multi-task learning problems in the machine learning community. Compared with SPDHG, our algorithms achieve much better performance for both SC and non-SC problems. Due to the low per-iteration cost and acceleration techniques, $2 \times$ to $5 \times$ speedup can be obtained by our ASVR-PDHG compared with SVRG-PDFP, SVRG-ADMM, and ASVRG-ADMM.

2. Preliminaries and Related Work

In this section, we recall some necessary assumptions and introduce some related works that inspire our ideas.

2.1. Notations

We use lower-case letters

x, y

to denote vectors. Given a vector

x \in R^{d}

,

∥ x ∥

is

ℓ_{2}

-norm,

{∥ x ∥}_{1}

is

ℓ_{1}

-norm, and

{∥ x ∥}_{Q} = \sqrt{x^{⊤} Q x}

.

\nabla F (x)

denotes the full gradient of function F at x when F is differentiable. Given a matrix A,

∥ A ∥

denotes the induced norm of A, i.e.,

∥ A ∥ = \max {∥ A x ∥ : x \in R^{d}

with

∥ x ∥ \leq 1}

.

σ_{\min} (A)

is the smallest eigenvalue of A, and

A^{†}

denotes its pseudo-inverse.

2.2. Basic Assumptions

Assumption 1

(Smoothness). Each component function

F_{i}, i = 1, 2, \dots, n

is smooth. That is, for

\forall x_{1}, x_{2} \in R^{d}

, there exists a constant

L_{i} > 0

such that

∥ \nabla F_{i} (x_{1}) - \nabla F_{i} (x_{2}) ∥ \leq L_{i} ∥ x_{1} - x_{2} ∥

, which means that the gradient of the average function F is also Lipschitz continuous, i.e.,

∥ \nabla F (x_{1}) - \nabla F (x_{2}) ∥ \leq L_{F} ∥ x_{1} - x_{2} ∥

, where

L_{F} \leq L = \max_{i = 1, \dots, n} L_{i}

.

Assumption 2

(Strong Convexity). The function F is μ-strongly convex, i.e., for

\forall x_{1}, x_{2} \in R^{d}

, there exists a constant

μ > 0

such that

F (x_{1}) \geq F (x_{2}) + 〈\nabla F (x_{2}), x_{1} - x_{2}〉 + \frac{μ}{2} {∥ x_{1} - x_{2} ∥}^{2}

.

Assumption 3

(Bounded Function Value). We assume that f and

G^{*}

admit finite lower bounds, i.e.,

f^{*} = {inf}_{x} f (x) > - \infty

and

{(G^{*})}^{*} = {inf}_{y} G^{*} (y) > - \infty

.

2.3. Related Work

2.3.1. Stochastic ADMM

SADMM is one of the most effective tools to solve the primal problem of Problem (1). In particular, Problem (1) is a primal–dual formulation of the following nonlinear primal problem:

\min_{x \in R^{d}} F (x) + G (A x)

. One can formulate this primal problem into a problem with an equality constraint by introducing an auxiliary variable y as follows:

\begin{matrix} \min_{x \in R^{d}, y \in Y} F (x) + G (y), s . t . y = A x . \end{matrix}

(2)

Thus, it can be solved by SADMM [10]. Together with the dual variable

λ

, the update steps of SADMM are

\begin{matrix} y_{t} = \arg min_{y \in Y} G (y) + \frac{ζ}{2} {∥ A x_{t - 1} - y + λ_{t - 1} ∥}^{2}, \\ x_{t} = \arg min_{x \in R^{d}} x^{⊤} \nabla F_{i_{t}} (x_{t - 1}) + \frac{1}{2 η_{t}} ∥ x - x_{t - 1} ∥_{Q}^{2} + \frac{ζ}{2} {∥ A x - y_{t} + λ_{t - 1} ∥}^{2}, \\ λ_{t} = λ_{t - 1} + A x_{t} - y_{t}, \end{matrix}

(3)

where we draw

i_{t}

uniformly at random from

[n] : = {1, \dots, n}

, and

ζ > 0

is a penalty parameter.

η_{t} \propto 1 / \sqrt{t}

is the step size of the t-th update, and

{∥ x ∥}_{Q}^{2} = x^{⊤} Q x

with given positive semi-definite matrix

Q = δ I - ν A^{⊤} A

in the inexact Uzawa method [48]. Here, the constant

ν

relies on the step size, I is an identity matrix, and the positive semi-definiteness of Q is guaranteed by a suitable constant

δ

. But due to the variance of random sampling, the algorithm requires the step size to be asymptotically reduced to ensure convergence [13].

Recently, the variance reduction technique has been applied to SADMM, such as SVRG-ADMM [23] and its accelerated version, ASVRG-ADMM [24], and improved theoretical and experimental results. Specifically, the SVRG estimator is defined as

\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \nabla F ({\tilde{x}}^{s - 1}),

(4)

where

I_{t} \subseteq {1, 2, \dots, n}

is a mini-batch set of size b, and

\nabla F ({\tilde{x}}^{s - 1})

is a full gradient at the snapshot point

{\tilde{x}}^{s - 1}

. Since the estimator is unbiased to the full gradient

\nabla F ({\tilde{x}}^{s - 1})

, it allows a constant step size to achieve faster convergence.

2.3.2. Stochastic PDHG

The PDHG-type methods update the variable x and the variable y alternately. For example, the main update rules of SPDHG for solving Problem (1) are written as follows:

\begin{matrix} y_{t} = \underset{y \in Y}{\arg \max} 〈y, A x_{t - 1}〉 - G^{*} (y) - \frac{1}{2 ρ_{t}} {∥ y - y_{t - 1} ∥}^{2}, x_{t} = x_{t - 1} - η_{t} (A^{⊤} y_{t} + \nabla F_{i_{t}} (x_{t - 1})), \end{matrix}

(5)

where

i_{t}

is chosen uniformly at random from

[n] : = {1, \dots, n}

, where

ρ_{t}

and

η_{t}

are step sizes. The SPDHG algorithm uses a stochastic gradient

\nabla F_{i_{t}} (\cdot)

to approximate the full gradient

\nabla F (\cdot)

. Similarly, the variance of

\nabla F_{i_{t}} (x_{t - 1})

is large due to random sampling, and a decaying step size has to be used. Thus, SPDHG can only attain suboptimal convergence. SPDHG requires boundedness assumptions, which limits its application scope. Inspired by the success of variance reduction and momentum acceleration in stochastic optimization, we propose more efficient stochastic PDHG algorithms to solve these tricky issues.

3. Our Stochastic Primal–Dual Hybrid Gradient Algorithms

In this section, we integrate variance reduction and momentum acceleration techniques into SPDHG and propose two stochastic variance reduced primal–dual hybrid gradient methods, called SVR-PDHG and ASVR-PDHG, where we design key linear extrapolation and momentum acceleration steps to improve the convergence rate. Moreover, we design asynchronous parallel versions for the proposed algorithms to further accelerate solving non-SC problems.

3.1. Our SVR-PDHG Algorithm

We first propose a stochastic variance reduced primal–dual hybrid gradient (SVR-PDHG) method for solving SC and non-SC objectives as shown in Algorithms 1 and 2. Our algorithms are divided into S epochs, and each epoch includes T updates, where T is usually set to

T = 2 n

as in the works [13,23,24]. More specifically, SVR-PDHG mainly includes the following three steps:

▸Update Dual Variable. We specially design a term

- \frac{1}{2 ρ} {∥ y - y_{t - 1}^{s} ∥}^{2}

to ensure the next iterate close to the current iterate

y_{t - 1}^{s}

. Specifically, the first-order surrogate function of the dual variable y is defined as follows:

y_{t}^{s} = \arg \max_{y \in Y} 〈y, A {\bar{x}}_{t - 1}^{s}〉 - G^{*} (y) - \frac{1}{2 ρ} {∥y - y_{t - 1}^{s}∥}^{2},

(6)

where

{\bar{x}}_{t - 1}^{s}

is updated by a linear extrapolation step in (8) below.

G^{*} (\cdot)

is a conjugate function of

G (\cdot)

, which is usually easy to solve. For example, for graph-guided fused Lasso problems,

G^{*} (y) \equiv 0

.

▸Update Primal Variable. Analogous to the sub-problem of y, we also add

\frac{1}{2 η} {∥ x - x_{t - 1}^{s} ∥}^{2}

to the sub-problem of x. Thus,

x_{t}^{s}

is updated as follows:

\begin{matrix} x_{t}^{s} & = \arg min_{x \in R^{d}} 〈 x, A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) 〉 + \frac{∥ x - x_{t - 1}^{s} ∥^{2}}{2 η} = x_{t - 1}^{s} - η (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s})), \end{matrix}

(7)

where

\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s})

is the variance reduced gradient estimator (4). Note that a constant step size

η

is used instead of just decaying the step size as in the work [4].

▸Linear Extrapolation Step. In order to further improve the theoretical convergence rate, we design the key update rule of

{\bar{x}}_{t}^{s}

as follows:

{\bar{x}}_{t}^{s} = x_{t}^{s} + β (x_{t}^{s} - x_{t - 1}^{s}),

(8)

where

β \in (0, 1]

. When we choose

β = 0

, there will be an extra inner product term in our proofs, which results in the convergence rate

𝒪 (1 / S)

within a certain error range like the Arrow–Hurwicz method [25]. While we choose

β = 1

, this linear extrapolation step can eliminate the extra inner product term, which ensures the

𝒪 (1 / S)

convergence rate for non-SC problems.

Algorithm 1 SVR-PDHG for SC Objectives.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0}$ , ${\hat{y}}^{0}$ .

1:: for $s = 1, 2, \dots, S$ do
2:: ${\bar{x}}_{0}^{s} = x_{0}^{s} = {\tilde{x}}^{s - 1}$ , $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ ; $y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})$ ;
3:: for $t = 1, 2, \dots, T$ do
4:: Choose $I_{t} \subseteq [n]$ of size b, uniformly at random;
5:: $\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \tilde{p}$ ;
6:: $y_{t}^{s} = \arg \max_{y \in Y} 〈 y, A {\bar{x}}_{t - 1}^{s} 〉 - G^{*} (y) - \frac{1}{2 ρ} {∥y - y_{t - 1}^{s}∥}^{2}$ ;
7:: $x_{t}^{s} = x_{t - 1}^{s} - η (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}))$ ; ${\bar{x}}_{t}^{s} = x_{t}^{s} + β (x_{t}^{s} - x_{t - 1}^{s})$ ;
8:: end for
9:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
10:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

Algorithm 2 SVR-PDHG for non-SC objectives.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0} = x_{T}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0} = y_{T}^{0}$ .

1:: for $s = 1, 2, \dots, S$ do
2:: ${\bar{x}}_{0}^{s} = x_{0}^{s} = x_{T}^{s - 1}$ , $y_{0}^{s} = y_{T}^{s - 1}$ , $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ ;
3:: for $t = 1, 2, \dots, T$ do
4:: Choose $I_{t} \subseteq [n]$ of size b, uniformly at random;
5:: $\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \tilde{p}$ ;
6:: $y_{t}^{s} = \arg \max_{y \in Y} 〈y, A {\bar{x}}_{t - 1}^{s}〉 - G^{*} (y) - \frac{1}{2 ρ} {∥y - y_{t - 1}^{s}∥}^{2}$ ;
7:: $x_{t}^{s} = x_{t - 1}^{s} - η (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}))$ ; ${\bar{x}}_{t}^{s} = x_{t}^{s} + β (x_{t}^{s} - x_{t - 1}^{s})$ ;
8:: end for
9:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
10:: end for

Output:: ${\hat{x}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}$ , ${\hat{y}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}$ .

The other detailed update rules of our SVR-PDHG algorithms for SC and non-SC objectives are outlined in Algorithms 1 and 2, respectively (note that the outputs of our algorithms are all denoted by

{\hat{x}}^{S}

and

{\hat{y}}^{S}

). The main differences of SVR-PDHG for solving SC and non-SC problems are listed as follows. The initial dual variable

y_{0}^{s}

at each epoch is set to

y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})

, which contributes to attaining a linear convergence rate for SC objectives. For comparison, the initial variables in non-SC problems are set to

y_{0}^{s} = y_{T}^{s - 1}

and

x_{0}^{s} = x_{T}^{s - 1}

. Furthermore, the outputs of SVR-PDHG for SC problems are

{\hat{x}}^{S} = {\tilde{x}}^{S}

and

{\hat{y}}^{S} = {\tilde{y}}^{S}

in a non-ergodic sense (note that a convergence rate is ergodic if it measures the optimality at (

{\hat{x}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}

,

{\hat{y}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}

), while the convergence rate of an algorithm is non-ergodic if it considers the optimality at the point (

{\tilde{x}}^{S}

,

{\tilde{y}}^{S}

) directly), and by contrast, the outputs of SVR-PDHG for non-SC problems are

{\hat{x}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}

and

{\hat{y}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}

.

3.2. Our ASVR-PDHG Algorithm

In this part, we design an accelerated stochastic variance reduced primal–dual hybrid gradient (ASVR-PDHG) method for solving SC and non-SC problems as shown in Algorithms 3 and 4, respectively. In particular, to eliminate the boundedness assumption for the non-SC case, we design a new adaptive epoch length strategy. More specifically, ASVR-PDHG mainly includes three steps:

▸Update Dual Variable. The optimization sub-problem of y has a similar term,

- \frac{1}{2 ρ θ_{s - 1}} {∥ y - y_{t - 1}^{s} ∥}^{2}

, to ensure the next iterate close to the current iterate

y_{t - 1}^{s}

. Different from existing works, the quadratic term is preceded by an acceleration factor

\frac{1}{θ_{s - 1}}

(when solving SC problems,

θ_{s - 1} \equiv θ

for all s). Thus, our first-order surrogate function becomes

y_{t}^{s} = \arg \max_{y \in Y} 〈 y, A {\bar{z}}_{t - 1}^{s} 〉 - G^{*} (y) - \frac{1}{2 ρ θ_{s - 1}} {∥ y - y_{t - 1}^{s} ∥}^{2},

(9)

where

{\bar{z}}_{t - 1}^{s}

is updated by a linear extrapolation step in (12) below.

▸Update Primal Variable. We introduce an auxiliary variable

z_{t}^{s}

to accelerate the primal variable, which mainly includes the following two steps.

Gradient descent: We first update $z_{t}^{s}$ with $x_{t - 1}^{s}$ . In particular, $z_{t}^{s}$ is obtained by solving the following sub-problem with a step size $\frac{η}{θ_{s - 1}}$ ,

$\begin{matrix} z_{t}^{s} & = \arg min_{x \in R^{d}} 〈 z, A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) 〉 + \frac{θ_{s - 1} {∥ z - z_{t - 1}^{s} ∥}^{2}}{2 η} = z_{t - 1}^{s} - \frac{η}{θ_{s - 1}} (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s})) . \end{matrix}$

(10)
Momentum acceleration: We design a momentum acceleration step to accelerate our algorithms by using the snapshot point of the previous epoch, i.e., ${\tilde{x}}^{s - 1}$ . In particular,

$x_{t}^{s} = {\tilde{x}}^{s - 1} + θ_{s - 1} (z_{t}^{s} - {\tilde{x}}^{s - 1}),$

(11)

where $θ_{s - 1}$ is a momentum parameter. For non-SC objectives, it can be seen from the outer loop that $θ_{s - 1}$ is monotonically decreasing and satisfies the condition $θ_{s} \leq 1 - \frac{α (b) L η}{1 - L η}$ for all s, where b is the size of the mini-batch, and $α (b) = \frac{n - b}{b (n - 1)}$ .

▸Linear Extrapolation Step. We also design a key linear extrapolation step for

z_{t}^{s}

in Algorithms 3 and 4. This step can also eliminate an extra inner product term by setting

β = 1

, which is also a reason to ensure an

𝒪 (1 / S^{2})

convergence rate of our ASVR-PDHG algorithm for solving non-SC problems. Specifically,

{\bar{z}}_{t}^{s}

is updated by

{\bar{z}}_{t}^{s} = z_{t}^{s} + β (z_{t}^{s} - z_{t - 1}^{s}) .

(12)

Algorithm 3 ASVR-PDHG for SC objectives.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0}$ , $0 \leq θ \leq 1 - \frac{α (b) L η}{1 - L η}$ .

1:: for $s = 1, 2, \dots, S$ do
2:: $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ , $x_{0}^{s} = {\tilde{x}}^{s - 1}$ , ${\bar{z}}_{0}^{s} = z_{0}^{s} = {\tilde{x}}^{s - 1}$ , $y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})$ ;
3:: for $t = 1, 2, \dots, T$ do
4:: Choose $I_{t} \subseteq [n]$ of size b, uniformly at random;
5:: $\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \tilde{p}$ ;
6:: $y_{t}^{s} = \arg \max_{y \in Y} 〈y, A {\bar{z}}_{t - 1}^{s}〉 - G^{*} (y) - \frac{1}{2 ρ θ} {∥y - y_{t - 1}^{s}∥}^{2}$ ;
7:: $z_{t}^{s} = z_{t - 1}^{s} - \frac{η}{θ} (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}))$ ;
8:: $x_{t}^{s} = {\tilde{x}}^{s - 1} + θ (z_{t}^{s} - {\tilde{x}}^{s - 1})$ ; ${\bar{z}}_{t}^{s} = z_{t}^{s} + β (z_{t}^{s} - z_{t - 1}^{s})$ ;
9:: end for
10:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = (1 - θ) {\tilde{y}}^{s - 1} + \frac{θ}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
11:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

Algorithm 4 ASVR-PDHG for non-SC objectives.

Input:: $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0} = z_{T_{0}}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0} = y_{T_{0}}^{0}$ , $0 \leq θ_{0} \leq 1 - \frac{α (b) L η}{1 - L η}$ , $T_{0}$ .

1:: for $s = 1, 2, \dots, S$ do
2:: $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ , $x_{0}^{s} = {\tilde{x}}^{s - 1}$ , ${\bar{z}}_{0}^{s} = z_{0}^{s} = z_{T_{s - 1}}^{s - 1}$ , $y_{0}^{s} = y_{T_{s - 1}}^{s - 1}$ ;
3:: for $t = 1, 2, \dots, T_{s - 1}$ do
4:: Choose $I_{t} \subseteq [n]$ of size b, uniformly at random;
5:: $\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \tilde{p}$ ;
6:: $y_{t}^{s} = \arg \max_{y \in Y} 〈 y, A {\bar{z}}_{t - 1}^{s} 〉 - G^{*} (y) - \frac{1}{2 ρ θ_{s - 1}} {∥y - y_{t - 1}^{s}∥}^{2}$ ;
7:: $z_{t}^{s} = z_{t - 1}^{s} - \frac{η}{θ_{s - 1}} (A^{⊤} y_{t}^{s} + \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}))$ ;
8:: $x_{t}^{s} = {\tilde{x}}^{s - 1} + θ_{s - 1} (z_{t}^{s} - {\tilde{x}}^{s - 1})$ ; ${\bar{z}}_{t}^{s} = z_{t}^{s} + β (z_{t}^{s} - z_{t - 1}^{s})$ ;
9:: end for
10:: ${\tilde{x}}^{s} = \frac{1}{T_{s - 1}} \sum_{t = 1}^{T_{s - 1}} x_{t}^{s}$ , ${\tilde{y}}^{s} = (1 - θ_{s - 1}) {\tilde{y}}^{s - 1} + \frac{θ_{s - 1}}{T_{s - 1}} \sum_{t = 1}^{T_{s - 1}} y_{t}^{s}$ ;
11:: $θ_{s} = \frac{\sqrt{θ_{s - 1}^{4} + 4 θ_{s - 1}^{2}} - θ_{s - 1}^{2}}{2}$ , $T_{s} = ⌈T_{s - 1} / (1 - θ_{s})⌉$ ;
12:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

Moreover, we set

{\tilde{y}}^{s} = (1 - θ_{s - 1}) {\tilde{y}}^{s - 1} + \frac{θ_{s - 1}}{T_{s - 1}} \sum_{t = 1}^{T_{s - 1}} y_{t}^{s}

(for SC problems,

T_{s - 1} = T

,

θ_{s - 1} \equiv θ

for all s) to further accelerate our algorithms, where

T_{s - 1}

is the number of inner loops at the s-th outer-loop. The key differences between ASVR-PDHG for SC and non-SC problems are as follows.

ASVR-PDHG for SC problems: The initial dual variable

y_{0}^{s}

at each epoch is set to

y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})

, which contribute to attaining a linear convergence for SC objectives. The momentum parameter

θ_{s}

and the length of inner-loop

T_{s}

are set to the constants

θ

and T, respectively.

ASVR-PDHG for non-SC problems: The initial variables are set to

y_{0}^{s} = y_{T_{s - 1}}^{s - 1}

and

z_{0}^{s} = z_{T_{s - 1}}^{s - 1}

. Following ASVRG-ADMM [24], the sequence

{θ_{s}}

is monotonically decreasing, satisfying

(1 - θ_{s}) / θ_{s}^{2} = 1 / θ_{s - 1}^{2}

. Different from ASVRG-ADMM, a new adaptive strategy for the epoch length

T_{s}

is designed as follows:

T_{s} = ⌈\frac{1}{1 - θ_{s}} T_{s - 1}⌉

(13)

with an initial

T_{0}

, which is the reason to eliminate the boundedness assumption. Since

θ_{s}

is decreasing, the coefficient of

T_{s - 1}

is greater than 1 and decreases gradually, while a constant 2 is used in SVRG++ [49].

3.3. Our Asynchronous Parallel Algorithms

In this subsection, we extend our SVR-PDHG and ASVR-PDHG algorithms to the sparse and asynchronous parallel setting to further accelerate the convergence speed for large-scale sparse and high-dimensional non-SC problems. To the best of our knowledge, this is the first asynchronous parallel stochastic primal–dual algorithm. Our parallel ASVR-PDHG algorithm is shown in Algorithm 5 and the parallel SVR-PDHG algorithm is shown in Algorithm A1 in the Appendix A. Specifically, we consider

A = I

and batch size

b = 1

to facilitate parallelism. Taking Algorithm 5 as an example, there are three main differences compared with Algorithm 4.

Algorithm 5 ASVR-PDHG for non-SC objectives in sparse and asynchronous parallel setting.

Input:: $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0} = z_{T_{0}}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0} = y_{T_{0}}^{0} = 0$ , $0 \leq θ_{0} \leq 1 - \frac{α (b) L η}{1 - L η}$ , $T_{0}$ , p threads.

1:: for $s = 1, 2, \dots, S$ do
2:: Read current value of ${\tilde{x}}^{s - 1}$ from the shared memory and all threads parallelly compute the full gradient $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ , $x_{0}^{s} = {\tilde{x}}^{s - 1}$ , ${\bar{z}}_{0}^{s} = z_{0}^{s} = z_{T_{s - 1}}^{s - 1}$ , $y_{0}^{s} = y_{T_{s - 1}}^{s - 1}$ ;
3:: $t = 0$ ; //inner loop counter
4:: while $t < T_{s - 1}$ in parallel do
5:: $t = t + 1$ ; //atomic increase counter
6:: Choose $i_{t}$ uniformly at random from ${1, 2, . . ., n}$ ;
7:: $s_{i_{t}} : =$ support of sample $i_{t}$ ;
8:: Inconsistent read of ${[x_{t - 1}^{s}]}_{s_{i_{t}}}$ ;
9:: ${[u]}_{s_{i_{t}}} = \nabla F_{i_{t}} ({[x_{t - 1}^{s}]}_{s_{i_{t}}}) - \nabla F_{i_{t}} ({[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}}) + {[D_{i_{t}} \tilde{p}]}_{s_{i_{t}}}$ ;
10:: ${[y_{t}^{s}]}_{s_{i_{t}}} = \arg \max_{{[y]}_{s_{i_{t}}}} 〈 {[y]}_{s_{i_{t}}}, {[{\bar{z}}_{t - 1}^{s}]}_{s_{i_{t}}} 〉 - G^{*} (y) - \frac{1}{2 ρ θ_{s - 1}}$ ${∥ [y]}_{s_{i_{t}}} - {[y_{t - 1}^{s}]}_{s_{i_{t}}} ∥^{2}$ ;
11:: ${[z_{t}^{s}]}_{s_{i_{t}}} = {[z_{t - 1}^{s}]}_{s_{i_{t}}} - \frac{η}{θ_{s - 1}} ({[y_{t}^{s}]}_{s_{i_{t}}} + {[u]}_{s_{i_{t}}})$ ;
12:: ${[x_{t}^{s}]}_{s_{i_{t}}} = {[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}} + θ_{s - 1} ({[z_{t}^{s}]}_{s_{i_{t}}} - {[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}})$ ;
13:: ${[{\bar{z}}_{t}^{s}]}_{s_{i_{t}}} = {[z_{t}^{s}]}_{s_{i_{t}}} + β ({[z_{t}^{s}]}_{s_{i_{t}}} - {[z_{t - 1}^{s}]}_{s_{i_{t}}})$ ;
14:: end while
15:: ${\tilde{x}}^{s} = \frac{1}{T_{s - 1}} \sum_{t = 1}^{T_{s - 1}} x_{t}^{s}$ , ${\tilde{y}}^{s} = (1 - θ_{s - 1}) {\tilde{y}}^{s - 1} + \frac{θ_{s - 1}}{T_{s - 1}} \sum_{t = 1}^{T_{s - 1}} y_{t}^{s}$ ;
16:: $θ_{s} = \frac{\sqrt{θ_{s - 1}^{4} + 4 θ_{s - 1}^{2}} - θ_{s - 1}^{2}}{2}$ , $T_{s} = ⌈T_{s - 1} / (1 - θ_{s})⌉$ ;
17:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

▸ The full gradient

\tilde{p}

in Algorithm 5 is computed in parallel, while the full gradient in Algorithm 4 is computed serially.

▸ We adopt a sparse approximation technique [50] to decrease the chances of conflicts between multiple cores, which makes the SVRG estimator (4) change as follows:

\tilde{\nabla} F_{i_{t}} (x_{t - 1}^{s}) = \nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1}) + D_{i_{t}} \nabla F ({\tilde{x}}^{s - 1}),

(14)

where

D_{i_{t}}

is used to construct sparse iterates. The choice of

D_{i_{t}}

needs to ensure

E [D_{i_{t}} \nabla F ({\tilde{x}}^{s - 1})] = \nabla F ({\tilde{x}}^{s - 1})

. Under such a condition, the sparse approximated estimator (14) is still an unbiased estimator of

\nabla F ({\tilde{x}}^{s - 1})

.

▸ The proposed asynchronous parallel algorithm just updates the coordinates

s_{i_{t}}

of variables, i.e., the support set of chosen random samples, rather than the entire dense vector; see lines 9–10 of Algorithm 5. As long as these dimensions are different, Algorithm 5 can effectively avoid write conflicts. Thus, our parallel ASVR-PDHG can take advantage of the power of multi-core processor architectures and further accelerate Algorithm 4 on sparse datasets.

4. Theoretical Analysis

This section provides the convergence analysis for SVR-PDHG and ASVR-PDHG (i.e., Algorithms 1–4) in SC and non-SC cases, respectively. We first introduce the convergence criterion and then give the key technical results (i.e., Lemmas 1 and 2) for SVR-PDHG and ASVR-PDHG, respectively. Finally, we prove the convergence rate of our algorithms as shown by the following Theorems 1–4.

4.1. Convergence Criterion

In Problem (1),

H (\cdot, y)

is convex for each

y \in Y

and

H (x, \cdot)

is concave for each

x \in R^{d}

. Under such conditions, Sion et al. [51] proved that

\max_{y \in Y} \min_{x \in R^{d}} H (x, y) = \min_{x \in R^{d}} \max_{y \in Y} H (x, y)

. In other words, there exists at least one saddle point

(x^{*}, y^{*})

such that

\min_{x \in R^{d}} H (x, y^{*}) = H (x^{*}, y^{*}) = \max_{y \in Y} H (x^{*}, y) .

(15)

That is,

H (x^{*}, y) \leq H (x^{*}, y^{*}) \leq H (x, y^{*})

. Therefore, this setting will contribute to establishing the convergence criterion for our algorithms. Following the work [4], we first introduce the function

𝒯 (x, y) = H (x, y^{*}) - H (x^{*}, y)

as a convergence criterion. As an illustration, the criterion function

𝒯 (\cdot, \cdot)

has the following properties.

Property 1.

For

\forall x, y

, if

R^{d} \times Y

contains a saddle point

(x^{*}, y^{*})

, then

𝒯 (x, y) = H (x, y^{*}) - H (x^{*}, y) \geq 0,

and it vanishes only if

(x, y)

is itself a saddle point.

Property 2.

According to the definition of

𝒯 (\cdot, \cdot)

, for

\forall x \in R^{d}

and

y \in Y

, the following inequality holds:

F (x) - F (x^{*}) - 〈\nabla F (x^{*}), x - x^{*}〉 \leq 𝒯 (x, y)

.

There are commonly other convergence criteria such as the primal–dual gap, i.e.,

\max_{y} H (\hat{x}, y) - \min_{x} H (x, \hat{y})

for point

(\hat{x}, \hat{y})

. For example, Chen et al. [40] used the primal–dual gap as the measurement and achieved the complexity matching the lower bound for solving online bilinear saddle-point problems. Zhao et al. [41] proposed the OTPDHG algorithm, which still uses the primal–dual gap as the measurement, and achieved the optimal convergence rate for online bilinear saddle-point problems even when A is unknown a priori. Zhao et al. [45] further considered the beyond-bilinear setting. Based on the primal–dual gap, they still obtained the optimal convergence rate of online saddle-point problems. It is worth noting that the works mentioned above focus on the online setting, while we focus on analyzing the finite-sum setting. The lower bound for the online setting is usually higher than the lower bound for the finite-sum setting such as [52,53]. Although Song et al. [42] also analyzed the convergence results based on the primal–dual gap in the finite-sum setting, they focused on solving non-smooth problems, while our algorithms focus on improving the convergence rates for solving smooth problems. Thus, these analyses are orthogonal to ours. Thekumparampil et al. [44] considered the finite-sum setting, but their convergence rates are based on

∥ x - x^{*} ∥ + ∥ y - y^{*} ∥

and require SC

G^{*}

. In this paper, we prove faster convergence rates based on

𝒯 (x, y)

in the non-SC finite-sum setting.

By the convergence criterion

𝒯 (x, y)

, we analyze our SVR-PDHG and ASVR-PDHG algorithms in the next two subsections. The detailed proofs of all the theoretical results are provided in the Appendix A. Here, we give a simple proof sketch: Our main proofs start from one-epoch analysis, i.e., Lemmas 1 and 2 below. Then, in Section 4.2, we prove the convergence results of SVR-PDHG by Theorems 1 and 2, which rely on the one-epoch inequality in Lemma 1, and the gradient complexity results are also given as a by-product. In Section 4.3, we prove the convergence rate and gradient complexity results of our ASVR-PDHG by Theorems 3 and 4, which depend on the one-epoch upper bound in Lemma 2.

4.2. Convergence Analysis of SVR-PDHG

This subsection provides the convergence analysis for SVR-PDHG (i.e., Algorithms 1 and 2). Lemma 1 provides a one-epoch analysis for SVR-PDHG.

Key technical challenges for SVR-PDHG. Line 5 in our SVR-PDHG algorithm eases the computational burden but simultaneously increases the difficulty of convergence analysis due to introducing a tricky inner product term

〈 A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*} 〉

in the bound in terms of y. To address this challenge, we use the linear extrapolation step

{\bar{x}}_{t}^{s} = x_{t}^{s} + β (x_{t}^{s} - x_{t - 1}^{s})

and propose to establish the upper bound on

〈 A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*} 〉

in terms of

∥ x_{t - 1}^{s} - x_{t - 2}^{s} ∥^{2}

and

∥ y_{t}^{s} - y_{t - 1}^{s} ∥^{2}

to eliminate this inner product term in Lemma 1.

Lemma 1

(One-Epoch Analysis for SVR-PDHG). Suppose Assumption 1 holds. Consider the sequence

{x_{t}^{s}, y_{t}^{s}, {\tilde{x}}^{s}, {\tilde{y}}^{s}}

generated by Algorithms 1 or 2 in one epoch, and

(x^{*}, y^{*})

as an optimal solution of Problem (1). If

0 < η \leq \frac{1}{9 L}

and

0 < ρ \leq \frac{8 L}{M^{2}}

, then the following inequality holds for all

s \in {1, 2, \dots, S}

:

\begin{matrix} T (1 - 4 L η α (b)) E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \leq \\ \frac{E [∥ x^{*} - x_{0}^{s} ∥^{2} - ∥ x^{*} - x_{T}^{s} ∥^{2}]}{2 η} + \frac{E [∥ y_{0}^{s} - y^{*} ∥^{2} - ∥ y_{T}^{s} - y^{*} ∥^{2}]}{2 ρ} + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} \\ - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} + E [〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - 〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉] \\ + 4 L η α (b) E [F (x_{0}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{0}^{s} - x^{*}〉] \\ - 4 L η α (b) E [F (x_{T}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{s} - x^{*}〉] \\ + 4 T L η α (b) E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉], \end{matrix}

(16)

where

α (b) = \frac{n - b}{b (n - 1)}

, γ satisfies that

M ρ \leq γ \leq \frac{1 - L η}{M η}

, and

M = ∥ A ∥

.

Lemma 1 provides the upper bound of

𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})

’s expectation in one epoch of our SVR-PDHG. Based on this lemma, we are now ready to combine the analysis across epochs, and derive our final Theorems 1 and 2 for SC and non-SC objectives, respectively. Lemma 1 also inspires us to analyze ASVR-PDHG.

Theorem 1

(SVR-PDHG for SC Objectives). Let

({\hat{x}}^{S}, {\hat{y}}^{S})

be the output of Algorithm 1. Suppose Assumptions 1–3 hold and A has full row rank. If

0 < η \leq \min {\frac{1}{9 L}, \frac{1}{9 L α (b)}}

and

0 < ρ \leq \frac{8 L}{M^{2}}

, and we set

T > 4 + \frac{9}{η μ} + \frac{9 L_{F}}{ρ σ_{m i n} (A^{⊤} A)}

such that

ϕ_{1} = \frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} + \frac{1}{T η μ (1 - 4 L η α (b))} + \frac{L_{F}}{T ρ (1 - 4 L η α (b)) σ_{\min} (A A^{⊤})} < 1

holds strictly, then

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϕ_{1}^{S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) .

(17)

In other words, choosing

T = 𝒪 (L / μ)

, the gradient complexity of SVR-PDHG to achieve an ϵ-additive error (i.e.,

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϵ

) is

𝒪 ((n + L / μ) log (1 / ϵ))

.

Theorem 1 shows that SVR-PDHG obtains a linear convergence rate for SC objectives. Unlike SVRG-PDFP, SVR-PDHG does not require the strong convexity of

G^{*} (\cdot)

. Our SVR-PDHG algorithm achieves the same coefficient

ϕ_{1}

as in the inexact Uzawa method [48] for solving SC objectives.

Theorem 2

(SVR-PDHG for Non-SC Objective). Suppose Assumptions 1 and 3 hold, and let

({\hat{x}}^{S}, {\hat{y}}^{S})

be the output of Algorithm 2. If

0 < η \leq \min {\frac{1}{9 L}, \frac{1}{9 L α (b)}}

and

0 < ρ \leq \frac{8 L}{M^{2}}

, then we have

\begin{matrix} E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq \frac{4 L η (T + 1) α (b)}{(1 - 8 L η α (b)) T S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{ρ D_{x^{*}}^{2} + η D_{y^{*}}^{2}}{2 η ρ (1 - 8 L η α (b)) T S}, \end{matrix}

(18)

where

D_{x^{*}} = ∥ x^{*} - {\hat{x}}^{0} ∥

and

D_{y^{*}} = ∥ y^{*} - {\hat{y}}^{0} ∥

. That is, if an output

({\hat{x}}^{S}, {\hat{y}}^{S})

satisfies

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϵ

, the gradient complexity of SVR-PDHG is

𝒪 (\frac{n 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0})}{ϵ} + \frac{D_{x^{*}}^{2} + D_{y^{*}}^{2}}{ϵ})

.

From Theorem 2, it can be found that it removes the boundedness assumption in SPDHG and only depends on the constants

D_{x^{*}}

and

D_{y^{*}}

, and SVR-PDHG achieves the convergence rate

𝒪 (1 / S)

for non-SC objectives. In addition, SVR-PDHG has simpler iteration rules than SVRG-ADMM. Thus, despite the consistent convergence rate of SVR-PDHG and SVRG-ADMM, the former is faster in practice.

4.3. Convergence Analysis of ASVR-PDHG

This subsection provides the convergence analysis for ASVR-PDHG (i.e., Algorithms 3 and 4). Similarly, we first provide a one-epoch upper bound for our ASVR-PDHG.

Key technical challenges for ASVR-PDHG. In addition to the same technical challenges in the analysis of SVR-PDHG, the momentum acceleration step further increases the difficulty of analyzing faster convergence rates for our ASVR-PDHG. Clarifying the behavior of momentum acceleration is a key step. To address this challenge, we use the improved variance upper bound [24], and design

v^{*} = (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} x^{*}

to help to clarify the behavior of momentum acceleration steps while bounding a new and tricky inner product term

E 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉

in Lemma 2. Fixed T also hinders achieving the convergence rate of

𝒪 (1 / S^{2})

and makes the convergence rate result still dependent on the boundedness assumption, thereby hindering extending the applicability of our algorithms. Our new adaptive strategy for the epoch length

T_{s}

can address these issues.

Lemma 2

(One-Epoch Analysis for ASVR-PDHG). The sequence

{x_{t}^{s}, z_{t}^{s}, y_{t}^{s}, {\tilde{x}}^{s}, {\tilde{y}}^{s}}

is generated by Algorithm 3 or 4 with Assumption 1 holding, and

(x^{*}, y^{*})

denote an optimal solution of Problem (1). If

η < \frac{1}{2 L}

and

ρ \leq \frac{L}{M^{2}}

, then the following inequality holds for all s,

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \leq (1 - θ_{s - 1}) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ_{s - 1}^{2}}{2 T_{s - 1} η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - {∥ x^{*} - z_{T_{s - 1}}^{s} ∥}^{2}] \\ - \frac{θ_{s - 1}}{T_{s - 1}} E 〈 A (z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s}), y_{T_{s - 1}}^{s} - y^{*} 〉 + \frac{1}{2 T_{s - 1} ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥ y_{T_{s - 1}}^{s} - y^{*} ∥}^{2}] \\ + \frac{θ_{s - 1} M γ}{2 T_{s - 1} η} E ∥ z_{0}^{s} - z_{- 1}^{s} ∥^{2} - \frac{L η θ_{s - 1}^{2}}{2 T_{s - 1} η} E {∥ z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s} ∥}^{2} + \frac{θ_{s - 1}}{T_{s - 1}} E 〈 A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*} 〉, \end{matrix}

(19)

where

θ_{s} \leq 1 - \frac{α (b) L η}{1 - 2 L η}

,

α (b) = \frac{n - b}{b (n - 1)}

, and γ satisfies that

M θ_{s - 1} η ρ \leq γ \leq \frac{L η θ_{s - 1}}{M}

.

From Lemma 2, we can obtain the relationship between two consecutive epochs for our ASVR-PDHG algorithm. Based on this, we provide the convergence properties of ASVR-PDHG for both SC and non-SC objectives.

Theorem 3

(ASVR-PDHG for SC Objectives). Suppose Assumptions 1–3 hold, A has a full row rank, and

θ \leq 1 - \frac{α (b) L η}{1 - 2 L η}

. Let

({\hat{x}}^{S}, {\hat{y}}^{S})

be the output generated by Algorithm 3, and

ϕ_{2} = 1 - θ + \frac{θ^{2}}{T η μ} + \frac{L_{F}}{T ρ σ_{\min} (A A^{⊤})}

. If we set

η < \frac{1}{2 L}

and

ρ \leq \frac{L}{M^{2}}

and choose

T > \frac{θ}{η μ} + \frac{L_{F}}{θ ρ σ_{m i n} (A A^{⊤})}

such that

ϕ_{2} < 1

, we obtain

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϕ_{2}^{S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) .

(20)

Analogous to SVR-PDHG, choosing

T = 𝒪 (L / μ)

, the gradient complexity of ASVR-PDHG to achieve an ϵ-additive error (i.e.,

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϵ

) is also

𝒪 ((n + L / μ) log (1 / ϵ))

.

Theorem 3 indicates that ASVR-PDHG achieves a linear convergence rate for SC objectives. Note that

ϕ_{2}

is more concise than

ϕ_{1}

, and ASVR-PDHG actually converges faster than SVR-PDHG for SC problems as shown in our experiments, which implies the superiority of momentum acceleration.

Theorem 4

(ASVR-PDHG for Non-SC Objectives). Suppose Assumptions 1 and 3 hold and

({\hat{x}}^{S}, {\hat{y}}^{S})

be the output of Algorithm 4. If we set

θ_{0} = 1 - \frac{α (b) L η}{1 - 2 L η}

,

η < \frac{1}{2 L}

, and

ρ \leq \frac{L}{M^{2}}

, ASVR-PDHG has the following convergence result for non-SC objectives:

\begin{matrix} E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq \frac{4 α (b)}{θ_{0}^{2} (1 - 2 L η) {(S + 1)}^{2}} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{2}{T_{0} {(S + 1)}^{2}} (\frac{1}{η} D_{x^{*}}^{2} + \frac{1}{θ_{0}^{2} ρ} D_{y^{*}}^{2}), \end{matrix}

(21)

where

D_{x^{*}} = ∥ x^{*} - {\hat{x}}^{0} ∥

,

D_{y^{*}} = ∥ y^{*} - {\hat{y}}^{0} ∥

. That is, if an output

({\hat{x}}^{S}, {\hat{y}}^{S})

satisfies

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϵ

, the gradient complexity of ASVR-PDHG is

𝒪 (\frac{n \sqrt{𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0})}}{\sqrt{ϵ}} + \frac{\sqrt{n} (D_{x^{*}} + D_{y^{*}})}{\sqrt{ϵ}})

.

In light of Theorem 4, ASVR-PDHG achieves an

𝒪 (1 / S^{2})

convergence rate with

T_{0} = 𝒪 (n / S)

and

n ≫ S

. Note that ASVR-PDHG removes the extra boundedness assumption in SPDHG and only depends on the constants

D_{x^{*}}

and

D_{y^{*}}

. That is, ASVR-PDHG improves the convergence rate of variance reduction algorithms (e.g., SVRG-ADMM, SVRG-PDFP, and SVR-PDHG) from

𝒪 (1 / S)

to

𝒪 (1 / S^{2})

by an adaptive epoch length strategy, linear extrapolation step, and the momentum acceleration technology.

Remark 1.

In order to further highlight the advantages of our algorithms, we compare their gradient complexity with those of other algorithms. When solving non-SC problems, the gradient complexity of SPDHG [4] is only

𝒪 (\frac{1}{ϵ^{2}})

, which is analogous to those of SGD [54] and SADMM [10]. Although SVR-PDHG has the same gradient complexity (i.e.,

𝒪 (\frac{n}{ϵ} + \frac{1}{ϵ})

) as SVRG-ADMM, SVR-PDHG has better practical performance. Theorem 4 implies that our ASVR-PDHG can effectively reduce gradient complexity and does not require additional assumptions. We summarize the gradient complexity of some stochastic primal–dual methods and stochastic ADMMs for non-SC problems as shown in Table 1. Note that we use the notation

𝒪 (\cdot)

to hide

D_{x^{*}}

,

D_{y^{*}}

, and other constants.

5. Experimental Results

This section evaluates the performance of our SVR-PDHG and ASVR-PDHG methods, and several state-of-the-art algorithms for solving non-SC graph-guided fused Lasso problems, SC graph-guided logistic regression, and non-SC multi-task learning problems. Our source codes are available at https://github.com/Weixin-An/ASVR-PDHG, accessed on 10 November 2021. The compared algorithms include SPDHG [55], SVRG-PDFP [46], SVRG-ADMM [23] and ASVRG-ADMM [24]. To alleviate statistical variability, the experiment in each case is carried out repeatedly 10 times, and shadow figures are plotted. The shadow represents the standard deviation, and the solid line in the middle represents the mean value. All the experiments are carried out on Intel Core i7-7700 3.6GHz CPU (Intel Corporation, Santa Clara, CA, USA) and 32GB RAM.

Hyper-parameter Selection. Based on the small-scale synthetic dataset mentioned in Section 5.1, we perform hyper-parameter selection. We use the grid search method to choose relatively good step sizes

η = ρ = 1

for our algorithms in all the cases unless otherwise specified. We choose

β = 1

in all experiments due to the same setting in our theoretical analysis. We choose

T = ⌊ \frac{N_{t r a i n}}{b} ⌋

for our SVR-PDHG solving both SC and non-SC problems, where

N_{t r a i n}

is the number of training samples. For ASVR-PDHG solving SC problems, we choose the common momentum parameter

θ = 0.9

and choose the same number of inner loops

T = ⌊ \frac{N_{t r a i n}}{b} ⌋

. For ASVR-PDHG solving non-SC problems, we choose the initial value of the momentum parameter

θ_{0} = 0.9

and apply the adaptive strategy for the epoch length

T_{s}

during the first 10 epochs. As for the compared algorithms, we choose the same hyper-parameters as in the work [56] for ASVRG-ADMM, we tune the parameters as in the work [20] for SVRG-ADMM, and we also adopt the grid search method and choose the optimal step sizes

η = ρ = 1

for SVRG-PDFP. About mini-batch sizes, we choose them guided by theory and considering the trade-off between time consumption and reasonably good performance. Specifically, we test the performance of our algorithms under different mini-batch sizes, and the results are shown in Figure A1 and Figure A2 in Appendix B. According to Figure A1 and Figure A2, considering the trade-off between time cost and loss, we determine the mini-batch sizes

b = 120

for SC problems and

b = 15

for non-SC problems.

We first solve the following non-SC graph-guided fused Lasso problem and SC graph-guided logistic regression problem in Section 5.1, Section 5.2, Section 5.3 and Section 5.4:

\min_{x \in R^{d}} {\frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + λ_{1} {∥ A x ∥}_{1}}, \min_{x \in R^{d}} {\frac{1}{n} \sum_{i = 1}^{n} (F_{i} (x) + \frac{λ_{2}}{2} {∥ x ∥}^{2}) + λ_{1} {∥ A x ∥}_{1} {,

(22)

where each

F_{i} (x) = log (1 + exp (- c_{i} a_{i}^{⊤} x))

is the logistic loss on the feature–label pair

(a_{i}, c_{i})

,

λ_{1} \geq 0

and

λ_{2} \geq 0

are two regularization parameters. And here, we set A as described in the work [57]. The

ℓ_{1}

-norm regularized minimization can be converted into Problem (1) by setting

λ_{1} {∥ A x ∥}_{1} = \max_{{∥ y ∥}_{\infty} \leq λ_{1}} 〈 y, A x 〉

, where

{∥ \cdot ∥}_{\infty}

is the maximum norm of a vector. Thus, Problem (22) can be converted into the following saddle-point problems, respectively:

\min_{x \in R^{d}} \max_{{∥ y ∥}_{\infty} \leq λ_{1}} {\frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + 〈 y, A x 〉},

(23)

\min_{x \in R^{d}} \max_{{∥ y ∥}_{\infty} \leq λ_{1}} {\frac{1}{n} \sum_{i = 1}^{n} (F_{i} (x) + \frac{λ_{2}}{2} {∥ x ∥}^{2}) + 〈 y, A x 〉} .

(24)

Here, the conjugate function

G^{*} (y) = 0

. Then, we consider the general case

G^{*} (y) \neq 0

in Section 5.5. Lastly, we solve the non-SC multi-task learning problem in Section 5.6.

5.1. Comparison of PDHG-Type Algorithms on Synthetic Datasets

In this subsection, to verify the advantages of our algorithms compared with PDHG-type algorithms, we first conduct experiments for solving Problems (23) and (24) on synthetic datasets. The method of generating a synthetic dataset is described below. Each sample

a_{i}

is generated from i.i.d. standard Gaussian random variables and normalized according to

a_{i} = \frac{a_{i}}{\max {a_{i}}}

, and the corresponding label is obtained by

c_{i} = sign (a_{i}^{⊤} x_{0} + ϵ_{i})

, where the vector

x_{0}

is generated from the d-dimensional standard normal distribution. The noise

ϵ_{i}

also comes from the normal distribution with mean 0 and standard deviation 0.01. For Problem (23), we set

λ_{1} = 10^{- 5}

, and for Problem (24), we set

λ_{1} = 10^{- 5}

and

λ_{2} = 10^{- 2}

.

Figure 1 shows the comparison between our algorithms and SPDHG for solving SC and non-SC problems on a small-scale synthetic dataset. The experimental results imply that variance reduction methods including SVR-PDHG and ASVR-PDHG converge obviously faster than SPDHG, which verifies our algorithms improving the theoretical convergence rate. ASVR-PDHG converges much faster than SVR-PDHG in the non-SC setting, which verifies that the momentum acceleration step can significantly improve the convergence speed for solving non-SC problems. ASVR-PDHG performs better than SVR-PDHG in the SC setting, which demonstrates the superiority of momentum acceleration for solving SC problems.

5.2. Comparison of PDHG-Type Algorithms on Real-World Datasets

To further verify the advantages of our algorithms compared with the SPDHG algorithm on real-world datasets, we also conduct experiments on bio, phy, w8a, epsilon, and epsilon_test datasets, which can be downloaded from the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvm/, accessed on 5 November 2021) and the KDDCUP website (http://osmot.cs.cornell.edu/kddcup/datasets.html, accessed on 5 November 2021). The detailed description of these datasets is shown in Table 2.

Due to similar experimental phenomena on the five real-world datasets in Table 2, we only report the results on the bio, phy, and epsilon datasets in this subsection. Figure 2 shows the experimental results of Algorithms 2 and 4 to solve the non-SC problem (23) with

λ_{1} = 10^{- 5}

. All the experimental results show that our SVR-PDHG and ASVR-PDHG perform obviously better than their baseline, SPDHG. Moreover, our ASVR-PDHG consistently converges much faster than both SVR-PDHG and SPDHG in all the cases, which verifies the effectiveness of our momentum trick to accelerate the variance reduced stochastic PDHG algorithm.

As for SC objectives, Figure 3 shows the experimental results of SVR-PDHG and ASVR-PDHG on the three real-world datasets, where

λ_{1} = 10^{- 5}

, and

λ_{2} = 10^{- 2}

. It can be observed that SVR-PDHG and ASVR-PDHG are superior to the baseline, SPDHG, by a significant margin, in terms of their number of passes through data and CPU time, which also verifies our theoretical convergence results, i.e., linear convergence rate. Note that the standard deviation of the results of our methods is relatively small, which implies that our algorithms are relatively stable.

5.3. Sparse and Asynchronous Parallel Setting

We also conduct our algorithms in the sparse and asynchronous parallel settings. We consider

A = I

in the non-SC Problem (23) to facilitate parallelism and select the regularization parameter

λ_{1} = 10^{- 5}

. The sparse datasets rcv1.small and real-sim are used to test our algorithms, and we choose

D_{i_{t}}

as in the work [50]. We choose the single-thread algorithm as the baseline and compare the performance of all the methods in terms of the running time. The parallel SPDHG is achieved by updating the support sets of vectors

x_{t}

and

y_{t}

in a parallel fashion by ourselves. All the algorithms under asynchronous parallel setting are implemented in C++ with a Matlab interface, and the experimental results are shown in Figure 4.

All the experimental results in Figure 4 show that our SVR-PDHG and ASVR-PDHG significantly outperform SPDHG in terms of running time on both one thread and four threads. Our ASVR-PDHG method achieves more than

3 \times

and

5 \times

speedup over our SVR-PDHG on one thread and four threads, respectively, which benefits from our momentum acceleration and adaptive epoch length strategy. Moreover, SVR-PDHG and ASVR-PDHG with four threads achieve more than

3 \times

speedup than those with one thread, respectively. These phenomena indicate that the linear extrapolation step, momentum acceleration trick, and adaptive epoch length strategy are also suitable for large-scale machine learning problems in the sparse and asynchronous parallel setting, and our asynchronous parallel algorithms achieve a speedup proportional to the number of threads.

5.4. Compared with State-of-the-Art Stochastic Methods

To illustrate the advantages of our methods over SOTA methods, we further conduct some experiments on a large-scale synthetic dataset and real-world datasets. And we also set

b = 120

and

b = 15

for SC and non-SC problems, respectively.

Figure 5 shows the experimental results of SPDHG, SVR-PDHG, ASVR-PDHG, SVRG-ADMM, and ASVRG-ADMM on a larger synthetic dataset. It can be found that our algorithms (SVR-PDHG and ASVR-PDHG) converge significantly faster than SPDHG. For SC problems, SVR-PDHG and ASVR-PDHG achieve an average speedup of

3 \times

over SVRG-ADMM and ASVRG-ADMM, and

2 \times

over SVRG-PDFP. For non-SC problems, ASVR-PDHG achieves an average speedup of at least

4 \times

compared with other algorithms, which benefits from fewer variables, without Q, and the momentum acceleration technology.

Due to limited space and similar experimental phenomenon on the five real-world datasets in Table 2, we only report the results on the epsilon_test and w8a datasets. Figure 6 shows the compared results. It can be seen that our ASVR-PDHG almost always converges much faster than the stochastic ADMMs in all the settings. For SC problems, although the linear convergence rate can be obtained by all the algorithms, our algorithms achieve an average speedup of

2 \times

over SVRG-ADMM and ASVRG-ADMM because we do not require Q of ADMM-type methods. Moreover, compared with SVRG-PDFP, our algorithms also converge significantly faster. For non-SC problems, our ASVR-PDHG achieves at least

3 \times

speedup over other stochastic algorithms.

We also compare our methods with the famous extragradient (EG) methods such as stochastic EG [58] (SEG), stochastic AG-EG [59] (SAG-EG), and stochastic variance reduction EG [60] (SEG-VR) methods. We set the same initialization and choose the batch size and step sizes guided by theory, considering the trade-off between time consumption and accuracy, while observing reasonably good performance. The experimental results on the phy, w8a, and epsilon datasets are shown in Figure 7. From Figure 7, we can observe that our proposed methods still converge faster than EG-type methods. Especially for non-SC problems, our SVR-PDHG and ASVR-PDHG algorithms can achieve at least 6× and 7× speedup compared to EG-type methods, respectively, which benefit from the variance reduction and our momentum acceleration technology.

5.5. Comparisons of Primal–Dual Algorithms When $G^{*} (y) \neq 0$

We further conduct our algorithms to solve the general setting, i.e.,

G^{*} (y) \neq 0

. We compare our methods and other primal–dual algorithms such as SEG [58], SAG-EG [59], SEG-VR [60], and SVRG-PDFP [46] methods when solving the logistic regression problem with

ℓ_{2}

regularization:

\min_{x} \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + \frac{λ}{2} {∥ x ∥}^{2} .

(25)

Its primal–dual formulation is

\min_{x} \max_{y} \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + λ 〈 x, y 〉 - \frac{λ}{2} {∥ y ∥}^{2},

(26)

where

G^{*} (y) : = \frac{λ}{2} {∥ y ∥}^{2}

and

λ \geq 0

is a regularization parameter. We set the same initialization

(x_{0}, y_{0})

and batch size to compare all the methods. The convergence results are shown in Figure 8.

From Figure 8, it can be found that the experimental phenomenon is similar to the case of

G^{*} (y) = 0

. Specifically, our SVR-PDHG algorithm performs sightly better than SEG and SVRG-PDFP methods and achieves a speedup of at least

6 \times

compared to other comparison methods. Our ASVR-PDHG algorithm further improves the convergence speed of our SVR-PDHG algorithm, which again verifies the effectiveness of our momentum acceleration technology.

5.6. Multi-Task Learning

In this subsection, in order to verify the advantages of our algorithms for solving the matrix nuclear norm regularized problem, we conduct the following multi-task learning experiments. Here, the multi-task learning model can be described as follows:

\min_{X} \sum_{i = 1}^{N} l_{i} (X) + λ_{1} {∥ X ∥}_{*},

(27)

where

X \in R^{d \times N}

, N is the number of tasks,

l_{i} (X)

is the logistic loss on the i-th task, and

{∥ \cdot ∥}_{*}

is the nuclear norm.

An auxiliary variable Y is introduced to solve the above model, and the original model can be transformed into the equality constrained problem

\min_{X} \sum_{i = 1}^{N} l_{i} (X) + λ_{1} {∥ Y ∥}_{*}

, s.t.

X = Y

, which can be solved by stochastic ADMMs. In order to apply the primal–dual technique to solve this problem, the nuclear norm needs to be rewritten as

λ_{1} {∥ X ∥}_{*} = \max_{∥ X^{'} ∥_{2} \leq λ_{1}} 〈 X^{'}, X 〉

, where

{∥ \cdot ∥}_{2}

is the spectral norm of a matrix. In this way, the multi-task learning model can be formulated into a saddle-point problem as follows:

\min_{X} \max_{∥ X^{'} ∥_{2} \leq λ_{1}} \sum_{i = 1}^{N} l_{i} (X) + 〈 X^{'}, X 〉 .

(28)

Here, the conjugate function

G^{*} (X^{'}) = 0

. This problem can be also solved by SPDHG, SVRG-PDFP, SVRG-ADMM, ASVRG-ADMM, and our algorithms.

We compare the stochastic ADMMs, SVRG-PDFP, and our algorithms on a dataset, 20newsgroups (available at https://github.com/jiayuzhou/MALSAR/tree/master/data, accessed on 5 November 2021), and set

λ_{1} = 10^{- 5}

and the mini-batch size

b = 15

for each task. The training loss (i.e., the training objective value minus the minimum value) and test error are shown in Figure 9. It can be observed that ASVR-PDHG significantly outperforms other algorithms in terms of convergence speed and test error.

5.7. Non-Convex Support Vector Machines

We also compare the related methods on the Support Vector Machine (SVM) problem. Given a training set

S = {(a_{i}, c_{i})}_{i = 1}^{n}

, the non-convex

ℓ_{1}

-norm penalized SVMs minimize the following penalized hinge loss function:

\min_{x} \frac{1}{n} \sum_{i = 1}^{n} \max {0, 1 - c_{i} x^{⊤} a_{i}} + λ {∥ A x ∥}_{1} .

(29)

In the same way, the SVMs can be formulated into a saddle-point problem:

\min_{x} \max_{{∥ y ∥}_{\infty} \leq λ} \frac{1}{n} \sum_{i = 1}^{n} \max {0, 1 - c_{i} x^{⊤} a_{i}} + 〈 y, A x 〉 .

(30)

We compare the ASVRG-ADMM, SVRG-PDFP, SAG-EG [59], and SPDM [61] algorithms on a synthetic dataset and the phy dataset, and set

λ = 10^{- 5}

and the mini-batch size

b = 15

. Regarding other hyper-parameters, we choose them guided by theory and considering the trade-off between time consumption and reasonably good performance. The experimental results are shown in Figure 10. It can be found that when solving non-convex problems, our ASVR-PDHG algorithm can still maintain a certain advantage.

Limitations. Our algorithms can be proved to achieve an advanced convergence rate under the convex assumption. For the non-convex problems, although our algorithms have unknown convergence properties, they achieve better experimental performance than some state-of-the-art methods as shown in Figure 10. As for solving complex non-convex problems such as training deep networks, the gradient calculation and acceleration steps may increase computational cost, but since the batch size can be chosen to be

b = 1

and only vector addition operations are performed, the computational cost will not increase too much and will still be smaller than the ADMM-type algorithms. We will study the convergence properties of non-convex primal–dual problems in future work.

6. Conclusions and Future Work

In this paper, we proposed a stochastic primal–dual hybrid gradient method with variance reduction (SVR-PDHG) and an accelerated variant (ASVR-PDHG) for saddle-point problems. Compared with stochastic ADMMs, our algorithms have simpler structures. A new adaptive epoch length strategy was proposed to remove the extra boundedness assumption for non-SC problems. We theoretically analyzed the linear convergence properties of our methods without the strong convexity of

G^{*} (\cdot)

for SC problems. In particular, we rigorously proved that the convergence rates of SVR-PDHG and ASVR-PDHG are

𝒪 (1 / S)

and

𝒪 (1 / S^{2})

for non-SC problems, respectively. As a by-product, we extended our algorithms to asynchronous parallel settings for non-SC problems and experiments verify that our parallel algorithms are suitable for large-scale sparse non-SC machine learning problems. Finally, various experimental results verified that our ASVR-PDHG consistently converges much faster than the existing stochastic methods.

Besides the machine learning problems mentioned above, we can extend our algorithms to solve image processing problems [34] and differentially private problems [62,63] in future work. Another interesting direction for future work is the research of the theoretical properties of our asynchronous parallel primal–dual algorithms.

Author Contributions

Conceptualization, W.A., Y.L. and F.S.; methodology, W.A. and Y.L.; software, W.A.; validation, W.A.; formal analysis, W.A. and Y.L.; investigation, W.A.; resources, W.A. and Y.L.; data curation, W.A.; writing—original draft preparation, W.A.; writing—review and editing, W.A., Y.L., F.S. and H.L.; visualization, W.A.; supervision, Y.L., F.S. and H.L.; project administration, Y.L., F.S. and H.L.; funding acquisition, Y.L., F.S. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China grant number 62276182, Peng Cheng Lab Program grant number PCL2023A08, Tianjin Natural Science Foundation grant numbers 24JCYBJC01230, 24JCYBJC01460, and Tianjin Municipal Education Commission Research Plan grant number 2024ZX008.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Theoretical Analysis

In this appendix, we give the detailed proofs of some lemmas, properties and theorems for our algorithms. In Appendix A.1, we provide our SVR-PDHG algorithm for solving non-SC objectives in the asynchronous parallel setting, as shown in Algorithm A1, and provide SVR-PDHG and ASVR-PDHG for solving SC objectives in the asynchronous parallel setting as shown in Algorithms A2 and A3. In Appendix A.2, we give some useful lemmas (i.e., Lemmas A1–A3) and properties (i.e., Properties A1–A4) and their detailed proofs. In Appendix A.3, we provide a one-epoch analysis for our SVR-PDHG and ASVR-PDHG methods by Lemmas 1 and 2, respectively. Based on Lemmas 1 and 2, we analyze the convergence properties (i.e., Theorems 1–4) of SVR-PDHG and ASVR-PDHG in Appendix A.4.1, Appendix A.4.2, Appendix A.4.3 and Appendix A.4.4 in Appendix A.4.

Appendix A.1. Our Algorithms in the Sparse and Asynchronous Parallel Setting

According to our analysis, our SVR-PDHG and ASVR-PDHG can be also easily extended to asynchronous parallel settings for solving SC problems. Considering

A = I

and

b = 1

, the asynchronous parallel variants of our SVR-PDHG and ASVR-PDHG algorithms for solving SC problems are shown in Algorithms A2 and A3, respectively.

Appendix A.2. Some Key Properties and Lemmas

Proof of Property 2.

For any

x \in R^{d}

and

y \in Y

, then we have

\begin{matrix} F (x) - F (x^{*}) - & 〈 \nabla F (x^{*}), x - x^{*} 〉 \leq F (x) - F (x^{*}) + 〈 A^{⊤} y^{*}, x - x^{*} 〉 \\ = F (x) - F (x^{*}) - 〈 y, A x^{*} 〉 + 〈 y^{*}, A x 〉 - 〈 y^{*} - y, A x^{*} 〉 \\ \leq F (x) - F (x^{*}) - 〈 y, A x^{*} 〉 + 〈 y^{*}, A x 〉 + 〈 y - y^{*}, \partial G^{*} (y^{*}) 〉 \\ \leq F (x) - F (x^{*}) - 〈 y, A x^{*} 〉 + 〈 y^{*}, A x 〉 + G^{*} (y) - G^{*} (y^{*}) = 𝒯 (x, y), \end{matrix}

(A1)

where the first inequality holds due to the optimality condition of the primal variable, i.e.,

〈 \nabla F (x^{*}) + A^{⊤} y^{*}, x - x^{*} 〉 \geq 0

, and the second inequality holds due to the optimality condition of the dual variable, i.e.,

〈 \partial G^{*} (y^{*}) - A x^{*}, y - y^{*} 〉 \geq 0

, and the last inequality holds due to the convexity property of the function

G (\cdot)

, i.e.,

〈 \partial G^{*} (y^{*}), y - y^{*} 〉 \leq G^{*} (y) - G^{*} (y^{*})

. □

Algorithm A1 SVR-PDHG for non-SC objectives in the sparse and asynchronous parallel setting.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0} = x_{T}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0} = y_{T}^{0} = 0$ , p threads.

1:: for $s = 1, 2, \dots, S$ do
2:: Read current value of ${\tilde{x}}^{s - 1}$ from the shared memory and all threads parallelly compute the full gradient $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ , ${\bar{x}}_{0}^{s} = x_{0}^{s} = x_{T}^{s - 1}$ , $y_{0}^{s} = y_{T}^{s - 1}$ ;
3:: $t = 0$ ; //inner loop counter
4:: while $t < T$ in parallel do
5:: $t = t + 1$ ; //atomic increase counter
6:: Choose $i_{t}$ uniformly at random from ${1, 2, . . ., n}$ ;
7:: $s_{i_{t}} : =$ support of sample $i_{t}$ ;
8:: Inconsistent read of ${[x_{t - 1}^{s}]}_{s_{i_{t}}}$ ;
9:: ${[u]}_{s_{i_{t}}} = \nabla F_{i_{t}} ({[x_{t - 1}^{s}]}_{s_{i_{t}}}) - \nabla F_{i_{t}} ({[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}}) + {[D_{i_{t}} \tilde{p}]}_{s_{i_{t}}}$ ;
10:: ${[y_{t}^{s}]}_{s_{i_{t}}} = \arg \max_{{[y]}_{s_{i_{t}}}} 〈 {[y]}_{s_{i_{t}}}, {[{\bar{x}}_{t - 1}^{s}]}_{s_{i_{t}}} 〉 - G^{*} (y) - \frac{1}{2 ρ} {∥ {[y]}_{s_{i_{t}}} - {[y_{t - 1}^{s}]}_{s_{i_{t}}} ∥}^{2}$ ;
11:: ${[x_{t}^{s}]}_{s_{i_{t}}} = {[x_{t - 1}^{s}]}_{s_{i_{t}}} - η ({[y_{t}^{s}]}_{s_{i_{t}}} + {[u]}_{s_{i_{t}}})$ ; ${[{\bar{x}}_{t}^{s}]}_{s_{i_{t}}} = {[x_{t}^{s}]}_{s_{i_{t}}} + β ({[x_{t}^{s}]}_{s_{i_{t}}} - {[x_{t - 1}^{s}]}_{s_{i_{t}}})$ ;
12:: end while
13:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
14:: end for

Output:: ${\hat{x}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}$ , ${\hat{y}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}$ .

Algorithm A2 SVR-PDHG for SC objectives in the sparse and asynchronous parallel setting.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0}$ , ${\hat{y}}^{0}$ , p threads.

1:: for $s = 1, 2, \dots, S$ do
2:: Read current value of ${\tilde{x}}^{s - 1}$ from the shared memory and all threads parallelly compute the full gradient $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ ; $y_{0}^{s} = - \nabla F ({\tilde{x}}^{s - 1})$ , ${\bar{x}}_{0}^{s} = x_{0}^{s} = {\tilde{x}}^{s - 1}$ ;
3:: $t = 0$ ; //inner loop counter
4:: while $t < T$ in parallel do
5:: $t = t + 1$ ; // atomic increase counter
6:: Choose $i_{t}$ uniformly at random from ${1, 2, . . ., n}$ ;
7:: $s_{i_{t}} : =$ the support of sample $i_{t}$ ;
8:: Inconsistent read of ${[x_{t - 1}^{s}]}_{s_{i_{t}}}$ ;
9:: ${[u]}_{s_{i_{t}}} = \nabla F_{i_{t}} ({[x_{t - 1}^{s}]}_{s_{i_{t}}}) - \nabla F_{i_{t}} ({[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}}) + {[D_{i_{t}} \tilde{p}]}_{s_{i_{t}}}$ ;
10:: ${[y_{t}^{s}]}_{s_{i_{t}}} = \arg \max_{{[y]}_{s_{i_{t}}}} 〈 {[y]}_{s_{i_{t}}}, {[{\bar{x}}_{t - 1}^{s}]}_{s_{i_{t}}} 〉 - G^{*} (y) - \frac{1}{2 ρ} {∥{[y]}_{s_{i_{t}}} - {[y_{t - 1}^{s}]}_{s_{i_{t}}}∥}^{2}$ ;
11:: ${[x_{t}^{s}]}_{s_{i_{t}}} = {[x_{t - 1}^{s}]}_{s_{i_{t}}} - η ({[y_{t}^{s}]}_{s_{i_{t}}} + {[u]}_{s_{i_{t}}})$ ; ${[{\bar{x}}_{t}^{s}]}_{s_{i_{t}}} = {[x_{t}^{s}]}_{s_{i_{t}}} + β ({[x_{t}^{s}]}_{s_{i_{t}}} - {[x_{t - 1}^{s}]}_{s_{i_{t}}})$ ;
12:: end while
13:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
14:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

Algorithm A3 ASVR-PDHG for SC objectives in the sparse and asynchronous parallel setting.

Input:: T, $ρ$ , $η$ , $1 \leq b \leq n$ , $0 < β \leq 1$ .
Initialize:: ${\hat{x}}^{0} = {\tilde{x}}^{0}$ , ${\hat{y}}^{0} = {\tilde{y}}^{0}$ , $0 \leq θ \leq 1 - \frac{α (b) L η}{1 - L η}$ , p threads.

1:: for $s = 1, 2, \dots, S$ do
2:: Read current value of ${\tilde{x}}^{s - 1}$ from the shared memory and all threads parallelly compute the full gradient $\tilde{p} = \frac{1}{n} \sum_{i = 1}^{n} \nabla F_{i} ({\tilde{x}}^{s - 1})$ , $x_{0}^{s} = {\tilde{x}}^{s - 1}$ , ${\bar{z}}_{0}^{s} = z_{0}^{s} = {\tilde{x}}^{s - 1}$ , $y_{0}^{s} = - \nabla F ({\tilde{x}}^{s - 1})$ ;
3:: $t = 0$ ; // inner loop counter
4:: while $t < T$ in parallel do
5:: $t = t + 1$ ; //atomic increase counter
6:: Choose $i_{t}$ uniformly at random from ${1, 2, . . ., n}$ ;
7:: $s_{i_{t}} : =$ the support of sample $i_{t}$ ;
8:: Inconsistent read of ${[x_{t - 1}^{s}]}_{s_{i_{t}}}$ ;
9:: ${[u]}_{s_{i_{t}}} = \nabla F_{i_{t}} ({[x_{t - 1}^{s}]}_{s_{i_{t}}}) - \nabla F_{i_{t}} ({[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}}) + {[D_{i_{t}} \tilde{p}]}_{s_{i_{t}}}$ ;
10:: ${[y_{t}^{s}]}_{s_{i_{t}}} = \arg \max_{{[y]}_{s_{i_{t}}}} 〈{[y]}_{s_{i_{t}}}, {[{\bar{z}}_{t - 1}^{s}]}_{s_{i_{t}}}〉 - G^{*} (y) - \frac{1}{2 ρ θ} {∥{[y]}_{s_{i_{t}}} - {[y_{t - 1}^{s}]}_{s_{i_{t}}}∥}^{2}$ ;
11:: ${[z_{t}^{s}]}_{s_{i_{t}}} = {[z_{t - 1}^{s}]}_{s_{i_{t}}} - \frac{η}{θ} ({[y_{t}^{s}]}_{s_{i_{t}}} + {[u]}_{s_{i_{t}}})$ ; ${[x_{t}^{s}]}_{s_{i_{t}}} = {[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}} + θ ({[z_{t}^{s}]}_{s_{i_{t}}} - {[{\tilde{x}}^{s - 1}]}_{s_{i_{t}}})$ ; ${[{\bar{z}}_{t}^{s}]}_{s_{i_{t}}} = {[z_{t}^{s}]}_{s_{i_{t}}} + β ({[z_{t}^{s}]}_{s_{i_{t}}} - {[z_{t - 1}^{s}]}_{s_{i_{t}}})$ ;
12:: end while
13:: ${\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}$ , ${\tilde{y}}^{s} = (1 - θ) {\tilde{y}}^{s - 1} + \frac{θ}{T} \sum_{t = 1}^{T} y_{t}^{s}$ ;
14:: end for

Output:: ${\hat{x}}^{S} = {\tilde{x}}^{S}$ , ${\hat{y}}^{S} = {\tilde{y}}^{S}$ .

Property A1.

Since the matrix A has full row rank, then we have

y^{*} = - {(A^{⊤})}^{†} \nabla F (x^{*}) .

(A2)

Property A2.

For the y-subproblem in Algorithms 1 or 2 and its optimal solution

y_{t}^{s}

, we have

\begin{matrix} 〈 - A {\bar{x}}_{t - 1}^{s} + \partial G^{*} (y_{t}^{s}) + \frac{1}{ρ} (y_{t}^{s} - y_{t - 1}^{s}), y - y_{t}^{s} 〉 \geq 0, for any y \in Y, \end{matrix}

(A3)

where

Y

is a convex compact set.

Property A3.

For the y-subproblem in Algorithms 3 or 4 and its optimal solution

y_{t}^{s}

, we have

〈 - A {\bar{z}}_{t - 1}^{s} + \partial G^{*} (y_{t}^{s}) + \frac{1}{ρ θ_{s - 1}} (y_{t}^{s} - y_{t - 1}^{s}), y - y_{t}^{s} 〉 \geq 0, for any y \in Y,

(A4)

where

Y

is a convex compact set.

Property A4.

Given any

a, b, c \in R^{d}

, then we have

\begin{matrix} 〈a - b, a - c〉 = & \frac{1}{2} [{∥ a - b ∥}^{2} {+ ∥ a - c ∥}^{2} - {∥ b - c ∥}^{2}] . \end{matrix}

(A5)

Lemma A1

(Variance Upper Bound [23]). For the mini-batch semi-stochastic gradient

\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \nabla F ({\tilde{x}}^{s - 1})

we have

\begin{matrix} E [∥ \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) - \nabla F (x_{t - 1}^{s}) ∥^{2}] \\ \leq 4 L α (b) [F (x_{t - 1}^{s}) - F (x^{*}) - \nabla F {(x^{*})}^{⊤} (x_{t - 1}^{s} - x^{*}) + F ({\tilde{x}}^{s - 1}) - F (x^{*}) - \nabla F {(x^{*})}^{⊤} ({\tilde{x}}^{s - 1} - x^{*})], \end{matrix}

(A6)

where

α (b) = \frac{n - b}{b (n - 1)} \leq 1

,

1 \leq b \leq n

and

L : = \max_{i} L_{i}

.

Lemma A2

(Improved Variance Upper Bound [24]). For the mini-batch semi-stochastic gradient

\tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) = \frac{1}{b} \sum_{i_{t} \in I_{t}} [\nabla F_{i_{t}} (x_{t - 1}^{s}) - \nabla F_{i_{t}} ({\tilde{x}}^{s - 1})] + \nabla F ({\tilde{x}}^{s - 1})

, we have

\begin{matrix} E [∥ \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) - \nabla F (x_{t - 1}^{s}) ∥^{2}] \leq 2 L α (b) [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s}) + 〈\nabla F (x_{t - 1}^{s}), x_{t - 1}^{s} - {\tilde{x}}^{s - 1}〉], \end{matrix}

(A7)

where

α (b) = \frac{n - b}{b (n - 1)} \leq 1

,

1 \leq b \leq n

, and

L : = \max_{i} L_{i}

.

Lemma A3.

Let

y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})

, and

y^{*} = - {(A^{⊤})}^{†} \nabla F (x^{*})

, then

{∥y_{0}^{s} - y^{*}∥}^{2} \leq \frac{2 L_{F}}{σ_{\min} (A A^{⊤})} [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] .

(A8)

Appendix A.3. Proofs of Lemmas 1 and 2

Appendix A.3.1. Proof of Lemma 1

Proof.

Since the function F is convex, differentiable with an

L_{F}

-Lipschitz-continuous gradient, where

L_{F} \leq L = \max_{i = 1, \dots, n} L_{i}

, then

\begin{matrix} F (x_{t}^{s}) \leq & F (x_{t - 1}^{s}) + 〈 \nabla F (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 + \frac{L}{2} {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} \\ \leq & F (x^{*}) - 〈 \nabla F (x_{t - 1}^{s}), x^{*} - x_{t - 1}^{s} 〉 + 〈 \nabla F (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 + \frac{L}{2} {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} \\ = & F (x^{*}) - 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) - \nabla F (x_{t - 1}^{s}), x_{t}^{s} - x^{*} 〉 - 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x^{*} - x_{t}^{s} 〉 + \frac{L}{2} {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2}, \end{matrix}

(A9)

where the second inequality holds due to the convexity of the function F.

For the optimal solution

x_{t}^{s}

of the x-subproblem in SVR-PDHG, the first-order optimality condition is

〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) + A^{⊤} y_{t}^{s} + \frac{1}{η} (x_{t}^{s} - x_{t - 1}^{s}), x - x_{t}^{s} 〉 \geq 0, for any x \in R^{d} .

By setting

x = x^{*}

in the above inequality, we obtain

\begin{matrix} 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x^{*} 〉 \\ \leq & 〈 A^{⊤} y_{t}^{s} + \frac{1}{η} (x_{t}^{s} - x_{t - 1}^{s}), x^{*} - x_{t}^{s} 〉 = 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉 + \frac{1}{η} 〈 x_{t}^{s} - x_{t - 1}^{s}, x^{*} - x_{t}^{s} 〉 \\ \overset{(a)}{=} & 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉 + \frac{1}{2 η} [∥ x^{*} - x_{t - 1}^{s} ∥^{2} - ∥ x^{*} - x_{t}^{s} ∥^{2} - ∥ x_{t - 1}^{s} - x_{t}^{s} ∥^{2}], \end{matrix}

(A10)

where the equality

\overset{(a)}{=}

holds due to Property A4. Taking the expectation over the random choice of

I_{t}

and substituting the inequality (A10) into the inequality (A9) with

η \leq \frac{1}{9 L} \leq \frac{1}{L}

, we have

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*})] & \leq E 〈 \nabla F (x_{t - 1}^{s}) - \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x^{*} 〉 + E 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉 \\ + \frac{1}{2 η} E [∥ x^{*} - x_{t - 1}^{s} ∥^{2} - {∥ x^{*} - x_{t}^{s} ∥}^{2}] - \frac{1 - L η}{2 η} E {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} . \end{matrix}

(A11)

According to the bound of the last term in Equation (4) in the appendix of the work [23], we obtain

\begin{matrix} E 〈 \nabla F (x_{t - 1}^{s}) - \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x^{*} 〉 \leq η E [∥ \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) - \nabla F (x_{t - 1}^{s}) ∥^{2}] . \end{matrix}

(A12)

Substituting the inequality (A12) into the inequality (A11) and taking expectation over

I_{t}

for

t = 1, \dots, T

at the s-th epoch, we obtain

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*}) - 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉] \\ \leq & \frac{1}{2 η} E {∥x_{t - 1}^{s} - x^{*}∥}^{2} - \frac{1}{2 η} E {∥ x_{t}^{s} - x^{*} ∥}^{2} + η E [∥ \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) - \nabla F (x_{t - 1}^{s}) ∥^{2}] \\ - \frac{1 - L η}{2 η} E {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} . \end{matrix}

(A13)

Using Lemma A1, we have

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*}) - 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉] \\ \leq & \frac{1}{2 η} E {∥x_{t - 1}^{s} - x^{*}∥}^{2} - \frac{1}{2 η} E {∥x_{t}^{s} - x^{*}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} \\ + 4 L η α (b) [F (x_{t - 1}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t - 1}^{s} - x^{*}〉] \\ + 4 L η α (b) [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] . \end{matrix}

(A14)

By setting

y = y^{*}

in Property A2, we obtain

〈- A {\bar{x}}_{t - 1}^{s} + \partial G^{*} (y_{t}^{s}), y_{t}^{s} - y^{*}〉 \leq \frac{1}{ρ} 〈y_{t - 1}^{s} - y_{t}^{s}, y_{t}^{s} - y^{*}〉 .

(A15)

Then, we have

\begin{matrix} - 〈A x_{t}^{s}, y_{t}^{s} - y^{*}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*}) \\ \leq & - 〈A x_{t}^{s}, y_{t}^{s} - y^{*}〉 + 〈\partial G^{*} (y_{t}^{s}), y_{t}^{s} - y^{*}〉 \\ = & - 〈A {\bar{x}}_{t - 1}^{s}, y_{t}^{s} - y^{*}〉 + 〈\partial G^{*} (y_{t}^{s}), y_{t}^{s} - y^{*}〉 + 〈A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*}〉 \\ \leq & \frac{1}{ρ} 〈y_{t - 1}^{s} - y_{t}^{s}, y_{t}^{s} - y^{*}〉 + 〈A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*}〉 \\ = & \frac{1}{2 ρ} [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y_{t - 1}^{s}∥}^{2}] + 〈A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*}〉, \end{matrix}

(A16)

where the first inequality follows from the convexity of the function

G^{*}

, the second inequality holds due to Property A2. and the last equality holds due to Property A4.

According to the iteration,

{\bar{x}}_{t}^{s} = x_{t}^{s} + β (x_{t}^{s} - x_{t - 1}^{s})

with

β = 1

, then

\begin{matrix} 〈 A ({\bar{x}}_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*} 〉 \\ = & 〈 A (x_{t - 1}^{s} + (x_{t - 1}^{s} - x_{t - 2}^{s}) - x_{t}^{s}), y_{t}^{s} - y^{*} 〉 \\ = & 〈 A (x_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*} 〉 + 〈 A (x_{t - 1}^{s} - x_{t - 2}^{s}), y_{t - 1}^{s} - y^{*} 〉 + 〈 A (x_{t - 1}^{s} - x_{t - 2}^{s}), y_{t}^{s} - y_{t - 1}^{s} 〉 \\ \leq & 〈 A (x_{t - 1}^{s} - x_{t}^{s}), y_{t}^{s} - y^{*} 〉 + 〈 A (x_{t - 1}^{s} - x_{t - 2}^{s}), y_{t - 1}^{s} - y^{*} 〉 \\ + \frac{M γ η}{2 η} ∥ x_{t - 1}^{s} - x_{t - 2}^{s} ∥^{2} + \frac{M ρ}{2 γ ρ} {∥ y_{t}^{s} - y_{t - 1}^{s} ∥}^{2}, \end{matrix}

(A17)

where

M = | | A | |

, and the inequality holds due to Young’s inequality. By choosing

M ρ \leq γ \leq \frac{1 - L η}{M η}

so that

\frac{M η γ}{2 η} \leq \frac{1 - L η}{2 η}

and

\frac{M ρ}{2 γ ρ} \leq \frac{1}{2 ρ}

, then combining the inequality (A16), (A17) and inequality (A14), we obtain

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*}) - 〈 A^{⊤} y_{t}^{s}, x^{*} - x_{t}^{s} 〉 - 〈 A x_{t}^{s}, y_{t}^{s} - y^{*} 〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ = & E [F (x_{t}^{s}) - F (x^{*}) - 〈 y_{t}^{s}, A x^{*} 〉 + 〈 y^{*}, A x_{t}^{s} 〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ \leq & \frac{1}{2 η} (E ∥ x^{*} - x_{t - 1}^{s} ∥^{2} - E {∥ x^{*} - x_{t}^{s} ∥}^{2}) + \frac{M η γ}{2 η} E {∥x_{t - 1}^{s} - x_{t - 2}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{t - 1}^{s} - x_{t}^{s} ∥}^{2} \\ + 4 L η α (b) E [F (x_{t - 1}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t - 1}^{s} - x^{*}〉] \\ + 4 L η α (b) E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{1}{2 ρ} E [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2}] + \frac{M ρ}{2 γ ρ} E {∥y_{t}^{s} - y_{t - 1}^{s}∥}^{2} - \frac{1}{2 ρ} E {∥y_{t}^{s} - y_{t - 1}^{s}∥}^{2} \\ + E 〈A (x_{t - 1}^{s} - x_{t - 2}^{s}), y_{t - 1}^{s} - y^{*}〉 - E 〈A (x_{t}^{s} - x_{t - 1}^{s}), y_{t}^{s} - y^{*}〉 . \end{matrix}

(A18)

Summing the above inequality from

t = 1, 2, \dots, T

at the s-th epoch, we have

\begin{matrix} E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈y_{t}^{s}, A x^{*}〉 + 〈y^{*}, A x_{t}^{s}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \leq \\ \frac{1}{2 η} (E ∥ x^{*} - x_{0}^{s} ∥^{2} - E {∥ x^{*} - x_{T}^{s} ∥}^{2}) + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} \\ + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] + E 〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - E 〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉 \\ + 4 L η α (b) E \sum_{t = 1}^{T} [F (x_{t - 1}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t - 1}^{s} - x^{*}〉] \\ + 4 L η T α (b) E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] . \end{matrix}

(A19)

Subtracting

4 L η α (b) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t}^{s} - x^{*}〉]

from both sides of the above inequality, we obtain

\begin{matrix} E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈y_{t}^{s}, A x^{*}〉 + 〈y^{*}, A x_{t}^{s}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ - 4 L η α (b) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t}^{s} - x^{*}〉] \\ \leq & \frac{1}{2 η} E [∥ x^{*} - x_{0}^{s} ∥^{2} - {∥ x^{*} - x_{T}^{s} ∥}^{2}] + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} \\ + E 〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - E 〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉 \\ + 4 L η α (b) E [F (x_{0}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{0}^{s} - x^{*}〉] \\ - 4 L η α (b) E [F (x_{T}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{s} - x^{*}〉] \\ + 4 L η α (b) T E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] . \end{matrix}

(A20)

Due to the property of the convex function, i.e.,

F (\sum_{i = 1}^{n} a_{i} x_{i}) \leq \sum_{i = 1}^{n} a_{i} F (x_{i})

with

a_{i} \geq 0

and

\sum_{i = 1}^{n} a_{i} = 1

for any

i \in [n]

, and the update rules, i.e.,

{\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}

and

{\tilde{y}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}

, the left side of the above inequality becomes

\begin{matrix} E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈y_{t}^{s}, A x^{*}〉 + 〈y^{*}, A x_{t}^{s}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ - 4 L η α (b) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{t}^{s} - x^{*}〉] \\ = & (1 - 4 L η α (b)) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*})] + E \sum_{t = 1}^{T} [〈y^{*}, A x_{t}^{s}〉 - 〈y_{t}^{s}, A x^{*}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ - 4 L η α (b) E \sum_{t = 1}^{T} 〈 A^{⊤} y^{*}, x_{t}^{s} - x^{*} 〉 \\ = & (1 - 4 L η α (b)) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈y_{t}^{s}, A x^{*}〉 + 〈y^{*}, A x_{t}^{s}〉] \\ + E \sum_{t = 1}^{T} [G^{*} (y_{t}^{s}) - G^{*} (y^{*})] + 4 L η α (b) E \sum_{t = 1}^{T} 〈A x^{*}, y^{*} - y_{t}^{s}〉 \\ \geq & (1 - 4 L η α (b)) E \sum_{t = 1}^{T} [F (x_{t}^{s}) - F (x^{*}) - 〈y_{t}^{s}, A x^{*}〉 + 〈y^{*}, A x_{t}^{s}〉 + G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ \geq & (1 - 4 L η α (b)) T E [F ({\tilde{x}}^{s}) - F (x^{*}) - 〈{\tilde{y}}^{s}, A x^{*}〉 + 〈y^{*}, A {\tilde{x}}^{s}〉 + G^{*} ({\tilde{y}}^{s}) - G^{*} (y^{*})], \end{matrix}

(A21)

where the first equality holds due to the optimality condition, i.e.,

\nabla F (x^{*}) + A^{⊤} y^{*} = 0

, and the first inequality holds due to the optimality condition

\partial G^{*} (y^{*}) - A x^{*} = 0

, and the convexity of

G^{*}

on the convex set

Y

with

y_{t}^{s}, y^{*} \in Y

. Thus,

〈 A x^{*}, y^{*} - y_{t}^{s} 〉 = 〈 \partial G^{*} (y^{*}), y^{*} - y_{t}^{s} 〉 \geq G^{*} (y^{*}) - G^{*} (y_{t}^{s})

. The last inequality holds due to the convexity of

G^{*}

and F. Then, we obtain

\begin{matrix} (1 - 4 L η α (b)) T E [F ({\tilde{x}}^{s}) - F (x^{*}) - 〈{\tilde{y}}^{s}, A x^{*}〉 + 〈y^{*}, A {\tilde{x}}^{s}〉 + G^{*} ({\tilde{y}}^{s}) - G^{*} (y^{*})] \\ \leq & \frac{1}{2 η} E [∥ x^{*} - x_{0}^{s} ∥^{2} - {∥ x^{*} - x_{T}^{s} ∥}^{2}] + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} \\ + E [〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - 〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉] \\ + 4 L η α (b) E [F (x_{0}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{0}^{s} - x^{*}〉] \\ - 4 L η α (b) E [F (x_{T}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{s} - x^{*}〉] \\ + 4 L η α (b) T E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] . \end{matrix}

(A22)

This completes the proof. □

Appendix A.3.2. Proof of Lemma 2

Proof.

Since the function F is convex, differentiable with an

L_{F}

-Lipschitz-continuous gradient, where

L_{F} \leq L = \max_{i = 1, \dots, n} L_{i}

, then

\begin{matrix} F (x_{t}^{s}) \leq & F (x_{t - 1}^{s}) + 〈\nabla F (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s}〉 + \frac{L}{2} {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2} \\ = & F (x_{t - 1}^{s}) + 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 + 〈 \nabla F (x_{t - 1}^{s}) - \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 \\ + \frac{L}{2} {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2} . \end{matrix}

(A23)

Using Lemma A2, we obtain

\begin{matrix} E [〈 \nabla F (x_{t - 1}^{s}) - \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 + \frac{L}{2} {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2}] \\ \leq & E [\frac{L η}{2 L (1 - 2 L η)} ∥ \nabla F (x_{t - 1}^{s}) - \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) ∥^{2} + \frac{L (1 - 2 L η)}{2 L η} {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2} + \frac{L}{2} {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2}] \\ \leq & \frac{α (b) L η}{1 - 2 L η} [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s}) + 〈\nabla F (x_{t - 1}^{s}), x_{t - 1}^{s} - {\tilde{x}}^{s - 1}〉] + \frac{1 - L η}{2 η} E {∥x_{t}^{s} - x_{t - 1}^{s}∥}^{2}, \end{matrix}

(A24)

where the first inequality holds due to the Young’s inequality, and the second inequality follows from Lemma A2. Taking the expectation over the random choice of

I_{t}

and substituting the inequality (A24) into the inequality (A23) with

η < \frac{1}{2 L}

, we have

\begin{matrix} E [F (x_{t}^{s})] \\ \leq & F (x_{t - 1}^{s}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - x_{t - 1}^{s} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] \\ + \frac{α (b) L η}{1 - 2 L η} [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s}) + 〈 \nabla F (x_{t - 1}^{s}), x_{t - 1}^{s} - {\tilde{x}}^{s - 1} 〉] \\ = & F (x_{t - 1}^{s}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} + v^{*} - x_{t - 1}^{s} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] \\ + \frac{α (b) L η}{1 - 2 L η} [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s}) + 〈 \nabla F (x_{t - 1}^{s}), x_{t - 1}^{s} - {\tilde{x}}^{s - 1} 〉] \\ = & F (x_{t - 1}^{s}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] \\ + \frac{α (b) L η}{1 - 2 L η} [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s})] + 〈 \nabla F (x_{t - 1}^{s}), \frac{α (b) L η}{1 - 2 L η} (x_{t - 1}^{s} - {\tilde{x}}^{s - 1}) + v^{*} - x_{t - 1}^{s} 〉 \\ + E [〈 - \frac{1}{b} \sum_{i_{t} \in I_{t}} \nabla F_{i_{t}} ({\tilde{x}}^{s - 1}) + \nabla F ({\tilde{x}}^{s - 1}), v^{*} - x_{t - 1}^{s} 〉] \\ = & F (x_{t - 1}^{s}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] \\ + \frac{α (b) L η}{1 - 2 L η} [F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s})] + 〈 \nabla F (x_{t - 1}^{s}), \frac{α (b) L η}{1 - 2 L η} (x_{t - 1}^{s} - {\tilde{x}}^{s - 1}) + v^{*} - x_{t - 1}^{s} 〉, \end{matrix}

(A25)

where

v^{*} = (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} x^{*}

, the second equality holds due to the fact that

〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), v^{*} - x_{t - 1}^{s} 〉 = 〈 \frac{1}{b} \sum_{i_{t} \in I_{t}} \nabla F_{i_{t}} (x_{t - 1}^{s}), v^{*} - x_{t - 1}^{s} 〉 - 〈 \frac{1}{b} \sum_{i_{t} \in I_{t}} \nabla F_{i_{t}} ({\tilde{x}}^{s - 1}) - \nabla F ({\tilde{x}}^{s - 1}), v^{*} - x_{t - 1}^{s} 〉

, and

E [\frac{1}{b} \sum_{i_{t} \in I_{t}} \nabla F_{i_{t}} (x_{t - 1}^{s})] = \nabla F (x_{t - 1}^{s})

, and the last equality follows from the fact that

E [〈 - \frac{1}{b} \sum_{i_{t} \in I_{t}} \nabla F_{i_{t}} ({\tilde{x}}^{s - 1}) + \nabla F ({\tilde{x}}^{s - 1}), v^{*} - x_{t - 1}^{s} 〉] = 0

. Furthermore,

\begin{matrix} 〈 \nabla F (x_{t - 1}^{s}), v^{*} - x_{t - 1}^{s} + \frac{α (b) L η}{1 - 2 L η} (x_{t - 1}^{s} - {\tilde{x}}^{s - 1}) 〉 \\ = & 〈 \nabla F (x_{t - 1}^{s}), (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} x^{*} - x_{t - 1}^{s} + \frac{α (b) L η}{1 - 2 L η} (x_{t - 1}^{s} - {\tilde{x}}^{s - 1}) 〉 \\ = & 〈 \nabla F (x_{t - 1}^{s}), θ_{s - 1} x^{*} + (1 - θ_{s - 1} - \frac{α (b) L η}{1 - 2 L η}) {\tilde{x}}^{s - 1} + \frac{α (b) L η}{1 - 2 L η} x_{t - 1}^{s} - x_{t - 1}^{s} 〉 \\ \leq & F (θ_{s - 1} x^{*} + (1 - θ_{s - 1} - \frac{α (b) L η}{1 - 2 L η}) {\tilde{x}}^{s - 1} + \frac{α (b) L η}{1 - 2 L η} x_{t - 1}^{s}) - F (x_{t - 1}^{s}) \\ \leq & θ_{s - 1} F (x^{*}) + (1 - θ_{s - 1} - \frac{α (b) L η}{1 - 2 L η}) F ({\tilde{x}}^{s - 1}) + \frac{α (b) L η}{1 - 2 L η} F (x_{t - 1}^{s}) - F (x_{t - 1}^{s}), \end{matrix}

(A26)

where the first inequality holds due to the property of the convex function F, i.e.,

〈 \nabla F (x), y - x 〉 \leq F (y) - F (x),

and the last inequality follows from the convexity of the function F, i.e.,

F (\sum_{i = 1}^{n} a_{i} x_{i}) \leq \sum_{i = 1}^{n} a_{i} F (x_{i})

with

a_{i} \geq 0

and

\sum_{i = 1}^{n} a_{i} = 1

for any

i \in [n]

, and the assumption that

1 - θ_{s - 1} - \frac{α (b) L η}{1 - 2 L η} \geq 0

. Substituting the inequality (A26) into the inequality (A25), we have

\begin{matrix} E [F (x_{t}^{s})] \leq & F (x_{t - 1}^{s}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] \\ + \frac{α (b) L η}{1 - 2 L η} (F ({\tilde{x}}^{s - 1}) - F (x_{t - 1}^{s})) + θ_{s - 1} F (x^{*}) \\ + (1 - θ_{s - 1} - \frac{α (b) L η}{1 - 2 L η}) F ({\tilde{x}}^{s - 1}) + \frac{α (b) L η}{1 - 2 L η} F (x_{t - 1}^{s}) - F (x_{t - 1}^{s}) \\ = & θ_{s - 1} F (x^{*}) + (1 - θ_{s - 1}) F ({\tilde{x}}^{s - 1}) + E [〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} 〉 + \frac{1 - L η}{2 η} ∥ x_{t}^{s} - x_{t - 1}^{s} ∥^{2}] . \end{matrix}

(A27)

For the optimal solution

z_{t}^{s}

of the z-subproblem, the first-order optimality condition is

〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}) + A^{⊤} y_{t}^{s} + \frac{θ_{s - 1}}{η} (z_{t}^{s} - z_{t - 1}^{s}), z - z_{t}^{s} 〉 \geq 0, for any z \in R^{d} .

Since

x_{t}^{s} = θ_{s - 1} z_{t}^{s} + (1 - θ_{s - 1}) {\tilde{x}}^{s - 1}

and

v^{*} = θ_{s - 1} x^{*} + (1 - θ_{s - 1}) {\tilde{x}}^{s - 1}

, and by setting

z = x^{*}

in the above inequality, we obtain

\begin{matrix} 〈 \tilde{\nabla} & F_{I_{t}} (x_{t - 1}^{s}), x_{t}^{s} - v^{*} 〉 = θ_{s - 1} 〈 \tilde{\nabla} F_{I_{t}} (x_{t - 1}^{s}), z_{t}^{s} - x^{*} 〉 \\ \leq & θ_{s - 1} 〈 A^{⊤} y_{t}^{s} + \frac{θ_{s - 1}}{η} (z_{t}^{s} - z_{t - 1}^{s}), x^{*} - z_{t}^{s} 〉 \\ = & θ_{s - 1} 〈 A^{⊤} y_{t}^{s}, x^{*} - z_{t}^{s} 〉 + \frac{θ_{s - 1}^{2}}{η} 〈 z_{t}^{s} - z_{t - 1}^{s}, x^{*} - z_{t}^{s} 〉 \\ \overset{(a)}{=} & θ_{s - 1} 〈 A^{⊤} y_{t}^{s}, x^{*} - z_{t}^{s} 〉 + \frac{θ_{s - 1}^{2}}{2 η} [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - ∥ x^{*} - z_{t}^{s} ∥^{2} - {∥ z_{t - 1}^{s} - z_{t}^{s} ∥}^{2}], \end{matrix}

(A28)

where the equality

\overset{(a)}{=}

holds due to Property A4. Substituting the inequality (A28) into the inequality (A27), and

x_{t}^{s} - x_{t - 1}^{s} = (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} z_{t}^{s} - (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} - θ_{s - 1} z_{t - 1}^{s} = θ_{s - 1} (z_{t}^{s} - z_{t - 1}^{s})

, we have

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*}) - θ_{s - 1} 〈 A^{⊤} y_{t}^{s}, x^{*} - z_{t}^{s} 〉] \\ \leq & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{1 - L η}{2 η} E {∥ x_{t}^{s} - x_{t - 1}^{s} ∥}^{2} \\ + \frac{θ_{s - 1}^{2}}{2 η} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - ∥ x^{*} - z_{t}^{s} ∥^{2} - {∥ z_{t - 1}^{s} - z_{t}^{s} ∥}^{2}] \\ = & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2} (1 - L η)}{2 η} E {∥ z_{t}^{s} - z_{t - 1}^{s} ∥}^{2} \\ + \frac{θ_{s - 1}^{2}}{2 η} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - ∥ x^{*} - z_{t}^{s} ∥^{2} - {∥ z_{t - 1}^{s} - z_{t}^{s} ∥}^{2}] \\ = & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2}}{2 η} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - {∥ x^{*} - z_{t}^{s} ∥}^{2}] - \frac{L η θ_{s - 1}^{2}}{2 η} E {∥ z_{t}^{s} - z_{t - 1}^{s} ∥}^{2} . \end{matrix}

(A29)

By setting

y = y^{*}

in Property A3, we obtain

\begin{matrix} - θ_{s - 1} 〈A z_{t}^{s}, y_{t}^{s} - y^{*}〉 + θ_{s - 1} [G^{*} (y_{t}^{s}) - G^{*} (y^{*})] \\ \leq & - θ_{s - 1} 〈A {\bar{z}}_{t - 1}^{s}, y_{t}^{s} - y^{*}〉 + θ_{s - 1} 〈\partial G^{*} (y_{t}^{s}), y_{t}^{s} - y^{*}〉 + θ_{s - 1} 〈A ({\bar{z}}_{t - 1}^{s} - z_{t}^{s}), y_{t}^{s} - y^{*}〉 \\ \leq & \frac{1}{ρ} 〈y_{t - 1}^{s} - y_{t}^{s}, y_{t}^{s} - y^{*}〉 + θ_{s - 1} 〈A ({\bar{z}}_{t - 1}^{s} - z_{t}^{s}), y_{t}^{s} - y^{*}〉 \\ = & \frac{1}{ρ} 〈y_{t - 1}^{s} - y_{t}^{s}, y_{t}^{s} - y^{*}〉 + θ_{s - 1} 〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t}^{s} - y^{*}〉 - θ_{s - 1} 〈A (z_{t}^{s} - z_{t - 1}^{s}), y_{t}^{s} - y^{*}〉 \\ = & \frac{1}{2 ρ} [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y_{t - 1}^{s}∥}^{2}] + θ_{s - 1} 〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t - 1}^{s} - y^{*}〉 \\ + θ_{s - 1} 〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t}^{s} - y_{t - 1}^{s}〉 - θ_{s - 1} 〈A (z_{t}^{s} - z_{t - 1}^{s}), y_{t}^{s} - y^{*}〉, \end{matrix}

(A30)

where the first inequality follows from the convexity of the function

G^{*}

, the first equation holds due to

{\bar{z}}_{t}^{s} = z_{t}^{s} + β (z_{t}^{s} - z_{t - 1}^{s})

with

β = 1

, and the last equality holds due to Property A4.

Combining the inequality (A30) and the inequality (A29), we obtain

\begin{matrix} E [F (x_{t}^{s}) - F (x^{*}) - θ_{s - 1} 〈 A^{⊤} y_{t}^{s}, x^{*} - z_{t}^{s} 〉 - θ_{s - 1} 〈A z_{t}^{s}, y_{t}^{s} - y^{*}〉 + θ_{s - 1} G^{*} (y_{t}^{s}) - θ_{s - 1} G^{*} (y^{*})] \\ = & E [F (x_{t}^{s}) - F (x^{*}) - θ_{s - 1} 〈y_{t}^{s}, A x^{*}〉 + θ_{s - 1} 〈y^{*}, A z_{t}^{s}〉 + θ_{s - 1} G^{*} (y_{t}^{s}) - θ_{s - 1} G^{*} (y^{*})] \\ \leq & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2}}{2 η} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - {∥ x^{*} - z_{t}^{s} ∥}^{2}] - \frac{L η θ_{s - 1}^{2}}{2 η} E [∥ z_{t}^{s} - z_{t - 1}^{s} ∥^{2}] \\ + \frac{1}{2 ρ} E [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2}] - \frac{1}{2 ρ} E [∥ y_{t}^{s} - y_{t - 1}^{s} ∥^{2}] \\ + θ_{s - 1} E [〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t - 1}^{s} - y^{*}〉] - θ_{s - 1} E [〈A (z_{t}^{s} - z_{t - 1}^{s}), y_{t}^{s} - y^{*}〉] \\ + θ_{s - 1} E [〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t}^{s} - y_{t - 1}^{s}〉] \\ \leq & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2}}{2 η} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - ∥ x^{*} - z_{t}^{s} ∥^{2}] - \frac{L η θ_{s - 1}^{2}}{2 η} E [∥ z_{t}^{s} - z_{t - 1}^{s} ∥^{2}] \\ + \frac{1}{2 ρ} E [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2}] - \frac{1}{2 ρ} E [{∥y_{t}^{s} - y_{t - 1}^{s}∥}^{2}] + θ_{s - 1} \frac{M γ}{2 η} E [∥ z_{t - 1}^{s} - z_{t - 2}^{s} ∥^{2}] \\ + θ_{s - 1} \frac{M η ρ}{2 γ ρ} E [∥ y_{t}^{s} - y_{t - 1}^{s} ∥] \\ + θ_{s - 1} E [〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t - 1}^{s} - y^{*}〉] - θ_{s - 1} E [〈A (z_{t}^{s} - z_{t - 1}^{s}), y_{t}^{s} - y^{*}〉] . \end{matrix}

(A31)

To simplify the proof, we set

T_{s - 1}

as T in the rest of Lemma 2, and then replace T with

T_{s - 1}

in the final result, which does not affect the proof procedure. Taking the expectation over the random choice of the history of random variables

I_{1}, \dots, I_{T}

on the inequality (A31), summing it over

t = 1, \dots, T

at the s-th epoch and due to the property of the convex function, i.e.,

F (\sum_{i = 1}^{n} a_{i} x_{i}) \leq \sum_{i = 1}^{n} a_{i} F (x_{i})

with

a_{i} \geq 0

and

\sum_{i = 1}^{n} a_{i} = 1

for

\forall i \in [n]

, and the update rule

{\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s}

, set

M θ_{s - 1} η ρ \leq γ \leq \frac{L η θ_{s - 1}}{M}

, which ensures that

\frac{M θ_{s - 1} γ}{2 T η} \leq \frac{L η θ_{s - 1}^{2}}{2 T η}

and

\frac{M θ_{s - 1} η ρ}{2 T γ ρ} \leq \frac{1}{2 T ρ}

, so we have

\begin{matrix} E [F ({\tilde{x}}^{s}) - F (x^{*}) - θ_{s - 1} 〈 \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}, A x^{*} 〉 + θ_{s - 1} 〈 y^{*}, A (\frac{1}{T} \sum_{t = 1}^{T} z_{t}^{s}) 〉 \\ + θ_{s - 1} G^{*} (\frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}) - θ_{s - 1} G^{*} (y^{*})] \\ \leq & E [\frac{1}{T} \sum_{t = 1}^{T} F (x_{t}^{s}) - F (x^{*}) - θ_{s - 1} 〈 \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}, A x^{*} 〉 + θ_{s - 1} 〈 y^{*}, A (\frac{1}{T} \sum_{t = 1}^{T} z_{t}^{s}) 〉 \\ + \frac{θ_{s - 1}}{T} \sum_{t = 1}^{T} G^{*} (y_{t}^{s}) - θ_{s - 1} G^{*} (y^{*})] \\ \leq & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2}}{2 T η} \sum_{t = 1}^{T} E [∥ x^{*} - z_{t - 1}^{s} ∥^{2} - {∥ x^{*} - z_{t}^{s} ∥}^{2}] \\ + \frac{1}{2 T ρ} \sum_{t = 1}^{T} E [{∥y_{t - 1}^{s} - y^{*}∥}^{2} - {∥y_{t}^{s} - y^{*}∥}^{2}] + \frac{M θ_{s - 1} γ}{2 T η} \sum_{t = 1}^{T} E {∥ z_{t - 1}^{s} - z_{t - 2}^{s} ∥}^{2} \\ - \frac{L η θ_{s - 1}^{2}}{2 T η} \sum_{t = 1}^{T} E {∥ z_{t}^{s} - z_{t - 1}^{s} ∥}^{2} + [\frac{M θ_{s - 1} η ρ}{2 T γ ρ} - \frac{1}{2 T ρ}] \sum_{t = 1}^{T} E [∥ y_{t}^{s} - y_{t - 1}^{s} ∥] \\ + \frac{θ_{s - 1}}{T} \sum_{t = 1}^{T} E 〈A (z_{t - 1}^{s} - z_{t - 2}^{s}), y_{t - 1}^{s} - y^{*}〉 - \frac{θ_{s - 1}}{T} \sum_{t = 1}^{T} E 〈A (z_{t}^{s} - z_{t - 1}^{s}), y_{t}^{s} - y^{*}〉 \\ \leq & (1 - θ_{s - 1}) [F ({\tilde{x}}^{s - 1}) - F (x^{*})] + \frac{θ_{s - 1}^{2}}{2 T η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - {∥ x^{*} - z_{T}^{s} ∥}^{2}] \\ + \frac{1}{2 T ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] + \frac{M θ_{s - 1} γ}{2 T η} E {∥ z_{0}^{s} - z_{- 1}^{s} ∥}^{2} \\ - \frac{L η θ_{s - 1}^{2}}{2 T η} E {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} + \frac{θ_{s - 1}}{T} E 〈A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - \frac{θ_{s - 1}}{T} E 〈A (z_{T}^{s} - z_{T - 1}^{s}), y_{T}^{s} - y^{*}〉 . \end{matrix}

(A32)

Using the update rules

x_{t}^{s} = (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} z_{t}^{s}

, we have

{\tilde{x}}^{s} = \frac{1}{T} \sum_{t = 1}^{T} x_{t}^{s} = \frac{1}{T} \sum_{t = 1}^{T} ((1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + θ_{s - 1} z_{t}^{s}) = (1 - θ_{s - 1}) {\tilde{x}}^{s - 1} + \frac{θ_{s - 1}}{T} \sum_{t = 1}^{T} z_{t}^{s} .

(A33)

Subtracting

(1 - θ_{s - 1}) E (〈 {\tilde{y}}^{s - 1}, A x^{*} 〉 - 〈 y^{*}, A {\tilde{x}}^{s - 1} 〉 + G^{*} (y^{*}) - G^{*} ({\tilde{y}}^{s - 1}))

from both sides of the inequality (A32), we obtain

\begin{matrix} E [F ({\tilde{x}}^{s}) - F (x^{*}) \underset{J_{1}}{\underset{︸}{- 〈{\tilde{y}}^{s}, A x^{*}〉}} \underset{J_{2}}{\underset{︸}{+ 〈y^{*}, A {\tilde{x}}^{s}〉}} \underset{J_{3}}{\underset{︸}{+ G^{*} ({\tilde{y}}^{s})}} \underset{J_{4}}{\underset{︸}{- G^{*} (y^{*})}}] \\ \leq & E [F ({\tilde{x}}^{s}) - F (x^{*}) \underset{J_{1} (a)}{\underset{︸}{- θ_{s - 1} 〈 \frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s}, A x^{*} 〉}} \underset{J_{2} (a)}{\underset{︸}{+ θ_{s - 1} 〈 y^{*}, A (\frac{1}{T} \sum_{t = 1}^{T} z_{t}^{s}) 〉}} \underset{J_{3} (a)}{\underset{︸}{+ θ_{s - 1} G^{*} (\frac{1}{T} \sum_{t = 1}^{T} y_{t}^{s})}} \underset{J_{4} (a)}{\underset{︸}{- θ_{s - 1} G^{*} (y^{*})}}] \\ + E [\underset{J_{1} (b)}{\underset{︸}{- (1 - θ_{s - 1}) 〈{\tilde{y}}^{s - 1}, A x^{*}〉}} \underset{J_{2} (b)}{\underset{︸}{+ (1 - θ_{s - 1}) 〈y^{*}, A {\tilde{x}}^{s - 1}〉}} \\ \underset{J_{3} (b)}{\underset{︸}{+ (1 - θ_{s - 1}) G^{*} ({\tilde{y}}^{s - 1})}} \underset{J_{4} (b)}{\underset{︸}{- (1 - θ_{s - 1}) G^{*} (y^{*})}}] \end{matrix}

\begin{matrix} \leq & (1 - θ_{s - 1}) E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈{\tilde{y}}^{s - 1}, A x^{*}〉 + 〈y^{*}, A {\tilde{x}}^{s - 1}〉 + G^{*} ({\tilde{y}}^{s - 1}) - G^{*} (y^{*})] \\ + \frac{θ_{s - 1}^{2}}{2 T η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - {∥ x^{*} - z_{T}^{s} ∥}^{2}] + \frac{1}{2 T ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] \\ + \frac{M γ θ_{s - 1}}{2 T η} E ∥ z_{0}^{s} - z_{- 1}^{s} ∥^{2} - \frac{L η θ_{s - 1}^{2}}{2 T η} E {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} + \frac{θ_{s - 1}}{T} E 〈A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*}〉 \\ - \frac{θ_{s - 1}}{T} E 〈A (z_{T}^{s} - z_{T - 1}^{s}), y_{T}^{s} - y^{*}〉, \end{matrix}

(A34)

where

J_{1} = J_{1} (a) + J_{1} (b)

due to

{\tilde{y}}^{s} = (1 - θ_{s - 1}) {\tilde{y}}^{s - 1} + \frac{θ_{s - 1}}{T} \sum_{t = 1}^{T} y_{t}^{s}

,

J_{2} = J_{2} (a) + J_{2} (b)

,

J_{3} \leq J_{3} (a) + J_{3} (b)

and

J_{4} = J_{4} (a) + J_{4} (b)

. Then, for all s, we replace T with

T_{s - 1}

,

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & (1 - θ_{s - 1}) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ_{s - 1}^{2}}{2 T_{s - 1} η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - ∥ x^{*} - z_{T_{s - 1}}^{s} ∥^{2}] \\ + \frac{1}{2 T_{s - 1} ρ} E [∥ y_{0}^{s} - y^{*} ∥^{2} - ∥ y_{T_{s - 1}}^{s} - y^{*} ∥^{2}] \\ + \frac{M γ θ_{s - 1}}{2 T_{s - 1} η} E ∥ z_{0}^{s} - z_{- 1}^{s} ∥^{2} - \frac{L η θ_{s - 1}^{2}}{2 T_{s - 1} η} E {∥ z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s} ∥}^{2} \\ + \frac{θ_{s - 1}}{T_{s - 1}} E 〈 A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*} 〉 - \frac{θ_{s - 1}}{T_{s - 1}} E 〈 A (z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s}), y_{T_{s - 1}}^{s} - y^{*} 〉 . \end{matrix}

(A35)

This completes the proof. □

Appendix A.4. Convergence Analyses of SVR-PDHG and ASVR-PDHG

In this section, we give the convergence rate results of our algorithms in detail. Appendix A.4.1 and Appendix A.4.2 prove the convergence rate of our SVR-PDHG algorithm. Appendix A.4.3 and Appendix A.4.4 prove the convergence rate of our ASVR-PDHG algorithm.

Appendix A.4.1. Proof of Theorem 1 (SVR-PDHG for Strongly Convex Objectives)

Proof.

For strongly convex function F, according to the update rule of

x_{0}^{s} = {\tilde{x}}^{s - 1}

and Lemma 1, and setting

x_{0}^{s} = x_{- 1}^{s}

, we have

\begin{matrix} (1 - 4 L η α (b)) T E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{1}{2 η} E [∥ x^{*} - x_{0}^{s} ∥^{2} - {∥ x^{*} - x_{T}^{s} ∥}^{2}] + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} \\ + E 〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - E 〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉 \\ + 4 L η α (b) E [F (x_{0}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{0}^{s} - x^{*}〉] \\ - 4 L η α (b) E [F (x_{T}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{s} - x^{*}〉] \\ + 4 L η α (b) T E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] \\ \leq & \frac{1}{2 η} E {∥ x^{*} - {\tilde{x}}^{s - 1} ∥}^{2} + 4 L η α (b) (T + 1) E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2}] + E [〈A (x_{T - 1}^{s} - x_{T}^{s}), y_{T}^{s} - y^{*}〉 - \frac{1 - L η}{2 η} ∥ x_{T - 1}^{s} - x_{T}^{s} ∥^{2} - \frac{1}{2 ρ} {∥y_{T}^{s} - y^{*}∥}^{2}] . \end{matrix}

(A36)

Because of

M ρ \leq \frac{1 - L η}{M η}

in Lemma 1, we have

\begin{matrix} 〈A (x_{T - 1}^{s} - x_{T}^{s}), y_{T}^{s} - y^{*}〉 - \frac{1 - L η}{2 η} {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} - \frac{1}{2 ρ} {∥y_{T}^{s} - y^{*}∥}^{2} \\ \leq & \frac{1 - L η}{2 η} ∥ x_{T - 1}^{s} - x_{T}^{s} ∥^{2} + \frac{M^{2} ρ η}{2 ρ (1 - L η)} ∥ y_{T}^{s} - y^{*} ∥^{2} - \frac{1 - L η}{2 η} {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} - \frac{1}{2 ρ} {∥y_{T}^{s} - y^{*}∥}^{2} \leq 0 . \end{matrix}

(A37)

Using Property 2 and the definition of

𝒯 (\cdot, \cdot)

, we have

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{1}{2 T η (1 - 4 L η α (b))} E {∥ x^{*} - {\tilde{x}}^{s - 1} ∥}^{2} + \frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] \\ + \frac{1}{2 T ρ (1 - 4 L η α (b))} E {∥y_{0}^{s} - y^{*}∥}^{2} . \end{matrix}

(A38)

Due to the strong convexity of the function F, the update rule

y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})

and Lemma A3, we have

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{1}{2 T η (1 - 4 L η α (b))} E {∥ x^{*} - {\tilde{x}}^{s - 1} ∥}^{2} \\ + \frac{1}{2 T ρ (1 - 4 L η α (b))} E {∥y_{0}^{s} - y^{*}∥}^{2} \\ \leq & \frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] \\ + \frac{1}{T η μ (1 - 4 L η α (b))} E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{L_{F}}{T ρ (1 - 4 L η α (b)) σ_{\min} (A A^{⊤})} E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] . \end{matrix}

(A39)

Using Property 2, we obtain

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \leq \\ (\frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} + \frac{1}{T η μ (1 - 4 L η α (b))} + \frac{L_{F}}{T ρ (1 - 4 L η α (b)) σ_{\min} (A A^{⊤})}) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] . \end{matrix}

(A40)

Setting

{\hat{x}}^{0} = {\tilde{x}}^{0}

,

{\hat{y}}^{0} = {\tilde{y}}^{1}

,

{\hat{x}}^{S} = {\tilde{x}}^{S}

, and

{\hat{y}}^{S} = {\tilde{y}}^{S}

, and letting

ϕ_{1} = \frac{4 L η (T + 1) α (b)}{(1 - 4 L η α (b)) T} + \frac{1}{T η μ (1 - 4 L η α (b))} + \frac{L_{F}}{T ρ (1 - 4 L η α (b)) σ_{\min} (A A^{⊤})}

, we have

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϕ_{1}^{S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0})

, where S is the number of outer iterations. This implies that SVR-PDHG converges linearly for strongly convex problems. This completes the proof. □

Appendix A.4.2. Proof of Theorem 2 (SVR-PDHG for Non-Strongly Convex Objectives)

Proof.

Recall Lemma 1

\begin{matrix} (1 - 4 L η α (b)) T E [F ({\tilde{x}}^{s}) - F (x^{*}) - 〈{\tilde{y}}^{s}, A x^{*}〉 + 〈y^{*}, A {\tilde{x}}^{s}〉 + G^{*} ({\tilde{y}}^{s}) - G^{*} (y^{*})] \\ \leq & \frac{1}{2 η} E [∥ x^{*} - x_{0}^{s} ∥^{2} - {∥ x^{*} - x_{T}^{s} ∥}^{2}] + \frac{M η γ}{2 η} E {∥x_{0}^{s} - x_{- 1}^{s}∥}^{2} - \frac{1 - L η}{2 η} E {∥ x_{T - 1}^{s} - x_{T}^{s} ∥}^{2} \\ + E [〈A (x_{0}^{s} - x_{- 1}^{s}), y_{0}^{s} - y^{*}〉] - E [〈A (x_{T}^{s} - x_{T - 1}^{s}), y_{T}^{s} - y^{*}〉] \\ + 4 L η α (b) E [F (x_{0}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{0}^{s} - x^{*}〉] \\ - 4 L η α (b) E [F (x_{T}^{s}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{s} - x^{*}〉] \\ + 4 L η α (b) T E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] + \frac{1}{2 ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] . \end{matrix}

(A41)

Using Property 2 and the definition of

𝒯 (\cdot, \cdot)

, set

x_{0}^{s} = x_{T}^{s - 1}

,

x_{- 1}^{s} = x_{T - 1}^{s - 1}

,

x_{0}^{1} = x_{- 1}^{1}

,

x_{0}^{1} = {\bar{x}}^{0} = {\tilde{x}}^{0}

, and

y_{0}^{s} = y_{T}^{s - 1}

, taking the expectation and summing over all epochs

s = 1, \dots, S

, we have

\begin{matrix} (1 - 4 L η α (b)) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{1}{2 η S} E [∥ x^{*} - x_{0}^{1} ∥^{2} - {∥ x^{*} - x_{T}^{S} ∥}^{2}] + \frac{M η γ}{2 S η} E {∥x_{0}^{1} - x_{- 1}^{1}∥}^{2} - \frac{1 - L η}{2 S η} E {∥ x_{T - 1}^{S} - x_{T}^{S} ∥}^{2} \\ + \frac{1}{S} [E [〈A (x_{0}^{1} - x_{- 1}^{1}), y_{0}^{1} - y^{*}〉] - E [〈A (x_{T}^{S} - x_{T - 1}^{S}), y_{T}^{S} - y^{*}〉]] \\ + \frac{4 L η α (b)}{S} E [F ({\bar{x}}^{0}) - F (x^{*}) - 〈\nabla F (x^{*}), {\bar{x}}^{0} - x^{*}〉] \\ - \frac{4 L η α (b)}{S} E [F (x_{T}^{S}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{S} - x^{*}〉] \\ + 4 L η α (b) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{1}{2 S ρ} E [{∥y_{0}^{1} - y^{*}∥}^{2} - {∥y_{T}^{S} - y^{*}∥}^{2}] \\ \leq & \frac{1}{2 η S} E [∥ x^{*} - x_{0}^{1} ∥^{2} - {∥ x^{*} - x_{T}^{S} ∥}^{2}] + \frac{4 L η α (b)}{S} E [F ({\bar{x}}^{0}) - F (x^{*}) - 〈\nabla F (x^{*}), {\bar{x}}^{0} - x^{*}〉] \\ - \frac{4 L η α (b)}{S} E [F (x_{T}^{S}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{S} - x^{*}〉] \\ + 4 L η α (b) T \frac{1}{S} \sum_{s = 1}^{S} [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{1}{2 S ρ} E {∥ y_{0}^{1} - y^{*} ∥}^{2} \\ + \frac{1}{S} E 〈A (x_{T - 1}^{S} - x_{T}^{S}), y_{T}^{S} - y^{*}〉 - \frac{1 - L η}{2 S η} E {∥ x_{T - 1}^{S} - x_{T}^{S} ∥}^{2} - \frac{1}{2 S ρ} E {∥y_{T}^{S} - y^{*}∥}^{2} . \end{matrix}

(A42)

Because of

M ρ \leq \frac{1 - L η}{M η}

in Lemma 1, we have

\begin{matrix} \frac{1}{S} E 〈A (x_{T - 1}^{S} - x_{T}^{S}), y_{T}^{S} - y^{*}〉 - \frac{1 - L η}{2 S η} E ∥ x_{T - 1}^{S} - x_{T}^{S} ∥^{2} - \frac{1}{2 S ρ} E {∥ y_{T}^{S} - y^{*} ∥}^{2} \\ \leq & \frac{1 - L η}{2 S η} E ∥ x_{T - 1}^{S} - x_{T}^{S} ∥^{2} + \frac{M^{2} ρ η}{2 S ρ (1 - L η)} E {∥ y_{T}^{S} - y^{*} ∥}^{2} \\ - \frac{1 - L η}{2 S η} E ∥ x_{T - 1}^{S} - x_{T}^{S} ∥^{2} - \frac{1}{2 S ρ} E {∥ y_{T}^{S} - y^{*} ∥}^{2} \leq 0 . \end{matrix}

(A43)

Then, we have

\begin{matrix} (1 - 4 L η α (b)) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & 4 L η α (b) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{4 L η α (b)}{S} E [F ({\bar{x}}^{0}) - F (x^{*}) - 〈\nabla F (x^{*}), {\bar{x}}^{0} - x^{*}〉] \\ - \frac{4 L η α (b)}{S} E [F (x_{T}^{S}) - F (x^{*}) - 〈\nabla F (x^{*}), x_{T}^{S} - x^{*}〉] + \frac{1}{2 S η} E ∥ x^{*} - x_{0}^{1} ∥^{2} + \frac{1}{2 S ρ} E {∥ y_{0}^{1} - y^{*} ∥}^{2} . \end{matrix}

(A44)

Subtracting

4 L η α (b) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})]

from both sides of the above inequality, setting

{\hat{x}}^{0} = {\bar{x}}^{0} = {\tilde{x}}^{0}

,

{\hat{y}}^{0} = {\bar{y}}^{0} = {\tilde{y}}^{0}

, and since the update rule

y_{0}^{s} = y_{T}^{s - 1}

, we obtain

\begin{matrix} (1 - 8 L η α (b)) T \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{4 L η (T + 1) α (b)}{S} 𝒯 ({\tilde{x}}^{0}, {\tilde{y}}^{0}) - \frac{4 L η α (b) T}{S} [𝒯 ({\tilde{x}}^{S}, {\tilde{y}}^{S})] \\ - \frac{4 L η α (b)}{S} E [F (x_{T}^{S}) - F (x^{*}) - 〈 \nabla F (x^{*}), x_{T}^{S} - x^{*} 〉] + \frac{1}{2 S η} E ∥ x^{*} - x_{0}^{1} ∥^{2} + \frac{1}{2 S ρ} E {∥ y_{0}^{1} - y^{*} ∥}^{2} \\ \leq & \frac{4 L η (T + 1) α (b)}{S} 𝒯 ({\tilde{x}}^{0}, {\tilde{y}}^{0}) + \frac{1}{2 S η} D_{x^{*}}^{2} + \frac{1}{2 S ρ} D_{y^{*}}^{2}, \end{matrix}

(A45)

where

D_{x^{*}} = ∥ x^{*} - {\hat{x}}^{0} ∥ = ∥ x^{*} - {\tilde{x}}^{0} ∥

,

y_{0}^{1} = {\tilde{y}}^{0}

, and

D_{y^{*}} = ∥ y^{*} - {\hat{y}}^{0} ∥ = ∥ y^{*} - {\tilde{y}}^{0} ∥

. The first inequality holds due to the definition of

𝒯 (\cdot, \cdot)

and the convexity of

F (\cdot)

. Since

{\hat{x}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}

and

{\hat{y}}^{S} = \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}

, we have

\begin{matrix} \frac{1}{S} \sum_{s = 1}^{S} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] = \frac{1}{S} \sum_{s = 1}^{S} E [F ({\tilde{x}}^{s}) - F (x^{*}) - 〈{\tilde{y}}^{s}, A x^{*}〉 + 〈y^{*}, A {\tilde{x}}^{s}〉 + G^{*} ({\tilde{y}}^{s}) - G^{*} (y^{*})] \\ = & E [\sum_{s = 1}^{S} \frac{1}{S} F ({\tilde{x}}^{s}) - F (x^{*}) - 〈 \frac{1}{S} \sum_{s = 1}^{S} {\tilde{y}}^{s}, A x^{*} 〉 + 〈 y^{*}, A (\frac{1}{S} \sum_{s = 1}^{S} {\tilde{x}}^{s}) 〉 + \sum_{s = 1}^{S} \frac{1}{S} G^{*} ({\tilde{y}}^{s}) - G^{*} (y^{*})] \\ \geq & E [F ({\hat{x}}^{S}) - F (x^{*}) - 〈 {\hat{y}}^{S}, A x^{*} 〉 + 〈 y^{*}, A {\hat{x}}^{S} 〉 + G^{*} ({\hat{y}}^{S}) - G^{*} (y^{*})] \\ = & E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] . \end{matrix}

(A46)

Combining the above results, we obtain

\begin{matrix} E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq & \frac{4 L η (T + 1) α (b)}{(1 - 8 L η α (b)) T S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{1}{2 (1 - 8 L η α (b)) T S η} D_{x^{*}}^{2} + \frac{1}{2 (1 - 8 L η α (b)) T S ρ} D_{y^{*}}^{2} \\ \leq & \frac{1}{S} [\frac{4 L η (T + 1) α (b)}{(1 - 8 L η α (b)) T} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{ρ D_{x^{*}}^{2} + η D_{y^{*}}^{2}}{2 η ρ (1 - 8 L η α (b)) T}] . \end{matrix}

(A47)

One can see that SVR-PDHG enjoys the convergence rate of

𝒪 (1 / S)

for non-SC problems. This completes the proof. □

Appendix A.4.3. Proof of Theorem 3 (ASVR-PDHG for Strongly Convex Objectives)

Proof.

According to Lemma 2 with

θ_{s - 1} = θ

and

T_{s - 1} = T

at all epochs,

z_{0}^{s} = {\tilde{x}}^{s - 1}

and

y_{0}^{s} = - {(A^{⊤})}^{†} \nabla F ({\tilde{x}}^{s - 1})

and

η < \frac{1}{2 L}

, and by setting

z_{0}^{s} = z_{- 1}^{s}

, we have

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & (1 - θ) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ^{2}}{2 T η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - {∥ x^{*} - z_{T}^{s} ∥}^{2}] \\ + \frac{1}{2 T ρ} E [{∥y_{0}^{s} - y^{*}∥}^{2} - {∥y_{T}^{s} - y^{*}∥}^{2}] + \frac{M γ θ}{2 T η} E ∥ z_{0}^{s} - z_{- 1}^{s} ∥^{2} - \frac{L η θ^{2}}{2 T η} E {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} \\ + \frac{θ}{T} E 〈A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - \frac{θ}{T} E 〈A (z_{T}^{s} - z_{T - 1}^{s}), y_{T}^{s} - y^{*}〉 \\ \leq & (1 - θ) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ^{2}}{2 T η} E [∥ x^{*} - z_{0}^{s} ∥^{2}] + \frac{1}{2 T ρ} E {∥y_{0}^{s} - y^{*}∥}^{2} \\ + \frac{θ}{T} E [〈A (z_{T - 1}^{s} - z_{T}^{s}), y_{T}^{s} - y^{*}〉 - \frac{L η θ}{2 η} {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} - \frac{1}{2 θ ρ} {∥y_{T}^{s} - y^{*}∥}^{2}] \\ \overset{(a)}{\leq} & (1 - θ) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ^{2}}{2 T η} E [∥ x^{*} - {\tilde{x}}^{s - 1} ∥^{2}] + \frac{1}{2 T ρ} E {∥y_{0}^{s} - y^{*}∥}^{2} \\ \overset{(b)}{\leq} & (1 - θ) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{θ^{2}}{T η μ} E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉] \\ + \frac{L_{F}}{T ρ σ_{\min} (A A^{⊤})} E [F ({\tilde{x}}^{s - 1}) - F (x^{*}) - 〈\nabla F (x^{*}), {\tilde{x}}^{s - 1} - x^{*}〉], \end{matrix}

(A48)

where the inequality

\overset{(b)}{\leq}

holds due to the strong convexity of function F and Lemma A3, and the inequality

\overset{(a)}{\leq}

holds due to a similar reason, with (A43) as follows:

\begin{matrix} 〈A (z_{T - 1}^{s} - z_{T}^{s}), y_{T}^{s} - y^{*}〉 - \frac{L η θ}{2 η} {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} - \frac{1}{2 θ s} {∥y_{T}^{s} - y^{*}∥}^{2} \\ \leq & \frac{1}{2 θ ρ} {∥y_{T}^{s} - y^{*}∥}^{2} + \frac{M^{2} θ ρ}{2} ∥ z_{T}^{s} - z_{T - 1}^{s} ∥^{2} - \frac{L η θ}{2 η} {∥ z_{T}^{s} - z_{T - 1}^{s} ∥}^{2} - \frac{1}{2 θ ρ} {∥y_{T}^{s} - y^{*}∥}^{2} \leq 0, \end{matrix}

(A49)

where

ρ \leq \frac{L}{M^{2}}

in Lemma 2. Using Property 2, we obtain

\begin{matrix} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \leq & (1 - θ + \frac{θ^{2}}{T η μ} + \frac{L_{F}}{T ρ σ_{\min} (A A^{⊤})}) E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] . \end{matrix}

(A50)

Let

{\hat{x}}^{0} = {\tilde{x}}^{0}

,

{\hat{y}}^{0} = {\tilde{y}}^{0}

,

{\hat{x}}^{S} = {\tilde{x}}^{S}

,

{\hat{y}}^{S} = {\tilde{y}}^{S}

, and

ϕ_{2} = 1 - θ + \frac{θ^{2}}{T η μ} + \frac{L_{F}}{T ρ σ_{\min} (A A^{⊤})}

, and we have

E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq ϕ_{2}^{S} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) .

(A51)

This completes the proof. □

Appendix A.4.4. Proof of Theorem 4 (ASVR-PDHG for Non-Strongly Convex Objectives)

Proof.

Recall Lemma 2, and dividing both sides of the inequality (19) by

θ_{s - 1}^{2}

, we have

\begin{matrix} \frac{1}{θ_{s - 1}^{2}} E [𝒯 ({\tilde{x}}^{s}, {\tilde{y}}^{s})] \\ \leq & \frac{1 - θ_{s - 1}}{θ_{s - 1}^{2}} E [𝒯 ({\tilde{x}}^{s - 1}, {\tilde{y}}^{s - 1})] + \frac{1}{2 T_{s - 1} η} E [∥ x^{*} - z_{0}^{s} ∥^{2} - {∥ x^{*} - z_{T_{s - 1}}^{s} ∥}^{2}] \\ + \frac{1}{2 T_{s - 1} θ_{s - 1}^{2} ρ} E [∥ y_{0}^{s} - y^{*} ∥^{2} - ∥ y_{T_{s - 1}}^{s} - y^{*} ∥^{2}] + \frac{M γ}{2 T_{s - 1} η θ_{s - 1}} E {∥ z_{0}^{s} - z_{- 1}^{s} ∥}^{2} \\ - \frac{L η}{2 T_{s - 1} η} E {∥ z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s} ∥}^{2} \\ + \frac{1}{T_{s - 1} θ_{s - 1}} E 〈A (z_{0}^{s} - z_{- 1}^{s}), y_{0}^{s} - y^{*}〉 - \frac{1}{T_{s - 1} θ_{s - 1}} E 〈 A (z_{T_{s - 1}}^{s} - z_{T_{s - 1} - 1}^{s}), y_{T_{s - 1}}^{s} - y^{*} 〉 . \end{matrix}

(A52)

By the update rule of

θ_{s} :

θ_{s} = \frac{\sqrt{θ_{s - 1}^{4} + 4 θ_{s - 1}^{2}} - θ_{s - 1}^{2}}{2} and θ_{0} = 1 - \frac{α (b) L η}{1 - L η}

, we have

(1 - θ_{s}) / θ_{s}^{2} = 1 / θ_{s - 1}^{2}

. And since the update rule of

T_{s}

, i.e.,

T_{s} = ⌈\frac{1}{1 - θ_{s}} T_{s - 1}⌉

, we have

\begin{matrix} \frac{1}{T_{s}} - \frac{1}{T_{s - 1}} \leq 0, \frac{1}{T_{s} θ_{s}} - \frac{1}{T_{s - 1} θ_{s - 1}} \leq 0, \frac{1}{T_{s} θ_{s}^{2}} - \frac{1}{T_{s - 1} θ_{s - 1}^{2}} \leq 0, \frac{M γ}{2 η T_{s} θ_{s}} - \frac{L η}{2 T_{s - 1} η} \leq 0 . \end{matrix}

(A53)

Summing over all epochs

s = 1, \dots, S

, since

z_{0}^{s} = z_{T_{s - 1}}^{s - 1}

,

z_{- 1}^{s} = z_{T_{s - 1} - 1}^{s - 1}

and

y_{0}^{s} = y_{T_{s - 1}}^{s - 1}

, set

z_{0}^{1} = z_{- 1}^{1}

we have

\begin{matrix} \frac{1}{θ_{S - 1}^{2}} E [𝒯 ({\tilde{x}}^{S}, {\tilde{y}}^{S})] \\ \leq & \frac{1 - θ_{0}}{θ_{0}^{2}} 𝒯 ({\tilde{x}}^{0}, {\tilde{y}}^{0}) + \frac{1}{2 T_{0} η} E ∥ x^{*} - z_{0}^{1} ∥^{2} + \frac{1}{2 T_{0} θ_{0}^{2} ρ} E {∥y_{0}^{1} - y^{*}∥}^{2} - \frac{1}{2 T_{S - 1} θ_{S - 1}^{2} ρ} E {∥ y_{T_{S - 1}}^{S} - y^{*} ∥}^{2} \\ + \frac{M γ}{2 η T_{0} θ_{0}} ∥ z_{0}^{1} - z_{- 1}^{1} ∥^{2} - \frac{L η}{2 T_{0} η} E {∥ z_{T_{S - 1}}^{S} - z_{T_{S - 1} - 1}^{S} ∥}^{2} + \frac{1}{T_{0} θ_{0}} E 〈 A (z_{0}^{1} - z_{- 1}^{1}), y_{0}^{S} - y^{*} 〉 \\ - \frac{1}{T_{S - 1} θ_{S - 1}} E 〈 A (z_{T_{S - 1}}^{S} - z_{T_{S - 1} - 1}^{S}), y_{T_{S - 1}}^{S} - y^{*} 〉 \\ = & \frac{1 - θ_{0}}{θ_{0}^{2}} 𝒯 ({\tilde{x}}^{0}, {\tilde{y}}^{0}) + \frac{1}{2 T_{0} η} E {∥ x^{*} - z_{0}^{1} ∥}^{2} + \frac{1}{2 T_{0} θ_{0}^{2} ρ} E {∥y_{0}^{1} - y^{*}∥}^{2} \\ + \frac{1}{T_{S - 1} θ_{S - 1}} E 〈 A (z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S}), y_{T_{S - 1}}^{S} - y^{*} 〉 - \frac{L η}{2 T_{0} η} E {∥ z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S} ∥}^{2} \\ - \frac{1}{2 T_{S - 1} θ_{S - 1}^{2} ρ} E {∥ y_{T_{S - 1}}^{S} - y^{*} ∥}^{2} . \end{matrix}

(A54)

Here, set

\frac{T_{0}}{T_{S - 1}} < \frac{L θ_{S - 1}}{M^{2} ρ}

, i.e.,

ρ \leq \frac{L θ_{S - 1} T_{S - 1}}{T_{0} M^{2}}

, and we have

\begin{matrix} \frac{1}{T_{S - 1} θ_{S - 1}} 〈 A (z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S}), y_{T_{S - 1}}^{S} - y^{*} 〉 - \frac{L η}{2 T_{0} η} {∥ z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S} ∥}^{2} \\ - \frac{1}{2 T_{S - 1} θ_{S - 1}^{2} ρ} {∥ y_{T_{S - 1}}^{S} - y^{*} ∥}^{2} \\ \leq & \frac{1}{T_{S - 1} θ_{S - 1}} [\frac{M^{2} ρ}{2} ∥ z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S} ∥^{2} + \frac{1}{2 ρ} {∥ y_{T_{S - 1}}^{S} - y^{*} ∥}^{2}] \\ - \frac{L η}{2 T_{0} η} ∥ z_{T_{S - 1} - 1}^{S} - z_{T_{S - 1}}^{S} ∥^{2} - \frac{1}{2 T_{S - 1} θ_{S - 1} ρ} {∥ y_{T_{S - 1}}^{S} - y^{*} ∥}^{2} \\ \leq & 0 . \end{matrix}

(A55)

Furthermore, since

z_{0}^{1} = {\tilde{x}}^{0} = {\hat{x}}^{0}

,

y_{0}^{1} = {\tilde{y}}^{0} = {\hat{y}}^{0}

,

{\hat{x}}^{S} = {\tilde{x}}^{S}

and

{\hat{y}}^{S} = {\tilde{y}}^{S}

, then

\begin{matrix} \frac{1}{θ_{S - 1}^{2}} E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq & \frac{1 - θ_{0}}{θ_{0}^{2}} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{1}{2 T_{0} η} E {∥ x^{*} - z_{0}^{1} ∥}^{2} + \frac{1}{2 T_{0} θ_{0}^{2} ρ} E {∥ y_{0}^{1} - y^{*} ∥}^{2}, \end{matrix}

(A56)

where

D_{x^{*}} = ∥ x^{*} - {\tilde{x}}^{0} ∥ = ∥ x^{*} - {\hat{x}}^{0} ∥

,

D_{y^{*}} = ∥ {\tilde{y}}^{0} - y^{*} ∥ = ∥ {\hat{y}}^{0} - y^{*} ∥

. By combining the above results with

θ_{s} \leq 2 / (s + 2)

for all s, and

θ_{0} = 1 - \frac{α (b) L η}{1 - 2 L η}

, we have

\begin{matrix} E [𝒯 ({\hat{x}}^{S}, {\hat{y}}^{S})] \leq & \frac{1 - θ_{0}}{θ_{0}^{2}} θ_{S - 1}^{2} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{1}{2 T_{0} η} θ_{S - 1}^{2} D_{x^{*}}^{2} + \frac{1}{2 T_{0} θ_{0}^{2} ρ} θ_{S - 1}^{2} D_{y^{*}}^{2} \\ \leq & \frac{4 α (b)}{θ_{0}^{2} (1 - 2 L η) {(S + 1)}^{2}} 𝒯 ({\hat{x}}^{0}, {\hat{y}}^{0}) + \frac{2}{T_{0} η {(S + 1)}^{2}} D_{x^{*}}^{2} + \frac{2}{T_{0} θ_{0}^{2} ρ {(S + 1)}^{2}} D_{y^{*}}^{2}, \end{matrix}

where

ρ \leq \min {\frac{L}{M^{2}}, \frac{L θ_{S - 1} T_{S - 1}}{T_{0} M^{2}}}

. Note that the increase in

T_{s}

leads to the increase in the algorithm complexity. In order to keep the complexity of our algorithms similar to those of the fixed inner loop length algorithms, we usually choose

T_{0} = 𝒪 (n / S)

, where n is the number of samples and S is the number of outer loops (

n ≫ S

). Thus, we obtain that our ASVR-PDHG has the convergence rate of

𝒪 (\frac{1}{S^{2}} + \frac{1}{n S} + \frac{1}{n S})

for non-strongly convex objectives. This completes the proof. □

Appendix B. More Experimental Details and Results

In this appendix, we provide more experimental details and results.

Appendix B.1. Hyper-Parameter Selection

We choose the mini-batch sizes according to Figure A1 and Figure A2.

Figure A1. Comparison of the methods for solving SC problems on the small-scale synthetic dataset with different mini-batch sizes. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time or the number of passes through data.

Figure A2. Comparison of the methods for solving non-SC problems on the small-scale synthetic dataset with different mini-batch sizes. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time or the number of passes through data.

Appendix B.2. Experimental Results in Sparse and Asynchronous Parallel Setting

We also execute our asynchronous parallel Algorithms A2 and A3 to solve SC problems. The experimental results are shown in Figure A3. It can be observed that our ASVR-PDHG always outperforms SVR-PDHG for solving SC problems under different thread counts. Moreover, SVR-PDHG and ASVR-PDHG with four threads achieve more than

2 \times

speedup than those with one thread, respectively.

Figure A3. Comparison of the stochastic asynchronous parallel PDHG methods for solving SC problems with

A = I

, batch size

b = 1

, and the regularization parameters

λ_{1} = 10^{- 5}

and

λ_{2} = 10^{- 2}

on the sparse datasets, rcv1.small and real-sim. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (seconds).

Figure A3. Comparison of the stochastic asynchronous parallel PDHG methods for solving SC problems with

A = I

, batch size

b = 1

, and the regularization parameters

λ_{1} = 10^{- 5}

and

λ_{2} = 10^{- 2}

on the sparse datasets, rcv1.small and real-sim. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (seconds).

References

Esser, E.; Zhang, X.; Chan, T.F. A general framework for a class of first order primal-dual algorithms for convex optimization in imaging science. SIAM J. Imag. Sci. 2010, 3, 1015–1046. [Google Scholar] [CrossRef]
Goldstein, T.; Li, M.; Yuan, X. Adaptive primal-dual splitting methods for statistical learning and image processing. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2089–2097. [Google Scholar]
Zhang, N.; Fang, C. Saddle point approximation approaches for two-stage robust optimization problems. J. Glob. Optim. 2020, 78, 651–670. [Google Scholar] [CrossRef]
Qiao, L.; Lin, T.; Jiang, Y.G.; Yang, F.; Liu, W.; Lu, X. On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, The Hague, The Netherlands, 29 August–2 September 2016; pp. 167–174. [Google Scholar]
Delplancke, C.; Gurnell, M.; Latz, J.; Markiewicz, P.J.; Schönlieb, C.B.; Ehrhardt, M.J. Improving a stochastic algorithm for regularized PET image reconstruction. In Proceedings of the 2020 IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), Boston, MA, USA, 31 October–7 November 2020; pp. 1–3. [Google Scholar]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
Suzuki, T. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, GA, USA, 17–19 June 2013; pp. 392–400. [Google Scholar]
Wang, H.; Banerjee, A. Online alternating direction method. In Proceedings of the International Conference on Machine Learning, Edinburgh, Scotland, 26 June–1 July 2012; pp. 1699–1706. [Google Scholar]
Zhao, S.Y.; Li, W.J.; Zhou, Z.H. Scalable stochastic alternating direction method of multipliers. arXiv 2015, arXiv:1502.03529. [Google Scholar]
Ouyang, H.; He, N.; Tran, L.; Gray, A. Stochastic alternating direction method of multipliers. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 80–88. [Google Scholar]
Nesterov, Y. A method for solving the convex programming problem with convergence rate O(1/k²). Proc. Ussr Acad. Sci. 1983, 269, 543–547. [Google Scholar]
Xiao, L. Dual averaging method for regularized stochastic learning and online optimization. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 7–10 December 2009; Volume 22. [Google Scholar]
Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26, pp. 315–323. [Google Scholar]
Defazio, A.; Bach, F.; Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1646–1654. [Google Scholar]
Mu, Y.; Liu, W.; Liu, X.; Fan, W. Stochastic Gradient Made Stable: A Manifold Propagation Approach for Large-Scale Optimization. IEEE Trans. Knowl. Data Eng. 2017, 29, 458–471. [Google Scholar] [CrossRef]
Zhou, K.; Shang, F.; Cheng, J. A simple stochastic variance reduced algorithm with fast convergence rates. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 5980–5989. [Google Scholar]
Shang, F.; Zhou, K.; Liu, H.; Cheng, J.; Tsang, I.W.; Zhang, L.; Tao, D.; Jiao, L. VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning. IEEE Trans. Knowl. Data Eng. 2020, 32, 188–202. [Google Scholar] [CrossRef]
Allen-Zhu, Z. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 2017, 18, 8194–8244. [Google Scholar]
Shang, F.; Jiao, L.; Zhou, K.; Cheng, J.; Ren, Y.; Jin, Y. Asvrg: Accelerated proximal svrg. In Proceedings of the Asian Conference on Machine Learning, Beijin, China, 14–16 November 2018; pp. 815–830. [Google Scholar]
Zhong, W.; Kwok, J. Fast stochastic alternating direction method of multipliers. In Proceedings of the International Conference on Machine Learning, Beijin, China, 21–26 June 2014; pp. 46–54. [Google Scholar]
Suzuki, T. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the International Conference on Machine Learning, Beijin, China, 21–26 June 2014; pp. 736–744. [Google Scholar]
Shalev-Shwartz, S.; Zhang, T. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. J. Mach. Learn. Res. 2013, 14, 567–599. [Google Scholar]
Zheng, S.; Kwok, J.T. Fast-and-Light Stochastic ADMM. In Proceedings of the International Joint Conference on Artificial Intelligence, New York, NY, USA, 9-15 July 2016; pp. 2407–2613. [Google Scholar]
Liu, Y.; Shang, F.; Cheng, J. Accelerated variance reduced stochastic ADMM. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Chambolle, A.; Pock, T. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vision 2011, 40, 120–145. [Google Scholar] [CrossRef]
Zhang, Y.; Lin, X. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 353–361. [Google Scholar]
Tan, C.; Zhang, T.; Ma, S.; Liu, J. Stochastic primal-dual method for empirical risk minimization with O(1) per-iteration complexity. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 8376–8385. [Google Scholar]
Devraj, A.M.; Chen, J. Stochastic Variance Reduced Primal Dual Algorithms for Empirical Composition Optimization. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2019; Volume 32, pp. 9882–9892. [Google Scholar]
Tran-Dinh, Q.; Liu, D. Faster Randomized Primal-Dual Algorithms For Nonsmooth Composite Convex Minimization. arXiv 2020, arXiv:2003.01322. [Google Scholar]
Zhu, M.; Chan, T. An efficient primal-dual hybrid gradient algorithm for total variation image restoration. UCLA CAM Rep. 2008, 34, 8–34. [Google Scholar]
He, B.; Yuan, X. Convergence analysis of primal-dual algorithms for a saddle-point problem: From contraction perspective. SIAM J. Imag. Sci. 2012, 5, 119–149. [Google Scholar] [CrossRef]
Palaniappan, B.; Bach, F. Stochastic variance reduction methods for saddle-point problems. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 1416–1424. [Google Scholar]
Jiang, F.; Zhang, Z.; He, H. Solving saddle point problems: A landscape of primal-dual algorithm with larger stepsizes. J. Glob. Optim. 2022, 85, 821–846. [Google Scholar] [CrossRef]
Chambolle, A.; Ehrhardt, M.J.; Richtárik, P.; Schonlieb, C.B. Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 2018, 28, 2783–2808. [Google Scholar] [CrossRef]
Rizkinia, M.; Okuda, M. Evaluation of primal-dual splitting algorithm for MRI reconstruction using spatio-temporal structure Tensor and L1-2 norm. Makara J. Technol. 2020, 23, 126–130. [Google Scholar] [CrossRef]
Xu, S. A search direction inspired primal-dual method for saddle point problems. Optim. Online 2019, 559. [Google Scholar]
Jiu, M.; Pustelnik, N. A deep primal-dual proximal network for image restoration. IEEE J. Sel. Top. Signal Process. 2021, 15, 190–203. [Google Scholar] [CrossRef]
Baguer, D.O.; Leuschner, J.; Schmidt, M. Computed tomography reconstruction using deep image prior and learned reconstruction methods. Inverse Probl. 2020, 36, 094004. [Google Scholar] [CrossRef]
Rahman Chowdhury, M.; Zhang, J.; Qin, J.; Lou, Y. Poisson image denoising based on fractional-order total variation. Inverse Probl. Imaging 2020, 14, 77–96. [Google Scholar] [CrossRef]
Chen, Y.; Lan, G.; Ouyang, Y. Optimal primal-dual methods for a class of saddle point problems. SIAM J. Optim. 2014, 24, 1779–1814. [Google Scholar] [CrossRef]
Zhao, R.; Haskell, W.B.; Tan, V.Y. An optimal algorithm for stochastic three-composite optimization. In Proceedings of the the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 428–437. [Google Scholar]
Song, C.; Wright, S.J.; Diakonikolas, J. Variance reduction via primal-dual accelerated dual averaging for nonsmooth convex finite-sums. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 9824–9834. [Google Scholar]
Du, S.S.; Hu, W. Linear convergence of the primal-dual gradient method for convex-concave saddle point problems without strong convexity. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, Naha, Japan, 16–18 April 2019; pp. 196–205. [Google Scholar]
Thekumparampil, K.K.; He, N.; Oh, S. Lifted primal-dual method for bilinearly coupled smooth minimax optimization. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 28–30 March 2022; pp. 4281–4308. [Google Scholar]
Zhao, R. Accelerated stochastic algorithms for convex-concave saddle-point problems. Math. Oper. Res. 2022, 47, 1443–1473. [Google Scholar] [CrossRef]
Zhu, Y.N.; Zhang, X. A Stochastic Variance Reduced Primal Dual Fixed Point Method for Linearly Constrained Separable Optimization. SIAM J. Imag. Sci. 2021, 14, 1326–1353. [Google Scholar] [CrossRef]
Xie, G.; Luo, L.; Lian, Y.; Zhang, Z. Lower Complexity Bounds for Finite-Sum Convex-Concave Minimax Optimization Problems. In Proceedings of the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 10504–10513. [Google Scholar]
Zhang, X.; Burger, M.; Osher, S. A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 2011, 46, 20–46. [Google Scholar] [CrossRef]
Allen-Zhu, Z.; Yuan, Y. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; pp. 1080–1089. [Google Scholar]
Mania, H.; Pan, X.; Papailiopoulos, D.; Recht, B.; Ramchandran, K.; Jordan, M.I. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 2017, 27, 2202–2229. [Google Scholar] [CrossRef]
Sion, M. On general minimax theorems. Pac. J. Math. 1958, 8, 171–176. [Google Scholar] [CrossRef]
Arjevani, Y.; Carmon, Y.; Duchi, J.C.; Foster, D.J.; Srebro, N.; Woodworth, B. Lower bounds for non-convex stochastic optimization. Math. Program. 2023, 199, 165–214. [Google Scholar] [CrossRef]
Han, Y.; Xie, G.; Zhang, Z. Lower complexity bounds of finite-sum optimization problems: The results and construction. J. Math. Learn. Res. 2024, 25, 1–86. [Google Scholar]
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Qiao, L.; Lin, T.; Qin, Q.; Lu, X. On the iteration complexity analysis of Stochastic Primal-Dual Hybrid Gradient approach with high probability. Neurocomputing 2018, 307, 78–90. [Google Scholar] [CrossRef]
Liu, Y.; Shang, F.; Liu, H.; Kong, L.; Jiao, L.; Lin, Z. Accelerated variance reduction stochastic ADMM for large-scale machine learning. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 4242–4255. [Google Scholar] [CrossRef]
Banerjee, O.; El Ghaoui, L.; d’Aspremont, A. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res. 2008, 9, 485–516. [Google Scholar]
Iusem, A.N.; Jofré, A.; Oliveira, R.I.; Thompson, P. Extragradient method with variance reduction for stochastic variational inequalities. SIAM J. Optim. 2017, 27, 686–724. [Google Scholar] [CrossRef]
Du, S.S.; Gidel, G.; Jordan, M.I.; Li, C.J. Optimal extragradient-based bilinearly-coupled saddle-point optimization. arXiv 2022, arXiv:2206.08573. [Google Scholar]
Alacaoglu, A.; Malitsky, Y. Stochastic variance reduction for variational inequality methods. In Proceedings of the Conference on Learning Theory, PMLR, London, UK, 2–5 July 2022; pp. 778–816. [Google Scholar]
Boroun, M. Projection-Free and Accelerated Methods for Constrained Optimization and Saddle-Points Problems. Ph.D. Thesis, The University of Arizona, Tucson, AZ, USA, 2025. [Google Scholar]
Wang, D.; Ye, M.; Xu, J. Differentially private empirical risk minimization revisited: Faster and more general. In Proceedings of the Advances in Neural Information Processing Systems, Los Angeles, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Han, B.; Tsang, I.W.; Xiao, X.; Chen, L.; Fung, S.F.; Yu, C.P. Privacy-Preserving Stochastic Gradual Learning. IEEE Trans. Knowl. Data Eng. 2021, 33, 3129–3140. [Google Scholar] [CrossRef]

Figure 1. Comparison of the stochastic PDHG methods for solving SC (24) (left) and non-SC (23) (right) problems on small-scale synthetic datasets. The vertical axis is the objective value minus minimum value, and the horizontal axis is the number of passes through data.

Figure 2. Comparison of the stochastic PDHG methods for solving non-SC Problem (23) with the regularization parameter

λ_{1} = 10^{- 5}

on the bio, phy, and epsilon datasets. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (top) or the number of passes through data (bottom).

Figure 2. Comparison of the stochastic PDHG methods for solving non-SC Problem (23) with the regularization parameter

λ_{1} = 10^{- 5}

on the bio, phy, and epsilon datasets. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (top) or the number of passes through data (bottom).

Figure 3. Comparison of the three stochastic PDHG methods for solving the SC Problem (24) with the regularization parameters

λ_{2} = 10^{- 2}

and

λ_{1} = 10^{- 5}

on the bio, phy, and epsilon datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the CPU time (top) or the number of passes through data (bottom).

Figure 3. Comparison of the three stochastic PDHG methods for solving the SC Problem (24) with the regularization parameters

λ_{2} = 10^{- 2}

and

λ_{1} = 10^{- 5}

on the bio, phy, and epsilon datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the CPU time (top) or the number of passes through data (bottom).

Figure 4. Comparison of the stochastic asynchronous parallel PDHG methods for solving non-SC Problem (23) with

A = I

, batch size

b = 1

, and the regularization parameter

λ_{1} = 10^{- 5}

on the sparse datasets, rcv1.small and real-sim. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (seconds).

Figure 4. Comparison of the stochastic asynchronous parallel PDHG methods for solving non-SC Problem (23) with

A = I

, batch size

b = 1

, and the regularization parameter

λ_{1} = 10^{- 5}

on the sparse datasets, rcv1.small and real-sim. The vertical axis is the objective value minus minimum value, and the horizontal axis is the CPU time (seconds).

Figure 5. Comparison of these stochastic methods for solving the SC (left) and non-SC (right) problems on large-scale synthetic datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the CPU time.

Figure 6. Comparison of all the stochastic methods for solving the SC (24) and non-SC (23) problems on the epsilon_test (left) and w8a (right) datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the CPU time (seconds).

Figure 7. Comparison of the EG-type stochastic methods for solving the non-SC problem (23) on the phy (left), w8a (middle) and epsilon (right) datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the number of passes through data or CPU time (seconds).

Figure 8. Comparison of the stochastic primal–dual methods for solving

\min_{x} \max_{y} \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + λ 〈 x, y 〉 - \frac{λ}{2} {∥ y ∥}^{2}

on the phy (left) and w8a (right) datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the number of passes through data or CPU time (seconds).

Figure 8. Comparison of the stochastic primal–dual methods for solving

\min_{x} \max_{y} \frac{1}{n} \sum_{i = 1}^{n} F_{i} (x) + λ 〈 x, y 〉 - \frac{λ}{2} {∥ y ∥}^{2}

on the phy (left) and w8a (right) datasets. The vertical axis is the objective value minus the minimum value, and the horizontal axis is the number of passes through data or CPU time (seconds).

Figure 9. Comparison of all the methods for solving the non-SC multi-task Problem (28) with the regularization parameters

λ_{1} = 10^{- 5}

on the 20newsgroups dataset. The vertical axis is the objective value minus the minimum value or test error.

Figure 9. Comparison of all the methods for solving the non-SC multi-task Problem (28) with the regularization parameters

λ_{1} = 10^{- 5}

on the 20newsgroups dataset. The vertical axis is the objective value minus the minimum value or test error.

Figure 10. Comparison of the stochastic methods for solving non-convex SVMs on a synthetic dataset (left) and the phy dataset (right). The vertical axis is the objective value minus minimum value or test error and the horizontal axis is the CPU time (seconds) or the number of passes through data.

Table 1. Gradient complexity of some stochastic methods.

Algorithm	Gradient Complexity
SPDHG [4], SGD [54], SADMM [10]	$𝒪 (\frac{1}{ϵ^{2}})$
SVR-PDHG (Ours), SVRG-ADMM [23]	$𝒪 (\frac{n}{ϵ} + \frac{1}{ϵ})$
ASVR-PDHG (Ours), ASVRG-ADMM [24]	$𝒪 (\frac{n}{\sqrt{ϵ}} + \frac{\sqrt{n}}{\sqrt{ϵ}})$

Table 2. Some real-world datasets and their mini-batch sizes used in our experiments.

Datasets	# Training Samples	# Testing Samples	# Dimension	b (SC)	b (Non-SC)
bio	116,600	29,151	74	120	15
phy	40,000	10,000	78	120	15
epsilon	320,000	80,000	2000	120	15
w8a	39,800	9949	300	120	15
epsilon_test	80,000	20,000	2000	120	15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

An, W.; Liu, Y.; Shang, F.; Liu, H. Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems. Mathematics 2025, 13, 1687. https://doi.org/10.3390/math13101687

AMA Style

An W, Liu Y, Shang F, Liu H. Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems. Mathematics. 2025; 13(10):1687. https://doi.org/10.3390/math13101687

Chicago/Turabian Style

An, Weixin, Yuanyuan Liu, Fanhua Shang, and Hongying Liu. 2025. "Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems" Mathematics 13, no. 10: 1687. https://doi.org/10.3390/math13101687

APA Style

An, W., Liu, Y., Shang, F., & Liu, H. (2025). Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems. Mathematics, 13(10), 1687. https://doi.org/10.3390/math13101687

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stochastic Variance Reduced Primal–Dual Hybrid Gradient Methods for Saddle-Point Problems

Abstract

1. Introduction

1.1. Our Motivations

1.2. Our Contributions

2. Preliminaries and Related Work

2.1. Notations

2.2. Basic Assumptions

2.3. Related Work

2.3.1. Stochastic ADMM

2.3.2. Stochastic PDHG

3. Our Stochastic Primal–Dual Hybrid Gradient Algorithms

3.1. Our SVR-PDHG Algorithm

3.2. Our ASVR-PDHG Algorithm

3.3. Our Asynchronous Parallel Algorithms

4. Theoretical Analysis

4.1. Convergence Criterion

4.2. Convergence Analysis of SVR-PDHG

4.3. Convergence Analysis of ASVR-PDHG

5. Experimental Results

5.1. Comparison of PDHG-Type Algorithms on Synthetic Datasets

5.2. Comparison of PDHG-Type Algorithms on Real-World Datasets

5.3. Sparse and Asynchronous Parallel Setting

5.4. Compared with State-of-the-Art Stochastic Methods

5.5. Comparisons of Primal–Dual Algorithms When G * ( y ) ≠ 0

5.6. Multi-Task Learning

5.7. Non-Convex Support Vector Machines

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Theoretical Analysis

Appendix A.1. Our Algorithms in the Sparse and Asynchronous Parallel Setting

Appendix A.2. Some Key Properties and Lemmas

Appendix A.3. Proofs of Lemmas 1 and 2

Appendix A.3.1. Proof of Lemma 1

Appendix A.3.2. Proof of Lemma 2

Appendix A.4. Convergence Analyses of SVR-PDHG and ASVR-PDHG

Appendix A.4.1. Proof of Theorem 1 (SVR-PDHG for Strongly Convex Objectives)

Appendix A.4.2. Proof of Theorem 2 (SVR-PDHG for Non-Strongly Convex Objectives)

Appendix A.4.3. Proof of Theorem 3 (ASVR-PDHG for Strongly Convex Objectives)

Appendix A.4.4. Proof of Theorem 4 (ASVR-PDHG for Non-Strongly Convex Objectives)

Appendix B. More Experimental Details and Results

Appendix B.1. Hyper-Parameter Selection

Appendix B.2. Experimental Results in Sparse and Asynchronous Parallel Setting

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.5. Comparisons of Primal–Dual Algorithms When $G^{*} (y) \neq 0$