Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization

Li, Zhihao; Wu, Qingtao; Zhang, Moli; Wang, Lin; Ge, Youming; Wang, Guoyong

doi:10.3390/math13040627

Open AccessArticle

Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization

by

Zhihao Li

¹,

Qingtao Wu

^1,2

,

Moli Zhang

^1,*,

Lin Wang

^1,2

,

Youming Ge

¹ and

Guoyong Wang

³

¹

School of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

²

Intelligent System Science and Technology Innovation Center, Longmen Laboratory, Luoyang 471023, China

³

School of Computer and Information Engineering, Luoyang Institute of Science and Technology, Luoyang 471023, China

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(4), 627; https://doi.org/10.3390/math13040627

Submission received: 10 January 2025 / Revised: 27 January 2025 / Accepted: 5 February 2025 / Published: 14 February 2025

Download

Browse Figures

Versions Notes

Abstract

Multi-objective optimization (MOO) has become an important method in machine learning, which involves solving multiple competing objective problems simultaneously. Nowadays, many MOO algorithms assume that gradient information is easily available and use this information to optimize functions. However, when encountering situations where gradients are not available, such as black-box functions or non-differentiable functions, these algorithms become ineffective. In this paper, we propose a zeroth-order MOO algorithm named SZMG (stochastic zeroth-order multi-gradient algorithm), which approximates the gradient of functions by finite difference methods. Meanwhile, to avoid conflicting gradients between functions and reduce stochastic multi-gradient direction bias caused by stochastic gradients, an SGD-type method is adopted to acquire weight parameters. Under the non-convex setting and mild assumptions, the convergence rate is established for the SZMG algorithm. Simulation results demonstrate the effectiveness of the SZMG algorithm.

Keywords:

multi-objective optimization; zeroth-order optimization; stochastic optimization

MSC:

68T07

1. Introduction

In recent years, multi-objective optimization (MOO) has gained much attention in machine learning applications, including autonomous driving [1], language model training [2], safe reinforcement learning [3], and adversarial attacks [4]. MOO aims to optimize multiple, potentially competing objective functions simultaneously. Unlike single-objective optimization (SOO), which aims to find the minimum value of a single function, MOO problems rarely have a solution that is optimal for all objective functions. Therefore, the goal of MOO is to find the Pareto optimality, which is the optimal trade-off solution between objective functions.

To solve the MOO problems, various MOO algorithms have been proposed. A class of MOO methods involves heuristic algorithms inspired by natural behaviors, reactions, and communication mechanisms [5]. For example, NSGA-II (non-dominated sorting genetic algorithm II) [6], influenced by the process of biological evolution, explores the solution space to find optimal or near-optimal Pareto solutions by simulating genetic operations such as selection, crossover, and mutation. MOPSO (multi-objective particle swarm optimization) [7] and MOGWO (multi-objective gray wolf optimizer) [8], inspired by the survival and behavioral activities of organisms, simulate collaborative mechanisms within a population to randomly search for Pareto solutions. The AMOSA (archived multi-objective simulated annealing) method [9], inspired by the annealing process in meals, explores the entire solution space by accepting poorer solutions with a probability. However, these algorithms rarely provide theoretical convergence guarantees and generally yield a set of solutions with each run, which can be computationally expensive for machine learning problems (Table 1).

In contrast, gradient-based MOO algorithms have a first-order theoretical convergence guarantee and are more efficient in finding Pareto solutions through gradient information [14]. One simple method is to adopt the scalarization methods, which convert MOO problems into SOO problems by presetting a group of weights, and then solve MOO problems by using SOO methods [15,16]. But scalarization methods face challenges with conflicting functions, such as conflicting gradients, which can impede the algorithm’s ability to find Pareto solutions [17]. Additionally, pre-selecting weights is also a significant challenge. Reference [18] proposed a dynamic weight method called the multi-gradient descent algorithm (MGDA), which determines a multi-gradient descent direction in each iteration by solving a least-squares problem. The direction can optimize all objective functions simultaneously while avoiding the need for pre-selected weights. After that, various MGDA variants have been proposed [19,20].

The aforementioned works mainly focus on deterministic cases. In practical applications, many real-world optimization problems involve unknown or uncertain data parameters, and processing all the data can be extremely costly. An alternative approach is to use a stochastic optimization method. Mercier et al. [21] first proposed the stochastic version of MGDA, called SMGDA, by combining the stochastic gradient descent algorithm with MGDA. However, the paper did not identify a critical property of SMGDA—namely, that the stochastic multi-gradient direction is biased. Liu and Vicente [22] improved the analysis of SMGDA based on previous work and provided convergence analysis under convex and strongly convex settings. Subsequently, refs. [23,24] conducted theoretical studies of SMGDA under convex and non-convex settings.

However, the above works assume that gradient information is easily available. In some cases, gradients are not available, or the cost of obtaining gradients can be prohibitive, such as in black-box functions, where the explicit form and gradient information of the objective functions are unknown, making previous gradient-based optimization algorithms ineffective. Recently, some methods have been proposed to address the problems. Chen et al. [25] proposed a Hessian-aware zeroth-order optimization method, but papers mainly focus on convex settings. In practical applications, non-convex functions can more accurately represent actual scenarios. Ye et al. [13] proposed a multi-objective evolutionary strategy algorithm by iteratively updating the Gaussian distribution to find a Pareto solution. However, the paper assumes that the covariance matrix of the Gaussian distribution is diagonal, making it more suitable for separable functions where the parameters are independent. Additionally, in the theoretical analysis, it assumes that all functions are bounded, which is an uncommon assumption in standard SOO and MOO convergence analysis. In this paper, a stochastic zeroth-order multi-gradient algorithm (SZMG) is proposed, which adopts milder assumptions that require at least one function to be bounded. As a zeroth-order variant of the method presented in [24], SZMG uses zeroth-order techniques to estimate gradients to address the issue of unavailable function gradients. The theoretical analysis of the SZMG algorithm is established under non-convex settings and milder assumptions. Numerical experiments demonstrate that the SZMG algorithm effectively optimizes both separable and non-separable functions.

The contributions of this paper are as follows:

This paper introduces a new gradient-free MOO algorithm called stochastic zeroth-order multi-gradient descent algorithm (SZMG), which applies the zeroth-order optimization method to multi-objective optimization problems.
The convergence rate of $O (\sqrt{d} / \sqrt{T})$ is established for the SZMG algorithm under non-convex settings and mild assumptions, where d is the dimensionality of the model parameters and T is the number of iterations.
Numerical experiments demonstrate the effectiveness of the SZMG algorithm.

The notations used in this paper are summarized in Table 2.

2. Related Work

In this section, the development of gradient-based MOO and gradient-free MOO methods are briefly reviewed.

2.1. Gradient-Based Deterministic MOO

Gradient-based MOO methods can be divided into two categories—one is the scalarization method, as seen in References [15,16,26], which converts MOO problems into SOO problems by pre-selecting weights. Indeed, pre-selecting weights is a challenging task and does not resolve the issue of conflicting gradients. Another category is the non-scalarization method, which extends the ideas of SOO methods by finding a direction to optimize all objective functions simultaneously. Examples include the steepest descent method [27], the proximal method [28], Newton’s method [29], the trust-region method [30], and the conjugate gradient method [31]. Reference [18] proposes a dynamic weighting method called MGDA, which determines a descent direction in each iteration by solving a quadratic subproblem. After that, various MGDA variants have been studied, such as methods for obtaining Pareto distributions [19,20] and minimizing average loss [32]. However, the above studies focus on determination cases. In practical applications, stochastic and gradient-free cases remain challenges.

2.2. Gradient-Based Stochastic MOO

In recent years, the stochastic versions of MGDA have been studied. Reference [21] proposed SMGDA, but the paper did not identify an important property of SMGDA—that the stochastic multi-gradient direction is biased. This property was found in Reference [22], which provides a convergence analysis under strongly convex and convex settings by adopting a linearly increasing batch size. However, using a linearly increasing batch size is impractical. To tackle this challenge, various methods have been proposed to reduce stochastic multi-gradient bias, such as adopting momentum mechanisms [23,33], SGD-type methods [24], and double sampling [34,35]. However, these works all focus on situations where gradients are available. When encountering situations where the gradients of the function are not obtainable, these methods become ineffective.

2.3. Gradient-Free MOO

Many algorithms study black-box MOO problems, such as Bayesian optimization algorithms, as seen in [36,37]. These methods are better suited for handling low-dimensional expensive MOO problems. Another category is heuristic algorithms, such as [6,7,38]. These methods generally explore the entire Pareto solutions, which are computationally expensive and typically lack convergence guarantees. Recently, Reference [25] proposed a Hessian-aware zeroth-order optimization method, but the paper mainly focused on convex settings. Reference [13] proposed a multi-objective evolutionary strategy algorithm by iteratively updating the Gaussian distribution to find a Pareto solution. However, the paper assumes that the covariance matrix of the Gaussian distribution is diagonal, which is more suitable for separable functions. Additionally, achieving theoretical convergence guarantees requires that all function values be bounded, which is uncommon in standard SOO and MOO theoretical analysis. In this paper, we propose a zeroth-order multi-objective optimization algorithm (SZMG), which adopts a milder assumption. Additionally, the theoretical analysis of the SZMG algorithm is established under non-convex settings and milder assumptions. In the numerical experiments, the SZNG algorithm has demonstrated effectiveness in optimizing both separable and non-separable functions.

3. Preliminaries

In this section, the basic concept of MOO problems, the MGDA algorithm, and the zeroth-order gradient estimation methods are introduced.

3.1. Problem Setup

This paper addresses the following stochastic MOO problem:

\begin{matrix} min_{z} H (z) & = (E_{w} [h_{1} (z, w)], \dots, E_{w} [h_{M} (z, w)]), \end{matrix}

(1)

where

h_{i} : R^{d} \to R

is the i-th continuously differentiable non-convex function with

h_{m} (z) = E_{w} [h_{m} (z, w)]

, z is a d-dimensional decision variable, w is a random variable, and M is the number of objective functions. In this setting, it is assumed that the gradients of the functions are not available, and only function values can be queried. Generally, the goal of MOO is to seek Pareto optimality, which means no other solution dominates it.

Definition 1

(Pareto optimality). (a) For any two solutions,

z_{1}, z_{2} \in R^{d}

, solution

z_{1}

is said to dominate solution

z_{2}

, if

\forall i \in \{1, 2, \dots, M\}

,

h_{i} (z_{1}) \leq h_{i} (z_{2})

and

H (z_{1}) \neq H (z_{2})

. (b) A solution

z^{*}

is called Pareto-optimal if no other solution dominates it.

For multi-objective non-convex functions, the goal is generally to find a Pareto stationary point, which is a necessary condition for the Pareto optimal point [23,24,33,39].

Definition 2

(Pareto stationary point). If point z is a Pareto stationary point, then

r a n g e (\nabla H (z)) \cap (- R_{+ +}^{M}) = \emptyset,

where

\nabla H (z) : = (\nabla h_{1} (z), \nabla h_{2} (z), \dots, \nabla h_{m} (z))

, and

R_{+ +}^{M}

is the positive orthant cone.

3.2. Multi-Gradient Descent Algorithm

The multi-gradient descent algorithm is a first-order gradient algorithm that aims to avoid conflicting gradients between functions. In each iteration, the algorithm first computes the gradients of the function, then obtains the descent direction

\nabla H (z) λ^{*} (z)

, where weights

λ^{*} (z) = {(λ_{1}^{*} (z), \dots, λ_{M}^{*} (z))}^{⊤}

are acquired by solving the following sub-problem:

\begin{matrix} λ^{*} (z) \in arg min_{λ} {∥\nabla H (z) λ∥}^{2} \\ s . t . λ \in ▵^{m} : = \{λ \in R^{m} | 1^{⊤} λ = 1, λ \geq 0\} . \end{matrix}

(2)

Then the parameter z is updated by the following equation:

\begin{matrix} z_{t + 1} = z_{t} - η_{t} d (z_{t}), \end{matrix}

(3)

where

d (z_{t}) = \nabla H (z_{t}) λ^{*} (z_{t})

is the t-th iteration descent direction and

η_{t}

is t-th iteration learning rate.

The standard MGDA is given in Algorithm 1, as follows:

Algorithm 1 Multi-gradient descent algorithm (MGDA)

1:: Input: initial model parameter $z_{0}$ , learning rate $η$ .
2:: for $t = 0, 1, \dots, T - 1$ do
3:: for objective function $i = 1, 2, \dots, M$ do
4:: Calculate gradients $\nabla h_{i} (z_{t})$ .
5:: end for
6:: Compute weights $λ^{*} (z_{t})$ following (2).
7:: Update model parameters $z_{t + 1}$ following (3).
8:: end for

3.3. Zeroth-Order Optimization

ZOO is a method for approximating the gradient of a function by querying function values. Its fundamental idea is to use finite difference approximations of the directional derivatives of functions

h (z)

along certain directions. In this paper, the Gaussian gradient estimator is utilized to approximate the gradients. For any function,

h_{i} (z)

, the Gaussian smoothed function of

h_{i} (z)

is as follows:

h_{i}^{ν} (z) = E_{u} [h_{i} (z + ν u)],

(4)

where u follows the standard Gaussian distribution,

ν

is a smoothness parameter. The gradient of

h_{i}^{ν} (z)

can be expressed as follows:

\begin{matrix} \nabla h_{i}^{ν} (z) & = E_{u} [\frac{h_{i} (z + ν u)}{ν} u] \end{matrix}

(5)

\begin{matrix} = E_{u} [\frac{h_{i} (z + ν u) - h_{i} (z)}{ν} u], \end{matrix}

(6)

where Equation (5) is the single-point estimation and Equation (6) is the two-point estimation. In general, Equation (5) has a greater gradient variance than Equation (6) [40,41]. In this paper, two-point estimation is adopted to approximate the gradient.

4. Stochastic Zeroth-Order Multi-Gradient Algorithm

To address the challenge of unavailable gradients, this paper applies the ZOO technique to MOO problems. The multi-objective problem (1) is converted into a Gaussian smoothing multi-objective problem, as follows:

\begin{matrix} min_{z} H_{ν} (z) = (h_{1}^{ν} (z), \dots, h_{M}^{ν} (z)), \end{matrix}

(7)

where

h_{i}^{ν} (z) = E [h_{i} (z + ν u, w)]

. Following Equation (6), the stochastic gradient of the function

h_{i}^{ν} (z)

is defined as follows:

s_{i} (z, u, w) = \frac{1}{q_{1} q_{2}} \sum_{m = 1}^{q_{1}} \sum_{n = 1}^{q_{2}} \frac{h_{i} (z + ν u_{m}, w_{n}) - h_{i} (z, w_{n})}{ν} u_{m} .

(8)

In this paper, it is assumed that

E [h_{i} (z, w)] = h_{i} (z)

, which indicates that the stochastic gradient

s_{i} (z, u, w)

is an unbiased gradient estimation of

\nabla h_{i}^{ν} (z)

due to the following:

\begin{matrix} E [s_{i} (z, u, w)] & = \frac{1}{q_{1} q_{2}} E_{u} [E_{w} [\sum_{m = 1}^{q_{1}} \sum_{n = 1}^{q_{2}} \frac{h_{i} (z + ν u_{m}, w_{n}) - h_{i} (z, w_{n})}{ν} u_{m} | u]] \\ = \frac{1}{q_{1} q_{2}} E_{u} [\sum_{m = 1}^{q_{1}} \sum_{n = 1}^{q_{2}} \frac{h_{i} (z + ν u_{m}) - h_{i} (z)}{ν} u_{m}] \\ = \nabla h_{i}^{ν} (z) . \end{matrix}

(9)

According to Equation (9), an unbiased stochastic gradient of function

h_{i}^{ν} (z)

can be obtained by querying the function values. To avoid potential conflicts in gradient direction, such as

〈s_{i} (z, u, w), s_{j} (z, u, w)〉 < 0

, the MGDA-type method is adopted to determine the weight values by solving the following problem:

\begin{matrix} min_{λ} {∥\nabla H_{ν} (z) λ∥}^{2} \\ s . t . λ \in ▵^{M} : = \{λ \in R^{m} | 1^{⊤} λ = 1, λ \geq 0\}, \end{matrix}

(10)

where

\nabla H_{ν} (z) : = (\nabla h_{1}^{ν} (z), \dots, h_{M}^{ν} (z))

.

According to References [22,33], replacing deterministic gradients

\nabla H_{ν} (z)

with stochastic gradients

S_{ν} (z_{,} u, w)

in Equation (10) and directly obtaining the weight values will introduce a bias

E [S_{ν} (z, u, w) λ_{ν}^{*} (z, u, w)] \neq \nabla H_{ν} (z) λ_{ν}^{*} (z)

. To reduce this bias, an SGD-type method from [24] is adopted to obtain the optimal weight value

λ_{ν}^{*}

.

The method for updating the weight

λ_{t}

is as follows:

\begin{matrix} λ_{t, k + 1} = \prod_{▵^{M}} (λ_{t, k} - β_{t, k} S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k}), \end{matrix}

(11)

where

u_{t}^{'}

,

w_{t}^{'}

,

u_{t}^{″}

,

w_{t}^{″}

are independent samples,

β_{t, k}

is the learning rate for the weight

λ_{t}

at the k-th iteration, and

\prod_{▵^{M}}

represents a projection onto

▵^{M}

. The gradient

S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k}

is unbiased for

{∥\nabla H_{ν} (z) λ∥}^{2}

because

\begin{matrix} E [S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k}] = \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) λ_{t, k} . \end{matrix}

(12)

After obtaining the weights, the decision variable

z_{t}

will be updated by the following equation:

z_{t + 1} = z_{t} - η_{t} S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K},

(13)

where

η_{t}

is the learning rate and K is the total number of iterations for the weight.

The proposed SZMG algorithm is given in Algorithm 2.

Algorithm 2 Stochastic zeroth-order multi-gradient algorithm (SZMG)

1:: Input: initial model parameter $z_{0}$ , weights $λ_{0}$ , learning rates ${η_{t}}_{t = 0}^{T}$ , ${β_{t}}_{t = 0}^{T}$ .
2:: for $t = 0, 1, \dots, T - 1$ do
3:: Set weights $λ_{t, 0} = λ_{t - 1, K}$ .
4:: for $k = 0, 1, \dots, K - 1$ do
5:: Calculate zeroth-order gradient $S_{ν} (z_{t}, u_{t}^{'}, w_{t}^{'})$ , $S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})$ .
6:: Update weights $λ_{t, k + 1}$ following (11).
7:: end for
8:: Calculate zeroth-order gradient $S_{ν} (z_{t}, u_{t}, w_{t})$ .
9:: Update model parameters $z_{t + 1}$ following (13).
10:: end for

5. Convergence Analysis

In this section, some standard assumptions are introduced first. Then, the error bound between gradient

S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}

and gradient

\nabla H (z_{t}) λ^{*} (z_{t})

is analyzed. Finally, the convergence rate of the SZMG algorithm is established. All the proofs are put in Appendix B.

Assumption 1.

For any

z \in R^{d}

, the estimation of

h_{i} (z)

is unbiased such that

E_{w} [h_{i} (z, w)] = h_{i} (z)

.

This assumption ensures that Equation (9) holds, making

s_{i} (z, u, w)

an unbiased estimate of the gradients

\nabla h_{i}^{ν} (z)

. This is also a standard assumption in stochastic zeroth-order optimization [12,40,42].

Assumption 2.

All functions

h_{1} (z, w), \dots, h_{M} (z, w)

are l-Lipschitz continuous for any w and

h_{1} (z), \dots, h_{M} (z)

are L-smoothness, i.e.,

∥\nabla h_{i} (z_{1}, w)∥ \leq l,

h_{i} (z_{1}) \leq h_{i} (z_{2}) + 〈\nabla h_{i} (z_{2}), z_{1} - z_{2}〉 + \frac{L}{2} {∥z_{1} - z_{2}∥}^{2},

for any

z_{1}, z_{2} \in R^{d}

.

Applying Jensen’s inequality, we have the following:

{∥\nabla h_{i} (z)∥}^{2} \leq E [{∥\nabla h_{i} (z, w)∥}^{2}] \leq l^{2},

(14)

which implies that functions

h_{1} (z), \dots, h_{M} (z)

are l-Lipschitz continuous.

This is a standard assumption in MOO problems [23,24]. Here, the assumption that

h_{i} (z, w)

is l-Lipschitz continuous is to ensure that gradient

s_{i} (z, u, w)

is bounded for all

i \in \{1, \dots, M\}

, which is necessary to control the drift of the optimal weight

λ

, thereby establishing convergence guarantees in the non-convex setting. This assumption is used for zeroth-order stochastic optimization problems in [12,43]. The L-smoothness property of the function is a standard assumption used to establish the convergence of non-convex functions [44].

Assumption 3.

For some

λ \in ▵^{M}

, there exists a constant

C_{H}

such that

H_{ν} (z_{0}) λ - {min}_{z} H_{ν} (z) λ \leq C_{H}

.

This assumption implies that there exists at least an objective function that is bounded at the initial point, such as

h_{i}^{ν} (z_{0}) - {min}_{z} h_{i}^{ν} (z) \leq C_{H}

, which implies that the distance between the initial point and the Pareto optimal solution is finite. From this, it can be inferred that the algorithm can converge to the Pareto solution within a reasonable amount of time. Compared to [13], which requires the function values bounded for all objective functions at any point z, this is a milder assumption.

These assumptions are typically reasonable in the context of black-box adversarial attacks on an image classification neural network model. For example, in [45], the sparse adversarial attack is transformed into a multi-objective black-box optimization problem. In this context, the aforementioned assumptions hold.

The error upper bound between

S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}

and

\nabla H (z_{t}) λ^{*} (z_{t})

is established as follows:

Proposition 1.

Suppose Assumptions 1 and 2 are satisfied. If we set

ν = Θ (\frac{1}{\sqrt{T} d})

,

K = T^{2}

,

\sqrt{d} \leq q_{1} q_{2} < d

, and

β_{t, k} = \frac{c}{\sqrt{k}}

, where

c > 0

is a constant, we have the following:

\begin{matrix} E [{∥E [S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}] - \nabla H (z_{t}) λ_{t}^{*}∥}^{2} | z_{t}] \\ \leq & \frac{c (log (K) + 2)}{\sqrt{K}} {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + (log (K) + 2) (\frac{1}{c \sqrt{K}} + \sqrt{2} ν M L l {(d + 3)}^{3 / 2}) + \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3} \\ = & O (\frac{log (K) \sqrt{d}}{K^{1 / 4}}) . \end{matrix}

(15)

Proposition 1 shows that the error bound will converge to 0 as

K \to \infty

. The following theorem shows the SZMG can converge to a Pareto stationary point under a non-convex setting and mild assumptions:

Theorem 1.

Suppose Assumptions 1, 2, and 3 are satisfied. If we set

η = Θ (T^{- \frac{1}{2}})

,

ν = Θ (\frac{1}{\sqrt{T} d})

K = Θ (T^{2})

,

\sqrt{d} \leq q_{1} q_{2} < d

, and

β_{t, k} = \frac{c}{\sqrt{k}}

, where

c > 0

is a constant, we have the following:

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq & \frac{2 C_{H}}{T η} + 4 l \sqrt{\frac{1 + log (T)}{c T}} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + \frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} (2 l \sqrt{\frac{2 c + 2 c log (T)}{T}} + L η) \\ + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2} (2 l \sqrt{\frac{2 c + 2 c log (T)}{T}} + L η) \\ = & O (\frac{d \sqrt{log (T)}}{q_{1} q_{2} \sqrt{T}}) . \end{matrix}

(16)

Theorem 1 shows that the proposed SZMG algorithm converges to a Pareto stationary point with a convergence rate of

O (\frac{d \sqrt{log (T)}}{q_{1} q_{2} \sqrt{T}})

. By choosing the parameter

q_{1} q_{2} = \sqrt{d}

and ignoring the term

\sqrt{log (T)}

, which grows very slowly, the convergence rate of the SZMG algorithm reaches

O (\sqrt{d} / \sqrt{T})

, which indicates that the SZMG algorithm has a convergence rate that matches the one in the zeroth-order stochastic single-objective non-convex setting [12]. To achieve an

ϵ

-optimal solution, the query complexity for each objective function is bounded by

O (d^{9 / 2} / ϵ^{6})

.

To reduce function query complexity, the error between

S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}

and

\nabla H (z_{t}) λ^{*} (z_{t})

is not considered. The following theorem shows that the query complexity for converging to an

ϵ

-optimal solution is bounded by

O (d^{\frac{3}{2}} / ϵ^{2})

.

Theorem 2.

Suppose Assumptions 1–3 are satisfied. If we set

η = Θ (T^{- \frac{1}{2}})

,

β = Θ (T^{- \frac{1}{2}} d^{- \frac{1}{2}})

,

ν = Θ (T^{- \frac{1}{2}} d^{- 1})

,

\sqrt{d} \leq q_{1} q_{2} < d

, and

K = 1

, we have the following:

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq & \frac{2 C_{H}}{T η} + β {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} + \frac{2}{T β} \\ + η L (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ = & O (\frac{d^{\frac{3}{2}}}{T^{\frac{1}{2}} {(q_{1} q_{2})}^{2}}) . \end{matrix}

(17)

If we set the parameter

q_{1} q_{2} = \sqrt{d}

, Theorem 2 shows that the SZMG algorithm converges to a Pareto stationary point with a convergence rate of

O (\sqrt{d} / \sqrt{T})

. To achieve an

ϵ

-optimal solution, the query complexity for each function is bounded by

O (d^{\frac{3}{2}} / ϵ^{2})

.

6. Experiments

In this section, several synthesis experiments are organized to verify the performance of the proposed SZMG algorithm.

6.1. Toy Example

To demonstrate that the proposed SZMG algorithm achieves better performance in stochastic multi-objective black-box settings, the first synthesis experiment is a two-dimensional problem, which is used in [34]. The objective functions are defined as follows:

\begin{matrix} h_{1} (z) = f_{1} (z) g_{1} (z) + f_{2} (z) c_{1} (z) \end{matrix}

(18)

\begin{matrix} h_{2} (z) = f_{1} (z) g_{2} (z) + f_{2} (z) c_{2} (z), \end{matrix}

(19)

where the functions

f_{1} (z)

,

f_{2} (z)

,

g_{1} (z)

,

g_{2} (z)

,

c_{1} (z)

, and

c_{2} (z)

are defined by the following equations:

\begin{matrix} f_{1} (z) = & max (tanh (0.5 z_{2}), 0) \\ f_{2} (z) = & max (tanh (- 0.5 z_{2}), 0) \\ c_{1} (z) = & ({(- z_{1} + 3.5)}^{2} + 0.1 {(- z_{2} - 1)}^{2}) / 10 - 20 \\ - 2 z_{1} w_{1} - 5.5 z_{2} w_{2} \\ c_{2} (z) = & ({(- z_{1} - 3.5)}^{2} + 0.1 {(- z_{2} - 1)}^{2}) / 10 - 20 \\ + 2 z_{1} w_{1} - 5.5 z_{2} w_{2} \\ g_{1} (z) = & log (max (| 0.5 (- z_{1} - 7) - tanh (- z_{2}) |, \\ 0.000005)) + 6 \\ g_{2} (z) = & log (max (| 0.5 (- z_{1} + 3) - tanh (- z_{2}) + 2 |, \\ 0.000005)) + 6, \end{matrix}

where

z = {[z_{1}, z_{2}]}^{⊤} \in R^{2}

is the model parameter, and

w = {[w_{1}, w_{2}]}^{⊤} \in R^{2}

represents stochastic data, which follow the standard Gaussian distribution, as follows:

The following three initial points are selected:

z_{0} \in {(- 8.5, 7.5), (- 10, - 3), (9, 9)}

The optimization trajectories of ZO-AdaMU (zeroth-order adapt momentum and uncertainty) [10], BES (Bernoulli smoothing) [11], RSGF (randomized stochastic gradient-free) [12], ASMG (adaptive stochastic multi-objective gradient) [13], MGDA [18], and SZMG are plotted in Figure 1. RSGF uses the Gaussian gradient estimator and BES adopts the variant of the Bernoulli distribution to estimate the gradient. ZO-AdaMU adopts the Gaussian gradient estimator by adapting perturbation with momentum and uncertainty. These three algorithms primarily focus on single-objective problems. In multi-objective scenarios, multiple objectives are aggregated into a single objective by applying equal weights to each individual objective. ASMG is an adaptive multi-objective gradient algorithm based on evolutionary strategies. MGDA is a first-order gradient algorithm. All algorithms adopt the Adam optimizer and the number of iterations is

T = 70, 000

. Other hyperparameters are summarized in Table 3.

In Figure 1, the results show that algorithms SZMG and ASMG can effectively optimize stochastic multi-objective black-box functions while addressing the conflicting gradient problem. The three initial points have all converged to the Pareto front. In contrast, the ZO-AdaMU, BES, and RSGF methods all become trapped in a valley at initial points

(- 8, 5, 7.5), (9, 9)

, and fail to reach the Pareto front due to the dominant gradient of a specific objective.

Figure 2 examines the impact of different numbers of inner iterations K on the error between gradients

S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}

, and

\nabla H (z_{t}) λ^{*} (z_{t})

. The gradient deviation for each trajectory is measured using the average norm

\frac{1}{T} \sum_{t = 1}^{T} ∥S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K} - \nabla H (z_{t}) λ^{*} (z_{t})∥

. From the results, it can be seen that as K increases, the gradient deviation becomes smaller. This result validates Proposition 1. In Figure 1, the optimization trajectories of the SZMG algorithm are presented for different numbers of inner iterations K.

6.2. Separable and Non-Separable Functions

To validate the performance of the proposed SZMG algorithm under separable and non-separable functions, SZMG is compared to other algorithms using synthetic functions in 500 and 1000 dimensions.

For separable functions, we have the following:

\begin{matrix} H (z) & = (h_{1} (z), h_{2} (z)) \\ = (\sum_{i = 1}^{d} 10^{\frac{2 (i - 1)}{d - 1}} z_{i}^{2}, 10 d - \sum_{i = 1}^{d} 10 cos (2 π 10^{\frac{(i - 1)}{d - 1}} z_{i}) + \sum_{i = 1}^{d} {(10^{\frac{(i - 1)}{d - 1}} z_{i})}^{2}) . \end{matrix}

(20)

For non-separable functions, we have the following:

\begin{matrix} H (x) = H (R z) = (h_{1} (x), h_{2} (x)), \end{matrix}

(21)

where

h_{1} (x) = \sum_{i = 1}^{d} x_{i}^{2} - 10 c o s (2 π x_{i}) + 10

,

h_{2} (x) = \frac{1}{d} \sum_{i = 1}^{d - 1} 100 {(x_{i}^{2} - x_{i + 1})}^{2} + {(x_{i} - 4)}^{2}

, and R is an orthogonal matrix generated by applying the Gram–Schmidt procedure to a random matrix with a standard normal distribution.

The SZMG algorithm will be compared with ZO-AdaMU [10], BES [11], RSGF [12], and ASMG [13]. For ZO-AdaMU, BES, RSGF, and SZMG methods, the learning rate is selected from

{0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001}

, and the smoothing parameters

ν

is set to

0.00001

. In the SZMG algorithm, the weight learning rate

β

is set to

0.005

, and the number of inner iterations K is set to 1. The ASMG algorithm selects the learning rate from

{5, 1, 0.5, 0.1, 0.05}

. The total number of iterations T and the sample batch size

q_{1}

are set to 5000 and 50, respectively, for all algorithms.

In Figure 3 and Figure 4, the convergence results of five algorithms on the separable function under different dimensions are presented. Specifically, the function value with respect to the number of iterations is presented. The results demonstrate that the proposed SZMG algorithm achieves better performance. The ASMG algorithm exhibits faster convergence on the separable function. ZO-AdaMU, BES, and RSGF methods converge to a suboptimal Pareto solution due to conflicting gradients. In Figure 5 and Figure 6, the convergence results of five algorithms on the non-separable functions under different dimensions are presented. The results can intuitively show that the SZMG algorithm outperforms ASMG in terms of convergence performance and results. These results show that the SZMG algorithm performs well on both separable and non-separable functions, demonstrating the effectiveness of the SZMG algorithm.

6.2.1. Runtime

In this subsection, the runtimes of different algorithms on separable functions are compared to evaluate their computational complexity. The runtime of an algorithm is the average time per iteration over 5000 iterations. The experimental results were obtained using a PC equipped with a 3.60-GHz Intel Core i7-7700 CPU (Acer, Luoyang, China).

Figure 7 shows the average time per iteration for the methods ZO-AdaMU, BES, RSGF, ASMG, and SZMG across different dimensions of separable functions. From the results, the BES and RSGF methods have the shortest runtime per iteration because they do not need to compute weights. ZO-AdaMU takes the longest time due to requiring multiple uncertain perturbations internally. SZMG requires an additional gradient estimation and matrix operation to obtain the weights, making its runtime longer than RSGF. Compared to ASMG, SZMG has a shorter runtime, which indicates that the SZMG has better efficiency.

6.2.2. Scalability

In this subsection, the scalability of SZMG is studied on separable and non-separable functions at 1500 and 3000 dimensions.

Figure 8 and Figure 9 show the convergence results of algorithms ZO-AdaMU, BES, RSGF, ASMG, and SZMG on separable functions at 1500 and 3000 dimensions. These results indicate that the SZMG algorithm has better performance and can converge to the Pareto solutions. Figure 10 and Figure 11 present the convergence results on non-separable functions at 1500 and 3000 dimensions. These results show that SZMG delivers better performance, which implies that gradient-based optimization methods are better suited for handling high-dimensional problems; moreover, using dynamic weights based on SGD effectively addresses conflicting gradients. These results demonstrate the SZMG algorithm’s effectiveness.

6.2.3. Sensitiveness to Batch Size

The sample batch size is a key parameter in the proposed SZMG algorithm. A larger batch size can reduce the variance of zeroth-order gradients, leading to a more accurate gradient direction, but it also results in higher time and function query overheads. In this subsection, we study the impact of different batch sizes on the convergence of the SZMG algorithm for non-separable functions.

Figure 12 shows the impact of batch sizes 10, 30, 50, and 70 on the SAMG algorithm for the 1000-dimensional non-separable function. From the results, it can be observed that as the batch size increases, the convergence trajectory of the SZMG algorithm becomes more stable and the convergence improves. Figure 13 shows the average runtime per iteration for different batch sizes. The runtime increases linearly with the batch size. Therefore, careful selection of an appropriate batch size is critical for balancing time consumption and stability for the SZMG algorithm.

7. Discussion

In the experiments section, several numerical experiments were conducted to validate the practical performance of the proposed SZMG algorithm. The results show that SZMG demonstrates promising performance on smooth black-box functions. As a zeroth-order variant of first-order multi-objective optimization algorithms, SZMG exhibits a convergence rate comparable to first-order methods and can serve as a substitute for these algorithms in handling black-box multi-objective optimization problems or scenarios with expensive gradients. Theoretically, we have demonstrated that the convergence rate of SZMG is

O (d)

worse than the first-order multi-objective optimization methods.

Despite its advantages, SZMG has potential limitations such as the curse of dimensionality. As the dimensionality of the problem increases, the number of required function evaluations also rises significantly, leading to a sharp increase in computational costs. Additionally, the theoretical convergence guarantees of SZMG are based on smooth functions, and its performance may degrade when applied to non-smooth functions. These limitations are also important research directions for zeroth-order multi-objective optimization algorithms.

In real-world applications, the SZMG algorithm has a wide range of application scenarios, such as black-box adversarial attacks. In single-objective optimization, zeroth-order algorithms are widely used in black-box attacks and have achieved excellent results [46]. Recently, Reference [45] explored sparse adversarial attacks as multi-objective problems, employing heuristic algorithms to generate adversarial examples. There is significant potential for applying the SZMG algorithm in this field.

8. Conclusions

In this paper, a gradient-free stochastic multi-objective optimization algorithm is proposed for addressing black-box multi-objective optimization problems. The algorithm applies zeroth-order optimization techniques to multi-objective optimization and adopts a dynamic weighting method to avoid conflicting gradients. Theoretically, this paper provides a convergence guarantee for the proposed algorithm under non-convex settings and mild assumptions. Moreover, synthetic experiments demonstrate the effectiveness of the proposed algorithm. In the future, it will be valuable to explore methods for zeroth-order algorithms to handle high-dimensional problems and introduce variance reduction techniques into zeroth-order optimization to reduce gradient variance and achieve better convergence performance.

Author Contributions

Conceptualization, Z.L., Q.W., M.Z., L.W., Y.G. and G.W.; methodology, Z.L. and Q.W.; software, Z.L. and Q.W.; validation, Z.L., Q.W., M.Z., L.W., Y.G. and G.W.; formal analysis, Z.L. and Q.W.; investigation, Z.L., Q.W. and M.Z.; resources, Q.W. and M.Z.; data curation, Z.L., Q.W. and M.Z.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L., Q.W. and M.Z.; visualization, Z.L., Q.W. and M.Z.; supervision, Q.W.; project administration, Q.W.; funding acquisition, Q.W. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) under grant no. 62176113, and in part by the Key Technologies R & D Program of Henan Province under grant no. 242102211024, and in part by the Longmen Laboratory Frontier Exploration Project of Henan Province under grant no. LMQYTSKT035.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Extra Lemmas

In this section, some lemmas are presented for the subsequent proof.

Lemma A1

([47]). If functions

h_{1} (z), \dots, h_{M} (z)

are l-Lipschitz continuous and exhibit L-smoothness, then

h_{1}^{ν} (z), \dots, h_{M}^{ν} (z)

are also l-Lipschitz continuous and L-smoothness.

Lemma A2.

If Assumptions 1 and 2 are satisfied, it holds that for all

z \in R^{d}

\begin{matrix} {∥\nabla H_{ν} (z)∥}^{2} \leq M l^{2}, \end{matrix}

(A1)

\begin{matrix} {∥\nabla H_{ν} (z) - \nabla H (z)∥}^{2} \leq \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3}, \end{matrix}

(A2)

\begin{matrix} E [{∥S_{ν} (z, u, w) - \nabla H_{ν} (z)∥}^{2}] \leq \frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{(2 d + 9)}{q_{1} q_{2}} M l^{2} . \end{matrix}

(A3)

Proof.

Based on Lemma A1, we have the following:

\begin{matrix} ∥\nabla h_{i}^{ν} (z)∥ \leq l . \end{matrix}

(A4)

Therefore, the squared norm of the gradient of

H_{ν} (z)

satisfies the following:

\begin{matrix} {∥\nabla H_{ν} (z)∥}^{2} & = {∥(\nabla h_{1}^{ν} (z), \dots, \nabla h_{M}^{ν} (z))∥}^{2} \\ \leq M max_{i} {∥\nabla h_{i}^{ν} (z)∥}^{2} \\ \leq M l^{2} . \end{matrix}

(A5)

Following Lemma 3 and Theorem 4 in [47], we have the following:

\begin{matrix} ∥\nabla h_{i}^{ν} (z) - \nabla h_{i} (z)∥ \leq \frac{ν}{2} L {(d + 3)}^{3 / 2}, \end{matrix}

(A6)

\begin{matrix} E [{∥\frac{h_{i} (z + ν u, w) - h_{i} (z, w)}{ν} u∥}^{2}] \leq \frac{ν^{2}}{2} L^{2} {(d + 6)}^{3} + 2 (d + 4) l^{2} . \end{matrix}

(A7)

Therefore, we have the following:

\begin{matrix} {∥\nabla H_{ν} (z) - \nabla H (z)∥}^{2} & \leq M max_{i} {∥\nabla h_{i}^{ν} (z) - \nabla h_{i} (z)∥}^{2} \\ \leq \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3}, \end{matrix}

(A8)

\begin{matrix} E [{∥s_{i} (z, u, w) - \nabla h_{i}^{ν} (z)∥}^{2}] \\ = & \frac{1}{q_{1}^{2} q_{2}^{2}} E [{∥\sum_{m = 1}^{q_{1}} \sum_{n = 1}^{q_{2}} [\frac{h_{i} (z + ν u_{m}, w_{n}) - h_{i} (z, w_{n})}{ν} u_{m}] - \sum_{m = 1}^{q_{1}} \sum_{n = 1}^{q_{2}} [\nabla h_{i}^{ν} (z)]∥}^{2}] \\ \leq & \frac{1}{q_{1} q_{2}} max_{m, n} E [{∥\frac{h_{i} (z + ν u_{m}, w_{n}) - h_{i} (z, w_{n})}{ν} u_{m} - \nabla h_{i}^{ν} (z)∥}^{2}] \\ \leq & \frac{1}{q_{1} q_{2}} max_{m, n} E [{∥\frac{h_{i} (z + ν u_{m}, w_{n}) - h_{i} (z, w_{n})}{ν} u_{m}∥}^{2} + {∥\nabla h_{i}^{ν} (z)∥}^{2}] \\ \leq & \frac{ν^{2} L^{2}}{2 q_{1} q_{2}} {(d + 6)}^{3} + \frac{(2 d + 9)}{q_{1} q_{2}} l^{2}, \end{matrix}

(A9)

\begin{matrix} E [{∥S_{ν} (z, u, w) - \nabla H_{ν} (z)∥}^{2}] & \leq M max_{i} E [{∥s_{i} (z, u, w) - \nabla h_{i}^{ν} (z)∥}^{2}] \\ \leq \frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{(2 d + 9)}{q_{1} q_{2}} M l^{2} . \end{matrix}

(A10)

Therefore, the proof is completed. □

Lemma A3.

Suppose Assumptions 1 and 2 are satisfied, the following holds:

\begin{matrix} E [{∥S_{ν} {(z, u^{'}, w^{'})}^{⊤} S_{ν} (z, u^{″}, w^{″})∥}^{2}] \leq {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} . \end{matrix}

(A11)

Proof.

\begin{matrix} E [{∥S_{ν} {(z, u^{'}, w^{'})}^{⊤} S_{ν} (z, u^{″}, w^{″})∥}^{2}] \\ \leq & E [{∥S_{ν} (z, u^{'}, w^{'})∥}^{2} {∥S_{ν} (z, u^{″}, w^{″})∥}^{2}] \\ = & E [{∥S_{ν} (z, u^{'}, w^{'}) - \nabla H_{ν} (z) + \nabla H_{ν} (z)∥}^{2} {∥S_{ν} (z, u^{″}, w^{″}) - \nabla H_{ν} (z) + \nabla H_{ν} (z)∥}^{2}] \\ = & E [{∥S_{ν} (z, u^{'}, w^{'}) - \nabla H_{ν} (z)∥}^{2} + {∥\nabla H_{ν} (z)∥}^{2}] E [{∥S_{ν} (z, u^{″}, w^{″}) - \nabla H_{ν} (z)∥}^{2} \\ + {∥\nabla H_{ν} (z)∥}^{2}] \\ \leq & {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}, \end{matrix}

(A12)

where the first inequality is due to the Cauchy–Schwartz inequality, and the last inequality follows from Lemma A2. □

Lemma A4.

Suppose Assumptions 1 and 2 are satisfied, the following holds:

\begin{matrix} - E [〈\nabla H_{ν} (z_{t}) λ, \nabla H (z_{t}) λ_{t}〉] \\ \leq & \frac{1}{2 β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] + \frac{β_{t}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, {w_{t}}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, {w_{t}}^{″})∥}^{2}] \\ - E [{∥\nabla H_{ν} (z_{t}) λ_{t}∥}^{2}] . \end{matrix}

(A13)

Proof.

Following the updated form of weighting

λ

, we have the following:

\begin{matrix} E [{∥λ_{t + 1} - λ∥}^{2}] \\ = & E [{∥\prod_{▵^{M}} (λ_{t} - β_{t} [S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t}]) - \prod_{▵^{M}} (λ)∥}^{2}] \\ \leq & E [{∥λ_{t} - β_{t} [S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t}] - λ∥}^{2}] \\ = & E [{∥λ_{t} - λ∥}^{2}] + β_{t}^{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t}∥}^{2}] \\ - 2 β_{t} E [〈λ_{t} - λ, \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) λ_{t}〉] \\ \leq & E [{∥λ_{t} - λ∥}^{2}] + β_{t}^{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2} {∥λ_{t}∥}^{2}] \\ - 2 β_{t} E [〈λ_{t} - λ, \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t})〉] \\ \leq & E [{∥λ_{t} - λ∥}^{2}] + β_{t}^{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}] \\ - 2 β_{t} E [〈λ_{t} - λ, \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t})〉], \end{matrix}

(A14)

where the first inequality is due to the non-expansive property of projection onto the convex set, the second inequality results from the Cauchy–Schwarz inequality, and the last inequality follows

{∥λ_{t}∥}^{2} \leq 1

.

For the last term on the right side, we have the following:

\begin{matrix} - 2 β_{t} E [〈λ_{t} - λ, \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) λ_{t}〉] \\ = & - 2 β_{t} E [〈\nabla H_{ν} (z_{t}) (λ_{t} - λ), \nabla H_{ν} (z_{t}) λ_{t}〉] \\ = & - 2 β_{t} E [{∥\nabla H_{ν} (z_{t}) λ_{t}∥}^{2}] + 2 β_{t} E [〈\nabla H_{ν} (z_{t}) λ, \nabla H_{ν} (z_{t}) λ_{t}〉] . \end{matrix}

(A15)

Plugging Equation (A15) into Equation (A14), and rearranging, the following holds:

\begin{matrix} - E [〈\nabla H_{ν} (z_{t}) λ, \nabla H (z_{t}) λ_{t}〉] \\ \leq & \frac{1}{2 β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] + \frac{β_{t}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}] \\ - E [{∥\nabla H_{ν} (z_{t}) λ_{t}∥}^{2}] . \end{matrix}

(A16)

Hence, the result of this lemma is obtained. □

Lemma A5.

Suppose Assumptions 1 and 2 are satisfied and

f (λ) = {∥\nabla H (z_{t}) λ∥}^{2}

is a convex function. Setting

λ_{t}^{*} = arg {min}_{λ} {∥\nabla H (z_{t}) λ∥}^{2}

, learning rate

β_{t, k} = \frac{c}{\sqrt{k}}

, where

c \geq 0

is a constant. Then for

K \geq 1

, the following holds:

\begin{matrix} f (λ_{t, K}) - f (λ_{t}^{*}) \leq & \frac{log (K) + 2}{\sqrt{K}} (\frac{1}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \\ + \sqrt{2} ν M L l {(d + 3)}^{3 / 2} (log (K) + 2) . \end{matrix}

(A17)

Proof.

Following the updated form of weighting

λ

, we have the following:

\begin{matrix} E [{∥λ_{t, k + 1} - λ∥}^{2}] \\ = & E [{∥\prod_{▵^{M}} (λ_{t, k} - β_{t, k} S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u "_{t}, w "_{t}) λ_{t, k}) - \prod_{▵^{M}} (λ)∥}^{2}] \\ \leq & E [{∥λ_{t, k} - β_{t, k} S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k} - λ∥}^{2}] \\ = & E [{∥λ_{t, k} - λ∥}^{2}] + β_{t, k}^{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k}∥}^{2}] \\ - 2 β_{t, k} E [〈λ_{t, k} - λ, \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) λ_{t, k}〉], \end{matrix}

(A18)

where the inequality is due to the non-expansive property of projection onto the convex set.

Rearranging the above inequality, the following holds:

\begin{matrix} E [〈λ_{t, k} - λ, \nabla H {(z_{t})}^{⊤} \nabla H (z_{t}) λ_{t, k}〉] \\ \leq & E [〈λ_{t, k} - λ, \nabla H {(z_{t})}^{⊤} \nabla H (z_{t}) λ_{t, k} - \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) λ_{t, k}〉] \\ + \frac{1}{2 β_{t, k}} E [{∥λ_{t, k} - λ∥}^{2} - {∥λ_{t, k + 1} - λ∥}^{2}] \\ + \frac{β_{t, k}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k}∥}^{2}] \\ \leq & E [∥λ_{t, k} - λ∥ ∥\nabla H {(z_{t})}^{⊤} \nabla H (z_{t}) - \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t})∥ ∥λ_{t, k}∥] \\ + \frac{1}{2 β_{t, k}} E [{∥λ_{t, k} - λ∥}^{2} - {∥λ_{t, k + 1} - λ∥}^{2}] \\ + \frac{β_{t, k}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2} {∥λ_{t, k}∥}^{2}] \\ \leq & \underset{I_{1}}{\underset{︸}{\sqrt{2} E [∥\nabla H {(z_{t})}^{⊤} \nabla H (z_{t}) - \nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t})∥]}} \\ + \frac{1}{2 β_{t, k}} E [{∥λ_{t, k} - λ∥}^{2} - {∥λ_{t, k + 1} - λ∥}^{2}] \\ + \frac{β_{t, k}}{2} \underset{I_{2}}{\underset{︸}{E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}]}}, \end{matrix}

(A19)

where the second inequality is due to the Cauchy–Schwartz inequality and the last inequality follows from the

∥λ_{t, k} - λ∥ \leq \sqrt{2}

and

{∥λ_{t, k}∥}^{2} \leq 1

.

Following the property of a convex function,

f (λ_{1}) \geq f (λ_{2}) + 〈\nabla f (λ_{2}) . λ_{1} - λ_{2}〉

, the following holds:

\begin{matrix} E [f (λ_{t, k}) - f (λ)] \leq \frac{1}{2 β_{t, k}} E [{∥λ_{t, k} - λ∥}^{2} - {∥λ_{t, k + 1} - λ∥}^{2}] + \frac{β_{t, k}}{2} I_{2} + I_{1} . \end{matrix}

(A20)

Letting r be an element in

\{1, \dots, K - 1\}

and summing over

k = K - r, \dots, K

, the following holds:

\begin{matrix} \sum_{k = K - r}^{K} E [f (λ_{t, k}) - f (λ)] & \leq \frac{1}{2 β_{t, K - r}} E [{∥λ_{t, K - r} - λ∥}^{2}] + \sum_{k = K - r}^{K} I_{1} + \frac{I_{2}}{2} \sum_{k = K - r}^{K} β_{t, k} \\ + \sum_{k = K - r + 1}^{K} E [\frac{{∥λ_{t, k} - λ∥}^{2}}{2} (\frac{1}{β_{t, k}} - \frac{1}{β_{t, k - 1}})] . \end{matrix}

(A21)

Setting

β_{t, k} = \frac{c}{\sqrt{k}}

,

λ = λ_{t, K - r}

, we have the following:

\begin{matrix} \sum_{k = K - r}^{K} E [f (λ_{t, k}) - f (λ_{K - r})] & \leq \sum_{k = K - r}^{K} I_{1} + \frac{I_{2} c}{2} \sum_{k = K - r}^{K} \frac{1}{\sqrt{k}} + \frac{1}{c} (\sqrt{K} - \sqrt{K - r}) \\ \leq I_{1} (r + 1) + (\frac{1}{c} + I_{2} c) (\sqrt{K} - \sqrt{K - r - 1}) \\ = I_{1} (r + 1) + (\frac{1}{c} + I_{2} c) \frac{r + 1}{\sqrt{K} + \sqrt{K - r - 1}} \\ \leq (\frac{1}{c} + I_{2} c) \frac{r + 1}{\sqrt{K}} + I_{1} (r + 1), \end{matrix}

(A22)

where the first inequality is due to

{∥λ_{t, k} - λ_{t, K - r}∥}^{2} \leq 2

, the second inequality follows

\sum_{k = K - r}^{K} \frac{1}{\sqrt{k}} \leq 2 (\sqrt{K} - \sqrt{K - r - 1})

.

Letting

L_{r} = \frac{1}{r + 1} \sum_{k = K - r}^{K} E [f (λ_{t, k})]

, the following holds:

\begin{matrix} - E [f (λ_{t, K - r})] \leq - L_{r} + (\frac{1}{c} + I_{2} c) \frac{1}{\sqrt{K}} + I_{1} . \end{matrix}

(A23)

Following the definition of

L_{r}

, we have the following:

\begin{matrix} r L_{r - 1} & = (r + 1) L_{r} - E [f (λ_{t, K - r})] \\ \leq (r + 1) L_{r} - L_{r} + (\frac{1}{c} + I_{2} c) \frac{1}{\sqrt{K}} + I_{1}, \end{matrix}

(A24)

and dividing by r, the following holds:

\begin{matrix} L_{r - 1} \leq L_{r} + (\frac{1}{c} + I_{2} c) \frac{1}{r \sqrt{K}} + \frac{I_{1}}{r} . \end{matrix}

(A25)

Summing over

r = 1, \dots, K - 1

, the following holds:

\begin{matrix} L_{0} = E [f (λ_{t, K})] & \leq L_{K - 1} + ((\frac{1}{c} + I_{2} c) \frac{1}{\sqrt{K}} + I_{1}) \sum_{r = 1}^{K - 1} \frac{1}{r} \\ \leq L_{K - 1} + ((\frac{1}{c} + I_{2} c) \frac{1}{\sqrt{K}} + I_{1}) (log (K) + 1), \end{matrix}

(A26)

where the last inequality is due to

\sum_{r = 1}^{K - 1} \frac{1}{r} \leq (1 + log (K))

.

Now, using Equation (A21) with

r = K - 1

,

β_{t, k} = \frac{c}{\sqrt{k}}

and

λ = λ_{t}^{*}

, the following holds:

\begin{matrix} \frac{1}{K} \sum_{k = 1}^{K} E [f (λ_{t, k}) - f (λ_{t}^{*})] \\ \leq & \frac{1}{2 K β_{t, 1}} E [{∥λ_{t, 1} - λ_{t}^{*}∥}^{2}] + \sum_{k = 2}^{K} E [\frac{{∥λ_{t, k} - λ_{t}^{*}∥}^{2}}{2 K} (\frac{1}{β_{t, k}} - \frac{1}{β_{t, k - 1}})] + I_{1} + \frac{I_{2}}{2 K} \sum_{k = 1}^{K} β_{t, k} \\ \leq & \frac{1}{c \sqrt{K}} + \frac{c I_{2}}{\sqrt{K}} + I_{1} . \end{matrix}

(A27)

It is easy to know the following:

\begin{matrix} L_{K - 1} - f (λ_{t}^{*}) & = \frac{1}{K} \sum_{k = 1}^{K} E [f (λ_{t, k}) - f (λ_{t}^{*})] \\ \leq \frac{1}{c \sqrt{K}} + \frac{c I_{2}}{\sqrt{K}} + I_{1} . \end{matrix}

(A28)

Plugging Equation (A28) into Equation (A26), the following holds:

\begin{matrix} f (λ_{t, K}) - f (λ_{t}^{*}) \leq ((\frac{1}{c} + I_{2} c) \frac{1}{\sqrt{K}} + I_{1}) (log (K) + 2) . \end{matrix}

(A29)

For the term

I_{1}

, we have the following:

\begin{matrix} \sqrt{2} E [∥\nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) - \nabla H {(z_{t})}^{⊤} \nabla H (z_{t})∥] \\ = & \sqrt{2} E [∥\nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) - \nabla H_{ν} {(z_{t})}^{⊤} \nabla H (z_{t}) + \nabla H_{ν} {(z_{t})}^{⊤} \nabla H (z_{t}) \\ - \nabla H {(z_{t})}^{⊤} \nabla H (z_{t})∥] \\ \leq & \sqrt{2} (E [∥\nabla H_{ν} {(z_{t})}^{⊤} \nabla H_{ν} (z_{t}) - \nabla H_{ν} {(z_{t})}^{⊤} \nabla H (z_{t})∥] \\ + E [∥\nabla H_{ν} {(z_{t})}^{⊤} \nabla H (z_{t}) - \nabla H {(z_{t})}^{⊤} \nabla H (z_{t})∥]) \\ \leq & \sqrt{2} E [∥\nabla H_{ν} (z_{t}) - \nabla H (z_{t})∥ (∥\nabla H_{ν} (z_{t})∥ + ∥\nabla H (z_{t})∥)] \\ \leq & \sqrt{2} ν M L l {(d + 3)}^{3 / 2}, \end{matrix}

(A30)

where the first inequality is due to the triangle inequality, the second inequality follows the Cauchy–Schwartz inequality, and the last inequality is due to Lemma A2.

Following Lemma A3, for term

I_{2}

, the following holds:

\begin{matrix} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}] \leq {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} . \end{matrix}

(A31)

Plugging Equation (A30), Equation (A31) into (A29), the following holds:

\begin{matrix} f (λ_{t, K}) - f (λ_{t}^{*}) \leq & \frac{log (K) + 2}{\sqrt{K}} (\frac{1}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \\ + \sqrt{2} ν M L l {(d + 3)}^{3 / 2} (log (K) + 2) . \end{matrix}

(A32)

Thus, this lemma is proved. □

Lemma A6.

Suppose Assumptions 1 and 2 are satisfied. For any

λ \in ▵^{M}

, the following holds:

\begin{matrix} {∥\nabla H (z_{t}) λ∥}^{2} \leq 2 {∥\nabla H_{ν} (z_{t}) λ∥}^{2} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} . \end{matrix}

(A33)

Proof.

\begin{matrix} {∥\nabla H (z_{t}) λ - \nabla H_{ν} (z_{t}) λ∥}^{2} & \leq {∥\nabla H (z_{t}) - \nabla H_{ν} (z_{t})∥}^{2} {∥λ∥}^{2} \\ \leq {∥\nabla H (z_{t}) - \nabla H_{ν} (z_{t})∥}^{2} \\ \leq \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3} . \end{matrix}

(A34)

where the first inequality follows the Cauchy–Schwartz inequality, the second inequality is due to

{∥λ∥}^{2} \leq 1

, and the last inequality is due to Lemma A2.

For the lower bound of the left equation, we have the following:

\begin{matrix} {∥\nabla H (z_{t}) λ - \nabla H_{ν} (z_{t}) λ∥}^{2} \\ = & {∥\nabla H (z_{t}) λ∥}^{2} + {∥\nabla H_{ν} (z_{t}) λ∥}^{2} - 2 〈\nabla H (z_{t}) λ, \nabla H_{ν} (z_{t}) λ〉 \\ \geq & \frac{1}{2} {∥\nabla H (z_{t}) λ∥}^{2} - {∥\nabla H_{ν} (z_{t}) λ∥}^{2}, \end{matrix}

(A35)

where the last inequality is due to Young’s inequality.

Plugging Equation (A35) into Equation (A34), the following holds:

\begin{matrix} {∥\nabla H (z_{t}) λ∥}^{2} \leq 2 {∥\nabla H_{ν} (z_{t}) λ∥}^{2} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} . \end{matrix}

(A36)

Therefore, the result of this lemma is proved completely. □

Appendix B. Proof of the Results in Section 5

Proof of Proposition 1.

Based on the definitions of

S_{ν} (z_{t}, u_{t}, w_{t})

, we have the following:

\begin{matrix} E [{∥E [S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}] - \nabla H (z_{t}) λ_{t}^{*}∥}^{2} | z_{t}] \\ = & E [{∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ = & E [{∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t, K} + \nabla H (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq & 2 \underset{I_{1}}{\underset{︸}{E [{∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t, K}∥}^{2}]}} + 2 \underset{I_{2}}{\underset{︸}{E [{∥\nabla H (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t}^{*}∥}^{2}]}}, \end{matrix}

(A37)

where the inequality is due to Young’s inequality.

For the term

I_{1}

, we have the following:

\begin{matrix} E [{∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t, K}∥}^{2}] & \leq E [{∥\nabla H_{ν} (z_{t}) - \nabla H (z_{t})∥}^{2} {∥λ_{t, K}∥}^{2}] \\ \leq E [{∥\nabla H_{ν} (z_{t}) - \nabla H (z_{t})∥}^{2}] \\ \leq \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3}, \end{matrix}

(A38)

where the first inequality follows the Cauchy–Schwartz inequality, the second inequality is due to

{∥λ_{t, K}∥}^{2} \leq 1

, and the last inequality follows Lemma A2.

For the term

I_{2}

, we have the following:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t, K} - \nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ = & E [{∥\nabla H (z_{t}) λ_{t, K}∥}^{2} + {∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2} - 2 〈\nabla H (z_{t}) λ_{t, K}, \nabla H (z_{t}) λ_{t}^{*}〉] \\ \leq & E [{∥\nabla H (z_{t}) λ_{t, K}∥}^{2} - {∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}], \end{matrix}

(A39)

where the inequality is due to the optimality condition that

\begin{matrix} 〈\nabla H (z_{t}) λ_{t, K}, \nabla H (z_{t}) λ_{t}^{*}〉 \geq {∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2} . \end{matrix}

(A40)

Following Lemma A5, the following holds:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t, K}∥}^{2} - {∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq & \frac{log (K) + 2}{\sqrt{K}} (\frac{1}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \\ + \sqrt{2} ν M L l {(d + 3)}^{3 / 2} (log (K) + 2) . \end{matrix}

(A41)

Plugging

I_{1}

,

I_{2}

into Equation (A37), the following holds:

\begin{matrix} E [{∥E [S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}] - \nabla H (z_{t}) λ_{t}^{*}∥}^{2} | z_{t}] \\ \leq & \frac{M ν^{2}}{4} L^{2} {(d + 3)}^{3} + \frac{log (K) + 2}{\sqrt{K}} (\frac{1}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \\ + \sqrt{2} ν M L l {(d + 3)}^{3 / 2} (log (K) + 2) . \end{matrix}

(A42)

Therefore, Proposition 1 is proved completely. □

Proof of Theorem 1.

Following Lemma A1, it is known that

h_{i}^{ν} (z)

is L-smooth

\begin{matrix} h_{i}^{ν} (z_{t + 1}) & \leq h_{i}^{ν} (z_{t}) + 〈\nabla h_{i}^{ν} (z_{t}), z_{t + 1} - z_{t}〉 + \frac{L}{2} {∥z_{t + 1} - z_{t}∥}^{2} \\ = h_{i}^{ν} (z_{t}) + η_{t} 〈\nabla h_{i}^{ν} (z_{t}), - S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}〉 + \frac{L η_{t}^{2}}{2} {∥S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}∥}^{2} \\ \leq h_{i}^{ν} (z_{t}) + η_{t} 〈\nabla h_{i}^{ν} (z_{t}), - S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}〉 + \frac{L η_{t}^{2}}{2} {∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2} {∥λ_{t, K}∥}^{2} \\ \leq h_{i}^{ν} (z_{t}) + η_{t} 〈\nabla h_{i}^{ν} (z_{t}), - S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t, K}〉 + \frac{L η_{t}^{2}}{2} {∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}, \end{matrix}

(A43)

where the second inequality follows the Cauchy–Schwarz inequality, and the last inequality is due to

{∥λ_{t, K}∥}^{2} \leq 1

.

When taking the expectation with respect to the condition

z_{t}

, the following holds:

\begin{matrix} E [h_{i}^{ν} (z_{t + 1})] \leq & E [h_{i}^{ν} (z_{t})] + η_{t} E [〈\nabla h_{i}^{ν} (z_{t}), - \nabla H_{ν} (z_{t}) λ_{t, K}〉] + \frac{L η_{t}^{2}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ = & E [h_{i}^{ν} (z_{t})] + η_{t} E [〈\nabla h_{i}^{ν} (z_{t}), - \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}〉] + \frac{L η_{t}^{2}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ + η_{t} E [〈\nabla h_{i}^{ν} (z_{t}), - \nabla H_{ν} (z_{t}) λ_{t, K} + \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}〉] \\ \leq & E [h_{i}^{ν} (z_{t})] - η_{t} E [{∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}] + \frac{L η_{t}^{2}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ + η_{t} E [〈\nabla h_{i}^{ν} (z_{t}), - \nabla H_{ν} (z_{t}) λ_{t, K} + \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}〉] \\ \leq & E [h_{i}^{ν} (z_{t})] - η_{t} E [{∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}] + \frac{L η_{t}^{2}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ + η_{t} E [∥\nabla h_{i}^{ν} (z_{t})∥ ∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥] \\ \leq & E [h_{i}^{ν} (z_{t})] - η_{t} E [{∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}] + \frac{L η_{t}^{2}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ + η_{t} l \underset{I_{1}}{\underset{︸}{E [∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥]}}, \end{matrix}

(A44)

where the second inequality follows the optimality condition given by

〈\nabla H_{ν} (z_{t}) λ, \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}〉 \geq {∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}

for any

λ \in ▵^{M}

, as stated in [18]. The third inequality is due to the Cauchy–Schwartz inequality, and the last inequality follows from the l-Lipschitz continuous of function

h_{i}^{ν} (z)

.

For the term

I_{1}

, following Jensen’s inequality, the following holds:

\begin{matrix} E [∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥] & \leq \sqrt{E [{∥\nabla H_{ν} (z_{t}) λ_{t, K} - \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}]} \\ \leq \sqrt{E [{∥\nabla H_{ν} (z_{t}) λ_{t, K}∥}^{2} - {∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}]}, \end{matrix}

(A45)

where the last inequality is due to the optimality condition

〈\nabla H_{ν} (z_{t}) λ, \nabla H_{ν} (z_{t}) λ_{t, ν}^{*}〉 \geq {∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}

.

Due to the fact that

f (λ) = {∥\nabla H_{ν} (z_{t}) λ∥}^{2}

is a convex function and

{sup}_{λ, λ^{'} \in ▵^{M}} {∥λ - λ^{'}∥}^{2} \leq 2

, we set

λ_{t, ν}^{*} = arg {min}_{λ} {∥\nabla H_{ν} (z_{t}) λ∥}^{2}

, and the learning rate of weighting

λ

is

β_{t, k} = \frac{c}{\sqrt{k}}

where c is a constant. Following Theorem 2 in [48] and Lemma A3, the following holds:

\begin{matrix} E [{∥\nabla H_{ν} (z_{t}) λ_{t, K}∥}^{2} - {∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}] \\ \leq & \underset{V}{\underset{︸}{(\frac{2}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \frac{2 + log (K)}{\sqrt{K}}}} . \end{matrix}

(A46)

Plugging Equations (A46) and (A45) into Equation (A44) and rearranging Equation (A44), the following holds:

\begin{matrix} E [{∥\nabla H_{ν} (z_{t}) λ_{t, ν}^{*}∥}^{2}] \\ \leq & \frac{E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + l \sqrt{V} + \frac{L η_{t}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ = & \frac{E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + l \sqrt{V} + \frac{L η_{t}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t}) - \nabla H_{ν} (z_{t}) + \nabla H_{ν} (z_{t})∥}^{2}] \\ = & \frac{E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + l \sqrt{V} + \frac{L η_{t}}{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t}) - \nabla H_{ν} (z_{t})∥}^{2} + {∥\nabla H_{ν} (z_{t})∥}^{2}] \\ \leq & \frac{E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + l \sqrt{V} + \frac{L η_{t}}{2} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}), \end{matrix}

(A47)

where the last inequality is due to Lemma A2.

Following Lemma A6, the following holds:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t, ν}^{*}∥}^{2}] \leq & \frac{2 E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + 2 l \sqrt{V} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + L η_{t} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) . \end{matrix}

(A48)

Due to

{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2} \leq {∥\nabla H (z_{t}) λ_{t, ν}^{*}∥}^{2}

, we have the following:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \leq & \frac{2 E [h_{i}^{ν} (z_{t}) - h_{i}^{ν} (z_{t + 1})]}{η_{t}} + 2 l \sqrt{V} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + L η_{t} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) . \end{matrix}

(A49)

Setting

η_{t} = η

and summing over

t \in \{0, \dots, T - 1\}

, the following holds:

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq & \frac{2 E [h_{i}^{ν} (z_{0}) - h_{i}^{ν} (z_{T})]}{T η} + L η (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + 2 l \sqrt{(\frac{2}{c} + c {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2}) \frac{2 + log (K)}{\sqrt{K}}} \\ \leq & \frac{2 C_{H}}{T η} + L η (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + 4 l \sqrt{\frac{1 + log (T)}{c T}} + 2 l \sqrt{\frac{2 c + 2 c log (T)}{T}} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) . \end{matrix}

(A50)

Therefore, Theorem 1 is proved completely. □

Proof of Theorem 2.

Following the smoothness of

h_{i}^{ν} (z_{t})

, the following holds:

\begin{matrix} h_{i}^{ν} (z_{t + 1}) & \leq h_{i}^{ν} (z_{t}) + η_{t} 〈\nabla h_{i}^{ν} (z_{t}), - S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t}〉 + \frac{L}{2} η_{t}^{2} {∥S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t}∥}^{2} . \end{matrix}

(A51)

For any

λ \in ▵^{M}

, we have the following:

\begin{matrix} H_{ν} (z_{t + 1}) λ & \leq H_{ν} (z_{t}) λ + η_{t} 〈\nabla H_{ν} (z_{t}) λ, - S_{ν} (z_{t}, u_{t}, w_{t}) λ_{t}〉 + \frac{L}{2} η_{t}^{2} {∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2} . \end{matrix}

(A52)

When taking the expectation with respect to the condition

z_{t}

, the following holds:

\begin{matrix} E [H_{ν} (z_{t + 1}) λ] \leq & E [H_{ν} (z_{t}) λ] + η_{t} E [〈\nabla H_{ν} (z_{t}) λ, - \nabla H_{ν} (z_{t}) λ_{t}〉] + \frac{L}{2} η_{t}^{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t})∥}^{2}] \\ = & E [H_{ν} (z_{t}) λ] + η_{t} E [〈\nabla H_{ν} (z_{t}) λ, - \nabla H_{ν} (z_{t}) λ_{t}〉] \\ + \frac{L}{2} η_{t}^{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t}) - \nabla H_{ν} (z_{t}) + \nabla H_{ν} (z_{t})∥}^{2}] \\ = & E [H_{ν} (z_{t}) λ] + η_{t} E [〈\nabla H_{ν} (z_{t}) λ, - \nabla H_{ν} (z_{t}) λ_{t}〉] \\ + \frac{L}{2} η_{t}^{2} E [{∥S_{ν} (z_{t}, u_{t}, w_{t}) - \nabla H_{ν} (z_{t})∥}^{2} + {∥\nabla H_{ν} (z_{t})∥}^{2}] \\ \leq & E [H_{ν} (z_{t}) λ] + η_{t} E [〈\nabla H_{ν} (z_{t}) λ, - \nabla H_{ν} (z_{t}) λ_{t}〉] \\ + \frac{L}{2} η_{t}^{2} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}), \end{matrix}

(A53)

where the last inequality follows Lemma A2.

For the second term on the right side, following Lemma A4, the following holds:

\begin{matrix} - η_{t} E [〈\nabla H_{ν} (z_{t}) λ, \nabla H (z_{t}) λ_{t}〉] \\ \leq & \frac{η_{t}}{2 β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] + \frac{η_{t} β_{t}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}] \\ - η_{t} E [{∥\nabla H_{ν} (z_{t}) λ_{t}∥}^{2}] . \end{matrix}

(A54)

Plugging Equation (A54) into Equation (A53), and rearranging, the following holds:

\begin{matrix} E [{∥\nabla H_{ν} (z_{t}) λ_{t}∥}^{2}] & \leq \frac{1}{η_{t}} E [H_{ν} (z_{t}) λ - H_{ν} (z_{t + 1}) λ] + \frac{1}{2 β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] \\ + \frac{β_{t}}{2} E [{∥S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″})∥}^{2}] \\ + \frac{L}{2} η_{t} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) \\ \leq \frac{1}{η_{t}} E [H_{ν} (z_{t}) λ - H_{ν} (z_{t + 1}) λ] + \frac{1}{2 β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] \\ + \frac{β_{t}}{2} {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + \frac{L}{2} η_{t} (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}), \end{matrix}

(A55)

where the last inequality is due to Lemma A3.

Following Lemma A4, the following holds:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t}∥}^{2}] & \leq \frac{2}{η_{t}} E [H_{ν} (z_{t}) λ - H_{ν} (z_{t + 1}) λ] + \frac{1}{β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] \\ + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} + β_{t} {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + η_{t} L (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) . \end{matrix}

(A56)

Due to

{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2} \leq {∥\nabla H (z_{t}) λ_{t}∥}^{2}

, we have the following:

\begin{matrix} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] & \leq \frac{2}{η_{t}} E [H_{ν} (z_{t}) λ - H_{ν} (z_{t + 1}) λ] + \frac{1}{β_{t}} E [{∥λ_{t} - λ∥}^{2} - {∥λ_{t + 1} - λ∥}^{2}] \\ + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} + β_{t} {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + η_{t} L (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) . \end{matrix}

(A57)

Setting

η_{t} = η

,

β_{t} = β

and summing over

t \in \{0, \dots, T - 1\}

, the following holds:

\begin{matrix} \frac{1}{T} \sum_{t = 0}^{T - 1} E [{∥\nabla H (z_{t}) λ_{t}^{*}∥}^{2}] \\ \leq \frac{2 (H_{ν} (z_{0}) λ - H_{ν} (z_{T}) λ)}{T η} + \frac{{∥λ_{0} - λ∥}^{2} - {∥λ_{T} - λ∥}^{2}}{T β} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} \\ + β {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + η L (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}) \\ \leq \frac{2 C_{H}}{T η} + \frac{2}{T β} + \frac{M ν^{2}}{2} L^{2} {(d + 3)}^{3} + β {(\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2})}^{2} \\ + η L (\frac{M ν^{2}}{2 q_{1} q_{2}} L^{2} {(d + 6)}^{3} + \frac{2 d + 9 + q_{1} q_{2}}{q_{1} q_{2}} M l^{2}), \end{matrix}

(A58)

where the last inequality is due to Assumption 3 and

{sup}_{λ, λ^{'} \in ▵^{M}} {∥λ - λ^{'}∥}^{2} \leq 2

. Thus, Theorem 2 is proved completely. □

References

Lu, W.; Zhou, Y.; Wan, G.; Hou, S.; Song, S. L3-Net: Towards Learning Based LiDAR Localization for Autonomous Driving. In Proceedings of the 31th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
Fernandes, P.; Ghorbani, B.; Garcia, X.; Freitag, M.; Firat, O. Scaling Laws for Multilingual Neural Machine Translation. In Proceedings of the 40th International Conference on Machine Learning, ICML, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Satija, H.; Thomas, P.S.; Pineau, J.; Laroche, R. Multi-Objective SPIBB: Seldonian Offline Policy Improvement with Safety Constraints in Finite MDPs. In Proceedings of the 35th Advances in Neural Information Processing Systems, NeurIPS, Virtual, 6–14 December 2021. [Google Scholar]
Zhang, L.; Liu, X.; Mahmood, K.; Ding, C.; Guan, H. Dynamic Gradient Balancing for Enhanced Adversarial Attacks on Multi-Task Models. CoRR 2023. [Google Scholar] [CrossRef]
Cui, Y.; Geng, Z.; Zhu, Q.; Han, Y. Review: Multi-objective optimization methods and application in energy saving. Energy 2017, 125, 681–704. [Google Scholar] [CrossRef]
Deb, K.; Agrawal, S.; Pratap, A.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef]
Coello, C.A.C.; Pulido, G.T.; Lechuga, M.S. Handling Multiple Objectives With Particle Swarm Optimization. IEEE Trans. Evol. Comput. 2004, 8, 256–279. [Google Scholar] [CrossRef]
Mirjalili, S.; Saremi, S.; Mirjalili, S.M.; dos Santos Coelho, L. Multi-objective grey wolf optimizer: A novel algorithm for multi-criterion optimization. Expert Syst. Appl. 2016, 47, 106–119. [Google Scholar] [CrossRef]
Bandyopadhyay, S.; Saha, S.; Maulik, U.; Deb, K. A Simulated Annealing-Based Multiobjective Optimization Algorithm: AMOSA. IEEE Trans. Evol. Comput. 2008, 12, 269–283. [Google Scholar] [CrossRef]
Jiang, S.; Chen, Q.; Pan, Y.; Xiang, Y.; Lin, Y.; Wu, X.; Liu, C.; Song, X. ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-Order Optimization. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, AAAI, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar] [CrossRef]
Gao, K.; Sener, O. Generalizing Gaussian Smoothing for Random Search. In Proceedings of the 39th International Conference on Machine Learning, ICML, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Ghadimi, S.; Lan, G. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
Ye, F.; Lyu, Y.; Wang, X.; Zhang, Y.; Tsang, I.W. Adaptive Stochastic Gradient Algorithm for Black-box Multi-Objective Learning. In Proceedings of the 12th International Conference on Learning Representations, ICLR, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Zerbinati, A.; Desideri, J.A.; Duvigneau, R. Comparison Between MGDA and PAES for Multi-Objective Optimization; Research Report RR-7667; INRIA: Le Chesnay-Rocquencourt, France, 2011. [Google Scholar]
Gass, S.; Saaty, T. The computational algorithm for the parametric objective function. Nav. Res. Logist. 1955, 2, 39–45. [Google Scholar] [CrossRef]
Haimes, Y. On a bicriterion formulation of the problems of integrated system identification and system optimization. IEEE Trans. Syst. Man Cybern. Syst. 1971, SMC-1, 296–297. [Google Scholar] [CrossRef]
Chen, Z.; Ngiam, J.; Huang, Y.; Luong, T.; Kretzschmar, H.; Chai, Y.; Anguelov, D. Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout. In Proceedings of the 34th Advances in Neural Information Processing Systems, NeurIPS, Virtual, 6–12 December 2020. [Google Scholar]
Désidéri, J.A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C. R. Math. 2012, 350, 313–318. [Google Scholar] [CrossRef]
Lin, X.; Zhen, H.; Li, Z.; Zhang, Q.; Kwong, S. Pareto Multi-Task Learning. In Proceedings of the 33th Advances in Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Mahapatra, D.; Rajan, V. Multi-Task Learning with User Preferences: Gradient Descent with Controlled Ascent in Pareto Optimization. In Proceedings of the 37th International Conference on Machine Learning, ICML, Virtual Event, 13–18 July 2020. [Google Scholar]
Mercier, Q.; Poirion, F.; Désidéri, J. A stochastic multiple gradient descent algorithm. Eur. J. Oper. Res. 2018, 271, 808–817. [Google Scholar] [CrossRef]
Liu, S.; Vicente, L.N. The stochastic multi-gradient algorithm for multi-objective optimization and its application to supervised machine learning. Ann. Oper. Res. 2024, 339, 1119–1148. [Google Scholar] [CrossRef]
Zhou, S.; Zhang, W.; Jiang, J.; Zhong, W.; Gu, J.; Zhu, W. On the Convergence of Stochastic Multi-Objective Gradient Manipulation and Beyond. In Proceedings of the 36th Advances in Neural Information Processing Systems, NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Xiao, P.; Ban, H.; Ji, K. Direction-oriented Multi-objective Learning: Simple and Provable Stochastic Algorithms. In Proceedings of the 37the Advances in Neural Information Processing Systems, NeurIPS, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Chen, R.; Li, Y.; Chai, T. Multi-Objective Derivative-Free Optimization Based on Hessian-Aware Gaussian Smoothing Method. In Proceedings of the 18th IEEE International Conference on Control & Automation, ICCA, Reykjavík, Iceland, 18–21 June 2024. [Google Scholar] [CrossRef]
Tammer, C.; Winkler, K. A new scalarization approach and applications in multicriteria d.c. optimization. J. Nonlinear Convex Anal. 2003, 4, 365–380. [Google Scholar]
Fliege, J.; Svaiter, B.F. Steepest descent methods for multicriteria optimization. Math. Methods Oper. Res. 2000, 51, 479–494. [Google Scholar] [CrossRef]
Bonnel, H.; Iusem, A.N.; Svaiter, B.F. Proximal Methods in Vector Optimization. SIAM J. Optim. 2005, 15, 953–970. [Google Scholar] [CrossRef]
Fliege, J.; Drummond, L.M.G.; Svaiter, B.F. Newton’s Method for Multiobjective Optimization. SIAM J. Optim. 2009, 20, 602–626. [Google Scholar] [CrossRef]
Carrizo, G.A.; Lotito, P.A.; Maciel, M.C. Trust region globalization strategy for the nonconvex unconstrained multiobjective optimization problem. Math. Program. 2016, 159, 339–369. [Google Scholar] [CrossRef]
Pérez, L.R.L.; da Fonseca Prudente, L. Nonlinear Conjugate Gradient Methods for Vector Optimization. SIAM J. Optim. 2018, 28, 2690–2720. [Google Scholar] [CrossRef]
Liu, L.; Li, Y.; Kuang, Z.; Xue, J.; Chen, Y.; Yang, W.; Liao, Q.; Zhang, W. Towards Impartial Multi-task Learning. In Proceedings of the 9th International Conference on Learning Representations, ICLR, Virtual Event, 3–7 May 2021. [Google Scholar]
Fernando, H.D.; Shen, H.; Liu, M.; Chaudhury, S.; Murugesan, K.; Chen, T. Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Approach. In Proceedings of the 11th International Conference on Learning Representations, ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Chen, L.; Fernando, H.D.; Ying, Y.; Chen, T. Three-Way Trade-Off in Multi-Objective Learning: Optimization, Generalization and Conflict-Avoidance. J. Mach. Learn. Res. 2024, 25, 193:1–193:53. [Google Scholar]
Fernando, H.D.; Chen, L.; Lu, S.; Chen, P.; Liu, M.; Chaudhury, S.; Murugesan, K.; Liu, G.; Wang, M.; Chen, T. Variance Reduction Can Improve Trade-Off in Multi-Objective Learning. In Proceedings of the 49th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, Seoul, Republic of Korea, 14–19 April 2024. [Google Scholar] [CrossRef]
Laumanns, M.; Ocenasek, J. Bayesian Optimization Algorithms for Multi-objective Optimization. In Proceedings of the 7th Parallel Problem Solving from Nature, PPSN, Granada, Spain, 7–11 September 2002. [Google Scholar] [CrossRef]
Belakaria, S.; Deshwal, A.; Jayakodi, N.K.; Doppa, J.R. Uncertainty-Aware Search Framework for Multi-Objective Bayesian Optimization. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, AAAI, New York, NY, USA, 7–12 February 2020. [Google Scholar] [CrossRef]
Zhang, Q.; Li, H. MOEA/D: A Multiobjective Evolutionary Algorithm Based on Decomposition. IEEE Trans. Evol. Comput. 2007, 11, 712–731. [Google Scholar] [CrossRef]
Fliege, J.; Vaz, A.I.F.; Vicente, L.N. Complexity of gradient descent for multiobjective optimization. Optim. Methods Softw. 2019, 34, 949–959. [Google Scholar] [CrossRef]
Liu, S.; Chen, P.; Kailkhura, B.; Zhang, G.; III, A.O.H.; Varshney, P.K. A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning: Principals, Recent Advances, and Applications. IEEE Signal Process. Mag. 2020, 37, 43–54. [Google Scholar] [CrossRef]
Berahas, A.S.; Cao, L.; Choromanski, K.; Scheinberg, K. A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization. Found. Comput. Math. 2022, 22, 507–560. [Google Scholar] [CrossRef]
Balasubramanian, K.; Ghadimi, S. Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points. Found. Comput. Math. 2022, 22, 35–76. [Google Scholar] [CrossRef]
Chen, X.; Liu, S.; Xu, K.; Li, X.; Lin, X.; Hong, M.; Cox, D.D. ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization. In Proceedings of the 33th Advances in Neural Information Processing Systems, NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
Williams, P.N.; Li, K. Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation CVPR Proceedings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Papernot, N.; McDaniel, P.D.; Goodfellow, I.J.; Jha, S.; Celik, Z.B.; Swami, A. Practical Black-Box Attacks against Machine Learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, 2–6 April 2017. [Google Scholar]
Nesterov, Y.E.; Spokoiny, V.G. Random Gradient-Free Minimization of Convex Functions. Found. Comput. Math. 2017, 17, 527–566. [Google Scholar] [CrossRef]
Shamir, O.; Zhang, T. Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes. In Proceedings of the 30th International Conference on Machine Learning, ICML, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]

Figure 1. (a–c) show the 3D images of two objective functions, and the contour maps of objective 1 and objective 2, where the black ★ marks the minimum point and the background solid contours display the landscape. (d–l) show the optimization trajectories of different algorithms, where black • marks three different initial points, the gray bar is the Pareto front, and the black ★ denotes the point on the Pareto front corresponding to equal weights for each objective. All trajectories are shown fading from red (start) to yellow (end).

Figure 2. The average norm of gradient deviation with respect to the number of inner iterations K at the three initial points.

Figure 3. The two objective function values with respect to the number of iterations T on the separable function with 500 dimensions.

Figure 4. The two objective function values with respect to the number of iterations T on the separable function with 1000 dimensions.

Figure 5. The two objective function values with respect to the number of iterations T on the non-separable function with 500 dimensions.

Figure 6. The two objective function values with respect to the number of iterations T on the non-separable function with 1000 dimensions.

Figure 7. The average runtime per iteration with respect to the dimensions, d, on separable functions.

Figure 8. The two objective function values with respect to the number of iterations T on the separable function with 1500 dimensions.

Figure 9. The two objective function values with respect to the number of iterations T on the separable function with 3000 dimensions.

Figure 10. The two objective function values with respect to the number of iterations T on the non-separable function with 1500 dimensions.

Figure 11. The two objective function values with respect to the number of iterations T on the non-separable function with 3000 dimensions.

Figure 12. The two objective function values with respect to the number of iterations T on the non-separable function with 1000 dimensions.

Figure 13. The average runtime per iteration with respect to the batch size on non-separable functions.

Table 1. The time and space complexity of a single iteration.

Algorithms	Time Complexity	Space Complex
ZO-AdaMU [10]	$O (q_{1} q_{2} d)$	$O (d)$
BES [11]	$O (q_{1} q_{2} d)$	$O (q_{1} q_{2} d)$
RSGF [12]	$O (q_{1} q_{2} d)$	$O (q_{1} q_{2} d)$
ASMG [13]	$O (M q_{1} q_{2} d)$	$O (M q_{1} q_{2} + q_{1} q_{2} d)$
SZMG (This paper)	$O (K (M^{2} d + M q_{1} q_{2} d))$	$O (M d + M^{2} + q_{1} q_{2} d)$

Table 2. Some important notations used in the paper.

Notation	Introduction
z	Decision variable
w	Data random variable independence of z
u	A random vector following the Gaussian distribution
$ν$	Smoothness parameter
d	The dimension of parameter z
M	Total number of functions
$h_{i} (z)$	The i-th objective function
$h_{i}^{ν} (z)$	The smoothed version of function $h_{i} (z)$
$\nabla h_{i} (z)$	Gradient of the i-th objective function
$\nabla h_{i}^{ν} (z)$	Gradient of the function $h_{i}^{ν} (z)$
$s_{i} (z, u, w)$	Stochastic gradient of the function $h_{i}^{ν} (z)$
$\nabla H (z)$	A matrix containing $\nabla h_{1} (z), \dots, \nabla h_{M} (z)$ as columns
$\nabla H_{ν} (z)$	A matrix containing $\nabla h_{1}^{ν} (z), \dots, \nabla h_{M}^{ν} (z)$ as columns
$S_{ν} (z, u, w)$	A matrix containing $s_{1} (z, u, w), \dots, s_{M} (z, u, w)$ as columns
$λ_{t}^{*}$	$λ_{t}^{*} = arg {min}_{λ} {∥\nabla H (z_{t}) λ∥}^{2}$
$λ_{t, ν}^{*}$	$λ_{t, ν}^{*} = arg {min}_{λ} {∥\nabla H_{ν} (z_{t}) λ∥}^{2}$
$λ_{t, k + 1}$	$λ_{t, k + 1} = \prod_{▵^{M}} (λ_{t, k} - β_{t, k} S_{ν} {(z_{t}, u_{t}^{'}, w_{t}^{'})}^{⊤} S_{ν} (z_{t}, u_{t}^{″}, w_{t}^{″}) λ_{t, k})$
$η, β$	Learning rates
$c, l, L$	A constant greater than 0

Table 3. Summary of hyperparameters.

	$ν$	$q_{1}$	$q_{2}$	$η$	K	$γ$	$β$
ZO-AdaMU	0.00001	8	8	0.002	-	-	-
BES	0.00001	8	8	0.002	-	-	-
RSGF	0.00001	8	8	0.002	-	-	-
ASMG	-	8	8	0.01	-	$\frac{1}{t + 1}$	-
MGDA	-	-	full	0.002	-	-	-
SZMG	0.00001	8	8	0.002	1	-	$\frac{0.01}{\sqrt{k}}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Wu, Q.; Zhang, M.; Wang, L.; Ge, Y.; Wang, G. Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization. Mathematics 2025, 13, 627. https://doi.org/10.3390/math13040627

AMA Style

Li Z, Wu Q, Zhang M, Wang L, Ge Y, Wang G. Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization. Mathematics. 2025; 13(4):627. https://doi.org/10.3390/math13040627

Chicago/Turabian Style

Li, Zhihao, Qingtao Wu, Moli Zhang, Lin Wang, Youming Ge, and Guoyong Wang. 2025. "Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization" Mathematics 13, no. 4: 627. https://doi.org/10.3390/math13040627

APA Style

Li, Z., Wu, Q., Zhang, M., Wang, L., Ge, Y., & Wang, G. (2025). Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization. Mathematics, 13(4), 627. https://doi.org/10.3390/math13040627

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stochastic Zeroth-Order Multi-Gradient Algorithm for Multi-Objective Optimization

Abstract

1. Introduction

2. Related Work

2.1. Gradient-Based Deterministic MOO

2.2. Gradient-Based Stochastic MOO

2.3. Gradient-Free MOO

3. Preliminaries

3.1. Problem Setup

3.2. Multi-Gradient Descent Algorithm

3.3. Zeroth-Order Optimization

4. Stochastic Zeroth-Order Multi-Gradient Algorithm

5. Convergence Analysis

6. Experiments

6.1. Toy Example

6.2. Separable and Non-Separable Functions

6.2.1. Runtime

6.2.2. Scalability

6.2.3. Sensitiveness to Batch Size

7. Discussion

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Extra Lemmas

Appendix B. Proof of the Results in Section 5

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI