Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method

Ma, Yinpu; Li, Cunlin; Wang, Zhichao; Li, Qian

doi:10.3390/axioms15030203

Open AccessArticle

Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method

¹

School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China

²

School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Perth, WA 6102, Australia

^*

Author to whom correspondence should be addressed.

Axioms 2026, 15(3), 203; https://doi.org/10.3390/axioms15030203

Submission received: 8 January 2026 / Revised: 1 March 2026 / Accepted: 5 March 2026 / Published: 10 March 2026

(This article belongs to the Section Geometry and Topology)

Download

Browse Figures

Versions Notes

Abstract

Recently, the adaptive regularized proximal quasi-Newton (ARPQN) method has demonstrated a strong performance in solving composite optimization problems over the Stiefel manifold. However, its reliance on first-order information limits its applicability to scenarios where gradient and Hessian evaluations are unavailable or costly. In this paper, we propose a zeroth-order adaptive regularized proximal quasi-Newton method (ZO-ARPQN) for black-box composite optimization over Riemannian manifolds, particularly the Stiefel and symmetric positive definite (SPD) manifolds. The proposed method estimates the Riemannian gradient and curvature information through randomized one-point finite-difference approximations and adaptively updates a regularized quasi-Newton matrix to capture the local manifold geometry. Theoretically, we established global convergence and complex analyses under mild assumptions. More importantly, by incorporating curvature-aware regularization and random perturbations in the proximal quasi-Newton framework, we proved that ZO-ARPQN can escape strict saddle points with a high probability. This guarantees convergence to a stationary point, even in the absence of explicit gradients. Extensive numerical experiments were conducted on manifold-constrained problems, including sparse PCA and robot stiffness tuning. These demonstrated that ZO-ARPQN shows a competitive convergence behavior compared with other state-of-the-art Riemannian optimization methods, while requiring only function evaluations.

Keywords:

zeroth-order optimization; Riemannian manifold; Stiefel manifold; SPD manifold; proximal quasi-Newton method; adaptive regularization; saddle-point escape; black-box optimization

MSC:

49M37

1. Introduction

In recent years, composite optimization problems on Riemannian manifolds have attracted widespread attention in various application fields. The objective function of such problems typically consists of the sum of a smooth function and a nonsmooth function, with optimization variables constrained on a manifold. The general form of a Riemannian composite optimization problem is summarized as follows:

min_{X \in M} F (X) = f (X) + h (X) .

(1)

where

f : R^{n \times r} \to R

is a smooth function and

h : R^{n \times r} \to R

is a convex, but nonsmooth, function. The aforementioned Riemannian composite optimization problems have found extensive applications, such as compressed sensing [1], sparse principal component analyses [2], and clusteringproblems [3]. For more applications of composite optimization on manifolds, readers are referred to [4,5,6].

Composite optimization problems with manifold constraints have been extensively studied in recent years. Currently, popular solution methods can be roughly categorized into the following types: operator-splitting methods, proximal gradient methods, and proximal Newton methods.

Operator-Splitting Methods. Riemannian ADMM (alternating-direction method of multipliers) provides a flexible decomposition strategy, especially suitable for composite problems with special structures. It usually transforms the original problem into an equivalent form via variable splitting and then solves it through alternating updates and dual ascent. Additionally, other operator-splitting ideas, such as Douglas–Rachford splitting, have been extended to manifolds [7]. Research on Riemannian ADMM focuses on its convergence analysis under nonconvex and nonsmooth settings [8] and efficient implementation in specific applications [3]. For example, in distributed or federated learning scenarios, the decomposable nature of ADMM enables the design of communication-efficient and privacy-preserving Riemannian optimization algorithms [9]. The advantages of Riemannian ADMM lie in its high flexibility and ability to decouple variables; however, it requires the introduction of additional variables when constructing the Lagrangian objective function, which slows down its convergence rate.

First-Order Methods: Proximal Gradient Methods. This is the most fundamental and straightforward approach for solving Riemannian composite problems. The Riemannian proximal gradient method iteratively decomposes the smooth and nonsmooth terms, i.e., [4,5]:

x_{k + 1} = {prox}_{γ_{k} h} (R_{x_{k}} (- γ_{k} \nabla f (x_{k})))

(2)

which denotes the Riemannian proximal operator. This algorithm first performs one step of Riemannian gradient descent on the smooth part f, and then corrects the nonsmooth part h via the proximal operator, entirely within the manifold

M

.

To accelerate convergence, researchers have proposed various accelerated Riemannian proximal gradient algorithms by leveraging Nesterov’s acceleration idea. These methods typically improve the convergence rate from

O (1 / k)

to

O (1 / k^{2})

under geodesically convex or specific curvature conditions [10,11]. For large-scale data, computing the full gradient

\nabla f (x_{k})

is prohibitively expensive; thus, Riemannian stochastic proximal gradient methods and their variance-reduced variants (e.g., R-SVRG-PG [12]) have emerged. They approximate the gradient through random sampling, significantly reducing the cost per iteration while achieving fast convergence. Considering that

f (x)

is nonconvex in many practical applications, recent works have analyzed the convergence of proximal gradient methods under nonconvex settings, proving that the algorithm converges to stationary points and providing convergence rate guarantees [13]. This method is simple to implement and theoretically mature, but suffers from slow convergence and reliance on the proximal operator.

Second-Order Methods: Proximal Newton Methods. To achieve faster convergence than first-order methods, second-order methods incorporate curvature (Hessian) information of the smooth term f to guide updates. The core idea of the regularized Riemannian proximal Newton method is to obtain the update direction

η_{k}

at each tangent space

T_{x_{k}} M

by solving a quadratic approximation subproblem [4,5]:

η_{k} = \underset{η \in T_{X_{k}} M}{arg min} 〈\nabla f (X_{k}), η〉 + \frac{1}{2} 〈η, H_{k} [η]〉 + h (R_{X_{k}} (η))

(3)

where

H_{k}

is a positive definite approximation of the Riemannian Hessian operator

H f (x_{k})

.

Standard Newton methods face two major challenges: (1) how to ensure global convergence when the Hessian is non-positive definite, and (2) how to handle the enormous computational, storage, and inversion costs of the Hessian matrix. Recent research on quasi-Newton methods and adaptive regularization has focused on addressing these two issues. These are key techniques to address the computational bottleneck of second-order methods. Riemannian quasi-Newton methods (e.g., Riemannian BFGS or L-BFGS) no longer directly compute the true Hessian matrix; instead, they construct an increasingly accurate Hessian approximation matrix

B_{k}

using first-order (gradient) information collected during iterations [14,15]. A representative method is the adaptive regularized proximal quasi-Newton (ARPQN) method [16], which solves a subproblem of the following form at each step:

min_{V \in T_{X_{k}} M} 〈\nabla f (X_{k}), V〉 + \frac{1}{2} 〈(B_{k} + σ_{k} I) [V], V〉 + h (R_{X_{k}} (V))

(4)

The core advantages of this method are Hessian approximation and adaptive regularization. This allows the algorithm to flexibly balance gradient and curvature information, thereby achieving faster convergence in practice.

It should be noted that the aforementioned methods all assume that the gradient and Hessian matrix of the objective function have analytical forms. However, many optimization problems in practical applications face the following challenges, rendering traditional gradient-based (first-order or second-order) methods inapplicable:

1.: Unavailable Gradient: In many practical problems, the objective function $f (X)$ is a black box: one can only query its value without access to analytical gradients or automatic differentiation. This often occurs in complex physical simulations such as fluid dynamics or structural mechanics, where each function evaluation is computationally expensive and differentiation is infeasible. Traditional approaches like Bayesian optimization mitigate the cost through surrogate models [17], but become inefficient as the problem dimension increases. In robotic stiffness control, the objective function describing the robot–environment interaction has no closed-form gradient and can only be evaluated through high-fidelity simulations or experiments. The mapping from stiffness parameters (an SPD matrix) to performance scores thus forms a typical black-box optimization problem.
2.: Prohibitive Gradient Computation Cost: Even if the gradient exists in theory, the cost of computing it may be unacceptably high. In large-scale reinforcement-learning scenarios, the reward function of an agent may depend on an extremely long sequence of decisions. Computing the gradient of policy parameters through backpropagation may consume enormous computational resources and memory. Directly evaluating the performance of the policy (zeroth-order information) may be much cheaper. Zeroth-order methods such as evolutionary strategies have been proven to be scalable alternatives for reinforcement learning [18].

Based on the above motivations, zeroth-order methods can be applied to any optimization problem where function evaluations are accessible, regardless of how complex its internal structure is or whether it is differentiable. This greatly expands the application boundary of optimization algorithms, enabling them to solve black-box problems that traditional methods cannot address.

First, we clarify the definition: Zeroth-order Riemannian methods, also known as derivative-free Riemannian optimization methods, refer to a class of methods that solve optimization problems on manifolds using only function evaluation values (zeroth-order information), without using explicit gradients (first-order information) or Hessian matrices (second-order information) of the objective function. By abandoning reliance on accurate gradient information, zeroth-order Riemannian methods gain the ability to handle complex problems such as black-box, noisy, and high-cost scenarios. They sacrifice a certain degree of convergence efficiency (compared to gradient-based methods) to significantly broaden the application scope of optimization algorithms on manifolds.

2. Theoretical Contributions

This work proposes the zeroth-order Riemannian adaptive quasi-Newton method (ZO-ARPQN) for composite optimization on Stiefel manifolds, establishing three fundamental advances:

Zeroth-Order Extension of Proximal Quasi-Newton on Riemannian Manifolds. We generalize the adaptive regularized proximal quasi-Newton (ARPQN) framework [16] to the zeroth-order setting over Riemannian manifolds such as the Stiefel and SPD manifolds. By constructing randomized finite-difference estimators that approximate both the Riemannian gradient and curvature information, ZO-ARPQN achieves first-order accuracy using only function evaluations, enabling black-box optimization in manifold-constrained problems.

Global Convergence Guarantees under Noisy Gradient Estimates. We establish global convergence to a stationary point under mild smoothness and regularization assumptions, even when the gradient and Hessian information are approximated by random perturbations. Theoretical bounds are derived on the bias and variance of the zeroth-order estimators, showing that ZO-ARPQN maintains the same global convergence as its first-order counterpart.

Saddle-Point Escaping. By incorporating curvature-aware regularization and adaptive random perturbation in the proximal quasi-Newton step, we prove that ZO-ARPQN can escape strict saddle points with a high probability, ensuring convergence to a stationary point. This result extends existing saddle-point escape theory to the manifold and zeroth-order settings.

3. Preliminaries on Riemannian Optimization

The core idea of Riemannian optimization is to extend classical optimization methods from Euclidean space to non-Euclidean spaces, i.e., Riemannian manifolds. Compared with the Euclidean case, manifold optimization requires constraining the geometric structure of iterates, such that search directions, gradients, and update operators are all defined within the framework of tangent spaces. This requires us to introduce tools such as tangent spaces, Riemannian metrics, retractions, and Riemannian Hessians.

3.1. Tangent Space and Tangent Bundle

Let

M

be a manifold. The tangent space at a point

X \in M

is denoted as

T_{X} M

, which consists of all tangent vectors at X. The tangent bundle

T M : = ⋃_{X \in M} T_{X} M

represents the collection of all tangent spaces of

M

.

3.2. Riemannian Metric and Norm

If the tangent spaces are endowed with an inner product

{〈 ξ, η 〉}_{X}, ξ, η \in T_{X} M

that varies smoothly with the point X, then

(M, 〈 \cdot, \cdot 〉)

is called a Riemannian manifold. The induced norm is

{∥ ξ ∥}_{X} = \sqrt{{〈 ξ, ξ 〉}_{X}} .

For simplicity, we denote

∥ ξ ∥

and

〈 ξ, ξ 〉

if there is no ambiguity.

3.3. Riemannian Gradient

In [19], if

f : M \to R

is a smooth function, its gradient

\nabla f (X) \in T_{X} M

at

X \in M

is defined as the unique tangent vector satisfying

〈 \nabla f (X), ξ 〉 = D f (X) [ξ], \forall ξ \in T_{X} M,

where

D f (X) [ξ]

denotes the directional derivative of f at X along

ξ

. In this paper, we adopt the Riemannian metric induced by the Euclidean inner product, i.e.,

{〈 \cdot, \cdot 〉}_{x} = 〈 \cdot, \cdot 〉, \forall x \in M .

Based on this Riemannian metric, the Riemannian gradient of a function is defined as the projection of its Euclidean gradient onto the tangent space:

\nabla f (x) = {Proj}_{T_{x} M} (\nabla f (x))

(5)

Definition 1

(Retraction [4]). A retraction on a manifold

M

is a smooth mapping

R : T M \to M

satisfying the following properties. Let

R_{X}

denote the restriction of R to

T_{X} M

: (1)

R_{X} (0_{X}) = X

, where

0_{X}

is the zero element of

T_{X} M

; (2) under the canonical isomorphism

T_{0_{X}} (T_{X} M) ≅ T_{X} M

, we have

D R_{X} (0_{X}) = {id}_{T_{X} M},

where

D R_{X} (0_{X})

denotes the differential of

R_{X}

at

0_{X}

, and

{id}_{T_{X} M}

denotes the identity mapping on

T_{X} M

.

The following proposition serves as a key bridge connecting retractions and optimization analyses. It endows complex manifold optimization algorithms with strict mathematical certainty. Inequality (6) guarantees the boundedness and controllability of the retraction operation. It ensures that a tangent space displacement

ξ

of a finite magnitude does not result in an infinitely distant move on the manifold, which is a prerequisite for any subsequent convergence analysis.

Proposition 1

([19]). If

M

is a compact embedded submanifold of

R^{N}

and R is a retraction, then there exist constants

M_{1}, M_{2} > 0

such that, for all

X \in M

and

ξ \in T_{X} M

,

∥ R_{X} (ξ) - X ∥ \leq M_{1} ∥ ξ ∥

(6)

∥ R_{X} (ξ) - X - ξ ∥ \leq M_{2} {∥ ξ ∥}^{2}

(7)

The following assumption is the most natural and practical way to extend the well-known L-smooth (i.e., the function has a Lipschitz-continuous gradient) concept from Euclidean space to Riemannian manifolds.

Assumption 1

(L-retraction smoothness [20]). There exists a constant

L_{g} \geq 0

such that, for the objective function f of problem (1), the following inequality holds:

|f (R_{x} (η)) - f (x) - {〈 \nabla f (x), η 〉}_{x}| \leq \frac{L_{g}}{2} {∥ η ∥}_{x}^{2}, \forall x \in M, η \in T_{x} M .

(8)

This assumption ensures that the behavior of function f along the retraction path is well-posed, i.e., as long as

η

is sufficiently small, predicting the function value at the next point using the gradient information at the current point is relatively accurate, ensuring that the gradient-based linear approximation is locally effective. Thus, a convergence analysis from Euclidean space (gradient descent, trust region, Newton method) can be extended to manifolds.

4. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Newton (ZO-ARPQN)

4.1. Zeroth-Order Riemannian Gradient and Hessian

In existing manifold optimization methods, gradients or their tangent space projections are assumed to be directly available. However, in many real-world problems such as black-box optimization, noisy or nonsmooth functions, and high-dimensional models, explicit gradient computation is either infeasible or too costly. Zeroth-order optimization offers an effective alternative by relying only on function evaluations.

While classical Euclidean zeroth-order estimators (e.g., Gaussian smoothing or coordinate finite differences) can approximate gradients accurately, they require many function calls and do not generalize well to manifolds. To overcome this, we employ a random direction difference method that samples one or several random tangent directions and uses one-sided finite differences to estimate gradients. This approach makes the function query cost independent of the dimension, achieves good scalability to high-dimensional Riemannian problems, and balances theoretical soundness with computational efficiency.

Definition 2

(Stochastic Difference Zeroth-Order Riemannian Gradient [20]). Generate

u = P u_{0} \in T_{x} M

, where

u_{0} \sim N (0, I_{n}) \in R^{n}

, and

P \in R^{n \times n}

is an orthogonal projection matrix that projects vectors onto the tangent space

T_{x} M

. Thus, u follows the distribution

N (0, P P^{⊤})

, i.e., the standard normal distribution on the tangent space; all eigenvalues of

P P^{⊤}

are either 0 (corresponding to eigenvectors orthogonal to the tangent space) or 1 (corresponding to eigenvectors embedded in the tangent space). The zeroth-order Riemannian gradient is defined as:

g_{μ} (x) = \frac{f (R_{x} (μ u)) - f (x)}{μ} u = \frac{f (R_{x} (μ P u_{0})) - f (x)}{μ} P u_{0} .

(9)

It should be noted that the projection matrix P is easy to compute on common manifolds. For example, for the Stiefel manifold

M

, its projection can be written as

{Proj}_{T_{X} M} (Y) = (I - X X^{⊤}) Y + X skew (X^{⊤} Y),

where

skew (A) : = (A - A^{⊤}) / 2

. In the stochastic case, performing multiple samplings in each iteration can improve the convergence speed:

{\bar{g}}_{μ, ξ} (x) = \frac{1}{m} \sum_{i = 1}^{m} g_{μ, ξ_{i}} (x), where g_{μ, ξ_{i}} (x) = \frac{F (R_{x} (μ u_{i}), ξ_{i}) - F (x, ξ_{i})}{μ} u_{i},

(10)

where

u_{i}

is a standard normal random vector on the tangent space

T_{x} M

. We also have:

E_{ξ} [g_{μ, ξ_{i}} (x)] = \frac{f (R_{x} (μ u)) - f (x)}{μ} u = g_{μ} (x) .

(11)

Lemma 1

(Upper bound on the error of the averaged zeroth-order gradient [20]). For the zeroth-order Riemannian gradient estimation with multiple samplings, we have

E {∥{\bar{g}}_{μ, ξ} (x) - \nabla f (x)∥}^{2} \leq μ^{2} L_{g}^{2} {(d + 6)}^{3} + \frac{8 (d + 4)}{m} σ^{2} + \frac{8 (d + 4)}{m} {∥ \nabla f (x) ∥}^{2},

(12)

where the expectation

E

is taken over by the Gaussian vectors

U = {u_{1}, \dots, u_{m}}

and ξ.

Definition 3

(Zeroth-Order Riemannian Hessian [20]). The zeroth-order Riemannian Hessian estimation of function f at point x is defined as:

H_{μ} (x) = \frac{1}{2 μ^{2}} (u u^{⊤} - P) [F (R_{x} (μ u), ξ) + F (R_{x} (- μ u), ξ) - 2 F (x, ξ)] .

(13)

It should be noted that the Riemannian Hessian here is actually the projected Hessian estimation of the pullback function

F_{x} (η, ξ) : = F (R_{x} (η), ξ), \forall x \in M, η \in T_{x} M

on the tangent space

T_{x} M

. Additionally, multi-sampling technology is adopted. For

i = 1, \dots, b

, each Hessian is defined as:

H_{μ, i} (x) = \frac{1}{2 μ^{2}} (u_{i} u_{i}^{⊤} - P) [F (R_{x} (μ u_{i}), ξ_{i}) + F (R_{x} (- μ u_{i}), ξ_{i}) - 2 F (x, ξ_{i})] .

(14)

Its averaged form is:

{\bar{H}}_{μ, ξ} (x) = \frac{1}{b} \sum_{i = 1}^{b} H_{μ, i} (x) .

(15)

Assumption 2

(Lipschitz Hessian assumption). For any

x \in M

and

η \in T_{x} M

, we have

{∥P_{η}^{- 1} \circ H F (R_{x} (η), ξ) \circ P_{η} - H F (x, ξ)∥}_{op} \leq L_{H} ∥ η ∥,

(16)

which holds almost everywhere, where

P_{η} : T_{x} M \to T_{R_{x} (η)} M

denotes the parallel transport and ∘ denotes the function composition. Here,

{∥ \cdot ∥}_{op}

denotes the operator norm. This assumption constrains the rate of change in the Hessian on the manifold.

Assumption 2 is the Riemannian counterpart of the Lipschitz Hessian assumption in Euclidean space, whose equivalent conditions include (see [13]):

∥P_{η}^{- 1} \nabla F (R_{x} (η), ξ) - \nabla f (x) - H F (x, ξ) [η]∥ \leq \frac{L_{H}}{2} {∥ η ∥}^{2},

(17)

|F (R_{x} (η), ξ) - [F (x, ξ) + 〈 η, \nabla F (x, ξ) 〉 + \frac{1}{2} 〈 η, H F (x, ξ) [η] 〉]| \leq \frac{L_{H}}{6} {∥ η ∥}^{3} .

(18)

In the Euclidean case,

P_{η}

degenerates to the identity mapping. In this section, we also assume that

F (\cdot, ξ)

satisfies Assumption 1 and introduce the following common assumption, which is often used in zeroth-order stochastic optimization.

Assumption 3

(Bounded Variance Assumption [19,21]). Let

E = E_{ξ}

; then, we have:

E [F (x, ξ)] = f (x)

,

E [\nabla F (x, ξ)] = \nabla f (x)

,

{E ∥ \nabla F (x, ξ) - \nabla f (x) ∥}^{2} \leq σ^{2}

for all

x \in M

.

The error between the approximate Hessian matrix constructed using the zeroth-order method and the true Riemannian

H f (x)

is bounded in expectation. By choosing a sufficiently large number of samples b and a sufficiently small step size

μ

, the approximation error can be made arbitrarily small, thereby ensuring the effectiveness and controllability of the zeroth-order Hessian estimation.

Lemma 2

([20]). Let

{\bar{H}}_{μ, ξ} (x)

be computed according to Formula (15), and

H_{μ} (x)

be computed according to Formula (13). Then, for any

x \in M

and

η \in T_{x} M

, we have:

E_{μ, ξ} {∥{\bar{H}}_{μ, ξ} (x) - H f (x)∥}_{op}^{2} \leq \frac{{(d + 16)}^{4}}{\sqrt{2 b}} L_{g} + \frac{μ^{2} L_{H}^{2}}{18} {(d + 6)}^{5},

(19)

E_{u, Ξ} {∥H_{μ} (x)∥}_{F}^{4} \leq \frac{{(d + 16)}^{8}}{8} L_{g}^{2} .

(20)

This lemma provides a theoretical basis for the zeroth-order quasi-Newton method: even without an explicit Hessian, a high-quality Hessian approximation can be constructed using function value approximation; it provides an error upper bound for subsequent convergence analyses, ensuring the theoretical feasibility of the method.

4.2. Zeroth-Order Extension: Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Algorithm

To address the scenario where first-order or second-order derivative information cannot be directly obtained, based on the above groundwork, we propose a zeroth-order information-based adaptive Riemannian optimization algorithm. Our method utilizes the aforementioned zeroth-order gradient and Hessian estimators to construct the following regularized quadratic approximation model in each iteration:

{\tilde{ϕ}}_{k} (V) = 〈 {\bar{g}}_{μ, ξ} (x_{k}), V 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x_{k}) + σ_{k} I) [V], V 〉 + h (x_{k} + V),

(21)

where

{\bar{g}}_{μ, ξ} (x_{k})

denotes the multi-sampling mean of the zeroth-order Riemannian gradient,

{\bar{H}}_{μ, ξ} (x_{k})

is the mean of the zeroth-order Riemannian Hessian estimation, and

σ_{k}

is the adaptive regularization parameter. By minimizing Subproblem (21) on the tangent space

T_{x_{k}} M

, we can obtain a reasonable descent direction using only function values (zeroth-order information). Furthermore, the adaptive adjustment mechanism of the regularization parameter

σ_{k}

ensures the stability and convergence of the algorithm under nonconvex conditions, thereby enabling the method to exhibit an excellent theoretical and practical performance in zeroth-order Riemannian optimization problems.

The greatest advantage of the zeroth-order extended adaptive Riemannian proximal quasi-Newton method is that it breaks the dependence on explicit gradients and Hessians, but can still closely approximate the second-order curvature information in black-box, non-differentiable, or even noisy environments, maintaining the efficiency and stability of the algorithm. This significantly expands the application scope of traditional Riemannian proximal Newton methods [16,22]. The complete optimization process is shown in Algorithm 1 (ZO-ARPQN). Below, we briefly describe the main steps of the ZO-ARPQN algorithm.

Algorithm 1 ZO-ARPQN: Zeroth-Order Adaptive Regularized Proximal Quasi-Newton Algorithm

Require:: Initial point $X_{0} \in M$ ; initial regularization parameter $σ_{0} > 0$ ; line search parameters $σ, γ \in (0, 1)$ ; threshold parameters $0 < η_{1} < η_{2} < 1$ ; sample batch sizes $m, b$ ; scaling factors $0 < γ_{1} < 1 < γ_{2}$
1:: for $k = 0, 1, 2, \dots$ do
2:: Generate the zeroth-order gradient ${\bar{g}}_{μ, ξ} (x_{k})$ and zeroth-order Hessian ${\bar{H}}_{μ, ξ} (x_{k})$ based on sample sizes $m, b$
3:: while true do
4:: Solve the subproblem

${\tilde{ϕ}}_{k} (V) = 〈 {\bar{g}}_{μ, ξ} (x_{k}), V 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x_{k}) + σ_{k} I) [V], V 〉 + h (x_{k} + V),$

to obtain the search direction $V_{k}$
5:: Set initial step size $α_{k} = 1$
6:: while Condition (22) is not satisfied do
7:: $α_{k} \leftarrow γ α_{k}$
8:: end while
9:: Let $Z_{k} = R_{X_{k}} (α_{k} V_{k})$
10:: Compute the ratio

$ρ_{k} = \frac{F (R_{X_{k}} (α_{k} V_{k})) - F (X_{l (k)})}{{\tilde{ϕ}}_{k} (α_{k} V_{k}) - {\tilde{ϕ}}_{k} (0)} .$
11:: if $ρ_{k} \geq η_{1}$ then
12:: if $ρ_{k} \geq η_{2}$ then
13:: Update $σ_{k} \leftarrow γ_{1} σ_{k}$
14:: end if
15:: Break
16:: else
17:: Update $σ_{k} \leftarrow γ_{2} σ_{k}$
18:: end if
19:: end while
20:: Update $X_{k + 1} = Z_{k}$ , $σ_{k + 1} = σ_{k}$
21:: end for

Step size selection strategy. In each iteration, the algorithm first obtains the search direction

V_{k}

by solving the subproblem. Then, a non-monotone line search with the Armijo condition is used to determine the appropriate step size

α_{k}

. Specifically, it is defined as:

α_{k} = γ^{N_{k}}, N_{k} \in N,

where

N_{k}

is the smallest integer satisfying the following condition:

F (R_{X_{k}} (α_{k} V_{k})) \leq max_{max {0, k - m} \leq j \leq k} F (X_{j}) - \frac{1}{2} σ α_{k} {∥ V_{k} ∥}_{B_{k}}^{2} .

(22)

The scaling factor

γ \in (0, 1)

ensures that

α_{k}

decreases gradually during backtracking;

σ > 0

is a line search parameter. The non-monotone condition allows the function value at the current iterate to not strictly decrease, but instead be compared with the maximum value among the most recent m iterations, thereby improving the algorithm’s robustness in complex nonconvex problems.

Model consistency measure (ratio definition). Let

F (X_{l (k)}) = max_{max {0, k - m} \leq j \leq k} F (X_{j}) .

This serves as the reference benchmark for the function value in the current iteration. To measure the degree of matching between the constructed quadratic approximation model and the true function decrease, we introduce the ratio:

ρ_{k} : = \frac{F (R_{x_{k}} (α_{k} V_{k})) - F (X_{l (k)})}{{\tilde{ϕ}}_{k} (α_{k} V_{k}) - {\tilde{ϕ}}_{k} (0)} .

(23)

To simplify the notation, we introduce

l (k) : = {argmax}_{max {0, k - m} \leq j \leq k} F (X_{j}) .

(24)

The ratio

ρ_{k}

characterizes the consistency between the predicted and actual decrease, where the numerator is the actual function decrease under step size

α_{k}

and the denominator is the predicted decrease in the quadratic approximation model.

Update strategy for regularization parameter

σ_{k}

. To ensure the stability and convergence of the algorithm, the regularization parameter

σ_{k}

is dynamically adjusted based on the ratio

ρ_{k}

:

1.: If $ρ_{k} < η_{1}$ , it indicates an inaccurate prediction or an excessively large step size, and the model is too optimistic:

$σ_{k + 1} = γ_{2} σ_{k}, γ_{2} > 1,$

enhancing the regularization strength to make the model more conservative.
2.: If $ρ_{k} \geq η_{1}$ , it indicates an acceptable decrease performance.
3.: If $ρ_{k} \geq η_{2}$ , it indicates an extremely accurate model prediction and an ideal decrease performance:

$σ_{k + 1} = γ_{1} σ_{k}, γ_{1} \in (0, 1),$

weakening the regularization.
4.: If $η_{1} \leq ρ_{k} < η_{2}$ , keep the regularization unchanged:

$σ_{k + 1} = σ_{k} .$

This corresponds to the scenario where the model prediction is roughly consistent with the true decrease. If

σ_{k}

is reduced hastily at this point, the model may become overly optimistic, leading to instability in the next iteration; if

σ_{k}

is increased hastily, the model may become overly conservative, sacrificing convergence speed. Therefore, the most reasonable choice is to keep the regularization level unchanged. This design ensures that the algorithm balances convergence and efficiency in long-term operation, without affecting stability due to frequent adjustments of

σ_{k}

.

Finally, the new iterate is given by:

X_{k + 1} = R_{X_{k}} (α_{k} V_{k}) .

This algorithm integrates mechanisms from three aspects: (1) a non-monotone line search, which improves the algorithm’s robustness in nonconvex scenarios; (2) the model consistency ratio, which dynamically evaluates the difference between the predicted and true decreases; and (3) an adaptive regularization update, which balances convergence and efficiency by adjusting

σ_{k}

.

Thus, the ZO-ARPQN algorithm can efficiently simulate the behavior of second-order methods using only function values, and it balances the convergence stability and computational efficiency both theoretically and practically.

5. Convergence Analysis

In this subsection, we prove the global convergence of Algorithm 1. First, we present some standard assumptions that will be used in subsequent analyses.

Assumption 4.

Let

{X_{k}}

be the iterate sequence generated by Algorithm 1. Then, (1)

f : R^{n \times r} \to R

is a continuously differentiable function, and its gradient

\nabla f

is Lipschitz continuous with constant L; (2)

h : R^{n \times r} \to R

is a convex, but nonsmooth, function, and h is Lipschitz continuous with constant

L_{h}

; (3) there exist constants

0 < κ_{1} < κ_{2}

such that, for all

k \geq 0

,

κ_{1} {∥ V ∥}^{2} \leq 〈 {\bar{H}}_{k} [V], V 〉 \leq κ_{2} {∥ V ∥}^{2}, \forall V \in T_{X_{k}} M;

(25)

and (4) the optimal solution

V_{k}

of Subproblem (4) satisfies

∥ V_{k} ∥ \neq 0

for all

k \geq 0

.

Remark 1.

In Assumption (4), the first two items are standard assumptions in the convergence analysis of composite optimization. From Assumption (4), the objective function

ϕ_{k}

of Subproblem (21) is strongly convex, and thus has a unique solution, denoted as

V_{k}

. According to the first-order optimality condition of the original problem (1)

0 \in \nabla f (X^{*}) + {Proj}_{T_{X^{*}} M} (\partial h (X^{*}))

(26)

we have

0 \in {Proj}_{T_{X_{k}} M} \partial {\tilde{ϕ}}_{k} (V_{k}) = \nabla f (X_{k}) + ({\tilde{H}}_{k} + σ_{k} I) [V_{k}] + {Proj}_{T_{X_{k}} M} \partial h (X_{k} + V_{k}) .

(27)

Therefore,

V_{k} = 0

is equivalent to

X_{k}

satisfying (26), i.e.,

X_{k}

is a stationary point of Problem (1). If

V_{k} \neq 0

, similar to the proof of Lemma 5.1 in [23], it can be shown that

V_{k}

brings a sufficient decrease in

ϕ_{k}

. For completeness, we provide the proof below.

Lemma 3.

Given iterate

X_{k}

, let

\tilde{ϕ} (V) : = 〈 {\bar{g}}_{μ, ξ} (x), V 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [V], V 〉 + h (x + V)

(28)

be the objective function in (21). Then, for any

α \in [0, 1]

, we have:

\tilde{ϕ} (α V_{k}) - \tilde{ϕ} (0) \leq \frac{α (α - 2)}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [V_{k}], V_{k} 〉 .

(29)

Proof.

Since

\tilde{ϕ}

is

\frac{1}{t}

-strongly convex, we have

\tilde{ϕ} (\hat{V}) \geq \tilde{ϕ} (V) + 〈 \partial \tilde{ϕ} (V), \hat{V} - V 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + α_{k} I) [V], V 〉, \forall V, \hat{V} \in R^{n \times r} .

(30)

In particular, if

V, \hat{V}

are feasible solutions (i.e.,

V, \hat{V} \in T_{X_{k}} M

), then:

〈 \partial \tilde{ϕ} (V), \hat{V} - V 〉 = 〈 {Proj}_{T_{X_{k}} M} \partial \tilde{ϕ} (V), \hat{V} - V 〉 .

From the optimality condition,

0 \in {Proj}_{T_{X_{k}} M} \partial \tilde{ϕ} (V_{k})

. Let

V = V_{k}

and substitute

\hat{V} = 0

into (30); then, we obtain:

\tilde{ϕ} (0) \geq \tilde{ϕ} (V_{k}) + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + α_{k} I) [V], V 〉 .

According to the definitions of

\tilde{ϕ} (0) = h (X_{k})

and

\tilde{ϕ} (V_{k})

, further expansion gives:

h (X_{k}) \geq 〈 \nabla f (X_{k}), V_{k} 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + α_{k} I) [V], V 〉 + h (X_{k} + V_{k}) + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + α_{k} I) [V], V 〉 .

Additionally, using the convexity of h, we have:

h (X_{k} + α V_{k}) = h (α (X_{k} + V_{k}) + (1 - α) X_{k}) \leq α h (X_{k} + V_{k}) + (1 - α) h (X_{k}) .

According to

\tilde{ϕ} (0) = h (X_{k})

, expand

{\tilde{ϕ}}_{k} (α V_{k})

by definition, and combine the above inequalities:

\begin{matrix} {\tilde{ϕ}}_{k} (α V_{k}) - {\tilde{ϕ}}_{k} (0) & = 〈 \nabla f (X_{k}), α V_{k} 〉 + \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [α V_{k}], α V_{k} 〉 + h (X_{k} + α V_{k}) - h (X_{k}) \\ \leq α (〈 \nabla f (X_{k}), V_{k} 〉 + \frac{α}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [V_{k}], V_{k} 〉 + h (X_{k} + V_{k}) - h (X_{k})) \\ = α ({\tilde{ϕ}}_{k} (V_{k}) - {\tilde{ϕ}}_{k} (0) + \frac{α - 1}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [V_{k}], V_{k} 〉) \\ \leq \frac{α (α - 2)}{2} 〈 ({\bar{H}}_{μ, ξ} (x) + σ_{k} I) [V_{k}], V_{k} 〉 . \end{matrix}

(31)

This completes the proof. □

Regarding the boundedness of

σ_{k}

: An important part of the convergence analysis of regularized Newton-type methods is to prove the boundedness of the sequence

{σ_{k}}

. We first present a preparatory lemma. According to the procedure of Algorithm 1, if

ρ_{k} < η_{1}

, then

σ_{k}

is scaled up by a factor of

γ_{2}

. We aim to prove that, when

σ_{k}

is sufficiently large,

ρ_{k} \geq η_{1}

always holds. Thus, the inner loop (Steps 3–15) of Algorithm 1 will terminate in finite steps. Since the manifold

M

is compact, we can define

ϱ : = sup_{X \in M} ∥ \nabla f (X) ∥ .

Recall that parameters

M_{1}

and

M_{2}

are given by (6) and (7), respectively. Our goal is to provide a sufficient condition that ensures

E [ρ_{k}] \geq η_{1}

.

Lemma 4.

(Sufficient condition for

E [ρ_{k}] \geq η_{1}

) Under the assumptions that f is L-smooth (Assumption (4)) and h is

L_{h}

-Lipschitz, define constants

c_{1} : = ϱ M_{2} + \frac{1}{2} L M_{1}^{2}, c_{2} : = c_{1} + L_{h} M_{2}, ϱ : = sup_{x \in M} ∥ \nabla f (x) ∥ .

Let

V_{k} \in T_{x_{k}} M

be the solution of the subproblem

{min}_{V} {\tilde{ϕ}}_{k} (V)

, take step size

α_{k} \in (0, 1]

, denote

η_{k} = α_{k} V_{k}

, and define

ρ_{k} : = \frac{F (x_{k}) - F (R_{x_{k}} (η_{k}))}{- {\tilde{ϕ}}_{k} (η_{k}) + {\tilde{ϕ}}_{k} (0)} .

If there exists

κ_{1} > 0

such that (corresponding to parameter (25))

- {\tilde{ϕ}}_{k} (α V) + {\tilde{ϕ}}_{k} (0) \geq \frac{α (2 - α)}{2} κ_{1} {∥ V ∥}^{2}, \forall α \in [0, 1],

(32)

and the regularization parameter

σ_{k}

and trust region radius

{∥ V ∥}_{k} \leq Δ

satisfy

σ_{k} \geq \bar{σ} : = - κ_{1} + \frac{2 c_{2}}{2 - η_{1}} + \frac{2}{2 - η_{1}} (\frac{ε_{g}}{Δ} + ε_{H}),

(33)

then we have the conclusion

E [ρ_{k}] \geq η_{1} .

Proof.

According to the strong convexity of f,

f (R_{x} (η)) \leq f (x) + 〈 \nabla f (x), η 〉 + \frac{L_{g}}{2} {∥ η ∥}^{2} .

(34)

Since h is

L_{h}

-Lipschitz in the ambient space, we have the following second-order retraction error:

∥ R_{x} {(η) - (X + η) ∥}_{F} \leq M_{2} {∥ η ∥}^{2} .

(35)

Thus,

h (R_{x} (η)) \leq h (x + η) + L_{h} M_{2} {∥ η ∥}^{2} .

(36)

Define updated constants

c_{1} : = \frac{L_{g}}{2}, c_{2} : = c_{1} + L_{h} M_{2} = \frac{L_{g}}{2} + L_{h} M_{2} .

(37)

From the operator norm error of the Hessian

ε_{H} : = E {∥ {\bar{H}}_{μ, ξ} (x) - H f (x) ∥}_{op} \leq \tilde{C} \frac{{(d + 16)}^{6}}{b^{3 / 2}} L_{g}^{1.5} + \frac{1}{27} μ^{3} L_{H}^{3} {(d + 6)}^{7.5} .

(38)

From Lemma 1, using Jensen’s inequality to extract the square root

ε_{g} : = \sqrt{μ^{2} L_{g}^{2} {(d + δ)}^{3} + \frac{8 (d + 4)}{n} σ^{2} + \frac{8 (d + 4)}{n} {∥ \nabla f (x_{k}) ∥}^{2}}, E ∥ {\bar{g}}_{μ, ξ} (x_{k}) - \nabla f (x_{k}) ∥ \leq ε_{g} .

(39)

There exists

κ_{1} > 0

such that

〈 ({\tilde{H}}_{μ, ξ} (x) + κ_{1} I) [V], V 〉 \geq 0 .

(40)

Thus, for any

α \in [0, 1]

, there is a “strongly convex” lower bound for the model

- {\tilde{ϕ}}_{k} (α V) + {\tilde{ϕ}}_{k} (0) \geq \frac{α (2 - α)}{2} κ_{1} {∥ V ∥}^{2} .

(41)

From (34)–(36), for any

V \in T_{x} M

,

α > 0

, we have

\begin{matrix} F (R_{x} (α V)) & \leq f (x) + 〈 \nabla f (x), α V 〉 + c_{1} {∥ α V ∥}^{2} + h (x + α V) + L_{h} M_{2} {∥ α V ∥}^{2} \\ = F (x) + 〈 \nabla f (x), α V 〉 + c_{2} {∥ α V ∥}^{2} + h (x + α V) - h (x) . \end{matrix}

(42)

Add and subtract

{\tilde{ϕ}}_{k} (α V)

on the right-hand side, then add and subtract

\nabla f

and

\bar{g}

, and use

〈 (H f (x_{k}) + κ_{1} I) [U], U 〉 \geq 0

to obtain

\begin{matrix} F (R_{x_{k}} (α V)) & \leq F (x_{k}) + {\tilde{ϕ}}_{k} (α V) + (c_{2} - \frac{1}{2} (κ_{1} + σ_{k})) {∥ α V ∥}^{2} \\ + 〈 \nabla f (x_{k}) - {\bar{g}}_{μ, ξ} (x_{k}), α V 〉 - \frac{1}{2} 〈 ({\bar{H}}_{μ, ξ} (x_{k}) - H f (x_{k})) [α V], α V 〉 . \end{matrix}

(43)

Take the expectation over

(μ, ξ)

and apply the Cauchy–Schwarz inequality and (38) and (39); then, we have

\begin{matrix} F (R_{x_{k}} (α V)) & \leq F (x_{k}) + E {\tilde{ϕ}}_{k} (α V) + (c_{2} - \frac{1}{2} (κ_{1} + σ_{k})) {∥ α V ∥}^{2} \\ + \underset{Gradient Estimation Error}{\underset{︸}{〈 \nabla f (x_{k}) - {\bar{g}}_{k}, α V 〉}} - \frac{1}{2} \underset{Hessian Estimation Error}{\underset{︸}{〈 ({\bar{H}}_{k} - H f (x_{k})) [α V], α V 〉}} . \end{matrix}

(44)

Take the expectation over

(μ, ξ)

and apply the Cauchy–Schwarz inequality and (38) and (39):

E F (R_{x_{k}} (α V)) \leq F (x_{k}) + E {\tilde{ϕ}}_{k} (α V) + (c_{2} - \frac{1}{2} (κ_{1} + σ_{k}) + \frac{1}{2} ε_{H}) {∥ α V ∥}^{2} + α ε_{g} ∥ V ∥ .

(45)

Let

V_{k}

be the solution of the subproblem

{min}_{V} {\tilde{φ}}_{k} (V)

, and define

ρ_{k} : = \frac{F (x_{k}) - F (R_{x_{k}} (α_{k} V_{k}))}{- {\tilde{ϕ}}_{k} (α_{k} V_{k}) + {\tilde{ϕ}}_{k} (0)} .

(46)

From (32) and (45), we have

\begin{matrix} 1 - E [ρ_{k}] & = \frac{E [F (R_{x_{k}} (α_{k} V_{k})) - F (x_{k}) - {\tilde{ϕ}}_{k} (α_{k} V_{k}) + {\tilde{ϕ}}_{k} (0)]}{- {\tilde{ϕ}}_{k} (α_{k} V_{k}) + {\tilde{ϕ}}_{k} (0)} \\ \leq \frac{(c_{2} - \frac{1}{2} (κ_{1} + σ_{k}) + \frac{1}{2} ε_{H}) ∥ α_{k} V_{k} ∥^{2} + α_{k} ε_{g} ∥ V_{k} ∥}{\frac{α_{k} (2 - α_{k})}{2} κ_{1} {∥ V_{k} ∥}^{2}} \\ = \frac{2}{κ_{1} (2 - α_{k})} (c_{2} - \frac{1}{2} (κ_{1} + σ_{k}) + \frac{1}{2} ε_{H} + \frac{ε_{g}}{∥ V_{k} ∥}) . \end{matrix}

(47)

To ensure

E [ρ_{k}] \geq η_{1}

, it is sufficient to set the last term ≤1

- η_{1}

and solve for

σ_{k}

to obtain the equivalent condition

σ_{k} \geq - κ_{1} (2 - α_{k}) (1 - η_{1}) + \frac{2 ε_{g}}{∥ V_{k} ∥} + 2 c_{2} + ε_{H} - κ_{1} .

(48)

Additionally,

0 \leq α_{k} \leq 1

; thus,

1 \leq 2 - α_{k} \leq 2

. Introduce the trust region radius

∥ V_{k} ∥ \leq Δ

; thus, it is sufficient to have

σ_{k} \geq 2 c_{2} + ε_{H} + \frac{2 ε_{g}}{Δ} - κ_{1} (2 - η_{1})

(49)

to ensure

E [ρ_{k}] \geq η_{1}

. □

Lemma 5

(Boundedness of the Regularization Parameter

σ_{k}

). Assume that the solution to the subproblem satisfies a uniform radius upper bound

∥V_{k}∥ \leq Δ

(attainable via the trust region radius or proximal regularization term). Let:

σ_{est} : = 2 c_{2} + ε_{H} + \frac{2 ε_{g}}{Δ} - κ_{1} (2 - η_{1})

(50)

Then, for all

k \geq 0

, the following holds:

σ_{k} \leq max \{σ_{0}, γ_{2} σ_{est}\}

(51)

Proof.

From Equation (49), we know that, if

σ_{k} \geq σ_{est}

, then

E [ρ_{k}] \geq η_{1}

always holds. We prove the conclusion by mathematical induction:

Base Case (

k = 0

): It is trivially true that

σ_{0} \leq max {σ_{0}, γ_{2} σ_{est}}

.

Inductive Hypothesis: Assume that the inequality holds for

k = j

, i.e.,

σ_{j} \leq max {σ_{0}, γ_{2} σ_{est}}

.

Inductive Step (

k = j + 1

): Consider two cases: (1) If

σ_{j} < σ_{est}

: By Lemma 4,

σ_{j}

fails to satisfy the acceptance condition in Step 10 of Algorithm 1. According to Step 15 of the algorithm,

σ_{j}

is scaled up by a factor of

γ_{2} > 1

, so

σ_{j + 1} = γ_{2} σ_{j} < γ_{2} σ_{est}

. (2) If

σ_{j} \geq σ_{est}

: by Lemma 4,

E [ρ_{j}] \geq η_{1}

. According to Steps 10–13 of the algorithm,

σ_{j + 1}

is either unchanged or scaled down by

γ_{1} \in (0, 1)

(i.e.,

σ_{j + 1} = γ_{1} σ_{j}

). Thus,

σ_{j + 1} \leq σ_{j} \leq max {σ_{0}, γ_{2} σ_{est}}

(by the inductive hypothesis).

By combining both cases, we have

σ_{j + 1} \leq max {σ_{0}, γ_{2} σ_{est}}

. By the principle of mathematical induction, the inequality holds for all

k \geq 0

. □

Theorem 1

(Convergence to Stationary Points). Define:

{\bar{α}}_{est} : = min \{1, \frac{(2 - σ) κ_{1}}{2 (c_{2} + \frac{1}{2} ε_{H} + \frac{ε_{g}}{Δ})}\}

(52)

When

α_{k} \geq γ {\bar{α}}_{est}

for all

k \geq 0

(where

γ \in (0, 1)

is the backtracking scaling factor in Step 7 of the algorithm), the backtracking line search (Steps 6–8 of the algorithm) terminates in finite steps. Furthermore, we have:

lim_{k \to \infty} E [∥V_{k}∥] = 0

(53)

and any accumulation point

X^{*}

of the iterate sequence

{X_{k}}

is a first-order stationary point of the problem

F (X) = f (X) + h (X)

(i.e.,

\nabla F (X^{*}) = 0

).

Proof.

Substitute Equation (54) into Equation (47). When

α \leq {\bar{α}}_{est}

, the following inequality holds:

(c_{2} - \frac{1}{2} (κ_{1} + σ_{k}) + \frac{1}{2} ε_{H}) α + \frac{ε_{g}}{Δ} \leq \frac{1 - η_{1}}{2} κ_{1} (2 - α)

(54)

This implies that

E [ρ_{k}] \geq η_{1}

, so the step size

α

is accepted. For the backtracking line search, we start with

α = 1

and scale it down by

γ

iteratively. Since

α

will eventually be reduced to

\leq {\bar{α}}_{est}

(or

σ_{k}

will be increased to

\geq {\bar{σ}}_{est}

), the line search terminates in finite steps, and

α_{k} \geq γ {\bar{α}}_{est}

.

From Equation (32) and the lower bound of

α_{k}

, we derive:

E [F (x_{k}) - F (x_{k + 1})] \geq \frac{η_{1} κ_{1}}{2} α_{\min} (2 - α_{\min}) E [{∥V_{k}∥}^{2}]

(55)

where

α_{\min} = γ {\bar{α}}_{est} > 0

. By summing both sides over k from 0 to ∞, the left-hand side becomes

E [F (x_{0}) - {lim}_{k \to \infty} F (x_{k})]

. Since

{E [F (x_{l (k)})]}

is non-increasing and bounded below (as F is bounded from below), the sum converges:

\sum_{k = 0}^{\infty} α_{k} E [{∥V_{k}∥}^{2}] < \infty

(56)

Because

α_{k} \geq α_{\min} > 0

, the terms of the series

\sum_{k = 0}^{\infty} E [{∥V_{k}∥}^{2}]

must tend to zero. Thus:

lim_{k \to \infty} E [∥V_{k}∥] = 0

To prove that accumulation points are stationary, let

X^{*}

be an accumulation point of

{X_{k}}

, and let

{X_{k_{i}}}

be a subsequence converging to

X^{*}

. From

{lim}_{i \to \infty} E [∥V_{k_{i}}∥] = 0

, we have

{lim}_{i \to \infty} V_{k_{i}} = 0

(by the continuity of the norm). Substitute

V_{k_{i}} \to 0

into the first-order optimality condition of the subproblem:

0 \in {\bar{g}}_{μ, ξ} (X_{k_{i}}) + ({\bar{H}}_{μ, ξ} (X_{k_{i}}) + σ_{k_{i}} I) V_{k_{i}} + {Proj}_{T_{X_{k_{i}}} M} \partial h (X_{k_{i}} + V_{k_{i}})

By taking the limit as

i \to \infty

and using the continuity of

\nabla f

, error, and

\partial h

, we obtain the first-order optimality condition of the original problem:

0 \in \nabla f (X^{*}) + {Proj}_{T_{X^{*}} M} (\partial h (X^{*}))

This confirms that

X^{*}

is a stationary point of Problem (1). □

5.1. Complexity Analysis

When analyzing the overall complexity of stochastic zeroth-order Riemannian optimization algorithms, we must consider both the computational cost of the outer iterations (Steps 1–16 of the algorithm) and the inner iterations (Steps 3–15 of the algorithm). The former characterizes the number of main iterations required to converge to an

ε

-stationary point, while the latter reflects the number of calls to the approximate subproblem solver on the manifold (ASSN) within each outer iteration. Due to the non-monotone line search and trust region/regularization update mechanisms, there is a tight coupling between the complexities of the outer and inner iterations.

In this proof, the acceptance criterion is defined as:

ρ_{k} \geq η_{1}

(57)

Definition 4

(

ε

-Stationary Point). If the optimal solution

V_{k}

of the subproblem in a certain outer iteration satisfies

∥V_{k}∥ = ε

, then

X_{k}

is called an ε-stationary point.

Theorem 2

(Outer and Inner Iteration Complexity). Let

F^{*}

be the optimal value of Problem (1), and

σ \in (0, 1)

be the constant in the sufficient decrease criterion. Define:

Θ : = \frac{2 (F (x_{0}) - F^{*})}{η_{1} κ_{1} γ \bar{α} (2 - γ \bar{α}) ε^{2}}

(58)

The algorithm finds an ε-stationary point within, at most,

⌈ Θ ⌉

outer iterations. Furthermore, the total number of inner ASSN calls satisfies:

\sum_{i = 0}^{⌈ Θ ⌉ - 1} r (i) \leq ⌈ Θ ⌉ (2 - {log}_{γ_{2}} γ_{1}) + {log}_{γ_{2}} (\frac{max \{σ_{0}, γ_{2} \bar{σ}\}}{σ_{0}})

(59)

Proof.

Step 1: Upper Bound on Outer Iterations From Equation (55) and

α_{k} \geq α_{\min} : = γ \bar{α}

, we have:

E [F (X_{k}) - F (X_{k + 1})] \geq \frac{η_{1} κ_{1}}{2} α_{\min} (2 - α_{\min}) E [{∥V_{k}∥}^{2}]

(60)

Assume the K-th outer iteration yields

∥V_{K}∥ = ε

. By summing both sides over k from 0 to

K - 1

, the left-hand side telescopes to

E [F (X_{0}) - F (X_{K})]

. Since

F (X_{K}) \geq F^{*}

, we get:

E [F (X_{0}) - F^{*}] \geq K \cdot \frac{η_{1} κ_{1}}{2} α_{\min} (2 - α_{\min}) ε^{2}

(61)

Rearranging for K gives:

K \leq \frac{2 (E [F (X_{0}) - F^{*}])}{η_{1} κ_{1} α_{\min} (2 - α_{\min}) ε^{2}} = Θ

(62)

Taking the ceiling function confirms that an

ε

-stationary point is found within

⌈ Θ ⌉

outer iterations.

Step 2: Upper Bound on Inner Iterations By the boundedness of

σ_{k}

(Lemma 5):

σ_{k} \leq max \{σ_{0}, γ_{2} \bar{σ}\}

(63)

Suppose the inner loop calls ASSN

r (i)

times in the i-th outer iteration. By the update rule of

σ

, we have

σ_{i + 1} = γ_{2}^{r (i) - 1} σ_{i}

. Since

γ_{1} < 1

, the following inequality holds:

\frac{σ_{i + 1}}{σ_{i}} \geq γ_{1} γ_{2}^{r (i) - 1} \Rightarrow r (i) \leq 1 + {log}_{γ_{2}} (\frac{σ_{i + 1}}{σ_{i}} \cdot \frac{γ_{2}}{γ_{1}})

(64)

umming both sides from

i = 0

to

i = K - 1

(where

K = ⌈ Θ ⌉

) results in the following:

\sum_{i = 0}^{K - 1} r (i) \leq K + \sum_{i = 0}^{K - 1} {log}_{γ_{2}} (\frac{σ_{i + 1}}{σ_{i}}) + K {log}_{γ_{2}} (\frac{γ_{2}}{γ_{1}})

The sum of logarithms simplifies to the logarithm of the product (telescoping sum):

\sum_{i = 0}^{K - 1} {log}_{γ_{2}} (\frac{σ_{i + 1}}{σ_{i}}) = {log}_{γ_{2}} (\frac{σ_{K}}{σ_{0}})

By substituting Equation (63) (i.e.,

σ_{K} \leq max {σ_{0}, γ_{2} \bar{σ}}

) and simplifying the logarithmic term, we obtain the following:

\sum_{i = 0}^{K - 1} r (i) \leq K (2 - {log}_{γ_{2}} γ_{1}) + {log}_{γ_{2}} (\frac{max \{σ_{0}, γ_{2} \bar{σ}\}}{σ_{0}})

(65)

This completes the proof. □

5.2. Saddle-Point Escape Theorem

In non-convex optimization problems, the presence of saddle points severely degrades the convergence efficiency of algorithms. Unlike local minima, saddle points have zero gradients while their Hessian matrices have negative eigenvalues. This causes first-order methods (e.g., stochastic gradient descent) to easily stagnate near saddle points, leading to prolonged slow convergence or even halting. This phenomenon is more prominent in high-dimensional manifold optimization, where the number of saddle points far exceeds that of local minima.

In recent years, second-order methods (e.g., Hessian-based algorithms) have been proven to escape saddle points more effectively. However, in the zeroth-order setting, we cannot directly access accurate gradient or Hessian information, and can only rely on function value approximations. This makes the identification and escape of saddle points more challenging. Traditional zeroth-order methods often require excessive sampling and computation, resulting in a high complexity that hinders practical application. The proposed method in this paper achieves saddle-point escape using only function value queries, without the need for explicit first-order or second-order information.

Lemma 6.

Let

g_{k} = \nabla f (x_{k})

,

{\bar{g}}_{μ} (x_{k})

be the zeroth-order gradient estimate, and

{\bar{H}}_{μ} (x_{k}) = \frac{1}{b} \sum_{i = 1}^{b} H_{μ, i} (x_{k})

be the mini-batch average of Hessian estimates. The following control inequality holds:

∥{\bar{g}}_{μ} (x_{k}) + {\bar{H}}_{μ} (x_{k}) [η_{k}]∥ \leq \frac{1}{2} {∥η_{k}∥}^{2} + \sqrt{δ_{g}} + δ_{μ}

(66)

where

δ_{g}

and

δ_{μ}

correspond to the zeroth-order gradient estimation error and Hessian mini-batch estimation error, respectively.

Proof.

Expand using the triangle inequality:

∥{\bar{g}}_{μ} (x_{k}) + {\bar{H}}_{μ} (x_{k}) [η_{k}]∥ \leq ∥g_{k} - {\bar{g}}_{μ} (x_{k})∥ + ∥{\bar{H}}_{μ} (x_{k}) [η_{k}]∥ + ∥g_{k}∥

Step 1: Bound on

∥{\bar{H}}_{μ} (x_{k}) [η_{k}]∥

. Apply the Frobenius norm and Cauchy–Schwarz inequality:

∥{\bar{H}}_{μ} (x_{k}) [η_{k}]∥ \leq ∥[∥ F] {\bar{H}}_{μ} (x_{k}) ∥η_{k}∥

Step 2: Apply Young’s Inequality. For the product term, Young’s inequality (

a b \leq \frac{a^{2}}{2} + \frac{b^{2}}{2}

) gives:

∥[∥ F] {\bar{H}}_{μ} (x_{k}) ∥η_{k}∥ \leq \frac{1}{2} {∥η_{k}∥}^{2} + \frac{1}{2} ∥[∥ F] {{\bar{H}}_{μ} (x_{k})}^{2}

(67)

Step 3: Bound on

E [∥[∥ F] {{\bar{H}}_{μ} (x_{k})}^{2}]

. Note that

{\bar{H}}_{μ} (x_{k})

is the average of b independent and identically distributed (i.i.d.) terms

H_{μ, i} (x_{k})

. By Jensen’s inequality and Cauchy–Schwarz, we have:

E [∥[∥ F] {{\bar{H}}_{μ} (x_{k})}^{2}] \leq \frac{1}{b} \sum_{i = 1}^{b} E [∥[∥ F] {H_{μ, i} (x_{k})}^{2}] \leq \frac{1}{b} {(E [∥[∥ F] {H_{μ} (x_{k})}^{4}])}^{1 / 2}

Using the fourth-moment bound from Equation (20) (

E [∥[∥ F] {H_{μ} (x_{k})}^{4}] ≲ \frac{{(d + 16)}^{8} L_{g}^{2}}{8}

), we define:

\frac{1}{2} E [∥[∥ F] {{\bar{H}}_{μ} (x_{k})}^{2}] \leq \frac{{(d + 16)}^{4}}{4 \sqrt{2}} L_{g} : = δ_{μ}

(68)

Step 4: Combine Results. Substitute Equation (67) and the gradient estimation error bound (from Lemma 1,

E [∥g_{k} - {\bar{g}}_{μ} (x_{k})∥] \leq \sqrt{δ_{g}}

) into the triangle inequality. Taking the expectation and simplifying gives:

E [∥{\bar{g}}_{μ} (x_{k}) + {\bar{H}}_{μ} (x_{k}) [η_{k}]∥] \leq \frac{1}{2} E [{∥η_{k}∥}^{2}] + \sqrt{δ_{g}} + δ_{μ}

This completes the proof. □

Lemma 7.

Let

x_{k + 1} = R_{x_{k}} (η_{k})

and

g_{k} = \nabla f (x_{k})

. The following holds:

E [∥g_{k + 1}∥] \leq \frac{L_{H} + 1}{2} E [{∥η_{k}∥}^{2}] + \sqrt{δ_{g}} + δ_{H} + δ_{μ}

(69)

where

δ_{g}

,

δ_{H}

, and

δ_{μ}

correspond to the zeroth-order gradient estimation error, zeroth-order Hessian estimation error, and mini-batch averaging error, respectively.

Proof.

Since parallel transport

P_{η_{k}}

is isometric, we have:

∥g_{k + 1}∥ = ∥P_{η_{k}}^{- 1} g_{k + 1}∥

Add and subtract

g_{k} + H f (x_{k}) [η_{k}]

to decompose the term into four components:

\begin{matrix} P_{η_{k}}^{- 1} g_{k + 1} & = \underset{Taylor Remainder}{\underset{︸}{P_{η_{k}}^{- 1} g_{k + 1} - g_{k} - H f (x_{k}) [η_{k}]}} + \underset{Gradient Estimation Error}{\underset{︸}{g_{k} - {\bar{g}}_{μ, ξ} (x_{k})}} \\ + \underset{Hessian Estimation Error}{\underset{︸}{H f (x_{k}) [η_{k}] - {\bar{H}}_{μ, ξ} (x_{k}) [η_{k}]}} + \underset{Subproblem Main Term}{\underset{︸}{{\bar{g}}_{μ, ξ} (x_{k}) + {\bar{H}}_{μ, ξ} (x_{k}) [η_{k}]}} \end{matrix}

Taking the norm and applying the triangle inequality yields the following:

∥g_{k + 1}∥ \leq ∥P_{η_{k}}^{- 1} g_{k + 1} - g_{k} - H f (x_{k}) [η_{k}]∥ + ∥g_{k} - {\bar{g}}_{μ, ξ}∥ + ∥H f (x_{k}) [η_{k}] - {\bar{H}}_{μ, ξ} [η_{k}]∥ + ∥{\bar{g}}_{μ, ξ} + {\bar{H}}_{μ, ξ} [η_{k}]∥

(i) Bound on the Taylor Remainder. By the equivalent condition of Assumption 2 (Lipschitz Hessian, Equation (17)),

∥P_{η_{k}}^{- 1} g_{k + 1} - g_{k} - H f (x_{k}) [η_{k}]∥ \leq \frac{L_{H}}{2} {∥η_{k}∥}^{2}

(ii) Bound on Gradient Estimation Error. From Lemma 1, the expectation of the gradient estimation error satisfies

E [∥g_{k} - {\bar{g}}_{μ, ξ} (x_{k})∥] \leq \sqrt{δ_{g}}

.

(iii) Bound on Hessian Estimation Error. Using the operator norm and Young’s inequality:

∥H f (x_{k}) [η_{k}] - {\bar{H}}_{μ, ξ} [η_{k}]∥ \leq ∥[∥ op] H f (x_{k}) - {\bar{H}}_{μ, ξ} ∥η_{k}∥ \leq \frac{1}{2} {∥η_{k}∥}^{2} + \frac{1}{2} ∥[∥ op] {H f (x_{k}) - {\bar{H}}_{μ, ξ}}^{2}

Taking the expectation and using Lemma 2 (

E [∥[∥ op] {H f (x_{k}) - {\bar{H}}_{μ, ξ}}^{2}] \leq δ_{H}

), we get:

E [∥H f (x_{k}) [η_{k}] - {\bar{H}}_{μ, ξ} [η_{k}]∥] \leq \frac{1}{2} E [{∥η_{k}∥}^{2}] + δ_{H}

(iv) Bound on the Subproblem Main Term. By Lemma 6, we have

E [∥{\bar{g}}_{μ, ξ} + {\bar{H}}_{μ, ξ} [η_{k}]∥] \leq \frac{1}{2} E [{∥η_{k}∥}^{2}] + \sqrt{δ_{g}} + δ_{μ}

.

(v) Combine All Bounds. Substitute (i)–(iv) into the triangle inequality and simplify using algebraic manipulations:

E [∥g_{k + 1}∥] \leq (\frac{L_{H}}{2} + 1) E [{∥η_{k}∥}^{2}] + \sqrt{δ_{g}} + δ_{H} + δ_{μ}

This completes the proof. □

Lemma 8

(Relationship Between Step Size and Minimum Eigenvalue). Let

x_{k + 1} = R_{x_{k}} (η_{k})

, and

P_{η_{k}} : T_{x_{k}} M \to T_{x_{k + 1}} M

be the isometric parallel transport along

η_{k}

. Let

σ_{k} \geq 0

be such that

{\bar{H}}_{μ, ξ} + σ_{k} I ⪰ 0

. Then, the following holds:

E [∥η_{k}∥] \geq - \frac{\sqrt{δ_{H}} + σ_{k} + E [λ_{\min} (H f (x_{k + 1}))]}{L_{H}}

(70)

Proof.

Step 1: Invariance of Minimum Eigenvalue Under Parallel Transport. Since parallel transport is isometric, the minimum eigenvalue of the Hessian is invariant:

λ_{\min} (H f (x_{k + 1})) = λ_{\min} (P_{η_{k}}^{- 1} H f (x_{k + 1}) P_{η_{k}})

Let

Δ_{k} : = P_{η_{k}}^{- 1} H f (x_{k + 1}) P_{η_{k}} - H f (x_{k})

. Then:

λ_{\min} (H f (x_{k + 1})) = λ_{\min} (H f (x_{k}) + Δ_{k})

Step 2: Apply Weyl’s Inequality. Weyl’s inequality states that, for symmetric matrices A and B,

λ_{\min} (A + B) \geq λ_{\min} (A) + λ_{\min} (B)

. For any symmetric matrix A, the minimum eigenvalue satisfies

λ_{\min} (A) \geq - ∥[∥ op] A

(by the Rayleigh quotient characterization). Thus:

λ_{\min} (Δ_{k}) \geq - ∥[∥ op] Δ_{k}

By Assumption 2 (Equation (16)),

∥[∥ op] Δ_{k} \leq L_{H} ∥η_{k}∥

. Substituting into Weyl’s inequality yields the following:

λ_{\min} (H f (x_{k + 1})) \geq λ_{\min} (H f (x_{k})) - L_{H} ∥η_{k}∥

(71)

Step 3: Decompose the Hessian and Introduce Regularization. Rewrite the true Hessian as:

H f (x_{k}) = (H f (x_{k}) - {\bar{H}}_{μ, ξ} (x_{k})) + ({\bar{H}}_{μ, ξ} (x_{k}) + σ_{k} I) - σ_{k} I

Substitute into Equation (71) and apply Weyl’s inequality again. Since

{\bar{H}}_{μ, ξ} + σ_{k} I ⪰ 0

, its minimum eigenvalue is non-negative, so:

λ_{\min} (H f (x_{k + 1})) \geq λ_{\min} (H f (x_{k}) - {\bar{H}}_{μ, ξ} (x_{k})) - σ_{k} - L_{H} ∥η_{k}∥

(72)

Step 4: Take Expectation and Rearrange. For the symmetric matrix

H f (x_{k}) - {\bar{H}}_{μ, ξ} (x_{k})

, we have

λ_{\min} (\cdot) \geq - ∥[∥ op] \cdot

. Taking the expectation and using Lemma 2 (

E [∥[∥ op] H f (x_{k})]

- {\bar{H}}_{μ, ξ} (x_{k}) \leq \sqrt{δ_{H}}

) yields the following:

E [λ_{\min} (H f (x_{k}) - {\bar{H}}_{μ, ξ} (x_{k}))] \geq - \sqrt{δ_{H}}

Taking the expectation of Equation (72) and rearranging terms gives:

E [λ_{\min} (H f (x_{k + 1}))] \geq - \sqrt{δ_{H}} - σ_{k} - L_{H} E [∥η_{k}∥]

Rearranging for

E [∥η_{k}∥]

completes the proof. □

Theorem 3.

Let

M

be a manifold, and let

f : M \to R

satisfy Assumptions 1–3. Define:

k_{m i n} : = {argmin}_{k} E_{u_{k}, Ξ_{k}} ∥η_{k}∥

If the step size in the update of Algorithm 1 satisfies

α \geq L_{H}

, then:

E [∥g_{k_{\min} + 1}∥] \leq O (ε), E [λ_{\min} (H f_{k_{\min} + 1})] \geq - O (\sqrt{ε})

(73)

where

λ_{\min} (\cdot)

denotes the minimum eigenvalue. The parameters must satisfy:

N = O (\frac{1}{ε^{3 / 2}}), μ = O (min \{\frac{ε}{d^{3 / 2}}, \sqrt{\frac{ε}{d^{3}}}\}), m = O (\frac{d}{ε^{2}}), b = O (\frac{d^{4}}{ε})

(74)

Thus, the complexity of the zeroth-order oracle is:

O (N m + N b) = O (\frac{d}{ε^{7 / 2}} + \frac{d^{4}}{ε^{5 / 2}})

Proof.

From Lemma 6, we have:

E [∥g_{k + 1}∥] \leq \frac{1}{2} (L_{H} + 1) E [{∥η_{k}∥}^{2}] + \sqrt{δ_{g}} + δ_{H} + δ_{μ}

(75)

where

δ_{g}

,

δ_{H}

, and

δ_{μ}

correspond to the gradient and Hessian estimation errors. On the other hand, from Lemma 4.8 (relationship between step size and minimum eigenvalue),

E [∥η_{k}∥] \geq - \frac{\sqrt{δ_{H}} + α_{k} + E [λ_{\min} (H f_{k + 1})]}{L_{H}}

(76)

Next, we analyze the upper bound of

E [∥η_{k}∥]

.

Step 1: Third-Order Taylor Upper Bound. By the equivalent form of Assumption 2 (Equation (18)),

f (R_{x_{k}} (η_{k})) \leq f (x_{k}) + 〈g_{k}, η_{k}〉 + \frac{1}{2} 〈η_{k}, H f (x_{k}) η_{k}〉 + \frac{L_{H}}{6} {∥η_{k}∥}^{3}

(77)

Step 2: Decompose Gradient and Hessian into Estimation + Error. Write the true gradient and Hessian as:

g_{k} = {\bar{g}}_{μ, ξ} + (g_{k} - {\bar{g}}_{μ, ξ}), H f (x_{k}) = {\bar{H}}_{μ, ξ} + (H f (x_{k}) - {\bar{H}}_{μ, ξ})

Substitute into Equation (77) to decompose the function value into the main term, first-order error, second-order error, and third-order term:

f_{k + 1} \leq f_{k} + \underset{Main Term}{\underset{︸}{〈{\bar{g}}_{μ, ξ}, η_{k}〉 + \frac{1}{2} 〈η_{k}, {\bar{H}}_{μ, ξ} η_{k}〉}} + \underset{First - Order Error}{\underset{︸}{〈g_{k} - {\bar{g}}_{μ, ξ}, η_{k}〉}} + \underset{Second - Order Error}{\underset{︸}{\frac{1}{2} 〈η_{k}, (H f (x_{k}) - {\bar{H}}_{μ, ξ}) η_{k}〉}} + \frac{L_{H}}{6} {∥η_{k}∥}^{3}

(78)

Step 3: First-Order Optimality Condition of the Subproblem. From the first-order optimality condition of Subproblem (21),

0 \in {\bar{g}}_{μ, ξ} (x_{k}) + ({\bar{H}}_{μ, ξ} (x_{k}) + α I) η_{k} + {Proj}_{T_{X_{k}} M} \partial h (x_{k} + η_{k})

(79)

Take the inner product with

η_{k}

and rearrange. Using the positive definiteness of

{\bar{H}}_{μ, ξ} + 2 α I

and the subgradient inequality for convex h (

〈\partial h (x_{k} + η_{k}), η_{k}〉 \geq h (x_{k} + η_{k}) - h (x_{k})

), we get:

\begin{matrix} 〈{\bar{g}}_{μ, ξ}, η_{k}〉 + \frac{1}{2} 〈η_{k}, {\bar{H}}_{μ, ξ} η_{k}〉 & = - \frac{1}{2} 〈η_{k}, ({\bar{H}}_{μ, ξ} + 2 α I) η_{k}〉 - 〈\partial h (x_{k} + η_{k}), η_{k}〉 \\ \leq h (x_{k}) - h (x_{k} + η_{k}) \end{matrix}

(80)

Substitute this into Equation (78) and combine terms for

F = f + h

:

F_{k + 1} \leq F_{k} + \underset{First - Order Error}{\underset{︸}{〈g_{k} - {\bar{g}}_{μ, ξ}, η_{k}〉}} + \underset{Second - Order Error}{\underset{︸}{\frac{1}{2} 〈η_{k}, (H f (x_{k}) - {\bar{H}}_{μ, ξ}) η_{k}〉}} + \frac{L_{H}}{6} {∥η_{k}∥}^{3}

(81)

Step 4: Bound First-Order and Second-Order Errors. Using the Cauchy–Schwarz inequality and Young’s inequality,

\begin{matrix} E [〈g_{k} - {\bar{g}}_{μ} (x_{k}), η_{k}〉 + \frac{1}{2} 〈η_{k}, (H f (x_{k}) - {\bar{H}}_{μ} (x_{k})) η_{k}〉] & \leq E [∥g_{k} - {\bar{g}}_{μ} (x_{k})∥ ∥η_{k}∥ + \frac{1}{2} ∥[∥ op] H f (x_{k}) - {\bar{H}}_{μ} (x_{k}) {∥η_{k}∥}^{2}] \\ \leq \frac{32}{3 β} E [{∥g_{k} - {\bar{g}}_{μ} (x_{k})∥}^{3 / 2}] + \frac{12}{β} E [∥[∥ op] {H f (x_{k}) - {\bar{H}}_{μ} (x_{k})}^{3}] \\ + \frac{β}{24} E [{∥η_{k}∥}^{3}] \end{matrix}

(82)

Substitute into Equation (82), take

β > L_{H}

, and use the lower bounds of gradient and Hessian errors. Summing over

k = 0

to

N - 1

and rearranging gives:

\frac{1}{N} \sum_{k = 0}^{N} E [{∥η_{k}∥}^{3}] \leq \frac{24}{L_{H}} (\frac{F_{0} - F^{*}}{N} + \frac{32}{3 L_{H}} δ_{g}^{3 / 4} + \frac{12}{L_{H}} {\tilde{δ}}_{H})

(83)

where

{\tilde{δ}}_{H} = \tilde{C} \frac{{(d + 16)}^{6}}{b^{3 / 2}} L_{g}^{1.5} + \frac{1}{27} μ^{3} L_{H}^{3} {(d + 6)}^{7.5}

. By combining with the parameter choices in Equation (74), we obtain:

E [{∥η_{k_{\min}}∥}^{3}] \leq O (ε^{3 / 2}), E [{∥η_{k_{\min}}∥}^{2}] \leq O (ε)

(84)

Step 5: Final Bounds. Substitute Equation (84) into Lemmas 6 and 7. Simplifying using the parameter choices gives:

E [∥g_{k_{\min} + 1}∥] \leq O (ε), E [λ_{\min} (H f_{k_{\min} + 1})] \geq - O (\sqrt{ε})

The complexity of the zeroth-order oracle is calculated by substituting N, m, and b from Equation (74) into

O (N m + N b)

, leading to:

O (\frac{d}{ε^{7 / 2}} + \frac{d^{4}}{ε^{5 / 2}})

This completes the proof. □

6. Application of ZO-ARPQN

6.1. Sparse Principal Component Analysis (SPCA) on the Stiefel Manifold

A traditional principal component analysis (PCA) extracts low-dimensional representations of data by maximizing the projection variance, but its loading vectors are usually dense, making it difficult to interpret the contribution of individual variables. A sparse principal component analysis (SPCA) introduces a sparsity penalty while maximizing the variance, thereby balancing variable selection and interpretability. Typical applications include gene expression analyses, spectral component separation, financial factor construction, and feature compression in high-dimensional regression. Compared with a PCA, the additional benefits of an SPCA are reflected in: a stronger interpretability (fewer non-zero variables can describe components), a better robustness (reducing the impact of noisy variables), and convenience for downstream modeling (sparse factors are more conducive to regression/classification tasks). This paper considers the SPCA problem on the Stiefel manifold

S t (n, p) = {X \in R^{n \times p} ∣ X^{⊤} X = I_{p}}

, formulated as:

min_{X \in S t (n, p)} F (X) : = f (X) + {λ ∥ X ∥}_{1} = - \frac{1}{2} Tr (X^{⊤} H X) + λ {∥ X ∥}_{1},

(85)

where

H \in R^{n \times n}

is a symmetric positive semi-definite matrix, and

{∥ \cdot ∥}_{1}

denotes the element-wise

ℓ_{1}

-norm to induce sparsity.

6.1.1. Experimental Parameter Settings

The following parameters were adopted in this experiment (consistent with the reproducibility requirements of SJTU doctoral theses):

Data Generation: Set $n = 100$ (number of variables), $p = 10$ (number of principal components), and $K = 5$ (rank of the true subspace). First, perform QR orthogonalization on a Gaussian matrix $A \in R^{n \times K}$ , construct $Λ = diag (| N (0, 1) |)$ (diagonal matrix of absolute values of standard normal samples), and then form $H = A Λ A^{⊤}$ and normalize it by the spectral norm to ensure numerical stability. The sparsity coefficient is set to $λ = 0.1$ (determined via 5-fold cross-validation to balance sparsity and variance retention).
ZO-ManPG Algorithm: Refer to Algorithm 3 in Reference [24]. Set the finite difference radius to $μ = 0.05$ , the gradient sampling counts to $m = {5, 20, 50}$ (to analyze the impact of the sampling density on zero-order estimation), and the maximum iterations to $N = 150$ (sufficient to ensure the convergence of all the compared algorithms).
ZO-ARPQN Algorithm: Adopt the same $μ = 0.05$ and $m = {5, 20, 50}$ as ZO-ManPG; set the Hessian sampling count to $b = 15$ (compromising between the estimation accuracy and computational cost); the initial regularization parameter to $σ_{0} = 5.0$ ; the line search shrinkage factor to $γ = 0.5$ ; the ratio test thresholds to $η_{1} = 0.25$ (lower bound for an acceptable model fidelity) and $η_{2} = 0.75$ (threshold for reducing regularization); the adaptive regularization coefficients to $γ_{1} = 0.8$ (reduction factor) and $γ_{2} = 2.0$ (increase factor); and the maximum iterations to $N = 150$ .
ARPQN Algorithm [16]: As the oracle baseline (with exact gradient/Hessian), use the same $σ_{0}$ , $γ$ , $η_{1}$ , $η_{2}$ , $γ_{1}$ , $γ_{2}$ , and N as ZO-ARPQN, except that the exact Riemannian gradient/Hessian (instead of zero-order estimation) is used to verify the performance upper bound of quasi-Newton methods.

6.1.2. Result Analysis

The vertical axis of the convergence curves plots the optimality gap

F (X_{k}) - F^{*}

(in logarithmic scale) to clearly show the convergence trend. Since Problem (85) is non-convex, the analytical optimal value is difficult to obtain; thus, this paper takes the minimum function value among all the compared algorithms as the empirical optimal value:

F^{*} = min_{a \in {ZO - ManPG, ZO - ARPQN, ARPQN}} F_{a} (X),

where

F_{a} (X)

denotes the final function value of algorithm a. The sparsity metric is defined as:

sparsity (X) = \frac{# {(i, j) : | X_{i j} | < 10^{- 3}}}{n \cdot p} \times 100 %,

which measures the proportion of zero elements in the solution matrix. A higher sparsity indicates that more variables are “compressed” to zero, enhancing the feature-selection ability of the model; however, excessive sparsity may lead to loss of principal component information.

To analyze the impact of the gradient sampling count m on the stability and accuracy of zero-order algorithms, this experiment set

m = 5, 15, 30

respectively for the SPCA problem on the Stiefel manifold, and compared three algorithms: the zeroth-order manifold proximal gradient (ZO-ManPG), the zeroth-order adaptive regularized proximal quasi-Newton (ZO-ARPQN), and ARPQN (with exact gradient/Hessian). All algorithms wererun for 150 iterations under the same initial point and parameters, and the convergence curves of the optimality gap

F (X_{k}) - F^{*}

are plotted. As shown in the figures, all three algorithms exhibited a monotonically decreasing convergence trend, but their performance differed significantly under different sampling counts.

6.1.3. Case 1: Low Sampling Count ( $M = 5$ ) (Figure 1)

When the sampling count was small, the number of random perturbation directions was insufficient, leading to large variance in zero-order estimation of gradients and Hessians, resulting in obvious fluctuations in the convergence curves. Even so, ZO-ARPQN still outperformed ZO-ManPG, which indicates that the quasi-Newton direction and adaptive regularization can effectively suppress noise. ARPQN converged rapidly within approximately 20 iterations, serving as a performance upper bound. Due to the large noise in zero-order Hessian estimation of ZO-ARPQN, the

ℓ_{1}

-regularization term dominated, leading to excessive sparsity (60.0%); in contrast, ZO-ManPG barely produced a significant sparse structure (5.4%), indicating that its gradient approximation is sensitive to noise.

Figure 1. Convergence curves of three algorithms (gradient sampling count

m = 5

).

Figure 1. Convergence curves of three algorithms (gradient sampling count

m = 5

).

6.1.4. Case 2: Moderate Sampling Count ( $M = 15$ ) (Figure 2)

With an increase in the sampling count, zero-order estimation became more stable. The convergence speed of ZO-ARPQN significantly accelerated and almost overlapped with ARPQN after approximately 40 iterations; the oscillation of ZO-ManPG weakened, but its convergence speed remained slow. As the sampling increased, the sparsity of ZO-ARPQN decreased significantly (2.4%), indicating that gradient estimation became more accurate and the impact of sparse constraints weakened; the sparsity of ZO-ManPG increased slightly (20%), showing a more stable convergence trend.

Figure 2. Convergence curves of three algorithms (gradient sampling count

m = 15

).

Figure 2. Convergence curves of three algorithms (gradient sampling count

m = 15

).

6.1.5. Case 3: High Sampling Count ( $M = 30$ ) (Figure 3)

When the sampling count further increased, the convergence curve of ZO-ARPQN almost completely overlapped with that of ARPQN, indicating that its zero-order Hessian approximation is extremely accurate under high sampling. Although ZO-ManPG was still the slowest, its final accuracy improved significantly. The sparsity of ZO-ARPQN was close to that of ARPQN (18.2% vs. 24.2% (Table 1), indicating that high sampling can effectively recover the true sparse structure. The sparsity of ZO-ManPG increased to 53.4%, but its overall convergence remained slow.

Figure 3. Convergence curves of three algorithms (gradient sampling count

m = 30

).

Figure 3. Convergence curves of three algorithms (gradient sampling count

m = 30

).

In conclusion, increasing the sampling count m can significantly improve the convergence quality and stability of zero-order algorithms. Among them, ZO-ARPQN benefited the most from high sampling, as its quasi-Newton mechanism can fully utilize the accurate zero-order Hessian information to approach the performance of the oracle algorithm (ARPQN); ZO-ManPG, as a first-order method, is limited by the inherent low accuracy of zero-order gradient estimation, and even high sampling cannot compensate for the performance gap with second-order methods.

6.2. Black-Box Optimization in Robotic Stiffness Control (Figure 4)

The specific connection between zeroth-order Riemannian optimization and its application in robotic stiffness control [16] is elaborated as follows.

Problem Background: Adaptive force/stiffness control aims to address a classic challenge in robotics: how to enable robots to adjust the force they apply adaptively when performing delicate manipulation tasks (e.g., inserting a key into a lock hole, scraping along a complex surface). This is closely associated with stiffness control, as the robot’s impedance controller achieves precise force control by adjusting its own stiffness and damping. Therefore, learning an adaptive force controller essentially involves learning how to dynamically adjust stiffness parameters based on real-time conditions.

Motivation for Black-Box/Zeroth-Order Optimization: The performance of a controller (e.g., whether key insertion is smooth, whether jamming occurs) cannot be described by simple mathematical formulas. Its performance score can only be obtained through physical experiments or high-fidelity simulations. The process of converting controller parameters to task performance scores is a typical black-box system, and its gradient is completely unknown. Thus, zeroth-order optimization methods become indispensable. The Bayesian optimization adopted in this paper is precisely a zeroth-order method designed for expensive black-box problems.

Figure 4. Schematic diagram of robotic stiffness control.

Riemannian Manifold Constraints: In robotic force/stiffness control, key parameters often lie on specific manifolds:

Stiffness Matrix (K): Physically valid stiffness matrices must be symmetric positive definite (SPD) matrices. The set of all SPD matrices forms a Riemannian manifold. During the optimization process, it is critical to ensure that the updated parameters remain SPD matrices after each iteration.
Desired Pose (R): The controller may need to adjust the desired tool pose (e.g., the orientation of a key), which is represented by a rotation matrix belonging to the $S O (3)$ manifold.

In Cartesian space impedance/stiffness control, the compliance of the end-effector is characterized by a symmetric positive definite (SPD) matrix

K^{P} \in S_{+ +}^{d}

. Given the current end-effector position

\hat{p} \in R^{d}

, end-effector velocity

\dot{p}

, and external force

f^{e} \in R^{d}

, the control law drives the system to reach a new equilibrium position

p (K^{P})

, whose common approximation is:

p (K^{P}) = \hat{p} - \dot{p} - {(K^{P})}^{- 1} f^{e} .

Intuitively, the stiffness matrix

K^{P}

determines the “hardness/softness” of the robot end-effector in response to external disturbances. Given the current end-effector position

\hat{p} \in R^{d}

, desired equilibrium position P, external force

f^{e}

, and damping matrix

K^{D} = K^{P}

(under critical damping), the equilibrium equation can be expressed as:

f^{e} = K^{P} (\hat{p} - p) - K^{D} \dot{p},

(86)

from which the position p is solved.

To obtain a reasonable stiffness matrix

K^{P}

, the following objective function is defined [25]:

f (K^{P}) = w_{p} {∥\hat{p} - p (K^{P})∥}^{2} + w_{d} det (K^{P}) + w_{c} cond (K^{P}),

(87)

where

w_{p}, w_{d}, w_{c} > 0

are weights,

p (K^{P})

denotes the equilibrium position, and

cond (\cdot)

represents the condition number.

This objective function consists of three components:

First term: squared position error ${∥\hat{p} - p (K^{P})∥}^{2}$ , which ensures the end-effector returns as close as possible to the target position;
Second term: penalty on $det (K^{P})$ , which prevents an excessively large stiffness (avoiding damage to the robot or target object);
Third term: penalty on the condition number, which avoids ill-conditioned stiffness matrices (ensuring stable force control).

This is a black-box optimization problem defined on the SPD manifold

S_{+ +}^{d}

: the function f has no explicit gradient information, and only function value queries are available. This characteristic makes traditional first-order methods (e.g., Riemannian gradient descent) difficult to apply directly, necessitating the design of Riemannian optimization methods based on zeroth-order function values.

6.2.1. Manifold and Tangent Space

The SPD manifold is mathematically defined as:

M = S_{+ +}^{d} = \{K \in R^{d \times d} ∣ K = K^{⊤}, K ≻ 0\},

where

K ≻ 0

indicates that K is a positive definite matrix. The tangent space at any point

K \in M

(denoted as

T_{K} M

) consists of all symmetric matrices, i.e.:

T_{K} M = \{U \in R^{d \times d} ∣ U = U^{⊤}\} .

6.2.2. Riemannian Metric

To ensure geometric consistency in stiffness control, the affine-invariant metric—a standard metric for SPD manifolds—is adopted:

{〈 U, V 〉}_{K} = tr (K^{- 1} U K^{- 1} V), \forall U, V \in T_{K} M,

where

tr (\cdot)

denotes the matrix trace. The norm induced by this metric is:

{∥ U ∥}_{K}^{2} = {〈 U, U 〉}_{K},

which characterizes the “magnitude” of tangent vectors in the context of SPD matrix updates.

6.2.3. Riemannian Gradient and Exponential Map

For a smooth objective function

f : M \to R

, let

\nabla_{E} f (K)

denote its Euclidean gradient (computed as if K were an unconstrained matrix in

R^{d \times d}

). The Riemannian gradient—projected onto the tangent space

T_{K} M

—is:

grad f (K) = K \cdot sym (\nabla_{E} f (K)) \cdot K,

where

sym (A) = \frac{1}{2} (A + A^{⊤})

denotes the symmetric part of matrix A.

The exponential map (used to map tangent vectors back to the SPD manifold) is:

{Exp}_{K} (U) = K^{1 / 2} exp (K^{- 1 / 2} U K^{- 1 / 2}) K^{1 / 2}, \forall U \in T_{K} M,

where

K^{1 / 2}

is the matrix square root of K, and

exp (\cdot)

denotes the matrix exponential. In numerical implementation, this exponential map is directly used as the retraction (a key operation in Riemannian optimization to maintain manifold constraints).

6.2.4. Experimental Setup

Since explicit calculation of gradients and Hessian matrices is impossible in black-box stiffness control, the standard ARPQN (proximal quasi-Newton method with exact gradients) cannot be used as a comparative baseline.

Basic Parameters

Initial stiffness matrix: $K_{0} = I_{d}$ (identity matrix, ensuring initial feasibility on the SPD manifold);
Task space dimension: $d = 3$ (consistent with 3D Cartesian space control of robotic end-effectors);
Maximum iterations: 100 (balancing the computational efficiency and convergence accuracy).

Algorithm-Specific Parameters

Two zeroth-order Riemannian optimization algorithms were compared, with parameters tuned to achieve an optimal performance:

ZO-ManPG (Zeroth-Order Manifold Proximal Gradient): step size: $α = 0.1$ ; finite difference radius (for zero-order gradient estimation): $μ = 10^{- 3}$ ; and mini-batch size (for noise reduction): $m = 10$ .
ZO-ARPQN (Zeroth-Order Adaptive Regularized Proximal Quasi-Newton): smoothing parameter (for zero-order Hessian estimation): $μ = 10^{- 4}$ ; mini-batch size (for gradient/Hessian estimation): $m = 10$ ; initial regularization parameter: $σ_{0} = 5.0$ ; ratio test thresholds (for adaptive regularization): $η_{1} = 0.1$ , $η_{2} = 0.8$ ; and step size decay factor: $γ = 0.6$ .

Both algorithms use the exponential map

{Exp}_{K} (U)

for retraction. After each update, matrix eigenvalues are clipped to the range

[10^{- 3}, 10^{3}]

to ensure the stiffness matrix K remains symmetric positive definite (avoiding physical invalidity of the stiffness parameters).

6.2.5. Experimental Results and Analysis

Figure 5 shows the objective function descent process of ZO-ManPG and ZO-ARPQN on the SPD manifold for the 3D stiffness adjustment task. The vertical axis represents the objective function value

F (K_{k})

(logarithmic scale to highlight convergence trends), and the horizontal axis represents the number of iterations.

F_{final}^{ManPG} = 0.2245, F_{final}^{ARPQN} = 0.1622

Convergence Trend Analysis

(1) Early-stage convergence: Both algorithms reduced the objective function by approximately two orders of magnitude within the first 10 iterations, verifying that zero-order approximation is feasible for black-box stiffness tuning. (2) Late-stage convergence: The ZO-ARPQN curve remained consistently below ZO-ManPG, indicating a higher descent efficiency and more stable convergence. This was attributed to ZO-ARPQN’s adaptive regularization mechanism (dynamically adjusting

σ_{k}

based on model fidelity) and quasi-Newton direction (using Hessian approximation to guide updates).

Quantitative Performance

After 100 iterations, the final objective function values were:

F_{final}^{ManPG} = 0.2245, F_{final}^{ARPQN} = 0.1622 .

Compared to ZO-ManPG, ZO-ARPQN achieved a 27.7% reduction in the objective function value under the same iteration budget, confirming its advantage in balancing zero-order stability and the convergence accuracy.

Practical Significance

The experimental results validate that ZO-ARPQN is effective for non-differentiable black-box SPD optimization tasks (e.g., robotic stiffness control). Its ability to handle “no gradient + manifold constraints” makes it suitable for engineering scenarios where only experimental/simulation scores are available, providing a feasible solution for adaptive stiffness tuning in robotic manipulation.

7. Conclusions

We propose ZO-ARPQN, the first zeroth-order adaptive regularized proximal quasi-Newton framework for composite optimization on Riemannian manifolds, including the Stiefel and SPD manifolds. The method extends the classical ARPQN algorithm to black-box settings by constructing stochastic one-point finite-difference estimators for both the Riemannian gradient and curvature, enabling second-order optimization without explicit derivatives.

We established global convergence guarantees under mild smoothness assumptions, proved that the algorithm achieves convergence to first-order stationary points even with noisy zeroth-order information, and further demonstrated that the curvature-aware regularization enables escape from strict saddle points with a high probability. A complete iteration-complexity analysis is provided, showing that ZO-ARPQN maintains competitive theoretical efficiency compared to existing zeroth-order Riemannian methods.

Extensive numerical experiments on sparse PCA and robot stiffness tuning validated the practical effectiveness of the proposed method. ZO-ARPQN achieved a convergence behavior comparable to first-order ARPQN and other state-of-the-art Riemannian solvers, while requiring only function evaluations. These results highlight its potential for applications in black-box manifold optimization, robotics, machine learning, and simulation-based design.

Future work will include exploring variance-reduced zeroth-order strategies, developing limited-memory versions for large-scale manifolds, and extending the framework to Riemannian stochastic compositional optimization and constrained reinforcement learning.

Author Contributions

Conceptualization, Y.M. and C.L.; Methodology, Y.M.; Software, Y.M.; Validation, Y.M., C.L. and Z.W.; Formal Analysis, Y.M.; Investigation, Y.M.; Resources, C.L.; Data Curation, Y.M.; Writing—Original Draft Preparation, Y.M.; Writing—Review and Editing, C.L. and Y.M.; Visualization, Y.M.; Supervision, C.L.; Project Administration, C.L.; Funding Acquisition, C.L. and Q.L. provided additional support in data analysis and manuscript revision. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Natural Science Foundation of Ningxia (Project No.: 2025AAC030018); The National Social Science Fund of China (grant No. 1246010681): The Uncertain Rescue Model of Major Disaster; The National Social Science Fund of China (grant No. 23BMZ062): Co-Creation of Ecosystem Value for Green Brands in Northwest China’s Characteristic Agriculture under Dual Carbon Targets; The North Minzu University Research Initiative (grant Nos. 2022ZLGTTYS12 and ZDZX201805); and The First-Class Discipline Construction Program of Ningxia (grant No. NXYLXK2017B09). Additional partial support was received from: The Ningxia Youth Talent Support Program (2021 cohort), The Leading Talent Program of North Minzu University, and The Governance and Social Management Research Center of Northwest Ethnic Regions.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (specify the reason for the restriction).

Acknowledgments

The authors gratefully acknowledge the computational resources provided by the Intelligent Optimization Laboratory of North Minzu University. Special thanks are extended to the anonymous reviewers, whose insightful comments significantly strengthened the theoretical rigor of this work.

Conflicts of Interest

The authors declare no competing interests.

References

Vandereycken, B. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 2013, 23, 1214–1236. [Google Scholar] [CrossRef]
Journée, M.; Nesterov, Y.; Richtarik, P.; Sepulchre, R. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 2010, 11, 517–553. [Google Scholar]
Li, J.; Ma, S.; Srivastava, T. A riemannian admm. arXiv 2022, arXiv:2211.02163. [Google Scholar]
Absil, P.-A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
Boumal, N. An Introduction to Optimization on Smooth Manifolds; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
Ferreira, O.P.; Oliveira, P.R. Proximal point algorithm on Riemannian manifolds. Optimization 2002, 51, 257–270. [Google Scholar] [CrossRef]
Bergmann, R.; Persch, J.; Steidl, G. A parallel Douglas–Rachford algorithm for minimizing ROF-like functionals on images with values in symmetric Hadamard manifolds. SIAM J. Imaging Sci. 2016, 9, 901–937. [Google Scholar] [CrossRef]
Kovnatsky, A.; Glashoff, K.; Bronstein, M.M. MADMM: A Generic Algorithm for Non-Smooth Optimization on Manifolds. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Reimherr, M.; Bharath, K.; Soto, C. Differential privacy over riemannian manifolds. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021; pp. 12292–12303. [Google Scholar]
Duruisseaux, V.; Leok, M. A variational formulation of accelerated optimization on Riemannian manifolds. SIAM J. Math. Data Sci. 2022, 4, 649–674. [Google Scholar] [CrossRef]
Han, A.; Mishra, B.; Jawanpuria, P.; Gao, J. Riemannian accelerated gradient methods via extrapolation. Proc. Mach. Learn. Res. 2023, 206, 1554–1585. [Google Scholar]
Zhang, H.; Reddi, S.J.; Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10, December 2016. [Google Scholar]
Xu, Y.; Jin, R.; Yang, T. NEON+: Accelerated gradient methods for extracting negative curvature for non-convex optimization. arXiv 2017, arXiv:1712.01033. [Google Scholar]
Jensen, R.; Zimmermann, R. Riemannian optimization on the symplectic Stiefel manifold using second-order information. arXiv 2024, arXiv:2404.08463. [Google Scholar] [CrossRef]
Ring, W.; Wirth, B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 2012, 22, 596–627. [Google Scholar] [CrossRef]
Wang, Q.; Yang, W.H. An adaptive regularized proximal Newton-type methods for composite optimization over the Stiefel manifold. Comput. Optim. Appl. 2024, 89, 419–457. [Google Scholar] [CrossRef]
Frazier, P.I. A tutorial on Bayesian optimization. arXiv 2018, arXiv:1807.02811. [Google Scholar] [CrossRef]
Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv 2017, arXiv:1703.03864. [Google Scholar] [CrossRef]
Zhou, P.; Yuan, X.T.; Feng, J. Faster first-order methods for stochastic non-convex optimization on Riemannian manifolds. Proc. Mach. Learn. Res. 2019, 89, 138–147. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Balasubramanian, K.; Ma, S. Stochastic zeroth-order Riemannian derivative estimation and optimization. Math. Oper. Res. 2023, 48, 1183–1211. [Google Scholar] [CrossRef]
Balasubramanian, K.; Ghadimi, S. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Found. Comput. Math. 2022, 22, 35–76. [Google Scholar] [CrossRef]
Wang, Q.; Yang, W.H. Proximal quasi-Newton method for composite optimization over the Stiefel manifold. J. Sci. Comput. 2023, 95, 39. [Google Scholar] [CrossRef]
Chen, S.; Ma, S.; Man-Cho So, A.; Zhang, T. Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 2020, 30, 210–239. [Google Scholar] [CrossRef]
Li, J.; Balasubramanian, K.; Ma, S. Zeroth-order optimization on Riemannian manifolds. arXiv 2020, arXiv:2003.11238. [Google Scholar]
Jaquier, N.; Rozo, L.; Calinon, S.; Bürger, M. Bayesian optimization meets Riemannian manifolds in robot learning. Proc. Mach. Learn. Res. 2020, 100, 233–246. [Google Scholar]

Figure 5. Convergence comparison of zeroth-order algorithms on SPD manifold (3D stiffness control task).

Table 1. Sparsity comparison of three algorithms under different gradient sampling counts m.

Sampling Count m	Algorithm	Final Objective Value $F_{final}$	Sparsity (%)
5	ZO-ManPG	1.6205	5.4
5	ZO-ARPQN	2.4447	60.0
5	ARPQN	3.3191	24.2
15	ZO-ManPG	1.3465	20.0
15	ZO-ARPQN	3.8697	2.4
15	ARPQN	3.3191	24.2
30	ZO-ManPG	1.2499	53.4
30	ZO-ARPQN	3.4940	18.2
30	ARPQN	3.3191	24.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, Y.; Li, C.; Wang, Z.; Li, Q. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms 2026, 15, 203. https://doi.org/10.3390/axioms15030203

AMA Style

Ma Y, Li C, Wang Z, Li Q. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms. 2026; 15(3):203. https://doi.org/10.3390/axioms15030203

Chicago/Turabian Style

Ma, Yinpu, Cunlin Li, Zhichao Wang, and Qian Li. 2026. "Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method" Axioms 15, no. 3: 203. https://doi.org/10.3390/axioms15030203

APA Style

Ma, Y., Li, C., Wang, Z., & Li, Q. (2026). Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms, 15(3), 203. https://doi.org/10.3390/axioms15030203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method

Abstract

1. Introduction

2. Theoretical Contributions

3. Preliminaries on Riemannian Optimization

3.1. Tangent Space and Tangent Bundle

3.2. Riemannian Metric and Norm

3.3. Riemannian Gradient

4. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Newton (ZO-ARPQN)

4.1. Zeroth-Order Riemannian Gradient and Hessian

4.2. Zeroth-Order Extension: Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Algorithm

5. Convergence Analysis

5.1. Complexity Analysis

5.2. Saddle-Point Escape Theorem

6. Application of ZO-ARPQN

6.1. Sparse Principal Component Analysis (SPCA) on the Stiefel Manifold

6.1.1. Experimental Parameter Settings

6.1.2. Result Analysis

6.1.3. Case 1: Low Sampling Count ( M = 5 ) (Figure 1)

6.1.4. Case 2: Moderate Sampling Count ( M = 15 ) (Figure 2)

6.1.5. Case 3: High Sampling Count ( M = 30 ) (Figure 3)

6.2. Black-Box Optimization in Robotic Stiffness Control (Figure 4)

6.2.1. Manifold and Tangent Space

6.2.2. Riemannian Metric

6.2.3. Riemannian Gradient and Exponential Map

6.2.4. Experimental Setup

Basic Parameters

Algorithm-Specific Parameters

6.2.5. Experimental Results and Analysis

Convergence Trend Analysis

Quantitative Performance

Practical Significance

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.1.3. Case 1: Low Sampling Count ( $M = 5$ ) (Figure 1)

6.1.4. Case 2: Moderate Sampling Count ( $M = 15$ ) (Figure 2)

6.1.5. Case 3: High Sampling Count ( $M = 30$ ) (Figure 3)