1. Introduction
In recent years, composite optimization problems on Riemannian manifolds have attracted widespread attention in various application fields. The objective function of such problems typically consists of the sum of a smooth function and a nonsmooth function, with optimization variables constrained on a manifold. The general form of a Riemannian composite optimization problem is summarized as follows:
where
is a smooth function and
is a convex, but nonsmooth, function. The aforementioned Riemannian composite optimization problems have found extensive applications, such as compressed sensing [
1], sparse principal component analyses [
2], and clusteringproblems [
3]. For more applications of composite optimization on manifolds, readers are referred to [
4,
5,
6].
Composite optimization problems with manifold constraints have been extensively studied in recent years. Currently, popular solution methods can be roughly categorized into the following types: operator-splitting methods, proximal gradient methods, and proximal Newton methods.
Operator-Splitting Methods. Riemannian ADMM (alternating-direction method of multipliers) provides a flexible decomposition strategy, especially suitable for composite problems with special structures. It usually transforms the original problem into an equivalent form via variable splitting and then solves it through alternating updates and dual ascent. Additionally, other operator-splitting ideas, such as Douglas–Rachford splitting, have been extended to manifolds [
7]. Research on Riemannian ADMM focuses on its convergence analysis under nonconvex and nonsmooth settings [
8] and efficient implementation in specific applications [
3]. For example, in distributed or federated learning scenarios, the decomposable nature of ADMM enables the design of communication-efficient and privacy-preserving Riemannian optimization algorithms [
9]. The advantages of Riemannian ADMM lie in its high flexibility and ability to decouple variables; however, it requires the introduction of additional variables when constructing the Lagrangian objective function, which slows down its convergence rate.
First-Order Methods: Proximal Gradient Methods. This is the most fundamental and straightforward approach for solving Riemannian composite problems. The Riemannian proximal gradient method iteratively decomposes the smooth and nonsmooth terms, i.e., [
4,
5]:
which denotes the Riemannian proximal operator. This algorithm first performs one step of Riemannian gradient descent on the smooth part
f, and then corrects the nonsmooth part
h via the proximal operator, entirely within the manifold
.
To accelerate convergence, researchers have proposed various accelerated Riemannian proximal gradient algorithms by leveraging Nesterov’s acceleration idea. These methods typically improve the convergence rate from
to
under geodesically convex or specific curvature conditions [
10,
11]. For large-scale data, computing the full gradient
is prohibitively expensive; thus, Riemannian stochastic proximal gradient methods and their variance-reduced variants (e.g., R-SVRG-PG [
12]) have emerged. They approximate the gradient through random sampling, significantly reducing the cost per iteration while achieving fast convergence. Considering that
is nonconvex in many practical applications, recent works have analyzed the convergence of proximal gradient methods under nonconvex settings, proving that the algorithm converges to stationary points and providing convergence rate guarantees [
13]. This method is simple to implement and theoretically mature, but suffers from slow convergence and reliance on the proximal operator.
Second-Order Methods: Proximal Newton Methods. To achieve faster convergence than first-order methods, second-order methods incorporate curvature (Hessian) information of the smooth term
f to guide updates. The core idea of the regularized Riemannian proximal Newton method is to obtain the update direction
at each tangent space
by solving a quadratic approximation subproblem [
4,
5]:
where
is a positive definite approximation of the Riemannian Hessian operator
.
Standard Newton methods face two major challenges: (1) how to ensure global convergence when the Hessian is non-positive definite, and (2) how to handle the enormous computational, storage, and inversion costs of the Hessian matrix. Recent research on quasi-Newton methods and adaptive regularization has focused on addressing these two issues. These are key techniques to address the computational bottleneck of second-order methods. Riemannian quasi-Newton methods (e.g., Riemannian BFGS or L-BFGS) no longer directly compute the true Hessian matrix; instead, they construct an increasingly accurate Hessian approximation matrix
using first-order (gradient) information collected during iterations [
14,
15]. A representative method is the adaptive regularized proximal quasi-Newton (ARPQN) method [
16], which solves a subproblem of the following form at each step:
The core advantages of this method are Hessian approximation and adaptive regularization. This allows the algorithm to flexibly balance gradient and curvature information, thereby achieving faster convergence in practice.
It should be noted that the aforementioned methods all assume that the gradient and Hessian matrix of the objective function have analytical forms. However, many optimization problems in practical applications face the following challenges, rendering traditional gradient-based (first-order or second-order) methods inapplicable:
- 1.
Unavailable Gradient: In many practical problems, the objective function
is a black box: one can only query its value without access to analytical gradients or automatic differentiation. This often occurs in complex physical simulations such as fluid dynamics or structural mechanics, where each function evaluation is computationally expensive and differentiation is infeasible. Traditional approaches like Bayesian optimization mitigate the cost through surrogate models [
17], but become inefficient as the problem dimension increases. In robotic stiffness control, the objective function describing the robot–environment interaction has no closed-form gradient and can only be evaluated through high-fidelity simulations or experiments. The mapping from stiffness parameters (an SPD matrix) to performance scores thus forms a typical black-box optimization problem.
- 2.
Prohibitive Gradient Computation Cost: Even if the gradient exists in theory, the cost of computing it may be unacceptably high. In large-scale reinforcement-learning scenarios, the reward function of an agent may depend on an extremely long sequence of decisions. Computing the gradient of policy parameters through backpropagation may consume enormous computational resources and memory. Directly evaluating the performance of the policy (zeroth-order information) may be much cheaper. Zeroth-order methods such as evolutionary strategies have been proven to be scalable alternatives for reinforcement learning [
18].
Based on the above motivations, zeroth-order methods can be applied to any optimization problem where function evaluations are accessible, regardless of how complex its internal structure is or whether it is differentiable. This greatly expands the application boundary of optimization algorithms, enabling them to solve black-box problems that traditional methods cannot address.
First, we clarify the definition: Zeroth-order Riemannian methods, also known as derivative-free Riemannian optimization methods, refer to a class of methods that solve optimization problems on manifolds using only function evaluation values (zeroth-order information), without using explicit gradients (first-order information) or Hessian matrices (second-order information) of the objective function. By abandoning reliance on accurate gradient information, zeroth-order Riemannian methods gain the ability to handle complex problems such as black-box, noisy, and high-cost scenarios. They sacrifice a certain degree of convergence efficiency (compared to gradient-based methods) to significantly broaden the application scope of optimization algorithms on manifolds.
2. Theoretical Contributions
This work proposes the zeroth-order Riemannian adaptive quasi-Newton method (ZO-ARPQN) for composite optimization on Stiefel manifolds, establishing three fundamental advances:
Zeroth-Order Extension of Proximal Quasi-Newton on Riemannian Manifolds. We generalize the adaptive regularized proximal quasi-Newton (ARPQN) framework [
16] to the zeroth-order setting over Riemannian manifolds such as the Stiefel and SPD manifolds. By constructing randomized finite-difference estimators that approximate both the Riemannian gradient and curvature information, ZO-ARPQN achieves first-order accuracy using only function evaluations, enabling black-box optimization in manifold-constrained problems.
Global Convergence Guarantees under Noisy Gradient Estimates. We establish global convergence to a stationary point under mild smoothness and regularization assumptions, even when the gradient and Hessian information are approximated by random perturbations. Theoretical bounds are derived on the bias and variance of the zeroth-order estimators, showing that ZO-ARPQN maintains the same global convergence as its first-order counterpart.
Saddle-Point Escaping. By incorporating curvature-aware regularization and adaptive random perturbation in the proximal quasi-Newton step, we prove that ZO-ARPQN can escape strict saddle points with a high probability, ensuring convergence to a stationary point. This result extends existing saddle-point escape theory to the manifold and zeroth-order settings.
3. Preliminaries on Riemannian Optimization
The core idea of Riemannian optimization is to extend classical optimization methods from Euclidean space to non-Euclidean spaces, i.e., Riemannian manifolds. Compared with the Euclidean case, manifold optimization requires constraining the geometric structure of iterates, such that search directions, gradients, and update operators are all defined within the framework of tangent spaces. This requires us to introduce tools such as tangent spaces, Riemannian metrics, retractions, and Riemannian Hessians.
3.1. Tangent Space and Tangent Bundle
Let
be a manifold. The tangent space at a point
is denoted as
, which consists of all tangent vectors at
X. The tangent bundle
represents the collection of all tangent spaces of
.
3.2. Riemannian Metric and Norm
If the tangent spaces are endowed with an inner product
that varies smoothly with the point
X, then
is called a Riemannian manifold. The induced norm is
For simplicity, we denote
and
if there is no ambiguity.
3.3. Riemannian Gradient
In [
19], if
is a smooth function, its gradient
at
is defined as the unique tangent vector satisfying
where
denotes the directional derivative of
f at
X along
. In this paper, we adopt the Riemannian metric induced by the Euclidean inner product, i.e.,
Based on this Riemannian metric, the Riemannian gradient of a function is defined as the projection of its Euclidean gradient onto the tangent space:
Definition 1 (Retraction [
4])
. A retraction on a manifold is a smooth mapping satisfying the following properties. Let denote the restriction of R to : (1) , where is the zero element of ; (2) under the canonical isomorphism , we havewhere denotes the differential of at , and denotes the identity mapping on . The following proposition serves as a key bridge connecting retractions and optimization analyses. It endows complex manifold optimization algorithms with strict mathematical certainty. Inequality (
6) guarantees the boundedness and controllability of the retraction operation. It ensures that a tangent space displacement
of a finite magnitude does not result in an infinitely distant move on the manifold, which is a prerequisite for any subsequent convergence analysis.
Proposition 1 ([
19])
. If is a compact embedded submanifold of and R is a retraction, then there exist constants such that, for all and , The following assumption is the most natural and practical way to extend the well-known L-smooth (i.e., the function has a Lipschitz-continuous gradient) concept from Euclidean space to Riemannian manifolds.
Assumption 1 (L-retraction smoothness [
20])
. There exists a constant such that, for the objective function f of problem (1), the following inequality holds: This assumption ensures that the behavior of function f along the retraction path is well-posed, i.e., as long as is sufficiently small, predicting the function value at the next point using the gradient information at the current point is relatively accurate, ensuring that the gradient-based linear approximation is locally effective. Thus, a convergence analysis from Euclidean space (gradient descent, trust region, Newton method) can be extended to manifolds.
4. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Newton (ZO-ARPQN)
4.1. Zeroth-Order Riemannian Gradient and Hessian
In existing manifold optimization methods, gradients or their tangent space projections are assumed to be directly available. However, in many real-world problems such as black-box optimization, noisy or nonsmooth functions, and high-dimensional models, explicit gradient computation is either infeasible or too costly. Zeroth-order optimization offers an effective alternative by relying only on function evaluations.
While classical Euclidean zeroth-order estimators (e.g., Gaussian smoothing or coordinate finite differences) can approximate gradients accurately, they require many function calls and do not generalize well to manifolds. To overcome this, we employ a random direction difference method that samples one or several random tangent directions and uses one-sided finite differences to estimate gradients. This approach makes the function query cost independent of the dimension, achieves good scalability to high-dimensional Riemannian problems, and balances theoretical soundness with computational efficiency.
Definition 2 (Stochastic Difference Zeroth-Order Riemannian Gradient [
20])
. Generate , where , and is an orthogonal projection matrix that projects vectors onto the tangent space . Thus, u follows the distribution , i.e., the standard normal distribution on the tangent space; all eigenvalues of are either 0 (corresponding to eigenvectors orthogonal to the tangent space) or 1 (corresponding to eigenvectors embedded in the tangent space). The zeroth-order Riemannian gradient is defined as: It should be noted that the projection matrix
P is easy to compute on common manifolds. For example, for the Stiefel manifold
, its projection can be written as
where
. In the stochastic case, performing multiple samplings in each iteration can improve the convergence speed:
where
is a standard normal random vector on the tangent space
. We also have:
Lemma 1 (Upper bound on the error of the averaged zeroth-order gradient [
20])
. For the zeroth-order Riemannian gradient estimation with multiple samplings, we havewhere the expectation is taken over by the Gaussian vectors and ξ. Definition 3 (Zeroth-Order Riemannian Hessian [
20])
. The zeroth-order Riemannian Hessian estimation of function f at point x is defined as: It should be noted that the Riemannian Hessian here is actually the projected Hessian estimation of the pullback function
on the tangent space
. Additionally, multi-sampling technology is adopted. For
, each Hessian is defined as:
Assumption 2 (Lipschitz Hessian assumption)
. For any and , we havewhich holds almost everywhere, where denotes the parallel transport and ∘ denotes the function composition. Here, denotes the operator norm. This assumption constrains the rate of change in the Hessian on the manifold. Assumption 2 is the Riemannian counterpart of the Lipschitz Hessian assumption in Euclidean space, whose equivalent conditions include (see [
13]):
In the Euclidean case, degenerates to the identity mapping. In this section, we also assume that satisfies Assumption 1 and introduce the following common assumption, which is often used in zeroth-order stochastic optimization.
Assumption 3 (Bounded Variance Assumption [
19,
21])
. Let ; then, we have: , , for all . The error between the approximate Hessian matrix constructed using the zeroth-order method and the true Riemannian is bounded in expectation. By choosing a sufficiently large number of samples b and a sufficiently small step size , the approximation error can be made arbitrarily small, thereby ensuring the effectiveness and controllability of the zeroth-order Hessian estimation.
Lemma 2 ([
20])
. Let be computed according to Formula (15), and be computed according to Formula (13). Then, for any and , we have: This lemma provides a theoretical basis for the zeroth-order quasi-Newton method: even without an explicit Hessian, a high-quality Hessian approximation can be constructed using function value approximation; it provides an error upper bound for subsequent convergence analyses, ensuring the theoretical feasibility of the method.
4.2. Zeroth-Order Extension: Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Algorithm
To address the scenario where first-order or second-order derivative information cannot be directly obtained, based on the above groundwork, we propose a zeroth-order information-based adaptive Riemannian optimization algorithm. Our method utilizes the aforementioned zeroth-order gradient and Hessian estimators to construct the following regularized quadratic approximation model in each iteration:
where
denotes the multi-sampling mean of the zeroth-order Riemannian gradient,
is the mean of the zeroth-order Riemannian Hessian estimation, and
is the adaptive regularization parameter. By minimizing Subproblem (
21) on the tangent space
, we can obtain a reasonable descent direction using only function values (zeroth-order information). Furthermore, the adaptive adjustment mechanism of the regularization parameter
ensures the stability and convergence of the algorithm under nonconvex conditions, thereby enabling the method to exhibit an excellent theoretical and practical performance in zeroth-order Riemannian optimization problems.
The greatest advantage of the zeroth-order extended adaptive Riemannian proximal quasi-Newton method is that it breaks the dependence on explicit gradients and Hessians, but can still closely approximate the second-order curvature information in black-box, non-differentiable, or even noisy environments, maintaining the efficiency and stability of the algorithm. This significantly expands the application scope of traditional Riemannian proximal Newton methods [
16,
22]. The complete optimization process is shown in Algorithm 1 (ZO-ARPQN). Below, we briefly describe the main steps of the ZO-ARPQN algorithm.
| Algorithm 1 ZO-ARPQN: Zeroth-Order Adaptive Regularized Proximal Quasi-Newton Algorithm |
- Require:
Initial point ; initial regularization parameter ; line search parameters ; threshold parameters ; sample batch sizes ; scaling factors - 1:
for do - 2:
Generate the zeroth-order gradient and zeroth-order Hessian based on sample sizes - 3:
while true do - 4:
Solve the subproblem
to obtain the search direction - 5:
Set initial step size - 6:
while Condition ( 22) is not satisfied do - 7:
- 8:
end while - 9:
Let - 10:
- 11:
if then - 12:
if then - 13:
Update - 14:
end if - 15:
Break - 16:
else - 17:
Update - 18:
end if - 19:
end while - 20:
Update , - 21:
end for
|
Step size selection strategy. In each iteration, the algorithm first obtains the search direction
by solving the subproblem. Then, a non-monotone line search with the Armijo condition is used to determine the appropriate step size
. Specifically, it is defined as:
where
is the smallest integer satisfying the following condition:
The scaling factor ensures that decreases gradually during backtracking; is a line search parameter. The non-monotone condition allows the function value at the current iterate to not strictly decrease, but instead be compared with the maximum value among the most recent m iterations, thereby improving the algorithm’s robustness in complex nonconvex problems.
Model consistency measure (ratio definition). Let
This serves as the reference benchmark for the function value in the current iteration. To measure the degree of matching between the constructed quadratic approximation model and the true function decrease, we introduce the ratio:
To simplify the notation, we introduce
The ratio
characterizes the consistency between the predicted and actual decrease, where the numerator is the actual function decrease under step size
and the denominator is the predicted decrease in the quadratic approximation model.
Update strategy for regularization parameter . To ensure the stability and convergence of the algorithm, the regularization parameter is dynamically adjusted based on the ratio :
- 1.
If
, it indicates an inaccurate prediction or an excessively large step size, and the model is too optimistic:
enhancing the regularization strength to make the model more conservative.
- 2.
If , it indicates an acceptable decrease performance.
- 3.
If
, it indicates an extremely accurate model prediction and an ideal decrease performance:
weakening the regularization.
- 4.
If
, keep the regularization unchanged:
This corresponds to the scenario where the model prediction is roughly consistent with the true decrease. If is reduced hastily at this point, the model may become overly optimistic, leading to instability in the next iteration; if is increased hastily, the model may become overly conservative, sacrificing convergence speed. Therefore, the most reasonable choice is to keep the regularization level unchanged. This design ensures that the algorithm balances convergence and efficiency in long-term operation, without affecting stability due to frequent adjustments of .
Finally, the new iterate is given by:
This algorithm integrates mechanisms from three aspects: (1) a non-monotone line search, which improves the algorithm’s robustness in nonconvex scenarios; (2) the model consistency ratio, which dynamically evaluates the difference between the predicted and true decreases; and (3) an adaptive regularization update, which balances convergence and efficiency by adjusting .
Thus, the ZO-ARPQN algorithm can efficiently simulate the behavior of second-order methods using only function values, and it balances the convergence stability and computational efficiency both theoretically and practically.
5. Convergence Analysis
In this subsection, we prove the global convergence of Algorithm 1. First, we present some standard assumptions that will be used in subsequent analyses.
Assumption 4. Let be the iterate sequence generated by Algorithm 1. Then, (1) is a continuously differentiable function, and its gradient is Lipschitz continuous with constant L; (2) is a convex, but nonsmooth, function, and h is Lipschitz continuous with constant ; (3) there exist constants such that, for all ,and (4) the optimal solution of Subproblem (4) satisfies for all . Remark 1. In Assumption (4), the first two items are standard assumptions in the convergence analysis of composite optimization. From Assumption (4), the objective function of Subproblem (21) is strongly convex, and thus has a unique solution, denoted as . According to the first-order optimality condition of the original problem (1)we haveTherefore, is equivalent to satisfying (26), i.e., is a stationary point of Problem (1). If , similar to the proof of Lemma 5.1 in [23], it can be shown that brings a sufficient decrease in . For completeness, we provide the proof below. Lemma 3. Given iterate , letbe the objective function in (21). Then, for any , we have: Proof. Since
is
-strongly convex, we have
In particular, if
are feasible solutions (i.e.,
), then:
From the optimality condition,
. Let
and substitute
into (
30); then, we obtain:
According to the definitions of
and
, further expansion gives:
Additionally, using the convexity of
h, we have:
According to
, expand
by definition, and combine the above inequalities:
This completes the proof. □
Regarding the boundedness of
: An important part of the convergence analysis of regularized Newton-type methods is to prove the boundedness of the sequence
. We first present a preparatory lemma. According to the procedure of Algorithm 1, if
, then
is scaled up by a factor of
. We aim to prove that, when
is sufficiently large,
always holds. Thus, the inner loop (Steps 3–15) of Algorithm 1 will terminate in finite steps. Since the manifold
is compact, we can define
Recall that parameters
and
are given by (
6) and (
7), respectively. Our goal is to provide a sufficient condition that ensures
.
Lemma 4. (Sufficient condition for ) Under the assumptions that f is L-smooth (Assumption (4)) and h is -Lipschitz, define constantsLet be the solution of the subproblem , take step size , denote , and defineIf there exists such that (corresponding to parameter (25))and the regularization parameter and trust region radius satisfythen we have the conclusion Proof. According to the strong convexity of
f,
Since
h is
-Lipschitz in the ambient space, we have the following second-order retraction error:
From the operator norm error of the Hessian
From Lemma 1, using Jensen’s inequality to extract the square root
There exists
such that
Thus, for any
, there is a “strongly convex” lower bound for the model
From (
34)–(
36), for any
,
, we have
Add and subtract
on the right-hand side, then add and subtract
and
, and use
to obtain
Take the expectation over
and apply the Cauchy–Schwarz inequality and (
38) and (
39); then, we have
Take the expectation over
and apply the Cauchy–Schwarz inequality and (
38) and (
39):
Let
be the solution of the subproblem
, and define
From (
32) and (
45), we have
To ensure
, it is sufficient to set the last term ≤1
and solve for
to obtain the equivalent condition
Additionally,
; thus,
. Introduce the trust region radius
; thus, it is sufficient to have
to ensure
. □
Lemma 5 (Boundedness of the Regularization Parameter
)
. Assume that the solution to the subproblem satisfies a uniform radius upper bound (attainable via the trust region radius or proximal regularization term). Let:Then, for all , the following holds: Proof. From Equation (
49), we know that, if
, then
always holds. We prove the conclusion by mathematical induction:
Base Case (): It is trivially true that .
Inductive Hypothesis: Assume that the inequality holds for , i.e., .
Inductive Step (): Consider two cases: (1) If : By Lemma 4, fails to satisfy the acceptance condition in Step 10 of Algorithm 1. According to Step 15 of the algorithm, is scaled up by a factor of , so . (2) If : by Lemma 4, . According to Steps 10–13 of the algorithm, is either unchanged or scaled down by (i.e., ). Thus, (by the inductive hypothesis).
By combining both cases, we have . By the principle of mathematical induction, the inequality holds for all . □
Theorem 1 (Convergence to Stationary Points)
. Define:When for all (where is the backtracking scaling factor in Step 7 of the algorithm), the backtracking line search (Steps 6–8 of the algorithm) terminates in finite steps. Furthermore, we have:and any accumulation point of the iterate sequence is a first-order stationary point of the problem (i.e., ). Proof. Substitute Equation (
54) into Equation (
47). When
, the following inequality holds:
This implies that , so the step size is accepted. For the backtracking line search, we start with and scale it down by iteratively. Since will eventually be reduced to (or will be increased to ), the line search terminates in finite steps, and .
From Equation (
32) and the lower bound of
, we derive:
where
. By summing both sides over
k from 0 to
∞, the left-hand side becomes
. Since
is non-increasing and bounded below (as
F is bounded from below), the sum converges:
Because
, the terms of the series
must tend to zero. Thus:
To prove that accumulation points are stationary, let
be an accumulation point of
, and let
be a subsequence converging to
. From
, we have
(by the continuity of the norm). Substitute
into the first-order optimality condition of the subproblem:
By taking the limit as
and using the continuity of
, error, and
, we obtain the first-order optimality condition of the original problem:
This confirms that
is a stationary point of Problem (
1). □
5.1. Complexity Analysis
When analyzing the overall complexity of stochastic zeroth-order Riemannian optimization algorithms, we must consider both the computational cost of the outer iterations (Steps 1–16 of the algorithm) and the inner iterations (Steps 3–15 of the algorithm). The former characterizes the number of main iterations required to converge to an -stationary point, while the latter reflects the number of calls to the approximate subproblem solver on the manifold (ASSN) within each outer iteration. Due to the non-monotone line search and trust region/regularization update mechanisms, there is a tight coupling between the complexities of the outer and inner iterations.
In this proof, the acceptance criterion is defined as:
Definition 4 (-Stationary Point). If the optimal solution of the subproblem in a certain outer iteration satisfies , then is called an ε-stationary point.
Theorem 2 (Outer and Inner Iteration Complexity)
. Let be the optimal value of Problem (1), and be the constant in the sufficient decrease criterion. Define:The algorithm finds an ε-stationary point within, at most, outer iterations. Furthermore, the total number of inner ASSN calls satisfies: Proof. Step 1: Upper Bound on Outer Iterations From Equation (
55) and
, we have:
Assume the
K-th outer iteration yields
. By summing both sides over
k from 0 to
, the left-hand side telescopes to
. Since
, we get:
Rearranging for
K gives:
Taking the ceiling function confirms that an
-stationary point is found within
outer iterations.
Step 2: Upper Bound on Inner Iterations By the boundedness of
(Lemma 5):
Suppose the inner loop calls ASSN
times in the
i-th outer iteration. By the update rule of
, we have
. Since
, the following inequality holds:
umming both sides from
to
(where
) results in the following:
The sum of logarithms simplifies to the logarithm of the product (telescoping sum):
By substituting Equation (
63) (i.e.,
) and simplifying the logarithmic term, we obtain the following:
This completes the proof. □
5.2. Saddle-Point Escape Theorem
In non-convex optimization problems, the presence of saddle points severely degrades the convergence efficiency of algorithms. Unlike local minima, saddle points have zero gradients while their Hessian matrices have negative eigenvalues. This causes first-order methods (e.g., stochastic gradient descent) to easily stagnate near saddle points, leading to prolonged slow convergence or even halting. This phenomenon is more prominent in high-dimensional manifold optimization, where the number of saddle points far exceeds that of local minima.
In recent years, second-order methods (e.g., Hessian-based algorithms) have been proven to escape saddle points more effectively. However, in the zeroth-order setting, we cannot directly access accurate gradient or Hessian information, and can only rely on function value approximations. This makes the identification and escape of saddle points more challenging. Traditional zeroth-order methods often require excessive sampling and computation, resulting in a high complexity that hinders practical application. The proposed method in this paper achieves saddle-point escape using only function value queries, without the need for explicit first-order or second-order information.
Lemma 6. Let , be the zeroth-order gradient estimate, and be the mini-batch average of Hessian estimates. The following control inequality holds:where and correspond to the zeroth-order gradient estimation error and Hessian mini-batch estimation error, respectively. Proof. Expand using the triangle inequality:
Step 1: Bound on
. Apply the Frobenius norm and Cauchy–Schwarz inequality:
Step 2: Apply Young’s Inequality. For the product term, Young’s inequality (
) gives:
Step 3: Bound on
. Note that
is the average of
b independent and identically distributed (i.i.d.) terms
. By Jensen’s inequality and Cauchy–Schwarz, we have:
Using the fourth-moment bound from Equation (
20) (
), we define:
Step 4: Combine Results. Substitute Equation (
67) and the gradient estimation error bound (from Lemma 1,
) into the triangle inequality. Taking the expectation and simplifying gives:
This completes the proof. □
Lemma 7. Let and . The following holds:where , , and correspond to the zeroth-order gradient estimation error, zeroth-order Hessian estimation error, and mini-batch averaging error, respectively. Proof. Since parallel transport
is isometric, we have:
Add and subtract
to decompose the term into four components:
Taking the norm and applying the triangle inequality yields the following:
(i) Bound on the Taylor Remainder. By the equivalent condition of Assumption 2 (Lipschitz Hessian, Equation (
17)),
(ii) Bound on Gradient Estimation Error. From Lemma 1, the expectation of the gradient estimation error satisfies .
(iii) Bound on Hessian Estimation Error. Using the operator norm and Young’s inequality:
Taking the expectation and using Lemma 2 (
), we get:
(iv) Bound on the Subproblem Main Term. By Lemma 6, we have .
(v) Combine All Bounds. Substitute (i)–(iv) into the triangle inequality and simplify using algebraic manipulations:
This completes the proof. □
Lemma 8 (Relationship Between Step Size and Minimum Eigenvalue)
. Let , and be the isometric parallel transport along . Let be such that . Then, the following holds: Proof. Step 1: Invariance of Minimum Eigenvalue Under Parallel Transport. Since parallel transport is isometric, the minimum eigenvalue of the Hessian is invariant:
Let
. Then:
Step 2: Apply Weyl’s Inequality. Weyl’s inequality states that, for symmetric matrices
A and
B,
. For any symmetric matrix
A, the minimum eigenvalue satisfies
(by the Rayleigh quotient characterization). Thus:
By Assumption 2 (Equation (
16)),
. Substituting into Weyl’s inequality yields the following:
Step 3: Decompose the Hessian and Introduce Regularization. Rewrite the true Hessian as:
Substitute into Equation (
71) and apply Weyl’s inequality again. Since
, its minimum eigenvalue is non-negative, so:
Step 4: Take Expectation and Rearrange. For the symmetric matrix
, we have
. Taking the expectation and using Lemma 2 (
) yields the following:
Taking the expectation of Equation (
72) and rearranging terms gives:
Rearranging for
completes the proof. □
Theorem 3. Let be a manifold, and let satisfy Assumptions 1–3. Define: If the step size in the update of Algorithm 1 satisfies , then:where denotes the minimum eigenvalue. The parameters must satisfy:Thus, the complexity of the zeroth-order oracle is: Proof. From Lemma 6, we have:
where
,
, and
correspond to the gradient and Hessian estimation errors. On the other hand, from Lemma 4.8 (relationship between step size and minimum eigenvalue),
Next, we analyze the upper bound of
.
Step 1: Third-Order Taylor Upper Bound. By the equivalent form of Assumption 2 (Equation (
18)),
Step 2: Decompose Gradient and Hessian into Estimation + Error. Write the true gradient and Hessian as:
Substitute into Equation (
77) to decompose the function value into the main term, first-order error, second-order error, and third-order term:
Step 3: First-Order Optimality Condition of the Subproblem. From the first-order optimality condition of Subproblem (
21),
Take the inner product with
and rearrange. Using the positive definiteness of
and the subgradient inequality for convex
h (
), we get:
Substitute this into Equation (
78) and combine terms for
:
Step 4: Bound First-Order and Second-Order Errors. Using the Cauchy–Schwarz inequality and Young’s inequality,
Substitute into Equation (
82), take
, and use the lower bounds of gradient and Hessian errors. Summing over
to
and rearranging gives:
where
. By combining with the parameter choices in Equation (
74), we obtain:
Step 5: Final Bounds. Substitute Equation (
84) into Lemmas 6 and 7. Simplifying using the parameter choices gives:
The complexity of the zeroth-order oracle is calculated by substituting
N,
m, and
b from Equation (
74) into
, leading to:
This completes the proof. □
7. Conclusions
We propose ZO-ARPQN, the first zeroth-order adaptive regularized proximal quasi-Newton framework for composite optimization on Riemannian manifolds, including the Stiefel and SPD manifolds. The method extends the classical ARPQN algorithm to black-box settings by constructing stochastic one-point finite-difference estimators for both the Riemannian gradient and curvature, enabling second-order optimization without explicit derivatives.
We established global convergence guarantees under mild smoothness assumptions, proved that the algorithm achieves convergence to first-order stationary points even with noisy zeroth-order information, and further demonstrated that the curvature-aware regularization enables escape from strict saddle points with a high probability. A complete iteration-complexity analysis is provided, showing that ZO-ARPQN maintains competitive theoretical efficiency compared to existing zeroth-order Riemannian methods.
Extensive numerical experiments on sparse PCA and robot stiffness tuning validated the practical effectiveness of the proposed method. ZO-ARPQN achieved a convergence behavior comparable to first-order ARPQN and other state-of-the-art Riemannian solvers, while requiring only function evaluations. These results highlight its potential for applications in black-box manifold optimization, robotics, machine learning, and simulation-based design.
Future work will include exploring variance-reduced zeroth-order strategies, developing limited-memory versions for large-scale manifolds, and extending the framework to Riemannian stochastic compositional optimization and constrained reinforcement learning.