Next Article in Journal
Analysis and Cost Optimization of a Retrial Queue with Push-Out and Feedback Using Analytical and Metaheuristic Approaches
Previous Article in Journal
Solutions of a Fuzzy Difference Equation with Maximum
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method

1
School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
2
School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Perth, WA 6102, Australia
*
Author to whom correspondence should be addressed.
Axioms 2026, 15(3), 203; https://doi.org/10.3390/axioms15030203
Submission received: 8 January 2026 / Revised: 1 March 2026 / Accepted: 5 March 2026 / Published: 10 March 2026
(This article belongs to the Section Geometry and Topology)

Abstract

Recently, the adaptive regularized proximal quasi-Newton (ARPQN) method has demonstrated a strong performance in solving composite optimization problems over the Stiefel manifold. However, its reliance on first-order information limits its applicability to scenarios where gradient and Hessian evaluations are unavailable or costly. In this paper, we propose a zeroth-order adaptive regularized proximal quasi-Newton method (ZO-ARPQN) for black-box composite optimization over Riemannian manifolds, particularly the Stiefel and symmetric positive definite (SPD) manifolds. The proposed method estimates the Riemannian gradient and curvature information through randomized one-point finite-difference approximations and adaptively updates a regularized quasi-Newton matrix to capture the local manifold geometry. Theoretically, we established global convergence and complex analyses under mild assumptions. More importantly, by incorporating curvature-aware regularization and random perturbations in the proximal quasi-Newton framework, we proved that ZO-ARPQN can escape strict saddle points with a high probability. This guarantees convergence to a stationary point, even in the absence of explicit gradients. Extensive numerical experiments were conducted on manifold-constrained problems, including sparse PCA and robot stiffness tuning. These demonstrated that ZO-ARPQN shows a competitive convergence behavior compared with other state-of-the-art Riemannian optimization methods, while requiring only function evaluations.

1. Introduction

In recent years, composite optimization problems on Riemannian manifolds have attracted widespread attention in various application fields. The objective function of such problems typically consists of the sum of a smooth function and a nonsmooth function, with optimization variables constrained on a manifold. The general form of a Riemannian composite optimization problem is summarized as follows:
min X M F ( X ) = f ( X ) + h ( X ) .
where f : R n × r R is a smooth function and h : R n × r R is a convex, but nonsmooth, function. The aforementioned Riemannian composite optimization problems have found extensive applications, such as compressed sensing [1], sparse principal component analyses [2], and clusteringproblems [3]. For more applications of composite optimization on manifolds, readers are referred to [4,5,6].
Composite optimization problems with manifold constraints have been extensively studied in recent years. Currently, popular solution methods can be roughly categorized into the following types: operator-splitting methods, proximal gradient methods, and proximal Newton methods.
Operator-Splitting Methods. Riemannian ADMM (alternating-direction method of multipliers) provides a flexible decomposition strategy, especially suitable for composite problems with special structures. It usually transforms the original problem into an equivalent form via variable splitting and then solves it through alternating updates and dual ascent. Additionally, other operator-splitting ideas, such as Douglas–Rachford splitting, have been extended to manifolds [7]. Research on Riemannian ADMM focuses on its convergence analysis under nonconvex and nonsmooth settings [8] and efficient implementation in specific applications [3]. For example, in distributed or federated learning scenarios, the decomposable nature of ADMM enables the design of communication-efficient and privacy-preserving Riemannian optimization algorithms [9]. The advantages of Riemannian ADMM lie in its high flexibility and ability to decouple variables; however, it requires the introduction of additional variables when constructing the Lagrangian objective function, which slows down its convergence rate.
First-Order Methods: Proximal Gradient Methods. This is the most fundamental and straightforward approach for solving Riemannian composite problems. The Riemannian proximal gradient method iteratively decomposes the smooth and nonsmooth terms, i.e., [4,5]:
x k + 1 = prox γ k h R x k γ k f ( x k )
which denotes the Riemannian proximal operator. This algorithm first performs one step of Riemannian gradient descent on the smooth part f, and then corrects the nonsmooth part h via the proximal operator, entirely within the manifold M .
To accelerate convergence, researchers have proposed various accelerated Riemannian proximal gradient algorithms by leveraging Nesterov’s acceleration idea. These methods typically improve the convergence rate from O ( 1 / k ) to O ( 1 / k 2 ) under geodesically convex or specific curvature conditions [10,11]. For large-scale data, computing the full gradient f ( x k ) is prohibitively expensive; thus, Riemannian stochastic proximal gradient methods and their variance-reduced variants (e.g., R-SVRG-PG [12]) have emerged. They approximate the gradient through random sampling, significantly reducing the cost per iteration while achieving fast convergence. Considering that f ( x ) is nonconvex in many practical applications, recent works have analyzed the convergence of proximal gradient methods under nonconvex settings, proving that the algorithm converges to stationary points and providing convergence rate guarantees [13]. This method is simple to implement and theoretically mature, but suffers from slow convergence and reliance on the proximal operator.
Second-Order Methods: Proximal Newton Methods. To achieve faster convergence than first-order methods, second-order methods incorporate curvature (Hessian) information of the smooth term f to guide updates. The core idea of the regularized Riemannian proximal Newton method is to obtain the update direction η k at each tangent space T x k M by solving a quadratic approximation subproblem [4,5]:
η k = arg min η T X k M f ( X k ) , η + 1 2 η , H k [ η ] + h R X k ( η )
where H k is a positive definite approximation of the Riemannian Hessian operator H f ( x k ) .
Standard Newton methods face two major challenges: (1) how to ensure global convergence when the Hessian is non-positive definite, and (2) how to handle the enormous computational, storage, and inversion costs of the Hessian matrix. Recent research on quasi-Newton methods and adaptive regularization has focused on addressing these two issues. These are key techniques to address the computational bottleneck of second-order methods. Riemannian quasi-Newton methods (e.g., Riemannian BFGS or L-BFGS) no longer directly compute the true Hessian matrix; instead, they construct an increasingly accurate Hessian approximation matrix B k using first-order (gradient) information collected during iterations [14,15]. A representative method is the adaptive regularized proximal quasi-Newton (ARPQN) method [16], which solves a subproblem of the following form at each step:
min V T X k M f ( X k ) , V + 1 2 ( B k + σ k I ) [ V ] , V + h R X k ( V )
The core advantages of this method are Hessian approximation and adaptive regularization. This allows the algorithm to flexibly balance gradient and curvature information, thereby achieving faster convergence in practice.
It should be noted that the aforementioned methods all assume that the gradient and Hessian matrix of the objective function have analytical forms. However, many optimization problems in practical applications face the following challenges, rendering traditional gradient-based (first-order or second-order) methods inapplicable:
1.
Unavailable Gradient: In many practical problems, the objective function f ( X ) is a black box: one can only query its value without access to analytical gradients or automatic differentiation. This often occurs in complex physical simulations such as fluid dynamics or structural mechanics, where each function evaluation is computationally expensive and differentiation is infeasible. Traditional approaches like Bayesian optimization mitigate the cost through surrogate models [17], but become inefficient as the problem dimension increases. In robotic stiffness control, the objective function describing the robot–environment interaction has no closed-form gradient and can only be evaluated through high-fidelity simulations or experiments. The mapping from stiffness parameters (an SPD matrix) to performance scores thus forms a typical black-box optimization problem.
2.
Prohibitive Gradient Computation Cost: Even if the gradient exists in theory, the cost of computing it may be unacceptably high. In large-scale reinforcement-learning scenarios, the reward function of an agent may depend on an extremely long sequence of decisions. Computing the gradient of policy parameters through backpropagation may consume enormous computational resources and memory. Directly evaluating the performance of the policy (zeroth-order information) may be much cheaper. Zeroth-order methods such as evolutionary strategies have been proven to be scalable alternatives for reinforcement learning [18].
Based on the above motivations, zeroth-order methods can be applied to any optimization problem where function evaluations are accessible, regardless of how complex its internal structure is or whether it is differentiable. This greatly expands the application boundary of optimization algorithms, enabling them to solve black-box problems that traditional methods cannot address.
First, we clarify the definition: Zeroth-order Riemannian methods, also known as derivative-free Riemannian optimization methods, refer to a class of methods that solve optimization problems on manifolds using only function evaluation values (zeroth-order information), without using explicit gradients (first-order information) or Hessian matrices (second-order information) of the objective function. By abandoning reliance on accurate gradient information, zeroth-order Riemannian methods gain the ability to handle complex problems such as black-box, noisy, and high-cost scenarios. They sacrifice a certain degree of convergence efficiency (compared to gradient-based methods) to significantly broaden the application scope of optimization algorithms on manifolds.

2. Theoretical Contributions

This work proposes the zeroth-order Riemannian adaptive quasi-Newton method (ZO-ARPQN) for composite optimization on Stiefel manifolds, establishing three fundamental advances:
Zeroth-Order Extension of Proximal Quasi-Newton on Riemannian Manifolds. We generalize the adaptive regularized proximal quasi-Newton (ARPQN) framework [16] to the zeroth-order setting over Riemannian manifolds such as the Stiefel and SPD manifolds. By constructing randomized finite-difference estimators that approximate both the Riemannian gradient and curvature information, ZO-ARPQN achieves first-order accuracy using only function evaluations, enabling black-box optimization in manifold-constrained problems.
Global Convergence Guarantees under Noisy Gradient Estimates. We establish global convergence to a stationary point under mild smoothness and regularization assumptions, even when the gradient and Hessian information are approximated by random perturbations. Theoretical bounds are derived on the bias and variance of the zeroth-order estimators, showing that ZO-ARPQN maintains the same global convergence as its first-order counterpart.
Saddle-Point Escaping. By incorporating curvature-aware regularization and adaptive random perturbation in the proximal quasi-Newton step, we prove that ZO-ARPQN can escape strict saddle points with a high probability, ensuring convergence to a stationary point. This result extends existing saddle-point escape theory to the manifold and zeroth-order settings.

3. Preliminaries on Riemannian Optimization

The core idea of Riemannian optimization is to extend classical optimization methods from Euclidean space to non-Euclidean spaces, i.e., Riemannian manifolds. Compared with the Euclidean case, manifold optimization requires constraining the geometric structure of iterates, such that search directions, gradients, and update operators are all defined within the framework of tangent spaces. This requires us to introduce tools such as tangent spaces, Riemannian metrics, retractions, and Riemannian Hessians.

3.1. Tangent Space and Tangent Bundle

Let M be a manifold. The tangent space at a point X M is denoted as T X M , which consists of all tangent vectors at X. The tangent bundle
T M : = X M T X M
represents the collection of all tangent spaces of M .

3.2. Riemannian Metric and Norm

If the tangent spaces are endowed with an inner product
ξ , η X , ξ , η T X M
that varies smoothly with the point X, then ( M , · , · ) is called a Riemannian manifold. The induced norm is
ξ X = ξ , ξ X .
For simplicity, we denote ξ and ξ , ξ if there is no ambiguity.

3.3. Riemannian Gradient

In [19], if f : M R is a smooth function, its gradient f ( X ) T X M at X M is defined as the unique tangent vector satisfying
f ( X ) , ξ = D f ( X ) [ ξ ] , ξ T X M ,
where D f ( X ) [ ξ ] denotes the directional derivative of f at X along ξ . In this paper, we adopt the Riemannian metric induced by the Euclidean inner product, i.e.,
· , · x = · , · , x M .
Based on this Riemannian metric, the Riemannian gradient of a function is defined as the projection of its Euclidean gradient onto the tangent space:
f ( x ) = Proj T x M ( f ( x ) )
Definition 1 
(Retraction [4]). A retraction on a manifold M is a smooth mapping R : T M M satisfying the following properties. Let R X denote the restriction of R to T X M : (1) R X ( 0 X ) = X , where 0 X is the zero element of T X M ; (2) under the canonical isomorphism T 0 X ( T X M ) T X M , we have
D R X ( 0 X ) = id T X M ,
where D R X ( 0 X ) denotes the differential of R X at 0 X , and id T X M denotes the identity mapping on T X M .
The following proposition serves as a key bridge connecting retractions and optimization analyses. It endows complex manifold optimization algorithms with strict mathematical certainty. Inequality (6) guarantees the boundedness and controllability of the retraction operation. It ensures that a tangent space displacement ξ of a finite magnitude does not result in an infinitely distant move on the manifold, which is a prerequisite for any subsequent convergence analysis.
Proposition 1 
([19]). If M is a compact embedded submanifold of R N and R is a retraction, then there exist constants M 1 , M 2 > 0 such that, for all X M and ξ T X M ,
R X ( ξ ) X     M 1 ξ
R X ( ξ ) X ξ     M 2 ξ 2
The following assumption is the most natural and practical way to extend the well-known L-smooth (i.e., the function has a Lipschitz-continuous gradient) concept from Euclidean space to Riemannian manifolds.
Assumption 1 
(L-retraction smoothness [20]). There exists a constant L g 0 such that, for the objective function f of problem (1), the following inequality holds:
f R x ( η ) f ( x ) f ( x ) , η x L g 2 η x 2 , x M , η T x M .
This assumption ensures that the behavior of function f along the retraction path is well-posed, i.e., as long as η is sufficiently small, predicting the function value at the next point using the gradient information at the current point is relatively accurate, ensuring that the gradient-based linear approximation is locally effective. Thus, a convergence analysis from Euclidean space (gradient descent, trust region, Newton method) can be extended to manifolds.

4. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Newton (ZO-ARPQN)

4.1. Zeroth-Order Riemannian Gradient and Hessian

In existing manifold optimization methods, gradients or their tangent space projections are assumed to be directly available. However, in many real-world problems such as black-box optimization, noisy or nonsmooth functions, and high-dimensional models, explicit gradient computation is either infeasible or too costly. Zeroth-order optimization offers an effective alternative by relying only on function evaluations.
While classical Euclidean zeroth-order estimators (e.g., Gaussian smoothing or coordinate finite differences) can approximate gradients accurately, they require many function calls and do not generalize well to manifolds. To overcome this, we employ a random direction difference method that samples one or several random tangent directions and uses one-sided finite differences to estimate gradients. This approach makes the function query cost independent of the dimension, achieves good scalability to high-dimensional Riemannian problems, and balances theoretical soundness with computational efficiency.
Definition 2 
(Stochastic Difference Zeroth-Order Riemannian Gradient [20]). Generate u = P u 0 T x M , where u 0 N ( 0 , I n ) R n , and P R n × n is an orthogonal projection matrix that projects vectors onto the tangent space T x M . Thus, u follows the distribution N ( 0 , P P ) , i.e., the standard normal distribution on the tangent space; all eigenvalues of P P are either 0 (corresponding to eigenvectors orthogonal to the tangent space) or 1 (corresponding to eigenvectors embedded in the tangent space). The zeroth-order Riemannian gradient is defined as:
g μ ( x ) = f R x ( μ u ) f ( x ) μ u = f R x ( μ P u 0 ) f ( x ) μ P u 0 .
It should be noted that the projection matrix P is easy to compute on common manifolds. For example, for the Stiefel manifold M , its projection can be written as
Proj T X M ( Y ) = ( I X X ) Y + X skew ( X Y ) ,
where skew ( A ) : = ( A A ) / 2 . In the stochastic case, performing multiple samplings in each iteration can improve the convergence speed:
g ¯ μ , ξ ( x ) = 1 m i = 1 m g μ , ξ i ( x ) , where g μ , ξ i ( x ) = F R x ( μ u i ) , ξ i F x , ξ i μ u i ,
where u i is a standard normal random vector on the tangent space T x M . We also have:
E ξ g μ , ξ i ( x ) = f R x ( μ u ) f ( x ) μ u = g μ ( x ) .
Lemma 1 
(Upper bound on the error of the averaged zeroth-order gradient [20]). For the zeroth-order Riemannian gradient estimation with multiple samplings, we have
E g ¯ μ , ξ ( x ) f ( x ) 2 μ 2 L g 2 ( d + 6 ) 3 + 8 ( d + 4 ) m σ 2 + 8 ( d + 4 ) m f ( x ) 2 ,
where the expectation E is taken over by the Gaussian vectors U = { u 1 , , u m } and ξ.
Definition 3 
(Zeroth-Order Riemannian Hessian [20]). The zeroth-order Riemannian Hessian estimation of function f at point x is defined as:
H μ ( x ) = 1 2 μ 2 u u P F R x ( μ u ) , ξ + F R x ( μ u ) , ξ 2 F ( x , ξ ) .
It should be noted that the Riemannian Hessian here is actually the projected Hessian estimation of the pullback function
F x ( η , ξ ) : = F R x ( η ) , ξ , x M , η T x M
on the tangent space T x M . Additionally, multi-sampling technology is adopted. For i = 1 , , b , each Hessian is defined as:
H μ , i ( x ) = 1 2 μ 2 u i u i P F R x ( μ u i ) , ξ i + F R x ( μ u i ) , ξ i 2 F ( x , ξ i ) .
Its averaged form is:
H ¯ μ , ξ ( x ) = 1 b i = 1 b H μ , i ( x ) .
Assumption 2 
(Lipschitz Hessian assumption). For any x M and η T x M , we have
P η 1 H F R x ( η ) , ξ P η H F ( x , ξ ) op L H η ,
which holds almost everywhere, where P η : T x M T R x ( η ) M denotes the parallel transport and ∘ denotes the function composition. Here, · op denotes the operator norm. This assumption constrains the rate of change in the Hessian on the manifold.
Assumption 2 is the Riemannian counterpart of the Lipschitz Hessian assumption in Euclidean space, whose equivalent conditions include (see [13]):
P η 1 F R x ( η ) , ξ f ( x ) H F ( x , ξ ) [ η ] L H 2 η 2 ,
F R x ( η ) , ξ F ( x , ξ ) + η , F ( x , ξ ) + 1 2 η , H F ( x , ξ ) [ η ] L H 6 η 3 .
In the Euclidean case, P η degenerates to the identity mapping. In this section, we also assume that F ( · , ξ ) satisfies Assumption 1 and introduce the following common assumption, which is often used in zeroth-order stochastic optimization.
Assumption 3 
(Bounded Variance Assumption [19,21]). Let E = E ξ ; then, we have: E [ F ( x , ξ ) ] = f ( x ) , E [ F ( x , ξ ) ] = f ( x ) , E F ( x , ξ ) f ( x ) 2 σ 2 for all x M .
The error between the approximate Hessian matrix constructed using the zeroth-order method and the true Riemannian H f ( x ) is bounded in expectation. By choosing a sufficiently large number of samples b and a sufficiently small step size μ , the approximation error can be made arbitrarily small, thereby ensuring the effectiveness and controllability of the zeroth-order Hessian estimation.
Lemma 2 
([20]). Let H ¯ μ , ξ ( x ) be computed according to Formula (15), and H μ ( x ) be computed according to Formula (13). Then, for any x M and η T x M , we have:
E μ , ξ H ¯ μ , ξ ( x ) H f ( x ) op 2 ( d + 16 ) 4 2 b L g + μ 2 L H 2 18 ( d + 6 ) 5 ,
E u , Ξ H μ ( x ) F 4 ( d + 16 ) 8 8 L g 2 .
This lemma provides a theoretical basis for the zeroth-order quasi-Newton method: even without an explicit Hessian, a high-quality Hessian approximation can be constructed using function value approximation; it provides an error upper bound for subsequent convergence analyses, ensuring the theoretical feasibility of the method.

4.2. Zeroth-Order Extension: Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Algorithm

To address the scenario where first-order or second-order derivative information cannot be directly obtained, based on the above groundwork, we propose a zeroth-order information-based adaptive Riemannian optimization algorithm. Our method utilizes the aforementioned zeroth-order gradient and Hessian estimators to construct the following regularized quadratic approximation model in each iteration:
ϕ ˜ k ( V ) = g ¯ μ , ξ ( x k ) , V + 1 2 ( H ¯ μ , ξ ( x k ) + σ k I ) [ V ] , V + h ( x k + V ) ,
where g ¯ μ , ξ ( x k ) denotes the multi-sampling mean of the zeroth-order Riemannian gradient, H ¯ μ , ξ ( x k ) is the mean of the zeroth-order Riemannian Hessian estimation, and σ k is the adaptive regularization parameter. By minimizing Subproblem (21) on the tangent space T x k M , we can obtain a reasonable descent direction using only function values (zeroth-order information). Furthermore, the adaptive adjustment mechanism of the regularization parameter σ k ensures the stability and convergence of the algorithm under nonconvex conditions, thereby enabling the method to exhibit an excellent theoretical and practical performance in zeroth-order Riemannian optimization problems.
The greatest advantage of the zeroth-order extended adaptive Riemannian proximal quasi-Newton method is that it breaks the dependence on explicit gradients and Hessians, but can still closely approximate the second-order curvature information in black-box, non-differentiable, or even noisy environments, maintaining the efficiency and stability of the algorithm. This significantly expands the application scope of traditional Riemannian proximal Newton methods [16,22]. The complete optimization process is shown in Algorithm 1 (ZO-ARPQN). Below, we briefly describe the main steps of the ZO-ARPQN algorithm.
Algorithm 1 ZO-ARPQN: Zeroth-Order Adaptive Regularized Proximal Quasi-Newton Algorithm
Require: 
Initial point X 0 M ; initial regularization parameter σ 0 > 0 ; line search parameters σ , γ ( 0 , 1 ) ; threshold parameters 0 < η 1 < η 2 < 1 ; sample batch sizes m , b ; scaling factors 0 < γ 1 < 1 < γ 2
  1:
for  k = 0 , 1 , 2 ,  do
  2:
    Generate the zeroth-order gradient g ¯ μ , ξ ( x k ) and zeroth-order Hessian H ¯ μ , ξ ( x k ) based on sample sizes m , b
  3:
    while true do
  4:
        Solve the subproblem
ϕ ˜ k ( V ) = g ¯ μ , ξ ( x k ) , V + 1 2 ( H ¯ μ , ξ ( x k ) + σ k I ) [ V ] , V + h ( x k + V ) ,
to obtain the search direction V k
  5:
        Set initial step size α k = 1
  6:
        while Condition (22) is not satisfied do
  7:
            α k γ α k
  8:
        end while
  9:
        Let Z k = R X k ( α k V k )
10:
        Compute the ratio
ρ k = F R X k ( α k V k ) F ( X l ( k ) ) ϕ ˜ k ( α k V k ) ϕ ˜ k ( 0 ) .
11:
        if  ρ k η 1  then
12:
           if  ρ k η 2  then
13:
               Update σ k γ 1 σ k
14:
           end if
15:
           Break
16:
        else
17:
           Update σ k γ 2 σ k
18:
        end if
19:
    end while
20:
    Update X k + 1 = Z k , σ k + 1 = σ k
21:
end for
Step size selection strategy. In each iteration, the algorithm first obtains the search direction V k by solving the subproblem. Then, a non-monotone line search with the Armijo condition is used to determine the appropriate step size α k . Specifically, it is defined as:
α k = γ N k , N k N ,
where N k is the smallest integer satisfying the following condition:
F R X k ( α k V k ) max max { 0 , k m } j k F ( X j ) 1 2 σ α k V k B k 2 .
The scaling factor γ ( 0 , 1 ) ensures that α k decreases gradually during backtracking; σ > 0 is a line search parameter. The non-monotone condition allows the function value at the current iterate to not strictly decrease, but instead be compared with the maximum value among the most recent m iterations, thereby improving the algorithm’s robustness in complex nonconvex problems.
Model consistency measure (ratio definition). Let
F ( X l ( k ) ) = max max { 0 , k m } j k F ( X j ) .
This serves as the reference benchmark for the function value in the current iteration. To measure the degree of matching between the constructed quadratic approximation model and the true function decrease, we introduce the ratio:
ρ k : = F R x k ( α k V k ) F ( X l ( k ) ) ϕ ˜ k ( α k V k ) ϕ ˜ k ( 0 ) .
To simplify the notation, we introduce
l ( k ) : = argmax max { 0 , k m } j k F ( X j ) .
The ratio ρ k characterizes the consistency between the predicted and actual decrease, where the numerator is the actual function decrease under step size α k and the denominator is the predicted decrease in the quadratic approximation model.
Update strategy for regularization parameter σ k . To ensure the stability and convergence of the algorithm, the regularization parameter σ k is dynamically adjusted based on the ratio ρ k :
1.
If ρ k < η 1 , it indicates an inaccurate prediction or an excessively large step size, and the model is too optimistic:
σ k + 1 = γ 2 σ k , γ 2 > 1 ,
enhancing the regularization strength to make the model more conservative.
2.
If ρ k η 1 , it indicates an acceptable decrease performance.
3.
If ρ k η 2 , it indicates an extremely accurate model prediction and an ideal decrease performance:
σ k + 1 = γ 1 σ k , γ 1 ( 0 , 1 ) ,
weakening the regularization.
4.
If η 1 ρ k < η 2 , keep the regularization unchanged:
σ k + 1 = σ k .
This corresponds to the scenario where the model prediction is roughly consistent with the true decrease. If σ k is reduced hastily at this point, the model may become overly optimistic, leading to instability in the next iteration; if σ k is increased hastily, the model may become overly conservative, sacrificing convergence speed. Therefore, the most reasonable choice is to keep the regularization level unchanged. This design ensures that the algorithm balances convergence and efficiency in long-term operation, without affecting stability due to frequent adjustments of σ k .
Finally, the new iterate is given by:
X k + 1 = R X k ( α k V k ) .
This algorithm integrates mechanisms from three aspects: (1) a non-monotone line search, which improves the algorithm’s robustness in nonconvex scenarios; (2) the model consistency ratio, which dynamically evaluates the difference between the predicted and true decreases; and (3) an adaptive regularization update, which balances convergence and efficiency by adjusting σ k .
Thus, the ZO-ARPQN algorithm can efficiently simulate the behavior of second-order methods using only function values, and it balances the convergence stability and computational efficiency both theoretically and practically.

5. Convergence Analysis

In this subsection, we prove the global convergence of Algorithm 1. First, we present some standard assumptions that will be used in subsequent analyses.
Assumption 4. 
Let { X k } be the iterate sequence generated by Algorithm 1. Then, (1) f : R n × r R is a continuously differentiable function, and its gradient f is Lipschitz continuous with constant L; (2) h : R n × r R is a convex, but nonsmooth, function, and h is Lipschitz continuous with constant L h ; (3) there exist constants 0 < κ 1 < κ 2 such that, for all k 0 ,
κ 1 V 2 H ¯ k [ V ] , V κ 2 V 2 , V T X k M ;
and (4) the optimal solution V k of Subproblem (4) satisfies V k   0 for all k 0 .
Remark 1. 
In Assumption (4), the first two items are standard assumptions in the convergence analysis of composite optimization. From Assumption (4), the objective function ϕ k of Subproblem (21) is strongly convex, and thus has a unique solution, denoted as V k . According to the first-order optimality condition of the original problem (1)
0 f ( X * ) + Proj T X * M ( h ( X * ) )
we have
0 Proj T X k M ϕ ˜ k ( V k ) = f ( X k ) + ( H ˜ k + σ k I ) [ V k ] + Proj T X k M h ( X k + V k ) .
Therefore, V k = 0 is equivalent to X k satisfying (26), i.e., X k is a stationary point of Problem (1). If V k 0 , similar to the proof of Lemma 5.1 in [23], it can be shown that V k brings a sufficient decrease in ϕ k . For completeness, we provide the proof below.
Lemma 3. 
Given iterate X k , let
ϕ ˜ ( V ) : = g ¯ μ , ξ ( x ) , V + 1 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ V ] , V + h ( x + V )
be the objective function in (21). Then, for any α [ 0 , 1 ] , we have:
ϕ ˜ ( α V k ) ϕ ˜ ( 0 ) α ( α 2 ) 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ V k ] , V k .
Proof. 
Since ϕ ˜ is 1 t -strongly convex, we have
ϕ ˜ ( V ^ ) ϕ ˜ ( V ) + ϕ ˜ ( V ) , V ^ V + 1 2 ( H ¯ μ , ξ ( x ) + α k I ) [ V ] , V , V , V ^ R n × r .
In particular, if V , V ^ are feasible solutions (i.e., V , V ^ T X k M ), then:
ϕ ˜ ( V ) , V ^ V = Proj T X k M ϕ ˜ ( V ) , V ^ V .
From the optimality condition, 0 Proj T X k M ϕ ˜ ( V k ) . Let V = V k and substitute V ^ = 0 into (30); then, we obtain:
ϕ ˜ ( 0 ) ϕ ˜ ( V k ) + 1 2 ( H ¯ μ , ξ ( x ) + α k I ) [ V ] , V .
According to the definitions of ϕ ˜ ( 0 ) = h ( X k ) and ϕ ˜ ( V k ) , further expansion gives:
h ( X k ) f ( X k ) , V k + 1 2 ( H ¯ μ , ξ ( x ) + α k I ) [ V ] , V + h ( X k + V k ) + 1 2 ( H ¯ μ , ξ ( x ) + α k I ) [ V ] , V .
Additionally, using the convexity of h, we have:
h ( X k + α V k ) = h ( α ( X k + V k ) + ( 1 α ) X k ) α h ( X k + V k ) + ( 1 α ) h ( X k ) .
According to ϕ ˜ ( 0 ) = h ( X k ) , expand ϕ ˜ k ( α V k ) by definition, and combine the above inequalities:
ϕ ˜ k ( α V k ) ϕ ˜ k ( 0 ) = f ( X k ) , α V k + 1 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ α V k ] , α V k + h ( X k + α V k ) h ( X k ) α f ( X k ) , V k + α 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ V k ] , V k + h ( X k + V k ) h ( X k ) = α ϕ ˜ k ( V k ) ϕ ˜ k ( 0 ) + α 1 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ V k ] , V k α ( α 2 ) 2 ( H ¯ μ , ξ ( x ) + σ k I ) [ V k ] , V k .
This completes the proof. □
Regarding the boundedness of σ k : An important part of the convergence analysis of regularized Newton-type methods is to prove the boundedness of the sequence { σ k } . We first present a preparatory lemma. According to the procedure of Algorithm 1, if ρ k < η 1 , then σ k is scaled up by a factor of γ 2 . We aim to prove that, when σ k is sufficiently large, ρ k η 1 always holds. Thus, the inner loop (Steps 3–15) of Algorithm 1 will terminate in finite steps. Since the manifold M is compact, we can define
ϱ : = sup X M f ( X ) .
Recall that parameters M 1 and M 2 are given by (6) and (7), respectively. Our goal is to provide a sufficient condition that ensures E [ ρ k ] η 1 .
Lemma 4. 
(Sufficient condition for E [ ρ k ] η 1 ) Under the assumptions that f is L-smooth (Assumption (4)) and h is L h -Lipschitz, define constants
c 1 : = ϱ M 2 + 1 2 L M 1 2 , c 2 : = c 1 + L h M 2 , ϱ : = sup x M f ( x ) .
Let V k T x k M be the solution of the subproblem min V ϕ ˜ k ( V ) , take step size α k ( 0 , 1 ] , denote η k = α k V k , and define
ρ k : = F ( x k ) F ( R x k ( η k ) ) ϕ ˜ k ( η k ) + ϕ ˜ k ( 0 ) .
If there exists κ 1 > 0 such that (corresponding to parameter (25))
ϕ ˜ k ( α V ) + ϕ ˜ k ( 0 ) α ( 2 α ) 2 κ 1 V 2 , α [ 0 , 1 ] ,
and the regularization parameter σ k and trust region radius V k Δ satisfy
σ k σ ¯ : = κ 1 + 2 c 2 2 η 1 + 2 2 η 1 ε g Δ + ε H ,
then we have the conclusion
E [ ρ k ] η 1 .
Proof. 
According to the strong convexity of f,
f ( R x ( η ) ) f ( x ) + f ( x ) , η + L g 2 η 2 .
Since h is L h -Lipschitz in the ambient space, we have the following second-order retraction error:
R x ( η ) ( X + η ) F M 2 η 2 .
Thus,
h ( R x ( η ) ) h ( x + η ) + L h M 2 η 2 .
Define updated constants
c 1 : = L g 2 , c 2 : = c 1 + L h M 2 = L g 2 + L h M 2 .
From the operator norm error of the Hessian
ε H : = E H ¯ μ , ξ ( x ) H f ( x ) op C ˜ ( d + 16 ) 6 b 3 / 2 L g 1.5 + 1 27 μ 3 L H 3 ( d + 6 ) 7.5 .
From Lemma 1, using Jensen’s inequality to extract the square root
ε g : = μ 2 L g 2 ( d + δ ) 3 + 8 ( d + 4 ) n σ 2 + 8 ( d + 4 ) n f ( x k ) 2 , E g ¯ μ , ξ ( x k ) f ( x k ) ε g .
There exists κ 1 > 0 such that
( H ˜ μ , ξ ( x ) + κ 1 I ) [ V ] , V 0 .
Thus, for any α [ 0 , 1 ] , there is a “strongly convex” lower bound for the model
ϕ ˜ k ( α V ) + ϕ ˜ k ( 0 ) α ( 2 α ) 2 κ 1 V 2 .
From (34)–(36), for any V T x M , α > 0 , we have
F ( R x ( α V ) ) f ( x ) + f ( x ) , α V + c 1 α V 2 + h ( x + α V ) + L h M 2 α V 2 = F ( x ) + f ( x ) , α V + c 2 α V 2 + h ( x + α V ) h ( x ) .
Add and subtract ϕ ˜ k ( α V ) on the right-hand side, then add and subtract f and g ¯ , and use ( H f ( x k ) + κ 1 I ) [ U ] , U 0 to obtain
F ( R x k ( α V ) ) F ( x k ) + ϕ ˜ k ( α V ) + c 2 1 2 ( κ 1 + σ k ) α V 2 + f ( x k ) g ¯ μ , ξ ( x k ) , α V 1 2 ( H ¯ μ , ξ ( x k ) H f ( x k ) ) [ α V ] , α V .
Take the expectation over ( μ , ξ ) and apply the Cauchy–Schwarz inequality and (38) and (39); then, we have
F ( R x k ( α V ) ) F ( x k ) + E ϕ ˜ k ( α V ) + c 2 1 2 ( κ 1 + σ k ) α V 2 + f ( x k ) g ¯ k , α V Gradient Estimation Error 1 2 ( H ¯ k H f ( x k ) ) [ α V ] , α V Hessian Estimation Error .
Take the expectation over ( μ , ξ ) and apply the Cauchy–Schwarz inequality and (38) and (39):
E F ( R x k ( α V ) ) F ( x k ) + E ϕ ˜ k ( α V ) + c 2 1 2 ( κ 1 + σ k ) + 1 2 ε H α V 2 + α ε g V .
Let V k be the solution of the subproblem min V φ ˜ k ( V ) , and define
ρ k : = F ( x k ) F ( R x k ( α k V k ) ) ϕ ˜ k ( α k V k ) + ϕ ˜ k ( 0 ) .
From (32) and (45), we have
1 E [ ρ k ] = E [ F ( R x k ( α k V k ) ) F ( x k ) ϕ ˜ k ( α k V k ) + ϕ ˜ k ( 0 ) ] ϕ ˜ k ( α k V k ) + ϕ ˜ k ( 0 ) c 2 1 2 ( κ 1 + σ k ) + 1 2 ε H α k V k 2 + α k ε g V k α k ( 2 α k ) 2 κ 1 V k 2 = 2 κ 1 ( 2 α k ) c 2 1 2 ( κ 1 + σ k ) + 1 2 ε H + ε g V k .
To ensure E [ ρ k ] η 1 , it is sufficient to set the last term ≤1 η 1 and solve for σ k to obtain the equivalent condition
σ k κ 1 ( 2 α k ) ( 1 η 1 ) + 2 ε g V k + 2 c 2 + ε H κ 1 .
Additionally, 0 α k 1 ; thus, 1 2 α k 2 . Introduce the trust region radius V k     Δ ; thus, it is sufficient to have
σ k 2 c 2 + ε H + 2 ε g Δ κ 1 ( 2 η 1 )
to ensure E [ ρ k ] η 1 . □
Lemma 5 
(Boundedness of the Regularization Parameter σ k ). Assume that the solution to the subproblem satisfies a uniform radius upper bound V k     Δ (attainable via the trust region radius or proximal regularization term). Let:
σ est : = 2 c 2 + ε H + 2 ε g Δ κ 1 ( 2 η 1 )
Then, for all k 0 , the following holds:
σ k max σ 0 , γ 2 σ est
Proof. 
From Equation (49), we know that, if σ k σ est , then E ρ k η 1 always holds. We prove the conclusion by mathematical induction:
Base Case ( k = 0 ): It is trivially true that σ 0 max { σ 0 , γ 2 σ est } .
Inductive Hypothesis: Assume that the inequality holds for k = j , i.e., σ j max { σ 0 , γ 2 σ est } .
Inductive Step ( k = j + 1 ): Consider two cases: (1) If σ j < σ est : By Lemma 4, σ j fails to satisfy the acceptance condition in Step 10 of Algorithm 1. According to Step 15 of the algorithm, σ j is scaled up by a factor of γ 2 > 1 , so σ j + 1 = γ 2 σ j < γ 2 σ est . (2) If σ j σ est : by Lemma 4, E ρ j η 1 . According to Steps 10–13 of the algorithm, σ j + 1 is either unchanged or scaled down by γ 1 ( 0 , 1 ) (i.e., σ j + 1 = γ 1 σ j ). Thus, σ j + 1 σ j max { σ 0 , γ 2 σ est } (by the inductive hypothesis).
By combining both cases, we have σ j + 1 max { σ 0 , γ 2 σ est } . By the principle of mathematical induction, the inequality holds for all k 0 . □
Theorem 1 
(Convergence to Stationary Points). Define:
α ¯ est : = min 1 , ( 2 σ ) κ 1 2 c 2 + 1 2 ε H + ε g Δ
When α k γ α ¯ est for all k 0 (where γ ( 0 , 1 ) is the backtracking scaling factor in Step 7 of the algorithm), the backtracking line search (Steps 6–8 of the algorithm) terminates in finite steps. Furthermore, we have:
lim k E V k = 0
and any accumulation point X * of the iterate sequence { X k } is a first-order stationary point of the problem F ( X ) = f ( X ) + h ( X ) (i.e., F ( X * ) = 0 ).
Proof. 
Substitute Equation (54) into Equation (47). When α α ¯ est , the following inequality holds:
c 2 1 2 ( κ 1 + σ k ) + 1 2 ε H α + ε g Δ 1 η 1 2 κ 1 ( 2 α )
This implies that E ρ k η 1 , so the step size α is accepted. For the backtracking line search, we start with α = 1 and scale it down by γ iteratively. Since α will eventually be reduced to α ¯ est (or σ k will be increased to σ ¯ est ), the line search terminates in finite steps, and α k γ α ¯ est .
From Equation (32) and the lower bound of α k , we derive:
E F ( x k ) F ( x k + 1 ) η 1 κ 1 2 α min ( 2 α min ) E V k 2
where α min = γ α ¯ est > 0 . By summing both sides over k from 0 to , the left-hand side becomes E F ( x 0 ) lim k F ( x k ) . Since { E F ( x l ( k ) ) } is non-increasing and bounded below (as F is bounded from below), the sum converges:
k = 0 α k E V k 2 <
Because α k α min > 0 , the terms of the series k = 0 E V k 2 must tend to zero. Thus:
lim k E V k = 0
To prove that accumulation points are stationary, let X * be an accumulation point of { X k } , and let { X k i } be a subsequence converging to X * . From lim i E V k i = 0 , we have lim i V k i = 0 (by the continuity of the norm). Substitute V k i 0 into the first-order optimality condition of the subproblem:
0 g ¯ μ , ξ ( X k i ) + ( H ¯ μ , ξ ( X k i ) + σ k i I ) V k i + Proj T X k i M h ( X k i + V k i )
By taking the limit as i and using the continuity of f , error, and h , we obtain the first-order optimality condition of the original problem:
0 f ( X * ) + Proj T X * M ( h ( X * ) )
This confirms that X * is a stationary point of Problem (1). □

5.1. Complexity Analysis

When analyzing the overall complexity of stochastic zeroth-order Riemannian optimization algorithms, we must consider both the computational cost of the outer iterations (Steps 1–16 of the algorithm) and the inner iterations (Steps 3–15 of the algorithm). The former characterizes the number of main iterations required to converge to an ε -stationary point, while the latter reflects the number of calls to the approximate subproblem solver on the manifold (ASSN) within each outer iteration. Due to the non-monotone line search and trust region/regularization update mechanisms, there is a tight coupling between the complexities of the outer and inner iterations.
In this proof, the acceptance criterion is defined as:
ρ k η 1
Definition 4 
( ε -Stationary Point). If the optimal solution V k of the subproblem in a certain outer iteration satisfies V k = ε , then X k is called an ε-stationary point.
Theorem 2 
(Outer and Inner Iteration Complexity). Let F * be the optimal value of Problem (1), and σ ( 0 , 1 ) be the constant in the sufficient decrease criterion. Define:
Θ : = 2 F ( x 0 ) F * η 1 κ 1 γ α ¯ ( 2 γ α ¯ ) ε 2
The algorithm finds an ε-stationary point within, at most, Θ outer iterations. Furthermore, the total number of inner ASSN calls satisfies:
i = 0 Θ 1 r ( i ) Θ 2 log γ 2 γ 1 + log γ 2 max σ 0 , γ 2 σ ¯ σ 0
Proof. 
Step 1: Upper Bound on Outer Iterations From Equation (55) and α k α min : = γ α ¯ , we have:
E F ( X k ) F ( X k + 1 ) η 1 κ 1 2 α min ( 2 α min ) E V k 2
Assume the K-th outer iteration yields V K = ε . By summing both sides over k from 0 to K 1 , the left-hand side telescopes to E F ( X 0 ) F ( X K ) . Since F ( X K ) F * , we get:
E F ( X 0 ) F * K · η 1 κ 1 2 α min ( 2 α min ) ε 2
Rearranging for K gives:
K 2 ( E F ( X 0 ) F * ) η 1 κ 1 α min ( 2 α min ) ε 2 = Θ
Taking the ceiling function confirms that an ε -stationary point is found within Θ outer iterations.
Step 2: Upper Bound on Inner Iterations By the boundedness of σ k (Lemma 5):
σ k max σ 0 , γ 2 σ ¯
Suppose the inner loop calls ASSN r ( i ) times in the i-th outer iteration. By the update rule of σ , we have σ i + 1 = γ 2 r ( i ) 1 σ i . Since γ 1 < 1 , the following inequality holds:
σ i + 1 σ i γ 1 γ 2 r ( i ) 1 r ( i ) 1 + log γ 2 σ i + 1 σ i · γ 2 γ 1
umming both sides from i = 0 to i = K 1 (where K = Θ ) results in the following:
i = 0 K 1 r ( i ) K + i = 0 K 1 log γ 2 σ i + 1 σ i + K log γ 2 γ 2 γ 1
The sum of logarithms simplifies to the logarithm of the product (telescoping sum):
i = 0 K 1 log γ 2 σ i + 1 σ i = log γ 2 σ K σ 0
By substituting Equation (63) (i.e., σ K max { σ 0 , γ 2 σ ¯ } ) and simplifying the logarithmic term, we obtain the following:
i = 0 K 1 r ( i ) K 2 log γ 2 γ 1 + log γ 2 max σ 0 , γ 2 σ ¯ σ 0
This completes the proof. □

5.2. Saddle-Point Escape Theorem

In non-convex optimization problems, the presence of saddle points severely degrades the convergence efficiency of algorithms. Unlike local minima, saddle points have zero gradients while their Hessian matrices have negative eigenvalues. This causes first-order methods (e.g., stochastic gradient descent) to easily stagnate near saddle points, leading to prolonged slow convergence or even halting. This phenomenon is more prominent in high-dimensional manifold optimization, where the number of saddle points far exceeds that of local minima.
In recent years, second-order methods (e.g., Hessian-based algorithms) have been proven to escape saddle points more effectively. However, in the zeroth-order setting, we cannot directly access accurate gradient or Hessian information, and can only rely on function value approximations. This makes the identification and escape of saddle points more challenging. Traditional zeroth-order methods often require excessive sampling and computation, resulting in a high complexity that hinders practical application. The proposed method in this paper achieves saddle-point escape using only function value queries, without the need for explicit first-order or second-order information.
Lemma 6. 
Let g k = f ( x k ) , g ¯ μ ( x k ) be the zeroth-order gradient estimate, and H ¯ μ ( x k ) = 1 b i = 1 b H μ , i ( x k ) be the mini-batch average of Hessian estimates. The following control inequality holds:
g ¯ μ ( x k ) + H ¯ μ ( x k ) [ η k ] 1 2 η k 2 + δ g + δ μ
where δ g and δ μ correspond to the zeroth-order gradient estimation error and Hessian mini-batch estimation error, respectively.
Proof. 
Expand using the triangle inequality:
g ¯ μ ( x k ) + H ¯ μ ( x k ) [ η k ] g k g ¯ μ ( x k ) + H ¯ μ ( x k ) [ η k ] + g k
Step 1: Bound on H ¯ μ ( x k ) [ η k ] . Apply the Frobenius norm and Cauchy–Schwarz inequality:
H ¯ μ ( x k ) [ η k ] [ F ] H ¯ μ ( x k ) η k
Step 2: Apply Young’s Inequality. For the product term, Young’s inequality ( a b a 2 2 + b 2 2 ) gives:
[ F ] H ¯ μ ( x k ) η k 1 2 η k 2 + 1 2 [ F ] H ¯ μ ( x k ) 2
Step 3: Bound on E [ F ] H ¯ μ ( x k ) 2 . Note that H ¯ μ ( x k ) is the average of b independent and identically distributed (i.i.d.) terms H μ , i ( x k ) . By Jensen’s inequality and Cauchy–Schwarz, we have:
E [ F ] H ¯ μ ( x k ) 2 1 b i = 1 b E [ F ] H μ , i ( x k ) 2 1 b E [ F ] H μ ( x k ) 4 1 / 2
Using the fourth-moment bound from Equation (20) ( E [ F ] H μ ( x k ) 4 ( d + 16 ) 8 L g 2 8 ), we define:
1 2 E [ F ] H ¯ μ ( x k ) 2 ( d + 16 ) 4 4 2 L g : = δ μ
Step 4: Combine Results. Substitute Equation (67) and the gradient estimation error bound (from Lemma 1, E g k g ¯ μ ( x k ) δ g ) into the triangle inequality. Taking the expectation and simplifying gives:
E g ¯ μ ( x k ) + H ¯ μ ( x k ) [ η k ] 1 2 E η k 2 + δ g + δ μ
This completes the proof. □
Lemma 7. 
Let x k + 1 = R x k ( η k ) and g k = f ( x k ) . The following holds:
E g k + 1 L H + 1 2 E η k 2 + δ g + δ H + δ μ
where δ g , δ H , and δ μ correspond to the zeroth-order gradient estimation error, zeroth-order Hessian estimation error, and mini-batch averaging error, respectively.
Proof. 
Since parallel transport P η k is isometric, we have:
g k + 1 = P η k 1 g k + 1
Add and subtract g k + H f ( x k ) [ η k ] to decompose the term into four components:
P η k 1 g k + 1 = P η k 1 g k + 1 g k H f ( x k ) [ η k ] Taylor Remainder + g k g ¯ μ , ξ ( x k ) Gradient Estimation Error   + H f ( x k ) [ η k ] H ¯ μ , ξ ( x k ) [ η k ] Hessian Estimation Error + g ¯ μ , ξ ( x k ) + H ¯ μ , ξ ( x k ) [ η k ] Subproblem Main Term
Taking the norm and applying the triangle inequality yields the following:
g k + 1 P η k 1 g k + 1 g k H f ( x k ) [ η k ] + g k g ¯ μ , ξ + H f ( x k ) [ η k ] H ¯ μ , ξ [ η k ] + g ¯ μ , ξ + H ¯ μ , ξ [ η k ]
(i) Bound on the Taylor Remainder. By the equivalent condition of Assumption 2 (Lipschitz Hessian, Equation (17)),
P η k 1 g k + 1 g k H f ( x k ) [ η k ] L H 2 η k 2
(ii) Bound on Gradient Estimation Error. From Lemma 1, the expectation of the gradient estimation error satisfies E g k g ¯ μ , ξ ( x k ) δ g .
(iii) Bound on Hessian Estimation Error. Using the operator norm and Young’s inequality:
H f ( x k ) [ η k ] H ¯ μ , ξ [ η k ] [ op ] H f ( x k ) H ¯ μ , ξ η k 1 2 η k 2 + 1 2 [ op ] H f ( x k ) H ¯ μ , ξ 2
Taking the expectation and using Lemma 2 ( E [ op ] H f ( x k ) H ¯ μ , ξ 2 δ H ), we get:
E H f ( x k ) [ η k ] H ¯ μ , ξ [ η k ] 1 2 E η k 2 + δ H
(iv) Bound on the Subproblem Main Term. By Lemma 6, we have E g ¯ μ , ξ + H ¯ μ , ξ [ η k ] 1 2 E η k 2 + δ g + δ μ .
(v) Combine All Bounds. Substitute (i)–(iv) into the triangle inequality and simplify using algebraic manipulations:
E g k + 1 L H 2 + 1 E η k 2 + δ g + δ H + δ μ
This completes the proof. □
Lemma 8 
(Relationship Between Step Size and Minimum Eigenvalue). Let x k + 1 = R x k ( η k ) , and P η k : T x k M T x k + 1 M be the isometric parallel transport along η k . Let σ k 0 be such that H ¯ μ , ξ + σ k I 0 . Then, the following holds:
E η k δ H + σ k + E λ min ( H f ( x k + 1 ) ) L H
Proof. 
Step 1: Invariance of Minimum Eigenvalue Under Parallel Transport. Since parallel transport is isometric, the minimum eigenvalue of the Hessian is invariant:
λ min ( H f ( x k + 1 ) ) = λ min ( P η k 1 H f ( x k + 1 ) P η k )
Let Δ k : = P η k 1 H f ( x k + 1 ) P η k H f ( x k ) . Then:
λ min ( H f ( x k + 1 ) ) = λ min ( H f ( x k ) + Δ k )
Step 2: Apply Weyl’s Inequality. Weyl’s inequality states that, for symmetric matrices A and B, λ min ( A + B ) λ min ( A ) + λ min ( B ) . For any symmetric matrix A, the minimum eigenvalue satisfies λ min ( A ) [ op ] A (by the Rayleigh quotient characterization). Thus:
λ min ( Δ k ) [ op ] Δ k
By Assumption 2 (Equation (16)), [ op ] Δ k L H η k . Substituting into Weyl’s inequality yields the following:
λ min ( H f ( x k + 1 ) ) λ min ( H f ( x k ) ) L H η k
Step 3: Decompose the Hessian and Introduce Regularization. Rewrite the true Hessian as:
H f ( x k ) = ( H f ( x k ) H ¯ μ , ξ ( x k ) ) + ( H ¯ μ , ξ ( x k ) + σ k I ) σ k I
Substitute into Equation (71) and apply Weyl’s inequality again. Since H ¯ μ , ξ + σ k I 0 , its minimum eigenvalue is non-negative, so:
λ min ( H f ( x k + 1 ) ) λ min ( H f ( x k ) H ¯ μ , ξ ( x k ) ) σ k L H η k
Step 4: Take Expectation and Rearrange. For the symmetric matrix H f ( x k ) H ¯ μ , ξ ( x k ) , we have λ min ( · ) [ op ] · . Taking the expectation and using Lemma 2 ( E [ op ] H f ( x k ) H ¯ μ , ξ ( x k ) δ H ) yields the following:
E λ min ( H f ( x k ) H ¯ μ , ξ ( x k ) ) δ H
Taking the expectation of Equation (72) and rearranging terms gives:
E λ min ( H f ( x k + 1 ) ) δ H σ k L H E η k
Rearranging for E η k completes the proof. □
Theorem 3. 
Let M be a manifold, and let f : M R satisfy Assumptions 1–3. Define:
k m i n : = argmin k E u k , Ξ k η k
If the step size in the update of Algorithm 1 satisfies α L H , then:
E g k min + 1 O ( ε ) , E λ min ( H f k min + 1 ) O ( ε )
where λ min ( · ) denotes the minimum eigenvalue. The parameters must satisfy:
N = O 1 ε 3 / 2 , μ = O min ε d 3 / 2 , ε d 3 , m = O d ε 2 , b = O d 4 ε
Thus, the complexity of the zeroth-order oracle is:
O ( N m + N b ) = O d ε 7 / 2 + d 4 ε 5 / 2
Proof. 
From Lemma 6, we have:
E g k + 1 1 2 ( L H + 1 ) E η k 2 + δ g + δ H + δ μ
where δ g , δ H , and δ μ correspond to the gradient and Hessian estimation errors. On the other hand, from Lemma 4.8 (relationship between step size and minimum eigenvalue),
E η k δ H + α k + E λ min ( H f k + 1 ) L H
Next, we analyze the upper bound of E η k .
Step 1: Third-Order Taylor Upper Bound. By the equivalent form of Assumption 2 (Equation (18)),
f ( R x k ( η k ) ) f ( x k ) + g k , η k + 1 2 η k , H f ( x k ) η k + L H 6 η k 3
Step 2: Decompose Gradient and Hessian into Estimation + Error. Write the true gradient and Hessian as:
g k = g ¯ μ , ξ + ( g k g ¯ μ , ξ ) , H f ( x k ) = H ¯ μ , ξ + ( H f ( x k ) H ¯ μ , ξ )
Substitute into Equation (77) to decompose the function value into the main term, first-order error, second-order error, and third-order term:
f k + 1 f k + g ¯ μ , ξ , η k + 1 2 η k , H ¯ μ , ξ η k Main Term + g k g ¯ μ , ξ , η k First - Order Error + 1 2 η k , ( H f ( x k ) H ¯ μ , ξ ) η k Second - Order Error + L H 6 η k 3
Step 3: First-Order Optimality Condition of the Subproblem. From the first-order optimality condition of Subproblem (21),
0 g ¯ μ , ξ ( x k ) + ( H ¯ μ , ξ ( x k ) + α I ) η k + Proj T X k M h ( x k + η k )
Take the inner product with η k and rearrange. Using the positive definiteness of H ¯ μ , ξ + 2 α I and the subgradient inequality for convex h ( h ( x k + η k ) , η k h ( x k + η k ) h ( x k ) ), we get:
g ¯ μ , ξ , η k + 1 2 η k , H ¯ μ , ξ η k = 1 2 η k , ( H ¯ μ , ξ + 2 α I ) η k h ( x k + η k ) , η k h ( x k ) h ( x k + η k )
Substitute this into Equation (78) and combine terms for F = f + h :
F k + 1 F k + g k g ¯ μ , ξ , η k First - Order Error + 1 2 η k , ( H f ( x k ) H ¯ μ , ξ ) η k Second - Order Error + L H 6 η k 3
Step 4: Bound First-Order and Second-Order Errors. Using the Cauchy–Schwarz inequality and Young’s inequality,
E g k g ¯ μ ( x k ) , η k + 1 2 η k , ( H f ( x k ) H ¯ μ ( x k ) ) η k E g k g ¯ μ ( x k ) η k + 1 2 [ op ] H f ( x k ) H ¯ μ ( x k ) η k 2 32 3 β E g k g ¯ μ ( x k ) 3 / 2 + 12 β E [ op ] H f ( x k ) H ¯ μ ( x k ) 3   + β 24 E η k 3
Substitute into Equation (82), take β > L H , and use the lower bounds of gradient and Hessian errors. Summing over k = 0 to N 1 and rearranging gives:
1 N k = 0 N E η k 3 24 L H F 0 F * N + 32 3 L H δ g 3 / 4 + 12 L H δ ˜ H
where δ ˜ H = C ˜ ( d + 16 ) 6 b 3 / 2 L g 1.5 + 1 27 μ 3 L H 3 ( d + 6 ) 7.5 . By combining with the parameter choices in Equation (74), we obtain:
E η k min 3 O ( ε 3 / 2 ) , E η k min 2 O ( ε )
Step 5: Final Bounds. Substitute Equation (84) into Lemmas 6 and 7. Simplifying using the parameter choices gives:
E g k min + 1 O ( ε ) , E λ min ( H f k min + 1 ) O ( ε )
The complexity of the zeroth-order oracle is calculated by substituting N, m, and b from Equation (74) into O ( N m + N b ) , leading to:
O d ε 7 / 2 + d 4 ε 5 / 2
This completes the proof. □

6. Application of ZO-ARPQN

6.1. Sparse Principal Component Analysis (SPCA) on the Stiefel Manifold

A traditional principal component analysis (PCA) extracts low-dimensional representations of data by maximizing the projection variance, but its loading vectors are usually dense, making it difficult to interpret the contribution of individual variables. A sparse principal component analysis (SPCA) introduces a sparsity penalty while maximizing the variance, thereby balancing variable selection and interpretability. Typical applications include gene expression analyses, spectral component separation, financial factor construction, and feature compression in high-dimensional regression. Compared with a PCA, the additional benefits of an SPCA are reflected in: a stronger interpretability (fewer non-zero variables can describe components), a better robustness (reducing the impact of noisy variables), and convenience for downstream modeling (sparse factors are more conducive to regression/classification tasks). This paper considers the SPCA problem on the Stiefel manifold S t ( n , p ) = { X R n × p X X = I p } , formulated as:
min X S t ( n , p ) F ( X ) : = f ( X ) + λ X 1 = 1 2 Tr X H X + λ X 1 ,
where H R n × n is a symmetric positive semi-definite matrix, and · 1 denotes the element-wise 1 -norm to induce sparsity.

6.1.1. Experimental Parameter Settings

The following parameters were adopted in this experiment (consistent with the reproducibility requirements of SJTU doctoral theses):
  • Data Generation: Set n = 100 (number of variables), p = 10 (number of principal components), and K = 5 (rank of the true subspace). First, perform QR orthogonalization on a Gaussian matrix A R n × K , construct Λ = diag ( | N ( 0 , 1 ) | ) (diagonal matrix of absolute values of standard normal samples), and then form H = A Λ A and normalize it by the spectral norm to ensure numerical stability. The sparsity coefficient is set to λ = 0.1 (determined via 5-fold cross-validation to balance sparsity and variance retention).
  • ZO-ManPG Algorithm: Refer to Algorithm 3 in Reference [24]. Set the finite difference radius to μ = 0.05 , the gradient sampling counts to m = { 5 , 20 , 50 } (to analyze the impact of the sampling density on zero-order estimation), and the maximum iterations to N = 150 (sufficient to ensure the convergence of all the compared algorithms).
  • ZO-ARPQN Algorithm: Adopt the same μ = 0.05 and m = { 5 , 20 , 50 } as ZO-ManPG; set the Hessian sampling count to b = 15 (compromising between the estimation accuracy and computational cost); the initial regularization parameter to σ 0 = 5.0 ; the line search shrinkage factor to γ = 0.5 ; the ratio test thresholds to η 1 = 0.25 (lower bound for an acceptable model fidelity) and η 2 = 0.75 (threshold for reducing regularization); the adaptive regularization coefficients to γ 1 = 0.8 (reduction factor) and γ 2 = 2.0 (increase factor); and the maximum iterations to N = 150 .
  • ARPQN Algorithm [16]: As the oracle baseline (with exact gradient/Hessian), use the same σ 0 , γ , η 1 , η 2 , γ 1 , γ 2 , and N as ZO-ARPQN, except that the exact Riemannian gradient/Hessian (instead of zero-order estimation) is used to verify the performance upper bound of quasi-Newton methods.

6.1.2. Result Analysis

The vertical axis of the convergence curves plots the optimality gap F ( X k ) F * (in logarithmic scale) to clearly show the convergence trend. Since Problem (85) is non-convex, the analytical optimal value is difficult to obtain; thus, this paper takes the minimum function value among all the compared algorithms as the empirical optimal value:
F * = min a { ZO - ManPG , ZO - ARPQN , ARPQN } F a ( X ) ,
where F a ( X ) denotes the final function value of algorithm a. The sparsity metric is defined as:
sparsity ( X ) = # { ( i , j ) : | X i j | < 10 3 } n · p × 100 % ,
which measures the proportion of zero elements in the solution matrix. A higher sparsity indicates that more variables are “compressed” to zero, enhancing the feature-selection ability of the model; however, excessive sparsity may lead to loss of principal component information.
To analyze the impact of the gradient sampling count m on the stability and accuracy of zero-order algorithms, this experiment set m = 5 ,   15 ,   30 respectively for the SPCA problem on the Stiefel manifold, and compared three algorithms: the zeroth-order manifold proximal gradient (ZO-ManPG), the zeroth-order adaptive regularized proximal quasi-Newton (ZO-ARPQN), and ARPQN (with exact gradient/Hessian). All algorithms wererun for 150 iterations under the same initial point and parameters, and the convergence curves of the optimality gap F ( X k ) F * are plotted. As shown in the figures, all three algorithms exhibited a monotonically decreasing convergence trend, but their performance differed significantly under different sampling counts.

6.1.3. Case 1: Low Sampling Count ( M = 5 ) (Figure 1)

When the sampling count was small, the number of random perturbation directions was insufficient, leading to large variance in zero-order estimation of gradients and Hessians, resulting in obvious fluctuations in the convergence curves. Even so, ZO-ARPQN still outperformed ZO-ManPG, which indicates that the quasi-Newton direction and adaptive regularization can effectively suppress noise. ARPQN converged rapidly within approximately 20 iterations, serving as a performance upper bound. Due to the large noise in zero-order Hessian estimation of ZO-ARPQN, the 1 -regularization term dominated, leading to excessive sparsity (60.0%); in contrast, ZO-ManPG barely produced a significant sparse structure (5.4%), indicating that its gradient approximation is sensitive to noise.
Figure 1. Convergence curves of three algorithms (gradient sampling count m = 5 ).
Figure 1. Convergence curves of three algorithms (gradient sampling count m = 5 ).
Axioms 15 00203 g001

6.1.4. Case 2: Moderate Sampling Count ( M = 15 ) (Figure 2)

With an increase in the sampling count, zero-order estimation became more stable. The convergence speed of ZO-ARPQN significantly accelerated and almost overlapped with ARPQN after approximately 40 iterations; the oscillation of ZO-ManPG weakened, but its convergence speed remained slow. As the sampling increased, the sparsity of ZO-ARPQN decreased significantly (2.4%), indicating that gradient estimation became more accurate and the impact of sparse constraints weakened; the sparsity of ZO-ManPG increased slightly (20%), showing a more stable convergence trend.
Figure 2. Convergence curves of three algorithms (gradient sampling count m = 15 ).
Figure 2. Convergence curves of three algorithms (gradient sampling count m = 15 ).
Axioms 15 00203 g002

6.1.5. Case 3: High Sampling Count ( M = 30 ) (Figure 3)

When the sampling count further increased, the convergence curve of ZO-ARPQN almost completely overlapped with that of ARPQN, indicating that its zero-order Hessian approximation is extremely accurate under high sampling. Although ZO-ManPG was still the slowest, its final accuracy improved significantly. The sparsity of ZO-ARPQN was close to that of ARPQN (18.2% vs. 24.2% (Table 1), indicating that high sampling can effectively recover the true sparse structure. The sparsity of ZO-ManPG increased to 53.4%, but its overall convergence remained slow.
Figure 3. Convergence curves of three algorithms (gradient sampling count m = 30 ).
Figure 3. Convergence curves of three algorithms (gradient sampling count m = 30 ).
Axioms 15 00203 g003
In conclusion, increasing the sampling count m can significantly improve the convergence quality and stability of zero-order algorithms. Among them, ZO-ARPQN benefited the most from high sampling, as its quasi-Newton mechanism can fully utilize the accurate zero-order Hessian information to approach the performance of the oracle algorithm (ARPQN); ZO-ManPG, as a first-order method, is limited by the inherent low accuracy of zero-order gradient estimation, and even high sampling cannot compensate for the performance gap with second-order methods.

6.2. Black-Box Optimization in Robotic Stiffness Control (Figure 4)

The specific connection between zeroth-order Riemannian optimization and its application in robotic stiffness control [16] is elaborated as follows.
Problem Background: Adaptive force/stiffness control aims to address a classic challenge in robotics: how to enable robots to adjust the force they apply adaptively when performing delicate manipulation tasks (e.g., inserting a key into a lock hole, scraping along a complex surface). This is closely associated with stiffness control, as the robot’s impedance controller achieves precise force control by adjusting its own stiffness and damping. Therefore, learning an adaptive force controller essentially involves learning how to dynamically adjust stiffness parameters based on real-time conditions.
Motivation for Black-Box/Zeroth-Order Optimization: The performance of a controller (e.g., whether key insertion is smooth, whether jamming occurs) cannot be described by simple mathematical formulas. Its performance score can only be obtained through physical experiments or high-fidelity simulations. The process of converting controller parameters to task performance scores is a typical black-box system, and its gradient is completely unknown. Thus, zeroth-order optimization methods become indispensable. The Bayesian optimization adopted in this paper is precisely a zeroth-order method designed for expensive black-box problems.
Figure 4. Schematic diagram of robotic stiffness control.
Figure 4. Schematic diagram of robotic stiffness control.
Axioms 15 00203 g004
Riemannian Manifold Constraints: In robotic force/stiffness control, key parameters often lie on specific manifolds:
  • Stiffness Matrix (K): Physically valid stiffness matrices must be symmetric positive definite (SPD) matrices. The set of all SPD matrices forms a Riemannian manifold. During the optimization process, it is critical to ensure that the updated parameters remain SPD matrices after each iteration.
  • Desired Pose (R): The controller may need to adjust the desired tool pose (e.g., the orientation of a key), which is represented by a rotation matrix belonging to the S O ( 3 ) manifold.
In Cartesian space impedance/stiffness control, the compliance of the end-effector is characterized by a symmetric positive definite (SPD) matrix K P S + + d . Given the current end-effector position p ^ R d , end-effector velocity p ˙ , and external force f e R d , the control law drives the system to reach a new equilibrium position p ( K P ) , whose common approximation is:
p K P = p ^ p ˙ K P 1 f e .
Intuitively, the stiffness matrix K P determines the “hardness/softness” of the robot end-effector in response to external disturbances. Given the current end-effector position p ^ R d , desired equilibrium position P, external force f e , and damping matrix K D = K P (under critical damping), the equilibrium equation can be expressed as:
f e = K P ( p ^ p ) K D p ˙ ,
from which the position p is solved.
To obtain a reasonable stiffness matrix K P , the following objective function is defined [25]:
f K P = w p p ^ p K P 2 + w d det K P + w c cond K P ,
where w p , w d , w c > 0 are weights, p ( K P ) denotes the equilibrium position, and cond ( · ) represents the condition number.
This objective function consists of three components:
  • First term: squared position error p ^ p ( K P ) 2 , which ensures the end-effector returns as close as possible to the target position;
  • Second term: penalty on det ( K P ) , which prevents an excessively large stiffness (avoiding damage to the robot or target object);
  • Third term: penalty on the condition number, which avoids ill-conditioned stiffness matrices (ensuring stable force control).
This is a black-box optimization problem defined on the SPD manifold S + + d : the function f has no explicit gradient information, and only function value queries are available. This characteristic makes traditional first-order methods (e.g., Riemannian gradient descent) difficult to apply directly, necessitating the design of Riemannian optimization methods based on zeroth-order function values.

6.2.1. Manifold and Tangent Space

The SPD manifold is mathematically defined as:
M = S + + d = K R d × d K = K , K 0 ,
where K 0 indicates that K is a positive definite matrix. The tangent space at any point K M (denoted as T K M ) consists of all symmetric matrices, i.e.:
T K M = U R d × d U = U .

6.2.2. Riemannian Metric

To ensure geometric consistency in stiffness control, the affine-invariant metric—a standard metric for SPD manifolds—is adopted:
U , V K = tr K 1 U K 1 V , U , V T K M ,
where tr ( · ) denotes the matrix trace. The norm induced by this metric is:
U K 2 = U , U K ,
which characterizes the “magnitude” of tangent vectors in the context of SPD matrix updates.

6.2.3. Riemannian Gradient and Exponential Map

For a smooth objective function f : M R , let E f ( K ) denote its Euclidean gradient (computed as if K were an unconstrained matrix in R d × d ). The Riemannian gradient—projected onto the tangent space T K M —is:
grad f ( K ) = K · sym E f ( K ) · K ,
where sym ( A ) = 1 2 ( A + A ) denotes the symmetric part of matrix A.
The exponential map (used to map tangent vectors back to the SPD manifold) is:
Exp K ( U ) = K 1 / 2 exp K 1 / 2 U K 1 / 2 K 1 / 2 , U T K M ,
where K 1 / 2 is the matrix square root of K, and exp ( · ) denotes the matrix exponential. In numerical implementation, this exponential map is directly used as the retraction (a key operation in Riemannian optimization to maintain manifold constraints).

6.2.4. Experimental Setup

Since explicit calculation of gradients and Hessian matrices is impossible in black-box stiffness control, the standard ARPQN (proximal quasi-Newton method with exact gradients) cannot be used as a comparative baseline.
Basic Parameters
  • Initial stiffness matrix: K 0 = I d (identity matrix, ensuring initial feasibility on the SPD manifold);
  • Task space dimension: d = 3 (consistent with 3D Cartesian space control of robotic end-effectors);
  • Maximum iterations: 100 (balancing the computational efficiency and convergence accuracy).
Algorithm-Specific Parameters
Two zeroth-order Riemannian optimization algorithms were compared, with parameters tuned to achieve an optimal performance:
  • ZO-ManPG (Zeroth-Order Manifold Proximal Gradient): step size: α = 0.1 ; finite difference radius (for zero-order gradient estimation): μ = 10 3 ; and mini-batch size (for noise reduction): m = 10 .
  • ZO-ARPQN (Zeroth-Order Adaptive Regularized Proximal Quasi-Newton): smoothing parameter (for zero-order Hessian estimation): μ = 10 4 ; mini-batch size (for gradient/Hessian estimation): m = 10 ; initial regularization parameter: σ 0 = 5.0 ; ratio test thresholds (for adaptive regularization): η 1 = 0.1 , η 2 = 0.8 ; and step size decay factor: γ = 0.6 .
Both algorithms use the exponential map Exp K ( U ) for retraction. After each update, matrix eigenvalues are clipped to the range [ 10 3 , 10 3 ] to ensure the stiffness matrix K remains symmetric positive definite (avoiding physical invalidity of the stiffness parameters).

6.2.5. Experimental Results and Analysis

Figure 5 shows the objective function descent process of ZO-ManPG and ZO-ARPQN on the SPD manifold for the 3D stiffness adjustment task. The vertical axis represents the objective function value F ( K k ) (logarithmic scale to highlight convergence trends), and the horizontal axis represents the number of iterations.
F final ManPG = 0.2245 ,     F final ARPQN = 0.1622
Convergence Trend Analysis
(1) Early-stage convergence: Both algorithms reduced the objective function by approximately two orders of magnitude within the first 10 iterations, verifying that zero-order approximation is feasible for black-box stiffness tuning. (2) Late-stage convergence: The ZO-ARPQN curve remained consistently below ZO-ManPG, indicating a higher descent efficiency and more stable convergence. This was attributed to ZO-ARPQN’s adaptive regularization mechanism (dynamically adjusting σ k based on model fidelity) and quasi-Newton direction (using Hessian approximation to guide updates).
Quantitative Performance
After 100 iterations, the final objective function values were:
F final ManPG = 0.2245 ,     F final ARPQN = 0.1622 .
Compared to ZO-ManPG, ZO-ARPQN achieved a 27.7% reduction in the objective function value under the same iteration budget, confirming its advantage in balancing zero-order stability and the convergence accuracy.
Practical Significance
The experimental results validate that ZO-ARPQN is effective for non-differentiable black-box SPD optimization tasks (e.g., robotic stiffness control). Its ability to handle “no gradient + manifold constraints” makes it suitable for engineering scenarios where only experimental/simulation scores are available, providing a feasible solution for adaptive stiffness tuning in robotic manipulation.

7. Conclusions

We propose ZO-ARPQN, the first zeroth-order adaptive regularized proximal quasi-Newton framework for composite optimization on Riemannian manifolds, including the Stiefel and SPD manifolds. The method extends the classical ARPQN algorithm to black-box settings by constructing stochastic one-point finite-difference estimators for both the Riemannian gradient and curvature, enabling second-order optimization without explicit derivatives.
We established global convergence guarantees under mild smoothness assumptions, proved that the algorithm achieves convergence to first-order stationary points even with noisy zeroth-order information, and further demonstrated that the curvature-aware regularization enables escape from strict saddle points with a high probability. A complete iteration-complexity analysis is provided, showing that ZO-ARPQN maintains competitive theoretical efficiency compared to existing zeroth-order Riemannian methods.
Extensive numerical experiments on sparse PCA and robot stiffness tuning validated the practical effectiveness of the proposed method. ZO-ARPQN achieved a convergence behavior comparable to first-order ARPQN and other state-of-the-art Riemannian solvers, while requiring only function evaluations. These results highlight its potential for applications in black-box manifold optimization, robotics, machine learning, and simulation-based design.
Future work will include exploring variance-reduced zeroth-order strategies, developing limited-memory versions for large-scale manifolds, and extending the framework to Riemannian stochastic compositional optimization and constrained reinforcement learning.

Author Contributions

Conceptualization, Y.M. and C.L.; Methodology, Y.M.; Software, Y.M.; Validation, Y.M., C.L. and Z.W.; Formal Analysis, Y.M.; Investigation, Y.M.; Resources, C.L.; Data Curation, Y.M.; Writing—Original Draft Preparation, Y.M.; Writing—Review and Editing, C.L. and Y.M.; Visualization, Y.M.; Supervision, C.L.; Project Administration, C.L.; Funding Acquisition, C.L. and Q.L. provided additional support in data analysis and manuscript revision. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Natural Science Foundation of Ningxia (Project No.: 2025AAC030018); The National Social Science Fund of China (grant No. 1246010681): The Uncertain Rescue Model of Major Disaster; The National Social Science Fund of China (grant No. 23BMZ062): Co-Creation of Ecosystem Value for Green Brands in Northwest China’s Characteristic Agriculture under Dual Carbon Targets; The North Minzu University Research Initiative (grant Nos. 2022ZLGTTYS12 and ZDZX201805); and The First-Class Discipline Construction Program of Ningxia (grant No. NXYLXK2017B09). Additional partial support was received from: The Ningxia Youth Talent Support Program (2021 cohort), The Leading Talent Program of North Minzu University, and The Governance and Social Management Research Center of Northwest Ethnic Regions.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to (specify the reason for the restriction).

Acknowledgments

The authors gratefully acknowledge the computational resources provided by the Intelligent Optimization Laboratory of North Minzu University. Special thanks are extended to the anonymous reviewers, whose insightful comments significantly strengthened the theoretical rigor of this work.

Conflicts of Interest

The authors declare no competing interests.

References

  1. Vandereycken, B. Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 2013, 23, 1214–1236. [Google Scholar] [CrossRef]
  2. Journée, M.; Nesterov, Y.; Richtarik, P.; Sepulchre, R. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 2010, 11, 517–553. [Google Scholar]
  3. Li, J.; Ma, S.; Srivastava, T. A riemannian admm. arXiv 2022, arXiv:2211.02163. [Google Scholar]
  4. Absil, P.-A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar] [CrossRef]
  5. Boumal, N. An Introduction to Optimization on Smooth Manifolds; Cambridge University Press: Cambridge, UK, 2023. [Google Scholar]
  6. Ferreira, O.P.; Oliveira, P.R. Proximal point algorithm on Riemannian manifolds. Optimization 2002, 51, 257–270. [Google Scholar] [CrossRef]
  7. Bergmann, R.; Persch, J.; Steidl, G. A parallel Douglas–Rachford algorithm for minimizing ROF-like functionals on images with values in symmetric Hadamard manifolds. SIAM J. Imaging Sci. 2016, 9, 901–937. [Google Scholar] [CrossRef]
  8. Kovnatsky, A.; Glashoff, K.; Bronstein, M.M. MADMM: A Generic Algorithm for Non-Smooth Optimization on Manifolds. In Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
  9. Reimherr, M.; Bharath, K.; Soto, C. Differential privacy over riemannian manifolds. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021), Virtual, 6–14 December 2021; pp. 12292–12303. [Google Scholar]
  10. Duruisseaux, V.; Leok, M. A variational formulation of accelerated optimization on Riemannian manifolds. SIAM J. Math. Data Sci. 2022, 4, 649–674. [Google Scholar] [CrossRef]
  11. Han, A.; Mishra, B.; Jawanpuria, P.; Gao, J. Riemannian accelerated gradient methods via extrapolation. Proc. Mach. Learn. Res. 2023, 206, 1554–1585. [Google Scholar]
  12. Zhang, H.; Reddi, S.J.; Sra, S. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10, December 2016. [Google Scholar]
  13. Xu, Y.; Jin, R.; Yang, T. NEON+: Accelerated gradient methods for extracting negative curvature for non-convex optimization. arXiv 2017, arXiv:1712.01033. [Google Scholar]
  14. Jensen, R.; Zimmermann, R. Riemannian optimization on the symplectic Stiefel manifold using second-order information. arXiv 2024, arXiv:2404.08463. [Google Scholar] [CrossRef]
  15. Ring, W.; Wirth, B. Optimization methods on Riemannian manifolds and their application to shape space. SIAM J. Optim. 2012, 22, 596–627. [Google Scholar] [CrossRef]
  16. Wang, Q.; Yang, W.H. An adaptive regularized proximal Newton-type methods for composite optimization over the Stiefel manifold. Comput. Optim. Appl. 2024, 89, 419–457. [Google Scholar] [CrossRef]
  17. Frazier, P.I. A tutorial on Bayesian optimization. arXiv 2018, arXiv:1807.02811. [Google Scholar] [CrossRef]
  18. Salimans, T.; Ho, J.; Chen, X.; Sidor, S.; Sutskever, I. Evolution strategies as a scalable alternative to reinforcement learning. arXiv 2017, arXiv:1703.03864. [Google Scholar] [CrossRef]
  19. Zhou, P.; Yuan, X.T.; Feng, J. Faster first-order methods for stochastic non-convex optimization on Riemannian manifolds. Proc. Mach. Learn. Res. 2019, 89, 138–147. [Google Scholar] [CrossRef] [PubMed]
  20. Li, J.; Balasubramanian, K.; Ma, S. Stochastic zeroth-order Riemannian derivative estimation and optimization. Math. Oper. Res. 2023, 48, 1183–1211. [Google Scholar] [CrossRef]
  21. Balasubramanian, K.; Ghadimi, S. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Found. Comput. Math. 2022, 22, 35–76. [Google Scholar] [CrossRef]
  22. Wang, Q.; Yang, W.H. Proximal quasi-Newton method for composite optimization over the Stiefel manifold. J. Sci. Comput. 2023, 95, 39. [Google Scholar] [CrossRef]
  23. Chen, S.; Ma, S.; Man-Cho So, A.; Zhang, T. Proximal gradient method for nonsmooth optimization over the Stiefel manifold. SIAM J. Optim. 2020, 30, 210–239. [Google Scholar] [CrossRef]
  24. Li, J.; Balasubramanian, K.; Ma, S. Zeroth-order optimization on Riemannian manifolds. arXiv 2020, arXiv:2003.11238. [Google Scholar]
  25. Jaquier, N.; Rozo, L.; Calinon, S.; Bürger, M. Bayesian optimization meets Riemannian manifolds in robot learning. Proc. Mach. Learn. Res. 2020, 100, 233–246. [Google Scholar]
Figure 5. Convergence comparison of zeroth-order algorithms on SPD manifold (3D stiffness control task).
Figure 5. Convergence comparison of zeroth-order algorithms on SPD manifold (3D stiffness control task).
Axioms 15 00203 g005
Table 1. Sparsity comparison of three algorithms under different gradient sampling counts m.
Table 1. Sparsity comparison of three algorithms under different gradient sampling counts m.
Sampling Count mAlgorithmFinal Objective Value F final Sparsity (%)
5ZO-ManPG1.62055.4
5ZO-ARPQN2.444760.0
5ARPQN3.319124.2
15ZO-ManPG1.346520.0
15ZO-ARPQN3.86972.4
15ARPQN3.319124.2
30ZO-ManPG1.249953.4
30ZO-ARPQN3.494018.2
30ARPQN3.319124.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, Y.; Li, C.; Wang, Z.; Li, Q. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms 2026, 15, 203. https://doi.org/10.3390/axioms15030203

AMA Style

Ma Y, Li C, Wang Z, Li Q. Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms. 2026; 15(3):203. https://doi.org/10.3390/axioms15030203

Chicago/Turabian Style

Ma, Yinpu, Cunlin Li, Zhichao Wang, and Qian Li. 2026. "Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method" Axioms 15, no. 3: 203. https://doi.org/10.3390/axioms15030203

APA Style

Ma, Y., Li, C., Wang, Z., & Li, Q. (2026). Zeroth-Order Riemannian Adaptive Regularized Proximal Quasi-Newton Optimization Method. Axioms, 15(3), 203. https://doi.org/10.3390/axioms15030203

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop