Although the global convergence analysis is developed under standard inexact line-search assumptions of Armijo/Wolfe type, the numerical implementation reported in this section computes the step sizes in both the predictor and corrector stages by bounded one-dimensional minimization using MATLAB’s fminbnd routine over the interval . Thus, the theoretical framework provides sufficient conditions for the convergence analysis, whereas fminbnd is adopted as a practical step-selection procedure in the computational experiments. Next, we define the test functions:
The Booth function
is strictly convex and has a unique global minimizer at
with value
. The Himmelblau function has four local minima which are also global minima, all satisfying
:
,
,
, and
. The Freudenstein–Roth function has a global minimizer at
with
, a well-known local minimizer at approximately
, and it also exhibits a saddle point at approximately
. The 3D landscapes in
Figure 1 visualize these geometries:
Figure 1a corresponds to the convex, unimodal
, whereas
Figure 1b and
Figure 1c show the multimodal landscapes of
and
, respectively, providing intuition about the basins and valleys against which the HOQN variants are assessed.
A particularly important metric in numerical tests of iterative methods is the Approximate Computational Order of Convergence (ACOC), as it allows for the experimental validation of the theoretical order of convergence. In the context of hybrid optimization algorithms, estimating the convergence order, denoted by
, using the classical formula proposed in [
25], often leads to unreliable results due to significant fluctuations between consecutive iterations. To obtain a more stable and consistent estimate of
, we adopt an approach based on the linear regression of the logarithms of successive errors
. This methodology leverages global information from multiple iterations within the asymptotic regime, thereby reducing sensitivity to local fluctuations in individual errors and providing a more robust estimate than formulas based solely on the most recent iterations. In particular, the ACOC is calculated as the slope of the least-squares regression line that fits the data
versus
, and is given by:
where
denotes the error at iteration
k,
m is the total number of iterations considered,
, and
, with
denoting the number of data points used in the regression. Furthermore,
denotes the first iteration from which the iterates are considered to lie in the asymptotic regime.
5.1. Hybrid Optimization with Complete Memory
We begin by analyzing the full memory variants of the proposed hybrid methods, comparing their numerical performance with classical quasi-Newton schemes on selected test functions.
Table 1 and
Table 2 present the numerical results obtained for the Himmelblau and Freudenstein–Roth functions, considering two different initial conditions. The classical quasi-Newton methods BFGS, DFP, and SR1 exhibit superlinear convergence, as is characteristic of this class of methods, requiring between five and six iterations to reach the minimum. On the other hand, with the exception of
, the proposed hybrid methods consistently converge in fewer iterations for both functions and initial conditions, achieving observed convergence orders consistent with the theoretical cubic order and computational times competitive with the classical methods considered. Furthermore, regarding the final gradient norm
, it is observed that, in these nonconvex problems, this metric can vary significantly depending on the initial condition used; however, the smallest values were obtained by some of the proposed hybrid variants. This fact constitutes a clear indication of the greater numerical accuracy of these variants in approximating the minimizer. The reported computation time (CT(s)) in the numerical experiments corresponds to the total execution time of the iterative algorithm, measured in seconds, from the start of the process until the termination criterion is satisfied or the maximum number of iterations is reached. Consequently, this value includes the cost of all operations performed in each iteration. To obtain reliable time measurements, each method was run 15 times for each test problem, and the times reported in the tables correspond to the average of those runs. The numerical experiments were conducted on a 13-inch MacBook Air (2025) running macOS Sequoia 15.7.4, equipped with an Apple M4 chip and 16 GB of unified memory. This computational environment provides a consistent and up-to-date basis for the comparative evaluation of the methods under consideration.
The numerical results indicate that the efficiency of the methods strongly depends on the nature of the problem. In the nonconvex problems, such as the Himmelblau and Freudenstein–Roth functions, the hybrid methods reach the minimizer in fewer iterations and with a faster convergence rate than the classical BFGS and DFP methods, while also yielding smaller final gradient norms.
Let us now analyze the Booth function, which is a two-dimensional strictly convex quadratic problem with unique global minimizer
.
Table 3 reports the numerical results obtained from two different initial conditions. All the methods converge in two iterations and reach final gradient norms of order
or smaller. Since only two iterations are required, the approximate computational order of convergence cannot be reliably estimated; therefore, the value of
is not reported.
The results in
Table 3 should be interpreted as a consistency check for a simple strictly convex quadratic problem, not as evidence of a general optimality property of the proposed methods. In exact arithmetic, Newton’s method solves a strictly convex quadratic problem in one step when the exact inverse Hessian is used, while full-memory quasi-Newton schemes with line searches may display very fast finite termination on low-dimensional quadratic problems. Therefore, the two-iteration behavior observed for the Booth function is a finite-dimensional quadratic effect. Consequently, the Booth experiment confirms that all methods behave accurately on a convex quadratic benchmark, but it is not sufficient to assess scalability or general asymptotic behavior. For this reason, the Booth test is complemented below with high-dimensional quadratic experiments with prescribed Hessian condition numbers.
5.2. High-Dimensional Quadratic Scalability Test
Since the Booth function is only a two-dimensional strictly convex quadratic problem, its two-iteration behavior should not be interpreted as evidence of general scalability. To test the methods beyond the low-dimensional Booth case, we consider the strictly convex quadratic family
where
Q is an orthogonal matrix generated with a fixed random seed. Hence,
The vector
b was chosen as
, with
so that the exact minimizer is known. We used
,and the stopping criterion
. The step length was computed by exact one-dimensional minimization along each search direction, in order to isolate the cost of the quasi-Newton updates from line-search variability. The tested dimensions and condition numbers were
. To assess the expected dense
scaling,
Table 4 reports the number of outer iterations and the normalized cost
.
All methods reached the prescribed tolerance in every tested case. The hybrid variant consistently required fewer outer iterations than both BFGS and . In the most demanding case, , BFGS required 770 iterations and required 800 iterations, whereas required 408 iterations. This confirms the reduction in outer steps produced by the predictor–corrector structure. Although performs two stages per outer iteration, the normalized cost remains of order across all tested dimensions and condition numbers, consistently with the dense cost of full-memory quasi-Newton updates. Hence, these experiments show that the proposed hybrid scheme extends beyond the two-dimensional Booth case, preserves the expected dense scaling, and substantially reduces the number of outer iterations on high-dimensional quadratic problems.
5.3. Dolan–Moré Performance Profiles and Robustness Assessment
To provide a broader numerical assessment, we complement the pointwise convergence results with Dolan–Moré performance profiles. Among the proposed hybrid variants,
BM1D,
BM2D and
BM3D were selected for the global benchmark because they showed the most stable behavior in the preliminary screening, with fewer non-descent directions, fewer line-search breakdowns and more reliable quasi-Newton updating. These methods correspond, respectively, to the modified Chun-type, Ostrowski-type and Traub-type hybrid quasi-Newton schemes, and are compared with the classical quasi-Newton methods DFP, BFGS and SR1 [
19]. Dolan–Moré profiles are a standard tool for benchmarking optimization solvers over a common test set [
26].
Let
be the set of test instances and
the set of solvers. For each
and
, let
denote the computational cost required by solver
s. If the solver fails to satisfy the stopping criterion within the computational budget, or produces non-finite values, we set
. The performance ratio is
where the minimum is taken over the solvers that successfully solve problem
p. If no solver solves a given instance, all ratios are treated as infinite and the instance contributes only to the failure-rate analysis. The Dolan–Moré profile of solver
s is defined by
Thus,
measures relative efficiency, whereas the limiting value of
for large
measures robustness, since failed runs keep infinite cost.
Following the reviewer’s recommendation, the benchmark set was enlarged beyond the basic two-dimensional tests. It consists of 30 base problems from classical unconstrained optimization benchmarks, the Moré–Garbow–Hillstrom family, the Wood problem, higher-dimensional extensions, and CUTEr/CUTEst-type analytic test instances [
27,
28,
29,
30]. The composition of the set is reported in
Table 5. Each base problem was tested in five variants: one smooth version, two deterministic noisy versions, and two piecewise-smooth nonsmooth versions. Hence, the complete benchmark contains
instances and, with six solvers,
numerical runs.
The primary performance measure was the total computational work
, where
and
are the numbers of objective-function and gradient evaluations. This metric is appropriate because
BM1D,
BM2D and
BM3D are two-stage schemes, whereas DFP, BFGS and SR1 are one-stage quasi-Newton methods. For completeness, we also report profiles based on the number of outer iterations, which highlight the reduction in iteration count achieved by the proposed methods. For smooth instances, a run was declared successful when
. For noisy and nonsmooth variants, where the gradient norm may be less reliable, we used the function-reduction criterion
Here,
denotes the best available reference value; when the exact value was unavailable, it was taken as the best value obtained over all solvers. This criterion follows the benchmarking philosophy of Moré and Wild for smooth, noisy and piecewise-smooth optimization problems [
31]. For noisy variants, the solvers used perturbed information, but success was assessed with the corresponding unperturbed objective value. In addition to the performance profiles, we report the failure rate
where
denotes the smooth, noisy, nonsmooth, or full benchmark class. This measure complements the profiles, since a solver may be efficient on the instances it solves while still being unreliable under perturbations or nonsmoothness.
Table 6 summarizes the Dolan–Moré results for the smooth subset of 30 instances. The first block uses the total work
, whereas the second block uses the number of outer iterations. The column “Best at
” counts ties independently.
The smooth-subset results show that BFGS and the three proposed hybrid methods solve all 30 instances, whereas DFP and SR1 show one and two failures, respectively. In terms of outer iterations,
BM1D,
BM2D and
BM3D have median iteration counts of 4.5, 5.0 and 5.5, compared with 7.0, 8.5 and 6.5 for DFP, BFGS and SR1.
BM2D attains the best iteration count in 23 instances, followed by
BM1D in 21 and
BM3D in 13. When total work
is used, the comparison becomes more balanced, as expected for two-stage methods. Even so, all three hybrid methods solve every smooth instance and reach
. Among them,
BM2D gives the most balanced smooth-subset behavior, with
, full success, and lower median work than
BM3D. To assess robustness under perturbations and loss of smoothness,
Table 7 reports failure rates over the smooth, noisy and nonsmooth subsets, together with the overall rate over all 150 instances.
The failure-rate results indicate that the noisy variants are the most demanding part of the benchmark. In the smooth and nonsmooth classes,
BM1D,
BM2D and
BM3D achieve zero failure rate, matching BFGS and improving over DFP and SR1. In the noisy class, BFGS has the lowest failure rate,
, followed by
BM2D and
BM3D, both with
. Overall,
BM2D and
BM3D are the most robust proposed methods, with global failure rates of
, close to BFGS
.
Figure 2 reports the empirical success rates by problem class, complementing
Table 7.
The heat map confirms that the proposed methods solve all smooth and nonsmooth instances, while maintaining competitive success rates in the noisy class: for BM1D and for BM2D and BM3D.
Figure 3 shows the aggregated failure rates over the full benchmark. BFGS gives the lowest global failure rate,
, while
BM2D and
BM3D follow closely with
. The larger failure rates of SR1 and DFP,
and
, respectively, indicate greater sensitivity under the present perturbation setting.
The global Dolan–Moré profiles over the full benchmark are shown in
Figure 4. The left panel uses the primary metric
, whereas the right panel uses the number of outer iterations.
The work-based profile in
Figure 4a gives the most balanced comparison because it accounts for the additional evaluations required by the two-stage hybrid schemes. In this metric, BFGS and SR1 are highly competitive for small values of
, reflecting their lower cost per iteration. However, the proposed methods remain close to the best solvers over a wide range of
, with limiting values consistent with the low failure rates reported in
Table 7. The iteration-based profile in
Figure 4b highlights the main advantage of the proposed schemes:
BM1D,
BM2D and
BM3D solve a large fraction of the test instances with fewer outer iterations than DFP, BFGS and SR1. Overall,
Table 6 and
Table 7, together with
Figure 2,
Figure 3 and
Figure 4, show that the proposed hybrid schemes are robust and competitive on a heterogeneous benchmark set.
BM2D and
BM3D are the most robust proposed methods, whereas
BM1D provides strong iteration reduction on several smooth instances. BFGS remains a very strong classical baseline in terms of total work and failure rate; therefore, the proposed methods should be interpreted as competitive hybrid alternatives that reduce outer iterations while preserving robust behavior under smooth, noisy and nonsmooth scenarios.
5.4. Local Verification Setup on Ill-Conditioned Hessians
To validate the explicit cubic error recurrences derived in Theorem 2 and Lemma 1, we consider a family of smooth test functions with prescribed Hessian condition number at the solution. For
, let
where
Q is a fixed orthogonal matrix generated with a prescribed random seed. We consider
Then
The gradient and Hessian are
where
and
denote componentwise powers. In the experiments, we used
. Since the convergence theory is local, the verification was carried out in the full-step regime
, which corresponds to the asymptotic regime assumed in the local analysis. For each value of
, 96 local samples were generated by taking
,with small radii
r and randomly generated unit directions
v. The initial inverse Hessian approximation was chosen as
. For sufficiently small
, this choice satisfies
which is precisely the first-order inverse-Hessian consistency condition used in Theorem 2. For each sample, one HOQN iteration was performed and the observed error
was compared with the cubic term
. The local order
was estimated by a log–log regression of
versus
. We also report
Thus, bounded values of
Q provide numerical evidence for
. For
and
, respectively, we monitor the inverse-Hessian consistency quantities
where
is the intermediate inverse Hessian approximation obtained after the BFGS update following the Newton-type predictor.
Table 8 reports the local verification results for
and
on ill-conditioned Hessians.
The results show that the estimated local order remains close to three for both
and
, even for
. Orders higher than three are consistent with the theoretical result, which guarantees a cubic convergence order, and may occur when the leading cubic coefficient is small or is partially canceled out along some sampled directions. The empirical constants
and the theoretical constants
remain finite across all tests. The small values of
Q indicate that, after subtracting the explicit cubic contribution
, the remaining error behaves as a fourth-order residual, which supports
Finally, the bounded values of and, for , , confirm that the approximations of the inverse Hessian before and after the intermediate BFGS update remain first-order consistent with . This provides numerical evidence that the BFGS update following the Newton-type predictor and the DFP update following the high-order corrector preserve the local cubic regime for the two-update variant described as .
5.5. Dynamical Planes
To analyze the dependence of the methods on the initial estimate, we generate dynamical planes [
32]. Each plane is constructed on a uniform grid, using each grid point as an initial condition and coloring it according to the minimizer reached by the method, thus visualizing the corresponding basins of attraction. In all dynamical planes, the colors indicate the basins of attraction associated with the minimizers reached by the method, whereas the black symbols mark the corresponding local minimizers. Unlike root-finding dynamical planes, here the goal is to locate minimizers rather than zeros of a nonlinear operator; consequently, the observed equilibrium points may appear on the boundary of attraction regions. The configuration used was a
grid, with a maximum of 500 iterations and tolerance
.
Figure 5,
Figure 6 and
Figure 7 show the dynamical planes for the Himmelblau function, and
Figure 8,
Figure 9 and
Figure 10 show the corresponding results for the Freudenstein–Roth function [
12]. Overall, BFGS, DFP, SR1,
,
,
, and
exhibit similar and comparatively stable dynamical behavior, whereas
and
show more fragmented or chaotic basins. The Himmelblau function produces the most complex attraction structure, with highly fragmented regions where small changes in the initial estimate may lead to different local minima. By contrast, the Freudenstein–Roth planes are less fragmented.
For the strictly convex quadratic Booth function, the Hessian is constant positive definite and
is the unique minimizer. Hence, the considered descent line-search variants do not generate additional attractors: all initial conditions converge to the same point and the basin of attraction coincides with the whole domain. Therefore, only the
plane is shown in
Figure 11, as it is representative of all methods for
.
5.6. Hybrid Optimization with Limited Memory
In general, optimization methods may require a significant amount of memory, making it essential to adopt computational strategies that efficiently manage such requirements. One of the most widely used approaches in this context is the limited-memory variant of the BFGS method (L-BFGS) [
33]. Instead of storing or factorizing the full Hessian matrix, which entails a memory cost of order
and becomes impractical for dimensions
, L-BFGS retains only a limited number
m of the most recent curvature pairs
. As a result, the overall memory cost is reduced to
, while still providing an effective approximation of curvature information. Building upon this strategy, we propose the hybrid limited-memory method L-HOQN, whose iterative scheme is described in Algorithm 2.
In low-dimensional problems, such as the Himmelblau, Freudenstein–Roth, and Booth test functions with
, limited-memory optimization methods and, in particular, the L-HOQN method, are affected by a loss of curvature information and, consequently, exhibit slower convergence than their full-memory counterparts. In the case of the L-HOQN method variants,
–LBFGS,
-LBFGS, and
-LBFGS, this loss reduces the effectiveness of the high order corrections and slows down the convergence process, as observed in the results reported in
Table 9,
Table 10 and
Table 11. Nevertheless, this behavior changes substantially in practical high-dimensional applications, such as neural network training, which will be analyzed in
Section 6.
| Algorithm 2 Hybrid method L-HOQN |
| Require: , , , , , , memory with at most m pairs |
| Ensure: , |
| 1: |
| 2: |
| 3: |
| 4: |
| 5: if
then |
| 6: |
| 7: |
| 8: |
| 9: else if
then |
| 10: |
| 11: |
| 12: |
| 13: else if
then |
| 14: |
| 15: |
| 16: |
| 17: end if |
| 18: |
| 19: |
| 20: |
| 21: ifthen |
| 22: remove oldest pair from |
| 23: end if |
| 24: append to |
| 25: update via two-loop recursion on |