Next Article in Journal
A Legendre Spectral Operational Matrix Method with Convergence Analysis for Two-Dimensional Integro-Differential Equations
Previous Article in Journal
A Break-Regime Score-Driven Model for Tail-Risk Forecasting in China’s Carbon Market Under Policy Shifts
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods

by
Alicia Cordero
1,*,
Javier G. Maimó
2,
Juan R. Torregrosa
1 and
Natanael Ureña Castillo
2
1
Instituto de Matemática Multidisciplinar, Universitat Politècnica de València, Camino de Vera, s/n, 46022 Valencia, Spain
2
Ciencias Básicas y Ambientales (CBA), Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo 10602, Dominican Republic
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(10), 1746; https://doi.org/10.3390/math14101746
Submission received: 3 April 2026 / Revised: 4 May 2026 / Accepted: 7 May 2026 / Published: 19 May 2026

Abstract

We propose the higher-order quasi-Newton (HOQN) method, a hybrid algorithm for unconstrained optimization that combines Newtonian predictors with higher-order correctors derived from vector extensions of the Traub, Chun, and Ostrowski methods, along with quasi-Newton updates of the inverse Hessian using Broyden–Fletcher–Goldfarb–Shanno (BFGS) or Davidon–Fletcher–Powell (DFP) formulas. We demonstrate that the resulting scheme achieves cubic local convergence order, representing a substantial improvement over the superlinear convergence typical of classical quasi-Newton methods, while maintaining a cost of O n 2 per iteration. We also analyze variants that incorporate two successive quasi-Newton updates, and show that they retain the same cubic order. Numerical experiments with the benchmark functions of Himmelblau and Freudenstein–Roth confirm the theoretical convergence order and show that the hybrid variants consistently require fewer iterations than BFGS, DFP, and Symmetric Rank-One (SR1). In the case of the Booth function, given its strictly convex quadratic structure, the proposed hybrid methods reach the global minimum in just two iterations and exhibit numerical accuracy superior to that of classical quasi-Newton methods. In addition, limited-memory variants (L-HOQN) are introduced; these are evaluated during the training of a convolutional neural network on the MNIST dataset, where they achieve test accuracies exceeding 99 % and outperform L-BFGS and standard stochastic gradient descent (SGD) at all tested learning rates.

1. Introduction

The development of quasi-Newton methods is one of the most important lines of research in unconstrained numerical optimization, due to their ability to balance computational efficiency and accuracy in large-scale problems. These methods seek to replicate the local efficiency of the Newton method while avoiding the explicit computation of the inverse of the Hessian matrix.
Let us first consider the classical Newton method for unconstrained optimization. Let f: R n R be a twice continuously differentiable function, where f ( x ) and 2 f ( x ) denote the gradient and the Hessian of f at a point x R n , respectively. Newton’s method generates a sequence { x ( k ) } of approximations to a minimizer of f according to
d ( k ) = 2 f x ( k ) 1 f x ( k ) , x ( k + 1 ) = x ( k ) + α ( k ) d ( k ) ,
where x ( k ) denotes the current iterate, d ( k ) is the Newton search direction, and α ( k ) > 0 is a step length obtained by means of a line search procedure.
Under standard regular assumptions, the method exhibits local quadratic convergence. However, its implementation requires the evaluation and subsequent inversion of the Hessian at each iteration, which implies a computational cost of the order O ( n 3 ) , making it impractical for large-scale problems. Furthermore, the Newton direction guarantees descent only when the Hessian is positive definite in a neighborhood of the solution, a condition that may not be satisfied in nonconvex problems.
With the aim of reducing computational cost while maintaining good convergence properties, quasi-Newton methods replace the exact Hessian with an iteratively updated approximation. In general, these methods generate the sequence { x ( k ) } according to the scheme
d ( k ) = H ( k ) f x ( k ) , x ( k + 1 ) = x ( k ) + α ( k ) d ( k ) , s ( k ) = x ( k + 1 ) x ( k ) , y ( k ) = f x ( k + 1 ) f x ( k ) ,
where H ( k ) denotes an approximation of the inverse Hessian matrix 2 f x ( k ) 1 , s ( k ) represents the displacement between two successive iterates, and y ( k ) denotes the variation of the gradient between those iterates. The matrix H ( k + 1 ) is computed through a quasi-Newton update formula satisfying the secant equation
H ( k + 1 ) y ( k ) = s ( k ) .
Different quasi-Newton methods arise from different update formulas. Among them, the DFP method, introduced in the 1960s, is one of the first designed for this purpose. The DFP update of the inverse Hessian is given by
H ( k + 1 ) = H ( k ) + s ( k ) ( s ( k ) ) T ( s ( k ) ) T y ( k ) H ( k ) y ( k ) ( y ( k ) ) T H ( k ) ( y ( k ) ) T H ( k ) y ( k ) ,
and was first presented in [1].
Later, in the 1970s, the BFGS method [2] was developed and is known for its efficiency and stability. In terms of the inverse Hessian approximation, the update is
H ( k + 1 ) = I s ( k ) ( y ( k ) ) T ( y ( k ) ) T s ( k ) H ( k ) I y ( k ) ( s ( k ) ) T ( y ( k ) ) T s ( k ) + s ( k ) ( s ( k ) ) T ( y ( k ) ) T s ( k ) ,
where I denotes the identity matrix of the same dimension as H ( k ) .
Another well-known quasi-Newton method is the symmetric rank-one (SR1) method [3]. The SR1 formula is considered one of the earliest quasi-Newton updates and is often associated with the ideas of Davidon [4].
However, its theoretical development and modern use in optimization were established primarily during the period 1991–1996. In terms of the approximation of the inverse of the Hessian, its update is defined by
H ( k + 1 ) = H ( k ) + s ( k ) H ( k ) y ( k ) s ( k ) H ( k ) y ( k ) T s ( k ) H ( k ) y ( k ) T y ( k ) .
Quasi-Newton methods reduce the iterative cost to O ( n 2 ) [5,6], but may suffer from loss of superlinear behavior and accumulation of numerical inaccuracies. In this context, hybrid algorithms [7,8,9,10] have emerged as a promising alternative by combining the strengths of different approaches to overcome the limitations of classical methods.
This article presents the Higher-Order Quasi-Newton (HOQN) method, a new hybrid algorithm with cubic local convergence order, which combines Newton-type techniques with high-order correctors constructed as vector extensions of the scalar methods of Chun, Ostrowski, and Traub. These vector extensions satisfy the optimality condition established by the Cordero–Torregrosa conjecture [11], according to which the convergence order of any Newton-type iterative method for solving nonlinear systems without memory cannot exceed
2 k 1 + k 2 1 ,
where k 1 represents the number of functional evaluations appearing in the entries of the Jacobian matrix per iteration and k 2 represents the number of evaluations of the nonlinear function, with k 1 k 2 . If a method reaches this upper bound, it is said to be optimal in the vectorial sense. In the proposed scheme, these high-order correctors are combined with quasi-Newton updates of the inverse Hessian matrix. Although the theoretical convergence of the method has been demonstrated in the context of convex functions, numerical experiments show that, in practice, the scheme also performs excellently in the minimization of nonconvex functions.
The HOQN scheme is applicable to any update formula, even allowing different update formulas after each hybrid step or nonlinear combinations of them, offering increased robustness and adaptability in solving complex optimization problems. In the HOQN scheme, a single update formula for the inverse of the Hessian matrix can be used after the high-order step, or two quasi-Newton update formulas can be used: one after the Newton step and another after the high-order corrector. This flexibility provides greater robustness and adaptability for solving complex optimization problems.
The main contribution of this work is to show that the HOQN method provides faster and more stable convergence compared to classical methods, due to the integration of higher-order corrections together with quasi-Newton updates. In addition, the dependence of the computational performance of the HOQN method on the initial estimation is studied by representing the basins of attraction in three benchmark functions: Himmelblau, Freudenstein–Roth, and Booth [12]. New variants of the hybrid algorithm with limited memory are also proposed, which are validated in the training of convolutional neural networks, obtaining excellent overall performance.
This manuscript is organized as follows. Section 1 provides an introduction to the study, outlining the state of the art in quasi-Newton methods and presenting the preliminary concepts that form the theoretical foundation of the research. Section 2 reviews the vector extensions of the scalar methods of Chun, Ostrowski, and Traub, and introduces a new optimal vector variant of Traub’s method inspired by the construction proposed by Singh et al. in [13]. Section 3 presents the general HOQN framework, including the one-update and two-update hybrid quasi-Newton schemes. Section 4 establishes the local convergence analysis of the proposed methods and proves the cubic convergence order under suitable smoothness and inverse-Hessian consistency assumptions. Section 5 reports the numerical experiments, including full-memory tests on classical benchmark functions, high-dimensional quadratic experiments, Dolan–Moré performance profiles, dynamical-plane analysis, and the limited-memory HOQN variants. Section 6 discusses the practical application of the proposed limited-memory hybrid optimizers to the training of a convolutional neural network on the MNIST dataset. Finally, Section 7 presents the main conclusions and outlines possible directions for future research.

Preliminary Concepts

In order to establish the theoretical foundations for the study of the proposed hybrid optimization methods, it is necessary to recall some classical definitions from optimization theory. These notions provide the mathematical framework for guaranteeing convergence properties and characterizing the behavior of the algorithms. In the sections devoted to vector extensions of iterative methods for nonlinear systems, we use the notation F : R n R n for a nonlinear mapping and F ( x ) for its Jacobian. In the particular case of unconstrained optimization, this mapping is identified with the gradient, that is, F ( x ) = f ( x ) . Hence, the first-order necessary condition for a local minimizer of f can be written as the nonlinear system F ( x ) = 0 .
Definition 1
(Convex Function). A function f : R n R is convex [14] if, for any pair of points x 1 , x 2 R n and any λ [ 0 , 1 ] , the following inequality holds:
f λ x 1 + ( 1 λ ) x 2 λ f x 1 + ( 1 λ ) f x 2 .
In addition to convexity, it is often useful to quantify how rapidly a function can change. This leads to the notion of Lipschitz continuity, which provides a bound on the variation of a function with respect to its inputs.
Definition 2
(Lipschitz Continuity of the Gradient). Let f : R n R be a continuously differentiable function. We say that the gradient f is Lipschitz continuous [15,16] on a set D R n if there exists a constant L 0 such that, for all x , y D ,
f ( x ) f ( y ) L x y ,
where · denotes a norm in R n , and L is the Lipschitz constant.
When analyzing iterative optimization algorithms, line search techniques play a key role in guaranteeing sufficient progress at each iteration. For this purpose, the strong Wolfe conditions are frequently imposed.
Definition 3
(Strong Wolfe Conditions). An optimization algorithm satisfies the strong Wolfe conditions [17,18,19] if, at each iteration k, the search direction d ( k ) (with f x ( k ) T d ( k ) < 0 ) and the step size α ( k ) > 0 jointly meet:
Sufficient decrease (Armijo): f x ( k ) + α ( k ) d ( k ) f x ( k ) + c 1 α ( k ) f x ( k ) T d ( k ) , Curvature condition: f x ( k ) + α ( k ) d ( k ) T d ( k ) c 2 f x ( k ) T d ( k ) ,
where x ( k ) is the current iterate, f x ( k ) is the gradient of the objective function at x ( k ) , and the constants satisfy 0 < c 1 < c 2 < 1 (typically c 1 [ 10 6 , 10 4 ] and c 2 0.9 ).
Finally, to ensure that quasi-Newton updates preserve desirable convergence properties, one of the most widely used criteria is the Dennis–Moré condition, which characterizes the superlinear behavior of these methods. This condition will play a fundamental role in demonstrating that the quasi-Newton updates embedded in the HOQN scheme preserve its local order of convergence.
Definition 4
(Dennis–Moré Condition [5,20]). Let { x ( k ) } be a sequence generated by a quasi-Newton method converging to a local minimizer ξ, and let { H ( k ) } denote the corresponding sequence of inverse Hessian approximations. The sequence { H ( k ) } is said to satisfy the Dennis–Moré condition at ξ if
lim k H ( k + 1 ) [ 2 f ( ξ ) ] 1 s ( k ) s ( k ) = 0 ,

2. Vector Extensions of Traub’s, Chun’s, and Ostrowski’s Methods

Although the methods of Chun, Ostrowski and Traub were originally developed for solving nonlinear equations, their extension to the context of unconstrained optimization arises naturally by imposing the condition f ( x ) = 0 . Likewise, the methods of Chun [21] and Ostrowski [22], widely recognized for their high-order properties, provide a relevant conceptual basis for the development of vector extensions oriented toward optimization problems. For the design of optimal vectorial extensions of Chun’s and Ostrowski’s methods, the strategy proposed by Cordero et al. [23] is particularly relevant. This strategy is based on the fundamental idea introduced by Singh et al. [13], who present a two-step iterative scheme with fifth-order convergence for solving nonlinear systems. This scheme includes a first step based on a standard Newton iteration.
z ( k ) = x ( k ) F x ( k ) 1 F x ( k ) ,
which computes a predictor z ( k ) that is refined in the subsequent step. Here, F : R n R n is a nonlinear system of equations, and F ( x ) denotes the Jacobian matrix of F evaluated at x. Unless otherwise stated, all norms considered throughout this work are Euclidean norms. The scheme then includes a second step, where a weighted corrective term is introduced, based on the ratio of squared norms of the function evaluated at z ( k ) and x ( k ) , expressed as F z ( k ) 2 F x ( k ) 2 .
x ( k + 1 ) = z ( k ) 1 + F z ( k ) 2 F x ( k ) 2 F z ( k ) 1 F z ( k ) .
In [23], Cordero et al. extend this idea by parametrizing the iterative scheme through the Ermakov Hyperfamily for Systems, with expression:
z ( k ) = x ( k ) F x ( k ) 1 F x ( k ) , x ( k + 1 ) = z ( k ) F x ( k ) 1 p k F z ( k ) + q k F x ( k ) , k = 0 , 1 ,
where
ν k = F z ( k ) 2 F x ( k ) 2 = F z ( k ) T F z ( k ) F x ( k ) T F x ( k ) , K k = 1 1 + λ ν k , p k = K k ( 1 + ψ ν k ) , q k = 2 K k ν k ,
obtaining, from (5), as special cases, the optimal vector versions of the scalar methods of Chun and Ostrowski, which we have denoted, respectively, by M 1 and M 2 . For the optimal vector version M 1 , when setting λ = 0 and ψ = 1 , the iteration is expressed as:
x ( k + 1 ) = z ( k ) F x ( k ) 1 1 + ν k F z ( k ) 2 ν k F x ( k ) .
On the other hand, by choosing λ = 4 and ψ = 0 , we obtain the optimal vector version M 2 , whose iterative expression is
x ( k + 1 ) = z ( k ) 1 1 4 ν k F x ( k ) 1 F z ( k ) + 2 ν k F x ( k ) .
On the other hand, Traub’s method [24], in its multidimensional form,
z ( k ) = x ( k ) F x ( k ) 1 F x ( k ) , x ( k + 1 ) = z ( k ) F x ( k ) 1 F z ( k ) ,
constitutes a third-order multipass scheme that serves as one of the fundamental precursors to the hybrid constructions considered in this work. To align the Traub method with the idea introduced by Singh et al. in [13], we employ a technique based on weight functions depending on the acceleration parameter ν k , through which a new fourth-order variant of this method is generated. The incorporation of scalar weight functions defined in terms of ν k allows us to preserve the essential structure of Traub’s method while, at the same time, endowing it with an additional acceleration mechanism that improves its local behavior. In this sense, the resulting variant can be interpreted as a natural extension of Traub’s method within the class of multistep methods with acceleration parameters. The following theorem formalizes this construction, establishes the corresponding error equation, and rigorously proves that the resulting method possesses fourth-order local convergence.
Theorem 1.
Let F : D R n R n be a sufficiently differentiable function in an open convex neighborhood D of ξ, where ξ is a simple solution of F ( x ) = 0 ; that is, F ( ξ ) = 0 , F ( ξ ) is nonsingular. Assume that x ( 0 ) is an initial approximation sufficiently close to ξ. Consider the iterative scheme
z ( k ) = x ( k ) F x ( k ) 1 F x ( k ) , x ( k + 1 ) = x ( k ) H ν k F x ( k ) 1 F x ( k ) G ν k F x ( k ) 1 F z ( k ) ,
where e ( k ) = x ( k ) ξ , ν k = F z ( k ) 2 F x ( k ) 2 , and C j = 1 j ! F ( ξ ) 1 F ( j ) ( ξ ) , j 2 .
Then, the sequence { x ( k ) } k 0 converges locally to ξ with order of convergence 4 if and only if
H 0 = H ( 0 ) = 1 , G 0 = G ( 0 ) = 1 , H 1 = H ( 0 ) = 2 , | G 1 = G ( 0 ) | < + .
In this case, the error equation is
e ( k + 1 ) = 2 C 2 3 6 C 2 C 3 + 3 C 4 G 1 C 2 3 + 2 P 1 Q C 2 2 e ( k ) 4 + O e ( k ) 5 .
and, therefore, the method is of order 4.
Proof. 
Let e ( k ) = x ( k ) ξ denote the error. The Taylor expansions of F x ( k ) and F x ( k ) around ξ are
F x ( k ) = F ( ξ ) e ( k ) + C 2 e ( k ) 2 + C 3 e ( k ) 3 + C 4 e ( k ) 4 + O e ( k ) 5 ,
and
F x ( k ) = F ( ξ ) I + 2 C 2 e ( k ) + 3 C 3 e ( k ) 2 + 4 C 4 e ( k ) 3 + O e ( k ) 4 ,
where C j = 1 j ! F ( ξ ) 1 F ( j ) ( ξ ) , j 2 . Furthermore, assuming that the inverse of the Jacobian satisfies
F x ( k ) 1 F x ( k ) = F x ( k ) F x ( k ) 1 = I ,
we obtain
F x ( k ) 1 = I + X 2 e ( k ) + X 3 e ( k ) 2 + X 4 e ( k ) 3 F ( ξ ) 1 + O e ( k ) 4 ,
where
X 2 = 2 C 2 , X 3 = 4 C 2 2 3 C 3 , X 4 = 2 4 C 2 3 + 2 C 4 3 C 2 C 3 3 C 3 C 2 .
Therefore, the expansion of A k = F x ( k ) 1 F x ( k ) is given by
A k = e ( k ) C 2 e ( k ) 2 + 2 C 3 + C 2 2 e ( k ) 3 + B e ( k ) 4 + O e ( k ) 5 ,
where B = 4 C 2 3 3 C 4 + 4 C 2 C 3 + 3 C 3 C 2 .
Let us now calculate the error of Newton’s step. Since z ( k ) = x ( k ) A k , it follows that e z ( k ) = z ( k ) ξ = e ( k ) A k . Therefore,
e z ( k ) = C 2 e ( k ) 2 2 C 3 + C 2 2 e ( k ) 3 B e ( k ) 4 + O e ( k ) 5 ,
that is,
e z ( k ) = C 2 e ( k ) 2 2 C 3 + C 2 2 e ( k ) 3 + 4 C 2 3 3 C 4 + 4 C 2 C 3 + 3 C 3 C 2 e ( k ) 4 + O e ( k ) 5 .
Now let us expand F z ( k ) around ξ using the result obtained in (14):
F z ( k ) = F ( ξ ) e z ( k ) + C 2 e z ( k ) 2 + C 3 e z ( k ) 3 + C 4 e z ( k ) 4 + O e z ( k ) 5 = F ( ξ ) C 2 e ( k ) 2 + A e ( k ) 3 + B + C 2 3 e ( k ) 4 + O e ( k ) 5 .
where A = 2 C 3 + C 2 2 ,     B = 4 C 2 3 + 3 C 4 4 C 2 C 3 3 C 3 C 2 . Therefore,
F z ( k ) = F ( ξ ) C 2 e ( k ) 2 + 2 C 3 C 2 2 e ( k ) 3 + 5 C 2 3 + 3 C 4 4 C 2 C 3 3 C 3 C 2 e ( k ) 4 + O e ( k ) 5 .
Consequently,
B k = F x ( k ) 1 F z ( k ) = C 2 e ( k ) 2 + 2 C 3 2 C 2 2 e ( k ) 3 + B 4 e ( k ) 4 + O e ( k ) 5 ,
where B 4 = 3 C 4 8 C 2 C 3 6 C 3 C 2 + 13 C 2 3 .
Let us now proceed to determine the Taylor expansion of ν k .
ν k = F z ( k ) 2 F x ( k ) 2 = F z ( k ) T F z ( k ) F x ( k ) T F x ( k ) .
Let F : R n R n be sufficiently differentiable. For every x = ( x 1 , x 2 , , x n ) T R n , it can be written as
F ( x ) = f 1 ( x ) , f 2 ( x ) , , f n ( x ) T ,
where each coordinate function f i : R n R is scalar-valued. Observe that, although F x ( k ) and F z ( k ) are vectors in R n , the quotient ν k is a scalar quantity. To manipulate it explicitly, Singh et al. [13] express it as
F z ( k ) T F z ( k ) F x ( k ) T F x ( k ) = i = 1 n f i 2 z ( k ) i = 1 n f i 2 x ( k ) .
Expanding f i x ( k ) and f i z ( k ) in Taylor series about ξ , we obtain
f i x ( k ) = f i ( ξ ) e ( k ) + 1 2 f i ( ξ ) e ( k ) 2 + 1 6 f i ( ξ ) e ( k ) 3 + O e ( k ) 4 ,
f i z ( k ) = f i ( ξ ) e z ( k ) + 1 2 f i ( ξ ) e z ( k ) 2 + 1 6 f i ( ξ ) e z ( k ) 3 + O e z ( k ) 4 ,
where f i ( x ) = f i x 1 , , f i x n denotes the row vector of f i , and f i ( x ) = 2 f i x j x k n × n is the Hessian matrix of f i . Higher order terms correspond to the multilinear derivatives of third order.
If we rewrite (20) and (21) as
f i x ( k ) = R i e ( k ) + H i e ( k ) 2 + K i e ( k ) 3 + O e ( k ) 4 ,
f i z ( k ) = R i e z ( k ) + H i e z ( k ) 2 + K i e z ( k ) 3 + O e ( k ) 4 ,
where R i = f i ( ξ ) , H i = 1 2 f i ( ξ ) , K i = 1 6 f i ( ξ ) . Then
f i 2 x ( k ) = P i e ( k ) 2 + Q i e ( k ) 3 + S i e ( k ) 4 + O e ( k ) 5 ,
f i 2 z ( k ) = P i e z ( k ) 2 + Q i e z ( k ) 3 + S i e z ( k ) 4 + O e ( k ) 5 ,
where
P i = R i T R i , Q i = R i T H i + H i T R i , S i = R i T K i + K i T R i + H i T H i , P = i = 1 n P i , Q = i = 1 n Q i , S = i = 1 n S i .
We use e z ( k ) = C 2 e ( k ) 2 + A e ( k ) 3 + B e ( k ) 4 + O e ( k ) 5 and compute
e z ( k ) 2 = C 2 2 e ( k ) 4 + C 2 A + A C 2 e ( k ) 5 + A 2 + C 2 B + B C 2 e ( k ) 6 + O e ( k ) 7 .
For e z ( k ) 3 , the dominant term is of order six, hence e z ( k ) 3 = C 2 3 e ( k ) 6 + O e ( k ) 7 . Since e z ( k ) = O e ( k ) 2 , in order to compute ν k up to order four it suffices to consider P e z ( k ) 2 up to e ( k ) 6 , Q e z ( k ) 3 at the term C 2 3 e ( k ) 6 , while S e z ( k ) 4 starts at order e ( k ) 8 and does not contribute. The numerator and the denominator, respectively, become:
i = 1 n f i 2 z ( k ) = P C 2 2 e ( k ) 4 + P C 2 A + A C 2 e ( k ) 5 + P A 2 + C 2 B + B C 2 + 2 Q C 2 3 e ( k ) 6 + O e ( k ) 7 ,
i = 1 n f i 2 x ( k ) = P e ( k ) 2 + Q e ( k ) 3 + S e ( k ) 4 + O e ( k ) 5 .
Substituting (27) and (28) in (19) and expanding the quotient up to order four, we obtain
ν k = C 2 2 e ( k ) 2 + 2 C 2 C 3 + 2 C 3 C 2 4 C 2 3 P 1 Q C 2 2 e ( k ) 3 + 12 C 2 4 + 4 C 3 2 + 3 C 2 C 4 + C 4 C 2 8 C 2 2 C 3 7 C 3 C 2 2 7 C 2 C 3 C 2 + 5 P 1 Q C 2 3 2 P 1 Q C 2 C 3 2 P 1 Q C 3 C 2 + P 1 Q 2 P 1 S C 2 2 e ( k ) 4 + O e ( k ) 5 .
Since ν k 0 as k , the functions H and G can be expanded in Taylor series about 0 as follows:
H ( ν k ) = H 0 + H 1 ν k + H 2 ν k 2 + O ( ν k 3 ) , G ( ν k ) = G 0 + G 1 ν k + G 2 ν k 2 + O ( ν k 3 ) .
For convenience, we introduce
A k = F x ( k ) 1 F x ( k ) , B k = F x ( k ) 1 F z ( k ) ,
so that the iterative error equation can be written as
e ( k + 1 ) = e ( k ) H ν k A k G ν k B k .
Using the previous expansions, we have
A k = e ( k ) + O e ( k ) 2 , B k = O e ( k ) 2 ,
and, since ν k = O e ( k ) 2 ,
ν k 2 A k = O e ( k ) 5 , ν k 2 B k = O e ( k ) 6 .
Therefore, the terms involving ν k 2 and higher powers do not affect the error equation up to order four.
Substituting these expansions into (30) and collecting powers of e ( k ) , we obtain
e ( k + 1 ) = 1 H 0 e ( k ) + H 0 G 0 C 2 e ( k ) 2 + 2 H 0 G 0 C 3 + 4 G 0 2 H 0 H 1 C 2 2 e ( k ) 3 + 3 H 0 G 0 C 4 + 4 H 0 + 8 G 0 2 H 1 C 2 C 3 + 3 H 0 + 6 G 0 H 1 C 3 C 2 + 4 H 0 13 G 0 + 5 H 1 G 1 C 2 3 + H 1 P 1 Q C 2 2 e ( k ) 4 + O e ( k ) 5 .
Therefore, the order-four conditions are obtained by canceling the coefficients of e ( k ) , e ( k ) 2 , e ( k ) 3 , namely
1 H 0 = 0 , H 0 G 0 = 0 , 4 G 0 2 H 0 H 1 = 0 ,
which yield
H 0 = 1 , G 0 = 1 , H 1 = 2 ,
while G 1 remains a free finite parameter.
Consequently,
e ( k + 1 ) = 2 C 2 3 6 C 2 C 3 + 3 C 4 G 1 C 2 3 + 2 P 1 Q C 2 2 e ( k ) 4 + O e ( k ) 5 ,
and, therefore, method (10) is of order 4.    □
From (10), with H 0 = 1 , G 0 = 1 , H 1 = 2 and | G 1 | < + , a new optimal vectorial variant of the scalar Traub method, denoted by M 3 . The iterative scheme is given by
z ( k ) = x ( k ) F x ( k ) 1 F x ( k ) , x ( k + 1 ) = x ( k ) F x ( k ) 1 1 + 2 ν k F x ( k ) + 1 + G 1 ν k F z ( k ) .
where
ν k = F z ( k ) 2 F x ( k ) 2 ,
The coefficient ν k provides the M 1 , M 2 , and M 3 methods with greater computational efficiency compared to their classical counterparts, which, although preserving fourth-order convergence, are computationally more expensive. This is particularly important in optimization, where reducing computational cost is essential for effectively tackling large-scale and complex problems. If F : R n R n denotes the nonlinear system under consideration, then, in the particular case of optimization, F is identified with the gradient vector of a scalar function f : R n R , that is, F ( x ) = f ( x ) . Accordingly, these methods can be rewritten in terms of the gradient and the Hessian of f, and thus applied to approximate local extrema of f. Given an initial point x 0 R n , the M 1 , M 2 , and M 3 methods generate sequences of points x ( k ) according to the following iterative formulas:
M 1 : x ( k + 1 ) = z ( k ) 2 f x ( k ) 1 ( 1 + ν k ) f z ( k ) 2 ν k f x ( k ) ,
M 2 : x ( k + 1 ) = z ( k ) 1 1 4 ν k 2 f x ( k ) 1 f z ( k ) + 2 ν k f x ( k ) ,
M 3 : x ( k + 1 ) = z ( k ) 2 f x ( k ) 1 ( 1 + 2 ν k ) f x ( k ) + ( 1 + G 1 ν k ) f z ( k ) .
where z ( k ) = x ( k ) 2 f x ( k ) 1 f x ( k ) is the Newton predictor, and ν k = f z ( k ) 2 f x ( k ) 2 .

3. Hybrid-Quasi-Newton Method (HOQN)

This algorithm combines Newton’s method with high order corrections given by (34)–(36) and a quasi-Newton update of the inverse Hessian based on the DFP scheme; see Algorithm 1. Accordingly, the M 1 DFP , M 2 - DFP , and M 3 DFP variants are obtained. At each iteration, a Newton direction followed by a high order correction is computed to define the search direction; the step size is determined via an inexact line search (Armijo or Wolfe), and the Hessian approximation is updated through a quasi-Newton formula at the end of the iteration. Additionally, the scheme can be extended by incorporating an intermediate quasi-Newton update after the Newton step, for instance using BFGS, leading to the variants described in the Remark 1. The higher-order corrections employed in the algorithm allow a local cubic convergence order to be achieved without significantly increasing the computational cost per iteration, which results in a faster reduction of the gradient norm and, consequently, in fewer iterations and function evaluations in nonconvex problems, where a purely Newton or a standard quasi-Newton method is less competitive. The general algorithm of the HOQN method is given by:
Algorithm 1 HOQN iteration with line search
       Inputs:   x ( k ) , inverse Hessian approximation H ( k ) , objective f, gradient f , | G 1 | < +
  1:   g ( k ) f x ( k )
  2:   d N ( k ) H ( k ) g ( k )
  3:   α N ( k ) arg min α ( 0 , α max ] f ( x ( k ) + α d N ( k ) )
  4:   z ( k ) x ( k ) + α N ( k ) d N ( k )
  5:   g z ( k ) f z ( k )
  6:   ν k g z ( k ) 2 g ( k ) 2
  7:  if method = M 1 DFP then
  8:          d H O ( k ) H ( k ) ( 1 + ν k ) g z ( k ) 2 ν k g ( k )
  9:          α H O ( k ) arg min α ( 0 , α max ] f ( z ( k ) + α d H O ( k ) )
10:          x ( k + 1 ) z ( k ) + α H O ( k ) d H O ( k )
11:  else if method = M 2 DFP then
12:          d H O ( k ) 1 1 4 ν k H ( k ) g z ( k ) + 2 ν k g ( k )
13:          α H O ( k ) arg min α ( 0 , α max ] f ( z ( k ) + α d H O ( k ) )
14:          x ( k + 1 ) z ( k ) + α H O ( k ) d H O ( k )
15:  else if method = M 3 DFP then
16:          d H O ( k ) H ( k ) ( 1 + 2 ν k ) g ( k ) + ( 1 + G 1 ν k ) g z ( k )
17:          α H O ( k ) arg min α ( 0 , α max ] f ( x ( k ) + α d H O ( k ) )
18:          x ( k + 1 ) x ( k ) + α H O ( k ) d H O ( k )
19:  end if
20:   g ( k + 1 ) f ( x ( k + 1 ) )
21:   s ( k ) x ( k + 1 ) x ( k )
22:   y ( k ) g ( k + 1 ) g ( k )
23:   H ( k + 1 ) DFP Update ( H ( k ) , s ( k ) , y ( k ) )
       Output:   x ( k + 1 ) , H ( k + 1 )
Remark 1.
If an intermediate quasi-Newton update is incorporated, that is, if an additional update is performed after the Newton step in Algorithm 1, for instance using a BFGS update, the following variants are obtained: B M 1 D , B M 2 D and B M 3 D .

4. Convergence Analysis of the HOQN Method

Theorem 2
(Local cubic error equation for the M i DFP-HOQN variants). Let F = f : R n R n be sufficiently differentiable in a neighborhood of a strict local minimizer ξ, so that the Taylor expansions used in Theorem 1 hold. Assume that
F ( ξ ) = 0 , F ( ξ ) = 2 f ( ξ ) 0 .
Let
e ( k ) = x ( k ) ξ , C j = 1 j ! [ F ( ξ ) ] 1 F ( j ) ( ξ ) , j 2 .
As in Theorem 1, the notation O ( e ( k ) p ) is understood in the vectorial sense; for a matrix sequence, E ( k ) = O ( e ( k ) ) means E ( k ) C e ( k ) .
Consider the one-update variants M 1 DFP , M 2 DFP , and M 3 DFP , obtained from the correctors M 1 , M 2 , and M 3 in (34), (35), and (36), respectively, by replacing [ F ( x ( k ) ) ] 1 with H ( k ) . Assume that the iteration is in the local full-step regime,
α N ( k ) = α H O ( k ) = 1 , z ( k ) = x ( k ) H ( k ) F ( x ( k ) ) , ν k = F ( z ( k ) ) 2 F ( x ( k ) ) 2 ,
for all sufficiently large k. For the M 2 -variant, assume 1 4 ν k 0 , which holds locally. For the M 3 -variant, the corrector is understood in the x ( k ) -based form
x ( k + 1 ) = x ( k ) H ( k ) ( 1 + 2 ν k ) F ( x ( k ) ) + ( 1 + G 1 ν k ) F ( z ( k ) ) , | G 1 | < + .
Assume the first-order local inverse-Hessian consistency condition
H ( k ) = [ F ( x ( k ) ) ] 1 + E ( k ) , E ( k ) = O ( e ( k ) ) .
Define
R k : = E ( k ) F ( ξ ) , Z k : = C 2 e ( k ) 2 R k e ( k ) ,
and
ϑ k : = F ( ξ ) Z k 2 F ( ξ ) e ( k ) 2 .
Then
Z k = O ( e ( k ) 2 ) , ϑ k = O ( e ( k ) 2 ) , ν k = ϑ k + O ( e ( k ) 3 ) .
Moreover, the explicit local error recurrences are
M 1 DFP : e ( k + 1 ) = C 1 , k ( 3 ) + O ( e ( k ) 4 ) ,
M 2 DFP : e ( k + 1 ) = C 2 , k ( 3 ) + O ( e ( k ) 4 ) ,
M 3 DFP : e ( k + 1 ) = C 3 , k ( 3 ) + O ( e ( k ) 4 ) ,
where the cubic contributions are explicitly given by
C 1 , k ( 3 ) = 2 C 2 e ( k ) C 2 e ( k ) 2 R k e ( k ) R k C 2 e ( k ) 2 R k e ( k ) + 2 ϑ k e ( k ) ,
C 2 , k ( 3 ) = 2 C 2 e ( k ) C 2 e ( k ) 2 R k e ( k ) R k C 2 e ( k ) 2 R k e ( k ) 2 ϑ k e ( k ) ,
C 3 , k ( 3 ) = C 2 , k ( 3 ) .
Consequently, there exists K > 0 such that
e ( k + 1 ) K e ( k ) 3
for all sufficiently large k. Hence, each M i DFP -HOQN variant has local convergence of order at least three. If
lim inf k C i , k ( 3 ) e ( k ) 3 > 0 ,
then the corresponding local convergence order is exactly three.
Proof. 
We use the Taylor expansions established in Theorem 1. In particular,
[ F ( x ( k ) ) ] 1 F ( x ( k ) ) = e ( k ) C 2 e ( k ) 2 + 2 ( C 3 + C 2 2 ) e ( k ) 3 + O ( e ( k ) 4 ) .
Using (37), R k = E ( k ) F ( ξ ) , and R k = O ( e ( k ) ) , we obtain
H ( k ) F ( x ( k ) ) = e ( k ) C 2 e ( k ) 2 + 2 ( C 3 + C 2 2 ) e ( k ) 3 + R k e ( k ) + R k C 2 e ( k ) 2 + O ( e ( k ) 4 ) .
Hence, for e z ( k ) = z ( k ) ξ ,
e z ( k ) = Z k + 2 ( C 3 C 2 2 ) e ( k ) 3 R k C 2 e ( k ) 2 + O ( e ( k ) 4 ) ,
and therefore
Z k = O ( e ( k ) 2 ) , e z ( k ) = O ( e ( k ) 2 ) .
Since e z ( k ) = O ( e ( k ) 2 ) ,
F ( z ( k ) ) = F ( ξ ) e z ( k ) + O ( e ( k ) 4 ) , [ F ( x ( k ) ) ] 1 F ( ξ ) = I 2 C 2 e ( k ) + O ( e ( k ) 2 ) .
Consequently,
e z ( k ) H ( k ) F ( z ( k ) ) = 2 C 2 e ( k ) Z k R k Z k + O ( e ( k ) 4 ) .
Moreover, from (46),
F ( z ( k ) ) = F ( ξ ) Z k + O ( e ( k ) 3 ) , F ( x ( k ) ) = F ( ξ ) e ( k ) + O ( e ( k ) 2 ) ,
which gives
ν k = ϑ k + O ( e ( k ) 3 ) , ϑ k = O ( e ( k ) 2 ) .
Also,
H ( k ) F ( x ( k ) ) = e ( k ) + O ( e ( k ) 2 ) , H ( k ) F ( z ( k ) ) = O ( e ( k ) 2 ) .
Thus,
ν k H ( k ) F ( x ( k ) ) = ϑ k e ( k ) + O ( e ( k ) 4 ) , ν k H ( k ) F ( z ( k ) ) = O ( e ( k ) 4 ) .
For M 1 DFP , (47) and (49) yield
e ( k + 1 ) = 2 C 2 e ( k ) Z k R k Z k + 2 ϑ k e ( k ) + O ( e ( k ) 4 ) ,
which proves (39) and (42). For M 2 DFP , using
( 1 4 ν k ) 1 = 1 + 4 ν k + O ( e ( k ) 4 ) ,
together with (47) and (49), gives
e ( k + 1 ) = 2 C 2 e ( k ) Z k R k Z k 2 ϑ k e ( k ) + O ( e ( k ) 4 ) ,
which proves (40) and (43).
For M 3 DFP , using the x ( k ) -based form and e ( k ) H ( k ) F ( x ( k ) ) = e z ( k ) , we obtain
e ( k + 1 ) = e z ( k ) H ( k ) F ( z ( k ) ) 2 ν k H ( k ) F ( x ( k ) ) G 1 ν k H ( k ) F ( z ( k ) ) .
The last term is O ( e ( k ) 4 ) , since | G 1 | < + , ν k = O ( e ( k ) 2 ) , and H ( k ) F ( z ( k ) ) = O ( e ( k ) 2 ) .
Therefore,
e ( k + 1 ) = 2 C 2 e ( k ) Z k R k Z k 2 ϑ k e ( k ) + O ( e ( k ) 4 ) ,
which proves (41) and (44).
Finally, since R k = O ( e ( k ) ) , Z k = O ( e ( k ) 2 ) , and ϑ k = O ( e ( k ) 2 ) , each C i , k ( 3 ) is O ( e ( k ) 3 ) .
Hence,
e ( k + 1 ) K e ( k ) 3
for some K > 0 and all sufficiently large k. If the normalized cubic contribution does not vanish asymptotically, the order is exactly three; otherwise, it is at least three.    □
Remark 2
(Interpretation of the cubic error equation). The cubic contributions of M 2 DFP and M 3 DFP coincide:
C 2 , k ( 3 ) = C 3 , k ( 3 ) .
This is due to the common dominant term 2 ν k H ( k ) F ( x ( k ) ) ; their first method-dependent difference appears at fourth order. Indeed, if
A k : = H ( k ) F ( x ( k ) ) , B k : = H ( k ) F ( z ( k ) ) ,
then
e M 2 ( k + 1 ) e M 3 ( k + 1 ) = ( G 1 4 ) ν k B k + O ( e ( k ) 5 ) = ( G 1 4 ) ϑ k Z k + O ( e ( k ) 5 ) .
The quasi-Newton perturbation enters the dominant cubic term through R k = E ( k ) F ( ξ ) . If locally
E ( k ) η e ( k ) , ρ : = η F ( ξ ) ,
then
R k ρ e ( k ) , Z k a e ( k ) 2 , a : = C 2 + ρ .
With
κ * : = F ( ξ ) [ F ( ξ ) ] 1 ,
we obtain
ϑ k κ * 2 a 2 e ( k ) 2 .
Thus, one may take
e ( k + 1 ) K H O Q N e ( k ) 3 + O ( e ( k ) 4 ) , K H O Q N = 2 C 2 a + ρ a + 2 κ * 2 a 2 .
This bound shows that the cubic constant depends on the local curvature coefficient C 2 , the conditioning of F ( ξ ) , and the first-order inverse-Hessian perturbation.
For comparison, a pure quasi-Newton step
x ( k + 1 ) = x ( k ) H ( k ) F ( x ( k ) )
satisfies, under (37),
e ( k + 1 ) = C 2 e ( k ) 2 R k e ( k ) + O ( e ( k ) 3 ) .
Hence, even under H ( k ) = [ F ( x ( k ) ) ] 1 + O ( e ( k ) ) , a pure quasi-Newton step is at most quadratic in this expansion. Under the Dennis–Moré condition alone, one obtains superlinear convergence, but no fixed cubic order is guaranteed in general. The HOQN correction cancels the quadratic residual and shifts the leading term to order three. Although exact high-order correctors may reach fourth order, the quasi-Newton approximation and the line-search setting reduce the effective HOQN local order to cubic, which still improves on the usual superlinear behavior of classical quasi-Newton schemes.
The next lemma shows that the cubic local order obtained in Theorem 2 is preserved by the two-update BFGS–DFP strategy.
Lemma 1
(Invariance of cubic local convergence under BFGS–DFP double updating). Assume the hypotheses of Theorem 2. Let F = f , and let H N ( k ) denote the inverse Hessian approximation at the beginning of the k-th HOQN iteration. We consider the complete two-update variants B M 1 D , B M 2 D , and B M 3 D , consisting of a Newton-type predictor, an intermediate BFGS update, a high-order corrector, and a final DFP update. In the local full-step regime, z ( k ) = x ( k ) H N ( k ) F x ( k ) , v k = F z ( k ) 2 F x ( k ) 2 . After this predictor, a BFGS update (2) is applied with
s N ( k ) = z ( k ) x ( k ) , y N ( k ) = F ( z ( k ) ) F ( x ( k ) ) ,
producing H ^ ( k ) . The high-order correction is then computed using H ^ ( k ) . In the local full-step regime, the correction stages of the complete two-update variants are represented by
B M 1 D : x ( k + 1 ) = z ( k ) H ^ ( k ) ( 1 + ν k ) F ( z ( k ) ) 2 ν k F ( x ( k ) ) ,
B M 2 D : x ( k + 1 ) = z ( k ) 1 1 4 ν k H ^ ( k ) F ( z ( k ) ) + 2 ν k F ( x ( k ) ) ,
B M 3 D : x ( k + 1 ) = x ( k ) H ^ ( k ) ( 1 + 2 ν k ) F ( x ( k ) ) + ( 1 + G 1 ν k ) F ( z ( k ) ) , | G 1 | < + .
Here, B M i D denotes the complete two-update hybrid iteration; the displayed formulas are its local full-step correction maps. For B M 2 D , 1 4 ν k 0 locally.
After x ( k + 1 ) is computed, a DFP update (1) is applied using H ^ ( k ) and
s D ( k ) = x ( k + 1 ) x ( k ) , y D ( k ) = F ( x ( k + 1 ) ) F ( x ( k ) ) ,
yielding H N ( k + 1 ) . Assume that the BFGS and DFP updates are well defined, that their curvature conditions hold, and that
H N ( k ) = [ F ( x ( k ) ) ] 1 + E N , k , H ^ ( k ) = [ F ( x ( k ) ) ] 1 + E ^ k ,
with
E N , k = O ( e ( k ) ) , E ^ k = O ( e ( k ) ) .
Assume also that the final DFP update propagates first-order local consistency:
H N ( k + 1 ) = [ F ( x ( k + 1 ) ) ] 1 + O ( e ( k + 1 ) ) .
Define
R N , k : = E N , k F ( ξ ) , R ^ k : = E ^ k F ( ξ ) , Z N , k : = C 2 e ( k ) 2 R N , k e ( k ) ,
and
ϑ N , k : = F ( ξ ) Z N , k 2 F ( ξ ) e ( k ) 2 .
Then
Z N , k = O ( e ( k ) 2 ) , ϑ N , k = O ( e ( k ) 2 ) , ν k = ϑ N , k + O ( e ( k ) 3 ) .
Moreover,
B M 1 D : e ( k + 1 ) = C ^ 1 , k ( 3 ) + O ( e ( k ) 4 ) ,
B M 2 D : e ( k + 1 ) = C ^ 2 , k ( 3 ) + O ( e ( k ) 4 ) ,
B M 3 D : e ( k + 1 ) = Δ k + C ^ 2 , k ( 3 ) + O ( e ( k ) 4 ) ,
where
Δ k : = H N ( k ) H ^ ( k ) F ( x ( k ) )
and
C ^ 1 , k ( 3 ) = 2 C 2 e ( k ) Z N , k R ^ k Z N , k + 2 ϑ N , k e ( k ) ,
C ^ 2 , k ( 3 ) = 2 C 2 e ( k ) Z N , k R ^ k Z N , k 2 ϑ N , k e ( k ) .
Consequently, B M 1 D and B M 2 D preserve the cubic local order. For B M 3 D , the same conclusion holds provided that
Δ k = H N ( k ) H ^ ( k ) F ( x ( k ) ) = O ( e ( k ) 3 ) .
Equivalently, at this order,
R N , k R ^ k e ( k ) = O ( e ( k ) 3 ) .
In particular,
H ^ ( k ) H N ( k ) = O ( e ( k ) 2 )
implies the compatibility condition above. Under this condition, all three complete two-update variants preserve the cubic local convergence order.
Proof. 
Let
e = e ( k ) , e z ( k ) = z ( k ) ξ .
Repeating the derivation of (46) for the predictor matrix H N ( k ) gives
e z ( k ) = Z N , k + 2 ( C 3 C 2 2 ) e 3 R N , k C 2 e 2 + O ( e 4 ) .
Thus, Z N , k = O ( e 2 ) and e z ( k ) = O ( e 2 ) .
Next, repeating the cancellation argument leading to (47), now using H ^ ( k ) , yields
e z ( k ) H ^ ( k ) F ( z ( k ) ) = 2 C 2 e Z N , k R ^ k Z N , k + O ( e 4 ) .
Similarly, by the arguments leading to (48) and (49),
ν k = ϑ N , k + O ( e 3 ) , ϑ N , k = O ( e 2 ) ,
and
ν k H ^ ( k ) F ( x ( k ) ) = ϑ N , k e + O ( e 4 ) , ν k H ^ ( k ) F ( z ( k ) ) = O ( e 4 ) .
For B M 1 D , substituting (60) and (61) into (50) gives
e ( k + 1 ) = 2 C 2 e Z N , k R ^ k Z N , k + 2 ϑ N , k e + O ( e 4 ) ,
which proves (54). For B M 2 D , using
( 1 4 ν k ) 1 = 1 + 4 ν k + O ( e 4 )
in (51) gives
e ( k + 1 ) = 2 C 2 e Z N , k R ^ k Z N , k 2 ϑ N , k e + O ( e 4 ) ,
which proves (55).
For B M 3 D , since the predictor was computed with H N ( k ) ,
e H ^ ( k ) F ( x ( k ) ) = e z ( k ) + H N ( k ) H ^ ( k ) F ( x ( k ) ) = e z ( k ) + Δ k .
Using this identity in (52), together with (60) and (61), and using G 1 ν k H ^ ( k ) F ( z ( k ) ) = O ( e 4 ) , we obtain
e ( k + 1 ) = Δ k + 2 C 2 e Z N , k R ^ k Z N , k 2 ϑ N , k e + O ( e 4 ) ,
which proves (56).
Since
Z N , k = O ( e 2 ) , R N , k = O ( e ) , R ^ k = O ( e ) , ϑ N , k = O ( e 2 ) ,
the cubic order of B M 1 D and B M 2 D follows. For B M 3 D , the same conclusion holds if Δ k = O ( e 3 ) . Moreover,
F ( x ( k ) ) = F ( ξ ) e + O ( e 2 )
implies
Δ k = ( E N , k E ^ k ) F ( x ( k ) ) = ( R N , k R ^ k ) e + O ( e 3 ) ,
which gives the stated equivalence. The sufficient condition H ^ ( k ) H N ( k ) = O ( e ( k ) 2 ) immediately implies Δ k = O ( e ( k ) 3 ) .
Finally, the DFP update is applied only after x ( k + 1 ) has been computed, so it does not enter the current local error equation. Its role is to generate H N ( k + 1 ) , and the assumed first-order consistency of this matrix supplies the induction hypothesis for the next iteration. Therefore, the BFGS–DFP double updating strategy preserves the cubic local convergence order under the stated assumptions.    □
To continue the formal analysis of the convergence properties of the HOQN method, we adopt the following notation and assumptions over the bounded level set
L : = x R n : f ( x ) f x ( 0 ) .
Throughout this work, the symbols ⪯, ⪰, and 0 are understood in the Loewner sense for symmetric matrices; in particular, H ( k ) λ I means that H ( k ) λ I is positive semidefinite, whereas H ( k ) 0 means that H ( k ) is symmetric positive definite.
(B1)
The objective function f : R n R is twice continuously differentiable on an open neighborhood of L , i.e., f C 2 ( R n ) .
(B2)
There exist constants 0 < m M such that m I 2 f ( x ) M I , x L . In particular, f is Lipschitz continuous on L .
(B3)
The initial matrix H ( 0 ) is symmetric positive definite, and the sequence { H ( k ) } generated by the quasi-Newton updates satisfies H ( k ) λ I , k 0 , for some constant λ > 0 .
(B4)
The step sizes α N ( k ) and α H O ( k ) associated with the Newton and higher–order directions, respectively, are obtained by inexact line searches satisfying the strong Wolfe conditions with constants 0 < c 1 < c 2 < 1 .
Lemma 2
(Strong Descent Condition). Let
λ ̲ : = inf k 0 λ min H ( k ) > 0 ,
where λ min ( H ( k ) ) denotes the smallest eigenvalue of the symmetric positive–definite (SPD) matrix H ( k ) . Under assumptions(B1)(B4), the Newton direction d N ( k ) and the high–order correction d H O ( k ) z ( k ) at the k-th iteration of Algorithm 1 satisfy
f x ( k ) T d N ( k ) λ ̲ f x ( k ) 2 ,
f z ( k ) T d H O ( k ) λ ̲ f z ( k ) 2 .
Consequently, the step sizes α N ( k ) , α H O ( k ) > 0 returned by two independent strong–Wolfe line searches yield the one–step decrease
f x ( k + 1 ) f x ( k ) c 1 λ ̲ α N ( k ) f x ( k ) 2 + α H O ( k ) f z ( k ) 2 .
Proof. 
Let g ( k ) : = f x ( k ) and g z ( k ) : = f z ( k ) with z ( k ) : = x ( k ) + α N ( k ) d N ( k ) . Let H ( k ) 0 be the (inverse–Hessian) approximation at x ( k ) and assume H ( k ) λ ̲ I with λ ̲ : = inf k 0 λ min ( H ( k ) ) > 0 . By the Rayleigh inequality,
g ( k ) T H ( k ) g ( k ) λ min H ( k ) g ( k ) 2 λ ̲ g ( k ) 2 .
Defining the Newton direction by d N ( k ) : = H ( k ) g ( k ) , we obtain
g ( k ) T d N ( k ) = g ( k ) T H ( k ) g ( k ) λ ̲ g ( k ) 2 .
which proves (62).
We consider the three high–order directions used in Algorithm 1 and show, in each case, that
g z T d H O ( k ) λ ̲ g z 2 .
To analyze the uniform spectral bound associated with the M 1 , M 2 , and M 3 directions, we proceed separately for each high-order correction.
Since ν k = g z ( k ) 2 g ( k ) 2 and the iteration is considered before termination, it follows that ν k > 0 . Moreover, since H ( k ) λ ̲ I , we have
g z ( k ) T H ( k ) g z ( k ) λ ̲ g z ( k ) 2 .
In addition, we assume that, in the local regime,
g z ( k ) T H ( k ) g ( k ) 0 .
(i)
M 1 (HOChun): For the Chun-type direction, define
d H O C h u n ( k ) : = H ( k ) ( 1 + ν k ) g z ( k ) 2 ν k g ( k ) .
Then,
g z ( k ) T d H O C h u n ( k ) = ( 1 + ν k ) g z ( k ) T H ( k ) g z ( k ) 2 ν k g z ( k ) T H ( k ) g ( k ) ( 1 + ν k ) λ ̲ g z ( k ) 2 2 ν k g z ( k ) T H ( k ) g ( k ) λ ̲ g z ( k ) 2 .
Therefore,
g z ( k ) T d H O C h u n ( k ) λ ̲ g z ( k ) 2 .
(ii)
M 2 (HOOS): For the Ostrowski-type direction, define
d H O O S ( k ) : = 1 1 4 ν k H ( k ) g z ( k ) + 2 ν k g ( k ) .
Then,
g z ( k ) T d H O O S ( k ) = 1 1 4 ν k g z ( k ) T H ( k ) g z ( k ) + 2 ν k 1 4 ν k g z ( k ) T H ( k ) g ( k ) λ ̲ 1 4 ν k g z ( k ) 2 ,
provided that 1 4 ν k > 0 . If the factor 1 1 4 ν k is absorbed into the line-search step size, we obtain
g z ( k ) T d H O O S ( k ) λ ̲ g z ( k ) 2 .
(iii)
M 3 (HOT): For the Traub-type direction, define
d H O T ( k ) : = H ( k ) ( 1 + 2 ν k ) g ( k ) + ( 1 + G 1 ν k ) g z ( k ) , G 1 0 .
Then,
g z ( k ) T d H O T ( k ) = ( 1 + 2 ν k ) g z ( k ) T H ( k ) g ( k ) + ( 1 + G 1 ν k ) g z ( k ) T H ( k ) g z ( k ) ( 1 + 2 ν k ) g z ( k ) T H ( k ) g ( k ) + ( 1 + G 1 ν k ) λ ̲ g z ( k ) 2 λ ̲ g z ( k ) 2 .
Therefore,
g z ( k ) T d H O T ( k ) λ ̲ g z ( k ) 2 .
In all three cases, (63) holds. Applying the first Wolfe condition at x ( k ) , we obtain
f x ( k ) + α N ( k ) d N ( k ) f x ( k ) + c 1 α N ( k ) g ( k ) T d N ( k ) .
Using (62), it follows that
f z ( k ) f x ( k ) c 1 λ ̲ α N ( k ) g ( k ) 2 .
Now apply the first Wolfe condition at z ( k ) for the high-order step:
f z ( k ) + α H O ( k ) d H O ( k ) f z ( k ) + c 1 α H O ( k ) g z ( k ) T d H O ( k ) .
Substituting (63) and then (65), we obtain
f x ( k + 1 ) f x ( k ) c 1 λ ̲ α N ( k ) g ( k ) 2 + α H O ( k ) g z ( k ) 2 .
which is exactly (64). The proof is complete.    □
Lemma 3
(Asymptotic gradient nulling). Let { x ( k ) } k 0 be the sequence produced by Algorithm 1. Suppose(B1)(B4)hold, and that the search directions generated by the algorithm satisfy the uniform–descent and bounded–direction properties stated in Lemma 2. Then
lim k f x ( k ) = 0 .
Proof. 
We now prove this result.
(i)
By Lemma 2, we know that
f x ( k + 1 ) f x ( k ) λ ̲ c 1 α N ( k ) g ( k ) 2 + c ¯ 1 α H O ( k ) g z ( k ) 2 .
(ii)
Using (B2), f attains a minimum on the set L = { x R n : f ( x ) f ( x ( 0 ) ) } . Let f inf : = inf x L f ( x ) . Then f inf f ( x ) for all x L . Hence, for the iterates { x ( k ) } we obtain
f x ( 1 ) f x ( 0 ) λ ̲ c 1 α 0 N g 0 2 + c ¯ 1 α 0 H O g z 0 2 , f x ( 2 ) f x ( 1 ) λ ̲ c 1 α 1 N g 1 2 + c ¯ 1 α 1 H O g z 1 2 f x ( 0 ) λ ̲ j = 0 1 c 1 α j N g j 2 + c ¯ 1 α j H O g z j 2 , f x ( 3 ) f x ( 2 ) λ ̲ c 1 α 2 N g 2 2 + c ¯ 1 α 2 H O g z 2 2 f x ( 0 ) λ ̲ j = 0 2 c 1 α j N g j 2 + c ¯ 1 α j H O g z j 2 ,
and, in general,
f x ( k + 1 ) f x ( 0 ) λ ̲ j = 0 k c 1 α j N g j 2 + c ¯ 1 α j H O g z j 2 .
Letting k yields
f inf f x ( 0 ) λ ̲ c 1 j = 0 α j N g j 2 + c ¯ 1 j = 0 α j H O g z j 2 ,
hence both series
j = 0 α j N g j 2 and j = 0 α j H O g z j 2
are convergent.
(iii)
Now, let us prove that the step sizes α N ( k ) and α H O ( k ) are bounded away from zero whenever the corresponding gradients remain bounded away from zero. Let ϕ ( α ) : = f ( x + α d ) , so
ϕ ( α ) = f ( x + α d ) T d and ϕ ( 0 ) = f ( x ) T d = g T d .
Assume f is L–Lipschitz, that is,
f ( u ) f ( v ) L u v , u , v .
We consider u = x + α d and v = x . Then,
f ( x + α d ) f ( x ) L ( x + α d ) x = L α d .
Then, for any α > 0 ,
ϕ ( α ) ϕ ( 0 ) = ( f ( x + α d ) f ( x ) ) T d f ( x + α d ) f ( x ) d L α d 2 .
Hence,
ϕ ( α ) ϕ ( 0 ) L α d 2 .
Let α satisfy the strong Wolfe curvature condition
| ϕ ( α ) | c 2 | ϕ ( 0 ) | , 0 < c 2 < 1 .
Then
ϕ ( α ) | ϕ ( α ) | c 2 | ϕ ( 0 ) | .
Combining with (68), we obtain
ϕ ( 0 ) L α d 2 c 2 | ϕ ( 0 ) | .
Thus,
ϕ ( 0 ) + c 2 | ϕ ( 0 ) | L α d 2 .
Since d is a descent direction, ϕ ( 0 ) = g T d < 0 and thus | ϕ ( 0 ) | = ϕ ( 0 ) . Replacing, we get
ϕ ( 0 ) + c 2 | ϕ ( 0 ) | = ϕ ( 0 ) c 2 ϕ ( 0 ) = ( 1 c 2 ) ϕ ( 0 ) .
Therefore, the equivalent inequality is
( 1 c 2 ) | ϕ ( 0 ) | L α d 2 ( 1 c 2 ) | ϕ ( 0 ) | L α d 2 .
α ( 1 c 2 ) | ϕ ( 0 ) | L d 2 = ( 1 c 2 ) | g T d | L d 2 .
Assume there exist constants λ ̲ > 0 and C > 0 such that
| g T d | λ ̲ g 2 , d C g .
Then d 2 C 2 g 2 and, substituting (70) into (69), we obtain
α ( 1 c 2 ) λ ̲ g 2 L C 2 g 2 = ( 1 c 2 ) λ ̲ L C 2 = α min > 0 .
Hence, if g ( k ) ε , then α N ( k ) α min ; similarly, if g ( z ( k ) ) ε , then α H O ( k ) α min . ]
Assume by contradiction that lim sup k g ( k ) = η > 0 . Then, for any ε > 0 , there exist infinitely many indices k such that g ( k ) η ε ; choosing ε = η / 2 yields a subsequence { k } satisfying
g ( k ) η / 2 , .
By (iii), α N ( k ) α min . Therefore,
α N ( k ) g ( k ) 2 α min η 2 2 , .
Summing over yields
= 1 α N ( k ) g ( k ) 2 = 1 α min η 2 2 = .
On the other hand, by (ii) we have
k = 0 α N ( k ) g ( k ) 2 < ,
which contradicts the divergence of the subseries above. Since
g ( k ) : = f x ( k ) ,
the result follows immediately. Hence,
lim k g ( k ) = lim k g ( z ( k ) ) = 0 .
   □
Remark 3
(Worst-case complexity for first-order stationarity). The descent estimate in Lemma 2 and the step-size lower bound established in Lemma 3 yield a standard worst-case complexity bound. Let
g ( k ) : = f ( x ( k ) ) , g z ( k ) : = f ( z ( k ) ) , f inf : = inf x L f ( x ) ,
where L = { x R n : f ( x ) f ( x ( 0 ) ) } . If, before termination, If, before termination,
α N ( k ) α min > 0 , α H O ( k ) α min > 0 , σ : = λ ̲ α min min { c 1 , c ¯ 1 } ,
then Lemma 2 givesthen Lemma 2 gives
f ( x ( k + 1 ) ) f ( x ( k ) ) σ g ( k ) 2 + g z ( k ) 2 .
Summing from k = 0 to N 1 , we obtain
σ k = 0 N 1 g ( k ) 2 + g z ( k ) 2 f ( x ( 0 ) ) f inf .
In particular,
min 0 k N 1 g ( k ) f ( x ( 0 ) ) f inf σ N .
Therefore, to guarantee an index 0 k N 1 such that
f ( x ( k ) ) ε ,
it is sufficient to take
N f ( x ( 0 ) ) f inf σ ε 2 .
Hence, the worst-case number of outer iterations required to reach first-order stationarity is
N ε = O ( ε 2 ) .
Since each full-memory HOQN iteration involves a finite number of matrix-vector products and rank-two quasi-Newton updates, its dense algebraic cost is O ( n 2 ) . Consequently, the dense worst-case arithmetic complexity is
O ( n 2 ε 2 ) .
Lemma 4
(Dennis–Moré condition). Let f : R n R be of class C 2 near ξ, with 2 f ( ξ ) 0 , and let { x ( k ) } be generated by Algorithm 1. Set
e ( k ) : = x ( k ) ξ , s ( k ) : = x ( k + 1 ) x ( k ) ,
and define
d ( k ) : = H ( k ) g ( x ( k ) ) , d N ( k ) : = 2 f ( ξ ) 1 g ( x ( k ) ) .
Assume that { H ( k ) } is updated by (1), that { H ( k ) } is bounded, that x ( k ) ξ superlinearly, and that the Dennis–Moré condition holds:
lim k ( H ( k ) 2 f ( ξ ) 1 ) s ( k ) s ( k ) = 0 .
Then
d ( k ) d N ( k ) d N ( k ) 0 .
Proof. 
Let
H * : = 2 f ( ξ ) 1 , g ( k ) : = g ( x ( k ) ) .
By definition,
d ( k ) d N ( k ) = H ( k ) H * g ( k ) .
Using the first-order expansion of the gradient around ξ , and the superlinear convergence of x ( k ) , we have
g ( k ) = 2 f ( ξ ) e ( k ) + O ( e ( k ) 2 ) , s ( k ) = e ( k + 1 ) e ( k ) = e ( k ) + o ( e ( k ) ) .
Hence
e ( k ) = s ( k ) + o ( s ( k ) ) , g ( k ) = 2 f ( ξ ) s ( k ) + o ( s ( k ) ) .
Substituting this expression for g ( k ) into (71) gives
d ( k ) d N ( k ) = H ( k ) H * 2 f ( ξ ) s ( k ) + o ( s ( k ) ) .
Since 2 f ( ξ ) is fixed and { H ( k ) } is bounded, there exists C > 0 such that
d ( k ) d N ( k ) C ( H ( k ) H * ) s ( k ) + o ( s ( k ) ) .
Dividing by s ( k ) and using the Dennis–Moré condition yields
d ( k ) d N ( k ) s ( k ) 0 .
Moreover, since
d N ( k ) = H * g ( k ) = s ( k ) + o ( s ( k ) ) ,
we have d N ( k ) s ( k ) . Therefore,
d ( k ) d N ( k ) d N ( k ) 0 ,
which proves the result.    □

5. Numerical Tests

In this section, we numerically compare the proposed HOQN variants ( M 1 DFP , M 2 DFP , M 3 DFP , B M 1 D , B M 2 D , and B M 3 D ) with the classical quasi-Newton methods BFGS, DFP, and SR1, using the Booth test function f B , the Himmelblau function f H , and the Freudenstein–Roth function f F R . All numerical experiments were carried out in MATLAB R2024. The iterative process was terminated when the gradient norm satisfied the stopping criterion f x ( k ) < 10 6 .
Although the global convergence analysis is developed under standard inexact line-search assumptions of Armijo/Wolfe type, the numerical implementation reported in this section computes the step sizes in both the predictor and corrector stages by bounded one-dimensional minimization using MATLAB’s fminbnd routine over the interval [ 0 , 10 ] . Thus, the theoretical framework provides sufficient conditions for the convergence analysis, whereas fminbnd is adopted as a practical step-selection procedure in the computational experiments. Next, we define the test functions:
  • f B x 1 , x 2 = x 1 + 2 x 2 7 2 + 2 x 1 + x 2 5 2 ;
  • f H x 1 , x 2 = x 1 2 + x 2 11 2 + x 1 + x 2 2 7 2 ;
  • f F R x 1 , x 2 = x 1 + x 2 ( x 2 ( x 2 + 1 ) 14 ) 29 2 + x 1 + x 2 ( x 2 ( x 2 5 ) + 2 ) + 13 2 .
The Booth function f B is strictly convex and has a unique global minimizer at ( 1 , 3 ) T with value f B ( 1 , 3 ) T = 0 . The Himmelblau function has four local minima which are also global minima, all satisfying f H = 0 : ( 3 , 2 ) T , ( 2.805118 , 3.131312 ) T , ( 3.779310 , 3.283186 ) T , and ( 3.584428 , 1.848126 ) T . The Freudenstein–Roth function has a global minimizer at ( 5 , 4 ) T with f F R ( 5 , 4 ) T = 0 , a well-known local minimizer at approximately ( 11.4128 , 0.8968 ) T , and it also exhibits a saddle point at approximately ( 23.9206 , 2.2300 ) T . The 3D landscapes in Figure 1 visualize these geometries: Figure 1a corresponds to the convex, unimodal f B , whereas Figure 1b and Figure 1c show the multimodal landscapes of f H and f F R , respectively, providing intuition about the basins and valleys against which the HOQN variants are assessed.
A particularly important metric in numerical tests of iterative methods is the Approximate Computational Order of Convergence (ACOC), as it allows for the experimental validation of the theoretical order of convergence. In the context of hybrid optimization algorithms, estimating the convergence order, denoted by ρ , using the classical formula proposed in [25], often leads to unreliable results due to significant fluctuations between consecutive iterations. To obtain a more stable and consistent estimate of ρ , we adopt an approach based on the linear regression of the logarithms of successive errors ( ln E k , ln E k + 1 ) . This methodology leverages global information from multiple iterations within the asymptotic regime, thereby reducing sensitivity to local fluctuations in individual errors and providing a more robust estimate than formulas based solely on the most recent iterations. In particular, the ACOC is calculated as the slope of the least-squares regression line that fits the data ln E k + 1 versus ln E k , and is given by:
ρ = k = k 0 m 1 ln E k ln E k ¯ ln E k + 1 ln E k + 1 ¯ k = k 0 m 1 ln E k ln E k ¯ 2 ,
where E k = x ( k ) ξ denotes the error at iteration k, m is the total number of iterations considered, ln E k ¯ = 1 N k = k 0 m 1 ln E k , and ln E k + 1 ¯ = 1 N k = k 0 m 1 ln E k + 1 , with N = m k 0 denoting the number of data points used in the regression. Furthermore, k 0 denotes the first iteration from which the iterates are considered to lie in the asymptotic regime.

5.1. Hybrid Optimization with Complete Memory

We begin by analyzing the full memory variants of the proposed hybrid methods, comparing their numerical performance with classical quasi-Newton schemes on selected test functions. Table 1 and Table 2 present the numerical results obtained for the Himmelblau and Freudenstein–Roth functions, considering two different initial conditions. The classical quasi-Newton methods BFGS, DFP, and SR1 exhibit superlinear convergence, as is characteristic of this class of methods, requiring between five and six iterations to reach the minimum. On the other hand, with the exception of M 1 DFP , the proposed hybrid methods consistently converge in fewer iterations for both functions and initial conditions, achieving observed convergence orders consistent with the theoretical cubic order and computational times competitive with the classical methods considered. Furthermore, regarding the final gradient norm f x ( k + 1 ) , it is observed that, in these nonconvex problems, this metric can vary significantly depending on the initial condition used; however, the smallest values were obtained by some of the proposed hybrid variants. This fact constitutes a clear indication of the greater numerical accuracy of these variants in approximating the minimizer. The reported computation time (CT(s)) in the numerical experiments corresponds to the total execution time of the iterative algorithm, measured in seconds, from the start of the process until the termination criterion is satisfied or the maximum number of iterations is reached. Consequently, this value includes the cost of all operations performed in each iteration. To obtain reliable time measurements, each method was run 15 times for each test problem, and the times reported in the tables correspond to the average of those runs. The numerical experiments were conducted on a 13-inch MacBook Air (2025) running macOS Sequoia 15.7.4, equipped with an Apple M4 chip and 16 GB of unified memory. This computational environment provides a consistent and up-to-date basis for the comparative evaluation of the methods under consideration.
The numerical results indicate that the efficiency of the methods strongly depends on the nature of the problem. In the nonconvex problems, such as the Himmelblau and Freudenstein–Roth functions, the hybrid methods reach the minimizer in fewer iterations and with a faster convergence rate than the classical BFGS and DFP methods, while also yielding smaller final gradient norms.
Let us now analyze the Booth function, which is a two-dimensional strictly convex quadratic problem with unique global minimizer ξ = ( 1 , 3 ) T . Table 3 reports the numerical results obtained from two different initial conditions. All the methods converge in two iterations and reach final gradient norms of order 10 14 or smaller. Since only two iterations are required, the approximate computational order of convergence cannot be reliably estimated; therefore, the value of ρ is not reported.
The results in Table 3 should be interpreted as a consistency check for a simple strictly convex quadratic problem, not as evidence of a general optimality property of the proposed methods. In exact arithmetic, Newton’s method solves a strictly convex quadratic problem in one step when the exact inverse Hessian is used, while full-memory quasi-Newton schemes with line searches may display very fast finite termination on low-dimensional quadratic problems. Therefore, the two-iteration behavior observed for the Booth function is a finite-dimensional quadratic effect. Consequently, the Booth experiment confirms that all methods behave accurately on a convex quadratic benchmark, but it is not sufficient to assess scalability or general asymptotic behavior. For this reason, the Booth test is complemented below with high-dimensional quadratic experiments with prescribed Hessian condition numbers.

5.2. High-Dimensional Quadratic Scalability Test

Since the Booth function is only a two-dimensional strictly convex quadratic problem, its two-iteration behavior should not be interpreted as evidence of general scalability. To test the methods beyond the low-dimensional Booth case, we consider the strictly convex quadratic family
f n , κ ( x ) = 1 2 x T A n , κ x b T x , A n , κ = Q T Λ n , κ Q , Λ n , κ = diag 1 , κ 1 / ( n 1 ) , κ 2 / ( n 1 ) , , κ ,
where Q is an orthogonal matrix generated with a fixed random seed. Hence,
2 f n , κ ( x ) = A n , κ , κ ( A n , κ ) = κ .
The vector b was chosen as b = A n , κ ξ , with ξ = 1 n ( 1 , , 1 ) T so that the exact minimizer is known. We used x ( 0 ) = 0 , H ( 0 ) = I ,and the stopping criterion f x ( k ) 10 6 . The step length was computed by exact one-dimensional minimization along each search direction, in order to isolate the cost of the quasi-Newton updates from line-search variability. The tested dimensions and condition numbers were n { 100 , 500 , 1000 } , κ 10 2 , 10 6 . To assess the expected dense O ( n 2 ) scaling, Table 4 reports the number of outer iterations and the normalized cost η = C T Iter n 2 .
All methods reached the prescribed tolerance in every tested case. The hybrid variant BM 1 D consistently required fewer outer iterations than both BFGS and M 1 DFP . In the most demanding case, n = 1000 , κ = 10 6 , BFGS required 770 iterations and M 1 DFP required 800 iterations, whereas BM 1 D required 408 iterations. This confirms the reduction in outer steps produced by the predictor–corrector structure. Although BM 1 D performs two stages per outer iteration, the normalized cost η remains of order 10 8 across all tested dimensions and condition numbers, consistently with the dense O ( n 2 ) cost of full-memory quasi-Newton updates. Hence, these experiments show that the proposed hybrid scheme extends beyond the two-dimensional Booth case, preserves the expected dense scaling, and substantially reduces the number of outer iterations on high-dimensional quadratic problems.

5.3. Dolan–Moré Performance Profiles and Robustness Assessment

To provide a broader numerical assessment, we complement the pointwise convergence results with Dolan–Moré performance profiles. Among the proposed hybrid variants, BM1D, BM2D and BM3D were selected for the global benchmark because they showed the most stable behavior in the preliminary screening, with fewer non-descent directions, fewer line-search breakdowns and more reliable quasi-Newton updating. These methods correspond, respectively, to the modified Chun-type, Ostrowski-type and Traub-type hybrid quasi-Newton schemes, and are compared with the classical quasi-Newton methods DFP, BFGS and SR1 [19]. Dolan–Moré profiles are a standard tool for benchmarking optimization solvers over a common test set [26].
Let P be the set of test instances and S the set of solvers. For each p P and s S , let t p , s denote the computational cost required by solver s. If the solver fails to satisfy the stopping criterion within the computational budget, or produces non-finite values, we set t p , s = + . The performance ratio is
r p , s = t p , s min s ¯ S t p , s ¯ ,
where the minimum is taken over the solvers that successfully solve problem p. If no solver solves a given instance, all ratios are treated as infinite and the instance contributes only to the failure-rate analysis. The Dolan–Moré profile of solver s is defined by
ρ s ( τ ) = 1 | P | p P : r p , s τ , τ 1 .
Thus, ρ s ( 1 ) measures relative efficiency, whereas the limiting value of ρ s ( τ ) for large τ measures robustness, since failed runs keep infinite cost.
Following the reviewer’s recommendation, the benchmark set was enlarged beyond the basic two-dimensional tests. It consists of 30 base problems from classical unconstrained optimization benchmarks, the Moré–Garbow–Hillstrom family, the Wood problem, higher-dimensional extensions, and CUTEr/CUTEst-type analytic test instances [27,28,29,30]. The composition of the set is reported in Table 5. Each base problem was tested in five variants: one smooth version, two deterministic noisy versions, and two piecewise-smooth nonsmooth versions. Hence, the complete benchmark contains 30 × 5 = 150 instances and, with six solvers, 150 × 6 = 900 numerical runs.
The primary performance measure was the total computational work t p , s = N f + N g , where N f and N g are the numbers of objective-function and gradient evaluations. This metric is appropriate because BM1D, BM2D and BM3D are two-stage schemes, whereas DFP, BFGS and SR1 are one-stage quasi-Newton methods. For completeness, we also report profiles based on the number of outer iterations, which highlight the reduction in iteration count achieved by the proposed methods. For smooth instances, a run was declared successful when f x ( k ) 10 6 . For noisy and nonsmooth variants, where the gradient norm may be less reliable, we used the function-reduction criterion
f ( x ( k ) ) f L + τ f f ( x ( 0 ) ) f L , τ f = 10 5 .
Here, f L denotes the best available reference value; when the exact value was unavailable, it was taken as the best value obtained over all solvers. This criterion follows the benchmarking philosophy of Moré and Wild for smooth, noisy and piecewise-smooth optimization problems [31]. For noisy variants, the solvers used perturbed information, but success was assessed with the corresponding unperturbed objective value. In addition to the performance profiles, we report the failure rate
FR s , C = 1 1 | C | p C : t p , s < + ,
where C P denotes the smooth, noisy, nonsmooth, or full benchmark class. This measure complements the profiles, since a solver may be efficient on the instances it solves while still being unreliable under perturbations or nonsmoothness. Table 6 summarizes the Dolan–Moré results for the smooth subset of 30 instances. The first block uses the total work N f + N g , whereas the second block uses the number of outer iterations. The column “Best at τ = 1 ” counts ties independently.
The smooth-subset results show that BFGS and the three proposed hybrid methods solve all 30 instances, whereas DFP and SR1 show one and two failures, respectively. In terms of outer iterations, BM1D, BM2D and BM3D have median iteration counts of 4.5, 5.0 and 5.5, compared with 7.0, 8.5 and 6.5 for DFP, BFGS and SR1. BM2D attains the best iteration count in 23 instances, followed by BM1D in 21 and BM3D in 13. When total work N f + N g is used, the comparison becomes more balanced, as expected for two-stage methods. Even so, all three hybrid methods solve every smooth instance and reach ρ s ( 5 ) = 1 . Among them, BM2D gives the most balanced smooth-subset behavior, with ρ s ( 2 ) = 0.9333 , full success, and lower median work than BM3D. To assess robustness under perturbations and loss of smoothness, Table 7 reports failure rates over the smooth, noisy and nonsmooth subsets, together with the overall rate over all 150 instances.
The failure-rate results indicate that the noisy variants are the most demanding part of the benchmark. In the smooth and nonsmooth classes, BM1D, BM2D and BM3D achieve zero failure rate, matching BFGS and improving over DFP and SR1. In the noisy class, BFGS has the lowest failure rate, 0.1167 , followed by BM2D and BM3D, both with 0.1333 . Overall, BM2D and BM3D are the most robust proposed methods, with global failure rates of 0.0533 , close to BFGS ( 0.0467 ) . Figure 2 reports the empirical success rates by problem class, complementing Table 7.
The heat map confirms that the proposed methods solve all smooth and nonsmooth instances, while maintaining competitive success rates in the noisy class: 85.0 % for BM1D and 86.7 % for BM2D and BM3D.
Figure 3 shows the aggregated failure rates over the full benchmark. BFGS gives the lowest global failure rate, 4.67 % , while BM2D and BM3D follow closely with 5.33 % . The larger failure rates of SR1 and DFP, 8.67 % and 9.33 % , respectively, indicate greater sensitivity under the present perturbation setting.
The global Dolan–Moré profiles over the full benchmark are shown in Figure 4. The left panel uses the primary metric N f + N g , whereas the right panel uses the number of outer iterations.
The work-based profile in Figure 4a gives the most balanced comparison because it accounts for the additional evaluations required by the two-stage hybrid schemes. In this metric, BFGS and SR1 are highly competitive for small values of τ , reflecting their lower cost per iteration. However, the proposed methods remain close to the best solvers over a wide range of τ , with limiting values consistent with the low failure rates reported in Table 7. The iteration-based profile in Figure 4b highlights the main advantage of the proposed schemes: BM1D, BM2D and BM3D solve a large fraction of the test instances with fewer outer iterations than DFP, BFGS and SR1. Overall, Table 6 and Table 7, together with Figure 2, Figure 3 and Figure 4, show that the proposed hybrid schemes are robust and competitive on a heterogeneous benchmark set. BM2D and BM3D are the most robust proposed methods, whereas BM1D provides strong iteration reduction on several smooth instances. BFGS remains a very strong classical baseline in terms of total work and failure rate; therefore, the proposed methods should be interpreted as competitive hybrid alternatives that reduce outer iterations while preserving robust behavior under smooth, noisy and nonsmooth scenarios.

5.4. Local Verification Setup on Ill-Conditioned Hessians

To validate the explicit cubic error recurrences derived in Theorem 2 and Lemma 1, we consider a family of smooth test functions with prescribed Hessian condition number at the solution. For κ > 1 , let
A κ = Q T Λ κ Q , Λ κ = diag 1 , κ 1 / ( n 1 ) , κ 2 / ( n 1 ) , , κ ,
where Q is a fixed orthogonal matrix generated with a prescribed random seed. We consider
f κ ( x ) = 1 2 x T A κ x + γ 3 i = 1 n x i 3 + δ 4 i = 1 n x i 4 , x R n .
Then
ξ = 0 , f κ ( 0 ) = 0 , 2 f κ ( 0 ) = A κ , κ 2 f κ ( 0 ) = κ .
The gradient and Hessian are
F ( x ) = f κ ( x ) = A κ x + γ x 2 + δ x 3 , F ( x ) = 2 f κ ( x ) = A κ + 2 γ diag ( x ) + 3 δ diag ( x 2 ) .
where x 2 and x 3 denote componentwise powers. In the experiments, we used n = 20 , γ = 10 1 , δ = 10 2 , κ 10 2 , 10 4 , 10 6 , 10 8 . Since the convergence theory is local, the verification was carried out in the full-step regime α N ( k ) = α H O ( k ) = 1 , which corresponds to the asymptotic regime assumed in the local analysis. For each value of κ , 96 local samples were generated by taking x ( 0 ) = r v , v = 1 ,with small radii r and randomly generated unit directions v. The initial inverse Hessian approximation was chosen as H N ( 0 ) = A κ 1 . For sufficiently small x ( 0 ) , this choice satisfies
H N ( 0 ) = [ F ( x ( 0 ) ) ] 1 + O ( e ( 0 ) ) ,
which is precisely the first-order inverse-Hessian consistency condition used in Theorem 2. For each sample, one HOQN iteration was performed and the observed error e ( 1 ) was compared with the cubic term C 0 ( 3 ) . The local order ρ loc was estimated by a log–log regression of e ( 1 ) versus e ( 0 ) . We also report
K emp = e ( 1 ) e ( 0 ) 3 , K th = C 0 ( 3 ) e ( 0 ) 3 , Q = e ( 1 ) C 0 ( 3 ) e ( 0 ) 4 .
Thus, bounded values of Q provide numerical evidence for e ( 1 ) C 0 ( 3 ) = O e ( 0 ) 4 . For M 1 DFP and BM 1 D , respectively, we monitor the inverse-Hessian consistency quantities
δ N = H N ( 0 ) [ F ( x ( 0 ) ) ] 1 e ( 0 ) , δ B = H ^ ( 0 ) [ F ( x ( 0 ) ) ] 1 e ( 0 ) ,
where H ^ ( 0 ) is the intermediate inverse Hessian approximation obtained after the BFGS update following the Newton-type predictor.
Table 8 reports the local verification results for M 1 DFP and BM 1 D on ill-conditioned Hessians.
The results show that the estimated local order remains close to three for both M 1 DFP and BM 1 D , even for κ = 10 8 . Orders higher than three are consistent with the theoretical result, which guarantees a cubic convergence order, and may occur when the leading cubic coefficient is small or is partially canceled out along some sampled directions. The empirical constants K emp and the theoretical constants K th remain finite across all tests. The small values of Q indicate that, after subtracting the explicit cubic contribution C 0 ( 3 ) , the remaining error behaves as a fourth-order residual, which supports
e ( 1 ) C 0 ( 3 ) = O e ( 0 ) 4 .
Finally, the bounded values of δ N and, for BM 1 D , δ B , confirm that the approximations of the inverse Hessian before and after the intermediate BFGS update remain first-order consistent with [ F ( x ( 0 ) ) ] 1 . This provides numerical evidence that the BFGS update following the Newton-type predictor and the DFP update following the high-order corrector preserve the local cubic regime for the two-update variant described as BM 1 D .

5.5. Dynamical Planes

To analyze the dependence of the methods on the initial estimate, we generate dynamical planes [32]. Each plane is constructed on a uniform grid, using each grid point as an initial condition and coloring it according to the minimizer reached by the method, thus visualizing the corresponding basins of attraction. In all dynamical planes, the colors indicate the basins of attraction associated with the minimizers reached by the method, whereas the black symbols mark the corresponding local minimizers. Unlike root-finding dynamical planes, here the goal is to locate minimizers rather than zeros of a nonlinear operator; consequently, the observed equilibrium points may appear on the boundary of attraction regions. The configuration used was a 500 × 500 grid, with a maximum of 500 iterations and tolerance 10 4 . Figure 5, Figure 6 and Figure 7 show the dynamical planes for the Himmelblau function, and Figure 8, Figure 9 and Figure 10 show the corresponding results for the Freudenstein–Roth function [12]. Overall, BFGS, DFP, SR1, M 3 DFP , B M 1 D , B M 2 D , and B M 3 D exhibit similar and comparatively stable dynamical behavior, whereas M 1 DFP and M 2 DFP show more fragmented or chaotic basins. The Himmelblau function produces the most complex attraction structure, with highly fragmented regions where small changes in the initial estimate may lead to different local minima. By contrast, the Freudenstein–Roth planes are less fragmented.
For the strictly convex quadratic Booth function, the Hessian is constant positive definite and ξ = ( 1 , 3 ) is the unique minimizer. Hence, the considered descent line-search variants do not generate additional attractors: all initial conditions converge to the same point and the basin of attraction coincides with the whole domain. Therefore, only the B M 1 D plane is shown in Figure 11, as it is representative of all methods for f B .

5.6. Hybrid Optimization with Limited Memory

In general, optimization methods may require a significant amount of memory, making it essential to adopt computational strategies that efficiently manage such requirements. One of the most widely used approaches in this context is the limited-memory variant of the BFGS method (L-BFGS) [33]. Instead of storing or factorizing the full Hessian matrix, which entails a memory cost of order O n 2 and becomes impractical for dimensions n 10 4 , L-BFGS retains only a limited number m of the most recent curvature pairs s i , y i . As a result, the overall memory cost is reduced to O m n , while still providing an effective approximation of curvature information. Building upon this strategy, we propose the hybrid limited-memory method L-HOQN, whose iterative scheme is described in Algorithm 2.
In low-dimensional problems, such as the Himmelblau, Freudenstein–Roth, and Booth test functions with n = 2 , limited-memory optimization methods and, in particular, the L-HOQN method, are affected by a loss of curvature information and, consequently, exhibit slower convergence than their full-memory counterparts. In the case of the L-HOQN method variants, M 1 –LBFGS, M 2 -LBFGS, and M 3 -LBFGS, this loss reduces the effectiveness of the high order corrections and slows down the convergence process, as observed in the results reported in Table 9, Table 10 and Table 11. Nevertheless, this behavior changes substantially in practical high-dimensional applications, such as neural network training, which will be analyzed in Section 6.
Algorithm 2 Hybrid method L-HOQN
Require:   x ( k ) , g ( k ) = f x ( k ) , H ( k ) , α N ( k ) , α H O ( k ) > 0 , type , | G 1 | < + , memory M with at most m pairs
Ensure:   x ( k + 1 ) , H ( k + 1 )
  1:   d N ( k ) H ( k ) g ( k )
  2:   z ( k ) x ( k ) + α N ( k ) d N ( k )
  3:   g z ( k ) f z ( k )
  4:   ν k g z ( k ) 2 g ( k ) 2
  5:  if   type = M 1    then
  6:          h ( k ) ( 1 + ν k ) g z ( k ) 2 ν k g ( k )
  7:          d H O ( k ) H ( k ) h ( k )
  8:          x ( k + 1 ) z ( k ) + α H O ( k ) d H O ( k )
  9:  else if   type = M 2    then
10:          h ( k ) g z ( k ) + 2 ν k g ( k )
11:          d H O ( k ) 1 1 4 ν k H ( k ) h ( k )
12:          x ( k + 1 ) z ( k ) + α H O ( k ) d H O ( k )
13:  else if   type = M 3    then
14:          h ( k ) ( 1 + 2 ν k ) g ( k ) + ( 1 + G 1 ν k ) g z ( k )
15:          d H O ( k ) H ( k ) h ( k )
16:          x ( k + 1 ) x ( k ) + α H O ( k ) d H O ( k )
17:  end if
18:   g ( k + 1 ) f x ( k + 1 )
19:   s ( k ) x ( k + 1 ) x ( k )
20:   y ( k ) g ( k + 1 ) g ( k )
21:  if card ( M ) = m then
22:         remove oldest pair from M
23:  end if
24:  append ( s ( k ) , y ( k ) ) to M
25:  update H ( k + 1 ) via two-loop recursion on M

6. Practical Application in Neural Networks for the MNIST Database

The MNIST (Modified National Institute of Standards and Technology) dataset [34] is a canonical benchmark in the fields of computer vision and deep learning. It consists of 70,000 grayscale images of handwritten digits, of which 60,000 are designated for training and 10,000 for evaluation, each with a resolution of 28 × 28 pixels. Despite its apparent simplicity, the task of training neural networks on this corpus entails the resolution of a high-dimensional nonconvex optimization problem, characterized by the presence of multiple local minima and highly intricate error surfaces. Handwritten digit classification constitutes a benchmark problem of substantial practical relevance in optical character recognition (OCR) applications, including forms, postal codes, bank checks, vehicle license plates, and other environments involving the processing of handwritten data, and it has served for decades as a standard testbed for the validation of pattern recognition theories and the assessment of machine learning algorithms. In order to facilitate the comparability of results and to promote progress in the field, several benchmark databases have been developed in which handwritten digit samples are uniformly preprocessed through segmentation and normalization, thereby providing a common framework in which researchers may objectively compare the performance of their methods [35]. Consequently, MNIST is systematically employed as a testbed for quantifying and comparing the effectiveness of different optimization algorithms and classification architectures in handwritten digit recognition tasks.

6.1. Architecture of the Implemented Neural Network

A convolutional neural network (CNN) was implemented (see Figure 12) whose input is a single-channel grayscale image of size 28 × 28 × 1 , processed by two convolutional blocks: the first applies 32 kernels of size 3 × 3 with ReLU activation followed by max-pooling of 2 × 2 (yielding feature maps of size 13 × 13 × 32 ), and the second uses 64 kernels of the same size with ReLU activation and identical pooling (yielding feature maps of size 5 × 5 × 64 ). The resulting tensor is flattened into a 1600-dimensional vector, passed through a fully connected layer of 128 neurons with ReLU activation, and finally projected to 10 logit output corresponding to the digit classes 0–9. This architecture balances complexity and efficiency by capturing spatial patterns and reducing dimensionality in the convolutional stages, while the dense layers perform the final classification.
The training of the neural network is formulated as a large-scale unconstrained optimization problem, whose objective is to determine the set of parameters θ (network weights and biases) that minimizes a regularized objective function. In particular, we consider the problem
min θ J ( θ ) = L ( θ ) + λ θ 2 2 ,
where L ( θ ) denotes the multiclass cross-entropy loss function [36,37] and λ = 10 4 is the L 2 regularization parameter. The cross-entropy loss is defined in terms of the direct network outputs (logits) as
L ( θ ) = 1 N i = 1 N log e z i , y i j = 1 10 e z i , j ,
where N denotes the number of samples (or the minibatch size), z i , j represents the logit associated with sample i and class j, and y i { 0 , , 9 } is the corresponding true label.
Training the convolutional architecture shown in Figure 12 leads to a very high-dimensional optimization problem, in which storing the full Hessian matrix is computationally infeasible. Therefore, it is essential to employ optimization methods that preserve curvature information without incurring excessive memory consumption. In Section 6.2, we introduce three new memory-constrained hybrid optimization algorithms that combine high order corrections with L-BFGS updates to efficiently train the neural network on the MNIST dataset.

6.2. Proposed Limited-Memory Hybrid Optimizers for CNN Training

In this section, we evaluate the performance of the M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS methods (see Algorithm 2) on a more complex problem, specifically in the training of the convolutional neural network shown in Figure 12, designed for automatic digit classification. These new hybrid algorithms are generally more accurate and, in most cases, more stable during training than the classical SGD and L-BFGS methods. To solve the optimization problem arising from the training of this network, we employ the PyTorch library [38], version 2.8.0, which is widely recognized in the field of deep learning. PyTorch provides native GPU support, which is essential for minibatch training in deep architectures, and offers a flexible API that enables the efficient implementation of hybrid optimizers such as those proposed in this work. All experiments were performed on a 13-inch MacBook Air equipped with an Apple M4 chip and 16 GB of memory, running macOS Sequoia 15.7.4. The code was implemented and executed using Python 3.12.3. Our implementations use a fixed learning rate α , momentum values μ = 0.95 and μ = 0.99 (see Table 12 and Table 13), and a limited-memory L-BFGS buffer of size m = 5 , 10 , 15 , and 20, allowing us to assess the impact of the learning rate, momentum, and memory size on the efficiency of the hybrid methods.

6.3. Efficiency of the M 1 -LBFGS, M 2 -LBFGS and M 3 -LBFGS Methods for MNIST

One of the key parameters affecting the efficiency of limited-memory optimization methods for minimizing the cross-entropy loss in neural networks is the learning rate ( l r ).
Table 12 shows how this parameter, together with the memory size, influences the accuracy of the M 1 -LBFGS, M 2 -LBFGS, M 3 -LBFGS, and L-BFGS optimization methods. The reported accuracy metrics correspond to the test dataset and are computed after each training epoch. In addition, the average time reported in the table represents the total time per epoch, including both the training and evaluation phases. The results indicate that, for lr = 0.10 , the proposed hybrid variants consistently achieve accuracy levels above 99 % , outperforming the classical L-BFGS method for all memory sizes.
In Table 13, we report the comparison between the proposed hybrid schemes and the classical SGD and L-BFGS methods in the training of the neural network on the MNIST dataset, considering the learning rates l r = 0.01 , l r = 0.025 , and l r = 0.08 . In all limited-memory algorithms, a memory size m = 5 was used, while in the proposed hybrid methods, an optimization momentum parameter μ = 0.99 was incorporated.
The results correspond to the test set and consistently show that the hybrid methods achieve higher levels of final accuracy than the classical approaches across all learning-rate regimes analyzed. In particular, for l r = 0.08 , the hybrid schemes clearly outperform L-BFGS (0.9902) and SGD (0.9920), indicating superior generalization capability under more demanding training configurations. This trend persists for l r = 0.025 , where the hybrid methods retain a significant advantage over L-BFGS and SGD in terms of the achieved accuracy. Even in the conservative scenario l r = 0.01 , the results show that the hybrid algorithms perform competitively or better than the classical methods. Although the computational cost per epoch is moderately higher in the hybrid schemes, the increase is proportional and fully justified by the systematic improvement in accuracy observed. As shown in Table 13, when comparing the performance of the M 1 -LBFGS, M 2 –LBFGS, and M 3 –LBFGS algorithms under the learning rates lr = 0.10 and lr = 0.15 , it can be observed that, for the lower learning rate, the three hybrid algorithms achieve the highest accuracy levels among them and outperform the classical L-BFGS and SGD methods. Figure 13 shows the evolution of the MNIST test accuracy for the implemented neural network under two learning rate values, lr = 0.10 and lr = 0.15 . When comparing the optimizers, the hybrid variants M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS exhibit a more stable and consistent behavior across epochs, achieving final accuracies that are slightly higher or comparable to those obtained by the classical L-BFGS and SGD methods.
Although this paper proposes three new hybrid algorithms with limited memory, the M 3 -LBFGS method may become saturated when learning rates greater than lr = 0.15 are used. For this reason, the use of the M 1 -LBFGS and M 2 -LBFGS algorithms is recommended in such cases. The curves in Figure 14 show the accuracy achieved by the M 1 -LBFGS, M 2 -LBFGS, L-BFGS, and SGD optimizers during the training and testing phases of the neural network, for the learning rate lr = 0.20 . In the hybrid variants, a memory size of m = 5 was used. It can be observed that the M 1 -LBFGS and M 2 -LBFGS methods exhibit greater stability and, in general, a final accuracy slightly higher than that of the classical L-BFGS and SGD methods. Furthermore, the graphs show that the hybrid methods maintain very similar behavior between the training and test sets, while L-BFGS and SGD exhibit less consistent performance across both sets.
Figure 15 illustrates the evolution of the batch loss during the training of the neural network using the hybrid optimizers M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS, with memory size m = 5 and μ = 0.95 . Figure 15a–c correspond to a learning rate of 0.05 , whereas Figure 15d–f present the results for a learning rate of 0.1 . It is observed that, for the lower learning rate ( lr = 0.05 ), the batch loss decreases gradually with noticeable fluctuations before reaching a stable regime, with M 2 -LBFGS exhibiting higher variability. By comparison, for the higher learning rate ( lr = 0.1 ), all three hybrid methods display a faster and more uniform reduction of the loss, with significantly reduced noise after the initial training stages, indicating improved stability and faster convergence. Moreover, M 1 -LBFGS and M 3 -LBFGS tend to reach slightly lower stabilized loss values compared to M 2 -LBFGS. Accuracy levels above 99 % achieved by the M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS variants for lr = 0.1 , as reported in Table 12 and Table 13 together with the low and stable batch loss values observed during training in Figure 15, indicate that these hybrid methods exhibit highly competitive performance in this high-dimensional optimization scenario.
In general, the numerical results show that the three proposed limited-memory hybrid variants are promising strategies for MNIST classification, achieving high accuracy and demonstrating stable and reliable neural network training.

6.4. Additional MNIST Experiments with Adam and AdamW

Finally, we compared the proposed limited-memory HOQN variants with the widely used adaptive first-order optimizers Adam [39,40] and AdamW [41,42] . The variants M 1 LBFGS and M 2 LBFGS were selected because they showed the most stable behavior in preliminary experiments, particularly with respect to changes in the learning rate. For both methods, we used memory size m = 5 , momentum parameter μ = 0.95 , L 2 regularization parameter 10 8 , learning rate 0.1 , and batch size 64. Adam and AdamW were trained with learning rate 0.001 under the same CNN architecture, batch size, number of epochs, and random seeds. The corresponding results are reported in Table 14.
The results show that the proposed limited-memory HOQN variants remain competitive with Adam and AdamW. In particular, M 1 LBFGS attains the highest mean test accuracy, while M 2 LBFGS achieves comparable performance with low variability across seeds. Although AdamW obtains the lowest mean training loss, the test accuracies of all methods remain close, indicating that the proposed L-HOQN variants provide stable and competitive classification performance under the same training protocol.

7. Conclusions

This work introduced the higher-order quasi-Newton (HOQN) framework for unconstrained optimization, combining Newton-type predictors, higher-order correction terms derived from vector extensions of the Chun, Ostrowski, and Traub methods, and quasi-Newton updates of the inverse Hessian approximation. The resulting formulation provides a flexible hybrid structure that allows one-update and two-update variants while preserving the dense full-memory quasi-Newton cost of order O ( n 2 ) per iteration.
From the theoretical point of view, local cubic convergence was established for the one-update variants M 1 - DFP , M 2 - DFP , and M 3 - DFP under standard smoothness, positive-definiteness, and first-order inverse-Hessian consistency assumptions. The analysis also shows that the BFGS–DFP two-update strategy preserves the cubic local regime under suitable compatibility conditions. These results show that the higher-order correction cancels the dominant quadratic residual of standard quasi-Newton iterations, shifting the leading error term to cubic order.
The numerical experiments confirm the practical relevance of the proposed hybrid schemes. On the Himmelblau and Freudenstein–Roth functions, the HOQN variants generally require fewer outer iterations than classical BFGS, DFP, and SR1 methods while maintaining competitive computational times. The Booth function, due to its strictly convex quadratic structure, is mainly interpreted as a consistency test, where all methods exhibit very fast convergence. Additional high-dimensional quadratic experiments further show that the predictor–corrector structure remains effective beyond two-dimensional problems, particularly in ill-conditioned settings, while preserving the expected dense O ( n 2 ) scaling.
The Dolan–Moré performance profiles and dynamical-plane analyses provide additional evidence of robustness and stability. Over heterogeneous smooth, noisy, and nonsmooth benchmark instances, the proposed methods are competitive with classical quasi-Newton schemes; although BFGS remains a strong baseline when total computational work is considered, BM2D and BM3D achieve low failure rates and robust performance. The dynamical planes also show that the two-update variants BM1D, BM2D, and BM3D exhibit stable basin structures, whereas some one-update variants are more sensitive to the initial approximation.
Limited-memory variants of the HOQN framework were also developed for large-scale optimization. Although these variants may lose curvature information in low-dimensional test functions, their performance in convolutional neural network training on MNIST is promising. The proposed M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS methods achieve test accuracies above 99 % for suitable learning rates and show stable behavior across training and test sets. Comparisons with Adam and AdamW over several random seeds indicate that the limited-memory hybrid methods are competitive in classification accuracy, with small variability across initializations.
Overall, the results suggest that the HOQN framework is a promising strategy for combining the fast local behavior of higher-order methods with the computational practicality of quasi-Newton updates. Future work should address global convergence theory for nonconvex problems, trust-region and adaptive line-search globalization strategies, stochastic and mini-batch limited-memory variants, deeper neural architectures, and more challenging datasets. Another relevant direction is the construction of new hybrid quasi-Newton algorithms based on other optimal vector methods for nonlinear systems, following the ideas introduced by Singh et al. in [13], and their adaptation to unconstrained optimization.

Author Contributions

Conceptualization, A.C. and N.U.C.; methodology, N.U.C. and J.R.T.; software, N.U.C. and J.G.M.; validation, J.G.M. and J.R.T.; formal analysis, N.U.C.; investigation, A.C., N.U.C., J.R.T. and J.G.M.; visualization, N.U.C.; writing—original draft preparation, N.U.C.; writing—review and editing, A.C., J.R.T., J.G.M. and N.U.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fondo Nacional de Innovación y Desarrollo Científico y Tecnológico (FONDOCYT) of the Ministerio de Educación Superior, Ciencia y Tecnología de la República Dominicana (MESCyT), under grant number FONDOCYT 2023-1-1D2-0537.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their comments and suggestions. Moreover, the authors gratefully acknowledge the institutional support of INTEC and UPV, which made the development of this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
  2. Broyden, C.G. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA J. Appl. Math. 1970, 6, 76–90. [Google Scholar] [CrossRef]
  3. Byrd, R.H.; Khalfan, H.F.; Schnabel, R.B. Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 1996, 6, 1025–1039. [Google Scholar] [CrossRef]
  4. Davidon, W.C. Variable Metric Method for Minimization; ANL-5990; Argonne National Laboratory: Lemont, IL, USA, 1959. [Google Scholar]
  5. Dennis, J.E., Jr.; Moré, J.J. Quasi-Newton methods, motivation and theory. SIAM Rev. 1977, 19, 46–89. [Google Scholar] [CrossRef]
  6. Dunn, J.C.; Bertsekas, D.P. Efficient dynamic programming implementations of Newton’s method for unconstrained optimal control problems. J. Optim. Theory Appl. 1989, 63, 23–38. [Google Scholar] [CrossRef]
  7. Singh, A.; Sharma, A.; Rajput, S.; Bose, A.; Hu, X. An investigation on hybrid particle swarm optimization algorithms for parameter optimization of PV cells. Electronics 2022, 11, 909. [Google Scholar] [CrossRef]
  8. Farnad, B.; Jafarian, A.; Baleanu, D. A new hybrid algorithm for continuous optimization problem. Appl. Math. Model. 2018, 55, 652–673. [Google Scholar] [CrossRef]
  9. Wang, G.; Guo, L. A novel hybrid bat algorithm with harmony search for global numerical optimization. J. Appl. Math. 2013, 2013, 696491. [Google Scholar] [CrossRef]
  10. Garg, H. A hybrid PSO-GA algorithm for constrained optimization problems. Appl. Math. Comput. 2016, 274, 292–305. [Google Scholar] [CrossRef]
  11. Arroyo, V.; Cordero, A.; Torregrosa, J.R. Approximation of artificial satellites’ preliminary orbits: The efficiency challenge. Math. Comput. Model. 2011, 54, 1802–1807. [Google Scholar] [CrossRef]
  12. Andrei, N. An unconstrained optimization test functions collection. Adv. Model. Optim. 2008, 10, 147–161. [Google Scholar]
  13. Singh, H.; Sharma, J.R.; Kumar, S. A simple yet efficient two-step fifth-order weighted-Newton method for nonlinear models. Numer. Algorithms 2023, 93, 203–225. [Google Scholar] [CrossRef]
  14. Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 2015, 8, 231–357. [Google Scholar] [CrossRef]
  15. Hager, W.W. Lipschitz continuity for constrained processes. SIAM J. Control Optim. 1979, 17, 321–338. [Google Scholar] [CrossRef]
  16. Pardalos, P.M.; Žilinskas, A.; Žilinskas, J. Non-Convex Multi-Objective Optimization; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  17. Dai, Y.H.; Yuan, Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 1999, 10, 177–182. [Google Scholar] [CrossRef]
  18. Dong, X.; Liu, H.; He, Y. A self-adjusting conjugate gradient method with sufficient descent condition and conjugacy condition. J. Optim. Theory Appl. 2015, 165, 225–241. [Google Scholar] [CrossRef]
  19. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
  20. Dennis, J.E.; Moré, J.J. A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 1974, 28, 549–560. [Google Scholar] [CrossRef]
  21. Chun, C. Construction of Newton-like iteration methods for solving nonlinear equations. Numer. Math. 2006, 104, 297–315. [Google Scholar] [CrossRef]
  22. Ostrowski, A.M. Solution of Equations and Systems of Equations: Pure and Applied Mathematics: A Series of Monographs and Textbooks; Elsevier: Amsterdam, The Netherlands, 2016; Volume 9. [Google Scholar]
  23. Cordero, A.; Rojas-Hiciano, R.V.; Torregrosa, J.R.; Vassileva, M.P. A highly efficient class of optimal fourth-order methods for solving nonlinear systems. Numer. Algorithms 2024, 95, 1879–1904. [Google Scholar] [CrossRef]
  24. Traub, J.F. Iterative Methods for the Solution of Equations; American Mathematical Soc.: Providence, RI, USA, 1982; Volume 312. [Google Scholar]
  25. Cordero, A.; Torregrosa, J.R. Variants of Newton’s method using fifth-order quadrature formulas. Appl. Math. Comput. 2007, 190, 686–698. [Google Scholar] [CrossRef]
  26. Dolan, E.D.; Moré, J.J. Benchmarking Optimization Software with Performance Profiles. Math. Program. 2002, 91, 201–213. [Google Scholar] [CrossRef]
  27. Moré, J.J.; Garbow, B.S.; Hillstrom, K.E. Testing Unconstrained Optimization Software. ACM Trans. Math. Softw. 1981, 7, 17–41. [Google Scholar] [CrossRef]
  28. Bongartz, I.; Conn, A.R.; Gould, N.I.M.; Toint, P.L. CUTE: Constrained and Unconstrained Testing Environment. ACM Trans. Math. Softw. 1995, 21, 123–160. [Google Scholar] [CrossRef]
  29. Gould, N.I.M.; Orban, D.; Toint, P.L. CUTEr and SifDec: A Constrained and Unconstrained Testing Environment, Revisited. ACM Trans. Math. Softw. 2003, 29, 373–394. [Google Scholar] [CrossRef]
  30. Goh, B.S.; McDonald, D. Newton methods to solve a system of nonlinear algebraic equations. J. Optim. Theory Appl. 2015, 164, 261–276. [Google Scholar] [CrossRef]
  31. Moré, J.J.; Wild, S.M. Benchmarking Derivative-Free Optimization Algorithms. SIAM J. Optim. 2009, 20, 172–191. [Google Scholar] [CrossRef]
  32. Chicharro, F.I.; Cordero, A.; Torregrosa, J.R. Drawing dynamical and parameters planes of iterative families and methods. Sci. World J. 2013, 2013, 780153. [Google Scholar] [CrossRef]
  33. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
  34. LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 5 January 2025).
  35. Dieterich, J.M.; Hartke, B. Empirical review of standard benchmark functions using evolutionary global optimization. arXiv 2012, arXiv:1207.4318. [Google Scholar] [CrossRef]
  36. Kline, D.M.; Berardi, V.L. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput. Appl. 2005, 14, 310–318. [Google Scholar] [CrossRef]
  37. Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
  38. Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
  39. Barakat, A.; Bianchi, P. Convergence and Dynamical Behavior of the ADAM Algorithm for Nonconvex Stochastic Optimization. SIAM J. Optim. 2021, 31, 244–274. [Google Scholar] [CrossRef]
  40. Chen, C.; Shen, L.; Zou, F.; Liu, W. Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration. J. Mach. Learn. Res. 2022, 23, 1–47. [Google Scholar]
  41. Zhuang, Z.; Liu, M.; Cutkosky, A.; Orabona, F. Understanding AdamW through Proximal Methods and Scale-Freeness. arXiv 2022, arXiv:2202.00089. [Google Scholar] [CrossRef]
  42. Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]
Figure 1. Three-dimensional representations of the benchmark functions.
Figure 1. Three-dimensional representations of the benchmark functions.
Mathematics 14 01746 g001
Figure 2. Success-rate heat map for the methods under smooth, noisy and nonsmooth scenarios. Each entry reports the percentage of solved instances in the corresponding class.
Figure 2. Success-rate heat map for the methods under smooth, noisy and nonsmooth scenarios. Each entry reports the percentage of solved instances in the corresponding class.
Mathematics 14 01746 g002
Figure 3. Overall failure rates of the methods over the full benchmark set of 150 smooth, noisy and nonsmooth instances.
Figure 3. Overall failure rates of the methods over the full benchmark set of 150 smooth, noisy and nonsmooth instances.
Mathematics 14 01746 g003
Figure 4. Dolan–Moré performance profiles over the full benchmark set, using computational work and number of outer iterations as performance measures.
Figure 4. Dolan–Moré performance profiles over the full benchmark set, using computational work and number of outer iterations as performance measures.
Mathematics 14 01746 g004
Figure 5. Dynamical planes of the BFGS, DFP, and SR1 methods for the Himmelblau function.
Figure 5. Dynamical planes of the BFGS, DFP, and SR1 methods for the Himmelblau function.
Mathematics 14 01746 g005
Figure 6. Dynamical planes of one-update hybrid methods for the Himmelblau function.
Figure 6. Dynamical planes of one-update hybrid methods for the Himmelblau function.
Mathematics 14 01746 g006
Figure 7. Dynamical planes of two-update hybrid methods for the Himmelblau function.
Figure 7. Dynamical planes of two-update hybrid methods for the Himmelblau function.
Mathematics 14 01746 g007
Figure 8. Dynamical planes of the BFGS, DFP, and SR1 methods for the Freudenstein–Roth function.
Figure 8. Dynamical planes of the BFGS, DFP, and SR1 methods for the Freudenstein–Roth function.
Mathematics 14 01746 g008
Figure 9. Dynamical planes of one-update hybrid methods for the Freudenstein–Roth function.
Figure 9. Dynamical planes of one-update hybrid methods for the Freudenstein–Roth function.
Mathematics 14 01746 g009
Figure 10. Dynamical planes of two-update hybrid methods for the Freudenstein–Roth function.
Figure 10. Dynamical planes of two-update hybrid methods for the Freudenstein–Roth function.
Mathematics 14 01746 g010
Figure 11. Dynamical plane of the B M 1 D method for the Booth function.
Figure 11. Dynamical plane of the B M 1 D method for the Booth function.
Mathematics 14 01746 g011
Figure 12. Neural network architecture.
Figure 12. Neural network architecture.
Mathematics 14 01746 g012
Figure 13. MNIST test accuracy per epoch across learning rates.
Figure 13. MNIST test accuracy per epoch across learning rates.
Mathematics 14 01746 g013
Figure 14. Training and test accuracy on MNIST for lr = 0.20 .
Figure 14. Training and test accuracy on MNIST for lr = 0.20 .
Mathematics 14 01746 g014
Figure 15. MNIST loss for M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS at two learning rates.
Figure 15. MNIST loss for M 1 -LBFGS, M 2 -LBFGS, and M 3 -LBFGS at two learning rates.
Mathematics 14 01746 g015
Table 1. Numerical tests for the Himmelblau function with two different initial estimates.
Table 1. Numerical tests for the Himmelblau function with two different initial estimates.
Himmelblau Function ( f H ) , ξ = ( 3.7793 , 3.2832 ) T
x ( 0 ) = ( 2.2920 , 2.6501 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
BFGS51.8 3.8318 × 10 7 0.0021
DFP51.8 3.8871 × 10 7 0.0013
SR151.8 3.8989 × 10 7 0.0024
M 1 DFP 102.9 2.0383 × 10 12 0.0047
M 2 DFP 42.3 1.43 × 10 14 0.0038
M 3 DFP 42.5 7.9947 × 10 8 0.0019
B M 1 D 32.8 5.7620 × 10 11 0.0013
B M 2 D 32.3 4.20 × 10 11 0.0058
B M 3 D 42.5 5.5011 × 10 9 0.0040
x ( 0 ) = ( 1.956 , 2.667 ) T
BFGS61.6 1.0194 × 10 8 0.0044
DFP61.6 1.0500 × 10 8 0.0040
SR161.6 1.0491 × 10 8 0.0015
M 1 DFP 123.0 3.9892 × 10 7 0.0029
M 2 DFP 42.4 3.81 × 10 7 0.0015
M 3 DFP 42.7 5.9150 × 10 7 0.0039
B M 1 D 33.2 3.8079 × 10 8 0.0058
B M 2 D 32.8 6.24 × 10 7 0.0050
B M 3 D 42.4 9.3926 × 10 8 0.0012
Table 2. Numerical tests for the Freudenstein–Roth function with two different initial estimates.
Table 2. Numerical tests for the Freudenstein–Roth function with two different initial estimates.
Freudenstein–Roth Function ( f FR ) , ξ = ( 5 , 4 ) T
x ( 0 ) = ( 3.5081 , 4.0087 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
BFGS41.3 2.2231 × 10 7 0.0068
DFP41.3 2.2179 × 10 7 0.0066
SR141.3 2.2171 × 10 7 0.0044
M 1 DFP 53.0 6.3499 × 10 10 0.0043
M 2 DFP 32.8 1.01 × 10 8 0.0017
M 3 DFP 32.6 7.0095 × 10 7 0.0088
B M 1 D 2 8.7848 × 10 8 0.0064
B M 2 D 2 4.00 × 10 7 0.0038
B M 3 D 32.6 7.0595 × 10 7 0.0087
x ( 0 ) = ( 4.3 , 4.0001 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
BFGS41.2 2.8710 × 10 7 0.0014
DFP41.2 2.8639 × 10 7 0.0010
SR141.2 2.8612 × 10 7 0.0014
M 1 DFP 42.5 5.8292 × 10 8 0.0093
M 2 DFP 43.0 4.56 × 10 7 0.0064
M 3 DFP 33.0 1.9586 × 10 7 0.0047
B M 1 D 2 2.3814 × 10 7 0.0012
B M 2 D 2 3.40 × 10 7 0.0015
B M 3 D 33.0 1.9628 × 10 7 0.0063
Table 3. Numerical tests for the Booth function f B with two different initial estimates.
Table 3. Numerical tests for the Booth function f B with two different initial estimates.
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
Booth Function   f B , ξ = ( 1 , 3 ) T
x ( 0 ) = ( 3.45 , 4.08 ) T
BFGS2 1.4200 × 10 14 0.0046
DFP2 2.2500 × 10 14 0.0020
SR12 1.5888 × 10 14 0.0038
M 1 DFP 2 7.1054 × 10 15 0.0041
M 2 DFP 2 1.5900 × 10 14 0.0038
M 3 DFP 2 1.0049 × 10 14 0.0023
BM 1 D 2 0.0000 × 10 0 0.0022
BM 2 D 2 0.0000 × 10 0 0.0022
BM 3 D 2 0.0000 × 10 0 0.0068
x ( 0 ) = ( 3 , 9 ) T
BFGS2 3.5500 × 10 14 0.0007
DFP2 2.9300 × 10 14 0.0007
SR12 5.1238 × 10 14 0.0029
M 1 DFP 2 2.5600 × 10 14 0.0031
M 2 DFP 2 1.0000 × 10 14 0.0019
M 3 DFP 2 7.9441 × 10 14 0.0056
BM 1 D 2 0.0000 × 10 0 0.0006
BM 2 D 2 0.0000 × 10 0 0.0022
BM 3 D 2 0.0000 × 10 0 0.0059
Table 4. Scalability test on high-dimensional strictly convex quadratic problems.
Table 4. Scalability test on high-dimensional strictly convex quadratic problems.
n κ IterationsNormalized Cost η
BFGS M 1 DFP BM 1 D BFGS M 1 DFP BM 1 D
100 10 2 575834 2.3662 × 10 8 2.2992 × 10 8 2.9310 × 10 8
100 10 6 9811258 1.1032 × 10 8 8.9478 × 10 9 1.5510 × 10 8
500 10 2 827049 1.0261 × 10 8 1.0478 × 10 8 1.6158 × 10 8
500 10 6 394434220 8.3910 × 10 9 8.9006 × 10 9 1.3985 × 10 8
1000 10 2 877548 8.2811 × 10 9 1.0092 × 10 8 1.3768 × 10 8
1000 10 6 770800408 8.8353 × 10 9 1.5070 × 10 8 2.2218 × 10 8
Table 5. Composition of the 30 base problems used in the Dolan–Moré benchmark.
Table 5. Composition of the 30 base problems used in the Dolan–Moré benchmark.
CollectionProblemsNumber
Moré–Garbow–Hillstrom/WoodRosenbrock, Freudenstein–Roth, Beale, Powell singular, Wood, Extended Rosenbrock ( n = 10 , 20 ) , Extended Powell ( n = 12 , 20 ) , Broyden tridiagonal ( n = 10 , 20 ) , Brown almost-linear, and variably dimensioned problems ( n = 10 , 20 ) 14
Classical unconstrained testsHimmelblau, Booth, Sphere, Zakharov, Styblinski–Tang, Branin–Hoo, Matyas, Dixon–Price, Powell badly scaled, Three-hump camel, Extended Beale, and Extended Himmelblau12
CUTEr/CUTEst-type analytic representativesTrigonometric, Penalty I, ARWHEAD, and GENHUMPS4
Table 6. Dolan–Moré results for the 30 smooth test instances.
Table 6. Dolan–Moré results for the 30 smooth test instances.
Performance measured by computational work ( N f + N g )
MethodSolvedSolved (%)Best at τ = 1 ρ s ( 2 ) ρ s ( 5 ) Median cost
DFP2996.667160.83330.9667159.0
BFGS30100.000130.96671.0000163.0
SR12893.333180.93330.9333140.0
BM1D30100.00050.86671.0000169.5
BM2D30100.00040.93331.0000182.5
BM3D30100.00050.66671.0000245.0
Performance measured by number of outer iterations
MethodSolvedSolved (%)Best at τ = 1 ρ s ( 2 ) ρ s ( 5 ) Median cost
DFP2996.66740.70000.96677.0
BFGS30100.00040.93331.00008.5
SR12893.33350.80000.93336.5
BM1D30100.000211.00001.00004.5
BM2D30100.000231.00001.00005.0
BM3D30100.000131.00001.00005.5
Table 7. Failure rates by benchmark class and overall.
Table 7. Failure rates by benchmark class and overall.
MethodSmoothNoisyNonsmoothOverall
DFP0.03330.18330.03330.0933
BFGS0.00000.11670.00000.0467
SR10.06670.15000.03330.0867
BM1D0.00000.15000.00000.0600
BM2D0.00000.13330.00000.0533
BM3D0.00000.13330.00000.0533
Table 8. Cubic error verification for M 1 DFP and BM 1 D .
Table 8. Cubic error verification for M 1 DFP and BM 1 D .
Method κ Samples ρ loc K emp K th Q δ N δ B
M 1 DFP 10 2 963.0107 2.5538 × 10 6 2.5327 × 10 6 3.1318 × 10 5 2.8409 × 10 2
M 1 DFP 10 4 963.9990 8.9961 × 10 8 2.2590 × 10 7 1.3963 × 10 5 2.0813 × 10 2
M 1 DFP 10 6 964.0005 1.0508 × 10 8 1.9845 × 10 7 8.9582 × 10 6 1.5872 × 10 2
M 1 DFP 10 8 963.9252 2.5802 × 10 9 8.0759 × 10 8 3.7921 × 10 6 8.1706 × 10 3
BM 1 D 10 2 963.0007 2.4177 × 10 5 2.4889 × 10 5 3.0458 × 10 5 2.8409 × 10 2 2.7995 × 10 2
BM 1 D 10 4 963.2511 2.4725 × 10 7 3.5745 × 10 7 1.3954 × 10 5 2.0813 × 10 2 2.0811 × 10 2
BM 1 D 10 6 963.7883 1.2633 × 10 8 2.0103 × 10 7 8.9578 × 10 6 1.5872 × 10 2 1.5872 × 10 2
BM 1 D 10 8 963.8858 2.5274 × 10 9 8.0762 × 10 8 3.7922 × 10 6 8.1706 × 10 3 8.1706 × 10 3
Table 9. Limited-memory results for the Himmelblau function with m = 10 .
Table 9. Limited-memory results for the Himmelblau function with m = 10 .
Himmelblau Function ( f H )
x ( 0 ) = ( 4.9 , 1.96 ) T , ξ = ( 2.81 , 3.13 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS1701.0034 8.7304 × 10 13 0.0161
M 2 -LBFGS5301.0053 2.2616 × 10 9 0.0292
M 3 -LBFGS680.9730 8.4545 × 10 13 0.0143
L-BFGS1870.9993 6.5113 × 10 13 0.0069
x ( 0 ) = ( 1.95 , 3.19 ) T , ξ = ( 3.58 , 1.85 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS2931.0024 9.0901 × 10 13 0.0153
M 2 -LBFGS5611.0052 4.4543 × 10 9 0.0350
M 3 -LBFGS950.9590 9.8602 × 10 13 0.0412
L-BFGS1060.9953 6.9535 × 10 13 0.0036
Table 10. Limited-memory results for the Freudenstein–Roth function with m = 10 .
Table 10. Limited-memory results for the Freudenstein–Roth function with m = 10 .
Freudenstein—Roth Function ( f FR )
x ( 0 ) = ( 18 , 1 ) T , ξ = ( 5 , 4 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS7490.9336 9.9704 × 10 9 0.0263
M 2 -LBFGS7920.9975 2.5072 × 10 6 0.0323
M 3 -LBFGS7140.9929 9.8866 × 10 10 0.0318
L-BFGS8270.9929 9.9511 × 10 9 0.0230
x ( 0 ) = ( 9 , 0 ) T , ξ = ( 11.41 , 0.90 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS11580.9912 3.0208 × 10 7 0.0431
M 2 -LBFGS1751.0327 2.4232 × 10 10 0.0444
M 3 -LBFGS3140.9789 9.9611 × 10 9 0.0152
L-BFGS11580.9944 8.4165 × 10 10 0.0266
Table 11. Limited-memory results for the Booth function with m = 10 .
Table 11. Limited-memory results for the Booth function with m = 10 .
Booth Function  ( f B ) ξ = ( 1 , 3 ) T
x ( 0 ) = ( 6.01 , 1.50 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS51.6198 0.0 × 10 0 0.0123
M 2 -LBFGS160.9297 1.7990 × 10 13 0.0069
M 3 -LBFGS52.2100 2.2571 × 10 9 0.0010
L-BFGS30.2094 0.0 × 10 0 0.0003
x ( 0 ) = ( 2 , 8 ) T
MethodIter ρ f ( x ( k + 1 ) ) CT (s)
M 1 -LBFGS241.0248 9.0994 × 10 14 0.0016
M 2 -LBFGS70.9539 0.0 × 10 0 0.0024
M 3 -LBFGS51.2141 6.7270 × 10 9 0.0011
L-BFGS30.2304 0.0 × 10 0 0.0011
Table 12. MNIST performance versus memory size m, momentum μ = 0.95 , and learning rate l r .
Table 12. MNIST performance versus memory size m, momentum μ = 0.95 , and learning rate l r .
M 1 -LBFGS M 2 -LBFGS M 3 -LBFGSL-BFGS
lr m Acc.Avg. Time (s)Acc.Avg. Time (s)Acc.Avg. Time (s)Acc.Avg. Time (s)
0.0550.988014.270.989414.500.990014.100.991512.45
100.987715.250.988017.040.990915.200.991813.09
150.989416.190.987617.800.990116.200.985412.30
200.988517.720.989118.550.990317.500.989912.32
0.1050.992814.740.992115.330.993014.200.990412.17
100.993115.990.992317.110.993515.300.991112.32
150.993216.870.993118.000.992816.100.991612.17
200.993417.920.993219.360.993117.100.992912.90
Table 13. Comparison of hybrid optimizers with SGD and L-BFGS on MNIST for different learning rates.
Table 13. Comparison of hybrid optimizers with SGD and L-BFGS on MNIST for different learning rates.
MethodsLearning Rate ( lr )Final AccuracyAvg. Time (s)
M 3 -LBFGS0.010.988915.20
0.0250.992014.20
0.080.994014.10
M 1 -LBFGS0.010.988814.20
0.0250.992214.15
0.080.992513.95
M 2 -LBFGS0.010.989014.65
0.0250.991614.62
0.080.992014.66
L-BFGS0.010.985412.23
0.0250.987612.27
0.080.990212.33
SGD0.010.985711.93
0.0250.987911.99
0.080.992012.02
Table 14. MNIST loss and accuracy over three seeds after 10 epochs.
Table 14. MNIST loss and accuracy over three seeds after 10 epochs.
MethodlrLossAccuracy
M 1 LBFGS 0.100 0.0151 ± 0.0007 0.9921 ± 0.0006
M 2 LBFGS 0.100 0.0208 ± 0.0008 0.9917 ± 0.0007
Adam0.001 0.0101 ± 0.0009 0.9913 ± 0.0010
AdamW0.001 0.0088 ± 0.0006 0.9915 ± 0.0012
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cordero, A.; Maimó, J.G.; Torregrosa, J.R.; Castillo, N.U. Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics 2026, 14, 1746. https://doi.org/10.3390/math14101746

AMA Style

Cordero A, Maimó JG, Torregrosa JR, Castillo NU. Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics. 2026; 14(10):1746. https://doi.org/10.3390/math14101746

Chicago/Turabian Style

Cordero, Alicia, Javier G. Maimó, Juan R. Torregrosa, and Natanael Ureña Castillo. 2026. "Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods" Mathematics 14, no. 10: 1746. https://doi.org/10.3390/math14101746

APA Style

Cordero, A., Maimó, J. G., Torregrosa, J. R., & Castillo, N. U. (2026). Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics, 14(10), 1746. https://doi.org/10.3390/math14101746

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop