Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods

Cordero, Alicia; Maimó, Javier G.; Torregrosa, Juan R.; Castillo, Natanael Ureña

doi:10.3390/math14101746

Open AccessArticle

Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods

¹

Instituto de Matemática Multidisciplinar, Universitat Politècnica de València, Camino de Vera, s/n, 46022 Valencia, Spain

²

Ciencias Básicas y Ambientales (CBA), Instituto Tecnológico de Santo Domingo (INTEC), Santo Domingo 10602, Dominican Republic

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(10), 1746; https://doi.org/10.3390/math14101746

Submission received: 3 April 2026 / Revised: 4 May 2026 / Accepted: 7 May 2026 / Published: 19 May 2026

Download

Browse Figures

Versions Notes

Abstract

We propose the higher-order quasi-Newton (HOQN) method, a hybrid algorithm for unconstrained optimization that combines Newtonian predictors with higher-order correctors derived from vector extensions of the Traub, Chun, and Ostrowski methods, along with quasi-Newton updates of the inverse Hessian using Broyden–Fletcher–Goldfarb–Shanno (BFGS) or Davidon–Fletcher–Powell (DFP) formulas. We demonstrate that the resulting scheme achieves cubic local convergence order, representing a substantial improvement over the superlinear convergence typical of classical quasi-Newton methods, while maintaining a cost of

O (n^{2})

per iteration. We also analyze variants that incorporate two successive quasi-Newton updates, and show that they retain the same cubic order. Numerical experiments with the benchmark functions of Himmelblau and Freudenstein–Roth confirm the theoretical convergence order and show that the hybrid variants consistently require fewer iterations than BFGS, DFP, and Symmetric Rank-One (SR1). In the case of the Booth function, given its strictly convex quadratic structure, the proposed hybrid methods reach the global minimum in just two iterations and exhibit numerical accuracy superior to that of classical quasi-Newton methods. In addition, limited-memory variants (L-HOQN) are introduced; these are evaluated during the training of a convolutional neural network on the MNIST dataset, where they achieve test accuracies exceeding

99 %

and outperform L-BFGS and standard stochastic gradient descent (SGD) at all tested learning rates.

Keywords:

hybrid method; quasi-Newton methods; gradient; Hessian; benchmark functions; dynamical planes; Dolan–Moré performance profile; neural networks

MSC:

65H10

1. Introduction

The development of quasi-Newton methods is one of the most important lines of research in unconstrained numerical optimization, due to their ability to balance computational efficiency and accuracy in large-scale problems. These methods seek to replicate the local efficiency of the Newton method while avoiding the explicit computation of the inverse of the Hessian matrix.

Let us first consider the classical Newton method for unconstrained optimization. Let f:

R^{n} \to R

be a twice continuously differentiable function, where

\nabla f (x)

and

\nabla^{2} f (x)

denote the gradient and the Hessian of f at a point

x \in R^{n}

, respectively. Newton’s method generates a sequence

{x^{(k)}}

of approximations to a minimizer of f according to

d^{(k)} = - {[\nabla^{2} f (x^{(k)})]}^{- 1} \nabla f (x^{(k)}), x^{(k + 1)} = x^{(k)} + α^{(k)} d^{(k)},

where

x^{(k)}

denotes the current iterate,

d^{(k)}

is the Newton search direction, and

α^{(k)} > 0

is a step length obtained by means of a line search procedure.

Under standard regular assumptions, the method exhibits local quadratic convergence. However, its implementation requires the evaluation and subsequent inversion of the Hessian at each iteration, which implies a computational cost of the order

O (n^{3})

, making it impractical for large-scale problems. Furthermore, the Newton direction guarantees descent only when the Hessian is positive definite in a neighborhood of the solution, a condition that may not be satisfied in nonconvex problems.

With the aim of reducing computational cost while maintaining good convergence properties, quasi-Newton methods replace the exact Hessian with an iteratively updated approximation. In general, these methods generate the sequence

{x^{(k)}}

according to the scheme

\begin{matrix} d^{(k)} = - H^{(k)} \nabla f (x^{(k)}), x^{(k + 1)} = x^{(k)} + α^{(k)} d^{(k)}, \\ s^{(k)} = x^{(k + 1)} - x^{(k)}, y^{(k)} = \nabla f (x^{(k + 1)}) - \nabla f (x^{(k)}), \end{matrix}

where

H^{(k)}

denotes an approximation of the inverse Hessian matrix

{[\nabla^{2} f (x^{(k)})]}^{- 1}

,

s^{(k)}

represents the displacement between two successive iterates, and

y^{(k)}

denotes the variation of the gradient between those iterates. The matrix

H^{(k + 1)}

is computed through a quasi-Newton update formula satisfying the secant equation

H^{(k + 1)} y^{(k)} = s^{(k)} .

Different quasi-Newton methods arise from different update formulas. Among them, the DFP method, introduced in the 1960s, is one of the first designed for this purpose. The DFP update of the inverse Hessian is given by

H^{(k + 1)} = H^{(k)} + \frac{s^{(k)} {(s^{(k)})}^{T}}{{(s^{(k)})}^{T} y^{(k)}} - \frac{H^{(k)} y^{(k)} {(y^{(k)})}^{T} H^{(k)}}{{(y^{(k)})}^{T} H^{(k)} y^{(k)}},

(1)

and was first presented in [1].

Later, in the 1970s, the BFGS method [2] was developed and is known for its efficiency and stability. In terms of the inverse Hessian approximation, the update is

H^{(k + 1)} = (I - \frac{s^{(k)} {(y^{(k)})}^{T}}{{(y^{(k)})}^{T} s^{(k)}}) H^{(k)} (I - \frac{y^{(k)} {(s^{(k)})}^{T}}{{(y^{(k)})}^{T} s^{(k)}}) + \frac{s^{(k)} {(s^{(k)})}^{T}}{{(y^{(k)})}^{T} s^{(k)}},

(2)

where I denotes the identity matrix of the same dimension as

H^{(k)}

.

Another well-known quasi-Newton method is the symmetric rank-one (SR1) method [3]. The SR1 formula is considered one of the earliest quasi-Newton updates and is often associated with the ideas of Davidon [4].

However, its theoretical development and modern use in optimization were established primarily during the period 1991–1996. In terms of the approximation of the inverse of the Hessian, its update is defined by

H^{(k + 1)} = H^{(k)} + \frac{(s^{(k)} - H^{(k)} y^{(k)}) {(s^{(k)} - H^{(k)} y^{(k)})}^{T}}{{(s^{(k)} - H^{(k)} y^{(k)})}^{T} y^{(k)}} .

(3)

Quasi-Newton methods reduce the iterative cost to

O (n^{2})

[5,6], but may suffer from loss of superlinear behavior and accumulation of numerical inaccuracies. In this context, hybrid algorithms [7,8,9,10] have emerged as a promising alternative by combining the strengths of different approaches to overcome the limitations of classical methods.

This article presents the Higher-Order Quasi-Newton (HOQN) method, a new hybrid algorithm with cubic local convergence order, which combines Newton-type techniques with high-order correctors constructed as vector extensions of the scalar methods of Chun, Ostrowski, and Traub. These vector extensions satisfy the optimality condition established by the Cordero–Torregrosa conjecture [11], according to which the convergence order of any Newton-type iterative method for solving nonlinear systems without memory cannot exceed

2^{k_{1} + k_{2} - 1},

where

k_{1}

represents the number of functional evaluations appearing in the entries of the Jacobian matrix per iteration and

k_{2}

represents the number of evaluations of the nonlinear function, with

k_{1} \leq k_{2}

. If a method reaches this upper bound, it is said to be optimal in the vectorial sense. In the proposed scheme, these high-order correctors are combined with quasi-Newton updates of the inverse Hessian matrix. Although the theoretical convergence of the method has been demonstrated in the context of convex functions, numerical experiments show that, in practice, the scheme also performs excellently in the minimization of nonconvex functions.

The HOQN scheme is applicable to any update formula, even allowing different update formulas after each hybrid step or nonlinear combinations of them, offering increased robustness and adaptability in solving complex optimization problems. In the HOQN scheme, a single update formula for the inverse of the Hessian matrix can be used after the high-order step, or two quasi-Newton update formulas can be used: one after the Newton step and another after the high-order corrector. This flexibility provides greater robustness and adaptability for solving complex optimization problems.

The main contribution of this work is to show that the HOQN method provides faster and more stable convergence compared to classical methods, due to the integration of higher-order corrections together with quasi-Newton updates. In addition, the dependence of the computational performance of the HOQN method on the initial estimation is studied by representing the basins of attraction in three benchmark functions: Himmelblau, Freudenstein–Roth, and Booth [12]. New variants of the hybrid algorithm with limited memory are also proposed, which are validated in the training of convolutional neural networks, obtaining excellent overall performance.

This manuscript is organized as follows. Section 1 provides an introduction to the study, outlining the state of the art in quasi-Newton methods and presenting the preliminary concepts that form the theoretical foundation of the research. Section 2 reviews the vector extensions of the scalar methods of Chun, Ostrowski, and Traub, and introduces a new optimal vector variant of Traub’s method inspired by the construction proposed by Singh et al. in [13]. Section 3 presents the general HOQN framework, including the one-update and two-update hybrid quasi-Newton schemes. Section 4 establishes the local convergence analysis of the proposed methods and proves the cubic convergence order under suitable smoothness and inverse-Hessian consistency assumptions. Section 5 reports the numerical experiments, including full-memory tests on classical benchmark functions, high-dimensional quadratic experiments, Dolan–Moré performance profiles, dynamical-plane analysis, and the limited-memory HOQN variants. Section 6 discusses the practical application of the proposed limited-memory hybrid optimizers to the training of a convolutional neural network on the MNIST dataset. Finally, Section 7 presents the main conclusions and outlines possible directions for future research.

Preliminary Concepts

In order to establish the theoretical foundations for the study of the proposed hybrid optimization methods, it is necessary to recall some classical definitions from optimization theory. These notions provide the mathematical framework for guaranteeing convergence properties and characterizing the behavior of the algorithms. In the sections devoted to vector extensions of iterative methods for nonlinear systems, we use the notation

F : R^{n} \to R^{n}

for a nonlinear mapping and

F^{'} (x)

for its Jacobian. In the particular case of unconstrained optimization, this mapping is identified with the gradient, that is,

F (x) = \nabla f (x)

. Hence, the first-order necessary condition for a local minimizer of f can be written as the nonlinear system

F (x) = 0

.

Definition 1

(Convex Function). A function

f : R^{n} \to R

is convex [14] if, for any pair of points

x_{1}, x_{2} \in R^{n}

and any

λ \in [0, 1]

, the following inequality holds:

f (λ x_{1} + (1 - λ) x_{2}) \leq λ f (x_{1}) + (1 - λ) f (x_{2}) .

In addition to convexity, it is often useful to quantify how rapidly a function can change. This leads to the notion of Lipschitz continuity, which provides a bound on the variation of a function with respect to its inputs.

Definition 2

(Lipschitz Continuity of the Gradient). Let

f : R^{n} \to R

be a continuously differentiable function. We say that the gradient

\nabla f

is Lipschitz continuous [15,16] on a set

D \subseteq R^{n}

if there exists a constant

L \geq 0

such that, for all

x, y \in D

,

∥ \nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥,

where

∥ \cdot ∥

denotes a norm in

R^{n}

, and L is the Lipschitz constant.

When analyzing iterative optimization algorithms, line search techniques play a key role in guaranteeing sufficient progress at each iteration. For this purpose, the strong Wolfe conditions are frequently imposed.

Definition 3

(Strong Wolfe Conditions). An optimization algorithm satisfies the strong Wolfe conditions [17,18,19] if, at each iteration k, the search direction

d^{(k)}

(with

\nabla f {(x^{(k)})}^{T} d^{(k)} < 0

) and the step size

α^{(k)} > 0

jointly meet:

\begin{matrix} Sufficient decrease (Armijo): & f (x^{(k)} + α^{(k)} d^{(k)}) \leq f (x^{(k)}) + c_{1} α^{(k)} \nabla f {(x^{(k)})}^{T} d^{(k)}, \\ Curvature condition: & |\nabla f {(x^{(k)} + α^{(k)} d^{(k)})}^{T} d^{(k)}| \leq c_{2} |\nabla f {(x^{(k)})}^{T} d^{(k)}|, \end{matrix}

(4)

where

x^{(k)}

is the current iterate,

\nabla f (x^{(k)})

is the gradient of the objective function at

x^{(k)}

, and the constants satisfy

0 < c_{1} < c_{2} < 1

(typically

c_{1} \in [10^{- 6}, 10^{- 4}]

and

c_{2} \approx 0.9

).

Finally, to ensure that quasi-Newton updates preserve desirable convergence properties, one of the most widely used criteria is the Dennis–Moré condition, which characterizes the superlinear behavior of these methods. This condition will play a fundamental role in demonstrating that the quasi-Newton updates embedded in the HOQN scheme preserve its local order of convergence.

Definition 4

(Dennis–Moré Condition [5,20]). Let

{x^{(k)}}

be a sequence generated by a quasi-Newton method converging to a local minimizer ξ, and let

{H^{(k)}}

denote the corresponding sequence of inverse Hessian approximations. The sequence

{H^{(k)}}

is said to satisfy the Dennis–Moré condition at ξ if

\lim_{k \to \infty} \frac{∥ (H^{(k + 1)} - {[\nabla^{2} f (ξ)]}^{- 1}) s^{(k)} ∥}{∥ s^{(k)} ∥} = 0,

2. Vector Extensions of Traub’s, Chun’s, and Ostrowski’s Methods

Although the methods of Chun, Ostrowski and Traub were originally developed for solving nonlinear equations, their extension to the context of unconstrained optimization arises naturally by imposing the condition

\nabla f (x) = 0

. Likewise, the methods of Chun [21] and Ostrowski [22], widely recognized for their high-order properties, provide a relevant conceptual basis for the development of vector extensions oriented toward optimization problems. For the design of optimal vectorial extensions of Chun’s and Ostrowski’s methods, the strategy proposed by Cordero et al. [23] is particularly relevant. This strategy is based on the fundamental idea introduced by Singh et al. [13], who present a two-step iterative scheme with fifth-order convergence for solving nonlinear systems. This scheme includes a first step based on a standard Newton iteration.

z^{(k)} = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}),

which computes a predictor

z^{(k)}

that is refined in the subsequent step. Here, F :

R^{n} \to R^{n}

is a nonlinear system of equations, and

F^{'} (x)

denotes the Jacobian matrix of F evaluated at x. Unless otherwise stated, all norms considered throughout this work are Euclidean norms. The scheme then includes a second step, where a weighted corrective term is introduced, based on the ratio of squared norms of the function evaluated at

z^{(k)}

and

x^{(k)}

, expressed as

\frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}}

.

x^{(k + 1)} = z^{(k)} - (1 + \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}}) {[F^{'} (z^{(k)})]}^{- 1} F (z^{(k)}) .

In [23], Cordero et al. extend this idea by parametrizing the iterative scheme through the Ermakov Hyperfamily for Systems, with expression:

\begin{matrix} z^{(k)} = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}), \\ x^{(k + 1)} = z^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} (p_{k} F (z^{(k)}) + q_{k} F (x^{(k)})), k = 0, 1, \dots \end{matrix}

(5)

where

ν_{k} = \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}} = \frac{F {(z^{(k)})}^{T} F (z^{(k)})}{F {(x^{(k)})}^{T} F (x^{(k)})}, K_{k} = \frac{1}{1 + λ ν_{k}}, p_{k} = K_{k} (1 + ψ ν_{k}), q_{k} = 2 K_{k} ν_{k},

(6)

obtaining, from (5), as special cases, the optimal vector versions of the scalar methods of Chun and Ostrowski, which we have denoted, respectively, by

M_{1}

and

M_{2}

. For the optimal vector version

M_{1}

, when setting

λ = 0

and

ψ = 1

, the iteration is expressed as:

x^{(k + 1)} = z^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} ((1 + ν_{k}) F (z^{(k)}) - 2 ν_{k} F (x^{(k)})) .

(7)

On the other hand, by choosing

λ = - 4

and

ψ = 0

, we obtain the optimal vector version

M_{2}

, whose iterative expression is

x^{(k + 1)} = z^{(k)} - \frac{1}{1 - 4 ν_{k}} {[F^{'} (x^{(k)})]}^{- 1} (F (z^{(k)}) + 2 ν_{k} F (x^{(k)})) .

(8)

On the other hand, Traub’s method [24], in its multidimensional form,

\begin{matrix} z^{(k)} & = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}), \\ x^{(k + 1)} & = z^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (z^{(k)}), \end{matrix}

(9)

constitutes a third-order multipass scheme that serves as one of the fundamental precursors to the hybrid constructions considered in this work. To align the Traub method with the idea introduced by Singh et al. in [13], we employ a technique based on weight functions depending on the acceleration parameter

ν_{k}

, through which a new fourth-order variant of this method is generated. The incorporation of scalar weight functions defined in terms of

ν_{k}

allows us to preserve the essential structure of Traub’s method while, at the same time, endowing it with an additional acceleration mechanism that improves its local behavior. In this sense, the resulting variant can be interpreted as a natural extension of Traub’s method within the class of multistep methods with acceleration parameters. The following theorem formalizes this construction, establishes the corresponding error equation, and rigorously proves that the resulting method possesses fourth-order local convergence.

Theorem 1.

Let

F : D \subset R^{n} \to R^{n}

be a sufficiently differentiable function in an open convex neighborhood D of ξ, where ξ is a simple solution of

F (x) = 0

; that is,

F (ξ) = 0, F^{'} (ξ)

is nonsingular. Assume that

x^{(0)}

is an initial approximation sufficiently close to ξ. Consider the iterative scheme

\{\begin{matrix} z^{(k)} = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}), \\ x^{(k + 1)} = x^{(k)} - H (ν_{k}) {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}) - G (ν_{k}) {[F^{'} (x^{(k)})]}^{- 1} F (z^{(k)}), \end{matrix}

(10)

where

e^{(k)} = x^{(k)} - ξ, ν_{k} = \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}}

, and

C_{j} = \frac{1}{j!} {[F^{'} (ξ)]}^{- 1} F^{(j)} (ξ), j \geq 2

.

Then, the sequence

{x^{(k)}}_{k \geq 0}

converges locally to ξ with order of convergence 4 if and only if

H_{0} = H (0) = 1, G_{0} = G (0) = 1, H_{1} = H^{'} (0) = 2, | G_{1} = G^{'} (0) | < + \infty .

In this case, the error equation is

e^{(k + 1)} = (2 C_{2}^{3} - 6 C_{2} C_{3} + 3 C_{4} - G_{1} C_{2}^{3} + 2 P^{- 1} Q C_{2}^{2}) e^{{(k)}^{4}} + O (e^{{(k)}^{5}}) .

(11)

and, therefore, the method is of order 4.

Proof.

Let

e^{(k)} = x^{(k)} - ξ

denote the error. The Taylor expansions of

F (x^{(k)})

and

F^{'} (x^{(k)})

around

ξ

are

F (x^{(k)}) = F^{'} (ξ) (e^{(k)} + C_{2} e^{{(k)}^{2}} + C_{3} e^{{(k)}^{3}} + C_{4} e^{{(k)}^{4}}) + O (e^{{(k)}^{5}}),

and

F^{'} (x^{(k)}) = F^{'} (ξ) (I + 2 C_{2} e^{(k)} + 3 C_{3} e^{{(k)}^{2}} + 4 C_{4} e^{{(k)}^{3}}) + O (e^{{(k)}^{4}}),

where

C_{j} = \frac{1}{j!} {[F^{'} (ξ)]}^{- 1} F^{(j)} (ξ), j \geq 2

. Furthermore, assuming that the inverse of the Jacobian satisfies

{[F^{'} (x^{(k)})]}^{- 1} F^{'} (x^{(k)}) = F^{'} (x^{(k)}) {[F^{'} (x^{(k)})]}^{- 1} = I,

we obtain

{[F^{'} (x^{(k)})]}^{- 1} = (I + X_{2} e^{(k)} + X_{3} e^{{(k)}^{2}} + X_{4} e^{{(k)}^{3}}) {[F^{'} (ξ)]}^{- 1} + O (e^{{(k)}^{4}}),

(12)

where

X_{2} = - 2 C_{2}, X_{3} = 4 C_{2}^{2} - 3 C_{3}, X_{4} = - 2 (4 C_{2}^{3} + 2 C_{4} - 3 C_{2} C_{3} - 3 C_{3} C_{2}) .

Therefore, the expansion of

A_{k} = {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)})

is given by

A_{k} = e^{(k)} - C_{2} e^{{(k)}^{2}} + 2 (- C_{3} + C_{2}^{2}) e^{{(k)}^{3}} + B e^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

(13)

where

B = - 4 C_{2}^{3} - 3 C_{4} + 4 C_{2} C_{3} + 3 C_{3} C_{2}

.

Let us now calculate the error of Newton’s step. Since

z^{(k)} = x^{(k)} - A_{k}

, it follows that

e_{z}^{(k)} = z^{(k)} - ξ = e^{(k)} - A_{k}

. Therefore,

e_{z}^{(k)} = C_{2} e^{{(k)}^{2}} - 2 (- C_{3} + C_{2}^{2}) e^{{(k)}^{3}} - B e^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

that is,

e_{z}^{(k)} = C_{2} e^{{(k)}^{2}} - 2 (- C_{3} + C_{2}^{2}) e^{{(k)}^{3}} + (- 4 C_{2}^{3} - 3 C_{4} + 4 C_{2} C_{3} + 3 C_{3} C_{2}) e^{{(k)}^{4}} + O (e^{{(k)}^{5}}) .

(14)

Now let us expand

F (z^{(k)})

around

ξ

using the result obtained in (14):

\begin{matrix} F (z^{(k)}) = & F^{'} (ξ) (e_{z}^{(k)} + C_{2} e_{z}^{{(k)}^{2}} + C_{3} e_{z}^{{(k)}^{3}} + C_{4} e_{z}^{{(k)}^{4}}) + O (e_{z}^{{(k)}^{5}}) \\ = & F^{'} (ξ) (C_{2} e^{{(k)}^{2}} + A e^{{(k)}^{3}} + (B + C_{2}^{3}) e^{{(k)}^{4}}) + O (e^{{(k)}^{5}}) . \end{matrix}

(15)

where

A = - 2 (- C_{3} + C_{2}^{2})

,

B = 4 C_{2}^{3} + 3 C_{4} - 4 C_{2} C_{3} - 3 C_{3} C_{2}

. Therefore,

F (z^{(k)}) = F^{'} (ξ) (C_{2} e^{{(k)}^{2}} + 2 (C_{3} - C_{2}^{2}) e^{{(k)}^{3}} + (5 C_{2}^{3} + 3 C_{4} - 4 C_{2} C_{3} - 3 C_{3} C_{2}) e^{{(k)}^{4}}) + O (e^{{(k)}^{5}}) .

(16)

Consequently,

B_{k} = {[F^{'} (x^{(k)})]}^{- 1} F (z^{(k)}) = C_{2} e^{{(k)}^{2}} + 2 (C_{3} - 2 C_{2}^{2}) e^{{(k)}^{3}} + B_{4} e^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

(17)

where

B_{4} = 3 C_{4} - 8 C_{2} C_{3} - 6 C_{3} C_{2} + 13 C_{2}^{3} .

Let us now proceed to determine the Taylor expansion of

ν_{k}

.

ν_{k} = \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}} = \frac{F {(z^{(k)})}^{T} F (z^{(k)})}{F {(x^{(k)})}^{T} F (x^{(k)})} .

(18)

Let

F : R^{n} \to R^{n}

be sufficiently differentiable. For every

x = {(x_{1}, x_{2}, \dots, x_{n})}^{T} \in R^{n}

, it can be written as

F (x) = {(f_{1} (x), f_{2} (x), \dots, f_{n} (x))}^{T},

where each coordinate function

f_{i} : R^{n} \to R

is scalar-valued. Observe that, although

F (x^{(k)})

and

F (z^{(k)})

are vectors in

R^{n}

, the quotient

ν_{k}

is a scalar quantity. To manipulate it explicitly, Singh et al. [13] express it as

\frac{F {(z^{(k)})}^{T} F (z^{(k)})}{F {(x^{(k)})}^{T} F (x^{(k)})} = \frac{\sum_{i = 1}^{n} f_{i}^{2} (z^{(k)})}{\sum_{i = 1}^{n} f_{i}^{2} (x^{(k)})} .

(19)

Expanding

f_{i} (x^{(k)})

and

f_{i} (z^{(k)})

in Taylor series about

ξ

, we obtain

f_{i} (x^{(k)}) = f_{i}^{'} (ξ) e^{(k)} + \frac{1}{2} f_{i}^{''} (ξ) e^{{(k)}^{2}} + \frac{1}{6} f_{i}^{'''} (ξ) e^{{(k)}^{3}} + O (e^{{(k)}^{4}}),

(20)

f_{i} (z^{(k)}) = f_{i}^{'} (ξ) e_{z}^{(k)} + \frac{1}{2} f_{i}^{''} (ξ) e_{z}^{{(k)}^{2}} + \frac{1}{6} f_{i}^{'''} (ξ) e_{z}^{{(k)}^{3}} + O (e_{z}^{{(k)}^{4}}),

(21)

where

f_{i}^{'} (x) = (\frac{\partial f_{i}}{\partial x_{1}}, \dots, \frac{\partial f_{i}}{\partial x_{n}})

denotes the row vector of

f_{i}

, and

f_{i}^{''} (x) = {[\frac{\partial^{2} f_{i}}{\partial x_{j} \partial x_{k}}]}_{n \times n}

is the Hessian matrix of

f_{i}

. Higher order terms correspond to the multilinear derivatives of third order.

If we rewrite (20) and (21) as

f_{i} (x^{(k)}) = R_{i} e^{(k)} + H_{i} e^{{(k)}^{2}} + K_{i} e^{{(k)}^{3}} + O (e^{{(k)}^{4}}),

(22)

f_{i} (z^{(k)}) = R_{i} e_{z}^{(k)} + H_{i} e_{z}^{{(k)}^{2}} + K_{i} e_{z}^{{(k)}^{3}} + O (e^{{(k)}^{4}}),

(23)

where

R_{i} = f_{i}^{'} (ξ), H_{i} = \frac{1}{2} f_{i}^{''} (ξ), K_{i} = \frac{1}{6} f_{i}^{'''} (ξ)

. Then

f_{i}^{2} (x^{(k)}) = P_{i} e^{{(k)}^{2}} + Q_{i} e^{{(k)}^{3}} + S_{i} e^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

(24)

f_{i}^{2} (z^{(k)}) = P_{i} e_{z}^{{(k)}^{2}} + Q_{i} e_{z}^{{(k)}^{3}} + S_{i} e_{z}^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

(25)

where

\begin{matrix} P_{i} & = R_{i}^{T} R_{i}, & Q_{i} & = R_{i}^{T} H_{i} + H_{i}^{T} R_{i}, & S_{i} & = R_{i}^{T} K_{i} + K_{i}^{T} R_{i} + H_{i}^{T} H_{i}, \\ P & = \sum_{i = 1}^{n} P_{i}, & Q & = \sum_{i = 1}^{n} Q_{i}, & S & = \sum_{i = 1}^{n} S_{i} . \end{matrix}

We use

e_{z}^{(k)} = C_{2} e^{{(k)}^{2}} + A e^{{(k)}^{3}} + B e^{{(k)}^{4}} + O (e^{{(k)}^{5}})

and compute

e_{z}^{{(k)}^{2}} = C_{2}^{2} e^{{(k)}^{4}} + (C_{2} A + A C_{2}) e^{{(k)}^{5}} + (A^{2} + C_{2} B + B C_{2}) e^{{(k)}^{6}} + O (e^{{(k)}^{7}}) .

(26)

For

e_{z}^{{(k)}^{3}}

, the dominant term is of order six, hence

e_{z}^{{(k)}^{3}} = C_{2}^{3} e^{{(k)}^{6}} + O (e^{{(k)}^{7}})

. Since

e_{z}^{(k)} = O (e^{{(k)}^{2}})

, in order to compute

ν_{k}

up to order four it suffices to consider

P e_{z}^{{(k)}^{2}}

up to

e^{{(k)}^{6}}

,

Q e_{z}^{{(k)}^{3}}

at the term

C_{2}^{3} e^{{(k)}^{6}}

, while

S e_{z}^{{(k)}^{4}}

starts at order

e^{{(k)}^{8}}

and does not contribute. The numerator and the denominator, respectively, become:

\begin{matrix} \sum_{i = 1}^{n} f_{i}^{2} (z^{(k)}) = & P C_{2}^{2} e^{{(k)}^{4}} + P (C_{2} A + A C_{2}) e^{{(k)}^{5}} \\ + & (P (A^{2} + C_{2} B + B C_{2}) + 2 Q C_{2}^{3}) e^{{(k)}^{6}} + O (e^{{(k)}^{7}}), \end{matrix}

(27)

\sum_{i = 1}^{n} f_{i}^{2} (x^{(k)}) = P e^{{(k)}^{2}} + Q e^{{(k)}^{3}} + S e^{{(k)}^{4}} + O (e^{{(k)}^{5}}) .

(28)

Substituting (27) and (28) in (19) and expanding the quotient up to order four, we obtain

\begin{matrix} ν_{k} = & C_{2}^{2} e^{{(k)}^{2}} + (2 C_{2} C_{3} + 2 C_{3} C_{2} - 4 C_{2}^{3} - P^{- 1} Q C_{2}^{2}) e^{{(k)}^{3}} \\ + (12 C_{2}^{4} + 4 C_{3}^{2} + 3 (C_{2} C_{4} + C_{4} C_{2}) - 8 C_{2}^{2} C_{3} - 7 C_{3} C_{2}^{2} - 7 C_{2} C_{3} C_{2} \\ + 5 P^{- 1} Q C_{2}^{3} - 2 P^{- 1} Q C_{2} C_{3} - 2 P^{- 1} Q C_{3} C_{2} + ({(P^{- 1} Q)}^{2} - P^{- 1} S) C_{2}^{2}) e^{{(k)}^{4}} + O (e^{{(k)}^{5}}) . \end{matrix}

(29)

Since

ν_{k} \to 0

as

k \to \infty

, the functions H and G can be expanded in Taylor series about 0 as follows:

\begin{matrix} H (ν_{k}) & = H_{0} + H_{1} ν_{k} + H_{2} ν_{k}^{2} + O (ν_{k}^{3}), \\ G (ν_{k}) & = G_{0} + G_{1} ν_{k} + G_{2} ν_{k}^{2} + O (ν_{k}^{3}) . \end{matrix}

For convenience, we introduce

A_{k} = {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}), B_{k} = {[F^{'} (x^{(k)})]}^{- 1} F (z^{(k)}),

so that the iterative error equation can be written as

e^{(k + 1)} = e^{(k)} - H (ν_{k}) A_{k} - G (ν_{k}) B_{k} .

(30)

Using the previous expansions, we have

A_{k} = e^{(k)} + O (e^{{(k)}^{2}}), B_{k} = O (e^{{(k)}^{2}}),

and, since

ν_{k} = O (e^{{(k)}^{2}})

,

ν_{k}^{2} A_{k} = O (e^{{(k)}^{5}}), ν_{k}^{2} B_{k} = O (e^{{(k)}^{6}}) .

Therefore, the terms involving

ν_{k}^{2}

and higher powers do not affect the error equation up to order four.

Substituting these expansions into (30) and collecting powers of

e^{(k)}

, we obtain

\begin{matrix} e^{(k + 1)} = & (1 - H_{0}) e^{(k)} + (H_{0} - G_{0}) C_{2} e^{{(k)}^{2}} + (2 (H_{0} - G_{0}) C_{3} + (4 G_{0} - 2 H_{0} - H_{1}) C_{2}^{2}) e^{{(k)}^{3}} \\ + (3 (H_{0} - G_{0}) C_{4} + (- 4 H_{0} + 8 G_{0} - 2 H_{1}) C_{2} C_{3} + (- 3 H_{0} + 6 G_{0} - H_{1}) C_{3} C_{2} \\ + (4 H_{0} - 13 G_{0} + 5 H_{1} - G_{1}) C_{2}^{3} + H_{1} P^{- 1} Q C_{2}^{2}) e^{{(k)}^{4}} + O (e^{{(k)}^{5}}) . \end{matrix}

(31)

Therefore, the order-four conditions are obtained by canceling the coefficients of

e^{(k)}, e^{{(k)}^{2}}, e^{{(k)}^{3}}

, namely

1 - H_{0} = 0, H_{0} - G_{0} = 0, 4 G_{0} - 2 H_{0} - H_{1} = 0,

which yield

H_{0} = 1, G_{0} = 1, H_{1} = 2,

while

G_{1}

remains a free finite parameter.

Consequently,

e^{(k + 1)} = (2 C_{2}^{3} - 6 C_{2} C_{3} + 3 C_{4} - G_{1} C_{2}^{3} + 2 P^{- 1} Q C_{2}^{2}) e^{{(k)}^{4}} + O (e^{{(k)}^{5}}),

(32)

and, therefore, method (10) is of order 4. □

From (10), with

H_{0} = 1

,

G_{0} = 1

,

H_{1} = 2

and

| G_{1} | < + \infty

, a new optimal vectorial variant of the scalar Traub method, denoted by

M_{3}

. The iterative scheme is given by

\{\begin{matrix} z^{(k)} = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}), \\ x^{(k + 1)} = x^{(k)} - {[F^{'} (x^{(k)})]}^{- 1} ((1 + 2 ν_{k}) F (x^{(k)}) + (1 + G_{1} ν_{k}) F (z^{(k)})) . \end{matrix}

(33)

where

ν_{k} = \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}},

The coefficient

ν_{k}

provides the

M_{1}, M_{2}

, and

M_{3}

methods with greater computational efficiency compared to their classical counterparts, which, although preserving fourth-order convergence, are computationally more expensive. This is particularly important in optimization, where reducing computational cost is essential for effectively tackling large-scale and complex problems. If

F : R^{n} \to R^{n}

denotes the nonlinear system under consideration, then, in the particular case of optimization, F is identified with the gradient vector of a scalar function

f : R^{n} \to R

, that is,

F (x) = \nabla f (x)

. Accordingly, these methods can be rewritten in terms of the gradient and the Hessian of f, and thus applied to approximate local extrema of f. Given an initial point

x_{0} \in R^{n}

, the

M_{1}, M_{2}

, and

M_{3}

methods generate sequences of points

\{x^{(k)}\}

according to the following iterative formulas:

\begin{matrix} M_{1} : x^{(k + 1)} & = z^{(k)} - {[\nabla^{2} f (x^{(k)})]}^{- 1} ((1 + ν_{k}) \nabla f (z^{(k)}) - 2 ν_{k} \nabla f (x^{(k)})), \end{matrix}

(34)

\begin{matrix} M_{2} : x^{(k + 1)} & = z^{(k)} - \frac{1}{1 - 4 ν_{k}} {[\nabla^{2} f (x^{(k)})]}^{- 1} (\nabla f (z^{(k)}) + 2 ν_{k} \nabla f (x^{(k)})), \end{matrix}

(35)

\begin{matrix} M_{3} : x^{(k + 1)} & = z^{(k)} - {[\nabla^{2} f (x^{(k)})]}^{- 1} ((1 + 2 ν_{k}) \nabla f (x^{(k)}) + (1 + G_{1} ν_{k}) \nabla f (z^{(k)})) . \end{matrix}

(36)

where

z^{(k)} = x^{(k)} - {[\nabla^{2} f (x^{(k)})]}^{- 1} \nabla f (x^{(k)})

is the Newton predictor, and

ν_{k} = \frac{∥ \nabla f (z^{(k)}) ∥^{2}}{∥ \nabla f (x^{(k)}) ∥^{2}}

.

3. Hybrid-Quasi-Newton Method (HOQN)

This algorithm combines Newton’s method with high order corrections given by (34)–(36) and a quasi-Newton update of the inverse Hessian based on the DFP scheme; see Algorithm 1. Accordingly, the

M_{1} DFP

,

M_{2} - DFP

, and

M_{3} DFP

variants are obtained. At each iteration, a Newton direction followed by a high order correction is computed to define the search direction; the step size is determined via an inexact line search (Armijo or Wolfe), and the Hessian approximation is updated through a quasi-Newton formula at the end of the iteration. Additionally, the scheme can be extended by incorporating an intermediate quasi-Newton update after the Newton step, for instance using BFGS, leading to the variants described in the Remark 1. The higher-order corrections employed in the algorithm allow a local cubic convergence order to be achieved without significantly increasing the computational cost per iteration, which results in a faster reduction of the gradient norm and, consequently, in fewer iterations and function evaluations in nonconvex problems, where a purely Newton or a standard quasi-Newton method is less competitive. The general algorithm of the HOQN method is given by:

Algorithm 1 HOQN iteration with line search

Inputs:

x^{(k)}

, inverse Hessian approximation

H^{(k)}

, objective f, gradient

\nabla f

,

| G_{1} | < + \infty

1:

g^{(k)} \leftarrow \nabla f (x^{(k)})

2:

d_{N}^{(k)} \leftarrow - H^{(k)} g^{(k)}

3:

α_{N}^{(k)} \leftarrow \arg \min_{α \in (0, α_{\max}]} f (x^{(k)} + α d_{N}^{(k)})

4:

z^{(k)} \leftarrow x^{(k)} + α_{N}^{(k)} d_{N}^{(k)}

5:

g_{z}^{(k)} \leftarrow \nabla f (z^{(k)})

6:

ν_{k} \leftarrow \frac{∥ g_{z}^{(k)} ∥^{2}}{∥ g^{(k)} ∥^{2}}

7: if method =

M_{1}

DFP then

8:

d_{H O}^{(k)} \leftarrow - H^{(k)} ((1 + ν_{k}) g_{z}^{(k)} - 2 ν_{k} g^{(k)})

9:

α_{H O}^{(k)} \leftarrow \arg \min_{α \in (0, α_{\max}]} f (z^{(k)} + α d_{H O}^{(k)})

10:

x^{(k + 1)} \leftarrow z^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

11: else if method =

M_{2}

DFP then

12:

d_{H O}^{(k)} \leftarrow - \frac{1}{1 - 4 ν_{k}} H^{(k)} (g_{z}^{(k)} + 2 ν_{k} g^{(k)})

13:

α_{H O}^{(k)} \leftarrow \arg \min_{α \in (0, α_{\max}]} f (z^{(k)} + α d_{H O}^{(k)})

14:

x^{(k + 1)} \leftarrow z^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

15: else if method =

M_{3}

DFP then

16:

d_{H O}^{(k)} \leftarrow - H^{(k)} ((1 + 2 ν_{k}) g^{(k)} + (1 + G_{1} ν_{k}) g_{z}^{(k)})

17:

α_{H O}^{(k)} \leftarrow \arg \min_{α \in (0, α_{\max}]} f (x^{(k)} + α d_{H O}^{(k)})

18:

x^{(k + 1)} \leftarrow x^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

19: end if

20:

g^{(k + 1)} \leftarrow \nabla f (x^{(k + 1)})

21:

s^{(k)} \leftarrow x^{(k + 1)} - x^{(k)}

22:

y^{(k)} \leftarrow g^{(k + 1)} - g^{(k)}

23:

H^{(k + 1)} \leftarrow DFP Update (H^{(k)}, s^{(k)}, y^{(k)})

Output:

x^{(k + 1)}, H^{(k + 1)}

Remark 1.

If an intermediate quasi-Newton update is incorporated, that is, if an additional update is performed after the Newton step in Algorithm 1, for instance using a BFGS update, the following variants are obtained:

B M_{1} D

,

B M_{2} D

and

B M_{3} D

.

4. Convergence Analysis of the HOQN Method

Theorem 2

(Local cubic error equation for the

M_{i}

DFP-HOQN variants). Let

F = \nabla f : R^{n} \to R^{n}

be sufficiently differentiable in a neighborhood of a strict local minimizer ξ, so that the Taylor expansions used in Theorem 1 hold. Assume that

F (ξ) = 0, F^{'} (ξ) = \nabla^{2} f (ξ) ≻ 0 .

Let

e^{(k)} = x^{(k)} - ξ, C_{j} = \frac{1}{j!} {[F^{'} (ξ)]}^{- 1} F^{(j)} (ξ), j \geq 2 .

As in Theorem 1, the notation

O (e^{{(k)}^{p}})

is understood in the vectorial sense; for a matrix sequence,

E^{(k)} = O (e^{(k)})

means

∥ E^{(k)} ∥ \leq C ∥ e^{(k)} ∥

.

Consider the one-update variants

M_{1} DFP

,

M_{2} DFP

, and

M_{3} DFP

, obtained from the correctors

M_{1}

,

M_{2}

, and

M_{3}

in (34), (35), and (36), respectively, by replacing

{[F^{'} (x^{(k)})]}^{- 1}

with

H^{(k)}

. Assume that the iteration is in the local full-step regime,

α_{N}^{(k)} = α_{H O}^{(k)} = 1, z^{(k)} = x^{(k)} - H^{(k)} F (x^{(k)}), ν_{k} = \frac{∥ F (z^{(k)}) ∥^{2}}{∥ F (x^{(k)}) ∥^{2}},

for all sufficiently large k. For the

M_{2}

-variant, assume

1 - 4 ν_{k} \neq 0

, which holds locally. For the

M_{3}

-variant, the corrector is understood in the

x^{(k)}

-based form

x^{(k + 1)} = x^{(k)} - H^{(k)} ((1 + 2 ν_{k}) F (x^{(k)}) + (1 + G_{1} ν_{k}) F (z^{(k)})), | G_{1} | < + \infty .

Assume the first-order local inverse-Hessian consistency condition

H^{(k)} = {[F^{'} (x^{(k)})]}^{- 1} + E^{(k)}, E^{(k)} = O (e^{(k)}) .

(37)

Define

R_{k} : = E^{(k)} F^{'} (ξ), Z_{k} : = C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)},

and

ϑ_{k} : = \frac{∥ F^{'} (ξ) Z_{k} ∥^{2}}{∥ F^{'} (ξ) e^{(k)} ∥^{2}} .

(38)

Then

Z_{k} = O (e^{{(k)}^{2}}), ϑ_{k} = O (e^{{(k)}^{2}}), ν_{k} = ϑ_{k} + O (e^{{(k)}^{3}}) .

Moreover, the explicit local error recurrences are

\begin{matrix} M_{1} DFP : e^{(k + 1)} & = C_{1, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(39)

\begin{matrix} M_{2} DFP : e^{(k + 1)} & = C_{2, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(40)

\begin{matrix} M_{3} DFP : e^{(k + 1)} & = C_{3, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(41)

where the cubic contributions are explicitly given by

\begin{matrix} C_{1, k}^{(3)} & = 2 C_{2} e^{(k)} (C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)}) - R_{k} (C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)}) + 2 ϑ_{k} e^{(k)}, \end{matrix}

(42)

\begin{matrix} C_{2, k}^{(3)} & = 2 C_{2} e^{(k)} (C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)}) - R_{k} (C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)}) - 2 ϑ_{k} e^{(k)}, \end{matrix}

(43)

\begin{matrix} C_{3, k}^{(3)} & = C_{2, k}^{(3)} . \end{matrix}

(44)

Consequently, there exists

K > 0

such that

∥ e^{(k + 1)} ∥ \leq K ∥ e^{(k)} ∥^{3}

for all sufficiently large k. Hence, each

M_{i} DFP

-HOQN variant has local convergence of order at least three. If

\underset{k \to \infty}{lim inf} \frac{∥ C_{i, k}^{(3)} ∥}{∥ e^{(k)} ∥^{3}} > 0,

then the corresponding local convergence order is exactly three.

Proof.

We use the Taylor expansions established in Theorem 1. In particular,

{[F^{'} (x^{(k)})]}^{- 1} F (x^{(k)}) = e^{(k)} - C_{2} e^{{(k)}^{2}} + 2 (- C_{3} + C_{2}^{2}) e^{{(k)}^{3}} + O (e^{{(k)}^{4}}) .

Using (37),

R_{k} = E^{(k)} F^{'} (ξ)

, and

R_{k} = O (e^{(k)})

, we obtain

H^{(k)} F (x^{(k)}) = e^{(k)} - C_{2} e^{{(k)}^{2}} + 2 (- C_{3} + C_{2}^{2}) e^{{(k)}^{3}} + R_{k} e^{(k)} + R_{k} C_{2} e^{{(k)}^{2}} + O (e^{{(k)}^{4}}) .

(45)

Hence, for

e_{z}^{(k)} = z^{(k)} - ξ

,

e_{z}^{(k)} = Z_{k} + 2 (C_{3} - C_{2}^{2}) e^{{(k)}^{3}} - R_{k} C_{2} e^{{(k)}^{2}} + O (e^{{(k)}^{4}}),

(46)

and therefore

Z_{k} = O (e^{{(k)}^{2}}), e_{z}^{(k)} = O (e^{{(k)}^{2}}) .

Since

e_{z}^{(k)} = O (e^{{(k)}^{2}})

,

F (z^{(k)}) = F^{'} (ξ) e_{z}^{(k)} + O (e^{{(k)}^{4}}), {[F^{'} (x^{(k)})]}^{- 1} F^{'} (ξ) = I - 2 C_{2} e^{(k)} + O (e^{{(k)}^{2}}) .

Consequently,

e_{z}^{(k)} - H^{(k)} F (z^{(k)}) = 2 C_{2} e^{(k)} Z_{k} - R_{k} Z_{k} + O (e^{{(k)}^{4}}) .

(47)

Moreover, from (46),

F (z^{(k)}) = F^{'} (ξ) Z_{k} + O (e^{{(k)}^{3}}), F (x^{(k)}) = F^{'} (ξ) e^{(k)} + O (e^{{(k)}^{2}}),

which gives

ν_{k} = ϑ_{k} + O (e^{{(k)}^{3}}), ϑ_{k} = O (e^{{(k)}^{2}}) .

(48)

Also,

H^{(k)} F (x^{(k)}) = e^{(k)} + O (e^{{(k)}^{2}}), H^{(k)} F (z^{(k)}) = O (e^{{(k)}^{2}}) .

Thus,

ν_{k} H^{(k)} F (x^{(k)}) = ϑ_{k} e^{(k)} + O (e^{{(k)}^{4}}), ν_{k} H^{(k)} F (z^{(k)}) = O (e^{{(k)}^{4}}) .

(49)

For

M_{1} DFP

, (47) and (49) yield

e^{(k + 1)} = 2 C_{2} e^{(k)} Z_{k} - R_{k} Z_{k} + 2 ϑ_{k} e^{(k)} + O (e^{{(k)}^{4}}),

which proves (39) and (42). For

M_{2} DFP

, using

{(1 - 4 ν_{k})}^{- 1} = 1 + 4 ν_{k} + O (e^{{(k)}^{4}}),

together with (47) and (49), gives

e^{(k + 1)} = 2 C_{2} e^{(k)} Z_{k} - R_{k} Z_{k} - 2 ϑ_{k} e^{(k)} + O (e^{{(k)}^{4}}),

which proves (40) and (43).

For

M_{3} DFP

, using the

x^{(k)}

-based form and

e^{(k)} - H^{(k)} F (x^{(k)}) = e_{z}^{(k)}

, we obtain

e^{(k + 1)} = e_{z}^{(k)} - H^{(k)} F (z^{(k)}) - 2 ν_{k} H^{(k)} F (x^{(k)}) - G_{1} ν_{k} H^{(k)} F (z^{(k)}) .

The last term is

O (e^{{(k)}^{4}})

, since

| G_{1} | < + \infty

,

ν_{k} = O (e^{{(k)}^{2}})

, and

H^{(k)} F (z^{(k)}) = O (e^{{(k)}^{2}})

.

Therefore,

e^{(k + 1)} = 2 C_{2} e^{(k)} Z_{k} - R_{k} Z_{k} - 2 ϑ_{k} e^{(k)} + O (e^{{(k)}^{4}}),

which proves (41) and (44).

Finally, since

R_{k} = O (e^{(k)})

,

Z_{k} = O (e^{{(k)}^{2}})

, and

ϑ_{k} = O (e^{{(k)}^{2}})

, each

C_{i, k}^{(3)}

is

O (e^{{(k)}^{3}})

.

Hence,

∥ e^{(k + 1)} ∥ \leq K ∥ e^{(k)} ∥^{3}

for some

K > 0

and all sufficiently large k. If the normalized cubic contribution does not vanish asymptotically, the order is exactly three; otherwise, it is at least three. □

Remark 2

(Interpretation of the cubic error equation). The cubic contributions of

M_{2} DFP

and

M_{3} DFP

coincide:

C_{2, k}^{(3)} = C_{3, k}^{(3)} .

This is due to the common dominant term

- 2 ν_{k} H^{(k)} F (x^{(k)})

; their first method-dependent difference appears at fourth order. Indeed, if

A_{k} : = H^{(k)} F (x^{(k)}), B_{k} : = H^{(k)} F (z^{(k)}),

then

e_{M_{2}}^{(k + 1)} - e_{M_{3}}^{(k + 1)} = (G_{1} - 4) ν_{k} B_{k} + O (e^{{(k)}^{5}}) = (G_{1} - 4) ϑ_{k} Z_{k} + O (e^{{(k)}^{5}}) .

The quasi-Newton perturbation enters the dominant cubic term through

R_{k} = E^{(k)} F^{'} (ξ)

. If locally

∥ E^{(k)} ∥ \leq η ∥ e^{(k)} ∥, ρ : = η ∥ F^{'} (ξ) ∥,

then

∥ R_{k} ∥ \leq ρ ∥ e^{(k)} ∥, ∥ Z_{k} ∥ \leq a ∥ e^{(k)} ∥^{2}, a : = ∥ C_{2} ∥ + ρ .

With

κ_{*} : = ∥ F^{'} (ξ) ∥ ∥ {[F^{'} (ξ)]}^{- 1} ∥,

we obtain

ϑ_{k} \leq κ_{*}^{2} a^{2} {∥ e^{(k)} ∥}^{2} .

Thus, one may take

∥ e^{(k + 1)} ∥ \leq K_{H O Q N} ∥ e^{(k)} ∥^{3} + O (∥ e^{(k)} ∥^{4}), K_{H O Q N} = 2 ∥ C_{2} ∥ a + ρ a + 2 κ_{*}^{2} a^{2} .

This bound shows that the cubic constant depends on the local curvature coefficient

C_{2}

, the conditioning of

F^{'} (ξ)

, and the first-order inverse-Hessian perturbation.

For comparison, a pure quasi-Newton step

x^{(k + 1)} = x^{(k)} - H^{(k)} F (x^{(k)})

satisfies, under (37),

e^{(k + 1)} = C_{2} e^{{(k)}^{2}} - R_{k} e^{(k)} + O (e^{{(k)}^{3}}) .

Hence, even under

H^{(k)} = {[F^{'} (x^{(k)})]}^{- 1} + O (e^{(k)})

, a pure quasi-Newton step is at most quadratic in this expansion. Under the Dennis–Moré condition alone, one obtains superlinear convergence, but no fixed cubic order is guaranteed in general. The HOQN correction cancels the quadratic residual and shifts the leading term to order three. Although exact high-order correctors may reach fourth order, the quasi-Newton approximation and the line-search setting reduce the effective HOQN local order to cubic, which still improves on the usual superlinear behavior of classical quasi-Newton schemes.

The next lemma shows that the cubic local order obtained in Theorem 2 is preserved by the two-update BFGS–DFP strategy.

Lemma 1

(Invariance of cubic local convergence under BFGS–DFP double updating). Assume the hypotheses of Theorem 2. Let

F = \nabla f

, and let

H_{N}^{(k)}

denote the inverse Hessian approximation at the beginning of the k-th HOQN iteration. We consider the complete two-update variants

B M_{1} D

,

B M_{2} D

, and

B M_{3} D

, consisting of a Newton-type predictor, an intermediate BFGS update, a high-order corrector, and a final DFP update. In the local full-step regime,

z^{(k)} = x^{(k)} - H_{N}^{(k)} F (x^{(k)}), v_{k} = \frac{{∥F (z^{(k)})∥}^{2}}{{∥F (x^{(k)})∥}^{2}}

. After this predictor, a BFGS update (2) is applied with

s_{N}^{(k)} = z^{(k)} - x^{(k)}, y_{N}^{(k)} = F (z^{(k)}) - F (x^{(k)}),

producing

{\hat{H}}^{(k)}

. The high-order correction is then computed using

{\hat{H}}^{(k)}

. In the local full-step regime, the correction stages of the complete two-update variants are represented by

\begin{matrix} B M_{1} D : x^{(k + 1)} & = z^{(k)} - {\hat{H}}^{(k)} ((1 + ν_{k}) F (z^{(k)}) - 2 ν_{k} F (x^{(k)})), \end{matrix}

(50)

\begin{matrix} B M_{2} D : x^{(k + 1)} & = z^{(k)} - \frac{1}{1 - 4 ν_{k}} {\hat{H}}^{(k)} (F (z^{(k)}) + 2 ν_{k} F (x^{(k)})), \end{matrix}

(51)

\begin{matrix} B M_{3} D : x^{(k + 1)} & = x^{(k)} - {\hat{H}}^{(k)} ((1 + 2 ν_{k}) F (x^{(k)}) + (1 + G_{1} ν_{k}) F (z^{(k)})), | G_{1} | < + \infty . \end{matrix}

(52)

Here,

B M_{i} D

denotes the complete two-update hybrid iteration; the displayed formulas are its local full-step correction maps. For

B M_{2} D

,

1 - 4 ν_{k} \neq 0

locally.

After

x^{(k + 1)}

is computed, a DFP update (1) is applied using

{\hat{H}}^{(k)}

and

s_{D}^{(k)} = x^{(k + 1)} - x^{(k)}, y_{D}^{(k)} = F (x^{(k + 1)}) - F (x^{(k)}),

yielding

H_{N}^{(k + 1)}

. Assume that the BFGS and DFP updates are well defined, that their curvature conditions hold, and that

H_{N}^{(k)} = {[F^{'} (x^{(k)})]}^{- 1} + E_{N, k}, {\hat{H}}^{(k)} = {[F^{'} (x^{(k)})]}^{- 1} + {\hat{E}}_{k},

with

E_{N, k} = O (e^{(k)}), {\hat{E}}_{k} = O (e^{(k)}) .

Assume also that the final DFP update propagates first-order local consistency:

H_{N}^{(k + 1)} = {[F^{'} (x^{(k + 1)})]}^{- 1} + O (e^{(k + 1)}) .

Define

R_{N, k} : = E_{N, k} F^{'} (ξ), {\hat{R}}_{k} : = {\hat{E}}_{k} F^{'} (ξ), Z_{N, k} : = C_{2} e^{{(k)}^{2}} - R_{N, k} e^{(k)},

and

ϑ_{N, k} : = \frac{∥ F^{'} (ξ) Z_{N, k} ∥^{2}}{∥ F^{'} (ξ) e^{(k)} ∥^{2}} .

(53)

Then

Z_{N, k} = O (e^{{(k)}^{2}}), ϑ_{N, k} = O (e^{{(k)}^{2}}), ν_{k} = ϑ_{N, k} + O (e^{{(k)}^{3}}) .

Moreover,

\begin{matrix} B M_{1} D : e^{(k + 1)} & = {\hat{C}}_{1, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(54)

\begin{matrix} B M_{2} D : e^{(k + 1)} & = {\hat{C}}_{2, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(55)

\begin{matrix} B M_{3} D : e^{(k + 1)} & = Δ_{k} + {\hat{C}}_{2, k}^{(3)} + O (e^{{(k)}^{4}}), \end{matrix}

(56)

where

Δ_{k} : = (H_{N}^{(k)} - {\hat{H}}^{(k)}) F (x^{(k)})

and

\begin{matrix} {\hat{C}}_{1, k}^{(3)} & = 2 C_{2} e^{(k)} Z_{N, k} - {\hat{R}}_{k} Z_{N, k} + 2 ϑ_{N, k} e^{(k)}, \end{matrix}

(57)

\begin{matrix} {\hat{C}}_{2, k}^{(3)} & = 2 C_{2} e^{(k)} Z_{N, k} - {\hat{R}}_{k} Z_{N, k} - 2 ϑ_{N, k} e^{(k)} . \end{matrix}

(58)

Consequently,

B M_{1} D

and

B M_{2} D

preserve the cubic local order. For

B M_{3} D

, the same conclusion holds provided that

Δ_{k} = (H_{N}^{(k)} - {\hat{H}}^{(k)}) F (x^{(k)}) = O (e^{{(k)}^{3}}) .

Equivalently, at this order,

(R_{N, k} - {\hat{R}}_{k}) e^{(k)} = O (e^{{(k)}^{3}}) .

In particular,

{\hat{H}}^{(k)} - H_{N}^{(k)} = O (e^{{(k)}^{2}})

implies the compatibility condition above. Under this condition, all three complete two-update variants preserve the cubic local convergence order.

Proof.

Let

e = e^{(k)}, e_{z}^{(k)} = z^{(k)} - ξ .

Repeating the derivation of (46) for the predictor matrix

H_{N}^{(k)}

gives

e_{z}^{(k)} = Z_{N, k} + 2 (C_{3} - C_{2}^{2}) e^{3} - R_{N, k} C_{2} e^{2} + O (e^{4}) .

(59)

Thus,

Z_{N, k} = O (e^{2})

and

e_{z}^{(k)} = O (e^{2})

.

Next, repeating the cancellation argument leading to (47), now using

{\hat{H}}^{(k)}

, yields

e_{z}^{(k)} - {\hat{H}}^{(k)} F (z^{(k)}) = 2 C_{2} e Z_{N, k} - {\hat{R}}_{k} Z_{N, k} + O (e^{4}) .

(60)

Similarly, by the arguments leading to (48) and (49),

ν_{k} = ϑ_{N, k} + O (e^{3}), ϑ_{N, k} = O (e^{2}),

and

ν_{k} {\hat{H}}^{(k)} F (x^{(k)}) = ϑ_{N, k} e + O (e^{4}), ν_{k} {\hat{H}}^{(k)} F (z^{(k)}) = O (e^{4}) .

(61)

For

B M_{1} D

, substituting (60) and (61) into (50) gives

e^{(k + 1)} = 2 C_{2} e Z_{N, k} - {\hat{R}}_{k} Z_{N, k} + 2 ϑ_{N, k} e + O (e^{4}),

which proves (54). For

B M_{2} D

, using

{(1 - 4 ν_{k})}^{- 1} = 1 + 4 ν_{k} + O (e^{4})

in (51) gives

e^{(k + 1)} = 2 C_{2} e Z_{N, k} - {\hat{R}}_{k} Z_{N, k} - 2 ϑ_{N, k} e + O (e^{4}),

which proves (55).

For

B M_{3} D

, since the predictor was computed with

H_{N}^{(k)}

,

e - {\hat{H}}^{(k)} F (x^{(k)}) = e_{z}^{(k)} + (H_{N}^{(k)} - {\hat{H}}^{(k)}) F (x^{(k)}) = e_{z}^{(k)} + Δ_{k} .

Using this identity in (52), together with (60) and (61), and using

G_{1} ν_{k} {\hat{H}}^{(k)} F (z^{(k)}) = O (e^{4})

, we obtain

e^{(k + 1)} = Δ_{k} + 2 C_{2} e Z_{N, k} - {\hat{R}}_{k} Z_{N, k} - 2 ϑ_{N, k} e + O (e^{4}),

which proves (56).

Since

Z_{N, k} = O (e^{2}), R_{N, k} = O (e), {\hat{R}}_{k} = O (e), ϑ_{N, k} = O (e^{2}),

the cubic order of

B M_{1} D

and

B M_{2} D

follows. For

B M_{3} D

, the same conclusion holds if

Δ_{k} = O (e^{3})

. Moreover,

F (x^{(k)}) = F^{'} (ξ) e + O (e^{2})

implies

Δ_{k} = (E_{N, k} - {\hat{E}}_{k}) F (x^{(k)}) = (R_{N, k} - {\hat{R}}_{k}) e + O (e^{3}),

which gives the stated equivalence. The sufficient condition

{\hat{H}}^{(k)} - H_{N}^{(k)} = O (e^{{(k)}^{2}})

immediately implies

Δ_{k} = O (e^{{(k)}^{3}})

.

Finally, the DFP update is applied only after

x^{(k + 1)}

has been computed, so it does not enter the current local error equation. Its role is to generate

H_{N}^{(k + 1)}

, and the assumed first-order consistency of this matrix supplies the induction hypothesis for the next iteration. Therefore, the BFGS–DFP double updating strategy preserves the cubic local convergence order under the stated assumptions. □

To continue the formal analysis of the convergence properties of the HOQN method, we adopt the following notation and assumptions over the bounded level set

L : = \{x \in R^{n} : f (x) \leq f (x^{(0)})\} .

Throughout this work, the symbols ⪯, ⪰, and

≻ 0

are understood in the Loewner sense for symmetric matrices; in particular,

H^{(k)} ⪰ λ I

means that

H^{(k)} - λ I

is positive semidefinite, whereas

H^{(k)} ≻ 0

means that

H^{(k)}

is symmetric positive definite.

(B1): The objective function $f : R^{n} \to R$ is twice continuously differentiable on an open neighborhood of $L$ , i.e., $f \in C^{2} (R^{n})$ .
(B2): There exist constants $0 < m \leq M$ such that $m I ⪯ \nabla^{2} f (x) ⪯ M I, \forall x \in L$ . In particular, $\nabla f$ is Lipschitz continuous on $L$ .
(B3): The initial matrix $H^{(0)}$ is symmetric positive definite, and the sequence ${H^{(k)}}$ generated by the quasi-Newton updates satisfies $H^{(k)} ⪰ λ I, \forall k \geq 0$ , for some constant $λ > 0$ .
(B4): The step sizes $α_{N}^{(k)}$ and $α_{H O}^{(k)}$ associated with the Newton and higher–order directions, respectively, are obtained by inexact line searches satisfying the strong Wolfe conditions with constants $0 < c_{1} < c_{2} < 1$ .

Lemma 2

(Strong Descent Condition). Let

\underset{̲}{λ} : = \inf_{k \geq 0} λ_{\min} (H^{(k)}) > 0,

where

λ_{\min} (H^{(k)})

denotes the smallest eigenvalue of the symmetric positive–definite (SPD) matrix

H^{(k)}

. Under assumptions(B1)–(B4), the Newton direction

d_{N}^{(k)}

and the high–order correction

d_{H O}^{(k)}

z^{(k)}

at the k-th iteration of Algorithm 1 satisfy

\begin{matrix} \nabla f {(x^{(k)})}^{T} d_{N}^{(k)} & \leq - \underset{̲}{λ} {∥\nabla f (x^{(k)})∥}^{2}, \end{matrix}

(62)

\begin{matrix} \nabla f {(z^{(k)})}^{T} d_{H O}^{(k)} & \leq - \underset{̲}{λ} {∥\nabla f (z^{(k)})∥}^{2} . \end{matrix}

(63)

Consequently, the step sizes

α_{N}^{(k)}, α_{H O}^{(k)} > 0

returned by two independent strong–Wolfe line searches yield the one–step decrease

f (x^{(k + 1)}) \leq f (x^{(k)}) - c_{1} \underset{̲}{λ} (α_{N}^{(k)} {∥\nabla f (x^{(k)})∥}^{2} + α_{H O}^{(k)} {∥\nabla f (z^{(k)})∥}^{2}) .

(64)

Proof.

Let

g^{(k)} : = \nabla f (x^{(k)})

and

g_{z}^{(k)} : = \nabla f (z^{(k)})

with

z^{(k)} : = x^{(k)} + α_{N}^{(k)} d_{N}^{(k)}

. Let

H^{(k)} ≻ 0

be the (inverse–Hessian) approximation at

x^{(k)}

and assume

H^{(k)} ⪰ \underset{̲}{λ} I

with

\underset{̲}{λ} : = \inf_{k \geq 0} λ_{\min} (H^{(k)}) > 0

. By the Rayleigh inequality,

{(g^{(k)})}^{T} H^{(k)} g^{(k)} \geq λ_{\min} (H^{(k)}) ∥ g^{(k)} ∥^{2} \geq \underset{̲}{λ} {∥ g^{(k)} ∥}^{2} .

Defining the Newton direction by

d_{N}^{(k)} : = - H^{(k)} g^{(k)}

, we obtain

{(g^{(k)})}^{T} d_{N}^{(k)} = - {(g^{(k)})}^{T} H^{(k)} g^{(k)} \leq - \underset{̲}{λ} {∥ g^{(k)} ∥}^{2} .

which proves (62).

We consider the three high–order directions used in Algorithm 1 and show, in each case, that

g_{z}^{T} d_{H O}^{(k)} \leq - \underset{̲}{λ} {∥ g_{z} ∥}^{2} .

To analyze the uniform spectral bound associated with the

M_{1}

,

M_{2}

, and

M_{3}

directions, we proceed separately for each high-order correction.

Since

ν_{k} = \frac{{∥g_{z}^{(k)}∥}^{2}}{{∥g^{(k)}∥}^{2}}

and the iteration is considered before termination, it follows that

ν_{k} > 0

. Moreover, since

H^{(k)} ⪰ \underset{̲}{λ} I

, we have

{(g_{z}^{(k)})}^{T} H^{(k)} g_{z}^{(k)} \geq \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} .

In addition, we assume that, in the local regime,

{(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} \geq 0 .

(i): $M_{1}$ (HOChun): For the Chun-type direction, define

$d_{H O C h u n}^{(k)} : = - H^{(k)} ((1 + ν_{k}) g_{z}^{(k)} - 2 ν_{k} g^{(k)}) .$

Then,

$\begin{matrix} - {(g_{z}^{(k)})}^{T} d_{H O C h u n}^{(k)} & = (1 + ν_{k}) {(g_{z}^{(k)})}^{T} H^{(k)} g_{z}^{(k)} - 2 ν_{k} {(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} \\ \geq (1 + ν_{k}) \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} - 2 ν_{k} {(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} \\ \geq \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} . \end{matrix}$

Therefore,

${(g_{z}^{(k)})}^{T} d_{H O C h u n}^{(k)} \leq - \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} .$
(ii): $M_{2}$ (HOOS): For the Ostrowski-type direction, define

$d_{H O O S}^{(k)} : = - \frac{1}{1 - 4 ν_{k}} H^{(k)} (g_{z}^{(k)} + 2 ν_{k} g^{(k)}) .$

Then,

$\begin{matrix} - {(g_{z}^{(k)})}^{T} d_{H O O S}^{(k)} & = \frac{1}{1 - 4 ν_{k}} {(g_{z}^{(k)})}^{T} H^{(k)} g_{z}^{(k)} + \frac{2 ν_{k}}{1 - 4 ν_{k}} {(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} \\ \geq \frac{\underset{̲}{λ}}{1 - 4 ν_{k}} {∥ g_{z}^{(k)} ∥}^{2}, \end{matrix}$

provided that $1 - 4 ν_{k} > 0$ . If the factor $\frac{1}{1 - 4 ν_{k}}$ is absorbed into the line-search step size, we obtain

${(g_{z}^{(k)})}^{T} d_{H O O S}^{(k)} \leq - \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} .$
(iii): $M_{3}$ (HOT): For the Traub-type direction, define

$d_{H O T}^{(k)} : = - H^{(k)} ((1 + 2 ν_{k}) g^{(k)} + (1 + G_{1} ν_{k}) g_{z}^{(k)}), G_{1} \geq 0 .$

Then,

\begin{matrix} - {(g_{z}^{(k)})}^{T} d_{H O T}^{(k)} & = (1 + 2 ν_{k}) {(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} + (1 + G_{1} ν_{k}) {(g_{z}^{(k)})}^{T} H^{(k)} g_{z}^{(k)} \\ \geq (1 + 2 ν_{k}) {(g_{z}^{(k)})}^{T} H^{(k)} g^{(k)} + (1 + G_{1} ν_{k}) \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} \\ \geq \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} . \end{matrix}

Therefore,

{(g_{z}^{(k)})}^{T} d_{H O T}^{(k)} \leq - \underset{̲}{λ} {∥ g_{z}^{(k)} ∥}^{2} .

In all three cases, (63) holds. Applying the first Wolfe condition at

x^{(k)}

, we obtain

f (x^{(k)} + α_{N}^{(k)} d_{N}^{(k)}) \leq f (x^{(k)}) + c_{1} α_{N}^{(k)} {(g^{(k)})}^{T} d_{N}^{(k)} .

Using (62), it follows that

f (z^{(k)}) \leq f (x^{(k)}) - c_{1} \underset{̲}{λ} α_{N}^{(k)} {∥ g^{(k)} ∥}^{2} .

(65)

Now apply the first Wolfe condition at

z^{(k)}

for the high-order step:

f (z^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}) \leq f (z^{(k)}) + c_{1} α_{H O}^{(k)} {(g_{z}^{(k)})}^{T} d_{H O}^{(k)} .

Substituting (63) and then (65), we obtain

f (x^{(k + 1)}) \leq f (x^{(k)}) - c_{1} \underset{̲}{λ} (α_{N}^{(k)} ∥ g^{(k)} ∥^{2} + α_{H O}^{(k)} {∥ g_{z}^{(k)} ∥}^{2}) .

which is exactly (64). The proof is complete. □

Lemma 3

(Asymptotic gradient nulling). Let

{x^{(k)}}_{k \geq 0}

be the sequence produced by Algorithm 1. Suppose(B1)–(B4)hold, and that the search directions generated by the algorithm satisfy the uniform–descent and bounded–direction properties stated in Lemma 2. Then

\lim_{k \to \infty} ∥\nabla f (x^{(k)})∥ = 0 .

Proof.

We now prove this result.

(i): By Lemma 2, we know that

$\begin{matrix} f (x^{(k + 1)}) & \leq f (x^{(k)}) - \underset{̲}{λ} (c_{1} α_{N}^{(k)} ∥ g^{(k)} ∥^{2} + {\bar{c}}_{1} α_{H O}^{(k)} {∥ g_{z^{(k)}} ∥}^{2}) . \end{matrix}$
(ii): Using (B2), f attains a minimum on the set $L = {x \in R^{n} : f (x) \leq f (x^{(0)})}$ . Let $f_{\inf} : = \inf_{x \in L} f (x)$ . Then $f_{\inf} \leq f (x)$ for all $x \in L$ . Hence, for the iterates ${x^{(k)}}$ we obtain

$\begin{matrix} f (x^{(1)}) & \leq f (x^{(0)}) - \underset{̲}{λ} (c_{1} α_{0}^{N} ∥ g_{0} ∥^{2} + {\bar{c}}_{1} α_{0}^{H O} {∥ g_{z_{0}} ∥}^{2}), \\ f (x^{(2)}) & \leq f (x^{(1)}) - \underset{̲}{λ} (c_{1} α_{1}^{N} ∥ g_{1} ∥^{2} + {\bar{c}}_{1} α_{1}^{H O} {∥ g_{z_{1}} ∥}^{2}) \\ \leq f (x^{(0)}) - \underset{̲}{λ} \sum_{j = 0}^{1} (c_{1} α_{j}^{N} ∥ g_{j} ∥^{2} + {\bar{c}}_{1} α_{j}^{H O} {∥ g_{z_{j}} ∥}^{2}), \\ f (x^{(3)}) & \leq f (x^{(2)}) - \underset{̲}{λ} (c_{1} α_{2}^{N} ∥ g_{2} ∥^{2} + {\bar{c}}_{1} α_{2}^{H O} {∥ g_{z_{2}} ∥}^{2}) \\ \leq f (x^{(0)}) - \underset{̲}{λ} \sum_{j = 0}^{2} (c_{1} α_{j}^{N} ∥ g_{j} ∥^{2} + {\bar{c}}_{1} α_{j}^{H O} {∥ g_{z_{j}} ∥}^{2}), \end{matrix}$

(66)

and, in general,

$f (x^{(k + 1)}) \leq f (x^{(0)}) - \underset{̲}{λ} \sum_{j = 0}^{k} (c_{1} α_{j}^{N} ∥ g_{j} ∥^{2} + {\bar{c}}_{1} α_{j}^{H O} {∥ g_{z_{j}} ∥}^{2}) .$

Letting $k \to \infty$ yields

$f_{\inf} \leq f (x^{(0)}) - \underset{̲}{λ} (c_{1} \sum_{j = 0}^{\infty} α_{j}^{N} ∥ g_{j} ∥^{2} + {\bar{c}}_{1} \sum_{j = 0}^{\infty} α_{j}^{H O} {∥ g_{z_{j}} ∥}^{2}),$

hence both series

$\sum_{j = 0}^{\infty} α_{j}^{N} ∥ g_{j} ∥^{2} and \sum_{j = 0}^{\infty} α_{j}^{H O} {∥ g_{z_{j}} ∥}^{2}$

are convergent.
(iii): Now, let us prove that the step sizes $α_{N}^{(k)}$ and $α_{H O}^{(k)}$ are bounded away from zero whenever the corresponding gradients remain bounded away from zero. Let $ϕ (α) : = f (x + α d)$ , so

$ϕ^{'} (α) = \nabla f {(x + α d)}^{T} d and ϕ^{'} (0) = \nabla f {(x)}^{T} d = g^{T} d .$

Assume $\nabla f$ is L–Lipschitz, that is,

$∥ \nabla f (u) - \nabla f (v) ∥ \leq L ∥ u - v ∥, \forall u, v .$

We consider $u = x + α d$ and $v = x$ . Then,

$∥ \nabla f (x + α d) - \nabla f (x) ∥ \leq L ∥ (x + α d) - x ∥ = L α ∥ d ∥ .$

Then, for any $α > 0$ ,

$\begin{matrix} ϕ^{'} (α) - ϕ^{'} (0) & = {(\nabla f (x + α d) - \nabla f (x))}^{T} d \\ \geq - ∥ \nabla f (x + α d) - \nabla f (x) ∥ ∥ d ∥ \\ \geq - {L α ∥ d ∥}^{2} . \end{matrix}$

(67)

Hence,

$ϕ^{'} (α) \geq ϕ^{'} (0) - L α {∥ d ∥}^{2} .$

(68)

Let $α$ satisfy the strong Wolfe curvature condition

$| ϕ^{'} (α) | \leq c_{2} | ϕ^{'} (0) |, 0 < c_{2} < 1 .$

Then

$ϕ^{'} (α) \geq - | ϕ^{'} (α) | \geq - c_{2} | ϕ^{'} (0) | .$

Combining with (68), we obtain

$ϕ^{'} (0) - {L α ∥ d ∥}^{2} \geq - c_{2} | ϕ^{'} (0) | .$

Thus,

$ϕ^{'} (0) + c_{2} | ϕ^{'} {(0) | \geq L α ∥ d ∥}^{2} .$

Since d is a descent direction, $ϕ^{'} (0) = g^{T} d < 0$ and thus $| ϕ^{'} (0) | = - ϕ^{'} (0)$ . Replacing, we get

$ϕ^{'} (0) + c_{2} | ϕ^{'} (0) | = ϕ^{'} (0) - c_{2} ϕ^{'} (0) = (1 - c_{2}) ϕ^{'} (0) .$

Therefore, the equivalent inequality is

$- (1 - c_{2}) | ϕ^{'} (0) {| \geq - L α ∥ d ∥}^{2} ⟹ (1 - c_{2}) | ϕ^{'} (0) {| \leq L α ∥ d ∥}^{2} .$

$α \geq \frac{(1 - c_{2}) | ϕ^{'} (0) |}{{L ∥ d ∥}^{2}} = \frac{(1 - c_{2}) | g^{T} d |}{{L ∥ d ∥}^{2}} .$

(69)

Assume there exist constants $\underset{̲}{λ} > 0$ and $C > 0$ such that

$| g^{T} d | \geq \underset{̲}{λ} {∥ g ∥}^{2}, ∥ d ∥ \leq C ∥ g ∥ .$

(70)

Then ${∥ d ∥}^{2} \leq C^{2} {∥ g ∥}^{2}$ and, substituting (70) into (69), we obtain

$α \geq \frac{(1 - c_{2}) \underset{̲}{λ} {∥ g ∥}^{2}}{L C^{2} {∥ g ∥}^{2}} = \frac{(1 - c_{2}) \underset{̲}{λ}}{L C^{2}} = α_{\min} > 0 .$

Hence, if $∥ g^{(k)} ∥ \geq ε$ , then $α_{N}^{(k)} \geq α_{\min}$ ; similarly, if $∥ g (z^{(k)}) ∥ \geq ε$ , then $α_{H O}^{(k)} \geq α_{\min}$ . ]

Assume by contradiction that

{lim sup}_{k \to \infty} ∥ g^{(k)} ∥ = η > 0

. Then, for any

ε > 0

, there exist infinitely many indices k such that

∥ g^{(k)} ∥ \geq η - ε

; choosing

ε = η / 2

yields a subsequence

{k_{ℓ}}

satisfying

∥ g^{(k_{ℓ})} ∥ \geq η / 2, \forall ℓ .

By (iii),

α_{N}^{(k_{ℓ})} \geq α_{\min}

. Therefore,

α_{N}^{(k_{ℓ})} {∥ g^{(k_{ℓ})} ∥}^{2} \geq α_{\min} {(\frac{η}{2})}^{2}, \forall ℓ .

Summing over ℓ yields

\sum_{ℓ = 1}^{\infty} α_{N}^{(k_{ℓ})} {∥ g^{(k_{ℓ})} ∥}^{2} \geq \sum_{ℓ = 1}^{\infty} α_{\min} {(\frac{η}{2})}^{2} = \infty .

On the other hand, by (ii) we have

\sum_{k = 0}^{\infty} α_{N}^{(k)} {∥ g^{(k)} ∥}^{2} < \infty,

which contradicts the divergence of the subseries above. Since

g^{(k)} : = \nabla f (x^{(k)}),

the result follows immediately. Hence,

\lim_{k \to \infty} ∥ g^{(k)} ∥ = \lim_{k \to \infty} ∥ g (z^{(k)}) ∥ = 0 .

□

Remark 3

(Worst-case complexity for first-order stationarity). The descent estimate in Lemma 2 and the step-size lower bound established in Lemma 3 yield a standard worst-case complexity bound. Let

g^{(k)} : = \nabla f (x^{(k)}), g_{z}^{(k)} : = \nabla f (z^{(k)}), f_{\inf} : = \inf_{x \in L} f (x),

where

L = {x \in R^{n} : f (x) \leq f (x^{(0)})}

. If, before termination, If, before termination,

α_{N}^{(k)} \geq α_{\min} > 0, α_{H O}^{(k)} \geq α_{\min} > 0, σ : = \underset{̲}{λ} α_{\min} \min {c_{1}, {\bar{c}}_{1}},

then Lemma 2 givesthen Lemma 2 gives

f (x^{(k + 1)}) \leq f (x^{(k)}) - σ (∥ g^{(k)} ∥^{2} + {∥ g_{z}^{(k)} ∥}^{2}) .

Summing from

k = 0

to

N - 1

, we obtain

σ \sum_{k = 0}^{N - 1} (∥ g^{(k)} ∥^{2} + {∥ g_{z}^{(k)} ∥}^{2}) \leq f (x^{(0)}) - f_{\inf} .

In particular,

\min_{0 \leq k \leq N - 1} ∥ g^{(k)} ∥ \leq \sqrt{\frac{f (x^{(0)}) - f_{\inf}}{σ N}} .

Therefore, to guarantee an index

0 \leq k \leq N - 1

such that

∥ \nabla f (x^{(k)}) ∥ \leq ε,

it is sufficient to take

N \geq \frac{f (x^{(0)}) - f_{\inf}}{σ ε^{2}} .

Hence, the worst-case number of outer iterations required to reach first-order stationarity is

N_{ε} = O (ε^{- 2}) .

Since each full-memory HOQN iteration involves a finite number of matrix-vector products and rank-two quasi-Newton updates, its dense algebraic cost is

O (n^{2})

. Consequently, the dense worst-case arithmetic complexity is

O (n^{2} ε^{- 2}) .

Lemma 4

(Dennis–Moré condition). Let

f : R^{n} \to R

be of class

C^{2}

near ξ, with

\nabla^{2} f (ξ) ≻ 0

, and let

{x^{(k)}}

be generated by Algorithm 1. Set

e^{(k)} : = x^{(k)} - ξ, s^{(k)} : = x^{(k + 1)} - x^{(k)},

and define

d^{(k)} : = - H^{(k)} g (x^{(k)}), d_{N}^{(k)} : = - \nabla^{2} f {(ξ)}^{- 1} g (x^{(k)}) .

Assume that

{H^{(k)}}

is updated by (1), that

{H^{(k)}}

is bounded, that

x^{(k)} \to ξ

superlinearly, and that the Dennis–Moré condition holds:

\lim_{k \to \infty} \frac{∥ (H^{(k)} - \nabla^{2} f {(ξ)}^{- 1}) s^{(k)} ∥}{∥ s^{(k)} ∥} = 0 .

Then

\frac{∥ d^{(k)} - d_{N}^{(k)} ∥}{∥ d_{N}^{(k)} ∥} \to 0 .

Proof.

Let

H^{*} : = \nabla^{2} f {(ξ)}^{- 1}, g^{(k)} : = g (x^{(k)}) .

By definition,

d^{(k)} - d_{N}^{(k)} = - (H^{(k)} - H^{*}) g^{(k)} .

(71)

Using the first-order expansion of the gradient around

ξ

, and the superlinear convergence of

x^{(k)}

, we have

g^{(k)} = \nabla^{2} f (ξ) e^{(k)} + O (∥ e^{(k)} ∥^{2}), s^{(k)} = e^{(k + 1)} - e^{(k)} = - e^{(k)} + o (∥ e^{(k)} ∥) .

Hence

e^{(k)} = - s^{(k)} + o (∥ s^{(k)} ∥), g^{(k)} = - \nabla^{2} f (ξ) s^{(k)} + o (∥ s^{(k)} ∥) .

Substituting this expression for

g^{(k)}

into (71) gives

d^{(k)} - d_{N}^{(k)} = (H^{(k)} - H^{*}) \nabla^{2} f (ξ) s^{(k)} + o (∥ s^{(k)} ∥) .

Since

\nabla^{2} f (ξ)

is fixed and

{H^{(k)}}

is bounded, there exists

C > 0

such that

∥ d^{(k)} - d_{N}^{(k)} ∥ \leq C ∥ (H^{(k)} - H^{*}) s^{(k)} ∥ + o (∥ s^{(k)} ∥) .

Dividing by

∥ s^{(k)} ∥

and using the Dennis–Moré condition yields

\frac{∥ d^{(k)} - d_{N}^{(k)} ∥}{∥ s^{(k)} ∥} \to 0 .

Moreover, since

d_{N}^{(k)} = - H^{*} g^{(k)} = s^{(k)} + o (∥ s^{(k)} ∥),

we have

∥ d_{N}^{(k)} ∥ \sim ∥ s^{(k)} ∥

. Therefore,

\frac{∥ d^{(k)} - d_{N}^{(k)} ∥}{∥ d_{N}^{(k)} ∥} \to 0,

which proves the result. □

5. Numerical Tests

In this section, we numerically compare the proposed HOQN variants (

M_{1} DFP

,

M_{2} DFP

,

M_{3} DFP

,

B M_{1} D

,

B M_{2} D

, and

B M_{3} D

) with the classical quasi-Newton methods BFGS, DFP, and SR1, using the Booth test function

f_{B}

, the Himmelblau function

f_{H}

, and the Freudenstein–Roth function

f_{F R}

. All numerical experiments were carried out in MATLAB R2024. The iterative process was terminated when the gradient norm satisfied the stopping criterion

∥\nabla f (x^{(k)})∥ < 10^{- 6}

.

Although the global convergence analysis is developed under standard inexact line-search assumptions of Armijo/Wolfe type, the numerical implementation reported in this section computes the step sizes in both the predictor and corrector stages by bounded one-dimensional minimization using MATLAB’s fminbnd routine over the interval

[0, 10]

. Thus, the theoretical framework provides sufficient conditions for the convergence analysis, whereas fminbnd is adopted as a practical step-selection procedure in the computational experiments. Next, we define the test functions:

$f_{B} (x_{1}, x_{2}) = {(x_{1} + 2 x_{2} - 7)}^{2} + {(2 x_{1} + x_{2} - 5)}^{2}$ ;
$f_{H} (x_{1}, x_{2}) = {(x_{1}^{2} + x_{2} - 11)}^{2} + {(x_{1} + x_{2}^{2} - 7)}^{2}$ ;
$f_{F R} (x_{1}, x_{2}) = {(x_{1} + x_{2} (x_{2} (x_{2} + 1) - 14) - 29)}^{2} + {(- x_{1} + x_{2} (x_{2} (x_{2} - 5) + 2) + 13)}^{2}$ .

The Booth function

f_{B}

is strictly convex and has a unique global minimizer at

{(1, 3)}^{T}

with value

f_{B} ({(1, 3)}^{T}) = 0

. The Himmelblau function has four local minima which are also global minima, all satisfying

f_{H} = 0

:

{(3, 2)}^{T}

,

{(- 2.805118, 3.131312)}^{T}

,

{(- 3.779310, - 3.283186)}^{T}

, and

{(3.584428, - 1.848126)}^{T}

. The Freudenstein–Roth function has a global minimizer at

{(5, 4)}^{T}

with

f_{F R} ({(5, 4)}^{T}) = 0

, a well-known local minimizer at approximately

{(11.4128, - 0.8968)}^{T}

, and it also exhibits a saddle point at approximately

{(23.9206, 2.2300)}^{T}

. The 3D landscapes in Figure 1 visualize these geometries: Figure 1a corresponds to the convex, unimodal

f_{B}

, whereas Figure 1b and Figure 1c show the multimodal landscapes of

f_{H}

and

f_{F R}

, respectively, providing intuition about the basins and valleys against which the HOQN variants are assessed.

A particularly important metric in numerical tests of iterative methods is the Approximate Computational Order of Convergence (ACOC), as it allows for the experimental validation of the theoretical order of convergence. In the context of hybrid optimization algorithms, estimating the convergence order, denoted by

ρ

, using the classical formula proposed in [25], often leads to unreliable results due to significant fluctuations between consecutive iterations. To obtain a more stable and consistent estimate of

ρ

, we adopt an approach based on the linear regression of the logarithms of successive errors

(\ln E_{k}, \ln E_{k + 1})

. This methodology leverages global information from multiple iterations within the asymptotic regime, thereby reducing sensitivity to local fluctuations in individual errors and providing a more robust estimate than formulas based solely on the most recent iterations. In particular, the ACOC is calculated as the slope of the least-squares regression line that fits the data

\ln E_{k + 1}

versus

\ln E_{k}

, and is given by:

ρ = \frac{\sum_{k = k_{0}}^{m - 1} (\ln E_{k} - \bar{\ln E_{k}}) (\ln E_{k + 1} - \bar{\ln E_{k + 1}})}{\sum_{k = k_{0}}^{m - 1} {(\ln E_{k} - \bar{\ln E_{k}})}^{2}},

(72)

where

E_{k} = ∥ x^{(k)} - ξ ∥

denotes the error at iteration k, m is the total number of iterations considered,

\bar{\ln E_{k}} = \frac{1}{N} \sum_{k = k_{0}}^{m - 1} \ln E_{k}

, and

\bar{\ln E_{k + 1}} = \frac{1}{N} \sum_{k = k_{0}}^{m - 1} \ln E_{k + 1}

, with

N = m - k_{0}

denoting the number of data points used in the regression. Furthermore,

k_{0}

denotes the first iteration from which the iterates are considered to lie in the asymptotic regime.

5.1. Hybrid Optimization with Complete Memory

We begin by analyzing the full memory variants of the proposed hybrid methods, comparing their numerical performance with classical quasi-Newton schemes on selected test functions. Table 1 and Table 2 present the numerical results obtained for the Himmelblau and Freudenstein–Roth functions, considering two different initial conditions. The classical quasi-Newton methods BFGS, DFP, and SR1 exhibit superlinear convergence, as is characteristic of this class of methods, requiring between five and six iterations to reach the minimum. On the other hand, with the exception of

M_{1} DFP

, the proposed hybrid methods consistently converge in fewer iterations for both functions and initial conditions, achieving observed convergence orders consistent with the theoretical cubic order and computational times competitive with the classical methods considered. Furthermore, regarding the final gradient norm

∥\nabla f (x^{(k + 1)})∥

, it is observed that, in these nonconvex problems, this metric can vary significantly depending on the initial condition used; however, the smallest values were obtained by some of the proposed hybrid variants. This fact constitutes a clear indication of the greater numerical accuracy of these variants in approximating the minimizer. The reported computation time (CT(s)) in the numerical experiments corresponds to the total execution time of the iterative algorithm, measured in seconds, from the start of the process until the termination criterion is satisfied or the maximum number of iterations is reached. Consequently, this value includes the cost of all operations performed in each iteration. To obtain reliable time measurements, each method was run 15 times for each test problem, and the times reported in the tables correspond to the average of those runs. The numerical experiments were conducted on a 13-inch MacBook Air (2025) running macOS Sequoia 15.7.4, equipped with an Apple M4 chip and 16 GB of unified memory. This computational environment provides a consistent and up-to-date basis for the comparative evaluation of the methods under consideration.

The numerical results indicate that the efficiency of the methods strongly depends on the nature of the problem. In the nonconvex problems, such as the Himmelblau and Freudenstein–Roth functions, the hybrid methods reach the minimizer in fewer iterations and with a faster convergence rate than the classical BFGS and DFP methods, while also yielding smaller final gradient norms.

Let us now analyze the Booth function, which is a two-dimensional strictly convex quadratic problem with unique global minimizer

ξ = {(1, 3)}^{T}

. Table 3 reports the numerical results obtained from two different initial conditions. All the methods converge in two iterations and reach final gradient norms of order

10^{- 14}

or smaller. Since only two iterations are required, the approximate computational order of convergence cannot be reliably estimated; therefore, the value of

ρ

is not reported.

The results in Table 3 should be interpreted as a consistency check for a simple strictly convex quadratic problem, not as evidence of a general optimality property of the proposed methods. In exact arithmetic, Newton’s method solves a strictly convex quadratic problem in one step when the exact inverse Hessian is used, while full-memory quasi-Newton schemes with line searches may display very fast finite termination on low-dimensional quadratic problems. Therefore, the two-iteration behavior observed for the Booth function is a finite-dimensional quadratic effect. Consequently, the Booth experiment confirms that all methods behave accurately on a convex quadratic benchmark, but it is not sufficient to assess scalability or general asymptotic behavior. For this reason, the Booth test is complemented below with high-dimensional quadratic experiments with prescribed Hessian condition numbers.

5.2. High-Dimensional Quadratic Scalability Test

Since the Booth function is only a two-dimensional strictly convex quadratic problem, its two-iteration behavior should not be interpreted as evidence of general scalability. To test the methods beyond the low-dimensional Booth case, we consider the strictly convex quadratic family

f_{n, κ} (x) = \frac{1}{2} x^{T} A_{n, κ} x - b^{T} x, A_{n, κ} = Q^{T} Λ_{n, κ} Q, Λ_{n, κ} = diag (1, κ^{1 / (n - 1)}, κ^{2 / (n - 1)}, \dots, κ),

where Q is an orthogonal matrix generated with a fixed random seed. Hence,

\nabla^{2} f_{n, κ} (x) = A_{n, κ}, κ (A_{n, κ}) = κ .

The vector b was chosen as

b = A_{n, κ} ξ

, with

ξ = \frac{1}{\sqrt{n}} {(1, \dots, 1)}^{T}

so that the exact minimizer is known. We used

x^{(0)} = 0, H^{(0)} = I

,and the stopping criterion

∥\nabla f (x^{(k)})∥ \leq 10^{- 6}

. The step length was computed by exact one-dimensional minimization along each search direction, in order to isolate the cost of the quasi-Newton updates from line-search variability. The tested dimensions and condition numbers were

n \in {100, 500, 1000}, κ \in \{10^{2}, 10^{6}\}

. To assess the expected dense

O (n^{2})

scaling, Table 4 reports the number of outer iterations and the normalized cost

η = \frac{C T}{Iter n^{2}}

.

All methods reached the prescribed tolerance in every tested case. The hybrid variant

{BM}_{1} D

consistently required fewer outer iterations than both BFGS and

M_{1} DFP

. In the most demanding case,

n = 1000, κ = 10^{6}

, BFGS required 770 iterations and

M_{1} DFP

required 800 iterations, whereas

{BM}_{1} D

required 408 iterations. This confirms the reduction in outer steps produced by the predictor–corrector structure. Although

{BM}_{1} D

performs two stages per outer iteration, the normalized cost

η

remains of order

10^{- 8}

across all tested dimensions and condition numbers, consistently with the dense

O (n^{2})

cost of full-memory quasi-Newton updates. Hence, these experiments show that the proposed hybrid scheme extends beyond the two-dimensional Booth case, preserves the expected dense scaling, and substantially reduces the number of outer iterations on high-dimensional quadratic problems.

5.3. Dolan–Moré Performance Profiles and Robustness Assessment

To provide a broader numerical assessment, we complement the pointwise convergence results with Dolan–Moré performance profiles. Among the proposed hybrid variants, BM₁D, BM₂D and BM₃D were selected for the global benchmark because they showed the most stable behavior in the preliminary screening, with fewer non-descent directions, fewer line-search breakdowns and more reliable quasi-Newton updating. These methods correspond, respectively, to the modified Chun-type, Ostrowski-type and Traub-type hybrid quasi-Newton schemes, and are compared with the classical quasi-Newton methods DFP, BFGS and SR1 [19]. Dolan–Moré profiles are a standard tool for benchmarking optimization solvers over a common test set [26].

Let

P

be the set of test instances and

S

the set of solvers. For each

p \in P

and

s \in S

, let

t_{p, s}

denote the computational cost required by solver s. If the solver fails to satisfy the stopping criterion within the computational budget, or produces non-finite values, we set

t_{p, s} = + \infty

. The performance ratio is

r_{p, s} = \frac{t_{p, s}}{\min_{\bar{s} \in S} t_{p, \bar{s}}},

where the minimum is taken over the solvers that successfully solve problem p. If no solver solves a given instance, all ratios are treated as infinite and the instance contributes only to the failure-rate analysis. The Dolan–Moré profile of solver s is defined by

ρ_{s} (τ) = \frac{1}{| P |} |\{p \in P : r_{p, s} \leq τ\}|, τ \geq 1 .

Thus,

ρ_{s} (1)

measures relative efficiency, whereas the limiting value of

ρ_{s} (τ)

for large

τ

measures robustness, since failed runs keep infinite cost.

Following the reviewer’s recommendation, the benchmark set was enlarged beyond the basic two-dimensional tests. It consists of 30 base problems from classical unconstrained optimization benchmarks, the Moré–Garbow–Hillstrom family, the Wood problem, higher-dimensional extensions, and CUTEr/CUTEst-type analytic test instances [27,28,29,30]. The composition of the set is reported in Table 5. Each base problem was tested in five variants: one smooth version, two deterministic noisy versions, and two piecewise-smooth nonsmooth versions. Hence, the complete benchmark contains

30 \times 5 = 150

instances and, with six solvers,

150 \times 6 = 900

numerical runs.

The primary performance measure was the total computational work

t_{p, s} = N_{f} + N_{g}

, where

N_{f}

and

N_{g}

are the numbers of objective-function and gradient evaluations. This metric is appropriate because BM₁D, BM₂D and BM₃D are two-stage schemes, whereas DFP, BFGS and SR1 are one-stage quasi-Newton methods. For completeness, we also report profiles based on the number of outer iterations, which highlight the reduction in iteration count achieved by the proposed methods. For smooth instances, a run was declared successful when

∥\nabla f (x^{(k)})∥ \leq 10^{- 6}

. For noisy and nonsmooth variants, where the gradient norm may be less reliable, we used the function-reduction criterion

f (x^{(k)}) \leq f_{L} + τ_{f} (f (x^{(0)}) - f_{L}), τ_{f} = 10^{- 5} .

Here,

f_{L}

denotes the best available reference value; when the exact value was unavailable, it was taken as the best value obtained over all solvers. This criterion follows the benchmarking philosophy of Moré and Wild for smooth, noisy and piecewise-smooth optimization problems [31]. For noisy variants, the solvers used perturbed information, but success was assessed with the corresponding unperturbed objective value. In addition to the performance profiles, we report the failure rate

{FR}_{s, C} = 1 - \frac{1}{| C |} |\{p \in C : t_{p, s} < + \infty\}|,

where

C \subseteq P

denotes the smooth, noisy, nonsmooth, or full benchmark class. This measure complements the profiles, since a solver may be efficient on the instances it solves while still being unreliable under perturbations or nonsmoothness. Table 6 summarizes the Dolan–Moré results for the smooth subset of 30 instances. The first block uses the total work

N_{f} + N_{g}

, whereas the second block uses the number of outer iterations. The column “Best at

τ = 1

” counts ties independently.

The smooth-subset results show that BFGS and the three proposed hybrid methods solve all 30 instances, whereas DFP and SR1 show one and two failures, respectively. In terms of outer iterations, BM₁D, BM₂D and BM₃D have median iteration counts of 4.5, 5.0 and 5.5, compared with 7.0, 8.5 and 6.5 for DFP, BFGS and SR1. BM₂D attains the best iteration count in 23 instances, followed by BM₁D in 21 and BM₃D in 13. When total work

N_{f} + N_{g}

is used, the comparison becomes more balanced, as expected for two-stage methods. Even so, all three hybrid methods solve every smooth instance and reach

ρ_{s} (5) = 1

. Among them, BM₂D gives the most balanced smooth-subset behavior, with

ρ_{s} (2) = 0.9333

, full success, and lower median work than BM₃D. To assess robustness under perturbations and loss of smoothness, Table 7 reports failure rates over the smooth, noisy and nonsmooth subsets, together with the overall rate over all 150 instances.

The failure-rate results indicate that the noisy variants are the most demanding part of the benchmark. In the smooth and nonsmooth classes, BM₁D, BM₂D and BM₃D achieve zero failure rate, matching BFGS and improving over DFP and SR1. In the noisy class, BFGS has the lowest failure rate,

0.1167

, followed by BM₂D and BM₃D, both with

0.1333

. Overall, BM₂D and BM₃D are the most robust proposed methods, with global failure rates of

0.0533

, close to BFGS

(0.0467)

. Figure 2 reports the empirical success rates by problem class, complementing Table 7.

The heat map confirms that the proposed methods solve all smooth and nonsmooth instances, while maintaining competitive success rates in the noisy class:

85.0 %

for BM₁D and

86.7 %

for BM₂D and BM₃D.

Figure 3 shows the aggregated failure rates over the full benchmark. BFGS gives the lowest global failure rate,

4.67 %

, while BM₂D and BM₃D follow closely with

5.33 %

. The larger failure rates of SR1 and DFP,

8.67 %

and

9.33 %

, respectively, indicate greater sensitivity under the present perturbation setting.

The global Dolan–Moré profiles over the full benchmark are shown in Figure 4. The left panel uses the primary metric

N_{f} + N_{g}

, whereas the right panel uses the number of outer iterations.

The work-based profile in Figure 4a gives the most balanced comparison because it accounts for the additional evaluations required by the two-stage hybrid schemes. In this metric, BFGS and SR1 are highly competitive for small values of

τ

, reflecting their lower cost per iteration. However, the proposed methods remain close to the best solvers over a wide range of

τ

, with limiting values consistent with the low failure rates reported in Table 7. The iteration-based profile in Figure 4b highlights the main advantage of the proposed schemes: BM₁D, BM₂D and BM₃D solve a large fraction of the test instances with fewer outer iterations than DFP, BFGS and SR1. Overall, Table 6 and Table 7, together with Figure 2, Figure 3 and Figure 4, show that the proposed hybrid schemes are robust and competitive on a heterogeneous benchmark set. BM₂D and BM₃D are the most robust proposed methods, whereas BM₁D provides strong iteration reduction on several smooth instances. BFGS remains a very strong classical baseline in terms of total work and failure rate; therefore, the proposed methods should be interpreted as competitive hybrid alternatives that reduce outer iterations while preserving robust behavior under smooth, noisy and nonsmooth scenarios.

5.4. Local Verification Setup on Ill-Conditioned Hessians

To validate the explicit cubic error recurrences derived in Theorem 2 and Lemma 1, we consider a family of smooth test functions with prescribed Hessian condition number at the solution. For

κ > 1

, let

A_{κ} = Q^{T} Λ_{κ} Q, Λ_{κ} = diag (1, κ^{1 / (n - 1)}, κ^{2 / (n - 1)}, \dots, κ),

where Q is a fixed orthogonal matrix generated with a prescribed random seed. We consider

f_{κ} (x) = \frac{1}{2} x^{T} A_{κ} x + \frac{γ}{3} \sum_{i = 1}^{n} x_{i}^{3} + \frac{δ}{4} \sum_{i = 1}^{n} x_{i}^{4}, x \in R^{n} .

Then

ξ = 0, \nabla f_{κ} (0) = 0, \nabla^{2} f_{κ} (0) = A_{κ}, κ (\nabla^{2} f_{κ} (0)) = κ .

The gradient and Hessian are

F (x) = \nabla f_{κ} (x) = A_{κ} x + γ x^{\circ 2} + δ x^{\circ 3}, F^{'} (x) = \nabla^{2} f_{κ} (x) = A_{κ} + 2 γ diag (x) + 3 δ diag (x^{\circ 2}) .

where

x^{\circ 2}

and

x^{\circ 3}

denote componentwise powers. In the experiments, we used

n = 20, γ = 10^{- 1}, δ = 10^{- 2}, κ \in \{10^{2}, 10^{4}, 10^{6}, 10^{8}\}

. Since the convergence theory is local, the verification was carried out in the full-step regime

α_{N}^{(k)} = α_{H O}^{(k)} = 1

, which corresponds to the asymptotic regime assumed in the local analysis. For each value of

κ

, 96 local samples were generated by taking

x^{(0)} = r v, ∥ v ∥ = 1

,with small radii r and randomly generated unit directions v. The initial inverse Hessian approximation was chosen as

H_{N}^{(0)} = A_{κ}^{- 1}

. For sufficiently small

x^{(0)}

, this choice satisfies

H_{N}^{(0)} = {[F^{'} (x^{(0)})]}^{- 1} + O (∥ e^{(0)} ∥),

which is precisely the first-order inverse-Hessian consistency condition used in Theorem 2. For each sample, one HOQN iteration was performed and the observed error

e^{(1)}

was compared with the cubic term

C_{0}^{(3)}

. The local order

ρ_{loc}

was estimated by a log–log regression of

∥ e^{(1)} ∥

versus

∥ e^{(0)} ∥

. We also report

K^{emp} = \frac{∥ e^{(1)} ∥}{∥ e^{(0)} ∥^{3}}, K^{th} = \frac{∥ C_{0}^{(3)} ∥}{∥ e^{(0)} ∥^{3}}, Q = \frac{∥ e^{(1)} - C_{0}^{(3)} ∥}{∥ e^{(0)} ∥^{4}} .

Thus, bounded values of Q provide numerical evidence for

∥e^{(1)} - C_{0}^{(3)}∥ = O ({∥e^{(0)}∥}^{4})

. For

M_{1} DFP

and

{BM}_{1} D

, respectively, we monitor the inverse-Hessian consistency quantities

δ_{N} = \frac{∥ H_{N}^{(0)} - {[F^{'} (x^{(0)})]}^{- 1} ∥}{∥ e^{(0)} ∥}, δ_{B} = \frac{∥ {\hat{H}}^{(0)} - {[F^{'} (x^{(0)})]}^{- 1} ∥}{∥ e^{(0)} ∥},

where

{\hat{H}}^{(0)}

is the intermediate inverse Hessian approximation obtained after the BFGS update following the Newton-type predictor.

Table 8 reports the local verification results for

M_{1} DFP

and

{BM}_{1} D

on ill-conditioned Hessians.

The results show that the estimated local order remains close to three for both

M_{1} DFP

and

{BM}_{1} D

, even for

κ = 10^{8}

. Orders higher than three are consistent with the theoretical result, which guarantees a cubic convergence order, and may occur when the leading cubic coefficient is small or is partially canceled out along some sampled directions. The empirical constants

K^{emp}

and the theoretical constants

K^{th}

remain finite across all tests. The small values of Q indicate that, after subtracting the explicit cubic contribution

C_{0}^{(3)}

, the remaining error behaves as a fourth-order residual, which supports

∥e^{(1)} - C_{0}^{(3)}∥ = O ({∥e^{(0)}∥}^{4}) .

Finally, the bounded values of

δ_{N}

and, for

{BM}_{1} D

,

δ_{B}

, confirm that the approximations of the inverse Hessian before and after the intermediate BFGS update remain first-order consistent with

{[F^{'} (x^{(0)})]}^{- 1}

. This provides numerical evidence that the BFGS update following the Newton-type predictor and the DFP update following the high-order corrector preserve the local cubic regime for the two-update variant described as

{BM}_{1} D

.

5.5. Dynamical Planes

To analyze the dependence of the methods on the initial estimate, we generate dynamical planes [32]. Each plane is constructed on a uniform grid, using each grid point as an initial condition and coloring it according to the minimizer reached by the method, thus visualizing the corresponding basins of attraction. In all dynamical planes, the colors indicate the basins of attraction associated with the minimizers reached by the method, whereas the black symbols mark the corresponding local minimizers. Unlike root-finding dynamical planes, here the goal is to locate minimizers rather than zeros of a nonlinear operator; consequently, the observed equilibrium points may appear on the boundary of attraction regions. The configuration used was a

500 \times 500

grid, with a maximum of 500 iterations and tolerance

10^{- 4}

. Figure 5, Figure 6 and Figure 7 show the dynamical planes for the Himmelblau function, and Figure 8, Figure 9 and Figure 10 show the corresponding results for the Freudenstein–Roth function [12]. Overall, BFGS, DFP, SR1,

M_{3} DFP

,

B M_{1} D

,

B M_{2} D

, and

B M_{3} D

exhibit similar and comparatively stable dynamical behavior, whereas

M_{1} DFP

and

M_{2} DFP

show more fragmented or chaotic basins. The Himmelblau function produces the most complex attraction structure, with highly fragmented regions where small changes in the initial estimate may lead to different local minima. By contrast, the Freudenstein–Roth planes are less fragmented.

For the strictly convex quadratic Booth function, the Hessian is constant positive definite and

ξ = (1, 3)

is the unique minimizer. Hence, the considered descent line-search variants do not generate additional attractors: all initial conditions converge to the same point and the basin of attraction coincides with the whole domain. Therefore, only the

B M_{1} D

plane is shown in Figure 11, as it is representative of all methods for

f_{B}

.

5.6. Hybrid Optimization with Limited Memory

In general, optimization methods may require a significant amount of memory, making it essential to adopt computational strategies that efficiently manage such requirements. One of the most widely used approaches in this context is the limited-memory variant of the BFGS method (L-BFGS) [33]. Instead of storing or factorizing the full Hessian matrix, which entails a memory cost of order

O (n^{2})

and becomes impractical for dimensions

n ≳ 10^{4}

, L-BFGS retains only a limited number m of the most recent curvature pairs

(s_{i}, y_{i})

. As a result, the overall memory cost is reduced to

O (m n)

, while still providing an effective approximation of curvature information. Building upon this strategy, we propose the hybrid limited-memory method L-HOQN, whose iterative scheme is described in Algorithm 2.

In low-dimensional problems, such as the Himmelblau, Freudenstein–Roth, and Booth test functions with

n = 2

, limited-memory optimization methods and, in particular, the L-HOQN method, are affected by a loss of curvature information and, consequently, exhibit slower convergence than their full-memory counterparts. In the case of the L-HOQN method variants,

M_{1}

–LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS, this loss reduces the effectiveness of the high order corrections and slows down the convergence process, as observed in the results reported in Table 9, Table 10 and Table 11. Nevertheless, this behavior changes substantially in practical high-dimensional applications, such as neural network training, which will be analyzed in Section 6.

Algorithm 2 Hybrid method L-HOQN

Require:

x^{(k)}

,

g^{(k)} = \nabla f (x^{(k)})

,

H^{(k)}

,

α_{N}^{(k)}, α_{H O}^{(k)} > 0

,

type

,

| G_{1} | < + \infty

, memory

M

with at most m pairs

Ensure:

x^{(k + 1)}

,

H^{(k + 1)}

1:

d_{N}^{(k)} \leftarrow - H^{(k)} g^{(k)}

2:

z^{(k)} \leftarrow x^{(k)} + α_{N}^{(k)} d_{N}^{(k)}

3:

g_{z}^{(k)} \leftarrow \nabla f (z^{(k)})

4:

ν_{k} \leftarrow \frac{∥ g_{z}^{(k)} ∥^{2}}{∥ g^{(k)} ∥^{2}}

5: if

type = M_{1}

then

6:

h^{(k)} \leftarrow (1 + ν_{k}) g_{z}^{(k)} - 2 ν_{k} g^{(k)}

7:

d_{H O}^{(k)} \leftarrow - H^{(k)} h^{(k)}

8:

x^{(k + 1)} \leftarrow z^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

9: else if

type = M_{2}

then

10:

h^{(k)} \leftarrow g_{z}^{(k)} + 2 ν_{k} g^{(k)}

11:

d_{H O}^{(k)} \leftarrow - \frac{1}{1 - 4 ν_{k}} H^{(k)} h^{(k)}

12:

x^{(k + 1)} \leftarrow z^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

13: else if

type = M_{3}

then

14:

h^{(k)} \leftarrow (1 + 2 ν_{k}) g^{(k)} + (1 + G_{1} ν_{k}) g_{z}^{(k)}

15:

d_{H O}^{(k)} \leftarrow - H^{(k)} h^{(k)}

16:

x^{(k + 1)} \leftarrow x^{(k)} + α_{H O}^{(k)} d_{H O}^{(k)}

17: end if

18:

g^{(k + 1)} \leftarrow \nabla f (x^{(k + 1)})

19:

s^{(k)} \leftarrow x^{(k + 1)} - x^{(k)}

20:

y^{(k)} \leftarrow g^{(k + 1)} - g^{(k)}

21: if

card (M) = m

then

22: remove oldest pair from

M

23: end if

24: append

(s^{(k)}, y^{(k)})

to

M

25: update

H^{(k + 1)}

via two-loop recursion on

M

6. Practical Application in Neural Networks for the MNIST Database

The MNIST (Modified National Institute of Standards and Technology) dataset [34] is a canonical benchmark in the fields of computer vision and deep learning. It consists of 70,000 grayscale images of handwritten digits, of which 60,000 are designated for training and 10,000 for evaluation, each with a resolution of

28 \times 28

pixels. Despite its apparent simplicity, the task of training neural networks on this corpus entails the resolution of a high-dimensional nonconvex optimization problem, characterized by the presence of multiple local minima and highly intricate error surfaces. Handwritten digit classification constitutes a benchmark problem of substantial practical relevance in optical character recognition (OCR) applications, including forms, postal codes, bank checks, vehicle license plates, and other environments involving the processing of handwritten data, and it has served for decades as a standard testbed for the validation of pattern recognition theories and the assessment of machine learning algorithms. In order to facilitate the comparability of results and to promote progress in the field, several benchmark databases have been developed in which handwritten digit samples are uniformly preprocessed through segmentation and normalization, thereby providing a common framework in which researchers may objectively compare the performance of their methods [35]. Consequently, MNIST is systematically employed as a testbed for quantifying and comparing the effectiveness of different optimization algorithms and classification architectures in handwritten digit recognition tasks.

6.1. Architecture of the Implemented Neural Network

A convolutional neural network (CNN) was implemented (see Figure 12) whose input is a single-channel grayscale image of size

28 \times 28 \times 1

, processed by two convolutional blocks: the first applies 32 kernels of size

3 \times 3

with ReLU activation followed by max-pooling of

2 \times 2

(yielding feature maps of size

13 \times 13 \times 32

), and the second uses 64 kernels of the same size with ReLU activation and identical pooling (yielding feature maps of size

5 \times 5 \times 64

). The resulting tensor is flattened into a 1600-dimensional vector, passed through a fully connected layer of 128 neurons with ReLU activation, and finally projected to 10 logit output corresponding to the digit classes 0–9. This architecture balances complexity and efficiency by capturing spatial patterns and reducing dimensionality in the convolutional stages, while the dense layers perform the final classification.

The training of the neural network is formulated as a large-scale unconstrained optimization problem, whose objective is to determine the set of parameters

θ

(network weights and biases) that minimizes a regularized objective function. In particular, we consider the problem

\min_{θ} J (θ) = L (θ) + λ {∥ θ ∥}_{2}^{2},

where

L (θ)

denotes the multiclass cross-entropy loss function [36,37] and

λ = 10^{- 4}

is the

L_{2}

regularization parameter. The cross-entropy loss is defined in terms of the direct network outputs (logits) as

L (θ) = - \frac{1}{N} \sum_{i = 1}^{N} \log (\frac{e^{z_{i, y_{i}}}}{\sum_{j = 1}^{10} e^{z_{i, j}}}),

where N denotes the number of samples (or the minibatch size),

z_{i, j}

represents the logit associated with sample i and class j, and

y_{i} \in {0, \dots, 9}

is the corresponding true label.

Training the convolutional architecture shown in Figure 12 leads to a very high-dimensional optimization problem, in which storing the full Hessian matrix is computationally infeasible. Therefore, it is essential to employ optimization methods that preserve curvature information without incurring excessive memory consumption. In Section 6.2, we introduce three new memory-constrained hybrid optimization algorithms that combine high order corrections with L-BFGS updates to efficiently train the neural network on the MNIST dataset.

6.2. Proposed Limited-Memory Hybrid Optimizers for CNN Training

In this section, we evaluate the performance of the

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS methods (see Algorithm 2) on a more complex problem, specifically in the training of the convolutional neural network shown in Figure 12, designed for automatic digit classification. These new hybrid algorithms are generally more accurate and, in most cases, more stable during training than the classical SGD and L-BFGS methods. To solve the optimization problem arising from the training of this network, we employ the PyTorch library [38], version 2.8.0, which is widely recognized in the field of deep learning. PyTorch provides native GPU support, which is essential for minibatch training in deep architectures, and offers a flexible API that enables the efficient implementation of hybrid optimizers such as those proposed in this work. All experiments were performed on a 13-inch MacBook Air equipped with an Apple M4 chip and 16 GB of memory, running macOS Sequoia 15.7.4. The code was implemented and executed using Python 3.12.3. Our implementations use a fixed learning rate

α

, momentum values

μ = 0.95

and

μ = 0.99

(see Table 12 and Table 13), and a limited-memory L-BFGS buffer of size

m = 5, 10, 15,

and 20, allowing us to assess the impact of the learning rate, momentum, and memory size on the efficiency of the hybrid methods.

6.3. Efficiency of the $M_{1}$ -LBFGS, $M_{2}$ -LBFGS and $M_{3}$ -LBFGS Methods for MNIST

One of the key parameters affecting the efficiency of limited-memory optimization methods for minimizing the cross-entropy loss in neural networks is the learning rate (

l r

).

Table 12 shows how this parameter, together with the memory size, influences the accuracy of the

M_{1}

-LBFGS,

M_{2}

-LBFGS,

M_{3}

-LBFGS, and L-BFGS optimization methods. The reported accuracy metrics correspond to the test dataset and are computed after each training epoch. In addition, the average time reported in the table represents the total time per epoch, including both the training and evaluation phases. The results indicate that, for

lr = 0.10

, the proposed hybrid variants consistently achieve accuracy levels above

99 %

, outperforming the classical L-BFGS method for all memory sizes.

In Table 13, we report the comparison between the proposed hybrid schemes and the classical SGD and L-BFGS methods in the training of the neural network on the MNIST dataset, considering the learning rates

l r = 0.01

,

l r = 0.025

, and

l r = 0.08

. In all limited-memory algorithms, a memory size

m = 5

was used, while in the proposed hybrid methods, an optimization momentum parameter

μ = 0.99

was incorporated.

The results correspond to the test set and consistently show that the hybrid methods achieve higher levels of final accuracy than the classical approaches across all learning-rate regimes analyzed. In particular, for

l r = 0.08

, the hybrid schemes clearly outperform L-BFGS (0.9902) and SGD (0.9920), indicating superior generalization capability under more demanding training configurations. This trend persists for

l r = 0.025

, where the hybrid methods retain a significant advantage over L-BFGS and SGD in terms of the achieved accuracy. Even in the conservative scenario

l r = 0.01

, the results show that the hybrid algorithms perform competitively or better than the classical methods. Although the computational cost per epoch is moderately higher in the hybrid schemes, the increase is proportional and fully justified by the systematic improvement in accuracy observed. As shown in Table 13, when comparing the performance of the

M_{1}

-LBFGS,

M_{2}

–LBFGS, and

M_{3}

–LBFGS algorithms under the learning rates

lr

= 0.10 and

lr = 0.15

, it can be observed that, for the lower learning rate, the three hybrid algorithms achieve the highest accuracy levels among them and outperform the classical L-BFGS and SGD methods. Figure 13 shows the evolution of the MNIST test accuracy for the implemented neural network under two learning rate values,

lr = 0.10

and

lr = 0.15

. When comparing the optimizers, the hybrid variants

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS exhibit a more stable and consistent behavior across epochs, achieving final accuracies that are slightly higher or comparable to those obtained by the classical L-BFGS and SGD methods.

Although this paper proposes three new hybrid algorithms with limited memory, the

M_{3}

-LBFGS method may become saturated when learning rates greater than

lr = 0.15

are used. For this reason, the use of the

M_{1}

-LBFGS and

M_{2}

-LBFGS algorithms is recommended in such cases. The curves in Figure 14 show the accuracy achieved by the

M_{1}

-LBFGS,

M_{2}

-LBFGS, L-BFGS, and SGD optimizers during the training and testing phases of the neural network, for the learning rate

lr = 0.20

. In the hybrid variants, a memory size of

m = 5

was used. It can be observed that the

M_{1}

-LBFGS and

M_{2}

-LBFGS methods exhibit greater stability and, in general, a final accuracy slightly higher than that of the classical L-BFGS and SGD methods. Furthermore, the graphs show that the hybrid methods maintain very similar behavior between the training and test sets, while L-BFGS and SGD exhibit less consistent performance across both sets.

Figure 15 illustrates the evolution of the batch loss during the training of the neural network using the hybrid optimizers

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS, with memory size

m = 5

and

μ = 0.95

. Figure 15a–c correspond to a learning rate of

0.05

, whereas Figure 15d–f present the results for a learning rate of

0.1

. It is observed that, for the lower learning rate (

lr = 0.05

), the batch loss decreases gradually with noticeable fluctuations before reaching a stable regime, with

M_{2}

-LBFGS exhibiting higher variability. By comparison, for the higher learning rate (

lr = 0.1

), all three hybrid methods display a faster and more uniform reduction of the loss, with significantly reduced noise after the initial training stages, indicating improved stability and faster convergence. Moreover,

M_{1}

-LBFGS and

M_{3}

-LBFGS tend to reach slightly lower stabilized loss values compared to

M_{2}

-LBFGS. Accuracy levels above

99 %

achieved by the

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS variants for

lr = 0.1

, as reported in Table 12 and Table 13 together with the low and stable batch loss values observed during training in Figure 15, indicate that these hybrid methods exhibit highly competitive performance in this high-dimensional optimization scenario.

In general, the numerical results show that the three proposed limited-memory hybrid variants are promising strategies for MNIST classification, achieving high accuracy and demonstrating stable and reliable neural network training.

6.4. Additional MNIST Experiments with Adam and AdamW

Finally, we compared the proposed limited-memory HOQN variants with the widely used adaptive first-order optimizers Adam [39,40] and AdamW [41,42] . The variants

M_{1} LBFGS

and

M_{2} LBFGS

were selected because they showed the most stable behavior in preliminary experiments, particularly with respect to changes in the learning rate. For both methods, we used memory size

m = 5

, momentum parameter

μ = 0.95

,

L_{2}

regularization parameter

10^{- 8}

, learning rate

0.1

, and batch size 64. Adam and AdamW were trained with learning rate

0.001

under the same CNN architecture, batch size, number of epochs, and random seeds. The corresponding results are reported in Table 14.

The results show that the proposed limited-memory HOQN variants remain competitive with Adam and AdamW. In particular,

M_{1} LBFGS

attains the highest mean test accuracy, while

M_{2} LBFGS

achieves comparable performance with low variability across seeds. Although AdamW obtains the lowest mean training loss, the test accuracies of all methods remain close, indicating that the proposed L-HOQN variants provide stable and competitive classification performance under the same training protocol.

7. Conclusions

This work introduced the higher-order quasi-Newton (HOQN) framework for unconstrained optimization, combining Newton-type predictors, higher-order correction terms derived from vector extensions of the Chun, Ostrowski, and Traub methods, and quasi-Newton updates of the inverse Hessian approximation. The resulting formulation provides a flexible hybrid structure that allows one-update and two-update variants while preserving the dense full-memory quasi-Newton cost of order

O (n^{2})

per iteration.

From the theoretical point of view, local cubic convergence was established for the one-update variants

M_{1} - DFP

,

M_{2} - DFP

, and

M_{3} - DFP

under standard smoothness, positive-definiteness, and first-order inverse-Hessian consistency assumptions. The analysis also shows that the BFGS–DFP two-update strategy preserves the cubic local regime under suitable compatibility conditions. These results show that the higher-order correction cancels the dominant quadratic residual of standard quasi-Newton iterations, shifting the leading error term to cubic order.

The numerical experiments confirm the practical relevance of the proposed hybrid schemes. On the Himmelblau and Freudenstein–Roth functions, the HOQN variants generally require fewer outer iterations than classical BFGS, DFP, and SR1 methods while maintaining competitive computational times. The Booth function, due to its strictly convex quadratic structure, is mainly interpreted as a consistency test, where all methods exhibit very fast convergence. Additional high-dimensional quadratic experiments further show that the predictor–corrector structure remains effective beyond two-dimensional problems, particularly in ill-conditioned settings, while preserving the expected dense

O (n^{2})

scaling.

The Dolan–Moré performance profiles and dynamical-plane analyses provide additional evidence of robustness and stability. Over heterogeneous smooth, noisy, and nonsmooth benchmark instances, the proposed methods are competitive with classical quasi-Newton schemes; although BFGS remains a strong baseline when total computational work is considered, BM₂D and BM₃D achieve low failure rates and robust performance. The dynamical planes also show that the two-update variants BM₁D, BM₂D, and BM₃D exhibit stable basin structures, whereas some one-update variants are more sensitive to the initial approximation.

Limited-memory variants of the HOQN framework were also developed for large-scale optimization. Although these variants may lose curvature information in low-dimensional test functions, their performance in convolutional neural network training on MNIST is promising. The proposed

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS methods achieve test accuracies above

99 %

for suitable learning rates and show stable behavior across training and test sets. Comparisons with Adam and AdamW over several random seeds indicate that the limited-memory hybrid methods are competitive in classification accuracy, with small variability across initializations.

Overall, the results suggest that the HOQN framework is a promising strategy for combining the fast local behavior of higher-order methods with the computational practicality of quasi-Newton updates. Future work should address global convergence theory for nonconvex problems, trust-region and adaptive line-search globalization strategies, stochastic and mini-batch limited-memory variants, deeper neural architectures, and more challenging datasets. Another relevant direction is the construction of new hybrid quasi-Newton algorithms based on other optimal vector methods for nonlinear systems, following the ideas introduced by Singh et al. in [13], and their adaptation to unconstrained optimization.

Author Contributions

Conceptualization, A.C. and N.U.C.; methodology, N.U.C. and J.R.T.; software, N.U.C. and J.G.M.; validation, J.G.M. and J.R.T.; formal analysis, N.U.C.; investigation, A.C., N.U.C., J.R.T. and J.G.M.; visualization, N.U.C.; writing—original draft preparation, N.U.C.; writing—review and editing, A.C., J.R.T., J.G.M. and N.U.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fondo Nacional de Innovación y Desarrollo Científico y Tecnológico (FONDOCYT) of the Ministerio de Educación Superior, Ciencia y Tecnología de la República Dominicana (MESCyT), under grant number FONDOCYT 2023-1-1D2-0537.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their comments and suggestions. Moreover, the authors gratefully acknowledge the institutional support of INTEC and UPV, which made the development of this research possible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fletcher, R. A new approach to variable metric algorithms. Comput. J. 1970, 13, 317–322. [Google Scholar] [CrossRef]
Broyden, C.G. The convergence of a class of double-rank minimization algorithms 1. general considerations. IMA J. Appl. Math. 1970, 6, 76–90. [Google Scholar] [CrossRef]
Byrd, R.H.; Khalfan, H.F.; Schnabel, R.B. Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 1996, 6, 1025–1039. [Google Scholar] [CrossRef]
Davidon, W.C. Variable Metric Method for Minimization; ANL-5990; Argonne National Laboratory: Lemont, IL, USA, 1959. [Google Scholar]
Dennis, J.E., Jr.; Moré, J.J. Quasi-Newton methods, motivation and theory. SIAM Rev. 1977, 19, 46–89. [Google Scholar] [CrossRef]
Dunn, J.C.; Bertsekas, D.P. Efficient dynamic programming implementations of Newton’s method for unconstrained optimal control problems. J. Optim. Theory Appl. 1989, 63, 23–38. [Google Scholar] [CrossRef]
Singh, A.; Sharma, A.; Rajput, S.; Bose, A.; Hu, X. An investigation on hybrid particle swarm optimization algorithms for parameter optimization of PV cells. Electronics 2022, 11, 909. [Google Scholar] [CrossRef]
Farnad, B.; Jafarian, A.; Baleanu, D. A new hybrid algorithm for continuous optimization problem. Appl. Math. Model. 2018, 55, 652–673. [Google Scholar] [CrossRef]
Wang, G.; Guo, L. A novel hybrid bat algorithm with harmony search for global numerical optimization. J. Appl. Math. 2013, 2013, 696491. [Google Scholar] [CrossRef]
Garg, H. A hybrid PSO-GA algorithm for constrained optimization problems. Appl. Math. Comput. 2016, 274, 292–305. [Google Scholar] [CrossRef]
Arroyo, V.; Cordero, A.; Torregrosa, J.R. Approximation of artificial satellites’ preliminary orbits: The efficiency challenge. Math. Comput. Model. 2011, 54, 1802–1807. [Google Scholar] [CrossRef]
Andrei, N. An unconstrained optimization test functions collection. Adv. Model. Optim. 2008, 10, 147–161. [Google Scholar]
Singh, H.; Sharma, J.R.; Kumar, S. A simple yet efficient two-step fifth-order weighted-Newton method for nonlinear models. Numer. Algorithms 2023, 93, 203–225. [Google Scholar] [CrossRef]
Bubeck, S. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn. 2015, 8, 231–357. [Google Scholar] [CrossRef]
Hager, W.W. Lipschitz continuity for constrained processes. SIAM J. Control Optim. 1979, 17, 321–338. [Google Scholar] [CrossRef]
Pardalos, P.M.; Žilinskas, A.; Žilinskas, J. Non-Convex Multi-Objective Optimization; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Dai, Y.H.; Yuan, Y. A nonlinear conjugate gradient method with a strong global convergence property. SIAM J. Optim. 1999, 10, 177–182. [Google Scholar] [CrossRef]
Dong, X.; Liu, H.; He, Y. A self-adjusting conjugate gradient method with sufficient descent condition and conjugacy condition. J. Optim. Theory Appl. 2015, 165, 225–241. [Google Scholar] [CrossRef]
Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
Dennis, J.E.; Moré, J.J. A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 1974, 28, 549–560. [Google Scholar] [CrossRef]
Chun, C. Construction of Newton-like iteration methods for solving nonlinear equations. Numer. Math. 2006, 104, 297–315. [Google Scholar] [CrossRef]
Ostrowski, A.M. Solution of Equations and Systems of Equations: Pure and Applied Mathematics: A Series of Monographs and Textbooks; Elsevier: Amsterdam, The Netherlands, 2016; Volume 9. [Google Scholar]
Cordero, A.; Rojas-Hiciano, R.V.; Torregrosa, J.R.; Vassileva, M.P. A highly efficient class of optimal fourth-order methods for solving nonlinear systems. Numer. Algorithms 2024, 95, 1879–1904. [Google Scholar] [CrossRef]
Traub, J.F. Iterative Methods for the Solution of Equations; American Mathematical Soc.: Providence, RI, USA, 1982; Volume 312. [Google Scholar]
Cordero, A.; Torregrosa, J.R. Variants of Newton’s method using fifth-order quadrature formulas. Appl. Math. Comput. 2007, 190, 686–698. [Google Scholar] [CrossRef]
Dolan, E.D.; Moré, J.J. Benchmarking Optimization Software with Performance Profiles. Math. Program. 2002, 91, 201–213. [Google Scholar] [CrossRef]
Moré, J.J.; Garbow, B.S.; Hillstrom, K.E. Testing Unconstrained Optimization Software. ACM Trans. Math. Softw. 1981, 7, 17–41. [Google Scholar] [CrossRef]
Bongartz, I.; Conn, A.R.; Gould, N.I.M.; Toint, P.L. CUTE: Constrained and Unconstrained Testing Environment. ACM Trans. Math. Softw. 1995, 21, 123–160. [Google Scholar] [CrossRef]
Gould, N.I.M.; Orban, D.; Toint, P.L. CUTEr and SifDec: A Constrained and Unconstrained Testing Environment, Revisited. ACM Trans. Math. Softw. 2003, 29, 373–394. [Google Scholar] [CrossRef]
Goh, B.S.; McDonald, D. Newton methods to solve a system of nonlinear algebraic equations. J. Optim. Theory Appl. 2015, 164, 261–276. [Google Scholar] [CrossRef]
Moré, J.J.; Wild, S.M. Benchmarking Derivative-Free Optimization Algorithms. SIAM J. Optim. 2009, 20, 172–191. [Google Scholar] [CrossRef]
Chicharro, F.I.; Cordero, A.; Torregrosa, J.R. Drawing dynamical and parameters planes of iterative families and methods. Sci. World J. 2013, 2013, 780153. [Google Scholar] [CrossRef]
Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989, 45, 503–528. [Google Scholar] [CrossRef]
LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 5 January 2025).
Dieterich, J.M.; Hartke, B. Empirical review of standard benchmark functions using evolutionary global optimization. arXiv 2012, arXiv:1207.4318. [Google Scholar] [CrossRef]
Kline, D.M.; Berardi, V.L. Revisiting squared-error and cross-entropy functions for training neural network classifiers. Neural Comput. Appl. 2005, 14, 310–318. [Google Scholar] [CrossRef]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar]
Paszke, A. Pytorch: An imperative style, high-performance deep learning library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Barakat, A.; Bianchi, P. Convergence and Dynamical Behavior of the ADAM Algorithm for Nonconvex Stochastic Optimization. SIAM J. Optim. 2021, 31, 244–274. [Google Scholar] [CrossRef]
Chen, C.; Shen, L.; Zou, F.; Liu, W. Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration. J. Mach. Learn. Res. 2022, 23, 1–47. [Google Scholar]
Zhuang, Z.; Liu, M.; Cutkosky, A.; Orabona, F. Understanding AdamW through Proximal Methods and Scale-Freeness. arXiv 2022, arXiv:2202.00089. [Google Scholar] [CrossRef]
Zhou, P.; Xie, X.; Lin, Z.; Yan, S. Towards Understanding Convergence and Generalization of AdamW. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6486–6493. [Google Scholar] [CrossRef]

Figure 1. Three-dimensional representations of the benchmark functions.

Figure 2. Success-rate heat map for the methods under smooth, noisy and nonsmooth scenarios. Each entry reports the percentage of solved instances in the corresponding class.

Figure 3. Overall failure rates of the methods over the full benchmark set of 150 smooth, noisy and nonsmooth instances.

Figure 4. Dolan–Moré performance profiles over the full benchmark set, using computational work and number of outer iterations as performance measures.

Figure 5. Dynamical planes of the BFGS, DFP, and SR1 methods for the Himmelblau function.

Figure 6. Dynamical planes of one-update hybrid methods for the Himmelblau function.

Figure 7. Dynamical planes of two-update hybrid methods for the Himmelblau function.

Figure 8. Dynamical planes of the BFGS, DFP, and SR1 methods for the Freudenstein–Roth function.

Figure 9. Dynamical planes of one-update hybrid methods for the Freudenstein–Roth function.

Figure 10. Dynamical planes of two-update hybrid methods for the Freudenstein–Roth function.

Figure 11. Dynamical plane of the

B M_{1} D

method for the Booth function.

Figure 11. Dynamical plane of the

B M_{1} D

method for the Booth function.

Figure 12. Neural network architecture.

Figure 13. MNIST test accuracy per epoch across learning rates.

Figure 14. Training and test accuracy on MNIST for

lr = 0.20

.

Figure 14. Training and test accuracy on MNIST for

lr = 0.20

.

Figure 15. MNIST loss for

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS at two learning rates.

Figure 15. MNIST loss for

M_{1}

-LBFGS,

M_{2}

-LBFGS, and

M_{3}

-LBFGS at two learning rates.

Table 1. Numerical tests for the Himmelblau function with two different initial estimates.

Himmelblau Function $(f_{H})$ , $ξ = {(- 3.7793, - 3.2832)}^{T}$
$x^{(0)} = {(- 2.2920, - 2.6501)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
BFGS	5	1.8	$3.8318 \times 10^{- 7}$	0.0021
DFP	5	1.8	$3.8871 \times 10^{- 7}$	0.0013
SR1	5	1.8	$3.8989 \times 10^{- 7}$	0.0024
$M_{1} DFP$	10	2.9	$2.0383 \times 10^{- 12}$	0.0047
$M_{2} DFP$	4	2.3	$1.43 \times 10^{- 14}$	0.0038
$M_{3} DFP$	4	2.5	$7.9947 \times 10^{- 8}$	0.0019
$B M_{1} D$	3	2.8	$5.7620 \times 10^{- 11}$	0.0013
$B M_{2} D$	3	2.3	$4.20 \times 10^{- 11}$	0.0058
$B M_{3} D$	4	2.5	$5.5011 \times 10^{- 9}$	0.0040
$x^{(0)} = {(- 1.956, - 2.667)}^{T}$
BFGS	6	1.6	$1.0194 \times 10^{- 8}$	0.0044
DFP	6	1.6	$1.0500 \times 10^{- 8}$	0.0040
SR1	6	1.6	$1.0491 \times 10^{- 8}$	0.0015
$M_{1} DFP$	12	3.0	$3.9892 \times 10^{- 7}$	0.0029
$M_{2} DFP$	4	2.4	$3.81 \times 10^{- 7}$	0.0015
$M_{3} DFP$	4	2.7	$5.9150 \times 10^{- 7}$	0.0039
$B M_{1} D$	3	3.2	$3.8079 \times 10^{- 8}$	0.0058
$B M_{2} D$	3	2.8	$6.24 \times 10^{- 7}$	0.0050
$B M_{3} D$	4	2.4	$9.3926 \times 10^{- 8}$	0.0012

Table 2. Numerical tests for the Freudenstein–Roth function with two different initial estimates.

Freudenstein–Roth Function $(f_{FR})$ , $ξ = {(5, 4)}^{T}$
$x^{(0)} = {(3.5081, 4.0087)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
BFGS	4	1.3	$2.2231 \times 10^{- 7}$	0.0068
DFP	4	1.3	$2.2179 \times 10^{- 7}$	0.0066
SR1	4	1.3	$2.2171 \times 10^{- 7}$	0.0044
$M_{1} DFP$	5	3.0	$6.3499 \times 10^{- 10}$	0.0043
$M_{2} DFP$	3	2.8	$1.01 \times 10^{- 8}$	0.0017
$M_{3} DFP$	3	2.6	$7.0095 \times 10^{- 7}$	0.0088
$B M_{1} D$	2	–	$8.7848 \times 10^{- 8}$	0.0064
$B M_{2} D$	2	–	$4.00 \times 10^{- 7}$	0.0038
$B M_{3} D$	3	2.6	$7.0595 \times 10^{- 7}$	0.0087
$x^{(0)} = {(4.3, 4.0001)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
BFGS	4	1.2	$2.8710 \times 10^{- 7}$	0.0014
DFP	4	1.2	$2.8639 \times 10^{- 7}$	0.0010
SR1	4	1.2	$2.8612 \times 10^{- 7}$	0.0014
$M_{1} DFP$	4	2.5	$5.8292 \times 10^{- 8}$	0.0093
$M_{2} DFP$	4	3.0	$4.56 \times 10^{- 7}$	0.0064
$M_{3} DFP$	3	3.0	$1.9586 \times 10^{- 7}$	0.0047
$B M_{1} D$	2	–	$2.3814 \times 10^{- 7}$	0.0012
$B M_{2} D$	2	–	$3.40 \times 10^{- 7}$	0.0015
$B M_{3} D$	3	3.0	$1.9628 \times 10^{- 7}$	0.0063

Table 3. Numerical tests for the Booth function

f_{B}

with two different initial estimates.

Table 3. Numerical tests for the Booth function

f_{B}

with two different initial estimates.

Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
Booth Function $f_{B}$ , $ξ = {(1, 3)}^{T}$
$x^{(0)} = {(3.45, 4.08)}^{T}$
BFGS	2	–	$1.4200 \times 10^{- 14}$	0.0046
DFP	2	–	$2.2500 \times 10^{- 14}$	0.0020
SR1	2	–	$1.5888 \times 10^{- 14}$	0.0038
$M_{1} DFP$	2	–	$7.1054 \times 10^{- 15}$	0.0041
$M_{2} DFP$	2	–	$1.5900 \times 10^{- 14}$	0.0038
$M_{3} DFP$	2	–	$1.0049 \times 10^{- 14}$	0.0023
${BM}_{1} D$	2	–	$0.0000 \times 10^{0}$	0.0022
${BM}_{2} D$	2	–	$0.0000 \times 10^{0}$	0.0022
${BM}_{3} D$	2	–	$0.0000 \times 10^{0}$	0.0068
$x^{(0)} = {(- 3, 9)}^{T}$
BFGS	2	–	$3.5500 \times 10^{- 14}$	0.0007
DFP	2	–	$2.9300 \times 10^{- 14}$	0.0007
SR1	2	–	$5.1238 \times 10^{- 14}$	0.0029
$M_{1} DFP$	2	–	$2.5600 \times 10^{- 14}$	0.0031
$M_{2} DFP$	2	–	$1.0000 \times 10^{- 14}$	0.0019
$M_{3} DFP$	2	–	$7.9441 \times 10^{- 14}$	0.0056
${BM}_{1} D$	2	–	$0.0000 \times 10^{0}$	0.0006
${BM}_{2} D$	2	–	$0.0000 \times 10^{0}$	0.0022
${BM}_{3} D$	2	–	$0.0000 \times 10^{0}$	0.0059

Table 4. Scalability test on high-dimensional strictly convex quadratic problems.

n	$κ$	Iterations			Normalized Cost $η$
		BFGS	$M_{1} DFP$	${BM}_{1} D$	BFGS	$M_{1} DFP$	${BM}_{1} D$
100	$10^{2}$	57	58	34	$2.3662 \times 10^{- 8}$	$2.2992 \times 10^{- 8}$	$2.9310 \times 10^{- 8}$
100	$10^{6}$	98	112	58	$1.1032 \times 10^{- 8}$	$8.9478 \times 10^{- 9}$	$1.5510 \times 10^{- 8}$
500	$10^{2}$	82	70	49	$1.0261 \times 10^{- 8}$	$1.0478 \times 10^{- 8}$	$1.6158 \times 10^{- 8}$
500	$10^{6}$	394	434	220	$8.3910 \times 10^{- 9}$	$8.9006 \times 10^{- 9}$	$1.3985 \times 10^{- 8}$
1000	$10^{2}$	87	75	48	$8.2811 \times 10^{- 9}$	$1.0092 \times 10^{- 8}$	$1.3768 \times 10^{- 8}$
1000	$10^{6}$	770	800	408	$8.8353 \times 10^{- 9}$	$1.5070 \times 10^{- 8}$	$2.2218 \times 10^{- 8}$

Table 5. Composition of the 30 base problems used in the Dolan–Moré benchmark.

Collection	Problems	Number
Moré–Garbow–Hillstrom/Wood	Rosenbrock, Freudenstein–Roth, Beale, Powell singular, Wood, Extended Rosenbrock $(n = 10, 20)$ , Extended Powell $(n = 12, 20)$ , Broyden tridiagonal $(n = 10, 20)$ , Brown almost-linear, and variably dimensioned problems $(n = 10, 20)$	14
Classical unconstrained tests	Himmelblau, Booth, Sphere, Zakharov, Styblinski–Tang, Branin–Hoo, Matyas, Dixon–Price, Powell badly scaled, Three-hump camel, Extended Beale, and Extended Himmelblau	12
CUTEr/CUTEst-type analytic representatives	Trigonometric, Penalty I, ARWHEAD, and GENHUMPS	4

Table 6. Dolan–Moré results for the 30 smooth test instances.

Performance measured by computational work $(N_{f} + N_{g})$
Method	Solved	Solved (%)	Best at $τ = 1$	$ρ_{s} (2)$	$ρ_{s} (5)$	Median cost
DFP	29	96.667	16	0.8333	0.9667	159.0
BFGS	30	100.000	13	0.9667	1.0000	163.0
SR1	28	93.333	18	0.9333	0.9333	140.0
BM₁D	30	100.000	5	0.8667	1.0000	169.5
BM₂D	30	100.000	4	0.9333	1.0000	182.5
BM₃D	30	100.000	5	0.6667	1.0000	245.0
Performance measured by number of outer iterations
Method	Solved	Solved (%)	Best at $τ = 1$	$ρ_{s} (2)$	$ρ_{s} (5)$	Median cost
DFP	29	96.667	4	0.7000	0.9667	7.0
BFGS	30	100.000	4	0.9333	1.0000	8.5
SR1	28	93.333	5	0.8000	0.9333	6.5
BM₁D	30	100.000	21	1.0000	1.0000	4.5
BM₂D	30	100.000	23	1.0000	1.0000	5.0
BM₃D	30	100.000	13	1.0000	1.0000	5.5

Table 7. Failure rates by benchmark class and overall.

Method	Smooth	Noisy	Nonsmooth	Overall
DFP	0.0333	0.1833	0.0333	0.0933
BFGS	0.0000	0.1167	0.0000	0.0467
SR1	0.0667	0.1500	0.0333	0.0867
BM₁D	0.0000	0.1500	0.0000	0.0600
BM₂D	0.0000	0.1333	0.0000	0.0533
BM₃D	0.0000	0.1333	0.0000	0.0533

Table 8. Cubic error verification for

M_{1} DFP

and

{BM}_{1} D

.

Table 8. Cubic error verification for

M_{1} DFP

and

{BM}_{1} D

.

Method	$κ$	Samples	$ρ_{loc}$	$K^{emp}$	$K^{th}$	Q	$δ_{N}$	$δ_{B}$
$M_{1} DFP$	$10^{2}$	96	3.0107	$2.5538 \times 10^{- 6}$	$2.5327 \times 10^{- 6}$	$3.1318 \times 10^{- 5}$	$2.8409 \times 10^{- 2}$	–
$M_{1} DFP$	$10^{4}$	96	3.9990	$8.9961 \times 10^{- 8}$	$2.2590 \times 10^{- 7}$	$1.3963 \times 10^{- 5}$	$2.0813 \times 10^{- 2}$	–
$M_{1} DFP$	$10^{6}$	96	4.0005	$1.0508 \times 10^{- 8}$	$1.9845 \times 10^{- 7}$	$8.9582 \times 10^{- 6}$	$1.5872 \times 10^{- 2}$	–
$M_{1} DFP$	$10^{8}$	96	3.9252	$2.5802 \times 10^{- 9}$	$8.0759 \times 10^{- 8}$	$3.7921 \times 10^{- 6}$	$8.1706 \times 10^{- 3}$	–
${BM}_{1} D$	$10^{2}$	96	3.0007	$2.4177 \times 10^{- 5}$	$2.4889 \times 10^{- 5}$	$3.0458 \times 10^{- 5}$	$2.8409 \times 10^{- 2}$	$2.7995 \times 10^{- 2}$
${BM}_{1} D$	$10^{4}$	96	3.2511	$2.4725 \times 10^{- 7}$	$3.5745 \times 10^{- 7}$	$1.3954 \times 10^{- 5}$	$2.0813 \times 10^{- 2}$	$2.0811 \times 10^{- 2}$
${BM}_{1} D$	$10^{6}$	96	3.7883	$1.2633 \times 10^{- 8}$	$2.0103 \times 10^{- 7}$	$8.9578 \times 10^{- 6}$	$1.5872 \times 10^{- 2}$	$1.5872 \times 10^{- 2}$
${BM}_{1} D$	$10^{8}$	96	3.8858	$2.5274 \times 10^{- 9}$	$8.0762 \times 10^{- 8}$	$3.7922 \times 10^{- 6}$	$8.1706 \times 10^{- 3}$	$8.1706 \times 10^{- 3}$

Table 9. Limited-memory results for the Himmelblau function with

m = 10

.

Table 9. Limited-memory results for the Himmelblau function with

m = 10

.

Himmelblau Function $(f_{H})$
$x^{(0)} = {(4.9, 1.96)}^{T}$ , $ξ = {(- 2.81, 3.13)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	170	1.0034	$8.7304 \times 10^{- 13}$	0.0161
$M_{2}$ -LBFGS	530	1.0053	$2.2616 \times 10^{- 9}$	0.0292
$M_{3}$ -LBFGS	68	0.9730	$8.4545 \times 10^{- 13}$	0.0143
L-BFGS	187	0.9993	$6.5113 \times 10^{- 13}$	0.0069
$x^{(0)} = {(1.95, 3.19)}^{T}$ , $ξ = {(3.58, - 1.85)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	293	1.0024	$9.0901 \times 10^{- 13}$	0.0153
$M_{2}$ -LBFGS	561	1.0052	$4.4543 \times 10^{- 9}$	0.0350
$M_{3}$ -LBFGS	95	0.9590	$9.8602 \times 10^{- 13}$	0.0412
L-BFGS	106	0.9953	$6.9535 \times 10^{- 13}$	0.0036

Table 10. Limited-memory results for the Freudenstein–Roth function with

m = 10

.

Table 10. Limited-memory results for the Freudenstein–Roth function with

m = 10

.

Freudenstein—Roth Function $(f_{FR})$
$x^{(0)} = {(18, 1)}^{T}$ , $ξ = {(5, 4)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	749	0.9336	$9.9704 \times 10^{- 9}$	0.0263
$M_{2}$ -LBFGS	792	0.9975	$2.5072 \times 10^{- 6}$	0.0323
$M_{3}$ -LBFGS	714	0.9929	$9.8866 \times 10^{- 10}$	0.0318
L-BFGS	827	0.9929	$9.9511 \times 10^{- 9}$	0.0230
$x^{(0)} = {(9, 0)}^{T}$ , $ξ = {(11.41, - 0.90)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	1158	0.9912	$3.0208 \times 10^{- 7}$	0.0431
$M_{2}$ -LBFGS	175	1.0327	$2.4232 \times 10^{- 10}$	0.0444
$M_{3}$ -LBFGS	314	0.9789	$9.9611 \times 10^{- 9}$	0.0152
L-BFGS	1158	0.9944	$8.4165 \times 10^{- 10}$	0.0266

Table 11. Limited-memory results for the Booth function with

m = 10

.

Table 11. Limited-memory results for the Booth function with

m = 10

.

Booth Function $(f_{B})$ , $ξ = {(1, 3)}^{T}$
$x^{(0)} = {(6.01, - 1.50)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	5	1.6198	$0.0 \times 10^{0}$	0.0123
$M_{2}$ -LBFGS	16	0.9297	$1.7990 \times 10^{- 13}$	0.0069
$M_{3}$ -LBFGS	5	2.2100	$2.2571 \times 10^{- 9}$	0.0010
L-BFGS	3	0.2094	$0.0 \times 10^{0}$	0.0003
$x^{(0)} = {(2, 8)}^{T}$
Method	Iter	$ρ$	$∥ \nabla f (x^{(k + 1)}) ∥$	CT (s)
$M_{1}$ -LBFGS	24	1.0248	$9.0994 \times 10^{- 14}$	0.0016
$M_{2}$ -LBFGS	7	0.9539	$0.0 \times 10^{0}$	0.0024
$M_{3}$ -LBFGS	5	1.2141	$6.7270 \times 10^{- 9}$	0.0011
L-BFGS	3	0.2304	$0.0 \times 10^{0}$	0.0011

Table 12. MNIST performance versus memory size m, momentum

μ = 0.95

, and learning rate

l r

.

Table 12. MNIST performance versus memory size m, momentum

μ = 0.95

, and learning rate

l r

.

		$M_{1}$ -LBFGS		$M_{2}$ -LBFGS		$M_{3}$ -LBFGS		L-BFGS
$lr$	$m$	Acc.	Avg. Time (s)	Acc.	Avg. Time (s)	Acc.	Avg. Time (s)	Acc.	Avg. Time (s)
0.05	5	0.9880	14.27	0.9894	14.50	0.9900	14.10	0.9915	12.45
	10	0.9877	15.25	0.9880	17.04	0.9909	15.20	0.9918	13.09
	15	0.9894	16.19	0.9876	17.80	0.9901	16.20	0.9854	12.30
	20	0.9885	17.72	0.9891	18.55	0.9903	17.50	0.9899	12.32
0.10	5	0.9928	14.74	0.9921	15.33	0.9930	14.20	0.9904	12.17
	10	0.9931	15.99	0.9923	17.11	0.9935	15.30	0.9911	12.32
	15	0.9932	16.87	0.9931	18.00	0.9928	16.10	0.9916	12.17
	20	0.9934	17.92	0.9932	19.36	0.9931	17.10	0.9929	12.90

Table 13. Comparison of hybrid optimizers with SGD and L-BFGS on MNIST for different learning rates.

Methods	Learning Rate ( $lr$ )	Final Accuracy	Avg. Time (s)
$M_{3}$ -LBFGS	0.01	0.9889	15.20
	0.025	0.9920	14.20
	0.08	0.9940	14.10
$M_{1}$ -LBFGS	0.01	0.9888	14.20
	0.025	0.9922	14.15
	0.08	0.9925	13.95
$M_{2}$ -LBFGS	0.01	0.9890	14.65
	0.025	0.9916	14.62
	0.08	0.9920	14.66
L-BFGS	0.01	0.9854	12.23
	0.025	0.9876	12.27
	0.08	0.9902	12.33
SGD	0.01	0.9857	11.93
	0.025	0.9879	11.99
	0.08	0.9920	12.02

Table 14. MNIST loss and accuracy over three seeds after 10 epochs.

Method	lr	Loss	Accuracy
$M_{1} LBFGS$	0.100	$0.0151 \pm 0.0007$	$0.9921 \pm 0.0006$
$M_{2} LBFGS$	0.100	$0.0208 \pm 0.0008$	$0.9917 \pm 0.0007$
Adam	0.001	$0.0101 \pm 0.0009$	$0.9913 \pm 0.0010$
AdamW	0.001	$0.0088 \pm 0.0006$	$0.9915 \pm 0.0012$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cordero, A.; Maimó, J.G.; Torregrosa, J.R.; Castillo, N.U. Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics 2026, 14, 1746. https://doi.org/10.3390/math14101746

AMA Style

Cordero A, Maimó JG, Torregrosa JR, Castillo NU. Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics. 2026; 14(10):1746. https://doi.org/10.3390/math14101746

Chicago/Turabian Style

Cordero, Alicia, Javier G. Maimó, Juan R. Torregrosa, and Natanael Ureña Castillo. 2026. "Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods" Mathematics 14, no. 10: 1746. https://doi.org/10.3390/math14101746

APA Style

Cordero, A., Maimó, J. G., Torregrosa, J. R., & Castillo, N. U. (2026). Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods. Mathematics, 14(10), 1746. https://doi.org/10.3390/math14101746

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods

Abstract

1. Introduction

Preliminary Concepts

2. Vector Extensions of Traub’s, Chun’s, and Ostrowski’s Methods

3. Hybrid-Quasi-Newton Method (HOQN)

4. Convergence Analysis of the HOQN Method

5. Numerical Tests

5.1. Hybrid Optimization with Complete Memory

5.2. High-Dimensional Quadratic Scalability Test

5.3. Dolan–Moré Performance Profiles and Robustness Assessment

5.4. Local Verification Setup on Ill-Conditioned Hessians

5.5. Dynamical Planes

5.6. Hybrid Optimization with Limited Memory

6. Practical Application in Neural Networks for the MNIST Database

6.1. Architecture of the Implemented Neural Network

6.2. Proposed Limited-Memory Hybrid Optimizers for CNN Training

6.3. Efficiency of the $M_{1}$ -LBFGS, $M_{2}$ -LBFGS and $M_{3}$ -LBFGS Methods for MNIST

6.4. Additional MNIST Experiments with Adam and AdamW

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficiency and Stability of a New Hybrid Unconstrained Optimization Algorithm with Quasi-Newton Updates and Higher-Order Methods

Abstract

1. Introduction

Preliminary Concepts

2. Vector Extensions of Traub’s, Chun’s, and Ostrowski’s Methods

3. Hybrid-Quasi-Newton Method (HOQN)

4. Convergence Analysis of the HOQN Method

5. Numerical Tests

5.1. Hybrid Optimization with Complete Memory

5.2. High-Dimensional Quadratic Scalability Test

5.3. Dolan–Moré Performance Profiles and Robustness Assessment

5.4. Local Verification Setup on Ill-Conditioned Hessians

5.5. Dynamical Planes

5.6. Hybrid Optimization with Limited Memory

6. Practical Application in Neural Networks for the MNIST Database

6.1. Architecture of the Implemented Neural Network

6.2. Proposed Limited-Memory Hybrid Optimizers for CNN Training

6.3. Efficiency of the M 1 -LBFGS, M 2 -LBFGS and M 3 -LBFGS Methods for MNIST

6.4. Additional MNIST Experiments with Adam and AdamW

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.3. Efficiency of the $M_{1}$ -LBFGS, $M_{2}$ -LBFGS and $M_{3}$ -LBFGS Methods for MNIST