An Efficient Limited Memory Multi-Step Quasi-Newton Method

: This paper is dedicated to the development of a novel class of quasi-Newton techniques tailored to address computational challenges posed by memory constraints. Such methodologies are commonly referred to as “limited” memory methods. The method proposed herein showcases adaptability by introducing a customizable memory parameter governing the retention of historical data in constructing the Hessian estimate matrix at each iterative stage. The search directions generated through this novel approach are derived from a modified version closely resembling the full memory multi-step BFGS update, incorporating limited memory computation for a singular term to approximate matrix–vector multiplication. Results from numerical experiments, exploring various parameter configurations, substantiate the enhanced efficiency of the proposed algorithm within the realm of limited memory quasi-Newton methodologies category.


Introduction
Unconstrained Optimization is concerned with the minimization of a specific objective function as follows: minimize f (x), where f : R n → R.
The aforementioned problem can be solved by employing a class of methods referred to as the quasi-Newton techniques for unconstrained optimization.Only the function and its first derivatives are required for quasi-Newton (QN) methods [1].The Hessian does not need to be available or even coded.To integrate updates in the function and the corresponding gradient, an approximation matrix to the actual Hessian is retained and updated across the iterations.
Given B i , the current Hessian approximation, a new Hessian estimation B i+1 needs to be constructed, for the new solution estimate x i+1 .To find B i+1 , we can utilize the Taylor series of order one to the gradient vector about the point x i+1 to obtain the following relation (known as the Secant equation) [2][3][4][5][6]: where s i = x i+1 − x i , and y i = g i+1 − g i , where g i ≡ g(x i ), is the gradient at iterate x i .The computed Hessian matrix approxima- tions must satisfy (1).To estimate the next Hessian approximation matrix B i+1 using the current matrix B i and the vectors s i and y i , an update of the form B i+1 = B i + C i is applied, where C i is a correctional update matrix.It may be preferable to use H i+1 = H i + D i , where D i is an update term and H i+1 = B −1 i+1 .The update can be of rank one or two.A widely used rank two correction formula is referred to as the BFGS formula.The BFGS is given by , and the corresponding inverse matrix as Reported numerical outcomes show that this formula outperforms other updating formulas, particularly when the line search is inexact.BFGS is regarded as a standard update formula [3,5,7,8].
The search direction for the standard quasi-Newton methods is computed using

Literature Review
Limited memory QN methods (L-BFGS) have gained considerable prominence in optimization due to their ability to handle large-scale problems efficiently, especially with memory constraints.These iterative methods are mostly classified as quasi-Newton techniques [9][10][11][12].The versatility of L-BFGS methods is evident in their application across various domains.In machine learning, they are integral to training support vector machines (SVMs), optimizing logistic regression models, and fine-tuning neural networks.Their memory-efficient design makes them invaluable when working with large datasets.Additionally, L-BFGS plays a pivotal role in structural optimization, a field where finite element analysis involves high-dimensional design spaces.In computational chemistry, L-BFGS aids in optimizing molecular structures, facilitating breakthroughs in drug discovery and materials science.
Nevertheless, L-BFGS distinguishes itself from the classical quasi-Newton methods by storing only a limited number of recent iterations data, often referred to as memory pairs (s i , y i ), where s i , represents the alteration in the optimization variable, and y i denotes the change in the gradient (see (1)).These memory pairs are instrumental in updating the approximate Hessian matrix, achieving a balance between computational efficiency and optimization accuracy.
In the past decade, L-BFGS has seen significant methodological advancements and the emergence of new variants tailored to address specific optimization challenges.For example, L-BFGS-B extends the method to handle bound-constrained optimization problems effectively [10].Innovations like L-BFGS-TF and L-BFGS++ have been introduced to enhance scalability and convergence properties.These methodological innovations have substantially widened the scope of L-BFGS applications, encompassing areas such as machine learning, deep learning, and numerical simulations.
Limited memory BFGS algorithms often derive using the updated identity for the inverse Hessian given as [10] where s i and y i are as in (1) and For a given identity matrix I, a sequence of vectors q i−m , . . ..., q m is defined such that q i ≡ I − ρ i s i y T i q i+1 .Then, the quantities q i and q i+1 are recursively computed.For more details see [10].
Next, we present some successful limited-memory BFGS formulas.These formulas are all variations of the standard L-BFGS formula, and they are designed to enhance the numerical behavior of the L-BFGS algorithm for certain types of problems.One of the well-known limited memory methods is the Liu-Nocedal formula [9].The updated Hessian matrix is the previous Hessian matrix updated with a correction term.The correction term is a weighted sum of the two recent steps and gradient differences, where the weights are chosen to make certain that the updated Hessian matrix is positive-definite.The formula is given as Another successful formula is the Byrd-Nocedal-Schnabel formula [13].The correction term is a weighted sum of the latest m steps and gradient differences, where the weights are chosen to ascertain that the newly computed Hessian matrix is positive-definite.The formula is defined as A variation of (3) is the Zhu-Byrd-Nocedal formula [14].The correction component is a weighted sum of the latest m steps and gradient differences, where the weights are also chosen to make certain that the newly computed Hessian matrix is positive-definite.Additionally, the correction term includes a term that is designed to improve the behavior of the L-BFGS algorithm for problems with non-smooth objective functions [15].The formula is defined by In ( 3) and ( 4), the parameter m controls the number of past update terms that are used to update the Hessian matrix.A larger value of m will result in a more precise approximation of the Hessian matrix, closer to that of the standard QN methods.
Another idea, presented in [16], focuses on applying regularized Newton methods in a versatile category of unconstrained optimization algorithms.The method they propose has the potential to combine favorable characteristics from both approaches.The primary emphasis is on the integration of regularization with limited memory quasi-Newton methods, leveraging the unique structure inherent in limited memory algorithms.The paper explores an alternative globalization technique, akin to the less familiar siblings of line search and trust-region methods, known as regularized Newton methods.The derived methods are referred to as regularized quasi-Newton methods.The methods utilize the relationship: which is used in the computation of the search direction p i , instead of the one used by the standard quasi-Newton methods.The parameter µ i serves as the regularization parameter.
The derived methods integrate features from both line search and trust region techniques and update a limited memory version of the Hessian matrix estimate.The method displays good behavior overall though it involves the solution of a system of linear equations at each iteration, which increases the time complexity of the algorithm.A recent paper also introduces a limited memory quasi-Newton type method for unconstrained multi-objective optimization problems [17], which is suitable for large-scale scenarios.The proposed algorithm approximates the convex combination of the Hessian matrices of the objective function using a positive definite matrix.The update formula for the approximation matrix is an extension of the one used in the L-BFGS method for scalar optimization [18].The method is well defined even in the nonconvex case and exhibits global and R-linear convergence to Pareto optimality for twice continuously differentiable strongly convex problems.In the method derived in [17], the matrix used to update the inverse Hessian approximation is given by where ρ i = 1/s T i u i and u i is the summation of the previous y i dictated by the number of vectors retained in memory.Those y values are multiplied by optimal Lagrange multipliers needed to achieve global convergence.
The performance of the algorithm is evaluated through computational comparisons with state-of-the-art Newton and quasi-Newton approaches in multi-objective optimization.The results show that the proposed approach is generally efficient.While testing the method, it failed to converge on some of our test functions, but it did well on others and managed to find global minimizers.
The paper in [19] presents a new numerical method for solving large-scale unconstrained optimization problems, derived from a modified BFGS-type update.The update formula is extended to the framework of a limited memory scheme.The update utilized in that paper takes the form where ρ i = 1/s T i u i and u i = y i + γ i s i , for some scalar γ i to ensure convergence.The paper discusses the global convergence and convergence rate of the algorithm with weak Wolfe-Powell line search.
A modified q-BFGS algorithm is proposed for nonlinear unconstrained optimization problems [20].It uses a simple symmetric positive definite matrix and a new q-quasi-Newton equation to build an approximate q-Hessian.The method preserves global convergence properties without assuming the convexity of the objective function.Numerical results show improvement over the original q-BFGS method.A limited memory quasi-Newton method is introduced for large-scale unconstrained multi-objective optimization problems.It approximates the convex combination of the Hessian matrices of the objectives and exhibits global and R-linear convergence to Pareto optimality.Empirical comparisons demonstrate the efficiency and effectiveness of the proposed approach [21].A new L-BFGS method with regularization techniques is proposed for large-scale unconstrained optimization.It guarantees global convergence and is robust in terms of solving more problems [4].The momentum-accelerated quasi-Newton (MoQ) method approximates Nesterov's accelerated gradient as a linear combination of past gradients, and its performance is evaluated on a function approximation problem.

A New Multi-Step Limited Memory Method
In this section, we devise a new QN-limited memory method inspired by the methods in (3) and (4).The method is characterized by its utilization of more of already computed past data to better the quality of the generated Hessian (inverse) approximations.
In the basic Secant equation, a straight line L is utilized to find a new iteration x i+1 given the iterate x i , whereas higher-order interpolants are used in the multi-step approaches [22,23].The main advantage of using polynomials is that they exploit more of the already computed data in the matrix update rather than simply discarding them.This is expected to result in better Hessian approximations.
Multi-step methods generally update the Hessian estimates to satisfy the form for which the classical Secant relation simply refers to the choices u k = s k and w k = y k , respectively.One of the alternatives to u i and w i proposed in [24] is where µ i−1 = δ 2 /(1 + 2δ) and δ = ∥s i ∥/∥s i−1 ∥.Thus, Equation ( 5) becomes: Concentrating on the BFGS updating method (2), the inverse Hessian approximation at iteration i may be expressed after manipulation as where r i = H i w i .Contained memory techniques are developed using the identity which is equivalent to (using ( 7)) However, the problem in using ( 7) is that of computing r i and H i y i without actually storing H i .The simplest option would be to make the choice H i = I at each iteration and, thus, r i = y i and H i y i , in that case, reduces to y i .However, this is a numerically improper choice since it abandons previous information collected in the updated matrices.Another proposal is considered here and that is to retain some approximation of the vector r i and correct every cycle.This can be performed using: Start with r 0 = H 0 w 0 = w 0 (since H 0 = I) and, recurrently, for any i, we use Relation (10) can then be utilized in (8) to compute the next direction vector.Still, the term H i−1 w i in (10) is not accessible.The obvious option is to set, r i−1 to w i−1 in (8).Such a choice is expected to affect negatively the updated matrix.Thus, our goal next is to build an expression for both H i−1 w i and H i y i to complete the derivation.By doing so, we will then have derived a method that is expected to be computationally superior to that of setting H i to be just the identity matrix in (8).This is because more of the previously accumulated information is utilized in updating the matrix.Therefore, the preference computationally is to approximate the aforementioned matrix vector products as will be shown next.
To proceed with the derivation, it should be noted that for a diagonal matrix H 0 , the next matrix in recurrence can be expressed as and u 2 , w 2 ) . . . .and so on for vectors u and w, as in (5).
Let η refer to the upper limit of the correction terms U that can be saved.Because H 0 is diagonal, the limit count of n-vectors that can be utilized to build the matrix approximation is 2η + 1.We have hit our storage limit once H η is produced.In this case, newer vectors start replacing older ones: This idea was influenced by Nocedal's algorithm [10].Nocedal's technique is founded on the idea that the latest data should be provided in the update, and replacing the oldest information is one compelling option.However, there is no assurance that the computed matrices will be positive-definite.Loss of positive-definiteness leads to search directions that are not necessarily descent, thus threatening the method's convergence.To avoid this problem, the standard BFGS updating formula is expressed in the form [2]: where , where ρ i is a parameter that is chosen to ascertain that the newly computed Hessian matrix is positive-definite and m is a parameter that controls the number of past updates that are used to update the Hessian matrix.
Thus, given the above-proposed strategy, Nocedal suggests a form of the update as follows: For i + 1 ≤ η, Back to our problem, we adopt Nocedal's approach with a suitably chosen small η to build an approximation to the product H i−1 w i in (8).For example, for η = 1, we have Similarly, to compute the product H i y i the following may be used: In general, the following is employed to build the above products for a given η and an appropriate replacement of the matrix/vectors involved (Algorithm 1): q bound = w i 3. FOR l = bound − 1 . . .0 {j = l + incr ; α l = ρ j u T j q l+1 ; q l = q l+1 − α l w j ;} 4. z 0 = H 0 q 0 5. FOR l = 0, 1, . . ., bound − 1; {j = l + incr; Our multi-step limited memory BFGS (MSCBFGS) algorithm can be stated as: Start with an approximate point x 0 to the solution Start with H 0 = I (or a scalar multiple of it) Step 2. Minimize f (x i + αp i ) where α ∈ R to find a step length α i along p i .To make certain that the obtained search direction is adequately descent, the parameter α i is required to satisfy the strong Wolfe conditions [25] We now need to establish that (8) produces a descent direction vector.To do so, it suffices to show that the matrix used in the computation of ( 8) is positive-definite.

Theorem 1.
The matrix H i+1 update in ( 7) is positive-definite if H i is positive-definite and if u T i w i > 0.
Proof of Theorem 1.We first show that if H i+1 is positive-definite, then u T i w i > 0. From (5), it follows that Now, given that w T i u i > 0 and r i = H i w i (from ( 7)), we proceed to prove that H i+1 is positive-definite by showing that v T H i+1 v > 0, for any vector v ̸ = 0. Using ( 7) and ( 11), we have , and ρ i is a parameter that is chosen to make certain that the newly computed Hessian matrix is positive-definite.Since H i is positive-definite, V T i H i V i is also positive-definite.The second term is also positive.Thus, the whole expression in ( 14) is positive-definite.□ It should be noted here that conditions (13) ensure only that s T i y i > 0 while there are no guarantees for w T i u i .In our implementations, the condition w T i u i > 0 is tested, and if it is not satisfied, we revert to using the classical Secant update at that specific iteration with u i = s i and w i = y i .
As the new algorithm is not stripping away the main structure of the BFGS update, the derived method possesses the same convergence properties as those that belong to the same class such as those in [10,26].The R-linear convergence rate of the MSCBFGS method is proved under the assumption that the objective function is strongly convex.However, the method can also converge R-linearly for non-convex objective functions.The convergence rate of the MSCBFGS method can be affected by the choice of the memory parameter η.A larger value of η can lead to a faster convergence rate, but it can also make the method more sensitive to noise.The proof for the convergence rate is very similar to that conducted in [26].
Estimating the time complexity of the described optimization algorithm involves several factors.The number of iterations (k) in the optimization process, along with the computational costs of algorithmic components (O(f )) and convergence conditions (O(g)), collectively shape the overall time complexity.A very rough estimate of the algorithm's time complexity could be expressed as O(k × (f + g)), considering the iterative nature of the optimization and the associated computations.However, a more precise analysis would require a detailed breakdown of specific operations and their computational costs.As a result, the per iteration complexity and storage requirement is O(ηn) where η ≤ n is the size of the stored frame and n is the problem dimension, thus, reducing the O(n 2 ) computational complexity and memory requirements of standard quasi-Newton methods.
The main feature of this algorithm is that it computes the search vector at each iteration using a formula that is almost the full memory multi-step BFGS version with only an approximation applied to the term H i−1 w i in (10).This is expected to produce results close to the full multi-step BFGS version, as the results reveal.

Numerical Results
The new method ( 7) is benchmarked against the methods in (3) and ( 4).Those are experimented on thirty distinct test problems with dimensions varying from 2 to 10,000.The total number of tested problems is 900.The functions tested are classified into four categories and can be found in [7,13,[27][28][29].The categories are as follows: The functions and their properties are listed in Table 1.Tables 2-5 report the results for each category.Table 6 reports the totals.The figures presented in each table indicate the total number of iterations, function/gradient evaluations, the timings, and the points for every algorithm.A point is granted to a method if it obtains the minimal evaluations count in solving a problem.Ties are resolved using the count of iterations.The percentages in each table indicate the improvements (or otherwise) of each method on each of the evaluation categories (iterations, evaluations, and time score) as opposed to the benchmark method of Byrd et al.A lower percentage indicate the savings obtained in comparison to the benchmark method that is assigned the full percentage.Our results for the above method seem to bias method 3 for η = 3. Bigger η have not introduced worthwhile improvements in performance over the other methods to justify the extra memory expense.Therefore, the test results reported on the new method are conducted with η = 3.The results reveal clearly that MSCBFGS (corresponding to η = 3) method is superior to the other two, especially on large problems.Figures 1-3 report the results comparisons on each criterion while Figure 4 presents an overall overview of those.The analysis of the numerical scores for the newly developed limited memory MSCBFGS method in comparison to the Byrd et al. [13] and Zhu and Byrd [14] methods reveals notable improvements in efficiency and effectiveness.In terms of evaluations, the MSCBFGS method requires only 91.21% of the evaluations compared to the Byrd et al. [13] method, indicating enhanced computational efficiency.Additionally, it achieves a further reduction in iterations, requiring only 88.01% of the iterations compared to the original method, demonstrating improved convergence properties.Moreover, the MSCBFGS method significantly reduces computation time, utilizing only 87.65% of the time compared to the Byrd et al. [13] method, while outperforming both the original method and the Zhu and Byrd method in terms of overall effectiveness, as evidenced by its higher score of 369 compared to 232 and 299, respectively.Overall, these results suggest that the newly developed MSCBFGS method offers substantial improvements in efficiency and effectiveness, making it a promising approach for optimization problems.

Conclusions
In conclusion, this paper has presented a contribution to the field of optimization through the development of a novel quasi-Newton method known as memory-sensitive or limited BFGS (MSCBFGS).The method has been tailored specifically to address the challenges posed by memory constraints in solving complex large-scale problems.By allowing for flexibility in the choice of a storage variable that governs the retention of past data in the formation of the new Hessian approximation, MSCBFGS exhibits adaptability and versatility.The outcomes of extensive numerical experiments have underscored the efficacy of the MSCBFGS algorithm.These findings not only affirm the algorithm's potential but also pave the way for its practical application in a range of memory-limited optimization tasks.
One notable advantage of the proposed method lies in its adaptability to varying memory constraints, offering a customizable parameter for controlling the retention of past data.This flexibility enables the algorithm to effectively navigate memory-limited environments, enhancing its applicability to a wide range of optimization problems.

Discussion
By utilizing a modified version of the full memory multi-step BFGS update, the method derived in this paper maintains a balance between computational efficiency and accuracy in approximating the Hessian matrix.However, it is crucial to acknowledge certain limitations.One such drawback is the potential sensitivity of the algorithm's performance to the selection of parameters, which may require careful tuning for optimal results.Additionally, while the method demonstrates improved effectiveness within constrained memory contexts, its performance in comparison to alternative approaches warrants further investigation, particularly in scenarios with highly complex or nonlinear objective functions.Overall, the presented method offers promising advancements in addressing memory limitations in quasi-Newton optimization methods, yet ongoing research is essential to fully understand its capabilities and limitations in diverse optimization scenarios.
Future research can deeper investigate the optimization of the memory parameter selection process within MSCBFGS.Exploring adaptive or learning-based techniques to dynamically adjust this parameter during optimization could enhance the algorithm's performance further, especially when taking the nature of the problem being solved into account [30].

Figure 4 .
Figure 4. Performance overview of all criteria.

Table 1 .
Set of test problems.

Table 2 .
Large problems test results.

Table 3 .
Moderate high dimension problems test results.

Table 4 .
Medium-size dimension problems test results.

Table 5 .
Small dimension problems test results.

Table 6 .
Overall scores for the test results for 900 problems.