A Sparse Quasi-Newton Method Based on Automatic Differentiation for Solving Unconstrained Optimization Problems

: In our paper, we introduce a sparse and symmetric matrix completion quasi-Newton model using automatic differentiation, for solving unconstrained optimization problems where the sparse structure of the Hessian is available. The proposed method is a kind of matrix completion quasi-Newton method and has some nice properties. Moreover, the presented method keeps the sparsity of the Hessian exactly and satisﬁes the quasi-Newton equation approximately. Under the usual assumptions, local and superlinear convergence are established. We tested the performance of the method, showing that the new method is effective and superior to matrix completion quasi-Newton updating with the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method and the limited-memory BFGS method.


Introduction
We concentrated on the unconstrained optimization problem where f : R n → R is a twice continuously differentiable function; and ∇ f (x) and ∇ 2 f (x) denote the gradient and Hessian of f at x, respectively. The first order necessary condition of (1) is ∇ f (x) = 0, which can be written as the symmetric nonlinear equations where F : R n → R n is a continuously differentiable mapping and the symmetry implies that the Jacobian F (x) satisfies F (x) = F (x) T . That symmetric nonlinear system has close relationships with many practical problems, such as the gradient mapping of unconstrained optimization problems, the Karush-Kuhn-Tuckrt (KKT) system of equality constrained optimization problem, the discretized two-point boundary value problem, and the saddle point problem (2) [1 -5].
For small or medium-scale problems, classical quasi-Newton methods enjoy superlinear convergence without the calculation of the Hessian [6,7]. Let x k be the current iterative point and B k be the symmetric approximation of the Hessian; then the iteration {x k } generated by quasi-Newton methods is where α k > 0 is a step length obtained by some line search or other strategies. The search direction d k can be gotten by solving the equations where the quasi-Newton matrix B k is an approximation of ∇ 2 f (x k ) and satisfies the secant condition: was first proposed by Davidon [8] and developed by Fletcher and Powell [9]. The Broyden-Fletcher-Goldfard-Shanno (BFGS) update, B k+1 = B k − B k s k s T k B k s T k B k s k + y k y T k y T k s k , was proposed independently by Broyden [10], Fletcher [11], Goldfarb [12], and Shanno [13]. One can find more on the topic in references [14][15][16][17]. If we assume that H k = B −1 k , then using Sherman-Morrison formula, we have the Broyden's family update: where φ k ∈ [0, 1] is φ k = y T k H k y k s k s T k y k − H k y k y T k H k y k , When φ k ≡ 1, we have a BFGS update. When φ k ≡ 0, we have a DFP update. However, a quasi-Newton method is not desirable when applied to solve large-scale problems, because we need to store the full matrx B k . To overcome such drawback, the so-called sparse quasi-Newton methods [14] have received much attention. Early in 1970, Schubert [18] has proposed a sparse Broyden's rank one method. Then Powell and Toint [19], Toint [20] studied the sparse quasi-Newton method.
Existing sparse quasi-Newton methods usually use a sparse symmetric matrix as an approximation of the Hessian so that both matrices take the same form or have similar structures. If the limited memory technique [21,22] is adopted, which only stores several pairs (s k , y k ) to construct a matrix H k by updating the initial matrix H 0 m times, the method can be widely used in practical optimization problems. On the other hand, there are many large-scale problems in scientific fields take the partially separable form where function f i , i = 1, . . . , m is related to a few variables. For the partially separable unconstrained optimization problems, the partitioned BFGS method [23,24] was proposed and has better performance in practice. The partitioned BFGS method updates each matrix B i k of each element function f i (x) separately via BFGS updating and sums these matrices to construct the next quasi-Newton matrix B k+1 . Since the size of x in f i (x) is smaller than that of n, the matrix B i k+1 will be a small matrix, and then the matrix B k+1 will be sparse. The quasi-Newton direction is the solution of the linear equations: However, the partitioned BFGS method cannot always preserve the positive definiteness of the matrix B k , only if that each element function f i (x) is convex, so the partitioned BFGS method is implemented with the trust region strategy [25]. Recently, for the partially separable nonlinear equations, Cao and Li [26] have introduced two kinds of partitioned quasi-Newton methods and given their global and superlinear convergence.
Another efficient sparse quasi-Newton method is designed to exploit the sparsity structures of the Hessian. We assume that for all x ∈ R n , where F ⊆ {1, . . . , n} × {1, . . . , n}. References [27,28] have proposed sparse quasi-Newton methods, where H k+1 satisfies the secant equation where H k+1 is an approximate inverse Hessian. Recently, Yamashita [29] proposed another type of matrix completion quasi-Newton (MCQN) update for solving problem (1) with a sparse Hessian and proved the local and superlinear convergence for MCQN updates with the DFP method. Reference [30] established the convergence of MCQN updates with all of Broyden's convex family. However, global convergence analysis [31] was presented for two-dimensional functions with uniformly positive definite Hessians.
Another kind of quasi-Newton method for solving large scale unconstrained optimization problems is the diagonal quasi-Newton method, where the Hessian of an objective function is approximated by a diagonal matrix with positive elements. The first version was developed by Nazareth [32], where the quasi-Newton matrix satisfies the least change and weak secant condition [33]: where · F is the standard Frobenius norm. Recently, Andrei N. [34] developed a diagonal quasi-Newton method, where the diagonal elements satisfy the least change weak secant condition (3) and minimize the trace of the update. Besides, lots of other techniques, such as forward and central finite differences, the variational principle with a weighted norm, and the generalized Frobenius norm, can be used to derive different kinds of diagonal quasi-Newton method [35][36][37]. Under usual assumptions, the diagonal quasi-Newton method is linearly convergent. The authors of [38] adopted a similar technique to derivation with the DFP method and got a low memory diagonal quasi-Newton method. Using the Armijo line search, they established the global convergence and gave the sufficient conditions for the method to be superlinearly convergent.
The main contribution of our paper is to propose a sparse quasi-Newton algorithm based on automatic differentiation for solving (1). Firstly, similarly to the derivation of BFGS update, we can perform a symmetric rank-two quasi-Newton update: where σ k ∈ R n and B k+1 satisfying the adjoint tangent condition [39] For an n × n matrix, we denote A 0, as A is positive definite. Then, when B k 0, B k+1 0 if and only if σ T k ∇ 2 f (x k+1 )σ k > 0, which means that the proposed update (4) keeps the positive definiteness, as in BFGS updating. Moreover, when B 0 is positive definite, the matrices {B k } updated by the proposed update (4) are positive definite for solving (1) with uniformly positive definite Hessians. In our work, we pay attention to σ k = s k ; then the proposed rank-two quasi-Newton update (4) method satisfies which means that B k+1 equals ∇ 2 f (x k+1 ) in the direction s k exactly. Several lemmas have been given to present the properties of the proposed rank-two quasi-Newton update formula. Secondly, combined with the idea of MCQN method [29], we propose a sparse and symmetric quasi-Newton algorithm for solving (1). Under appropriate conditions, local and superlinear convergence are established. Finally, our numerical results illustrate that the proposed algorithm has satisfying performance.
The paper is organized as follows. In Section 2, we introduce a symmetric ranktwo quasi-Newton update based on automatic differentiation and prove several nice properties. In Section 3, by using the idea of matrix completion, we present a sparse quasi-Newton algorithm and show some nice properties. In Section 4, we prove the local and superlinear convergence of the algorithm proposed in Section 3. Numerical results are listed in Section 5, which verify that the proposed algorithm is very encouraging. Finally, we give the conclusion.

A New Symmetric Rank-Two Quasi-Newton Update
Similarly to the derivation of BFGS update, we will derive a new symmetric rank-two quasi-Newton update and show several lemmas. Let where ∆ k is a rank-two matrix and B k+1 satisfies the condition where σ k ∈ R n and σ k = 0. Similarly to the derivation of BFGS, we have the following symmetric rank-two update: If we denote H k = B −1 k and H k+1 = B −1 k+1 , then (6) can be expressed as It can be seen that the update (6) involves the Hessian ∇ 2 f (x), but we do not need to compute them in practice. For given vectors x, s, and σ, we can get ∇ 2 f (x)s and σ T ∇ 2 f (x) exactly by the forward and reverse mode of automatic differentiation.
Next, several lemmas are presented.
Lemma 1. We suppose that B k 0 and B k+1 is updated by (6); then B k+1 0 if and only if Proof. According to the condition (5), one has where the equality holds if and only if d k = λ k σ k , λ k = 0. If the inequality (8) holds strictly, one has If the equality (8) holds; i.e., there exists a λ k = 0 such that d k = λ k σ k , then it can be deduced from (8) that In conclusion, d T k B k+1 d k > 0 for ∀d k ∈ R n and d k = 0.

Lemma 2.
If we rewrite update Formula (7) as H k+1 = H k + E, where H k is symmetric and then E is the solution of the following minimization problem:

Proof. A suitable Lagrangian function of the convex programming problem is
where Λ and λ are Lagrange multipliers. Moreover, or according to the symmetry and cyclic permutations, one has Taking the transpose and accumulating eliminates Λ to yield ) and the nonsingularity of W we have that Substituting this into (9) gives the result (7).
Then B k+1 given by (6) solves the variational problem Proof. According to the definition of ψ, where ψ : R n×n → R [40] is given by so we have We have the Lagrangian function where Λ and λ are the Lagrange multipliers. Moreover, one has Transposing and adding in (12) that Combined with the tangent condition, we have that Combined with (6), one has the Formula (7). According to the Sherman-Morrison formula, (7) is equivalent to (6). Since the function ψ(H 1/2 k BH 1/2 k ) is strictly convex on B 0, the update formula (6) is the unique solution of the variational problem.
In this paper, we set σ k = s k , so one has which means that B k+1 is an exact approximation to ∇ 2 f (x k+1 ) in direction s k . Then we have the symmetric rank-two update formula It can be seen that B k+1 can preserve the symmetry when B k is symmetric. If we denote w k = ∇ 2 f (x k+1 )s k , then we can obtain a similar Broyden convex family update formula: where the parameter φ k ∈ [0, 1] is defined as The choice φ k ≡ 1 corresponds to the BFGS update

Algorithm and Related Properties
For the update formula (15), we adopt the idea of matrix completion. The next quasi-Newton matrix H k+1 is the solution of the following minimization problem: When G(V,F) is chordal, the minimization problem (18) can be solved by solving the problem max det(H) Then H k+1 can be expressed as the sparse clique-factorization formula [29]. Then Algorithm 1 is stated as follows.
Algorithm 1 (Sparse Quasi-Newton Algorithm based on Automatic Differentiation) Step 0. ComputeF according to F such that G(V,F) is a chordal graph, where V = {1, 2, · · · , n}. Choose x 0 ∈ R n , > 0 and a matrix H 0 ∈ R n×n , H 0 0 with Step 4 Get H k+1 by the minimization problem (18). When G(V,F) is a chordal graph, the problem (18) can be solved by solving the problem (19).
When the H k in step 3 is updated by Broyden's class method, the method corresponds to the method in [29]. In the present paper, we focus on the MCQN update with H AD = H k+1 , where H k+1 is given by (15).
In what follows, we give some notation for the convenience of analysis. For a nonsingular matrix P satisfying we lets where H AD = H k+1 is given by (15). Then we can get from (15) Similarly to that in [30], we can assume that According to [41] and (21), we have and Next, we establish a relation betweenH k+1 andH AD , which is very important in the establishment of the local and superlinear convergence of Algorithm 1.
Proof. We can obtain from (18) that Combined with (20), one has that for any (i, j) ∈ F, there at least exists one of the (H k+1 − H AD ) ij and P −1 i,j equals to zero. Then we can get that Moreover, since H AD satisfies (19), we must have det(H k+1 ) det(H AD ).

The Local and Superlinear Convergence
Based on the discussion in Section 3, we prove the local and superlinear convergence of Algorithm 1. First, we list the assumptions.

Assumption 1.
Assume that x * is a solution of (1) and (1) The function f ∈ R n → R is twice continuously differentiable on Ω.
(2) There exist two constants, m > 0 and M > 0, satisfying According to Assumption 1, we have constantsL > 0 and L > 0 such that We define and get from (32) that If we take P = H * , then one has from (34) that Furthermore, it is easy to deduce that where c 1 > 0, c 2 ∈ (0, b), and k < c 2 . We define and rewrite (29) as As γ k = q k /m k and 0 q k tr(H k ), we can obtain from the above inequality and (36) that Considering one has where A T = A and A > 0. Moreover, it follows from (40) that where c 3 = c 1 1 n + e e−1 . Since τ 2 k 1 and β k γ k 1, it is obvious that ρ k , ζ k , ξ k > 0, and The theorem given bellow shows that Algorithm 1 converges locally and linearly, where the relation (42) plays an essential role. Theorem 1. Let Assumption 1 hold and sequence {x k } be generated by Algorithm 1 with α k ≡ 1, where H k is updated by (15). Then for any ρ ∈ (0, 1), there is a constant τ x 0 − x * τ, Proof. According to the Lemma 4 [29], there are constantsτ ∈ (0, b) and δ > 0 such that and where H 0 andH = H * −1/2 HH = H * −1/2 . Define We will prove the inequalities (43) and hold for any k 0 by induction. By the Lipstchitz continuity of ∇ 2 f (x), we have for ∀x ∈ Ω, Then, when k = 0, it is easy to deduce (43) by (44) and (45). Moreover, when we take α k ≡ 1 and substitute x 0 into (48), we can obtain So we have that (43) and (47) hold for k = 1. Assume that (43) and (47) hold for k = 0, 1, . . . , l; then one has Then by the definition of τ (46), one has Combine (42) and (44). It can seen that Thus, we can get that (47) holds for all k = l + 1. This completes the proof.
Based on the above discussion and the relation (42), we can show the superlinear convergence of the Algorithm 1. Theorem 2. Let Assumption 1 hold and sequence {x k } be generated by Algorithm 1 with α k ≡ 1, where H k is updated by (15). Then there is a constant τ > 0 such that when x 0 − x * τ, Then the sequence {x k } is superlinearly convergent.
Proof. Let τ be defined as in (1), and for all k one has It follows from (41) that Summing the above inequality and combining (51) and (54), we can deduce which means that the nonnegative constants ρ k , ζ k and ξ k all tend to zero when k → +∞. Furthermore, according to the definition of (37), we have that

First, we have
For the case {k i : φ k i 1 2 }, one has q k → 1, and τ k i → 1; and then (53) is true. Moreover, it is easy to deduce that We also have For the case {k i : φ k i > 1 2 }, one has q k → 1, β k γ k → 1, m k → 1; then (53) is true by (56)-(58). Thus, the relation (53) holds for all k.
Next, we will show that (53) indicates that the sufficient condition [6] lim k→∞ (B k − ∇ 2 f (x * ))s k s k = 0 (59) holds. According to (47), one has that there is a constant λ min > 0 such that (λ k ) i λ min , where (λ k ) i denotes the eigenvalues of H k , i = 1, 2, · · · , n. When we let w k = ∇ 2 f (x k+1 )s k , one has When k → ∞, since x k → x * , then one has from (53) that which is the well-known Dennis-Moré condition. Thus, we get the superlinear convergence.

Numerical Experiments
The performance in [29] shows that the MCQN update with the BFGS method has better numerical performance than the MCQN update with DFP method. Hence, we compare the numerical performance of Algorithm 1 with the MCQN update with BFGS method and the limited-memory BFGS method.
The 24 test problems with initial points are given in Table 1, which are from [29,[42][43][44]. It can be seen that all the test problems have special Hessian structures such as band matrices, so the chordal extension of the sparsity could be obtained easily. Then H k+1 in Algorithm 1 can be written as the sparse clique-factorization formula.
All the methods were coded in MATLAB R2016a on a Core (TM) i5 PC. The automatic differentiation was computed by ADMAT 2.0, which is available on the cayuga research GitHUB page. In Tables 1-4 and Figures 1 and 2, we report the numerical performances of the three methods. For the convenience of statement, we use the following notation in our numerical results.
Pro: the problems; Dim: the dimensions of the test problem; Init: the initial points; Method: the algorithm used to solve the problem; MCQN-BFGS: MCQN update with the BFGS method; L-BFGS: limited-memory with the BFGS method.  For the sake of precise comparison, we adopted the performance profiles from [45], which are distribution functions of a performance metric. We denote P and S as the test set and the set of solvers; and N p and N s as the umber of problems and number of solvers, respectively. For solver s ∈ S and problem p ∈ P, we define t p,s as the number of iterations or number of function evaluations required for solve problem p using solver s. Then, using the performance ration r p,s = t p,s min{t p,q : q ∈ S} , we define where r p,s ≤ r M for some constant for all p and s. The equality holds if and only if solver s cannot solve problem p. Therefore, ρ s : R → [0, 1] was the probability for s ∈ S satisfying r p,s ≤ t, t ∈ R among the best possible ratios. Figure 1 evaluates the number of iterations of and the MCQN update with BFGS method by using performance profiles. It can be seen that the top curve corresponds to Algorithm 1, which shows that Algorithm 1 had better performance than the MCQN update with BFGS method. Additionally, Figure 2 demonstrates that Algorithm 1 had better performance than the limited-memory BFGS method. Secondly, for a further comparison of Algorithm 1 and the MCQN update with BFGS method, we tested five different initial points, x 0 , 2x 0 , 4x 0 , 7x 0 , and 10x 0 , where x 0 is specified in Table 1. The dimensions of the test problems was 1000. Table 4 reports the number of iterations required of the two methods for 24 test problems, which also demonstrates that Algorithm 1 was effective and superior to the MCQN update with BFGS method. Table 4. Results of Dim = 1000 with different initial points.

Conclusions
In this paper, we presented a symmetric rank-two quasi-Newton update method based on an adjoint tangent condition for solving unconstrained optimization problems. Combined with the idea of matrix completion, we proposed a sparse quasi-Newton algorithm and established its local and superlinear convergence. Extensive numerical results demonstrated that the proposed algorithm outperformed other methods and can be used to solve large-scale unconstrained optimization problems.

Data Availability Statement:
The date used to support the research plan and all the code used in this study are available from the corresponding author upon request.

Conflicts of Interest:
The authors declare no conflict of interest.