Next Article in Journal / Special Issue
Modeling of the 5G-Band Patch Antennas Using ANNs under the Uncertainty of the Geometrical Design Parameters Associated with the Manufacturing Process
Previous Article in Journal
Deep Transfer Learning for Parkinson’s Disease Monitoring by Image-Based Representation of Resting-State EEG Using Directional Connectivity
Previous Article in Special Issue
Approximately Optimal Control of Nonlinear Dynamic Stochastic Problems with Learning: The OPTCON Algorithm
 
 
Order Article Reprints
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks

1
Graduate School of Science and Technology, Shizuoka University, Hamamatsu 432-8561, Shizuoka, Japan
2
Graduate School of Electrical and Information Engineering, Shonan Institute of Technology, Fujisawa 251-8511, Kanagawa, Japan
3
Graduate School of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Shizuoka, Japan
4
Research Institute of Electronics, Shizuoka University, Hamamatsu 432-8561, Shizuoka, Japan
*
Author to whom correspondence should be addressed.
Algorithms 2022, 15(1), 6; https://doi.org/10.3390/a15010006
Received: 30 November 2021 / Revised: 19 December 2021 / Accepted: 20 December 2021 / Published: 24 December 2021
(This article belongs to the Special Issue Stochastic Algorithms and Their Applications)

Abstract

:
Gradient-based methods are popularly used in training neural networks and can be broadly categorized into first and second order methods. Second order methods have shown to have better convergence compared to first order methods, especially in solving highly nonlinear problems. The BFGS quasi-Newton method is the most commonly studied second order method for neural network training. Recent methods have been shown to speed up the convergence of the BFGS method using the Nesterov’s acclerated gradient and momentum terms. The SR1 quasi-Newton method, though less commonly used in training neural networks, is known to have interesting properties and provide good Hessian approximations when used with a trust-region approach. Thus, this paper aims to investigate accelerating the Symmetric Rank-1 (SR1) quasi-Newton method with the Nesterov’s gradient for training neural networks, and to briefly discuss its convergence. The performance of the proposed method is evaluated on a function approximation and image classification problem.

1. Introduction

Neural networks have shown to have great potential in several applications. Hence, there is a great demand for large scale algorithms that can train neural networks effectively and efficiently. Neural network training posses several challenges such as ill-conditioning, hyperparameter tuning, exploding and vanishing gradients, saddle points, etc. Thus the optimization algorithm plays an important role in training neural networks. Gradient-based algorithms have been widely used in training neural networks and can be broadly categorized into first order methods (e.g., SGD, Adam) and higher order methods (e.g., Newton method, quasi-Newton method), each with its own pros and cons. Much progress has been made in the last 20 years in designing and implementing robust and efficient methods suitable for deep learning and neural networks. While several works focus on sophisticated update strategies for improving the performance of the optimization algorithm, several works propose acceleration techniques such as incorporating momentum, Nesterov’s acceleration or Anderson’s accleration. Furthermore, it has been shown that second-order methods show faster convergence compared to first order methods, even without the acceleration techniques. While most of the second-order quasi-Newton methods used in training neural networks are rank-2 update methods, rank-1 methods are not widely used since they do not perform as well as the rank-2 update methods. In this paper, we investigate if the Nesterov’s acceleration can be applied to the rank-1 update methods of the quasi-Newton family to improve performance.

Related Works

First order methods are most commonly used due to their simplicity and low computational complexity. Several works have been devoted to first-order methods such as the gradient descent [1,2] and its variance-reduced forms [3,4,5], Nesterov’s Accelerated Gradient Descent (NAG) [6], AdaGrad [7], RMSprop, [8] and Adam [9]. However, second order methods have shown to have better convergence, with the only drawbacks being high computational and storage costs. Thus, several approximations have been proposed under Newton [10,11] and quasi-Newton [12] methods to efficiently use the second order information while keeping the computational load minimal. Recently there has been a surge of interest in designing efficient second order quasi-Newton variants which are better suited for large scale problems, such as in [13,14,15,16] since in addition to better convergence, second order methods are more suitable for parallel and distributed training. It is notable that among the quasi-Newton methods, the Broyden-Fletcher-Goldfarb-Shanon (BFGS) method is most widely studied for training neural networks. The Symmetric Rank-1 (SR1) quasi-Newton method, though less commonly used in training neural networks, is known to have interesting properties and provide good Hessian approximations when used with a trust-region approach [17,18]. Several works in optimization [19,20,21] have shown SR1 quasi-Newton methods to be efficient. Recent works such as [22,23] have proposed sampled LSR1 (limited memory) quasi-Newton updates for machine learning and describe efficient ways for distributed training implementation. Recent studies such as [24,25] have shown that the BFGS method can be accelerated by using Nesterov’s accelerated gradient and momentum terms. In this paper, we explore if the Nesterov’s acceleration can be applied to the LSR1 quasi-Newton method as well. We thus propose a new limited memory Nesterov’s acclerated symmetric rank-1 method (L-SR1-N) for training neural networks. We show that the performance of the LSR1 quasi-Newton method can be significantly improved using the trust-region approach and Nesterov’s acceleration.

2. Background

Training in neural networks is an iterative process in which the parameters are updated in order to minimize an objective function. Given a subset of the training dataset X T r with input-output pair samples ( x p , o p ) p X drawn at random from the training set T r and error function E p ( w ; x p , o p ) parameterized by a vector w R d , the objective function to be minimized is defined as
E ( w ) = 1 b p X E p ( w ) ,
where b = | X | , is the batch size. In full batch, X = T r and b = n where n = | T r | . In gradient based methods, the objective function E ( w ) under consideration is minimized by the iterative formula (2) where k N is the iteration count and v k + 1 is the update vector, which is defined for each gradient algorithm.
w k + 1 = w k + v k + 1 , for k = 1 , 2 , . . . , k m a x N .
Notations: We briefly define the notations used in this paper. In general, all vectors are denoted by boldface lowercase characters, matrices by boldface uppercase characters and scalars by simple lowercase characters. The scalars, vectors and matrices at each iteration bear the corresponding iteration index k as a subscript. Below is a list of notations used.
  • iteration index k N : k = 1 , 2 , . . . , k m a x N
  • n is the number of total samples in T r and is given by | T r | .
  • b is the number of samples in the minbatch X T r and is given by | X | .
  • d is the number of parameters of the neural network.
  • m L is the limited memory size.
  • α k is the learning rate or step size.
  • μ k is the momentum coefficient, chosen in the range (0,1).
  • E ( w ) is the error evaluated at w .
  • E ( w ) is the gradient of the error function evaluated at w .
In the following sections, we briefly discuss the common first and second order gradient based methods.

2.1. First-Order Gradient Descent and Nesterov’s Accelerated Gradient Descent Methods

The gradient descent (GD) method is one of the earliest and simplest gradient based algorithms. The update vector v k + 1 is given as
v k + 1 = α k E ( w k ) .
The learning rate α k determines the step size along the direction of the gradient E ( w k ) . The step size α k is usually fixed or set to a simple decay schedule.
The Nesterov’s Accelerated Gradient (NAG) method [6] is a modification of the gradient descent method in which the gradient is computed at w k + μ k v k instead of w k . Thus, the update vector is given by:
v k + 1 = μ k v k α k E ( w k + μ k v k ) ,
where E ( w k + μ k v k ) is the gradient at w k + μ k v k and is referred to as Nesterov’s accelerated gradient. The momentum coefficient μ k is a hyperparameter chosen in the range (0,1). Several adaptive momentum and restart schemes have also been proposed for the choice of the momentum [26,27]. The algorithms of GD and NAG are as shown in Algorithm 1 and Algorithm 2, respectively.
Algorithm 1 GD Method
Require: ε and k m a x
Initialize: w k R d
1:
k 1
2:
while | | E ( w k ) | | > ε and k < k m a x do
3:
    Calculate E ( w k )
4:
     v k + 1 α k E ( w k )
5:
     w k + 1 w k + v k + 1
6:
     k k + 1
7:
end while
Algorithm 2 NAG Method
Require: 0 < μ k < 1 , ε and k m a x
Initialize: w k R d and v k = 0.
1:
k 1
2:
while | | E ( w k ) | | > ε and k < k m a x do
3:
    Calculate E ( w k + μ k v k )
4:
     v k + 1 μ k v k α k E ( w k + μ k v k )
5:
     w k + 1 w k + v k + 1
6:
     k k + 1
7:
end while

2.2. Second-Order Quasi-Newton Methods

Second order methods such as the Newton’s method have better convergence than first order methods. The update vector of second order methods take the form
v k + 1 = α k H k E ( w k ) .
However, computing the inverse of the Hessian matrix H k = B k 1 incurs a high computational cost, especially for large-scale problems. Thus, quasi-Newton methods are widely used where the inverse of the Hessian matrix is approximated iteratively.

2.2.1. BFGS Quasi-Newton Method

The Broyden-Fletcher-Goldfarb-Shanon (BFGS) algorithm is one of the most popular quasi-Newton methods for unconstrained optimization. The update vector of the BFGS quasi-Newton method is given as v k + 1 = α k g k , where g k = H k BFGS E ( w k ) is the search direction. The hessian matrix H k BFGS is symmetric positive definite and is iteratively approximated by the following BFGS rank-2 update formula [28].
H k + 1 BFGS = I p k q k T q k T p k H k BFGS I q k p k T q k T p k + p k p k T q k T p k ,
where I denotes the identity matrix, and
p k = w k + 1 w k and q k = E ( w k + 1 ) E ( w k ) .

2.2.2. Nesterov’s Accelerated Quasi-Newton Method

The Nesterov’s Accelerated Quasi-Newton (NAQ) [24] method introduces Nesterov’s acceleration to the BFGS quasi-Newton method by approximating the quadratic model of the objective function at w k + μ k v k and by incorporating Nesterov’s accelerated gradient E ( w k + μ k v k ) in its Hessian update. The update vector of NAQ can be written as:
v k + 1 = μ k v k + α k g k ,
where g k = H k NAQ E ( w k + μ k v k ) is the search direction and the Hessian update equation is given as
H k + 1 NAQ = I p k q k T q k T p k H k NAQ I q k p k T q k T p k + p k p k T q k T p k ,
where
p k = w k + 1 ( w k + μ k v k ) and q k = E ( w k + 1 ) E ( w k + μ k v k ) .
(9) is derived from the secant condition q k = ( H k + 1 NAQ ) 1 p k and the rank-2 updating formula [24]. It is proven that the Hessian matrix H k + 1 NAQ updated by (9) is a positive definite symmetric matrix, given H k NAQ is initialized to identity matrix [24]. It is shown in [24] that NAQ has similar convergence properties to that of BFGS.
The algorithms of BFGS and NAQ are as shown in Algorithm 3 and Algorithm 4, respectively. Note that the gradient is computed twice in one iteration. This increases the computational cost compared to the BFGS quasi-Newton method. However, due to acceleration by the momentum and Nesterov’s gradient term, NAQ is faster in convergence compared to BFGS. Often, as the scale of the neural network model increases, the O ( d 2 ) cost of storing and updating the Hessian matrices H k BFGS and H k NAQ become expensive. Hence, limited memory variants LBFGS and LNAQ were proposed, and the respective Hessian matrices were updated using only the last m L curvature information pairs { p i , q i } i = k 1 k m L 1 , where m L is the limited memory size and is chosen such that m L b .
Algorithm 3 BFGS Method
Require: ε and k m a x
Initialize: w k R d  and  H k = I .
1:
k 1
2:
Calculate E ( w k )
3:
while | | E ( w k ) | | > ε and k < k m a x do
4:
     g k H k BFGS E ( w k )
5:
    Determine α k by line search
6:
     v k + 1 α k g k
7:
     w k + 1 w k + v k + 1
8:
    Calculate E ( w k + 1 )
9:
    Update H k + 1 BFGS using (6)
10:
     k k + 1
11:
end while
Algorithm 4 NAQ Method
Require: 0 < μ k < 1 , ε and k m a x
Initialize: w k R d , H k = I and v k = 0.
1:
k 1
2:
while | | E ( w k ) | | > ε and k < k m a x do
3:
     Calculate E ( w k + μ k v k )
4:
     g k H k NAQ E ( w k + μ k v k )
5:
    Determine α k by line search
6:
     v k + 1 μ k v k + α k g k
7:
     w k + 1 w k + v k + 1
8:
     Calculate E ( w k + 1 )
9:
    Update H k NAQ using (9)
10:
     k k + 1
11:
end while

2.2.3. SR1 Quasi-Newton Method

While the BFGS and NAQ methods update the Hessian using rank-2 updates, the Symmetric Rank-1 (SR1) method performs rank-1 updates [28]. The Hessian update of the SR1 method is given as
H k + 1 SR 1 = H k SR 1 + ( p k H k SR 1 q k ) ( p k H k SR 1 q k ) T ( p k H k SR 1 q k ) T q k ,
where,
p k = w k + 1 w k and q k = E ( w k + 1 ) E ( w k ) .
Unlike the BFGS or NAQ method, the Hessian generated by the SR1 update may not always be positive definite. Also, the denominator can vanish or become zero. Thus, SR1 methods are not popularly used in neural network training. However, SR1 methods are known to converge faster towards the true Hessian than the BFGS method, and have computational advantages for sparse problems [17]. Furthermore, several strategies have been introduced to overcome the drawbacks of the SR1 method, resulting in them performing almost on par with, if not better than, the BFGS method.
Thus, in this paper, we investigate if the performance of the SR1 method can be accelerated using Nesterov’s gradient. We propose a new limited memory Nesterov’s accelerated symmetric rank-1 (L-SR1-N) method and evaluate its performance in comparison to the conventional limited memory symmetric rank-1 (LSR1) method.

3. Proposed Method

Second order quasi-Newton (QN) methods build an approximation of a quadratic model recursively using the curvature information along a generated trajectory. In this section, we first show that the Nesterov’s acceleration when applied to QN satisfies the secant condition and then show the derivation of the proposed Nesterov Accelerated Symmetric Rank-1 Quasi-Newton Method.

Nesterov Accelerated Symmetric Rank-1 Quasi-Newton Method

Suppose that E : R d R is continuosly differentiable and that d R d , then from Taylor series, the quadratic model of the objective function at an iterate w k is given as
E ( w k + d ) m k ( d ) E ( w k ) + E ( w k ) T d + 1 2 d T 2 E ( w k ) d .
In order to find the minimizer d k , we equate m k ( d ) = 0 and thus have
d k = 2 E ( w k ) 1 E ( w k ) = B k 1 E ( w k ) .
The new iterate w k + 1 is given as,
w k + 1 = w k α k B k 1 E ( w k ) ,
and the quadratic model at the new iterate is given as
E ( w k + 1 + d ) m k + 1 ( d ) E ( w k + 1 ) + E ( w k + 1 ) T d + 1 2 d T B k + 1 d ,
where α k is the step length and B k 1 = H k and its consecutive updates B k + 1 1 = H k + 1 are symmetric positive definite matrices satisfying the secant condition. The Nesterov’s acceleration approximates the quadratic model at w k + μ k v k instead of the iterate at w k . Here v k = w k w k 1 and μ k is the momentum coefficient in the range ( 0 , 1 ) . Thus we have the new iterate w k + 1 given as,
w k + 1 = w k + μ k v k α k B k 1 E ( w k + μ k v k ) ,
= w k + μ k v k + α k d k .
In order to show that the Nesterov accelerated updates also satisfy the secant condition, we require that the gradient of m k + 1 should match the gradient of the objective function at the last two iterates ( w k + μ k v k ) and w k + 1 . In other words, we impose the following two requirements on B k + 1 ,
m k + 1 | d = 0 = E ( w k + 1 + d ) | d = 0 = E ( w k + 1 ) ,
m k + 1 | d = α k d k = E ( w k + 1 + d ) | d = α k d k = E ( w k + 1 α k d k ) = E ( w k + μ k v k ) .
From (16),
m k + 1 ( d ) = E ( w k + 1 ) + B k + 1 d .
Substituting d = 0 in (21), the condition in (19) is satisfied. From (20) and substituting d = α k d k in (21), we have
E ( w k + μ k v k ) = E ( w k + 1 ) α k B k + 1 d k .
Substituting for α k d k from (18) in (22), we get
E ( w k + μ k v k ) = E ( w k + 1 ) B k + 1 ( w k + 1 ( w k + μ k v k ) ) .
On rearranging the terms, we have the secant condition
y k = B k + 1 s k ,
where,
y k = E ( w k + 1 ) E ( w k + μ k v k ) and s k = w k + 1 ( w k + μ k v k ) = α k d k .
We have thus shown that the Nesterov accelerated QN update satisfies the secant condition. The update equation of B k + 1 for SR1-N can be derived similarly to that of the classic SR1 update [28]. The secant condition requires B k to be updated with a symmetric matrix such that B k + 1 is also symmetric and satisfies the secant condition. The update of B k + 1 is defined using a symmetric-rank-1 matrix formed by an arbitrary vector u u T is given as
B k + 1 = B k + σ u u T ,
where σ and u are chosen such that they satisfy the secant condtion in (24). Substituting (26) in (24), we get
y k = B k s k + ( σ u T s k ) u .
Since ( σ u T s k ) is a scalar, we can deduce u a scalar multiple of y k B k s k and thus have
( y k B k s k ) = σ δ 2 [ s k T ( y k B k s k ) ] ( y k B k s k ) ,
where
σ = sign [ s k T ( y k B k s k ) ] and δ = ± | [ s k T ( y k B k s k ) ] | 1 / 2 .
Thus the proposed Nesterov accelerated symmetric rank-1(L-SR1-N) update is given as
B k + 1 = B k + ( y k B k s k ) ( y k B k s k ) T ( y k B k s k ) T s k .
Note that the Hessian update is performed only if the below condition in (31) is satisfied, otherwise B k + 1 = B k .
| s k T ( y k B k s k ) | ρ | | s k | | | | y k B k s k | | .
By applying the Sherman-Morrison-Woodbury Formula [28], we can find B k + 1 1 = H k + 1 as
H k + 1 = H k + ( s k H k y k ) ( s k H k y k ) T ( s k H k y k ) T y k ,
where,
y k = E ( w k + 1 ) E ( w k + μ k v k ) and s k = w k + 1 ( w k + μ k v k ) = α k d k .
The proposed algorithm is as shown in Algorithm 5. We implement the proposed method in its limited memory form, where the Hessian is updated using the recent m L curvature information pairs satisfying (31). Here m L denotes the limited memory size and is chosen such that m L b . The proposed method uses the trust-region approach where the subproblem is solved using the CG-Steihaug method [28] as shown in Algorithm 6. Also note that the proposed L-SR1-N has two gradient computations per iteration. The Nesterov’s gradient E ( w k + μ k v k ) can be approximated [25,29] as a linear combination of past gradients as shown below.
E ( w k + μ k v k ) ( 1 + μ k ) E ( w k ) μ k E ( w k 1 ) .
Thus we have the momentum accelerated symmetric rank-1 (L-MoSR1) method by approximating the Nesterov’s gradient in L-SR1-N.
Algorithm 5 Proposed Algorithm
1:
while | | E ( w k ) | | > ϵ and k < k max  do
2:
    Determine μ k
3:
    Compute E ( w k + μ k v k )
4:
    Find s k by CG-Steihaug subproblem solver in Algorithm (6)
5:
    Compute ρ k = E ( w k + μ k v k ) E ( w k + μ k v k + s k ) m k ( 0 ) m k ( s k )
6:
    if  ρ k η  then
7:
        Set v k + 1 = μ k v k + s k , w k + 1 = w k + v k + 1
8:
    else
9:
        Set v k + 1 = v k , w k + 1 = w k , reset μ k
10:
    end if
11:
     Δ k + 1 = adjustTR ( Δ k , ρ k )
12:
    Compute y k = E ( w k + 1 ) E ( w k + μ k v k ) + ζ s k
13:
    Update ( S k , Y k ) buffer with ( s k , y k ) if (31) is satisfied
14:
end while
Algorithm 6 CG-Steihaug
Require: Gradient E ( w k + μ k v k ) , tolerance ϵ k > 0 , and trust-region radius Δ k .
Initialize: Set z 0 = 0 , r 0 = E ( w k + μ k v k ) , d 0 = r 0 = E ( w k + μ k v k )
1:
if then | | r 0 | | < ϵ k
2:
    return  s k = z 0 = 0
3:
end if
4:
for i = 0 , 1 , 2 , . . . do
5:
    if  d i T B k d i 0  then
6:
        Find τ such that s k = z i + τ d i minimizes (41) and satisfies | | s k | | = Δ k
7:
        return  s k
8:
    end if
9:
    Set α i = r i T r i d i T B k d i
10:
    Set z i + 1 = z i + α i d i
11:
    if  | | z i + 1 | | Δ k  then
12:
        Find τ 0 such that s k = z i + τ d i satisfies | | s k | | = Δ k
13:
        return  s k
14:
    end if
15:
    Set r i + 1 = r i + α i B k d i
16:
    if  | | r i + 1 | | < ϵ k  then
17:
        return  s k = z i + 1
18:
    end if
19:
    Set β i + 1 = r i T r i + 1 r i T r i
20:
    Set d i + 1 = r i + 1 + β i + 1 d i
21:
end for

4. Convergence Analysis

In this section we discuss the convergence proof of the proposed Nesterov accelerated Symmetric Rank-1 (L-SR1-N) algorithm in its limited memory form. As mentioned earlier, the Nesterov’s acceleration approximates the quadratic model at w k + μ k v k instead of the iterate at w k . For ease of representation, we write w k + μ k v k = w ^ k k = 1 , 2 , . . . , k m a x N . In the limited memory scheme, the Hessian matrix can be implicitly constructed using the recent m L number of curvature information pairs { s i , y i } i = k 1 k m L 1 . At a given iteration k, we define matrices S k and Y k of dimensions d × m L as
S k = [ s k 1 , s k 2 , . . . , s k m L 1 ] and Y k = [ y k 1 , y k 2 , . . . , y k m L 1 ] ,
where the curvature pairs { s i , y i } i = k 1 k m L 1 are each vectors of dimensions d × 1 . The Hessian approximation in (30) can be expressed in its compact representation form [30] as
B k = B 0 + ( Y k B 0 S k ) ( L k + D k + L k T S k T B 0 S k ) 1 ( Y k B 0 S k ) ,
where B 0 is the initial d × d Hessian matrix, L k is a m L × m L lower triangular matrix and D k is a m L × m L diagonal matrix as given below,
B 0 = γ k I ,
( L k ) i , j = s i T y j if i > j , 0 otherwise ,
D k = diag [ S k T Y k ] .
Let Ω be the level set such that Ω = { w R d : E ( w ) E ( w 0 ) } and { s k } k = 1 , 2 , . . . , k m a x N , denote the sequence generated by the explicit trust-region algorithm where Δ k be the trust-region radius of the successful update step. We choose γ k = 0 . Since the curvature information pairs ( s k , y k ) given by (33) are stored in S k and Y k only if they satisfy the condition in (31), the matrix M k = ( L k + D k + L k T S k T B 0 S k ) is invertible and positive semi-definite.
Assumption 1.
The sequence of iterates w k and w ^ k k = 1 , 2 , . . . , k m a x N remains in the closed and bounded set Ω on which the objective function is twice continuously differentiable and has Lipschitz continuous gradient, i.e., there exists a constant L > 0 such that
| | E ( w k + 1 ) E ( w ^ k ) | | L | | w k + 1 w ^ k | | w k + 1 , w ^ k R d .
Assumption 2.
The Hessian matrix is bounded and well-defined, i.e., there exists constants ρ and M, such that
ρ | | B k | | M k = 1 , 2 , . . . , k m a x N .
and for each iteration k
| s k T ( y k B k s k ) | ρ | | s k | | | | y k B k s k | | .
Assumption 3.
Let B k be any n × n symmetric matrix and s k be an optimal solution to the trust region subproblem,
min d m k ( d ) = E ( w ^ k ) + d T E ( w ^ k ) + 1 2 d T B k d ,
where w ^ k + d lies in the trust region. Then for all k 0 ,
| E ( w ^ k ) T s k + 1 2 s k T B k s k | 1 2 | | E ( w ^ k ) | | min Δ k , | | E ( w ^ k ) | | | | B k | | .
This assumption ensures that the subproblem solved by trust-region results in a sufficiently optimal solution at every iteration. The proof for this assumption can be shown similar to the trust-region proof by Powell.
Lemma 1.
If assumptions A1 to A3 hold, and s k be an optimal solution to the trust region subproblem given in (41), and if the initial γ k is bounded (i.e., 0 γ k γ ¯ k ), then for all k 0 , the Hessian update given by Algorithm 5 and (26) is bounded.
Proof. 
We begin with the proof for the general case [31], where the Hessian is bounded by
| | B k ( j ) | | 1 + 1 ρ j γ k + 1 + 1 ρ j 1 M .
The proof for (43) is given by mathematical induction. Let m L be the limited memory size and ( s k , j , y k , j ) be the curvature information pairs given by (33) at the kth iteration for j = 1 , 2 , . . . , m L . For j = 0 , we can see that (43) holds true. Let us assume that (43) holds true for some j > 0 . Thus for j + 1 we have
B k ( j + 1 ) = B k ( j ) + y k , j + 1 B k ( j ) s k , j + 1 y k , j + 1 B k ( j ) s k , j + 1 T y k , j + 1 B k ( j ) s k , j + 1 T s k , j + 1
| | B k ( j + 1 ) | | | | B k ( j ) | | + | | y k , j + 1 B k ( j ) s k , j + 1 y k , j + 1 B k ( j ) s k , j + 1 T y k , j + 1 B k ( j ) s k , j + 1 T s k , j + 1 | |
| | B k ( j ) | | + | | y k , j + 1 B k ( j ) s k , j + 1 y k , j + 1 B k ( j ) s k , j + 1 T | | ρ | | y k , j + 1 B k ( j ) s k , j + 1 | | | | s k , j + 1 | |
| | B k ( j ) | | + | | y k , j + 1 B k ( j ) s k , j + 1 | | ρ | | s k , j + 1 | |
| | B k ( j ) | | + | | y k , j + 1 | | ρ | | s k , j + 1 | | + | | B k ( j ) s k , j + 1 | | ρ | | s k , j + 1 | |
| | B k ( j ) | | + | | y k , j + 1 | | ρ | | s k , j + 1 | | + | | B k ( j ) | | ρ
1 + 1 ρ | | B k ( j ) | | + M ρ
1 + 1 ρ 1 + 1 ρ j γ k + 1 + 1 ρ j 1 M + M ρ
| | B k ( j + 1 ) | | 1 + 1 ρ j + 1 γ k + 1 + 1 ρ j + 1 1 M
Since we use the limited memory scheme, B k + 1 = B k ( m L ) , where m L is the limited memory size. Therefore, the Hessian approximation at the k t h iteration satisfies
| | B k + 1 | | 1 + 1 ρ m L γ k + 1 + 1 ρ m L 1 M
We choose γ k = 0 as it removes the choice of the hyperparameter for the initial Hessian B k ( 0 ) = γ k I and also ensures that the subproblem solver CG algorithm (Algorithm 6) terminates in at most m L iterations [22]. Thus the Hessian approximation at the kth iteration satisfies (54) and is still bounded.
| | B k + 1 | | 1 + 1 ρ m L 1 M
This completes the inductive proof.  □
Theorem 1.
Given a level set Ω = { w R d : E ( w ) E ( w 0 ) } that is bounded, let { w k } be the sequence of iterates generated by Algorithm 5. If assumptions (A1) to (A3) holds true, then we have,
lim k | | E ( w k ) | | = 0 .
Proof. 
From the derivation of the proposed L-SR1-N algorithm, it is shown that the Nesterov’s acceleration to quasi-Newton method satisfies the secant condition. The proposed algorithm ensures the definiteness of the Hessian update as the curvature pairs used in the Hessian update satisfies (31) for all k. The sequence of updates are generated by solving using the trust region method where s k is the optimal solution to the subproblem in (41). From Theorem 2.2 in [32], it can be shown that the updates made by the trust region method converges to a stationary point. Since B k is shown to be bounded (Lemma 1), it follows from that theorem that as k , w k converges to a point such that | | E ( w k ) | | = 0 .   □

5. Simulation Results

We evaluate the performance of the proposed Nesterov accelerated symmetric rank-1 quasi-Newton (L-SR1-N) method in its limited memory form in comparison to conventional first order methods and second order methods. We illustrate the performances in both full batch and stochastic/mini-batch setting. The hyperparameters are set to their default values. The momentum coefficient μ k is set to 0.9 in NAG and 0.85 in oLNAQ [33]. For L-NAQ [34], L-MoQ [35], and the proposed methods, the momentum coefficient μ k is set adaptively. The adaptive μ k is obtained from the following equations, where θ k = 1 and η = 10 6 .
μ k = θ k ( 1 θ k ) / ( θ k 2 + θ k + 1 ) ,
θ k + 1 2 = ( 1 θ k + 1 ) θ k 2 + η θ k + 1 .

5.1. Results of the Levy Function Approximation Problem

Consider the following Levy function approximation problem to be modeled by a neural network.
f ( x 1 x p ) = π p { i = 1 p 1 [ ( x i 1 ) 2 ( 1 + 10 sin 2 ( π x i + 1 ) ) ]
+ 10 sin 2 ( π x 1 ) + ( x p 1 ) 2 } , x i [ 4 , 4 ] , i .
The performance of the proposed L-SR1-N and L-MoSR1 is evaluated on the Levy function (58) where p = 5 . Therefore the inputs to the neural network is { x 1 , x 2 , . . . , x 5 } . We use a single hidden layer with 50 hidden neurons. The neural network architecture is thus 5 50 1 . We terminate the training at k max = 10,000, and set ϵ = 10 6 and m L = 10 . Sigmoid and linear activation functions are used for the hidden and output layers, respectively. Mean squared error function is used. The number of parameters is d = 351 . Note that we use full batch for the training in this example and the number of training samples is n = 5000 . Figure 1 shows the average results of 30 independent trials. The results confirm that the proposed L-SR1-N and L-MoSR1 have better performance compared to the first order methods as well as the conventional LSR1 and rank-2 LBFGS quasi-Newton method. Furthermore, it can be observed that incorporating the Nesterov’s gradient in LSR1 has significantly improved the performance, bringing it almost equivalent to the rank-2 Nesterov accelerated L-NAQ and momentum accelerated L-MoQ methods. Thus we can confirm that the limited memory symmetric rank-1 quasi-Newton method can be significantly accelerated using the Nesterov’s gradient. From the iterations vs. training error plot, we can observe that the L-SR1-N and L-MoSR1 are almost similar in performance. This verifies that the approximation applied to L-SR1-N in L-MoSR1 is valid, and has an advantage in terms of computation wall time. This can be observed in the time vs. training error plot, where the L-MoSR1 method converges much faster compared to the other first and second order methods under comparison.

5.2. Results of MNIST Image Classification Problem

In large scale optimization problems, owing to the massive amount of data and large number of parameters of the neural network model, training the neural network using full batch is not feasible. Hence a stochastic approach is more desirable where the neural networks are trained using a relatively small subset of the training data, thereby significantly reducing the computational and memory requirements. However, getting second order methods to work in a stochastic setting is a challenging task. A common problem in stochastic/mini-batch training is the sampling noise that arises due to the gradients being estimated on different mini-batch samples at each iteration. In this section, we evaluate the performance of the proposed L-SR1-N and L-MoSR1 methods in the stochastic/mini-batch setting. We use the MNIST handwritten digit image classification problem for the evaluation. The MNIST dataset consists of 50,000 train and 10,000 test samples of 28 × 28 pixel images of handwritten digits from 0 to 9 that needs to be classified. We evaluate the performance of this image classification task on a simple fully connected neural network and LeNet-5 architectures. In a stochastic setting, the conventional LBFGS method is known to be affected by sampling noise and to alleviate this issue, [16] proposed the oLBFGS method that computes two gradients per iteration. We thus compare the performance of our proposed method against both the naive stochastic LBFGS (denoted here as oLBFGS-1) and the oLBFGS proposed in [16].

5.2.1. Results of MNIST on Fully Connected Neural Networks

We first consider a simple fully connected neural network with two hidden layers with 100 and 50 hidden neurons respectively. Thus, the neural network architecture used is 784 100 50 10 . The hidden layers use the ReLU activation function and the loss function used is the softmax cross-entropy loss function. Figure 2 shows the performance comparison with a batch size b = 128 and limited memory size of m L = 8 . It can be observed that the second order quasi-Newton methods show fast convergence compared to first order methods in the first 500 iterations. From the results we can see that even though the stochastic L-SR1-N (oL-SR1-N) and stochastic MoSR1 (oL-MoSR1) does not perform the best on the small network, it has significantly improved the performance of the stochastic LSR1 (oLSR1) method, and performs better than the oLBFGS-1 method. Since our aim is to investigate the effectiveness of the Nesterov’s acceleration on SR1, we focus on the performance comparison of oLBFGS-1, oLSR1 and the proposed oL-SR1-N and oL-MoSR1 methods. As seen from Figure 2, oLBFGS-1, oLSR1 does not further improve the test accuracy or test loss after 1000 iterations. However, incorporating Nesterov’s acceleration significantly improved the performance compared to the conventional oL-SR1 and oLBFGS-1, thus confirming the effectiveness of Nesterov’s acceleration on LSR1 in the stochastic setting.

5.2.2. Results of MNIST on LeNet-5 Architecture

Next, we evaluate the performance of the proposed methods on a bigger network with convolutional layers. The LeNet-5 architecture consists of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier. The number of parameters is d = 61,706. Figure 3 shows the performance comparison when trained with a batch size of b = 256 and limited memory m L = 8 . From the results, we can observe that oLNAQ performs the best. However, the proposed oL-SR1-N method performs better compared to both the first order SGD, NAG, Adam and second order oLSR1, oLBFGS-1 and oLBFGS methods. It can be confirmed that incorporating the Nesterov’s gradient can accelerate and significantly improve the performance of the conventional LSR1 method, even in the stochastic setting.

6. Conclusions and Future Works

Acceleration techniques such as the Nesterov’s acceleration have shown to speed up convergence as in the cases of NAG accelerating GD and NAQ accelerating the BFGS methods. Second order methods are said to achieve better convergence compared to first order methods and are more suitable for parallel and distributed implementations. While the BFGS quasi-Newton method is the most extensively studied method in the context of deep learning and neural networks, there are other methods in the quasi-Newton family, such as the Symmetric Rank-1 (SR1), which are shown to be effective in optimization but not extensively studied in the context of neural networks. SR1 methods converge towards the true Hessian faster than BFGS and have computational advantages for sparse or partially separable problems [17]. Thus, investigating acceleration techniques on the SR1 method is significant. The Nesterov’s acceleration is shown to accelerate convergence as seen in the case of NAQ, improving the performance of BFGS. We investigate whether the Nesterov’s acceleration can improve the performance of other quasi-Newton methods such as SR1 and compare the performance among second-order Nesterov’s accelerated variants. To this end, we have introduced a new limited memory Nesterov accelerated symmetric rank-1 (L-SR1-N) method for training neural networks. We compared the results with LNAQ to give a sense of comparison of how the Nesterov’s acceleration affects the two methods of the quasi-Newton family, namely BFGS and SR1. The results confirm that the performance of the LSR1 method can be significantly improved in both the full batch and the stochastic settings by introducing Nesterov’s accelerated gradient. Furthermore, it can be observed that the proposed L-SR1-N method is competitive with LNAQ and is substantially better than the first order methods and second order LSR1 and LBFGS method. It is shown both theoretically and empirically that the proposed L-SR1-N converges to a stationary point. From the results, it can also be noted that, unlike in the full batch example, the performance of oL-SR1-N and oL-MoSR1 do not correlate well in the stochastic setting. This can be regarded as due to the sampling noise, similar to that of oLBFGS-1 and oLBFGS. In the stochastic setting, the curvature information vector y k of oL-MoSR1 is approximated based on the gradients computed on different mini-batch samples. This could introduce sampling noise and hence result in oL-MoSR1 not being a close approximation of the stochastic oL-SR1-N method. Future works could involve solving the sampling noise problem with multi-batch strategies such as in [36], and further improving the performance of L-SR1-N. Furthermore, a detailed study on larger networks and problems with different hyperparameter settings could test the limits of the proposed method.

Author Contributions

Conceptualization, S.I. and S.M.; Methodology, S.I.; Software, S.I.; formal analysis, S.I., S.M. and H.N; validation, S.I., S.M., H.N. and T.K.; writing—original draft preparation, S.I.; writing—review and editing, S.I. and S.M.; resources, H.A.; supervision, H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Source code is available on: https://github.com/indra-ipd/sr1-n (accessed on 23 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bottou, L.; Cun, Y.L. Large scale online learning. Adv. Neural Inf. Process. Syst. 2004, 16, 217–224. [Google Scholar]
  2. Bottou, L. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 177–186. [Google Scholar]
  3. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
  4. Peng, X.; Li, L.; Wang, F.Y. Accelerating minibatch stochastic gradient descent using typicality sampling. IEEE Trans. Neural Networks Learn. Syst. 2019, 31, 4649–4659. [Google Scholar] [CrossRef][Green Version]
  5. Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 2013, 26, 315–323. [Google Scholar]
  6. Nesterov, Y.E. A method for solving the convex programming problem with convergence rate O(1/kˆ2). Dokl. Akad. Nauk Sssr 1983, 269, 543–547. [Google Scholar]
  7. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  8. Tieleman, T.; Hinton, G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Neural Netw. Mach. Learn. 2012, 4, 26–31. [Google Scholar]
  9. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  10. Martens, J. Deep learning via Hessian-free optimization. ICML 2010, 27, 735–742. [Google Scholar]
  11. Roosta-Khorasani, F.; Mahoney, M.W. Sub-sampled Newton methods I: Globally convergent algorithms. arXiv 2016, arXiv:1601.04737. [Google Scholar]
  12. Dennis, J.E., Jr.; Moré, J.J. Quasi-Newton methods, motivation and theory. SIAM Rev. 1977, 19, 46–89. [Google Scholar] [CrossRef][Green Version]
  13. Mokhtari, A.; Ribeiro, A. RES: Regularized stochastic BFGS algorithm. IEEE Trans. Signal Process. 2014, 62, 6089–6104. [Google Scholar] [CrossRef][Green Version]
  14. Mokhtari, A.; Ribeiro, A. Global convergence of online limited memory BFGS. J. Mach. Learn. Res. 2015, 16, 3151–3181. [Google Scholar]
  15. Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
  16. Schraudolph, N.N.; Yu, J.; Günter, S. A stochastic quasi-Newton method for online convex optimization. Artif. Intell. Stat. 2007, 26, 436–443. [Google Scholar]
  17. Byrd, R.H.; Khalfan, H.F.; Schnabel, R.B. Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 1996, 6, 1025–1039. [Google Scholar] [CrossRef]
  18. Brust, J.; Erway, J.B.; Marcia, R.F. On solving L-SR1 trust-region subproblems. Comput. Optim. Appl. 2017, 66, 245–266. [Google Scholar] [CrossRef][Green Version]
  19. Spellucci, P. A modified rank one update which converges Q-superlinearly. Comput. Optim. Appl. 2001, 19, 273–296. [Google Scholar] [CrossRef]
  20. Modarres, F.; Hassan, M.A.; Leong, W.J. A symmetric rank-one method based on extra updating techniques for unconstrained optimization. Comput. Math. Appl. 2011, 62, 392–400. [Google Scholar] [CrossRef][Green Version]
  21. Khalfan, H.F.; Byrd, R.H.; Schnabel, R.B. A theoretical and experimental study of the symmetric rank-one update. SIAM J. Optim. 1993, 3, 1–24. [Google Scholar] [CrossRef]
  22. Jahani, M.; Nazari, M.; Rusakov, S.; Berahas, A.S.; Takáč, M. Scaling up quasi-newton algorithms: Communication efficient distributed sr1. In Proceedings of the International Conference on Machine Learning, Optimization, and Data Science, Siena, Italy, 19–23 July 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 41–54. [Google Scholar]
  23. Berahas, A.; Jahani, M.; Richtarik, P.; Takáč, M. Quasi-Newton methods for machine learning: Forget the past, just sample. Optim. Methods Softw. 2021, 36, 1–37. [Google Scholar] [CrossRef]
  24. Ninomiya, H. A novel quasi-Newton-based optimization for neural network training incorporating Nesterov’s accelerated gradient. Nonlinear Theory Its Appl. IEICE 2017, 8, 289–301. [Google Scholar] [CrossRef][Green Version]
  25. Mahboubi, S.; Indrapriyadarsini, S.; Ninomiya, H.; Asai, H. Momentum acceleration of quasi-Newton based optimization technique for neural network training. Nonlinear Theory Its Appl. IEICE 2021, 12, 554–574. [Google Scholar] [CrossRef]
  26. Sutskever, I.; Martens, J.; Dahl, G.E.; Hinton, G.E. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; Volume 28, pp. 1139–1147. [Google Scholar]
  27. O’donoghue, B.; Candes, E. Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 2015, 15, 715–732. [Google Scholar] [CrossRef][Green Version]
  28. Nocedal, J.; Wright, S.J. Numerical Optimization, 2nd ed.; Springer Series in Operations Research; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  29. Mahboubi, S.; Indrapriyadarsini, S.; Ninomiya, H.; Asai, H. Momentum Acceleration of Quasi-Newton Training for Neural Networks. In Pacific Rim International Conference on Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2019; pp. 268–281. [Google Scholar]
  30. Byrd, R.H.; Nocedal, J.; Schnabel, R.B. Representations of quasi-Newton matrices and their use in limited memory methods. Math. Program. 1994, 63, 129–156. [Google Scholar] [CrossRef]
  31. Lu, X.; Byrd, R.H. A Study of the Limited Memory Sr1 Method in Practice. Ph.D. Thesis, University of Colorado at Boulder, Boulder, CO, USA, 1996. [Google Scholar]
  32. Shultz, G.A.; Schnabel, R.B.; Byrd, R.H. A family of trust-region-based algorithms for unconstrained minimization with strong global convergence properties. SIAM J. Numer. Anal. 1985, 22, 47–67. [Google Scholar] [CrossRef][Green Version]
  33. Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Asai, H. A Stochastic Quasi-Newton Method with Nesterov’s Accelerated Gradient. In ECML-PKDD; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
  34. Mahboubi, S.; Ninomiya, H. A Novel Training Algorithm based on Limited-Memory quasi-Newton method with Nesterov’s Accelerated Gradient in Neural Networks and its Application to Highly-Nonlinear Modeling of Microwave Circuit. IARIA Int. J. Adv. Softw. 2018, 11, 323–334. [Google Scholar]
  35. Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Takeshi, K.; Asai, H. A modified limited memory Nesterov’s accelerated quasi-Newton. In Proceedings of the NOLTA Society Conference, IEICE, Online, 6–8 December 2021. [Google Scholar]
  36. Crammer, K.; Kulesza, A.; Dredze, M. Adaptive regularization of weight vectors. Adv. Neural Inf. Process. Syst. 2009, 22, 414–422. [Google Scholar] [CrossRef][Green Version]
Figure 1. Average results on levy function approximation problem with m L = 10 (full batch).
Figure 1. Average results on levy function approximation problem with m L = 10 (full batch).
Algorithms 15 00006 g001
Figure 2. Results of MNIST on fully connected neural network with b = 128 and m L = 8 .
Figure 2. Results of MNIST on fully connected neural network with b = 128 and m L = 8 .
Algorithms 15 00006 g002
Figure 3. Results of MNIST on LeNet-5 architecture with b = 256 and m L = 8 .
Figure 3. Results of MNIST on LeNet-5 architecture with b = 256 and m L = 8 .
Algorithms 15 00006 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Indrapriyadarsini, S.; Mahboubi, S.; Ninomiya, H.; Kamio, T.; Asai, H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms 2022, 15, 6. https://doi.org/10.3390/a15010006

AMA Style

Indrapriyadarsini S, Mahboubi S, Ninomiya H, Kamio T, Asai H. Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks. Algorithms. 2022; 15(1):6. https://doi.org/10.3390/a15010006

Chicago/Turabian Style

Indrapriyadarsini, S., Shahrzad Mahboubi, Hiroshi Ninomiya, Takeshi Kamio, and Hideki Asai. 2022. "Accelerating Symmetric Rank-1 Quasi-Newton Method with Nesterov’s Gradient for Training Neural Networks" Algorithms 15, no. 1: 6. https://doi.org/10.3390/a15010006

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop