Adaptive Levenberg–Marquardt Algorithm: A New Optimization Strategy for Levenberg–Marquardt Neural Networks

: Engineering data are often highly nonlinear and contain high-frequency noise, so the Levenberg–Marquardt (LM) algorithm may not converge when a neural network optimized by the algorithm is trained with engineering data. In this work, we analyzed the reasons for the LM neural network’s poor convergence commonly associated with the LM algorithm. Speciﬁcally, the effects of different activation functions such as Sigmoid, Tanh, Rectiﬁed Linear Unit (RELU) and Parametric Rectiﬁed Linear Unit (PRLU) were evaluated on the general performance of LM neural networks, and special values of LM neural network parameters were found that could make the LM algorithm converge poorly. We proposed an adaptive LM (AdaLM) algorithm to solve the problem of the LM algorithm. The algorithm coordinates the descent direction and the descent step by the iteration number, which can prevent falling into the local minimum value and avoid the inﬂuence of the parameter state of LM neural networks. We compared the AdaLM algorithm with the traditional LM algorithm and its variants in terms of accuracy and speed in the context of testing common datasets and aero-engine data, and the results veriﬁed the effectiveness of the AdaLM algorithm.


Introduction
When applied to real-world data interspersed with high nonlinearity and highfrequency noise, LM neural networks have irreplaceable advantages. They reduce the requirement of computing resources and guarantee high convergence speed, making their performance superior to other existing regression methods. The LM neural network is an artificial neural network with an LM optimizer. The LM algorithm is a second-order method for solving general least squares problems. As an optimizer, the LM algorithm has wide engineering applications in LM neural networks because of its fast convergence and small memory occupation [1][2][3].
However, the disadvantage of the LM algorithm makes LM neural networks have potential symptoms of divergence or convergence to a bad local minimum. The paper [4] reported that the LM algorithm may become stuck and fail to degrade the cost function. For a nonconvex optimization problem, such as neural network optimization, there can be many local minima. The cost function may end up in the "bad" local minima when it is difficult to guarantee the nonsingularity of the Jacobian matrix and Lipschitz continuity [5]. LM neural networks show higher errors than do regular neural networks in practical engineering application [6].
Most of the recent work have tried to help the LM algorithm find one of the "good" local minima by combining with other algorithms for better engineering applications, because it is generally acceptable in the modern neural network community that searching for a global minimum is often an unnecessary endeavor, e.g., the genetic algorithm (GA) (1). F(w) and f (w) are nonlinearly continuous and differentiable. The f (w) is a neural network model, w is a parameter of the neural network model, and F(w) is the cost function of the neural network. In Refs. [12][13][14][15], the global optimization ability of the proposed algorithm based on nonlinearly continuous and differential F(w) and f (w) was verified. In detail, the activation functions in the neural networks are not necessarily continuous and differentiable, so the above references are only valid in the current setting without proving that the global optimization can be achieved in the neural network model. (2). J = ∂F(w)/∂w. J is a Jacobian matrix. The Jacobian matrix in Refs. [16][17][18] was obtained by directly solving the system of nonlinear equations. The calculation results were relatively accurate, but the calculation process was very complex. In the neural networks, the Jacobian matrix participating in the proof process was obtained by back-propagation rather than by the derivation of the objective function. Therefore, the theory of the above references may be not suitable for the field of neural networks.
Rather, a new adaptive LM algorithm is proposed for neural networks in this paper. We explain in detail the specific factors that cause the cost function to fall into bad local minima by the original LM algorithm, by analyzing the output performance of several activation functions. In view of these factors, the new algorithm makes up for the deficiency of the LM algorithm to train a network efficiently.
The remaining parts are organized as follows. Section 2 introduces the LM algorithm. Section 3 proves that the original LM algorithm has the possibility of falling into local minima in neural networks. Section 4 gives the optimization of the LM algorithm. Section 5 verifies the correctness of this theory about experiments, and the conclusions are drawn in last section.

LM Algorithm
The LM algorithm was first proposed for nonlinear least squares problems by scholars Levenberg and Marquardt [19]. Before introducing the LM and carrying out research, some parameters need to be defined, which are the same as those of [20].

Definition 1.
Given a neural network model f(w), the cost function is the least squares problem: 2 (1) where w represents neural network parameters and y label represents label data.

Definition 2.
A model L is assumed to describe the behavior of F in the current iteration There is a method to evaluate the accuracy of model L, called gain ratio q: LM method minimizes F(w) by iteration. h is introduced to describe the change direction of F(w).
where J is the Jacoby matrix, r is the cost function error, and µ is the damping parameter.
Given a variable v, the damping parameter µ is adjusted by v in every iteration. The µ and v updating formulas are illustrated below [20]:

Problem of LM Algorithm
The problem of the LM algorithm is from the gain ratio q. A positive q means L(h) is a good approximation to F(w + h), and vice versa. The L(0)−L(h) has proved to be positive [20], and its proof is: In Definition 3, we know q = (F(w) − F(w + h))/(L(0)−L(h)), which indicates whether L(h) is a good approximation to F(w + h) dependences on F(w) − F(w + h). As just said, if q > 0, L(h) is a good approximation. If q < 0, the damping parameter µ in Equation (4) is adjusted and the current iteration is recalculated.
There is a situation where q will always remain negative regardless of how the damping parameter µ in Equation (4) is adjusted, while the cost function F(w) in Equation (1) is still very large.
That situation indicates that iteration cannot continue, and that LM neural networks have fallen into the "bad" local minima or had divergence.

Analysis of Falling into "Bad" Local Minima
In the previous section, the possibility of the LM algorithm was analyzed. In this section, we focus on the case of where q would always remain negative and analyze the possible situation. If we find the condition where gain ratio q < 0 regardless of how the parameters are adjusted, we can prove that LM neural networks may diverge or converge to the poor local minima.
The proof process includes three steps: First, given the inequality f (w) + J(w) < r(w 2 ), the conditions satisfying the inequality are analyzed. Secondly, we prove that "∃f (w), when f (w) + J(w) < r(w 2 ), in Equation (3), the gain ratio q < 0". The conditions leading to q < 0 can be obtained. Thirdly, based on the above analysis, we discuss the conditions under which LM neural networks may diverge or converge to a worse local minimum.

Definition 5.
According to the structure of the neural network, the residual function is as follows: where σ is an activation function, x is the input of the neural network, the weight of the neural network is w, and the bias is b.
Definition 6. The first-order Taylor expansion for the output function f(w + h) of the neural network at h is: where r(w 2 ) is the remainder. Proof. According to Equation (7): Substituting Equation (8) into f (w) + J(w) < r(w 2 ) gives: Thus, to prove f (w) + J(w) < r(w 2 ), you only need to prove Equation (9). By Equation (6), we obtain: Substituting Equation (6) and Equation (10) into Equation (9) gives: Let: Thus, to prove f (w) + J(w) < r(w 2 ), you only need to prove: In order to simplify the calculation, define: Then: The frequently used activation functions include the sigmoid function, tanh function, ReLU function, and PReLU function. Discussion: (1). When σ is sigmoid: Then, the function G(w ,h ) is: Due to the arbitrariness of the value of y label , the first three terms of the function G(w ,h ) are considered and expressed by g(w ,h ): The image of g(w ,h ) in Equation (18) drawn by OCTAVE is shown in Figure 1: Regardless of the value of w , the g(w ,h ) curve monotonically decreases and intersects the x-axis at the point (0.5, 0). This means that when the conditions (σ = sigmoid, h ≤ 0.5, and y label > 0) are met, the algorithm has the possibility of divergence.
From Figure 1, it can be seen that the g(w ,h ) curve is monotonously decreasing and converges at point (h = 0.5, g(w ,h ) = 0); when h ≤ 0.5, g(w ,h ) ≥ 0.That is: Thus, when h ≤ 0.5 and y label > 0, G(w,h) > 0. As h = hx, for the arbitrariness of h, when the absolute values of x and y label are large enough, we have G(w,h) > 0, that is, f (w) + J(w) < r(w 2 ).
(2). When σ is tanh: Similarly, let: w = wx + b, h = hx, then: (20) Then, the function G(w ,h ) is: For the arbitrariness of the value of y label , the first three terms of the function G(w ,h ) are considered and expressed by g(w ,h ): The image of g(w ,h ) in Equation (21) drawn by OCTAVE is shown in Figure 2: In Figure 2, the curve g(w ,h ) is also monotonically decreasing. When h < 0.3, g(w ,h ) ≥ −3, and so: Thus, when h'< 0.3 and y label > 3, G(w,h) > 0. As h' = hx, for the arbitrariness of h, when the absolute values of x and y label are large enough, we have G(w,h) > 0, that is, f (w) + J(w) < r(w 2 ).
(3). When σ is ReLU: Due to the arbitrariness of the value of y label , the first three terms of the function G(w ,h ) are considered and expressed by g(w ,h ). The image of g(w ,h ) is shown in Figure 3: In Figure 3, when w < 0 and h < 0, g(w , h ) ≥ 0, and so: Thus, when w < 0, h < 0, and y label > 0, G(w,h) > 0. As h = hx and w = wx + b, for the arbitrariness of h, w, b, when the absolute values of x and y label are large enough, we have G(w,h) > 0, that is, f(w) + J(w) < r(w 2 ).

(4). When σ is PReLU:
Similarly, let: w = wx + b and h = hx, and draw the image of g(w ,h ) as shown in Figure 4: In Figure 4, when w < 0, g(w ,h ) > 0, and so: Thus, when w < 0 and y label > 0, G(w,h) > 0. As h' = hx and w'= wx + b, for the arbitrariness of h,w,b, when the absolute values of x and y label are large enough, we have G(w,h) > 0, that is, f (w) + J(w) < r(w 2 ).
To sum up, when the absolute values of x and y label are large enough, there are f (w) + J(w) < r(w 2 ).
Proof. The cost function is: F(w + h) = 1 2 ||f (w + h)|| 2 . The first-order Taylor expansion for F(w + h) at w is: As f (w) + J(w) < r(w 2 ), the r(w 2 ) term of the residual function f (w + h) after Taylor's firstorder expansion cannot be omitted. Therefore, there exists f (w) that makes it impossible to omit the R(w) term of Taylor's first-order expansion for F(w + h). This means: L(w) < R(w).
In [19], it has been proven that: In the case of Lemma 1 and Lemma 2, the LM algorithm causes cost functions to fall into "bad" local minima.
Proof. In the kth iteration, we have q k < 0; in the case of Lemma 2, according to the LM algorithm in the original, when q k < 0, the trust region will be reduced, that is: At this point, the absolute value of step h decreases, and its sign remains unchanged. There is a situation where q is always negative when h is at any point in the positive or negative half-axis, and h will approach zero infinitely and the cost function cannot decrease. At this time, there is: According to Equation (14), h' = hx, we have: We analyze the common activation functions one by one to find out whether they have the possibility of this situation: For the activation function σ = sigmoid, when y label > 1, the area covered by the function is shown in Figure 5a. It can be seen that g(w ,h ) will remain positive when h is at any point on the negative half of the x-axis. In the case of Lemma 1 and Lemma 2, it is known that "g(w ,h ) remains positive" means "q remains negative." According to Equation (14), h' = hx, so when y label > 1 and hx < 0, q remains negative, and the LM algorithm will stop running, and the LM neural networks will also diverge or converge poorly.
Similarly, for the activation function σ = tanh, when w ≤ 0 and y label > 3, the area covered by the function is shown in Figure 5b. When w > 0 and y label > 1, the area covered by the function is shown in Figure 5c. Thus, when y label > 3, x > 0, and hx < 0, the LM neural network will diverge or converge poorly.
Similarly, for the activation function σ = ReLU, when w < 0 and y label > 0, the area covered by the function is shown in Figure 5d. When w < 0, y label > 0, x > 0, and hx < 0, the LM neural networks will diverge or converge poorly.
Finally, for the activation function σ = PReLU, when w < 0 and y label > 0, the area covered by the function is shown in Figure 5d. When y label > 0 and w < 0, the LM neural network will diverge or converge poorly.  The grey area covered by the g(w ,h ) function when σ = tanh, y label > 3, and w ≤ 0. If h ≤ 0.3, g(w ,h ) is always positive; (c) The grey area covered by the g(w ,h ) function when σ = tanh, y label > 1, and w > 0. If h ≤ 0, g(w ,h ) is always positive; (d) The grey area covered by the g(w ,h ) function when σ = ReLU, y label > 0, and w < 0. If h ≤ 0, g(w ,h ) is always positive; (e) The grey area covered by the g(w ,h ) function when σ = PReLU, y label > 0, and w < 0. g(w ,h ) is always positive.

The Proposed Algorithm: AdaLM
It can be seen from Section 3 that the parameter state of the model will affect the stability of the LM algorithm. In order to reduce the influence of the state of the model on the algorithm, the improved scheme starts from other algorithms to participate in the iterative operation, and then slowly transitions to the LM algorithm. Therefore, a more stable algorithm is needed, and the steepest descent method meets this requirement.
With this idea, this paper proposed the AdaLM algorithm that is close to the steepest descent method at the beginning and close to the LM algorithm after several iterations.
The core step calculation of the AdaLM algorithm can be expressed by Equation (34): where A = J T J, α is the inertia parameter, V is the accumulated value of h, and k is the current number of iterations. The damping coefficient µ is calculated from q and v: (1). If q < 0, µ = (µ + 1) × v, v = v × 2; (2). If q > 0, µ = 0, v = 2.
If AdaLM reaches the global minimum: k → ∞ and F(x ∞ ) = g = 0. Given ε 1 > 0, the first stop criteria can be set to: Consistent with Reference [20], there is a second stop condition if the step size is very small: Finally, in all iterations, we need a protection against infinite loops, so there is a third stop condition: Thus, the AdaLM algorithm can be described in Algorithm 1:

Experimental Preparation
In order to evaluate the performance of AdaLM, it was arranged to compare with the original LM algorithm, as well as other popular LM variants, including High order Levenberg-Marquardt (HLM) and Three Step Levenberg-Marquardt (TSLM). Each algorithm is described in Table 1. In this paper, each algorithm and neural network were combined to build a prediction model, and the performance level of the algorithm was reflected by testing the prediction ability of the model.

Algorithm
Full Name Description As for the neural network model, it has been proven that the three-layer neural network can simulate any complex nonlinear mapping if there are enough neurons in the hidden layer [21][22][23]. Therefore, the neural network models involved in this paper were all three-layer neural networks. We found that all the models involved in this paper will converge before 50 steps of iteration. Thus, we set the maximum number of iterations of the model to 50. In this study, the mean absolute error (MAE) was used to give the accuracy of the experimental results. MAE is the average value of absolute error, which can reflect well the actual situation of the predicted value error. Its formula is as follows: The number of nodes in the model affects the quality of the neural network. The input and output nodes of the model need to be set according to the requirements of the task. The number of hidden nodes of the model can be calculated according to the empirical formula given in [24]: where m is the number of hidden layer nodes, n is the number of input layer nodes, and l is the number of output layer nodes. The neural network model is shown in Figure 6. The model uses the square difference formula to calculate the error between the prediction data and the label data. In order to test the engineering application ability of each algorithm, the real prediction task of aero-engine performance parameters was arranged in this paper. In this paper, fuel flow data of an aero-engine were selected as training data. There is a large noise in the fuel flow data of the aero-engine due to environmental conditions, operation conditions, flight tasks, maintenance measures, etc. [25]. In Figure 7, the fuel flow data collected from the CFM56-5B engine records the fuel consumption rate of the A320 aircraft in pounds per hour. It is shown from Figure 7 that the data were highly nonlinear with many outliers. The data sample size was small. Therefore, the selection of the data can reflect the robustness and convergence of the algorithm. The specific implementation steps of the data prediction task for the fuel flow of the aeroengine are as follows: Step 1: Obtain fuel flow data of the aeroengine. Set the total amount of data to N. Use normalization and data smoothing to preprocess the fuel flow data of the aeroengine, and then obtain the data X = (x 1 , x 2 , x 3 , . . . . . . , x N ). The purpose of normalization is to avoid numerical problems such as overflow and underflow, reducing the impact of model initialization, and improving the prediction accuracy. The purpose of data smoothing is to reduce the influence of noise and outliers on the model. The data smoothing method is given by Equation (40): Step 2: Organize data into datasets. In this step, the dataset is generated according to the number of input layer nodes n of neural network, and the determination method of n is given in step 5. The fuel flow data are divided into subsequences with length of n as the input data of the model and the adjacent element of the subsequence as the label data. Because this task is a one-step forecast, the label data dimension is 1. Input data X and label data Y are, respectively: Step 3: Split the dataset {X,Y} into training set {X train , Y train } and test set {X test , Y test }. According to experience, the first 95% of the data are used as training sets. The rest are the test set.
Step 4: Build the neural network model and use the training set {X train , Y train } to train the model. The model uses the square difference formula to calculate the error between the prediction data and the label data.
Step 5: Determine the number of nodes in the neural network model to make the model optimal. The enumeration method is used to determine the input node n of the neural network, with the range of 5−12. In detail, steps 2 to 4 are looped 8 times, and the input node n of the neural network is set to 5 to 12 in step 2 in each loop. The number of hidden layer nodes is determined by Equation (39), and the number of output nodes is 1. The MAE error in Equation (38) is used to describe the prediction accuracy of the experimental results. The number of input nodes and the corresponding model MAE are shown in Figure 8.  Figure 8 shows that when the number of input nodes n is 9, the output errors of the algorithm LM reach the minimum, as well as HLM, TSLM, and AdaLM.
Step 6: Test the model. Input the test set X test into the input model to obtain f (X test ).

The Influence of Each Algorithm on Activation Function
The purpose of this section is to evaluate the impact of the four algorithms on the activation function to evaluate the performance of each algorithm. In this paper, the AdaLM algorithm was compared with LM, HLM, and TSLM. As a contrast, two frameworks were set-up in this study to investigate the performance of these algorithms under different conditions.
The test results are listed in Table 2. " √ " means convergence, "good" means local minima, " " means "bad" local minima, and "×" means divergence or the term does not exist.  Table 2 shows that the traditional LM model diverges when Tanh, ReLU, and PReLU are used as activation functions, which proves that the LM problem exists. HLM is a higherorder descent method. The HLM model could not converge when the activation functions were Tanh and PReLU, which indicates that over-high-order descent may produce negative effects. The TSLM model is an improved version of HLM. In the test, TSLM converged with four activation functions. The AdaLM algorithm proposed in this paper could change the descent direction in time and has good stability, so it achieved convergence.

Evaluation of Prediction Effect
From Figure 9a, the generated curve amplitude was very small with a large error. From the perspective of the prediction effect, the curve predicted by LM could not describe the future trend of real data. From Figure 9b,c, it could hardly reproduce the original data. After training, HLM and TSLM only achieved a line with slight fluctuation. Obviously, they could not predict the trend of future data in this test. From Figure 9d, the curves generated by AdaLM could fit the real data well. From the perspective of the prediction effect, the curve predicted by AdaLM could describe the future trend of real data. In order to quantify the prediction effect of each algorithm, the time consumption in 50 iterations and the prediction MAE of algorithms are shown in Table 3. From Table 3, the neural network model with the original LM algorithm was trained in 140 s. The data predicted by the model after training were basically consistent with the original data, with an error of 41.4847. This shows that the original LM algorithm could basically make the neural network predict the fuel consumption data of the aeroengine, but the training took a lot of time because of the falling into a "bad" local minimum. The neural network model with HLM algorithm completed the training in 181 s. The data predicted by the model after training fit the original data with an error of 38.2276. The HLM algorithm took the longest time to train in the prediction of fuel consumption data of an aeroengine. The HLM algorithm increased the unnecessary complexity and did not improve the performance.
The neural network model of the TSLM algorithm was trained in 173 s. The data predicted by the trained model fit the original data with an error of 42.6715. TSLM algorithm has the highest error in this task, and the complexity of the algorithm affects the prediction performance.
The neural network model with the AdaLM algorithm proposed in this paper was trained in 62 s, and the data predicted by the trained model fit the original data with an error of only 32.9175, which indicates that the AdaLM algorithm took into account high performance and high speed in predicting the fuel consumption data of aeroengine, with the least time consumption and the highest accuracy.

Conclusions
This paper pointed out that the LM neural networks may converge poorly or diverge because the training data are large and the weight is not selected properly. This work can guide researchers to carry out some appropriate strategies such as training data preprocessing and weights intervention at the beginning of training to avoid problems in the use of the LM neural networks when they still insist on using this model. Furthermore, this work can expand the application scope of LM neural networks and make LM neural networks play its role better in more situations.
This study proposed a new solution to the problem of LM neural networks: the AdaLM algorithm. The algorithm adds the weight term and error direction on the original algorithm, which makes the algorithm close to the steepest descent method at the beginning to preliminarily adjust the weight of the neural network. Then, the AdaLM algorithm gradually approaches the LM algorithm, which improves the stability based on inheriting the efficient optimization ability of LM. This paper compared the performance of traditional LM, HLM, TSLM, and AdaLM algorithms. We found that the performance improvement of the higher-order algorithm was very limited. Among many high-order algorithms, the third-order algorithm had the better performance. The AdaLM algorithm could not only make the neural network converge to a "good" local minimum, but also greatly shorten the operation time without losing the prediction accuracy.
The application of the AdaLM algorithm focused on a small sample and highly nonlinear engineering data. It was especially effective for the analysis and prediction of engine performance parameters in the aviation field. In future research, we will collect engine data from more dimensions to comprehensively evaluate AdaLM's ability of analysis. The application scenarios of AdaLM will also be further explored.

Conflicts of Interest:
The authors declare no conflict of interest.