Local Linear Approximation Algorithm for Neural Network

: This paper aims to develop a new training strategy to improve efﬁciency in estimation of weights and biases in a feedforward neural network (FNN). We propose a local linear approximation (LLA) algorithm, which approximates ReLU with a linear function at the neuron level and estimate the weights and biases of one-hidden-layer neural network iteratively. We further propose the layer-wise optimized adaptive neural network (LOAN), in which we use the LLA to estimate the weights and biases in the LOAN layer by layer adaptively. We compare the performance of the LLA with the commonly-used procedures in machine learning based on seven benchmark data sets. The numerical comparison implies that the proposed algorithm may outperform the existing procedures in terms of both training time and prediction accuracy.


Introduction
Despite that neural networks have many successful applications, there is a longstanding challenge of training feedforward neural network (FNN) models efficiently to strike a balance between training time and prediction performance. Stochastic gradient descent [1][2][3] is widely used for training neural networks, but it has a slow learning rate and noisy training curves in the beginning of the training epochs.
In order to reduce total training iterations and consequently training time, while achieving high generalization ability, many studies have developed different efficient training algorithms for neural networks, among which linear approximation of the neural network received lots of attentions. Douglas and Meng [4], Singhal and Wu [5] and Stan and Kamen [6] developed linearized least squares using Kalman recursion to update weight parameters. They demonstrated increased training efficiency of this technique compared to gradient descent. However, these methods lacked robustness in terms of model accuracy. Kollias and Anastassiou [7] developed an adaptive least squares algorithm based on the Marquardt-Levenberg optimization algorithm. See Ngia and Sjoberg [8], Kim and Lee [9] and Zhou and Si [10] for other Marquardt-Levenberg based methods. Moreover, Maddox et al. [11], Khan et al. [12] and Mu et al. [13] suggested linearizing neural network with first order Taylor expansion followed by Gaussian process for inference. However, these methods only worked on pre-trained networks but not random initialization [14].
The work in this paper aims to address the problem from a different point of view, and develop a reliable training algorithm called Local Linear Approximation (LLA) Algorithm. In each iteration, LLA uses the first order Taylor expansion to locally approximate activation functions of each neuron. It computes estimated gradient and Hessian matrix, and updates weights and biases in the spirit of Fisher scoring algorithm. Algorithms taking into account Fisher information or Hessian matrix have been proposed via natural gradient descent [15,16]. However, the proposed LLA is different from the existing algorithms. See Section 2.1 for more details. We develop LLA for both regression and classification problems. The high-level summary of the major contributions of this work are shown below: (a) We derive a novel adaptive method, LLA, which leverages the ideas of linearizing neural network and iterative Fisher scoring, aiming at reducing training time while achieving higher prediction performance than current baselines. Distinguished from existing methods that linearize neural networks, LLA has a unique initialization procedure.
(b) We investigate the convergence behavior of LLA by establishing a connection between the Fisher scoring algorithm and the iterative regression. The connection implies that LLA and Fisher scoring algorithms share the same rate of convergence. Thus, the LLA has a quadratic convergence rate.
(c) As an extension of the one-hidden-layer LLA, we built a layer-wise optimized adaptive neural network (LOAN) for further improvement. The performance of LLA with multiple hidden layers is examined. The experiments demonstrate that shallow LOAN achieves both stable performance and high prediction accuracy level.
(d) We use several benchmark data sets to illustrate and evaluate our proposed method. The results indicate that LLA has a faster convergence rate and outperforms existing optimization methods and machine learning models in terms of prediction on both training and testing data.
The rest of this paper is organized as follows. In Section 2, we develop the LLA algorithm for regression and classification problems and study the rate of convergence of the proposed LLA algorithm. Section 3 presents numerical comparisons. Conclusion and discussion are given in Section 4.

New Method for Constructing Neural Network
According to the universal approximation theorem [17,18], provided that the network has enough hidden units, an FNN with one-hidden-layer and a linear output layer may approximate any Borel measurable functions well (Section 6.4.1 in [19]). It is common to consider a deep neural network for better prediction performance, but in practice onehidden-layer neural network with a large number of nodes can still achieve high prediction performance [20,21], which is an easy starting point to explain our proposed LLA method.

Regression Settings
Consider an one-hidden-layer neural network with J 1 nodes: where h (0) = x is the p-dimensional input vector, W and b are the weight matrix and the bias vector in the first hidden layer, respectively, w j is the j-th row of W corresponding to the jth node, and b j is the j-th element of b. Throughout this paper, we set σ(t) = max{t, 0}, the ReLU activation function. Consider a linear regression model in the output layer For this neural network, the parameters to be optimized are W, a J 1 × p weight matrix, b, a J 1 -dimensional vector, β = (β 1 , · · · , β J 1 ) T , a J 1 dimensional vector and β 0 , a one-dimensional parameter. Denote θ 1 = {β 0 , β}, θ 2 = {W, b} and θ = {θ 1 , θ 2 }. Let m(x; θ) = β 0 + β T h (1) . Suppose that we have a set of sample {x i , y i }, i = 1, · · · , n. In the literature, the estimation of θ is to minimize the least squares (LS) function which is highly nonlinear in θ 2 . Hence, parameter estimation is challenging for deep neural network.
To tackle these problems, we propose an algorithm to minimize (3). Given θ (c) 2 = {W (c) , b (c) } in the current step, we propose to approximate σ(x T w j + b j ) by a linear function based on the first-order Taylor expansion of σ(t): We refer (4) to as local linear approximation (LLA) since it is linear approximation with first order Taylor expansion at the neuron level. For any function f , letf be an approximation of f by (4) throughout the paper. Thus, m(x; θ) is approximated by: and Further define and which is a J 1 (p + 2)-dimensional vector. With the aid of approximation (5), the objective function in (3) is approximated bỹ which is a least squares function of linear regression with the response y i and predictors z i .

Theorem 1.
Let θ (c) be the value of θ at the c-th step. Then The proof of Theorem 1 is given in Appendix A.1 of this paper. Theorem 1 implies that updating θ based on˜ 1 (θ) (the second equation) is actually equivalent to updating θ based on 1 (θ) (the first equation). Let where γ = [γ 1 , · · · , γ J 1 ] T . Thusm(x; θ, θ (c) ) = z T i ϑ is a linear function of ϑ while it is nonlinear in θ. This enables us to update ϑ easily.
Denote the resulting least squares estimates of β j , γ j and η j byβ j ,γ j andη j , respectively. By the definition of γ j and η j , we can update b j and w j as follows for j = 1, · · · , J 1 . If |β j | is very close to zero, one may simply set b j . The LLA algorithm is to firstly update β, γ, η 1 , · · · , η J 1 by least squares estimates of Equation (10). Then, the estimates of W and b are obtained by Equation (13). The estimation process is conducted iteratively till convergence. The procedure can be summarized in Algorithm 1.
Motivated by single index models, we design an initialization procedure for W and b with J 1 nodes as follows. Firstly, we fit data {x i , y i } with a linear regression model and obtain the least squares estimates for the coefficients of x, denoted byα, and the intercept, denoted byα 0 . Letτ j , j = 1, · · · , J l be quantile points ofα 0 + x T iα , i = 1, · · · , n. We set initial value: b for j = 1, · · · , J 1 . Since the objective function 1 (θ) is nonconvex, numerical minimization algorithm may depend on the initial value. We find that the proposed initialization strategy works reasonably well in practice. We have conducted some numerical comparison between this initialization procedure and the Xavier initialization strategy, which is based on randomization. Based on our simulation study, the proposed initialization procedure has a better and more stable performance than the Xavier initialization strategy. j , where z i is defined in Equations (6)- (9), and obtain the least squares estimates (LSE)β j s,γ j s andη j s by running a linear regression of y i on covariate z i according to Equation (10). 4: if |β j | ≥ then 5: update the biases and weights by To understand the convergence behavior of the proposed LLA algorithm, let us briefly describe the Fisher scoring algorithm for a general nonlinear least squares problem. Let where g(x; Θ) is a general nonlinear function with existing finite second-order derivatives with respect to Θ, a general parameter vector. Denote * (Θ) and * (Θ) to be the gradient and Hessian matrix of * (Θ), respectively. The Fisher scoring algorithm is to update Θ by whereÊ{− * (Θ)} stands for an estimate of E{− * (Θ (c) )}. Similar to the proof of Theorem 1, we can rewrite Θ (c+1) into the following form which is the solution of the least squares problem We next show that (10) corresponds to (15) Theorem 2. For given θ 1 , let g(x i ; θ 2 ) = m(x i ; θ). Then (10) coincides with The proof of Theorem 2 is given in Appendix A.2 of this paper. Based on Theorem 2, we may show that the proposed LLA algorithm shares the same convergence rate of the Fisher scoring algorithm. Thus, the convergence rate of the proposed LLA algorithm is quadratic. The Fisher scoring algorithm has been proposed for FNN back in 1990s [22]. Recently, researchers use Fisher information or Hessian matrix information by constructing natural gradient descent [15,16,23]. A natural question which arises here is why the Fisher scoring algorithm is not implemented directly. We have compared the LLA algorithm and the Fisher scoring algorithm via numerical simulation. Our numerical comparison implies that the LLA algorithm is more stable and converges faster than the Fisher scoring algorithm. This may be becausem(x i ; θ, θ (c) ) is linear in ϑ, and (10) provides a better approximation than (15) for both θ 1 and θ 2 . In addition, the form of ReLU activation function σ(t) implies that its first-order derivative σ (t) = I(t > 0) is bounded, and its second-order derivative is 0 if it exists. The LLA algorithm takes advantages of it since the LLA in (4) is based on σ(t) ≈ σ(t (c) ) + σ (t (c) )(t − t (c) ), the firstorder Taylor expansion of σ(t). Figure 1 presents an illustration of the LLA of ReLU activation function at three different locations.

Binary Classification
For a classification problem, the output layer is (1) . Thus, training an FNN reduces to estimating θ by maximizing the log-likelihood function where b(m) = log{1 + exp(m)}. We approximate σ(x T w j + b j ) in the same way as in (4) and approximate 2 (θ) similarly by: wherem(x i ; θ, θ (c) ) is defined in Equation (5). Let 2 (θ) and 2 (θ) be the gradient and Hessian matrix of 2 (θ), respectively. According to Theorems 1 and 2 for regression in Section 2.1, extension to classification involves no fundamentally new ideas. Thus, let θ (c) be the value of θ at the c-th step, θ can be updated based on and the convergence rate is quadratic as well. This implies that we can iteratively update ϑ using˜ 2 (ϑ), in which ϑ is defined similarly as in (12) andm(x; θ, θ (c) ) = z T i ϑ is a linear function of ϑ. Thus, we can directly apply logistic regression of y i on z i and obtain an estimate of ϑ. To save computational cost, we use one-step iteration rather than a full iteration in the logistic regression. The initial values play an important role in the model performance since the log-likelihood function 2 (θ) is nonconcave. Similarly, we propose a method to construct initial values of W and b for classification. We fit data using a logistic regression model and obtain the estimates of the intercept and the coefficients of x, denoted byα 0 andα, respectively. Letτ j , j = 1, · · · , J 1 be quantile points ofα 0 + x T iα , i = 1, · · · , n. We may set initial values: (19) for j = 1, · · · , J 1 . The procedure for updating β 0 , β, b j and w j is summarized in Algorithm 2.

Algorithm 2 Local Linear Approximation (LLA) Algorithm for Classification
Input: Given data {X, y} and number of nodes J 1 . (19). j where z i is defined in Equations (6)-(9), and fit a logistic regression of y i on z i to obtainβ j s,γ j s andη j s according to Equation (18). 4: if |β (c) j | ≥ then 5: update the biases and weights by end if 9: end while 10: return W and b

Extension to Multi-Class Classification
The proposed procedures may be extended to multi-class classification problems. For classification problem with T + 1 classes, consider the following model n, is a set of samples from the model above. Denote y ti = I(y i = t). Then, the log-likelihood function is Note that for one-hidden-layer FNN, is the bias vector for class t. We may apply the same linear approximation in (4) to t ), we may update θ t easily by using similar strategy described in Section 2.2. Thus, Algorithms 2 can be used to iteratively update β t0 , β t , W t and b t . Similarly, we may extend the LLA proposed in Section 2.2 to FNN for multi-class classification problem.

Layer-Wise Optimized Adaptive Neural Network
In practice, one-hidden-layer neural network with a large number of nodes may approximate regression functions well and lead to good prediction performance in many applications. Sometimes, it is still of interest to examine the performance of deep neural network for further improvement. We consider regression problem as an example in this subsection, and propose a layer-wise optimized adaptive neural network (LOAN) as an extension of one-hidden layer LLA. The LLA may be extended to estimate the weights and biases in the LOAN layer by layer adaptively.
Consider an L-hidden layer FNN: for l = 1, · · · , L with h (0) = x, the p-dimensional input vector, where σ l (·)s are activation functions, applied to each component of output from the previous layer. W l and b l are the weight matrix and the bias vector in the l-th hidden layer. Consider a linear regression model in the output layer The back-propagation with gradient decent algorithm is to seek optimal weights and biases in all layers of FNN via minimizing This leads to a highdimensional, nonconvex minimization problem. In this subsection, we propose LOAN for deep FNN. Layerwise construction of neural network has been proposed in the literature. Contrastive divergence [24] trains Boltzmann machines through approximate likelihood gradient with sampling Gibbs chain [25][26][27][28]. The decoupled neural interface [29] updates weight matrix using synthetic gradient, a gradient approximation, estimated based on local information and trains each layer asynchronously. More recent work on this topic include Chen et al. [30], Shin [31], and Song et al. [32].
The idea of LOAN can be summarized as follows. We start with h (0) = x, fit the data with one-hidden-layer network as described in Section 2.1, and obtain estimatesŴ andb of W and b. We set W 1 =Ŵ and b 1 =b for the weight matrix and bias vector in FNN. Then . We fit data {x i , y i } with one-hidden-layer network, and obtainŴ andb by the proposed LLA. Then we set W 2 =Ŵ and b 2 =b. Therefore, by running one-hidden-layer network on data {x i , y i } withx i = h (l−1) i , we construct a LOAN by estimating W l s and b l s. This procedure can be summarized in Algorithm 3.

Algorithm 3 Layerwise Optimized Adaptive Network (LOAN)
Input: LOAN structure [J 1 , · · · , J L ] with J l being the number of nodes for the l-th layer. Initialization: Obtain the estimates of W 1 and b 1 in the first hidden layer by fitting the data {x i , y i } with one-hidden-layer LLA in Algorithm 1. Set h 3: obtain estimated W l+1 and b l+1 by fitting the data {x i , y i } with one-hidden-layer LLA in Algorithm 1, As described in Algorithm 3, the proposed LOAN is distinguished from the existing layer-wise deep neural networks [28,[30][31][32] in that LOAN builds the deep FNN layer by layer rather than treating the objective function as a function of weights and biases in all layers simultaneously.

Numerical Comparison
In this section, we compare the performance of the proposed LLA with commonlyused procedures in machine learning. To be specific, we shall compare LLA algorithm with Adam and Adamx [33]. Furthermore, we will compare the proposed LLA model with the following popular machine learning models (regressors or classifiers).

Regression Problems
Firstly, we compare the performance of LLA with commonly-used procedures for regression problems based on four benchmark data sets. The stability constant in the Algorithm 1 is set to be 10 −3 in this subsection.
(1) Airfoil (AF, for short) data, which consists of n = 1503 observations with p = 5 predictors, can be downloaded from https://archive.ics.uci.edu/ml/datasets/Airfoil+Self-Noise (accessed on 15 January 2022). The response is the scaled sound pressure level. The explanatory variables include frequency, angle of attack, chord length, free-stream velocity, suction side displacement thickness.
(2) Bikesharing (BS, for short) data, which consists of n = 17,379 observations with p = 8 predictors and can be downloaded from https://archive.ics.uci.edu/ml/datasets/ bike+sharing+dataset (accessed on 15 January 2022). The response is the log-transformed count. The explanatory variables include normalized hour, normalized temperature, normalized humidity, normalized wind speed, season, holiday, weekday, and weather situation.
(3) California house price (CHP, for short) data [40], which consists of n = 20,640 observations and p = 8 predictors, can be downloaded from Carnegie-Mellon StatLib repository (http://lib.stat.cmu.edu/datasets/ (accessed on 15 January 2022)). The response variable is the median house value. The explanatory variables include longitude, latitude, housing median age, medium income, population, total rooms, total bedrooms, and households.
(4) Parkinson (PK, for short) data, which consists of n = 5874 observations and p = 19 predictors, can be downloaded from https://archive.ics.uci.edu/ml/datasets/Parkinsons+ Telemonitoring (accessed on 15 January 2022). The response variable is the total UPDRS scores, and the explanatory variables include subject age, subject gender, time interval from baseline recruitment date, and 16 biomedical voice measures.
In the numerical comparison, we first standardize the response and predictors so that their sample means and variances equal 0 and 1, respectively. We split each data set into 80% training data and 20% testing data with seed 0, · · · , 29. Thus, we may obtain 30 MSEs for testing data and 30 MSEs for training data for each procedure. We report the sample mean and standard deviation (std) of the 30 MSEs for each procedure and data set. In addition to MSE, other measures such as The left and right panels of Figure 2 depict the box-plots of MSEs for AF and CHP data sets, respectively. For CHP data, the MSE of training data decreases as the number of nodes increases, while the MSE of testing data decreases firstly and then remains at the same level as the number of nodes increases. This is expected since as the number of nodes increases, the corresponding neural network may be overfitted, which may lead to an increase of prediction error. For AF data set, the MSE of training and testing data decreases at first, but then increases as the number of nodes increases, reaching its minimum when the number of nodes is around 60. The difference in the pattern of training MSE between AF and CHP data in Figure 2 might be due to different sample sizes of AF data (n = 1503) and CHP data (n = 20,640).
Experiments of these two data sets suggest that LLA yields the best prediction performance when the number of nodes in one-hidden-layer neural network is 60. Thus, we set 60 as the default number of nodes for LLA regressor.

Comparison with Stochastic Gradient Algorithms
In this subsection, we compare the performance of LLA with Adam and Adamax [33], two popular stochastic gradient algorithms.
We consider default settings for these three algorithms. The default number of nodes in LLA is 60 by default, as discussed in Section 3.1.1. For the Adam algorithm, the default step size is α = 0.001, exponential decay rates for the moment estimates are β 1 = 0.9, β 2 = 0.999, the constant for numerical stability is = 10 −8 , and moving averages are initialized as zero vectors. The default settings for Adamax algorithm are the same as those of Adam, except that the step size is set to be α = 0.002. Figure 3 depicts the MSEs of training and testing data versus the number of epochs for each algorithm. From Figure 3, the MSEs of LLA algorithm become stable quickly after a few epochs. Compared with the Adam and the Adamax, the LLA algorithm needs fewer epochs to achieve stability. When the MSEs become stable, the LLA has the least MSEs among these three algorithms. This is expected because for a nonlinear least squares problem, the LLA algorithm utilizes both gradient and Hessian of the objective function in the minimization problem. In contrast, the Adam and Adamax algorithms are stochastic gradient descent algorithms, which only take advantage of the gradient of the objective function. Thus, they converge comparatively slowly.

Performance of LLA in Regression Settings
Next, we compare the performance of LLA with other commonly used procedures. All hyperparameters are set by default. However, the performance of procedures such as MLP might be further improved by extensively tuning their hyper-parameters. Table 1 depicts the values of 100 times of the sample mean and standard deviation of the 30 MSEs of testing data (testing MSE for short) as well as the 30 MSEs of training data (training MSE for short). Table 1 implies that XGBoost performs better than other two boosting methods: AdaBoost and GBM, and the LLA performs well in terms of prediction accuracy. Specifically, the LLA has the lowest testing MSEs for AF and BS data sets. For CHP, the LLA has slightly greater testing MSE than XGBoost, while XGBoost has a deeper tree by default. For AF, BS, and CHP data sets, the difference between testing MSE and training MSE is small for LLA. This implies that the LLA is less likely to be over-fitting. As illustrated in Figure 4, for AF, BS, and CHP data sets, the prediction MSE of MLP and GBM are likely to have large variation for different data splittings and thus more potential outliers. LLA prediction has relatively small standard deviation of MSEs and fewer points outside the interquartile range (IQR) of boxplots, compared to other procedures. This implies that LLA is likely to obtain consistent results and has a fairly stable performance. For PK data, LLA outperforms MLP in terms of both prediction accuracy and standard deviation of MSEs. However, XGBoost and RF have the best performance. This motivates us to examine the data further.  By doing exploratory data analysis of PK data, we find that several of the 19 predictors in PK data are highly correlated. For example, the correlation between Shimmer and Shimmer(dB) is 0.9923. The correlation between Jitter(%) and Jitter(RAP) is 0.9841. The poor performance of the LLA may be due to the collinearity of predictors since the linear regression is applied for updating the weights and biases. Thus, we conduct a principal component analysis (PCA), and the first 6, 9, and 13 principal components (PC) can explain 95%, 99%, and 99.9% of the variance. We further apply all procedures to PK data with predictors being set to PCs. Table 2 depicts the MSEs for PK data with PC predictors. Compared with Table 1, the performance of XGBoost and RF becomes poor. Tables 1 and 2 imply that the performance of RF and XGBoost are not robust under linear transformation of predictors. The performance of MLP is quite stable across different numbers of PC variables. The performance of the LLA improves significantly via using PCs.

Classification
In this section, we compare the performance of LLA classifier with other commonlyused procedures based on the following three data sets.
(1) Ringnorm (RN, for short) data consists of n = 7400 observations with p = 20 predictors and can be downloaded from https://www.cs.toronto.edu/~delve/data/ringnorm/ desc.html (accessed on 15 January 2022). This data set has been analyzed in [41]. Each class is drawn from a multivariate normal distribution. Breiman [41] reports the theoretical expected misclassification rate as 1.3%.
(2) Magic gamma (MG, for short) data consists of n = 19,020 observations with p = 11 predictors and can be downloaded from https://archive.ics.uci.edu/ml/datasets/MAGIC+ Gamma+Telescope (accessed on 15 January 2022). The data are Monte Carlo generated to simulate registration of high energy gamma particles in a ground-based atmospheric Cherenkov gamma telescope using the imaging technique [42]. The binary response is whether the event is a signal or a background event. The explanatory variables include fLength (major axis of ellipse), fLength (minor axis of ellipse), fSize (10-log of sum of content of all pixels), fSize (ratio of sum of two highest pixels over fSize), fConc1 (ratio of highest pixel over fSize), fAsym (distance from highest pixel to center projected onto major axis), etc.
(3) Mammographic mass (MM, for short) data consists of n = 800 observations with missing values deleted and p = 4 predictors. It can be downloaded from https: //archive.ics.uci.edu/ml/datasets/Mammographic+Mass (accessed on 15 January 2022). The response is severity (benign or malignant). The explanatory variables include age, shape, margin, and density.
We first standardize the continuous predictors so that their sample means and variances equal 0 and 1, respectively. We split each data set into 80% training data and 20% testing data with seed 0, · · · , 29. Thus, we may obtain 30 AUC-ROC (AUC-ROC is the area under the receiver operating characteristic curve, which plots two parameters true positive rate and false positive rate. It provides an aggregate measure of performance of a classifier across all possible classification thresholds.) (AUC for short hereafter) scores for testing data and 30 AUC scores for training data for each procedure. We report the sample mean and standard deviation (std) of the 30 AUC scores for each procedure and data set.

Impact of the Number of Nodes for Classification Problems
We firstly investigate the impact of the number of nodes on the performance of LLA classifier in terms of AUCs for testing and training data sets. The number of nodes ranges from 15 to 70 in the LLA. Figure 5 depicts the box-plots of AUCs for RN and MM data, respectively. For RN data set, the AUCs of training and testing data remain close to 1 across different number of nodes, and reach the maximum at approximately 30. As shown in Figure 5, when the number of nodes is small, LLA algorithm has a stable performance. As the number of nodes increases, it is likely to incur overfitting problem, especially for RN data set, leading to estimates with larger variance. For MM data, the AUCs of training data first increase and then keep constant when the number of nodes is chosen to be from 50 to 70. For the testing data, when the number of nodes varies from 15 to 35, the AUCs are close with each other. As the number of nodes increases, the AUCs for the testing data decrease and keep the same at approximately 0.85. When the number of nodes is too large, overfitting issue is likely to happen. As illustrated from these two data sets, when the number of nodes is set to be 30, the corresponding LLA classifier yields good performance in terms of prediction. Therefore, we set 30 as the default value in our numerical experiments for classification problems.

Comparison with Stochastic Gradient Algorithms
We then examine the convergence rate of the LLA algorithm in comparison with Adam and Adamax, two well-known stochastic gradient algorithms for neural networks. To make a fair comparison among the LLA, Adam and Adamax, we set J 1 = 30 in one-hidden-layer network for all these three algorithms. The default settings are used for the three algorithms. Specifically, the default step size for Adam is α = 0.001, the exponential decay rates for the moment estimates β 1 = 0.9, β 2 = 0.999. The constant for numerical stability is set to be = 10 −8 . And the moving averages are initialized to be zero vectors. For Adamax, the step size is set to be α = 0.002. Figure 6 depicts the AUC scores of training and testing data versus the number of epochs for each algorithm. As seen from Figure 6, the AUC curves of the LLA algorithm increase and then become stable within 20 epochs for both RN and MG data sets. Both Adam and Adamax need many more epochs to have stable AUC values. For example, both Adam and Adamax need 150 epochs to have stable AUC values for RN data. This implies that the LLA converges faster than Adam and Adamax, consistent with our theoretical analysis. The LLA also performs better than the Adam and Adamax in terms of AUC score. This can be clearly seen from the AUC plots for the MG data set.

Performance of LLA in Classification Problems
We next compare the performance of the LLA classifier with other procedures with default settings. The number of nodes in the LLA for RN, MM and MG data sets are 30. The number of max epochs is set to be 100. Table 3 depicts the values of 100 times of the sample mean and standard deviations of the 30 AUC scores for the testing data (testing AUC for short) as well as those for the training data (training AUC for short). Table 3 implies that the LLA performs the best for RN and MM data sets in terms of AUC scores, followed by MLP. Figure 7 compares the AUC scores of different methods more explicitly. As shown in Figure 7, all procedures have small standard deviation for the testing data. However, for the training data, RF has a large standard deviation for the RN data while XGBoost, GBM, and Adaboost have more fluctuations for the MM data. The LLA has small standard deviations for all data sets, illustrating that the LLA retains stability. The differences between the testing AUCs and training AUCs are small for the LLA classifier, which shows that it is less likely for the LLA classifier to be over-fitting. For MG data, the RF performs the best, followed by XGBoost. This motivates us to examine the data further by considering multi-layer network, which will be discussed in Section 3.4 in details.

Analysis of Computational Time
In this section, we compare the computational time of LLA with other commonly used procedures. As shown in Figure 3, LLA converges faster than ADAM algorithm in terms of training epochs. Each training epoch of LLA takes longer time than Adam and Adamax, though. To make fair comparison on computational time, we use the 'tensorflow.keras.Sequential' to build one-hidden-layer neural networks with Adam and Adamax optimizers. Similar to the Monte Carlo experiment settings used for Tables 1 and 3, we randomly split the data into two parts (80% for training and 20% for testing), and repeat the process for 30 times to calculate the mean and standard deviation. For regression problems, models contain 60 nodes (by default) in the hidden layer and are forced to run 1000 epochs. For classification problems, models contain 30 nodes (by default) in the hidden layer and are forced to run 150 epochs. The experiments are run on a personal computer (3.1 GHz Intel Core i5 processor) with Python 3.8 and TesorFlow 2.4.0. Table 4 shows the sample means along with standard deviations (in parenthesis) of testing MSE (100 times of the actual values), and training time (in seconds). The results show that although LLA takes longer time in training 1000/150 epochs, it can achieve higher accuracy in both regression and classification models. Moreover, as shown in Figures 3 and 6, LLA converges in fewer epochs than Adam or Adamax, demonstrating that if early stopping is applied, computational time of LLA is comparable to the time of Adam and Adamax. Consider AF data set as an example. If LLA stops at 220 epochs, its average and standard deviation of training time will be 55.3 and 9.2 s, meaning that the computational time of LLA is close to Adam's and Adamax's. For classification problems, LLA can also converge in fewer epochs than Adam or Adamax as shown in Figure 6.

Performance of Deep LOAN
In this subsection, we examine the performance of the LOAN with multiple-hidden layers. Since there is no option to set different depth levels for AdaBoost, we exclude it for this comparison. We set the MLP, XGBoost, GBM, and RF with the same number of layer or depth as L, the number of hidden layers used in the LLA. Other parameters of the MLP, XGBoost, GBM and RF are set to be the default values. For regression problem, we apply these four procedures and LLA to CHP data. For classification problem, we investigate MG data in details.
For regression problems, Table 5 depicts the sample mean and standard deviation of the 30 training and testing data splittings for L = 1, 2 and 3, based on CHP data set. The number of nodes in the one-hidden-layer LOAN is set to be 60. The number of nodes in the LOAN with two hidden layer is set to be 60 and 10 in the first and second layer, respectively. The number of nodes in the LOAN with three hidden layers is set to be 60, 10, and 50 in the first, second and third layer, respectively. From Table 5, we can see that both the LOAN and MLP have smaller MSE as L increases, but there is no dramatic improvement, while XGBoost, GBM, and RF seem to have more improvement as L increases. Table 5 indicates that LOAN performs the best when L = 1, 2 and 3 in terms of mean and standard deviation of both training and testing MSEs.
For classification problems, Table 6 depicts the sample means and standard deviations of AUC scores for different values of L based on MG data set. For LOAN with different number of layers, 30 nodes is chosen for each layer. Compared with GBM and RF, LLA, MLP, and XGBoost perform better. As the number of layer increases, the improvement of LOAN is dramatic. Although LLA does not perform the best when only one hidden layer is used, it achieves the highest AUC scores when L is set to be 2 and 3. The standard deviations for LOAN continue to decrease as L increases for both testing and training data, illustrating that deep LLA leads to a more stable performance in terms of prediction for this data set.
These experiments demonstrate that shallow LOAN achieves both stable performance and high prediction accuracy level compared to commonly-used deep learning models.

Conclusions and Discussion
In this paper, we develop the LLA algorithm in an FNN for both regression and classification problems, and show that the LLA has a quadratic convergence rate. Our comparison based on benchmark data sets indicates that the LLA can greatly improve training efficiency and outperforms commonly-used procedures in both training time and prediction accuracy. This algorithm is generally applicable, and can be extended to deep neural networks via layer-wise or back-propagation construction. Exciting future work could further extend the proposed method in different studies such as survival data analysis. Parallel computing algorithm can also be developed for the proposed method.

Acknowledgments:
The authors are grateful to the reviewers for constructive comments, which lead to a significant improvement of this work. The views expressed in the paper are those of the authors and do not represent the views of Wells Fargo.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
This appendix consists of the proofs of Theorems 1 and 2.
The gradient and Hessian matrix of the least squares function can be derived Thus, and a natural estimate of E{− 1 (θ)} iŝ By using Fisher scoring algorithm, we do not need to evaluate ∂ 2 m(x i ; θ)/∂θ∂θ T , which typically is hard to calculate. The LLA is to approximate m(x i ; θ) bỹ m(x i ; θ, θ (c) ) = β 0 + Therefore, the Fisher scoring algorithm for the least squares function based on the local linear approximation˜ 1 (θ) = n ∑ i=1 y i −m(x i ; θ, θ (c) ) 2 has the same updated θ (c+1) as given in (A1). This completes the proof of Theorem 1.
Furthermore, it follows that for j = 1, · · · , J 1 , by definitions of γ j and z 2ij . Similarly, we have by definitions of η j and z 3ij . Thus, (16) equals which is (10) exactly. This completes the proof.