A Lightweight Learning Method for Stochastic Conﬁguration Networks Using Non-Inverse Solution

: Stochastic conﬁguration networks (SCNs) face time-consuming issues when dealing with complex modeling tasks that usually require a mass of hidden nodes to build an enormous network. An important reason behind this issue is that SCNs always employ the Moore–Penrose generalized inverse method with high complexity to update the output weights in each increment. To tackle this problem, this paper proposes a lightweight SCNs, called L-SCNs. First, to avoid using the Moore– Penrose generalized inverse method, a positive deﬁnite equation is proposed to replace the overdetermined equation, and the consistency of their solution is proved. Then, to reduce the complexity of calculating the output weight, a low complexity method based on Cholesky decomposition is proposed. The experimental results based on both the benchmark function approximation and real-world problems including regression and classiﬁcation applications show that L-SCNs are sufﬁciently lightweight.


Introduction
Although the deep neural networks have proven to be a powerful learning tool, most networks suffer from time-consuming training due to the massive hyperparameters and complex structures. In many heterogeneous data analytics tasks, flattened networks can achieve promising performance. In the flattened networks, single-hidden layer feedforward neural networks (SLFNs) [1,2] have been widely applied because of their universal approximation capability and simple construction. However, gradient-descent-based learning algorithms are generally adopted for SLFNs training. Therefore, slow convergence and trap in a local minimum are often-encountered problems [3].
The randomized learning method offers a different learning method for flattened networks training. Many randomized flattened networks have been shown to approximate continuous functions on compact sets, and they also have the property of fast learning [4,5]. Stochastic configuration networks (SCNs) [6] provide a state-of-the-art randomized incremental learning method for SLFNs. In comparison with the traditional randomized incremental learning models, SCNs have some advantages: (1) SCNs randomly assign the input weights and biases of the hidden nodes in dynamically adjustable scopes according to the supervisory mechanism; (2) a more compact network structure. Therefore, SCNs have been extensively studied and become a hot topic of neural computing.
For large-scale data analytics, an ensemble learning method for quickly disassociating heterogeneous neurons was proposed is designed for SCNs by using a negative correlation learning strategy [7]. To improve learning efficiency, SCNs with block increments and variable increments are developed, which allow multiple hidden nodes to be added at each iteration [8,9]. Then, point and block increments are integrated into the parallel SCNs (PSCNs) [10]. To resolving the modeling tasks of uncertain data, robust SCN (RSCNs) is proposed by using maximum correlation entropy criterion (MCC) and kernel density estimation [10][11][12]. In order to further improve the expressiveness, SCNs with deep and stacked structures are proposed [13][14][15]. In [16], a two-dimensional SCNs (2DSCNs) is constructed for image data analytics. To address prediction interval estimation problems, the corresponding deep, ensemble, robust, and sparse versions of SCNs were developed [17][18][19]. In addition to the above theoretical studies, SCNs have successful applications in many fields, such as optical fiber pre-warning system [20], industrial process [21], concrete defect recognition [22], and so on.
However, the SCNs construction process can be extremely time-consuming when dealing with complex modeling tasks. The fundamental reason for this is that singular value decomposition (SVD) is needed to solve the output weights [23]. Concretely, the number of rows of the hidden layer output matrix is always much larger than the number of columns [24,25], which makes the hidden layer output matrix to become an over-determined matrix with no inverse [26]. To obtain the output weights, it is necessary to employ SVD to solve Moore-Penrose (M-P) generalized inverse in an over-determined matrix. Theoretically, the complexity of SVD is related with third power of the number of hidden nodes and the product of number of hidden nodes and input dimension. This makes the modeling process of SCNs extremely time-consuming when dealing with complex tasks that require a large network structure (a large number of hidden nodes) to enhance the expressive power of the model. This paper proposes a lightweight non-inverse solution method for the output weights of SCNs (L-SCNs) by introducing normal equation theory [27] and Cholesky decomposition [28]. The main contributions of the paper are as follows:

1.
To avoid adopting M-P generalized inverse with SVD, a positive definite equation for solving output weights is established based on normal equation theory to replace the over-determined equation; 2.
The consistency of the solutions of the positive definite equation and the overdetermined equation in calculating the output weights is proved; 3.
A low complexity method for solving the positive definite equations based on Cholesky decomposition is proposed.
Experimental results on both the benchmark function approximation and real-world problems including regression and classification applications show that, compared with SCNs and IRVFLNs (an incremental variant of RVFLNs), the proposed L-SCNs have a superior performance in lightweight aspect.
The remaining parts of the paper are organized as follows. In Section 2, the basic principle of SCNs and some remarks are shown. The algorithm description of L-SCNs and full proof of related theories are presented in Section 3. In Section 4, the experimental setup is given and the performance of L-SCNs is fully discussed. Some conclusions are drawn in Section 5.

Brief Review of SCNs
As a kind of flattened network, SCNs model includes an input layer, hidden layer and output layer. Its hidden layer is constructed incrementally according to the supervisory mechanism. The specific SCNs network structure is shown in Figure 1. The construction process of SCNs is briefly described as follows: Given an input X = {x 1 , x 2 , . . . , x N }, x i ∈ R d and its corresponding output Suppose that we have already built a SCNs with L−1 hidden nodes, i.e.,: where β j = β j,1 , β j,2 , . . . , β j,m T is the output weights vector of the j-th hidden node, w j is the input weight vector of the j-th hidden node. b j is the threshold of the j-th hidden node.
g j w T j x + b j is the hidden layer output vector of the j-th node, "T" represents a transpose of the matrix.
The current residual error of SCNs is calculated by Equation (2): The acceptable tolerance error, denoted as ε. If e L−1 does not reach ε, continue to add new nodes to the SCNs by the supervision configuration mechanism: and where 0 < r < 1 indicates regularization parameter, {µ L } is a nonnegative real number sequence with µ L ≤ 1 − r and lim L→∞ µ The best hidden node parameters are determined by the maximum ξ L . Then, the output weights can be evaluated by Equation (5): where β * = β * 1 , . . . , β * L . H L = [h 1 , . . . , h L ] is the current hidden layer output matrix, h p = g p , p = 1, . . . , L. H † L is M-P generalized inverse matrix of H L . The above process will be repeated until the residual error reach expected a tolerance ε or hidden nodes reaches the maximum.

Remark 1.
It can be seen that the hidden nodes of SCNs are built incrementally, and all output weights need to be recalculated after each hidden node is added. Therefore, the complexity of the modeling process depends on the evaluation method of the output weights.

Remark 2.
It can be seen from the above analysis, H L is an over-determined matrix. Therefore, M-P generalized inverse method is used to solve the over-determined equation H L β = f . However, the complexity of M-P generalized inverse method is related with third power of the number of hidden nodes and the product of number of hidden nodes and input dimension due to the use of singular value decomposition (SVD). Thus, the M-P method is very time-consuming, especially when dealing with complex modeling tasks that require a large number of hidden nodes. In addition, the M-P generalized inverse method can only obtain the approximate solution of the output weights, which is difficult to make the model optimal.

L-SCNs Method
From the above analysis, it can be seen that the M-P generalized inverse method involving SVD is the main reason for time-consuming nature of SCNs modeling. In order to solve this problem, a positive definite equation based on normal equation is proposed to replace the over-determined equation. Then, a low computational complexity method based on Cholesky decomposition is proposed to solve the positive definite equation and obtain the output weight, thereby reducing the modeling complexity of SCNs.

Positive Definite Equation
For the sake of brevity, this paper introduces H to replace H L . According to normal equation theory, Hβ = f can be denoted by Theorem 1 can guarantee the consistency of the solution of positive definite equation and over-determined equation, and a strict proof is given.

Proof. Sufficiency: Suppose an
Therefore, β * is a solution Hβ = f . Necessity: Let r = f − Hβ, the i-th component of r can be written as Let From the necessary conditions of the extreme value of the multivariate function, it can be obtained Equation (11) can be transformed into matrix form: Based on the above analysis, the solutions of the two equations are consistent in theory.

SCNs with Cholesky Decomposition
In order to reduce the computational complexity of the model, this paper uses the Cholesky decomposed method that does not involve the inversion operation to solve Equation (12). However, using Cholesky decomposition has a premise that the decomposed matrix must be a positive definite symmetric matrix. In addition, since H is not always full rank in practical applications, H T H is not necessarily a positive definite matrix. In this paper, we introduce a moderator factor I/C to make H T H a full rank matrix. I is the identity matrix of the same type as H T H, and C is determined by cross verification. Thus, Equation (6) can be denoted by The transpose of A can be evaluated by Equation (14) A T = H T H + I/C therefore, A = A T . A is a symmetric matrix. Given an arbitrary vector v = 0, then the quadratic form of A can be expressed as Based on the results, it is easy to verify that A is a positive definite symmetric matrix. The solving process of β* based on Cholesky decomposition is as follows: Based on Equation (16), the element s ij in S that is not 0 can be evaluated by where i, j = 1, 2, . . . , L − 1.
Bring Equation (16) into Equation (13), and multiply both sides of the formula by S −1 , then it can get where SK = b. Therefore, Equation (18) can be denoted by SK = b. The element calculation method in K is evaluated by: To sum up, the output weights β * i can be calculated by: The pseudo code of L-SCNs is described in Algorithm 1:
Save w L , b L in W, and ξ L in Ω 9. Else 10.
go to back to step 4 11.
If W is not empty 14. Find

Computational Complexity Analysis
It can be seen from the above description that the difference between the two methods lies in the calculation of the output weights β * . SCNs obtains the output weights by the product of M-P generalized inverse matrix and the output f , while L-SCNs evaluates the output weights by positive definite equation and Cholesky decomposition, since the M-P generalized inverse is calculated using the SVD method. Therefore, the computational complexity of the output weights of SCNs is about O L 3 + LMd . While L-SCNs only involves simple addition, subtraction, multiplication, and division operations when calculating output weights, so the computational complexity is about O L 3 /3 + LMd + L 2 M . Where M is the number of samples in the training set of classification, and d is the number of categories (d = 1 in the regression problem). In summary, the method proposed in this paper has obvious lightweight advantages when dealing with complex tasks that require a large number of hidden nodes.

Experiments
In this section, the performance of L-SCNs is evaluated and compared with original SCNs and IRVFLNs on some benchmark data sets. The sigmoid function is used as activation function. All experiments on L-SCNs, SCNs and IRVFLNs are performed in the MATLAB 2019b environment running on a Windows personal computer with Intel(R) Xeon(R) E3-1225 v6 3.31GHz CPUs and RAM 32 GB.

Data Sets Description
Eight data sets have been used in experiments, including five real regression problems and three real classification problems, which were collected from KEEL and UCI HAR. (Knowledge Extraction based on Evolutionary Learning (KEEL) [29] and UCI HAR database [30]). These data sets specifically include winequality-white, California, delta_ail, Compactiv, Abalone, Iris, Human Activity Recognition (HAR) and wine. In addition, there is a highly nonlinear benchmark regression function data set [31,32], which is generated by Equation (21). The detailed information of all the data sets are shown in Table 1.

Experimental Setup
In each trial, all samples were randomly divided into training and test data sets. All the results in the paper are average of 30 trials on the data set. The specifications of the experimental setup are shown in Table 2, in which ε is the expected error tolerance, T max is the maximum times of random configuration, L max is the maximum number of hidden nodes. γ is the assignment range of hidden layer node parameters. The moderator factor C was obtained by cross validation.

Performance Comparison
First of all, the convergence and function fitting performance of IRVFLNs, SCNs, and L-SCNs are evaluated using a highly nonlinear benchmark regression function dataset. The results shown in Table 3, includes training time, training error, testing error and the number of hidden nodes, and the best experimental results are highlighted. It can be seen from Table 3 that the modeling times of L-SCNs are 18.8% and 66.69% lower than that of SCNs and IRVFLNs, respectively. The training error and testing error of L-SCNs have obvious advantages, especially compared with IRVFLNs. In addition, compared with IRVFLNs, SCNs and L-SCNs save 36.8% and 43.83% of hidden nodes, respectively. This is mainly because the hidden node parameter selection function of the supervision mechanism improves the compactness of the model while ensuring the high performance of the model. Since SCNs can only obtain approximate solutions when using M-P generalized inverse to calculate output weights, while L-SCNs output weight evaluation method can get real solutions. Therefore, L-SCNs is superior to SCNs in compactness and model performance. In addition, in order to analyze the convergence and fitting ability of IRVFLNs, SCNs and L-SCNs, this paper draws a convergence curve and a fitting curve, as shown in Figure 2. It can be seen from the convergence curve that IRVFLNs used up 100 preset hidden nodes, but still did not meet the expected error tolerance. In particular, it is difficult to improve the convergence of IRVFLNs by adding more nodes after the number of hidden nodes reaches 51. The convergence of SCNs and L-SCNs meets the expected error tolerance, and L-SCNs converges faster. It only uses 19 nodes to reduce the residual to 0.02, and only used 56.17 nodes to meet the expectations. Compared with SCNs, L-SCNs save 11.12% of nodes. Therefore, it shows that L-SCNs modeling is faster, and the structure of the built model is more compact. The fitting curve shows that among IRVFLNs, SCNs and L-SCNs, the data fitting ability of the model built by IRVFLNs is the worst, while the models built by SCN and L-SCN have similar fitting capabilities.  Tables 4 and 5. Table 4 gives the number of hidden nodes, the training time, the training error and the testing error. It can be seen from Table 4 that for Abalone data set, the training error and test error of IRVFLNs are the worst, and 100 hidden nodes are used up, which is also the main reason for the longest modeling time. L-SCNs and SCNs achieve similar training error and testing error, but L-SCNs saves 73.18% and 25.77% of the number of nodes and modeling time, respectively. On the Compactiv data set, IRVFLNs still used up all hidden nodes, and achieved the worst training error and testing error. The experimental results of L-SCNs and SCNs are also consistent with the results on the Abalone data set. By comparing the experimental results of winequality-white, california and delta_ail, it can be seen that: (1) When consuming the same hidden layer node, IRVFLNs modeling is the fastest, but the model performance is the worst; (2) The number of hidden nodes required for L-SCNs modeling is far less than that of the other two algorithms; (3) When the number of hidden layer nodes is small, L-SCNs has no obvious advantage in lightness. In summary, L-SCNs are superior to IRVFLN and SCNs in terms of model compactness and modeling time when a large number of hidden layer nodes are needed.  Table 5 shows the numbers of hidden node, training times, training errors and testing errors of IRVFLNs, SCNs and L-SCNs on the three real classification data sets. As can be seen from Table 5, for the Iris data set, the numbers of hidden nodes required by SCNs and L-SCNs is much less than 107.2 of IRVFLNs. Therefore, SCNS and L-SCNs have lower modeling times. The main reason behind this result is that the node parameter selection function of the supervisory mechanism makes the node parameters quality better, so the model can reach the expected value faster and perform better. Compared with SCNs, L-SCNs saves 3.58% and 5.35% in the number of hidden nodes and training time, respectively. At the same time, L-SCNs achieves the best test error. In particular, the training errors of IRVFLNs, SCNs and L-SCNs on the Iris data set is the same. For the HAR data set, compared to the other two algorithms, L-SCNs saves 74.85% and 4.91% in the number of hidden nodes, while saving 73.01% and 40.04% in training time, respectively. In addition, the number of hidden nodes of IRVFLNs reached the maximum value of 1000, but the model performance was the worst. For the wine data set, L-SCNs and SCNs still have obvious advantages in the number of hidden nodes, training time and test error. Compared with the other two algorithms, L-SCNs constructs the best performance model with the least 90 hidden nodes and the minimum 0.1722 s training time. In summary, L-SCNs have obvious merits in training efficiency and model compactness for classification tasks. Therefore, L-SCNs is a lightweight algorithm. Through the analysis of Tables 1 and 5, it can be seen that HAR and wine data sets have higher sample numbers and features than Iris data set, especially HAR data set. The experimental results also show that L-SCNs have more obvious in lightweight on HAR and wine data sets. Therefore, L-SCNs are suitable for dealing with large data problems.
In order to further verify the advantages of L-SCNs in terms of lightweight. In this paper, while maintaining the same number of hidden nodes, the change process of modeling time of SCNs and L-SCNs with the increase of the number of hidden nodes is drawn when the experiment is performed on the HAR data set, as shown in Figure 3. It can be seen that before the hidden nodes reach 100, the modeling time of SCNs and L-SCNs is the same. However, after 100 hidden nodes, with the increase of hidden nodes, the advantage of L-SCNs becomes more and more obvious in term of lightweight. When 500 nodes are reached, the gap between SCNs and L-SCNs widened to 36.66%. It also shows that when dealing with modeling tasks that require a large number of hidden nodes, the L-SCNs proposed in this paper can effectively reduce the modeling complexity and improve the lightweight of modeling. In addition, we have compared the Cholesky decomposition approach with other methods, including QR decomposition, LDL decomposition and SVD decomposition; the detailed results of all these approaches are shown in Table 6. It can be found from Table 6 that Cholesky decomposition is slightly better than QR decomposition and LDL decomposition. However, as the number of nodes increases, compared with SVD decomposition, Cholesky decomposition has more obvious advantages in terms of lightness. The main reason for this result is that the computational complexity of QR and LDL decomposition is similar to that of Cholesky decomposition. The computational complexity of SVD far exceeds these three methods. This clearly demonstrates the lightness of Cholesky decomposition.

Conclusions
This work is motivated by the time-consuming calculation of output weights in each addition of hidden nodes. Lightweight stochastic configuration networks (L-SCNs) are developed by employing a non-inverse calculation method for problem solving. In L-SCNs, a positive definite equation is firstly proposed based on normal equation theory to take the place of the over-determined equation to avoid the use of M-P generalized inverse. Secondly, the Cholesky decomposition method with low computational complexity is used to calculate the positive definite equation and obtain the output weight. The proposed L-SCNs have been evaluated on several benchmark data sets, and the experimental results show that L-SCNs not only solve the high complexity problem of calculating output weights, but also improve the compactness of the model structure. In addition, the comparison with IRVFLNs and SCNs shows that L-SCNs have obvious advantages in lightweight. Therefore, L-SCNs are particularly suitable for complex modeling tasks that usually require a mass of hidden nodes to build an enormous network.

Conflicts of Interest:
The authors declare no conflict of interest.