Local Coupled Extreme Learning Machine Based on Particle Swarm Optimization

We developed a new method of intelligent optimum strategy for a local coupled extreme learning machine (LC-ELM). In this method, both the weights and biases between the input layer and the hidden layer, as well as the addresses and radiuses in the local coupled parameters, are determined and optimized based on the particle swarm optimization (PSO) algorithm. Compared with extreme learning machine (ELM), LC-ELM and extreme learning machine based on particle optimization (PSO-ELM) that have the same network size or compact network configuration, simulation results in terms of regression and classification benchmark problems show that the proposed algorithm, which is called LC-PSO-ELM, has improved generalization performance and robustness.


Introduction
The mathematical model of single-hidden layer feed-forward neural networks (SLFNs) has been widely used in many domains because of its ability to approximate strongly nonlinear input-output mappings.However, traditional learning methods are usually much slower than required while few faster learning algorithms for SLFNs are generated [1].In 2006, a novel learning algorithm for SLFNs called extreme learning machine (ELM) [1,2] was presented by Huang et al. for decreasing the training time of SLFNs.
Different from the existing learning algorithms of SLFNs, the weights and biases between the input layer and hidden layer of the ELM were chosen randomly, then the weights between the hidden layer and the output layer were determined based on the ordinary least squares.The ELM learning algorithm has fast learning speed and good generalization performance with little human intervention, which makes the algorithm applicable to many areas, such as stock prediction [3], image classification [4], fault diagnosis [5], etc.
In the ELM, the number of hidden neurons is required to be greater than or equal to the number of the training samples so as to guarantee the convergence of the algorithm.Therefore, there will be quite a lot of input-hidden weights when the number of input neurons is large [6], which may reduce the generalization performance of SLFNs.The original ELM model has been equipped with various extensions to make it more suitable and efficient for specific applications [7].For example, based on the structure of the local coupled feed-forward neural network (LCFNN) [8,9] and the learning mechanism of the ELM algorithm, the local coupled extreme learning machine (LC-ELM) learning algorithm was proposed by Qu in 2014 [10].The algorithm could decrease the researching complexity of the weights between the input layer and the hidden layer by means of assigning the addresses to the hidden neurons [10].The advantage of the LC-ELM on image watermarking was examined by Mehta et al. [11].
In the LC-ELM learning algorithm, the addresses and radiuses were generally preset empirically or randomly.And, thus, those parameters might not be optimal for the LC-ELM, and the algorithm may yield an inappropriate underlying model.In 2015, Qu et al. presented an evolutionary local coupled extreme learning machine (ELC-ELM).In the ELC-ELM, the differential evolutionary (DE) algorithm was used to optimize the addresses and the radiuses of the fuzzy membership functions in hidden neurons for improving the generalization performance [12].However, it should be noted that the hidden biases and input weights in the ELC-ELM were also set randomly.
The DE algorithm has good global converge property by means of utilizing the differential information of the population.However, the instability performance of DE can also be caused because of the above reason and the algorithm may be trapped in local optima [13,14].Moreover, three parameters of the DE algorithm should be controlled manually [15].In 1995, the particle swarm optimization (PSO) algorithm was presented by Eberhart et al. [16] and has been used in many optimization fields as it can converge to the global minima quickly.Compared with other stochastic optimization techniques, the advantages of the PSO algorithm are that it is easy to be implemented in practice and few parameters need to be adjusted [17,18].The PSO algorithm and its improved variants, such as APSO (Adaptive PSO) and PSOGSA (The hybrid PSO and gravitational search algorithm), were used to select the optimal parameters between the input layer and the hidden layer (input weights and biases) of the ELM [19,20].
Therefore, in order to overcome the limitation of the DE, a new method combining the LC-ELM with an improved PSO called LC-PSO-ELM is proposed in this paper.In the proposed algorithm, the improved PSO algorithm is used to optimize the address and window radius of the local coupled parameters.In addition, the input weights and hidden layer biases of the ELM are also optimized to further improve the generalization performance of the LC-ELM, and the MP generalized inverse is used to calculate the weights between the hidden layer and the output layer analytically.In order to prove the superiority of the proposed algorithm, we compared the computer simulation results from our developed algorithm to those from the ELM, LC-ELM and PSO-ELM algorithms, respectively.The comparison results demonstrated that the newly developed algorithm exhibits improved generalization performance with the highest accuracy.
The rest of this paper is organized as follows.The local coupled extreme learning machine (LC-ELM) and the improved particle swarm optimization algorithm are given in Section 2. The local coupled extreme learning machine based on the PSO algorithm is introduced in Section 3. Section 4 includes different simulation results and analysis of the proposed algorithm in regression and classification benchmark problems.Finally, the conclusions are summarized in Section 5.

Local Coupled Extreme Learning Machine
The ELM learning algorithm is a simple, fast and efficient method.For further improving the generalization performance of the ELM, the LC-ELM learning algorithm was proposed by Qu [10] in which the efficiency of LC-ELM in terms of classification and regression benchmark problems was investigated.
In the LC-ELM, due to the utilization of the fuzzy membership function F(•) and the similarity relation S(x, d i ), the complexity of the weight searching space was reduced and the generalization performance was correspondingly improved in terms of the simple neural networks structure.The mathematical formulation of the LC-ELM is presented as follows: For M arbitrary distinct examples (x i , t i ), where is the expected output, i = 1, . . ., M. The output of the hidden layer neurons g(w i •x j + b i ) for the ELM is modified with the help of fuzzy membership function as g(w i •x j + b i )F(S(x j , d i )).Therefore, the network output of the LC-ELM with N hidden neurons are mathematically modeled by where g(•) denotes the activation function of the ELM, which can not only be sigmoid functions, however also other functions such as sin, cos, cubic, etc. β i denotes the weight vector connecting the ith hidden neuron and the output neurons, w i is the weight vector connecting the ith hidden neuron and the input neurons.b i is the bias of the ith hidden neuron.d i is the address of the ith hidden node.
In the LC-ELM learning algorithm, the similarity relation S(x, d i ) is the distance between the input x and the ith hidden node with address d i .Various forms of fuzzy membership functions F(•), such as Gaussian function, sigmoid function and reverse sigmoid function [21,22], are utilized.In addition, the underlying radius parameter r is kept in F(•) for adjusting the width of the activation area, which is also an optimized parameter, to the same as the address parameter d.Combining the structure of the LCFNN with the learning mechanism of the ELM, the LC-ELM also is a three step learning algorithm and the parameters (input weights w and biases b between the input layer and hidden layer, the address d of the hidden neurons) of the networks are assigned randomly, which is the same as the ELM [10].
The standard LC-ELM learning algorithm can approximate these M examples with zero error means ∑ N j=1 o j − t j = 0, where o j is the actual output of the LC-ELM.i.e., the corresponding relation is defined by The above M equations can be written compactly as a linear system: where H is the output matrix of the hidden layer and can be expressed as in the above Equation (4), h ji = g(w i x j + b i ) denotes the output of the ith hidden neuron with respect to x j .β = [β 1 , . . ., β N ] T N×q is the matrix of the output weights and β i denotes the weight vector connecting the ith hidden node and the output layer.T = [t 1 , . . ., t M ] T M×q is the matrix of the target of the LC-ELM.
The smallest norm least squares solution of Equation ( 3) is where H + is the Moore-Penrose generalized inverse of the hidden layer output matrix H [23]. Based on the above discussion, the LC-ELM algorithm can be summarized in Algorithm 1.

Algorithm 1. The algorithm flow of LC-LEM
(1) Input weights w, hidden bias b and the node address d are allocated randomly.
(2) The output matrix of the hidden layer H is computed using Equation (4).
(3) Calculate the output weights β between the hidden layer and the output layer based on Equation (5):

Particle Swarm Optimization
In 1995, a particle swarm methodology was proposed for nonlinear function optimization by Kennedy and Eberhart [16], which was called the PSO algorithm.It belongs to a population-based, heuristic optimization algorithm.The PSO algorithm is simple, easy to be realized and has a fast convergence rate.It has been widely applied in the fields of scientific research and engineering application [20].
As a swarm-based algorithm, the particles of the PSO algorithm may flow through the searching space depending on the best position information of their own and their neighbors'.The initial values of the particles in the population are set randomly [24].
In the PSO algorithm, suppose D is the dimension of searching space and N is the number of particles, respectively.Then, x t i and v k i are denoted by the current position and the current velocity of the ith particle at iteration t, respectively [25].Therefore, the new velocity and the particle position in the next iterative time are described as: where w denotes the inertia weight.c 1 and c 2 stand for the different acceleration coefficient, respectively.rand() denotes a constant value in the interval [0, 1] and is set randomly.p k i is the best position of the ith particle in the search stage at present, g k i represents the global best position, which constitutes the best position found in the population at present.
In the PSO algorithm, the initial parameter w plays the role of balancing the global search and the local search.Therefore, in order to ensure higher exploring ability in the early iteration and fast convergence speed in the last part iteration, w is not a constant and can be expressed as a nonlinear function of time [17,26]: where w max and w min are the initial and terminal values of inertia weight in the iteration process, respectively.The parameter max_iter is the maximum iteration number of the algorithm and iter is the current iteration time of the algorithm.
In addition, in order to enhance the global search in the early part iteration, to encourage the particles to converge to the global optimal solution and to improve the convergence speed in the final iteration period [27], the acceleration parameters c 1 and c 2 are described as: where c 1max and c 1min , c 2max and c 2min are constants.Based on the Equation ( 6), the searching ability of the cognitive and social components can be changed by changing the values of c 1 and c 2 , which can improve the convergence rate of the PSO algorithm.

Local Coupled Extreme Learning Machine Based on the PSO Algorithm
Based on the optimization technique of the above PSO algorithm with self-adaptive parameters w and c, the parameter values w, b, d and r of the LC-ELM are optimized for improving the generalization performance in this work.
In the LC-ELM learning algorithm, the decoupling of the input layer and the hidden layer is determined by the address parameter d and the radius parameter r.However, these parameter values are randomly determined.In other words, they might not be suitable for the algorithm, resulting in the poor performance of the algorithm.In addition, the hidden biases b and input weights w are also set randomly in the LC-ELM.Therefore, for improving the performance of the LC-ELM algorithm, the four parameters (w, b, d, r) of the LC-ELM are optimized based on the above adaptive PSO algorithm simultaneously.When the optimal parameters of the LC-ELM algorithm are established, the t weights between the hidden layer and the output layer of the LC-ELM are determined analytically based on the Equation (5) of the ELM, which is called the LC-PSO-ELM algorithm in this paper.
Therefore, the particles in the searching space of the LC-PSO-ELM are composed of a set by the parameter values of input weights, hidden biases, address and radius, which can be defined as: where Based on the global searching capability of the above PSO algorithm and the universal approximation performance of the LC-ELM learning algorithm, the detailed steps of the LC-PSO-ELM algorithm (Algorithm 2) are described as follows: The parameters in the algorithm are defined as: the training set is denoted as is the output function g(w i x j + b i ) of the hidden neuron, N is the number of the hidden neurons, F and S are fuzzy membership and similarity function, respectively.max_iter is the preset maximum learning epoch of the PSO algorithm.w max and w min are the initial and terminal values of inertia weight in the iterative stage.c max and c min are the initial and final values of the acceleration constants.
Each particle in the generation is composed of a set of the input weights w, biases b, address d and radius r, as is shown in Equation (11).The initialization value of all of the components of the particle are set from −1 to 1 randomly.
(2) Iter = 1 (3) While Iter ≤ max_iter (4) (1) Evaluate the fitness function of each particle (the root means standard error for regression problems and the classification accuracy for classification problems).
(3) Iter = Iter +1 (5) end while (6) The optimal parameters of the LC-ELM can be determined.Then, based on the optimized parameters: (1) The output matrix H of the hidden layer is computed based on Equation ( 4).
Similar to the LC-ELM, the combinational function F(S(x, d i )) between the similarity relation S(x) and the fuzzy membership F(x) in the LC-PSO-ELM also has many selection strategies.For example, the similarity relation function could be selected by the fuzzy similarity function, Gaussian kernel and wave kernel functions, etc.Meanwhile, the fuzzy membership Equations ( 12)-( 14) can be also chosen in the LC-PSO-ELM learning algorithm.

Simulations and Performance Verification
In this section, the proposed LC-PSO-ELM learning algorithm and three alternative ELM algorithms in the aspect of four function approximation (regression) and four classification benchmark problems, the original ELM, LC-ELM [10] PSO-ELM [17], are conducted in the MATLAB R16a environment running with 3.4 GHz CPU and 16 G RAM.The parameters specification of the benchmarks problems is shown in Table 1.The experimentally well-characterized datasets were chosen for good comparison in this paper [28,29], in which the Box and Jenkins gas furnace data were sourced from the reference [30], the Calhousing data came from the StatLib dataset [31] and the other dataset was derived from the UCI (University of California, Irvine, CA, USA) Machine Learning Repository [32], respectively.For each dataset, the input sequence of the data was changed randomly and then the data were divided into two groups of training data and testing data for experiments based approximately on a 70-30 ratio.The number of the two groups is shown in Table 1.The number of the population of the PSO algorithm is 200 and the maximum iterative number is 50.The configurations of the ELM, PSO-ELM, LC-ELM and the LC-PSO-ELM are listed in Table 2.For simplicity, RN is the abbreviation for random number and NDRN is the abbreviation for normally distributed random numbers.
As shown in Table 2, the sigmoid function is selected as the activation function of the four learning algorithms.The wave kernel S(x, y) = (θ/||x − y||) sin(||x − y||/θ) is selected as the similarity function and the reversed sigmoid function Equation ( 13) F(x) = 2 1+exp(x/r) is selected as the fuzzy membership function in the LC-ELM and the LC-PSO-ELM algorithms, respectively.13) Equation (13) In order to increase the persuasion of different algorithms in terms of validity, 10 trials of the average simulation results (root mean square error (RMSE) is the abbreviation) for regression benchmarks and classification accuracy for classification (pattern classification) are given in the following tables.The training and testing subsets of each experiment in the 10 trials are created by randomly choosing samples of the datasets based on a 70-30 ratio renewedly, the robustness of the algorithms is compared using the standard deviation (STD is the abbreviation) of the 10 trials.The CPU time of training is used to evaluate the computational complexity of the algorithms.The testing error and the CPU time of testing are used to evaluate the generalization performance and application value of the algorithms, respectively.On the other hand, in all of the tables of the simulation results, symbols in bold represent the comparatively best value of the corresponding algorithms.The control parameters of the PSO that were used in different algorithms of PSO-ELM and LC-PSO-ELM are listed in Table 3.Besides the parameters of input weights and hidden biases, address parameter and the radius parameter (w, b, d, r), the generalization performance of the algorithms is affected mainly by the number of hidden nodes (neurons).In order to simply the analysis and comparison, all the figures in this paper illustrating the generalization curves of different algorithms based on different hidden neurons in function approximation and classification problems are the simulation results in one run of the experiments.As shown in Figure 1, in the function approximation problems, with the increasing of the hidden nodes from one to some determined value, the testing RMSE of the algorithms first rapidly decreases, then the curves become stable with a fluctuating value, except for the LC-ELM learning algorithm.From the figures, we can also conclude that the proposed LC-PSO-ELM algorithm has less testing RMSE error in most cases, which means that the proposed algorithm in terms of generalization performance is better than the other algorithms in one run.Figure 2 shows the generalization curves of classification problems in one run of the experiments.The testing classification accuracy is gradually bigger with the increasing of the hidden neurons, which also show the superiority of the proposed algorithm and the instability of the LC-ELM algorithm in one run.
For the sake of comparison, based on the generalization curves of different algorithms in terms of different hidden neurons on function approximation and classification problems, the selection of hidden neurons for the proposed algorithm is equal or less than the other algorithms.Meanwhile, a good number of hidden neurons of different algorithms in terms of generalization performance are also considered in the selection process of hidden neurons.Finally, the number of hidden neurons in the algorithms for different benchmark problems is shown in Table 4. Figure 2 shows the generalization curves of classification problems in one run of the experiments.The testing classification accuracy is gradually bigger with the increasing of the hidden neurons, which also show the superiority of the proposed algorithm and the instability of the LC-ELM algorithm in one run.
For the sake of comparison, based on the generalization curves of different algorithms in terms of different hidden neurons on function approximation and classification problems, the selection of hidden neurons for the proposed algorithm is equal or less than the other algorithms.Meanwhile, a good number of hidden neurons of different algorithms in terms of generalization performance are also considered in the selection process of hidden neurons.Finally, the number of hidden neurons in the algorithms for different benchmark problems is shown in Table 4. Figure 2 shows the generalization curves of classification problems in one run of the experiments.The testing classification accuracy is gradually bigger with the increasing of the hidden neurons, which also show the superiority of the proposed algorithm and the instability of the LC-ELM algorithm in one run.

(d) Calhousing
For the sake of comparison, based on the generalization curves of different algorithms in terms of different hidden neurons on function approximation and classification problems, the selection of hidden neurons for the proposed algorithm is equal or less than the other algorithms.Meanwhile, a good number of hidden neurons of different algorithms in terms of generalization performance are also considered in the selection process of hidden neurons.Finally, the number of hidden neurons in the algorithms for different benchmark problems is shown in Table 4.

Performance Comparison of Regression Benchmark Problems
This section mainly shows the comparison results of the original ELM, LC-ELM, PSO-ELM and LC-PSO-ELM four algorithms on the function approximation datasets.The average simulation results of 10 experiments are shown in Tables 5 and 6.From these tables, we may see that the training time of the proposed algorithm consumed much more than the other ones, which means that the adaptive PSO algorithm needs more time for searching the global optimal solution of the parameters (w, b, d, r) in the LC-PSO-ELM algorithm.
Although the training error is higher than the other algorithms in the proposed algorithm in terms of the Autompg problem, the proposed algorithm in this paper focuses on superiority in terms of improved generalization performance, the fact that the testing time of all of them is almost equivalent and the proposed algorithm has better generalization performance with fewer parameters and compact network configuration, which shows that the proposed algorithm has good generalization value and real applicability.
Moreover, the proposed LC-PSO-ELM and PSO-ELM learning algorithms have relatively less value of STD in the experiments, which means that the algorithms have stable performance with parameters optimized by means of the PSO algorithm, although searching the optimal parameters needs much time in the training process.
Except for the STD value of the Autompg, the other problems of LC-ELM are bigger than the ELM, PSO-ELM and LC-PSO-ELM algorithms.The results show that the LC-ELM is the most unstable learning algorithm out of the four, and they are also the same as the simulation results in Figures 1  and 2.

Performance Comparison of Classification Problems
Performance comparison among ELM, LC-ELM, PSO-ELM and LC-PSO-ELM algorithms is given in Tables 7 and 8.The generalization performance of the problems is justified by testing classification accuracy (testing accuracy).The simulation results in the tables show that the LC-PSO-ELM algorithm is obviously superior to the other algorithms in terms of generalization performance, except for the Iris dataset.From the Tables 7 and 8, we can also conclude that the PSO-ELM algorithm and the LC-PSO-ELM algorithm have the comparable generalization performance in the Iris dataset.From the subgraph of Figure 1, there are 16 times to 100% in the testing classification accuracy of the proposed algorithm in 20 trials with the increasing of the number of hidden neurons and the PSO-ELM learning algorithm has 15 times to 100%, which also shows the same conclusion.Therefore, the preferable performance of the proposed algorithm illustrates that the selection of optimized parameters in these specific problems is suitable for improving the generalization performance of the model.Moreover, the STD value of the PSO-ELM learning algorithm is the least in the four algorithms, which shows that it is more easily obtained from the global solution in terms of searching two parameters than four parameters for the PSO algorithm.In addition, the LC-ELM is also the most unstable learning algorithm in most cases.
In summary, by analyzing all of the obtained results, the following conclusions can be drawn: (1) The generalization performance of the ELM algorithm can be improved by means of the parameter optimization based on the PSO.(2) The improvement of the generalization performance has been made at the expense of the consumption of the training time of CPU for searching the optimal parameters of the model.(3) The proposed algorithm in this paper has the best generalization ability for real applications.

Performance Comparison of LC-ELM Based on Two Different Optimization Methods of DE and PSO
Performance comparison results of the ELC-ELM [12] and the LC-PSO-ELM algorithms on regression or classification problems are listed in Table 9.Here, in the ELC-ELM algorithm, the differential evolution (DE) optimization algorithm is used for improving the generalization performance of the ELC-ELM (evolution local coupled extreme learning machine) algorithm, in which the parameters of the hidden neuron address and the radiuses of the fuzzy membership functions are optimized; otherwise, the input weights and hidden biases are still preset randomly in this algorithm.The function approximation problem of Autompg and the classification problem of the Iris data sets are used for comparing the generalization performance of the two algorithms.The number of hidden neurons in the LC-PSO-ELM algorithm is the same as or less than that in the ELC-ELM algorithm.As can be seen from Table 9 (the data of simulation results in the ELC-ELM algorithm came from reference [12]), compared with the ELC-ELM algorithm, although the learning speed of the LC-PSO-ELM is slower than the ELC-ELM, the generalization performance of the LC-PSO-ELM algorithm for optimizing four parameter values is better than the ELC-ELM algorithm for optimizing two parameter values.

Performance Comparison of the LC-PSO-ELM Based on Different Fuzzy Membership Functions
The choice of activation (basis) functions of the ELM learning algorithm is problem dependent [33], which means that different fuzzy membership function in the LC-ELM and the algorithms will affect the generalization performance.Meanwhile, Yu pointed out that the window function that is used in the LC-ELM does not satisfy the necessary conditions of window function that are required by LCFNN.As a result, it is possible that the improper window function can cause the LC-ELM to have the same discriminant with the basic ELM [34].For this reason, three different fuzzy membership functions of Gaussian function, reversed sigmoid function and reversed tanh function were used to verify the results.The simulation results of 10 trials with the three different fuzzy membership functions in the LC-PSO-ELM algorithms on regression and classification problems are listed in Table 10.
As can be seen from Table 10, the simulation results demonstrate that the LC-PSO-ELM learning algorithm has different generalization performance with different fuzzy membership functions, and the better test accuracy can be obtained in the LC-PSO-ELM algorithm using the reversed sigmoid function.

Conclusions
In this study, a novel learning algorithm, named LC-PSO-ELM, was proposed by means of the frame structure of LC-ELM and the parameter optimization strategy of the PSO algorithm.The parameters of input weights, hidden biases, addresses and radiuses were all adjusted by the PSO for searching the optimal solution in the model.
Based on the function approximation and classification benchmarks problems, the performance of the LC-PSO-ELM utilizing different fuzzy membership functions was conducted.Meanwhile, the generalization performance of the four algorithms of ELM, LC-ELM, PSO-ELM and LC-PSO-ELM were compared, which showed that the proposed algorithm can produce better generalization performance in most cases, compared with the other alternative ELM-based approaches.
Although the LC-PSO-ELM can obtain a significantly improved generalization performance, the training time of the algorithm was much longer than the others due to the fact that four parameter values should be optimized in the algorithm.In future, it is necessary to propose a parallel training mechanism for the proposed method for improving the efficiency to solve problems with very large datasets.Correspondingly, it is also necessary to exploit the sensitivities of these chosen activation functions in theory in the future.

Figure 1 .
Figure 1.The generalization curves of different algorithms based on different hidden neurons on function approximation problem: (a) Box and Jenkins gas furnace; (b) Autompg; (c) Abalone; (d) Calhousing.

Figure 1 .
Figure 1.The generalization curves of different algorithms based on different hidden neurons on function approximation problem: (a) Box and Jenkins gas furnace; (b) Autompg; (c) Abalone; (d) Calhousing.

Figure 1 .
Figure 1.The generalization curves of different algorithms based on different hidden neurons on function approximation problem: (a) Box and Jenkins gas furnace; (b) Autompg; (c) Abalone; (d) Calhousing.

Table 1 .
Parameters specification of the benchmark problems.

Table 2 .
Configurations of the ELM (extreme learning machine), PSO-ELM (extreme learning machine based on particle optimization), LC-ELM (local coupled extreme learning machine) and LC-PSO-ELM algorithms.

Table 3 .
Control parameters used in the different algorithms of the PSO-ELM and LC-PSO-ELM.

Table 4 .
The number of hidden neurons in the algorithms for different benchmark problems.

Table 5 .
Performance comparison of different algorithms on regression problems of Box and Jenkins gas furnace data and Autompg.

Table 6 .
Performance comparison of different algorithms on regression problems of Abalone and Calhousing.

Table 7 .
Performance comparison of different algorithms on classification problems of Wine and Iris.

Table 8 .
Performance comparison of different algorithms on classification problems of Diabetes and Satimage.

Table 9 .
Performance comparison of the ELC-ELM and LC-PSO-ELM algorithms.

Table 10 .
Performance comparison of different fuzzy membership function in the LC-PSO-ELM algorithms on regression or classification problems.