1. Introduction
The mathematical model of single-hidden layer feed-forward neural networks (SLFNs) has been widely used in many domains because of its ability to approximate strongly nonlinear input-output mappings. However, traditional learning methods are usually much slower than required while few faster learning algorithms for SLFNs are generated [
1]. In 2006, a novel learning algorithm for SLFNs called extreme learning machine (ELM) [
1,
2] was presented by Huang et al. for decreasing the training time of SLFNs.
Different from the existing learning algorithms of SLFNs, the weights and biases between the input layer and hidden layer of the ELM were chosen randomly, then the weights between the hidden layer and the output layer were determined based on the ordinary least squares. The ELM learning algorithm has fast learning speed and good generalization performance with little human intervention, which makes the algorithm applicable to many areas, such as stock prediction [
3], image classification [
4], fault diagnosis [
5], etc.
In the ELM, the number of hidden neurons is required to be greater than or equal to the number of the training samples so as to guarantee the convergence of the algorithm. Therefore, there will be quite a lot of input-hidden weights when the number of input neurons is large [
6], which may reduce the generalization performance of SLFNs. The original ELM model has been equipped with various extensions to make it more suitable and efficient for specific applications [
7]. For example, based on the structure of the local coupled feed-forward neural network (LCFNN) [
8,
9] and the learning mechanism of the ELM algorithm, the local coupled extreme learning machine (LC-ELM) learning algorithm was proposed by Qu in 2014 [
10]. The algorithm could decrease the researching complexity of the weights between the input layer and the hidden layer by means of assigning the addresses to the hidden neurons [
10]. The advantage of the LC-ELM on image watermarking was examined by Mehta et al. [
11].
In the LC-ELM learning algorithm, the addresses and radiuses were generally preset empirically or randomly. And, thus, those parameters might not be optimal for the LC-ELM, and the algorithm may yield an inappropriate underlying model. In 2015, Qu et al. presented an evolutionary local coupled extreme learning machine (ELC-ELM). In the ELC-ELM, the differential evolutionary (DE) algorithm was used to optimize the addresses and the radiuses of the fuzzy membership functions in hidden neurons for improving the generalization performance [
12]. However, it should be noted that the hidden biases and input weights in the ELC-ELM were also set randomly.
The DE algorithm has good global converge property by means of utilizing the differential information of the population. However, the instability performance of DE can also be caused because of the above reason and the algorithm may be trapped in local optima [
13,
14]. Moreover, three parameters of the DE algorithm should be controlled manually [
15]. In 1995, the particle swarm optimization (PSO) algorithm was presented by Eberhart et al. [
16] and has been used in many optimization fields as it can converge to the global minima quickly. Compared with other stochastic optimization techniques, the advantages of the PSO algorithm are that it is easy to be implemented in practice and few parameters need to be adjusted [
17,
18]. The PSO algorithm and its improved variants, such as APSO (Adaptive PSO) and PSOGSA (The hybrid PSO and gravitational search algorithm), were used to select the optimal parameters between the input layer and the hidden layer (input weights and biases) of the ELM [
19,
20].
Therefore, in order to overcome the limitation of the DE, a new method combining the LC-ELM with an improved PSO called LC-PSO-ELM is proposed in this paper. In the proposed algorithm, the improved PSO algorithm is used to optimize the address and window radius of the local coupled parameters. In addition, the input weights and hidden layer biases of the ELM are also optimized to further improve the generalization performance of the LC-ELM, and the MP generalized inverse is used to calculate the weights between the hidden layer and the output layer analytically. In order to prove the superiority of the proposed algorithm, we compared the computer simulation results from our developed algorithm to those from the ELM, LC-ELM and PSO-ELM algorithms, respectively. The comparison results demonstrated that the newly developed algorithm exhibits improved generalization performance with the highest accuracy.
The rest of this paper is organized as follows. The local coupled extreme learning machine (LC-ELM) and the improved particle swarm optimization algorithm are given in
Section 2. The local coupled extreme learning machine based on the PSO algorithm is introduced in
Section 3.
Section 4 includes different simulation results and analysis of the proposed algorithm in regression and classification benchmark problems. Finally, the conclusions are summarized in
Section 5.
2. Theoretical Background
2.1. Local Coupled Extreme Learning Machine
The ELM learning algorithm is a simple, fast and efficient method. For further improving the generalization performance of the ELM, the LC-ELM learning algorithm was proposed by Qu [
10] in which the efficiency of LC-ELM in terms of classification and regression benchmark problems was investigated.
In the LC-ELM, due to the utilization of the fuzzy membership function and the similarity relation , the complexity of the weight searching space was reduced and the generalization performance was correspondingly improved in terms of the simple neural networks structure. The mathematical formulation of the LC-ELM is presented as follows:
For
arbitrary distinct examples
, where
is the input and
is the expected output,
. The output of the hidden layer neurons
for the ELM is modified with the help of fuzzy membership function as
. Therefore, the network output of the LC-ELM with
hidden neurons are mathematically modeled by
where
denotes the activation function of the ELM, which can not only be sigmoid functions, however also other functions such as sin, cos, cubic, etc.
denotes the weight vector connecting the
ith hidden neuron and the output neurons,
is the weight vector connecting the
ith hidden neuron and the input neurons.
is the bias of the
ith hidden neuron.
is the address of the
ith hidden node.
In the LC-ELM learning algorithm, the similarity relation
is the distance between the input
and the
ith hidden node with address
. Various forms of fuzzy membership functions
, such as Gaussian function, sigmoid function and reverse sigmoid function [
21,
22], are utilized. In addition, the underlying radius parameter
is kept in
for adjusting the width of the activation area, which is also an optimized parameter, to the same as the address parameter
. Combining the structure of the LCFNN with the learning mechanism of the ELM, the LC-ELM also is a three step learning algorithm and the parameters (input weights
and biases
between the input layer and hidden layer, the address
of the hidden neurons) of the networks are assigned randomly, which is the same as the ELM [
10].
The standard LC-ELM learning algorithm can approximate these
examples with zero error means
, where
is the actual output of the LC-ELM. i.e., the corresponding relation is defined by
The above
equations can be written compactly as a linear system:
where
is the output matrix of the hidden layer and can be expressed as
in the above Equation (4),
denotes the output of the
ith hidden neuron with respect to
.
is the matrix of the output weights and
denotes the weight vector connecting the
ith hidden node and the output layer.
is the matrix of the target of the LC-ELM.
The smallest norm least squares solution of Equation (3) is
where
is the Moore-Penrose generalized inverse of the hidden layer output matrix
[
23].
Based on the above discussion, the LC-ELM algorithm can be summarized in Algorithm 1.
Algorithm 1. The algorithm flow of LC-LEM |
(1) Input weights , hidden bias and the node address are allocated randomly. |
(2) The output matrix of the hidden layer is computed using Equation (4). |
(3) Calculate the output weights between the hidden layer and the output layer based on Equation (5): . |
2.2. Particle Swarm Optimization
In 1995, a particle swarm methodology was proposed for nonlinear function optimization by Kennedy and Eberhart [
16], which was called the PSO algorithm. It belongs to a population-based, heuristic optimization algorithm. The PSO algorithm is simple, easy to be realized and has a fast convergence rate. It has been widely applied in the fields of scientific research and engineering application [
20].
As a swarm-based algorithm, the particles of the PSO algorithm may flow through the searching space depending on the best position information of their own and their neighbors’. The initial values of the particles in the population are set randomly [
24].
In the PSO algorithm, suppose
is the dimension of searching space and
is the number of particles, respectively. Then,
and
are denoted by the current position and the current velocity of the
ith particle at iteration
, respectively [
25]. Therefore, the new velocity and the particle position in the next iterative time are described as:
where
denotes the inertia weight.
and
stand for the different acceleration coefficient, respectively.
denotes a constant value in the interval
and is set randomly.
is the best position of the
ith particle in the search stage at present,
represents the global best position, which constitutes the best position found in the population at present.
In the PSO algorithm, the initial parameter
plays the role of balancing the global search and the local search. Therefore, in order to ensure higher exploring ability in the early iteration and fast convergence speed in the last part iteration,
is not a constant and can be expressed as a nonlinear function of time [
17,
26]:
where
and
are the initial and terminal values of inertia weight in the iteration process, respectively. The parameter
is the maximum iteration number of the algorithm and
is the current iteration time of the algorithm.
In addition, in order to enhance the global search in the early part iteration, to encourage the particles to converge to the global optimal solution and to improve the convergence speed in the final iteration period [
27], the acceleration parameters
and
are described as:
where
and
,
and
are constants. Based on the Equation (6), the searching ability of the cognitive and social components can be changed by changing the values of
and
, which can improve the convergence rate of the PSO algorithm.
3. Local Coupled Extreme Learning Machine Based on the PSO Algorithm
Based on the optimization technique of the above PSO algorithm with self-adaptive parameters and , the parameter values , , and of the LC-ELM are optimized for improving the generalization performance in this work.
In the LC-ELM learning algorithm, the decoupling of the input layer and the hidden layer is determined by the address parameter and the radius parameter . However, these parameter values are randomly determined. In other words, they might not be suitable for the algorithm, resulting in the poor performance of the algorithm. In addition, the hidden biases and input weights are also set randomly in the LC-ELM. Therefore, for improving the performance of the LC-ELM algorithm, the four parameters of the LC-ELM are optimized based on the above adaptive PSO algorithm simultaneously. When the optimal parameters of the LC-ELM algorithm are established, the t weights between the hidden layer and the output layer of the LC-ELM are determined analytically based on the Equation (5) of the ELM, which is called the LC-PSO-ELM algorithm in this paper.
Therefore, the particles in the searching space of the LC-PSO-ELM are composed of a set by the parameter values of input weights, hidden biases, address and radius, which can be defined as:
where
,
,
and
.
Based on the global searching capability of the above PSO algorithm and the universal approximation performance of the LC-ELM learning algorithm, the detailed steps of the LC-PSO-ELM algorithm (Algorithm 2) are described as follows:
The parameters in the algorithm are defined as: the training set is denoted as , is the output function of the hidden neuron, is the number of the hidden neurons, and are fuzzy membership and similarity function, respectively. is the preset maximum learning epoch of the PSO algorithm. and are the initial and terminal values of inertia weight in the iterative stage. and are the initial and final values of the acceleration constants.
Algorithm 2. The algorithm flow of LC-PSO-ELM |
(1) Initiate the population (particle). Each particle in the generation is composed of a set of the input weights , biases , address and radius , as is shown in Equation (11). The initialization value of all of the components of the particle are set from to randomly.
|
(2) Iter = 1 |
(3) While Iter |
(4) (1) Evaluate the fitness function of each particle (the root means standard error for regression problems and the classification accuracy for classification problems). (2) Modify the position of the particle according to Equations (4)–(8). (3) Iter = Iter +1 |
(5) end while (6) The optimal parameters of the LC-ELM can be determined. Then, based on the optimized parameters:- (1)
The output matrix of the hidden layer is computed based on Equation (4). - (2)
The weight is calculated based on Equation (5).
|
Similar to the LC-ELM, the combinational function
between the similarity relation
and the fuzzy membership
in the LC-PSO-ELM also has many selection strategies. For example, the similarity relation function could be selected by the fuzzy similarity function, Gaussian kernel and wave kernel functions, etc. Meanwhile, the fuzzy membership Equations (12)–(14) can be also chosen in the LC-PSO-ELM learning algorithm.
4. Simulations and Performance Verification
In this section, the proposed LC-PSO-ELM learning algorithm and three alternative ELM algorithms in the aspect of four function approximation (regression) and four classification benchmark problems, the original ELM, LC-ELM [
10] PSO-ELM [
17], are conducted in the MATLAB R16a environment running with 3.4 GHz CPU and 16 G RAM. The parameters specification of the benchmarks problems is shown in
Table 1. The experimentally well-characterized datasets were chosen for good comparison in this paper [
28,
29], in which the Box and Jenkins gas furnace data were sourced from the reference [
30], the Calhousing data came from the StatLib dataset [
31] and the other dataset was derived from the UCI (University of California, Irvine, CA, USA) Machine Learning Repository [
32], respectively. For each dataset, the input sequence of the data was changed randomly and then the data were divided into two groups of training data and testing data for experiments based approximately on a 70–30 ratio. The number of the two groups is shown in
Table 1.
The number of the population of the PSO algorithm is 200 and the maximum iterative number is 50. The configurations of the ELM, PSO-ELM, LC-ELM and the LC-PSO-ELM are listed in
Table 2. For simplicity, RN is the abbreviation for random number and NDRN is the abbreviation for normally distributed random numbers.
As shown in
Table 2, the sigmoid function is selected as the activation function of the four learning algorithms. The wave kernel
is selected as the similarity function and the reversed sigmoid function Equation (13)
is selected as the fuzzy membership function in the LC-ELM and the LC-PSO-ELM algorithms, respectively.
In order to increase the persuasion of different algorithms in terms of validity, 10 trials of the average simulation results (root mean square error (RMSE) is the abbreviation) for regression benchmarks and classification accuracy for classification (pattern classification) are given in the following tables. The training and testing subsets of each experiment in the 10 trials are created by randomly choosing samples of the datasets based on a 70–30 ratio renewedly, the robustness of the algorithms is compared using the standard deviation (STD is the abbreviation) of the 10 trials. The CPU time of training is used to evaluate the computational complexity of the algorithms. The testing error and the CPU time of testing are used to evaluate the generalization performance and application value of the algorithms, respectively. On the other hand, in all of the tables of the simulation results, symbols in bold represent the comparatively best value of the corresponding algorithms. The control parameters of the PSO that were used in different algorithms of PSO-ELM and LC-PSO-ELM are listed in
Table 3.
Besides the parameters of input weights and hidden biases, address parameter and the radius parameter
, the generalization performance of the algorithms is affected mainly by the number of hidden nodes (neurons). In order to simply the analysis and comparison, all the figures in this paper illustrating the generalization curves of different algorithms based on different hidden neurons in function approximation and classification problems are the simulation results in one run of the experiments. As shown in
Figure 1, in the function approximation problems, with the increasing of the hidden nodes from one to some determined value, the testing RMSE of the algorithms first rapidly decreases, then the curves become stable with a fluctuating value, except for the LC-ELM learning algorithm. From the figures, we can also conclude that the proposed LC-PSO-ELM algorithm has less testing RMSE error in most cases, which means that the proposed algorithm in terms of generalization performance is better than the other algorithms in one run.
Figure 2 shows the generalization curves of classification problems in one run of the experiments. The testing classification accuracy is gradually bigger with the increasing of the hidden neurons, which also show the superiority of the proposed algorithm and the instability of the LC-ELM algorithm in one run.
For the sake of comparison, based on the generalization curves of different algorithms in terms of different hidden neurons on function approximation and classification problems, the selection of hidden neurons for the proposed algorithm is equal or less than the other algorithms. Meanwhile, a good number of hidden neurons of different algorithms in terms of generalization performance are also considered in the selection process of hidden neurons. Finally, the number of hidden neurons in the algorithms for different benchmark problems is shown in
Table 4.
4.1. Performance Comparison of Regression Benchmark Problems
This section mainly shows the comparison results of the original ELM, LC-ELM, PSO-ELM and LC-PSO-ELM four algorithms on the function approximation datasets. The average simulation results of 10 experiments are shown in
Table 5 and
Table 6. From these tables, we may see that the training time of the proposed algorithm consumed much more than the other ones, which means that the adaptive PSO algorithm needs more time for searching the global optimal solution of the parameters
in the LC-PSO-ELM algorithm.
Although the training error is higher than the other algorithms in the proposed algorithm in terms of the Autompg problem, the proposed algorithm in this paper focuses on superiority in terms of improved generalization performance, the fact that the testing time of all of them is almost equivalent and the proposed algorithm has better generalization performance with fewer parameters and compact network configuration, which shows that the proposed algorithm has good generalization value and real applicability.
Moreover, the proposed LC-PSO-ELM and PSO-ELM learning algorithms have relatively less value of STD in the experiments, which means that the algorithms have stable performance with parameters optimized by means of the PSO algorithm, although searching the optimal parameters needs much time in the training process.
Except for the STD value of the Autompg, the other problems of LC-ELM are bigger than the ELM, PSO-ELM and LC-PSO-ELM algorithms. The results show that the LC-ELM is the most unstable learning algorithm out of the four, and they are also the same as the simulation results in
Figure 1 and
Figure 2.
4.2. Performance Comparison of Classification Problems
Performance comparison among ELM, LC-ELM, PSO-ELM and LC-PSO-ELM algorithms is given in
Table 7 and
Table 8. The generalization performance of the problems is justified by testing classification accuracy (testing accuracy). The simulation results in the tables show that the LC-PSO-ELM algorithm is obviously superior to the other algorithms in terms of generalization performance, except for the Iris dataset. From the
Table 7 and
Table 8, we can also conclude that the PSO-ELM algorithm and the LC-PSO-ELM algorithm have the comparable generalization performance in the Iris dataset. From the subgraph of
Figure 1, there are 16 times to 100% in the testing classification accuracy of the proposed algorithm in 20 trials with the increasing of the number of hidden neurons and the PSO-ELM learning algorithm has 15 times to 100%, which also shows the same conclusion. Therefore, the preferable performance of the proposed algorithm illustrates that the selection of optimized parameters in these specific problems is suitable for improving the generalization performance of the model.
Moreover, the STD value of the PSO-ELM learning algorithm is the least in the four algorithms, which shows that it is more easily obtained from the global solution in terms of searching two parameters than four parameters for the PSO algorithm. In addition, the LC-ELM is also the most unstable learning algorithm in most cases.
In summary, by analyzing all of the obtained results, the following conclusions can be drawn:
- (1)
The generalization performance of the ELM algorithm can be improved by means of the parameter optimization based on the PSO.
- (2)
The improvement of the generalization performance has been made at the expense of the consumption of the training time of CPU for searching the optimal parameters of the model.
- (3)
The proposed algorithm in this paper has the best generalization ability for real applications.
4.3. Performance Comparison of LC-ELM Based on Two Different Optimization Methods of DE and PSO
Performance comparison results of the ELC-ELM [
12] and the LC-PSO-ELM algorithms on regression or classification problems are listed in
Table 9. Here, in the ELC-ELM algorithm, the differential evolution (DE) optimization algorithm is used for improving the generalization performance of the ELC-ELM (evolution local coupled extreme learning machine) algorithm, in which the parameters of the hidden neuron address and the radiuses of the fuzzy membership functions are optimized; otherwise, the input weights and hidden biases are still preset randomly in this algorithm.
The function approximation problem of Autompg and the classification problem of the Iris data sets are used for comparing the generalization performance of the two algorithms. The number of hidden neurons in the LC-PSO-ELM algorithm is the same as or less than that in the ELC-ELM algorithm. As can be seen from
Table 9 (the data of simulation results in the ELC-ELM algorithm came from reference [
12]), compared with the ELC-ELM algorithm, although the learning speed of the LC-PSO-ELM is slower than the ELC-ELM, the generalization performance of the LC-PSO-ELM algorithm for optimizing four parameter values is better than the ELC-ELM algorithm for optimizing two parameter values.
4.4. Performance Comparison of the LC-PSO-ELM Based on Different Fuzzy Membership Functions
The choice of activation (basis) functions of the ELM learning algorithm is problem dependent [
33], which means that different fuzzy membership function in the LC-ELM and the LC-PSO-ELM algorithms will affect the generalization performance. Meanwhile, Yu pointed out that the window function that is used in the LC-ELM does not satisfy the necessary conditions of window function that are required by LCFNN. As a result, it is possible that the improper window function can cause the LC-ELM to have the same discriminant with the basic ELM [
34]. For this reason, three different fuzzy membership functions of Gaussian function, reversed sigmoid function and reversed tanh function were used to verify the results. The simulation results of 10 trials with the three different fuzzy membership functions in the LC-PSO-ELM algorithms on regression and classification problems are listed in
Table 10.
As can be seen from
Table 10, the simulation results demonstrate that the LC-PSO-ELM learning algorithm has different generalization performance with different fuzzy membership functions, and the better test accuracy can be obtained in the LC-PSO-ELM algorithm using the reversed sigmoid function.
5. Conclusions
In this study, a novel learning algorithm, named LC-PSO-ELM, was proposed by means of the frame structure of LC-ELM and the parameter optimization strategy of the PSO algorithm. The parameters of input weights, hidden biases, addresses and radiuses were all adjusted by the PSO for searching the optimal solution in the model.
Based on the function approximation and classification benchmarks problems, the performance of the LC-PSO-ELM utilizing different fuzzy membership functions was conducted. Meanwhile, the generalization performance of the four algorithms of ELM, LC-ELM, PSO-ELM and LC-PSO-ELM were compared, which showed that the proposed algorithm can produce better generalization performance in most cases, compared with the other alternative ELM-based approaches.
Although the LC-PSO-ELM can obtain a significantly improved generalization performance, the training time of the algorithm was much longer than the others due to the fact that four parameter values should be optimized in the algorithm. In future, it is necessary to propose a parallel training mechanism for the proposed method for improving the efficiency to solve problems with very large datasets. Correspondingly, it is also necessary to exploit the sensitivities of these chosen activation functions in theory in the future.