Extreme Learning Machine Based on Firefly Adaptive Flower Pollination Algorithm Optimization

Extreme learning machine (ELM) has aroused a lot of concern and discussion for its fast training speed and good generalization performance, and it has been used diffusely in both regression and classification problems. However, on account of the randomness of input parameters, it requires more hidden nodes to obtain the desired accuracy. In this paper, we come up with a firefly-based adaptive flower pollination algorithm (FA-FPA) to optimize the input weights and thresholds of the ELM algorithm. Nonlinear function fitting, iris classification and personal credit rating experiments show that the ELM with FA-FPA (FA-FPA-ELM) can obtain significantly better generalization performance (such as root mean square error, classification accuracy) than traditional ELM, ELM with firefly algorithm (FA-ELM), ELM with flower pollination algorithm (FPA-ELM), ELM with genetic algorithm (GA-ELM) and ELM with particle swarm optimization (PSO-ELM) algorithms.


Introduction
For the past few years, various artificial intelligence algorithms have attracted more and more attention from scientific research and industrial applications. Machine learning, also known as statistical learning theory, is an important branch of artificial intelligence. It obtains data laws through data analysis and applies these laws to predict or determine other unknown data. Machine learning has been diffusely used in data mining [1], natural language processing [2], speech recognition [3] and search engines [4]. Neural network is a crucial algorithm for machine learning. It changes the internal structure of the system by sensing changes in external information. Feedforward neural networks [5] are one of the most widely used neural networks. It generally includes an input layer, a hidden layer and an output layer, where the hidden layer can be multi-layered. Neurons in the same layer are not connected, and adjacent layers have multiple connection methods. The back layer only receives information from the front layer. Typical feedforward neural networks include perceptron models [6], BP neural networks [7], self-coding neural networks [8], radial basis networks [9,10] and convolutional neural networks [11].
Lately, ELM was put forward by Huang et al. [12][13][14][15]. This is a novel algorithm for single hidden layer feedforward network [16][17][18][19]. It randomly generates input weights and thresholds, and can directly calculate output weights. The learning speed of ELM algorithm is much faster than the feedforward network learning algorithm based on the gradient descent, and it has good generalization performance on complex application problems [20,21]. However, because the input parameters are randomly generated, nodes, and can acquire the unique ideal solution. Suppose there are Q different training samples, where X and Y are input data and output data, respectively . . . . . . · · · . . .
y 11 y 12 · · · y 1Q y 21 y 22 · · · y 2Q . . . . . . · · · . . . y m1 y m2 · · · y mQ the actual output T of the network is as follows: The standard ELM with L hidden nodes can be denoted as where t j = [t 1j , t 2j , · · · , t mj ] T , ϕ(x) represents activation function, the specific expression of t j is as follows: where j = 1, 2, · · · , Q, x j = [x 1j , x 2j , · · · , x nj ] T , w i = [w i1 , w i2 , · · · , w in ] is the weight vector concatenating the i-th hidden node and input nodes, β i = [β i1 , β i2 , · · · , β im ] T is the weight vector concatenating the i-th hidden node and output nodes and b i is the threshold of the i-th hidden node. w i · x j denotes the inner product of w i and x j . The above (5) can be signified as Hβ = T T , where H is the output matrix of the hidden layer. Its specific form is the least squares solutions of formula Hβ = T T can be represented as: here, H † is the MP generalized inverse of hidden layer output matrix H.

Adaptive Flower Pollination Algorithm (FPA)
There are two ways to achieve pollination: self-pollination and cross-pollination. Self-pollination usually occurs when there is no reliable pollinator and is fertilized from the pollen of the same plant. Cross-pollination means that pollination can be conducted from pollen of different plants. Cross-pollination usually involves bees, bats and birds, and these pollinators can fly a long distance. The behavior of bees and birds will follow the lévy distribution [43]. For the convenience of research, it is premised that each plant has only one flower and one pollen gamete. Each pollen gamete maintains a one-to-one relationship with a solution, and the corresponding pollination process follows the following rules.
1. Cross-pollination can be deemed a global pollination process, while pollinators carrying pollen move in a way that follows lévy flights. 2. Self-pollination can be seen as a process of local pollination. 3. The constancy of the flower is the probability of reproduction, which is proportional to the similarity of the two flowers involved. 4. The switch probability p [0, 1] can adjust local pollination and global pollination. Owing to the influence of distance and other factors, the whole pollination process is more inclined to local pollination.
In the initial stage of FPA, randomly generate a population x t i,j , · · · , x t i,D ](j = 1, 2, · · · , D; i = 1, 2, · · · , N), where N is the size of the population, D is the dimension of the optimization problem and t is the current number of iterations. Self-pollination is the process of local pollination, the process is as follows: where X t j , X t i represent pollen of different flowers of the same plant, i, j are random integers on [1, N]. The variation factor ε is a random number obeying uniform distribution on [0, 1].
Cross-pollination is a global pollination process by pollinators through lévy flight, the process is as follows: here, X t i represents the position of pollen i at the t-th iteration, g * represents the optimal solution of the current group, γ represents the scaling factor of the control step, L represents the intensity of pollination, which substantially is a step size and it follows the lévy distribution. In the basic FPA, the conversion probability p is a constant, which may cause the problem that the algorithm is not easy to converge and fall into the local optimal solution. Therefore, the following adaptive adjustments are made to p: where σ is a random number between [−1, 1]. In reference [44], when p is 0.8, the algorithm has the best performance, so p is set to float around 0.8 to prevent the value of p from being too large or too small. The improved algorithm can adaptively adjust the execution probability of global search and local search, avoiding FPA from falling into local optimum, and at the same time improving the convergence speed of the algorithm.

Firefly Algorithm (FA)
In nature, fireflies communicate with other fireflies through their own light-emitting characteristics. The FA regards the problem of solving the optimal value as the problem of finding the brightest fireflies, and simulates the entire optimization process as the process of mutual attraction and iterative update of fireflies. To simplify the algorithm, make the following assumptions: 1. Regardless of the sex of the fireflies. 2. The attraction between fireflies is only pertinent to the brightness and distance of fireflies. 3. The brightness of fireflies is determined by the fitness function.
The brighter the firefly is, the better its position is. Fireflies that glow weaker will move to brighter ones. The brightest firefly represents the optimal solution of the function. If the brightness is the same, the firefly will move randomly. If the distance between individual j and individual i is r ij , then the brightness at individual j is among them, λ is the light intensity absorption coefficient, the general value is 1.0, I 0 is the maximum fluorescent brightness of fireflies, r ij denotes the Euclidean distance between firefly i and firefly j and its expression is as follows: the attraction between individual j and individual i can be expressed as where β 0 is the maximum attraction of fireflies. If firefly j is attracted and moved by another brighter firefly i, its position update formula is: here, α is the step factor, µ j is a random vector and the random numbers in the vector follow Gaussian distribution or uniform distribution. The position of fireflies is constantly updated through brightness and attraction. Eventually, most fireflies will gather around fireflies with higher fitness values, and finally achieve the purpose of optimization.

Firefly-Based Adaptive Flower Pollination Algorithm (FA-FPA)
Since FA adopts a global random search strategy with multiple positions in parallel, the position obtained after iteration helps accelerate the convergence of adaptive FPA and find the global optimal solution. Therefore, FA can be used to obtain a better initial position, and then use FPA optimization. The specific algorithm process is shown in Table 1. Table 1. Steps to achieve firefly-based adaptive flower pollination algorithm (FA-FPA). step 1: Initialize algorithm parameters, and set loop termination condition; step 2: Stochastically initialize the position of fireflies, and calculate the target function value of each individual; step 3: Use formulas (11) and (13) to determine the direction of movement of individuals; step 4: Update the individual's spatial position according to formulas (14); step 5: Calculate the fitness function value of each individual on the grounds of the updated individual's position; step 6: Generate σ randomly, calculate the conversion probability p according to formulas (10); step 7: Generate rand [0, 1] randomly, if p > rand, then conduct a global search according to formula (9); step 8: If p ≤ rand, then conduct a local search according to formula (8); step 9: Calculate the fitness function value of each pollen to find the current optimal solution; step 10: Judge whether the loop termination condition is met. If not, go to Step 3; if satisfied, go to Step 6; step 11: Output the result, the algorithm ends.

Firefly Adaptive Flower Pollination Extreme Learning Machine Algorithm (FA-FPA-ELM)
For effectively solving this question caused by the randomness of the input parameters of the ELM algorithm, this paper uses the FA-FPA with strong optimization ability to optimize the ELM input weights and thresholds, so as to propose an adaptive FA-FPA-ELM algorithm. We first use FA-FPA to obtain the optimal weights and thresholds combination, and use it directly as the input weights and thresholds for ELM training. For decreasing the influence on the performance of the algorithm due to the large difference in variables, we use Equation (15) to normalize the data: here, X max , X min respectively denote the maximum and minimum values in the training samples. Next, we individually encode the ELM input parameters, the input weights and thresholds are expressed as individual pollen in the form of real coding. According to Section 2, the number of neurons in the input layer and hidden layer are n and L, respectively; therefore, the string length of individual pollen is assume that the weight matrix ω connecting the input layer and the hidden layer is the threshold vector b of the hidden layer is then the coding form of the individual pollen is The fitness function used in FA-FPA is the root mean square error (RMSE). The RMSE and mean absolute error (MAE) between the actual output and the expected output of the network are used as the evaluation standard of the network. The following formula are used to calculate RMSE and MAE where f (x i ) denotes the predicted datas, y i denotes the actual datas. Therefore, the purpose of using FA-FPA to optimize ELM input parameters is to find pollen particles that can make the RMSE of the ELM algorithm smaller. Table 2 details the implementation steps of the FA-FPA-ELM algorithm.

Numerical Experiments and Comparison
To verify the effectiveness of the introduced FA-FPA-ELM algorithm, we have done some experiments which are nonlinear function fitting, iris classification and personal credit rating problems. This section compares the approximation and classification performance of FA-FPA-ELM with traditional ELM, FA-ELM, FPA-ELM, GA-ELM and PSO-ELM algorithms.

Nonlinear Function Fitting Problem
The nonlinear function defined by  Table 3. Next, we test the impact of the number of hidden nodes on the ELM and FA-FPA-ELM algorithms. The maximum number of iterations is set to 100. When the number of hidden layer nodes gradually increases, the training time and RMSE of the two continue to change. The experimental results are demonstrated in Table 4.  It can be discovered from Table 3 that when the hidden layer node is fixed, as the quantity of iterations increases, the training time also gradually increases, and the training error and test error decrease. When iterations reach a certain value, the test error decreases slowly. Table 4 demonstrates that with the growth of hidden layer nodes, the RMSE of FA-FPA-ELM and ELM gradually decreases. The training RMSE of the FA-FPA-ELM algorithm reaches 0.582, and only 5 hidden layer nodes are required, while the training RMSE of the ELM reaches 0.0561, which requires 20. In other words, when the two algorithms obtain almost the same training RMSE, the quantity of hidden nodes required by the new algorithm is a quarter of that of ELM. Combining Tables 3 and 4, the maximum quantity of iterations of FA-FPA-ELM is set to 100, and the hidden layer node is 5. The fitting of ELM and FA-FPA-ELM algorithm to test function f (x) is shown in Figure 3. The absolute error of the two algorithms is shown in Figure 4.  It is not difficult to see that when the hidden nodes is 5, the approximation effect of FA-FPA-ELM algorithm is better than the ELM algorithm, and the absolute test error is smaller.
Next, we compare the approximation property of the basic ELM, FA-ELM, FPA-ELM, GA-ELM, PSO-ELM and FA-FPA-ELM algorithms. In order to ensure a fair comparison, the same population size and maximum iteration number should be used. The population size N = 10 and maximum iteration number Maxgen = 100 have been used for these five algorithms for optimizing the ELM. FA-ELM, FPA-ELM and FA-FPA-ELM selected parameters are consistent with the above. For GA-ELM, a crossover rate of p crossover = 0.95 and a mutation rate of p mutation = 0.05 are used. For PSO-ELM, an inertia weight w = 0.7 is used, and its two learning parameters c1 = c2 are set as 1.5. We run 30 simulation experiments on each of the six algorithms, and then give the experiment results in Table 5. Figure 5 gives the iteration curves of the three optimization algorithms (GA-ELM, PSO-ELM and FA-FPA-ELM). Figure 6 shows the absolute error of the three algorithms (GA-ELM, PSO-ELM and FA-FPA-ELM). It shown that GA, PSO and FA-FPA can effectively optimize the input parameters of ELM, and the approximation effect of FA-FPA-ELM is best by comparing the performance of the five algorithms, but the training time is longer.

Iris Classification Problem
Next, we verify the classification performance of ELM, FA-ELM, FPA-ELM, GA-ELM, PSO-ELM and FA-FPA-ELM algorithms, using the iris data set of the UCI database. Iris is a data set for multivariate analysis, it contains 150 rows of data, each row of data contains 4 attributes, divided into 3 categories. The four attributes are petal length, petal width, calyx length and calyx width. These properties are used to predict the species of iris (Setosa, Versicolour, Virginica). Among them, there is a linear relationship between the length and width of petals and the type of iris, and there is a nonlinear relationship between the length and width of sepals and the type of iris. Stochastic method is used here to generate training sets and test sets, the quantity of training samples is 100, and the rest are testing samples. For FA, FPA and FA-FPA, a step factor of α = 0.25, a maximum attraction of β 0 = 0.2, an absorption coefficient of light intensity λ = 1 and the initial value of the conversion probability p = 0.8 are used. For PSO-ELM, an inertia weight w = 0.7 is used, and its two learning parameters c1 = c2 are set as 1.5. For GA-ELM, a crossover rate of p crossover = 0.95 and a mutation rate of p mutation = 0.05 are used. The above three algorithms for optimizing ELM parameters have a population size of 15 and a maximum number of iterations of 100. The fitness function is the accuracy of training samples, and the quantity of hidden layer neurons is 5. Sigmoid function is the activation function of the six algorithms. For the iris classification problem, it can be seen from Figure 7 that when optimizing the ELM parameters, the FA-ELM algorithm is unstable and will lead to a local optimal. The improved FA-FPA-ELM algorithm can jump out of the local optimal solution and accelerate the convergence speed of the FPA-ELM algorithm. The iteration curves of the three optimization algorithms (GA-ELM, PSO-ELM and FA-FPA-ELM) are presented in Figure 8. Table 6 shows the training time, testing time, training accuracy and test accuracy of the six algorithms, respectively. It is obvious from Figure 8 that when the number of iterations is less than 40, the FA-FPA-ELM algorithm finds the optimal weights and thresholds, while the PSO-ELM algorithm finds the optimal parameters when the number of iterations is greater than 80. When the number of iterations is 100, the accuracy of FA-FPA-ELM and PSO-ELM is greater than that of GA-ELM algorithm. Table 6 shows that GA-ELM, FA-ELM, FPA-ELM, PSO-ELM and FA-FPA-ELM can effectively optimize the parameters and improve the algorithm classification accuracy. FA-FPA-ELM algorithm has the highest accuracy and better classification property than the other five algorithms, but it requires the longest training time.

Personal Credit Rating
With the development of the financial industry, banks have gradually established a corporate credit evaluation system. The evaluation methods of personal credit are mainly divided into qualitative evaluation and quantitative evaluation. Qualitative evaluation is mainly based on the subjective judgment of credit officers. Quantitative evaluation is based on individual customer data and analysis using tools such as score cards and credit scoring models. The following uses a personal customer credit evaluation method to classify all customers into two categories, only distinguishing good and bad situations. The personal credit evaluation uses a data set from the German credit database compiled by Professor Hans Hofmann, which contains 1000 customers' data, each customer containing 20 attributes, and gives a mark of good or bad credit. Among them, each customer's 20 attributes include 7 numeric attributes and 13 category attributes. The numeric attributes include age, bank deposits, account duration, loan amount, installment payment as a percentage of monthly income, current residence years and number of dependents. Category attributes include current account status, loan history status, loan use, deposit status, working hours, personal status, security deposit, property, other installment plans, housing status, working status, telephone number and whether it is a foreign worker. Next, we will use the newly proposed FA-FPA-ELM algorithm to evaluate user credit and compare it with the traditional ELM, FA-ELM, FPA-ELM, GA-ELM and PSO-ELM algorithms. The parameter selection of each algorithm is the same as the iris classification problem. The training set and the test set are randomly generated, the number of training sets is 700, and the number of test sets is 300. For the personal credit rating problem, it can be seen from Figure 9 that when the ELM parameters are optimized, the FA-ELM algorithm will oscillate in the iterative process and fall into a local optimum. The improved FA-FPA-ELM algorithm is relatively stable, and it has better classification accuracy. Figure 10 shows the iterative curve of GA-ELM, PSO-ELM and FA-FPA-ELM algorithm optimization. Table 7 shows the training time, testing time, training accuracy and test accuracy of the six algorithms, respectively. Figure 10 shows that FA-FPA-ELM has the fastest convergence speed. When the number of iterations is 100, the accuracy of FA-FPA-ELM and PSO-ELM is greater than that of GA-ELM algorithm. Table 7 shows that FA-ELM, FPA-ELM, GA-ELM, PSO-ELM and FA-FPA-ELM can improve the algorithm classification accuracy. FA-FPA-ELM algorithm has the highest classification accuracy. As a result that FA-FPA-ELM requires two intelligent optimizations, it takes longer training time than FA-ELM, FPA-ELM, GA-ELM and PSO-ELM algorithms, but has a higher accuracy rate.

Conclusions
In this paper, our goal was to optimize the weights and thresholds of the ELM algorithm, because FA has a good global optimization ability, and FPA has a local optimization ability. Therefore, we first use the FA-ELM algorithm to get better weights and thresholds, and then use it as the initial weights and thresholds of the FPA-ELM algorithm. This new algorithm combines the advantages of ELM and FA-FPA, with simple parameter adjustment, global optimality and strong generalization ability. When fitting a nonlinear function, the RMSE and MAE of FA-FPA-ELM are better than the traditional ELM algorithm, indicating that the algorithm after optimizing the weight and threshold of ELM has a better approximation effect. In the iris classification and personal credit evaluation experiments, when the hidden node is set to 5 and the activation function is sigmoid, the classification performance of the FA-FPA-ELM algorithm is better than the traditional ELM, FA-ELM, FPA-ELM, GA-ELM and PSO-ELM, but because the FA-FPA-ELM algorithm needs to search for the optimal weights and thresholds twice, the training time is the longest. The above experiment results show that the new algorithm only needs fewer hidden layer nodes to achieve better approximation and classification results. In the future, we consider applying FA-FPA-ELM algorithm to practical engineering problems and comparing it with a wider range of algorithms.