Research and Application of a Hybrid Wind Energy Forecasting System Based on Data Processing and an Optimized Extreme Learning Machine

Accurate wind speed forecasting plays a significant role for grid operators and the use of wind energy, which helps meet increasing energy needs and improve the energy structure. However, choosing an accurate forecasting system is a challenging task. Many studies have been carried out in recent years, but unfortunately, these studies ignore the importance of data preprocessing and the influence of numerous missing values, leading to poor forecasting performance. In this paper, a hybrid forecasting system based on data preprocessing and an Extreme Learning Machine optimized by the cuckoo algorithm is proposed, which can overcome the limitations of the single ELM model. In the system, the standard genetic algorithm is added to reduce the dimensions of the input and utilize the time series model for error correction by focusing on the optimized extreme learning machine model. And according to screened results, the 5% fractile and 95% fractile are applied to compose the upper and lower bounds of the confidence interval, respectively. The assessment results indicate that the hybrid system successfully overcomes some limitations of the single Extreme Learning Machine model and traditional BP and Mycielski models and can be an effective tool compared to traditional forecasting models.


Introduction
At present, the increasing energy demand, security of the energy supply and reduction of emissions are the most difficult challenges that need to be address urgently for the whole world [1]. Energy consumption, which accounts for 60% of global greenhouse gas emissions, has already contributed to climate change [2]. How best to stop climate change and global warming while still satisfying the world's energy consumption, without impairing the global economy, is an essential problem for every country [3]. There is no doubt that renewable energy is an appropriate way.
In renewables, wind power technology is regarded as the most mature of the new technologies today, with the potential to cover more than 20% of the global electricity demand by 2050 [4]. By the end of 2015, the worldwide total cumulative installed wind capacity had reached about 432,680 MW [5]. Wind touches our life in countless ways all the time. Thus, research on wind power production can be taken as an illustrative case for renewable energy.
Wind energy is of great significance in many countries' energy structures and has received a large amount of attention as a type of renewable energy. However, many problems connected with wind 1.
In the data preprocessing module, the Dynamic absolute mean value method and Cubic Spline interpolation is employed to eliminate outliers and process the missing data, respectively. The Dynamic absolute mean value method can define the scope of outliers effectively by setting the value of the segmentation length and the coefficient. Cubic Spline interpolation not only can overcome the defects in high-order polynomial interpolation but also can guarantee a certain smoothness of piecewise interpolation. 2.
In the algorithm optimization module, the ELM algorithm is optimized by the Cuckoo Search algorithm. The Cuckoo Search algorithm is abstracted to solve various optimization problems by simulating the nest parasitic behavior in the natural world. In the paper, it is used to optimize the initial weight of ELM and then increase the forecasting accuracy. 3.
In the wind speed forecasting module, an innovative hybrid system which successfully takes advantages of CSELM algorithm and SGA algorithm is proposed. Then, according to this, we conduct an empirical analysis on the wind speed of Dangjin Mountain located in Akesai, China. The combined system can provide more accurate and stable wind speed forecasting results compared with traditional forecasting models.

4.
A more scientific and comprehensive evaluation is conducted to estimate the performance of the developed forecasting system in this paper. The evaluation system contains the performance of the whole research process including data preprocessing, optimized algorithm and empirical forecasting results.
The reminder of the paper is organized as follows: Section 2 introduces the algorithms for the outliers and the imputation method for missing data. Attempting to overcome the low forecasting accuracy of the single ELM method, this article proposes a prediction model that optimizes the initial weight of ELM by using Cuckoo Search. Section 3 filters the input and output matrix established by SGA, calibrates the forecasting error using the sliding ARMA process and estimates the upper limit and lower limit of the prediction point's confidence interval based on historical samples. Finally, Section 4 concludes the paper.

Data Preprocessing and Proposed CS-ELM Model
The section can be primarily divided into two parts. The first part introduces the algorithms for the outliers and the imputation method for missing data. Attempting to overcome the low forecasting accuracy of the single ELM model, this second part proposes a prediction model that optimizes the initial weight of ELM via the usage of Cuckoo Search and contrast the prediction results obtained from BP and Mycielski.

Multiple Patterns
This part introduces the Dynamic absolute mean value method, Cubic Spline interpolation function, Extreme Learning Machine and Cuckoo Optimization Algorithm in turn. Dynamic absolute mean value method is used to process outliers and Cubic Spline interpolation is applied for missing data interpolation. Extreme Learning Machine is a method, which is applied to forecast based on the processed data. Targeting at the poor forecasting accuracy of single ELM model, ELM model is optimized by Cuckoo Search in the paper.

Dynamic Absolute Mean Value Method
Let x 1 , x 2 , x 3 , · · · , x n be an evenly spaced discrete sequence that is segmented at a fixed length τ(τ < n). This part focuses on the first segmented sequence and calculates as follows: Definition 1. Calculate the mean value: and write the new sequence as: Definition 2. Calculate the absolute mean value of the new sequence: iterate over the new sequence x 1 , x 2 , · · · , x τ and mark the value that is larger than the threshold k·x . For x i such that x i > k·x , x i is generally the bad data for k ≥ 4.
The scope of outliers that need to be removed can be defined by setting the segmentation length τ and the coefficient k determining the threshold size. For max-min outliers in data with lower volatility, k = 4 is applied. However, k ≥ 7 is always set in volatile data.

Cubic Spline Interpolation
Spline interpolation is a popular model that not only can overcome the defects in high-order polynomial interpolation but also can guarantee a certain smoothness of piecewise interpolation. Therefore, after eliminating outliers, Cubic spline interpolation is used to complete data.
Let [a, b] be partitioned into subintervals by n + 1 points of the function f (x)(a = x 0 < x 1 < · · · < x n = b). Let f (x l ) = y l (l = 0, 1, 2, · · · , n). The function S(x) is a Cubic Spline interpolant on interval [x l−1 , x l ] if: The function S(x) is a cubic polynomial having a second-order continual derivative. And S(x l − 0) = S(x l + 0) (4) S (x l − 0) = S (x l + 0) (5) S (x l − 0) = S (x l + 0) for l = 1, 2, . . . , n − 1 (6) S(x l ) = y l for l = 1, 2, . . . , n − 1 To obtain the concrete expression of S(x) on each subinterval, analysis of its existence is necessary. For distinguishing conveniently, S l (x) is written for a cubic function on [x l−1 , x l ]. Thus, S l (x) = a l x 3 + b l x 2 + c l x + d l for x on [x l−1 , x l ] and l = 1, 2, · · · , n. In the cubic function, a l , b l , c l , d l are unknown parameters. Then, there are 4n unknown parameters in n subintervals. From (a), we can obtain: There are 3(n − 1) constraints. Then, considering (b), there will be 3(n − 1) + (n + 1) = 4n − 2 constraints in total. The other two free parameters in a cubic spline interpolant can be variously assigned [29].
In this paper, we invoke the function interp1 in MATLAB to interpolate elimination points, choosing not-a-knot as its end point constraints by default. The calling method is splineRes = interp1(datanum, data, indexRR, spline ), where data represents data, datanum is the subscript of the data, indexRR stands for the interpolation position, and spline is the selected Cubic Spline interpolation function.

Density Function Estimation
To estimate the probability density function effectively, a random sample X 1 , X 2 , · · · , X n is employed to estimate the histogram in this paper.
Assume that the non-negative density function f (x) meets f (x)dx = 1. For the purpose, starting point t 0 and width h, the two parameters are selected to construct a histogram where h decides the smoothness of the histogram. If h is minor, the histogram will show more detail. However, if the calculation is too large, the histogram will be too smooth. Thus, h can be considered as a smoothness parameter, and selecting a proper h is important.
Let B k = [t k , t k+1 ) be the kth subinterval, and t k+1 − t k = h. If there are v k data in the subinterval, at the point of x, there will bef Thus, if the size of the sample in the interval B k is given, the probability destiny value at the given point x will be worked out. A proper h is defined via these common rules: Extreme Learning Machine (ELM) is a promising artificial neural network method introduced by Huang which has very fast learning capability [30]. ELM has a simple three-layer structure: an input layer, output layer, and hidden layer, which contains a large number of nonlinear processing nodes [31]. The ELM structure of multiple inputs and a single output is shown by Figure 1. One advantage of ELM over traditional neural networks is that it is not necessary to adjust parameters iteratively. Thus, ELM has a much faster training process with better performance in some cases when compared with traditional neural networks [32]. However, ELM is always applied for balanced data. Imbalanced data problems require special treatment because characteristics of the imbalanced data can decrease the accuracy of the data [33].
(c) Scott's Rule Extreme Learning Machine (ELM) is a promising artificial neural network method introduced by Huang which has very fast learning capability [30]. ELM has a simple three-layer structure: an input layer, output layer, and hidden layer, which contains a large number of nonlinear processing nodes [31]. The ELM structure of multiple inputs and a single output is shown by Figure 1. One advantage of ELM over traditional neural networks is that it is not necessary to adjust parameters iteratively. Thus, ELM has a much faster training process with better performance in some cases when compared with traditional neural networks [32]. However, ELM is always applied for balanced data. Imbalanced data problems require special treatment because characteristics of the imbalanced data can decrease the accuracy of the data [33].

Cuckoo Search
The Cuckoo Search (CS) is a new meta-heuristic animal-behavior-imitation algorithm developed by Yang and Deb in 2009 [34]. The CS is always based on three idealized rules: first, each

Cuckoo Search
The Cuckoo Search (CS) is a new meta-heuristic animal-behavior-imitation algorithm developed by Yang and Deb in 2009 [34]. The CS is always based on three idealized rules: first, each cuckoo can only lay one egg at a time, and lay it in a randomly chosen host bird nest. Second, the best nests with high-quality eggs (solutions) can also be selected by the next generations. Last, the number of available host nests is fixed, and a host bird can discover an alien egg with probability p a ∈ [0, 1] [35]. Moreover, the performance of the CS can be improved by the usage of Lévy flights. The following Lévy flights are performed: In these formulas, the step size (∂) is related to the scale of the problem of interests [36], and ⊕ expresses entry-wise multiplication locations.
Due to the outliers in the collected data set, the prediction results of the Extreme Learning Machine are not very good. In order to overcome the weakness of the ELM model, one ELM model optimized by Cuckoo Search is developed in the paper. The computational steps of the optimized algorithm (CS-ELM) are described as Algorithm A2.

Wind Speed Empirical Analysis
Based on the data of Dangjin Mountain in Akeisai, China, the part uses MAE, RMSE and MAPE as prediction evaluation indexes, builds a model after statistical analysis and data preprocessing, and gets results.

Data Resources
Dangjin Mountain wind farm in Akesai is the first plateau type of demonstration wind farm established in our country and has been in use since 2010. In 2014, the wind farm had an electricity installment capacity of approximately 100 MW and provided surrounding areas with green and clean energy add up to 4.58 billion kWh [37]. The wind speed data collected every 15 min from 1 January to 31 December in 2013 for Akesai Dangjin Mountain wind farm is referenced for empirical analysis data in the experiment.

Prediction Evaluation Index
Let the forecasting target outcome be y target (n) ∈ R N for n = 1, 2, · · · , T presenting forecasting temporal points. To evaluate the accuracy of the output result y(n), mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) are applied as prediction evaluation indexes and the equations of these indexes are shown as Table 1. Table 1. Three prediction evaluation indexes.

Metric
Definition Equation

MAE
Mean absolute error When these three indexes are smaller, single point prediction is more accurate and the applicability of the proposed model is better.

Statistical Analysis and Data Preprocessing
The wind speed data collected every 15 min covering the period from 1 January to 31 December in 2013 for Akesai Dangjin Mountain wind farm is used as empirical analysis data in the experiment. As shown in Figure 2a, the general trend of annual raw wind data demonstrates that the data fluctuation frequency is large and that there are many cuspidal points. The data value on both sides is small, but there is a peak in the middle. In addition, because of the existence of data loss, the diagram also presents the wind speed data collection. The grey line indicates that the data are missing at that position or that the value is the same at different times. It is estimated that the amount of missing data takes up 12.2% of the total samples. experiment. As shown in Figure 2a, the general trend of annual raw wind data demonstrates that the data fluctuation frequency is large and that there are many cuspidal points. The data value on both sides is small, but there is a peak in the middle. In addition, because of the existence of data loss, the diagram also presents the wind speed data collection. The grey line indicates that the data are missing at that position or that the value is the same at different times. It is estimated that the amount of missing data takes up 12.2% of the total samples.  Figure 2b shows the frequency histogram after removing missing data. It's easy to be seen from the frequency histogram that with the influence of significant missing data, the collected data does not obey the single peak statistical distribution, but has complex statistical characteristics. The boxplot below clearly reflects the range of 0~4 m/s, focuses nearly 50% of the data, and the other distributes in the range of 4~30 m/s more dispersedly. From this, it can be seen that the local has rich wind resources. Therefore, setting up accurate wind speed forecasting model is significant to the effective usage of wind energy for the region.
To estimate the probability density function of collected wind speed data, the probability histogram and the Freedman-Diaconis Rule function are applied to calculate h. In statistics, the Freedman-Dacoins rule can be used to selected the size of the bins to be used in a histogram [38]. For a set empirical measurements sampled from some probability distribution, the Freedman-Doaconis rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution.
According to the Freedman-Diaconis Rules, The calculation result shows that 0.5010 h  and that the probability of points falling into the range of 5~30 m is 0.5095. Figure 2c provides the probability histogram after removing missing data.
In the process of collecting data, the situation of collecting unreal data resulting from transmission error, instrument failure or special weather often appears. These data perform as abnormal and missing two cases and would influence further analyzing and forecasting by negative trends. To make the process of analyzing and predicting complete and the results more reasonable, the method of processing outliers and interpolating missing data is employed as follows.  Figure 2b shows the frequency histogram after removing missing data. It's easy to be seen from the frequency histogram that with the influence of significant missing data, the collected data does not obey the single peak statistical distribution, but has complex statistical characteristics. The boxplot below clearly reflects the range of 0~4 m/s, focuses nearly 50% of the data, and the other distributes in the range of 4~30 m/s more dispersedly. From this, it can be seen that the local has rich wind resources. Therefore, setting up accurate wind speed forecasting model is significant to the effective usage of wind energy for the region.
To estimate the probability density function of collected wind speed data, the probability histogram and the Freedman-Diaconis Rule function are applied to calculate h. In statistics, the Freedman-Dacoins rule can be used to selected the size of the bins to be used in a histogram [38]. For a set empirical measurements sampled from some probability distribution, the Freedman-Doaconis rule is designed to minimize the difference between the area under the empirical probability distribution and the area under the theoretical probability distribution. According to the Freedman-Diaconis Rules, h = 2 × IQR × n −1/3 . The calculation result shows that h = 0.5010 and that the probability of points falling into the range of 5~30 m is 0.5095. Figure 2c provides the probability histogram after removing missing data.
In the process of collecting data, the situation of collecting unreal data resulting from transmission error, instrument failure or special weather often appears. These data perform as abnormal and missing two cases and would influence further analyzing and forecasting by negative trends. To make the process of analyzing and predicting complete and the results more reasonable, the method of processing outliers and interpolating missing data is employed as follows.
The section takes the wind speed data as random dynamic signal and adopts the absolute mean value method to remove outliers. The underneath of Figure 3 shows existing minimax value segment in the empirical data adding up to 1000 from 20:00 on 15 February 2013 to 5:45 on 26 February 2013. The section applies τ = 10, k = 4 and τ = 50, k = 5 two parameter combinations, respectively, to remove the maximum point and minimum point. According to detected outliers by Dynamic Absolute Mean Value method (as shown in Figure 3), red dot and yellow dot present the extreme value point through the first and second parameter combination.  The recognition results of outliers with different segmentation length  and coefficient k are presented in Figure 3. From this, it is easy to draw the conclusion that using a single parameter combination is difficult to recognize all maximum and minimum value points comprehensively and is possible to eliminate the abnormal points. In addition, when the coefficient k is the same and the segmentation length  is different, both the subscript and the number of recognized points are different. Then, with the same  and an increasing k, both the rejection range and the number of found extreme value points become smaller. This is the reason why we apply two different parameter combinations to remove outliers. Figure 3 also lists the points' subscript with different coefficients. Moreover, there are a few single points absent in the original data distribution. Combined with the above eliminated outlier, the Cubic Spline function is selected to interpolate the missing value in the data. Missing too many data cannot achieve satisfied result, so the segments missing more than 20 values are excluded in the section. Figure 4 shows the wind speed data for empirical research after the removing of outliers and interpolating missing data. Compared with Figure 2a, the effect of the Dynamic Absolute Mean The recognition results of outliers with different segmentation length τ and coefficient k are presented in Figure 3. From this, it is easy to draw the conclusion that using a single parameter combination is difficult to recognize all maximum and minimum value points comprehensively and is possible to eliminate the abnormal points. In addition, when the coefficient k is the same and the segmentation length τ is different, both the subscript and the number of recognized points are different. Then, with the same τ and an increasing k, both the rejection range and the number of found extreme value points become smaller. This is the reason why we apply two different parameter combinations to remove outliers. Figure 3 also lists the points' subscript with different coefficients. Moreover, there are a few single points absent in the original data distribution. Combined with the above eliminated outlier, the Cubic Spline function is selected to interpolate the missing value in the data. Missing too many data cannot achieve satisfied result, so the segments missing more than 20 values are excluded in the section. Figure 4 shows the wind speed data for empirical research after the removing of outliers and interpolating missing data. Compared with Figure 2a, the effect of the Dynamic Absolute Mean Value method in removing outliers and the practicability of Cubic Spline Interpolation are ensured. From Figure 4, it can be observed that there are only 20 or more values that are empty in processed missing data segments. In the subsequent study, the overall data are decomposed into four datasets for empirical analysis according to the position of these nulls. Table 2 lists several statistical indexes, such as median and range before and after data interpolation. As can be observed, in these indexes, the average is slightly larger, and the rest values are the same as the original data, which suggests the preprocessing in the paper retains the basic characteristics and changing trend of the original data. In other words, the preferred method of removing extreme points and interpolating missing data is effective. Value method in removing outliers and the practicability of Cubic Spline Interpolation are ensured. From Figure 4, it can be observed that there are only 20 or more values that are empty in processed missing data segments. In the subsequent study, the overall data are decomposed into four datasets for empirical analysis according to the position of these nulls. Table 2 lists several statistical indexes, such as median and range before and after data interpolation. As can be observed, in these indexes, the average is slightly larger, and the rest values are the same as the original data, which suggests the preprocessing in the paper retains the basic characteristics and changing trend of the original data. In other words, the preferred method of removing extreme points and interpolating missing data is effective.  This section proposes building a wind speed forecasting model where random initial input weights and the threshold in the ELM method are optimized by Cuckoo Search. The framework of the entire optimization model is shown in Figure 5. The specific modeling process is as follows:


Step 1. Data processing: The scope of the outlier to be removed can be defined by the Dynamic absolute mean value method, and after eliminating outliers, Cubic spline interpolation is used to complete data. Normalize the input data to eliminate the null column in the transformation data matrix and avoid the influence of data in results obtained by ELM as much as possible.  Step   This section proposes building a wind speed forecasting model where random initial input weights and the threshold in the ELM method are optimized by Cuckoo Search. The framework of the entire optimization model is shown in Figure 5. The specific modeling process is as follows:

•
Step 1. Data processing: The scope of the outlier to be removed can be defined by the Dynamic absolute mean value method, and after eliminating outliers, Cubic spline interpolation is used to complete data. Normalize the input data to eliminate the null column in the transformation data matrix and avoid the influence of data in results obtained by ELM as much as possible.

•
Step function. Then, screen the optimal weight and threshold on the basis of this. As a result, the possibility of being selected is higher with higher fitness.

•
Step 3. Encoding: Code for individuals w 11 w 12 · · · w 1p w 21 w 22 · · · w 2p · · · w N1 · · · w N p bs 1 bs 2 · · · bs N where w denotes the weight and bs denotes the threshold. In the initial phases, the first N * p weights are initialized by the usage of the real random number in [−1, 1], but the rest part is initialized by the number in [0, 1].

•
Step 4. Optimization iteration: firstly initialize the fitness of the nest, use the initial optimal nest individual and Lévy flights disturbance to generate new individuals, and calculate the fitness values of all nest individuals after disturbance. Write the fitness of nest i as O(i) and the new nest , then replace the nest i; in every generation save the nest with small fitness, abandon the host nest with high recognition and find a new nest individual again.

•
Step 5. Test: The optimized weights are used as the input weights in ELM to build the optimized wind speed prediction model. • Step 6. Reconstructed model: Set a single point prediction threshold. If ELM forecasting values are continuously above the prediction threshold, re-optimize and adjust the weights. employ re-initialization. The objective function decides the search trends and the results of the algorithm. The section employs 10-fold cross-validation mean square error sum as the output value of the objective function. Then, screen the optimal weight and threshold on the basis of this. As a result, the possibility of being selected is higher with higher fitness.


Step 3. Encoding: Code for individuals 11 , then replace the nest i ; in every generation save the nest with small fitness, abandon the host nest with high recognition and find a new nest individual again.  Step 5. Test: The optimized weights are used as the input weights in ELM to build the optimized wind speed prediction model.  Step 6. Reconstructed model: Set a single point prediction threshold. If ELM forecasting values are continuously above the prediction threshold, re-optimize and adjust the weights.

Model Results
In this section, BP neural net, Mycielski and non-optimized ELM are used as contrast models to analyze the predictable performance of the ELM model optimized by the Cuckoo algorithm proposed in the previous section. To ensure the model's rationality, node number of input and hidden layers of BP, non-optimized ELM and CS-ELM are set to be the same. Moreover, the Sigmoid function is used as the activation function in three models. The parameter of the CS-ELM model is set as follows: let the nest number n = 50, the host recognition p = 0.25, the number of iterations N = 100, node number of input inputnum = 11 and hidden layers hidnum = 20. Moreover, Sigmoid function is the activation function, the weight is in [−1, 1], and the threshold is in [0, 1].
To reflect the forecasting effect of each model in different months and seasons, the annual data are divided into four datasets referred to as Data 1, Data 2, Data 3 and Data 4. In the section, at least two months of data are used for training and the weeklong prediction is selected for evaluating the performance of each model. Figure 6 lists the training data and time of the four datasets corresponding to four seasons from top to bottom. In this section, BP neural net, Mycielski and non-optimized ELM are used as contrast models to analyze the predictable performance of the ELM model optimized by the Cuckoo algorithm proposed in the previous section. To ensure the model's rationality, node number of input and hidden layers of BP, non-optimized ELM and CS-ELM are set to be the same. Moreover, the Sigmoid function is used as the activation function in three models. The parameter of the CS-ELM model is set as follows: let the nest number 50 n  , the host recognition To reflect the forecasting effect of each model in different months and seasons, the annual data are divided into four datasets referred to as Data 1, Data 2, Data 3 and Data 4. In the section, at least two months of data are used for training and the weeklong prediction is selected for evaluating the performance of each model. Figure 6 lists the training data and time of the four datasets corresponding to four seasons from top to bottom.  energy resource is abundant. In this case, establishing an accurate wind speed forecasting model is particularly important. Figure 6 presents the line chart of four test datasets at the same time, clearly describing the overall changing trend of wind speed in four seasons. As shown in Figure 6, the data in Data 4 is stable and the overall value is small. Data 2 and Data 3 fluctuate wildly, accompanied by certain periodicity. Both sides of the Data 1 value are small, and the middle value is large. Figure 7 includes figures that present the contrast between the true value and the predicted results obtained by the different models on the basis of four datasets. There, the horizontal axis is the real value, and the vertical axis shows the prediction result. Clearly, when the points are closer to the diagonal, the prediction effect is better. In Data 1, it is evident that the predictions of ELM and CS-ELM are close and focus around the diagonal, the Mycielski model is more dispersed, and the points of BPNN are in the range of 0-10 m/s and off to the real value, demonstrating that the prediction results are small. In the contrasting processing in Data 2, the points of BP off to the true value and Mycielski off to the prediction show that both models are biased for Data 2. The results of ELM and CS-ELM are close and more concentrated, suggesting that the two models have an advantage in the group. From the third figure, it can be seen that the prediction effect of the four models is unsatisfied, and among them, the Mycielski method is the worst one. For Data 4, the points are mainly in the range 0-4 m/s, and the four algorithms are similar. The forecasting results of the part more than 0-4 m/s are dispersed. Based on the overall situation in four datasets, the following conclusion can be draw: the effect of ELM and CS-ELM is satisfied relatively, while the results of BPNN and Mycielski are often dispersed. For convenience in contrasting the effects, Figure 6 also shows the basic statistical value of data use in four datasets. The test set length of all four datasets is 672, presenting collected wind speed data in seven days. In Data 2, the average wind speed is 11.2 m/s and the median is 10.94 m/s, illustrating the wind is strong in the period. The wind speed is 2.64 m/s on average in the low wind period, Data 4. The statistical indicators of Data 1 and Data 3 are similar. From range, it can be observed that all four datasets have windy days. The range of Data 2 reaches 26.3 m/s, and Data 1 also reaches 21.05 m/s. It suggests that the size and type of the local wind is various and that the wind energy resource is abundant. In this case, establishing an accurate wind speed forecasting model is particularly important. Figure 6 presents the line chart of four test datasets at the same time, clearly describing the overall changing trend of wind speed in four seasons. As shown in Figure 6, the data in Data 4 is stable and the overall value is small. Data 2 and Data 3 fluctuate wildly, accompanied by certain periodicity. Both sides of the Data 1 value are small, and the middle value is large. Figure 7 includes figures that present the contrast between the true value and the predicted results obtained by the different models on the basis of four datasets. There, the horizontal axis is the real value, and the vertical axis shows the prediction result. Clearly, when the points are closer to the diagonal, the prediction effect is better. In Data 1, it is evident that the predictions of ELM and CS-ELM are close and focus around the diagonal, the Mycielski model is more dispersed, and the points of BPNN are in the range of 0-10 m/s and off to the real value, demonstrating that the prediction results are small. In the contrasting processing in Data 2, the points of BP off to the true value and Mycielski off to the prediction show that both models are biased for Data 2. The results of ELM and CS-ELM are close and more concentrated, suggesting that the two models have an advantage in the group. From the third figure, it can be seen that the prediction effect of the four models is unsatisfied, and among them, the Mycielski method is the worst one. For Data 4, the points are mainly in the range 0-4 m/s, and the four algorithms are similar. The forecasting results of the part more than 0-4 m/s are dispersed. Based on the overall situation in four datasets, the following conclusion can be draw: the effect of ELM and CS-ELM is satisfied relatively, while the results of BPNN and Mycielski are often dispersed.  Table 3 collects the prediction evaluation indexes of different models in the four datasets. Overall, the prediction effect of ELM and CS-ELM is much better than that of BP and Mycielski. In Data 1 and Data 2, the performance of BP is worse than Mycielski, while in Data 3 and Data 4, Mycielski is worse. This is because initial weights have a great influence in BP, and different random initialization results led to different prediction results. Next, in view of the above four dataset results, after being optimized by CS, the ELM effect achieves a certain improvement. The improvement is the most obvious and the degree is the highest in Data 2 which can be seen by  Table 3 collects the prediction evaluation indexes of different models in the four datasets. Overall, the prediction effect of ELM and CS-ELM is much better than that of BP and Mycielski. In Data 1 and Data 2, the performance of BP is worse than Mycielski, while in Data 3 and Data 4, Mycielski is worse. This is because initial weights have a great influence in BP, and different random initialization results led to different prediction results. Next, in view of the above four dataset results, after being optimized by CS, the ELM effect achieves a certain improvement. The improvement is the most obvious and the degree is the highest in Data 2 which can be seen by comparing the values of the RMSE, MAE and MAPE. In regard to other datasets, only the performance of CS-ELM has a weak improvement.  Figure 8 shows the contrasting error boxplot of each model for four datasets, in which 1 represents the BPNN model, 2 is the Mycielski model, 3 shows ELM and 4 represents the CS-ELM model. Then, the following conclusions would be introduced: (a) In the first figure, the prediction result of BP is biased, and the part beyond quartiles is greater.
The data of the Mycielski model beyond the upper and lower edge is more and the range of error distribution is wide, presenting an error interval that is maximized. In ELM and the CS-ELM model, the data mainly focuses on the range of −2 m/s~2 m/s, the excess part is less than BP and Mycielski, and the prediction effect is better. and bottom edge is less than Mycielski, so the performance is better than Mycielski. However, its error concentration is not higher than ELM and CS-ELM, leading to worse performance than ELM and CS-ELM.   Figure 8 shows the contrasting error boxplot of each model for four datasets, in which 1 represents the BPNN model, 2 is the Mycielski model, 3 shows ELM and 4 represents the CS-ELM model. Then, the following conclusions would be introduced: (a) In the first figure, the prediction result of BP is biased, and the part beyond quartiles is greater.
The data of the Mycielski model beyond the upper and lower edge is more and the range of error distribution is wide, presenting an error interval that is maximized. In ELM and the CS-ELM model, the data mainly focuses on the range of −2 m/s~2 m/s, the excess part is less than BP and Mycielski, and the prediction effect is better. (b) In the second figure, the error range of the Mycielski model reaches [-10, 10], showing that the performance is very unsatisfied. The overall error is concentrated for ELM and CS-ELM. (c) In the last two figures, the result of BP is not obviously biased, and the data range beyond the top and bottom edge is less than Mycielski, so the performance is better than Mycielski. However, its error concentration is not higher than ELM and CS-ELM, leading to worse performance than ELM and CS-ELM.

Summary
This section applies Cuckoo Search to optimize the weight parameters of ELM. In accordance with the problem that ELM is sensitive to input data, the Dynamic Absolute mean value and Cubic Spline are used to reject outliers and interpolate parts of the missing data, which maintains the integrity of research data well.
In addition, to increase the stability of the proposed model, the rest of the data having more vacancy is removed after setting up an input and output mapping matrix net. Then, optimize the weight parameters of ELM by CS and build the optimized CS-ELM model. Based on the above analysis of the wind speed of Akesai Dangjin Mountain in four seasons, CS-ELM achieves much better accuracy than the other models, and its single-point prediction effect is best followed by ELM. The effect of BP and Mycielski is the worst. Therefore, through optimized weights and a threshold by CS and ELM, the model performance has an improvement, but the improvement is affected by the test set. The wind speed has considerable difference in different seasons, so both the prediction error distribution and results are different for different datasets.

Proposed Hybrid Model
Based on CS-ELM prediction model, the chapter mainly studies the following two questions: (1) proper simplified input regardless of the precision of the model; (2) how to provide the range of the forecasted points. To solve the first problem, a hybrid model based on the SGA and ARMA is developed. The SGA plays the main role, which simplifies input in the hybrid model, and ARMA is used to adjust the errors. For the second question, the interval prediction based on the distance cluster is proposed with the application of the filter input.
The remainder of the section is organized as follows: the theoretical part describes the SGA, ARMA and the modeling procedure. The empirical part first presents specific modeling steps, and then the point and interval forecasting results of the hybrid model are given on the basis of the wind speed of Akesai Dangjin Mountain in four seasons.

Theoretical Analysis
In the chapter, the Standard Genetic Algorithm is used for data filtering to achieve the goal that partial data can obtain higher prediction accuracy. In the end, ARMA model is employed for error correction. Based on the standard genetic algorithm SGA, ARMA time series and CS-ELM model developed in the second section, a hybrid model is proposed in this part.

Standard Genetic Algorithm
The Genetic Algorithm (GA) is a powerful stochastic algorithm based on simulating biological genetics and natural selection mechanism's biological evolution process [39], which was first described in the Ph.D. thesis of Bagay in 1967 [40]. GA starts with an initial set of random solutions called a population, and each individual in the population is called a chromosome [28].
To qualitative analyze the global convergence of SGA based on binary encoding, Holland has established the model theorem and implicit parallelism, proving that SGA can keep the optimal solution in each generation before or after selecting the operation and the result of SGA converge to global optimal solution with probability 1 [41]. Because of its characteristics of uncomplicated operation and parallel searching of the global space, SGA has been successfully utilized in optimal control and machine learning field.
The standard Genetic Algorithm flow chart is shown in Figure 9.

Stationary Time Series
The stationary of one time series includes strictly stationary, wide stationary and non-stationary. In practical application, the stability of one time series can be determined by its statistical properties, such as the mean, variance and so on.
The white noise test: if the time series {X , Only when the single variable time series meets stationarity and passes the white noise test, a stationary time series model can be build. The commonly used regression models are the AR model, MA model and ARMA model.
After building the regression model, it's necessary to verify the fitting effect of the selected model, which is to test the stability of the error.
There are two types of error stationary tests-figure tests and statistic tests-for error sequences. A figure test is a test method utilizing a drawing sequence diagram and autocorrelation. There, the commonly used test methods are the ADF test, PP test and so on. The Daniel test comes to the conclusion through the Spearman rank correlation coefficient s q passing the hypothesis test:

Stationary Time Series
The stationary of one time series includes strictly stationary, wide stationary and non-stationary. In practical application, the stability of one time series can be determined by its statistical properties, such as the mean, variance and so on.
The white noise test: if the time series {X t , t ∈ T} meets: (a)∀t ∈ T, EX t = µ (16) (b)∀t, s ∈ T, r(t, s) = σ 2 , t = s 0, t = s (17) then {X t } is a white noise sequence. Only when the single variable time series meets stationarity and passes the white noise test, a stationary time series model can be build. The commonly used regression models are the AR model, MA model and ARMA model.
After building the regression model, it's necessary to verify the fitting effect of the selected model, which is to test the stability of the error.
There are two types of error stationary tests-figure tests and statistic tests-for error sequences. A figure test is a test method utilizing a drawing sequence diagram and autocorrelation. There, the commonly used test methods are the ADF test, PP test and so on. The Daniel test comes to the conclusion through the Spearman rank correlation coefficient q s passing the hypothesis test:

Problem Description and Solution
In the previous section, the established model can be expressed as: where x 1,i , x 2,i , · · · , x p,i expresses the input variables of the model, p represents the dimension, i shows the ith column of the input matrix, f (·) represents the built CS-ELM model, y i is the target output, and ε i shows the error. Moreover, when applying a neural network to build the model, it is generally suggested that variables without high correlation are applied as input. However, in the previous section, the transformed original data matrix is used directly for output-input mapping where the data are in time order and the input has high correlation. Thus, SGA is developed to filter the transformed matrixes. The screening input is used as the input of CS-ELM. Then, adjust the errors with ARMA, achieving the establishment of the hybrid forecasting model. The process of SGA variable selection method joint in the model is shown as follows: Assume that the established input-output mapping is such that: where z 1 , z 2 , · · · , z p represents the coefficient of controlling input for 0 or 1. Thus, z 1 , z 2 , · · · , z p can be encoded with the binary method, for example, E[Y|x, z = (1, 0, 1, 0)] = x 1 + x 3 and E[Y|x, z = (1, 1, 0, 0)] = x 1 + x 2 . Then, choose the corresponding row data as the input of CS-ELM to build the prediction model. The method of multi-step ahead forecast is that updating the input data by discarding the old data for each loop to perform the prediction. The multi-step ahead forecast is defined as: define the time index h as the forecast origin and the positive integer l as the forecast horizon. Suppose we are at time index h and intended to forecastr h+1 , where l ≥ 1. Letr h (l) be the forecast of r h+1 , then we definedr h (l) as the l-step ahead forecast of r t at the forecast origin h. When l = 1, we definedr h (1) as one-step ahead forecast of r t at the forecast origin h [42,43].
The following is the method for building the interval model: • Calculate the Euclidean distance between the test input vector and training input matrix after screening and arrange according to the values from small to large. Set a distance threshold.
Lump the values less than the threshold into one class and find corresponding input vectors to form a new mapping matrix. Because each vector in the new matrix has a small distance with the test input vector, vectors therein are considered as the test input vector joined the interference. At a result, corresponding test output value which is used as a possible sample space of current prediction can be obtained. We select 5% fractile and 95% fractile of the sample as the upper and lower bounds of the confidence interval. However, due to the limited historical data, it is difficult to find close vectors and construct the sample space for the test input vector. For this reason, we can consider the spacing of part points and ignore the distance of other points, which reduces the overall distance and constructs the sample space of similar sequences. Although the SGA-CSELM model proposed in this chapter removes part variables during the input process, its result is similar to that of the CS-ELM model or even better. Thus, the SGA-CSELM screening results are adopted as a new test vector to find the possible forecast sample space again and obtain a confidence interval closer to the forecasting point. Figure 10 is an instance of the prediction point. In addition, for rare points that cannot form intervals, the moving average method is applied to obtain upper and lower boundaries.

Experiment
The hybrid model proposed in this section takes SGA into consideration, reduces the input variable appropriately and achieves a more powerful forecasting result than the CS-ELM model.

SGA-CSELM Model
The following briefly introduces the process of the construction of the SGA-CSELM combination model: 1. Data processing: Build the initial input and output matrix of the CSELM model and then eliminate the null columns in the matrix to ensure the continuity of the data. Normalize the data to reduce the interference of data resources on the ELM.
Set up the SGA parameters, such as the total number of individuals N , number of iterations Niter , crossover probability 1 p , mutation probability 3 p and so on.

Experiment
The hybrid model proposed in this section takes SGA into consideration, reduces the input variable appropriately and achieves a more powerful forecasting result than the CS-ELM model.

SGA-CSELM Model
The following briefly introduces the process of the construction of the SGA-CSELM combination model:

1.
Data processing: Build the initial input and output matrix of the CSELM model and then eliminate the null columns in the matrix to ensure the continuity of the data. Normalize the data to reduce the interference of data resources on the ELM. 2.
SGA parameters setting: Set the appropriate fitness function F(•), write the number of inputs requiring filtering as n and compile the SGA individuals as: G k = (g 1 , g 2 , · · · , g i , · · · , g n ), g i = 0 or g i = 1 Set up the SGA parameters, such as the total number of individuals N, number of iterations Niter, crossover probability p 1 , mutation probability p 3 and so on. The objective function determines the search trends and results of the algorithm, so the paper employs the mean square error of the test data fitting the value 1/mse(y −ŷ) as the output value of the objective function and filters the optimal weight and threshold accordingly.

5.
Encoding: Apply the coding w 11 w 12 · · · w 1p w 21 w 22 · · · w 2p · · · w N1 · · · w N p bs 1 bs 2 · · · bs N , where w represents the weight and bs stands for the threshold value, for individuals. In the initial phase, the first N * p weights are initialized by the usage of the real random number in [−1, 1], but the rest is initialized by the number in [0, 1]. 6.
Optimization iteration: First, initialize the fitness of the nest, use the initial optimal nest individual and Lévy flights disturbance to generate new individuals and calculate the fitness value of all nest individuals after disturbance. Write the fitness of nest i as O(i) and the new nest fitness as , then replace nest i; in every generation, save the nest with small fitness, abandon the host nest with high recognition and find a new nest individual again. 7.
Carry out test data fitting and obtain the training error: Set the sliding window width H, obtain the training error of the data within the widow width H used to test data and judge whether the data are white noise. If it is white noise, skip the ARMA model building steps, or establish ARMA and carry out step (8).

8.
Identify the ARMA model: Employ the window width data to calculate the corresponding model lag intervals for endogenous to minimum BIC rule and build the ARMA model. 9.
Training model: Take the obtained optimizing weights as the input weights of ELM and build the SGA-CSELM hybrid wind speed forecasting model. 10. Error correction: Enter the new data into the SGA-CSELM model to obtain the prediction. If there is an ARMA model, then enter the error obtain the error prediction, and correct the error in the SGA-CSELM results to obtain the Hybrid Model. 11. Reconstruct model: Set the forecasting threshold of a single point. If the efficiency of the ELM exceeds the threshold continuously, then retrain the Hybrid Model.

Model Results
Apply the GAOT toolbox to build SGA, and the specific parameters are set as follows: the number of iterations is 100, the species number is 20, the crossover probability is 0.09, and the mutation probability is 0.05. It is important to note that, due to variable screening, the input node of CSELM, inputnum, varies with the change of screening data. Therefore, when building the SGA-CSELM model, we set the number of hidden layer nodes at the Apply GAOT toolbox to build SGA, and the specific parameters are set as follows: the number of iterations is 100, the species number is 20, the crossover probability is 0.09, and the mutation probability is 0.05. It is important to note that due to variable screening, the input node of CSELM, inputnum, changes with the change of screening data. Therefore, when building SGA-CSELM model, we set the number of hidden layer nodes at: To facilitate the efficiency of models quickly, we directly introduce the prediction of ELM and CS-ELM and write them as ELM-1 and CSELM-1. In addition, the ELM model of which the number of hidden layer nodes is the same as screening hiddennum is added, and its efficiency and CS-ELM's are comprehensively compared. The input of the newly added model and the original one is the same; for the purpose of distinguishing them, they are written as ELM-2 and CSELM-2. Finally, through the test in stationary and white noise for the prediction error of the SGA-CSELM model, the ARMA model is built for the non-white noise part to correct the prediction error. The other parameters of the SGA-CSELM model are set as in the previous chapter.
Model Contrast Result Table 5 shows the prediction results of various models in Data 1. The first column is the six models used in the test. Among them, the results of ELM-1 and CSELM-1 come from the second section, ELM-2 and CSELM-2 are contrast models of which the layer nodes are set as SGA-CSELM, SGA-CSELM is the model that does not join ARMA for error correction, and the last hybrid model is the one joining ARMA. Moreover, the second to fourth columns describe the performance of each model under different indicators. According to the value of the RMSE, the hybrid model's value is the smallest and the ELM-2 s is the largest. From MAPE, all models are near 22%, the ELM-2 is the biggest, CSELM-1 is the smallest one, and the rest center. The efficiency of the hybrid model is the best, and SGA-CSELM follows from MAE. Taken together, although there is a decrease in the input information of SGA-CSELM after screening the variable, the prediction result is still better than ELM and near CS-ELM. The result of the hybrid model has ascends after error correction by ARMA. In the process of forecasting, we employ the ARMA time series model and a dynamic test and predict the stationarity of the error and judge whether the error is a white noise sequence. If the conditions of the time series model are met, then build ARMA to fetch useful information for further data. Figure 11a shows the line chart drawn according to 200 points used to construct the ARMA model in Data 1. From the figure, it can be seen that the error fluctuation is larger, and the SGA-CSELM model does not fully extract the data information in the process of training. Apply the ADF test for the sequence; the result indicates that it is smooth. Through the Ljung-Box Q-test, detect the error correlation, not the white noise sequence. Therefore, the ARMA model can be constructed for the data fitting error and further extract the useful information from the error term. Figure 11b is the autocorrelation and partial autocorrelation figure of the segment data, showing that the data fits ARMA (4,6), but in factual modeling, ARMA (1,1) is chosen because, when the orders of both AR and MA are 1, the value of the corresponding criterion is minimum (shown in Table 7) through comparing the criterion value of AIC and BIC, etc.     Table 6 lists the AIC value and BIC value for fitting the ARMA model in the data segment with the sliding window width H = 200 when R = 0, 1, 2, 3 and M = 0, 1, 2, 3. By comparing the BIC rule in the scope in which R = 0, 1, 2, · · · , 6 and M = 0, 1, 2, · · · , 6, it can be found that, when R = 1, M = 1, the AIC and BIC values are the smallest. Thus, ARMA (1,1) is used to build the model, and in the next H = 200 steps, the error for the SGA-CSELM model is corrected with it.  Table 7 shows the concrete results of the 1st to 28th points in Data 1 with each model. In the table, the second column is the measurement data, and the following columns are the point prediction results of ELM-1, CSELM-1, ELM-2, CSELM-2, SGA-CSELM and the model joining error correction. The last one is the error correction of the SGA-CSELM model with ARMA (1,1). For facility, all data reserve two decimal fractions. As a result of Table 7, the prediction effect of the Hybrid Model proposed is best, followed by that of the SGA-CSELM Model.  Figure 12 shows the prediction of the hybrid model and the upper and lower bounds of the prediction interval that is composed by 5% fractile and 95% fractile in Data 1, Data 2, Data 3 and Data 4. First, our model prediction is very close to the real sequence, showing that the forecasting effect of SGA-CSELM has a certain improvement after adding the ARMA model to correct the error. Next, our forecasting range includes almost all real values, and the interval width at some points is narrow. However, there is an obvious shortage in that, when the original data increases quickly, the prediction interval cannot well contain the real value, and the effect is better in the decrease part. To see the details of the forecasting results of the Hybrid model, the point prediction performance of Data 1, Data 2, Data 3 and Data 4 is also given in Figure 12 for reference. Table 8 lists the evaluation index situation of Data 2 after screening, on the basis of predictive results. Different from Data 1, overall, CSELM-2 works best followed by CSELM-1 and SGA-CSELM effects worse than the two models with the small gap. ELM-1 and ELM-2 work the worst. Because the error sequence is white noise after using SGA-CSELM to forecast the test data in Data 2, the prediction evaluation indexes of the Hybrid Model are not shown in Table 8.
According to Table 8 and the value of the RMSE, CSELM-1 works best, followed by SGA-CSELM, and the Hybrid model result is worse than that of CSELM-2 in Data 3. From the point of view of MAPE, the result of the Hybrid model is the best, and SGA-CSELM follows. From MAE, the parameter value of SGA-CSELM is the minimum. Taken together, each indicator in SGA-CSELM performs most stably, but its result is a bit weaker after joining correction by the time sequence, explaining that the ARMA model is not entirely appropriate for the dataset. Table 8 also presents the prediction index comparison result of each model in Data 4. From the table, we can see that SGA-CSELM is the smallest in RMSE and MAE. CSELM-2 is minimum in the index MAPE, and the MAPE value of the rest of the model is larger, even more than 50%. This is because the wind speed data in Data 2 is generally small and fluctuates frequently, and any small forecast deviation will lead to a sharp fall in results. In addition, the Hybrid Model of the effect is not the best one; this may be due to the error correction for SGA-CSELM, and the selected MA model is not good for modeling of different variance components.
Energies 2018, 11, x FOR PEER REVIEW  23 of 29 index MAPE, and the MAPE value of the rest of the model is larger, even more than 50%. This is because the wind speed data in Data 2 is generally small and fluctuates frequently, and any small forecast deviation will lead to a sharp fall in results. In addition, the Hybrid Model of the effect is not the best one; this may be due to the error correction for SGA-CSELM, and the selected MA model is not good for modeling of different variance components.

Summary
Based on the standard genetic algorithm SGA, ARMA time series and CS-ELM model developed in the second section, a Hybrid model is proposed in this section. When building the CS-ELM model in the previous section, all input and output mapping was put in models to test. However, taking SGA into consideration, the section reduces the input variable appropriately and achieves a more powerful forecasting result than the CS-ELM model.
In the process of hybrid modeling, first, make up the input subscript for 0-1 by the binary code and, through the standard genetic algorithm, select a variable to set up the SGA-CSELM model. Second, conduct the stationarity and white noise test for the prediction error of the SGA-CSELM model. If the error series meets the ARMA modeling conditions, then identify the optimal model by the AIC and BIC information rules. Thus, establish the SGA-CSELM model with error correction, namely, the Hybrid Model. Through empirical analysis of the four datasets of Dangjin Mountain in Akesai, the following results are produced:

1.
Under the influence of different initial weights, the effect of the single extreme learning machine model changes greatly; nevertheless, the ELM model optimized by the cuckoo algorithm possesses both a stable result and a reliable output guarantee.

2.
Selecting input variables by the usage of SGA reduces the information of the part variables to some extent, but its prediction effect can still achieve results that are not weaker than those of CS-ELM, which all inputs test.

3.
The error of SGA-CSELM is corrected by the ARMA model, and the performance of the hybrid model after calibration is improved and declines because during the process of fixing the width of the sliding window, the error may become a white noise sequence and join the error correction, leading to poor results.

4.
With the screening result of SGA, by calculating the distance between the predicted sample and constructed historical data-mapping matrix, possible forecast points of the sample can be screed to be composed of points that are close. The upper and lower bounds of the prediction sample constructed by calculating the 5% and 95% fractile numerical values obtain a good result.

Conclusions
Because existing fossil fuels cannot meet the increasing energy demand, increased attention has been paid to wind energy, a type of clean and renewable energy. However, owing to the intermittency and random character of wind speed, it is essential to build a wind speed forecasting model with high-precision for wind power utilization [44]. Therefore, this paper contributes to the development of an improved wind speed forecasting method [45].
In the paper, the optimized CS-ELM model is applied as the main model, joining SGA and ARMA to build a hybrid model. Based on a series of empirical analyses, the performance of the optimized CS-ELM model is found to overcome the limitations of the single ELM model, that is, it is easily affected by the initial weights, fluctuates, cannot provide stable prediction results and so on. Compared with BP and the single extreme learning machine model, the model has achieved a great improvement in the prediction effect and stable prediction results. It is easy to see that after SGA screening, the model only requires data of lower dimension to produce better prediction results than the single ELM. The ARMA model is employed for error correction to further extract useful information in error prediction. In addition, based on the SGA screening results, by calculating the distance between the test input vector and mapping matrix, building a possible prediction sample space and providing the rough range of the upper and lower boundaries of the predicted values has yielded an excellent performance. Providing an increased accuracy forecasting range can prevent the dramatic fault of a single point prediction, which on a windy day, allows managers to take measures in advance to protect the fan from damage and to have a more competitive advantage to drive markets.
Author Contributions: C.G. proposed the concept of this research, J.W. provided overall guidance; R.W. has written the whole manuscript. R.W. carried on data analysis; J.L. and J.W. polished the manuscript and supported in part during data processing.

Conflicts of Interest:
The authors declare no conflict of interest. x t,k The location of the kth nest in the tth generation G k The encoding of the kth individual ∂

Abbreviations
Step length controlled quantity g i The ith gene of the individual k ⊕ dot product p k The probability of the individual G k can be saved over to the next generation x b The current optimal solution rand Random number