Research on Combined Model Based on Multi-Objective Optimization and Application in Wind Speed Forecast

Wind power is an important part of a power system, and its use has been rapidly increasing as compared with fossil energy. However, due to the intermittence and randomness of wind speed, system operators and researchers urgently need to find more reliable wind-speed prediction methods. It was found that the time series of wind speed not only has linear characteristics, but also nonlinear. In addition, most methods only consider one criterion or rule (stability or accuracy), or one objective function, which can lead to poor forecasting results. So, wind-speed forecasting is still a difficult and challenging problem. The existing forecasting models based on combination-model theory can adapt to some time-series data and overcome the shortcomings of the single model, which achieves poor accuracy and instability. In this paper, a combined forecasting model based on data preprocessing, a nondominated sorting genetic algorithm (NSGA-III) with three objective functions and four models (two hybrid nonlinear models and two linear models) is proposed and was successfully applied to forecasting wind speed, which not only overcomes the issue of forecasting accuracy, but also solves the difficulties of forecasting stability. The experimental results show that the stability and accuracy of the proposed combined model are better than the single models, improving the mean absolute percentage error (MAPE) range from 0.007% to 2.31%, and the standard deviation mean absolute percentage error (STDMAPE) range from 0.0044 to 0.3497.


Introduction
In recent years, wind energy has become the focus of managers and researchers in the energy field, due to the advantages of wind power, such as renewability and cleanliness.
In 2016, the global installed capacity of wind power exceeded 54 GW. This installed capacity is distributed between 90 countries, which have an installed capacity of more than 10 GW, with 29 countries having an installed capacity of 1 GW [1]. Cumulative installed capacity increased by 12.6%, and accumulated capacity reached 486.8 GW. In the United States, more than 53,000 wind turbines operate, generating more than 84.1 GW of electricity in 41 states [2]. On the other side of the globe, wind power may become China's second largest power source by 2050 [3].
Accurate forecasting of wind speed is an important prerequisite for using wind power due to the following reasons: 1) it reduces rotating and operating costs of wind-farm equipment; 2) it helps dispatching departments to adjust their plans in time; 3) it reduces the impact on the entire power grid;

Strategy of Our Proposed Combined Model
(1) Preprocessing the time-series wind-speed data with denoising and reconstruction from China.
(2) Setting the train data and test data according to the character of the time-series wind-speed data; the interval of each two data items is ten minutes.
(3) Two kinds of models were selected as single models to build the combined model. The linear models include ARIMA and HW, and the nonlinear models include CS-BPNN and DE-OSELM, which were optimized by cuckoo search and differential evolution. The parameter values are shown in Table 1.
(4) Building a combined forecasting system model, which is based on nonpositive constraint theory, NSGA-III, which optimizes the weights of the branch models and four single models, which are mentioned before. (5) Forecasting results were evaluated, and testing of the combined model is at the end.

AE
The average forecast error of n times forecast results AE = 1 N ∑ N n=1 (y n −ŷ n ) MAE The average absolute forecast error of n times forecast results The average of the prediction error squares The average of absolute percentage error MAPE = 1 N ∑ N n=1 | y n −ŷ n y n | × 100%

Results
The denoising process consists of four steps, embedding, singular-value decomposition, grouping, and diagonal averaging [35]. According to the observed time-series wind-speed data and these four steps, the trajectory matrix was constructed. The trajectory matrix was decomposed and reconstructed to extract signals representing different components of the original time series, such as long-term trend signals, periodic signals, and noise signals. In order to obtain satisfying results, the noise signals were abandoned. The detailed process is as follows: (1) Embedding: Form the trajectory matrix of the series X, which is the L × K matrix: where X =(x 1 , · · · , x i+L−1 ) T , (1<i<K) are lagged vectors of size L. Matrix X is a Hankel matrix that means that X has equal elements x ij on the anti-diagonals i +j=const.
(2) Singular-Value Decomposition: Perform the singular value decomposition (SVD) of trajectory matrix X. Set S = XX T and denote by λ 1 , · · · , λ L the eigenvalues of S taken in the decreasing order of magnitude λ 1 ≥ · · · ≥ λ L ≥ 0, and by U 1 , . . . , U L the orthonormal system of the eigenvectors of matrix S corresponding to these eigenvalues.
Set d = rankX = max{i, such that λ i > 0} (note that d = L for a typical real-life series) and V i = X T U/ √ λ i , (i = 1, . . . , d). In this notation, the SVD of trajectory matrix X can be written as X = X 1 + · · · +X N .
Where X i = √ λ i U i V i T are matrices having rank 1; these are called elementary matrices. Collection is the ith eigentriple (ET) of the SVD. Vectors U i are the left singular vectors of matrix X, and numbers √ λ i are the singular values and provide the singular spectrum of X; this gives the name to SSA. Vectors V i √ λ i = X T U i are the vectors of principal components (PCs). (3) Eigentriple grouping: Partition set of indices {1, . . . , d} into m disjoint subsets I 1 , · · · , I m . Let I = i 1 , · · · , i p . Then, resultant matrix X I corresponding to group I is defined as X = X I 1 + · · · +X I m . The resultant matrices are computed for the groups, and the grouped SVD expansion of X can now be written as X = X I 1 + . . . + X I m .
(4) Diagonal averaging: Each matrix X I j of the grouped decomposition is hankelized, and then the obtained Hankel matrix is transformed into a new series of length N using the one-to-one correspondence between Hankel matrices and the time series. Diagonal averaging applied to a resultant matrix X I k produces a Appl. Sci. 2019, 9, 423 5 of 27 reconstructed series X (k) = ( X (k) 1 , · · · , X (k) N ). In this way, the initial series x 1 , · · · , x N is decomposed into a sum of m reconstructed subseries: This decomposition is the main result of the SSA algorithm. The decomposition is meaningful if each reconstructed subseries could be forecast as a part of either a trend, some periodic component, or noise.
The pseudocode for denoising is provided in Appendix A, Algorithm 1.

Methods and Heuristic Algorithm
The linear and nonlinear models selected to build the combined system are shown in this section. The heuristic algorithm, cuckoo search, and differential evolution to optimize BPNN and OSELM are also introduced in this section.

Nonlinear Models
Because of the character of wind-speed data, nonlinear models can obtain excellent performance in forecasting. In this paper, two models, BPNN and ELM, were selected from nonlinear models and the parameter was optimized by the algorithm to obtain better performance.

Back-Propagation Neural Network (BPNN) Model
BPNN is a type of multilayer feed-forward neural network with a wide variety of applications. It is based on a gradient-descent method that minimizes the sum of the squared errors between actual and desired output values. The transfer function is of the neuron type. The output function is between 0 and 1, and can transform input to output for continuous nonlinear mapping [36].
The topology of the BPNN is as follows: where X min and X max are the minimum and maximum value of the input array or output vectors, and X i denotes the real value of each vector.
Step 1. Calculate outputs of all hidden layer nodes: where the activation value of node j is net j , w ji representing the connection weight from input node i to hidden node j, b j represents the bias of neuron j, y j represents the output of hidden layer node j, and f is the activation function of a node, which is usually a sigmoid function.
Step 2. Calculate the output data of the neural network: where w 0j represents the connection threshold from hidden node j to the output node, b 0 represents the bias of the neuron, O 1 represents the output data of the network, and f 0 is the activation function of the output layer node.
Step 3. Minimize the global error via the training algorithm: where Z represents the real data vector of the output, m represents the number of output.

CS Algorithm
CS, which was proposed by Yang and Deb [37] in 2009, is derived from the action of cuckoos laying their eggs in other birds' nests to let those birds hatch the eggs for them. However, once the host birds discover the cuckoo eggs, the host birds throw them away or abandon their nests and rebuild a new nest elsewhere. The CS algorithm is constructed based on three assumptions: a) only one egg is randomly laid by each cuckoo in a selected nest; b) the following generations would begin in the best nest; and c) it is a constant of the number of available host nests, and the probability value of the host bird discovering the egg laid by a cuckoo is p, which has the range of 0 to 1. In CS, every nest stands for a solution. The steeps of BPNN optimized by CS are shown in Figure 1.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 28 of the host bird discovering the egg laid by a cuckoo is p, which has the range of 0 to 1. In CS, every nest stands for a solution. The steeps of BPNN optimized by CS are shown in Figure 1.

OSELM
An Extreme Learning Machine (ELM) is a simple and effective single hidden-layer feed-forward neural network proposed by Huang et al. [38]. In many practical applications, however, learning is a continuous process. For this reason, Liang et al. proposed an online learning algorithm, the OSELM [39].
For different N training sample Z, , t i2 ,⋯, t im ]∈R m , ELM with L hidden layer nodes, and g(x) activation function can approximate arbitrary N samples with zero errors by arbitrarily specifying a j and b j , which can be expressed as follows: where a j is the input weight, b j is the threshold of hidden layer nodes, x i is the input vector, O i is the output vector, and beta βj is the output weight.β L×m = [β 1 , β 2 , ⋯, β L ] T Formula (8) can be simplified as: T T

OSELM
An Extreme Learning Machine (ELM) is a simple and effective single hidden-layer feed-forward neural network proposed by Huang et al. [38]. In many practical applications, however, learning is a continuous process. For this reason, Liang et al. proposed an online learning algorithm, the OSELM [39].
For different N training sample Z, , t i2 , · · · , t im ] ∈ R m , ELM with L hidden layer nodes, and g(x) activation function can approximate arbitrary N samples with zero errors by arbitrarily specifying a j and b j , which can be expressed as follows: where a j is the input weight, b j is the threshold of hidden layer nodes, x i is the input vector, O i is the output vector, and beta β j is the output weight β L×m = [β 1 , β 2 , · · · , β L ] T . Formula (8) can be simplified as: where, β L×m = [β 1 , β 2 , · · · , β L ] T , T N×m = [T 1 , T 2 , · · · , T N ] T , and H is hidden layer output matrix.
H(a 1 , · · · , a L ; b 1 , · · · , b L ; represents the output of the j hidden layer node for input x 1 , x 2 , · · · , x N . After the pseudoinverse matrix is incorporated, the least-squares solution of the above linear system is: Based on a recursive least-squares algorithm, the algorithm flow of OSELM can be described as follows: (1) Initialization Stage Given activation function g, the number of hidden layer nodes is L. A small training set is given to initialize the network and obtain the initial value. The output weight value is β (0) . Take k = 0, where k is the number of data segments sent to the network, which is represented.
(2) Online Learning Stage Given the k + 1 data segment, calculate output weight beta β (k+1) , take k = k + 1, then return to the online learning stage, and constantly update the calculated output weight until data learning is completed.

Differential Evolution
The differential-evolution algorithm is a parallel, direct, and random-search algorithm based on group evolution [40]. Firstly, the population can be randomly initialized in the feasible solution space of the problem. The population can be represented by NP (population size) D (number of decision variables), dimension parameter x hl , where h = 1, 2, . . . ,NP, l = 1, 2, . . . , D.
Two different individual vectors are randomly selected and subtracted to generate difference vectors. The difference vectors are weighted and added to the third random selected individual vectors to generate variation vectors. This operation is called variation. Utilizing Formula (11) to implement a mutation operation on each individual x t h at t-time, the corresponding mutation individual v t+1 where, r 1 , r 3 , r 3 ∈ {1, 2, . . . , NP} are different, and different to h. The variation vector is mixed with the target vector to generate test vectors. This process is called a crossover. Formula (12) is used to cross-operate x t h with variant individual v t+1 h generated by Formula (11) to generate experimental individual u t+1 h , that is: where, rand (l) is a uniformly distributed random number in the range of [0,1]; CR is a cross probability in the range of [0,1]; and rnbr (h) is a random variable in the range of {1, 2, . . . , D}.
If the fitness of the test vector is better than that of the target vector, the next generation is formed by replacing the target vector with the test vector. This operation is called selection. Fitness functions J h and x t h were compared by Formula (6). For the minimization problem, the individual with a low fitness function value was selected as the individual x t+1 h of the new population, that is: where J is the fitness function. The Figure 2 shows the flow chart of hybrid OSELM.

Linear Models
Although wind-speed data are usually nonlinear, according to our test in Section 6.1, the timeseries wind-speed data also have a linear character. We can say that the linear models used to build the combined system are correct and appropriate. In this section, two linear models are briefly introduced.

ARIMA
The ARIMA model is one of the most popular forecasting models [41]. The ARIMA model can be expressed as follows: Where y i i=1,2,…,t is the actual value, ε i i=1,2,…,t is the random error at time t, ϕ i and θ i represent the coefficients, and p and q are internumbers that are often referred to as autoregressive and moving average polynomials, respectively.

Holt-Winters (HW)
The output of the HW method is written as F t+m , an estimate of the value of x at time t + m, m > 0 based on the raw data up to time t; suppose we have a sequence of observations {x t }, beginning at time t = 0 with a cycle of seasonal change of length L. The formula and recursive updating equations are the following: , α is the data-smoothing factor, β is the trend-smoothing factor, γ is the seasonal change-smoothing factor, and these three parameters are between 0 to 1. {s t } and {b t } represent the smoothed value of the constant part for time t and the sequence of best estimates of the linear trend that are superimposed on the seasonal changes, respectively. {c t } is the sequence of seasonal-correction factors [42].

Linear Models
Although wind-speed data are usually nonlinear, according to our test in Section 6.1, the time-series wind-speed data also have a linear character. We can say that the linear models used to build the combined system are correct and appropriate. In this section, two linear models are briefly introduced.

ARIMA
The ARIMA model is one of the most popular forecasting models [41]. The ARIMA model can be expressed as follows: where y i (i = 1, 2, . . . , t) is the actual value, ε i (i = 1, 2, . . . , t) is the random error at time t, ϕ i and θ i represent the coefficients, and p and q are internumbers that are often referred to as autoregressive and moving average polynomials, respectively.

Holt-Winters (HW)
The output of the HW method is written as F t+m , an estimate of the value of x at time t + m, m > 0 based on the raw data up to time t; suppose we have a sequence of observations {x t }, beginning at time t = 0 with a cycle of seasonal change of length L. The formula and recursive updating equations are the following: where α is the data-smoothing factor, β is the trend-smoothing factor, γ is the seasonal change-smoothing factor, and these three parameters are between 0 to 1. {s t } and {b t } represent the smoothed value of the constant part for time t and the sequence of best estimates of the linear trend that are superimposed on the seasonal changes, respectively. {c t } is the sequence of seasonal-correction factors [42].

Our Proposed Combined Model
The combined model theory, MOPs, our proposed combined system with three objective functions, and the compared combined model with two objective functions are presented in this section. The flowchart is shown in Figure 3.

Combined-Model Theory
The forecasting model based on combination, which was initiated by Bates and Granger [32], has long been considered as representing an improvement over individual models and also as an efficient and simple way to perfect forecasting accuracy and stability. A new combined system that consolidates several neural networks, NSGA-III, and nonpositive constraint theory [34] was successfully proposed in this paper.

Combined-Model Theory
The forecasting model based on combination, which was initiated by Bates and Granger [32], has long been considered as representing an improvement over individual models and also as an efficient and simple way to perfect forecasting accuracy and stability. A new combined system that consolidates several neural networks, NSGA-III, and nonpositive constraint theory [34] was successfully proposed in this paper. Definition 1. Letx j,t denote the unbiased out-of-sample forecast for x t , which is obtained by the jth individual model. Then, the combined output at time t of the combining methods has the following weighted average form [43,44]

Multiobjective Optimization Problem
Generally, MOPs can be classified into two groups: constrained and unconstrained problems. Constrained problems with J inequality and K equality constraints can be formulated as: where M is the number of objectives, and M ≥ 4, and x = (x 1 , x 2 , · · · , x n ) T is the decision vector, where n is the number of decision variables. In MOP (1), i and x U i are the lower and upper bounds of the decision variable x i , respectively.
When the inequality and equality constraints in MOP (1) were omitted, then the unconstrained (with box constraints only) problems were obtained, which can be stated as follows: In order to solve MOP, the NSGA-III is presented in the next section.

Introduction of Objective Functions and NSGA-III
Since wind-speed prediction is not a single-objective problem, it is necessary to consider multiple objectives to obtain both accuracy and stability. In this part, we show a multi-objective optimization algorithm that optimizes the weights of four models and three proposed objective functions to comprehensively consider this problem.

Objective Functions
In this paper, we chose three functions to be the objective functions.
(1) The Theil Inequality Coefficient (TIC) can be indicated as follows: Appl. Sci. 2019, 9, 423 11 of 27 TIC is always between 0 to 1; a lower numerical value is equal to better performance.
(2) Root mean squared error (RMSE). A smaller absolute value of RMSE (Ŷ, Y) indicates a more accurate forecasting performance of the system. It can be defined as follows: (3) MAPE can be represented as: The lower MAPE (Ŷ, Y) is, the more accurate the system. Thus, in this combined system, the fitness function for the accuracy and stability objectives can be defined as follows:

NSGA-III
A random population of size N and a set of widely distributed prespecified M-dimensional reference points H on a unit hyperplane, which is placed in a manner so that it intersects each objective axis at one, having a normal vector of ones covering the entire R M + region, were initialized by NSGA-III [45]. The more detailed description of NSGA-III is shown in the Appendix A.

Compared Combined Models with Two Objective Functions
There are three two-object-function models that were chosen to compare with our combined system: (1) Combined model with Object Function (21) and (22); (2) Combined model with Object Function (21) and (23); (3) Combined model with Object Function (22) and (23).

Experiments
Some performance metrics must be described to comprehensively understand the model characteristics. Diebold-Mariano and forecasting effectiveness were used in this study. Four experiments are given in this section to text the data and our proposed combined system.

The Performance Metric
Some performance metrics must be described to comprehensively understand the model characteristics. Four metrics, that is, AE, MAE, MSE, and MAPE, are shown in Table 1 [46], which focused on the predictive accuracy and evaluating the forecasting performance between two or more-time series models.
Actual values: {y t ; t = 1, . . . , n + m} Two forecasts: ŷ (1) ŷ (2) The forecast errors from the two models are: A suitable loss function, L(ε (1) n+h ) i = 1,2., was applied to measure the accuracy of each forecast. Square error loss and absolute-deviation error loss are the most popular loss functions.
Square error loss: Absolute deviation loss: The DM test statistics estimate the forecasts according to arbitrary loss function L(g): where S 2 is an estimator of the variance of d h = L(ε n+h ). The null hypothesis is: versus the alternative hypothesis, which is: The null hypothesis means that the two forecasts have the same accuracy. The alternative hypothesis means that the two forecasts have different levels of accuracy. Under the null hypothesis, test statistic DM is asymptotically N (0, 1) distributed. The null hypothesis of no difference is rejected if the computed DM statistic falls outside the range from -z α/2 to z α/2 , that is, if where z α/2 is the upper (or positive) z-value from the standard normal table, corresponding to half of the desired α level of the test.

Forecasting Effectiveness
Both the square sum of forecasting error and the mean and mean squared deviation were applied to measure the forecasting accuracy by forecasting effectiveness. In some practical cases, it is necessary to further consider the kurtosis and skewness of the forecasting-accuracy distribution. On this basis, the general discrete from of forecasting effectiveness is given in this section [47].

Definition 4. Let
Definition 5. A n = 1−|ε n | is called the forecasting accuracy at time n.
Definition 6. m k = ∑ N n=1 Q n A k n is the forecasting-accuracy effectiveness unit, k is a positive integer, {Q n , n = 1, 2, . . . , N} is the discrete-probability distribution at time n, and ∑ N n−1 Q n = 1, Q n > 0. Especially if the a priori information of the discrete-probability distribution is unknown, we define Q n = 1/ N, n = 1, 2, . . . , N. Only by better understanding the characteristics of the research data can we better select the model for future work. To achieve better results, we must consider the characteristics of the data.
In general, the linear model fits the linear data better, just as the non-linear model fits the non-linear data better. Only by understanding the characteristics of the data can we achieve good results in future forecasting work. For data, it is not only linear or non-linear, but also may both. Therefore, it is necessary to judge the linear nonlinearity of the data used in this paper. Therefore, we carried out the following experiment. In order to verify the linear or nonlinear character of wind speed, three functions were structured: (1) linear function the results of Tables 2 and 3, the wind speed data are both linear and nonlinear by hypothesis test. Therefore, the linear models and nonlinear models considered in our proposed forecasting model are correct and necessary. Table 2. Testing wind speed data by adjusting to linear functions or nonlinear functions.

R-Squared
Adjusted R-Squared

Number of Observations Number of Rows without Any NaN Values.
Error degrees of freedom np, where n is the number of observations, and p is the number of coefficients in the model, including the intercept.

Experiment II: Models Tested with Wind-Speed Data From Site 1
In order to evaluate the accuracy and stability of the forecasting system, we selected the average results of 100 trials, which were divided into two parts: accuracy and stability. In the accuracy section, we compare the AE, MAE, MSE, and MAPE values for a single model and combined model (as shown in Table 4). (a) The results of Table 4 show the following: (1) CS-BPNN reached the best results in Tuesday, Wednesday, and Thursday compared to other branch models, with MAPE values being 3.726%, 4.081%, and 5.173%, respectively.
(2) On Monday and Sunday, DE-OSELM obtained the lowest MAPE compared with other branch models.
(3) HW achieved the most accurate forecasting value of all branch models on Friday and Saturday.
(4) Although ARIMA could achieve higher forecasting precision, their forecasting performance was worse than nonlinear models and HW (5) Our combined model had a significant improvement in forecasting accuracy with a lower MAPE compared with all branch models. The MAPE values from Monday to Sunday are 5.691%, 3.698%, 4.078%, 5.166%, 5.692%, 6.443%, and 5.082%, respectively.
(b) The results of Figure 4 show the following: (b) The results of Figure 4 show the following: (1) Part A shows the MAE, MSE, and MAPE of five models, although our combined model did not achieve the lowest MAE every day; the lowest MSE and MAPE were obtained by our proposed model.
(2) Part B shows the forecasting results of CS-BPNN, DE-OSELM, ARIMA, HW, and the combined model.
(3) Part C also shows the 95% confidence intervals (CIs) obtained by each model; the figure indicates that both the upper and lower CIs were close between four branch models but, for linear models, there were more points in the confidence interval. As Part C shows, the errors of the combined model were very small, and our combined model also achieved a small CI. Table 4 and Figure 4, the results indicate that our proposed model showed better performance than the other branch models. In brief, it can be explained that SSA, which could denoise the time-series wind-speed data as a preprocessing method, and the combined model, which took advantage of the branch models, could improve forecasting accuracy.

Experiment III: The Performance of Branch Models at Each Time Point.
In this experiment, four models, two nonlinear hybrid models (CS-BPNN and DE-OSELM) and two linear models (ARIMA and HW), were tested for performance at each time points. In this part the, wind-speed data of Tuesday from Site 2 were used in our experiments.
(a) All Tuesday results in Figure 5 show the following: indicates that both the upper and lower CIs were close between four branch models but, for linear models, there were more points in the confidence interval. As Part C shows, the errors of the combined model were very small, and our combined model also achieved a small CI. Remark: From Table 4 and Figure 4, the results indicate that our proposed model showed better performance than the other branch models. In brief, it can be explained that SSA, which could denoise the time-series wind-speed data as a preprocessing method, and the combined model, which took advantage of the branch models, could improve forecasting accuracy.

Experiment III: The Performance of Branch Models at Each Time Point.
In this experiment, four models, two nonlinear hybrid models (CS-BPNN and DE-OSELM) and two linear models (ARIMA and HW), were tested for performance at each time points. In this part the, wind-speed data of Tuesday from Site 2 were used in our experiments.
(a) All Tuesday results in Figure 5 show the following: (1) Part A shows the MAPE values of four branch models. From the figure, we can see that ARIMA performed the worst, but the MAPE of these four models were approximately. (2) From Figure 5, Part B, we can see that the MSE and MAE of the four models were not high. The values are very close between these four models.
(3) Figure 5, Part C also shows the 95% CIs obtained by the four branch models, and it indicates that both the upper and lower CI were close between the two nonlinear models. For the linear models, however, the CI of ARIMA was smaller. (1) Part A shows the MAPE values of four branch models. From the figure, we can see that ARIMA performed the worst, but the MAPE of these four models were approximately.
(2) From Figure 5, Part B, we can see that the MSE and MAE of the four models were not high. The values are very close between these four models.
(3) Figure 5, Part C also shows the 95% CIs obtained by the four branch models, and it indicates that both the upper and lower CI were close between the two nonlinear models. For the linear models, however, the CI of ARIMA was smaller.
(5) The results reveal that there is no model that can reach the best results at every time point.

Remark 2.
There is no one model that can reach the best results at every time point; each model has advantages and disadvantages. The combined models can add up the forecasting models to overcome these dilemmas. This experiment provides a reason to apply combined-model theory to forecast wind speed.

Experiment IV: Stability Comparison with Branch Models
In Experiments II and III, we tested the time-point performance and accuracy of the branch models. In this experiment, we evaluated the stability of the proposed combined model by comparing the standard deviation of the MAPE (STD-MAPE) values (as shown in Table 6). Because there is no randomness in the mathematical models, we just tested CS-BPNN and DE-OSELMNN of branch models in terms of stability. Here, we show the wind-speed data from Sites 1 and 2, and the results of MAPE and STD-MAPE from 100 time experiments. The results in Table 6 show the following: (1) CS-BPNN performs better than DE-OSELM in accuracy on most test days.
(2) In terms of stability, CS-BPNN obtained lower STD-MAPE than DE-OSELM in Site 1, but in Site 2, DE-OSELM was better than CS-BPNN.
(3) Our proposed combined model reached the best accuracy and stability values compared to the two other models.
(4) The lowest STD-MAPE value of CS-BPNN was achieved on Wednesday on Site 2; at the same time, the STD-MAPE of DE-OSELM and our combined model were 0.2430 and 1846, respectively. Table 6, our combined model obtained the lowest MAPE and STD-MAPE, which means that our combined-model functions could achieve high accuracy and strong stability compared with the single models. 6.6. Experiment V: Comparison with Three Combined Models (Two Objective Functions)

Remark 3. From the results shown in
We proposed three other combined models with two objective functions in Section 5.2. In this experiment, we tested the accuracy and stability between our proposed system and the other three combined models. Here, wind-speed data from Site 3 were used in our experiments, and all of the models in this part were run 100 times.
(a) The results in Tables 7 and 8 show the following: (1) Table 7 shows the average AE, MAE, and MSE. It can be seen that the forecast performance and the results of our proposed combined model were better in accuracy.
(2) The average results of the three two-objective-function models were close to the combined system (three objective functions). Especially on Friday, the difference between the combined model (three objective functions) and the combined models (two objective functions) was only 0.65%.
(3) As shown in Table 8, after running the models for 100 times, the standard-deviation MAPE and the MAPE range of the two-objective-function models was much higher and larger than the combined model. This shows that the stability of our combined model's performance was much better than the combined models with the two objective functions.
(b). Figure 6 shows the following; (1) Part A shows that, with regard to the minimum, maximum, and average results of MAPE from 100 experiments, our proposed combined model with three objective functions reached the lowest values compared with the other combined models on each day.
(2) Part C presents that the MAE and MSE of our proposed combined model with three objective functions were also lower than the other models.

Remark 4.
The results show that the performance of the combined model with two objective functions was close to the combined model with three objective functions when we chose the result average. However, the single results of the combined model with three objective functions could be given a high degree of trust, as accuracy and stability were successfully enhanced by our combined model based on NSGA-III in the forecasting problems. better than the combined models with the two objective functions.
(b). Figure 6 shows the following; (1) Part A shows that, with regard to the minimum, maximum, and average results of MAPE from 100 experiments, our proposed combined model with three objective functions reached the lowest values compared with the other combined models on each day. (2) Part C presents that the MAE and MSE of our proposed combined model with three objective functions were also lower than the other models.
Remark: The results show that the performance of the combined model with two objective functions was close to the combined model with three objective functions when we chose the result average. However, the single results of the combined model with three objective functions could be given a high degree of trust, as accuracy and stability were successfully enhanced by our combined model based on NSGA-III in the forecasting problems.

Experiment VI: Forecasting Results Test
To evaluate the forecasting results of these models, two important evaluation metrics, the DM test and forecasting availability, were applied in this part. We discuss the results from Site 3.
(1) The results of the DM test are shown in Table 9, which indicate that the combined system was different from the other models at a significant difference level in different datasets.
(2) As shown in Table 10, the first-order and second-order forecasting availabilities of the proposed combined system performed better than the other models for seven datasets from three regions in electricity-load forecasting. For example, in the Monday dataset, the first-order forecasting availabilities offered by each forecasting models were 0.9110, 0.9143, 0.8432, 0.8851, 0.9284, 0.9291, 0.9249, and 0.9456, respectively.

Conclusions
Effective, accurate, and reliable forecast of wind speed is a crucial component of wind-farm management, and also a significant part of the economic development of a nation. However, previous studies only focused on one type of accuracy or stability. Due to this reason, former models cannot achieve satisfying results that obtain high accuracy and strong stability. Wind-speed forecast is also a multicriteria problem; considering only one criterion (accuracy or stability) cannot achieve satisfying results. Thus, it is difficult to select one objective function from multiple-objective functions, and arduous to ensure that the selected function can obtain low MAPE and strong stability. In this paper, a combined forecasting model based on data preprocessing and MOP theory, which could simultaneously improve the accuracy and stability of wind-speed forecasting and four models (two hybrid nonlinear models and two linear models), is proposed and was successfully applied to wind-speed forecasting, not only overcoming the issue of forecasting accuracy, but also solving the difficulty of forecasting stability. Then, as the results show, our proposed combined model with three objective functions achieved lower MAPE and STMAPE than the other models. From the results of our research, our proposed combined model improved the MAPE range from 0.007% to 2.31%, and the STDMAPE range from 0.0044 to 0.3497. Moreover, according to our study, the combined model could be used in large wind farms to forecast wind speed, evaluate wind-energy resources, and save operation costs and wind energy.
The Introduction of NSGA-III NSGA-III performs the following operations, performed at a generation t. Primarily, different nondomination levels are classified from whole population P t in the same way in NSGA-II, following the principle of nondominated sorting. Then, P t creates an offspring population Q t by way of the usual recombination and mutation operators. For every reference point, only one individual of the population is anticipated to be found, so there is no necessity for any selection operation in NSGA-III. Then, there is a new population R t combined with P t and Q t (R t = P t Q t ). Subsequently, points starting from the first nondominated front are selected for P t + 1, one at a time, until all solutions from a complete front cannot be included. We can denote the final front as F L . In general, only a few solutions from F L need to be selected for P t + 1 using a niche-preserving operator, which could be described in the next one. Primarily, each population member of P t + 1 and F L is normalized by using the current population spread so that all objective vectors and reference points have commensurate values. Subsequently, each member of P t+1 and F L is associated with a specific reference point by using the shortest perpendicular distance (d())of each population member, with a reference line created by joining the origin with a supplied reference point. Then, a careful niching strategy is employed to choose those F L members that are associated with the least-represented reference points in P t + 1 . The niching strategy puts an emphasis on selecting a population member for as many supplied reference points as possible. A population member associated with an under-represented or unrepresented reference point is immediately preferred. With continuous stress for emphasizing nondominated individuals, the whole process is then expected to find one population member corresponding to each supplied reference point close to the Pareto-optimal front, provided that the genetic variation operators (recombination and mutation) are capable of producing respective solutions. The use of well-spread reference points ensures a well-distributed set of trade-off points at the end.
The pseudocode for the NSGA-III is provided in Algorithm 2.

Parameters:
N-the length of a time series. X-the trajectory matrix. L-the windows length.
K-the number of lagged vactors. L * -the minimum between L and K.
K * -the maximum between L and K. M-the number of repetitions of each trial.
The pseudocode for the NSGA-III is provided in Algorithm 2.

Input:
Y= y 1 , y 2 , ⋯,y N a sequence of time series wind speed data Output: Y = y 1 , y 2 , …, y N -a sequence of time series wind speed forecasting data

Fitness function
Minimize=
NPop-the size of population.

t-the number of iterations.
Wi-the weight of ith single model.
Itmax-the number of maximum iteration.