One-Day-Ahead Hourly Wind Power Forecasting Using Optimized Ensemble Prediction Methods

: This paper proposes an optimal ensemble method for one-day-ahead hourly wind power forecasting. The ensemble forecasting method is the most common method of meteorological forecasting. Several different forecasting models are combined to increase forecasting accuracy. The proposed optimal ensemble method has three stages. The ﬁrst stage uses the k-means method to classify wind power generation data into ﬁve distinct categories. In the second stage, ﬁve single prediction models, including a K-nearest neighbors (KNN) model, a recurrent neural network (RNN) model, a long short-term memory (LSTM) model, a support vector regression (SVR) model, and a random forest regression (RFR) model, are used to determine ﬁve categories of wind power data to generate a preliminary forecast. The ﬁnal stage uses an optimal ensemble forecasting method for one-day-ahead hourly forecasting. This stage uses swarm-based intelligence (SBI) algorithms, including the particle swarm optimization (PSO), the salp swarm algorithm (SSA) and the whale optimization algorithm (WOA) to optimize the weight distribution for each single model. The ﬁnal predicted value is the weighted sum of the integral for each individual model. The proposed method is applied to a 3.6 MW wind power generation system that is located in Changhua, Taiwan. The results show that the proposed optimal ensemble model gives more accurate forecasts than the single prediction models. When comparing to the other ensemble methods such as the least absolute shrinkage and selection operator (LASSO) and ridge regression methods, the proposed SBI algorithm also allows more accurate prediction.


Introduction
Renewable energy will account for 20% of the total energy that is generated by 2025 in Taiwan.The target for wind turbine power capacity is 4.2 GW.The intermittent nature of the delivery of renewable energy will have a significant impact on the power system.A novel coordinated control approach is then used to offer high-quality voltages and allow optimal power transfer for a grid [1].For an offshore wind farm that connects to the grids, the weak feeder and high harmonic characteristics have an impact on the safe operation of the system.The key technologies of transient protection for offshore wind farm transmission lines are reviewed in [2].A study on the monitoring, operation, and maintenance of offshore wind farms is proposed to reduce the operation and maintenance costs and improve the stability of the power generation system [3].
Accurate wind power forecasting allows reliable power management and ensures an appropriate backup capacity, which reduces the cost of penetration and operation of wind power facilities.However, the variability and irregularity of wind means that forecasts are uncertain, and this affects power system management decisions.The accuracy of wind power forecasting must be increased to ensure a reliable supply of power to the grid.The time horizon of one-day-ahead hourly forecasting of wind power is used for power management, demand response in day-ahead, load dispatch planning, and ancillary services, such as the frequency regulation reserve, the fast response reserve and the real-time spinning reserve [4].Accurate one-day-ahead hourly wind power forecasting allows a rational power supply reserve, which reduces operating costs.Many studies propose methods for wind power forecasting.Indirect forecasting and direct forecasting are the two major categories.Indirect forecasting predicts the future wind speed based on historical wind speed and meteorological data, which includes the hidden Markov model [5], variational recurrent autoencoder [6], machine learning regression [7], dynamic integration method [8], spectrum analysis [9], hybrid machine learning model [10], stochastic method [11] and variable support segment method [12].A power curve or a machine learning method that represents the nonlinear relationship between wind speed and corresponding wind power is then used to establish a prediction model.In this study, an indirect method is used for wind power forecasting [13].
Direct forecasting uses a physical method, a statistical method, a learning machine method, a hybrid method or an ensemble method to establish a forecasting model based on historical wind power and meteorological data.The methods for direct forecasting include a gradient-boosting machine (GBM) algorithm [14], a Bayesian optimization-based machine learning algorithm [15], an AI-based hybrid method [16], a nonparametric probabilistic method [17], an online ensemble method [18], a variable mode decomposition method [19], a multi-step method [20], a hybrid algorithm [21], an LSTM model [22], and an SVR with rolling origin recalibration [23].
Each method may feature a large forecasting error due to the variability and irregularity of the wind.To increase forecasting accuracy, an ensemble technique that combines several machine learning methods is used.Ensemble forecasting methods (EFM) were used for early meteorological forecasting and are currently used to increase the accuracy of renewable energy forecasting.An EFM combines several different forecasting models to reduce overestimation and preserve the diversity of models.The EFM uses either competition or cooperation methods [24].The competition method uses different data sets or an individual model with the same data set but different parameters to train a model.The prediction output from each model is averaged to give a final prediction.As shown in [25], the weather variables, such as temperature, humidity, precipitation, and wind speed are regarded as individual models that affect the solar power output.A least absolute shrinkage and selection operator (LASSO) method is used to aggregate the output of each weather model.The results show that the LASSO algorithm achieves considerably higher accuracy than existing methods.A study [26] used a regression-based ensemble method for short-term solar forecasting.A random forest regression (RFR) with different parameters is used for a single forecasting method.Five RFR models are established and integrated using a ridge regression, for which the hyperparameters are tuned using a Bayesian optimization algorithm.
The cooperative method divides the prediction model into several sub-models.Depending on the characteristics of each sub-model, a prediction model is established, and the final predicted values are calculated by aggregating the outputs of each sub-model.A previous study [27] used a ridge regression method to aggregate the output of four machine learning algorithms for solar and wind power forecasts.Another study [28] used a constrained least squares (CLS) regression method to combine the wind power predictions for three single forecasting models.One study [29] used a chaos local search JAVA algorithm to aggregate the output of four machine learning networks for wind speed forecasting.
Another study [30] used a weighted average method to combine the output of four single models for wind speed forecasting.A stacking ensemble method uses an ensemble neural network (ENN) [31] or a recurrent neural network (RNN) [32] to aggregate the output of several single models for solar power forecasting.The ensemble method that is mentioned avoids overfitting and gives better forecasting accuracy than a single model.This study uses a cooperative method to evaluate five different models for one-dayahead hourly wind power forecasting.The proposed method first uses the k-means method to divide wind power data into different clusters.Five single prediction models, including a K-nearest neighbors (KNN), an RNN, a LSTM, an SVR. and an RFR, are established to generate a preliminary forecast.An optimization technique that uses swarm-based intelligence (SBI) algorithms, such as the particle swarm optimization (PSO), the salp swarm algorithm (SSA) and the whale optimization algorithm (WOA), is used to assign a weight to each single model for every hour.The final predicted value is generated by adding the weighted sum for each individual model.To address inaccuracy in wind speed prediction from a forecasting platform, an RFR model is used to correct the forecasted values.The main contributions of this paper are as follows: • A k-means method is used to divide historical wind power data into five different categories.Each category of data is used to establish individual forecasting models.A total of 25 sub-models (five categories of data with five single models) are established to increase forecasting accuracy by 12% to 31%.

•
A cooperative method that combines the output of five single machine learning algorithms prevents overestimation and give a more accurate forecast than single prediction models.

•
In contrast to existing cooperative methods, an SBI algorithm is used to optimize the weight distribution of each single model for every hour.Assigning weights for each hour is more complicated and time-consuming, but it can increase the prediction accuracy.

•
One-day-ahead hourly wind speed prediction from a forecasting platform features a large error so an RFR model is used to correct the forecasted values.The proposed correction model decreases wind power forecasting error by 2-3% MRE value.
The remainder of this paper is organized as follows.Section 2 describes the existing ensemble methods.Section 3 details the proposed optimal ensemble method.Five single models are also described in this section.Section 4 describes the test results for a 3.6 MW wind power generation system.Conclusions are given in Section 5.

Ensemble Forecasting Methods
An EFM combines several forecasting models to increase forecasting accuracy and is widely used for meteorological forecasting.Described below are the general ensemble forecasting methods.

Weighted Average Method
The weighted average method generates prediction results by averaging the predicted outputs for each model, as [23,30]: where T is the number of prediction models and ŷi is the output from the ith prediction model.

Weighted Sum Method
The weighted sum method generates prediction results by aggregating the outputs from each sub-model with dissimilar weights [24], as: Energies 2023, 16, 2688 4 of 22

LASSO Regression Method
The LASSO regression is a regularization method that prevents overfitting [25,26].A LASSO regression performs feature selection to determine predictors that contribute significantly to the model; models that contribute to a lesser extent are assigned lower weights.The LASSO regression method is expressed as: The weights in (3) are calculated as: The term w | 2 2 represents the square root of a norm and α ≥ 0 is a penalty parameter that controls the amount of shrinkage.The greater the value of α, the greater is the amount of shrinkage, so the coefficient is more robust to collinearity.

Ridge Regression Method
Like the LASSO regression method, a ridge regression uses the square of the weight, instead of the square root of a norm [26,27], as: (5)

Constrained Least Squares Regression Method
A constrained least squares regression minimizes the sum of the squared error by training the estimated outputs from several single models as [28]: where α is a penalty parameter for individual models that are biased.

Chaos Local Search JAVA (CLSJAVA) Algorithm
CLSJAVA uses JAVA and CLS to achieve the optimal weight distribution for each single model [29].JAVA is a swarm-based heuristic algorithm that iteratively updates particle solutions towards the global best solution and away from the global worst solution as: where x i (t) is the value of the ith particle at the tth iteration, x best (t) is the best particle at the tth iteration, x worst (t) is the worst particle at the tth iteration and rand 1 and rand 1 are uniform random numbers.The JAVA algorithm is well suited to a local search.To solve the problem, CLS is used to enrich the searching behavior and accelerate the local convergence speed of the Jaya algorithm as [29]: where γ i (t + 1) is the ith chaotic variable at the (t + 1)th iteration, δ = 4 and γ i (0) = [0.25,0.5, 0.75].

Stacking Method
The stacking method is an ensemble learning technique that uses a meta-learner to combine the prediction results for multiple models to establish a new prediction model [28,29].Any machine learning algorithm, such as KNN, SVR, RNN, or LSTM, can be used as a meta-learner.Unlike the stacking method, this study uses an SBI algorithm to optimize the weight distribution for each model to generate accurate predictions.

The Proposed Method
In contrast to a traditional stacking method, the proposed method uses an SBI algorithm to determine the weight distribution for each single model.Figure 1 shows the structure for the proposed method.A preliminary forecast is generated by each single model.The final forecast is produced by combining the weight output for each single model.Described below are the k-means method, five single models, optimization algorithms such as PSO, SSA, WOA, and the scheme for using SBI to optimize the weight for each single model.

KNN
K-nearest neighbors (KNN) is a supervised learning method that is one of the simplest machine learning algorithms.KNN is used for classification and regression problems for which data must be divided into various categories or to model the relationship between input and output variables.Determining a best K value is difficult and complex because it is determined by experiments.Details of the KNN are given in [34].The KNN algorithm works as follows:

•
The predefined distance between the training and testing datasets is calculated.
Manhattan distance is widely chosen as the distance measure.

•
The K value with the minimum distance from the training datasets is used.

•
The final wind power is predicted using a weighted average method.

RNN
Recurrent neural networks (RNN) were developed in 1986 [35] and are used in handwriting recognition systems.An RNN describes the dynamic behavior of a time series and transmits the state through its own network, so it accepts a wider range of time series inputs.Figure 2 shows the RNN architecture.The relationship between the input and output is expressed as [36]:

The k-Means Method
The k-means method was developed by Lloyd in 1987 [33].It is an unsupervised clustering technique that is mainly used for cluster analysis and data classification.For a set of observation data (x 1 , x 2 , . . . ,x n ), the k-means clustering method is used to divide the n observation data points into k categories as: where X j is the jth observation, w ji is the weight of the ith cluster center, R i is the ith cluster center and ||•|| is the Euclidean distance.w ji and R i are individually expressed as: Energies 2023, 16, 2688 Equation (11) shows that for the minimum Euler distance, n observation data points are divided into k categories.For this study, the wind power data is divided into five categories in terms of the magnitude of the wind.
3.2.Five Single Models 3.2.1.KNN K-nearest neighbors (KNN) is a supervised learning method that is one of the simplest machine learning algorithms.KNN is used for classification and regression problems for which data must be divided into various categories or to model the relationship between input and output variables.Determining a best K value is difficult and complex because it is determined by experiments.Details of the KNN are given in [34].The KNN algorithm works as follows: • The predefined distance between the training and testing datasets is calculated.Manhattan distance is widely chosen as the distance measure.

•
The K value with the minimum distance from the training datasets is used.

•
The final wind power is predicted using a weighted average method.

RNN
Recurrent neural networks (RNN) were developed in 1986 [35] and are used in handwriting recognition systems.An RNN describes the dynamic behavior of a time series and transmits the state through its own network, so it accepts a wider range of time series inputs.Figure 2 shows the RNN architecture.The relationship between the input and output is expressed as [36]: where x t is the input, y t is the output, h t−1 is the output of the previous hidden layer and w, u and v are the parameter vectors.
Energies 2023, 16, x FOR PEER REVIEW 7 of 22 and output nodes mainly produce network results.An RNN also uses historical prediction information as part of the input.The gradient vanishes for historical data and longer historical information does not affect the prediction results.

LSTM
Long short-term memory (LSTM) is a time recurrent neural network that was developed in 1986 [37].An LSTM is used for processing and predicting important information that features very long intervals and delays in the time series.An LSTM is better suited to longer time series than an RNN. Figure 3 shows the LSTM architecture.The relationship between related nodes is expressed as: ) ) An RNN is regarded as a neural network that is delivered in the time domain.Each node in the plot is connected through a unidirectional connection to a node in the next successive layer.Every node has a time-varying, real-valued stimulus, and each connection has a real-valued weight that can be modified.Input nodes receive data from outside the network, hidden nodes modify data during the training process from input to output, and output nodes mainly produce network results.An RNN also uses historical prediction information as part of the input.The gradient vanishes for historical data and longer historical information does not affect the prediction results.

LSTM
Long short-term memory (LSTM) is a time recurrent neural network that was developed in 1986 [37].An LSTM is used for processing and predicting important information that features very long intervals and delays in the time series.An LSTM is better suited to Energies 2023, 16, 2688 7 of 22 longer time series than an RNN. Figure 3 shows the LSTM architecture.The relationship between related nodes is expressed as: where x t is the input, f t is the forget gate, i t is the input gate, o t is the output gate, c t is a transfer function, c t is the cell state, W is an input weight vector, U is an output weight vector for the previous stage, and b is a biased weighted vector.

LSTM
Long short-term memory (LSTM) is a time recurrent neural network that was developed in 1986 [37].An LSTM is used for processing and predicting important information that features very long intervals and delays in the time series.An LSTM is better suited to longer time series than an RNN. Figure 3 shows the LSTM architecture.The relationship between related nodes is expressed as: ) where   is the input,   is the forget gate,   is the input gate,   is the output gate, ̃ is a transfer function,   is the cell state, W is an input weight vector, U is an output weight vector for the previous stage, and b is a biased weighted vector.An LSTM is also an intelligent network unit that can memorize values for an indefinite length of time.The gates in the block determine whether the input is sufficiently important to be remembered and whether it can be output.If the generated value for the forget gate is close to zero, the value that is remembered in the block is forgotten.Similarly, the generated value of the output gate determines whether the output in the block memory can be output.

SVR
Support Vector Regression (SVR) was proposed by Corter and Vapnik in 1995 [38].It is used for data classification and regression analysis.An SVR is widely used for image recognition, gene analysis, font recognition, fault diagnosis and load forecasting.Figure 4 shows an SVR hyperplane, which divides the data into high-dimensional spaces, as [39]: where u is the unit normal vector for the hyperplane, h is the distance from the origin to the hyperplane, n is the number of training data points, ϕ k is a swing variable, ∑ ϕ k is a penalty function, σ is the weight of the penalty function, x k is an input data set, and H(x k ) is a nonlinear mapping function. x

RFR
The random forest regression (RFR) model is composed of multiple regression trees.Each decision tree is an independent prediction model that is uncorrelated with other trees.The RFR can be used for discrete and continuous data and can also be used for unsupervised clustering learning and outlier detection.Figure 5 shows a schematic diagram of an RFR algorithm.The steps for an RFR algorithm are described as follows [40]:

•
The average of leaf nodes from the training data is treated as the prediction output from each CART.

•
The final prediction using an RFR is the average of all prediction outputs of each CART.
Table 1 shows a brief comparison among the five single models.
Training data sets (wind power, wind speed, wind direction, etc.)  Must determine the value of K which may be complex.

RNN Can accept a wider range of time series inputs.
There is a gradient vanishing phenomenon for longer historical data.SVR is expressed as a dual optimization problem, as: Subject to The term H(x k ) T H(x l ) in ( 24) is defined as a kernel function K(x k ,x l ) and must satisfy: where g(x) is an integrable function.This study uses a radial basis function as the kernel function: where ε is a dilation parameter.

RFR
The random forest regression (RFR) model is composed of multiple regression trees.Each decision tree is an independent prediction model that is uncorrelated with other trees.The RFR can be used for discrete and continuous data and can also be used for unsupervised clustering learning and outlier detection.Figure 5 shows a schematic diagram of an RFR algorithm.The steps for an RFR algorithm are described as follows [40]:

RFR
The random forest regression (RFR) model is composed of multiple regression trees.Each decision tree is an independent prediction model that is uncorrelated with other trees.The RFR can be used for discrete and continuous data and can also be used for unsupervised clustering learning and outlier detection.Figure 5 shows a schematic diagram of an RFR algorithm.The steps for an RFR algorithm are described as follows [40]: • n sub-training data sets,  1 ,  2, … ,   , are randomly generated from historical data sets.

•
CARTs (classification and regression trees) are used to train each set of sub-training data.Some features are extracted and clustered in this step.

•
n decision tree models that are used for individual prediction are generated.

•
The average of leaf nodes from the training data is treated as the prediction output from each CART.

•
The final prediction using an RFR is the average of all prediction outputs of each CART.
Table 1 shows a brief comparison among the five single models.
Training data sets (wind power, wind speed, wind direction, etc.)

Method
Advantage Disadvantage

KNN
Implementation is simple and is robust to noisy training data.
Must determine the value of K which may be complex.• n sub-training data sets, S 1 , S 2, . . ., S n , are randomly generated from historical data sets.

•
CARTs (classification and regression trees) are used to train each set of sub-training data.Some features are extracted and clustered in this step.

•
n decision tree models that are used for individual prediction are generated.

•
The average of leaf nodes from the training data is treated as the prediction output from each CART.

•
The final prediction using an RFR is the average of all prediction outputs of each CART.
Table 1 shows a brief comparison among the five single models.Must determine the value of K which may be complex.

RNN Can accept a wider range of time series inputs.
There is a gradient vanishing phenomenon for longer historical data.

LSTM
Performs better than RNN for longer time series.
Predicts well only for a short time horizon.

SVR Fits well for a highly nonlinear domain
Modelling is significantly affected by noise.

RFR
Can be used for discrete and continuous data and can also be used for outlier detection.
May converge to a local optimal solution.

The Optimization Algorithms
Many optimization algorithms can be used to solve weight distribution optimization problems.This study uses swarm-based intelligent methods, such as PSO, SSA and WOA, to determine the weighting value for each single model.

PSO
The particle swarm optimization (PSO) was developed by Kennedy and Eberhart in 1995 [41].The PSO simulates the behavior of fishes swimming and birds flying as a simplified social system.Each variable (or particle) modifies its position using the previous best position and the best position for the swarm as: where v d i (t + 1) is the velocity of the ith particle at the (t + 1)th iteration, i = 1, 2, . . ., P, P is the population size and d = 1, 2, . . ., D, D is the dimension of the variable, w is the weighting value, v d i (t) is the previous velocity, r 1 and r 2 are the parameters for self-cognition and the swarm, respectively, rand 1 and rand 2 are random numbers with a uniform distribution, x i,best (t) is the best position for the ith particle at the tth iteration, s best (t) is the best position for the swarm at the tth iteration, x d i (t + 1) is the position of the ith particle at the (t + 1)th iteration and x d i (t) is the previous position.[42].It simulates the group activities of a salp swarm chain.SSA performs exploration and exploitation during the optimization process.During the foraging process, salps naturally form a group chain structure, as shown in Figure 6, are either leader salp or follower salps.The leader salp swims ahead and guides the whole group forward and updates the swimming direction depending on the position of the food.The other salps are called follower salps, and they update their positions depending on the position of the leader salp.The leader salp updates its position as: (30) where x j 1 (t + 1) is the position of the leader salp for the ith particle at the (t + 1)th iteration, x j best is the best position for the jth particle, x j 1, min and x j 1, max are the lower and upper limits for the jth variable, the parameters c 2 and c 3 are uniform random numbers and c 1 maintains a balance between exploration and exploitation and is expressed as: where t is the current iteration and t m is the maximum number of iterations.When the position of the leader salp is updated, the positions of the follower salps are updated as: where  = 2, 3, … ,   ,   is the number of follower salps,  0 is the initial velocity and  is the acceleration.

WOA
The whale optimization algorithm (WOA) was developed by Mirjalili and Lewis in 2016 [43].It simulates the fishing strategy of the humpback whale and uses encircling prey, bubble net attack, and search for prey strategies to fish, as described in the following: • Encircling prey Humpback whales encircle prey when they find the location of the prey as follows: where  ⃗ () is the current position vector,  ⃗  () is the previous best position vector,  ⃗ (= 2 ⃗ •  −  ⃗ ),  and  ⃗ (= 2 •  ) are coefficient vectors,  is a uniform random vector,  ⃗ gradually decreases from 2 to 0, so  ⃗ is between 0 and 1, and "•" represents an inner product operation.
• Bubble net attack Surrounding the prey is the most common attack strategy by humpback whales.It also hunts prey using the bubble net attack as: When the position of the leader salp is updated, the positions of the follower salps are updated as: where i = 2, 3, . . ., N s , N s is the number of follower salps, v 0 is the initial velocity and a is the acceleration.

WOA
The whale optimization algorithm (WOA) was developed by Mirjalili and Lewis in 2016 [43].It simulates the fishing strategy of the humpback whale and uses encircling prey, bubble net attack, and search for prey strategies to fish, as described in the following:

•
Encircling prey Humpback whales encircle prey when they find the location of the prey as follows: Energies 2023, 16, 2688 where → P (t) is the current position vector, → P best (t) is the previous best position vector, r is a uniform random vector, → b gradually decreases from 2 to 0, so → B is between 0 and 1, and "•" represents an inner product operation.

•
Bubble net attack Surrounding the prey is the most common attack strategy by humpback whales.It also hunts prey using the bubble net attack as: where is the distance between the humpback whales and the prey, e is a constant that defines the shape of the logarithmic spiral and l and p are random numbers between 0 and 1.

• Search for prey
To increase exploration, humpback whales use |B| > 1 to avoid falling into local optima as: where → P rand is randomly selected from the swarm.

The Scheme for Optimizing Weight Distribution
In contrast to a traditional stacking method, the proposed method uses an optimization algorithm to determine the weight distribution for each single model.A preliminary forecast is generated from every single model.The final forecast is produced by combining the weight output of each single model.The steps for using an optimization algorithm to determine the weight for each single model are described in the following: Step 1: The initial position of each particle is randomly generated as: where x i,j (0) is the initial position of the ith variable of the jth feasible solution, x i,max and x i,min are the maximum and minimum positions, respectively, rand∈[0,1] is the value of the uniform distribution function, S is the number of variables, and P is the number of feasible solutions for the group.The position of the jth feasible solution is expressed as: x j = [w 1,h , w 2,h , . . . ,w 5,h ], h = 1, 2, . . ., 24, j = 1, 2, . . ., P 5 where w i,h is the weight of the ith prediction model at the hth hour.This study generates a feasible solution for the first hour (i.e., x j = [w 1,1 , w 2,1 , . . ., w 5,1 ], j = 1, 2, . . ., P).After optimization, the weight distribution for each single model for the remaining hours is optimized successively.
Step 2: The fitness value for each initial feasible solution is calculated, and the position of the best initial feasible solution is recorded.The fitness value for the jth feasible solution at the hth hour is expressed as: 2 , h = 1, 2, . . ., 24, j = 1, 2, . . ., P where Ŷi,h = Ŷ1 , Ŷ2 , . . ., ŶN is the estimated value of the training data for the jth feasible solution at the hth hour, and N is the number of training data points.
is the actual value of the training data for the jth feasible solution at the hth hour.
Step 3: A position updating strategy is used in this step.
• PSO: use ( 28) and ( 29) to modify velocity and position; • SSA: use ( 30) and (32) to update the positions of the leader salp and the follower salps, respectively; • WOA: use encircling prey, bubble net attack, and search for prey strategies to update the position of the humpback whales as shown in ( 35) to (37).
Step 4: The fitness value for each updated position is calculated using (42).The position with the best fitness value is selected as the next generation.
Step 5: If the maximum number of iterations is achieved, the method determines whether the 24 h weighting optimization is complete.If it is, the optimal 24 h weighting solution is output; if none of the above conditions are met, steps 3-5 are repeated.

Data Pre-Processing
The proposed method was used for a 3.6 MW wind turbine power generation system that is located in Changhua, Taiwan.Data was collected from December 2019 to September 2021, to give a total of 11,527 hourly data points when outliers or missing data are eliminated.From the 11,527-hourly data points, 10,951 data points are used to construct and validate five single models and the remaining 576 data points (for a total of 24 days, distributed over each month) are used for testing.The data includes wind power, wind speed, and wind direction.Figure 7 shows the schematic diagram of class selection for future prediction points.If the future wind speed prediction at the first hour is h 1 , calculate the Euclidean distance between h 1 and each cluster center (a total of five cluster centers).The class with the shortest Euclidean distance is chosen for h 1 .Five single models then use the same class of prediction model that is constructed in the training stage to generate a preliminary forecast.The wind speed data is measured at the hub height of 10 m.In order to ensure that wind speed data for the wind turbine at a height of 67 m can be used, the following conversion formula is used [44]: where z 1 = 10 (m), z 2 = 67 (m), α is the surface friction coefficient, which value is obtained by experiment.The α value in the smooth area is low, and the α value in the rough blocking area is high.Generally, α has a value between 0.1 and 0.4.For this study, α is 0.2.The program was run on a Windows 11 PC using Python software.Figure 8 shows the curves for wind power data before and after pre-processing.A Pearson correlation coefficient is used to determine the effects of wind speed and wind direction on wind power.The k-means method is used to classify historical wind power data into several categories.Figure 9 shows an elbow curve for the collected wind power data.The sum of square error (SSE = ∑ n i=1 (y i − ŷi ) 2 ) decreases as the number of clusters increases.When the number of clusters is greater than 5, SSE decreases slowly.As shown in Figure 10, the historical wind power data are then divided into five categories such as breeze (class 1), moderate wind (class 2), cool wind (class 3), strong wind (class 4), and powerful wind (class 5).To illustrate the impact of data classification on prediction accuracy, five single models are also used to establish the individual prediction models without classifying the data.Table 2 shows the prediction error of data before and after classification.The data after classifying into five categories increase prediction accuracy by 12% to 31%.Table 3 shows the correlation coefficient values before and after pre-processing.After data pre-processing, the correlation coefficient values between weather variables and wind power are greater.As shown in this table, wind speed has a great effect on wind power.There is a small mutual correlation between wind speed and wind direction.In this study, the wind speed and wind direction are used as explanatory variables to establish each single prediction model.Table 4 shows the number of data points for every category that are used for training, validation, and testing.Table 5 shows the parameter settings for every single model.To determine the forecasting accuracy, the mean relative error (MRE) is used as:

EER REVIEW 14 of 22
where y i is the ith actual value, ŷi is the ith estimated value, y cap is the capacity to generate wind power, and N is the number of estimation points.

Forecasting Results
Five machine learning methods (KNN, RNN, LSTM, SVR and RFR) are used to establish individual prediction models for each grade of wind, in order to generate a preliminary forecast.The inputs for each model are wind speed and wind direction, and the output is wind power.Table 6 shows the validation results (MRE%) for every single prediction model.Every single model produces good predictions using the validation data, which demonstrates that those models do not overfit and can be used for preliminary prediction.Table 7 shows the parameter settings for every optimization algorithm.These parameters are tuned by experiment.Figures 11-13, respectively, show the optimization curves for PSO, SSA and WOA methods.The mean squired error (MSE) is used to evaluate the convergence characteristic as:           Each plot contains 24 (hourly) optimization curves.In order to easily observe the convergence characteristics, the curves for the 51st to 80th iterations are magnified.The respective convergence average MSE values for PSO, SSA, and WOA are 8.89 × 10 −8 , 1.10 × 10 −10 and 1.53 × 10 −6 .The convergence time for 24 h for the PSO is 101.08 (s) and the SSA and WOA, respectively, require 113.32 (s) and 134.08 (s) after 150 iterations.Table 8 shows the respective weights for each individual model for the 24 h using the WOA method.A weight of zero signifies a prediction model that has no effect on the output.Similar weight matrices are generated using the PSO and SSA methods.The Taiwanese Central Weather Bureau (TCWB) only provides 3-h-ahead wind speed predictions, so the data is not suitable for one-day-ahead hourly wind power forecasting.Solcast is a forecasting platform that offers meteorological predictions including temperature, wind speed, wind direction, and humidity at different resolutions, as long as the latitude and longitude locations are provided [45].However, the wind speed prediction that is provided by Solcast features a 16.31% forecasting error, compared to the actual measured wind speed.A correction model to increase prediction accuracy for wind speed is then constructed.The RFR model that gives better results than the other single models for the Solcast forecasting data is used to correct the Solcast predictions.During training, the inputs are the Solcast predictions for wind speed and wind direction and the output is the actual measured wind speed.After training, the forecasting error for wind speed is reduced from 16.31% to 4.56%.The RFR model for wind speed correction is then used for one-day-ahead hourly wind speed prediction based on the Solcast forecasting data.
Figure 14 shows the curves for the forecasting results for four different testing days using the PSO, SSA and WOA methods.Table 9 shows the forecasting results for single and ensemble models using the corrected wind speed data.For the 24 test datasets, the respective average MRE value for KNN, RNN, LSTM, SVR, the ensemble-based PSO, SSA, and WOA is 5.8091%, 5.7423%, 5.7622%, 5.7726%, 5.8937%, 5.7403%, 5.7359% and 5.7413%.The optimized ensemble methods give a more accurate forecast than the single models.Table 10 shows the number of maximum and minimum MRE values for single and ensemble models.In terms of the number of maximum MRE, RFR gives a less accurate forecast than the other single and ensemble models.The RNN model and the ensemble models do not produce the worst prediction for all of the test datasets.In terms of the number of minimum MRE values, KNN gives an accurate forecast for seven test datasets; RNN, RFR, and ensemble WOA produce an accurate forecast for three test datasets.Table 11 shows the forecasting results for average MRE using actual data, Solcast forecasting data, Solcast forecasting data with 10% random error, and the forecasting data after correction.If the actual measured wind speed and wind direction are used, the optimized ensemble models give a more accurate forecast than the single models, except for the KNN model.If predicted wind speed and wind direction data that are provided by the Solcast are used, SVR gives a more accurate result than all other models.The optimized ensemble models also allow accurate forecasts.
If predicted wind speed and wind direction data that are provided by the Solcast are used, SVR gives a more accurate result than all other models.The optimized ensemble models also allow accurate forecasts.
To simulate the inaccuracy for weather forecasts, a random error with normal distribution is added to the Solcast forecasting data [46].During the experiment, the random errors for 10%, 20%, 30%, and 40% are used for the test.A random error of 10% that allows a more accurate forecast is used to correct the Solcast forecasting data.As shown in Table 11, the forecasting results give a slightly better result than the results using Solcast prediction data.If the proposed wind speed correction model is used, the wind power forecasting errors reduce about 2~3% MRE value for each single and ensemble model.This case is mainly used to represent an optimized ensemble method that can be better than a single prediction method.Table 12 compares the SBI methods with other ensemble methods using LASSO [25] and ridge regression [27], which use a Bayesian optimization algorithm to determine the weight distribution for each single model.This case is mainly used to highlight the SBI methods such as PSO, SSA, and WOA, which give more accurate forecasts than LASSO and ridge regression methods.To simulate the inaccuracy for weather forecasts, a random error with normal distribution is added to the Solcast forecasting data [46].During the experiment, the random errors for 10%, 20%, 30%, and 40% are used for the test.A random error of 10% that allows a more accurate forecast is used to correct the Solcast forecasting data.As shown in Table 11, the forecasting results give a slightly better result than the results using Solcast prediction data.If the proposed wind speed correction model is used, the wind power forecasting errors reduce about 2~3% MRE value for each single and ensemble model.This case is mainly used to represent an optimized ensemble method that can be better than a single prediction method.Table 12 compares the SBI methods with other ensemble methods using LASSO [25] and ridge regression [27], which use a Bayesian optimization algorithm to determine the weight distribution for each single model.This case is mainly used to highlight the SBI methods such as PSO, SSA, and WOA, which give more accurate forecasts than LASSO and ridge regression methods.

Discussion
An SBI method that is used to optimize the weights distribution for each single model gives a more accurate wind power forecast than the single and ensemble prediction models.The forecasting results allow the following observations:

•
Five single models, including KNN, RNN, LSTM, SVR, and RFR are used to produce a preliminary forecast.More machine learning models can be used as a single model to avoid overestimation and to increase the forecasting accuracy.

•
As shown in Tables 9 and 10, the optimized ensemble models do not give the best forecast on every test dataset, but no maximum MRE value is produced using the ensemble models.

•
There is a high correlation between wind speed and wind power data.The accuracy of the wind speed prediction significantly affects the wind power forecast.Compared to the Solcast prediction results, a decrease of about 2~3% MRE value is obtained by using the proposed wind speed correction model.

•
This study uses an RFR model to decrease the wind speed prediction error from 16.31% to 4.56%.A more accurate prediction method can be used to increase the forecasting accuracy of wind speed, such as those of previous studies in [5][6][7][8][9][10][11][12].

•
A Bayesian optimization algorithm is used to determine the weight distribution for each single model by the LASSO and ridge regression methods.The proposed method uses SBI algorithms to optimize the weight distribution and allows a more accurate prediction.

Conclusions
An optimized ensemble model for one-day-ahead hourly wind power forecasting is proposed to increase the forecasting accuracy for single prediction models.The proposed method first divides historical wind power data into five different categories.Five single models, including KNN, RNN, LSTM, SVR, and RFR, are used to establish individual prediction models for each category of data, in order to produce a preliminary forecast.The final prediction is generated using a swarm-based intelligent tool to determine the weight distribution for each single model.The wind speed prediction that is provided by a forecasting platform features a 16.31% forecasting error.An RFR model is used to reduce the wind speed prediction error from 16.31% to 4.56%.Testing with a 3.6 MW wind power generation system shows that the optimized ensemble method gives a more accurate forecast than the single models.The ensemble models do not produce the worst prediction for all test datasets.Using the proposed wind speed correction model, the wind power forecasting error is reduced by 2~3% MRE value for each single and ensemble model.The proposed method also allows more accurate forecasting than the LASSO and ridge regression methods.Future studies will dynamically update the weight value for each single prediction model using new wind power data, in order to increase forecasting accuracy.

Figure 1 .
Figure 1.The structure of the proposed method.

Figure 1 .
Figure 1.The structure of the proposed method.

Figure 2 .
Figure 2. The structure of an RNN.

Figure 2 .
Figure 2. The structure of an RNN.

Figure 2 .
Figure 2. The structure of an RNN.

Figure 6 .
Figure 6.The structure of a salp swarm chain.

Figure 6 .
Figure 6.The structure of a salp swarm chain.

Figure 7 .
Figure 7. Schematic diagram of class selection for future prediction points.Figure 7. Schematic diagram of class selection for future prediction points.

Figure 7 .
Figure 7. Schematic diagram of class selection for future prediction points.Figure 7. Schematic diagram of class selection for future prediction points.

Figure 7 .Figure 8 .
Figure 7. Schematic diagram of class selection for future prediction points.

Figure 9 .
Figure 9.The elbow curve for collected wind power data.

Figure 8 .
Figure 8. Curves for wind power data: (a) before pre-processing and (b) after pre-processing.

Figure 9 .
Figure 9.The elbow curve for collected wind power data.

Figure 9 .
Figure 9.The elbow curve for collected wind power data.

Figure 10 .
Figure 10.Wind power data classification using a k-means method.

Figure 10 .
Figure 10.Wind power data classification using a k-means method.

Figure 11 .
Figure 11.Convergence curves for 24 h for the PSO method.

Figure 12 .
Figure 12.Convergence curves for 24 h for the SSA method.

Figure 11 .
Figure 11.Convergence curves for 24 h for the PSO method.

Figure 11 .
Figure 11.Convergence curves for 24 h for the PSO method.

Figure 12 .
Figure 12.Convergence curves for 24 h for the SSA method.Figure 12. Convergence curves for 24 h for the SSA method.

Figure 12 . 22 Figure 13 .
Figure 12.Convergence curves for 24 h for the SSA method.Figure 12. Convergence curves for 24 h for the SSA method.Energies 2023, 16, x FOR PEER REVIEW 17 of 22

Figure 13 .
Figure 13.Convergence curves for 24 h for the WOA method.

Figure 14 .
Figure 14.The curves for forecasting results for 4 different testing days using ensemble methods.Figure 14.The curves for forecasting results for 4 different testing days using ensemble methods.

Figure 14 .
Figure 14.The curves for forecasting results for 4 different testing days using ensemble methods.Figure 14.The curves for forecasting results for 4 different testing days using ensemble methods.

Table 1 .
Comparison of the five single models.

Table 1 .
Comparison of the five single models.

Table 1 .
Comparison of the five single models.

Table 2 .
Prediction error of data before and after classification.

Table 3 .
Pearson correlation coefficient values between weather variable

Table 2 .
Prediction error of data before and after classification.

Table 3 .
Pearson correlation coefficient values between weather variables and wind power.

Table 4 .
The number of data points that are used for every category.

Table 5 .
Parameter settings for every single model.

Table 6 .
Validation results (MRE%) for every single prediction model.

Table 7 .
Parameter settings for every optimization algorithm.

Table 6 .
Validation results (MRE%) for every single prediction model.

Table 7 .
Parameter settings for every optimization algorithm.

Table 6 .
Validation results (MRE%) for every single prediction model.

Table 7 .
Parameter settings for every optimization algorithm.

Table 8 .
The respective weights for each individual model using a WOA for 24 h.

Table 8 .
The respective weights for each individual model using a WOA for 24 h.

Table 9 .
Forecasting errors for single and ensemble models using proposed corrected data.

Table 10 .
Number of maximum and minimum MREs for single and ensemble models.

Table 11 .
Forecasting results for average MRE using actual data, Solcast prediction data and corrected data.

Table 12 .
Comparison between the proposed SBI and the other ensemble methods using LASSO and ridge regressions.