Advanced Ensemble Methods Using Machine Learning and Deep Learning for One-Day-Ahead Forecasts of Electric Energy Production in Wind Farms

: The ability to precisely forecast power generation for large wind farms is very important, since such generation is highly unstable and creates problems for Distribution and Transmission System Operators to properly prepare the power system for operation. Forecasts for the next 24 h play an important role in this process. They are also used in energy market transactions. Even a small improvement in the quality of these forecasts translates into more security of the system and savings for the economy. Using two wind farms for statistical analyses and forecasting considerably increases credibility of newly created effective prediction methods and formulated conclusions. In the ﬁrst part of our study, we have analysed the available data to identify potentially useful explanatory variables for forecasting models with additional development of new input data based on the basic data set. We demonstrate that it is better to use Numerical Weather Prediction (NWP) point forecasts for hourly lags: − 3, 2, − 1, 0, 1, 2, 3 (original contribution) as input data than lags 0, 1 that are typically used. Also, we prove that it is better to use forecasts from two NWP models as input data. Ensemble, hybrid and single methods are used for predictions, including machine learning (ML) solutions like Gradient-Boosted Trees (GBT), Random Forest (RF), Multi-Layer Perceptron (MLP), Long Short-Term Memory (LSTM), K-Nearest Neighbours Regression (KNNR) and Support Vector Regression (SVR). Original ensemble methods, developed for researching speciﬁc implementations, have reduced errors of forecast energy generation for both wind farms as compared to single methods. Predictions by the original ensemble forecasting method, called “Ensemble Averaging Without Extremes” have the lowest normalized mean absolute error (nMAE) among all tested methods. A new, original “Additional Expert Correction” additionally reduces errors of energy generation forecasts for both wind farms. The proposed ensemble methods are also applicable to short-time generation forecasting for other renewable energy sources (RES), e.g., hydropower or photovoltaic (PV) systems.


Introduction
The impact of humanity on climate change is a fact accepted by most scientists and policymakers. Renewable energy sources have become a "natural" alternative to energy sources based on fossil fuels. Obviously, the largest increases in energy production come from wind sources. However, they are known for their basic disadvantage, which is intermittent power generation. A way to overcome this drawback is to develop best possible energy production forecasts and properly prepare the power system for operation by Distribution and Transmission System Operators. Forecasts for the next day play an important role in this process. They are also used in energy market transactions. Even a small improvement in the quality of these forecasts translates into improved security of the system and savings for the economy. Therefore, efforts are made to improve quality by: • analysing the usefulness of various explanatory data; • utilizing machine learning; • preparing forecasts with single, team and hybrid methods; • analysing the influence of point distribution of Numerical Weather Prediction (NWP) models at large wind farms; • conducting comparative analysis of forecast quality for various wind farms.
Machine learning models have become frequently used prediction tools, not only as ensemble components, but also as standalone solutions. Decision trees with variants [17], SVR [10,20], and neural networks [8][9][10] are examples of quite popular predictors. With increasing average PC computational power, deep learning models gained their share of popularity, too. Among them, not only methods such as LSTM [3,6,8,11,14,16,18,21,22], GRU [18,23], or deep ESN [24] have been used in research, but also methods previously associated with image analysis like CNN [1,2,[25][26][27] have been incorporated into studies. In their research, Wang, Li, and Yang [19] proposed an LSTM-based encoder to achieve input attention that understands the importance of variables, Sun, Zhao, and Zhang [8] created different LSTM hybrids for wind power series of multiple time scales, while Niu et al. [23] presented Sequence-to-Sequence GRU Networks as a recurrent method of multi-step ahead prediction.
In the papers reviewed by us, convolution networks were used to extract spatial information from data. In some cases [2,25,26], they were used to add spatial aspect to temporal information. For that purpose, Yin, Ou, Huang, and Meng [28] suggested extracting both temporal and spatial information by cascade of CNN followed by LSTM; in another case [27], extracted spatial information was a replacement for lacking time information.
Data extraction by CNN can be treated as semi-automatic input inference without user involvement. Some authors, however, preferred a different approach, i.e., feature engineering and input selection based on statistical analysis. Lin and Liu [29] presented wind data correction methods according to IEC standards, Medina and Ajenjo [30] presented analysis of optimal time lags for input variables with different time horizons, while other authors presented data cleaning and imputation by Lomnaofski norm [31], extensive sensitivity analysis of input data [9], and analyses of optimal sparsity of NWP model grids [32].
Last but not least, note that all of the mentioned deep learning and ensemble solutions could either use NWP data or be created as a stack of weather forecasting models followed by energy prediction models. Since generated energy prediction accuracy is usually affected by the accuracy of input data, enhanced weather forecasts could lead to improved energy prediction. Better weather forecast could be achieved in multiple ways, e.g., de Mattos Neto et al. [33] proposed in their paper an LSTM-SVR hybrid as a means of obtaining better wind speed forecasts.

Objective and Contribution
The main objectives of this paper can be summarized as follows: • Perform extensive statistical analysis of time series of energy generated in two wind farms and perform statistical analysis of potential exogenous explanatory variables; • Perform very extensive analysis of sensitivity of explanatory variables; • Verify the accuracy of forecasts conducted by single methods, hybrid methods, and ensemble methods (13 methods in total); • Develop and verify an original ensemble method, called "Ensemble Averaging Without Extremes" and conduct an original selection of combinations of predictors for ensemble methods; • Identify the most efficient forecasting methods from among tested methods for data from both wind farms.
Below are listed selected contributions of this paper: 1. This research addresses forecasting for large wind farms. Although this topic frequently appears in literature, this research has its unique values. First of all, an extensive data analysis was performed, including the time series itself, additional data, and two different NWP model parameters (81 inputs in total). Secondly, 13 forecasting methods for two wind farms were tested and compared.

2.
Development of an original method, called "Ensemble Averaging Without Extremes". Predictions by this method have yielded the lowest SS metric (Skill Score) and nMAE error among the tested methods; the original "Additional Expert Correction" method yielded additional improvement of forecasts.  3. Construction of a number of different models, data scenarios and parameters resulted in testing more than 400 forecasting models. This makes this research one of the most extensive studies on the topic. The conclusions drawn from this research can be generalized, at least for Central Europe.
The remainder of this paper is organized as follows: Section 2.1 presents statistical analysis of times series and NWP data for two wind farms. The importance of the available basic input data and additional input data is discussed in Sections 2.2 and 2.3. Section 3 describes prediction methods employed and Section 4 gives evaluation criteria for assessment of forecasting quality. Extensive analysis of the results and their discussion is in Section 5. Section 6 summarizes the whole research providing the main conclusions. References are listed at the end of this paper.

Statistical Analysis
For statistical analyses, data acquired for two medium-sized European wind farms (A and B) were used. The range of the acquired data was identical for both farms and spanned from 4 April 2017 to 10 October 2019, with about 29 months in total. Rated powers for Farms A and B were 50 MW and 48.3 MW, respectively.
The following data were available for analysis: • Wind farm generation time series (forecasted variable); • Weather forecasts for wind farms location.
Records of actual meteorological parameters were not available; hence, GFS and ECMWF NWP models were used for our research instead. For ECMWF, the archived high-resolution atmospheric model was chosen (HRES) [34]. The GFS model was supplied by the Interdisciplinary Modelling Center, Warsaw University (ICM UW) [35,36]. Both models make it possible to use 4 forecast runs per day (at 0/6/12/18 UTC) with 1 h resolution and maximal horizon of 240 h. Time resolution of HRES changes, however, to a 3 h interval after 90 h horizon and to 6 h after 144 h horizon. For GFS, only the first interval change appears after reaching the 120 h horizon. For each wind farm, only weather forecasts corresponding to the respective spatial point were used. Weather source points for the ECMWF model were chosen as the points nearest to the ones appearing in a dense 1/8 × 1/8-degree grid. The same method was applied to GFS with its native spatial resolution of 0.25 × 0.25 degrees.
Data from both time series of electric energy production (Farm A and Farm B) were normalized separately for anonymization to relative units (1 relative unit is equal to the rated power of the wind farm). However, each time series of NWP forecasts data was normalized using min-max scaling. Table 1 shows descriptive statistics for time series of hourly electric energy generated by the Wind Farm A and Wind Farm B considered here. Percentage distribution of electric energy generation for both wind farms is shown in Figure 1. The analysis of electric energy generation percentiles shows that values very close to 0 made up more than 25% of both time series samples. Usually, energy generation was within the range of (0-0.1) [p.u.] for both time series samples.
Calculated autocorrelation coefficient (ACF) of hourly generation in both time series shows a little daily periodicity. Autocorrelation coefficients quickly decrease for the following hours of the first day. For both time series, all autocorrelation coefficients are statistically significant (5% significance level) up to 3 days back (72 prior observations). Autocorrelation function (ACF) of the Wind Farm A energy generation time series is presented in Figure 2. However, autocorrelation function (ACF) of the Wind Farm B energy generation time series is presented in Figure 3.    Calculated autocorrelation coefficient (ACF) of hourly generation in both time series shows a little daily periodicity. Autocorrelation coefficients quickly decrease for the following hours of the first day. For both time series, all autocorrelation coefficients are statistically significant (5% significance level) up to 3 days back (72 prior observations). Autocorrelation function (ACF) of the Wind Farm A energy generation time series is presented in Figure 2. However, autocorrelation function (ACF) of the Wind Farm B energy generation time series is presented in Figure 3.     Figure 4 shows daily variability of hourly energy production [p.u.] of Wind Farms A and B. Arithmetic means of hourly energy generations for each hour of the day were calculated based on data span from 4 April 2017 to 28 September 2018 (18 months in total), with omitting test period datetimes-1 October 2018 to 1 October 2019. For the same periods, mean arithmetic hourly generations were calculated for each month, with the averaging of values for the months occurring two times. Pearson linear correlation coefficient between the data is equal to 0.950. Daily variability of electric energy generation of both wind farms is similar. Figure 5 shows seasonal variability of electrical energy generation of Wind Farms A and B. Pearson linear correlation coefficient between the data is equal to 0.922. Seasonal variability of electric energy generation of both wind farms is similar.        All dispersion diagrams indicate a non-linear relationship between wind speed and the yield of electricity. The exact shape corresponding approximately to the shape of the wind turbine power curve typical for a single turbine cannot be well seen on the diagram due to low concentration of data points. Both of the observed disadvantageous phenomena result probably from the following reasons: • Data include wind speed forecasts instead of actual, recorded values; • Wind speed forecasts are momentary values for a given hour and actual wind speed usually changes during a 1 h period; • Data come from very large wind farms with turbines scattered across vast territory with varying orography. For single wind turbines, data would probably be much more concentrated.   All dispersion diagrams indicate a non-linear relationship between wind speed and the yield of electricity. The exact shape corresponding approximately to the shape of the wind turbine power curve typical for a single turbine cannot be well seen on the diagram due to low concentration of data points. Both of the observed disadvantageous phenomena result probably from the following reasons: • Data include wind speed forecasts instead of actual, recorded values; • Wind speed forecasts are momentary values for a given hour and actual wind speed usually changes during a 1 h period; • Data come from very large wind farms with turbines scattered across vast territory with varying orography. For single wind turbines, data would probably be much more concentrated.
For both wind farms, extreme outliers were treated as unreliable samples, and further removed from data. This can be due to incorrect readings, missing data or scheduled/unscheduled shutdowns of at least a part of the wind farm. Only extreme, rarely occurring outliers were removed from the data, since big errors of wind speed prediction certainly must occur in a 24 h forecast horizon. A scenario with null wind speed forecast and non-zero electricity generation could be given as an example of NWP inaccuracy.

Analysis of Importance of Available Basic Input Data for Forecasting Methods
A detailed description of the available basic set of potential input variables for forecasting models is presented in Table 2.  For both wind farms, extreme outliers were treated as unreliable samples, and further removed from data. This can be due to incorrect readings, missing data or scheduled/unscheduled shutdowns of at least a part of the wind farm. Only extreme, rarely occurring outliers were removed from the data, since big errors of wind speed prediction certainly must occur in a 24 h forecast horizon. A scenario with null wind speed forecast and nonzero electricity generation could be given as an example of NWP inaccuracy.

Analysis of Importance of Available Basic Input Data for Forecasting Methods
A detailed description of the available basic set of potential input variables for forecasting models is presented in Table 2. Figure 8 presents time points (momentary values time lags) of point weather forecasts from GFS and ECMWF models (input data) in relation to periods of electricity generation. For both wind farms, extreme outliers were treated as unreliable samples, and further removed from data. This can be due to incorrect readings, missing data or scheduled/unscheduled shutdowns of at least a part of the wind farm. Only extreme, rarely occurring outliers were removed from the data, since big errors of wind speed prediction certainly must occur in a 24 h forecast horizon. A scenario with null wind speed forecast and nonzero electricity generation could be given as an example of NWP inaccuracy.

Analysis of Importance of Available Basic Input Data for Forecasting Methods
A detailed description of the available basic set of potential input variables for forecasting models is presented in Table 2.  To identify the most important inputs for prediction models, extensive sensitivity analysis was performed for both wind farms. All of the 68 potential input variables that had been acquired were included. Comparison of the importance of input variables for both farms made it possible to draw general conclusions about the validity of use of given variables in predictions of electricity generation from large wind farms. Figure 9 presents consecutive steps of this analysis. Global Sensitivity Analysis (SA statistics) in the MLP network was performed for 4 models. Each trained model had 68 input variables and 1 output variable (electricity generation), 40, 50, 60, 70, and 80 (5 models) hidden neurons, used the BFGS learning algorithm, hyperbolic tangent hidden layer activation function, and linear output layer activation function. After training each MLP model, GSA was performed and the importance of each input variable was computed. Next, for each input, the overall rating was calculated as the arithmetic mean of 4 results of global sensitivity analysis obtained from each MLP model.    network was performed for 4 models. Each trained model had 68 input variables and 1 output variable (electricity generation), 40, 50, 60, 70, and 80 (5 models) hidden neurons, used the BFGS learning algorithm, hyperbolic tangent hidden layer activation function, and linear output layer activation function. After training each MLP model, GSA was performed and the importance of each input variable was computed. Next, for each input, the overall rating was calculated as the arithmetic mean of 4 results of global sensitivity analysis obtained from each MLP model. Results of sensitivity analysis for potential input variables of prediction models are shown in Figure 10. The most important input variables are definitely wind speed forecasts, notably, the ones closest to the 1 h energy generation period. ECMWF NWP forecasts turned out higher in ranking than GFS forecasts, while the least important input variables were predictions of atmospheric pressure.
The importance of INPUT variables varied between 4 analytic methods used here for both value of metrics and position in importance ranking. The most differing results came from the SA method due to the non-linear modelling (MLP network) used in that method. The remaining analytic methods used linear modelling; hence, their results were similar. Figure 11 contains the interrelationship matrix (Pearson linear correlation coefficients) between 4 analytic methods used to determine the importance of input variables. Results of sensitivity analysis for potential input variables of prediction models are shown in Figure 10. The most important input variables are definitely wind speed forecasts, notably, the ones closest to the 1 h energy generation period. ECMWF NWP forecasts turned out higher in ranking than GFS forecasts, while the least important input variables were predictions of atmospheric pressure.
The importance of INPUT variables varied between 4 analytic methods used here for both value of metrics and position in importance ranking. The most differing results came from the SA method due to the non-linear modelling (MLP network) used in that method. The remaining analytic methods used linear modelling; hence, their results were similar. Figure 11 contains the interrelationship matrix (Pearson linear correlation coefficients) between 4 analytic methods used to determine the importance of input variables.

Analysis of Importance of Additional Input Data Created
A detailed description of an additional set of potential input variables for forecasting models, derived from mathematical transformation of basic data, is presented in Table 3. Additional input data are created to verify their potential usefulness and importance in the forecasting process. Wind speed forecasts using either one or both NWP models are averaged to reduce the random component. Percentage differences between averaged wind speed/atmospheric pressure point forecasts for respective pairs of hourly lags are computed to include additional information about the dynamics for wind speed/atmospheric pressure in the model. Physical model (turbine power curve) prognosis is another additional information. A third-order polynomial is used to approximate the power curve, while averaged wind speeds with lag 0 and 1 from time bounds of the predicted periods are inputs to this model. Figure 12 shows the results of importance analysis of the additional input data created. The analysis was performed according to steps in Figure 3, the same as for the basic input data case, and used 68 basic input variables (described in Table 2) and 13 additional inputs (described in Table 3). Figure 12 presents partial results-40 best input variables out of the total of 81. Studies have shown that additional input data, in particular forecasts from physical models and average values of predictions, are highly valuable as prediction model explanatory data, as additional input data usually rank high in terms of importance (OR metrics). Moreover, similar results for both wind farms show the universality of our procedure of the construction of additional input data. As the next step, it was verified whether additional data can be advantageous for different methods of electricity generation forecasting.

12A E_from_producer_turbine_power_curve
Forecast of electric energy production calculated based on producer turbine power curve estimated as a polynomial of degree 3, with ave(GFS_ECMWF)_v_mod_0-1 as input data. For input data below and above the cut-in for turbine, the forecast value is equal to zero 13A E_from_power_curve(scatter_plot) Forecast electric energy production calculated based on estimated polynomial of degree 3 as turbine power curve with ave(GFS_ECMWF)_v_mod_0-1 as input data. The estimation of power curve was executed based on the scatter plot between electric energy production and wind speed forecasts. For input data below and above the cut-in for turbine, the forecast value is equal to zero of the total of 81. Studies have shown that additional input data, in particular forecasts from physical models and average values of predictions, are highly valuable as prediction model explanatory data, as additional input data usually rank high in terms of importance (OR metrics). Moreover, similar results for both wind farms show the universality of our procedure of the construction of additional input data. As the next step, it was verified whether additional data can be advantageous for different methods of electricity generation forecasting.

Forecasting Methods
This section includes the description of proposed forecasting methods. The research used both single methods as well as advanced ensemble and hybrid methods. Described the Persistence Model is a benchmark for the quality of other, more advanced forecasting methods.

Forecasting Methods
This section includes the description of proposed forecasting methods. The research used both single methods as well as advanced ensemble and hybrid methods. Described the Persistence Model is a benchmark for the quality of other, more advanced forecasting methods.
Single methods, using only one individual predictor, are addressed next. The general scheme is presented in Figure 13. Single methods, using only one individual predictor, are addressed next. The general scheme is presented in Figure 13. (1) Figure 13. General structure of single method.
Energies 2022, 15,1252 Persistence model. The naïve model is the simplest model in forecasting. In the Persistence Model, the forecast generation value is the same as the actual energy generation value from the same hour the day before. Forecasts are calculated by Formula (1): whereŷ t -forecast electric energy generated by wind farm for hour t and y t−24·n -energy generation for period lagged by t -24 from forecast period t. Physical Model. This forecasting model of generated hourly power is a function of wind speed. The function is in the form of the 3 rd -order polynomial. Two different methods were utilized to form 3 rd -degree polynomial separately for Wind Farm A and Wind Farm B.

•
Physical model version 1. The polynomial of degree 3 is estimated based on turbine power curve data from the manufacturer's catalogue (turbine power for wind speeds with 1 m/s steps). For input data below and above cut-in for each turbine, the forecast is equal to zero. The input data depend on variant (ave(GFS_ECMWF)_v_mod_0-1, ave_ECMWF_v_mod_0-1 or ave_GFS_v_mod_0-1).

•
Physical model version 2. The polynomial of degree 3 is estimated based on the scatter plot between electric energy production and wind speed forecasts (ave(GFS_ECMWF) _v_mod_0-1). For input data below and above cut-in for each turbine, the forecast value is equal to zero. The input data are ave(GFS_ECMWF)_v_mod_0-1.

K-Nearest Neighbours Regression.
KNNR is a non-parametric method used for regression problems [37]. The input of the model contains the k-closest training examples in the feature space. The output of KNNR model is the property value for the object. Property value is the average of the values of k-nearest neighbours. Hyperparameter-the value of k (the number of nearest neighbours) needs searching for the appropriate value. The other hyperparameter for tuning is the choice of the distance metric.
Neural Network, Type MLP-Multi-layer Perceptron is a classical type of ANN. Widely used over decades, it proved its applicability as an effective non-linear or linear global approximator [13,38]. It is a feedforward ANN usually with an input layer, one or two hidden layers, and an output layer. Originally, it used the backpropagation algorithm for supervised learning. During years of development, other optimisation algorithms were applied for MLP learning, among them, the BFGS method that was chosen as the learning algorithm in our research. The number of neurons in hidden layer(s) was decided to be the main hyperparameter for tuning. Support Vector Regression. Support Vector Machine for regression (SVR) transforms the classification task into regression by defining hyperparameter width ε tolerance region around the destination [39]. Hyperparameters of SVR for tuning are the following: regularization constant C, tolerance ε, and parameter s of the Gaussian kernel.
Deep Neural Network Type LSTM. The main difference between LSTM and traditional RNNs is LSTM's internal built format. Its hidden layers contain 3 gates, namely, input, forget, and output gate. This solution allows to control the flow of information and allows to deal with problems such as gradient explosion and vanishing, and taking long-term dependencies into account [10]. A typical LSTM network contains an input layer followed by up to two hidden layers finished by an output layer with dropout layers possible between layers. The dropout mechanism's goal is to prevent overfitting by keeping node in network with Bernoulli distribution probability [40]. The LSTM model contains (among others) the following hyperparameters: the number of hidden layers and neurons in them, activation function in each layer, number of training epochs, batch size, dropout degree, type of model optimizer, and learning rate.
Ensemble methods, using more than one individual predictor and supported by a simple or more complex integration system of individual forecasts, are addressed next. The simplest integration system is weighted averaging of individual predictors. The general scheme of establishing an ensemble of predictors is presented in Figure 14. The ensemble method can use the same type of methods as predictors (e.g., Random Forest, Gradient-Boosted Trees) or different types of predictors (e.g., single Machine Learning methods as predictors. (among others) the following hyperparameters: the number of hidden layers and neurons in them, activation function in each layer, number of training epochs, batch size, dropout degree, type of model optimizer, and learning rate.
Ensemble methods, using more than one individual predictor and supported by a simple or more complex integration system of individual forecasts, are addressed next. The simplest integration system is weighted averaging of individual predictors. The general scheme of establishing an ensemble of predictors is presented in Figure 14. The ensemble method can use the same type of methods as predictors (e.g., Random Forest, Gradient-Boosted Trees) or different types of predictors (e.g., single Machine Learning methods as predictors. Random Forest Regression. RF is an ensemble method based on many single decision trees (the same type of models). In the regression task, the prediction in a single decision tree is the average target value of all instances associated with the single leaf node [41]. The final prediction is the average value of all n single decision trees. The regularization hyperparameters depend on the algorithm used, but generally restricted are among others: the maximum depth of a single decision tree, maximum number of levels in each decision tree, minimum number of data points placed in a node before the node is split, minimum number of data points allowed in a leaf node and maximum number of nodes. The number of predictors for each of the n single decision trees is made by the random choice of k predictors from all available n predictors [41,42].
Gradient-Boosted Trees for Regression. Gradient boosting refers to an ensemble method that can combine several weak learners into a strong learner [41]. GBT works by sequentially adding predictors (the same type of models) to the ensemble, each one correcting its predecessor. The method tries to fit the new predictor into the residual errors made by the previous predictor. The final prediction is the average value from all n single decision trees. In comparison with random forest, this method has one additional hyperparameter-learning rate, which scales the contribution of each tree [42,43].
Ensemble Averaging Without Extremes. The method developed by the authors of this study involves the deletion of the minimum and maximum forecast from the set of n single predictors (different types of methods) before each calculation of single final forecasts, being an average of forecasts from n-2 single predictors. The deletion is executed 24 times for each forecast separately. The choice predictors in the ensemble is based on the similar levels of forecasting error and mutually independent operation [9]. The final forecast result is calculated by Formula (2).

Random Forest Regression.
RF is an ensemble method based on many single decision trees (the same type of models). In the regression task, the prediction in a single decision tree is the average target value of all instances associated with the single leaf node [41]. The final prediction is the average value of all n single decision trees. The regularization hyperparameters depend on the algorithm used, but generally restricted are among others: the maximum depth of a single decision tree, maximum number of levels in each decision tree, minimum number of data points placed in a node before the node is split, minimum number of data points allowed in a leaf node and maximum number of nodes. The number of predictors for each of the n single decision trees is made by the random choice of k predictors from all available n predictors [41,42].
Gradient-Boosted Trees for Regression. Gradient boosting refers to an ensemble method that can combine several weak learners into a strong learner [41]. GBT works by sequentially adding predictors (the same type of models) to the ensemble, each one correcting its predecessor. The method tries to fit the new predictor into the residual errors made by the previous predictor. The final prediction is the average value from all n single decision trees. In comparison with random forest, this method has one additional hyperparameter-learning rate, which scales the contribution of each tree [42,43].
Ensemble Averaging Without Extremes. The method developed by the authors of this study involves the deletion of the minimum and maximum forecast from the set of n single predictors (different types of methods) before each calculation of single final forecasts, being an average of forecasts from n-2 single predictors. The deletion is executed 24 times for each forecast separately. The choice predictors in the ensemble is based on the similar levels of forecasting error and mutually independent operation [9]. The final forecast result is calculated by Formula (2).
where i is the forecast point,ŷ i is the final forecast value,ŷ k i is the forecasted value by predictor number k, and n is the number of predictors in the original ensemble before the removal of the outputs of predictors yielding extreme forecasts from the set of results.
Weighted Averaging as an Integrator of Ensemble based on nMAE and R. It integrates the results of selected predictors (different types of methods) into the final verdict of the ensemble. The final forecast is defined as the average of the results generated by all n predictors in the ensemble and is calculated by Formula (3) [9,39]. This method reduces the variance of forecast errors. Predictors are included in the ensemble based on two important elements: • time series of the residues from forecasts should be most distant from each other (small R values); • the smallest nMAE errors on the validation subset.
where i is the prediction point,ŷ i is the final predicted value,ŷ j i is the predicted value by predictor number j, and n is the number of hybrid predictors in the ensemble.
Hybrid methods, using two or more different methods connected in series, are addressed next.
Machine learning method with additional input data from two Physical models. This hybrid method is a cascade of two different Physical models (version 1 and version 2) with one of the five ML methods (GBT, SVR, KNNR, MLP, or LSTM). ML component uses both forecasts of electric energy production as an additional input. The general scheme of this hybrid method is presented in Figure 15.
n predictors in the ensemble and is calculated by Formula (3) [9,39]. This meth the variance of forecast errors. Predictors are included in the ensemble based portant elements: • time series of the residues from forecasts should be most distant from (small R values); • the smallest nMAE errors on the validation subset.

= 1
where i is the prediction point, is the final predicted value, is the predict predictor number j, and n is the number of hybrid predictors in the ensemble.
Hybrid methods, using two or more different methods connected in ser dressed next.
Machine learning method with additional input data from two Physic This hybrid method is a cascade of two different Physical models (version 1 a 2) with one of the five ML methods (GBT, SVR, KNNR, MLP, or LSTM). ML uses both forecasts of electric energy production as an additional input. T scheme of this hybrid method is presented in Figure 15. Physical model version 1 with input data as wind speed forecast from Boosted Trees method. This hybrid method consists of the Gradient-Boo method connected in series with Physical Model Version 1. The GBT metho wind speed, while Physical Model Version 1 forecasts electric energy productio Model Version 1 yielded smaller errors than the MLP and GBT methods consi The training and testing subsets differ from each other. The training subset speed based on the manufacturer's reversed turbine power curve (third-ord mial) as additional input. This allows the method to learn effective wind sp sponding to actual values of electric energy production. In turn, the testing ave(GFS_ECMWF)_v_mod_0-1 as its additional input, since electric energy p and thus effective wind speed, would be unobtainable during the operational models. The concept of this hybrid methods is based on the assumption tha learn better on a training subset containing a precise estimate of wind speed t containing wind speed forecasts with a large random component. The genera this hybrid method is presented in Figure 16. Physical model version 1 with input data as wind speed forecast from Gradient-Boosted Trees method. This hybrid method consists of the Gradient-Boosted Trees method connected in series with Physical Model Version 1. The GBT method predicts wind speed, while Physical Model Version 1 forecasts electric energy production. Physical Model Version 1 yielded smaller errors than the MLP and GBT methods considered here. The training and testing subsets differ from each other. The training subset uses wind speed based on the manufacturer's reversed turbine power curve (third-order polynomial) as additional input. This allows the method to learn effective wind speed corresponding to actual values of electric energy production. In turn, the testing subset uses ave(GFS_ECMWF)_v_mod_0-1 as its additional input, since electric energy production, and thus effective wind speed, would be unobtainable during the operational work of the models. The concept of this hybrid methods is based on the assumption that GBT will learn better on a training subset containing a precise estimate of wind speed than on one containing wind speed forecasts with a large random component. The general scheme of this hybrid method is presented in Figure 16.
A summary description of thirteen tested forecasting methods is shown in Table 4. The listed methods include four types of ensemble methods, two types of hybrid ones, and seven single methods. Six methods (single/ensemble) are machine learning (ML) methods, including one deep learning method. A summary description of thirteen tested forecasting methods is shown in Table 4. The listed methods include four types of ensemble methods, two types of hybrid ones, and seven single methods. Six methods (single/ensemble) are machine learning (ML) methods, including one deep learning method.   Additional expert correction of forecasts. Since wind turbines produce no power below the lower and above the upper limits of wind speed, a unique expert correction method is proposed. Obviously, without verification, the use of the correction would be unjustified, as it applies to wind speed forecasts with a large random component instead of real-world wind speeds. Due to that, its effectiveness and validity are verified for the selected group of methods providing best forecasts. A robust wind estimator ave(GFS_ECMWF)_v_mod_0-1 is used as a conditional variable for the method to adjust for bias of singular NWP models. For wind speed forecasts-ave(GFS_ECMWF)_v_mod_0-1 below cut-in and above cut-out wind speeds for wind turbine, forecast electric energy production is corrected to zero. The final prediction with expert correction is calculated by Formula (4).Ê whereÊ i is the predicted value (electric energy production),v i is the predicted wind speed (ave(GFS_ECMWF)_v_mod_0-1) and v min and v max are cut-in and cut-out wind speeds of turbine.

Evaluation Criteria
Three evaluation criteria are used to test the performance of the methods, including normalized Root Mean Square Error (nRMSE), normalized Mean Absolute Error (nMAE) and normalized Mean Bias Error (nMBE).
Normalized Root Mean Square Error which is sensitive to large error values is calculated by Formula (5): whereŷ i is the predicted value (electric energy production), y i is the actual value, c norm is the normalizing factor (rated power of wind farm), and n is the number of prediction points. Normalized Mean Absolute Error is calculated by Formula (6). nMAE is a risk metric according to the expected value of the absolute error.
Normalized Mean Bias Error (nMBE) captures average bias in prediction and is calculated by Formula (7). The forecasting method overestimates if nMBE > 0 or underestimates if nMBE < 0.
Errors nRMSE and nMAE are basic measures to evaluate the accuracy of proposed models, while nMBE is only auxiliary. In the process of forecasting electric energy production in a wind farm, the changes of nRMSE and nMAE have the same trend, and the smaller the two error values, the more accurate the prediction results. Both show random and systematic errors. A large gap between nMAE and nRMSE for the results of a method indicates that predicted values are extremely distant from the measured data [44,45].
The effectiveness of the forecasting approaches is found by considering the uncertainty and variability of forecasts [46]. For a comparative assessment of the performance test of the analysed methods, the Skill Score (SS) metric was used. The proposed Skill Score metric uses two error metrics-nRMSE and nMAE-and is calculated by Formula (8). Higher SS values are an indication of superior prediction quality. where nMAE f orecast and nRMSE f orecast are errors of the analysed method, nMAE re f erence and nRMSE re f erence are errors of reference method (persistence method-naive model).

Results and Discussion
The range of the acquired data was identical for both wind farms and spanned from 4 April 2017 to 10 October 2019, with about 29 months in total. Data were divided into three subsets-training subset, validation subset, and test subset. The training and validation subsets for the period from 4 April 2017 to 30 September 2018 (17 months) were chosen at random (85% and 15%, respectively). The training subset is used for the estimation of model parameters. The validation subset is used for tuning hyperparameters of parts of methods. The last part of the data (from 1 October 2018 to 1 October 2019-12 months) constituted the test subset used for one-time final evaluation of the quality of specific prediction methods on data for all seasons.
Predictions were conducted sequentially, from single methods with a limited number of input variables to hybrid methods, to ensemble methods. Such procedure allows us to observe differences in the quality of results depending on the complexity of particular methods and the range of input variables used. Research was done in steps in order to verify different hypotheses, and find an optimal input dataset and the best group of prediction methods.
Step 1. Hypotheses verification: • Are the accuracies of two designed physical models different from each other? • Is it better to use one or two NWP models with wind speed forecasts for physical models? Tables 5 and 6 contains results of forecasts for A and B wind farms, respectively. Physical and Persistence (reference) Models were used for predictions.  Results of the two Physical Models indicate that PHYS_v1 was better fitted for both wind farms, while results for the NWP models were ambiguous. Although using only the GFS model was clearly the least favourable option, for Wind Farm A it was better to use both NWP models, while for Wind Farm B it was better to use the ECMWF model only. In comparison, the nMAE Persistence Model was twice as good as both Physical Models.
Step 2. Hypotheses verification: • Is it better to use NWP point forecasts for hourly lags: −3, 2, −1, 0, 1, 2, 3 (original contribution) as input data instead of the typically used lags 0, 1? • Is it better to use one or two NWP models as input data source? To verify the above, a strong GBT method was used, recommended by multiple papers. Tables 7 and 8 present the resulting forecasts for Wind Farms A and B using the proposed GBT method with different versions of NWP input data.  Research in step 2 demonstrated that the order of the results obtained with the same combination of input data was the same for both wind farms. Best accuracies were achieved by using both NWP models. The application of the novel and original idea of using point forecasts for hourly lags: −3, 2, −1, 0, 1, 2, 3 yields clearly better results than using typical 0, 1 lags. Like in Physical Models, in this case, forecasts for Wind Farm B were less accurate than for Wind Farm A. Preliminary studies analysing the importance of input data also indicated slightly lesser correlation between NWP forecasts for Wind Farm B than for Wind Farm A. The above findings were used in further research steps; hence, the subsequent versions of forecasts use both NWP models predictions and point forecasts for hourly lags: −3, 2, −1, 0, 1, 2, 3.
Step 3. This step is the main, most extensive and labour intensive part of research. Forecasts of energy production were obtained from different single, hybrid and ensemble models, including by original methods. To find proper hyperparameters for them, more than 300 hyperparameter combinations were tested using the Grid Search method. The lowest nMAE score on the validation range was used as the parameter selection criterion. Hyperparameter search ranges and their determined values for chosen methods are summarized in Table A1 in Appendix A. The described determinations were carried out to verify the following: • Which method group yields the lowest prediction errors (recommended methods) and does the Ensemble Averaging Without Extremes original method developed by us belong to the recommended methods? • Does the original proposition developed by us-additional input variables (see Table 3)reduce the prediction error? • Does the original proposition developed by us-additional expert correction-reduce the prediction error?
Tables 9 and 10 present forecasts for Wind Farms A and B resulting from the proposed single, ensemble and hybrid methods with different sets of input data. For the two best methods, results are shown with and without additional expert correction (see Formula (4)).
Tabular results were ordered by descending SS metric, which was taken as the main determinant of prediction quality, as it takes into account both nMAE and nRMSE errors.
Based on the results from Tables 9 and 10, the following conclusions can be drawn regarding the proposed single, hybrid, and ensemble methods with different sets of input data:

•
Original method called "Ensemble Averaging Without Extremes" is the best for both wind farms (SS metric and nMAE error); • For "Ensemble Averaging Without Extremes", the best fitted solution was to use an ensemble of 5 methods, while a 3-method ensemble was the best for "Weighted Averaging As an Integrator of Ensemble"; • Our original method "Additional Expert Correction" resulted in lower nMAE than for predictions without correction. Prediction errors for Wind Farm B were bigger than for Wind Farm A, which was indicated by results of sensitivity analysis of potential input variables (see Figure 10).   provides two forecasts of electric energy generation for Wind Farm A made by the best method with additional expert correction for the two following days of each season (from autumn to summer). Figures 21-24 provides two forecasts of electric energy generation for Wind Farm B made by the best method with additional expert correction for two following days of each season (from autumn to summer). Figures 17-24 show that energy generation of both wind farms in presented days (16 in total) is highly random. For some hours of certain days, generation is periodically close to its rated value, but for other hours generation is very low. There are also few hour periods of null generation. The lowest generations and predictions among the presented 16 days occurred for 4 days of summer months (Figures 20 and 24). It should be noted that generation predictions have periods of both over-and under-forecasting. Most commonly, it can be observed on a few consequent samples of time series. Moreover, time series of generation predictions have slightly smoothened course due to using the ensemble method, as ensemble methods reduce the variance of forecasts. For "Ensemble Averaging Without Extremes", additional removal of extreme forecasts occurs before average forecast calcula-tion, which, in turn, further enhances the smoothening effect for generation prediction time series. For both wind farms, additional analysis of nMAE error distribution was made. It concerned hourly periods of prediction using the best forecasting method of "Ensemble Averaging Without Extremes". The goal of analysis was to determine whether error magnitude depends on forecast horizon (from 1 to 24 h) and time of the day. Figure 25 shows the graph of the forecast error (nMAE) depending on the forecast horizon for the test subset for Wind Farm A and Wind Farm B. nMAE values presented in Figure 25 are visibly greater for Wind Farm B, which complies with the results from Tables 9 and 10. nMAE error equals 11.3055% and 13.7552% for Wind Farm A and B, respectively. The distribution of error values shown in Figure 25 and the distribution of average production of energy in individual hours values shown in Figure 4a,b are very similar for both wind farms. For Wind Farm A, the correlation coefficient is equal to 0.9331, and for Wind Farm B, the correlation coefficient is equal 0.9291. Both autocorrelation coefficients are statistically significant (5% significance level). This phenomenon is related to a strong non-linear relationship between the energy forecast error and the wind speed forecast error. The aforementioned non-linear relationship results from the fact that the generation of energy in the wind source is a third-degree polynomial of the wind speed.
ception as it yielded the lowest nMAE with a highly reduced number of input data variables; • Values of nMBE were very low for the analysed methods, which means there was no systematic error in predictions. The lowest nMBE for both wind farms was achieved with the RF Method; • Prediction errors for Wind Farm B were bigger than for Wind Farm A, which was indicated by results of sensitivity analysis of potential input variables (see Figure 10).
Figures 17-20 provides two forecasts of electric energy generation for Wind Farm A made by the best method with additional expert correction for the two following days of each season (from autumn to summer).                Figures 17-24 show that energy generation of both wind farms in presented days (16 in total) is highly random. For some hours of certain days, generation is periodically close to its rated value, but for other hours generation is very low. There are also few hour periods of null generation. The lowest generations and predictions among the presented 16 days occurred for 4 days of summer months (Figures 20 and 24). It should be noted that generation predictions have periods of both over-and under-forecasting. Most commonly, it can be observed on a few consequent samples of time series. Moreover, time series of generation predictions have slightly smoothened course due to using the ensemble method, as ensemble methods reduce the variance of forecasts. For "Ensemble Averaging Without Extremes", additional removal of extreme forecasts occurs before average forecast calculation, which, in turn, further enhances the smoothening effect for generation prediction time series.
For both wind farms, additional analysis of nMAE error distribution was made. It concerned hourly periods of prediction using the best forecasting method of "Ensemble Averaging Without Extremes". The goal of analysis was to determine whether error magnitude depends on forecast horizon (from 1 to 24 h) and time of the day. Figure 25 shows the graph of the forecast error (nMAE) depending on the forecast horizon for the test subset for Wind Farm A and Wind Farm B. Figures [17][18][19][20][21][22][23][24] show that energy generation of both wind farms in presented days (16 in total) is highly random. For some hours of certain days, generation is periodically close to its rated value, but for other hours generation is very low. There are also few hour periods of null generation. The lowest generations and predictions among the presented 16 days occurred for 4 days of summer months (Figures 20 and 24). It should be noted that generation predictions have periods of both over-and under-forecasting. Most commonly, it can be observed on a few consequent samples of time series. Moreover, time series of generation predictions have slightly smoothened course due to using the ensemble method, as ensemble methods reduce the variance of forecasts. For "Ensemble Averaging Without Extremes", additional removal of extreme forecasts occurs before average forecast calculation, which, in turn, further enhances the smoothening effect for generation prediction time series.
For both wind farms, additional analysis of nMAE error distribution was made. It concerned hourly periods of prediction using the best forecasting method of "Ensemble Averaging Without Extremes". The goal of analysis was to determine whether error magnitude depends on forecast horizon (from 1 to 24 h) and time of the day. Figure 25 shows the graph of the forecast error (nMAE) depending on the forecast horizon for the test subset for Wind Farm A and Wind Farm B.

Conclusions
Using two wind farms for statistical analyses and forecasting considerably improves credibility of newly created effective prediction methods and conclusions. The results of the study are summarized below.
Original ensemble methods, developed for researching specific implementations, reduced errors of energy generation forecasts for both wind farms as compared to single methods. The best integration system for ensemble methods for accuracy measure nMAE is a new, original integrator developed for predictions, called "Ensemble Averaging Without Extremes" (method code INT_OUT_EXT), with five methods in the ensemble. The best integration system for ensemble methods for accuracy measure nRMSE is an original integrator developed for predictions called "Weighted Averaging As an Integrator of Ensemble" (method code INT_AVE) with three methods in the ensemble.
A new, original "Additional Expert Correction" reduced errors of energy generation forecasts for both wind farms. Deep neural network LSTM is the best single method, MLP is the second best, while using SVR, KNNR, and Physical model is less favourable for both wind farms. Hybrid methods have worse accuracy measures using nMAE and nRMSE than ensemble methods for both wind farms.
Using meteo forecasts from two NWP models (ECMWF and GFS) as input data yield better results than using a single NWP model. Using NWP point forecasts for hourly lags: −3, −2, −1, 0, 1, 2, 3 (original contribution) as input data is better than using typical lags 0, 1. Using additional input data created, especially input data numbers: 1A, 2A, 3A, reduces prediction errors of most methods in comparison with base input variables (input data numbers: 1-68).
For both wind farms, strong positive correlation was determined between distribution of energy production averages, in particular, hourly periods and distribution of prediction errors (nMAE). Identifying this relationship is valuable, practical information concerning the expected value of prediction error depending on the time of the day. The greater the average generation for a given hour, the greater the prediction error (nMAE) expected. For both analyzed wind farms, the greatest prediction errors are expected during evening hours, while the lowest errors are expected between 08:00 a.m. and 2:00 p.m.
Using original SS metric to compare prediction accuracy is useful, as it allows to incorporate both nMAE and nRMSE into final quality assessment. Both measures are important for the end user of the prediction, as the former is sensitive to reducing the average error, while the latter is sensitive to overforecasting and underforecasting.
More research is needed to verify, among other things, the following: • Does prediction accuracy depend on using forecasts from more than one spot of a medium-sized wind farm and how the accuracy of forecasts will be affected by other factors, i.e., data with higher time resolution, e.g., 15 min, using real measurements of weather as input data? • Can measurements and weather predictions be used to create a weather-error sensitive switching regime for models of ensembles? • Is it possible to reduce the level of high random component in predictions?

•
Will the proposed, original method of "Ensemble Averaging Without Extremes" be equally good for different types of RES predictions (e.g., photovoltaic systems and hydropower system)?