Evaluation Metrics for Wind Power Forecasts: A Comprehensive Review and Statistical Analysis of Errors

: Power generation forecasts for wind farms, especially with a short-term horizon, have been extensively researched due to the growing share of wind farms in total power generation. Detailed forecasts are necessary for the optimization of power systems of various sizes. This review and analytical paper is largely focused on a statistical analysis of forecasting errors based on more than one hundred papers on wind generation forecasts. Factors affecting the magnitude of forecasting errors are presented and discussed. Normalized root mean squared error (nRMSE) and normalized mean absolute error (nMAE) have been selected as the main error metrics considered here. A new and unique error dispersion factor (EDF) is proposed, being the ratio of nRMSE to nMAE. The variability of EDF depending on selected factors (size of wind farm, forecasting horizons, and class of forecasting method) has been examined. This is unique and original research, a novelty in studies on errors of power generation forecasts in wind farms. In addition, extensive quantitative and qualitative analyses have been conducted to assess the magnitude of forecasting error depending on selected factors (such as forecasting horizon, wind farm size, and a class of the forecasting method). Based on these analyses and a review of more than one hundred papers, a unique set of recommendations on the preferred content of papers addressing wind farm generation forecasts has been developed. These recommendations would make it possible to conduct very precise benchmarking meta-analyses of forecasting studies described in research papers and to develop valuable general conclusions concerning the analyzed phenomena.


Introduction
The forecasting of power generation in wind farms has been an extensively explored research topic [1][2][3][4][5][6][7][8].The growing significance of renewable energy sources (RES) and the remarkably dynamic growth of wind farms in most countries has highlighted the importance of accurate power generation forecasts due to, e.g., increased wind farm contribution to the overall power system.Cost-efficient and optimized management of a power system requires RES generation forecasts of the best possible accuracy.System operation optimization processes include scheduling the operation of fossil-based sources, scheduling maintenance works in the power grid, and preventive and remedial maintenance of RES themselves.In addition to obtaining forecasts with the best accuracy, the estimation of errors in these forecasts also proves to be important, as it translates into maintaining appropriate safety margins.
Forecasting purposes vary by time horizon [4,5,7].Time horizon, also called planning horizon, is a fixed point in the future at which a certain process will be evaluated or assumed to have ended.In wind energy forecasting, time horizon affects the choice of forecasting

Time horizon
With increasing forecasting horizon, forecasting errors grow significantly, mainly due to the falling quality of NWP forecasts [5,6,8,10,11] Forecasting method (complexity) Complex (ensemble or hybrid) models have typically lower forecasting errors than single methods (more details in Section 2) [1][2][3][4][5]8] Size of system The inertia of power production usually grows with system size.This translates into more predictable production, especially in shorter horizons.
Site (onshore/offshore) Forecasting errors for offshore wind farms should be less than for onshore wind farms due to distinctive characteristics of weather conditions (more stable and higher wind speeds).

Landscape
Forecasting errors for farms located in rough terrain can depend on the local landscape and meteorological features (e.g., a forest, hills, or a lake in a direct neighborhood).The best if the terrain has as little roughness as possible-this guarantees the optimal generation of wind energy [12] Location of NWP forecasting points NWP forecasts from points more distant from the wind farm can generate larger forecasting errors.The topic of optimum selection of NWP forecasting points and their location on the farm is subject to studies [12] Energies 2023, 15, 9657 3 of 38

Factor Description/Influence on Error Level
Types and quantities of input data The more information related to power generation can be used in the model, the more accurate generation forecasts can be expected.In particular, using NWP as input data is especially important (if absent, generation forecast errors grow very sharply), and with horizons of more than several hours, NWPs are virtually indispensable [1,2,4,13] NWP data sources NWP data can vary by quality (forecasting accuracy)-the more accurate NWP forecasts, the lower power generation forecasting error [2] Quantity of training sets The model must use training data encompassing at least one year.With a growing number of years, the model uses more information and better represents the seasonality and daily variability of the process (a smaller forecasting error can be expected)

Data preprocessing
Properly conducted steps to clean up and process raw input data can reduce forecasting error [3] Data postprocessing Elimination of impossible situations, e.g., negative power forecasts or forecasts determined as unlikely for specific input data.Elimination of such cases reduces the error [13] Measurement data availability lag Using up-to-date, current data provides for identification and correction of errors "on the fly," e.g., using switchable forecasting models More attention should be paid to forecast model inputs due to the fact that, whereas some factors, such as location, size of the system, or forecasting horizon, cannot be changed, the biggest reduction in error can be achieved by appropriate selection of input data or such selection that encompasses as much information as possible related to the forecast power generation time series.
Regarding the selection of input data itself, both statistical analysis and a semi-machine approach can be applied.Statistical analysis using various tests can help to draw conclusions on data interdependencies, however, the time required for this makes it impractical for big datasets.If larger quantities of data are available, expert selection of a pool of input data combinations and their subsequent review can prove to be a more practical approach.The choice of solution depends significantly on the tool-the time required for statistical analysis can bring more benefits for tools that usually require greater optimization, e.g., ANN models.
Forecasting models require input data to predict wind power generation.The data format used by forecasting models need to be relevant to the model itself, i.e., it must consider which external phenomena have a direct impact on wind generation.This data can be divided into NWP and time series [14].
NWP is a multivariate dataset based on a set of physical models used to simulate conditions in the atmosphere; these models are available both on local and global scales.NWP dataset contains information generated by power metering and prediction of several meteorological variables, such as wind speed, wind direction, temperature, humidity, air pressure, time of the day, day of the year, etc. [15,16].As NWP is a general dataset, the main factor that affects wind-generated power is wind speed [17,18].
The time series is a univariate dataset of a wind speed or wind power that is measured at timestamps over a certain period.To obtain a wind speed time series, a mast is usually installed at the wind farm, with an anemometer mounted at the hub height.
Decomposition methods often used during the forecasting process are based on the premise that the wind power time series contain different frequency signals with different characteristics, and that modeling each of the decomposed series separately can lead to an overall improvement in forecast quality [8].Popular techniques are discrete wavelet transform (DWT), empirical mode decomposition (EMD) [3,8], ensemble empirical mode decomposition (EEMD) [6], variational mode decomposition (VMD) [4,8], or wavelet packet transform (WPT).
Machine learning models can use NWP data as features, and/or time series as inputs, where NWPs are features with information related to the expected output wind speed.Input matrix X contains the following: historical wind inputs, weather information, and time data.Output vector y contains the time series and multiple values of predicted wind power with a changing prediction horizon [19].
When NWP data is not available to be used as input data, common practice is to use autoregression of the output variable.In some works, measured values of other variables from the recent past are also used-e.g., measured instantaneous rotor speed or measured weather parameters for a few hours in the past [20,21].This allows us to take into account recent generation trends and a farm's short-term generation inertia.A practical drawback of this appeal would be that it cannot be used for forecasts with horizons longer than a few hours ahead.Other possible approaches would be using measures from one farm to forecast another one [22].In this case, the final result would depend on the weather similarity between used and destination farms and tools used to translate one generation to another.
When only the wind speed time series is known, a technique called feature engineering can be used to fabricate new features.This technique's goal is to fabricate features by executing simple calculations based on the known feature, the wind speed time series.These calculations are described by standard deviation, average, minimum, or maximum wind speed for a period of time [20,23].
In practice, the choice of the source and types of NWP forecasts strongly depends on the data cost-to-quality ratio, nevertheless, it is worthwhile to maximize both the number of NWP models with various densities of forecasting points and the number of weather parameters derived from them [12].In some cases, the application of various models, or even bundles of models, is recommended due to the diverse information content of different models.For instance, for long-term forecasts, NWP climatic models are different from each other, and identification of the best one can be not only difficult, but virtually impossible, and drawing any conclusion can require aggregation analysis of various scenarios.On the macro scale, equally important is to select a proper forecasting point, from which meteorological variables would be derived.Hence, the growing trend of extraction of spatial information using various tools, e.g., CNN, was applied by some research papers [24][25][26].

Objective and Contribution
The main objectives of this paper can be summarized as follows: • classify wind power forecasting techniques; • provide unique description of major factors affecting wind power forecasts; • describe the performance of forecasting models; • conduct comprehensive review (quantitative analysis) based on more than one hundred papers; • conduct statistical analysis of errors (qualitative analysis).
Below are listed selected contributions of this paper: • proposal for a novel, unique ratio, called EDF; • analysis of variability of the new EDF ratio depending on selected characteristics (size of wind farm, forecasting horizon, and class of forecasting method) and the original, novel conclusions drawn from the analysis; • development of a unique list of recommended content of papers addressing wind farm generation forecasts (the application of these recommendations would make it possible to conduct very accurate meta-analyses that would compare various forecasting studies).
The remainder of this paper is organized as follows: Section 2 presents the classification of wind power forecasting techniques.Section 3 describes the performance of the Energies 2023, 15, 9657 5 of 38 forecasting model.Section 4 is the main part of the paper and includes comprehensive statistical analysis (quantitative and qualitative analysis).Discussion is provided in Sections 5 and 6 draws the main conclusions.References are listed at the close of this paper.

Classification of Wind Power Forecasting Techniques
The following alternative methodologies are applied to wind power forecasting: Naive, Physical, Statistical, and AI/ML methods (see Table 3).Statistical models have high precision in very short-term prediction [27].The most used statistical model for wind forecasting is the times series model, due to the fact that future levels of wind power depend on weather features, but they also can depend on the prior value of wind power generated.The amount of wind power produced in the current hour affects the amount of wind power generation in the next hour.These models can determine conditions in time based on relationships between parameters.However, they depend on pre-set coefficient values.
AI and ML models are suitable for systems that are more complex to model, as they attempt to discover underlying relationships, and are widely used to accurately predict wind.Without an a priori structural hypothesis that relates wind power to several historical meteorological variables, they have a strong generalization and fast speed [18,28].
Each approach mentioned above can have a high forecasting error due to inherent weaknesses, especially when wind speeds have significant non-linear characteristics, as volatility causes complex fluctuations.In particular, the conventional single ANN model has the drawback of falling into local minimum and overfitting, and its performance can be influenced by the initial parameters.These weaknesses cannot be easily remedied with a single method.To reduce forecasting error and obtain advanced models that can achieve higher accuracy, a combination of methods described in Table 4 is incorporated.
Ensemble forecasting methods are generated through the application of various machine learning techniques and then by merging the outputs, which reduces the risk of overestimation and is aimed at preserving the diversity of models.The ensemble technique is known to be applied in both cooperative and competitive styles.
In a cooperative ensemble, the dataset is divided into data subsets, each subset being forecast individually and then aggregated with other sub-forecasts [29].This technique is computationally lightweight due to less need for parameter tuning and is in general used for very short-term or short-term forecasting.Competitive ensembles build individual forecasting models with different parameters and initial values, and the results are obtained by aggregation of forecasts by different techniques, such as the Bayesian model average.This technique, used by [30], can cover a larger dataset and is used to achieve early detection of a large wind ramp before the changes in the wind speed propagate to other locations.However, it is considered computationally expensive and is mostly used in medium-term and long-term forecasting.
To obtain an advanced model with higher accuracy, hybrid forecasting models combine the advantages of different methods with individual superior features [31].Overall forecasting effectiveness of hybrid methods can be improved, since hybrid methods can overcome the limitations and take advantage of the merits of individual models by integrating two or more types of models [28].
A neural network can be used in different steps of the algorithm, for example, a CNNbased model using transfer learning is used to address the problem of some newly constructed farms not having sufficient historical wind speed data to train a well-performing model by producing synthetic data [32].In [26], the CNN is trained in layers to extract local features and relationships between the nodes, and the output layer of CNN is set in multiple dimensions to directly forecast future wind speed.
The most common approach is to adopt the machine learning algorithm as the main forecasting tool and to perform data treatment using general techniques as shown by [33], which consist of variational mode decomposition (VMD) of raw wind power series into a certain number of sub-layers with different frequencies; the K-means as a data mining approach being executed for splitting the data into an ensemble of components with a similar fluctuant level of each sub-layer; and LSTM is adopted as the principal forecasting engine for capturing unsteady characteristics of each component.Some authors also combine both hybrid and ensemble approaches into one [34], using a hybrid technique of intelligent and heuristic algorithms that include neural networks, wavelet transform, diverse heuristic algorithms, and fuzzy logic.The hybrid technique uses wavelet transform to filter distortions and noise in wind power signals, the radial neural networks (RBF) technique being used as a preliminary predictor to find local solutions.With the local solution, an ensemble combining three neural networks of MLP using various learning methods along with heuristic WIPSO is used for the final prediction and modeling of the non-linear behavior of the wind power curve.

RMSE, MAE, and MAPE as Frequently-Used Metrics
The root mean square error (RMSE), given by Formula 1, is a quadratic scoring rule that estimates the average magnitude of error.It is the most standard function used to calculate the difference between predicted and observed values, since it reflects the level of differences between the actual and forecast values, in other words, the absolute magnitude of prediction error [35].However, RMSE is sensitive to outliers, so its outcome can be biased if the data is not clean [36].
where ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.A smaller RMSE means that the proposed model performs better.The mean absolute error (MAE), Equation ( 2), corresponds to the estimated level of absolute error.This level indicates the average magnitude of the actual value and the predicted value [37].
where ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.MAE is not susceptible to outliers and can better reflect the actual status of predicted errors [38].The model is deemed to be accurate when MAE is close to zero.
The mean absolute percentage error (MAPE), Equation (3), calculates the percentage error relative to the actual value, which is stated as the average ratio, and is also commonly used to compare different models [36,39].
where ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.
Although RMSE is usually used to express the dispersion of the results, MAE and MAPE can indicate the deviation of the prediction [17].The smaller the values of RMSE, MAE, and MAPE, the more accurate the forecasting model.

MSE, nMAE, nRMSE, and R2 As Occasionally Used Metrics
The mean squared error (MSE), Equation (4), simply averages the mean squared difference between the estimated and original parameters [40], which can avoid the problem that the errors cancel each other out, and accurately reflects the actual prediction error [35].
where ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.Sometimes authors need to normalize the MAE and RMSE to quantitatively examine the prediction performances of some models, their norms being given by the normalized mean absolute error (nMAE), Equation (5), and normalized root mean squared error (nRMSE), Equation (6).
where C i is the operating capacity of time point i, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.In general, smaller values of these metrics indicate that the corresponding solution offers less deviation of prediction performance [41].The R-square or coefficient of determination (R2), Equation (7), is the proportion of the variance in the dependent variable that is predictable from independent variable(s) [42,43].It indicates the level of correlation between predicted value and the actual value, and it helps to select the best model with highest forecasting accuracy [44].It is mostly used in datasets of large amplitudes [17].
where y is the average of actual values, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.An R2 closer to one indicates more accurate forecasting.It can also be displayed as negative to denote an arbitrarily worse predicting model [20].

R, PICP, PINAW, sMAPE, MRE, and TIC As Seldom Used Metrics
The Pearson linear correlation coefficient (R or CC), Equation (8), is a metric that determines the relationship between inputs and outputs by determining the linear dependence between results and observations [9,37].
where ŷi and xi are predicted values, y i and x i are the actual values, and N is the number of prediction points or number of samples.The possible R score range can vary between 1 and −1, with 1 representing the biggest correlation, and −1 the lowest one [45].Prediction interval coverage probability (PICP), Formula (9), measures the ability of the constructed confidence interval to cover the target values for prediction intervals.
where L i and U i are the lower bound and the upper bound, respectively, of the prediction values, y i is the actual value, and N is the number of prediction points or number of samples.The greater the PICP, the more reliable the prediction values [46,47].Prediction interval normalized average width (PINAW), Equation (10), is used to measure the width of the PIs for a given length of the prediction interval.
where t min and t max are the maximum and minimum values of the predicted values, and L i and U i are the lower bound and the upper bound, respectively, of the prediction values [48].
The Symmetric Mean Absolute Percentage Error (sMAPE) metric, Formula (11), a variation of MAPE, is used to describe the relative error of a set of forecasts and their labels as a percentage [36,37].
where y is the average of actual values, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.
where y is the average of actual values, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.The smaller the TIC value, the stronger the prediction ability [50].
The mean relative error (MRE), Equation (13), calculates the magnitude of the difference between predicted and actual values [51].
where y is the average of actual values, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.The mean bias error (MBE), Equation ( 14), gives the average bias error of prediction.It is used to determine if the predicted value is underestimated <0 or overestimated >0 [1,2,52].
where ŷi is the predicted value, y i is the actual value, and N is the number of prediction points or number of samples.This metric is useful to identify the need to add extra steps to calibrate the model.

Interesting Usage of Other Metrics
Some studies use prediction accuracy metrics such as MSE, combinations of MAE, and RMSE to create the fitness function, as the fitness function directly affects the convergence of the algorithms and the optimal solution [17,35,49].
The MAPE is a commonly used evaluation metric that generates infinite values when the actual value y i is zero or close to zero.To avoid this problem, mean arctangent absolute percentage error (MAAPE) is used, Equation (15).
where MAAPE ranges from 0 to π 2 , y is the average of actual values, ŷi is predicted value, y i is the actual value, and N is the number of prediction points or number of samples.A smaller MAAPE indicates smaller forecasting error [9].
To compare the predictive performance of the models, promoting percentages (P) are applied in different metrics, Equation (16).
The metrics MAE, MSE, and RMSE are usually used in deterministic forecasting methods.As for probabilistic forecasting, the process can be more complicated, due to the influence of external factors leading to a better analysis based on the verification of the quantile forecasts given by PICP and PINAW [56].For a comparative assessment of the performance test of the analyzed methods, the skill score (SS) metric is useful.The skill score metric uses one nRMSE (Equation ( 17)) or nMAE metric (Equation ( 18)) or two error metrics-nRMSE and nMAE-and in this case, it is calculated by Equation ( 19) [2].Higher SS values are an indication of superior prediction quality.An advantage of using a skill score is the ability to compare the forecasting qualities of various systems, using the level of reduction in forecasting error relative to the reference method as the quality indicator (persistence method-naive model).
where nRMSE f orecast is the error of the analyzed method, and nRMSE re f erence is the error of the reference method (persistence method-naive model).
where nMAE f orecast is the error of the analyzed method and nMAE re f erence is the error of the reference method (persistence method-naive model).

Comprehensive Statistical Analysis
Out of 106 papers, statistical analysis was conducted on those which applied nRMSE and nMAE errors and which could calculate these two error metrics based on the rated power of the system and the levels of RMSE and MAE errors.In addition, based on the content of those papers, crucial details (factors) of studies were selected to enable statistical quantitative analysis and error analysis and their relationship with other factors.Table 5 (onshore systems, data from 60 papers) and Table 6 (offshore systems, data from six papers) contain sets of selected information from the studies presented in the papers.Can we improve short-term wind power forecasts using turbine-level data?A case study in Ireland [63] Proposed Method-VMD-ELM with WT-data    In addition to the papers mentioned in Tables 5 and 6, in Section 4.1 (Comprehensive Quantitative Review) below we also use in certain analyses information from those papers which did not provide nRMSE and nMAE error values or these values could not be calculated due to absence of information on the rated power of the system.These papers are the following: [26,.

Comprehensive Quantitative Review
Based on data from 116 papers, statistical analysis was conducted to determine, among other things: the frequency of use of various error metrics, classes of forecasting methods, distinct types of input variables for forecasting models, scopes of rated powers of the systems subject to forecasting, location of the systems subject to forecasting, and typical forecasting horizons.The analysis in this subparagraph excludes papers that provided less reliable or no data.
Figure 1 presents the outcome of statistical analysis of the number of forecasting studies concerning wind power generation in particular regions of the world based on the research papers analyzed here.What is remarkable is the very uneven distribution of studies across regions of the world.Special attention must be drawn to China-by far the largest number of papers addressing wind farm generation forecasting.The second best is the United States.

Comprehensive Quantitative Review
Based on data from 116 papers, statistical analysis was conducted to determine, among other things: the frequency of use of various error metrics, classes of forecasting methods, distinct types of input variables for forecasting models, scopes of rated powers of the systems subject to forecasting, location of the systems subject to forecasting, and typical forecasting horizons.The analysis in this subparagraph excludes papers that provided less reliable or no data.
Figure 1 presents the outcome of statistical analysis of the number of forecasting studies concerning wind power generation in particular regions of the world based on the research papers analyzed here.What is remarkable is the very uneven distribution of studies across regions of the world.Special attention must be drawn to China-by far the largest number of papers addressing wind farm generation forecasting.The second best is the United States.The performance of a wind power forecasting model is measured with different statistical metrics.These metrics quantify the prediction error of a model, providing the accuracy between the predicted values and the measured data [65].It is difficult to make a comprehensive evaluation using the single error index, and, as Figure 2 shows, studies can consider up to six different metrics to evaluate, compare performance and quantify forecasting errors [20,40].However, in general, only 2 or 3 statistical metrics are used in model validation.In some cases, authors also do not specify the metric used to evaluate the performance of the model.The combination of statistical metrics, presented by Figure 2, varies by study.The performance of a wind power forecasting model is measured with different statistical metrics.These metrics quantify the prediction error of a model, providing the accuracy between the predicted values and the measured data [65].It is difficult to make a comprehensive evaluation using the single error index, and, as Figure 2 shows, studies can consider up to six different metrics to evaluate, compare performance and quantify forecasting errors [20,40].However, in general, only 2 or 3 statistical metrics are used in model validation.In some cases, authors also do not specify the metric used to evaluate the performance of the model.The combination of statistical metrics, presented by Figure 2, varies by study.It means that the metrics used to evaluate the performance of each model are different for each study.Figure 3 shows a summary of quantifiers used in the studies analyzed, and these metrics are split into four groups (see Table 7).It means that the metrics used to evaluate the performance of each model are different for each study.Figure 3 shows a summary of quantifiers used in the studies analyzed, and these metrics are split into four groups (see Table 7).It means that the metrics used to evaluate the performance of each model are different for each study.Figure 3 shows a summary of quantifiers used in the studies analyzed, and these metrics are split into four groups (see Table 7).RMSE, MAE, and MAPE are popular accuracy metrics due to their ease of interpretability by decision-makers and participants of energy markets.Those metrics, unlike mean bias errors, do not falsify the average quality of forecasts by compensating over-forecasting with under-forecasting.Moreover, they give a decent estimation of the average error one can expect from forecasts for each prediction step [123].Because of squaring the error, RMSE is more sensitive to the detection of high values of errors in error time series, which make it a good metric for detecting extreme error values.In turn, MAE does not additionally magnify extreme values of errors and is the closest to the most naturally expected type of error-mean error.Unlike the two previous metrics, MAPE is not dependent on the scale of values in the data, which makes it useful for comparing data of different scales-e.g., errors for prosumer wind turbines and very big wind farms.It also allows us to find how an accurate model is in the scale of changing the momentary real value.This metric is, however, susceptible to zero/small values of generation appearing in its denominator [124].The result can be either an indefinite expression or a substantial error at a given step, and as a consequence, its value is reflected in the final average value.Because of the aforementioned, we recommend not using MAPE.In Figure 3 one can see the rare use of metrics other than the three most frequent ones.Usually, they are root or derivative metrics of those three and are used to solve previous ones' drawbacks, e.g., nRMSE and nMAE add an aspect of comparability between objects of different scales, which cannot be easily conducted without normalization of time series, used in MAPE for example.A coefficient of determination is also relatively frequently used.It serves as a means of describing not how well a model predicts but how much of the modeled process is actually modeled.
Forecasting methods were classified into single methods, ensemble methods, and hybrid methods, and calculations were conducted on how frequently each of those methods was the best method (lowest nRMSE error) in each of the studies described in the papers reviewed here.Figure 4 presents the outcome of our analysis.A hybrid method was usually the best class of forecasting methods (almost 44%).Quite a substantial percentage (almost 25%) of studies in which a single method was the best is surprising.This can be explained by the fact that some papers proposed single methods only, without comparing them to other classes (ensemble, hybrid).Additionally, note that, in some cases, ensemble and hybrid methods have the characteristics of both classes.For instance, the general hybrid model also contains model(s) from the ensemble class.Comparison of forecast quality of the single, ensemble, and hybrid methods are presented in Section 4.2.2.Analysis of errors and EDF depending on the class of forecasting methods.
Energies 2022, 15, 9657 20 of 39 RMSE, MAE, and MAPE are popular accuracy metrics due to their ease of interpretability by decision-makers and participants of energy markets.Those metrics, unlike mean bias errors, do not falsify the average quality of forecasts by compensating overforecasting with under-forecasting.Moreover, they give a decent estimation of the average error one can expect from forecasts for each prediction step [123].Because of squaring the error, RMSE is more sensitive to the detection of high values of errors in error time series, which make it a good metric for detecting extreme error values.In turn, MAE does not additionally magnify extreme values of errors and is the closest to the most naturally expected type of error-mean error.Unlike the two previous metrics, MAPE is not dependent on the scale of values in the data, which makes it useful for comparing data of different scales-e.g., errors for prosumer wind turbines and very big wind farms.It also allows us to find how an accurate model is in the scale of changing the momentary real value.This metric is, however, susceptible to zero/small values of generation appearing in its denominator [124].The result can be either an indefinite expression or a substantial error at a given step, and as a consequence, its value is reflected in the final average value.Because of the aforementioned, we recommend not using MAPE.In Figure 3 one can see the rare use of metrics other than the three most frequent ones.Usually, they are root or derivative metrics of those three and are used to solve previous ones' drawbacks, e.g., nRMSE and nMAE add an aspect of comparability between objects of different scales, which cannot be easily conducted without normalization of time series, used in MAPE for example.A coefficient of determination is also relatively frequently used.It serves as a means of describing not how well a model predicts but how much of the modeled process is actually modeled.
Forecasting methods were classified into single methods, ensemble methods, and hybrid methods, and calculations were conducted on how frequently each of those methods was the best method (lowest nRMSE error) in each of the studies described in the papers reviewed here.Figure 4 presents the outcome of our analysis.A hybrid method was usually the best class of forecasting methods (almost 44%).Quite a substantial percentage (almost 25%) of studies in which a single method was the best is surprising.This can be explained by the fact that some papers proposed single methods only, without comparing them to other classes (ensemble, hybrid).Additionally, note that, in some cases, ensemble and hybrid methods have the characteristics of both classes.For instance, the general hybrid model also contains model(s) from the ensemble class.Comparison of forecast quality of the single, ensemble, and hybrid methods are presented in Section 4.2.2.Analysis of errors and EDF depending on the class of forecasting methods.Based on information from papers (for which rated powers were provided), the percentage of farms for which generation forecasts were conducted was calculated for specific ranges of rated powers.By far most frequently studied were systems sized from more than 10 MW to 100 MW, with the second largest group of systems being those sized up to 10 MW (Figure 5).Domination of the former range is probably due to the fact that it is the most frequent range of powers in wind farms, and, on the other hand, for very small Based on information from papers (for which rated powers were provided), the percentage of farms for which generation forecasts were conducted was calculated for specific ranges of rated powers.By far most frequently studied were systems sized from more than 10 MW to 100 MW, with the second largest group of systems being those sized up to 10 MW (Figure 5).Domination of the former range is probably due to the fact that it is the most frequent range of powers in wind farms, and, on the other hand, for very small (prosumer) systems, power generated from wind turbines is forecast much less frequently.
(prosumer) systems, power generated from wind turbines is forecast much less frequently.The percentage of farms for which generation forecasts have been conducted was calculated by forecasting horizon (Figure 6).By far the most frequent forecasting horizon is "24 h" and "few steps", with "one step" (5 min, 10, min, 15 min, or 1 h) being slightly less frequent.Forecasts with horizons of more than 24 h are clearly rare.On the one hand, the reason may be more difficult access to NWP with such horizons and the awareness of the loss of quality of such forecasts, especially as compared to horizons with few steps ahead.The frequency of use of various sets of input data in the forecasting models described in the papers was calculated (Figure 7).Lagged generation values of the forecast time series are used clearly most frequently.NWP and weather measurements are used only slightly less frequently.Other types of input data are used at least ten times less frequently than the three input data mentioned above (or incidentally).Such infrequent use of input data such as lagged NWP, time variables and generation stats (statistics on the forecast times series) in forecasting models is surprising.The percentage of farms for which generation forecasts have been conducted was calculated by forecasting horizon (Figure 6).By far the most frequent forecasting horizon is "24 h" and "few steps", with "one step" (5 min, 10, min, 15 min, or 1 h) being slightly less frequent.Forecasts with horizons of more than 24 h are clearly rare.On the one hand, the reason may be more difficult access to NWP with such horizons and the awareness of the loss of quality of such forecasts, especially as compared to horizons with few steps ahead.
Energies 2022, 15, 9657 21 of 39 (prosumer) systems, power generated from wind turbines is forecast much less frequently.The percentage of farms for which generation forecasts have been conducted was calculated by forecasting horizon (Figure 6).By far the most frequent forecasting horizon is "24 h" and "few steps", with "one step" (5 min, 10, min, 15 min, or 1 h) being slightly less frequent.Forecasts with horizons of more than 24 h are clearly rare.On the one hand, the reason may be more difficult access to NWP with such horizons and the awareness of the loss of quality of such forecasts, especially as compared to horizons with few steps ahead.The frequency of use of various sets of input data in the forecasting models described in the papers was calculated (Figure 7).Lagged generation values of the forecast time series are used clearly most frequently.NWP and weather measurements are used only slightly less frequently.Other types of input data are used at least ten times less frequently than the three input data mentioned above (or incidentally).Such infrequent use of input data such as lagged NWP, time variables and generation stats (statistics on the forecast times series) in forecasting models is surprising.The frequency of use of various sets of input data in the forecasting models described in the papers was calculated (Figure 7).Lagged generation values of the forecast time series are used clearly most frequently.NWP and weather measurements are used only slightly less frequently.Other types of input data are used at least ten times less frequently than the three input data mentioned above (or incidentally).Such infrequent use of input data such as lagged NWP, time variables and generation stats (statistics on the forecast times series) in forecasting models is surprising.

Comprehensive Error Analysis
Out of 116 papers, error analysis was conducted on those which applied both nRMSE and nMAE errors and which could calculate these two error metrics based on the available rated power of the system and the values of RMSE and MAE errors (errors pre-normalization).In addition, quotients of nRMSE to nMAE errors were calculated.A new, unique

Comprehensive Error Analysis
Out of 116 papers, error analysis was conducted on those which applied both nRMSE and nMAE errors and which could calculate these two error metrics based on the available rated power of the system and the values of RMSE and MAE errors (errors prenormalization).In addition, quotients of nRMSE to nMAE errors were calculated.A new, unique EDF (error dispersion factor) metric has thus been introduced to analyses, described by Formula (20).Therefore, EDF is a combination of two frequently used error metrics.Statistical analyses in Section 4.2, Section 4.2.1 and Section 4.2.2 apply, among others, to the potential usefulness of EDF in analyses of wind power forecasts.
The analysis in this subparagraph excludes papers that provided less reliable data (abnormal errors, abnormal error quotients)-abnormal phenomena are addressed in the Section 5. Table 8 presents basic statistics, and Figure 8 visualizes selected statistics.The averages in Table 8 are slightly larger than the medians both for nRMSE, nMAE, and EDF.The dispersion of errors is remarkably high-the maximum/minimum quotient for nRMSE error metric is more than 17, and for nMAE errors, the quotient is more than The averages in Table 8 are slightly larger than the medians both for nRMSE, nMAE, and EDF.The dispersion of errors is remarkably high-the maximum/minimum quotient for nRMSE error metric is more than 17, and for nMAE errors, the quotient is more than 25.Such large dispersion of values can be partly justified by different forecasting horizons (from 10 min to 72 h).

Analysis of Errors by Forecasting Horizon
Figures 9 and 10 present nRMSE and nMAE errors, respectively, in ascending order, based on the papers in which these error metrics were provided (also considering those which did not provide the rated power of the system).In addition, information on the forecasting horizon is provided.Forecasts with longer horizons display significantly much larger nRMSE errors, which is unsurprising (the accuracy of wind speed forecasts decreases with increasing forecasting horizon).The averages in Table 8 are slightly larger than the medians both for nRMSE, nMAE, and EDF.The dispersion of errors is remarkably high-the maximum/minimum quotient for nRMSE error metric is more than 17, and for nMAE errors, the quotient is more than 25.Such large dispersion of values can be partly justified by different forecasting horizons (from 10 min to 72 h).

Analysis of Errors by Forecasting Horizon
Figures 9 and 10 present nRMSE and nMAE errors, respectively, in ascending order, based on the papers in which these error metrics were provided (also considering those which did not provide the rated power of the system).In addition, information on the forecasting horizon is provided.Forecasts with longer horizons display significantly much larger nRMSE errors, which is unsurprising (the accuracy of wind speed forecasts decreases with increasing forecasting horizon).Figure 11 presents how the amount of error depends on the forecasting horizon.This figure summarizes information from Figures 9 and 10-average values for both metrics were calculated for selected forecasting horizons.In general, average errors grow with increasing forecasting horizon, although, for the 24 h horizon, average errors are slightly more than for the 48 h horizon.This is probably due to the fact that there were significantly fewer papers describing forecasts with a 48 h horizon than with a 24 h horizon (random element of lower errors from a small number of samples).By far the largest were the average errors for the 72 h horizon-more than two-and-a-half larger than for 24 h and 48 h horizons.For the "one step" horizon, average errors are two times smaller than average errors for the 24 h horizon.This information has large practical significance-it shows what magnitude of normalized errors should be expected from the respective forecasting horizon.Please note that the averages calculated for the 48 h and 72 h horizons may not be fully representative due to a small number of samples.average values for both metrics were calculated for selected forecasting horizons.In general, average errors grow with increasing forecasting horizon, although, for the 24 h horizon, average errors are slightly more than for the 48 h horizon.This is probably due to the fact that there were significantly fewer papers describing forecasts with a 48 h horizon than with a 24 h horizon (random element of lower errors from a small number of samples).By far the largest were the average errors for the 72 h horizon-more than two-and-a-half larger than for 24 h and 48 h horizons.For the "one step" horizon, average errors are two times smaller than average errors for the 24 h horizon.This information has large practical significance-it shows what magnitude of normalized errors should be expected from the respective forecasting horizon.Please note that the averages calculated for the 48 h and 72 h horizons may not be fully representative due to a small number of samples.
element of lower errors from a small number of samples).By far the largest were the average errors for the 72 h horizon-more than two-and-a-half larger than for 24 h and 48 h horizons.For the "one step" horizon, average errors are two times smaller than average errors for the 24 h horizon.This information has large practical significance-it shows what magnitude of normalized errors should be expected from the respective forecasting horizon.Please note that the averages calculated for the 48 h and 72 h horizons may not be fully representative due to a small number of samples.To determine precisely whether there is a statistically significant relationship between the forecasting horizon and error magnitudes, numerical forecasting horizons were selected (1/6 h, 1/4 h, 1/2 h, 1 h, 6 h, 12 h, 24 h, 48 h, and 72 h), which enabled us to calculate Pearson linear correlation.The statistical analysis concluded a statistically significant (5% level of significance) positive linear correlation between the forecasting horizon (multiples of 1) and the magnitude of nRMSE error (R = −0.347).nRMSE error grows with increasing forecasting horizon.Figure 12 presents how the magnitude of nRMSE error depends on the forecasting horizon.To determine precisely whether there is a statistically significant relationship between the forecasting horizon and error magnitudes, numerical forecasting horizons were selected (1/6 h, 1/4 h, 1/2 h, 1 h, 6 h, 12 h, 24 h, 48 h, and 72 h), which enabled us to calculate Pearson linear correlation.The statistical analysis concluded a statistically significant (5% level of significance) positive linear correlation between the forecasting horizon (multiples of 1) and the magnitude of nRMSE error (R = −0.347).nRMSE error grows with increasing forecasting horizon.Figure 12 presents how the magnitude of nRMSE error depends on the forecasting horizon.The statistical analysis concluded a statistically significant (5% level of significance) positive linear correlation between the forecasting horizon (multiple of 1) and the magnitude of nRMSE error (R = 0.410).nMAE error grows with increasing forecasting horizons.Figure 13 presents how the magnitude of nMAE error depends on the forecasting horizon.It is worthwhile to emphasize that the linear correlation between the forecasting horizon and the magnitude of error is slightly larger for the nRMSE error metric than for nMAE.The statistical analysis concluded a statistically significant (5% level of significance) positive linear correlation between the forecasting horizon (multiple of 1) and the magnitude of nRMSE error (R = 0.410).nMAE error grows with increasing forecasting horizons.Figure 13 presents how the magnitude of nMAE error depends on the forecasting horizon.It is worthwhile to emphasize that the linear correlation between the forecasting horizon and the magnitude of error is slightly larger for the nRMSE error metric than for nMAE.
The statistical analysis concluded a statistically insignificant (5% level of significance) negative linear correlation between the forecasting horizon and EDF (R = −0.196).The EDF slightly decreases with increasing forecasting horizon.Figure 14 presents how EDF depends on the forecasting horizon.
For forecasts with very short horizons (from 10 min to 1 h), the average EDF is 1.422, and 1.3163 for 6 h, and it falls to 1.2724 for the 24 h horizon.For the 48 h and 72 h horizons, the samples are too few to calculate reliable averages.
In addition, statistical analysis omitting the 48 h and 72 h horizons concluded a negative correlation.Not a very large one, but statistically significant (15% level of significance), between the forecasting horizon and EDF (R = −0.283).
The statistical analysis concluded a statistically significant (5% level of significance) positive linear correlation between the forecasting horizon (multiple of 1) and the magnitude of nRMSE error (R = 0.410).nMAE error grows with increasing forecasting horizons.Figure 13 presents how the magnitude of nMAE error depends on the forecasting horizon.It is worthwhile to emphasize that the linear correlation between the forecasting horizon and the magnitude of error is slightly larger for the nRMSE error metric than for nMAE.The statistical analysis concluded a statistically insignificant (5% level of significance) negative linear correlation between the forecasting horizon and EDF (R = −0.196).The EDF slightly decreases with increasing forecasting horizon.Figure 14 presents how EDF depends on the forecasting horizon.
For forecasts with very short horizons (from 10 min to 1 h), the average EDF is 1.422, and 1.3163 for 6 h, and it falls to 1.2724 for the 24 h horizon.For the 48 h and 72 h horizons, the samples are too few to calculate reliable averages.
In addition, statistical analysis omitting the 48 h and 72 h horizons concluded a negative correlation.Not a very large one, but statistically significant (15% level of significance), between the forecasting horizon and EDF (R = −0.283).The EDF (Figure 14) and (Formula (20) show the average variability of the moduli of error regardless of the magnitude of the error.If absolute errors on all samples are the same, this ratio reaches its minimum value of 1.The larger the error deviation on particular samples from the average, the larger the ratio.This resembles the behavior of standard deviation determined for the moduli of error for samples, however, with the difference that standard deviation reaches a minimum value equal to zero, and the dynamics of that ratio is much larger-significantly dependent on particular samples.For the EDF, the dynamics of values are smaller, which better illustrates the variability of errors across the sample pool.It should also be mentioned that the EDF in fact shows the ratio of the second moment of error to the first moment of error.
The decreasing levels of EDF with a rising forecasting horizon means that the variability of error decreases with an increasing forecasting horizon.On the one hand, it is probably due to the growing error, and, on the other, the averaging nature of the forecasting models for longer horizons, which stabilizes errors around certain values.
It is worthwhile to note that statistical analysis of hourly values of wind speed presented in [1] concluded that the variance of wind speed forecasts for horizons ranging from 1 to 24 h was 3.121, and for 25-to 48-h horizons, it was 3.063, which is less.

Analysis of Errors and EDF Depending on the Class of Forecasting Methods
Some of the 116 papers analyzed here provide the forecasting error of a method from the "single method" class.The primary objective of the analysis was to investigate percentage error reduction achieved by the best (proposed) method from the ensemble or hybrid class relative to the single method with the largest forecasting error (excluding the outcome of the naive method).Figure 15 presents, in descending order, percentage reductions of nRMSE and percentage reductions of nMAE of the best methods relative to single methods.What is remarkable is a very wide dispersion of percentage reductions of error.For nRMSE, the largest percentage reduction of error is 80.02%, and the smallest is 2.76%.Similar observations apply to the dispersion of nMAE.The EDF (Figure 14) and (Formula (20) show the average variability of the moduli of error regardless of the magnitude of the error.If absolute errors on all samples are the same, this ratio reaches its minimum value of 1.The larger the error deviation on particular samples from the average, the larger the ratio.This resembles the behavior of standard deviation determined for the moduli of error for samples, however, with the difference that standard deviation reaches a minimum value equal to zero, and the dynamics of that ratio is much larger-significantly dependent on particular samples.For the EDF, the dynamics of values are smaller, which better illustrates the variability of errors across the sample pool.It should also be mentioned that the EDF in fact shows the ratio of the second moment of error to the first moment of error.
The decreasing levels of EDF with a rising forecasting horizon means that the variability of error decreases with an increasing forecasting horizon.On the one hand, it is probably due to the growing error, and, on the other, the averaging nature of the forecasting models for longer horizons, which stabilizes errors around certain values.
It is worthwhile to note that statistical analysis of hourly values of wind speed presented in [1] concluded that the variance of wind speed forecasts for horizons ranging from 1 to 24 h was 3.121, and for 25-to 48-h horizons, it was 3.063, which is less.

Analysis of Errors and EDF Depending on the Class of Forecasting Methods
Some of the 116 papers analyzed here provide the forecasting error of a method from the "single method" class.The primary objective of the analysis was to investigate percentage error reduction achieved by the best (proposed) method from the ensemble or hybrid class relative to the single method with the largest forecasting error (excluding the outcome of the naive method).Figure 15 presents, in descending order, percentage reductions of nRMSE and percentage reductions of nMAE of the best methods relative to single methods.What is remarkable is a very wide dispersion of percentage reductions of error.For nRMSE, the largest percentage reduction of error is 80.02%, and the smallest is 2.76%.Similar observations apply to the dispersion of nMAE. Figure 16 presents the average percentage improvement of the hybrid methods and ensembles method relative to the single method for nRMSE and nMAE error metrics.The percentage improvement of error metrics is much bigger for hybrid methods in comparison to ensemble methods however the number of cases (19 for hybrid methods and 14 for ensemble methods) is too small to generalize this fact.Unfortunately, a small proportion of the papers reviewed here provide forecasting error using a naive (persistence) method-such error would be the best benchmark for the level of improvement achieved by other proposed methods, including single methods.The forecasting methodology assumes that a forecasting method is valuable if its error is less than the error of the naive method.Six papers provide errors for the naive method.Figure 17 presents, in descending order, percentage reductions of nRMSE relative to the naive method for six cases (pairs of nRMSE and nMAE) Figure 16 presents the average percentage improvement of the hybrid methods and ensembles method relative to the single method for nRMSE and nMAE error metrics.The percentage improvement of error metrics is much bigger for hybrid methods in comparison to ensemble methods however the number of cases (19 for hybrid methods and 14 for ensemble methods) is too small to generalize this fact.Figure 16 presents the average percentage improvement of the hybrid methods and ensembles method relative to the single method for nRMSE and nMAE error metrics.The percentage improvement of error metrics is much bigger for hybrid methods in comparison to ensemble methods however the number of cases (19 for hybrid methods and 14 for ensemble methods) is too small to generalize this fact.Unfortunately, a small proportion of the papers reviewed here provide forecasting error using a naive (persistence) method-such error would be the best benchmark for the level of improvement achieved by other proposed methods, including single methods.The forecasting methodology assumes that a forecasting method is valuable if its error is less than the error of the naive method.Six papers provide errors for the naive method.Figure 17 presents, in descending order, percentage reductions of nRMSE relative to the naive method for six cases (pairs of nRMSE and nMAE) Unfortunately, a small proportion of the papers reviewed here provide forecasting error using a naive (persistence) method-such error would be the best benchmark for the level of improvement achieved by other proposed methods, including single methods.The forecasting methodology assumes that a forecasting method is valuable if its error is less than the error of the naive method.Six papers provide errors for the naive method.Figure 17 presents, in descending order, percentage reductions of nRMSE relative to the naive method for six cases (pairs of nRMSE and nMAE) Figure 17.Percentage improvement of the best method relative to the naive method for nRMSE and nMAE error metrics.
The average percentage reduction calculated for six cases is 60.53% for nRMSE and 63.79% for nMAE.Therefore, both average percentages are much larger than similar values calculated for nRMSE and nMAE reductions when errors of best methods are compared to single methods.Nevertheless, in a small number of cases, percentage reductions of nRMSE and nMAE for the best method relative to single methods are large and similar to the best method compared to the naive method.It means that some single methods referred to in literature are only marginally better than naive methods.
The second objective of our analysis is to compare EDF for the best (proposed: ensemble or hybrid) method and a single method.Based on 33 cases (pairs of ratios), we have determined that in 77% of cases, the EDF for the best method is larger than the EDF for the single method used in the respective study-this is more frequently observed for larger values of those ratios.The Pearson coefficient of linear correlation (R) between the ratios for the best method and the ratios for the single method is 0.737.
Figure 18 presents pairs of EDF sorted in descending order by the level of ratios for the best method.The average EDF for the best method is 1.432 and the median EDF is 1.364.The average EDF for the single method is 1.352, and the median EDF is 1.294.Therefore, both the average and median levels are clearly larger for the best method.Both series do not have normal distribution-Shapiro-Wilk test was conducted.Wilcoxon signed-rank test has been therefore applied to the analysis of pairs, which concluded that there are statistically significant differences between pairs in both series (they have different expected values).The average percentage reduction calculated for six cases is 60.53% for nRMSE and 63.79% for nMAE.Therefore, both average percentages are much larger than similar values calculated for nRMSE and nMAE reductions when errors of best methods are compared to single methods.Nevertheless, in a small number of cases, percentage reductions of nRMSE and nMAE for the best method relative to single methods are large and similar to the best method compared to the naive method.It means that some single methods referred to in literature are only marginally better than naive methods.
The second objective of our analysis is to compare EDF for the best (proposed: ensemble or hybrid) method and a single method.Based on 33 cases (pairs of ratios), we have determined that in 77% of cases, the EDF for the best method is larger than the EDF for the single method used in the respective study-this is more frequently observed for larger values of those ratios.The Pearson coefficient of linear correlation (R) between the ratios for the best method and the ratios for the single method is 0.737.
Figure 18 presents pairs of EDF sorted in descending order by the level of ratios for the best method.The average percentage reduction calculated for six cases is 60.53% for nRMSE and 63.79% for nMAE.Therefore, both average percentages are much larger than similar values calculated for nRMSE and nMAE reductions when errors of best methods are compared to single methods.Nevertheless, in a small number of cases, percentage reductions of nRMSE and nMAE for the best method relative to single methods are large and similar to the best method compared to the naive method.It means that some single methods referred to in literature are only marginally better than naive methods.
The second objective of our analysis is to compare EDF for the best (proposed: ensemble or hybrid) method and a single method.Based on 33 cases (pairs of ratios), we have determined that in 77% of cases, the EDF for the best method is larger than the EDF for the single method used in the respective study-this is more frequently observed for larger values of those ratios.The Pearson coefficient of linear correlation (R) between the ratios for the best method and the ratios for the single method is 0.737.
Figure 18 presents pairs of EDF sorted in descending order by the level of ratios for the best method.The average EDF for the best method is 1.432 and the median EDF is 1.364.The average EDF for the single method is 1.352, and the median EDF is 1.294.Therefore, both the average and median levels are clearly larger for the best method.Both series do not have normal distribution-Shapiro-Wilk test was conducted.Wilcoxon signed-rank test has been therefore applied to the analysis of pairs, which concluded that there are statistically significant differences between pairs in both series (they have different expected values).The average EDF for the best method is 1.432 and the median EDF is 1.364.The average EDF for the single method is 1.352, and the median EDF is 1.294.Therefore, both the average and median levels are clearly larger for the best method.Both series do not have normal distribution-Shapiro-Wilk test was conducted.Wilcoxon signed-rank test has been therefore applied to the analysis of pairs, which concluded that there are statistically significant differences between pairs in both series (they have different expected values).Therefore, differences between medians are statistically significant, and not without reason.
An interesting conclusion can be drawn based on our analysis-the variability of the moduli of errors in the best methods (smallest forecasting errors) is typically larger than for the "single method" class (much larger forecasting errors).The moduli of errors in the "single method" class are much larger and much closer to each other than in the best (hybrid or ensemble) method.In some studies, a single method could also use slightly less information (a different set of input data), which can also affect the characteristics of errors (magnitude and variability level).

Analysis of Errors Based on System Size
Our analysis covered the studies which provided nRMSE and nMAE and the size of the system.Statistical analysis did not reveal a statistically significant (5% level of significance) linear correlation between the size of the system (rated power) and nRMSE and nMAE errors (R = −0.110,R = −0.111,respectively).In theory, errors should grow with increasing size of the system due to much less uniform weather conditions (wind speed) in wind farms occupying extensive areas, the fact of using usually point-based meteorological forecasts, the wake effect, and other factors affecting the farm which are more difficult to represent if they overlap in the same space.The Pearson coefficient of linear correlation (R) between nRMSE and nMAE is 0.994 (5% level of significance).It means that these error metrics are very similar to each other.The details are presented in Figure 19.In addition, there is a large dispersion of the magnitudes of error for systems of similar sizes.This can be due to different sets of input data (different quality of information).
Energies 2022, 15, 9657 29 of 39 Therefore, differences between medians are statistically significant, and not without reason.
An interesting conclusion can be drawn based on our analysis-the variability of the moduli of errors in the best methods (smallest forecasting errors) is typically larger than for the "single method" class (much larger forecasting errors).The moduli of errors in the "single method" class are much larger and much closer to each other than in the best (hybrid or ensemble) method.In some studies, a single method could also use slightly less information (a different set of input data), which can also affect the characteristics of errors (magnitude and variability level).

Analysis of Errors Based on System Size
Our analysis covered the studies which provided nRMSE and nMAE and the size of the system.Statistical analysis did not reveal a statistically significant (5% level of significance) linear correlation between the size of the system (rated power) and nRMSE and nMAE errors (R = −0.110,R = −0.111,respectively).In theory, errors should grow with increasing size of the system due to much less uniform weather conditions (wind speed) in wind farms occupying extensive areas, the fact of using usually point-based meteorological forecasts, the wake effect, and other factors affecting the farm which are more difficult to represent if they overlap in the same space.The Pearson coefficient of linear correlation (R) between nRMSE and nMAE is 0.994 (5% level of significance).It means that these error metrics are very similar to each other.The details are presented in Figure 19.In addition, there is a large dispersion of the magnitudes of error for systems of similar sizes.This can be due to different sets of input data (different quality of information).The statistical analysis concluded an insignificant (5% level of significance), marginally positive linear correlation between the size of the system (rated power) and EDF (R = 0.039).The EDF usually varies between 1 and 2. The average value of the EDF is 1. 35.The details are presented in Figure 20.The statistical analysis concluded an insignificant (5% level of significance), marginally positive linear correlation between the size of the system (rated power) and EDF (R = 0.039).The EDF usually varies between 1 and 2. The average value of the EDF is 1.35.The details are presented in Figure 20.
An interesting conclusion can be drawn based on our analysis-the variability of the moduli of errors in the best methods (smallest forecasting errors) is typically larger than for the "single method" class (much larger forecasting errors).The moduli of errors in the "single method" class are much larger and much closer to each other than in the best (hybrid or ensemble) method.In some studies, a single method could also use slightly less information (a different set of input data), which can also affect the characteristics of errors (magnitude and variability level).

Analysis of Errors Based on System Size
Our analysis covered the studies which provided nRMSE and nMAE and the size of the system.Statistical analysis did not reveal a statistically significant (5% level of significance) linear correlation between the size of the system (rated power) and nRMSE and nMAE errors (R = −0.110,R = −0.111,respectively).In theory, errors should grow with increasing size of the system due to much less uniform weather conditions (wind speed) in wind farms occupying extensive areas, the fact of using usually point-based meteorological forecasts, the wake effect, and other factors affecting the farm which are more difficult to represent if they overlap in the same space.The Pearson coefficient of linear correlation (R) between nRMSE and nMAE is 0.994 (5% level of significance).It means that these error metrics are very similar to each other.The details are presented in Figure 19.In addition, there is a large dispersion of the magnitudes of error for systems of similar sizes.This can be due to different sets of input data (different quality of information).The statistical analysis concluded an insignificant (5% level of significance), marginally positive linear correlation between the size of the system (rated power) and EDF (R = 0.039).The EDF usually varies between 1 and 2. The average value of the EDF is 1. 35.The details are presented in Figure 20.The number of papers addressing forecasting for offshore farms is small, as they constitute less than 6% of the 116 papers subject to this analysis.Only six papers (Table 6) provide nRMSE or nMAE, which is too little to conduct an accurate statistical analysis.Figure 21 compares nRMSEs for two forecasting horizons (average of the errors provided in the papers) for offshore and onshore farms.The number of papers addressing forecasting for offshore farms is small, as they constitute less than 6% of the 116 papers subject to this analysis.Only six papers (Table 6) provide nRMSE or nMAE, which is too little to conduct an accurate statistical analysis.Figure 21 compares nRMSEs for two forecasting horizons (average of the errors provided in the papers) for offshore and onshore farms.Offshore farms have smaller forecasting errors than onshore farms.This is expected, as it results from more stable and stronger winds at offshore farms.In addition, these are typically very large systems.For a 1 h horizon, such a sizable difference may result from the fact that the average for offshore farms was based on only two values of error and the fact that some onshore forecasts did not use meteorological forecasts of wind speed (larger forecasting errors occur in such cases).In real at 24 h horizon, nRMSE at onshore farms can be about twice as large (assuming that meteorological forecasts are used in both locations).

Discussion
A comprehensive review and statistical analysis of errors based on an extensive selection of 116 papers allowed us to conclude, using actual figures, a correlation between the magnitude of error and selected factors.The quantitative analysis is provided for the aggregate assessment of how frequently various categories of (quite diverse) forecasting methods are applied, what typical input data are (meteorological forecasts are typically used for horizons above 6 h), how often various forecasting horizons have been used (typical horizons being in the range of 1 h to 24 h).
The analyses concluded that some papers used incomplete data that prevented them from being used in an aggregate meta-analysis of studies, which applies, in particular, to error metrics (nRMSE and nMAE).
In addition, several untypical (extreme) nRMSE and nMAE error levels have been identified, which, due to extreme dissimilarity to the remaining data of the same class (forecasting error being too large or too small) by expert judgment have been excluded from the analyses presented in Section 4.2.Comprehensive Error analysis.Figure 22 presents the variability of nRMSE error in the papers reviewed here.Offshore farms have smaller forecasting errors than onshore farms.This is expected, as it results from more stable and stronger winds at offshore farms.In addition, these are typically very large systems.For a 1 h horizon, such a sizable difference may result from the fact that the average for offshore farms was based on only two values of error and the fact that some onshore forecasts did not use meteorological forecasts of wind speed (larger forecasting errors occur in such cases).In real terms, at 24 h horizon, nRMSE at onshore farms can be about twice as large (assuming that meteorological forecasts are used in both locations).

Discussion
A comprehensive review and statistical analysis of errors based on an extensive selection of 116 papers allowed us to conclude, using actual figures, a correlation between the magnitude of error and selected factors.The quantitative analysis is provided for the aggregate assessment of how frequently various categories of (quite diverse) forecasting methods are applied, what typical input data are (meteorological forecasts are typically used for horizons above 6 h), how often various forecasting horizons have been used (typical horizons being in the range of 1 h to 24 h).
The analyses concluded that some papers used incomplete data that prevented them from being used in an aggregate meta-analysis of studies, which applies, in particular, to error metrics (nRMSE and nMAE).
In addition, several untypical (extreme) nRMSE and nMAE error levels have been identified, which, due to extreme dissimilarity to the remaining data of the same class (forecasting error being too large or too small) by expert judgment have been excluded from the analyses presented in Section 4.2.Comprehensive Error analysis.Figure 22 presents the variability of nRMSE error in the papers reviewed here.
A novel, unique ratio called EDF has been explored.The EDF shows the average variability of the moduli of error regardless of the magnitude of error.The analysis of variability of the new EDF ratio depending on selected characteristics (size of wind farm, forecasting horizon, and class of forecasting method) has been performed.There is a small negative correlation but statistically significant between the forecasting horizon and EDF.Additionally, the EDF for the best forecasting method is larger than the EDF for the single forecasting method.The analysis concluded an insignificant, marginally positive linear correlation between the size of the system (rated power) and EDF.A novel, unique ratio called EDF has been explored.The EDF shows the average variability of the moduli of error regardless of the magnitude of error.The analysis of variability of the new EDF ratio depending on selected characteristics (size of wind farm, forecasting horizon, and class of forecasting method) has been performed.There is a small negative correlation but statistically significant between the forecasting horizon and EDF.Additionally, the EDF for the best forecasting method is larger than the EDF for the single forecasting method.The concluded an insignificant, marginally positive linear correlation between the size of the system (rated power) and EDF.
Statistical analysis in one paper concluded, in addition, an untypical value of EDF.Statistical data from the reviewed papers for which EDF could be calculated shows that EDF levels range from 1.028 to 7.478, although a vast majority of EDF levels range between 1 and 2 (this range seems to be most credible-minimum value of the ratio is 1).The outcome of our analysis is presented in Figure 23.Based on our analysis of papers, in our subjective assessment, to maximize the quality of aggregate meta-analysis of studies addressing power generation forecasting in wind farms, a research paper should contain the following items (our recommendations): • mandatory use of normalized error metrics for the assessment of forecasting quality: nRMSE Formula (5) and nMAE Formula (6), accompanied by a description of the normalization method (recommended normalization using the rated power of the Statistical analysis in one paper concluded, in addition, an untypical value of EDF.Statistical data from the reviewed papers for which EDF could be calculated shows that EDF levels range from 1.028 to 7.478, although a vast majority of EDF levels range between 1 and 2 (this range seems to be most credible-minimum value of the ratio is 1).The outcome of our analysis is presented in Figure 23.A novel, unique ratio called EDF has been explored.The EDF shows the average variability of the moduli of error regardless of the magnitude of error.The analysis of variability of the new EDF ratio depending on selected characteristics (size of wind farm, forecasting horizon, and class of forecasting method) has been performed.There is a small negative correlation but statistically significant between the forecasting horizon and EDF.Additionally, the EDF for the best forecasting method is larger than the EDF for the single forecasting method.The analysis concluded an insignificant, marginally positive linear correlation between the size of the system (rated power) and EDF.
Statistical analysis in one paper concluded, in addition, an untypical value of EDF.Statistical data from the reviewed papers for which EDF could be calculated shows that EDF levels range from 1.028 to 7.478, although a vast majority of EDF levels range between 1 and 2 (this range seems to be most credible-minimum value of the ratio is 1).The outcome of our analysis is presented in Figure 23.Based on our analysis of papers, in our subjective assessment, to maximize the quality of aggregate meta-analysis of studies addressing power generation forecasting in wind farms, a research paper should contain the following items (our recommendations): • mandatory use of normalized error metrics for the assessment of forecasting quality: nRMSE Formula (5) and nMAE Formula (6), accompanied by a description of the normalization method (recommended normalization using the rated power of the Based on our analysis of papers, in our subjective assessment, to maximize the quality of aggregate meta-analysis of studies addressing power generation forecasting in wind farms, a research paper should contain the following items (our recommendations): • mandatory use of normalized error metrics for the assessment of forecasting quality: nRMSE Formula (5) and nMAE Formula (6), accompanied by a description of the normalization method (recommended normalization using the rated power of the system), which would enable comparative assessments of the quality of studies of systems of various sizes (or regardless of how big the wind farm is); • we do not recommend the use of MAPE metric, which is susceptible to substantial error for small or zero generations; • mandatory use of a forecast conducted by naive (persistence) method to enable the assessment of the quality of the best model proposed by the authors of the paper relative to the reference model; in such case calculation of the skill score metric by Formula ( 17), (18), or ( 19) is recommended in addition; • provide strict, precise information on forecasting horizon(s) and information if the forecast is, e.g., "One Step," e.g., from 1 h to 24 h, or one step ahead with follow-up predictions (in which case forecasting errors are typically smaller); • provide strict, precise information on the set(s) of input data for the proposed model(s); • provide the source of meteo forecasts (GFS, ECMWF, other sources) if used in the forecasting model; • explicit statement on whether the forecasting model uses meteo forecasts and/or on-site measurements of weather conditions at the wind farm; • provide the range of training, validation, and testing data, and how the range of data is divided into the identified subsets; • provide details of the location, unless confidential (location and size of the wind farm, and landscape features prevalent at the location).

Conclusions
This paper is the outcome of a comprehensive review and statistical analysis of errors using more than one hundred research papers.The quantitative analyses allowed us to assess the distribution of frequency of application of selected parameters in research studies (including the number and type of error metrics, forecasting horizon, rated power of the system, classes of forecasting methods, and location of the forecast systems).
Our qualitative analyses allowed us to provide an aggregate assessment of power generation forecasting in wind farms, including how selected factors affect the magnitude of forecasting errors.In addition, the rationale for using complex (ensemble, hybrid) forecasting methods instead of single methods was verified, by examining how this improves the quality of forecasts.
Notably, only 6 of 116 papers addressed power generation forecasts in offshore farmsit means that such research should intensify going forward, although it is in part due to a significantly smaller number of such systems than of onshore farms.The offshore location of a farm involves a number of distinct characteristics (such as a surface with exceptionally low roughness, significantly higher wind speeds, and more stable power generation).The magnitude of forecasting errors is significantly smaller.Due to a small number of offshore-related papers, our analysis was much more constrained.
In our view, research on topics related to aggregate statistical analyses (meta-analyses) should continue.We are planning to increase the number of reviewed papers at least two-or three-fold in the future.Such a number will enable us to conduct a more precise statistical assessment of a large number of factors affecting the magnitude of forecasting error, and expand the analyses related to the EDF factor proposed by us.In our view, it is crucial that published papers on generation forecasts in wind farms contain information from our recommended list, to enable conducting the necessary analyses.

Figure 1 .
Figure 1.Total number of farms analyzed in 116 papers.

Figure 1 .
Figure 1.Total number of farms analyzed in 116 papers.

Figure 2 .
Figure 2. Number of statistical metrics used per article to evaluate the performance of forecasting models (evaluation of 114 articles).

Figure 3 .
Figure 3. Common ways to measure and evaluate the error of models predicting quantitative data.

Figure 2 .
Figure 2. Number of statistical metrics used per article to evaluate the performance of forecasting models (evaluation of 114 articles).

Energies 2022, 15 , 9657 19 of 39 Figure 2 .
Figure 2. Number of statistical metrics used per article to evaluate the performance of forecasting models (evaluation of 114 articles).

Figure 3 .
Figure 3. Common ways to measure and evaluate the error of models predicting quantitative data.

Figure 3 .
Figure 3. Common ways to measure and evaluate the error of models predicting quantitative data.

Figure 5 .
Figure 5. Ranges of rated powers of wind farms.

Figure 6 .
Figure 6.Frequency of forecasts from different forecasting horizons.

Figure 5 .
Figure 5. Ranges of rated powers of wind farms.

Figure 5 .
Figure 5. Ranges of rated powers of wind farms.

Figure 6 .
Figure 6.Frequency of forecasts from different forecasting horizons.

Figure 6 .
Figure 6.Frequency of forecasts from different forecasting horizons.

Figure 7 .
Figure 7. Input data categories by frequency of use.

Figure 7 .
Figure 7. Input data categories by frequency of use.

Figure 8 .
Figure 8. Selected statistics of errors and error quotients.

Figure 8 .
Figure 8. Selected statistics of errors and error quotients.

Figure 8 .
Figure 8. Selected statistics of errors and error quotients.

Figure 9 .
Figure 9. nRMSE errors with a note on forecasting horizon, in ascending order.Figure 9. nRMSE errors with a note on forecasting horizon, in ascending order.

Figure 9 .
Figure 9. nRMSE errors with a note on forecasting horizon, in ascending order.Figure 9. nRMSE errors with a note on forecasting horizon, in ascending order.Energies 2022, 15, 9657 24 of 39

Figure 10 .
Figure 10.nMAE errors with a note on forecasting horizon, in ascending order.

Figure 10 .
Figure 10.nMAE errors with a note on forecasting horizon, in ascending order.

Figure 11
Figure 11  presents how the amount of error depends on the forecasting horizon.This figure summarizes information from Figures 9 and 10-average values for both metrics were calculated for selected forecasting horizons.In general, average errors grow with increasing forecasting horizon, although, for the 24 h horizon, average errors are slightly more than for the 48 h horizon.This is probably due to the fact that there were significantly fewer papers describing forecasts with a 48 h horizon than with a 24 h horizon (random element of lower errors from a small number of samples).By far the largest were the average errors for the 72 h horizon-more than two-and-a-half larger than for 24 h and 48 h horizons.For the "one step" horizon, average errors are two times smaller than average errors for the 24 h horizon.This information has large practical significance-it shows what magnitude of normalized errors should be expected from the respective forecasting horizon.Please note that the averages calculated for the 48 h and 72 h horizons may not be fully representative due to a small number of samples.
Figure 11  presents how the amount of error depends on the forecasting horizon.This figure summarizes information from Figures 9 and 10-average values for both metrics were calculated for selected forecasting horizons.In general, average errors grow with increasing forecasting horizon, although, for the 24 h horizon, average errors are slightly more than for the 48 h horizon.This is probably due to the fact that there were significantly fewer papers describing forecasts with a 48 h horizon than with a 24 h horizon (random element of lower errors from a small number of samples).By far the largest were the average errors for the 72 h horizon-more than two-and-a-half larger than for 24 h and 48 h horizons.For the "one step" horizon, average errors are two times smaller than average errors for the 24 h horizon.This information has large practical significance-it shows what magnitude of normalized errors should be expected from the respective forecasting horizon.Please note that the averages calculated for the 48 h and 72 h horizons may not be fully representative due to a small number of samples.

Figure 11 .
Figure 11.Magnitudes of error by forecasting horizon.

Figure 11 .
Figure 11.Magnitudes of error by forecasting horizon.

Figure 14 .
Figure 14.Dependence of EDF on forecasting horizon.

Figure 15 .
Figure 15.(a) Percentage reduction in nRMSE for the best method relative to the single method; (b) percentage reduction in nMAE for the best method relative to the single method.

Figure 16 .
Figure 16.Average percentage improvement of the hybrids method and ensembles method relative to the single method for nRMSE and nMAE error metrics.

Figure 15 .
Figure 15.(a) Percentage reduction in nRMSE for the best method relative to the single method; (b) percentage reduction in nMAE for the best method relative to the single method.

Figure 15 .
Figure 15.(a) Percentage reduction in nRMSE for the best method relative to the single method; (b) percentage reduction in nMAE for the best method relative to the single method.

Figure 16 .
Figure 16.Average percentage improvement of the hybrids method and ensembles method relative to the single method for nRMSE and nMAE error metrics.

Figure 16 .
Figure 16.Average percentage improvement of the hybrids method and ensembles method relative to the single method for nRMSE and nMAE error metrics.

Figure 18 .
Figure 18.Pairs of EDF sorted in descending order by the level of quotients for the best method.

Figure 17 .
Figure 17.Percentage improvement of the best method relative to the naive method for nRMSE and nMAE error metrics.

Energies 2022, 15 , 9657 28 of 39 Figure 17 .
Figure 17.Percentage improvement of the best method relative to the naive method for nRMSE and nMAE error metrics.

Figure 18 .
Figure 18.Pairs of EDF sorted in descending order by the level of quotients for the best method.

Figure 18 .
Figure 18.Pairs of EDF sorted in descending order by the level of quotients for the best method.

Figure 19 .
Figure 19.Magnitude of error depending on rated power of the system.

Figure 20 .
Figure 20.Magnitude of EDF depending on rated power of the system.

Figure 19 .
Figure 19.Magnitude of error depending on rated power of the system.

Figure 19 .
Figure 19.Magnitude of error depending on rated power of the system.

Figure 20 .
Figure 20.Magnitude of EDF depending on rated power of the system.

Figure 21 .
Figure 21.Magnitude of nRMSEs depending on farm location and forecasting horizon.

Figure 21 .
Figure 21.Magnitude of nRMSEs depending on farm location and forecasting horizon.

Figure 22 .
Figure 22.Variability of nRMSEs in the reviewed papers, with red-marked less reliable values.

Figure 23 .
Figure 23.Variability of EDF in reviewed papers, with red-marked less credible value.

Figure 22 .
Figure 22.Variability of nRMSEs in the reviewed papers, with red-marked less reliable values.

Figure 22 .
Figure 22.Variability of nRMSEs in the reviewed papers, with red-marked less reliable values.

Figure 23 .
Figure 23.Variability of EDF in reviewed papers, with red-marked less credible value.

Figure 23 .
Figure 23.Variability of EDF in reviewed papers, with red-marked less credible value.

Table 2 .
Description of major factors affecting wind power forecasts.

Table 3 .
Classification of forecasting techniques.

Table 4 .
Classification of complex forecasting techniques.

Table 5 .
Summary of errors obtained for onshore wind power forecasting models.
[54] speed forecasting method based on deep learning strategy using empirical wavelet transform, long short term memory neural network and Elman neural network[54]Proposed Method-EWT-LSTM-Elman
* recalculated from RMSE or MAE and nominal system power, ** approximated values from graphs, *** nominal system power only applies to the error for a specific case rather than all cases.

Table 6 .
Summary of errors obtained for offshore wind power forecasting models.recalculated from RMSE or MAE and nominal system power, *** nominal system power only applies to the error for a specific case rather than all cases. *

Table 7 .
Frequency of use of model performance evaluation metrics.

Table 7 .
Frequency of use of model performance evaluation metrics.

Table 7 .
Frequency of use of model performance evaluation metrics.

Table 8 .
Descriptive statistics of errors and error quotients.