Sensitivity Studies for a Hybrid Numerical–Statistical Short-Term Wind and Gust Forecast at Three Locations in the Basque Country (Spain)

: This study evaluates the performance of statistical models applied to the output of numerical models for short-term (1–24 h) hourly wind forecasts at three locations in the Basque Country. The target variables are horizontal wind components and the maximum wind gust at 3 h intervals. Statistical approaches such as persistence, analogues, linear regression, and random forest (RF) are used. The veriﬁcation statistics used are coe ﬃ cient of determination (R 2 ) and root mean square error (RMSE). Statistical models use three inputs: (1) Local wind observations; (2) extended EOFs (empirical orthogonal functions) derived from past local observations and ERA-Interim variables in a previous 24-h period covering a domain around the area of study; and (3) wind forecasts provided by ERA-Interim. Results indicate that, for horizons less than 1–4 h, persistence is the best model. For longer predictions, RF provides the best forecasts. For horizontal components at 4–24 h horizons, RF slightly outperformed ERA-Interim wind forecasts. For gust, RF performs better than ERA-Interim for all the horizons. Persistence is the most inﬂuential factor for 2–5 h. Beyond this horizon, predictors from the ERA-Interim wind forecasts led the contribution. Hybrid numerical–statistical methods can be used to improve short-term wind forecasts.

with non-stationarity in wind dynamics. A comparison of machine learning models for forecasting wind during the next hour was presented by Feng et al. [11]. Along these lines, ARIMA-ANN and ARIMA-Kalman Filter models have also been used [47] for 3 h forecasting. Hourly observations were used as target variables with ANN and genetic expression programming (GEP) [48]. A method similar to analogues [49] was also applied to forecast wind speed in the next hour [50]. Even [51] monthly mean wind speed has been forecasted combining time-series with ANN.
In summary, long-term (beyond a few days) forecasts are commonly issued by means of a global NWF followed by statistical post-processing [7]. While for time horizons of a few hours, forecast time-series based statistical models are usually applied, we propose to use both approaches in a hybrid wind forecast system.
Thus, we follow the well-stablished technique of testing the hybrid numerical-statistical methods by using data from a frozen assimilation system. This allows us to perform a sensitivity study of the relative merits of different statistical techniques under controlled conditions without introducing spurious modifications into the results due to changes in the configuration of the operational NWF system [8,18,26,27,31,33,39,52].
A comparison of more than 14 studies based on statistical methods for wind forecasting can be found in Okumus and Dinler [12] and references therein. As mentioned in that review, these kinds of wind-forecasting studies are sometimes difficult to compare. The root mean square error (RMSE) and the coefficient of determination (R 2 ) are widely used in this field of research [8,18,26,27,31,33,42]. The main advantage of using R 2 over the plain correlation is that the coefficient of determination is an indicator that represents the fraction of the overall variability explained by the model. A complete evaluation requires additional information on the error, either expressed in absolute terms (RMSE) or in percentage terms.
For this study, hybrid models were constructed using local observations and ERA-Interim data for short-term forecasting of wind speed and wind gust up to 24 h in the future at three selected locations in the Basque Country during 2007-2014 ( Figure 1). The statistical algorithms used were analogues, linear regression, and random forest (RF). They are here applied to locations over the ocean, the coast, and the interior (see Figure 1).
Taking into account the results in aforementioned studies, our objectives were: First, a comparison of the performance of different hybrid numerical-statistical techniques in the forecasting of u and v components for three observatories located in the ocean, at the coast, and in the interior of the Basque Country in the Bay of Biscay. Secondly, a comparison of the performance of the best statistical model with two reference forecasts, the nearest grid cell direct model output from the coarse-scale NWF model, which provided the predictors and persistence. The rationale behind the use of these reference models (ERA-Interim and persistence) is that any forecasting effort for a certain number of hours in the future is justified only if the new model outperforms existing ones. The comparison of performances is presented both for hourly wind data and wind gusts. The robustness of the method is evaluated against the selection of the domain and by applying it to three different sites which present different wind regimes. Finally, the most significant predictors are also identified. Section 2 presents the datasets and method used in this study. Section 3 describes the results, Section 4 discusses the conclusions of this study, and Section 5 presents a discussion along with a future outlook. evaluation requires additional information on the error, either expressed in absolute terms (RMSE) or in percentage terms.
For this study, hybrid models were constructed using local observations and ERA-Interim data for short-term forecasting of wind speed and wind gust up to 24 h in the future at three selected locations in the Basque Country during 2007-2014 ( Figure 1). The statistical algorithms used were analogues, linear regression, and random forest (RF). They are here applied to locations over the ocean, the coast, and the interior (see Figure 1).

Data
The Basque country is located in the Bay of Biscay, an area with elevations that may reach 1000 m just 30 km away from the coast. The topographic gradient is moderately steep, and the wind regime is subject to important coastal influences. Thus, three locations are chosen for this study ( Figure 1). They represent ocean, coastal, and interior wind regimes, as can be seen in their wind roses ( Figure 2). One of them is a buoy (Bilbao Bizkaia), representing an oceanic regime. The other two sites were selected to compare the results of the sea with the coast and an inland zone in the area of interest.
Atmosphere 2020, 11, x FOR PEER REVIEW 4 of 22 Taking into account the results in aforementioned studies, our objectives were: First, a comparison of the performance of different hybrid numerical-statistical techniques in the forecasting of u and v components for three observatories located in the ocean, at the coast, and in the interior of the Basque Country in the Bay of Biscay. Secondly, a comparison of the performance of the best statistical model with two reference forecasts, the nearest grid cell direct model output from the coarse-scale NWF model, which provided the predictors and persistence. The rationale behind the use of these reference models (ERA-Interim and persistence) is that any forecasting effort for a certain number of hours in the future is justified only if the new model outperforms existing ones. The comparison of performances is presented both for hourly wind data and wind gusts. The robustness of the method is evaluated against the selection of the domain and by applying it to three different sites which present different wind regimes. Finally, the most significant predictors are also identified. Section 2 presents the datasets and method used in this study. Section 3 describes the results, Section 4 discusses the conclusions of this study, and Section 5 presents a discussion along with a future outlook.

Data
The Basque country is located in the Bay of Biscay, an area with elevations that may reach 1000 m just 30 km away from the coast. The topographic gradient is moderately steep, and the wind regime is subject to important coastal influences. Thus, three locations are chosen for this study ( Figure 1). They represent ocean, coastal, and interior wind regimes, as can be seen in their wind roses ( Figure 2). One of them is a buoy (Bilbao Bizkaia), representing an oceanic regime. The other two sites were selected to compare the results of the sea with the coast and an inland zone in the area of interest. The Bilbao Bizkaia buoy (43.64° N, 3.05° W) is run by the Spanish Puertos del Estado agency (Puertos del Estado) and is located offshore, approximately 30 km from the coast. As can be seen from the wind roses, it is not affected by sea breezes and, since there are no natural obstacles around it, the wind tends to flow freely. The second location is Punta Galea, at the coastline, at the top of a cliff (43.38° N and 3.04° W, 61 m elevation above sea level). Punta Galea is exposed to sea-land interactions affecting the wind regime, such as sea breezes. The third wind sensor is located in Alegria, on a plateau in the south of the Basque Country (42.84° N and 2.52° W, 545 m elevation), 60 km inland. The dominant wind pattern is west-east but the highest wind corresponds to a south- The Bilbao Bizkaia buoy (43.64 • N, 3.05 • W) is run by the Spanish Puertos del Estado agency (Puertos del Estado) and is located offshore, approximately 30 km from the coast. As can be seen from the wind roses, it is not affected by sea breezes and, since there are no natural obstacles around it, the wind tends to flow freely. The second location is Punta Galea, at the coastline, at the top of a cliff (43.38 • N and 3.04 • W, 61 m elevation above sea level). Punta Galea is exposed to sea-land interactions affecting the wind regime, such as sea breezes. The third wind sensor is located in Alegria, on a plateau in the south of the Basque Country (42.84 • N and 2.52 • W, 545 m elevation), 60 km inland. The dominant wind pattern is west-east but the highest wind corresponds to a south-west direction. Both inland stations are run by the Basque Meteorological Service (Euskalmet).
At each location, u, v, and wind gust at the sensor height above ground level (Bilbao Bizkaia: 3 m, Punta Galea: 12 m, and Alegria: 10 m) are measured. u and v were calculated using only the last measurements (10 min) prior to the hourly time step at the three locations. These are the operational averages regularly implemented and subsequently made public by the institutions running the sensors. The zonal (u) and meridional (v) projections of the wind vector were obtained from magnitude and direction recorded at the three locations.
A second group of target variables was used at the three locations: Hourly maximum wind gusts. The wind gusts were computed from the maximum value of the gust recorded in the last 10 min at Punta Galea and Alegria. The maximum value for the Bilbao Bizkaia buoy was calculated using the measurements taken over the last 10 min of each hourly step because the buoy records data every hour while the meteorological stations do this every 10 min. Only the magnitude of wind gust was used. To compare the data with the ERA-Interim forecast, the wind gust since previous post-processing (3 h period) was calculated; this will hereinafter be referred to simply as wind gust. All the data span the period 2007-2014.
Data from ECMWF's ERA-Interim reanalysis [21,53,54] were used. The variables selected as likely to impact upcoming wind speed were mean sea level pressure (Msl); the zonal component of wind at Additionally, u 10 and v 10 and wind gusts in forecast mode from ERA-Interim for the nearest grid points from the three chosen locations were also used. From the initial conditions at t = 0000 UTC and t = 1200 UTC, ERA-Interim forecasts of these variables are available for t + k, where k = (3,6,9,12,15,18,21,24) h in the future in the meteorological archival and retrieval system (MARS) server of the ECMWF. All these variables were used to feed the different wind forecasting models as if the statistical models were using forecasts from an operational model. A second group of target variables was used at the three locations: Hourly maximum wind gusts. The wind gusts were computed from the maximum value of the gust recorded in the last 10 min at Punta Galea and Alegria. The maximum value for the Bilbao Bizkaia buoy was calculated using the measurements taken over the last 10 min of each hourly step because the buoy records data every hour while the meteorological stations do this every 10 min. Only the magnitude of wind gust was used. To compare the data with the ERA-Interim forecast, the wind gust since previous postprocessing (3 h period) was calculated; this will hereinafter be referred to simply as wind gust. All the data span the period 2007-2014.
Data from ECMWF's ERA-Interim reanalysis [21,53,54] were used. The variables selected as likely to impact upcoming wind speed were mean sea level pressure (Msl); the zonal component of wind at 10 m (m/s) (u10); the meridional component of wind at 10 m (v10); and temperature at 2 m (T2). The domain selected ( Figure 3) is a rectangle spanning (49.50° N, 12° W) to (36° N, 7.5° E) with a spatial resolution of 0.75° × 0.75°. Analyses at 6 h time steps for the 2007-2014 period were downloaded from ECMWF. Additionally, u10 and v10 and wind gusts in forecast mode from ERA-Interim for the nearest grid points from the three chosen locations were also used. From the initial conditions at t = 0000 UTC and t = 1200 UTC, ERA-Interim forecasts of these variables are available for t + k, where k = (3,6,9,12,15,18,21,24) h in the future in the meteorological archival and retrieval system (MARS) server of the ECMWF. All these variables were used to feed the different wind forecasting models as if the statistical models were using forecasts from an operational model.

Method
In order to achieve a rigorous evaluation of the performance of statistical models, the original database was split into two datasets, one to fit and train the models (2007)(2008)(2009)(2010) and the other to test the models on independent data (2011-2014). RF uses train and test datasets [18,55]; this method does not require validation datasets, such as the neural network method. The number of cases is shown in Table 1.

Method
In order to achieve a rigorous evaluation of the performance of statistical models, the original database was split into two datasets, one to fit and train the models (2007)(2008)(2009)(2010) and the other to test the models on independent data (2011-2014). RF uses train and test datasets [18,55]; this method does not require validation datasets, such as the neural network method. The number of cases is shown in Table 1. The raw input data were pre-processed as follows. At a given t time, each model at the different lead times was fed with:

1.
Observations of u and v, and the wind gust at each sensor height when t = 0; 2.
ERA-Interim forecasts (u, v, wind gust) for the nearest grid points from the selected locations ( Figure 1). This NWF is based on a four-dimensional variational assimilation analysis, with a time window of 12 h, and produces forecasts with time steps that range from 3 to 24 h in the future, as explained in Section 2.1; 3.
Time-lagged wind observation at t − 0, t − 6, t − 12, and t − 18 h. Time-lagged observations and ERA-Interim analyses (see domain in Figure 3) were used. In order to reduce the dimensionality of the time-lagged ERA-Interim variables (Msl, u 10 , v 10 , and T 2 ), extended empirical orthogonal functions (extEOFs) of the original variables were calculated [56,57]. In this way, both spatial and temporal patterns can be captured in the space corresponding to the principal components, and the leading extEOFs hold the highest fractions of the total variance. The resulting extended principal components have not been rotated, since they were just used for compressing the information and they are, therefore, orthogonal. Extended EOFs have been applied, dealing with different geophysical variables, such as waves [52,58], ENSO events [59], and surface moisture flux and precipitation [55]. In the case of this study, important associations (correlations) were detected between variables, in addition to non-negligible autocorrelations and spatial correlations throughout the selected spatial and time domain. Due to the different physical quantities (K, m/s and so on) of the variables involved, at an initial stage they were standardized (mean = 0, variance = 1), and the final number of leading extEOFs used for this study was 26, selected under the condition of jointly retaining at least 90% of the overall variance, following the final criteria developed by authors for similar geophysical studies after careful evaluation of different alternatives [52,55].
A summary of the raw input data is shown below ( Table 2). Table 2. Summary description of the statistical model raw inputs Input.

Name Variables Source Time
Last obs Three statistical techniques were used: Analogues, linear regression, and RF. Linear regression is a common method in statistical prediction. Second, the analogues method is a non-linear technique that has been often used since the early days of meteorological forecasting [49]. The third technique was random forests (RF), a machine learning technique that we wished to compare with the two other, simpler methods.
For each target variable and statistical technique, a specific model was built for each of the forecasting horizons at t + k (k = 1, 2, . . . , 24) h. This means that, for u and v, a total of 432 models were constructed (two target variables times three locations times three techniques times 24 hourly steps). In the case of the wind gust, only the best technique identified for u and v was used RF. For wind gust, Atmosphere 2020, 11, 45 7 of 22 72 models (one target variable times three locations times one technique times 24 hourly steps) were built. As a result, for this study, a total number of 504 models were developed and evaluated.
In the analogue technique [49], for each case in the test period, the most similar situation is sought in the training period and this becomes the wind prediction at t + h hours. Patterns can be selected using various similarity metrics [60], and, for this study, both Euclidean and cosine norms were applied. Given the state of the predictors to be used by the vector x = (x 1 , . . . , x p ) with p the dimension of the state-space, the Euclidean distance between vectors x and y is defined by: whilst the cosine distance is defined by: Random forest is a predictive algorithm that uses bootstrap technique to combine different decision trees, where each tree is built with observation data and random variables. RF [61] was considered due to its ability to mathematically capture highly non-linear relationships, like those known to be involved in the physics of wind forecasting. RF is a development of classification and regression trees (CART [62]), where the regression trees are randomly perturbed and a forest of perturbed trees is created. Trees are perturbed at two levels:

1.
A number of bootstrap samples from original data are drawn and used to feed the different regression trees; 2.
At each tree, each node is split using the best among a subset of m predictors randomly chosen at that node.
A key advantage of RF over other regression techniques is that it is free from overfitting, due to the use of the strong law of large numbers [61]. In RF, it is possible to estimate how important a given input is by calculating its percentage increase in the mean square error of the regression, if that particular predictor, is removed. This ranks the most influential inputs [63]. For this work, this importance ranking was estimated from the training dataset and, following the general procedure, the RF outputs were estimated as the average of the outputs of the trees [64,65].
In the case of the classical multiple linear regression, the candidate input variables are incorporated into the equation following the Akaike information criterion (AIC) [66]. Performance comparison for other atmospheric fields indicates that non-linear algorithms like neural networks outperform, but not overwhelmingly so, more simple methods like linear regression [67] and analogues [49,68]. RF is perhaps more difficult to implement than analogues or linear regression, but it has been successfully employed in other studies [55,69,70]. In particular, for the variable studied here-wind- [18] RF was used, as well as other machine learning techniques, to forecast wind speed. This latter study is the most similar to ours, because both forecast wind for same locations and used NWF model predictors for the statistical models. Details on most mathematical aspects of RF can be found in the literature [61,71]. For comparison purposes, in addition to the abovementioned models, the plain forecasts provided by ERA-Interim for time t + k (k = 3 h, 6 h, 9 h, . . . , 24 h), as well as persistence were also considered as additional models for each location and forecasting horizon.
Forecast accuracy was evaluated against observations exclusively using data belonging to the test period. The number of cases ranges from 1896 to 2231 (Table 1). The performance assessment was carried out by means of the usual indicators typically employed in the literature: R 2 and RMSE [18,26,42,72]. In this study, a given model will be considered superior to another if both indicators, at a 95% confidence level, are better. All the comparisons using these indicators were assessed at a 95% confidence level using boundaries computed by bootstrap resampling [73]. In the process of selecting bootstrap samples, the time order of the series was taken into account.

Model Performance for u and v
First, in this section, the results of three statistical models (linear regression, analogues, and RF) have been shown and compared for u and v targets. Then, the best result has been compared against persistence and ERA-Interim forecasts. The sensitivity to the domain size is also verified and finally the relevant inputs are identified.

Statistical Models
The forecasts by the three statistical models (linear regression, analogues, and RF) were compared in order to clarify which technique performed better. The goal was to identify the best general technique.
The results of this analysis are summarized in Tables 3 and 4 and Figure 4. These tables and figure show a performance comparison for the statistical models built using the three techniques mentioned above. It can be seen that models based on RF, analogues, and linear regression tend to perform similarly for all the forecasting horizons (1 h, 2 h, . . . , 24 h) and only marginal-though statistically significant-improvements can be detected, depending on the location and forecasting horizons. In the case of one location (Bilbao Bizkaia), linear regression-based models performed slightly better for more forecasting horizons, while RF-based models exhibited a marginal improvement in the case of Punta Galea and Alegria. So, in general terms, RF-based models appear to perform best at the three locations. Table 3. R 2 verification index of u and v wind components for the steps 1, 6, 12, 18, and 24 h, for three locations (Bilbao Bizkaia, Punta Galea, and Alegria), and three statistical methods (random forest, analogs, and linear regression). The best verification values are typed using bold font.  The forecasts by the three statistical models (linear regression, analogues, and RF) were compared in order to clarify which technique performed better. The goal was to identify the best general technique.
The results of this analysis are summarized in Tables 3 and 4 and Figure 4. These tables and figure show a performance comparison for the statistical models built using the three techniques mentioned above. It can be seen that models based on RF, analogues, and linear regression tend to perform similarly for all the forecasting horizons (1 h, 2 h, …, 24 h) and only marginal-though statistically significant-improvements can be detected, depending on the location and forecasting horizons. In the case of one location (Bilbao Bizkaia), linear regression-based models performed slightly better for more forecasting horizons, while RF-based models exhibited a marginal improvement in the case of Punta Galea and Alegria. So, in general terms, RF-based models appear to perform best at the three locations.

Model Evaluation
The next step was to compare the results of the statistical models based on RF (the best technique identified in the previous step) against persistence and ERA-Interim forecasts for the nearest grid point. The results can be seen in Figure 5 and Tables 5 and 6 for the Bilbao Bizkaia buoy, Punta Galea, and Alegria locations. The errors of the physics-based ERA-Interim forecasts remain almost constant up to the 24 h lag forecast [74].    If u and v are jointly analyzed, the results indicate that for horizons shorter than 2-4 h, neither the RF-based models nor ERA-Interim forecasts outperform persistence. Beyond this horizon, up to 24 h in the future, for the Bilbao Bizkaia location, the RF-based models exhibit higher R 2 values and lower RMSE than persistence or ERA-Interim models for u forecasts. In the case of v, both RF and ERA-Interim clearly outperform persistence. Considering that wind prediction involves a joint forecast of u and v, the results for Bilbao Bizkaia indicate that, below a forecasting horizon of 4 h, persistence is the best option, and, from 4 h to 24 h in the future, RF-based models outperform the rest, but that both are not always significantly better than ERA-Interim forecasts.
A similar pattern can be observed for the other two locations, Punta Galea and Alegria, although persistence outperformed the other techniques for a shorter forecast horizon (the first 2-4 h). For longer horizons, the ERA-Interim and RF-based models performed better, although, for v, RF models exhibited somewhat better results than ERA-Interim forecasts. For u forecasts, RF and ERA-Interim performed similarly, and a joint evaluation of wind forecasts once again indicates a slightly better overall performance of RF-based models. In all cases, it must be noted that the RMSE were lower for the different techniques and forecasting horizons for Alegria, but this was due to the lower wind speed values recorded at this location.
A seasonally stratified analysis (not shown) confirmed the abovementioned ranking, although in summer, since wind is generally weaker, the error values described by the RMSE are also smaller. Table 5. R 2 index (95% confidence interval) of u and v wind components for the steps 1, 6, 12, 18, and 24 h, for three locations (Bilbao Bizkaia, Punta Galea, and Alegria), and RF, ERA-Interim forecasts, and persistence models. Highest values of R 2 are highlighted by means of bold font.

Sensitivity to the Domain Size
One of the input groups was the extEOF derived from the ERA-Interim values of different variables, plus local observations, as explained in Section 2.2. As mentioned above, the geographical boundaries of those variables used to calculate the extEOF were (49.50 • N, 12 • W) to (36 • N, 7.5 • E) (Figure 3). The reason that these boundaries were selected is that analysis of the variables inside this domain could allow us to capture the large-scale synoptic fields influencing the wind field at the three locations.
However, it was necessary to evaluate how the size of the selected domain could affect the results. To that end, a new set of extEOF was derived from the same variables, but defined for a smaller domain, from (45 • N, 12 • W) to (40 • N, 0 • E) ( Figure 6). These new extEOFs were used to build the RF models again, and their performance was evaluated following the same method as before.
Atmosphere 2020, 11, x FOR PEER REVIEW 12 of 22 ( Figure 3). The reason that these boundaries were selected is that analysis of the variables inside this domain could allow us to capture the large-scale synoptic fields influencing the wind field at the three locations. However, it was necessary to evaluate how the size of the selected domain could affect the results. To that end, a new set of extEOF was derived from the same variables, but defined for a smaller domain, from (45° N, 12° W) to (40° N, 0° E) ( Figure 6). These new extEOFs were used to build the RF models again, and their performance was evaluated following the same method as before.
The rationale behind this testing of the domain is that the observed wind field in the Basque Country is, as is common in extratropical areas, very likely close to geostrophy. The ability of a smaller domain to provide information at this level must be checked. The use of a smaller domain would make calculations more simple, quicker, and straightforward. For this reason, a much smaller domain has also been considered. The results of the RF-based models obtained using the two sets of extEOFs derived from the two domains can be seen in Figure 7. This figure clearly shows that the statistical indicators of performance for R 2 and RMSE (95% confidence boundaries) are entirely overlapping, thus indicating a similar performance. For this reason, it can be concluded that there is negligible sensitivity of results to a domain size in the range analyzed in this paper. The rationale behind this testing of the domain is that the observed wind field in the Basque Country is, as is common in extratropical areas, very likely close to geostrophy. The ability of a smaller domain to provide information at this level must be checked. The use of a smaller domain would make calculations more simple, quicker, and straightforward. For this reason, a much smaller domain has also been considered.
The results of the RF-based models obtained using the two sets of extEOFs derived from the two domains can be seen in Figure 7. This figure clearly shows that the statistical indicators of performance for R 2 and RMSE (95% confidence boundaries) are entirely overlapping, thus indicating a similar performance. For this reason, it can be concluded that there is negligible sensitivity of results to a domain size in the range analyzed in this paper. Atmosphere 2020, 11, x FOR PEER REVIEW 13 of 22

Identification of Relevant Inputs
The RF-based models were identified as those that, in general terms, performed best. Thus, these models were compared with the ERA-Interim and persistence forecasts. However, it should be noted that, for the three locations, the statistical models built with RF roughly identify the same three groups of inputs as being the most influential. These predictors are current local wind observation (u, v); the first, second, fourth, eighth, and ninth extEOFs; and the ERA-Interim forecasts for the nearest grid points. Figure 8 shows the most influential input for the different horizons for RF-based models.

Identification of Relevant Inputs
The RF-based models were identified as those that, in general terms, performed best. Thus, these models were compared with the ERA-Interim and persistence forecasts. However, it should be noted that, for the three locations, the statistical models built with RF roughly identify the same three groups of inputs as being the most influential. These predictors are current local wind observation (u, v); the first, second, fourth, eighth, and ninth extEOFs; and the ERA-Interim forecasts for the nearest grid points. Figure 8 shows the most influential input for the different horizons for RF-based models.

Identification of Relevant Inputs
The RF-based models were identified as those that, in general terms, performed best. Thus, these models were compared with the ERA-Interim and persistence forecasts. However, it should be noted that, for the three locations, the statistical models built with RF roughly identify the same three groups of inputs as being the most influential. These predictors are current local wind observation (u, v); the first, second, fourth, eighth, and ninth extEOFs; and the ERA-Interim forecasts for the nearest grid points. Figure 8 shows the most influential input for the different horizons for RF-based models.  For predictions for shorter horizons than 2-5 h, the most influential variable tends to be the current observation. This indicates that for short horizons, the system's memory accounts for the highest fraction of overall predictability achieved by the models, which do not outperform persistence in the first 2-4 h. For medium-range predictions, persistence, ERA-Interim forecasts, and extEOFs tend to be either the most, or second most, influential inputs. Finally, for forecasting horizons longer than 7-9 h, the wind forecasts of ERA-Interim for the nearest grid points are the most influential predictors. All these conclusions are for the most important variable in each model, but if the second one (or even the third) is considered (results not shown), extEOFs become more influential.

Model Performance for Wind Gusts
Considering the results obtained under Section 3.1, the model evaluation against persistence and ERA-Interim forecasts and the identification of important inputs for the downscaling model presented in the previous subsections. The best-performing one has been used for wind gust variable.

Model Evaluation
A similar general approach to that described above was also applied to construct the forecasting models at the same three locations, with wind gust as the target variable. The forecasting horizons were also the same (from 1 to 24 h in the future), but, since in most cases RF seemed to perform somewhat better for u and v, this was the only technique analyzed in a more detailed way. Similar predictors were also used in the models ( Table 2). For this case, the inputs were: 1.
All the information up to t = 0, last 3 h of observed wind gusts, and the same set of extEOFs corresponding to the initial domain and variables; 2.
Instead of using the u and v values observed at time t = 0, the selected input was the maximum wind gust since previous post-processing; 3.
The ERA-Interim forecasts were those corresponding to the upcoming maximum wind gusts. These forecasts involved 3 h time steps from t + 3 to t + 24 h.
Therefore, for this target variable a total of 72 RF-based models were evaluated, the results of which are shown in Figure 9 and Tables 7 and 8. It can be seen that the best choice for modelling the first 1-2 h was persistence. From this time onwards, the results indicate that, for the three locations and all forecasting horizons, the RF-based models exhibited a significantly smaller error than either persistence or ERA-Interim forecasts. In general terms, wind gust results are better than u and v's previously obtained results.
Atmosphere 2020, 11, x FOR PEER REVIEW 14 of 22 For predictions for shorter horizons than 2-5 h, the most influential variable tends to be the current observation. This indicates that for short horizons, the system's memory accounts for the highest fraction of overall predictability achieved by the models, which do not outperform persistence in the first 2-4 h. For medium-range predictions, persistence, ERA-Interim forecasts, and extEOFs tend to be either the most, or second most, influential inputs. Finally, for forecasting horizons longer than 7-9 h, the wind forecasts of ERA-Interim for the nearest grid points are the most influential predictors. All these conclusions are for the most important variable in each model, but if the second one (or even the third) is considered (results not shown), extEOFs become more influential.

Model Performance for Wind Gusts
Considering the results obtained under Section 3.1, the model evaluation against persistence and ERA-Interim forecasts and the identification of important inputs for the downscaling model presented in the previous subsections. The best-performing one has been used for wind gust variable.

Model Evaluation
A similar general approach to that described above was also applied to construct the forecasting models at the same three locations, with wind gust as the target variable. The forecasting horizons were also the same (from 1 to 24 h in the future), but, since in most cases RF seemed to perform somewhat better for u and v, this was the only technique analyzed in a more detailed way. Similar predictors were also used in the models (Table 2). For this case, the inputs were: 1. All the information up to t = 0, last 3 h of observed wind gusts, and the same set of extEOFs corresponding to the initial domain and variables; 2. Instead of using the u and v values observed at time t = 0, the selected input was the maximum wind gust since previous post-processing; 3. The ERA-Interim forecasts were those corresponding to the upcoming maximum wind gusts.
These forecasts involved 3 h time steps from t + 3 to t + 24 h.
Therefore, for this target variable a total of 72 RF-based models were evaluated, the results of which are shown in Figure 9 and Tables 7 and 8. It can be seen that the best choice for modelling the first 1-2 h was persistence. From this time onwards, the results indicate that, for the three locations and all forecasting horizons, the RF-based models exhibited a significantly smaller error than either persistence or ERA-Interim forecasts. In general terms, wind gust results are better than u and v's previously obtained results.   Table 7. R 2 index (95% confidence interval) of wind gust for the steps 1, 6, 12, 18, and 24 h, for three locations (Bilbao Bizkaia, Punta Galea, and Alegria), and RF, ERA-Interim forecasts, and persistence models. The results corresponding to the best model are highlighted using bold font.

Identification of Important Inputs
The most important input for maximum wind gust forecasts over the last 3 h period is shown in Figure 10 for different lead times. The previous pattern was confirmed; in the first 4 h, the selected input was the last observation, and in the last 10 h, it was the ERA-Interim forecast for the nearest grid points. The results for the u wind component and wind gust were very similar. The horizons for persistence and ERA-Interim forecast were almost the same. They only differed in the selection of extEOFs for medium-range predictions. In the same way as previously, all these conclusions are for the most influential inputs in each model, but, if the second one (or even the third) is considered (results not shown), extEOFs gain relevance (results not shown). Atmosphere 2020, 11, x FOR PEER REVIEW 16 of 22

Discussion
The approach used in this paper follows the general pattern of using forecasts provided by various types of numerical models (such as the prognostic model used in the ERA-Interim reanalysis) followed by a statistical modelling stage based on different mathematical algorithms [33]. In our case, as shown above, the models include additional candidate inputs, like extEOFs plus local observations. The comparison between the performance of analogues, linear regression, and RF indicates that RF outperforms linear regression and analogues, but not in all cases, and never overwhelmingly.
Since the persistence of wind is not null, any modelling effort must also outperform persistence (Okumus and Dinler, 2016). Persistence, and ERA-Interim forecasts (playing the role of the surrogate of an operational NWF system) represent readily available forecasting models. The performance comparison between the best statistical technique (RF) and the plain use of persistence or ERA-Interim forecasts indicates that RF outperforms the others beyond the time horizon of 1-4 h. On the other hand, persistence is always a better option than the methods tested here for up to 1-4 h into the future. Preliminary studies with a low number of cases (10 days) indicate that a modified general regression neural network could beat persistence, even for short-term horizons (below 5 h). Other studies [9,41,48], using different methods, have found that RMSE values are better than for plain persistence, even for the first hour. However, in these studies, the number of cases is not very high, and the authors do not assess the statistical significance of the performance differences. All this indicates that the specific conditions in which persistence is outperformed for very short-term forecasting horizons are yet to be fully understood, and further research in this area may be needed. In our study, for all predictands and forecasts up to 2-5 h in the future, the most influential input is the current local observation. Thus, the system's memory plays a key role in building the prediction and explains why persistence is not beaten in these cases in the immediate short-term. Beyond this horizon, ERA-Interim forecasts are the most important predictor. The group of extEOFs (but not the same ones in all cases) also plays an important role, as the second/third most influential input, thus confirming the need to use all three general groups of inputs.
These results are in general agreement with earlier studies [7], even for the same area [18], where different statistical approaches applied to outputs from numerical models were used. Soman et al. [7] describe that, for horizons longer than 6 h, hybrid structures, such as the numerical model with postprocessing, or the numerical model plus time series techniques, also exhibit better results.

Discussion
The approach used in this paper follows the general pattern of using forecasts provided by various types of numerical models (such as the prognostic model used in the ERA-Interim reanalysis) followed by a statistical modelling stage based on different mathematical algorithms [33]. In our case, as shown above, the models include additional candidate inputs, like extEOFs plus local observations. The comparison between the performance of analogues, linear regression, and RF indicates that RF outperforms linear regression and analogues, but not in all cases, and never overwhelmingly.
Since the persistence of wind is not null, any modelling effort must also outperform persistence (Okumus and Dinler, 2016). Persistence, and ERA-Interim forecasts (playing the role of the surrogate of an operational NWF system) represent readily available forecasting models. The performance comparison between the best statistical technique (RF) and the plain use of persistence or ERA-Interim forecasts indicates that RF outperforms the others beyond the time horizon of 1-4 h. On the other hand, persistence is always a better option than the methods tested here for up to 1-4 h into the future. Preliminary studies with a low number of cases (10 days) indicate that a modified general regression neural network could beat persistence, even for short-term horizons (below 5 h). Other studies [9,41,48], using different methods, have found that RMSE values are better than for plain persistence, even for the first hour. However, in these studies, the number of cases is not very high, and the authors do not assess the statistical significance of the performance differences. All this indicates that the specific conditions in which persistence is outperformed for very short-term forecasting horizons are yet to be fully understood, and further research in this area may be needed. In our study, for all predictands and forecasts up to 2-5 h in the future, the most influential input is the current local observation. Thus, the system's memory plays a key role in building the prediction and explains why persistence is not beaten in these cases in the immediate short-term. Beyond this horizon, ERA-Interim forecasts are the most important predictor. The group of extEOFs (but not the same ones in all cases) also plays an important role, as the second/third most influential input, thus confirming the need to use all three general groups of inputs.
These results are in general agreement with earlier studies [7], even for the same area [18], where different statistical approaches applied to outputs from numerical models were used. Soman et al. [7] describe that, for horizons longer than 6 h, hybrid structures, such as the numerical model with post-processing, or the numerical model plus time series techniques, also exhibit better results.
Santamaría-Bonfil et al. [10] and Hu et al. [50] reached the same conclusion for wind speed forecasting. More specifically, their models are based on time series with SVM and analogue methods, respectively. The authors indicate that persistence is the best model up to the 5 h horizons, signifying that they reached broadly similar conclusions to those in this study.
RF is one of the algorithms considered by Rozas-Larraondo et al. [18], but they issued forecasts only up to a 3 h lead time. In general terms, these authors combined the outputs of an NWF-GFS in this case-with METAR data as additional inputs to build a statistical model. In our study, we have extended the method to another variable (wind gust), statistical prediction techniques (analogues and linear regression), longer forecast lead times, and different data compression techniques, with the use of predictors such as extEOFs. Our results for longer lead times support the suggestion that RF-based statistical downscaling models constitute an effective tool for providing short-time wind forecasts, particularly when an NWF predictor is included in lead times longer than 3 h. Since we found that these results hold for three different locations (ocean, coast, and interior), with different wind regimes and an assessment of errors to a 95% confidence level by means of bootstrapping, we think the results are quite robust.
It must be noted that the models shown in this paper were built for three locations, representative of different areas in the region: Sea, coast, and inland. The surroundings of the three locations and the wind regimes are all very different. It must, however, be highlighted that the results show a common pattern with regard to the best statistical technique (RF) to be used, for the forecasting horizons and also for the most influential input groups. Despite this general common pattern, the particular aspects of the RF models built for the individual sites, such as the specific inputs selected at each location for each forecasting horizon (Figures 8 and 10), are different. Moreover, similarly to other statistical forecasting models [8,34,39], especially for sites where complex topography is an important factor, the RF models developed here are not exchangeable, but rather site-dependent and valid only for their specific locations. Okumus and Dinler [12] also present the idea that results using similar approaches to that in this study depend on target location and data.
The distance from the locations and the nearest ERA-Interim grid points, around 14-25 km, could impact ( Figure 1) the results. By applying these techniques to a higher-resolution operational model, better results from the model can be expected, because the model output would be available at a smaller distance from the wind station. As mentioned above, the wind forecasts provided by the model used in the reanalysis play an important role in building the predictions [31]. The impact of the distance between grids and points needs to be assessed with further studies, and with new reanalyses, like ERA5, since operational models are often run at much higher resolutions. In any case, if higher-resolution models were used, better inputs would also be introduced in the statistical model. Thus, it can be expected that the RF-based models' results presented in this paper would be relatively robust to the spatial resolution of the NWF model, which provides the inputs. However, inhomogeneities in the archives derived from operational models (not the ones used in reanalyses) make this somewhat problematic when evaluating a relatively long time period (eight years), such as the period in this study.
Since meteorological services already have access to the numerical outputs from global and regional numerical models, the availability of advanced statistical methods coupled to numerical outputs, such as the one presented in this study, is important, because it allows a local wind prediction system to be developed with a lead time of several hours (1-24 h). From the perspective of meteorological agency, forecasts are very important, but especially so in emergency cases: To more accurately forecast wind speed and direction for the next 24 h when strong winds are expected, and when wind gusts are expected to surpass the threshold levels corresponding to a warning that should be issued, because, on those occasions, the wind could result in dangerous situations for the population [1,4].
Regarding other applications, it should be highlighted that the wind energy sector already uses [13] deep techniques to improve the renewable energy forecasts. Having access to wind forecasts that are as accurate as possible could help better integrate wind energy into the electric grid [6,75]. In this field, short ranges of between 10 min and 2 h are customarily used for controlling and regulating turbines. According to our results, further study is necessary at this point, because the proposed models perform more poorly than persistence for this horizon. On the other hand, electricity markets require longer prediction ranges, usually between 13 and 37 h in the future, with hourly temporal resolution. In this case, we propose that from 13 to 24 h RF forecasting should be used, while, from 24 to 37 h forecasts, an NWF model is probably the best tool (valid for ERA-Interim, at least). We hope that the method proposed in this paper can be used to optimize the position in the electricity auction market of the stakeholders in the wind energy sector, although future studies should evaluate this in detail, considering the drawbacks of higher resolution (operational models) versus longer homogeneous training length (reanalyses).

Conclusions
Although applied to three particular locations, the present study contains a complete method for short-term wind prediction that can easily be transferred and potentially applied elsewhere. This method involves a machine-learning technique like RF and, in addition to numerical forecasts (provided here by ERA-Interim as hindcasts) and local wind observations, it incorporates extended EOFs as additional inputs. The result is wind forecasts at hourly intervals up to 24 h in the future.
Although previously discussed in general terms, the forecasting errors of the RF-based models presented in this paper cannot be directly compared to results from previous studies. Nevertheless, it can be stated that, in absolute terms, the performance is, in general, similar to the one reported by other studies, and that errors are lower than the plain use of either persistence or direct model output from ERA-Interim. For this reason, in our study, RF-based models provide clear added value and represent a significant improvement over the use of the above-mentioned readily available forecasts for forecast horizons of 4-24 h, and sometimes 1-24 h forecasting horizons.
Below this forecasting range, the memory of the system (persistence) is the best predictor. The comparison between three different types of techniques (RF, linear regression, and analogues) evaluated in this study indicates that, in general, RF performs slightly better than the others, but not always and not for all forecasting horizons.
The results are common to the three analysed variables (u, v, and wind gust), forecasting horizons (24), and sites (three). Since the analysis covered many input cases (over 1800), many years, different locations (ocean, coastal, and inland places), and statistical techniques, we can consider this a robust outcome. The next step in this research path, once the u and v wind components and wind gust have been analyzed, will be to apply the same method to extreme events. This requires a longer training database, due to a very low probability of this type of event, but a similar approach to that presented in this paper, based on RF and similar types of inputs, looks highly promising. ERA5 and UERRA reanalyses [54] have recently been made available with 0.3 • and 1 h (ERA5) and 11 km × 11 km and 6 h (UERRA) spatial and temporal resolutions for a long time span: (1979-2018) for ERA5; and (1962-2018) for UERRA. As in the ERA-Interim model, forecasts are also available, so ERA5 or UERRA, along with RF, will be used in future studies. The prediction of extreme events and the appropriate management of this information represent important challenges for meteorological services and insurance companies. Furthermore, RF can also yield probabilistic predictions, something that is useful when different alert levels have to be launched to protect lives and properties. In this paper, different statistical models were compared for the short-term forecasting of wind and wind forecasts over the ocean, the coast, and an interior site in the Basque Country. The design of this study follows the reasoning presented in many references already discussed in the introduction to this paper [8,12,26,27,31,33]. This has also been the authors' experience in previous papers [52,76]. It is a common result in the literature that the use of a statistical model, which is fed with data provided by a numerical weather forecast model, yields better results than the use of a statistical model alone or a numerical model alone. There are interesting new developments in the field of statistical wind forecast [45,46]. Since the regime-switch