Wind Speed Forecast Based on Post-Processing of Numerical Weather Predictions Using a Gradient Boosting Decision Tree Algorithm

With the large-scale development of wind energy, wind power forecasting plays a key role in power dispatching in the electric power grid, as well as in the operation and maintenance of wind farms. The most important technology for wind power forecasting is forecasting wind speed. The current mainstream methods for wind speed forecasting involve the combination of mesoscale numerical meteorological models with a post-processing system. Our work uses the WRF model to obtain the numerical weather forecast and the gradient boosting decision tree (GBDT) algorithm to improve the near-surface wind speed post-processing results of the numerical weather model. We calculate the feature importance of GBDT in order to find out which feature most affects the post-processing wind speed results. The results show that, after using about 300 features at different height and pressure layers, the GBDT algorithm can output more accurate wind speed forecasts than the original WRF results and other post-processing models like decision tree regression (DTR) and multi-layer perceptron regression (MLPR). Using GBDT, the root mean square error (RMSE) of wind speed can be reduced from 2.7–3.5 m/s in the original WRF result by 1–1.5 m/s, which is better than DTR and MLPR. While the index of agreement (IA) can be improved by 0.10–0.20, correlation coefficient be improved by 0.10–0.18, Nash–Sutcliffe efficiency coefficient (NSE) be improved by −0.06–0.6. It also can be found that the feature which most affects the GBDT results is the near-surface wind speed. Other variables, such as forecast month, forecast time, and temperature, also affect the GBDT results.


Introduction
Among the renewable energy technologies currently developed, wind power is a renewable energy with mature technology and large-scale development prospects. One of the key technologies for the development of wind power is forecasting of the amount of power generated by wind farms. As the output power of a wind turbine is directly related to wind speed, wind power forecasts strongly depend on wind speed forecasts.
In the development of wind power forecast technology for wind farms, mesoscale model simulation is a useful method for wind speed forecast. Rife et al. [1] used the mesoscale numerical weather prediction (NWP) model MM5 to predict the low-level wind in the boundary layer. Storm et al. [2] evaluated the performance of Weather Research and Forecast (WRF) model in predicting the low-level jet (LLJ) and pointed out that LLJ results simulated by WRF model were similar to observations. This result indicate that the mesoscale model has the ability to capture some characteristics of boundary layer wind. Marquis et al. [3] pointed out that the wind speed predicted by the WRF model can result in wind resources being better developed and utilized. Foley et al. [4] discussed the methods in wind power generation forecasting and mentioned that the use of mesoscale models for dynamic downscaling of large-scale forecasts is the basis of wind power forecasting for wind farms. Zhao et al. [5] presented a day-ahead wind power forecasting system. In the wind power forecasting system, the WRF model was used to forecast the wind speed. Mahoney et al. [6], Stathopoulos et al. [7], and Wyszogrodzki et al. [8] did similar work, mesoscale models are widely used in wind power forecasting.
Based on the NWP model, some research focus on the improvement of wind speed forecasting results. Deppe et al. [9] used aggregated results of WRF model simulations with different planetary boundary layer (PBL) schemes to improve the turbine height wind speed forecasts. Tateo et al. [10] also used the ensemble method of combining different PBL scheme results. Cheng et al. [11] and Marjanovic et al. [12] explored the impact of the choice of physical schemes on the wind forecasts, and the data assimilation is also an effective way to improve the wind speed forecasts [13][14][15][16][17][18][19][20].
In the current status of wind speed forecasting, combining a mesoscale numerical model with post-processing algorithms is an efficient method [21,22]. Such mesoscale numerical models mainly include WRF model or other NWP models, and the post-processing algorithms mainly include statistical methods and machine learning methods.
The statistical methods used have been developed over many years. The Model Output Statistics (MOS) method has been widely used for a long time [23][24][25][26][27]. In addition to the MOS method, there have also been some studies related to error correction by analyzing the systematic errors of the numerical model. Stensurd et al. [28] compared the grid point temperature data of the mesoscale model with observed data and obtained the systematic deviation of the model. After subtracting the average temperature deviation from the model results, the corrected temperature was closer to the observed value. Another work, by Stensured et al. [29], integrated ensemble predictions into this error correction method. The work used the results of 23 ensemble members, a simple seven-day continuous average was used to correct the deviation, and the deviation correction value of the ensemble result was compared with the temperature at 2 m above ground level (AGL). The corrected result was better than that of the MOS method. Eckel et al. [30] and Hacker et al. [31] also carried out similar studies related to ensemble forecasting and the reduction of systematic error. In addition, other statistical methods, such as the Kalman filter, have also been widely used in numerical weather model post-processing [32][33][34][35][36].
In recent years, machine learning algorithms have been applied to wind forecasting in wind farms. Li et al. [37] used three typical neural networks to predict the 1-h-ahead wind speed. Ishak et al. [38] obtained a mesoscale weather forecast using the MM5 (fifth-generation Penn State/NCAR Mesoscale Model) model, and multiple linear regression (MLR), support vector machine (SVM), artificial neural network (ANN) were used to process the mesoscale model results and output the wind speed. The results showed that the SVM algorithm performed the best due to its ability to capture non-linear relations. Sweeney et al. [39] combined several post-processing methods (Short-term bias-correction, Diurnal cycle correction, Linear least-square, Kalman filter, Mean and variance correction, Directional-bias forecast, and ANN) and proposed a combined post-processing method to reduce the error in the wind speed forecast. Zjavka et al. [40] used a polynomial neural network method to process the output of mesoscale numerical meteorological model and obtained a more accurate wind speed forecast. Zhao et al. [41] built a wind speed forecast system based on WRF ensemble results and post-processing algorithms. Zhao et al. [42] combined non-linear and non-parameter algorithms to correct the wind speed output of the numerical model. Papayiannis et al. [43] proposed a new method based on optimal transportation theory to fit the observed wind speed and model results.
From the previous works, it can be seen that the mainstream method for wind speed prediction uses a numerical weather model to predict the meteorological features and select some of the features Atmosphere 2020, 11, 738 3 of 38 as the input of a post-processing algorithm, then it uses the post-processing algorithm to output the corrected wind speed.
In this paper, the Weather Research and Forecast (WRF) model was used for numerical weather prediction in wind farm areas. The gradient boosting decision tree (GBDT) method was used to perform the post-processing task and output the post-processed wind speed results. Two additional machine learning models were used for comparison. By comparing the RMSE of the WRF model's wind speed output with the RMSE of the post-processing models' wind speed output, we evaluate the performance of GBDT as a post-processing method. Section 2 mainly introduces the WRF model, observation data, and GBDT algorithm. Section 3 analyzes the results of GBDT and its feature importance distribution. Section 4 presents our main conclusions.

Numerical Weather Model
The numerical weather prediction model we used in our tests is the Weather Research and Forecast (WRF) model (Version 3.9.1) [44]. We designed a nested-grid simulation of the WRF model, where the outer grid has 25 km horizontal resolution and the inner gird has 5 km horizontal resolution. Figure 1 shows domains 01 and 02 of our simulation. Figure 2 shows the domain 02 area, which contains some wind observation towers; these towers were used to evaluate the forecasting results of WRF and measure the wind speed forecast improvement of the post-processing algorithm.
Atmosphere 2020, 11, x FOR PEER REVIEW 3 of 39 as the input of a post-processing algorithm, then it uses the post-processing algorithm to output the corrected wind speed. In this paper, the Weather Research and Forecast (WRF) model was used for numerical weather prediction in wind farm areas. The gradient boosting decision tree (GBDT) method was used to perform the post-processing task and output the post-processed wind speed results. Two additional machine learning models were used for comparison. By comparing the RMSE of the WRF model's wind speed output with the RMSE of the post-processing models' wind speed output, we evaluate the performance of GBDT as a post-processing method. Section 2 mainly introduces the WRF model, observation data, and GBDT algorithm. Section 3 analyzes the results of GBDT and its feature importance distribution. Section 4 presents our main conclusions.

Numerical Weather Model
The numerical weather prediction model we used in our tests is the Weather Research and Forecast (WRF) model (Version 3.9.1) [44]. We designed a nested-grid simulation of the WRF model, where the outer grid has 25 km horizontal resolution and the inner gird has 5 km horizontal resolution. Figure 1 shows domains 01 and 02 of our simulation. Figure 2 shows the domain 02 area, which contains some wind observation towers; these towers were used to evaluate the forecasting results of WRF and measure the wind speed forecast improvement of the post-processing algorithm. In China, every wind farm needs to provide wind power forecasts to the power grid. Every morning (UTC + 8 h), the wind farm needs to make a forecast of its own power generation for the next few days. The forecast time starts at 00:00 (UTC + 8 h) of the next day and lasts for several days. Therefore, our WRF model starts running at 00:00 (UTC) every day, and the result starting at 16:00 UTC (+1 day, 00:00 UTC + 8 h) every day is taken as the result of wind speed forecast of the wind farm, where the output interval of the WRF model is 10 min. Figure 3 shows the WRF running time    The driving data of the WRF model are from the final operational global analysis data (FNL) [45], which provide the initial field and boundary forcing of the WRF model. The FNL data were provided by the National Centre for Environmental Prediction (NCEP), which have 1 degree of spatial resolution and a 6-h temporal resolution. Table 1 shows the domain configuration and parameter settings of the WRF model; in our study period, the models that were run every day used the same set of configurations. In China, every wind farm needs to provide wind power forecasts to the power grid. Every morning (UTC + 8 h), the wind farm needs to make a forecast of its own power generation for the next few days. The forecast time starts at 00:00 (UTC + 8 h) of the next day and lasts for several days. Therefore, our WRF model starts running at 00:00 (UTC) every day, and the result starting at 16:00 UTC (+1 day, 00:00 UTC + 8 h) every day is taken as the result of wind speed forecast of the wind farm, where the output interval of the WRF model is 10 min. Figure 3 shows the WRF running time configuration. We start the WRF model every day and make 24   The driving data of the WRF model are from the final operational global analysis data (FNL) [45], which provide the initial field and boundary forcing of the WRF model. The FNL data were provided by the National Centre for Environmental Prediction (NCEP), which have 1 degree of spatial resolution and a 6-h temporal resolution. Table 1 shows the domain configuration and parameter settings of the WRF model; in our study period, the models that were run every day used the same set of configurations. The driving data of the WRF model are from the final operational global analysis data (FNL) [45], which provide the initial field and boundary forcing of the WRF model. The FNL data were provided by the National Centre for Environmental Prediction (NCEP), which have 1 degree of spatial resolution and a 6-h temporal resolution. Table 1 shows the domain configuration and parameter settings of the WRF model; in our study period, the models that were run every day used the same set of configurations. After obtaining the model results, we interpolated the wind field into heights of 10, 30, 50, and 70 m using linear interpolation. We also interpolated the wind field into the wind tower's location using bilinear interpolation. We used the results of the interpolation to compare with the observed wind speed.
The ETA values of the near-surface layers were 1.0000, 0.9960, 0.9920, 0.9900, 0.9851 . . . The average altitudes of near-surface layers in model domain were 16.07, 48.22, 72.36, 101.16, and 134.84 m. It can be seen that there are models layer near 50 m and 70 m. The height of the wind tower data is 10, 30, 50, and 70 m. For the data of 10 m, we use the output result of the wind speed of 10 m in WRF model. For data at other heights, we use linear interpolation in the vertical direction for interpolation. Despite the nonlinear characteristics of the wind in the boundary layer atmosphere, due to the existence of model layers at the heights of 50 m and 70 m, the nonlinear changes in the boundary layer have no significant effect on the results of 50 m and 70 m.

Wind Observation Data
In order to evaluate the results of the post-processing algorithm, we used wind speed observation data from 14 wind observation towers to evaluate the model results and the post-processing results. The geographical distribution of wind towers is shown in Figure 2. These wind towers are distributed in the coastal areas of Jiangsu, China, where wind power has been widely developed. Table 2 shows the terrain height, location and sensor parameters of 14 wind towers. The time interval of the observation data of the wind tower is 10 min. The observation period lasted from June 2009 to May 2010. Each wind tower had observation data at different heights near the ground, including 10 m above ground level (AGL), 30 m AGL, 50 m AGL, and 70 m AGL.

Results Measurements
In order to evaluate the results of the different tests, the following evaluation metrics were calculated to evaluate the WRF model results and post-processing results.

Root Mean Square Error
The root mean square error (RMSE) is widely used in NWP to evaluate the error of wind speed and other meteorological variables. The calculation of RMSE is where n is the number of observations, M i is the wind speed in the model results or post-processing results, and O i is the wind speed observation.

Index of Agreement
Index of agreement (IA) is a standardized measure of the degree of model prediction error [46][47][48]. It can be calculated by where O is the average value of the wind speed observation. The Index of Agreement varies between 0 and 1, where a value of IA close to 1 indicates well-matched results, and 0 indicates no agreement at all.
The index of agreement can detect additive and proportional differences in the observed and simulated means and variances [49].

Correlation Coefficient
The Pearson correlation coefficient (R) is also widely used to evaluate the performance of wind speed simulation of NWP. It reflects the correlation between wind speed simulation series and observation series. Here Atmosphere 2020, 11, 738 7 of 38

Nash-Sutcliffe Efficiency Coefficient
The Nash-Sutcliffe efficiency coefficient (NSE) is used to assess the predictive ability of numerical model [50]. NSE can be calculated as NSE ranges from −∞ to 1. The result of 1 (NSE = 1) means that the model results matches the observation perfectly. The result of 0 (NSE = 0) means that the model results are as accurate as the mean of the observed data, and the less-than-zero results (NSE < 0) means that.

Gradient Boosting Decision Tree
In our study, gradient boosting decision tree (GBDT) is used to conduct the post-processing of WRF model output. The original results of the WRF model were processed by horizontal and vertical interpolation into values of meteorological variables at different heights of each wind tower. In the GBDT model training process, these meteorological variables are used as input, and the wind speed observations are used as output.

Ensemble Learning Approach: Boosting
The GBDT algorithm is a kind of ensemble learning algorithm. Ensemble learning is not a specific machine learning algorithm but completes the learning task by building and combining multiple machine learners; we call these learners "weak learners" and the combined learner a "strong learner" [51].
Boosting is one such ensemble method [52,53]. At the beginning of boosting training, a weak learner is trained with a training data set with initial weights. The weights of the training data set are updated, according to the learning error performance of weak learner. The update of the weights makes the weights of the training samples with high learning errors higher. These poor samples will get more attention in the weak learners, later in the training process. After updating the weights, the GBDT algorithm continues to train new weak learners, based on the updated training data set. This is repeated until the number of weak learners reaches a predetermined number; finally, these weak learners are integrated through a collection strategy to obtain the final strong learner. Figure 4 shows the total process of the boosting algorithm.
Atmosphere 2020, 11, x FOR PEER REVIEW 7 of 39 NSE ranges from −∞ to 1. The result of 1 ( = 1) means that the model results matches the observation perfectly. The result of 0 ( = 0) means that the model results are as accurate as the mean of the observed data, and the less-than-zero results ( < 0) means that.

Gradient Boosting Decision Tree
In our study, gradient boosting decision tree (GBDT) is used to conduct the post-processing of WRF model output. The original results of the WRF model were processed by horizontal and vertical interpolation into values of meteorological variables at different heights of each wind tower. In the GBDT model training process, these meteorological variables are used as input, and the wind speed observations are used as output.

Ensemble Learning Approach: Boosting
The GBDT algorithm is a kind of ensemble learning algorithm. Ensemble learning is not a specific machine learning algorithm but completes the learning task by building and combining multiple machine learners; we call these learners "weak learners" and the combined learner a "strong learner" [51].
Boosting is one such ensemble method [52,53]. At the beginning of boosting training, a weak learner is trained with a training data set with initial weights. The weights of the training data set are updated, according to the learning error performance of weak learner. The update of the weights makes the weights of the training samples with high learning errors higher. These poor samples will get more attention in the weak learners, later in the training process. After updating the weights, the GBDT algorithm continues to train new weak learners, based on the updated training data set. This is repeated until the number of weak learners reaches a predetermined number; finally, these weak learners are integrated through a collection strategy to obtain the final strong learner. Figure 4 shows the total process of the boosting algorithm.
In the history of the development of the boosting algorithm, AdaBoost (adaptive boosting) [54,55] is a common algorithm. In each iteration, AdaBoost calculates the error of each sample, calculates a new weight based on the error of each sample, and performs the next iteration. Unlike AdaBoost, GBDT determines the new weight distribution by calculating the error gradient in each iteration. Therefore, the GBDT algorithm can achieve more accurate results than the AdaBoost algorithm.   In the history of the development of the boosting algorithm, AdaBoost (adaptive boosting) [54,55] is a common algorithm. In each iteration, AdaBoost calculates the error of each sample, calculates a new weight based on the error of each sample, and performs the next iteration. Unlike AdaBoost, GBDT determines the new weight distribution by calculating the error gradient in each iteration. Therefore, the GBDT algorithm can achieve more accurate results than the AdaBoost algorithm.

Classification and Regression Tree (CART)
The gradient boosting decision tree (GBDT) method is a kind of boosting algorithm. GBDT uses a CART (classification and regression tree) decision tree as its weak learner. A CART decision tree can take categorical and numerical features as its input and can be used for classification and regression tasks.
The decision tree algorithm is a classic algorithm in machine learning and was proposed by Quinlan [56]. In the history of decision tree algorithm development, ID3 decision trees and C4.5 decision trees are important types of decision tree algorithms [57]. The C4.5 algorithm improves the deficiencies of the ID3 algorithm. At the same time, C4.5 also has shortcomings such as inability to perform regression and low calculation efficiency. Compared with the C4.5 algorithm, CART has higher computational efficiency and can handle regression problems [58].
For categorical features, the CART decision tree calculates the Gini coefficient [58] to select split features and decide how to split the features in the tree branches. The ID3 and C4.5 algorithms use information gain to select features. CART uses Gini coefficients instead of information gain ratios. The Gini coefficient represents the impurity of the model. The smaller the Gini coefficient, the lower the impurity and the better the characteristics.
Suppose there are K categories in a feature, and the probability of the kth category is p k . Then, the expression of the Gini coefficient is For a given sample D, suppose there are K categories, and the number of the kth category is C k . Then, the expression of the Gini coefficient of the sample D is For a sample D, if D is divided into two parts, D 1 and D 2 , by a value a of a feature A, then, under the condition of feature A, the Gini coefficient of D is For the numerical features in the regression problem, the CART decision tree uses a measurement of the sum of variance. The measurement goal of the CART regression tree is: for feature A, a partition point s divides the data set into D 1 and D 2 ; we need to find which feature and feature value division points minimize the mean square error of the respective sets D 1 and D 2 , and also minimize the sum of the mean square errors of D 1 and D 2 . The expression is where c 1 is the mean value of the data set D 1 , and c 2 is the mean value of the data set D 2 .

Training Process of GBDT
The other boosting algorithm, GBDT, iterates by calculating gradients [59,60]. Using the CART decision tree, we can perform multiple training iterations, where each iteration trains a new CART decision tree based on the training data after the updating of the weights. The GBDT algorithm can be expressed by the formula where f (x) is the final strong learner, h t (x) is the weak learner of each iteration, and γ t is the weight of each weak learner in the strong learner. We suppose that the strong learner is f t (x) when iterating to the t th round during the GBDT iteration process and that the loss function is L(y, f t (x)), where x is the input data (WRF output) and y is the label (wind speed observation).
When iterating to a new step, GBDT builds the strong learner in a greedy way The newly added CART decision tree h t (x) tries to minimize the loss L, where the new loss function is and the new CART decision tree is GBDT's iteration process intends to solve this minimization problem, numerically, using steepest descent: the steepest descent direction is the negative gradient of the loss function evaluated at the current model f t−1 , which can be calculated for any differentiable loss function where γ t is chosen using line search where γ is the weight of the weak learner. Figure 5 shows the training process of GBDT. When performing the tth round of CART decision tree training, the loss of the sample is used to calculate the negative gradient of decision tree, and the negative gradient of the loss function of the ith sample can be expressed as Atmosphere 2020, 11, 738 10 of 38 Atmosphere 2020, 11, x FOR PEER REVIEW 10 of 39

Feature Importance of GBDT
After training the GBDT model, we can calculate the feature importance distribution of the GBDT model to obtain each feature's importance. During the branching of the decision tree, the variable to be split and the split value of the variable are determined by calculating the information gain. The information gain can be expressed as ( , ), where is the features and is the data samples. After all decision trees are constructed, the importance (or contribution value) of a feature can be obtained by calculating the information gain of the feature for the decision tree and dividing by the total frequency of the feature in all of the trees of the GBDT strong learner where is the total frequency of the feature in all trees.

Features Selection and Parameters Setting of GBDT
After we obtain the WRF model forecast results for the 0-24 h, 24-48 h, and 48-72 h forecasts, we need to decide which variables to use as the input features of GBDT. The output data interval of the WRF model is 10 min. Therefore, we take the output of each moment of the WRF model as one record of input data for GBDT.
In fitting the near-surface wind speeds, we used multiple variables at different heights and pressure layers as the input features of the GBDT model. We used the linear interpolation method to interpolate the model results into different layers. These variables consist of wind speed, wind direction, temperature, pressure, height, absolute vorticity (avo), and potential vorticity (pvo). Table  3 shows the variables at different pressure layers, the pressure layers being 850, 700, 500, and 300 hPa. Table 4 shows the variables at different height layers, with 29 height layers from 10 m to 5000 m; as By using (x i , r ti )(i = 1, 2, 3 . . . n), a CART decision tree can be trained, and the tth CART decision tree in the integrated model is obtained as where c is the combination of leaves of the decision tree and c t is the leaf combination of the tth decision tree.
After we obtain the tth decision tree, we can update the strong learner where I(x) is the leaf output of input x.

Feature Importance of GBDT
After training the GBDT model, we can calculate the feature importance distribution of the GBDT model to obtain each feature's importance. During the branching of the decision tree, the variable to be split and the split value of the variable are determined by calculating the information gain. The information gain can be expressed as I(A, D), where A is the features and D is the data samples. After all decision trees are constructed, the importance (or contribution value) of a feature can be obtained by calculating the information gain of the feature for the decision tree and dividing by the total frequency of the feature in all of the trees of the GBDT strong learner where N a is the total frequency of the feature a in all trees.

Features Selection and Parameters Setting of GBDT
After we obtain the WRF model forecast results for the 0-24 h, 24-48 h, and 48-72 h forecasts, we need to decide which variables to use as the input features of GBDT. The output data interval of the WRF model is 10 min. Therefore, we take the output of each moment of the WRF model as one record of input data for GBDT.
In fitting the near-surface wind speeds, we used multiple variables at different heights and pressure layers as the input features of the GBDT model. We used the linear interpolation method to interpolate the model results into different layers. These variables consist of wind speed, wind direction, temperature, pressure, height, absolute vorticity (avo), and potential vorticity (pvo). Table 3 shows the variables at different pressure layers, the pressure layers being 850, 700, 500, and 300 hPa. Table 4 shows the variables at different height layers, with 29 height layers from 10 m to 5000 m; as our post-processing output result is the near-surface wind speed, we used more height layers in the lower atmosphere. Table 3. GBDT input features at different pressure layers.

Variables Height Levels
Wind speed, wind direction, temperature, pressure, avo, pvo As the GBDT model is based on the CART decision tree, which can effectively deal with categorical features, we added several categorical features into the input of the GBDT model. Table 5 shows the categorical features used in GBDT. For the date of each forecast record, we took 'month' as a categorical feature, the month feature ranging from January to December. As for the time of each forecast record, we took 'hour' as a categorical feature, where this feature ranged from 1-24, which indicated the forecast record at different hours of the day. When we completed the feature engineering, we built the GBDT ensemble model to carry out the post-processing work. We used LightGBM [61] to build our GBDT models, where the input of LightGBM was the features we obtained from the output of each forecast time of the WRF model and the output of LightGBM during the training process is the wind speed observations from wind towers.
We divided the total forecast records into two parts: one was training data, used to train the GBDT model, and the other was testing data, used to evaluate the error of the GBDT model.
In the WRF results and observation data, the interval between adjacent data records is 10 min. The state of the atmosphere is unlikely to change a lot within 10 min, the adjacent data records are very similar. Therefore, if all the data are randomly divided into training data and test data according to the traditional machine learning method, the results cannot reflect the true performance of the model. Based on this, we chose to use the entire day of data as a training set or test set.
In each month, we chose the data in date 3,7,11,15,19,23, and 27 as test data and data in other date as train data. Table 6 shows the train and test data split in one month. For each wind tower, we trained the GBDT model on the WRF output observation speed at different forecast times (0-24 h, 24-48 h, and 48-72 h) and different heights (10,30,50, and 70 m).  (29), (30), (31) means that some month may not have these date.
Before the training of the GBDT models, we had to set the parameters of the GBDT models. Our GBDT models mainly needed to tune two parameters: the number of leaves and the minimum number of data in each leaf. These two parameters were both related to the effect of fitting and could reduce overfitting. Table 7 shows the parameter configurations of LightGBM and the tuning ranges of the two parameters.  Table 7, two parameters-number of leaves and minimum data in leaf-needed to be tuned. We set a pairwise combination of the values of two parameters, where the number of leaves contained 10, 20, 40, 80, and 160; and the minimum data in leaf contained 10, 20, 40, and 80.

Models Used for Comparison
Most previous machine learning based post-processing algorithms used artificial neural network (ANN) as its' machine learning model or a part of ensemble model. Also, the basic model of our GBDT algorithm is CART decision tree regression (DTR). Based on these, we chose the multi-layer perceptron regression (MLPR) and DTR as the model used for comparison to show the performance improvement of GBDT over traditional machine learning models. The DTR and MLPR models used the same train data and test data as GBDT and the RMSEs of each model was calculated to compare the performance of post-processing results of WRF. Table 8 is the parameters setting of MLPR model and Table 9 is the parameters setting of DTR model.  For the significance test, two-sample Student's t-test is used to calculate the p-value of the following hypothesis where H(test1&test2) is the t-test hypothesis of the two tests, S test1 is the statistical variable of test 1 and S test2 is the same statistical variable of test 2. If the p-value is less than 0.01, than it can proved that two results are significantly different at a level of 99% confidence, while a p-value less than 0.05 means that the 'significant difference' passed the 95% confidence level.

GBDT Parameters Tuning Results
We used the WRF model output and 10 m wind speed observations of tower '10001' to perform the GBDT parameter tuning. The wind speed forecast time of the tuning work was 0-24 h. Figure 6 shows the parameter tuning results of the parameters 'number of leaves' and 'minimum data in leaf'. From Figure 6, we can see that when 'L (number of leaves)' is the same, the curves of different 'D (Minimum data in leaf)' are very close. With larger 'L', the faster the MSE curve decreases. Thus, a larger number of leaves can reduce MSE and achieve better post-processing results. Table 10 is the result of parameter tuning after 2000 iterations. For each L and D, we have the results of training data (train) and test data (val). It can be seen from Table 7 that, although increasing 'L' can reduce the MSE of training data and test data, it will cause overfitting. In the case of keeping 'L' unchanged and changing 'D', we can see that 'val' has no obvious change, and the MSE of 'train' increases, indicating that increasing 'D' can weaken overfitting. Combining these results, we set the Number of leaves to 80 and the Minimum data in leaf to 80 in the GBDT wind speed post-processing model.
We used the WRF model output and 10 m wind speed observations of tower '10001' to perform the GBDT parameter tuning. The wind speed forecast time of the tuning work was 0-24 h. Figure 6 shows the parameter tuning results of the parameters 'number of leaves' and 'minimum data in leaf'. From Figure 6, we can see that when 'L (number of leaves)' is the same, the curves of different 'D (Minimum data in leaf)' are very close. With larger 'L', the faster the MSE curve decreases. Thus, a larger number of leaves can reduce MSE and achieve better post-processing results.

Post-Processing Results
After parameter tuning, we used the train data in Table 6 to train the GBDT model for 0-24 h, 24-48 h, and 48-72 h wind speed forecasts at different heights of different towers and evaluated the results with the test data. We calculated the RMSE, IA, R, and NSE of the wind speed output of WRF model results and post-processing results using the test data sets. Appendix A contains the RMSE results of all wind towers, including 0-24 h (Table A1) Figure 9 shows the average NSE value of 14 towers in each test. From Figure 9 we can find out that the NSE results of WRF model are close to zero, which means that the simulation results are close to the average level of observations, that is, the overall results are credible, but the simulation errors are large. Therefore, it is very necessary to post-process the results of the model. From Figures 7-10 we can find that the RMSE of WRF output is about 2.7-3.5 m/s, the IA is about 0.61-0.75, the NSE is about −0.35-0.15 and the R is about 0.51-0.67. The increasing RMSE, as well as the decreasing IA, R and NSE between 0-24 h, 24-48 h, and 48-72 h indicates that the error of WRF forecast increases with the forecast time.
From Table 11 we can find that all the significance test results of RMSE, IA and R passed the confidence level of 99%, which means the improvement of GBDT model in RMSE, IA, and R is significant. However, some significance test results of NSE did not pass the 95% confidence level, especially in 24-48 h and 48-72 h, meaning that the NSE improvement in some cases cannot be trusted.
By comparing the results of WRF and post-processing models, it can be found that each post-processing method can reduce RMSE within a certain range. The RMSE of GBDT results is smaller than MLPR and DTR, which shows that GBDT can reduce more RMSE. The degree of reduction of RMSE by GBDT model is between 1-1.5 m/s, compared with 0.2-1 m/s of DTR and 0.7-1.2 m/s of MLPR. For IA, R, and NSE results, GBDT achieved the highest IA, R, and NSE among the four tests. Compared with WRF, GBDT has greatly improved these three metrics (IA between 0.10-0.20, R between 0.10-0.18, and NSE between −0.06-0.6). The reduction of RMSE as well as the improvement of IA and R indicate that GBDT can fit the near-surface wind speed with a smaller error than other two post-processing models. Thus, GBDT can be used to perform post-processing of WRF model to forecast the near-surface wind speed.
In order to further study the error changes of WRF and GBDT in different months, we calculated the average RMSE of wind speed forecast for each month. Figure 11 is the RMSE results in different months, including the monthly changes in the RMSE of WRF and GBDT and the percentage reduction in RMSE of GBDT relative to the WRF results.   From Figure 11, we can find that the RMSE of the WRF forecast results varies greatly between different months, and the RMSE in March, April, June, July, and December is large. However, at the same time, the RMSE of the GBDT results change less in different months. In the months when the RMSE of the WRF result is large, GBDT can reduce more RMSE, so that the final RMSE is roughly the same in each month.
In order to compare the post-processing effects of the GBDT model at different times, we calculate RMSE and IA for different hours in different months. Figure 12 is the RMSE in different month and hour, Figure 13 is the IA in different month and hour. From Figures 12 and 13 we can find that in July, September, October, and December, the forecast results at 12-24 are worse than those at 0-12. Compared with WRF, the results of GBDT did not have larger errors during the above-mentioned poor forecast time. It can be seen from Figures 12c and 13c that when the forecast results have larger error, GBDT reduce more RMSE and improve more IA than other time.

Weibull Distributions
In general cases, the distribution of near surface wind speed can be fitted using Weibull distribution [62]. The density function of Weibull distribution is Here x is the wind speed, k > 0 is the shape parameter, and λ > 0 is the scale parameter of the Weibull distribution.
The two parameters of Weibull distribution can be used to determine whether the distribution of NWP results or the post-processing results is similar to the observations. We used the 10 m wind speed results of test data to fit the Weibull distribution of WRF, GBDT, DTR, and MLPR, the Weibull distributions are shown in Figure 8, and the parameters of these distributions are listed in Table 12.

Weibull Distributions
In general cases, the distribution of near surface wind speed can be fitted using Weibull distribution [62]. The density function of Weibull distribution is Here x is the wind speed, > 0 is the shape parameter, and > 0 is the scale parameter of the Weibull distribution.
The two parameters of Weibull distribution can be used to determine whether the distribution of NWP results or the post-processing results is similar to the observations. We used the 10 m wind speed results of test data to fit the Weibull distribution of WRF, GBDT, DTR, and MLPR, the Weibull distributions are shown in Figure 8, and the parameters of these distributions are listed in Table 12.
From Figure 14 and Table 12 we can find that, relative to the original WRF results, the Weibull distribution curves of post-processing models are closer to the observations. Among the postprocessing models, the curves of GBDT models are closest to the observations, both in shape and peak value. The results above show that the GBDT post-processing model can capture the distribution better than other post-processing models and original WRF output.  From Figure 14 and Table 12 we can find that, relative to the original WRF results, the Weibull distribution curves of post-processing models are closer to the observations. Among the post-processing models, the curves of GBDT models are closest to the observations, both in shape and peak value. The results above show that the GBDT post-processing model can capture the distribution better than other post-processing models and original WRF output. Atmosphere 2020, 11, x FOR PEER REVIEW 21 of 39

GBDT Feature Importance Results
After training the GBDT models, we calculated the feature importance of each GBDT model. We first calculated the feature importance distribution of each GBDT model, and then calculated the average value of the feature importance distribution over the 14 towers. As the results in Figure 7 showed that the forecast results for 0-24 h, 24-48 h, and 48-72 h were not significantly different, we calculated the average of the results of the above three forecast times at different heights (10 m, 30 m, 50 m, and 70 m). Figure 15 shows the feature importance results. It can be seen from Figure 15

GBDT Feature Importance Results
After training the GBDT models, we calculated the feature importance of each GBDT model. We first calculated the feature importance distribution of each GBDT model, and then calculated the average value of the feature importance distribution over the 14 towers. As the results in Figure 7 showed that the forecast results for 0-24 h, 24-48 h, and 48-72 h were not significantly different, we calculated the average of the results of the above three forecast times at different heights (10 m, 30 m, 50 m, and 70 m). Figure 15 shows the feature importance results. It can be seen from Figure 15  At the same time, we can also see that the two features 'month' and 'hour' were also very important. In Figure 15, 'month' and 'hour' are in the top six features of importance. This means that changes in near-surface wind speeds were related to the forecast month and the forecast hour. We input 'month' and 'hour' into GBDT as two categorical features, such that these two features could contribute to the GBDT result.
Although the near-surface wind speed features importantly contributed to the GBDT results, the 'other' component still comprised about a half of the importance in Figure 15. This means the rest of the features also contributed to the final post-processing results. Thus, even if the effect of a single feature is limited, a large number of features can still have a strong effect on the result.

Feature Importance Sensitivity Tests
In order to further investigate the effect of different features on the results, we set up sensitivity tests on the input features. For sensitivity tests, we set up three sets of tests for different input features. We kept the GBDT model parameters of each sensitivity test the same. Table 13 shows the input features in different sensitivity tests. Test 1 uses all the features as input, Test 2 uses 'other' features as input and Test 3 uses the near surface wind speed, 'hour', and 'month' features as input. At the same time, we can also see that the two features 'month' and 'hour' were also very important. In Figure 15, 'month' and 'hour' are in the top six features of importance. This means that changes in near-surface wind speeds were related to the forecast month and the forecast hour. We input 'month' and 'hour' into GBDT as two categorical features, such that these two features could contribute to the GBDT result.
Although the near-surface wind speed features importantly contributed to the GBDT results, the 'other' component still comprised about a half of the importance in Figure 15. This means the rest of the features also contributed to the final post-processing results. Thus, even if the effect of a single feature is limited, a large number of features can still have a strong effect on the result.

Feature Importance Sensitivity Tests
In order to further investigate the effect of different features on the results, we set up sensitivity tests on the input features. For sensitivity tests, we set up three sets of tests for different input features. We kept the GBDT model parameters of each sensitivity test the same. Table 13 shows the input   Figure 16 is the average RMSE and Figure 17 is average IA. We also did the significance test between Tests 1-3. Table 14 shows the significance test results. The p-values of each significance tests are less than 0.01, means that all the results passed the confidence level of 99%. From Figures 16 and 17 we can find that Test 1 has the smallest error and the highest IA, which means that we can obtain the best post-processing wind speed when we input all the features into GBDT model. Compared to Test 2, Test 3 has less RMSE and higher IA, the near-surface wind speed, 'hour' and 'month' features has a greater impact on the post-processing results than the 'other' feature. The significant improvement between Tests 1 and 3 indicate that it is necessary to add 'other' features to the input of GBDT post-processing model.

GBDT Feature Split Value Distributions
In the process of feature split for numerical features, each split has a split value. The distribution of feature split values depends on the distribution of feature values and feature's contribution distribution.
If the feature split values are highly distributed over an interval, it means that: (1) the feature value has wide distribution in that interval; and (2) a change of feature value in that interval will have a great effect on the GBDT result.
In Section 3.4, we found that WRF model's wind speed at 10 m had the most importance in the 10 m wind speed GBDT output. We plot the distributions of 10 m wind speed observation and the distributions of 10 m WRF wind speed feature split values in 10. For the wind speed observation distributions, we used the Weibull distribution function to fit the 10 m wind speed observations.        Atmosphere 2020, 11, 738 25 of 38 Figure 18 shows the distributions of all the towers (towers 10001-10014). From Figure 18, we find that the distribution of the wind speed observation at 10 m is roughly similar to the distribution of the feature split value; this is because the distribution of WRF output wind speed was similar to the wind speed observations. However, in the high speed regions (wind speed > 8 m/s), the Weibull distribution decreases significantly but the feature split value distribution remains high. These distributions indicate that a change of feature value in the regions with high wind speeds still has an impact on the GBDT results. It also indicated that, if the WRF model is inaccurately simulated in high wind speed regions, large errors will occur in the GBDT results. This indicates that, in the area we studied, improvement of the WRF model's simulation performance for high wind speeds can improve the overall wind speed prediction performance.

Conclusions
In this work, based on WRF model, we conducted a one-year wind speed forecast of the coastal area of Jiangsu, China. According to the power grid's requirements for wind power forecasting in wind farms, we obtained 0-24 h, 24-48 h, and 48-72 h wind speed forecasts for wind power forecasting. Based on the NWP forecast results, we extracted multiple variables from WRF output, at different height and pressure levels, for each moment. We built a GBDT regression model to correct

Conclusions
In this work, based on WRF model, we conducted a one-year wind speed forecast of the coastal area of Jiangsu, China. According to the power grid's requirements for wind power forecasting in wind farms, we obtained 0-24 h, 24-48 h, and 48-72 h wind speed forecasts for wind power forecasting. Based on the NWP forecast results, we extracted multiple variables from WRF output, at different height and pressure levels, for each moment. We built a GBDT regression model to correct the output near-surface wind speed of the WRF model and compared the performance with other two post-processing methods. Finally, we analyzed feature importance in the GBDT model and found which features had a greater impact on the results of the GBDT model. Our main conclusions are as follows: The Weibull distributions of 10 m wind speed results shows that after the post-processing results, the wind distributions were closer to the observations and the GBDT model had the best performance to fit the near surface wind.
The root mean square error (RMSE) of the wind speed forecast for the wind farm was approximately 2.7-3.5 m/s for different wind towers and at different levels. Wind speed errors will cause greater errors in wind power forecasting, which will affect the operations of wind farms and power grids. After GBDT model correction, the RMSE of the wind speed was reduced, between a range of 1-1.5 m/s, on the test data sets. Also, the IA and R results shows that GBDT can improve these indices, the IA can be improved by 0.10-0.20, and R can be improved by 0.10-0.18. These improvements indicate that the GBDT model, using a large number of features as input, can reduce the wind speed forecast error of the wind farm.
In different months and at different times, the error of the WRF results varies greatly. In the month and time with large error, the GBDT model can reduce a larger error, so that the error distribution of the final wind speed forecast results in different months and different times does not have significant difference.
By analyzing the feature importance of each GBDT model, we found that the distribution of feature importance is different for the correction models at different heights. At the same time, we found that two categorical features, 'month' and 'hour', also had a great impact on the result of the correction. This shows that WRF simulation errors have some characteristics in different months and at different hours of the day.
From the feature split distribution of the 10 m WRF output wind speed, it could be seen that the distribution of the feature split and the Weibull distribution of wind speed do not completely coincide, but that the distribution of the split value in the high wind speed region is greater than the Weibull distribution of wind speed. This result shows that the decision tree has frequent branching when the wind speed value is high. Thus, high wind speeds simulated by the WRF model have a great impact on the GBDT results and can easily cause errors.
There have been many studies which used a numerical weather model to make weather forecasts and performed post-processing algorithms to correct the forecast. Such post-processing algorithms only used a few features as their input for two main reasons: (1) in some machine learning algorithms, such as ANN and SVM, too many feature inputs will negatively affect the algorithm's performance, by some less-effective features. (2) In some algorithms, if the input number of features is too large, the amount of calculation increases squarely or exponentially, which will make the model impossible to train in a short time. The GBDT algorithm we used did not have the above problems: it can input a huge number of features, pick out the important features, and ignore the less important features. Our results showed that even the most important feature (near-surface wind speed), only takes up a small portion of the entire importance. The GBDT results show that smaller errors can be achieved with more features.
For the categorical features, such as forecast month and forecast hour, it is very difficult to input them into post-processing algorithms such as neural networks and other statistical methods. However, with a decision tree, categorical features can be easily processed. Therefore, the GBDT algorithm also has an advantage over other algorithms in being able to deal with categorical features. Our results showed that the feature importance of forecast month and forecast hour was high and, so, categorical features like forecast month and forecast hour have improved our model's performance.
Another advantage of GBDT is that it can calculate feature importance by analyzing the gain of information while the decision tree is split. Therefore, we can find which features are important and which are less important.
In summary, a more effective method for wind speed forecasting of wind farms is to use a mesoscale meteorological model to forecast wind speed, followed by use of the GBDT algorithm to correct the model simulation results.