A Novel Hybrid Machine Learning Model for Wind Speed Probabilistic Forecasting

: Accurately capturing wind speed ﬂuctuations and quantifying the uncertainties has important implications for energy planning and management. This paper proposes a novel hybrid machine learning model to solve the problem of probabilistic prediction of wind speed. The model couples the light gradient boosting machine (LGB) model with the Gaussian process regression (GPR) model, where the LGB model can provide high-precision deterministic wind speed prediction results, and the GPR model can provide reliable probabilistic prediction results. The proposed model was applied to predict wind speeds for a real wind farm in the United States. The eight contrasting models are compared in terms of deterministic prediction and probabilistic prediction, respectively. The experimental results show that the LGB-GPR model improves the point forecast accuracy ( RMSE ) by up to 20.0% and improves the probabilistic forecast reliability ( CRPS ) by up to 21.5% compared to a single GPR model. This research is of great signiﬁcance for improving the reliability of wind speed, probabilistic predictions, and the sustainable development of new energy.


Introduction
With the rapid development of science and technology industries, fossil fuel energy has been consumed in large quantities, which has led to a series of problems, such as the greenhouse effect, resource shortage, and environmental pollution [1,2]. To alleviate and improve the existing energy crisis and resource scheduling problems, people focus on the development of renewable and nonpolluting new energies. Wind energy is one of the most common sources of energy in nature and accounts for a large proportion of the development of renewable energy [3,4]. Wind-speed prediction is also important for the wind-resistant design of bridges [5] and railway infrastructure [6]. However, the nonstationary, nonlinearity, and intermittency of wind-energy resources will lead to uneven wind-power output of wind turbines, which hinders grid security maintenance, power quality, and power scheduling and planning [7,8]. For wind-power generation systems, the difficulty and cost of storing electricity is much higher than generating electricity, and most generators are in the form of direct power generation [9,10]. This is very different from traditional regulated energy sources, such as hydroelectricity [11,12]. Another difficulty with wind power generation is the integration of wind power into the grid [13]. A large number of intermittent energy grids will lead to unbalanced and unstable power-frequency regulation ranges and peak values during the peak period of electricity consumption [14]. Therefore, finding an accurate and robust wind-speed prediction method has always been an important research direction for related scholars.
However, most of the existing studies focus on improving the accuracy of wind-speed predictions, ignoring the quantification of uncertainty in wind-speed series. Accurately capturing the probability distribution of wind-speed sequences can provide more abundant decision-making information for dispatchers, which is conducive to efficient planning and rational allocation of resources. Therefore, it is necessary to carry out wind-speed probability prediction research. This paper developed a novel hybrid model, LGB-GPR, which combines light gradient boosting machine (LGB) and Gaussian process regression (GPR) for wind-speed forecasting and quantifying forecast uncertainty. Among them, the LGB model can provide accurate wind-speed deterministic prediction results, but cannot quantify the wind-speed uncertainty. In contrast, the GPR model can quantify the uncertainty of wind speed, but its prediction accuracy is poor. The fusion of the above two models can give full play to their advantages. The innovations and main contributions of this paper are summarized as follows: (1) A new machine learning method named LGB is used to predict wind-speed sequences, which can provide accurate wind-speed prediction results. (2) A novel hybrid model combining LGB and GPR is proposed for wind-speed probability prediction. (3) The proposed hybrid model is applied to a real case in the United States and compared with eight contrasting models.
The rest of the paper is organized as follows: In Section 2, a brief description of LGB, GPR, and the model LGB-GPR are introduced. The evaluation metrics are given in Section 3. The data usage and experimental setup are presented in Section 4. Comparative results and discussion are presented in Section 5. At last, conclusions and future research of this study are given in Section 6.

Methodology
To solve the wind-speed probability prediction problem, this paper proposes a new hybrid model, LGB-GPR. In this section, we first describe the formulation and principles of the LGB model and the GPR model, respectively. Then, how to couple the LGB model and the GPR model to obtain reliable wind-speed probabilistic prediction results is described in detail.

Light Gradient Boosting Machine
Light gradient boosting machine (LGB) is an improved gradient boosting decision tree (GBDT) model proposed by Microsoft. It solves the problems of slow training speed and large memory usage of the traditional gradient boosting decision tree model in the face of large data volume and high feature dimension, and at the same time, has higher accuracy.
LGB is a lifting tree model with a decision tree as the base function. The final prediction is achieved by linearly adding the prediction results of multiple decision trees.

Model Formulation
Known datasets T = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x N , y N )}, x i ∈X, y i ∈Y. x i is the n-dimensional feature vector, X is the input space, y i is the one-dimensional label, Y is the output space, and N is the number of samples. The model can be expressed as follows: where, T(x;Θ m ) represents a single binary regression tree, Θ m is the parameter of the tree, M is the number of trees. If input space X is divided into J independent regions, R 1 , R 2 , . . ., R J , there is a certain output value corresponding to each region c j , and the regression tree can be expressed as: where, Θ= {(R 1 , c 1 ), (R 2 , c 2 ), . . ., (R J , c J )} is the divided area of the tree and the output value on the corresponding area, J is the complexity of the tree, that is, the number of leaf nodes of the tree.

Model Optimization Mechanism
Compared with the traditional GBDT model, the optimization of the LGB model mainly includes the following points: (1) Gradient-based One-Side Sampling (GOSS): Without changing the distribution of sample data, some samples with small gradients can be eliminated, and only the remaining samples with larger gradients can be retained to estimate information gain, thereby reducing the number of training samples. Since samples with smaller gradients also contribute little to information gain, GOSS technology can make the LGB model faster while ensuring accuracy. (2) Exclusive Feature Bundling: In practical applications, high-dimensional data is often sparse.
LGB model adopts the histogram (Histogram) algorithm to merge those mutually exclusive features after discretizing continuous features to form new features, reduce feature dimension, reduce memory usage, and speed up model training.
(3) Leaf-wise Tree Growth with Depth Limit: Change the level-wise tree growth adopted by most decision tree models to a leaf-wise growth strategy. Compared to the original, each leaf node was split, but now only the leaf node with the largest split gain is split, which reduces unnecessary overhead. In the case of the same number of classifications, the latter is more accurate than the former. LGB avoids model overfitting by setting the maximum tree depth parameter. The growth diagram of the decision tree is shown in Figure 1: If input space X is divided into J independent regions, R1, R2, , RJ, there is a output value corresponding to each region cj, and the regression tree can be expres c2), , (RJ, cJ)} is the divided area of the tree and the outpu on the corresponding area, J is the complexity of the tree, that is, the number of leaf of the tree.

Model Optimization Mechanism
Compared with the traditional GBDT model, the optimization of the LGB mainly includes the following points: (1) Gradient-based One-Side Sampling (GOSS): Without changing the distribu sample data, some samples with small gradients can be eliminated, and only maining samples with larger gradients can be retained to estimate informatio thereby reducing the number of training samples. Since samples with smaller ents also contribute little to information gain, GOSS technology can make th model faster while ensuring accuracy. (2) Exclusive Feature Bundling: In practical applications, high-dimensional data i sparse.
LGB model adopts the histogram (Histogram) algorithm to merge tho tually exclusive features after discretizing continuous features to form new fe reduce feature dimension, reduce memory usage, and speed up model trainin (3) Leaf-wise Tree Growth with Depth Limit: Change the level-wise tree growth ad by most decision tree models to a leaf-wise growth strategy. Compared to the nal, each leaf node was split, but now only the leaf node with the largest split split, which reduces unnecessary overhead. In the case of the same number of fications, the latter is more accurate than the former. LGB avoids model over by setting the maximum tree depth parameter. The growth diagram of the de tree is shown in Figure 1:

Model Implementation Process
The flow of the complete LGB model is as follows: (1) Initialize, find the constant value that minimizes the overall loss function.

Model Implementation Process
The flow of the complete LGB model is as follows: (1) Initialize, find the constant value that minimizes the overall loss function.
where L(.) is the loss function. At this point, the model is a tree with only one root node. (2) For m = 1, 2, . . ., M.

(a)
For i = 1, 2, . . ., N, the residual is estimated by the negative gradient of the loss function.
Energies 2022, 15, 6942 Fit a regression tree to r m to obtain the leaf node area R mj of the m-th tree, that is, j = 1, 2, . . ., J.

(c)
For j = 1, 2, . . ., J, estimate the value of the leaf node region using a linear search fit to minimize the loss function.
Iteratively update with the following formula.
(3) Get the final model.

Gaussian Process Regression
The Gaussian process regression is a nonparametric model that uses a Gaussian process prior to performing regression analysis on data. The basic principles are: assuming that the learning sample obeys the Gaussian distribution, according to the prior probability of the random variable assumed by the Gaussian distribution, estimate the posterior distribution of the random variable based on the Bayesian principle, and use the maximum likelihood method or Monte Carlo sampling to estimate the model parameters. Then, a Gaussian process regression model is constructed to obtain a probability prediction value that obeys a Gaussian distribution. The schematic diagram is shown in Figure 2.
Fit a regression tree to rm to obtain the leaf node area Rmj of the m-th tree, th j = 1, 2, , J. (c) For j = 1, 2, , J, estimate the value of the leaf node region using a linear se fit to minimize the loss function.

Gaussian Process Regression
The Gaussian process regression is a nonparametric model that uses a Gaussian cess prior to performing regression analysis on data. The basic principles are: assu that the learning sample obeys the Gaussian distribution, according to the prior prob ity of the random variable assumed by the Gaussian distribution, estimate the post distribution of the random variable based on the Bayesian principle, and use the m mum likelihood method or Monte Carlo sampling to estimate the model parame Then, a Gaussian process regression model is constructed to obtain a probability pr tion value that obeys a Gaussian distribution. The schematic diagram is shown in Fi 2.
x t, 1 x In the figure, X = [x 1 , x 2 , . . ., x n ] represents the n-dimensional input feature vector, and Y = [y 1 , y 2 , . . ., y n ] represents the predictor variable. Suppose x and y form the following regression model: where ε is the noise and obeys the normal distribution with a mean of 0 and a variance of σ; n is the input feature dimension. The prior distribution of y train is: where ∑(X train , X train ) is an n × n symmetric positive definite covariance matrix, I n is an n-dimensional identity matrix. The detailed expression of ∑(X train , X train ) is as follows: where cov i,j represents the covariance between feature i and feature j. Gaussian process kernel function κ is introduced to simulate the covariance between each feature dimension, ∑(X train , X train ) = (κ ij ). In the paper, the radial basis kernel function is used, and the formula is as follows: where σ is the hyperparameter of the radial basis kernel, and M is the matrix function that characterizes the anisotropy. The joint Gaussian distribution of y train and y test is as follows: where ∑(X train , X test ) = ∑ (X test , X train ) T is the covariance matrix between the training set feature input X train and the test set feature input X test , ∑(X test , X test ) is the internal covariance matrix of the test set feature input.
The posterior condition of the predicted value Y test of the test set can be obtained by Bayesian inference.
where y test is the predicted mean of the test set; σ 2 y test is the variance of the Gaussian distribution.

LGB-GPR
On the basis of the deterministic forecast results obtained by the LGB model, combined with the GPR method, the LGB-GPR model, which can obtain both the interval forecast results and the probabilistic forecast results, is obtained. GPR is implemented using the 'GPy 1.9.9 framework in python. LGB is implemented using the 'lightgbm 3.3.1 framework in python. All the above models were performed using 'Intel(R) Core (TM) i7-10750H CPU @ 2.60GHz'. The flowchart of the LGB-GPR prediction is presented in Figure 3.  In Figure 3  In Figure 3, X ta 1 , X ta 2 , . . . , X ta Ta represents the original training set feature input, X te 1 , X te 2 , . . . , X te Te represents the original test set feature input, y ta 1,1 , y ta 1,2 , . . . , y ta 1,Ta and y te 1,1 , y te 1,2 , . . . , y te 1,Te represents the training set results and the test set results predicted by the trained LGB model, respectively. y ta 1 , y ta 2 , . . . , y ta Ta represents the training set observations, y te 1 , y te 2 , . . . , y te Te represents the test set observations, y te 2,1 , y te 2,2 , . . . , y te 2,Te represents the test set results predicted by the GPR model.
The prediction steps based on the LGB-GPR model are as follows: Step1: Train LGB model with X ta 1 , X ta 2 , . . . , X ta Ta and y ta 1 , y ta 2 , . . . , y ta Ta as features and labels, respectively.

Deterministic Forecasting Evaluation Metrics
Coefficient of certainty (R 2 ), root mean square error (RMSE), and mean absolute percent error (MAPE) are employed to evaluate the deterministic forecasting results: where y i and Y i are the i-th prediction and observation, respectively. m is the number of validations set. The smaller the MAPE or the RMSE, the better the prediction results. The closer R 2 is to 1, the better the prediction results.

Probabilistic Forecasting Evaluation Metric
Interval coverage probability coefficient (ICPC) and continuous ranked probability score (CRPS) are employed to evaluate the probabilistic forecasting results: where I(·) indicates the 0-1 function, which outputs 1 when the condition is met, and 0 otherwise; L α t , U α t indicate the lower and upper boundaries of the prediction interval at t period under confidence degree α, respectively; α indicates the mean of the confidence degree; F(·) indicates the cumulative distribution function; H(·) indicates the Heaviside function. The closer the ICPC is to 1, the better the prediction interval. The smaller the CRPS, the better the uncertainty quantification results.

Case Introduction
To test the proposed model, a real US wind farm located at longitude W103 • and latitude N38 • was used for the case study. A total of 1 January 2013 to 1 January 2014 meteorological data in hourly steps, including wind speed, wind direction, temperature, dew point, relative humidity, and precipitable amount, were collected as research datasets. In this study, the historical meteorological variables of the previous 3 h were used to predict the wind-speed sequence of the next 1 h. Furthermore, to prevent overfitting, a time series sliding window method was used to partition the dataset. The sliding window size used in this study accounts for 80% of the original data, and the window slides backwards by 10% each time. In a sliding window, the data is split into training and test sets with a ratio of 6:2. A schematic diagram of the partition of the dataset is shown in Figure 4. ( ) H  indicates the Heaviside function. The closer the ICPC is to 1, the better the prediction interval. The smaller the CRPS, the better the uncertainty quantification results.

Case Introduction
To test the proposed model, a real US wind farm located at longitude W103° and latitude N38° was used for the case study. A total of 1 January 2013 to 1 January 2014 meteorological data in hourly steps, including wind speed, wind direction, temperature, dew point, relative humidity, and precipitable amount, were collected as research datasets. In this study, the historical meteorological variables of the previous 3 h were used to predict the wind-speed sequence of the next 1 h. Furthermore, to prevent overfitting, a time series sliding window method was used to partition the dataset. The sliding window size used in this study accounts for 80% of the original data, and the window slides backwards by 10% each time. In a sliding window, the data is split into training and test sets with a ratio of 6:2. A schematic diagram of the partition of the dataset is shown in Figure  4.

Data Normalized
To eliminate the influence of unit and scale differences between different meteorological datasets, the normalized data was used for calculation in the study. The normalization formula is as follows: where x max and x min are maximum and minimum feature data, x norm is the normalized dataset, x i is the i-th observed data. The normalized values are in the range [0, 1].

Feature Selection
This study involves a large amount of meteorological feature data. To reduce the computational complexity of the model and the influence of redundant features, this study uses the maximum information coefficient (MIC) to initially screen the feature vectors. The expression of MIC is as follows: where D is a set of ordinal pairs, G is the divided grid, D|G represents the probability distribution of data D on grid G. I(D|G) is the information coefficient. The range of MIC values is [0, 1], and the closer the value is to 1, the greater the correlation. In this paper, the correlation coefficient thresholds for meteorological features are set to 0.1 and 0.2. Figure 5 shows a schematic diagram of the feature selection results. In the figure, the gray part in the figure is the high correlation interval, and the eigenvectors located in it are selected. After feature selection, feature sequences, such as dew point, relative humidity, wind direction, and wind speed, are used to train the model.
tors. The expression of MIC is as follows: where D is a set of ordinal pairs, G is the divided grid, | DG represents the probability distribution of data D on grid G . ( | ) I D G is the information coefficient. The range of MIC values is [0, 1], and the closer the value is to 1, the greater the correlation. In this paper, the correlation coefficient thresholds for meteorological features are set to 0.1 and 0.2. Figure 5 shows a schematic diagram of the feature selection results. In the figure, the gray part in the figure is the high correlation interval, and the eigenvectors located in it are selected. After feature selection, feature sequences, such as dew point, relative humidity, wind direction, and wind speed, are used to train the model.

Model Selection and Hyperparameter Optimization
To verify the superiority of the proposed model, eight models, including linear regressor (LR), random forest (RF), SVR, GPR, ANN, LSTM, LGB, and LGB-GPR, were used for comparison. Among them, LR, RF, SVR, ANN, LSTM, and LGB were all point prediction models, and only point prediction results could be obtained. While GPR and LGB-GPR were probabilistic prediction models, probabilistic prediction results could be obtained. Furthermore, in order to improve the performance of the models, all the models in this study adopted a Bayesian optimization algorithm to optimize the hyperparameters. Table 1 shows the detailed hyperparameters of the eight models. The following will perform a statistical analysis of the performance of each model on multiple datasets.

Deterministic Prediction Result Evaluation
In this section, three indicators, R 2 , RMSE, and MAPE, were used to evaluate the accuracy of the model's deterministic forecast results. The scoring results of the three metrics for different models on different datasets are shown in Tables 2-4. The best scores in the table are highlighted in bold.
It can be seen from Table 2 that the average R 2 value of LGB reaches 0.958, which is significantly higher than ANN's 0.94, RF's 0.941, and LSTM's 0.953. In the second dataset, LGB achieved the highest R 2 value of 0.963, higher than SVR's 0.878, LGB's 0.961, and GPR's 0.936. The above results show that the LGB model has higher R 2 values than the other models on all three datasets. As can be seen from Table 3, compared with the LGB model, LGB-GPR improves the accuracy of the three datasets by 1.6%, 1.7%, and 1.8%, respectively. Compared with the GPR model, LGB-GPR improves the accuracy by 24.9%, 24.1%, and 6.9% on the three datasets, respectively. The above results prove that, after combining the LGB and GPR models, the model prediction accuracy can be effectively improved. In Table 4, the performance of LGB is second only to LGB-GPR, achieving MAPE scores of 0.156, 0.130, and 0.119 on three datasets. This phenomenon shows that the LGB model itself has excellent wind-speed prediction performance, and the combination with the GPR further improves its prediction accuracy. Figure 6 shows the prediction results of each comparison model in the three datasets for typical time periods. It can be seen from the figure that the LGB-GPR model can effectively fit the fluctuation trend of future wind speed in the three sets of datasets. This once again illustrates the superiority of LGB-GPR. In all models, the predicted results of the SVR model deviate significantly from the actual wind-speed values, which is consistent with the results in Tables 2-4.    In conclusion, the deterministic prediction score of the LGB-GPR model surpasses all the comparison models, which indicates that the LGB-GPR model can provide highaccuracy deterministic wind-speed prediction results.

Probability Prediction Result Evaluation
In this section, two indicators, ICPC and CRPS, were used to evaluate the reliability of the model's probability forecast results. The scoring results of two metrics for different models on different datasets are shown in Tables 5 and 6. The best scores in the table are highlighted in bold. The ICPC indicator measures the appropriateness of the prediction interval of a probabilistic forecasting model. It can be seen from Table 5 that the ICPC values of the LGB-GPR model on the three datasets are 8.9%, 2.0%, and 23.9% higher than the GPR model, respectively. This shows that the probability prediction interval of the LGB-GPR model is more reasonable than that of the GPR model, and can better describe the uncertainty of the wind-speed prediction within a reasonable range. The CRPS index comprehensively measures the probability prediction and reliability of the probability prediction model. It can be seen from Table 6 that the CRPS values of the LGB-GPR model on the three datasets are lower in the GPR model, with improvements of 26%, 24.9%, and 10.3%, respectively. This result verifies the superiority of the proposed LGB-GPR model in the comprehensive performance of probabilistic forecasting. Figure 7 shows the probabilistic prediction comparison results of the probabilistic prediction model on the three datasets. It can be seen from the figure that the probability prediction interval of the GPR model is significantly larger than that of the LGB-GPR model, which indicates that the GPR model tends to predict with greater uncertainty. However, the prediction interval of the LGB-GPR model is narrower, which indicates that the LGB-GPR model can more reliably quantify the uncertainty of wind speed.  To sum up, the probabilistic prediction results of LGB-GPR are better than the comparison model, GPR, which indicates that the LGB-GPR model can provide reliable probabilistic prediction results.

Conclusions
In this paper, the GPR model is introduced on the basis of the LGB model, and a wind-speed probability prediction model based on LGB-GPR is proposed. The LGB model can provide accurate deterministic wind speed prediction results, and the GPR model can provide reliable probabilistic prediction results. Combining the two can give full play to their respective advantages. This paper verifies the superiority of the LGB-GPR model from the deterministic evaluation index and the probabilistic evaluation index aspects. The results show that the LGB-GPR model outperforms all contrasting models. LGB-GPR model can not only provide high-precision deterministic forecast results, but also effectively quantify the uncertainty in wind speed forecasting, which can provide wind farm dispatchers with rich decision-making information.