Short-Term Direct Probability Prediction Model of Wind Power Based on Improved Natural Gradient Boosting

: Wind energy has been widely used in renewable energy systems. A probabilistic prediction that can provide uncertainty information is the key to solving this problem. In this paper, a short-term direct probabilistic prediction model of wind power is proposed. First, the initial data set is preprocessed by a box plot and gray correlation analysis. Then, a generalized method is proposed to calculate the natural gradient and the improved natural gradient boosting (NGBoost) model is proposed based on this method. Finally, blending fusion is used in order to enhance the learning effect of improved NGBoost. The model is validated with the help of measured data from Dalian Tuoshan wind farm in China. The results show that under the specified confidence, compared with the single NGBoost metamodel and other short-term direct probability prediction models, the model proposed in this paper can reduce the forecast area coverage probability while ensuring a higher average width of prediction intervals, and can be used to build new efficient and intelligent energy power systems.


Introduction
With the low-carbon development of energy, the penetration rate of renewable energy represented by wind power has increased year by year [1].Due to the strong randomness and fluctuation of wind energy, it is so hard to obtain complete uncertainty information by only performing a point prediction on it since the prediction results are biased [2] resulting in the challenges of a safe and stable operation [3].In order to build an efficient and intelligent new energy power system, effectively adjust the scheduling plan, expand the advantages of wind power bidding and grid connection, it is crucial to perform accurate a probability prediction on wind power [4].
There are two methods for calculating wind power [5].One is to calculate with the help of fixed calculation formulas based on meteorological data and its internal relationship.Using a numerical weather prediction (NWP), the geographical factors of wind farms, the data are transformed into physical equations for prediction [6,7].However, this method requires a lot of historical data and is more suitable for medium-term or long-term forecasting [8,9].It is not universal.What is more, the formulas between meteorological data and wind power in different wind farms are diverse [10,11].Therefore, in actual engineering applications, machine learning modeling methods are often used.These models are trained through measured data, and the nonlinear relationships between the input data and output data are better obtained based on data mining, which ensure the universality and robustness of calculating wind power [12,13].Poncela et al., used the maximum likelihood estimation method to fit the wind power sequence to make an ultra-short-term wind power prediction [14].Zhang et al., used a least squares wavelet support vector machine (LSSVM) to obtain prediction parameters [15].Villacorta et al., used the autoregressive integrated moving average model (ARIMA) time series forecasting model to predict the wind power time series data [16].These methods have the advantages of a fast calculation speed, good estimation effect in nonlinear systems, and high accuracy [17,18].However, they cannot provide complete uncertainty information.
Probabilistic prediction methods of wind power can usually be divided into two categoriesindirect prediction and direct prediction.Indirect prediction is based on the point prediction model and calculates the probability distribution of the point prediction error to indirectly realize the probability prediction.The main models include neural networks [19], extreme learning machines [20], nonparametric kernel density estimations [21] and so forth.Although indirect predictions are widely used, it is excessively dependent on the accuracy of point prediction models and requires large sample sizes.Direct prediction assumes the probability distribution form of wind power and establishes a machine learning model to solve the corresponding parameters which realize the dynamic estimation.The main models include quantile regression [22], sample entropy [23], sparse Bayesian learning machine [24], Warped Gaussian process regression [25] models, etc.Although these types of methods directly implement probabilistic predictions, the learning model structure is complicated and the training time commonly takes a few hours.
In recent years, ensemble learning represented by boosting algorithms has received extensive attention in wind power, photovoltaic, and load forecasting fields.Many researchers improve traditional machine learning algorithms based on the Adaboost algorithm, which significantly reduces the root mean square error of point prediction, and fully demonstrates the advantages of ensemble learning when dealing with point prediction problems [26][27][28][29].Xie et al., used gradient boosting decision tree (GBDT) combined with a bayes optimization algorithm to predict photovoltaic output and significantly shorten the running time of the forecasting model [30].Liu et al., combined XGBoost and stacking fusion to apply short-term load forecasting which significantly enhances the model's ability to predict electricity load in different seasons [31,32].Although the above boosting algorithms have the advantages of a high solution accuracy, short running time, strong generalization ability, they are only suitable for solving the point prediction problems that only care about the expected value of output.As a result, they cannot be applied to solve the probability prediction problems that aim to obtain complete statistical information.
Aiming at the application defects of boosting algorithms in probabilistic predictions, the Stanford University team led by Andrew Y. Ng proposed NGBoost model [33].Although the promotion and application of boosting algorithms have been realized, it still has the following shortcomings in terms of solving short-term direct probability prediction of wind power.(1) The model lacks data preprocessing, bringing about a weak generalization ability and robustness for different wind farms.(2) The calculation principle of natural gradient is complicated and practical engineering applications are challenged.(3) The NGBoost metamodel is too elementary to guarantee the accuracy and sharpness of probability prediction.
Based on the above analysis, a new improved methodology which based on the NGBoost metamodel has been proposed in this paper.This method can be well-used for short-term direct probability prediction of wind power.The establishment of the model includes the following steps.
(1) Preprocess the initial data set by the box plot in order to eliminate abnormal values in the initial data set, and use a gray correlation analysis to extract strongly correlated meteorological variables.
(2) Use generalized natural gradient calculation methods to improve the NGBoost metamodel.(3) Use blending fusion to further strengthen the model learning effect.The comparative analysis based on the measured data of Dalian Tuoshan wind farm in China verifies the effectiveness and advantages of the model in this paper.

Establish Model
Suppose that the model data set D contains nD samples and m features, as D = {(xi,yi)} (xi ∈R m ,yi ∈R).xi represents the feature vector of the ith sample.yi represents the label value (true value) that the ith sample corresponds to, where i∈ (1,nD).Based on the above hypothesis, the specific principles of model are explained as follows.

Data Preprocessing
Considering actual engineering conditions, there are many outliers in the initial data set which will cause the final prediction errors.Therefore, this paper firstly uses the box plot to eliminate outliers.
Wind power is related to meteorological variables such as temperature, wind speed, etc. [34,35].However, the correlation between meteorological variables and wind power are diverse in varied wind farms.Hence, this paper employed a gray correlation analysis to calculate the degree of correlation in order to choose variables.A threshold φ was set and the variables used in the model were selected when their degree of association was over the threshold φ.The specific steps are as follows [36].
1. Normalize the time series of each variable.Taking the kth of n meteorological variables as the comparison sequence  (t) and the wind power sequence as the reference sequence  (t), the absolute sequence ∆ (t) is calculated showing the difference between the two sequences by Equation ( 1), where k∈ (1, n).
2. Calculate the correlation coefficient where Min (•) and Max (•) means the minimum and maximum value of the sequence.
3. Solve the degree of association ( ) where Tn is the sequence length.
4. Set the threshold  and select the variables whose  is over the threshold as a new data set.

Improved NGBoost
The key of NGBoost is the natural gradient.However, the related concepts and calculation of it are extremely complex, bringing inconvenience to its popularization and application in actual engineering.Focusing on the process of solving natural gradient, this paper adopts an improved approach, which establishes a connection between the general gradient and natural gradient through Fisher information.The specific principles are as follows.
A scoring function S(,  ) is established based on the Shannon information of yi.
( ) ( ) where,   (yi) is the probability value of yi;  is the parameter vector of the prediction probability distribution.
Let −log  (yi) = f () and perform a Taylor expansion on f ( + d').For convenience of calculation, the third-order and above terms are discarded.
where,  ' is the infinitesimal step vector that θ moves along ∇ (,  ); ∇ represents the natural gradient.
Convert the Euclidean Space into a statistical manifold, and deal with Equation ( 5) in this Riemann Space: where, () is the Riemann metric of the statistical manifold at  that is used to characterize the Fisher information brought by   (yi). ( In this way, the natural gradient ∇ (,  ) can be calculated through the general gradient: An improved NGBoost model can be established based on Formula (10) by the following steps.(1) Take  0 as the initial parameter vector.(2) Use the ordinary gradient to calculate yi and its corresponding parameter vector     assuming that the calculation is carried out in the mth iteration.
(3) Calculate the natural gradient ∇ (    ,  ) and generate a new set of base learners along this natural gradient direction, so as to realize the parameter vector update.The final prediction result can be expressed as Formula ( 11): where,  is the scale factor;  is the unified learning rate; B m is the unified representation of the base learner.For example, the calculation example of this paper is predicting the probability of wind power.Although the overall change process of wind power is non-Gaussian, according to the literature [33], the value of each sample point can be assumed to meet the Gaussian distribution.Hence,  can be expressed as (,) and the mth training stage of  can correspond to two base learners  and  , that is B m = ( ,  ).

Blending Fusion
The fusion of the metamodel can not only strengthen the learning effect, but also avoid causing excessive redundancy of the overall model.In recent years, model fusion, especially stacking fusion [31,37], has been widely used in solving prediction problems.However, stacking fusion is too complicated, and there will be data traversal that the training data will refer global statistics during the training process, which is not suitable for solving the probability prediction.Therefore, in view of the above-mentioned shortcomings, the blending fusion is proposed since it is simple and overcomes the matter of data traversal.It can not only strengthen the learning effect, but also avoid causing excessive redundancy of the overall model [38].The schematic diagram of blending fusion is shown in Figure 1.The specific steps are as follows: All in all, the establishment process of the model proposed in this article can be summarized as the following three steps.First, after entering the initial data set, it is preprocessed to detect outlier and screen feature variables by a box plot and gray correlation analysis.This step can be summarized as data preprocessing.Then the revised data set is calculated by an improved NGBoost metamodel by a generalized method proposed in this paper.Finally, blending fusion is used to enhance the learning effect of improved NGBoost.The overall flow chart of establishing the model proposed in this paper is shown in Figure 2.

Evaluation Indicators
In order to objectively quantify the effectiveness and advantages of the model proposed in this paper, based on the accuracy and sharpness of the prediction results, the forecast area coverage probability (counted as: IF) and the proportion of average width of prediction interval (counted as: IP) were proposed as basic indicators.Due to the contradictions between IF and IP, a composite score (counted as: IC) was established as a final indicator [39].The specific calculation methods of the above indicators are described as follows.

Forecast area coverage
The reliability of the model is quantified by introducing the IF to measure the accuracy of the probabilistic prediction results.This indicator is based on the number of actual values falling within the confidence interval.The larger the IF, the more accurate the model.
( ) where, Nt is the number of predicted samples;  is the mark value of whether the ith sample falls within the confidence interval.The format of  is a Boolean constant.If samples fall in the confidence interval, they are counted as 1 and those that do not fall into the interval are counted as 0.

Proportion of average width of prediction interval
By introducing IP to measure the sharpness of the probabilistic prediction results, the pure pursuit of IF being avoided leads to an excessively wide confidence interval and the prediction results lose their reference value.The larger the IP, the wider the confidence interval, the greater the sharpness of the prediction distribution and the worse the prediction effect.
[ ] where, IP0 is the width of the confidence interval under the initial parameters; Ui and Li are the upper and lower limits of the confidence interval corresponding to the ith prediction sample.

Overall score
IC is introduced to comprehensively evaluate IF and IP.The higher the IC, the better the overall performance of the model in reducing the sharpness while ensuring accuracy.

Calculation Example and Model Parameter Description
The effectiveness of the improved NGBoost model proposed in this paper is analyzed using the actual supervisory control and data acquisition (SCADA) data of Dalian Tuoshan wind farm in China in 2019 as a calculation example.The data sampling interval is 15 min, and 960 samples from 1 January to 10 January are taken to form the initial data set (the original data visualization is shown in Figure 3a (the relevant discussion of the number of samples is shown in Appendix A)).The input meteorological variables include the wind direction (angle data, unit: °, located at hub center of the wind turbine); temperature (unit: °C); humidity (unit: %rh); air pressure (unit: pa); wind speed (unit: m/s).The output variable is the wind power (unit: MW).It can be seen from Figure 3a that the wind power itself and related meteorological variables have strong randomness and volatility.The shortterm direct probability prediction of the wind power is just to extract relevant uncertainty information.The data preprocessing step is proposed in this paper to enhance the robustness of the data within the observed window and achieve the acquisition of valid samples.
The initial data set was preprocessed according to the methods proposed in the Section 2.1 of this paper.First, a box diagram was drawn of the input weather variables as shown in Figure 3b.It can be seen from Figure 3b that there were many abnormal values of wind speed and temperature variables.Finally, 902 valid samples were selected.Then the correlation degree between each meteorological variable and wind power was calculated through a gray correlation analysis (the resolution coefficient  is the default value: 0.5), and the stacked histogram was drawn as shown in Figure 3c.It can be seen from Figure 3c that the meteorological variables that have a strong correlation with the wind power fluctuation sequence were wind speed, air pressure, humidity, and wind direction correlation.Influenced by the local microclimate in the northeast of China, the correlation of temperature is the lowest.Under the constraint of a higher threshold (φ = 0.8 in this paper), the temperature variable should be discarded.
The preprocessed data were normalized to eliminate the influence of dimensions on the calculation results.The data from day 1 to day 9 is set as the original training set.The experiment found (see Appendix B for experimental details and results analysis) that a higher proportion of the DA test set helps overcome model overfitting.So, this paper divided 60% of the original training set into the DT subtraining set, and the remaining 40% was divided into testing set DA. Set the wind power on the 10th as the original forecast set DP for short-term forecasting.V was set to 8.
In this paper, classification and regression tree (CART) is selected as the base learner.Its basic structure and principle can be derived from the literature [31].The main parameter settings of the model are as fellows.The maximum depth is 5 (see Appendix C for related discussion).The M of improved NGBoost is 400.M limits the total number of iterations to prevent training from falling into an infinite loop.According to our experience, the model can acquire good prediction results within 400 iterations.So, we set it to 400.The scale factor  is 0.5, avoiding the local approximation being far away from the current parameter position in the calculation process which can lead to training failure.The setting of learning rate β refers to traditional boosting algorithms.This value is generally set to 0.1/0.01.A smaller learning rate helps overcome the phenomenon of model overfitting, so this paper set it to 0.01.The initial parameters are the corresponding values for constructing a rectangular area bounded by the upper and lower limits of the data set.
It should be noted that: (1) In order to meet the needs of graph visualization, the graphs in this paper were drawn at 95% confidence level.(2) The model in this paper was based on the same period data to establish a complete and accurate mapping relationship between the input data and the output data.However, the prediction and comparison in this paper were based on historical data in order to simulate actual engineering conditions.Through the comparison between the forecast data and actual data and indexes analysis, the validity and advantages of the model proposed in this paper are further verified.(3) In Figure 4b: the min and max represent the upper and lower limits of data truncation, which were calculated by Formula (15); Q1 and Q3 represent the upper and lower quartiles, respectively: IQR = Q3 − Q1.

Effectiveness Analysis of Model Improvements
Different from the NGBoost metamodel proposed in the literature [33], the model proposed in this paper has the following innovations and improvements.(1) Data preprocessing is added.(2) Natural gradients are calculated through general gradients.(3) Blending fusion is used for strengthening the model training effect.In order to verify the effectiveness of the above improvements, based on the principle of controlled variables, we have conducted different experiments to prove the rationality and effectiveness of the improved model proposed in this paper.The visualization of the experimental results is shown in Figure 4.The prediction effect is compared with and without data preprocessing drawn as shown in Figure 4a.It can be seen from Figure 4a that when the data set is not preprocessed, the abnormal data and weakly correlated input data represented by the temperature directly reduce the overall prediction level and the confidence of the prediction under the same confidence level.The interval shifted and expanded, which shows that weakly correlated data and outliers should not be used as training data for the model.The data preprocessing step added in this paper is reasonable and effective.
The fusion model comparison is shown in Figure 4b.It can be seen from Figure 4b that the stacking fusion that cannot overcome the problem of data traversal is not suitable for solving the probability prediction problem.Their prediction results have a large deviation.The blending fusion proposed in this paper exhibits better prediction performance in comparison.
The comparison of the model prediction effects between the improved NGBoost in this paper and the NGBoost metamodel proposed in [33] is shown in Figure 4c.It can be seen from Figure 4c that although the NGBoost proposed in document [33] has a good probability prediction effect, the prediction effect of the model proposed in this paper has been reinforced.It is better and more suitable for practical engineering applications for a short-term direct probability prediction of wind power.
The natural gradient calculation method proposed in the literature [33] and the calculation method proposed in this paper are used to calculate the data of the examples, respectively.Based on the evaluation indicators proposed in this paper, the calculation results under different confidence levels are shown in Table 1.
It can be seen from Table 1 that the natural gradient calculation method proposed in this paper is similar to the method proposed in the original text.However, the method in this paper can calculate the natural gradient through ordinary gradients which simplifies the calculation process and is obviously more suitable for a practical promotion and application in engineering situations than direct calculations.In order to further illustrate the advantages of the improved NGBoost (model 1) proposed in this paper, the kernel extreme learning machine model (model 2) proposed in [40] and the naive Bayes combination model (model 3) proposed in [41] are selected as comparisons.The prediction results are evaluated according to the indicators proposed in Section 3. The comparison of the prediction confidence intervals of each model is shown in Figure 5.The index values of each model under different confidence levels are recorded in Table 2.
It can be seen from Figure 5 and Table 2 that as the confidence level increases, the IF of each model increases, and the IP also gradually increases.Under the same confidence level, the IF of model 1 is larger indicating that there is a smaller deviation in the mean of the prediction results.Similarly, the IP of model 1 is larger than that of the models 2 and 3, representing that the sharpness of the prediction results is smaller.All in all, the model proposed in this paper can reduce IP while ensuring a larger IF and obtain a higher IC.It exhibits a better performance in solving the probability prediction problem.

Appendix B
Overfitting refers to the phenomenon of the model making the hypothesis excessively strict in order to obtain a consistent hypothesis.The specific manifestation is that the model shows a very high overall score on the training set, but the overall score is not high when the test set is predicted.Overcoming overfitting is a core task in designing predictive models.Although the blending fusion proposed in this paper overcomes the problem of data traversal and has the advantages of simplicity, efficiency, and a higher overall score compared with the stacking model, it is still necessary to discuss how to overcome its overfitting.
For this reason, this paper analyzes the root cause of the overfitting of the blending fusion.The experiment found that the original training set will be divided into the DT subtraining set and DA test set according to the proportions.The proportion of the DA test set directly affects the degree of overfitting of the final model.By continuously adjusting the ratio of the DT and DA, using the overall score proposed in this paper as the index, the score curve of the different ratio between training set and the test set is drawn as shown in Figure A2.It can be seen from Figure A2 that as the proportion of DA in the test set continues to increase, the score of the training set does not change much, but the score of the test set increases significantly showing that the degree of model overfitting gradually decreases.That is, a higher percentage of DA in the test set is more conducive to overcoming model overfitting.

Appendix C
There are many parameters involved in the model training process.Among them, the CART depth needs to be given, which directly affects the training time and effect of the model.A larger value will significantly extend the model training time and a smaller value may cause a larger model error.Figure A3 shows the comparison of Shannon's information and training time when the CART is given different depths.It can be seen from Figure A3 that the model training time increases exponentially with the increase in the CART depth.However, a larger depth does not reduce the amount of Shannon's information.It may have a negative effect.The optimal setting of the maximum depth in this paper is based on the corresponding value of the minimum Shannon information.

Figure 3 .
Figure 3. Data analysis charts.(a) original data visualization; (b) Box diagram of input weather variables; (c) Stacked histogram of input weather variables.

Figure 4 .
Figure 4. Improved model validity comparison verification.(a) Comparison with and without data preprocessing; (b) The fusion model comparison; (c) The comparison of model.

Figure A1 .
Figure A1.Curve of overall score in different numbers of input samples. 95

Figure A2 .
Figure A2.Score curve of the different ratio.

Figure A3 .
Figure A3.Comparison of different depths of CART.

Table 1 .
Model comparison data at different methods of natural gradient.

Table 2 .
Models comparison at different confidence levels.