Ensemble Recurrent Neural Network Based Probabilistic Wind Speed Forecasting Approach

Wind energy is a commonly utilized renewable energy source, due to its merits of extensive distribution and rich reserves. However, as wind speed fluctuates violently and uncertainly at all times, wind power integration may affect the security and stability of power system. In this study, we propose an ensemble model for probabilistic wind speed forecasting. It consists of wavelet threshold denoising (WTD), recurrent neural network (RNN) and adaptive neuro fuzzy inference system (ANFIS). Firstly, WTD smooths the wind speed series in order to better capture its variation trend. Secondly, RNNs with different architectures are trained on the denoising datasets, operating as sub-models for point wind speed forecasting. Thirdly, ANFIS is innovatively established as the top layer of the entire ensemble model to compute the final point prediction result, in order to take full advantages of a limited number of deep-learning-based sub-models. Lastly, variances are obtained from sub-models and then prediction intervals of probabilistic forecasting can be calculated, where the variances inventively consist of modeling and forecasting uncertainties. The proposed ensemble model is established and verified on less than one-hour-ahead ultra-short-term wind speed forecasting. We compare it with other soft computing models. The results indicate the feasibility and superiority of the proposed model in both point and probabilistic wind speed forecasting.


Introduction
The demand of renewable energy application is growing stronger over the recent years, in response to increasingly high energy consumption and serious environment pollutions [1].The global renewable energy capacity has exceeded 1800 GW by the end of 2015 [2,3].Among those energy sources, wind power is a typical kind, which is widely distributed and easy to access and it has become one of the fastest developing sources.Nevertheless, the chaotic nature of wind speed is unavoidable, which restricts the application of in-grid wind power.As wind fluctuates arbitrarily and uncertainly, wind power integration will challenge the security and stability of power system operations.Therefore, an accurate wind speed forecasting is expected for the development of large-scale wind power utilization, which can benefit wind power integrated system.There are customarily three major methodologies for wind energy forecasting, that is, physical, statistical and soft computing methodologies [4][5][6].Firstly, physical models are established based on numerous physical variables, including geographical and meteorological factors.As the formation of wind is influenced by many elements, for example, surface temperature and sunshine duration, physical models specialize in long-term forecasting and trend forecasting [7,8].However, physical methodologies usually take a great deal of computation time due to the large number of inputs and their accuracies depend on the results from numerical weather prediction (NWP).Secondly, statistical models are aimed at developing relationships between the historical (input) and future (output) wind energy data.Auto regressive moving average (ARMA), auto regressive integrated moving average (ARIMA) and their hybrid models are most frequently utilized in statistical prediction methodologies [9][10][11][12].Lastly, soft computing methodologies are most widely used in renewable energy forecasting [13].Among those methods, artificial neural network (ANN) is a typical soft computing model.In Reference [14][15][16], polynomial ANN, radial basis function (RBF) ANN and physical hybrid (PH) ANN are introduced for wind and solar energy forecasting.Besides, there are massive other forecasting models proved efficient in literatures.In Reference [17][18][19], support vector machine (SVM) is introduced to achieve better wind power forecasting results when the number of training samples are limited.Extreme learning machine (ELM) is a recently proposed model that has a high training speed and is suitable for ultra-short online forecasting [20][21][22].Adaptive neuro-fuzzy inference system (ANFIS) is a simpler model than ANN while can also approximate any non-linear function, which is proven feasible in wind speed forecasting [23].Nevertheless, those models have limited generalization performance and can be hardly trained on a large amount of training data.
As a result, deep learning theory has been developed in order to overcome the shallow learning abilities of traditional soft computing methods.As they can be trained on a large number of samples, they are ideally suitable for big data analysis and have emerged their superiorities in the field of renewable energy forecasting.Stacked auto-encoder (SAE) and deep belief network (DBN) are two models that solve the optimization problem in deep multi-layer perceptron (MLP), which have successfully been applied in wind speed and power forecasting [24][25][26].In Reference [27], deep convolutional neural network (CNN) is introduced to predict wind power using two-dimensional inputs.Besides, recurrent neural network (RNN) is also a new deep-learning-based model, which holds the ability to learn temporal correlations in a complete sequence.It has also been introduced to point wind speed forecasting [28,29].Hence, RNN is chosen as the sub-model in this study for probabilistic wind speed forecasting and it will be compared with other soft computing methods.
Moreover, in order to combine the advantages of different models, extensive hybrid forecasting models have been proposed.In addition to pre-processing and post-processing, the ensemble of sub-models is an efficient approach [30], which is also called competitive ensemble forecasting [31].The final forecasting result is usually calculated via an averaging or weighting procedure [32].This method has two advantages: first, the prediction error of a sole model can be reduced through competitions in sub-models; second, a confidence level for probabilistic forecasting can be obtained based on the variance of those sub-models.In Reference [33], an ensemble model based on Gaussian process regression (GPR) and ANN possesses a superb precision in short-term wind power forecasting.In Reference [34], adaptive boosting (AdaBoost) is combined with ELM for multi-step-ahead wind speed forecasting.A wind power forecasting model mixed bagging and ANN is proposed in Reference [35].Those models share a common characteristic that the number of sub-models is considerable.However, it costs a lot of computation time to train a sole deep-learning-based sub-model.In order to ensure the performance of ensemble with a small number of RNNs, an ANFIS model is built as the top layer of the ensemble forecasting model in this study.A combined result of those sub-models is achieved through the learnt fuzzy rules, rather than a simple weighting approach.
In this study, a hybrid ensemble model is proposed for short-term probabilistic wind speed forecasting, which consists of wavelet threshold denoising (WTD), RNN and ANFIS.The WTD is a pre-processing approach to decrease the fluctuations in wind speed datasets.RNN and ANFIS are utilized to build an ensemble probabilistic forecasting model.The main contributions of this study are listed as follows: 1.
The ANFIS model is innovatively established as the top layer of the ensemble model to calculate a better forecasting result rather than just an averaged value from sub-models.

2.
In order to maintain diversities of sub-models and take advantages of ensemble learning, different architectures are introduced to build RNNs, including long short-term memory (LSTM), gated recurrent unit (GRU) and dropout layer.

3.
The variances collected from sub-models are introduced to probabilistic forecasting problem, which inventively consist of two parts, namely modeling uncertainties and forecasting uncertainties.
We compare the proposed WTD-RNN-ANFIS model with several soft computing models in various point and probabilistic forecasting cases, in order to prove its feasibility and superiority.The rest of this paper is structured as follows.Section 2 introduces the utilized datasets, as well as detailed architecture and algorithms of the probabilistic forecasting model.Section 3 presents concise forecasting results based on the proposed approach.Section 4 contains the detailed discussion and comparisons in probabilistic forecasting performance.Finally, a conclusion is drawn in Section 5.

Datasets
The training and testing datasets utilized in this study are based on wind speed data measured in the interval of 15 min at the height of 50 m, which are collected by the wind tower of National Renewable Energy Laboratory (NREL) National Wind Technology Center (NWTC) [36].The tower is located in the latitude of 39.91 • N, the longitude of 105.23 • W and the elevation of 1855 m.When the measured data is missing or incorrect due to equipment failure, it can be substituted by the mean value obtained from previous or subsequent points, or by the value calculated from intelligent imputation technologies for example, decision tree [37].
In Figure 1, the monthly averaged values of wind speed from 2015 to 2017 are exhibited, from which it can be found that wind speed varies under a certain seasonal regularity.The averaged values of wind speed in May, June, July and August are smaller than those in November, December, January and February.In this study, the wind speed datasets in 2015 and 2016 are used as training samples and those in 2017 are for the test phase.Specially, according to the above analysis, we choose four testing cases of different seasons.The wind speed data on 22 March 2017, 23 June 2017, 22 September 2017 and 23 December 2017 are chosen to be testing samples as representative dates of spring, summer, autumn and winter seasons, in order to fully validate and compare the performances of different forecasting models.
Energies 2018, 11,1958 3 of 23 utilized to build an ensemble probabilistic forecasting model.The main contributions of this study are listed as follows: 1.The ANFIS model is innovatively established as the top layer of the ensemble model to calculate a better forecasting result rather than just an averaged value from sub-models. 2. In order to maintain diversities of sub-models and take advantages of ensemble learning, different architectures are introduced to build RNNs, including long short-term memory (LSTM), gated recurrent unit (GRU) and dropout layer.3. The variances collected from sub-models are introduced to probabilistic forecasting problem, which inventively consist of two parts, namely modeling uncertainties and forecasting uncertainties.
We compare the proposed WTD-RNN-ANFIS model with several soft computing models in various point and probabilistic forecasting cases, in order to prove its feasibility and superiority.The rest of this paper is structured as follows.Section 2 introduces the utilized datasets, as well as detailed architecture and algorithms of the probabilistic forecasting model.Section 3 presents concise forecasting results based on the proposed approach.Section 4 contains the detailed discussion and comparisons in probabilistic forecasting performance.Finally, a conclusion is drawn in Section 5.

Datasets
The training and testing datasets utilized in this study are based on wind speed data measured in the interval of 15 min at the height of 50 m, which are collected by the wind tower of National Renewable Energy Laboratory (NREL) National Wind Technology Center (NWTC) [36].The tower is located in the latitude of 39.91° N, the longitude of 105.23°W and the elevation of 1855 m.When the measured data is missing or incorrect due to equipment failure, it can be substituted by the mean value obtained from previous or subsequent points, or by the value calculated from intelligent imputation technologies for example, decision tree [37].
In Figure 1, the monthly averaged values of wind speed from 2015 to 2017 are exhibited, from which it can be found that wind speed varies under a certain seasonal regularity.The averaged values of wind speed in May, June, July and August are smaller than those in November, December, January and February.In this study, the wind speed datasets in 2015 and 2016 are used as training samples and those in 2017 are for the test phase.Specially, according to the above analysis, we choose four testing cases of different seasons.The wind speed data on 22 March 2017, 23 June 2017, 22 September 2017 and 23 December 2017 are chosen to be testing samples as representative dates of spring, summer, autumn and winter seasons, in order to fully validate and compare the performances of different forecasting models.

Architecture of the Entire Forecasting Model
The probabilistic wind speed forecasting model proposed in this study consists of WTD, RNN and ANFIS.Firstly, WTD is used to decompose and smooth historical time series of wind speed in

Architecture of the Entire Forecasting Model
The probabilistic wind speed forecasting model proposed in this study consists of WTD, RNN and ANFIS.Firstly, WTD is used to decompose and smooth historical time series of wind speed in order to reduce its volatility.In that case, it is easier for a forecasting model to capture the variation trend of wind speed.Secondly, six RNNs with dissimilar architectures and parameters, namely sub-models, are established and trained for prediction.Thirdly, an ANFIS is utilized as the top layer of the ensemble model.The outputs of those RNNs are entered into the ANFIS and the final forecasting result is attained from its output.Finally, variances are gathered from different sub-models, along with errors between the final predicted value and the actual value, so prediction intervals of different confidence levels can be calculated.The structure of the entire WTD-RNN-ANFIS model is shown in Figure 2.
order to reduce its volatility.In that case, it is easier for a forecasting model to capture the variation trend of wind speed.Secondly, six RNNs with dissimilar architectures and parameters, namely sub-models, are established and trained for prediction.Thirdly, an ANFIS is utilized as the top layer of the ensemble model.The outputs of those RNNs are entered into the ANFIS and the final forecasting result is attained from its output.Finally, variances are gathered from different sub-models, along with errors between the final predicted value and the actual value, so prediction intervals of different confidence levels can be calculated.The structure of the entire WTD-RNN-ANFIS model is shown in Figure 2.

Wavelet Threshold Denoising (WTD)
WTD is based on wavelet transform (WT), which is a time-scale approach for signal processing.As WT holds the characteristics of local feature representation and multi-resolution analysis [38], it has been proved effective in dealing with non-stationary series and time varying problems [39,40].A mother wavelet ϕ(t) is essential in WT.It can be scaled and time-shifted via scale factor a and shift factor b, producing a set of child wavelets to extract features under different resolutions.Moreover, WT includes two categories: continuous wavelet transform (CWT) and discrete wavelet transform (DWT).DWT has scale and shift factors in discrete forms, which reduces computation cost with little loss of signal information.It is especially suitable for sampled signals as well.The wavelet function of DWT is expressed as follows [41]: where j and k are integer-valued scale and shift factors, respectively; t represents the time variable; and ϕ(t) is the mother wavelet.The wavelet functions are binary wavelets and they are utilized to conduct DWT as [41]: where ϕ * (•) represents a convolution operation using wavelet function and x(t) is the sampled signal.Mallat algorithm is commonly used for DWT, where scale functions and wavelet functions are operating as low and high pass filters, respectively.The multi-level decomposition of the algorithm

Wavelet Threshold Denoising (WTD)
WTD is based on wavelet transform (WT), which is a time-scale approach for signal processing.As WT holds the characteristics of local feature representation and multi-resolution analysis [38], it has been proved effective in dealing with non-stationary series and time varying problems [39,40].A mother wavelet ϕ(t) is essential in WT.It can be scaled and time-shifted via scale factor a and shift factor b, producing a set of child wavelets to extract features under different resolutions.Moreover, WT includes two categories: continuous wavelet transform (CWT) and discrete wavelet transform (DWT).DWT has scale and shift factors in discrete forms, which reduces computation cost with little loss of signal information.It is especially suitable for sampled signals as well.The wavelet function of DWT is expressed as follows [41]: where j and k are integer-valued scale and shift factors, respectively; t represents the time variable; and ϕ(t) is the mother wavelet.The wavelet functions are binary wavelets and they are utilized to conduct DWT as [41]: where ϕ * (•) represents a convolution operation using wavelet function and x(t) is the sampled signal.Mallat algorithm is commonly used for DWT, where scale functions and wavelet functions are operating as low and high pass filters, respectively.The multi-level decomposition of the algorithm is shown in Figure 3.After decomposition, approximation and detail components are obtained, which contain low-frequency and high-frequency information, respectively [41]: where ψ j,k and ϕ j,k are the scale and wavelet functions, respectively; A j,k is called the approximation component or scale coefficient and D j,k is the detail component or wavelet coefficient.As the Gaussian white noise in a signal is discontinuous, its coefficients in wavelet domain, which are centered in detail components, also follow the Gaussian distribution.Therefore, coefficients of noise are smaller than those of the effective signal.In this case, WTD is able to utilize a certain threshold to extract the noise and sets its coefficients to be zero.
Energies 2018, 11, 1958 5 of 23 is shown in Figure 3.After decomposition, approximation and detail components are obtained, which contain low-frequency and high-frequency information, respectively [41]: where ψj,k and ϕj,k are the scale and wavelet functions, respectively; Aj,k is called the approximation component or scale coefficient and Dj,k is the detail component or wavelet coefficient.As the Gaussian white noise in a signal is discontinuous, its coefficients in wavelet domain, which are centered in detail components, also follow the Gaussian distribution.Therefore, coefficients of noise are smaller than those of the effective signal.In this case, WTD is able to utilize a certain threshold to extract the noise and sets its coefficients to be zero.
Based on the above explanations, the procedure of WTD for wind speed is simply summed up into three steps: 1. Decompose the original time series of wind speed using WT; 2. Calculate the threshold to distinguish the noise from the effective signal; 3. Set the wavelet coefficients of noise to be zero and reconstruct the wind speed series.
It also should be mentioned that there are several parameters in WTD to be determined artificially: the wavelet basis function, the number of decomposition levels, threshold computation approach and denoising approach.

Recurrent Neural Network (RNN)
RNN has a strong power to handle a sequence with temporal correlation and it has been widely utilized in solving time-varying problems [42,43], especially natural language processing [44].Unlike a common neural network that has no connections within hidden layers, RNN is able to connect hidden layers with the former ones circularly.Those hidden layers of RNN, which are also named recurrent units, can save historical information from the sequence.The sequential structure of a one-hidden-layer RNN is shown in Figure 4.It can be unfolded to several networks with hidden layers connected, each of which has a point input of a sequence.Therefore, a RNN possesses an extremely deep architecture, the number of whose hidden layers is equal to the length of input sequence.

Output y2
Hidden  Based on the above explanations, the procedure of WTD for wind speed is simply summed up into three steps: Decompose the original time series of wind speed using WT; 2.
Calculate the threshold to distinguish the noise from the effective signal; 3.
Set the wavelet coefficients of noise to be zero and reconstruct the wind speed series.
It also should be mentioned that there are several parameters in WTD to be determined artificially: the wavelet basis function, the number of decomposition levels, threshold computation approach and denoising approach.

Recurrent Neural Network (RNN)
RNN has a strong power to handle a sequence with temporal correlation and it has been widely utilized in solving time-varying problems [42,43], especially natural language processing [44].Unlike a common neural network that has no connections within hidden layers, RNN is able to connect hidden layers with the former ones circularly.Those hidden layers of RNN, which are also named recurrent units, can save historical information from the sequence.The sequential structure of a one-hidden-layer RNN is shown in Figure 4.It can be unfolded to several networks with hidden layers connected, each of which has a point input of a sequence.Therefore, a RNN possesses an extremely deep architecture, the number of whose hidden layers is equal to the length of input sequence.
Energies 2018, 11, 1958 5 of 23 is shown in Figure 3.After decomposition, approximation and detail components are obtained, which contain low-frequency and high-frequency information, respectively [41]: where ψj,k and ϕj,k are the scale and wavelet functions, respectively; Aj,k is called the approximation component or scale coefficient and Dj,k is the detail component or wavelet coefficient.As the Gaussian white noise in a signal is discontinuous, its coefficients in wavelet domain, which are centered in detail components, also follow the Gaussian distribution.Therefore, coefficients of noise are smaller than those of the effective signal.In this case, WTD is able to utilize a certain threshold to extract the noise and sets its coefficients to be zero.
Based on the above explanations, the procedure of WTD for wind speed is simply summed up into three steps: 1. Decompose the original time series of wind speed using WT; 2. Calculate the threshold to distinguish the noise from the effective signal; 3. Set the wavelet coefficients of noise to be zero and reconstruct the wind speed series.
It also should be mentioned that there are several parameters in WTD to be determined artificially: the wavelet basis function, the number of decomposition levels, threshold computation approach and denoising approach.

Recurrent Neural Network (RNN)
RNN has a strong power to handle a sequence with temporal correlation and it has been widely utilized in solving time-varying problems [42,43], especially natural language processing [44].Unlike a common neural network that has no connections within hidden layers, RNN is able to connect hidden layers with the former ones circularly.Those hidden layers of RNN, which are also named recurrent units, can save historical information from the sequence.The sequential structure of a one-hidden-layer RNN is shown in Figure 4.It can be unfolded to several networks with hidden layers connected, each of which has a point input of a sequence.Therefore, a RNN possesses an extremely deep architecture, the number of whose hidden layers is equal to the length of input sequence.

Output y2
Hidden  In this study, RNNs are established and trained as sub-models for ensemble forecasting.In order to make full use of ensemble technology and enhance forecasting accuracy, it is of great significance to keeping diversities of sub-models.In order to deal with this issue, several approaches are considered and discussed as follows:

•
Choose different datasets to train sub-models, for example, bagging technology.In bagging, a random sampling with replacement is operated for training datasets.Thus, datasets may contain duplicated samples or not [45].However, a quite large number of sub-models is essential for this method, whereas it is expensive in terms of computation cost for deep-learning-based models.

•
Set dissimilar parameters for sub-models, for example, the numbers of neural nodes and hidden layers in ANN.Nevertheless, it is hard in itself to set those parameters as they are often determined by trial and error of many times in practical application.

•
Establish diverse models or models with different architectures, for example, different recurrent units in RNN.

•
Adopt dropout technology in ANN.Dropout is a state-of-art method to regularize fixed-sized models and prevent them from over-fitting [46].As model training has a certain randomness with dropout, diversities of sub-models are generated.
Accordingly, two recurrent architectures are utilized to build sub-models of RNNs in this study, namely LSTM and GRU.Dropout is used as well and it is in the form of layer structures in RNN.

Dropout Layer
The structure of a dropout layer is shown in Figure 5. Several connections of neural nodes between two hidden layers are blocked with a certain dropout rate p in one iteration and those nodes are not able to participate in training [47].
During the training phase, a certain proportion (p) of the nodes are abandoned randomly, whose weights W are not updated.Therefore, different parts of an ANN are trained in different iterations.While the testing stage, a 'thinner' network is obtained, where all nodes are connected with the new weights pW.
Energies 2018, 11,1958 6 of 23 In this study, RNNs are established and trained as sub-models for ensemble forecasting.In order to make full use of ensemble technology and enhance forecasting accuracy, it is of great significance to keeping diversities of sub-models.In order to deal with this issue, several approaches are considered and discussed as follows:

•
Choose different datasets to train sub-models, for example, bagging technology.In bagging, a random sampling with replacement is operated for training datasets.Thus, datasets may contain duplicated samples or not [45].However, a quite large number of sub-models is essential for this method, whereas it is expensive in terms of computation cost for deep-learning-based models.

•
Set dissimilar parameters for sub-models, for example, the numbers of neural nodes and hidden layers in ANN.Nevertheless, it is hard in itself to set those parameters as they are often determined by trial and error of many times in practical application.

•
Establish diverse models or models with different architectures, for example, different recurrent units in RNN.

•
Adopt dropout technology in ANN.Dropout is a state-of-art method to regularize fixed-sized models and prevent them from over-fitting [46].As model training has a certain randomness with dropout, diversities of sub-models are generated.
Accordingly, two recurrent architectures are utilized to build sub-models of RNNs in this study, namely LSTM and GRU.Dropout is used as well and it is in the form of layer structures in RNN.

Dropout Layer
The structure of a dropout layer is shown in Figure 5. Several connections of neural nodes between two hidden layers are blocked with a certain dropout rate p in one iteration and those nodes are not able to participate in training [47].
During the training phase, a certain proportion (p) of the nodes are abandoned randomly, whose weights W are not updated.Therefore, different parts of an ANN are trained in different iterations.While the testing stage, a 'thinner' network is obtained, where all nodes are connected with the new weights pW.

LSTM
A simple RNN shown in Figure 4 has numerous connections between current and previous hidden layers.Thus, training such a network becomes quite difficult, where the vanishing gradient problem will arise.In order to overcome the problem, LSTM was proposed in 1997 [48,49] and has been improved during the recent rapid development of deep learning technology [50].It is a building unit for layers of RNN, which is able to remember short-term memory that lasts for a long period of time.The structure of a LSTM block is presented in Figure 6, which consists of four major parts, namely input gate, forget gate, memory cell and output gate.

LSTM
A simple RNN shown in Figure 4 has numerous connections between current and previous hidden layers.Thus, training such a network becomes quite difficult, where the vanishing gradient problem will arise.In order to overcome the problem, LSTM was proposed in 1997 [48,49] and has been improved during the recent rapid development of deep learning technology [50].It is a building unit for layers of RNN, which is able to remember short-term memory that lasts for a long period of time.The structure of a LSTM block is presented in Figure 6, which consists of four major parts, namely input gate, forget gate, memory cell and output gate.From the shown structure of LSTM, it is seen that two parallel lines are working to deal with the hidden layer information and memory, which is the most significant variance from a simple RNN.Hidden layer line h computes its output value based on the input and historical hidden information and its computation result is sent to both the next layer and memory line c.The memory line receives those results and forgets redundant ones, producing a modified output to affect hidden layer line.The realizations of the above functions are decided by these controllable gates with dissimilar architectures and equations [29]: Input gate.It receives information from the previous hidden layer and the current input.Then it computes to obtain an output with the following equation: where it is the output of input gate; xt and ht−1 are the current input and the previous hidden layer output, respectively; wxi and whi are weights for inputs xt and ht−1, respectively; and bi is the bias of the input gate.σ is the activation function and a soft sign function is adopted in this study as: where x denotes an independent variable of the activation function.Moreover, a temporary memory is also achieved via the input gate: where wxc, whc and bc are the weight for xt, weight for ht−1 and bias for t c , respectively.Forget gate.The output of the forget gate has the similar computation formula as the input gate with different weights (wxf, whf) and bias (bf): Memory cell.The current memory ct is calculated under the following equation: where ct−1 is the previous output of memory cell.Output gate.Its result is determined by the current input, the current memory and the previous hidden layer output.The calculation formulas can be described as follows: From the shown structure of LSTM, it is seen that two parallel lines are working to deal with the hidden layer information and memory, which is the most significant variance from a simple RNN.Hidden layer line h computes its output value based on the input and historical hidden information and its computation result is sent to both the next layer and memory line c.The memory line receives those results and forgets redundant ones, producing a modified output to affect hidden layer line.The realizations of the above functions are decided by these controllable gates with dissimilar architectures and equations [29]: Input gate.It receives information from the previous hidden layer and the current input.Then it computes to obtain an output with the following equation: where i t is the output of input gate; x t and h t−1 are the current input and the previous hidden layer output, respectively; w xi and w hi are weights for inputs x t and h t−1 , respectively; and b i is the bias of the input gate.σ is the activation function and a soft sign function is adopted in this study as: where x denotes an independent variable of the activation function.Moreover, a temporary memory is also achieved via the input gate: where w xc , w hc and b c are the weight for x t , weight for h t−1 and bias for c t , respectively.Forget gate.The output of the forget gate has the similar computation formula as the input gate with different weights (w xf , w hf ) and bias (b f ): Memory cell.The current memory c t is calculated under the following equation: where c t−1 is the previous output of memory cell.
Output gate.Its result is determined by the current input, the current memory and the previous hidden layer output.The calculation formulas can be described as follows: where o t and h t denote the outputs of the output gate and the current hidden layer, respectively; w xo , w ho and b o are the weight for x t , the weight for h t−1 and bias for o t , respectively.

GRU
GRU, which is a recently proposed architecture of RNN, is a simplified variant of LSTM [51].There are mainly two changes in GRU.First, the input and forget gates of LSTM are merged in the model, producing a sole update gate.Second, the two lines in LSTM, that is, the memory and the hidden layer, are also combined together.The integrate structure of GRU is shown in Figure 7.
The following equations are designed for calculations in GRU [52]: = tanh( ) where ot and ht denote the outputs of the output gate and the current hidden layer, respectively; wxo, who and bo are the weight for xt, the weight for ht−1 and bias for ot, respectively.

GRU
GRU, which is a recently proposed architecture of RNN, is a simplified variant of LSTM [51].There are mainly two changes in GRU.First, the input and forget gates of LSTM are merged in the model, producing a sole update gate.Second, the two lines in LSTM, that is, the memory and the hidden layer, are also combined together.The integrate structure of GRU is shown in Figure 7.The following equations are designed for calculations in GRU [52]: where xt and ht−1 are the current input and the previous output of GRU, respectively; wxz, whz, wxr, whr, wxh and whh are weights in GRU; bz, br and bh are biases; zt, rt and are outputs during intermediate procedures; and ht is the final output of the GRU block.

Proposed RNN Structure
In this study, two kinds of RNNs are utilized, that is, the LSTM and the GRU networks.Each RNN has four computation layers, including two recurrent units and two fully-connected layers (also named dense layers).As wind speed data are sampled in the interval of 15 min, a time sequence of wind speed with 48 points is chosen as an input sample by trial and error, which contains historical information of 12 h.Therefore, the input size of the proposed RNNs is 48 × 1.Those deep-learning-based models are established through Tensorflow 1.4.0 [53] and Keras 2.0.8 [54] under Python 3.5.4and they are trained on a standard PC under Windows 10 operating system with Z270-A motherboard, an Intel Core i7-7700K 4.2 GHz CPU, a NVIDIA GTX 1080 GPU and 16.0 GB of RAM.CUDA Toolkit 9.1 is also utilized.The structures and parameters of RNNs are listed in Tables 1 and  2, where rectified linear unit (ReLU) and sigmoid are two adopted activation functions: From the model structures, the total numbers of trainable parameters in the LSTM and GRU networks are 52,033 and 39,553, respectively.As the number of training samples in this study is more

Proposed RNN Structure
In this study, two kinds of RNNs are utilized, that is, the LSTM and the GRU networks.Each RNN has four computation layers, including two recurrent units and two fully-connected layers (also named dense layers).As wind speed data are sampled in the interval of 15 min, a time sequence of wind speed with 48 points is chosen as an input sample by trial and error, which contains historical information of 12 h.Therefore, the input size of the proposed RNNs is 48 × 1.Those deep-learning-based models are established through Tensorflow 1.4.0 [53] and Keras 2.0.8 [54] under Python 3.5.4and they are trained on a standard PC under Windows 10 operating system with Z270-A motherboard, an Intel Core i7-7700K 4.2 GHz CPU, a NVIDIA GTX 1080 GPU and 16.0 GB of RAM.CUDA Toolkit 9.1 is also utilized.The structures and parameters of RNNs are listed in Tables 1 and 2, where rectified linear unit (ReLU) and sigmoid are two adopted activation functions: From the model structures, the total numbers of trainable parameters in the LSTM and GRU networks are 52,033 and 39,553, respectively.As the number of training samples in this study is more than 70 thousand, the over-fitting problem is effectively avoided.Besides, the computation time for training a RNN sub-model is approximately 380 s (6.3 min), which meets the needs of engineering application.If those sub-models are trained parallel, the entire ensemble model is also suitable for online training and forecasting.

Adaptive Neuro Fuzzy Inference System (ANFIS) Based Ensemble Approach
The common practice for building ensemble models is to calculate an average value from prediction results of different sub-models.Hence, a quite large number of sub-models is essential.However, due to the long training time of deep-learning-based sub-models, the number of those models is actually limited.Based on this situation, an ANFIS is built in this study as the top layer of the entire ensemble model, in order to fully analyze the small number of sub-models.The final prediction result will be decided by the fuzzy inference system (FIS), which is the output of the ANFIS model.
Generally, there are three methods that can establish FIS of an ANFIS model [55].In consideration of the number of model inputs, an ANFIS with fuzzy-c-means (ANFIS-FCM) is chosen to be utilized.A diagram of ANFIS-FCM is exhibited in Figure 8.The number of fuzzy rules is equal to that of clusters obtained by FCM, as well as that of membership functions (MFs) for each input.In this study, the number of inputs is 6, which is the same number of sub-models.And the number of clusters should be set artificially, which is chosen to be 4 by trial and error.Based on the architecture of ANFIS-FCM in Figure 8, there are five layers in this model and their computation formulas can be described as follows [56].
1.In layer 1, input variables are fuzzified via MFs: where x and O are the input and the ith output of layer 1, respectively; μA, μB and μC are MFs.2. In layer 2, namely rule layer, the outputs from the previous layer are multiplied together and the firing strengths of rules are calculated: 3. Layer 3 normalizes the firing strength of each rule as: 4. Layer 4 is a defuzzification layer, where adaptive nodes are utilized to calculate the weighted contributions of different rules: where pi, qi, ri and si are called consequent parameters. 5.In layer 5, the final output is obtained by summing contributions of rules: where the output is adopted as the point forecasting result of the entire ensemble model.A hybrid algorithm is commonly used to train ANFIS models.The trainable parameters in ANFIS contain premise parameters in MFs and consequent parameters in layer 4, which are determined via the least square fitting and the gradient descent algorithm, respectively.In this study, the number of inputs is 6, which is the same as that of sub-models.The number of clusters should be set artificially, which is chosen to be 4 by trial and error.The numbers of nodes in layer 2-4 remain the same, which are equal to that of clusters.As a result, the structure of the utilized ANFIS-FCM can be decided, as is shown in Table 3.Based on the architecture of ANFIS-FCM in Figure 8, there are five layers in this model and their computation formulas can be described as follows [56].

1.
In layer 1, input variables are fuzzified via MFs: where x and O 1,i are the input and the ith output of layer 1, respectively; µ A , µ B and µ C are MFs.

2.
In layer 2, namely rule layer, the outputs from the previous layer are multiplied together and the firing strengths of rules are calculated: 3. Layer 3 normalizes the firing strength of each rule as: Layer 4 is a defuzzification layer, where adaptive nodes are utilized to calculate the weighted contributions of different rules: where p i , q i , r i and s i are called consequent parameters. 5.
In layer 5, the final output is obtained by summing contributions of rules: where the output is adopted as the point forecasting result of the entire ensemble model.
A hybrid algorithm is commonly used to train ANFIS models.The trainable parameters in ANFIS contain premise parameters in MFs and consequent parameters in layer 4, which are determined via the least square fitting and the gradient descent algorithm, respectively.In this study, the number of inputs is 6, which is the same as that of sub-models.The number of clusters should be set artificially, which is chosen to be 4 by trial and error.The numbers of nodes in layer 2-4 remain the same, which are equal to that of clusters.As a result, the structure of the utilized ANFIS-FCM can be decided, as is shown in Table 3.

Probabilistic Forecasting Approach
As wind speed varies randomly and violently, a probabilistic forecasting model can provide more effective information of its future state.In order to measure the range of prediction errors, the uncertainties from the modeling and the forecasting procedures are considered.
The modeling uncertainty is obtained by the variance in sub-models.In the proposed WTD-RNN-ANFIS model, there are six prediction results from RNNs and a result from ANFIS: where y e (t) is the ensemble output of the ANFIS, A(•) is the ANFIS computation, y si (t) is the output of the ith sub-model and t is the time to be predicted.Therefore, the modeling uncertainty can be described as the variance: The final prediction value can hardly be equal to the actual value.As a result, the forecasting uncertainty is produced based on the variance of their differences.The uncertainty is estimated at the training phase, where the actual outputs of the training samples are y(1), y(2), . . ., y(s).The mean and variance of forecasting differences can be easily calculated as follows: where s is number of training samples; the variance σ 2 f denotes the forecasting uncertainty, which is a constant obtained during the training stage.
It is assumed that the above two uncertainties are independent.Therefore, the total uncertainty of the proposed model is the sum of those two parts: Based on the uncertainty, the bounds of the prediction interval for confidence level 100•(1−α)% are obtained under the following equations: where l α (t) and u α (t) are the lower and upper bounds of the confidence interval, respectively; z α/2 is the critical value of a Gaussian distribution.

Forecasting Results
In this study, the proposed WTD-RNN-ANFIS model is established to predict wind speed for different steps ahead of time.Other models, including SVM and three-layer ANN, are utilized and compared with the proposed model.Both the point and probabilistic forecasting results are presented based on several performance criteria.

Criteria for Point Forecasting
Root mean square error (RMSE), mean absolute error (MAE) and normalized mean absolute percentage error (NMAPE) are adopted performance criteria for point forecasting in this study.Their calculation equations are as follows: where n is the number of testing samples.A better prediction performance is achieved when RMSE, MAE and NMAPE are smaller.

Criteria for Probabilistic Forecasting
In this study, average coverage error (ACE) and interval sharpness (IS) are two criteria for evaluating the performance of probabilistic forecasting.ACE is an indicator to appraise the reliability of prediction interval, which has the following equation: where c t is the indicative function of coverage and its calculation equation is as: From the equation of ACE, a value close to zero denotes the high reliability of prediction interval.
As an infinitely wide prediction interval is meaningless, IS is another indicator contrary to the coverage rate of interval, which measures the accuracy of probabilistic forecasting.It can be calculated under the following equation: It can be found that IS α is always equal to a negative value.A great prediction interval obtains high reliability with a narrow width, which has a small absolute value of IS.

Denoising Results
The best parameters of WTD are tried and decided when the highest forecasting accuracy is acquired.In this study, the wavelet basis function, the number of decomposition levels are chosen to be sym4 and 2, respectively.The threshold computation and denoising approaches are selected to be the heuristic algorithm and the soft thresholding method, respectively.Those decomposed series of wind speed are used to construct training and testing samples.In order to display the denoising results, the denoising series in training samples of four days, that is, 1 March, 1 June, 1 September and 1 December in 2016, are presented as examples in Figure 9.

Point Forecasting Results
In this study, the proposed WTD-RNN-ANFIS model is validated on forecasting wind speed under multi-step-ahead situations, that is, 1-step-ahead (15 min), 2-step-ahead (30 min) and 3-stepahead (45 min).It is compared with three-layer ANN, SVM, RNN, WTD-ANN, WTD-SVM and WTD-RNN.Specially, the performance criteria in WTD-RNN are the averaged values of the six sub-

Point Forecasting Results
In this study, the proposed WTD-RNN-ANFIS model is validated on forecasting wind speed under multi-step-ahead situations, that is, 1-step-ahead (15 min), 2-step-ahead (30 min) and 3-step-ahead (45 min).It is compared with three-layer ANN, SVM, RNN, WTD-ANN, WTD-SVM and WTD-RNN.Specially, the performance criteria in WTD-RNN are the averaged values of the six sub-models in the proposed WTD-RNN-ANFIS, in order to validate the ensemble method that is based on ANFIS.Besides, as mentioned before in Section 2, the wind speed data at 22 March, 23 June, 22 September and 23 December in 2017 are chosen as four testing cases of different seasons.
The point forecasting results are presented in Table 4, where the best criteria of each case are emphasized in bold and the comparisons of those criteria are shown in Figure 10.From the results, it can be found that WTD is a great pre-processing method in wind speed forecasting, as it reduces the uncertainties of wind speed variation.The comparisons indicate that SVM performs worse than ANN and RNN.Besides, SVM costs rather long computation time when trained on a large number of training samples, which is not suitable for online training of short-term forecasting.It is noted that RNN usually achieves better performance than ANN whether or not WTD is utilized.The proposed WTD-RNN-ANFIS obtains the best forecasting results in all four cases.Therefore, it is a perfect ensemble model for short-term wind speed forecasting rather than just calculating an averaged output from different sub-models.Moreover, the prediction curves of 1-step-ahead point forecasting using several representative models are presented as examples in Figure 11.The curve of the proposed WTD-RNN-ANFIS fits the actual value best, which denotes the superb point forecasting performance of the model.  1 The numbers in bold are the best criteria of each studied case.

Probabilistic Forecasting Results
In order to validate the feasibility of the proposed WTD-RNN-ANFIS model for probabilistic wind speed forecasting, we establish it to calculate the prediction intervals of wind speed under 1~3 steps ahead timescales.Different prediction interval nominal confidences (PINCs) are verified in this study, including 85%, 90% and 95%.The comparison results based on criteria ACE and IS are presented in Tables 5 and 6.From the results, the criteria ACE and IS maintain within certain ranges in all verified time scales, which indicates that the proposed WTD-RNN-ANFIS model achieves a rather high reliability in ultra-short probabilistic forecasting.Moreover, prediction intervals of 1-stepahead forecasting obtained by the proposed model under four cases are shown as examples in Figure 12.It can be discovered that the intervals cover the actual series of wind speed perfectly

Probabilistic Forecasting Results
In order to validate the feasibility of the proposed WTD-RNN-ANFIS model for probabilistic wind speed forecasting, we establish it to calculate the prediction intervals of wind speed under 1~3 steps ahead timescales.Different prediction interval nominal confidences (PINCs) are verified in this study, including 85%, 90% and 95%.The comparison results based on criteria ACE and IS are presented in Tables 5 and 6.From the results, the criteria ACE and IS maintain within certain ranges in all verified time scales, which indicates that the proposed WTD-RNN-ANFIS model achieves a rather high reliability in ultra-short probabilistic forecasting.Moreover, prediction intervals of 1-step-ahead forecasting obtained by the proposed model under four cases are shown as examples in Figure 12.It can be discovered that the intervals cover the actual series of wind speed perfectly.The numbers in bold are the best criteria of each studied case.
From the comparison results, the two QR models obtains the worst criterion of ACE and GPR acquires the worst criterion of IS.According to the comparison between QR and GPR, QR merely achieves better IS than GPR whereas it can hardly cover the actual wind speed series.The reason may be that QR extremely concentrates on achieving the minimum quantile loss function so that it ignores the coverage rate.It is to say otherwise that machine-learning-based QR model reveals no significant merits in probabilistic forecasting.As a result, the QR models in Reference [57,58] adopt kernel destiny estimation method to ameliorate this situation and improve forecasting performance.Besides, those two kinds of models, QR and GPR, calculate prediction intervals directly based on the distribution of historical time series whereas the unavoidable forecasting errors in models are not included.On the contrary, we involve forecasting errors as forecasting uncertainties in our proposed probabilistic forecasting approach and test two ensemble models, that is, the model proposed by Lee et al. [60] and WTD-RNN-ANFIS.The two models achieve better ACE and slightly improve the IS.Hence, the proposed probabilistic forecasting approach increases the prediction reliability of models without enlarging their widths of prediction intervals.Moreover, WTD-RNN-ANFIS performs better than the ensemble model in Reference [60], probably due to the denoising procedure of WTD and the strong learning ability of RNN.
As the forecasting accuracy of wind speed further increases owing to the proposed WTD-RNN-ANFIS model, especially ultra-short-term forecasting, it can help the operators of wind power integrated system to better guide the real-time scheduling and dispatching of electrical power system.In this study, we mainly introduce state-of-the-art technologies of ensemble learning and deep learning into the field of renewable energy forecasting, in order to improve the operation reliability and security of power system.If the acquisition and storage technology for real-time data can be improved, as well as the calculation capability in power system, its operators will benefit much more from the contributions of the study.Moreover, we provide a feasible probabilistic forecasting approach based on deep learning without adding too much calculation burden.However, it still remains limitations that merely one kind of sub-model is utilized in ensemble and the prediction accuracy is not greatly enhanced due to the rapid fluctuation of wind speed.Therefore, it is of great application value to research and verify more deep-learning-based models for renewable energy forecasting and it will be involved in our future work.

Conclusions
An ensemble WTD-RNN-ANFIS model is proposed in this study for probabilistic wind speed forecasting.The model is based on wavelet decomposition, deep learning and ensemble learning technologies.It is aimed at less than one-hour-ahead ultra-short-term forecasting of wind speed, which utilizes historical wind speed data as inputs.We compare the proposed model with other commonly-used machine learning models to predict wind speed based on four cases of different seasons and the prediction results are in both point and interval forms.
From the comparisons, WTD evidently reduces the volatilities and uncertainties in wind speed series, so that forecasting models can better capture the variation trend of wind speed and improve their prediction accuracies.A RNN model, which are based on deep learning technology, is able to learn the correlations among data points in an entire time series.It is proved to achieve better results in wind speed forecasting than shallow ANN in this study.Besides, ANFIS successfully increases prediction accuracy of the ensemble model using only a small number of sub-models.It performs better than the ensemble model that merely calculates an averaged output value from sub-models.
Specifically, based on the overall testing datasets, the criteria RMSE, MAE and NMAPE of the proposed model in 15-min-ahead forecasting reach 0.9678 m/s, 0.6516 m/s and 4.9221%, respectively.Those criteria in 30-min-ahead and 45-min-ahead forecasting are 1.3079 m/s and 1.5643 m/s, 0.8774 m/s and 1.0671 m/s, 6.6276% and 8.0608%, respectively.The above prediction results indicate the superiority of the proposed ensemble model in ultra-short point wind speed forecasting.The high forecasting accuracy of the proposed model can meet the needs of practical engineering projects, which can benefit the secure operations of wind power integrated system.Moreover, the best ACE and IS of the proposed model tested on the overall datasets are −0.21% and −0.58, respectively.Comparing to traditional QR models, it achieves more reliable prediction intervals, demonstrating its excellent performance in ultra-short probabilistic wind speed forecasting.
As wind speed fluctuates rapidly and arbitrarily, the improvement in performance of prediction models is limited.Therefore, it is desired that in the future advanced signal decomposition and denoising technologies can be adopted and merged into hybrid wind speed forecasting methods.Besides, the ensemble model proposed in this study merely utilizes RNNs as sub-models.It remains to be studied further that other state-of-the-art deep-learning-based models are combined for ensemble learning.

Figure 1 .
Figure 1.The monthly averaged wind speed of datasets in years 2015-2017.

Figure 1 .
Figure 1.The monthly averaged wind speed of datasets in years 2015-2017.

Figure 2 .
Figure 2. The entire architecture of the proposed probabilistic forecasting model.

Figure 2 .
Figure 2. The entire architecture of the proposed probabilistic forecasting model.

Figure 5 .
Figure 5.A dropout layer with rate p = 0.25.(a) dropout during the training phase; (b) dropout during the testing phase.

Figure 5 .
Figure 5.A dropout layer with rate p = 0.25.(a) dropout during the training phase; (b) dropout during the testing phase.

Figure 6 .
Figure 6.The structure of a long short-term memory (LSTM) block.

Figure 6 .
Figure 6.The structure of a long short-term memory (LSTM) block.

Figure 7 .
Figure 7.The structure of a gated recurrent unit (GRU) block.

Figure 7 .
Figure 7.The structure of a gated recurrent unit (GRU) block.

Figure 8 .
Figure 8.An example model of adaptive neural fuzzy inference system-fuzzy c-means (ANFIS-FCM) with 3 inputs and 2 clusters.

Figure 8 .
Figure 8.An example model of adaptive neural fuzzy inference system-fuzzy c-means (ANFIS-FCM) with 3 inputs and 2 clusters.
where x t and h t−1 are the current input and the previous output of GRU, respectively; w xz , w hz , w xr , w hr , w xh and w hh are weights in GRU; b z , b r and b h are biases; z t , r t and are outputs during intermediate procedures; and h t is the final output of the GRU block.

Table 1 .
The structure of the proposed LSTM network.

Table 2 .
The structure of the proposed GRU network.

Table 3 .
The structure of the utilized ANFIS-FCM.

Table 3 .
The structure of the utilized ANFIS-FCM.

Table 4 .
The performance criteria of different models for point wind speed forecasting.

Table 7 .
Comparisons of the criterion ACE (%) for probabilistic wind speed forecasting.The numbers in bold are the best criteria of each studied case.

Table 8 .
Comparisons of the criterion IS for probabilistic wind speed forecasting.