Short-Term Forecasting of Wind Energy: A Comparison of Deep Learning Frameworks

Wind energy has been recognized as the most promising and economical renewable energy source, attracting increasing attention in recent years. However, considering the variability and uncertainty of wind energy, accurate forecasting is crucial to propel high levels of wind energy penetration within electricity markets. In this paper, a comparative framework is proposed where a suite of long short-term memory (LSTM) recurrent neural networks (RNN) models, inclusive of standard, bidirectional, stacked, convolutional, and autoencoder architectures, are implemented to address the existing gaps and limitations of reported wind power forecasting methodologies. These integrated networks are implemented through an iterative process of varying hyperparameters to better assess their effect, and the overall performance of each architecture, when tackling one-hour to three-hours ahead wind power forecasting. The corresponding validation is carried out through hourly wind power data from the Spanish electricity market, collected between 2014 and 2020. The proposed comparative error analysis shows that, overall, the models tend to showcase low error variability and better performance when the networks are able to learn in weekly sequences. The model with the best performance in forecasting one-hour ahead wind power is the stacked LSTM, implemented with weekly learning input sequences, with an average MAPE improvement of roughly 6, 7, and 49%, when compared to standard, bidirectional, and convolutional LSTM models, respectively. In the case of two to three-hours ahead forecasting, the model with the best overall performance is the bidirectional LSTM implemented with weekly learning input sequences, showcasing an average improved MAPE performance from 2 to 23% when compared to the other LSTM architectures implemented.


Introduction
The rising concerns over global warming, energy security, and the impact of Greenhouse gas (GHG) emissions on the global economy have increased interest in developing efficient renewable energy sources for the rapid replacement of fossil fuels [1]. Governments around the world have stepped up their climate ambitions. By the end of 2020, 28 countries had issued "climate emergency" declarations, many of which were accompanied by plans and targets to transition to more renewable-based energy systems [2]. Wind power has emerged as the most promising and economical renewable energy source. The global wind power market expanded 19% in 2019 to 60 GW, for a total global capacity of 650 GW (621 GW onshore and the rest offshore) [2]. While economic factors are the driving force for the steady growth of wind power, the variability and uncertainty of wind energy pose challenges in the operation, scheduling, and planning of power systems, which is preferred, whereas GRU is selected in some classification tasks where less memory consumption and faster operation is required [19][20][21]. For this reason, as will be seen, most of the state-of-the-art research in wind power forecasting, using deep learning, involves the LSTM approach.
In the literature, different wind power forecasting strategies have been developed by implementing diverse LSTM models. In [22], medium-term wind power forecasting was performed using LSTM networks on historical wind power data and NWP data. In this case, the latter data type was first analyzed by principal component analysis (PCA) to narrow down the meteorological inputs. The study produced similar results for both LSTM and PCA-LSTM results; however, the latter hybrid model slightly outperformed the LSTM, yet both proved to be superior when compared to SVM models and backpropagation (BP) neural networks performance metrics. In [23], short-term wind power forecasting was developed, for one to five steps ahead, based on discrete wavelet transform (DWT) and LSTM networks. DWT decomposed original wind power data from three different wind farms into low-frequency and high-frequency sub-signals, such that independent LSTMs approximate the temporal dynamic behaviors of the sub-signals, respectively. The proposed DWT LSTM model proved to perform better when compared to RNN and BP methods alone; however, when steps ahead increased, the performance metrics deteriorated. In [24], a combined model consisting of the variational mode decomposition (VMD), convolutional LSTM network (ConvLSTM), and error analysis was conducted for short-term wind power forecasting. The proposed VMD-ConvLSTM-LSTM model was implemented on historical output data series from two wind turbines and two wind farms. The proposed model proved to outperform traditional forecasting algorithms; however, model parameters were selected by expertise given the large computational cost associated with the estimation of optimal parameters.
Additional strategies have been implemented in related research areas, such as the wind speed forecasting field. In [25], a model combining ARIMA and dynamic evolving neural-fuzzy inference system (DENFIS) for wind speed forecasting has been proposed, reporting RMSE values of 0.07 and 0.13 for 2 and 6 hour ahead forecasting, respectively. As a reported drawback, DENFIS needs prior assumptions about data and requires domain knowledge to define the associated parameters [26]. Other ML-based models have been proposed to improve the forecasting accuracy. Specifically, in [27], a hybrid approach based on the CRO (coral reefs optimization) algorithm and the extreme learning machine (ELM) is presented. The CRO was implemented to reduce the number of variables and the ELM was used to provide the wind speed prediction. The main limitation of this approach is that the performance is very unstable. It is very difficult to find the best parameters such that the training and testing accuracy is maximum for a given problem [28]. In order to improve the error/accuracy metrics, DL-based models have been proposed.
In particular, in [29], a multivariate stacked LSTM model is proposed. Historical wind speed, wind direction, temperature, humidity, pressure, dew point, and solar radiation are used to predict future short-term wind speed. The study proposes a model with two LSTM layers stacked after the input layer, with 64 neurons each. The results proved to outperform multiple competing statistical methods, including multiple linear regression, LASSO, and ridge-based strategies. In the same line, in [30], a hybrid particle swarm optimization-ant lion optimizer (hPSO-ALO) was used for training the weight values of LSTM network. However, limitations associated to ALO-based algorithms are related to the long run time due to the random ant walking model and the associated exploration process, which usually is not to the extent of obtaining optimal solutions [31]. In [32], a novel deep learning model is proposed to forecast regional short-term wind speed. The study proposes a CNN-LSTM model, where the CNN acts as encoder and decoder, while the LSTM acts as the temporal predictor to forecast the previously extracted and translated deep features. The proposed CNN-LSTM model showed that the prediction errors decreased as the models gradually became more sophisticated, outperforming the persistence model, ANN, and LSTM with a 32, 27, and 18% RMSE decrease, respectively. In [33], a hybrid VMD-DE-ESN model is proposed incorporating variational mode decomposition (VMD) as the decomposition module, differential evolution (DE) as the optimization module, and echo state network (ESN) as forecasting module of decomposed subseries of wind speed. Wind speed data collected in March, June, September, and December (2018)  In [34], a novel genetic algorithm (GA) coupled with LSTM model is proposed for one-hour ahead wind power forecasting for seven wind farms in Europe. The proposed model governs the optimization of input learning sequence size and number of neurons of LSTM layers by exploiting the strength of the bio-inspired framework of genetic algorithm. When compared against support vector regressor (SVR) and LSTM models, the GLSTM, on average, improves wind power predictions by 12-30% in RMSE values. In [35], a deep sequence-to-sequence long short-term memory regression (STSR-LSTM) is proposed for wind power forecasting at three different timescales (week-ahead, day-ahead, 15 minahead) for three distinct wind farms. The data applied in this study were collected from three sites in Belgium, two of which are offshore wind farms (sites 1 and 3). Results are compared to Levenberg-Marquardt backpropagation neural network (LMBNN) and BRNN, where the best MAPE value of 1.897% was achieved for site 3 in the week-ahead forecast timescale; however, the proposed STSR-LSTM model consistently proved to be superior across all timescales and sites when compared to LMBNN and BRNN.
Although the verified performance of LSTM and hybrid LSTM models surpass statistical and ML methods, the above literature review highlights both the advancements and existing gaps in wind power forecasting. Most studies have focused on wind speed forecasting, rather than wind power, and few studies have focused on comparing and evaluating different DL models (LSTM, bidirectional-LSTM, convolutional-LSTM, stacked-LSTM, autoencoder-LSTM). To the best of our knowledge, a comparative analysis of short-term prediction results, given the estimation of optimum hyperparameters for each LSTM model, has not been applied to wind power forecasting. In addition, although bidirectional, convolutional-LSTM, and autoencoder-LSTM have been widely used to forecast meteorological time series, they have not been deeply analyzed in the case of wind power forecasting for single and multiple steps ahead. We aim to fill these research gaps by the evaluation of a set of competitive LSTM-based models, applied in the field of wind power forecasting (vanilla LSTM, stacked, and bidirectional), as a baseline to compare the results. The evaluation of the models has been carried out considering statistical error measures reported in the field (MAPE, RMSE). In addition, it is important to highlight that, unlike most of the papers available in the literature, which deal with the prediction in a turbine or a group of turbines, in this study, the wind power forecasting is carried out for the whole system. In this way, the historical data in this study case not only depend on wind resource availability in all the farms in the power system but also the balance between offer and supply in the electricity market. To the best of our knowledge, the models analyzed in this work have not been evaluated before tackling this problem.
In summary, in this paper, hourly wind power data is processed through a suite of LSTM RNN-based models for single and multiple steps ahead (up to three) forecasting. Moreover, these models are evaluated and compared through real data from the Spanish electricity market. A descriptive analysis of the time series data is presented, and a comparative error analysis is produced given the diverse modeling strategies employed, where significant hyperparameters, such as the number of iterations, memory units, batch size, and size of the input learning sequence, were implemented through an iterative process aiming to estimate the model with the best predictive capabilities. The structure of this work is as follows. Section 2.1 describes the dataset analyzed and the implementation methodology used in this work. Section 2.2 focuses on detailed descriptions of the DL techniques and the definition of the evaluation metrics. Section 3 compares the results and metrics, and finally, in Section 4, conclusions are drawn and future work is discussed.

Dataset Description
Unlike most of the papers available in the literature that deal with the prediction in a turbine or a group of turbines, in this paper, the performance of a suite of LSTM RNNbased models is evaluated on a whole system wind power forecasting. In other words, the historical data in our study case not only depend on wind resource availability in all power system farms but also the balance between offer and supply in the particular electricity market. In this way, remarkable particularities make the Spanish study case attractive to evaluate the performance of forecasting techniques. First, Spain is the fifth country in the world in terms of installed wind power, after China, the US, Germany, and India. There are currently 1265 wind farms installed in 16 out of the 17 states (autonomous communities), covering 21.9% of the energy consumed in the country, making the participation of wind power in the Spanish electricity market substantial [36]. Second, the time series of Spain exhibits a high level of seasonality and trend because of the intrinsic nature of wind resources in the region. Finally, because of the interconnections with Portugal and France, the dataset reflects the behavior of other markets with different energy mixes and the state of the cross-border interconnections. These characteristics make this a challenging study case to evaluate the suitability of the LSTM RNN-based models in forecasting.
In this context, the original dataset for this study covers approximately six and a half years of hourly wind power observations in Spain, collected from the Spanish Electric Network [37]. Specifically, the wind power time series was recorded from 1 January 2014 to 30 May 2020 and reported in megawatts-hour (MWh) for a total of 56,231 observations. Figure 1 represents the wind power data processed and analyzed in this work. Firstly, in order to remove innovational outliers, additive outliers, or level shifts, the approach described in [38] was used for the automatic detection and removal of outliers in the time series. This strategy uses spatial outlier detection by implementing the variogram method, where outliers are considered those that considerably deviate from the majority of the data. This methodology assumes that the outliers are not correlated in time and space, and can be modeled using an α-stable distribution. The selection of this approach was based on its extensive use in the time series preprocessing [39]. Taking this approach into account, no outlier was found in the analysis of the whole time series. Initial observations on the data suggest that there are no changes in trend over the years. In addition, there are relevant peaks at the beginning of each year, marking the annual seasonality of the wind power time series. These ideas are explored in more detail below. In order to describe the wind power time series analyzed in this work, in Figure 2, the sample descriptive statistics, its corresponding histogram distribution, and the Quantile-Quantile (QQ) plot are summarized. Based on the reported results, it can be seen that the skewness value describes a non-normal wind power time series with a right-skewed distribution. Moreover, the calculated Kurtosis value, being less than three, indicates the data are light-tailed and flatter than a normal distribution (platykurtic).

Deep Learning Strategies for Time Series Forecasting Using LSTM
Deep learning (DL) is a subset of machine learning (ML) that uses restructured multilayered ANNs to deliver improved accuracy in diverse tasks, such as object detection, speech recognition, language translation, among many other applications. Given the success of DL models in tackling classification and regression problems, research continues to move towards implementing diverse DL model architectures for a plethora of applications. A particular DL model architecture has gained popularity over the years given its success in handling long and short-term time dependencies and its ability to work with raw time series data without extensive feature engineering or data preprocessing. This DL model is known as RNNs.

LSTM RNNs Based Models Description
RNNs are dynamic systems that efficiently use the temporal information of a model's input sequences. For this reason, they are a widely used tool for time series forecasting. RNNs are derived from feed-forward neural networks (FFNN), where the network is forward propagated according to the number of time steps per sample. RNNs are composed by a memory gate that allows the network to process sequential data, maintaining previous inputs to predict the outputs. The RNN model updates its memory, also known as its recurrent hidden state h t , as follows: where x = (x 1 , x 2 , . . . , x t ) is a sequence of length t, σ is a nonlinear function such as sigmoid or rectified linear unit (ReLu), W x and W h are weight matrices, and b t is a constant bias value [40]. Standard RNNs suffer from short-term memory and are limited to look back in time, such that the larger the size of the input learning sequence, the less it learns. Moreover, RNNs struggle with long sequential data, leading to exploding and vanishing gradients, where the former refers to the assignment of high importance to the weight matrices without any reason and the latter refers to significantly small gradient values that prevent the RNN model from learning further [40].
LSTM networks address the short-term memory problem of RNNs that leads to vanishing and exploding backpropagation error signals that either grow or shrink with every time step in RNNs. In order to overcome this problem, gated units to control the information flow have been proposed. Specifically, LSTM units have reported successful results in different fields to forecast time series [8,11]. LSTM networks are capable of modeling long-term dependencies in time series, learning more than 1000-time steps, whereas RNNs are limited to 10 steps [41]. Given the competency of LSTM networks in handling long and short-term time dependencies, the study developed in this work utilizes a suite of LSTM-based models to better capture the real-time dynamics of wind energy.

Vanilla LSTM
The traditional LSTM network structure is composed of three gates: the forget gate, the input gate, and the output gate. The forget gate determines which information is valuable enough to be maintained within the network, assigning a value between 0 and 1. Values close to 0 are discarded, while values close to 1 are kept. The input gate decides what information is allowed to flow into the memory cell and be stored there.
Equations (3) and (4) below represent this decision, where k t produces a vector of new values to be added to the memory cell based on the update signal given by i t . The network is then updated as defined in Equation (5), where c t (long-term state) considers whether to keep or forget the previous values and add new values. Lastly, the output of h t (short-term state) is also composed by two layers. Here, o t is the output value given by the σ activation, which determines what values from the memory cell are to be considered as output, followed by a tanh layer with an output range [−1, 1] [42]. Equations (2)- (7) show how the network is updated, which was also illustrated in Figure 3.
where f t , i t , k t , o t are the output values of the forget gate, input gate, update signal, and output gate, respectively, and their input values are x t at the current time t and the output value h t−1 at time t − 1. W f ,i,k,o are weight matrices and b f ,i,k,o are bias vectors corresponding to the respective gates. σ is a nonlinear activation function and c t is the memory unit.

Bidirectional LSTM
While the vanilla LSTM takes unidirectional processing of the information from left to right, the bidirectional model learns the input time series sequences both forward and backward, ultimately combining both interpretations. The data are processed through independent networks, where one network will process the time series data from left to right, while the other network will process the data from right to left. In this way, each network holds information at time t, while also holding information from either the past or the future. In this case, the hidden state equation is redefined to incorporate the forward layers, backward layers, and combined output. The forward layer's output sequence − → h is computed by going through the input sequence from left to right (t − 1 to t − n). In contrast, the backward layer's output is computed from right to left using the reverse inputs. Ultimately, both outputs are computed as previously defined in Equations (2)- (7). Figure 4 exemplifies the incorporation of the direction for the forward and backward layers. The concatenated output is defined in Equation (8), where W− → h y and W← − h y are the weights of the forward and backward networks, respectively, and b y is the bias of the output layer. Equation (8) can be generalized for multi-step forecasting problems such that y t would be otherwise computed given by y t+1 , y t+2 , and so on.

Stacked LSTM
A stacked LSTM refers to a model architecture composed of multiple LSTM layers, making the model deeper. Under a standard LSTM model, the layer would receive an input sequence and output a value or a specified number of values to be predicted. However, under a stacked model, as it is shown in Figure 5, the output of a hidden LSTM layer is not only propagated forward through time but also used as one of the inputs for the next LSTM hidden layer. In this way, the hidden states are a function of all previous hidden states. Consequently, the l-th layer can be updated by the Equations (9)- (14).
Based on this configuration, the input associated to the first layer is the raw time series: while its respective output is an input abstraction, which is then fed as a hierarchical feature to the next LSTM layer. The final layer is composed by the number of neurons corresponding to the number of time steps predicted through the model. Different advantages reported for this methodology include that a stacked configuration allows the neural network to learn features from the raw time series in different components at each time step [43]. The parameters associated with the neural network are spread across the whole model's space, which accelerates the algorithm's convergence and fine-tunes the nonlinear operations applied to the time series. Figure 5. Stacked LSTM architecture.

Convolutional LSTM
Convolutional neural networks (CNNs) have been widely used for data feature extraction with grid-like configurations, such as images. The CNN architecture was initially developed for two-dimensional data, allowing the model to learn patterns hierarchically such that the patterns learned will be recognized throughout the network. Presently, onedimensional CNNs have become more popular given their application to sequence data, such as text and time series. However, their performance is suboptimal when compared to other models, such as LSTMs [44]. CNNs are usually composed by three stacked layers: convolutional layer, pooling layer, and fully connected layer. First, the input layer holds the input vector of values, followed by the convolutional layer where features are extracted through a filter or kernel, to be output into a fully connected layer [45]. The basic structure of CNNs is portrayed in Figure 6. A LSTM network can be connected to the dense or fully connected layer of the CNN configuration as portrayed, such that the convolutional LSTM network would be updated in a similar way as in the vanilla LSTM, replacing the matrix addition and dot product operators by convolutional operators.  Figure 6. Example of a CNN with six convolutional layers.
A convolutional LSTM RNN (ConvLSTM) incorporates the convolutional structures of the CNN in both the input-to-state and state-to-state transitions. The network is redefined such that the operators previously defined in (2)-(7) are replaced by the convolution operator ( * ) and the Hadamard product ( ) as exemplified in Equations (16)-(20), where the latter operator keeps the constant bias property of the cells. As it can be noted, these equations differ from the standard LSTM network in that, in order to extend the LSTM network's capability to remember or forget information throughout the convolutional LSTM configuration, the information from the cell state is included in all output computations for each gate to take into account the analogous transitions between the states. Moreover, the weight matrices of the respective gates now include a convolution operator in order to incorporate the function of the filter. The network is updated as follows: 2.2.6. Autoencoder LSTM The autoencoder (AE) architecture consists of three sequentially connected layers defined as the encoder, the code, and the decoder. AEs extract features from input data in an unsupervised manner and are often used as generative models [44]. The AE architecture is illustrated in Figure 7.
The encoder maps high-dimensional input data into a low-dimensional representation, where the decoder is then responsible for reconstructing the data. This architecture allows the decoder weights to be tuned first, although both the encoder and decoder are trained simultaneously where, ultimately, the model minimizes the difference between the reconstructed output and the original input data. The encoded features e t are given by: The decoded values given by the original input data x are given by: where σ is the selected activation function, W e is the encoder weight matrix, b e is the encoder bias vector, W d is the decoder weight matrix, and b d is the decoder bias [46]. According to the architecture of AE models, LSTM units can be included in order to process time series, where multiple sequences are taken as input and later as output. This model presents special characteristics to deal with multi-step ahead forecasting, where the output predictions mapped represent several hours to be forecasted at a time.

Implementation Methodology
Upon completion of the exploratory data analysis, all observations were scaled [0 to 1] for the sake of the networks' speed, learning rate, and convergence. A data split of 70:30 for training and testing sets is implemented, such that we are able to evaluate the models' predictive performance for up to two years (at least two periods of annual seasonality). Each model was implemented considering the mathematical background explained in Section 2.2, following the pseudocode presented in Algorithms 1-5. For the LSTM models to learn the mapping function of our historical wind power observations sequence, the original sequence must be transformed into multiple time-series samples from which the LSTM learns. The network maps these sequences as input to an output observation (one-step ahead) or to multiple output observations (multi-steps ahead). Input learning sequence size is based on the weekly seasonal behavior of wind power time series [47]. Considering these characteristics, weekly, monthly, and quarterly time scales have been considered in the preparation of the input data. First, we consider a transformation where 168-time steps are used as input, such that the network learns by week. Then, 720-time steps are proposed, such that the model's input sequence is equivalent to roughly a month worth of observations. Lastly, we consider a quarterly transformation with 2160-time steps used as input. It must be noted that a larger input sequence size of 8760 steps (annual steps) is not considered due to implications that the size of the time window has on the availability of data for the validation of the models, as we would only have very limited observations as part of the testing dataset. Moreover, the annual input sequence would also require an increasing number of neurons in the hidden layers, which would positively affect the computational times of the networks, which makes the implementation in a timely manner unfeasible.

Algorithm 1 Vanilla LSTM
1: x ← RESHAPE(x, shape(168, 720, 2160)) 2: y ← RESHAPE(y, shape(168, 720, 2160), (1,2,3)) 3 TRAIN_DATA, VALID_DATA ← TEST_TRAIN_SPLIT(e,0.7) 10: for i, j, k ← EPOCHS, Batch Size, Neurons do 11: for data ← TRAIN_DATA do 12: 13: prediction_results ← lstm(x,y, i,j,k) 14: cost← CALCULATE_COST(prediction_results, y) 15: optimizer← ADAM_OPTIMIZER(learning_rate = 0.001).MIN(MSE) 16 TRAIN_DATA, VALID_DATA ← TEST_TRAIN_SPLIT(e,0.7) 10: for i, j, k ← EPOCHS, Batch Size, Neurons do 11: for data ← TRAIN_DATA do 12: 13: prediction_results ← bilstm(x, y, i, j, k) 14: cost← CALCULATE_COST(prediction_results, y) 15: optimizer← ADAM_OPTIMIZER(learning_rate = 0.001).MIN(MSE) 16 TRAIN_DATA, VALID_DATA ← TEST_TRAIN_SPLIT(e,0.7) 12: for i, j, k ← EPOCHS, Batch Size, Neurons do 13: for data ← TRAIN_DATA do 14: X ← data[0], Y ← data [1] 15: prediction_results ← aulstm(x, y, i, j, k) 16: cost← CALCULATE_COST(prediction_results, y) 17: optimizer← ADAM_OPTIMIZER(learning_rate =0.001 ).MIN(cost) 18: end for 19: end for 20: end for The models are first evaluated, given the different input learning sequences for onestep ahead forecasting. An iterative training process is first implemented to find the DL-based model and configuration with the best performance. Then, the models and their respective configurations, yielding the lowest possible error rate, are selected to evaluate their performance when facing three-steps ahead forecasting. The iterative process is only completed for one-step ahead forecasting due to the exhaustive computational load that the forecasting of three steps entails, considering the different DL architectures. Lastly, we are able to conduct a comparative analysis to evaluate the performance of the three-steps ahead forecasting models against an AE-LSTM model, which is commonly used to tackle multi-step forecasts rather than one step. The general implementation process is shown in Figure 8. Since batch size relates to the number of training samples to be considered at a time for updating the network, batch size may affect the network speed, learning rate, and convergence. Therefore, different batch sizes are considered according to the number of samples obtained through the selected time windows, such that the samples are divisible by the proposed batch sizes. As such, models set up according to weekly and monthly learning sequences are implemented with batch sizes 12, 24, and 46. In contrast, models set up through a quarterly learning sequence are implemented with batch sizes 12, 24, and 48. Likewise, we select a different number of memory units for the LSTM layers. Since the initial model configurations are based on increasing learning sequence sizes, an increasing number of units is proposed to understand whether more complex models are able to better solve the problem at hand. The number of neurons in each hidden layer play a key role in the neural network training and impact the prediction accuracy [23]. The more the number of neurons in each layer, the more complex the model. Lastly, a varying number of training cycles is proposed (100-300 epochs). In this way, the iterative process will be performed such that the models' training cycles and batch sizes vary, so the networks are trained with different patterns. Given all the possible hyperparameter configurations portrayed in Table 1, and according to the number of steps in the input sequence, there are 27 possible combinations for each, at minimum, given the number of epochs, size of the batches, and the number of neurons.
The activation function employed is ReLu, as its gradient does not tend to suffer from gradient vanishing or diffusion [46]. In addition, models are trained using the Adam optimization algorithm [48], where the default recommended parameters are used (learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, = 1 × 10 −8 ), unless otherwise stated and required by the model. The Adam optimizer is useful in handling sparse and noisy problems as it combines the properties of AdaGrad and RMSProp optimizers, and is applicable to LSTM time series forecasting as it does not require a stationary objective [48,49]. Additionally, the selection of this model is based on the reported advantages of its implementation to support DL-based models: straightforward implementation without the need of extensive data preprocessing and decomposition methods, small memory requirement, invariance to the diagonal re-scalation of gradients, suitability for models that involve large datasets, and/or large number of parameters [48,49]. Finally, the mean square error (MSE) is the selected loss function to be minimized, based on the successful results reported of this loss function in conjunction with the Adam optimizer [49][50][51].

Evaluation Metrics
The performance evaluation of the different LSTM models is performed by comparing the root mean square error (RMSE), the mean absolute percentage error (MAPE), the mean absolute error (MAE), and the coefficient of determination (R), defined as: 26) where N is the number of predicted samples, x i andx i are the real and predicted values, respectively, andx i is the real values' mean. RMSE provides the overall agreement between estimated and original data, avoiding error compensation, while the performance in actual units is achieved through root implementation [34]. RMSE has been considered one of the most commonly used indicators for power forecasting in renewable energy stations because it can be used effectively with the satisfaction of objectivity and symmetry criterion. In addition, it can be used uniformly for different measurement durations [52]. In contrast to RMSE, the MAE metric weights all values equally, and thus does not add additional weight to extreme forecasting events. This metric has been widely used in the renewable energy industry to assess the effectiveness of the forecast [53]. MAPE, on the other hand, represents the quantum of error relative to the actual data expressed as percentage, thereby directly indicating the accuracy of the model [54]. This metric has been proposed as the most widely used unit-free accuracy measure for forecasting tasks and it has been recommended by many textbooks, M-competition, and the previous literature [55,56]. Finally, the accuracy is measured using the coefficient of determination R 2 . This metric compares the sum of the square of the residuals with the total sum of squares, which is proportional to the variance of the data. These evaluation metrics have been selected because they are the most used performance measures in the field of short-term wind power forecasting using deep learning techniques [4,5,8,23].

Wind Power Forecasting-Experimental Results
Based on the methodology described in the previous section, the experiments for each model were carried out. It is important to note that every model was compiled in Python 3.8 with Tensor Flow 2.4.0 backend and Intel(R) Core(TM) i7-8565U.

One-Step Forecasting
First, single-step ahead forecasting strategies were assessed. Single, bi-directional, stacked, and convolutional LSTM RNNs were implemented according to the previously defined implementation process, varying the number of input neurons with only one output defined by the value to be predicted.
Given the hyperparameter configurations portrayed in Table 1, and according to the number of steps in the input sequence, there are 27 possible combinations for each given the number of epochs, size of the batches, and the number of neurons in the hidden layer. The first technique implemented was the vanilla LSTM. The performance of each model is illustrated in Figures S1-S3 of the Supplementary Material. For this approach, the smallest MAPE error of 4.30% was recorded under a configuration of 168-time steps corresponding to a weekly input learning, 200 iterations, batch size of 46, and 32 neurons in the hidden layer. The models implemented using monthly input sequences produced errors between 4.61 and 8.49%, while the quarterly input sequence models reported errors between 4.84 and 12.21%. It is possible to see that, as the size of the input learning sequences increase, so do their corresponding error metrics.
The bidirectional LSTM models were implemented using all possible configuration parameters referenced in Table 1. Once again, as it can be seen in Figures S4-S6 of the Supplementary Material, the best performing model configuration, with a reported MAPE of 4.44%, is given by the model that learned by weekly time steps (168 steps) implemented with 200 epochs, batch size of 46, and 32 neurons. However, the highest reported error rates were found within the implementation with monthly time steps (720 steps), where the errors ranged between 4.60 and 8.91%. In comparison, the models implemented with quarterly time steps (2160 steps) produced a 4.79-7.04% range. Overall, the models initiated with weekly time steps maintained values between 4.44 and 5.53%, resulting in lower variability among the reported metrics. This architecture maintained similar metrics for the weekly and monthly input sequence implementations, and it improved the resulting error range under the quarterly input steps. In contrast, the range for the vanilla for the quarterly input sequence (2160 steps) was 4.84-12.21%, the bidirectional reported a range of 4.79-7.04%, resulting in lower variability.
In addition, a stacking architecture with two LSTM layers is implemented, where each layer is independent, and the number of neurons in each layer may vary between 32 or 64 neurons, being these two values the most commonly employed under stacked architectures [57,58]. Under this configuration, we exclude the implementation of 128 neurons for the sake of implementation times as a higher number of neurons may prevent the models from finding a solution within a suitable time frame. A dense layer then connects the layers with one neuron for a single time step output. Given all the possible parameter configurations portrayed in Table 2, there are a total of 108 possible combinations given the size of the input sequence, iterations, size of the batches, and the number of neurons in the first and second layers.  32 neurons on the first layer and a varying number of neurons on the second layer. The lowest recorded error of 4.37% was produced by the model implemented with 32 memory units in the first and second layers, 300 epochs, a batch size of 12, and a weekly input sequence (168 steps). The models implemented under the monthly input sequence yielded errors ranging from 4.68 to 8.82%, while the quarterly sequence approach produced errors ranging between 4.75 and 9.37%. The results for the stacked model with 64 neurons on the first layer are portrayed below in Figures S10-S12 of the Supplementary Material. Once again, the lowest recorded MAPE of 4.50% was produced by the model configured with an input sequence of size 168, 32 neurons in the second layer, 300 epochs, and batch size of 12. The models implemented under the monthly input sequence yielded MAPE errors ranging from 4.70 to 8.99%, while the quarterly sequence approach yielded errors ranging between 4.73 and 14.13%. Overall, the models implemented with a weekly input sequence (168 steps) produced the lowest MAPEs ranging between 4.37 and 5.13%. The lowest reported error (4.37%) was produced by the model configured with 32 neurons on the first and second layers, 300 epochs, and a batch size of 12. This architecture achieves a slightly lower variability in relation to weekly input sequence (168 steps) implementations when compared to the vanilla (4.30-5.38%) and bidirectional (4.44-5.53%) models.
The Conv-LSTM implementation, on the other hand, considers some fixed parameters based on the results reported previously in the literature. As such, for these experiments, 64 filters were used in the convolutional layer and a max-pooling kernel size of 2. This parameter configuration has been previously employed in studies of short-term wind power forecasting [42]. This model required a further split of the input sequences into subsequences to take advantage of the feature extraction capability. Given the already established data splits (168, 720, 2160 input sequence), each sample is split into smaller samples. Our selection splits every sample by 24 h for the model to read subsequences equivalent to daily data. In this way, the number of subsequences per sample, according to input sequences of sizes 168 (weekly), 720 (monthly), and 2160 (quarterly), is 7, 30, and 90, respectively. Given the additional subsequences, only the models initialized with weekly input time steps (168 steps) were processed considering all the other possible configurations of the hyperparameters referenced in Table 1. This was due to the significant increase in computational times of this architecture that made their implementation infeasible. The results, using the full suite of possible configurations given a weekly input learning sequence (168 steps), were produced in roughly 22 h ( Figure S13 of the Supplementary Material). However, the results given monthly and quarterly input time steps, implemented with 100 epochs only, took 27 and 57 h, respectively (Figures S14 and S15 of the Supplementary Material). Under these Conv-LSTM configurations, the MAPE results range from 5.45 to 10.15, 5.31 to 7.31, and 5.87 to 6.81%, given weekly (168 steps), monthly (720 steps), and quarterly (2160 steps) input learning sequences, respectively. While the previously analyzed architectures produced the lowest error variability under weekly input sequences, the Conv-LSTM produced the lowest error variability according to the quarterly input sequence implementations (5.87-6.81%). The lowest reported MAPE of 5.31% was found under monthly input time steps (720 steps), 100 epochs, batch size of size 12, and 128 neurons.
For one-step ahead wind power forecasting, according to the overall performance results recorded in Table 3 and portrayed in Figure 9, it can be observed that the majority of the architectures produce better results when implemented with an input learning sequence of 168 steps (weekly time steps). Further, it is possible to observe that the trade-off between the error metric results and implementation times is not valuable enough to choose a more complex nor deeper model (stacked-LSTM or Conv-LSTM). While the standard and bidirectional models yielded MAPEs between 4.30-12.21 and 4.44-8.91%, the stacked and convolutional architectures produced errors between 4.37-14.13 and 5.45-10.15%, respectively. Additionally, standard and bidirectional models were able to implement all proposed hyperparameter configurations, referenced in Table 1, in approximately 23 and 40 h, respectively. In contrast, the stacked and convolutional models involved implementations between 32 and 106 h, respectively. As the size of the input time steps increased, so did the associated error variability and the implementation times. However, the increase in execution times is also positively related to the number of neurons and the complexity of the models.

One-to-Three-Steps Forecasting
Given the results of the one-step forecasting models implemented in the previous section, now, the model configurations that produced the lowest errors are selected to carry out a forecasting three steps ahead. Lastly, an autoencoder LSTM architecture is introduced for this task to compare its performance against the previously studied models. This DL architecture is only presented for the three-steps ahead forecasting task due to its inner configurations, where multiple neurons are found at the input and output layers.
The one-step LSTM model that yielded the best predictive performance had a configuration of weekly input time steps (168 steps), 200 iterations, batch size of 46, and 32 neurons in the hidden layer, with an error rate of 4.30%. Based on this configuration, the model was updated such that there are three neurons in the output layer, instead of 1, to produce the three-steps ahead predictions. The reported MAPE and RMSE increased with each further time step predicted, as has been reported in the literature, ranging from 4.47 to 13.61% and from 269.64 to 745.22 MW, respectively (Figure S16 of the Supplementary Material).
In the same way, the best performing one-step BiLSTM model configuration, with a reported MAPE of 4.44%, was given by a configuration of 168 input time steps implemented with 200 epochs, batch size of 46, and 32 neurons. Based on this configuration, the model was updated such that there are three neurons in the output layer, instead of 1, to produce the three-steps ahead predictions. The three-steps ahead forecast we implemented with this configuration produced almost identical results for the first hour predicted, yet it also proved that the MAPE errors increase with each additional hour forecasted, ranging from 4.50 to 13.27%, which translates into 267.95-741.82 MWh.
Under the lowest error yielding configuration for the stacked LSTM referenced in Table 3, the model was updated such that there are three neurons in the output layer, instead of 1, to produce the three-steps ahead predictions. The stacked architecture has produced a lower MAPE, MAE, and RMSE for the first step predicted in comparison to the vanilla and bidirectional three-steps ahead models; however, these previous implementations showcase a better performance for the second and third hours predicted.
According to the Conv-LSTM results presented in Table 3, the model was updated such that there are three neurons in the output layer, instead of 1, to produce the three-steps ahead predictions. The model's performance for this task proves to be the highest erroryielding network compared to the previous architectures, as all error metrics for each step have virtually doubled under this configuration, with a MAPE between 8.24 and 17.21% and a RMSE between 463.22 and 962.18 MWh.
Finally, the AE-LSTM networks are introduced given their encoder and decoder architecture, which is especially helpful in multi-step forecasting given their generative capabilities. The recorded metrics for this first implementation show that there is little variation among the models implemented with a different number of neurons in the hidden layer. Overall, the hyperparameter configuration yields the lowest MAPE by the implementations set up with a batch of size 46 and 32 neurons, ranging between 4.24 and 13.03%. At the same time, the reported RMSE for this model produced negligible differences in terms of MWh. For executing the models according to a monthly input sequence (720 steps), 32 and 64 neurons were considered during the parameters tuning process. Lastly, the models with quarterly-sized input sequences (2160 steps) had to be trimmed down due to the extreme computation time that made the implementation unfeasible using more than 32 neurons in the hidden LSTM layer. Results are portrayed in Figure S22 of the Supplementary Material. The lowest reported MAPE ranges between 4.19 and 12.60% and is related to the model with 32 neurons and a batch size of 12.
As a performance summary, the three-steps ahead forecasting executions suffered from high implementation times and computational load, and the overall observed metrics variations did not justify the selection of deeper nor more complex models. Table S1 in the Supplementary Material shows the overall performance of the AE-LSTM model implemented with varying hyperparameters. The implementation times under the AE-LSTM model architecture are extremely high; however, the vast majority of the recorded metrics produced similar results regardless of the input learning sequence size and suffered less variations among the metrics when compared to the reported performance of the other models. Table 4 shows the overall performance of the vanilla, stacked, bidirectional, autoencoder, and convolutional models implemented given the configurations that yielded the lowest error, also including reported metrics for MARS and M5Tree models, as statistical baseline models. In summary, it can be observed in those results that for multi-steps ahead (2-3) the bidirectional LSTM shows the best performance, providing the best trade-off between computation time and lowest recorded MAE, MAPE, and RMSE for the different number of steps ahead considered in this study. These results can be explained considering that bidirectional LSTM-based models can exploit the temporal correlations in the time series data in both forward and backward directions. In addition, taking into account the presented study case, where the time series have a characteristic seasonality, the bidirectional approach provides better feature representation as compared to vanilla and stacked LSTM structures. Regarding to the R 2 results, the vanilla, stacked, bidirectional, and autoencoder LSTM models have similar values, with slightly better performance for the vanilla LSTM. Figure 10 shows the scatter diagrams of the obtained results for 1-step, 2-steps, and 3-steps ahead predictions. According to the results, it can be seen that the vanilla, stacked, bidirectional, and autoencoder LSTM models present the best performance, while the convolutional LSTM has comparable results to MARS and M5Tree. As expected, the best results were obtained in T + 1 prediction, while T + 2 and T + 3 predictions followed the T + 1, respectively. In general, the results do not reveal significant differences between autoencoders, bidirectional, and LSTM models, according to the determination coefficient. The scatter plots clearly revealed that these last models' outputs are more aligned to the 1:1 perfect line for T + 1 prediction, while for T + 2 and T + 3, the determination coefficient is smaller.

Conclusions
The suite of LSTM RNN-based models presented in this study, along with the diverse patterns in which each architecture was implemented, allowed for a vast comparative analysis for one-step to three-steps ahead wind power forecasting. LSTM models configured in standard, bidirectional, stacked, encoder/decoder, and also coupled with a CNN model, were executed with varying input learning sequences defined as weekly (168 time steps), monthly (720 time steps), and quarterly (2160 time steps). In addition, the diverse model architectures were implemented through an iterative process, where different hyperparameters were varied, such that the models learned in different patterns. This implementation process was executed in order to tackle one-step and three-steps ahead wind power predictions.
For one-step ahead wind power forecasting, the results showed that the majority of the architectures produced better results when implemented with an input learning sequence of 168 steps (weekly time steps). In particular, while the standard and bidirectional models yielded MAPEs between 4.30-12.21 and 4.44-8.91%, the stacked and convolutional architectures produced errors between 4.37-14.13 and 5.45-10.15%, respectively. The three-steps ahead forecasting executions suffered from high implementation times and computational load, and the overall observed metrics variations did not justify the selection of deeper nor more complex models.
The model with the best performance in forecasting one-hour ahead wind power is the stacked LSTM, implemented with weekly learning input sequences, with an average MAPE improvement of roughly 6, 7, and 49%, when compared to standard, bidirectional, and convolutional LSTM models, respectively. In the case of two-or three-hours ahead forecasting, the model with the best overall performance is the bidirectional LSTM implemented with weekly learning input sequences, showcasing improved performance from 2 to 23% when compared to the other LSTM architectures implemented.
While the study completed within this work fills in some of the research gaps concerning DL comparative analysis and hyperparameter optimization, we have laid the blueprint for future research focused on LSTM-RNN architectures for short-term forecasting in the field of wind energy applications. In order to complement the presented comparative analysis, other study cases could be analyzed to determine the performance of these techniques under different wind power and electricity market dynamics, as well as different weight of the wind resources in the energy mix in a power system. Given the increasing implementation times and computational load of the more complex models, future works may consider the incorporation of parallel or high performance computing (HPC) in order to efficiently solve the forecasting tasks at hand, consider more hyperparameters to be included in the iterative process, and assess even deeper models.
As a final conclusion, it is well known that the high variability of wind power makes it difficult to predict their energy output. In fact, the unpredictability of the wind energy could reduce system reliability and increase power curtailments in power systems. Consequently, additional support mechanisms and policies are needed to deal with wind power uncertainty. In this way, accurate wind power forecasting based on DL strategies could boost the security of electricity supply and help guide the policies and decision making in power systems. Precise forecasting helps to design policies estimating suitable reinforcements and reserves needed in the power system according to the wind energy shortages. Furthermore, it could help build realistic scenarios to develop support mechanisms to reduce the firms' risk exposure in the short-term electricity markets.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: