Multi-Step Short-Term Wind Speed Prediction using a Residual Dilated Causal Convolutional Network with Nonlinear Attention

.


Introduction
With increasing concern about global warming and pollution caused by over usage of fossil fuels, the leading world organizations and countries are encouraging renewable energy sources like wind and solar power. According to the U.S. Energy Information Administration, renewable energy contributes to over 18% of total energy production with solar and wind energy contributing almost 10% [1]. Wind energy is mostly generated at on-shore or off-shore clusters of wind turbines also known as wind farms. These wind farms in order to manage the future production of total electrical power efficiently require prior knowledge of future wind conditions. We can classify the wind speed prediction problem into several temporal ranges. According to [2], there are four temporal ranges for forecasting: (1) long-term, with a range of a week to a year or much ahead; (2) medium-term, for two days to a week ahead; (3) short-term, for one hour to two days ahead; (4) very short-term, from a few seconds to one hour. Accurate short-term prediction is used for economic load dispatch planning, such as increase and decrease in wind power supply and is used to preplan distribution from the wind farms. Short-term wind speed prediction is also vital for advanced scheduling of the cleaning, maintenance, and safety check of wind turbines during low-wind conditions. models. Models in this research were trained and tested on a data split of training, validation, and testing sets. In general, the LSTM model is very popular for wind speed prediction. The work in [18] introduced a modification to the LSTM structure to better predict the weather contextual data and decrease the naive character of generic LSTM. By introducing an adaptive compression algorithm, cell regulation, and rectified linear units (ReLU) units to generic LSTM, the improved model was able to increase the prediction accuracy for wind power. The model was validated using an 80% training split and 20% testing split. Discrete wavelet transformation (DWT) was used as a data preprocessing method in [19], and several LSTMs were trained on the z-score normalized values of different decomposition levels of the DWT of the original wind signal. The z-score denormalized values of each predicted value for each decomposition level were added to get the final prediction. The DWT method was also tested with recurrent neural networks (RNN), BP, generic LSTM, RNN, and BP with a training, validation, and testing set of 70%, 20%, and 10%, respectively. The models were optimized using Adam, a first-order optimization algorithm. Results confirmed that the LSTM model trained with DWT decompositions performed better than the rest of the models on short-term prediction of wind speed. The deep learning method proposed in [20] used the wavelet packet decomposition, CNN, and CNN-LSTM models for wind speed prediction. A hybrid model was reported in [21], by incorporating the double decomposition and error correction with the LSTM model to improve onestep-ahead prediction of short-term wind speed. In [22], the authors used different configurations of artificial neural networks (ANN) to predict the hourly wind speed time series. In a different approach, ELM with kernel mean p-power error loss [23] was used to improve the wind speed prediction accuracy. The author claimed that ELM based on second-order statistics was not suitable for a non-linear and non-Gaussian dataset, so a new loss function was introduced, and PCA was applied to reduce some of the redundant data components. In a more recent paper [24], a direct multistep-ahead prediction model was by combining LSTM with ensemble empirical model decomposition (EEMD) and fuzzy entropy and compared against the SVM, BP, and generic LSTM models. Another ensemble forecasting method based on a multi-objective optimization algorithm was proposed in [25] for wind speed forecasting that included the back-propagation neural network (BPNN), RBF, general regression neural networks (GRNN), and the wavelet neural network (WNN) with singular spectrum analysis for data preprocessing.
Since most of the recent research for wind related prediction is focused on the LSTM-based architecture or improving the traditional models, another type of machine learning architecture known as the convolutional neural network (CNN) is also an alternative for time series forecasting. Unlike LSTM or the conventional model, CNN requires little to no data preprocessing. The work in [26] introduced a 1D CNN-based forecasting model for direct multi-step-ahead prediction for shortterm wind speed data. The model was trained on 75% of the dataset and tested on 25% and outperformed the SVM, radial function (RF), decision tree (DT), and MLP models on 11 datasets. Apart from wind speed prediction, CNN architectures have traditionally been used in time series classification. In [27], it was shown that the CNN model could extract a suitable internal structure to produce deep features automatically and outperform state-of-the-art methods on real-world datasets. CNN architectures, unlike other feature-based models, do not require sophisticated feature engineering [28][29][30]. These papers provided supportive results and reviews showing that 1D CNN could be used as an automatic feature extraction mechanism for one-dimensional data. Several studies have presented the innovations in CNN to improve its suitability for time series data. Some hybrid models of the CNN architecture with the gated recurrent unit (GRU) are also used for wind speed predictions. In [31], a CNN-GRU model was introduced for short-term wind prediction that used CNN as a feature extraction method for multivariate weather data and then fed the features to the GRU model to make predictions. A model called the dilated causal convolutional neural network (DCCNN), also known as the deep temporal convolutional network (DTCN), has been used in multiple time series applications, including multivariate prediction [32], real-time water level prediction [33], and seismic event prediction [34]. Moving a step further, WaveNet was proposed in [35], a network comprised of the residual unit of DCCNN for raw audio generation. Another CNN architecture, widely used in medical image segmentation, is called U-net. U-net is named after its U-shaped structure, where low-level features are fused with high-level features to learn cross-context information. The works in [36,37] gave us an insight into the success of U-net in image segmentation. Attention modules are sometimes also used to improve the performance of U-nets. The works in [38][39][40] propose attention modules for U-net architectures in medical image applications.
The application of attention modules in time series models is also an active field of research. The works in [41,42] proposed the application of attention in the LSTM architecture for time series prediction. Attention models have also been used with CNN and time series data. The work in [43] proposed an attention gated CNN for sentence classification, and the work in [44] introduced a temporal causal discovery framework (TCDF) for learning causal relationships in time series data. However, many innovative CNN architectures mentioned above have not yet been explored for renewable energy applications. Studies are often focused on learning linear causal relationships from training data.
We exploited the time series feature extraction power of DCCNN and the computational efficiency with the low data requirement of U-net to discover the complex underlying relationship in time series wind speed signals. We used a nonlinear attention module to merge the low-level features of residual DCCNN U-net (ResUnet) to high-level features. We used a sliding window training method to train our models, using eight hours of wind speed data with a sampling frequency of 10 minutes to predict the next eight hours' wind speed. We also compared our model against the naive prediction method, LSTM, GRU, residual DCCNN (SeriesNet), and residual DCCNN U-net (ResUnet) architectures. In summary,


We present "ResAUnet", a residual dilated convolutional network based on U-net that uses nonlinear attention to merge the causal features of low-level residual blocks with high-level residual blocks to learn the nonlinear causal relationship in wind speed data for short-term wind prediction.  We evaluate the performance of our proposed model and several other time series prediction models on six real-world wind speed datasets with different probability distributions using multiple time series error metrics. We organize the remainder of this paper as follows. Section 2 presents the details of LSTM, GRU, residual dilated CNN, ResUnet, and the proposed model (ResAUnet). It also provides detailed information of the hyperparameters used for each model. In Section 2.6, we describe the wind speed dataset used for the evaluation of the model. Section 3 presents the model training method used in the presented research and the performance metrics used for the accuracy comparison of different models. We detail the results of this research at the end of Section 3. Conclusions of this study are given in Section 4.

Prediction models and method
In this section, we describe the different wind speed prediction models used for benchmarking, as well as the proposed model. We start with the brief introduction of LSTM, GRU, DCCNN, and attention. Then, we move on to the details of the proposed model.

Long-term Short-term Memory and Gated Recurrent Unit
Long-term short-term memories (LSTM) are a variation of RNN modules introduced to overcome the long-term dependencies problem of RNN. LSTM, introduced by [45], are capable of learning long-term dependencies and are used by state-of-the-art time series prediction models. The concept behind the LSTM model is a memory cell that maintains its state over time and non-linear forgetting units that regulate the information in and out of the memory cell. In order to utilize all the information present in the time series data, the dataset is normally chronologically arranged before feeding into LSTM. Figure 1 shows the structure of an LSTM network, where every cell's output, also known as the control state, h, is connected to the next cell's input, x. The cell state C is also shared with the next LSTM cell in the chain. An LSTM cell contains an input gate, i, which decides the information originating from the new observation at current timestep t, and the output of the previous timestep t-1 will be stored in the unit state or not. Then, a forget gate, f, selectively forgets some of the past trends and other time series factors. Later, the output gate, o, determines the output and cell state for the current timestep observation. All the gates have their separate weight matrix, W, and bias vector, b. The training process of LSTM can be written as the following equations.
= ( · [ℎ , ] + ) In the above equations, the ⊙ operator represents the Hadamard product [46], also known as element-wise multiplication of two matrices. In Equations 1, 2, and 5, σ represents a sigmoid activation. Similarly, in Equations 3 and 6, tanh represents a hyperbolic tangent activation. The equations for the sigmoid and hyperbolic tangent activations are as follows.
tanh( ) = 1 − 2 1 + (8) The length of the LSTM chain can be determined by the length of the output/prediction steps. A fully-connected (FC) layer follows the LSTM chain to output step temporal features. The equation for each neuron in an FC layer with linear activation can be written as follows.
where is the output of the k th neuron, is the i th input x, is the weight for the i th input, and m is the total number of input variables. The FC layer at the end of the LSTM chain finds the complex relationship between the LSTM outputs and training variables to further improve the accuracy of the architecture. The model used for comparison in this paper uses the backpropagation algorithm for training. The backpropagation algorithm is also known as the backpropagation of errors [47] and is a chain rule method. Using an optimizer (often, a gradient-based one), after each forward pass through the network, the algorithm applies a backward pass to adjust the model's parameters, such as weights and biases. For the LSTM architecture, we made use of the first-order gradient-based stochastic objective optimization algorithm, Adam [48]. The name of the Adam optimization algorithm is derived from adaptive moment estimation because it uses estimations of the first and second moments of the gradient to adapt/adjust the learning rate for each weight of the neural network. For more details on how the Adam optimization works, one may refer to the referenced paper.
Gated recurrent units (GRU) were first proposed by [49] as a simplified and efficient modification of traditional RNN and LSTM cells. GRU models have successfully been applied to sequence modelling problems in the past [50,51], which makes them good candidates as a benchmark model for our paper. Unlike LSTM, GRU only has two gates, the update gate and the reset gate; since there is no output gate, GRU has no control over the memory content of the unit. Due to simple gating units and less modulation of the flow of information inside the unit, GRU has less training parameters than LSTM. Furthermore, the lack of a forget gate gives GRU the ability to keep information from distant past steps, without discarding it through time or forgetting information that is irrelevant to the prediction. A simple GRU block is presented in Figure 2 below. An update gate , for timestep t, helps the model determine the amount of previous information that needs to be passed to the next unit. The advantage of the update gate is that it copies all the past information and eliminates the vanishing gradient problem. The reset gate, , is a replacement of the forget gate from LSTM that determines the amount of information to forget/reset. The equations of GRU can be written as follows.
In Figure 3, the LSTM and the GRU models are illustrated, which are used as the benchmark models for our proposed model. The LSTM model is comprised of two hidden layers of the LSTM chain of 50 LSTM cells each (the number of cells corresponds to the number of input timesteps; please refer to Section 5 for more information). Finally, an FC layer consisting 50 neurons with linear activation was used as the output layer. Similarly, we also used a GRU model with two hidden layers of the GRU chain with 50 units each followed by an FC layer of 50 with linear activation. The number of hidden layers for the LSTM layer was determined using the Tabu search method; please refer to [52] for more details about the architecture search using the mentioned method. The complete parameter information of LSTM and GRU models is listed in Table 1 and Table 2, respectively.

Convolutional Neural Networks
Convolutional neural networks (CNN) are known for their outstanding performance on image (2D data) related machine learning tasks, such as image segmentation, object recognition, and super resolution. A modified version of CNN, known as 1D CNN [30,53], has been developed for time series and other 1D data (sound, vibration, sentences, etc.). CNN layers as shown in Figure 4(a) are comprised of filters that capture the local correlation of nearby data points, instead of the conventional full connection. The filters are made up of convolutional kernels that share the same weights. Each 1D CNN layer can be mathematically represented as follows: where * is the convolution operation, y is the output, x is the input, is the weight matrix, and is the bias matrix of the k th layer. In Equation (14), i and j are the number of filters and neurons, respectively. In the above equation, f is the activation function. The rectified linear unit (ReLU) is one of the widely utilized activation functions for CNN layers. A modified version of ReLU known as scaled exponential linear unit (SELU) was proposed by [54], which induces self-normalization properties on CNN layers. Furthermore, SELU avoids the vanishing gradient problem while normalizing the network and speeding up the training process. The equations for ReLU and SELU are as follows, The authors of [54] calculated the values for and as 1.05070098 and 1.67326324, respectively.
Causal convolution [55] is a restrictive variation of CNN, in which the timestep order must not be violated. In causal CNN, the output at any timestep t only uses the information from previous steps. Figure 4 (b) illustrates the flow of information in a causal CNN structure. Because of this time dependency characteristic, causal CNN are used for time series classification and prediction tasks. Mathematically, we can represent a layer of causal CNN as follows, where * is the convolution operation, y is the output, is the input to the layer, is the weight matrix, and is the bias matrix of the k th layer. In Equation (17), i and j are the number of filters and neurons, respectively. Unlike standard convolution, causal convolution guarantees that only instantaneous data are used, and future features are excluded during the training process. Thus, the final prediction of the causal architecture p(xt+1|x1, x2, …, xt) is independent of xt+1, xt+2, …, xT. Causal convolution models make predictions in a sequential manner, and predictions of a layer are fed back into the next layer to predict the next sequence. Unlike RNN, causal convolution models do not have recurrent connections, and all the conditional predictions can be made in parallel, hence being faster to train than RNNs.
A dilated causal convolution also known as à trous convolution is a variant of CNN, where the filter is applied to a larger area than its kernel length by skipping inputs at certain steps. This technique was introduced by [56] to expand the size of CNN's receptive field of images without modifying the data structure. The same technique can be applied to the time series data [57,58] to expand the data length effectively without increasing the neural network structure, as shown in Figure 4 (c). Due to its sparse connection and weight sharing mechanism, dilated convolution networks can automatically learn translationally-invariant features from longer input time series while having fewer trainable parameters than conventional CNN.

Residual Dilated Convolutional Neural Network
Residual convolutional neural networks were first proposed by [59] to tackle the problem of training deep learning models. Intuitively, by adding more layers, a CNN architecture should learn more complex functions and improve the prediction accuracy. However, the authors in [59] noted in their paper that deeper CNN models are not necessarily better at learning features. During the training of deeper models, a degradation problem was uncovered. The paper reported that such degradation was not due to overfitting, because adding more layers to the network counterintuitively increased the training error. With the increasing depth of the CNN model, the accuracy of the model becomes saturated and then starts degrading. This identifies the underlying problem of how the gradient descent-based optimization algorithm works. The authors proposed a shortcut connection by skipping a few layers of convolution and copying an identity matrix to the input of the following layer. The shortcut identity connections were designed in a way that all the information from the previous layer was always passed through to the next connected layer. In the residual learning method, for the mapping function (output) of a few stacked convolution layers ( ), the shortcut connections force the network to learn residual function ( ) − instead. Thus, the underlying mapping function for stacked layers with an identity shortcut, also known as the residual block, becomes ( ) = ( ) − . Therefore, for a residual block, the mathematical formulation can be written as follows, where x and y are the input and output of the residual block. According to the above function, the dimensions of x and ( * ) must be the same, else a linear projection can be applied to match the dimensions. The above modification to the traditional CNN architecture for image data was also experimented on time series data by replacing the 2D convolution layers with 1D convolution layers. WaveNet was proposed by [35] to generate raw audio waveforms. The WaveNet architecture consists of conditional residual blocks and skipped connections. Dilated convolution layers are used instead of standard convolution layers to expand the receptive field of networks. Several variations for the residual blocks for time series data have been proposed for different time series tasks. The work in [34] proposed a CNN-LSTM hybrid network with dilated convolution residual blocks ( Figure 5(a)). The residual block proposed for the CNN-LSTM model is comprised of a dilated convolution layer followed by a ReLU activation and a dropout layer for better generalization. Another variation with skipped connection and the self-normalizing SELU activation function was proposed by [60]. The author proposed the residual block ( Figure 5 (b)), and ReLU activation was replaced by SELU activation to remove bias from the network and also improve generalization due to SELU's self-normalization properties [54]. As noted by the author, unlike other time series prediction models, this architecture does not require data preprocessing to improve performance.
We used a generative time series forecasting model SeriesNet proposed by [60] as one of our benchmark models, a variant of residual-DCCNN. We also tested the CNN-LSTM hybrid model during our experiments, but the model was not able to converge using our sliding window training method (refer to Section 4), therefore being excluded from our benchmark list. The complete architecture of the SeriesNet is shown in Figure 6, and residual blocks and the complete architecture's hyperparameters are listed in Table 3 and Table 4, respectively. As illustrated in Figure 6, the final predictions of SeriesNet were made using parameterized skipped connections of the residual block. As noted by the author, the model uses skipped connection instead of only using the final output of residual blocks to ensure that the latent representation of dilated convolution layers is not overly influenced by past trends that may not be relevant to the present prediction. The L2-norm was applied to every convolution layer to further improve the generalization, and a truncated normal distribution was used for kernel initialization of each layer for faster convergence. In this architecture, dilated convolution layers have a fixed number of filters (i.e., f: 32) with a constant kernel size of three (s:3), but an increasing number of dilations (d) in the order of 1, 2, 4, 8, 16, 32, and 64 for 7 residual blocks. Furthermore, an 80% dropout rate was added for the last two residual blocks' skipped connections. As listed in Table 3, the residual block had two output layers, residual connection a, and skipped connection b. In Table 4, the connection column corresponds to the layer id, and a and b are the residual and skipped connections of the residual blocks. Layer id 7 had only 96 trainable parameters as the residual connection of the seventh residual connection was not used during training. The element-wise sum of the skipped connection of each residual block was followed by a ReLU activation. The output layer was a 1 × 1 convolution layer with linear activation. Thus, the length of the input and output data was the same and forecasts were made in a direct manner.

Residual U-Net Architecture
The U-net architecture was proposed by [61] for biomedical image segmentation. The key feature of this architecture is the ability to use low-level features while retaining high-level semantic information. The idea behind U-net is similar to residual networks. In U-net, the low-level features are copied to corresponding high-level features to create a path for information to propagate between lowand high-level convolution layers. This method allows backpropagation of errors between layers and makes training deep learning models easier. Using the idea from U-net and residual networks, some researchers have proposed a hybrid model by replacing the plain convolution layers of U-net with residual blocks. The work in [62] proposed a deep residual U-net architecture (ResUnet) for extracting road from aerial images. In the ResUnet architecture, residual blocks are used to ease training, and the skipped long connection facilitates the flow of information between low levels and high levels without degradation. ResUnet architectures are also robust to the noise present in training data, as shown by [36], where ResUnet outperformed the residual networks. In [37], the authors showed some experiments with the ResUnet architecture and residual block structure to improve semantic segmentation of satellite images. In time series data applications, U-net architectures are applied to model traffic information [63]. Spatio-temporal U-net (ST-UNet) was proposed for modeling graphstructured time series. By using dilated recurrent skip connections, the ST-UNet architecture can extract multi-resolution temporal dependencies.
In this paper, we investigated a ResUnet architecture, consisting of the residual blocks from SeriesNet. As shown in Figure 7, we used the bottom 6 residual blocks similar to SeriesNet, then used the 7 th block (the one with the highest dilation rate) as a bridge to transfer information to the upper 6 residual blocks. This architecture resembled an encoder-decoder network [64] with increasing numbers of dilation units; the highest dilation reached a value of 64 steps, then a decreasing number of dilations was applied to residual blocks. The number of residual blocks before and after the bridge was kept the same. Long skipped connections were applied to the corresponding top residual blocks from bottom residual blocks. The output of bottom blocks was copied and concatenated with the input of top blocks with the same number of dilations; thus, the temporal information due to dilation could be recovered or enhanced by top level residual blocks. Similar to SeriesNet, we used the sum of skipped connections of each residual block followed by ReLU activation to make the final prediction. Dropout was applied at the skipped connection of the sixth residual block to decrease the influence of past trends. Using the 1 × 1 convolution layer with linear activation at the output layer, the output steps were equal to the input steps, and the predictions were made in a direct manner. After some tests, we found a dropout rate of 80% to be suitable for the ResUnet architecture. The hyperparameters of ResUnet are listed in Table 5. For the residual blocks' hyperparameters, refer to Table 3.

The Proposed Model
The ResUnet structure allows the information to flow from the lower level residual blocks to the higher level ones. However, this alone is not enough to get a substantial gain in time series data. Since the residual blocks proposed in SeriesNet work best for univariate time series data, we noticed a major drawback in the ResUnet architecture. When merging the low-level features with high-level ones, the dimension of the data becomes modified, and our consistent univariate data flow as in SeriesNet becomes multivariate because of the concatenation of two one-dimensional outputs. We experimented by averaging, element-wise multiplication, and addition to overcome this drawback, but the results were poorer than ResUnet discussed in Section 2.4. Further, we focused our efforts on blocks for time series data.
The attention mechanism is typically used in RNN architectures to improve the model performance. The works in [65][66][67] are a few examples where attention blocks were proposed and used with LSTM architectures for time series forecasting. Attention blocks have also been used with CNN architectures [40,43,68,69] for image classification and time series data. One of the noteworthy contributions of the attention mechanism for time series forecasting can be found in [70]. We investigated this architecture for our application, but the results were not better than SeriesNet, so we skipped this in our benchmark list.
The proposed non-linear attention mechanism (Figure 8) consisted of two FC layers of equal length (50 neurons) with a dropout applied between them. The attention mechanism received a concatenated input X with two column vectors, then a row-wise average was calculated before feeding the data to a fully-connected layer. The row-wise average function was used to calculate the average mapping of residual blocks for each time series element without violating their timestep order, i.e., preserving temporal information. A dropout was applied to the output of the first FC layer before feeding into the next FC layer to prevent the saturation of FC layers during training. Finally, the output of the second FC layer was reshaped in order to comply with the CNN layers' data input format. We applied sigmoid activation to the FC layers to ensure non-linearity in the FC layer output.
For the given input matrix with i rows and j columns, we could represent our non-linear attention block as follows, where is the row-wise mean of input matrix and n is the number of elements in the row of the matrix. ℎ and ℎ are the outputs of the first and second FC layer. In Equation 21, (ℎ ⊙ ) represents the dropout applied to the input of the second FC layer, where ⊙ is the element-wise (Hadamard) product, is the elements of the mask matrix , and ( ) represents the Bernoulli dropout, (1 − ), with dropout probability . is the output of the attention block, where the output vector of the second FC layer is reshaped from 1 × n to n × 1 to conform with the standard input format of the 1D convolution layer. The dropout rate applied to the output of the first FC layers in this mechanism was kept at 50%. We used the proposed nonlinear attention blocks in ResUnet discussed in Section 2.4 with minor modifications as shown in Figure 9. We applied the attention blocks to the input of higher level residual blocks, transforming the concatenated vectors to a single output vector by applying the row-wise average and finding the nonlinear relationship between each timestep element using FC layers with sigmoid activation. The residual attention U-net (ResAUnet) architecture's skipped connections were similar to those of ResUnet with the exception being that the attention block was also used to bypass the information by skipping the bridge connection of the residual block with the highest dilation value. In our proposed architecture, the feature mapping of lower level residual blocks was combined with the feature mapping of higher level one and then fed into the residual block with the same dilation value as the corresponding lower level residual block. In order to limit the over influence of past trends in the training of higher level residual blocks, we applied a dropout of 80% before the bridge connection and also at the skipped output of the 7 th residual block. Combined with nonlinear attention, this mechanism of information flow allowed the architecture to learn complex relationship between the causal outputs of different dilation levels. Similar to SeriesNet, the number of output timesteps was equal to the input (i.e., 50 timesteps). The hyperparameters of the proposed attention block and ResAUnet are listed in Tables 6 and 7, respectively.

Wind Speed Data
For the comparison of the models presented in Section 2, we used an open access dataset publish by [71] under a creative commons license at https://www.sciencedirect.com/science/article/pii/S2352340919306456. The published dataset presents the wind speed, wind direction, and wind power data of 12 different sites in the state of Tamil Nadu in India. The measurements were recorded using anemometers and wind vanes at a height of 100 meters at a regular interval of 10 minutes throughout the year. The dataset consists of the measurements for the years 2014, 2015, and 2016 for all 12 locations. to compare the model's performance, we chose first six locations and 6050 wind speed timesteps from the datasets, which corresponded to the timeline of January to mid-February of 2014. The statistical features and collection site details of the dataset used in this paper are listed in Table 8 below followed by the probability density plot calculated using kernel density estimation [72], presented in Figure 10.

Training Method
In general, the performance of an ANN model is measured using the k-fold validation method. This method is effective in determining the model's performance when the data used for training are not time-dependent, e.g., image data, non-time series regression, etc. Another method used for testing the time series prediction model is the training and testing split, where data are split for training and testing the model. This method requires a large historical dataset that can be used to train the complex models at the training time. The model is then used predict on the testing data split to validate the model accuracy. Many researches referred to in Section 1 used this method for wind speed prediction evaluation. However, in real-world applications where historical data for a distant past horizon are not available, the training and testing split method cannot be applied successfully. Furthermore, the presence of stochastic trends in time series datasets makes it difficult to verify the performance of such model for distant future timesteps. Another method of training the time series prediction model is known as expanding window validation. In the expanding window validation method [73,74], the model is retrained every time new data are available. This method is popular for validating prediction models for financial and economic time series data. We investigated this method, but the increasing  In our research, we used the sliding window technique [75][76][77][78], also known as the rolling window, moving window, or walk-forward technique. The sliding window technique is more robust to the stochastic changes in the data trend and can be applied to smaller datasets as the window size is smaller. As shown in Figure 11, we used a training window of 21 sets of 50 wind speed timesteps to train our model and then predicted the next 50 timesteps using the last set of data in the training window. For every new prediction window, the training window used a new set of 50 timesteps and discarded the oldest set of timesteps. The intuition behind this technique was that by discarding old data from the training window, we could limit the influence of distant past trends during model training and promote the learning of new trends in the data. This method also reduced the training time of the model as the number of training sets was always fixed. The training window size of 21 × 50 timesteps was chosen after testing the incrementing window size in multiples of 3, which was also the batch size used for training all models.
In order to facilitate a fair comparison of all benchmark models, we also used the results of simple the naive prediction model. The naive prediction model used in our benchmark model uses the current set of timesteps t to predict the next set t+1 as = . We used the stochastic optimizer Adam [48] to train and retrain all the models (LSTM, GRU, SeriesNet, ResUnet, and ResAUnet) with a learning rate of 0.0075, the exponential decay of the first moment equal to 0.9, the exponential decay of the second moment estimate equal to 0.999, and an epsilon value of 10 . Models were trained for 50 epochs for each training window with a batch size of 3. The loss function used for training different models was the mean absolute error (MAE). For each wind speed site, the model's weights were reinitialized to default weights in order to prevent the influence of trends from previous wind sites on the next site's prediction. The training and prediction method used for the models in this study is shown as a flowchart in Figure 12.

Performance Metrics for Forecast Accuracy
Usually, the performance of a regression model is judged using the root mean squared error (RMSE), mean absolute error (MAE), and mean absolute percentile error (MAPE). However, when it comes to the time series prediction model, MAE, RMSE, and MAPE were not sufficient to evaluate the model's performance. RMSE depends on the scale of dependent variable and is sensitive to large deviation. MAPE is a scale independent measure. However, MAPE has the limitation of being infinite or undefined if there is a zero value in the series. Furthermore, MAPE can have an extremely skewed distribution in the case that the actual values are very close to zero. Therefore, we also used the "symmetric" MAPE (SMAPE) proposed in the Makridakis competition (M-3 competition) [79] for time series prediction models. We also investigated the mean squared log error (MSLE) for our performance evaluation. Since the MSLE measures the ratio between the actual value and the predicted value, large errors are not penalized if the both the actual and predicted values are large numbers. Furthermore, MSLE penalizes underestimates more than the overestimates. Another error metric we used is the normalized root mean squared error (NRMSE). NRMSE is the normalized version of RMSE, which facilitates the comparison between datasets and models with different scales. We also used the unscaled mean bounded relative absolute error (UMBRAE) [80] to compare the accuracy of different models using naive prediction results as the benchmark. UMBRAE is based on the mean bounded relative absolute error (MBRAE), but the final result is represented in terms of a ratio, rather than relative error. If UMBRAE is larger than one, the model has worse accuracy than the benchmark, and values smaller than one indicate better accuracy than the benchmark. UMBRAE is less sensitive to outliers and a is symmetric and scale-independent measure of model's accuracy. The equations for calculating the performance/error metrics used in this paper are as follows.
where is the actual value, is the predicted value of the model, * is the predicted value of the naive prediction method, and n is total number of predictions.

Comparison
We trained and tested the proposed model and benchmark models on six wind speed datasets. The models involved in this research were naive, LSTM, GRU, SeriesNet, ResUnet, and ResAUnet. No preprocessing was applied to the wind speed data in order to test the model's performance on raw wind speed data. Each model's input series length was 50, and a 50 step-ahead-prediction test was evaluated in this paper.
In Table 9, the prediction errors including MAE, RMSE, MSLE, MAPE, SMAPE, NRMSE, and UMBRAE of the six models are presented for different wind speed measurement sites. The comparison for the six sites is summarized below.  Table 8, we could conclude that nonlinear attention used in ResAUnet was helpful to improve the prediction accuracy for wind speed data.

Prediction and error plots
To further verify the performance of ResAUnet, the prediction graphs of each model with corresponding ground truth data for six wind sites are provided in Figures 13, 14, 15, 16, 17, and 18. Moreover, Figure 18 provides the NRMSE of the six models for each prediction step (sample set) on each wind site. The initial 21 sets of timesteps were excluded from the plots since they did not contribute to the evaluation of the models and were only used for training purposes. The plots consist of 5000 timesteps for wind speed. All predictions were made using the out-of-time method, and the model was not trained on the data on which the predictions were made.       Through Figures 13 to 18, the prediction trend of different models was observable. The LSTM and GRU models were able to capture the overall trend of wind speed data. The SeriesNet and ResUnet models evolved over time on the training sets to also capture some of the local trends, which is more visible in Figures 14, 15, and 16. Overall, ResAUnet was able to capture local nonlinear trends much faster and more accurately than the other models. The evolution of ResAUnet could be clearly seen in Figures 13, 14, 15, and 16, where the model was able to follow accurately the local trend of wind speed data much faster than the other models. The NRMSE of each prediction step is plotted in Figure 19, where ResAUnet had the least NRMSE among all models used in this study throughout the validation process.

Training time evaluation
The number of parameters and the models' architecture had a great influence on the training complexity. For RNN-based models like LSTM and GRU, the recurrent connections increased the training time as the data flow was sequential between the different blocks in the same chain. CNN models, however, did not show this sequential nature of data flow, and all the predictions could be made in parallel. Parallel dataflow enabled CNN models to train faster than their RNN counterparts. We demonstrate the average training time of each model used in the study in Table 10 below. Experiments were conducted on a notebook PC with an Intel i5 7 th generation CPU and 8GB RAM. The TensorFlow API for the Python programming language was used to create and train the models on the notebook's Nvidia GeForce 920MX dedicated GPU.

Conclusions
Improving the accuracy of short-term wind speed prediction is one of the most important factors for improving the energy conversion efficiency. By accurately predicting distant future timesteps (in this case, 50-steps-ahead), efficient power management and planning of repair/cleaning schedules could be done in advance. In this paper, a residual U-net architecture of dilated convolution layers with a nonlinear attention block was proposed for 50-step-ahead prediction of wind speed data using the previous 50 steps. The proposed model was validated using the sliding window technique and was evaluated against several other models including naive prediction. The models were tested on the wind speed data of six different sites. For this study, it could be concluded that:  CNN architectures consisting of residual blocks made of a dilated convolution layer had higher prediction accuracy than RNN-based architectures like LSTM and GRU. SeriesNet showed an overall performance gain of 3% to 17% for MAPE compared to LSTM, which meant residual blocks of dilated convolution layers were effective for learning wind speed trends without preprocessing the data. Furthermore, CNN architectures required less training time than LSTM and GRU.  Nonlinear attention blocks applied to the ResAUnet provided an advantage over SeriesNet in predicting wind speed for all wind speed datasets used in this study. ResAUnet performed better on each wind site's data. A performance gain of up to 17.3% in MAPE, 14.6% in NRMSE, and 15.4% in UMBRAE was observed over the SeriesNet model. Furthermore, ResAUnet could adapt to the local trends in the wind speed data faster than other models used in this study.  LSTM, GRU, SeriesNet, ResUnet, and ResAUnet could all follow the trend in complex and random wind speed data. However, for 50-step-ahead prediction, dilated convolution-based architectures showed better performance in following the intermediate and local trends. By retraining the model using the sliding window technique, the ResAUnet model adapted to the intermediate trend of the wind speed data faster than the other models.
Overall, the proposed model with nonlinear attention blocks could provide more reliable and accurate 50-step-ahead short-term wind speed prediction for wind farm management systems and improve the efficiency of wind turbine maintenance, cleaning, and operational planning. Even though the proposed model had a clear advantage in 50-step-ahead prediction of short-term wind speed data using raw wind speed data, other environmental factors such as humidity, pressure, wind direction, solar radiation, temperature, and turbulence have great effect on wind speed. Therefore, our future studies will incorporate these influencing factors into multi-step-ahead short-term wind speed prediction. We will also explore other machine learning architectures in order to improve the prediction accuracy.