Wind Speed Forecasting Using Attention-Based Causal Convolutional Network and Wind Energy Conversion

: As one of the effective renewable energy sources, wind energy has received attention because it is sustainable energy. Accurate wind speed forecasting can pave the way to the goal of sustainable development. However, current methods ignore the temporal characteristics of wind speed, which leads to inaccurate forecasting results. In this paper, we propose a novel SSA-CCN-ATT model to forecast the wind speed. Speciﬁcally, singular spectrum analysis (SSA) is ﬁrst applied to decompose the original wind speed into several sub-signals. Secondly, we build a new deep learning CNN-ATT model that combines causal convolutional network (CNN) and attention mechanism (ATT). The causal convolutional network is used to extract the information in the wind speed time series. After that, the attention mechanism is employed to focus on the important information. Finally, a fully connected neural network layer is employed to get wind speed forecasting results. Three experiments on four datasets show that the proposed model performs better than other comparative models. Compared with different comparative models, the maximum improvement percentages of MAPE reaches up to 26.279%, and the minimum is 5.7210%. Moreover, a wind energy conversion curve was established by simulating historical wind speed data.


Introduction
The utilization of renewable energy is the way to achieve sustainable development. Therefore, renewable energy draws more and more attention within industry and from academics. Among them, wind power generation is one of the most promising renewable energies, which has been widely used in the past decades. It is estimated that the reserve of wind power generation is more than 400 million MW (MegaWatt), which greatly exceeds the 18 million MW primary energy supply in advance [1]. However, the basic scientific research of onshore wind power development lags behind industrial development [2]. Due to the lack of early evaluation of wind energy resources, the utilization rate of wind energy resources is low. The reduction rate of power generation is high, hindering the further development of wind energy [3]. Accurate wind speed prediction can provide adequate decision-making information for wind farm management and energy scheduling [4]. Therefore, wind speed prediction is crucial for the design and installation of large wind farms and essential for maintaining reliability and safe operation of the power network [5]. However, accurate wind speed prediction is a challenge due to the volatility and diversification of wind speed.

Existing Methods to Forecast Wind Speed
In order to obtain accurate forecasting results, a series of strategies have been proposed by researchers. These models can be divided into four categories: (1) physical

Our Contribution
Based on the above literature review, we can draw the following conclusions: (a) The method based on deep learning is effective on wind speed forecasting. (b) Decomposition technology can improve the prediction performance of a model. (c) The methods based on a hybrid model are better than an individual model. Given the above conclusions, we propose a novel wind speed prediction model. We first used SSA to decompose the wind speed time series into several components. Then we designed a deep neural network model based on CCN and attention mechanism to predict wind speed. The proposed neural network has two CCN layers and one attention layer. The CCN layers extract temporal information from the time series and eliminate the impact of future data. The attention layer was used to focus on the important information for wind speed forecasting. The main innovations and contributions of this paper are as follows: 1. The SSA decomposition method is used to decompose the wind speed value into several different sub-signals, and the forecasting accuracy of the prediction model is further improved by using the characteristics of each sub-signal. 2. A new model for short-term wind speed prediction is proposed, which uses CCN to extract features and employs the attention mechanism to make predictions from the extracted features. 3. In order to verify the performance of wind speed signal extraction by SSA, we adopt different decomposition technology, and put the decomposed sub-signals into our proposed model to evaluate the performance of decomposition technology. 4. To verify the effectiveness of the proposed model, we use four different time period data and ten comparison models and evaluate the performance of the related models in different prediction intervals.
The rest of this paper is constructed as follows: Section 2 introduces the theory about SSA, CCN, and attention mechanism. In Section 3, the proposed SSA-CCN-ATT model is presented. In Section 4, three experiments using four datasets are conducted. Section 5 presents the discussion of the comparative models. Section 6 summarizes the whole paper. Finally, Nomenclature is added to introduce the abbreviation in the paper.

Methodology
This section introduces the methods used in this paper, including SSA, CCN, and attention mechanism.

Singular Spectrum Analysis
SSA is a nonparametric spectrum estimation method that decomposes a time series into several meaningful components. SSA does not need any prior knowledge about the time series [43]. The method consists of two phases: decomposition and reconstruction. The specific steps are as follows.

Embedding
Let a one-dimensional sequence of length N be X = [x 1 , x 2 . . . , x N ], the positive integer L is the length of the sliding window, 1 < L < N. The original sequence X is constructed into K vectors by embedding operation, as follows: where The result of the mapping forms the trajectory matrix M: 2. Singular value decomposition SVD is used to decompose the trajectory matrix. Singular value decomposition is a classical matrix decomposition method in matrix theory, and the decomposition formula is as follows: where d is the number of non-zero singular values of X, λ 1 , λ 2 , . . . , λ d is the singular value of X in descending order, U i is called left singular vector, V i is called the right singular vector.

Grouping
The purpose of grouping is to separate the additive components in the signal. If the original signal is denoised, then the grouping operation is to express the trajectory matrix M constructed by the original sequence X as the sum of the useful signal S and noise E, We use SSA to analyze time series with potential structure; it is generally considered that the first r (r < d) large singular values reflect the main energy of the signal, while the last d − r small singular values are considered to be noise components. Thus, the grouping operation is to determine the appropriate r value to achieve signal-to-noise separation.

Diagonal averaging
The purpose of diagonal averaging is to transform the matrix that is obtained by grouping into a sequence of length N. Let Y ∈ R L×K represent any matrix after grouping, y ij is the elements of the matrix, 1 ≤ i ≤ L, 1 ≤ j ≤ K. The elementary matrix corresponding to the time series y rc is calculated with Equation (4), where L * = min(L, K), K * = max(L, K), and N = L + K − 1.

Causal Convolution Network
Causal convolution is proposed to capture the information in time series effectively [44]. Unlike traditional one-dimensional convolution, as shown in Figure 1, causal convolution only considers the local property on the left (previous data samples), and the information from future data samples cannot affect any analysis of given time step [45].

Attention Mechanism
The basic mechanism of attention is to imagine the components in the original da (Source) as a series of < , pairs. At this time, given the target value eleme (Query), the weight coefficient of each corresponding to value is obtained by calcu lating the similarity or correlation between and each , and then weighted wi the value, the final attention value is obtained. The essential idea can be rewritten into th following formula: For the sequence problem, the main abstraction is to predict y t according to x 1 , x 2 , . . . , x t and y 1 , y 2 , . . . , y t−1 , so that y t is close to the actual value. Where x is the eigenvalue, y is the target value. Another causal convolution network is called causal differentiated convolution, which can obtain a larger receivable field [46]. Only standard causal convolution is used in this experiment.

Attention Mechanism
The basic mechanism of attention is to imagine the components in the original data (Source) as a series of < Key, Value > pairs. At this time, given the target value element (Query), the weight coefficient of each Key corresponding to value is obtained by calculating the similarity or correlation between Query and each Key, and then weighted with the value, the final attention value is obtained. The essential idea can be rewritten into the following formula: As for the specific calculation process of attention mechanism, it can be summarized into three processes: the first process is to calculate the weight coefficient according to Query and Key, and different functions and computer systems can be introduced, and according to Query and a Key i . The most common methods to calculate the similarity or correlation of the two include: to find the vector dot product of the two, to find the similarity of the vector cosine of both, or to evaluate by introducing additional neural networks. In this experiment, we use these to find the vector dot product of both. The formula is as follows: In the second stage, the original score of the first stage is normalized, and the first stage scores can be converted by using SoftMax calculation method. On the one hand, the original calculated scores can be normalized into probability distribution with the sum of all elements weight of 1; On the other hand, the weight of important elements can be more highlighted through the inherent mechanism of SoftMax. That is, generally, the following formula is used for calculation: In the third stage, the value is weighted to get attention value according to the weight coefficient a i . The formula is as follows:

The Proposed SSA-CCN-ATT Model
In this paper, we propose a new wind speed forecasting model, which includes SSA method for signal decomposition of data, two-layer CCN network for feature extraction of sub-signals, attention mechanism to give high weight to the more important features, and the fully connected neural network to get the final output. The model structure is shown in Figure 2. The design and specific steps are summarized as follows: 1. Data preprocessing. Considering the nonlinearity and volatility of wind speed data, we use SSA to process the original wind speed. SSA has a strict mathematical theory and fewer parameter and can efficiently extract the trend, periodic, and quasi-periodic information of the signals. 2. Sample construction. The wind speed data is divided into two types of datasets: the training set and the testing set, respectively. The training set is used to train the CNN-ATT network, whereas the testing data set is used to evaluate the proposed forecasting model. 3. CNN-ATT network forecasts. Put the de-noised wind speed time series to the CNN-ATT network. There are two CCN layers, one attention layer in the CNN-ATT network and one full connected layer. CCN is highly noise-resistant model, and it extracts nonlinear spatial features from wind speed; the attention mechanism further increases its extraction efficiency. Finally, a full connected layer is employed to obtain the forecasting result. 4. Evaluation. To study the efficiency of the proposed model, a comprehensive evaluation module includes four evaluation metrics, DM test, and improvement ratio analysis is designed to analyze the forecasting results. 5. Wind energy conversion and uncertainty analysis. Based on the wind energy conversion curve and wind speed forecasting value, the calculated electricity generation for wind turbines and the forecasting interval method are used to analyze the uncertainty of the wind energy conversion process.
2. Sample construction. The wind speed data is divided into two types of datasets: training set and the testing set, respectively. The training set is used to train the CN ATT network, whereas the testing data set is used to evaluate the proposed foreca ing model. 3. CNN-ATT network forecasts. Put the de-noised wind speed time series to the CN ATT network. There are two CCN layers, one attention layer in the CNN-ATT n work and one full connected layer. CCN is highly noise-resistant model, and it tracts nonlinear spatial features from wind speed; the attention mechanism furt increases its extraction efficiency. Finally, a full connected layer is employed to obt the forecasting result. 4. Evaluation. To study the efficiency of the proposed model, a comprehensive eval tion module includes four evaluation metrics, DM test, and improvement ratio an ysis is designed to analyze the forecasting results. 5. Wind energy conversion and uncertainty analysis. Based on the wind energy conv sion curve and wind speed forecasting value, the calculated electricity generation wind turbines and the forecasting interval method are used to analyze the unc tainty of the wind energy conversion process.

Experimental Results
In this section, the experimental design and result analysis are conducted. We first introduce the basic information of the datasets used in the experiments. Then, the evaluation criteria and parameters of the model are described. Finally, the prediction results are presented.

Dataset Information
The wind speed data used in this study are taken from the NWTC (National Wind Technology Center) of NREL (National Renewable Energy Laboratory). The data were collected every two seconds, and an average value was recorded every minute [4]. The wind speeds at six heights were measured and recorded, which were 2 m, 5 m, 10 Figure 3, the horizontal axis represents the time of samples, and the vertical axis represents the wind speed. The description and statistical information of the collected wind speed data sets are shown in Table 2.
In this section, the experimental design and result analysis are conducted. We firs introduce the basic information of the datasets used in the experiments. Then, the evaluation criteria and parameters of the model are described. Finally, the prediction results are presented.

Dataset Information
The wind speed data used in this study are taken from the NWTC (National Wind Technology Center) of NREL (National Renewable Energy Laboratory). The data were collected every two seconds, and an average value was recorded every minute [4]. The wind speeds at six heights were measured and recorded, which were 2 m, 5 m, 10 Figure 3, the horizontal axis represents the time of samples, and the vertical axis represents the wind speed. The description and statistical information of the collected wind speed data sets are shown in Table 2.  In order to analyze time series with potential structure, four decomposition methods are selected in the experiment, namely EMD, EEMD, EWT, and SSA. Through experimental comparison (see Section 4.2.1), we selected SSA as our decomposition method. In this paper, we use the psts [48] to implement SSA. When applying SSA to decompose the  In order to analyze time series with potential structure, four decomposition methods are selected in the experiment, namely EMD, EEMD, EWT, and SSA. Through experimental comparison (see Section 4.2.1), we selected SSA as our decomposition method. In this paper, we use the psts [48] to implement SSA. When applying SSA to decompose the data, we set the number of sub-signals to 14. That is, there are 14 sub-signals in each decomposition chart. Figure 4 shows the SSA decomposition results of four datasets. In Figure 4, the top left figure is the original wind speed data, and the other figures are arranged in the order of the diagonal averaging. data, we set the number of sub-signals to 14. That is, there are 14 sub-signals in each decomposition chart. Figure 4 shows the SSA decomposition results of four datasets. In Figure 4, the top left figure is the original wind speed data, and the other figures are arranged in the order of the diagonal averaging.

Experimental Design
After data decomposition, we used the decomposed time series to train our model and used some evaluation criteria to evaluate our model. This section will introduce the evaluation criteria we used, the training process, and the experiment's design.

Experimental Design
After data decomposition, we used the decomposed time series to train our model and used some evaluation criteria to evaluate our model. This section will introduce the evaluation criteria we used, the training process, and the experiment's design.

Model Training
Before model training, we normalized the decomposed data with linear function (Min-Max scaling). The linear function converts the original data to the range of [0, 1]. The normalization formula is as follows: where X norm is the normalized data, X is the original data, X max and X min are the maximum and minimum value of the original dataset, respectively. After normalization, the CCN-ATT network is applied to predict the wind speed, and the above four datasets are used to verify the performance of the network. In the experiment, two layers of CCN are used to extract the features, and the convolution kernel size is 10 and 12, respectively. The extracted features are the input of the attention mechanism. The output of the attention mechanism is passed into a full connection neural network. Then, the predicted value is obtained.
The whole process of the model training can be described in the following three steps: 1. The one-dimensional wind speed is decomposed into 14 one-dimensional sub-signals by SSA to eliminate the randomness of the original data. 2. The first 4800 samples obtained in the first step are used as the training set. 10% of the samples in the training set is used as the verification set. The last 500 samples are used as the training set. Min-Max scaling is used to normalize the training set and the testing set, respectively. 3. The input length is 14. When making one-step forecasting, the i-th sample to the (i + 14)-th sample are used to predict the (i + 15)-th sample. When making two-step forecasting, the i-th sample to the (i + 14)-th sample are used to predict the (i + 16)-th sample. When making three-step forecasting, the i-th sample to the (i + 14)-th sample are used to predict the (i + 17)-th sample.

Experimental Setup
In order to verify the effectiveness of the proposed model, we conduct three experiments. The detailed information of the three experiments and the comparison models are shown in Table 3. Among the three experiments, experiment I is designed to determine which algorithms are suitable for feature extraction; Experiment II compares the SSA-CCN-ATT model with different decomposition methods; Experiment III compares the SSA-CCN-ATT model with some classic individual models. All the models are implemented in Python using Keras framework. The parameter settings of ANN, SVR, CCN, LSTM, and GRU are shown in Table 4.  From Table 3, we can see that there are ten comparison models. In the ten models, there are six hybrid models and four individual models. When designing the experiment, the four individual models do not use data decomposition technology. The original wind speed values are used directly, while the other six hybrid models use the data decomposition technique for experiments. The wind speed data of all models are normalized by Min-Max normalization, and the range of normalized data is between [0, 1]. All models are trained with Adam optimizer except SVR.

Evaluation Criteria
In order to evaluate the accuracy of the proposed model, this section introduces some commonly used evaluation indicators, including the MAE, MSE, MAPE, and R 2 . Their specific formulas are as follows: where n is the number of samples, y i is the real target wind speed value, andŷ i is the predicted wind speed value, y i represents the average of the target values. In these evaluation criteria, the smaller the value of MAE, MAPE, and MSE is, the better the model is, and the larger the R 2 value is.
In addition, to compare the wind speed prediction performance of different models, this paper also introduces the improvement ratio of MAPE (P MAPE ). It is defined as follows: where MAPE 1 represents the MAPE of our proposed model, and MAPE 2 represents the MAPE of a comparison model. When P MAPE are positive, the forecasting effect of our model is better than that of the comparison models. Otherwise, the forecasting effect is worse.

Result Analysis
In this section, three experiments are conducted to verify the proposed model from different aspects.

Result Analysis of Experiment I
To validate the accuracy and stability of the SSA-CCN-ATT model, the forecasting results of the SSA-CCN-ATT model are compared with those of the attention-based model. It is designed to analyze the impact of feature selection on forecasting performance and select a more effective feature selection algorithm. The evaluation metrics results of the three comparison models and the proposed model can be seen in Table 5.

Remark 1.
The performance of the attention-based model using CCN is better than other attentionbased models using other feature extraction technology. The results can fully prove that CCN is more suitable for our proposed model.

Result Analysis of Experiment II
The effects of SSA and other decomposition methods, including EMD, EEMD, and EWT, are compared in this experiment. For this purpose, we keep the prediction models fixed and compare different decomposition methods, including EMD, EEMD, and EWT. Figure 6 shows the forecasting results of four models. It can be seen from the bar chart that the three indices of our model are lower than those of the other three models. Table 6 lists the four indices results of the proposed model and the other three models. From Table 6, we can get the following results.
1. Similar to experiment I, the proposed model can get the best evaluation metrics values for dataset 1, dataset 3, and dataset 4. The other three comparison models can get their best results for different evaluation metrics and different steps of forecasting. 2. For dataset 1, in three-step forecasting, the worst forecasting model is EMD-CCN-ATT. 3. For dataset 2, EMD-CCN-ATT gets the best MAE, and MSE in one-step forecasting, EEMD-CCN-ATT gets the best MSE and R 2 in two-step forecasting. 4. For dataset 3 and dataset 4, in three-step forecasting, the worst forecasting model is EMD-CCN-ATT.

Remark 2.
The performance of the CCN-ATT model using SSA decomposition technology is better than that of the CCN-ATT model using EMD, EEMD, and EWT decomposition technology. The results can fully prove that SSA decomposition technology is more suitable for our proposed model. SSA can efficiently extract the trend, periodic, and quasi-periodic information of the signals.

Result Analysis of Experiment II
The effects of SSA and other decomposition methods, including EMD, EEMD, and EWT, are compared in this experiment. For this purpose, we keep the prediction models fixed and compare different decomposition methods, including EMD, EEMD, and EWT. Figure 6 shows the forecasting results of four models. It can be seen from the bar chart that the three indices of our model are lower than those of the other three models. Table 6 lists the four indices results of the proposed model and the other three models. From Table  6, we can get the following results.

Result Analysis of Experiment III
In this experiment, we also use four datasets and compare the SSA-CCN-ATT model with some classic individual models, namely, ANN, SVR, CNN, and LSTM. Their parameters are the same as those of experiment I. The difference between this experiment and experiment I and experiment II is that the original data is the input of the models. We drew a radar chart with three indices in Figure 7. From Table 7 and Figure 7, we can draw the following conclusions: 1. For dataset 1, the proposed model achieves the best results for every step of forecasting.
Among the four comparison models, LSTM performs best because it has the best values in terms of the four indices. 2. For dataset 2, the best forecasting method differs for different step of forecasting.
For in general, LSTM is the best forecasting model among the four indices, but in one-step forecasting, CNN has the best MSE and R 2 value. 4. Similar to the dataset 3, the performance of the SSA-CCN-ATT is better than those of the other four individual models. From the comparison of CCN, SVR, ANN, and LSTM, it can be seen LSTM always obtains the best values compared to the other three models for one-step and two-step forecasting.

Remark 3.
Compared with the four individual models, the proposed SSA-CCN-ATT model can get the most accurate forecasting results. Among the four individual models, LSTM is relatively better than the other three individual models. As the number of forecasting steps increases, the forecasting performance of all models becomes worse.

Discussion
In this section, we will conduct the Diebold-Mariano (DM) test on ten comparison models and analyze the improvement ratio of our model relative to the comparison models.

Significance of the Proposed Model
To verify the forecasting performance of the proposed SSA-CCN-ATT model, we conducted the DM test. Considering the significance level α, the zero hypothesis H 0 indicates that there is no significant difference in forecasting performance between the proposed model and the reference model, while H 1 rejects this hypothesis. The related hypotheses are as follows: Among them, L is the loss function of the forecasting errors, error p i , p = 1, 2, are the forecasting errors of the two comparison models.
Moreover, the DM test statistics can be defined by: where S 2 is an estimation for the variance of d i = L ε 1 i − L ε 2 i . Assuming the given significance level α, the calculated values of DM are compared with Z α/2 and −Z α/2 . Z α/2 is the upper (or positive) Z-value from the standard normal table corresponding to half of the desired α level of the test. It means that H 0 is accepted if the DM statistic falls into the interval [Z α/2 , −Z α/2 ]. This would indicate that there is a significant difference between the forecasting performances of the proposed model and the comparison models. Table 8 shows the DM test results. From Table 8, we can see that the EWT-CCN-ATT at dataset 1 in one-step forecasting has the smallest value of 1.8028, which is greater than Z 0.1/2 = 1.645. Subsequently, we conclude that the null hypothesis can be accepted at the 10% significance level. The probability that the alternative hypothesis will be accepted is 90%. This confirms that the proposed SSA-CCN-ATT model performs much better than the other ten comparisons. Table 9 shows the results of P MAPE in four datasets. Based on the details provided in Table 9, the forecasting results of the proposed SSA-CCN-ATT model are better than those of any other comparison model. Compared with different attention-based models, the maximum improvement percentages of MAPE reached up to 23.345%, and the minimum improvement is 5.7210%. The P MAPE values between the proposed SSA-CCN-ATT and SSA-LSTM-ATT are P 1−step MAPE =16.445%, P 2−step MAPE =11.416%, and P 3−step MAPE =14.470% for dataset1. In comparison with some classic individual prediction models, the percentage increase of MAPE has a maximum value of 40.488% and a minimum value of 6.7730%. This proves that the proposed model has better prediction accuracy and prediction stability than the comparison models. Compared with different decomposition models, the maximum improvement percentages of MAPE reached up to 26.279%, and the minimum is 6.1830%.  Table 9. Improvement ration of MAPE generated by the SSA-CCN-ATT from four datasets.