Forecasting Day-Ahead Hourly Photovoltaic Power Generation Using Convolutional Self-Attention Based Long Short-Term Memory

The problem of Photovoltaic (PV) power generation forecasting is becoming crucial as the penetration level of Distributed Energy Resources (DERs) increases in microgrids and Virtual Power Plants (VPPs). In order to improve the stability of power systems, a fair amount of research has been proposed for increasing prediction performance in practical environments through statistical, machine learning, deep learning, and hybrid approaches. Despite these efforts, the problem of forecasting PV power generation remains to be challenging in power system operations since existing methods show limited accuracy and thus are not sufficiently practical enough to be widely deployed. Many existing methods using long historical data suffer from the long-term dependency problem and are not able to produce high prediction accuracy due to their failure to fully utilize all features of long sequence inputs. To address this problem, we propose a deep learning-based PV power generation forecasting model called Convolutional Self-Attention based Long Short-Term Memory (LSTM). By using the convolutional self-attention mechanism, we can significantly improve prediction accuracy by capturing the local context of the data and generating keys and queries that fit the local context. To validate the applicability of the proposed model, we conduct extensive experiments on both PV power generation forecasting using a real world dataset and power consumption forecasting. The experimental results of power generation forecasting using the real world datasets show that the MAPEs of the proposed model are much lower, in fact by 7.7%, 6%, 3.9% compared to the Deep Neural Network (DNN), LSTM and LSTM with the canonical self-attention, respectively. As for power consumption forecasting, the proposed model exhibits 32%, 17% and 44% lower Mean Absolute Percentage Error (MAPE) than the DNN, LSTM and LSTM with the canonical self-attention, respectively.


Introduction
Microgrids and Virtual Power Plants (VPPs) are two remarkable solutions for a reliable supply of electricity in a power system [1]. Microgrids are power systems comprising Distributed Energy Resources (DERs) and electricity end-users, possibly with controllable elastic loads, all deployed across a limited geographic area [2]. The concept of VPP was proposed in [3]. A VPP is a flexible representation of a portfolio of DERs that can be used to make contracts in the wholesale market and to offer services to the system operator [4]. The increasing penetration of intermittent and variable renewable energy resources (e.g., wind and solar) has significantly complicated energy system

LSTM-Based Approach to Predict PV Power Generation
RNN is proposed to model a time series using an autoregressive method [31,32]. However, RNN has difficulty in training due to the problem of gradient vanishing and exploding [33]. The proposed RNN is a special kind; LSTM and gradient clipping solve the problem of gradient vanishing and extension and also solve the complex and artificial long-term dependency problem [34]. In Figure 2, (a) shows the interior of the LSTM block and (b) shows the LSTM-based model architecture to predict PV power generation. In Figure 2b, two layers of LSTM layers are stacked, and each LSTM layer has 256 units and is designed to output prediction results for the future 24-h through a fully connected layer.

LSTM-Based Approach to Predict PV Power Generation
RNN is proposed to model a time series using an autoregressive method [31,32]. However, RNN has difficulty in training due to the problem of gradient vanishing and exploding [33]. The proposed RNN is a special kind; LSTM and gradient clipping solve the problem of gradient vanishing and extension and also solve the complex and artificial long-term dependency problem [34]. In Figure 2, (a) shows the interior of the LSTM block and (b) shows the LSTM-based model architecture to predict PV power generation. In Figure 2b, two layers of LSTM layers are stacked, and each LSTM layer has 256 units and is designed to output prediction results for the future 24-h through a fully connected layer.

LSTM-Based Approach to Predict PV Power Generation
RNN is proposed to model a time series using an autoregressive method [31,32]. However, RNN has difficulty in training due to the problem of gradient vanishing and exploding [33]. The proposed RNN is a special kind; LSTM and gradient clipping solve the problem of gradient vanishing and extension and also solve the complex and artificial long-term dependency problem [34]. In Figure 2, (a) shows the interior of the LSTM block and (b) shows the LSTM-based model architecture to predict PV power generation. In Figure 2b, two layers of LSTM layers are stacked, and each LSTM layer has 256 units and is designed to output prediction results for the future 24-h through a fully connected layer.

Features and Their Notations
In this paper, we aim to perform a day-ahead hourly forecasting task using both historical and future data. We consider PV power generation and weather measurements as historical data and weather forecast data as the future data. Since we use the historical data and future data, we need to extract common features of weather measurement data and weather forecast data. To do this, we use the humidity, rainfall, cloudiness, temperature and wind speed data. Table 1 shows the notation for the measurement corresponding to each feature of each day.
We use the hourly data on the elevation angle of the sun on each day E n and the Hour value K n as shown in Table 1. Since the elevation angle of the sun is calculated by the formula in [35], we can use the future elevation angle of the sun without forecasting. Equation (1) indicates the elevation angle of the sun.
sin E = (sin δ × sin ϕ) + (cos δ × cos ϕ × cos H), where δ is a declination of the sun, ϕ is a latitude and H is an hour angle. Note that each feature consists of 24-h values. P n means a set of PV power generation values for 24 h on n-th days and p n,t means the PV power generation value on day n at t-hour. (0 ≤ t ≤ 23). For example, P 3 is the PV power generation values from 0:00 to 23:00 on day 3, and p 3,5 is the PV power generation value at 5:00 on day 3. The same rules apply to the rest of the features.
In this paper, we construct the input using the past 5 days of PV power generation to predict day-ahead hourly PV power generation. In Sections 3.1 and 3.2, we explain how the input and output data are constructed using the remaining weather measurement and forecast data, which are then divided into training and testing phase.
R n = [r n,0 , r n,1 , . . . , r n,23 ] C n = [c n,0 , c n,1 , . . . , c n,23 ] T n = [q n,0 , q n,1 , . . . , q n,23 ] W n = [w n,0 , w n,1 , . . . , w n,23 ] K n = [k n,0 , k n,1 , . . . , k n,23 ] E n = [e n,0 , e n,1 , . . . , e n,23 ] Energies 2020, 13, 4017 6 of 17 Figure 3 shows the inputs and outputs of the training phase. The dotted-blue region represents the inputs of the model, while the dotted-red rectangle denotes the output of the model.   Figure 4 shows the inputs and outputs of the testing phase. Similar to the training phase, the input sequences have the same structure as shown in the dotted-blue region. The input sequence of PV power generation is from Day -5 to Day -1, while the other features are from Day -4 to Day . However, the weather data on Day is replaced by the weather data forecasted on Day d-1 as shown in the dotted-green rectangle. This is because the future weather data are not known during the testing phase. The dotted-red rectangle represents the predicted value of .  Figure 5 shows an example of the input sequences for the testing phase, assuming that we are trying to forecast Day d. As shown in Figure 5, we use a 5-day sequence of PV power generation from Day -5 to Day -1, where the ground truth and testing input are exactly same. As shown in the blue box in Figure 5, it can be observed that the forecast data replaced with the weather data on Day d are different from the ground truth. Note that the actual values (the black line) are quite different from weather forecasts (the red line) due to the inaccuracy of weather forecasts. Nevertheless, we observe that weather forecast data play an important role in the prediction of PV power generation by giving the model valuable information about the future. For PV power generation values, we use a 5-day sequence from Day d-5 to Day d-1. On the other hand, for other features except PV power generation, we use a 5-day sequence from Day d-4 to Day d. This is to construct a 5-day input sequence for all features. The outputs of the model are the predictedP d . Figure 4 shows the inputs and outputs of the testing phase. Similar to the training phase, the input sequences have the same structure as shown in the dotted-blue region. The input sequence of PV power generation is from Day d-5 to Day d-1, while the other features are from Day d-4 to Day d. However, the weather data on Day d is replaced by the weather data forecasted on Day d-1 as shown in the dotted-green rectangle. This is because the future weather data are not known during the testing phase. The dotted-red rectangle represents the predicted value ofP d .

Testing Phase
Energies 2020, 13, x FOR PEER REVIEW 6 of 17 For PV power generation values, we use a 5-day sequence from Day d-5 to Day d-1. On the other hand, for other features except PV power generation, we use a 5-day sequence from Day d-4 to Day d. This is to construct a 5-day input sequence for all features. The outputs of the model are the predicted .  Figure 4 shows the inputs and outputs of the testing phase. Similar to the training phase, the input sequences have the same structure as shown in the dotted-blue region. The input sequence of PV power generation is from Day -5 to Day -1, while the other features are from Day -4 to Day . However, the weather data on Day is replaced by the weather data forecasted on Day d-1 as shown in the dotted-green rectangle. This is because the future weather data are not known during the testing phase. The dotted-red rectangle represents the predicted value of .  Figure 5 shows an example of the input sequences for the testing phase, assuming that we are trying to forecast Day d. As shown in Figure 5, we use a 5-day sequence of PV power generation from Day -5 to Day -1, where the ground truth and testing input are exactly same. As shown in the blue box in Figure 5, it can be observed that the forecast data replaced with the weather data on Day d are different from the ground truth. Note that the actual values (the black line) are quite different from weather forecasts (the red line) due to the inaccuracy of weather forecasts. Nevertheless, we observe that weather forecast data play an important role in the prediction of PV power generation by giving the model valuable information about the future.  Figure 5 shows an example of the input sequences for the testing phase, assuming that we are trying to forecast Day d. As shown in Figure 5, we use a 5-day sequence of PV power generation from Day d-5 to Day d-1, where the ground truth and testing input are exactly same. As shown in the blue box in Figure 5, it can be observed that the forecast data replaced with the weather data on Day d are different from the ground truth. Note that the actual values (the black line) are quite different from weather forecasts (the red line) due to the inaccuracy of weather forecasts. Nevertheless, we observe that weather forecast data play an important role in the prediction of PV power generation by giving the model valuable information about the future.

Proposed Method
In this paper, we propose the Convolutional Self-Attention LSTM model for day-ahead hourly forecasting of PV power generation. Although LSTM and gradient clipping solve the problem of the RNN, such as the gradient vanishing and expending, it may be possible to fail to utilize all features from long sequence inputs such as data for the past 120 h [36]. Recently, an attention method that emphasizes the important features of long sequence input has been widely used regardless of distance [37][38][39]. In case of LSTM with attention, the amount of lag is an important factor in reducing loss. Additionally, the LSTM with attention shows that it outperforms regular LSTM [40]. On the other hand, some deep learning model may be deteriorated due to the attention technique. The canonical self-attention is one of the attention techniques that cause optimization problems to the model by generating inappropriate keys and queries that do not fit the local context. To address this problem, we apply causal convolution to our model. By applying the causal convolution, we can improve prediction performance by generating queries and keys that are more aware of local context and thus are able to compute their similarities by their local context information, e.g., local shapes, instead of point-wise values [15]. In Figure 6, (a) shows the calculation of the similarity between queries and keys with point-wise values, while (b) shows the calculation of the similarity between queries and keys with the local context.

Proposed Method
In this paper, we propose the Convolutional Self-Attention LSTM model for day-ahead hourly forecasting of PV power generation. Although LSTM and gradient clipping solve the problem of the RNN, such as the gradient vanishing and expending, it may be possible to fail to utilize all features from long sequence inputs such as data for the past 120 h [36]. Recently, an attention method that emphasizes the important features of long sequence input has been widely used regardless of distance [37][38][39]. In case of LSTM with attention, the amount of lag is an important factor in reducing loss. Additionally, the LSTM with attention shows that it outperforms regular LSTM [40]. On the other hand, some deep learning model may be deteriorated due to the attention technique. The canonical self-attention is one of the attention techniques that cause optimization problems to the model by generating inappropriate keys and queries that do not fit the local context. To address this problem, we apply causal convolution to our model. By applying the causal convolution, we can improve prediction performance by generating queries and keys that are more aware of local context and thus are able to compute their similarities by their local context information, e.g., local shapes, instead of point-wise values [15]. In Figure 6, (a) shows the calculation of the similarity between queries and keys with point-wise values, while (b) shows the calculation of the similarity between queries and keys with the local context. In addition, we construct a residual connection between the results of convolutional selfattention and input, and also apply the normalization [41]. Then, through the two hidden-layer LSTMs with 256 units passing the fully connected layer, the predicted hourly PV power generation values are generated for the next 1-day (24 h).

Experiment
In this paper, we use hourly PV power generation data from 1 October 2017 to 30 March 2020 of the PV power plant in Jebi-ri, Gujeong-myeon, Gangneung-si, Gangwon-do, Korea. We obtain power generation data from a PV power plant in Jebi-ri. The maximum power output of each PV panel used in the Jebi-ri power plant is 150 kW. We also collect hourly weather measurement data for the same period as PV power generation measurement and 3-hourly weather forecast data to be used for testing from the Korean Meteorological Administration (KMA) [42]. Linear interpolation is used to transform the 3-hourly weather forecast data into hourly data. As mentioned in Section 3.1 and 3.2, we use the training dataset from October 2017 to February 2020 and the test dataset from 2 March 2020 to 30 March 2020.

Find k of Convolutional Self-Attention
As shown in Figure 7 in Section 4, we generate queries, keys and values using the convolutional self-attention. The convolution layer with kernel size 1 generates a value, while the convolution layer with kernel size k generates queries and keys. Different values for the size of kernel k may lead to different queries and keys, and thus may have a non-trivial impact on the overall performance of the proposed Convolutional Self-Attention LSTM model. Therefore, we conduct nine sets of experiments In addition, we construct a residual connection between the results of convolutional self-attention and input, and also apply the normalization [41]. Then, through the two hidden-layer LSTMs with 256 units passing the fully connected layer, the predicted hourly PV power generation values are generated for the next 1-day (24 h).

Experiment
In this paper, we use hourly PV power generation data from 1 October 2017 to 30 March 2020 of the PV power plant in Jebi-ri, Gujeong-myeon, Gangneung-si, Gangwon-do, Korea. We obtain power generation data from a PV power plant in Jebi-ri. The maximum power output of each PV panel used in the Jebi-ri power plant is 150 kW. We also collect hourly weather measurement data for the same period as PV power generation measurement and 3-hourly weather forecast data to be used for testing from the Korean Meteorological Administration (KMA) [42]. Linear interpolation is used to transform the 3-hourly weather forecast data into hourly data. As mentioned in Sections 3.1 and 3.2, we use the training dataset from October 2017 to February 2020 and the test dataset from 2 March 2020 to 30 March 2020.

Find k of Convolutional Self-Attention
As shown in Figure 7 in Section 4, we generate queries, keys and values using the convolutional self-attention. The convolution layer with kernel size 1 generates a value, while the convolution layer with kernel size k generates queries and keys. Different values for the size of kernel k may lead to different queries and keys, and thus may have a non-trivial impact on the overall performance of the proposed Convolutional Self-Attention LSTM model. Therefore, we conduct nine sets of experiments Energies 2020, 13, 4017 9 of 17 varying k from 1 to 9 and measure the MSE (Mean Square Error) of the model to find a suitable k value for the data. Table 2 shows the MSE results according to the k values. Since the results of these experiments show that the lowest MSE is achieved at k = 3, we set the number of k to 3.
Energies 2020, 13, x FOR PEER REVIEW 9 of 17 varying k from 1 to 9 and measure the MSE (Mean Square Error) of the model to find a suitable k value for the data. Table 2 shows the MSE results according to the k values. Since the results of these experiments show that the lowest MSE is achieved at k = 3, we set the number of k to 3.
where represents the number of data, , represents the t-th actual PV power generation value, and , represents the -th predicted PV power generation value.

Experimental Results
In this section, we assess the efficiency and effectiveness of the proposed Convolutional Self-Attention LSTM model by comparing it against three widely used models, DNN (described in Section 2.1), LSTM (described in Section 2.2), and LSTM with the canonical self-attention. Figure 8 shows the PV power generation forecasting results for each method, i.e., DNN, LSTM, LSTM with the canonical where n represents the number of data, y true,t represents the t-th actual PV power generation value, and y pred, t represents the t-th predicted PV power generation value.

Experimental Results
In this section, we assess the efficiency and effectiveness of the proposed Convolutional Self-Attention LSTM model by comparing it against three widely used models, DNN (described in Section 2.1), LSTM (described in Section 2.2), and LSTM with the canonical self-attention. Figure 8 shows the PV power generation forecasting results for each method, i.e., DNN, LSTM, LSTM with the canonical self-attention and our proposed Convolutional Self-Attention LSTM model, respectively. In Figure 8, there are five predicted days by four methods. The proposed Convolutional Self-Attention LSTM model shows a lower forecasting error and consequently higher forecasting accuracy compared with the DNN, LSTM and LSTM with the canonical self-attention model. The average forecasting errors for the entire test sets are summarized in Table 3. The MAPE of our proposed Convolutional Self-Attention model is lower by 7.7%, 6% and 3.9% compared to the DNN, LSTM and LSTM with the canonical self-attention, respectively.  The average forecasting errors for the entire test sets are summarized in Table 3. The MAPE of our proposed Convolutional Self-Attention model is lower by 7.7%, 6% and 3.9% compared to the DNN, LSTM and LSTM with the canonical self-attention, respectively.  Figure 9 shows the actual PV power generations and predictions for the five consecutive days. In general, the prediction results are similar to actual PV power generation. As shown in the blue box in Figure 9, although PV power generation is lower compared with the other days for several reasons such as poor weather condition, our proposed Convolutional Self-Attention LSTM model predicts the similar pattern on the third day. This means that our proposed model works well even with various weather changes. However, the differences between the red line and the black line are also seen. These errors are mainly due to a lack of data, that is, a pattern that does not exist in the past. Another important factor causing the error is uncertainty in weather forecasts. It can be seen that the uncertainty of weather forecasts can inject incorrect information into a forecast model.  Figure 9 shows the actual PV power generations and predictions for the five consecutive days. In general, the prediction results are similar to actual PV power generation. As shown in the blue box in Figure 9, although PV power generation is lower compared with the other days for several reasons such as poor weather condition, our proposed Convolutional Self-Attention LSTM model predicts the similar pattern on the third day. This means that our proposed model works well even with various weather changes. However, the differences between the red line and the black line are also seen. These errors are mainly due to a lack of data, that is, a pattern that does not exist in the past. Another important factor causing the error is uncertainty in weather forecasts. It can be seen that the uncertainty of weather forecasts can inject incorrect information into a forecast model.

Effect of Historical and Future Data
To further analyze the effect of historical and future data, we magnify the blue box in Figure 9 as shown in Figure 10. The historical and future data for predicting the PV power generation on the third day is shown in Figure 10a and its result is depicted in Figure 10b. Note that Figure 10b is the same as the blue box in Figure 9.
As shown in the green box in Figure 10a, the curve of PV power generation shows some lot fluctuations, which produce a high information gain that is helpful for accurate prediction. In addition, as shown in the blue box in Figure 10a, the weather forecast (the red line) predicted rain and the precipitation of 8 mm on day 10. Based on this information, the proposed model can predict an accurate amount of PV power generation as shown in Figure 10b by capturing historical features

Effect of Historical and Future Data
To further analyze the effect of historical and future data, we magnify the blue box in Figure 9 as shown in Figure 10. The historical and future data for predicting the PV power generation on the third day is shown in Figure 10a and its result is depicted in Figure 10b. Note that Figure 10b is the same as the blue box in Figure 9.
As shown in the green box in Figure 10a, the curve of PV power generation shows some lot fluctuations, which produce a high information gain that is helpful for accurate prediction. In addition, as shown in the blue box in Figure 10a, the weather forecast (the red line) predicted rain and the precipitation of 8 mm on day 10. Based on this information, the proposed model can predict an accurate amount of PV power generation as shown in Figure 10b by capturing historical features and also acquiring helpful information from weather forecasts.
On the other hand, Figure 10d shows an inaccurate case. In this case, the curve of PV power generation shows consistent patterns for 5 days, which generate lower information for predicting a sudden decline in PV power generation. This low information gain can have a negative impact on a prediction model. In addition, because the weather forecast in the blue box in Figure 10c is inaccurate, the proposed model can only predict a fairly different amount of PV power generation than the actual PV power generation as shown in the red line in Figure 10d. On the other hand, Figure 10d shows an inaccurate case. In this case, the curve of PV power generation shows consistent patterns for 5 days, which generate lower information for predicting a sudden decline in PV power generation. This low information gain can have a negative impact on a prediction model. In addition, because the weather forecast in the blue box in Figure 10c is inaccurate, the proposed model can only predict a fairly different amount of PV power generation than the actual PV power generation as shown in the red line in Figure 10d.

NSW Electricity Load Data
In order to validate our proposed model, we conduct an experiment using another type of time series data such as the NSW (New South Wales) electricity load data [43]. The NSW electricity load data include hour, dry bulb temperature, dew point temperature, wet bulb temperature, humidity, electricity price and system load. It is constructed by measuring every 30 min from 1 January 2006 to 1 January 2011. For NSW data, since the future data, such as weather forecast, is unknown, we use only the past 5-day data as input to predict the future 1-day electricity load. Figure 11 shows the

NSW Electricity Load Data
In order to validate our proposed model, we conduct an experiment using another type of time series data such as the NSW (New South Wales) electricity load data [43]. The NSW electricity load data include hour, dry bulb temperature, dew point temperature, wet bulb temperature, humidity, electricity price and system load. It is constructed by measuring every 30 min from 1 January 2006 to 1 January 2011. For NSW data, since the future data, such as weather forecast, is unknown, we use only the past 5-day data as input to predict the future 1-day electricity load. Figure 11 shows the forecasting results for seven consecutive days. As shown in Figure 11, the proposed Convolutional Self-Attention LSTM model predicts a closer matching pattern to the actual electricity load than other methods. The MAPE of the proposed model is much lower by 32%, 17% and 44% compared to the DNN, LSTM and LSTM with the canonical self-attention, respectively. This result indicates that the proposed Convolutional Self-Attention LSTM model predicts not only PV power generations but also power consumptions.

NSW Electricity Load Data
In order to validate our proposed model, we conduct an experiment using another type of time series data such as the NSW (New South Wales) electricity load data [43]. The NSW electricity load data include hour, dry bulb temperature, dew point temperature, wet bulb temperature, humidity, electricity price and system load. It is constructed by measuring every 30 min from 1 January 2006 to 1 January 2011. For NSW data, since the future data, such as weather forecast, is unknown, we use only the past 5-day data as input to predict the future 1-day electricity load. Figure 11 shows the Figure 11. Forecast results of NSW (New South Wales) electricity load.

Conclusions
In this paper, we propose the Convolutional Self-Attention LSTM model for forecasting day-ahead hourly PV power generations by applying the convolution self-attention technique. Unlike previous methods using the canonical self-attention, our proposed method is able to generate queries and keys that are more aware of local context and thus are able to improve prediction performance by computing their similarities with their local context information. In addition to the convolution self-attention, our model uses both historical and future data by replacing the past weather data with the future weather forecast data during the testing phase.
We compare our proposed model with existing methods: the DNN-based model, LSTM-based model and LSTM with the canonical self-attention. Extensive experiments using the real world datasets show that the MAPE of the proposed Convolutional Self-Attention LSTM model is much lower by 7.7%, 6%, 3.9% compared to the DNN, LSTM and LSTM with the canonical self-attention, respectively. In addition, in order to validate the applicability of the proposed model, we conduct another set of experiments on forecasting power consumptions using the NSW electricity load data. In these experiments, we observed that the proposed Convolutional Self-Attention LSTM model has reduced error rates by 32%, 17% and 44% for MAPE compared to the DNN, LSTM and LSTM with the canonical self-attention-based models, respectively.
For future work, we plan to augment our method with harsh conditions where the daily PV power generation is not in the form of a parabola and its deviation is quite severe due to poor weather conditions. We also plan to extend our model to predict various forecast horizons ranging from 5 min to a half-hour.
Author Contributions: D.Y. designed the Convolutional Self-Attention LSTM based model, conducted the experiments, and prepared the manuscript as the first author. W.C. and M.K. led the project and research. L.L. assisted with the research and contributed to writing and revising the manuscript. All authors discussed the results of the experiments and were involved in preparing the manuscript. All authors have read and agreed to the published version of the manuscript.