Vision Transformer-Based Photovoltaic Prediction Model

: Sensing the cloud movement information has always been a difﬁcult problem in photo-voltaic (PV) prediction. The information used by current PV prediction methods makes it challenging to accurately perceive cloud movements. The obstruction of the sun by clouds will lead to a signiﬁ-cant decrease in actual PV power generation. The PV prediction network model cannot respond in time, resulting in a signiﬁcant decrease in prediction accuracy. In order to overcome this problem, this paper develops a visual transformer model for PV prediction, in which the target PV sensor information and the surrounding PV sensor auxiliary information are used as input data. By using the auxiliary information of the surrounding PV sensors and the spatial location information, our model can sense the movement of the cloud in advance. The experimental results conﬁrm the effectiveness and superiority of our model.


Introduction
When solar energy is used in the grid, the output of PV power generation is intermittent due to some meteorological factors, such as changes in solar radiation.Solar energy depends on local climatic conditions and cloud dynamics.This uncertainty affects the accuracy of PV predictions, as clouds blocking sunlight will lead to a sharp drop in light radiation intensity.The model cannot predict the situation at this time, resulting in a large difference between the model's prediction and the actual results, thereby reducing accuracy.
There are two primary PV prediction methods in current research: traditional machine learning methods and deep learning algorithms.Traditional machine learning methods generally include, but are not limited to, support vector machines, decision trees, random forests, hidden Markov model methods, etc.The feature extraction of most machine learning methods is independent of the network model.Known features or features [1] that experts in specific fields believe are important for completing specific tasks need to be manually extracted from the original data.In contrast, deep learning algorithms use an end-to-end approach, using the neural network layer to extract deep or abstract features from large complex datasets.Compared to traditional machine learning methods, most deep learning algorithms do not rely on hand-selected features.Deep learning algorithms extract relevant features from datasets in an unassuming manner, without requiring expert domain-specific knowledge [2].
Although many models have achieved good results in PV prediction, their performance in PV prediction is still insufficient.For example, previous PV prediction models could not perceive that cloud movement blocking the sun causes a rapid decline in power generation, which reduces the overall accuracy of the prediction.To overcome this problem, in this paper, we propose a PV prediction model based on the vision transformer (VIT) [3] model, which has been successfully applied in many computer vision tasks.In our model, we use the target PV sensor and surrounding PV sensors as input data.The PV sensor is located in Panyu, Guangdong Province, China.Since our target sensor is located inside these surrounding sensors, as shown in Figure 1, surrounding sensors can perceive the cloud movement in advance.We used the number 1 sensor as the target sensor and the remaining 8 as auxiliary sensors.The information collected by all sensors was utilized for training and testing.To effectively capture this advanced information, we adopted the multi-head self-attention (MSA) mechanism to exploit the auxiliary information from the surrounding PV sensors.Moreover, we considered the impact of the positional information of the surrounding PV sensors and the target PV sensor.In order to verify the validity of the model, we conducted a large number of comparative experiments and ablation experiments.Hence, the contributions of this paper are summarized as follows: (1) We developed the VIT model for PV prediction, which utilizes the auxiliary information from the surrounding PV sensors to help the target sensor in anticipating the cloud movement in advance.
(2) Incorporating the geographic information of PV sensors into our model further enhances the prior knowledge needed to improve the PV prediction performance.
(3) Many comparative and ablation experiments confirm the effectiveness and superiority of our model.
The rest of the paper is organized as follows.Section 1 presents the literature review.Section 2 presents the methodology.Section 3 presents the results.Section 4 presents the conclusions and future prospects.prediction models [8].Bouzerdoum et al. proposed a SARIMA-SVM hybrid model for the time series forecasting of solar PV power generation [9].Wu et al. combined the autoregressive integrated moving average model, SVM, artificial neural network, and fuzzy inference system to predict solar PV power generation [10].The integration method is also popular in PV forecasting.Rana et al. integrated neural networks and support vector regression to make short-term predictions of PV power generation [11].Asrari et al. proposed an artificial neural network to forecast solar PV power generation one hour in advance [12].Shang et al. improved SVR and enhanced empirical mode decomposition for predicting solar PV power generation [13].Behera et al. used the extreme learning machine to predict the PV power at intervals of 15 min, 30 min, and 60 min, respectively [14].Eseye et al. applied a hybrid prediction model combining particle swarm optimization and SVM for the short-term power prediction of actual microgrid PV systems [15].Although the above models have achieved good results in PV prediction, there are still deficiencies.The data used in machine learning methods need to be manually screened or supported by prior knowledge, which is very troublesome.Compared to machine learning methods, deep learning networks have good results in most cases.

Deep Learning for PV Prediction
Jeong et al. used a convolutional neural network (CNN) to extract spatiotemporal correlations by superimposing PV signals into images and reordering them based on their real-world locations [16].In addition, Shih et al. introduced an attention mechanism to capture the spatial correlation among PV nodes [17].Simeunovi et al. used the graphconvolutional transformer for PV prediction, which uses an attention mechanism [18].Li et al. proposed a hybrid short-term PV power plant model, which combines a timeseries generative adversarial network, K-medoids, and CNN-GRU [19].To address the volatility and instability of PV power generation, Zhu et al. used a PV prediction model that combines the k-means technique with a long short-term memory (LSTM) network [20].By using the attention layers of two LSTM neural networks, Zhou et al. found more important input features in PV prediction [21].Qu et al. proposed a new hybrid model based on the gated recurrent unit to predict distributed PV power generation [22].Basset et al. introduced a new deep learning architecture called PV-Net for short-term forecasting of day-ahead PV energy [23].Perez et al. proposed an intra-day prediction model that does not require training or real-time data measurements [24].Guermoui et al. used a novel decomposition method to decompose PV power into intrinsic functions and used extreme learning machines for prediction [25].Korkmaz proposed a new CNN structure model called Solar-Net for the short-term prediction of PV output power under dissimilar weather seasons and conditions [26].Sharma et al. applied a hybrid deep learning framework for PV prediction, which consists of a long short-term memory layer and maximum overlap discrete wavelet transform model [27].Cannizzaro et al. proposed a new method for predicting solar radiation by combining variational mode decomposition, two CNNs, random forests, and LSTM networks [28].The above deep network models have greatly improved the accuracy of PV prediction, but most of the network's input only uses historical PV information and local climate information.This makes it difficult for the network to perceive cloud movement information.Perceiving cloud movement information is very important for the PV prediction network.

Motivation
In order to solve the above problems, this paper develops a PV prediction model that is based on the vision transformer framework.The model uses the PV power information from both the target sensor and the auxiliary sensor as input, and integrates the geographic information matrix into the self-attention layer to allocate the information weight of information from the auxiliary sensor, so that the PV prediction network can perceive cloud movement information and improve prediction accuracy.

Model
The proposed model is shown in Figure 2. The input X ∈ R l×w of the model is the information of the PV sensor, where l is the length of time representing the sequence, and w is the number of PV sensors.The output Y ∈ R l×1 of the model is the PV prediction sequence of the target PV sensor.We add a learnable sequence z 0 0 ∈ R l×1 in front of the input X for prediction.To fuse the sequence with the position embedding A pos , we add a trainable linear projection A to map the input X.The procedure of adding position embedding can be represented as follows: where Z 0 ∈ R (w+1)×D represents the complete sequence after adding a learnable predictive token and position embedding.In the following, MSA is used to exploit the auxiliary information of the surrounding PV sensors, which is represented as follows: where LN(•) is the layer normalization module, MSA(•) is the MSA module, and MLP(•) is the multi-layer perception (MLP) module.Z ι ∈ R (w+1)×D and Z ι ∈ R (w+1)×D denote the ι-th middle variable and output variable.Through the N transformer encoder, the PV prediction of the target sensor is represented as Z ι .
where z 0 N represents the output state of our learnable prediction token z 0 0 in Z 0 after passing through the transformer encoder layer.Y is the output of the model, which is the PV prediction sequence of the target PV sensor.and P, where X represents the information sequence of the photovoltaic sensor, A pos is the position encoding of X, and z 0 0 is the learnable predictive token that we added.The input of the model is normalized by the layer norm in the transformer encoder.

Multi-Head Self-Attention
Self-attention [29] is a popular neural network module.For each sequence in the input Z 0 ∈ R (w+1)×D of the transformer encoder, we generate three learnable weight matrices, denoted as We calculate the weighted sum of all values V within the sequence.
where Q ∈ R w×D h , K ∈ R w×D h , V ∈ R w×D h are the three vectors obtained by multiplying the input sequence Z 0 and the corresponding three matrices in U QKV .
where SA(•) is a self-attention computation.MSA denotes the splicing of multiple SA operations.We perform k self-attention operations in parallel to form k heads and map the splicing output of the k heads.In order to keep the number of calculations and parameters unchanged when changing k, D h in ( 5) is usually set to D/k.
where U msa ∈ R D h ×D is a weight matrix.

Input Embedding and Position Embedding
We incorporate predictive input embedding and position embedding into the predictive input sequence of our model.As mentioned above, we add predictive input embedding and position embedding into the input sequence.When embedding, we not only add a learnable blank label, such as the VIT model, we add a geographic location information matrix for PV sensors.The geographic information matrix helps the model to predict at the beginning, allowing the model to adjust the parameters according to the geographical location of the PV sensor, to make the prediction more accurate.The definition of the geographic location information matrix is as follows ( 9) and (10).
where S ij denotes the value of row i and column j in the close-range geographic information matrix, S ij denotes the value of row i and column j in the long-range geographic information matrix.S ij denotes the value of row i and column j in the short-range geographic information matrix.d j i denotes the distance from the i sensor to the j sensor, and d k denotes half of the distance between the target PV sensor and the furthest PV sensor.In the experiment, we set four advanced times.When the prediction time was short, such as 60 s or 180 s, the short-range PV sensor information was relatively useful; hence, we increased the weight of the close-range PV sensor information.When the prediction time was long, such as 300 s or 600 s, the long-range PV sensor information was relatively useful, so the weight of the remote PV sensor information was increased.
A geo = RELU(SO s ); (11) where S ∈ R w×w is a geographic information matrix, which is mapped to the D dimension by O s ∈ R D×D .
A pos = A learn + A geo (12) where A learn ∈ R (w+1)×D is a learnable position embedding, and A geo ∈ R (w+1)×D is a geographic information matrix calculated by using the distance between each PV sensor.The initial dimension is w × w, which has been mapped to the D dimension.In input X, we add a learnable predictive token.In order to keep the dimension consistent, we need to add its position coding and geographic information on A learn and A geo .Here, we add the initialized random value.

Loss Function
Our loss function uses the MSE mean squared error loss function, and the loss is propagated back to the model from the output of the model, adjusting the parameters of the model.MSE is a good measure of the mean error and the degree of variation in the evaluation data.The specific formula is as follows (13).
where Y i is the truth value collected by the PV target sensor, and Ŷi is the PV prediction value output of the model.

Experimental Detail
In order to prevent the results from being accidental due to the different initial values of each model, we repeated the following experiments to ensure the stability of the results and reduce uncertainty.
The flow chart of the prediction method is shown in Figure 3. Firstly, we used PV sensors for data acquisition, and then cleaned and normalized the data.Data cleaning includes deleting abnormal data and using the interpolation method to fill the values.We used the max-min normalization method.During both the training and testing processes, our model took as input the information from the target sensor, the auxiliary sensor, and the geographic information matrix.For the geographic information matrix description, please see Section 2.3.

Datasets
We used our own PV sensor to collect data to make a PV dataset.The dataset contained data collected from seven PV sensors, every day, from 2019 to 2021, with a time resolution of 1 s.Seven PV sensors were placed in Panyu, Guangzhou, Guangdong Province, China.We used the PV sensor to collect data from Hebei Pingao Electronic Technology Co., Ltd.(Handan, China); this PV sensor can collect real-time optical radiance data with a sampling frequency of 1 s.Since the solar irradiance before sunrise and after sunset is negligible, we used data from between 9:00 a.m. and 3:00 p.m. Out of the missing data, there were about 2 million data points in our PV data that could be predicted.We disrupted the number of days on the total dataset and divided it into two datasets, on average (dataset1 and dataset2).In dataset1, we divide the dataset into the training set, validation set, and test set, according to the ratio of 7:1:2.The processing of dataset2 was the same as dataset1.
During training, we disrupted the number of days but ensured the integrity and continuity of daily data.
BP: BP neural networks can be divided into two parts, BP and neural networks.BP is short for backpropagation.
CNN: The CNN network has structural characteristics, such as local area connection, weight sharing, and downsampling.Weight sharing in a convolutional neural network makes its network structure more similar to that of a biological neural network.
RNN: RNN networks, specifically recurrent neural networks with LSTM units, efficiently handle sequence problems.
LSTMs: A combination forecasting model using the LSTM network, optimized by the ant lion optimization algorithm, is based on the ensemble empirical mode decomposition and K-means clustering algorithm.
We trained all models using the Adam optimizer [32], with the learning rate set at 0.0001; the learning rate was attenuated by the trained epoch number, which was 5 × 10 −5 , 1 × 10 −5 , 5 × 10 −6 , 1 × 10 −6 , 5 × 10 −7 , 1 × 10 −7 , and 5 × 10 −8 ; when the epoch numbers were 2, 4, 6, 8, 10, 15, and 20, the learning rate decayed sequentially, compared to the SGD optimizer.We found that the Adam optimizer worked well for various models in our setup.The batch_size was generally set to 32, which can be reduced when the memory is low, and has little impact on the prediction effect.

Experimental Results
In order to ensure the stability and authenticity of the experimental results, we conducted several experiments on the two datasets.The experimental results are shown in Tables 1 and 2. As can be seen from the table, we list the mean and variance of the errors for each model on the two datasets.When the prediction time was 60 s, the error of our model was not much different from the comparison model.This is because the prediction time was short, the model increased the weight of the target sensor information and paid less attention to the auxiliary sensor information.However, as the prediction time increased, our model had much lower errors than the comparison model.In terms of stability, we used the standard deviation.It can be seen that our model has high stability.The effect of our model greatly improved after adding auxiliary information from surrounding PV sensors; in particular, as the delay time increased, our model exhibited a slower decline in prediction accuracy compared to the other models in the comparison.As the prediction times increased to 300 s and 600 s, there was a significant gap between the MSE and MAPE values in each model.The reason for this huge gap is that there was a gap among the models in their ability to assess the decline time of the PV forecast curve.As shown in Figure 4, one can see how each model predicts the curve at 300 s.When the real PV curve fell, our model reacted more quickly to sense the drop compared to other models.Other models require a delay of about 200-250 s to sense a drop.When the prediction time reached 600 s, as seen in Figure 5, the time gap between the models in sensing the decline of the real PV curve was greater.However, our model incorporates information from PV auxiliary sensors, reducing the time it takes for the model to sense the decline in the PV curve, and even anticipate it in advance.The x coordinates of all pictures represent Beijing time (UTC + 8:00), and the y coordinates represent power generation.

Ablation Experiments
Verify the validity of the auxiliary information:In order to verify the effectiveness of our proposed fusion auxiliary sensor, we performed an ablation experiment and compared the network prediction results with auxiliary sensor information and the network prediction results without auxiliary sensor information.We placed the predicted comparison results in Tables 3 and 4. The model prediction curve results are shown in Figures 6 and 7. From the results, it can be seen that when the model does not add auxiliary information as input, the prediction accuracy will be greatly reduced, which is similar to other comparison models.What if the model output is better?In the VIT model, the output of the model is a token added to the image's input sequence, contains the classification information of the image.However, this method is not necessarily suitable for PV forecasting.In order to verify the effectiveness of this method, we used three outputs of the model for comparison, namely the output token, the mean sequence of the output time input sequence, and the output sequence obtained by maximum pooling of the time input sequence.The detailed comparison results are shown in Tables 5 and 6.The number of encoder layers used to extract features: In the experimental results, after adding the auxiliary information of the peripheral PV sensor, the advanced amount predicted by the model greatly improved, but it did not improve to the point of satisfaction.We assume that this could be due to insufficient layers in the feature extraction layer of the encoder.Thus, we set up this ablation experiment, hoping to find out the number of layers suitable for feature extraction; the experimental results are shown in Figures 8 and 9.In this experiment, we fixed the parameters of the model and then changed the number of layers in the encoder to prevent unfair results from different initial values.

Conclusions
In this paper, the VIT model was improved to enable it to predict PV directly.In order to deal with the problem of reduced light radiation intensity caused by the occlusion of sunlight due to cloud motion, which leads to a decrease in prediction accuracy, auxiliary PV sensor information was added to the network input to effectively improve the accuracy of PV prediction; it was found from the actual prediction map that the network can perceive the cloud motion in advance and make responsive predictions.In the comparison experiments of each network, when the prediction time was 300 s, our MSE and MAPE were 9245.2 and 8.94%, respectively.Compared to the best network, our MSE was reduced by 18% and the accuracy improved by 4%.When the prediction time was 600 s, our MSE and MAPE were 23,763 and 16.45%, respectively.Compared to the best network, our MSE decreased by 11% and the accuracy increased by 4%.
Although these results are encouraging, there are still some shortcomings.For example, the network's ability to predict significant decreases in light radiation intensity is not yet at a satisfactory level.We hope to further explore the fusion between auxiliary PV sensor information and target PV sensor information in the future, to enhance the network's ability to predict more accurately and improve the overall accuracy of PV forecasting.

LN(•)
normalization SA(•) self-attention computation MSA(•) multi-head self-attention MLP(•) multi-layer perception U QKV = [W Q , W K , W V ] the weight matrix of Q, K, V in self-attention S ∈ R w×w geographic information matrix

Figure 1 .
Figure 1.The positional distribution map of the PV sensor, where 1 is the target sensor and the rest are the auxiliary sensors.

2. Related Works 2 . 1 .
Traditional Machine Learning for PV Prediction Yang et al. proposed an autoregressive linear model, which was further extended to include vector autoregressive and vector autoregression models, based on the traditional autoregressive model [4].Cavalcante et al. performed PV prediction by combining the vector autoregressive model with the minimum absolute shrinkage and selection operator framework [5].Peder et al. used the autoregressive model with exogenous inputs to predict hourly values for solar PV power generation [6].Zeng et al. proposed a radial basis function neural network-based model for short-term solar power prediction [7].Hugo et al. compared the k-nearest neighbor, artificial neural network, and other solar PV power

Figure 2 .
Figure 2. The structure of the network we propose.As shown in the figure, the input of the network consists of X, A pos , and P, where X represents the information sequence of the photovoltaic sensor, A pos is the position encoding of X, and z 0 0 is the learnable predictive token that we added.The input of the model is normalized by the layer norm in the transformer encoder.

Figure 5 .
Figure 5. (a-d) display the prediction curves of each model at a prediction time of 600 s.

Figure 6 .
Figure 6.(a,b) show the model prediction curves with and without auxiliary information input, respectively, at a prediction time of 300 s (line A represents the prediction with auxiliary information and line B represents the prediction without auxiliary information).

10.23 10 :Figure 7 .
Figure 7. (a,b) show the model prediction curves with and without auxiliary information input, respectively, at a prediction time of 600 s (line A represents auxiliary information and line B without auxiliary information).

Figure 8 .
Figure 8.(a) The MSE and (b) the MAPE of our model on different encoder layers (prediction time 300 s; both MSE and MAPE are calculated using denormalized values).

Figure 9 .
Figure 9. (a) The MSE and (b) MAPE of our model on different encoder layers (prediction time 600 s; both MSE and MAPE are calculated using denormalized values).

Table 1 .
The average MSE value and MSE standard deviation for the two datasets; the STD represents the standard deviation of MSE, and AMSE represents the average MSE (both the MSE and the standard deviation are calculated using denormalized values.The number of trials is 10).

Table 2 .
The average MAPE value and MAPE standard deviation for the two datasets; the STD represents the standard deviation of MAPE, and AMAPE is the average MAPE (both MAPE and the standard deviation are calculated using denormalized values.The number of trials is 10).

Table 3 .
The average MSE value and MSE standard deviation of the model with or without auxiliary information were included in the two datasets; the STD represents the standard deviation of MSE, and AMSE represents the average MSE (both MSE and the standard deviation are calculated using denormalized values; the number of trials is 10).

Table 4 .
The average MAPE value and MAPE standard deviation of the model with or without auxiliary information were included in the two datasets; the STD represents the standard deviation of MAPE, and AMAPE is the average MAPE (both MAPE and the standard deviation are calculated using denormalized values; the number of trials is 10).

Table 5 .
The average MSE value and MSE standard deviation of different output modes in two datasets; the STD represents the standard deviation of MSE, and AMSE represents the average MSE (both MSE and the standard deviation are calculated using denormalized values; the number of trials is 10).

Table 6 .
The average MAPE value and the standard deviation of the model with or without auxiliary information were included in the two datasets; the STD represents the standard deviation of MAPE, and AMAPE is the average MAPE (both MAPE and the standard deviation are calculated using denormalized values; the number of trials is 10).