SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features

Wang, Bin; Chen, Julong; Zhu, Yongqing; Fan, Junqiu; Hu, Jiang; Tan, Ling

doi:10.3390/app152111846

Open AccessArticle

SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features

by

Bin Wang

¹

,

Julong Chen

¹

,

Yongqing Zhu

¹

,

Junqiu Fan

¹,

Jiang Hu

¹

and

Ling Tan

^2,*

¹

Power Grid Planning and Research Center, Guizhou Power Grid Co., Ltd., 38 Ruijin South Road, Nanming District, Guiyang 550003, China

²

School of Computer Science (Computer Science and Technology), Nanjing University of Information Science and Technology, No. 219 Ningliu Road, Jiangbei New District, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(21), 11846; https://doi.org/10.3390/app152111846

Submission received: 17 September 2025 / Revised: 29 October 2025 / Accepted: 3 November 2025 / Published: 6 November 2025

Download

Browse Figures

Versions Notes

Abstract

Aiming to solve the challenges of the weak spatial and temporal correlation of medium- and long-term photovoltaic (PV) power data, as well as data redundancy and low forecasting efficiency brought about by long-time forecasting, this paper proposes a medium- and long-term PV power forecasting method based on the Transformer, SP-Transformer (spatiotemporal probsparse transformer), which aims to effectively capture the spatiotemporal correlation between meteorological and geographical elements and PV power. The method embeds the geographic location information of PV sites into the model through spatiotemporal positional encoding and designs a spatiotemporal probsparse self-attention mechanism, which reduces model complexity while allowing the model to better capture the spatiotemporal correlation between input data. To further enhance the model’s ability to capture and generalize potential patterns in complex PV power data, this paper proposes a feature pyramid self-attention distillation module to ensure the accuracy and robustness of the model in long-term forecasting tasks. The SP-Transformer model performs well in the PV power forecasting task, with a medium-term (48 h) forecasting accuracy of 93.8% and a long-term (336 h) forecasting accuracy of 90.4%, both of which are better than all the comparative algorithms involved in the experiment.

Keywords:

PV power forecasting; medium- and long-term forecasting; transformer; attention mechanism; feature pyramid self-attention distillation

1. Introduction

Medium and long-term photovoltaic (PV) power forecasting refers to the prediction of electricity generation by photovoltaic power systems over a period ranging from several days to months or even longer. It plays a significant role in energy planning, power system operations, and energy investment [1].

Compared to short-term photovoltaic (PV) power forecasting, medium- and long-term PV power generation displays more pronounced cyclical patterns. On a daily basis, the power output curve tends to follow a similar trend, reflecting a clear diurnal cycle. In addition, seasonal changes in solar elevation angle and daylight duration further influence PV output, resulting in a seasonal cyclical pattern [2].

Medium- and long-term PV power forecasting faces two core challenges in spatial–temporal dimensions: first, its longer temporal span and broader spatial scale demand models with more sophisticated spatiotemporal feature extraction capabilities [3]; second, regional differences in geographical location require models to adapt to varying meteorological conditions across different areas.

Current photovoltaic (PV) power forecasting methods can be broadly categorized into statistical methods [4], machine learning methods [5], and deep learning methods [6].

Traditional statistical approaches enable intuitive interpretation of the relationship between power output and its influencing factors, thereby facilitating in-depth understanding of the key drivers of power fluctuations [7]. They perform well on small-scale datasets with short temporal spans, rendering them suitable for numerous practical application scenarios [8]. However, statistical methods are generally based on linear assumptions, which restricts their capacity to model complex nonlinear relationships. Additionally, they require high-quality data and dense sampling frequencies, while exhibiting poor responsiveness to unforeseen events or future uncertainties [9]. Consequently, such methods have inherent limitations for medium- and long-term PV power forecasting tasks.

Machine learning (ML) methods, trained on large historical datasets, can automatically capture the complex nonlinear relationships in photovoltaic (PV) power forecasting and demonstrate strong generalization capabilities [10], leading to their widespread adoption. However, their effectiveness, particularly in medium- and long-term forecasting, typically demands extensive datasets and substantial computational resources to model the requisite long-term time series and variables [11]. Moreover, when applied to such large-scale data, ML models are prone to overfitting [12]—a challenge that becomes especially pronounced in medium- and long-term scenarios.

Deep learning models such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks incorporate memory functions designed to capture contextual information in time series data, such as seasonal variations and cyclical trends. This capability enhances their performance in predicting medium- and long-term photovoltaic (PV) power generation [13]. Furthermore, by leveraging multilayer architectures, these methods can model complex nonlinear relationships in PV data and integrate various meteorological factors (e.g., sunlight, temperature, wind speed), thereby improving prediction accuracy and adaptability [14]. However, recent studies highlight limitations in their ability to model long-term dependencies [15]. The sequential computation of LSTM restricts parallel processing and can lead to error accumulation in very long sequences. Additionally, while proficient at temporal feature extraction, LSTMs exhibit limited capacity for comprehensively modeling multidimensional spatiotemporal features.

Most existing photovoltaic (PV) power forecasting methods are designed for short-term tasks [16], with performance often degrading in medium- to long-term scenarios [17]. The Transformer architecture addresses this by capturing long-range dependencies via its self-attention mechanism, which processes all sequence positions simultaneously. This contrasts with the sequential processing of RNNs and LSTMs, enabling more efficient training and inference on long sequences.

Transformer-based models are increasingly applied to medium- and long-term photovoltaic (PV) power forecasting. A summary of representative methods is provided in Table 1, which outlines their key contributions and limitations. Ran et al. [18] proposed a hybrid model combining adaptive noise, complete ensemble empirical mode decomposition (CEEMD), sample entropy, and Transformer, utilizing an attention mechanism to address long-term memory loss and evaluate decomposition strategies. However, this approach relies solely on time-series data and overlooks meteorological and geographical factors, thus lacking spatiotemporal characterization of PV power variations. Cao et al. [19] introduced an LSTM-Informer model based on an improved Stacking ensemble algorithm, integrating long short-term memory (LSTM) and Informer as base learners and replacing k-fold cross-validation with time series cross-validation. While effective, the multi-layer encoder-decoder structure may lead to parameter redundancy and overfitting in large-scale spatiotemporal data modeling, constraining the model’s capacity to learn complex patterns and limiting its performance and generalization in medium- and long-term forecasting. Zhang et al. [20] developed a Transformer-based model integrated with graph convolutional networks (GCNs) for power grid load forecasting, employing a feedforward neural network for final predictions. Although graph structure is incorporated, the global self-attention mechanism applies uniform weighting across all time steps, which may overlook local variations, impair key spatiotemporal feature extraction, and increase computational cost and model complexity. Xu et al. [21] designed a cloud image-based prediction framework that integrates a Vision Transformer model with a gated recurrent unit (GRU) encoder for high-dimensional latent feature analysis, using a multi-layer perceptron (MLP) for step-by-step PV power predictions. Zhang et al. [22] proposed a Transformer-based prediction framework that combines images with quantitative solar irradiance measurements and optimizes the current mainstream architecture using gating mechanisms.

To address these limitations, this paper proposes the SP-Transformer, an enhanced Transformer-based model specifically designed for PV power forecasting. It tackles three key challenges: effectively integrating meteorological and geographical features beyond temporal information in positional encoding; capturing the non-stationary characteristics and sudden fluctuations of PV power that threaten grid stability; and mitigating the quadratic complexity growth of self-attention to maintain feasible computational efficiency with large-scale medium- and long-term data.

The main contributions of this paper are as follows:

In this paper, a spatiotemporal position encoding method is proposed. By embedding encoded vectors containing spatial information of sites into the input meteorological time series, the model is able to more accurately capture the spatiotemporal dependencies between different locations. This enhancement improves prediction accuracy and effectively mitigates the impact of abrupt PV power fluctuations caused by spatial differences on the power system.
This study proposes a spatiotemporal probsparse self-attention mechanism, which enhances the accuracy and efficiency of PV power forecasting by incorporating the Haversine distance metric and a probabilistic sparsity strategy.
To address the issue of low efficiency in medium- to long-term photovoltaic power forecasting, this paper proposes a feature pyramid self-attention distillation module (FPSA). The FPSA employs multi-scale depthwise separable convolutions to construct a hierarchical feature pyramid structure, which ensures efficient feature extraction and comprehensive information transmission, thereby significantly reducing information loss and enhancing model stability. The proposed module effectively captures latent spatiotemporal patterns in photovoltaic power data, achieving high prediction accuracy and strong generalization capability under complex environmental conditions. This provides a solid foundation for tackling the key challenges associated with long-term forecasting.

2. Materials and Methods

2.1. Materials

2.1.1. Dataset

The dataset used in this study includes photovoltaic power data and meteorological data, covering the time span from March 2022 to February 2023. In the dataset, 70% of the data is allocated for the training set, 20% for the testing set, and the remaining 10% for the validation set. To eliminate instances of nearly zero photovoltaic power during nighttime, this study only selected data from 8 AM to 6 PM each day—this time window is determined based on the effective sunshine duration of the study area during the research period and can well cover the high-efficiency interval of photovoltaic power generation. The photovoltaic power data is sourced from the open dataset provided by the Belgian electricity supplier Elia, and includes data from 108 sites. This paper utilized meteorological data from the WRF model provided by the European Centre for Medium-Range Weather Forecasts (ECMWF), which includes information on temperature, humidity, wind speed, wind direction, cloud water content, cloud ice content, and solar irradiance. The meteorological data has a temporal resolution of 1 h and a spatial resolution of 1 km.

2.1.2. Experimental Setup and Scheme

The experimental setup used in this study features an Intel(R) Core(TM) i9-10900X processor, 32 GB of RAM, and an NVIDIA GeForce GTX 2080 Ti GPU, with the operating system being Ubuntu 18.04. The environment is configured to run PyTorch version 3.6. All models are trained using the Adam optimizer for 100 epochs. Preliminary experiments show that the validation loss stabilizes after 80–90 epochs with no significant improvement thereafter; therefore, extending the training to 100 epochs effectively prevents underfitting while avoiding redundant computation. The batch size is set to 64 to balance training efficiency and memory constraints. This configuration can be fully supported by an NVIDIA GeForce GTX 2080 Ti GPU (11 GB) without memory overflow. Increasing the batch size to 128 leads to unstable gradient updates, whereas reducing it to 32 substantially increases training time. The initial learning rate is set to 0.001, as determined through preliminary testing: higher values (e.g., 0.01) cause divergence during early training, while lower values (e.g., 0.0001) result in slow convergence.

2.2. Model Method

This paper proposes the SP-Transformer, a Transformer-based model that effectively captures the spatiotemporal dependencies among meteorological, geographical, and PV power data, aiming to improve the accuracy and efficiency of medium- and long-term PV power forecasting. The model architecture is shown in Figure 1.

The SP-Transformer leverages spatiotemporal positional encoding to better model the complex relationships in PV power data. By embedding key geographical information, this mechanism enhances the model’s capacity to learn intricate spatiotemporal patterns and improves its understanding of inter-site dependencies, thereby providing richer context for forecasting.

To further reduce both time and space complexity without compromising accuracy and stability, the spatiotemporal probsparse self-attention mechanism. By selectively attending to regions most relevant to the forecasting task, this mechanism captures essential spatiotemporal correlations more efficiently, improving prediction accuracy and enhancing scalability to large-scale PV power data.

The SP-Transformer also incorporates a feature pyramid self-attention distillation module to reduce information loss and enhance model stability through multi-scale feature extraction and fusion. This approach enables more comprehensive modeling of spatiotemporal patterns across scales, improving adaptability to diverse forecasting scenarios. The distillation mechanism further ensures greater accuracy and consistency in long-term predictions.

2.2.1. Spatiotemporal Position Encoding

In Transformer models, positional encoding compensates for the lack of inherent sequential structure, unlike RNN or LSTM. As Transformers rely solely on attention mechanisms, they cannot differentiate token order without positional cues. Traditional positional encoding captures relative positions in one-dimensional temporal sequences but fails to reflect spatial relationships. In medium- and long-term PV power forecasting, this limits the model to temporal dependencies, neglecting the spatial correlations among different nodes.

To enable the model to further extract the relative spatial position information of different photovoltaic (PV) sites, this study adds a spatiotemporal position encoding to the input sequence based on the aforementioned foundation. The spatiotemporal position encoding is composed of scalar projection, temporal position encoding (including local time stamp and global time stamp), and spatial position encoding, as illustrated in Figure 2. After the input sequence undergoes the spatiotemporal position encoding process, the input vector of the model is obtained, and the specific process is shown in Equation (1):

X_{en [i]}^{t} = α u_{i}^{t} + P E_{(L_{x} \times (t - 1) + i,)} + \sum_{p} {[S E_{(L_{x} \times (t - 1) + i)}]}_{p} + S S E_{(L_{x} \times (t - 1) + i)}

(1)

where

u_{i}^{t}

represents the scalar projection, and

α

serves as a scaling factor to balance the magnitude of the scalar projection with other positions or feature encodings. The value of

α

depends on whether the input sequence has been normalized or not. If the input data have been normalized,

α

can be set to 1. Otherwise,

α

should be set reasonably according to the scale of the input data to ensure that the magnitudes of the feature dimensions are consistent with each other and to avoid that a certain part of the features dominates in the attention computation. The scalars in this paper include historical photovoltaic power, temperature, humidity, horizontal wind speed, vertical wind speed, wind direction, cloud water content, cloud ice content, and solar irradiance. PE stands for the local time stamp, SE for the global time stamp, and SSE for the spatial position encoding. t represents the moment,

L_{x}

is the length of the input scalar sequence, i is the current position, and p is the global time stamp.

Local time stamp refers to the position encoding of Transformer, as shown in Equations (2) and (3):

P E_{(p o s, 2 j)} = sin (p o s / {(2 L_{x})}^{2 j / d_{model}})

(2)

P E_{(p o s, 2 j + 1)} = cos (p o s / {(2 L_{x})}^{2 j / d_{model}})

(3)

where

p o s

represents the position of the data point at the current moment, 2j denotes even points, 2j + 1 denotes odd points, and

d_{m o d e l}

is the feature dimension of the input sequence.

The global time stamp selects hierarchical timestamps, which helps to enhance the model’s ability to capture long-range dependencies. Considering that photovoltaic power is minimal during the night and early morning, this paper selects data from 8:00 to 18:00 each day as input. Additionally, since photovoltaic power generation is primarily influenced by seasonal changes and the alternation of day and night, the impact of annual, monthly, and daily time features on photovoltaic power is not significant. Therefore, this paper chooses season and hour as the global time stamps.

The global time stamp selects hierarchical timestamps, which helps to enhance the model’s ability to capture long-term dependencies. Considering that photovoltaic power is minimal during the night and early morning, this paper selects data from 8:00 to 18:00 each day as input. Additionally, since photovoltaic power generation is primarily influenced by seasonal changes and the alternation of day and night, the impact of annual, monthly, and daily time features on photovoltaic power is not significant. Therefore, this paper chooses season and hour as the global time stamps. This choice reduces redundant temporal noise, enhances computational efficiency, and focuses the model’s attention on the most relevant time scales affecting solar radiation and PV output patterns.

This paper adopts latitude and longitude coordinates as spatial position encoding, as the geographical location of PV stations influences factors such as sunlight duration and solar incidence angle. Incorporating these coordinates enables the model to better capture spatial differences and more accurately reflect variations in solar radiation and related variables.

Spatial position encoding enhances the model’s ability to capture spatial correlations, as neighboring stations often share similar lighting and weather patterns. By incorporating spatial relationships, the model can more accurately predict future power output.

2.2.2. Spatiotemporal Probsparse Self-Attention Mechanism

The self-attention mechanism in Transformer models incurs high computational complexity when processing long sequences, as each position attends to all others. This quadratic growth in complexity hampers efficiency. However, not all data points are strongly correlated. To address this, we propose a spatiotemporal probsparse self-attention mechanism that combines a Haversine distance metric and a probabilistic sparsity strategy. The Haversine metric captures spatial correlations by identifying geographically proximate “active” points, while the probsparse self-attention mechanism uses KL divergence to focus on the most informative time steps. These components work in parallel to address sparsity in both spatial and temporal dimensions. By combining them, the model concentrates computational resources on the most significant spatiotemporal interactions, enhancing its ability to capture long-range dependencies and sudden fluctuations in PV power over time.

Adjacent photovoltaic power stations are typically influenced by similar meteorological conditions and environmental factors, such as similar terrain and solar incidence angles. By leveraging the power data from nearby stations, the model can learn these shared pieces of information, thereby better capturing spatial correlations. To select nearby active points of photovoltaic power stations in the spatial dimension, the spatiotemporal probsparse self-attention mechanism is employed. The equation is as follows:

d_{(i, j)} = 2 r \cdot arcsin (\sqrt{{sin}^{2} (\frac{l a t_{i} - l a t_{j}}{2}) + cos (l a t_{i}) \cdot cos (l a t_{j}) \cdot {sin}^{2} (\frac{l o n_{i} - l o n_{j}}{2})})

(4)

D_{(i, j)} = \frac{\sum_{j = 1}^{L} d_{(i, j)}}{L} - d_{(i, j)}

(5)

where d represents the distance between two photovoltaic sites, r is the average radius of the Earth,

l o n_{i}

,

l a t_{i}

are the longitude and latitude of the target site, and

l o n_{j}

,

l a t_{j}

are the longitude and latitude of the selected site. L denotes the total number of photovoltaic power stations. D is defined as the difference between the mean distance of the target site to all other sites and the distance from the target site to the selected site. If D is greater than 0, the selected site is considered an active point that may influence the photovoltaic power forecasting of the target site. Global attention is calculated for the photovoltaic power data of the selected active points and the target site at each moment, resulting in photovoltaic power data that takes into account spatial location information.

The traditional Transformer model utilizes self-attention mechanisms with a time complexity of O(

L^{2}

), which leads to high memory consumption and low computational efficiency when processing long sequence data. The probsparse self-attention mechanism addresses this issue by calculating the difference between the target point’s attention distribution and a uniform distribution, thereby identifying points that significantly contribute to the attention computation while ignoring others. This approach reduces the time complexity of the Transformer model from O(

L^{2}

) to O(

L * l n (L)

), substantially enhancing the model’s performance in predicting long sequence data. The probsparse self-attention mechanism utilizes the Kullback-Leibler (KL) divergence to measure the discrepancy between the attention probability distribution and the uniform distribution. The specific equation is as follows:

\bar{M} (q_{i}, K) = max_{j} \{\frac{q_{i} k_{j}^{⊤}}{\sqrt{d}}\} - \frac{1}{L_{K}} \sum_{j = 1}^{L_{K}} \frac{q_{i} k_{j}^{⊤}}{\sqrt{d}}

(6)

where q represents the query, K represents the Key,

L_{K}

represents the total number of keys, and

\bar{M} (q_{i}, K)

represents the sparsity measure for the i-th query. If the

\bar{M} (q_{i}, K)

for the i-th query is large, it is considered to contribute more to the attention. Selecting several points with the largest

\bar{M} (q_{i}, K)

can approximate the attention probability distribution. The final Equation for the spatiotemporal probsparse self-attention mechanism calculation is:

A (Q, K, V) = Softmax (\frac{\bar{Q} K^{⊤}}{\sqrt{d}}) V

(7)

where

\bar{Q}

is a sparse matrix that contains only u queries selected by the spatiotemporal probsparse self-attention mechanism, with other points filled with zeros, where

u = c * l n L_{Q}

and c is a constant sampling factor. The spatiotemporal probsparse self-attention mechanism selects active points on both spatial and temporal dimensions, thereby limiting the model’s complexity to O(

L_{K} * l n L_{Q}

). This approach not only captures the spatiotemporal correlations within the input data more effectively but also focuses on regions that are critical for the prediction task, thereby enhancing the prediction accuracy with a more efficient computational complexity. This design makes the model more adaptable to large-scale photovoltaic power data, enhancing its feasibility in practical application scenarios.

2.2.3. Feature Pyramid Self-Attention Distillation Module

After passing through the spatiotemporal probsparse self-attention layer, the input sequence produces sparse features. However, redundant values remain in the feature map. To highlight key features and reduce computational cost, feature distillation is an effective solution. Some models use max-pooling to compress attention blocks, halving the input sequence at each encoder layer and concatenating all outputs. However, max-pooling only retains the maximum value, discarding potentially important sub-maximal features. Similarly, directly applying downsampling can result in a substantial loss of long-term dependencies. The Feature Pyramid Self-Attention Distillation Module is utilized to more effectively extract dominant key features, as shown in Figure 3.

In this figure, D represents the input scalar, L denotes the length of the time series, n refers to the number of multi-head attention heads, and i indicates the number of encoder layers. Downsampling of the original input sequence is applied from the second stacked encoder layer onward using Depthwise Separable Convolution. Each layer uses a different size of convolution kernel with an initial size of 1 and a step size of 2. The size of the convolution kernel increases with the number of layers, i.e., the size of the convolution kernel used in layer i is equal to the base size plus the product of the number of layers and the step size, allowing the shallow layers to focus on the local details and the deeper layers to be able to extract a larger range of global features.

By utilizing multi-scale convolution along the temporal dimension, the network is enabled to focus on temporal information of varying lengths. In the channel dimension, a 1 × 1 convolution is employed to extract cross-features among different elements, thereby enriching the network’s receptive field and allowing for a more comprehensive understanding of the structure and content of the input sequence. The multi-level convolutional network can learn more abstract and high-level features, which aids the network in establishing an understanding of complex patterns and objects.

Furthermore, convolutional downsampling is applied to the attention blocks in each Encoder layer. Compared to max-pooling, convolution more effectively captures local features of the input data. By stacking multiple convolutional and pooling layers, the model can adapt to features at different scales and learn deeper hierarchical representations. The feature pyramid self-attention distillation process is shown in Equation (8).

X_{j + 1}^{i} = ELU (DSConv ({[X_{j}^{i}]}_{AB}))

(8)

where i represents the number of Encoder layers, j denotes the number of distillation layers within each Encoder,

{[]}_{A B}

indicates the self-attention block operation, DSConv refers to depthwise separable convolution, and ELU is the activation function. Through the distillation process, the model can extract key features from the stacked multiple Encoders while reducing redundant information. Finally, the outputs of the stacked Encoders are concatenated along the channel dimension to form the output of the Encoder. The feature map produced by the Encoder is then processed through two stacked Decoder layers to obtain the final photovoltaic power prediction value.

2.3. Experimental Method

2.3.1. Comparative Experimental Setup

The proposed model was benchmarked against established time series forecasting methods—including LSTM, Transformer [23], Log Transformer [24], Informer [25], and Fedformer [26]—to evaluate its performance in photovoltaic power prediction. All models were trained and evaluated under identical conditions: the dataset was split into training and test sets, with hyperparameters tuned on a validation set. Model performance was assessed quantitatively using Root Mean Square Error (RMSE) and Mean Absolute Error (MAE), which provide complementary measures of prediction accuracy and stability.

2.3.2. Ablation Experimental Setup

To validate the effectiveness of each module in the model, this study conducted ablation experiments, systematically removing specific components and observing their impact on overall performance. The complete SP-Transformer, which includes all modules, served as the baseline model. In Model 1, the spatiotemporal positional encoding was replaced with sequential positional encoding. In Model 2, the spatiotemporal probsparse self-attention mechanism was substituted with a global self-attention mechanism. In Model 3, the feature pyramid self-attention distillation module was replaced with the decoder structure from the Transformer. RMSE and MAE were utilized to evaluate the prediction performance of each model.

2.4. Evaluation Method

This study employs RMSE and MAE as two metrics to evaluate the performance of the photovoltaic power prediction model. The Equations for RMSE and MAE are as follows:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}

(9)

MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(10)

where

{\hat{y}}_{t}

represents the predicted photovoltaic power at time t,

y_{t}

denotes the actual photovoltaic power at time t, and N is the number of samples.

3. Results and Discussion

3.1. Comparative Experimental Results

To validate the advantages of the proposed SP-Transformer in the task of photovoltaic power forecasting, this paper compares it with several commonly used time series forecasting methods, including LSTM, Transformer, LogTransformer, Informer, and FEDformer. By gradually increasing the prediction time steps, the performance of each model in long-sequence forecasting tasks is systematically evaluated.

The RMSE values of the photovoltaic power prediction results from the aforementioned methods on the testing set are visually represented in Figure 4. From the figure, it can be observed that as the prediction time steps are extended, most models begin to exhibit significant deviations in their predictions. However, the SP-Transformer, utilizing the spatiotemporal probsparse self-attention mechanism, is better able to capture the spatiotemporal relationships between photovoltaic power and geographical as well as meteorological factors. By selectively focusing on regions that have a critical impact on the prediction task, the model enhances prediction accuracy and stability while maintaining a more efficient computational complexity. Consequently, compared to other models, the SP-Transformer demonstrates stable predictive performance across all time steps.

Figure 5 displays a comparison of the prediction results for the next 40 time steps from each model at the 48-h and 336-h nodes. Figure 6 further illustrates the declining trend in prediction accuracy of these models from the 48-h to the 336-h forecast horizon.

As illustrated, while some models achieve performance comparable to the SP-Transformer at a 48-h horizon, a significant accuracy decline is observed in the 336-h task for all models except the SP-Transformer. The SP-Transformer exhibits the smallest accuracy reduction (10%), substantially outperforming the second-best FEDformer (24% drop) and the LSTM (32% drop). This superior long-term stability stems from its integrated architectural innovations: The spatiotemporal positional encoding embeds geographical context, enabling superior modeling of cross-site dependencies and meteor-geographical relationships; The spatiotemporal probsparse self-attention mechanism reduces computational complexity by focusing on pivotal spatiotemporal “active points”, enhancing both efficiency and predictive focus; The feature pyramid self-attention distillation module, leveraging multi-scale depthwise separable convolutions, mitigates information loss and stabilizes learning. Collectively, these components systematically address the challenges of weak long-range correlations and low forecasting efficiency, establishing the SP-Transformer’s state-of-the-art performance in medium- and long-term PV power forecasting.

Table 2 presents the results of the comparative experiments for each model. This paper utilized RMSE and MAE as evaluation metrics, with the best result for each metric highlighted in bold.

Ten random data sets were selected from the testing set to conduct experiments on each model, and the RMSE and MAE metrics were averaged over ten trials. From the data in Table 2 it is evident that the SP-Transformer outperforms the other methods. The RMSE for the SP-Transformer at 48 h is 0.761, which is 4.2% lower than that of FEDformer [26], 10.8% lower than Informer [25], 20.3% lower than LogTransformer [24], and 33.9% lower than Transformer [23]. The SP-Transformer demonstrates high prediction accuracy across all four prediction time steps, maintaining an RMSE of 1.061 at 336 h. Although this represents an increase in error compared to the 48-h prediction, the growth trend is steady and gradual, remaining lower than that of the other methods. Overall, the SP-Transformer achieves the highest accuracy in predicting photovoltaic power methods.

Overall, the SP-Transformer achieves the highest accuracy in predicting photovoltaic power and exhibits the most stable performance in long sequence predictions.

3.2. Ablation Experimental Results

To isolate the effects of individual components, a series of ablations were performed against a Transformer baseline. Model 1 utilized sequential instead of spatiotemporal positional encoding; Model 2 applied global self-attention in place of the spatiotemporal probsparse self-attention mechanism; and Model 3 was modified by swapping its feature pyramid self-attention distillation module for a standard Transformer decoder.

Figure 7 displays the prediction results of each model from the ablation experiments at the 48-h and 336-h marks, while Figure 8 illustrates the decline in prediction accuracy for these models as the forecast horizon extends from 48 h to 336 h.

As shown in the figure, the complete SP-Transformer attained prediction accuracies of 93.8% and 90.4% for the 48-h and 336-h forecasts, respectively. While Models 1–3, each ablating a key component, still exceeded the standard Transformer’s performance, they exhibited clear performance degradation. This result underscores the collective contribution of all components to forecasting accuracy and efficiency. The synergistic interaction of these modules bolsters model stability and generalization across time horizons, yielding SP-Transformer’s consistent superiority in mid- to long-term forecasting tasks.

The ablation experiment confirms the contribution of each proposed component. Model 1, using standard sequential encoding, was consistently outperformed by the full SP-Transformer, underscoring the importance of spatiotemporal position encoding in capturing geographic relationships for accurate medium- to long-term forecasting. Similarly, replacing the spatiotemporal probsparse self-attention mechanism in Model 2 reduced accuracy, validating its role in addressing data redundancy through selective focus on critical regions to enhance model adaptability. Finally, Model 3 exhibited performance degradation when its feature pyramid self-attention distillation module was replaced, demonstrating this module’s efficacy in reducing information loss and ensuring stability for robust long-term predictions.

The RMSE values of the photovoltaic power prediction results from each model on the testing set are visually represented in the Figure 9.

As shown in Table 3, all proposed model variants demonstrated performance superior to the standard Transformer in the ablation studies. The complete SP-Transformer delivered the best performance across all forecasting horizons.

Model 1 achieved reductions in RMSE and MAE of 25.8% and 28.8% (48-h) and 1.8% and 4.3% (336-h), confirming that the introduced spatiotemporal positional encoding effectively captures inter-site correlations and enhances modeling of complex spatiotemporal features.

Model 2 achieved RMSE and MAE reductions of 16.9% and 23.3% (48-h) and 11.3% and 16.1% (336-h), demonstrating the efficacy of the spatiotemporal probsparse self-attention mechanism. By leveraging a combined Haverisine distance and KL divergence metric, the mechanism identifies the most contributive “active points” across spatiotemporal dimensions, thereby enhancing the capture of critical dependencies while improving computational efficiency.

Model 3 achieved the most substantial performance gains, reducing RMSE and MAE by 30.4% and 29.2% (48-h) and by 13.8% and 14.5% (336-h), respectively. These improvements are attributed to its feature pyramid self-attention distillation module, which enhances stability and generalization through multi-scale feature extraction, thereby mitigating information loss and addressing the low efficiency of traditional models in long-term forecasting.

3.3. Discussion

This study systematically addresses the persistent challenges in medium- and long-term photovoltaic power forecasting through the development of the SP-Transformer architecture.

The incorporation of spatiotemporal positional encoding represents a fundamental advancement in geospatial–temporal representation learning for distributed photovoltaic systems. This encoding scheme effectively embeds the relative geospatial configuration of generation sites within the temporal evolution of power generation, establishing a coherent physical basis for the model’s reasoning process. Empirical evidence demonstrates that this approach yields a substantial 33.9% error reduction in 48-h forecasting compared to the baseline Transformer architecture, underscoring the critical importance of explicit spatial representation in renewable energy forecasting paradigms.

The proposed spatiotemporal probsparse self-attention mechanism addresses the computational complexity barrier inherent in conventional self-attention frameworks while simultaneously enhancing predictive accuracy. By implementing a strategic sampling methodology across spatiotemporal dimensions, the mechanism achieves computational efficiency without compromising representational capacity. The observed 4.2% error reduction in short-term forecasting, coupled with significantly constrained error progression in extended forecasting horizons (manifesting as merely 10% error increase versus 32% in LSTM architectures), indicates superior temporal coherence in capturing long-range dependencies characteristic of meteorological systems.

Architectural robustness is further augmented through the feature pyramid self-attention distillation module, which implements hierarchical feature preservation across multiple temporal scales. This design characteristic proves particularly valuable in 336-h forecasting scenarios, where the model demonstrates exceptional stability with only 13.9% error escalation, substantially outperforming contemporary approaches such as FEDformer (24% error increase). The module’s capacity to maintain predictive integrity across extended temporal horizons suggests effective mitigation of information degradation typically encountered in long-sequence forecasting.

4. Conclusions

This paper presents the SP-Transformer model. By introducing spatiotemporal position encoding, the model deeply integrates photovoltaic (PV) site data with spatiotemporal information, effectively improving its ability to capture relative spatial relationships between different sites and its sensitivity to geographic differences, thereby enhancing the understanding of complex spatiotemporal dependencies. By designing a spatiotemporal ProbSparse self-attention mechanism, it selects key information points in the spatiotemporal dimension, which not only reduces model complexity and preserves computational efficiency but also strengthens the capture of spatiotemporal correlations, enabling it to adapt to large-scale PV data application scenarios. By constructing a feature pyramid-based self-attention distillation module, it reduces information loss through multi-scale feature extraction, significantly improving model stability and allowing for flexible adaptation to different prediction scenarios. Experimental results demonstrate that the SP-Transformer exhibits excellent performance in medium- and long-term PV power forecasting: compared with traditional sequential models (e.g., LSTM), it reduces the error by approximately 70% in the 336-h forecasting task, highlighting the inherent limitations of recurrent architectures in capturing long-term spatiotemporal dependencies; compared with the latest Transformer variants (e.g., FEDformer), it improves the prediction accuracy by about 5.5% in the 336-h forecasting task, and reduces computational resource consumption by 30% due to the sparse attention mechanism. This effectively balances forecasting efficiency and accuracy, successfully solving the key problem in practical PV forecasting applications.

Author Contributions

Conceptualization, B.W. and Y.Z.; data curation, J.F. and J.H.; formal analysis, B.W.; funding acquisition, J.C., J.F. and L.T.; investigation, L.T.; methodology, B.W., J.F. and J.H.; project administration, Y.Z.; resources, J.F.; software, J.C. and J.F.; supervision, L.T.; validation, J.C.; visualization, J.H. and L.T.; writing—original draft, B.W.; writing—review and editing, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Plan of China Scientific and Technological Innovation 2030–“New Generation Artificial Intelligence” Major Project (2021ZD0102100).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

We sincerely thank our colleagues at the Power Grid Planning and Research Center of Guizhou Power Grid Co., Ltd. and Nanjing University of Information Science and Technology for their valuable support and collaboration in this research.

Conflicts of Interest

Author Bin Wang, Julong Chen, Yongqing Zhu, Junqiu Fan, and Jiang Hu are employed by Power Grid Planning and Research Center, Guizhou Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Qiu, T.; Wang, L.; Lu, Y.; Zhang, M.; Qin, W.; Wang, S.; Wang, L. Potential assessment of photovoltaic power generation in China. Renew. Sustain. Energy Rev. 2022, 154, 111900. [Google Scholar] [CrossRef]
Yang, Y.; Che, J.; Deng, C.; Li, L. Sequential grid approach based support vector regression for short-term electric load forecasting. Appl. Energy 2019, 238, 1010–1021. [Google Scholar] [CrossRef]
VanDeventer, W.; Jamei, E.; Thirunavukkarasu, G.S.; Seyedmahmoudian, M.; Soon, T.K.; Horan, B.; Mekhilef, S.; Stojcevski, A. Short-term PV power forecasting using hybrid GASVM technique. Renew. Energy 2019, 140, 367–379. [Google Scholar] [CrossRef]
Zhu, C.; Wang, M.; Guo, M.; Deng, J.; Du, Q.; Wei, W.; Zhang, Y. Innovative approaches to solar energy forecasting: Unveiling the power of hybrid models and machine learning algorithms for photovoltaic power optimization. J. Supercomput. 2025, 81, 20. [Google Scholar] [CrossRef]
Nguyen, R.; Yang, Y.; Tohmeh, A.; Yeh, H.G. Predicting PV power generation using SVM regression. In Proceedings of the 2021 IEEE Green Energy and Smart Systems Conference (IGESSC), Long Beach, CA, USA, 1–2 November 2021; pp. 1–5. [Google Scholar] [CrossRef]
Zhou, Z.; Liu, L.; Dai, N.Y. Day-ahead power forecasting model for a photovoltaic plant in Macao based on weather classification using SVM/PCC/LM-ANN. In Proceedings of the 2021 IEEE Sustainable Power and Energy Conference (iSPEC), Nanjing, China, 23–25 December 2021; pp. 775–780. [Google Scholar] [CrossRef]
Niu, D.; Wang, K.; Sun, L.; Wu, J.; Xu, X. Short-term photovoltaic power generation forecasting based on random forest feature selection and CEEMD: A case study. Appl. Soft Comput. 2020, 93, 106389. [Google Scholar] [CrossRef]
Ali, M.; Prasad, R.; Xiang, Y.; Khan, M.; Farooque, A.A.; Zong, T.; Yaseen, Z.M. Variational mode decomposition based random forest model for solar radiation forecasting: New emerging machine learning technology. Energy Rep. 2021, 7, 6700–6717. [Google Scholar] [CrossRef]
Gao, Y.; Wang, J.; Guo, L.; Peng, H. Short-Term Photovoltaic Power Prediction Using Nonlinear Spiking Neural P Systems. Sustainability 2024, 16, 1709. [Google Scholar] [CrossRef]
Huang, X.; Li, Q.; Tai, Y.; Chen, Z.; Liu, J.; Shi, J.; Liu, W. Time series forecasting for hourly photovoltaic power using conditional generative adversarial network and Bi-LSTM. Energy 2022, 246, 123403. [Google Scholar] [CrossRef]
Wang, L.; Mao, M.; Xie, J.; Liao, Z.; Zhang, H.; Li, H. Accurate solar PV power prediction interval method based on frequency-domain decomposition and LSTM model. Energy 2023, 262, 125592. [Google Scholar] [CrossRef]
Hochreiter, S. Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Qu, J.; Qian, Z.; Pei, Y. Day-ahead hourly photovoltaic power forecasting using attention-based CNN-LSTM neural network embedded with multiple relevant and target variables prediction pattern. Energy 2021, 232, 120996. [Google Scholar] [CrossRef]
Limouni, T.; Yaagoubi, R.; Bouziane, K.; Guissi, K.; Baali, E.H. Accurate one step and multistep forecasting of very short-term PV power using LSTM-TCN model. Renew. Energy 2023, 205, 1010–1024. [Google Scholar] [CrossRef]
Huang, S.; Liu, Y.; Zhang, F.; Li, Y.; Li, J.; Zhang, C. CrossWaveNet: A dual-channel network with deep cross-decomposition for Long-term Time Series Forecasting. Expert Syst. Appl. 2024, 238, 121642. [Google Scholar] [CrossRef]
Wang, S.; Ma, J. A novel GBDT-BiLSTM hybrid model on improving day-ahead photovoltaic prediction. Sci. Rep. 2023, 13, 15113. [Google Scholar] [CrossRef]
Kushwaha, V.; Pindoriya, N.M. A SARIMA-RVFL hybrid model assisted by wavelet decomposition for very short-term solar PV power generation forecast. Renew. Energy 2019, 140, 124–139. [Google Scholar] [CrossRef]
Ran, P.; Dong, K.; Liu, X.; Wang, J. Short-term load forecasting based on CEEMDAN and Transformer. Electr. Power Syst. Res. 2023, 214, 108885. [Google Scholar] [CrossRef]
Cao, Y.; Liu, G.; Luo, D.; Bavirisetti, D.P.; Xiao, G. Multi-timescale photovoltaic power forecasting using an improved Stacking ensemble algorithm based LSTM-Informer model. Energy 2023, 283, 128669. [Google Scholar] [CrossRef]
Zhang, Q.; Chen, J.; Xiao, G.; He, S.; Deng, K. TransformGraph: A novel short-term electricity net load forecasting model. Energy Rep. 2023, 9, 2705–2717. [Google Scholar] [CrossRef]
Xu, S.; Zhang, R.; Ma, H.; Ekanayake, C.; Cui, Y. On vision transformer for ultra-short-term forecasting of photovoltaic generation using sky images. Sol. Energy 2024, 267, 112203. [Google Scholar] [CrossRef]
Zhang, L.; Wilson, R.; Sumner, M.; Wu, Y. Advanced multimodal fusion method for very short-term solar irradiance forecasting using sky images and meteorological data: A gate and transformer mechanism approach. Renew. Energy 2023, 216, 118952. [Google Scholar] [CrossRef]
Vaswani, A. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Li, S.; Jin, X.; Xuan, Y.; Zhou, X.; Chen, W.; Wang, Y.X.; Yan, X. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. arXiv 2019, arXiv:1907.00235. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar] [CrossRef]

Figure 1. Overall Architecture of SP-Transformer.

Figure 2. The structure of Spatiotemporal Position Encoding.

Figure 3. The structure of Feature Pyramid Self-Attention Distillation Module.

Figure 4. Visualization of RMSE in Comparative Experiments.

Figure 5. Comparison of Prediction Results for Future 40 Time Steps at 48-h and 336-h Nodes Across Models.

Figure 6. Prediction Accuracy Degradation in Comparative Experiments.

Figure 7. Prediction Results of Ablation Experiments for Each Model at 48-h and 336-h Nodes.

Figure 8. Prediction Accuracy Degradation in Ablation Experiments.

Figure 9. Visualization of RMSE in Ablation Experiments.

Table 1. Overview of Transformer-Based Medium- and Long-Term PV Power Forecasting Methods.

References	Contribution Points	Shortcomings
[18]	A hybrid model combining adaptive noise and fully integrated empirical mode decomposition, sample entropy, and Transformer is proposed.	Position encoding can only obtain temporal information, without considering the impact of the relative position of photovoltaic sites on the prediction results.
[19]	A LSTM Informer model based on an improved Stacking ensemble algorithm is proposed.
[20]	Propose a combined Transformer and graph convolutional network model for power grid load forecasting.	The full connectivity of self-attention mechanism leads to an increase in quadratic complexity as the sequence length increases, greatly increasing the computational burden of the model when processing long sequence data.
[21]	Developed a prediction framework that integrates Vision Transformer model and gated loop unit.
[22]	Propose a Transformer based prediction framework model that combines images with quantitative measurement of solar irradiance.

Table 2. Comparative Experimental Results.

Method	48 h		96 h		168 h		336 h
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Ours	0.761	0.558	0.784	0.562	0.905	0.655	1.061	0.768
[26]	0.795	0.659	0.885	0.725	0.987	0.799	1.123	0.924
[25]	0.854	0.663	0.913	0.706	1.053	0.811	1.096	0.844
[24]	0.956	0.714	1.002	0.739	1.084	0.847	1.209	0.952
[23]	1.151	0.932	1.162	0.914	1.394	0.952	1.232	0.976
[12]	2.360	1.934	2.363	1.946	2.418	1.934	4.384	3.494

Table 3. Ablation Experimental Results.

Method	48 h		96 h		168 h		336 h
Method	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Ours	0.761	0.558	0.784	0.562	0.905	0.655	1.061	0.768
Model 1	0.854	0.663	0.913	0.706	1.053	0.811	1.209	0.952
Model 2	0.956	0.714	1.002	0.737	1.139	0.874	1.096	0.835
Model 3	0.800	0.659	0.885	0.706	0.987	0.800	1.061	0.835
Transformer	1.151	0.932	1.162	0.959	1.190	0.952	1.232	0.995

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Chen, J.; Zhu, Y.; Fan, J.; Hu, J.; Tan, L. SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features. Appl. Sci. 2025, 15, 11846. https://doi.org/10.3390/app152111846

AMA Style

Wang B, Chen J, Zhu Y, Fan J, Hu J, Tan L. SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features. Applied Sciences. 2025; 15(21):11846. https://doi.org/10.3390/app152111846

Chicago/Turabian Style

Wang, Bin, Julong Chen, Yongqing Zhu, Junqiu Fan, Jiang Hu, and Ling Tan. 2025. "SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features" Applied Sciences 15, no. 21: 11846. https://doi.org/10.3390/app152111846

APA Style

Wang, B., Chen, J., Zhu, Y., Fan, J., Hu, J., & Tan, L. (2025). SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features. Applied Sciences, 15(21), 11846. https://doi.org/10.3390/app152111846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SP-Transformer: A Medium- and Long-Term Photovoltaic Power Forecasting Model Integrating Multi-Source Spatiotemporal Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.1.1. Dataset

2.1.2. Experimental Setup and Scheme

2.2. Model Method

2.2.1. Spatiotemporal Position Encoding

2.2.2. Spatiotemporal Probsparse Self-Attention Mechanism

2.2.3. Feature Pyramid Self-Attention Distillation Module

2.3. Experimental Method

2.3.1. Comparative Experimental Setup

2.3.2. Ablation Experimental Setup

2.4. Evaluation Method

3. Results and Discussion

3.1. Comparative Experimental Results

3.2. Ablation Experimental Results

3.3. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI