In this work, a Transformer-based architecture integrated with ASSA is employed to investigate its impact on PV power generation forecasting. This section presents the structure and parameter settings of the Transformer model, explains the mechanism and advantages of ASSA, and designs four algorithmic variants, including a baseline Transformer, its integration with ASSA, and its variants, to evaluate the performance improvements provided by ASSA.
2.1. Transformer, Adaptive Sparse Self-Attention, and Feature Refinement Feedforward Network
The Transformer is a well-known deep learning architecture based on self-attention, consisting of an encoder and a decoder, as illustrated in
Figure 1. It models sequential dependencies through the integration of multi-head attention and a position-wise feedforward network. After embedding and positional encoding, the input passes through stacked layers, each containing a self-attention module and a feedforward network with residual connections and layer normalization to stabilize training. The decoder extends this design by adding an extra attention module to process encoder outputs and uses masking to ensure autoregressive generation. The Transformer architecture shown in
Figure 1 serves as the baseline in this study, with its mathematical formulation provided in
Appendix A.1.
Adaptive sparse self-attention (ASSA) aims to mitigate noise interactions from irrelevant regions in the Transformer while preserving informative features [
28]. Since the Transformer attends to all tokens in the feature map, it may involve redundant or irrelevant regions during the computation process [
29,
30]. To address this issue, the core innovation of ASSA lies in adaptively combining two complementary branches, which jointly suppress noise and redundancy while maintaining the integrity of critical information. A squared ReLU-based self-attention mechanism (SSA) is introduced to suppress features with low query–key matching scores, while a dense self-attention branch (DSA) applies a softmax operation in parallel to preserve essential information and compensate for potential over-sparsity in the SSA, as shown in
Figure 2. In essence, ASSA enables the model to adaptively control the sparsity of attention in task-specific scenarios, thereby enhancing robustness and feature selectivity. Unlike conventional sparsification techniques that rely on selection or restriction, ASSA achieves sparsity through adaptive fusion, effectively balancing sparse and dense representations to yield task-adaptive soft sparsity. Consequently, ASSA serves not merely as a sparsification method but as an attention regulation mechanism. The core mathematical formulation of ASSA is provided in
Appendix A.2.
Compared with the conventional multi-head self-attention mechanism in the Transformer, ASSA may exhibit several advantages in PV forecasting. ASSA significantly reduces the computational complexity by replacing the dense attention matrix with a sparse and adaptive structure, which can improve training and inference efficiency. ASSA enhances the modeling of long-term dependencies in PV time-series data by selectively focusing on the most informative time steps, thereby mitigating the noise interference. The sparsity and adaptivity of ASSA alleviate the risk of overfitting to short-term fluctuations, which could improve the generalization ability across different weather and illumination conditions. Therefore, this study investigates the application of the ASSA–Transformer architecture for PV forecasting. Specifically, we replace the standard self-attention mechanism in the Transformer with ASSA and employ a series of ASSA–Transformer variants for comparative analysis.
We also introduce the feature refinement feedforward network (FRFN) to optimize the feature representation by enhancing informative features and reducing redundancy along the channel dimension [
28]. The main components of the FRFN include partial convolution (PartialConv), a gating mechanism, residual connections, and a nonlinear FNN, as illustrated in
Figure 3. Specifically, the FRFN strengthens relevant feature elements through partial convolution and employs a gating mechanism to alleviate the computational burden caused by redundant information. Partial convolution applies lightweight operations to only a subset of input channels, thereby retaining key features while filtering out redundancy. This design substantially reduces the number of parameters and computational complexity, making it well suited to processing large-scale data [
31]. The mathematical description of the FRFN part is given in
Appendix A.3.
The FRFN is well suited to time-series forecasting due to its adaptive feature refinement capabilities. Dynamic feature selection is enabled as the model is directed by the partial convolution operation to focus on the most informative input feature combinations at each time step. Concurrently, a data-dependent temporal filter is constituted by the gating mechanism, whereby critical periods (e.g., weather transitions) are amplified and noisy intervals are suppressed. The refinement of the complex and noisy spatiotemporal patterns in PV power generation data is thereby effectively accomplished through this integration of sparse feature processing and adaptive gating, yielding more robust forecasts compared with standard FNNs. Consequently, the focus of this study is set on one-dimensional time series, where temporal dependencies are critical. To align with this focus, the positional FNN in
Figure 1 is replaced by the FRFN, thereby forming a variant of the ASSA–Transformer. The complete ASSA FRFN algorithm from Ref. [
28] is not directly adopted, and this modification is implemented to preserve the core FRFN functionality while placing greater emphasis on temporal feature processing.
2.2. Model Architecture
In this work, the Transformer-based ASSA algorithm and its variants are constructed for performance evaluation in PV power forecasting. The baseline is established using the standard Transformer described in
Section 2.1. A simplified ASSA–Transformer is then obtained by replacing the multi-head attention with ASSA while the rest of the architecture is retained. Subsequently, the position-wise FFN is substituted with a deeper neural network for enhanced time-series feature extraction and is further replaced with the FRFN to assess the impact of a more complex architecture. The architectural designs of these ASSA–Transformer variants are detailed in the following sections, with a comprehensive presentation provided in this subsection.
The original input data is denoted as
, where
N is the number of time steps and
D is the number of features at each step. The PV generation target is placed in the last column, while the remaining columns correspond to environmental variables. To ensure consistency in the feature magnitudes, improve feature learning efficiency, and enhance prediction accuracy, the data is normalized as
where
and
represent the minimum and maximum values in
X, respectively. This linear mapping transforms the original data into the interval [0, 1], mitigating the influence of feature scale differences on model training. Then, the normalized data is prepared for input into the model. Assuming an input window with a length
W, a prediction step length of
L, and a batch size of
B, the algorithm input is defined as
.
To incorporate spatiotemporal information into the input sequence, the Transformer embeds and positionally encodes the data at each time step, providing the model with both semantic information and temporal position context. Each time step in the input sequence typically consists of multiple numerical values, which are projected into a specified embedding space. The input embedding can be expressed as
where
is the embedding weight matrix, and
denotes the embedding dimension. The bias vector
encodes positional information using sine and cosine functions, assigning a unique code to each position in the sequence. Specifically, for position
p and dimension
i,
The positional encoding is directly added to the input embedding, injecting temporal information while preserving the original feature values. In this study, the embedding dimension is set to
, resulting in an embedded input of size
. Next, linear projections are applied to the embedded data
E to generate the inputs for the attention mechanism of the Transformer: the query matrix
Q, the key matrix
K, and the value matrix
V. This can be expressed as
where
,
, and
are learnable projection matrices, and
denotes the dimension of each attention subspace.
The first ASSA–Transformer architecture is constructed by replacing the self-attention component of the standard Transformer with the ASSA mechanism described in
Section 2.1. The query, key, and value matrices
Q,
K, and
V from Equation (
4) are directly substituted into Equations (
A5) and (
A6), and the attention output from Equation (
A7) is fed into the remaining Transformer layers. In other words, the multi-head attention module in
Figure 1 is replaced with ASSA, while the rest of the Transformer architecture remains unchanged. For simplicity, this architecture will be referred to as “ASSA” in the following text.
To explore the potential of ASSA–Transformer, while keeping other parts unchanged, we replace the position-wise FNN part (Equation (
A4)) with a network that can capture and learn time-series features more deeply, which can be expressed as
where
and
are bias terms, and GELU denotes the Gaussian Error Linear Unit activation function. In this design,
first expands the dimensionality,
subsequently reduces it, and
projects the representation back to the original space, thereby forming a “bottleneck” structure that expands and contracts the feature space. This architecture enhances the model’s representational capacity while maintaining the final output dimensionality consistent with the input for residual connections. The two successive GELU activations introduce nonlinearity, allowing the network to approximate highly complex functions. Finally, the incorporation of layer normalization ensures stable training and facilitates efficient information propagation through the residual pathway. This resulting architecture, i.e., the ASSA–Transformer equipped with
, will hereafter be referred to as “ASSA Deep” in subsequent discussions.
The final variant replaces the position-wise FNN component of the ASSA–Transformer with the FRFN described in
Section 2.1. This architecture is hereafter referred to as the “ASSA FRFN” model. In our study, we employ the three architectures: “ASSA”, “ASSA Deep”, and “ASSA FRFN”, for PV forecasting and systematically compare their predictive accuracy against the baseline Transformer and other benchmark algorithms. Their structures and differences are shown in
Figure 4, and
Table 1 summarizes the structural and functional differences among these three architectures and the Transformer. From a computational complexity perspective, the four algorithms display a clear efficiency hierarchy. The baseline Transformer is dominated by the FFN layer, which accounts for about 73% of the total computation, while the ASSA–Transformer increases the complexity by only 3% through adaptive sparse attention. ASSA Deep enhances the feature extraction with a three-layer FFN but incurs a 76% increase in complexity, making it the most expensive variant. In contrast, ASSA FRFN introduces partial convolutions and gating, reducing the complexity by 18% compared with DeepFFN, with a convolutional cost only 1.2% of that of the standard FFN.
All algorithms ultimately share the same output layer, formulated as
The output layer first takes the feature tensor
from the decoder and flattens it into a two-dimensional matrix
. Subsequently, two linear transformations are applied. The first employs weights
and bias
to project features into a latent space, followed by the ReLU activation function to introduce nonlinearity. The second transformation compresses features into the prediction space using
and
, producing the final prediction
.
This design preserves spatiotemporal feature information through flattening, enhances the representational power with nonlinear transformations, and provides a unified prediction interface for all algorithms. Since the input data is normalized prior to training, the prediction results must be denormalized to restore them to the original measurement scale. The final prediction
can be denormalized as
Based on the theoretical framework of the ASSA–Transformer, the complete procedure employed in our study is summarized in Algorithm 1. We use Optuna to automatically optimize the hyperparameters of the standard Transformer, including the learning rate, dropout rate, and embedding dimension, among others, and apply the resulting settings consistently across the Transformer-based ASSA, ASSA Deep, and ASSA FRFN architectures to ensure fair comparability. All models use the same input sequence length, prediction horizon, and standardized five-dimensional input features. The network architecture is uniformly configured with an embedding dimension of 32, a feedforward hidden dimension of 64, and an encoder–decoder structure composed of 4 attention heads and 3 Transformer layers. The training settings are also aligned, including a dropout ratio of 0.01, a learning rate of 0.001, a batch size of 32, and 50 training epochs, with input features normalized to the range
. The loss function is defined as the Huber loss with
, which behaves as the mean squared error (MSE) for deviations smaller than
and approximates the mean absolute error (MAE) for larger deviations, thereby enhancing the robustness to outliers. This rigorous consistency guarantees that the observed performance differences arise solely from the core architectural variations, such as standard attention versus ASSA attention and standard FFN versus Deep FFN and FRFN, rather than from incidental hyperparameter choices. Consequently, the experimental setup provides a reliable foundation for evaluating the impact of architectural innovations.
Algorithm 1 Transformer with ASSA Variants for PV Time-Series Prediction. |
- 1:
Input: Time-series data - 2:
Parameters: Embedding dim , attention dim FFN dim = 64, Weight matrices
- 3:
procedure EMBEDDING() - 4:
- 5:
- 6:
return - 7:
end procedure - 8:
procedure ATTENTIONPROJECTION(E) - 9:
, , - 10:
return - 11:
end procedure - 12:
procedure ENCODERBLOCK(E) - 13:
- 14:
for each layer to N do - 15:
- 16:
- 17:
- 18:
end for - 19:
return encoded features - 20:
end procedure - 21:
procedure DECODERBLOCK() - 22:
Similar to EncoderBlock with - 23:
1. Masked attention - 24:
2. Encoder–decoder attention using - 25:
return decoded features - 26:
end procedure - 27:
procedure OUTPUTLAYER() - 28:
- 29:
- 30:
return predictions - 31:
end procedure - 32:
- 33:
- 34:
- 35:
|
To evaluate the performance of ASSA and its variants, we compare them with several baseline models, including Long Short-Term Memory (LSTM) [
32], bidirectional LSTM (BiLSTM) [
33], a Gated Recurrent Unit (GRU) [
34], and a Temporal Convolutional Network (TCN) [
35]. The LSTM model employs a two-layer architecture with 128 hidden units, trained using the Adam optimizer with a learning rate of 0.001 and a batch size of 32. The BiLSTM model uses the same configuration as LSTM, except that it has a two-layer bidirectional structure with 128 hidden units in each direction. The GRU model consists of a two-layer structure with 128 fully connected units, utilizes the ReLU activation function, applies a dropout rate of 0.2, and is trained with the Adam optimizer using a learning rate of 0.001 and a batch size of 32. The TCN model adopts a three-layer convolutional architecture with a kernel size of 3 and a dilation base of 2. Its fully connected layers are configured with 64, 128, and 72 units, and it also uses a dropout rate of 0.2, the Adam optimizer with a learning rate of 0.001, and a batch size of 32. All of the experiments are conducted on PyTorch 2.6.0 with Python 3.10.11, utilizing an NVIDIA GeForce RTX 4060 Ti (16 GB) GPU. The results of these algorithms were obtained in our previous study on long-term forecasting [
36]. In this short-term forecasting study, the Transformer has been adapted to the ASSA–Transformer framework and optimized accordingly, so its results differ from the previous work.