2.1. Variable Mode Decomposition
PV power time-series data are influenced by factors such as solar irradiance cycles and sudden weather changes, exhibiting both seasonal trends and minute-level fluctuations. Traditional single-scale modeling methods struggle to simultaneously capture global patterns and local details. To more effectively extract multi-scale features, this study adopts Variational Mode Decomposition (VMD) to decompose the original data into several modal components with distinct frequency characteristics, where each mode corresponds to variations in the data at a specific time scale. In this way, the dynamic characteristics of time-series data across different scales can be more clearly revealed. These modal components are then used as input features for the prediction model, enabling it to fully exploit multi-scale information during training, thereby improving adaptability to PV power fluctuations and enhancing prediction accuracy [
16,
17].
VMD decomposes the original PV time series into K Intrinsic Mode Functions (IMFs). An IMF is defined as a function of the input signal
, as shown in Equation (1):
In Equation (1),
represents the amplitude with
≥ 0, while
denotes the phase. The modal function is
. To obtain the instantaneous frequency, Equation (2) differentiates
.
Based on the Gaussian smoothness of the demodulated signal, the bandwidth is estimated, and the constraint variational problem as Formulas (3) and (4) is constructed.
In Equation (3), represents the set of K modal function components ; denotes the set of center frequencies corresponding to each mode, and ; represents the squared norm of the demodulated signal gradient; in Equation (4), is the sequence.
By introducing a quadratic penalty term
and Lagrange multiplier
, the problem is transformed into an unconstrained variational problem, as shown in Equation (5):
Here, denotes the generalized Lagrangian function, which ensures the completeness of signal reconstruction, while represents the second penalty term, guarantees the strictness of the constraint conditions.
In traditional VMD, the feature extraction process of time series is still prone to significant accuracy loss due to the influence of high-frequency noise. Therefore, in this study, correlation analysis is integrated with conventional VMD. To avoid over-decomposition or under-decomposition, the correlation between each component and the original sequence is calculated, and weakly correlated or uncorrelated IMFs are removed. The remaining components are then reconstructed into a denoised sequence.
To further reduce the impact of noise on feature extraction, this method eliminates irrelevant and weakly relevant components based on the correlation coefficients between each mode and the original sequence. The process for calculating the correlation coefficient is defined as follows in Equation (6):
where
denotes the correlation coefficient;
represents the original time series; and
refers to the decomposed components. The values of
can be used to assess the correlation between patterns and the original function. Typically, strongly correlated components are reconstructed while weakly correlated ones should be removed.
Table 1 illustrates the correspondence between correlation coefficients and their corresponding correlations [
18,
19].
To ensure effective decomposition and accurate subsequent analysis, we carefully optimized and selected key parameters for VMD. The bandwidth constraint parameter (alpha) was set to 2000, aiming to moderately compress the bandwidth of each mode, thereby clearly distinguishing different frequency components while preventing loss of important fluctuation details due to excessive smoothing. The number of modes (K) is the most critical parameter in VMD. Through exploratory experiments, we found that when K < 7, the decomposition is insufficient, while K > 7 begins to exhibit noticeable “mode aliasing” phenomena. Therefore, we ultimately determined K = 7 as the optimal value to achieve the best balance between effective signal separation and avoiding mode aliasing. Additionally, we set the convergence tolerance (tol) to 1 × 10−7 to ensure the algorithm converges to sufficiently precise results.
After determining the decomposition parameters, we filtered out noise by calculating the correlation coefficients between each IMF component and the original sequence. To identify the optimal correlation threshold, we conducted a sensitivity analysis. Using the TCN-TST-BiGRU model architecture detailed in
Section 2.2, we preprocessed the data with different thresholds and evaluated the model’s predictive performance on the validation set. The experimental results are presented in
Table 2.
As shown in
Table 2, the model achieved the lowest average absolute error on the validation set when the threshold was set to 0.2. This indicates that the threshold strikes the optimal balance between effectively filtering high-frequency noise modes and preserving essential signal details. Therefore, this study adopts 0.2 as the correlation threshold for IMF filtering. Ultimately, the filtered effective mode components are reconstructed into denoised sequences for subsequent model training. The reconstruction equation, shown in Formula (7), is as follows:
Among them, represents the reconstructed sequence after denoising; i represents the number of satisfactory sequences; represents the correlation coefficient between the pattern and the original sequence.
Figure 1 clearly illustrates the complete workflow: from raw data input through VMD, IMF screening based on correlation coefficients, to missing value imputation using random forest.
2.2. Ultra-Short-Term Photovoltaic Power Prediction Model
Unlike traditional stacked hybrid models, this paper proposes a parallel dual-channel architecture TCN-TST-BiGRU hybrid model for ultra-short-term photovoltaic power forecasting. As shown in
Figure 2, the model innovatively integrates three core components: Temporal Convolutional Network (TCN), Temporal Shift Transformer (TST), and Bidirectional Gated Recurrent Unit (BiGRU). The model adopts a symmetrical parallel design to achieve balanced and comprehensive feature extraction:
Channel One (Local and Long-Term Feature Channel): In this channel, the TCN first efficiently captures local dynamic features in sequences through expansion causal convolution. The output is then fed into the TST module, which utilizes its built-in multi-head self-attention mechanism to accurately model long-term dependencies in the data, such as the daily cyclical patterns of photovoltaic power.
Channel 2 (Global Context Channel): In a parallel channel, BiGRU captures bidirectional temporal context information across the entire sequence from a global perspective through its forward and backward recurrent structures.
Both channels receive identical raw input data, with their extracted features concatenated at the model’s terminal. The resulting high-dimensional features are then fed into an attention fusion layer, which dynamically assigns weights to different features to highlight the most critical information for final predictions. This design enables the model to holistically and unbiasedly analyze time series across multiple dimensions, significantly improving prediction accuracy and robustness.
This symmetrical parallel architecture prevents the model from favoring any single type of feature extractor. The TCN-TST branch excels at capturing local details and long-term periodic patterns (such as diurnal variations), while the BiGRU branch captures global contextual dependencies. The final Attention Fusion Layer acts as a dynamic balancer, adaptively adjusting the contribution weights of the two branches based on input data characteristics. This symmetrical complementarity between local and global features is key to achieving high precision in the model.
Key architectural parameters of the model: To ensure reproducibility of this study, we detail the critical hyperparameters of each model component in
Table 3 below. These parameters were determined through extensive grid search experiments on the validation set (see
Section 3.2.1).
2.2.1. Temporal Convolutional Network (TCN)
The TCN neural network, constructed with Dilated Causal Convolution, employs exponentially increasing dilation factors to significantly expand the receptive field while reducing network depth [
20]. This design enables efficient capture of long-term temporal trends in the input data. Unlike traditional Convolutional Neural Networks (CNNs), where the receptive field grows linearly with network depth and kernel size, the dilated convolution structure of TCN allows the receptive field to expand exponentially with layer depth. Consequently, TCN can cover extremely long historical sequences with only a few layers while avoiding the vanishing gradient problem.
When the initial dilation factor is set to
and the initial convolution kernel size is set to
, the receptive field r of the output layer can be calculated using Equation (8):
Here,
denotes the number of expanded causal convolution layers. By adjusting the values of
and
, the receptive field can be significantly expanded to extract long-term temporal features from input data while maintaining a relatively shallow network architecture. Given a one-dimensional input sequence
X,
= {
,
,···,
}, and an
dimensional convolutional filter
= {
,
,···,
}, the expanded causal convolution result at time step t can be expressed by Equation (9) [
21,
22,
23]:
As illustrated in
Figure 3, by combining dilated causal convolution with residual stacking, the TCN is able to cover extremely long historical contexts, with its receptive field expanding exponentially. This design avoids the gradient vanishing/explosion problems that typically occur in RNNs as the number of recurrent steps increases. The residual layer not only expands the receptive field but also reduces parameter redundancy. Moreover, the identity mapping in the residual design prevents performance degradation in deep networks while saving computational resources.
2.2.2. Temporal Shift Transformer (TST)
To enhance the joint modeling capability of long-term dependencies and local abrupt variations, this study introduces the Temporal Shift Transformer (TST) module on top of TCN. The module integrates the Temporal Shift Module (TSM) with a multi-head self-attention mechanism, thereby significantly improving the representation of temporal dynamic features.
Specifically, the TST employs TSM to introduce cross-time-step feature interactions within the input sequence, which substantially strengthens the representation of local temporal features. For an input feature sequence X ∈ RT × d (where T denotes the number of time steps and d is the feature dimension), the TSM defines a shift offset
n according to the PV data interval ∆t = 15 min:
The time-shifted sequence
is fed into the multi-head self-attention mechanism. First, the input undergoes three learnable linear projection matrices
,
, and
, which compute the query (Q), key (K), and value (V) for the
i-th attention head among the h heads [
24].
Next, h attention heads compute their respective output
in parallel. The computational Formula (11) is as follows:
Here, is the key vector dimension of each head.
Finally, the outputs from all heads are concatenated and fused through a final linear projection matrix W^O to produce the multi-head self-attention output Z as shown in Formula (12):
The output of the STST module is a feature tensor that deeply integrates both local and global temporal dependencies. This output is then combined with features extracted from parallel BiGRU channels for final fusion.
In the TST module, we implemented eight attention heads. This configuration was determined by two key considerations: First, the eight-head setup represents a well-established and validated approach in Transformer architecture, enabling parallel learning of temporal dependencies across multiple representation subspaces to form a robust “expert committee”. Second, this configuration perfectly aligns with our 64-dimensional embedding space (64D/8 heads = 8D/head), ensuring each head can effectively learn features within sufficient dimensional subspaces. Our preliminary experiments confirmed that this setup enables the model to achieve both robustness and superior performance.
2.2.3. BiGRU Neural Network
Bidirectional Gated Recurrent Unit (BiGRU) employs reset gates and update gates to realize a gating mechanism that dynamically regulates the flow of temporal information. The reset gate controls the degree of association between the current input and the historical state, while the update gate determines whether the previous memory should be retained. This gating design significantly enhances the efficiency of modeling complex temporal dependencies [
25,
26,
27]. The BiGRU architecture consists of an input layer, a forward GRU layer, a backward GRU layer, and an output layer. At each time step, the input layer simultaneously passes data to both the forward and backward GRU layers, allowing information to flow in two opposite directions. The output sequence is jointly determined by the two GRU networks, enabling the model to capture both past and future dependencies. The overall network structure of BiGRU is illustrated in
Figure 4.
2.2.4. Self-Attention Mechanism
Photovoltaic (PV) power generation is influenced by multiple coupled factors such as solar irradiance, geographical environment, and equipment conditions, exhibiting strong nonlinearity, volatility, and randomness.
Traditional time-series models (e.g., TCN, BiLSTM) face two major limitations in feature extraction, namely the reliance on sequential order and static weight allocation. To address this issue, the proposed model incorporates a self-attention mechanism in the output layer. By computing global similarity among features, self-attention enables dynamic weighting, thereby eliminating the dependence on fixed sequential order, strengthening the representation of critical features, and capturing deeper correlations. This ultimately enhances the accuracy of power prediction. The principle of the self-attention mechanism is illustrated in
Figure 5.
The attention calculation formula is shown in (13):
Here, a is the input vector, b is the output vector, W is the weight, Q is the query matrix, K is the key matrix, V is the value matrix, and is the dimension of the key vector.