1. Introduction
The accelerating global shift toward sustainable energy solutions has positioned photovoltaic (PV) systems at the heart of renewable power generation strategies. These systems have been driven by several factors, including minimal operating costs, environmental sustainability, and scalability across various applications that range from residential rooftops to large-scale solar farms. However, some limitations remain: the intermittency and non-linearity of PV generation are significantly influenced by strongly variable meteorological factors such as cloud cover, solar irradiance, temperature, and atmospheric pressure [
1,
2].
The accurate short-term forecasting of PV power generation is imperative for applications [
3,
4] like grid load management, reserve scheduling, energy trading, and real-time control in smart grids [
5]. Traditional forecasting methods, such as physical models (e.g., numerical weather prediction, irradiance-based models) and statistical models (e.g., ARIMA, exponential smoothing), usually fail to capture the non-stationary, multiscale, and non-linear nature of PV data [
1,
6]. These models typically assume time-invariant relationships and linearity, which is less suitable for highly volatile and high-resolution solar data, especially during fast-changing weather transitions.
With the arrival of data-driven learning paradigms, deep learning (DL) models [
7,
8] are state-of-the-art time series forecasting tools due to their ability to learn implicit non-linear mappings directly from data [
6]. In particular, Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) [
9] and Gated Recurrent Units (GRUs) [
10], have been shown to provide improved performance in sequential dependency modeling. Convolutional Neural Networks (CNNs) [
11] and Temporal Convolutional Networks (TCNs) [
12] are better at extracting local temporal and spatial features and can be more efficiently trained. Nonetheless, standalone deep learning models, though powerful, are susceptible to the issues of overfitting, limitations in flexibility across temporal scales, and interpretability under complex real-world scenarios.
To address these limitations, current studies highlight the strength of hybrid deep learning models, which employ multiple learning processes and preprocessing techniques to identify both long-term trends and high-frequency fluctuations. These include the use of signal decomposition algorithms (i.e., Empirical Mode Decomposition (EMD) [
13], Variational Mode Decomposition (VMD) [
14], and Wavelet Transform (WT) [
15]), which decompose the input signal into multiple frequency components and isolate meaningful features from noise. Similarly, attention mechanisms, particularly self-attention and channel attention modules like Efficient Channel Attention for Deep Convolutional Neural Networks (ECANet) [
16], have proven highly effective in concentrating the model’s attention on the most relevant areas of the input signal, thereby improving both interpretability and accuracy.
However, despite these advances, there remains a research gap: many recent hybrid models do not combine multiresolution decomposition, attention-enhanced feature learning, and recurrent temporal modeling within a unified, end-to-end architecture. Additionally, such models often depend heavily on external meteorological inputs or complex preprocessing routines, which limit their portability and real-time deployment.
To address this gap, we propose a novel hybrid deep learning framework that integrates multiscale signal decomposition, global attention mechanisms, local temporal convolution, adaptive channel recalibration, and sequential memory encoding in a single pipeline for short-term PV forecasting. The full architectural design and evaluation procedures are detailed in the following sections.
The main objectives of this study are the following:
Design an explainable and modular hybrid model that captures both global and local temporal dependencies in high-resolution PV data.
Benchmark the proposed model against more than 18 architectures, including both external models and internal ablation variants.
Demonstrate its applicability to real-time grid operations such as energy dispatching, reserve management, and short-term load balancing.
The rest of the paper is organized as follows:
Section 2 presents the dataset, preprocessing, and methodological approach.
Section 3 describes the experimental setup and the evaluation metrics.
Section 4 analyzes and discusses the results and the model’s performance.
Section 5 concludes this study and outlines future research directions.
1.1. Related Work
Recent advances in short-term PV prediction have focused on hybrid deep learning approaches that integrate signal decomposition, attention mechanisms, and sequential modeling. These studies aim to mitigate the natural volatility and non-stationarity of PV data through various architectural and preprocessing enhancements.
The authors of [
16] proposed a physics-constrained deep learning model incorporating Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) and CNN-LSTM with a novel physics-aware loss function that penalizes physically unreasonable results, such as negative predictions during non-generation periods. Their model demonstrated improved accuracy and robustness under adverse weather conditions from real PV datasets. However, its use of external meteorology inputs and a pre-specified decomposition pipeline may limit scalability across different geographies and PV installations.
A CEEMDAN–Informer hybrid model was also introduced in [
17] for PV prediction in hourly steps with the use of CEEMDAN for reducing non-stationarity dimensions and the Informer transformer for sparse long-sequence modeling. Their model achieved remarkable accuracy gains with the MAE and RMSE reduced up to 37.5% and 34.74%, respectively, over regular LSTM, GRU, and Transformer models. Nevertheless, its high-resolution meteorological feature reliance restricts it to ultra-short-term or sensor-limited scenarios.
In [
18] the authors introduced CGaformer, a CNN1D model supported by Global Additive Attention (GADAttention) and Auto-Correlation mechanisms for efficient long-range temporal learning. CGaformer attains state-of-the-art forecasting performance as well as computational efficiency while maintaining architectural complexity; nevertheless, it continues to depend on exogenous weather data for optimal performance.
A multi-stage hybrid approach was presented in [
19], combining Isolation Forest (IF) for outlier removal, Non-linear Probabilistic CNN (NPCNN) for feature extraction, and Wavelet-KNN for reducing variance post-prediction. Applied to a national PV demonstration dataset, the model reduced the MAE and RMSE by approximately 14.3% and 13.1% compared to LSTM. However, the layered architecture involves inference latency and implementation complexity.
The authors of [
20] proposed DSEHM, a dynamic selective ensemble of AttnGRU, Informer, TCN, and XGBoost, with scenario-adaptive weighting according to irradiance and temporal variability. The DSEHM tested on smart grid data reduced the MAE by 22.55% and RMSE by 19.01% but at the cost of coordination and computational overhead.
The authors of [
21] developed MHA-BiLSTM-CNN, a two-stream hybrid framework using a CNN for local pattern extraction and BiLSTM for modeling long dependencies. MHA-based fusion enhanced explainability and reduced the MAE by 21.4% compared to baselines. The model is sensitive to the hyperparameter tuning of many modules and needs short input sequence adaptation.
EDCPSO-MQLSTM was proposed in [
22], a probabilistic forecasting model with Multi-Quantile LSTM and Evolutive Distributed Chaotic Particle Swarm Optimization. The model is aimed at ultra-short-term prediction with high Prediction Interval Coverage Probability (PICP of 94.2%) and improved pinball loss, but it introduces high computational burdens due to optimization. This highlights the growing importance of uncertainty-aware forecasting approaches in ultra-short-term PV prediction.
A recent contribution in [
23] explored MIC-WSO-TCN, a hybrid architecture incorporating the Maximal Information Coefficient (MIC) for feature selection, White Shark Optimizer (WSO) for hyperparameter tuning, and a TCN for temporal learning. Although achieving MAE and RMSE reductions of up to 38.1% and 29.9%, respectively, the model’s dependence on accurate meteorological inputs and metaheuristic tuning increases its operational complexity.
The authors of [
13] proposed EMD–GRU–Attention (EGA), a Kalman denoising, Empirical Mode Decomposition, and attention-augmented GRU deep learning model for solar radiation forecasting. EGA achieved 17.94% and 15.2% RMSE and MAE gains, respectively, but suffered from high preprocessing requirements and persistent input signal assumption.
Complementing these efforts, Reference [
24] reported a TCN–ECANet–GRU hybrid that embeds Efficient Channel Attention within TCN residual blocks prior to a GRU sequencer. Trained on 5 min multivariate PV/meteorological measurements from a real plant, the model achieved season-best normalized errors (e.g., RMSE = 0.0195; MAE = 0.0128; R
2 ≈ 0.997 in winter) and consistently outperformed SVR, GRU, TCN, CNN-GRU, and TCN-GRU across single- and multi-step horizons. The findings highlight the utility of channel recalibration before gated recurrence; limitations include reliance on exogenous meteorological inputs, single-site/single-year evaluation, and occasional non-physical evening outputs, which may constrain generalization under sensor outages or across geographies.
In parallel, Transformer–GRU hybrids have been explored for PV and irradiance forecasting and, more broadly, for long-horizon sequence learning. Reference [
25] introduced a GRU–Temporal Fusion Transformer with a DILATE loss that achieved MAE = 1.19, MSE = 2.08, and RMSE = 1.44 on the “Daily Power Production of Solar Panels” dataset, with Diebold–Mariano tests (
p < 0.05) confirming significant gains over XGBoost, N-HiTS, and N-BEATS. Reference [
26] presented a transformer-infused recurrent architecture (attention-augmented BiLSTM encoder with a GRU decoder) reaching R
2 = 0.9983, RMSE = 0.0140, and MAE = 0.0092 on solar irradiance data, outperforming the ANN, GRU, BiLSTM, BiLSTM–GRU, and vanilla Transformer. These findings indicate that combining global self-attention with lightweight gated recurrence captures complementary dependencies more effectively than purely recurrent or purely transformer models; the remaining limitations include sensitivity to sequence length and the need for careful feature engineering or exogenous inputs when transferring across sites and horizons.
While these works contribute substantially to the field, they often suffer from one or more of the following drawbacks: reliance on exogenous meteorological data, non-modular and complex architectures, and the absence of a unified mechanism that can capture global, local, and multiscale temporal features in an end-to-end manner.
The proposed WT–Transformer–TCN–ECANet–GRU model builds on prior work via the following:
It is meteorology-independent, reducing reliance on exogenous weather sensors.
It combines wavelet decomposition, self-attention, local convolution, and recurrent memory in a unified pipeline, allowing for comprehensive multiscale feature extraction and learning.
Its modularity improves interpretability, scalability, and training efficiency, especially under high-frequency PV data conditions.
This study aims to improve upon existing methods by enhancing accuracy and robustness and also address real-world deployment feasibility for smart grid forecasting tasks. This architecture offers a practical and unified response to the challenges repeatedly identified in the recent literature.
1.2. Proposed Hybrid Model: WT–Transformer–TCN–ECANet–GRU
WT–Transformer–TCN–ECANet–GRU is a novel hybrid deep architecture specifically designed for PV power forecasting. The proposed model integrates multiple modeling paradigms (frequency domain analysis, attention mechanisms, and sequential learning) to capture global and local temporal dependences in time series data effectively. The proposed architecture is composed of a structured pipeline of specialized modules, each module being specifically designed to extract, transform, and cleanse temporal features of high fidelity.
As illustrated in
Figure 1, the model processes input data with the following components, each designed to handle a particular level of temporal abstraction:
Wavelet Transform (WT).
Linear Input Projection.
Transformer Encoder.
Temporal Convolutional Network (TCN).
Efficient Channel Attention (ECANet).
Gated Recurrent Unit (GRU).
Fully Connected (FC) Layer.
Their combination allows the architecture to effectively extract multiscale patterns, long-term relationships, and essential temporal features, crucial to high-quality PV forecasting, as detailed below.
The end-to-end architecture is mathematically defined as follows:
where
Xt: Multivariate input time series at t.
ŷt: PV power predicted value.
fWT(Xt): WT extracting time–frequency components.
fProj: Linear projection mapping the decomposed signal into a common latent space.
fTrans: Transformer encoder capturing global long-range dependencies.
fTCN: TCN for local and mid-term features.
fECA: ECANet model that emphasizes informative features.
fGRU: GRU encoding sequential temporal dependencies.
: Fully connected output layer producing final prediction.
Wavelet Transform (WT): The proposed architecture starts with a wavelet transformation of the input time series, which decomposes the signal into multiple frequency sub-bands. Using the wavelet’s ability to perform multiscale time–frequency analysis allows the model to capture both long-term trends and short-term fluctuations in the data. By transforming the raw time series into a set of approximation and detail coefficients (at different resolution levels) [
27], the WT block provides an insightful representation of the input that isolates disparate temporal dynamics [
15].
The wavelet decomposition multiresolution output is then forwarded to the subsequent projection layer, which ensures that downstream modules can use these extracted features.
A linear input projection layer (a fully connected mapping) then converts the wavelet-transformed features to the appropriate dimension for the Transformer encoder. Like an embedding layer, it takes the multiscale coefficients output from the WT stage and maps them into a dimensional feature space that can be input to attention-based encoding. This transformation is expressed as follows:
where
Zt represents the projected input features.
Wp and bp are the weight matrix and bias of the projection layer, respectively.
Like an embedding layer, it takes the multiscale coefficients output from the WT stage and maps them into a dimensional feature space suitable for attention-based encoding. Consequently, this operation ensures that all time steps and sub-sequences are mapped as vectors in a common latent space such that the attention mechanism can effectively coordinate with all features. The projected features (typically with positional encoding added in practice) are used as input to the Transformer encoder.
Transformer Encoder: The predicted time series features are then passed through a Transformer encoder module, which has been designed to learn long-range dependencies through self-attention mechanisms [
28,
29].
The process starts with the initialization of the hidden representation at the first layer with the projected input:
Then, each subsequent layer (for l = 1,…,L) updates the representation using the following:
MSA executes scaled dot-product attention across multiple heads to enable the model to attend to different positions in the sequence.
FFN is typically a two-layer feed-forward network with ReLU activation at each time step independently.
This formulation provides context for each position in the input sequence by considering its relationship to all other positions using attention and then refines it through a non-linear transformation.
The multi-head self-attention sub-layer allows the model to capture patterns in the entire sequence simultaneously and capture global temporal relationships (e.g., seasonality or long trends) without recurrence. Attention-based encoding eliminates the vanishing gradient and memory bottleneck of traditional RNNs by enabling direct interaction between distant time steps [
29]. The Transformer encoder thus produces a richer representation of the input sequence, endowing it with both content and long-range context of the time series. The output is then passed to the TCN module for further temporal feature extraction.
Temporal Convolutional Network (TCN): The TCN module is included to obtain local and medium-term temporal patterns not necessarily in an explicit manner, learned by the global attention of the Transformer [
23]. A TCN consists of multiple layers of dilated causal one-dimensional convolutions, and the output of every layer has the same length as the input. Each layer in the TCN processes its input by updating an internal hidden state representation, which evolves hierarchically through the stacked convolutional layers.
The hidden state at layer l is calculated through 1D causal convolution as follows:
where
Wl and bl are, respectively, the weight and bias parameters of layer l.
ReLU: The activation function.
This formulation enables an exponentially large receptive field while preserving the chronological order of information. In practical application, dilated causal convolutions allow the TCN to learn short-to-mid-range dependencies and recurring patterns (e.g., spikes or seasonality at an integer interval) that complement the Transformer’s capabilities. With the support of a stacked TCN and residual connections coupled with appropriate dilation factors, hierarchical temporal features are extracted from the Transformer’s output, and these local trends and high-frequency pattern representations are thereafter forwarded to the attention module for refinement.
Efficient Channel Attention (ECANet): After the TCN, an adaptive channel attention mechanism (ECA) is employed to adaptively recalibrate the channel-wise features. Specifically, following an adaptive average pooling of the TCN output, a small convolution (with kernel size k) is utilized to obtain local cross-channel interactions and produce an attention weight for each feature channel. The kernel size k is adaptively determined as a function of the number of channels, by ensuring that the attention mechanism is specifically engineered following the feature dimensionality [
30]. This approach significantly reduces complexity compared to a standard Squeeze-and-Excitation module or Convolutional Block Attention Module (CBAM) while effectively highlighting the most informative channels. The ECA module adds only k trainable parameters but sacrifices little performance compared to heavier attention blocks. By focusing the TCN output channels according to their importance, ECANet allows the model to concentrate on important features (e.g., emphasize specific frequency bands or strongly predictive sensors) and suppress less critical information. The attention-weighted feature sequence is input into the GRU layer.
Channel-wise attention weight computation:
Attention-weighted output:
where
σ: The sigmoid activation function.
Conv1D: A one-dimensional convolution.
AvgPool: Channel-wise average pooling.
is the output feature map from the TCN module.
⊙: The element-wise (Hadamard) product.
: The learned channel-wise attention scores.
: The reweighted feature map passed to the GRU layer.
Gated Recurrent Unit (GRU): The GRU layer is added to capture any remaining sequential dependencies and to integrate the information over time using gating mechanisms. In this architecture, the GRU takes the feature sequence (augmented with ECA) as input and processes it step by step, maintaining a hidden internal state [
31]. The gating mechanism’s structure enables any GRU to learn to keep long-term information or discard it as needed while modeling complex temporal dynamics and noise in the time series. Compared to standard LSTMs [
32], GRUs require fewer parameters (since there is no extra output gate) but often have equivalent performance. This makes the GRU an attractive, computationally inexpensive choice for adding a recurrent inductive bias on top of the attention and convolutional representations.
The internal computations of the GRU are defined as follows:
Update gate: This gate controls how much of the previous hidden state should be retained.
where
: The previous hidden state.
σ: The sigmoid activation function.
and : The trainable weights for the input and hidden state, respectively.
: The input at time step t.
The reset gate determines how much past information to forget when computing the new memory content.
Candidate hidden state: The proposed new memory content, influenced by the reset gate.
⊙ denotes element-wise multiplication.
tanh: The hyperbolic tangent activation function.
: The weight matrices for the candidate update.
Final hidden state update: A linear interpolation between the previous hidden state and the new candidate, controlled by the update gate.
The GRU output (e.g., the final hidden state or sequence of states, based on the task) is a concise summary of the time series that is informed by both long-term contexts and recent observations.
Fully Connected Layer: The last component of the model is a fully connected output layer that projects the GRU representation to the target prediction. This layer (essentially, a linear combination of the GRU output features) produces the prediction or classification output in the desired form. The linear output layer architecture keeps the end-to-end model differentiable and allows the training of earlier layers (WT to GRU) to optimize prediction accuracy. The final model output is obtained by linearly projecting the current hidden state output weights and bias.
In summary, the WT–Transformer–TCN–ECANet–GRU architecture integrates powerful modules—wavelet-based decomposition for multiscale feature extraction, self-attention for capturing long-range dependency, convolution for learning local patterns, channel attention for weighing feature importance, and gated recurrence for sequential inference—into a single, end-to-end pipeline aimed at capturing every facet of temporal dynamics in the data. Each component addresses a specific aspect of the time series learning problem, and together they contribute to robust predictive performance.
1.3. Novelty and Distinction from Other Hierarchical Hybrid Architectures
While several hybrid forecasting models adopt hierarchical flows (e.g., Transformer–GRU, TCN–GRU–ECA Net, CNN–LSTM), the proposed WT–Transformer–TCN–ECA Net–GRU implements a hierarchical and interdependent feature processing flow that differs both structurally and functionally from prior work [
24,
26].
Structural Novelty: In typical hierarchical architectures, modules follow a linear, feed-forward sequence with each block processing on the output of its predecessor in full isolation. Representative cases are the following: Transformer–GRU [
25], where global attention is taken before sequential modeling but where the explicit extraction of local patterns or recalibration of channels is skipped; TCN–GRU–ECANet [
24], where local channel modeling and attention are skipped but where global context and multiresolution frequency features are also not taken into account; and CNN–LSTM [
16], where spatial/feature extraction and recurrent modeling coexist but where frequency decomposition or attention mechanisms are not explicitly combined. By contrast, our model embeds multiresolution wavelet decomposition inside the learnable stack rather than as a fixed pre-step. The WT coefficients are linearly projected into a shared latent space and then passed to the Transformer, enabling decomposed representations to interact with downstream modules—Transformer (global dependencies), TCN (local temporal patterns), ECA Net (channel recalibration), and GRU (sequential memory)—within one trainable graph.
Functional Interdependence: Transformer outputs (global context) condition the subsequent TCN stage (local features); ECA Net then recalibrates channels before the GRU integrates a calibrated multiscale sequence rather than raw features. This global-to-local conditioning pathway ensures that each stage operates on contextually enriched inputs, unlike the baselines above, where modules are connected serially but not mutually informed. Baselines such as Transformer–GRU, TCN–GRU–ECA Net, and CNN–LSTM lack either the multiresolution decomposition, the explicit global-to-local conditioning, or the inter-stage channel recalibration, resulting in weaker coupling between hierarchical stages.
Figure 2 compares the proposed WT–Transformer–TCN–ECA Net–GRU with three representative baselines to illustrate structural and functional differences. All diagrams use a consistent visual grammar (block shapes and arrows) for fair comparison.
Figure 2a: Proposed WT–Transformer–TCN–ECA Net–GRU: Input → WT → Linear Projection → Transformer → TCN → ECA Net → GRU → FC → Output (ŷ). The WT’s outputs are projected to a shared latent space and fused with Transformer features before local extraction. ECA Net performs channel-wise recalibration before the GRU, ensuring memory integrates a selectively emphasized representation.
Figure 2b: Transformer–GRU baseline: Input → Linear Projection → Transformer → GRU → FC → Output (ŷ). This captures long-range dependencies but lacks the WT and TCN/ECA Net; the GRU receives sequence-level attention features without explicit multiscale/local refinement.
Figure 2c: TCN–GRU–ECA Net baseline: Input → TCN → ECA Net → GRU → FC → Output (ŷ). This models local patterns and applies channel recalibration but omits global attention and frequency domain decomposition; long-range dependencies are implicit.
Figure 2d: CNN–LSTM baseline: Input → CNN → LSTM → FC → Output (ŷ). This learns local spatial/temporal patterns and sequential dependencies but omits frequency decomposition, explicit attention mechanisms, and global-to-local feature conditioning.
The proposed pipeline is not a mere reordering of blocks; it integrates decomposition (WT), global attention (Transformer), local convolution (TCN), channel attention (ECA Net), and recurrent memory (GRU) into a single, coupled hierarchy in which later stages consume context-conditioned and channel reweighted features. This is supported by the ablation study results (
Section 3.1.2), where relocating or removing modules consistently causes statistically significant performance degradation, reinforcing that the novelty lies in hierarchical interdependence rather than arbitrary block permutation.
2. Materials and Methods
Based on this rich and high-resolution dataset, we developed a novel hybrid forecasting model to capture both global and local temporal features. The model architecture is detailed below.
2.1. Dataset Description
This study uses a full dataset of photovoltaic (PV) panel production in the presidential building of Ibn Tofail University in Kenitra, Morocco. The data span one year, from 1 January 2022 to 31 December 2022, and consist of measurements taken every five min. There are 95,885 records in the dataset.
Each input contains the following:
To better understand the dataset’s behavior,
Figure 3 presents the average daily PV production pattern, showing the typical diurnal shape with zero values during nighttime and a single production peak around midday.
Key statistics of the dataset include the following:
Mean production: ~3281 Wh; median: ~2742 Wh.
Minimum value: 0 Wh in nighttime hours; maximum: 18,044 Wh in peak sunlight.
Standard deviation: ~3419 Wh.
These statistics confirm the strong intraday variability in and skewness of PV production, characteristic of solar irradiance-driven systems.
Before modeling, basic preprocessing steps were applied:
Missing Data Handling: Bad or missing measurements were interpolated or removed to maintain the continuity of the time series.
Normalization: The PV power values were normalized using RobustScaler before being input to the neural networks, which scales the PV data by removing the median and dividing by the interquartile range. This approach is robust to outliers and helps stabilize deep learning model training.
Train–Test Split: Among the traditional approaches used for time series is the technique employing the initial part of the timeline for training and the most recent part for the test, such that test data temporally succeeds training data to mimic actual forecasting; 70% of the dataset in this study is used for training, and 30% is reserved for testing.
2.2. Hyperparameters
One set of hyperparameters was adopted to train them and provide a consistent and fair comparison of all models. The models used a sequence length of 64 and a hidden size of 64 for both recurrent and convolutional modules. The Transformer had an embedding dimension (d_model) of 64, 4 attention heads, and 2 encoder layers, while TCNs had a 3-sized kernel for spatial–temporal feature extraction, and the GRU and LSTM recurrent modeling architectures were fixed to one-layer mode, operating in batch-first order.
Data preprocessing was standardized for all experiments with the RobustScaler. For signal decomposition, CWT was chosen with the Morlet wavelet because of its favorable time–frequency localization properties and effectiveness in modeling irradiance signals characterized by smooth oscillatory patterns. Scale widths were fixed from 1 to 30, as empirically validated for the highest resolution granularity. This wavelet was selected over alternatives such as Daubechies, Symlets, and Coiflets, which, while effective for sharp discontinuities, lack the smooth spectral representation offered by Morlet. The resulting approximation and detail coefficients were denoised multiresolution signal representations that enhanced the detection of temporal patterns for downstream parts such as the Transformer and TCN layers, reducing learning burden and enhancing convergence stability as well as generalization. Other decomposition methods used included VMD (K = 5; α = 2000; tol = 1 × 10−7), EMD using PyEMD settings, and EWT with the same CWT–Morlet configuration. For Prophet-based hybrid models, parameters were fixed to daily_seasonality = True; growth = ‘linear’; and seasonality_mode = ‘additive’, in order to focus on the learning of neural residual modeling.
To determine these hyperparameter settings, initial tuning with a limited grid search and empirical validation was conducted. This helped balance model complexity, convergence speed, and generalization performance between architectures. Other techniques, such as dropout regularization and batch normalization, were experimentally tested during the exploratory phase but were ultimately omitted due to their negligible impact on validation performance.
Although early stopping was not applied during final training, the training loss curves in
Figure 4 show that the models achieved stable convergence between epochs 600 and 800. Thus, the learning rate (0.0005) and a 1000-epoch schedule were preserved to ensure complete learning convergence without oscillations or overfitting.
All experiments were performed in Google Colab using a TPU v2-2 backend with 334 GB RAM, Intel® Xeon® CPU @ 2.00 GHz (96 vCPUs, 48 physical cores), Python 3.11.13, and PyTorch/XLA. Training the proposed WT–Transformer–TCN–ECANet–GRU model on the full dataset (~95,000 five-min resolution samples) required approximately 10.6 h (38,056 s). Inference on the complete test set was executed in under 4 s using batched processing, corresponding to an average latency of ~0.674 ms per prediction when batches are used, which demonstrates strong feasibility for near real-time PV forecasting tasks.
Despite the higher computational demands introduced by the integration of wavelet decomposition, Transformer attention, temporal convolution, and recurrent memory, the model significantly outperformed baseline and hybrid alternatives. This confirms the advantage of the proposed architecture for real-world deployment in precision-sensitive smart grid systems.
Table 1 summarizes a comprehensive comparison of the hyperparameter settings and architectural components used by each model, including the type of decomposition method (if applied), Transformer configuration, convolution and attention modules, recurrent architecture, and corresponding implementation notes. The table shows the distinction between baseline models (e.g., GRU, CNN-LSTM), hybrid models (e.g., Prophet–GRU, Transformer–GRU), and the proposed models with multiple architectural enhancements. The aim within this comparative framework is to lay bare the contribution of each component—decomposition, attention, convolution, and recurrence—to photovoltaic power forecasting tasks.
Figure 5 illustrates the end-to-end workflow of the proposed WT–Transformer–TCN–ECANet–GRU forecasting system, following the operational logic implemented. The workflow begins with the loading of a high-frequency PV production dataset and proceeds with data preprocessing steps, including robust scaling and WT, which approximates multiscale signal decomposition. This step enhances noise robustness and emphasizes beneficial time–frequency patterns.
After that, the data is transformed into deep learning suitable input sequences while preserving temporal structure. Those are passed through a main forecasting model, which is implemented as a black-box hybrid module combining the following: WT, Transformer, TCN, ECANet, and GRU.
The model is iteratively trained over a number of epochs using the Adam optimizer and Mean Squared Error (MSE) loss. Once converged, the trained model predicts on the test set, which are inverse-transformed to the original scale. Final predictions and actuals are exported for evaluation and visualization, completing the predictive cycle.
This modularized data-to-output flow provides a full pipeline for time series prediction in solar energy systems with an emphasis on reproducibility, interpretability, and robustness. It is a conceptual and implementation-level guide for the application of high-performance hybrid deep learning systems to real-world energy forecasting problems.
2.3. Evaluation Criteria
To assess the performance of the proposed model and compare it with other architectures, we employed four commonly accepted statistical metrics in time series forecasting:
The MAE measures the average magnitude of forecast errors without considering their direction. It is intuitive and scales linearly with the size of the error, making it easy to interpret [
33,
34,
35]. The equation of the MAE is as follows:
The MSE penalizes large errors more heavily, using it when large deviations are particularly undesirable. The MSE is sensitive to outliers [
33,
34].
The RMSE is the square root of the MSE and shares the same units as the original data. The RMSE provides a balanced view between the MAE and MSE with an interpretable scale of error [
12,
33,
35]. The equation of the RMSE is as follows:
R
2 determines the proportion of the variance in the observed data accounted for by the model. A smaller value is closer to 1, indicating a better explanatory power [
34].
with the following:
These measures were chosen to provide an aggregate evaluation on multiple dimensions of forecasting accuracy: error size (MAE), error variability (MSE and RMSE), and model explanatory power (R2). All metrics were evaluated on the same test set using the model predictions and real PV power values.
The selection of the MAE, MSE, RMSE, and R2 gives a comprehensive assessment of accuracy and variability without being trapped by other measures like the MAPE (Mean Absolute Percentage Error) and SMAPE (Symmetric Mean Absolute Percentage Error) that tend to be unstable or undefined when actual values approach zero, a common phenomenon in PV forecasting, particularly for nighttime hours. Conversely, R2 is scale-invariant, while the MAE, MSE, and RMSE are scale-dependent, stable, robust in the presence of near-zero values, and widely used in the time series forecasting literature.
In addition to these error metrics, statistical significance and effect size analyses were conducted to determine whether performance differences between model variants were due to genuine architectural effects or random variation.
The Wilcoxon signed-rank test is a nonparametric test for comparing two paired samples without assuming normality. The test statistic is the following:
where
Small W values with p < 0.05 indicate a statistically significant median difference.
The paired
t-test is a parametric test used when paired differences are normally distributed. The test statistic is the following:
where
p < 0.05 indicates a significant mean difference.
s1, s2: The standard deviations.
n1, n2: The sample sizes.
Conventional thresholds interpret (small), (medium), and (large)
The selection of paired t-tests and Wilcoxon signed-rank tests is motivated by the need to carefully investigate whether observed performance differences between models are statistically significant under different assumptions regarding underlying distributions. The Wilcoxon test is robust against non-normality and small sample sizes and is appropriate for comparisons of models where the distributions of errors are non-normal, whereas paired t-tests have greater statistical power when normality holds approximately. These are supplemented by measures of Cohen’s d, which provide the practical size of the difference such that statistically significant results also possess meaningful effect sizes within the context of PV forecasting.
4. Discussion
The enhanced potential of the proposed WT–Transformer–TCN–ECANet–GRU framework stems from the complementary deployment of specialized modules designed to address specific attributes of short-term PV power forecasting. By combining multiresolution decomposition, global attentions, localized extraction of temporal patterns, and memory-efficient sequential mechanisms, the model effectively leverages both short- and long-distance temporal dependencies inherent in non-stationary PV time series.
Wavelet-based decomposition (WT) is also instrumental in filtering high-frequency noise and maintaining signal characteristics on a variety of temporal scales. This ability, supported by the robust gains in
Section 3.1.2 and
Section 3.3, is uniquely helpful in the presence of fluctuating weather and in transition phases where prediction is the most difficult.
The attention mechanisms, and specifically ECANet, offer a lightweight yet efficient way of recalibrating channels. As demonstrated in
Section 3.3, ECANet retains higher precision and lower latency as compared to heavier options like full self-attention, while the Transformer encoder complements it through preserving long-range relations without recurrence to address the limitations of traditional RNN-based architectures.
The TCN layer enhances the mid-term and local fluctuation detection capability of the model. Its removal, as observed in the ablation results, leads to measurable error increases, confirming its role in refining localized temporal features.
Finally, the GRU allows for stable and efficient temporal memory integration and offers competitive or superior accuracy to LSTM with fewer parameters. When the GRU is removed, substantial degradation is observed, underscoring its importance in managing complex, non-linear diurnal transitions.
The overall results reinforce the ability of hybrid structures in addressing the inherent non-stationarity and intermittency of PV time series. By incorporating different temporal processing mechanisms in one model, the proposed WT–Transformer–TCN–ECANet–GRU assembles robust performance and practical applicability, making it a promising candidate for intelligent energy forecasting applications.
5. Conclusions
This study presented a novel hybrid deep learning architecture, WT–Transformer–TCN–ECANet–GRU, for short-term PV power forecasting. By integrating wavelet-based decomposition, temporal convolution, channel-wise attention, and recurrent memory units, the model captures both high-frequency variability and long-term dependencies in solar generation data.
When tested on a large-scale five-min resolution dataset (~95,000 samples), the model yielded MAE = 209.36 W, RMSE = 616.53 W, and R2 = 0.96884 with just 0.142 M parameters, taking less than 4 s to perform inference on the full test set and with 0.674 ms avg-per-sample latency, supporting its applicability to near real-time PV forecasting without using large-scale networks.
The key contribution of this study is that it demonstrates that architectural innovation, not model size growth, can provide significant accuracy improvements under various weather scenarios. It advances the state of the art in PV forecasting as a computationally efficient yet high-performing solution applicable to grid stability, energy trading, and renewable integration.
Future work will apply validation to geographically and climatically disparate datasets, include probabilistic forecasts for uncertainty quantification, and examine deployment on resource-limited devices (e.g., Raspberry Pi, Jetson Nano). Runtime benchmarking under standard platforms will also be conducted to facilitate equitable cross-study comparisons.
Overall, the proposed model offers an excellent, accurate, and computationally efficient solution to existing PV forecasting challenges, with considerable relevance to real-world smart grid scenarios.