A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting

Ye, Xiao; Yin, Jun; Zhang, Jiajia; Li, Anping; Liu, Zhibo; Chen, Bin; Yang, Jingyao; Li, Shilei; Li, Hongmei

doi:10.3390/pr14050759

Open AccessArticle

A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting

by

Xiao Ye

^1,2,

Jun Yin

³,

Jiajia Zhang

¹,

Anping Li

³,

Zhibo Liu

¹,

Bin Chen

²

,

Jingyao Yang

²,

Shilei Li

² and

Hongmei Li

^2,*

¹

China Energy Engineering Group Anhui Electric Power Design Institute Co., Ltd., Hefei 230601, China

²

School of Electrical Engineering and Automation, Hefei University of Technology, Hefei 230009, China

³

Huaihe Energy and Power Group Co., Ltd., Huainan 232000, China

^*

Author to whom correspondence should be addressed.

Processes 2026, 14(5), 759; https://doi.org/10.3390/pr14050759

Submission received: 27 January 2026 / Revised: 14 February 2026 / Accepted: 24 February 2026 / Published: 26 February 2026

(This article belongs to the Section AI-Enabled Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

Accurate photovoltaic (PV) power forecasting is essential for the reliable integration of renewable energy into electrical grids. This paper proposes a novel Multi-Scale CNN-Transformer network with Residual Correction (MSCT-RCM) for ultra-short-term PV power forecasting. The model integrates parallel multi-scale convolutional neural networks (CNNs) to extract local temporal features, a Transformer encoder to capture long-range dependencies, and a Residual Correction Module (RCM) that dynamically refines predictions using historical error patterns. A two-stage training strategy is employed to stabilize learning and enhance performance. Experimental evaluation on two years of operational data from a large-scale PV plant demonstrates that the proposed model achieves an R² value of 0.9944 for 15-minute-ahead forecasts and reduces mean absolute error (MAE) and root mean square error (RMSE) by over 50% in one-hour-ahead predictions compared to benchmark models. The MSCT-RCM model therefore exhibits strong potential for deployment in scenarios requiring high-precision predictions, such as smart grid scheduling.

Keywords:

ultra-short-term photovoltaic power forecasting; multi-scale convolutional neural network; transformer; residual correction; two-stage training

1. Introduction

The rapid deployment of photovoltaic (PV) systems worldwide has emphasized the necessity of accurate PV power forecasting to ensure grid stability, optimize energy dispatch, and support electricity market operations [1,2,3]. PV output is highly stochastic and intermittent due to rapidly changing meteorological conditions and spatial heterogeneity [4,5]. Such variability introduces considerable uncertainty into power system operations. Accurate forecasting in short-term and ultra-short-term scenarios is critical for minimizing reserve capacity requirements and enhancing overall system reliability.

As a typical time series regression problem, PV power forecasting methods can generally be categorized into three types of methods: traditional statistical methods, machine learning methods, and deep learning methods. Early studies on PV power forecasting primarily adopted statistical methods such as Autoregressive (AR), Autoregressive Moving Average (ARMA), or Autoregressive Integrated Moving Average (ARIMA) [6,7,8,9]. These methods can achieve satisfactory forecasting performance on short-term stationary sequences and offer high computational efficiency. However, due to the strongly nonlinear, multi-peak, and intermittent characteristics of PV power data, the prediction accuracy becomes limited under complex weather conditions [10]. By introducing external meteorological variables such as temperature, wind speed, and solar irradiance for multivariate regression, the limitations of traditional statistical methods can be alleviated to some extent. However, these methods still struggle to effectively capture long-term temporal dependencies and multi-scale features.

Machine learning methods such as Support Vector Regression (SVR), Gradient Boosted Decision Trees (GBDT), and Random Forests (RF) are capable of effectively modeling nonlinear relationships and demonstrate a certain degree of generalization ability in scenarios involving small samples or high-dimensional features [11,12,13]. To enhance the sensitivity of these methods to multi-scale features, they are often combined with signal decomposition techniques such as Empirical Mode Decomposition (EMD), Ensemble Empirical Mode Decomposition (EEMD), or Variational Mode Decomposition (VMD). In these methods, the original time series is decomposed into components of different frequency bands before performing the prediction [14,15,16,17]. Although these methods alleviate the issues of nonlinearity and multi-scale variability to some extent, the decomposition process often suffers from parameter sensitivity and high computational complexity, and cumulative errors may occur under extreme weather conditions [18].

In recent years, deep learning methods have demonstrated remarkable advantages in PV power forecasting. Convolutional Neural Networks (CNNs) are capable of effectively capturing local temporal patterns [19,20,21,22]. Recurrent Neural Networks (RNNs) and their variants, such as Long Short-Term Memory (LSTM) networks and Gated Recurrent Unit (GRU), possess inherent advantages in modeling long-term dependencies and temporal dynamics [23,24,25,26,27]. Models based on attention mechanisms and the Transformer architecture exhibit powerful capabilities in capturing long-term temporal dependencies. They can effectively uncover complex feature interactions and significantly improve forecasting accuracy [28,29,30]. In addition, hybrid models that combine LSTM, BiLSTM, and Transformer architectures have further enhanced power forecasting performance [31,32]. However, challenges remain regarding model generalization ability and robustness under extreme meteorological conditions [33].

In summary, PV power forecasting methods have evolved from traditional statistical approaches to machine learning and, more recently, to deep learning techniques. Traditional statistical methods offer high computational efficiency but struggle to capture nonlinear and multi-scale features. Machine learning approaches improve nonlinear modeling capabilities but remain limited in capturing long-term dependencies. Deep learning methods significantly enhance forecasting performance; however, challenges persist regarding model complexity and generalization ability.

Regarding the multi-scale characteristics of PV power data, signal decomposition methods such as EMD or VMD can separate components of different frequency bands, enabling forecasting models to capture temporal features more clearly. However, these decomposition techniques often introduce issues such as high parameter sensitivity, increased computational complexity, and potential information leakage if the validation process is not sufficiently rigorous [34]. Residual connections, attention enhancement modules, and multi-scale convolution operations can strengthen the forecasting model’s responsiveness to high-frequency fluctuations and local anomalies [35,36]. Moreover, historical prediction residuals or error information, as auxiliary input, can help alleviate the problem of error accumulation in multi-step prediction [37]. Furthermore, to balance forecasting accuracy with practical deployment requirements, strategies such as lightweight network design, parameter-sharing mechanisms, and model compression have gradually attracted research attention [38].

To achieve high-precision ultra-short-term PV power forecasting, this paper proposes a forecasting scheme that integrates multi-scale CNN, Transformer architecture, and residual correction. The design concept first employs a multi-scale CNN to extract local temporal features, and then utilizes a Transformer backbone to model long-term temporal dependencies, generating high-quality baseline predictions. On this basis, a Residual Correction Module (RCM) is designed, which iteratively refines the baseline predictions by incorporating historical residual information.

The main contributions of this paper are summarized as follows:

Multi-scale feature extraction and global dependency modeling: A hybrid architecture integrating multi-scale CNN and Transformer is proposed to jointly capture local temporal dynamics and global dependencies, enabling high-precision forecasting across multiple time scales.
Residual Correction Module: The RCM is designed to mitigate error accumulation in multi-step forecasting. By incorporating historical error data to refine baseline predictions, it enhances forecasting accuracy, stability, and robustness under complex meteorological conditions.
Two-Stage Training Strategy: A two-stage training strategy is introduced, where the CNN-Transformer baseline model is first pre-trained independently for robust feature extraction, then jointly optimized with the RCM. This strategy stabilizes convergence and improves multi-step forecasting performance.

The remainder of the paper is organized as follows. Section 2 is the proposed ultra-short-term PV power forecasting scheme, Section 3 describes the model training process, and Section 4 presents experimental results and analysis. At last, Section 5 concludes the paper and highlights directions for future research.

2. Proposed Scheme

This study proposes an ultra-short-term PV power forecasting scheme that integrates a multi-scale CNN, Transformer backbone-based global dependency modeling, and a residual correction mechanism. By means of multi-scale feature extraction and dynamic residual correction, the proposed scheme effectively addresses the high nonlinearity and multi-scale characteristics of PV power sequences under complex meteorological conditions.

The overall workflow diagram of the proposed scheme is shown in Figure 1, which consists of five main steps:

Data Preparation and Preprocessing: Raw PV power and meteorological data undergo cleaning, outlier handling, and normalization to form a continuous, complete, and high-quality input feature matrix, laying a reliable data foundation for subsequent modeling.
Feature Engineering: This step includes feature enhancement using EMD to decompose numerical variables into multi-scale components, and feature selection based on Gradient Boosting Regression (GBR) to rank and select the top 10 features. This process improves the multi-scale representation of inputs and reduces redundancy, as detailed in Section 2.1.
Multi-Scale CNN-Transformer Model Construction: The model architecture integrates multi-scale CNN branches and a Transformer encoder. The CNN branches extract local temporal features at different resolutions using parallel convolutional kernels, capturing short-term fluctuations, medium-term trends, and long-term patterns. The Transformer backbone then models global dependencies to generate baseline predictions, as described in Section 2.2.
Residual Correction Module (RCM): Dynamic correction is performed using historical residuals. The interaction between baseline predictions and historical residuals is captured by a lightweight Transformer encoder, and the prediction results are iteratively optimized
Final Prediction Output: After fusing the baseline prediction with the residual correction output, high-precision ultra-short-term photovoltaic power prediction results are generated.

2.1. Data Preprocessing and Feature Enhancement

The data preprocessing pipeline is meticulously designed to prevent information leakage by strictly adhering to the principle of “independent training set processing—test set parameter reuse.” The workflow comprises the following sequential steps: (1) raw data collection; (2) data cleaning and outlier correction; (3) temporal sequence-oriented splitting into training and test sets; (4) training set-exclusive feature enhancement via EMD; (5) training set-exclusive feature selection using GBR; (6) calculation of feature extrema based solely on the training set; (7) normalization of the training set; (8) normalization of the test set by reusing the training set-derived extrema; and (9) construction of the input feature matrix.

(1): Data Cleaning and Outlier Handling

For records with a missing rate below 0.5%, linear interpolation is used for imputation. Outliers (e.g., power records exceeding the plant’s rated capacity or below zero) are removed or corrected. This step is applied to the entire raw dataset prior to train–test splitting to maintain data quality without information leakage.

(2): Feature Enhancement and Selection

To enhance the multi-scale expression capability of features, EMD is used to perform multi-scale decomposition on each numerical feature. Based on EMD, the time series x(t) is decomposed into several Intrinsic Mode Functions (IMFs) and a residual term, which can be expressed as follows:

x (t) = \sum_{i = 1}^{n} I M F_{i} (t) + r_{n} (t)

(1)

where IMF_i(t) denotes the Intrinsic Mode Function, i = 1 − n, and

r_{n} (t)

is the residual term. The first two orders of IMF components capture short-term fluctuations and medium-term trends, respectively. These two components are selected as enhanced features and concatenated with the original training set features to construct an enhanced feature set.

Subsequently, GBR is applied to the training set’s enhanced feature set to rank feature importance. The feature selection rules derived from the training set are directly reused for the test set without recalculation.

(3): Normalization and Dataset Partitioning

To eliminate the impact of different dimensions on training, all input features are mapped to the interval [0, 1] using Min–Max normalization, which is expressed as follows.

x_{n o r m} = \frac{x - x_{m i n, t r a i n}}{x_{m a x, t r a i n} - x_{m i n, t r a i n}}

(2)

where

x

represents the original value of the feature,

x_{m i n, t r a i n}

represents the minimum value of the feature,

x_{m a x, t r a i n}

represents the maximum value of the feature and

x_{n o r m}

represents the normalized feature value after applying the Min–Max normalization.

The normalization parameters are calculated exclusively based on the training set. The test set is normalized using the same parameters to avoid information leakage. Finally, the normalized features of the training and test sets are divided into input sequences and target outputs using the sliding window method to implement multi-step prediction.

2.2. Proposed MSCT-RCM Model Architecture

The MSCT-RCM architecture comprises three core modules: multi-scale convolution branch, Transformer backbone network, and RCM, as shown in Figure 2.

(1): Multi-scale convolution branch

Ultra-short-term PV power sequences contain multi-temporal scale dynamics of short-term fluctuations and medium- and long-term trends. In order to capture multi-time scale features at the same time in feature extraction stage, three parallel one-dimensional convolution branches are designed based on CNN, and the size of convolution kernel k is set to 3, 5 and 7.

Set the input sequence as

X = [x_{1}, x_{2,} \dots, x_{L}] \in ℝ^{L \times F}

(3)

where L represents the length of the input sequence, F represents the characteristic dimension.

Each convolution branch is calculated as:

H_{k} = ReLU (Conv1 D_{k} (X))

(4)

where

H_{k} \in ℝ^{L \times C}

, C represents number of convolutional output channels per branch.

The multi-branch features are formed by splicing to form a comprehensive feature, then there is

H_{c n n} = C o n c a t (H_{3}, H_{5}, H_{7}) \in ℝ^{L \times (3 C)}

(5)

The multi-scale convolutional branch of the MSCT-RCM model is designed to preserve the local dynamic characteristics, and is the same as that of the proposed model. Transformer the backbone network provides a richer representation of contextual features.

(2): Transformer Backbone Network

To capture the global dependencies of the input sequence and enhance the multi-step prediction capability, the proposed MSCT-RCM model incorporates the Transformer backbone network.

The Transformer models global dependencies. The concatenated features H_cnn and original input X are fused:

X_{f u s i o n} = C o n c a t (X, H_{c n n})

(6)

Multi-scale CNN branches extract local temporal features but rely on the Transformer to model long-range dependencies across the entire sequence. The concatenation of multi-scale convolutional features with the original input is motivated by the need to retain fine-grained raw information (e.g., irradiance mutation points) while integrating structured local features. This dual-branch fusion strategy complements “raw data details” with “extracted structured patterns,” which has been validated effective in renewable energy forecasting tasks for adapting to extreme weather conditions.

A linear projection maps X_fusion to the hidden dimension d_model = 96:

X_{p r o j} = X_{f u s i o n} \cdot W_{p r o j} + b_{p r o j}

(7)

The linear projection layer undertakes two core functions: first, mapping the concatenated high-dimensional features (

X_{f u s i o n} \in ℝ^{L \times (F + 3 C)}

, where F = 10 is the raw feature dimension and C = 32 is the output channel number of a single CNN branch) to the unified hidden dimension of the Transformer (d_model = 96), providing an adaptive dimension for efficient computation of the self-attention mechanism; second, enhancing feature discriminability via learnable parameters to optimize subsequent dependency modeling.

Learnable positional encoding PE is added:

X_{e m b e d} = X_{p r o j} + P E

(8)

Positional encoding is critical for temporal sequence modeling. Given the dynamic temporal dependencies of PV power data, learnable positional encoding (

P E \in ℝ^{L \times d_{m o d e l}}

) is adopted instead of fixed sinusoidal encoding. This choice enables adaptive learning of PV-specific temporal patterns without presetting period parameters.

After embedding with positional encoding, the feature sequence passes through N = 3 Transformer encoder layers. Each layer consists of multi-head self-attention (MHA) and a feed-forward network (FFN), with residual connections and layer normalization to stabilize training. The key operations are summarized as follows:

M u l t i H e a d = A t t e n t i o n (X_{e m b e d}) \cdot W_{O}

(9)

Z_{e n c} = L a y e r N o r m (X_{e m b e d} + M u l t i H e a d + F F N (X_{e m b e d}))

(10)

Z = T r a n s f o r m e r E n c o d e r (Z_{e n c}; N = 3, h = 4, d_{m o d e l} = 96, d r o p o u t = 0.1)

(11)

where

W_{p r o j} \in ℝ^{(F + 3 C) \times d_{m o d e l}}

and

b_{p r o j} \in ℝ^{d_{m o d e l}}

are projection parameters;

W_{O} \in ℝ^{d_{m o d e l} \times d_{m o d e l}}

is the MHA fusion weight; h = 4 is the number of attention heads; and

Z [- 1] \in R^{d_{m o d e l}}

is the global feature representation.

The last temporal step hidden state Z[−1] is extracted to generate baseline predictions, as it aggregates all preceding temporal information to capture the cumulative effect of long-range dependencies. This strategy avoids additional pooling operations, balancing modeling accuracy and computational efficiency for ultra-short-term forecasting. The baseline prediction is computed as:

{\hat{y}}_{b a s e l i n e} = Z [- 1] \cdot W_{f c} + b_{f c}

(12)

where

W_{f c} \in ℝ^{d_{m o d e l} \times H}

and

b_{f c} \in ℝ^{H}

are fully connected layer parameters, and H denotes the prediction horizon length.

(3): Residual Correction Module (RCM)

Multi-step forecasting often suffers from error accumulation, leading to degraded accuracy over longer horizons. To mitigate this, a RCM is integrated into the model. The RCM leverages historical residual information to dynamically optimize the baseline predictions. Let the historical residual vector for a given sample be denoted as:

E_{hist} = [e_{1}, e_{2}, \dots, e_{H}] \in R^{H}

(13)

where e_i represents the residual corresponding to the prediction step, i = 1 − H. It is critical to note that

E_{hist}

is not the actual error of the current test sample, which is unknown during forecasting. Instead, it is a retrieved error pattern from a pre-built error library, constructed offline using the predictions of the pre-trained baseline model on the training set. Therefore, during inference, for an input corresponding to a forecast horizon H, the model retrieves a single historical residual sequence

E_{hist}

of length H based on the similarity between the current input features and the records in the error library. This single-sequence retrieval mechanism is designed to preserve the temporal dependency and cumulative effect of error patterns across the entire forecast horizon.

This single-sequence retrieval approach, as opposed to a step-wise one, is employed for two main reasons: (1) Temporal Consistency: It preserves the inherent temporal dependency and evolution pattern of errors across the forecast horizon, which is crucial for coherent correction in ultra-short-term forecasting. (2) Computational Efficiency: It requires only one similarity search and fusion operation per forecast instance, significantly reducing the inference latency

The residual library is constructed offline using training set data, with the core process as follows:

Using the pre-trained CNN-Transformer baseline model from Phase I, multi-step predictions are performed on all training samples. The prediction residual for each sample is calculated as $e_{i} = y_{i} - {\hat{y}}_{b a s e l i n e, i}$ (where $y_{i}$ is the true value and ${\hat{y}}_{b a s e l i n e, i}$ is the baseline prediction).
For each training sample, extract its input feature vector and corresponding residual sequence $E_{i} = [e_{i, 1}, e_{i, 2}, \dots, e_{i, H}]$ (where H is the prediction step: H = 1 for 15 min prediction, H = 2 for 30 min prediction, H = 4 for 1 h prediction) to form a “feature vector–residual sequence” pair.
Min–Max normalization is applied to the feature vectors of all training samples (reusing extremum parameters from the training set preprocessing stage) to build a standardized feature library; residual sequences are directly stored in the residual library, whose scale matches the number of training samples.

Strict separation between the training and test sets is maintained during the residual library construction, with the library relying solely on training set data and the pre-trained baseline model (no test set information is involved) to ensure the fairness of the validation process.

The residual retrieval in the test phase adopts a “K-Nearest Neighbors (KNN) + Cosine Similarity” strategy. The specific retrieval process is as follows:

Similarity Metric: Cosine similarity is used to calculate the feature similarity between test samples and training samples in the residual library, with the formula:

s i m (X_{t e s t}, X_{t r a i n, j}) = \frac{X_{t e s t}, X_{t r a i n, j}}{‖X_{t e s t}‖ \cdot ‖X_{t r a i n, j}‖}

(14)

where

X_{t e s t}

is the standardized feature vector of the test sample,

X_{t r a i n, j}

is the standardized feature vector of the j-th training sample in the residual library, ⋅ denotes the dot product, and

‖\cdot‖

denotes the L2 norm.

Retrieval Algorithm and K-Value Optimization: The KNN algorithm is adopted, and the number of neighbors K is determined via cross-validation on the training set—four settings (K = 3, 5, 7, 9) are tested for prediction performance. Results show that the model achieves the minimum MAE with moderate computational complexity when K = 5, thus the optimal number of neighbors is determined as K = 5.

Residual Sequence Fusion: For the selected K = 5 residual sequences, a weighted average strategy is used to generate the final historical residual sequence

E_{h i s t}

, with weights being the cosine similarity of the corresponding training samples (higher similarity corresponds to higher weight):

E_{h i s t} = \frac{\sum_{j = 1}^{K} s i m (X_{t e s t,} X_{t r a i n, j}) E_{t r a i n, j}}{\sum_{j = 1}^{K} s i m (X_{t e s t,} X_{t r a i n, j})}

(15)

where

E_{t r a i n, j}

is the residual sequence of the j-th training sample in the library, and the sum of weights is normalized to 1 to ensure numerical stability.

The residual correction process is as follows:

Residual encoding

The mapping of historical residuals to token is expressed as:

E_{e m b e d} = L i n e a r (E_{h i s t}) ⊙ R e L U

(16)

where

E_{h i s t} \in ℝ^{H}

is the retrieved historical residual sequence, Linear is a linear projection layer (weight

W_{e} \in ℝ^{H \times d_{r c m}}

, bias

b_{e} \in ℝ^{d_{r c m}}

), d_rcm = 96 (consistent with the Transformer hidden dimension),

E_{e m b e d} \in ℝ^{d_{r c m}}

is the encoded residual embedding vector, and ⊙ denotes element-wise product.

2.: Sequence concatenation

The baseline prediction sequence

{\hat{y}}_{b a s e l i n e} \in ℝ^{H}

is mapped to a d_rcm-dimensional vector via a linear projection layer (

W_{y} \in ℝ^{H \times d_{r c m}}

) and then concatenated with the residual embedding vector

E_{e m b e d}

along the feature dimension. The baseline prediction token

{\hat{y}}_{b a s e l i n e}

is set as the first token, and the concatenation is expressed as:

S = C o n c a t ({\hat{y}}_{b a s e l i n e}, E_{e m b e d})

(17)

3.: Lightweight residual Transformer encoder

The interaction between the baseline prediction and historical residuals is captured through a single-layer Transformer Encoder, which includes MHA and FFN. The operations are expressed as follows:

S_{a t t n} = M H A (S, S, S; m a s k)

(18)

S_{n o r m 1} = L a y e r N o r m (S + S_{a t t n})

(19)

S_{f f n} = F F N (S_{n o r m 1}) = R e L U (S_{n o r m 1} \cdot W_{1} + b_{1}) \cdot W_{2} + b_{2}

(20)

S_{enc} {= LayerNorm (S}_{n o r m 1} {+ S}_{f f n})

(21)

where mask is a diagonal mask to prevent cross-position information leakage;

W_{1} \in ℝ^{d_{r c m} \times 2 d_{r c m}}

and

W_{2} \in ℝ^{2 d_{r c m} \times d_{r c m}}

are FFN layer weights, b₁ and b₂ are corresponding biases, and

S_{enc} \in ℝ^{2 \times d_{r c m}}

is the encoded sequence.

4.: Residual generation

By mapping the first token of

S_{enc}

through a multi-layer perceptron (MLP) to obtain the residual prediction, we have:

Δ y = M L P (S_{e n c} [0]) = L i n e a r (R e L U (L i n e a r (S_{e n c} [0])))

(22)

where the MLP includes two linear layers (intermediate dimension = 128), and

Δ y \in ℝ^{H}

has the same dimension as the baseline prediction sequence

{\hat{y}}_{b a s e l i n e}

.

5.: Final predication output

Add the residual to the baseline prediction, and the final prediction is obtained as:

{\hat{y}}_{f i n a l} = {\hat{y}}_{b a s e l i n e} + Δ y

(23)

This design enables adaptive correction of systematic errors under complex weather conditions, significantly improving the accuracy and stability of multi-step prediction.

3. Model Architecture and Its Two-Stage Training Strategy

The proposed MSCT-RCM model consists of a multi-scale convolutional branch, Transformer backbone network, and RCM. The structure and core hyperparameters are shown in Table 1.

To ensure stable convergence and robust performance, a two-stage training strategy is proposed, separating baseline model pre-training from joint optimization with RCM.

3.1. Phase I: Baseline Pre-Training

The goal of Phase I is to train the CNN-Transformer backbone to generate reliable baseline predictions, laying a foundation for subsequent residual correction. The optimization objective minimizes the mean squared error (MSE) between baseline predictions and true values:

ς_{1} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{y}}_{b a s e l i n e, i} - y_{i}‖}_{2}^{2}

(24)

where N is the number of training samples,

{\hat{y}}_{b a s e l i n e, i}

is the baseline prediction of the i-th sample, and

y_{i}

is the corresponding true value. This phase focuses on optimizing feature extraction and global dependency modeling, avoiding interference from residual correction to prevent premature overfitting.

3.2. Phase II: Joint Training

In Phase II, the pre-trained backbone parameters are fixed as initial values, and the entire MSCT-RCM model (backbone + RCM) is jointly optimized. The objective integrates final prediction accuracy and regularization of the residual correction term to avoid over-correction:

ς_{j o i n t} = \frac{1}{N} \sum_{i = 1}^{N} {‖{\hat{y}}_{f i n a l, i} - y_{i}‖}_{2}^{2} + λ ∥ Δ y_{i} ∥_{2}^{2}

(25)

where

{\hat{y}}_{f i n a l, i} = {\hat{y}}_{b a s e l i n e, i} + Δ y_{i}

(Δy_i is the residual correction term for the i-th sample), and λ = 0.01 is the regularization coefficient. Joint training enables adaptive matching between the baseline model and RCM, enhancing the model’s ability to utilize historical residual information for dynamic error correction.

The algorithm flow of the two-stage training MSCT-RCM model is shown in Algorithm 1.

Algorithm 1: MSCT-RCM Two-Phase Training

Input: historical sequence X, ground truth y
Output: trained full model parameters θ_full
1: Initialize CNN-Transformer backbone θ_base
2: # Phase I: Baseline Pre-training
3: for epoch in range(E₁):
4: for (x_batch, y_batch) in training_data:
5: y_pred = BaselineModel(x_batch; θ_base)
6: loss = MSE (y_pred, y_batch)
7: Update θ_base via AdamW optimizer
8: Save best θ_base
9: Initialize full model θ_full with θ_base
10: # Phase II: Joint Training
11: for epoch in range(E₂):
12: for (x_batch, E_hist,batch, y_batch) in training_data:
13: y_pred,batch = FullModel(x_batch, E_hist,batch; θ_full)
14: Δy_batch = y_pred,batch − BaselineModel(x_batch;θ_base)
15: loss = MSE (y_pred,batch, y_batch) +λ⋅MSE(Δy_batch,0)
16: Update θ_full via AdamW optimizer
17: Save best θ_full
18: return θ_full

4. Case Study and Comparative Results

This paper comprehensively validates the proposed ultra-short-term PV power forecasting scheme based on the MSCT-RCM model; moreover, two years of field data are used from a large PV power station in northern China from 2018 to 2019. Through experiments with different prediction time domains and input step sizes, comparative experiments with mainstream benchmark models, and ablation analysis, the proposed scheme’s prediction accuracy and robustness, as well as the effectiveness of key modules in the MSCT-RCM model, are systematically evaluated.

4.1. Dataset and Sample Construction

The experimental data, sampled at 15 min intervals, includes PV power output and multi-dimensional meteorological variables: air temperature, cloud opacity, dew point temperature, direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), global horizontal irradiance (GHI), global tilted irradiance (GTI), relative humidity, surface pressure, wind speed, and wind direction. To ensure validity, only data from the daily operational period (8:00 AM to 6:00 PM) were selected, excluding nighttime intervals with zero power output.

Considering the data sampling frequency, every four consecutive 15 min sampling points were used as inputs. Based on the two-year time span, the total number of samples was approximately 8030. To ensure reasonable model training and testing, the training and test sets were divided into training and test sets in an 8:2 ratio, with the training set containing 6424 samples and the test set containing 1606 samples. The sample division maintained chronological continuity, with no overlap between the training and test sets, ensuring fair experimentation and reliable results.

4.2. Experimental Environment

All experiments are conducted on a computer with an Intel Xeon 3.0 GHz CPU, 64 GB DDR4 memory, and an NVIDIA RTX 2080Ti GPU (11 GB video memory). The operating system is Windows 1 1.0, the deep learning framework is PyTorch 2.1 and the CUDA version is 12.1.

4.3. Evaluation Metrics

To quantify the power prediction performance of the proposed model from multiple perspectives, the following three evaluation metrics are used.

Mean Absolute Error (MAE)

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|

(26)

where MAE measures the average deviation of the prediction in kW. This evaluation metric provides an intuitive assessment of the power prediction accuracy of the proposed model.

2.: Root Mean Square Error (RMSE)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}

(27)

RMSE imposes a more severe penalty on larger errors, making it suitable for evaluating the power prediction performance of the proposed model under extreme weather conditions.

3.: Coefficient of Determination (R²)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(28)

where R² represents the proportion of variance explained by the model. The closer the value is to 1, the better the fit.

4.4. Baseline Models and Hyperparameters

In order to comprehensively evaluate the ultra-short-term PV power forecasting performance of the proposed scheme, six mainstream benchmark deep learning models, named LSTM [24], BiLSTM [32], GRU [25], standard Transformer [28], TCN-LSTM [39], and Transformer-BiLSTM [40], were selected as comparison models. Each benchmark model was parameter-tuned to ensure fairness in the comparison. Their key architectures and hyperparameters are shown in Table 2.

5. Experimental Results and Analysis

(1): Impact of Feature Quantity

The influence of feature quantity on the forecasting performance of the proposed MSCT-RCM model was systematically evaluated, with the results illustrated in Figure 3. The model’s performance was assessed using the top-ranked features identified by the GBR algorithm, as detailed in Table 3.

As the number of input features increases from 5 to 10, a consistent improvement in forecasting accuracy is observed across all prediction horizons (15 min, 30 min, and 1 h). Specifically, both the MAE and RMSE exhibit a significant decreasing trend, while the R² value improves notably. This enhancement indicates that incorporating a richer set of highly correlated features strengthens the model’s representational capacity. However, beyond the threshold of 10 features, the model performance begins to degrade slightly. This decline is primarily attributed to the introduction of redundant or noisy information, which can impair the model’s generalization capability. Consequently, utilizing the top 10 features achieves an optimal balance between informational richness and model complexity, and this configuration is therefore adopted as the standard input for the proposed scheme.

(2): Sensitivity Analysis of Different Input Sequence Lengths

The performance impact of different input sequence lengths on the proposed MSCT-RCM model at three forecast horizons is summarized in Table 4 and Figure 4. The results show that increasing the input length improves accuracy to a certain extent, but the gain beyond this point is negligible. For 15 min forecasts, the MAE decreases from 2.75 kW (seq_len = 4) to 1.91 kW (seq_len = 12), and the R² increases from 0.9766 to 0.9944. Further extending the sequence to 16 or 24 steps only slightly improves performance. Similar trends are observed for the 30-minute and l-hour horizons, suggesting that input lengths of 12–16 steps achieve a balance between capturing historical dependencies and computational efficiency.

(3): Convergence Stability Analysis

To verify the convergence stability of the proposed two-stage training strategy, MSCT-RCM (two-stage training) and a single-stage end-to-end training model are trained repeatedly with 5 different random seeds (Seed = 123, 456, 789, 1011, 1213). The mean and standard deviation of MAE/RMSE for 15 min predictions are calculated to evaluate training stability, with statistical results presented in Table 5.

Based on the statistical results in Table 5, the standard deviation of MAE for two-stage training is only 0.032 kW, and the standard deviation of RMSE is 0.045 kW, significantly lower than single-stage training (MAE standard deviation 0.087 kW, RMSE standard deviation 0.102 kW). This proves that the two-stage strategy effectively reduces training variance and improves model convergence stability. The superior stability stems from Phase I independent pre-training of the CNN-Transformer baseline model, which ensures stable feature extraction and global dependency modeling capabilities before the RCM error correction mechanism is introduced. This avoids overfitting caused by premature intervention of RCM and enables more accurate learning of the mapping relationship between historical residuals and prediction errors during Phase II joint optimization.

(4): Comparative Analysis with Baseline Models

Table 6 and Figure 5 show the performance comparison of the MSCT-RCM model with benchmark models at different prediction horizons. Overall, the MSCT-RCM model outperforms all baseline models across all evaluation metrics, demonstrating exceptional accuracy in ultra-short-term power forecasting.

Specifically, for 15 min predictions, the proposed MSCT-RCM achieves an MAE of 1.91 kW and an RMSE of 2.73 kW, representing reductions of 46.3% and 45.2%, respectively, compared to the best-performing benchmark model Transformer-BiLSTM (MAE: 3.56 kW; RMSE: 4.98 kW). The corresponding R² value reaches 0.9944, indicating near-ideal goodness-of-fit. This advantage persists for 1 h predictions, where MSCT-RCM achieves MAE and RMSE values of 3.22 kW and 5.20 kW, corresponding to 47.4% and 37.7% improvements over Transformer-BiLSTM (6.12 kW and 8.35 kW), while maintaining a substantially higher R² value of 0.9799 compared to all other models.

This significant performance improvement is attributed to the synergistic effect of the MSCT-RCM’s core modules: multi-scale CNN branches effectively capture local short-term fluctuations and medium-term trends, enhancing feature representation; the Transformer backbone network models global temporal dependencies to ensure stable predictions under complex dynamic conditions; and the RCM dynamically adjusts predictions using historical residual patterns, mitigating error accumulation in multi-step forecasting.

(5): Cross-Dataset Generalization Analysis

To verify the model’s generalization ability, three additional real operational datasets covering different climatic zones and PV technologies are introduced (key information: Southern Subtropical Monsoon Climate (SSMC)—Polycrystalline Silicon 50 MW, Northwest Arid and Semi-Arid Climate (NASAC)—Thin-Film PV 80 MW, Southwest Plateau Mountain Climate (SPMC)—Monocrystalline Silicon + Tracking System 120 MW).

As summarized in Table 7, the proposed MSCT-RCM model demonstrates consistently superior and stable performance across all three datasets. Specifically, for 15 min forecasts, its MAE values are 1.96 kW (SSMC), 2.27 kW (NASAC), and 2.08 kW (SPMC), all below 2.3 kW; RMSE values are 2.81 kW, 3.31 kW, and 3.02 kW, respectively, remaining under 3.5 kW; and R² values exceed 0.985, ranging from 0.9895 to 0.9938. In contrast, every benchmark model exhibits significant performance degradation on unfamiliar datasets, particularly under the Northwest Arid conditions with high irradiance fluctuations. For instance, LSTM’s MAE increases by approximately 28.7% from 7.52 kW (SSMC) to 9.68 kW (NASAC), whereas MSCT-RCM shows only a minimal deviation of 15.8% (from 1.96 kW to 2.27 kW). This pronounced disparity underscores the proposed model’s stronger generalization capacity, which stems from its ability to capture universal spatio-temporal patterns in PV generation that are less sensitive to localized climate or technology specifics.

(6): Prediction Results at Different Time Periods

Figure 6 illustrates representative prediction outcomes of the proposed MSCT-RCM model across three forecasting horizons: 15 min, 30 min, and 1 h. The results indicate a strong alignment between the predicted and actual PV power generation sequences. For the 15-minute-ahead predictions, the forecasted values closely align with the actual measurements, reflecting the model’s high fidelity in capturing short-term local variations. As the prediction horizon extends to 30 min and 1 h, minor deviations emerge at certain peaks and troughs; however, the overall trend remains consistent, and the prediction errors are notably lower than those of the baseline models. These observations validate that the RCM effectively incorporates historical error information to refine predictions, thereby enhancing dynamic responsiveness and reducing temporal lag or systematic bias under rapidly fluctuating weather patterns. Consequently, the proposed model achieves highly accurate and robust ultra-short-term PV power forecasting across diverse meteorological conditions. A qualitative analysis of the remaining errors reveals that they are primarily concentrated during periods of abrupt weather transitions, such as sudden cloud cover changes. These residuals reflect the inherent challenge of “concept drift” in electrical data streams, where the underlying statistical properties of the generation modalities shift due to non-stationary meteorological environments [41].

(7): Ablation Study

Ablation experiments were conducted to quantify the individual contributions of the multi-scale CNN branch, Transformer backbone, and Residual Correction Module (RCM) to the overall model performance for the 15 min prediction horizon. The results are summarized in Table 8.

Single-Module Configurations (Without Dual-Module Synergy): When employing only a single module—the pure Transformer (without CNN and RCM) or the pure CNN (without Transformer and RCM)—the model exhibits high prediction errors, with MAE values of 4.37 kW and 4.31 kW, respectively, and R² values of 0.9708 and 0.9715. This indicates that neither module alone can effectively capture both local fine-grained features and global long-range temporal dependencies, leading to significant performance degradation.

Without Multi-Scale CNN Branch (w/o CNN): Removing the multi-scale CNN branch results in a notable performance decline. The MAE increases from 1.91 kW (full model) to 3.07 kW, RMSE rises from 2.73 kW to 4.22 kW, and R² decreases to 0.9853. These results verify the critical role of the multi-scale CNN in extracting local temporal features at different resolutions, which provides enriched contextual inputs for the Transformer backbone and enhances feature representation capability.

Without Transformer Backbone Network (w/o Transformer): Omitting the Transformer backbone leads to an MAE of 2.89 kW, RMSE of 4.08 kW, and R² of 0.9867. This demonstrates the Transformer’s essential function in modeling global temporal dependencies. Without it, the model fails to capture cross-time-step feature correlations, particularly degrading the accuracy of predictions that rely on long-range dependencies.

Without Residual Correction Module (w/o RCM): Excluding the RCM increases the MAE from 1.91 kW to 2.41 kW, RMSE from 2.73 kW to 3.49 kW, and reduces R² to 0.9902. This confirms the importance of the RCM in dynamically correcting baseline prediction errors by leveraging historical residual patterns, thereby mitigating error accumulation in multi-step forecasting and enhancing model stability.

Full Model (MSCT-RCM): The complete model achieves optimal performance, with an MAE of 1.91 kW, RMSE of 2.73 kW, and R² of 0.9944. This validates the synergistic effect of the three components: the multi-scale CNN provides local features, the Transformer models global dependencies, and the RCM corrects systematic errors. The integration ensures a balance between accuracy and robustness, with the full model significantly outperforming all ablated configurations.

To further validate the rationality of key design choices in the Transformer backbone, additional ablation experiments are conducted, with results shown in Table 9. The “baseline scheme” (concatenation of raw input and CNN features + learnable positional encoding + last time-step hidden state) outperforms alternative designs in all prediction horizons. Specifically:

Removing the concatenation strategy (using only CNN features) increases 15 min MAE by 4.7%, as fine-grained raw information is lost.

Replacing learnable positional encoding with sinusoidal encoding degrades 30 min prediction MAE by 3.9%, reflecting poor adaptation to dynamic temporal dependencies.

Using average pooling instead of the last time-step hidden state increases 15 min MAE by 5.8%, as cumulative long-range dependencies are not fully captured.

To further verify the effectiveness of the KNN retrieval strategy in the RCM, an additional ablation experiment is conducted to compare the impact of different retrieval strategies on model performance. The results are shown in Table 10, where K denotes the number of neighbors in the KNN algorithm. Experimental results show that the proposed “K = 5 + Cosine Similarity” strategy achieves the lowest MAE (1.91 kW) for 15 min prediction, which is 21.4% lower than random residual selection (MAE = 2.43 kW). This confirms that the designed retrieval mechanism can effectively retrieve historical residual patterns correlated with the current input features, thereby improving the correction accuracy of the RCM.

6. Conclusions

This paper presents the MSCT-RCM model, a novel deep learning framework for ultra-short-term photovoltaic power forecasting. The proposed architecture seamlessly integrates multi-scale convolutional networks for local feature extraction, a Transformer backbone for global temporal dependency modeling, and a residual correction module for error compensation. Experimental results demonstrate that the model significantly outperforms mainstream benchmarks across multiple forecasting horizons. Ablation studies confirm the synergistic effects of all components, with the complete architecture achieving optimal performance. The proposed scheme demonstrates strong potential for practical applications in smart grid scheduling and energy management systems. Future work will focus on enhancing the model’s generalization capability under diverse meteorological conditions and optimizing its computational efficiency for real-time deployment.

Author Contributions

Conceptualization, X.Y.; Methodology, X.Y. and A.L.; Software, J.Z., Z.L. and H.L.; Validation, J.Z., B.C. and H.L.; Formal analysis, X.Y., J.Z. and S.L.; Investigation, X.Y. and J.Y. (Jingyao Yang); Resources, J.Y. (Jun Yin), A.L. and J.Y. (Jingyao Yang); Data curation, Z.L.; Writing—original draft, B.C.; Writing—review & editing, X.Y. and S.L.; Visualization, J.Y. (Jun Yin); Supervision, J.Y. (Jun Yin), A.L. and J.Y. (Jingyao Yang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 51877064) and in part by the Anhui Provincial Key Research and Development Program (Grant No. 1804a09020092). The Article Processing Charge (APC) was funded by these grants.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Jun Yin and Anping Li were employed by the company Huaihe Energy and Power Group Co., Ltd. Authors Xiao Ye, Jiajia Zhang and Zhibo Liu were employed by the company China Energy Engineering Group Anhui Electric Power Design Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

PV	Photovoltaic
CNN	Convolutional Neural Network
RCM	Residual Correction Module
EMD	Empirical Mode Decomposition
GBR	Gradient Boosting Regression
LSTM	Long Short-Term Memory
Transformer	Transformer Architecture
KNN	K-Nearest Neighbors

References

Tian, J.; Ooka, R.; Lee, D. Multi-scale solar radiation and photovoltaic power forecasting with machine learning algorithms in urban environment: A state-of-the-art review. J. Clean. Prod. 2023, 426, 139040. [Google Scholar] [CrossRef]
Santana, E.J.; Silva, R.P.; Zarpelao, B.B.; Barbon Junior, S. Detecting and Mitigating Adversarial Examples in Regression Tasks: A Photovoltaic Power Generation Forecasting Case Study. Information 2021, 12, 394. [Google Scholar] [CrossRef]
Lara-Benítez, P.; Carranza-Garcia, M.; Luna-Romera, J.M.; Riquelme, J.C. Temporal Convolutional Networks Applied to Energy-Related Time Series Forecasting. Appl. Sci. 2020, 10, 2322. [Google Scholar] [CrossRef]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Sarmas, E.; Spiliotis, E.; Stamatopoulos, E.; Marinakis, V.; Doukas, H. Short-term photovoltaic power forecasting using meta-learning and numerical weather prediction independent Long Short-Term Memory models. Renew. Energy 2023, 216, 118997. [Google Scholar] [CrossRef]
Luo, P.; Li, C.; Kang, D.; Zhang, F.; Lv, Q. PMWC: A hybrid framework based causal inference and multi-scale feature fusion for day-ahead PV power forecasting. Renew. Energy 2026, 257, 124753. [Google Scholar] [CrossRef]
Gupta, M.; Arya, A.; Varshney, U.; Mittal, J.; Tomar, A. A review of PV power forecasting using machine learning techniques. Prog. Eng. Sci. 2025, 2, 100058. [Google Scholar] [CrossRef]
Dong, Z.; Tian, Z.; Lv, S. A short-term power load forecasting system based on data decomposition, deep learning and weighted linear error correction with feedback mechanism. Appl. Soft Comput. 2024, 162, 111863. [Google Scholar] [CrossRef]
Libra, M.; Kozelka, M.; Šafránková, J.; Belza, R.; Poulek, V.; Beránek, V.; Sedlacek, J.; Zholobov, M.; Subrt, T.; Severová, L. Agrivoltaics: Dual usage of agricultural land for sustainable development. Int. Agrophys. 2024, 38, 121–126. [Google Scholar] [CrossRef]
Luan, J.; Li, Q.; Qiu, Y.; Liu, W. Ensemble learning unlocking point load forecasting accuracy: A novel framework based on two-stage data preprocessing and improved multi-objective optimisation strategy. Comput. Electr. Eng. 2025, 124, 110282. [Google Scholar] [CrossRef]
Akhter, M.N.; Mekhilef, S.; Mokhlis, H.; Almohaimeed, Z.M.; Muhammad, M.A.; Khairuddin, A.S.M. An Hour-Ahead PV Power Forecasting Method Based on an RNN-LSTM Model for Three Different PV Plants. Energies 2022, 15, 2243. [Google Scholar] [CrossRef]
Bae, D.-J.; Kwon, B.-S.; Song, K.-B. XGBoost-Based Day-Ahead Load Forecasting Algorithm Considering Behind-the-Meter Solar PV Generation. Energies 2021, 15, 128. [Google Scholar] [CrossRef]
Scott, C.; Ahsan, M.; Albarbar, A. Machine learning for forecasting a photovoltaic (PV) generation system. Energy 2023, 278, 127807. [Google Scholar] [CrossRef]
Wang, B.; Wang, J. Energy futures and spots prices forecasting by hybrid SW-GRU with EMD and error evaluation. Energy Econ. 2020, 90, 104827. [Google Scholar] [CrossRef]
Li, G.; Wei, X.; Yang, H. Decomposition integration and error correction method for photovoltaic power forecasting. Measurement 2023, 208, 112462. [Google Scholar] [CrossRef]
Yu, M.; Niu, D.; Wang, K.; Du, R.; Yu, X.; Sun, L.; Wang, F. Short-term photovoltaic power point-interval forecasting based on double-layer decomposition and WOA-BiLSTM-Attention and considering weather classification. Energy 2023, 275, 127348. [Google Scholar] [CrossRef]
Wang, L.; Mao, M.; Xie, J.; Liao, Z.; Zhang, H.; Li, H. Accurate solar PV power prediction interval method based on frequency-domain decomposition and LSTM model. Energy 2023, 262, 125592. [Google Scholar] [CrossRef]
Wang, H.; Lei, Z.; Zhang, X.; Zhou, B.; Peng, J. A review of deep learning for renewable energy forecasting. Energy Convers. Manag. 2019, 198, 111799. [Google Scholar] [CrossRef]
Sun, Y.; Venugopal, V.; Brandt, A.R. Short-term solar power forecast with deep learning: Exploring optimal input and output configuration. Sol. Energy 2019, 188, 730–741. [Google Scholar] [CrossRef]
Nguyen-Duc, T.; Do-Dinh, H.; Fujita, G.; Tran-Thanh, S. Multi 2D-CNN-based model for short-term PV power forecast embedded with Laplacian Attention. Energy Rep. 2024, 12, 2086–2096. [Google Scholar] [CrossRef]
Abou Houran, M.; Bukhari, S.M.S.; Zafar, M.H.; Mansoor, M.; Chen, W. COA-CNN-LSTM: Coati optimization algorithm-based hybrid deep learning model for PV/wind power forecasting in smart grid applications. Appl. Energy 2023, 349, 121638. [Google Scholar] [CrossRef]
Agga, A.; Abbou, A.; Labbadi, M.; El Houm, Y. Short-term self consumption PV plant power production forecasts based on hybrid CNN-LSTM, ConvLSTM models. Renew. Energy 2021, 177, 101–112. [Google Scholar] [CrossRef]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. In Cognitive Modeling; MIT Press: Cambridge, MA, USA, 1986. [Google Scholar]
Mellit, A.; Pavan, A.M.; Lughi, V. Deep learning neural networks for short-term photovoltaic power forecasting. Renew. Energy 2021, 172, 276–288. [Google Scholar] [CrossRef]
Dai, Y.; Wang, Y.; Leng, M.; Yang, X.; Zhou, Q. LOWESS smoothing and Random Forest based GRU model: A short-term photovoltaic power generation forecasting method. Energy 2022, 256, 124661. [Google Scholar] [CrossRef]
Negash, T.; Weldemikael, N.; Ghebregziabiher, M.; Tedla, Y.; István, S.; István, F. Addressing photovoltaic (PV) forecasting challenges: Satellite-driven data models for predicting actual PV generation using hybrid (LSTM-GRU) model. Energy Rep. 2025, 14, 2141–2156. [Google Scholar] [CrossRef]
Ait Mansour, A.; Tilioua, A.; Touzani, M. Bi-LSTM, GRU and 1D-CNN models for short-term photovoltaic panel efficiency forecasting case amorphous silicon grid-connected PV system. Results Eng. 2024, 21, 101886. [Google Scholar] [CrossRef]
Wu, Q.; Han, C. A novel framework for ultra-short-term photovoltaic power forecasting based on improved transformer and weather pattern recognition. Sol. Energy 2026, 304, 114183. [Google Scholar] [CrossRef]
Ma, Y.; Li, F.; Zhang, H.; Fu, G.; Yi, M. Two-stage photovoltaic power forecasting method with an optimized transformer. Glob. Energy Interconnect. 2024, 7, 812–824. [Google Scholar] [CrossRef]
Zhang, Z.; Huang, X.; Li, C.; Cheng, F.; Tai, Y. CRAformer: A cross-residual attention transformer for solar irradiation multistep forecasting. Energy 2025, 320, 135214. [Google Scholar] [CrossRef]
Cheikh, G.; Ammar, B.; Benhadj, N.; Bentounes, K.A.; Bentounes, H.A.; Ksibi, A. Transformer-based deep neural networks for short-term solar power prediction in the Middle East and North Africa regions. Eng. Appl. Artif. Intell. 2025, 160, 111848. [Google Scholar] [CrossRef]
Liu, M.; Rao, S.; Huang, M.; Deng, S. Short-term photovoltaic power forecasting based on improved transformer with feature enhancement. Sustain. Energy Grids Netw. 2025, 43, 101759. [Google Scholar] [CrossRef]
Yuan, L.; Wang, X.; Sun, Y.; Liu, X.; Dong, Z.Y. Multistep photovoltaic power forecasting based on multi-timescale fluctuation aggregation attention mechanism and contrastive learning. Int. J. Electr. Power Energy Syst. 2025, 164, 110389. [Google Scholar] [CrossRef]
Cui, S.; Lyu, S.; Ma, Y.; Wang, K. Improved informer PV power short-term prediction model based on weather typing and AHA-VMD-MPE. Energy 2024, 307, 132766. [Google Scholar] [CrossRef]
Bai, M.; Zhou, G.; Yao, P.; Dong, F.; Chen, Y.; Zhou, Z.; Yang, X.; Liu, J.; Yu, D. Deep multi-attribute spatial–temporal graph convolutional recurrent neural network-based multivariable spatial–temporal information fusion for short-term probabilistic forecast of multi-site photovoltaic power. Expert Syst. Appl. 2025, 279, 127458. [Google Scholar] [CrossRef]
Chen, S.; Wan, H.; Peng, B.; Quan, R.; Chang, Y.; Derigent, W. Accurate multi-step wind and solar power forecasting based on multi-scale convolutional Kolmogorov-Arnold network and improved Lemming-optimized attention fusion. Eng. Appl. Artif. Intell. 2026, 163, 112832. [Google Scholar] [CrossRef]
Wu, X.; Wu, R.; Wu, S.; Li, W.; Chen, H.; Tong, N. Short-term PV prediction using multiperiod similar days and TimeGAN-inception. Int. J. Electr. Power Energy Syst. 2025, 172, 111287. [Google Scholar] [CrossRef]
Li, Z.; Ye, L.; Song, X.; Luo, Y.; Pei, M.; Wang, K.; Yu, Y.; Tang, Y. Heterogeneous Spatiotemporal Graph Convolution Network for Multi-Modal Wind-PV Power Collaborative Prediction. IEEE Trans. Power Syst. 2024, 39, 5591–5608. [Google Scholar] [CrossRef]
Zhan, Y.; Wang, X.; Xu, Y.; Li, W. A hybrid TCN-LSTM-attention framework for multi-scenario short-term photovoltaic power forecasting incorporating physics-informed neural network strategy. Energy 2026, 344, 139968. [Google Scholar] [CrossRef]
Wang, T.; Xu, Y.; Qin, Y.; Wang, X.; Zheng, F.; Li, W. Short-term PV forecasting of multiple scenarios based on multi-dimensional clustering and hybrid transformer-BiLSTM with ECPO. Energy 2025, 334, 137654. [Google Scholar] [CrossRef]
Ismail, I.; Azeem, A.; Shamim, S.; Shamim, S. comparative analysis of concept drift detection algorithms for electrical streams across different generation modalities. In Sustainable and Eco-Friendly Process Management; Apple Academic Press: Boca Raton, FL, USA, 2025; pp. 307–316. [Google Scholar]

Figure 1. The overall workflow of ultra-short-term photovoltaic power prediction scheme.

Figure 2. MSCT-RCM architecture.

Figure 3. Prediction performance under varying feature counts.

Figure 4. Prediction performance under varying input sequence lengths.

Figure 5. Performance trends across different time horizons.

Figure 6. Predicted versus actual photovoltaic power under different forecasting horizons.

Table 1. Architecture and Key Hyperparameters of Proposed MSCT-RCM Model.

Module	Configuration
Multi-Scale CNN Branch	Kernel sizes: {3, 5, 7}; Output channels per branch: 32; Activation: ReLU
Transformer Backbone	Hidden size d_model: 96; Number of heads: 4; Layers: 3; Dropout: 0.1
Residual Correction Module	Lightweight Transformer with 1 layer; Hidden size: 96; Dropout: 0.1
Optimizer	AdamW; Initial learning rate: 0.001; Weight decay: 1 × 10⁻⁵
Learning Rate Schedule	Cosine Annealing; Warm-up: 6 epochs
Training Epochs	Phase I: 40 (main pre-training); Phase II: 80 (joint training)
Batch Size	64

Table 2. Baseline Models and Hyperparameters.

Model	Key Architecture	Hyperparameters
LSTM	Two-layer LSTM; Hidden size: 100;	Learning rate: 0.001; Epochs: 200; Batch: 64
BiLSTM	Three-layer bidirectional LSTM; Hidden: 32; Dropout: 0.1188	Learning rate: 0.001; Epochs: 200; Batch: 64
GRU	Single GRU layer; Hidden size: 100; Output: Dense with Sigmoid	Learning rate: 0.001; Epochs:200; Batch: 64
Transformer	Four-layer encoder; Hidden size: 256; Heads: 8; Dropout: 0.1	Learning rate: 0.001; Epochs:200; Batch: 64
TCN-LSTM	8-layer TCN; 2-layer LSTM; Multi-head self-attention (4 heads); TCN for global/local feature extraction, LSTM for temporal dependency modeling	Learning rate: 0.001; Epochs: 200; Batch: 64; Activation: GELU; LSTM hidden size: 64; Attention head dimension: 32
Transformer-BiLSTM	4-layer Transformer encoder; 2-layer bidirectional LSTM; Linear transition layer; Dense output layer	Learning rate: 0.001; Epochs: 200; Batch: 64; Transformer hidden size: 128; Attention heads: 4; FFN dimension: 256; BiLSTM hidden size: 64; GELU

Table 3. Top-10 Features Selected by GBR and Their Importance Scores.

Feature Name	Abbreviation	GBR Importance Score
Global Tilted Irradiance	GTI	0.186
Direct Normal Irradiance	DNI	0.162
Historical PV Power	—	0.145
Air Temperature	—	0.113
Cloud Opacity	—	0.098
Relative Humidity	—	0.087
Global Horizontal Irradiance	GHI	0.074
Dew Point Temperature	—	0.062
Wind Speed	—	0.051

Table 4. Performance with Different Input Sequence Lengths.

seq_len	15 Min (MAE/RMSE/R²)	30 Min (MAE/RMSE/R²)	1 h (MAE/RMSE/R²)
4	2.75/4.46/0.9766	3.02/5.23/0.9660	4.14/6.93/0.9545
8	2.24/3.21/0.9895	2.69/4.30/0.9770	3.93/6.14/0.9653
12	1.91/2.73/0.9944	2.56/3.94/0.9885	3.22/5.20/0.9799
16	1.89/2.73/0.9946	2.56/3.91/0.9785	3.20/5.16/0.9881
24	1.88/2.70/0.9946	2.53/3.91/0.9788	3.17/5.16/0.9881

Table 5. Performance Variance Under Different Random Seeds.

Training Strategy	MAE (Mean ± Std, kW)	RMSE (Mean ± Std, kW)	R² (Mean ± Std)
Two-stage Training	1.91 ± 0.032	2.73 ± 0.045	0.9944 ± 0.0008
Single-stage Training	2.68 ± 0.087	3.85 ± 0.102	0.9867 ± 0.0021

Table 6. Comparative Performance of Benchmark Models at Different Forecasting Horizons.

Model	15 Min (MAE/RMSE/R²)	30 Min (MAE/RMSE/R²)	1 h (MAE/RMSE/R²)
LSTM	7.16/9.34/0.9352	8.29/10.82/0.9131	9.2703/12.07/0.8919
BiLSTM	5.17/6.85/0.9652	6.44/8.54/0.9459	7.35/9.80/0.9288
GRU	6.82/8.57/0.9455	7.97/10.02/0.9255	8.82/11.22/0.9065
Transformer	4.71/6.26/0.9709	5.60/7.72/0.9557	7.66/10.02/0.9255
TCN-LSTM	3.89/5.37/0.9789	4.72/6.68/0.9675	6.54/8.83/0.9398
Transformer-BiLSTM	3.56/4.98/0.9815	4.35/6.21/0.9712	6.12/8.35/0.9452
MSCT-RCM	1.91/2.73/0.9944	2.56/3.94/0.9885	3.22/5.20/0.9799

Table 7. Cross-Dataset Generalization Performance Comparison (15-minute-ahead forecast).

Model	SSMC (MAE/RMSE/R²)	NASAC (MAE/RMSE/R²)	SPMC (MAE/RMSE/R²)
LSTM	7.52/9.68/0.9285	9.68/12.25/0.8975	7.95/10.32/0.9172
BiLSTM	5.45/6.98/0.9630	6.98/9.05/0.9338	5.88/7.58/0.9485
GRU	6.95/8.70/0.9420	8.70/10.80/0.9145	7.35/9.25/0.9260
Transformer	4.85/6.40/0.9680	6.40/8.35/0.9440	5.25/6.95/0.9565
TCN-LSTM	4.02/5.50/0.9765	5.50/7.35/0.9588	4.38/6.02/0.9668
Transformer-BiLSTM	3.68/5.10/0.9790	5.10/6.85/0.9650	3.98/5.55/0.9718
MSCT-RCM	1.96/2.81/0.9938	2.27/3.31/0.9895	2.08/3.02/0.9912

Table 8. Ablation Study Results.

Configuration	MAE	RMSE	R²
w/o CNN + RCM (pure Transformer)	4.37	5.96	0.9708
w/o Transformer + RCM (pure CNN)	4.31	5.88	0.9715
w/o CNN (Transformer + RCM)	3.07	4.22	0.9853
w/o Transformer (CNN + RCM)	2.89	4.08	0.9867
w/o RCM (CNN + Transformer only)	2.41	3.49	0.9902
Full MSCT-RCM	1.91	2.73	0.9944

Table 9. Ablation Experiment Results of Transformer Architectural Design Choices.

Design Scheme	15 Min (MAE/RMSE/R²)	30 Min (MAE/RMSE/R²)	1 h (MAE/RMSE/R²)
Baseline Scheme (Concatenation + Learnable PE + Last Time Step)	1.91/2.73/0.9944	2.56/3.94/0.9885	3.22/5.20/0.9799
No Concatenation (CNN Features Only)	2.00/2.89/0.9931	2.68/4.11/0.9867	3.36/5.42/0.9772
Sinusoidal PE (Replacing Learnable PE)	1.98/2.84/0.9935	2.66/4.05/0.9876	3.33/5.38/0.9778
Average Pooling (Replacing Last Time Step)	2.02/2.91/0.9929	2.71/4.15/0.9862	3.39/5.45/0.9769

Table 10. Ablation Experiment Results of RCM Retrieval Strategies.

Retrieval Strategy	MAE	RMSE	R²
Random Residual Selection (No Retrieval)	2.43	3.57	0.9872
K = 3 + Cosine Similarity	1.98	2.89	0.9931
K = 5 + Cosine Similarity (Proposed Method)	1.91	2.73	0.9944
K = 7 + Cosine Similarity	1.93	2.78	0.9940
K = 9 + Cosine Similarity	3.24	4.44	0.9935
K = 5 + Euclidean Distance	2.05	2.97	0.9925

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ye, X.; Yin, J.; Zhang, J.; Li, A.; Liu, Z.; Chen, B.; Yang, J.; Li, S.; Li, H. A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting. Processes 2026, 14, 759. https://doi.org/10.3390/pr14050759

AMA Style

Ye X, Yin J, Zhang J, Li A, Liu Z, Chen B, Yang J, Li S, Li H. A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting. Processes. 2026; 14(5):759. https://doi.org/10.3390/pr14050759

Chicago/Turabian Style

Ye, Xiao, Jun Yin, Jiajia Zhang, Anping Li, Zhibo Liu, Bin Chen, Jingyao Yang, Shilei Li, and Hongmei Li. 2026. "A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting" Processes 14, no. 5: 759. https://doi.org/10.3390/pr14050759

APA Style

Ye, X., Yin, J., Zhang, J., Li, A., Liu, Z., Chen, B., Yang, J., Li, S., & Li, H. (2026). A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting. Processes, 14(5), 759. https://doi.org/10.3390/pr14050759

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Scale CNN-Transformer Network with Residual Correction for Ultra-Short-Term Photovoltaic Power Forecasting

Abstract

1. Introduction

2. Proposed Scheme

2.1. Data Preprocessing and Feature Enhancement

2.2. Proposed MSCT-RCM Model Architecture

3. Model Architecture and Its Two-Stage Training Strategy

3.1. Phase I: Baseline Pre-Training

3.2. Phase II: Joint Training

4. Case Study and Comparative Results

4.1. Dataset and Sample Construction

4.2. Experimental Environment

4.3. Evaluation Metrics

4.4. Baseline Models and Hyperparameters

5. Experimental Results and Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI