Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention

Guo, Xincheng; Gao, Yan; Song, Wanqing; Zen, Yi; Shi, Xianhui

doi:10.3390/electronics14091871

Open AccessArticle

Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention

by

Xincheng Guo

¹,

Yan Gao

^1,*

,

Wanqing Song

²

,

Yi Zen

¹ and

Xianhui Shi

¹

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

School of Electronic and Electrical Engineering, Minnan University of Science and Technology, Quanzhou 362700, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(9), 1871; https://doi.org/10.3390/electronics14091871

Submission received: 1 April 2025 / Revised: 29 April 2025 / Accepted: 30 April 2025 / Published: 4 May 2025

Download

Browse Figures

Versions Notes

Abstract

Short-term power load forecasting is crucial for safe grid operation. To address the insufficiency of traditional decomposition methods in suppressing high-frequency noise within multi-source noisy time series, this study proposes a hybrid forecasting model integrating CEEMDAN-WT-VMD joint denoising with a BiTCN-BiGRU-Attention architecture. The methodology comprises three stages: (1) CEEMDAN decomposition of raw load data to mitigate mode mixing and extract stationary IMF components; (2) wavelet threshold denoising to filter high-frequency interference while preserving and reconstructing low-frequency signals; (3) secondary feature decomposition using Variational Mode Decomposition (VMD) to enhance data stability. A hybrid architecture combines a Bidirectional Temporal Convolutional Network (BiTCN) for long-term dependency capture, a Bidirectional Gated Recurrent Unit (BiGRU) for dynamic feature extraction, and an attention mechanism for key pattern emphasis. The final load forecasting value is generated by progressively accumulating predictions of decomposed components. Empirical analysis based on power load data from a region in Australia demonstrates that, through horizontal and vertical comparative experiments, the proposed hybrid method demonstrates significant improvements in both forecasting accuracy and stability compared to other frontier hybrid models.

Keywords:

CEEMDAN; wavelet threshold denoising; VMD; BiTCN-BiGRU-Attention; short-term power load forecasting

1. Introduction

In the rapidly evolving power industry, the accuracy of power load forecasting is crucial for the safe operation and efficient dispatching of power grids. While existing forecasting methods have made progress, they still face critical challenges in handling complex time-series data with multi-source noise and non-stationary characteristics [1,2]. Traditional approaches, such as time series analysis [3,4,5], regression models [6], and similar-day approaches [7], suffer from simplistic structures and poor nonlinear fitting capabilities, making them inadequate for modern grid demands. Artificial intelligence (AI)-based methods, including support vector machines (SVM) [8,9], random forests [10], and back-propagation neural networks (BP) [11], show improved performance but remain constrained by unidirectional information flow in architectures like standard recurrent neural networks (RNN), which fail to capture bidirectional temporal dependencies [12]. Hybrid models combining convolutional neural networks (CNN) [13,14] and long short-term memory(LSTM) [15] partially address these limitations but still overlook the inherent nonlinearity and noise interference in load data. Recent advancements in multimodal data fusion [16] and multiscale feature decomposition [17] have demonstrated the potential of hierarchical denoising strategies in complex time-series analysis. Emerging AI techniques in power distribution, such as adaptive genetic algorithms for resource allocation [18] and multi-sampling neural networks for fault detection [19], highlight the importance of hybrid architectures combining signal processing and deep learning. These studies motivate our integration of hierarchical denoising with bidirectional temporal modeling.

Recent advancements in decomposition techniques, such as empirical mode decomposition (EMD) [20,21] and ensemble EMD (EEMD) [22,23], aim to mitigate noise through signal decomposition. However, these methods are plagued by mode mixing, leading to degraded decomposition quality and unreliable subsequence reconstruction. Variational mode decomposition (VMD) [24,25] partially alleviates this issue but struggles with high-frequency intrinsic mode functions (IMFs) contaminated by noise, which directly impair prediction accuracy. Furthermore, existing hybrid models often adopt sequential decomposition–prediction frameworks without joint denoising, resulting in residual noise propagation and suboptimal feature extraction [26].

To address these challenges, this paper proposes a hybrid forecasting model integrating CEEMDAN-WT-VMD joint denoising with a BiTCN-BiGRU-Attention architecture. The key innovations and contributions are summarized as follows:

Hierarchical denoising strategy: A CEEMDAN-WT-VMD framework is designed to suppress multi-source noise. CEEMDAN first decomposes raw load data into stable IMFs while mitigating mode mixing. Wavelet thresholding (WT) then filters high-frequency noise components, and VMD performs secondary decomposition to enhance feature stability. This joint approach resolves the limitations of single-stage decomposition and residual noise in existing methods.
Bidirectional spatiotemporal modeling: A BiTCN-BiGRU-Attention network is developed to exploit bidirectional temporal dependencies. The BiTCN captures multi-scale historical patterns through dilated causal convolutions, while the BiGRU extracts forward–backward dynamic features. An attention mechanism dynamically weights critical temporal states, overcoming the unidirectional information bottleneck in traditional RNN-based models.
Validation on real-world data: Experiments on Australian power load data demonstrate superior performance, with the proposed model achieving a 0.65% MAPE, significantly outperforming benchmarks like VMD-BiTCN-BiGRU-Attention (1.21% MAPE) and non-denoised hybrid models (1.64% MAPE). This validates the effectiveness of the joint denoising and bidirectional architecture in improving both accuracy and robustness.

The remainder of this paper is organized as follows: Section 2 details the CEEMDAN-WT-VMD denoising methodology. Section 3 introduces the BiTCN-BiGRU-Attention model. Section 4 and Section 5 present experimental results and comparative analyses. Conclusions are drawn in Section 6.

2. Denoising and Decomposition of Power Load Time Series

2.1. Principle of CEEMDAN

To effectively address the issues of mode mixing and reconstruction errors in traditional empirical mode decomposition (EMD) and ensemble EMD (EEMD) methods, Torres et al. proposed the Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) [27]. This method decomposes complex nonlinear and non-stationary time-series data into a series of intrinsic mode functions (IMFs) with distinct frequency characteristics by adaptively adding noise. The computational steps of CEEMDAN are as follows:

Step 1: CEEMDAN stabilizes signal decomposition by injecting adaptive white noise into the original signal. This process can be represented by Equation (1):

s_{i} (t) = φ (t) + α n_{i} (t)

(1)

where

φ (t)

is the original signal,

n_{i} (t)

is the adaptively generated white noise sequence, α is the parameter controlling the noise amplitude, and

s_{i} (t)

is the noise-injected signal. This step is repeated multiple times with different noise sequences.

Step 2: Perform EMD on each noise-injected signal to obtain its IMFs. The

j

-th IMF component of the

i

-th signal is denoted as

c_{i j} (t)

.

Step 3: Compute the mean of the corresponding IMFs across all noise-injected decomposition results to obtain the final

j

-th IMF component from CEEMDAN, as shown in Equation (2):

I M F_{j} (t) = \frac{1}{2 n} \sum_{i = 1}^{2 n} c_{i j} (t)

(2)

where

I M F_{j} (t)

represents the

j

-th IMF component derived from CEEMDAN.

Step 4: Classify IMFs into high- and low-frequency components using a reconstruction algorithm. For example, IMF₁ represents the first component, IMF₁ + IMF₂ represents the second composite component, and so on.

Step 5: Calculate the mean of components from IMF₁ to IMF_j and perform a t-test to determine whether the mean significantly deviates from zero. The t-test statistic is expressed in Equation (3):

t = \frac{\bar{X_{j}}}{σ_{j} / \sqrt{m}}

(3)

where

\bar{X_{J}}

is the sample mean of the

j

-th IMF component,

σ_{j}

is its sample standard deviation, and

m

is the number of data points in the component.

2.2. CEEMDAN Combined with Wavelet Threshold Denoising

Wavelet threshold denoising (WT) is a signal denoising method based on wavelet decomposition and reconstruction [28], which effectively removes high-frequency noise while preserving critical information features in the signal. The specific workflow for integrating CEEMDAN with wavelet threshold denoising is illustrated in Figure 1.

The denoising process in Figure 1 includes the following steps:

Signal Decomposition: The noisy load data are decomposed into multiple scales using CEEMDAN, yielding N IMF components, each containing distinct frequency characteristics of the signal.
Noise Processing: High-frequency IMF components are further denoised via wavelet thresholding to obtain refined components $I M F_{1}^{'} ~ I M F_{k}^{'}$ . A hybrid soft-hard threshold method is adopted to mitigate the limitations of single-threshold approaches. The soft threshold function is defined as Equation (4):

\bar{σ_{j, i}} = \{\begin{matrix} s g n (σ_{j, i}) \cdot (|σ_{j, i}| - λ) |σ_{j, i}| \geq λ \\ 0 |σ_{j, i}| \leq λ \end{matrix}

(4)

3.: An appropriate wavelet basis function is selected, and the decomposition level is determined based on signal characteristics. Wavelet decomposition generates coefficients $σ_{j, i}$ , including noise coefficients $v_{j, i}$ and target signal coefficients $u_{j, i}$ .
4.: Signal Reconstruction: The denoised high-frequency components $I M F_{1}^{'} ~ I M F_{k}^{'}$ for cumulative reconstruction, producing the final denoised power load data. The reconstruction formula is expressed as Equation (5):

g = \sum_{i = 1}^{k} I M F_{i}^{'} + \sum_{i = k + 1}^{n} I M F_{i}

(5)

5.: Denoising Effect Quantification: The signal-to-noise ratio ( $S N R$ ) is used to evaluate denoising performance, calculated via Equation (6):

S N R (d B) = 10 \cdot \log_{10} (P_{s i g n a l} / P_{n o i s e})

(6)

where

P_{s i g n a l}

and

P_{n o i s e}

represent the power of the signal and noise, respectively, computed as the mean squared values of the signal and noise components.

2.3. Theory of VMD Algorithm

Variational Mode Decomposition (VMD) is an advanced signal decomposition method that iteratively optimizes a variational model to precisely determine the center frequency and bandwidth of each mode, enabling frequency-domain decomposition and effective component extraction [29]. VMD decomposes the input signal

x (t)

into K mode functions

u_{k} (t)

, aiming to minimize the bandwidth of each mode while dynamically updating the center frequencies via the Alternating Direction Method of Multipliers (ADMM). This achieves decomposition of non-stationary signals. The mathematical formulation is as follows:

The input signal

x (t)

is expressed as the superposition of K mode functions:

x (t) = \sum_{k = 1}^{K} u_{k} (t)

(7)

In the variational framework, the objective function seeks to minimize the bandwidth of the modes:

\min_{u_{k}, ω_{k}} \{\sum_{k = 1}^{K} u_{k} (t) {‖\partial_{t} [(u_{k} (t) + j H (u_{k} (t))) e^{- j ω_{k} t}]‖}_{2}^{2}\}

(8)

where

H (u_{k} (t))

denotes the Hilbert transform of

u_{k} (t)

, and

ω_{k}

represents the center frequency of the

k

-th mode. To ensure physical meaningfulness of the decomposition, the following constraint is added:

s (t) = \sum_{k = 1}^{K} u_{k}

(9)

By introducing Lagrange multipliers, the optimization objective and constraints are combined into the augmented Lagrangian function:

L (\{u_{k}\}, \{ω_{k}\}, λ) = α \sum_{k = 1}^{K} {‖\partial_{t} [(\begin{matrix} u_{k} (t) \\ + j H (u_{k} (t)) \end{matrix}) e^{- j ω_{k} t}]‖}_{2}^{2} + {‖s (t) - \sum_{k = 1}^{K} u_{k} (t)‖}_{2}^{2}

(10)

Finally, ADMM alternately updates the mode functions

u_{k} (t)

and center frequencies

ω_{k}

until the convergence criteria are met, completing the decomposition.

3. Short-Term Power Load Forecasting Model

3.1. BiTCN Neural Network Architecture

The Bidirectional Temporal Convolutional Network (BiTCN) is a deep learning model designed for processing sequential data, combining the strengths of temporal convolutional networks (TCN) and bidirectional information flow to significantly enhance modeling capabilities. TCN employs causal convolution to ensure that each element of the output sequence depends only on current and previous inputs, while dilated convolution expands the receptive field to capture longer-term historical dependencies [30]. Building on this, BiTCN incorporates bidirectional information flow, simultaneously leveraging historical and future data, thereby substantially improving time-series modeling performance.

The causal convolution in BiTCN can be expressed as follows:

f (x) = \sum_{k = 0}^{K - 1} ω_{k} \cdot x_{t - d \cdot k}

(11)

where

ω_{k}

represents the convolutional kernel weights, and

x_{t - d \cdot k}

denotes the input value at timestep

t - d \cdot k

. The architecture of the dilated causal convolutional network is illustrated in Figure 2.

Figure 3 demonstrates the BiTCN architecture, which incorporates multiple layers of dilated causal convolutions, GELU activation functions, dropout steps, and fully connected layers. This design enhances the model’s ability to extract features from long-sequence data while maintaining computational efficiency.

As shown in Figure 2, the fundamental building block of BiTCN employs dilated causal convolutions with bidirectional processing. These blocks are then systematically stacked in the full BiTCN architecture (Figure 3), where the forward path captures historical dependencies and the backward path extracts future-context features. The concatenated outputs from both directions enable comprehensive temporal pattern learning.

Figure 3 illustrates the BiTCN model used in this study. The model employs two TCN modules (forward and backward) to extract temporal features from sequential data. Each TCN module includes dilated causal convolution, batch normalization, Leaky ReLU activation, and dropout layers to strengthen the model’s capacity to capture temporal dependencies. The outputs of these modules are concatenated to integrate bidirectional temporal information, thereby improving the model’s forecasting accuracy.

3.2. BiGRU Neural Network

The Bidirectional Gated Recurrent Unit (BiGRU) is an enhanced network architecture that introduces bidirectional information flow based on the traditional GRU. It captures both historical and future trend information in time-series data, enabling a more comprehensive exploration of the latent dynamic characteristics of sequential data [31].

The BiGRU architecture, as shown in Figure 4, consists of a forward GRU and a backward GRU. The forward GRU extracts historical features from past timesteps of the time series, while the backward GRU captures future information by processing the input sequence in reverse order. By concatenating the outputs of both directions, BiGRU integrates temporal features more comprehensively.

To further improve prediction accuracy, BiGRU is employed to process the feature vectors output by BiTCN. Specifically, BiGRU utilizes its intrinsic memory units and gating mechanisms to learn dynamic changes in the input sequence from both forward and reverse directions. The state update equations are as follows:

{\vec{h}}_{t} = GRU (X_{t}, {\vec{h}}_{t - 1})

(12)

\overset{\leftarrow}{h} = GRU (X_{t}, {\overset{\leftarrow}{h}}_{t - 1})

(13)

h_{t} = w_{t} {\vec{h}}_{t} + v_{t} \vec{h} + b_{t}

(14)

where

X_{t}

is the input;

{\vec{h}}_{t}

and

\overset{\leftarrow}{h}

represent the forward and backward hidden layer outputs at timestep

t

, respectively;

w_{t}

and

v_{t}

denote the output weights of the forward and backward hidden layers;

b_{t}

is the bias term; and

h_{t}

is the final integrated output vector.

3.3. Attention Mechanism

The attention mechanism mimics the human brain’s information processing by assigning weights to input features in neural networks, emphasizing critical features while suppressing irrelevant ones, thereby enhancing the model’s information filtering and decision-making capabilities. In short-term load forecasting, this mechanism assigns varying weights to load values based on temporal proximity, prioritizing recent data to rapidly capture critical information and improve prediction accuracy [32]. Figure 5 illustrates the structure of the attention mechanism, which generates a distribution vector reflecting feature importance.

Attention Scoring Function:

e_{t} = v \tanh (W h_{t} + b)

(15)

2.: Softmax Function:

α_{j} = softmax (e_{t}) = \frac{\exp (e_{t})}{\sum_{j = 1}^{m} e_{j}}

(16)

3.: Output Calculation:

y_{i} = \sum_{j = 1}^{m} α_{j} h_{j}

(17)

Here, the attention scoring function

e_{t}

operates on the hidden layer

h_{t}

, where

v

and

W

denote attention weights,

b

is the bias term, and

m

represents the dimension of the input vector.

4. Hybrid CWVMD-BiTCN-BiGRU-Attention Forecasting

4.1. Input Feature Settings for Load Forecasting Model

As shown in Figure 6, the measured power load data exhibit significant volatility and randomness with weak periodic trends, which is associated with diverse noise interference (e.g., weather factors and date types).

The input feature set X adopted in this study includes historical load, dew point temperature (DPT), dry bulb temperature (DBT), wet bulb temperature (WBT), humidity (HMY), and date type, while y represents the historical load data. The input features are detailed in Table 1.

During model training and prediction, the time series of each IMF component is directly concatenated with external features (e.g., DBT, DPT, WBT, HMY, HOUR, etc.) at the input layer. For example, the input feature vector for the k-th IMF component IMF_k(t) is as follows:

X_{k} (t) = [I M F_{k} (t), D B T (t), D P T (t), W B T (t), H M Y (t), H O U R (t), D a t e T y p e (t)]

This process is completed during the data preprocessing stage to ensure alignment between IMF values and corresponding external features at each timestep. The input of the “BiTCN-BiGRU-Attention” module in Figure 7 and Figure 8 is the aforementioned concatenated multi-dimensional feature X_k(t), rather than solely the IMF sequence.

4.2. CWVMD-BiTCN-BiGRU-Attention Hybrid Forecasting Model

The architecture of the CWVMD-BiTCN-BiGRU-Attention hybrid forecasting model is illustrated in Figure 7.

The model structure in Figure 7 includes the following steps:

Decomposition: The denoised load time series is decomposed into multiple IMF components using the CWVMD method.
Model Training: For each IMF component, a BiTCN-BiGRU-Attention neural network prediction model is established. Model parameters are initialized and optimized via the Adam optimizer, adjusting weight parameters to enhance performance.
Iterative Forecasting: The trained models generate predictions iteratively, aggregating results to produce the final forecast.

The workflow of the BiTCN-BiGRU-Attention neural network is further detailed in Figure 8.

The load time-series data decomposed into 4 IMF components via the CWVMD joint denoising and decomposition framework (see Section 5.3 for methodological details) are utilized as independent input channels (labeled 1–4) and fed into the BiTCN-BiGRU-Attention network for feature extraction and prediction. The model structure in Figure 8 comprises the following stages:

BiTCN Layer: IMF components are fed into the BiTCN layer, which consists of forward and reverse TCN modules. These modules process sequential data bidirectionally through dilated causal convolution, batch normalization, Leaky ReLU activation, and dropout operations to extract temporal features.
BiGRU Layer: The output vectors from BiTCN serve as inputs to the BiGRU layer, which includes forward and reverse GRU modules to capture long-term dependencies in the time series.
Attention Layer: The outputs from BiGRU are weighted by the attention mechanism to emphasize critical temporal patterns.
Output Layer: The weighted features are passed to the output layer to generate the final prediction.

4.3. Evaluation Metrics

The performance of the forecasting model is evaluated based on the errors between the true and predicted values. This study adopts three metrics for assessment: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). The formulas are defined as follows:

M A E = \frac{1}{N} \sum_{k = 1}^{N} |e_{k}|

(18)

R M S E = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} e_{k}^{2}}

(19)

M A P E = \frac{100 %}{N} \sum_{k = 1}^{N} |\frac{e_{k}}{a_{k}}|

(20)

where

$N$ : Total number of test samples.

$e_{k}$ : Error between the $k$ -th predicted value and the corresponding true value.

$a_{k}$ : Actual (true) value of the $k$ -th sample.

5. Case Analysis

5.1. Data Source and Preprocessing

To validate the effectiveness of the proposed algorithm, load data from a region in Australia spanning May 2020 to July 2021 is utilized. The data are sampled at 30 min intervals, yielding 48 data points per day and a total of 21,936 data points. To align with the requirements of the forecasting model, a sliding window of 48 width is applied: the previous 48 load values are used as “features” and the load value at the next timestep is designated as the “label”. The load data are normalized using the min-max normalization method, as defined by Equation (13):

X_{n o r m} = \frac{X - X_{m i n}}{X_{m a x} - X_{m i n}}

(21)

where

X

is the raw data,

X_{m i n}

and

X_{m a x}

are the minimum and maximum values in the dataset, respectively, and

X_{n o r m}

is the normalized data.

5.2. Data Denoising

The CEEMDAN decomposition is applied to 4416 power load data points from 1 May to 1 August 2020. Figure 9 illustrates the decomposition results of the original signal and nine intrinsic mode functions (IMFs). IMF1 to IMF9 are arranged in descending order of frequency and complexity, revealing clearer and more intuitive variation patterns compared to the original sequence, with distinct trends. IMF1 represents the highest-frequency and most complex components of the time series, while subsequent IMFs gradually exhibit simpler and smoother patterns.

In Figure 9, noise signals are concentrated in high-frequency IMF components. Using a reconstruction algorithm, high- and low-frequency components are differentiated. As shown in Table 2, the t-test value for IMF4 is 0.03 (corresponding to 1.28%), which significantly deviates from zero. Thus, IMF1, IMF2, and IMF3 are classified as high-frequency components, while the remaining are low-frequency.

To eliminate high-frequency noise and extract primary signal features, wavelet threshold denoising is applied to each high-frequency IMF component, while low-frequency components retain critical load characteristics. The sym5 wavelet basis function, 3 decomposition levels, and a soft threshold method are selected. Figure 10 compares the denoising effects on IMF1–IMF3:

The denoised results show that IMF-1 is dominated by noise, with a signal-to-noise ratio (SNR) of only 0.31 dB. IMF-2 and IMF-3 gradually exhibit primary signal characteristics, achieving SNRs of 16.42 dB and 48.98 dB, respectively. After reconstruction, the denoised signal achieves an SNR of 227.10 dB (compared to 41.06 dB before denoising), demonstrating significant quality improvement (Table 3).

5.3. VMD Decomposition

Using the power load data from 1 May to 1 August 2020, the VMD results are illustrated in Figure 11.

The optimal number of decomposition modes K is determined by analyzing the central frequencies of the modes. Over-decomposition occurs when the central frequencies of two adjacent modes differ by a value smaller than a conventional threshold. The central frequencies of the IMF components for different K values are listed in Table 4.

From the central frequency data in Table 4, when K = 5, the central frequencies of IMF-3 and IMF-4 are 0.041 Hz and 0.044 Hz, respectively, with a difference of only 0.003 Hz, which is significantly smaller than the conventional threshold. This indicates over-decomposition at K = 5.

When K = 4, the central frequencies of IMF-3 and IMF-4 are 0.043 Hz and 0.055 Hz, respectively, yielding a difference of 0.012 Hz. This confirms no significant over-decomposition and demonstrates effective decomposition, whereas K = 3 may inadequately decompose low-frequency components (IMF-1 at 7.43 × 10⁻⁶ Hz) and K = 6 introduces redundant modes (IMF-6 at 0.065 Hz vs. IMF-5 at 0.045 Hz, difference: 0.020 Hz) without substantial improvement in decomposition efficiency. Therefore, K = 4 is selected as the optimal decomposition level.

5.4. Model Parameters and Prediction Results of BiTCN-BiGRU-Attention

The dataset spanning May 2020 to July 2021 (21,936 data points) is divided into an 80% training set and a 20% test set. The BiTCN network employs a convolutional kernel size of 3, a batch size of 256, a learning rate of 0.001, a dropout rate of 0.2 (to prevent overfitting), and dilation rates of [1, 2, 4]. The convolutional layer is configured with 128 neurons.

The output vectors of the BiTCN model serve as inputs to the BiGRU layer. The BiGRU model architecture includes two layers: the first layer contains 128 neurons, and the second layer contains 32 neurons. The model is trained for 100 epochs.

During training, we adopted an early stopping strategy based on validation loss (patience = 10), meaning training is terminated if the validation loss does not decrease for 10 consecutive epochs. Experimental results show that training for most VMD components converges within 60–80 epochs. Therefore, the setting of 100 epochs is sufficient to cover the training needs of all components while avoiding overfitting.

We compared the impact of different dropout rates (0.1, 0.2, 0.3, 0.5) on model performance (see Table 5) and found that a dropout rate of 0.2 effectively suppresses overfitting while minimizing the impact on prediction accuracy.

A higher dropout rate (e.g., 0.5) leads to underfitting (higher errors on both training and validation sets), whereas a rate of 0.1 fails to effectively prevent overfitting (validation errors exhibit significant fluctuations).

5.5. Ablation Experiments and Prediction Model Comparison

After finalizing the parameters of the CWVMD-BiTCN-BiGRU-Attention model, horizontal ablation experiments are conducted by comparing it with the following models: CWVMD-BiTCN-BiGRU, CWVMD-BiTCN-Attention, CWVMD-BiGRU-Attention, CWVMD-BiTCN, and CWVMD-BiGRU. The prediction curves of these models are compared in Figure 12, and their evaluation metrics are summarized in Table 6.

According to the comparison of evaluation metrics in Table 6, the CWVMD-BiTCN-BiGRU-Attention model achieves an MAPE of 0.65%, which is 0.13%, 0.25%, 0.18%, 0.37%, and 0.50% lower than the CWVMD-BiTCN-BiGRU, CWVMD-BiTCN-Attention, CWVMD-BiGRU-Attention, CWVMD-BiTCN, and CWVMD-BiGRU models, respectively. Simultaneously, the RMSE and MAE values of the CWVMD-BiTCN-BiGRU-Attention model are also significantly lower than those of other models, demonstrating its superior accuracy and generalization capability in load forecasting.

From the perspective of error metric comparisons, the improvements in the CWVMD-BiTCN-BiGRU-Attention model stem from the synergistic effects of multiple key mechanisms. The BiTCN component employs causal convolutions and dilated convolutions to model multi-scale temporal dependencies, enhancing the extraction of time-series features. Concurrently, the BiGRU model, integrated with an Attention mechanism, further optimizes the feature selection process by dynamically allocating higher weights to critical features, thereby improving prediction accuracy.

To validate the advantages of the proposed model in short-term load forecasting, a vertical comparison was conducted against three benchmark models: the non-denoised hybrid model BiTCN-BiGRU-Attention, the VMD-BiTCN-BiGRU-Attention model incorporating VMD, and the WT-VMD-BiTCN-BiGRU-Attention model utilizing joint WT and VMD. As illustrated in Figure 13, the evaluation metrics for the predictions of all models are summarized in Table 7, demonstrating the superior performance of the proposed approach in both accuracy and robustness.

VMD alone fails to handle initial noise pollution, as high-frequency noise distorts its mode decomposition. WT-VMD (without CEEMDAN) loses subtle temporal features due to premature denoising, increasing MAPE by 0.40%.

In contrast, CWVMD’s phased processing preserves both macro trends (via CEEMDAN) and micro dynamics (via WT-VMD).

The results in Table 7 show that the proposed model reduces MAPE by 0.99%, 0.56%, and 0.40% compared to BiTCN-BiGRU-Attention, VMD-BiTCN-BiGRU-Attention, and WT-VMD-BiTCN-BiGRU-Attention, respectively. The significant reductions in RMSE and MAE further confirm that the CWVMD method (CEEMDAN-WT-VMD joint denoising and decomposition) effectively extracts stationary features from load time series, mitigates high-frequency noise interference, and substantially improves forecasting precision.

6. Conclusions

To address the challenges of insufficient prediction accuracy caused by non-stationary data and noise interference in short-term power load forecasting, this study proposes a hybrid short-term power load forecasting model, CEEMDAN-Wavelet-VMD-BiTCN-BiGRU-Attention, which integrates CEEMDAN (Complete Ensemble Empirical Mode Decomposition with Adaptive Noise) with wavelet threshold denoising and VMD (Variational Mode Decomposition). The key conclusions are as follows:

The original load data are decomposed into multiple IMF (Intrinsic Mode Function) components using CEEMDAN. Wavelet threshold denoising is applied to remove noise from high-frequency components, followed by secondary feature extraction via VMD on the denoised signal. This process generates load time series with enhanced stationarity and periodicity, enabling the BiTCN model to better capture long-term dependencies and improve fitting capability.
The VMD-derived IMF components are fed into the BiTCN network using time windows, leveraging BiTCN’s strength in extracting latent features and long-range dependencies. This step provides more accurate feature representations for subsequent modeling.
The feature vectors extracted by BiTCN are used as inputs for the BiGRU model, which further models complex temporal relationships and nonlinear characteristics of load data. Concurrently, the Attention mechanism dynamically focuses on critical features, enhancing prediction accuracy and model robustness.

Compared to benchmark models, the proposed framework achieves significant improvements in forecasting precision, reducing MAPE by 60.4% (from 1.64% to 0.65%) and RMSE by 65.5% (from 165.39 MW to 57.01 MW) over the non-denoised baseline model. The hierarchical denoising strategy (CEEMDAN-WT-VMD) contributes to a 46.3% MAPE reduction compared to standalone VMD.

Future research focuses: (1) realize real-time adaptive decomposition through streaming CEEMDAN-VMD integration; (2) develop a multivariate attention mechanism to analyze the weather-load coupling relationship; (3) research on edge computing deployment solutions based on neural architecture search.

Author Contributions

Conceptualization, X.G. and Y.G.; methodology, X.G.; software, Y.Z.; validation, Y.Z. and X.S.; formal analysis, X.G. and W.S.; investigation, X.G.; resources, Y.G.; data curation, X.G.; writing—original draft preparation, X.G.; writing—review and editing, Y.G. and W.S.; visualization, Y.Z.; supervision, X.S.; project administration, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC), grant number 62171271.

Data Availability Statement

Due to privacy restrictions, the data for this study cannot be made publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Z. Energy system optimization based on fuzzy decision support system and unstructured data. Energy Inform. 2024, 7, 82. [Google Scholar] [CrossRef]
Zaboli, A.; Kasimalla, S.R.; Park, K.; Hong, Y.; Hong, J. A Comprehensive Review of Behind-the-Meter Distributed Energy Resources Load Forecasting: Models. Energies 2023, 17, 2534. [Google Scholar] [CrossRef]
Ullah, K.; Ahsan, M.; Hasanat, S.M.; Haris, M.; Yousaf, H.; Raza, S.F.; Tandon, R.; Abid, S.; Ullah, Z. Short-Term Load Forecasting: A Comprehensive Review and Simulation Study with CNN-LSTM Hybrids Approach. IEEE Access 2024, 12, 111858–111881. [Google Scholar] [CrossRef]
Buratto, W.G.; Muniz, R.N.; Nied, A.; González, G.V. Seq2Seq-LSTM With Attention for Electricity Load Forecasting in Brazil. IEEE Access 2024, 12, 30020–30029. [Google Scholar] [CrossRef]
Das, S.; Fouda, M.M.; Abdo, M.G. Short-Term Load Forecasting Using GRU-LGBM Fusion. In Proceedings of the 2024 International Conference on Smart Applications, Communications and Networking (SmartNets), Washington DC, USA, 28–30 May 2024; pp. 1–6. [Google Scholar]
Wang, X.; Wang, H.; Bhandari, B.; Cheng, L. AI-Empowered Methods for Smart Energy Consumption: A Review of Load Forecasting, Anomaly Detection and Demand Response. Int. J. Precis. Eng. Manuf. Technol. 2024, 11, 963–993. [Google Scholar] [CrossRef]
Amini, H.; Mehrizi-Sani, A. Similar Days Fuzzy Clustering Load Forecasting. In Proceedings of the IECON 2024—50th Annual Conference of the IEEE Industrial Electronics Society, Chicago, IL, USA, 3–6 November 2024; pp. 1–5. [Google Scholar] [CrossRef]
Wang, Z.-X.; Ku, Y.-Y.; Liu, J. The Power Load Forecasting Model of Combined SaDE-ELM and FA-CAWOA-SVM Based on CSSA. IEEE Access 2024, 12, 41870–41882. [Google Scholar] [CrossRef]
Chen, S.; Hao, C.; Dehghanian, P. Short-Term Load Forecasting Model Based on ICCEEMDAN-MRMR and CNN-SVM. In Proceedings of the 2024 4th International Conference on Energy Engineering and Power Systems (EEPS), Hangzhou, China, 9–11 August 2024; pp. 385–389. [Google Scholar] [CrossRef]
Magalhães, B.; Bento, P.; Pombo, J.; Calado, M.D.R.; Mariano, S. Short-Term Load Forecasting Based on Optimized Random Forest and Optimal Feature Selection. Energies 2023, 17, 1926. [Google Scholar] [CrossRef]
Wang, Q.; Peng, C.; Zhou, Y. Short-term power load forecasting based on improved back propagation neural network. In Proceedings of the SPIE 13395, International Conference on Optics, Electronics, and Communication Engineering (OECE 2024), 133952R, Foshan, China, 12 November 2024; pp. 698–705. [Google Scholar]
Jiang, B.; Yang, H.; Wang, Y.; Liu, Y.; Geng, H.; Zeng, H.; Ding, J. Dynamic Temporal Dependency Model for Multiple Steps Ahead Short-Term Load Forecasting of Power System. IEEE Trans. Ind. Appl. 2024, 60, 5244–5254. [Google Scholar] [CrossRef]
Wu, L.; Kong, C.; Hao, X.; Chen, W. A Short-Term Load Forecasting Method Based on GRU-CNN Hybrid Neural Network Model. Math. Probl. Eng. 2020, 1, 1428104. [Google Scholar] [CrossRef]
Ahranjani, Y.K.; Beiraghi, M.; Ghanizadeh, R. Short time load forecasting for Urmia city using the novel CNN-LTSM deep learning structure. Electr. Eng. 2025, 107, 1253–1264. [Google Scholar] [CrossRef]
Murali, S.; Saini, P.; Abhinav, K.; Shankar, R.; Parida, S.K. Improved LSTM-Based Load Forecasting Embedded 3DOF (FOPI)-FOPD Controller for Proactive Frequency Regulation in Power System. IEEE Trans. Ind. Appl. 2024, 60, 8213–8227. [Google Scholar] [CrossRef]
Guo, D.; Zhang, Z.; Yang, B.; Zhang, J.; Yang, H.; Lin, Y. Integrating spoken instructions into flight trajectory prediction to optimize automation in air traffic control. Nat. Commun. 2024, 15, 15. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Sun, Y.; Li, X.; Zheng, B.; Chen, T. Dual-Scale Complementary Spatial-Spectral Joint Model for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 6772–6789. [Google Scholar] [CrossRef]
Mateev, V.; Marinova, I. Modified chromosome pooling genetic algorithm for resource allocation optimization. AIP Conf. Proc. 2023, 2939, 100010. [Google Scholar] [CrossRef]
Marin-Quintero, J.; Orozco-Henao, C.; Bretas, A.S.; Velez, J.C.; Herrada, A.; Barranco-Carlos, A.; Percybrooks, W.S. Adaptive Fault Detection Based on Neural Networks and Multiple Sampling Points for Distribution Networks and Microgrids. J. Mod. Power Syst. Clean Energy 2022, 10, 1648–1657. [Google Scholar] [CrossRef]
Tan, D.; Tang, Z.; Zhou, F.; Xie, Y. A Novel Hybrid Model Based on EMD-Improved TCN-Improved TST for Short-Term Railway Traction Load Forecasting. IEEE Trans. Transp. Electrif. 2024, 11, 6418–6427. [Google Scholar] [CrossRef]
Dong, Y.; Meng, Q.; Fu, J.; Ai, X.; Chen, Z.; Yin, Y.; Liu, K.; Zhang, X. Ultra-Short-Term Load Forecasting Based on EMD Decomposition and CNN-BiGRU Heterogeneous Computing Model. In Proceedings of the 2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 26–28 January 2024; pp. 1266–1272. [Google Scholar] [CrossRef]
Dong, Y.; Yang, C.; Meng, Q.; Ai, X.; Yin, Y.; Liu, K.; Fu, J.; Chen, Z. Ultra-short-term load forecasting based on the combination of EEMD and Autoformer multi-model. In Proceedings of the 2024 IEEE 4th International Conference on Power, Electronics and Computer Applications (ICPECA), Shenyang, China, 26–28 January; pp. 1273–1279. [CrossRef]
Yang, M.; Liu, Y.; Mu, Y.; Li, Z.; Zhang, H.; Chen, W.; Rong, F. Short-Term Power Load Forecasting Model Based on EEMD-SE-ERCNN. In Proceedings of the 2024 7th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Hangzhou, China, 15–17 August 2024; pp. 893–898. [Google Scholar] [CrossRef]
Ye, H.; Zhu, Q.; Zhang, X. Short-Term Load Forecasting for Residential Buildings Based on Multivariate Variational Mode Decomposition and Temporal Fusion Transformer. Energies 2023, 17, 3061. [Google Scholar] [CrossRef]
Duan, Q.; He, X.; Chao, Z.; Tang, X.; Li, Z. Short-term power load forecasting based on sparrow search algorithm-variational mode decomposition and attention-long short-term memory. Int. J. Low-Carbon Technol. 2024, 19, 1089–1097. [Google Scholar] [CrossRef]
Xiang, X.; Yuan, T.; Cao, G.; Zheng, Y. Short-Term Electric Load Forecasting Based on Signal Decomposition and Improved TCN Algorithm. Energies 2023, 17, 1815. [Google Scholar] [CrossRef]
Heng, L.; Hao, C.; Nan, L.C. Load forecasting method based on CEEMDAN and TCN-LSTM. PLoS ONE 2024, 19, e0300496. [Google Scholar] [CrossRef]
Liu, J.; Luan, J.; Tai, Y.; Gao, C.; Li, S.; Jiao, L. Application of Signal Averaging CEEMDAN and Adaptive Wavelet Threshold Denoising in Signal Processing. In Proceedings of the 2024 5th International Conference on Machine Learning and Computer Application (ICMLCA), Hangzhou, China, 18–20 October 2024; pp. 535–538. [Google Scholar] [CrossRef]
Liu, W.; Hua, F.; Cui, Y.; Xu, Y.; Liu, H. An Optimized Power Load Forecasting Algorithm Based on VMD-SMA-LSTM. Energy Sci. Eng. 2025, 9, 2857. [Google Scholar] [CrossRef]
Tian, C.; Liu, Q.; Tian, C.; Ma, X.; Feng, Y.; Han, D. Interval Prediction of Air Conditioning Load Based on Quantile Regression BiTCN-BiGRU-Attention Model. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; pp. 1–6. [Google Scholar] [CrossRef]
Ullah, K.; Shakir, D.; Abid, U.; Alahmari, S.; Aslam, S.; Ullah, Z. Hybrid BiGRU-CNN Model for Load Forecasting in Smart Grids with High Renewable Energy Integration. IET Gener. Transm. Distrib. 2024, 19, e70060. [Google Scholar] [CrossRef]
Zhang, X.; Ye, J.; Gao, L.; Ma, S.; Xie, Q.; Huang, H. Short-term wind power prediction based on ICEEMDAN decomposition and BiTCN–BiGRU-multi-head self-attention model. Electr. Eng. 2025, 107, 2645–2662. [Google Scholar] [CrossRef]

Figure 1. Denoising workflow for power load data.

Figure 2. Bidirectional dilated causal convolutional network structure.

Figure 3. BiTCN model structure.

Figure 4. BiGRU structure.

Figure 5. Attention mechanism structure.

Figure 6. Measured power load data.

Figure 7. CWVMD-BiTCN-BiGRU-Attention hybrid forecasting model.

Figure 8. BiTCN-BiGRU-Attention model structure.

Figure 9. CEEMDAN decomposition results.

Figure 10. Wavelet threshold denoising results. (a) IMF-1 denoising comparison, (b) IMF-2 denoising comparison, (c) IMF-3 denoising comparison.

Figure 11. VMD results.

Figure 12. Prediction curve comparison of different models.

Figure 13. Prediction curve comparison of denoising strategies.

Table 1. Input feature set.

Feature	Description	Value Range
LOAD	Previous day load	Load sequence values y
HOUR	Hour of the day	00:00–23:00 (0.5 h steps)
DBT	Dry bulb temperature	Real-time temperature values
DPT	Dew point temperature	Real-time temperature values
WBT	Wet bulb temperature	Real-time temperature values
HMY	Humidity	Real-time humidity values

Table 2. Classification of high- and low-frequency IMF components.

Component	IMF_1	IMF_2	IMF_3	IMF_4	IMF_5	IMF_6	…
Percentage (%)	1.03	0.56	6.19	1.28	0.02	1.25	…
Category	High	High	High	Low	Low	Low	Low

Table 3. SNR comparison of IMF components.

Component	SNR (Pre-Denoising, dB) ¹	SNR (Post-Denoising, dB)
IMF-1	0.10	0.31
IMF-2	2.34	16.42
IMF-3	5.67	48.98
Reconstructed Signal	41.06	227.10

¹ Pre-denoising SNR values are estimated based on the energy ratio of high-frequency noise to total signal power.

Table 4. Central frequencies of mode components (Hz).

Mode	K = 3	K = 4	K = 5	K = 6
IMF-1	7.43 × 10⁻⁶	7.01 × 10⁻⁶	6.98 × 10⁻⁶	6.95 × 10⁻⁶
IMF-2	0.021	0.021	0.020	0.019
IMF-3	0.046	0.043	0.041	0.040
IMF-4	…	0.055	0.044	0.042
IMF-5	…	…	0.064	0.045
IMF-6	…	…	…	0.065

Table 5. Impact of different dropout rates on model performance (Taking IMF1 as an example).

Dropout	Train RMSE (MW)	Val RMSE (MW)	Test RMSE (MW)
0.1	45.2	62.8	63.5
0.2	48.6	58.3	57.0
0.3	52.1	59.7	58.9
0.5	60.4	65.2	64.1

Table 6. Comparison of evaluation metrics across models.

Model	RMSE (MW)	MAPE (%)	MAE (MW)
CWVMD-BiGRU	130.84	1.15	115.32
CWVMD-BiTCN	115.63	1.02	102.67
CWVMD-BiGRU-Attention	85.42	0.83	79.45
CWVMD-BiTCN-Attention	93.54	0.90	83.85
CWVMD-BiTCN-BiGRU	79.84	0.78	76.71
CWVMD-BiTCN-BiGRU-Attention	57.01	0.65	54.62

Table 7. Comparison of denoising strategies.

Model	RMSE (MW)	MAPE (%)	MAE (MW)
BiTCN-BiGRU-Attention	165.39	1.64	144.79
VMD-BiTCN-BiGRU-Attention	109.15	1.21	98.21
WT-VMD-BiTCN-BiGRU-Attention	85.63	1.05	84.37
CWVMD-BiTCN-BiGRU-Attention	57.01	0.65	54.62

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, X.; Gao, Y.; Song, W.; Zen, Y.; Shi, X. Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention. Electronics 2025, 14, 1871. https://doi.org/10.3390/electronics14091871

AMA Style

Guo X, Gao Y, Song W, Zen Y, Shi X. Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention. Electronics. 2025; 14(9):1871. https://doi.org/10.3390/electronics14091871

Chicago/Turabian Style

Guo, Xincheng, Yan Gao, Wanqing Song, Yi Zen, and Xianhui Shi. 2025. "Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention" Electronics 14, no. 9: 1871. https://doi.org/10.3390/electronics14091871

APA Style

Guo, X., Gao, Y., Song, W., Zen, Y., & Shi, X. (2025). Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention. Electronics, 14(9), 1871. https://doi.org/10.3390/electronics14091871

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Power Load Forecasting Based on CEEMDAN-WT-VMD Joint Denoising and BiTCN-BiGRU-Attention

Abstract

1. Introduction

2. Denoising and Decomposition of Power Load Time Series

2.1. Principle of CEEMDAN

2.2. CEEMDAN Combined with Wavelet Threshold Denoising

2.3. Theory of VMD Algorithm

3. Short-Term Power Load Forecasting Model

3.1. BiTCN Neural Network Architecture

3.2. BiGRU Neural Network

3.3. Attention Mechanism

4. Hybrid CWVMD-BiTCN-BiGRU-Attention Forecasting

4.1. Input Feature Settings for Load Forecasting Model

4.2. CWVMD-BiTCN-BiGRU-Attention Hybrid Forecasting Model

4.3. Evaluation Metrics

5. Case Analysis

5.1. Data Source and Preprocessing

5.2. Data Denoising

5.3. VMD Decomposition

5.4. Model Parameters and Prediction Results of BiTCN-BiGRU-Attention

5.5. Ablation Experiments and Prediction Model Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI