A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting

Xu, Tianqi; He, Jie; Li, Yan; Li, Xiaolan; Tang, Ju

doi:10.3390/electronics15122507

Open AccessArticle

A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting

by

Tianqi Xu

^1,2

,

Jie He

¹

,

Yan Li

^1,2,*

,

Xiaolan Li

¹ and

Ju Tang

¹

Key Laboratory of Cyber-Physical Power System of Yunnan Colleges and Universities, the School of Electrical and Information Technology, Yunnan Minzu University, Kunming 650504, China

²

Yunnan Key Laboratory of Unmanned Autonomous Systems, Kunming 650504, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2507; https://doi.org/10.3390/electronics15122507

Submission received: 14 May 2026 / Revised: 1 June 2026 / Accepted: 5 June 2026 / Published: 7 June 2026

(This article belongs to the Section Power Electronics)

Download

Browse Figures

Versions Notes

Abstract

Accurate short-term load forecasting is essential for reliable power system operation under increasingly nonlinear, volatile, and multi-scale load patterns. This study proposes a VMD–BayesXGB–BiLSTM hybrid forecasting framework that integrates time-series-cross-validation-based variational mode decomposition (VMD), Bayesian-optimized XGBoost (BayesXGB), and BiLSTM residual correction. First, abnormal values in the raw load and explanatory variables are detected using the

3 σ

criterion and corrected by cubic spline interpolation. Then, VMD parameters are selected only within the training sequence, and leakage-free VMD features are generated from historical input windows, avoiding the use of future information. BayesXGB is employed as the primary forecasting model to capture nonlinear relationships between historical load, VMD-derived multi-scale features, and external variables. Finally, a stacked BiLSTM module learns temporal patterns from historical BayesXGB predictions and residuals, and the predicted residual correction is added to the preliminary forecast. Experiments on an Australian electricity load dataset show that the proposed model achieves an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an R² of 0.9921, outperforming all compared baseline models while maintaining sub-millisecond inference per sample.

Keywords:

Bayesian optimization; BiLSTM; short-term load forecasting; variational mode decomposition; XGBoost

1. Introduction

Short-term load forecasting is an important task for power system dispatch, operational planning, electricity market decision-making, and grid reliability management [1]. Accurate load forecasts help system operators allocate generation resources more effectively, reduce operational uncertainty, and maintain the stability of power supply. With the development of new power systems and the increasing integration of renewable energy, electricity load profiles have become more nonlinear, volatile, and multi-scale. These changes pose significant challenges to the accuracy and robustness of short-term load forecasting models [2,3,4].

Existing studies on short-term load forecasting have mainly developed along three directions: statistical forecasting, machine learning, and deep learning. Statistical models, including ARIMA and Kalman filtering, are easy to implement and interpret, but their performance is often limited when load data contain strong nonlinear and non-stationary patterns [5,6,7,8]. Machine learning methods, such as support vector regression, random forests, and XGBoost, can model nonlinear relationships between meteorological variables, electricity price, historical load, and future demand [9,10,11]. Among them, XGBoost has shown strong performance in regression tasks because of its ensemble learning structure and regularization mechanism. However, tree-based models do not directly learn sequential dependencies from time-series inputs. In contrast, deep learning models, such as RNN, LSTM, GRU, BiLSTM, attention-based networks, and convolutional-recurrent hybrid models, are able to extract temporal patterns from historical sequences [12,13,14,15]. Recently, Cavus et al. [16] developed a residual-normalized GRU-based forecasting framework, showing that recurrent learning, residual modeling, and normalization mechanisms can improve the robustness of short-term energy forecasting under non-stationary conditions. Nevertheless, deep learning models may still be affected by high-frequency noise, complex fluctuations, and relatively high training costs.

To reduce the influence of high-frequency noise and non-stationary fluctuations, signal decomposition methods have often been combined with forecasting models. EMD and EEMD can decompose complex load sequences into several frequency-related components, but they may be affected by mode mixing, endpoint effects, and empirical settings [17,18,19]. Variational Mode Decomposition (VMD) provides a more stable decomposition framework and has been applied to short-term load forecasting tasks [20]. Through decomposition, the original load sequence can be transformed into several relatively stable sub-sequences, which provides a useful basis for subsequent nonlinear forecasting.

The forecasting stage after decomposition also affects the overall performance of hybrid load forecasting models. For example, Nabavi et al. [21] combined DWT with LSTM for electricity load forecasting and showed that the DWT-LSTM model achieved higher forecasting accuracy than several benchmark models. This study demonstrates the effectiveness of combining signal decomposition with deep learning for load forecasting. However, after a primary prediction is obtained, the remaining errors may still contain nonlinear and time-dependent patterns. Therefore, residual correction can be introduced as a supplementary step to calibrate preliminary forecasts and improve the final forecasting results [22].

Based on the above discussion, this paper proposes a VMD–BayesXGB–BiLSTM hybrid model for short-term load forecasting. Specifically, the proposed framework combines TSCV-based VMD parameter selection, Bayesian-optimized XGBoost forecasting, and BiLSTM-based residual correction into a unified forecasting pipeline. This design progressively reduces the complexity of the load sequence, improves the nonlinear prediction capability of the primary model, and further calibrates residual temporal errors.

The main contributions of this study are summarized as follows:

TSCV-based parameter selection for VMD: To reduce the dependence on empirical parameter settings in decomposition-based forecasting, a time-series cross-validation strategy is introduced for VMD parameter selection. By selecting the number of modes and penalty factor through validation across temporal folds, the decomposition process becomes more adaptive to the multi-scale characteristics of the load sequence.
Enhanced forecasting via BayesXGB: Bayesian optimization is introduced to tune the hyperparameters of the XGBoost regressor, with mean absolute error (MAE) used as the optimization objective. By incorporating VMD-derived multi-scale features, the BayesXGB model is used as the primary forecasting model to capture nonlinear relationships in the load data.
BiLSTM-based residual correction mechanism: A BiLSTM correction module is introduced after the BayesXGB forecasting stage to model the temporal patterns contained in prediction residuals. Historical BayesXGB predictions and residual sequences are used as inputs, and the predicted residual correction is added to the BayesXGB output to obtain the final forecast.

2. Methods

2.1. VMD

VMD is an adaptive mode decomposition method for analyzing non-stationary, nonlinear signals. It transforms the signal decomposition process into solving a constrained variational optimization problem. Compared to EMD, VMD typically yields more stable results with superior noise resistance, effectively resolving the mode mixing issues inherent in EMD [23]. The constrained variational model is formulated as follows:

\{\begin{cases} \min {u_{k}} {ω_{k}} \sum_{k} | | 𝜕 t [(δ (t) + \frac{j}{π t}) u_{k} (t))] e^{- j ω_{k} t} {| |}_{2}^{2} \\ s . t . \sum_{k} u_{k} (t) = f (t) \end{cases}

(1)

where

$t$ represents time; $f (t)$ denotes the input raw load sequence;
$u_{k} (t)$ is the $k$ -th mode component;
$δ (t)$ is the unit impulse function;
$ω_{k}$ represents the center frequency of the $k$ -th mode component;

Since solving the constrained variational problem directly is relatively challenging, it is transformed into an unconstrained optimization problem via the introduction of the Lagrangian function.

L (\{u_{k}\}, \{ω_{k}\}, λ) = α {\sum_{k = 1}^{K} ‖[(δ (t) + \frac{j}{π t}) u_{k} (t)] e^{- j ω_{k} t}‖}_{2}^{2} + {‖f (t) - \sum_{k = 1}^{K} u_{k} (t)‖}_{2}^{2} + 〈λ (t), f (t) - \sum_{k = 1}^{K} u_{k} (t)〉

(2)

$λ (t)$ represents the Lagrange multiplier;
$α$ is the quadratic penalty factor.

Subsequently, the Alternating Direction Method of Multipliers (ADMM) is employed to iteratively solve the model until convergence is achieved. The update formulas for the modes and center frequencies are as follows:

Intrinsic Mode Function (IMF) Update:

{\hat{u}}_{k}^{n + 1} (ω) = \frac{[\hat{f} (ω) - \sum_{i \neq k} {\hat{u}}_{i} (ω) + \frac{\hat{λ} (ω)}{2}]}{1 + 2 α {(ω - ω_{k})}^{2}}

(3)

Among these,

\hat{f} (ω)

,

{\hat{u}}_{k} (ω)

and

\hat{u} (ω)

represent the Fourier transforms of

f (t)

,

u_{k} (t)

and

u (t)

, respectively.

Center Frequency Update:

ω_{k}^{n + 1} = \frac{\int_{0}^{\infty} ω {|{\hat{u}}_{k}^{n + 1} (ω)|}^{2} d ω}{{\int_{0}^{\infty} |{\hat{u}}_{k}^{n + 1} (ω)|}^{2} d ω}

(4)

2.2. Principles of XGBoost

XGBoost is an improved implementation of the gradient boosting decision tree (GBDT) algorithm [24,25]. GBDT performs classification or regression by integrating multiple weak learners into a strong ensemble learner. The weak learners integrated into XGBoost are typically CART (Classification and Regression Tree) regression trees [26], and its ensemble model is expressed as follows:

{\hat{y}}_{i} = \sum_{k = 1}^{n} f_{k} (x_{i}), f_{k} \in F

(5)

where

${\hat{y}}_{i}$ represents the predicted value for the $i$ -th sample;
$f_{k} (\cdot)$ denotes the functional relationship between the $k$ -th tree structure and leaf weights;
$n$ represents the total number of CART regression trees;
$x_{i}$ denotes the feature vector of $i$ -th sample;
$F$ represents the space of all possible CART regression trees.

After the model training is completed, the final prediction for a given sample is obtained by aggregating the outputs of all individual trees. The iterative accumulation process for the inference stage is detailed in Algorithm 1.

Algorithm 1. Prediction Iteration Process for XGBoost

Input: Input features x_{i}

, number of trees

n

and trained learners {f_{1}, f_{2}, \dots, f_{t}}

Output: Final prediction value {\hat{y}}_{i}^{(n)}

1: Initialize prediction value: {\hat{y}}_{i}^{(0)} \leftarrow 0

2: for k = 1

to

n

do

3: Compute learner output: c u r r e n t_o u t p u t \leftarrow f_{k} (x_{i})

4: {\hat{y}}_{i}^{(k)} \leftarrow {\hat{y}}_{i}^{(k - 1)} + c u r r e n t_o u t p u t

5: end for

6: return {\hat{y}}_{i}^{(n)}

Among these,

${\hat{y}}_{i}^{(k)}$ represents the predicted value after $k$ training iterations;
${\hat{y}}_{i}^{(k - 1)}$ denotes the cumulative prediction retained from the previous iteration;
$f_{k} (x_{i})$ signifies the newly added base learner at the $k$ -th iteration.

To obtain the optimal functional forms for each

f_{k}

during the training phase, XGBoost reduces prediction bias by performing a second-order Taylor expansion of the loss function. The objective function is formulated as

\{\begin{cases} L_{l o s s} = \sum_{i = 1}^{n} ϕ (y_{i}, {\hat{y}}_{i}^{(k)}) + \sum_{k = 1}^{k} Ω (f_{k}) \\ Ω (f) = γ T + \frac{1}{2} λ {‖ω‖}^{2} \end{cases}

(6)

In the equation:

ϕ

is the loss function, representing the error between the actual value and the predicted value;

Ω (f)

is the regularization term, where

γ

and

λ

are penalty coefficients,

T

is the number of leaf nodes, and

ω

is the leaf weight. This function primarily serves to prevent model overfitting.

Performing a second-order Taylor expansion at

f_{k} = 0

yields an objective function that can be approximated as

τ^{(k)} ≃ \sum_{i = 1}^{n} [ϕ (y_{i}, {\hat{y}}_{i}^{(k - 1)}) + g_{i} f_{k} (x_{i}) + \frac{1}{2} h_{i} f_{k}^{2} (x_{i})] + Ω (f_{k})

(7)

where

$g_{i} = \frac{𝜕}{𝜕 {\hat{y}}_{i}^{(k - 1)}} ϕ (y_{i}, {\hat{y}}_{i}^{(k - 1)})$ denotes the first derivative of the loss function;
$h_{i} = \frac{𝜕^{2}}{𝜕 {({\hat{y}}_{i}^{(k - 1)})}^{2}} ϕ (y_{i}, {\hat{y}}_{i}^{(k - 1)})$ represents the second derivative of the loss function.

Equation (7) aggregates the loss functions for each sample, reorganizing all samples belonging to the same leaf node through the following process:

\begin{array}{l} L_{l o s s} ≃ \sum_{i = 1}^{n} ϕ (g_{i} f_{t} (x_{i}), \frac{1}{2} h_{i} {f_{t}}^{2} (x_{i})) + Ω (f_{t}) \\ = \sum_{i = 1}^{n} [g_{i} ω_{q} (x_{i}) + \frac{1}{2} h_{i} {ω^{2}}_{q} (x_{i})] + γ T + \frac{1}{2} λ \sum_{j = 1}^{T} {ω^{2}}_{j} \\ = \sum^{T} [(\sum_{i \in I_{j}} g_{i}) ω_{j} + \frac{1}{2} (\sum_{i \in I_{j}} h_{i} + λ) {ω_{j}}^{2}] + γ T \end{array}

(8)

Define

G_{j} = \sum_{i \in I_{j}} g_{i}

and

H_{j} = \sum_{i \in I_{j}} h_{i}

. After taking the derivative with respect to

ω

, the minimum value of the regularized objective can be obtained as

L_{l o s s} ≅ - \frac{1}{2} \sum_{j = 1}^{T} \frac{{G_{j}}^{2}}{H_{j} + λ} + γ T

(9)

2.3. BiLSTM Module

2.3.1. LSTM

Long Short-Term Memory (LSTM) is an improved variant of Recurrent Neural Networks (RNNs), which mitigates the gradient explosion and vanishing gradient problems that can arise in traditional RNN architectures [27,28]. As shown in Figure 1, LSTM primarily consists of an input gate, an output gate, and a forget gate.

Its structural formula is as follows:

\{\begin{cases} f_{t} = σ (W_{f} h_{t - 1} + W_{f} x_{t} + b_{f}) \\ i_{t} = σ [W_{i} h_{t - 1} + W_{i} x_{t} + b_{i}] \\ {\bar{C}}_{t} = \tanh [W_{c} h_{t - 1} + W_{c} x_{t} + b_{c}] \\ C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ {\bar{C}}_{t} \\ ο_{t} = σ [W_{o} (h_{t - 1}, x_{t}) + b_{o}] \\ h_{t} = ο_{t} ⊙ \tanh (C_{t}) \end{cases}

(10)

where

$f$ denotes the forget gate, which determines the proportion of information from the previous state to be discarded;
$σ$ represents the sigmoid activation function;
$W_{f}$ , $W_{i}$ , $W_{c}$ , $W_{o}$ are the weight matrices corresponding to the forget gate, input gate, candidate cell state, and output gate, respectively;
$x_{t}$ is the input value at the current time step;
$b_{f}$ , $b_{i}$ , $b_{c}$ , $b_{o}$ signify the bias vectors for each respective component;
$i_{t}$ is the input gate, regulating the flow of new information into the cell state;
${\bar{C}}_{t}$ is the candidate cell state value;
$\tanh$ is the activation function;
$C_{t}$ is the cell state value of the current hidden layer;
$C_{t - 1}$ is the cell state value of the hidden layer in the previous state;
$o_{t}$ is the output gate, which controls the amount of information to be transferred from the cell state to the hidden state;
$h_{t}$ is the hidden-layer output, representing the final information output of the LSTM cell at the current step.

2.3.2. BiLSTM

Combining a forward LSTM and a backward LSTM into a BiLSTM network partially addresses the limitation of LSTMs, considering only unidirectional temporal sequences. This enables feature extraction from both forward and backward temporal sequences, thereby enhancing the model’s prediction accuracy [29]. The BiLSTM architecture is illustrated in Figure 2.

3. Research Methods and Model Architecture

3.1. Model Architecture

The overall architecture of the proposed VMD–BayesXGB–BiLSTM forecasting model is shown in Figure 3. The framework consists of four main stages: data preprocessing, VMD-based feature construction, BayesXGB forecasting, and BiLSTM-based residual correction.

First, the raw electricity load data and explanatory variables are preprocessed. The preprocessing procedure includes abnormal data detection, missing-value treatment, interpolation, and feature standardization. After preprocessing, the VMD parameters are selected using a TSCV strategy. The selected VMD parameters are then used to generate multi-scale IMFs from the historical load sequence. To avoid the use of future information, VMD feature construction is performed based on historical input windows. In other words, for each prediction sample, only the historical observations before the target time step are used for decomposition, while the target point and future load values are excluded.

Window-based samples are constructed for the primary BayesXGB forecasting stage using historical observations before the target time step. The input feature vector of BayesXGB consists of historical load-window features, VMD-derived multi-scale load features, and external explanatory variables. Based on these features, the BayesXGB model generates the preliminary one-step-ahead load prediction.

After the preliminary prediction is obtained, a separate historical sequence of BayesXGB predictions and residuals is constructed for the BiLSTM residual correction module. Therefore, the forecasting window used for the primary BayesXGB model and the sequence length used for BiLSTM residual correction refer to two different stages of the proposed framework. Their specific values are reported in the experimental settings.

After the preliminary prediction is obtained, a BiLSTM residual correction module is introduced to further improve the forecasting result. Let

y_{t}

denote the actual load value at time step

t

, and let

{\hat{y}}_{B a y e s X G B, t}

denote the preliminary prediction generated by the BayesXGB model. The residual between the actual load and the BayesXGB prediction is defined as

e_{t} = y_{t} - {\hat{y}}_{B a y e s X G B, t}

(11)

The residual sequence may still contain nonlinear and time-dependent patterns that are not fully captured by the primary BayesXGB model. Therefore, the BiLSTM correction module is used to learn the temporal characteristics of the residuals. For each time step t, the historical BayesXGB prediction sequence and the corresponding historical residual sequence are constructed as

[{\hat{y}}_{B a y e s X G B, t - L}, \dots, {\hat{y}}_{B a y e s X G B, t - 1}]

(12)

and

[e_{t - L}, \dots, e_{t - 1}]

(13)

where

L

denotes the residual correction sequence length. These two sequences are combined as the input features of the BiLSTM residual correction module. Therefore, the input tensor of the BiLSTM model is organized as

N \times L \times d

, where

N

is the batch size,

L

is the sequence length, and

d

is the feature dimension.

No teacher forcing is used in the BiLSTM residual correction module. During both training and testing, the residual correction at time step

t

is estimated only from historical BayesXGB predictions and historical residuals before

t

. Therefore, the target load value and future residual information are not used to estimate the residual correction for the current prediction step.

The BiLSTM correction model outputs the predicted residual correction term

{\hat{e}}_{t}

. The final corrected forecast is obtained by adding the predicted residual correction to the preliminary BayesXGB prediction:

{\hat{y}}_{f i n a l, t} = {\hat{y}}_{B a y e s X G B, t} + {\hat{e}}_{t}

(14)

In this way, the BiLSTM module provides an additive correction to the BayesXGB preliminary forecast. Only historical BayesXGB predictions and historical residuals from previous time steps are used to estimate the residual correction at time step

t

.

To further clarify the residual correction process, Figure 4 illustrates the data flow from the BayesXGB preliminary prediction to the final BiLSTM-corrected forecast.

3.2. Model Evaluation Metrics

This paper employs four commonly used evaluation metrics in load forecasting tasks to assess the model’s predictive performance. These metrics are Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and

R^{2}

.

The specific calculation formulas for each metric are as follows:

\{\begin{cases} R M S E = \sqrt{\frac{1}{n} {\sum_{i = 1}^{n} (y_{t r u e, i} - y_{p r e d, i})}^{2}} \\ M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{t r u e, i} - y_{p r e d, i}| \\ M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{t r u e, i} - y_{p r e d, i}}{y_{t r u e, i}}| * 100 % \\ R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{t r u e, i} - y_{p r e d, i})}^{2}}{\sum_{i = 1}^{n} {(y_{t r u e, i} - {\bar{y}}_{t r u e})}^{2}} \end{cases}

(15)

where

$y_{t r u e}$ denotes the actual load value;
$y_{p r e d}$ denotes the forecasted load value;
$n$ denotes the sample size;
${\bar{y}}_{t r u e}$ denotes the mean of the true load values.

4. Experimental Setup

4.1. Dataset Description and Preprocessing

The experiments were conducted using Python 3.7. A publicly available Australian electricity load dataset was used for simulation analysis. The dataset was obtained from the Australian Electricity Load and Price Forecasting Dataset repository, covering the period from 00:30 on 1 January 2006 to 00:00 on 1 January 2011, with a 30 min sampling interval. In addition to the electricity load, five explanatory variables were included: dry-bulb temperature, dew-point temperature, wet-bulb temperature, humidity, and electricity price. Figure 5 presents the time series plot of the electricity load for the Australian dataset.

As shown in Figure 5, the electricity load exhibits clear periodic fluctuations and seasonal variation. This indicates that electricity demand is affected by temporal patterns and meteorological factors, which increases the difficulty of short-term load forecasting.

Before model training, the raw dataset was processed through four main steps: outlier detection, outlier handling, feature standardization, and VMD-based feature construction. Section 4.1.1, Section 4.1.2, Section 4.1.3 and Section 4.1.4 describe these preprocessing steps in detail. After preprocessing, supervised learning samples were constructed using a sliding window. The input window length for the primary forecasting model was set to 24, corresponding to 12 h of historical observations, and the forecasting horizon was set to one step, corresponding to a 30 min-ahead prediction.

After window-sample construction, all samples were divided chronologically into training and testing sets. The first 80% of the samples were used as the training set, and the remaining 20% were used as the testing set. In total, 87,624 supervised samples were constructed, including 70,099 training samples and 17,525 testing samples. The training set was used for model fitting and parameter selection, while the testing set was reserved only for final performance evaluation.

For model tuning, validation was performed on the training set rather than using the testing set. The detailed validation strategies and hyperparameter settings for VMD, BayesXGB, and BiLSTM are reported in Section 4.2.

4.1.1. Outlier Detection

To mitigate the impact of dirty data caused by sensor failures, this paper employs the

3 σ

rule to identify outliers. This outlier detection method is based on the statistical principles of normal distribution.

Assuming data follows a normal distribution

N (μ, σ^{2})

, the data distribution exhibits strict probabilistic patterns:

Approximately 68.27% of the data falls within the interval $| μ - σ, μ + σ |$ ;
Approximately 95.45% of the data falls within the interval $| μ - 2 σ, μ + 2 σ |$ ;
Approximately 99.73% of the data falls within the interval $| μ - 3 σ, μ + 3 σ |$ .

Since the

3 σ

range encompasses nearly all (99.73%) of the normal data, the

3 σ

criterion stipulates that data exceeding the range

| μ - 3 σ, μ + 3 σ |

is highly likely to be an outlier (with a probability of only 0.27%, a low-probability event).

4.1.2. Outlier Handling

After outlier detection, abnormal values were replaced using cubic spline interpolation. This method was adopted because it can preserve the local trend and smoothness of the time series better than simple mean imputation or linear interpolation.

For cubic spline interpolation, the original data sequence is divided into several intervals, and each interval is represented by a cubic polynomial:

S i (x) = a i x^{3} + b i x^{2} + c i x + d i

(where

i

denotes the interval index, and

a_{i}

,

b_{i}

,

c_{i}

,

d_{i}

are the polynomial coefficients for the

i

-th interval).

Continuity Constraints: Polynomials in adjacent intervals must satisfy the following at nodes:

Function value continuity:

S i (x_{i + 1}) = S i + 1 (x_{i + 1})

First derivative continuity:

{S_{i}}^{'} (x_{i + 1}) = {S_{i + 1}}^{'} (x_{i + 1})

Second derivative continuity:

{S_{i}}^{″} (x_{i + 1}) = {S_{i + 1}}^{″} (x_{i + 1})

Boundary conditions: To ensure a unique solution, boundary constraints must be specified.

After applying the

3 σ

-based outlier detection procedure, abnormal values were detected and corrected by cubic spline interpolation. Specifically, 195 outliers were detected in dry-bulb temperature, 55 in dew-point temperature, 217 in humidity, 174 in electricity price, and 124 in electricity load, corresponding to outlier ratios of 0.2225%, 0.0628%, 0.2476%, 0.1985%, and 0.1415%, respectively. No outliers were detected in wet-bulb temperature. Since the outlier ratios for all variables were below 0.25%, the preprocessing step mainly corrected abnormal local observations while preserving the overall structure of the original dataset.

4.1.3. Feature Standardization

To eliminate the influence of different feature scales on model training, feature standardization was applied before model fitting. The mean and standard deviation were calculated from the training set and then applied to both the training and testing sets. This procedure ensures that the testing data are transformed using only information obtained from the training set.

The standardization formula (Z-score normalization) is expressed as

x_{s c a l e d} = \frac{x - μ}{σ}

(16)

where

x

represents the original feature value,

μ

denotes the mean of the feature across the training set;

σ

indicates the standard deviation of the feature across the training set; and

x_{s c a l e d}

signifies the standardized feature value.

4.1.4. VMD Feature Construction

VMD was used to extract multi-scale load features from the historical input windows. The VMD parameters were determined using only the training set, with the detailed selection procedure described in Section 4.2.1. For each prediction sample, VMD was applied only to the historical load window before the target time step, so the target value and future observations were not involved in feature construction.

Figure 6 shows the VMD results of a representative load segment from the training set.

4.2. Hyperparameter Settings

The main experimental settings include VMD parameter selection, Bayesian optimization of XGBoost hyperparameters for constructing BayesXGB, and the configuration of the BiLSTM residual correction module. All parameter tuning procedures were performed using only the training set, while the testing set was used only for final performance evaluation.

4.2.1. VMD Parameter Selection

For VMD, the number of modes

K

and the penalty factor

α

were selected only within the training sequence. To reduce computational cost while preserving temporal order, the last 3000 consecutive observations from the training sequence were used as the parameter selection sample. The number of modes

K

was searched from 3 to 14, and the penalty factor

α

was selected from {1000, 2000, 3000, 4000, 5000, 6000, 7000}.

For each candidate pair of

K

and

α

, three-fold time-series cross-validation was applied. The validation window length of each fold was 750 observations. The reconstruction mean squared error was calculated in each validation fold, and the average reconstruction MSE across the three folds was used as the final selection criterion. The mathematical procedure of the TSCV-based VMD parameter selection is summarized in Algorithm 2.

Algorithm 2. Time-series cross-validation-based VMD parameter selection

Input: Training load sequence X_{t r a i n} = \{x_{1}, x_{2}, \dots, x_{T}\}

, sampling length S = 3000

,

and candidate mode number set K_{c} = \{3, 4, \dots, 14\}

, candidate penalty factor set α_{c} = \{1000, 2000, \dots, 7000\}

, and number of time-series cross-validation folds (F = 3)

.

Output: Optimal VMD parameter pair (K^{*}, α^{*})

.

1: Select the last

S

consecutive observations from the training load sequence as the VMD parameter selection sample: X_{s} = \{x_{T - S + 1}, x_{T - S + 2}, \dots, x_{T}\}

2: Split X_{s}

into F = 3

time-series cross-validation folds while preserving chronological order X_{s} = \{V_{1}, V_{2}, V_{3}\}

, where each validation fold V_{f}

contains N_{f} = 750

observations.

3: For each candidate parameter pair (K, α) \in K_{c} \times α_{c}

, apply VMD to V_{f}

, reconstruct the validation signal by summing all decomposed modes and the residual component, and compute the fold-wise reconstruction mean squared error: {MSE}_{f} (K, α) = \frac{1}{N_{f}} {\sum_{i = 1}^{N_{f}} (x_{i, f} - {\hat{x}}_{i, f} (K, α))}^{2}

4: Compute the average reconstruction error across all folds: \bar{MSE} (K, α) = \frac{1}{F} \sum_{f = 1}^{F} {MSE}_{f} (K, α)

5: Select the parameter pair with the minimum average reconstruction error: (K^{*}, α^{*}) = \arg \min_{K \in K_{c}, α \in α_{c}} \bar{MSE} (K, α)

6: Use the selected K^{*} = 14

and α^{*} = 1000

for leakage-free VMD feature construction, where VMD is applied only to the historical input window of each prediction sample.

As shown in Table 1, (

K

= 14) and (

α

= 1000) were selected as the optimal parameter pair in all three validation folds. The average reconstruction MSE was 388.5164, and the VMD parameter search time was 48.2403 s. These fold-wise results indicate that the selected VMD parameters were stable under the time-series cross-validation scheme.

The reconstruction MSE was used because VMD serves as an unsupervised feature construction step before forecasting. This criterion was adopted to select a decomposition setting that preserves the main information of the original load sequence while generating multi-scale components. The forecasting contribution of the selected VMD features was further evaluated through the ablation study in Section 4.3.

4.2.2. BayesXGB Hyperparameter Optimization

The hyperparameters of XGBoost were optimized using Bayesian optimization. During the optimization process, five-fold time-series cross-validation was performed, and MAE was used as the optimization objective. The optimization objective is defined as

M A E = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |

(17)

where

y_{i}

represents the actual electricity load value;

{\hat{y}}_{i}

denotes the predicted load value generated by XGBoost; and

N

denotes the number of validation samples in each time-series cross-validation fold. The Bayesian optimization process searches for the hyperparameter combination that minimizes the cross-validation MAE. The optimized XGBoost hyperparameters are shown in Table 2.

4.2.3. BiLSTM Residual Correction Settings

The BiLSTM residual correction module was configured as a stacked four-layer bidirectional LSTM structure. Each BiLSTM layer contains both a forward LSTM and a backward LSTM, and the output sequence of each layer is passed to the next BiLSTM layer. No residual skip connection is used between BiLSTM layers.

The BiLSTM residual correction module used a sequence length of 48, corresponding to one day of half-hour historical information. This differs from the 24-step input window used for the primary BayesXGB forecasting stage. The 24-step window was used to construct the first-stage forecasting samples, whereas the 48-step sequence was used to construct historical BayesXGB prediction and residual sequences for residual correction.

The input features of the BiLSTM residual correction module include historical BayesXGB predictions and their corresponding historical residuals. Therefore, the input tensor can be represented as

N \times 48 \times 2

, where

N

is the batch size, 48 is the sequence length, and 2 denotes the feature dimension.

The detailed BiLSTM parameter settings are shown in Table 3.

4.3. Ablation Study

To evaluate the contribution of each component in the proposed framework, four ablation models were constructed: BayesXGB, VMD–BayesXGB, BayesXGB–BiLSTM, and VMD–BayesXGB–BiLSTM. Here, BayesXGB denotes the Bayesian-optimized XGBoost regressor. The same optimized BayesXGB hyperparameter setting was used for all BayesXGB-based ablation models to ensure a consistent comparison. BayesXGB was used as the baseline model; VMD–BayesXGB introduced the VMD-based feature construction module; BayesXGB–BiLSTM introduced the BiLSTM-based residual correction module; and VMD–BayesXGB–BiLSTM represented the complete proposed model.

As shown in Table 4, introducing VMD-based feature construction substantially improved the baseline BayesXGB model. Compared with BayesXGB, VMD–BayesXGB reduced RMSE, MAE, and MAPE by 53.33%, 52.85%, and 53.35%, respectively, while increasing

R^{2}

from 0.9495 to 0.9890. This indicates that VMD-derived multi-scale features can effectively reduce the complexity of the original load sequence and improve the forecasting capability of BayesXGB.

The BiLSTM residual correction module also contributed to forecasting improvement. Compared with BayesXGB, BayesXGB–BiLSTM reduced RMSE, MAE, and MAPE by 39.27%, 44.77%, and 45.66%, respectively. This suggests that the residual sequence after the primary BayesXGB forecasting stage still contains useful temporal information that can be learned by the BiLSTM correction module.

The complete VMD–BayesXGB–BiLSTM model achieved the best overall performance among the ablation models, with an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an

R^{2}

of 0.9921. Compared with VMD–BayesXGB, the complete model further reduced RMSE, MAE, and MAPE by 15.14%, 16.72%, and 16.72%, respectively. Compared with the baseline BayesXGB model, the corresponding reductions reached 60.40%, 60.73%, and 61.15%, respectively. These results demonstrate that VMD-based feature construction and BiLSTM-based residual correction provide complementary improvements to the proposed forecasting framework.

Figure 7 provides a visual comparison of RMSE, MAE, MAPE, and

R^{2}

among the ablation models. The complete VMD–BayesXGB–BiLSTM model consistently achieved the lowest RMSE, MAE, and MAPE, as well as the highest

R^{2}

, further confirming the effectiveness of the proposed hybrid structure.

Figure 8 presents the prediction curves of the ablation models on a representative test segment. The predicted curves generally follow the overall trend of the actual load. Although local differences exist among the models, the quantitative results in Table 4 show that VMD–BayesXGB–BiLSTM achieves the lowest overall prediction error, indicating more stable forecasting performance on the testing set.

4.4. Comparative Evaluation

To further evaluate the predictive performance of the proposed model, it was compared with several representative baseline models, including LSTM, VMD-LSTM, BayesXGB, VMD–BayesXGB, BiLSTM, VMD-BiLSTM, Attention-BiLSTM, VMD-Attention-BiLSTM, and DWT-LSTM. Here, BayesXGB denotes the XGBoost regressor whose hyperparameters were tuned by Bayesian optimization. The proposed model is denoted as VMD–BayesXGB–BiLSTM.

To ensure consistency, all models were evaluated using the same chronological training-testing split, input window length, forecasting horizon, and evaluation metrics. For the BayesXGB-based models, the same optimized hyperparameter setting was used. The LSTM-, BiLSTM-, and attention-based baseline models were trained using the same optimizer, batch size, maximum number of epochs, and early-stopping strategy. Therefore, the comparative evaluation focuses on the forecasting performance of different model structures under a consistent experimental setting.

The experimental results are shown in Table 5 and Figure 9.

As shown in Table 5, the proposed VMD–BayesXGB–BiLSTM model achieved the best performance among all compared models, with an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an

R^{2}

of 0.9921. These results indicate that the proposed hybrid framework can more accurately capture the nonlinear, non-stationary, and temporal characteristics of electricity load data.

The standalone recurrent neural network models, including LSTM, BiLSTM, and Attention–BiLSTM, produced relatively large prediction errors. In contrast, their VMD-based counterparts achieved much better performance, indicating that VMD-based feature construction is effective for reducing the complexity of the original load sequence and extracting useful multi-scale information for downstream forecasting models.

Among all baseline models, VMD–LSTM achieved the strongest performance, with an RMSE of 137.3648, an MAE of 104.7501, a MAPE of 1.2105%, and an

R^{2}

of 0.9899. Nevertheless, the proposed VMD–BayesXGB–BiLSTM model further reduced RMSE, MAE, and MAPE by 11.11%, 13.38%, and 15.17%, respectively, compared with VMD-LSTM. This demonstrates that combining BayesXGB forecasting with BiLSTM residual correction can provide additional improvement beyond VMD-based recurrent forecasting alone.

Compared with VMD–BayesXGB, the proposed model further reduced RMSE from 143.8891 to 122.1003, MAE from 108.9592 to 90.7386, and MAPE from 1.2331% to 1.0269%. The corresponding reductions were 15.14%, 16.72%, and 16.72%, respectively. This improvement demonstrates that the residual sequence after the primary BayesXGB forecasting stage still contains useful temporal information, which can be further modeled by the BiLSTM residual correction module.

The proposed model also outperformed other decomposition-based and hybrid baselines. Compared with VMD–Attention–BiLSTM, it reduced RMSE, MAE, and MAPE by 27.68%, 26.11%, and 25.70%, respectively. Compared with DWT–LSTM, it reduced RMSE, MAE, and MAPE by 58.85%, 62.83%, and 63.05%, respectively. These comparisons show that the proposed framework provides more accurate forecasting performance than both attention-enhanced recurrent models and DWT-based recurrent models on the tested dataset.

Overall, the comparative results demonstrate that the proposed VMD–BayesXGB–BiLSTM model achieves superior predictive performance on the tested Australian electricity load dataset. However, since the experiments were conducted on a single dataset, broader generalizability should be further verified using additional datasets from different regions or power systems in future work.

4.5. Computational Cost Analysis

In addition to forecasting accuracy, computational cost is also an important factor for practical short-term load forecasting applications. Therefore, the training time, total inference time, inference time per sample, and number of trainable neural parameters were compared among all models used in the comparative evaluation. The results are shown in Table 6.

It should be noted that the values in Table 6 mainly report the model training and inference costs after feature construction. In this study, VMD and DWT feature construction were treated as offline preprocessing steps. For leakage-free VMD feature construction, decomposition was applied independently to each historical input window to avoid the use of future information. Therefore, the offline feature generation cost should be considered separately from the online inference cost.

As shown in Table 6, BayesXGB required the lowest computational cost, with a training time of 30.0264 s and an inference time of 0.0021 ms per sample. After introducing VMD-based feature construction, VMD–BayesXGB still maintained a very low inference time of 0.0078 ms per sample, although its training time increased to 379.6089 s. This indicates that VMD-derived features can greatly improve forecasting accuracy while preserving low online inference cost for the BayesXGB-based model.

The recurrent neural network models required higher computational costs because of their trainable neural parameters. In particular, VMD-BiLSTM and Attention-BiLSTM required 4820.4418 s and 5260.5649 s for training, respectively, with inference times above 1 ms per sample. In comparison, the proposed VMD–BayesXGB–BiLSTM model required 3137.0643 s for training and 0.9559 ms per sample for inference. Although the proposed model contains 1,602,049 trainable neural parameters and is more complex than the tree-based models, its online inference time remains below 1 ms per sample.

The additional computational cost of the proposed model mainly comes from the BiLSTM residual correction module. However, this cost is accompanied by clear accuracy improvements. Compared with VMD–BayesXGB, the proposed model reduced RMSE from 143.8891 to 122.1003, MAE from 108.9592 to 90.7386, and MAPE from 1.2331% to 1.0269%. Therefore, the proposed model provides a trade-off between higher forecasting accuracy and increased computational complexity.

Figure 10 further illustrates the accuracy–cost trade-off of different forecasting models by comparing RMSE with inference time per sample. In this figure, models located closer to the lower-left region are preferable because they achieve lower prediction error and lower online inference cost simultaneously. BayesXGB and VMD–BayesXGB have the lowest inference time, but their RMSE values are higher than that of the proposed model. Several recurrent neural network baselines require comparable or higher inference time while still producing larger prediction errors. The proposed VMD–BayesXGB–BiLSTM model achieves the lowest RMSE among all compared models while maintaining an inference time below 1 ms per sample, indicating a favorable balance between forecasting accuracy and online inference efficiency.

From a practical perspective, VMD–BayesXGB may be more suitable for scenarios with strict computational constraints because it provides a large improvement over BayesXGB while maintaining very low inference time. In contrast, the complete VMD–BayesXGB–BiLSTM model is more suitable for applications where higher forecasting accuracy is prioritized and sufficient computational resources are available. Future work may focus on reducing the computational burden of the residual correction module and the VMD feature construction stage, for example, by using lightweight recurrent structures, more efficient decomposition strategies, or model compression techniques.

5. Conclusions

To improve short-term electricity load forecasting under nonlinear, volatile, and multi-scale load conditions, this study proposed a VMD–BayesXGB–BiLSTM hybrid forecasting model. In the proposed framework, abnormal values were detected using the

3 σ

criterion and corrected by cubic spline interpolation, while feature standardization was applied to reduce the influence of different variable scales. VMD parameters were selected using time-series cross-validation within the training sequence, and leakage-free VMD features were generated from historical input windows. BayesXGB was then used as the primary forecasting model, and a BiLSTM residual correction module was introduced to further model temporal patterns in the prediction residuals.

Experimental results on an Australian electricity load dataset show that the proposed model achieved an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an

R^{2}

of 0.9921, outperforming all compared baseline models. The ablation study confirms that both VMD-based feature construction and BiLSTM-based residual correction contribute to the forecasting improvement. Compared with BayesXGB, the complete model reduced RMSE, MAE, and MAPE by 60.40%, 60.73%, and 61.15%, respectively. The comparative evaluation further shows that the proposed model performed better than VMD-LSTM, VMD-BiLSTM, VMD-Attention-BiLSTM, and DWT-LSTM. In addition, the online inference time of the proposed model remained below 1 ms per sample, indicating a practical balance between forecasting accuracy and inference efficiency.

Future work will focus on testing the model on additional load datasets, reducing the computational cost of VMD feature construction and BiLSTM residual correction, and further evaluating statistical robustness through repeated runs and significance testing.

Author Contributions

Conceptualization, T.X. and J.H.; methodology, T.X.; software, J.H.; validation, J.H., X.L., and J.T.; formal analysis, Y.L.; investigation, T.X.; resources, T.X., J.H.; data curation, X.L.; writing—original draft preparation, T.X.; writing—review and editing, J.H.; visualization, Y.L.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Young Academic and Technical Leaders Program of Yunnan Province, China (Grant No. 202305AC160077) and the National Natural Science Foundation of China (Grant No. 62062068).

Data Availability Statement

The data presented in this study are available at https://gitcode.com/Universal-Tool/6f8ba (accessed on 7 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Özdemır, Ş.; Demır, Y.; Yildirim, Ö. The Effect of Input Length on Prediction Accuracy in Short-Term Multi-Step Electricity Load Forecasting: A CNN-LSTM Approach. IEEE Access 2025, 13, 28419–28432. [Google Scholar] [CrossRef]
Duan, P.; Jiao, H.; Sun, J.; Han, A.; Dai, Z.; Cheng, L.; Chen, X. Research on Load Forecasting Based on Bayesian Optimized CNN-LSTM Neural Network. Energies 2025, 18, 6217. [Google Scholar] [CrossRef]
Saxena, A.; Shankar, R.; El-Saadany, E.F.; Kumar, M.; Al Zaabi, O.; Al Hosani, K.; Muduli, U.R. Intelligent Load Forecasting and Renewable Energy Integration for Enhanced Grid Reliability. IEEE Trans. Ind. Appl. 2024, 60, 8403–8417. [Google Scholar] [CrossRef]
Wen, X.; Liao, J.; Niu, Q.; Shen, N.; Bao, Y. Deep Learning-Driven Hybrid Model for Short-Term Load Forecasting and Smart Grid Information Management. Sci. Rep. 2024, 14, 13720. [Google Scholar] [CrossRef]
Karamolegkos, S.; Koulouriotis, D.E. Advancing Short-Term Load Forecasting with Decomposed Fourier ARIMA: A Case Study on the Greek Energy Market. Energy 2025, 325, 135854. [Google Scholar] [CrossRef]
Takeda, H.; Tamura, Y.; Sato, S. Using the Ensemble Kalman Filter for Electricity Load Forecasting and Analysis. Energy 2016, 104, 184–198. [Google Scholar] [CrossRef]
Zhu, S.; Ma, H.; Chen, L.; Wang, B.; Wang, H.; Li, X.; Gao, W. Short-Term Load Forecasting of an Integrated Energy System Based on STL-CPLE with Multitask Learning. Prot. Control Mod. Power Syst. 2024, 9, 71–92. [Google Scholar] [CrossRef]
Hnin, S.W.; Karnjana, J.; Kohda, Y.; Jeenanunta, C. A Hybrid K-Means and KNN Approach for Enhanced Short-Term Load Forecasting Incorporating Holiday Effects. Energy Rep. 2024, 12, 5942–5959. [Google Scholar] [CrossRef]
Luo, J.; Hong, T.; Gao, Z.; Fang, S.C. A Robust Support Vector Regression Model for Electric Load Forecasting. Int. J. Forecast. 2023, 39, 1005–1020. [Google Scholar] [CrossRef]
Magalhães, B.; Bento, P.; Pombo, J.; Calado, M.d.R.; Mariano, S. Short-Term Load Forecasting Based on Optimized Random Forest and Optimal Feature Selection. Energies 2024, 17, 1926. [Google Scholar] [CrossRef]
You, W.; Guo, D.; Wu, Y.; Li, W. Multiple Load Forecasting of Integrated Energy System Based on Sequential-Parallel Hybrid Ensemble Learning. Energies 2023, 16, 3268. [Google Scholar] [CrossRef]
Shi, H.; Xu, M.; Li, R. Deep Learning for Household Load Forecasting—A Novel Pooling Deep RNN. IEEE Trans. Smart Grid 2018, 9, 5271–5280. [Google Scholar] [CrossRef]
Buratto, W.G.; Muniz, R.N.; Nied, A.; González, G.V. Seq2Seq-LSTM With Attention for Electricity Load Forecasting in Brazil. IEEE Access 2024, 12, 30020–30029. [Google Scholar] [CrossRef]
Kong, W.; Dong, Z.Y.; Jia, Y.; Hill, D.J.; Xu, Y.; Zhang, Y. Short-Term Residential Load Forecasting Based on LSTM Recurrent Neural Network. IEEE Trans. Smart Grid 2017, 10, 841–851. [Google Scholar] [CrossRef]
Hasanat, S.M.; Ullah, K.; Yousaf, H.; Munir, K.; Abid, S.; Bokhari, S.A.S.; Aziz, M.M.; Naqvi, S.F.M.; Ullah, Z. Enhancing Short-Term Load Forecasting With a CNN-GRU Hybrid Model: A Comparative Analysis. IEEE Access 2024, 12, 184132–184141. [Google Scholar] [CrossRef]
Cavus, M.; Jiang, J.; Allahham, A. Deep Multi-Task Forecasting of Net-Load and EV Charging with a Residual-Normalised GRU in IoT-Enabled Microgrids. Energies 2026, 19, 311. [Google Scholar] [CrossRef]
Aziukovskyi, O.; Hnatushenko, V.; Kashtan, V.; Polyanska, A.; Jamróz, A. Intelligent Electricity Load Forecasting Method Using ARIMA-LSTM-Random Forest. Inżynieria Miner. 2025, 1, 1. [Google Scholar] [CrossRef]
Lotfipoor, A.; Patidar, S.; Jenkins, D.P. Deep Neural Network with Empirical Mode Decomposition and Bayesian Optimisation for Residential Load Forecasting. Expert Syst. Appl. 2024, 237, 121355. [Google Scholar] [CrossRef]
Xu, Y.; Ji, X.; Zhu, Z. A Photovoltaic Power Forecasting Method Based on the LSTM-XGBoost-EEDA-SO Model. Sci. Rep. 2025, 15, 30177. [Google Scholar] [CrossRef]
Wen, Y.; Pan, S.; Li, X.X.; Li, Z.B. Highly Fluctuating Short-Term Load Forecasting Based on Improved Secondary Decomposition and Optimized VMD. Sustain. Energy Grids Netw. 2024, 37, 101270. [Google Scholar] [CrossRef]
Nabavi, S.A.; Mohammadi, S.; Motlagh, N.H.; Tarkoma, S.; Geyer, P. Deep Learning Modeling in Electricity Load Forecasting: Improved Accuracy by Combining DWT and LSTM. Energy Rep. 2024, 12, 2873–2900. [Google Scholar] [CrossRef]
Wang, Z.G.; Li, W.J.; Wang, S.T.; Shi, Y.; Han, J.J. Enhancing Load Forecasting for Large Industrial Users Through Feature Preference and Error Correction. IEEE Access 2024, 12, 98647–98659. [Google Scholar] [CrossRef]
Lai, Y.B.; Wang, Q.F.; Chen, G.; Bai, Y.; Zhao, P.Y.; Liao, X.J.; Wu, S.; Men, C.Y.; Sun, Q. Short-Term Power Load Prediction Method Based on VMD and EDE-BiLSTM. IEEE Access 2025, 13, 10481–10488. [Google Scholar] [CrossRef]
Luo, S.C.; Wang, B.S.; Gao, Q.Z.; Wang, Y.B.; Pang, X.F. Stacking Integration Algorithm Based on CNN-BiLSTM-Attention with XGBoost for Short-Term Electricity Load Forecasting. Energy Rep. 2024, 12, 2676–2689. [Google Scholar] [CrossRef]
Song, K.-M.; Kim, T.-G.; Cho, S.-M.; Song, K.-B.; Yoon, S.-G. XGBoost-Based Very Short-Term Load Forecasting Using Day-Ahead Load Forecasting Results. Electronics 2025, 14, 3747. [Google Scholar] [CrossRef]
Yao, X.; Fu, X.; Zong, C. Short-Term Load Forecasting Method Based on Feature Preference Strategy and LightGBM-XGboost. IEEE Access 2022, 10, 75257–75268. [Google Scholar] [CrossRef]
Wang, H.; Huang, S.; Yin, Y.; Gu, T. Short-Term Load Forecasting Based on Pelican Optimization Algorithm and Dropout Long Short-Term Memories-Fully Convolutional Neural Network Optimization. Energies 2024, 17, 6115. [Google Scholar] [CrossRef]
Zhong, B. Deep Learning Integration Optimization of Electric Energy Load Forecasting and Market Price Based on the ANN–LSTM–Transformer Method. Front. Energy Res. 2023, 11, 1292204. [Google Scholar] [CrossRef]
Liu, X.; Song, J.; Tao, H.; Wang, P.; Mo, H.; Du, W. Quarter-Hourly Power Load Forecasting Based on a Hybrid CNN-BiLSTM-Attention Model with CEEMDAN, K-Means, and VMD. Energies 2025, 18, 2675. [Google Scholar] [CrossRef]
Wu, S.; Cai, H. Short-Term Power Load Prediction of VMD-LSTM Based on ISSA Optimization. Appl. Sci. 2025, 15, 5037. [Google Scholar] [CrossRef]
Chen, J.; Ma, W.; Chen, Z. Ultra-Short-Term Electric Load Forecasting Based on VMD-BiLSTM Model. Adv. Eng. Technol. Res. 2023, 8, 865. [Google Scholar] [CrossRef]

Figure 1. Architecture of the LSTM cell.

Figure 2. Architecture of the BiLSTM network.

Figure 3. Overall framework of the proposed VMD–BayesXGB–BiLSTM forecasting model.

Figure 4. Residual correction process from a BayesXGB output to the final BiLSTM-corrected forecast.

Figure 5. Time-series plot of the Australian electricity load dataset.

Figure 6. VMD results of a representative load segment from the training set.

Figure 7. Performance comparison in the ablation study.

Figure 8. Prediction results of the ablation models.

Figure 9. Comparison of performance metrics across different models.

Figure 10. Accuracy–cost trade-off of different forecasting models in terms of RMSE and inference time per sample.

Table 1. VMD parameter selection settings and fold-wise validation results.

Panel A. VMD parameter search settings
Item		Setting
Parameter selection data		Training sequence
Sampling strategy		Last 3000 consecutive observations from the training sequence
Number of TSCV folds		3
Validation window length		750 observations
$K$ candidate range		3–14
$α$ candidate set		{1000, 2000, 3000, 4000, 5000, 6000, 7000}
Selection criterion		Average reconstruction MSE across TSCV folds
Optimization method		Grid search
Final selected $K$		14
Final selected $α$		1000
Average reconstruction MSE		388.5164
VMD parameter search time		48.2403 s
Panel B. Fold-wise VMD validation results
Fold	Validation window length	Selected $K$	Selected $α$	Validation reconstruction MSE
Fold 1	750	14	1000	443.9469
Fold 2	750	14	1000	325.0876
Fold 3	750	14	1000	396.5147
Average	-	-	-	388.5164

Table 2. Optimized hyperparameters of BayesXGB.

Optimization Parameter	Search Range	Final Optimization Result
colsample_bytree	(0.7, 1.0)	0.7000
gamma	(0, 5)	3.5828
learning_rate	(0.01, 0.3)	0.0593
max_depth	(6, 12)	9
min_child_weight	(1, 5)	5
n_estimators	(100, 500)	500
reg_alpha	(0, 10)	10.0000
reg_lambda	(1, 20)	16.2920
subsample	(0.7, 1.0)	0.7618

Table 3. BiLSTM parameter settings.

Parameter	Parameter Setting
BiLSTM structure	4-layer Bidirectional LSTM
Layer connection	Output sequence of each BiLSTM layer is passed to the next layer
Hidden Units	256 (Layer 1), 128 (Layer 2), 96 (Layer 3), 64 (Layer 4)
Sequence Length	48
Input features	Historical BayesXGB prediction and historical residual
Input tensor shape	$N \times 48 \times 2$
Optimizer	Adam
Initial Learning Rate	0.001
Batch Size	64
Training Epochs	100
Dropout Rate	0.4 for BiLSTM layers; 0.3 after the fourth BiLSTM layer; 0.2 for fully connected layers
Fully Connected Layer	Two dense layers with 64 and 32 units, ReLU activation

Table 4. Results of the ablation study.

Model	RMSE	MAE	MAPE(%)	$R^{2}$
BayesXGB	308.3163	231.0695	2.6434	0.9495
VMD–BayesXGB	143.8891	108.9592	1.2331	0.9890
BayesXGB–BiLSTM	187.2390	127.6176	1.4363	0.9814
VMD–BayesXGB–BiLSTM	122.1003	90.7386	1.0269	0.9921

Table 5. Experimental comparison of different models.

Model	RMSE	MAE	MAPE(%)	$R^{2}$
LSTM	661.0743	518.8267	5.9788	0.7682
VMD–LSTM [30]	137.3648	104.7501	1.2105	0.9899
BayesXGB	308.3163	231.0695	2.6434	0.9495
VMD–BayesXGB	143.8891	108.9592	1.2331	0.9890
BiLSTM	673.1252	525.9073	5.9155	0.7596
VMD–BiLSTM [31]	185.1763	154.9294	1.7540	0.9730
Attention–BiLSTM	619.7955	487.2397	5.5849	0.7962
VMD–Attention–BiLSTM	168.8346	122.8087	1.3821	0.9849
DWT–LSTM [21]	296.7398	244.1437	2.7794	0.9533
VMD–BayesXGB–BiLSTM (Proposed)	122.1003	90.7386	1.0269	0.9921

Table 6. Computational cost comparison of different models.

Model	Training Time (s)	Total Inference Time (s)	Inference Time per Sample (ms)	Trainable Neural Parameters
LSTM	1091.7144	4.4910	0.2562	132,481
VMD–LSTM	1397.0314	4.4148	0.2519	139,649
BayesXGB	30.0264	0.0367	0.0021	0
VMD–BayesXGB	379.6089	0.1380	0.0078	0
BiLSTM	1671.0437	8.7020	0.4965	346,305
VMD–BiLSTM	4820.4418	25.5631	1.4586	360,641
Attention–BiLSTM	5260.5649	24.5254	1.3994	350,018
VMD–Attention–BiLSTM	2899.6422	9.1550	0.5224	364,354
DWT–LSTM	709.6795	4.4305	0.2528	134,017
VMD–BayesXGB–BiLSTM (Proposed)	3137.0643	16.7080	0.9559	1,602,049

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, T.; He, J.; Li, Y.; Li, X.; Tang, J. A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting. Electronics 2026, 15, 2507. https://doi.org/10.3390/electronics15122507

AMA Style

Xu T, He J, Li Y, Li X, Tang J. A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting. Electronics. 2026; 15(12):2507. https://doi.org/10.3390/electronics15122507

Chicago/Turabian Style

Xu, Tianqi, Jie He, Yan Li, Xiaolan Li, and Ju Tang. 2026. "A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting" Electronics 15, no. 12: 2507. https://doi.org/10.3390/electronics15122507

APA Style

Xu, T., He, J., Li, Y., Li, X., & Tang, J. (2026). A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting. Electronics, 15(12), 2507. https://doi.org/10.3390/electronics15122507

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A VMD–Bayesian-Optimized XGBoost–BiLSTM Hybrid Model for Short-Term Load Forecasting

Abstract

1. Introduction

2. Methods

2.1. VMD

2.2. Principles of XGBoost

2.3. BiLSTM Module

2.3.1. LSTM

2.3.2. BiLSTM

3. Research Methods and Model Architecture

3.1. Model Architecture

3.2. Model Evaluation Metrics

4. Experimental Setup

4.1. Dataset Description and Preprocessing

4.1.1. Outlier Detection

4.1.2. Outlier Handling

4.1.3. Feature Standardization

4.1.4. VMD Feature Construction

4.2. Hyperparameter Settings

4.2.1. VMD Parameter Selection

4.2.2. BayesXGB Hyperparameter Optimization

4.2.3. BiLSTM Residual Correction Settings

4.3. Ablation Study

4.4. Comparative Evaluation

4.5. Computational Cost Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI