4.1. Dataset Description and Preprocessing
The experiments were conducted using Python 3.7. A publicly available Australian electricity load dataset was used for simulation analysis. The dataset was obtained from the Australian Electricity Load and Price Forecasting Dataset repository, covering the period from 00:30 on 1 January 2006 to 00:00 on 1 January 2011, with a 30 min sampling interval. In addition to the electricity load, five explanatory variables were included: dry-bulb temperature, dew-point temperature, wet-bulb temperature, humidity, and electricity price.
Figure 5 presents the time series plot of the electricity load for the Australian dataset.
As shown in
Figure 5, the electricity load exhibits clear periodic fluctuations and seasonal variation. This indicates that electricity demand is affected by temporal patterns and meteorological factors, which increases the difficulty of short-term load forecasting.
Before model training, the raw dataset was processed through four main steps: outlier detection, outlier handling, feature standardization, and VMD-based feature construction.
Section 4.1.1,
Section 4.1.2,
Section 4.1.3 and
Section 4.1.4 describe these preprocessing steps in detail. After preprocessing, supervised learning samples were constructed using a sliding window. The input window length for the primary forecasting model was set to 24, corresponding to 12 h of historical observations, and the forecasting horizon was set to one step, corresponding to a 30 min-ahead prediction.
After window-sample construction, all samples were divided chronologically into training and testing sets. The first 80% of the samples were used as the training set, and the remaining 20% were used as the testing set. In total, 87,624 supervised samples were constructed, including 70,099 training samples and 17,525 testing samples. The training set was used for model fitting and parameter selection, while the testing set was reserved only for final performance evaluation.
For model tuning, validation was performed on the training set rather than using the testing set. The detailed validation strategies and hyperparameter settings for VMD, BayesXGB, and BiLSTM are reported in
Section 4.2.
4.1.1. Outlier Detection
To mitigate the impact of dirty data caused by sensor failures, this paper employs the rule to identify outliers. This outlier detection method is based on the statistical principles of normal distribution.
Assuming data follows a normal distribution , the data distribution exhibits strict probabilistic patterns:
Approximately 68.27% of the data falls within the interval ;
Approximately 95.45% of the data falls within the interval ;
Approximately 99.73% of the data falls within the interval .
Since the range encompasses nearly all (99.73%) of the normal data, the criterion stipulates that data exceeding the range is highly likely to be an outlier (with a probability of only 0.27%, a low-probability event).
4.1.2. Outlier Handling
After outlier detection, abnormal values were replaced using cubic spline interpolation. This method was adopted because it can preserve the local trend and smoothness of the time series better than simple mean imputation or linear interpolation.
(where
denotes the interval index, and
,
,
,
are the polynomial coefficients for the
-th interval).
Function value continuity:
First derivative continuity:
Second derivative continuity:
After applying the -based outlier detection procedure, abnormal values were detected and corrected by cubic spline interpolation. Specifically, 195 outliers were detected in dry-bulb temperature, 55 in dew-point temperature, 217 in humidity, 174 in electricity price, and 124 in electricity load, corresponding to outlier ratios of 0.2225%, 0.0628%, 0.2476%, 0.1985%, and 0.1415%, respectively. No outliers were detected in wet-bulb temperature. Since the outlier ratios for all variables were below 0.25%, the preprocessing step mainly corrected abnormal local observations while preserving the overall structure of the original dataset.
4.1.3. Feature Standardization
To eliminate the influence of different feature scales on model training, feature standardization was applied before model fitting. The mean and standard deviation were calculated from the training set and then applied to both the training and testing sets. This procedure ensures that the testing data are transformed using only information obtained from the training set.
The standardization formula (Z-score normalization) is expressed as
where
represents the original feature value,
denotes the mean of the feature across the training set;
indicates the standard deviation of the feature across the training set; and
signifies the standardized feature value.
4.1.4. VMD Feature Construction
VMD was used to extract multi-scale load features from the historical input windows. The VMD parameters were determined using only the training set, with the detailed selection procedure described in
Section 4.2.1. For each prediction sample, VMD was applied only to the historical load window before the target time step, so the target value and future observations were not involved in feature construction.
Figure 6 shows the VMD results of a representative load segment from the training set.
4.2. Hyperparameter Settings
The main experimental settings include VMD parameter selection, Bayesian optimization of XGBoost hyperparameters for constructing BayesXGB, and the configuration of the BiLSTM residual correction module. All parameter tuning procedures were performed using only the training set, while the testing set was used only for final performance evaluation.
4.2.1. VMD Parameter Selection
For VMD, the number of modes and the penalty factor were selected only within the training sequence. To reduce computational cost while preserving temporal order, the last 3000 consecutive observations from the training sequence were used as the parameter selection sample. The number of modes was searched from 3 to 14, and the penalty factor was selected from {1000, 2000, 3000, 4000, 5000, 6000, 7000}.
For each candidate pair of
and
, three-fold time-series cross-validation was applied. The validation window length of each fold was 750 observations. The reconstruction mean squared error was calculated in each validation fold, and the average reconstruction MSE across the three folds was used as the final selection criterion. The mathematical procedure of the TSCV-based VMD parameter selection is summarized in Algorithm 2.
| Algorithm 2. Time-series cross-validation-based VMD parameter selection |
, . |
| . |
|
| observations. |
|
|
|
| for leakage-free VMD feature construction, where VMD is applied only to the historical input window of each prediction sample. |
As shown in
Table 1, (
= 14) and (
= 1000) were selected as the optimal parameter pair in all three validation folds. The average reconstruction MSE was 388.5164, and the VMD parameter search time was 48.2403 s. These fold-wise results indicate that the selected VMD parameters were stable under the time-series cross-validation scheme.
The reconstruction MSE was used because VMD serves as an unsupervised feature construction step before forecasting. This criterion was adopted to select a decomposition setting that preserves the main information of the original load sequence while generating multi-scale components. The forecasting contribution of the selected VMD features was further evaluated through the ablation study in
Section 4.3.
4.2.2. BayesXGB Hyperparameter Optimization
The hyperparameters of XGBoost were optimized using Bayesian optimization. During the optimization process, five-fold time-series cross-validation was performed, and MAE was used as the optimization objective. The optimization objective is defined as
where
represents the actual electricity load value;
denotes the predicted load value generated by XGBoost; and
denotes the number of validation samples in each time-series cross-validation fold. The Bayesian optimization process searches for the hyperparameter combination that minimizes the cross-validation MAE. The optimized XGBoost hyperparameters are shown in
Table 2.
4.2.3. BiLSTM Residual Correction Settings
The BiLSTM residual correction module was configured as a stacked four-layer bidirectional LSTM structure. Each BiLSTM layer contains both a forward LSTM and a backward LSTM, and the output sequence of each layer is passed to the next BiLSTM layer. No residual skip connection is used between BiLSTM layers.
The BiLSTM residual correction module used a sequence length of 48, corresponding to one day of half-hour historical information. This differs from the 24-step input window used for the primary BayesXGB forecasting stage. The 24-step window was used to construct the first-stage forecasting samples, whereas the 48-step sequence was used to construct historical BayesXGB prediction and residual sequences for residual correction.
The input features of the BiLSTM residual correction module include historical BayesXGB predictions and their corresponding historical residuals. Therefore, the input tensor can be represented as , where is the batch size, 48 is the sequence length, and 2 denotes the feature dimension.
The detailed BiLSTM parameter settings are shown in
Table 3.
4.3. Ablation Study
To evaluate the contribution of each component in the proposed framework, four ablation models were constructed: BayesXGB, VMD–BayesXGB, BayesXGB–BiLSTM, and VMD–BayesXGB–BiLSTM. Here, BayesXGB denotes the Bayesian-optimized XGBoost regressor. The same optimized BayesXGB hyperparameter setting was used for all BayesXGB-based ablation models to ensure a consistent comparison. BayesXGB was used as the baseline model; VMD–BayesXGB introduced the VMD-based feature construction module; BayesXGB–BiLSTM introduced the BiLSTM-based residual correction module; and VMD–BayesXGB–BiLSTM represented the complete proposed model.
As shown in
Table 4, introducing VMD-based feature construction substantially improved the baseline BayesXGB model. Compared with BayesXGB, VMD–BayesXGB reduced RMSE, MAE, and MAPE by 53.33%, 52.85%, and 53.35%, respectively, while increasing
from 0.9495 to 0.9890. This indicates that VMD-derived multi-scale features can effectively reduce the complexity of the original load sequence and improve the forecasting capability of BayesXGB.
The BiLSTM residual correction module also contributed to forecasting improvement. Compared with BayesXGB, BayesXGB–BiLSTM reduced RMSE, MAE, and MAPE by 39.27%, 44.77%, and 45.66%, respectively. This suggests that the residual sequence after the primary BayesXGB forecasting stage still contains useful temporal information that can be learned by the BiLSTM correction module.
The complete VMD–BayesXGB–BiLSTM model achieved the best overall performance among the ablation models, with an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an of 0.9921. Compared with VMD–BayesXGB, the complete model further reduced RMSE, MAE, and MAPE by 15.14%, 16.72%, and 16.72%, respectively. Compared with the baseline BayesXGB model, the corresponding reductions reached 60.40%, 60.73%, and 61.15%, respectively. These results demonstrate that VMD-based feature construction and BiLSTM-based residual correction provide complementary improvements to the proposed forecasting framework.
Figure 7 provides a visual comparison of RMSE, MAE, MAPE, and
among the ablation models. The complete VMD–BayesXGB–BiLSTM model consistently achieved the lowest RMSE, MAE, and MAPE, as well as the highest
, further confirming the effectiveness of the proposed hybrid structure.
Figure 8 presents the prediction curves of the ablation models on a representative test segment. The predicted curves generally follow the overall trend of the actual load. Although local differences exist among the models, the quantitative results in
Table 4 show that VMD–BayesXGB–BiLSTM achieves the lowest overall prediction error, indicating more stable forecasting performance on the testing set.
4.4. Comparative Evaluation
To further evaluate the predictive performance of the proposed model, it was compared with several representative baseline models, including LSTM, VMD-LSTM, BayesXGB, VMD–BayesXGB, BiLSTM, VMD-BiLSTM, Attention-BiLSTM, VMD-Attention-BiLSTM, and DWT-LSTM. Here, BayesXGB denotes the XGBoost regressor whose hyperparameters were tuned by Bayesian optimization. The proposed model is denoted as VMD–BayesXGB–BiLSTM.
To ensure consistency, all models were evaluated using the same chronological training-testing split, input window length, forecasting horizon, and evaluation metrics. For the BayesXGB-based models, the same optimized hyperparameter setting was used. The LSTM-, BiLSTM-, and attention-based baseline models were trained using the same optimizer, batch size, maximum number of epochs, and early-stopping strategy. Therefore, the comparative evaluation focuses on the forecasting performance of different model structures under a consistent experimental setting.
As shown in
Table 5, the proposed VMD–BayesXGB–BiLSTM model achieved the best performance among all compared models, with an RMSE of 122.1003, an MAE of 90.7386, a MAPE of 1.0269%, and an
of 0.9921. These results indicate that the proposed hybrid framework can more accurately capture the nonlinear, non-stationary, and temporal characteristics of electricity load data.
The standalone recurrent neural network models, including LSTM, BiLSTM, and Attention–BiLSTM, produced relatively large prediction errors. In contrast, their VMD-based counterparts achieved much better performance, indicating that VMD-based feature construction is effective for reducing the complexity of the original load sequence and extracting useful multi-scale information for downstream forecasting models.
Among all baseline models, VMD–LSTM achieved the strongest performance, with an RMSE of 137.3648, an MAE of 104.7501, a MAPE of 1.2105%, and an of 0.9899. Nevertheless, the proposed VMD–BayesXGB–BiLSTM model further reduced RMSE, MAE, and MAPE by 11.11%, 13.38%, and 15.17%, respectively, compared with VMD-LSTM. This demonstrates that combining BayesXGB forecasting with BiLSTM residual correction can provide additional improvement beyond VMD-based recurrent forecasting alone.
Compared with VMD–BayesXGB, the proposed model further reduced RMSE from 143.8891 to 122.1003, MAE from 108.9592 to 90.7386, and MAPE from 1.2331% to 1.0269%. The corresponding reductions were 15.14%, 16.72%, and 16.72%, respectively. This improvement demonstrates that the residual sequence after the primary BayesXGB forecasting stage still contains useful temporal information, which can be further modeled by the BiLSTM residual correction module.
The proposed model also outperformed other decomposition-based and hybrid baselines. Compared with VMD–Attention–BiLSTM, it reduced RMSE, MAE, and MAPE by 27.68%, 26.11%, and 25.70%, respectively. Compared with DWT–LSTM, it reduced RMSE, MAE, and MAPE by 58.85%, 62.83%, and 63.05%, respectively. These comparisons show that the proposed framework provides more accurate forecasting performance than both attention-enhanced recurrent models and DWT-based recurrent models on the tested dataset.
Overall, the comparative results demonstrate that the proposed VMD–BayesXGB–BiLSTM model achieves superior predictive performance on the tested Australian electricity load dataset. However, since the experiments were conducted on a single dataset, broader generalizability should be further verified using additional datasets from different regions or power systems in future work.
4.5. Computational Cost Analysis
In addition to forecasting accuracy, computational cost is also an important factor for practical short-term load forecasting applications. Therefore, the training time, total inference time, inference time per sample, and number of trainable neural parameters were compared among all models used in the comparative evaluation. The results are shown in
Table 6.
It should be noted that the values in
Table 6 mainly report the model training and inference costs after feature construction. In this study, VMD and DWT feature construction were treated as offline preprocessing steps. For leakage-free VMD feature construction, decomposition was applied independently to each historical input window to avoid the use of future information. Therefore, the offline feature generation cost should be considered separately from the online inference cost.
As shown in
Table 6, BayesXGB required the lowest computational cost, with a training time of 30.0264 s and an inference time of 0.0021 ms per sample. After introducing VMD-based feature construction, VMD–BayesXGB still maintained a very low inference time of 0.0078 ms per sample, although its training time increased to 379.6089 s. This indicates that VMD-derived features can greatly improve forecasting accuracy while preserving low online inference cost for the BayesXGB-based model.
The recurrent neural network models required higher computational costs because of their trainable neural parameters. In particular, VMD-BiLSTM and Attention-BiLSTM required 4820.4418 s and 5260.5649 s for training, respectively, with inference times above 1 ms per sample. In comparison, the proposed VMD–BayesXGB–BiLSTM model required 3137.0643 s for training and 0.9559 ms per sample for inference. Although the proposed model contains 1,602,049 trainable neural parameters and is more complex than the tree-based models, its online inference time remains below 1 ms per sample.
The additional computational cost of the proposed model mainly comes from the BiLSTM residual correction module. However, this cost is accompanied by clear accuracy improvements. Compared with VMD–BayesXGB, the proposed model reduced RMSE from 143.8891 to 122.1003, MAE from 108.9592 to 90.7386, and MAPE from 1.2331% to 1.0269%. Therefore, the proposed model provides a trade-off between higher forecasting accuracy and increased computational complexity.
Figure 10 further illustrates the accuracy–cost trade-off of different forecasting models by comparing RMSE with inference time per sample. In this figure, models located closer to the lower-left region are preferable because they achieve lower prediction error and lower online inference cost simultaneously. BayesXGB and VMD–BayesXGB have the lowest inference time, but their RMSE values are higher than that of the proposed model. Several recurrent neural network baselines require comparable or higher inference time while still producing larger prediction errors. The proposed VMD–BayesXGB–BiLSTM model achieves the lowest RMSE among all compared models while maintaining an inference time below 1 ms per sample, indicating a favorable balance between forecasting accuracy and online inference efficiency.
From a practical perspective, VMD–BayesXGB may be more suitable for scenarios with strict computational constraints because it provides a large improvement over BayesXGB while maintaining very low inference time. In contrast, the complete VMD–BayesXGB–BiLSTM model is more suitable for applications where higher forecasting accuracy is prioritized and sufficient computational resources are available. Future work may focus on reducing the computational burden of the residual correction module and the VMD feature construction stage, for example, by using lightweight recurrent structures, more efficient decomposition strategies, or model compression techniques.