Enhanced Short-Term Load Forecasting Based on Adaptive Residual Fusion of Autoformer and Transformer
Abstract
1. Introduction
- (1)
- Designing sophisticated deep learning architectures often incurs significant computational costs and time investments. Paradoxically, escalating model complexity do not universally translate to superior forecasting precision. This study presents a light-weight ensemble strategy that achieves pre-trained Autoformer and Transformer forecasting outputs fusion through dynamic residual correction. This strategy demonstrates that adaptive fusion of specialized models achieves superior prediction accuracy com-pared to monolithic networks, embodying the principle that collective modeling expertise outperforms individual architectural complexity.
- (2)
- Unlike traditional ensemble learning methods, this study introduces a dynamic residual correction mechanism that explicitly leverages the complementary strengths of specialized models. By a validation-driven model prioritization strategy, the framework assigns the model with minimal validation dataset forecasting error as the primary back-bone and repurposes the other as a prediction error corrector to adaptively correct discrepancies in the backbone’s forecasts.
- (3)
- A data-driven adaptive weight matrix is adopted for result-level fusion, dynamically combining the predictions of Autoformer and Transformer. This matrix outperforms traditional fixed-weight ensembles by prioritizing model strengths and suppressing redundant errors, thus achieving high forecasting accuracy.
- (4)
- Leveraging the inherent differences in feature extraction mechanisms between Transformer and Autoformer, their STELF results may have complementary relationships at certain times. The residual fusion network is employed to fuse the forecasting results of Transformer and Autoformer to achieve error cancellation and accurate prediction of load fluctuation trends, thus improving the forecasting performance of the model. In single-region STELF experiments, Transformer outperformed Autoformer in prediction performance. ATRFN achieved 14.49% lower mean squared error (MSE), 10.51% lower mean absolute error (MAE), and 9.16% lower mean absolute percentage error (MAPE) compared to Transformer. In multi-region STELF experiments, where Autoformer demonstrated bet-ter performance than Transformer, ATRFN still reduced MSE, MAE, and MAPE by 5.22%, 2.77%, and 5.88%, respectively, compared to Autoformer. Experimental results on both datasets consistently demonstrate the effectiveness of ATRFN for STELF tasks. Additionally, ATRFN outperformed all baseline models in predictions, further validating its superiority across different regional scenarios.
2. Methodology
2.1. Autoformer
2.2. Transformer
2.3. Autoformer-Transformer Residual Fusion Network
- (1)
- The original electricity load data was partitioned into training, validation, and test sets, followed by normalization processing.
- (2)
- The Transformer and Autoformer were trained separately and saved independently.
- (3)
- ELF was conducted using Transformer and Autoformer, respectively.
- (4)
- The prediction results were fed into a residual fusion network for training.
- (5)
- The final trained ATRFN was employed as the ultimate ELF model.
3. Results and Discussion
3.1. Evaluation Metrics
3.2. Univariate Electricity Load Forecasting
3.3. Multivariate Electricity Load Forecasting
3.4. Ablation Study on Fusion Methods
3.5. Discussion
- (1)
- Unlike traditional feature fusion that requires alignment and reconstruction of the internal architectures of Transformer and Autoformer, ATRFN directly acts on the result outputs of the dual models through RFN, achieving fusion in a lightweight manner of “backbone screening and residual correction”.
- (2)
- The ATRFN proposed in this paper deepens the fusion granularity to the feature dimension, dynamically screening the outputs of the dual models through an adaptive weight matrix. This approach retains the long-range trend features of Autoformer while enhancing the local mutation features to which Transformer is sensitive.
- (3)
- The ATRFN model follows the “superior model priority” principle, using the output of the model with smaller validation set errors as the feature backbone, while the output of the other model is used for detail correction in the form of residuals. This design maintains the stability of gradient propagation through the identity mapping property of residual connections and avoids the “redundant information interference” caused by directly concatenating the output features of the dual models, thereby improving the utilization efficiency of the feature space.
- (1)
- The model’s selection of the backbone model and weight initialization are highly dependent on validation set errors. If the validation set fails to fully reflect the true distribution of the target data, it may lead to misjudgment of the backbone model and deviations in weight initialization.
- (2)
- The calculation of the weight matrix is based on global feature statistics without incorporating time-step-level contextual information, which may cause a decline in correction gains during certain periods.
- (3)
- The residual correction mechanism implicitly assumes that “dual-model errors are independent of each other,” but in reality, there may be a risk of systematic co-bias. When both Transformer and Autoformer have large prediction errors simultaneously, residual correction may fail due to the lack of effective signals.
4. Conclusions
- (1)
- The proposed fusion model avoids joint optimization of the internal architectures of the dual models through an independent pretraining and result-layer fusion strategy, reducing joint training overhead and maximizing the retention of the respective advantages of Transformer and Autoformer.
- (2)
- The proposed fusion model employs a dynamic weight matrix as an interaction bridge between Transformer and Autoformer, which can analyze the local feature characteristics of input data and allocate differentiated weights to the feature fusion outputs of the dual models, effectively mitigating the blindness of fixed weights.
- (3)
- By screening the better-performing model from the validation set as the feature backbone and using the other model to supplement and correct local details in the form of residuals, the fusion model provides a baseline performance guarantee. In univariate prediction experiments, the MSE and MAE of the proposed fusion model are 14.49% and 10.51% lower than those of the Transformer model, respectively. In multivariate prediction experiments, the MSE and MAE of the proposed fusion model are 5.22% and 2.77% lower than those of the Autoformer model, respectively.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| ABPE | Average absolute prediction errors |
| ATRFN | Autoformer-Transformer residual fusion network |
| BiLSTM | Bidirectional Long short-term memory network |
| CNNs | Convolutional neural networks |
| GRU | Gated recurrent unit |
| LSTM | Long short-term memory network |
| MAE | Mean absolute error |
| MAPE | Mean absolute percentage error |
| MSE | Mean squared error |
| STELF | Short-term electricity load forecasting |
References
- Islam, U.; Ullah, H.; Khan, N.; Saleem, K.; Ahmad, I. AI-Enhanced Intrusion Detection in Smart Renewable Energy Grids: A Novel Industry 4.0 Cyber Threat Management Approach. Int. J. Crit. Infrastruct. Prot. 2025, 50, 100769. [Google Scholar] [CrossRef]
- Kumar, S.S.; Srinivasan, C.; Balavignesh, S. Enhancing grid integration of renewable energy sources for micro grid stability using forecasting and optimal dispatch strategies. Energy 2025, 322, 135572. [Google Scholar] [CrossRef]
- Wang, J.; Huang, W.; Ding, Y.; Dang, Y.; Ye, L. Forecasting the electric power load based on a novel prediction model coupled with accumulative time-delay effects and periodic fluctuation characteristics. Energy 2025, 317, 134518. [Google Scholar] [CrossRef]
- Yang, Y.; Gao, Y.; Zhou, H.; Wu, J.; Gao, S.; Wang, Y.G. Multi-Granularity Autoformer for long-term deterministic and probabilistic power load forecasting. Neural Netw. 2025, 188, 107493. [Google Scholar] [CrossRef]
- Ma, K.; Nie, X.; Yang, J.; Zha, L.; Li, G.; Li, H. A power load forecasting method in port based on VMD-ICSS-hybrid neural network. Appl. Energy 2025, 377, 124246. [Google Scholar] [CrossRef]
- Peng, Z.; Yang, X. Short-and medium-term power load forecasting model based on a hybrid attention mechanism in the time and frequency domains. Expert Syst. Appl. 2025, 278, 127329. [Google Scholar] [CrossRef]
- Li, H.; Heleno, M.; Zhang, W.; Sun, K.; Garcia, L.R.; Hong, T. A Cross-Dimensional Analysis of Data-Driven Short-Term Load Forecasting Methods with Large-scale Smart Meter Data. Energy Build. 2025, 344, 115909. [Google Scholar] [CrossRef]
- Khan, S.; Muhammad, Y.; Jadoon, I.; Awan, S.E.; Raja, M.A.Z. Leveraging LSTM-SMI and ARIMA architecture for robust wind power plant forecasting. Appl. Soft Comput. 2025, 170, 112765. [Google Scholar] [CrossRef]
- Smyl, S.; Bergmeir, C.; Dokumentov, A.; Long, X.; Wibowo, E.; Schmidt, D. Local and global trend Bayesian exponential smoothing models. Int. J. Forecast. 2025, 41, 111–127. [Google Scholar] [CrossRef]
- Xu, X.; Guan, L.; Wang, Z.; Yao, R.; Guan, X. A double-layer forecasting model for PV power forecasting based on GRU-Informer-SVR and Blending ensemble learning framework. Appl. Soft Comput. 2025, 172, 112768. [Google Scholar] [CrossRef]
- Bu, X.; Wu, Q.; Zhou, B.; Li, C. Hybrid short-term load forecasting using CGAN with CNN and semi-supervised regression. Appl. Energy 2023, 338, 120920. [Google Scholar] [CrossRef]
- Jin, C.; Chen, T.; Ni, H.; Shi, Q. Intra and inter-series pattern representations fusion network for multiple time series forecasting. Appl. Soft Comput. 2025, 175, 113024. [Google Scholar] [CrossRef]
- Uyar, S.G.K.; Ozbay, B.K.; Dal, B. Interpretable building energy performance prediction using xgboost quantile regression. Energy Build. 2025, 344, 115815. [Google Scholar] [CrossRef]
- Kumar, R.S.; Meera, P.S.; Lavanya, V.; Hemamalini, S. Brown bear optimized random forest model for short term solar power forecasting. Results Eng. 2025, 25, 104583. [Google Scholar] [CrossRef]
- Xu, A.; Chen, J.; Li, J.; Chen, Z.; Xu, S.; Nie, Y. Multivariate rolling decomposition hybrid learning paradigm for power load forecasting. Renew. Sustain. Energy Rev. 2025, 212, 115375. [Google Scholar] [CrossRef]
- Liu, Y.; Pu, H.; Sun, D.W. Efficient extraction of deep image features using convolutional neural network (CNN) for applications in detecting and analysing complex food matrices. Trends Food Sci. Technol. 2021, 113, 193–204. [Google Scholar] [CrossRef]
- Gong, Y.; Wu, H.; Zhou, J.; Zhang, Y.; Zhang, L. Hybrid Multi-Branch Attention–CNN–BiLSTM Forecast Model for Reservoir Capacities of Pumped Storage Hydropower Plant. Energies 2025, 18, 3057. [Google Scholar] [CrossRef]
- Bhatnagar, M.; Rozinaj, G.; Vargic, R. Using crafted features and polar bear optimization algorithm for short-term electric load forecast system. Energy AI 2025, 19, 100470. [Google Scholar] [CrossRef]
- Yao, X.; Su, K.; Zhang, H.; Zhang, S.; Zhang, H.; Zhang, J. Remaining useful life prediction for lithium-ion batteries in highway electromechanical equipment based on feature-encoded LSTM-CNN network. Energy 2025, 323, 135719. [Google Scholar] [CrossRef]
- Wang, H.; Hu, D.; Yang, C.; Wang, B.; Duan, B.; Wang, Y. Model construction and multi-objective performance optimization of a biodiesel-diesel dual-fuel engine based on CNN-GRU. Energy 2024, 301, 131586. [Google Scholar] [CrossRef]
- Du, J.; Zeng, J.; Wang, H.; Ding, H.; Wang, H.; Bi, Y. Using acoustic emission technique for structural health monitoring of laminate composite: A novel CNN-LSTM framework. Eng. Fract. Mech. 2024, 309, 110447. [Google Scholar] [CrossRef]
- Chiu, M.C.; Hsu, H.W.; Chen, K.S.; Wen, C.Y. A hybrid CNN-GRU based probabilistic model for load forecasting from individual household to commercial building. Energy Rep. 2023, 9, 94–105. [Google Scholar] [CrossRef]
- Zheng, H.L.; Wen, X.H.; Yan, W.; Ji, H.Z.; Xiao, J.X.; Li, X.G.; Meng, S.L. Ensemble learning based on bi-directional gated recurrent unit and convolutional neural network with word embedding module for bioactive peptide prediction. Food Chem. 2025, 468, 142464. [Google Scholar] [CrossRef]
- Li, J.; Wang, Y.; Ning, X.; He, W.; Cai, W. FefDM-Transformer: Dual-channel multi-stage Transformer-based encoding and fusion mode for infrared–visible images. Expert Syst. Appl. 2025, 277, 127229. [Google Scholar] [CrossRef]
- Yu, W.; Dai, Y.; Ren, T.; Leng, M. Short-time Photovoltaic Power Forecasting Based on Informer Model Integrating Attention Mechanism. Appl. Soft Comput. 2025, 178, 113345. [Google Scholar] [CrossRef]
- Hou, G.; Zhang, T.; Huang, T. Data-driven modeling of 600 MW supercritical unit under full operating conditions based on Transformer-XL. ISA Trans. 2025, 158, 141–166. [Google Scholar] [CrossRef]
- Zhang, X.; Feng, Z.; Mu, W. Reformer: Re-parameterized kernel lightweight transformer for grape disease segmentation. Expert Syst. Appl. 2025, 265, 125757. [Google Scholar] [CrossRef]
- Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar] [CrossRef]
- Pei, J.; Liu, N.; Shi, J.; Ding, Y. Tackling the duck curve in renewable power system: A multi-task learning model with iTransformer for net-load forecasting. Energy Convers. Manag. 2025, 326, 119442. [Google Scholar] [CrossRef]
- Cheng, F.; Liu, H. Multi-step electric vehicles charging loads forecasting: An autoformer variant with feature extraction, frequency enhancement, and error correction blocks. Appl. Energy 2024, 376, 124308. [Google Scholar] [CrossRef]
- Tie, R.; Li, M.; Zhou, C.; Ding, N. Research on the application of an improved Autoformer model integrating CNN-attention-BiGRU in short-term power load forecasting. Evol. Syst. 2025, 16, 98. [Google Scholar] [CrossRef]
- Wang, C.; Wang, Y.; Ding, Z.; Zheng, T.; Hu, J.; Zhang, K. A transformer-based method of multienergy load forecasting in integrated energy system. IEEE Trans. Smart Grid 2022, 13, 2703–2714. [Google Scholar] [CrossRef]
- Ran, P.; Dong, K.; Liu, X.; Wang, J. Short-term load forecasting based on CEEMDAN and Transformer. Electr. Power Syst. Res. 2023, 214, 108885. [Google Scholar] [CrossRef]
- Gao, J.; Chen, Y.; Hu, W.; Zhang, D. An adaptive deep-learning load forecasting framework by integrating transformer and domain knowledge. Adv. Appl. Energy 2023, 10, 100142. [Google Scholar] [CrossRef]
- Chen, Y.; Ye, N.; Zhang, W.; Fan, J.; Mumtaz, S.; Li, X. Meta-lstr: Meta-learning with long short-term transformer for futures volatility prediction. Expert Syst. Appl. 2025, 265, 125926. [Google Scholar] [CrossRef]
- Ma, S.; He, J.; He, J.; Feng, Q.; Bi, Y. Forecasting air quality index in yan’an using temporal encoded informer. Expert Syst. Appl. 2024, 255, 124868. [Google Scholar] [CrossRef]
- Bommidi, B.S.; Teeparthi, K. A hybrid wind speed prediction model using improved ceemdan and autoformer model with auto-correlation mechanism. Sustain. Energy Technol. Assess. 2024, 64, 103687. [Google Scholar] [CrossRef]
- Eren, Y.; Kucukdemiral, I. A comprehensive review on deep learning approaches for short-term load forecasting. Renew. Sustain. Energy Rev. 2024, 189, 114031. [Google Scholar] [CrossRef]
- Chen, B.; Zhang, Y.; Wu, J.; Yuan, H.; Guo, F. Lithium-Ion Battery State of Health Estimation Based on Feature Recon-struction and Transformer-GRU Parallel Architecture. Energies 2025, 18, 1236. [Google Scholar] [CrossRef]
- Yang, Y.; Xu, J.; Kong, X.; Su, J. A multi-strategy improved sparrow search algorithm and its application. Neural Process. Lett. 2023, 55, 12309–12346. [Google Scholar] [CrossRef]
- Wang, B.; Wang, L.; Ma, Y.; Hou, D.; Sun, W.; Li, S. A short-term load forecasting method considering multiple factors based on VAR and CEEMDAN-CNN-BILSTM. Energies 2025, 18, 1855. [Google Scholar] [CrossRef]







| Parameters | ATRFN | LSTM | BiLSTM | Autoformer | Transformer | Informer | Reformer |
|---|---|---|---|---|---|---|---|
| Input size | 672 | 672 | 672 | 672 | 672 | 672 | 672 |
| Start token size | 48 | - | - | 48 | 48 | 48 | 48 |
| Forecasting size | 96 | 96 | 96 | 96 | 96 | 96 | 96 |
| Dimension of model | 512 | - | - | 512 | 512 | 512 | 512 |
| Dimension of FCN | 2048 | - | - | 2048 | 2048 | 2048 | 2048 |
| Hidden size | 128 | 512 | 512 | - | - | - | - |
| Num_layers | - | 3 | 3 | - | - | - | - |
| Encoder layers | 2 | - | - | 2 | 2 | 2 | 2 |
| Decoder layers | 1 | - | - | 1 | 1 | 1 | 1 |
| Attention heads | 8 | - | - | 8 | 8 | 8 | 8 |
| Batch size | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
| Bucket size | - | - | - | - | - | - | 4 |
| N_hashes | - | - | - | - | - | - | 4 |
| Learning rate | 0.0001 | 0.0001 | 0.0001 | 0.0001 | 0.0001 | 0.0001 | 0.0001 |
| Train epochs | 100 | 500 | 500 | 10 | 10 | 10 | 10 |
| Metric | ATRFN | LSTM | BiLSTM | Autoformer | Informer | Transformer | Reformer |
|---|---|---|---|---|---|---|---|
| MSE | 0.0472 | 0.0954 | 0.0595 | 0.1476 | 0.0536 | 0.0552 | 0.1519 |
| MAE | 0.1149 | 0.1746 | 0.1448 | 0.2556 | 0.1296 | 0.1284 | 0.2657 |
| MAPE | 0.5940 | 0.7997 | 0.7638 | 1.4552 | 0.6456 | 0.6539 | 1.2890 |
| Metric | ATRFN | LSTM | BiLSTM | Autoformer | Informer | Transformer | Reformer |
|---|---|---|---|---|---|---|---|
| MSE | 0.1869 | 0.2875 | 0.2925 | 0.1972 | 0.2750 | 0.2590 | 0.3014 |
| MAE | 0.3017 | 0.3663 | 0.3742 | 0.3103 | 0.3744 | 0.3593 | 0.3921 |
| MAPE | 3.1772 | 3.2704 | 3.2913 | 3.3757 | 3.4695 | 3.4167 | 3.4735 |
| Inference Time (ms/batch) | 7.30 | 5.40 | 5.34 | 5.69 | 6.06 | 5.34 | 5.55 |
| CPU Peak Memory (MB) | 26.99 | 1214.43 | 1214.56 | 26.83 | 26.83 | 26.92 | 26.83 |
| Metric | ATRFN | Simple Averaging | Fixed Linear Fusion |
|---|---|---|---|
| MSE | 0.1869 | 0.2091 | 0.1973 |
| MAE | 0.3017 | 0.3103 | 0.3045 |
| MAPE | 3.1772 | 3.3921 | 2.9837 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zeng, L.; Zheng, K.; Lin, G.; Yang, J.; Wu, M.; Chen, G.; Jiang, H. Enhanced Short-Term Load Forecasting Based on Adaptive Residual Fusion of Autoformer and Transformer. Energies 2025, 18, 6496. https://doi.org/10.3390/en18246496
Zeng L, Zheng K, Lin G, Yang J, Wu M, Chen G, Jiang H. Enhanced Short-Term Load Forecasting Based on Adaptive Residual Fusion of Autoformer and Transformer. Energies. 2025; 18(24):6496. https://doi.org/10.3390/en18246496
Chicago/Turabian StyleZeng, Lukun, Kaihong Zheng, Guoying Lin, Jingxu Yang, Mingqi Wu, Guanyu Chen, and Haoxia Jiang. 2025. "Enhanced Short-Term Load Forecasting Based on Adaptive Residual Fusion of Autoformer and Transformer" Energies 18, no. 24: 6496. https://doi.org/10.3390/en18246496
APA StyleZeng, L., Zheng, K., Lin, G., Yang, J., Wu, M., Chen, G., & Jiang, H. (2025). Enhanced Short-Term Load Forecasting Based on Adaptive Residual Fusion of Autoformer and Transformer. Energies, 18(24), 6496. https://doi.org/10.3390/en18246496
