Next Article in Journal
Assessment of Essential Elements and Potentially Toxic Elements (PTEs) in Organic and Conventional Flaxseeds: Implications for Dietary Exposure and Food Safety
Previous Article in Journal
Optimizing Car Collision Detection Using Large Dashcam-Based Datasets: A Comparative Study of Pre-Trained Models and Hyperparameter Configurations
Previous Article in Special Issue
Forecasting Day-Ahead Electricity Demand in Australia Using a CNN-LSTM Model with an Attention Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer

1
School of Information and Communication Engineering, Hainan University, Haikou 570228, China
2
School of Computing and Artificial Intelligence, Hainan College of Software Technology, Qionghai 571499, China
3
School of Tourism, Hainan Normal University, Haikou 570228, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(13), 7003; https://doi.org/10.3390/app15137003
Submission received: 10 May 2025 / Revised: 12 June 2025 / Accepted: 18 June 2025 / Published: 21 June 2025

Abstract

Improving the accuracy of power load forecasting is an important link in the optimization of power systems. Most of the existing studies in the short-term load forecasting task at present have the problem of insufficient extraction of multi-scale features. Therefore, in order to improve prediction accuracy, this study designs a short-term power load forecasting model integrating multi-scale GCN and the improved Transformer, as well as the prediction method based on this model. First, multi-feature power load data were collected. Second, the random forest algorithm was used to preprocess the data. Next, multi-scale GCN was utilized to model the multi-scale spatio-temporal features in the power load data. The data processed by the multi-scale GCN were input into the improved Transformer module based on MLLA to extract long-term temporal dependencies. Subsequently, comparative experiments and ablation experiments were conducted on three public power datasets. The experimental results show that, compared to the comparative model, for the ETTh1 dataset, the RMSE index of this model decreased by up to 0.314, the MAE decreased by up to 0.304, and the R2 index result improved by up to 9.45%. For the ETTm1 dataset, the RMSE index of this model decreased by up to 0.266, the MAE decreased by up to 0.231, and the R2 index result improved by up to 3.3%. For the Australian dataset, the RMSE index of this model decreased by up to 494.366, the MAE decreased by up to 493.127, and the R2 index result improved by up to 54%, verifying the superiority and effectiveness of the proposed model.

1. Introduction

Driven by the global wave of sustainable development, China has officially initiated the implementation process of its “dual carbon” strategy, and the energy industry is undergoing profound transformation and reform. Against this backdrop, building an accurate energy demand forecasting system has become a core supporting element for energy enterprises to optimize the supply structure, achieve efficient resource allocation, and control operating costs.
Electricity, as the central carrier of the modern energy system, has special strategic value in demand forecasting. Different from other forms of energy, the physical characteristic of electricity being used immediately upon generation determines the extreme importance of the balance between supply and demand. The dynamic evolution of social production and living patterns may cause instantaneous fluctuations in electricity demand, and this characteristic amplifies the risk of resource waste caused by the mismatch between supply and demand. Establishing a high-precision power load forecasting mechanism is becoming a key approach for power enterprises to solve the problem of resource scheduling: through forward-looking demand analysis, enterprises can formulate more refined power generation plans, implement differentiated resource allocation strategies, and establish a demand-side response regulation mechanism, ultimately achieving a qualitative improvement in the utilization efficiency of power resources.
The construction of this predictive ability is not only related to the operational efficiency of individual enterprises but will also reshape the value creation model of the entire power industry. When precise demand insight is deeply integrated with intelligent dispatching technology, the power supply system will have a stronger and flexible adaptability, which can not only meet the dynamic demands of economic and social development but also effectively reduce system redundancy costs, laying an important data foundation for building a new type of power system. Power load forecasting refers to the process by which researchers predict the power demand at a certain future time through the use of mathematical models. According to the length of the prediction time step, the power load forecasting task can be divided into four types [1]: ultra-short-term predictions focus on time granularities ranging from minutes to hours; short-term predictions are typically made on a daily or weekly basis; medium-term predictions cover time spans from weeks to months; and long-term predictions are strategic judgments based on an annual basis. This time–domain division system is deeply consistent with the operational characteristics of the power system. Compared to the fundamental supporting role of medium-term forecasts for production plans, short-term forecast results directly affect the start-up and shutdown decisions of generating units and the allocation of reserve capacity. In particular, the precise analysis and judgment of short-term predictions have become the key basis for power generation enterprises to optimize fuel procurement strategies and reduce start–stop losses. At the same time, it also provides important technical support for power grid dispatching institutions to implement peak-valley regulation and demand response.
Since power load forecasting was proposed, the academic community has established a diversified methodological system. Due to the typical multiple uncertainty characteristics of the power load sequence, the existing research mainly follows two technical paths: traditional statistical models and deep learning architectures [2]. At the traditional methodological level, the Alberg team developed a non-seasonal prediction model and a sliding window algorithm based on the ARIMA framework, which were successfully applied to the power demand prediction scenario [3]. Sadaei achieved the fitting and optimization of the load curve by improving the ARMA model [4]. Furthermore, several researchers have attempted to use feature selection methods to conduct feature selection on multiple sequences of exogenous variables in order to achieve prediction scenarios [5,6,7]. However, studies in recent years have shown that due to better fitting of nonlinear features in power load data, modern methods based on deep learning models have more advantages in accuracy when facing the task of power load forecasting. In the early years, RNN-series models were even the mainstream models in this task. In various studies of power load forecasting based on RNN system models, Vermaak [8] and Tang et al. [9] proposed two different methods for the power load forecasting task based on RNNs. Mao proposed a hybrid model for the task of power load forecasting. This model takes LSTM as the main body and introduces integrated strategies such as Bagging and Boosting at the same time [10]. The experimental results show that this method can effectively improve the accuracy of power load forecasting. Sharma et al. developed a new model composed of Fitz-Hugh Nagumo(FHN), RNNs, and feedforward neural networks and applied it to power load forecasting [11]. The research showed that this model demonstrates excellent forecasting performance. In recent years, the research focus has shifted to graph neural networks (GNNs) and Transformer architectures [12,13,14,15]. Faisal Saecd et al. attempted to integrate GNNs and Transformer for power load forecasting and achieved good results [16]. Although the abovementioned methods continuously improve the prediction accuracy, existing studies still lack systematic consideration of the multi-scale characteristics of load data. The heterogeneous features presented by exogenous variables and load sequences in different time dimensions have not been fully deconstructed, which provides an important breakthrough for subsequent methodological innovation [17,18]. Therefore, in this paper, with the aim of addressing this gap, a multi-scale graph Transformer method is proposed for predicting short-term power load, mainly including the following research content:
(1)
A prediction method for short-term power load forecasting is proposed, including the proposed multi-scale graph Transformer model. The model is composed of a multi-scale graph convolution module and a Transformer mixed together, combining the advantages of the two networks, and can effectively extract the spatio-temporal features at different scales and the time dependence between scales.
(2)
The model introduces MLLA. The use of this attention enables the model to maintain its computational complexity while further modeling global features and processing long time series data.
(3)
The performance of the method was evaluated through experiments on three datasets. Compared to the other methods, the method proposed in this paper shows better performance.

2. Related Works

Power load forecasting essentially involves using data analysis methods to identify the key factors that affect power load and construct models for prediction. Since the emergence of the power load forecasting task, a variety of methods have been used for this task. The methods commonly used for power load forecasting in the early days were mainly traditional methods. Due to the limitations of the equipment used for data collection and the power supply and demand relationship at that time, there was limited data and few influencing factors. Researchers often used traditional and simple methods, such as ARMA [19], ARIMA [20], and Exponential Smoothing [21], for prediction. These methods are simple to use and have low time complexity. However, with upgrades of data collection equipment and the complications related to power relationships, the traditional prediction methods are no longer applicable. Therefore, people have gradually introduced new methods, such as machine learning methods and wavelet transform. Li et al. proposed an innovative power load forecasting method [22]. This method ingeniously utilizes wavelet transform to decompose power data, then builds models for different component data, and finally fuses the results to obtain the final forecasting result. This method significantly improves the accuracy of power load forecasting. Ervin made improvements based on the support vector regression (SVM) model and optimized the model parameters by using the PSO method, which also enhanced the prediction effect [23]. The work conducted by Hu is similar to that of Ervin [24]. Both used the PSO method to improve the SVM model to form a new power load forecasting method. However, the difference is that Hu adopted the memory algorithm of the improved PSO to optimize the parameters of the SVM model. The experimental results prove that this method significantly improves prediction accuracy.
The abovementioned methods have undoubtedly made considerable progress or performed well in the field of power load forecasting. However, it is rather difficult to achieve precise feature mining and modeling when using them with an increasing amount of power load data. Therefore, researchers have introduced deep learning methods. Vermaak et al. and Tang et al. both used Recurrent Neural Networks (RNNs) for power load forecasting [8,9]. Their experimental results indicated that the use of RNNs could improve the forecasting accuracy. Abunoheen et al. compared the results of various methods, such as Long Short-Term Memory (LSTM), Gate Recurrent Unit (GRU), and RNNs in power load forecasting, and found that GRU performed the best [25]. L‘Heureux et al. applied Transformer to power load forecasting [26]. The experimental results show that this method is superior to others. Liu et al. made improvements based on Temporal Convolutional Networks (TCNs) and DenseNet and proposed the Densenet-iTCN model for power load forecasting [27]. The experimental results verified that this method was superior to the baseline model. Zhu et al. effectively captured the influence of exogenous factors and time steps on the peak value in power load forecasting by using dual attention [28]. Lin et al. proposed an attention model that can adaptively select the characteristics of power load and explored the influence of time step size [29]. Niu et al. added self-attention to enhance information transfer based on Convolutional Neural Networks (CNNs) and BiGRU, which is helpful for finding the relationships among multiple factors in the power load dataset [30].
These methods have demonstrated strong performance and achieved good prediction results. However, there are still deficiencies. Due to the failure to consider the internal differences within different time scales and the external long-term sequence relationships in the power load forecasting data, these methods are still not precise enough in the process of data learning and modeling. Therefore, this paper presents a power load forecasting method. This method is composed of MSGNet [31] and Transformer [32] as the basic model framework, and MLLA is introduced, making the model’s extraction of multi-scale spatio-temporal features of data more accurate, thereby improving the forecasting accuracy.
The structure of this article is as follows: Section 1 mainly provides an overview of the research background and briefly summarizes the innovations of the research. Section 2 systematically reviews the relevant literature, explores the deficiencies of the existing research, and outlines the direction for the subsequent research. Section 3 introduces the method proposed in this paper and the knowledge about the models examined. Section 4 presents a detailed analysis and interpretation of the results of parameter experiments and comparative tests. Section 5 is the Results and Discussion section, which summarizes the conclusions and outlines future research directions.

3. Methods

3.1. Preliminary Knowledge

To elaborate on the method proposed in this paper more clearly, in this section, we will briefly explain the model examined.

3.1.1. MSGNet

MSGNet [31] is a multi-scale graph convolutional network proposed by Cai et al. in 2023. This model first performs embedding on the data to obtain the timestamp and location information within the data. Then, the embedded data are sent into the scale graph block. First, the FFT operation is carried out to find the frequency of the data and confirm the scale number. Then, corresponding spatio-temporal models are constructed for each scale to model the spatio-temporal characteristics of the data at each scale. The model also uses multi-head attention to learn the long-term dependencies between different scales. Finally, the data enter the prediction layer to output the final result. In the model, the formulas for confirming the scale and constructing the spatio-temporal graph are as follows, respectively [31]:
F = A v g ( A m p ( F F T ( X e m b ) ) )
f 1 , , f k = a r g T o p k f * { 1 , . . . , L 2 } ( F ) , s i = L f i
A i = S o f t m a x ( R e L u ( E 1 i ( E 2 i ) T ) )

3.1.2. MLLA

Mamba-Like Linear Attention [33] (s a lightweight attention improvement mechanism inspired by the State Space Model (SSM) [34]. Its core idea is to combine selective state scanning with linear attention calculation, significantly reducing the computational complexity while maintaining global modeling capabilities. This mechanism integrates the structural design of the Mamba block into the LA block of linear attention, reducing the complexity from quadratic to linear ((O(N))). Its advantage lies in maintaining the linear computational complexity while improving the global modeling ability and reasoning speed of the model. The structure of MLLA is shown in Figure 1.

3.1.3. Transformer

Transformer [32] was first proposed by Google’s research team in 2017; since then, multiple variants, including Bert [35], Informer [36], and Reformer [37], have been developed. Its architecture is entirely based on the self-attention mechanism, completely changing the paradigm where traditional sequence modeling relies on RNNs or CNNs. Its core structure is composed of a stack of encoders and decoders. The encoder contains multi-layer and multi-head attention modules and a feedforward layer, and it captures the global dependencies of the input sequence through parallel computing. The decoder introduces masked multi-head attention on the basis of the encoder to ensure that predictions rely only on known information. Each module adopts Residual Connection and Layer Normalization to optimize the training stability. Among them, the calculation formula of multi-head attention is as follows [32]:
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d k ) V
h e a d i = A t t e n t i o n ( X W i Q , X W i K , X W i V )
M u l t i h e a d ( X ) = C o n c a c t ( h e a d 1 , , h e a d h ) W O

3.2. MSGNet-MLLA-Transformer

3.2.1. Problem Formulation

This paper focuses on short-term power load forecasting. As a subtask of multivariate time series forecasting, this problem can be described as follows: Let X t i represent the state of the i-th influence sequence of power load at the t-th time step. Then, X i = { X 1 i , ,   X T i } represents the data of the i-th influence sequence at all historical moments, and T is the historical time length. X = { X 1 ,   , X n } represents all the sequences affected by the power load, and n is the number of sequences. If the future values of the power load target sequence Y are predicted by using the model through the sliding time window of size w, then Y ^ ( t + 1 : t + l ) = f ( X ( t w + 1 : t ) , Y ( t w + 1 : t ) ;   θ ) , where θ is the learnable parameter of the model, t is the observation moment, l is the prediction step size, X ( t w + 1 : t ) and Y ( t w + 1 : t ) are the historical moment influence sequence and target sequence data intercepted by the sliding time window, respectively, and f is the model examined in this paper.

3.2.2. Method Framework

The process of the prediction method proposed in this paper is shown in Figure 2.
Specifically, this method consists of ten steps.
Step 1. Data collection. Collect weather data and power data in hourly or daily time units to form a power load dataset.
Step 2. Preprocess the collected data, which specifically includes handling missing values, feature selection, normalization, etc.
Step 3. Data embedding. Perform timestamp and position embedding on the preprocessed data to extract the information within it.
Step 4. Use the fast Fourier operation to determine the scale information in the data. This step is achieved through Formulas (1) and (2).
Step 5. Construct multi-scale spatio-temporal graphs: Based on the scales determined in Step 4, a spatio-temporal graph is constructed for each scale according to Formula (3), thereby modeling the data structure within different scales of the data.
Step 6. Construct the MLLA-Transformer encoding layer: To model the long-term sequence dependency relationships existing among data at different scales, the MLLA-Transformer encoding layer is constructed. The specific structure is, in sequence, the multi-head MLLA layer, the normalization layer, the feedforward layer, and the normalization layer. Among them, the multi-head MLLA layer is used to learn the internal relationships within long sequences between data scales.
Step 7. Construct the MLLA-Transformer decoding layer. The specific structure is as follows: masked multi-head MLLA layer, normalization layer, multi-head MLLA attention layer, normalization layer, feedforward layer, and normalization layer. The use of multi-head MLLA helps the model extract the deep global information in the data and calculate the similarity within the sequence without increasing the computational consumption.
Step 8. The amplitudes corresponding to each scale obtained by the FFT are output after being processed by the Softmax layer. They are multiplied pairwise by the outputs that have passed through the corresponding multi-scale Transformers to obtain the final model result. The specific formula is as follows:
Out = i = 1 k S o f t m a x F f i T r a n s f o r m e r o u t i
In the formula, T r a n s f o r m e r o u t i represents the Transformer output corresponding to the i-th scale, and F f i determines the amplitude corresponding to scale i, which is calculated by Formulas (1) and (2).
Step 9. The model is trained and iterated N times, and the training results are verified and evaluated.
Step 10. Test the model and output the final result.

4. Analysis of Experimental Results

4.1. Data Source

A total of three datasets were used in the experiments of this paper, namely, the ETTh1 and ETTm1 sub-datasets of the ETT dataset and the public power load dataset of a certain region in Australia.
ETT Dataset (https://github.com/zhouhaoyi/ETDataset, accessed on 1 April 2025): This dataset was collected from a certain area in Xinjiang, China, with a time period spanning from July 2016 to July 2018, and contains information such as power load and oil temperature. There are two time granularity levels for sub-datasets in the dataset: hourly ETTh1 and ETTh2 and minute-level ETTm1 and ETTm2. The hourly sub-datasets all contain 17,420 pieces of data, while the minute-level sub-datasets contain 69,680 pieces of data. In this study, we selected ETTh1 and ETTm1 as the experimental datasets.
Australian Electricity Load Dataset (https://gitcode.com/qq_42998340/Australia, accessed on 1 April 2025): This dataset contains the electricity load of a certain region in Australia. It contains six variable sequences (dry bulb temperature, dew point temperature, wet bulb temperature, temperature, electricity price, and electricity load). The time span of this dataset is from 1 January 2006 to 1 January 2011. It is recorded once every 0.5 h, and there are a total of 87,648 pieces of data.
For the three datasets in this paper, the training set, validation set, and test set are divided in a ratio of 7:1:2.
Table 1 presents each data source, including the prediction targets and the specific division of the dataset.

4.2. Data Preprocessing

Before making predictions, it is necessary to preprocess the data first, which is helpful to improve the prediction performance. The commonly used data preprocessing steps mainly include missing value filling and data normalization processing. Since the MLLA in the method proposed in this paper can only accept the input of an even number of sequences, it is also necessary to perform feature selection and dimension reduction processing on the dataset containing an odd number of sequences.
Since there are no missing data in the three datasets in this study, the data preprocessing in this paper includes two steps: dimension reduction and normalization for datasets containing an odd number of sequences. In order to reduce the dimension of the data, the random forest method is adopted in this paper. The essence of the random forest method is to integrate decision trees for dimensionality reduction, calculate the correlation of data by using the out-of-bag (OOB) error, and sort and filter the sequences [38].
Specifically, a decision tree is generated first to select the input sequence and the segmentation points on the sequence. Therefore, the space containing the input sequence can be divided into two regions by the selected sequence and the segmentation points on the sequence. When the features are discrete, the two regions can be obtained by the following formula:
K 1 ( s , p ) = x m x m s = p K 2 ( s , p ) = x m x m s p
When the features are continuous, the two regions can be obtained by the following formula:
K 1 ( s , p ) = x m x m s p K 2 ( s , p ) = x m x m s p
In the above formulas, K 1 and K 2 represent the two regions obtained through division, s is the selected feature sequence, and p is the segmentation point.
The operation of dividing the area is achieved through the following formula:
m i n ( s , p ) [ m i n d 1 x i K 1 ( s , p ) ( y i d 1 ) 2 + m i n d 2 x i K 2 ( s , p ) ( y i d 2 ) 2 ]
In the formula, d 1 and d 2 are the predicted values of the data in the two regions.
Next, the above operation is repeated for the divided areas until the value of the formula no longer continues to decrease, and then, the decision tree can be obtained.
Therefore, the operation steps of feature selection by the random forest algorithm integrated from decision trees are as follows:
  • Calculate the importance of each sequence and sort them in descending order.
  • Determine the elimination ratio each time and perform elimination based on the sequence importance calculated in step 1 to obtain a new sequence dataset.
  • Repeat the above steps for the new set until only the number of feature sequences that have been pre-imagined remains.
Since both the ETTh1 and ETTm1 datasets contain an odd number of sequences, we performed dimensionality reduction selection on the exogenous variable sequences, transforming the odd-number sequence dataset into an even-number sequence dataset.
The sequences obtained from the ETTh1 and ETTm1 datasets after random forest dimensionality reduction are shown in Figure 3.
As shown in Figure 3, among the six exogenous variables in the ETTh1 dataset, we screened out the five exogenous variables that are of the highest importance for the prediction of the target sequence. Their ranking based on importance, from high to low, is HULL, MULL, LUFL, MUFL, and HUFL. Among the six exogenous variables in the ETTm1 dataset, the five most important ones that were screened out, in sequence, are MUFL, HUFL, LUFL, HULL, and MULL. After feature screening, both the ETTh1 and ETTm1 datasets contained 5 exogenous variables and 1 target variable, which can be further processed using MLLA.

4.3. Experimental Setup

4.3.1. Baseline

To verify the superiority and effectiveness of the proposed method, we set up a comparative experiment and selected four methods, namely, ARIMA, GRU [39], TCN [40], GRU-Attention [41], TCN-Attention [42], MSGNet [31], and MrCAN [43]. GRU is a multivariate time series prediction method proposed by Becerra-rico in 2020. It is a variant of RNNs and makes predictions by extracting the time features of time series. The TCN is a temporal convolutional network. The dilated causal convolution in its structure can deal with the scale changes of the sequence while extracting the temporal features of the time series. The GRU-Attention model is a deep learning framework that integrates GRU and the attention mechanism (attention). Compared to the traditional GRU model, GRU-Attention can not only capture local temporal patterns but also establish global dependencies. The TCN-Attention model integrates TCNs and the attention mechanism. It captures the temporal characteristics of sequence data through TCNs and then dynamically focuses on the key information using the attention mechanism. MSGNet is a multi-scale prediction method. It uses FFT to process sequence data to extract frequency and determine scale k and constructs spatio-temporal graph each scale to extract multi-scale spatio-temporal features. MrCAN utilizes the small sample learning module and the spatio-temporal relationship learning module to learn the relationships among sample data and the temporal and spatial relationships within the same sample.

4.3.2. Loss Function

This paper selects MSE as the loss function, and its formula is as follows:
M S E = 1 m t = 1 m ( y t y ^ t ) 2
In order to evaluate the experimental results, RMSE, MAE, and R 2 are adopted as the experimental evaluation indicators in this paper. The specific calculation methods are as follows.
R M S E = 1 m t = 1 m ( y t y ^ t ) 2
M A E = 1 m t = 1 m y t y ^ t
R 2 = 1 t ( y t y ^ t ) 2 t ( y t ¯ y t ) 2

4.3.3. Experimental Platform

This experiment was carried out using the PyTorch 1.13.1 library of the Python 3.9.18 platform. A laptop with an Intel i5 core processor, 16 GB of memory, and an NVIDIA RTX4060 graphics card was used as the experimental equipment.
For the model in this paper, the parameter settings for the load forecasting are shown in Table 2.

4.4. Sliding Time Window Size Experiment

The size of the sliding time window is a variable parameter and has a significant impact on the predictive performance of the model. Therefore, in order to obtain a better prediction effect, we set up the sliding time window size experiment. The size of the time window was increased successively according to the values of 6, 9, 12, 18, 24, 30, and 42. Single-step prediction was conducted on the three datasets, and the changing trend of the results was observed to find the optimal sliding window size.
Figure 4 shows the influence of the sliding time window size on the model’s prediction performance based on different datasets. For the ETTh1 dataset (Figure 4a), RMSE and MAE reach their lowest values when the window size is 12, while R2 reaches its highest value, indicating that 12 is the best prediction window. For the ETTm1 dataset (Figure 4b), RMSE and MAE are the lowest when the window size is 18, and R2 is the highest. It is determined that 18 is the optimal window. For the Australian power load dataset (Figure 4c), RMSE and MAE are the lowest when the window size is 24, while R2 is the highest. Therefore, 24 is selected as the best prediction window size. Overall, the optimal prediction window sizes corresponding to each dataset are 12, 18, and 24, respectively.

4.5. Comparative Experiment

The goal of the comparative experiment was to verify the superiority and effectiveness of the proposed method. In this paper, we set up comparative experiments using three datasets to verify and analyze the superiority and effectiveness of the proposed MSGNet-MLLA-Transformer method. Regarding the presentation of the results, since the test sets of the three datasets have a large amount of predicted data, the overall image cannot intuitively show the performance differences of each model. Therefore, we chose the interval with more intuitive differences for the presentation of the results.
The visualization results of the target sequences of the three datasets are shown in Figure 5. From Figure 5, we can see that the target sequences of the three datasets have different degrees of periodicity, and the periodicity is most obvious in the target sequence of the Australian electricity load dataset.
In the experiments conducted below, in terms of the presentation of the results, since the test sets of the three datasets have a large amount of predicted data, the overall images cannot intuitively show the performance differences of each model. Therefore, we chose the intervals with more intuitive differences for the presentation of the results.

4.5.1. ETTh1 Dataset Experiment

Table 3 shows the performance comparison of RMSE, MAE, and R2 between MSGNet-MLLA-Transformer and mainstream models under different prediction step sizes (1, 3, 6, and 12 steps) in the ETTh1 dataset. The experimental data show that the model in this paper has systematic advantages in all prediction scenarios. Its RMSE is, on average, approximately 0.04 lower than the optimal benchmark, and its R2 is, on average, 1.0% higher. The ranking of model performance is as follows: Ours > MrCAN > MSGNet > GRU/TCN (GRU has the weakest performance) > ARIMA, and the visualization results in Figure 6, Figure 7, Figure 8 and Figure 9 further verify this ranking. In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. This advantage stems from the fact that MSGNet-MLLA-Transformer can collaboratively achieve multi-scale parsing and long-term dependency capture of spatio-temporal features. In contrast, although MrCAN is good at spatio-temporal joint learning and sample relationship modeling, the insufficiency of multi-scale decomposition leads to error accumulation. The MSGNet prototype has advantages in spatial multi-scale feature extraction but is limited by the bottlenecks of sample relationship modeling and long-term dependency capture. However, traditional models such as GRU and TCN have obvious disadvantages in scenarios with strong spatio-temporal coupling due to the lack of effective spatial feature extraction and cross-sample relationship modeling capabilities (the performance is improved after adding the attention mechanism). These findings not only verify the necessity of multi-dimensional feature joint modeling for power prediction but also highlight the importance of collaborative optimization in dimensions such as spatio-temporal feature decoupling, long-term and short-term dependency balance, and sample relationship mining.

4.5.2. ETTm1 Dataset Experiment

Table 4 comprehensively compares the RMSE, MAE, and R2 performances of MSGNet-MLLA-Transformer and the benchmark model under different prediction step sizes (1, 3, 6, and 12 steps) for the ETTm1 dataset. In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. Combined with the analysis in Figure 10, Figure 11, Figure 12 and Figure 13, it can be seen that in the short- and medium-term predictions (steps 1, 3, and 6), all indicators of this model are superior to traditional models, such as GRU and TCN (with a maximum increase of 1.6% in R2), while in the 12-step long-term prediction, the accuracy is comparable to that of the MSGNet prototype model. This phenomenon indicates that for the ETTm1 dataset, capturing the multi-scale spatio-temporal coupling characteristics of power load (such as minute-level fluctuations and hour-level periods) contributes significantly more to the prediction accuracy than the long-term reliance on modeling capabilities. This explains why MSGNet and its improved MSGNet-MLLA-Transformer (with stronger multi-scale feature extraction capabilities) have become the optimal combination. Although the overall performance of the MrCAN model is not satisfactory, its spatial attention mechanism is still valuable in capturing regional correlations.

4.5.3. Australian Electricity Load Dataset Experiment

In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. As shown in Table 5 and Figure 14, Figure 15, Figure 16 and Figure 17, in the comparative experiments of different prediction steps for the Australian power load dataset, the MSGNet-MLLA-Transformer method proposed in this paper is lower than all baseline models in both RMSE and MAE, while R2 is higher than the baseline, with the best performance. The ranking of model performance is as follows: Ours > MSGNet > MrCAN > TCN-Attention/GRU-Attention > TCN/GRU (GRU is comparable to MSGNet in single-step prediction) > ARIMA. Regarding this phenomenon, we analyze and believe that capturing the spatio-temporal characteristics and long-term dependencies of data is crucial for improving prediction performance. The method proposed in this paper achieves the best performance by effectively decoupling spatio-temporal features and optimizing multi-scale dynamic interaction. Although MSGNet is good at multi-scale spatio-temporal modeling, it has long relied on insufficient capture. MrCAN lacks a clear multi-scale decomposition mechanism; GRU/TCN and its improved attention mechanism may have worse performance due to weak spatial modeling ability, limited receptive field, or insufficient decoupling ability of spatio-temporal features. These results highlight the crucial role of collaborative optimization, spatio-temporal feature decoupling, and multi-scale interaction in the performance of the power load forecasting model.

4.6. Ablation Experiment

The purpose of the ablation experiment is to verify the validity of the components in the model. In the experiments of this paper, in order to test the effectiveness of MLLA and the Transformer components improved based on MLLA, we designed two variants: (1) removing MLLA (-w/o MLLA) and (2) removing the MLLA-Transformer component (-w/o MLLA-Transformer).
Figure 18 shows the index results of the ablation experiments conducted on three datasets. It can be found from the figure that when the MLLA or the MLLA-Transformer component is removed, the prediction accuracy of the model for the data decreases, which also verifies the effectiveness of the proposed model.

4.7. Robustness Analysis

To test the robustness of the model, we adopted the fixed-proportion generation method to generate outliers on the three datasets to simulate real data for testing the model effect.
Table 6 presents the results of the robust analysis conducted by our model using three datasets. It can be found that when noisy inputs are introduced in the current year, the model can maintain similar predictive performance compared to when non-noisy inputs are used. Figure 19 shows the robust experimental fitting graph of the model based on the ETTh1 dataset. To have a clear observation of the fitting situation, we selected an interval of 1000 to 1500, where the displayed outlier distribution is relatively dense. In the figure, we can see that the model also has a good fit when facing outliers.

4.8. Computational Cost

We compared the training time and inference time of the six-step prediction of the proposed MSGNet-MLLA-Transformer model to MSGNet and MrCAN by taking 100 samples from the Australian power load dataset. The results are shown in Table 7. Compared to the two models, the training time and inference time of our model are in two different positions. Considering both the computational cost and the predictive performance simultaneously, the method proposed in this paper has certain advantages.

5. Conclusions

To address the problem that the existing short-term load forecasting research often ignores the modeling of multi-scale spatio-temporal features, an MSGNet-MLLA-Transformer method for this task scenario is proposed. Specifically, the MSGNET-MLLA-Transformer proposed in this paper integrates Transformer and enhances the attention paid to MSGNet and MLLA. The experimental results of the sliding time window on the ETTh1, ETTm1, and Australian power load datasets all show that a smaller time window size cannot improve the prediction performance of the model, while an overly large time window size leads to a decline in the prediction performance of the model. The most suitable time window sizes are 12, 18, and 24, respectively. The results of the comparative experiments and ablation experiments show that the prediction performance of the deep learning model is superior to that of the ARIMA statistical model, and the model with the ability to extract multi-scale spatio-temporal features is superior to the time-dependent extraction model and the spatial extraction model. In the multi-scale spatio-temporal feature extraction model, the model proposed in this paper can collaboratively achieve multi-scale parsing of spatio-temporal features and capture of long-term dependencies, thus achieving the best results.
Due to the limitations of our research goals and space, the interpretability module and probability prediction module were not systematically integrated. In the future, we will further carry out research on interpretability, probability prediction, and other aspects to resolve the problems of insufficient post-event interpretability and the insufficient uncertainty prediction ability of the model in relation to practical applications.
  • We plan to introduce SHAP value analysis to quantify the marginal contribution of input features (such as historical load and spatio-temporal nodes) to the prediction results;
  • We aim to visualize cross-scale attention maps and analyze the attention patterns of the model at different time resolutions (such as minute-level fluctuations and hourly cycles);
  • By using probability prediction methods, we aim to obtain the interval probability distribution of the predicted values to enhance the practical application value of the prediction results.

Author Contributions

Methodology, M.W. and C.C.; software, M.W., W.F. and C.C.; validation, M.W. and W.F.; formal analysis, M.W., W.F. and C.C.; data curation, M.W., W.F. and C.C.; writing—original draft preparation, M.W., W.F., X.L. and C.C.; writing—review and editing, M.W., W.F., X.L. and C.C.; visualization, M.W., X.L. and C.C.; supervision, M.W. and Y.L.; project administration, M.W. and Y.L.; and funding acquisition, M.W. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the “Rising Star in the South China Sea” project of Hainan Province (NHXXRCXM202322) and the Education Department of Hainan Province (Hnky2024-76).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study can be found at https://github.com/zhouhaoyi/ETDataset (accessed on 1 April 2025) and https://gitcode.com/qq_42998340/Australia (accessed on 1 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Mustapha, M.; Mustafa, M.; Khalid, S.; Abubakar, I.; Shareef, H. Classification of electricity load forecasting based on the factors influencing the load consumption and methods used: An-overview. In Proceedings of the 2015 IEEE Conference on Energy Conversion (CENCON 2015), Johor Bahru, Malaysia, 19–20 October 2015. [Google Scholar]
  2. Klyuev, R.; Morgoev, I.; Morgoeva, A.; Gavrina, O.A.; Martyushev, N.V.; Efremenkov, E.A.; Mengxu, Q. Methods of forecasting electric energy consumption: A literature review. Energies 2022, 15, 8919. [Google Scholar] [CrossRef]
  3. Alberg, D.; Last, M. Short-term load forecasting in smart meters with sliding window-based ARIMA algorithms. Vietnam J. Comput. Sci. 2018, 5, 241–249. [Google Scholar] [CrossRef]
  4. Sadaei, H.; Guimarães, F.; Silva, C.J.; Lee, M.H.; Eslami, T. Short-term load forecasting method based on fuzzy time series, seasonality and long memory process. Int. J. Approx. Reason. 2017, 83, 196–217. [Google Scholar] [CrossRef]
  5. Xing, Q.; Huang, X.; Wang, J.; Wang, S. A novel multivariate combined power load forecasting system based on feature selection and multi-objective intelligent optimization. Expert Syst. Appl. 2024, 244, 122970. [Google Scholar] [CrossRef]
  6. Liu, W.; Mao, Z. Short-term photovoltaic power forecasting with feature extraction and attention mechanisms. Renew. Energy 2024, 226, 120437. [Google Scholar] [CrossRef]
  7. Fan, G.; Han, Y.; Li, J.; Peng, L.; Yeh, Y.; Hong, W. A hybrid model for deep learning short-term power load forecasting based on feature extraction statistics techniques. Expert Syst. Appl. 2024, 238, 122012. [Google Scholar] [CrossRef]
  8. Vermaak, J.; Botha, E. Recurrent neural networks for short-term load forecasting. IEEE Trans. Power Syst. 1998, 13, 126–132. [Google Scholar] [CrossRef]
  9. Tang, X.; Dai, Y.; Liu, Q.; Dang, X.; Xu, J. Application of bidirectional recurrent neural network combined with deep belief network in short-term load forecasting. IEEE Access 2019, 7, 160660–160670. [Google Scholar] [CrossRef]
  10. Tan, M.; Yuan, S.; Li, S.; Su, Y.; Li, H.; He, F. Ultra-short-term industrial power demand forecasting using LSTM based hybrid ensemble learning. IEEE Trans. Power Syst. 2020, 35, 2937–2948. [Google Scholar] [CrossRef]
  11. Sharma, V.; Srinivasan, D. A hybrid intelligent model based on recurrent neural networks and excitable dynamics for price prediction in deregulated electricity market. Eng. Appl. Artif. Intell. 2013, 26, 1562–1574. [Google Scholar] [CrossRef]
  12. Zhang, J.; Li, H.; Cheng, P.; Yan, J. Interpretable Wind Power Short-Term Power Prediction Model Using Deep Graph Attention Network. Energies 2024, 17, 384. [Google Scholar] [CrossRef]
  13. Xie, Y.; Zheng, J.; Taylor, G.; Hulak, D. A short-term wind power prediction method via self-adaptive adjacency matrix and spatiotemporal graph neural networks. Comput. Electr. Eng. 2024, 120, 109715. [Google Scholar] [CrossRef]
  14. Mo, S.; Wang, H.; Li, B.; Xue, Z.; Fan, S.; Liu, X. Powerformer: A temporal-based transformer model for wind power forecasting. Energy Rep. 2024, 11, 736–744. [Google Scholar] [CrossRef]
  15. Xiang, L.; Fu, X.; Yao, Q.; Zhu, G.; Hu, A. A novel model for ultra-short term wind power prediction based on Vision Transformer. Energy 2024, 294, 130854. [Google Scholar] [CrossRef]
  16. Saeed, F.; Rehman, A.; Shah, H.A.; Diyan, M.; Chen, J.; Kang, J.-M. SmartFormer: Graph-based transformer model for energy load forecasting. Sustain. Energy Technol. Assess. 2025, 73, 104133. [Google Scholar] [CrossRef]
  17. Guo, X.; Zhao, Q.; Zheng, D.; Ning, Y.; Gao, Y. A short-term load forecasting model of multi-scale CNN-LSTM hybrid neural network considering the real-time electricity price. Energy Rep. 2020, 6, 1046–1053. [Google Scholar] [CrossRef]
  18. Yin, L.; Xie, J. Multi-temporal-spatial-scale temporal convolution network for short-term load forecasting of power systems. Appl. Energy 2021, 283, 116328. [Google Scholar] [CrossRef]
  19. Pappas, P.; Ekonomou, L.; Karamousantas, D.; Chatzarakis, G.E.; Katsikas, S.K.; Liatsis, P. Electricity demand loads modeling using AutoRegressive Moving Average (ARMA) models. Energy 2008, 33, 1353–1360. [Google Scholar] [CrossRef]
  20. Shi, J.; Qu, X.; Zeng, S. Short-Term Wind Power Generation Forecasting: Direct Versus Indirect Arima-Based Approaches. Int. J. Green Energy 2011, 8, 100–112. [Google Scholar] [CrossRef]
  21. Taylor, J. Short-Term Load Forecasting With Exponentially Weighted Methods. IEEE Trans. Power Syst. 2012, 27, 458–464. [Google Scholar] [CrossRef]
  22. Li, S.; Goel, L.; Wang, P. An ensemble approach for short-term load forecasting by extreme learning machine. Appl. Energy 2016, 170, 22–29. [Google Scholar] [CrossRef]
  23. Ceperic, E.; Ceperic, V.; Baric, A. A Strategy for Short-Term Load Forecasting by Support Vector Regression Machines. IEEE Trans. Power Syst. 2013, 28, 4356–4364. [Google Scholar] [CrossRef]
  24. Hu, Z.; Bao, Y.; Xiong, T. Comprehensive learning particle swarm optimization based memetic algorithm for model selection in short-term load forecasting using support vector regression. Appl. Soft Comput. 2014, 25, 15–25. [Google Scholar] [CrossRef]
  25. Abumohsen, M.; Owda, A.; Owda, M. Electrical load forecasting using LSTM, GRU, and RNN algorithms. Energies 2023, 16, 2283. [Google Scholar] [CrossRef]
  26. L’Heureux, A.; Grolinger, K.; Capretz, M. Transformer-based model for electrical load forecasting. Energies 2022, 15, 4993. [Google Scholar] [CrossRef]
  27. Liu, M.; Qin, H.; Cao, R.; Deng, S. Short-Term Load Forecasting Based on Improved TCN and DenseNet. IEEE Access 2022, 10, 115945–115957. [Google Scholar] [CrossRef]
  28. Zhu, K.; Li, Y.; Mao, W.; Li, F.; Yan, J. LSTM enhanced by dual-attention-based encoder-decoder for daily peak load forecasting. Electr. Power Syst. Res. 2022, 208, 107860. [Google Scholar] [CrossRef]
  29. Lin, J.; Ma, J.; Zhu, J.; Cui, Y. Short-term load forecasting based on LSTM networks considering attention mechanism. Int. J. Electr. Power Energy Syst. 2022, 137, 107818. [Google Scholar] [CrossRef]
  30. Niu, D.; Yu, M.; Sun, L.; Gao, T.; Wang, K. Short-term multi-energy load forecasting for integrated energy systems based on CNN-BiGRU optimized by attention mechanism. Appl. Energy 2022, 313, 118801. [Google Scholar] [CrossRef]
  31. Cai, W.; Liang, Y.; Liu, X.; Feng, J.; Wu, Y. MSGNet: Learning multi-scale inter-series correlations for multivariate time series forecasting. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
  32. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Proceedings of Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 9–12 December 2017. [Google Scholar]
  33. Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. In Proceedings of the 38th Proceedings of Advances in Neural Information Processing Systems (NIPS 2024), Vancouver, BC, Canada, 4–9 December 2024. [Google Scholar]
  34. Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the 2024 International Conference on Machine Learning (ICML2024), Vienna, Austria, 21–27 July 2024. [Google Scholar]
  35. Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP2018), Brussels, Belgium, 1–4 November 2018. [Google Scholar]
  36. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence timeseries forecasting. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI2021), Vancouver, BC, Canada, 2–6 February 2021. [Google Scholar]
  37. Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  38. Peng, L.; Wang, L.; Ai, X.; Zeng, Y. Forecasting tourist arrivals via random forest and long short-term memory. Cogn. Comput. 2021, 13, 125–138. [Google Scholar] [CrossRef]
  39. Becerra-rico, J.; Aceves-fernandez, M.; Esquivel-escalante, K.; Pedraza-Ortega, J.C. Airborne Particle Pollution Predictive Model Using Gated Recurrent Unit (GRU) Deep Neural Networks. Earth Sci. Inform. 2020, 13, 821–834. [Google Scholar] [CrossRef]
  40. Bai, S.; Kolter, Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
  41. Jung, S.; Moon, J.; Park, S.; Hwang, E. An Attention-Based Multi Layer GRU Model for Multi-Step-Ahead Short-Term Load Forecasting. Sensors 2021, 21, 1639. [Google Scholar] [CrossRef] [PubMed]
  42. Li, L.; Lin, S.; Jia, J. Short-term Load Forecasting Based on TCN-Attention Neural Network. Electr. Power Inf. Commun. Technol. 2023, 21, 10–16. [Google Scholar]
  43. Zhang, J.; Dai, Q. MrCAN: Multi-relations aware convolutional attention network for multivariate time series forecasting. Inf. Sci. 2023, 643, 119277. [Google Scholar] [CrossRef]
  44. Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference of Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Figure 1. MLLA structure.
Figure 1. MLLA structure.
Applsci 15 07003 g001
Figure 2. MSGNet-MLLA-Transformer method flow.
Figure 2. MSGNet-MLLA-Transformer method flow.
Applsci 15 07003 g002
Figure 3. Sequences left after random forest feature selection processing based on the ETTh1 and ETTm1 datasets, listed as (a) ETTh1 and (b) ETTm1.
Figure 3. Sequences left after random forest feature selection processing based on the ETTh1 and ETTm1 datasets, listed as (a) ETTh1 and (b) ETTm1.
Applsci 15 07003 g003
Figure 4. Plot of time window metrics results on three datasets, listed as (a) ETTh1, (b) ETTm1, and (c) Australian electricity load.
Figure 4. Plot of time window metrics results on three datasets, listed as (a) ETTh1, (b) ETTm1, and (c) Australian electricity load.
Applsci 15 07003 g004
Figure 5. Visualization of the target sequences of the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.
Figure 5. Visualization of the target sequences of the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.
Applsci 15 07003 g005
Figure 6. Comparison chart fitting the 1-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 6. Comparison chart fitting the 1-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g006aApplsci 15 07003 g006b
Figure 7. Comparison chart fitting the 3-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 7. Comparison chart fitting the 3-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g007aApplsci 15 07003 g007b
Figure 8. Comparison chart fitting the 6-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 8. Comparison chart fitting the 6-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g008aApplsci 15 07003 g008b
Figure 9. Comparison chart fitting the 12-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 9. Comparison chart fitting the 12-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g009aApplsci 15 07003 g009b
Figure 10. Comparison chart fitting the 1-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 10. Comparison chart fitting the 1-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g010aApplsci 15 07003 g010b
Figure 11. Comparison chart fitting the 3-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 11. Comparison chart fitting the 3-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g011aApplsci 15 07003 g011b
Figure 12. Comparison chart fitting the 6-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 12. Comparison chart fitting the 6-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g012aApplsci 15 07003 g012b
Figure 13. Comparison chart fitting the 12-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 13. Comparison chart fitting the 12-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g013aApplsci 15 07003 g013b
Figure 14. Comparison chart fitting the 1-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 14. Comparison chart fitting the 1-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g014aApplsci 15 07003 g014b
Figure 15. Comparison chart fitting the 3-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 15. Comparison chart fitting the 3-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g015aApplsci 15 07003 g015b
Figure 16. Comparison chart fitting the 6-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 16. Comparison chart fitting the 6-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g016aApplsci 15 07003 g016b
Figure 17. Comparison chart fitting the 12-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Figure 17. Comparison chart fitting the 12-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.
Applsci 15 07003 g017aApplsci 15 07003 g017b
Figure 18. Graphs showing the results of the ablation experiment indicators for the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.
Figure 18. Graphs showing the results of the ablation experiment indicators for the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.
Applsci 15 07003 g018aApplsci 15 07003 g018b
Figure 19. Graphs of robust analysis results for the ETTh1 dataset, listed as (a) 1-step, (b) 3-step, (c) 6-step, and (d) 12-step.
Figure 19. Graphs of robust analysis results for the ETTh1 dataset, listed as (a) 1-step, (b) 3-step, (c) 6-step, and (d) 12-step.
Applsci 15 07003 g019
Table 1. Dataset description.
Table 1. Dataset description.
DatasetPrediction TargetDataset Division (Training Set:Validation Set:Test Set)
ETTh1OT12,194:1742:3484
ETTm1OT48,776:6968:13,936
Australian Electricity LoadElectricity Load61,354:8765:17,529
Table 2. Model parameter settings.
Table 2. Model parameter settings.
ParametersValues
Learning rate0.001
Batch size128
OptimizerAdam [44]
Epochs200
Dropout0.2
Table 3. RMSE, MAE, and R 2 evaluation metrics of the ETTh1 dataset.
Table 3. RMSE, MAE, and R 2 evaluation metrics of the ETTh1 dataset.
MetricsMethodPredicted Length
13612
RMSEARIMA1.9002.4182.7323.348
GRU0.9421.1071.2661.702
TCN0.8061.2061.5111.968
GRU-Attention0.8520.9151.2391.671
TCN-Attention0.7301.0151.3131.736
MrCAN0.7051.0021.2831.688
MSGNet0.7410.9311.2101.662
Ours0.6640.9041.1971.656
MAEARIMA1.6912.1122.4933.016
GRU0.7720.7450.9401.311
TCN0.6010.9461.1431.542
GRU-Attention0.6680.6640.9131.272
TCN-Attention0.5170.7581.0311.383
MrCAN0.5100.7460.9741.312
MSGNet0.5590.6770.8871.263
Ours0.4810.6420.8631.262
R 2 ARIMA0.7310.6830.5540.328
GRU0.9250.9130.8650.756
TCN0.9450.8770.8070.675
GRU-Attention0.9390.9290.8710.767
TCN-Attention0.9510.9130.8550.746
MrCAN0.9580.9150.8610.760
MSGNet0.9530.9270.8760.768
Ours0.9630.9310.8800.769
Table 4. RMSE, MAE, R2 evaluation metrics of the ETTm1 dataset.
Table 4. RMSE, MAE, R2 evaluation metrics of the ETTm1 dataset.
MetricsMethodPredicted Length
13612
RMSEARIMA1.4322.8573.5864.338
GRU0.4600.6400.7510.951
TCN0.3850.6900.8790.999
GRU-Attention0.5100.6580.8940.963
TCN-Attention0.6780.8570.9430.956
MrCAN0.3780.6060.7280.913
MSGNet0.3650.5140.7350.855
Ours0.3460.4700.6130.855
MAEARIMA1.2172.6253.3984.164
GRU0.3700.5180.5770.719
TCN0.2920.5510.6420.731
GRU-Attention0.3780.5000.6700.710
TCN-Attention0.5800.6400.7990.704
MrCAN0.2830.4710.5600.676
MSGNet0.2730.3520.5310.611
Ours0.2520.3200.4150.603
R 2 ARIMA0.7360.6820.5450.331
GRU0.9660.9550.9420.913
TCN0.9700.9490.9350.906
GRU-Attention0.9580.9530.9320.908
TCN-Attention0.9510.9380.9250.923
MrCAN0.9710.9590.9450.920
MSGNet0.9720.9670.9420.928
Ours0.9790.9710.9580.928
Table 5. RMSE, MAE, R 2 evaluation metrics of the Australian electricity load dataset.
Table 5. RMSE, MAE, R 2 evaluation metrics of the Australian electricity load dataset.
MetricsMethodPredicted Length
13612
RMSEARIMA1114.6401497.2952427.2902969.202
GRU132.349388.518753.8151157.247
TCN319.052477.217856.9291277.548
GRU-Attention126.481335.018583.403980.216
TCN-Attention180.761381.154591.0881045.967
MrCAN132.296307.270576.016884.175
MSGNet132.257310.001564.320828.189
Ours109.301292.855518.900783.182
MAEARIMA1099.9371276.5312101.2692735.886
GRU99.840279.779552.986878.398
TCN235.910390.566708.9181067.904
GRU-Attention94.676242.491417.174730.904
TCN-Attention134.382275.569425.921819.309
MrCAN100.884218.761414.341650.945
MSGNet99.612224.166410.227658.960
Ours81.107216.314380.461574.777
R 2 ARIMA0.4260.3910.283−0.494
GRU0.9800.9200.6990.290
TCN0.9360.8790.6110.135
GRU-Attention0.9820.9410.8160.491
TCN-Attention0.9720.9230.8190.420
MrCAN0.9800.9500.8240.585
MSGNet0.9800.9490.8310.636
Ours0.9840.9540.8570.675
Table 6. Robustness analysis metrics of the ETTh1 dataset.
Table 6. Robustness analysis metrics of the ETTh1 dataset.
DatasetMetricsPredicted Length
13612
ETTh1RMSE1.8001.9632.3012.567
MAE0.8720.9091.3271.589
R 2 0.9320.9240.8680.755
ETTh2 RMSE 0.5330.6170.7280.981
MAE 0.3830.4720.5900.803
R 2 0.9660.9590.9540.931
Australian Electricity Load RMSE 216.247401.082665.439892.131
MAE 193.460314.960582.718719.004
R 2 0.9730.9480.8360.682
Table 7. Computational cost of MSGNet-MLLA-Transformer on the Australian electricity load dataset.
Table 7. Computational cost of MSGNet-MLLA-Transformer on the Australian electricity load dataset.
MethodPredicted Length
Training Time Inference TimeRMSE
Ours0.178 s0.0457 s783.182
MSGNet0.266 s0.084 s828.189
MrCAN0.087 s0.025 s884.175
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, M.; Feng, W.; Li, X.; Liu, Y.; Cao, C. Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Appl. Sci. 2025, 15, 7003. https://doi.org/10.3390/app15137003

AMA Style

Wu M, Feng W, Li X, Liu Y, Cao C. Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Applied Sciences. 2025; 15(13):7003. https://doi.org/10.3390/app15137003

Chicago/Turabian Style

Wu, Man, Wanyi Feng, Xinya Li, Yunan Liu, and Chuxin Cao. 2025. "Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer" Applied Sciences 15, no. 13: 7003. https://doi.org/10.3390/app15137003

APA Style

Wu, M., Feng, W., Li, X., Liu, Y., & Cao, C. (2025). Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Applied Sciences, 15(13), 7003. https://doi.org/10.3390/app15137003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop