Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer

Wu, Man; Feng, Wanyi; Li, Xinya; Liu, Yunan; Cao, Chuxin

doi:10.3390/app15137003

Open AccessArticle

Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer

by

Man Wu

^1,2,

Wanyi Feng

¹,

Xinya Li

³,

Yunan Liu

^1,2,* and

Chuxin Cao

^1,*

¹

School of Information and Communication Engineering, Hainan University, Haikou 570228, China

²

School of Computing and Artificial Intelligence, Hainan College of Software Technology, Qionghai 571499, China

³

School of Tourism, Hainan Normal University, Haikou 570228, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7003; https://doi.org/10.3390/app15137003

Submission received: 10 May 2025 / Revised: 12 June 2025 / Accepted: 18 June 2025 / Published: 21 June 2025

(This article belongs to the Special Issue Artificial Intelligence and Digital Technology in Smart Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

Improving the accuracy of power load forecasting is an important link in the optimization of power systems. Most of the existing studies in the short-term load forecasting task at present have the problem of insufficient extraction of multi-scale features. Therefore, in order to improve prediction accuracy, this study designs a short-term power load forecasting model integrating multi-scale GCN and the improved Transformer, as well as the prediction method based on this model. First, multi-feature power load data were collected. Second, the random forest algorithm was used to preprocess the data. Next, multi-scale GCN was utilized to model the multi-scale spatio-temporal features in the power load data. The data processed by the multi-scale GCN were input into the improved Transformer module based on MLLA to extract long-term temporal dependencies. Subsequently, comparative experiments and ablation experiments were conducted on three public power datasets. The experimental results show that, compared to the comparative model, for the ETTh1 dataset, the RMSE index of this model decreased by up to 0.314, the MAE decreased by up to 0.304, and the R² index result improved by up to 9.45%. For the ETTm1 dataset, the RMSE index of this model decreased by up to 0.266, the MAE decreased by up to 0.231, and the R² index result improved by up to 3.3%. For the Australian dataset, the RMSE index of this model decreased by up to 494.366, the MAE decreased by up to 493.127, and the R² index result improved by up to 54%, verifying the superiority and effectiveness of the proposed model.

Keywords:

improved transformer; GCN; short-term power load forecast; multiple features

1. Introduction

Driven by the global wave of sustainable development, China has officially initiated the implementation process of its “dual carbon” strategy, and the energy industry is undergoing profound transformation and reform. Against this backdrop, building an accurate energy demand forecasting system has become a core supporting element for energy enterprises to optimize the supply structure, achieve efficient resource allocation, and control operating costs.

Electricity, as the central carrier of the modern energy system, has special strategic value in demand forecasting. Different from other forms of energy, the physical characteristic of electricity being used immediately upon generation determines the extreme importance of the balance between supply and demand. The dynamic evolution of social production and living patterns may cause instantaneous fluctuations in electricity demand, and this characteristic amplifies the risk of resource waste caused by the mismatch between supply and demand. Establishing a high-precision power load forecasting mechanism is becoming a key approach for power enterprises to solve the problem of resource scheduling: through forward-looking demand analysis, enterprises can formulate more refined power generation plans, implement differentiated resource allocation strategies, and establish a demand-side response regulation mechanism, ultimately achieving a qualitative improvement in the utilization efficiency of power resources.

The construction of this predictive ability is not only related to the operational efficiency of individual enterprises but will also reshape the value creation model of the entire power industry. When precise demand insight is deeply integrated with intelligent dispatching technology, the power supply system will have a stronger and flexible adaptability, which can not only meet the dynamic demands of economic and social development but also effectively reduce system redundancy costs, laying an important data foundation for building a new type of power system. Power load forecasting refers to the process by which researchers predict the power demand at a certain future time through the use of mathematical models. According to the length of the prediction time step, the power load forecasting task can be divided into four types [1]: ultra-short-term predictions focus on time granularities ranging from minutes to hours; short-term predictions are typically made on a daily or weekly basis; medium-term predictions cover time spans from weeks to months; and long-term predictions are strategic judgments based on an annual basis. This time–domain division system is deeply consistent with the operational characteristics of the power system. Compared to the fundamental supporting role of medium-term forecasts for production plans, short-term forecast results directly affect the start-up and shutdown decisions of generating units and the allocation of reserve capacity. In particular, the precise analysis and judgment of short-term predictions have become the key basis for power generation enterprises to optimize fuel procurement strategies and reduce start–stop losses. At the same time, it also provides important technical support for power grid dispatching institutions to implement peak-valley regulation and demand response.

Since power load forecasting was proposed, the academic community has established a diversified methodological system. Due to the typical multiple uncertainty characteristics of the power load sequence, the existing research mainly follows two technical paths: traditional statistical models and deep learning architectures [2]. At the traditional methodological level, the Alberg team developed a non-seasonal prediction model and a sliding window algorithm based on the ARIMA framework, which were successfully applied to the power demand prediction scenario [3]. Sadaei achieved the fitting and optimization of the load curve by improving the ARMA model [4]. Furthermore, several researchers have attempted to use feature selection methods to conduct feature selection on multiple sequences of exogenous variables in order to achieve prediction scenarios [5,6,7]. However, studies in recent years have shown that due to better fitting of nonlinear features in power load data, modern methods based on deep learning models have more advantages in accuracy when facing the task of power load forecasting. In the early years, RNN-series models were even the mainstream models in this task. In various studies of power load forecasting based on RNN system models, Vermaak [8] and Tang et al. [9] proposed two different methods for the power load forecasting task based on RNNs. Mao proposed a hybrid model for the task of power load forecasting. This model takes LSTM as the main body and introduces integrated strategies such as Bagging and Boosting at the same time [10]. The experimental results show that this method can effectively improve the accuracy of power load forecasting. Sharma et al. developed a new model composed of Fitz-Hugh Nagumo(FHN), RNNs, and feedforward neural networks and applied it to power load forecasting [11]. The research showed that this model demonstrates excellent forecasting performance. In recent years, the research focus has shifted to graph neural networks (GNNs) and Transformer architectures [12,13,14,15]. Faisal Saecd et al. attempted to integrate GNNs and Transformer for power load forecasting and achieved good results [16]. Although the abovementioned methods continuously improve the prediction accuracy, existing studies still lack systematic consideration of the multi-scale characteristics of load data. The heterogeneous features presented by exogenous variables and load sequences in different time dimensions have not been fully deconstructed, which provides an important breakthrough for subsequent methodological innovation [17,18]. Therefore, in this paper, with the aim of addressing this gap, a multi-scale graph Transformer method is proposed for predicting short-term power load, mainly including the following research content:

(1): A prediction method for short-term power load forecasting is proposed, including the proposed multi-scale graph Transformer model. The model is composed of a multi-scale graph convolution module and a Transformer mixed together, combining the advantages of the two networks, and can effectively extract the spatio-temporal features at different scales and the time dependence between scales.
(2): The model introduces MLLA. The use of this attention enables the model to maintain its computational complexity while further modeling global features and processing long time series data.
(3): The performance of the method was evaluated through experiments on three datasets. Compared to the other methods, the method proposed in this paper shows better performance.

2. Related Works

Power load forecasting essentially involves using data analysis methods to identify the key factors that affect power load and construct models for prediction. Since the emergence of the power load forecasting task, a variety of methods have been used for this task. The methods commonly used for power load forecasting in the early days were mainly traditional methods. Due to the limitations of the equipment used for data collection and the power supply and demand relationship at that time, there was limited data and few influencing factors. Researchers often used traditional and simple methods, such as ARMA [19], ARIMA [20], and Exponential Smoothing [21], for prediction. These methods are simple to use and have low time complexity. However, with upgrades of data collection equipment and the complications related to power relationships, the traditional prediction methods are no longer applicable. Therefore, people have gradually introduced new methods, such as machine learning methods and wavelet transform. Li et al. proposed an innovative power load forecasting method [22]. This method ingeniously utilizes wavelet transform to decompose power data, then builds models for different component data, and finally fuses the results to obtain the final forecasting result. This method significantly improves the accuracy of power load forecasting. Ervin made improvements based on the support vector regression (SVM) model and optimized the model parameters by using the PSO method, which also enhanced the prediction effect [23]. The work conducted by Hu is similar to that of Ervin [24]. Both used the PSO method to improve the SVM model to form a new power load forecasting method. However, the difference is that Hu adopted the memory algorithm of the improved PSO to optimize the parameters of the SVM model. The experimental results prove that this method significantly improves prediction accuracy.

The abovementioned methods have undoubtedly made considerable progress or performed well in the field of power load forecasting. However, it is rather difficult to achieve precise feature mining and modeling when using them with an increasing amount of power load data. Therefore, researchers have introduced deep learning methods. Vermaak et al. and Tang et al. both used Recurrent Neural Networks (RNNs) for power load forecasting [8,9]. Their experimental results indicated that the use of RNNs could improve the forecasting accuracy. Abunoheen et al. compared the results of various methods, such as Long Short-Term Memory (LSTM), Gate Recurrent Unit (GRU), and RNNs in power load forecasting, and found that GRU performed the best [25]. L‘Heureux et al. applied Transformer to power load forecasting [26]. The experimental results show that this method is superior to others. Liu et al. made improvements based on Temporal Convolutional Networks (TCNs) and DenseNet and proposed the Densenet-iTCN model for power load forecasting [27]. The experimental results verified that this method was superior to the baseline model. Zhu et al. effectively captured the influence of exogenous factors and time steps on the peak value in power load forecasting by using dual attention [28]. Lin et al. proposed an attention model that can adaptively select the characteristics of power load and explored the influence of time step size [29]. Niu et al. added self-attention to enhance information transfer based on Convolutional Neural Networks (CNNs) and BiGRU, which is helpful for finding the relationships among multiple factors in the power load dataset [30].

These methods have demonstrated strong performance and achieved good prediction results. However, there are still deficiencies. Due to the failure to consider the internal differences within different time scales and the external long-term sequence relationships in the power load forecasting data, these methods are still not precise enough in the process of data learning and modeling. Therefore, this paper presents a power load forecasting method. This method is composed of MSGNet [31] and Transformer [32] as the basic model framework, and MLLA is introduced, making the model’s extraction of multi-scale spatio-temporal features of data more accurate, thereby improving the forecasting accuracy.

The structure of this article is as follows: Section 1 mainly provides an overview of the research background and briefly summarizes the innovations of the research. Section 2 systematically reviews the relevant literature, explores the deficiencies of the existing research, and outlines the direction for the subsequent research. Section 3 introduces the method proposed in this paper and the knowledge about the models examined. Section 4 presents a detailed analysis and interpretation of the results of parameter experiments and comparative tests. Section 5 is the Results and Discussion section, which summarizes the conclusions and outlines future research directions.

3. Methods

3.1. Preliminary Knowledge

To elaborate on the method proposed in this paper more clearly, in this section, we will briefly explain the model examined.

3.1.1. MSGNet

MSGNet [31] is a multi-scale graph convolutional network proposed by Cai et al. in 2023. This model first performs embedding on the data to obtain the timestamp and location information within the data. Then, the embedded data are sent into the scale graph block. First, the FFT operation is carried out to find the frequency of the data and confirm the scale number. Then, corresponding spatio-temporal models are constructed for each scale to model the spatio-temporal characteristics of the data at each scale. The model also uses multi-head attention to learn the long-term dependencies between different scales. Finally, the data enter the prediction layer to output the final result. In the model, the formulas for confirming the scale and constructing the spatio-temporal graph are as follows, respectively [31]:

F = A v g (A m p (F F T (X_{e m b})))

(1)

f_{1}, \dots, f_{k} = {a r g T o p k}_{f_{*} \in {1, . . ., \frac{L}{2}}} (F), s_{i} = \frac{L}{f_{i}}

(2)

A^{i} = S o f t m a x (R e L u (E_{1}^{i} {(E_{2}^{i})}^{T}))

(3)

3.1.2. MLLA

Mamba-Like Linear Attention [33] (s a lightweight attention improvement mechanism inspired by the State Space Model (SSM) [34]. Its core idea is to combine selective state scanning with linear attention calculation, significantly reducing the computational complexity while maintaining global modeling capabilities. This mechanism integrates the structural design of the Mamba block into the LA block of linear attention, reducing the complexity from quadratic to linear ((O(N))). Its advantage lies in maintaining the linear computational complexity while improving the global modeling ability and reasoning speed of the model. The structure of MLLA is shown in Figure 1.

3.1.3. Transformer

Transformer [32] was first proposed by Google’s research team in 2017; since then, multiple variants, including Bert [35], Informer [36], and Reformer [37], have been developed. Its architecture is entirely based on the self-attention mechanism, completely changing the paradigm where traditional sequence modeling relies on RNNs or CNNs. Its core structure is composed of a stack of encoders and decoders. The encoder contains multi-layer and multi-head attention modules and a feedforward layer, and it captures the global dependencies of the input sequence through parallel computing. The decoder introduces masked multi-head attention on the basis of the encoder to ensure that predictions rely only on known information. Each module adopts Residual Connection and Layer Normalization to optimize the training stability. Among them, the calculation formula of multi-head attention is as follows [32]:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

{h e a d}_{i} = A t t e n t i o n (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V})

(5)

M u l t i h e a d (X) = C o n c a c t ({h e a d}_{1}, \dots, {h e a d}_{h}) W^{O}

(6)

3.2. MSGNet-MLLA-Transformer

3.2.1. Problem Formulation

This paper focuses on short-term power load forecasting. As a subtask of multivariate time series forecasting, this problem can be described as follows: Let

X_{t}^{i}

represent the state of the i-th influence sequence of power load at the t-th time step. Then,

X^{i} = {X_{1}^{i}, \dots, X_{T}^{i}}

represents the data of the i-th influence sequence at all historical moments, and T is the historical time length.

X = {X^{1}, \dots, X^{n}}

represents all the sequences affected by the power load, and n is the number of sequences. If the future values of the power load target sequence

Y

are predicted by using the model through the sliding time window of size w, then

{\hat{Y}}_{(t + 1 : t + l)} = f (X_{(t - w + 1 : t)}, Y_{(t - w + 1 : t)}; θ)

, where

θ

is the learnable parameter of the model, t is the observation moment, l is the prediction step size,

X_{(t - w + 1 : t)}

and

Y_{(t - w + 1 : t)}

are the historical moment influence sequence and target sequence data intercepted by the sliding time window, respectively, and f is the model examined in this paper.

3.2.2. Method Framework

The process of the prediction method proposed in this paper is shown in Figure 2.

Specifically, this method consists of ten steps.

Step 1. Data collection. Collect weather data and power data in hourly or daily time units to form a power load dataset.

Step 2. Preprocess the collected data, which specifically includes handling missing values, feature selection, normalization, etc.

Step 3. Data embedding. Perform timestamp and position embedding on the preprocessed data to extract the information within it.

Step 4. Use the fast Fourier operation to determine the scale information in the data. This step is achieved through Formulas (1) and (2).

Step 5. Construct multi-scale spatio-temporal graphs: Based on the scales determined in Step 4, a spatio-temporal graph is constructed for each scale according to Formula (3), thereby modeling the data structure within different scales of the data.

Step 6. Construct the MLLA-Transformer encoding layer: To model the long-term sequence dependency relationships existing among data at different scales, the MLLA-Transformer encoding layer is constructed. The specific structure is, in sequence, the multi-head MLLA layer, the normalization layer, the feedforward layer, and the normalization layer. Among them, the multi-head MLLA layer is used to learn the internal relationships within long sequences between data scales.

Step 7. Construct the MLLA-Transformer decoding layer. The specific structure is as follows: masked multi-head MLLA layer, normalization layer, multi-head MLLA attention layer, normalization layer, feedforward layer, and normalization layer. The use of multi-head MLLA helps the model extract the deep global information in the data and calculate the similarity within the sequence without increasing the computational consumption.

Step 8. The amplitudes corresponding to each scale obtained by the FFT are output after being processed by the Softmax layer. They are multiplied pairwise by the outputs that have passed through the corresponding multi-scale Transformers to obtain the final model result. The specific formula is as follows:

Out = \sum_{i = 1}^{k} {S o f t m a x}_{F_{f_{i}}} {T r a n s f o r m e r}_{o u t}^{i}

(7)

In the formula,

{T r a n s f o r m e r}_{o u t}^{i}

represents the Transformer output corresponding to the i-th scale, and

F_{f_{i}}

determines the amplitude corresponding to scale i, which is calculated by Formulas (1) and (2).

Step 9. The model is trained and iterated N times, and the training results are verified and evaluated.

Step 10. Test the model and output the final result.

4. Analysis of Experimental Results

4.1. Data Source

A total of three datasets were used in the experiments of this paper, namely, the ETTh1 and ETTm1 sub-datasets of the ETT dataset and the public power load dataset of a certain region in Australia.

ETT Dataset (https://github.com/zhouhaoyi/ETDataset, accessed on 1 April 2025): This dataset was collected from a certain area in Xinjiang, China, with a time period spanning from July 2016 to July 2018, and contains information such as power load and oil temperature. There are two time granularity levels for sub-datasets in the dataset: hourly ETTh1 and ETTh2 and minute-level ETTm1 and ETTm2. The hourly sub-datasets all contain 17,420 pieces of data, while the minute-level sub-datasets contain 69,680 pieces of data. In this study, we selected ETTh1 and ETTm1 as the experimental datasets.

Australian Electricity Load Dataset (https://gitcode.com/qq_42998340/Australia, accessed on 1 April 2025): This dataset contains the electricity load of a certain region in Australia. It contains six variable sequences (dry bulb temperature, dew point temperature, wet bulb temperature, temperature, electricity price, and electricity load). The time span of this dataset is from 1 January 2006 to 1 January 2011. It is recorded once every 0.5 h, and there are a total of 87,648 pieces of data.

For the three datasets in this paper, the training set, validation set, and test set are divided in a ratio of 7:1:2.

Table 1 presents each data source, including the prediction targets and the specific division of the dataset.

4.2. Data Preprocessing

Before making predictions, it is necessary to preprocess the data first, which is helpful to improve the prediction performance. The commonly used data preprocessing steps mainly include missing value filling and data normalization processing. Since the MLLA in the method proposed in this paper can only accept the input of an even number of sequences, it is also necessary to perform feature selection and dimension reduction processing on the dataset containing an odd number of sequences.

Since there are no missing data in the three datasets in this study, the data preprocessing in this paper includes two steps: dimension reduction and normalization for datasets containing an odd number of sequences. In order to reduce the dimension of the data, the random forest method is adopted in this paper. The essence of the random forest method is to integrate decision trees for dimensionality reduction, calculate the correlation of data by using the out-of-bag (OOB) error, and sort and filter the sequences [38].

Specifically, a decision tree is generated first to select the input sequence and the segmentation points on the sequence. Therefore, the space containing the input sequence can be divided into two regions by the selected sequence and the segmentation points on the sequence. When the features are discrete, the two regions can be obtained by the following formula:

\{\begin{matrix} K_{1} (s, p) = \{x_{m}| x_{m}^{s} = p\} \\ K_{2} (s, p) = \{x_{m}| x_{m}^{s} \neq p\} \end{matrix}

(8)

When the features are continuous, the two regions can be obtained by the following formula:

\{\begin{matrix} K_{1} (s, p) = \{x_{m}| x_{m}^{s} \leq p\} \\ K_{2} (s, p) = \{x_{m}| x_{m}^{s} \geq p\} \end{matrix}

(9)

In the above formulas,

K_{1}

and

K_{2}

represent the two regions obtained through division, s is the selected feature sequence, and p is the segmentation point.

The operation of dividing the area is achieved through the following formula:

{m i n}_{(s, p)} [{m i n}_{d_{1}} \sum_{x_{i} \in K_{1} (s, p)} {(y_{i} - d_{1})}^{2} + {m i n}_{d_{2}} \sum_{x_{i} \in K_{2} (s, p)} {(y_{i} - d_{2})}^{2}]

(10)

In the formula,

d_{1}

and

d_{2}

are the predicted values of the data in the two regions.

Next, the above operation is repeated for the divided areas until the value of the formula no longer continues to decrease, and then, the decision tree can be obtained.

Therefore, the operation steps of feature selection by the random forest algorithm integrated from decision trees are as follows:

Calculate the importance of each sequence and sort them in descending order.
Determine the elimination ratio each time and perform elimination based on the sequence importance calculated in step 1 to obtain a new sequence dataset.
Repeat the above steps for the new set until only the number of feature sequences that have been pre-imagined remains.

Since both the ETTh1 and ETTm1 datasets contain an odd number of sequences, we performed dimensionality reduction selection on the exogenous variable sequences, transforming the odd-number sequence dataset into an even-number sequence dataset.

The sequences obtained from the ETTh1 and ETTm1 datasets after random forest dimensionality reduction are shown in Figure 3.

As shown in Figure 3, among the six exogenous variables in the ETTh1 dataset, we screened out the five exogenous variables that are of the highest importance for the prediction of the target sequence. Their ranking based on importance, from high to low, is HULL, MULL, LUFL, MUFL, and HUFL. Among the six exogenous variables in the ETTm1 dataset, the five most important ones that were screened out, in sequence, are MUFL, HUFL, LUFL, HULL, and MULL. After feature screening, both the ETTh1 and ETTm1 datasets contained 5 exogenous variables and 1 target variable, which can be further processed using MLLA.

4.3. Experimental Setup

4.3.1. Baseline

To verify the superiority and effectiveness of the proposed method, we set up a comparative experiment and selected four methods, namely, ARIMA, GRU [39], TCN [40], GRU-Attention [41], TCN-Attention [42], MSGNet [31], and MrCAN [43]. GRU is a multivariate time series prediction method proposed by Becerra-rico in 2020. It is a variant of RNNs and makes predictions by extracting the time features of time series. The TCN is a temporal convolutional network. The dilated causal convolution in its structure can deal with the scale changes of the sequence while extracting the temporal features of the time series. The GRU-Attention model is a deep learning framework that integrates GRU and the attention mechanism (attention). Compared to the traditional GRU model, GRU-Attention can not only capture local temporal patterns but also establish global dependencies. The TCN-Attention model integrates TCNs and the attention mechanism. It captures the temporal characteristics of sequence data through TCNs and then dynamically focuses on the key information using the attention mechanism. MSGNet is a multi-scale prediction method. It uses FFT to process sequence data to extract frequency and determine scale k and constructs spatio-temporal graph each scale to extract multi-scale spatio-temporal features. MrCAN utilizes the small sample learning module and the spatio-temporal relationship learning module to learn the relationships among sample data and the temporal and spatial relationships within the same sample.

4.3.2. Loss Function

This paper selects MSE as the loss function, and its formula is as follows:

M S E = \frac{1}{m} \sum_{t = 1}^{m} (y_{t} - {\hat{y}}_{t})^{2}

(11)

In order to evaluate the experimental results, RMSE, MAE, and

R^{2}

are adopted as the experimental evaluation indicators in this paper. The specific calculation methods are as follows.

R M S E = \sqrt{\frac{1}{m} \sum_{t = 1}^{m} (y_{t} - {\hat{y}}_{t})^{2}}

(12)

M A E = \frac{1}{m} \sum_{t = 1}^{m} |y_{t} - {\hat{y}}_{t}|

(13)

R^{2} = 1 - \frac{\sum_{t} (y_{t} - {\hat{y}}_{t})^{2}}{\sum_{t} (\bar{y_{t}} - y_{t})^{2}}

(14)

4.3.3. Experimental Platform

This experiment was carried out using the PyTorch 1.13.1 library of the Python 3.9.18 platform. A laptop with an Intel i5 core processor, 16 GB of memory, and an NVIDIA RTX4060 graphics card was used as the experimental equipment.

For the model in this paper, the parameter settings for the load forecasting are shown in Table 2.

4.4. Sliding Time Window Size Experiment

The size of the sliding time window is a variable parameter and has a significant impact on the predictive performance of the model. Therefore, in order to obtain a better prediction effect, we set up the sliding time window size experiment. The size of the time window was increased successively according to the values of 6, 9, 12, 18, 24, 30, and 42. Single-step prediction was conducted on the three datasets, and the changing trend of the results was observed to find the optimal sliding window size.

Figure 4 shows the influence of the sliding time window size on the model’s prediction performance based on different datasets. For the ETTh1 dataset (Figure 4a), RMSE and MAE reach their lowest values when the window size is 12, while R² reaches its highest value, indicating that 12 is the best prediction window. For the ETTm1 dataset (Figure 4b), RMSE and MAE are the lowest when the window size is 18, and R² is the highest. It is determined that 18 is the optimal window. For the Australian power load dataset (Figure 4c), RMSE and MAE are the lowest when the window size is 24, while R² is the highest. Therefore, 24 is selected as the best prediction window size. Overall, the optimal prediction window sizes corresponding to each dataset are 12, 18, and 24, respectively.

4.5. Comparative Experiment

The goal of the comparative experiment was to verify the superiority and effectiveness of the proposed method. In this paper, we set up comparative experiments using three datasets to verify and analyze the superiority and effectiveness of the proposed MSGNet-MLLA-Transformer method. Regarding the presentation of the results, since the test sets of the three datasets have a large amount of predicted data, the overall image cannot intuitively show the performance differences of each model. Therefore, we chose the interval with more intuitive differences for the presentation of the results.

The visualization results of the target sequences of the three datasets are shown in Figure 5. From Figure 5, we can see that the target sequences of the three datasets have different degrees of periodicity, and the periodicity is most obvious in the target sequence of the Australian electricity load dataset.

In the experiments conducted below, in terms of the presentation of the results, since the test sets of the three datasets have a large amount of predicted data, the overall images cannot intuitively show the performance differences of each model. Therefore, we chose the intervals with more intuitive differences for the presentation of the results.

4.5.1. ETTh1 Dataset Experiment

Table 3 shows the performance comparison of RMSE, MAE, and R² between MSGNet-MLLA-Transformer and mainstream models under different prediction step sizes (1, 3, 6, and 12 steps) in the ETTh1 dataset. The experimental data show that the model in this paper has systematic advantages in all prediction scenarios. Its RMSE is, on average, approximately 0.04 lower than the optimal benchmark, and its R² is, on average, 1.0% higher. The ranking of model performance is as follows: Ours > MrCAN > MSGNet > GRU/TCN (GRU has the weakest performance) > ARIMA, and the visualization results in Figure 6, Figure 7, Figure 8 and Figure 9 further verify this ranking. In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. This advantage stems from the fact that MSGNet-MLLA-Transformer can collaboratively achieve multi-scale parsing and long-term dependency capture of spatio-temporal features. In contrast, although MrCAN is good at spatio-temporal joint learning and sample relationship modeling, the insufficiency of multi-scale decomposition leads to error accumulation. The MSGNet prototype has advantages in spatial multi-scale feature extraction but is limited by the bottlenecks of sample relationship modeling and long-term dependency capture. However, traditional models such as GRU and TCN have obvious disadvantages in scenarios with strong spatio-temporal coupling due to the lack of effective spatial feature extraction and cross-sample relationship modeling capabilities (the performance is improved after adding the attention mechanism). These findings not only verify the necessity of multi-dimensional feature joint modeling for power prediction but also highlight the importance of collaborative optimization in dimensions such as spatio-temporal feature decoupling, long-term and short-term dependency balance, and sample relationship mining.

4.5.2. ETTm1 Dataset Experiment

Table 4 comprehensively compares the RMSE, MAE, and R² performances of MSGNet-MLLA-Transformer and the benchmark model under different prediction step sizes (1, 3, 6, and 12 steps) for the ETTm1 dataset. In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. Combined with the analysis in Figure 10, Figure 11, Figure 12 and Figure 13, it can be seen that in the short- and medium-term predictions (steps 1, 3, and 6), all indicators of this model are superior to traditional models, such as GRU and TCN (with a maximum increase of 1.6% in R²), while in the 12-step long-term prediction, the accuracy is comparable to that of the MSGNet prototype model. This phenomenon indicates that for the ETTm1 dataset, capturing the multi-scale spatio-temporal coupling characteristics of power load (such as minute-level fluctuations and hour-level periods) contributes significantly more to the prediction accuracy than the long-term reliance on modeling capabilities. This explains why MSGNet and its improved MSGNet-MLLA-Transformer (with stronger multi-scale feature extraction capabilities) have become the optimal combination. Although the overall performance of the MrCAN model is not satisfactory, its spatial attention mechanism is still valuable in capturing regional correlations.

4.5.3. Australian Electricity Load Dataset Experiment

In the visualization of the results, we present the models other than ARIMA, which have relatively small differences. As shown in Table 5 and Figure 14, Figure 15, Figure 16 and Figure 17, in the comparative experiments of different prediction steps for the Australian power load dataset, the MSGNet-MLLA-Transformer method proposed in this paper is lower than all baseline models in both RMSE and MAE, while R² is higher than the baseline, with the best performance. The ranking of model performance is as follows: Ours > MSGNet > MrCAN > TCN-Attention/GRU-Attention > TCN/GRU (GRU is comparable to MSGNet in single-step prediction) > ARIMA. Regarding this phenomenon, we analyze and believe that capturing the spatio-temporal characteristics and long-term dependencies of data is crucial for improving prediction performance. The method proposed in this paper achieves the best performance by effectively decoupling spatio-temporal features and optimizing multi-scale dynamic interaction. Although MSGNet is good at multi-scale spatio-temporal modeling, it has long relied on insufficient capture. MrCAN lacks a clear multi-scale decomposition mechanism; GRU/TCN and its improved attention mechanism may have worse performance due to weak spatial modeling ability, limited receptive field, or insufficient decoupling ability of spatio-temporal features. These results highlight the crucial role of collaborative optimization, spatio-temporal feature decoupling, and multi-scale interaction in the performance of the power load forecasting model.

4.6. Ablation Experiment

The purpose of the ablation experiment is to verify the validity of the components in the model. In the experiments of this paper, in order to test the effectiveness of MLLA and the Transformer components improved based on MLLA, we designed two variants: (1) removing MLLA (-w/o MLLA) and (2) removing the MLLA-Transformer component (-w/o MLLA-Transformer).

Figure 18 shows the index results of the ablation experiments conducted on three datasets. It can be found from the figure that when the MLLA or the MLLA-Transformer component is removed, the prediction accuracy of the model for the data decreases, which also verifies the effectiveness of the proposed model.

4.7. Robustness Analysis

To test the robustness of the model, we adopted the fixed-proportion generation method to generate outliers on the three datasets to simulate real data for testing the model effect.

Table 6 presents the results of the robust analysis conducted by our model using three datasets. It can be found that when noisy inputs are introduced in the current year, the model can maintain similar predictive performance compared to when non-noisy inputs are used. Figure 19 shows the robust experimental fitting graph of the model based on the ETTh1 dataset. To have a clear observation of the fitting situation, we selected an interval of 1000 to 1500, where the displayed outlier distribution is relatively dense. In the figure, we can see that the model also has a good fit when facing outliers.

4.8. Computational Cost

We compared the training time and inference time of the six-step prediction of the proposed MSGNet-MLLA-Transformer model to MSGNet and MrCAN by taking 100 samples from the Australian power load dataset. The results are shown in Table 7. Compared to the two models, the training time and inference time of our model are in two different positions. Considering both the computational cost and the predictive performance simultaneously, the method proposed in this paper has certain advantages.

5. Conclusions

To address the problem that the existing short-term load forecasting research often ignores the modeling of multi-scale spatio-temporal features, an MSGNet-MLLA-Transformer method for this task scenario is proposed. Specifically, the MSGNET-MLLA-Transformer proposed in this paper integrates Transformer and enhances the attention paid to MSGNet and MLLA. The experimental results of the sliding time window on the ETTh1, ETTm1, and Australian power load datasets all show that a smaller time window size cannot improve the prediction performance of the model, while an overly large time window size leads to a decline in the prediction performance of the model. The most suitable time window sizes are 12, 18, and 24, respectively. The results of the comparative experiments and ablation experiments show that the prediction performance of the deep learning model is superior to that of the ARIMA statistical model, and the model with the ability to extract multi-scale spatio-temporal features is superior to the time-dependent extraction model and the spatial extraction model. In the multi-scale spatio-temporal feature extraction model, the model proposed in this paper can collaboratively achieve multi-scale parsing of spatio-temporal features and capture of long-term dependencies, thus achieving the best results.

Due to the limitations of our research goals and space, the interpretability module and probability prediction module were not systematically integrated. In the future, we will further carry out research on interpretability, probability prediction, and other aspects to resolve the problems of insufficient post-event interpretability and the insufficient uncertainty prediction ability of the model in relation to practical applications.

We plan to introduce SHAP value analysis to quantify the marginal contribution of input features (such as historical load and spatio-temporal nodes) to the prediction results;
We aim to visualize cross-scale attention maps and analyze the attention patterns of the model at different time resolutions (such as minute-level fluctuations and hourly cycles);
By using probability prediction methods, we aim to obtain the interval probability distribution of the predicted values to enhance the practical application value of the prediction results.

Author Contributions

Methodology, M.W. and C.C.; software, M.W., W.F. and C.C.; validation, M.W. and W.F.; formal analysis, M.W., W.F. and C.C.; data curation, M.W., W.F. and C.C.; writing—original draft preparation, M.W., W.F., X.L. and C.C.; writing—review and editing, M.W., W.F., X.L. and C.C.; visualization, M.W., X.L. and C.C.; supervision, M.W. and Y.L.; project administration, M.W. and Y.L.; and funding acquisition, M.W. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the “Rising Star in the South China Sea” project of Hainan Province (NHXXRCXM202322) and the Education Department of Hainan Province (Hnky2024-76).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study can be found at https://github.com/zhouhaoyi/ETDataset (accessed on 1 April 2025) and https://gitcode.com/qq_42998340/Australia (accessed on 1 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mustapha, M.; Mustafa, M.; Khalid, S.; Abubakar, I.; Shareef, H. Classification of electricity load forecasting based on the factors influencing the load consumption and methods used: An-overview. In Proceedings of the 2015 IEEE Conference on Energy Conversion (CENCON 2015), Johor Bahru, Malaysia, 19–20 October 2015. [Google Scholar]
Klyuev, R.; Morgoev, I.; Morgoeva, A.; Gavrina, O.A.; Martyushev, N.V.; Efremenkov, E.A.; Mengxu, Q. Methods of forecasting electric energy consumption: A literature review. Energies 2022, 15, 8919. [Google Scholar] [CrossRef]
Alberg, D.; Last, M. Short-term load forecasting in smart meters with sliding window-based ARIMA algorithms. Vietnam J. Comput. Sci. 2018, 5, 241–249. [Google Scholar] [CrossRef]
Sadaei, H.; Guimarães, F.; Silva, C.J.; Lee, M.H.; Eslami, T. Short-term load forecasting method based on fuzzy time series, seasonality and long memory process. Int. J. Approx. Reason. 2017, 83, 196–217. [Google Scholar] [CrossRef]
Xing, Q.; Huang, X.; Wang, J.; Wang, S. A novel multivariate combined power load forecasting system based on feature selection and multi-objective intelligent optimization. Expert Syst. Appl. 2024, 244, 122970. [Google Scholar] [CrossRef]
Liu, W.; Mao, Z. Short-term photovoltaic power forecasting with feature extraction and attention mechanisms. Renew. Energy 2024, 226, 120437. [Google Scholar] [CrossRef]
Fan, G.; Han, Y.; Li, J.; Peng, L.; Yeh, Y.; Hong, W. A hybrid model for deep learning short-term power load forecasting based on feature extraction statistics techniques. Expert Syst. Appl. 2024, 238, 122012. [Google Scholar] [CrossRef]
Vermaak, J.; Botha, E. Recurrent neural networks for short-term load forecasting. IEEE Trans. Power Syst. 1998, 13, 126–132. [Google Scholar] [CrossRef]
Tang, X.; Dai, Y.; Liu, Q.; Dang, X.; Xu, J. Application of bidirectional recurrent neural network combined with deep belief network in short-term load forecasting. IEEE Access 2019, 7, 160660–160670. [Google Scholar] [CrossRef]
Tan, M.; Yuan, S.; Li, S.; Su, Y.; Li, H.; He, F. Ultra-short-term industrial power demand forecasting using LSTM based hybrid ensemble learning. IEEE Trans. Power Syst. 2020, 35, 2937–2948. [Google Scholar] [CrossRef]
Sharma, V.; Srinivasan, D. A hybrid intelligent model based on recurrent neural networks and excitable dynamics for price prediction in deregulated electricity market. Eng. Appl. Artif. Intell. 2013, 26, 1562–1574. [Google Scholar] [CrossRef]
Zhang, J.; Li, H.; Cheng, P.; Yan, J. Interpretable Wind Power Short-Term Power Prediction Model Using Deep Graph Attention Network. Energies 2024, 17, 384. [Google Scholar] [CrossRef]
Xie, Y.; Zheng, J.; Taylor, G.; Hulak, D. A short-term wind power prediction method via self-adaptive adjacency matrix and spatiotemporal graph neural networks. Comput. Electr. Eng. 2024, 120, 109715. [Google Scholar] [CrossRef]
Mo, S.; Wang, H.; Li, B.; Xue, Z.; Fan, S.; Liu, X. Powerformer: A temporal-based transformer model for wind power forecasting. Energy Rep. 2024, 11, 736–744. [Google Scholar] [CrossRef]
Xiang, L.; Fu, X.; Yao, Q.; Zhu, G.; Hu, A. A novel model for ultra-short term wind power prediction based on Vision Transformer. Energy 2024, 294, 130854. [Google Scholar] [CrossRef]
Saeed, F.; Rehman, A.; Shah, H.A.; Diyan, M.; Chen, J.; Kang, J.-M. SmartFormer: Graph-based transformer model for energy load forecasting. Sustain. Energy Technol. Assess. 2025, 73, 104133. [Google Scholar] [CrossRef]
Guo, X.; Zhao, Q.; Zheng, D.; Ning, Y.; Gao, Y. A short-term load forecasting model of multi-scale CNN-LSTM hybrid neural network considering the real-time electricity price. Energy Rep. 2020, 6, 1046–1053. [Google Scholar] [CrossRef]
Yin, L.; Xie, J. Multi-temporal-spatial-scale temporal convolution network for short-term load forecasting of power systems. Appl. Energy 2021, 283, 116328. [Google Scholar] [CrossRef]
Pappas, P.; Ekonomou, L.; Karamousantas, D.; Chatzarakis, G.E.; Katsikas, S.K.; Liatsis, P. Electricity demand loads modeling using AutoRegressive Moving Average (ARMA) models. Energy 2008, 33, 1353–1360. [Google Scholar] [CrossRef]
Shi, J.; Qu, X.; Zeng, S. Short-Term Wind Power Generation Forecasting: Direct Versus Indirect Arima-Based Approaches. Int. J. Green Energy 2011, 8, 100–112. [Google Scholar] [CrossRef]
Taylor, J. Short-Term Load Forecasting With Exponentially Weighted Methods. IEEE Trans. Power Syst. 2012, 27, 458–464. [Google Scholar] [CrossRef]
Li, S.; Goel, L.; Wang, P. An ensemble approach for short-term load forecasting by extreme learning machine. Appl. Energy 2016, 170, 22–29. [Google Scholar] [CrossRef]
Ceperic, E.; Ceperic, V.; Baric, A. A Strategy for Short-Term Load Forecasting by Support Vector Regression Machines. IEEE Trans. Power Syst. 2013, 28, 4356–4364. [Google Scholar] [CrossRef]
Hu, Z.; Bao, Y.; Xiong, T. Comprehensive learning particle swarm optimization based memetic algorithm for model selection in short-term load forecasting using support vector regression. Appl. Soft Comput. 2014, 25, 15–25. [Google Scholar] [CrossRef]
Abumohsen, M.; Owda, A.; Owda, M. Electrical load forecasting using LSTM, GRU, and RNN algorithms. Energies 2023, 16, 2283. [Google Scholar] [CrossRef]
L’Heureux, A.; Grolinger, K.; Capretz, M. Transformer-based model for electrical load forecasting. Energies 2022, 15, 4993. [Google Scholar] [CrossRef]
Liu, M.; Qin, H.; Cao, R.; Deng, S. Short-Term Load Forecasting Based on Improved TCN and DenseNet. IEEE Access 2022, 10, 115945–115957. [Google Scholar] [CrossRef]
Zhu, K.; Li, Y.; Mao, W.; Li, F.; Yan, J. LSTM enhanced by dual-attention-based encoder-decoder for daily peak load forecasting. Electr. Power Syst. Res. 2022, 208, 107860. [Google Scholar] [CrossRef]
Lin, J.; Ma, J.; Zhu, J.; Cui, Y. Short-term load forecasting based on LSTM networks considering attention mechanism. Int. J. Electr. Power Energy Syst. 2022, 137, 107818. [Google Scholar] [CrossRef]
Niu, D.; Yu, M.; Sun, L.; Gao, T.; Wang, K. Short-term multi-energy load forecasting for integrated energy systems based on CNN-BiGRU optimized by attention mechanism. Appl. Energy 2022, 313, 118801. [Google Scholar] [CrossRef]
Cai, W.; Liang, Y.; Liu, X.; Feng, J.; Wu, Y. MSGNet: Learning multi-scale inter-series correlations for multivariate time series forecasting. In Proceedings of the 38th AAAI Conference on Artificial Intelligence (AAAI2024), Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Proceedings of Advances in Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 9–12 December 2017. [Google Scholar]
Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify Mamba in Vision: A Linear Attention Perspective. In Proceedings of the 38th Proceedings of Advances in Neural Information Processing Systems (NIPS 2024), Vancouver, BC, Canada, 4–9 December 2024. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. In Proceedings of the 2024 International Conference on Machine Learning (ICML2024), Vienna, Austria, 21–27 July 2024. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP2018), Brussels, Belgium, 1–4 November 2018. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence timeseries forecasting. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI2021), Vancouver, BC, Canada, 2–6 February 2021. [Google Scholar]
Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The efficient transformer. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Peng, L.; Wang, L.; Ai, X.; Zeng, Y. Forecasting tourist arrivals via random forest and long short-term memory. Cogn. Comput. 2021, 13, 125–138. [Google Scholar] [CrossRef]
Becerra-rico, J.; Aceves-fernandez, M.; Esquivel-escalante, K.; Pedraza-Ortega, J.C. Airborne Particle Pollution Predictive Model Using Gated Recurrent Unit (GRU) Deep Neural Networks. Earth Sci. Inform. 2020, 13, 821–834. [Google Scholar] [CrossRef]
Bai, S.; Kolter, Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Jung, S.; Moon, J.; Park, S.; Hwang, E. An Attention-Based Multi Layer GRU Model for Multi-Step-Ahead Short-Term Load Forecasting. Sensors 2021, 21, 1639. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Lin, S.; Jia, J. Short-term Load Forecasting Based on TCN-Attention Neural Network. Electr. Power Inf. Commun. Technol. 2023, 21, 10–16. [Google Scholar]
Zhang, J.; Dai, Q. MrCAN: Multi-relations aware convolutional attention network for multivariate time series forecasting. Inf. Sci. 2023, 643, 119277. [Google Scholar] [CrossRef]
Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference of Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]

Figure 1. MLLA structure.

Figure 2. MSGNet-MLLA-Transformer method flow.

Figure 3. Sequences left after random forest feature selection processing based on the ETTh1 and ETTm1 datasets, listed as (a) ETTh1 and (b) ETTm1.

Figure 4. Plot of time window metrics results on three datasets, listed as (a) ETTh1, (b) ETTm1, and (c) Australian electricity load.

Figure 5. Visualization of the target sequences of the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.

Figure 6. Comparison chart fitting the 1-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 7. Comparison chart fitting the 3-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 8. Comparison chart fitting the 6-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 9. Comparison chart fitting the 12-step prediction experimental model of the ETTh1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 10. Comparison chart fitting the 1-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 11. Comparison chart fitting the 3-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 12. Comparison chart fitting the 6-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 13. Comparison chart fitting the 12-step prediction experimental model of the ETTm1 dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 14. Comparison chart fitting the 1-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 15. Comparison chart fitting the 3-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 16. Comparison chart fitting the 6-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 17. Comparison chart fitting the 12-step prediction experimental model of the Australian electricity load dataset, listed as (a) GRU; (b) TCN; (c) GRU-Attention; (d) TCN-Attention; (e) MrCAN; (f) MSGNet; and (g) Ours.

Figure 18. Graphs showing the results of the ablation experiment indicators for the three datasets, listed as (a) ETTh1; (b) ETTm1; and (c) Australian electricity load.

Figure 19. Graphs of robust analysis results for the ETTh1 dataset, listed as (a) 1-step, (b) 3-step, (c) 6-step, and (d) 12-step.

Table 1. Dataset description.

Dataset	Prediction Target	Dataset Division (Training Set:Validation Set:Test Set)
ETTh1	OT	12,194:1742:3484
ETTm1	OT	48,776:6968:13,936
Australian Electricity Load	Electricity Load	61,354:8765:17,529

Table 2. Model parameter settings.

Parameters	Values
Learning rate	0.001
Batch size	128
Optimizer	Adam [44]
Epochs	200
Dropout	0.2

Table 3. RMSE, MAE, and

R^{2}

evaluation metrics of the ETTh1 dataset.

Table 3. RMSE, MAE, and

R^{2}

evaluation metrics of the ETTh1 dataset.

Metrics	Method	Predicted Length
Metrics	Method	1	3	6	12
RMSE	ARIMA	1.900	2.418	2.732	3.348
	GRU	0.942	1.107	1.266	1.702
	TCN	0.806	1.206	1.511	1.968
	GRU-Attention	0.852	0.915	1.239	1.671
	TCN-Attention	0.730	1.015	1.313	1.736
	MrCAN	0.705	1.002	1.283	1.688
	MSGNet	0.741	0.931	1.210	1.662
	Ours	0.664	0.904	1.197	1.656
MAE	ARIMA	1.691	2.112	2.493	3.016
	GRU	0.772	0.745	0.940	1.311
	TCN	0.601	0.946	1.143	1.542
	GRU-Attention	0.668	0.664	0.913	1.272
	TCN-Attention	0.517	0.758	1.031	1.383
	MrCAN	0.510	0.746	0.974	1.312
	MSGNet	0.559	0.677	0.887	1.263
	Ours	0.481	0.642	0.863	1.262
$R^{2}$	ARIMA	0.731	0.683	0.554	0.328
	GRU	0.925	0.913	0.865	0.756
	TCN	0.945	0.877	0.807	0.675
	GRU-Attention	0.939	0.929	0.871	0.767
	TCN-Attention	0.951	0.913	0.855	0.746
	MrCAN	0.958	0.915	0.861	0.760
	MSGNet	0.953	0.927	0.876	0.768
	Ours	0.963	0.931	0.880	0.769

Table 4. RMSE, MAE, R² evaluation metrics of the ETTm1 dataset.

Metrics	Method	Predicted Length
Metrics	Method	1	3	6	12
RMSE	ARIMA	1.432	2.857	3.586	4.338
	GRU	0.460	0.640	0.751	0.951
	TCN	0.385	0.690	0.879	0.999
	GRU-Attention	0.510	0.658	0.894	0.963
	TCN-Attention	0.678	0.857	0.943	0.956
	MrCAN	0.378	0.606	0.728	0.913
	MSGNet	0.365	0.514	0.735	0.855
	Ours	0.346	0.470	0.613	0.855
MAE	ARIMA	1.217	2.625	3.398	4.164
	GRU	0.370	0.518	0.577	0.719
	TCN	0.292	0.551	0.642	0.731
	GRU-Attention	0.378	0.500	0.670	0.710
	TCN-Attention	0.580	0.640	0.799	0.704
	MrCAN	0.283	0.471	0.560	0.676
	MSGNet	0.273	0.352	0.531	0.611
	Ours	0.252	0.320	0.415	0.603
$R^{2}$	ARIMA	0.736	0.682	0.545	0.331
	GRU	0.966	0.955	0.942	0.913
	TCN	0.970	0.949	0.935	0.906
	GRU-Attention	0.958	0.953	0.932	0.908
	TCN-Attention	0.951	0.938	0.925	0.923
	MrCAN	0.971	0.959	0.945	0.920
	MSGNet	0.972	0.967	0.942	0.928
	Ours	0.979	0.971	0.958	0.928

Table 5. RMSE, MAE,

R^{2}

evaluation metrics of the Australian electricity load dataset.

Table 5. RMSE, MAE,

R^{2}

evaluation metrics of the Australian electricity load dataset.

Metrics	Method	Predicted Length
Metrics	Method	1	3	6	12
RMSE	ARIMA	1114.640	1497.295	2427.290	2969.202
	GRU	132.349	388.518	753.815	1157.247
	TCN	319.052	477.217	856.929	1277.548
	GRU-Attention	126.481	335.018	583.403	980.216
	TCN-Attention	180.761	381.154	591.088	1045.967
	MrCAN	132.296	307.270	576.016	884.175
	MSGNet	132.257	310.001	564.320	828.189
	Ours	109.301	292.855	518.900	783.182
MAE	ARIMA	1099.937	1276.531	2101.269	2735.886
	GRU	99.840	279.779	552.986	878.398
	TCN	235.910	390.566	708.918	1067.904
	GRU-Attention	94.676	242.491	417.174	730.904
	TCN-Attention	134.382	275.569	425.921	819.309
	MrCAN	100.884	218.761	414.341	650.945
	MSGNet	99.612	224.166	410.227	658.960
	Ours	81.107	216.314	380.461	574.777
$R^{2}$	ARIMA	0.426	0.391	0.283	−0.494
	GRU	0.980	0.920	0.699	0.290
	TCN	0.936	0.879	0.611	0.135
	GRU-Attention	0.982	0.941	0.816	0.491
	TCN-Attention	0.972	0.923	0.819	0.420
	MrCAN	0.980	0.950	0.824	0.585
	MSGNet	0.980	0.949	0.831	0.636
	Ours	0.984	0.954	0.857	0.675

Table 6. Robustness analysis metrics of the ETTh1 dataset.

Dataset	Metrics	Predicted Length
Dataset	Metrics	1	3	6	12
ETTh1	RMSE	1.800	1.963	2.301	2.567
	MAE	0.872	0.909	1.327	1.589
	$R^{2}$	0.932	0.924	0.868	0.755
ETTh2	$RMSE$	0.533	0.617	0.728	0.981
	$MAE$	0.383	0.472	0.590	0.803
	$R^{2}$	0.966	0.959	0.954	0.931
Australian Electricity Load	$RMSE$	216.247	401.082	665.439	892.131
	$MAE$	193.460	314.960	582.718	719.004
	$R^{2}$	0.973	0.948	0.836	0.682

Table 7. Computational cost of MSGNet-MLLA-Transformer on the Australian electricity load dataset.

Method	Predicted Length
Method	Training Time	Inference Time	RMSE
Ours	0.178 s	0.0457 s	783.182
MSGNet	0.266 s	0.084 s	828.189
MrCAN	0.087 s	0.025 s	884.175

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, M.; Feng, W.; Li, X.; Liu, Y.; Cao, C. Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Appl. Sci. 2025, 15, 7003. https://doi.org/10.3390/app15137003

AMA Style

Wu M, Feng W, Li X, Liu Y, Cao C. Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Applied Sciences. 2025; 15(13):7003. https://doi.org/10.3390/app15137003

Chicago/Turabian Style

Wu, Man, Wanyi Feng, Xinya Li, Yunan Liu, and Chuxin Cao. 2025. "Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer" Applied Sciences 15, no. 13: 7003. https://doi.org/10.3390/app15137003

APA Style

Wu, M., Feng, W., Li, X., Liu, Y., & Cao, C. (2025). Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer. Applied Sciences, 15(13), 7003. https://doi.org/10.3390/app15137003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Power Load Forecasting Using an Improved Model Integrating GCN and Transformer

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Preliminary Knowledge

3.1.1. MSGNet

3.1.2. MLLA

3.1.3. Transformer

3.2. MSGNet-MLLA-Transformer

3.2.1. Problem Formulation

3.2.2. Method Framework

4. Analysis of Experimental Results

4.1. Data Source

4.2. Data Preprocessing

4.3. Experimental Setup

4.3.1. Baseline

4.3.2. Loss Function

4.3.3. Experimental Platform

4.4. Sliding Time Window Size Experiment

4.5. Comparative Experiment

4.5.1. ETTh1 Dataset Experiment

4.5.2. ETTm1 Dataset Experiment

4.5.3. Australian Electricity Load Dataset Experiment

4.6. Ablation Experiment

4.7. Robustness Analysis

4.8. Computational Cost

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI