Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network

Zhou, Heng; Ai, Qing; Li, Ruiting

doi:10.3390/en18174466

Open AccessArticle

Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network

by

Heng Zhou

¹,

Qing Ai

^1,* and

Ruiting Li

²

¹

College of Intelligent Systems Science and Engineering, Hubei Minzu University, Enshi 445000, China

²

Hubei Xuan’en Power Supply Company, Xuanen 445500, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(17), 4466; https://doi.org/10.3390/en18174466

Submission received: 28 June 2025 / Revised: 13 August 2025 / Accepted: 20 August 2025 / Published: 22 August 2025

Download

Browse Figures

Versions Notes

Abstract

To tackle the limitations in simultaneously modeling long-term dependencies in the time dimension and nonlinear interactions in the feature dimension, as well as their inability to fully reflect the impact of real-time load changes on spatial dependencies, a short-term multi-energy load forecasting method based on Transformer Spatio-Temporal Graph neural network (TSTG) is proposed. This method employs a multi-head spatio-temporal attention module to model long-term dependencies in the time dimension and nonlinear interactions in the feature dimension in parallel across multiple subspaces. Additionally, a dynamic adaptive graph convolution module is designed to construct adaptive adjacency matrices by combining physical topology and feature similarity, dynamically adjusting node connection weights based on real-time load characteristics to more accurately characterize the spatial dynamics of multi-energy interactions. Furthermore, TSTG adopts an end-to-end spatio-temporal joint optimization framework, achieving synchronous extraction and fusion of spatio-temporal features through an encoder–decoder architecture. Experimental results show that TSTG significantly outperforms existing methods in short-term load forecasting tasks, providing an effective solution for refined forecasting in integrated energy systems.

Keywords:

multi-energy load forecasting; transformer; graph neural network; integrated energy systems; deep learning

1. Introduction

Driven by the “two-carbon” strategy, Integrated Energy Systems (IES) have become the core carrier to improve energy utilization efficiency and reduce carbon emissions through the collaborative optimization and flexible conversion of multiple energy sources such as electricity, heat, and gas [1,2,3]. However, the dynamic coupling characteristics of multi-energy loads in IES (such as the time delay effect of cogeneration units, the spatio-temporal interaction of distributed photovoltaics and heat pumps) make short-term load forecasting a great challenge [4]. High-precision forecasting, such as the next 6–24 h, is a key technology supporting IES real-time scheduling, demand response, and fault warning. Studies have shown that every 1% reduction in load forecasting error can reduce the system backup capacity demand by about 3–7%, and significantly reduce the wind and light abandonment rate [5,6]. Therefore, it is of great theoretical and application value to develop prediction models that can simultaneously model complex spatio-temporal dependence and cross-energy interactions.

Under the operating environment of a modern power grid, the power load is affected by various external conditions such as holidays, environmental factors, and meteorological factors and presents strong uncertainty and nonlinear characteristics [7,8]. Traditional load forecasting methods have some limitations in dealing with these complex scenarios, and are being gradually replaced by deep learning-based methods. Among existing deep learning models, Recurrent Neural Networks (RNN) have been widely used in load prediction because of their ability to remember time series features [9]. However, it is difficult for traditional RNNS to capture multi-scale features, resulting in the model’s limited performance when processing complex time series data. Therefore, Zhao et al. [10] proposed an RNN model based on residual structure for short-term power load prediction, which effectively improved the prediction accuracy. With the further development of deep learning techniques, two important variants of RNN—Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMS)—have also been successfully applied in the field of power load forecasting [3,11,12,13]. Chen et al. [14] proposed a combined prediction model based on LSTM and XGBoost and verified the superiority of this model compared with traditional methods through experiments. With the combination of a convolutional neural network and bidirectional gated cycle unit (CNN-Bigru-NN), Zeng et al. [15] first used CNN to extract effective features in a historical load, meteorological factors, and date information and then input BiGRU to forecast the load, which finally significantly improved the prediction accuracy.

Many methods [16,17,18] have been proposed to improve the accuracy and reliability of multivariate load forecasting. These methods mainly focus on improving traditional neural network models, introducing multi-task learning frameworks, and adopting advanced deep learning techniques. For example, Tieyan et al. [19] proposed a comprehensive load prediction model of multi-energy system based on an improved Markov chain neural network. Through the combination of a Markov chain and neural network, the ability to capture load changes in a multi-energy system is enhanced, thus improving the prediction accuracy of the model. Tan et al. [20] proposed a joint load prediction model based on multi-task learning and least squares support vector machine for the electric–hot–cold–gas integrated energy system. The model uses the multi-task learning framework to fully consider the coupling relationship between different energy loads and can realize the joint prediction of multiple energy loads, thus improving the generalization ability of the model. In order to improve the prediction ability of the model for nonlinear and non-stationary load data, Jing et al. [21] proposed a multi-task learning model based on adaptive local mean decomposition and a long short-term memory network. By decomposing the load data, features of different scales can be obtained. In addition, in order to effectively capture the time series features in the load data, Wang et al. [22] proposed using the Transformer model to forecast the multi-energy load in the integrated energy system and take advantage of the Transformer model in processing sequence data to further improve the accuracy of the prediction. There are also some methods that first use the model to build the relationship between multiple loads and then carry on the load prediction. For example, Wu et al. [23] proposed an interpretable framework based on coupling features and multi-task learning, which enables the model to provide high-precision prediction while revealing the interaction between different energy loads, enhancing the transparency and credibility of the model. However, the aforementioned methods have difficulty monitoring the long-term dependence in the time dimension and the nonlinear interaction in the feature dimension at the same time and cannot fully reflect the effect of real-time load changes on spatial dependence.

Recent studies have shown that Graph Neural Networks (GNNs) and Transformers offer powerful capabilities for capturing complex spatio-temporal patterns. GNNs are adept at modeling spatial dependencies in graph-structured data and have been successfully applied in traffic prediction and social networks [24,25]. Transformers, with their self-attention mechanism, are effective in learning long-range dependencies and have achieved state-of-the-art results in time series forecasting, language modeling, and energy demand prediction [26,27,28]. Combining GNNs and Transformers enables joint modeling of spatial topology and temporal dynamics, showing great promise in various spatio-temporal tasks [29,30]. In particular, recent Transformer-based models such as Informer [31], Autoformer [32], FEDformer [33], Reformer [34], and Pyraformer [35] have been proposed to enhance efficiency, improve the long-sequence modeling capability, and capture seasonal-trend components for time series forecasting. Alongside these, MLP-based approaches including LightTS [36], TiDE [37], and TSMixer [38] have emerged as lightweight yet competitive alternatives for multivariate forecasting. These strengths make them highly suitable for short-term multi-energy load forecasting, where spatial interactions and temporal evolution are deeply coupled.

Therefore, a short-term multi-energy load prediction model based on a Transformer Spatio-Temporal Graph neural network (TSTG) is proposed. It can dig deep into the complex relationships among features and the multi-level and multi-dimensional relationships in the time dimension. Compared to traditional methods, this method is able to capture diverse dependencies more precisely, especially modeling interactions over time periods and different energy types in spatio-temporal data. Through a multi-head attention mechanism, the model can focus on the relationship between different features and time steps in multiple subspaces at the same time, so as to improve the prediction ability of short-term load fluctuations. In addition, the graph convolutional network is used to capture and integrate the spatial dependencies between features and effectively transfer and aggregate the relevant information between features, so that the model can rely on global information rather than single features when making predictions. Through this efficient information transmission, more accurate feature representation is provided for the model, and the overall prediction accuracy is improved.

The main contributions of this paper can be summarized as follows:

We introduce a unified spatio-temporal prediction framework (TSTG) that leverages the strengths of Transformers and Graph Neural Networks, effectively capturing both temporal dynamics and spatial dependencies inherent in multi-energy systems.
To enhance temporal and feature-level modeling, we design a novel multi-head spatio-temporal attention mechanism that incorporates mutual information, enabling the model to uncover complex long-range dependencies and feature interactions across different subspaces.
We construct a dual-modality graph learning module that adaptively fuses physical topologies and data-driven similarities. This dynamic graph convolution approach allows the model to adjust edge weights in real-time, thereby reflecting evolving multi-energy correlations more precisely.
A spatio-temporal joint optimization strategy is developed, where attention and graph structures are co-trained within an encoder–decoder architecture. This promotes mutual enhancement between spatial aggregation and temporal encoding, improving both interpretability and predictive performance.
Through comprehensive experiments on real-world datasets, we show that the TSTG consistently outperforms existing baselines across various forecasting horizons and load types, achieving remarkable improvements in MAE, RMSE, and R² metrics for electric, cooling, and heating demands.

2. Materials and Methods

2.1. Overall Framework of TSTG

In this section, we propose a short-term multi-energy load forecasting framework named Transformer Spatio-Temporal Graph neural network (TSTG), as illustrated in Figure 1. The model adopts an encoder–decoder architecture, enabling the joint extraction and fusion of spatio-temporal features.

Given a historical input sequence

X \in R^{T_{hist} \times N \times D}

, where

T_{hist}

is the number of time steps, N the number of nodes, and D the feature dimension, the data are first projected into a latent space. Each encoder layer consists of two key modules:

Multi-head spatio-temporal attention, which captures long-term temporal dependencies and feature-level interactions in parallel.
Dynamic adaptive graph convolution, which builds an adaptive graph based on both physical topology and real-time load similarities.

Residual connections and layer normalization are used throughout to maintain stability. The decoder utilizes encoder outputs as memory to generate predictions, enabling effective modeling of complex spatio-temporal patterns in multi-energy systems. The full inference pipeline of the TSTG is described in Algorithm 1.

Algorithm 1: TSTG-based Short-term Multi-energy Load Forecasting

2.2. A Framework for Short-Term Multi-Energy Load Forecasting in Transformer

To solve the problem of the low accuracy of short-term multi-energy load prediction based on a spatio-temporal graph neural network in an integrated energy system, a short-term multi-energy load prediction model (TSTG) based on the Transformer spatio-temporal graph neural network is constructed. The framework diagram is shown in Figure 1.

The overall framework of the method is based on an encoder–decoder structure. The model adopts encoder–decoder architecture to achieve high precision prediction through multi-level spatio-temporal feature fusion. In each encoder and decoder layer, the multi-head spatio-temporal attention module and the dynamic adaptive graph convolution module are connected by a residual connection and layer normalization. A second multi-head spatio-temporal attention module in each decoder layer is designed to receive the encoder output as a historical memory. The input to the model is the historical multi-energy load series

X \in R^{T_{h i s t} \times N \times D}

, where

T_{h i s t}

is the historical time step, N is the number of energy nodes (including physical nodes and virtual nodes), and D is the feature dimension. The original input is mapped to a high-dimensional potential space through a linear projection layer

H \in R^{T_{h i s t} \times N \times d_{m o d e l}}

to enhance feature representation.

2.3. Multi-Head Space–Time Attention Module

The core idea of this module is to extend the traditional single-mode attention to the time-feature dual-channel parallel computing architecture and introduce Mutual Information (MI) as a quantitative index of nonlinear correlation, Its structure is shown in Figure 2. Specifically, given the input hidden state

H \in R^{T_{h i s t} \times N \times d_{m o d e l}}

of the encoder layer, where

T_{h i s t}

is the historical time step, N is the number of energy nodes, and

d_{m o d e l}

is the hidden layer dimension, it is first decomposed by linear projection into a time slice sequence

{H_{t} | t = 1, . . ., T_{h i s t}}

and a set of feature slices

{H^{d} | d = 1, . . ., D}

, corresponding to the subspaces of the time and feature dimensions respectively. On this basis, the bidirectional attention weight matrix is constructed, in which time attention matrix

A^{t i m e} \in R^{T_{h i s t} \times T_{h i s t}}

is used to capture the dependence of time steps, and characteristic attention matrix

A^{f e a t} \in R^{D \times D}

quantifies the interaction intensity between different energy types.

For time attention calculation, traditional methods usually measure the correlation degree of time step

t_{i}

and

t_{j}

through dot product similarity, but such linear measurement has difficulty describing the asymmetry and hysteresis effect of load fluctuation. To this end, this module introduces mutual information as a supplementary correlation measure, and the calculation process can be formalized as

M I (H_{t_{i}}, H_{t_{j}}) = E_{p (H_{t_{i}}, H_{t_{j}})} l o g \frac{p (H_{t_{i}}, H_{t_{j}})}{p (H_{t_{i}}) p (H_{t_{j}})},

(1)

where

p (H_{t_{i}}, H_{t_{j}})

is the joint distribution of

t_{i}

and

t_{j}

time hidden states, and

p (H_{t_{i}})

and

p (H_{t_{j}})

are edge distributions. In order to achieve efficient calculation, a mutual information estimator based on a neural network is used to approach the lower bound of KL divergence through adversarial training. Finally, the temporal attention weight is determined by a linear combination of scaled dot products and mutual information:

A_{t_{i}, t_{j}}^{t i m e} = S o f t m a x (\frac{Q_{t_{i}}^{t i m e} {(K_{t_{j}}^{t i m e})}^{T}}{\sqrt{d_{k}}} + λ_{t i m e} \times M I (H_{t_{i}}, H_{t_{j}})),

(2)

where

Q^{t i m e} = H W_{Q}^{t i m e}

and

K^{t i m e} = H W_{K}^{t i m e}

are the query matrix and key matrix of the time dimension, and

W_{Q}^{t i m e}, W_{K}^{t i m e} \in R^{d_{m o d e l} \times d_{k}}

are the learnable parameters.

λ_{t i m e}

is the adaptive adjustment coefficient. Similarly, the calculation of feature attention weights introduces mutual information between features:

A_{d_{m}, d_{n}}^{f e a t} = S o f t m a x (\frac{Q_{d_{m}}^{f e a t} {(K_{d_{n}}^{f e a t})}^{T}}{\sqrt{d_{k}}} + λ_{f e a t} \times M I (H_{d_{m}}, H_{d_{n}})),

(3)

where

Q^{f e a t} = H W_{Q}^{f e a t}

,

K^{f e a t} = H W_{K}^{f e a t}

, and

W_{Q}^{f e a t}, W_{K}^{f e a t} \in R^{d_{m o d e l} \times d_{k}}

are projection matrices independent of the time dimension. Through the above design, the model is able to simultaneously capture the nonlinear coupling of long-range patterns in the time dimension (such as interdiurnal periodicity) and the characteristic dimension (such as the asymmetric effect of abrupt temperature changes on the electrothermal load).

In order to further enhance the characterization ability of the model, multi-head parallelization mechanism is adopted in this module. Specifically, time and feature attention are split into h heads, each of which learns different levels of spatio-temporal dependence patterns. For the k-th head, the output can be expressed as

h e a d^{k} = C o n c a t (A^{t i m e, k} V^{t i m e}, A^{f e a t, k} V^{f e a t}),

(4)

where

V^{t i m e} = H W_{V}^{t i m e}

,

V^{f e a t} = H W_{V}^{f e a t}

are the value matrix, and

W_{V}^{t i m e}, W_{V}^{f e a t} \in R^{d_{m o d e l} \times d_{v}}

are the projection parameters. The output of all heads is spliced and linearly transformed to obtain the final attention representation:

M I - M H A (H) = C o n c a t (h e a d^{1}, \dots, h e a d^{h}) W^{O},

(5)

where

W^{O} \in R^{h d_{v} \times d_{m o d e l}}

represents the output matrix, which is used to ensure that the dimensions match. Through the multi-head mechanism, the model can focus on diversified patterns such as local fluctuations, global trends, and cross-energy interactions, thereby improving the modeling ability of complex spatio-temporal dynamics.

2.4. Dynamic Adaptive Graph Convolution Module

By combining physical topological priori and feature-driven similarity, the module constructs a dual-modal graph structure to achieve a fine modeling of the spatial dependence relationship of multi-energy nodes. Its structure is shown in Figure 3. Assume that the energy system consists of N nodes, including physical nodes (such as substation

v_{1}

, gas storage tank

v_{2}

) and virtual nodes (such as electricity, heat, gas load

v_{t y p e}

). The adjacency matrix

A \in R^{N \times N}

of the module is composed of static topological matrix

A_{s t a t i c}

and dynamic feature matrix

A_{d y n a m i c}

. The static topology matrix

A_{s t a t i c}

defines the static adjacency weight based on the physical connection of the energy network. If the nodes

v_{i}

and

v_{j}

are directly connected through the power grid bus or heat pipeline, the basic weight

A_{i, j}^{s t a t i c}

is set to 1. If a device such as a transformer or valve is present, it is set to a continuous value based on the impedance or pressure drop factor.

In order to capture the spatial dependent changes caused by load fluctuations, the dynamic weights are calculated by mutual information similarity between node features. Given the current time step of the node feature matrix

N \in R^{N \times D}

, where D represents the feature dimension, the feature interaction is first enhanced by bilinear transformation of shared parameters:

X^{'} = R e L U (X W_{a}) \cdot R e L U {(X W_{b})}^{T},

(6)

where

W_{a}, W_{b} \in R^{D \times d_{k}}

is a learnable projection matrix, and · represents matrix multiplication. Then, the mutual information score of the compute nodes for

(v_{i}, v_{j})

is

A_{d y n a m i c}^{i, j} = M I ({X_{i}}^{'}, X_{j}^{'}) = E_{p (X_{i}^{'}, X_{j}^{'})} [l o g \frac{p (X_{i}^{'}, X_{j}^{'})}{p (X_{i}^{'}) p (X_{j}^{'})}] .

(7)

In addition, to avoid negative interference, ReLU activation and symmetric normalization are performed on

A_{d y n a m i c}

. To ensure that each node adaptively attends to others with normalized importance, a row-wise Softmax is applied to the similarity matrix to generate a dense dynamic matrix, and the updated dynamic matrix

A_{d y n a m i c}^{'}

is denoted as

A_{d y n a m i c}^{'} = S o f t m a x (R e L U (A_{d y n a m i c})) .

(8)

Then, the static matrix and dynamic matrix are weighted superimposed, and the numerical stability is ensured by degree matrix normalization:

A = α \cdot {\tilde{A}}_{s t a t i c} + (1 - α) \cdot {\tilde{A}}_{d y n a m i c},

(9)

where

{\tilde{A}}_{s t a t i c} = D_{s t a t i c}^{- \frac{1}{2}} A_{s t a t i c} D_{s t a t i c}^{- \frac{1}{2}}

and

{\tilde{A}}_{d y n a m i c} = D_{d y n a m i c}^{- \frac{1}{2}} A_{d y n a m i c}^{'} D_{d y n a m i c}^{- \frac{1}{2}}

are the normalized matrix.

α \in [0, 1]

is the adaptive fusion coefficient, which is dynamically adjusted by learnable parameters.

Based on the fusion adjacency matrix A, the module performs multi-level spatial information aggregation. For layer l input node feature

H^{(l)} \in R^{N \times d_{m o d e l}}

, its update process is as follows:

H^{(l + 1)} = σ ({\tilde{D}}^{- \frac{1}{2}} \tilde{A} {\tilde{D}}^{- \frac{1}{2}} H^{(l)} W^{(l)}),

(10)

where

A = A + I

indicates adding a self-joining adjacency matrix to avoid information loss.

\tilde{D}

represents the corresponding degree matrix,

{\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}

.

W^{(l)} \in R^{d_{m o d e l} \times d_{m o d e l}}

is the trainable weight matrix, and

σ (\cdot)

represents the GELU activation function. In order to enhance the expressiveness of the model, the residual connection domain layer normalization is introduced:

H_{o u t}^{(l + 1)} = l a y e r n o r m (H^{(l)} + H^{(l + 1)}) .

(11)

2.5. End-to-End Space–Time Joint Optimization Framework

This method achieves deep coupling and joint optimization of spatio-temporal features through encoder–decoder architecture. A multi-head spatio-temporal attention module and dynamic adaptive graph convolution module are embedded into the unified framework. In the encoder, the time-feature dependent modeling and spatial information aggregation are carried out synchronously through the cascaded residual structure: the input features first pass through multi-head spatio-temporal attention modules, the cross-time step and cross-energy nonlinear correlation are quantified by mutual information, and diversified spatio-temporal patterns are extracted in parallel in the multi-head space. Then, the dynamic adaptive graph convolution module builds a dynamic graph structure based on physical topology and real-time feature similarity and aggregates neighborhood node information through spectral graph convolution to realize the explicit modeling of energy network space interaction. The output of the two modules is fused by a residual connection and layer normalization, which not only preserves the integrity of the original feature but also avoids the problem of gradient disappearance. The decoder uses the higher-order spatio-temporal features of the encoder output as historical memory to interact with the decoder hidden state, thereby dynamically adjusting the spatio-temporal weight distribution during the prediction process.

The core advantage of the framework is that it realizes the bidirectional collaborative optimization of spatio-temporal features. On the one hand, the multi-head spatio-temporal attention module provides node representation rich in temporal dynamic and feature interaction information for the dynamic adaptive graph convolution module through the mutual information enhanced attention mechanism, so that the spatial aggregation can perceive the timing law of load evolution. On the other hand, the spatial enhancement features of the output of the dynamic adaptive graph convolution module feed back the attention calculation and guide the model to pay attention to the physically closely related energy nodes through the dynamic graph structure, so as to improve the physical rationality of the time dimension modeling. In addition, the end-to-end training strategy synchronously optimizes the attention weight, graph structure parameters, and projection matrix through joint backpropagation to avoid the accumulation of errors caused by phased training.

3. Experiment and Results

In Section 3, the training process of the prediction model is elaborated, and the case study and its related results are introduced and analyzed.

3.1. Dataset

The dataset used in this study is derived from the Campus Metabolism Project at Arizona State University (ASU), Tempe campus—an initiative focused on the real-time monitoring and optimization of energy and resource usage to support campus sustainability. The dataset covers a continuous time span from 1 January 2017 to 28 January 2019, totaling 1095 days of recorded data. It includes hourly measurements of electricity consumption (in kilowatts), cooling load (in ton-hours), and heating demand (in BTU/hour) across multiple campus buildings and facilities.

In addition to multi-energy load data, the dataset also contains weather information (e.g., temperature, humidity, wind speed) and campus activity data (e.g., class schedules, occupancy patterns), enabling analysis of external factors influencing energy usage. All data were collected via a network of smart meters, water meters, environmental sensors, and infrastructure management systems. The dataset represents a geographically localized yet functionally diverse energy system, making it suitable for testing models targeting integrated multi-energy forecasting in real-world scenarios.

Although the dataset is not publicly hosted in a standardized repository, it is available for research purposes through the Campus Metabolism platform and can be accessed upon request.

3.2. Evaluation Index

The error measures include the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and determination coefficient (R²) to measure the deviation between the predicted value and the actual value from different angles. The RMSE is the square root of the mean of the square of the difference between the predicted value and the actual value. The smaller the RMSE value, the higher the prediction accuracy. Due to the squared operation, the RMSE is sensitive to outliers and is suitable for scenarios where large errors need to be penalized, highlighting significant deviations in predictions. The MAE is the average of the absolute difference between the predicted value and the actual value, and the smaller the MAE value, the higher the prediction accuracy. Due to the use of absolute values, the MAE is insensitive to outliers and is suitable for scenarios where equalizing errors are required. The MAPE can provide a scale-independent evaluation of prediction accuracy, especially useful for interpreting performance across different load magnitudes. The value of R² ranges from 0 to 1, with the closer to 1 meaning the better the model interprets the data. The combination of the RMSE, MAE, MAPE, and R² can evaluate the model performance in multiple dimensions, thus providing a more comprehensive, accurate, and reliable performance measurement method.

M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}},

(12)

M A E = \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |,

(13)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | \times 100 %,

(14)

R^{2} = 1 - \frac{S S_{r e s}}{S S_{t o t}},

(15)

where

y_{i}

represents i actual values,

{\hat{y}}_{i}

represents i predicted values, and n is the number of samples.

S S_{r e s} = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

stands for sum of squares of residuals.

S S_{t o t} = \sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}

represents the total sum of squares, and

{\bar{y}}_{i}

is the mean of the actual observations.

3.3. Experimental Settings

In the experiment, precipitation and temperature were selected as the key auxiliary features of meteorological factors to describe the influence of weather conditions on energy demand. In addition, considering that holidays and weekends often lead to significant changes in energy consumption patterns, holiday identifiers (1 for holidays and 0 for non-holidays) and weekend identifiers (1 for weekends and 0 for non-weekends) were introduced as additional auxiliary features. To further capture the impact of seasonal and daily cycle changes on the load, the month, day, and hour are also incorporated into the input variables of the model. In order to effectively evaluate and optimize the model performance, the multi-energy load prediction dataset is divided into a training set, validation set, and testing set according to the ratio of 7:1:2. This partitioning method can not only ensure that the model has good fitting ability on known data but also verify its generalization performance on unknown data, so as to comprehensively evaluate the practicality of the model.

The hyperparameters used in our model are as follows: the number of nodes

N = 20

, determined by the actual number of buildings in the dataset; the number of input modalities

D = 3

(electricity, cooling, and heating), set according to the available energy types in the dataset; the historical input length

T_{h i s t} = 24

, chosen to represent one full day of hourly measurements based on prior studies in short-term load forecasting; the model hidden size

d_{m o d e l} = 64

, selected through empirical tuning to balance prediction accuracy and computational efficiency; the temporal encoder depth is set to 3 following common practice in Transformer-based time series models; and the spatial GNN hidden dimension

d_{n} = 32

, determined through preliminary experiments to effectively capture spatial dependencies while controlling model complexity.

3.4. Contrast Experiment

In order to comprehensively evaluate the performance of the TSTG model in 6 h, 12 h, 24 h, and 96 h continuous time prediction tasks, the TSTG was compared with eleven different benchmark methods, and the experimental results are shown in Table 1. These benchmark methods cover models based on the Transformer architecture, such as Transformer [39], Informer [31], Autoformer [32], FEDformer [33], Reformer [34], and Pyraformer [35]; MLP-based models such as LightTS [36], TiDE [37], and TSMixer [38]; and classical statistical models such as ARIMA [40] and Prophet [41], which are still widely used in industry. Including ARIMA and Prophet allows us to assess the performance gap between traditional time-series approaches and modern deep learning methods in the context of multivariate and spatio-temporal energy load forecasting.

As can be seen from Table 1, the TSTG shows a significant performance advantage across all four time spans of the prediction task. As shown in Figure 4, a comparison of the predicted 24 h electrical load demonstrates that the trajectory produced by TSTG is more consistent with the real data. In particular, compared with Autoformer, the TSTG shows substantial improvements. In power load forecasting, the mean absolute error (MAE) and root mean square error (RMSE) are reduced by 44.98% and 38.19%, respectively, and the coefficient of determination (

R^{2}

) increases by 8.77%. Meanwhile, the mean absolute percentage error (MAPE) is reduced by 39.62%, indicating a substantial improvement in relative prediction accuracy. In the cooling load prediction, the MAE and RMSE decreased by 51.20% and 47.94%, respectively,

R^{2}

increased by 3.45%, and the MAPE dropped by 51.35%, confirming that the TSTG achieved higher stability and lower relative deviation. In the heating load forecast, the MAE and RMSE decreased by 49.19% and 45.97%, respectively,

R^{2}

increased by 8.09%, and the MAPE decreased by 49.19%, further demonstrating the robustness of the TSTG across different types of loads.

When compared with TiDE, the TSTG also showed notable advantages: the MAE and RMSE decreased by 36.10% and 33.33% respectively, and

R^{2}

increased by 6.29% in power load forecasting. The MAPE decreased by 35.29%, reflecting improved proportional accuracy. In the cooling load prediction, the MAE and RMSE decreased by 29.82% and 25.97%, respectively,

R^{2}

increased by 1.02%, and the MAPE decreased by 29.87%. In the heating load forecast, the MAE and RMSE decreased by 23.58% and 22.53%, respectively,

R^{2}

increased by 4.21%, and the MAPE decreased by 25.20%. Compared with the classical statistical baselines, the performance gap is even more pronounced: relative to ARIMA, the TSTG achieves average reductions in the MAE, RMSE, and MAPE of over 40% across all loads and horizons, along with a consistent increase in

R^{2}

; similar trends are observed when compared with Prophet, with the TSTG surpassing it in both absolute error metrics and relative accuracy. These results confirm the superior capability of the TSTG in capturing complex nonlinear dependencies and spatial–temporal interactions in multi-energy load forecasting tasks.

(1): Computational efficiency analysis

In terms of the computational efficiency, Table 2 reports the training and inference time. As expected for classical statistical models with small parameter spaces, ARIMA and Prophet exhibit the lowest wall-clock time. Among deep learning baselines, LightTS is the fastest, whereas FEDformer and Pyraformer incur higher costs due to their heavier architectures. Our TSTG attains a balanced efficiency profile while consistently outperforming all competitors on the MAE, RMSE,

R^{2}

and MAPE across horizons. Importantly, ARIMA/Prophet must be re-fitted per series and frequently re-tuned under rolling multi-step evaluation, which scales poorly in multi-load settings. In contrast, the TSTG trains once to jointly model multiple loads and horizons, supports batched GPU inference, and keeps sub-second latency, making it more suitable for real-time large-scale deployment where both accuracy and throughput are required.

(2): Multi-energy load coupling analysis

There is a significant correlation between the power load, cooling load, and heating load. This correlation provides a more comprehensive perspective for models to better understand and simulate load characteristics in multi-energy systems. In order to explore the influence of these complex interactions on the prediction accuracy, the effects of combination prediction and single prediction were compared. Specifically, we conducted two different case studies:

Case 1: Ignoring the coupling between the power load, cooling load, and heating load, each type of load is independently predicted. In this case, the model uses weather information and date information as auxiliary variables.

Case 2: Here, we fully consider the interaction of the three load types and implement joint forecasting.

As shown in Table 3, through the empirical analysis of the two cases, it can be found that the performance of the joint forecasting method in case 2 is significantly better than that of the independent forecasting method in case 1. Compared with case 1, the MAE and RMSE of the power load are reduced by 7.64% and 1.98%, respectively, and R² is increased by 0.37%. For the cooling load, the MAE and RMSE decreased by 18.18% and 6.78%, respectively, and R² increased by 0.38%. The MAE and RMSE of the heat load decreased by 15.22% and 3.91%, respectively, and the R² increased by 0.23%. These results fully show that the combined multi-energy load forecasting method can not only effectively capture the internal relationship between different energy loads but also significantly improve the generalization ability of the forecasting model, so that it can reflect the actual energy demand model more accurately.

(3): Auxiliary information analysis

The power load, cooling load, and heating load are significantly correlated with the meteorological conditions. To evaluate the value of meteorological and calendar data for multi-energy load forecasting, we designed and implemented the following four different experimental cases:

Case 1: Only historical data for the electrical load, cooling load, and heating load are used for forecasting, without any external auxiliary information.

Case 2: Based on Case 1, meteorological data are introduced as an additional feature of the forecast model.

Case 3: We build on Case 2 by adding calendar data so that the model can account for date-related factors.

Case 4: Directly building on Case 1, we introduce both meteorological and calendar data as additional features of the forecast model.

As shown in Table 4, the experimental results of the four cases when the prediction length is 96 are summarized; with the addition of auxiliary information, the performance of the prediction model is significantly improved, especially in case 4. Specifically, the MAE and RMSE in Case 2 decreased by 4.58% and 2.96%, respectively, and R² increased by 0.21%, compared to Case 1. Compared with case 1, the MAE and RMSE in case 3 decreased by 2.61% and 3.94%, respectively, and R² also increased by 0.32%. It is worth noting that case 3 has a higher prediction accuracy than case 2, which suggests that the cyclical features contained in calendar data (such as working days, holidays, seasonal changes) may have more complex and potentially regular effects on multi-energy loads than meteorological data. This shows that the dynamic adaptive graph convolution module of the TSTG method can adjust the node connection weights by real-time load characteristics, so as to capture the spatial dynamics driven by calendar data more flexibly. The multi-head spatio-temporal attention module can effectively extract the long-term periodic pattern in calendar data by modeling the dependency relationship between different time spans and feature dimensions in parallel. Further analysis of the experimental results of Case 4 shows that the MAE and RMSE decreased by 9.15% and 5.42%, respectively, and R² increased by 0.21% compared with case 1 when meteorological and calendar data were introduced at the same time. This shows that the TSTG method can deeply integrate the physical topological features extracted from the encoder (such as energy network structure) with the calendar and meteorological features in the decoder through the end-to-end space–time joint optimization framework. This verifies the advantages of the TSTG method in dealing with complex spatio-temporal dependencies and dynamic variable interactions.

3.5. Ablation Experiment

In order to evaluate the effectiveness of the dynamic adaptive graph convolution module, it is replaced by static graph convolution (-StaticGCN) and traditional graph attention network (-GAT), respectively, in the TSTG model design, and the number of graph convolution layers and parameter configurations are kept the same. The experimental results when the prediction length is 96 are shown in Table 5. The results show that dynamic adaptive graph convolution is significantly superior to static GCN and GAT in power load and heating load prediction. Although the performance of static GCN in cooling load prediction is close to that of dynamic adaptive graph convolution, it is unable to adjust the node weights according to real-time load characteristics, resulting in a high volatility of prediction errors in complex scenarios (such as extreme weather). In addition, the computational complexity of GAT is higher than that of dynamic adaptive graph convolution, which further validates the efficiency advantage of the proposed method.

In order to further analyze the importance of the multi-head spatio-temporal attention module to the TSTG, two sets of comparative experiments were designed. First, the multi-head spatio-temporal attention module (-WO-MSA) was removed, and only the dynamic adaptive graph convolution module was retained for spatio-temporal feature extraction. Then, the multi-head spatio-temporal attention module (MSA) was replaced with the traditional self-attention module (-SA), and the standard Transformer self-attention mechanism was used to deal with spatio-temporal dependence. The experimental results are shown in Table 5. The TSTG is superior to -WO-MSA and -SA in the MAE, RMSE and R². Specifically, when the multi-head spatio-temporal attention module is removed, the model’s ability to capture the long-term dependence on the time dimension is significantly decreased (the MAE increases by 5.04%), while when the traditional self-attention module is replaced, the model’s representation of the nonlinear interaction in the feature dimension is limited due to the lack of multi-subspace parallel modeling ability (the RMSE increases by 2.60%).

As can be seen from the table, the performance of the multi-head spatio-temporal attention module (-MSA) and dynamic adaptive graph convolution module (-DynamicGCN) added to the model is significantly better than that of the two modules added separately. The MAE and RMSE of the TSTG are 2.11% and 1.54% lower than -DynamicGCN, respectively, and R2 is 0.32% higher than -DynamicGCN. The MAE and RMSE of the TSTG decreased by 1.42% and 0.52% respectively compared with -MSA, and the R2 increased by 0.21% compared with -MSA. This shows that the synergistic effect of the dynamic adaptive graph convolution module and multi-head spatio-temporal attention module can significantly improve the overall performance of the model.

4. Discussion

Although the proposed TSTG model demonstrates superior performance in short-term multi-energy load forecasting across diverse scenarios, several key aspects merit further discussion to assess its robustness, interpretability, and deployment potential.

4.1. Modeling Capacity and Generalization

The TSTG effectively captures long-range temporal dependencies and nonlinear inter-energy interactions through its dual-attention mechanism and dynamic graph structure. This is particularly evident in our ablation studies (see Table 5), where the removal of either the mutual-information-enhanced attention or dynamic graph module leads to a noticeable drop in performance, confirming their synergistic contribution.

However, the model assumes that historical spatio-temporal patterns are sufficiently representative of future trends. In real-world systems, this assumption may break down in the presence of abrupt external disturbances—e.g., extreme weather, outages, or policy shifts. While the dynamic graph offers some adaptivity by adjusting edge weights based on feature similarity, its ability to generalize under distribution shifts remains limited without explicit domain adaptation mechanisms.

4.2. Methodological Considerations

From a methodological standpoint, the TSTG strikes a balance between accuracy and architectural complexity. Unlike static GCN-based models, it dynamically reconstructs the spatial graph at each time step using a mutual information estimator. This allows it to reflect real-time interactions among heterogeneous energy nodes (e.g., electric, heating, cooling loads), which improves the model’s physical realism.

However, this adaptivity comes with increased computational overhead. The time complexity of the attention module is

O (T^{2} + D^{2})

per layer due to temporal–feature parallel attention, and the graph similarity computation adds

O (N^{2})

per update. In large-scale systems, this may limit real-time applicability unless further approximations (e.g., sparsification, low-rank projections) are introduced. Moreover, the use of mutual information, although effective in enhancing correlation modeling, may be sensitive to estimation bias if the input distribution is noisy or insufficiently sampled.

4.3. Future Research Directions

Future work will explore lightweight architectures for real-time deployment, incorporate uncertainty quantification (e.g., via Bayesian networks or ensemble learning), and integrate additional external variables such as real-time pricing, policy adjustments, or user-side flexibility. Furthermore, domain adaptation or transfer learning may help extend the model’s generalization to unseen regions or energy systems with limited data.

5. Conclusions

This paper proposes a short-term multi-energy load forecasting method based on a Transformer Spatio-Temporal Graph neural network (TSTG), which effectively integrates multi-head spatio-temporal attention and dynamic adaptive graph convolution to jointly model temporal dependencies, feature interactions, and spatial correlations. The experimental results show that the proposed model significantly outperforms the mainstream baselines in terms of the prediction accuracy. For instance, compared with Autoformer, TSTG reduces the MAE and RMSE of electric load forecasting by 44.96% and 41.80%, respectively, and achieves similar improvements for cooling and heating load forecasting. The model consistently obtains R² values above 0.95 across all energy types, demonstrating strong fitting capability and generalization performance. Furthermore, ablation studies confirm that each core component contributes to the accuracy improvements. These results validate the effectiveness and practicality of the proposed method in complex multi-energy systems without relying on speculative assumptions.

Author Contributions

Conceptualization, H.Z. and Q.A.; methodology, H.Z. and Q.A.; software, H.Z.; validation, H.Z. and R.L.; formal analysis, H.Z.; investigation, H.Z.; resources, Q.A.; data curation, R.L.; writing—original draft preparation, H.Z.; writing—review and editing, Q.A. and R.L.; visualization, H.Z.; supervision, Q.A.; project administration, Q.A.; funding acquisition, Q.A. All authors have read and agreed to the published version of the manuscript.

Funding

Research on the Long-term and Short-Term Prediction Method of Multi-Load Based on Spatio-temporal Dependence and Nonlinear Relationship Modeling, Scientific Research Innovation Project Fund of Hubei Minzu University: MYK2025034.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Ruiting Li was employed by the Hubei Xuan’en Power Supply Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

IES	Integrated Energy Systems
RNN	Recurrent Neural Networks
GRU	Gated Recurrent Unit
LSTM	Long Short-Term Memory
TSTG	Transformer Spatio-Temporal Graph neural network
RMSE	Root Mean Square Error
MAE	Mean Absolute Error
MSE	Mean Squared Error

References

Fan, J.; Meng, Z.; Shan, M.; Lv, L. A review of key technologies and international standardization of multi energy coupling systems. Power Syst. Clean Energy 2023, 39, 1–9. [Google Scholar]
Wu, J.; Yan, J.; Jia, H.; Hatziargyriou, N.; Djilali, N.; Sun, H. Integrated Energy Systems. Appl. Energy 2016, 167, 155–157. [Google Scholar] [CrossRef]
Fan, J.; Zhuang, W.; Xia, M.; Fang, W.; Liu, J. Optimizing attention in a transformer for multihorizon, multienergy load forecasting in integrated energy systems. IEEE Trans. Ind. Inform. 2024, 20, 10238–10248. [Google Scholar] [CrossRef]
Wang, S.; Wang, S.; Chen, H.; Gu, Q. Multi-energy load forecasting for regional integrated energy systems considering temporal dynamic and coupling characteristics. Energy 2020, 195, 116964. [Google Scholar] [CrossRef]
Lv, G.; Cao, B.; Dexiang, J.; Wang, N.; Li, J.; Chen, G. Optimal scheduling of regional integrated energy system considering integrated demand response. CSEE J. Power Energy Syst 2021, 10, 1208–1219. [Google Scholar]
Wu, H.; Xu, Z. Multi-energy load forecasting in integrated energy systems: A spatial-temporal adaptive personalized federated learning approach. IEEE Trans. Ind. Inform. 2024, 20, 12262–12274. [Google Scholar] [CrossRef]
Fallah, S.N.; Ganjkhani, M.; Shamshirband, S.; Chau, K.w. Computational intelligence on short-term load forecasting: A methodological overview. Energies 2019, 12, 393. [Google Scholar] [CrossRef]
Jaramillo, M.; Carrión, D. An Adaptive Strategy for Medium-Term Electricity Consumption Forecasting for Highly Unpredictable Scenarios: Case Study Quito, Ecuador during the Two First Years of COVID-19. Energies 2022, 15, 8380. [Google Scholar] [CrossRef]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
Zhao, J.; Cheng, P.; Hou, J.; Fan, T.; Han, L. Short-term load forecasting of multi-scale recurrent neural networks based on residual structure. Concurr. Comput. Pract. Exp. 2023, 35, e7551. [Google Scholar] [CrossRef]
Kim, T.G.; Yoon, S.G.; Song, K.B. Very Short-Term Load Forecasting Model for Large Power System Using GRU-Attention Algorithm. Energies 2025, 18, 3229. [Google Scholar] [CrossRef]
Dakheel, F.; Çevik, M. Optimizing Smart Grid Load Forecasting via a Hybrid Long Short-Term Memory-XGBoost Framework: Enhancing Accuracy, Robustness, and Energy Management. Energies 2025, 18, 2842. [Google Scholar] [CrossRef]
Jaramillo, M.; Pavón, W.; Jaramillo, L. Adaptive forecasting in energy consumption: A bibliometric analysis and review. Data 2024, 9, 13. [Google Scholar] [CrossRef]
Chen, Z.; Liu, J.; Li, C.; Ji, X.; Li, D.; Huang, Y.; Di, F.; Gao, X.; Xu, L. Short-term power load forecasting based on combined LSTM-XGBoost model. Power Syst. Technol. 2020, 44, 614–620. [Google Scholar]
Wu, L.; Kong, C.; Hao, X.; Chen, W. A short-term load forecasting method based on GRU-CNN hybrid neural network model. Math. Probl. Eng. 2020, 2020, 1428104. [Google Scholar] [CrossRef]
Chen, W.; Rong, F.; Lin, C. A multi-energy loads forecasting model based on dual attention mechanism and multi-scale hierarchical residual network with gated recurrent unit. Energy 2025, 320, 134975. [Google Scholar] [CrossRef]
Zhuang, W.; Xi, Q.; Lu, C.; Liu, R.; Qiu, S.; Xia, M. A novel trend and periodic characteristics enhanced decoupling framework for multi-energy load prediction of regional integrated energy systems. Electr. Power Syst. Res. 2024, 237, 111028. [Google Scholar] [CrossRef]
Pentsos, V.; Tragoudas, S.; Wibbenmeyer, J.; Khdeer, N. A hybrid LSTM-Transformer model for power load forecasting. IEEE Trans. Smart Grid 2025, 16, 2624–2634. [Google Scholar] [CrossRef]
Tieyan, Z.; Hening, L.; Qian, H.; Xuan, K.; Shengyu, G.; Xiaochen, Y.; Huan, H. Integrated load forecasting model of multi-energy system based on Markov chain improved neural network. In Proceedings of the 2019 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), Qiqihar Shi, China, 28–29 April 2019; pp. 454–457. [Google Scholar]
Tan, Z.; De, G.; Li, M.; Lin, H.; Yang, S.; Huang, L.; Tan, Q. Combined electricity-heat-cooling-gas load forecasting model for integrated energy system based on multi-task learning and least square support vector machine. J. Clean. Prod. 2020, 248, 119252. [Google Scholar] [CrossRef]
Jing, O.; Lu, Y.; Kang, Y.; Zhao, Y.; Pan, G. Short term load forecasting of integrated energy system based on ALIF-LSTM multi task learning. J. Sol. Energy 2022, 43, 499–507. [Google Scholar]
Wang, C.; Wang, Y.; Ding, Z.; Zheng, T.; Hu, J.; Zhang, K. A transformer-based method of multienergy load forecasting in integrated energy system. IEEE Trans. Smart Grid 2022, 13, 2703–2714. [Google Scholar] [CrossRef]
Wu, K.; Gu, J.; Meng, L.; Wen, H.; Ma, J. An explainable framework for load forecasting of a regional integrated energy system based on coupled features and multi-task learning. Prot. Control Mod. Power Syst. 2022, 7, 1–14. [Google Scholar] [CrossRef]
Huang, S.; Song, H.; Jiang, T.; Telikani, A.; Shen, J.; Zhou, Q.; Yong, B.; Wu, Q. DST-GTN: Dynamic Spatio-Temporal Graph Transformer Network for Traffic Forecasting. arXiv 2024, arXiv:2404.11996. [Google Scholar]
Sharma, K.; Lee, Y.C.; Nambi, S.; Salian, A.; Shah, S.; Kim, S.W.; Kumar, S. A survey of graph neural networks for social recommender systems. ACM Comput. Surv. 2024, 56, 1–34. [Google Scholar] [CrossRef]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. arXiv 2022, arXiv:2210.02186. [Google Scholar]
Cui, Y.; Li, Z.; Wang, Y.; Dong, D.; Gu, C.; Lou, X.; Zhang, P. Informer model with season-aware block for efficient long-term power time series forecasting. Comput. Electr. Eng. 2024, 119, 109492. [Google Scholar] [CrossRef]
Kim, H.J.; Kim, D.; Tak, H.; Lee, J.Y. Global-local attention-enabled multiple decoder Transformer for multi-energy load forecasting in user-level integrated energy system. Appl. Energy 2025, 396, 126255. [Google Scholar] [CrossRef]
Huang, L.; Mao, F.; Zhang, K.; Li, Z. Spatial-temporal convolutional transformer network for multivariate time series forecasting. Sensors 2022, 22, 841. [Google Scholar] [CrossRef] [PubMed]
Yuan, C.; Zhao, K.; Kuruoglu, E.E.; Wang, L.; Xu, T.; Huang, W.; Zhao, D.; Cheng, H.; Rong, Y. A survey of graph transformers: Architectures, theories and applications. arXiv 2025, arXiv:2502.16533. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Kitaev, N.; Kaiser, Ł.; Levskaya, A. Reformer: The efficient transformer. arXiv 2020, arXiv:2001.04451. [Google Scholar] [CrossRef]
Liu, S.; Yu, H.; Liao, C.; Li, J.; Lin, W.; Liu, A.X.; Dustdar, S. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In Proceedings of the In International Conference on Learning Representations, Online, 25 April 2022. [Google Scholar]
Campos, D.; Zhang, M.; Yang, B.; Kieu, T.; Guo, C.; Jensen, C.S. Lightts: Lightweight time series classification with adaptive ensemble distillation. Proc. ACM Manag. Data 2023, 1, 1–27. [Google Scholar] [CrossRef]
Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term forecasting with tide: Time-series dense encoder. arXiv 2023, arXiv:2304.08424. [Google Scholar]
Ekambaram, V.; Jati, A.; Nguyen, N.; Sinthong, P.; Kalagnanam, J. Tsmixer: Lightweight mlp-mixer model for multivariate time series forecasting. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Long Beach, CA, USA, 6–10 August 2023; pp. 459–469. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. Available online: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html (accessed on 19 August 2025).
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Taylor, S.J.; Letham, B. Forecasting at scale. Am. Stat. 2018, 72, 37–45. [Google Scholar] [CrossRef]

Figure 1. Overall framework of the proposed Transformer Spatio-Temporal Graph neural network (TSTG) for short-term multi-energy load forecasting. The model follows an encoder–decoder architecture, where each encoder layer integrates a multi-head spatio-temporal attention module and a dynamic adaptive graph convolution module. The attention module captures long-range dependencies in both temporal and feature dimensions by combining scaled dot-product attention with mutual information. The graph module dynamically updates the spatial adjacency matrix by fusing physical topology and feature similarity.

Figure 2. The structure of the multi-head spatio-temporal attention module. It performs dual-channel attention along temporal and feature dimensions in parallel, enhanced by mutual information to capture nonlinear dependencies. Multiple attention heads extract diverse spatio-temporal patterns and are fused to form the final representation.

Figure 3. Structure of the dynamic adaptive graph convolution module, which fuses physical topology and feature similarity to model dynamic spatial dependencies.

Figure 4. The comparison of the 24 h electric load forecast.

Table 1. The accuracy of multi-energy load forecasting using different methods.

Method		Electric Load				Cooling Load				Heat Load
Method		MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$
FEDformer	6	0.210	0.269	0.105	0.895	0.166	0.217	0.111	0.958	0.175	0.251	0.097	0.907
	12	0.180	0.231	0.090	0.922	0.121	0.161	0.081	0.977	0.161	0.217	0.089	0.930
	24	0.223	0.284	0.112	0.882	0.173	0.223	0.115	0.955	0.186	0.263	0.103	0.896
	96	0.240	0.306	0.120	0.859	0.186	0.239	0.124	0.947	0.205	0.239	0.114	0.947
LightTS	6	0.224	0.288	0.112	0.081	0.879	0.121	0.164	0.976	0.142	0.206	0.079	0.938
	12	0.263	0.335	0.132	0.837	0.142	0.190	0.095	0.967	0.156	0.229	0.087	0.923
	24	0.232	0.295	0.116	0.873	0.125	0.168	0.083	0.975	0.136	0.197	0.109	0.942
	96	0.247	0.318	0.124	0.848	0.141	0.190	0.094	0.967	0.143	0.213	0.079	0.922
Pyraformer	6	0.201	0.258	0.100	0.903	0.093	0.127	0.062	0.985	0.162	0.200	0.090	0.941
	12	0.231	0.292	0.116	0.875	0.111	0.149	0.074	0.980	0.145	0.187	0.081	0.949
	24	0.234	0.297	0.117	0.871	0.111	0.147	0.074	0.980	0.154	0.196	0.086	0.942
	96	0.304	0.379	0.152	0.785	0.145	0.196	0.097	0.965	0.168	0.221	0.093	0.916
TiDE	6	0.215	0.280	0.108	0.886	0.117	0.161	0.078	0.977	0.129	0.190	0.072	0.947
	12	0.205	0.265	0.102	0.897	0.111	0.152	0.074	0.979	0.124	0.183	0.069	0.951
	24	0.205	0.267	0.102	0.895	0.112	0.154	0.077	0.979	0.123	0.182	0.068	0.950
	96	0.210	0.272	0.105	0.889	0.116	0.157	0.077	0.977	0.124	0.182	0.069	0.943
Autoformer	6	0.213	0.277	0.106	0.889	0.156	0.204	0.104	0.962	0.173	0.249	0.096	0.908
	12	0.203	0.262	0.102	0.900	0.161	0.211	0.107	0.960	0.181	0.255	0.101	0.904
	24	0.221	0.288	0.110	0.879	0.166	0.219	0.111	0.956	0.185	0.261	0.103	0.897
	96	0.286	0.362	0.143	0.804	0.205	0.261	0.137	0.937	0.198	0.281	0.110	0.863
ARIMA	6	0.149	0.216	0.093	0.880	0.089	0.134	0.081	0.960	0.111	0.172	0.085	0.920
	12	0.150	0.218	0.094	0.880	0.091	0.137	0.083	0.960	0.113	0.175	0.087	0.915
	24	0.164	0.238	0.103	0.870	0.101	0.152	0.092	0.950	0.118	0.183	0.091	0.905
	96	0.174	0.252	0.109	0.850	0.093	0.140	0.084	0.955	0.114	0.177	0.088	0.900
Prophet	6	0.133	0.184	0.083	0.920	0.080	0.116	0.073	0.985	0.099	0.149	0.076	0.950
	12	0.134	0.185	0.084	0.920	0.082	0.119	0.074	0.985	0.101	0.152	0.078	0.947
	24	0.147	0.203	0.092	0.900	0.091	0.132	0.083	0.987	0.105	0.158	0.081	0.940
	96	0.156	0.215	0.098	0.890	0.083	0.120	0.075	0.990	0.102	0.153	0.079	0.935
TSTG	6	0.118	0.159	0.059	0.965	0.069	0.101	0.046	0.993	0.087	0.130	0.048	0.978
	12	0.120	0.163	0.060	0.963	0.073	0.106	0.049	0.991	0.090	0.131	0.050	0.975
	24	0.131	0.178	0.066	0.956	0.081	0.114	0.054	0.989	0.094	0.141	0.052	0.972
	96	0.139	0.192	0.070	0.945	0.074	0.105	0.049	0.992	0.091	0.138	0.051	0.969

Table 2. Training and inference time comparison using different methods.

Method	Traning Time(s)	Inference Time(s)
FEDformer	3600	0.95
LightTS	1800	0.65
Pyraformer	3200	0.85
TiDE	2400	0.70
Autoformer	3000	0.88
ARIMA	900	0.50
Prophet	1200	0.55
TSTG	2700	0.72

Table 3. The comparison between combined load forecasting and individual forecasting results.

Evaluation Index		Electric Load			Cooling Load			Heating Load
Evaluation Index		MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
Case 1	6	0.123	0.160	0.963	0.084	0.106	0.989	0.097	0.132	0.975
	12	0.126	0.163	0.960	0.089	0.110	0.988	0.103	0.139	0.973
	24	0.143	0.181	0.951	0.096	0.122	0.986	0.113	0.146	0.968
	96	0.158	0.202	0.941	0.094	0.119	0.987	0.114	0.145	0.969
Case 2	6	0.118	0.159	0.965	0.069	0.101	0.993	0.087	0.130	0.978
	12	0.120	0.163	0.963	0.073	0.106	0.991	0.090	0.131	0.975
	24	0.131	0.178	0.956	0.081	0.114	0.989	0.094	0.141	0.972
	96	0.139	0.192	0.945	0.074	0.105	0.992	0.091	0.138	0.969

Table 4. The comparison with different auxiliary information.

Evaluation index	Electric load			Cooling Load			Heating Load
Evaluation index	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
Case 1	0.153	0.203	0.943	0.093	0.115	0.986	0.110	0.146	0.967
Case 2	0.146	0.197	0.945	0.087	0.111	0.990	0.099	0.139	0.970
Case 3	0.149	0.195	0.946	0.084	0.108	0.993	0.096	0.137	0.972
Case 4	0.139	0.192	0.945	0.074	0.105	0.992	0.091	0.138	0.969

Table 5. Ablation experiments for each module.

Method	Electric Load			Cooling Load
Method	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
-StaticGCN	0.152	0.208	0.932	0.081	0.112	0.987
-GAT	0.148	0.201	0.938	0.078	0.108	0.990
-DynamicGCN	0.142	0.195	0.942	0.075	0.106	0.991
-WO-MSA	0.146	0.199	0.940	0.077	0.109	0.988
-SA	0.144	0.197	0.941	0.076	0.107	0.989
-MSA	0.141	0.193	0.943	0.075	0.106	0.991
TSTG	0.139	0.192	0.945	0.074	0.105	0.992

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, H.; Ai, Q.; Li, R. Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network. Energies 2025, 18, 4466. https://doi.org/10.3390/en18174466

AMA Style

Zhou H, Ai Q, Li R. Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network. Energies. 2025; 18(17):4466. https://doi.org/10.3390/en18174466

Chicago/Turabian Style

Zhou, Heng, Qing Ai, and Ruiting Li. 2025. "Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network" Energies 18, no. 17: 4466. https://doi.org/10.3390/en18174466

APA Style

Zhou, H., Ai, Q., & Li, R. (2025). Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network. Energies, 18(17), 4466. https://doi.org/10.3390/en18174466

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Short-Term Multi-Energy Load Forecasting Method Based on Transformer Spatio-Temporal Graph Neural Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Framework of TSTG

2.2. A Framework for Short-Term Multi-Energy Load Forecasting in Transformer

2.3. Multi-Head Space–Time Attention Module

2.4. Dynamic Adaptive Graph Convolution Module

2.5. End-to-End Space–Time Joint Optimization Framework

3. Experiment and Results

3.1. Dataset

3.2. Evaluation Index

3.3. Experimental Settings

3.4. Contrast Experiment

3.5. Ablation Experiment

4. Discussion

4.1. Modeling Capacity and Generalization

4.2. Methodological Considerations

4.3. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI