3.1. Dataset
The dataset used in this study is derived from the Campus Metabolism Project at Arizona State University (ASU), Tempe campus—an initiative focused on the real-time monitoring and optimization of energy and resource usage to support campus sustainability. The dataset covers a continuous time span from 1 January 2017 to 28 January 2019, totaling 1095 days of recorded data. It includes hourly measurements of electricity consumption (in kilowatts), cooling load (in ton-hours), and heating demand (in BTU/hour) across multiple campus buildings and facilities.
In addition to multi-energy load data, the dataset also contains weather information (e.g., temperature, humidity, wind speed) and campus activity data (e.g., class schedules, occupancy patterns), enabling analysis of external factors influencing energy usage. All data were collected via a network of smart meters, water meters, environmental sensors, and infrastructure management systems. The dataset represents a geographically localized yet functionally diverse energy system, making it suitable for testing models targeting integrated multi-energy forecasting in real-world scenarios.
Although the dataset is not publicly hosted in a standardized repository, it is available for research purposes through the Campus Metabolism platform and can be accessed upon request.
  3.4. Contrast Experiment
In order to comprehensively evaluate the performance of the TSTG model in 6 h, 12 h, 24 h, and 96 h continuous time prediction tasks, the TSTG was compared with eleven different benchmark methods, and the experimental results are shown in 
Table 1. These benchmark methods cover models based on the Transformer architecture, such as Transformer [
39], Informer [
31], Autoformer [
32], FEDformer [
33], Reformer [
34], and Pyraformer [
35]; MLP-based models such as LightTS [
36], TiDE [
37], and TSMixer [
38]; and classical statistical models such as ARIMA [
40] and Prophet [
41], which are still widely used in industry. Including ARIMA and Prophet allows us to assess the performance gap between traditional time-series approaches and modern deep learning methods in the context of multivariate and spatio-temporal energy load forecasting.
As can be seen from 
Table 1, the TSTG shows a significant performance advantage across all four time spans of the prediction task. As shown in 
Figure 4, a comparison of the predicted 24 h electrical load demonstrates that the trajectory produced by TSTG is more consistent with the real data. In particular, compared with Autoformer, the TSTG shows substantial improvements. In power load forecasting, the mean absolute error (MAE) and root mean square error (RMSE) are reduced by 44.98% and 38.19%, respectively, and the coefficient of determination (
) increases by 8.77%. Meanwhile, the mean absolute percentage error (MAPE) is reduced by 39.62%, indicating a substantial improvement in relative prediction accuracy. In the cooling load prediction, the MAE and RMSE decreased by 51.20% and 47.94%, respectively, 
 increased by 3.45%, and the MAPE dropped by 51.35%, confirming that the TSTG achieved higher stability and lower relative deviation. In the heating load forecast, the MAE and RMSE decreased by 49.19% and 45.97%, respectively, 
 increased by 8.09%, and the MAPE decreased by 49.19%, further demonstrating the robustness of the TSTG across different types of loads.
When compared with TiDE, the TSTG also showed notable advantages: the MAE and RMSE decreased by 36.10% and 33.33% respectively, and  increased by 6.29% in power load forecasting. The MAPE decreased by 35.29%, reflecting improved proportional accuracy. In the cooling load prediction, the MAE and RMSE decreased by 29.82% and 25.97%, respectively,  increased by 1.02%, and the MAPE decreased by 29.87%. In the heating load forecast, the MAE and RMSE decreased by 23.58% and 22.53%, respectively,  increased by 4.21%, and the MAPE decreased by 25.20%. Compared with the classical statistical baselines, the performance gap is even more pronounced: relative to ARIMA, the TSTG achieves average reductions in the MAE, RMSE, and MAPE of over 40% across all loads and horizons, along with a consistent increase in ; similar trends are observed when compared with Prophet, with the TSTG surpassing it in both absolute error metrics and relative accuracy. These results confirm the superior capability of the TSTG in capturing complex nonlinear dependencies and spatial–temporal interactions in multi-energy load forecasting tasks.
- (1)
- Computational efficiency analysis 
In terms of the computational efficiency, 
Table 2 reports the training and inference time. As expected for classical statistical models with small parameter spaces, ARIMA and Prophet exhibit the lowest wall-clock time. Among deep learning baselines, LightTS is the fastest, whereas FEDformer and Pyraformer incur higher costs due to their heavier architectures. Our TSTG attains a balanced efficiency profile while consistently outperforming all competitors on the MAE, RMSE, 
 and MAPE across horizons. Importantly, ARIMA/Prophet must be re-fitted per series and frequently re-tuned under rolling multi-step evaluation, which scales poorly in multi-load settings. In contrast, the TSTG trains once to jointly model multiple loads and horizons, supports batched GPU inference, and keeps sub-second latency, making it more suitable for real-time large-scale deployment where both accuracy and throughput are required.
- (2)
- Multi-energy load coupling analysis 
There is a significant correlation between the power load, cooling load, and heating load. This correlation provides a more comprehensive perspective for models to better understand and simulate load characteristics in multi-energy systems. In order to explore the influence of these complex interactions on the prediction accuracy, the effects of combination prediction and single prediction were compared. Specifically, we conducted two different case studies:
Case 1: Ignoring the coupling between the power load, cooling load, and heating load, each type of load is independently predicted. In this case, the model uses weather information and date information as auxiliary variables.
Case 2: Here, we fully consider the interaction of the three load types and implement joint forecasting.
As shown in 
Table 3, through the empirical analysis of the two cases, it can be found that the performance of the joint forecasting method in case 2 is significantly better than that of the independent forecasting method in case 1. Compared with case 1, the MAE and RMSE of the power load are reduced by 7.64% and 1.98%, respectively, and R
2 is increased by 0.37%. For the cooling load, the MAE and RMSE decreased by 18.18% and 6.78%, respectively, and R
2 increased by 0.38%. The MAE and RMSE of the heat load decreased by 15.22% and 3.91%, respectively, and the R
2 increased by 0.23%. These results fully show that the combined multi-energy load forecasting method can not only effectively capture the internal relationship between different energy loads but also significantly improve the generalization ability of the forecasting model, so that it can reflect the actual energy demand model more accurately.
- (3)
- Auxiliary information analysis 
The power load, cooling load, and heating load are significantly correlated with the meteorological conditions. To evaluate the value of meteorological and calendar data for multi-energy load forecasting, we designed and implemented the following four different experimental cases:
Case 1: Only historical data for the electrical load, cooling load, and heating load are used for forecasting, without any external auxiliary information.
Case 2: Based on Case 1, meteorological data are introduced as an additional feature of the forecast model.
Case 3: We build on Case 2 by adding calendar data so that the model can account for date-related factors.
Case 4: Directly building on Case 1, we introduce both meteorological and calendar data as additional features of the forecast model.
As shown in 
Table 4, the experimental results of the four cases when the prediction length is 96 are summarized; with the addition of auxiliary information, the performance of the prediction model is significantly improved, especially in case 4. Specifically, the MAE and RMSE in Case 2 decreased by 4.58% and 2.96%, respectively, and R
2 increased by 0.21%, compared to Case 1. Compared with case 1, the MAE and RMSE in case 3 decreased by 2.61% and 3.94%, respectively, and R
2 also increased by 0.32%. It is worth noting that case 3 has a higher prediction accuracy than case 2, which suggests that the cyclical features contained in calendar data (such as working days, holidays, seasonal changes) may have more complex and potentially regular effects on multi-energy loads than meteorological data. This shows that the dynamic adaptive graph convolution module of the TSTG method can adjust the node connection weights by real-time load characteristics, so as to capture the spatial dynamics driven by calendar data more flexibly. The multi-head spatio-temporal attention module can effectively extract the long-term periodic pattern in calendar data by modeling the dependency relationship between different time spans and feature dimensions in parallel. Further analysis of the experimental results of Case 4 shows that the MAE and RMSE decreased by 9.15% and 5.42%, respectively, and R
2 increased by 0.21% compared with case 1 when meteorological and calendar data were introduced at the same time. This shows that the TSTG method can deeply integrate the physical topological features extracted from the encoder (such as energy network structure) with the calendar and meteorological features in the decoder through the end-to-end space–time joint optimization framework. This verifies the advantages of the TSTG method in dealing with complex spatio-temporal dependencies and dynamic variable interactions.
  3.5. Ablation Experiment
In order to evaluate the effectiveness of the dynamic adaptive graph convolution module, it is replaced by static graph convolution (-StaticGCN) and traditional graph attention network (-GAT), respectively, in the TSTG model design, and the number of graph convolution layers and parameter configurations are kept the same. The experimental results when the prediction length is 96 are shown in 
Table 5. The results show that dynamic adaptive graph convolution is significantly superior to static GCN and GAT in power load and heating load prediction. Although the performance of static GCN in cooling load prediction is close to that of dynamic adaptive graph convolution, it is unable to adjust the node weights according to real-time load characteristics, resulting in a high volatility of prediction errors in complex scenarios (such as extreme weather). In addition, the computational complexity of GAT is higher than that of dynamic adaptive graph convolution, which further validates the efficiency advantage of the proposed method.
In order to further analyze the importance of the multi-head spatio-temporal attention module to the TSTG, two sets of comparative experiments were designed. First, the multi-head spatio-temporal attention module (-WO-MSA) was removed, and only the dynamic adaptive graph convolution module was retained for spatio-temporal feature extraction. Then, the multi-head spatio-temporal attention module (MSA) was replaced with the traditional self-attention module (-SA), and the standard Transformer self-attention mechanism was used to deal with spatio-temporal dependence. The experimental results are shown in 
Table 5. The TSTG is superior to -WO-MSA and -SA in the MAE, RMSE and R
2. Specifically, when the multi-head spatio-temporal attention module is removed, the model’s ability to capture the long-term dependence on the time dimension is significantly decreased (the MAE increases by 5.04%), while when the traditional self-attention module is replaced, the model’s representation of the nonlinear interaction in the feature dimension is limited due to the lack of multi-subspace parallel modeling ability (the RMSE increases by 2.60%).
As can be seen from the table, the performance of the multi-head spatio-temporal attention module (-MSA) and dynamic adaptive graph convolution module (-DynamicGCN) added to the model is significantly better than that of the two modules added separately. The MAE and RMSE of the TSTG are 2.11% and 1.54% lower than -DynamicGCN, respectively, and R2 is 0.32% higher than -DynamicGCN. The MAE and RMSE of the TSTG decreased by 1.42% and 0.52% respectively compared with -MSA, and the R2 increased by 0.21% compared with -MSA. This shows that the synergistic effect of the dynamic adaptive graph convolution module and multi-head spatio-temporal attention module can significantly improve the overall performance of the model.