Next Article in Journal
Intelligent Analysis of Data Flows for Real-Time Classification of Traffic Incidents
Previous Article in Journal
Leakage-Free Evaluation for Employee Attrition Prediction on Tabular Data
Previous Article in Special Issue
Synthetic Data Generation for Binary and Multi-Class Classification in the Health Domain
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Load Forecasting for Integrated Energy Systems

1
School of Automation, Jiangsu University of Science and Technology, Zhenjiang 212000, China
2
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
*
Author to whom correspondence should be addressed.
Information 2026, 17(3), 309; https://doi.org/10.3390/info17030309
Submission received: 12 February 2026 / Revised: 11 March 2026 / Accepted: 21 March 2026 / Published: 23 March 2026

Abstract

Load forecasting for Integrated Energy Systems (IESs) is critical to enabling multi-energy coordinated optimization and low-carbon scheduling. Facing multi-load types and multi-site high-dimensional heterogeneous data, there remains a global learning challenge stemming from insufficient representation of spatiotemporal coupling features. In response to the multi-source heterogeneous characteristics of IES loads, this paper designs a Spatiotemporal Topology Encoder that maps load data into a tensorized multi-energy spatiotemporal topological representation via fuzzy classification and multi-scale ranking. In parallel, we construct a MultiScale Hybrid Convolver to extract multi-scale, multi-level global spatiotemporal features of multi-energy load representations. We further develop a Temporal Segmentation Transformer and a Steady-State Exponentially Gated Memory Unit, and design a jointly optimized forecasting model that enforces global dynamic correlations and local, steady-state preservation. Altogether, we propose a multi-scale spatiotemporal fusion and steady-state memory-driven load forecasting method for integrated energy systems (MSTF-SMDN). Extensive experiments on a public real-world dataset from Arizona State University demonstrate the superiority of the proposed approach: compared to the strongest baseline, MSTF-SMDN reduces cooling load RMSE by 16.09%, heating load RMSE by 12.97%, and electric load RMSE by 6.14%, while achieving R2 values of 0.99435, 0.98701, and 0.96722, respectively, confirming its feasibility, efficiency, and promising potential for multi-energy load forecasting in IES.

1. Introduction

Against the backdrop of profound restructuring of the global energy mix [1] and large-scale integration of renewable energy [2], Integrated Energy Systems (IESs) [3] have emerged as a critical technical pathway for achieving efficient, low-carbon, and resilient energy operation. By organically coupling multiple energy carriers—electricity, heating, and cooling—and increasingly extending to freshwater production within the water–energy nexus paradigm [4,5], IES mitigate short-term volatility and supply–demand uncertainty in power [6], heat [7], and cooling [8] subsystems, while dynamic routing and energy storage improve overall utilization efficiency [9] and operational economic performance. In modern energy infrastructure, accurate load forecasting is indispensable for stochastic dispatching and multi-timescale scheduling optimization [10,11], data-driven grid security assessment [10], and flexibility resource allocation [11]. How to extract high-resolution decision support from large-scale, multi-source heterogeneous information has, therefore, become a key research topic in power and energy engineering.
In practice, IES load forecasting faces several intertwined challenges. First, IESs encompass multiple load types (e.g., electricity, heating, cooling) and involve coordinated data acquisition from diverse channels, including sensor networks [12], distributed energy monitoring systems, and building energy management platforms. The resulting multi-source data exhibit pronounced heterogeneity in granularity, timeliness, and credibility, demanding effective information fusion strategies—as demonstrated even in electromechanical diagnostics [13]—and robust data quality assurance techniques [14] as reliable baselines. Second, along the temporal dimension, load time series display intra-day periodic fluctuations, weekly cycles, seasonal patterns, and long-term trends, inducing time-shift characteristics and inertia-driven delays [15] that resemble the complex volatility patterns observed in environmental time-series forecasting [16]. Third, along the spatial dimension, different load types form multi-level coupling through the energy network topology, giving rise to cross-carrier coupled transfer effects [3]. These intertwined temporal and spatial complexities pose fundamental questions regarding multi-scale feature extraction, coupling-mechanism modeling, and robust handling of data quality that require deeper exploration and innovative solutions.
In recent years, IES load forecasting has evolved from foundational architectural innovations to increasingly sophisticated system-level applications. Early methods largely relied on traditional machine learning (ML). Idowu et al. [17] employed support vector machines, regression trees, and feedforward neural networks for building heat load forecasting. While effective for simpler tasks, these methods degrade markedly in accuracy and generalization when confronted with complex spatiotemporal dependencies and multi-energy coupling. To address this, Rodrigues et al. [18] introduced a method combining functional clustering with ensemble learning; however, computational efficiency remains a challenge as data scale grows. Tan et al. [19] proposed a joint forecasting model based on multi-task learning and least squares support vector machines (LSSVM), improving accuracy by sharing weights across electricity, heat, cooling, and gas. Subsequently, Alsharekh et al. [20] presented an approach combining evolutionary algorithms with data decomposition and wavelet transforms for short-term load forecasting (STLF), though limitations persist for high-dimensional complex data. Overall, both traditional statistical models [21] and classical ML approaches [22] often struggle to simultaneously capture multimodal features and long-range dependencies, leading to insufficient feature exploitation and poor generalization in high-dimensional nonlinear settings [23].
Deep learning, with its powerful nonlinear fitting and hierarchical feature extraction capabilities, has become the mainstream paradigm for IES load forecasting. A critical line of research focuses on spatiotemporal feature modeling. Chen et al. [24] proposed a multi-scale convolutional neural network (CNN) combined with long short-term memory (LSTM) networks, achieving strong performance in multi-energy load forecasting but leaving practical issues of computational efficiency and scalability unresolved. Zhao et al. [25] introduced a multi-step residential load forecasting method based on a graph attention mechanism and a Transformer model, which improved multi-step accuracy by fusing spatiotemporal graph information. Banerjee et al. [26] proposed a Spatial–Temporal Synchronous Graph Transformer network (STSGT), which synchronously captures spatial and temporal dependencies via multi-head self-attention operating on a synchronous spatiotemporal graph; however, this framework was originally designed for epidemiological forecasting and does not address the unique multi-energy coupling present in IES.
Another important direction addresses multi-scale and multi-task learning for capturing cross-energy-carrier correlations. Song et al. [27] proposed a multi-stage LSTM federated forecasting method that jointly models multi-load interactions under multi-time-scale settings, markedly improving accuracy. More recently, Song et al. [28] transformed multi-energy load forecasting into a hierarchical multi-task learning problem with spatiotemporal attention, designing a gated temporal convolutional network to analyze coupling relationships among energy sources. While this approach effectively improves forecasting accuracy through task hierarchy, it processes temporal and spatial dependencies in a separate manner, potentially missing the fine-grained interactions that arise from the topological structure of energy consumption units. Although multi-scale feature extraction techniques from adjacent domains—such as dynamic trend fusion for traffic prediction [29] and wavelet-guided frequency decomposition for fault prediction [30]—offer transferable methodological insights, they have not been specifically tailored to the multi-energy coupling inherent in IES. More critically, the recurrent memory mechanisms underlying most existing IES forecasting models face fundamental limitations. Recent advances in extended recurrent architectures, such as xLSTM [31] with exponential gating and stabilized memory mixing, have demonstrated improved sequence modeling in language tasks; neural ordinary differential equation (ODE) approaches [32] and log-domain gating stabilization [31] have also been explored for gradient stabilization in deep recurrent networks. However, none of these innovations have been specifically designed for the multi-source heterogeneous data fusion, spatiotemporal topology encoding, and multi-energy coupling requirements unique to IES load forecasting.
Despite significant progress, several critical limitations persist in the state of the art. (1) Existing data representation methods for IES typically flatten multi-source heterogeneous inputs into simple vector concatenations, failing to encode the functional heterogeneity of energy consumption units and their geospatial distribution topology. (2) While multi-scale convolutional and Transformer-based architectures have shown promise, they have not been jointly optimized to exploit both fine-grained local structures and broad global associations simultaneously within a unified framework. (3) Recurrent models such as LSTM and GRU suffer from gradient vanishing/explosion in long-horizon settings. Although recent advances—including xLSTM [31] with exponential gating and novel memory structures, neural ODE approaches [32] with continuous-time dynamics, and log-domain gating stabilization—have partially addressed these issues in general sequence modeling, they have not been specifically designed for the multi-energy coupling, heterogeneous data fusion, and numerical stability requirements particular to long-horizon IES load forecasting, leaving a critical methodological gap. (4) Ablation studies in prior works are often incomplete, and the contribution of individual components (e.g., topology encoding, multi-scale fusion, and memory mechanisms) is rarely quantified independently.
To address these limitations, we propose a Multi-Scale Spatiotemporal Fusion with Steady-State Memory-Driven Network (MSTF-SMDN) tailored for real-world IES load forecasting tasks. Considering the functional heterogeneity of energy consumption units and the geospatial distribution topology, we design a Spatiotemporal Topology Encoder (STG) that maps multi-source heterogeneous time series into a tensorized multi-energy spatiotemporal topological representation via fuzzy classification and multi-scale proximity ranking, enabling unified embedding across temporal and spatial dimensions. We then construct a MultiScale Hybrid Convolver (MSHC) that combines depthwise separable convolutions with dynamic channel fusion to simultaneously capture fine-grained local structures and broad global associations at multiple scales. Furthermore, we develop a Temporal Segmentation Transformer (TST) and a Steady-State Exponentially Gated Memory Unit (SEGM), proposing a global–local collaborative forecasting algorithm that captures long-range temporal dependencies through Transformer-based attention and models local steady-state dynamics through numerically stabilized exponential gating. Extensive experiments on a public real-world dataset demonstrate that MSTF-SMDN achieves consistent and substantial improvements over both classical and state-of-the-art baselines: compared to the strongest baseline (TimesNet), our method reduces cooling load RMSE by 16.09%, heating load RMSE by 12.97%, and electric load RMSE by 6.14%, while achieving R2 values of 0.99435, 0.98701, and 0.96722 for the three load types, respectively.
The main contributions are summarized as follows:
  • We design a Spatiotemporal Topology Encoder that overcomes the limitations of conventional flat data representations by reconstructing multi-source heterogeneous data into a tensorized spatiotemporal topological representation. Through fuzzy functional classification and multi-scale spatial proximity ranking, the encoder preserves both the functional attributes and geospatial distribution of energy consumption units, capturing the dynamic evolution of load data across temporal and spatial dimensions.
  • We propose a MultiScale Hybrid Convolver that integrates depthwise separable convolutions with dynamic channel fusion to synchronously extract deep interactions between local details and global patterns across multiple scales (3 × 3, 5 × 5, 7 × 7), significantly enhancing the multidimensionality and discriminative power of load feature representations.
  • We integrate a Temporal Segmentation Transformer with a Steady-State Exponentially Gated Memory Unit, where the former captures global temporal dependencies via multi-head self-attention on time-series segments and the latter models local steady-state dynamics via log-domain exponential gating that provably stabilizes gradient propagation. Their synergy yields a global–local jointly optimized forecasting model with notable improvements in accuracy and stability.

2. Proposed Method

We propose a load forecasting algorithm tailored to multi-site, multi-load-type, high-dimensional heterogeneous data, namely, the Multi-Scale Spatiotemporal Fusion with Steady-State Memory-Driven Network (MSTF-SMDN). Figure 1 illustrates the overall architecture of the proposed method.
The proposed MSTF-SMDN comprises four modules:
  • Spatiotemporal Topology Encoder (STG): Considering functional heterogeneity of energy consumption units and geospatial topological distribution, it reorganizes and generates a tensorized multi-energy spatiotemporal topological representation, enabling high-dimensional embedding of multi-source heterogeneous time series.
  • MultiScale Hybrid Convolver: A parallel heterogeneous convolutional architecture that extracts local structural features and global associations across multiple scales, providing multidimensional load feature representations.
  • Temporal Segmentation Transformer: It slices the time series, then encodes with a Transformer to model long-range dynamics and capture long-term dependencies effectively.
  • Steady-State Exponentially Gated Memory Unit: It enhances modeling of short-term dependencies and micro-scale perturbations, handling spatiotemporal variability efficiently in complex environments.
Through the synergy of these modules, we construct an efficient and accurate IES load forecasting framework that delivers fine-grained decision support in highly dynamic settings. Concretely, information flows as follows: The STG first reorganizes raw multi-source heterogeneous time series into a tensorized 2D grid T that encodes both functional category memberships and geospatial proximity. This structured representation is then fed to the MSHC, which applies parallel multi-scale convolutions (3 × 3, 5 × 5, and 7 × 7) to extract spatial features at different granularities, from fine-grained local load clusters to broad cross-category coupling patterns. The resulting fused feature map F ^ is flattened and segmented by the TST, which captures long-range temporal dependencies across segments via multi-head self-attention. Finally, the segment-level embeddings are projected back to step-level vectors and fed to the SEGM, which recurrently processes each time step to model local steady-state dynamics and short-term perturbations. The final prediction is produced by a linear output layer applied to the last hidden state of the SEGM.

2.1. Spatiotemporal Topology Encoder

In this subsection, we design a Spatiotemporal Topology Encoder (STG) using an affine multi-channel, image-like encoding mechanism to collaboratively map multi-source heterogeneous spatiotemporal information. The STG reconstructs multi-energy load data into a multi-energy spatiotemporal topological representation, precisely capturing functional attributes of energy consumption units and their spatial topology, thereby providing accurate feature expressions for multidimensional spatiotemporal modeling.

2.1.1. Fuzzy Classification of Functional Categories of Energy Consumption Units

In IES, differences in functional attributes [33] of energy consumption units directly affect their load time-series characteristics [34], peak–valley distributions, and periodic patterns [35]. To finely characterize such functional heterogeneity, we introduce a fuzzy classification approach to express latent membership of units to functional categories.
First, for the i-th energy consumption unit, we extract time-series statistics of its load data over a specified horizon to obtain its load feature vector (see Appendix A, Appendix A.1 for formulas). On this basis, we apply fuzzy C-means (FCM) clustering [36] to categorize the unit’s functional type. We select K initial cluster centers from the feature space and compute each unit’s fuzzy membership degree to each functional category (see Appendix A.2), where u i k denotes the membership of the i-th unit to the k-th cluster center, and m > 1 is the fuzzifier controlling membership decay. In this study, we set the fuzzifier m = 2 following standard FCM practice, and determine K = 4 based on silhouette coefficient analysis of the campus energy consumption data, which identifies four dominant functional categories (teaching, research, residential, and sports/recreation).
To validate the choice of K, we conduct a sensitivity analysis by varying K from 2 to 6 over the 121 energy consumption units (Table 1). Although K = 2 yields the highest silhouette score [37] ( 0.5237 ), it provides only a coarse binary partition that cannot distinguish the four functionally distinct building categories present on campus. Among multi-class settings ( K 3 ), K = 4 achieves the highest silhouette coefficient ( 0.3988 ), indicating the best-separated and most cohesive clustering structure. The Calinski–Harabasz index [38] decreases monotonically with K, reflecting the natural trade-off between granularity and compactness. Furthermore, a stability analysis with 10 random FCM initializations at K = 4 yields a coefficient of variation of 0.00 % , confirming that the clustering is highly robust to initialization.
To optimize clustering, we further update the cluster centers (Appendix A.3) and assess convergence by the change in centers (Appendix A.4). Following the maximum-membership principle, we determine each unit’s dominant category c i * (Appendix A.5). Units sharing the same dominant category form the set S c (Appendix A.6), yielding the function-oriented energy consumption unit collection { ( S c , v c ) } c = 1 K . Through this process, we establish a function-oriented, block-wise topological partition, such that units of the same category exhibit highly cohesive spatial distributions of load characteristics.

2.1.2. Multi-Scale Ranking by Intra-Class Spatial Proximity

Within a functional category, spatial distributions of units reflect not only similarity of load characteristics but also regional aggregation effects [39] and geographic factors [40]. To ensure spatial consistency and interpretability within categories, we propose a multi-scale ranking method by spatial proximity that preserves inter-unit spatial interaction features.
Define the layered distance of the i-th unit in the function-oriented energy consumption unit set as d i ( l ) = p i g c ( l ) for l = 1 , , M (see also Appendix A.7), where p i denotes the geospatial coordinates of unit i, g c ( l ) is the l-th level centroid of category c, and M is the number of spatial resolution levels. By weighted aggregation across multiple scales, the composite distance is obtained as d i comp = l = 1 M w l d i ( l ) (Appendix A.8), where the scale weights { w l } are set proportionally to the inverse of each scale’s variance, thereby balancing fine- and coarse-grained proximity contributions. Sorting in ascending order by the composite distance yields the ordered set (Appendix A.9). Each unit’s spatial proximity is discretely mapped to a ranking index (Appendix A.10). This yields the proximity-ordered collection of function-oriented unit sets. After ranking, the geographic proximity within the same functional category is encoded as spatial adjacency via the composite distance, ensuring that intra-category spatial distributions exhibit high cohesion and rationality.

2.1.3. Grid Matrix Filling and Optimization

We reconstruct the proximity-ordered set S prox into a composite grid matrix of “temporal information × functional channels × spatial grid.” The goal is to efficiently represent spatiotemporal characteristics via structured spatial ordering and embeddings, thereby supporting multidimensional spatiotemporal modeling.
Define the multi-energy spatiotemporal topological representation T whose size N is determined by the maximum class capacity across the proximity-ordered sets, i.e., N = max c | S c | . The filling start position for the k-th category is (see Appendix A.11). For a unit, its grid coordinates are obtained by mapping the spatial ranking index to (Appendix A.12). This mapping ensures that units within the same category are arranged compactly by geographic proximity while units from different categories are effectively separated. Zeros are filled in the inter-category gaps to avoid feature interference; this strategy improves spatial continuity and ensures the effective placement of energy consumption units within the grid matrix.
The proposed STG preserves temporal information while enabling non-rigid categorization of units, embedding geospatial distribution across multi-energy channels. With hierarchical spatial ranking, geographic proximity is quantified within function-oriented unit sets, and the resulting multi-energy spatiotemporal topological representation T effectively captures temporal features of multi-source heterogeneous spatiotemporal dynamics.
Remark on data partitioning. To ensure strict separation between training and evaluation, all parameters of the STG, including the FCM cluster centers { v c } , membership degrees { u i k } , and the composite distance rankings used for grid filling—are computed exclusively from the training partition. During validation and testing, the learned cluster centers and ranking indices are frozen and directly applied to the corresponding data slices, thereby preventing any information leakage from the test set into the encoder’s construction.
Remark on data-driven design rationale. The proposed MSTF-SMDN adopts a purely data-driven approach rather than incorporating explicit physical equations (e.g., thermodynamic balance or network flow constraints). This design choice is motivated by two considerations. First, real-world IES often lack complete physical parameter specifications for every building and subsystem, making first-principles modeling impractical at scale. Second, the STG implicitly captures domain-relevant physical properties: by clustering energy consumption units according to their load statistics, it groups buildings with similar thermal inertia, occupancy patterns, and HVAC characteristics, whereas the spatial proximity ranking reflects geographic factors such as shared climate exposure and district-level energy coupling. Future work will explore physics-informed hybrid architectures that embed thermodynamic constraints as inductive biases alongside the data-driven modules.

2.2. MultiScale Hybrid Convolver

IES loads display high spatial heterogeneity [41] and hierarchical structure, often combining local clustering [42] with broad diffusion patterns. We design a MultiScale Hybrid Convolver by jointly optimizing depthwise separable convolutions and dynamic channel fusion, establishing multi-scale coverage and channel decoupling to achieve efficient multi-scale feature extraction and fusion.
Justification for 2D convolution on the topological grid. Although the grid matrix T is constructed via a raster scan of sorted units, the resulting 2D layout encodes two distinct forms of adjacency that neither 1D convolution nor standard Graph Neural Networks (GNNs) can simultaneously exploit. Along the horizontal axis, neighboring cells share the same functional category and are ranked by geographic proximity, capturing intra-class spatial correlations. Along the vertical axis, neighboring cells may belong to different functional categories at similar spatial positions, enabling the 2D kernels to model cross-category energy coupling (e.g., co-located electricity and heating loads). A 1D convolution would only capture intra-class sequential correlations, while a GNN—though capable of modeling arbitrary graph topologies—requires predefined or learned adjacency matrices and cannot directly leverage the regular grid structure for efficient multi-scale receptive-field expansion via standard kernel sizes (3 × 3, 5 × 5, 7 × 7). The 2D convolutional architecture, therefore, achieves a favorable trade-off between topological expressiveness and computational efficiency.
To further substantiate this argument quantitatively, Table 2 compares the theoretical computational complexity of the proposed MSHC with representative GNN alternatives (GCN and GAT) under matched capacity settings.
MSHC incurs higher FLOPs ( 56.29 M vs. 6 M for GCN/GAT) and more parameters ( 233,088 vs. 27,000 ). However, we argue that this additional cost is well justified by the resulting accuracy gains (see the ablation study in Section 3.3.2, where removing the MSHC degrades cooling RMSE by 20.75 % ). Three design-level factors further favor the convolutional approach: (1) the regular 11 × 11 grid enables highly optimized GPU convolution kernels, yielding favorable wall-clock throughput despite higher theoretical FLOPs; (2) MSHC natively provides multi-scale receptive fields (3 × 3, 5 × 5, 7 × 7) in a single pass, whereas GNNs require multiple stacked layers to emulate comparable coverage; and (3) GCN/GAT require either a predefined or data-driven adjacency matrix; the latter adds N 2 = 13,225 learnable parameters and an additional structural hyperparameter (edge sparsity threshold) that must be tuned.
First, we define a set of multi-scale convolution kernels { s 1 , s 2 , s 3 } = { 3 , 5 , 7 } and design a parallel, heterogeneous multi-scale convolutional architecture. The multi-energy spatiotemporal topological representation T is convolved with kernels of sizes 3 × 3, 5 × 5, and 7 × 7. Each scale extracts features independently to capture, in order, fine-grained details, mid-range associations, and global spatial information. The convolution is computed as in Equation (1).
F s ( c out , x , y ) = c in i = p s p s j = p s p s W s ( c out , c in , i , j ) · T ( c in , x + i , y + j ) + b s ( c out )
Here, F s denotes the output feature map at scale s; W s is the kernel parameter tensor at scale s; i and j are spatial indices inside the kernel, representing offsets from the center position, with ranges determined by p s = s / 2 ; p s is the symmetric padding radius ensuring the output size matches the input; c in and c out are the input/output channel indices; b s is the bias term.
For each scale s, a 1 × 1 pointwise convolution is applied to realize channel-dimension expansion, as in Equation (2):
G s ( c out , x , y ) = c out W 1 × 1 ( c out , c out ) · F s ( c out , x , y ) + b 1 × 1 ( c out )
To further enhance the flexibility and adaptivity of multi-scale feature fusion, we introduce learnable scale weights α s to adjust the contribution of each scale. The weights are computed from the global average pooling of each scale’s feature map and then normalized with a Softmax, as in Equations (3) and (4).
z s = 1 H × W x = 1 H y = 1 W G s ( : , x , y )
α s = exp ( z s ) s exp ( z s )
Finally, the fused multi-scale feature map F ^ is obtained as in Equation (5):
F ^ = s α s · G s

2.3. Temporal Segmentation Transformer

We propose a slicing–embedding–position-encoding pipeline for time series, combined with the modeling power of a Transformer, to form a Temporal Segmentation Transformer (TST) that captures global dependencies and multi-scale dynamics.
First, we stack the multi-scale fused feature map F ^ produced by the MultiScale Hybrid Convolver column-wise—preserving the temporal ordering along the first axis—and flatten the spatial dimensions into a univariate time series of length T. This column-major flattening ensures that temporally adjacent entries remain contiguous, thereby preserving the causal structure required by downstream sequence modeling. We set the segment length L = 16 (corresponding to a 4 h context window at 15 min resolution), which aligns with the dominant intra-day periodicity observed in campus energy consumption patterns and provides sufficient local context for the Transformer encoder without incurring excessive computational overhead. A sensitivity analysis over L { 8 , 12 , 16 , 24 , 32 } (Section 3.3.3) empirically confirms that L = 16 yields the best multi-load trade-off. Using a sliding window of length L and stride F, we split the resulting series into subsequences:
x p ( i ) = x ( i 1 ) F + 1 : ( i 1 ) F + L , i = 1 , , P
where the number of subsequences is P = ( T L ) / F + 1 . To avoid insufficiency at the tail, we pad by repeating the last observations, producing the padded segment x p ( P ) :
x p ( P ) = [ x ( P 1 ) F + 1 , , x T , x T , , x T ]
After slicing and padding, each subsequence is linearly projected to a fixed-dimensional vector to form the embedding e ( i ) :
e ( i ) = W e x p ( i ) + b e + PE ( i )
where W e and b e are learnable projection parameters, and PE ( i ) is the positional encoding computed with sine–cosine functions:
PE ( pos , 2 k ) = sin pos 10000 2 k / d
PE ( pos , 2 k + 1 ) = cos pos 10000 2 k / d
This encoding preserves the relative order of segments and improves accuracy and robustness. The embeddings are fed to the Transformer encoder. Based on the associations among the query (Q), key (K), and value (V) matrices, the encoder adaptively assigns weights to segments via multi-head self-attention:
Q = E W Q , K = E W K , V = E W V
Attention ( Q , K , V ) = softmax Q K d k V
head j = Attention ( Q W j Q , K W j K , V W j V )
MultiHead ( Q , K , V ) = [ head 1 ; ; head h ] W O
h = LayerNorm E + MultiHead ( Q , K , V )
After the multi-head attention sublayer, a position-wise feed-forward network with two linear transforms and a ReLU activation produces the TST output:
FFN ( h ) = max ( 0 , h W 1 + b 1 ) W 2 + b 2
where W 1 , W 2 are the feed-forward weights, and b 1 , b 2 are biases. After multi-layer Transformer encoding, we apply layer normalization and residual connections after each layer for regularization, ensuring stable training.
Dimension bridging from TST to SEGM. The TST processes the flattened time series in coarse-grained segments of length L, producing a sequence of P segment-level embeddings { e ( 1 ) , , e ( P ) } R P × d . To interface with the SEGM, which operates at the fine-grained individual time-step level, we apply a learned linear projection W bridge R d × ( L · d h ) followed by a reshape operation that maps each segment embedding back to L time-step vectors of dimension d h :
[ x ( i 1 ) L + 1 , , x i L ] = Reshape e ( i ) W bridge + b bridge , i = 1 , , P
where d h is the SEGM hidden dimension. This projection restores the temporal resolution from segment-level to step-level, producing a sequence of T = P × L vectors that serve as the input to the SEGM’s recurrent unrolling. The linear projection is jointly trained end-to-end, ensuring that the segment-level global representations captured by the TST are seamlessly decoded into step-level inputs for the SEGM’s local steady-state modeling.

2.4. Steady-State Exponentially Gated Memory Unit

We introduce a Steady-State Exponentially Gated Memory (SEGM) with an exponential gating mechanism and a stabilization-oriented state update, improving numerical stability and gradient flow over traditional LSTM [43] when handling high-dimensional, multimodal data. The gating architecture of SEGM builds upon the exponential gating and log-domain stabilization principles introduced in xLSTM’s sLSTM variant [31], while incorporating domain-specific adaptations for multi-energy load forecasting: (i) a 2D convolutional formulation that operates on the spatiotemporal grid T rather than 1D sequences, and (ii) integration with the upstream MSHC and TST modules for end-to-end spatiotemporal feature processing. SEGM consists of gate definitions, stabilized state updating, temporal unrolling, and output mapping.

2.4.1. Gating Mechanism

In the Steady-State Exponentially Gated Memory (SEGM), the gating mechanism comprises an input gate, a forget gate, an output gate, and a candidate state. Together, they regulate the flow and update of state information, thereby modulating how current and historical states affect the output. The computations are as follows:
(1) Input gate. Following the exponential gating formulation of sLSTM [31], the input gate i t applies the exponential function to the current input x t and the previous hidden state h t 1 :
i t = exp W i x x t + W i h h t 1 + b i
where W i x is the input projection matrix, W i h is the recurrent projection matrix, and b i is the bias term.
(2) Forget gate. The forget gate controls how the previous state c t 1 influences the current update. As in sLSTM [31], the forget gate also employs exponential activation:
f t = exp W f x x t + W f h h t 1 + b f
where W f x and W f h are the input and recurrent projection matrices, and b f is the bias.
(3) Output gate. The output gate determines the degree to which the current state contributes to the output, retaining the standard sigmoid activation from classical LSTM [43]:
o t = σ W o x x t + W o h h t 1 + b o
(4) Candidate state. Following standard LSTM practice [43], a nonlinear transform computes the potential state update at the current step:
c ˜ t = tanh W c x x t + W c h h t 1 + b c
Remark on exponential gating. The input gate i t and forget gate f t deliberately employ the exponential function exp ( · ) rather than the conventional sigmoid σ ( · ) . While sigmoid gates are bounded within ( 0 , 1 ) and thus numerically safe, they inherently limit the dynamic range of gating signals and contribute to gradient saturation in long-horizon scenarios. The exponential activation produces unbounded positive values that enable finer-grained control over information flow. Although raw exponential values risk numerical overflow, this issue is explicitly resolved by the stabilized state update mechanism below, which operates in the log domain and normalizes all gate contributions through a running maximum m t (Equation (22)), ensuring that effective gate values remain bounded and well-conditioned throughout training.

2.4.2. Stabilized State Update

The stabilized state update mechanism below follows the log-domain normalization approach introduced in sLSTM [31]. The goal is to shift and scale gate values to alleviate gradient explosion, ensure numerical stability of the network in high-dimensional spaces, and enhance modeling of long-range dependencies. Details are as follows.
(1) Stable state. By operating in the log domain, multiplicative gating is converted into additive comparisons, which effectively avoids numerical overflow. A max operator selects the dominant information flow, ensuring effective updates: when the influence of the current input exceeds the accumulated historical influence, the model prioritizes the current input; otherwise, it preserves historical memory (Equation (22)).
m t = max log f t + m t 1 , log i t
(2) Normalized gating. By shifting and scaling the stable state, we limit the growth rate of state information, control the learning pace, and prevent over-updates from harming the model (Equations (23) and (24)).
f t = exp log f t + m t 1 m t
i t = exp log i t m t
This procedure adjusts the original gate values to guarantee smooth information transmission.
(3) State update equations. After integrating the normalized gates, the current state is computed by weighted accumulation of old and new information, covering information accumulation, state update, and output computation (Equations (25)–(27)).
c t = f t c t 1 + i t c ˜ t
n t = f t n t 1 + i t
h t = o t c t n t
Here, n t records the cumulative intensity of state updates, which is physically equivalent to a time-varying learning rate; c t / n t provides adaptive scaling of the state, so the hidden state h t can be interpreted as a weighted average of historical information, as originally formulated in sLSTM [31]. This update process fuses the current input with the previous hidden state, balancing historical information and new input. While the gating and stabilization equations (Equations (18)–(27)) share the same mathematical form as the sLSTM variant of xLSTM [31], our contribution lies in the following: (i) embedding SEGM within a pipeline where the upstream STG and MSHC encode raw multi-source data into structured 2D spatiotemporal grid representations, so that the recurrent unit operates on features that already encode spatial topology and multi-scale patterns; (ii) bridging the segment-level TST output back to step-level inputs via a learned projection (Equation (17)), thereby unifying global attention and local recurrence in a single end-to-end framework; (iii) providing a gradient stability analysis specific to multi-energy load forecasting (see below).
Theoretical analysis of gradient stability. The term “steady-state” in SEGM refers to the numerical steady-state property of the gradient flow rather than thermodynamic equilibrium. We now demonstrate why SEGM achieves more stable gradient propagation than standard LSTM and GRU.
In a standard LSTM, the gradient of the loss L with respect to the cell state at time τ < t involves the product k = τ + 1 t f k , where each forget gate f k ( 0 , 1 ) is bounded by the sigmoid function. For long sequences, this product either vanishes exponentially (when f k < 1 ) or requires careful bias initialization to remain near unity, creating a fragile training regime.
In SEGM, we replace the multiplicative gate chain with an additive log-domain accumulation. Taking the log of both sides of the state update equations, the gradient satisfies the following:
L c τ = L c t · k = τ + 1 t f k = L c t · k = τ + 1 t exp log f k + m k 1 m k
The key insight is that the stable state m t acts as a running normalizer: by construction, m t log f t + m t 1 and m t log i t (Equation (22)), so the normalized forget gate satisfies f t 1 and the normalized input gate satisfies i t 1 . This bounded normalization ensures that the gradient product k f k neither explodes nor vanishes abruptly, achieving a numerical “steady state” where information flows smoothly across time steps. In contrast to GRU, which uses a single sigmoid-gated interpolation and lacks explicit state normalization, SEGM’s log-domain formulation provides a principled mechanism for long-range gradient preservation.
More importantly, the cumulative intensity n t and the normalizing division c t / n t in Equation (27) serve as an adaptive scaling mechanism: the hidden state h t represents a weighted average rather than a weighted sum of accumulated information, preventing unbounded state growth. This is analogous to the running mean in batch normalization, but applied to the temporal dimension of the recurrent state. We note that this “steady-state” property is fundamentally different from differential neural dynamics studied in other application domains; our design specifically targets the numerical stability requirements of long-horizon multi-energy load forecasting.

2.4.3. Temporal Unrolling and Output Mapping

The temporal unrolling equations effectively handle information flow and state updates in sequential data. In each network layer, the computation at time step t depends on the input x t and the hidden state h t 1 from the previous time step. The layer-wise input is defined as follows:
x t ( l ) = x t , l = 1 h t ( l 1 ) , l > 1
Given the current layer input x t ( l ) and the previous hidden state h t 1 ( l ) , memory state c t 1 ( l ) , cumulative intensity n t 1 ( l ) , and stable state m t 1 ( l ) , the improved Steady-State Exponentially Gated Memory (SEGM) updates the current hidden state h t ( l ) , memory state c t ( l ) , cumulative intensity n t ( l ) , and stable state m t ( l ) .
The final-layer hidden state h t ( N L ) (where N L denotes the number of SEGM layers) is passed through a fully connected layer for compression and mapping to produce the SEGM output, which serves as the final prediction of MSTF-SMDN:
y ^ t = W o h t ( N L ) + b o
By coupling multi-scale spatiotemporal fusion with the steady-state memory mechanism, MSTF-SMDN addresses the challenges posed by high-dimensional heterogeneous and multi-source data, providing accurate and robust load forecasts that support fine-grained scheduling and optimization in Integrated Energy Systems.

3. Case Study

3.1. Experimental Environment

To verify the feasibility and effectiveness of the proposed Multi-Scale Spatiotemporal Fusion with Steady-State Memory-Driven Network (MSTF-SMDN), we conduct experiments on real data. The dataset comes from the Campus Metabolism web platform of Arizona State University (Tempe campus), covering January 2019 to March 2020, i.e., over 14 months of continuous high-frequency observations. It includes 121 basic energy consumption units (e.g., teaching, research, and sports facilities) with three energy carriers: electricity, heating, and cooling. The sampling resolution is 15 min. Standard data preprocessing steps were applied: missing values were filled via linear interpolation, and outliers were filtered before model training. These data are used for the IES load forecasting task in this paper. All experiments are run on a computer with an Intel(R) Core(TM) i5-1035G1 quad-core CPU and an NVIDIA GeForce MX350 GPU, implemented in MATLAB R2023a. We use root mean squared error (RMSE), mean absolute error (MAE), and coefficient of determination (R2) as evaluation metrics [21] to assess overall performance.

3.2. Hyperparameter Configuration

The dataset is split chronologically into training/validation/test sets with a 70%/10%/20% ratio. To ensure MSTF-SMDN performs efficiently and robustly under complex spatiotemporal data, we choose hyperparameters via experiments or empirical settings. The final configuration is summarized in Table 3.
We adopt a single-step forecasting strategy. Each load series has a 15 min interval, i.e., known cooling/heating/electric loads are used to predict the next 15 min load. Although 15 min-ahead forecasting supports real-time control and demand-response applications, we acknowledge that many practical IES operations also require longer-horizon predictions (e.g., 1 h or day-ahead) for scheduling and dispatch. The proposed architecture is compatible with multi-step extensions—either through autoregressive rollout or direct multi-output projection—and we regard multi-horizon evaluation as an important direction for future work (see Section 4).

3.3. Results and Analysis

To comprehensively validate feasibility, effectiveness, and superiority, we design three experiments:
  • Comparison with state-of-the-art load forecasting methods to evaluate modeling and forecasting performance.
  • Ablation study to remove key modules one by one and quantify each module’s contribution.
  • Sensitivity analysis of the TST segment length L to empirically justify the chosen hyperparameter value.

3.3.1. Experiment 1: Benchmark Comparison

We compare MSTF-SMDN with five representative deep learning baselines spanning both classical and state-of-the-art architectures:
  • PatchTST-BiLSTM [44] (self-constructed): Combines a patch-based Transformer with bidirectional LSTM to model global and bidirectional dependencies; hidden size 32.
  • Informer [45]: A Transformer-based model with ProbSparse self-attention to reduce complexity; 4 attention layers.
  • iTransformer [23]: An inverted Transformer that applies attention across variates rather than time steps, capturing inter-series correlations; 4 layers.
  • TimesNet [46]: Transforms 1D time series into 2D tensors via multi-period analysis and applies inception-style 2D convolutions; 3 layers.
  • Crossformer [47]: Employs cross-dimension attention with a two-stage architecture to model both temporal and variate dependencies; 3 encoder layers.
On the public dataset, the results (Table 4) report RMSE, R2, and MAE for cooling, heating, and electric loads. Detailed prediction curves (Figure A1), training loss history (Figure A2), scatter plots (Figure A3), error heatmaps (Figure A4), and error distributions (Figure A5) are provided in Appendix B.
MSTF-SMDN consistently outperforms all baselines on cooling, heating, and electric loads. Among the baselines, TimesNet achieves the best overall performance owing to its multi-period 2D convolution mechanism that effectively captures temporal patterns at multiple scales. iTransformer outperforms the standard Informer by leveraging inter-variate attention, yet it lacks the spatiotemporal topology encoding that enables our method to exploit the functional structure of energy consumption units. Crossformer’s cross-dimension attention provides moderate improvements over PatchTST-BiLSTM, but its generic variate-mixing architecture cannot capture the domain-specific spatial adjacency encoded by the STG. Our MSTF-SMDN achieves the largest margins on all three load types: the STG unifies multi-source heterogeneous data into a topological representation; the MSHC and TST jointly capture fine-grained local structures and broad global associations across scales; and the SEGM maintains high accuracy under diverse climatic conditions through its numerically stabilized gating mechanism.
To further assess the reliability of the proposed method, we visualize the prediction intervals in Figure 2. The 95% confidence intervals are constructed via Monte Carlo Dropout (MC-Dropout) [48]: during inference, dropout is kept active, and the model is evaluated M = 50 times to obtain an ensemble of predictions, from which the mean and the 2.5th/97.5th percentiles are computed. The resulting intervals tightly envelope the ground truth curves for cooling, heating, and electric loads, indicating that MSTF-SMDN not only provides accurate point forecasts but also robustly quantifies aleatoric uncertainty under fluctuating conditions. This capability is particularly valuable for risk-aware scheduling in IES [10]. We note that MC-Dropout provides a post-hoc approximation of predictive uncertainty without modifying the training objective. A fully integrated probabilistic framework—such as Bayesian neural networks, deep ensembles, or quantile regression—could yield more principled uncertainty estimates and is considered a promising direction for future research.

3.3.2. Experiment 2: Ablation Study

To quantify the contribution of each component, we evaluate six configurations by progressively removing or replacing key modules (Table 5).
The results reveal clear contributions from each module:
  • SEGM vs. LSTM (comparing MSHC+SEGM with MSHC+LSTM): Replacing LSTM with SEGM reduces cooling RMSE by 3.97% and heating RMSE by 18.08%, confirming the advantage of the exponentially gated memory with log-domain stabilization over standard LSTM gating.
  • TST contribution (comparing MSHC+TST with MSHC+LSTM): Adding the Temporal Segmentation Transformer yields consistent improvements across all load types, validating its ability to capture global temporal dependencies via segment-level attention.
  • MSHC contribution (comparing Ours with TST+SEGM): Incorporating the MultiScale Hybrid Convolver produces the largest single-module improvement, with cooling RMSE dropping from 366.01 to 290.04 (↓20.75%), demonstrating the critical role of multi-scale spatial feature extraction.
  • Adaptive fusion vs. equal-weight fusion (comparing Ours with Equal-Weight Fusion): Replacing the learned 1 × 1 convolution-based fusion with naïve equal-weight averaging degrades electric RMSE by + 18.80 % ( 681.29 809.34 ) and cooling RMSE by + 31.28 % ( 290.04 380.77 ). Heating RMSE decreases marginally by 4.56 % ( 0.2322 0.2216 ), likely because heating load dynamics are dominated by slow thermal inertia rather than multi-scale spatial patterns, making them less sensitive to the fusion strategy. Overall, adaptive fusion yields substantial gains on two out of three load types and maintains competitive heating accuracy, confirming the value of the learned 1 × 1 convolution-based channel weighting (Equation (5)) for load types with complex multi-scale spatiotemporal coupling.
To further validate the model’s effectiveness on challenging load types, we visualize the improvement in heating load forecasting in Figure 3. Heating load is often harder to predict due to thermal inertia and complex coupling with external factors [7]. The figure demonstrates that MSTF-SMDN significantly reduces prediction residuals compared to the baseline, particularly during peak demand periods.

3.3.3. Experiment 3: TST Segment Length Sensitivity Analysis

To investigate how the segment length L of the Temporal Segmentation Transformer (TST) affects forecasting performance, we conduct a sensitivity analysis by varying L { 8 , 12 , 16 , 24 , 32 } while keeping all other hyperparameters fixed. Table 6 reports the RMSE and R2 for each load type under different segment lengths.
The results reveal that each load type responds differently to the segment length:
  • Electric load favors shorter segments: RMSE increases monotonically from 668.44 ( L = 8 ) to 695.43 ( L = 32 ). Shorter segments produce more tokens for the Transformer encoder, enabling finer-resolution attention that better tracks rapid electrical fluctuations.
  • Heating load is largely insensitive to L, with RMSE varying within a narrow band of 0.2319 0.2401 (relative variation < 3.5 % ). This stability reflects the high thermal inertia of heating systems, whose slow dynamics are captured adequately regardless of segment granularity.
  • Cooling load displays a clear U-shaped pattern: the minimum RMSE of 290.04 occurs at L = 16 , with both shorter ( L = 8 : 317.99) and longer ( L = 32 : 333.56) segments degrading performance, indicating that L = 16 best matches the intra-day periodic patterns of campus cooling demand.
Because the three load types favor different segment lengths, the choice of L necessarily involves a trade-off. Setting L = 16 achieves the best cooling RMSE (290.04, R2   = 0.99435 ) and near-optimal heating accuracy (RMSE = 0.2322, only + 0.13 % above the best L = 12 ), at the cost of a modest + 1.92 % increase in electric RMSE relative to L = 8 . Given that cooling load exhibits the widest RMSE variation across segment lengths ( Δ RMSE = 43.52 , compared to 27.00 for electric and 0.0082 for heating), optimizing for cooling yields the largest absolute benefit. These results confirm that L = 16 provides the most favorable overall balance among the three energy carriers.

4. Conclusions

We propose a multi-scale spatiotemporal fusion and steady-state memory-driven IES load forecasting method (MSTF-SMDN) by designing four tightly coupled modules: a Spatiotemporal Topology Encoder (STG), a MultiScale Hybrid Convolver (MSHC), a Temporal Segmentation Transformer (TST), and a Steady-State Exponentially Gated Memory Unit (SEGM). The STG addresses the challenge of unified spatiotemporal embedding of multi-source heterogeneous data through fuzzy functional classification and multi-scale proximity ranking. The MSHC extracts fine-grained local structures and broad global associations at multiple scales via parallel heterogeneous kernels (3 × 3, 5 × 5, 7 × 7) with dynamic channel fusion. The TST captures long-range temporal dependencies through segment-level multi-head self-attention, while the SEGM models local steady-state dynamics via log-domain exponential gating that provably stabilizes gradient propagation.
Extensive experiments on the Arizona State University Campus Metabolism dataset demonstrate that MSTF-SMDN achieves consistent and substantial improvements over five representative baselines. Compared to the strongest baseline (TimesNet), MSTF-SMDN reduces cooling load RMSE by 16.09%, heating load RMSE by 12.97%, and electric load RMSE by 6.14%, while achieving R2 values of 0.99435, 0.98701, and 0.96722, respectively. Ablation studies confirm that each module contributes meaningfully: the MSHC provides the largest single-module improvement (↓20.75% cooling RMSE), while the SEGM outperforms standard LSTM by ↓18.08% on heating load. Furthermore, replacing the learned adaptive fusion with equal-weight averaging increases electric and cooling RMSE by 18.80 % and 31.28 % , respectively, validating the necessity of the 1 × 1 convolution-based adaptive channel fusion mechanism. A sensitivity analysis of the TST segment length L { 8 , 12 , 16 , 24 , 32 } reveals that L = 16 achieves the best cooling load RMSE (290.04) while maintaining near-optimal performance on electric and heating loads, confirming it as the optimal multi-energy trade-off.
Despite these promising results, several limitations remain. First, the current evaluation is limited to a single campus-scale dataset located in the hot-arid climate zone of Tempe, Arizona; the claim of broad applicability should be tempered by the recognition that location-specific factors—including climate zone, building stock composition, occupancy culture, and energy pricing structures—can significantly affect load patterns and model transferability. Generalization to larger-scale, multi-district IES across diverse climatic and socioeconomic contexts requires further validation on geographically varied datasets. Second, the model currently adopts a single-step forecasting strategy with a 15 min horizon; extension to multi-step and multi-horizon forecasting (e.g., 1 h, day-ahead) warrants investigation, as many practical IES scheduling and dispatch tasks rely on longer lead times. Third, the computational cost of the full pipeline may limit deployment on resource-constrained edge devices. Fourth, the current uncertainty quantification relies on post-hoc MC-Dropout rather than a probabilistic framework integrated into training.
Future work will pursue the following directions: (1) validating the framework on multiple geographically and climatically diverse datasets to establish robustness across different IES configurations and weather regimes; (2) incorporating federated or privacy-preserving learning frameworks to enable multi-site model training under data-security constraints; (3) enhancing responsiveness to sudden load fluctuations and extreme events through online adaptation mechanisms; (4) optimizing computational efficiency via model compression and knowledge distillation to support real-time deployment; (5) extending the framework to multi-horizon forecasting with integrated probabilistic training (e.g., deep ensembles, Bayesian layers) for principled uncertainty quantification in risk-aware energy dispatch; (6) broadening the scope of the integrated energy system to include freshwater production under the water–energy nexus, thereby capturing desalination and water-treatment demands that influence overall system sizing and optimization.

Author Contributions

Conceptualization, Y.L. and L.B.; methodology, Y.L.; software, Y.L.; validation, Y.L., X.S. and J.T.; formal analysis, Y.L.; investigation, Y.L. and L.B.; resources, L.B.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, L.B. and X.S.; visualization, Y.L. and J.T.; supervision, L.B.; project administration, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study were obtained from the Arizona State University Campus Metabolism platform (https://cm.asu.edu). Note: this platform has since been decommissioned; the dataset was downloaded prior to its closure during the study period (January 2019–March 2020). The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
IESIntegrated Energy System
MSTF-SMDNMulti-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Network
STGSpatiotemporal Topology Encoder
MSHCMulti-Scale Hybrid Convolver
TSTTemporal Segmentation Transformer
SEGMSteady-State Exponentially Gated Memory Unit
FCMFuzzy C-Means
CNNConvolutional Neural Network
LSTMLong Short-Term Memory
GRUGated Recurrent Unit
GNNGraph Neural Network
STLFShort-Term Load Forecasting
LSSVMLeast Squares Support Vector Machine
RMSERoot Mean Squared Error
MAEMean Absolute Error
MLMachine Learning
MC-DropoutMonte Carlo Dropout

Appendix A. Spatiotemporal Topology Encoder Derivation Formulas

Appendix A.1. Load Feature Vector (ϕi)

ϕ i = L ¯ i ( 1 ) , σ i ( 1 ) , L ¯ i ( 2 ) , σ i ( 2 ) , , L ¯ i ( K ) , σ i ( K ) R 2 K

Appendix A.2. Fuzzy Membership Degree

u i k = 1 j = 1 K ϕ i v k ϕ i v j 2 m 1

Appendix A.3. Cluster Center Update

v k = i = 1 D u i k m ϕ i i = 1 D u i k m

Appendix A.4. Convergence Condition

max k v k ( new ) v k ( old ) < ε

Appendix A.5. Dominant Category

c i * = arg max k u i k

Appendix A.6. Category Set

S c = { i c i * = c }

Appendix A.7. Layered Distance

d i ( l ) = p i g c ( l ) , l = 1 , , M
where p i denotes the geospatial coordinates of unit i, g c ( l ) is the l-th level centroid of category c, and M is the number of spatial resolution levels.

Appendix A.8. Composite Distance

d i comp = l = 1 M w l d i ( l )

Appendix A.9. Ordered Set

S c ord = sort S c , d i comp , ascending

Appendix A.10. Ranking Index

r i = rank ( i S c ord ) , i S c

Appendix A.11. Filling Start Position

start k = j = 1 k 1 | S j |

Appendix A.12. Grid Coordinate Mapping

( row i , col i ) = start k + r i 1 N + 1 , ( start k + r i 1 ) mod N + 1

Appendix B. Experimental Figures

Figure A1. The multi-load forecasting results of the proposed algorithm.
Figure A1. The multi-load forecasting results of the proposed algorithm.
Information 17 00309 g0a1
Figure A2. The loss curve of the training process of the proposed algorithm.
Figure A2. The loss curve of the training process of the proposed algorithm.
Information 17 00309 g0a2
Figure A3. The scatter plot of the prediction results of the proposed algorithm.
Figure A3. The scatter plot of the prediction results of the proposed algorithm.
Information 17 00309 g0a3
Figure A4. Prediction error heatmap (time vs. load type).
Figure A4. Prediction error heatmap (time vs. load type).
Information 17 00309 g0a4
Figure A5. The histogram of prediction errors for the proposed algorithm.
Figure A5. The histogram of prediction errors for the proposed algorithm.
Information 17 00309 g0a5

References

  1. IEA. World Energy Outlook 2023; International Energy Agency: Paris, France, 2023.
  2. Impram, S.; Nese, S.V.; Oral, B. Challenges of renewable energy penetration on power system flexibility: A survey. Energy Strategy Rev. 2020, 31, 100539. [Google Scholar] [CrossRef]
  3. Mancarella, P. MES (multi-energy systems): An overview of concepts and evaluation models. Energy 2014, 65, 1–17. [Google Scholar] [CrossRef]
  4. Kumar, P.; Date, A.; Shabani, B. Techno-economic analysis of an integrated desalination-renewable-hydrogen system for zero-emission freshwater and electricity production. Energy Convers. Manag. 2026, 353, 121231. [Google Scholar] [CrossRef]
  5. Traisak, O.; Kumar, P.; Das, R.K. Integrated Thermoelectric Power Generation and Membrane-Based Water Desalination Using Low-Grade Thermal Energy. Energies 2026, 19, 1054. [Google Scholar] [CrossRef]
  6. Kroposki, B.; Johnson, B.; Zhang, Y.; Gevorgian, V.; Denholm, P.; Hodge, B.-M.; Hannegan, B. Achieving a 100% renewable grid: Operating electric power systems with extremely high levels of variable renewable energy. IEEE Power Energy Mag. 2017, 15, 61–73. [Google Scholar] [CrossRef]
  7. Gaur, A.S.; Fitiwi, D.Z.; Curtis, J. Heat pumps and our low-carbon future: A comprehensive review. Energy Res. Soc. Sci. 2021, 71, 101764. [Google Scholar] [CrossRef]
  8. Gang, W.; Wang, S.; Xiao, F.; Gao, D.-C. District cooling systems: Technology integration, system optimization, challenges and opportunities for applications. Renew. Sustain. Energy Rev. 2016, 53, 253–264. [Google Scholar] [CrossRef]
  9. Good, N.; Zhang, L.; Navarro-Espinosa, A.; Mancarella, P. High resolution modelling of multi-energy domestic demand profiles. Appl. Energy 2015, 137, 193–210. [Google Scholar] [CrossRef]
  10. Wang, H.; Gao, F.; Chen, Q.; Bu, S.; Lei, C. Instability pattern-guided model updating method for data-driven transient stability assessment. IEEE Trans. Power Syst. 2025, 40, 1214–1227. [Google Scholar] [CrossRef]
  11. Hong, T.; Fan, S. Probabilistic electric load forecasting: A tutorial review. Int. J. Forecast. 2016, 32, 914–938. [Google Scholar] [CrossRef]
  12. Zhou, K.; Fu, C.; Yang, S. Big data driven smart energy management: From big data to big insights. Renew. Sustain. Energy Rev. 2016, 56, 215–225. [Google Scholar] [CrossRef]
  13. Hang, J.; Qiu, G.; Hao, M.; Ding, S. Improved fault diagnosis method for permanent magnet synchronous machine system based on lightweight multisource information data layer fusion. IEEE Trans. Power Electron. 2024, 39, 13808–13817. [Google Scholar] [CrossRef]
  14. Du, M.; Liu, X.; Li, Z.; Lin, H. Robust mitigation strategy against dummy data attacks in power systems. IEEE Trans. Smart Grid 2023, 14, 3102–3113. [Google Scholar] [CrossRef]
  15. Denholm, P.; Mai, T.; Kenyon, R.W.; Kroposki, B.; O’Malley, M. Inertia and the Power Grid: A Guide Without the Spin; National Renewable Energy Laboratory Technical Report NREL/TP-6A20-73856; NREL: Golden, CO, USA, 2020.
  16. Wu, C.; Wang, R.; Lu, S.; Tian, J.; Yin, L.; Wang, L.; Zheng, W. Time-series data-driven PM2.5 forecasting: From theoretical framework to empirical analysis. Atmosphere 2025, 16, 292. [Google Scholar] [CrossRef]
  17. Idowu, S.; Saguna, S.; Åhlund, C.; Schelén, O. Applied machine learning: Forecasting heat load in district heating system. Energy Build. 2016, 133, 478–488. [Google Scholar] [CrossRef]
  18. Rodrigues, F.; Trindade, A. Load forecasting through functional clustering and ensemble learning. Knowl. Inf. Syst. 2018, 57, 229–244. [Google Scholar] [CrossRef]
  19. Tan, Z.; De, G.; Li, M.; Lin, H.; Yang, S.; Huang, L.; Tan, Q. Combined electricity-heat-cooling-gas load forecasting model for integrated energy system based on multi-task learning and least square support vector machine. J. Clean. Prod. 2020, 248, 119252. [Google Scholar] [CrossRef]
  20. Alsharekh, M.F.; Habib, S.; Dewi, D.A.; Albattah, W.; Islam, M.; Albahli, S. Improving the Efficiency of Multistep Short-Term Electricity Load Forecasting via R-CNN with ML-LSTM. Sensors 2022, 22, 6913. [Google Scholar] [CrossRef] [PubMed]
  21. Deb, C.; Zhang, F.; Yang, J.; Lee, S.E.; Shah, K.W. A review on time series forecasting techniques for building energy consumption. Renew. Sustain. Energy Rev. 2017, 74, 902–924. [Google Scholar] [CrossRef]
  22. Smyl, S. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int. J. Forecast. 2020, 36, 75–85. [Google Scholar] [CrossRef]
  23. Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted transformers are effective for time series forecasting. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  24. Chen, D.; Zhao, H.; Lin, H.; Han, Y.; Hu, X.; Yuan, K.; Geng, Z. Load prediction of integrated energy systems for energy saving and carbon emission based on novel multi-scale fusion convolutional neural network. Energy 2024, 290, 130181. [Google Scholar] [CrossRef]
  25. Zhao, P.; Hu, W.; Cao, D.; Zhang, Z.; Liao, W.; Chen, Z.; Huang, Q. Enhancing multivariate, multi-step residential load forecasting with spatiotemporal graph attention-enabled transformer. Int. J. Electr. Power Energy Syst. 2024, 160, 110074. [Google Scholar] [CrossRef]
  26. Banerjee, S.; Dong, M.; Shi, W. Spatial–temporal synchronous graph transformer network (STSGT) for COVID-19 forecasting. Smart Health 2022, 26, 100348. [Google Scholar] [CrossRef] [PubMed]
  27. Song, X.; Chen, Z.; Wang, J.; Zhang, Y.; Sun, X. A multi-stage LSTM federated forecasting method for multi-loads under multi-time scales. Expert Syst. Appl. 2024, 253, 124303. [Google Scholar] [CrossRef]
  28. Song, J.; Yang, K.; Cai, Z.; Yang, P.; Bao, H.; Xu, K.; Meng, X.B. Multi-energy load forecasting via hierarchical multi-task learning and spatiotemporal attention. Appl. Energy 2024, 373, 123788. [Google Scholar] [CrossRef]
  29. Chen, J.; Ye, H.; Ying, Z.; Sun, Y.; Xu, W. Dynamic trend fusion module for traffic flow prediction. Appl. Soft Comput. 2025, 174, 112979. [Google Scholar] [CrossRef]
  30. Wang, H.; Li, Y.; Men, T.; Li, L. Physically interpretable wavelet-guided networks with dynamic frequency decomposition for machine intelligence fault prediction. IEEE Trans. Syst. Man Cybern. Syst. 2024, 54, 4863–4875. [Google Scholar] [CrossRef]
  31. Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
  32. Chen, R.T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D. Neural Ordinary Differential Equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 6572–6583. [Google Scholar]
  33. Miller, C.; Meggers, F. The Building Data Genome Project: An open, public dataset from non-residential building electrical meters. Energy Procedia 2017, 122, 439–444. [Google Scholar] [CrossRef]
  34. Wang, S.; Zhang, Z. Short-term multiple load forecasting model of regional integrated energy system based on QWGRU-MTL. Energies 2021, 14, 6555. [Google Scholar] [CrossRef]
  35. Jiang, L.; Wang, X.; Li, W.; Wang, L.; Yin, X.; Jia, L. Hybrid Multitask Multi-Information Fusion Deep Learning for Household Short-Term Load Forecasting. IEEE Trans. Smart Grid 2021, 12, 5362–5372. [Google Scholar] [CrossRef]
  36. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, NY, USA, 1981. [Google Scholar]
  37. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  38. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]
  39. Li, J.; Deng, D.; Zhao, J.; Cai, D.; Hu, W.; Zhang, M.; Huang, Q. A Novel Hybrid Short-Term Load Forecasting Method of Smart Grid Using MLR and LSTM Neural Network. IEEE Trans. Ind. Inf. 2021, 17, 2443–2452. [Google Scholar] [CrossRef]
  40. Tobler, W.R. A computer movie simulating urban growth in the Detroit region. Econ. Geogr. 1970, 46, 234–240. [Google Scholar] [CrossRef]
  41. Xing, X.; Gong, D.; Wang, Y.; Sun, X.; Zhang, Y. Acceptable cost-driven multivariate load forecasting for integrated coal mine energy systems. Appl. Energy 2025, 397, 126341. [Google Scholar] [CrossRef]
  42. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Chang, X.; Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, 6–10 July 2020; pp. 753–763. [Google Scholar]
  43. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  44. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A time series is worth 64 words: Long-term forecasting with transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  45. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 11106–11115. [Google Scholar] [CrossRef]
  46. Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-variation modeling for general time series analysis. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  47. Zhang, Y.; Yan, J. Crossformer: Transformer utilizing cross-dimension dependency for multivariate time series forecasting. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  48. Gal, Y.; Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1050–1059. [Google Scholar]
Figure 1. Overall structure of the proposed MSTF-SMDN algorithm.
Figure 1. Overall structure of the proposed MSTF-SMDN algorithm.
Information 17 00309 g001
Figure 2. Prediction intervals of the proposed MSTF-SMDN for multi-energy loads. The shaded areas represent the 95% confidence intervals, demonstrating the model’s ability to capture uncertainty and bound the true load values effectively.
Figure 2. Prediction intervals of the proposed MSTF-SMDN for multi-energy loads. The shaded areas represent the 95% confidence intervals, demonstrating the model’s ability to capture uncertainty and bound the true load values effectively.
Information 17 00309 g002
Figure 3. Specific improvement in heating load forecasting accuracy. The comparison highlights the reduction in residual errors achieved by the proposed MSTF-SMDN model.
Figure 3. Specific improvement in heating load forecasting accuracy. The comparison highlights the reduction in residual errors achieved by the proposed MSTF-SMDN model.
Information 17 00309 g003
Table 1. K-value sensitivity analysis for FCM clustering. Bold values indicate the selected configuration ( K = 4 ).
Table 1. K-value sensitivity analysis for FCM clustering. Bold values indicate the selected configuration ( K = 4 ).
KSilhouetteCH IndexCategory Sizes
20.523750.3922/99
30.356248.249/61/51
40.398845.3223/6/39/53
50.366537.5833/29/18/5/36
60.366233.2535/32/3/28/4/19
Table 2. Computational complexity comparison of spatial feature extractors.
Table 2. Computational complexity comparison of spatial feature extractors.
MethodFLOPsParametersAdj. LearningMulti-Scale
MSHC (Ours)56.29 M233,088Not requiredNative (3 × 3, 5 × 5, 7 × 7)
GCN6.34 M (6.37 M w/adj)26,624 (39,849 w/adj) + N 2 = 13 , 225 paramsRequires stacking
GAT (4 heads)6.64 M (6.67 M w/adj)27,264 (40,489 w/adj) + N 2 = 13 , 225 paramsRequires stacking
Table 3. Model parameter settings.
Table 3. Model parameter settings.
ParameterValue
Learning rate0.002
Learning-rate scheduler (decay factor)0.5
Batch size256
Epochs5000
Dropout rate0.5
L2 regularization 1 × 10 4
Convolution kernel sizes(3, 3), (5, 5), (7, 7)
Output channels per conv layer32, 64, 128
TST segment length (L)16
TST stride (F)8
TST embedding dimension (d)64
TST attention heads (h)4
TST encoder layers2
SEGM hidden units ( d h )64
SEGM layers3
Early stopping15
Table 4. Comprehensive comparison of experimental results. ↓ indicates decrease (improvement for RMSE/MAE); ↑ indicates increase (improvement for R2). Bold values indicate the best results.
Table 4. Comprehensive comparison of experimental results. ↓ indicates decrease (improvement for RMSE/MAE); ↑ indicates increase (improvement for R2). Bold values indicate the best results.
MethodCool. RMSECool. R2Cool. MAEHeat. RMSEHeat. R2Heat. MAEElec. RMSEElec. R2Elec. MAE
PatchTST-BiLSTM378.94040.97493261.55210.325490.885670.25896746.09390.8243559.6159
Informer376.43030.97538259.29490.285850.913730.21934764.56950.81549590.0808
iTransformer358.21470.97772249.82130.271430.920540.20876738.51240.82793558.3472
TimesNet345.68340.97923241.31670.266820.923180.20452725.83150.83381545.2196
Crossformer365.12760.97685254.59380.278240.916540.21482748.29370.82319565.7213
Ours290.03990.99435211.89750.232200.987010.17330681.29420.96722518.6698
Ours vs. Best↓16.09%↑1.54%↓12.19%↓12.97%↑6.91%↓15.27%↓6.14%↑16.00%↓4.87%
Table 5. Results of the ablation experiment. Bold values indicate the best performance (full model).
Table 5. Results of the ablation experiment. Bold values indicate the best performance (full model).
ConfigurationCool. RMSECool. R2Cool. MAEHeat. RMSEHeat. R2Heat. MAEElec. RMSEElec. R2Elec. MAE
MSHC + LSTM392.01170.96521258.36780.325160.8850.2581764.57170.8155590.082
MSHC + TST378.94550.97495251.55560.288290.91040.22074746.09240.8243559.626
MSHC + SEGM376.43120.97545249.29120.266360.91370.21932723.46640.8409586.185
TST + SEGM (w/o MSHC)366.01240.97641248.37780.278750.91840.21885723.46780.8409586.181
Equal-Weight Fusion380.76520.99026287.79960.22160.988170.1676809.34350.95373617.7052
Ours (Full)290.03990.99435211.89750.232200.987010.17330681.29420.96722518.6698
Table 6. Sensitivity analysis of TST segment length L. Bold values indicate the selected configuration ( L = 16 ).
Table 6. Sensitivity analysis of TST segment length L. Bold values indicate the selected configuration ( L = 16 ).
LElec. RMSEElec. R2Heat. RMSEHeat. R2Cool. RMSECool. R2
8668.43590.968440.23430.98678317.98570.99320
12678.72910.967460.23190.98705305.47280.99373
16681.29420.967220.23220.98701290.03990.99435
24685.84970.966780.24010.98611315.16990.99332
32695.42970.965840.23250.98698333.56080.99252
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, Y.; Bao, L.; Sun, X.; Tang, J. Multi-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Load Forecasting for Integrated Energy Systems. Information 2026, 17, 309. https://doi.org/10.3390/info17030309

AMA Style

Liang Y, Bao L, Sun X, Tang J. Multi-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Load Forecasting for Integrated Energy Systems. Information. 2026; 17(3):309. https://doi.org/10.3390/info17030309

Chicago/Turabian Style

Liang, Yong, Lin Bao, Xiaoyan Sun, and Junping Tang. 2026. "Multi-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Load Forecasting for Integrated Energy Systems" Information 17, no. 3: 309. https://doi.org/10.3390/info17030309

APA Style

Liang, Y., Bao, L., Sun, X., & Tang, J. (2026). Multi-Scale Spatiotemporal Fusion and Steady-State Memory-Driven Load Forecasting for Integrated Energy Systems. Information, 17(3), 309. https://doi.org/10.3390/info17030309

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop