A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand

Sun, Yufang; Guo, Bing; Wu, Chase; Lyu, Rui; Kang, Hongjuan; Zhao, Mingjie; Chen, Xin; Ye, Kui

doi:10.3390/app16031381

Open AccessArticle

A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand

by

Yufang Sun

^1,2

,

Bing Guo

^1,*

,

Chase Wu

³

,

Rui Lyu

²

,

Hongjuan Kang

¹

,

Mingjie Zhao

¹

,

Xin Chen

¹

and

Kui Ye

¹

College of Computer Science, Sichuan University, Chengdu 610065, China

²

School of Big Data and Artificial intelligence, Chengdu Technological University, Chengdu 611730, China

³

Department of Data Science, New Jersey Institute of Technology, Newark, NJ 07102, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1381; https://doi.org/10.3390/app16031381

Submission received: 31 December 2025 / Revised: 26 January 2026 / Accepted: 28 January 2026 / Published: 29 January 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Accurately forecasting demand for intermittent items is essential for effective inventory control, improved service levels, and cost reduction. This study focuses on highly sparse, irregular, and volatile demand patterns and proposes a generalizable multi-source data-driven framework for intermittent demand forecasting, using automotive spare parts as a representative application scenario. The proposed framework integrates Transformer networks, multi-graph convolutional networks (GCNs), and a Mamba-based feature fusion module. The Transformer captures long-term temporal dependencies in historical demand sequences, while the multi-graph GCN incorporates prior knowledge—including traffic geography, socioeconomic indicators, and environmental attributes—to model spatial correlations across multiple supply nodes. The Mamba-based fusion module then integrates temporal and spatial features into a unified representation, enhancing predictive accuracy and robustness. Extensive experiments on real-world datasets of automotive spare parts in China show that the proposed framework exhibits competitive and often superior performance compared with TiDE, FSNet, Informer, and DLinear across multiple forecasting horizons (3-, 6-, and 9-step), as measured by RMSE, MAE, and

R^{2}

. The proposed approach provides a practical and adaptable solution for forecasting intermittent demand, offering valuable support for dynamic inventory management.

Keywords:

demand forecasting; temporal-spatial modeling; prior knowledge; mamba; feature fusion

1. Introduction

Accurate demand forecasting plays a critical role in supporting inventory control, production planning, and service-level assurance. This challenge becomes particularly pronounced for sparse and intermittent demand items, whose demand occurrences are irregular, infrequent, and highly volatile over time. Automotive spare parts constitute a representative and practically important example of such items. Their demand patterns are jointly influenced by heterogeneous vehicle usage behaviors, regional socioeconomic characteristics, and external environmental conditions, resulting in long periods of zero demand interspersed with sudden demand spikes. These characteristics substantially complicate demand prediction and inventory decision-making processes.

In practice, inaccurate forecasting of intermittent demand can lead to severe operational inefficiencies. Excessive inventory levels increase holding and obsolescence costs, while underestimation of demand may cause stockouts, maintenance delays, and service disruptions, ultimately undermining customer satisfaction and market resilience [1,2,3,4]. Consequently, developing robust forecasting methods capable of handling sparsity, intermittency, and high uncertainty is of significant practical importance—not only for the automotive industry, but also for other sectors such as aerospace, healthcare equipment, and industrial maintenance services that face similar demand characteristics.

Existing studies on spare parts demand forecasting can be broadly categorized into qualitative and quantitative approaches. Qualitative methods rely primarily on expert judgment, managerial experience, and market surveys to infer future demand trends [5]. While useful in data-scarce contexts, such approaches are inherently subjective and difficult to scale. Quantitative methods, by contrast, attempt to infer future demand directly from historical observations. Representative techniques include moving average (MA) [6], weighted moving average (WMA) [7], exponential smoothing (ES) and single exponential smoothing (SES) [8], as well as classical statistical models such as ARIMA [9]. More recently, deep learning models—particularly recurrent neural networks and LSTM-based architectures—have been introduced to capture nonlinear temporal dependencies [10,11]. Probability-based methods assuming specific parametric demand distributions have also been widely adopted for intermittent demand modeling [12].

Despite substantial progress in spatiotemporal forecasting, several fundamental limitations remain insufficiently addressed when the target problem involves sparse and intermittent demand patterns. First, although many existing models employ powerful temporal encoders, they are primarily designed for dense and continuous sequences and therefore struggle to extract meaningful signals from extremely sparse, fragmented, and low-frequency demand series. This often results in unstable predictions and poor generalization for intermittent items, which are common in spare parts and supply chain demand scenarios [13]. Second, most prior studies either treat demand series as independent or rely on implicit spatial proximity, failing to explicitly model cross-node interactions induced by shared geographic constraints, socioeconomic contexts, and environmental conditions. Such oversimplification is particularly problematic for intermittent demand, where individual time series provide limited information and spatially correlated nodes can serve as critical sources of complementary signals [14]. Third, while heterogeneous external data are increasingly available, prior knowledge embedded in built environment, socioeconomic, and natural environmental attributes is rarely incorporated into forecasting models in a structured and principled manner. As a result, existing spatiotemporal pipelines often lack robustness, interpretability, and transferability when applied to real-world sparse demand systems characterized by high uncertainty and data scarcity.

Accordingly, the central research question of this study is: how can sparse and intermittent demand be forecast accurately and robustly by effectively exploiting limited temporal observations, cross-node correlations, and heterogeneous prior knowledge within a unified modeling framework?

To explicitly tackle these limitations, this study proposes a hybrid temporal–spatial forecasting framework tailored to sparse and intermittent demand prediction through the systematic integration of prior knowledge. Specifically, a Transformer-based temporal encoder is employed to capture long-range dependencies and nonlinear dynamics while mitigating instability caused by sparse observations. To compensate for information scarcity at the individual series level, multiple feature graphs are constructed to encode prior knowledge related to geographic proximity, socioeconomic attributes, and natural environmental conditions across supply nodes, and a multi-graph GCN is used to explicitly model cross-node correlations. Finally, a Mamba-based feature fusion module integrates temporal and spatial representations into a unified predictive space, enabling robust and accurate forecasting under intermittent demand conditions.

The objective of this study is to develop a generalizable and data-driven forecasting framework that effectively models sparse and intermittent demand patterns, explicitly exploits prior spatial knowledge derived from heterogeneous external data, and improves predictive performance and stability compared with existing time-series-based approaches. To achieve this objective, the proposed framework integrates Transformer networks and graph convolutional networks (GCNs) to jointly capture temporal dynamics and spatial dependencies inherent in intermittent spare parts demand.

Specifically, the Transformer component leverages self-attention mechanisms to capture long-term dependencies and nonlinear dynamics in fragmented and irregular demand time series, enabling the extraction of informative temporal patterns from sparse historical observations. To complement temporal modeling, a multi-graph GCN is constructed to systematically incorporate prior knowledge derived from built environment, socioeconomic, and natural environmental attributes, which are designed according to the characteristics of the forecasting target and the application context. By explicitly modeling relational structures among supply nodes with similar contextual attributes or comparable demand behaviors, the proposed framework captures spatially correlated demand patterns and enriches demand representations beyond purely temporal information. This design allows the framework to flexibly accommodate different forms of prior knowledge while maintaining robustness under sparse and intermittent demand conditions.

Furthermore, a Mamba-based feature fusion module is designed to integrate the temporal representations learned by the Transformer with the spatial features extracted by the GCN. This fusion mechanism unifies multimodal information into a coherent representation, allowing the model to balance temporal dynamics and spatial context adaptively, and ultimately enhancing predictive accuracy and robustness in intermittent demand forecasting tasks.

The organization of this paper is as follows. Section 2 introduces the relevant studies. Section 3 introduces the modeling approach. The results are analyzed and discussed in Section 4. Section 5 presents a conclusion and suggests directions for future research.

2. Related Works

Demand forecasting methods for intermittent items can generally be categorized into parametric and non-parametric approaches. Parametric approaches typically construct probabilistic distribution models and estimate their parameters based on historical or simulated demand data. For instance, Sodemann et al. [15] examined five fundamental demand distribution models, including Poisson, negative binomial, hurdle Poisson, gamma, and normal distributions, and demonstrated that the negative binomial model provided the best predictive performance. Nevertheless, when demand patterns become highly volatile and irregular, actual observations often deviate from explicit distributional assumptions, limiting the effectiveness of purely parametric approaches.

Non-parametric approaches predominantly employ machine learning and statistical algorithms to automatically identify patterns from data, and intermittent demand forecasting is often formulated as a time series prediction problem. Classical non-parametric methods include moving average (MA) [6], weighted moving average (WMA) [7], single exponential smoothing (SES) [16], Croston’s method [17], the Syntetos–Boylan approximation (SBA) [18], and various improved SBA models [19]. These approaches are generally effective when data availability is limited, but simple time series techniques remain insufficient for capturing the irregularity, sparsity, and volatility inherent in intermittent demand, leading to restricted forecasting accuracy. To overcome these challenges, deep learning techniques have been increasingly explored. For instance, LSTM networks have been applied to capture nonlinear temporal dependencies in intermittent demand data [20]. Nonetheless, single deep learning architectures still exhibit suboptimal performance when demand patterns lack strong periodicity or exhibit high variability. As a result, hybrid models that integrate the strengths of different methods have emerged as a promising research direction. Chandriah et al. [10] proposed an enhanced RNN/LSTM model with an improved Adam optimizer, demonstrating superior forecasting performance. Similarly, Fahrudin et al. [8] developed a hybrid model for intermittent demand by combining ARMA, single exponential smoothing (SES), and multilayer perceptron (MLP), achieving improved results. In another study, Hua et al. [21] introduced a hybrid approach that integrates support vector machines (SVMs) with logistic regression, effectively handling demand with discrete and irregular structures.

Beyond methodological distinctions between parametric and non-parametric approaches, a growing body of literature has emphasized that intermittent item demand is driven not only by historical sales patterns but also by heterogeneous external factors. Most existing forecasting methods, however, rely predominantly on internal demand histories for feature extraction, which limits their ability to capture the multifaceted drivers underlying sparse and irregular demand dynamics.

Empirical studies in the automotive domain provide representative evidence of this limitation. For example, Liu et al. [1] incorporated weather-related variables, including temperature, visibility, and road slipperiness, into forecasting models based on extreme learning machines (ELMs) and support vector machines (SVMs), demonstrating that environmental conditions significantly affect spare parts consumption. Similarly, Bocker et al. [22] reported that milder and wetter winters increase vehicle usage, whereas extreme heat and heavy precipitation suppress travel activity, thereby altering maintenance-related demand. In addition, built environment characteristics, such as urban density and road network structure, as well as socioeconomic conditions, including income levels and regional economic development, have been shown to influence vehicle ownership rates and operational intensity [23]. Although these studies focus on automotive applications, they highlight a general principle: intermittent demand is shaped by dynamic environmental, infrastructural, and socioeconomic contexts rather than historical demand alone.

Despite this recognition, most existing forecasting models incorporate external factors only at the individual series level and fail to explicitly model spatial similarities or semantic associations among different demand nodes. This limitation is particularly problematic for intermittent demand, where information contained in a single sparse time series is insufficient and cross-node commonalities can provide critical complementary signals.

To address this issue, graph-based models have recently gained increasing attention. Graph Neural Networks (GNNs) offer a principled framework for representing structured relationships and have been successfully applied in traffic flow forecasting [24] and logistics optimization [25]. In particular, Graph Convolutional Networks (GCNs) aggregate information from neighboring nodes to capture spatial correlations and contextual dependencies induced by shared geographic, environmental, or socioeconomic attributes [26]. These properties make GCNs especially suitable for modeling cross-node interactions in sparse and intermittent demand systems.

Despite extensive research efforts, existing intermittent demand forecasting studies still face notable limitations. In particular, most approaches primarily rely on historical demand data while insufficiently incorporating critical external drivers—such as built environment, socioeconomic conditions, and natural environmental factors—which significantly influence item consumption. Moreover, spatial correlations across supply nodes, including similarities in geographic, socioeconomic, or environmental attributes, are often neglected, limiting opportunities for knowledge transfer and improved prediction accuracy. Finally, the effective integration of heterogeneous information sources, including temporal demand, spatial features, and environmental context, remains underexplored, constraining the ability of models to fully capture the complexity of intermittent demand. These limitations underscore the necessity for a more comprehensive, multi-modal, and spatially aware modeling framework to improve the accuracy and robustness of spare parts demand forecasting.

3. Method

3.1. Overall Architecture

The proposed framework is designed as a general methodology for forecasting intermittent demand items within supply chains, which are often characterized by sparsity, discontinuity, and high variability. As illustrated in Figure 1, the model integrates heterogeneous data sources and combines Transformer-based temporal modeling with graph-based spatial learning to enhance predictive accuracy.

The Transformer component employs a self-attention mechanism to capture long-range temporal dependencies in historical demand data, thereby extracting complex temporal patterns that traditional forecasting approaches often fail to identify. The extracted temporal representations are then refined through multiple linear layers to ensure stable and normalized features.

To complement temporal modeling, multi-graph GCNs are introduced to incorporate prior knowledge in a clearly defined methodological sense. In this framework, prior knowledge is not imposed as explicit constraints on the model parameters or optimization process. Instead, it is operationalized through two distinct forms: (i) static exogenous information and (ii) structural relationships among demand items.

Static exogenous information refers to time-invariant contextual attributes that are external to the demand time series, such as geographical, socioeconomic, and environmental characteristics. These attributes are treated as auxiliary inputs and encoded into node features, rather than being directly enforced as hard constraints. In the empirical study of this work, such static exogenous variables are selected because they are widely recognized as influential factors affecting automotive spare parts consumption.

Structural prior knowledge is represented through graph structures, where nodes correspond to demand items and edges encode similarity or correlation relationships derived from static exogenous information. This formulation allows prior knowledge to be incorporated as a relational inductive bias, guiding the learning process by constraining information propagation across the graph while preserving model flexibility.

It should be emphasized that, although static exogenous information serves as the basis for graph construction, it is distinct from prior knowledge in the form of explicit modeling constraints. The current framework does not restrict parameter values or enforce predefined functional relationships; instead, it leverages prior knowledge implicitly through graph topology and feature representations.

Subsequently, temporal features from the Transformer and contextual representations derived from the multi-graph GCN are dynamically fused. A Mamba fusion module is employed to unify these heterogeneous modalities into a joint feature space, allowing the model to adaptively weigh temporal dynamics and graph-based contextual information. This multimodal integration strengthens robustness and enhances predictive performance by jointly exploiting dynamic demand patterns and static contextual priors.

Overall, the framework provides a generalizable solution for intermittent demand forecasting. While the empirical analysis in this study focuses on automotive spare parts, the architecture is flexible and can be adapted to other application domains by redefining static exogenous variables and corresponding graph structures. The pseudocode of the proposed framework is presented in Algorithm 1.

Algorithm 1 Proposed Demand Prediction Framework

Require: Historical time series

X_{t} \in R^{n \times m}

, prior knowledge

A_{t} \in R^{n \times k}

;

Ensure: Demand prediction

\hat{y} \in R^{n}

;

1: Temporal modeling:

F_{t} \in R^{n \times p_{1}} \leftarrow Transformer (X_{t})

;

2: Spatial–relational modeling:

G_{t} \in R^{n \times p_{2}} \leftarrow GCN (A_{t})

;

3: Feature fusion:

M \in R^{n \times p} \leftarrow MambaFusion (F_{t}, G_{t})

;

4: Prediction:

\hat{y} \leftarrow PredictionLayer (M)

;

5: return

\hat{y}

;

The proposed framework distinguishes itself from existing approaches by systematically integrating multidimensional environmental factors, including urban morphology, socioeconomic indicators, and natural conditions, which are often neglected in prior studies. Spatial dependencies and similarities across regions with comparable environmental contexts are explicitly modeled via a multi-graph GCN, enabling effective knowledge transfer and enhanced predictive performance in data-sparse areas. Moreover, heterogeneous modalities, including temporal demand series, spatial relations, and environmental features, are adaptively fused using the Mamba technique, ensuring comprehensive utilization of available information and yielding more accurate and robust demand forecasts. The prediction target in this study is the weekly aggregated demand quantity, defined as the total number of units requested for each spare part type in each city at each time step.

3.2. Time Series Feature Extraction

Despite the success of the Transformer architecture in natural language processing and, more recently, in time series modeling, its conventional design exhibits inherent limitations when applied to multivariate time series data [27]. Standard Transformers typically treat each time step as a token and focus on modeling dependencies across the temporal dimension. However, this approach neglects the variable-wise structure of time series, which is crucial in many real-world forecasting tasks, especially when multivariate interactions play a central role [28].

To address these limitations, we propose a novel architecture named STformer (Series-as-Token Transformer). Instead of treating time steps as tokens, STformer reformulates the input representation by encoding each variable’s full temporal sequence as a single token. This enables the model to focus attention across variables, thereby capturing inter-variable dependencies more effectively, while a separate module models each variable’s temporal dynamics independently. Such a design is particularly beneficial for demand forecasting in domains like automotive after-sales, where complex interactions exist between parts, time, and exogenous variables [29]. The architecture of STformer is composed of three main components: a variable-level self-attention module that learns relationships between variables, a simplified feed-forward layer to reduce computational complexity, and a single normalization layer to stabilize training.

As illustrated in Figure 2, the embedding layer of STformer differs fundamentally from that of the traditional Transformer. In conventional Transformer-based time series models, each data point across all variables at a single time step is treated as a token. This approach emphasizes temporal dependencies but often overlooks the unique temporal dynamics within each individual variable.

In contrast, STformer treats each variable’s entire time series as a single token, denoted by

H \in R^{T}

, where T is the total number of time steps. This design ensures that the model has access to the global and sequential characteristics of each variable, enabling it to learn more robust representations across variables. N denotes the number of nodes in the graph, and each node is associated with a single time series of length T. STformer treats each node-level time series as a token for temporal modeling. Given the raw input matrix

X \in R^{N \times T}

, where N is the number of variables and T is the sequence length, the embedding layer extracts

H_{d} \in R^{T}

for the d-th variable. These variable-wise sequences are then projected into a latent space using temporal encoding functions or 1D convolutional layers to form token embeddings

E \in R^{N \times d}

. If this embedding layer receives the output from a previous module,

H \in R^{N \times T}

represents the intermediate representation instead of raw input.

Formally, for each variable d, the corresponding time series embedding

H_{d} \in R^{T}

is mapped to a fixed-dimensional token representation through a temporal projection function. In this work, the projection is implemented as a learnable transformation

f_{θ} (\cdot)

, which maps the entire sequence into a d-dimensional latent vector:

e_{d} = f_{θ} (H_{d}) \in R^{d} .

(1)

Specifically,

f_{θ} (\cdot)

can be instantiated as a stack of 1D convolutional layers followed by a global pooling operation along the temporal dimension, or equivalently as a temporal encoding module that aggregates sequential information into a compact representation. This design ensures that both local temporal patterns and global sequence-level characteristics are preserved during the sequence-to-token transformation.

To capture diverse interaction patterns among variables, STformer employs a multi-head self-attention mechanism at the variable level. This mechanism allows the model to attend to multiple representation subspaces in parallel, enabling it to learn different types of relationships (e.g., linear correlations, complementary effects, or temporal dependencies) between variable sequences.

To compute the variable-wise attention, each token (i.e., time series for a variable) is linearly projected into three matrices:

Q u e r y (Q)

,

K e y (K)

, and

Value (V)

:

Q = E W^{Q}, K = E W^{K}, V = E W^{V},

(2)

where

E \in R^{N \times d}

denotes the embedded representations of all N variables, and

W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{d \times d_{h}}

are learnable projection matrices. The attention score between each pair of variables is computed using scaled dot-product attention:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

(3)

The normalization layer in STformer plays a critical role in stabilizing training and accelerating convergence by standardizing the time series features across variables. However, in multivariate temporal datasets, variables often have unaligned timestamps due to missing entries or sampling discrepancies. Conventional normalization strategies, such as LayerNorm or BatchNorm applied across time steps, may obscure these misalignments by projecting dissimilar values into similar representations, thereby distorting inter-variable relationships.

LayerNorm is a technique that normalizes the activations within a layer across the feature dimension. For each input sample, LayerNorm computes the mean and standard deviation of the features, then normalizes them by subtracting the mean and dividing by the standard deviation. While effective in many applications, LayerNorm assumes that each time step in a sequence is independent, which can be problematic for time series data, particularly when there are temporal misalignments such as missing values or irregular sampling intervals. In such cases, applying LayerNorm across time steps may obscure important temporal relationships and distort inter-variable dependencies.

3.3. Priori Data Feature Extraction

For spatiotemporal demand forecasting problems, the proposed framework can incorporate different sets of external influencing variables by integrating domain-specific prior knowledge. In the context of automotive spare parts, the core rationale for studying their demand lies in the inevitable wear and tear of vehicle components during operation. Over time, parts degrade and fail due to mechanical stress, environmental exposure, and operational conditions, necessitating timely replacements and repairs. Importantly, the rate of wear and failure is not uniform but is significantly influenced by external factors. Based on transportation geography [30,31,32], transportation economics [33], and vehicle engineering theories [34], we classify the key external factors influencing spare parts demand into three main categories: (1) built environment factors, including road slope, road network density, and intersection density, which affect vehicle dynamics and braking/acceleration patterns; (2) natural environment factors, such as temperature, precipitation, and humidity, which influence material fatigue, corrosion, and overall component lifespan; and (3) socioeconomic factors, including regional GDP, the proportion of tertiary industry, and vehicle ownership density, which shape driving frequency and intensity. Incorporating these domain-specific factors allows for a more accurate modeling of spatially heterogeneous, wear-driven spare parts demand. By constructing a spatially aware graph that encodes the relationships among regions and parts based on geographic proximity and these external factors, the dynamics of demand can be more precisely captured. Furthermore, the proposed framework is generalizable to other spatiotemporal demand forecasting problems by integrating domain-specific prior knowledge to select appropriate external influencing variables.

3.3.1. Construction of Prior Data Graph Structure

As shown in Figure 3, to model these complex spatial dependencies, we construct a prior knowledge graph that encodes both geographical proximity and domain-specific regional attributes. This graph serves as a structured foundation for subsequent graph convolution operations.

Let the prior graph be denoted as:

G = (V, E, A),

(4)

where

V

is the set of nodes, each representing a specific spare part category in a specific city, i.e.,

v_{i} = (part_{type}_{k}, {location}_{m}),

(5)

E

is the set of undirected edges representing spatial correlations, and

A \in R^{N \times N}

is the adjacency matrix, where

N = | V |

is the total number of nodes.

To define edges between nodes, we consider geographical distance as the structural prior, under the assumption that cities closer in space tend to share more similar spare parts consumption behaviors due to shared infrastructure conditions and vehicular usage patterns. The geodesic distance between city i and city j is calculated using the Haversine formula:

d_{i j} = 2 r \cdot arcsin (\sqrt{{sin}^{2} (\frac{ϕ_{i} - ϕ_{j}}{2}) + cos (ϕ_{i}) cos (ϕ_{j}) {sin}^{2} (\frac{λ_{i} - λ_{j}}{2})}),

(6)

where

r = 6371

km is the Earth’s average radius, and

(ϕ_{i}, λ_{i}), (ϕ_{j}, λ_{j})

are the geographic coordinates (latitude and longitude) of cities i and j, respectively.

The edge weights are computed using a Gaussian kernel over the spatial distances:

A_{i j} = exp (- \frac{d_{i j}^{2}}{2 σ^{2}}),

(7)

where

σ

is a bandwidth hyperparameter controlling the decay of spatial influence. To prevent over-connection in dense urban areas and reduce computational overhead, we retain only the top-k nearest neighbors for each node based on

d_{i j}

.

In addition to structural connectivity, each node

v_{i}

is associated with a feature vector

x_{i} \in R^{F}

, representing domain-related contextual variables that potentially influence spare parts demand. We consider three regional factors:

$f_{i}^{1}$ : average daily traffic flow in the city, representing the wear intensity on vehicle components;
$f_{i}^{2}$ : vehicle ownership per capita, indicating the demand base;
$f_{i}^{3}$ : temperature fluctuation amplitude, reflecting environmental stress on parts.

To ensure numerical comparability across heterogeneous features, each dimension is normalized using min-max scaling:

{\tilde{f}}_{i}^{k} = \frac{f_{i}^{k} - min (f^{k})}{max (f^{k}) - min (f^{k})},

(8)

The normalized feature vectors are then stacked to form the final input feature matrix:

X = [{\tilde{x}}_{1}^{⊤}; {\tilde{x}}_{2}^{⊤}; \dots; {\tilde{x}}_{N}^{⊤}] \in R^{N \times F},

(9)

To facilitate spectral graph convolution, the adjacency matrix is further symmetrically normalized:

\hat{A} = D^{- 1 / 2} A D^{- 1 / 2},

(10)

where

D

is the degree matrix defined by:

D_{i i} = \sum_{j} A_{i j},

(11)

This prior data graph integrates both topological spatial knowledge and exogenous regional attributes, allowing the subsequent graph convolutional model to learn spatially aware representations of part demand. It also introduces an interpretable inductive bias that enhances generalization in data-sparse or regionally inconsistent scenarios. The detailed construction process is shown in Algorithm 2:

Algorithm 2 Construction of Prior Data Graph Structure

Require: Cities

C = {c_{1}, c_{2}, \dots, c_{M}}

, part types

T = {t_{1}, t_{2}, \dots, t_{K}}

, geographic coordinates

{(ϕ_{i}, λ_{i})}

, regional features

F = {f_{i}^{k}}

;

Ensure: Prior graph

G = (V, E, \hat{A})

and node feature matrix

X

;

1: Initialize

V \leftarrow \emptyset

,

E \leftarrow \emptyset

;

2: for each city

c_{i} \in C

do

3: for each part type

t_{k} \in T

do

4: Create node

v_{i} = (t_{k}, c_{i})

;

5: Add

v_{i}

to

V

;

6: end for

7: end for

8: Compute distance matrix

D_{i j}

using Haversine formula;

9: for each node pair

(v_{i}, v_{j})

do

10: Compute edge weight

A_{i j} = exp (- \frac{D_{i j}^{2}}{2 σ^{2}})

;

11: end for

12: for each node

v_{i}

do

13: Retain top-k nearest neighbors based on

D_{i j}

;

14: end for

15: Compute degree matrix

D_{i i} = \sum_{j} A_{i j}

;

16: Normalize adjacency:

{\hat{A}}_{i j} = D_{i i}^{- 1 / 2} A_{i j} D_{j j}^{- 1 / 2}

;

17: for each feature dimension k do

18: Apply min-max normalization:

{\tilde{f}}_{i}^{k} = \frac{f_{i}^{k} - min (f^{k})}{max (f^{k}) - min (f^{k})}

;

19: end for

20: Stack features to form

X = [{\tilde{x}}_{1}^{⊤}; \dots; {\tilde{x}}_{N}^{⊤}]

;

21: return

(V, E, \hat{A}, X)

;

To avoid ambiguity, we explicitly distinguish different forms of prior knowledge used in this work. First, static exogenous information refers to observable, time-invariant external attributes associated with each node, such as geographical location, socioeconomic indicators, and environmental statistics. These features provide contextual background but do not encode relationships by themselves. Second, prior knowledge as static exogenous information in our setting specifically includes city-level attributes such as average traffic flow, vehicle ownership per capita, and temperature fluctuation amplitude. These features are treated as fixed node attributes and are directly used as inputs to the graph construction process. Third, prior knowledge as a structural relationship is represented by the prior graph, where nodes correspond to part type and city pairs and edges encode predefined relational assumptions, such as geographical proximity between cities. This form of prior knowledge defines the topology of the graph and governs how information propagates across nodes via the GCN. Finally, prior knowledge as a modeling constraint refers to the way these predefined structures and relationships restrict the learning space of the model. Instead of learning arbitrary dependencies, the model is guided to respect the given graph structure and contextual similarities, thereby improving interpretability and reducing the risk of spurious correlations.

3.3.2. Graph Convolutional Network Module

After constructing the prior spatial graph structure, we employ a Graph Convolutional Network (GCN) to extract high-level spatial correlations among part demands across different cities and part types. GCN enables the propagation and aggregation of information over the graph structure, allowing each node representation to be contextually enriched by its spatial and attribute-aware neighbors.

Let the input to the GCN be the normalized adjacency matrix

\hat{A} \in R^{N \times N}

and the node feature matrix

X^{(0)} \in R^{N \times F}

, where N is the number of nodes and F is the input feature dimension. To support the multi-graph modeling assumption, we extend the GCN formulation to accommodate multiple prior adjacency matrices. Specifically, instead of a single graph, we consider a set of symmetrically normalized adjacency matrices

{{\hat{A}}^{(m)}}_{m = 1}^{M}

, where each matrix represents a distinct spatial or functional relationship among nodes (e.g., geographical adjacency, transportation connectivity, or demand similarity). For each graph, an independent graph convolution operation is performed to capture relation-specific spatial dependencies.

The propagation rule of a single GCN layer is defined as:

X^{(l + 1)} = σ (\hat{A} X^{(l)} W^{(l)}),

(12)

where

$X^{(l)} \in R^{N \times D_{l}}$ is the node feature matrix at the l-th layer;
$W^{(l)} \in R^{D_{l} \times D_{l + 1}}$ is the trainable weight matrix;
$σ (\cdot)$ is a nonlinear activation function (e.g., ReLU);
$\hat{A}$ is the symmetrically normalized adjacency matrix computed from the prior graph.

Accordingly, the graph convolution at layer l is applied to each adjacency matrix separately, yielding a set of intermediate representations

{X^{(l + 1, m)}}_{m = 1}^{M}

. These representations are then fused to form a unified embedding through a weighted aggregation scheme:

X^{(l + 1)} = \sum_{m = 1}^{M} α_{m} X^{(l + 1, m)},

(13)

where

α_{m}

denotes a learnable coefficient that reflects the relative contribution of the m-th graph and satisfies

\sum_{m = 1}^{M} α_{m} = 1

. This fusion mechanism enables the model to adaptively integrate complementary information from multiple graph structures.

Through stacking multiple GCN layers, the model aggregates information from increasingly distant neighbors, effectively capturing higher-order spatial dependencies:

H = GCN (\hat{A}, X) = X^{(L)},

(14)

where

H \in R^{N \times D}

denotes the final high-level node representation and L is the total number of graph convolutional layers.

This representation encodes both the local features (e.g., regional traffic intensity, climate) and global spatial interactions among spare parts, providing a rich semantic embedding for each part-city pair.

To enhance generalization and avoid overfitting, dropout and layer normalization techniques are optionally applied after each GCN layer:

X^{(l + 1)} = Dropout (LayerNorm (σ (\hat{A} X^{(l)} W^{(l)}))),

(15)

Finally, the learned representation

H

is passed to the downstream forecasting head (e.g., fully connected network or temporal model), which outputs the spare part demand forecast for each node.

This GCN module enables the model to incorporate complex spatial and contextual information during representation learning, leading to improved performance in demand prediction under heterogeneous urban and regional conditions.

In the construction of the part–city graph, nodes are defined as pairs of part and city, where each node represents the demand for a specific part in a particular city. Edges between nodes are formed solely based on the geographical proximity between cities, with no direct connections made between different parts, even if they are geographically close. Therefore, the graph is not block-diagonal per part type, and cross-part connections are not allowed. This design ensures that the model focuses on geographic dependencies between cities, without introducing artificial relationships between different part types. The structure of the graph aligns with real-world geographic influences and helps maintain the interpretability of the GCN-based model.

The prior graph is static across time, whereas node-level demand observations evolve at a weekly resolution. During model training and inference, graph-based spatial embeddings are computed for each node and aligned with temporal features using the same node index. The spatial embeddings are broadcast along the temporal dimension so that, at each time step, temporal dynamics and spatial priors correspond to the same part–city pair.

3.4. Dynamic Fusion

3.4.1. Modality Unification Based on Mamba

In multi-source spare parts demand forecasting, temporal dynamics and spatial priors often reside in fundamentally distinct representational spaces, making direct fusion suboptimal. Traditional attention-based architectures, such as Transformers, struggle to align heterogeneous modalities when they carry sparse or long-range dependencies. In this context, we propose using Mamba, a selective state space model, to unify temporal sequence features and graph-structured prior knowledge within a unified latent representation. While the Mamba state space model offers complexity advantages for long sequences, the choice of STformer was motivated by its effectiveness in capturing temporal dependencies for the relatively shorter sequence lengths in our dataset. Transformers’ attention mechanisms also better capture long-range dependencies compared to Mamba-based models.

Concretely, we treat the temporal module’s output and the GCN-derived spatial embeddings as two separate modalities. These are first linearly projected into a shared latent space, and then passed through stacked Mamba blocks. The state space formulation of Mamba allows it to maintain sequence integrity while selectively integrating cross-modality cues, resulting in a smooth yet expressive fusion representation.

To effectively achieve modality unification, we design a dedicated Mamba-based fusion block, as illustrated in Figure 4. The block is composed of a carefully constructed sequence of sub-layers that enables deep integration between temporal sequence representations and spatial prior embeddings. Specifically, the Mamba block contains two residual pathways. The first residual path encapsulates the Mamba layer, which learns dynamic representations over the fused modality inputs by modeling selective long-range dependencies. This is followed by a Layer Normalization operation to stabilize training dynamics and promote representation smoothness. The second residual path incorporates a position-wise Feedforward Network (FFN), which further enhances the non-linear expressiveness of the fused representation. This FFN is also wrapped with Layer Normalization and residual connection, allowing the network to better preserve original information while refining cross-modality interactions. This dual-residual design not only facilitates gradient flow during training but also ensures robust information preservation across layers. As a result, the Mamba block serves as the core unit for effective modality alignment and integration, bridging heterogeneous information sources under a unified dynamic state space.

Given the temporal feature sequence

X_{t} \in R^{T \times d}

and spatial prior embedding

X_{s} \in R^{T \times d}

, we first project them into a shared latent space via linear transformation:

X = {Linear}_{proj} (X_{t} + X_{s}),

(16)

The fused representation X is passed into the Mamba layer. We denote the Mamba transformation as

M (\cdot)

, resulting in:

H_{1} = LayerNorm (X + M (X)),

(17)

Next, we apply a position-wise feedforward network (FFN) with another residual connection and normalization:

H_{2} = LayerNorm (H_{1} + FFN (H_{1})),

(18)

where the FFN is defined as:

FFN (x) = ReLU (x W_{1} + b_{1}) W_{2} + b_{2},

(19)

The output

H_{2}

serves as the unified modality representation that integrates both temporal dynamics and spatial priors under the Mamba framework.

As illustrated in Figure 5, the Mamba is composed of two parallel branches that process the input hidden state

H_{l - 1} \in R^{T \times d}

. The goal of this block is to extract dynamic features while preserving long-range temporal dependencies.

In the first branch,

H_{l - 1}

is passed through a linear projection followed by a SiLU activation to obtain the transformed sequence x:

x = SiLU (Linear (H_{l - 1})),

(20)

In the second branch, the same input is linearly projected and then processed by a convolutional layer (to model local patterns), followed by a SiLU activation, resulting in y:

y = SiLU (Conv (Linear (H_{l - 1}))),

(21)

The signal y is then passed through the Selective State Space Model (SelectiveSSM) to capture sequence-dependent selective dynamics, yielding

\hat{y}

:

\hat{y} = SelectiveSSM (y),

(22)

Next, the outputs from both branches are combined through an element-wise fusion operation ⊗, and then mapped back to the original dimensionality via a linear layer to obtain the intermediate output

O_{1, l}

:

O_{1, l} = Linear (x \otimes \hat{y}),

(23)

To reduce redundancy and enhance generalization, a residual connection is employed, where

O_{1, l}

is subtracted from the original input

H_{l - 1}

, followed by layer normalization:

{\hat{O}}_{1, l} = LayerNorm (H_{l - 1} - Dropout (O_{1, l})),

(24)

where

{\hat{O}}_{1, l}

denotes the output of the Mamba block after residual refinement, which is then passed into subsequent layers.

The selective state update mechanism in Mamba allows modality-specific information to be adaptively retained or suppressed over time, effectively reducing semantic mismatch between heterogeneous representations. As a result, the fused representation progressively evolves into a modality-consistent latent space, guided by task supervision without requiring additional explicit alignment losses. This implicit yet structured unification process is particularly suitable for demand forecasting scenarios, where temporal continuity and spatial priors must be jointly modeled under a coherent sequential representation.

Mamba’s role in fusion is critical for modeling selective dependencies across different modalities. The Selective State Space Model (SelectiveSSM) allows Mamba to maintain sparsity while dynamically integrating temporal and spatial representations. Unlike traditional fusion mechanisms that merge modalities in a fixed manner, Mamba adapts its fusion process to the relevance of each modality at each time step, reducing the risk of introducing irrelevant information. This selective integration ensures that the fused representation effectively captures the essential characteristics of both the temporal and spatial modalities without distorting their original dynamics.

While Mamba is highly effective for modality fusion, it is not optimized for modeling temporal dependencies within a single modality. The Transformer-based architecture, specifically STformer, excels at capturing complex temporal dynamics due to its self-attention mechanism, which allows it to model long-range temporal dependencies efficiently. Thus, we use STformer for temporal modeling and Mamba for fusion, ensuring that each model is employed for its respective strength, leading to better performance overall.

3.4.2. Dynamic Weighting Mechanism

Feature fusion has emerged as a critical topic in the field of spatiotemporal forecasting and decision modeling, especially in scenarios involving heterogeneous data sources. By integrating features from multiple modalities—such as time-series sensor data, domain-specific priors, and external contextual information—fusion techniques aim to capture the complementary and joint correlations that single-modality models often fail to exploit effectively [35,36].

Recent studies have demonstrated the potential of feature fusion in enhancing robustness and generalization in complex tasks such as traffic prediction, supply chain demand forecasting, and environmental modeling [37,38]. However, existing methods often treat all modalities equally or fuse them in a static manner, without considering the modality-specific noise levels, information content, and temporal relevance. This can lead to suboptimal predictions or even error accumulation, particularly when certain data modalities are missing, noisy, or less informative in certain contexts.

Given the inherent differences in feature distributions, reliability, and semantic emphasis across modalities, there is a pressing need for a principled and dynamic fusion framework that can adaptively weigh modalities based on their confidence and contribution. In this work, we propose an energy-based dynamic fusion strategy that leverages the uncertainty associated with each modality to guide the integration process.

Different modality representations originate from heterogeneous feature spaces and focus on different semantic aspects during extraction. This leads to discrepancies in feature distribution, data quality, and noise levels across modalities, resulting in varying levels of importance, informativeness, and relative strength.

In multimodal learning, individual modalities often exhibit substantial differences in feature distributions, noise sensitivity, and semantic granularity. These discrepancies arise from distinct data sources and processing mechanisms—for example, traffic signals may emphasize temporal dynamics, whereas economic or environmental indicators may encode long-term trends or exogenous influences. As a result, modality representations are typically characterized by heterogeneous levels of reliability and informativeness.

To address this challenge, we propose a principled dynamic fusion strategy that explicitly models modality-specific uncertainty and integrates representations based on their relative confidence. Unlike conventional fusion methods that assume equal or static importance across modalities, our approach leverages energy-based uncertainty estimation to drive an adaptive weighting mechanism, ensuring that more reliable modalities contribute more significantly to the fused output.

Formally, for each modality

d \in {1, 2, 3}

, let

x^{(d)}

denote its modality-specific representation obtained after feature extraction and alignment. We define an energy score

E (x^{(d)})

to characterize the relative confidence of modality d:

E (x^{(d)}) = - T^{(d)} \cdot log (\frac{\sum_{k = 1}^{K} exp (f_{k}^{(d)} (x^{(d)}))}{J^{(d)}}),

(25)

where

f^{(d)} (\cdot)

denotes a learnable transformation that maps modality features to a latent response space,

T^{(d)}

is a modality-specific temperature parameter controlling the smoothness of the confidence distribution, and

J^{(d)}

is a normalization constant. This formulation is adopted as a surrogate confidence estimator rather than a faithful probabilistic model of the regression target.

Lower energy values correspond to sharper and more confident modality representations, whereas higher energy values indicate increased uncertainty. To translate the energy scores into fusion weights, we compute a confidence score as:

log P (x^{(d)}) = - \frac{E (x^{(d)})}{T^{(d)}},

(26)

and apply a temperature-controlled softmax function to obtain normalized modality weights:

ω_{d} (x^{(d)}) = \frac{exp (γ log P (x^{(d)}))}{\sum_{j = 1}^{D} exp (γ log P (x^{(j)}))},

(27)

where

γ > 0

controls the sensitivity of the fusion process to differences in modality confidence. Larger values of

γ

encourage sharper emphasis on the most reliable modality, while smaller values yield a more balanced fusion across modalities.

The resulting confidence-aware weights are applied to the modality representations prior to or within the Mamba-based fusion block. In this way, the energy-based mechanism complements the Mamba fusion backbone by adaptively regulating modality contributions, ensuring that the unified representation is dominated by informative and reliable sources while remaining robust to noisy or less relevant modalities.

The final multimodal representation is computed as a weighted aggregation:

f (x) = \sum_{d = 1}^{D} ω_{d} (x^{(d)}) \cdot f_{d} (x^{(d)}),

(28)

In this formulation,

f_{d} (x^{(d)})

denotes the transformed feature from modality d, and

f (x)

is the unified representation used for downstream prediction tasks.

This energy-based fusion strategy enables the model to dynamically adjust its reliance on each modality based on a probabilistically grounded uncertainty measure. By explicitly incorporating modality confidence and tuning sensitivity via the

γ

parameter, our method achieves robust and context-aware integration of heterogeneous sources, particularly under noisy or partially reliable conditions.

In multimodal spatiotemporal forecasting, different modalities often exhibit heterogeneous noise levels, data quality, and reliability, which makes static or uniform fusion strategies suboptimal. The energy-based framework provides a convenient and principled way to quantify modality-specific uncertainty and to translate it into adaptive fusion weights. By associating higher energy with less reliable or noisier modality representations, the proposed fusion mechanism naturally emphasizes more informative modalities while suppressing uncertain ones, leading to more robust regression performance under heterogeneous and partially unreliable inputs.

It is important to clarify that the proposed Mamba-based fusion and the energy-based mechanism do not represent two independent fusion strategies, but operate at different levels of the overall framework. Specifically, the Mamba module serves as the core representation unification backbone, which integrates heterogeneous temporal features and spatial prior embeddings into a shared latent space by modeling selective sequential dynamics and cross-modality interactions. Built upon the Mamba-fused representations, the energy-based mechanism is introduced as a confidence-aware modulation layer. Its purpose is to quantify modality-specific uncertainty and to adaptively adjust the contribution of each modality during the final prediction stage. Rather than replacing the fusion performed by Mamba, the energy formulation refines the unified features by emphasizing more reliable modalities and attenuating noisier ones. This hierarchical design enables effective feature unification through Mamba while enhancing robustness via uncertainty-guided fusion.

4. Experiments

4.1. Data Description

The dataset used in this study comprises maintenance order records collected from 4S dealerships of a major automobile brand across various cities in China. It covers a total of 624,796 maintenance orders spanning from January 2020 to December 2023, encompassing 26 provinces and municipalities. To validate the reliability of the proposed model, we focus on three representative auto parts: fuel injectors, electronic water pumps, and timing belts. These components are selected due to their intermittent demand characteristics and their consumption being influenced by multiple factors, including vehicle operating conditions, usage intensity, and urban environmental attributes.

In this study, each 4S stores within a city is treated as a spatial unit of prediction. Given the characteristics of the data and practical application requirements, a one-week period is adopted as the temporal unit for forecasting. As shown in Figure 6, the consumption of the selected parts exhibits pronounced intermittency over time.

It is worth noting that the observed zero-demand periods in the dataset arise from different underlying mechanisms. Specifically, zero observations can be categorized into structural zeros and random zeros. Structural zeros correspond to periods in which demand is inherently absent due to system-level or operational constraints, such as discontinued parts, inactive service outlets, or predefined operational restrictions. In contrast, random zeros originate from stochastic fluctuations around a low but nonzero latent demand process, reflecting the irregular realization of infrequent consumption events.

In this study, the analyzed demand series are dominated by random zeros rather than structural zeros. This type of random zero-demand behavior represents the normal state of intermittent demand in spare parts supply chains and constitutes the primary focus of the proposed forecasting framework. Structural zeros, which imply an intrinsic impossibility of demand occurrence, are not explicitly modeled and fall outside the scope of the present study.

To integrate information on spare parts consumption across regions, we construct multi-dimensional feature graphs representing built environment, natural environment, and socioeconomic contexts. Built environment data are sourced from OpenStreetMap and high-resolution digital elevation models (Copernicus 30-meter DEM), which are processed using GIS tools to capture regional spatial structures. Natural environment variables, such as climate measures, and socioeconomic information, including regional economic statistics, are obtained from local statistical yearbooks. All datasets are preprocessed and standardized to ensure consistency across regions, allowing the model to capture spatial heterogeneity and contextual influences on spare parts demand.

4.2. Experimental Settings

We randomly split the dataset into training (70%), validation (15%), and test (15%) sets, ensuring no data leakage across time. All experiments are conducted on a server with one NVIDIA T4 GPUs and 128 GB memory. The models are implemented in PyTorch 2.0, and trained using Adam optimizer with an initial learning rate of 0.001 and batch size of 64.

Following standard practice in time series forecasting, the dataset is split in a strictly chronological manner to avoid temporal leakage. Specifically, earlier time periods are used for training, followed by a contiguous validation set and a subsequent test set drawn from later time intervals. No random sampling across the temporal dimension is performed, ensuring that all test observations occur strictly after the training and validation data.

To evaluate the performance of our proposed model, we adopt three widely used regression metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and the Coefficient of Determination (

R^{2}

). The definitions are given as follows:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}},

(29)

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |,

(30)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

(31)

where

y_{i}

denotes the ground truth value,

{\hat{y}}_{i}

is the predicted value, and

\bar{y}

is the mean of all ground truth values. RMSE penalizes larger errors more heavily, MAE reflects average deviation, and

R^{2}

measures the proportion of variance explained by the model.

Each node corresponds to the demand time series of a single part type in a single city, and this node definition is used consistently throughout the method and experiments.

To ensure strict reproducibility and robustness, all experiments are conducted under fully controlled random settings. Unless otherwise stated, a fixed random seed of 42 is used to initialize model parameters, data shuffling, and optimization processes. Furthermore, to account for the stochasticity inherent in neural network training, each experiment is independently repeated five times using different random seeds 42, 3407, 114,514, 256. For each model and dataset, the reported performance metrics (RMSE, MAE, and

R^{2}

) correspond to the mean values across these five runs, and the associated standard deviations are reported to reflect the stability and variance of the results. This evaluation protocol ensures that the reported performance gains are statistically reliable and not attributable to favorable random initialization or sampling effects.

4.3. Loss Function

To jointly optimize point forecasting accuracy and uncertainty estimation, we adopt a combined point and probabilistic loss for model training. Let

Y \in R^{N \times T_{f}}

denote the ground-truth values and

\hat{Y} \in R^{N \times T_{f}}

be the corresponding point predictions. The point-wise prediction error is measured using the mean squared error (MSE):

L_{point} = \frac{1}{N T_{f}} \sum_{i = 1}^{N} \sum_{t = 1}^{T_{f}} {({\hat{Y}}_{i, t} - Y_{i, t})}^{2} .

(32)

In addition to point estimation, the model also outputs a predictive distribution to characterize uncertainty. Specifically, the forecast for each variable at each time step is assumed to follow a Gaussian distribution parameterized by a predicted mean

μ_{i, t}

and variance

σ_{i, t}^{2}

. The probabilistic loss is defined as the negative log-likelihood (NLL):

L_{prob} = \frac{1}{N T_{f}} \sum_{i = 1}^{N} \sum_{t = 1}^{T_{f}} [\frac{{(Y_{i, t} - μ_{i, t})}^{2}}{2 σ_{i, t}^{2}} + \frac{1}{2} log σ_{i, t}^{2}] .

(33)

The final training objective is a weighted combination of the point and probabilistic losses:

L = L_{point} + λ L_{prob},

(34)

where

λ

is a hyperparameter that balances point prediction accuracy and uncertainty modeling.

The hyperparameter

λ

is tuned to balance the point prediction and probabilistic loss components. In practice, we perform a grid search over the validation set using candidate values

{0.1, 0.5, 1.0, 2.0}

and select the value that yields the best overall performance in terms of both point accuracy and uncertainty calibration. In all reported experiments, the chosen

λ

ensures a principled trade-off between minimizing prediction error and accurately modeling uncertainty.

In this work, we assume that the forecast for each variable at each time step follows a Gaussian distribution, parameterized by a predicted mean

μ_{i, t}

and variance

σ_{i, t}^{2}

. This assumption is supported by the Central Limit Theorem, which suggests that the sum of independent random variables, common in many real-world processes, tends to follow a Gaussian distribution. Additionally, Gaussian distributions are widely used in time-series forecasting due to their simplicity and effectiveness in modeling uncertainty. Our empirical results demonstrate that this assumption provides both accurate point forecasts and reliable uncertainty estimates, making it a suitable choice for our approach.

It is worth noting that the predictive variance

σ_{i, t}^{2}

is estimated in a fine-grained manner, namely for each node i and each forecasting time step t, rather than being shared globally. This design allows the model to capture heteroscedastic uncertainty that varies across both spatial and temporal dimensions. To ensure numerical stability and enforce the positivity of the variance, the model predicts an unconstrained value which is subsequently transformed using a softplus activation function. In practice, a small constant

ϵ

is added to the resulting variance to avoid numerical issues when computing the logarithmic and division terms in the negative log-likelihood. Finally, we emphasize that the Gaussian negative log-likelihood is adopted as a surrogate loss for optimization purposes. While it provides a convenient and effective way to jointly learn point predictions and uncertainty estimates, we do not assume that the true data-generating process strictly follows a Gaussian distribution. Instead, the Gaussian NLL serves as a practical approximation that encourages well-calibrated uncertainty estimation during training.

4.4. Performance Comparison of Different Models

To rigorously evaluate the effectiveness of our proposed method, we compare it against a diverse set of representative baselines, including time series forecasting models, online learning frameworks, and continual learning algorithms. These models span across various design paradigms such as linear modeling, convolutional architectures, attention mechanisms, and memory-based learning.

Empirical Replay (ER) [39]: A reinforcement learning technique for continual learning that randomly stores and replays past data samples to prevent catastrophic forgetting.
DER++ [40]: An improved version of Experience Replay, which incorporates knowledge distillation and regularization to enhance continual learning performance and generalization.
RLinear [41]: A reversible linear forecasting model incorporating Reversible Instance Normalization (RevIN) and Channel Independence (CI). These mechanisms allow it to effectively capture periodic and distributional properties in time-series data.
Informer [42]: A Transformer-based model designed for long-sequence time-series forecasting. It utilizes a ProbSparse self-attention mechanism and a generative decoder to reduce memory consumption and accelerate training.
OnlineTCN [43]: A real-time forecasting model that integrates residual convolutional filters into a TCN backbone, enabling adaptive updates in non-stationary environments.
DLinear [44]: A simple and fast linear forecasting model. It decomposes the time series into trend and residual components, each modeled by a shallow linear layer to capture global temporal patterns.
TiDE [45]: A multi-layer perceptron (MLP)-based encoder–decoder architecture. It balances the simplicity of linear models with the nonlinearity handling capability of MLPs and captures both variable interactions and long-range dependencies.
FSNet [46]: A lightweight TCN-based model for streaming time-series forecasting. It stacks dilated convolutional layers for efficient online learning without retraining, making it suitable for real-time applications.
Time-TCN [47]: A TCN architecture that performs convolutions solely along the temporal axis. However, it ignores dependencies across feature dimensions, which may limit its modeling capacity in multi-modal scenarios.
S-Mamba [48]: A time-series forecasting model built upon the Mamba selective state space architecture. It independently embeds each variate, applies a bidirectional Mamba layer to capture cross-variable dependencies, and models temporal dynamics through a feed-forward network, providing an efficient alternative to Transformer-based approaches.
RLMamba [49]: A Mamba-based encoder-only model for long-term time series forecasting that replaces self-attention with a linear-complexity state space model and incorporates residual learning to reduce redundancy.

As shown in Table 1, the proposed method achieves the best or highly competitive performance across most datasets and horizons, indicating superior forecasting accuracy and robustness. Notably, the performance advantage becomes more pronounced as the prediction horizon increases, highlighting the model’s effectiveness in capturing long-term temporal dependencies. Across all datasets, our approach outperforms both classical baselines (e.g., ER and DLinear) and more recent models such as TiDE and FSNet, which confirms the benefit of jointly modeling temporal dynamics and structured spatial priors.

As shown in Table 2, similar trends are observed for MAE. The proposed method consistently yields the lowest MAE values across all datasets and forecasting steps, demonstrating strong generalization capability and stable error control. In particular, the error gap between our method and the baselines widens under longer forecasting horizons, suggesting that the proposed framework is more resilient to error accumulation in multi-step prediction settings.

Overall, while attention-based models (e.g., Informer) and Mamba-based baselines (e.g., S-Mamba and RLMamba) show competitive performance, they still exhibit higher error growth compared with the proposed method. These results indicate that explicitly integrating temporal modeling, spatial priors, and adaptive fusion mechanisms enables more robust and accurate demand forecasting, especially in sparse and long-horizon scenarios.

In addition to RMSE and MAE, we also evaluate the models using the coefficient of determination (

R^{2}

), which reflects the proportion of variance in the ground truth that is predictable from the model outputs. Table 3 reports the cumulative

R^{2}

scores of all compared models on four datasets (electronic water pump, fuel injector, and timing belt) at prediction horizons of 3, 6, and 9 steps. As observed, our proposed method achieves significantly higher

R^{2}

values across all datasets and forecast horizons. For instance, on the electronic water pump dataset, our model yields

R^{2}

scores of 0.961, 0.897, and 0.876 for 3, 6, and 9-step predictions, respectively, which surpass the strong baselines such as TiDE (0.885–0.823) and FSNet (0.876–0.814). A similar pattern is evident in other datasets, where our model achieves the best or highly competitive performance across most datasets and horizons. While models like Informer, FSNet, and TiDE perform relatively well, their prediction capability deteriorates more noticeably as the forecast horizon increases. In contrast, the

R^{2}

values of our model remain robust and decline more gently, indicating better generalization in longer-term forecasting. S-Mamba and RLMamba achieve higher coefficients of determination than traditional models, reflecting improved goodness-of-fit for long-term forecasts. Despite this improvement, our method maintains the highest

R^{2}

values across all settings, indicating stronger explanatory power and more reliable trend modeling, particularly for longer prediction horizons.

4.5. Visualization of Prediction Results

To further assess the effectiveness of the proposed model, we provide a qualitative comparison of prediction outcomes through a series of visualizations. In this section, we compare the prediction performance of our method with TiDE, the strongest baseline model identified in the quantitative experiments. While numerical metrics such as RMSE, MAE, and

R^{2}

offer a comprehensive evaluation of accuracy, visual inspection can provide deeper insights into the temporal alignment and consistency of predictions with the actual data.

We present representative prediction curves from both models alongside the ground truth values, enabling a direct comparison of their ability to capture the underlying time-series patterns across different datasets and prediction horizons. In addition, we include scatter plots of the predicted values versus the true values for both TiDE and our method. These plots highlight the distribution and correlation between predictions and actual observations, where a closer clustering along the diagonal line indicates higher accuracy and reduced variance.

As shown in Figure 7, Figure 8 and Figure 9, which correspond to prediction horizons of 3, 6, and 9 steps respectively, a clear visual comparison can be made between the proposed method and the TiDE model. It is evident that the prediction curves generated by our approach more closely follow the trend of the ground truth, especially in capturing rapid fluctuations and preserving the overall shape of the temporal dynamics. This indicates a stronger ability to generalize and adapt to high-frequency variations, demonstrating the model’s robustness in complex temporal patterns.

Moreover, the scatter plots further corroborate this observation. The predicted versus actual value distribution from our method exhibits a tighter alignment along the diagonal line, with a higher degree of point concentration. This suggests not only improved predictive accuracy but also reduced variance, reinforcing the superior performance of the proposed approach in both short- and long-term forecasting tasks.

The superior visual alignment of the proposed method with the ground truth can be largely attributed to the incorporation of prior data as an auxiliary input. By embedding historical statistical features or relevant domain-specific priors into the model, the forecasting framework benefits from enhanced contextual awareness, which enables it to better distinguish underlying temporal patterns and mitigate noise-induced fluctuations.

This integration of prior information effectively guides the learning process, allowing the model to form more stable and informative representations, particularly in scenarios with high variability or abrupt changes. As reflected in both the trajectory comparisons and the scatter plots, the use of priors not only improves the model’s robustness but also significantly enhances its precision and consistency across different prediction horizons.

4.6. Ablation Study

To investigate the contribution of each key component in the proposed model, we conduct an ablation study by designing several controlled variants. Each variant removes a specific module from the full model to evaluate its individual effect on prediction performance. For clarity and conciseness, we denote these variants using the “w/o” (without) notation.

The first variant, referred to as w/o STFormer, replaces the original spatio-temporal modeling backbone with a standard Transformer encoder. This configuration is designed to examine the advantages provided by STFormer in capturing long-range temporal dependencies while simultaneously modeling implicit spatial interactions. The second variant, denoted as w/o GCN, eliminates the multi-graph GCN component, thereby excluding structured prior knowledge such as part-level correlations and contextual dependencies derived from domain attributes. This setup evaluates the role of relational graph information in enhancing prediction robustness and contextual awareness. The third variant, named w/o Mamba, removes the Mamba fusion module and processes the time-series and prior knowledge representations independently. Without the unified multimodal fusion, the model loses the capacity to dynamically align features from different sources, which allows us to assess the effectiveness of modality-level interaction. The fourth variant, marked as w/o Dynamic Fusion, disables the dynamic fusion mechanism that adaptively balances temporal and prior knowledge features, and instead applies direct feature concatenation. This variant helps evaluate whether dynamic relevance weighting between modalities contributes significantly beyond simple integration.

All ablation models are evaluated under the same experimental settings as the full model. The comparative results allow for a comprehensive understanding of how each architectural element contributes to the overall performance, particularly in handling complex, sparse, and discontinuous patterns inherent in 4S store spare parts demand prediction.

As summarized in Table 4 and Figure 10, the complete model consistently yields the best performance across all datasets and evaluation metrics, indicating the effectiveness of the integrated framework. When the STFormer module is replaced with a standard Transformer (w/o STFormer), a noticeable degradation in performance is observed. The average RMSE and MAE increase to 0.514 and 0.360 respectively, while the

R^{2}

value drops to 0.851. This suggests that the spatial–temporal attention mechanism in STFormer plays a vital role in capturing fine-grained temporal patterns relevant to demand fluctuations. Excluding the GCN module (w/o GCN) also results in performance deterioration, although to a slightly lesser extent than the removal of STFormer. The average MAE increases to 0.354 and the RMSE reaches 0.508, reflecting the importance of modeling structural dependencies and prior correlations between spare parts. The decline in

R^{2}

further underscores the role of spatial information in enhancing model generalization. Removing the Mamba fusion strategy (w/o Mamba), which is responsible for aligning heterogeneous data modalities, leads to a marginal drop in predictive accuracy, with an average MAE of 0.349. While this module contributes less drastically than the others in isolation, it still plays a significant role in harmonizing temporal features with external prior knowledge. Finally, replacing the dynamic feature fusion mechanism with a simple concatenation scheme (w/o Dynamic Fusion) leads to a consistent decline across all metrics, albeit less severe than removing either STFormer or GCN. The average RMSE and MAE increase to 0.496 and 0.344, respectively, indicating that dynamic weighting of temporal and prior features offers a tangible benefit in adaptively capturing relevant patterns. Overall, the results demonstrate that each component—temporal modeling (STFormer), structural modeling (GCN), modality fusion (Mamba), and dynamic integration—contributes uniquely and substantially to the overall performance. The combination of these components in the proposed architecture leads to synergistic improvements, particularly evident in long-range forecasting tasks, as reflected by the consistently higher

R^{2}

and lower error metrics.

To systematically assess the contribution of our proposed Mamba module in modality unification, we conducted an ablation study comparing our full model (denoted as OUR) with variants employing alternative sequence modeling architectures, including LSTM, GRU, and Transformer. The experiments were performed on three representative datasets, namely Electronic Water Pump, Fuel Injector, and Timing Belt, and the evaluation metrics included RMSE, MAE, and

R^{2}

.

In the ablation study, we compare Mamba with LSTM, GRU, and Transformer exclusively in the context of the fusion module. The temporal modeling component (whether LSTM, GRU, or Transformer) remains unchanged across all configurations. This ensures that the comparison is focused on the fusion mechanism alone, with no bias introduced by differences in temporal modeling. To ensure a fair comparison, we kept the number of parameters and input dimensionalities consistent across all models. The fusion module (Mamba, LSTM, GRU, or Transformer) was replaced while keeping the rest of the architecture, including temporal processing and spatial embedding layers, intact. We aligned the input sizes and matched the number of parameters in the fusion modules to avoid any unfair advantages.

The results, summarized in Table 5, demonstrate that OUR outperforms all variants in the vast majority of evaluated cases across datasets and metrics. Specifically, the inclusion of the Mamba module leads to a notable reduction in RMSE and MAE while improving

R^{2}

, indicating that the model captures more informative and discriminative modality-specific representations. Among the alternative architectures, Transformer achieves relatively competitive performance due to its global attention mechanism, but it still falls short of OUR, particularly in the Electronic Water Pump dataset, where the temporal and contextual dependencies are critical. LSTM and GRU exhibit moderate performance, reflecting their capacity to model sequential dependencies but limited ability to fully exploit multimodal correlations and uncertainty.

4.7. Hyperparameter Analysis

To evaluate the sensitivity of the proposed fusion framework to the uncertainty control parameter

γ

, we conduct a series of controlled experiments on four datasets (electronic water pump, fuel injector, and timing belt) under the 9-step forecasting setting. Specifically,

γ

is varied in the range

0.2, 0.4, 0.6, 0.8, 1.0

to examine its impact on predictive performance, as summarized in Table 6 and Figure 11.

From the experimental results, it can be observed that the model achieves the best overall performance when

γ = 0.6

. In this setting, all three metrics—RMSE, MAE, and

R^{2}

—consistently reach optimal or near-optimal values across all four datasets. For instance, on the electronic water pump dataset, the RMSE and MAE reach 0.446 and 0.307 respectively, while the

R^{2}

score peaks at 0.876. Similar trends are also found on fuel injector and timing belt, where performance degrades slightly when

γ

deviates from 0.6, either decreasing or increasing.

The results validate the role of

γ

as a crucial balancing factor between uncertainty sensitivity and representational stability during fusion. A lower

γ

(e.g., 0.2) tends to overemphasize the uncertainty modeling, leading to underutilization of potentially informative modalities, thus increasing the overall error. Conversely, a higher

γ

(e.g., 1.0) may reduce the adaptiveness of the fusion weights, making the model more susceptible to noisy or ambiguous features.

The observed performance peak at

γ = 0.6

suggests that moderate uncertainty modulation achieves the most effective trade-off between noise suppression and information retention. This highlights the importance of tuning

γ

to align the fusion mechanism with the statistical characteristics of multimodal inputs, ultimately resulting in more robust and discriminative temporal representations.

5. Discussion

The experimental results demonstrate that the proposed hybrid temporal–spatial framework achieves superior performance across multiple forecasting horizons for sparse and intermittent demand. Rather than merely reflecting numerical improvements, these results highlight the importance of aligning model design with the intrinsic properties of intermittent demand series.

A key observation is the clear performance advantage of the STformer-based temporal module over the standard Transformer. Intermittent demand series are typically characterized by long zero-demand intervals, abrupt spikes, and weak periodicity, which violate the dense dependency assumptions implicitly favored by conventional self-attention mechanisms. In such settings, standard Transformers may overfit local noise or fail to propagate informative signals across long temporal gaps. By contrast, the selective state-space modeling mechanism underlying STformer facilitates stable long-range information propagation and emphasizes robustness to sparsity, enabling more effective extraction of temporal signals from fragmented historical observations. This suggests that robustness-oriented temporal modeling is particularly critical for intermittent demand forecasting.

The incorporation of prior knowledge through multiple graphs further enhances predictive robustness by compensating for information scarcity at the individual series level. Although the relative contributions of geographic, socioeconomic, and environmental graphs are not explicitly parameterized by learnable coefficients in the current framework, their joint inclusion enables the model to exploit complementary contextual cues. Geographic proximity and socioeconomic similarity provide relatively stable structural signals that support demand inference when historical observations are sparse, while environmental factors act as dynamic modifiers that help explain short-term fluctuations. This qualitative decomposition offers interpretability by clarifying how heterogeneous prior knowledge supports demand prediction from different perspectives.

Compared with baselines such as ER, DER++, and DLinear, which primarily rely on local temporal dynamics or replay mechanisms, the proposed framework explicitly leverages cross-node correlations to alleviate the inherent information deficiency of intermittent series. This capability is particularly important when individual demand trajectories are insufficiently informative on their own. Recent deep learning models such as TiDE and Informer achieve competitive performance under specific horizons but tend to exhibit instability as the forecasting horizon increases, largely due to their limited integration of structured spatial or contextual information.

Despite these advantages, the proposed framework still exhibits prediction deviations during periods of abrupt demand changes. Such deviations are largely attributable to the intrinsic uncertainty of intermittent demand, where extreme events are weakly represented in historical data and are difficult to anticipate using data-driven models alone. In addition, spatial dependencies are modeled through predefined static graphs, implicitly assuming stable inter-node relationships. In practice, these relationships may evolve over time due to changes in infrastructure, economic conditions, or policy interventions, potentially limiting the model’s responsiveness to rapidly changing demand dynamics.

From a managerial perspective, the proposed framework should be interpreted not merely as a point forecasting tool but as a risk-aware decision support mechanism. The predicted demand trajectories can serve as early-warning signals for inventory management, enabling firms to adjust alert thresholds and buffer stock levels in anticipation of potential demand surges or depletion risks. By shifting the role of demand forecasting from exact estimation toward risk-informed planning, the framework supports more resilient and adaptive inventory management in supply chain environments characterized by sparse and intermittent demand.

6. Conclusions

This study proposes a hybrid temporal–spatial forecasting framework that integrates Transformer networks, GCNs, and a Mamba-based feature fusion mechanism to address the challenge of sparse and intermittent demand forecasting. Using automotive spare parts as a representative application scenario, the proposed framework effectively captures long-range temporal dependencies, spatial correlations, and contextual prior knowledge derived from heterogeneous data sources.

Extensive experiments conducted on real-world business datasets demonstrate that the proposed approach outperforms state-of-the-art forecasting models in the vast majority of evaluated cases across multiple evaluation metrics and prediction horizons. The results highlight the importance of jointly modeling temporal dynamics and spatial dependencies when addressing intermittent demand, as well as the effectiveness of incorporating prior knowledge into data-driven forecasting frameworks to improve robustness and predictive accuracy.

Although this study focuses on automotive spare parts, the proposed framework is generalizable and can be extended to other supply chain contexts characterized by sparse, irregular, or low-frequency demand, such as aerospace spare parts, medical equipment maintenance, and industrial service logistics. By providing accurate and stable demand forecasts, the framework offers practical value for inventory control, resource allocation, and operational decision-making.

Future research may further enhance the proposed framework in several directions. First, incorporating adaptive or dynamic graph structures could enable the model to better capture evolving spatial relationships among supply nodes. Second, integrating real-time data streams or online learning mechanisms may improve the model’s responsiveness to sudden demand changes and external shocks. Finally, incorporating additional exogenous factors, such as policy changes or macroeconomic indicators, could further improve forecasting performance in highly uncertain environments.

Author Contributions

Conceptualization, Y.S. and B.G.; methodology, Y.S.; software, Y.S.; validation, R.L., H.K., M.Z., X.C. and K.Y.; formal analysis, Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.S., B.G. and C.W.; supervision, B.G. and C.W.; funding acquisition, Y.S., R.L. and B.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. U2268204 and 62172061; National Key R&D Program of China under Grant No. 2023YFB3308600; the Science and Technology Project of Sichuan Province under Grant No. 2024ZDZX0012, 2023ZHCG0011, 2021YFG0152; Sichuan Provincial Key Laboratory Open Research Project (Project No.: 2024-ScL-MC&I-005).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are not publicly available because they contain proprietary operational records and commercially sensitive demand information obtained under strict confidentiality and data-sharing agreements with an industrial partner. These constraints prohibit any form of data sharing or redistribution.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, Y.; Zhang, Q.; Fan, Z.P.; You, T.H.; Wang, L.X. Maintenance spare parts demand forecasting for automobile 4S shop considering weather data. IEEE Trans. Fuzzy Syst. 2018, 27, 943–955. [Google Scholar] [CrossRef]
Qiu, Q.; Qin, C.; Shi, J.; Zhou, H. Research on demand forecast of aircraft spare parts based on fractional order discrete grey model. In Proceedings of the 2019 IEEE 5th International Conference on Computer and Communications (ICCC), Chengdu, China, 6–9 December 2019; IEEE: New York, NY, USA, 2019; pp. 2212–2216. [Google Scholar]
Jifri, M.H.; Hassan, E.E.; Miswan, N.H. Forecasting performance of time series and regression in modeling electricity load demand. In Proceedings of the 2017 7th IEEE international conference on system engineering and technology (ICSET), Shah Alam, Malaysia, 2–3 October 2017; IEEE: New York, NY, USA, 2017; pp. 12–16. [Google Scholar]
Stormi, K.; Laine, T.; Suomala, P.; Elomaa, T. Forecasting sales in industrial services: Modeling business potential with installed base information. J. Serv. Manag. 2018, 29, 277–300. [Google Scholar] [CrossRef]
İfraz, M.; Aktepe, A.; Ersöz, S.; Çetinyokuş, T. Demand forecasting of spare parts with regression and machine learning methods: Application in a bus fleet. J. Eng. Res. 2023, 11, 100057. [Google Scholar] [CrossRef]
do Rego, J.R.; De Mesquita, M.A. Demand forecasting and inventory control: A simulation study on automotive spare parts. Int. J. Prod. Econ. 2015, 161, 1–16. [Google Scholar] [CrossRef]
Alalawin, A.; Arabiyat, L.M.; Alalaween, W.; Qamar, A.; Mukattash, A. Forecasting vehicle’s spare parts price and demand. J. Qual. Maint. Eng. 2021, 27, 483–499. [Google Scholar] [CrossRef]
Fahrudin, T.M.; Ambariawan, R.P.; Kamisutara, M. Demand forecasting of the automobile sales using least square, single exponential smoothing and double exponential smoothing. Petra Int. J. Bus. Stud. (IJBS) 2021, 4, 122–130. [Google Scholar] [CrossRef]
Matsumoto, M.; Ikeda, A. Examination of demand forecasting by time series analysis for auto parts remanufacturing. J. Remanuf. 2015, 5, 1–20. [Google Scholar] [CrossRef]
Chandriah, K.K.; Naraganahalli, R.V. RNN/LSTM with modified Adam optimizer in deep learning approach for automobile spare parts demand forecasting. Multimed. Tools Appl. 2021, 80, 26145–26159. [Google Scholar] [CrossRef]
Jung, D.K.; Park, Y.S. A Study on the Demand Prediction Model for Repair Parts of Automotive After-sales Service Center Using LSTM Artificial Neural Network. J. Inf. Syst. 2022, 31, 197–220. [Google Scholar]
Syntetos, A.A.; Babai, M.Z.; Altay, N. On the demand distributions of spare parts. Int. J. Prod. Res. 2012, 50, 2101–2117. [Google Scholar] [CrossRef]
Abu Arqub, O.; Al-Smadi, M.; Momani, S.; Hayat, T. Numerical solutions of fuzzy differential equations using reproducing kernel Hilbert space method. Soft Comput. 2016, 20, 3283–3302. [Google Scholar] [CrossRef]
Kim, N.; Park, Y.; Lee, D. Differences in consumer intention to use on-demand automobile-related services in accordance with the degree of face-to-face interactions. Technol. Forecast. Soc. Change 2019, 139, 277–286. [Google Scholar] [CrossRef]
Sodemann, A.A.; Ross, M.P.; Borghetti, B.J. A review of anomaly detection in automated surveillance. IEEE Trans. Syst. Man, Cybern. Part C (Appl. Rev.) 2012, 42, 1257–1272. [Google Scholar] [CrossRef]
Willemain, T.R.; Smart, C.N.; Shockor, J.H.; DeSautels, P.A. Forecasting intermittent demand in manufacturing: A comparative evaluation of Croston’s method. Int. J. Forecast. 1994, 10, 529–538. [Google Scholar] [CrossRef]
Croston, J.D. Forecasting and stock control for intermittent demands. J. Oper. Res. Soc. 1972, 23, 289–303. [Google Scholar] [CrossRef]
Syntetos, A.A.; Boylan, J.E. On the bias of intermittent demand estimates. Int. J. Prod. Econ. 2001, 71, 457–466. [Google Scholar] [CrossRef]
Babai, M.Z.; Dallery, Y.; Boubaker, S.; Kalai, R. A new method to forecast intermittent demand in the presence of inventory obsolescence. Int. J. Prod. Econ. 2019, 209, 30–41. [Google Scholar] [CrossRef]
Chen, F.; Shang, D.; Zhou, G.; Ye, K.; Ren, F.; Wu, G. Collaborative multiview time series modeling for vehicle maintenance demand prediction. Sci. Rep. 2025, 15, 13058. [Google Scholar] [CrossRef]
Hua, Z.; Zhang, B. A hybrid support vector machines and logistic regression approach for forecasting intermittent demand of spare parts. Appl. Math. Comput. 2006, 181, 1035–1048. [Google Scholar] [CrossRef]
Böcker, L.; Prillwitz, J.; Dijst, M. Climate change impacts on mode choices and travelled distances: A comparison of present with 2050 weather conditions for the Randstad Holland. J. Transp. Geogr. 2013, 28, 176–185. [Google Scholar] [CrossRef]
Pongthanaisawan, J.; Sorapipatana, C. Relationship between level of economic development and motorcycle and car ownerships and their impacts on fuel consumption and greenhouse gas emission in Thailand. Renew. Sustain. Energy Rev. 2010, 14, 2966–2975. [Google Scholar] [CrossRef]
Alourani, A.; Ashfaq, F.; Jhanjhi, N.; Ali Khan, N. BiLSTM-and GNN-Based Spatiotemporal Traffic Flow Forecasting with Correlated Weather Data. J. Adv. Transp. 2023, 2023, 8962283. [Google Scholar] [CrossRef]
Chen, J.F.; Wang, L.; Liang, Y.; Yu, Y.; Feng, J.; Zhao, J.; Ding, X. Order dispatching via GNN-based optimization algorithm for on-demand food delivery. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13147–13162. [Google Scholar] [CrossRef]
Yao, X.; Gao, Y.; Zhu, D.; Manley, E.; Wang, J.; Liu, Y. Spatial origin-destination flow imputation using graph convolutional networks. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7474–7484. [Google Scholar] [CrossRef]
Su, L.; Zuo, X.; Li, R.; Wang, X.; Zhao, H.; Huang, B. A systematic review for transformer-based long-term series forecasting. Artif. Intell. Rev. 2025, 58, 80. [Google Scholar] [CrossRef]
Zhu, J.; Liu, D.; Chen, H.; Liu, J.; Tao, Z. DTSFormer: Decoupled temporal-spatial diffusion transformer for enhanced long-term time series forecasting. Knowl.-Based Syst. 2025, 309, 112828. [Google Scholar] [CrossRef]
Chien, C.F.; Ku, C.C.; Lu, Y.Y. Ensemble learning for demand forecast of After-Market spare parts to empower data-driven value chain and an empirical study. Comput. Ind. Eng. 2023, 185, 109670. [Google Scholar] [CrossRef]
Faria, M.V.; Baptista, P.C.; Farias, T.L.; Pereira, J.M. Assessing the impacts of driving environment on driving behavior patterns. Transportation 2020, 47, 1311–1337. [Google Scholar] [CrossRef]
Tao, T.; Cao, J. Ineffective built environment interventions: How to reduce driving in American suburbs? Transp. Res. Part A Policy Pract. 2024, 179, 103924. [Google Scholar] [CrossRef]
Chen, J.; Liu, K.; Li, R.; Li, W.; Chen, Q. Optimising built environment to reduce car use: Spatial and attribute heterogeneity perspectives. Transp. Res. Part D Transp. Environ. 2025, 143, 104767. [Google Scholar] [CrossRef]
Banister, D. Transport and economic development: Reviewing the evidence. Transp. Rev. 2012, 32, 1–2. [Google Scholar] [CrossRef]
Mrozik, M.; Merkisz-Guranowska, A. Environmental assessment of the vehicle operation process. Energies 2020, 14, 76. [Google Scholar] [CrossRef]
Chen, Z.; Pu, B.; Zhao, L.; He, J.; Liang, P. Divide and augment: Supervised domain adaptation via sample-wise feature fusion. Inf. Fusion 2025, 115, 102757. [Google Scholar] [CrossRef]
Yan, T.; Xing, X.; Wang, D.; Tsui, K.L.; Xia, M. Interpretable degradation tensor modeling through multi-scale and multi-level time-frequency feature fusion for machine health monitoring. Inf. Fusion 2025, 117, 102935. [Google Scholar] [CrossRef]
Fofanah, A.J.; Chen, D.; Wen, L.; Zhang, S. CHAMFormer: Dual heterogeneous three-stages coupling and multivariate feature-aware learning network for traffic flow forecasting. Expert Syst. Appl. 2025, 266, 126085. [Google Scholar] [CrossRef]
Peng, T.; Gan, M.; Ou, Q.; Yang, X.; Wei, L.; Ler, H.R.; Yu, H. Railway cold chain freight demand forecasting with graph neural networks: A novel GraphARMA-GRU model. Expert Syst. Appl. 2024, 255, 124693. [Google Scholar] [CrossRef]
Chaudhry, A.; Rohrbach, M.; Elhoseiny, M.; Ajanthan, T.; Dokania, P.K.; Torr, P.H.; Ranzato, M. On tiny episodic memories in continual learning. arXiv 2019, arXiv:1902.10486. [Google Scholar] [CrossRef]
Buzzega, P.; Boschini, M.; Porrello, A.; Abati, D.; Calderara, S. Dark experience for general continual learning: A strong, simple baseline. Adv. Neural Inf. Process. Syst. 2020, 33, 15920–15930. [Google Scholar]
Kim, T.; Kim, J.; Tae, Y.; Park, C.; Choi, J.H.; Choo, J. Reversible instance normalization for accurate time-series forecasting against distribution shift. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Woo, G.; Liu, C.; Sahoo, D.; Kumar, A.; Hoi, S. Cost: Contrastive learning of disentangled seasonal-trend representations for time series forecasting. arXiv 2022, arXiv:2202.01575. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
Das, A.; Kong, W.; Leach, A.; Mathur, S.; Sen, R.; Yu, R. Long-term forecasting with tide: Time-series dense encoder. arXiv 2023, arXiv:2304.08424. [Google Scholar]
Pham, Q.; Liu, C.; Sahoo, D.; Hoi, S.C. Learning fast and slow for online time series forecasting. arXiv 2022, arXiv:2202.11672. [Google Scholar] [CrossRef]
Wen, Q.; Chen, W.; Sun, L.; Zhang, Z.; Wang, L.; Jin, R.; Tan, T. Onenet: Enhancing time series forecasting models under concept drift by online ensembling. Adv. Neural Inf. Process. Syst. 2023, 36, 69949–69980. [Google Scholar]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is mamba effective for time series forecasting? Neurocomputing 2025, 619, 129178. [Google Scholar] [CrossRef]
Wang, M.; Tong, G. RLMamba: Integrating residual learning with Mamba for long-term time series forecasting. Expert Syst. Appl. 2025, 278, 127362. [Google Scholar] [CrossRef]

Figure 1. Overall architecture.

Figure 2. Comparison of Transformer and STformer embedding methods.

Figure 3. Example of constructing an a priori data graph structure.

Figure 4. Mamba-based fusion block.

Figure 5. Overall structure of Mamba.

Figure 6. Overview of spare parts demand in 4S stores: (a) electronic water pump; (b) fuel injector; (c) timing belt.

Figure 7. Visualization of the results of the comparison between the method of this paper and TiDE at step 3 prediction. (a) Visualization of predicted comparative results. (b) Visualization of the relationship between TiDE’s predicted and actual values. (c) Visualization of the relationship between our method predicted and actual values.

Figure 8. Visualization of the results of the comparison between the method of this paper and TiDE at step 6 prediction. (a) Visualization of predicted comparative results. (b) Visualization of the relationship between TiDE’s predicted and actual values. (c) Visualization of the relationship between our method predicted and actual values.

Figure 9. Visualization of the results of the comparison between the method of this paper and TiDE at step 9 prediction. (a) Visualization of predicted comparative results. (b) Visualization of the relationship between TiDE’s predicted and actual values. (c) Visualization of the relationship between our method predicted and actual values.

Figure 10. Ablation study on four datasets. The prediction horizon is 9 step. (a) Electronic Water Pump. (b) Fuel Injector. (c) Timing Belt. (d) AVG.

Figure 11. The impact of

γ

on model performance. (a) Electronic Water Pump. (b) Fuel Injector. (c) Timing Belt.

Figure 11. The impact of

γ

on model performance. (a) Electronic Water Pump. (b) Fuel Injector. (c) Timing Belt.

Table 1. Comparison of RMSE across different models and datasets at prediction horizons 3, 6, and 9. Lower RMSE indicates better performance.

RMSE	Electronic Water Pump			Fuel Injector			Timing Belt
	3	6	9	3	6	9	3	6	9
ER	0.305	0.465	0.501	0.333	0.475	0.513	0.388	0.552	0.609
DER++	0.298	0.457	0.490	0.326	0.468	0.499	0.372	0.541	0.602
Informer	0.274	0.426	0.466	0.307	0.443	0.479	0.358	0.532	0.593
OnlineTCN	0.281	0.434	0.472	0.312	0.451	0.483	0.359	0.533	0.594
DLinear	0.287	0.442	0.479	0.319	0.458	0.488	0.368	0.540	0.601
TiDE	0.265	0.417	0.459	0.295	0.438	0.472	0.342	0.514	0.570
FSNet	0.271	0.421	0.461	0.298	0.440	0.473	0.347	0.517	0.574
Time-TCN	0.292	0.451	0.489	0.322	0.461	0.496	0.369	0.547	0.605
S-Mamba	0.268	0.422	0.463	0.299	0.442	0.478	0.346	0.519	0.576
RLMamba	0.271	0.425	0.467	0.302	0.446	0.481	0.349	0.523	0.579
OURS	0.252	0.407	0.446	0.272	0.427	0.474	0.349	0.523	0.589

Table 2. Cumulative MAE of each model for predicting the 3, 6, and 9 steps.

MAE	Electronic Water Pump			Fuel Injector			Timing Belt
	3	6	9	3	6	9	3	6	9
ER	0.326	0.492	0.538	0.334	0.514	0.558	0.372	0.552	0.606
DER++	0.309	0.473	0.518	0.316	0.491	0.535	0.359	0.535	0.584
Informer	0.245	0.395	0.437	0.256	0.406	0.453	0.294	0.453	0.497
OnlineTCN	0.263	0.421	0.463	0.274	0.432	0.479	0.311	0.474	0.518
DLinear	0.287	0.442	0.488	0.295	0.459	0.506	0.337	0.506	0.553
TiDE	0.213	0.348	0.387	0.219	0.359	0.398	0.261	0.405	0.446
FSNet	0.217	0.351	0.392	0.222	0.361	0.402	0.264	0.408	0.449
Time-TCN	0.273	0.429	0.472	0.281	0.441	0.486	0.321	0.487	0.532
S-Mamba	0.221	0.356	0.396	0.227	0.368	0.409	0.269	0.414	0.455
RLMamba	0.224	0.360	0.401	0.231	0.372	0.414	0.272	0.418	0.459
OURS	0.182	0.284	0.307	0.194	0.301	0.329	0.243	0.371	0.411

Table 3. Cumulative

R^{2}

of each model for predicting the 3, 6, and 9 steps.

Table 3. Cumulative

R^{2}

of each model for predicting the 3, 6, and 9 steps.

$R^{2}$	Electronic Water Pump			Fuel Injector			Timing Belt
	3	6	9	3	6	9	3	6	9
ER	0.798	0.715	0.688	0.784	0.701	0.674	0.776	0.694	0.665
DER++	0.812	0.732	0.705	0.798	0.716	0.689	0.791	0.709	0.681
Informer	0.866	0.812	0.789	0.854	0.801	0.777	0.848	0.796	0.774
OnlineTCN	0.857	0.804	0.779	0.842	0.788	0.761	0.838	0.781	0.759
DLinear	0.847	0.793	0.764	0.831	0.778	0.752	0.825	0.769	0.745
TiDE	0.885	0.841	0.823	0.874	0.829	0.812	0.869	0.825	0.805
FSNet	0.876	0.833	0.814	0.864	0.821	0.801	0.859	0.816	0.794
Time-TCN	0.832	0.765	0.735	0.816	0.749	0.721	0.812	0.743	0.714
S-Mamba	0.889	0.845	0.826	0.878	0.833	0.815	0.872	0.828	0.808
RLMamba	0.895	0.851	0.832	0.884	0.839	0.821	0.878	0.834	0.814
OURS	0.961	0.897	0.876	0.953	0.885	0.859	0.951	0.882	0.860

Table 4. Ablation study results of different model variants on three datasets.

Dataset		Electronic Water Pump	Fuel Injector	Timing Belt	AVG
OUR	RMSE	0.446	0.474	0.589	0.503
	MAE	0.307	0.329	0.411	0.349
	$R^{2}$	0.876	0.859	0.860	0.865
w/o STFormer	RMSE	0.471	0.503	0.612	0.529
	MAE	0.328	0.351	0.436	0.372
	$R^{2}$	0.861	0.842	0.842	0.848
w/o GCN	RMSE	0.467	0.497	0.607	0.524
	MAE	0.323	0.344	0.429	0.365
	$R^{2}$	0.864	0.846	0.848	0.853
w/o Mamba	RMSE	0.460	0.488	0.598	0.515
	MAE	0.319	0.338	0.421	0.359
	$R^{2}$	0.868	0.850	0.854	0.857
w/o Dynamic Fusion	RMSE	0.454	0.483	0.594	0.510
	MAE	0.313	0.333	0.417	0.354
	$R^{2}$	0.872	0.854	0.857	0.861

Table 5. Ablation study results of different modality unification method on three datasets.

Dataset		Electronic Water Pump	Fuel Injector	Timing Belt	AVG
OUR	RMSE	0.446	0.474	0.589	0.503
	MAE	0.307	0.329	0.411	0.349
	$R^{2}$	0.876	0.859	0.860	0.865
w/ LSTM	RMSE	0.457	0.485	0.596	0.513
	MAE	0.312	0.331	0.418	0.354
	$R^{2}$	0.872	0.853	0.855	0.860
w/ GRU	RMSE	0.460	0.490	0.601	0.517
	MAE	0.316	0.334	0.423	0.358
	$R^{2}$	0.869	0.850	0.852	0.857
w/ Transformer	RMSE	0.463	0.495	0.607	0.522
	MAE	0.320	0.338	0.428	0.362
	$R^{2}$	0.867	0.847	0.849	0.854

Table 6. Performance comparison under different values of

γ

on step-9 forecasting.

Table 6. Performance comparison under different values of

γ

on step-9 forecasting.

$γ$		0.2	0.4	0.6	0.8	1.0
Electronic water pump	RMSE	0.475	0.458	0.446	0.452	0.457
	MAE	0.329	0.316	0.307	0.314	0.317
	$R^{2}$	0.857	0.868	0.876	0.870	0.866
Fuel injector	RMSE	0.502	0.486	0.474	0.479	0.484
	MAE	0.351	0.339	0.329	0.335	0.340
	$R^{2}$	0.837	0.850	0.859	0.853	0.849
Timing belt	RMSE	0.612	0.598	0.589	0.595	0.599
	MAE	0.434	0.420	0.411	0.418	0.422
	$R^{2}$	0.834	0.848	0.860	0.851	0.847

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Guo, B.; Wu, C.; Lyu, R.; Kang, H.; Zhao, M.; Chen, X.; Ye, K. A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand. Appl. Sci. 2026, 16, 1381. https://doi.org/10.3390/app16031381

AMA Style

Sun Y, Guo B, Wu C, Lyu R, Kang H, Zhao M, Chen X, Ye K. A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand. Applied Sciences. 2026; 16(3):1381. https://doi.org/10.3390/app16031381

Chicago/Turabian Style

Sun, Yufang, Bing Guo, Chase Wu, Rui Lyu, Hongjuan Kang, Mingjie Zhao, Xin Chen, and Kui Ye. 2026. "A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand" Applied Sciences 16, no. 3: 1381. https://doi.org/10.3390/app16031381

APA Style

Sun, Y., Guo, B., Wu, C., Lyu, R., Kang, H., Zhao, M., Chen, X., & Ye, K. (2026). A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand. Applied Sciences, 16(3), 1381. https://doi.org/10.3390/app16031381

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Temporal–Spatial Framework Incorporating Prior Knowledge for Predicting Sparse and Intermittent Item Demand

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Overall Architecture

3.2. Time Series Feature Extraction

3.3. Priori Data Feature Extraction

3.3.1. Construction of Prior Data Graph Structure

3.3.2. Graph Convolutional Network Module

3.4. Dynamic Fusion

3.4.1. Modality Unification Based on Mamba

3.4.2. Dynamic Weighting Mechanism

4. Experiments

4.1. Data Description

4.2. Experimental Settings

4.3. Loss Function

4.4. Performance Comparison of Different Models

4.5. Visualization of Prediction Results

4.6. Ablation Study

4.7. Hyperparameter Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI