Next Article in Journal
The Effect of Acerola and Rosemary Extracts on the Quality and Oxidative Stability of Sliced Fermented Salami Stored in a Modified Atmosphere
Previous Article in Journal
One-at-a-Time Sensitivity Analysis for Probabilistic Fault Displacement Hazard
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Non-Autoregressive Spatiotemporal Framework for Offline Full-Matrix Origin–Destination Forecasting in Large-Scale Metro Networks

1
Department of Electric Computer Engineering, Inha University, Incheon 22212, Republic of Korea
2
ITS Mobility Lab Co., Ltd., Seoul 06271, Republic of Korea
3
Department of Computer Engineering, Inha University, Incheon 22212, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2026, 16(11), 5333; https://doi.org/10.3390/app16115333
Submission received: 15 April 2026 / Revised: 19 May 2026 / Accepted: 22 May 2026 / Published: 26 May 2026
(This article belongs to the Section Transportation and Future Mobility)

Abstract

Origin–destination (OD) matrix forecasting is essential for urban railway operations because it enables simultaneous understanding of the direction and magnitude of passenger flows. However, OD matrices in large-scale subway networks are difficult to predict owing to their high dimensionality and sparsity, and existing approaches often rely on station-level predictions or complex structural designs. This study addresses the offline full-matrix OD forecasting problem, where complete historical OD sequences are available at prediction time, and proposes Metro-GATF, a spatiotemporal forecasting framework that jointly models railway topology and dynamic OD interactions. The model employs a GATv2-based spatial encoder to learn static inter-station relationships and encodes time-varying interactions using sparse OD graphs. A non-autoregressive transformer decoder generates future multi-step node representations in parallel, whereas origin–destination factorization and sparsity-aware gating are used to reconstruct the full OD matrix. Experiments on minute-level AFC-based OD data from a 637-station metropolitan subway network demonstrated that Metro-GATF achieved the lowest sMAPE among the compared full-matrix models. These results indicate that the proposed framework effectively captures complex spatiotemporal OD patterns and offers a practical end-to-end framework for forecasting urban railway demand.

1. Introduction

With rapid urbanization and the resulting increase in mobility demand, subways have become a core mode of mass public transportation in major cities worldwide. Urban railway systems play a crucial role in alleviating traffic congestion and ensuring the reliability of public transit services, owing to their high capacity, punctuality, and energy efficiency. To operate such large-scale railway networks efficiently, it is essential to accurately understand and predict the spatiotemporal variations in passenger demand. In particular, origin–destination (OD) matrix forecasting is recognized as a key problem for transportation operations, demand management, resource allocation, and infrastructure planning, because it enables the simultaneous capture of directional demand between origin and destination stations [1,2,3]. Recent surveys summarizing OD flow research also identify OD prediction, construction, estimation, and forecasting as central problem domains, emphasizing that station-level demand alone is insufficient in public transportation systems such as urban railways, where network-level OD information is critical.
The Seoul metropolitan subway system is not merely a single urban rail network confined to Seoul but rather a large-scale regional railway network connecting Seoul, Incheon, and Gyeonggi Province, formed as a complex system involving multiple operating agencies [4,5]. According to Seoul’s TOPIS, Line 1 extends to Suwon, Incheon, and Cheonan, whereas Line 9 operates both express trains and local trains [4]. Recent statistics also report that the total ridership of Seoul Subway Lines 1–8 reached 2.417 billion passengers in 2024, with a daily average of approximately 6.61 million [6]. In such an environment characterized by multiple lines, operators, and large-scale demand, station-level boarding and alighting counts alone are insufficient to fully capture directional passenger flows and congestion propagation across the network. Therefore, there is a strong need for full-matrix OD forecasting that simultaneously considers both the origin and destination stations.
Recent OD forecasting studies have evolved from recurrent neural networks and graph-based models to Transformer- and MLP-based architectures [2,7,8,9,10,11,12,13,14,15,16,17,18,19]. In parallel, graph attention mechanisms such as GAT and GATv2 have provided adaptive alternatives to fixed graph convolution for spatial relationship modeling [20,21]. These approaches have improved the modeling of temporal dependencies, spatial relationships, and OD-pair interactions. However, several limitations remain from the perspective of offline full-matrix metro OD forecasting. First, station-level or selected OD-pair forecasting does not directly reconstruct the complete network-wide OD matrix. Second, online incomplete OD prediction and imputation-oriented models assume different information availability from the offline setting, where complete historical OD matrices are available at prediction time. Third, many recent models rely on complex task-specific structures, such as heterogeneous graphs, hypergraphs, or multi-view designs, which can increase model complexity and reduce the simplicity of an end-to-end forecasting framework.
To address these limitations, this study proposes Metro-GATF. We focus on the offline multi-step OD matrix forecasting problem, where complete historical OD matrices are available at prediction time. The proposed model jointly incorporates static spatial structures learned from actual railway adjacency relationships and dynamic interactions extracted from time-dependent sparse OD graphs. Specifically, the spatial module employs a GATv2-based encoder to adaptively learn the relative influence of neighboring stations from the actual railway topology. The temporal module adopts a non-autoregressive transformer decoder that utilizes node representations derived from historical OD matrix sequences as memory, enabling parallel generation of future multi-step representations. Finally, the prediction module combines origin–destination factorization with sparsity-aware gating to directly reconstruct future OD matrices. Thus, the Metro-GATF provides an end-to-end unified framework tailored for full-matrix OD forecasting in large-scale urban railway networks.
In this study, we evaluated the proposed model on a large-scale metropolitan subway network consisting of 637 stations in the Seoul metropolitan area, using minute-level AFC-based OD data from May 2022. The experiments were conducted under a full-matrix forecasting setting, in which the entire network-level OD matrix was directly predicted. The evaluation examined whether Metro-GATF can achieve competitive predictive performance on large-scale urban railway networks without substantial reliance on complex heuristic feature engineering or multi-stage preprocessing.
The main contributions of this paper are as follows:
-
We reformulate offline full-matrix metro OD forecasting as a parallel generation problem and propose a non-autoregressive framework that predicts future OD matrices without sequential autoregressive decoding.
-
We propose Metro-GATF, an end-to-end spatiotemporal architecture that jointly models static railway topology and dynamic sparse OD interactions, while reconstructing future OD matrices through origin–destination factorization and sparsity-aware gating.
-
We validate the proposed framework on minute-level AFC-based OD data from a 637-station metropolitan subway network and show through extensive experiments and ablation studies that parallel temporal decoding and explicit sparsity modeling are key to competitive full-matrix OD forecasting performance.
The remainder of this paper is organized as follows. Section 2 reviews related studies on OD forecasting, graph-based metro demand prediction, and Transformer-based time-series forecasting. Section 3 describes the proposed Metro-GATF framework. Section 4 presents the experimental setting and comparative results. Section 5 reports the ablation and sensitivity analyses. Section 6 discusses the main findings, limitations, and practical implications. Section 7 concludes the paper.

2. Related Works

2.1. Station Level and Short-Term OD Flow Forecasting

Early research on urban railway demand forecasting primarily focused on station-level inflow/outflow predictions or short-term OD flow forecasting. Han et al. modeled an urban railway network as a graph and employed a spatiotemporal GCN to predict short-term passenger flows at the station level, thereby demonstrating the feasibility of learning spatial correlations across stations [7]. Cui et al. identified data lag, high dimensionality, and data malformation as key challenges in metro OD prediction and proposed an ST-LSTM framework that integrates multi-source data for short-term OD flow forecasting [19]. CNN-based approaches have also been explored for short-term metro OD demand prediction by extracting local spatial–temporal patterns from OD-related demand representations [22]. While these studies provide operationally meaningful short-term insights, they differ fundamentally from the problem of directly reconstructing a full network-wide OD matrix.

2.2. Full OD Matrix Forecasting and Graph-Based Modeling

Research on directly predicting full OD matrices has gradually expanded in public transportation and general OD forecasting literature. Toqué et al. conducted an early study using an LSTM-based recurrent neural network to predict dynamic public transportation OD matrices [2]. Wang et al. explicitly formulated the OD matrix prediction (ODMP) problem and proposed GEML, which integrates spatial and temporal components [8]. Subsequently, Hu et al. addressed probabilistic OD matrix forecasting using a dual-stage graph convolutional recurrent neural network [9], whereas Shi et al. proposed the MPGCN, which jointly leverages dynamic and static graphs [10]. In the context of urban rail systems, representative studies include MRSTN by Noursalehi et al. [3], MVPF by Zheng et al. [11], and the large-scale urban rail GCN-GRU model by Wang et al. [23]. These approaches combine graph-based structures with time-series models to enable real-time or short-term OD matrix prediction. Beyond metro OD forecasting, attention-based spatio-temporal graph models have also been explored for road traffic prediction. For example, ASTMGCNet integrates GCN and GRU with multi-scale feature extraction and dual attention mechanisms to capture complex spatial dependencies and nonlinear temporal associations in T-CPS-oriented traffic systems [24]. These studies demonstrate the effectiveness of graph-based spatial encoders for transportation demand forecasting, but many of them still rely on short-term prediction settings or recurrent temporal modeling.

2.3. Composite Structure-Based Metro OD Modeling: Hypergraph, Heterogeneous Graph, and Incomplete OD

Recent metro OD forecasting studies have introduced more sophisticated structures to address the high dimensionality, sparsity, transfer behavior, and complex spatial relationships inherent in metro OD prediction. Liu et al. proposed HIAM, a heterogeneous information aggregation framework designed for online metro systems in which complete historical OD matrices are not immediately available. This model jointly utilizes incomplete OD matrices, unfinished order vectors, and DO matrices to support online OD prediction under limited observability [12]. Shen et al. proposed ST-DAMHGN, which employs multiple spatiotemporal dynamic attentive hypergraphs to capture high-order relationships among OD pairs [13]. Tang et al. introduced DT-HGN by leveraging heterogeneous graph structures and separating direct and transfer trips into distinct graphs for public transport OD matrix prediction [14].
In addition to heterogeneous and hypergraph-based modeling, related studies have explored spatial OD flow imputation using GCNs [25] and completion- or augmentation-based short-term metro OD forecasting under limited observability [26]. More recently, ODMixer highlighted the limitations of existing methods that either mix multiple OD pairs from a station-centric perspective or consider only partial OD pairs and proposed fine-grained metro OD modeling from an all-OD-pair perspective [18]. These studies have significantly advanced the metro OD prediction literature by enriching OD representations through heterogeneous graphs, hypergraphs, incomplete-information modeling, imputation, completion, and all-pair interaction modeling. However, they also increase reliance on problem-specific structural designs, such as multi-view construction, heterogeneous graph separation, hypergraph generation, and incomplete-data completion.

2.4. Transformer-Based OD Forecasting and Positioning of This Work

Transformer-based models have significantly influenced time series forecasting from the perspective of modeling long-term dependencies. Informer introduced ProbSparse self-attention and a generative-style decoder to efficiently handle long sequence time-series forecasting [15], whereas Autoformer improved the long-term forecasting performance through a decomposition architecture and an auto-correlation mechanism [16]. This line of research has also been extended to OD matrix prediction, where ODFormer proposed a Transformer-like OD matrix forecasting (ODMF) model incorporating OD attention and PeriodSparse self-attention [17]. Graph attention mechanisms are also closely related to the spatial modeling component of this study. GAT introduced masked self-attention on graph-structured data, allowing neighboring nodes to be weighted adaptively rather than aggregated only through fixed normalized adjacency [20]. GATv2 further addressed the static-attention limitation of the original GAT by enabling query-conditioned dynamic attention [21]. Compared with conventional GCN-based neighborhood aggregation [27], graph attention-based encoders can provide more flexible modeling of railway-network dependencies by learning the relative importance of neighboring stations. Despite these advances, metro OD forecasting still requires a simple and scalable framework that simultaneously considers static spatial priors derived from railway topology, sparse time-varying OD interactions, and an offline full-matrix forecasting setting where complete historical OD sequences are available. To address this gap, this study proposes Metro-GATF, which integrates static spatial encoding based on a railway adjacency matrix, dynamic interaction modeling using time-dependent sparse OD graphs, a non-autoregressive transformer decoder, and sparsity-aware OD reconstruction. By adopting GATv2 as the spatial encoder, Metro-GATF aims to learn adaptive and query-conditioned neighborhood importance while maintaining a relatively simple structure compared with highly task-specific heterogeneous, hypergraph, or incomplete-information modeling approaches.
Table 1 summarizes the key characteristics of representative OD forecasting studies reviewed above, categorized by output granularity, task setting, spatial encoder type, and structural complexity.

3. Method

Figure 1 visualizes the geographic distribution of the stations included in the analyzed Seoul metropolitan subway network based on latitude and longitude coordinates. Each point represents one station, and the color indicates the station connectivity degree, computed as the sum of in-degree and out-degree in the railway adjacency graph. To reduce visual distortion, a small number of stations with abnormal coordinate values were excluded only for visualization, while the OD forecasting experiments were conducted using the full 637-station network. Stations with high connectivity degrees are labeled to highlight major network hubs. This visualization illustrates that the OD forecasting network used in this study covers a broad metropolitan spatial structure rather than being limited to a specific corridor or localized area.

3.1. Dataset

In this study, minute-level subway OD demand data from 1 May to 31 May 2022 were used, with each day represented as an OD matrix sequence X d R 1440 × 637 × 637 . Each sequence consists of 1440 min-level OD matrices for 637 stations, where each element X d t , i , j denotes the number of passengers traveling from origin station i to destination station j at minute t . The stored OD counts are integer-valued, and each day is treated as a continuous minute-level OD matrix sequence. To avoid information leakage from future time periods, the dataset was later divided using a chronological three-way split into training, validation, and independent test sets, as described in Section 4.
Input samples were generated using a sliding window approach. The default configuration consisted of an input length of 60 min, a prediction horizon of 30 min, and a stride of 10 min. Windows were generated only within the operational time range of 5:30 AM to midnight, reflecting actual service hours [32]. Recent transportation statistics from Seoul also indicate a pronounced demand immediately after the start of operations and during evening peak hours, supporting the operational validity of minute-level short-term forecasting settings [6]. Under this configuration, 102 samples were generated per day. To mitigate skewness in the distribution of raw OD counts, a log 1 + x transformation was applied to both inputs and targets. Temporal information is encoded using sine–cosine positional encoding over a daily period of 1440 min [22]. Let m denote the minute index within a day, that is,
sin 2 π m 1440 ,   cos 2 π m 1440
Day-of-week information is also incorporated as an additional input feature. For graph input construction, each OD matrix at a given time step is converted into a dynamic sparse graph by extracting only non-zero entries. To enhance the stability of message passing, reverse edges are added for each edge, and edge weights are defined as log 1 + x of the corresponding OD flow at that time step. Additionally, a fixed 637 × 637 static adjacency matrix representing station connectivity is used to construct a static spatial graph. The station latitude and longitude are aligned with node indices, and missing coordinates are imputed using the average coordinates of neighboring stations [28,29,31]. Finally, min–max normalization is applied, and the resulting values are used as the static node positional features. This study assumes an offline multi-step forecasting setting, where complete historical OD matrix sequences are available as inputs at prediction time.

3.2. Overall Architecture

The proposed model is designed as a hybrid spatiotemporal encoder–decoder architecture that jointly captures the static railway network and time-evolving OD flows. The inputs consist of a sequence of OD matrices over past T time steps, X 1 : T = { X 1 , , X T } static station connectivity graph G s = V , E s , temporal encodings for both past and future steps, and day-of-week information. Here, each X t R N × N represents the OD demand matrix at time step t , and N denotes the number of stations. This study assumes an offline forecasting setting, in which complete historical OD matrices are available at prediction time.
Figure 2 provides an overview of the Metro-GATF architecture. The model takes dynamic sparse OD graphs, a static railway adjacency matrix, temporal encodings, and station-level node features as inputs. The spatial module combines short- and long-range GATv2 encoders for static railway topology with a dynamic GAT encoder for time-varying OD interactions. The fused spatiotemporal representations are used as historical memory, and a non-autoregressive Transformer decoder generates future station-level representations in parallel using future-step embeddings and positional encodings. The final predictor reconstructs future full OD matrices through origin–destination factorization, while the magnitude head estimates flow intensity and the gate head controls sparse zero/non-zero OD activity.
The model first generates static node-level representations for stations and then encodes static spatial structures and dynamic OD interactions based on these representations. The temporal module subsequently uses the sequence of node representations obtained from the past T time steps as memory and estimates the node representations for the next O future steps in parallel. Finally, in each step, the full OD matrix is reconstructed through pairwise interactions between origin-role embeddings and destination-role embeddings. To account for the high sparsity of OD matrices, the model jointly learns a magnitude head for predicting the flow intensity and a gate head to determine whether a given OD pair is non-zero.
From a model-complexity perspective, Metro-GATF avoids assigning independent trainable predictors to all N   × N OD pairs. The high-dimensional OD output is produced from shared station-level origin and destination embeddings, and the sparsity gate is applied as an OD-activity decision layer rather than as a separate pair-specific model. Consequently, the main learnable components are concentrated in the station-level spatial encoders, temporal decoder, and shared projection heads, while the final full-matrix reconstruction still preserves directional origin–destination interactions. This design is intended to keep full-matrix prediction tractable for a 637-station network without requiring an independent parameter set for every possible OD pair.

3.3. Node Feature Construction

First, a learnable node embedding e i R d n is assigned to each station to model its intrinsic latent characteristics. Second, to incorporate the static positional information of the stations, a normalized latitude–longitude vector g i R 2 is used. These coordinates were aligned according to the station index order, and missing values were imputed using the average coordinates of the neighboring stations, followed by min–max normalization. Third, the day-of-week information is represented as an embedding w b e m b , obtained from an integer weekday index provided at the sample level. This embedding is shared across all the nodes within the same sample. Accordingly, the initial static feature of node i in batch b is defined as
h b , i 0 = e i ; g i ; w b e m b
Here, e i denotes the learnable embedding of station i , g i R 2 denotes the normalized latitude–longitude feature of station i , and w b e m b denotes the weekday embedding of sample b. The weekday embedding is shared across all stations within the same sample. Therefore, h b , i 0 R d e + 2 + d w . When geographic positional information is not used, g i is omitted, and h b , i 0 R d e + d w . It is important to note that the model does not directly use raw OD rows and column vectors as node features. Instead, static node representations were constructed first, and OD flows were later incorporated through the edge structure and edge weights of the dynamic graphs. In other words, the node attributes and time-dependent mobility interactions were modeled in a decoupled manner.

3.4. Spatial Module

The spatial module consists of two stages: a static spatial encoder and a dynamic spatial encoder. First, the static spatial encoder operates on a fixed graph G s = V , E s constructed from the railway adjacency matrix, where V denotes the set of stations (nodes) and E s denotes the set of edges representing the physical connectivity between stations. The edges of this graph represent the connectivity between stations and are shared across the entire training period. The static spatial encoder is composed of two parallel Graph Attention Network (GAT) branches: a relatively shallow short branch and a deeper long branch. GAT applies masked self-attention to graph-structured data, enabling the adaptive weighting of neighboring nodes [20]. Furthermore, GATv2 addresses the limitation of static attention in the original GAT by allowing dynamic attention that varies depending on the query node, thereby providing a more expressive model of neighborhood importance [21]. Both branches employ GATv2-based spatial encoders implemented with sequential multi-head and single-head GATv2Conv layers. The short branch is designed to capture local connectivity within a small hop range, whereas the long branch captures a broader spatial context through deeper message passing. The outputs from the two branches, s b , i short and s b , i long , are concatenated and projected linearly to produce a unified static spatial representation s b , i R d .
s b , i = W s s b , i s h o r t ; s b , i l o n g
Next, the dynamic spatial encoder utilizes a sequence of dynamic OD graphs G t d , constructed for each time step t within the historical input window. Each G t d is a sparse graph formed by extracting only the non-zero entries from the OD matrix at time t, with edge weights defined as the log 1 + x -transformed OD flows. In the implementation, reverse edges are added to each edge to enhance the stability of the message passing and capture bidirectional relationships. Importantly, the node inputs to the dynamic graph are not raw traffic features but static spatial representations s b , i obtained earlier. In other words, the same station representations are replicated across all past time steps and used as node signals, whereas time-varying information is encoded solely through the edge connectivity structure and edge attributes. This design reflects the assumption that intrinsic station properties remain relatively stable, whereas inter-station interactions vary over time according to OD flows.
The dynamic spatial encoder is implemented as a Dynamic GAT Encoder that incorporates edge attributes and ultimately produces dynamic representations h b , t , i d y n R d for each time step and node.
h b , t , i d y n = G A T d y n s b , i , G t d

3.5. Temporal Module

The temporal module follows a transformer decoder architecture based on self-attention and cross-attention mechanisms [33]. Although the original transformer adopts an autoregressive decoding scheme [33], this study employs a non-autoregressive decoding strategy to generate multiple future steps in parallel. Specifically, the temporal module transforms spatially encoded historical sequences into memory representations that can be referenced by the Transformer decoder and uses future-step queries to simultaneously generate node states for the next O time steps. In practice, no causal masking is applied to the decoder. The proposed model operates in a non-autoregressive setting without causal masking.
Temporal information was represented using sine–cosine encoding with a daily periodicity of 1440 min [33]. For a given minute index m t , the encoding is defined as
c t = sin 2 π m t 1440 , cos 2 π m t 1440 ,
which is then projected through a linear transformation into a d -dimensional temporal embedding τ t . This temporal embedding is shared across all nodes at each time step. The fused representation at time step t for node i is then computed by combining the dynamic spatial representation, static spatial representation, and temporal embedding as follows:
u b , t , i = W f h b , t , i d y n ; s b , i ; τ b , t
The resulting u b , 1 : T , i is used as the historical time-series memory for each node. Each batch–node pair is treated as an independent sequence of time-series tokens. Sinusoidal positional encoding and a learnable positional bias were then added to inject the temporal order information. This memory sequence was subsequently used as the input for the cross-attention mechanism in the transformer decoder.
For future predictions, the query is initialized using a learnable future-step embedding for each o { 1 , , O } . The same positional encoding is applied, and a linearly projected sine–cosine temporal encoding of the corresponding future time step is added. Consequently, the decoder query simultaneously captures both the relative position in the future horizon “which future step” and the actual temporal position in the time series “what time of day.” Importantly, the future query is initialized solely from future-step embedding and known temporal encodings, and the ground-truth future OD values are not used as decoder inputs. All future steps are predicted in parallel in a non-autoregressive setting. As a result, the model generates future node representations z b , o , i R d in parallel through self-attention across future steps and cross-attention over historical memory, without referencing future ground-truth values.

3.6. Predictor

The predictor reconstructs the full OD matrix from the node representations at each future time step. To this end, instead of using a single node representation directly, two separate linear projections are applied to distinguish the roles of the origin and destination.
o b , o , i = W O z b , o , i
r b , o , j = W D z b , o , j
At time step o , the flow magnitude for the origin–destination pair i j is computed using the inner product of the two embeddings.
m ^ b , o , i , j = o b , o , i · r b , o , j
A softplus function is applied to this value to obtain the final magnitude prediction.
Y ^ b , o , i , j m a g = softplus m ^ b , o , i , j
Unlike the independent regression of each OD pair, the OD matrix is reconstructed in a structured manner through latent interactions between the origin and destination-role embeddings of each station.
Simultaneously, to handle the sparsity of the OD matrix explicitly, the model introduces a separate gate head. The gate head also applies separate linear projections for the origin and destination and computes the logit through an inner product followed by a bias term.
g b , o , i , j l o g i t = o ~ b , o , i · r ~ b , o , j + b g
The bias term b g is initialized to 2.0 to reflect a sparse prior, assuming that most OD pairs are likely to be zero during the early stage of training. After applying a sigmoid function, the probability that a given OD pair is non-zero, p ^ b , o , i , j , is obtained. During inference, the magnitude prediction is retained only if p ^ b , o , i , j > τ ; otherwise, it is forced to zero. The gate threshold τ is an inference-time operating point that converts the estimated non-zero probability into a binary OD activity decision, rather than a trainable model parameter. Because full OD matrices are highly sparse, lowering τ increases the number of predicted non-zero OD pairs, whereas raising τ yields a more conservative prediction by suppressing weakly activated OD pairs. This thresholding process is closely related to the operating-point selection problem in binary classification, where the decision threshold controls the trade-off between positive detection and false alarms [34]. In imbalanced settings, such threshold-dependent behavior should be examined carefully because aggregate error metrics and detection-oriented metrics may favor different operating points [35]. Therefore, in this study, τ = 0.9 was adopted as the main setting based on a post-training sensitivity analysis over τ { 0.1,0.2 , , 0.9 } , as reported in Section 5. Accordingly, the final prediction is defined as follows:
Y b , o , i , j ^ = Y ^ b , o , i , j m a g ,   i f   p ^ b , o , i , j > τ 0 ,   o t h e r w i s e
During training, the targets were used on a log 1 + Y scale. The gate label is defined as I Y > 0 , and the gate head is optimized using a weighted binary cross-entropy (BCE) loss that accounts for the positive/negative imbalance within each batch. In contrast, the magnitude head computes the smooth L1 loss only over non-zero target entries. The final training objective is defined as follows:
L = L mag + λ gate L gate
Here, λ g a t e = 1.0 is used in the current implementation. This loss design mitigates the issue where the regression loss tends to converge to trivial zero predictions in highly sparse OD matrices while improving the accuracy of magnitude prediction for OD pairs with actual flows.
During the evaluation, the final prediction is reconstructed only for entries that pass the gate threshold, and inverse transformation is performed by clipping the predicted log values to be non-negative before applying the exponential function. In addition, a small ϵ is used in the sMAPE calculation to alleviate numerical instability near zero.

4. Experiment

The experiment was designed as a minute-level OD matrix forecasting problem to simulate a short-term demand prediction scenario applicable to real-world urban railway operations. The time-series resolution is maintained at one minute in order to preserve rapid demand fluctuations during peak hours and short-term mobility variations caused by transfers. The input window length of 60 min reflects the recent one-hour demand context, whereas the prediction horizon of 30 min provides a practical short-term forecasting window for operational decision-making without introducing excessive uncertainty in minute-level full-matrix forecasting. In addition, a hop size of 10 min is used to ensure sufficient training samples while mitigating the excessive overlaps between adjacent windows. Windows are generated only within the operational time range of 5:30 AM to midnight, reflecting actual service hours. Official guidance also indicates that subway services generally operate from approximately 5:30 AM to midnight, and thus this setting aligns with both the temporal characteristics of the dataset and the real-world operations [32].
The dataset was divided using a chronological three-way split to evaluate generalization on unseen future periods without random shuffling. Specifically, data from 1 May to 19 May 2022 were used for training, data from 20 May to 24 May 2022 were used for validation, and data from 25 May to 31 May 2022 were reserved as an independent test set. The validation set was used only for early stopping, hyperparameter selection, and best-checkpoint selection, while the test set was held out exclusively for final evaluation. Under the default sliding-window configuration, 102 samples were generated per day, resulting in 1938 training samples, 510 validation samples, and 714 test samples. All baseline models were evaluated under the same input length, prediction horizon, temporal resolution, and chronological split. The final performance was measured on the independent test set using MAE, RMSE, and sMAPE on the original scale. Both the inputs and targets were transformed using log 1 + x during training, and evaluation was conducted after inverse transformation.
For Metro-GATF, owing to its architecture incorporating gate-based zero handling, the final predictions are reconstructed by first determining whether each OD pair is non-zero according to the inference rule, followed by recovering the magnitude prediction. Specifically, Metro-GATF was trained in the log 1 + x space for both the inputs and targets. The ground-truth non-zero indicator is defined as z = I y > 0 , which is equivalent to the condition in which the raw OD count is greater than zero. During inference, the gate head outputs a probability p = σ g , and the magnitude prediction is retained only for entries where p > τ , whereas all other OD pairs are set to zero. In this study, τ = 0.9 is used. The inverse transformation is then applied as y ^ = e x p max y ^ l o g , 0 1 and y = e x p y l o g 1 , where predicted log values below zero are clipped to zero prior to the reconstruction in the original scale. Furthermore, sMAPE is computed with a small constant ϵ = 10 3 to alleviate denominator instability near zero.
In addition to the main comparison, we conducted a post-training gate-threshold sensitivity analysis for the full Metro-GATF model using the validation set. To isolate the effect of the inference threshold from model training, the best checkpoints of the full Metro-GATF configuration corresponding to S8 in the ablation study were fixed for three random seeds, and only the gate threshold τ was varied from 0.1 to 0.9. The threshold τ = 0.9 was selected based on the validation-set sensitivity results and then fixed before evaluating the final model on the independent test set. The sensitivity results were evaluated using the same forecasting metrics as the main experiments, namely MAE, RMSE, and sMAPE, and are reported as the mean and standard deviation across the three seeds. In addition, the predicted non-zero rate was reported as an auxiliary diagnostic statistic to examine how conservatively the gate identifies active OD pairs after thresholding.
Because OD matrices exhibit a sparse structure with frequent zero or near-zero values, this study employed MAE, RMSE, and sMAPE to jointly evaluate the average absolute error, sensitivity to large errors, and relative prediction accuracy. The same three metrics were also used for the gate-threshold sensitivity analysis to maintain consistency with the main forecasting evaluation. The predicted non-zero rate was not used as an error metric but as an auxiliary diagnostic measure for interpreting the sparsity behavior of the gate. This experimental setup ensured that all the models were compared under a consistent operational forecasting protocol.
The comparison models are divided into two groups to ensure fairness in problem setting. The first group consists of full-matrix baselines that directly predict the entire 637 × 637 OD matrix, including HA, ARIMA, GCN-LSTM, Autoformer, MPGCN, and ODFormer. The second group includes a pair-level reference baseline, ST-LSTM, trained on individual OD pairs. Because ST-LSTM is not designed for full-matrix prediction but for individual OD pair forecasting, it is not directly compared within the same table as the full-matrix baselines and is instead presented as a separate reference experiment.
The methods discussed in the related works, such as HIAM, DT-HGN, ST-DAMHGN, and ODMixer, are not included in the main comparison. These approaches are often based on different problem assumptions, such as incomplete OD inference, task-specific heterogeneous or hypergraph representations, or pair-centric/local OD modeling. Therefore, they do not align directly with the offline full-matrix forecasting protocol considered in this study, which assumes a fixed station-level railway topology and complete historical OD sequences as inputs to predict the full future OD matrix. Accordingly, the main benchmark in this study is constructed around representative full-matrix baselines that operate under the same input–output protocol.
All learning-based models were tuned using the same time-ordered validation split from 20 May to 24 May 2022, and the optimal checkpoint was selected based on validation RMSE. After checkpoint selection, the selected model was evaluated once on the independent test set from 25 May to 31 May 2022. Training for all learning-based models, including Metro-GATF, was conducted on a Blackwell Max-Q Workstation Edition (96 GB VRAM). The training configuration used a batch size of 2 and 200 epochs, learning rate of 1 × 10 4 , and the Adam optimizer. In contrast, HA and ARIMA were trained using standard estimation procedures. In addition, all neural network-based models were evaluated using three random seeds, and the final performance was reported as the mean and standard deviation.

4.1. Experimental Results

4.1.1. Full-Matrix Baseline Comparison

The full-matrix benchmark in Table 2 was designed to cover a wide spectrum of models, ranging from simple repetitive-pattern baselines to OD-specific graph-based models. HA represents the simplest baseline, utilizing historical average patterns for each OD entry, and demonstrates how much predictive performance can be achieved using only a strong periodicity. ARIMA is a traditional time-series model based on linear autoregressive structures that represents non-deep-learning statistical baselines. GCN-LSTM is a representative spatiotemporal deep learning baseline that combines graph-based spatial aggregation with recurrent temporal modeling. The Autoformer represents a generic transformer-based time-series model that does not explicitly incorporate railway topology, capturing long-term dependencies through attention mechanisms. The MPGCN, on the other hand, is an OD-specific full-matrix forecasting model that jointly utilizes static and dynamic graphs and serves as a strong graph-based baseline that is closely aligned with the problem setting of this study. Accordingly, Table 2 is constructed to include simple average-based models, traditional statistical models, general deep learning time-series models, and OD-specific graph-based models, enabling comprehensive evaluation of the proposed method across multiple levels of baselines. Because the ST-LSTM is designed to directly predict the selected OD pairs rather than the full OD matrix, it is not included in Table 2 and is instead evaluated separately in a pair-level reference experiment.
As shown in Table 2, Metro-GATF provides balanced forecasting performance across the full-matrix error metrics. Compared with HA, ARIMA, GCN-LSTM, and Autoformer, Metro-GATF reduced MAE by approximately 95.5–97.9%, RMSE by approximately 75.8–95.2%, and sMAPE by approximately 88.8–98.1%. These results suggest that the proposed model more effectively captures the complex spatiotemporal variations in minute-level OD matrices than simple average-based patterns, traditional statistical methods, or generic time-series deep learning approaches alone. A comparison with ODFormer requires a more nuanced interpretation. ODFormer produced very small aggregate MAE and RMSE values, but its sMAPE reached 163.10 ± 6.01, indicating that low absolute error over a highly sparse matrix does not necessarily imply reliable relative accuracy for sparse non-zero OD demand. Therefore, sMAPE is important for assessing whether sparse non-zero OD demand is captured reliably. Metro-GATF achieves the lowest sMAPE among the compared full-matrix models.
Compared with MPGCN, Metro-GATF achieved slightly better forecasting accuracy while requiring substantially less training time. Specifically, Metro-GATF reduced MAE from 0.0172 to 0.0169, RMSE from 0.1746 to 0.1637, and sMAPE from 2.697% to 2.662%. These correspond to approximately 1.7%, 6.2%, and 1.3% reductions in MAE, RMSE, and sMAPE, respectively. More importantly for offline large-scale analysis, the training time decreased from 24.79 ± 5.73 h for MPGCN to 4.36 ± 0.34 h for Metro-GATF, corresponding to an approximately 82.4% reduction. Thus, although the error difference between the two graph-based models is modest, Metro-GATF provides a more favorable accuracy–efficiency trade-off under the evaluated full-matrix OD forecasting protocol.

4.1.2. Pair-Level Reference Experiment

In addition, this study examined the pair-level behavior of the proposed model through a pair-level reference experiment. The ST-LSTM used in this experiment is based on the short-term urban railway OD flow prediction model proposed by Cui et al. [19]. In the original work, ST-LSTM leveraged network-wide historical OD data, spatial correlation learning, and real-time inflow/outflow information to predict the target OD flows. Owing to computational constraints, the original study evaluated a subset of OD pairs rather than the full OD matrix and provided an analysis of selected individual OD flows [19]. Accordingly, in this study, ST-LSTM is reimplemented not as a full-matrix baseline but as a reference model for selected OD pairs and is used for supplementary comparison. Therefore, the results presented below should be interpreted not as a direct comparison with full-matrix forecasting models, but rather as a complementary evaluation at the level of specific OD pairs.
Furthermore, for low-flow OD pairs, the true values are frequently zero or near-zero, which can lead to instability in percentage-based metrics [36,37]. Therefore, this study presents the (16,523) pair where the flow volume is relatively sufficient as a representative case, whereas the (10,25) pair is included only as part of the supplementary analysis.
As shown in Table 3, for the (16,523) pair, ST-LSTM achieved a lower error in MAE, whereas Metro-GATF recorded lower errors in the RMSE and sMAPE. Specifically, Metro-GATF slightly reduced the RMSE from 3.007 to 2.986 and achieved a more substantial improvement in the sMAPE from 79.29% to 58.13%. These results suggest that while pair-level models trained directly on individual OD pairs may have an advantage in certain absolute error metrics, full-matrix models trained to reconstruct the entire OD matrix can still exhibit stable relative error characteristics at the level of representative OD pairs. However, because the two models differ in their training objectives and output structures, these results should be interpreted as a pair-level supplementary analysis rather than a direct comparison.
For the auxiliary analysis, the same comparison was conducted for a low-flow OD pair (10,25). This pair exhibits very low traffic volumes across most time steps, with frequent occurrences of zero or near-zero values, leading to significant instability in percentage-based metrics [36,37].
As shown in Table 4, for the (10,25) pair, Metro-GATF achieved lower MAE and sMAPE values, whereas its RMSE was higher than that of ST-LSTM. However, because this pair exhibits an extremely low overall flow, the variability of percentage-based metrics can be significantly amplified [36,37]. Therefore, these results should be interpreted not as a representative comparison but rather as a supplementary analysis for low-demand OD pairs.

5. Ablation Study

To better explain how Metro-GATF is progressively assembled into a practical full-matrix OD forecasting model, we conducted progressive core build-up ablation rather than relying only on local leave-one-out variants. Starting from a minimal backbone, we incrementally added the major components of the proposed architecture and examined how each stage changed forecasting accuracy.
The stage definitions are as follows: S0 is a minimal graph-temporal baseline that uses only a short static encoder, a simple GRU-based temporal head, and direct row-wise prediction, without factorization, dynamic OD encoding, metadata, or gating. S1 replaces the direct output head with origin–destination factorization while keeping the remaining settings unchanged. S2 extends the static encoder from a short-only branch to dual short and long, multi-scale branches. S3 further introduces the dynamic OD graph encoder such that time-varying OD interactions can be encoded on top of the static spatial backbone. S4 replaces the simple temporal head with a non-autoregressive Transformer decoder. S5 adds minute-of-day temporal features to both historical and future steps. S6 incorporates weekday embeddings. S7 further appends geographic coordinates as static node features. Finally, S8 adds a sparsity-aware gate, which corresponds to the full Metro-GATF model.
As shown in Table 5, stages S0–S3 are far from a practically useful full-matrix forecasting regime. The minimal backbone in S0 yields extremely large errors, indicating that short-range static encoding with a simple temporal head and direct prediction is insufficient for large-scale metro OD forecasting. Adding factorization to S1 does not by itself improve the result, and extending the static encoder to a dual short/long structure in S2 still leaves the model in a clearly underpowered regime. When the dynamic OD graph encoder was introduced in S3, the errors decreased substantially, suggesting that time-varying OD interactions provide useful information beyond the static railway topology. However, the performance is still far from that of the final model, implying that dynamic graph modeling alone is not sufficient.
The dominant turning point occurred at S4. Once the simple GRU-based temporal head is replaced by a non-autoregressive transformer decoder, the model enters a practically meaningful performance range. Notably, S3 and S4 share the same batch size under the progressive training schedule, which makes this transition particularly informative. This result indicates that the temporal decoder is not a minor refinement but a core requirement for large-scale offline full-matrix OD forecasting. In other words, static and dynamic graph encoders alone are not sufficient, unless they are paired with a sufficiently expressive future-step decoder.
From S4 to S7, adding time encoding, weekday embedding, and geographic coordinates does not produce measurable gains under the current 30 min forecasting setting, as the reported values are identical up to the shown precision. Therefore, these metadata features appear to play a complementary rather than dominant role in the present setup. Their contributions may become more visible under different horizons or datasets, but they are not the main drivers of the performance transition observed here.
The final improvement was obtained at S8, where the sparsity-aware gate was added to the otherwise complete model. S8 achieves the best MAE, RMSE, and sMAPE among all stages, confirming that explicit zero/non-zero modeling remains important even after the backbone and temporal decoder have already been established. Overall, the progressive build-up study revealed a clear hierarchy of contributions: a minimal backbone is insufficient, dynamic OD interaction modeling is helpful but not decisive, the transformer decoder provides the major performance transition, and the sparsity-aware gate delivers the final gain that completes the full Metro-GATF architecture.
To further examine the inference-time behavior of the sparsity-aware gate, we conducted a gate-threshold sensitivity analysis using the full Metro-GATF configuration. Because the gate threshold is not a trainable parameter but an inference-time operating point, the trained model parameters were fixed, and only τ was varied.
As shown in Table 6, increasing the gate threshold consistently reduced the three main full-matrix forecasting errors. As τ increased from 0.1 to 0.9, RMSE decreased from 0.4551 to 0.1639, MAE from 0.1836 to 0.0170, and sMAPE from 32.4330 to 2.6665. These results indicate that, under the forecasting metrics used in this study, a conservative gate threshold is beneficial for highly sparse metro OD matrices. This is because a low threshold retains many weakly activated OD pairs as non-zero predictions, which can increase false-positive flows over the large number of truly zero OD entries. In contrast, a high threshold suppresses weak predictions and reduces cumulative full-matrix errors. However, τ = 0.9 should not be interpreted as a universally optimal threshold for all operational objectives. The actual non-zero rate of the evaluated OD matrices was 1.348%, whereas the predicted non-zero rate at τ = 0.9 was 0.128%. This indicates that the selected threshold is conservative: it is effective for minimizing the full-matrix forecasting errors considered in this study, but it may fail to detect some active OD pairs if the objective is demand detection or recall-oriented monitoring.

6. Discussion

The experimental results suggest that direct full-matrix OD forecasting can be a practical option under the evaluated Seoul metropolitan subway setting and the 30 min short-term forecasting horizon. Full OD matrix prediction is more challenging than station-level inflow/outflow forecasting because the output dimensionality increases quadratically with the number of stations and because most possible OD pairs are zero or near-zero at a given minute. Metro-GATF addresses this issue not by treating the OD matrix as a set of fully independent regression targets, but by first generating future station-level representations and then reconstructing origin–destination relationships through structured factorization. This design reduces the burden of modeling all OD pairs independently while preserving the ability to forecast the complete network-wide OD matrix.
Comparison with baseline models provides insight into why the proposed architecture performs competitively. HA and ARIMA can exploit repetitive or linear temporal patterns but cannot explicitly represent railway topology or time-varying OD interactions. GCN-LSTM incorporates graph-based spatial aggregation and recurrent temporal modeling, but it does not explicitly address sparse full-matrix reconstruction. Autoformer provides a generic Transformer-based time-series baseline, but it does not directly encode the static railway graph or dynamic OD graph structure. ODFormer shows that low aggregate MAE and RMSE can be misleading in sparse OD matrices when relative errors on active OD pairs remain large. MPGCN is a strong graph-based OD forecasting baseline because it also considers static and dynamic graph information. Therefore, the close performance between MPGCN and Metro-GATF indicates that jointly modeling static and dynamic graph structures is important for full-matrix OD forecasting. The advantage of Metro-GATF lies in combining adaptive GATv2-based railway topology encoding, dynamic sparse OD interaction modeling, non-autoregressive future-step decoding, and sparsity-aware OD reconstruction in a unified framework with a shorter training time than MPGCN.
The ablation study further clarifies the role of the temporal decoder. The transition from the dynamic-graph stage to the Transformer-decoder stage produced the dominant performance improvement, suggesting that graph encoders alone are insufficient unless paired with an expressive future-step generation mechanism. The non-autoregressive decoder generates all future station representations in parallel using future-step embeddings and known temporal encodings, without feeding previously predicted OD matrices into subsequent prediction steps. This design can reduce the possibility of sequential error propagation and is well aligned with short-horizon multi-step OD forecasting. In addition, the operating-protocol comparison in Table 7 further supports this interpretation. Under the same Metro-GATF components and experimental settings, the proposed non-autoregressive protocol achieved lower MAE, RMSE, and sMAPE than the autoregressive decoding variant. These results suggest that parallel future-step generation is more effective than sequential future-step generation for the evaluated offline full-matrix OD forecasting task. Nevertheless, the autoregressive setting in Table 7 was implemented as an autoregressive decoding protocol rather than as a fully redesigned autoregressive Transformer architecture. Therefore, a strictly matched comparison between independently optimized autoregressive and non-autoregressive Transformer architectures remains an important direction for future work.
The sparsity-aware gate is another important component for large-scale metro OD forecasting. In a 637-station network, the number of possible OD pairs is very large, whereas the number of active OD pairs at a given minute is relatively small. Without explicit zero/non-zero modeling, weak false-positive predictions can accumulate over the full OD matrix and increase aggregate forecasting errors. The gate head addresses this issue by separating OD-pair activity estimation from magnitude prediction. The threshold sensitivity analysis indicates that a conservative threshold is effective for minimizing MAE, RMSE, and sMAPE under the evaluated setting. Nevertheless, the selected threshold should not be regarded as universally optimal. Because the predicted non-zero rate at the selected threshold is lower than the actual non-zero rate, the threshold is better interpreted as an error-minimizing operating point rather than a recall-oriented demand detection threshold. In operational applications where detecting as many active OD pairs as possible is more important than suppressing false positives, a lower threshold or a different threshold-selection criterion may be more appropriate.
From a practical perspective, the proposed framework should be interpreted as a decision-support tool for short-term OD demand analysis rather than as an automatic schedule control system. A 30 min forecasting horizon is useful for monitoring short-term demand changes, identifying emerging OD patterns, and supporting operational awareness, but it is not intended to replace comprehensive timetable planning or emergency-response decision-making. Operational adjustments in metro systems require additional constraints, including rolling-stock availability, crew scheduling, safety margins, passenger transfer behavior, and station-level capacity. Therefore, Metro-GATF should be used as one analytical input within a broader operational decision-making process.
Computational efficiency should be interpreted together with forecasting accuracy. In Table 2, Metro-GATF required 4.36 ± 0.34 h for training, whereas MPGCN required 24.79 ± 5.73 h under the same hardware and experimental protocol. Thus, the proposed model achieved comparable or slightly better MPGCN error metrics while reducing training time by approximately 82.4%. GCN-LSTM and Autoformer trained faster than Metro-GATF, at 3.06 ± 0.01 h and 3.00 ± 0.02 h, respectively, but their forecasting errors were substantially higher. ODFormer trained fastest among the neural baselines, but its high sMAPE indicates unstable relative accuracy for sparse OD demand.
Table 8 further summarizes the model complexity and inference-efficiency results in terms of parameter count, GPU memory usage, and inference latency. GCN-LSTM and Autoformer showed lower latency than Metro-GATF, but their forecasting errors were substantially larger, indicating that low computational cost alone does not guarantee reliable full-matrix OD prediction. ODFormer had the largest number of parameters, with 263.8 M parameters, and also showed higher inference latency than Metro-GATF. Although MPGCN had the smallest parameter count, it required the largest GPU memory and exhibited the highest inference latency, which suggests that parameter count alone is insufficient for evaluating the practical complexity of full-matrix OD forecasting models.
From an architectural complexity perspective, Metro-GATF uses shared origin and destination projections rather than independent predictors for every OD pair. This design avoids assigning separate trainable predictors to all N 2 OD pairs while preserving directional origin–destination interactions. Although reconstructing a full N   × N OD matrix at each future step remains the dominant output-size cost. Metro-GATF achieved substantially lower latency than MPGCN and required far fewer parameters than ODFormer. Therefore, the proposed model provides a favorable accuracy–efficiency trade-off for offline full-matrix OD forecasting over a 637-station metro network.
This study has several limitations. First, the experiments were conducted on a single Seoul metropolitan subway network over a limited observation period. Although the network is large and operationally realistic, the results do not by themselves establish general applicability across different cities, network structures, or demand regimes. Second, the study focuses on a 30 min short-term forecasting horizon. Longer horizons, such as one day or one week, may involve different demand patterns, exogenous events, and accumulated uncertainty. Third, the model mainly uses AFC-based OD demand and railway network structure; external factors such as weather, holidays, special events, land use, station-area facilities, and socioeconomic attributes were not incorporated. Finally, this study reports seed-level variability through repeated experiments but does not provide a formal Bayesian or ensemble-based parameter uncertainty analysis. Future work should evaluate the framework across multiple cities and periods, extend it to longer forecasting horizons, incorporate exogenous contextual variables, and examine uncertainty-aware OD forecasting.

7. Conclusions

This study proposed Metro-GATF, an end-to-end framework for offline full-matrix metro origin–destination forecasting. The proposed model departs from sequential autoregressive forecasting by generating future node representations in parallel through a non-autoregressive Transformer decoder. These representations are then used to reconstruct future OD matrices through origin–destination factorization and a sparsity-aware gating mechanism. In this way, Metro-GATF is designed to jointly capture static railway topology and time-varying sparse OD interactions.
Experiments were conducted using minute-level AFC-based OD data from May 2022 for 637 stations in the Seoul metropolitan subway network. Under the full-matrix forecasting setting, final evaluation on the independent test set showed that Metro-GATF achieved an MAE of 0.0169, an RMSE of 0.1637, and an sMAPE of 2.662. The model consistently outperformed HA, ARIMA, GCN-LSTM, and Autoformer, achieved the lowest sMAPE among the compared full-matrix baselines, and showed slightly better MAE, RMSE, and sMAPE than MPGCN while requiring substantially less training time. These results support the effectiveness of Metro-GATF within the empirical setting considered in this study, namely, a single large-scale metropolitan subway network.
The ablation results further indicate that the non-autoregressive temporal decoder is a key component contributing to performance improvement, while the sparsity-aware gating mechanism provides additional benefits for highly sparse OD matrices. The additional comparison with the autoregressive decoding variant further supports the advantage of parallel future-step generation under the evaluated offline full-matrix forecasting protocol. These findings suggest that combining parallel temporal decoding with structured sparse OD reconstruction is a promising design choice for full-matrix metro OD forecasting. The gate-threshold sensitivity analysis further showed that τ = 0.9 provided the lowest and most stable errors among the evaluated thresholds according to MAE, RMSE, and sMAPE. At the same time, this threshold produced a predicted non-zero rate lower than the actual non-zero rate, indicating that it should be interpreted as a conservative error-minimizing operating point rather than a recall-optimized demand detection threshold.
Importantly, the empirical findings of this study should be interpreted within the scope of the evaluated dataset. Although the Seoul metropolitan subway network represents a large-scale real-world metro system, the experiments were conducted on a single network over a limited observation period. Therefore, the results do not by themselves establish the general applicability of Metro-GATF across different cities, network structures, or operating conditions. Rather, this study provides evidence that the proposed framework is effective in the examined large-scale metro network and offers a practical basis for further validation in broader OD forecasting scenarios.

Author Contributions

Conceptualization, S.H.K. and H.J.J.; methodology, S.H.K.; formal analysis, S.i.S.; investigation, S.i.S.; writing—original draft preparation, S.H.K.; writing—review and editing, S.H.K. and J.W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) grant funded by the Korea government (MOTIE) (RS-2022-KP002841, Development of Artificial Intelligence Vibration Monitoring System for Rotating Machinery). This work was supported by T-money Welfare Foundation. No grant number was assigned.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable. The study used pseudonymized transportation data without personally identifiable information.

Data Availability Statement

Data are unavailable due to privacy or ethical restrictions.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (GPT-5.4, OpenAI, San Francisco, CA, USA) (OpenAI) for language refinement and manuscript editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

Author Seong il Shin was employed by the ITS Mobility Lab. The author declares that there were no commercial or financial relationships that could be construed as a potential conflict of interest related to this work. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Rong, C.; Ding, J.; Li, Y. An interdisciplinary survey on origin-destination flows modeling: Theory and techniques. ACM Comput. Surv. 2024, 57, 1–49. [Google Scholar] [CrossRef]
  2. Toqué, F.; Côme, E.; El Mahrsi, M.K.; Oukhellou, L. Forecasting dynamic public transport origin-destination matrices with long-short term memory recurrent neural networks. In Proceedings of the 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC); IEEE: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
  3. Noursalehi, P.; Koutsopoulos, H.N.; Zhao, J. Dynamic origin-destination prediction in urban rail systems: A multi-resolution spatio-temporal deep learning approach. IEEE Trans. Intell. Transp. Syst. 2021, 23, 5106–5115. [Google Scholar] [CrossRef]
  4. Seoul TOPIS. Subway. Seoul Transport Operation and Information Service. Available online: https://topis.seoul.go.kr/openEngSubway.do (accessed on 6 April 2026).
  5. Korea Railroad Corp (KORAIL). KORAIL Establishes a Safety Cooperation System with Nine Metropolitan Urban Rail Operators. Available online: https://info.korail.com/info/selectBbsNttView.do?bbsNo=199&integrDeptCode=&key=911&nttNo=25101&pageIndex=24&searchCnd=SJ&searchCtgry=&searchKrwd= (accessed on 4 April 2026).
  6. Seoul Metropolitan Government. Jamsil and Seongsu Crowned as Seoul’s Busiest Subway Stations. Available online: https://english.seoul.go.kr/jamsil-and-seongsu-crowned-as-seouls-busiest-subway-stations/ (accessed on 4 April 2026).
  7. Han, Y.; Wang, S.; Ren, Y.; Wang, C.; Gao, P.; Chen, G. Predicting station-level short-term passenger flow in a citywide metro network using spatiotemporal graph convolutional neural networks. ISPRS Int. J. Geo-Inf. 2019, 8, 243. [Google Scholar] [CrossRef]
  8. Wang, Y.; Yin, H.; Chen, H.; Wo, T.; Xu, J.; Zheng, K. Origin-destination matrix prediction via graph convolution: A new perspective of passenger demand modeling. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
  9. Hu, J.; Yang, B.; Guo, C.; Jensen, C.S.; Xiong, H. Stochastic origin-destination matrix forecasting using dual-stage graph convolutional, recurrent neural networks. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
  10. Shi, H.; Yao, Q.; Guo, Q.; Li, Y.; Zhang, L.; Ye, J.; Li, Y.; Liu, Y. Predicting origin-destination flow via multi-perspective graph convolutional network. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE); IEEE: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
  11. Zheng, F.; Zhao, J.; Ye, J.; Gao, X.; Ye, K.; Xu, C. Metro OD matrix prediction based on multi-view passenger flow evolution trend modeling. IEEE Trans. Big Data 2023, 9, 991–1003. [Google Scholar] [CrossRef]
  12. Liu, L.; Zhu, Y.; Li, G.; Wu, Z.; Bai, L.; Lin, L. Online metro origin-destination prediction via heterogeneous information aggregation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3574–3589. [Google Scholar] [CrossRef] [PubMed]
  13. Shen, L.; Li, J.; Chen, Y.; Li, C.; Chen, X.; Lee, D.H. Short-term metro origin-destination passenger flow prediction via spatio-temporal dynamic attentive multi-hypergraph network. IEEE Trans. Intell. Transp. Syst. 2024, 25, 9945–9957. [Google Scholar] [CrossRef]
  14. Tang, T.; Mao, J.; Liu, R.; Liu, Z.; Wang, Y.; Huang, D. Origin-destination matrix prediction in public transport networks: Incorporating heterogeneous direct and transfer trips. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19889–19903. [Google Scholar] [CrossRef]
  15. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35. [Google Scholar] [CrossRef]
  16. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS Proceedings: San Diego, CA, USA, 2021; Volume 34. [Google Scholar] [CrossRef]
  17. Huang, B.; Ruan, K.; Yu, W.; Xiao, J.; Xie, R.; Huang, J. ODFormer: Spatial-temporal transformers for long sequence origin-destination matrix forecasting against cross application scenario. Expert Syst. Appl. 2023, 222, 119835. [Google Scholar] [CrossRef]
  18. Liu, Y.; Chen, B.; Zheng, Y.; Cheng, L.; Li, G.; Lin, L. ODMixer: Fine-grained spatial-temporal MLP for metro origin-destination prediction. IEEE Trans. Knowl. Data Eng. 2025, 37, 5508–5522. [Google Scholar] [CrossRef]
  19. Cui, H.; Si, B.; Wang, J.; Zhao, B.; Pan, W. Short-term origin-destination flow prediction for urban rail network: A deep learning method based on multi-source big data. Complex Intell. Syst. 2024, 10, 4675–4696. [Google Scholar] [CrossRef]
  20. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.10903. [Google Scholar] [CrossRef]
  21. Brody, S.; Alon, U.; Yahav, E. How attentive are graph attention networks? In Proceedings of the International Conference on Learning Representations (ICLR), Online, 25–29 April 2022. [Google Scholar] [CrossRef]
  22. Zhang, J.; Che, H.; Chen, F.; Ma, W.; He, Z. Short-term origin-destination demand prediction in urban rail transit systems: A channel-wise attentive split-convolutional neural network method. Transp. Res. Part C Emerg. Technol. 2021, 124, 102928. [Google Scholar] [CrossRef]
  23. Wang, X.; Zhang, Y.; Zhang, J. Large-scale origin-destination prediction for urban rail transit network based on graph convolutional neural network. Sustainability 2024, 16, 10190. [Google Scholar] [CrossRef]
  24. Ali, A.; Ullah, I.; Ahmad, S.; Wu, Z.; Li, J.; Bai, X. An attention-driven spatio-temporal deep hybrid neural networks for traffic flow prediction in transportation systems. IEEE Trans. Intell. Transp. Syst. 2025, 26, 14154–14168. [Google Scholar] [CrossRef]
  25. Yao, X.; Gao, Y.; Zhu, D.; Manley, E.; Wang, J.; Liu, Y. Spatial origin-destination flow imputation using graph convolutional networks. IEEE Trans. Intell. Transp. Syst. 2020, 22, 7474–7484. [Google Scholar] [CrossRef]
  26. Ye, J.; Zhao, J.; Zheng, F.; Xu, C. Completion and augmentation-based spatiotemporal deep learning approach for short-term metro origin-destination matrix prediction under limited observable data. Neural Comput. Appl. 2023, 35, 3325–3341. [Google Scholar] [CrossRef]
  27. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar] [CrossRef]
  28. Seoul Open Data Plaza. Seoul Metro Line-by-Line Station Information. Available online: https://data.seoul.go.kr/dataList/OA-15442/S/1/datasetView.do (accessed on 6 April 2026).
  29. Ministry of Land; Infrastructure and Transport. All Urban Rail Lines. Public Data Portal. Available online: https://www.data.go.kr/data/15122916/fileData.do (accessed on 6 April 2026).
  30. Korea Rail Data Portal. List of Operator, Line, and Station Code Information. Available online: https://data.kric.go.kr/rips/M_04_02/detail.do?id=3 (accessed on 6 April 2026).
  31. Seoul Metro. Latitude/Longitude Information for Stations on Seoul Metro Lines 1–8. Public Data Portal. Available online: https://www.data.go.kr/data/15099316/fileData.do (accessed on 6 April 2026).
  32. Jung-gu Office; Seoul. Subway. Transportation. Available online: https://www.junggu.seoul.kr/english/content.do?cmsid=14848 (accessed on 6 April 2026).
  33. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems; NeurIPS Proceedings: San Diego, CA, USA, 2017; Volume 30. [Google Scholar]
  34. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  35. Saito, T.; Rehmsmeier, M. The precision–recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
  36. Hyndman, R.J.; Koehler, A.B. Another look at measures of forecast accuracy. Int. J. Forecast. 2006, 22, 679–688. [Google Scholar] [CrossRef]
  37. Kim, S.; Kim, H. A new metric of absolute percentage error for intermittent demand forecasts. Int. J. Forecast. 2016, 32, 669–679. [Google Scholar] [CrossRef]
Figure 1. Geographic distribution of the Seoul metropolitan subway network used in this study. Line and station metadata were integrated using official line-specific station information, system-wide urban railway network data, operator/line/station code information, and official station-coordinate data [28,29,30,31].
Figure 1. Geographic distribution of the Seoul metropolitan subway network used in this study. Line and station metadata were integrated using official line-specific station information, system-wide urban railway network data, operator/line/station code information, and official station-coordinate data [28,29,30,31].
Applsci 16 05333 g001
Figure 2. Overall architecture of Metro-GATF. The framework consists of input construction, spatial encoding, non-autoregressive temporal decoding, and sparse OD reconstruction. Dynamic sparse OD graphs, static railway adjacency, temporal encodings, and station-level node features are transformed into future node representations, which are then used by the magnitude and gate heads to generate future OD matrix predictions. Different colors indicate different functional modules, including input components, spatial encoding, feature fusion, temporal decoding, and prediction heads. Arrows represent the flow of information across the framework, while vertical separators indicate major processing stages.
Figure 2. Overall architecture of Metro-GATF. The framework consists of input construction, spatial encoding, non-autoregressive temporal decoding, and sparse OD reconstruction. Dynamic sparse OD graphs, static railway adjacency, temporal encodings, and station-level node features are transformed into future node representations, which are then used by the magnitude and gate heads to generate future OD matrix predictions. Different colors indicate different functional modules, including input components, spatial encoding, feature fusion, temporal decoding, and prediction heads. Arrows represent the flow of information across the framework, while vertical separators indicate major processing stages.
Applsci 16 05333 g002
Table 1. Summary of representative OD forecasting studies by output granularity, task setting, and model architecture.
Table 1. Summary of representative OD forecasting studies by output granularity, task setting, and model architecture.
ModelOutput GranularityTask SettingSpatial EncoderTemporal EncoderSpecial Structure
MRSTN [3]Full MatrixOnlineGCNLSTMMulti-resolution
HIAM [12]Full MatrixOnlineHeterogeneousAttentionIncomplete OD imputation
LSTM [2]Full MatrixOffline-LSTM-
ST-GCN [7]Station-levelOfflineGCNGCN-
GEML [8]Full MatrixOfflineGCNGCNGraph reformulation
GCN-RNN [9]Full MatrixOfflineGCNGRUProbabilistic output
MPGCN [10]Full MatrixOfflineGCN (static + dynamic)GCNDual-graph fusion
MVPF [11]Full MatrixOfflineMulti-view GCNGRUMulti-view evolution trend
ST-DAMHGN [13]Full MatrixOfflineHypergraphAttentionMulti-hypergraph + dynamic attn
DT-HGN [14]Full MatrixOfflineHeterogeneous Graph-Direct/transfer decomposition
Informer [15]General TSOffline-Transformer (ProbSparse)Long-seq efficiency
Autoformer [16]General TSOffline-Transformer + DecompAuto-correlation
ODFormer [17]Full MatrixOffline-Transformer (OD attn)OD attention
ODMixer [18]Full MatrixOffline-MLPAll-OD-pair fine-grained
ST-LSTM [19]OD PairOffline-LSTMMulti-source
data fusion
GCN-GRU [23]Full MatrixOfflineGCNGRULarge-scale
ASTMGCNet [24]Traffic flowOfflineGCNGRUMulti-scale + dual attention
CNN [22]OD pairOfflineCNNCNNLocal spatiotemporal pattern extraction
Metro-GATF (Ours)Full MatrixOfflineGATv2Non-AR TransformerOD factorization + sparsity gate
Comparison of representative OD forecasting studies by output granularity, task setting, spatial encoder, and model design. Online setting indicates that complete historical OD matrices are not available at prediction time.
Table 2. Comparison of minute-level OD matrix forecasting performance.
Table 2. Comparison of minute-level OD matrix forecasting performance.
ModelMAERMSEsMAPETraining Time
HA0.38541.24793.354.59 m
ARIMA0.68153.432139.64174 h
GCN-LSTM0.8200 ± 0.03260.8710 ± 0.037023.83 ± 0.253.06 ± 0.01 h
Autoformer0.5200 ± 0.02470.6770 ± 0.016731.48 ± 4.083.00 ± 0.02 h
ODFormer0.0017 ± 0.00030.0022 ± 0.0004163.10 ± 6.011.03 ± 0.24 h
MPGCN0.0172 ± 0.00180.1746 ± 0.01342.697 ± 1.78124.79 ± 5.73 h
Metro-GATF (Proposed)0.0169 ± 0.000050.1637 ± 0.00242.662 ± 0.0084.36 ± 0.34 h
The experimental setting used an input window of 60 min, a hop size of 10 min, a prediction horizon of 30 min, and a time resolution of 1 min. All results were computed on the independent test set from 25 May to 31 May 2022.
Table 3. Pair-level reference experiment of a representative OD pair (16,523).
Table 3. Pair-level reference experiment of a representative OD pair (16,523).
ModelMAERMSEsMAPE
ST-LSTM1.9523.00779.29
Metro-GATF (Proposed)2.2392.98658.13
The experimental setting used an input window of 60 min, a hop size of 10 min, a prediction horizon of 30 min, and a time resolution of 1 min. All results were computed on the independent test set from 25 May to 31 May 2022.
Table 4. Supplementary pair-level result on a low-demand OD pair (10,25).
Table 4. Supplementary pair-level result on a low-demand OD pair (10,25).
ModelMAERMSEsMAPE
ST-LSTM0.00870.0644120.5511
Metro-GATF (Proposed)0.00600.08081.138
The experimental setting used an input window of 60 min, a hop size of 10 min, a prediction horizon of 30 min, and a time resolution of 1 min. All results were computed on the independent test set from 25 May to 31 May 2022.
Table 5. Comparison of minute-level OD matrix forecasting performance.
Table 5. Comparison of minute-level OD matrix forecasting performance.
VariantMAERMSEsMAPE
S0: minimal backbone1.0075121.021849197.597704
S1: +OD factorization1.0805271.088442197.586878
S2: +dual short, long static branches1.0593641.067890197.582795
S3: +dynamic OD graph encoder0.5477590.649611195.990365
S4: +Transformer 0.0172290.1745992.696677
S5: +time encoding0.0172290.1745992.696677
S6: +weekday embedding0.0172290.1745992.696677
S7: +geographic coordinates0.0172290.1745992.696677
S8: +sparsity-aware gate (Metro-GATF)0.0169380.1637162.662081
The progressive build-up experiment followed the same forecasting settings. The gate threshold was set to τ = 0.9.
Table 6. Gate-threshold sensitivity analysis of Metro-GATF on the full OD network.
Table 6. Gate-threshold sensitivity analysis of Metro-GATF on the full OD network.
τ MAERMSEsMAPEPredicted Non-Zero Rate (%)
0.10.1836 ± 0.02450.4551 ± 0.032032.4330 ± 4.086317.062 ± 2.080
0.20.1098 ± 0.01600.3583 ± 0.027018.6101 ± 2.60419.875 ± 1.358
0.30.0746 ± 0.01080.3000 ± 0.022112.2228 ± 1.71066.434 ± 0.919
0.40.0534 ± 0.00720.2577 ± 0.01768.4979 ± 1.11404.333 ± 0.622
0.50.0393 ± 0.00470.2248 ± 0.01336.0994 ± 0.69382.895 ± 0.409
0.60.0295 ± 0.00280.1985 ± 0.00914.4957 ± 0.39381.845 ± 0.252
0.70.0228 ± 0.00140.1784 ± 0.00523.4493 ± 0.18741.057 ± 0.138
0.80.0187 ± 0.00050.1660 ± 0.00192.8623 ± 0.05550.487 ± 0.057
0.90.0170 ± 0.00010.1639 ± 0.00012.6665 ± 0.00400.128 ± 0.015
The sensitivity analysis was conducted on the validation set using the best S8 checkpoints from three random seeds. Only the inference threshold τ was varied, while all trained model parameters were fixed. The actual non-zero rate of the evaluated OD matrices was 1.348%.
Table 7. Comparison between non-autoregressive and autoregressive decoding protocols.
Table 7. Comparison between non-autoregressive and autoregressive decoding protocols.
ModelMAERMSEsMAPE
Metro-GATF (AR)0.01720 ± 0.000020.1735 ± 0.00052.6937 ± 0.0018
Metro-GATF (Proposed)0.0169 ± 0.000050.1637 ± 0.00242.662 ± 0.008
The autoregressive variant was evaluated by sequentially generating future steps while keeping all other Metro-GATF components and experimental settings unchanged. Values are reported as mean ± standard deviation over three random seeds on the independent test set.
Table 8. Computational efficiency and model complexity of neural full-matrix forecasting models.
Table 8. Computational efficiency and model complexity of neural full-matrix forecasting models.
ModelParamsGPU MemLatencyThroughput
GCN-LSTM46.0 K0.38 GB200.0 ± 2.8 ms/sample5.00 ± 0.07 sample/s
Autoformer57.8 M0.64 GB128.7 ± 2.7 ms/sample7.77 ± 0.16 sample/s
ODFormer263.8 M5.08 GB618.4 ± 4.9 ms/sample1.62 ± 0.01 sample/s
MPGCN6.6 K34.38 GB2789.9 ± 25.2 ms/sample0.36 ± 0.00 sample/s
Metro-GATF (Proposed)1.35 M10.72 GB316.4 ± 18.1 ms/sample3.17 ± 0.19 sample/s
Training time and runtime statistics were measured on the same workstation setting used in the experiments.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kim, S.H.; Jeong, H.J.; Shin, S.i.; Kwon, J.W. A Non-Autoregressive Spatiotemporal Framework for Offline Full-Matrix Origin–Destination Forecasting in Large-Scale Metro Networks. Appl. Sci. 2026, 16, 5333. https://doi.org/10.3390/app16115333

AMA Style

Kim SH, Jeong HJ, Shin Si, Kwon JW. A Non-Autoregressive Spatiotemporal Framework for Offline Full-Matrix Origin–Destination Forecasting in Large-Scale Metro Networks. Applied Sciences. 2026; 16(11):5333. https://doi.org/10.3390/app16115333

Chicago/Turabian Style

Kim, Seung Ha, Hoe Jun Jeong, Seong il Shin, and Jang Woo Kwon. 2026. "A Non-Autoregressive Spatiotemporal Framework for Offline Full-Matrix Origin–Destination Forecasting in Large-Scale Metro Networks" Applied Sciences 16, no. 11: 5333. https://doi.org/10.3390/app16115333

APA Style

Kim, S. H., Jeong, H. J., Shin, S. i., & Kwon, J. W. (2026). A Non-Autoregressive Spatiotemporal Framework for Offline Full-Matrix Origin–Destination Forecasting in Large-Scale Metro Networks. Applied Sciences, 16(11), 5333. https://doi.org/10.3390/app16115333

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop