1. Introduction
Short-term speed prediction, which involves forecasting vehicle or traffic speeds over brief future intervals, is a cornerstone technology for modern Intelligent Transportation Systems (ITS) and the advancement of autonomous vehicles [
1]. With the emergence of new formats such as front warehouses, community retail, local warehouses, and hourly delivery, the integration between logistics and the national economy has deepened, significantly increasing the demand of supply chains for real-time visibility and rapid responsiveness. Smart logistics, supported by information technology, control techniques, optimization methods, and artificial intelligence, aims to reduce costs and increase efficiency across the entire supply chain through order allocation, vehicle management, route planning, and signal optimization [
2]. To maintain stable operations and quickly recover from disruptions, road networks require predictability. Traffic forecasting [
3], a core capability, encompasses key quantities such as traffic states, road speeds, and travel times. Accurate prediction of states and speeds provides the foundation for platoon control, route guidance, and signal optimization. Travel time prediction serves as an early indicator for scheduling and coordination. In digital-twin-driven online simulations, forecasting further supports rolling evaluation and scenario selection. For safe and resilient operations, speed prediction also enables risk identification and early warning, allowing interventions in high-risk spatiotemporal segments to reduce accidents and delays, ultimately improving punctuality and network reliability.
However, short-term road speed forecasting at the minute level faces multiple challenges in real-world environments. The first challenge lies in the heterogeneity and noise at the data level. Vehicle operation data coexist with multi-source sensor data, where missing values, measurement errors, and irregular sampling are common. The spatial coverage of the sensing network is uneven—dense in central urban areas but sparse in suburban regions, resulting in coverage gaps and biased measurements. The second challenge is the complexity of spatiotemporal coupling. Traffic data simultaneously contain static structures and dynamic evolution, with prominent cross-scale dependencies and nonlinear interactions. Deep learning has advanced spatiotemporal prediction by learning expressive, data-driven representations. Recent graph and sequence models capture spatial diffusion and temporal dependencies, improving traffic flow and speed forecasting [
4]. The third challenge is nonstationarity and distribution shift. Conventional neural prediction models map vectors to vectors and typically require retraining or extensive fine-tuning when exogenous or boundary conditions change. Demand fluctuations, incidents, weather conditions, as well as modification to road networks and timetables occur frequently. These changes make models trained under previous conditions prone to mismatches in new scenarios, and the costs of maintenance and retraining remain high. Consequently, there is a need for modeling paradigms that can explicitly incorporate boundary changes at the input level while maintaining stable accuracy and reducing maintenance costs when scenarios change.
To address these challenges, this study proposes a short-term road speed forecasting framework that directly integrates logistics data with traffic prediction. To the best of our knowledge, no prior research has systematically mapped supply chain information—such as warehouse and customer locations or dynamic demand volumes—into traffic speed prediction while simultaneously applying Deep Operator Network (DeepONet) learning [
5] to achieve cross-scenario transferability. We conduct an initial exploration in this direction by projecting logistics demand and warehouse allocation onto the road network, creating learnable boundary conditions, and then applying an operator-learning approach to map historical sequences and contextual information to next-step speed predictions. This provides a novel perspective for bridging supply chain systems and traffic systems. At the data level, we develop a unified data and evaluation pipeline that performs alignment, validity checks, anomaly removal, and feature standardization. We then split the data into training, validation, and test sets according to different scenarios, enabling us to evaluate the models’ robustness under diverse boundary combinations. At the modeling level, we adopt a branch–trunk architecture. The branch network encodes historical speed sequences of each link to capture short-term dynamics. The trunk network encodes contemporaneous exogenous and boundary states such as inflow, outflow, density, occupancy, waiting time, and travel time that represent congestion intensity and downstream constraints. Multiplicative coupling of the two networks creates a mapping from functions to functions, allowing boundary changes to enter the inference process through input variation. This approach maintains accuracy while reducing the need for retraining when scenarios change.
We constructed six Simulation of Urban MObility (SUMO) scenarios, labeled S001–S006, based on a five-kilometer urban subnetwork, using a time step of 60 s to generate link-level data. These scenarios are driven by the Solomon dataset and differ in random seeds, total trip volumes, and order–warehouse allocation strategies. These variations produce distinct Origin–Destination (OD) combinations, which characterize the paired relationships between origins and destinations, their intensities, and their temporal distributions. These scenarios also include order quantities, vehicle counts or trip numbers, departure times, and service time windows for each OD pair. Different OD combinations determine the spatial and temporal distributions of inflows and outflows across the road network, which in turn shape congestion patterns and boundary conditions, leading to varying levels of prediction difficulty and transfer challenges. For all scenarios, we extract speed, inflow, outflow, density, occupancy, waiting time, and travel time. Inputs are constructed from twelve-step historical speeds together with six contemporaneous contextual features, while the next-step speed serves as the supervisory signal. After validity checks and anomaly filtering, approximately 1.19 million edge–time samples remain. To evaluate cross-scenario transfer, we use S001–S004 for training and validation and reserve S005–S006 as unseen test sets. Within the visible scenarios, we apply an 80/20 temporal split to ensure leakage-free evaluation that encompasses a variety of boundary conditions. To quantify the benefits of the proposed approach in modeling nonlinearities and history–context interactions, we systematically compare it with Ridge regression, Multilayer Perceptrons (MLP), Long Short-Term Memory (LSTM) networks, Temporal Convolutional Networks (TCNs), Transformers, and Graph Neural Networks (GNNs). We further conduct ablation studies to verify the necessity of trunk-side exogenous variables and perform counterfactual perturbations of these variables to illustrate the model’s sensitivity and robustness to congestion transitions. Results demonstrate that the proposed design mitigates feature bias caused by heterogeneous and noisy data, improves adaptability to distribution shifts, and enhances the representation of complex spatiotemporal interactions. Challenges such as missing-data handling, explicit spatial coupling, and uncertainty quantification are discussed in the limitations and reserved for future research.
The contributions of this paper are summarized as follows:
This study constructs a unified logistics–traffic dataset by integrating Solomon demand data with SUMO-generated link-level states, producing approximately 1.2 million edge–time samples across six distinct scenarios. This dataset provides a reproducible foundation for cross-scene forecasting research.
We propose a DeepONet-based framework that decouples historical speeds, processed through a branch network, from contemporaneous exogenous and boundary states, processed through a trunk network. This approach enables boundary changes to be incorporated as functional inputs, eliminating the need for frequent retraining.
This paper systematically compares the strengths and weaknesses of the proposed method against classic and state-of-the-art models across three distinct datasets. While the results indicate that DeepONet does not outperform every baseline in every aspect, the comprehensive evaluation demonstrates that it achieves the optimal overall performance, particularly in terms of generalization and robustness. These comparative insights provide a valuable reference for future research in selecting appropriate modeling paradigms for complex traffic scenarios.
The remainder of the paper is organized as follows:
Section 2 reviews related work in logistics forecasting, traffic prediction, and operator learning.
Section 3 introduces the background and formal problem statement.
Section 4 describes the data design, feature construction, and the DeepONet architecture along with training protocols.
Section 5 presents the experimental results, diagnostics, and ablation studies, followed by a discussion on deployment implications.
Section 6 concludes the paper and outlines limitations and future directions.
4. Methodology
As illustrated in
Figure 1, our methodology follows a three-stage pipeline: (i) data and scenario construction where Solomon demand instances are projected and simulated on a 5 km SUMO subnetwork to produce link-level edge states; (ii) feature engineering and dataset assembly that aligns, filters, and standardizes twelve-step speed histories together with contemporaneous exogenous and boundary covariates; and (iii) model learning and diagnostics using a branch–trunk Deep Operator Network that decouples short-term histories from contextual boundary inputs, followed by systematic cross-scene evaluation, ablations, and counterfactual perturbations.
4.1. Solomon Dataset as the Demand Prior
We ground the demand layer in the classical Solomon vehicle routing problem with time windows benchmarks [
33]. The suite contains 56 instances with 100 customers, organized into six classes—C1, C2, R1, R2, RC1, RC2—where C/R/RC denote clustered, random, and mixed spatial layouts, and the “1” vs. “2” suffix reflects tighter vs. looser time windows, often implying a shorter vs. longer planning horizon. Each instance places 100 customers on a
grid and follows a common schema: node index
i, coordinates
, demand
, ready time
, due date
, and service duration
; the depot is node 0. File headers specify the fleet-size limit
K and vehicle capacity
Q. These fields map directly to our SUMO pipeline: coordinates are projected to the network coordinate reference system and snapped to the nearest nodes and edges; depot identifiers anchor origins; time windows drive release and service scheduling to produce temporally consistent OD flows; and demands determine vehicle loading and trip counts. We use Solomon because its controlled spatial patterns and time-window tightness create diverse routing pressures and post-assignment congestion, which is essential for stress-testing forecasting models under heterogeneous boundary conditions.
4.2. Simulation Environment and Dataset Construction
We consider an urban subnetwork of approximately
imported into SUMO, and instantiate six scenarios S001–S006 that vary random seeds and trip loads to diversify demand [
10]. Beyond the static network, each scenario is parameterized by logistics demand and supply. Customer requests and depot locations shape OD patterns and temporal loading, which in turn drive the edge states observed during simulation. We ingest (i) customer planar coordinates
which are projected to the network coordinate reference system, (ii) demand quantity with units or weight, (iii) requested service time windows
, and (iv) depot or warehouse identifiers and coordinates. Orders are snapped to nearest edges and nodes and grouped into time buckets to form OD flows or discrete trips consistent with their time windows and depot assignments.
Given the OD specification, SUMO produces vehicle- and edge-level traces: (i) per-vehicle routes and traversed edge sequences, and, if needed, per-time step positions; (ii) per-interval edge aggregates, including speed, entered and left, density, occupancy, waitingtime, traveltime; and (iii) per-vehicle summaries. These outputs connect the logistics side, including who, when, from which depot to which customer, and with how much load to the traffic side, including which edges are used, with speeds and queues. This integration enables supervised learning on edge dynamics under realistic boundary conditions.
Table 1 summarizes the data sources and their roles in linking logistics demand with traffic states.
From each edge data, we extract per-edge, per-interval measurements including speed, entered, left, density, occupancy, waiting time, and travel time. To rigorously evaluate the contribution of spatial information, we construct two distinct feature sets:
Baseline Dataset without Spatial Features:This configuration focuses on temporal dynamics and local boundary conditions. The input vector concatenates 12 speed lags () and 6 contemporaneous covariates (density, occupancy, etc.) of the target edge itself, yielding an 18-dimensional input vector. This serves as the primary dataset for benchmarking temporal sequence models.
Spatial Dataset with Spatial Features: To capture network-level dependencies, we augment the baseline features with upstream and downstream context. For each target edge, we identify its immediate predecessor and successor links and append their mean speed and density to the input vector. This increases the input dimensionality to 23, allowing models to explicitly learn from spatial propagation effects.
We form supervised pairs
using these feature sets, with the scalar target being the next-step speed
. The combined dataset has
rows before filtering. To reduce artifacts, we retain rows satisfying validity checks for
, nonnegative counts and finite speeds [
10]. Specifically, the raw simulation output generated approximately 23.3 million edge-time samples. However, due to the sparse nature of traffic in the 5 km subnetwork, a significant portion (approx. 95%) of these samples represented zero-speed or empty-road conditions which provide limited supervisory signal for learning congestion dynamics. To focus the model on active traffic states, we filtered out these zero-value samples, resulting in a final high-quality dataset of approximately 1.19 million samples. This filtering process ensures that the model training is driven by meaningful traffic interactions rather than the dominant background of empty roads. We emphasize that this filtering was chosen to concentrate evaluation on informative congestion dynamics. We inspected marginal speed distributions before and after filtering and found that the qualitative ordering of model performance is unchanged; including the full raw set reduces sensitivity to congestion regimes but does not alter the main comparative conclusions reported here. We exclude the current speed at time
t from contemporaneous features to avoid leakage; only lagged speeds are used in inputs. Standardization is fit on training scenarios and applied to validation and test to prevent target or covariate leakage [
34]. We split by scenario:
S001–
S004 supply training and validation, an 80/20 temporal split within each seen scene, and
S005–
S006 form the test set. The resulting sizes are
,
,
.
4.3. Real-World Dataset
To validate the generalization capability of our framework beyond simulation, we utilize the METR-LA benchmark dataset [
35], a widely used reference in traffic forecasting. This dataset collects traffic speed readings from 207 loop detectors on the highways of Los Angeles County, spanning a period of 4 months from 1 March 2012 to 30 June 2012.
Unlike the link-level simulation data, METR-LA provides graph-structured data where sensors are nodes in a network. The adjacency matrix is pre-computed based on the driving distance between sensors, using a Gaussian kernel thresholded to retain only strong connections. The data are aggregated to 5-min intervals, matching the typical control horizon of ITS applications. We use the standard chronological split of 70% training, 10% validation, and 20% testing. This dataset introduces real-world complexities such as sensor noise, missing values, and non-recurrent congestion events, providing a rigorous testbed for evaluating model robustness in complex, nonlinear topologies.
4.4. Baseline and Comparative Models
We compare (i) naïve persistence, defined as
[
36]; (ii) Ridge regression, an L2-regularized linear model applied to the 18-dimensional input [
37]; (iii) MLP operating on the same 18-dimensional input, a choice supported by modern universal-approximation results [
38]; (iv) LSTM, which utilizes the same 12-step window [
39]; (v) TCN, employing dilated causal convolutions on the same 12-step window [
40]; (vi) Transformer, which incorporates self-attention mechanisms for time-series forecasting [
41]; (vii) GNN, or Graph Neural Network, which explicitly models spatial dependencies via graph convolutions [
42]. Unless noted, all models use identical splits and early stopping on validation
[
43]. All baselines consume the same feature set defined above to ensure parity.
Ridge: We fit a linear model on
:
with features standardized using training statistics and intercept
. The regularization
is selected on a log-grid
. Ridge offers a strong linear baseline with high inference throughput.
MLP: Two hidden layers of width 256 with Rectified Linear Unit, or ReLU, activation, dropout , Adaptive Moment Estimation, known as Adam, optimizer with a learning rate of , batch size 8192, up to 30 epochs; early stopping on validation.
LSTM: We form a sequence
where each step uses the
k-th speed lag and the same exogenous context:
yielding an input tensor
. A single-layer LSTM (hidden size 128, dropout 0.1) processes the sequence; the last hidden state feeds a linear head to predict
. Optimizer: Adam with a learning rate of
, batch size 8192, 30 epochs, and early stopping.
TCN: We use a causal Temporal Convolutional Network on the same sequence: four residual blocks with dilations , kernel size 3, 64 channels, dropout ; causal padding prevents leakage. The receptive field, which is greater than 12, covers the window. The block output is global-pooled and passed to a linear head. Optimizer and early stopping are applied as described above.
Transformer: We employ a standard Transformer encoder architecture adapted for time-series forecasting. The model consists of 2 encoder layers with 4 attention heads, a model dimension of 64, and a feed-forward dimension of 256. Positional encodings are added to the input sequence to retain temporal order information.
GNN: We utilize a GNN to capture spatial dependencies. For the simulation dataset, the graph is constructed based on physical connectivity, specifically upstream and downstream links. For the METR-LA dataset, we use the predefined sensor adjacency matrix. The model consists of two Graph Convolutional Network (GCN) layers with 64 hidden units followed by a fully connected output layer.
To ensure a fair comparison, we performed a grid search for the hyperparameters of each model using the validation set. The search space included learning rates in
, batch sizes in
, and dropout rates in
. The final hyperparameters selected for the reported experiments are summarized in
Table 2.
Table 3 summarizes the implementation-level architectural choices used for each model. In our implementation DeepONet uses branch and trunk MLPs formed by two 256-unit hidden layers that project to a latent embedding of dimension
p (default
). When spatial features are included the trunk input expands from 6 to 10 (adding upstream/downstream speed and density). Learning rate, batch size and dropout were tuned via the validation grid search described above; the DeepONet latent dimension
p was kept at the default value for the reported experiments. The configurations listed here correspond to the concrete implementations used in the ablation and comparative evaluations reported below.
4.5. Operator-Learning Model
We model the one-step map from an edge’s recent speed history and its contemporaneous context to the next-step speed as a neural operator acting on two inputs: the 12-step lag vector
and the 6-d context
. Let
and
be branch and trunk embeddings. The prediction is their inner product in a
p-dimensional latent space:
which realizes a low-rank factorization of the operator from
to
y [
22]. The overall architecture is illustrated in
Figure 2. Architecturally, both branch and trunk are MLPs with hidden width 256, dropout
, and linear
p-dimensional projections; we set
. Optimization uses Adam with learning rate
, batch size 1024, up to 50 epochs with early stopping on validation
. All features are standardized using training statistics, and train/validation/test splits, random seeds, and library versions are fixed for reproducibility.
The factorized form (
9) decouples temporal history from exogenous conditions and enables counterfactual analyses without retraining: varying
, for instance by perturbing entered or density, changes
while keeping
fixed, thus isolating the effect of boundary and context signals on
. This branch–trunk inner-product realization exactly matches the DeepONet formulation for operator learning [
21], so we henceforth refer to our model as DeepONet.
The theoretical advantage of this operator learning formulation lies in its alignment with the physical nature of traffic flow. Traffic dynamics are fundamentally governed by partial differential equations, where the system state evolves as a function of time and space subject to boundary conditions. Standard deep learning models approximate a finite-dimensional mapping , effectively memorizing point-to-point correlations. In contrast, DeepONet approximates the continuous solution operator that maps the space of input functions and parameter functions to the solution space. By explicitly separating the encoding of history and context, the model learns a basis expansion of the solution operator, where the Trunk network identifies the basis functions of the traffic regimes and the Branch network computes the coefficients based on the input state. This mechanism enables robust generalization to unseen scenarios, as the model learns the underlying physical laws governing the transition between states rather than just the statistical distribution of the training data.
To further clarify the training and inference process, Algorithm 1 details the DeepONet procedure for traffic speed forecasting.
| Algorithm 1 DeepONet Training and Inference for Traffic Speed Forecasting |
- Require:
Historical speed sequence , Context vector , Target speed - Ensure:
Trained Branch network , Trunk network - 1:
Initialize: Parameters for Branch and Trunk networks - 2:
Hyperparameters: Learning rate , Batch size B, Latent dim p - 3:
while not converged do - 4:
Sample batch of B pairs from training set - 5:
for to B do - 6:
Compute Branch embedding: - 7:
Compute Trunk embedding: - 8:
Predict speed: - 9:
Compute Loss: - 10:
end for - 11:
Update via Adam optimizer to minimize - 12:
end while - 13:
Inference: Given new history and context , predict
|
4.6. Evaluation
We report Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and
:
In brief, MAE reports the average absolute deviation in
and is relatively robust to outliers; RMSE reports the quadratic mean error and emphasizes large deviations, which is desirable when significant mistakes are particularly costly; and
reports the proportion of variance explained relative to a mean-only baseline and can be negative if the model underperforms that baseline. Reporting de-standardized MAE and RMSE in
enables operational interpretation, while
facilitates scale-free comparison across scenes. All metrics are computed on de-standardized speeds (km/h) [
34].
5. Experimental Results and Discussion
5.1. Overall Performance Comparison
Table 4 presents a comprehensive comparison of model performance across three experimental modules: (1) The SUMO Baseline, which uses temporal features only; (2) The SUMO Spatial module, which is enhanced with upstream and downstream features; and (3) The Real-World Validation using the METR-LA dataset.
In general, we observe distinct performance patterns across the three scenarios. In the linear simulation (Modules 1 and 2), temporal sequence models like LSTM and DeepONet dominate, as the system dynamics are primarily driven by local history and boundary conditions. Conversely, in the complex METR-LA network (Module 3), the advantage shifts towards architectures capable of modeling high-dimensional spatial interactions, where DeepONet and Transformer perform competitively with recent state-of-the-art models. Notably, GNNs show a significant performance jump from simulation to real-world, validating their dependency on rich graph structures. While DeepONet is competitive in the simpler simulation tasks, its true strength lies in its robustness and scalability to complex, real-world topologies, where it outperforms traditional baselines by a wide margin.
5.2. Baseline Simulation Experiments
Implementation Details
To establish a performance benchmark, we first evaluated the models on the standardized SUMO dataset without explicit spatial topology features. The input vector consisted of 19 dimensions, capturing the local temporal history from Lags 1 to 12 and instantaneous traffic variables such as density and occupancy of the target edge.
As shown in Module 1 of
Table 4, the LSTM model achieved the highest accuracy with an
score of 0.8188, slightly outperforming the Transformer and DeepONet, which achieved
scores of 0.8152 and 0.8122, respectively. The MLP and TCN models followed with
scores of 0.7975 and 0.7905. The linear Ridge baseline lagged significantly behind with an
of 0.4631, confirming the nonlinear nature of the traffic dynamics. These results indicate that for a single road segment in a controlled simulation environment, the temporal autocorrelation is the dominant predictive factor. The strong performance of LSTM, Transformer and DeepONet suggests that capturing sequence dependencies and operator-level mappings provides an advantage even in this baseline setting. Furthermore, DeepONet’s performance is comparable to the specialized LSTM, demonstrating that the branch–trunk architecture effectively encodes the temporal inertia through the branch network without requiring recurrent computation.
5.3. Spatial Feature Analysis
Addressing the concern regarding the omission of spatial correlations, we extended the feature space to include upstream and downstream dependencies. We constructed a “Spatial” dataset, referred to as Module 2, where the input dimension was increased to 23 by appending the mean speed and density of adjacent links, comprising .
Counter-intuitively, the inclusion of these local spatial features did not improve performance in the simulation environment; in fact, we observed a slight decrease in across all models, where DeepONet and MLP scored 0.7473 and 0.7031, respectively. We attribute this to two factors:
Topology Simplicity:The simulation utilizes a linear 5km corridor where upstream conditions are highly collinear with the local temporal history; for instance, provides similar information to .
Noise Introduction: In the microscopic simulation, short-term fluctuations in adjacent links (due to individual driver behavior) may introduce stochastic noise that outweighs their predictive signal for the aggregated 5 min interval.
This result supports a critical physical interpretation: in the DeepONet framework, the boundary conditions, such as flow entering and leaving, serve as the interface for wave propagation. In the one-dimensional Lighthill–Whitham–Richards (LWR) traffic flow model, congestion waves propagate through the boundaries. By learning the operator that maps these boundary functions to the internal state, DeepONet implicitly learns the wave propagation physics. The fact that explicit spatial features did not improve performance suggests that for this linear topology, the temporal dynamics and boundary conditions were indeed sufficient to capture these effects.
However, this negative result is scientifically valuable: it demonstrates that DeepONet’s operator learning capability is robust enough to extract maximum information from temporal dynamics alone, without relying on explicit spatial feature engineering in simple topologies. In this module, LSTM achieved the highest performance with an of 0.7483, closely followed by DeepONet with 0.7473, both outperforming the Transformer ( of 0.7310) and GNN ( of 0.7166). This reinforces the finding that sequence modeling and operator mapping are more effective than graph-based methods for this specific linear topology. The lower performance of GNN here, with an of approximately 0.72, highlights a limitation of graph convolutions in sparse, linear structures where message passing offers little advantage over direct temporal modeling.
Figure 3 provides a deeper robustness analysis, showing that while GNN performance degrades significantly in unseen scenarios, as seen in
Figure 3a, and high-density regimes shown in
Figure 3b, DeepONet maintains stable low error rates, confirming its superior generalization capabilities.
5.4. Real-World Validation
To validate the proposed approach on a complex, nonlinear network, we applied the models to the METR-LA benchmark dataset. Unlike the simulation, this dataset involves a graph of 207 sensors with complex spatial dependencies.
Here, the advantages of advanced architectures became evident. DeepONet achieved top-tier performance with an of 0.9172, significantly outperforming the MLP baseline with an of 0.8791 and surpassing the standard GNN baseline which reached 0.8952. The Transformer also performed exceptionally well at 0.9137. DeepONet’s superior performance suggests it can capture propagation effects effectively even without explicit graph convolution layers, likely by learning the high-dimensional mapping of the system’s state.
Figure 4 visualizes this performance gap through parity plots, where DeepONet shows significantly tighter clustering around the diagonal compared to MLP and GNN, particularly in the high-speed free-flow regime.
This result confirms that while simple temporal models suffice for linear simulations, DeepONet and Transformer architectures are essential for capturing the complex, high-dimensional spatiotemporal dynamics of real-world traffic networks. The significant performance gap between DeepONet/Transformer and MLP on real data of approximately 4% in strongly supports the adoption of operator learning frameworks for practical ITS applications.
It is worth noting that the training times for MLP and LSTM in this module of approximately 4 to 6 s are significantly shorter than in the simulation experiments. This is attributed to the smaller dataset size, 34 k samples compared to 1.2 million, and the rapid convergence of these baselines, which triggered early stopping around epoch 15. Additionally, the LSTM implementation utilized a vectorized input structure to maximize GPU parallelism, avoiding the high computational cost of sequential unrolling.
The contrast in GNN performance between the simulation in Module 2 and real-world in Module 3 experiments is particularly illuminating. In the sparse, linear simulation topology, GNNs struggled with an
of approximately 0.72 as the graph structure provided limited connectivity for effective message passing. However, in the dense, interconnected METR-LA graph, GNNs thrived achieving an
of 0.8952, validating their design for graph-structured data. Crucially, DeepONet performed consistently well across both regimes, demonstrating a versatility that neither pure temporal models such as LSTM nor pure spatial models such as GNN could match individually.
Figure 5 further illustrates this by comparing the time-series forecasts, where DeepONet and Transformer accurately track abrupt speed drops during rush hours, unlike the lagging baselines.
5.5. Ablation Study
To verify the contribution of each component in the DeepONet architecture, we conducted an ablation study by varying the network structure and latent dimension
p.
Table 5 summarizes the ablation results tested on the unfiltered simulation data, which contains a significant number of zero values compared to the filtered dataset used in the main experiments. As shown in
Figure 6, removing the Branch network, which relies solely on the Trunk network for exogenous features, leads to a significant performance drop of approximately 15%, confirming that the historical state trajectory encoded by the Branch network is critical for accurate forecasting. Furthermore, we analyzed the sensitivity to the latent dimension
p. Performance degrades noticeably when
, indicating underfitting, while increasing
p beyond 128 yields diminishing returns, justifying our choice of
as an optimal balance between accuracy and computational efficiency.
In addition to architectural components, we evaluated the impact of specific trunk features. Our analysis identified density and travel time as the most critical exogenous variables.
5.6. Discussion
The experimental results highlight several key characteristics of the DeepONet framework for traffic forecasting. First, the model demonstrates remarkable robustness across varying topological complexities. In the linear SUMO simulation, it performs on par with specialized sequence models like LSTM, while in the complex METR-LA network, it achieves state-of-the-art performance comparable to Transformers and superior to standard GNNs. This suggests that the operator learning paradigm, which maps functional spaces rather than discrete points, effectively captures the underlying physical dynamics of traffic flow regardless of the specific network structure.
Second, the “Digital Twin” capability, evidenced by the recovery of the fundamental diagram in
Figure 7, distinguishes DeepONet from purely statistical baselines. By learning the operator
, the model does not merely memorize historical patterns but internalizes the causal relationship between density and speed. This allows for reliable counterfactual reasoning, a critical feature for logistics planning where operators must evaluate hypothetical scenarios that may differ from historical averages. For example, operator-based one-step forecasts can be incorporated into rolling-horizon vehicle routing: by providing fast, link-level speed predictions under alternative boundary conditions, a routing engine can re-evaluate route costs in near real time and trigger dynamic rerouting or vehicle reassignment when predicted travel times exceed operational thresholds. Similarly, in depot scheduling and last-mile dispatch, these forecasts can feed ETA-aware sequencing and feasibility checks so that pickup/drop-off orders are proactively rescheduled to reduce delay propagation and improve on-time delivery rates.
To further investigate the model’s sensitivity to specific boundary conditions, we performed a systematic perturbation analysis.
Figure 8 shows the mean predicted speed response to multiplicative scaling of each trunk feature. DeepONet exhibits physically consistent sensitivity, particularly to density and occupancy, whereas the MLP baseline often shows negligible or erratic responses, confirming the operator model’s superior ability to disentangle causal factors [
5]. Regarding the sensitivity analysis in
Figure 8, the nearly flat response to waiting time warrants closer interpretation. We attribute this to feature redundancy, as density and occupancy already effectively capture the congestion state in this predominantly free-flow scenario, meaning the marginal information provided by waiting time is minimal. Furthermore, while the model demonstrates robust behavior under moderate perturbations, we observed that extreme counterfactual scenarios where zero density is enforced while maintaining low speeds can yield physically inconsistent predictions. This behavior in unseen regimes highlights a limitation of pure data-driven operator learning and underscores the need for incorporating explicit physics-informed constraints in future iterations to ensure validity across the entire state space.
Third, our analysis sheds light on the nature of the traffic modeling challenge. Given that traffic variables such as density, speed, and travel time are highly correlated, we investigated potential multicollinearity issues by comparing DeepONet with Ridge regression, which is robust to multicollinearity via L2 regularization. Ridge regression performed poorly on the simulation dataset, yielding an of approximately 0.46, but achieved high accuracy on the METR-LA dataset with an of around 0.90. This stark contrast indicates that the primary challenge in the simulation environment is nonlinearity, specifically the regime shifts between free-flow and congestion, rather than multicollinearity. The superior performance of DeepONet stems from its ability to model these nonlinear operator mappings, which linear models like Ridge cannot capture effectively, regardless of their robustness to collinearity.
However, certain limitations warrant discussion. While DeepONet outperforms MLP and Ridge regression, its training time is higher, though still competitive with LSTM. Conversely, in terms of inference efficiency, DeepONet demonstrates a clear advantage. As shown in
Table 4, its inference time of 0.07 s is significantly lower than that of the Transformer at 0.33 s and the GNN at 0.18 s, making it highly suitable for real-time applications where low latency is critical. Additionally, unlike GNNs which explicitly encode the adjacency matrix, DeepONet learns spatial dependencies implicitly through the Trunk network’s conditioning. While this proved effective in our experiments, it may face scalability challenges in extremely large networks where the explicit sparsity of graph convolutions offers a computational advantage. Nevertheless, the results confirm that for typical urban traffic networks, DeepONet provides a versatile and powerful alternative to existing spatiotemporal architectures.