A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction

Lv, Haotao; Lou, Xiwen; Mou, Jingu; Papageorgiou, Markos; Huang, Zhengfeng; Zheng, Pengjun

doi:10.3390/su18073228

Open AccessArticle

A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction

by

Haotao Lv

¹,

Xiwen Lou

¹,

Jingu Mou

¹,

Markos Papageorgiou

^1,2,

Zhengfeng Huang

^1,3,4,* and

Pengjun Zheng

^1,3,4,*

¹

Faculty of Maritime and Transportation, Ningbo University, Ningbo 315832, China

²

Dynamic Systems and Simulation Laboratory, Technical University of Crete, 73100 Chania, Greece

³

Collaborative Innovation Center of Modern Urban Traffic Technologies, Southeast University, Nanjing 211189, China

⁴

National Traffic Management Engineering & Technology Research Centre, Ningbo University Sub-Centre, Ningbo 315832, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(7), 3228; https://doi.org/10.3390/su18073228

Submission received: 9 February 2026 / Revised: 11 March 2026 / Accepted: 19 March 2026 / Published: 25 March 2026

(This article belongs to the Section Sustainable Transportation)

Download

Browse Figures

Versions Notes

Abstract

Accurate freeway Improvements in traffic state prediction accuracy and enhanced stability enable more proactive traffic control and demand management strategies, thereby reducing congestion spillover effects, unnecessary acceleration–deceleration cycles, and the resulting fuel consumption and emissions. Yet, this remains challenging due to the interplay between deterministic traffic flow mechanisms and stochastic disturbances. Purely data-driven models suffer from error accumulation under out-of-distribution conditions, while physics-based models lack flexibility in capturing nonlinear deviations. This paper proposes MDURP, a physics-constrained residual learning framework that reformulates prediction as a residual-space learning problem. A calibrated Cell Transmission Model generates a physically admissible baseline; deep learning models are then restricted to learning the residuals. Wavelet decomposition and GARCH volatility modeling address the multi-scale and heteroskedastic characteristics of these residuals. Experimental results demonstrate that MDURP consistently outperforms baseline models, reducing MAE by an average of 6.8%, RMSE by an average of 4%. The framework also suppresses long-term error accumulation, with MAPE escalation slowing from 0.79% to 0.58% per step. These gains confirm that anchoring deep learning within a physics-defined residual space enhances both accuracy and stability.

Keywords:

freeway traffic state prediction; hybrid framework; physical models; deep learning

1. Introduction

Accurate freeway traffic state prediction is essential for intelligent transportation systems, supporting real-time traffic management and long-term infrastructure planning. Reliable forecasts enable proactive control strategies that mitigate congestion, reduce fuel consumption, and lower emissions—key objectives for sustainable transportation [1]. However, achieving robust prediction remains challenging due to the coexistence of deterministic traffic flow mechanisms and stochastic disturbances arising from demand variability, driver behavior, incidents, and sensing noise.

Traditional methods rely on physics-based models such as the Cell Transmission Model (CTM) or statistical filters like Kalman filtering [2]. While physically interpretable, these approaches struggle to capture complex nonlinear phenomena—abrupt demand surges, stop-and-go oscillations—that characterize real-world freeway traffic. Conversely, deep learning models excel at extracting nonlinear spatiotemporal patterns from large datasets [3] but suffer from two fundamental limitations: their performance degrades when traffic conditions deviate from historical patterns, and they lack embedded physical constraints, allowing predictions to drift into physically implausible regimes over longer horizons [4].

Recent hybrid frameworks attempt to combine these approaches but typically treat physical models and deep learning as loosely coupled modules [5]. Physical mechanisms are not explicitly enforced during learning, and neural networks continue to approximate full traffic dynamics rather than focusing on what physics-based models cannot explain.

This study argues that the central challenge lies not in how to combine models, but in how to redefine what the data-driven model is allowed to learn. Traffic dynamics can be viewed as the superposition of a physically admissible baseline, governed by conservation laws, and a stochastic deviation component driven by unmodeled behaviors and uncertainties. From this perspective, requiring neural networks to learn the entire traffic evolution is both inefficient and unnecessary.

We formulate the following research hypotheses:

Hypothesis 1 (H1):

Confining deep learning to the residual space between observed traffic states and a physics-based baseline reduces long-horizon error accumulation compared to learning full traffic dynamics directly.

Hypothesis 2 (H2):

Explicitly modeling the multi-scale and heteroskedastic characteristics of residuals—via wavelet decomposition and GARCH—improves prediction accuracy beyond what either standalone deep learning or physical models can achieve.

Hypothesis 3 (H3):

The proposed physics-constrained residual learning framework enhances robustness under out-of-distribution traffic conditions by anchoring predictions to a physically admissible manifold.

The key contribution of this study lies in reformulating traffic state prediction as a residual-space learning problem under physical constraints, rather than proposing yet another hybrid architecture. By anchoring deep learning within a physics-defined solution space, MDURP improves prediction accuracy, suppresses long-horizon error accumulation, and enhances robustness under unseen traffic conditions. This paradigm provides a principled pathway for integrating physical knowledge and machine learning in traffic systems characterized by both structured dynamics and stochastic uncertainty.

The remainder of this paper is organized as follows. Section 2 reviews related work on deep learning-based traffic prediction, macroscopic traffic flow models, and hybrid approaches. Section 3 presents the MDURP framework and its methodological components. Section 4 describes the experimental setup and datasets. Section 5 reports and analyzes prediction results. Section 6 discusses system-level implications, limitations, and future research directions, followed by conclusions in Section 7.

2. Literature Review

2.1. Deep Learning-Based Traffic State Prediction

Deep learning methods have become the dominant paradigm for short-term traffic state prediction on freeways due to their strong capability in capturing nonlinear temporal dependencies and complex spatial correlations from large-scale traffic datasets. Early studies primarily focused on temporal modeling using recurrent neural networks such as LSTM [6,7,8,9] and GRU [10,11], demonstrating clear advantages over traditional statistical approaches in short-horizon freeway traffic forecasting. Subsequent research incorporated convolutional architectures [12] and graph neural networks [13,14,15,16,17,18] to explicitly model spatial interactions among roadway segments, leading to significant performance improvements in network-level freeway traffic prediction.

Recent GNN-based approaches achieve high accuracy in spatio-temporal learning, but are still predominantly data-driven and without an explicit physical anchor, making them particularly vulnerable to performance degradation when freeway traffic conditions—such as incident-induced congestion or weather-related disruptions—deviate from historical patterns [19]. Moreover, the absence of embedded physical constraints allows predictions to drift into physically implausible regimes, which is especially problematic for freeway applications where conservation laws (e.g., vehicle continuity) and fundamental diagram relationships must hold. Under long forecasting horizons or abnormal traffic conditions (e.g., sudden capacity drops or shockwave propagation), these models may produce predictions that violate basic traffic flow principles. These issues highlight that while deep learning excels at function approximation, it lacks inherent mechanisms to enforce the conservation laws and system-level traffic dynamics that govern freeway traffic behavior [20].

2.2. Cell Transmission Models

Macroscopic traffic flow models describe traffic dynamics using aggregated variables such as flow, density, and speed, providing a physically interpretable representation of traffic evolution. Among these models, the LWR [21,22,23] framework and its discrete realization, the Cell Transmission Model (CTM), remain widely used due to their conceptual simplicity, computational efficiency, and consistency with traffic flow conservation laws [24,25,26].

CTM effectively captures fundamental traffic phenomena such as shockwave propagation, congestion formation, and queue spillback on freeway segments [27]. As a result, it has been extensively applied in traffic simulation, state estimation, and control. However, as a first-order macroscopic model, CTM cannot inherently reproduce higher-order effects such as stop-and-go oscillations, stochastic capacity drops, or demand-induced volatility. These limitations lead to systematic prediction errors when CTM is applied directly to real-world traffic data [28].

Importantly, while CTM may be insufficient as a standalone predictor, it provides a physically admissible baseline trajectory that constrains traffic evolution within realistic bounds. This property makes CTM particularly suitable as a structural anchor for hybrid prediction frameworks [29].

2.3. Hybrid Traffic Prediction

To leverage the complementary strengths of physics-based and data-driven models, numerous hybrid traffic prediction approaches have been proposed. Existing hybrid methods can be broadly categorized into three groups. The first group switches between simulation models and data-driven predictors under specific events, such as incidents or adverse weather [30,31]. While effective in predefined scenarios, these approaches rely on accurate event detection and often lack real-time adaptability.

The second group uses traffic simulations to generate synthetic training data for deep learning models, enabling prediction under rare or abnormal conditions [32,33]. However, these models often underperform under normal traffic conditions and remain constrained by the fidelity of the simulation environment.

A third group loosely integrates physical models as auxiliary inputs or regularization terms within deep learning architectures [34,35]. Although these methods improve robustness to some extent, physical mechanisms are not explicitly enforced during learning. Consequently, neural networks still attempt to approximate full traffic dynamics rather than focusing on the deviations that physical models cannot capture.

In contrast to existing approaches, this study adopts a residual-space learning paradigm. Instead of combining predictions at the output level or switching models across scenarios, the proposed framework decomposes traffic dynamics into a physics-constrained baseline and a stochastic deviation component. Deep learning models are explicitly restricted to learning residuals relative to the physical baseline, fundamentally reshaping the learning task and reducing model variance.

2.4. Research Gap and Conceptual Framework

Building on the above insights, this study proposes MDURP, a physics-constrained residual learning framework for freeway traffic state prediction. By integrating CTM-based baseline prediction with structured residual modeling, MDURP embeds physical constraints directly into the learning process while retaining the flexibility of deep learning to capture complex stochastic behaviors. Unlike existing hybrid models, MDURP does not require neural networks to learn traffic flow evolution from scratch; instead, it confines learning to physically admissible deviations, enabling improved robustness and long-horizon stability.

3. Methodology

3.1. Problem Formulation

In this section, we introduce the problem of freeway traffic state prediction and describe the forecasting process within the proposed hybrid framework. Traffic state prediction is a typical time-series forecasting task, which aims to predict traffic measurements (e.g., speed or flow) over the next

T

time steps given historical observations from the previous

T'

time steps. Let the traffic states observed across road segments be represented as feature

X \in R^{N \times P}

, where

N

denotes the number of spatial nodes and

P

is the number of features. Denoting the features at time t as

X^{(t)}

, the traffic state prediction problem aims to learn a mapping function

f (\cdot)

that transforms

T'

historical features into

T

future ones, as follows:

{[X}^{(t - T^{'} + 1)}, \dots, X^{(t)}] \overset{f (\cdot)}{\to} {[X}^{(t + 1)}, \dots, X^{(t + T)}]

The proposed hybrid prediction framework, as illustrated in Figure 1, integrates physics-based modeling with deep learning to enhance traffic state forecasting.

We utilize the PEMSD7 dataset for analysis. PEMSD7 is a public highway dataset which is collected from Performance Measurement System of California (PeMS). This traffic dataset is aggregated into 5 min intervals from 30 s data samples. We select a specific road segment in this study, and the corresponding local road network is constructed by explicitly defining the road segments between the detected nodes. Historical traffic data is then used to calibrate a fundamental traffic flow model, which subsequently provides initial predictions (denoted as y). The residuals (The difference between these preliminary predictions and the actual observations (Y) yields the prediction residuals (Y − y)), computed from these initial predictions, are subjected to further analysis using deep learning methods to capture complex spatio-temporal dependencies that the physics-based model cannot fully represent. As detailed in Figure 2, the deep learning component is specifically designed to identify and compensate for nonlinear spatiotemporal interactions, through hierarchical representation learning. By decomposing system dynamics into physics-constrained trends and data-driven residuals, the framework ensures that neural networks focus on learning intricate deviations (e.g., stochastic driver behavior, sensor biases, or emergent congestion patterns) while preserving interpretable baseline predictions grounded in transport theory.

The final predictions are derived by combining the physics-based outputs (y) with the deep learning residual corrections (y′), significantly improving accuracy. The traffic states for the next 1, 3, 6, and 9 steps (corresponding to 5, 15, 30, and 45 min respectively) were predicted. Meanwhile, the performance of the model was rigorously evaluated through a series of indicators, including root mean square error (RMSE) and mean absolute error (MAE), which collectively demonstrate the hybrid framework’s superior predictive capability compared to standalone physical or data-driven models.

3.2. Model Framework

Figure 2 illustrates the framework of the MDURP model. Initially, predictions are generated using the calibrated traffic flow model, producing preliminary results denoted as y. The residuals (Y − y) represent the discrepancies between the true values and the initial predictions. These residuals, recorded at each time point, are subsequently used as training data for the deep learning model to uncover spatial and temporal dependencies within the residuals. Finally, the initial predictions (y) from the traffic flow model are combined with the outputs of the deep learning model (y′) to generate the final prediction results.

This hybrid framework synergistically integrates physics-guided modeling and deep learning for traffic flow prediction. The physics-based module generates baseline estimates using domain-specific conservation laws, while a neural network processes residual signals to capture unmodeled spatiotemporal patterns and system uncertainties. Final predictions are derived through dynamic fusion of these complementary components, preserving physical mechanisms while enhancing accuracy through data-driven refinement of residual errors.

To model the residuals from the CTM baseline, we combine three techniques, each selected based on the expected statistical behavior of traffic flow residuals.

Wavelet decomposition separates the residual signal into low-frequency components—capturing structural deviations such as demand surges or incident impacts—and high-frequency components that reflect stochastic fluctuations like driver heterogeneity or sensor noise. The rationale for this multi-scale analysis is straightforward: traffic dynamics evolve across different temporal resolutions, and wavelet transforms offer a natural way to disentangle them.

On the high-frequency components, we apply GARCH (Generalized Autoregressive Conditional Heteroskedasticity) modeling to account for volatility clustering—the tendency for large fluctuations to cluster over time. This choice follows from preliminary analysis (detailed in Section 3.4.1) showing that CTM residuals exhibit time-varying variance and heavy tails, both indicative of conditional heteroskedasticity.

Finally, temporal deep learning architectures—specifically LSTM and TCN—are used to model nonlinear dependencies within low-frequency components. This is because the powerful function approximation capabilities of deep learning make it highly suitable for modeling nonlinear dependencies. LSTM is well-suited for capturing long-range temporal dependencies via its gating mechanisms, while TCN offers advantages in parallelization and gradient stability.

3.3. Physics-Based Baseline Modelling

CTM is a classic macroscopic first-order traffic flow model based on the LWR (Lighthill–Whitham–Richards) model. It divides road segments in traffic flow into multiple small units and describes the evolution of traffic density and flow using discretized flow conservation equations [36]. The core equation of CTM is as follows:

n_{i} (t + 1) = n_{i} (t) + y_{i} (t) - y_{i + 1} (t)

(1)

In Equation (1),

n_{i} (t)

is the number of vehicles in cell i at time t;

y_{i} (t)

is the number of vehicles entering cell i from cell i − 1 from time t to time t + 1.

y_{i} (t) = m i n \{\frac{L_{c}}{l_{i - 1}} n_{i - 1} (t), Q_{i} (t), \frac{δ L_{c}}{l_{i}} [N_{i} (t) - n_{i} (t)]\}

(2)

In Equation (2),

L_{c}

is the length of the road that a standard vehicle can travel in one time step under smooth traffic conditions;

l_{i}

is the length of cell i (

l_{i} \geq L_{c}

),

Q_{i} (t)

is the maximum flow rate at time t;

N_{i} (t)

is the maximum number of vehicles that can be accommodated by the cell at time t;

δ = \frac{v}{ω}

, v is the free flow rate, and ω is the backward excitation speed, i.e., the congestion backward propagation speed.

In the form of roadway convergence, the number of vehicles driving into cell (i) from cell (i − 1, U) and cell (i − 1, D) in the t-th time step is given by Equations (3) and (4), respectively.

\begin{matrix} y_{i, U} (t) = \\ \{\begin{matrix} m i d \{S_{i - 1, U} (t), R_{i} (t) - S_{i - 1, D} (t), p_{i - 1, U} (t) R_{i} (t)}, if R_{i} (t) \leq S_{i - 1, U} (t) + S_{i - 1, D} (t) \\ S_{i - 1, U} (t), i f R_{i} (t) > S_{i - 1, U} (t) + S_{i - 1, D} (t) \end{matrix} \end{matrix}

(3)

\begin{matrix} y_{i, D} (t) = \\ \{\begin{matrix} m i d \{S_{i - 1, D} (t), R_{i} (t) - S_{i - 1, U} (t), p_{i - 1, D} (t) R_{i} (t)}, if R_{i} (t) \leq S_{i - 1, U} (t) + S_{i - 1, D} (t) \\ S_{i - 1, D} (t), i f R_{i} (t) > S_{i - 1, U} (t) + S_{i - 1, D} (t) \end{matrix} \end{matrix}

(4)

In Equations (3) and (4),

S_{i - 1, U} (t)

is the number of vehicle departures from cell (i − 1, U) in a unit step.

\begin{matrix} S_{i - 1, U} = \min \{\frac{L_{c}}{l_{i - 1, U}} n_{i - 1, U} (t), Q_{i - 1, U} (t)\} \end{matrix}

(5)

R_{i} (t)

is the number of vehicle entries of cell i in a unit step.

\begin{matrix} R_{i} (t) = \min \{\frac{δ L_{c}}{l_{i}} [N_{i} (t) - n_{i} (t)], Q_{i} (t)\} \end{matrix}

(6)

p_{i - 1, U}

is the ratio of the number of vehicles from cell (i − 1, U) to the number of vehicles from cell i;

p_{i - 1, D} (t)

is the ratio of the number of vehicles from cell (i − 1, D) to the number of vehicles from cell i;

p_{i - 1, U} + p_{i - 1, D} (t) = 1

; mid is the middle of the three comparative values.

The number of vehicles entering cell (i, U) and cell (i, D) from cell i − 1 in the t-th time step under the roadway diversion form is shown in Equations (8) and (9).

\begin{matrix} y_{i} (t) = m i n \{S_{i - 1} (t), \frac{R_{i, U} (t)}{β_{i, U}}, \frac{R_{i, D} (t)}{β_{i, D}}\} \end{matrix}

(7)

\begin{matrix} y_{i, U} (t) = β_{i, U} y_{i} (t) \end{matrix}

(8)

\begin{matrix} y_{i, D} (t) = β_{i, D} y_{i} (t) \end{matrix}

(9)

where

β_{i, U}

is the share of the number of vehicles from upstream cell into cell (i, U);

β_{i, D}

is the share of the number of vehicles from upstream cell into cell (i, D),

β_{i, U} + β_{i, D} = 1

.

CTM is proficient in handling the uniformity and transmission mechanisms of traffic flow on road segments, and is especially suitable for simulating and analyzing the microscopic dynamics of traffic flow. By combining CTM predictions with complex intersections, the model can better capture the mechanisms of congestion propagation and dissipation in traffic flow.

To ensure that the CTM baseline accurately reflects the traffic characteristics of the study corridor, a genetic algorithm (GA) was used to calibrate the model parameters. The specific parameters are shown in Table 1. The objective function minimizes the mean absolute error between the CTM-simulated speeds and the observed loop detector data over the training period. MAE was chosen over RMSE for this task, as it is less sensitive to outliers and yields a more robust calibration under heterogeneous traffic conditions.

The GA was configured with a population size of 100 and run for 50 generations. Tournament selection (tournament size 3) was used for selection, simulated binary crossover (SBX) with a probability of 0.9 for crossover, and polynomial mutation with a rate of 0.1 for mutation. Preliminary experiments indicated that further increasing the population size or the number of generations offered only marginal improvements—less than a 0.5% reduction in validation error—while adding considerably to the computational cost.

Parameter bounds were defined based on physical plausibility and domain knowledge: free-flow speed [50, 70] mph, jam density [600, 900] veh/5 min, and wave speed [12, 20] mph.

3.4. Multi-Scale Residual Modelling

The architecture for processing the residual data is shown in Figure 3. This architecture synergizes wavelet-based signal decomposition, volatility modeling, and deep learning for time series prediction. The raw signal is first decomposed into high-frequency (volatility-driven) and low-frequency (trend-driven) components through multi-scale wavelet analysis. The high-frequency components are processed via GARCH models to capture time-varying volatility and asymmetric responses to shocks, while deep learning extracts nonlinear spatiotemporal dependencies from low-frequency trends.

3.4.1. Analysis of the Characteristics of Residual Data

The residual data characteristics are summarized in Table 2. The residuals exhibit significant non-stationarity (short-term fluctuation = 72.2585) and heavy-tailed distribution (kurtosis = 3.7057), indicating clustered extreme variability (local range = 862.0). The low mutation density (0.0378) indicates intermittent transitions in traffic states, while the near-zero autocorrelation coefficient at lag 1 (ACF = 0.0229) reflects weak linear time dependence. Collectively, these characteristics reveal: (1) Traffic fluctuations exhibit a clustering feature; (2) the residual sequence contains multi-scale spatiotemporal structures; (3) there are nonlinear transient anomalies in the data, which need to be processed using specially designed deep learning structures.

Figure 4 presents a comparison of residual prediction between the LSTM-only model and the Wavelet-GARCH-LSTM model. The Wavelet-GARCH-LSTM model demonstrates a substantial improvement in predicting traffic flow residuals compared to the LSTM model. Although LSTM is proficient in capturing temporal dependencies, it encounters challenges in addressing non-stationarity, volatility clustering, and multi-scale features present in traffic residuals. In contrast, the Wavelet-GARCH-LSTM model employs wavelet decomposition to distinguish high-frequency volatility from low-frequency trends, utilizes GARCH to model conditional heteroskedasticity, and applies LSTM to capture residual nonlinear dependencies. This integrated approach effectively mitigates the limitations of the LSTM model, resulting in more accurate and robust predictions by comprehensively handling non-stationarity, volatility clustering, and multi-scale features in traffic flow residuals.

3.4.2. Wavelet Decomposition

Wavelet decomposition is employed as a core tool for signal preprocessing and feature separation, with the primary objective of decomposing the original residual signal into sub-components across different frequency bands. This facilitates targeted modeling of high-frequency fluctuations and low-frequency trends in subsequent analysis [37].

The Discrete Wavelet Transform (DWT) employs the concept of Multiresolution Analysis (MRA), recursively decomposing a signal using a set of orthogonal or biorthogonal wavelet basis functions. Each level of decomposition splits the signal into:

Approximation Coefficients (A): Representing the low-frequency components of the signal, reflecting its overall trend or smooth variations.

Detail Coefficients (D): Capturing the high-frequency components, highlighting rapid changes and abrupt fluctuations within the signal.

At each decomposition level, the signal s(t) is convolved with both a low-pass filter and a high-pass filter, followed by downsampling by a factor of two. This process yields the low-frequency component A and the high-frequency component D. The low-frequency component is then recursively decomposed, forming a hierarchical structure. Mathematically, the decomposition at the j-th level can be expressed as:

\begin{matrix} A_{j} (n) = \sum_{k} h (k - 2 n) A_{j - 1} (k) \end{matrix}

(10)

\begin{matrix} D_{j} (n) = \sum_{k} g (k - 2 n) A_{j - 1} (k) \end{matrix}

(11)

where

h (k)

and

g (k)

are the coefficients of the low-pass and high-pass filters, respectively.

The decomposed signal can be reconstructed by inverse transformation:

\begin{matrix} s (t) = \sum_{j = 1}^{J} \sum_{n} D_{j} (n) ψ_{j, n} (t) + A_{j} (n) ф_{j, n} (t) \end{matrix}

(12)

where

ψ

is the wavelet function, and

ф

is the scale function.

3.4.3. GARCH Volatility Modeling

In the MDURP model, the GARCH (Generalized Autoregressive Conditional Heteroskedasticity) model is specifically used to model the high-frequency fluctuation components in the residual signals, and the fluctuation rates of the high-frequency residual components (e.g., detail coefficients of wavelet decompositions

D_{j} (n)

) are dynamically modeled to capture the conditional heteroskedasticity of the residuals (i.e., fluctuation aggregation effects) [38].

GARCH quantifies this phenomenon through an autoregressive structure, as high-frequency residuals are often characterized by“volatility aggregation”(alternating periods of high volatility with periods of low volatility) in scenarios such as traffic flow and financial returns:

\begin{matrix} σ_{t}^{2} = ω + α r_{t - 1}^{2} + β σ_{t - 1}^{2} \end{matrix}

(13)

where

σ_{t}^{2}

is the current volatility,

r_{t - 1}^{2}

is the prior residual squared, and the

α

and

β

measure the contribution of short-term shocks and long-term volatility memory, respectively.

3.4.4. Deep Learning Feature Extraction

In time series modeling, the extraction of nonlinear features is one of the core tasks of deep learning. LSTM (Long Short-Term Memory Network) and TCN (Temporal Convolutional Network), as the two classical architectures, achieve the modeling of complex nonlinear relationships through gating mechanism and dilated causal convolution, respectively.

The LSTM implements long-range dependency learning through three gating units (oblivion gate, input gate, and output gate) and cell states, which are defined in the following formulation [39]:

\begin{matrix} f_{t} = σ (W_{f} [h_{t - 1}, x_{t}] + b_{f}) \end{matrix}

(14)

\begin{matrix} i_{t} = σ (W_{i} [h_{t - 1}, x_{t}] + b_{i}) \end{matrix}

(15)

\begin{matrix} \hat{C_{t}} = \tanh (W_{C} [h_{t - 1}, x_{t}] + b_{C}) \end{matrix}

(16)

\begin{matrix} C_{t} = f_{t} ⊙ C_{t - 1} + i_{t} ⊙ \hat{C_{t}} \end{matrix}

(17)

\begin{matrix} o_{t} = σ (W_{o} [h_{t - 1}, x_{t}] + b_{o}) \end{matrix}

(18)

\begin{matrix} h_{t} = o_{t} ⊙ \tanh (C_{t}) \end{matrix}

(19)

where

f_{t}

,

i_{t}

,

o_{t}

are the outputs of three different gates.

\hat{C_{t}}

is a candidate of the memory cell while

C_{t}

is the final memory cell at time t.

h_{t}

denotes the final output of the memory unit at time t.

W_{(\cdot)}

and

b_{(\cdot)}

are weights and biases for the respective gate neurons, and

σ

means the sigmoid function.

x_{t}

represents the input at time t and

h_{t - 1}

means the output at the previous time t − 1.

⊙

denotes the Hadamard product (element-wise multiplication of vectors).

Sigmoid functions compress linear combinations into a probability space, enabling feature selection (e.g., forgetting gates to suppress irrelevant historical information). The tanh in candidate states (

\hat{C_{t}}

) introduces a nonlinear transformation to enhance the model’s ability to represent complex patterns. Cell state

C_{t}

achieves information transfer across time steps by gating weighting to form memory pathways.

TCN stands for Temporal Convolutional Network. Conventional convolutional operations work within a localized receptive field, which makes it difficult for the original TCN to capture long-term dependencies on longer sequential data. However, by using techniques such as deep TCN and dilated convolutions, TCN can expand the receptive field and increase the distance over which information can be transmitted to capture and model long-term dependencies more efficiently. At the same time, TCN can extract features at different scales by stacking multiple convolutional layers. Each convolutional layer processes the input with a different sized convolutional kernel to capture different time scales of features in the sequential data. This multi-scale information extraction capability makes TCN more sensitive to local dependencies in sequence data [40].

TCN is used for Efficient Temporal Modeling via Dilated Causal Convolution and Residual Blocks.

Dilated Causal Convolution:

\begin{matrix} F (s) = \sum_{i = 0}^{k - 1} f (i) * x_{s - d * i} \end{matrix}

(20)

where

d

is the dilation factor, which controls the sensory field size (e.g.,

d = 2^{l}

to achieve exponential dilation).

k

is the convolution kernel size and s is the current time step. Causality constraint ensures

s - d * i \leq s

to avoid future information leakage.

Residual block structure (TCN core unit):

\begin{matrix} o u t p u t = a c t i v a t i o n (x + G (x)) \end{matrix}

(21)

where

G (x)

is the subnetwork containing the following operations:

dilated causal convolution → weight normalization → ReLU activation → Dropout

quadratic dilation causal convolution → weight normalization → Dropout

Alignment by 1 × 1 convolution when input and output dimensions are not matched.

Dilated Causal Convolution causes the top neuron to cover a wide region of the input sequence by exponentially expanding the factor d (e.g., d = 1, 2, 4 when the receptive field is 7). Residual blocks allow gradients to be passed directly back to the shallow layer, supporting the construction of ultra-deep TCNs.

4. Experiments

4.1. Dataset

The proposed prediction framework, MDURP, applied to various baseline prediction algorithms, is evaluated on real-world networks. The road structure of the target area is shown in Figure 5. The study area is the San Diego Freeway in California, USA, an east–west road in southwest Los Angeles. We opt to evaluate a 7.04 km section that includes 10 consecutive detector stations, 6 on-ramps and 5 off-ramps, from PM 41.97 to PM 37.27.

The PeMS dataset comprises traffic state data collected in real time by over 39,000 independent detectors distributed across the freeway systems in all major metropolitan areas of California. These data are continuously updated on the PeMS website. Spanning more than a decade, the dataset is widely used in traffic-related research due to its extensive coverage and public accessibility.

The traffic data in the PeMS dataset are derived from loop detectors, which upload collected data every 30 s. However, these raw data are not directly used; instead, they are transmitted to their corresponding monitoring stations. Each monitoring station aggregates data from all lane-specific detectors at its location to generate comprehensive traffic state data for the associated roadway segment. These aggregated data, representing the traffic state of each segment, are published on the PeMS website. The shortest available traffic sampling interval on the PeMS website is five minutes. The data we used covers a period of 44 working days from May to June in 2012.

Given the broad range of data types provided by the PeMS dataset, this study focuses only on the information relevant to the research. Table 3 illustrates the traffic state information collected by monitoring station 716674 during the time interval between 12:00 AM and 12:15 AM on 1 May 2012. The dataset is presented in English, and the original terms are retained here, accompanied by explanatory notes:

TimeStamp: The timestamp of the monitoring.

Station: The monitoring station ID.

Total Flow: The total number of vehicles passing through the station across all lanes during the 5 min interval, measured as vehicles per 5 min.

Avg Occupancy: The average lane occupancy at the monitoring station over the 5 min interval, expressed as a percentage.

Avg Speed: The average vehicle speed at the monitoring station across all lanes during the 5 min interval, measured in miles per hour.

4.2. Prediction Setting

4.2.1. Physics Baseline Configuration

In the modeling framework of MDURP, the Cell Transport Model (CTM) serves as a base predictor to provide the deep learning model with baseline predictions that conform to the traffic dynamics laws through a physical constraint mechanism. The CTM constructs a discretized cellular system based on the law of conservation of traffic flow, and its core equations can be expressed as follows:

\begin{matrix} y_{i} (t) = m i n \{\frac{L_{c}}{l_{i - 1}} n_{i - 1} (t), Q_{i} (t), \frac{δ L_{c}}{l_{i}} [N_{i} (t) - n_{i} (t)]\} \end{matrix}

(22)

\begin{matrix} δ = \frac{V_{f}}{w} \end{matrix}

(23)

The model effectively captures the bottleneck effect and traffic wave propagation law of the road network through the physical parameters such as free flow velocity

V_{f}

, congestion wave speed W and maximum capacity Q.

For the complex road network of main roads and ramps, firstly, non-uniform tuples are divided based on road geometric features (e.g., the length of main road tuples ranges from 710 to 1145 m), and merging/diverging rules are defined at entrance ramps and exit ramps.

The simulation prediction process of CTM takes 30 s as time step and realizes the dynamic evolution by recursively solving the conservation equations. Specifically, the flow transmission of the main road cell is constrained by a segmented linear function, i.e., the sending volume

S_{i - 1, U}

and the receiving volume

R_{i} (t)

jointly determine the actual transmission volume

y_{i} (t)

, where the free flow velocity v, the backward surge velocity w, and the maximum capacity Q, as adjustable parameters, need to be optimized by the genetic algorithm.

\begin{matrix} S_{i - 1, U} = \min \{\frac{L_{c}}{l_{i - 1, U}} n_{i - 1, U} (t), Q_{i - 1, U} (t)\} \end{matrix}

(24)

\begin{matrix} R_{i} (t) = \min \{\frac{δ L_{c}}{l_{i}} [N_{i} (t) - n_{i} (t)], Q_{i} (t)\} \end{matrix}

(25)

The flow rates of the entrance cells and ramps are determined in the simulation based on historical data to simulate the continuous input of external traffic, while the remaining cells update their states according to the conservation Equations (4) and (5). The flow rates of the entrance cells and ramps are determined in the simulation based on historical data to simulate the continuous input of external traffic. Each prediction is obtained through 10 iterations (5 min) to obtain the predicted value of the flow rate of the detector tuple, and the performance is evaluated in comparison with the measured data.

n_{i} (t + 1) = n_{i} (t) + y_{i} (t) - y_{i + 1} (t)

(26)

Figure 6 presents an illustrative comparison between the predictions generated by the CTM and the corresponding ground-truth observations over a five-minute forecasting horizon. The mean absolute percentage error (MAPE) of the CTM predictions is 15.57%, reflecting an average deviation of approximately 15% from the actual values. It is observed that the CTM reasonably reproduces the overarching trends of the traffic flow time series, despite its inherent limitations in capturing finer-grained fluctuations.

4.2.2. Residual Modelling Setup

Residual data is obtained by subtracting the CTM prediction from the true flow data. The data analysis of the residual data, as shown in Table 2, shows significant non-stationarity and nonlinear dynamics: the short-term volatility of 72.2585 and the local extreme deviation of 862 reveal the existence of violent oscillations in the data, with a coefficient of variation of 1.93 (assuming a mean of 37.4), indicating that the degree of dispersion at the local scale is far beyond the conventional threshold; the kurtosis coefficient of 3.7057 confirms that the distribution has a right-skewed thick-tailed characteristic, with the probability density of extreme values elevated by 18–22% compared with that of the normal distribution. Value probability density is elevated by 18–22% compared with normal distribution. The mutation density of 0.0378 shows the sparse distribution of structural breakpoints, suggesting the quasi-continuous nature of the data evolution process, while the ACF_Lag1 value of 0.0229 (p > 0.1) and the absence of Lag3/Lag5 (nan) reflect the weak short-term autocorrelation and may be accompanied by intermittent covariance structure mutation, which leads to the failure of the traditional linear autoregressive model. Taken together, the spiky thick-tailed distribution, multiscale fluctuations and weak autocorrelation characteristics of this dataset call for a prediction method that integrates GARCH fluctuation modeling with a hybrid wavelet–neural network architecture.

The hybrid wavelet decomposition–GARCH–deep learning residual prediction model achieves high-precision prediction of complex time series through the synergistic optimization of multi-scale signal decomposition and nonlinear feature extraction. The model first decouples the original signal into high-frequency and low-frequency components by using adaptive wavelet packet decomposition algorithm: the high-frequency component characterizes the sudden fluctuations and noise disturbances, and its empirical modal decomposition (EMD) intrinsic mode function (IMF) satisfies Equation (27):

\begin{matrix} x (t) = \sum_{k = 1}^{L} {I M F}_{k} (t) + R (t) \end{matrix}

(27)

To address the heteroskedasticity of the high-frequency component, a GARCH (1,1) model is constructed for volatility modeling, as shown in Equation (28):

\begin{matrix} σ_{t}^{2} = ω + α r_{t - 1}^{2} + β σ_{t - 1}^{2} \end{matrix}

(28)

where the parameters are optimized by maximum likelihood estimation.

4.2.3. Experimental Configuration

Table 4 shows the parameter settings for traffic prediction. The prediction horizon is set to 1, 3, 6, and 9 steps (corresponding to 5, 15, 30, and 45 min respectively). Here, we adopt the CTM as the traffic flow model and LSTM and TGCN as the prediction models for residual. The parameters for the CTM, LSTM, and TCN are listed, respectively.

To assess the stability of the calibrated CTM parameters, a temporal cross-validation experiment was conducted. The model was recalibrated on four distinct training periods, each covering three weeks of traffic data, and subsequently evaluated on the corresponding test periods. The calibrated parameters—free-flow speed, critical density, and jam density—varied within a narrow range across the four periods, with coefficients of variation below 5%. Correspondingly, the resulting prediction performance remained consistent, with MAE fluctuations remaining within 2% of the baseline value.

5. Experimental Results and Discussion

5.1. Evaluation Metrics and Benchmarking Criteria

The MDURP results in the flow values of each link. Prediction performance is evaluated with the mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean squared error (RMSE) of the speed in the target area as in (29)–(31).

\begin{matrix} M A E = \frac{1}{||T|| \cdot ||L||} \sum_{t ϵ T} \sum_{l ϵ L} |p_{o b s}^{l} (t) - p_{p r e d}^{l} (t)| \end{matrix}

(29)

\begin{matrix} M A P E = \frac{1}{||T|| \cdot ||L||} \sum_{t ϵ T} \sum_{l ϵ L} |\frac{p_{o b s}^{l} (t) - p_{p r e d}^{l} (t)}{p_{o b s}^{l} (t)}| \cdot 100 % \end{matrix}

(30)

\begin{matrix} R M S E = \sqrt{\frac{1}{||T|| \cdot ||L||} \sum_{t ϵ T} \sum_{l ϵ L} {(p_{o b s}^{l} (t) - p_{p r e d}^{l} (t))}^{2}} \end{matrix}

(31)

where

p_{o b s}^{l} (t)

and

p_{p r e d}^{l} (t)

are observed and predicted flow values of the link l at time t. T and L are set of times and links.

5.2. Performance Comparison and Analysis

To evaluate the effectiveness of the MDURP framework, we applied it to two benchmark models—LSTM and TCN—to perform traffic flow prediction over various forecasting horizons (with each step corresponding to 5 min). Table 5 and Figure 7 summarize the MAE, MAPE, and RMSE metrics obtained for both the baseline models and their MDURP-enhanced counterparts across different prediction horizons (1, 3, 6, and 9 steps) (corresponding to 5, 15, 30, and 45 min respectively). Consistent with the general principles of time series forecasting (i.e., information decay and error accumulation), all models exhibited a monotonic increase in MAE, MAPE, and RMSE as the forecasting horizon extended. Moreover, the relative performance ranking across horizons remained stable, with the MDURP-enhanced models outperforming the baseline models (LSTM/TCN), which in turn outperformed the CTM model. This outcome demonstrates the universal optimization capability of the MDURP framework.

The CTM model consistently underperformed at all horizons (MAE: 53.50→87.92; RMSE: 75.24→117.89), thereby highlighting the limitations of conventional linear models—such as Kalman filters—in capturing complex nonlinear temporal patterns. Notably, the MAPE for CTM reached 23.96% at the 9-step horizon, indicating its inability to capture dynamic trends and an error escalation that is superlinear as the forecasting horizon increases. Among the baseline models, TCN marginally outperformed LSTM in one-step predictions (RMSE: 45.40 vs. 45.42), likely due to the enhanced adaptability of its causal convolutional architecture to local temporal features. However, for longer-term forecasts (6+ steps), LSTM surpassed TCN, with RMSE values evolving from 59.25 to 67.71 for LSTM versus 63.59 to 70.78 for TCN, suggesting that the gated mechanisms in LSTM are more effective at modeling long-term dependencies.

Integration of the MDURP framework substantially improved the performance of the baseline models across multiple scales. For one-step predictions, the MDURP-TCN reduced MAE by 7.36% (from 33.58 to 31.11) and RMSE by 2.09% (from 45.40 to 44.45) compared to standalone TCN. At the 9-step horizon, the MDURP-LSTM achieved a 6.48% reduction in MAE (from 51.19 to 47.87) and a 2.42% reduction in RMSE (from 67.71 to 66.07) relative to the baseline LSTM, thereby confirming the framework’s efficacy in mitigating multi-scale errors. Although the improvement in RMSE is very small (2–5%), we think that the small improvements may have been due to the following several reasons: 1) the baselines are good enough and some large improvements are very difficult; 2) the quality of the traffic data is high, which is beneficial for training. However, we cannot say that our model is dispensable, because our model gives the best performance indeed. The proposed model is alleviating the black-box problem of deep learning models and the proposed model works. Therefore, the improvement is necessary.

Further analysis revealed distinct robustness patterns: LSTM-based models demonstrated superior stability in long-term forecasting (6–9 steps), benefiting from their hidden state memory mechanisms that preserve temporal context. In contrast, the fixed-length causal convolutions of the TCN imposed a receptive field constraint that accelerated performance degradation over extended horizons. Moreover, the MDURP-LSTM exhibited a slower MAPE escalation (from 8.67% to 13.29%, corresponding to a slope of 0.58% per step) compared to the baseline LSTM (from 9.89% to 16.94%, with a slope of 0.79% per step), thereby highlighting the effectiveness of MDURP in suppressing cumulative error amplification.

Collectively, these results demonstrate the significant advantages of MDURP-enhanced models in multi-step prediction tasks, particularly in curbing long-term error accumulation. The framework’s capability to improve both short-term accuracy and long-term stability provides a robust solution for addressing complex temporal forecasting challenges.

Figure 8 presents the average improvement rates of the MDURP-enhanced models relative to their respective baseline models in terms of predictive performance. Subfigures (a), (b), (c), and (d) illustrate the results for 1-step, 3-step, 6-step, and 9-step predictions, respectively. Both MDURP-LSTM and MDURP-TCN significantly outperform their baseline counterparts (LSTM and TCN) in the vast majority of cases, with all improvement rates being positive. The most notable enhancements are observed in the MAPE metric, where MDURP-LSTM achieves an average improvement of 16.14% and MDURP-TCN 15.84%, indicating MDURP’s superior efficacy in optimizing relative error. For the MAE metric, improvements are moderately lower (MDURP-LSTM: 8.37%, MDURP-TCN: 8.24%), while RMSE improvements are comparatively modest (MDURP-LSTM: 4.12%, MDURP-TCN: 5.34%), likely due to RMSE’s heightened sensitivity to large errors and MDURP’s relatively limited mitigation of such outliers.

Notably, MDURP-LSTM reaches peak improvement rates at the 3-step prediction horizon (MAE: 10.73%; MAPE: 17.30%; RMSE: 7.47%). Although MAE and RMSE improvements decline at the 9-step horizon (to 6.48% and 2.42%, respectively), MAPE sustains exceptional enhancement (21.55%). Similarly, MDURP-TCN exhibits high improvements at 3-step prediction (MAE: 9.31%; MAPE: 17.34%; RMSE: 6.82%), with peak RMSE improvement (7.45%) at 6-step prediction and robust MAPE retention (18.82%) at 9-step prediction. Collectively, these results demonstrate that MDURP achieves its most significant gains at the 3-step horizon, while maintaining substantial improvements—particularly in MAPE—at extended horizons (e.g., 9-step).

The MDURP framework integrates the physical mechanisms of the CTM model with wavelet decomposition-based frequency-domain processing, thereby substantially enhancing traffic flow prediction accuracy. Compared to baseline models (LSTM and TCN), MDURP delivers marked improvements across all three metrics (MAE, MAPE, RMSE), with MAPE showing the most pronounced gains (>15% average improvement). The improvement rates peak at the 3-step prediction horizon and moderate—yet remain significant—at longer horizons (e.g., 9-step), especially for MAPE. This underscores MDURP’s dominance in short-to-medium-term forecasting while retaining strong performance in long-term scenarios.

Figure 9 illustrates the temporal distribution of average prediction errors. The MDURP framework significantly enhances traffic flow prediction performance through a multi-resolution residual decomposition mechanism coupled with deep learning–GARCH collaborative modeling. Across all prediction horizons, the models enhanced by the MDURP framework demonstrated improved performance. Specifically, MDURP-LSTM and MDURP-TCN consistently outperformed their counterparts, LSTM and TCN, at most time points. Notably, in terms of extreme error values, the MDURP model significantly outperforms single deep learning models. The occurrence of such extremes in deep learning models may stem from encountering scenarios not present in the training dataset. This suggests that the integration of nonlinear features with physical mechanisms provides a significant advantage in handling large traffic flow fluctuations.

When the prediction horizon was set to one step, MDURP-TCN exhibited superior performance compared to MDURP-LSTM. Conversely, for prediction horizons of three, six, and nine steps, MDURP-LSTM showed better performance. This discrepancy may be attributed to the memory mechanism inherent in LSTM’s hidden states, which facilitates better long-term dependency capture.

6. Discussion

6.1. Interpreting MDURP as Residual-Space Learning Under Physical Constraints

The core contribution of the proposed MDURP framework lies not in the combination of multiple modeling techniques, but in how the prediction task itself is reformulated. Instead of requiring deep learning models to learn the full evolution of traffic states, MDURP decomposes traffic dynamics into two complementary components:

(1): a physics-constrained baseline trajectory generated by the Cell Transmission Model (CTM),
(2): a structured residual component capturing deviations from this trajectory.

From a system-modeling perspective, CTM defines a physically admissible solution manifold governed by conservation laws and fundamental diagram constraints. The deep learning model is then restricted to operate in the residual space orthogonal to this manifold. This design significantly reduces the hypothesis space explored by the neural network, thereby mitigating overfitting and improving generalization under unseen traffic conditions.

This residual-space learning paradigm distinguishes MDURP from existing hybrid approaches that merely fuse simulation outputs with data-driven predictions. In MDURP, the physical model does not serve as an auxiliary input or fallback option; rather, it actively constrains what the neural network is allowed to learn.

6.2. Statistical Structure of Residual Dynamics

Empirical analysis of CTM residuals reveals that they exhibit characteristics fundamentally different from raw traffic flow time series. Specifically, the residuals show pronounced non-stationarity, heavy-tailed distributions, and volatility clustering. These properties reflect the influence of unmodeled factors such as demand surges, driver heterogeneity, lane-changing disturbances, and sensor noise, which cannot be represented by first-order macroscopic traffic models.

Wavelet decomposition is employed to separate low-frequency structural deviations from high-frequency stochastic fluctuations, enabling targeted modeling of distinct temporal scales. The subsequent GARCH component explicitly captures volatility clustering in the high-frequency residuals, which is commonly observed during congestion onset and dissipation phases. Deep learning architectures (LSTM or TCN) are then applied to model nonlinear dependencies within each decomposed component.

Importantly, these components are not introduced as independent performance enhancements, but as structurally motivated mechanisms aligned with the statistical properties of the residual process. This design ensures that each modeling layer corresponds to a specific aspect of residual behavior, rather than arbitrarily increasing architectural complexity.

6.3. Performance Gains and Error Accumulation Suppression

Experimental results demonstrate that MDURP consistently improves prediction accuracy across short-, medium-, and long-term horizons. While standalone deep learning models already outperform CTM in short-term forecasting, their errors accumulate rapidly as the prediction horizon extends. By contrast, MDURP exhibits a noticeably slower error growth rate, particularly in terms of MAPE.

This behavior can be attributed to the anchoring effect of the physics-based baseline. Since CTM enforces physically plausible trends, long-term predictions remain bounded within realistic regimes, while deep learning focuses on correcting localized deviations. As a result, MDURP effectively suppresses error propagation that typically arises when neural networks extrapolate beyond their training distribution.

6.4. Robustness, Generalization, and Sustainability Implications

From a robustness standpoint, the MDURP framework demonstrates enhanced stability when encountering traffic conditions that differ from those observed during training. This property is particularly relevant for real-world traffic management systems, where unexpected demand fluctuations and disturbances are common.

In the context of sustainability, improved traffic state prediction accuracy and stability enable more proactive traffic control and demand management strategies, reducing congestion spillback, unnecessary acceleration–deceleration cycles, and associated fuel consumption and emissions. Although environmental impacts are not directly quantified in this study, the proposed framework contributes to sustainable transportation systems by improving the reliability of predictive inputs used in traffic optimization and control.

6.5. Practical Implications

Beyond its methodological contributions, the MDURP framework offers several practical advantages for real-world traffic management systems:

First, the framework’s enhanced long-horizon stability makes it particularly suitable for proactive control applications such as ramp metering, variable speed limits, and dynamic route guidance, where reliable predictions over 15–30 min are essential for effective strategy deployment. By suppressing error accumulation, MDURP reduces the risk of control actions being based on misleading forecasts.

Second, the modular architecture allows for incremental deployment in existing traffic management centers. Agencies can retain their calibrated CTM models as the physical backbone while gradually integrating the residual learning modules as data availability improves. This lowers the barrier to adoption compared to black-box deep learning alternatives.

6.6. Limitations and Future Research Directions

Despite its advantages, MDURP has several limitations. First, the framework relies on accurate calibration of the CTM baseline; poor calibration may degrade overall performance. Second, the current study focuses on a single freeway corridor, and further validation across heterogeneous network topologies is required. Third, while the architectural design is theoretically motivated, formal ablation studies isolating the contribution of each residual-modeling component remain an important direction for future research.

Future work will explore adaptive baseline models, simplified residual architectures, and tighter integration with real-time traffic control systems to further enhance scalability and practical applicability.

7. Conclusions

This paper proposes MDURP, a physics-constrained residual learning framework for freeway traffic state prediction. By decomposing traffic dynamics into a physically constrained baseline and a structured stochastic residual, the framework fundamentally reshapes the role of deep learning in traffic prediction. Instead of learning traffic flow evolution directly, neural networks are restricted to learning physically admissible deviations from macroscopic traffic flow theory.

The integration of CTM, wavelet decomposition, volatility modeling, and deep learning enables MDURP to capture multi-scale residual dynamics while preserving physical interpretability. Experimental results on real-world freeway data demonstrate that MDURP consistently outperforms standalone physical and data-driven models across multiple prediction horizons, with particularly strong performance in suppressing long-term error accumulation. Empirical analysis of CTM residuals reveals that they exhibit characteristics fundamentally different from raw traffic flow time series. Specifically, the residuals show pronounced non-stationarity, heavy-tailed distributions, and volatility clustering. These properties reflect the influence of unmodeled factors such as demand surges, driver heterogeneity, lane-changing disturbances, and sensor noise, which cannot be represented by first-order macroscopic traffic models.

These findings, however, also point to certain limitations. First, the current study validates MDURP on a single freeway corridor. While this allows for controlled experimentation, it raises questions about generalization to more heterogeneous network topologies—such as urban arterials or ring roads—where traffic dynamics and residual patterns may differ. Second, although the framework’s architectural choices are theoretically motivated, the absence of formal ablation studies means the individual contribution of each component (wavelet decomposition, GARCH modeling, and temporal deep networks) has not been quantitatively isolated. Such ablation experiments would not only clarify the necessity of each module but also guide future simplifications for computational efficiency.

From a reproducibility standpoint, the reliance on a single corridor also underscores the need for validation across multiple datasets and geographic contexts. Future work should therefore extend MDURP to network-level prediction tasks, explore adaptive baseline calibration methods, and systematically evaluate the framework’s transferability. More broadly, the concept of residual-space learning under physical constraints offers a principled pathway for integrating domain knowledge with data-driven methods—one that may prove valuable beyond traffic prediction, in other engineering domains where physical laws and stochastic fluctuations coexist.

Author Contributions

Conceptualization, H.L. and P.Z.; methodology, H.L., P.Z. and M.P.; software, H.L.; validation, H.L., X.L., J.M. and Z.H.; formal analysis, H.L.,X.L. and J.M.; investigation, H.L.; resources, P.Z.; data curation, H.L.; writing—original draft preparation, H.L.; writing—review and editing, P.Z., M.P. and Z.H.; visualization, H.L.; supervision, P.Z.; project administration, P.Z.; funding acquisition, P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52272334, 52502403), the Provincial Key R&D Program of Zhejiang (2024C01180), National Key Research and Development Program of China (2017YFE0194700), EC H2020 Project (690713) and National “111” Centre on Safety and Intelligent Operation of Sea Bridges (D21013).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Guan, W. A Summary of Traffic Flow Forecasting Methods. J. Highw. Transp. Res. Dev. 2004, 21, 82–85. [Google Scholar]
Ahmed, M.; Cook, A. Analysis of Freeway Traffic Timeseries Data by Using Box-Jenkins Techniques. Transp. Res. Rec. 1979, 722, 1–9. [Google Scholar]
Zhang, X.; Rice, J.A. Short-term travel time prediction. Transp. Res. Part C Emerg. Technol. 2003, 11, 187–210. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, X.; Hao, Y. A review of hybrid physics-based machine learning approaches in traffic state estimation. Intell. Transp. Infrastruct. 2023, 2, liad002. [Google Scholar] [CrossRef]
Feng, S.; Hu, S.; Xin, W. A physics-informed machine learning framework for speed-flow prediction: Integrating an S-shaped traffic stream model with deep learning models. Transp. Res. Part C Emerg. Technol. 2025, 166, 105362. [Google Scholar]
Guo, S.; Lin, Y.; Li, S. Deep Spatial-Temporal 3D Convolutional Neural Networks for Traffic Data Forecasting. IEEE Trans. Intell. Transp. Syst. 2019, 20, 3913–3926. [Google Scholar] [CrossRef]
Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU neural network methods for traffic flow prediction. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Hubei, China, 11–13 November 2016. [Google Scholar]
Bharti, N.; Poonam, R.; Kranti, K. MFOA-Bi-LSTM: An optimized bidirectional long short-term memory model for short-term traffic flow prediction. Phys. A Stat. Mech. Appl. 2024, 634, 129448. [Google Scholar]
Hu, X.; Liu, W.; Huo, H. An intelligent network traffic prediction method based on Butterworth filter and CNN–LSTM. Comput. Netw. 2024, 240, 110172. [Google Scholar] [CrossRef]
Liu, L.; Wu, M.; Lv, Q. CCNN-former: Combining convolutional neural network and Transformer for image-based traffic time series prediction. Expert Syst. Appl. 2025, 268, 126146. [Google Scholar]
Manuel, M.; Mercedes, G.; Manuel, N. Long-term traffic flow forecasting using a hybrid CNN-BiLSTM model. Eng. Appl. Artif. Intell. 2023, 121, 106041. [Google Scholar]
Narmadha, S.; Vijayakuma, V. Spatio-Temporal vehicle traffic flow prediction using multivariate CNN and LSTM model. Mater. Today Proc. 2023, 81, 826–833. [Google Scholar] [CrossRef]
Zhou, J.; Cui, G.; Hu, S.; Zhang, Z.; Yang, C.; Liu, Z.; Sun, M. Graph neural networks: A review of methods and applications. AI Open 2020, 1, 57–81. [Google Scholar] [CrossRef]
Ye, J.; Zhao, J.; Ye, K. How to Build a Graph-Based Deep Learning Architecture in Traffic Domain: A Survey. IEEE Trans. Intell. Transp. Syst. 2022, 23, 3904–3924. [Google Scholar] [CrossRef]
Wang, Q.; Jiang, H.; Qiu, M. TGAE: Temporal Graph Autoencoder for Travel Forecasting. IEEE Trans. Intell. Transp. Syst. 2023, 24, 8529–8541. [Google Scholar] [CrossRef]
Ma, Q.; Sun, W.; Gao, J. Spatio-temporal adaptive graph convolutional networks fortraffic flow forecasting. IET Intell. Transp. Syst. 2022, 17, 691–703. [Google Scholar] [CrossRef]
Ta, X.; Liu, Z.; Hu, X. Adaptive Spatio-temporal Graph Neural Network for trafficforecasting. Knowl.-Based Syst. 2022, 242, 108199. [Google Scholar] [CrossRef]
Zhang, W.; Zhu, F.; Lv, Y. AdapGL: An adaptive graph learning algorithm for trafficprediction based on spatiotemporal neural networks. Transp. Res. Part C Emerg. Technol. 2022, 139, 103659. [Google Scholar] [CrossRef]
Agnimitra, S.; Sudeepta, M.; Das, A. A Bayesian approach to quantifying uncertainties and improving generalizability in traffic prediction models. Transp. Res. Part C Emerg. Technol. 2024, 162, 104585. [Google Scholar]
Huipeng, Z.; Honghui, D.; Zhiqiang, Y. TSGDiff: Traffic state generative diffusion model using multi-source information fusion. Transp. Res. Part C Emerg. Technol. 2025, 174, 105081. [Google Scholar]
Lighthill, M.; Whitham, G. On kinematic waves. II. A theory of traffic flow on long crowded roads. Proc. R. Soc. A Math. Phys. Sci. 1955, 229, 317–345. [Google Scholar] [CrossRef]
Richards, P.I. Shock waves on the highway. Oper. Res. 1956, 4, 42–51. [Google Scholar] [CrossRef]
Daganzo, C. The cell transmission model: A dynamic representation of highway traffic consistent with the hydrodynamic theory. Transp. Res. Part B Methodol. 1994, 28, 269–287. [Google Scholar] [CrossRef]
Wang, J.; Zou, L.; Zhao, J. Dynamic capacity drop propagation in incident-affected networks: Traffic state modeling with SIS-CTM. Phys. A Stat. Mech. Appl. 2024, 599, 127646. [Google Scholar] [CrossRef]
Hauke, R.; Kübler, J.; Baumann, M.; Vortisch, P. A Vectorized Formulation of the Cell Transmission Model for Efficient Simulation of Large-Scale Freeway Networks. Procedia Comput. Sci. 2024, 238, 143–150. [Google Scholar] [CrossRef]
Yao, Z.; Jin, Y.; Jiang, H. CTM-based traffic signal optimization of mixed traffic flow with connected automated vehicles and human-driven vehicles. Phys. A Stat. Mech. Appl. 2022, 596, 127193. [Google Scholar] [CrossRef]
Mingming, Z.; Hongxin, Y.; Yibing, W. Real-time freeway traffic state estimation for inhomogeneous traffic flow. Phys. A Stat. Mech. Appl. 2024, 607, 128248. [Google Scholar]
Li, W.; Zhu, T.; Feng, Y. A cooperative perception based adaptive signal control under early deployment of connected and automated vehicles. Transp. Res. Part C Emerg. Technol. 2024, 168, 104630. [Google Scholar] [CrossRef]
Facundo, S.; Roberta, D.; Bart, D. A traffic responsive control framework for signalized junctions based on hybrid traffic flow representation. J. Intell. Transp. Syst. 2023, 27, 547–562. [Google Scholar]
Fusco, G.; Colombaroni, C. An integrated method for short-term prediction of road traffic conditions for Intelligent Transportation Systems Applications. In Proceedings of the 7th WSEAS European Computing Conference (ECC ’13), Dubrovnik, Croatia, 25–27 June 2013. [Google Scholar]
Tak, S.; Kim, S.; Oh, S. Development of a data-driven framework for real-time travel time prediction. Comput. Civ. Infrastruct. Eng. 2016, 31, 717–731. [Google Scholar] [CrossRef]
Fukuda, S.; Uchida, H.; Fujii, H. Short-term prediction of traffic flow under incident conditions using graph convolutional recurrent neural network and traffic simulation. IET Intell. Transp. Syst. 2020, 14, 217–225. [Google Scholar] [CrossRef]
Shafiei, S.; Mihăiţă, A.; Nguyen, H. Integrating data-driven and simulation models to predict traffic state affected by road incidents. Transp. Lett. 2022, 14, 120–130. [Google Scholar] [CrossRef]
Ting, W.; Ye, L.; Rongjun, C. Knowledge-data fusion oriented traffic state estimation: A stochastic physics-informed deep learning approach. Transp. Res. Part C Emerg. Technol. 2026, 180, 106983. [Google Scholar]
Lei, Y.Z.; Gong, Y.; Chen, D.; Cheng, Y.; Yang, X.T. Reconstructing physics-informed machine learning for traffic flow modeling: A multi-gradient descent and pareto learning approach. Transp. Res. Part C Emerg. Technol. 2025, 173, 105021. [Google Scholar] [CrossRef]
Carlos, C.; Antonella, F. A variable-length Cell Transmission Model for road traffic systems. Transp. Res. Part C Emerg. Technol. 2018, 95, 180–197. [Google Scholar]
Liu, S.; He, Q.; Chen, Y. Wavelet and VMD enhanced traffic forecasting and scheduling method for edge cloud networks. Comput. Electr. Eng. 2025, 107, 108427. [Google Scholar] [CrossRef]
Ilya, A.; Peter, R.; Asger, L. A multivariate realized GARCH model. J. Econom. 2025, 254, 106040. [Google Scholar]
Ma, C.; Hu, Y.; Xu, X. Hybrid deep learning model with VMD-BiLSTM-GRU networks for short-term traffic flow prediction. Data Sci. Manag. 2025, 8, 257–269. [Google Scholar] [CrossRef]
Ma, X.; Wu, F.; Yue, J. MSE-TCN: Multi-scale temporal convolutional network with channel attention for open-set gas classification. Microchem. J. 2024, 199, 108360. [Google Scholar] [CrossRef]

Figure 1. Prediction flow chart.

Figure 2. MDURP framework.

Figure 3. Processing of residual data.

Figure 4. Comparison of residual prediction. (a) Prediction results of LSTM-only model; (b) Prediction results of wavelet-GARCH-LSTM model.

Figure 5. Study area: San Diego Freeway, CA, USA. The red-highlighted section indicates the specific corridor selected for experimental validation. (Base map © OpenStreetMap contributors, available under the Open Database License [ODbL]).

Figure 6. Prediction results of CTM.

Figure 7. Performance comparison in different time steps.

Figure 8. Performance improvement of the MDURP models in comparison with the respect baseline models.

Figure 9. Average prediction error (MAE) by time.

Table 1. Genetic algorithm configuration for CTM calibration.

Parameter/Operator	Setting/Value
Objective function	Minimize MAE between CTM-simulated and observed speeds
Population size	100
Number of generations	50
Selection	Tournament selection (tournament size = 3)
Crossover	Simulated binary crossover (SBX), probability = 0.9
Mutation	Polynomial mutation, rate = 0.1
Search range ( $V_{f}$ )	[50, 70] mph
Search range ( $Q_{m a x}$ )	[600, 900] veh/5 min
Search range ( $w$ )	wave speed [12, 20] mph

Table 2. Characterization of residual data.

Short-Term Volatility	Kurtosis Coefficient	Local Range	Abrupt Change Density	Autocorrelation Function
72.2585	3.7057	862	0.0378	0.0229

Table 3. Some traffic status information.

TimeStamp	Station	Total Flow (Veh/5 min)	Avg Occupancy (%)	Avg Speed (Mph)
1 May 2012 0:00	716674	127	2.3	69.00
1 May 2012 0:05	716674	154	2.9	69.1
1 May 2012 0:10	716674	154	2.8	68.8
1 May 2012 0:15	716674	142	2.5	70.8

Table 4. Prediction parameter settings.

Module	Parameter	Value
CTM	△t	30 s
	$V_{f}$	65.5 mph (50–70)
	$w$	15 mph (12–20)
	$Q_{m a x}$	750 veh/5 min (600–900)
LSTM	Learning rate	0.001
	Optimization solver	Adam
	LSTM layers	2 layers with 64 hidden units
	Time window size	12
	Batch size	64
	Training epochs	100
	Loss function	MAE
TCN	Learning rate	0.001
	Optimization solver	Adam
	TCN layers	4
	Kernel size	2
	Dilation Rate	{1, 2, 4, 8}
	Time window size	12
	Batch size	64
	Training epochs	100
	Loss function	MAE
Wavelet decomposition	Wavelet Family	db4
	Decomposition Level	3
	Extension Mode	Symmetric
GARCH	Order Parameters	(1, 1)
GARCH	residual distribution assumption	Student’s t

Note: △t denotes the time step, which determines the discrete interval for state updates;

V_{f}

is the free-flow speed, representing the average speed of vehicles as density approaches zero;

w

is the shock wave speed, characterizing the upstream propagation speed of congestion;

Q_{m a x}

is the capacity, i.e., the maximum number of vehicles that can pass a road section per unit time, typically associated with the critical density.

Table 5. Prediction performance obtained from the baseline models and the MDURP models with different measurements.

Model		MAE	MAPE (%)	RMSE
	1 step
CTM		53.50	15.57	75.24
LSTM		33.80	9.89	45.42
TCN		33.58	9.72	45.40
MDURP-LSTM		31.31	8.67	44.63
MDURP-TCN		31.11	8.57	44.45
	3 steps
CTM		78.53	16.72	95.98
LSTM		42.41	12.89	56.22
TCN		41.87	12.80	55.88
MDURP-LSTM		37.86	10.66	52.02
MDURP-TCN		37.97	10.58	52.07
	6 steps
CTM		76.89	19.78	105.49
LSTM		45.02	13.18	59.25
TCN		47.09	14.01	63.59
MDURP-LSTM		41.01	11.42	56.39
MDURP-TCN		43.08	11.86	58.85
	9 steps
CTM		87.92	23.96	117.89
LSTM		51.19	16.94	67.71
TCN		53.07	16.79	70.78
MDURP-LSTM		47.87	13.29	66.07
MDURP-TCN		49.06	13.63	67.23

The best results are bold-marked. All results are averaged over three independent runs with different random seeds; the standard deviations are less than 1% of the reported values, indicating stable performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lv, H.; Lou, X.; Mou, J.; Papageorgiou, M.; Huang, Z.; Zheng, P. A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction. Sustainability 2026, 18, 3228. https://doi.org/10.3390/su18073228

AMA Style

Lv H, Lou X, Mou J, Papageorgiou M, Huang Z, Zheng P. A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction. Sustainability. 2026; 18(7):3228. https://doi.org/10.3390/su18073228

Chicago/Turabian Style

Lv, Haotao, Xiwen Lou, Jingu Mou, Markos Papageorgiou, Zhengfeng Huang, and Pengjun Zheng. 2026. "A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction" Sustainability 18, no. 7: 3228. https://doi.org/10.3390/su18073228

APA Style

Lv, H., Lou, X., Mou, J., Papageorgiou, M., Huang, Z., & Zheng, P. (2026). A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction. Sustainability, 18(7), 3228. https://doi.org/10.3390/su18073228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Physics-Constrained Residual Learning Framework for Robust Freeway Traffic Prediction

Abstract

1. Introduction

2. Literature Review

2.1. Deep Learning-Based Traffic State Prediction

2.2. Cell Transmission Models

2.3. Hybrid Traffic Prediction

2.4. Research Gap and Conceptual Framework

3. Methodology

3.1. Problem Formulation

3.2. Model Framework

3.3. Physics-Based Baseline Modelling

3.4. Multi-Scale Residual Modelling

3.4.1. Analysis of the Characteristics of Residual Data

3.4.2. Wavelet Decomposition

3.4.3. GARCH Volatility Modeling

3.4.4. Deep Learning Feature Extraction

4. Experiments

4.1. Dataset

4.2. Prediction Setting

4.2.1. Physics Baseline Configuration

4.2.2. Residual Modelling Setup

4.2.3. Experimental Configuration

5. Experimental Results and Discussion

5.1. Evaluation Metrics and Benchmarking Criteria

5.2. Performance Comparison and Analysis

6. Discussion

6.1. Interpreting MDURP as Residual-Space Learning Under Physical Constraints

6.2. Statistical Structure of Residual Dynamics

6.3. Performance Gains and Error Accumulation Suppression

6.4. Robustness, Generalization, and Sustainability Implications

6.5. Practical Implications

6.6. Limitations and Future Research Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI