1. Introduction
As urbanization accelerates, urban traffic demand continues to increase, and traffic signal control remains a central mechanism for mitigating recurrent congestion and improving road-network efficiency [
1,
2]. Recent mobility statistics further demonstrate the practical importance of this problem. The 2025 Urban Mobility Report reports that the average auto commuter in U.S. urban areas wasted 63 h per year in congestion in 2024, and that the national congestion cost reached approximately US
$269 billion in 2024 [
3]. At the signal-operation level, poor traffic signal timing has been reported to account for an estimated 10% of all traffic delay, corresponding to approximately 300 million vehicle-hours on major roadways, while adaptive signal control technologies can improve travel time by more than 10% on average [
4]. These findings indicate that signal control is a critical operational factor in urban mobility. Classical timing plans and heuristic adaptive systems are still widely used in practice, but their effectiveness declines when traffic demand becomes strongly time-varying, spatially coupled, and non-stationary [
1,
2]. In recent years, reinforcement learning has become a major direction for adaptive traffic signal control because it can optimize phase switching directly from observed traffic states and long-term control objectives [
1,
2,
5,
6].
As learning-based control has progressed, research has shifted from single-intersection optimization to coordinated decision making across multiple intersections. This shift is motivated by the operational structure of arterial and grid networks: traffic progression, queue storage, and downstream receiving capacity are jointly determined by multiple adjacent signals [
7]. Coordinated timing plans are therefore developed for a series of intersections rather than only for isolated intersections, and time–space analysis is commonly used to evaluate progression opportunities, delay, stops, queuing, and queue spillback [
7]. In this setting, optimizing each intersection independently may reduce a local queue while disturbing platoon progression, increasing downstream spillback risk, or transferring bottlenecks to neighboring intersections. The methodological development of learning-based traffic signal control can be summarized in three stages. First, learning-based controllers began to replace fixed rules and enabled adaptive timing at individual intersections, as illustrated by pressure-based and phase-competition formulations such as PressLight [
8] and Learning Phase Competition [
9]. Second, researchers modeled information exchange among intersections through graph structures, hierarchical cooperation, neighborhood communication, and collaborator selection, as represented by CoLight [
10], HiLight [
11], FGLight [
12], CoSLight [
13], and X-Light [
14]. Third, sequence modeling extended traffic signal control from instantaneous state response to spatio-temporal decision making over historical trajectories, where states, actions, and returns are encoded in a unified sequence interface [
15,
16,
17]. This development reflects a transition from local reactive control to network-level coordination and trajectory-level process modeling.
Despite this progress, two core bottlenecks remain. Firstly, most reinforcement learning methods for traffic signal control still rely on online trial-and-error interaction. Training cycles are long, and early exploration can trigger severe congestion, making deployment costly and risky [
1,
2,
8,
10]. Secondly, in offline or sequential settings, many models still optimize only a history-to-action mapping, that is, they recover the current action from a historical window [
18,
19,
20,
21,
22]. This objective alone is often insufficient. In practice, two samples may show similar current queues and phase states, but the correct actions can still differ: one intersection may face an imminent upstream platoon arrival, while another may be blocked by downstream spillback; similarly, propagation delays can make near-identical snapshots evolve in different ways a few steps later. If training uses only action labels, the model can fit observed actions without learning the underlying short-term evolution of congestion.
From a broader spatio-temporal perspective, short-term traffic states are structured rather than random. Recent traffic prediction studies show that variables such as flow, queue length, and pressure follow stable but nontrivial short-term dynamics when spatial coupling and temporal delay are modeled explicitly [
23,
24,
25,
26,
27,
28]. Exploiting this structure effectively requires three capabilities: constrained modeling of spatial relations, explicit characterization of propagation delay, and stable shared-representation learning with continuous auxiliary supervision. This raises a central question for offline multi-intersection control: can short-term future inbound-queue evolution serve as a structural constraint for control learning, rather than remain a separate prediction task decoupled from action decision making?
To answer this question, we propose PR-STLight, a prediction-regularized spatio-temporal sequence framework for offline multi-intersection traffic signal control. The central idea is that near-future queue evolution should be recoverable from the shared representation. We therefore do not treat queue prediction as an isolated auxiliary output. Instead, a useful representation should recover the current action and preserve short-term congestion propagation information at the same time. Following this design, PR-STLight combines topology-aware spatio-temporal encoding with a dedicated queue-prediction branch in a staged training scheme. The main contributions are as follows:
- (1)
We formulate offline multi-intersection traffic signal control as a prediction-regularized spatio-temporal sequence imitation problem. In addition to action recovery, short-term future inbound-queue evolution is used as structural supervision, which alleviates the under-constrained representation learning caused by action labels alone.
- (2)
We develop a unified local-topology-aware spatio-temporal architecture that integrates neighborhood-constrained spatial self-attention, causal temporal self-attention, and TRQP. This design jointly models local coordination and temporal propagation, while enforcing representation consistency through continuous short-horizon queue extrapolation.
- (3)
We propose a stability-oriented two-stage optimization strategy for fixed offline replay buffers. The model first performs queue-prediction pretraining and then switches to joint control-prediction optimization. We further combine cross-trajectory future-label masking, a log1p-Huber prediction objective, and progressive auxiliary gradient injection to reduce early gradient interference and improve convergence stability.
Unlike backbone-only sequence models, PR-STLight unifies spatio-temporal encoding and queue-prediction regularization in a single framework. The resulting design is associated with improved traffic efficiency and congestion mitigation over the TransformerLight backbone, while also showing more stable optimization behavior in offline training.
The remainder of the manuscript is organized as follows.
Section 2 reviews related work on traffic signal control, traffic-flow theory, offline decision making, and spatio-temporal traffic prediction.
Section 3 formulates the offline multi-intersection control problem and defines the auxiliary prediction target.
Section 4 presents the PR-STLight architecture and the two-stage optimization strategy.
Section 5 reports the experimental setup, main comparison, demand-level and stress-test evaluations, ablation and sensitivity analyses, and limitations.
3. Problem Formulation
3.1. Problem Definition for Multi-Intersection Traffic Signal Control
The regional traffic network is modeled as a directed graph
, where
denotes the set of controlled intersections,
denotes road connections, and
is the number of controlled intersections. Let the control interval be
. Under this formulation, multi-intersection signal control is cast as a finite-horizon Markov decision process.
Figure 1 illustrates the representative intersection layout and local-topology abstraction used in this task.
At the task level, the control objective is to maximize the expected discounted cumulative return,
Here, denotes the network-level reward at step t, for example, the sum or average of the intersection-level rewards under the adopted evaluation protocol.
In the present setting, policy learning is constrained to a fixed offline trajectory dataset , without online environment interaction. Accordingly, the implemented training objective is behavior cloning with prediction regularization: a parameterized policy maps historical spatio-temporal context to executable phase actions, while short-term future inbound-queue trends are incorporated as representation-level regularization.
Compared with single-intersection control, multi-intersection control requires both rapid local response and network-level coordination. The policy must account for arrival-flow propagation, queue spillback, and bottleneck transfer across intersections. This requirement becomes more critical under heavy or unbalanced traffic, where the release strategy at one intersection can directly affect neighboring intersections and, in turn, the wider region. Accordingly, the target problem is offline multi-intersection traffic signal control with explicit spatial correlation and temporal delay.
3.2. State, Action, and Reward Modeling
To model intersection traffic at fine granularity, lane-level observations are used. Suppose that intersection
i has
controlled inbound lanes. At time
t, the feature vector of lane
l is denoted by
. The raw state matrix of intersection
i is defined as
Here, denotes the lane-level feature dimension. State features include entering and leaving vehicle counts, queued vehicle counts, pressure-related descriptors, and segment-level lane statistics. The pressure-related descriptors provide information about local inbound–outbound traffic imbalance, while the reward design is kept queue-based to maintain a direct and reproducible congestion-oriented optimization objective. The current phase is not directly concatenated with the raw state vector; instead, it is injected through an independent phase embedding, preserving a functional separation between traffic observations and control context.
The action space employs discrete phase control. Let the candidate phase set at each intersection be
. The control action of intersection
i at time
t satisfies
Here, each action corresponds to selecting a phase index from the available signal phases. Additional execution constraints are examined separately in the constrained evaluation discussed in
Section 5.6.
The reward function is defined as a queue-length-based negative cost, because queue accumulation directly reflects local congestion and is consistent with the main efficiency metrics used in this study. For intersection
i at time
t, the instantaneous reward is defined as
where
denotes the total inbound queue length,
Here, is the queued vehicle count on inbound lane l of intersection i, and is the number of controlled inbound lanes. The coefficient is set to , corresponding to the implementation setting {"queue_length": −0.25}. No additional pressure term is included in the reward, which is equivalent to setting in the general queue–pressure formulation.
This choice keeps the reward definition simple and reproducible and avoids introducing an additional pressure-weight hyperparameter. The queue-based reward is used for PR-STLight and TransformerLight (base) under the offline sequence-learning setting. Other baselines follow their standard reward definitions and configurations unless otherwise specified.
3.3. Offline Trajectories and Sequential Decision Modeling
Because online trial-and-error is costly and risky in real traffic systems, training is conducted on a fixed offline dataset. During parameter updates, PR-STLight does not generate new online interactions with CityFlow. Instead, it reads states, actions, next states, and rewards from pre-stored replay-buffer memory files, while the simulator is used only for independent policy evaluation. Let the offline trajectory set be
where the
mth trajectory is a control sequence of length
,
Here, each tuple represents the joint multi-intersection state, action, and reward at decision step t.
The offline replay buffer is a fixed CityFlow-compatible memory prepared before model training and is not updated during parameter optimization. Each memory record contains the state, action, next-state, and reward information required for offline sequence learning. The stored actions are behavior actions obtained from fixed replay-buffer files. These buffers are either loaded from pre-generated CityFlow-compatible memory files or generated before training using the same benchmark data-collection protocol. The data-collection protocol is parameterized by the behavior controller, traffic-demand profile, and collection-round setting; once generated or loaded, the replay buffer remains fixed throughout PR-STLight training. PR-STLight is therefore optimized within the state–action coverage contained in the fixed replay buffer through behavior cloning with prediction regularization.
The memory contains complete traffic-control episodes generated under the benchmark traffic-demand profiles. Each episode lasts 3600 s, and the control interval is 15 s, so the stored data preserve temporally ordered state–action–reward trajectories. The collection-round setting controls the number of complete episodes in each replay buffer, and the corresponding configurations and replay-buffer files are provided as part of the data availability materials. These trajectories cover typical traffic regimes in the benchmark scenarios, including light–traffic intervals, queue formation, congested periods, and queue dissipation. During sequence construction, historical windows and future prediction targets are formed only within valid trajectory segments, so invalid cross-episode labels are avoided.
PR-STLight organizes each offline trajectory segment into a reward–state–action sequence to preserve the chronological structure of the control process. For a continuous time window of length
L, the shared backbone output is
Here, denotes the time-step index within the sliding window ending at decision step t, N is the number of controlled intersections, and d is the hidden dimension. The hidden representation encodes the lane-level traffic state of intersection i together with phase, spatial-neighborhood, and temporal context.
Within this sequence interface, the phase decision is predicted from the shared spatio-temporal traffic representation. The stored action values in the replay buffer provide supervised phase labels for the control objective, while the serialized reward and action slots preserve the offline trajectory sequence format. Accordingly, the main control cues are extracted from historical traffic states, phase context, neighborhood interaction, and temporal order.
The control head predicts phase probabilities as
Here, and denote the linear projection parameters of the control head.
The control objective is formulated as discrete phase classification, and the control head is optimized with cross-entropy loss,
This objective fits the conditional action distribution on the fixed offline trajectory dataset. It is equivalent to behavior cloning under discrete phase classification and serves as the primary control target in the subsequent multi-task optimization.
3.4. Definition of the Auxiliary Prediction Task
To explicitly encode short-term traffic evolution, an auxiliary multi-step inbound-queue prediction task is introduced in addition to the control task. Consistent with the current implementation, the auxiliary target is defined at the intersection level, that is, the total inbound queue. This target is directly related to congestion accumulation and provides stable continuous supervision on a fixed offline replay buffer. Let the total inbound queue of intersection
i at time
t be
For the prediction horizon
K, the corresponding multi-step label is
here,
K is the number of future decision steps used for auxiliary supervision.
Let the prediction-head output be
. Near the end of a trajectory, complete
K-step labels may be unavailable; therefore, a validity mask
is introduced. To ensure non-negativity and improve robustness to large values and outliers, predictions are transformed by softplus
and optimized with a log1p-transformed Huber loss,
Here,
masks invalid future labels from incomplete trajectory suffixes. This design reduces the disturbance from long-tail and large-error samples and maintains a stable supervision scale under heavy congestion. The overall training objective is
where
is the control loss,
is the auxiliary prediction loss, and
controls the contribution of the auxiliary task at the joint-training epoch
r.
4. Method
4.1. Overall Framework
Offline multi-intersection traffic signal control must learn from fixed trajectories while still handling neighborhood coordination and delayed congestion propagation. This setting raises two modeling challenges. First, learning only from action labels may provide insufficient supervision for capturing near-future congestion evolution. Second, the model must capture both local spatial interaction and delayed temporal dynamics without introducing unstable optimization behavior. To address these challenges, we propose PR-STLight, which combines topology-aware spatio-temporal encoding, queue-prediction regularization, and staged optimization. The framework has four tightly coupled modules: input encoding, coordination-aware spatio-temporal representation learning, a queue-prediction regularization branch, and two-stage joint optimization.
As shown in
Figure 2, lane-level traffic observations, phase context, and time context are first mapped into unified state representations. Neighborhood-constrained spatial attention is then used to model local coordination, and causal temporal attention captures propagation-delay dependencies across intersections. The resulting shared representation is consumed by two heads: a control head for current phase prediction and TRQP for multi-step inbound-queue prediction. During training, the queue-prediction loss acts as structural regularization for the shared representation.
Within this architecture, input encoding provides a consistent sequence interface for offline trajectories; the spatio-temporal module captures neighborhood interaction and delayed propagation; TRQP supplies continuous supervision on future queue evolution; and staged joint optimization coordinates control and auxiliary prediction for stable learning on fixed replay data.
4.2. Spatio-Temporal Representation Module
The shared backbone is designed to extract spatio-temporal representations from historical traffic observations for coordinated multi-intersection control. Because coordination and propagation are primarily reflected in traffic-state dynamics, explicit spatio-temporal enhancement is first applied to state tokens. The enhanced state tokens are then combined with reward and action tokens at the same time step and fed into the sequence backbone. Under this “state enhancement first, sequence modeling second” design, the model first captures traffic-state dependencies and then learns higher-order decision relations.
For the observation at intersection
i and time
t, the lane-level traffic state matrix
is projected into the hidden space and fused with phase and time embeddings to obtain a basic state representation,
where
and
are the state and phase projection matrices,
denotes the phase embedding input, and
is the time embedding indexed by the time-context token. This operation aligns traffic state, control context, and temporal position in a shared feature space for subsequent dependency modeling.
Along the spatial dimension, neighborhood-constrained multi-head self-attention is used to model interactions among intersection states at the same time step. Let the state representations of all intersections at time
t be
Here,
N is the number of controlled intersections. To keep information exchange consistent with road topology, a spatial attention mask
is constructed from the adjacency relation, so each intersection attends only to itself and its first-order neighbors. The spatial enhancement is
where
denotes masked multi-head self-attention,
is a learnable residual scaling factor, and
denotes LayerNorm.
This module has two roles. First, it captures local traffic-flow coupling, queue spillback, and short-range propagation among neighboring intersections. Second, the neighborhood mask suppresses direct interaction between distant, irrelevant nodes, thereby reducing noise from unconstrained global attention. The learnable residual scaling factor also keeps the module close to a stable residual mapping in early training, mitigating optimization oscillation when spatial and temporal attention are stacked.
After spatial enhancement, causal temporal self-attention is applied along each intersection’s time axis to model traffic-state propagation and accumulation over the historical window. For intersection
i, let the spatially enhanced sequence be
Then the temporal enhancement is
where
is a causal mask that allows each time step to access only current and previous steps, preventing future-information leakage.
Causal temporal modeling is consistent with the traffic-control decision setting: the current phase decision can depend only on current and past observations. Under this constraint, the temporal module captures delayed effects of traffic-wave propagation, queue buildup and dissipation, and phase switching. Applying temporal attention to spatially enhanced states captures two complementary aspects: how intersections interact and how these interactions evolve over time.
After two-stage enhancement in space and time, the spatio-temporal state sequence for intersection
i is
The enhanced state representation is then arranged with the reward and action entries at each time step and fed into the sequence backbone in chronological order. The resulting sequence for intersection
i is
Here, denotes the spatio-temporally enhanced traffic-state representation at time step ℓ. In this sequence construction, the enhanced state tokens carry lane-level traffic dynamics, phase context, neighborhood interaction, and temporal information extracted by the spatial and temporal attention modules. Reward and action slots are inserted in chronological order to preserve the inherited TransformerLight-style token layout. The control head reads the state-position hidden representations for phase imitation, and the auxiliary queue-prediction branch uses the same shared representation for future-queue regularization.
4.3. Topology-Recurrent Queue Predictor (TRQP)
To preserve future congestion information during training, a TRQP branch is coupled with the shared spatio-temporal representation and serves as a structural regularizer rather than a standalone forecasting module. It aggregates local spatial information within each time block using road topology and recursively predicts total inbound queues over the next
K steps along the time axis. This mechanism encourages the shared representation to retain information useful for short-term queue propagation. The input to the predictor is
where
L is the historical block length,
N is the number of controlled intersections, and
d is the hidden dimension.
More specifically, the predictor first applies a topology-aware aggregation layer with a residual connection to each time block to refine local spatial dependencies. A GRU is then applied along the temporal axis for each intersection. Finally, a linear projection outputs predicted total inbound queues for the next
K steps. For the shared hidden representation of time block
s, the structural enhancement is
where
is the normalized adjacency matrix,
is the graph-convolution weight matrix,
is a nonlinear transform, and
is a learnable residual scaling factor. Let
denote the feature of intersection
i in
. The temporal representation is updated recursively by
and the multi-step prediction is obtained by linear projection,
where
and
denote the output-projection weight and bias of the prediction head.
Here, the prediction head outputs a future K-step queue prediction for every input time block, rather than a single prediction from the final time step only. This design provides continuous supervision on queue evolution over the full historical block, strengthens the ability of the shared representation to capture short-term congestion evolution, and supplies a stable auxiliary signal during joint optimization.
4.4. Two-Stage Training and Joint Optimization
To incorporate short-term queue supervision into traffic signal control learning in a stable manner, a two-stage training strategy is adopted. In the first stage, only the auxiliary prediction loss is used to pretrain the shared backbone and prediction head, allowing the model to learn a basic representation of inbound-queue dynamics. In the second stage, the control loss
and auxiliary prediction loss
are optimized jointly, so the shared representation supports both current action prediction and future queue modeling. The two-stage objectives are
Here, is the control loss, is the auxiliary prediction loss, r is the joint-training epoch index, and is a dynamic weight for the auxiliary loss.
Because the auxiliary prediction task can introduce strong gradient interference early in joint training, progressive auxiliary gradient injection is used to smooth the influence of the prediction branch on the shared backbone. The shared hidden representation used by the auxiliary branch is reparameterized as
where
denotes the stop-gradient operation and
is a mixing coefficient that increases with training progress. When
, the auxiliary predictor updates only prediction-branch parameters. As
increases gradually, the prediction loss is propagated to the shared backbone in a more stable manner.
Algorithm 1 summarizes the full training pipeline used in practice.
| Algorithm 1. Two-stage training procedure of PR-STLight |
- Require:
Offline replay buffer , history length L, prediction horizon K, pretraining epochs , joint-training rounds R - Ensure:
Trained control policy - 1:
Initialize shared backbone parameters , control-head parameters , and predictor parameters - 2:
for to do - 3:
Sample mini-batches from and construct length-L sequences with valid future-label masks - 4:
Encode lane states, phase embeddings, and temporal embeddings - 5:
Apply neighborhood-constrained spatial attention and causal temporal attention to obtain shared representation H - 6:
Predict future inbound queues with TRQP and compute - 7:
Update and using - 8:
end for - 9:
for to R do - 10:
Sample mini-batches from and build reward–state–action sequences - 11:
Compute shared representation H and phase logits with the backbone and control head - 12:
Compute control loss with the control head - 13:
Predict K-step future inbound queues and compute masked prediction loss - 14:
Set the auxiliary weight and backbone mixing factor - 15:
Update the network using - 16:
end for - 17:
return
|
With this two-stage strategy and progressive joint optimization, the model injects future traffic-evolution information into shared-representation learning while preserving the dominant role of the control objective. This design is expected to mitigate gradient interference and stabilize representation learning in early multi-task training.
5. Experiments
5.1. Experimental Setup
5.1.1. Simulation Platform and Datasets
Experiments are conducted on CityFlow [
48], an efficient open-source simulator designed for large-scale multi-agent traffic control. CityFlow provides lane-level observations, a standardized phase-control interface, and reproducible evaluation settings, which makes it suitable for constructing offline replay buffers and conducting fair benchmark comparison. We use two real-city benchmark networks: Jinan
and Hangzhou
. Jinan contains 12 controlled intersections, and Hangzhou contains 16 controlled intersections. Comparative results are reported on both networks, while the ablation study focuses on Jinan. These two benchmarks provide controlled and reproducible grid-based multi-intersection settings for evaluating the offline sequence-control capability of PR-STLight.
Figure 3 shows the layouts of the Jinan
and Hangzhou
benchmark networks used in the experiments.
5.1.2. Comparison Methods
We compare PR-STLight with representative methods from three categories.
Rule-based baselines: FixedTime and SOTL. FixedTime uses preset phase durations, and SOTL switches phases using local waiting-based thresholds.
Online learning-based baselines: PressLight [
8], CoLight [
10], HiLight [
11], and InitLight [
29]. PressLight uses pressure-based rewards for policy learning. CoLight models multi-intersection coordination with graph attention. HiLight adopts a hierarchical reinforcement learning framework. InitLight uses adversarial inverse reinforcement learning to generate a strong initialization policy.
The offline sequence baseline is TransformerLight (base), which follows the same Transformer-style control backbone adopted in recent sequence-based traffic-signal-control studies [
15,
17]. However, it does not include explicit spatio-temporal enhancement or future-queue prediction regularization.
5.1.3. Metrics and Statistical Evaluation Protocol
We report two primary metrics: average travel time (ATT) and average inbound queue (AIQ). ATT evaluates global network efficiency, whereas AIQ measures congestion accumulation at inbound approaches. Both metrics are computed at the network level for each evaluation round, and lower values indicate better performance.
For stochastic learning-based methods, we conduct five independent training runs using different random seeds. For each random seed, the final score is first obtained by averaging the final ten evaluation rounds after convergence, which provides one stable run-level score for that independent run. The reported mean ± standard deviation is then calculated across these five independent run-level scores. Therefore, the reported standard deviation reflects between-run uncertainty rather than fluctuations within a single training run. Fixed Time and SOTL are deterministic rule-based reference methods and are reported as mean values only.
To assess statistical reliability, we conduct two-sided paired
t-tests on the seed-level scores using matched random seeds. PR-STLight is compared with TransformerLight (base), and the corresponding significance results are discussed together with the main comparison. The implementation details and main experimental settings of PR-STLight are summarized in
Table 1.
5.2. Main Comparative Results and Analysis
Table 2 and
Table 3, together with
Figure 4, show that PR-STLight achieves consistently strong performance across both benchmarks. In terms of ATT, PR-STLight obtains the best results on both Jinan and Hangzhou, reducing ATT by 21.27% and 22.54% relative to TransformerLight (base), respectively. For each benchmark network, seed-level paired
t-tests are conducted between PR-STLight and TransformerLight (base) using matched random-seed run scores. The improvements of PR-STLight are statistically significant for both ATT and AIQ on Jinan and Hangzhou, with all four tests satisfying
. These results indicate that the proposed prediction-regularized spatio-temporal representation improves network-level travel efficiency under the evaluated offline control setting.
The AIQ results show a more nuanced pattern. On Jinan , CoLight achieves the lowest average inbound queue, while PR-STLight still reduces AIQ by 46.28% compared with TransformerLight (base), indicating that it remains competitive in queue mitigation while showing a clearer advantage in average travel time. This suggests that the two methods emphasize somewhat different aspects of traffic optimization on this network: CoLight is more competitive in reducing queues, whereas PR-STLight is more favorable for network-level travel-time optimization. Even so, PR-STLight still performs better than several other learning-based baselines on the queue metric, which indicates that its travel-time advantage is not obtained at the cost of severely degraded queue control.
On Hangzhou , PR-STLight performs best on both ATT and AIQ. Compared with TransformerLight (base), it reduces the queue metric by 66.29%, and its advantage is more consistent than that observed on Jinan. This result suggests that the proposed spatio-temporal representation is particularly effective under the evaluated Hangzhou benchmark, where the number of controlled intersections and local coordination interactions are larger than those in Jinan. The corresponding curves also show that PR-STLight stays in a relatively stable low-value region during the later training stage, which is consistent with its final quantitative advantage on Hangzhou.
A comparison with TransformerLight (base) is also informative. Since both methods share the same Transformer-style sequence backbone, the observed performance gap can be attributed mainly to the additional spatio-temporal modeling and prediction-regularized supervision introduced in PR-STLight. This comparison indicates that the performance gain does not stem solely from sequence modeling, but is also associated with the topology-aware spatio-temporal representation and future-queue structural supervision.
As shown in
Figure 4, PR-STLight generally enters a lower-error region earlier than most baselines in the travel-time curves. In the queue curves, its late-stage trajectory is also relatively smooth, especially on Hangzhou. Taken together with the tabulated results, these observations are consistent with faster convergence and more stable training behavior, although the degree of advantage varies across datasets and metrics.
5.3. Evaluation Under Different Traffic Demand Levels
To further examine the robustness of PR-STLight under different traffic intensities, we conduct additional evaluations on both the Jinan and Hangzhou networks with three demand scaling factors: 0.7, 1.0, and 1.5, corresponding to low-demand, medium-demand, and oversaturated conditions, respectively. All reported values follow the statistical aggregation protocol described in
Section 5.1.3. For learning-based methods, the percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level, which helps contextualize each method against the same Transformer-style backbone reference. The detailed results are reported in
Table 4 and
Table 5.
As shown in
Table 4 and
Table 5, PR-STLight consistently improves over TransformerLight (base) across all tested demand levels.
On the Hangzhou network, PR-STLight achieves the best ATT and AIQ under low-demand, medium-demand, and oversaturated conditions. Compared with TransformerLight (base), PR-STLight reduces ATT by 46.62%, 28.88%, and 29.07% under 0.7×, 1.0×, and 1.5× demand, respectively. The corresponding AIQ reductions are 90.91%, 75.43%, and 67.81%. These results indicate that the proposed prediction-regularized spatio-temporal design remains effective as traffic pressure increases.
On the Jinan network, PR-STLight also shows substantial gains over TransformerLight (base) across all three demand levels. It reduces ATT by 23.38%, 25.25%, and 17.92%, and reduces AIQ by 61.93%, 54.49%, and 30.90% under 0.7×, 1.0×, and 1.5× demand, respectively. Under 0.7× and 1.0× demand, PR-STLight achieves the best ATT and AIQ among the compared methods. Under the oversaturated 1.5× setting, CoLight obtains the lowest ATT and AIQ, while PR-STLight still ranks second on both metrics and remains clearly better than TransformerLight (base). This result suggests that PR-STLight remains competitive under severe congestion, although the relative advantage of different methods may vary with traffic intensity and network characteristics.
Overall, the demand-level analysis shows that PR-STLight provides stable gains over its direct Transformer backbone across both networks and all tested demand levels. It achieves the strongest overall results on Hangzhou and on Jinan under low-to-medium demand, while the oversaturated Jinan case reveals that a graph-attention coordination baseline can still be more effective under severe congestion. These findings support the robustness of the proposed prediction-regularized spatio-temporal design, while keeping the performance claims aligned with the observed benchmark results.
5.4. Stress Test Under Sudden Demand Surge
To further evaluate the robustness of PR-STLight beyond recurrent congestion patterns, we add a representative sudden-demand-surge stress test on the Hangzhou network. This setting is designed to reflect a short-term demand-side disturbance, such as abrupt inflow growth or temporary crowd aggregation, while keeping the road topology and signal-control constraints unchanged.
Each evaluation episode lasts 3600 s, and the signal-control interval remains 15 s. The original traffic-flow profile is kept unchanged before and after the disturbance. During the shock window from 1200 s to 1800 s, vehicle arrivals are increased by 50%, corresponding to a 1.5× demand surge. After 1800 s, the traffic demand returns to the original benchmark profile. The road network, lane availability, signal phase definitions, action interval, reward setting, and execution rules are kept unchanged. Therefore, this experiment represents a demand-side non-recurrent disturbance rather than a capacity-side incident such as a crash or lane blockage.
For the offline sequence-control models, training still uses the original recurrent offline replay buffer. The sudden-surge traffic file is used only for simulator-based evaluation. No shock-specific replay buffer is generated, and no model is fine-tuned on the surge scenario. This protocol evaluates zero-shot responsiveness to an unseen demand shock rather than performance after retraining on the disrupted distribution.
As shown in
Table 6, PR-STLight achieves the lowest ATT and AIQ among the compared methods under the sudden-demand-surge scenario. Since the surge trajectories are not included in the offline training buffer, the stress-test results reflect the ability of the learned policy to respond to abrupt demand changes during evaluation. Compared with TransformerLight (base), PR-STLight reduces ATT from 399.74 s to 298.82 s, corresponding to a 25.25% reduction, and reduces AIQ from 121.21 vehicles to 42.58 vehicles, corresponding to a 64.87% reduction. These results indicate that the proposed prediction-regularized spatio-temporal design remains effective when traffic demand changes abruptly.
In addition to the comparison with TransformerLight (base), PR-STLight also outperforms CoLight, which is the strongest external learning baseline in this stress test. Specifically, PR-STLight further reduces ATT by 1.59% and AIQ by 8.48% compared with CoLight. This provides a conservative comparison and shows that PR-STLight remains competitive even against a strong graph-attention-based coordination method under non-recurrent demand shocks.
The comparison with PR-STLight w/o TRQP further highlights the value of the future-queue prediction branch under sudden demand changes. The w/o TRQP variant preserves the spatio-temporal control backbone but removes the future-queue prediction branch. Removing TRQP increases ATT from 298.82 s to 304.48 s and AIQ from 42.58 vehicles to 47.93 vehicles. This degradation indicates that future-queue supervision helps the shared representation capture short-term queue buildup when demand changes abruptly, thereby improving robustness under demand-side shock conditions.
The curve trends in
Figure 5 are consistent with the table-level results, with PR-STLight maintaining lower ATT and AIQ than the compared variants in the stress-test process. Additional incident-driven disruptions are further discussed in
Section 5.6.
5.5. Component and Parameter Analysis
5.5.1. Ablation Study and Analysis
To examine the contribution of each component in PR-STLight, we conduct an ablation study on the Jinan
network using average travel time as the evaluation metric. The results are summarized in
Table 7, where the variants remove different combinations of the proposed components.
As shown in
Table 7 and
Figure 6, the three components do not contribute in the same way. Removing the spatio-temporal module (
w/o ST) leads to the clearest degradation in both final performance and convergence speed: its average travel time is noticeably higher than that of PR-STLight, and its convergence round is also substantially later. This suggests that explicit spatio-temporal modeling is important not only for final control quality but also for optimization efficiency during training.
The role of TRQP is more evident in training stability than in the final mean alone. Although w/o TRQP remains close to the full model in final average travel time, its result variance is clearly larger and its convergence is later. This pattern suggests that the prediction branch provides a useful regularizing effect on the shared representation, helping the model reach a stable low-error region more reliably.
The effect of pretraining is reflected mainly in convergence behavior and late-stage stability. As shown in
Table 7,
w/o Pretrain obtains a slightly lower final mean ATT than the full PR-STLight model. This result shows that the pretraining stage does not necessarily lead to the lowest final ATT in this ablation setting. However, PR-STLight converges earlier and exhibits smaller late-stage variation. Its ConvRound is 6.0, compared with 11.0 for
w/o Pretrain, and its ATT standard deviation is 0.46, compared with 1.20 for
w/o Pretrain. These results suggest that pretraining improves optimization stability and convergence speed, with a small empirical trade-off in final mean ATT.
The small standard deviation of PR-STLight indicates that its late-stage performance is maintained more steadily across evaluation rounds. Variants that remove one or more components tend to show slower convergence, larger late-stage variation, or both.
Finally, removing both components causes the most severe degradation in final ATT. Although w/o Both has an early ConvRound, its curve quickly flattens at a high-error region and remains far from the stable low-error regime reached by PR-STLight. Taken together, these results suggest that the spatio-temporal module, the prediction branch, and the pretraining strategy play complementary roles: the spatio-temporal module supports traffic-interaction representation, the prediction branch improves representation regularity during training, and the pretraining stage improves early optimization stability and convergence behavior.
5.5.2. Sensitivity Analysis of Historical Length and Prediction Horizon
To examine the influence of the historical block length L and the auxiliary prediction horizon K, we conduct a controlled sensitivity analysis on the Jinan network. Specifically, we vary while fixing , and vary while fixing . All other training settings and replay buffers are kept unchanged. The results are reported as the mean and standard deviation over the final ten evaluation rounds after convergence.
As shown in
Table 8, PR-STLight maintains stable performance across the tested values of
L and
K. The average travel time remains within a narrow range from 273.08 s to 276.06 s, and the average inbound queue also varies within a limited range. These results indicate that the prediction-regularized framework is not overly sensitive to moderate changes in the historical context length or the auxiliary prediction horizon.
When K is fixed at 3, and produce very close ATT values and comparable AIQ values, whereas leads to a mild degradation in both metrics. This suggests that a relatively short historical window can already capture most useful recent traffic context, while an overly long historical window may introduce redundant or less relevant information into the sequence representation. Compared with , achieves a slightly lower AIQ and a very close ATT, while providing a longer temporal context for modeling short-term queue evolution. In the implemented sequence construction, a shorter historical block can also generate more training blocks from the same offline trajectories, thereby increasing the number of mini-batch updates within each epoch. Therefore, considering performance, temporal-context coverage, and training workload, this study adopts as a balanced setting.
When L is fixed at 8, the results under , , and are also close, indicating that PR-STLight is relatively stable under different short-term prediction horizons. The setting provides a very short and stable prediction target, whereas introduces a longer future target with slightly higher uncertainty. The default setting is retained as a moderate horizon because it provides richer short-term future-queue supervision than a one-step target while avoiding the additional uncertainty associated with a longer prediction horizon.
Based on these observations, the main experiments adopt and as a moderate configuration. This setting provides sufficient historical context, preserves short-term future-queue supervision, and keeps the auxiliary prediction task and training workload controlled.
5.6. Limitations and Future Work
The experiments in this study are conducted on two standard CityFlow benchmark networks, namely Jinan and Hangzhou . These benchmarks provide controlled and reproducible multi-intersection scenarios for comparing PR-STLight with existing baselines and analyzing the effect of prediction regularization. However, the present evaluation is still centered on grid-based networks. Extending PR-STLight to more irregular and heterogeneous urban networks will require broader offline replay buffers and more flexible topology and action modeling. Future work will examine networks with asymmetric connectivity, diverse phase configurations, uneven lane capacities, and spatially unbalanced demand patterns.
The additional experiments evaluate PR-STLight under different demand levels and a sudden-demand-surge stress test. These results provide an initial assessment of the model under demand-side variations and non-recurrent demand shocks. However, demand-side perturbations do not fully represent capacity-side disruptions, such as crashes and lane blockages, where lane availability and effective road capacity may change during operation. Future work will extend the evaluation to incident-driven scenarios by modeling incident duration, lane availability, capacity reduction, and post-incident recovery.
The current state representation mainly uses queue-, flow-, and pressure-related variables that are consistently available in the adopted CityFlow benchmark setting. This compact design helps maintain fair and reproducible comparisons with existing baselines. Richer traffic-state variables, such as occupancy, approach speed, turn ratios, and downstream blocking indices, may further improve the description of link utilization, arrival dynamics, route choice, and downstream receiving capacity. However, incorporating these variables would require additional feature extraction, normalization, replay-buffer alignment, and corresponding baseline reimplementation. Future work will therefore investigate feature-enriched variants of PR-STLight together with dedicated ablation analysis.
Practical signal operation also involves execution constraints beyond general phase selection. As an additional constrained evaluation, we tested PR-STLight on the Hangzhou network with yellow intervals and minimum-green requirements enforced during signal execution. Compared with TransformerLight (base), PR-STLight reduces ATT from 369.95 s to 312.22 s, AIQ from 91.92 vehicles to 52.41 vehicles, and phase changes per hour from 2567.50 to 2134.13. These results provide preliminary evidence that PR-STLight remains executable under basic timing constraints. Future work will further incorporate pedestrian phases, transit signal priority, and emergency vehicle preemption through priority-request features, action masks, phase-extension rules, and safety execution layers.
6. Conclusions
This study proposed PR-STLight, a prediction-regularized spatio-temporal Transformer framework for offline multi-intersection traffic signal control. The main idea is to use short-term future inbound-queue evolution as structural supervision for shared representation learning, so that the model not only imitates historical phase actions but also preserves information about near-future congestion formation and propagation. By integrating neighborhood-constrained spatial attention, causal temporal attention, TRQP, and two-stage optimization, PR-STLight provides a unified framework for learning executable signal-control policies from fixed offline replay buffers.
The experimental results show that prediction-regularized spatio-temporal learning improves the effectiveness and robustness of offline signal control under the adopted CityFlow benchmark protocol. On the Jinan and Hangzhou networks, PR-STLight achieves the lowest average travel time among the compared methods and shows clear improvement over the TransformerLight backbone. Additional evaluations under different demand levels and a sudden-demand-surge stress test further indicate that the proposed design remains beneficial under traffic-demand variations and representative non-recurrent demand shocks. The ablation and sensitivity analyses also support the roles of the spatio-temporal module, the TRQP branch, and the selected historical and prediction windows, while showing that pretraining mainly contributes to optimization stability rather than guaranteeing the best final score in every setting.
From an engineering perspective, PR-STLight is most relevant to offline traffic-signal-control scenarios where historical trajectory data are available and direct online trial-and-error is undesirable. Such settings are common in multi-intersection corridors or urban grid areas where unsafe exploration, long training cycles, and congestion propagation make online reinforcement learning difficult to deploy directly. The proposed framework suggests that combining action imitation with short-term traffic-evolution supervision can be a practical way to improve offline policy learning for coordinated signal control.
The demonstrated claims remain bounded by the current experimental setting. The evaluation is based on two standard CityFlow grid benchmarks and a compact traffic-state representation. Broader deployment will require validation on irregular urban networks, richer input features, capacity-side incidents such as crashes and lane blockages, and more realistic operational constraints such as pedestrian phases, transit priority, and emergency-vehicle preemption. These extensions will be the focus of our future work toward more realistic and deployable prediction-regularized offline signal-control systems.