Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control

Deng, Yueting; Li, Huale; Xia, Tong; Wang, Zhaobin; Lei, Ruoming

doi:10.3390/app16105156

Open AccessArticle

Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control

by

Yueting Deng

¹,

Huale Li

¹

,

Tong Xia

¹,

Zhaobin Wang

¹

and

Ruoming Lei

^2,*

¹

School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

²

School of Media Engineering, Lanzhou University of Arts and Science, Lanzhou 730000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(10), 5156; https://doi.org/10.3390/app16105156

Submission received: 7 April 2026 / Revised: 17 May 2026 / Accepted: 19 May 2026 / Published: 21 May 2026

(This article belongs to the Special Issue Advances in Intelligent Decision-Making Systems)

Download

Browse Figures

Versions Notes

Abstract

Multi-intersection traffic signal control must jointly address local coordination and delayed traffic propagation under strongly time-varying conditions. Existing offline sequence-imitation methods mainly recover actions from historical trajectories and make limited use of short-term future traffic evolution in shared-representation learning. To address this issue, we propose PR-STLight, a prediction-regularized spatio-temporal extension of TransformerLight for offline multi-intersection traffic signal control. PR-STLight introduces short-term future inbound-queue evolution as structural supervision for shared representation learning. The model combines neighborhood-constrained spatial self-attention, causal temporal self-attention, and a Topology-Recurrent Queue Predictor (TRQP) to capture topology-aware spatio-temporal dependencies and near-future congestion dynamics. Training adopts a two-stage strategy, namely queue-prediction pretraining followed by joint control-prediction optimization, to improve optimization stability on a fixed offline replay buffer. In experiments on the adopted CityFlow benchmarks, PR-STLight obtains average travel times of 274.39 s on Jinan

3 \times 4

and 288.09 s on Hangzhou

4 \times 4

, corresponding to 1.14% and 2.82% lower travel times than the strongest non-PR baseline, and 21.27% and 22.54% lower travel time than the TransformerLight backbone, respectively. It also achieves the lowest average inbound queue on Hangzhou and remains competitive on Jinan. These results show that PR-STLight provides an effective offline spatio-temporal sequence framework for coordinated multi-intersection signal control.

Keywords:

traffic signal control; offline sequence imitation; TransformerLight; spatio-temporal representation; prediction regularization

1. Introduction

As urbanization accelerates, urban traffic demand continues to increase, and traffic signal control remains a central mechanism for mitigating recurrent congestion and improving road-network efficiency [1,2]. Recent mobility statistics further demonstrate the practical importance of this problem. The 2025 Urban Mobility Report reports that the average auto commuter in U.S. urban areas wasted 63 h per year in congestion in 2024, and that the national congestion cost reached approximately US$269 billion in 2024 [3]. At the signal-operation level, poor traffic signal timing has been reported to account for an estimated 10% of all traffic delay, corresponding to approximately 300 million vehicle-hours on major roadways, while adaptive signal control technologies can improve travel time by more than 10% on average [4]. These findings indicate that signal control is a critical operational factor in urban mobility. Classical timing plans and heuristic adaptive systems are still widely used in practice, but their effectiveness declines when traffic demand becomes strongly time-varying, spatially coupled, and non-stationary [1,2]. In recent years, reinforcement learning has become a major direction for adaptive traffic signal control because it can optimize phase switching directly from observed traffic states and long-term control objectives [1,2,5,6].

As learning-based control has progressed, research has shifted from single-intersection optimization to coordinated decision making across multiple intersections. This shift is motivated by the operational structure of arterial and grid networks: traffic progression, queue storage, and downstream receiving capacity are jointly determined by multiple adjacent signals [7]. Coordinated timing plans are therefore developed for a series of intersections rather than only for isolated intersections, and time–space analysis is commonly used to evaluate progression opportunities, delay, stops, queuing, and queue spillback [7]. In this setting, optimizing each intersection independently may reduce a local queue while disturbing platoon progression, increasing downstream spillback risk, or transferring bottlenecks to neighboring intersections. The methodological development of learning-based traffic signal control can be summarized in three stages. First, learning-based controllers began to replace fixed rules and enabled adaptive timing at individual intersections, as illustrated by pressure-based and phase-competition formulations such as PressLight [8] and Learning Phase Competition [9]. Second, researchers modeled information exchange among intersections through graph structures, hierarchical cooperation, neighborhood communication, and collaborator selection, as represented by CoLight [10], HiLight [11], FGLight [12], CoSLight [13], and X-Light [14]. Third, sequence modeling extended traffic signal control from instantaneous state response to spatio-temporal decision making over historical trajectories, where states, actions, and returns are encoded in a unified sequence interface [15,16,17]. This development reflects a transition from local reactive control to network-level coordination and trajectory-level process modeling.

Despite this progress, two core bottlenecks remain. Firstly, most reinforcement learning methods for traffic signal control still rely on online trial-and-error interaction. Training cycles are long, and early exploration can trigger severe congestion, making deployment costly and risky [1,2,8,10]. Secondly, in offline or sequential settings, many models still optimize only a history-to-action mapping, that is, they recover the current action from a historical window [18,19,20,21,22]. This objective alone is often insufficient. In practice, two samples may show similar current queues and phase states, but the correct actions can still differ: one intersection may face an imminent upstream platoon arrival, while another may be blocked by downstream spillback; similarly, propagation delays can make near-identical snapshots evolve in different ways a few steps later. If training uses only action labels, the model can fit observed actions without learning the underlying short-term evolution of congestion.

From a broader spatio-temporal perspective, short-term traffic states are structured rather than random. Recent traffic prediction studies show that variables such as flow, queue length, and pressure follow stable but nontrivial short-term dynamics when spatial coupling and temporal delay are modeled explicitly [23,24,25,26,27,28]. Exploiting this structure effectively requires three capabilities: constrained modeling of spatial relations, explicit characterization of propagation delay, and stable shared-representation learning with continuous auxiliary supervision. This raises a central question for offline multi-intersection control: can short-term future inbound-queue evolution serve as a structural constraint for control learning, rather than remain a separate prediction task decoupled from action decision making?

To answer this question, we propose PR-STLight, a prediction-regularized spatio-temporal sequence framework for offline multi-intersection traffic signal control. The central idea is that near-future queue evolution should be recoverable from the shared representation. We therefore do not treat queue prediction as an isolated auxiliary output. Instead, a useful representation should recover the current action and preserve short-term congestion propagation information at the same time. Following this design, PR-STLight combines topology-aware spatio-temporal encoding with a dedicated queue-prediction branch in a staged training scheme. The main contributions are as follows:

(1): We formulate offline multi-intersection traffic signal control as a prediction-regularized spatio-temporal sequence imitation problem. In addition to action recovery, short-term future inbound-queue evolution is used as structural supervision, which alleviates the under-constrained representation learning caused by action labels alone.
(2): We develop a unified local-topology-aware spatio-temporal architecture that integrates neighborhood-constrained spatial self-attention, causal temporal self-attention, and TRQP. This design jointly models local coordination and temporal propagation, while enforcing representation consistency through continuous short-horizon queue extrapolation.
(3): We propose a stability-oriented two-stage optimization strategy for fixed offline replay buffers. The model first performs queue-prediction pretraining and then switches to joint control-prediction optimization. We further combine cross-trajectory future-label masking, a log1p-Huber prediction objective, and progressive auxiliary gradient injection to reduce early gradient interference and improve convergence stability.

Unlike backbone-only sequence models, PR-STLight unifies spatio-temporal encoding and queue-prediction regularization in a single framework. The resulting design is associated with improved traffic efficiency and congestion mitigation over the TransformerLight backbone, while also showing more stable optimization behavior in offline training.

The remainder of the manuscript is organized as follows. Section 2 reviews related work on traffic signal control, traffic-flow theory, offline decision making, and spatio-temporal traffic prediction. Section 3 formulates the offline multi-intersection control problem and defines the auxiliary prediction target. Section 4 presents the PR-STLight architecture and the two-stage optimization strategy. Section 5 reports the experimental setup, main comparison, demand-level and stress-test evaluations, ablation and sensitivity analyses, and limitations.

2. Related Work

2.1. Traffic Signal Control and Multi-Intersection Coordination

Traffic signal control methods are usually grouped into three categories: rule-driven, heuristic adaptive, and learning-driven methods [1,2]. Early approaches such as FixedTime, GreenWave, SCOOT, and SCATS rely on preset timing plans or experience-based rules. They are mature in practice, but their adaptability is limited when traffic patterns vary strongly over time. Learning-based methods attempt to overcome this limitation by optimizing signal policies from observed traffic dynamics and long-term control objectives.

With the rise of deep reinforcement learning, research has moved from single-intersection policy learning to coordinated modeling across multiple intersections. PressLight [8] couples max-pressure control with reinforcement learning for arterial coordination. Learning Phase Competition [9] models competition among candidate phases. CoLight [10] introduces graph-attention-based information exchange for network-level cooperation. HiLight [11] adopts a hierarchical cooperative control mechanism. FGLight [12] further learns neighbor-level interaction patterns explicitly, whereas CoSLight [13] co-optimizes collaborator selection and decision making. X-Light [14] extends this line toward cross-city transfer with a Transformer-on-Transformer design. In addition, InitLight [29] improves policy initialization through adversarial inverse reinforcement learning, while recent traffic-control studies have explored multi-objective optimization [5] and model-based policy reuse [6]. These studies collectively show that effective multi-intersection control depends not only on local response, but also on how information is exchanged across intersections.

For coordinated multi-intersection control, existing methods can be summarized into two groups. One group uses cross-intersection information during training updates to improve gradient propagation. The other group explicitly uses representations from other intersections during decision making to strengthen cooperation. Common paradigms include graph neural networks, graph attention, hierarchical control, neighborhood communication, and trajectory-level Transformers. Even so, most studies still rely on online interaction with the environment, and many coordination mechanisms emphasize static or local aggregation rather than explicit delayed propagation and future-traffic regularization.

2.2. Traffic-Flow Theory and Congestion Propagation Mechanisms

Traffic signal control is closely related to traffic-flow mechanisms because signal timing determines the boundary conditions under which queues form, dissipate, and propagate. In coordinated arterial and grid networks, inefficient phase allocation can interrupt platoon progression, reduce downstream receiving capacity, and transfer congestion to adjacent links or intersections. These mechanisms are reflected in queue spillback, delayed congestion propagation, and bottleneck transfer in multi-intersection networks [7].

Recent traffic-flow studies further describe congestion formation and dissipation from the perspective of phase transitions. The congestion-boundary approach estimates the transition threshold between free-flow and congested traffic states, while kinetic traffic-flow models analyze traffic phase-transition phenomena from a mesoscopic perspective [30,31]. These studies indicate that short-term queue evolution is a structured traffic process associated with the formation, propagation, and recovery of congestion. This observation motivates the use of future inbound-queue prediction as auxiliary supervision in PR-STLight, so that the shared representation preserves information about near-future congestion evolution during offline control learning.

2.3. Offline Sequence Control and Sequential Decision Making

To avoid the long training time and high exploration risk of online reinforcement learning, offline sequence-control and sequential decision-making methods offer a safer and more practical route for traffic signal control [18,19,20,21]. Their core idea is to learn a control policy directly from a fixed trajectory dataset collected in advance, without continuous online interaction with the environment during training. In this study, the offline control problem is implemented as sequence imitation with prediction regularization rather than as value-based offline reinforcement learning.

Under the sequential decision-making paradigm, the decision problem is usually rewritten as conditional sequence modeling. The goal is to learn the conditional action distribution

p_{θ} (a_{t} ∣ R_{\leq t}, X_{\leq t}, a_{< t}) .

(1)

Here,

R_{\leq t}

denotes return-related conditioning tokens, such as returns-to-go or reward-context tokens. This expression provides a unified sequence interface for policy learning. Decision Transformer [32] established this formulation in general sequential decision making, and recent traffic-control studies have adapted it to signal control through STLight [15], Sequence Decision Transformer for adaptive traffic signal control [16], CrossLight [22], and lightweight Transformer-based offline-to-online control [17].

Although offline sequential decision making is safer than online exploration, most existing studies still treat control learning as action recovery over a historical window [18,19,20,21]. Action labels provide only indirect supervision at decision time and do not directly constrain whether the model captures upcoming propagation and congestion evolution. As a result, shared-representation learning remains under-constrained. More systematic treatment is still needed on three aspects: explicit use of future traffic trends, stability of multi-task optimization, and consistency between method description and implementation.

2.4. Traffic Prediction and Spatio-Temporal Representation

Traffic prediction is commonly formulated as follows: given a historical observation sequence of length L and the road-network structure, predict the traffic state over the next K steps,

{\hat{X}}_{t + 1 : t + K} = f (X_{t - L + 1 : t}, G) .

(2)

Here,

X

denotes observed features such as flow, speed, occupancy, or queue length, and

G

denotes road topology or node relations. As a representative spatio-temporal sequence modeling problem, traffic prediction must capture complex, nonlinear, and time-varying dependencies across both space and time.

From the temporal-modeling perspective, recent methods have moved from recurrent and convolutional designs toward Transformer-style or state-space sequence modeling. PDFormer [23] models propagation delay explicitly. Hybrid Transformer and Spatial-Temporal Self-Supervised Learning [33] combines Transformer forecasting with self-supervised pretraining. Rethinking Spatio-Temporal Transformer [24] and OpenCity [25] emphasize multi-view augmentation and foundation-model style generalization, respectively. ST-Mamba [26] further explores selective state-space modeling for traffic dynamics.

From the spatial-modeling perspective, graph-based and graph–Transformer methods have become a major technical route. Representative studies include Navigating Spatio-Temporal Heterogeneity [34], STGformer [27], DST-GTN [28], PT-TDGCN [35], and a transfer-aware spatio-temporal graph attention model [36]. These methods aim to capture spatial coupling and temporal evolution jointly. A typical formulation is

Z_{t} = GCN (X_{t}, \hat{A}),

(3)

h_{t} = GRU (Z_{t}, h_{t - 1}),

(4)

{\hat{y}}_{t + 1 : t + K} = g (h_{t}),

(5)

where

\hat{A}

denotes the normalized adjacency matrix,

Z_{t}

is the spatially aggregated representation, and

g (\cdot)

denotes the forecasting head.

More recent prediction models improve traffic forecasting from complementary angles. Efficient Traffic Prediction through Spatio-Temporal Distillation [37] and Efficient Large-Scale Traffic Forecasting with Transformers [38] focus on scalability and compression. T-Graphormer [39], Multi-scale Spatial-Temporal Transformer [40], STICformer [41], and STPFormer [42] enrich Transformer design for long-range and pattern-aware forecasting. In addition, spatio-temporal Transformer–GCN coupling [43], improved Transformer forecasting [44], cross-domain Transformer fusion [45], urban signalized-interSection prediction [46], and a physics-guided stepwise framework for urban intersections [47] highlight the importance of richer structured representations and better supervision. Nevertheless, the main goal of most traffic prediction methods remains accurate estimation of future states, rather than direct support for control decisions. By contrast, offline traffic signal control focuses on recovering executable action rules from fixed trajectories while preserving sensitivity to short-term congestion propagation.

3. Problem Formulation

3.1. Problem Definition for Multi-Intersection Traffic Signal Control

The regional traffic network is modeled as a directed graph

G = (V, E)

, where

V = {v_{1}, v_{2}, \dots, v_{N}}

denotes the set of controlled intersections,

E \subseteq V \times V

denotes road connections, and

N = | V |

is the number of controlled intersections. Let the control interval be

Δ t

. Under this formulation, multi-intersection signal control is cast as a finite-horizon Markov decision process. Figure 1 illustrates the representative intersection layout and local-topology abstraction used in this task.

At the task level, the control objective is to maximize the expected discounted cumulative return,

J (π_{θ}) = E_{τ \sim π_{θ}} [\sum_{t = 1}^{T} γ^{t - 1} r_{t}] .

(6)

Here,

r_{t}

denotes the network-level reward at step t, for example, the sum or average of the intersection-level rewards

{r_{t}^{i}}_{i = 1}^{N}

under the adopted evaluation protocol.

In the present setting, policy learning is constrained to a fixed offline trajectory dataset

D = {τ^{(m)}}_{m = 1}^{M}

, without online environment interaction. Accordingly, the implemented training objective is behavior cloning with prediction regularization: a parameterized policy

π_{θ}

maps historical spatio-temporal context to executable phase actions, while short-term future inbound-queue trends are incorporated as representation-level regularization.

Compared with single-intersection control, multi-intersection control requires both rapid local response and network-level coordination. The policy must account for arrival-flow propagation, queue spillback, and bottleneck transfer across intersections. This requirement becomes more critical under heavy or unbalanced traffic, where the release strategy at one intersection can directly affect neighboring intersections and, in turn, the wider region. Accordingly, the target problem is offline multi-intersection traffic signal control with explicit spatial correlation and temporal delay.

3.2. State, Action, and Reward Modeling

To model intersection traffic at fine granularity, lane-level observations are used. Suppose that intersection i has

L_{i}

controlled inbound lanes. At time t, the feature vector of lane l is denoted by

x_{t, l}^{i}

. The raw state matrix of intersection i is defined as

X_{t}^{i} = [x_{t, 1}^{i}, x_{t, 2}^{i}, \dots, x_{t, L_{i}}^{i}] \in R^{L_{i} \times d_{f}} .

(7)

Here,

d_{f}

denotes the lane-level feature dimension. State features include entering and leaving vehicle counts, queued vehicle counts, pressure-related descriptors, and segment-level lane statistics. The pressure-related descriptors provide information about local inbound–outbound traffic imbalance, while the reward design is kept queue-based to maintain a direct and reproducible congestion-oriented optimization objective. The current phase is not directly concatenated with the raw state vector; instead, it is injected through an independent phase embedding, preserving a functional separation between traffic observations and control context.

The action space employs discrete phase control. Let the candidate phase set at each intersection be

A = {1, 2, \dots, A}

. The control action of intersection i at time t satisfies

a_{t}^{i} \in A .

(8)

Here, each action corresponds to selecting a phase index from the available signal phases. Additional execution constraints are examined separately in the constrained evaluation discussed in Section 5.6.

The reward function is defined as a queue-length-based negative cost, because queue accumulation directly reflects local congestion and is consistent with the main efficiency metrics used in this study. For intersection i at time t, the instantaneous reward is defined as

r_{t}^{i} = - α_{q} Q_{t}^{i},

(9)

where

Q_{t}^{i}

denotes the total inbound queue length,

Q_{t}^{i} = \sum_{l = 1}^{L_{i}} q_{t, l}^{i, in} .

(10)

Here,

q_{t, l}^{i, in}

is the queued vehicle count on inbound lane l of intersection i, and

L_{i}

is the number of controlled inbound lanes. The coefficient is set to

α_{q} = 0.25

, corresponding to the implementation setting {"queue_length": −0.25}. No additional pressure term is included in the reward, which is equivalent to setting

α_{p} = 0

in the general queue–pressure formulation.

This choice keeps the reward definition simple and reproducible and avoids introducing an additional pressure-weight hyperparameter. The queue-based reward is used for PR-STLight and TransformerLight (base) under the offline sequence-learning setting. Other baselines follow their standard reward definitions and configurations unless otherwise specified.

3.3. Offline Trajectories and Sequential Decision Modeling

Because online trial-and-error is costly and risky in real traffic systems, training is conducted on a fixed offline dataset. During parameter updates, PR-STLight does not generate new online interactions with CityFlow. Instead, it reads states, actions, next states, and rewards from pre-stored replay-buffer memory files, while the simulator is used only for independent policy evaluation. Let the offline trajectory set be

D = {τ^{(m)}}_{m = 1}^{M},

(11)

where the mth trajectory is a control sequence of length

T_{m}

,

τ^{(m)} = {(X_{1}, a_{1}, r_{1}), (X_{2}, a_{2}, r_{2}), \dots, (X_{T_{m}}, a_{T_{m}}, r_{T_{m}})} .

(12)

Here, each tuple

(X_{t}, a_{t}, r_{t})

represents the joint multi-intersection state, action, and reward at decision step t.

The offline replay buffer is a fixed CityFlow-compatible memory prepared before model training and is not updated during parameter optimization. Each memory record contains the state, action, next-state, and reward information required for offline sequence learning. The stored actions are behavior actions obtained from fixed replay-buffer files. These buffers are either loaded from pre-generated CityFlow-compatible memory files or generated before training using the same benchmark data-collection protocol. The data-collection protocol is parameterized by the behavior controller, traffic-demand profile, and collection-round setting; once generated or loaded, the replay buffer remains fixed throughout PR-STLight training. PR-STLight is therefore optimized within the state–action coverage contained in the fixed replay buffer through behavior cloning with prediction regularization.

The memory contains complete traffic-control episodes generated under the benchmark traffic-demand profiles. Each episode lasts 3600 s, and the control interval is 15 s, so the stored data preserve temporally ordered state–action–reward trajectories. The collection-round setting controls the number of complete episodes in each replay buffer, and the corresponding configurations and replay-buffer files are provided as part of the data availability materials. These trajectories cover typical traffic regimes in the benchmark scenarios, including light–traffic intervals, queue formation, congested periods, and queue dissipation. During sequence construction, historical windows and future prediction targets are formed only within valid trajectory segments, so invalid cross-episode labels are avoided.

PR-STLight organizes each offline trajectory segment into a reward–state–action sequence to preserve the chronological structure of the control process. For a continuous time window of length L, the shared backbone output is

H = {h_{ℓ}^{i}}_{ℓ = t - L + 1}^{t}, H \in R^{L \times N \times d} .

(13)

Here,

ℓ \in [t - L + 1, t]

denotes the time-step index within the sliding window ending at decision step t, N is the number of controlled intersections, and d is the hidden dimension. The hidden representation

h_{ℓ}^{i}

encodes the lane-level traffic state of intersection i together with phase, spatial-neighborhood, and temporal context.

Within this sequence interface, the phase decision is predicted from the shared spatio-temporal traffic representation. The stored action values in the replay buffer provide supervised phase labels for the control objective, while the serialized reward and action slots preserve the offline trajectory sequence format. Accordingly, the main control cues are extracted from historical traffic states, phase context, neighborhood interaction, and temporal order.

The control head predicts phase probabilities as

o_{ℓ}^{i} = W_{a} h_{ℓ}^{i} + b_{a}, π_{θ} (a_{ℓ}^{i} ∣ H) = Softmax (o_{ℓ}^{i}) .

(14)

Here,

W_{a}

and

b_{a}

denote the linear projection parameters of the control head.

The control objective is formulated as discrete phase classification, and the control head is optimized with cross-entropy loss,

L_{ctrl} = - \frac{1}{L N} \sum_{ℓ = t - L + 1}^{t} \sum_{i = 1}^{N} log π_{θ} (a_{ℓ}^{i} ∣ H) .

(15)

This objective fits the conditional action distribution on the fixed offline trajectory dataset. It is equivalent to behavior cloning under discrete phase classification and serves as the primary control target in the subsequent multi-task optimization.

3.4. Definition of the Auxiliary Prediction Task

To explicitly encode short-term traffic evolution, an auxiliary multi-step inbound-queue prediction task is introduced in addition to the control task. Consistent with the current implementation, the auxiliary target is defined at the intersection level, that is, the total inbound queue. This target is directly related to congestion accumulation and provides stable continuous supervision on a fixed offline replay buffer. Let the total inbound queue of intersection i at time t be

y_{t}^{i} = \sum_{l = 1}^{L_{i}} q_{t, l}^{i, in} .

(16)

For the prediction horizon K, the corresponding multi-step label is

y_{t}^{i} = [y_{t + 1}^{i}, y_{t + 2}^{i}, \dots, y_{t + K}^{i}] \in R^{K},

(17)

here, K is the number of future decision steps used for auxiliary supervision.

Let the prediction-head output be

{\hat{y}}_{t}^{i}

. Near the end of a trajectory, complete K-step labels may be unavailable; therefore, a validity mask

m_{t, i, k} \in {0, 1}

is introduced. To ensure non-negativity and improve robustness to large values and outliers, predictions are transformed by softplus

ϕ (\cdot) = log (1 + e^{\cdot})

and optimized with a log1p-transformed Huber loss,

L_{pred} = \frac{1}{\sum_{t, i, k} m_{t, i, k}} \sum_{t, i, k} m_{t, i, k} L_{Huber} (log (1 + ϕ ({\hat{y}}_{t + k}^{i})), log (1 + y_{t + k}^{i})) .

(18)

Here,

m_{t, i, k}

masks invalid future labels from incomplete trajectory suffixes. This design reduces the disturbance from long-tail and large-error samples and maintains a stable supervision scale under heavy congestion. The overall training objective is

L = L_{ctrl} + λ (r) L_{pred},

(19)

where

L_{ctrl}

is the control loss,

L_{pred}

is the auxiliary prediction loss, and

λ (r)

controls the contribution of the auxiliary task at the joint-training epoch r.

4. Method

4.1. Overall Framework

Offline multi-intersection traffic signal control must learn from fixed trajectories while still handling neighborhood coordination and delayed congestion propagation. This setting raises two modeling challenges. First, learning only from action labels may provide insufficient supervision for capturing near-future congestion evolution. Second, the model must capture both local spatial interaction and delayed temporal dynamics without introducing unstable optimization behavior. To address these challenges, we propose PR-STLight, which combines topology-aware spatio-temporal encoding, queue-prediction regularization, and staged optimization. The framework has four tightly coupled modules: input encoding, coordination-aware spatio-temporal representation learning, a queue-prediction regularization branch, and two-stage joint optimization.

As shown in Figure 2, lane-level traffic observations, phase context, and time context are first mapped into unified state representations. Neighborhood-constrained spatial attention is then used to model local coordination, and causal temporal attention captures propagation-delay dependencies across intersections. The resulting shared representation is consumed by two heads: a control head for current phase prediction and TRQP for multi-step inbound-queue prediction. During training, the queue-prediction loss acts as structural regularization for the shared representation.

Within this architecture, input encoding provides a consistent sequence interface for offline trajectories; the spatio-temporal module captures neighborhood interaction and delayed propagation; TRQP supplies continuous supervision on future queue evolution; and staged joint optimization coordinates control and auxiliary prediction for stable learning on fixed replay data.

4.2. Spatio-Temporal Representation Module

The shared backbone is designed to extract spatio-temporal representations from historical traffic observations for coordinated multi-intersection control. Because coordination and propagation are primarily reflected in traffic-state dynamics, explicit spatio-temporal enhancement is first applied to state tokens. The enhanced state tokens are then combined with reward and action tokens at the same time step and fed into the sequence backbone. Under this “state enhancement first, sequence modeling second” design, the model first captures traffic-state dependencies and then learns higher-order decision relations.

For the observation at intersection i and time t, the lane-level traffic state matrix

X_{t}^{i}

is projected into the hidden space and fused with phase and time embeddings to obtain a basic state representation,

e_{t}^{i} = W_{s} X_{t}^{i} + W_{p} p_{t}^{i} + e (c_{t}),

(20)

where

W_{s}

and

W_{p}

are the state and phase projection matrices,

p_{t}^{i}

denotes the phase embedding input, and

e (c_{t})

is the time embedding indexed by the time-context token. This operation aligns traffic state, control context, and temporal position in a shared feature space for subsequent dependency modeling.

Along the spatial dimension, neighborhood-constrained multi-head self-attention is used to model interactions among intersection states at the same time step. Let the state representations of all intersections at time t be

E_{t} = [e_{t}^{1}, e_{t}^{2}, \dots, e_{t}^{N}] .

(21)

Here, N is the number of controlled intersections. To keep information exchange consistent with road topology, a spatial attention mask

M^{sp}

is constructed from the adjacency relation, so each intersection attends only to itself and its first-order neighbors. The spatial enhancement is

H_{t}^{sp} = LN (E_{t} + α_{sp} MHA (E_{t}, E_{t}, E_{t}; M^{sp})),

(22)

where

MHA (\cdot)

denotes masked multi-head self-attention,

α_{sp}

is a learnable residual scaling factor, and

LN (\cdot)

denotes LayerNorm.

This module has two roles. First, it captures local traffic-flow coupling, queue spillback, and short-range propagation among neighboring intersections. Second, the neighborhood mask suppresses direct interaction between distant, irrelevant nodes, thereby reducing noise from unconstrained global attention. The learnable residual scaling factor also keeps the module close to a stable residual mapping in early training, mitigating optimization oscillation when spatial and temporal attention are stacked.

After spatial enhancement, causal temporal self-attention is applied along each intersection’s time axis to model traffic-state propagation and accumulation over the historical window. For intersection i, let the spatially enhanced sequence be

H^{sp, i} = [h_{t - L + 1}^{sp, i}, \dots, h_{t}^{sp, i}],

(23)

Then the temporal enhancement is

H^{tm, i} = LN (H^{sp, i} + MHA (H^{sp, i}, H^{sp, i}, H^{sp, i}; M^{tm})),

(24)

where

M^{tm}

is a causal mask that allows each time step to access only current and previous steps, preventing future-information leakage.

Causal temporal modeling is consistent with the traffic-control decision setting: the current phase decision can depend only on current and past observations. Under this constraint, the temporal module captures delayed effects of traffic-wave propagation, queue buildup and dissipation, and phase switching. Applying temporal attention to spatially enhanced states captures two complementary aspects: how intersections interact and how these interactions evolve over time.

After two-stage enhancement in space and time, the spatio-temporal state sequence for intersection i is

H^{tm, i} = [h_{t - L + 1}^{i}, \dots, h_{t}^{i}] .

(25)

The enhanced state representation is then arranged with the reward and action entries at each time step and fed into the sequence backbone in chronological order. The resulting sequence for intersection i is

Z^{i} = (r_{t - L + 1}^{i}, h_{t - L + 1}^{i}, a_{t - L + 1}^{i}, \dots, r_{t}^{i}, h_{t}^{i}, a_{t}^{i}) .

(26)

Here,

h_{ℓ}^{i}

denotes the spatio-temporally enhanced traffic-state representation at time step ℓ. In this sequence construction, the enhanced state tokens carry lane-level traffic dynamics, phase context, neighborhood interaction, and temporal information extracted by the spatial and temporal attention modules. Reward and action slots are inserted in chronological order to preserve the inherited TransformerLight-style token layout. The control head reads the state-position hidden representations for phase imitation, and the auxiliary queue-prediction branch uses the same shared representation for future-queue regularization.

4.3. Topology-Recurrent Queue Predictor (TRQP)

To preserve future congestion information during training, a TRQP branch is coupled with the shared spatio-temporal representation and serves as a structural regularizer rather than a standalone forecasting module. It aggregates local spatial information within each time block using road topology and recursively predicts total inbound queues over the next K steps along the time axis. This mechanism encourages the shared representation to retain information useful for short-term queue propagation. The input to the predictor is

H_{ctrl} = {h_{s}^{i}}_{s = t - L + 1}^{t} \in R^{L \times N \times d},

(27)

where L is the historical block length, N is the number of controlled intersections, and d is the hidden dimension.

More specifically, the predictor first applies a topology-aware aggregation layer with a residual connection to each time block to refine local spatial dependencies. A GRU is then applied along the temporal axis for each intersection. Finally, a linear projection outputs predicted total inbound queues for the next K steps. For the shared hidden representation of time block s, the structural enhancement is

G_{s} = H_{s}^{ctrl} + α_{g} Ψ (\hat{A} H_{s}^{ctrl} W_{g}),

(28)

where

\hat{A}

is the normalized adjacency matrix,

W_{g}

is the graph-convolution weight matrix,

Ψ (\cdot)

is a nonlinear transform, and

α_{g}

is a learnable residual scaling factor. Let

g_{s}^{i}

denote the feature of intersection i in

G_{s}

. The temporal representation is updated recursively by

u_{s}^{i} = GRU (g_{s}^{i}, u_{s - 1}^{i}),

(29)

and the multi-step prediction is obtained by linear projection,

{\hat{y}}_{s}^{i} = W_{y} u_{s}^{i} + b_{y}, {\hat{y}}_{s}^{i} \in R^{K},

(30)

where

W_{y}

and

b_{y}

denote the output-projection weight and bias of the prediction head.

Here, the prediction head outputs a future K-step queue prediction for every input time block, rather than a single prediction from the final time step only. This design provides continuous supervision on queue evolution over the full historical block, strengthens the ability of the shared representation to capture short-term congestion evolution, and supplies a stable auxiliary signal during joint optimization.

4.4. Two-Stage Training and Joint Optimization

To incorporate short-term queue supervision into traffic signal control learning in a stable manner, a two-stage training strategy is adopted. In the first stage, only the auxiliary prediction loss is used to pretrain the shared backbone and prediction head, allowing the model to learn a basic representation of inbound-queue dynamics. In the second stage, the control loss

L_{ctrl}

and auxiliary prediction loss

L_{pred}

are optimized jointly, so the shared representation supports both current action prediction and future queue modeling. The two-stage objectives are

L^{(1)} = L_{pred},

(31)

L^{(2)} = L_{ctrl} + λ (r) L_{pred} .

(32)

Here,

L_{ctrl}

is the control loss,

L_{pred}

is the auxiliary prediction loss, r is the joint-training epoch index, and

λ (r)

is a dynamic weight for the auxiliary loss.

Because the auxiliary prediction task can introduce strong gradient interference early in joint training, progressive auxiliary gradient injection is used to smooth the influence of the prediction branch on the shared backbone. The shared hidden representation used by the auxiliary branch is reparameterized as

λ (r) = λ_{min} + β (r) (λ_{max} - λ_{min}),

(33)

h_{pred} = sg (h) + β (r) (h - sg (h)),

(34)

where

sg (\cdot)

denotes the stop-gradient operation and

β (r) \in [0, 1]

is a mixing coefficient that increases with training progress. When

β (r) = 0

, the auxiliary predictor updates only prediction-branch parameters. As

β (r)

increases gradually, the prediction loss is propagated to the shared backbone in a more stable manner.

Algorithm 1 summarizes the full training pipeline used in practice.

Algorithm 1. Two-stage training procedure of PR-STLight

Require:: Offline replay buffer $D$ , history length L, prediction horizon K, pretraining epochs $E_{pre}$ , joint-training rounds R
Ensure:: Trained control policy $π_{θ}$
1:: Initialize shared backbone parameters $θ_{b}$ , control-head parameters $θ_{c}$ , and predictor parameters $θ_{p}$
2:: for $e = 1$ to $E_{pre}$ do
3:: Sample mini-batches from $D$ and construct length-L sequences with valid future-label masks
4:: Encode lane states, phase embeddings, and temporal embeddings
5:: Apply neighborhood-constrained spatial attention and causal temporal attention to obtain shared representation H
6:: Predict future inbound queues with TRQP and compute $L_{pred}$
7:: Update $θ_{b}$ and $θ_{p}$ using $\nabla L_{pred}$
8:: end for
9:: for $r = 1$ to R do
10:: Sample mini-batches from $D$ and build reward–state–action sequences
11:: Compute shared representation H and phase logits with the backbone and control head
12:: Compute control loss $L_{ctrl}$ with the control head
13:: Predict K-step future inbound queues and compute masked prediction loss $L_{pred}$
14:: Set the auxiliary weight $λ (r)$ and backbone mixing factor $β (r)$
15:: Update the network using $L_{ctrl} + λ (r) L_{pred}$
16:: end for
17:: return $π_{θ}$

With this two-stage strategy and progressive joint optimization, the model injects future traffic-evolution information into shared-representation learning while preserving the dominant role of the control objective. This design is expected to mitigate gradient interference and stabilize representation learning in early multi-task training.

5. Experiments

5.1. Experimental Setup

5.1.1. Simulation Platform and Datasets

Experiments are conducted on CityFlow [48], an efficient open-source simulator designed for large-scale multi-agent traffic control. CityFlow provides lane-level observations, a standardized phase-control interface, and reproducible evaluation settings, which makes it suitable for constructing offline replay buffers and conducting fair benchmark comparison. We use two real-city benchmark networks: Jinan

3 \times 4

and Hangzhou

4 \times 4

. Jinan contains 12 controlled intersections, and Hangzhou contains 16 controlled intersections. Comparative results are reported on both networks, while the ablation study focuses on Jinan. These two benchmarks provide controlled and reproducible grid-based multi-intersection settings for evaluating the offline sequence-control capability of PR-STLight. Figure 3 shows the layouts of the Jinan

3 \times 4

and Hangzhou

4 \times 4

benchmark networks used in the experiments.

5.1.2. Comparison Methods

We compare PR-STLight with representative methods from three categories.

Rule-based baselines: FixedTime and SOTL. FixedTime uses preset phase durations, and SOTL switches phases using local waiting-based thresholds.

Online learning-based baselines: PressLight [8], CoLight [10], HiLight [11], and InitLight [29]. PressLight uses pressure-based rewards for policy learning. CoLight models multi-intersection coordination with graph attention. HiLight adopts a hierarchical reinforcement learning framework. InitLight uses adversarial inverse reinforcement learning to generate a strong initialization policy.

The offline sequence baseline is TransformerLight (base), which follows the same Transformer-style control backbone adopted in recent sequence-based traffic-signal-control studies [15,17]. However, it does not include explicit spatio-temporal enhancement or future-queue prediction regularization.

5.1.3. Metrics and Statistical Evaluation Protocol

We report two primary metrics: average travel time (ATT) and average inbound queue (AIQ). ATT evaluates global network efficiency, whereas AIQ measures congestion accumulation at inbound approaches. Both metrics are computed at the network level for each evaluation round, and lower values indicate better performance.

For stochastic learning-based methods, we conduct five independent training runs using different random seeds. For each random seed, the final score is first obtained by averaging the final ten evaluation rounds after convergence, which provides one stable run-level score for that independent run. The reported mean ± standard deviation is then calculated across these five independent run-level scores. Therefore, the reported standard deviation reflects between-run uncertainty rather than fluctuations within a single training run. Fixed Time and SOTL are deterministic rule-based reference methods and are reported as mean values only.

To assess statistical reliability, we conduct two-sided paired t-tests on the seed-level scores using matched random seeds. PR-STLight is compared with TransformerLight (base), and the corresponding significance results are discussed together with the main comparison. The implementation details and main experimental settings of PR-STLight are summarized in Table 1.

5.2. Main Comparative Results and Analysis

Table 2 and Table 3, together with Figure 4, show that PR-STLight achieves consistently strong performance across both benchmarks. In terms of ATT, PR-STLight obtains the best results on both Jinan and Hangzhou, reducing ATT by 21.27% and 22.54% relative to TransformerLight (base), respectively. For each benchmark network, seed-level paired t-tests are conducted between PR-STLight and TransformerLight (base) using matched random-seed run scores. The improvements of PR-STLight are statistically significant for both ATT and AIQ on Jinan and Hangzhou, with all four tests satisfying

p < 0.001

. These results indicate that the proposed prediction-regularized spatio-temporal representation improves network-level travel efficiency under the evaluated offline control setting.

The AIQ results show a more nuanced pattern. On Jinan

3 \times 4

, CoLight achieves the lowest average inbound queue, while PR-STLight still reduces AIQ by 46.28% compared with TransformerLight (base), indicating that it remains competitive in queue mitigation while showing a clearer advantage in average travel time. This suggests that the two methods emphasize somewhat different aspects of traffic optimization on this network: CoLight is more competitive in reducing queues, whereas PR-STLight is more favorable for network-level travel-time optimization. Even so, PR-STLight still performs better than several other learning-based baselines on the queue metric, which indicates that its travel-time advantage is not obtained at the cost of severely degraded queue control.

On Hangzhou

4 \times 4

, PR-STLight performs best on both ATT and AIQ. Compared with TransformerLight (base), it reduces the queue metric by 66.29%, and its advantage is more consistent than that observed on Jinan. This result suggests that the proposed spatio-temporal representation is particularly effective under the evaluated Hangzhou benchmark, where the number of controlled intersections and local coordination interactions are larger than those in Jinan. The corresponding curves also show that PR-STLight stays in a relatively stable low-value region during the later training stage, which is consistent with its final quantitative advantage on Hangzhou.

A comparison with TransformerLight (base) is also informative. Since both methods share the same Transformer-style sequence backbone, the observed performance gap can be attributed mainly to the additional spatio-temporal modeling and prediction-regularized supervision introduced in PR-STLight. This comparison indicates that the performance gain does not stem solely from sequence modeling, but is also associated with the topology-aware spatio-temporal representation and future-queue structural supervision.

As shown in Figure 4, PR-STLight generally enters a lower-error region earlier than most baselines in the travel-time curves. In the queue curves, its late-stage trajectory is also relatively smooth, especially on Hangzhou. Taken together with the tabulated results, these observations are consistent with faster convergence and more stable training behavior, although the degree of advantage varies across datasets and metrics.

5.3. Evaluation Under Different Traffic Demand Levels

To further examine the robustness of PR-STLight under different traffic intensities, we conduct additional evaluations on both the Jinan and Hangzhou networks with three demand scaling factors: 0.7, 1.0, and 1.5, corresponding to low-demand, medium-demand, and oversaturated conditions, respectively. All reported values follow the statistical aggregation protocol described in Section 5.1.3. For learning-based methods, the percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level, which helps contextualize each method against the same Transformer-style backbone reference. The detailed results are reported in Table 4 and Table 5.

As shown in Table 4 and Table 5, PR-STLight consistently improves over TransformerLight (base) across all tested demand levels.

On the Hangzhou

4 \times 4

network, PR-STLight achieves the best ATT and AIQ under low-demand, medium-demand, and oversaturated conditions. Compared with TransformerLight (base), PR-STLight reduces ATT by 46.62%, 28.88%, and 29.07% under 0.7×, 1.0×, and 1.5× demand, respectively. The corresponding AIQ reductions are 90.91%, 75.43%, and 67.81%. These results indicate that the proposed prediction-regularized spatio-temporal design remains effective as traffic pressure increases.

On the Jinan

3 \times 4

network, PR-STLight also shows substantial gains over TransformerLight (base) across all three demand levels. It reduces ATT by 23.38%, 25.25%, and 17.92%, and reduces AIQ by 61.93%, 54.49%, and 30.90% under 0.7×, 1.0×, and 1.5× demand, respectively. Under 0.7× and 1.0× demand, PR-STLight achieves the best ATT and AIQ among the compared methods. Under the oversaturated 1.5× setting, CoLight obtains the lowest ATT and AIQ, while PR-STLight still ranks second on both metrics and remains clearly better than TransformerLight (base). This result suggests that PR-STLight remains competitive under severe congestion, although the relative advantage of different methods may vary with traffic intensity and network characteristics.

Overall, the demand-level analysis shows that PR-STLight provides stable gains over its direct Transformer backbone across both networks and all tested demand levels. It achieves the strongest overall results on Hangzhou and on Jinan under low-to-medium demand, while the oversaturated Jinan case reveals that a graph-attention coordination baseline can still be more effective under severe congestion. These findings support the robustness of the proposed prediction-regularized spatio-temporal design, while keeping the performance claims aligned with the observed benchmark results.

5.4. Stress Test Under Sudden Demand Surge

To further evaluate the robustness of PR-STLight beyond recurrent congestion patterns, we add a representative sudden-demand-surge stress test on the Hangzhou

4 \times 4

network. This setting is designed to reflect a short-term demand-side disturbance, such as abrupt inflow growth or temporary crowd aggregation, while keeping the road topology and signal-control constraints unchanged.

Each evaluation episode lasts 3600 s, and the signal-control interval remains 15 s. The original traffic-flow profile is kept unchanged before and after the disturbance. During the shock window from 1200 s to 1800 s, vehicle arrivals are increased by 50%, corresponding to a 1.5× demand surge. After 1800 s, the traffic demand returns to the original benchmark profile. The road network, lane availability, signal phase definitions, action interval, reward setting, and execution rules are kept unchanged. Therefore, this experiment represents a demand-side non-recurrent disturbance rather than a capacity-side incident such as a crash or lane blockage.

For the offline sequence-control models, training still uses the original recurrent offline replay buffer. The sudden-surge traffic file is used only for simulator-based evaluation. No shock-specific replay buffer is generated, and no model is fine-tuned on the surge scenario. This protocol evaluates zero-shot responsiveness to an unseen demand shock rather than performance after retraining on the disrupted distribution.

As shown in Table 6, PR-STLight achieves the lowest ATT and AIQ among the compared methods under the sudden-demand-surge scenario. Since the surge trajectories are not included in the offline training buffer, the stress-test results reflect the ability of the learned policy to respond to abrupt demand changes during evaluation. Compared with TransformerLight (base), PR-STLight reduces ATT from 399.74 s to 298.82 s, corresponding to a 25.25% reduction, and reduces AIQ from 121.21 vehicles to 42.58 vehicles, corresponding to a 64.87% reduction. These results indicate that the proposed prediction-regularized spatio-temporal design remains effective when traffic demand changes abruptly.

In addition to the comparison with TransformerLight (base), PR-STLight also outperforms CoLight, which is the strongest external learning baseline in this stress test. Specifically, PR-STLight further reduces ATT by 1.59% and AIQ by 8.48% compared with CoLight. This provides a conservative comparison and shows that PR-STLight remains competitive even against a strong graph-attention-based coordination method under non-recurrent demand shocks.

The comparison with PR-STLight w/o TRQP further highlights the value of the future-queue prediction branch under sudden demand changes. The w/o TRQP variant preserves the spatio-temporal control backbone but removes the future-queue prediction branch. Removing TRQP increases ATT from 298.82 s to 304.48 s and AIQ from 42.58 vehicles to 47.93 vehicles. This degradation indicates that future-queue supervision helps the shared representation capture short-term queue buildup when demand changes abruptly, thereby improving robustness under demand-side shock conditions.

The curve trends in Figure 5 are consistent with the table-level results, with PR-STLight maintaining lower ATT and AIQ than the compared variants in the stress-test process. Additional incident-driven disruptions are further discussed in Section 5.6.

5.5. Component and Parameter Analysis

5.5.1. Ablation Study and Analysis

To examine the contribution of each component in PR-STLight, we conduct an ablation study on the Jinan

3 \times 4

network using average travel time as the evaluation metric. The results are summarized in Table 7, where the variants remove different combinations of the proposed components.

As shown in Table 7 and Figure 6, the three components do not contribute in the same way. Removing the spatio-temporal module (w/o ST) leads to the clearest degradation in both final performance and convergence speed: its average travel time is noticeably higher than that of PR-STLight, and its convergence round is also substantially later. This suggests that explicit spatio-temporal modeling is important not only for final control quality but also for optimization efficiency during training.

The role of TRQP is more evident in training stability than in the final mean alone. Although w/o TRQP remains close to the full model in final average travel time, its result variance is clearly larger and its convergence is later. This pattern suggests that the prediction branch provides a useful regularizing effect on the shared representation, helping the model reach a stable low-error region more reliably.

The effect of pretraining is reflected mainly in convergence behavior and late-stage stability. As shown in Table 7, w/o Pretrain obtains a slightly lower final mean ATT than the full PR-STLight model. This result shows that the pretraining stage does not necessarily lead to the lowest final ATT in this ablation setting. However, PR-STLight converges earlier and exhibits smaller late-stage variation. Its ConvRound is 6.0, compared with 11.0 for w/o Pretrain, and its ATT standard deviation is 0.46, compared with 1.20 for w/o Pretrain. These results suggest that pretraining improves optimization stability and convergence speed, with a small empirical trade-off in final mean ATT.

The small standard deviation of PR-STLight indicates that its late-stage performance is maintained more steadily across evaluation rounds. Variants that remove one or more components tend to show slower convergence, larger late-stage variation, or both.

Finally, removing both components causes the most severe degradation in final ATT. Although w/o Both has an early ConvRound, its curve quickly flattens at a high-error region and remains far from the stable low-error regime reached by PR-STLight. Taken together, these results suggest that the spatio-temporal module, the prediction branch, and the pretraining strategy play complementary roles: the spatio-temporal module supports traffic-interaction representation, the prediction branch improves representation regularity during training, and the pretraining stage improves early optimization stability and convergence behavior.

5.5.2. Sensitivity Analysis of Historical Length and Prediction Horizon

To examine the influence of the historical block length L and the auxiliary prediction horizon K, we conduct a controlled sensitivity analysis on the Jinan

3 \times 4

network. Specifically, we vary

L \in {4, 8, 12}

while fixing

K = 3

, and vary

K \in {1, 3, 5}

while fixing

L = 8

. All other training settings and replay buffers are kept unchanged. The results are reported as the mean and standard deviation over the final ten evaluation rounds after convergence.

As shown in Table 8, PR-STLight maintains stable performance across the tested values of L and K. The average travel time remains within a narrow range from 273.08 s to 276.06 s, and the average inbound queue also varies within a limited range. These results indicate that the prediction-regularized framework is not overly sensitive to moderate changes in the historical context length or the auxiliary prediction horizon.

When K is fixed at 3,

L = 4

and

L = 8

produce very close ATT values and comparable AIQ values, whereas

L = 12

leads to a mild degradation in both metrics. This suggests that a relatively short historical window can already capture most useful recent traffic context, while an overly long historical window may introduce redundant or less relevant information into the sequence representation. Compared with

L = 4

,

L = 8

achieves a slightly lower AIQ and a very close ATT, while providing a longer temporal context for modeling short-term queue evolution. In the implemented sequence construction, a shorter historical block can also generate more training blocks from the same offline trajectories, thereby increasing the number of mini-batch updates within each epoch. Therefore, considering performance, temporal-context coverage, and training workload, this study adopts

L = 8

as a balanced setting.

When L is fixed at 8, the results under

K = 1

,

K = 3

, and

K = 5

are also close, indicating that PR-STLight is relatively stable under different short-term prediction horizons. The setting

K = 1

provides a very short and stable prediction target, whereas

K = 5

introduces a longer future target with slightly higher uncertainty. The default setting

K = 3

is retained as a moderate horizon because it provides richer short-term future-queue supervision than a one-step target while avoiding the additional uncertainty associated with a longer prediction horizon.

Based on these observations, the main experiments adopt

L = 8

and

K = 3

as a moderate configuration. This setting provides sufficient historical context, preserves short-term future-queue supervision, and keeps the auxiliary prediction task and training workload controlled.

5.6. Limitations and Future Work

The experiments in this study are conducted on two standard CityFlow benchmark networks, namely Jinan

3 \times 4

and Hangzhou

4 \times 4

. These benchmarks provide controlled and reproducible multi-intersection scenarios for comparing PR-STLight with existing baselines and analyzing the effect of prediction regularization. However, the present evaluation is still centered on grid-based networks. Extending PR-STLight to more irregular and heterogeneous urban networks will require broader offline replay buffers and more flexible topology and action modeling. Future work will examine networks with asymmetric connectivity, diverse phase configurations, uneven lane capacities, and spatially unbalanced demand patterns.

The additional experiments evaluate PR-STLight under different demand levels and a sudden-demand-surge stress test. These results provide an initial assessment of the model under demand-side variations and non-recurrent demand shocks. However, demand-side perturbations do not fully represent capacity-side disruptions, such as crashes and lane blockages, where lane availability and effective road capacity may change during operation. Future work will extend the evaluation to incident-driven scenarios by modeling incident duration, lane availability, capacity reduction, and post-incident recovery.

The current state representation mainly uses queue-, flow-, and pressure-related variables that are consistently available in the adopted CityFlow benchmark setting. This compact design helps maintain fair and reproducible comparisons with existing baselines. Richer traffic-state variables, such as occupancy, approach speed, turn ratios, and downstream blocking indices, may further improve the description of link utilization, arrival dynamics, route choice, and downstream receiving capacity. However, incorporating these variables would require additional feature extraction, normalization, replay-buffer alignment, and corresponding baseline reimplementation. Future work will therefore investigate feature-enriched variants of PR-STLight together with dedicated ablation analysis.

Practical signal operation also involves execution constraints beyond general phase selection. As an additional constrained evaluation, we tested PR-STLight on the Hangzhou

4 \times 4

network with yellow intervals and minimum-green requirements enforced during signal execution. Compared with TransformerLight (base), PR-STLight reduces ATT from 369.95 s to 312.22 s, AIQ from 91.92 vehicles to 52.41 vehicles, and phase changes per hour from 2567.50 to 2134.13. These results provide preliminary evidence that PR-STLight remains executable under basic timing constraints. Future work will further incorporate pedestrian phases, transit signal priority, and emergency vehicle preemption through priority-request features, action masks, phase-extension rules, and safety execution layers.

6. Conclusions

This study proposed PR-STLight, a prediction-regularized spatio-temporal Transformer framework for offline multi-intersection traffic signal control. The main idea is to use short-term future inbound-queue evolution as structural supervision for shared representation learning, so that the model not only imitates historical phase actions but also preserves information about near-future congestion formation and propagation. By integrating neighborhood-constrained spatial attention, causal temporal attention, TRQP, and two-stage optimization, PR-STLight provides a unified framework for learning executable signal-control policies from fixed offline replay buffers.

The experimental results show that prediction-regularized spatio-temporal learning improves the effectiveness and robustness of offline signal control under the adopted CityFlow benchmark protocol. On the Jinan

3 \times 4

and Hangzhou

4 \times 4

networks, PR-STLight achieves the lowest average travel time among the compared methods and shows clear improvement over the TransformerLight backbone. Additional evaluations under different demand levels and a sudden-demand-surge stress test further indicate that the proposed design remains beneficial under traffic-demand variations and representative non-recurrent demand shocks. The ablation and sensitivity analyses also support the roles of the spatio-temporal module, the TRQP branch, and the selected historical and prediction windows, while showing that pretraining mainly contributes to optimization stability rather than guaranteeing the best final score in every setting.

From an engineering perspective, PR-STLight is most relevant to offline traffic-signal-control scenarios where historical trajectory data are available and direct online trial-and-error is undesirable. Such settings are common in multi-intersection corridors or urban grid areas where unsafe exploration, long training cycles, and congestion propagation make online reinforcement learning difficult to deploy directly. The proposed framework suggests that combining action imitation with short-term traffic-evolution supervision can be a practical way to improve offline policy learning for coordinated signal control.

The demonstrated claims remain bounded by the current experimental setting. The evaluation is based on two standard CityFlow grid benchmarks and a compact traffic-state representation. Broader deployment will require validation on irregular urban networks, richer input features, capacity-side incidents such as crashes and lane blockages, and more realistic operational constraints such as pedestrian phases, transit priority, and emergency-vehicle preemption. These extensions will be the focus of our future work toward more realistic and deployable prediction-regularized offline signal-control systems.

Author Contributions

Writing—original draft, Y.D.; Writing—review and editing, H.L., Z.W. and R.L.; Conceptualization, Y.D., H.L. and Z.W.; Methodology, Y.D. and H.L.; Software, Y.D.; Formal analysis, Y.D.; Investigation, Y.D.; Validation, T.X.; Funding acquisition, H.L.; Experimental design, H.L., T.X. and Z.W.; Data curation, T.X.; Visualization, Y.D.; Supervision, R.L. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Natural Science Foundation of China (No. 62406251), Natural Science Foundation of Gansu (No. 26JRRA256).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new road-network benchmark dataset was generated in this study. The simulator configurations and traffic data used to support the findings of this work are based on publicly available CityFlow-compatible benchmark data for multi-intersection traffic signal control. The road-network and traffic-flow benchmark files for the Jinan and Hangzhou networks are available at https://github.com/wingsweihua/colight/tree/master/data (accessed on 2 April 2026). The offline replay-buffer data used for training and evaluation in this study are available at https://drive.google.com/drive/folders/1Y_gDn4l6bWl6M97WAtWnsIfuwVhDNVD6?usp=drive_link (accessed on 2 April 2026).

Acknowledgments

We thank the developers and maintainers of the datasets used in this work. The authors are grateful to the anonymous reviewers for their valuable feedback.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TRQP	Topology-Recurrent Queue Predictor
RL	Reinforcement Learning
GCN	Graph Convolutional Network
GRU	Gated Recurrent Unit
MHA	Multi-Head Attention
ATT	Average Travel Time
AIQ	Average Inbound Queue

References

Xiao, F.; Lu, J.; Li, L.; Tu, W.; Li, C. Advances in reinforcement learning for traffic signal control: A review of recent progress. Intell. Transp. Infrastruct. 2025, 4, liaf009. [Google Scholar] [CrossRef]
Michailidis, P.; Michailidis, I.; Lazaridis, C.R.; Kosmatopoulos, E. Traffic signal control via reinforcement learning: A review on applications and innovations. Infrastructures 2025, 10, 114. [Google Scholar] [CrossRef]
Texas A&M Transportation Institute. 2025 Urban Mobility Report; Texas A&M Transportation Institute: College Station, TX, USA, 2025; Available online: https://mobility.tamu.edu/umr/ (accessed on 2 May 2026).
Federal Highway Administration. Adaptive Signal Control Technology; Every Day Counts; Federal Highway Administration: Washington, DC, USA, 2017. Available online: https://www.fhwa.dot.gov/innovation/everydaycounts/edc-1/asct.cfm (accessed on 2 May 2026).
Elharoun, M.; El-Badawy, S.M.; Shwaly, E.A.E.; Shahdah, U.E. Adaptive traffic signal control using deep reinforcement learning: A multi-objective approach for single and multi-intersection scenarios. IATSS Res. 2025, 49, 481–492. [Google Scholar] [CrossRef]
Li, Y.; Zhang, C.; Zhan, F.; Liu, W.; Zhou, K.; Zheng, L. Enhancing traffic signal control through model-based reinforcement learning and policy reuse. Expert Syst. Appl. 2026, 298, 129755. [Google Scholar] [CrossRef]
Koonce, P.; Rodegerdts, L.; Lee, K.; Quayle, S.; Beaird, S.; Braud, C.; Bonneson, J.; Tarnoff, P.; Urbanik, T. Traffic Signal Timing Manual; FHWA-HOP-08-024; Federal Highway Administration: Washington, DC, USA, 2008.
Wei, H.; Chen, C.; Zheng, G.; Wu, K.; Gayah, V.; Xu, K.; Li, Z. PressLight: Learning max pressure control to coordinate traffic signals in arterial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Anchorage, AK, USA, 4–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1290–1298. [Google Scholar]
Zheng, G.; Xiong, Y.; Zang, X.; Feng, J.; Wei, H.; Zhang, H.; Li, Y.; Xu, K.; Li, Z. Learning phase competition for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1963–1972. [Google Scholar]
Wei, H.; Xu, N.; Zhang, H.; Zheng, G.; Zang, X.; Chen, C.; Zhang, W.; Zhu, Y.; Xu, K.; Li, Z. CoLight: Learning network-level cooperation for traffic signal control. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing, China, 3–7 November 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1913–1922. [Google Scholar]
Xu, B.; Wang, Y.; Wang, Z.; Jia, H.; Lu, Z. Hierarchically and cooperatively learning traffic signal control. Proc. AAAI Conf. Artif. Intell. 2021, 35, 669–677. [Google Scholar] [CrossRef]
Xiao, H.; Li, H.; Qi, S.; Zhang, J.; Cai, D. FGLight: Learning neighbor-level information for traffic signal control. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, Detroit, MI, USA, 19–23 May 2025; ACM Press: New York, NY, USA, 2025; pp. 2181–2189. [Google Scholar]
Ruan, J.; Li, Z.; Wei, H.; Jiang, H.; Lu, J.; Xiong, X.; Mao, H.; Zhao, R. CoSLight: Co-optimizing collaborator selection and decision-making to enhance traffic signal control. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; ACM Press: New York, NY, USA, 2024; pp. 2500–2511. [Google Scholar]
Jiang, H.; Li, Z.; Wei, H.; Xiong, X.; Ruan, J.; Lu, J.; Mao, H.; Zhao, R. X-Light: Cross-city traffic signal control using transformer on transformer as meta multi-agent reinforcement learner. arXiv 2024, arXiv:2404.12090. [Google Scholar]
Sun, Q.; Zhang, L.; Zhou, J.; Zha, R.; Mei, Y.; Tian, C.; Xiong, H. Spatio-temporal sequence modeling for traffic signal control. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2024; ACM Press: New York, NY, USA, 2024; pp. 4076–4080. [Google Scholar]
Zhao, R.; Hu, H.; Li, Y.; Fan, Y.; Gao, F.; Gao, Z. Sequence decision transformer for adaptive traffic signal control. Sensors 2024, 24, 6202. [Google Scholar] [CrossRef] [PubMed]
Huang, X.; Wu, D.; Boulet, B. Traffic signal control using lightweight transformers: An offline-to-online RL approach. IEEE Open J. Intell. Transp. Syst. 2026, 7, 396–411. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Y.; Deng, J.; Li, C. DataLight: Offline data-driven traffic signal control. arXiv 2023, arXiv:2303.10828. [Google Scholar]
Li, J.; Lin, S.; Shi, T.; Tian, C.; Mei, Y.; Song, J.; Zhan, X.; Li, R. A fully data-driven approach for realistic traffic signal control using offline reinforcement learning. Data Sci. Transp. 2025, 7, 25. [Google Scholar] [CrossRef]
Wang, L.; Wang, Y.X.; Li, J.K.; Liu, Y.; Pi, J.T. Adaptive traffic signal control method based on offline reinforcement learning. Appl. Sci. 2024, 14, 10165. [Google Scholar] [CrossRef]
Bokade, R.; Jin, X. OffLight: An offline multi-agent reinforcement learning framework for traffic signal control. In Proceedings of the 2025 IEEE 21st International Conference on Automation Science and Engineering, Los Angeles, CA, USA, 17–21 August 2025; IEEE: New York, NY, USA, 2025; pp. 2730–2737. [Google Scholar]
Sun, Q.; Zha, R.; Zhang, L.; Zhou, J.; Mei, Y.; Li, Z.; Xiong, H. CrossLight: Offline-to-online reinforcement learning for cross-city traffic signal control. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; ACM Press: New York, NY, USA, 2024; pp. 2765–2774. [Google Scholar]
Jiang, J.; Han, C.; Zhao, W.X.; Wang, J. PDFormer: Propagation delay-aware dynamic long-range Transformer for traffic flow prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 4365–4373. [Google Scholar] [CrossRef]
Lin, J.; Ren, Q. Rethinking spatio-temporal Transformer for traffic prediction: Multi-level multi-view augmented learning framework. arXiv 2024, arXiv:2406.11921. [Google Scholar]
Li, Z.; Xia, L.; Shi, L.; Xu, Y.; Yin, D.; Huang, C. OpenCity: Open spatio-temporal foundation models for traffic prediction. arXiv 2024, arXiv:2408.10269. [Google Scholar]
Shao, Z.; Bell, M.G.H.; Wang, Z.; Geers, D.G.; Xi, H.; Gao, J. ST-Mamba: Spatial-temporal selective state space model for traffic flow prediction. arXiv 2024, arXiv:2404.13257. [Google Scholar]
Wang, H.; Chen, J.; Pan, T.; Dong, Z.; Zhang, L.; Jiang, R.; Song, X. STGformer: Efficient spatiotemporal graph Transformer for traffic forecasting. arXiv 2024, arXiv:2410.00385. [Google Scholar] [CrossRef]
Huang, S.; Song, H.; Jiang, T.; Telikani, A.; Shen, J.; Zhou, Q.; Yong, B.; Wu, Q. DST-GTN: Dynamic spatio-temporal graph Transformer network for traffic forecasting. arXiv 2024, arXiv:2404.11996. [Google Scholar]
Ye, Y.; Zhou, Y.; Ding, J.; Wang, T.; Chen, M.; Lian, X. InitLight: Initial model generation for traffic signal control using adversarial inverse reinforcement learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, Macao, China, 19–25 August 2023. [Google Scholar]
Lee, E.H.; Lee, E. Congestion boundary approach for phase transitions in traffic flow. Transp. B Transp. Dyn. 2024, 12, 2379377. [Google Scholar] [CrossRef]
Zhang, Z.; Lu, C. Traffic flow phase transition phenomena based on the kinetic approach. Phys. A Stat. Mech. Its Appl. 2025, 662, 130423. [Google Scholar] [CrossRef]
Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; Mordatch, I. Decision Transformer: Reinforcement learning via sequence modeling. Adv. Neural Inf. Process. Syst. 2021, 34, 15084–15097. [Google Scholar]
Zhu, W.; Zhang, D.; Long, B.; Xiao, J. Hybrid Transformer and spatial-temporal self-supervised learning for long-term traffic prediction. arXiv 2024, arXiv:2401.16453. [Google Scholar]
Zhou, J.; Liu, E.; Chen, W.; Zhong, S.; Liang, Y. Navigating spatio-temporal heterogeneity: A graph Transformer approach for traffic forecasting. arXiv 2024, arXiv:2408.10822. [Google Scholar] [CrossRef]
Yang, H.; Wei, S.; Wang, Y. PT-TDGCN: Pre-trained trend-aware dynamic graph convolutional network for traffic flow prediction. Sensors 2025, 25, 6709. [Google Scholar] [CrossRef] [PubMed]
Zhou, Y.; Wang, X.; Jia, J. A traffic flow forecasting method based on transfer-aware spatio-temporal graph attention network. ISPRS Int. J. Geo-Inf. 2025, 14, 459. [Google Scholar] [CrossRef]
Zhang, Q.; Gao, X.; Wang, H.; Yiu, S.M.; Yin, H. Efficient traffic prediction through spatio-temporal distillation. Proc. AAAI Conf. Artif. Intell. 2025, 39, 1093–1101. [Google Scholar] [CrossRef]
Fang, Y.; Liang, Y.; Hui, B.; Shao, Z.; Deng, L.; Liu, X.; Jiang, X.; Zheng, K. Efficient large-scale traffic forecasting with Transformers: A spatial data management perspective. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Toronto, ON, USA, 3–7 August 2025; ACM Press: New York, NY, USA, 2025; pp. 307–317. [Google Scholar]
Bai, H.Y.; Liu, X. T-Graphormer: Using Transformers for spatiotemporal forecasting. arXiv 2025, arXiv:2501.13274. [Google Scholar] [CrossRef]
Qiu, Z.; Wu, H.; Teng, G.; Wu, H.; Huang, Z.; Zhao, M. Multi-scale spatial-temporal Transformer for traffic flow prediction. Sci. Rep. 2025, 16, 3531. [Google Scholar] [CrossRef]
Chu, Y.; Fu, T.; Liu, P.; Lao, H. STICformer: Spatio-temporal intrinsic connections Transformer for traffic flow prediction. Sci. Rep. 2025, 16, 1881. [Google Scholar] [CrossRef]
Fang, J.; Shao, Z.; Choy, S.T.; Gao, J. STPFormer: A state-of-the-art pattern-aware spatio-temporal Transformer for traffic forecasting. arXiv 2025, arXiv:2508.13433. [Google Scholar]
Zhang, J.; Yang, Y.; Wu, X.; Li, S. Spatio-temporal Transformer and graph convolutional networks based traffic flow prediction. Sci. Rep. 2025, 15, 24299. [Google Scholar] [CrossRef]
Liu, S.; Wang, X. An improved Transformer based traffic flow prediction model. Sci. Rep. 2025, 15, 8284. [Google Scholar] [CrossRef] [PubMed]
Xiong, Y.; Xu, K.; Chen, M.; Huang, H. Cross-domain Transformer spatial-temporal fusion network for traffic flow forecasting. Sci. Rep. 2025, 15, 23524. [Google Scholar] [CrossRef] [PubMed]
Li, A.; Xu, Z.; Li, W.; Chen, Y.; Pan, Y. Urban signalized intersection traffic state prediction: A spatial-temporal graph model integrating the cell transmission model and Transformer. Appl. Sci. 2025, 15, 2377. [Google Scholar] [CrossRef]
Pan, Y.A.; Li, F.; Li, A.; Niu, Z.; Liu, Z. Urban intersection traffic flow prediction: A physics-guided stepwise framework utilizing spatio-temporal graph neural network algorithms. Multimodal Transp. 2025, 4, 100207. [Google Scholar] [CrossRef]
Zhang, H.; Feng, S.; Liu, C.; Ding, Y.; Zhu, Y.; Zhou, Z.; Zhang, W.; Yu, Y.; Jin, H.; Li, Z. CityFlow: A multi-agent reinforcement learning environment for large scale city traffic scenario. In Proceedings of the Web Conference, San Francisco, CA, USA, 13–17 May 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 3620–3624. [Google Scholar]

Figure 1. Representative intersection layout and local-topology abstraction used in the multi-intersection signal-control task. Panel (a) illustrates a four-leg signalized intersection with lane-level vehicle states. The green dashed arrow indicates the currently permitted east–west through movement, whereas the red signal markers indicate stopped approaches. Panel (b) summarizes the corresponding control abstraction: the central intersection is connected to its first-order neighboring intersections, and four candidate signal phases are defined, including bidirectional east–west through movement, bidirectional north–south through movement, and two protected left-turn phase groups. At each decision step, the controller observes the lane-level traffic state and selects one of these candidate phases, which determines the permitted movements and influences queue discharge and traffic propagation to adjacent intersections.

Figure 2. Overall framework of PR-STLight. The ellipsis denotes omitted repeated sequence elements, arrows indicate information flow between modules, colors distinguish different functional components, and the × symbol denotes blocked or masked connections.

Figure 3. Benchmark environments used in the experiments.

Figure 4. Comparative training curves on the Jinan 3 × 4 and Hangzhou 4 × 4 networks. Subfigures (a,b) show average travel time and average inbound queue on Jinan, while subfigures (c,d) show the same metrics on Hangzhou. To improve readability, only selected key methods are shown. The insets enlarge the final ten evaluation rounds to show post-convergence stability.

Figure 5. Curves under the sudden-demand-surge stress test on the Hangzhou

4 \times 4

network. The left panel shows ATT, and the right panel shows AIQ. The insets enlarge the final ten evaluation rounds after convergence to show late-stage stability.

Figure 5. Curves under the sudden-demand-surge stress test on the Hangzhou

4 \times 4

network. The left panel shows ATT, and the right panel shows AIQ. The insets enlarge the final ten evaluation rounds after convergence to show late-stage stability.

Figure 6. Ablation convergence curves on Jinan

3 \times 4

using average travel time. PR-STLight reaches a stable low-error regime with fast convergence and smoother late-stage behavior. Although w/o Pretrain attains a slightly lower final mean ATT, it converges later and shows larger late-stage variation.

Figure 6. Ablation convergence curves on Jinan

3 \times 4

using average travel time. PR-STLight reaches a stable low-error regime with fast convergence and smoother late-stage behavior. Although w/o Pretrain attains a slightly lower final mean ATT, it converges later and shows larger late-stage variation.

Table 1. Implementation details used in PR-STLight.

Category	Item	Value
Environment	Simulator	CityFlow 0.1
Environment	Benchmark networks	Jinan $3 \times 4$ (12 intersections); Hangzhou $4 \times 4$ (16 intersections)
Environment	Decision interval	15 s
Environment	Episode length	3600 s per episode
Environment	Offline replay buffer	Fixed pre-collected memory files (Jinan/Hangzhou)
Model	Hidden size d	256
Model	Transformer layers	10
Model	Attention heads	4
Model	Feed-forward inner size	512
Model	History length L	8
Model	Prediction hidden size	128
Model	Prediction horizon K	3
Software	Python	3.8.20
Software	PyTorch	2.4.1 + cu121

Table 2. Comparative results on the Jinan

3 \times 4

network.

Table 2. Comparative results on the Jinan

3 \times 4

network.

Method	ATT/s ↓	ATT Impr./%	AIQ/Veh ↓	AIQ Impr./%
Fixed Time	455.06	–	318.72	–
SOTL	384.87	–	213.47	–
PressLight	289.79 ± 24.39	16.85	103.46 ± 19.35	59.05
CoLight	277.55 ± 8.96	20.37	97.34 ± 24.85	61.48
HiLight	411.39 ± 2.08	−18.04	334.30 ± 6.04	−32.30
InitLight	347.03 ± 39.90	0.43	326.11 ± 75.28	−29.06
TransformerLight (base)	348.53 ± 3.32	–	252.68 ± 4.77	–
PR-STLight	274.39 ± 0.96 ***	21.27	135.73 ± 1.44 ***	46.28

Note: Learning-based methods are reported as mean ± between-run standard deviation across five independent random-seed runs; deterministic rule-based methods are reported as single values. Percentage improvements are relative to TransformerLight (base), with positive values indicating improvement. *** indicates

p < 0.001

under a two-sided paired t-test using matched seed-level run scores. Bold values indicate the best results, underlined values indicate the second-best results, and ↓ indicates that lower values are better.

Table 3. Comparative results on the Hangzhou

4 \times 4

network.

Table 3. Comparative results on the Hangzhou

4 \times 4

network.

Method	ATT/s ↓	ATT Impr./%	AIQ/Veh ↓	AIQ Impr./%
Fixed Time	497.98	–	192.90	–
SOTL	470.64	–	177.72	–
PressLight	317.42 ± 16.99	14.65	53.81 ± 14.94	40.83
CoLight	296.46 ± 1.31	20.28	37.40 ± 1.12	58.87
HiLight	403.11 ± 23.60	−8.39	100.78 ± 20.16	−10.82
InitLight	300.28 ± 2.93	19.26	72.42 ± 3.03	20.37
TransformerLight (base)	371.91 ± 1.43	–	90.94 ± 1.34	–
PR-STLight	288.09 ± 1.91 ***	22.54	30.66 ± 1.54 ***	66.29

Note: Learning-based methods are reported as mean ± between-run standard deviation across five independent random-seed runs; deterministic rule-based methods are reported as single values. Percentage improvements are relative to TransformerLight (base), with positive values indicating improvement. *** indicates

p < 0.001

under a two-sided paired t-test using matched seed-level run scores. Bold values indicate the best results, underlined values indicate the second-best results, and ↓ indicates that lower values are better.

Table 4. Demand-level validation on the Hangzhou

4 \times 4

network. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level. Bold values indicate the best results. The ↓ and ↑ arrows indicate decreases and increases relative to TransformerLight (base), respectively.

Table 4. Demand-level validation on the Hangzhou

4 \times 4

network. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level. Bold values indicate the best results. The ↓ and ↑ arrows indicate decreases and increases relative to TransformerLight (base), respectively.

Demand Level	Method	ATT/s ↓	AIQ/Veh ↓
	SOTL	$507.61$	$144.66$
	PressLight	$301.02 \pm 7.28$ (↓ 42.65%)	$27.05 \pm 3.81$ (↓ 84.75%)
0.7× Low	CoLight	$290.81 \pm 1.23$ (↓ 44.60%)	$22.56 \pm 0.65$ (↓ 87.28%)
	TransformerLight (base)	$524.91 \pm 9.50$ (−)	$177.35 \pm 6.05$ (−)
	PR-STLight	$280.18 \pm 0.21$ (↓ 46.62%)	$16.12 \pm 0.13$ (↓ 90.91%)
	SOTL	$470.64$	$177.72$
	PressLight	$332.35 \pm 28.20$ (↓ 17.56%)	$66.22 \pm 22.47$ (↓ 42.44%)
1.0× Medium	CoLight	$299.33 \pm 1.13$ (↓ 25.75%)	$39.82 \pm 1.18$ (↓ 65.39%)
	TransformerLight (base)	$403.14 \pm 7.04$ (−)	$115.05 \pm 6.10$ (−)
	PR-STLight	$286.73 \pm 0.73$ (↓ 28.88%)	$28.27 \pm 0.48$ (↓ 75.43%)
	SOTL	$455.89$	$239.00$
	PressLight	$463.57 \pm 66.16$ (↑ 5.58%)	$257.49 \pm 87.89$ (↑ 29.99%)
1.5× Oversaturated	CoLight	$341.91 \pm 3.27$ (↓ 22.13%)	$103.95 \pm 7.14$ (↓ 47.52%)
	TransformerLight (base)	$439.06 \pm 3.20$ (−)	$198.09 \pm 3.39$ (−)
	PR-STLight	$311.43 \pm 0.56$ (↓ 29.07%)	$63.76 \pm 0.55$ (↓ 67.81%)

Table 5. Demand-level validation on the Jinan

3 \times 4

network. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level. Bold values indicate the best results, underlined values indicate the second-best results, and the ↓ arrow indicates a decrease relative to TransformerLight (base).

Table 5. Demand-level validation on the Jinan

3 \times 4

network. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base) under the same demand level. Bold values indicate the best results, underlined values indicate the second-best results, and the ↓ arrow indicates a decrease relative to TransformerLight (base).

Demand Level	Method	ATT/s ↓	AIQ/Veh ↓
	SOTL	$348.95$	$188.72$
	PressLight	$241.23 \pm 3.23$ (↓ 20.87%)	$59.79 \pm 3.72$ (↓ 55.42%)
0.7× Low	CoLight	$245.03 \pm 1.29$ (↓ 19.63%)	$66.13 \pm 1.56$ (↓ 50.69%)
	TransformerLight (base)	$304.86 \pm 27.77$ (−)	$134.11 \pm 33.87$ (−)
	PR-STLight	$233.57 \pm 0.36$ (↓ 23.38%)	$51.05 \pm 0.36$ (↓ 61.93%)
	SOTL	$371.59$	$302.15$
	PressLight	$276.37 \pm 3.12$ (↓ 24.14%)	$137.99 \pm 7.81$ (↓ 50.94%)
1.0× Medium	CoLight	$277.07 \pm 4.28$ (↓ 23.95%)	$141.53 \pm 7.40$ (↓ 49.68%)
	TransformerLight (base)	$364.33 \pm 3.79$ (−)	$281.28 \pm 8.86$ (−)
	PR-STLight	$272.33 \pm 0.48$ (↓ 25.25%)	$128.02 \pm 0.85$ (↓ 54.49%)
	SOTL	$462.08$	$638.96$
	PressLight	$473.41 \pm 56.75$ (↓ 11.08%)	$667.95 \pm 140.55$ (↓ 17.64%)
1.5× Oversaturated	CoLight	$404.40 \pm 10.73$ (↓ 24.04%)	$485.65 \pm 23.36$ (↓ 40.12%)
	TransformerLight (base)	$532.42 \pm 10.19$ (−)	$811.04 \pm 26.60$ (−)
	PR-STLight	$\underset{̲}{437.03 \pm 0.52}$ (↓ 17.92%)	$\underset{̲}{560.47 \pm 1.34}$ (↓ 30.89%)

Table 6. Performance under the sudden-demand-surge stress test on the Hangzhou

4 \times 4

network. The surge is imposed from 1200 s to 1800 s by increasing vehicle arrivals by 50%, while the road network, lane availability, phase definitions, action interval, reward setting, and offline replay buffer remain unchanged. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base).

Table 6. Performance under the sudden-demand-surge stress test on the Hangzhou

4 \times 4

network. The surge is imposed from 1200 s to 1800 s by increasing vehicle arrivals by 50%, while the road network, lane availability, phase definitions, action interval, reward setting, and offline replay buffer remain unchanged. Values are reported as mean ± standard deviation following the statistical aggregation protocol in Section 5.1.3. Percentages in parentheses denote relative changes compared with TransformerLight (base).

Method	ATT/s ↓	AIQ/Veh ↓
PressLight	395.01 ± 47.37 (↓ 1.18%)	127.00 ± 51.07 (↑ 4.78%)
CoLight	303.65 ± 1.40 (↓ 24.04%)	46.52 ± 1.21 (↓ 61.62%)
TransformerLight (base)	399.74 ± 3.66 (−)	121.21 ± 3.55 (−)
PR-STLight w/o TRQP	304.48 ± 12.93 (↓ 23.83%)	47.93 ± 12.43 (↓ 60.46%)
PR-STLight	298.82 ± 3.50 (↓ 25.25%)	42.58 ± 3.38 (↓ 64.87%)

Remark: Reported values follow the statistical aggregation protocol in Section 5.1.3. Percentages are relative changes with respect to TransformerLight (base); downward arrows indicate reductions and upward arrows indicate increases. Lower values are better. Bold numeric values indicate the best result, and underlining indicates the second-best result.

Table 7. Ablation summary on the Jinan

3 \times 4

network using Average Travel Time. The results follow the same statistical aggregation protocol as the main comparison. Percentage changes are calculated relative to PR-STLight.

Table 7. Ablation summary on the Jinan

3 \times 4

network using Average Travel Time. The results follow the same statistical aggregation protocol as the main comparison. Percentage changes are calculated relative to PR-STLight.

Model Variant	ATT/s ↓	ATT Change/%	ConvRound ↓	Start Mean ↓
w/o Both	346.79 ± 2.04	26.89	7.0	801.13
w/o Pretrain	273.12 ± 1.20	−0.07	11.0	452.92
w/o TRQP	273.90 ± 1.81	0.22	15.0	489.05
w/o ST	289.87 ± 3.77	6.06	18.0	306.85
PR-STLight	273.30 ± 0.46	–	6.0	412.06

Note: Percentage changes are calculated relative to PR-STLight, where positive values indicate degradation and negative values indicate improvement. ConvRound denotes the first round at which the curve becomes stably flat, while Start Mean reflects the initial-stage metric level. Smaller values are better for ATT, ConvRound, and Start Mean. The downward arrows in the table header indicate that lower values are better. Boldface indicates the best value, and underlining indicates the second-best value.

Table 8. Sensitivity analysis of PR-STLight with respect to historical block length L and prediction horizon K on the Jinan

3 \times 4

network. The reported values are calculated as mean ± standard deviation over the final ten evaluation rounds after convergence. The downward arrows indicate that lower values are better.

Table 8. Sensitivity analysis of PR-STLight with respect to historical block length L and prediction horizon K on the Jinan

3 \times 4

network. The reported values are calculated as mean ± standard deviation over the final ten evaluation rounds after convergence. The downward arrows indicate that lower values are better.

Sweep	L	K	ATT/s ↓	AIQ/Veh ↓
Historical length	4	3	$273.40 \pm 1.99$	$134.47 \pm 3.62$
Historical length	8	3	$273.37 \pm 2.18$	$133.80 \pm 3.23$
Historical length	12	3	$276.06 \pm 3.32$	$138.73 \pm 5.59$
Prediction horizon	8	1	$273.08 \pm 3.83$	$133.73 \pm 6.19$
Prediction horizon	8	3	$273.37 \pm 2.18$	$133.80 \pm 3.23$
Prediction horizon	8	5	$274.46 \pm 3.83$	$135.95 \pm 7.16$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Deng, Y.; Li, H.; Xia, T.; Wang, Z.; Lei, R. Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control. Appl. Sci. 2026, 16, 5156. https://doi.org/10.3390/app16105156

AMA Style

Deng Y, Li H, Xia T, Wang Z, Lei R. Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control. Applied Sciences. 2026; 16(10):5156. https://doi.org/10.3390/app16105156

Chicago/Turabian Style

Deng, Yueting, Huale Li, Tong Xia, Zhaobin Wang, and Ruoming Lei. 2026. "Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control" Applied Sciences 16, no. 10: 5156. https://doi.org/10.3390/app16105156

APA Style

Deng, Y., Li, H., Xia, T., Wang, Z., & Lei, R. (2026). Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control. Applied Sciences, 16(10), 5156. https://doi.org/10.3390/app16105156

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Prediction-Regularized Spatio-Temporal Transformer Framework for Offline Multi-Intersection Traffic Signal Control

Abstract

1. Introduction

2. Related Work

2.1. Traffic Signal Control and Multi-Intersection Coordination

2.2. Traffic-Flow Theory and Congestion Propagation Mechanisms

2.3. Offline Sequence Control and Sequential Decision Making

2.4. Traffic Prediction and Spatio-Temporal Representation

3. Problem Formulation

3.1. Problem Definition for Multi-Intersection Traffic Signal Control

3.2. State, Action, and Reward Modeling

3.3. Offline Trajectories and Sequential Decision Modeling

3.4. Definition of the Auxiliary Prediction Task

4. Method

4.1. Overall Framework

4.2. Spatio-Temporal Representation Module

4.3. Topology-Recurrent Queue Predictor (TRQP)

4.4. Two-Stage Training and Joint Optimization

5. Experiments

5.1. Experimental Setup

5.1.1. Simulation Platform and Datasets

5.1.2. Comparison Methods

5.1.3. Metrics and Statistical Evaluation Protocol

5.2. Main Comparative Results and Analysis

5.3. Evaluation Under Different Traffic Demand Levels

5.4. Stress Test Under Sudden Demand Surge

5.5. Component and Parameter Analysis

5.5.1. Ablation Study and Analysis

5.5.2. Sensitivity Analysis of Historical Length and Prediction Horizon

5.6. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI