Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control

Ćiprovski, Dejan; Ilić, Nemanja; Božilović, Boško; Vučetić, Miljan

doi:10.3390/math14111942

Open AccessArticle

Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control

¹

Artificial Intelligence Department, Vlatacom Institute, 11070 Belgrade, Serbia

²

Faculty of Informatics and Computing, Singidunum University, 11000 Belgrade, Serbia

³

Department of Information Technologies, College of Applied Technical Sciences, 37000 Kruševac, Serbia

⁴

School of Computing, Union University, 11000 Belgrade, Serbia

⁵

Department of Systems Engineering and Informatics, VŠB—Technical University of Ostrava, 708 00 Ostrava, Czech Republic

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(11), 1942; https://doi.org/10.3390/math14111942

Submission received: 6 May 2026 / Revised: 24 May 2026 / Accepted: 27 May 2026 / Published: 2 June 2026

(This article belongs to the Special Issue Advances in Machine Learning and Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

Traffic signal control over urban networks requires coordinating the controllers of multiple signalized intersections toward a shared goal of minimizing network-wide congestion. Multi-agent reinforcement learning (MARL) methods have shown considerable promise in this setting. The epsilon–greedy exploration strategy adopted by many of these methods treats every candidate signal phase as equally worth trying, discarding the rich domain knowledge that traffic theory already provides. This paper proposes fuzzy-guided exploration, in which a multi-criteria fuzzy inference system uses local traffic conditions, with phase pressure as its primary input, to assign each candidate phase a priority. These priorities define a sampling distribution used in place of the uniform draw. We evaluate the method across four MARL algorithms covering independent learning (IQL) and the centralized training with decentralized execution paradigm (VDN, QMIX, and QPLEX) on both a synthetic grid and a real-world network. Fuzzy-guided exploration consistently improves upon the baseline in all combinations, with tangible gains on the synthetic grid and substantially larger improvements on the real-world network. These findings demonstrate that exploration is an effective intervention point for domain-knowledge integration in cooperative MARL, and that pressure-based scoring provides a well-suited signal to serve that role in traffic signal control.

Keywords:

traffic signal control; multi-agent reinforcement learning; fuzzy logic; exploration strategy; MaxPressure; epsilon–greedy

MSC:

68T05; 68T42; 03E72; 90B20

1. Introduction

Urban traffic congestion imposes substantial and growing costs on cities worldwide, driven by increasing vehicular demand against largely fixed road infrastructure. Traffic signal control (TSC) is the primary operational lever through which cities can extract additional throughput from existing networks without capital investment in new infrastructure [1]. Classical approaches to TSC, ranging from fixed-time plans to sensor-responsive controllers, are well-studied and practically deployed [2]; yet, their responses to traffic conditions are ultimately governed by predetermined rules that cannot be autonomously refined through experience. Reinforcement learning (RL) offers a complementary paradigm: rather than prescribing a fixed response, an agent learns a control policy directly from the experience of interacting with the environment [3]. RL has shown considerable promise as a framework for traffic signal control [4,5,6,7]. Despite this progress, the domain knowledge embodied in classical methods, which is principled, well-understood, and freely available, has been integrated into RL approaches to TSC unevenly, leaving certain elements of the learning process largely overlooked.

At the scale of real traffic networks, the RL problem is inherently multi-agent: each signalized intersection is typically treated as an agent observing local traffic conditions and selecting a signal phase as its action, and effective control requires these agents to act cooperatively toward a shared objective—a setting that falls naturally within the multi-agent reinforcement learning (MARL) framework [8,9]. Coordinating multiple agents raises an immediate scalability challenge: a fully centralized controller must reason over a joint action space that grows exponentially with the number of intersections, while fully independent learners suffer from non-stationarity induced by simultaneously adapting agents. Among the paradigms proposed to navigate this tradeoff, centralized training with decentralized execution (CTDE) [10] has become one of the most widely adopted frameworks in cooperative MARL, enabling agents to exploit global information during training while acting solely on local observations at execution time. Within the MARL-TSC literature, empirical studies suggest that independent decentralized methods remain competitive and robust across realistic traffic networks [4], while more recent evidence indicates that coordinated CTDE approaches yield increasing advantages as network complexity grows [11]. Beyond the choice of training paradigm, the field has incorporated the traffic domain knowledge through a range of specialized contributions, including pressure-based reward and observation design [12,13], graph-based architectures that reflect network topology [14], and communication mechanisms that share the local state across neighboring agents [15]. One component of the learning process, however, has remained largely unexamined: the exploration mechanism.

MARL methods for TSC predominantly rely on value-based learning, adopting

ε

-greedy exploration as a default inherited from single-agent RL [3], treating all candidate signal phases as equally worthy of exploration regardless of prevailing traffic conditions. Classical controllers, such as MaxPressure [16], which activate the phase maximizing the pressure differential across controlled lane pairs, demonstrate that principled, locally computable rankings of phase quality are readily available. This gap has received no dedicated attention in the literature, despite a well-developed body of work on structured exploration in single-agent RL [17,18,19], recent recognition of the exploration gap in cooperative MARL [20], and the existence of domain-knowledge-guided learning and exploration approaches. Fuzzy domain knowledge has been integrated directly into the policy network in [21], while in [22] a fuzzy safety guide has been incorporated throughout the single-agent learning pipeline, modulating exploration and exploitation actions while shaping the reward function to direct Q-value learning in general domains; the broader pattern of blending structured domain expertise with data-driven models extends across engineering fields [23]. The TSC setting presents its own set of challenges: traffic demand is inherently time-varying, the interaction of simultaneously learning agents through shared physical infrastructure compounds this non-stationarity, and, as we discuss in Section 5, interventions that appear well-motivated often fail to produce consistent gains in practice.

This paper proposes fuzzy-guided exploration for cooperative MARL, applied to traffic signal control. Rather than replacing the

ε

-greedy mechanism, we retain its structure and replace only the uniform random sampling process. During exploration, each agent samples from a structured distribution over candidate phases derived from a multi-criteria fuzzy inference system [24]. The system is built around pressure-based phase scoring [16], further incorporating queue length and waiting time into a single priority score per phase. The method is non-invasive—the reward signal, observation space, and learning objective remain completely unchanged—and decentralized, as each agent computes its exploration distribution from its own local observations and locally observed network structure without communication. It is evaluated across four value-based MARL algorithms covering independent learning (IQL [25]) and the CTDE paradigm (VDN [26], QMIX [27], and QPLEX [28]) on a synthetic grid network [29] and a real-world network [4]. Fuzzy-guided exploration consistently outperforms the

ε

-greedy baseline across all algorithm and environment combinations, with gains that grow markedly from the synthetic grid to the real-world network.

The remainder of this paper is organized as follows. Section 2 reviews the reinforcement learning and multi-agent foundations, the CTDE paradigm and value-based cooperative MARL algorithms, and the role of exploration. In Section 3, the proposed fuzzy-guided exploration method is presented in detail, covering the TSC problem instantiation, the pressure-based phase scoring, the multi-criteria fuzzy rule system, and the integration with the

ε

-greedy training loop. Numerical simulations and results across different environments and algorithms are reported in Section 4. Section 5 discusses the method’s behavior and design choices. Finally, Section 6 concludes the paper.

2. Background

2.1. Reinforcement Learning and Multi-Agent Foundations

Traffic signal control is inherently a sequential decision-making problem: a control agent must repeatedly select signal phases over time, with each decision affecting both immediate traffic flow and future traffic network conditions. Reinforcement learning provides a principled framework for such problems, formalized through the concept of a Markov decision process (MDP), in which an agent repeatedly interacts with an environment and future outcomes depend only on the current state [3]. At each discrete time step t, the agent observes state

s_{t} \in S

, selects action

a_{t} \in A

according to policy

π (a | s)

, transitions to a successor state

s_{t + 1}

according to the state transition function

P (\cdot | s_{t}, a_{t})

, and receives scalar reward

r_{t}

according to the reward function

R (s_{t}, a_{t})

. Together with discount factor

γ \in [0, 1)

, which down-weights future rewards relative to immediate ones, these components define the MDP tuple

〈 S, A, P, R, γ 〉

. The objective is to find a policy

π^{*}

that maximizes the expected return (cumulative discounted reward):

J (π) = E_{π} [\sum_{k = 0}^{T} γ^{k} r_{k}],

(1)

where

E_{π} [\cdot]

denotes the expectation over trajectories induced by policy

π

and T is the episode horizon. A central object in value-based RL is the action-value function

Q^{π} (s, a) = E_{π} [\sum_{k = 0}^{T} γ^{k} r_{t + k} | s_{t} = s, a_{t} = a],

(2)

which quantifies the expected return of executing action a in state s and thereafter following policy

π

. The optimal action-value function

Q^{*} (s, a) = {max}_{π} Q^{π} (s, a)

yields the optimal policy directly via

π^{*} (s) = arg {max}_{a} Q^{*} (s, a)

; value-based methods approximate

Q^{*}

through iterative temporal-difference (TD) updates. The MDP formulation assumes the environment state

s_{t}

is fully observable; when only a partial observation

o_{t}

of

s_{t}

is available to the agent, the problem is formalized as a partially observable MDP (POMDP), in which a memoryless policy over current observations is generally suboptimal and the agent typically conditions its policy on the history of observations and actions.

We now extend this framework to the cooperative multi-agent setting, in which n agents act simultaneously in a shared environment. We index agents by

i \in N \equiv {1, \dots, n}

and use the subscript i throughout to denote agent-specific quantities; the time index t is retained as a subscript where both dimensions are relevant, yielding notation such as

a_{i, t}

for the action of agent i at step t. Each agent i selects action

a_{i} \in A_{i}

from its own action space according to its policy

π_{i}

, and the joint action

a = (a_{1}, \dots, a_{n})

is executed simultaneously. In fully cooperative settings, agents may share a single team reward

r_{t}

or receive individual rewards shaped to align with a common objective; we adopt the former formulation, so the joint policy

π = (π_{1}, \dots, π_{n})

is optimized to maximize the expected discounted return of (1). Partial observability generalizes naturally to the multi-agent case: each agent i receives a local observation

o_{i}

of the environment state rather than observing it directly, making the problem a decentralized partially observable MDP (Dec-POMDP) [9]. To support decision-making under partial observability, each agent maintains an action-observation history

τ_{i}

, comprising the sequence of its past observations and actions; in practice,

τ_{i}

is typically processed by a recurrent neural network, which maintains a compressed hidden state rather than storing the full history explicitly. The implications of the Dec-POMDP structure for cooperative learning are discussed in the next subsection.

2.2. Centralized Training with Decentralized Execution

The Dec-POMDP structure introduced above poses a fundamental tension for cooperative MARL: coordination requires the agents’ actions to fit together coherently, yet at execution time each agent has access only to its own local history

τ_{i}

. Two extreme solutions are immediate but unsatisfactory: treating the system as a single agent with a joint action space is computationally intractable, while having each agent learn independently treats the non-stationarity induced by simultaneously learning agents as environment noise.

The paradigm of centralized training with decentralized execution (CTDE) [10,30] resolves this tension by separating the training and execution phases. During training, a centralized learner has access to global information (the state, all agent observations, and all agent actions) and uses this to learn value functions or policies that support coordination. During execution, each agent acts purely on the basis of its local history

τ_{i}

, with no communication or access to the global state, adhering to the Dec-POMDP formulation. CTDE has become the dominant paradigm for cooperative MARL.

A key requirement for value-based CTDE is the Individual-Global-Max (IGM) principle [31], which links the centralized joint value function

Q_{tot} (τ, a)

to the agents’ local value functions

Q_{i} (τ_{i}, a_{i})

, where

τ = (τ_{1}, \dots, τ_{n})

is the joint action-observation history. IGM requires that the joint greedy action derived from

Q_{tot}

coincide with the collection of locally greedy actions:

arg max_{a} Q_{tot} (τ, a) = (\begin{matrix} arg max_{a_{1}} Q_{1} (τ_{1}, a_{1}) \\ ⋮ \\ arg max_{a_{n}} Q_{n} (τ_{n}, a_{n}) \end{matrix}) .

(3)

IGM is what aligns decentralized execution with centralized training: each agent can select actions greedily from its local

Q_{i}

, and the resulting joint action matches the joint greedy action under

Q_{tot}

.

Independent Q-Learning (IQL) [25] is the simplest baseline: each agent learns a separate

Q_{i} (τ_{i}, a_{i})

using standard Q-learning, with no joint value function and therefore no IGM-style coordination mechanism. It nonetheless remains a competitive baseline in many practical settings [4]. Value Decomposition Networks (VDNs) [26] take a first step toward coordination by representing the joint value function as a sum of individual utilities:

Q_{tot} (τ, a) = \sum_{i = 1}^{n} Q_{i} (τ_{i}, a_{i}),

(4)

which satisfies IGM by construction: maximizing a sum of independent per-agent terms reduces to each agent maximizing its own term. QMIX [27] generalizes VDN by observing that full additivity is not necessary for IGM: monotonicity is sufficient. QMIX represents

Q_{tot}

as the output of a state-conditioned mixing network that takes the individual

Q_{i}

as inputs, with weights constrained to be non-negative, guaranteeing that

\frac{\partial Q_{tot}}{\partial Q_{i}} \geq 0, \forall i \in N,

(5)

which permits a richer class of joint value functions than VDN while preserving tractable decentralized execution. QPLEX [28] further extends the representable function class through a duplex dueling architecture that decomposes both the joint and individual value functions into value and advantage components. This formulation encodes the IGM constraint directly into the network structure rather than through a monotonicity restriction, and in its history-based form is designed to achieve the complete set of IGM-consistent joint value functions. Together, VDN, QMIX, and QPLEX form a spectrum of value-decomposition methods of increasing expressiveness, and remain the foundation on which recent work continues to build [32,33].

In practice, all of the above methods build on the deep Q-learning framework introduced by the Deep Q-Networks (DQN) [34], which contributed three ingredients essential for stable value-based learning in high-dimensional settings: a neural network to approximate the action-value function

Q^{*}

, an experience replay buffer that stores past transitions and breaks temporal correlations by sampling random mini-batches for gradient updates, and a target network, which is a periodically synchronized copy of the main network that provides stable TD targets during training. These ingredients are inherited directly by IQL, VDN, QMIX, and QPLEX, adapted for the multi-agent Dec-POMDP setting with two modifications: recurrent neural networks, which process the local history

τ_{i}

into a compressed hidden state to handle partial observability [35]; and the value decomposition architectures described above, which extend the single-agent Q-network to the cooperative multi-agent setting.

The framework described above can be complemented by prior domain knowledge about which actions are likely to be beneficial in a given state. Several integration points exist, including reward shaping, which enriches the reward signal with domain-informed terms; observation augmentation, which appends heuristic quantities to the agent’s input; and exploration guidance, which steers the distribution over actions during exploration. The first two approaches modify the learning problem itself, altering what the agent optimizes or observes, and can therefore change the nature of the learned policy in ways that are difficult to control. Exploration guidance, by contrast, is non-invasive: it leaves the reward signal, observation space, and Q-learning objective untouched, and its influence recedes as exploration gives way to exploitation and the agent’s own value estimates increasingly govern action selection. This makes exploration guidance the natural and principled point of integration for domain knowledge within the CTDE framework.

2.3. Exploration in Reinforcement Learning

The most widely used exploration strategy in value-based RL is

ε

-greedy action selection [3], in which the agent acts greedily with respect to its current value estimates with probability

1 - ε

, and selects a uniformly random action with probability

ε

:

a_{t} = \{\begin{matrix} arg max_{a} Q (s_{t}, a) & with probability 1 - ε, \\ uniform (A) & with probability ε . \end{matrix}

(6)

In practice,

ε

is typically annealed linearly from

1.0

to a small final value, allowing the agent to explore broadly early in training and increasingly exploit its learned value estimates as training progresses. The

ε

-greedy strategy is the default exploration mechanism in deep Q-learning [34] and is adopted without modification by the MARL frameworks discussed above, where each agent independently applies

ε

-greedy selection with respect to its local value function

Q_{i} (τ_{i}, a_{i})

. In the cooperative MARL setting, the final

ε

is typically kept higher than in single-agent RL, as the non-stationarity induced by co-evolving agent policies makes continued exploration beneficial throughout training.

Several more sophisticated exploration strategies address settings where

ε

-greedy exploration is insufficient. Optimism-based methods, such as Upper Confidence Bound (UCB) [17] maintain uncertainty estimates over action values and favor under-explored actions, providing theoretical guarantees in bandit and tabular settings; however, they remain agnostic to domain structure and address a different challenge: the exploration–exploitation tradeoff. UCB exploration has been applied to TSC-MARL in [36], though as one of several algorithmic contributions. Count-based methods [18] generalize the optimism principle to high-dimensional spaces by assigning exploration bonuses to rarely visited states, while curiosity-driven methods [19] derive intrinsic reward signals from the prediction errors of a learned forward model to encourage the agent to seek out novel states. Both are primarily designed to address the hard-exploration problem, where reward signals are sparse, and the agent struggles to discover any useful learning signal; in many practical domains, such as traffic signal control, where a dense reward is received at every decision step, this is not the primary challenge.

The relevant issue is instead one of efficiency: uniform random exploration treats every action as equally worthy regardless of the current state, which represents a significant source of waste in structured domains where prior knowledge can identify which actions are likely to be more worth trying. Domain-knowledge-guided exploration, on the other hand, can act as a soft prior over actions during training, influencing which transitions are collected and therefore what the agent learns from. A design requirement that must be satisfied is that all actions retain non-zero probability under the guided exploration distribution, preserving stochastic coverage of the action space. The proposed solution applies domain-knowledge-guided exploration to traffic signal control, while satisfying the non-zero probability requirement by construction.

3. Proposed Method

In the TSC setting, each of the n traffic signal controllers in the network is represented by one agent

i \in N

, and the collective goal is to minimize network-wide congestion through coordinated phase selection. The action of agent i is the selection of a signal phase

a_{i} \in A_{i}

; each selectable phase corresponds to a set of non-conflicting vehicle movements assigned the right of passage [4,11]. The number of selectable phases varies by intersection geometry, so the action spaces

A_{i}

are, in general, heterogeneous across agents. Each agent observes local traffic conditions on its adjacent lanes, such as per-lane vehicle density, queue length, and waiting time, measurable by on-site sensors. The reward signal is any team-level measure of network-wide congestion that aligns the incentives of all agents with global performance; common choices in the TSC literature include negative cumulative waiting time, negative average delay, and negative queue length summed over all lanes in the network [4,11,29].

The proposed method replaces the uniform random draw in the exploration branch of

ε

-greedy action selection with a structured distribution over candidate phases derived from local traffic observations. Rather than treating all phases as equally worthy of exploration regardless of the current traffic state, the method assigns each candidate phase a priority score derived from domain knowledge and samples the exploration action from a distribution shaped by these scores. The priority computation proceeds in three sequential stages: a pressure score is first derived for each phase from lane density observations; this is then combined with queue length and waiting time through a multi-criteria fuzzy rule system into a single priority value per phase; finally, the priority vector is passed through a temperature-controlled softmax to yield a sampling distribution.

3.1. Pressure-Based Action Prioritization

Traffic pressure quantifies the degree of congestion imbalance across a signal phase: for each candidate phase, it measures the difference in traffic load between the incoming and outgoing lanes controlled by that phase. This signal provides a locally computable, interpretable ranking of phase quality—phases with high pressure correspond to approaches where activating the green signal would most effectively relieve congestion. MaxPressure [16] is the most prominent controller built on this signal: it activates at each step the phase with the highest pressure score, and has been shown to be throughput-optimal under mild conditions on arrival rates. MaxPressure has been adopted as a strong rule-based baseline in the TSC literature, where it consistently outperforms fixed-time control and remains competitive with learned controllers in a range of network configurations [4]. In the present work, pressure is not used to select a single phase deterministically but as a per-phase scoring signal: a pressure score is computed for every candidate phase and used as the first stage of the exploration priority computation.

For agent i and candidate phase

a \in A_{i}

(agent subscript on a dropped for brevity), the pressure score is defined as

Π_{i} (a) = \sum_{(l^{in}, l^{out}) \in C_{i} (a)} (ρ_{l^{in}} - ρ_{l^{out}}),

(7)

where

C_{i} (a) = {(l^{in}, l^{out})}

is the set of lane-pairs controlled by phase a at intersection i, and

ρ_{l} \in [0, 1]

is the vehicle density on lane ℓ, defined as the ratio of vehicle count to lane capacity. Unlike the formulation in [16], which uses raw vehicle counts, density-based pressure ensures comparability across lanes of different capacities. A positive pressure score indicates that the incoming lanes of phase a have higher density than the corresponding outgoing lanes, signaling that activating phase a would relieve a congested approach; a negative score indicates the reverse. Since the outgoing lanes of i are physically the incoming lanes of its downstream neighbors, agent i’s local observations implicitly encode the congestion state of neighboring intersections, with no inter-agent communication required. As only the local sub-graph of intersection i is consulted (its phases, controlled movements, and immediately adjacent lanes), the pressure computation applies without modification to networks of arbitrary size and heterogeneous intersection geometry.

3.2. Multi-Criteria Fuzzy Rule System

Fuzzy logic systems provide a principled framework for reasoning under uncertainty using linguistic variables and graded membership, making them a natural fit for domains where relevant quantities are continuous but most aptly described in qualitative terms such as low, moderate, or high [24]. In a fuzzy inference system, membership functions map each input variable to a degree of membership within linguistic categories; a rule base then encodes expert knowledge as if-then statements that fire to a degree proportional to how well their conditions are satisfied, and the contributions of all fired rules are aggregated into a single output through defuzzification. Traffic signal control is a particularly appropriate domain for this style of reasoning: congestion phenomena are inherently gradual, crisp thresholds separating congested from uncongested conditions are difficult to justify, and the urgency of a given phase is well captured by linguistic rules that domain experts can inspect and interpret directly [37].

The use of fuzzy membership functions also provides robustness to the imprecision inherent in sensor observations: gradual transitions between linguistic categories ensure that small perturbations in input values produce smoothly varying priority scores rather than sharp discontinuities, stabilizing the priority distribution under noisy and non-stationary operating conditions. The rule-based structure accommodates multiple congestion criteria within a single inference procedure, so that no single metric needs to be individually discriminative. For instance, a phase with moderate pressure but severely elevated waiting times can be correctly identified as urgent even when its pressure score alone would not distinguish it.

We combine pressure with queue length and waiting time as inputs. Pressure captures the flow imbalance across the intersection, queue length reflects vehicle accumulation on incoming lanes, and waiting time accounts for the experienced delay of vehicles already stopped, together providing a complementary characterization of congestion urgency. This combination represents one possible instantiation that reflects the most readily available congestion indicators; other locally observable quantities could serve a similar role. For each candidate phase

a \in A_{i}

, let

I_{i} (a) = {l^{in} : (l^{in}, l^{out}) \in C_{i} (a)}

denote the set of incoming lanes controlled by phase a. The per-phase queue length

q_{i} (a)

and waiting time

w_{i} (a)

are obtained by averaging the corresponding per-lane observations over

I_{i} (a)

, ensuring comparability across phases that control different numbers of lane-pairs:

\begin{matrix} q_{i} (a) & = \frac{1}{| I_{i} (a) |} \sum_{l^{in} \in I_{i} (a)} q_{l^{in}}, \end{matrix}

(8)

\begin{matrix} w_{i} (a) & = \frac{1}{| I_{i} (a) |} \sum_{l^{in} \in I_{i} (a)} w_{l^{in}}, \end{matrix}

(9)

where

q_{l} \in [0, 1]

is the queue length on lane ℓ, defined as the ratio of stopped vehicles to lane capacity, and

w_{l} \geq 0

is the accumulated waiting time on lane ℓ.

Each of the three input variables is fuzzified independently by mapping its value to degrees of membership in three linguistic categories

L = {L_{OW}, M_{ED}, H_{IGH}}

through triangular membership functions defined over the normalized domain

[0, 1]

:

\begin{matrix} μ_{L_{OW}} (x) & = max (0, 1 - \frac{x}{0.5}), \end{matrix}

(10)

\begin{matrix} μ_{M_{ED}} (x) & = max (0, 1 - \frac{| x - 0.5 |}{0.5}), \end{matrix}

(11)

\begin{matrix} μ_{H_{IGH}} (x) & = max (0, \frac{x}{0.5} - 1) . \end{matrix}

(12)

Under this parameterization, Low attains full membership at

x = 0

and vanishes at

x = 0.5

, High attains full membership at

x = 1

and vanishes at

x = 0.5

, and Med peaks at

x = 0.5

and vanishes at both extremes; adjacent categories overlap pairwise, ensuring smooth transitions between linguistic descriptors across the full domain

[0, 1]

, as illustrated in Figure 1. The three functions satisfy the Ruspini partition condition [38], ensuring that

μ_{L_{OW}} (x) + μ_{M_{ED}} (x) + μ_{H_{IGH}} (x) = 1

for all

x \in [0, 1]

, providing a complete and consistent fuzzy partition of each variable.

Prior to fuzzification, all three inputs are brought onto a common scale by min–max normalization across the candidate phases at the current decision step:

\tilde{v} (a) = \frac{v (a) - {min}_{a^{'}} v (a^{'})}{{max}_{a^{'}} v (a^{'}) - {min}_{a^{'}} v (a^{'})},

(13)

where

v \in {Π_{i}, q_{i}, w_{i}}

and the extrema are taken over all candidate phases

a^{'} \in A_{i}

; as a special case, when all phases take the same value, the input is set to

0.5

, placing it at the peak of Med and producing a neutral, uniform contribution across all rules. Since pressure, queue length, and waiting time have different natural scales and units, and pressure can additionally take negative values, the three inputs are not directly commensurable; the min–max normalization of Equation (13) addresses this by rescaling each input to

[0, 1]

using the minimum and maximum values observed across all candidate phases. This normalization ensures that all three inputs reflect the relative standing of each phase among the available alternatives at the current step rather than absolute congestion levels, guaranteeing meaningful discrimination between candidate phases regardless of the prevailing traffic conditions. It also reduces the sensitivity of the fuzzy system to the aggregation methods used for obtaining the individual input metrics.

The rule base consists of

K = {| L |}^{3} = 27

if-then rules covering all combinations of linguistic categories for each input, encoding the intuition that a phase becomes increasingly worthy of selection as its pressure, queue length, and waiting time increase jointly. Rule activation follows the minimum t-norm: for rule k, where

P_{k}

,

Q_{k}

,

W_{k} \in {L_{OW}, M_{ED}, H_{IGH}}

denote the conditions on pressure, queue length, and waiting time, respectively, the activation degree is

α_{k} = min (μ_{P_{k}} ({\tilde{Π}}_{i} (a)), μ_{Q_{k}} ({\tilde{q}}_{i} (a)), μ_{W_{k}} ({\tilde{w}}_{i} (a))),

(14)

and the priority score

f_{i} (a)

for phase a is obtained by weighted-average defuzzification over all K rules:

f_{i} (a) = \frac{\sum_{k = 1}^{K} α_{k} c_{k}}{\sum_{k = 1}^{K} α_{k}},

(15)

where

c_{k}

is the crisp output value of rule k, assigned to reflect the joint urgency implied by the conditions, and

f_{i} (a)

is the resulting priority score for phase a; if the total activation is zero,

f_{i} (a)

should default to a predefined neutral value, though this case does not arise with the membership functions defined above. The full rule base is given in Table 1, where rule output values decrease monotonically as the linguistic levels of pressure, queue, and waiting time decrease, ranging from

0.95

for the most urgent combination (High–High–High) to

0.05

for the least urgent (Low–Low–Low). The output values, drawn from a discrete set within

[0.05, 0.95]

and symmetric around

0.5

, are assigned using expert-driven heuristics that enforce monotonicity and consistency, ensuring phase priority increases with the joint congestion level implied by the conditions. The assignment reflects three deliberate design principles: pressure is the primary criterion, with steps of

0.20

between linguistic levels in most cells compared to

0.10

for queue length and waiting time; queue length and waiting time are treated symmetrically as secondary criteria; and the neutral combination (Med–Med–Med) is anchored at

0.50

. The small irregularities visible in some cells of Table 1 arise naturally from satisfying these constraints within

[0.05, 0.95]

simultaneously. Exact numerical choices are secondary to the monotone structure they encode: any assignment consistent with this ordering would produce qualitatively similar priority scores. More advanced approaches could estimate or refine

c_{k}

via data-driven approaches such as least-squares optimization or gradient-based learning, enabling adaptive calibration of the rule base.

3.3. Fuzzy-Guided Exploration Policy

The priority scores

f_{i} (a)

produced by the fuzzy rule system are converted to a proper probability distribution over candidate phases via a temperature-scaled softmax:

p_{i}^{fuzzy} (a) = \frac{exp (f_{i} (a) / β)}{\sum_{a^{'} \in A_{i}} exp (f_{i} (a^{'}) / β)},

(16)

where

β > 0

is a temperature parameter controlling the sharpness of the distribution: small values of

β

concentrate probability mass on the highest-priority phase, approaching deterministic selection, while large values flatten the distribution toward uniform. Score-based softmax exploration appears in several settings, including Boltzmann exploration based on Q-values [3], and knowledge-guided exploration based on fuzzy safety scores [22]; the proposed method uses domain-informed fuzzy priority scores specific to the TSC setting. The complete modified

ε

-greedy policy, replacing the uniform draw of (6) with the fuzzy-guided distribution, is:

a_{i, t} = \{\begin{matrix} arg max_{a} Q_{i} (τ_{i}, a) & with probability 1 - ε, \\ \sim p_{i}^{fuzzy} (\cdot) & with probability ε . \end{matrix}

(17)

The exploration rate

ε

is annealed over training, concentrating the fuzzy guidance in the early stages when Q-value estimates are least reliable. Algorithm 1 summarizes the complete procedure for reference and reproducibility.

Algorithm 1 Fuzzy-guided action selection, executed by each agent i at every training step t

Input: action-observation history

τ_{i}

, value function

Q_{i}

, exploration rate

ε

(all current at step t), temperature

β

Output: action

a_{i, t}

1: for each candidate phase

a \in A_{i}

do
2: Compute pressure

Π_{i} (a)

, queue length

q_{i} (a)

, waiting time

w_{i} (a)

▹ Equations (7)–(9)
3: end for
4: Normalize

Π_{i} (a)

,

q_{i} (a)

,

w_{i} (a)

across candidate phases ▹ Equation (13)
5: for each candidate phase

a \in A_{i}

do
6: Fuzzify normalized values ▹ Equations (10)–(12)
7: Compute rule activations

α_{k}

for each rule ▹ Equation (14)
8: Defuzzify using

α_{k}

and rule outputs

c_{k}

(Table 1) to obtain

f_{i} (a)

▹ Equation (15)
9: end for
10: Compute

p_{i}^{fuzzy} (\cdot)

over

A_{i}

based on

f_{i} (a)

and

β

▹ Equation (16)
11: if

Uniform (0, 1) > ε

then ▹ Equation (17)
12: return

arg {max}_{a} Q_{i} (τ_{i}, a)

13: else
14: return

a_{i, t} \sim p_{i}^{fuzzy} (\cdot)

15: end if

4. Experiments

4.1. Experimental Setup

We evaluate the proposed fuzzy-guided exploration method on two traffic networks selected to cover complementary evaluation scenarios, using standard

ε

-greedy exploration as the baseline. The first is a synthetic

3 \times 3

grid network scenario [29], providing uniform intersection geometry and balanced demand. The second is the Cologne 8 from the RESCO benchmark suite [4], a real-world network scenario derived from the city of Cologne comprising eight traffic signal controllers with heterogeneous phase structure (between two and four green phases per controller), which tests generalization to practical deployment conditions. Cologne 8 network contains one cluster node (three closely spaced real-world junctions joined into a single traffic light program), which is handled in a straightforward way by computing all traffic metrics on the external approach lanes, exactly as for any other intersection.

Each network node is controlled by a dedicated agent that observes a local state vector comprising the one-hot encoding of the current phase, a binary flag indicating whether a phase change is permitted, the normalized vehicle density

ρ_{l} \in [0, 1]

per controlled lane ℓ, and the normalized queue length

q_{l} \in [0, 1]

[29]. The action space is the set of available green phases; yellow transition intervals are enforced automatically by the environment whenever the active phase changes [4,11]. The shared team reward is the negative change in cumulative vehicle waiting time across all controlled lanes between consecutive decision steps:

r_{t} = - (\sum_{l} w_{l, t} - \sum_{l} w_{l, t - 1}) .

(18)

This is the default reward in [29] and is widely used in the TSC-as-MARL literature [4].

All algorithms are trained for 250 episodes of 120 decision steps each (600 s of simulated traffic per episode at a 5 s control interval), following the standard CTDE training loop [39]. The

ε

-greedy baseline anneals

ε

linearly from 1.0 to 0.1 over the warm-up and first 50 training episodes, followed by a constant

ε = 0.1

for the remaining 200 training episodes. The fuzzy-guided variant follows the identical annealing schedule, replacing uniform random sampling with structured, domain-knowledge-driven exploration. To expose agents to diverse traffic conditions, episode start times are drawn uniformly at random from the available demand window, preventing agents from overfitting to a fixed traffic demand pattern.

The experiments are conducted using two open-source frameworks. On the simulation side, SUMO-RL [29] wraps the SUMO traffic simulator [40], exposing PettingZoo-compatible [41] multi-agent environments with configurable per-agent observations, actions, and reward functions. On the learning side, pyMARLzoo+ [42], an extension of EPyMARL [43] and PyMARL [39], provides the CTDE training infrastructure and implements the four algorithms evaluated in this work: IQL, VDN, QMIX, and QPLEX. We extend pyMARLzoo+ with two components: a fuzzy action selector that replaces uniform random exploration with structured, domain-knowledge-driven sampling; and a custom environment wrapper that bridges SUMO-RL and pyMARLzoo+, exposing the local network topology and per-agent traffic signal objects required by the fuzzy module and implementing the random episode window mechanism. Experiments are conducted on commodity hardware; each decision step requires advancing the SUMO simulation, which is computationally costly and motivates the focus on sample-efficient exploration.

The shared hyperparameters across all algorithms are as follows: learning rate

α = 5 \times 10^{- 4}

, discount factor

γ = 0.99

, mini-batch size of 32 episodes, hidden dimension of 64, and two gradient update passes per collected batch; the softmax temperature for the fuzzy-guided variant is

β = 0.25

. These values were determined through extensive empirical tuning conducted under the computational constraints described above, balancing learning stability and convergence speed within the available training budget. All agents use a local recurrent network and observe their own agent identifier and last action in addition to the local traffic state. Target networks are updated via soft updates with a coefficient of

0.005

, with the exception of QPLEX, which uses periodic hard target updates every 50 training episodes. Rewards are standardized online during training.

4.2. Results

Figure 2 and Figure 3 show the training curves for the

3 \times 3

grid and Cologne 8 networks, respectively, with each subfigure corresponding to one algorithm. Each plot reports reward, total waiting time, and mean vehicle speed, averaged over steps within each episode, as a function of training episode; solid lines are means over five independent runs, and shaded bands are one standard deviation. The number of stopped vehicles, another system-level metric available in [29], is omitted from the figures as it is strongly correlated with wait time.

The

3 \times 3

grid network represents a relatively low-complexity scenario: the uniform intersection geometry and balanced synthetic demand allow both exploration strategies to converge reliably within the training budget, as can be seen in Figure 2. Fuzzy-guided exploration consistently converges faster, and the converged performance is better across all four algorithms and all metrics. The advantage is most pronounced for QPLEX.

Results on the Cologne 8 network scenario, shown in Figure 3, reveal substantially larger differences between the two exploration strategies than observed on the

3 \times 3

grid. Two properties distinguish these scenarios and jointly explain the difference in gains. First, the network structure: while all nine intersections of the

3 \times 3

grid have exactly two green phases, in Cologne 8, the eight controllers have heterogeneous phase counts: two, three, and four green phases, with two, three, and three controllers, respectively, according to the used action space [29]. Under uniform random exploration, the probability that all agents simultaneously select a traffic-relevant phase decreases multiplicatively across agents, falling an order of magnitude lower on Cologne 8 than on the regular grid (

{(1 / 2)}^{2} \times {(1 / 3)}^{3} \times {(1 / 4)}^{3}

versus

{(1 / 2)}^{9}

). Second, the demand characteristics: the near-uniform synthetic demand of the

3 \times 3

grid means that phases differ little in utility, so a randomly selected phase is unlikely to be severely suboptimal. On Cologne 8, highly unbalanced real-world demand makes the cost of a randomly selected low-utility phase much higher, and the value of the domain-knowledge-based signal correspondingly greater. The compounding effect of collecting higher-quality training transitions amplifies this further, as visible in the per-algorithm training curves. The fuzzy-guided variant again converges faster and reaches better asymptotic performance across all four algorithms, but the magnitude of both effects is considerably more pronounced than on the simpler benchmark. For IQL, the

ε

-greedy baseline exhibits mid-training instability, while the fuzzy-guided variant converges steadily and maintains a persistent advantage across all metrics. In the VDN case, both methods exhibit mid-training instability characteristic of the more demanding topology, but the fuzzy-guided variant recovers more reliably and exhibits consistently better subsequent performance across all metrics. For QMIX, the picture is comparatively clean: both methods converge, but fuzzy-guided exploration maintains a persistent advantage in all metrics across the full training horizon. The most striking case is QPLEX, where

ε

-greedy fails to converge within the training budget, with reward and wait time oscillating at poor values throughout, while the fuzzy-guided variant converges to a well-defined policy. A plausible explanation lies in the nature of the duplex dueling architecture: the advantage decomposition requires meaningful contrasts between action qualities in the training data to produce reliable estimates, and uniform random exploration on a heterogeneous network with many phases fails to provide this. Fuzzy-guided exploration may introduce sufficient structure into the early training distribution for the advantage estimates to develop a consistent signal. Taken together, the shown results indicate that the benefits of domain-knowledge-guided exploration increase with network complexity and heterogeneity, suggesting stronger potential gains in more realistic deployment scenarios.

Table 2 summarizes the mean performance metrics over the final 5000 training steps, together with standard deviations, for both exploration strategies across all algorithm and environment combinations. The results confirm the trends observed in the training curves: fuzzy-guided exploration consistently outperforms the

ε

-greedy baseline across all metrics, with gains in the means that are modest but systematic on the

3 \times 3

grid and substantially larger on Cologne 8. On the

3 \times 3

grid, reward improvements reach up to

+ 11.5 %

, wait time up to

+ 16.4 %

, and speed up to

+ 1.5 %

. On Cologne 8, improvements are considerably larger, with rewards ranging from

+ 28.2 %

to

+ 64.7 %

, wait time from

+ 23.7 %

to

+ 69.0 %

, and speed up to

+ 6.0 %

. Improvements in stopped vehicles fall between wait time and speed improvements across both networks. The reward metric, although less directly interpretable in physical units than wait time or speed, is the central quantity that drives all learning: it is the signal that agents optimize throughout training, and improvements in reward directly reflect a better explored and more efficiently learned value function. Crucially, systematically higher rewards throughout training imply that fuzzy-guided exploration steers agents toward higher-quality transitions from the earliest stages, compounding into better learned policies—an effect that purely random exploration, by definition, is less likely to replicate. Therefore, the consistent reward gains across all configurations confirm that fuzzy-guided exploration leads to more effective learning, with traffic metrics providing further evidence in operationally meaningful terms. The reported standard deviations further confirm that the mean performance results are representative: fuzzy-guided exploration produces tighter cross-seed variability in most cases.

Statistical significance of the results was assessed using one-sided Mann–Whitney U tests across five independent seeds. On Cologne 8, fuzzy-guided exploration achieves significance (

p < 0.05

) in 15 of 16 metric–algorithm cells, with 11 of these at

p < 0.01

; the sole exception is VDN speed (

p = 0.210

). On the

3 \times 3

grid, significance is reached in 13 of 16 cells (10 at

p < 0.01

), with the three exceptions being reward for IQL, VDN, and QMIX (

p = 0.421

,

0.111

,

0.111

). This divergence reflects the noisier nature of the differential reward signal: agents perform significantly better by all traffic measures, even when the reward improvement does not reach statistical significance.

Table 2 also includes the MaxPressure controller as a standalone baseline. All RL controllers substantially reduce wait time relative to MaxPressure on both networks, with the sole exception on Cologne 8 of QPLEX

ε

-greedy, whose non-convergence leaves it marginally worse than MaxPressure; fuzzy-guided QPLEX resolves this failure completely. On stopped vehicles, all RL controllers outperform MaxPressure on the

3 \times 3

grid; on Cologne 8,

ε

-greedy falls short of MaxPressure, and fuzzy-guided exploration reverses this, achieving lower stopped counts across all four algorithms. Across both networks, MaxPressure achieves slightly higher speed than the RL controllers, as agents optimize waiting time rather than speed. These comparisons confirm that learning adds genuine value over the heuristic baseline, and that using pressure as a phase-scoring signal during exploration is distinct from its use as a deterministic action selector in MaxPressure control.

To validate the choice of softmax temperature

β = 0.25

, we evaluated three additional values

β \in {0.1, 0.5, 1.0}

on QMIX with Cologne 8, keeping all other settings fixed. Figure 4 shows the resulting training curves. At

β = 0.1

, the exploration distribution is highly concentrated, amplifying early scoring errors and leading to higher training instability, though final performance remains comparable to

β = 0.25

. At

β = 0.5

and

β = 1.0

, the distribution flattens toward uniform, and performance degrades monotonically across traffic metrics as the structured signal is progressively diluted. The value

β = 0.25

avoids both failure modes, achieving better final performance than

β = 0.5

and

β = 1.0

, and better training stability than

β = 0.1

.

Finally, to assess the contribution of the multi-criteria fuzzy system within the RL setting, we compare two exploration variants on QMIX with Cologne 8: one that converts pressure scores directly to a softmax distribution (SoftMaxPressure), and one that additionally incorporates queue length and waiting time through the fuzzy system (Fuzzy-guided). SoftMaxPressure can also be viewed as a domain-knowledge variant of Boltzmann exploration [3], where pressure replaces Q-values as the scoring signal. Figure 5 shows the training curves for each variant. On reward and wait time, both variants reach comparable final values, but the SoftMaxPressure curves exhibit a sharp mid-training destabilization event (a sudden spike in wait time around step 9000, followed by a gradual recovery) that is absent in the fuzzy-guided variant. On speed, the fuzzy-guided variant performs slightly better. The interpretation is that pressure alone can be uninformative in certain traffic states, while queue length and waiting time provide complementary signals, producing a more robust exploration distribution and more stable training dynamics.

5. Discussion

The results, while consistent, should not be taken to imply that embedding domain knowledge into a MARL-TSC pipeline is straightforward. The design space has two largely independent dimensions: the component of the learning system to modify, and the choice of domain-knowledge signal to inject. Both dimensions require careful consideration, and the interaction between them is non-trivial. Among the interventions considered along the first dimension, reward shaping and observation augmentation were also explored in the early stages of this work, using the same fuzzy priority scores as the proposed method. Reward shaping added a fuzzy bonus to the team reward, annealed over training; observation augmentation appended per-agent fuzzy Q-values to the observation vector. Neither produced consistent gains. Possible explanations include the credit assignment problem inherent in cooperative MARL when individual agent fuzzy scores are aggregated into a shared team reward bonus, and the fact that the network must first learn to exploit the fuzzy observation features before benefiting from them. These findings are documented to inform potential future work along the alternative directions. In both cases, the domain knowledge must pass through additional processing before it can influence learning. Exploration proved to be the effective intervention point for structured domain-knowledge injection; one plausible reason is that it is the only component whose modification is strictly contained, as it influences only which transitions are collected during training, from which each agent learns directly.

Along the second dimension (the choice of domain-knowledge signal), the picture is similarly nuanced. Single-criterion heuristics based on queue length or waiting time alone did not produce consistent gains. Pressure scores, which account for the density differential across the lane pairs served by each phase, proved to be the signal that reliably guided exploration toward higher-quality transitions. The multi-criteria fuzzy combination of pressure, queue length, and waiting time then added further value over pressure alone, as evidenced by the ablation results reported in Section 4: beyond an improvement across training metrics, the fuzzy rule system avoided the destabilisation events observed under pressure-only guidance. The natural mechanism for this is that a single criterion can be uninformative in certain traffic states, while the multi-criteria combination provides a more robust signal across diverse conditions. The fuzzy rule system also accommodates the inherent imprecision and gradedness of traffic state descriptions. The proposed rule base represents one of many possible designs, chosen for its principled and straightforward structure, which makes it readily transferable to other domains beyond traffic signal control. The monotone structure of the output values matters more than their precise numerical values, as confirmed in additional experiments by comparison against a linearly spaced alternative spanning the same output range, which produced qualitatively indistinguishable results. More specific rule sets, further tuned to the characteristics of a particular network or traffic regime, are equally feasible and may yield additional gains in targeted deployments. Applying fuzzy-guided exploration to only a subset of agents, rather than the full network, was also investigated, and did not yield consistent improvements, suggesting that coordinated structured exploration across all agents may be important for the effect to manifest reliably.

Since the proposed method operates through the exploration channel, the exploration schedule directly influences its performance. Longer and higher exploration amplifies the advantage of fuzzy-guided over

ε

-greedy baseline: structured exploration actions remain meaningful throughout training, while uniform random actions increasingly interfere with the learned Q-function. For faster decay or lower final

ε

, the margin narrows under stable baseline behavior. The relatively high final

ε

adopted in the experiments ensures sufficient diversity in the replay buffer throughout training. This is particularly relevant in the cooperative MARL setting, where the non-stationarity induced by co-evolving agent policies continuously alters the effective environment each agent faces, and in traffic signal control, where demand conditions may vary across episodes. A sensitivity check across a range of final

ε

values shows that fuzzy-guided performance remains consistent, suggesting that the learned policies are generally aligned with the instilled domain-knowledge prior, rather than that the action selection is continuously influenced by the fuzzy distribution late in training.

The current design involves several fixed choices that could, in principle, be made adaptive. The softmax temperature

β

was set based on extensive empirical evaluation, and produced consistent gains across all tested configurations; nevertheless, the optimal degree of concentration in the exploration distribution may vary depending on the design choices. Similarly, the fuzzy membership functions were defined manually using domain intuition and held fixed during training, rather than being learned or adapted online. The use of three linguistic categories (Low, Med, High) represents a deliberate simplicity choice that keeps the rule base at a manageable 27 rules; the number of linguistic categories per variable is not uniquely determined and is typically chosen based on a trade-off between interpretability, model complexity, and approximation accuracy [44], with three to five being standard in control applications. Finer granularity could, in principle, capture more nuanced traffic states, at the cost of a larger rule base and increased design effort.

Several further properties of the method are worth stating explicitly. The consistent gains across methods that differ substantially in how they handle joint value factorization suggest that the benefits of structured exploration are not tied to any particular algorithmic choice within the value-based CTDE family. It should be noted that beyond final performance, fuzzy-guided exploration consistently yields faster convergence and more stable training dynamics across seeds, as visible in the training curves, with the most pronounced case being QPLEX on Cologne 8, where the fuzzy-guided method essentially enables convergence. Additionally, since the exploration distribution is computed from each agent’s local observations and the locally observable part of the network structure, no inter-agent communication is required. The approach is therefore consistent with a broad range of training settings beyond the CTDE paradigm evaluated here. Nevertheless, its applicability to policy gradient and actor–critic approaches, where the exploration mechanism interacts differently with the learning objective, remains to be established.

Finally, the fuzzy system introduces a computational overhead relative to the

ε

-greedy baseline: at each exploration step, each agent evaluates the fixed rule base over its set of candidate phases

A_{i}

, performing membership function evaluation, rule activation, and weighted-average defuzzification. This amounts to a sequence of elementary arithmetic operations with time complexity

O (| A_{i} |)

per agent per step. The input aggregation, computing pressure, queue length, and waiting time over the lanes serving each candidate phase, scales linearly with the number of lanes per phase. Both costs are substantially smaller than the neural network forward and backward passes required for Q-value estimation and parameter updates. The overhead is confined to training and entirely absent at deployment, where agents act greedily with respect to their learned Q-values.

6. Conclusions

This paper proposed fuzzy-guided exploration as a non-invasive mechanism for embedding classical traffic domain knowledge into the training of cooperative MARL agents for traffic signal control. The approach replaces uniform random action sampling during

ε

-greedy exploration with a structured distribution derived from pressure scores and a multi-criteria fuzzy rule system, while leaving the reward signal, observation space, and learning objective intact. Evaluated across four MARL algorithms on synthetic and real-world environments, fuzzy-guided exploration consistently outperformed the

ε

-greedy baseline across all algorithm and environment combinations. Gains were modest but consistent on the synthetic network, and substantially larger on the real-world network, suggesting that the complexity of the scenario makes uniform random exploration less effective. These findings position exploration as an effective lever for domain-knowledge integration in cooperative MARL, and establish pressure-based phase scoring as a principled and practical signal for guiding that exploration in traffic signal control.

Future work could address the fixed design choices for the membership functions by exploring learnable or data-driven construction, through clustering-based partitioning or adaptive neuro-fuzzy refinement [45], preserving interpretability while reducing manual tuning and enabling better adaptation to specific network characteristics. Further directions include investigating adaptive temperature and

ε

-decay scheduling, analogous to the experience-based heuristic fading of [46], and examining the joint effect of structured exploration and inter-agent communication [15]. Understanding how the relative contributions of individual criteria vary across different operating conditions represents an interesting research direction. Evaluation across a broader range of real-world traffic networks, as well as under dynamic demand conditions including incident scenarios, would also strengthen the empirical case.

Author Contributions

Conceptualization, M.V., N.I. and D.Ć.; methodology, M.V. and N.I.; software, D.Ć., N.I. and M.V.; validation, D.Ć., N.I. and B.B.; formal analysis, M.V. and N.I.; investigation, M.V., N.I. and D.Ć.; resources, B.B.; data curation, D.Ć. and N.I.; writing—original draft preparation, M.V., N.I. and D.Ć.; writing—review and editing, D.Ć., N.I. and B.B.; visualization, D.Ć. and N.I.; supervision, M.V.; project administration, B.B.; funding acquisition, M.V. and B.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Vlatacom Institute under the project P000 Smart City, as well as by the European Union under the REFRESH–Research Excellence For REgion Sustainability and High-tech Industries project number CZ.10.03.01/00/22_003/0000048 via the Operational Programme Just Transition, and in part by the Science Fund of the Republic of Serbia, Grant No. 7502, for the project Intelligent Multi-Agent Control and Optimization applied to Green Buildings and Environmental Monitoring Drone Swarms–ECOSwarm.

Data Availability Statement

The data presented in this study are available in https://github.com/LucasAlegre/sumo-rl and https://github.com/Pi-Star-Lab/RESCO (both links accessed on 26 May 2026).

Acknowledgments

The authors gratefully acknowledge Aleksej Makarov for his valuable ideas and discussions on the research process. The author M.V. is deeply grateful to his wife Ana, and his children, Dunja and Vukan, for their love, understanding, and continuous support, which have been a lasting source of inspiration and motivation during his work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

TSC	Traffic Signal Control
RL	Reinforcement Learning
MARL	Multi-Agent Reinforcement Learning
CTDE	Centralized Training with Decentralized Execution
IQL	Independent Q-Learning
VDN	Value Decomposition Networks
MDP	Markov Decision Process
POMDP	Partially Observable MDP
Dec-POMDP	Decentralized Partially Observable MDP
IGM	Individual-Global-Max
DQN	Deep Q-Network
UCB	Upper Confidence Bound
GRU	Gated Recurrent Unit
SUMO	Simulation of Urban MObility

References

Papageorgiou, M.; Diakaki, C.; Dinopoulou, V.; Kotsialos, A.; Wang, Y. Review of road traffic control strategies. Proc. IEEE 2003, 91, 2043–2067. [Google Scholar] [CrossRef]
Koonce, P.; Rodegerdts, L.; Lee, K.; Quayle, S.; Beaird, S.; Braud, C.; Bonneson, J.; Tarnoff, P.; Urbanik, T. Traffic Signal Timing Manual; Technical Report FHWA-HOP-08-024; Federal Highway Administration: Washington, DC, USA, 2008.
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; The MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Ault, J.; Sharon, G. Reinforcement Learning Benchmarks for Traffic Signal Control. In Proceedings of the Thirty-Fifth Conference on Neural Information Processing Systems (NeurIPS 2021) Datasets and Benchmarks Track, Virtual, 6–14 December 2021. [Google Scholar]
Michailidis, P.; Michailidis, I.; Lazaridis, C.R.; Kosmatopoulos, E. Traffic Signal Control via Reinforcement Learning: A Review on Applications and Innovations. Infrastructures 2025, 10, 114. [Google Scholar] [CrossRef]
Saadi, A.; Abghour, N.; Chiba, Z.; Moussaid, K.; Ali, S. A Survey of Reinforcement and Deep Reinforcement Learning for Coordination in Intelligent Traffic Light Control. J. Big Data 2025, 12, 84. [Google Scholar] [CrossRef]
Cao, K.; Yang, S.; Yang, C.; Yu, M.; Geng, J.; Jung, H. Research on Intelligent Traffic Signal Control Based on Multi-Agent Deep Reinforcement Learning. Mathematics 2026, 14, 149. [Google Scholar] [CrossRef]
Chu, T.; Wang, J.; Codecà, L.; Li, Z. Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1086–1095. [Google Scholar] [CrossRef]
Oliehoek, F.A.; Amato, C. A Concise Introduction to Decentralized POMDPs; SpringerBriefs in Intelligent Systems; Springer: Cham, Switzerland, 2016. [Google Scholar] [CrossRef]
Kraemer, L.; Banerjee, B. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 2016, 190, 82–94. [Google Scholar] [CrossRef]
Bokade, R.; Jin, X. PyTSC: A Unified Platform for Multi-Agent Reinforcement Learning in Traffic Signal Control. Sensors 2025, 25, 1302. [Google Scholar] [CrossRef]
Wei, H.; Chen, C.; Zheng, G.; Wu, K.; Gayah, V.; Xu, K.; Li, Z. PressLight: Learning Max Pressure Control to Coordinate Traffic Signals in Arterial Network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Anchorage, AL, USA, 4–8 August 2019; pp. 1290–1298. [Google Scholar] [CrossRef]
Chen, C.; Wei, H.; Xu, N.; Zheng, G.; Yang, M.; Xiong, Y.; Xu, K.; Li, Z. Toward a Thousand Lights: Decentralized Deep Reinforcement Learning for Large-Scale Traffic Signal Control. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 3414–3421. [Google Scholar] [CrossRef]
Liu, Y.; Luo, G.; Yuan, Q.; Li, J.; Jin, L.; Chen, B.; Pan, R. GPLight: Grouped Multi-Agent Reinforcement Learning for Large-Scale Traffic Signal Control. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI), Macau, China, 19–25 August 2023; pp. 199–207. [Google Scholar]
Bokade, R.; Jin, X.; Amato, C. Multi-Agent Reinforcement Learning Based on Representational Communication for Large-Scale Traffic Signal Control. IEEE Access 2023, 11, 47646–47658. [Google Scholar] [CrossRef]
Varaiya, P. Max pressure control of a network of signalized intersections. Transp. Res. Part C Emerg. Technol. 2013, 36, 177–195. [Google Scholar] [CrossRef]
Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn. 2002, 47, 235–256. [Google Scholar] [CrossRef]
Bellemare, M.G.; Srinivasan, S.; Ostrovski, G.; Schaul, T.; Saxton, D.; Munos, R. Unifying Count-Based Exploration and Intrinsic Motivation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the 34th International Conference on Machine Learning (ICML) PMLR, Sydney, Australia, 6–11 August 2017; pp. 2778–2787. [Google Scholar]
Zhou, G.; Zhang, Z.; Fan, G. AIR: Unifying Individual and Collective Exploration in Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
Zhang, P.; Hao, J.; Wang, W.; Tang, H.; Ma, Y.; Duan, Y.; Zheng, Y. KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI), Virtual, 7–15 January 2021; pp. 2291–2297. [Google Scholar] [CrossRef]
Qin, P.; Zhao, T. Knowledge Guided Fuzzy Deep Reinforcement Learning. Expert Syst. Appl. 2025, 264, 125823. [Google Scholar] [CrossRef]
Farag, W.A. Virtual multiphase flow meter for high gas/oil ratios and water-cut reservoirs via ensemble machine learning. Exp. Comput. Multiph. Flow 2025, 7, 133–148. [Google Scholar] [CrossRef]
Zadeh, L.A. Fuzzy sets. Inf. Control. 1965, 8, 338–353. [Google Scholar] [CrossRef]
Tan, M. Multi-Agent Reinforcement Learning: Independent versus Cooperative Agents. In Proceedings of the Tenth International Conference on Machine Learning (ICML 1993); Morgan Kaufmann: San Francisco, CA, USA, 1993; pp. 330–337. [Google Scholar]
Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W.M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J.Z.; Tuyls, K.; et al. Value-Decomposition Networks for Cooperative Multi-Agent Learning Based on Team Reward. In Proceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Stockholm, Sweden, 10–15 July 2018; pp. 2085–2087. [Google Scholar] [CrossRef]
Rashid, T.; Samvelyan, M.; Schroeder de Witt, C.; Farquhar, G.; Foerster, J.; Whiteson, S. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the 35th International Conference on Machine Learning (ICML), PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; Zhang, C. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In Proceedings of the 9th International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Alegre, L.N. SUMO-RL. 2019. Available online: https://github.com/LucasAlegre/sumo-rl (accessed on 26 May 2026).
Oliehoek, F.A.; Spaan, M.T.J.; Vlassis, N. Optimal and Approximate Q-value Functions for Decentralized POMDPs. J. Artif. Intell. Res. 2008, 32, 289–353. [Google Scholar] [CrossRef]
Son, K.; Kim, D.; Kang, W.J.; Hostallero, D.E.; Yi, Y. QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning (ICML), PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5887–5896. [Google Scholar]
Rashid, T.; Farquhar, G.; Peng, B.; Whiteson, S. Weighted QMIX: Expanding Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020. [Google Scholar]
Marchesini, E.; Baisero, A.; Bhati, R.; Amato, C. On Stateful Value Factorization in Multi-Agent Reinforcement Learning. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Detroit, MI, USA, 19–23 May 2025; pp. 1445–1453. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Hausknecht, M.; Stone, P. Deep Recurrent Q-Learning for Partially Observable MDPs. In Proceedings of the AAAI Fall Symposium on Sequential Decision Making for Intelligent Agents (AAAI-SDMIA), Arlington, VI, USA, 12–14 November 2015. [Google Scholar]
Wang, X.; Ke, L.; Qiao, Z.; Chai, X. Large-Scale Traffic Signal Control Using a Novel Multiagent Reinforcement Learning. IEEE Trans. Cybern. 2021, 51, 174–187. [Google Scholar] [CrossRef]
Koukol, M.; Zajíčková, L.; Marek, L.; Tuček, P. Fuzzy Logic in Traffic Engineering: A Review on Signal Control. Math. Probl. Eng. 2015, 2015, 979160. [Google Scholar] [CrossRef]
Ruspini, E.H. A New Approach to Clustering. Inf. Control. 1969, 15, 22–32. [Google Scholar] [CrossRef]
Samvelyan, M.; Rashid, T.; Schroeder de Witt, C.; Farquhar, G.; Nardelli, N.; Rudner, T.G.J.; Hung, C.M.; Torr, P.H.S.; Foerster, J.; Whiteson, S. The StarCraft Multi-Agent Challenge. arXiv 2019, arXiv:1902.04043. [Google Scholar]
Lopez, P.A.; Behrisch, M.; Bieker-Walz, L.; Erdmann, J.; Flötteröd, Y.P.; Hilbrich, R.; Lücken, L.; Rummel, J.; Wagner, P.; Wiessner, E. Microscopic Traffic Simulation using SUMO. In Proceedings of the 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, USA, 4–7 November 2018; pp. 2575–2582. [Google Scholar] [CrossRef]
Terry, J.; Black, B.; Grammel, N.; Jayakumar, M.; Hari, A.; Sullivan, R.; Santos, L.S.; Dieffendahl, C.; Horsch, C.; Perez-Vicente, R.; et al. PettingZoo: Gym for Multi-Agent Reinforcement Learning. Adv. Neural Inf. Process. Syst. 2021, 34, 15032–15043. [Google Scholar]
Papadopoulos, G.; Kontogiannis, A.; Papadopoulou, F.; Poulianou, C.; Koumentis, I.; Vouros, G. An Extended Benchmarking of Multi-Agent Reinforcement Learning Algorithms in Complex Fully Cooperative Tasks. In Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), Detroit, MI, USA, 19–23 May 2025; pp. 1613–1622. [Google Scholar] [CrossRef]
Papoudakis, G.; Christianos, F.; Schäfer, L.; Albrecht, S.V. Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Trillas, E.; Moraga, C. Reasons for Careful Design of Fuzzy Sets. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013), Milan, Italy, 11–13 December 2013; pp. 140–145. [Google Scholar]
Jang, J.S.R. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Trans. Syst. Man Cybern. 1993, 23, 665–685. [Google Scholar] [CrossRef]
Cheng, C.A.; Kolobov, A.; Swaminathan, A. Heuristic-Guided Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021; Volume 34, pp. 13550–13563. [Google Scholar]

Figure 1. Triangular membership functions for Low, Med, and High over the normalized domain

[0, 1]

.

Figure 1. Triangular membership functions for Low, Med, and High over the normalized domain

[0, 1]

.

Figure 2. Training curves on the

3 \times 3

grid network for all four algorithms. Each subfigure shows reward, total waiting time, and mean vehicle speed (left to right) averaged over steps within each episode. Solid lines are means over five independent runs; shaded bands are one standard deviation.

Figure 2. Training curves on the

3 \times 3

grid network for all four algorithms. Each subfigure shows reward, total waiting time, and mean vehicle speed (left to right) averaged over steps within each episode. Solid lines are means over five independent runs; shaded bands are one standard deviation.

Figure 3. Training curves on the Cologne 8 network for all four algorithms. Layout as in Figure 2.

Figure 4. Sensitivity to softmax temperature

β

on QMIX, Cologne 8. Solid lines are means over five seeds; shaded bands are one standard deviation.

Figure 4. Sensitivity to softmax temperature

β

on QMIX, Cologne 8. Solid lines are means over five seeds; shaded bands are one standard deviation.

Figure 5. Ablation on QMIX, Cologne 8: pressure-guided exploration (pressure scores only, blue) vs. fuzzy-guided exploration (full three-criteria system, orange). Solid lines are means over five seeds; shaded bands are one standard deviation.

Table 1. Fuzzy rule output values

c_{k}

for all 27 combinations of pressure (P), queue length (Q), and waiting time (W). For each pressure level, rows correspond to queue length and columns to waiting time. L = Low, M = Med, H = High.

Table 1. Fuzzy rule output values

c_{k}

for all 27 combinations of pressure (P), queue length (Q), and waiting time (W). For each pressure level, rows correspond to queue length and columns to waiting time. L = Low, M = Med, H = High.

		$P = H$			$P = M$			$P = L$
	$W$	H	M	L	H	M	L	H	M	L
Q	H	0.95	0.85	0.75	0.75	0.65	0.55	0.55	0.45	0.35
	M	0.85	0.75	0.65	0.65	0.50	0.40	0.45	0.30	0.20
	L	0.75	0.65	0.55	0.55	0.40	0.30	0.35	0.20	0.05

Table 2. Performance metrics averaged over the final 5000 training steps, reported as mean ± standard deviation over five independent seeds. Improvement (%) is computed from mean values of Fuzzy-guided relative to epsilon–greedy; positive values indicate Fuzzy-guided is better. * MaxPressure values appear in the epsilon–greedy column.

Environment	Algorithm	Metric	Epsilon–Greedy	Fuzzy-Guided	Improv. (%)
SUMO 3 $\times$ 3	MaxPressure *	Wait Time (s)	23.05 ± 0.33
		Stopped	4.12 ± 0.04
		Speed (m/s)	9.73 ± 0.05
	IQL	Reward	−0.27 ± 0.02	−0.27 ± 0.02	+0.6%
		Wait Time (s)	12.83 ± 0.70	10.72 ± 0.31	+16.4%
		Stopped	3.89 ± 0.11	3.55 ± 0.06	+8.7%
		Speed (m/s)	9.54 ± 0.07	9.64 ± 0.05	+1.1%
	VDN	Reward	−0.30 ± 0.03	−0.27 ± 0.03	+11.5%
		Wait Time (s)	12.29 ± 0.62	10.82 ± 0.20	+11.9%
		Stopped	3.89 ± 0.12	3.62 ± 0.06	+6.9%
		Speed (m/s)	9.47 ± 0.08	9.61 ± 0.03	+1.5%
	QMIX	Reward	−0.28 ± 0.03	−0.27 ± 0.02	+5.5%
		Wait Time (s)	11.89 ± 0.38	10.65 ± 0.11	+10.5%
		Stopped	3.79 ± 0.09	3.56 ± 0.03	+5.8%
		Speed (m/s)	9.54 ± 0.05	9.67 ± 0.07	+1.4%
	QPLEX	Reward	−0.31 ± 0.01	−0.28 ± 0.02	+10.1%
		Wait Time (s)	13.73 ± 1.03	11.54 ± 0.35	+15.9%
		Stopped	4.09 ± 0.11	3.79 ± 0.09	+7.4%
		Speed (m/s)	9.40 ± 0.06	9.52 ± 0.05	+1.3%
Cologne 8	MaxPressure *	Wait Time (s)	54.72 ± 5.62
		Stopped	4.08 ± 0.30
		Speed (m/s)	8.60 ± 0.05
	IQL	Reward	−0.76 ± 0.14	−0.44 ± 0.04	+41.8%
		Wait Time (s)	20.29 ± 2.14	15.28 ± 1.95	+24.7%
		Stopped	4.30 ± 0.10	3.67 ± 0.19	+14.6%
		Speed (m/s)	8.19 ± 0.09	8.38 ± 0.05	+2.4%
	VDN	Reward	−0.75 ± 0.18	−0.43 ± 0.12	+43.5%
		Wait Time (s)	20.92 ± 2.35	15.97 ± 2.89	+23.7%
		Stopped	4.42 ± 0.25	3.81 ± 0.43	+13.7%
		Speed (m/s)	8.18 ± 0.10	8.27 ± 0.14	+1.1%
	QMIX	Reward	−0.58 ± 0.13	−0.42 ± 0.05	+28.2%
		Wait Time (s)	22.63 ± 4.39	14.88 ± 1.68	+34.2%
		Stopped	4.45 ± 0.42	3.56 ± 0.30	+20.1%
		Speed (m/s)	8.20 ± 0.08	8.45 ± 0.04	+3.0%
	QPLEX	Reward	−1.63 ± 0.91	−0.57 ± 0.14	+64.7%
		Wait Time (s)	57.59 ± 20.37	17.85 ± 1.67	+69.0%
		Stopped	6.22 ± 0.56	4.06 ± 0.33	+34.7%
		Speed (m/s)	7.84 ± 0.11	8.31 ± 0.04	+6.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ćiprovski, D.; Ilić, N.; Božilović, B.; Vučetić, M. Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics 2026, 14, 1942. https://doi.org/10.3390/math14111942

AMA Style

Ćiprovski D, Ilić N, Božilović B, Vučetić M. Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics. 2026; 14(11):1942. https://doi.org/10.3390/math14111942

Chicago/Turabian Style

Ćiprovski, Dejan, Nemanja Ilić, Boško Božilović, and Miljan Vučetić. 2026. "Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control" Mathematics 14, no. 11: 1942. https://doi.org/10.3390/math14111942

APA Style

Ćiprovski, D., Ilić, N., Božilović, B., & Vučetić, M. (2026). Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control. Mathematics, 14(11), 1942. https://doi.org/10.3390/math14111942

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fuzzy-Guided Exploration for Multi-Agent Reinforcement Learning in Traffic Signal Control

Abstract

1. Introduction

2. Background

2.1. Reinforcement Learning and Multi-Agent Foundations

2.2. Centralized Training with Decentralized Execution

2.3. Exploration in Reinforcement Learning

3. Proposed Method

3.1. Pressure-Based Action Prioritization

3.2. Multi-Criteria Fuzzy Rule System

3.3. Fuzzy-Guided Exploration Policy

4. Experiments

4.1. Experimental Setup

4.2. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI