Sustainability-Oriented Urban Traffic System Optimization Through a Hierarchical Multi-Agent Deep Reinforcement Learning Frameworkâ€
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper proposes a sustainability-oriented intelligent transportation framework, with the main contribution being a sustainability-aware network-wide signal control method (SERL-H) developed via hierarchical multi-agent reinforcement learning. The topic is timely and practically meaningful, and the idea of embedding sustainability objectives into network-level signal coordination is valuable. Overall, the manuscript is promising; however, several key methodological and presentation issues require clarification and strengthening before the work can be properly assessed and reproduced.
Major Comments
-
Visualization of the hierarchical architecture of SERL-H is required.
The manuscript introduces SERL-H as a hierarchical MARL framework, yet the hierarchical structure (e.g., levels, agent roles, temporal/spatial decomposition, information flow, and control boundaries) is not sufficiently visualized. I strongly recommend adding a dedicated architecture figure to explicitly show:-
the hierarchy levels (high-level coordinator vs. low-level controllers, if applicable);
-
what each level observes (state), decides (action), and optimizes (reward/objective);
-
how inter-level communication/coordination is implemented (messages, aggregated variables, constraints);
-
how network-wide objectives are decomposed into intersection-level decisions.
This figure is important for both comprehension and reproducibility.
-
-
A clear training and deployment pipeline of the reinforcement learning strategy is missing.
The paper currently lacks a process-level depiction of the RL workflow. Please provide a flowchart (or algorithm box) describing the complete training procedure and how the learned policy is deployed. At minimum, the workflow should clarify:-
environment definition and simulation/interaction loop;
-
state/action/reward design, especially how “sustainability-aware” reward terms are computed;
-
hierarchical training scheme (e.g., sequential training vs. joint training; centralized training with decentralized execution, etc.);
-
update frequency for each hierarchy level;
-
convergence/stopping criteria and hyperparameter settings;
-
inference-time execution steps (real-time control loop).
Without this, it is difficult to judge the validity of the learning setup and to reproduce the results.
-
-
The role and placement of SUT-GNN in the overall framework is unclear and potentially redundant.
The manuscript introduces SUT-GNN, but its functional position relative to the DRL pipeline is not well specified. Given that DRL typically includes deep neural networks that can already serve as function approximators (and may implicitly capture predictive patterns), it is essential to clarify whether SUT-GNN is:-
(a) a separate prediction module providing auxiliary forecasts/representations that feed into DRL;
-
(b) a feature encoder replacing the standard deep network backbone in the policy/value networks;
-
(c) an additional component appended before the DRL networks (e.g., “GNN encoder → policy/value heads”); or
-
(d) used only for certain hierarchy levels or only during training (e.g., representation learning) but not in online inference.
Please explicitly state the input/output of SUT-GNN, how it is trained (jointly or separately), and how its output is consumed by SERL-H. A simplified block diagram showing “SUT-GNN ↔ SERL-H” connections would resolve this ambiguity.
-
-
The case study should clearly distinguish simulation-based evaluation versus real-world deployment.
The manuscript needs to explicitly state whether the case analysis is conducted in a simulation environment, a digital twin platform, or an actual field deployment. If it is simulation-only, please clearly describe the simulator, calibration approach, and realism assumptions. If any real deployment or pilot test exists, the paper would benefit significantly from including deployment evidence such as:-
the deployment setting (city/region, intersections, sensors, communication stack);
-
real operational constraints (detector noise, missing data, latency);
-
online performance statistics and robustness discussion.
If real-world deployment results cannot be provided, please clearly acknowledge this limitation and discuss what is required for practical deployment (data availability, compute requirements, safety constraints, fallback strategies, etc.).
-
Author Response
Comment 1. “Visualization of the hierarchical architecture of SERL-H is required.”Response: Thank you for this important suggestion. In the revision, we add a dedicated architecture figure that explicitly visualizes the hierarchical structure of SERL-H, including (i) local intersection controllers, (ii) region-level coordinators operating at a slower timescale, (iii) the adaptive graph-attention interdependency encoder, (iv) the sustainability-aware reward computation, and (v) CTDE training components (centralized critic, replay buffer) vs. decentralized execution at inference time. The figure also clarifies information flow, temporal decomposition (local step vs. regional step), and the boundaries of control (feasible action masks and safety constraints).
Comment 2. “A clear training and deployment pipeline of the reinforcement learning strategy is missing.”
Response: We agree that reproducibility requires a process-level depiction. In the revision, we add a training and deployment pipeline flowchart that shows: environment interaction loop, state/action/reward computation (including sustainability terms), hierarchical update schedule (local per step; regional every ? steps), CTDE updates, stopping criteria, and real-time inference procedure for deployment. We also expand Algorithm 1 by explicitly stating: (i) update frequencies for hierarchy levels, (ii) replay-buffer contents, (iii) evaluation checkpoints, and (iv) early-stopping/convergence criteria.
Comment 3. “The role and placement of SUT-GNN in the overall framework is unclear and potentially redundant.”
Response: Thank you—this was a helpful critique. In the revision, we reposition SUT-GNN as an optional predictor that provides short-horizon anticipatory features under partial observability. We make three clarifications: (i) SUT-GNN is not the policy/value backbone of SERL-H. (ii) It is trained separately offline on historical traffic time-series, its outputs are then used as optional additional inputs (forecasts) to SERL-H. (iii) SERL-H does not require SUT-GNN to function, we keep prediction as a supporting module and explicitly state conditions where it helps most (e.g., peak volatility, missing/noisy sensing, low V2X penetration).
To remove ambiguity, we add a small block diagram and explicit I/O definition:
Input: graph ?, historical window , sustainability covariates Output: predicted arrivals Consumption: appended to x_i_t before AGAT + actor.
Comment 4. “The case study should clearly distinguish simulation-based evaluation versus real-world deployment.”
Response: We fully agree. The revised manuscript will explicitly state that the signal-control evaluation is simulation-based (microscopic SUMO), while real-world data are used for demand/parameter calibration and for the separate prediction dataset. We add a concise explanation of: (i) simulator and network construction (grid testbed), (ii) how real data inform demand regimes/turning ratios (calibration assumptions), (iii) sensing realism assumptions (noise, missingness, V2X penetration), and (iv) what would be required for deployment (latency, safety fallback, monitoring).
Reviewer 2 Report
Comments and Suggestions for AuthorsDear Authors,
The paper provides an analysis of the use of intelligent systems to optimize traffic signals across the entire urban network, prioritizing environmental objectives such as emission reduction and fuel efficiency, alongside traditional traffic flow indicators. The paper is of particular importance because to prioritize sustainability, agents are trained using reward functions, balancing vehicle waiting time, efficiency and fairness with environmental objectives. I also appreciate that the limitations of your research are mentioned and indeed priority should be given to reducing the gap between simulation and reality. However, in my opinion, your paper needs improvements, primarily regarding compliance with the “Instructions for Authors” and the “sustainability-template”. There are also a few suggestions that can contribute to a quick reading and understanding of your research. Please follow the comments below:
A. All figures, diagrams and tables should be inserted in the main text close to their first citation (e.g. Table 1 and Table 2). This applies for almost all figures and tables in the paper;
B. The punctuation recommended by the ”sustainability-template” (“;” and ”.”) for ”Bulleted lists” and ”Numbered lists” should be followed;
C. In order to streamline the text and highlight the specific features of the analysis, I suggest you consider the following writing style:
a. Lines 41-48:
Urban traffic signal control is difficult for at least four reasons:
- A city network is inherently multi-agent: each intersection is both a local decision point and a component of a coupled system where upstream decisions propagate to downstream queues;
- Observations are often partial and noisy due to imperfect detectors and stochastic behavior;
- Scalable coordination requires communication-efficient representations of neighborhood context that generalize beyond small grids;
- Sustainability introduces multi-objective trade-offs, efficiency, emissions, and equity—that are rarely embedded into standard RL formulations.
b. Lines 85-89:
Research relevant to sustainability-oriented urban traffic signal control spans four closely related themes:
- ITS and data-driven traffic management;
- Graph-based learning for spatio-temporal traffic modeling;
- RL for network-wide signal control with scalable coordination;
- Sustainability and socio-economic impact assessment for ITS interventions.
This section reviews each theme and positions our work.
c. Lines 175-181:
To address these gaps, we propose SERL-H: a sustainability-aware hierarchical MARL controller that:
- Uses region-level coordination to reflect heterogeneous urban contexts,
- Uses adaptive GAT to encode dynamic interdependencies under bounded neighborhoods, and
- Explicitly optimizes efficiency–environment–equity objectives.
In addition, we report results for SUT-GNN, a sustainability-enhanced spatio-temporal graph predictor, as supporting evidence that reliable anticipatory signals can be obtained during peak-hour volatility when such a module is enabled.
d. Lines 231-234:
We consider discrete phase control consistent with standard signal controllers. Each intersection i has a phase set Φi and a time-varying feasible action set Ait ⊆ Φi determined by operational constraints:
- Minimum/maximum green constraints;
- Inter-green clearance requirements (amber/all-red);
- Pedestrian clearance constraints.
e. Lines 245-251:
Equity term (regularizer). We penalize dispersion of service across user groups to discourage solutions that improve averages by systematically disadvantaging specific movements or vulnerable road users. Let G be a predefined set of service groups. Each group index g ∈ G corresponds to a disaggregated service category, such as:
- An approach/movement group (e.g., eastbound through, northbound left);
- A road-class group (major vs. minor approaches);
- A mode group (vehicles vs. pedestrians, or pedestrian crossings by leg).
f. Lines 258-263:
In addition to reward components used for learning, we report the USD oriented outcomes for evaluation:
- Efficiency: average travel time (ATT), average delay (AVD), throughput (IT/Q);
- Environment: total emissions (TE / Etotal), fuel consumption (FC);
- Safety proxies: accident risk index (ARI), conflict rate (CR);
- Socio-economic and reliability: economic productivity index (EPI), commute time variability (CTV), monetized cost savings Csavings, and environmental quality index Equal (see Section 5.2.7).
g. Lines 270-275:
We propose SERL-H, a sustainability-aware hierarchical MARL framework for network-wide traffic signal control. SERL-H integrates:
- multi-source perception with SUT-GNN prediction features;
- an adaptive graph-attention encoder for dynamic interdependency modeling under bounded communication;
- region-level hierarchical coordination for scalable cooperation across heterogeneous urban regions.
The controller is trained under CTDE paradigm with feasibility constraints enforced by action masking.
h. Lines 424-437:
Learning uses the sustainability-aware reward in Eq. (9), while reporting emphasizes interpretable outcomes: We report metrics aligned with efficiency, environment, and USD-oriented outcomes:
- Traffic efficiency:
-
- Average Travel Time (ATT): mean travel time over trips in the evaluation set;
- Average Vehicle Delay (AVD): additional travel time relative to free-flow (s/veh);
- Intersection Throughput (IT): vehicles served per hour (veh/h);
- Environmental sustainability:
-
- Total emissions (TE): aggregated emission mass (e.g., kg over horizon), optionally by pollutant (CO2, NOx, PM);
- Fuel consumption (FC): aggregated fuel usage (e.g., L over horizon);
- USD-oriented indicators:
-
- Safety proxies: Accident Risk Index (ARI), Conflict Rate (CR), derived from trajectory-based surrogate safety measures (reported as indices/events per hour);
- Socio-economic and reliability: Economic Productivity Index (EPI), Commute Time Variability (CTV), and monetized cost savings Csavings as defined in Section 6 (with parameters explicitly stated where used).
D. To improve the clarity and impact of your idea, the sentence in Lines 58-62 should be expressed as follows:
SERL-H follows a centralized-training decentralized-execution (CTDE) paradigm, i.e., it combines region-level coordination with adaptive graph attention to encode time-varying network coupling under bounded neighborhood communication, and embeds sustainability objectives directly into the learning reward.
Author Response
Comment A: All figures, diagrams and tables should be inserted in the main text close to their first citation (e.g. Table 1 and Table 2). This applies for almost all figures and tables in the paper.Response A: We have revised the manuscript layout to comply with the Sustainability template requirement that figures/tables appear close to their first citation. Specifically: 1. We moved all tables/figures to immediately follow the paragraph where they are first referenced. 2. We checked each `\ref{}` occurrence and ensured the corresponding float is placed in the same section/subsection as its first mention. 3. We adjusted float placement settings (where necessary) to reduce drifting and improve readability, while keeping the template-compliant style.
Comment B: The punctuation recommended by the “sustainability-template” (“;” and “.”) for “Bulleted lists” and “Numbered lists” should be followed.
Response B: We updated the punctuation in all bulleted and numbered lists to match the Sustainability template style: each bullet item ends with “;” and the final bullet ends with “.” We systematically reviewed lists in Sections (including RQs, difficulty reasons, constraints, equity group examples, and metric lists) and corrected punctuation for consistency.
Comment C(a): Lines 41–48: “Urban traffic signal control is difficult for at least four reasons …” (suggested list format)
Response C(a): We rewrote this paragraph into a clearer bulleted list following the suggested style and the template punctuation rules, explicitly listing the four reasons (multi-agent coupling; partial/noisy observation; communication-efficient scalable coordination; sustainability multi-objective trade-offs).
Comment C(b): Lines 85–89: “Research relevant to sustainability-oriented urban traffic signal control spans …” (suggested list format)
Response C(b): We streamlined the beginning of Related Work by introducing the section with four concise themes as a structured list (ITS/data-driven traffic management; graph-based learning; RL/MARL for network-wide control; sustainability & socio-economic assessment), and we added a final sentence that states how the section is organized and how our work is positioned.
Comment C(c): Lines 175–181: “To address these gaps, we propose SERL-H …” (suggested list format)
Response C(c): We revised the contribution summary into a compact list that highlights the three core design points: 1. region-level coordination for heterogeneous contexts; 2. adaptive GAT-based interdependency encoding under bounded neighborhoods; 3. explicit efficiency–environment–equity optimization via reward design. We also clarified that SUT-GNN is **supporting/optional** evidence for anticipatory features rather than the primary contribution of the paper.
Comment C(d): Lines 231–234: “Action space and feasibility constraints …” (suggested list format)
Response C(d): We rewrote the action/constraint description as a bullet list explicitly enumerating: minimum/maximum green; inter-green clearance; pedestrian clearance constraints. We also added one sentence clarifying that feasibility is enforced by action masking at inference time.
Comment C(e): Lines 245–251: “Equity term (regularizer) … define G and g …” (suggested list format)
Response C(e): We revised the equity regularizer explanation to clearly define: (\mathcal{G}): the predefined set of service groups; (g \in \mathcal{G}): the index of one group. We then listed the three example groupings (movement/approach; road class; mode/pedestrian legs) using the recommended bullet style and punctuation. This also removes ambiguity for readers reproducing the equity metric.
Comment C(f): Lines 258–263: “In addition to reward components … report USD oriented outcomes …” (suggested list format)
Response C(f): We reformatted the USD-oriented evaluation metrics into a structured list grouped by: Efficiency; Environment; Safety proxies; Socio-economic & reliability. We also corrected/standardized notation (e.g., TE vs. (E_{\text{total}}), (C_{\text{savings}}), (E_{\text{qual}})) and ensured the cross-reference points to the correct metrics subsection in the revised manuscript.
Comment C(g): Lines 270–275: “We propose SERL-H … integrates …” (suggested list format)
Response C(g): We revised the method overview to a short “SERL-H integrates:” list, explicitly covering: multi-source perception (+ optional SUT-GNN prediction features); adaptive graph-attention encoder under bounded communication; region-level hierarchical coordination. We also added one clarifying sentence stating that SERL-H is trained under CTDE and executed in a decentralized manner with feasibility/action masking.
Comment C(h): Lines 424–437: “Learning uses sustainability-aware reward … reporting emphasizes interpretable outcomes …” (suggested format)
Response C(h): We rewrote this metrics paragraph to improve “quick reading” by separating: learning objective (reward in Eq. (9)); reporting metrics grouped by category (traffic efficiency; environmental sustainability; USD-oriented indicators). We also improved metric definitions (units and interpretations) and ensured the acronyms are consistent with the Abbreviations list.
Comment D: To improve clarity and impact, the sentence in Lines 58–62 should be expressed as follows: “SERL-H follows a CTDE paradigm … and embeds sustainability objectives directly into the learning reward.”
Response D: We adopted the reviewer’s suggested rewriting and replaced the original sentence with the clearer CTDE-focused formulation. We also ensured terminology consistency (“bounded neighborhood communication,” “time-varying coupling,” “sustainability objectives embedded in reward”) with the Methodology section.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors!
1) While the proposed SERL-H framework combines hierarchy, graph attention, and sustainability-aware rewards, it is not fully clear which aspect constitutes the primary conceptual novelty beyond existing works such as CoSLight, MonitorLight, and hierarchical MARL approaches.
-
Hierarchical coordination and CTDE MARL are well established in the literature.
-
Graph attention mechanisms have already been applied to traffic signal control.
-
Multi-objective reward formulations (delay + emissions) have appeared in prior studies.
The manuscript would benefit from a more explicit positioning that clarifies:
-
what SERL-H enables that cannot be achieved by existing graph-based MARL with multi-objective rewards;
-
whether the key contribution lies in hierarchical regional coordination, sustainability-aware evaluation, or the joint integration of these elements.
At present, the contribution risks appearing as an incremental architectural integration rather than a clearly differentiated methodological advance.
2) The treatment of sustainability, particularly the equity dimension, remains conceptually underdeveloped.
-
Equity is operationalized via a dispersion-based regularizer over service groups, which is mathematically convenient but only loosely connected to established equity or accessibility frameworks in transport planning.
-
Socio-economic heterogeneity is proxied through region-level context vectors, yet the manuscript does not clearly explain how these vectors relate to real socio-demographic inequalities.
The authors should:
-
clarify whether the proposed equity term captures fairness, service balance, or merely variance reduction;
-
discuss limitations of using service dispersion as a proxy for equity;
-
better connect the sustainability framing to transport policy and planning literature, not only to ML-oriented metrics.
Without this, the sustainability claims risk being perceived as technically framed rather than substantively grounded.
3) The experimental evaluation is extensive, but several issues remain:
-
The comparison mixes pure efficiency-oriented baselines with sustainability-aware SERL-H, which complicates interpretation of relative performance.
-
Some claims (e.g., socio-economic cost savings, environmental quality indices) rely on monetization or composite indicators whose derivation is not fully transparent.
-
The role of the auxiliary SUT-GNN predictor is ambiguous: while claimed to be optional, a significant portion of the narrative emphasizes its benefits.
The authors should:
-
more clearly separate core contributions from auxiliary modules;
-
justify the choice of baselines in relation to sustainability objectives;
-
improve transparency in the construction of socio-economic indicators.
Although limitations are briefly acknowledged, the discussion of:
-
simulation-to-reality transfer,
-
data availability,
-
governance and institutional adoption
remains largely generic. Given the strong sustainability framing, the paper would benefit from a more explicit discussion of real-world deployment constraints and policy relevance, beyond technical feasibility.
Author Response
Comment 1: While SERL-H combines hierarchy, graph attention, and sustainability-aware rewards, it is unclear which aspect constitutes the primary conceptual novelty beyond existing works (CoSLight, MonitorLight, hierarchical MARL). The contribution risks appearing incremental rather than clearly differentiated.Response 1: We appreciate this important comment and agree that the novelty must be stated more explicitly. In the revised manuscript, we clarify that the contribution is not any single component (hierarchy, attention, or multi-objective reward) in isolation, but a differentiated integration targeted at sustainability-aware, heterogeneity-aware, and deployment-aware network control. Specifically, we revise the Introduction and Related Work to add an explicit Positioning and Conceptual Novelty subsection that highlights what SERL-H enables beyond existing graph-based MARL or hierarchical MARL baselines:
1. Region-conditioned hierarchical coordination under heterogeneity: SERL-H introduces region-level coordinators conditioned on region context vectors (\mathbf{u}_k), providing slow-timescale guidance that adapts coordination across heterogeneous urban subregions (e.g., demand volatility and environmental burden), which is not explicitly modeled in CoSLight/MonitorLight-style neighborhood coordination.
2. Sustainability as a control-and-evaluation bundle: We emphasize that SERL-H couples sustainability-aware learning objectives (efficiency–environment–equity) with a unified USD-oriented evaluation protocol that reports policy-facing outcomes, rather than only operational KPIs.
3. Bounded communication and feasibility-aware execution: SERL-H is designed for bounded neighborhood communication and signal-controller feasibility via action masking, making the trained policy compatible with operational constraints.
We also strengthen the experimental narrative by restructuring the Results to emphasize SERL-H as the primary contribution, and we add/expand ablation descriptions to demonstrate the necessity of *joint* design choices (hierarchy + adaptive attention + sustainability objective), rather than incremental combination.
Comment 2: The treatment of sustainability, especially equity, is conceptually underdeveloped. The dispersion-based regularizer is mathematically convenient but loosely connected to established equity/accessibility frameworks. Region-level context vectors are not clearly linked to socio-demographic inequality. The manuscript should clarify what equity captures, limitations of dispersion as a proxy, and connect sustainability framing to transport policy/planning literature.
Response 2: Thank you for this valuable critique. We agree that our original text could be misread as claiming a full equity/accessibility framework, while our intent was to implement an operational fairness proxy measurable in real-time traffic control. In the revision we make three changes:
1. Clarify interpretation and scope: We revise the equity subsection to explicitly state that the dispersion term captures *short-term service balance / operational fairness* across predefined groups (movements, road classes, modes), rather than structural equity (e.g., demographic-based accessibility or exposure justice). We explicitly distinguish “fairness proxy” vs. “planning equity.”
2. Discuss limitations transparently: We add a paragraph on limitations of service dispersion (e.g., does not encode demographic distributions, long-term accessibility to opportunities, or pollution exposure inequality), and we position it as a pragmatic first step compatible with available operational data. We also add future-work directions for richer equity metrics (accessibility- and exposure-aware indicators).
3. Explain region context vectors (\mathbf{u}_k): We strengthen the definition of (\mathbf{u}_k) to clarify that it is not treated as ground-truth socio-demographic inequality labels. Instead, it is a compact representation of *region-level heterogeneity relevant to sustainability impacts* (economic activity proxies, environmental quality measures, demand volatility), used to condition coordination policies. We explicitly acknowledge that full demographic grounding requires datasets that may be unavailable at scale and outline what data would be needed for deeper equity analysis.
Finally, we enrich the Related Work and Discussion with stronger connections to transportation policy and planning perspectives on equity, emphasizing that the paper’s equity component is technically grounded but policy-relevant as a measurable operational constraint.
Comment 3: Experimental evaluation is extensive, but: (i) mixing efficiency-oriented baselines with sustainability-aware SERL-H complicates interpretation; (ii) socio-economic cost savings and environmental quality indices rely on composite/monetized indicators whose derivation is not transparent; (iii) the auxiliary SUT-GNN predictor is ambiguous and over-emphasized despite being “optional”; (iv) discussion on simulation-to-reality transfer, data availability, governance/adoption is generic and should be strengthened.
Response 3: We appreciate these detailed suggestions and have revised the manuscript to improve clarity, transparency, and policy relevance.
(i) Baseline interpretation under sustainability objectives. We add explicit text in the Experimental Setup to justify baseline selection: efficiency-oriented and coordination-centric MARL baselines represent operational benchmarks commonly used in practice, while SERL-H targets a multi-objective sustainability reward. To avoid misinterpretation, we emphasize multi-metric reporting for all methods and add clearer discussion of trade-offs (efficiency vs. sustainability outcomes), including sensitivity to reward weights where applicable.
(ii) Transparency of monetized/composite indicators. We substantially improve reproducibility by providing explicit formulas and parameter reporting. We add a dedicated subsection defining annualized cost savings (C_{\text{savings}}) from time, fuel, and carbon externality components, and we introduce a parameter table (e.g., VOT, fuel price, SCC, annualization assumptions). For the Environmental Quality Index (E_{\text{qual}}), we specify the normalization approach and weights used to combine emissions/fuel components, so the indicator is traceable and reproducible.
(iii) Clarifying the role of SUT-GNN and separating core vs. auxiliary modules. We restructure the Results section so that SERL-H signal control is presented as the *primary* contribution first, and SUT-GNN is presented later as *supporting evidence* only. We explicitly state in both Methodology and Results that SERL-H does not depend on SUT-GNN and can operate purely with multi-source sensing inputs; when forecasting features are used, we clearly specify how they are trained and appended, and whether they are enabled in each experiment.
(iv) Strengthening discussion on deployment constraints and governance. We expand the Discussion with more concrete considerations: data requirements (detectors, optional V2X, handling missing/noisy measurements), real-time compute and latency, safety compliance and fallback control strategies, and governance/adoption issues (how policy makers can tune reward weights, how to audit attention/coordinator signals for accountability). We also sharpen the simulation-to-reality limitation and outline what would be needed for field deployment (calibration, domain adaptation, operational validation).
We believe these revisions materially strengthen the conceptual framing, improve transparency of reported sustainability outcomes, and clarify the paper’s core contributions relative to auxiliary modules.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsNo questions now
Author Response
Comment: No questions now.Response: We sincerely thank the reviewer for the constructive evaluation in the previous round and for confirming that there are no further questions in Round 2. We appreciate the reviewer’s time and effort, and we have maintained careful attention to clarity, contextualization, methodological transparency, referencing, and conclusion support throughout the revised manuscript.
As part of our revisions, we have made the following improvements:
1. Abstract: We have revised the abstract to more explicitly state the study's purpose, core contribution, and the sustainability-oriented nature of the framework. Additionally, we emphasize the key advantages of the hierarchical multi-agent reinforcement learning (MARL) design, particularly how it contributes to sustainability-oriented network-wide signal optimization.
2. Title: We have refined the title to better highlight the sustainability-oriented optimization perspective and the hierarchical multi-agent reinforcement learning approach. The updated title is now: “Sustainability-Oriented Urban Traffic Systems Optimization through a Hierarchical Multi-Agent Deep Reinforcement Learning Framework”.
We hope these changes address the reviewer’s suggestions and improve the clarity and impact of the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors!
The revised manuscript presents a comprehensive and methodologically rigorous study addressing sustainability-oriented urban traffic signal control using a hierarchical multi-agent deep reinforcement learning framework. The paper reflects the current state of knowledge in intelligent transportation systems and sustainability-aware control and demonstrates a high level of technical sophistication.
The structure of the manuscript is clear and logical. The Introduction clearly motivates the research problem, formulates well-defined research questions, and positions the study within the relevant literature. The methodological framework (SERL-H) is thoroughly described, with transparent formulation of the hierarchical MARL architecture, sustainability-aware reward design, and evaluation pipeline. The experimental setup is carefully designed, and the results are clearly presented and systematically discussed. The conclusions are well supported by the reported findings and appropriately summarize the theoretical and practical implications of the study.
- The abstract is informative and well written. As a minor improvement, the authors may consider slightly sharpening the statement of the study’s purpose and core contribution, particularly by more explicitly emphasizing the sustainability-oriented nature of the proposed framework and the main advantages of the hierarchical MARL design.
- The title is generally appropriate and aligned with the content of the paper. A minor refinement could further improve its precision, for example by more clearly highlighting the sustainability-oriented optimization perspective and the hierarchical multi-agent reinforcement learning approach.
Comments on the Quality of English Language
The overall language quality is good. Nevertheless, a final careful proofreading is recommended to eliminate minor grammatical or stylistic inconsistencies and to ensure full compliance with the journal’s formatting and style guidelines.
Author Response
Comment 1: The abstract is informative and well written. As a minor improvement, the authors may consider slightly sharpening the statement of the study’s purpose and core contribution, particularly by more explicitly emphasizing the sustainability-oriented nature of the proposed framework and the main advantages of the hierarchical MARL design.Response: Thank you. We revised the abstract to (i) state the study purpose more explicitly as sustainability-oriented network-wide signal optimization (efficiency–environment–equity), and (ii) highlight the main advantages of the hierarchical MARL design (fast local actuation + slower region-level coordination under CTDE, improved scalability/stability) as the core methodological contribution. Concretely, we strengthened the first 2–3 sentences and added a clearer “SERL-H enables…” phrasing to foreground the hierarchical mechanism and sustainability integration.
Comment 2: The title is generally appropriate and aligned with the content of the paper. A minor refinement could further improve its precision, for example by more clearly highlighting the sustainability-oriented optimization perspective and the hierarchical multi-agent reinforcement learning approach.
Response: Thank you. We refined the title to more explicitly emphasize both the sustainability-oriented optimization perspective and the hierarchical multi-agent reinforcement learning approach. The updated title used in the revised manuscript is: “Sustainability-Oriented Urban Traffic Systems Optimization through a Hierarchical Multi-Agent Deep Reinforcement Learning Framework.”
Comment 3: The overall language quality is good. Nevertheless, a final careful proofreading is recommended to eliminate minor grammatical or stylistic inconsistencies and to ensure full compliance with the journal’s formatting and style guidelines.
Response: We agree and conducted a final proofreading pass across the full manuscript. We corrected minor grammatical/stylistic inconsistencies, harmonized terminology (e.g., SERL-H, hierarchy/CTDE wording), and checked compliance with the Sustainability style (including list punctuation consistency and general formatting).

