1. Introduction
The digital transformation of cities has accelerated the adoption of artificial intelligence (AI) to optimize critical urban functions, including mobility, energy distribution, public safety, healthcare delivery, environmental monitoring, and citizen services [
1,
2,
3,
4,
5]. Despite these advancements, many operational AI deployments remain single-objective, siloed, or domain-specific, where optimization focuses on isolated efficiency metrics rather than interdependent city-scale trade-offs. Urban optimization objectives frequently conflict, as demonstrated by multi-objective routing and control studies that jointly optimize delay, energy consumption, emissions, and throughput [
5,
6,
7,
8,
9,
10]. These competing priorities reveal a fundamental challenge: smart city decision systems must evolve from isolated optimization toward adaptive multi-objective governance-aware decision reasoning [
9,
11]. To understand this transition toward governance-aware decision intelligence, it is necessary to distinguish between three foundational layers of urban AI systems. First, multi-objective optimization addresses competing performance metrics. Second, sustainability and governance alignment introduce societal accountability into optimization objectives. Third, safe policy validation mechanisms are required before deployment in real municipal environments. The following paragraphs stage these dimensions progressively to clarify how they converge into a unified governance-centric learning framework.
At the algorithmic level, Multi-Objective Reinforcement Learning (MORL) enables learning Pareto-optimal policies under competing objectives rather than maximizing a single utility [
12,
13,
14]. However, MORL deployments in smart-city contexts remain domain-fragmented and typically performance-centric, with sustainability alignment, accountability, and deployment safety treated as external evaluation criteria rather than intrinsic learning objectives [
6,
7,
15,
16,
17]. Complementing MORL, DT provide simulation infrastructure for scenario testing and risk-aware validation, yet most reported DT implementations focus on monitoring or forecasting and are rarely integrated as policy-validation modules within reinforcement-learning loops [
18,
19]. As a result, current research lacks an end-to-end pipeline that learns multi-objective urban policies, validates them safely in simulation, and supports governance-aligned auditing and selection before real-world adoption
Several recent studies have advanced multi-objective reinforcement learning (RL) in transportation control, energy management, and surveillance optimization (e.g., [
12,
15,
20]), while others have explored DT-enabled governance modeling and simulation-based urban planning (e.g., [
21]). However, existing MORL applications remain predominantly domain-specific and performance-driven, without embedding governance-aligned reward structures or formal pre-deployment policy certification. Conversely, DT governance studies emphasize simulation fidelity and policy visualization but rarely integrate adaptive multi-objective learning or Pareto-based policy auditing within the decision loop. The proposed MORL-Smart Governance Framework (MORL-SGF) framework differs by explicitly unifying governance-aware MORL, DT validation, and Pareto-based accountability mechanisms within a single end-to-end learning pipeline
To address these systemic limitations, this paper introduces the MORL-SGF, a unified architecture integrates:
Multi-Objective Reinforcement Learning for adaptive policy learning under competing objectives [
12,
14];
DT simulation pipelines for risk-aware pre-deployment policy evaluation [
18,
22];
Governance-aligned reward shaping derived from ESG (Environmental, Social, Governance) and United Nations Sustainable Development Goals (SDGs) metrics [
11,
23];
Pareto-frontier policy auditing for transparent and accountable decision selection [
10,
16].
Unlike conventional MORL solutions that optimize performance alone, MORL-SGF extends MORL by embedding sustainability objectives within the reward structure and incorporating simulation-backed validation mechanisms [
18,
23]. The framework does not target a single application domain but instead establishes a cross-domain decision layer capable of governing city-scale policies under shared sustainability constraints [
9,
11].
It is important to clarify the scope of validation in this study. The proposed MORL-SGF framework is analytically and conceptually validated through formal modeling, architectural specification, and structured evidence synthesis rather than empirical deployment or large-scale simulation benchmarking. The objective of this work is to establish a governance-aware decision architecture and functional design blueprint, laying the groundwork for subsequent simulation-based and real-world implementation studies.
Collectively, these limitations motivate the need for a governance-aware learning-to-deployment architecture capable of integrating multi-objective optimization, sustainability alignment, and simulation-based policy validation within a unified decision framework. MORL-SGF is designed to address this integration gap systematically. The following contributions formalize the structural, algorithmic, and evaluative components of this proposed framework. This work is positioned as a design-oriented methodological framework study that formalizes governance-aware reinforcement learning architecture rather than as a survey, policy manifesto, or empirical benchmarking report.
The primary contributions of this work are:
A novel MORL-based governance framework (MORL-SGF) that unifies multi-domain urban decision learning under sustainability and governance objectives.
A governance reward shaping mechanism translating ESG and SDG indicators into formal MORL reward signals.
A Digital Twin integration protocol enabling safe, non-intrusive policy validation, risk tracing, and Pareto compliance auditing.
A policy selection and accountability model that ranks Pareto-optimal policies using sustainability-aware governance scoring.
Evidence-driven validation informed by synthesis and characterization of 79 smart city studies, demonstrating research readiness and real-world demand for governance-aware MORL systems.
The remainder of the paper proceeds as follows.
Section 2 reviews foundational concepts in MORL, DTs, and governance-aware intelligent systems.
Section 3 formalizes research gaps and functional system requirements. The methodology applied in this research is described in
Section 4.
Section 5 presents the MORL-SGF architecture, reward formulation, and decision pipeline.
Section 6 synthesizes design evidence from the smart city literature.
Section 7 outlines representative use cases.
Section 8 discusses open challenges, followed by conclusions and future research directions in
Section 9.
2. Background and Foundations
Smart city decision systems operate in highly dynamic and heterogeneous environments characterized by continuous sensing, multiple stakeholders, constrained resources, sustainability requirements, and inherently conflicting policy objectives [
1,
4,
9]. While artificial intelligence (AI) and reinforcement learning have demonstrated strong optimization capabilities across urban domains, many deployed systems remain single-objective, siloed, or narrowly performance-driven. Such formulations fail to capture the multidimensional trade-offs required in real municipal decision-making, motivating the need for multi-objective learning, governance-aware reasoning, and safe pre-deployment validation mechanisms [
10,
11,
16].
This section reviews the foundational concepts underpinning the proposed MORL-SGF, focusing on (i) multi-objective reinforcement learning, (ii) governance-aware AI grounded in ESG and SDG targets, and (iii) DTs as policy validation environments.
2.1. Multi-Objective Reinforcement Learning (MORL)
Reinforcement Learning aims to learn a policy π(a∣s) that maximizes cumulative reward through interaction with an environment [
10,
13]. In real smart-city deployments, however, decision objectives are inherently multidimensional. Urban policies must simultaneously reduce emissions, minimize congestion, control energy consumption, maintain fairness across districts, and ensure safety and service continuity [
7,
16]. Encoding such competing objectives into a single scalar reward often leads to opaque trade-offs and hidden policy bias.
MORL extends classical RL by optimizing a vector-valued reward:
and by learning a set of Pareto-optimal policies rather than a single optimal solution. This formulation preserves trade-offs explicitly and avoids collapsing heterogeneous objectives into manually weighted sums. From a mathematical perspective, the multi-objective formulation induces an m-dimensional objective space ℝ^
m, where each axis corresponds to one governance-aligned reward component (e.g., efficiency, sustainability, fairness, safety, participation). Each learned policy
π is therefore mapped to a coordinate vector
J(
π) = [
J1 (
π), …,
Jm (
π)] ∈ ℝ^m, where
Ji (
π) denotes the expected cumulative return for objective
i. The Pareto front is defined as the subset of non-dominated coordinate points in this objective space. Thus, the “coordinates” correspond to governance-relevant performance metrics, not spatial variables, and represent trade-off positions in governance-aligned objective space.
In smart-city contexts, MORL provides several structural advantages over scalarized reinforcement learning approaches.
Explicit trade-off preservation: Vector-valued rewards retain conflicting objectives—such as energy efficiency versus mobility throughput—without collapsing them into manually weighted aggregates, thereby maintaining transparency across heterogeneous urban metrics [
24].
Pareto-based policy diversity: Learning non-dominated policies enables systematic trade-off analysis and prevents dominance by a single performance metric [
10,
15].
Adaptive multi-agent scalability: MORL architectures extend naturally to distributed and cooperative agents, supporting large-scale, non-stationary urban environments [
25].
Governance objective integration: Multi-dimensional reward structures allow fairness, sustainability, and accountability objectives to be embedded directly during learning rather than evaluated post hoc [
11,
23].
Collectively, these characteristics position MORL as a promising foundation for governance-aware urban decision systems.
Despite these strengths, existing MORL deployments in smart-city research remain predominantly subsystem-oriented, optimizing transportation, energy, or communication infrastructures independently rather than as interconnected governance problems. While such domain-specific optimizations demonstrate technical performance gains, they typically omit explicit sustainability alignment, cross-sector trade-off coordination, and institutional accountability mechanisms. In contrast, governance-aware decision systems require integrated objective modeling, transparent trade-off exposure, and validation protocols that extend beyond isolated performance metrics. This comparative gap between performance-centric MORL applications and governance-centric urban requirements underscores the need for a structurally unified framework.
However, optimization alone—even when multi-objective—does not guarantee societal legitimacy or regulatory alignment. This limitation motivates the explicit incorporation of governance principles into the learning objective itself.
2.2. Governance-Aware AI and Sustainability Targets (ESG & SDGs)
Modern smart-city AI systems are no longer assessed solely on operational efficiency. Governance, accountability, fairness, and environmental impact have become mandatory design criteria for public-sector AI deployment [
11,
23]. Urban decision systems must increasingly satisfy three interdependent dimensions:
Environmental objectives, including emissions reduction, energy efficiency, pollution mitigation, and waste minimization;
Social objectives, such as fairness, accessibility, safety, and inclusive service delivery;
Governance objectives, encompassing transparency, explainability, regulatory compliance, and accountability.
These dimensions align directly with globally recognized sustainability frameworks, including ESG principles and the SDGs. However, many AI systems operationalize these goals only as post hoc evaluation metrics rather than embedding them into the learning objective itself.
To enable governance-aware learning, sustainability and accountability targets must be mathematically encoded into the reward structure of the decision system.
Table 1 summarizes representative governance dimensions and their corresponding interpretations in MORL-based policy design.
Despite recognition of these objectives, many existing MORL deployments treat them as after-optimization evaluation metrics, rather than embedding them into the reward function at training time [
11,
23].
While embedding governance objectives into reward structures addresses normative alignment, it does not resolve deployment risk. Even governance-aware policies require structured validation before activation in complex urban environments.
2.3. Digital Twins as a Policy Validation Layer
To address the deployment and risk-validation dimension introduced above, Digital Twins provide high-fidelity virtual representations of physical urban systems that synchronize with real-world data to enable simulation, forecasting, and scenario analysis. They are increasingly adopted in urban planning, transportation modeling, energy systems, and emergency response [
18,
19]. However, their integration into reinforcement learning pipelines—particularly for governance validation—remains limited [
22]. Recent studies suggest that DTs have the potential to support risk-aware experimentation and policy evaluation when integrated with AI pipelines [
21].
When integrated with MORL, DTs provide a critical policy safety layer. They enable stress testing of learned policies under extreme or rare conditions, validation of Pareto trade-offs before deployment, estimation of long-term sustainability impacts, and early rejection of unsafe or inequitable policies. This shifts DTs from passive monitoring tools into active governance sandboxes, where policy behavior—not just infrastructure performance—is evaluated.
Conventional DT deployments focus on simulating physical system behavior or forecasting demand. In contrast, governance-aware DT integration evaluates policy outcomes, including risk exposure, fairness violations, and ESG/SDG compliance, before any real-world enactment. This distinction is essential for responsible urban AI deployment and forms a central pillar of the MORL-SGF framework.
From an algorithmic perspective, DTs can interact with reinforcement learning pipelines in three complementary roles. First, the DT functions as a high-fidelity simulated environment in which candidate policies generated by MORL agents are executed without real-world risk. Second, simulation outputs provide structured feedback signals—including performance metrics, constraint violations, and externality indicators—that can be incorporated into reward adjustment or policy filtering mechanisms. Third, the DT enables pre-deployment stress testing under adversarial or rare-event scenarios, allowing governance thresholds to be evaluated before real-world activation. In MORL-SGF, the DT is therefore not a passive visualization layer but an active policy validation module integrated into the learning-to-deployment loop.
The above foundations highlight why multi-objective learning, governance-aligned objective design, and simulation-based validation must be treated jointly in urban decision systems.
Section 3 consolidates the observed limitations into explicit research gaps and corresponding capability requirements. Furthermore, the identified limitations and their corresponding system-level requirements are consolidated in
Table 2.
3. Research Gap and Motivation
Smart cities increasingly rely on autonomous, data-driven decision systems to coordinate critical urban operations, including mobility management [
26,
27], energy distribution [
28,
29], public safety monitoring [
13,
30], environmental sustainability [
3,
31], healthcare logistics [
32,
33,
34], and long-term infrastructure planning [
18,
35]. Advances in artificial intelligence, reinforcement learning, and computer vision have enabled substantial performance improvements across these domains [
26,
36,
37,
38]. Despite these advances, recurring structural limitations are evident across deployments, particularly in the limited integration of governance-related objectives such as public accountability [
23], social equity [
16,
39], sustainability [
2,
4,
5], and transparent decision logic [
13,
40].
This pattern reveals a structural misalignment between algorithmic optimization objectives and the societal, regulatory, and ethical requirements of urban governance. Recent surveys indicate that contemporary smart-city AI deployments predominantly emphasize efficiency and throughput metrics, while governance integration, fairness considerations, sustainability alignment, and citizen-centric accountability mechanisms remain underrepresented [
41,
42]. Rather than reiterating general concerns, the following subsections formalize these deficiencies into explicit capability gaps.
3.1. Why Governance Must Become a Primary Optimization Objective
Urban policy decisions differ from classical engineering optimization in that they directly affect public welfare, require explicit justification, and involve inherently conflicting objectives that must remain transparent and auditable. Despite these realities, many AI-driven city systems continue to optimize operational key performance indicators (KPIs) such as latency, throughput, or resource utilization [
15,
20,
43,
44], while treating governance constraints as offline checks or post-deployment evaluation criteria [
13,
23,
40]. This disconnect underscores the need to elevate governance to a primary optimization objective in urban AI systems.
3.2. Systemic Limitations of Existing Smart City AI Approaches
A cross-domain analysis of smart-city AI literature reveals five recurring and structural limitations that hinder responsible urban decision-making.
First, many reinforcement learning formulations rely on scalarized or manually weighted objectives, collapsing multi-dimensional societal trade-offs into a single reward signal [
11,
43,
44]. This practice introduces hidden policy bias and obscures the rationale behind trade-off decisions. Second, governance-aware reward shaping is largely absent: sustainability, fairness, accountability, and social impact are rarely encoded during policy learning, leading AI systems to learn efficient yet potentially inequitable or unsustainable behaviors [
2,
11,
39]. Third, policy interpretability and auditability remain limited. Few systems expose Pareto trade-offs or decision rationales in a form suitable for regulatory review or public accountability [
13,
23,
40].
Fourth, pre-deployment policy stress testing is uncommon. While DTs and simulators are widely used to model infrastructure behavior, they are rarely integrated as validation checkpoints for reinforcement learning policies, allowing unsafe or non-compliant decisions to reach deployment without governance assurance [
22,
45,
46]. Finally, many solutions remain domain-isolated, optimizing traffic, energy, safety, or environment independently, without a unified view of system-wide governance trade-offs [
4,
6,
29].
An alternative strategy for prioritizing structural limitations or governance contributions involves multi-criteria decision-making methods such as the Analytic Hierarchy Process (AHP). AHP can assist municipal stakeholders in assigning structured pairwise preference weights to governance dimensions, thereby supporting strategic prioritization in complex urban systems [
47,
48,
49,
50]. While MORL-SGF preserves objective independence during learning, AHP-style post-learning preference structuring may complement Pareto-based selection by guiding final policy choice under stakeholder-defined priorities. This integration remains a promising direction for future research.
3.3. Research Gaps
Addressing the above limitations requires explicit capability targets that are currently missing from the literature. First, there is a lack of true Pareto-based multi-objective policy learning, where trade-offs remain transparent rather than being implicitly resolved through scalarization. Second, governance-aware reward modeling—explicitly encoding ESG, SDG, fairness, and sustainability objectives during learning—is largely absent. Third, explainable policy trade-offs suitable for public audit and regulatory approval are rarely supported. Fourth, DT environments are underutilized as policy sandboxing platforms for high-stress, pre-deployment validation. Finally, cross-domain generalization remains limited, with no unified governance layer capable of orchestrating decisions across mobility, energy, safety, and environmental systems.
These gaps collectively indicate that existing approaches lack the architectural, methodological, and governance foundations required for responsible, large-scale urban autonomy
3.4. Motivating Vision
To operationalize governance within AI-driven decision systems, a next-generation smart-city learning framework must satisfy five core principles. Governance objectives must be optimized jointly with performance metrics rather than evaluated post hoc. Decision systems should expose a Pareto frontier of auditable policy alternatives instead of producing a single opaque policy. Policies must be validated under large-scale synthetic scenarios using DTs to ensure safety, robustness, and fairness. Decision rationale must be explainable in structured, human-interpretable formats suitable for public institutions. Finally, governance reasoning must generalize across urban domains instead of reinforcing isolated AI silos.
Together, these principles motivate a fundamental design requirement: smart-city AI must evolve from performance-centric learning to governance-centric policy intelligence.
Translating these principles into an implementable learning architecture requires explicit formulation of reward design, validation mechanisms, policy selection criteria, and cross-domain integration strategies. The subsequent section formalizes these elements as concrete research objectives and contributions of the proposed MORL-SGF framework.
3.5. Research Objectives and Contributions
Guided by the identified gaps and motivating vision, this work delivers five primary contributions, summarized in
Table 3. These include a governance-aware MORL reward formulation embedding ESG and SDG metrics, a DT-based policy validation loop for safe deployment, Pareto-based policy ranking with explainability support, a unified cross-domain governance architecture, and a generalizable smart governance framework (MORL-SGF) applicable across urban infrastructures. Here, G1–G5 refer to the systemic research gaps identified in
Section 3.3, namely scalarized objectives (G1), lack of governance-aware rewards (G2), absence of policy explainability (G3), lack of pre-deployment validation (G4), and domain-isolated optimization (G5).
Current smart-city AI research remains dominated by isolated, performance-oriented optimization approaches that lack governance guarantees, sustainability alignment, and policy accountability [
51]. Existing learning paradigms rarely treat governance as an optimization objective, provide limited explainability, and fail to validate policies prior to deployment. These shortcomings motivate the MORL-SGF, which positions multi-objective reinforcement learning as a governance engine—validated through DTs and guided by ESG/SDG-aware reward modeling—to enable auditable, sustainable, and policy-aligned urban intelligence.
4. Methodology
The methodology adopted in this study is design-oriented and evidence-synthesized rather than empirical in the experimental sense. The proposed MORL-SGF framework is architecturally specified and analytically formalized, with validation grounded in systematic literature synthesis and structural feasibility analysis rather than real-world deployment or controlled simulation experiments. This positioning reflects the objective of establishing a governance-aware learning architecture that can guide future empirical implementations.
This study adopts a multi-phase methodological pipeline that integrates systematic evidence synthesis, governance-aligned problem modeling, multi-objective reinforcement learning design, and DT-based validation. The purpose of this methodology is to provide a transparent, reproducible foundation showing how the MORL-SGF was derived, structured, and theoretically validated.
The methodological workflow follows seven sequential steps, aligned exactly with the process illustrated in
Figure 1. These steps map the evolution from evidence gathering to governance-aligned policy generation and deployment-ready validation.
4.1. Literature Search and Screening Protocol
To ensure methodological transparency and replicability, the evidence synthesis underlying this study followed a structured literature search and screening protocol. Publications were retrieved from Scopus, Web of Science, IEEE Xplore, and ScienceDirect, covering the period 2015–2025.
Search queries combined terms related to multi-objective reinforcement learning, smart cities, governance modeling, sustainability alignment, and DT validation. Representative search expressions included:
(“multi-objective reinforcement learning” OR “MORL”) AND (“smart city” OR “urban systems”) AND (“governance” OR “sustainability” OR “ESG” OR “SDG”) AND (“digital twin” OR “simulation validation”).
Inclusion criteria required that studies:
- (1)
Addressed reinforcement learning or AI-based optimization in urban contexts;
- (2)
Incorporated multi-objective or trade-off formulations;
- (3)
Discussed governance, sustainability, fairness, or accountability dimensions;
- (4)
Employed simulation or Digital Twin environments for validation.
Exclusion criteria removed purely theoretical optimization works without urban relevance, single-objective formulations lacking governance implications, non-peer-reviewed articles, and studies with insufficient methodological clarity.
After duplicate removal and abstract-level screening, 79 studies were retained for detailed governance-readiness evaluation. These studies correspond to the core peer-reviewed works cited in
Section 2,
Section 3,
Section 4,
Section 5 and
Section 6 and form the evidentiary basis for the governance-readiness analysis.
4.2. Overview of the Methodological Pipeline
The methodology is organized into seven tightly coupled phases:
- (i)
Systematic literature review and evidence collection;
- (ii)
Governance readiness coding and gap identification;
- (iii)
Requirements derivation and multi-objective problem formalization;
- (iv)
Governance-aware reward modeling;
- (v)
MORL-based policy learning design;
- (vi)
Digital Twin–driven validation and governance-based policy filtering;
- (vii)
Deployment logic with continual governance alignment.
Rather than operating as isolated steps, these phases form a continuous decision pipeline in which analytical insights, formal models, and validation outcomes are progressively refined. This design ensures that governance considerations are embedded from the earliest conceptual stages and preserved through policy learning and validation.
4.3. Step 1—Systematic Literature Review
The methodological process begins with a structured systematic literature review conducted in accordance with widely accepted review guidelines [
52]. Studies were collected from multiple smart-city-related domains, including transportation systems, energy management, digital health, surveillance, and urban governance [
53,
54,
55,
56,
57]. To ensure methodological rigor and reproducibility, the screening and selection process followed PRISMA 2020 principles, with the full screening flow summarized in
Figure 2.
Figure 2 provides the PRISMA 2020 flow diagram, while
Table 4 summarizes the corresponding numerical counts for clarity and reproducibility.
An initial corpus of more than 400 studies was identified through database searches and manual screening. These records were filtered based on relevance to artificial intelligence, reinforcement learning, governance modeling, and smart-city optimization. After title, abstract, and full-text screening, a final set of 79 peer-reviewed studies was retained, covering multi-objective optimization, reinforcement learning, DT environments, and governance frameworks. This evidence base serves as the empirical foundation for identifying systemic limitations in existing research and motivating the design of MORL-SGF. While the PRISMA-based screening resulted in a final corpus of 79 studies for systematic evidence synthesis, additional references were included throughout the manuscript to support background concepts, methodological foundations, and comparative discussion.
4.4. Step 2—Governance Readiness Coding & Gap Identification
Each study included in the final corpus was systematically evaluated using the Governance Readiness Index (GRI) described in
Section 5. The coding process assessed five governance-critical dimensions: objective formulation (single versus multi-objective), level of governance integration, use of simulators or DTs, degree of explainability, and validation rigor. The design of this coding scheme was informed by prior governance and accountability analyses [
23,
58,
59].
The aggregated coding results revealed five recurring deficiencies across the literature. First, governance objectives are rarely encoded directly into learning objectives. Second, DT usage is often limited to visualization or basic simulation rather than policy validation. Third, explainability mechanisms are typically applied post hoc rather than embedded within the learning process. Fourth, cross-domain reasoning across city subsystems remains largely absent. Finally, although MORL techniques exist, they are seldom applied for governance-oriented reasoning. These findings directly informed the functional requirements underlying the MORL-SGF architecture.
4.5. Step 3—Requirements Derivation & MOMDP Problem Formalization
Building on the gaps identified in Step 2, smart-city decision-making was formalized as a Multi-Objective Markov Decision Process (MOMDP). This formulation captures the inherently stochastic, multi-stakeholder nature of urban governance problems. The MOMDP representation incorporates state variables reflecting system congestion, energy demand, emissions, and spatial fairness disparities; action spaces corresponding to mobility control, routing, energy dispatch, or allocation decisions; and transition dynamics derived either from real-world data or DT simulations.
Crucially, the reward function is defined as a vector-valued governance-aligned objective space, rather than a scalar performance metric. This formalization establishes the mathematical foundation for the MORL component described later in
Section 4 and ensures that governance trade-offs are preserved throughout the learning process.
The MOMDP formulation assumes full observability at the system level through aggregated state representations derived either from sensor networks or DT integration. While urban environments are inherently non-stationary due to demand fluctuations, seasonal variation, and evolving infrastructure conditions, non-stationarity is addressed through adaptive policy updates within the continual governance alignment stage (Step 7). The framework supports both centralized and multi-agent configurations. In centralized settings, a global policy operates over aggregated city-scale states; in multi-agent configurations, decentralized agents (e.g., traffic controllers, energy nodes) share a common governance-aware reward structure while interacting within a coordinated DT environment. This flexible modeling assumption ensures applicability across heterogeneous smart-city deployment scenarios.
4.6. Step 4—Governance-Aware Reward Modeling
To operationalize governance within the learning process, abstract sustainability and accountability principles were translated into measurable reward components. Reward design was informed by ESG and SDG indicators, municipal governance key performance indicators, and fairness and sustainability metrics reported in prior studies [
39,
60]. The resulting reward vector captures multiple governance dimensions, including efficiency, sustainability, fairness, safety, cost, and public participation.
To prevent dominance of any single objective, normalization and independence constraints were applied across reward dimensions. This design ensures that the MORL algorithm learns trade-offs intrinsically rather than relying on manually tuned scalar weights. The output of this stage is a governance-aware reward vector that serves as the direct input to policy learning.
4.7. Step 5—MORL-Based Policy Learning Design
The MORL-SGF framework operationalizes Pareto-based multi-objective reinforcement learning as its core policy optimization mechanism. Rather than relying on scalarized reward aggregation, the framework adopts dominance-based policy learning in which vector-valued rewards are preserved throughout training. This approach enables explicit exploration of the Pareto frontier and maintains transparent trade-offs among governance-aligned objectives.
Algorithmically, the framework is compatible with established Pareto-oriented MORL implementations, including Pareto Q-learning (value-based), multi-objective actor–critic architectures, and hypervolume-guided policy optimization methods [
25,
27,
60]. In the conceptual instantiation presented here, dominance relations are used to evaluate candidate policies during training or post-training selection, ensuring that no objective dimension is collapsed into a fixed scalar weight. This design choice preserves governance transparency and prevents hidden objective bias.
4.8. Step 6—Digital Twin Validation & Governance-Based Policy Filtering
Candidate policies are validated within a high-fidelity DT environment before any real-world consideration. Drawing on DT governance studies [
18,
60], this phase evaluates policies under diverse operational and stress conditions. Validation focuses on robustness, stability, fairness impact, and compliance with predefined governance thresholds.
Governance thresholds are defined as quantitative upper or lower bounds on governance-critical indicators derived from municipal regulations, ESG benchmarks, SDG targets, or policy-defined risk tolerances. Examples include maximum allowable emission increases, minimum fairness parity ratios, acceptable congestion ceilings, or budgetary constraints. These thresholds function as hard feasibility constraints during DT validation: any policy violating at least one governance constraint is removed from the candidate set Π*.
Importantly, threshold values are not assumed to be static. Within the continual governance alignment stage (Step 7), thresholds may be updated to reflect evolving policy priorities, regulatory revisions, or stakeholder-driven adjustments. Such updates trigger re-evaluation of policies within the DT environment, ensuring that governance compliance remains dynamic rather than fixed at design time.
Policies that violate governance risk constraints are filtered out, while remaining candidates are ranked using governance-aligned Pareto dominance and hypervolume metrics. The output of this stage is a validated policy set Πvalid that satisfies both performance and governance requirements, ensuring that unsafe or non-compliant policies are excluded prior to deployment. Where Πvalid ⊆ Π* denotes the subset of Pareto-optimal policies that satisfy governance risk constraints under DT validation.
4.9. Step 7—Deployment Logic & Continual Governance Alignment
The final methodological stage addresses real-world deployment dynamics. A continual feedback loop compares real-world performance metrics with DT predictions and evolving governance indicators. Policy parameters are iteratively adjusted to correct deviations and maintain long-term alignment with governance objectives, consistent with adaptive reinforcement learning principles in [
61,
62].
This step ensures that governance alignment is not treated as a static design-time constraint but as an ongoing operational requirement throughout the policy lifecycle. It is important to emphasize that the present study develops a formal architectural and methodological framework rather than an empirical implementation or simulation benchmark. The objective is to specify the structural integration of governance-aware MORL, DT validation, and Pareto-based policy auditing at the architectural level. Empirical deployment, simulation benchmarking, and prototype implementation are intentionally reserved for future work, as the primary contribution of this study lies in formal system design and governance-aligned optimization modeling.
5. MORL–Smart Governance Framework (MORL-SGF)
5.1. Framework Overview
City-scale decision-making inherently involves multiple, often conflicting objectives spanning operational performance, environmental sustainability, social equity, safety, and public accountability. Such objectives cannot be adequately addressed through traditional single-objective optimization pipelines, which typically collapse diverse societal priorities into a single utility function [
3,
33]. To overcome this limitation, the MORL-SGF introduces a unified architecture that integrates multi-objective reinforcement learning with governance-aware reward design and Digital Twin-based policy validation.
At its core, MORL-SGF is structured around three tightly coupled intelligence layers. First, MORL is employed to learn sets of Pareto-optimal policies rather than a single optimal solution, explicitly preserving trade-offs among competing objectives [
6,
63]. Second, governance-aware reward modeling embeds sustainability, equity, safety, participation, and cost considerations directly into the learning objective, ensuring that governance principles influence policy formation rather than post hoc evaluation [
2,
39,
64]. Third, a DT validation layer serves as a risk-aware sandbox in which candidate policies are stress-tested, filtered, and certified before real-world deployment [
25,
65].
Unlike conventional reinforcement learning pipelines that return a single opaque policy, MORL-SGF produces a portfolio of non-dominated policies, enabling public authorities to select decisions aligned with explicit governance priorities such as sustainability targets, equity mandates, safety constraints, or fiscal limitations [
45,
52]. The overall architecture and its boundary between the external smart-city context and the internal governance core are illustrated in
Figure 3.
5.2. Problem Formulation: MOMDP Modeling
Smart-city decision processes within MORL-SGF are formalized as a Multi-Objective Markov Decision Process (MOMDP) [
1,
44], defined as:
The MOMDP formulation follows the standard MDP structure with vector-valued rewards, as commonly adopted in multi-objective reinforcement learning literature. Where
denotes the system state space (e.g., congestion levels, energy demand, emissions, service access disparities),
denotes the set of admissible control actions (e.g., signal timing, routing strategies, energy dispatch decisions),
(
s′|
s,
a) describes the stochastic state transition dynamics, derived either from historical data or from a calibrated DT simulator [
5,
9], and
R(s,a) is an m-dimensional reward vector; throughout, larger is better (minimization objectives are sign-flipped or normalized). The reward function
R = [
R1,
R2, …,
Rm] is vector-valued, where each component corresponds to a distinct governance-aligned objective, and γ ∈ (0, 1) is the discount factor. All objectives are normalized to comparable ranges and oriented consistently (maximize), with minimization objectives negated or transformed.
At every step, the MORL agent receives a multi-dimensional reward rather than a scalar objective [
66]:
Here, R(s,a) denotes a vector-valued reward function
, where
m is the number of governance objectives. In the present framework,
m = 5, corresponding to efficiency, sustainability, fairness, safety, and participation objectives. The learning objective is to derive a set of optimal policies
such that no policy strictly dominates another across all objectives [
33,
67]. This Pareto-optimal formulation ensures that improvements in one governance dimension necessarily incur trade-offs in at least one other, thereby making policy compromises explicit and auditable [
14].
Let denote the Pareto-optimal policy set (portfolio), and let be the number of non-dominated policies returned by the MORL procedure where each πi stationary policy mapping states to actions. During training, the MORL agent iteratively samples trajectories under the current policy, updates vector-valued value estimates, and retains non-dominated policy candidates based on Pareto dominance relations. This process continues until convergence criteria are met (e.g., stability of the Pareto front or bounded hypervolume improvement).
5.3. Governance-Aware Reward Design
A defining feature of MORL-SGF is the direct encoding of governance principles into the reward space, inspired by ESG and SDG frameworks [
4,
42,
68]. Rather than collapsing governance objectives into a weighted scalar reward, MORL-SGF maintains them as independent reward components to preserve transparency and avoid subjective pre-weighting [
38].
Efficiency rewards capture service performance metrics such as congestion reduction or latency minimization. Sustainability rewards penalize environmental externalities, including carbon emissions and energy overuse. Fairness rewards quantify spatial or demographic service disparities, while safety rewards reflect accident rates or risk exposure. Cost rewards account for operational and budgetary efficiency, and participation rewards capture citizen engagement or public feedback indicators. Operational cost is treated as a component of the efficiency objective rather than as a separate reward dimension.
By maintaining independence among these reward dimensions and applying normalization constraints, MORL-SGF ensures that trade-offs are learned rather than imposed, allowing governance priorities to emerge naturally through Pareto reasoning rather than manual tuning.
5.4. Pareto-Based Multi-Objective Policy Learning
Policy learning within MORL-SGF is conducted using MORL architectures such as actor–critic methods, Soft Actor–Critic (SAC), or Pareto Q-learning, all of which estimate vector-valued value functions [
15,
20,
27,
63]. The expected return is defined over cumulative vector rewards:
Here,
is a vector-valued action–value function, consistent with the vector-valued reward formulation. Policy gradients are computed in a multi-objective setting, enabling simultaneous optimization across governance dimensions. Rather than selecting a single optimal policy, the learning process yields a Pareto front of candidate solutions. To support governance-driven comparison, dominance-aware metrics such as hypervolume contribution are used to rank policies within the Pareto set [
10,
41].
is a fixed reference point; ΔHV(π) ranks policies by Pareto-set contribution. This approach preserves the diversity of governance-relevant solutions and avoids premature convergence toward policies that over-optimize a single objective at the expense of others. Policy updates are computed using multi-objective extensions of actor–critic or Q-learning methods, where vector-valued returns are handled through dominance-based evaluation or preference-conditioned optimization rather than a single scalar objective. Importantly, no scalar aggregation of objectives is performed during training. Any governance-based ranking occurs only after Pareto set generation, preserving the multi-objective learning structure while enabling accountable stakeholder selection.
To mitigate Pareto-front explosion and cognitive overload, MORL-SGF supports dominance filtering, hypervolume contribution ranking, and stakeholder-conditioned preference queries to reduce candidate policies to a manageable subset. Hierarchical decomposition may also be applied, whereby governance objectives are clustered into higher-level categories prior to frontier evaluation. While full real-time frontier pruning guarantees are not formalized in the present work, scalable MORL variants and preference-conditioned policy learning represent active research directions.
5.5. Digital Twin Policy Validation and Risk Filtering
Before real-world deployment, all candidate policies are evaluated within a city-scale DT environment that replicates operational dynamics under normal and extreme conditions [
31,
69]. The DT enables systematic stress testing, robustness analysis, and risk assessment without exposing real citizens or infrastructure to experimental failures [
18].
The Digital Twin is interfaced with the MORL agent through a simulation evaluation pipeline: candidate policies generated during training are executed within the DT environment across multiple controlled scenarios. The resulting performance vectors are logged and compared against governance thresholds before policies are admitted into the validated set. This interface decouples policy learning from real-world execution while maintaining governance-aware certification prior to deployment.
Policies are assessed for stability, governance compliance, and risk tolerance. Only those satisfying predefined governance thresholds—such as acceptable risk levels and minimum robustness criteria—are retained:
Here, Risk(π) is operationalized as the expected performance degradation under predefined stress-test scenarios within the DT, measured as the weighted variance or worst-case deviation of governance reward components across simulated perturbations. Robustness(π) denotes policy stability under distributional shifts, estimated as the consistency of reward performance across multiple scenario samples. These definitions are domain-instantiable and may be computed using statistical dispersion, constraint-violation frequency, or scenario-based sensitivity metrics.
and
denote DT-derived evaluation scores, and
,
are governance thresholds set by stakeholders/regulators. Ω is a predefined finite set of DT stress-test scenarios (demand spikes, sensor noise, incidents);
is the nominal return and
is the return under scenario ω; ϵ > 0 avoids division by zero. This validation stage replaces unsafe online trial-and-error exploration with certified, simulation-backed policy approval, ensuring that only governance-compliant policies proceed to deployment [
70]. Furthermore, Thresholds are defined contextually by municipal governance requirements and are not fixed constants within the framework.
5.6. Explainable Policy Selection for City Stakeholders
Following DT validation, decision makers are presented with a transparent portfolio of Pareto-optimal policies rather than a single recommendation [
8,
71]. Each policy can be compared across governance dimensions such as emissions, equity, cost, and safety, enabling explicit alignment with municipal priorities.
For example, one policy may prioritize emissions reduction at the expense of cost efficiency, while another may favor equity and safety in underserved regions. By exposing these trade-offs explicitly, MORL-SGF supports accountable, stakeholder-driven policy selection rather than algorithmic opacity [
17,
22].
Legitimacy, auditability, and governance alignment are operationalized in MORL-SGF through three measurable mechanisms: (i) explicit vector-valued objective representation, ensuring that all governance dimensions are preserved and inspectable; (ii) Digital Twin-based validation metrics (risk, robustness, compliance thresholds), enabling pre-deployment stress testing; and (iii) transparent Pareto portfolio exposure, allowing decision-makers to inspect trade-offs prior to selection. These mechanisms provide structural and procedural auditability rather than post hoc justification. Empirical validation at full city scale remains a future deployment-stage objective.
5.7. Deployment Feedback and Continual Policy Alignment
Once deployed, policies remain subject to continual governance alignment through real-world monitoring and feedback. Observed performance and governance metrics are compared against simulated expectations, and policy parameters are iteratively adjusted using governance-aware feedback signals [
11,
35]. Policy parameters are adjusted through governance-aware feedback signals derived from discrepancies between real-world and simulated outcomes. This update rule represents a conceptual governance-aware feedback heuristic rather than a standardized reinforcement learning update, intended to illustrate how discrepancies between simulated and real-world governance outcomes may guide adaptive policy adjustment. This adaptive loop ensures long-term alignment with evolving societal goals, regulatory constraints, and urban dynamics [
72,
73]. When deployed performance deviates from governance targets—such as sustainability thresholds, fairness constraints, or safety metrics—the feedback term
drives corrective adaptation. Unlike conventional online reinforcement learning, this update mechanism is explicitly governance-aware, ensuring that policy evolution remains bounded by regulatory, ethical, and societal constraints rather than purely performance-driven optimization.
5.8. Discussion and Framework Implications
MORL-SGF provides an architecture-level integration of governance-aware MORL and DT validation, enabling explicit trade-off exposure and policy filtering prior to deployment. Its primary implication is a shift from single-policy optimization to auditable policy portfolios, where governance constraints and stakeholder priorities can be operationalized transparently. The framework is intended as a reusable blueprint that can be instantiated with different MORL algorithms and domain-specific DT environments.
6. Evidence Synthesis and Multi-Domain Insights from Smart City Literature
The design of the MORL-SGF is grounded in a systematic synthesis of 79 peer-reviewed studies, complemented by additional references used for contextual and methodological support, spanning smart transportation, energy systems, surveillance, healthcare, and urban infrastructure planning. While these studies demonstrate progress in AI, reinforcement learning, and simulation for city-scale systems (e.g., [
1,
2,
3,
5,
8,
10,
11]), the synthesis identifies persistent limitations in governance integration, explainability, and cross-domain reasoning.
This section evaluates the governance readiness of contemporary smart-city AI research and positions MORL-SGF in relation to the identified gaps. The validation presented in this section is interpretive and synthesis-driven, aimed at demonstrating architectural readiness rather than empirical performance benchmarking.
6.1. Review Methodology and Governance Readiness Scoring
Each selected study was evaluated using a GRI designed to assess whether an AI system is suitable for accountable, deployable urban decision-making. Five governance-critical dimensions were considered:
Objective formulation (O): single- vs. multi-objective learning;
Governance integration (G): explicit modeling of fairness, sustainability, or accountability;
Digital Twin or simulator usage (D): from none to high-fidelity twins;
Explainability (X): post hoc vs. intrinsic interpretability;
Validation setting (V): theoretical, simulated, or real-world deployment.
To ensure consistency and interpretability of the GRI, each dimension is discretized on a bounded ordinal scale reflecting increasing levels of governance maturity. Specifically, objective formulation is encoded as O ∈ {0, 1}, where 0 denotes single-objective optimization and 1 denotes explicit multi-objective formulation. Governance integration, DT usage, explainability, and validation rigor are encoded as G, D, X, V ∈ {0, 1, 2} where 0 indicates absence, 1 denotes partial or post hoc implementation, and 2 represents explicit, intrinsic, or high-fidelity integration. Under this encoding, the maximum achievable composite score is 10, corresponding to full multi-objective formulation with explicit governance modeling, high-fidelity DT validation, intrinsic explainability, and real-world or deployment-level validation
The normalized index is defined as
where
GRI ∈ [0, 1] and higher values indicate stronger alignment with governance-aware and deployment-ready AI.
This scoring framework enables quantitative comparison across heterogeneous domains while preserving interpretability for policy-oriented analysis.
Given the interpretive nature of governance assessment, potential subjectivity in scoring was addressed through a structured coding rubric with explicit operational definitions for each dimension (O, G, D, X, V). For example, governance integration (G = 2) required direct embedding of fairness, sustainability, or accountability indicators within the learning objective, whereas post hoc evaluation corresponded to G = 1. Similarly, DT integration (D = 2) required active policy-loop validation rather than standalone simulation or visualization. To enhance internal consistency, a two-pass coding process was adopted, consisting of initial classification followed by structured cross-verification against predefined criteria.
While formal inter-rater reliability analysis was not conducted, the transparent scoring rubric and ordinal definitions mitigate arbitrariness and allow independent replication or reassessment by future researchers.
6.2. Key Empirical Findings
6.2.1. Finding 1—Governance Objectives Are Rarely Embedded into Learning
Although 37% of reviewed studies adopt multi-objective formulations ([
3,
12,
63]), only 11% explicitly encode governance objectives such as fairness, sustainability, or accountability directly into the reward function ([
1,
8,
73]). Many systems continue to optimize classical engineering KPIs (e.g., latency, throughput, cost) without governance coupling:
All reported percentages were computed through frequency analysis of the 79 retained studies. Each study was coded according to the predefined ordinal rubric (O, G, D, X, V). Binary or ordinal classifications were then aggregated by counting occurrences of each category and normalizing by the total sample size (n = 79). For example, P(O = 1) = 0.37 indicates that 29 out of 79 studies explicitly adopted multi-objective formulations. No inferential statistical modeling was performed; results represent descriptive frequency analysis of governance-readiness characteristics.
6.2.2. Finding 2—Digital Twins Are Underutilized for Policy Validation
Only a limited number of works use DTs or high-fidelity simulations as part of the RL loop ([
2,
20,
70]), leaving most approaches reliant solely on abstract simulators.
Here, D = 2 denotes high-fidelity DT environments explicitly integrated into the policy evaluation loop rather than used solely for visualization or offline simulation. Many DT implementations remain focused on forecasting or visualization rather than policy-validation integration.
6.2.3. Finding 3—Explainability Is Mostly Post Hoc, Not Intrinsic
Interpretability frameworks appear in a few mobility and energy applications ([
3,
5]), but intrinsic policy explainability appears in fewer than 10% of reviewed systems:
6.2.4. Finding 4—Cross-Domain Decision Learning Is Largely Absent
The literature remains dominated by domain-isolated optimization, with transportation [
20,
42,
63] and energy systems [
10,
28,
70] accounting for the majority of studies:
Only 4% of systems address cross-sector trade-offs.
6.2.5. Finding 5—MORL Is Rare and Not Aligned with Governance Reasoning
Although some works explore MORL in transportation and energy [
12,
20,
73], governance-aware Pareto selection is still absent:
Few MORL-based approaches link Pareto policy selection to governance or sustainability criteria.
6.3. Governance Readiness Benchmarking
Table 5 positions common research paradigms according to average Governance Readiness Index. The MORL-SGF score reflects a framework-level idealized evaluation rather than an empirical deployment benchmark.
The GRI score of 0.89 assigned to MORL-SGF represents an idealized framework-level evaluation derived directly from the ordinal scoring rubric defined in
Section 6.1. Under this rubric, MORL-SGF satisfies the maximum attainable scores in objective formulation (
O = 1), governance integration (
G = 2), DT validation (
D = 2), and intrinsic explainability (
X = 2). Deployment validation is modeled at near-full maturity (
V ≈ 1.9), reflecting structured DT-based validation and governance-constrained deployment logic rather than large-scale empirical field implementation. The resulting composite score (1 + 2 + 2 + 2 + 1.9)/10 ≈ 0.89 therefore represents a theoretical upper-bound architectural alignment with governance-aware AI design.
Importantly, this benchmarking is comparative and structural rather than empirical. The score is not derived from real-world deployment metrics but from architectural capability alignment under the defined governance maturity criteria. This distinction ensures transparency and prevents conflation between conceptual completeness and empirical validation. The improvement margin between MORL-SGF and the nearest paradigm is:
This reflects a 37% relative increase in governance alignment. This improvement reflects the cumulative impact of governance-aware rewards, DT validation, and Pareto-based explainability.
6.4. Domain-Specific Gaps and MORL-SGF Contributions
Key shortcomings are consistent across domains, with representative examples drawn from transportation [
20,
42], energy [
10,
28,
70], healthcare [
33,
53], surveillance [
13,
68], and urban systems [
35,
73]. MORL-SGF directly responds to these limitations by integrating governance-aware rewards, DT validation, and Pareto-based explainability. Specifically, it directly addresses this by:
Embedding governance objectives into learning;
Validating policies under extreme scenarios;
Enabling auditable, stakeholder-driven policy selection;
Supporting cross-domain governance reasoning.
6.5. Key Implications and Positioning of MORL-SGF
The evidence synthesis highlights persistent gaps in governance integration, DT validation, explainability, cross-domain coordination, and governance-aligned MORL deployment. These findings provide analytical justification for the architectural components of MORL-SGF described in
Section 5.
8. Challenges, Limitations, and Open Research Directions
While MORL-SGF introduces a structured path toward accountable and sustainability-driven policy learning in smart cities, deploying such a framework at scale presents persistent challenges. These challenges span governance modeling, technical feasibility, data readiness, multi-stakeholder alignment, and real-world reliability [
40,
62,
64,
76]. Recognizing them transparently is essential for a responsible research agenda, future benchmarking, and practical road-to-deployment planning. This section discusses key limitations and research opportunities, categorized into governance, learning, infrastructure, evaluation, and deployment dimensions.
8.1. Governance Representation and Quantification Challenges
Although sustainability and governance are widely endorsed goals in city planning, modeling them as mathematical objectives remains nontrivial [
67,
79]. Unlike latency or energy consumption metrics, governance constructs such as fairness, accountability, citizen trust, and participatory inclusion lack universally standardized or machine-readable formulations [
33]. Consequently, reward shaping becomes highly context-dependent and often relies on proxy variables that can introduce bias or oversimplification. Critical studies on smart-city governance caution that technological optimization alone cannot resolve institutional accountability, democratic legitimacy, or social trust, underscoring the need for governance-aware AI frameworks that explicitly expose and manage policy trade-offs rather than obscuring them through automated decision-making [
80].
The following expression illustrates a common composite governance formulation discussed in the literature and is not used within MORL-SGF, which preserves vector-valued rewards:
This weighted form is shown only as an optional stakeholder preference aggregation for post-learning selection, not as the MORL training reward (which remains vector-valued). However, determining weights
wi without embedding subjective political or institutional bias is itself a governance challenge. Future research must develop standardized, audit-ready governance metrics derived from regulatory frameworks such as ISO 37120 [
81], UN SDG [
82] indicator taxonomy, or local municipal policy KPIs. Equally important is the construction of mechanisms that allow stakeholder-controlled adjustment of governance weights to ensure alignment with democratic priorities instead of opaque technical defaults.
Beyond weight subjectivity, governance-aware reward modeling introduces deeper ethical risks related to reward mis-specification and governance manipulation. Even when vector-valued rewards are preserved during MORL training, the selection and normalization of governance indicators may unintentionally privilege measurable proxies over latent civic values. For example, fairness metrics based solely on service distribution variance may overlook structural inequalities, while sustainability metrics may neglect long-term ecological externalities not captured in short-horizon simulations.
Additionally, governance objectives may be strategically influenced if institutional actors adjust threshold definitions or post-learning aggregation weights to favor politically convenient outcomes. Such manipulation risks transforming governance-aware AI into governance-appearing optimization without substantive accountability. Addressing these vulnerabilities requires transparent metric documentation, stakeholder-auditable reward design, periodic recalibration of governance thresholds, and independent policy review mechanisms capable of detecting metric gaming or reward exploitation.
8.2. Scalability of Multi-Objective Reinforcement Learning
Scaling MORL to real city environments remains computationally expensive, especially where high-dimensional state spaces, long planning horizons, and dense Pareto front approximations are required. MORL scalability challenges—such as Pareto explosion, long horizons, and high-dimensional objectives—have been highlighted in advanced RL research [
26,
52,
83]. Hierarchical or preference-based MORL architectures may mitigate computational overhead [
78,
84]. The number of non-dominated policies grows polynomially with objectives:
where
k is the number of policy candidates and
m is the number of objectives. This compromises real-time decision-making, especially when governance objectives are expanded beyond efficiency and sustainability into auditability, equity, and citizen participation. This expression is intended to illustrate the general combinatorial growth trend of the non-dominated policy set with respect to the number of objectives, rather than to represent a strict theoretical complexity bound.
Promising directions include dimensionality reduction for objective pruning, preference-based MORL to limit non-actionable Pareto regions, and hierarchical MORL where governance objectives are learned at a slower policy cycle while operational objectives are updated in real time.
While MORL-SGF provides structural governance alignment, computational scalability remains a non-trivial challenge. Multi-objective reinforcement learning inherently increases optimization complexity relative to scalarized RL due to the need to approximate or maintain a Pareto policy set rather than a single optimal solution. The computational burden grows with (i) the dimensionality of the reward vector, (ii) the size of the state–action space, and (iii) the number of non-dominated candidate policies retained during training.
In high-dimensional objective spaces, Pareto front approximation may scale super-linearly as the number of objectives increases, potentially leading to policy set expansion and increased storage and evaluation costs. Additionally, integration with DT environments introduces further computational overhead, particularly when stress-testing policies under multiple extreme scenarios or multi-agent urban simulations.
In large-scale city deployments, scalability may require distributed MORL training, objective-space pruning mechanisms, or adaptive Pareto sampling strategies to prevent combinatorial growth. These computational trade-offs highlight the need for future research on scalable governance-aware MORL implementations capable of operating under real-time urban constraints.
8.3. Reliability of Digital Twin Environments
DT fidelity concerns are documented in smart infrastructure and simulation-based validation literature [
39,
76]. The need for adversarial testing, uncertainty modeling, and cross-domain synchronization is emphasized in multiple DT studies [
33,
53]. DTs play a critical role in pre-deployment policy evaluation, yet their representational fidelity remains a bottleneck. If the Digital Twin fails to model rare but high-impact scenarios (e.g., flash floods, mass transit failures, sensor blackouts, civil events), policies validated in simulation may fail catastrophically in real deployment. Moreover, many city Digital Twins are fragmented, vertically siloed, or lack real-time bidirectional data streams.
A robust validation loop ideally follows:
Future progress requires adaptive DTs capable of uncertainty modeling, adversarial stress testing, domain randomization, and cross-sector data fusion pipelines that maintain causal consistency between simulated and real urban dynamics.
8.4. Explainability, Accountability, and Policy Auditability
Explainability requirements for urban AI have been previously noted in governance and fairness research [
40,
64,
78]. Policy-level interpretability tools—such as counterfactual explanations—are increasingly recognized as essential for public institutions [
34,
78]. Although some of these studies focus on domain-specific digital systems rather than reinforcement learning algorithms, they provide empirical evidence that transparency, perceived fairness, and user trust are decisive factors for acceptance of AI-enabled public-sector technologies, thereby motivating the need for auditable and explainable policy-learning frameworks in smart-city governance. Despite MORL producing Pareto-optimal policies, explaining the rationale behind policy trade-offs to non-technical stakeholders remains a governance barrier. Current MORL literature focuses on dominance ranking, yet city planning requires actionable justification. Illustrative examples of governance trade-offs include:
Why a policy sacrifices 12% traffic efficiency to improve 34% equity in underserved districts;
Which demographic zones benefit or lose from specific Pareto solutions;
What minimum governance threshold a policy violates if deployed.
Empirical studies on citizen acceptance of smart-city technologies demonstrate that transparency, explainability, and perceived fairness are stronger predictors of public trust than technical performance alone, reinforcing the necessity of auditable and interpretable policy-learning systems in urban governance [
85]. Future research must formalize automatically generated policy audit trails, counterfactual explanations, and compliance certificates tied to governance constraints. These should resemble model cards or policy sheets, but at the policy level rather than the model level.
8.5. Data Readiness, Bias, and Social Representation Risks
Challenges of biased or incomplete urban data have been documented in smart city sensing and equity-focused AI literature [
26,
33,
62,
67]. Reward-level fairness calibration and demographic stress-testing extensions are necessary to prevent systemic bias propagation [
78]. MORL-SGF assumes access to representative city-wide data, but many real municipal data contain sampling bias, sensor sparsity, missing demographic coverage, or socioeconomic blind spots. If left uncorrected, governance-aware learning may optimize biased realities rather than ideal civic outcomes. This is particularly critical when encoding fairness as a reward component:
Future work must introduce bias-aware reward normalization, fairness-calibrated state sampling, synthetic data augmentation for marginalized zones, and worst-case equity stress testing before policy approval.
8.6. Synthesis and Research Outlook
The challenges identified in this section highlight that advancing governance-aware reinforcement learning for smart cities requires progress across multiple dimensions, including governance quantification, learning scalability, simulation fidelity, explainability, and data integrity. As summarized in
Table 6, current limitations are not isolated technical shortcomings but systemic gaps that emerge when AI systems are deployed in complex, socio-technical urban environments [
39,
64,
76]. The framework has not yet been implemented in a live DT environment or evaluated through numerical benchmarking, which remains a necessary next step for empirical validation.
A central challenge remains the lack of standardized, machine-interpretable governance metrics. While sustainability, fairness, and accountability are widely recognized as policy objectives, their translation into robust reward formulations continues to rely on context-specific proxies, raising concerns about bias, subjectivity, and regulatory alignment. Addressing this gap will require the development of audit-ready governance indicators grounded in international standards such as SDGs, ISO frameworks, and municipal policy KPIs, alongside mechanisms that allow stakeholder-adjustable governance preferences.
Scalability poses another critical barrier. As the number of objectives increases, MORL systems face Pareto front explosion and increased computational cost, limiting their applicability in real-time city operations. Promising research directions include preference-based MORL, hierarchical learning architectures, and objective-space reduction techniques that preserve governance relevance while maintaining tractability.
The reliability of DT environments also remains a bottleneck. Simulation fidelity gaps, limited modeling of rare but high-impact events, and fragmented cross-domain representations can undermine policy validation. Future DTs must evolve toward uncertainty-aware, adversarially tested, and causally consistent platforms capable of supporting governance-critical decision validation across multiple urban sectors.
Explainability and auditability represent equally important governance requirements. While Pareto-optimality exposes trade-offs mathematically, city administrators and regulators require interpretable policy justifications, counterfactual explanations, and compliance certificates that articulate why a policy was selected and whom it benefits or disadvantages. Embedding such policy-level explanations into MORL pipelines remains an open research challenge.
Finally, data readiness and social representation risks must be addressed to prevent governance-aware learning from reinforcing existing inequities. Bias-aware reward normalization, demographic stress testing, and synthetic data augmentation for underserved populations are essential to ensure that learned policies reflect equitable civic objectives rather than biased data realities.
Overall, addressing these challenges is essential to move MORL-SGF from a conceptual framework toward operational civic infrastructure. The value of MORL-SGF lies not only in optimizing urban systems, but in producing governable, auditable, and deployable policy portfolios that city authorities can justify, adapt, and regulate through transparent policy reasoning rather than algorithmic opacity. As governance-aware AI becomes a foundational requirement for future smart cities, MORL-SGF represents a necessary transition from performance-driven autonomy toward responsibility-driven urban intelligence.
9. Conclusions and Future Research Agenda
Smart cities require decision systems that extend beyond single-objective performance optimization toward governance-aware and accountable policy learning. This study introduced MORL-SGF, a unified architecture integrating multi-objective reinforcement learning, ESG/SDG-aligned reward modeling, DT–based policy validation, and Pareto-based policy auditing within a single decision pipeline.
From a methodological perspective, MORL-SGF formalizes governance objectives—such as sustainability, equity, safety, and accountability—as intrinsic components of vector-valued reinforcement learning rewards rather than post hoc evaluation criteria. The framework replaces single-policy optimization with Pareto-governed policy portfolios, supports structured DT validation prior to deployment, and establishes a cross-domain governance layer capable of coordinating heterogeneous urban subsystems under shared accountability constraints. The structured synthesis of 79 smart-city studies provides empirical grounding for this architectural design.
The evidence synthesis in
Section 6 quantitatively substantiates the structural gap motivating this work. Within the reviewed corpus, 37% of studies adopted multi-objective formulations, yet only 11% explicitly embedded governance objectives within reinforcement learning reward functions. DT integration into policy-loop validation appeared in fewer than one-third of cases, while intrinsic explainability mechanisms were present in fewer than 10% of systems. Cross-domain governance-aware MORL implementations were rare. These findings indicate that contemporary smart-city AI research remains largely performance-centric.
Responsible deployment of MORL-SGF requires careful governance reward specification, high-fidelity DT modeling, human-in-the-loop policy selection, bias-aware data handling, and scalable MORL implementation strategies. These considerations highlight that governance-aware learning is both a technical and institutional challenge.
Future research should focus on standardized governance reward libraries aligned with SDG and ESG indicators, scalable and preference-based MORL architectures, causal and neuro-symbolic governance modeling, continual post-deployment governance adaptation, citizen-in-the-loop preference integration, and benchmarking within high-fidelity urban DT environments.
By embedding governance considerations directly into learning and validation processes, MORL-SGF provides a structured foundation for advancing toward accountable, auditable, and governance-aligned urban decision intelligence.