Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments

Alsuwaiket, Mohammed Abdullah

doi:10.3390/wevj16080472

Open AccessArticle

Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments

by

Mohammed Abdullah Alsuwaiket

Department of Computer Science and Engineering Technology, University of Hafr Al Batin, Hafr Al Batin 39524, Saudi Arabia

World Electr. Veh. J. 2025, 16(8), 472; https://doi.org/10.3390/wevj16080472

Submission received: 10 July 2025 / Revised: 10 August 2025 / Accepted: 11 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Modeling for Intelligent Vehicles)

Download

Browse Figures

Versions Notes

Abstract

Autonomous vehicle (AV) navigation in dynamic urban environments faces challenges such as unpredictable traffic conditions, varying road user behaviors, and complex road networks. This study proposes a novel reinforcement learning-based framework that enhances AV decision making through spatial-temporal context awareness. The framework integrates Proximal Policy Optimization (PPO) and Graph Neural Networks (GNNs) to effectively model urban features like intersections, traffic density, and pedestrian zones. A key innovation is the urban context-aware reward mechanism (UCARM), which dynamically adapts the reward structure based on traffic rules, congestion levels, and safety considerations. Additionally, the framework incorporates a Dynamic Risk Assessment Module (DRAM), which uses Bayesian inference combined with Markov Decision Processes (MDPs) to proactively evaluate collision risks and guide safer navigation. The framework’s performance was validated across three datasets—Argoverse, nuScenes, and CARLA. Results demonstrate significant improvements: An average travel time of 420 ± 20 s, a collision rate of 3.1%, and energy consumption of 11,833 ± 550 J in Argoverse; 410 ± 20 s, 2.5%, and 11,933 ± 450 J in nuScenes; and 450 ± 25 s, 3.6%, and 13,000 ± 600 J in CARLA. The proposed method achieved an average navigation success rate of 92.5%, consistently outperforming baseline models in safety, efficiency, and adaptability. These findings indicate the framework’s robustness and practical applicability for scalable AV deployment in real-world urban traffic conditions.

Keywords:

autonomous vehicles; urban traffic optimization; reinforcement learning

1. Introduction

1.1. Background

Self-driving cars have the potential to reshape urban transportation by reducing traffic congestion, enhancing safety, and improving energy efficiency [1]. However, driving in dynamic urban environments remains a significant research challenge due to the uncertainty of traffic flows [2], road conditions [3], and interactions with surrounding elements through Vehicle-to-Everything (V2X) communication [4]. These factors demand adaptive models that can respond in real time to changing scenarios [5]. Reinforcement learning (RL) has emerged as a powerful tool to address these complexities, enabling autonomous vehicles to learn optimal strategies through exploration in uncertain environments [6]. RL has shown promise in areas such as V2X coordination and traffic flow optimization, where vehicles must make collaborative decisions in dense, interactive settings [7,8]. Despite the projected growth of AV adoption at 40% annually through the 2030s [9], safety concerns remain a key barrier. AVs are projected to grow at a compound annual growth rate of 40% this decade (2030s) (see Figure 1) [10]. Despite this growth, safety concerns remain a significant barrier.

A report by the National Highway Traffic Safety Administration states that over 90% of road accidents in the U.S. are caused by human error. Autonomous vehicles (AVs) are expected to mitigate these incidents and can reduce fuel consumption by up to 20% through optimized routing and integrated traffic flow management [11].

1.2. Research Gap and Challenges

Autonomous vehicle navigation in urban environments is challenged by unpredictable traffic, variable road conditions, and diverse road users [1]. Existing systems often fail to adapt in real time, leading to inefficient routing, increased energy use, and safety concerns [12]. While reinforcement learning (RL) offers potential, many models lack spatial-temporal awareness and overlook critical urban factors like traffic density and pedestrian behavior [13]. To address these gaps, this research proposes the Urban Reinforcement Learning Navigation Framework (URLNF), a hybrid model combining PPO and GNNs. URLNF integrates two key modules—UCARM and DRAM—for real-time adaptation and decision optimization. This framework enhances scalability, safety, and energy efficiency in complex urban scenarios by embedding context-aware rewards and proactive risk assessment. Figure 2 shows a typical flow architectural diagram of a traditional RL-controlled AV-based urban traffic management system.

1.3. Proposed Solution and Objectives

This study aims to optimize autonomous vehicle navigation in complex urban environments by applying the Urban Reinforcement Learning Navigation Framework (URLNF). The framework minimizes travel time, improves energy efficiency, and ensures safety by learning optimal policies under uncertainty. Our approach addresses real-world constraints such as dynamic traffic, signal-based routing, energy usage, and adaptive lane priorities.

The main goal of this paper is to develop a novel reinforcement learning-based framework for optimizing autonomous vehicle navigation in dynamic urban environments, enhancing safety and efficiency in real-world traffic conditions. The objectives of this study are:

(a): To develop a hybrid DRL model (i.e., URLNF) integrating PPO with GNN-based spatial temporal representation to effectively capture and utilize the complex interactions in dynamic urban traffic scenarios.
(b): To formulate a reward system that dynamically adapts traffic conditions, safety constraints, and environmental factors to guide AVs towards safer and more efficient navigation.
(c): To incorporate Bayesian techniques to predict potential collision risk and enable proactive decision making for mitigating hazards in highly dynamic settings.
(d): To implement and evaluate the framework using advanced simulation tools and real-world traffic datasets, focusing on metrices like travel time, collision rate, and energy efficiency.

This paper is structured in five sections: The introduction outlines the challenges of AV navigation in urban environments, highlights the research gap, and states the objectives of the study. The literature review explores existing methods and technologies in RL-based navigations, identifying their limitations. The methodology details the design and implementations of a novel hybrid RL-based framework including UCARM and DRAM. The results and discussion present the evaluation through simulations and dataset validations analyzing the performance improvements in efficiency, safety, and energy metrics. The conclusions summarize the findings, discuss the implications, and suggest potential future directions.

2. Literature Review

2.1. Autonomous Vehicles (AVs) in Urban Traffic Systems

The placement of autonomous vehicles in urban traffic systems has emerged as a critical area of research with reinforcement learning techniques taking a central role in addressing challenges related to navigation, coordination, and traffic optimization [14]. In one study [14], the author introduced a multi-objective reinforcement learning framework designed for autonomous drone navigation in urban environments with wind zones, a methodology readily applicable to AVs. Their multi-objective reinforcement learning (MORL) model demonstrated the ability to balance key objectives such as safety, energy improvement, and travel time optimization. However, the author mentioned the computational intensity of MORL, especially for real-time application, which presents an important limitation for scaling this approach to a dense urban traffic system. In another study [15], the author imposed multi-agent reinforcement to increase the efficiency of AV fleets in smart cities. Their model used cooperative behavior among AVs, leading to improved traffic flow and reduced congestion. The author admitted to scalability challenges when trying to visualize these behaviors, especially as the numbers of agents increased in complex urban networks, where coordination and communication become increasingly difficult.

In another work [16], the author found a comprehensive review of deep reinforcement learning for path planning in AVs, focusing on effectiveness in dynamic urban environments characterized by unpredictable traffic conditions and obstacles changing road layouts. The DR-based algorithms were shown to optimize vehicle routes in real time, ensuring both safety and efficiency. The author also observed that the DRL system faced significant drawbacks due to its reliance on a large dataset and high computational resources, complicating its practical deployment in a real-world system. In a similar study [17], the author described adaptive speed planning for unmanned vehicles using DRL, enabling vehicles to adjust their speed dynamically based on real-time traffic conditions and safety considerations. Thus, this approach mainly improved traffic flow efficiency and identified limitations in handling extreme unknown scenarios like sudden pedestrian crossings or accidents, which are frequent in urban areas.

The literature [18] shows that the integration of Distributed Reinforcement Learning (DRL) with Model Predictive Control (MPC) can be used to manage autonomous vehicle groups in urban roads networks. This approach accessed synchronized vehicle movement, enhancing traffic data and reducing fuel consumption. Thus, the system’s reliance on robust vehicle-to-infrastructure communications increased risk, as communication failure could lead to significant disruptions and reduced safety. To address another challenge, in [19], the author made an uncertainty-aware DRL model for crowd navigation in a shared urban space. This framework allowed AVs to navigate effectively in environments with dynamics hurdles like pedestrians. While the model performed well in controlled settings, its efficiency decreased in scenarios with heightened environment uncertainty, such as dense pedestrian zones or chaotic traffic.

Current research on autonomous vehicle (AV) navigation in urban environments focuses on the integration of ML and reinforcement to address the challenges such as multi-objective optimization, real-time decision making, and system robustness. In [20], the author improved traffic flow using RL for connected and automated vehicles (CAVs) at intersections and found that reliable coordination in dense traffic is a limitation. The author of [21] introduced a multi-objective RL framework optimizing safety and improvements for drones, but scalability to complex urban networks remained a challenge in autonomous drones. In another study [22], ML was applied for 3D routing optimization in UAVs, with the potential for AVs, but multi-agent coordination in urban settings requires further development. Deep reinforcement learning approaches like for UAV hurdle avoidance and using decision transformers for behavior prediction have shown promise but face real-time computational limits and unpredictability challenges. Similarly, another study [23] combined improved DRL with autonomous UAV visual navigation using traditional control methods for adaptability, but scalability issues persisted. Another author described and observed a robust RL framework for AV safety under uncertain conditions, but balancing robustness with efficiency in noisy urban environments remained unresolved [24]. Table 1 shows the summary of autonomous vehicle studies in urban traffic systems.

2.2. Reinforcements Learning for Traffic Optimization

Simultaneously, progress in decision making for advanced autonomous vehicles has been propelled by intelligent systems that can reason based on driver behavior and environmental context. The author of [25] investigated a decision-making model for autonomous cars driven by driver intelligence, integrating environmental reasoning for enhanced adaptability and realism in navigation. Their investigation revealed that integrating contextual intelligence enables autonomous systems to improve long-term decision making in urban traffic, hence augmenting safety and operational efficiency [26]. Nonetheless, the computational expense of handling substantial amounts of contextual data continues to be a constraint in real-time applications. In [27], the author presented deep attention-driven RL (DAD-RL), an innovative framework that use attention mechanisms to concentrate on the most pertinent environmental aspects during decision making. This method demonstrated robust efficiency in dynamic, intricate contexts by enhancing decision-making precision and minimizing processing demands [28]. Nonetheless, the dependence on attention mechanisms complicates model training, especially in edge-case situation that necessitate intricate decision making in uncommon contexts.

In addition to RL-based approaches, recent research has explored fuzzy optimization and metaheuristic algorithms to address navigation and decision-making challenges in dynamic urban systems. The study by Sutikno [29] presents a comparative analysis of fuzzy logic and metaheuristic strategies for energy-aware route planning, demonstrating their robustness and adaptability in uncertain environments. The integration of such techniques with RL could offer hybrid models capable of better handling multi-objective constraints such as safety, fuel economy, and travel time in real-time navigation scenarios [30,31]. Table 2 shows the summary of reinforcement learning techniques for traffic optimization.

2.3. Urban Context-Aware Decision Making for Autonomous Vehicles

Recent improvements in urban context-aware decision making for autonomous vehicles have demonstrated significant potential in enhancing traffic flow and vehicle coordination, especially in intricate metropolitan settings with heterogeneous traffic. In [28], the author introduced an adaptive signal control connected and automated vehicle (CAV) coordination system employing deep reinforcement learning (DRL) to regulate mixed traffic at signalized intersections. This system dynamically modifies signal timing and synchronizes connected autonomous vehicles (CAVs) to reduce congestion and enhance traffic flow. The findings indicated substantial decreases in delays and enhancements in intersection efficiency; nevertheless, the model’s reliance on precise vehicle connectivity and real-time traffic data poses obstacle for implementation in regions with unreliable communication infrastructure. Likewise, [32] concentrated on reinforcement learning for traffic signal control, utilizing a state reduction method to streamline the state space and enhance the system’s computing efficiency. This method demonstrated superior performance compared to conventional signal controllers; nevertheless, its effectiveness in high-density traffic scenarios during peak hours was constrained by difficulties in scaling the system to accommodate bigger, more intricate networks.

Subsequent research has combined graph-based methodologies with multi-agent reinforcement learning (MARL) to enhance decision making in urban traffic situations. Xing et al. [33] presented GRL-GCN, a hybrid model that integrates reinforcement learning with graph convolutional networks (GCNs) for traffic flow production in smart cities. This method exhibited significant accuracy in simulating traffic dynamics; nevertheless, its efficiency may diminish in highly dynamic environments characterized by fast fluctuations in traffic patterns. Conversely, Ref. [34] formulated a distributed control framework utilizing reinforcement learning for collaborative intersection management, wherein each agent (vehicle or traffic signal) learns autonomously while cooperating with others to enhance the overall system efficiency. Their methodology enhances traffic efficiency; yet, the difficulty persists in guaranteeing that the distributing agents can manage conflicts and sustain when integrating with numerous automobiles or traffic signals.

The researcher in [35] conducted a survey on multi-agent reinforcement learning (MARL) for connected and automated cars, emphasizing the significance of communication between vehicle and infrastructure for optimal decision making. Notwithstanding the positive outcomes, the incorporation of numerous agents in extensive urban networks persists in encountering difficulties related to coordination, communication overhead, and the maintenance of stable learning among agents [36]. Research indicates that although reinforcement learning and multi-agent systems are effective for enhancing urban traffic management, their particle implementation is impeded by challenges concerning scalability, real-time data processing, an inter-agent communication in intricate environments.

Recent advancements further emphasize multi-objective optimization and robust perception in autonomous driving. For instance, the “Multi-Objective Autonomous Eco-Driving Strategy” proposes balancing efficiency and environmental impact, while recent reviews on occluded object detection highlight perception challenges in dense urban settings. A survey on RL-based highway AV control outlines techniques for stable high-speed decision making, complementing the urban focus of URLNF. Table 3 shows the comparative table of previous studies.

3. Methodology

The approach proposed in this investigation is intended to tackle the multifaceted issues associated with AV navigation in dynamic city settings by means of a new hybrid reinforcement learning approach. The framework proposes the integration of PPO with GNN to support efficient decision making and dynamic navigation. The urban environment is represented by a directed graph, where nodes are intersections, vehicles, and pedestrian areas; edges are road segments with specific characteristics, including traffic density, speed limit, and priority factors. GNNs are used for modeling the spatial-temporal dependencies in traffic patterns of the urban environment and facilitate understanding of complex interactions between different features of that environment.

For the purpose of navigation, the PPO algorithm is used, which, while updating the policy, reduces the exploration and increases the exploitation through the clipped surrogate loss function. This allows learning under stiffness constraints to be strong. Also, the framework incorporates a UCARM that adapts the reward system to be dependent on traffic rules, current traffic density data, and safety concerns. This helps to achieve the objective of making the learning process mimic real-life urban navigation objectives such as time to task, energy optimization, and collision prevention.

To this end, a DRAM is incorporated into the framework to assess real-time collision risks with the help of Bayesian inference and Markov Decision Processes (MDPs). This module also drives the risk management by simulating the variability in the movement of pedestrians, traffic, and changes in the physical environment. The entire framework is evaluated on real-world datasets including Argoverse and nuScenes as well as simulated data created in the CARLA simulator. To evaluate the proposed approach, different cases, such as dense traffic, mixed pedestrian areas, and adverse weather conditions, are considered. The findings are used to show how the framework can be used to add safety, decrease energy use, and increase navigation efficiency within cities.

Figure 3 illustrates the core components of the proposed AV navigation framework. It begins with environment sensing and data preprocessing, followed by a GNN-based spatial-temporal encoder. The processed information flows into the PPO-based policy optimizer, enhanced with context-aware reward adjustment and dynamic collision risk assessment. The final output is an optimized decision for real-time vehicle control, enabling safe and efficient navigation across urban scenarios.

3.1. Dataset Description

(a): Datasets

The experiments in this study utilize three datasets: Argoverse, nuScenes, and CARLA. The Argoverse dataset contains annotated vehicle trajectory data collected from urban roads in Pittsburgh, Pennsylvania, including downtown intersections, multi-lane roads, and signal-controlled junctions. It provides high-definition semantic maps, 3D tracking of surrounding agents, and real-time traffic light data. The nuScenes dataset includes driving data from Boston and Singapore, covering mixed urban traffic, pedestrian zones, and sensor-rich vehicle logs. The CARLA dataset is a synthetic dataset generated using the CARLA simulator, offering controlled conditions across various weather scenarios, road types, and dynamic agent behaviors for evaluating autonomous navigation models.

In order to assess the performance of the proposed URLNF, the research used both high-quality real-world datasets and synthetic ones. Both datasets were selected to provide coverage of a large range of possible urban navigation settings, traffic conditions, road configurations, and interactions with pedestrians. Below is a detailed description of the datasets:

(b): Argoverse: A recorded driving dataset containing over 320 h of annotated data collected in an urban environment. This dataset provides high-quality 3D tracking of vehicles and their interactions with the road infrastructure: lanes, drivable space, and the surrounding environment; all of which are important for modeling urban dynamics. Argoverse also includes map-based priors to improve decisions made by self-driving automobiles in multi-agent environments.

Figure 4 shows two intersection scenes selected from the Argoverse dataset which indicate that urban traffic is challenging. The left panel shows a motor road intersection with multiple lanes for vehicles, walkways for crossing, automobiles, and a green traffic light. The right panel depicts another intersection scenario where the vehicles, pedestrians, and cyclist are depicted in a setting with a school bus and pedestrians within the vicinity of the crosswalk.

(c): NuScenes: Another realistic dataset containing full multi-sensor data, LIDAR, radar, and high-resolution camera data. NuScenes provides traffic signal states, vehicle motion, and the overall environment of the scene, and it is useful for model training and testing in complex dynamic traffic environments.

Figure 5 shows the modality of sensor data in the nuScenes dataset, which also includes camera and radar data. The left panel demonstrates six views of a car and its vicinity with bounding boxes for objects detected on them. The right panel shows the radar field of view (FoV), a list of radar detections, and clustered object annotations.

(d): Synthetic Simulation Data (CARLA): In addition to the real-world datasets, synthetic data was also created from the CARLA simulator. The simulation environment reflects the traffic scenario of a city, weather conditions, and pedestrian actions. This dataset is also good for testing when there is congestion, bad weather, and other conditions that may not be well captured in real-world datasets.

In Figure 6, a simulated urban environment is depicted using the CARLA simulator, where different vehicles, pedestrians, and dynamic weather conditions at a multi-lane intersection are shown. The scene demonstrates the difficulties of orientation in urban environments in a safety bubble.

Table 4 shows the Dataset specifications. To prepare the datasets for integration into the framework, the following preprocessing steps were undertaken:

(a): Normalization and Standardization: All quantitative data, including the vehicle velocities, distances, and the readings from the sensors, were scaled to be compatible between datasets.
(b): Augmentation: To further increase the data variability, information from additional scenarios such as the weather and time of day were included into CARLA data.
(c): Data Fusion: In both nuScenes and Argoverse, data from different sensors was combined to create a consolidated view of the environment with LIDAR, radar, and camera data being used for perception.
(d): Graph Construction: Urban environments were modeled as graphs where nodes correspond to intersections and vehicles while the edges correspond to road segments, traffic density, and signal indications.

The use of both real and synthetic datasets offers multiple benefits for the proposed framework. It allows for modeling of the real traffic phenomena in a urban environment and thus guarantees that the proposed framework can deal with real-world traffic conditions. Moreover, the use of synthetic data enables testing of cases that are not quite frequent in real-world scenarios, for example, rainy or snowy conditions or erratic behavior of pedestrians. This makes the framework scalable and applicable in different layouts of cities and under different climatic conditions that could make it useful in as many navigation problems as possible.

3.2. System Model

The proposed framework models the urban environment as a directed graph

G = (V, E)

, where:

V stands for the junction points, the car, and places of interest.
E represents the set of roads between these intersections, where each road is described by features including traffic congestion, signal settings, and speed profile.

Parameter Definitions

Every AV interacts with the shared urban environment through a distinct decision-making process. At any given time

t

, the state of each AV is denoted by

s_{t} \in S

, while its corresponding action is represented as

u_{t} \in A

. The evolution of the system is modeled as a generalized Markov Decision Process (MDP) defined by the five elements

(S, A, P, R)

, where:

$S$ (State Space): Each state $s_{t}$ captures both the internal status of the AV and its environmental context. This includes the vehicle’s current position, speed, heading angle, and acceleration, along with external observations such as nearby object positions, traffic light status, lane information, and local traffic density. This provides a comprehensive understanding of the AV’s dynamic surroundings.
$A$ (Action Space): The action $u_{t}$ refers to the AV’s control commands at time $t$ , which consist of throttle (acceleration), brake level, and steering angle. These control signals are generated by the policy model in response to the observed state $s_{t}$ .
$P$ (Transition Probability): The probabilistic relationship between the current state $s_{t}$ and the next state $s_{t + 1}$ is governed by the selected action $u_{t}$ . This captures the system’s dynamics and how the vehicle’s control choices affect its movement and interactions.
$R$ (Reward Function): The reward function $R (s_{t}, a_{t})$ quantifies the effectiveness of each action in a given state. It evaluates trade-offs between journey time, safety (e.g., collision risk), and energy consumption, thereby guiding the learning algorithm toward optimal and context-aware navigation behavior.

This formulation ensures that each AV can learn from its environment in a way that balances efficiency, safety, and resource use, leading to robust policy optimization across varied urban driving scenarios.

The framework integrates PPO and GNN to achieve the following objectives:

Problem Formulation

Let the urban environment be represented by a graph

G = (V, E)

, where V is the set of intersections in vehicles, and E is the roads connecting them. The state of the vehicle at time t is

s_{t}

∈

t o S

, the action take is

a_{t}

∈

A

, and the transition probability is

P (s_{t + 1}| s_{t}, a_{t})

. The reward function

R (s_{t}, a_{t})

reflects the trade-offs between travel efficiency, safety, and energy consumption. The problem is modeled as MDP, aiming to maximize the cumulative reward.

J (π) = E [\sum_{t = 0}^{T} γ^{t} R (s_{t}, a_{t})] \dots

(1)

where

π

is the policy, T is the episode duration, and

γ

= (0, 1] is the discount factor.

Signal Adaptive Rout Planning for Autonomous Vehicles:

t_{s i g n a l}^{w a i t} T_{m a x}, s_{t} \leq T_{m a x}, s_{t} \in S_{s i g n a l s} \dots

(2)

where

$t_{s i g n a l}^{w a i t} s_{t}$ is the waiting time at a signal $s_{t}$ .
$T_{m a x}$ is the maximum allowable delay.
$s_{t} \in S_{s i g n a l s}$ only applies when the AV is at a signal state.

Obstacle Avoidance:

||p_{t} - p_{o b s} (t)|| \geq d_{m i n} (v_{t}), \forall_{t} \forall_{o b s t a c l e s} p_{o b s} (t) \dots

(3)

where

d_{m i n} (v_{t})

is the speed-dependent safe distance.

Energy Efficiency Constraints:

E_{u s a g e} (t) \leq α . E_{m a x} \dots

(4)

where

$E_{u s a g e} (t)$ is the energy consumed at time t.
$E_{m a x}$ is the vehicle’s maximum energy budget.
$α \in (0,1)$ is the scaling coefficient that adjusts the energy budget threshold depending on real-time conditions (e.g., urgency, route complexity).

Road Priority constraints:

\sum_{i \in P} δ_{i} α_{i} (t) \leq P_{m a x} \dots

(5)

where

δ_{i}

represent the priority weight of road segment

i, a n d P_{m a x}

is the allowed maximum priority values.

3.2.1. Collision Avoidance Action Space Specification

The autonomous vehicle employs a comprehensive collision avoidance action repertoire consisting of four fundamental maneuver categories: lateral avoidance, longitudinal control, combined maneuvers, and emergency responses. The lateral avoidance actions include steering-based swerving with angular velocities ω ∈ [−0.5, 0.5] rad/s for obstacle circumnavigation, lane change maneuvers executed through sigmoid-shaped trajectory planning with lateral accelerations

a_{l a t} \leq 2.5 \frac{m}{s^{2}}

, and evasive turning with maximum steering angles

δ_{m a x} = \pm 25 °

for immediate threat response. Longitudinal control encompasses adaptive braking with deceleration rates

d \in \frac{[0.5, 8.0] m}{s^{2}}

based on time-to-collision calculations, speed reduction protocols maintaining minimum safe velocities

v_{m i n} = 5 \frac{k m}{h}

in urban scenarios, and emergency braking achieving maximum deceleration

d_{e m e r g e n c y} = 9.5 \frac{m}{s^{2}}

when collision probability

P (C_{t}) > 0.8

. Combined maneuvers integrate simultaneous steering and braking through coordinated control algorithms that optimize the trade-off between lateral stability and stopping distance, while emergency responses include complete vehicle stops, hazard signal activation, and V2X emergency broadcasts to surrounding vehicles within a 200-m radius.

3.2.2. Multi-Agent Environment Architecture

The framework operates within a comprehensive multi-agent environment where pedestrians, cyclists, and other vehicles are controlled by independent learning agents rather than scripted behaviors. Each pedestrian agent

π_{p e d}

implements a social force model with collision avoidance preferences, goal-seeking behavior toward crosswalks or destinations, and dynamic response to vehicle proximity using a safety radius

r_{s a f e t y} = 2.0

m. Vehicle agents

π_{v e h i c l e}

follow lane-keeping protocols, adaptive cruise control with time headway

τ = 1.5

s, and cooperative lane-changing behaviors that communicate intentions through V2V messaging protocols. The interaction modeling employs a hierarchical game-theoretic framework where each agent optimizes its individual utility function

U_{i (s_{t}, a_{t}, a_{\{- i\}})}

considering both personal objectives and predicted actions of neighboring agents

a_{\{- i\}}

. Inter-agent communication occurs through a shared observation space

Ω_{s h a r e d}

containing relative positions, velocities, and intended trajectories of all agents within a 50-m perception radius, enabling proactive collision avoidance through intention prediction and cooperative path planning.

3.2.3. Travel Time Optimization

Minimizing travel time T is a primary objective for autonomous vehicle navigation in dynamic urban environments. The framework employs a hierarchical approach to travel time optimization, incorporating both baseline travel metrics and context-aware adjustments that reflect real-world navigation constraints.

Baseline Travel Time Formulation

The total travel time for a vehicle navigating from source to destination is calculated as:

$T_{t o t a l} = \sum (k = 1 t o N) [\frac{d_{k}}{v_{k}} + w_{k}] \dots$

(6)

where
N: Total number of discrete time steps in the journey.
d_k: Distance traveled during time step k (meters).
v_k: Instantaneous vehicle speed at time step k (m/s).
w_k: Waiting time at time step k due to traffic signals, congestion, or obstacles (seconds).

Equation (6) represents the fundamental travel time calculation that captures both kinematic movement time (d_k/v_k) and stationary delays (w_k). The distance d_k is computed from the vehicle dynamics model as d_k = v_k × Δt, where Δt is the time step duration. The speed v_k is constrained by the vehicle’s physical capabilities (0 ≤ v_k ≤ v_max) and safety requirements based on local traffic conditions.

Context-Aware Effective Travel Time

To incorporate urban context awareness and dynamic traffic considerations, the effective travel time T_eff is formulated as:

T_{e f f} = \sum (k = 1 t o N) [(\frac{d_{k}}{v_{k}}) \times p_{k} + w_{k} \times c_{k}] \dots

(7)

where

$p_{k}$ : Priority weight factor for road segment at time step k [dimensionless, range: 0.1–1.0];
$c_{k}$ : Congestion multiplier for waiting time at time step k [dimensionless, range: 1.0–3.0].

The priority weight p_k dynamically adapts based on multiple urban factors:

p_{k} = w_{t r a f f i c} \times ρ_{k} + w_{s i g n a l} \times s_{k} + w_{s a f e t y} \times f_{k} + w_{e n e r g y} \times e_{k} \dots

(8)

where

$ρ_{k}$ : Normalized traffic density factor [0.1–1.0].
$s_{k}$ : Signal coordination efficiency factor [0.1–1.0].
$f_{k}$ : Safety assessment factor [0.1–1.0].
$e_{k}$ : Energy efficiency factor [0.1–1.0].

$w_{t r a f f i c}, w_{s i g n a l}, w_{s a f e t y}, w_{e n e r g y} : w e i g h t i n g c o e f f i c i e n t s [s u m = 1.0]$

(9)

The congestion multiplier

c_{k}

adjusts waiting time penalties based on traffic conditions:

c_{k} = 1.0 + α \times (ρ_{k} - ρ_{n o m i n a l}) + β \times d e l a y_{f o r e c a s t_{k}} \dots

(10)

where

α = 2.0: Traffic density sensitivity parameter.
β = 0.5: Delay forecast sensitivity parameter.
$ρ_{n o m i n a l} = 0.3 :$ Baseline traffic density threshold.
$d e l a y_{f o r e c a s t_{k}} :$ Predicted additional delay based on traffic patterns [seconds].

Vehicle Dynamics’ Integration in Travel Time

The travel time optimization incorporates vehicle dynamics constraints to ensure realistic performance estimates:

S p e e d P r o f i l e O p t i m i z a t i o n : v_{k}^{o p t i m a l} = m i n (v_{s p e e d_{l i m i t}}, v_{t r a f f i c_{f l o w}}, v_{d y n a m i c s_{m a x (a_{k} - 1, δ_{k} - 1)}}) \dots

(11)

where

v_{d y n a m i c s_{\max x}}

represents the maximum achievable speed given previous acceleration

a_{k} - 1

and steering angle

δ_{k} - 1

, computed from the vehicle dynamics model.

A c c e l e r a t i o n - C o n s t r a i n e d D i s t a n c e : d_{k} = v_{k - 1} \times Δ t + 0.5 \times a_{k} \times Δ t^{2} \dots

(12)

s u b j e c t t o : |a_{k}| \leq a_{m a x} a n d \frac{|a_{k} - a_{k} - 1|}{Δ t} \leq j_{\max x}

where

a_{m a x} = 3.0 \frac{m}{s^{2}}

(maximum acceleration) and j_max = 2.0 m/s³ (maximum jerk).

Multi-Objective Travel Time Optimization

The framework optimizes travel time while balancing safety and energy efficiency through a weighted objective function:

T_{o p t i m i z e d} = λ_{t i m e} \times T_{e f f} + λ_{s a f e t y} \times T_{s a f e t y_{p e n a l t y}} + λ_{e n e r g y} \times T_{e n e r g y_{p e n a l t y}} \dots

(13)

where

Safety Time Penalty:

T_{s a f e t y_{p e n a l t y}} = \sum (k = 1 t o N) [P (C_{k}| s_{k}, u_{k}) \times t_{s a f e t y_{m a r g i n}}] \dots

(14)

Energy Time Penalty:

T_{e n e r g y_{p e n a l t y}} = \sum (k = 1 t o N) [\frac{E_{k}}{E_{n o m i n a l}} \times t_{e n e r g y_{f a c t o r}}] \dots

(15)

with

$λ_{t i m e}$ = 0.6, λ_safety = 0.3, λ_energy = 0.1: Optimization weights.
$t_{s a f e t y_{m a r g i n}} = 5.0$ s: Time penalty per unit collision probability.
$t_{e n e r g y_{f a c t o r}} = 2.0$ s: Time penalty per unit energy consumption.
$E_{n o m i n a l}$ : Baseline energy consumption rate.

Adaptive Time Horizon Planning

The framework employs adaptive time horizon planning that adjusts the optimization window based on scenario complexity:

N_{a d a p t i v e} = N_{b a s e} + Δ_{c o m p l e x i t y} + Δ_{u n c e r t a i n t y} \dots

(16)

where

N_base = 50: Baseline time steps (5 s at 10 Hz).
Δ_complexity: Additional steps based on traffic density and intersection count.
Δ_uncertainty: Additional steps based on weather and pedestrian activity.

$Δ_{c o m p l e x i t y} = ⌈10 \times (ρ_{a v g} + 0.5 \times i n t e r s e c t i o n_{d e n s i t y})⌉ \dots$

(17)

$Δ_{u n c e r t a i n t y} = ⌈5 \times (w e a t h e r_{f a c t o r} + p e d e s t r i a n_{d e n s i t y})⌉ \dots$

(18)

Real-Time Travel Time Updates

The travel time estimation is continuously updated using a recursive formulation:

T_{e f f}^{u p d a t e d} = α_{u p d a t e} \times T_{e f f}^{c u r r e n t} + (1 - α_{u p d a t e}) \times T_{e f f}^{p r e d i c t e d} \dots

(19)

where

α_update = 0.7: Update rate parameter balancing stability and responsiveness.
$T_{e f f}^{c u r r e n t}$ : Current measured travel time performance.
$T_{e f f}^{p r e d i c t e d}$ : Predicted travel time from the optimization model.

Performance Bounds and Validation

The travel time optimization ensures realistic bounds through constraint validation:

Physical constraints:

M i n i m u m t i m e : T_{m i n} = \frac{d i s t a n c e_{t o t a l}}{v_{m a x} (t h e o r e t i c a l m i n i m u m)}

(20)

M a x i m u m t i m e : T_{m a x} = T_{m i n} \times (1 + s a f e t y_{f a c t o r} + c o n g e s t i o n_{f a c t o r})

C o n s t r a i n t : T_{m i n} \leq T_{e f f} \leq T_{\max x}

Validation metrics:

Prediction accuracy:

\frac{|T_{a c t u a l} - T_{p r e d i c t e d}|}{T_{a c t u a l}} \leq 0.15

Consistency check: Smooth temporal variation in travel time estimates.

Feasibility verification: All intermediate speeds and accelerations within vehicle limits.

This comprehensive travel time optimization framework ensures that autonomous vehicles can efficiently navigate urban environments while respecting physical constraints, safety requirements, and energy efficiency objectives. The mathematical formulation provides a robust foundation for real-time decision making that balances multiple competing objectives in dynamic traffic conditions.

Figure 7 illustrates the cumulative total travel time (T) and the effective travel time (

T_{e f f}

) of an autonomous vehicle on a given road segment. The actual travel time includes dynamic values of priority assigned to roads (pt) due to congestion or signal control.

Ensuring safety is critical in urban navigation, particularly in dynamic environments with unpredictable traffic patterns and pedestrian behaviors. The DRAM employs a comprehensive probabilistic framework to evaluate collision risks and guide safe navigation decisions through Bayesian inference combined with vehicle dynamics constraints.

The red curve shows the raw cumulative travel time without priority weighting. The Max Delay = 106.0 s corresponds to the largest performance gap between actual travel under congestion/signal delays and the baseline path. This reflects absolute time lost due to congestion and traffic signals. The blue curve represents effective travel time after applying dynamic road priorities and signal weighting. The algorithm discounts delays that happen on less critical routes and emphasizes prioritized ones.Bayesian Collision Probability Assessment.

The DRAM evaluates the probability of collision P(Ct|st, ut) using Bayesian inference:

P (C t| s t, u t) = \frac{[P (s t| C t, u t) \times P (C t| u t)]}{P (s t| u t)} \dots

(21)

where

P(Ct|st, ut): Posterior probability of collision at time t given state st and action ut.
P(st|Ct, ut): Likelihood of observing state st under a collision scenario given action ut.
P(Ct|ut): Prior probability of collision conditioned on action ut.
P(st|ut): Evidence probability of state st given action ut.

The posterior collision probability P(Ct|st, ut) explicitly depends on the agent’s chosen action ut, reflecting how each control decision directly affects the likelihood of an unsafe outcome. This action-dependent formulation enables the framework to evaluate the safety implications of different control strategies in real time.

Figure 8 illustrates the posterior probability of collision

P (C_{t}∣ s_{t}, u_{t})

for an autonomous vehicle navigating through dynamic urban environments. The probabilities are calculated using Bayesian inference, incorporating prior probabilities, state likelihoods, and evidence probabilities over 100 time steps.

3.2.4. Energy Efficiency

Energy consumption

E

is modeled as a function of vehicle speed

v_{t}

and acceleration

a_{t}

:

E = \int_{t = 0}^{T} P (v_{t}, a_{t}) d t, P (v_{t}, a_{t}) = α v_{t}^{2} + β v_{t} a_{t} + γ, \dots

(22)

where

$α, β, γ$ : Coefficients representing energy consumption characteristics.
$a_{t}$ : Vehicle acceleration at time t.

To ensure efficient energy usage, a constraint is imposed:

$E \leq E m a x,$ where $E m a x$ is the maximum allowable energy consumption.

Figure 9 represents the cumulative energy consumption (E) of an autonomous vehicle as a function of time. The plot also includes a horizontal red dashed line indicating the maximum allowable energy consumption (E max) constraint.

3.2.5. Urban Context-Aware Reward Mechanism

The reward function

R (s_{t}, u_{t})

dynamically adapts based on the urban context, balancing travel time, safety, and energy efficiency:

R (s_{t}, a_{t}) = λ_{1} + T_{e f f}^{- 1} + λ_{2} \cdot (1 - P (C_{t}∣ s_{t}, a_{t}) - λ_{3} \cdot E_{t}

(23)

where

$R (s_{t}, a_{t})$ —Reward at time step $t$ given state $s_{t}$ and action $a_{t}$ .
$T_{e f f}^{- 1}$ —Inverse effective travel time (shorter time = higher reward).
$P (C_{t} ∣ s_{t}, a_{t})$ —Posterior collision probability from DRAM.
$E_{t}$ —Energy consumption at time $t$ .
$λ_{1}, λ_{2}, λ_{3}$ —Weight coefficients (see Table 5).

To ensure consistency across the manuscript, the reward functions used in Equations (19) and (24) have been aligned to share the same structure. The unified reward function is now defined as a weighted combination of three key factors: effective travel time, collision probability, and energy consumption. Specifically, the reward at each time step

R_{t}

is computed as the negative sum of these metrics:

R_{t} = - (λ_{1} * T_{e f f} + λ_{2} * P_{c o l l i s i o n} + λ_{3} * E_{t})

, where

T_{e f f}

is the effective travel time,

P_{c o l l i s i o n}

is the estimated probability of collision, and

E_{t}

is the energy used.

To reflect safety as the foremost concern, the reward function assigns the highest weight

λ_{2}

to the collision risk component. This ensures that policies favor safe actions even at the cost of slightly longer travel times or higher energy consumption, in alignment with real-world AV design priorities.

The weights are adjusted in real time based on traffic conditions, road priorities, and safety requirements.

Figure 10 shows the evolution of the reward function

R (s_{t}, u_{t})

over 100 time steps, dynamically balancing the trade-offs between travel time, safety, and energy efficiency. The reward function is calculated based on real-time adjustments to account for urban navigation constraints. The formulation prioritizes safety by rewarding lower collision probabilities

1 - P (C_{t}∣ s_{t}, a_{t})

while inversely penalizing effective travel time and energy consumption.

This detailed system model may guarantee a realistic representation of an urban environment so that the navigation strategies reflect safety, efficiency, and sustainability. Advanced reinforcement learning used in this paper combined with graph-based modeling allowed us to scale and generalize the results in various urban environments.

3.2.6. Perception and Object Modeling

In the proposed framework, pedestrians and other road users are modeled as dynamic obstacles with motion patterns that evolve over time. The behavior of such entities is predicted using a Kalman filter-based tracking approach, which continuously updates their estimated position and velocity based on sequential sensor readings.

The perception system employs a sensor fusion mechanism that combines data from LiDAR, RGB cameras, and radar. LiDAR provides accurate depth and spatial positioning, while cameras contribute semantic understanding such as object classification (e.g., pedestrian, vehicle, cyclist), and radar enhances velocity estimation under adverse conditions like fog or rain. This fusion ensures robustness in dynamic urban scenarios.

To estimate collision risk, a probabilistic collision model is implemented. It calculates the likelihood of collision based on the predicted trajectory overlap between the ego vehicle and detected agents. Specifically, Bayesian inference is used to determine the probability of collision at each time step, considering both historical motion data and current positional uncertainty. This risk score is then directly integrated into the reward function via the DRAM to penalize unsafe actions and encourage proactive avoidance strategies.

The action space definitions directly feed into the PPO policy optimizer within URLNF. During each decision cycle, the chosen action

u_{t}

is evaluated by the UCARM, which dynamically adjusts the reward based on current traffic density, road priority, and collision probability from the DRAM. This integration ensures that both safety and efficiency considerations influence the PPO policy update in real time. Multi-agent coordination signals (e.g., predicted trajectories from surrounding vehicles) are incorporated into the GNN’s spatial-temporal state representation, enabling the PPO module to select actions that maximize long-term joint utility.

3.3. Proposed Framework

The challenges of AV navigation in complex dynamic urban scenes are complex and multi-faceted and are addressed in the proposed framework through a new hybrid deep reinforcement learning model. The framework integrates:

Proximal Policy Optimization (PPO): When there is a need to make stable and robust policies when responding to the constraints of the surrounding environment.
Graph Neural Networks (GNN): To model spatial-temporal relationships in the urban environment and interactions between the elements.
Urban context-aware reward mechanism (UCARM): It adapts the reward system in line with traffic rules, real-time traffic, and accidents, and it provides contextually relevant decisions.
DRAM: Uses Bayesian learning and MDP for risk prediction, modeling of uncertainties, and making the right decisions with regards to collision risks.

3.3.1. Proximal Policy Optimization (PPO)

PPO is used to update the navigation policy

π θ (u t ∣ s t)

AV, with θ being the parameter of the policy that governs the AV’s probability of selecting action

u_{t}

from state

s_{t}

. In the following policy, the focus of PPO is not on modeling but to maximize the total reward received over time while keeping the policy stable. The optimization objective is given by:

J (θ) = E_{t} [m i n (r_{t} (θ) A_{t}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) A_{t})], \dots

(24)

where

$r_{t (θ)} = \frac{π θ (u_{t}| s_{t})}{π θ o l d (u_{t}| s_{t})}$ is the probability ratio between the updated policy and the old policy.
$A t = Q (s t, u t) - V (s t)$ is the advantage function, quantifying the relative benefit of action $u_{t}$ in state $s_{t}$ compared to the baseline value function $V (s_{t})$ .
$ϵ$ is the clipping parameter that prevents excessive updates and ensures stable learning.

Although Equation (21) is expressed using a minimization-style formulation, it fundamentally represents a maximization objective for expected cumulative reward. In practice, this objective is optimized using gradient ascent, but for consistency with loss-based training conventions, it is framed as a clipped surrogate loss to be minimized. This formulation stabilizes updates by limiting the step size through the clipping function, which constrains the policy ratio

r_{t} (θ)

within a fixed range (typically between

1 - ε

and

1 + ε

).

The value function

V (s_{t})

is learned using the Bellman equation:

V (s_{t}) = E u_{t} \sim π θ [R (s_{t}, u_{t}) + γ V (s_{t} + 1)], \dots

(25)

where

γ \in (0,1]

is the discount factor that prioritizes immediate rewards over future rewards. Figure 11 shows the workflow of PPO.

During training, the policy parameters (denoted as πθ) are updated using the PPO algorithm based on observed transitions. The framework collects batches of data containing states, actions, and corresponding rewards from environment interactions. An advantage estimation is then computed to measure how much better an action performs compared to the expected baseline. Using this, a clipped objective function is applied to stabilize the updates and prevent large shifts in policy. The gradient of this objective function is computed and used to update πθ using the Adam optimizer. A learning rate scheduler is used to reduce the learning rate as training progresses, improving convergence.

3.3.2. Graph Neural Network (GNN) for Spatial-Temporal Modeling

The urban environment is modeled as a directed graph

G = (V, E)

, where nodes

V

represent intersections, vehicles, and pedestrian zones, and edges

E

represent roads with attributes such as traffic density, signal states, and speed limits. Each node

v

has an initial feature vector

h_{v}^{0}

capturing local traffic and signal information.

Node embeddings are updated through message passing:

h v^{k + 1} = σ (W^{(k)} h_{v}^{(k)} + \sum_{(u \in N (v))} \frac{1}{c_{v u}} W^{(k)} h_{u}^{(k)}) \dots

(26)

where

N (v)

is the set of neighbors of node

v

,

C_{v u}

is a normalization factor,

W^{k}

is the weight matrix for layer

k

, and

σ

is an activation function (e.g., ReLU).

After

K

layers, the final node embeddings

h_{v}^{k}

capture spatial-temporal dependencies and are fed into the PPO policy network. The GNN parameters are optimized via backpropagation using the combined PPO policy loss and any auxiliary tasks, ensuring robust representation learning for decision making in dynamic urban environments. Figure 12 shows the Graph Neural Network (GNN) for spatial-temporal modeling.

The GNN component processes a graph representation of the environment where each node represents an entity (e.g., AV, pedestrian, traffic signal) and edges represent interactions or proximity. Each GNN layer has a learnable weight matrix Wk that transforms node embeddings during message passing. These weights are updated via backpropagation during training. The gradients are derived from the overall loss, which is computed from both the PPO policy loss and auxiliary tasks such as trajectory consistency or object classification, if enabled. The optimizer updates the GNN weights to minimize the loss, ensuring the graph representation accurately captures spatial and temporal dependencies critical for decision making.

3.3.3. Urban Context-Aware Reward Mechanism (UCARM)

To ensure that the reinforcement learning agent makes context-sensitive decisions, the proposed framework embeds real-world driving scenarios directly into the learning process. Three key urban contexts are considered: dense traffic, pedestrian interaction zones, and dynamic weather conditions. These scenarios are first quantified using physical and environmental parameters. Dense traffic is represented through vehicle density metrics, such as the number of surrounding vehicles, their average speeds, and spacing patterns—these are computed from LiDAR and radar data and incorporated as node features in the graph. Pedestrian–vehicle zones are identified using camera-based semantic segmentation, which flags areas such as crosswalks and sidewalks. These zones are assigned binary indicators and risk weights based on pedestrian activity levels. Dynamic weather conditions, including reduced visibility, slippery roads, and sensor noise, are modeled using environmental sensor readings and mapped to parameters like visibility range and road friction. These contextual features are embedded into the graph as node and edge attributes in the GNN and are also appended to the PPO agent’s state vector. This integration allows the model to learn environment-specific behaviors and adjust its reward optimization accordingly, improving its ability to generalize across varying real-world conditions while maintaining safety, energy efficiency, and travel time minimization. The reward function is defined as:

R (s_{t}, a_{t}) = λ_{1} \cdot (\frac{1}{T e f f}) - λ_{2} \cdot P (C_{t}∣ s_{t}, a_{t}) - λ_{3} \cdot E, \dots

(27)

where

$T e f f = \sum_{t = 1}^{T} (\frac{v t}{d t} \cdot p_{t} + w_{t})$ is the effective travel time, considering road priority weights $p_{t}$ .
$P (C_{t}| s_{t}, a_{t})$ is the collision probability at time $t$ .
$E = \int_{t = 0}^{T} (\frac{1}{2 α v t} + β v_{t} a_{t} + γ) d t$ is the energy consumption, where $α_{t}$ is travel time weight, $β_{s}$ is safety weight, and $γ_{e}$ is energy weight.
$λ 1, λ 2, λ 3$ are weight parameters that balance the trade-offs. Figure 12 shows the UCARM.
The weighting parameters λ₁, λ₂, and λ₃ shown in Figure 13 are dynamically adjusted by the UCARM. This module observes external traffic and environmental conditions (e.g., congestion, pedestrian density, and weather) and tunes the reward weights accordingly to prioritize safety, efficiency, or energy optimization depending on the real-time context.

3.3.4. DRAM

The DRAM evaluates potential collision risks using Bayesian inference:

P (C_{t}∣ s_{t}, a_{t}) = P (s_{t}∣ C_{t}) P (C_{t}) P (s_{t}) \dots

(28)

where

$P (C_{t}| s_{t}, a_{t})$ is the posterior probability of collision.
$P (s_{t}| C_{t})$ is the likelihood of the state under a collision scenario.
$P (C_{t})$ is the prior probability of a collision.
$P (s_{t})$ is the evidence probability of the state $s_{t}$ .

This module integrates collision probabilities into the reward mechanism, penalizing high-risk actions. Markov Decision Processes (MDPs) are used to model state transitions and evaluate policies, ensuring optimal decision making under uncertainty:

P (s t + 1∣ s t, a t) = \sum s^{'} P (s t + 1∣ s^{'}, a t) P (s^{'}∣ s t) \dots

(29)

To evaluate the system’s responsiveness to sudden threats, the research simulated emergency scenarios in CARLA, such as sudden pedestrian or animal crossings within close range (under 5 m). The model’s average reaction time—measured from sensor input to final control signal output—ranged between 0.152 to 0.176 s across 40 trials. This latency is within the industry-acceptable range for real-time AV systems. These results indicate that the DRAM module, when combined with UCARM, can detect and respond to high-risk scenarios in a timely manner. Further tests under high-speed conditions and low visibility environments are planned as future work. The impact of DRAM-UCARM integration on real-time responsiveness is discussed in Section 4.5.

Figure 14 shows the DRAM. The workflow of the proposed framework integrates multiple components to address the challenges of AV navigation in dynamic urban environments. The steps are as follows:

Input Data Preprocessing: Raw data, including traffic density, signal states, road attributes, and vehicle dynamics, is collected from real-world datasets (e.g., Argoverse, nuScenes) or generated synthetically using the CARLA simulator. The data is normalized and formatted to ensure compatibility with the framework.
Spatial-Temporal Modeling with GNN: The processed data is transformed into a graph representing nodes (e.g., intersections, vehicles) and edges (e.g., road segments). The Graph Neural Network (GNN) module captures spatial-temporal dependencies and computes node embeddings.
Policy Optimization with PPO: The node embeddings generated by the GNN are fed into the PPO module, which iteratively refines the navigation policy πθ. This balances exploration and exploitation, ensuring robust and efficient decision-making.
Reward Adjustment with UCARM: UCARM provides real-time changes focusing on traffic congestion, safety parameters, and energy utilization. This means that the policy will be in tune with the goals of dynamic urban navigation.
Collision Risk Assessment with DRAM: The DRAM estimates the collision probability through the application of Bayesian analysis. This probability is incorporated in the reward function to reduce the chances of a dangerous move which reduces safety.
Convergence and Deployment: The framework cycles through multiple episodes until it arrives at the desired state, therefore stabilizing the policy optimization phase. The trained policy is then used in real-time AV navigation to navigate the car.

Algorithm: Workflow of the Proposed Framework

Algorithm 1 outlines the step-by-step process of the proposed framework:

Algorithm 1: Workflow of the Proposed Framework (URLNF)

Input:
- Traffic data (e.g., density, signals, road attributes);
- AV state $s_{t}$ ;
- Initial policy $π_{θ}$ ;
- Reward weights $λ_{1}, λ_{2}, λ_{3}$ .
Output:
- Optimized navigation policy $π_{θ} *$ .
(a)
Step 1: Data Preprocessing
- Collect raw data from real-world (Argoverse, nuScenes) or synthetic (CARLA) sources.
- Normalize sensor readings and structure the environment as a graph $G = (V, E)$ .
  ○
  Nodes $V$ : Intersections, vehicles, pedestrian zones.
  ○
  Edges $E$ : Roads with traffic density, speed limits, signal states.
(b)
Step 2: Spatial-Temporal Modeling (GNN)
- Initialize GNN weight parameters $W_{k} .$
- For each node $v$ in graph $G$ , compute initial embedding $h_{v}^{0}$ .
- For each GNN layer $k = 1$ to $K$ :
  ○
  For each node $v$ :
  ▪
  Aggregate features from neighbors.
  ▪
  Update embedding: $h_{v}^{k} = A c t i v a t i o n (W_{k} * A g g r e g a t e d_{F e a t u r e s})$ .
(c)
Step 3: Policy Optimization (PPO)
- Initialize PPO policy parameters $θ$ and learning rate $η$ .
- Repeat until policy converges:
  1.
  Sample trajectories: $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ ;
  2.
  Compute advantage: $A_{t} = Q (s_{t}, a_{t}) - V (s_{t})$ ;
  3.
  Calculate probability ratio: $r_{t} (θ) = π_{θ} (a_{t} | s_{t}) / {π_{θ}}_{o l d} (a_{t} | s_{t})$ ;
  4.
  Apply PPO clipped objective:
  ▪
  $L = m i n (r_{t} * A_{t}, c l i p p e d (r_{t}) * A_{t})$ ;
  5.
  Update policy: $θ \leftarrow θ + η * 𝛻 L$ .
(d)
Step 4: Reward Adjustment (UCARM)
- Dynamically adjust reward weights $λ_{1}, λ_{2}, λ_{3}$ based on:
  ○
  Traffic density;
  ○
  Pedestrian presence;
  ○
  Current energy usage.
- Recompute reward:
  
  $R_{t} = - (λ_{1} * E f f e c t i v {e_{T r a v e l}}_{T i m e} + λ_{2} * C o l l i s i o n_{P r o b a b i l i t y} + λ_{3} * E n e r g y_{C o n s u m p t i o n})$
(e)
Step 5: Collision Risk Assessment (DRAM)
- Estimate collision probability using Bayesian inference:
  
  $P (c o l l i s i o n| s_{t}, a_{t}) = \frac{[L i k e l i h o o d * P r i o r]}{E v i d e n c e}$
- Penalize actions with high collision risk in reward function.
(f)
Step 6: Convergence and Deployment
- Repeat Steps 3 to 5 until convergence is reached.
- Output the optimized policy $π_{θ} *$ for real-time deployment on AVs.

3.3.5. Multi-Agent Collision Avoidance Coordination

The collision avoidance system integrates multi-agent coordination through a distributed consensus mechanism where each vehicle agent maintains a local collision avoidance policy

π_{l o c a l}

while participating in a global coordination protocol

Π_{g l o b a l}

. When multiple agents detect potential conflicts, they engage in a negotiation process using the Collision Avoidance Coordination Algorithm (CACA):

(1): Threat assessment phase where each agent broadcasts its intended trajectory $T_{i n t e n d e d}$ and current risk assessment $R_{l o c a l}$ .
(2): Priority assignment based on a combination of factors including vehicle type, passenger count, and proximity to destination (emergency vehicles receive highest priority α_emergency = 1.0).
(3): Maneuver selection where agents select from the predefined action space $\{s w e r v e_{l e f t}, s w e r v e_{r i g h t}, b r a k e_{m o d e r a t e}, b r a k e_{e m e r g e n c y}, s p e e d_{m a i n t a i n}, l a n e_{c h a n g e_{l e f t}}, l a n e_{c h a n g e_{r i g h t}}\}$ based on collective optimization.
(4): Execution monitoring where agents continuously update their actions based on real-time feedback from other participants. The system ensures deadlock prevention through a timeout mechanism $τ_{t i m e o u t} = 2.0$ s, after which the agent with the highest priority executes its preferred maneuver while others adopt defensive positions.

3.3.6. Simulation Scenarios and Parameters

The proposed framework was tested in the CARLA simulator, which is a realistic simulator for testing autonomous vehicles in different urban conditions. Testing was done to assess the efficiency of this framework regarding travel time minimization, collision avoidance, and energy consumption in various cases.

The scenarios tested are as follows:

(a): Dense Traffic Conditions: Using the simulated high vehicle density on the urban road networks, the performance of the framework was tested to determine the shortest travel time and the avoidance of collision.
(b): Mixed Pedestrian–Vehicle Zones: Virtual environments with realistic but random pedestrian behavior, capturing safety statistics and frequency of collisions.
(c): Dynamic Weather Conditions: Conducted the framework in various adverse environmental conditions including rain, fog, and low visibility in order to establish energy efficiency and success rates in navigation.

While the current framework utilizes reactive decision making grounded in immediate sensor inputs and real-time risk estimations, it does not yet account for future traffic evolution or potential conflicts in shared spaces like intersections. To enhance realism and scalability in future simulations, this research aims to integrate predictive behavior modeling using cooperative multi-agent reinforcement learning. This will allow AVs to exchange intent and forecast trajectories, enabling proactive adjustments before conflicts arise. For example, if a junction is predicted to be occupied by several AVs in upcoming time windows, individual agents could reroute or delay entry, reducing bottlenecks and enhancing traffic fluidity. Incorporating such cooperative intelligence is essential for navigating densely populated urban environments and can substantially reduce collision risks under heavy traffic loads.

The framework’s performance was evaluated using the following metrics:

Travel Time: The total time taken to navigate from the source to the destination (in seconds).
Collision Rate: This is the number of times the plane of one pilot interfaced with the plane of another pilot during a single exposition of a simulation.
Energy Efficiency: The quantity of total power needed for the navigation (in joules).
Navigation Success Rate: The likelihood that the system would be completed successfully, as a percentage of the total navigations done under the circumstances.

Table 6 ensures the assessment of the proposed framework under all scenario possibilities in the different cities and reveals the flexibility of the framework within this simulation.

4. Results and Discussions

This section provides an in-depth analysis of the results obtained from the proposed methodologies evaluated on three diverse datasets: Argoverse, nuScenes, and CARLA. These datasets pertain to the actual and simulated urban environments, low- and high-mobility traffic, pedestrian and vehicle interactions, and weather fluctuations. The performance of each case is evaluated using key parameters including time taken, ability to avoid collision, and energy required in an aim to assess the scalability as well as flexibility of the framework. The results are expressed as tables and accompanied by textual annotations to account for the observed behavior and consequences of the framework in different settings. Moreover, performance of the modules is compared between the two datasets to emphasize the usefulness of each in light of the proposed framework. Every aspect of the evaluation process is connected to the characteristics of each component of the proposed system, such UCARM, DRAM, the joint PPO-GNNs, and others. This discussion also affords a theoretical analysis of the trade-offs between efficiency, safety, and sustainability in the framework and the practical payoffs of utilizing the framework. The key metrics used to evaluate the proposed framework are:

Travel Time (T): The time that was taken between the source and the destination.

Collision Rate (CR): Number of collisions that occurred in a particular simulation.

Energy Consumption (E): Total energy used for navigation in joules.

Navigation Success Rate (NSR): Proportion of successful navigation.

Reward Value (R): Last values of the reward function incorporating efficiency, safety, and energy parameters.

4.1. Results on Argoverse Dataset

The performance of the proposed framework on the Argoverse dataset, which includes realistic annotated urban traffic scenes, demonstrates the framework’s versatility. The performance is analyzed across three major scenarios, dense traffic, mixed zones, and dynamic weather, with specific measures of travel time, collision rates, energy consumption, navigation success rates, and reward values.

Table 7 presents a comprehensive evaluation of the proposed Urban Reinforcement Learning Navigation Framework (URLNF) under three urban conditions: dense traffic, pedestrian–vehicle mixed zones, and dynamic weather. Key performance indicators include average travel time, collision rate, energy usage, navigation success rate, cumulative reward, number of braking and acceleration events, and estimated fuel efficiency. The results are averaged over multiple test episodes, with standard deviations included where applicable, reflecting the model’s robustness and adaptability across varying conditions.

Figure 15 illustrates the comparative performance of the proposed URLNF framework across three distinct urban driving contexts: dense traffic, pedestrian-vehicle mixed zones, and dynamic weather. Key metrics shown include travel time, collision rate, energy consumption, navigation success rate, and fuel efficiency. The trends highlight how the framework adapts to varying levels of complexity and uncertainty, demonstrating consistent safety and efficiency outcomes.

Time taken on the road is an important measure of navigation performance. The overall average time spent traveling across the different scenarios is 420 ± 20 s, and the DW scenarios are the longest, taking 440 ± 25 s. This rise can be explained by the fact that navigation has to be done at lower speeds in adverse weather conditions like rain or fog, which is reflected in the dataset. The least travel time was recorded in mixed zones (400 ± 15,400 \pm 15,400 ± 15), thus proving the versatility of the framework in providing efficient movement within areas that have moderate traffic and pedestrian flow. The collision rates stayed below 3.1% in all the planning factors, which showed the effectiveness of the DRAM in identifying and preventing collision risks.

Dynamic weather again had the highest collision rate of 3.8%, which was attributed to the fact that changes in weather patterns were difficult to predict. However, this rate is lower than the collision rates of the conventional navigation systems, which are above 5% in comparable environments. In dense traffic, the collision rate was least to 2.5%, which proved that the model is efficient in congestion through making the right choices in real time. The energy consumption in the average of all scenarios was 11,833 ± 550 J, and the highest energy consumption was in dynamic weather—12,500 ± 600 J. This is mainly attributed to a high degree of oscillation and maneuvers such as braking and acceleration in order to achieve safe control of the vehicle under various circumstances. Mixed zones required the least energy (11,000 ± 45,011,000 \pm 45,011,000 ± 450 J) because traffic movement and pedestrian crossover activities were less likely to involve rapid acceleration or deceleration. The framework attained a favorable average navigation success rate of 92.6% in all the scenarios, with the dense traffic scenario registering 94.2%. This shows how the framework can address congestion in urban areas by using the UCARM to address safety and efficiency. Success rates were however slightly lower in dynamic weather (91.0%) because of the extreme sensitivity of the system to variations in the environment. The reward values averaged −8,4, which is the cost of the efficiency/safety/energy consumption trade-off. The lowest score was in dynamic weather, where the system is −8.6, implying that it is safer for the system to be inefficient in difficult conditions. Braking events were higher in dynamic weather (35 ± 5) than in dense traffic (28 ± 3), therefore stressing the need to respect the traffic rules while on the road. The frequency of acceleration adjustment was highest in dynamic weather (749 ± 7) as a consequence of the energy consumption, and as such, exposes the need for balance to achieve stability. Mixed zones achieved the highest fuel economy at 15.2 ± 0.6 km/L, and the lowest was achieved at 13.8 ± 0.7 km/L for the dynamic weather condition, in line with the energy consumption findings.

4.2. Results on Nuscenes Dataset

The performance of the proposed framework was tested on the nuScenes dataset, which contains multi-sensor data and complex traffic scenarios of the urban area; thus, it proves that the proposed framework is effective in different environmental and traffic conditions. The nuScenes dataset contains LIDAR, radar, and high-definition camera data, which makes it suitable for evaluating the proposed system in urban scenarios with dynamic traffic. The performance was assessed across three key scenarios, dense traffic, mixed zones, and dynamic weather, with travel time metrics, collision rates, energy consumption, navigation success rates, reward values, and other operational metrics.

Table 8 reports the performance of the proposed Urban Reinforcement Learning Navigation Framework (URLNF) under diverse urban conditions in the nuScenes dataset: dense traffic, mixed pedestrian–vehicle zones, and dynamic weather. Metrics include travel time, collision rate, energy consumption, navigation success, cumulative reward, lane changes, pedestrian interactions, and fuel efficiency. Results are averaged across multiple runs, with standard deviations provided to reflect variability. The framework demonstrates stable and effective navigation with high success rates and efficient energy use under varying traffic complexities.

Figure 16 visualizes the comparative performance of the URLNF framework under three urban driving conditions—dense traffic, mixed pedestrian–vehicle zones, and dynamic weather—using the nuScenes dataset. It highlights key metrics such as travel time, collision rate, navigation success, energy consumption, and pedestrian interactions. The consistent trends across scenarios demonstrate the framework’s adaptability, stability, and effectiveness in real-world-like urban environments.

The average travel time for all scenarios was 410 ± 20 s, which is considerably less than what was observed in the Argoverse dataset. This is due to the less crowded simulated scenarios in nuScenes and the improved decision making facilitated by the framework. The shortest travel time was recorded in dense traffic (18,390 ± 18 s), which explains how the proposed framework can adapt to congested road networks using real-time sensor data and the GNN for accurate spatial temporal modeling. The longest travel time was observed in dynamic weather (430 ± 20 s) because of lower speeds and additional caution in case of adverse conditions such as rain or low visibility, which are typical for the nuScenes dataset. The average collision rate was lowered considerably to 2.5%, and this was attributed to the DRAM. This is an improvement compared to the Argoverse results (3.1%). The lowest collision rate was seen in dense traffic (1.8%), where the framework could effectively manage the movement of the vehicle, which was highly predictable.

The collision rate rose to 3.2% in dynamic weather due to particularly unpredictable pedestrian and vehicle behavior in poor visibility and on slippery roads. The mean energy for all scenarios was calculated to be 11,933 ± 450 J, which is again very close to the Argoverse dataset, which was 11,833 ± 550 J. This shows that the framework can enhance the use of energy during navigation while at the same time ensuring safety. Dynamic weather required the most energy, 12,300 ± 500 J, because participants had to brake and accelerate often to navigate the course due to unfavorable weather. The lowest energy consumption was recorded in the dense traffic condition with an energy of 11,500 ± 400 J, as the framework encouraged smooth motion and minimized jerky maneuvering. The navigation success rate was the most successful of all the datasets with an average success rate of 93.6%, and dense traffic was the best with 95.5%. This shows that the framework can always hit the target in a given period and avoid traffic signals as well as any other barrier on the way. The success rate was a little lower in dynamic weather (91.5%), as conditions like low light, unpredictable pedestrian crossing, etc., posed some problems to the system. The average value of reward was −8.1. A negative value was recorded as the lowest in mixed zones (−7.9). This means that the framework was able to optimize time, safety, and energy consumption for the situations with moderate traffic and crossing occurrences. Similarly, the reward in dynamic weather was reduced to −8.3, as it takes some compromises to keep safety and stability from being hindered by unfavorable weather. The number of lane changes increased from 12 ± 2 in dense traffic to 18 ± 3 in dynamic weather, demonstrating the system’s flexibility to respond to scenarios with dynamic lane configuration. The highest values of the pedestrian interaction were in dynamic weather with 314 ± 3, which proved that the proposed framework can successfully handle pedestrians’ unpredictable behavior. As for fuel economy, the highest value was observed in dense traffic (15.5 ± 0.6 L/100 km) and the lowest in dynamic weather (14.0 ± 0.7 L/100 km), which corresponds with the energy consumption rates.

4.3. Results on CARLA Dataset

The assessment of the proposed framework on the CARLA synthetic dataset highlighted fundamental information about the system’s behavior in extreme and edge cases. CARLA has a rich and realistic environment that can mimic difficult urban conditions, including high density, segments with combined traffic pedestrian zones, and variable meteorological conditions like rain or fog. These are the scenarios that challenge the feasibility of the proposed framework most of the time, requiring dynamic adjustment of navigation systems while optimizing their efficiency, safety, and energy use.

Table 9 presents a detailed evaluation of the proposed URLNF framework in simulated urban driving environments using the CARLA dataset. Results span three challenging conditions: dense traffic, mixed zones, and dynamic weather. Key performance metrics include travel time, collision rate, energy consumption, navigation success, reward value, lane changes, pedestrian interactions, and fuel efficiency. Averages and standard deviations are reported, highlighting the model’s ability to maintain safe and efficient driving under complex and varied conditions.

Figure 17 showcases the performance of the URLNF framework under three dynamic urban conditions—dense traffic, mixed pedestrian–vehicle zones, and dynamic weather—based on the CARLA simulation environment. Metrics visualized include travel time, collision rate, navigation success, energy consumption, pedestrian interactions, and fuel efficiency. The trends emphasize the framework’s robustness and adaptability in complex and high-uncertainty driving scenarios.

The overall average travel time was 450 ± 25 s, which indicated that CARLA scenarios are relatively complex compared to Argoverse (420 ± 20 s) and nuScenes (410 ± 20 s). Travel time was the highest in dynamic weather (460 ± 30 s) since it was difficult to maneuver, move at high speeds, and make several adjustments due to the weather conditions. Specifically for dense traffic, the travel time was slightly less at 450 ± 25 s, as the framework is capable of minimizing lane changing and avoiding congested regions by using GNN for spatial-temporal learning. Comparing the CARLA results with those of Argoverse and nuScenes, the collision rate was higher in CARLA (3.6%) than in Argoverse (3.1%) and nuScenes (2.5%); this indicates the challenge of avoiding risks in extreme traffic or weather conditions.

A high collision rate was found in dynamic weather (4.2%) due to uncontrolled pedestrian and vehicle movements. Nonetheless, the organization managed to control these risks through the DRAM to achieve a reasonable collision rate. The lowest collision rate was in dense traffic (3.0%), which demonstrates the superiority of the framework to predict car movements even with high traffic density. The average energy consumption was the highest of all datasets, at 13,000 ± 600 J, which is due to the need to handle more complex cases. Dynamic weather again took the highest energy consumption rate of 13,500 ± 650 J because drivers frequently had to brake and accelerate to maintain stability on slippery roads and unfavorable visibility.

Energy consumption was also slightly lower in dense traffic (12,500 ± 550 J) due to the possibility of maintaining a constant speed without frequent acceleration and deceleration. The navigation success rate was 91.2%, which is slightly lower than Argoverse (92.6%) and nuScenes (93.6) due to the complexity of CARLA scenarios. The highest success rates were achieved in dense traffic (92.8%), as the nature of the scenarios provided a structured environment for the use of the navigation policy of the framework. The lowest success rate was obtained in dynamic weather at 89.5%, showing that controlling stability and safety in unfavorable weather circumstances is a major challenge. The reward value was below average at −8.7 and the lowest in dynamic weather (−8.9) because the participants highlighted that safety is more important than time in difficult conditions. In mixed zones, the reward value was less bad (−8.6), which shows that the described framework facilitates effective safety and efficiency within the conditions that involve moderate interference from pedestrians.

The number of lane changes also rose from 20 ± 4 in dense traffic to 30 ± 6 in dynamic weather, suggesting the system’s flexibility when it comes to responding with different frequency patterns of lanes. Dynamic weather took 22 ± 5 interactions to complete the scenario, which is in line with the need to frequently update navigation paths for pedestrians. Fuel efficiency was the least among all datasets, with a mean of 12.9 ± 0.7 km/L, and the least was in dynamic weather with 12.2 ± 0.8 km/L because of much more frequent braking and accelerating.

4.4. Comparative Analysis

Table 10 summarizes and compares the performance of the proposed Urban Reinforcement Learning Navigation Framework (URLNF) across three datasets—Argoverse, nuScenes, and CARLA. Key metrics include travel time, collision rate, energy consumption, navigation success rate, and cumulative reward. The overall average provides a holistic view of the framework’s adaptability and effectiveness across varied urban driving scenarios, demonstrating consistent performance in both real-world and simulated environments.

Figure 18 visually compares the performance of the proposed URLNF framework across three diverse datasets—Argoverse, nuScenes, and CARLA. Metrics include travel time, collision rate, energy consumption, navigation success, and reward value. The visualization highlights the consistency and adaptability of the framework across real-world and simulated environments, validating its effectiveness in handling complex and dynamic urban driving conditions.

The proposed framework has the shortest time to travel (410 ± 20 s) on nuScenes because it is densely equipped with sensors, and the traffic is well optimized for curved routes. The integration of multi-sensor data was possible to train the Graph Neural Network (GNN), which enabled us to model spatial-temporal relationships and reduce the time delay. CARLA took the longest time (450 ± 25 s) to complete the trial, and this result is consistent with the system’s performance under difficult conditions such as heavy traffic and unfavorable weather conditions. These conditions made the framework operate at low speeds, hence taking a longer time in its movements.

The mean cumulative travel time for all datasets was 427 ± 22 s, suggesting that the framework is reliable in other environments as well. To further demonstrate model generalizability, the research conducted cross-validation experiments by segmenting each dataset into distinct urban layout categories such as grid-like roads, radial layouts, and irregular topologies. The framework was then evaluated separately on each type. For example, in Argoverse, downtown scenes with high-density intersections were used, while in nuScenes, roundabouts and multi-lane avenues were isolated. The results revealed consistent performance across layout variations with a standard deviation in the navigation success rate under 1.5%. Additionally, the research validated the policy’s behavior on CARLA’s “Town01” (structured) and “Town05” (unstructured rural-like) environments to test spatial transferability. The model maintained over 90% navigation success, showing strong potential for real-world applicability.

The lowest collision rate was 2.5%, which was found in nuScenes, and the utilization of the DRAM was applicable to dense sensor configuration. The module used Bayesian inference and Markov Decision Processes (MDPs) to predict and avoid collision risks at a high level of accuracy. CARLA had the highest collision rate, equal to 3.6%, which results from such challenging conditions as unpredictable actions of pedestrians and heavy rain. Nevertheless, the collision rate remained reasonable, proving that the given framework is quite effective in handling extreme scenarios. The average collision rate when all datasets were combined was 3.1%, which was similar to or slightly better than previous studies, for example the work done by Xing et al. [33], who found a 4.2% collision rate in similar circumstances.

The overall energy consumption was highest in CARLA (13,000 ± 600 J) since the scenarios chosen were challenging and involved continuous braking and accelerating. Such actions as these were essential for safety but they raised energy consumption as a result. On the other hand, Argoverse had the least energy consumption of (11,833 ± 550 J) because its traffic flow was smooth; hence, there was no fluctuating speed and related energy outbursts.

Figure 19 illustrates the optimized navigation paths for autonomous vehicles across the three datasets: Argoverse, nuScenes, and CARLA. Both sub-plots show the AV’s motion in terms of X and Y coordinates, with the optimized path shown in the solid blue line and other potential routes in the dotted/dashed lines. The lightly shaded blue area is the “Optimized Zone,” where the vehicle is at its best in terms of navigation parameters including time to destination, energy consumption, and collision risk. From the optimized path, one sees that the pattern is much smoother, with few fluctuations than other paths, hence less time taken and less chances of collision. The other options, though reasonable, deviated just a little from the mainline; thus, they show resource consumption and safety risks. The AV shows comparable levels of navigation performance, with the optimal path lying closely within the intended course. The alternative paths are due to variations in the dynamic traffic condition of the urban environment represented in the nuScenes dataset. The CARLA environment with edge case scenarios highlight the optimized path that can easily maneuver its way around the zone. The two options show clearly distinguishable differences, while the proposed framework is crucial in handling such challenges as crowded traffic and unfavorable weather conditions. In total, the presented figure highlights the effectiveness and applicability of the proposed framework for optimizing AV navigation in various datasets. The separation of optimized and other paths demonstrates the extent to which the framework can be adjusted to environmental conditions while maintaining optimal and secure path planning.

In Figure 20, the optimized AV navigation path is evaluated using the nuScenes dataset, which presents more dynamic and cluttered urban scenes, including dense traffic, crosswalks, and mixed traffic agents. The optimized path (blue) again demonstrates clear advantages in trajectory efficiency and path stability compared to the baseline (green) and random (red) strategies. The baseline path deviates slightly due to less precise obstacle handling, while the random path shows irregular transitions. This figure confirms the effectiveness of the URLNF policy in real-world scenarios by showcasing its ability to integrate contextual parameters—such as pedestrian zones and vehicle density—into its path planning decisions, thus enhancing safety and decision robustness.

This Figure 21 illustrates the vehicle navigation paths across three datasets: Argoverse, nuScenes, and CARLA. Every subplot demonstrates the different navigation path, including the straight path (blue solid line), left turn (green dashed line), and right turn (red dotted line), to depict the efficiency of the AV’s performance at intersections and precise maneuvering. The comparison demonstrates the ability of the proposed framework to handle different urban navigation situations with a high level of accuracy. The Argoverse and nuScenes datasets showed accurate operation in real-world scenarios, while the CARLA dataset confirmed the framework’s stability in the simulated extreme conditions. These results confirm that the navigation framework is portable and flexible in various datasets and driving environments.

Figure 22 demonstrates a sophisticated multi-modal fusion architecture that addresses the critical challenge of environmental perception in autonomous vehicle navigation through a comprehensive RGB-depth-layout prediction pipeline. The framework begins with RGB imagery processing through a ResNet-50 backbone network, where convolutional layers extract hierarchical spatial features

F_{r g b} \in R^{H \times W \times 512}

from raw camera input. Simultaneously, monocular depth estimation is performed using a dedicated depth regression network that leverages the same ResNet-50 encoder but with specialized decoder layers to produce dense depth maps

D_{p r e d} \in R^{H \times W \times 1}

, where depth values are normalized between 0 and 1, representing distances from 0 to 100 m. The innovation lies in the integration of prior environmental knowledge extraction, where semantic maps from training datasets are queried based on GPS coordinates and heading information to retrieve contextual priors

P_{p r i o r} \in R^{H \times W \times C}

, where C represents the number of semantic classes, including roads, buildings, vehicles, and pedestrians. These priors are dynamically weighted based on spatial proximity and temporal consistency using a learned embedding network that maps geographic coordinates to semantic feature vectors. The self-attention fusion mechanism represents the core technical contribution, implementing a multi-head attention architecture where query matrices

Q = F_{r g b} W_{Q}

, key matrices

K = [F_{d e p t h}; P_{p r i o r}]

W_{K}

, and value matrices

V = [F_{d e p t h}; P_{p r i o r}]

W_{V}

are computed through learned projection matrices. The attention weights

α_{i j} = s o f t m a x (\frac{Q_{i} K_{j}^{T}}{\sqrt{d_{k}}})

determine the importance of each depth and prior feature relative to RGB features, enabling adaptive feature selection based on scene context. The fused representation

F_{f u s e d} = Σ_{j} α_{i j} V_{j}

incorporates both geometric constraints from depth and semantic constraints from priors, creating a comprehensive environmental understanding that captures both immediate sensor observations and historical knowledge patterns. The integration into the PPO-GNN framework occurs through a carefully designed parameter extraction and embedding process. First, the fused features

F_{f u s e d}

are spatially pooled using learned attention pooling to create compact representations suitable for policy networks:

θ_{l a y o u t} = G l o b a l A t t e n t i o n P o o l (F_{f u s e d})

. Second, these layout embeddings are incorporated into the GNN architecture by augmenting node features h_v with environmental context:

h_{v}^{e n h a n c e d} = [h_{v}^{o r i g i n a l}; θ_{l a y o u t}; s p a t i a l_{e n c o d i n g (v)}]

. The GNN message passing then propagates this enriched information:

m_{u v} = M L P ([h_{u}^{e n h a n c e d}; h_{v}^{e n h a n c e d}; e d g e_{f e a t u r e s_{u v}}])

, followed by node updates (

h_{v}^{l + 1} = G R U (h_{v}^{l}

,

Σ_{u} \in N (v) m_{u v})

. Finally, the PPO policy network receives the enhanced graph representation

G_{e n h a n c e d}

along with traditional state observations

s_{t}

to compute action probabilities:

π_{θ (a_{t}| s_{t}, G_{e n h a n c e d})} = s o f t m a x (P o l i c y M L P ([s_{t}; R e a d o u t G N N (G_{e n h a n c e d}); θ_{l a y o u t}]))

. The cross-dataset validation demonstrates the framework’s generalizability across CARLA (synthetic), nuScenes (real-world highway), and Argoverse (urban intersection) environments. Evaluation metrics include Intersection over Union (IoU) for semantic segmentation accuracy, pixel-wise accuracy for spatial precision, and downstream task performance measured through navigation success rates and collision avoidance effectiveness. The results show IoU scores of 0.87 (CARLA), 0.91 (nuScenes), and 0.84 (Argoverse), indicating robust cross-domain performance despite varying data distributions, lighting conditions, and scene complexities. The depth estimation achieves mean absolute errors of 0.12 m, 0.08 m, and 0.15 m, respectively, while the integrated PPO-GNN framework demonstrates 15% improved navigation performance compared to baseline methods that lack prior environmental integration. This comprehensive evaluation validates that the self-attention fusion mechanism successfully bridges the domain gap between datasets while maintaining the semantic consistency required for reliable autonomous navigation in diverse urban environments.

Both datasets show how the system is able to leverage raw RGB inputs with no prior knowledge of depth, estimate depth, forecast the layout of the environment, and determine efficiency by comparing the predicted layout with actual layout of the environment. Such an analysis also validates real-time RGB-DEP and layout estimation on CARLA synthetic, Argoverse, and nuScenes real environments from the proposed framework. The reliability of CARLA and its dataset under the worst case has been demonstrated, while Argoverse together with nuScenes made sure that the proposed framework could tackle real-world challenges. The results presented in this paper demonstrate that the discussed framework can generate repeatable and stable predictions when applied to various aspects of urban environments.

The outcomes that have been attained are typically less than in other similar works: for example, Liu et al. [17] estimate that average energy consumption for level 4–5 automated cars in kinetic populated city conditions becomes 13,500 J, but the provided framework was more effective by 5.9%. The navigation success rate was 93.6% in nuScenes and 91.2% in CARLA because of the challenging environment of the later. Argoverse’s performance obtained 92.6% success, which demonstrates that the proposed framework can perform efficiently in moderately complex real-world scenarios.

The overall success rate of 92.5% is higher than the success rate in other studies like Samma et al. [23], where the success rate of reinforcement learning-based navigation systems was 90.1% under the same circumstances.

The reward values were almost similar for two datasets, but CARLA had the lower value (−8.7) and nuScenes had higher value (−8.1). These results suggest that the framework can find good solutions that minimize travel time, safety risks, and energy consumption in a reasonable amount of time, even in the worst case.

To further evaluate the framework, its performance was compared against three previous studies shown in Table 11.

In addition to the methods summarized in Table 11, we compared our framework with the recent work of Bilban and Inan [37], who proposed an improved PPO algorithm for AV navigation in CARLA simulations. Their approach achieved a travel time of approximately 460 s and a collision rate of 3.4%, with notable stability improvements through hyperparameter tuning. In contrast, URLNF achieved a shorter travel time (450 ± 25 s) while integrating GNN-based spatial-temporal modeling, UCARM for context-aware rewards, and DRAM for proactive risk assessment. While the collision rate was similar (3.6%), our approach demonstrated superior adaptability across multiple datasets, suggesting improved generalization potential beyond CARLA scenarios.

The proposed framework adopted for this work incurred 3% less travel time on average than the other approaches used from other literature such as Xing et al. [33] and Liu et al. [17]. The above improvement is mainly due to the successful application of the route optimization using GNN and PPO. All the collision rates proposed below are less than those of the previous studies, the average of which obtained an improvement of 26.2 percent as compared to that of Xing et al. [33]. This is a clear indication that the DRAM is better placed in tackling collision risks. The framework’s energy efficiency was modeled as having 9.5% lower power than in Liu et al. [17]. This was made achievable by UCARM to discourage wastage and promote efficiency. Thus, the proposed framework had a 92.5% navigation success rate, which is better than Samma et al. [23] by 2.4%, thus highlighting the usefulness of the framework in several urban scenarios. Comparing the proposed framework with state-of-the-art methods reveals that the impact rates of the proposed framework are greater than the impact rates of the state-of-the-art methods by all crucial measures. That this model can decrease the traveling time, mitigate collision issues, optimize energy consumption, and show satisfactory levels of navigation achievement makes it relevant for real and simulated environments. The obtained results prove the effectiveness of the presented approach based on the union of concepts in reinforcement learning and the spatial-temporal graph-based model for autonomy of car movement. Figure 23 shows the comparison with previous studies.

In addition to learning-based RL baselines, we compared the proposed URLNF framework against traditional rule-based controllers—specifically, the Intelligent Driver Model (IDM) for longitudinal control and MOBIL for lane changing. As shown in Table 12, IDM + MOBIL achieved a navigation success rate of 85.4% with a collision rate of 5.8%, whereas URLNF improved the success rate by over 7% and reduced collisions by nearly half. This performance gap is attributed to URLNF’s spatial-temporal modeling and context-aware decision making, enabling adaptive responses to dynamic urban traffic that static rule-based approaches cannot match.

Comparative Evaluation with Existing RL Models

To address the concern regarding novelty and provide empirical evidence of the proposed framework’s superiority, the research conducted a baseline comparison against standard deep reinforcement learning models: Deep Q-Networks (DQNs) and vanilla PPO without GNN integration. These models were trained under the same environmental settings using Argoverse, nuScenes, and CARLA datasets for uniformity in evaluation.

Table 13 presents the averaged performance metrics across all scenarios. The results clearly demonstrate that URLNF outperforms both DQN and vanilla PPO in terms of travel time efficiency, collision reduction, and fuel consumption. The inclusion of Graph Neural Networks and the urban context-aware reward mechanism contribute significantly to the improved outcomes.

4.5. Ablation Study and Real-World Applicability

An ablation study was conducted to isolate the contributions of UCARM and DRAM. Removing DRAM increased collision rates by 25% relative to the full model, while removing UCARM reduced success rates and increased travel time. This confirms that both modules are essential for achieving optimal performance. Additionally, while CARLA simulations provide diverse urban scenarios, the absence of sensor-noise modeling may overstate real-world applicability. Future work will incorporate perception noise to validate safety thresholds under realistic urban sensing conditions.

The ablation study (Table 14) isolates the contributions of UCARM and DRAM to URLNF’s overall performance. The full URLNF achieved optimal results—427 ± 22 s for travel time, a 3.1% collision rate, and a 92.5% success rate—showing the synergy of both modules. Removing UCARM increased travel time and reduced success rate, confirming its role in travel efficiency and compliance with traffic rules. Removing DRAM raised the collision rate and further lowered success rate, highlighting its critical function in real-time risk mitigation. These results validate that both modules are essential for achieving superior safety and efficiency in dynamic urban navigation, directly differentiating URLNF’s hybrid approach from baseline models.

4.6. Impact of Look-Ahead Horizon Length

This evaluates how varying the look-ahead planning horizon affects navigation performance in dense urban environments. The horizon length determines how far ahead the framework predicts and optimizes its trajectory at 10 Hz sampling. We compare success rates, travel times, and collision rates for 2 s, 5 s (baseline), and 10 s horizons.

Table 15 reports navigation success rate, average travel time, and collision rate for three horizon settings (2 s, 5 s, 10 s) at 10 Hz. Results are averaged over 10 runs per configuration with standard deviations for success rates. The baseline 5 s horizon provides the optimal trade-off between foresight and adaptability, yielding the highest success rate and lowest collision rate among the tested configurations.

Figure 24 shows that the 5 s horizon yields the highest navigation success rate, avoiding short-horizon deadlocks and long-horizon prediction errors.

Figure 25 indicates that average travel time is minimized at the 5 s horizon, with longer or shorter horizons slightly increasing trip duration.

Figure 26 reveals that collision rate is lowest at 5 s, confirming it as the optimal trade-off between foresight and real-time adaptability in dense urban navigation.

4.7. Collision Avoidance Action Analysis

Detailed analysis of collision avoidance maneuvers reveals distinct behavioral patterns across different threat scenarios. Lateral avoidance maneuvers comprised 34.2% of all collision avoidance actions in Argoverse, 41.7% in nuScenes, and 28.9% in CARLA, with average swerving angles of 12.3°, 8.7°, and 15.1°, respectively. Longitudinal control actions represented 45.1%, 38.2%, and 52.3% of responses, featuring mean deceleration rates of 4.2 m/s², 3.8 m/s², and 5.1 m/s² during moderate braking scenarios. Emergency braking events (deceleration > 7.0 m/s²) occurred in 8.3%, 6.1%, and 12.8% of collision scenarios, with reaction times averaging 0.156 s, 0.142 s, and 0.173 s from threat detection to brake engagement. Combined maneuvers integrating simultaneous steering and braking achieved the highest success rates (96.8%, 97.2%, and 94.5%) but required 23% longer execution times due to stability considerations. Lane change maneuvers demonstrated completion rates of 92.1%, 94.8%, and 89.3% with average lateral displacement distances of 3.2 m, 2.9 m, and 3.6 m while maintaining longitudinal speeds within 15% of target velocities.

Figure 27 demonstrates a sophisticated collision avoidance analysis framework validated across three diverse autonomous vehicle datasets, providing empirical evidence of specific maneuver execution and effectiveness quantification that directly addresses concerns about vehicle navigation being limited to straight-line trajectories. The Argoverse dataset analysis (left panel) showcases urban intersection navigation with measured performance metrics, including a 0.156 s reaction time, 12.3-degree swerving angle, and 96.8% overall success rate, where the ground truth path (solid blue line) clearly demonstrates non-linear trajectory execution around red circular obstacles representing collision threats. The vehicle successfully executes a sequence of collision avoidance actions marked by gold star indicators: initial threat detection at coordinates (1.0, 0.8) followed by a swerving maneuver at (1.5, 1.5) involving a 12.3-degree steering angle deviation from the baseline trajectory, pedestrian avoidance through speed reduction at (2.5, 2.5), and path resumption at (3.0, 3.0), with each action demonstrating measurable deviations from straight-line motion that validate the framework’s capability to execute complex evasive maneuvers. The nuScenes dataset validation (center panel) focuses on highway merging scenarios with enhanced performance metrics showing a 0.142 s reaction time, 8.7-degree swerving angle, and 97.2% success rate, where the predicted path (red dashed line) closely follows the ground truth while maintaining safe distances from obstacles through dynamic lane change maneuvers. The collision avoidance sequence includes lane change initiation at (1.5, 1.0) in response to a slow-moving vehicle obstacle, cooperative merging behavior at (2.5, 2.0) that demonstrates multi-agent coordination capabilities, and final lane confirmation at (3.5, 3.0), with the predicted reinforcement learning path achieving 97.2% alignment with optimal collision avoidance trajectories compared to 73.1% for baseline methods that fail to adequately respond to dynamic traffic conditions. The CARLA dataset analysis (right panel) presents the most challenging edge-case scenarios with a 0.173 s reaction time, 15.1-degree maximum swerving angle, and 94.5% success rate under extreme complexity conditions, where the vehicle must navigate multiple simultaneous collision threats including sudden pedestrian crossings, erratic vehicle behavior, and complex urban obstacles. The technical implementation reveals that collision avoidance actions are not predetermined scripted responses but rather learned behaviors emerging from the integration of the UCARM and DRAM, where Bayesian inference continuously updates collision probability estimates

P (C_{t} | s_{t}, a_{t})

and triggers appropriate maneuver selection from the action space, including lateral avoidance (swerving), longitudinal control (braking), emergency responses, and combined maneuvers. The effectiveness zones surrounding each obstacle (depicted as light red circular areas) demonstrate the spatial regions where different collision avoidance strategies are optimally deployed, with high-effectiveness zones (closer to optimal paths) achieving 95–98% success rates while challenging zones (requiring more complex maneuvers) maintain 87–92% effectiveness, thereby providing quantitative validation that the autonomous vehicle framework successfully executes sophisticated collision avoidance maneuvers with measurable performance characteristics that substantially exceed straight-line navigation capabilities and demonstrate real-world applicability across diverse urban traffic scenarios. Figure 27 shows the Collision avoidance action analysis.

4.8. Multi-Agent Collision Avoidance Performance

The multi-agent environment validation demonstrates sophisticated coordination capabilities across 1200 simulated scenarios involving two to eight simultaneous agents. Pedestrian–vehicle interactions showed successful avoidance in 97.3% of cases, with pedestrians exhibiting predictive behavior by initiating evasive actions 2.1 s before potential collision, while vehicles maintained safe distances averaging 2.8 m during close encounters. Vehicle–vehicle coordination achieved 94.7% conflict resolution without external intervention, with agents successfully negotiating lane changes, merge scenarios, and intersection priorities through the distributed consensus mechanism. Communication effectiveness measured through V2X message success rates averaged 98.2% within the 50 m perception radius, with message latencies of 12.4 ms enabling real-time coordination. Priority-based resolution demonstrated clear hierarchical behavior: emergency vehicles achieved 100% priority compliance, passenger vehicles with higher occupancy received preference in 89.3% of conflicts, and commercial vehicles successfully coordinated in 91.8% of freight corridor scenarios. Deadlock prevention mechanisms activated in only 0.7% of multi-agent scenarios, with timeout-based resolution successfully preventing system stalls. The framework’s ability to handle complex multi-agent scenarios while maintaining individual agent autonomy validates the effectiveness of the distributed collision avoidance architecture in realistic urban traffic conditions. Figure 28 shows the multi-agent collision avoidance performance.

4.9. Collision Avoidance Maneuver Effectiveness

Quantitative analysis of individual collision avoidance maneuvers reveals performance variations based on scenario complexity and environmental conditions. Swerving maneuvers achieved success rates of 94.2% (simple obstacles), 89.7% (complex intersections), and 86.1% (dense traffic), with optimal swerving angles ranging from 8–15° depending on vehicle speed and available lateral space. Emergency braking demonstrated 98.3% effectiveness in preventing collisions when activated with a time to collision ≥ 1.2 s, degrading to 76.4% effectiveness when reaction time fell below 0.8 s. Lane change maneuvers showed dependency on traffic density, achieving 96.1% success in light traffic (≤15 vehicles/km), 88.7% in moderate traffic (15–30 vehicles/km), and 72.3% in dense traffic (>30 vehicles/km). Combined steering-braking maneuvers required an average of 3.4 s for completion but achieved the highest overall safety margins, maintaining minimum distances of 1.8 m from obstacles compared to 1.2 m for single-action responses. Speed adjustment strategies proved most effective in highway scenarios (94.8% success) compared to urban intersections (81.2% success), with optimal deceleration profiles following exponential decay functions that minimize passenger discomfort while ensuring collision avoidance. Figure 29 shows the collision Avoidance Maneuver Effectiveness.

4.10. Discussions

As shown by the results of this work, the proposed framework meets the goal of AV navigation optimization in terms of the total travel time, safety, and energy consumption under various datasets and sophisticated scenarios. The specific one, the urban context-aware reward mechanism (UCARM), was very useful in proceeding with the proposed adaptation of the greedy algorithm to the dynamic structure and behavior of the urban environment, including real-time traffic information, road priority, and safety. It was this approach that made it possible to make navigation decisions with regard to context and at the same time minimize time used on the road where it was not necessary, in the process avoiding any extra fuel consumption or increased probability of an accident. DRAM was useful in reducing collision risks through its involvement of Bayesian inference and Markov Decision Processes. Through the correct estimation of possible collisions and the punishment of risky actions, DRAM helped to further improve the performance of the system during different weather conditions and dense traffic. The framework was able to prove its solidity and expansiveness by showing similar performance across the real-world datasets of Argoverse or nuScenes as well as synthetically designed edge-case datasets in CARLA. In real-world datasets, the framework was found to perform exceptionally in improving navigation by leveraging sensor-rich data to enable dynamic environment sensing. In the more difficult synthetic environments, it demonstrated its capacity to operate in extreme conditions where safety is the paramount concern but efficiency is still reasonable. The results support the proposed framework’s efficiency and will be expanded in future work to improve its applicability to unstructured and highly dynamic contexts. If the real-time feedback mechanisms were integrated to the framework, the decision-making process could be enriched during the interaction, and the unforeseen situations may be handled more effectively. Moreover, expanding the model’s reach by covering different road networks, different types of vehicles, and different cultural traffic rules will make a broader contribution towards real-world self-driving navigation systems.

To assess the system’s responsiveness in critical real-time scenarios, this research conducted a series of emergency simulations in the CARLA environment. These scenarios involved sudden appearances of pedestrians, animals, or cyclists crossing the vehicle’s path at close range (typically within 3–5 m), simulating urban uncertainties such as jaywalking or abrupt intrusions. The system’s reaction time was measured as the total time elapsed from the moment the threat was detected by onboard sensors to the execution of a corresponding control signal by the policy module.

Across 40 randomized trials, the model demonstrated an average reaction time between 0.152 and 0.176 s, depending on the complexity of the scene and proximity of the intruding object. This latency falls well within the commonly accepted benchmark (under 0.2 s) for safe, real-time operation in autonomous driving systems. The responsiveness is largely attributed to the integration of the DRAM with the UCARM, which together enable the system to evaluate collision probabilities and adapt actions dynamically.

These findings suggest that the proposed framework is capable of timely evasive actions in high-risk, unpredictable environments. Nevertheless, additional evaluations are planned under more challenging conditions, such as high-speed driving, poor lighting, and adverse weather, to validate the robustness of the system in real-world deployment scenarios.

To assess scalability, we extended URLNF to a scenario with up to 10 concurrent AV agents in CARLA. The framework maintained above a 90% success rate for up to five AVs, with a linear increase in computation time per step (~6 ms/vehicle). Beyond seven AVs, success rates dropped due to inter-agent coordination complexity, indicating a need for explicit multi-agent communication protocols in future work.

4.11. Limitations and Future Work

While the URLNF (Urban Reinforcement Learning Navigation Framework) demonstrates robust performance in dynamic urban environments using structured datasets like Argoverse, nuScenes, and CARLA, several limitations remain that impact its generalizability to real-world deployment scenarios.

Firstly, the current experiments are conducted within simulation platforms that, despite their sophistication, cannot fully replicate the diversity and unpredictability of real-world traffic. The tested scenarios are primarily based on predefined traffic flows and constrained event variations. This excludes a wide range of uncertainties such as unplanned road closures, spontaneous pedestrian crossings, inconsistent driver behaviors, or rapid weather changes, all of which are commonly encountered in real city environments.

Secondly, the sensory input used in the framework assumes near-perfect functioning of LiDAR, cameras, and radar modules. In practical deployment, these sensors can be affected by occlusions, heavy rain, snow, dirt, or lighting conditions. For instance, a fogged-up camera or LiDAR reflection in rain may lead to degraded perception. Such degradation is not currently modeled, making the perception module potentially fragile under harsh or unpredictable environments.

Moreover, the collision risk assessment module assumes a certain behavior distribution from surrounding agents, but in reality, human drivers may violate traffic rules or exhibit irrational maneuvers that are difficult to predict through standard models. Similarly, the energy optimization module operates under uniform battery performance assumptions without accounting for variable loads, terrain changes, or thermal effects on energy consumption.

In terms of data limitations, the Argoverse and nuScenes datasets primarily represent US-based and Asian urban layouts, respectively. These may not fully reflect traffic dynamics in other regions with different road infrastructures, cultural driving behaviors, or traffic control systems.

To address these gaps, future work will focus on the following directions:

Uncertainty Modeling: Incorporate robustness-aware reinforcement learning techniques to account for noisy sensor data, missing information, or false detections.
Adversarial Testing: Simulate aggressive or rule-violating agents to challenge the model’s decision-making robustness and adaptiveness in unexpected scenarios.
Weather and Environmental Variability: Integrate environmental perturbation models that emulate rain, fog, glare, and occlusions into the simulation loop.
Real-World Validation: Plan controlled pilot testing in small-scale real environments (e.g., closed campus roads or parking lots) to evaluate policy adaptation and perception reliability in live settings.
Cross-Domain Dataset Expansion: Extend training and evaluation to include datasets from varied geographies, driving cultures, and infrastructural layouts to ensure broader applicability and generalization.

These enhancements will help in evolving the current framework into a production-grade autonomous vehicle navigation system capable of safe, efficient, and adaptive operation across diverse urban environments.

5. Conclusions

This paper introduces a new method for the path planning of autonomous vehicles that can successfully solve the problems of the dynamic environment of cities. The use of both UCARM and DRAM has been important in achieving a good balance of travel time, safety, and energy optimization. UCARM responds to real-time traffic situations and ensures safety and environmentally friendly navigation, while DRAM, through Bayesian inference and MDPs, estimates and avoids collision risks. The framework’s robustness and scalability were validated across three diverse datasets: Argoverse, nuScenes, and CARLA datasets. When tested on Argoverse and nuScenes, which are real-world scenarios, the framework proved to perform better by effectively exploiting sensor-rich data to navigate through traffic patterns. In CARLA, the framework was easily able to handle extreme conditions like dense traffic, adverse weather, and unpredictable behaviors of pedestrians as a part of edge cases. The findings show reduced travel time and collision rates higher than those of previous studies as well as constant energy efficiency. The success of this system can be attributed to the use of PPO and GNN for the modeling of spatial-temporal relationships that were critical to the decision-making context of the traffic management system. However, the framework suggested by the study creates new directions for future research. Future work will concern its ability to better operate in unstructured and highly dynamic environments like rural roads or mixed-traffic systems. The use of feedback mechanisms for real-time learning will enable the framework to adapt well to emergent situations. Moreover, the generalization of the model that includes various types of roads, local culture for traffic, and interactions of multiple agents will expand its real-world uses. This work provides a solid base for the creation of large-scale, robust autonomous navigation systems that are ready for practical utilization.

Future work can explore integrating fuzzy-based decision systems and metaheuristic optimization techniques with reinforcement learning to further enhance adaptability in highly dynamic or uncertain urban environments. Fuzzy logic can help AVs handle ambiguous scenarios, such as interpreting pedestrian intent or negotiating right-of-way in unregulated intersections, while metaheuristic algorithms like Genetic Algorithms, PSO, or Ant Colony Optimization can be used to optimize hyperparameters, reward weights, or route selections in real time. Moreover, deployment in real-world AV testbeds equipped with V2X communication infrastructure will be critical to evaluate system performance under actual traffic conditions. Implementing the proposed framework on edge devices (e.g., Nvidia Jetson, Qualcomm AI platforms) will allow low-latency decision making directly onboard AVs, supporting faster reactions in dense traffic. Additionally, future research should focus on expanding inter-agent communication protocols, enabling cooperative decision making between AVs and infrastructure. Considering cultural-specific driving behaviors (e.g., lane discipline, overtaking habits) will also improve model realism, generalizability, and policy transfer across different countries and urban settings.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available at https://www.kaggle.com/datasets/alechantson/carladataset and https://www.kaggle.com/datasets/mitanshuchakrawarty/nuscenes (accessed on 9 July 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Orieno, O.H.; Ndubuisi, N.L.; Ilojianya, V.I.; Biu, P.W.; Odonkor, B. The future of autonomous vehicles in the u.s. urban landscape: A review: Analyzing implications for traffic, urban planning, and the environment. Eng. Sci. Technol. J. 2024, 5, 43–64. [Google Scholar] [CrossRef]
Van Thanh, N.; Tran, T.M.L. Real-time Trajectory Planning for Autonomous Vehicles in Dynamic Traffic Environments: A Survey of Modern Algorithms and Predictive Techniques. J. Intell. Connect. Emerg. Technol. 2022, 7, 1–25. [Google Scholar]
Urmson, C.; Anhalt, J.; Bagnell, D.; Baker, C.; Bittner, R.; Clark, M.N.; Dolan, J.; Duggins, D.; Galatali, T.; Geyer, C.; et al. Autonomous driving in urban environments: Boss and the urban challenge. J. Field Robot. 2008, 25, 425–466. [Google Scholar] [CrossRef]
Alghodhaifi, H.; Lakshmanan, S. Holistic Spatio-Temporal Graph Attention for Trajectory Prediction in Vehicle–Pedestrian Interactions. Sensors 2023, 23, 7361. [Google Scholar] [CrossRef]
Rammohan, A. Revolutionizing Intelligent Transportation Systems with Cellular Vehicle-to-Everything (C-V2X) technology: Current trends, use cases, emerging technologies, standardization bodies, industry analytics and future directions. Veh. Commun. 2023, 43, 100638. [Google Scholar]
Chen, W.H. Perspective view of autonomous control in unknown environment: Dual control for exploitation and exploration vs reinforcement learning. Neurocomputing 2022, 497, 50–63. [Google Scholar] [CrossRef]
Liu, W.R.; Qin, G.R.; He, Y.; Jiang, F. Distributed cooperative reinforcement learning-based traffic signal control that integrates V2X networks’ dynamic clustering. IEEE Trans. Veh. Technol. 2017, 66, 8667–8681. [Google Scholar] [CrossRef]
Gholamhosseinian, A.; Jochen, S. A comprehensive survey on cooperative intersection management for heterogeneous connected vehicles. IEEE Access 2022, 10, 7937–7972. [Google Scholar] [CrossRef]
Osman, M. Controlling uncertainty: A review of human behavior in complex dynamic environments. Psychol. Bull. 2010, 136, 65. [Google Scholar] [CrossRef] [PubMed]
Papadimitriou, D.; Pierre, D. FUTURE MOVE: A Review of the Main Trends in the Automotive Sector at Horizon 2030 in the Great Region; University of Liège: Arlon, Belgium, 2022. [Google Scholar]
Owens, N.D.; Armstrong, A.H.; Mitchell, C.; Brewster, R. Federal Highway Administration Focus States Initiative: Traffic Incident Management Performance Measures Final Report; No. FHWA-HOP-10-010; United States Federal Highway Administration: Washington, WA, USA, 2009.
Durlik, I.; Miller, T.; Kostecka, E.; Tuński, T. Artificial Intelligence in Maritime Transportation: A Comprehensive Review of Safety and Risk Management Applications. Appl. Sci. 2024, 14, 8420. [Google Scholar] [CrossRef]
Lakmali, R.G.N.; Genovese, P.V.; Abewardhana, A.A.B.D.P. Evaluating the Efficacy of Agent-Based Modeling in Analyzing Pedestrian Dynamics within the Built Environment: A Comprehensive Systematic Literature Review. Buildings 2024, 14, 1945. [Google Scholar] [CrossRef]
Wu, J.H.; Ye, Y.; Du, J. Multi-objective reinforcement learning for autonomous drone navigation in urban areas with wind zones. Autom. Constr. 2024, 158, 105253. [Google Scholar] [CrossRef]
Louati, A.; Louati, H.; Kariri, E.; Neifar, W.; Hassan, M.K.; Khairi, M.H.H.; Farahat, M.A.; El-Hoseny, H.M. Sustainable Smart Cities through Multi-Agent Reinforcement Learning-Based Cooperative Autonomous Vehicles. Sustainability 2024, 16, 1779. [Google Scholar] [CrossRef]
Singh, J. Autonomous Vehicles and Smart Cities: Integrating AI to Improve Traffic Flow, Parking, and Environmental Impact. J. AI-Assist. Sci. Discov. 2024, 4, 65–105. [Google Scholar]
Liu, H.; Shen, Y.; Zhou, W.; Zou, Y.; Zhou, C.; He, S. Adaptive speed planning for unmanned vehicle based on deep reinforcement learning. arXiv 2024, arXiv:2404.17379. [Google Scholar] [CrossRef]
D’Alfonso, L.; Giannini, F.; Franzè, G.; Fedele, G.; Pupo, F.; Fortino, G. Autonomous vehicle platoons in urban road networks: A joint distributed reinforcement learning and model predictive control approach. IEEE/CAA J. Autom. Sinica 2024, 11, 141–156. [Google Scholar] [CrossRef]
Golchoubian, M.; Ghafurian, M.; Dautenhahn, K.; Azad, N.L. Uncertainty-Aware DRL for Autonomous Vehicle Crowd Navigation in Shared Space. IEEE Trans. Intell. Veh. 2024, 9, 7931–7944. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, H.; Feng, Z.; Zhang, J.; Qian, Y.; Wang, B. A Multi-Objective Optimal Control Method for Navigating Connected and Automated Vehicles at Signalized Intersections Based on Reinforcement Learning. Appl. Sci. 2024, 14, 3124. [Google Scholar] [CrossRef]
Wu, J.H.; Ye, Y.; Du, J. Autonomous drones in urban navigation: Autoencoder learning fusion for aerodynamics. J. Constr. Eng. Manag. 2024, 150, 04024067. [Google Scholar] [CrossRef]
Mishra, P.; Boopal, B.; Mishra, N. Real-Time 3D Routing Optimization for Unmanned Aerial Vehicle using Machine Learning. EAI Endorsed Trans. Scalable Inf. Syst. 2024, 11. [Google Scholar] [CrossRef]
Samma, H.; El-Ferik, S. Autonomous UAV Visual Navigation Using an Improved Deep Reinforcement Learning. IEEE Access 2024, 12, 79967–79977. [Google Scholar] [CrossRef]
Ghintab, S.S.; Hassan, M.Y. PID-like IT2FLC-Based Autonomous Vehicle Control in Urban Areas. Arab. J. Sci. Eng. 2025, 50, 11001–11017. [Google Scholar] [CrossRef]
He, X.; Huang, W.; Lv, C. Trustworthy autonomous driving via defense-aware robust reinforcement learning against worst-case observational perturbations. Transp. Res. Part C Emerg. Technol. 2024, 163, 104632. [Google Scholar] [CrossRef]
Ge, L.; Zhou, X.; Li, Y.; Wang, Y. Deep reinforcement learning navigation via decision transformer in autonomous driving. Front. Neurorobot. 2024, 18, 1338189. [Google Scholar] [CrossRef]
Chowdhury, J.; Shivaraman, V.; Dangi, S.; Sujit, P.B. Deep Attention Driven Reinforcement Learning (DAD-RL) for Autonomous Decision-Making in Dynamic Environment. arXiv 2024, arXiv:2404.00340. [Google Scholar]
Li, D.; Zhu, F.; Wu, J.; Wong, Y.D.; Chen, T. Managing mixed traffic at signalized intersections: An adaptive signal control and CAV coordination system based on deep reinforcement learning. Expert Syst. Appl. 2024, 238, 121959. [Google Scholar] [CrossRef]
Sutikno, T. Fuzzy optimization and metaheuristic algorithms. Babylon. J. Math. 2023, 59–65. [Google Scholar] [CrossRef]
Plattner, B. Vehicle Platooning using Multi-Agent Reinforcement Learning: A Study on Autonomous Driving in the CARLA Simulator. Ph.D. Dissertation, OST Ostschweizer Fachhochschule, Buchs, Switzerland, 2023. [Google Scholar]
Wang, Y.; Jiang, J.; Li, S.; Li, R.; Xu, S.; Wang, J.; Li, K. Decision-making driven by driver intelligence and environment reasoning for high-level autonomous vehicles: A survey. IEEE Trans. Intell. Transp. Syst. 2023, 24, 10362–10381. [Google Scholar] [CrossRef]
Savithramma, R.M.; Sumathi, R.; Sudhira, H.S. Reinforcement learning based traffic signal controller with state reduction. J. Eng. Res. 2023, 11, 100017. [Google Scholar] [CrossRef]
Xing, H.; Chen, A.; Zhang, X. RL-GCN: Traffic flow prediction based on graph convolution and reinforcement learning for smart cities. Displays 2023, 80, 102513. [Google Scholar]
Yan, L.; Zhu, L.; Song, K.; Yuan, Z.; Yan, Y.; Tang, Y. Graph cooperation deep reinforcement learning for ecological urban traffic signal control. Appl. Intell. 2023, 53, 6248–6265. [Google Scholar] [CrossRef]
Vieira, M.A.; Galvão, G.; Vieira, M.; Louro, P.; Vestias, M.; Vieira, P. Enhancing Urban Intersection Efficiency: Visible Light Communication and Learning-Based Control for Traffic Signal Optimization and Vehicle Management. Symmetry 2024, 16, 240. [Google Scholar] [CrossRef]
Yadav, P.; Mishra, A.; Kim, S. A Comprehensive Survey on Multi-Agent Reinforcement Learning for Connected and Automated Vehicles. Sensors 2023, 23, 4710. [Google Scholar] [CrossRef] [PubMed]
Bilban, M.; İnan, O. Optimizing Autonomous Vehicle Performance Using Improved Proximal Policy Optimization. Sensors 2025, 25, 1941. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Market size of AVs (billions USD) vs. years (2020–2030) [10].

Figure 2. A typical flow architectural diagram of a traditional RL-controlled AV-based urban traffic management system.

Figure 3. Research framework overview.

Figure 4. Sample scenarios from the Argoverse dataset.

Figure 5. Sensor data and field of view representations from the nuScenes dataset.

Figure 6. Urban traffic scene from the CARLA synthetic dataset.

Figure 7. Travel time optimization—total vs. effective travel time.

Figure 8. Collision avoidance—posterior collision probability over time.

Figure 9. Energy efficiency—cumulative energy consumption over time.

Figure 10. Urban context-aware reward mechanism over time.

Figure 11. Workflow of proximal policy optimization (PPO).

Figure 12. Graph Neural Network (GNN) for spatial-temporal modeling.

Figure 13. Urban context-aware reward mechanism (UCARM).

Figure 14. DRAM.

Figure 15. Performance trends of the proposed framework across urban scenarios in the Argoverse dataset.

Figure 16. Performance metrics of the proposed framework across urban scenarios in the nuScenes dataset.

Figure 17. Comparative performance of the proposed framework across urban scenarios in the CARLA dataset.

Figure 18. Enhanced comparative analysis of URLNF performance across Argoverse, nuScenes, and CARLA datasets.

Figure 19. Optimized autonomous vehicle (AV) navigation paths across Argoverse, nuScenes, and CARLA datasets.

Figure 20. Comparison of vehicle behavior prediction across Argoverse, nuScenes, and CARLA dataset.

Figure 21. Comparison of vehicle navigation paths across Argoverse, nuScenes, and CARLA datasets.

Figure 22. Cross-dataset comparison of RGB imagery, depth maps, predicted layouts, and ground truth layouts.

Figure 23. Comparison with previous studies [17,23,33].

Figure 24. Navigation success rate versus look-ahead horizon length (2 s, 5 s, 10 s) at 10 Hz sampling. The smoothed curve illustrates the trend, with shaded bands representing ±1σ. The 5 s horizon achieves the highest success rate by balancing planning depth with prediction stability.

Figure 25. Average travel time versus look-ahead horizon length (2 s, 5 s, 10 s) at 10 Hz sampling. The 5 s horizon minimizes travel time by avoiding short-horizon deadlocks while reducing long-horizon prediction uncertainty.

Figure 26. Collision rate versus look-ahead horizon length (2 s, 5 s, 10 s) at 10 Hz sampling. The lowest collision rate is achieved at the 5 s horizon, highlighting its optimal trade-off between proactive risk avoidance and planning foresight.

Figure 27. Collision avoidance action analysis.

Figure 28. Multi-agent collision avoidance performance.

Figure 29. Collision Avoidance Maneuver Effectiveness.

Table 1. Summary of autonomous vehicle studies in urban traffic systems.

Study	Dataset	Approach	Limitation
[7]	Custom simulation	Deep RL-based adaptive speed planning	Focuses only on speed, not full decision pipeline
[16]	Not specified	AI-driven integration for smart city AVs	Generalized concept, lacks implementation detail
[18]	Urban road network models	Joint distributed RL + MPC for platoons	Limited scalability for large platoons
[19]	Shared space simulation	Uncertainty-aware DRL crowd navigation	No real-world validation yet
[20]	Signalized intersection sim	Multi-objective RL control for CAVs	Limited generalization across cities
[21]	Urban drone environments	Autoencoder fusion for drone aerodynamics	Applies to aerial rather than ground vehicles

Table 2. Summary of reinforcement learning techniques for traffic optimization.

Study	Dataset	Approach	Limitation
[25]	Simulated AV data with perturbations	Robust RL against worst-case observation noise	Focuses only on observational noise, not actuator issues
[26]	Autonomous driving navigation datasets	Decision transformer for RL-based navigation	Limited to learning from offline trajectories
[27]	Custom dynamic driving environment (simulated)	Deep attention-driven RL for dynamic decision making	Lacks real-world or physical environment deployment
[29]	Not dataset-specific; optimization theory-based	Fuzzy logic and metaheuristic optimization algorithms	Not tailored for AVs; focuses on general optimization

Table 3. Comparative table of previous studies.

Studies	Technique	Results	Limitations	Findings
[14]	MORL	Optimized safety, energy efficiency, and travel time in urban drone navigation	High computational intensity limits real-time application in dense urban systems	Potential for AV application but needs scalability improvements
[16]	DRL for path plaining	Optimized real-time vehicle routes ensuring safety and efficiency	Dependency on large datasets and computational resources	DRL can enhance path plaining but requires better resource optimization
[17]	Adaptive speed plaining using DRL	Enhanced traffic flow by dynamically adjusting vehicle speeds	Ineffective in extreme situations like sudden pedestrian crossings or accidents	Effective for dynamic speed adjustments; need improvements for rare situations
[18]	Distributed DRL with MPC	Synchronized vehicle movements, enhancement of data, and reduced fuel consumption	Reliance on robust vehicle-to-infrastructure communication creates risk during failures	Integration of MPC with DRL shows promise but is vulnerable to communication issues
[19]	Uncertainty aware DRL model	Effective navigation in environments with dynamic hurdles	Decrease efficiency in environments with heightened uncertainty	Suitable for controlled settings
[35]	MARL	Improved traffic flow and reduced congestion through cooperative AV behavior	Scalability challenges with increased agents in complex urban networks	Mentioned the need for robust coordination and communication mechanisms

Table 4. Dataset specifications.

Dataset	Type	Size (GB)	Key Features	Applications
Argoverse	Real-world	400	3D Tracking, Lane Geometry, Map Priors, Road Topology	Multi-Agent Navigation, Urban Road Interactions
nuScenes	Real-world	100	Multi-sensor (Lidar, Radar, Camera), Traffic Signal States	Traffic Behavior Prediction, Sensor Fusion
CARLA	Synthetic Simulation	Variable	Dynamic Traffic Patterns, Pedestrian Behaviors, Weather Variations	Edge-case Scenarios, Reinforcement Learning Testing

Table 5. Parameter descriptions.

Parameter	Description	Value/Range
$d_{t}$	Distance traveled	0.1–5 km
$v_{t}$	Vehicle speed	10–60 km/h
$w_{t}$	Waiting time at signals	0–2 min
$p_{t}$	Road priority weight	0.1–1.0
$λ_{1}$	Time travel weight	0.5, 0.2, 0.1
$λ_{2}$	Safety weight (highest priority)
$λ_{3}$	Energy efficiency weight
$u_{t}$	For travel-time weight
$β_{s}$	For safety weight
$γ_{e}$	For energy weight
$E m a x$	Maximum allowable energy consumption	Variable
$T_{e f f}^{1}$	Effective travel time considering road priority weights	Derived per Equation (7)
$P (C_{t} ∣ s_{t}, a_{t})$	Posterior collision probability from DRAM	0.0–1.0
$E_{t}$	Energy consumption at time $t$	J (joules)

Table 6. Simulation scenarios and metrics.

Scenario	Conditions	Metrics Evaluated
Dense traffic	High vehicle density	Travel time, collision rate
Mixed pedestrian zones	Random pedestrian movements	Safety, collision rate
Dynamic weather conditions	Rain, fog, and low visibility scenarios	Energy efficiency, success rate

Table 7. Performance metrics of the proposed URLNF framework across diverse urban scenarios in the Argoverse dataset.

Metric	Dense Traffic	Mixed Zones	Dynamic Weather	Average Across Scenarios
Travel Time (s)	420 ± 20	400 ± 15	440 ± 25	420 ± 20
Collision Rate (%)	2.5	3.1	3.8	3.1
Energy Consumption (J)	12,000 ± 500	11,000 ± 450	12,500 ± 600	11,833 ± 550
Navigation Success (%)	94.2	92.5	91.0	92.6
Reward Value (R)	−8.4	−8.2	−8.6	−8.4
Braking Events (Count)	28 ± 3	32 ± 4	35 ± 5	31.7 ± 4
Acceleration Events (Count)	45 ± 5	47 ± 6	49 ± 7	47.0 ± 6
Fuel Efficiency (km/L)	14.5 ± 0.5	15.2 ± 0.6	13.8 ± 0.7	14.5 ± 0.6

Table 8. Evaluation of the proposed URLNF framework across scenarios using the nuScenes dataset.

Metric	Dense Traffic	Mixed Zones	Dynamic Weather	Average Across Scenarios
Travel Time (s)	390 ± 18	410 ± 22	430 ± 20	410 ± 20
Collision Rate (%)	1.8	2.4	3.2	2.5
Energy Consumption (J)	11,500 ± 400	12,000 ± 450	12,300 ± 500	11,933 ± 450
Navigation Success (%)	95.5	93.8	91.5	93.6
Reward Value (R)	−8.0	−7.9	−8.3	−8.1
Lane Changes (Count)	12 ± 2	15 ± 3	18 ± 3	15 ± 3
Pedestrian Interactions (Count)	8 ± 1	12 ± 2	14 ± 3	11 ± 2
Fuel Efficiency (km/L)	15.5 ± 0.6	14.8 ± 0.5	14.0 ± 0.7	14.8 ± 0.6

Table 9. Performance of the proposed URLNF framework across scenarios using the CARLA dataset.

Metric	Dense Traffic	Mixed Zones	Dynamic Weather	Average Across Scenarios
Travel Time (s)	450 ± 25	440 ± 20	460 ± 30	450 ± 25
Collision Rate (%)	3.0	3.5	4.2	3.6
Energy Consumption (J)	12,500 ± 550	13,000 ± 600	13,500 ± 650	13,000 ± 600
Navigation Success (%)	92.8	91.2	89.5	91.2
Reward Value (R)	−8.7	−8.6	−8.9	−8.7
Lane Changes (Count)	20 ± 4	25 ± 5	30 ± 6	25 ± 5
Pedestrian Interactions (Count)	15 ± 3	20 ± 4	22 ± 5	19 ± 4
Fuel Efficiency (km/L)	13.5 ± 0.7	13.0 ± 0.6	12.2 ± 0.8	12.9 ± 0.7

Table 10. Summary of AV performance across datasets.

Metric	Argoverse	nuScenes	CARLA	Best Value
Travel Time (s)	420 ± 20	410 ± 20	450 ± 25	nuScenes
Collision Rate (%)	3.1	2.5	3.6	nuScenes
Energy Consumption (J)	11,833 ± 550	11,933 ± 450	13,000 ± 600	Argoverse
Navigation Success (%)	92.6	93.6	91.2	nuScenes
Reward Value (R)	−8.4	−8.1	−8.7	nuScenes
Fuel Efficiency (km/L)	14.5 ± 0.6	14.8 ± 0.6	12.9 ± 0.7	nuScenes

Bold values indicate the best performance achieved for each metric across the three datasets. Lower values are better for Travel Time, Collision Rate, Energy Consumption, and Reward Value (less negative), while higher values are better for Navigation Success and Fuel Efficiency.

Table 11. Comparison with previous studies.

Metric	Proposed Framework	Xing et al. [33]	Liu et al. [17]	Samma et al. [23]	Bilban and İnan [37]
Travel time (s)	427 ± 22	440 ± 25	450 ± 30	435 ± 28	460 ± 25
Collision rate (%)	3.1	4.2	3.8	4.0	3.4
Energy consumption (J)	12,222 ± 533	13,500	13,000	12,900	13,200
Navigation success (%)	92.5	90.5	91.0	90.1	91.8
Core method	PPO + GNN + UCARM + DRAM	DRL + MPC	Adaptive speed DRL	Improved DRL visual navigation	Improved PPO
Key differentiator	Spatial-temporal modeling, dynamic reward and risk modules	Coordinated platoon control	Speed optimization under real-time conditions	Visual adaptability in UAV and AV navigation	PPO tuned for stability, no GNN or contextual modules

Table 12. Performance comparison of URLNF and traditional rule-based controllers in dynamic urban environments.

Controller	Travel Time (s)	Collision Rate (%)	Navigation Success Rate (%)
IDM + MOBIL	470 ± 25	5.8	85.4
URLNF (Proposed)	427 ± 22	3.1	92.5

Table 13. Comparative evaluation of URLNF vs. existing RL methods (average metrics).

Model	Travel Time (s)	Collision Rate (%)	Reward Value (R)	Fuel Efficiency (km/L)
DQN	460 ± 30	4.8	−9.2	13.2 ± 0.8
PPO	435 ± 25	3.9	−8.7	13.9 ± 0.6
URLNF	420 ± 20	3.1	−8.4	4.5 ± 0.6

Table 14. Ablation study results showing the individual impact of UCARM and DRAM modules on URLNF performance in dynamic urban navigation.

Framework Variant	Travel Time (s)	Collision Rate (%)	Success Rate (%)
URLNF (full)	427 ± 22	3.1	92.5
URLNF—UCARM only	432 ± 24	3.5	90.8
URLNF—DRAM only	435 ± 25	3.9	90.2

Table 15. Sensitivity of navigation success rate to look-ahead horizon length.

Horizon Length	Steps @ 10 Hz	Navigation Success Rate (%)	Average Travel Time (s)	Collision Rate (%)
2 s	20	86.8 ± 1.2	438 ± 24	4.6
5 s (baseline)	50	92.5 ± 0.8	427 ± 22	3.1
10 s	100	91.4 ± 1.0	430 ± 23	3.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Published by MDPI on behalf of the World Electric Vehicle Association. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alsuwaiket, M.A. Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments. World Electr. Veh. J. 2025, 16, 472. https://doi.org/10.3390/wevj16080472

AMA Style

Alsuwaiket MA. Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments. World Electric Vehicle Journal. 2025; 16(8):472. https://doi.org/10.3390/wevj16080472

Chicago/Turabian Style

Alsuwaiket, Mohammed Abdullah. 2025. "Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments" World Electric Vehicle Journal 16, no. 8: 472. https://doi.org/10.3390/wevj16080472

APA Style

Alsuwaiket, M. A. (2025). Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments. World Electric Vehicle Journal, 16(8), 472. https://doi.org/10.3390/wevj16080472

Article Menu

Optimizing Autonomous Vehicle Navigation Through Reinforcement Learning in Dynamic Urban Environments

Abstract

1. Introduction

1.1. Background

1.2. Research Gap and Challenges

1.3. Proposed Solution and Objectives

2. Literature Review

2.1. Autonomous Vehicles (AVs) in Urban Traffic Systems

2.2. Reinforcements Learning for Traffic Optimization

2.3. Urban Context-Aware Decision Making for Autonomous Vehicles

3. Methodology

3.1. Dataset Description

3.2. System Model

3.2.1. Collision Avoidance Action Space Specification

3.2.2. Multi-Agent Environment Architecture

3.2.3. Travel Time Optimization

Baseline Travel Time Formulation

Context-Aware Effective Travel Time

Vehicle Dynamics’ Integration in Travel Time

Multi-Objective Travel Time Optimization

Adaptive Time Horizon Planning

Real-Time Travel Time Updates

Performance Bounds and Validation

3.2.4. Energy Efficiency

3.2.5. Urban Context-Aware Reward Mechanism

3.2.6. Perception and Object Modeling

3.3. Proposed Framework

3.3.1. Proximal Policy Optimization (PPO)

3.3.2. Graph Neural Network (GNN) for Spatial-Temporal Modeling

3.3.3. Urban Context-Aware Reward Mechanism (UCARM)

3.3.4. DRAM

3.3.5. Multi-Agent Collision Avoidance Coordination

3.3.6. Simulation Scenarios and Parameters

4. Results and Discussions

4.1. Results on Argoverse Dataset

4.2. Results on Nuscenes Dataset

4.3. Results on CARLA Dataset

4.4. Comparative Analysis

Comparative Evaluation with Existing RL Models

4.5. Ablation Study and Real-World Applicability

4.6. Impact of Look-Ahead Horizon Length

4.7. Collision Avoidance Action Analysis

4.8. Multi-Agent Collision Avoidance Performance

4.9. Collision Avoidance Maneuver Effectiveness

4.10. Discussions

4.11. Limitations and Future Work

5. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI