Next Article in Journal / Special Issue
Dubins-Aware NCO: Learning SE(2)-Equivariant Representations for Heading-Constrained UAV Routing
Previous Article in Journal
Hierarchical Route Planning Framework and MMDQN Agent-Based Intelligent Obstacle Avoidance for UAVs
Previous Article in Special Issue
Collaborative Vehicle-Mounted Multi-UAV Routing and Scheduling Optimization for Remote Sensing Observations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PAIR: A Hybrid A* with PPO Path Planner for Multi-UAV Navigation in 2-D Dynamic Urban MEC Environments

by
Bahaa Hussein Taher
1,2,*,
Juan Luo
1,
Ying Qiao
1 and
Hussein Ridha Sayegh
1
1
College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
2
College of Computer Science and Information Technology, University of Sumer, Al Rifaee 64005, Iraq
*
Author to whom correspondence should be addressed.
Drones 2026, 10(1), 58; https://doi.org/10.3390/drones10010058
Submission received: 6 December 2025 / Revised: 26 December 2025 / Accepted: 7 January 2026 / Published: 13 January 2026
(This article belongs to the Special Issue Path Planning, Trajectory Tracking and Guidance for UAVs: 3rd Edition)

Highlights

What are the main findings?
  • PAIR, which combines A* for global routing with PPO for local adjustments, reaches 100% success in nine urban test cases and beats A*, D* Lite, CBS-D*, and PSO in overall path quality, energy use, and travel time.
  • In tests with 5 to 30 moving obstacles, PAIR keeps paths shorter and replanning times lower than the others, showing its reliability for busy city drone flights.
What is the implication of the main finding?
  • Combining a fast discrete backbone with lightweight continuous RL refinements has potential for energy-efficient and low-latency navigation in cluttered MEC-inspired low-altitude airspace.
  • The PAIR framework and released 2-D MEC-inspired benchmark provide a reproducible testbed for future research on dynamic multi-UAV path planning, including extensions to full 3-D kinematics, robustness to perception uncertainty, the joint optimization of trajectories, and computation offloading in swarm-scale deployments.

Abstract

Emerging multi-unmanned aerial vehicle (multi-UAV) applications in smart cities must navigate cluttered airspace while meeting tight mobile edge computing (MEC) deadlines. Classical grid planners, including A-star (A*), D-star Lite (D* Lite), and conflict-based search with D-star Lite (CBS-D*) and metaheuristics such asparticle swarm optimization (PSO), either replan too slowly in dynamic scenes or waste energy on long detours. This paper presents PPO-adjusted incremental refinement (PAIR), a decentralized hybrid planner that couples an A* global backbone with a continuous PPO refinement module for multi-UAV navigation on two-dimensional (2-D) urban grids. A* produces feasible waypoint routes, while a shared risk-aware PPO policy applies local offsets from a compact state encoding. MEC tasks are allocated by a separate heterogeneous scheduler; PPO optimizes geometric objectives (path length, risk, and a normalized propulsion-energy surrogate). Across nine benchmark scenarios with static and Markovian dynamic obstacles, PAIR achieves 100% mission success (matching the strongest baselines) while delivering the best energy surrogate (104.9 normalized units) and shortest mean travel time (207.8 s) on a reproducible 100 × 100 grid at fixed UAV speed. Relative to the strongest non-learning baseline (PSO), PAIR reduces energy by about 4% and travel time by about 3%, and yields roughly 10–20% gains over the remaining planners. An obstacle-density sweep with 5–30 moving obstacles further shows that PAIR maintains shorter paths and the lowest cumulative replanning time, supporting real-time multi-UAV navigation in dynamic urban MEC environments.

1. Introduction

Unmanned aerial vehicles (UAVs) are increasingly deployed for critical missions, from disaster response to precision agriculture, but their onboard processors cannot keep pace with high-resolution LiDAR, video, and sensor streams required for real-time inference [1,2,3]. Deep neural networks (DNNs) deliver high-fidelity perception yet exceed UAV size, weight, and power budgets when executed locally. This has motivated several mobile edge computing (MEC) schemes that resort to offloading the DNN layers to ground or edge servers [4,5,6]. At the same time, multi-UAV operation could have a second layer of complexity, requiring continuous deconfliction under dynamic obstacles and inter-agent interactions. These safety checks can severely limit real-time replanning on embedded hardware, making heavy iterative optimization impractical.
In dynamic MEC operations, path planning must account for both aerial navigation and real-time computational offloading, even as the physical and wireless environments fluctuate. For example, intermittent wireless connectivity may interrupt DNN layer uploads or downloads mid-flight [7,8], while variable network load can cause MEC task deadlines to shift unpredictably [9]. At the same time, unplanned no-fly zones [10], such as temporary airspace restrictions, or urgent edge-computing requests can force UAVs to abandon their original routes or reassign their roles during a mission [11]. These coupled challenges require planners to perform a number of things, including finding safe paths around both static and moving obstacles, while also ensuring that each offloaded computational task meets its deadline. Thus, they must adapt continuously to link failures, deadline changes, and emergent airspace constraints in real time.
Given a 2-D urban grid G = ( V , E ) with static obstacles, time-varying dynamic obstacles, and N UAVs with start–goal pairs { ( s i , g i ) } i = 1 N , our goal is to generate collision-free trajectories that (i) respect no-fly regions and a minimum separation distance d safe from dynamic obstacles and other UAVs, and (ii) minimize geometric mission cost (path length and a propulsion-energy surrogate) while maintaining real-time replanning capability. The airspace is modeled as a fixed-altitude 2-D projection for controlled, reproducible evaluation. A heterogeneous task-assignment model provides a MEC-inspired operating context by pre-assigning tasks to UAVs under CPU/memory/battery/deadline constraints; however, in the current implementation, the learning agent optimizes navigation-only signals (distance, aggregated risk, and energy surrogate), and MEC decisions are not included in the PPO reward.
Classical graph-based planners (A* (A-star), D* Lite (D-star Lite), and CBS-D* (conflict-based search with D-star Lite)) guarantee static collision-free routes but suffer from replanning latency and deadlocks under dynamic obstacles [12,13]. Metaheuristics such as Particle Swarm Optimization (PSO) continuously adapt but require careful parameter tuning and may converge slowly in cluttered urban layouts [14,15,16]. Hybrid reinforcement learning (RL) planners refine coarse waypoints through local, risk-aware corrections yet lack unified, mission-level evaluations across dynamic scenarios [17].
To address the joint challenges of airspace dynamics and MEC offloading, we propose PAIR, a decentralized hybrid framework integrating a fast A* global backbone with a continuous PPO-based refinement module.
Our key contributions are as follows:
  • MEC-inspired formulation: We model multi-UAV navigation under a MEC-inspired operating context where heterogeneous task profiles and deadlines motivate time-sensitive replanning. In our simulation experiments, task assignment is computed offline by a separate model and remains constant during navigation experiments; the PAIR learning component optimizes navigation-only signals.
  • Hybrid PAIR architecture: We propose a two-stage scheme where A* computes global routes and a PPO agent applies real-time trajectory corrections based on local risk information. In addition, a heterogeneous MEC task-assignment model (Section 3.5) computes resource-feasible task-to-UAV assignments offline subject to CPU/memory/battery and deadline constraints; these assignments are made constant during the navigation experiments.
  • Unified yet transparent evaluation: We compare PAIR against classical planners (A*, D* Lite, and CBS-D*) and PSO on nine 2-D urban grid scenarios with static/dynamic obstacles. PAIR achieves 100% success, 104.9 normalized energy units, and 207.8 s travel time, outperforming baselines in energy and latency. A ‘unified score’ summarizes path quality, with separate reporting of success, energy, and time to reveal trade-offs.
  • Reproducible 2-D benchmark and density sweep: We release a reproducible 2-D fixed-altitude benchmark with procedurally generated maps and Markovian dynamic obstacles, validated for coverage, connectivity, and clustering, together with a single-map obstacle-density sweep from 5 to 30 dynamic obstacles. The dataset is intended to be an open baseline for future multi-UAV navigation work in dynamic MEC-inspired environments, while full 3-D flight dynamics and hardware-in-the-loop validation are explicitly left to future work.
The remainder of this paper is organized as follows. Section 1 introduces the problem context and our key contributions. Section 2 reviews related work on discrete, metaheuristic, and RL-based UAV path planning. Section 3 presents our system model, problem formulation, and evaluation metrics. Section 4 details the proposed PAIR framework. Section 5 describes the simulation setup. Section 6 reports the comparative results and analysis. Section 7 discusses the observed trade-offs and practical considerations. Finally, Section 8 concludes the paper and outlines directions for future work.

2. Related Work

2.1. Collaborative DNN Partitioning and Offloading

Collaborative DNN inference is hinged on dividing a network across devices to leverage the power of distributed computing. Early MEC frameworks proposed joint model partitioning and offloading between edge servers and mobile devices, targeting latency reduction and resource efficiency [6,18,19]. Fine-grained model splitting with asynchronous actor–critic methods further reduces inference delay under dynamic workloads [20,21]. Multi-UAV strategies distribute early network layers across aerial agents to accelerate inference but often overlook UAV mobility and link variability, causing suboptimal and unbalanced workloads [22,23]. However, none of these works simultaneously address dynamic obstacle avoidance in collaboration with DNN offloading. This is one of the key motivating factors for the integrated view of our PAIR framework.

2.2. Discrete and Incremental Path Planners

Graph-based planners remain foundational for UAV trajectories. A* finds optimal paths on static grids but requires complete replanning of the path when obstacles shift [24]. This causes latency spikes in dynamic environments [12]. D* Lite on the other hand, incrementally adjusts its routes to avoid moving obstacles but degrades under heavy dynamicity [13]. Following this drawback, Conflict-Based Search with D* Lite (CBS-D*) combines the D* Lite incremental repair capabilities with a high-level conflict resolution mechanism to guarantee collision-free trajectories for individual agents. However, when scaled to multiple UAVs and concurrent paths, CBS-D* suffers from substantial replanning overhead. Every moving obstacle or inter-agent conflict triggers both low-level D* Lite repairs and high-level conflict tree updates. This becomes problematic in rapidly changing environments; as the number of agents or the obstacle update rate increase, latency spikes and overly conservative detours degrade both responsiveness and path optimality [17]. To overcome these replanning bottlenecks, some researchers have turned to continuous and population-based search methods.

2.3. Metaheuristic Search

Swarm-intelligence algorithms like PSO encode waypoints as particles updated by individual and global bests; this enables continuous adaptation to non-convex, time-varying cost surfaces without grid constraints. PSO navigates complex obstacle fields but usually requires the careful tuning of inertia and acceleration and may converge slowly or become trapped in dynamic scenarios [14,15]. While PSO can escape the local minima better than pure grid methods, its lack of learned experience is one of the motivating points for hybrid RL planners that tend to adapt to situational data [17,25].

2.4. RL-Driven Hybrid Planners

Hybrid planners for their part rely on integrating coarse discrete paths with RL-based local corrections. For example, there are a number of RL-Planners that leverage deep Q-learning to refine A* or D* Lite trajectories, often balancing distance, collision risk, and energy costs for smoother, safer routes [25,26]. These approaches adapt to moving obstacles but lack comprehensive real-time evaluations under unified mission-level criteria [27]. Besides that, these studies usually focus on preliminary proof-of-concept scenarios or isolated metrics. It is against this backdrop that in this study we seek to provide a unified mission-level evaluation across realistic maps and planner classes.

2.5. Joint Path Planning and Task Offloading

In multi-UAV MEC networks, trajectory design and computation offloading are fundamentally intertwine. The chosen flight paths determine the instantaneous channel quality and transmission delays, while the spatial–temporal distribution of service requests dictates optimal waypoints and loitering patterns. Lou et al. propose a deep reinforcement learning framework that jointly learns UAV waypoints and offloading policies by embedding the deadline, energy, and security constraints directly into the RL reward function [28,29]. This cross-layer approach simultaneously optimizes UAV speeds, waypoint selection, and edge–cloud partitioning decisions to minimize end-to-end latency and energy consumption, and at the same time maintaining resilient service in dynamic 6G smart-city environments. However, although these works ensure inter-UAV collision avoidance through safe-distance constraints, they assume an obstacle-free airspace and do not model more nuanced environmental obstacles (such as buildings and no-fly zones) [30]. As a result, real-time obstacle detection and avoidance remain unaddressed. This leaves a critical gap for deployments in cluttered or rapidly changing urban environments. Compared with RL-Planner [17] and recent joint trajectory/offloading DRL schemes [28,29], PAIR targets a complementary regime. RL-Planner focuses on single-UAV MEC scenarios and employs deep Q-learning to refine discrete paths but does not model multi-agent interactions or dense dynamic obstacles. Joint DRL works such as [28,29] co-optimize UAV trajectories and computation offloading decisions yet typically assume obstacle-free airspace and represent interference only through aggregate channel models. In contrast, PAIR combines an A* global backbone with continuous PPO-based corrections in a multi-UAV setting with explicit static and dynamic obstacles, performs a unified mission-level evaluation (success rate, energy, time, and path-quality score), and integrates a heterogeneous task-assignment model to account for MEC resource heterogeneity, even though the current PPO agent is trained solely on navigation-related rewards.
To fill this gap, in Section 4 we introduce PAIR, a hybrid A* with PPO framework that combines dynamic obstacle avoidance with a MEC-aware task-assignment layer in a unified system architecture; in the present implementation, the PPO agent optimizes geometric navigation (path length, local risk, and energy), while offloading decisions and deadline satisfaction are governed by the task-assignment model rather than the RL reward.
Several recent works have explored intelligent path planning in dynamic environments using deep reinforcement learning and hybrid methods. Yang et al. propose a DRL-based planner for dynamic scenes that models the environment as an MDP and uses an improved D3QN to generate collision-free paths under moving obstacles [31]. Wu et al. combine an enhanced Informed-RRT* sampler with a dynamic window approach for local obstacle avoidance, improving path quality and real-time performance in cluttered maps [32]. Off-policy DRL schemes such as RPL-TD3 have also been applied to UAV trajectory planning with recurrent feature extractors for temporal dynamics [33]. In addition, Wang et al. introduce a reinforcement-learning-driven continuous ant colony optimizer for multi-UAV path planning in complex terrain [34], while recent surveys in Drones systematically classify multi-UAV path-planning algorithms and highlight open problems in dynamic, cluttered environments [35,36].
Compared with these studies, PAIR targets a complementary design space. It integrates a lightweight A* backbone with a shared PPO policy for continuous local refinement, focuses explicitly on multi-UAV interaction with static and Markovian dynamic obstacles on a controlled 2-D benchmark, and reports unified mission-level metrics (success rate, normalized path quality, energy, and time) across nine scenarios and an obstacle-density sweep. Rather than competing with specific DRL architectures, PAIR is intended to be a reusable hybrid baseline and open benchmark for dynamic multi-UAV navigation in MEC-inspired urban environments.Table 1 summarizes a qualitative comparison of the planners considered in this study, highlighting their main strengths and limitations.

3. System Model and Problem Statement

3.1. Environment Model

We consider a fleet of N homogeneous UAVs acting as mobile edge servers over a smart-city region discretized into a 2-D grid G = ( V , E ) . Each grid vertex v V represents a feasible waypoint, and each edge ( u , v ) E denotes a straight-line flight corridor. UAV i must navigate from a fixed start s i to a goal g i while the following hold:
  • Dynamic Obstacles: Mobile obstacles (such as cranes, other flying objects, and temporary no-fly zones) appear unpredictably.
  • Airspace Constraints: Altitude bounds [ h min , h max ] and no-fly zones N F Z V prohibit certain waypoints.
  • Inter-UAV Separation: A safe distance d safe must be maintained between any two UAVs.
In this study, UAV motion is restricted to a single horizontal plane at a representative flight altitude, so the 3-D airspace is approximated by a 2-D grid projection. Altitude bounds and no-fly zones are therefore modeled as planar regions over G; extending PAIR to full 3-D kinematics is left for future work.For clarity, Table 2 lists the key symbols and variables used throughout the paper.
We use i { 1 , , N } to index UAVs, j to index waypoints along a planned path, and t to index discrete simulator time steps. The start and goal locations for UAV i are denoted by s i and g i . Path lengths L A , p are non-negative by definition, with L A , p = 0 used to encode failures (collision or deadlock). The grid environment and safety constraints follow standard multi-robot planning formulations [13,17], while the MEC context and task constraints are consistent with common UAV-assisted MEC models [9,28,29].

3.2. Objectives

Define the multi-objective cost as
J = α E total + ( 1 α ) T total , α [ 0 , 1 ] ,
where
E total = i = 1 N j = 1 | P i | 1 e p i , j , p i , j + 1 ,
T total = i = 1 N j = 1 | P i | 1 τ p i , j , p i , j + 1 ,
and P i = [ p i , 1 , , p i , | P i | ] is the discrete waypoint sequence for UAV i.

3.3. Decision Variables

  • Waypoints:  P i V , the ordered list of grid points for UAV i.
  • Local Corrections:  Δ i , j R 2 offsets applied at waypoint p i , j to avoid imminent collisions in the horizontal plane.

3.4. Constraints

1.
Boundary Conditions:
p i , 1 = s i , p i , | P i | = g i , i
2.
No-Fly Zones:
p i , j NFZ , i , j
3.
Dynamic Obstacle Avoidance: At time t, denoting the location of obstacle o by o ( t ) ,
p i , j ( t ) + Δ i , j o ( t ) d safe , i , j , o
4.
Inter-UAV Safety:
p i , j ( t ) + Δ i , j p k , ( t ) + Δ k , d safe , i k , j ,
5.
Replanning Latency:
Latency plan T plan , max

3.5. Task Generation Model

We consider a set of heterogeneous service requests (tasks) T = { 1 , , M } generated by the application layer. Each task j T is described by its CPU workload w j (core-seconds), memory requirement m j (MB), and deadline d j (s) from the issue time. For realism we draw
w j U ( 50 , 500 ) , m j U ( 100 , 1000 ) , d j U ( 10 , 60 ) ,
which roughly reflects lightweight to moderate IoT/MEC analytics tasks. Each UAV i has heterogeneous resources ( C i cpu , C i mem , B i bat ) in CPU units, memory units, and remaining battery capacity. We introduce a binary assignment variable
x i , j = 1 , if task j is assigned to UAV i , 0 , otherwise ,
and impose standard capacity and deadline constraints:
i = 1 N x i , j = 1 , j T ,
j = 1 M w j x i , j C i cpu , i ,
j = 1 M m j x i , j C i mem , i ,
j = 1 M e j x i , j B i bat , i ,
τ i , j x i , j d j , i , j ,
where e j is the estimated energy to execute task j and τ i , j is the predicted execution time of j on UAV i.
In this work, the heterogeneous assignment model in (10)–(14) is used purely as a fixed MEC-inspired context: tasks are pre-assigned offline based on their resource profiles, and these assignments remain constant during path-planning experiments. The PPO agent is trained and evaluated only on navigation-related signals (distance, local risk, and a propulsion-energy surrogate), and offloading decisions do not appear in the RL reward. Consequently, we treat MEC as a background configuration that shapes which UAV serves which task, while the scientific contribution of PAIR lies in the hybrid path-planning algorithm itself.

3.6. Algorithmic Framework

We employ a two-stage hybrid scheme:
1.
Global Planner: A fast discrete backbone (such as A*) computes an initial waypoint sequence P i for each UAV i.
2.
Local Continuous Refinement: A PPO-based agent monitors P i for potential collisions or tight turns and applies continuous corrections Δ i , j as needed.

3.7. Performance Metrics

We evaluate each planner using the following:
  • Normalized Path Length (NPL):  L actual L opt / L opt .
  • Energy Consumption (EC): Total propulsion energy E total .
  • Mission Success Rate (MSR): Fraction of UAVs reaching goals without collision.
  • Optimality Score (OS): Distance from the theoretical optimum.
  • Computation Overhead: Average replanning time per collision event.
  • Density-Sweep Scalability: How the key metrics (path length, steps to goal, and total planning time) vary as the number of dynamic obstacles increases from 5 to 30, allowing us to assess each planner’s performance under rising environmental complexity.

4. Proposed Framework

4.1. Global Planner Backbone

For each UAV, the initial path P i is generated on the grid graph G = ( V , E ) via a discrete planner, with four options to ensure robustness in dynamic settings:
Algorithms 1–3 summarize PAIR and the baseline planners.
  • A*: A classic heuristic search minimizing path length under static obstacle assumptions [12].
  • D* Lite: Incrementally repairs paths when the environment changes, avoiding full replanning [13].
  • CBS–D* Lite (Algorithm 2): Multi-agent conflict-based search over D* Lite to guarantee collision-free global paths under inter-UAV constraints [13].
  • PSO-Based Waypoint Search (Algorithm 3): Uses Particle Swarm Optimization to optimize a sequence of continuous waypoints balancing path length and obstacle clearance [14].
Algorithm 1 PAIR: decentralized PPO path planning with swarm coordination.
Require: UAV swarm size N, start/goal pairs { s i , g i } i = 1 N , grid G = ( V , E ) , comms range d comms
Ensure: Collision-free trajectories { P i } i = 1 N
1:for all UAV i = 1 , , N in parallel do
2:   P i A * ( s i , g i )                           % global backbone
3:   N i getNeighbors ( i , d comms )                   % local swarm adjacency
4:end for
5:while any UAV not at goal do
6:  for all UAV i = 1 , , N in parallel do
7:     x i currentPosition ( i ) , v i currentVelocity ( i )
8:     p next nextWaypoint ( P i )
9:     R i updateRiskMap ( i , N i )           % risk field from obstacles + neighbors
10:    if isHighRisk ( p next , R i ) then
11:       s i buildState ( x i , v i , g i , R i )               % 6-D encoding from (15)
12:       Δ i PPOAgent . correct ( s i )              % continuous offset from PPO
13:       smoothPath ( P i , Δ i )                      % ensure feasibility
14:    else
15:       Δ i 0
16:    end if
17:     executeStep ( i , p next + Δ i )
18:     broadcast ( i , x i + Δ i )                % share updated pose for risk map
19:  end for
20:end while
Algorithm 2 CBS–D* Lite: integrated multi-agent path planning.
Require: start–goal pairs { ( s i , g i ) } i = 1 N , environment E , max iterations I, safe distance d safe
Ensure: conflict-free trajectories { P i } i = 1 N
1:Stage 1: Initial Planning
2:for all agent i = 1 , , N in parallel do
3:   P i DStarLitePlan ( s i , g i , E )
4:end for
5:Stage 2: Conflict Resolution
6:iteration 0
7:while iteration < I do
8:  detect conflict ( i , j , t ) in { P k } under d safe
9:  if no conflict then
10:    return { P i }
11:  end if
12:   P j rollback _ and _ replan ( P j , t )
13:  iteration ← iteration + 1
14:end while
15:return { P i }
Algorithm 3 PSO: Particle Swarm Optimization path planning.
Require: start s, goal g, environment E , swarm size n, waypoints m, weights w , c 1 , c 2 , max speed v max , iterations T
Ensure: collision-free, smoothed path P
1:Initialization:
2:generate { p i } i = 1 n with m intermediate waypoints
3:initialize { v i } i = 1 n 0
4:for  i = 1 to n do
5:  evaluate f i fitness ( p i ) ; store personal best
6:end for
7:determine global best g best
8:Main Loop:
9:for  t = 1 to T do
10:  for i = 1 to n do
11:    update v i ; clamp to ± v max
12:     p i p i + v i ; clamp to environment
13:    evaluate f i ; update personal/global bests
14:  end for
15:end for
16:Path Construction:
17: P raw [ s , g best , g ]
18:smooth P raw to P
19:return P
We implement A*, D* Lite, CBS-D* Lite, and PSO as (i) direct baselines in our comparative study and (ii) optional backbone variants for the same simulation interface. In PAIR, A* is used as the default backbone to generate an initial discrete route, while PPO performs on-demand local refinements; the other three planners are evaluated as standalone planners (without PPO refinement). This is performed to quantify the benefit of the hybrid design under identical maps and safety checks.
Among these options, PAIR adopts A* as its default global backbone. The admissible Euclidean heuristic ensures shortest paths on static grids, the algorithm is lightweight enough for per-step replanning on embedded platforms, and the resulting paths provide a stable, interpretable reference that the learning-based refinement module can safely adjust. For the continuous correction stage we employ Proximal Policy Optimization (PPO), which directly outputs bounded continuous offsets and uses a clipped on-policy objective that is widely regarded as numerically stable for continuous-control tasks, avoiding the sensitivity of off-policy value-based methods in non-stationary multi-UAV environments.

4.2. Local Continuous Refinement

To handle unpredictable, fast-moving obstacles and fine-tune energy-efficient trajectories, we introduce the PPO-Adjusted Incremental Refinement (PAIR) module. PAIR formulates on-demand local waypoint correction as a Markov Decision Process (MDP) [10]:
State 
We use a compact 6-D state encoding that matches our implementation and we keep the observation size fixed across swarm sizes:
s t = Δ x t , Δ y t , v ^ t x , v ^ t y , R ^ t ( x t ) , h ^ t R 6 ,
where x t = ( x t , y t ) is the current 2-D position and g = ( g x , g y ) is the goal. Specifically (i) Δ x t = ( g x x t ) / L max and Δ y t = ( g y y t ) / L max are normalized goal-direction offsets with L max = 2 grid_size (map diagonal); (ii) v ^ t x = v t x / v max and v ^ t y = v t y / v max are normalized velocity components; (iii) R ^ t ( x t ) [ 0 , 1 ] is the normalized local risk scalar at the current position; and (iv) h ^ t = h t / L max is the normalized Euclidean heuristic-to-go distance with h t = g x t 2 .
Action 
a t = [ δ x , δ y ] ,
is a bounded continuous offset in the horizontal plane, applied to refine the next discrete waypoint before execution. In our experiments the agent outputs only 2-D corrections; extending the action space to include altitude for full 3-D navigation is left to future work.
Reward 
r t = w d d ( x t , x t + Δ t ) + w r R t ( x t ) + w e e ( a t ) + r goal ,
where d ( · ) is the Euclidean distance term toward the next discrete waypoint, R t ( · ) penalizes local risk exposure, e ( a t ) a t 2 2 is an instantaneous energy surrogate, and r goal is a sparse terminal bonus on reaching the final goal.
At each simulator time step t, we rebuild a time-varying obstacle set (for UAV i)
O t ( i ) = O static O dyn ( t ) { x k ( t ) } k N i ,
where O static are fixed obstacles, O dyn ( t ) are Markovian dynamic obstacles at time t (Section 5.2), and N i is the neighbor set within communication range (used to enforce inter-UAV separation). The instantaneous risk field is then updated as
R t ( x ) = o O t ( i ) max 0 , ( d safe + r ) x o 2 ,
and the scalar observation used in Equation (15) is normalized by
R ^ t ( x ) = clip R t ( x ) R min R max R min , 0 , 1 ,
where ( R min , R max ) are empirical bounds collected from training rollouts. Communication is simulated as a local broadcast of each UAV pose once per control step (i.e., every Δ t ), so neighbor influence enters R t ( · ) at the same update rate.

4.2.1. PPO Training Details

We train the policy π θ ( a t s t ) using Proximal Policy Optimization (PPO) with the following settings:
  • Network architecture: Actor and critic are two-layer MLPs with 128 ReLU-activated units per layer.
  • Hyperparameters:
    Discount factor γ = 0.99 , GAE λ = 0.95 .
    PPO clip ϵ = 0.2 .
    Learning rates: actor 3 × 10 4 , critic 1 × 10 3 .
    Batch size: 2048, epochs per update: 50.
  • Training: Converges stably within 500,000 environment steps across varied urban scenarios.
All PPO hyperparameters are reported in Appendix A (Table A1). We use a standard PPO configuration [37,38] and keep it fixed across all benchmarks; beyond basic stability/convergence checks, we do not perform an exhaustive hyperparameter search, and a full sensitivity study is left to future work.

4.2.2. PAIR: Hybrid Discrete–Continuous Path Planning (Algorithm)

From a multi-agent learning perspective, PAIR adopts a simple parameter-sharing scheme rather than a full centralized-critic MAPPO formulation. A single PPO policy is trained and then executed independently by all UAVs; each agent observes only its own kinematic state and the locally aggregated risk scalar R ^ t ( x t ) described in (15). Neighboring UAVs influence the decision-making process only through this shared risk field and the occasional broadcast of updated poses, but their individual actions are neither coordinated through a joint action space nor optimized via a global value function.
During our empirical experiments, we observe that our independent, risk-aware PPO design is numerically stable for swarm sizes up to N = 9 in our benchmark scenarios, with no oscillatory behaviors or emergent deadlocks beyond those already handled by our collision checks. We use the term “decentralized” to denote decentralized execution with local observations and a shared policy. A formal convergence analysis of the resulting multi-agent dynamics is beyond the scope of this paper and is left for future work.

4.3. Unified Evaluation Metrics

To compare discrete, metaheuristic, and hybrid RL planners on a common scale, we report four primary metrics for each algorithm A: (i) a normalized path-quality score (“unified score”), (ii) mission success rate, (iii) average propulsion energy, and (iv) average travel time. The unified score is designed to be interpreted as a simple grade in [ 0 , 100 ] : on each benchmark scenario the shortest successful path receives 100 points, the longest successful path receives 0 points, and failed planners also receive 0. The overall unified score is then the average of these grades across the benchmark set. Success rate, energy, and time are reported separately so that no trade-offs are hidden.
1.
Normalized path-quality score (unified score). Let P denote the total number of benchmark trajectories, and let L A , p be the realized path length (in grid units) produced by algorithm A on path p. We treat L A , p = 0 as a failure (collision or deadlock). For each path p, we consider only the successful planners and define
L min , p = min B : L B , p > 0 L B , p , L max , p = max B : L B , p > 0 L B , p .
The per-path score S A , p assigns 100 points to the shortest successful paths (or to all successful planners if all successful paths have the same length) and scales down to 0 for failures or relatively longer paths:
S A , p = 100 L max , p L A , p L max , p L min , p , L A , p > 0 and L max , p > L min , p , 100 , L A , p > 0 and L max , p = L min , p , 0 , L A , p = 0 .
Because L A , p denotes a realized geometric path length, it satisfies L A , p 0 by construction. In our implementation, failures (collision or deadlock) are encoded as L A , p = 0 and are assigned score 0 in Equation (22). Therefore, the case L A , p < 0 cannot occur.
The overall “unified score” of algorithm A is then the simple average over all benchmark paths
S ¯ A = 1 P p = 1 P S A , p ,
which by construction lies in [ 0 , 100 ] and summarizes the relative path quality across the benchmark set. Importantly, this length-based normalization is used only to aggregate the path quality; the success rate, average energy, and average travel time are reported as separate metrics (Table 3) so that no energy–latency trade-offs are separated by the unified score.
2.
Mission success rate. For completeness we also report the mission success rate of algorithm A, defined as
R A = 1 P p = 1 P 1 { L A , p > 0 } ,
i.e., the fraction of paths completed without collision or deadlock.
3.
Average energy consumption. For each trajectory we compute a propulsion-energy surrogate E A , p as a convex function of path length and assign a large penalty to failures:
E A , p = α L A , p + β L A , p 2 , L A , p > 0 , E fail , L A , p = 0 ,
with fixed α > 0 , β > 0 , and E fail . We report the average E ¯ A = 1 P p = 1 P E A , p . This should be interpreted as a normalized propulsion-cost surrogate for relative comparison between planners under the same model, not as a calibrated energy estimate for a specific airframe.
4.
Average travel time. Assuming a constant UAV speed v (in grid units/s), the travel time for path p is
T A , p = L A , p v , L A , p > 0 , 0 , L A , p = 0 ,
and we report T ¯ A = 1 P p = 1 P T A , p as the average travel time.
5.
Path optimality score (per-path visualization). For some figures we additionally use a per-path optimality score
O A , p = 100 L min , p L A , p , L A , p > 0 , 0 , L A , p = 0 ,
which directly measures how close a successful path is to the best successful path on that scenario. These values are used in per-path heatmaps but are not reported as a single scalar in Table 3.

5. Simulation Setup

We evaluate all planners on nine procedurally generated urban maps (Paths 1–9) with a constant swarm size of N = 9 UAVs for all compared algorithms, this is to ensure identical multi-agent scale across baselines and PAIR (Urban Multi-UAV Path Planning Simulation Dataset, https://doi.org/10.6084/m9.figshare.30787730). We do not sweep swarm size in this paper; all results use N = 9 , and scalability to larger N is left as future work.

5.1. Map Generation and Validation

Static obstacles are placed via Poisson-disc sampling on a 1 × 1 normalized grid (100 × 100 cells) with minimum spacing r min = 0.1 . Each map is accepted only if it meets the following:
  • Coverage: obstacle density in [15%, 40%];
  • Connectivity: BFS confirms at least one collision-free path for every start–goal pair;
  • Clustering: Hopkins statistic greater than 0.75 for realistic spatial patterns.
These procedurally generated maps form a 2-D fixed-altitude benchmark rather than a photorealistic 3-D environment. Our goal is to provide a controlled, reproducible testbed where obstacle density, spatial clustering, and dynamic updates can be systematically swept, not to emulate a specific real city. To facilitate independent verification and reuse, we release all nine maps and the corresponding generator as an open “Urban Multi-UAV Path Planning Simulation Dataset” (see https://doi.org/10.6084/m9.figshare.30787730).

5.2. Dynamic Obstacle Generation

Each map includes N dyn mobile obstacles (radius r = 0.05 ):
1.
Initial Placement: Sampled uniformly in V ( N F Z { UAV start positions } ) , with min-distance d safe + r from all static obstacles and UAVs.
2.
Mobility Model: A three-state Markov chain, Move [10], Pause, Detour, with parameters:
Move : v U ( 0 , v o ) , θ ( t ) = θ ( t 1 ) + Δ θ , Δ θ U ( π 4 , π 4 ) , T move Exp ( λ m ) . Pause : v = 0 , T pause Exp ( λ p ) . Detour : v U ( 0 , v o ) , θ = θ ± π 2 , T det Exp ( λ d ) .
Transitions occur after each timer or if a Move would intersect a static obstacle (triggering Detour).
3.
Boundary Reflection: Any trajectory crossing a map edge is reflected back into the domain.
4.
Parameters: r = 0.05 , v o = 0.02 grid-units/step; no new obstacles spawn after t = 0 .

5.3. Variable Obstacle Density Sweep

To further probe replanning performance, we run each planner on a single canonical start–goal scenario, sweeping the number of dynamic obstacles over { 5 , 10 , 15 , 20 , 25 , 30 } (animated demonstrations for each obstacle density are available on GitHub: GHRABAT/Impact-of-Obstacle-Density, https://github.com/GHRABAT/Impact-of-Obstacle-Density/tree/v1.0.0 (accessed on 6 January 2026)). This sweep acts as a controlled single-map stress test that isolates the effect of obstacle density on path length, steps to goal, and cumulative planning time, rather than a full statistical study over many urban layouts.
The obstacle-density sweep is conducted on a single canonical map (not all nine maps). This is conducted to isolate the effect of obstacle count. Each plotted point is averaged over S = 3 random seeds controlling dynamic-obstacle trajectories. Therefore, the non-monotonic behavior (especially for PSO) reflects stochastic convergence under that fixed map rather than a population-wide statistic over all scenarios. Extending the sweep to multiple maps shall be perused in future work.
The obstacle-density sweep in Section 6.4 is conducted on a single canonical map. Each point is the mean over S = 3 random seeds, with shaded bands indicating ± 1 standard deviation. This sweep should therefore be interpreted as a controlled single-map stress test.

5.4. UAV Kinematics and Metrics

Each UAV flies at v uav = 2 grid-units/s, so a path of length L takes T = L / v uav seconds. We record mission success rate, energy consumption, travel time, and unified score (Section 4).

5.5. Benchmark Variation Grid

The nine scenarios span three representative static obstacle densities (15%, 25%, and 40%), with both clustered and dispersed spatial patterns and both static-only and mixed static+dynamic obstacle configurations. Instead of exhaustively enumerating all combinations, we select nine representative cases that cover easy, moderate, and highly cluttered layouts under static and dynamic conditions, thereby testing planner robustness across realistic extremes while keeping the benchmark size manageable for reproducible experiments.

5.6. Planner Configurations

For PAIR, we fix the inter-UAV communication radius used to form the neighbor set to d comms = 0.2 grid units (that is, 20 cells under the 100 × 100 discretization), and we keep this value constant across all planners. We compare five planners with these parameter settings:
  • A*: 4-connected grid, Euclidean heuristic, no replanning on collision.
  • D* Lite: Same heuristic, incremental replanning up to once per time step.
  • CBS-D* Lite: Conflict-Based Search over D* Lite with up to 10 banned cells and a 2-waypoint rollback window.
  • PSO: 30 particles optimizing 8 intermediate waypoints, inertia w = 0.7 , c 1 = c 2 = 1.4 , 200 iterations per replanning.
  • PAIR: Hybrid A* + PPO with
    Network: Two-layer MLPs (128 ReLU units each).
    Hyperparameters:  γ = 0.99 , GAE λ = 0.95 , clip ϵ = 0.2 , actor LR 3 × 10 4 , critic LR 1 × 10 3 , batch size 2048, 50 epochs/update, 500k total steps.
    Reward weights:  ( w d , w r , w e ) = ( 1.0 , 0.5 , 0.1 ) .

5.6.1. Statistical Analysis

For each planner we evaluate nine benchmark paths (Paths 1–9 in our scenario set). Unless otherwise stated, the scalar metrics in Table 3 are reported as mean ± standard deviation across these nine scenarios. For each path and algorithm we record a binary success indicator (1 if the UAV reaches the goal without collision, 0 otherwise); the success rates in Table 3 are the mean of these indicators (in %), and the associated standard deviations quantify how success varies across maps. Propulsion energy is computed per path using the surrogate in Equation (25), including the fixed penalty E fail = 1000 (normalized units) for failed trajectories, and we report the mean ± standard deviation over all nine paths. To better capture typical mission duration, the travel times are averaged only over successful trajectories (non-collided flights); failures are excluded from the time statistics but remain penalized through the energy and unified-score metrics.
The unified scores in Table 3 correspond to the length-based scores in Equations (22) and (23), computed per path by normalizing path length over the set of successful planners, assigning a score of 0 to failed planners, and then averaging these values across the nine benchmark paths.

5.6.2. Baseline Scope

We compare PAIR against classical discrete planners (A*, D* Lite, and CBS-D* Lite) and a representative swarm-based metaheuristic (PSO), which together cover the main non-learning baselines used in UAV path planning. Incorporating additional end-to-end deep RL baselines, such as RL-Planner [17], under our nine-map benchmark and real-time constraints is left to future work.

5.7. Reward Shaping and Collision-Check Implementation

5.7.1. Reward Shaping

At each PPO correction step we emit a dense, potential-based reward:
r t = w d d ( x t , x t + Δ t ) + w r R t ( x t ) + w e e ( a t ) + r goal ,
where
  • d ( x t , x t + Δ t ) is the Euclidean distance toward the next discrete waypoint.
  • R t ( x ) = o O max 0 , ( d safe + r ) x o ( t ) penalizes proximity to every obstacle.
  • e ( a t ) a t 2 is the instantaneous energy cost of offset a t .
  • r goal = + 10 is a sparse bonus awarded upon reaching the final goal.
The weights ( w d , w r , w e ) = ( 1.0 , 0.5 , 0.1 ) are chosen empirically to balance path efficiency, obstacle clearance, and energy use, and are kept fixed across all experiments. A more systematic numerical ablation of these coefficients is left to future work.
The same risk field R t ( x ) used in the reward is also queried at the UAV position to form the observation component R ^ t ( x t ) in (15). All dynamic obstacles and other UAVs are inserted into the obstacle set O when building R t ( · ) , so nearby agents influence the state only through this aggregated risk scalar rather than via an explicit neighbor list. This keeps the observation dimension fixed while still allowing the policy to react to local congestion. In the implementation reported here, the PPO agent is trained solely on these navigation-related signals (distance, aggregated risk, and a convex surrogate for propulsion energy). MEC task assignment and deadline satisfaction are handled independently by the heterogeneous task-to-UAV model in Section 3.5 and are not explicitly encoded in the RL reward. Incorporating deadline- or latency-aware terms (for example, a normalized penalty w ( t rem / T max ) when the predicted completion time approaches a task’s deadline) is a natural extension and is left to future work.

5.7.2. Collision Detection

At each simulator step, we have the following:
  • UAV vs. obstacle:
    p i ( t ) + Δ i , j o k ( t ) < d safe + r
  • Inter-UAV:
    p i ( t ) + Δ i , j p ( t ) + Δ < 2 d safe
  • Segment sampling: To catch collisions mid-edge, we sample five equally spaced points along each discrete step and repeat the same checks.

5.8. Implementation Details

All simulations (environment, baselines, and PAIR) were implemented in Python 3.12.4 using NumPy 2.4.0, SciPy 1.16.3, Matplotlib 3.10.8, and PyTorchPyTorch 2.9.1. As indicated in Table A1, the simulation runtime experiments (including PPO training, evaluation, and online PPO inference during replanning) are executed on an NVIDIA Jetson Xavier NX embedded platform (6-core ARM CPU, 8 GB RAM, 384-core Volta GPU), chosen to emulate onboard UAV compute(software stack: NVIDIA JetPack 5.1.4). The trained PPO policy is used for online inference with an average forward-pass latency of ≈5 ms per PPO call (Table A1); this supports the 10 Hz replanning loop used in our benchmark. For post-processing and figure generation, we additionally use an HP ZBook 15v G5 workstation running Windows 10 Pro (Build 19045) with an Intel Core i7-8750H CPU, 16 GB RAM, and a NVIDIA Quadro P600 (4 GB), using Python 3.12.4 (Anaconda Distribution 2025.12-1, Spyder 6.1.2) to parse logs and render plots. For reproducibility, random seeds are fixed per scenario when reporting mean ± std metrics, and the obstacle-density sweep averages the results over S = 3 random seeds as stated in Section 5.

6. Results and Comparative Analysis

6.1. Success-Rate Analysis

Table 3 and Figure 1 report each planner’s success rate over the nine benchmark maps. Classical A* succeeds on only one of nine maps (11.11%), failing in both dynamic and highly cluttered static scenarios due to its inability to replan. D* Lite improves this to 44.44% by incrementally repairing collisions, yet still breaks down in highly dynamic layouts where frequent updates are required. CBS-D*, PSO, and our PAIR framework all achieve perfect reliability (100%), demonstrating that conflict-based search, metaheuristic optimization, and the proposed hybrid planner can all attain mission-complete performance under the evaluated settings.

6.2. Energy Consumption and Travel Time

Figure 2 depicts the average propulsion-energy surrogate consumed by each planner (in normalized units). Purely discrete methods exhibit the highest energy demands: A* at 899.66 units and D* Lite at 594.01 units, while CBS-D* reduces this to 126.45 units thanks to occasional replanning. It should however be noted that the large values for A* and D* Lite are inflated by the failure penalty E fail = 1000 applied to collided/deadlocked trajectories in Equation (25). Therefore, these numbers should be interpreted as penalty-including surrogate costs for relative comparison, not as physical energy measurements.
In contrast, hybrid planners achieve dramatic savings: PSO consumes 109.44 units and PAIR further lowers this to 104.91 units, a 4.1% reduction relative to PSO. Across the nine benchmark paths, the associated standard deviations are moderate (for instance, PAIR achieves 104.91 ± 44.36 units compared to 109.44 ± 47.26 units for PSO and 126.45 ± 54.67 units for CBS-D*), so the ranking in Table 3 is robust to path-to-path variability.
Figure 3 shows the corresponding average travel times. According to Table 3, A* and D* Lite achieve mean travel times of 201.4 s and 179.4 s, respectively, but only on the very small subset of paths where they succeed (11.1% and 44.4% success). Among the 100%-reliable planners, CBS-D* has the longest mean travel time (236.8 s), whereas PSO and PAIR achieve 214.3 s and 207.8 s, respectively, with PAIR providing the shortest travel time under full mission success. Hybrid methods again lead: PSO averages 214.3 s, while PAIR trims this to 207.8 s, which is 6.5 s faster than PSO (a 3.0% improvement).
These results demonstrate that integrating continuous PPO-based corrections into a discrete backbone (PAIR) both maintains perfect reliability and substantially lowers energy and time compared to purely discrete or metaheuristic approaches.
In Figure 4, A* exhibits multiple deadlocks (path 9) and at least one collision (paths 1 and 2) when confronted with dynamic obstacles. These trajectories are not solely determined by geometric proximity but follow the task-assignment model in Section 3.5, which assigns each request to the UAV whose CPU, memory, battery, and deadline constraints are best matched. Consequently, some paths cross (for example, paths 1 and 2 end up on a collision course), and in a few cases a more distant UAV is dispatched instead of the nearest one (e.g., UAV 7 is closer to the task in path 9 but is not selected).
D* Lite (Figure 5) leverages incremental replanning to escape some static deadlocks, but still registers collisions on dynamic encounters (path1 and path2). The replanning latency causes hesitation near high-risk zones as seen in the abrupt and zig-zag nature of the path in those areas.
CBS-D* Lite (Figure 6) resolves inter-agent conflicts more robustly than D* Lite, yet its conflict-tree updates induce pronounced detours (as seen in path 5 and 6) and occasional collisions when dynamic obstacles intrude mid-segment.
PSO (Figure 7) generates smoother, continuous trajectories successfully avoiding both static and dynamic obstacles. However, its global waypoint optimization sometimes yields unnecessarily long loops as can be seen in path 1 just before it converges; this reflects trade-offs between exploration and path length.
In contrast, PAIR (Figure 8) combines the fast backbone of A* with PPO-based local corrections. All nine paths seamlessly navigate around moving obstacles, maintain safe separation, and incur minimal detours. The continuous refinements tighten trajectories around static obstacles while reacting instantly to dynamic intrusions. This demonstrates the superior adaptability and efficiency of PAIR in both static and dynamic settings.

6.3. Unified Score Comparison

Figure 9 and Table 3 report the unified score, the average normalized path-quality score S ¯ A defined in (23). This scalar index summarizes how often each planner attains short paths relative to the best-performing algorithm on each scenario, while the success rate, energy consumption, and travel time are presented explicitly in the same table and figures. In terms of the normalized unified score, PAIR attains 84.56 ± 34.01 versus 73.17 ± 38.13 for PSO, with A*, D* Lite, and CBS-D* remaining well below these values (Table 3). This margin reflects balanced improvement of PAIR across all dimensions, as it maintains 100% success while at the same time minimizing both energy and time traveled compared to purely discrete (A*, D* Lite, CBS-D*) and metaheuristic (PSO) planners.

6.4. Impact of Obstacle Density

To quantify each planner’s scalability under rising dynamic complexity, we consider the single-scenario density sweep introduced in the “Variable Obstacle Density Sweep” subsection of Section 5, where each planner is evaluated at obstacle counts { 5 , 10 , 15 , 20 , 25 , 30 } . Because this sweep is performed on a single canonical map, it should be interpreted as a controlled stress test of relative trends rather than a full statistical characterization over an ensemble of urban layouts; replicating the sweep across multiple randomly generated maps and 3-D environments is an important direction for future work. Additional qualitative trajectory visualizations for this obstacle-density sweep (GIFs) are provided in the Supplementary Material.
Figure 10 plots the resulting path lengths. The PAIR curve increases only modestly, from about 113 to 147 grid units, which demonstrates robust backbone routing with minimal detours. In contrast, CBS-D* remains flat around 156–162 units, and PSO exhibits an unusual fluctuation (114→172→145→157), which is understandable given its stochastic waypoint sampling (Figure 10).
Figure 11 shows the total steps to goal. PAIR grows smoothly from 82 to 199 steps, indicating predictable replanning overhead. CBS-D* increases monotonically (158→205), as each new obstacle simply adds another conflict/rollback event. PSO, however, oscillates dramatically from 182 at 10 obstacles, down to 100 at 15, and back up to 132 at 30 obstacles, revealing an unreliable convergence under clutter.
Most critically, Figure 12 reports the total cumulative replanning time per episode. For PAIR this quantity remains the lowest among all planners across the entire sweep, increasing from about 1.9 × 10 3 ms (≈1.9 s) at 5 obstacles to a worst case of roughly 1.9 × 10 5 ms (≈190 s) at 25 obstacles. At 30 obstacles, the episode reaches the goal in fewer steps on that map, so the PAIR cumulative replanning time decreases to about 2.3 × 10 4 ms (≈23 s) rather than continuing to grow with obstacle count. By comparison, CBS-D* requires up to 4.4 × 10 5 ms (≈440 s) of planning time, and PSO exceeds 3.4 × 10 6 ms (≈3400 s, that is, nearly an hour) at 30 obstacles, making it impractical for real-time use.
Taken together, these density-sweep curves confirm that PAIR is the only planner whose latency and path optimality both scale gracefully as obstacle density increases, making it uniquely suited for real-time UAV navigation in highly dynamic environments.

7. Discussion

Our comparative analysis highlights a fundamental trade-off between responsiveness and optimality. Purely discrete planners (A*, D* Lite, CBS-D*) guarantee optimal or near-optimal grid-based routes on static maps but suffer from replanning latency and deadlocks once obstacles move (Figure 4, Figure 5 and Figure 6). The PSO metaheuristic adapts continuously yet can over-correct and produce suboptimal detours in dense layouts (Figure 7). The PAIR hybrid approach combines a fast A* backbone for coarse routing with PPO-based local corrections that preserve near-optimal static performance while reacting quickly to dynamic intrusions, yielding lower propulsion energy and shorter travel times than all baselines under the same benchmark scenarios and hyperparameter settings.
Unlike PSO, which quickly becomes infeasible for replanning (multi-minute cumulative planning times), and CBS-D* (tens to hundreds of seconds), PAIR consistently has the smallest cumulative replanning time across all obstacle densities. In our experiments its total replanning time increases from about 2 s at 5 obstacles to at most 190 s at 25 obstacles, while CBS-D* and PSO reach roughly 440 s and over 3400 s, respectively, at high densities.
With PAIR, its monotonic, predictable growth in steps (Figure 11) contrasts the erratic behavior of PSO, underscoring its reliability across densities. The non-monotonic dip in the PAIR cumulative replanning time at 30 obstacles arises because the episode on that map terminates earlier than at 25 obstacles; since we accumulate planning time only up to goal completion, shorter episodes naturally yield smaller totals even at higher obstacle counts. These results confirm that PAIR not only delivers superior path quality and energy efficiency but also scales gracefully in high-density dynamic environments, preserving both real-time responsiveness and path optimality.
PAIR incurs modest overhead: discrete A* searches complete in <1 ms per step, and PPO inference runs in ≈5 ms on an embedded GPU, supporting a 10 Hz replanning rate. By contrast, CBS-D* conflict-tree updates can exceed 50 ms, and the PSO 200-iteration replanning loops are even slower.
Inter-UAV safety checks scale as O ( N 2 ) , but the PAIR decentralized PPO corrections integrate neighbor positions directly into each agent’s state, avoiding global coordination bottlenecks. This makes PAIR naturally extensible to larger swarms.
In our experiments, the PAIR PPO agent is trained purely on navigation performance (path length, local risk, and a propulsion-energy surrogate), while MEC task assignment and deadline satisfaction are governed by the heterogeneous assignment model of Section 3.5. Thus, the present evaluation focuses on geometric path quality under dynamic obstacles, with offloading treated as a separate upper-layer decision mechanism rather than as part of the RL objective. Thus, we view PAIR as a path-planning contribution evaluated in a MEC-inspired environment, not as an end-to-end solution for joint computation offloading and trajectory control.
A natural extension is to incorporate deadline, or latency-aware terms directly into the PPO reward (for example, a normalized penalty w ( t rem / T max ) when the predicted completion time approaches a task’s deadline), or to co-train navigation and offloading policies in a unified multi-objective RL formulation. We leave such MEC co-optimization to future work, while the present study deliberately isolates the navigation component and demonstrates the compatibility of PAIR with 6G edge-computing scenarios.
All experiments in this paper are conducted on a 2-D fixed-altitude grid abstraction with procedurally generated static obstacles and stylized Markovian dynamic obstacles, under a simplified MEC model that abstracts away physical-layer, queueing, and perception uncertainties. Consequently, our results should be interpreted as an algorithmic evaluation of navigation strategies in a MEC-inspired setting, not as an end-to-end hardware validation on specific UAV platforms or sensor pipelines. This design choice allows us to stress-test PAIR and all baselines under tightly controlled obstacle densities and map statistics, but it does not capture the full 3-D flight kinematics or real-world sensing artifacts. Validating PAIR on photorealistic 3-D simulators and standardized or real multi-UAV flight datasets is therefore an important direction for future work.
While the paper is motivated by UAV-assisted MEC, our present evaluation focuses on geometric navigation (path length, collision risk, and a propulsion-energy surrogate). The task-assignment model in Section 3.5 provides a heterogeneous MEC context but is static during flight, and MEC constraints (latency/deadlines) are not factored into the PPO reward. A more convincing end-to-end MEC study would generate dynamic task arrivals during simulated flights and couple the predicted completion latency to the navigation policy (for instance, via a deadline-aware penalty when the predicted service time approaches a task deadline). We leave this integrated co-optimization and dynamic task scenario generation to future work.

8. Conclusions and Future Work

We have presented PAIR, a multi-agent hybrid A*–PPO path-planning framework for multi-UAV deployments in dynamic MEC-inspired urban environments modeled as 2-D fixed-altitude grids with procedurally generated static obstacles and stylized Markovian dynamic obstacles. The framework combines a fast discrete A* backbone with an on-demand PPO-based continuous refinement module and achieves 100% mission success across nine algorithmically generated benchmark scenarios, while reducing the average propulsion energy to 104.9 normalized units and average travel time to 207.8 s. Under our unified evaluation, which jointly considers success rate, energy, time, and normalized path quality, PAIR consistently outperforms classical discrete planners (A*, D* Lite, CBS-D*) and a PSO metaheuristic baseline, attaining a top unified score of 84.56 without sacrificing reliability. The released 2-D benchmark and simulator are intended to be a reusable baseline for future work on dynamic multi-UAV navigation in MEC-inspired urban environments and as a stepping stone toward full 3-D and hardware-in-the-loop evaluations.
Looking ahead, several directions can further enhance the applicability and performance of PAIR:
  • Full 3-D Navigation: Extend the state and action spaces to include altitude variations, enabling obstacle avoidance in true three-dimensional urban canyons.
  • Robustness to Perception Uncertainty: Integrate noisy sensor models and adversarial perturbations into training, or adopt uncertainty-aware RL techniques to maintain performance under imperfect state estimation.
  • End-to-End Joint Learning: Co-train the discrete backbone heuristic and PPO policy within a unified curriculum, allowing global and local planners to co-adapt.
  • Swarm-Scale Coordination: Investigate federated or hierarchical PPO schemes to support densely populated UAV fleets, sharing policy improvements without centralized data exchange.
  • MEC Co-Optimization: Incorporate live network latency and computation offloading deadlines into the reward function, achieving truly integrated flight-computational planning for 6G-enabled UAV networks.
  • Formal safety guarantees: While PAIR ensures a high probability of safety, deterministic guarantees require integration with formal methods such as control barrier certificates, which we leave for future work.
  • 3-D and real-data validation: Port PAIR to full 3-D kinematics and evaluate it on standardized or real multi-UAV datasets (such as LiDAR/trajectory logs), complementing the 2-D synthetic benchmark reported here.
By pursuing these extensions, future work will bring PAIR closer to deployment in real-world UAV swarms.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/drones10010058/s1, Supplementary files are provided with this manuscript and include additional visual demonstrations (GIFs) and supporting experimental materials for this study.

Author Contributions

Conceptualization, B.H.T. and J.L.; methodology, B.H.T. and J.L.; software, B.H.T.; validation, B.H.T., J.L., Y.Q. and H.R.S.; formal analysis, B.H.T. and J.L.; investigation, B.H.T.; resources, J.L. and Y.Q.; data curation, B.H.T.; writing, original draft preparation, B.H.T.; writing, review and editing, J.L., Y.Q. and H.R.S.; visualization, B.H.T.; supervision, J.L. and Y.Q.; project administration, J.L. and Y.Q.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62372163 and the Science and Technology Innovation Program of Hunan Province under Grant 2024RC1033.

Data Availability Statement

The data that support the findings of this study are openly available in the “ Urban Multi-UAV Path Planning Simulation Dataset (2-D Dynamic Urban MEC Scenarios)” repository on Figshare, at https://doi.org/10.6084/m9.figshare.30787730. The dataset contains the nine benchmark urban scenarios used in our experiments (static and dynamic obstacle fields, UAV start–goal pairs, and dynamic obstacle trajectories) in CSV/JSON format. Any additional information related to the data or scripts for loading and basic visualization is available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the College of Computer Science and Electronic Engineering at Hunan University and the College of Computer Science and Information Technology at the University of Sumer for their administrative and technical support, as well as the computing resources used in this work.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A. PPO Hyperparameters and Training Configuration

Table A1 reports the complete PPO training configuration used throughout the paper. We adopt standard PPO settings commonly used for continuous-control navigation ( γ = 0.99 , GAE λ = 0.95 , clipping ϵ = 0.2 ), and we keep them fixed across all benchmarks for reproducibility and to avoid scenario-specific overfitting. We do not claim exhaustive hyperparameter optimality; beyond basic sanity checks to ensure stable learning and convergence, a full sensitivity/ablation study is left to future work. Unless otherwise stated, the “Hardware” entry in Table A1 refers to the platform used for the runtime experiments reported in the paper (training/evaluation and online PPO inference), while a separate workstation is used only for offline plotting.
Table A1. PPO hyperparameters and training configuration.
Table A1. PPO hyperparameters and training configuration.
ParameterValue
Network architecture2-layer MLP, 128 ReLU units per layer
OptimizerAdam
Actor learning rate 3 × 10 4
Critic learning rate 1 × 10 3
Discount factor ( γ ) 0.99
GAE parameter ( λ GAE ) 0.95
PPO clipping parameter ( ϵ ) 0.20
Batch size2048
Epochs per update50
Total environment steps500,000
Reward weights ( w d , w r , w e ) (1.0, 0.5, 0.1)
Training scenarios3 static + 3 dynamic maps
HardwareEmbedded GPU (Jetson Xavier NX)
Inference latency≈5 ms per PPO call

Appendix B. Baseline Hyperparameters and Tuning Protocol

For transparency and to ensure a fair comparison, we keep all baseline configurations fixed across the nine benchmark maps and obstacle-density sweeps. Table A2 summarizes the key hyperparameters for PSO and CBS-D* Lite. The PSO settings follow the standard recommendations for swarm-based path planning and are chosen to balance the solution quality against the per-step replanning budget, while the CBS-D* Lite parameters bound the size of the conflict tree and rollback window so that planning latency remains comparable to the PAIR 10 Hz target. Rather than exhaustively tuning each baseline, we adopt these simple, literature-inspired configurations and use them consistently in all experiments, so that the performance gains reported in Table 3 primarily reflect architectural differences rather than aggressive hyperparameter search.
We set the PSO inertia weight to w = 0.7 as a standard, literature-consistent mid-range choice that balances exploration (large inertia) and convergence (small inertia) in waypoint-based PSO planning. Following the canonical PSO formulation [16], we keep ( w , c 1 , c 2 ) constant across all maps to avoid per-scenario tuning advantages and to also keep replanning budgets comparable.
CBS-D* Lite resolves inter-agent conflicts by expanding a conflict tree and triggering low-level repairs. In highly dynamic environments, frequent obstacle updates can repeatedly invalidate repaired segments, which increases conflict-tree churn and induces conservative detours. This mechanism explains why CBS-D* Lite remains reliable in our tests but often exhibits longer travel times and higher replanning overhead than PAIR and PSO (Table 3).
Table A2. Key hyperparameters for PSO and CBS-D* Lite baselines.
Table A2. Key hyperparameters for PSO and CBS-D* Lite baselines.
PlannerSetting
PSOSwarm size N p = 30 particles
PSOIntermediate waypoints per path N wp = 8
PSOInertia weight w = 0.7
PSOCognitive/social gains c 1 = c 2 = 1.4
PSOIterations per replanning = 200
CBS-D* LiteMaximum banned cells per agent = 10
CBS-D* LiteRollback window = 2 waypoints
CBS-D* LiteInter-agent safety radius d safe (as in Section 5.7)

References

  1. Raja, G.; Manoharan, A.; Siljak, H. Ugen: UAV and GAN aided ensemble network for post disaster survivor detection through ORAN. IEEE Trans. Veh. Technol. 2024, 73, 9296–9305. [Google Scholar] [CrossRef]
  2. Wang, T.; Huang, X.; Wu, Y.; Qian, L.; Lin, B.; Su, Z. UAV swarm assisted two tier hierarchical federated learning. IEEE Trans. Netw. Sci. Eng. 2023, 11, 943–956. [Google Scholar] [CrossRef]
  3. He, J.; Wang, J.; Wang, N.; Guo, S.; Zhu, L.; Niyato, D.; Xiang, T. Preventing non intrusive load monitoring privacy invasion: A precise adversarial attack scheme for networked smart meters. arXiv 2024, arXiv:2412.16893. [Google Scholar] [CrossRef]
  4. Tian, B.; Wang, L.; Xu, L.; Pan, W.; Wu, H.; Li, L.; Han, Z. UAV assisted wireless cooperative communication and coded caching: A multiagent two timescale DRL approach. IEEE Trans. Mob. Comput. 2023, 23, 4389–4404. [Google Scholar] [CrossRef]
  5. Qu, Y.; Sun, H.; Dong, C.; Kang, J.; Dai, H.; Wu, Q.; Guo, S. Elastic collaborative edge intelligence for UAV swarm: Architecture, challenges, and opportunities. IEEE Commun. Mag. 2023, 62, 62–68. [Google Scholar] [CrossRef]
  6. Wang, Y.; Sheng, M.; Wang, X.; Wang, L.; Li, J. Mobile edge computing: Partial computation offloading using dynamic voltage scaling. IEEE Trans. Commun. 2016, 64, 4268–4282. [Google Scholar] [CrossRef]
  7. Long, Y.; Zhao, S.; Gong, S.; Gu, B.; Niyato, D.; Shen, X. AoI aware sensing scheduling and trajectory optimization for multi UAV assisted wireless backscatter networks. IEEE Trans. Veh. Technol. 2024, 73, 15440–15455. [Google Scholar] [CrossRef]
  8. Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
  9. Zhan, C.; Hu, H.; Liu, Z.; Wang, Z.; Mao, S. Multi UAV enabled mobile edge computing for time constrained IoT applications. IEEE Internet Things J. 2021, 8, 15553–15567. [Google Scholar] [CrossRef]
  10. Trotti, F.; Farinelli, A.; Muradore, R. A Markov decision process approach for decentralized UAV formation path planning. In Proceedings of the 2024 European Control Conference (ECC), Stockholm, Sweden, 25–28 June 2024; pp. 436–441. [Google Scholar]
  11. Dai, X.; Xiao, Z.; Jiang, H.; Lui, J.C.S. UAV assisted task offloading in vehicular edge computing networks. IEEE Trans. Mob. Comput. 2024, 23, 2520–2534. [Google Scholar] [CrossRef]
  12. Chang, H.; Chen, Y.; Zhang, B.; Doermann, D. Multi UAV mobile edge computing and path planning platform based on reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 489–498. [Google Scholar] [CrossRef]
  13. Jin, J.; Zhang, Y.; Zhou, Z.; Jin, M.; Yang, X.; Hu, F. Conflict based search with D*Lite algorithm for robot path planning in unknown dynamic environments. Comput. Electr. Eng. 2023, 105, 108473. [Google Scholar] [CrossRef]
  14. Li, Y.; Wu, R.; Gan, L.; He, P. Development of an effective relay communication technique for multi UAV wireless network. IEEE Access 2024, 12, 74087–74095. [Google Scholar] [CrossRef]
  15. Wang, Y.; Zhu, J.; Huang, H.; Xiao, F. Bi objective ant colony optimization for trajectory planning and task offloading in UAV assisted MEC systems. IEEE Trans. Mob. Comput. 2024, 23, 12360–12377. [Google Scholar] [CrossRef]
  16. Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
  17. Ejaz, M.; Gui, J.; Asim, M.; Affendi, M.A.E.; Fung, C.; Latif, A.A.A. RL Planner: Reinforcement learning enabled efficient path planning in multi UAV MEC systems. IEEE Trans. Netw. Serv. Manag. 2024, 21, 3317–3329. [Google Scholar] [CrossRef]
  18. Yuan, H.; Wang, M.; Bi, J.; Shi, S.; Yang, J.; Zhang, J.; Zhou, M.; Buyya, R. Cost efficient task offloading in mobile edge computing with layered unmanned aerial vehicles. IEEE Internet Things J. 2024, 11, 30496–30509. [Google Scholar] [CrossRef]
  19. Mach, P.; Becvar, Z. Mobile edge computing: A survey on architecture and computation offloading. IEEE Commun. Surv. Tutor. 2017, 19, 1628–1656. [Google Scholar] [CrossRef]
  20. Qin, P.; Fu, Y.; Xie, Y.; Wu, K.; Zhang, X.; Zhao, X. Multi agent learning based optimal task offloading and UAV trajectory planning for AGIN power IoT. IEEE Trans. Commun. 2023, 71, 4005–4017. [Google Scholar] [CrossRef]
  21. Liang, K.; Wang, Y.; Li, Z.; Zheng, G.; Wong, K.K.; Chae, C.B. Digital twin assisted deep reinforcement learning for computation offloading in UAV systems. IEEE Trans. Veh. Technol. 2025, 74, 8466–8471. [Google Scholar] [CrossRef]
  22. Cherif, B.; Ghazzai, H.; Alsharoa, A.; Besbes, H.; Massoud, Y. Aerial LiDAR based 3D object detection and tracking for traffic monitoring. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar]
  23. Li, C.; Gan, Y.; Zhang, Y.; Luo, Y. A cooperative computation offloading strategy with on demand deployment of multi UAVs in UAV aided mobile edge computing. IEEE Trans. Netw. Serv. Manag. 2023, 21, 2095–2110. [Google Scholar] [CrossRef]
  24. Wu, X.; Lei, Y.; Tong, X.; Zhang, Y.; Li, H.; Qiu, C.; Guo, C.; Sun, Y.; Lai, G. A non rigid hierarchical discrete grid structure and its application to UAVs conflict detection and path planning. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5393–5411. [Google Scholar] [CrossRef]
  25. Zhong, L.; Zhao, J.; Luo, H.; Hou, Z. Hybrid path planning and following of a quadrotor UAV based on deep reinforcement learning. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 1858–1863. [Google Scholar]
  26. Jayaweera, N.; Rajatheva, N.; Latva aho, M. Autonomous driving without a burden: View from outside with elevated LiDAR. In Proceedings of the 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring), Kuala Lumpur, Malaysia, 28 April–1 May 2019; pp. 1–7. [Google Scholar]
  27. Li, F.; Luo, J.; Sun, P.; Teng, S. Energy efficient UAV based data collection 3D trajectory optimization with wireless power transfer for forest monitoring. IEEE Internet Things J. 2025, 12, 24071–24082. [Google Scholar] [CrossRef]
  28. Luo, Z.; Zhang, J.; Wei, J.; Zhou, L.; Cao, K.; Zhao, H. Trajectory design and task scheduling for multi UAV aided mobile edge computing networks. In Proceedings of the 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 24–27 March 2025; pp. 1–6. [Google Scholar]
  29. Gao, Y.; Tao, J.; Xu, Y.; Wang, Z.; Gao, Y.; Wang, M. Improving user QoE via joint trajectory and resource optimization in multi UAV assisted MEC. IEEE Trans. Services Comput. 2025, 18, 1472–1486. [Google Scholar] [CrossRef]
  30. Yin, L.; Luo, J.; Qiu, C.; Wang, C.; Qiao, Y. Joint task offloading and resource allocation for hybrid vehicle edge computing systems. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10355–10368. [Google Scholar] [CrossRef]
  31. Yang, A.; Huan, J.; Wang, Q.; Yu, H.; Gao, S. ST-D3QN: Advancing UAV path planning with an enhanced deep reinforcement learning framework in ultra-low altitudes. IEEE Access 2025, 13, 65285–65300. [Google Scholar] [CrossRef]
  32. Wu, T.; Zhang, Z.; Jing, F.; Gao, M. A dynamic path planning method for UAVs based on improved informed-RRT* fused dynamic windows. Drones 2024, 8, 539. [Google Scholar] [CrossRef]
  33. Xie, J.; Huang, W.; Miao, J.; Li, J.; Cao, S. Off-policy deep reinforcement learning for path planning of stratospheric airship. Drones 2025, 9, 650. [Google Scholar] [CrossRef]
  34. Wang, Y.; Liu, J.; Qian, Y.; Yi, W. Path planning for multi-UAV in a complex environment based on reinforcement-learning-driven continuous ant colony optimization. Drones 2025, 9, 638. [Google Scholar] [CrossRef]
  35. Rahman, M.; Sarkar, N.I.; Lutui, R. A survey on multi-UAV path planning: Classification, algorithms, open research problems, and future directions. Drones 2025, 9, 263. [Google Scholar] [CrossRef]
  36. Meng, W.; Zhang, X.; Zhou, L.; Guo, H.; Hu, X. Advances in UAV path planning: A comprehensive review of methods, challenges, and future directions. Drones 2025, 9, 376. [Google Scholar] [CrossRef]
  37. Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What matters for on-policy deep actor-critic methods? A large-scale study. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
  38. Kunda, N.S.S.; Kc, P.; Pandey, M.; Kumaar, A.A.N. Reward design and hyperparameter tuning for generalizable deep reinforcement learning agents in autonomous racing. Sci. Rep. 2025, 15, 43940. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Success rates across the nine benchmark paths (unit: %).
Figure 1. Success rates across the nine benchmark paths (unit: %).
Drones 10 00058 g001
Figure 2. Average energy consumption of pathfinding algorithms (normalized units).
Figure 2. Average energy consumption of pathfinding algorithms (normalized units).
Drones 10 00058 g002
Figure 3. Travel time per path for each algorithm (unit: seconds). The annotation “fail” indicates that the planner did not successfully reach the goal on that specific path.
Figure 3. Travel time per path for each algorithm (unit: seconds). The annotation “fail” indicates that the planner did not successfully reach the goal on that specific path.
Drones 10 00058 g003
Figure 4. Path planning visualization (A*).
Figure 4. Path planning visualization (A*).
Drones 10 00058 g004
Figure 5. Path planning visualization (D* Lite).
Figure 5. Path planning visualization (D* Lite).
Drones 10 00058 g005
Figure 6. Path planning visualization (CBS-D* Lite).
Figure 6. Path planning visualization (CBS-D* Lite).
Drones 10 00058 g006
Figure 7. Path planning visualization (PSO).
Figure 7. Path planning visualization (PSO).
Drones 10 00058 g007
Figure 8. Path planning visualization (PAIR).
Figure 8. Path planning visualization (PAIR).
Drones 10 00058 g008
Figure 9. Average unified scores of pathfinding algorithms.
Figure 9. Average unified scores of pathfinding algorithms.
Drones 10 00058 g009
Figure 10. Path length vs. number of dynamic obstacles (unit: grid units).
Figure 10. Path length vs. number of dynamic obstacles (unit: grid units).
Drones 10 00058 g010
Figure 11. Steps to goal vs. number of dynamic obstacles.
Figure 11. Steps to goal vs. number of dynamic obstacles.
Drones 10 00058 g011
Figure 12. Total replanning time vs. number of dynamic obstacles (unit: milliseconds).
Figure 12. Total replanning time vs. number of dynamic obstacles (unit: milliseconds).
Drones 10 00058 g012
Table 1. Qualitative comparison of planners used in this study. A* denotes A-star; D* Lite denotes D-star Lite; CBS-D* Lite denotes Conflict-Based Search with D* Lite.
Table 1. Qualitative comparison of planners used in this study. A* denotes A-star; D* Lite denotes D-star Lite; CBS-D* Lite denotes Conflict-Based Search with D* Lite.
PlannerStrengthsLimitationsRef.
A*Optimal on static grids with admissible heuristicNo incremental repair; fails under dynamic obstacles[12,24]
D* LiteIncremental replanning in changing environmentsCan degrade under frequent dynamic updates[13]
CBS-D* LiteConflict-aware multi-agent planningConflict-tree overhead; conservative detours under high dynamics[13]
PSOContinuous waypoint optimization; smooth pathsStochastic convergence; high replanning time when iterated[14,16]
PAIR (A*+PPO)Fast backbone + learned local corrections; lowest replanning time in sweepCurrent PPO reward is navigation-only; MEC coupling is future work[17]
Table 2. Nomenclature.
Table 2. Nomenclature.
SymbolDefinition
P i A*–computed waypoint sequence for UAV i
p i , j jth waypoint in sequence P i
Δ i , j Continuous offset applied at waypoint p i , j
d safe Minimum safe separation distance
N F Z Set of no-fly-zone vertices
π θ PPO policy network with parameters θ
a t Continuous action (offset) from π θ at time t
w d , w r , w e Reward weights for distance, risk, and energy
T Set of generated tasks, { 1 , , M }
w j CPU workload of task j (core-seconds)
m j Memory footprint of task j (MB)
d j Deadline of task j (s) from issuance
x i , j 1 if task j is assigned to UAV i, else 0
C i cpu CPU capacity of UAV i (core-seconds)
C i mem Memory capacity of UAV i (MB)
B i bat Remaining battery capacity of UAV i (energy units)
τ i , j Predicted execution time of task j on UAV i (s)
e j Estimated energy to execute task j (energy units)
NNumber of UAVs (swarm size)
s i , g i Start and goal positions of UAV i on the grid
G = ( V , E ) Grid graph with vertices V and edges E
E total Total propulsion-energy surrogate over all UAV paths
T total Total travel time aggregated over all UAV paths
L A , p Path length produced by algorithm A on scenario/path p (grid units)
Δ t Simulator time step (s) used for replanning/broadcasting
Table 3. Algorithm performance on the nine benchmark paths. Values are mean ± standard deviation across the nine scenarios; the travel time is averaged over successful trajectories only.
Table 3. Algorithm performance on the nine benchmark paths. Values are mean ± standard deviation across the nine scenarios; the travel time is averaged over successful trajectories only.
AlgorithmSuccess Rate (%)Avg. Energy (Normalized Units)Avg. Time (s)Unified Score
A*11.1 ± 33.3899.66 ± 301.03201.4 6.62 ± 19.87
D* Lite44.4 ± 52.7594.01 ± 482.27179.4 ± 74.433.62 ± 40.44
CBS-D*100.0 ± 0.0126.45 ± 54.67236.8 ± 78.416.88 ± 35.59
PSO100.0 ± 0.0109.44 ± 47.26214.3 ± 69.373.17 ± 38.13
PAIR100.0 ± 0.0104.91 ± 44.36207.8 ± 67.884.56 ± 34.01
A* succeeds on only one of the nine benchmark paths; the reported time is that single successful trajectory (the standard deviation is therefore not meaningful and is omitted).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Taher, B.H.; Luo, J.; Qiao, Y.; Sayegh, H.R. PAIR: A Hybrid A* with PPO Path Planner for Multi-UAV Navigation in 2-D Dynamic Urban MEC Environments. Drones 2026, 10, 58. https://doi.org/10.3390/drones10010058

AMA Style

Taher BH, Luo J, Qiao Y, Sayegh HR. PAIR: A Hybrid A* with PPO Path Planner for Multi-UAV Navigation in 2-D Dynamic Urban MEC Environments. Drones. 2026; 10(1):58. https://doi.org/10.3390/drones10010058

Chicago/Turabian Style

Taher, Bahaa Hussein, Juan Luo, Ying Qiao, and Hussein Ridha Sayegh. 2026. "PAIR: A Hybrid A* with PPO Path Planner for Multi-UAV Navigation in 2-D Dynamic Urban MEC Environments" Drones 10, no. 1: 58. https://doi.org/10.3390/drones10010058

APA Style

Taher, B. H., Luo, J., Qiao, Y., & Sayegh, H. R. (2026). PAIR: A Hybrid A* with PPO Path Planner for Multi-UAV Navigation in 2-D Dynamic Urban MEC Environments. Drones, 10(1), 58. https://doi.org/10.3390/drones10010058

Article Metrics

Back to TopTop