1. Introduction
Unmanned aerial vehicles (UAVs) are increasingly deployed for critical missions, from disaster response to precision agriculture, but their onboard processors cannot keep pace with high-resolution LiDAR, video, and sensor streams required for real-time inference [
1,
2,
3]. Deep neural networks (DNNs) deliver high-fidelity perception yet exceed UAV size, weight, and power budgets when executed locally. This has motivated several mobile edge computing (MEC) schemes that resort to offloading the DNN layers to ground or edge servers [
4,
5,
6]. At the same time, multi-UAV operation could have a second layer of complexity, requiring continuous deconfliction under dynamic obstacles and inter-agent interactions. These safety checks can severely limit real-time replanning on embedded hardware, making heavy iterative optimization impractical.
In dynamic MEC operations, path planning must account for both aerial navigation and real-time computational offloading, even as the physical and wireless environments fluctuate. For example, intermittent wireless connectivity may interrupt DNN layer uploads or downloads mid-flight [
7,
8], while variable network load can cause MEC task deadlines to shift unpredictably [
9]. At the same time, unplanned no-fly zones [
10], such as temporary airspace restrictions, or urgent edge-computing requests can force UAVs to abandon their original routes or reassign their roles during a mission [
11]. These coupled challenges require planners to perform a number of things, including finding safe paths around both static and moving obstacles, while also ensuring that each offloaded computational task meets its deadline. Thus, they must adapt continuously to link failures, deadline changes, and emergent airspace constraints in real time.
Given a 2-D urban grid with static obstacles, time-varying dynamic obstacles, and N UAVs with start–goal pairs , our goal is to generate collision-free trajectories that (i) respect no-fly regions and a minimum separation distance from dynamic obstacles and other UAVs, and (ii) minimize geometric mission cost (path length and a propulsion-energy surrogate) while maintaining real-time replanning capability. The airspace is modeled as a fixed-altitude 2-D projection for controlled, reproducible evaluation. A heterogeneous task-assignment model provides a MEC-inspired operating context by pre-assigning tasks to UAVs under CPU/memory/battery/deadline constraints; however, in the current implementation, the learning agent optimizes navigation-only signals (distance, aggregated risk, and energy surrogate), and MEC decisions are not included in the PPO reward.
Classical graph-based planners (A* (A-star), D* Lite (D-star Lite), and CBS-D* (conflict-based search with D-star Lite)) guarantee static collision-free routes but suffer from replanning latency and deadlocks under dynamic obstacles [
12,
13]. Metaheuristics such as Particle Swarm Optimization (PSO) continuously adapt but require careful parameter tuning and may converge slowly in cluttered urban layouts [
14,
15,
16]. Hybrid reinforcement learning (RL) planners refine coarse waypoints through local, risk-aware corrections yet lack unified, mission-level evaluations across dynamic scenarios [
17].
To address the joint challenges of airspace dynamics and MEC offloading, we propose PAIR, a decentralized hybrid framework integrating a fast A* global backbone with a continuous PPO-based refinement module.
Our key contributions are as follows:
MEC-inspired formulation: We model multi-UAV navigation under a MEC-inspired operating context where heterogeneous task profiles and deadlines motivate time-sensitive replanning. In our simulation experiments, task assignment is computed offline by a separate model and remains constant during navigation experiments; the PAIR learning component optimizes navigation-only signals.
Hybrid PAIR architecture: We propose a two-stage scheme where A* computes global routes and a PPO agent applies real-time trajectory corrections based on local risk information. In addition, a heterogeneous MEC task-assignment model (
Section 3.5) computes resource-feasible task-to-UAV assignments offline subject to CPU/memory/battery and deadline constraints; these assignments are made constant during the navigation experiments.
Unified yet transparent evaluation: We compare PAIR against classical planners (A*, D* Lite, and CBS-D*) and PSO on nine 2-D urban grid scenarios with static/dynamic obstacles. PAIR achieves 100% success, 104.9 normalized energy units, and 207.8 s travel time, outperforming baselines in energy and latency. A ‘unified score’ summarizes path quality, with separate reporting of success, energy, and time to reveal trade-offs.
Reproducible 2-D benchmark and density sweep: We release a reproducible 2-D fixed-altitude benchmark with procedurally generated maps and Markovian dynamic obstacles, validated for coverage, connectivity, and clustering, together with a single-map obstacle-density sweep from 5 to 30 dynamic obstacles. The dataset is intended to be an open baseline for future multi-UAV navigation work in dynamic MEC-inspired environments, while full 3-D flight dynamics and hardware-in-the-loop validation are explicitly left to future work.
The remainder of this paper is organized as follows.
Section 1 introduces the problem context and our key contributions.
Section 2 reviews related work on discrete, metaheuristic, and RL-based UAV path planning.
Section 3 presents our system model, problem formulation, and evaluation metrics.
Section 4 details the proposed PAIR framework.
Section 5 describes the simulation setup.
Section 6 reports the comparative results and analysis.
Section 7 discusses the observed trade-offs and practical considerations. Finally,
Section 8 concludes the paper and outlines directions for future work.
2. Related Work
2.1. Collaborative DNN Partitioning and Offloading
Collaborative DNN inference is hinged on dividing a network across devices to leverage the power of distributed computing. Early MEC frameworks proposed joint model partitioning and offloading between edge servers and mobile devices, targeting latency reduction and resource efficiency [
6,
18,
19]. Fine-grained model splitting with asynchronous actor–critic methods further reduces inference delay under dynamic workloads [
20,
21]. Multi-UAV strategies distribute early network layers across aerial agents to accelerate inference but often overlook UAV mobility and link variability, causing suboptimal and unbalanced workloads [
22,
23]. However, none of these works simultaneously address dynamic obstacle avoidance in collaboration with DNN offloading. This is one of the key motivating factors for the integrated view of our PAIR framework.
2.2. Discrete and Incremental Path Planners
Graph-based planners remain foundational for UAV trajectories. A* finds optimal paths on static grids but requires complete replanning of the path when obstacles shift [
24]. This causes latency spikes in dynamic environments [
12]. D* Lite on the other hand, incrementally adjusts its routes to avoid moving obstacles but degrades under heavy dynamicity [
13]. Following this drawback, Conflict-Based Search with D* Lite (CBS-D*) combines the D* Lite incremental repair capabilities with a high-level conflict resolution mechanism to guarantee collision-free trajectories for individual agents. However, when scaled to multiple UAVs and concurrent paths, CBS-D* suffers from substantial replanning overhead. Every moving obstacle or inter-agent conflict triggers both low-level D* Lite repairs and high-level conflict tree updates. This becomes problematic in rapidly changing environments; as the number of agents or the obstacle update rate increase, latency spikes and overly conservative detours degrade both responsiveness and path optimality [
17]. To overcome these replanning bottlenecks, some researchers have turned to continuous and population-based search methods.
2.3. Metaheuristic Search
Swarm-intelligence algorithms like PSO encode waypoints as particles updated by individual and global bests; this enables continuous adaptation to non-convex, time-varying cost surfaces without grid constraints. PSO navigates complex obstacle fields but usually requires the careful tuning of inertia and acceleration and may converge slowly or become trapped in dynamic scenarios [
14,
15]. While PSO can escape the local minima better than pure grid methods, its lack of learned experience is one of the motivating points for hybrid RL planners that tend to adapt to situational data [
17,
25].
2.4. RL-Driven Hybrid Planners
Hybrid planners for their part rely on integrating coarse discrete paths with RL-based local corrections. For example, there are a number of RL-Planners that leverage deep Q-learning to refine A* or D* Lite trajectories, often balancing distance, collision risk, and energy costs for smoother, safer routes [
25,
26]. These approaches adapt to moving obstacles but lack comprehensive real-time evaluations under unified mission-level criteria [
27]. Besides that, these studies usually focus on preliminary proof-of-concept scenarios or isolated metrics. It is against this backdrop that in this study we seek to provide a unified mission-level evaluation across realistic maps and planner classes.
2.5. Joint Path Planning and Task Offloading
In multi-UAV MEC networks, trajectory design and computation offloading are fundamentally intertwine. The chosen flight paths determine the instantaneous channel quality and transmission delays, while the spatial–temporal distribution of service requests dictates optimal waypoints and loitering patterns. Lou et al. propose a deep reinforcement learning framework that jointly learns UAV waypoints and offloading policies by embedding the deadline, energy, and security constraints directly into the RL reward function [
28,
29]. This cross-layer approach simultaneously optimizes UAV speeds, waypoint selection, and edge–cloud partitioning decisions to minimize end-to-end latency and energy consumption, and at the same time maintaining resilient service in dynamic 6G smart-city environments. However, although these works ensure inter-UAV collision avoidance through safe-distance constraints, they assume an obstacle-free airspace and do not model more nuanced environmental obstacles (such as buildings and no-fly zones) [
30]. As a result, real-time obstacle detection and avoidance remain unaddressed. This leaves a critical gap for deployments in cluttered or rapidly changing urban environments. Compared with RL-Planner [
17] and recent joint trajectory/offloading DRL schemes [
28,
29], PAIR targets a complementary regime. RL-Planner focuses on single-UAV MEC scenarios and employs deep Q-learning to refine discrete paths but does not model multi-agent interactions or dense dynamic obstacles. Joint DRL works such as [
28,
29] co-optimize UAV trajectories and computation offloading decisions yet typically assume obstacle-free airspace and represent interference only through aggregate channel models. In contrast, PAIR combines an A* global backbone with continuous PPO-based corrections in a multi-UAV setting with explicit static and dynamic obstacles, performs a unified mission-level evaluation (success rate, energy, time, and path-quality score), and integrates a heterogeneous task-assignment model to account for MEC resource heterogeneity, even though the current PPO agent is trained solely on navigation-related rewards.
To fill this gap, in
Section 4 we introduce PAIR, a hybrid A* with PPO framework that combines dynamic obstacle avoidance with a MEC-aware task-assignment layer in a unified system architecture; in the present implementation, the PPO agent optimizes geometric navigation (path length, local risk, and energy), while offloading decisions and deadline satisfaction are governed by the task-assignment model rather than the RL reward.
Several recent works have explored intelligent path planning in dynamic environments using deep reinforcement learning and hybrid methods. Yang et al. propose a DRL-based planner for dynamic scenes that models the environment as an MDP and uses an improved D3QN to generate collision-free paths under moving obstacles [
31]. Wu et al. combine an enhanced Informed-RRT* sampler with a dynamic window approach for local obstacle avoidance, improving path quality and real-time performance in cluttered maps [
32]. Off-policy DRL schemes such as RPL-TD3 have also been applied to UAV trajectory planning with recurrent feature extractors for temporal dynamics [
33]. In addition, Wang et al. introduce a reinforcement-learning-driven continuous ant colony optimizer for multi-UAV path planning in complex terrain [
34], while recent surveys in
Drones systematically classify multi-UAV path-planning algorithms and highlight open problems in dynamic, cluttered environments [
35,
36].
Compared with these studies, PAIR targets a complementary design space. It integrates a lightweight A* backbone with a shared PPO policy for continuous local refinement, focuses explicitly on multi-UAV interaction with static and Markovian dynamic obstacles on a controlled 2-D benchmark, and reports unified mission-level metrics (success rate, normalized path quality, energy, and time) across nine scenarios and an obstacle-density sweep. Rather than competing with specific DRL architectures, PAIR is intended to be a reusable hybrid baseline and open benchmark for dynamic multi-UAV navigation in MEC-inspired urban environments.
Table 1 summarizes a qualitative comparison of the planners considered in this study, highlighting their main strengths and limitations.
3. System Model and Problem Statement
3.1. Environment Model
We consider a fleet of N homogeneous UAVs acting as mobile edge servers over a smart-city region discretized into a 2-D grid . Each grid vertex represents a feasible waypoint, and each edge denotes a straight-line flight corridor. UAV i must navigate from a fixed start to a goal while the following hold:
Dynamic Obstacles: Mobile obstacles (such as cranes, other flying objects, and temporary no-fly zones) appear unpredictably.
Airspace Constraints: Altitude bounds and no-fly zones prohibit certain waypoints.
Inter-UAV Separation: A safe distance must be maintained between any two UAVs.
In this study, UAV motion is restricted to a single horizontal plane at a representative flight altitude, so the 3-D airspace is approximated by a 2-D grid projection. Altitude bounds and no-fly zones are therefore modeled as planar regions over
G; extending PAIR to full 3-D kinematics is left for future work.For clarity,
Table 2 lists the key symbols and variables used throughout the paper.
We use
to index UAVs,
j to index waypoints along a planned path, and
t to index discrete simulator time steps. The start and goal locations for UAV
i are denoted by
and
. Path lengths
are non-negative by definition, with
used to encode failures (collision or deadlock). The grid environment and safety constraints follow standard multi-robot planning formulations [
13,
17], while the MEC context and task constraints are consistent with common UAV-assisted MEC models [
9,
28,
29].
3.2. Objectives
Define the multi-objective cost as
where
and
is the discrete waypoint sequence for UAV
i.
3.3. Decision Variables
Waypoints: , the ordered list of grid points for UAV i.
Local Corrections: offsets applied at waypoint to avoid imminent collisions in the horizontal plane.
3.4. Constraints
- 1.
- 2.
- 3.
Dynamic Obstacle Avoidance: At time
t, denoting the location of obstacle
o by
,
- 4.
- 5.
3.5. Task Generation Model
We consider a set of heterogeneous service requests (tasks)
generated by the application layer. Each task
is described by its CPU workload
(core-seconds), memory requirement
(MB), and deadline
(s) from the issue time. For realism we draw
which roughly reflects lightweight to moderate IoT/MEC analytics tasks. Each UAV
i has heterogeneous resources
in CPU units, memory units, and remaining battery capacity. We introduce a binary assignment variable
and impose standard capacity and deadline constraints:
where
is the estimated energy to execute task
j and
is the predicted execution time of
j on UAV
i.
In this work, the heterogeneous assignment model in (
10)–(
14) is used purely as a fixed MEC-inspired context: tasks are pre-assigned offline based on their resource profiles, and these assignments remain constant during path-planning experiments. The PPO agent is trained and evaluated only on navigation-related signals (distance, local risk, and a propulsion-energy surrogate), and offloading decisions do not appear in the RL reward. Consequently, we treat MEC as a background configuration that shapes which UAV serves which task, while the scientific contribution of PAIR lies in the hybrid path-planning algorithm itself.
3.6. Algorithmic Framework
We employ a two-stage hybrid scheme:
- 1.
Global Planner: A fast discrete backbone (such as A*) computes an initial waypoint sequence for each UAV i.
- 2.
Local Continuous Refinement: A PPO-based agent monitors for potential collisions or tight turns and applies continuous corrections as needed.
3.7. Performance Metrics
We evaluate each planner using the following:
Normalized Path Length (NPL): .
Energy Consumption (EC): Total propulsion energy .
Mission Success Rate (MSR): Fraction of UAVs reaching goals without collision.
Optimality Score (OS): Distance from the theoretical optimum.
Computation Overhead: Average replanning time per collision event.
Density-Sweep Scalability: How the key metrics (path length, steps to goal, and total planning time) vary as the number of dynamic obstacles increases from 5 to 30, allowing us to assess each planner’s performance under rising environmental complexity.
4. Proposed Framework
4.1. Global Planner Backbone
For each UAV, the initial path is generated on the grid graph via a discrete planner, with four options to ensure robustness in dynamic settings:
Algorithms 1–3 summarize PAIR and the baseline planners.
A*: A classic heuristic search minimizing path length under static obstacle assumptions [
12].
D* Lite: Incrementally repairs paths when the environment changes, avoiding full replanning [
13].
CBS–D* Lite (Algorithm 2): Multi-agent conflict-based search over D* Lite to guarantee collision-free global paths under inter-UAV constraints [
13].
PSO-Based Waypoint Search (Algorithm 3): Uses Particle Swarm Optimization to optimize a sequence of continuous waypoints balancing path length and obstacle clearance [
14].
| Algorithm 1 PAIR: decentralized PPO path planning with swarm coordination. |
| Require: UAV swarm size N, start/goal pairs , grid , comms range |
| Ensure: Collision-free trajectories |
| 1: | for all UAV in parallel do |
| 2: | % global backbone |
| 3: | % local swarm adjacency |
| 4: | end for |
| 5: | while any UAV not at goal do |
| 6: | for all UAV in parallel do |
| 7: | , |
| 8: | |
| 9: | % risk field from obstacles + neighbors |
| 10: | if then |
| 11: | % 6-D encoding from (15) |
| 12: | % continuous offset from PPO |
| 13: | % ensure feasibility |
| 14: | else |
| 15: | |
| 16: | end if |
| 17: | |
| 18: | % share updated pose for risk map |
| 19: | end for |
| 20: | end while |
| Algorithm 2 CBS–D* Lite: integrated multi-agent path planning. |
| Require: start–goal pairs , environment , max iterations I, safe distance |
| Ensure: conflict-free trajectories |
|
| 1: | Stage 1: Initial Planning |
| 2: | for all agent in parallel do |
| 3: | |
| 4: | end for |
|
| 5: | Stage 2: Conflict Resolution |
| 6: | iteration |
| 7: | while iteration do |
| 8: | detect conflict in under |
| 9: | if no conflict then |
| 10: | return |
| 11: | end if |
| 12: | |
| 13: | iteration ← iteration |
| 14: | end while |
|
| 15: | return |
| Algorithm 3 PSO: Particle Swarm Optimization path planning. |
| Require: start s, goal g, environment , swarm size n, waypoints m, weights , max speed , iterations T |
| Ensure: collision-free, smoothed path P |
| 1: | Initialization: |
| 2: | generate with m intermediate waypoints |
| 3: | initialize |
| 4: | for to n do |
| 5: | evaluate ; store personal best |
| 6: | end for |
| 7: | determine global best |
|
| 8: | Main Loop: |
| 9: | for to T do |
| 10: | for to n do |
| 11: | update ; clamp to |
| 12: | ; clamp to environment |
| 13: | evaluate ; update personal/global bests |
| 14: | end for |
| 15: | end for |
|
| 16: | Path Construction: |
| 17: | |
| 18: | smooth to P |
| 19: | return P |
We implement A*, D* Lite, CBS-D* Lite, and PSO as (i) direct baselines in our comparative study and (ii) optional backbone variants for the same simulation interface. In PAIR, A* is used as the default backbone to generate an initial discrete route, while PPO performs on-demand local refinements; the other three planners are evaluated as standalone planners (without PPO refinement). This is performed to quantify the benefit of the hybrid design under identical maps and safety checks.
Among these options, PAIR adopts A* as its default global backbone. The admissible Euclidean heuristic ensures shortest paths on static grids, the algorithm is lightweight enough for per-step replanning on embedded platforms, and the resulting paths provide a stable, interpretable reference that the learning-based refinement module can safely adjust. For the continuous correction stage we employ Proximal Policy Optimization (PPO), which directly outputs bounded continuous offsets and uses a clipped on-policy objective that is widely regarded as numerically stable for continuous-control tasks, avoiding the sensitivity of off-policy value-based methods in non-stationary multi-UAV environments.
4.2. Local Continuous Refinement
To handle unpredictable, fast-moving obstacles and fine-tune energy-efficient trajectories, we introduce the
PPO-Adjusted Incremental Refinement (PAIR) module. PAIR formulates on-demand local waypoint correction as a Markov Decision Process (MDP) [
10]:
- State
We use a compact 6-D state encoding that matches our implementation and we keep the observation size fixed across swarm sizes:
where
is the current 2-D position and
is the goal. Specifically (i)
and
are normalized goal-direction offsets with
(map diagonal); (ii)
and
are normalized velocity components; (iii)
is the normalized local risk scalar at the current position; and (iv)
is the normalized Euclidean heuristic-to-go distance with
.
- Action
is a bounded continuous offset in the horizontal plane, applied to refine the next discrete waypoint before execution. In our experiments the agent outputs only 2-D corrections; extending the action space to include altitude for full 3-D navigation is left to future work.
- Reward
where
is the Euclidean distance term toward the next discrete waypoint,
penalizes local risk exposure,
is an instantaneous energy surrogate, and
is a sparse terminal bonus on reaching the final goal.
At each simulator time step
t, we rebuild a time-varying obstacle set (for UAV
i)
where
are fixed obstacles,
are Markovian dynamic obstacles at time
t (
Section 5.2), and
is the neighbor set within communication range (used to enforce inter-UAV separation). The instantaneous risk field is then updated as
and the scalar observation used in Equation (
15) is normalized by
where
are empirical bounds collected from training rollouts. Communication is simulated as a local broadcast of each UAV pose once per control step (i.e., every
), so neighbor influence enters
at the same update rate.
4.2.1. PPO Training Details
We train the policy using Proximal Policy Optimization (PPO) with the following settings:
Network architecture: Actor and critic are two-layer MLPs with 128 ReLU-activated units per layer.
Hyperparameters:
- –
Discount factor , GAE .
- –
PPO clip .
- –
Learning rates: actor , critic .
- –
Batch size: 2048, epochs per update: 50.
Training: Converges stably within 500,000 environment steps across varied urban scenarios.
All PPO hyperparameters are reported in
Appendix A (
Table A1). We use a standard PPO configuration [
37,
38] and keep it fixed across all benchmarks; beyond basic stability/convergence checks, we do not perform an exhaustive hyperparameter search, and a full sensitivity study is left to future work.
4.2.2. PAIR: Hybrid Discrete–Continuous Path Planning (Algorithm)
From a multi-agent learning perspective, PAIR adopts a simple parameter-sharing scheme rather than a full centralized-critic MAPPO formulation. A single PPO policy is trained and then executed independently by all UAVs; each agent observes only its own kinematic state and the locally aggregated risk scalar
described in (
15). Neighboring UAVs influence the decision-making process only through this shared risk field and the occasional broadcast of updated poses, but their individual actions are neither coordinated through a joint action space nor optimized via a global value function.
During our empirical experiments, we observe that our independent, risk-aware PPO design is numerically stable for swarm sizes up to in our benchmark scenarios, with no oscillatory behaviors or emergent deadlocks beyond those already handled by our collision checks. We use the term “decentralized” to denote decentralized execution with local observations and a shared policy. A formal convergence analysis of the resulting multi-agent dynamics is beyond the scope of this paper and is left for future work.
4.3. Unified Evaluation Metrics
To compare discrete, metaheuristic, and hybrid RL planners on a common scale, we report four primary metrics for each algorithm A: (i) a normalized path-quality score (“unified score”), (ii) mission success rate, (iii) average propulsion energy, and (iv) average travel time. The unified score is designed to be interpreted as a simple grade in : on each benchmark scenario the shortest successful path receives 100 points, the longest successful path receives 0 points, and failed planners also receive 0. The overall unified score is then the average of these grades across the benchmark set. Success rate, energy, and time are reported separately so that no trade-offs are hidden.
- 1.
Normalized path-quality score (unified score). Let
P denote the total number of benchmark trajectories, and let
be the realized path length (in grid units) produced by algorithm
A on path
p. We treat
as a failure (collision or deadlock). For each path
p, we consider only the successful planners and define
The per-path score
assigns 100 points to the shortest successful paths (or to all successful planners if all successful paths have the same length) and scales down to 0 for failures or relatively longer paths:
Because
denotes a realized geometric path length, it satisfies
by construction. In our implementation, failures (collision or deadlock) are encoded as
and are assigned score 0 in Equation (
22). Therefore, the case
cannot occur.
The overall “unified score” of algorithm
A is then the simple average over all benchmark paths
which by construction lies in
and summarizes the relative path quality across the benchmark set. Importantly, this length-based normalization is used only to aggregate the path quality; the success rate, average energy, and average travel time are reported as separate metrics (
Table 3) so that no energy–latency trade-offs are separated by the unified score.
- 2.
Mission success rate. For completeness we also report the mission success rate of algorithm
A, defined as
i.e., the fraction of paths completed without collision or deadlock.
- 3.
Average energy consumption. For each trajectory we compute a propulsion-energy surrogate
as a convex function of path length and assign a large penalty to failures:
with fixed
,
, and
. We report the average
. This should be interpreted as a normalized propulsion-cost surrogate for relative comparison between planners under the same model, not as a calibrated energy estimate for a specific airframe.
- 4.
Average travel time. Assuming a constant UAV speed
v (in grid units/s), the travel time for path
p is
and we report
as the average travel time.
- 5.
Path optimality score (per-path visualization). For some figures we additionally use a per-path optimality score
which directly measures how close a successful path is to the best successful path on that scenario. These values are used in per-path heatmaps but are not reported as a single scalar in
Table 3.
5. Simulation Setup
We evaluate all planners on nine procedurally generated urban maps (Paths 1–9) with a constant swarm size of
UAVs for
all compared algorithms, this is to ensure identical multi-agent scale across baselines and PAIR (Urban Multi-UAV Path Planning Simulation Dataset,
https://doi.org/10.6084/m9.figshare.30787730). We do not sweep swarm size in this paper; all results use
, and scalability to larger
N is left as future work.
5.1. Map Generation and Validation
Static obstacles are placed via Poisson-disc sampling on a normalized grid (100 × 100 cells) with minimum spacing . Each map is accepted only if it meets the following:
Coverage: obstacle density in [15%, 40%];
Connectivity: BFS confirms at least one collision-free path for every start–goal pair;
Clustering: Hopkins statistic greater than 0.75 for realistic spatial patterns.
These procedurally generated maps form a 2-D fixed-altitude benchmark rather than a photorealistic 3-D environment. Our goal is to provide a controlled, reproducible testbed where obstacle density, spatial clustering, and dynamic updates can be systematically swept, not to emulate a specific real city. To facilitate independent verification and reuse, we release all nine maps and the corresponding generator as an open “Urban Multi-UAV Path Planning Simulation Dataset” (see
https://doi.org/10.6084/m9.figshare.30787730).
5.2. Dynamic Obstacle Generation
Each map includes mobile obstacles (radius ):
- 1.
Initial Placement: Sampled uniformly in , with min-distance from all static obstacles and UAVs.
- 2.
Mobility Model: A three-state Markov chain,
Move [
10],
Pause,
Detour, with parameters:
Transitions occur after each timer or if a Move would intersect a static obstacle (triggering Detour).
- 3.
Boundary Reflection: Any trajectory crossing a map edge is reflected back into the domain.
- 4.
Parameters: , grid-units/step; no new obstacles spawn after .
5.3. Variable Obstacle Density Sweep
To further probe replanning performance, we run each planner on a single canonical start–goal scenario, sweeping the number of dynamic obstacles over
(animated demonstrations for each obstacle density are available on GitHub: GHRABAT/Impact-of-Obstacle-Density,
https://github.com/GHRABAT/Impact-of-Obstacle-Density/tree/v1.0.0 (accessed on 6 January 2026)). This sweep acts as a controlled single-map stress test that isolates the effect of obstacle density on path length, steps to goal, and cumulative planning time, rather than a full statistical study over many urban layouts.
The obstacle-density sweep is conducted on a single canonical map (not all nine maps). This is conducted to isolate the effect of obstacle count. Each plotted point is averaged over random seeds controlling dynamic-obstacle trajectories. Therefore, the non-monotonic behavior (especially for PSO) reflects stochastic convergence under that fixed map rather than a population-wide statistic over all scenarios. Extending the sweep to multiple maps shall be perused in future work.
The obstacle-density sweep in
Section 6.4 is conducted on a single canonical map. Each point is the mean over
random seeds, with shaded bands indicating
standard deviation. This sweep should therefore be interpreted as a controlled single-map stress test.
5.4. UAV Kinematics and Metrics
Each UAV flies at
grid-units/s, so a path of length
L takes
seconds. We record mission success rate, energy consumption, travel time, and unified score (
Section 4).
5.5. Benchmark Variation Grid
The nine scenarios span three representative static obstacle densities (15%, 25%, and 40%), with both clustered and dispersed spatial patterns and both static-only and mixed static+dynamic obstacle configurations. Instead of exhaustively enumerating all combinations, we select nine representative cases that cover easy, moderate, and highly cluttered layouts under static and dynamic conditions, thereby testing planner robustness across realistic extremes while keeping the benchmark size manageable for reproducible experiments.
5.6. Planner Configurations
For PAIR, we fix the inter-UAV communication radius used to form the neighbor set to grid units (that is, 20 cells under the 100 × 100 discretization), and we keep this value constant across all planners. We compare five planners with these parameter settings:
A*: 4-connected grid, Euclidean heuristic, no replanning on collision.
D* Lite: Same heuristic, incremental replanning up to once per time step.
CBS-D* Lite: Conflict-Based Search over D* Lite with up to 10 banned cells and a 2-waypoint rollback window.
PSO: 30 particles optimizing 8 intermediate waypoints, inertia , , 200 iterations per replanning.
PAIR: Hybrid A* + PPO with
- –
Network: Two-layer MLPs (128 ReLU units each).
- –
Hyperparameters: , GAE , clip , actor LR , critic LR , batch size 2048, 50 epochs/update, 500k total steps.
- –
Reward weights: .
5.6.1. Statistical Analysis
For each planner we evaluate nine benchmark paths (Paths 1–9 in our scenario set). Unless otherwise stated, the scalar metrics in
Table 3 are reported as mean ± standard deviation across these nine scenarios. For each path and algorithm we record a binary success indicator (1 if the UAV reaches the goal without collision, 0 otherwise); the success rates in
Table 3 are the mean of these indicators (in %), and the associated standard deviations quantify how success varies across maps. Propulsion energy is computed per path using the surrogate in Equation (
25), including the fixed penalty
(normalized units) for failed trajectories, and we report the mean ± standard deviation over all nine paths. To better capture typical mission duration, the travel times are averaged only over successful trajectories (non-collided flights); failures are excluded from the time statistics but remain penalized through the energy and unified-score metrics.
The unified scores in
Table 3 correspond to the length-based scores in Equations (
22) and (
23), computed per path by normalizing path length over the set of successful planners, assigning a score of 0 to failed planners, and then averaging these values across the nine benchmark paths.
5.6.2. Baseline Scope
We compare PAIR against classical discrete planners (A*, D* Lite, and CBS-D* Lite) and a representative swarm-based metaheuristic (PSO), which together cover the main non-learning baselines used in UAV path planning. Incorporating additional end-to-end deep RL baselines, such as RL-Planner [
17], under our nine-map benchmark and real-time constraints is left to future work.
5.7. Reward Shaping and Collision-Check Implementation
5.7.1. Reward Shaping
At each PPO correction step we emit a dense, potential-based reward:
where
is the Euclidean distance toward the next discrete waypoint.
penalizes proximity to every obstacle.
is the instantaneous energy cost of offset .
is a sparse bonus awarded upon reaching the final goal.
The weights are chosen empirically to balance path efficiency, obstacle clearance, and energy use, and are kept fixed across all experiments. A more systematic numerical ablation of these coefficients is left to future work.
The same risk field
used in the reward is also queried at the UAV position to form the observation component
in (
15). All dynamic obstacles and other UAVs are inserted into the obstacle set
O when building
, so nearby agents influence the state only through this aggregated risk scalar rather than via an explicit neighbor list. This keeps the observation dimension fixed while still allowing the policy to react to local congestion. In the implementation reported here, the PPO agent is trained solely on these navigation-related signals (distance, aggregated risk, and a convex surrogate for propulsion energy). MEC task assignment and deadline satisfaction are handled independently by the heterogeneous task-to-UAV model in
Section 3.5 and are not explicitly encoded in the RL reward. Incorporating deadline- or latency-aware terms (for example, a normalized penalty
when the predicted completion time approaches a task’s deadline) is a natural extension and is left to future work.
5.7.2. Collision Detection
At each simulator step, we have the following:
5.8. Implementation Details
All simulations (environment, baselines, and PAIR) were implemented in Python 3.12.4 using NumPy 2.4.0, SciPy 1.16.3, Matplotlib 3.10.8, and PyTorchPyTorch 2.9.1. As indicated in
Table A1, the simulation runtime experiments (including PPO training, evaluation, and online PPO inference during replanning) are executed on an NVIDIA Jetson Xavier NX embedded platform (6-core ARM CPU, 8 GB RAM, 384-core Volta GPU), chosen to emulate onboard UAV compute(software stack: NVIDIA JetPack 5.1.4). The trained PPO policy is used for online inference with an average forward-pass latency of ≈5 ms per PPO call (
Table A1); this supports the 10 Hz replanning loop used in our benchmark. For post-processing and figure generation, we additionally use an HP ZBook 15v G5 workstation running Windows 10 Pro (Build 19045) with an Intel Core i7-8750H CPU, 16 GB RAM, and a NVIDIA Quadro P600 (4 GB), using Python 3.12.4 (Anaconda Distribution 2025.12-1, Spyder 6.1.2) to parse logs and render plots. For reproducibility, random seeds are fixed per scenario when reporting mean ± std metrics, and the obstacle-density sweep averages the results over
random seeds as stated in
Section 5.
6. Results and Comparative Analysis
6.1. Success-Rate Analysis
Table 3 and
Figure 1 report each planner’s success rate over the nine benchmark maps. Classical A* succeeds on only one of nine maps (11.11%), failing in both dynamic and highly cluttered static scenarios due to its inability to replan. D* Lite improves this to 44.44% by incrementally repairing collisions, yet still breaks down in highly dynamic layouts where frequent updates are required. CBS-D*, PSO, and our PAIR framework all achieve perfect reliability (100%), demonstrating that conflict-based search, metaheuristic optimization, and the proposed hybrid planner can all attain mission-complete performance under the evaluated settings.
6.2. Energy Consumption and Travel Time
Figure 2 depicts the average propulsion-energy surrogate consumed by each planner (in normalized units). Purely discrete methods exhibit the highest energy demands: A* at 899.66 units and D* Lite at 594.01 units, while CBS-D* reduces this to 126.45 units thanks to occasional replanning. It should however be noted that the large values for A* and D* Lite are inflated by the failure penalty
applied to collided/deadlocked trajectories in Equation (
25). Therefore, these numbers should be interpreted as penalty-including surrogate costs for relative comparison, not as physical energy measurements.
In contrast, hybrid planners achieve dramatic savings: PSO consumes 109.44 units and PAIR further lowers this to 104.91 units, a 4.1% reduction relative to PSO. Across the nine benchmark paths, the associated standard deviations are moderate (for instance, PAIR achieves
units compared to
units for PSO and
units for CBS-D*), so the ranking in
Table 3 is robust to path-to-path variability.
Figure 3 shows the corresponding average travel times. According to
Table 3, A* and D* Lite achieve mean travel times of 201.4 s and 179.4 s, respectively, but only on the very small subset of paths where they succeed (11.1% and 44.4% success). Among the 100%-reliable planners, CBS-D* has the longest mean travel time (236.8 s), whereas PSO and PAIR achieve 214.3 s and 207.8 s, respectively, with PAIR providing the shortest travel time under full mission success. Hybrid methods again lead: PSO averages 214.3 s, while PAIR trims this to 207.8 s, which is 6.5 s faster than PSO (a 3.0% improvement).
These results demonstrate that integrating continuous PPO-based corrections into a discrete backbone (PAIR) both maintains perfect reliability and substantially lowers energy and time compared to purely discrete or metaheuristic approaches.
In
Figure 4, A* exhibits multiple deadlocks (path 9) and at least one collision (paths 1 and 2) when confronted with dynamic obstacles. These trajectories are not solely determined by geometric proximity but follow the task-assignment model in
Section 3.5, which assigns each request to the UAV whose CPU, memory, battery, and deadline constraints are best matched. Consequently, some paths cross (for example, paths 1 and 2 end up on a collision course), and in a few cases a more distant UAV is dispatched instead of the nearest one (e.g., UAV 7 is closer to the task in path 9 but is not selected).
D* Lite (
Figure 5) leverages incremental replanning to escape some static deadlocks, but still registers collisions on dynamic encounters (path1 and path2). The replanning latency causes hesitation near high-risk zones as seen in the abrupt and zig-zag nature of the path in those areas.
CBS-D* Lite (
Figure 6) resolves inter-agent conflicts more robustly than D* Lite, yet its conflict-tree updates induce pronounced detours (as seen in path 5 and 6) and occasional collisions when dynamic obstacles intrude mid-segment.
PSO (
Figure 7) generates smoother, continuous trajectories successfully avoiding both static and dynamic obstacles. However, its global waypoint optimization sometimes yields unnecessarily long loops as can be seen in path 1 just before it converges; this reflects trade-offs between exploration and path length.
In contrast, PAIR (
Figure 8) combines the fast backbone of A* with PPO-based local corrections. All nine paths seamlessly navigate around moving obstacles, maintain safe separation, and incur minimal detours. The continuous refinements tighten trajectories around static obstacles while reacting instantly to dynamic intrusions. This demonstrates the superior adaptability and efficiency of PAIR in both static and dynamic settings.
6.3. Unified Score Comparison
Figure 9 and
Table 3 report the unified score, the average normalized path-quality score
defined in (
23). This scalar index summarizes how often each planner attains short paths relative to the best-performing algorithm on each scenario, while the success rate, energy consumption, and travel time are presented explicitly in the same table and figures. In terms of the normalized unified score, PAIR attains
versus
for PSO, with A*, D* Lite, and CBS-D* remaining well below these values (
Table 3). This margin reflects balanced improvement of PAIR across all dimensions, as it maintains 100% success while at the same time minimizing both energy and time traveled compared to purely discrete (A*, D* Lite, CBS-D*) and metaheuristic (PSO) planners.
6.4. Impact of Obstacle Density
To quantify each planner’s scalability under rising dynamic complexity, we consider the single-scenario density sweep introduced in the “Variable Obstacle Density Sweep” subsection of
Section 5, where each planner is evaluated at obstacle counts
. Because this sweep is performed on a single canonical map, it should be interpreted as a controlled stress test of relative trends rather than a full statistical characterization over an ensemble of urban layouts; replicating the sweep across multiple randomly generated maps and 3-D environments is an important direction for future work. Additional qualitative trajectory visualizations for this obstacle-density sweep (GIFs) are provided in the
Supplementary Material.
Figure 10 plots the resulting path lengths. The PAIR curve increases only modestly, from about 113 to 147 grid units, which demonstrates robust backbone routing with minimal detours. In contrast, CBS-D* remains flat around 156–162 units, and PSO exhibits an unusual fluctuation (114→172→145→157), which is understandable given its stochastic waypoint sampling (
Figure 10).
Figure 11 shows the total steps to goal. PAIR grows smoothly from 82 to 199 steps, indicating predictable replanning overhead. CBS-D* increases monotonically (158→205), as each new obstacle simply adds another conflict/rollback event. PSO, however, oscillates dramatically from 182 at 10 obstacles, down to 100 at 15, and back up to 132 at 30 obstacles, revealing an unreliable convergence under clutter.
Most critically,
Figure 12 reports the total cumulative replanning time per episode. For PAIR this quantity remains the lowest among all planners across the entire sweep, increasing from about
ms (≈1.9 s) at 5 obstacles to a worst case of roughly
ms (≈190 s) at 25 obstacles. At 30 obstacles, the episode reaches the goal in fewer steps on that map, so the PAIR cumulative replanning time decreases to about
ms (≈23 s) rather than continuing to grow with obstacle count. By comparison, CBS-D* requires up to
ms (≈440 s) of planning time, and PSO exceeds
ms (≈3400 s, that is, nearly an hour) at 30 obstacles, making it impractical for real-time use.
Taken together, these density-sweep curves confirm that PAIR is the only planner whose latency and path optimality both scale gracefully as obstacle density increases, making it uniquely suited for real-time UAV navigation in highly dynamic environments.
7. Discussion
Our comparative analysis highlights a fundamental trade-off between
responsiveness and
optimality. Purely discrete planners (A*, D* Lite, CBS-D*) guarantee optimal or near-optimal grid-based routes on static maps but suffer from replanning latency and deadlocks once obstacles move (
Figure 4,
Figure 5 and
Figure 6). The PSO metaheuristic adapts continuously yet can over-correct and produce suboptimal detours in dense layouts (
Figure 7). The PAIR hybrid approach combines a fast A* backbone for coarse routing with PPO-based local corrections that preserve near-optimal static performance while reacting quickly to dynamic intrusions, yielding lower propulsion energy and shorter travel times than all baselines under the same benchmark scenarios and hyperparameter settings.
Unlike PSO, which quickly becomes infeasible for replanning (multi-minute cumulative planning times), and CBS-D* (tens to hundreds of seconds), PAIR consistently has the smallest cumulative replanning time across all obstacle densities. In our experiments its total replanning time increases from about 2 s at 5 obstacles to at most s at 25 obstacles, while CBS-D* and PSO reach roughly 440 s and over 3400 s, respectively, at high densities.
With PAIR, its monotonic, predictable growth in steps (
Figure 11) contrasts the erratic behavior of PSO, underscoring its reliability across densities. The non-monotonic dip in the PAIR cumulative replanning time at 30 obstacles arises because the episode on that map terminates earlier than at 25 obstacles; since we accumulate planning time only up to goal completion, shorter episodes naturally yield smaller totals even at higher obstacle counts. These results confirm that PAIR not only delivers superior path quality and energy efficiency but also scales gracefully in high-density dynamic environments, preserving both real-time responsiveness and path optimality.
PAIR incurs modest overhead: discrete A* searches complete in <1 ms per step, and PPO inference runs in ≈5 ms on an embedded GPU, supporting a 10 Hz replanning rate. By contrast, CBS-D* conflict-tree updates can exceed 50 ms, and the PSO 200-iteration replanning loops are even slower.
Inter-UAV safety checks scale as , but the PAIR decentralized PPO corrections integrate neighbor positions directly into each agent’s state, avoiding global coordination bottlenecks. This makes PAIR naturally extensible to larger swarms.
In our experiments, the PAIR PPO agent is trained purely on navigation performance (path length, local risk, and a propulsion-energy surrogate), while MEC task assignment and deadline satisfaction are governed by the heterogeneous assignment model of
Section 3.5. Thus, the present evaluation focuses on geometric path quality under dynamic obstacles, with offloading treated as a separate upper-layer decision mechanism rather than as part of the RL objective. Thus, we view PAIR as a path-planning contribution evaluated in a MEC-inspired environment, not as an end-to-end solution for joint computation offloading and trajectory control.
A natural extension is to incorporate deadline, or latency-aware terms directly into the PPO reward (for example, a normalized penalty when the predicted completion time approaches a task’s deadline), or to co-train navigation and offloading policies in a unified multi-objective RL formulation. We leave such MEC co-optimization to future work, while the present study deliberately isolates the navigation component and demonstrates the compatibility of PAIR with 6G edge-computing scenarios.
All experiments in this paper are conducted on a 2-D fixed-altitude grid abstraction with procedurally generated static obstacles and stylized Markovian dynamic obstacles, under a simplified MEC model that abstracts away physical-layer, queueing, and perception uncertainties. Consequently, our results should be interpreted as an algorithmic evaluation of navigation strategies in a MEC-inspired setting, not as an end-to-end hardware validation on specific UAV platforms or sensor pipelines. This design choice allows us to stress-test PAIR and all baselines under tightly controlled obstacle densities and map statistics, but it does not capture the full 3-D flight kinematics or real-world sensing artifacts. Validating PAIR on photorealistic 3-D simulators and standardized or real multi-UAV flight datasets is therefore an important direction for future work.
While the paper is motivated by UAV-assisted MEC, our present evaluation focuses on geometric navigation (path length, collision risk, and a propulsion-energy surrogate). The task-assignment model in
Section 3.5 provides a heterogeneous MEC context but is static during flight, and MEC constraints (latency/deadlines) are not factored into the PPO reward. A more convincing end-to-end MEC study would generate dynamic task arrivals during simulated flights and couple the predicted completion latency to the navigation policy (for instance, via a deadline-aware penalty when the predicted service time approaches a task deadline). We leave this integrated co-optimization and dynamic task scenario generation to future work.
8. Conclusions and Future Work
We have presented PAIR, a multi-agent hybrid A*–PPO path-planning framework for multi-UAV deployments in dynamic MEC-inspired urban environments modeled as 2-D fixed-altitude grids with procedurally generated static obstacles and stylized Markovian dynamic obstacles. The framework combines a fast discrete A* backbone with an on-demand PPO-based continuous refinement module and achieves 100% mission success across nine algorithmically generated benchmark scenarios, while reducing the average propulsion energy to 104.9 normalized units and average travel time to 207.8 s. Under our unified evaluation, which jointly considers success rate, energy, time, and normalized path quality, PAIR consistently outperforms classical discrete planners (A*, D* Lite, CBS-D*) and a PSO metaheuristic baseline, attaining a top unified score of 84.56 without sacrificing reliability. The released 2-D benchmark and simulator are intended to be a reusable baseline for future work on dynamic multi-UAV navigation in MEC-inspired urban environments and as a stepping stone toward full 3-D and hardware-in-the-loop evaluations.
Looking ahead, several directions can further enhance the applicability and performance of PAIR:
Full 3-D Navigation: Extend the state and action spaces to include altitude variations, enabling obstacle avoidance in true three-dimensional urban canyons.
Robustness to Perception Uncertainty: Integrate noisy sensor models and adversarial perturbations into training, or adopt uncertainty-aware RL techniques to maintain performance under imperfect state estimation.
End-to-End Joint Learning: Co-train the discrete backbone heuristic and PPO policy within a unified curriculum, allowing global and local planners to co-adapt.
Swarm-Scale Coordination: Investigate federated or hierarchical PPO schemes to support densely populated UAV fleets, sharing policy improvements without centralized data exchange.
MEC Co-Optimization: Incorporate live network latency and computation offloading deadlines into the reward function, achieving truly integrated flight-computational planning for 6G-enabled UAV networks.
Formal safety guarantees: While PAIR ensures a high probability of safety, deterministic guarantees require integration with formal methods such as control barrier certificates, which we leave for future work.
3-D and real-data validation: Port PAIR to full 3-D kinematics and evaluate it on standardized or real multi-UAV datasets (such as LiDAR/trajectory logs), complementing the 2-D synthetic benchmark reported here.
By pursuing these extensions, future work will bring PAIR closer to deployment in real-world UAV swarms.
Supplementary Materials
The following supporting information can be downloaded at:
https://www.mdpi.com/article/10.3390/drones10010058/s1, Supplementary files are provided with this manuscript and include additional visual demonstrations (GIFs) and supporting experimental materials for this study.
Author Contributions
Conceptualization, B.H.T. and J.L.; methodology, B.H.T. and J.L.; software, B.H.T.; validation, B.H.T., J.L., Y.Q. and H.R.S.; formal analysis, B.H.T. and J.L.; investigation, B.H.T.; resources, J.L. and Y.Q.; data curation, B.H.T.; writing, original draft preparation, B.H.T.; writing, review and editing, J.L., Y.Q. and H.R.S.; visualization, B.H.T.; supervision, J.L. and Y.Q.; project administration, J.L. and Y.Q.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by the National Natural Science Foundation of China under Grant 62372163 and the Science and Technology Innovation Program of Hunan Province under Grant 2024RC1033.
Data Availability Statement
The data that support the findings of this study are openly available in the “ Urban Multi-UAV Path Planning Simulation Dataset (2-D Dynamic Urban MEC Scenarios)” repository on Figshare, at
https://doi.org/10.6084/m9.figshare.30787730. The dataset contains the nine benchmark urban scenarios used in our experiments (static and dynamic obstacle fields, UAV start–goal pairs, and dynamic obstacle trajectories) in CSV/JSON format. Any additional information related to the data or scripts for loading and basic visualization is available from the corresponding author upon reasonable request.
Acknowledgments
The authors would like to thank the College of Computer Science and Electronic Engineering at Hunan University and the College of Computer Science and Information Technology at the University of Sumer for their administrative and technical support, as well as the computing resources used in this work.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A. PPO Hyperparameters and Training Configuration
Table A1 reports the complete PPO training configuration used throughout the paper. We adopt standard PPO settings commonly used for continuous-control navigation (
, GAE
, clipping
), and we keep them fixed across all benchmarks for reproducibility and to avoid scenario-specific overfitting. We do not claim exhaustive hyperparameter optimality; beyond basic sanity checks to ensure stable learning and convergence, a full sensitivity/ablation study is left to future work. Unless otherwise stated, the “Hardware” entry in
Table A1 refers to the platform used for the runtime experiments reported in the paper (training/evaluation and online PPO inference), while a separate workstation is used only for offline plotting.
Table A1.
PPO hyperparameters and training configuration.
Table A1.
PPO hyperparameters and training configuration.
| Parameter | Value |
|---|
| Network architecture | 2-layer MLP, 128 ReLU units per layer |
| Optimizer | Adam |
| Actor learning rate | |
| Critic learning rate | |
| Discount factor | 0.99 |
| GAE parameter | 0.95 |
| PPO clipping parameter | 0.20 |
| Batch size | 2048 |
| Epochs per update | 50 |
| Total environment steps | 500,000 |
| Reward weights | (1.0, 0.5, 0.1) |
| Training scenarios | 3 static + 3 dynamic maps |
| Hardware | Embedded GPU (Jetson Xavier NX) |
| Inference latency | ≈5 ms per PPO call |
Appendix B. Baseline Hyperparameters and Tuning Protocol
For transparency and to ensure a fair comparison, we keep all baseline configurations fixed across the nine benchmark maps and obstacle-density sweeps.
Table A2 summarizes the key hyperparameters for PSO and CBS-D* Lite. The PSO settings follow the standard recommendations for swarm-based path planning and are chosen to balance the solution quality against the per-step replanning budget, while the CBS-D* Lite parameters bound the size of the conflict tree and rollback window so that planning latency remains comparable to the PAIR 10 Hz target. Rather than exhaustively tuning each baseline, we adopt these simple, literature-inspired configurations and use them consistently in all experiments, so that the performance gains reported in
Table 3 primarily reflect architectural differences rather than aggressive hyperparameter search.
We set the PSO inertia weight to
as a standard, literature-consistent mid-range choice that balances exploration (large inertia) and convergence (small inertia) in waypoint-based PSO planning. Following the canonical PSO formulation [
16], we keep
constant across all maps to avoid per-scenario tuning advantages and to also keep replanning budgets comparable.
CBS-D* Lite resolves inter-agent conflicts by expanding a conflict tree and triggering low-level repairs. In highly dynamic environments, frequent obstacle updates can repeatedly invalidate repaired segments, which increases conflict-tree churn and induces conservative detours. This mechanism explains why CBS-D* Lite remains reliable in our tests but often exhibits longer travel times and higher replanning overhead than PAIR and PSO (
Table 3).
Table A2.
Key hyperparameters for PSO and CBS-D* Lite baselines.
Table A2.
Key hyperparameters for PSO and CBS-D* Lite baselines.
| Planner | Setting |
|---|
| PSO | Swarm size particles |
| PSO | Intermediate waypoints per path |
| PSO | Inertia weight |
| PSO | Cognitive/social gains |
| PSO | Iterations per replanning |
| CBS-D* Lite | Maximum banned cells per agent |
| CBS-D* Lite | Rollback window waypoints |
| CBS-D* Lite | Inter-agent safety radius (as in Section 5.7) |
References
- Raja, G.; Manoharan, A.; Siljak, H. Ugen: UAV and GAN aided ensemble network for post disaster survivor detection through ORAN. IEEE Trans. Veh. Technol. 2024, 73, 9296–9305. [Google Scholar] [CrossRef]
- Wang, T.; Huang, X.; Wu, Y.; Qian, L.; Lin, B.; Su, Z. UAV swarm assisted two tier hierarchical federated learning. IEEE Trans. Netw. Sci. Eng. 2023, 11, 943–956. [Google Scholar] [CrossRef]
- He, J.; Wang, J.; Wang, N.; Guo, S.; Zhu, L.; Niyato, D.; Xiang, T. Preventing non intrusive load monitoring privacy invasion: A precise adversarial attack scheme for networked smart meters. arXiv 2024, arXiv:2412.16893. [Google Scholar] [CrossRef]
- Tian, B.; Wang, L.; Xu, L.; Pan, W.; Wu, H.; Li, L.; Han, Z. UAV assisted wireless cooperative communication and coded caching: A multiagent two timescale DRL approach. IEEE Trans. Mob. Comput. 2023, 23, 4389–4404. [Google Scholar] [CrossRef]
- Qu, Y.; Sun, H.; Dong, C.; Kang, J.; Dai, H.; Wu, Q.; Guo, S. Elastic collaborative edge intelligence for UAV swarm: Architecture, challenges, and opportunities. IEEE Commun. Mag. 2023, 62, 62–68. [Google Scholar] [CrossRef]
- Wang, Y.; Sheng, M.; Wang, X.; Wang, L.; Li, J. Mobile edge computing: Partial computation offloading using dynamic voltage scaling. IEEE Trans. Commun. 2016, 64, 4268–4282. [Google Scholar] [CrossRef]
- Long, Y.; Zhao, S.; Gong, S.; Gu, B.; Niyato, D.; Shen, X. AoI aware sensing scheduling and trajectory optimization for multi UAV assisted wireless backscatter networks. IEEE Trans. Veh. Technol. 2024, 73, 15440–15455. [Google Scholar] [CrossRef]
- Mao, Y.; You, C.; Zhang, J.; Huang, K.; Letaief, K.B. A survey on mobile edge computing: The communication perspective. IEEE Commun. Surv. Tutor. 2017, 19, 2322–2358. [Google Scholar] [CrossRef]
- Zhan, C.; Hu, H.; Liu, Z.; Wang, Z.; Mao, S. Multi UAV enabled mobile edge computing for time constrained IoT applications. IEEE Internet Things J. 2021, 8, 15553–15567. [Google Scholar] [CrossRef]
- Trotti, F.; Farinelli, A.; Muradore, R. A Markov decision process approach for decentralized UAV formation path planning. In Proceedings of the 2024 European Control Conference (ECC), Stockholm, Sweden, 25–28 June 2024; pp. 436–441. [Google Scholar]
- Dai, X.; Xiao, Z.; Jiang, H.; Lui, J.C.S. UAV assisted task offloading in vehicular edge computing networks. IEEE Trans. Mob. Comput. 2024, 23, 2520–2534. [Google Scholar] [CrossRef]
- Chang, H.; Chen, Y.; Zhang, B.; Doermann, D. Multi UAV mobile edge computing and path planning platform based on reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 489–498. [Google Scholar] [CrossRef]
- Jin, J.; Zhang, Y.; Zhou, Z.; Jin, M.; Yang, X.; Hu, F. Conflict based search with D*Lite algorithm for robot path planning in unknown dynamic environments. Comput. Electr. Eng. 2023, 105, 108473. [Google Scholar] [CrossRef]
- Li, Y.; Wu, R.; Gan, L.; He, P. Development of an effective relay communication technique for multi UAV wireless network. IEEE Access 2024, 12, 74087–74095. [Google Scholar] [CrossRef]
- Wang, Y.; Zhu, J.; Huang, H.; Xiao, F. Bi objective ant colony optimization for trajectory planning and task offloading in UAV assisted MEC systems. IEEE Trans. Mob. Comput. 2024, 23, 12360–12377. [Google Scholar] [CrossRef]
- Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the ICNN’95-International Conference on Neural Networks, Perth, WA, Australia, 27 November–1 December 1995; Volume 4, pp. 1942–1948. [Google Scholar]
- Ejaz, M.; Gui, J.; Asim, M.; Affendi, M.A.E.; Fung, C.; Latif, A.A.A. RL Planner: Reinforcement learning enabled efficient path planning in multi UAV MEC systems. IEEE Trans. Netw. Serv. Manag. 2024, 21, 3317–3329. [Google Scholar] [CrossRef]
- Yuan, H.; Wang, M.; Bi, J.; Shi, S.; Yang, J.; Zhang, J.; Zhou, M.; Buyya, R. Cost efficient task offloading in mobile edge computing with layered unmanned aerial vehicles. IEEE Internet Things J. 2024, 11, 30496–30509. [Google Scholar] [CrossRef]
- Mach, P.; Becvar, Z. Mobile edge computing: A survey on architecture and computation offloading. IEEE Commun. Surv. Tutor. 2017, 19, 1628–1656. [Google Scholar] [CrossRef]
- Qin, P.; Fu, Y.; Xie, Y.; Wu, K.; Zhang, X.; Zhao, X. Multi agent learning based optimal task offloading and UAV trajectory planning for AGIN power IoT. IEEE Trans. Commun. 2023, 71, 4005–4017. [Google Scholar] [CrossRef]
- Liang, K.; Wang, Y.; Li, Z.; Zheng, G.; Wong, K.K.; Chae, C.B. Digital twin assisted deep reinforcement learning for computation offloading in UAV systems. IEEE Trans. Veh. Technol. 2025, 74, 8466–8471. [Google Scholar] [CrossRef]
- Cherif, B.; Ghazzai, H.; Alsharoa, A.; Besbes, H.; Massoud, Y. Aerial LiDAR based 3D object detection and tracking for traffic monitoring. In Proceedings of the 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 21–25 May 2023; pp. 1–5. [Google Scholar]
- Li, C.; Gan, Y.; Zhang, Y.; Luo, Y. A cooperative computation offloading strategy with on demand deployment of multi UAVs in UAV aided mobile edge computing. IEEE Trans. Netw. Serv. Manag. 2023, 21, 2095–2110. [Google Scholar] [CrossRef]
- Wu, X.; Lei, Y.; Tong, X.; Zhang, Y.; Li, H.; Qiu, C.; Guo, C.; Sun, Y.; Lai, G. A non rigid hierarchical discrete grid structure and its application to UAVs conflict detection and path planning. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5393–5411. [Google Scholar] [CrossRef]
- Zhong, L.; Zhao, J.; Luo, H.; Hou, Z. Hybrid path planning and following of a quadrotor UAV based on deep reinforcement learning. In Proceedings of the 2024 36th Chinese Control and Decision Conference (CCDC), Xi’an, China, 25–27 May 2024; pp. 1858–1863. [Google Scholar]
- Jayaweera, N.; Rajatheva, N.; Latva aho, M. Autonomous driving without a burden: View from outside with elevated LiDAR. In Proceedings of the 2019 IEEE 89th Vehicular Technology Conference (VTC2019-Spring), Kuala Lumpur, Malaysia, 28 April–1 May 2019; pp. 1–7. [Google Scholar]
- Li, F.; Luo, J.; Sun, P.; Teng, S. Energy efficient UAV based data collection 3D trajectory optimization with wireless power transfer for forest monitoring. IEEE Internet Things J. 2025, 12, 24071–24082. [Google Scholar] [CrossRef]
- Luo, Z.; Zhang, J.; Wei, J.; Zhou, L.; Cao, K.; Zhao, H. Trajectory design and task scheduling for multi UAV aided mobile edge computing networks. In Proceedings of the 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 24–27 March 2025; pp. 1–6. [Google Scholar]
- Gao, Y.; Tao, J.; Xu, Y.; Wang, Z.; Gao, Y.; Wang, M. Improving user QoE via joint trajectory and resource optimization in multi UAV assisted MEC. IEEE Trans. Services Comput. 2025, 18, 1472–1486. [Google Scholar] [CrossRef]
- Yin, L.; Luo, J.; Qiu, C.; Wang, C.; Qiao, Y. Joint task offloading and resource allocation for hybrid vehicle edge computing systems. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10355–10368. [Google Scholar] [CrossRef]
- Yang, A.; Huan, J.; Wang, Q.; Yu, H.; Gao, S. ST-D3QN: Advancing UAV path planning with an enhanced deep reinforcement learning framework in ultra-low altitudes. IEEE Access 2025, 13, 65285–65300. [Google Scholar] [CrossRef]
- Wu, T.; Zhang, Z.; Jing, F.; Gao, M. A dynamic path planning method for UAVs based on improved informed-RRT* fused dynamic windows. Drones 2024, 8, 539. [Google Scholar] [CrossRef]
- Xie, J.; Huang, W.; Miao, J.; Li, J.; Cao, S. Off-policy deep reinforcement learning for path planning of stratospheric airship. Drones 2025, 9, 650. [Google Scholar] [CrossRef]
- Wang, Y.; Liu, J.; Qian, Y.; Yi, W. Path planning for multi-UAV in a complex environment based on reinforcement-learning-driven continuous ant colony optimization. Drones 2025, 9, 638. [Google Scholar] [CrossRef]
- Rahman, M.; Sarkar, N.I.; Lutui, R. A survey on multi-UAV path planning: Classification, algorithms, open research problems, and future directions. Drones 2025, 9, 263. [Google Scholar] [CrossRef]
- Meng, W.; Zhang, X.; Zhou, L.; Guo, H.; Hu, X. Advances in UAV path planning: A comprehensive review of methods, challenges, and future directions. Drones 2025, 9, 376. [Google Scholar] [CrossRef]
- Andrychowicz, M.; Raichuk, A.; Stańczyk, P.; Orsini, M.; Girgin, S.; Marinier, R.; Hussenot, L.; Geist, M.; Pietquin, O.; Michalski, M.; et al. What matters for on-policy deep actor-critic methods? A large-scale study. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Kunda, N.S.S.; Kc, P.; Pandey, M.; Kumaar, A.A.N. Reward design and hyperparameter tuning for generalizable deep reinforcement learning agents in autonomous racing. Sci. Rep. 2025, 15, 43940. [Google Scholar] [CrossRef] [PubMed]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |