Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions

Alotaibi, Nader; BinSaeedan, Wojdan

doi:10.3390/drones10050394

Open AccessArticle

Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions

by

Nader Alotaibi

and

Wojdan BinSaeedan

^*

Department of Computer Science, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 394; https://doi.org/10.3390/drones10050394

Submission received: 19 April 2026 / Revised: 16 May 2026 / Accepted: 18 May 2026 / Published: 21 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

RL-JSO integrates deep reinforcement learning, implemented as a dueling double DQN with prioritized experience replay, with the jellyfish search optimizer to learn adaptive phase selection under dynamic and adversarial conditions. The policy is trained through a nine-stage mastery-gated (competence-based) curriculum that introduces moving obstacles, AR(1) wind, GPS jamming, and communication loss in increasing levels of difficulty.
Under a fairness-controlled comparison against standard JSO, standard PSO, and an architecturally matched RL-PSO counterpart with identical reward design, state representation, network architecture, curriculum, and training budget, RL-JSO is the only method in the evaluated family to sustain a 100% collision-free rate across all four progressive difficulty campaigns. Its Cliff’s delta fitness advantage over standard JSO also increases monotonically with environmental difficulty, from a medium effect size ( $| δ | = 0.354$ ) under nominal conditions to a large effect size ( $| δ | = 0.689$ ) under full adversarial conditions.

What are the implications of the main findings?

The quantitative evidence shows that learned phase control preserves swarm coordination precisely where deterministic time scheduling begins to fail. RL-JSO maintains consistent safety constraint satisfaction and a nearly invariant cooperation score ( $A \approx 0.74$ , range $= 0.012$ ), whereas the comparator methods degrade by 17–23% as environmental stress increases. A paired inference-time ablation further supports the interpretation that adaptive phase switching is a principal contributor to the observed performance gains (Cliff’s $| δ | \geq 0.78$ , Holm-adjusted $p_{Holm} < 10^{- 8}$ ).
From a practical perspective, the framework is modular and compatible with centralized waypoint-based mission planning pipelines. Because it produces waypoint sequences compatible with standard autopilot stacks such as PX4 and ArduPilot, it provides a realistic pathway toward future hardware-in-the-loop and real-world validation in cooperative UAV missions, including disaster response, surveillance, and autonomous logistics under GNSS-degraded and communication-degraded conditions.

Abstract

Cooperative multi-UAV path planning under dynamic and adversarial conditions demands simultaneous satisfaction of safety, efficiency, and coordination constraints, yet existing swarm-intelligence and RL–swarm hybrids rely on deterministic switching rules, tabular states, and ad hoc training schedules. This paper proposes RL-JSO, a hybrid framework in which a dueling double deep Q-network with prioritized experience replay adaptively selects among the drift, passive, and active phases of a jellyfish search optimizer, replacing the deterministic time-control rule with a learned policy. The framework integrates a five-layer hierarchical safety control mechanism, a mastery-gated nine-stage curriculum, and a shared reward module that architecturally enforces fairness between RL-JSO and a paired RL-PSO counterpart. Evaluation across four progressive campaigns with 160 independent runs per algorithm shows that, within the evaluated JSO/PSO family, RL-JSO is the only method that sustains a 100% collision-free rate across all four progressive difficulty campaigns, its Cliff’s delta over standard JSO grows monotonically with difficulty from medium to large, and under a composite cooperation metric its coordination score remains nearly invariant while comparators degrade by 17–23%. A paired inference-time ablation on the trained checkpoint provides controlled inference-time evidence that adaptive phase switching is a principal contributor to the observed test-time performance within the trained system, rather than the heuristic fallback layers.

Keywords:

unmanned aerial vehicle; cooperative swarm; path planning; reinforcement learning; jellyfish search optimization; deep Q-network; curriculum learning; adversarial environments; hierarchical safety control

1. Introduction

Unmanned aerial vehicles (UAVs) have become indispensable platforms across environmental monitoring, disaster response, surveillance, and autonomous logistics [1,2]. As missions grow in complexity, single-UAV operation is increasingly being replaced by multi-UAV swarm systems in which coordinated fleets perform tasks collaboratively. Achieving safe, efficient, and cooperative trajectory generation for a swarm remains an open problem, however, because dynamic environmental conditions, limited on-board energy, non-linear flight dynamics, and the combinatorial explosion of the multi-agent decision space all compound one another. For N UAVs with K intermediate waypoints each in three-dimensional space, the decision space is

3 \cdot N \cdot K

-dimensional and continuous, which renders exhaustive search intractable for realistic swarm sizes [3]. The difficulty is compounded further when real-world adversarial factors such as GPS jamming, communication loss, and stochastic wind disturbances are present, conditions that are increasingly relevant in GNSS-degraded and communication-degraded operational scenarios. Throughout this paper, adversarial conditions (also referred to as adverse conditions) denote the simultaneous presence of four environmental stressors: (i) dynamic obstacles traversing the workspace at stage-dependent velocities, (ii) stochastic wind generated by an AR(1) process with stage-dependent volatility, (iii) GPS jamming that injects localization noise, and (iv) communication loss that removes inter-UAV coordination data. Their intensity is controlled through the nine-stage curriculum (Section 3.6), progressing from nominal (no stressors) to fully adversarial (all four at maximum intensity).

Two broad families of algorithms have emerged as dominant paradigms for this class of problems. Swarm intelligence (SI) comprises population-based metaheuristics such as particle swarm optimization (PSO) [4], ant colony optimization [5], and the more recent jellyfish search optimizer (JSO) [6]; these algorithms excel at global exploration but rely on deterministic or rule-based behavioral switching that cannot adapt to rapidly changing conditions. Deep reinforcement learning (DRL) [7], with algorithms such as the Deep Q-Network (DQN) [8] and policy-gradient methods, supports adaptive decision-making but is prone to instability, slow convergence, and poor generalization in high-dimensional multi-agent settings.

The integration of RL and SI into hybrid frameworks has recently gained significant attention as a strategy for combining the complementary strengths of both paradigms [9,10,11,12]. The guiding insight is that RL can dynamically adjust the behavioral phases or parameters of an SI algorithm based on observed optimization state, while SI provides global search diversity that stabilizes RL’s policy learning. Recent works have demonstrated that RL-guided strategy selection inside swarm optimizers can yield substantial improvements in convergence and solution quality over their non-learned counterparts [5,13]. Despite this progress, five methodological gaps persist in the current literature. First, no prior work employs deep RL to govern JSO’s behavioral phase switching for cooperative multi-UAV planning; the JSO family has been extensively refined (with evolutionary state estimation [3], PSO hybridization [14], and composite variants [15]), but the deterministic

c_{0}

schedule has not been replaced by a learned deep policy. Second, existing hybrid RL–SI works do not enforce reward fairness across compared algorithms, leaving observed differences confounded with reward asymmetry. Third, most existing RL-guided swarm works rely on tabular Q-learning with severely limited state representations (typically 4–32 cells) [5,9,10] that cannot encode adversarial events or high-dimensional optimization landscapes. Fourth, no prior RL–SI hybrid simultaneously models temporally correlated wind, GPS jamming, and communication loss under a progressive training protocol. Fifth, curriculum learning [16,17] has not been systematically applied to RL-guided swarm optimization despite strong theoretical motivation.

This paper addresses these gaps through the following three contributions:

A novel RL-JSO hybrid framework in which a dueling double DQN [18,19] trained with prioritized experience replay [20] governs real-time phase selection among JSO’s drift, passive, and active modes, operating on a 24-dimensional continuous state that encodes optimization progress, safety margins, and environmental conditions—information unavailable to the deterministic time-control rule, which operates on a single scalar (iteration count). To the best of the authors’ knowledge, this is the first deep RL-guided phase controller for JSO in cooperative multi-UAV path planning that jointly addresses state-conditioned phase selection, hierarchical safety enforcement, and fairness-controlled evaluation. The framework enforces a fairness-aware experimental design in which the reward function, state representation, network architecture, curriculum, and random seeds are architecturally matched between RL-JSO and its RL-PSO counterpart, substantially reducing confounding asymmetries.
A hierarchical safety control architecture with five designed override layers, of which three (warmup, stagnation fallback, and the DQN decision itself) are active in the reported experiments; the remaining two are retained as pass-through placeholders for future deployments requiring tighter temporal stability. The correctness of disabling the heuristic fallbacks at evaluation time is empirically validated through a paired inference-time ablation.
A mastery-gated curriculum of nine progressive stages, from nominal conditions to full adversarial settings, with predominantly single-factor progression in the early and intermediate stages and a controlled compound increase in the later stages. The curriculum is coupled with a comprehensive evaluation protocol spanning four progressive difficulty campaigns (160 runs per algorithm), a zero-shot scalability sweep across six swarm sizes, a seven-configuration out-of-distribution generalization study, and a paired inference-time ablation with Wilcoxon signed-rank tests, Holm correction, and Cliff’s delta effect sizes.

Research hypothesis. This work tests the hypothesis that replacing the deterministic time-control rule of JSO with a deep RL policy trained via a mastery-gated curriculum yields statistically significant improvements in trajectory quality and safety under adversarial conditions, compared to both the unmodified JSO and an identically structured RL-PSO baseline. Specifically: (H1) RL-JSO achieves lower objective function values than standard JSO with monotonically increasing effect sizes as environmental difficulty grows; (H2) RL-JSO maintains 100% collision-free trajectories under conditions where at least one deterministic or PSO-family baseline begins to fail; and (H3) the learned phase-selection policy, rather than the heuristic safety fallback layers, is the principal contributor to the observed performance.

The remainder of this paper is organized as follows. Section 2 surveys the related work and positions the present framework within it. Section 3 details the proposed framework, including the simulation environment, state and action spaces, network architecture, reward design, hierarchical safety mechanism, and mastery-gated curriculum. Section 4 reports training outcomes, the four evaluation campaigns, statistical analysis, supplementary generalization and scalability studies, and the inference-time ablation. Section 5 interprets the findings, discusses their scope, and outlines limitations. Section 6 concludes and identifies future directions.

2. Related Work

2.1. Swarm Intelligence Foundations and the JSO Family

Population-based metaheuristics for continuous optimization can be traced back to PSO [4], which models a population of candidate solutions as particles whose velocities are updated by inertia, cognitive, and social terms. Despite extensive variants, PSO retains a structural limitation in three-dimensional UAV path planning: its velocity update is purely additive, with no built-in mechanism for switching between qualitatively different search behaviors (exploitation versus dispersive exploration). In response, a family of newer SI methods explicitly partitions the search process into discrete phases selected by an external schedule. JSO [6] is a prominent member. JSO models the population as jellyfish that switch between an ocean current regime (drift toward the global best, mediated by the population mean) and an active swarm motion regime that decomposes further into a passive sub-mode (random local perturbation) and an active sub-mode (pairwise tournament with a randomly chosen peer). The transition between regimes is governed by a deterministic time-control coefficient

c_{0} (t) = 1 - t / T

, yielding the familiar explore-first/exploit-later schedule. The phase choice is already discrete, and the action space is small, making JSO a particularly natural target for learned adaptive control.

The JSO literature has expanded rapidly. Nayyef et al. [14] hybridized JSO with PSO by replacing the ocean-current operator with PSO’s velocity update, but the regime switching itself remained deterministic. Wang and Yi [15] proposed a composite JSO–PSO–GA scheme for urban-terrain UAV path planning, again with static rather than learned hybridization. Wang et al. [21] developed a multi-objective JSO with RRT-based initialization for UAV planning, focusing on initialization quality rather than online adaptation. Zeng et al. [22] developed a parallel multi-objective JSO with differential evolution mutation for multi-UAV forest firefighting. Most directly relevant, Meng et al. [3] proposed an evolutionary-state-estimation multi-strategy JSO (ESE-MSJS) with a switching framework that identifies exploration, transition, and exploitation states from population-level statistics; the framework demonstrates the value of state-aware adaptive control for JSO but relies on rule-based switching rather than a learned policy.

Across this entire JSO family, the consistent observation is that adaptive switching is valuable when available, but no published JSO variant replaces the deterministic phase schedule with a learned policy that observes the optimization state, the safety state, and the environmental state simultaneously. The present work fills this specific gap.

2.2. Deep Reinforcement Learning Components

The decision layer of the present framework rests on four established DRL components. The DQN of Mnih et al. [8] established that a deep neural network trained with experience replay and a periodically synchronized target network can stably approximate action-value functions in high-dimensional observation spaces. The double-Q correction of van Hasselt et al. [18] addresses the systematic overestimation bias of the original DQN by decoupling action selection from action evaluation; this is especially relevant for a phase-control policy where a single phase choice has only an indirect effect on episode-level fitness. The dueling architecture of Wang et al. [19] factorizes the Q-function into a state-value stream

V (s)

and an advantage stream

A (s, a)

, recombined as

Q (s, a) = V (s) + A (s, a) - \bar{A} (s)

; in many states the relative advantage of one phase over another is small, but the absolute state value (whether the swarm is on a healthy or deteriorating trajectory) is highly informative. Prioritized experience replay (PER) [20] replaces uniform sampling with sampling proportional to the temporal difference error, focusing learning effort on surprising transitions. The present framework integrates all four mechanisms as a single dueling double DQN with PER, trained with the Huber loss [23] for robustness against outlier targets.

Beyond value-based DRL, alternative paradigms have been applied to UAV control. Lillicrap et al. [24] introduced DDPG for continuous action spaces; Lowe et al. [25] extended this to multi-agent settings (MADDPG); Schulman et al. [26] proposed PPO as a trust-region method that has become the de facto standard for continuous-control benchmarks. The present work uses a discrete action space (phase choice) rather than a continuous control space (waypoint coordinates), so DQN-family methods are the natural choice. Crucially, the JSO population is responsible for proposing continuous trajectory candidates, while the DQN operates at the meta-level. This separation of concerns distinguishes the present framework from end-to-end DRL planners.

2.3. Reinforcement Learning for Optimizer Control

Reinforcement-learned control of metaheuristic optimizers is the most directly relevant body of related work. The literature divides naturally into two architectural directions: RL-guided SI, where an RL policy controls the behavior of an SI optimizer at decision points within an episode, and SI-enhanced RL, where SI is used to improve the training of an RL policy. The implications differ as follows: in the RL-guided SI direction, the population-based search structure remains intact, and the RL policy adds adaptivity; in the SI-enhanced RL direction, the final deployed policy is fundamentally an RL policy and the SI component disappears at deployment.

In the RL-guided SI direction, Zhang and Chen [9] integrated tabular Q-learning into particle learning behavior in PSO. Wang et al. [5] proposed QMSR-ACOR, a Q-learning-driven continuous ant colony optimizer for multi-UAV planning with a Q-table of only 32 cells. Kappagantula et al. [10] demonstrated that RL can improve move-selection in discrete PSO with a Q-table of 27 cells. Li et al. [13] presented a Q-learning-guided multi-objective PSO for UAV planning. Jing and Li [27] combined deep RL with quantum PSO for mobile robot planning, representing one of the few works that uses a deep network rather than a tabular policy. A consistent pattern across these works is that the RL component is restricted to a small tabular state space (4–32 cells), which fundamentally limits the policy’s ability to represent the joint distribution over optimization state, safety state, and environmental conditions in a 3D dynamic environment. The present work addresses this representational limit directly through a 24-dimensional continuous state vector and a deep value network.

In the SI-enhanced RL direction, Zhang et al. [11] proposed PSO-M3DDPG, in which PSO optimizes the experience sample set for a multi-agent deep deterministic policy gradient. While this approach accelerates RL convergence, the final deployed policy is a pure neural network and inherits the generalization limitations of end-to-end RL planners. Hazarika et al. [12] proposed a generative AI-augmented graph RL framework for adaptive UAV swarm optimization, also operating in the SI-enhanced RL direction. The present work follows the RL-guided SI direction explicitly.

2.4. Curriculum Learning and Adversarial Modeling

Curriculum learning [16] organizes training along a difficulty axis so that the learner is exposed to easier instances first and harder instances later. In reinforcement learning, the environment itself varies across episodes; the survey by Narvekar et al. [17] identifies mastery-gated progression—where stage promotion is conditioned on demonstrated competence rather than on a fixed wall-clock or episode count—as a particularly stable strategy. To the best of our knowledge, no prior RL–SI hybrid work uses an explicit mastery-gated curriculum of this kind.

The realism of adversarial modeling in UAV swarm research varies substantially across the literature. Most published work either omits environmental disturbance entirely or introduces wind as a time-invariant constant offset or independent Gaussian noise, neither of which captures the temporal correlation of real wind fields. The present work follows a more demanding model: wind is generated by a discrete-time first-order autoregressive (AR(1)) mean-reverting process with stage-dependent volatility, GPS jamming and communication loss are modeled as independent Bernoulli events with stage-dependent probabilities, and the evaluator’s penalty weights adapt dynamically when adversarial events are active. This composite model is consistent with the cooperative-localization regime studied by Wang et al. [28] for GNSS-denied environments. None of the directly comparable RL–SI hybrid works [5,9,10,13,27] models all three disturbance categories (wind, jamming, communication loss) simultaneously.

Recent work on cooperative UAV swarm coordination has addressed complementary aspects of the multi-agent planning problem. Feng et al. [29] surveyed the integration of game-theoretic principles with multi-agent systems for autonomous coordination, providing a taxonomy of cooperative strategies relevant to UAV swarms. Liu et al. [30] proposed an adaptive multi-UAV cooperative path planner combining multi-agent RL with rotation artificial potential fields, demonstrating the value of learned cooperation in obstacle-rich environments. Additional recent studies on UAV swarm path planning [31,32,33] further underscore the growing interest in adaptive and learning-based approaches.

2.5. Statistical Methodology and Fair Comparison

A recurrent methodological concern in the metaheuristics literature is that comparison between algorithms is often reported using only mean fitness without statistical testing, paired protocols, or effect-size estimates. The framework of Derrac et al. [34] consolidates the practices recommended for non-parametric comparison of swarm and evolutionary algorithms: paired Wilcoxon signed-rank tests for two-algorithm comparisons, Holm correction for multi-algorithm families, and explicit reporting of effect sizes alongside p-values. Cliff’s delta [35] is the recommended non-parametric effect size for skewed continuous data. The present work adopts this protocol in full. A second, less commonly addressed methodological concern is reward fairness across compared algorithms: when an RL-augmented optimizer is compared against a non-RL baseline, the optimizers must minimize the same objective function under the same observation model and stage-matched configuration. None of the closely related RL–SI hybrid works [5,9,10] explicitly enforces such a fairness contract; the present work treats it as a non-negotiable architectural constraint.

2.6. Synthesis of Research Gaps

Table 1 structures the foregoing survey along dimensions relevant to the present contribution. Synthesizing the survey, five concrete gaps motivate this work: (1) no deep-RL-guided JSO for cooperative UAV planning; (2) tabular RL representational limits in existing hybrids; (3) no reward fairness contract between RL-augmented and non-RL optimizers; (4) limited adversarial modeling; (5) absence of mastery-gated curriculum training in RL–SI hybrids. These gaps are not independent: gaps (2) and (3) together explain why most prior RL–SI hybrids report strong nominal-condition results but degrade rapidly under distribution shift, and gaps (4) and (5) together explain why the present framework’s advantage over standard JSO grows monotonically with campaign difficulty rather than appearing only at a single operating point (Section 4).

3. Materials and Methods

This section details the proposed RL-JSO framework. Section 3.1 introduces the notation and formulations used throughout. Section 3.2 describes the system architecture and simulation environment. Section 3.3 specifies the decision layer (state, action, network). Section 3.4 details the shared reward function. Section 3.5 presents the hierarchical safety mechanism. Section 3.6 describes the mastery-gated curriculum. Section 3.7 describes the training algorithm and experimental protocol. The fair-comparison protocol is presented in Section 4.1 alongside the evaluation results.

3.1. Preliminaries

3.1.1. Jellyfish Search Optimization

JSO [6] evolves a population of M candidate solutions through three motion modes. The ocean current (drift) phase drives exploitation by pulling individuals toward the global best:

X_{i} (t + 1) = X_{i} (t) + r \cdot (X^{*} - β \cdot r \cdot μ),

(1)

where

X^{*}

is the current global best,

μ

is the population mean,

β

is the drift coefficient, and

r \sim U (0, 1)

. The passive motion phase provides exploration via random perturbation:

X_{i} (t + 1) = X_{i} (t) + c \cdot (2 r - 1) \cdot (UB - LB),

(2)

with contraction parameter c and per-dimension bounds

UB, LB

. The active motion phase balances exploration and exploitation via tournament with a random peer:

X_{i} (t + 1) = X_{i} (t) + r \cdot D, D = \{\begin{matrix} X_{j} - X_{i} & if f_{i} \geq f_{j}, \\ X_{i} - X_{j} & otherwise . \end{matrix}

(3)

In the original formulation, phase selection is governed by a stochastic time-control coefficient; a widely used deterministic proxy is

c_{0} (t) = 1 - t / T

, yielding an explore-first/exploit-later schedule. In RL-JSO, this deterministic rule is replaced by a learned policy.

3.1.2. Reinforcement Learning and Deep Q-Networks

Reinforcement learning [7] formalizes sequential decision-making as a Markov decision process

(S, A, P, R, γ)

. The agent learns a policy

π : S \to A

that maximizes the expected cumulative discounted reward

E [\sum_{t} γ^{t} r_{t}]

. Deep Q-networks (DQN) [8] approximate

Q (s, a; θ)

using a neural network. Double DQN [18] decouples action selection from value estimation:

y = r + γ Q (s^{'}, arg max_{a^{'}} Q (s^{'}, a^{'}; θ); θ^{-}) .

(4)

The dueling architecture [19] factorizes Q into state–value and advantage streams,

Q (s, a) = V (s) + A (s, a) - \bar{A} (s)

, and PER [20] samples transitions proportionally to their TD error. RL-JSO integrates all of these mechanisms.

3.2. System Architecture and Simulation Environment

3.2.1. Architecture Overview

The framework consists of three interconnected layers. The state layer extracts a 24-dimensional state vector capturing fitness progress, safety metrics, environmental signals, and population statistics. The decision layer employs a dueling double DQN with PER to select one of three discrete actions corresponding to JSO phases (drift, passive, active), applied uniformly to all population members at each iteration. The optimization core executes the RL-selected JSO phase to update candidate trajectory solutions. Figure 1 illustrates the information flow at a high level.

The framework is deliberately centralized: a single optimizer reasons jointly about all N UAVs in the swarm, rather than running N independent optimizers. This choice is motivated by the strong inter-UAV coupling introduced by the safety constraints, particularly the inter-UAV separation penalty, which is not separable across UAVs. Decomposition approaches such as consensus-based ADMM or coordination-graph methods could in principle decouple the problem but are beyond the scope of this paper.

3.2.2. Simulation Environment

A custom Python-based simulation models UAV swarm dynamics within a rectangular workspace of dimensions

100 \times 100 \times 50

m. Each UAV is treated at the kinematic level (position and velocity) without explicit attitude dynamics. Dynamic spherical obstacles advance via Euler integration with boundary bouncing, with velocities capped per curriculum stage. Wind is modeled as a discrete-time AR(1) mean-reverting process:

w (t + 1) = 0.95 \cdot w (t) + 0.05 \cdot μ_{w} + σ_{w} \cdot ϵ_{t}, ϵ_{t} \sim N (0, I_{3}),

(5)

with stage-dependent volatility

σ_{w}

, where

w (t)

is the three-dimensional wind velocity at iteration t, the autoregressive coefficient

ϕ = 0.95

controls temporal persistence,

(1 - ϕ) μ_{w} = 0.05 μ_{w}

is the mean-reversion term that pulls the process toward the long-term mean wind

μ_{w}

and prevents unbounded drift, and

ϵ_{t}

is an independent and identically distributed standard Gaussian noise vector. GPS jamming and communication loss are modeled as independent Bernoulli events with stage-dependent probabilities. When jamming is active, Gaussian noise is injected into trajectory evaluations; when communication is lost, the population mean used in JSO drift is replaced by the global best. Figure 2 shows a representative instance of the simulation environment.

3.2.3. Trajectory Encoding and Objective

Each UAV trajectory is represented as a sequence of K intermediate waypoints bracketed by a fixed start position

s_{i}

and a fixed goal position

g_{i}

, neither of which enters the optimization:

X_{i} = [\underset{fixed}{\underset{︸}{s_{i}}}, \underset{K optimizable waypoints}{\underset{︸}{w_{i, 1}, \dots, w_{i, K}}}, \underset{fixed}{\underset{︸}{g_{i}}}],

(6)

for

i = 1, \dots, N

UAVs, where each

w_{i, k} \in R^{3}

. Because the start and goal are fixed, only the K intermediate waypoints are decision variables. For

N = 10

and

K = 16

, the decision space is

3 \cdot N \cdot K = 480

-dimensional and continuous. Let

X = {X_{1}, \dots, X_{N}}

denote the concatenated decision vector of all N UAV trajectories. All optimizers minimize a shared scalar fitness:

J (X) = λ_{L} L + λ_{E} E_{turn} + λ_{C} C_{total} + λ_{S} \bar{α} + F_{form},

(7)

where L is total path length,

E_{turn}

is turning energy,

\bar{α}

is the mean turn angle (smoothness proxy),

C_{total}

is the composite safety cost aggregating obstacle, separation, boundary, and UAV collision penalties, and

F_{form}

is a formation cohesion penalty. Under adversarial conditions, penalty coefficients adapt dynamically: jamming boosts the safety cost weight, communication loss boosts formation cohesion, and wind boosts the energy weight proportionally to wind magnitude. All algorithms receive identical adaptive weighting.

3.3. Decision Layer

3.3.1. State Representation

The state vector

s_{t} \in R^{24}

provides the DQN with a comprehensive observation of the optimization landscape, safety status, and environmental conditions. All features are normalized to approximately

[0, 1]

using a running mean–variance estimator maintained with double-precision arithmetic for numerical stability. Table 2 lists the 24 features. The choice of a 24-dimensional continuous state vector directly addresses the representational limitations of tabular Q-learning approaches [5,9,10].

Two features deserve explicit definition. Hard-hit pressure (index 4) counts the number of trajectory segments whose minimum obstacle clearance falls below the hard safety threshold

d_{hard}

, normalized by a clipping constant of 20. Near-hit pressure (index 5) counts segments with clearance between the hard threshold

d_{hard}

and the soft warning threshold

d_{soft}

, normalized by a clipping constant of 50. The former indicates imminent collision; the latter captures gray-zone proximity where trajectories are technically safe but uncomfortably close to obstacles.

Although features such as separation reserve and separation ratio (or clearance reserve and clearance ratio) may appear redundant, they encode complementary information. The ratio is scale-invariant—a ratio of 1.2 communicates the same relative safety regardless of whether the threshold is 2 m or 20 m—while the reserve captures the absolute margin in meters, which governs the time available for corrective action under a given velocity bound. Including both enables the policy to generalize across curriculum stages that use different absolute thresholds while retaining stage-specific margin awareness. In preliminary ablations, removing the reserve features while keeping the ratios slowed training convergence by approximately 15%.

3.3.2. Action Space

For RL-JSO, the action space is discrete and small, consisting of three actions corresponding to the three JSO phases:

A = {0, 1, 2} = {drift, passive, active} .

(8)

At each decision point, the selected action determines which phase update equation (Equations (1)–(3)) is applied to the entire population. For the RL-PSO fair counterpart used in Section 4.1, the action space is larger:

A_{PSO} = {0, 1, 2, 3, 4}

, corresponding to five predefined combinations of the PSO coefficients

(w, c_{1}, c_{2})

spanning exploration-heavy to exploitation-heavy regimes. The output head of the RL-PSO DQN is correspondingly five-dimensional; all other network components are shared.

3.3.3. Network Architecture

The dueling double DQN (Figure A1) consists of a two-hidden-layer feedforward network (256 units per layer), layer normalization [36], dropout, and dueling streams (128 units each) that separately compute

V (s)

and

A (s, a)

. Training minimizes the Huber loss [23] weighted by PER importance sampling weights:

L (θ) = \frac{1}{B} \sum_{i = 1}^{B} w_{i} \cdot H_{δ} (y_{i} - Q (s_{i}, a_{i}; θ)),

(9)

with double DQN targets (Equation (4)) and Polyak soft updates

θ^{-} \leftarrow (1 - τ) θ^{-} + τ θ

. A schematic of the full network architecture is provided in Appendix A.

3.4. Reward Function Design

The reward function underwent iterative refinement across five revisions, each addressing a specific failure mode observed in training logs; a summary timeline is provided in Appendix B. The final revision (v3.2) computes a shared reward signal used identically by RL-JSO and RL-PSO, imported from a single code module. The reward combines (i) a fitness improvement term that rewards reductions in best-so-far fitness, (ii) two distributed safety margin terms that reward preserving inter-UAV separation and obstacle clearance above stage-specific thresholds, (iii) a trajectory smoothness term, (iv) a cooperation term, and (v) a small per-step time penalty. The component weights are summarized in Table 3.

The separation-margin term uses a square-root deficit curve that amplifies gray-zone penalties near the threshold:

r_{sep} = - 0.65 \cdot \sqrt{max (0, 1 - \frac{s_{curr}}{d_{min}})},

(10)

where

s_{curr}

is the current minimum inter-UAV separation and

d_{min}

is the stage-specific threshold. This formulation resolved the persistent gray-zone failure mode where the majority of separation violations occurred near the threshold boundary under linear penalties. Measured on a reference set of near-threshold training episodes, v3.2 produces a mean per-step safety penalty magnitude approximately

5.3 \times

larger than v3.0, which characterizes the gradient of the safety signal near the threshold rather than an overall performance gain.

3.5. Hierarchical Safety Control

The RL agent is embedded in a hierarchical safety override architecture (Figure 3) with five designed layers, of which three are active in the reported experiments:

L1. Warmup heuristic (active): during the first few iterations of each episode, the action is drawn from a heuristic fallback schedule that allows JSO to establish an initial population distribution before the DQN has meaningful experience.
L2. Stagnation fallback (active): if the best-so-far fitness has not improved for a pre-specified number of iterations, control is handed to a heuristic rule that forces an exploratory phase, preventing the DQN from committing to a stalled exploitation strategy.
L3. Switch cooldown (pass-through): retained architecturally with the inter-phase cooldown parameter set to 0 in the reported experiments.
L4. RL decision interval (pass-through): retained architecturally, with the decision interval set to 1 iteration, so the DQN is queried at every step.
L5. DQN $ε$ -greedy decision (active): when all prior gates are inactive, the DQN selects an action via an $ε$ -greedy heuristic.

At evaluation time, the eval_disable_fallback flag (set to True) additionally bypasses L1 and L2, so only L5 governs every iteration. This design choice is empirically validated by the inference time ablation (Section 4.6): re-enabling L1 and L2 at evaluation time degrades minimum separation by 21–

29 %

in C2–C4 with effect sizes rising to large and Holm-corrected

p < 10^{- 4}

in C4, confirming that the mature policy does not benefit from heuristic fallbacks at inference time.

3.6. Mastery-Gated Curriculum

Training employs mastery-gated progression across nine stages (Figure 4, Table 4). The design principle is predominantly single-factor progression in the early and intermediate stages—each new stage introduces one new difficulty factor while holding the others constant—followed by a controlled compound increase at a later stage. Stage promotion is conditioned on the learner achieving a stage-specific composite success threshold on a held-out validation set (minimum validation win rate, minimum collision-free rate, minimum improvement in cumulative reward). If the criterion is not met, the learner enters a recovery cycle: exploration probability is temporarily increased, additional training episodes are allocated, and re-evaluation is performed. Up to three recovery cycles are allowed per stage; if the gate is still not met after the third cycle, training is terminated by a stop_on_terminal_failure policy, and the best checkpoint observed during the failed stage is retained.

A stage-aware exploration bump temporarily raises

ε

at novelty transitions:

ε_{eff} = max (ε_{sched}, ε_{f} + (ε_{b} - ε_{f}) e^{- (t - t_{b}) / τ}) .

(11)

The mechanism is self-correcting: fast-progressing optimizers reach novelty stages with higher base

ε

and therefore receive smaller effective bumps.

3.7. Training Algorithm and Experimental Protocol

Table 5 summarizes the hyperparameters, all fixed across RL-JSO and RL-PSO.

Training Algorithm

Algorithm 1 presents the complete RL-JSO training procedure integrating all components described above.

Algorithm 1 RL-JSO training procedure.

1:: Initialize DQN parameters ( $θ, θ^{-}$ ), PER buffer $D$ , running mean–variance estimator
2:: for each curriculum stage $\in {S 1, \dots, S 9}$ do
3:: Load stage configuration; apply $ε$ -bump if novel factor introduced
4:: for episode $= 1$ to stage budget do
5:: Reset environment; initialize JSO population encoding N-UAV waypoints
6:: Evaluate initial population; identify global best $X^{*}$
7:: for iteration $t = 1$ to T do
8:: Advance env.: move obstacles, update wind (Equation (5)), sample adversarial events
9:: if dynamic obstacles then re-evaluate all population fitness
10:: end if
11:: Build state $s_{t}$ (Table 2); normalize via running MVE
12:: Select $a_{t}$ via hierarchical control (L1, L2 active; L3, L4 pass-through; L5 as DQN)
13:: Execute selected JSO phase; apply wind perturbation; clip bounds
14:: Evaluate J (Equation (7)); compute reward $r_{t}$
15:: Store $(s_{t}, a_{t}, r_{t}, s_{t + 1}, {done}_{t})$ in $D$
16:: if $| D | \geq$ batch size and past warmup then
17:: Sample from $D$ with PER; compute double DQN targets (Equation (4)); update $θ$
18:: Soft update: $θ^{-} \leftarrow (1 - τ) θ^{-} + τ θ$
19:: end if
20:: if catastrophic safety event then switch to heuristic fallback for remainder of episode
21:: end if
22:: end for
23:: if validation interval then freeze model → greedy eval → mastery-gate decision
24:: end if
25:: end for
26:: end for
27:: return best validation checkpoint $θ^{*}$

At evaluation time, the trained checkpoint

θ^{*}

is loaded, the hierarchical safety controller is configured with eval_disable_fallback=True (disabling L1 and L2), and the mature policy L5 governs every iteration greedily (

arg {max}_{a} Q (s, a; θ^{*})

). No gradient updates are performed and the replay buffer is not populated. The same procedure is used for RL-PSO, with JSO phase actions replaced by PSO parameter presets and all other components held identical.

Experimental setup. The framework was implemented in Python 3.11 using PyTorch 2.5.1 and NumPy 1.26 in an Anaconda environment. Experiments ran on a single workstation with an Intel Core i9-14900HX CPU (Intel Corporation, Santa Clara, CA, USA; 24 cores, 32 threads), 64 GB DDR5-5600 RAM, and an NVIDIA GeForce RTX 4070 Ti SUPER GPU (NVIDIA Corporation, Santa Clara, CA, USA; 16 GB GDDR6X) with CUDA 12.1, running Windows 11. The codebase comprises approximately 26,000 lines of Python across 20 modules, with a shared training infrastructure ensuring identical logic across all algorithms. RL-JSO and RL-PSO were each trained on three scenarios (S1_default, S2_shifted, S3_tighter_gap) for approximately 76 wall-clock hours per algorithm under matched compute (75.74 h for RL-JSO and 75.62 h for RL-PSO, a difference of 0.16%).

Benchmark algorithms and campaigns. Four algorithms constitute the primary benchmark: (i) RL-JSO (proposed); (ii) RL-PSO fair counterpart (DQN selects from 5 PSO presets); (iii) standard JSO with deterministic

c_{0}

switching; and (iv) standard PSO with fixed parameters. Four progressive campaigns assess robustness across a difficulty gradient. Each campaign uses 20 seeds × 2 evaluation scenarios = 40 runs per algorithm:

C1 (nominal): static obstacles, no wind, no adversarial factors.
C2 (wind): dynamic obstacles, AR(1) wind at moderate intensity.
C3 (hard dynamic): larger and faster obstacles with wind.
C4 (full adversarial): wind, GPS jamming, and communication loss with enlarged obstacles.

Campaigns C1 and C2 are stage-matched to training stages S1 and S5, respectively, and are within the mastered distribution for RL-JSO. Campaigns C3 and C4 are stage-matched to S6 and S8, respectively. Because RL-JSO encountered S6 during training but did not satisfy the mastery gate, C3 constitutes partial-transfer evaluation, not strict zero-shot. Because RL-JSO was not trained on S7–S9 at all, C4 constitutes strict zero-shot evaluation. This distinction is maintained throughout the reported results.

Statistical protocol. Statistical significance is assessed using two-sided Wilcoxon signed-rank tests (paired by seed and scenario). For each metric within each campaign, Holm correction [34] is applied across the three pairwise comparisons of RL-JSO against {JSO, PSO, RL-PSO} to control the family-wise error rate. Effect sizes are reported as Cliff’s delta [35]: negligible (

| δ | < 0.147

), small (

0.147

–

0.330

), medium (

0.330

–

0.474

), or large (

| δ | > 0.474

), with 95% confidence intervals from 10,000 bootstrap resamples. Generalization tests evaluate seven unseen environment configurations with 20 seeds each (140 runs per algorithm, 560 runs overall). Scalability sweeps test zero-shot transfer across six swarm sizes (

N \in {5, 10, 15, 20, 50, 100}

) with 10 seeds per size.

4. Results

Results are reported in six stages. Section 4.1 describes the fair-comparison protocol that governs all subsequent evaluations. Section 4.2 describes the training phase: curriculum progression, mastery outcomes, and checkpoint selection. Section 4.3 describes the four evaluation campaigns. Section 4.4 presents the statistical analysis. Section 4.5 reports cooperation, generalization, and scalability. Section 4.6 presents the inference-time ablation.

4.1. Fair Comparison Protocol

To ensure scientific validity, RL-JSO and RL-PSO share identical conditions enforced architecturally (Figure 5): the identical reward module (v3.2), identical 24-dimensional state representation, identical running-mean-variance normalization, identical two-hidden-layer feature trunk, identical nine-stage curriculum, identical exploration schedule, identical evaluation budget and obstacle layouts, identical seeds, and matched wall-clock training time. The only components that differ are the algorithm-specific action heads (three JSO phases for RL-JSO versus five PSO parameter presets for RL-PSO) and the optimizer backbones themselves. Standard JSO and PSO serve as non-RL baselines using their canonical heuristics.

For each RL-augmented algorithm, the best validation checkpoint is selected by the lowest safety-penalized validation score, which combines mean fitness with weighted penalties for feasibility failures, separation violations, obstacle collisions, hard hits, and early termination, minus a success-rate bonus. The overall validation win rate (0.911 for RL-JSO, 0.578 for RL-PSO) is reported as a post-training summary statistic. All algorithms are subsequently evaluated under identical campaign configurations regardless of the training stage each algorithm reached.

4.2. Training and Curriculum Progression

Table 6 summarizes the curriculum progression. RL-JSO completed 736 training episodes and reached stage S6. Mastery-gated promotion was achieved through stages S1–S5; on stage S6 the mastery gate was not met after three recovery cycles, and training was terminated by the stop_on_terminal_failure policy. The best validation checkpoint, selected by the lowest safety-penalized validation score, corresponds to the S5-mastered stage (overall validation win rate:

0.911

). Across the entire training run, RL-JSO recorded a single hard hit, and exploration decayed from

ε = 0.99

to

ε = 0.10

.

RL-PSO reached stage S9 in 766 episodes with eight successful stage-level promotions (S1 → S2

\to \dots \to

S9), but did not satisfy the S9 mastery gate after three recovery cycles; the best validation checkpoint used for RL-PSO corresponds to its hardest mastered stage (S8), with an overall validation win rate of

0.578

. RL-PSO accumulated 795 hard hits during the late curriculum stages (S7–S9), reflecting a structurally less stable optimization core under the hardest conditions. An earlier independent training run of RL-JSO with a different seed reproduced structurally consistent results (S6 reached after 640 episodes, no hard hits, best fitness within

0.3 %

of the present run), providing cross-run evidence of training stability. A systematic multi-seed replication remains future work.

4.3. Main Evaluation Results

Table 7, Table 8, Table 9 and Table 10 report the results of the four progressive campaigns. Lower fitness is better, and CF denotes the collision-free rate. Fitness values are directly comparable across algorithms only when both algorithms achieve CF

= 100 %

; for cells with CF < 100%, the reported mean fitness includes runs that contain infeasible solutions (obstacle penetrations) and is shown in italics with a footnote marker.

4.3.1. Campaign C1: Nominal Conditions

Under nominal conditions (Table 7), all four algorithms achieve CF

= 100 %

with zero hard hits. RL-JSO achieves the lowest mean fitness (

616,437

),

15 %

below standard JSO,

22 %

below standard PSO, and

16 %

below RL-PSO. The advantage of RL-JSO over JSO corresponds to a medium effect size (

| δ | = 0.354

,

p_{Holm} < 0.001

).

4.3.2. Campaign C2: Wind Conditions

Under wind (Table 8), a clear bifurcation emerges between optimizer families. Both JSO-based methods retain

CF = 100 %

, while PSO drops to

10 %

and RL-PSO to

2.5 %

. Among the two collision-free algorithms, RL-JSO achieves

20 %

lower mean fitness than JSO (

808,503

vs.

1,013,162

), with a borderline-significant medium effect (

| δ | = 0.369

,

p_{Holm} = 0.050

).

4.3.3. Campaign C3: Hard Dynamic Obstacles (Partial Transfer)

Training-coverage note. For RL-JSO, Campaign C3 represents partial transfer, not strict zero-shot: stage S6 (hard_obstacles) was encountered during training but did not satisfy the mastery gate. The policy therefore has limited prior exposure to this distribution but did not achieve mastery on it.

Under hard dynamic obstacles (Table 9), the divergence widens further. RL-JSO retains

CF = 100 %

, while standard JSO drops slightly to

97.5 %

. PSO reaches

20 %

CF and RL-PSO only

10 %

. Among the (now nearly) collision-free pair, RL-JSO’s mean fitness is

55 %

lower than JSO’s (

901,414

vs.

1,991,694

), with a large effect (

| δ | = 0.628

,

p_{Holm} < 0.001

).

4.3.4. Campaign C4: Full Adversarial Conditions (Strict Zero-Shot)

Training-coverage note. For RL-JSO, Campaign C4 represents strict zero-shot transfer: stages S7 (jamming) and S8 (comm_intro) were never encountered during RL-JSO training, which terminated at S6. The policy has no prior exposure to GPS jamming or communication loss as training signals, and the reported performance reflects purely out-of-training distribution generalization.

Under full adversarial conditions (Table 10), RL-JSO maintains

CF = 100 %

across both scenarios with zero hard hits. Standard JSO also achieves

CF = 100 %

aggregate, though with substantially inflated mean fitness (

2,042,387

) driven by separation-collapse outliers. Among the two collision-free algorithms, RL-JSO achieves

57 %

lower mean fitness than JSO with a large statistical effect (

| δ | = 0.689

,

p_{Holm} < 0.001

). The per-seed distribution is not driven by a small number of outliers: in S2_shifted, the gap reaches

81 %

(RL-JSO:

426,226

versus JSO:

2,282,272

); and on the matched seed visualized in Figure A3, the per-seed gap reaches a

5.2 \times

ratio (

601,471,

versus

3,105,219

).

Figure 6 visualizes the full fitness distribution underlying Table 10. The hatched boxes for PSO and RL-PSO mark algorithms that are not collision-free in this campaign, and therefore their apparently low fitness values include infeasible solutions and are not directly comparable to RL-JSO and JSO. Among the two collision-free algorithms, RL-JSO’s median fitness lies well below standard JSO’s, and its tail of high-fitness outliers is substantially shorter.

Fitness function fairness. To address the question whether the comparison is fair: all four algorithms minimize the identical objective function

J (X)

(Equation (7)), which includes path length, energy, smoothness, collision penalty, and inter-UAV separation penalty with the same weighting schedule. No algorithm receives a fitness advantage. The collision-free rate difference arises because PSO’s independent velocity updates more frequently penetrate obstacle boundaries, whereas JSO’s coherent drift operator preserves geometric feasibility. A representative trajectory comparison is provided in Appendix C: the PSO path exhibits multiple obstacle intersections that contribute to a superficially low fitness, while the RL-JSO trajectory under the same seed maintains full clearance.

A scenario-level decomposition of Campaign C4 is shown in Figure 7. The performance gap is concentrated in S2_shifted, where RL-JSO remains collision-free and reduces the mean fitness from

2,282,272

(JSO) to

426,226

, an

81 %

reduction. In S1_default, both JSO-based methods remain collision-free and the gap is materially smaller. This decomposition supports the interpretation that the principal zero-shot difficulty in C4 arises from the combination of adversarial events with obstacle-layout shift, not from adversarial events alone.

4.3.5. Progressive Collision-Free Rate Degradation

Table 11 and Figure 8 present the CF rate matrix across all campaigns. RL-JSO achieves

100 %

across all conditions, while PSO-based methods degrade sharply as soon as wind is introduced in C2, and standard JSO begins to lose CF rate under C3 hard obstacles.

4.3.6. Representative Trajectory Comparison

A matched-seed trajectory pair from Campaign C4 is presented in Appendix C (Figure A3). Both algorithms achieve zero hard hits on this seed, but the safety margins differ by an order of magnitude. RL-JSO produces layered, well-separated trajectories with

s_{min} = 3.09

m (comfortably above the

2.95

m stage threshold) and final fitness

601,471

. Standard JSO achieves the same zero-hit count only by chance: its

s_{min}

collapses to

0.10

m—two UAVs pass within ten centimeters of one another—and its final fitness is

5.2 \times

worse at

3,105,219

.

4.4. Statistical Analysis

Table 12 summarizes paired Wilcoxon signed-rank tests and Cliff’s delta effect sizes for RL-JSO versus standard JSO, paired by seed and scenario. Holm correction is applied within each metric across the three pairwise comparisons of RL-JSO against {JSO, PSO, RL-PSO}; the reported conclusions do not change under this correction because the uncorrected p-values in the relevant cells are already well below

10^{- 3}

.

Three patterns are notable. First, the fitness effect size grows monotonically with campaign difficulty: medium in C1 (

| δ | = 0.354

), medium in C2 (

| δ | = 0.369

), large in C3 (

| δ | = 0.628

), and large in C4 (

| δ | = 0.689

). The C2 fitness comparison is borderline (

p_{Holm} = 0.050

). The Spearman rank correlation between campaign difficulty (C1 through C4, ranked 1–4) and the absolute fitness effect size is

ρ = 1.0

. With

n = 4

ordered campaigns, this correlation should be interpreted as supportive descriptive evidence of a monotonic ordinal pattern, not as a formal inferential trend test—the sample is too small for meaningful significance testing of

ρ

itself. Second, the quality metrics (path length, energy, smoothness) become strongly favorable to RL-JSO once wind and harder dynamics are introduced. Third, the separation metric shows the clearest safety scaling, improving from negligible effect in C1 to a large effect by C4 (

| δ | = 0.803

).

Figure 9 visualizes the effect-size progression reported in Table 12, covering the four primary metrics (fitness, minimum inter-UAV separation, path length, and energy). All four metrics converge toward large effect sizes by Campaign C4, and the

s_{min}

trajectory grows from negligible at C1 to the largest effect observed in the study (

| δ | = 0.803

) at C4—confirming that the safety advantage of RL-JSO emerges predominantly under adversarial conditions rather than under nominal ones.

4.5. Cooperation, Generalization, and Scalability

4.5.1. Cooperation

Table 13 presents the composite cooperation score A across the four campaigns. Under the proposed metric, RL-JSO exhibits near-invariant behavior:

A \approx 0.74

(range

= 0.012

) across C1–C4, while every other algorithm degrades by 17–

23 %

from C1 to C4. The advantage gap scales from

+ 1.8 %

at C1 to

+ 25.9 %

at C4. The composite cooperation score is a weighted mean of five normalized sub-components (inter-UAV separation variance, path co-planarity, arrival-time synchronization, inter-distance stability, and goal alignment) using fixed mission-informed weights; these weights have not been validated through a formal sensitivity analysis, and the stability finding is therefore specific to this metric formulation.

Figure 10 plots the same data as Table 13 on a single panel to visualize the divergence in stability profiles. RL-JSO traces a nearly horizontal curve, while the three comparator algorithms trace similarly shaped monotonic-decline curves separated primarily by offset.

4.5.2. Out-of-Distribution Generalization

Seven unseen environment configurations (random obstacle layouts, expanded workspace bounds, faster obstacles, extreme wind) were tested with 20 seeds each, for a total of 140 runs per algorithm and 560 runs overall. All algorithms achieved

CF < 11 %

and success

< 3 %

, confirming that these configurations substantially exceed training coverage and represent a fundamental challenge for all evaluated approaches.

Notably, every observed success across the entire generalization experiment (RL-JSO: 3, JSO: 2, RL-PSO: 2, PSO: 1) occurred exclusively in the narrow-corridor scenario, which geometrically admits a narrow feasible passage; in the remaining six scenarios no algorithm achieved any successful run, so the fitness-based “win” counts in those scenarios are best read as tie-breaking among uniformly infeasible solutions rather than as a meaningful generalization signal. With that caveat in mind, no single algorithm dominated, and the observed pattern suggests a marginal safety advantage for RL-JSO concentrated on narrow-corridor geometries rather than broad OOD generalization.

4.5.3. Zero-Shot Scalability

Zero-shot scalability was tested at six swarm sizes (

N \in {5, 10, 15, 20, 50, 100}

) with 10 seeds each, for a total of 60 runs per algorithm. Table 14 reports the full comparative results.

At

N = 5

, RL-JSO is the only algorithm to achieve 100% collision-free runs with the lowest fitness variance (

\pm 0.2 \times 10^{6}

), indicating stable zero-shot transfer to smaller swarms. PSO achieves the lowest absolute fitness at

N = 5

but at the cost of zero collision-free runs, confirming that PSO’s fitness advantage stems from exploring infeasible regions. At

N \geq 10

, all algorithms lose collision-free guarantees, and at

N \geq 50

fitness converges to approximately

10^{8}

, indicating a common saturation regime where the fixed evaluation budget is insufficient for the expanded search space. Notably, JSO-family algorithms (JSO and RL-JSO) consistently produce fewer separation violations than PSO-family algorithms at

N \geq 50

(106–161 vs. 123–247), suggesting that the JSO drift operator produces more structurally coherent trajectories as swarm size grows. The

N = 10

configuration remains the operating point where the learned policy’s advantages are most pronounced in the main staged campaigns (Table 7, Table 8, Table 9 and Table 10); the scalability stress test above is a separate zero-shot regime and should not be interpreted as equivalent to the mastered

N = 10

campaign setting. Empirical runtime scaling fits

O (N^{0.06})

for JSO,

O (N^{0.09})

for RL-JSO,

O (N^{0.10})

for PSO, and

O (N^{0.16})

for RL-PSO, confirming that the DQN overhead adds negligible computational cost relative to the underlying optimizer.

4.5.4. Feature Sensitivity of the Learned Policy

To assess which state features most influence the DQN’s phase-selection decisions, gradient-based saliency analysis was performed on the trained checkpoint. For 5000 randomly sampled states, the mean absolute gradient

| \partial Q^{*} / \partial s_{j} |

was computed for each of the 24 state dimensions. Figure 11 reports the results ranked by importance.

Three findings emerge. First, the separation and clearance group exhibits the highest aggregate importance (mean

= 1.21

), with separation violations, safety persistence, and clearance reserve occupying the top three ranks. This confirms that the learned policy primarily attends to inter-UAV safety margins when selecting among JSO phases—consistent with the safety margin preference identified in the ablation study. Second, the pairing of reserve and ratio features is empirically justified: clearance reserve (rank 3, importance

1.84

) is

6.8 \times

more important than clearance ratio (rank 19,

0.27

), while both separation reserve (rank 14,

0.57

) and separation ratio (rank 16,

0.48

) contribute meaningfully, confirming that the absolute-margin and scale-invariant features encode complementary information. Third, the adversarial-event flags (jamming, communication loss) have lower direct saliency than the downstream safety indicators, suggesting that the policy responds more strongly to realized safety degradation than to event indicators alone.

4.6. Inference-Time Ablation of the Learned Policy

To probe the role of the learned policy empirically, the trained RL-JSO checkpoint (best_val_dqn.pt) was applied to the same four campaigns (C1–C4,

n = 40

per cell) under three ablation conditions:

Random: the DQN action stream is replaced by a uniform random policy over the three JSO phases.
Fixed DRIFT: the policy is locked to always select the DRIFT phase (exploitation only).
L1+L2 enabled: layers L1 (warmup) and L2 (stagnation fallback) are re-enabled at evaluation time by setting eval_disable_fallback=False.

The wrapper preserves every other variable (checkpoint weights, evaluation budget, scenarios, obstacle layouts, reward function, and running mean–variance statistics), so any observed difference is attributable to the single intervention. The ablation uses the same evaluation seeds (100–119) as the published baseline campaigns, enabling paired Wilcoxon signed-rank tests with Holm correction and Cliff’s

δ

with 10,000-resample bootstrap CIs. Results are summarized in Table 15.

Three conclusions follow from the ablation.

(1) Adaptive phase switching is a principal contributor. Fixing the policy to DRIFT degrades fitness by 148–

216 %

with

| δ | \geq 0.78

(large,

p_{Holm} < 10^{- 8}

) and collapses minimum inter-UAV separation by 65–

76 %

across all four campaigns. This is inference-time scoped: the ablation operates on the fixed checkpoint and does not isolate training-time contributions.

(2) The learned policy expresses a safety margin preference. Against uniform random switching, the random policy achieves lower fitness in C2–C4 (

- 21 %

to

- 28 %

), but

s_{min}

drops by 16–

18 %

. Both conditions are collision-free, so the learned policy’s advantage manifests as wider separation margins rather than raw fitness, a distinction that matters under real-world sensor noise and control delays.

(3) Safety layers are training-time safeguards. Re-enabling L1 + L2 at evaluation time does not significantly change fitness but degrades

s_{min}

by up to

29 %

in C4 (

p_{Holm} < 10^{- 4}

), validating the design choice of eval_disable_fallback=True.

Taken together, these provide controlled paired evidence that adaptive phase switching is a principal contributor to test-time behavior, the learned policy encodes a safety margin preference, and the hierarchical layers are correctly disabled at evaluation time.

5. Discussion

5.1. Interpretation and Mechanisms

RL-JSO’s benefit is not a uniform shift in nominal performance but a change in failure profile under increasing stress. Under C1, all methods remain collision-free and the RL-JSO advantage is moderate; under C2–C4, JSO-based methods separate sharply from PSO-based methods in safety, and RL-JSO further separates from standard JSO in efficiency and separation margins. The learned phase controller modulates when exploration, balanced motion, and drift-style exploitation should dominate under changing safety pressure, rather than replacing the global-search structure.

Within PSO-based methods, frequent infeasible trajectories under C2–C4 suggest that independent velocity updates provide a weaker safety prior than JSO’s coherent population-level drift operator. The learned controller amplifies this structural advantage: the Cliff’s delta over standard JSO grows precisely in the regimes where static time-control is least adequate.

5.2. Positioning Within the Literature

The inference-time ablation addresses the central limitation of prior RL–SI hybrids [5,9,10], whose tabular RL cannot represent the joint distribution over optimization state, safety state, and environmental conditions. The 24-dimensional continuous state vector represents a substantial capacity increase over 4–32 cell Q-tables, and the monotonically increasing effect size across campaigns (Table 12) supports the interpretation that this capacity matters more as the observation distribution grows in complexity.

The RL-guided SI direction contrasts with SI-enhanced RL approaches such as PSO-M3DDPG [11] and GenAI-GRL [12]. In RL-guided SI, the population-based search structure remains intact at deployment, while in SI-enhanced RL the final policy is a neural network. The present ablation provides suggestive—not causal—evidence favoring the RL-guided direction within this formulation; a matched reward, matched-budget pure-RL baseline is important future work.

The double-layer DRL framework of Yan et al. [38] uses neural networks at both the inner and the outer layer. In contrast, RL-JSO employs JSO as the inner optimization layer rather than a second neural network, which preserves the population-based search diversity inherent to SI while adding learned adaptivity at the meta-level. The quality–robustness tradeoff observed between RL-JSO and RL-PSO—where JSO’s coherent drift operator produces structurally smoother trajectories than PSO’s independent velocity updates—is consistent with this design choice.

Positioning relative to end-to-end multi-agent RL. Methods such as MAPPO [25,26] directly learn decentralized policies that output continuous trajectory adjustments, representing a fundamentally different computational paradigm from the meta-controller approach adopted here, where the RL agent selects which optimization strategy the swarm should follow rather than generating trajectories directly. A direct comparison would conflate algorithmic merit with problem formulation: MAPPO operates in a Dec-POMDP with per-agent continuous actions, while RL-JSO selects among three discrete optimizer phases for a centralized population. Moreover, training a competitive MAPPO baseline for the specific 480-dimensional problem with all four adversarial stressors would require substantial engineering effort and hyperparameter tuning, risking an unfair under-tuned baseline. While RL-JSO demonstrates strong performance within the metaheuristic family, its competitiveness against end-to-end MARL planners on identical benchmarks remains an open question and a valuable direction for future work.

5.3. Limitations and Scope

Several limitations constrain the scope and generalizability of the findings.

Simulation fidelity. All experiments use simulation with spherical obstacles, a simplified AR(1) wind model, and kinematic-level UAV dynamics. No hardware-in-the-loop or real-world flight tests were conducted.

Absence of formal convergence analysis. The work is purely empirical. The DQN introduces a state-dependent, non-stationary operator selection mechanism that invalidates standard metaheuristic convergence proofs. Empirical convergence evidence is provided through training curves and mastery gates, but a formal treatment remains an open direction shared by all existing RL–metaheuristic hybrids.

Single locked training campaign. The reported results are based on a single final training campaign. However, this campaign is the product of an iterative development process spanning 12 preliminary runs (5 early-architecture, 7 refined-architecture) in which the reward function, safety hierarchy, curriculum schedule, and network architecture were progressively refined based on observed training dynamics. The final locked configuration was then independently reproduced with a different seed, yielding structurally consistent behavior. Evaluation uses 160 independent runs with paired seeds across four campaigns. A systematic multi-seed replication of the final configuration (∼380 GPU-hours) remains future work.

Incomplete curriculum mastery. RL-JSO terminated at stage S6 and RL-PSO at S9. Campaign C3 results reflect partial transfer; Campaign C4 is strict zero-shot.

Inference-time-only ablation. The ablation isolates the contribution of phase switching within the trained checkpoint but does not separate training-time components (curriculum, reward shaping, DQN architecture).

Narrow baselines and centralized architecture. Comparisons are restricted to the JSO/PSO family; no sampling-based (RRT*), optimization-based (MPC), or multi-agent RL (MAPPO) methods are included. The centralized optimizer is not directly deployable in decentralized real-time settings.

Cooperation metric and OOD generalization. The composite cooperation score uses unvalidated fixed weights. All algorithms achieved CF < 11% in OOD experiments, so campaign robustness (C1–C4) should not be conflated with broad OOD generalization.

6. Conclusions

This paper presented RL-JSO, a hybrid framework in which a dueling double DQN governs JSO phase selection for cooperative multi-UAV path planning under adversarial conditions, constrained by hierarchical safety overrides and trained through a mastery-gated curriculum with a shared fair-comparison reward.

Four principal findings emerged. (1) RL-JSO was the only method to maintain

100 %

collision-free rates across all campaigns, with Cliff’s delta growing monotonically from medium (

0.354

) to large (

0.689

) over standard JSO. (2) RL-JSO’s composite cooperation score remained nearly constant (

A \approx 0.74

) while comparators degraded by 17–

23 %

. (3) Supplementary analyses confirmed weak OOD generalization for all methods and a common scalability saturation regime. (4) Paired inference-time ablation showed that adaptive phase switching is a principal contributor (

| δ | \geq 0.78

,

p_{Holm} < 10^{- 8}

), the learned policy encodes a safety margin preference, and heuristic fallback layers are correctly disabled at inference.

These findings support a deliberately scoped main claim: learned phase selection can improve the robustness of JSO within the evaluated benchmark family, under a fairness-controlled experimental design, and within the scope boundaries documented in Section 5.3.

Hypothesis evaluation. Returning to the three hypotheses stated in Section 1: H1 is supported by the monotonically growing Cliff’s delta across campaigns (medium → large,

ρ_{Spearman} = 1.0

); H2 is supported by RL-JSO’s unique 100% collision-free rate across all four progressive campaigns, including C3 where standard JSO drops to 97.5% CF and C2–C4 where PSO-family methods degrade sharply; and H3 is supported by the inference-time ablation showing catastrophic degradation under fixed-DRIFT (

| δ | \geq 0.78

,

p_{Holm} < 10^{- 8}

).

Future work should prioritize seven directions. (1) A matched-reward, matched-budget pure-RL baseline is needed to formally isolate the contribution of the population-based backbone from that of the learned policy. (2) Extending the training protocol to full multi-seed studies with confidence-interval quantification would strengthen the reproducibility of the reported effect sizes. (3) Broader baseline comparisons against sampling-based planners (RRT*), optimization-based methods (MPC), and multi-agent RL frameworks (MAPPO) would position the framework within the broader multi-UAV planning landscape; such cross-paradigm comparisons must carefully control for training compute, hyperparameter tuning budget, and problem representation to ensure fairness. (4) Higher-fidelity simulation, hardware-in-the-loop validation, and real-world flight tests are needed before practical deployment. (5) Decentralized extensions and sensitivity analyses for the cooperation metric weights are natural next steps. (6) Extending the action space to support per-UAV or per-subswarm phase selection could enable heterogeneous strategies, though this raises challenges in action-space dimensionality (

3^{N}

joint actions) and credit assignment. (7) Formal convergence and stability analysis of the RL–metaheuristic hybrid, potentially through the lens of switching dynamical systems or regret-bounded online learning, would place the empirical findings on stronger theoretical foundations.

Author Contributions

Conceptualization, N.A. and W.B.; methodology, N.A. and W.B.; software, N.A.; validation, N.A. and W.B.; formal analysis, N.A.; investigation, N.A. and W.B.; resources, W.B.; data curation, N.A.; writing—original draft preparation, N.A.; writing—review and editing, N.A. and W.B.; visualization, N.A.; supervision, W.B.; project administration, W.B.; funding acquisition, W.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2604).

Institutional Review Board Statement

Not applicable. This study did not involve human participants or animal subjects; all experiments were conducted entirely in a software simulation environment.

Informed Consent Statement

Not applicable. This study did not involve human participants.

Data Availability Statement

The source code, trained checkpoints, raw evaluation data, statistical analysis scripts, and a step-by-step reproduction guide are available from the corresponding author on reasonable request. A public release is planned upon acceptance of this manuscript. For clarity regarding version mapping: the reward function designated “v3.2” throughout this paper corresponds to the final reward revision in the accompanying code base, internally tagged “v8.4”. The two labels refer to the identical reward design; the difference is solely historical—paper revision numbering restarted at v1.0 for expository clarity, whereas the code retains its cumulative internal revision tag. Any future public repository release will carry this mapping explicitly in its README.

Acknowledgments

The authors acknowledge the use of AI-assisted tools during the preparation of this manuscript for the purposes of code development acceleration and manuscript drafting efficiency. During the preparation of this manuscript, the authors used large language model assistants for these purposes. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ADMM	Alternating direction method of multipliers
AR(1)	First-order autoregressive process
CF	Collision-free rate
DDPG	Deep deterministic policy gradient
DQN	Deep Q-Network
DRL	Deep reinforcement learning
GPS	Global positioning system
JSO	Jellyfish Search Optimizer
MADDPG	Multi-agent DDPG
MAPPO	Multi-agent proximal policy optimization
MDP	Markov decision process
MPC	Model predictive control
PER	Prioritized experience replay
PPO	Proximal policy optimization
PSO	Particle swarm optimization
RL	Reinforcement learning
RL-JSO	Proposed hybrid framework
RL-PSO	Fair RL-guided PSO counterpart
RRT*	Rapidly exploring random tree (asymptotically optimal)
SI	Swarm intelligence
UAV	Unmanned aerial vehicle

Appendix A. Dueling Double DQN Architecture

Figure A1 illustrates the full network architecture used as the decision layer in both RL-JSO and RL-PSO. The 24-dimensional state vector

s_{t}

is first normalized through a running mean–variance estimator maintained with double-precision arithmetic, then processed by a shared feedforward trunk of two hidden layers (256 units each) with Layer Normalization, ReLU activations, and dropout. The output is then split into the value stream

V (s)

(128 units → 1 output) and the advantage stream

A (s, a)

(128 units

\to | A |

outputs) and recombined as

Q (s, a) = V (s) + A (s, a) - \bar{A} (s)

. The output dimensionality differs between the two algorithms (

| A_{JSO} | = 3

,

| A_{PSO} | = 5

); all other components are identical.

Figure A1. Dueling double DQN architecture. The 24-dimensional state vector is normalized via running mean–variance, processed through shared hidden layers (256 units, LayerNorm, ReLU, dropout), then split into value and advantage streams combined as

Q (s, a) = V (s) + A (s, a) - \bar{A} (s)

. The output dimensionality differs between the two algorithms: RL-JSO uses

| A | = 3

(drift, passive, active) while RL-PSO uses

| A | = 5

(PSO parameter presets); all other components are identical.

Figure A1. Dueling double DQN architecture. The 24-dimensional state vector is normalized via running mean–variance, processed through shared hidden layers (256 units, LayerNorm, ReLU, dropout), then split into value and advantage streams combined as

Q (s, a) = V (s) + A (s, a) - \bar{A} (s)

. The output dimensionality differs between the two algorithms: RL-JSO uses

| A | = 3

(drift, passive, active) while RL-PSO uses

| A | = 5

(PSO parameter presets); all other components are identical.

Appendix B. Reward Function Evolution Timeline

Figure A2 summarizes the five-revision history of the shared reward function described in Section 3.4. Each revision addressed a specific failure mode observed in the training logs of the preceding version: V1.0 suffered from unfair comparison between the RL-augmented optimizers; V2.0 unified the reward across algorithms but produced a weak safety signal near the separation threshold; V3.0 introduced safety shaping but left gray-zone violations uncorrected; V3.1 added absolute penalty terms but decoupled the reward signal from stage promotion rates; and V3.2 introduced the

\sqrt{\cdot}

deficit curve (Equation (10)) used in all experiments reported in this paper.

Figure A2. Reward function evolution across five revisions. Each revision addressed a specific failure mode observed in the preceding version; V3.2 is the final revision used throughout the reported experiments. The version-numbering convention used in the paper (v1.0–v3.2) restarts at v1.0 for expository clarity; the accompanying code base uses a cumulative internal revision tag (v8.4) for the same final design.

Appendix C. Representative Trajectory Comparison Under Campaign C4

Figure A3 shows a matched-seed, matched-scenario trajectory pair drawn from Campaign C4 (full adversarial). The two panels plot the 10-UAV trajectories produced on evaluation seed 10008 in scenario S2_shifted. This example is representative of the per-seed quality gap summarized numerically in Section 4.3: both algorithms happen to achieve zero hard hits on this seed, but the minimum inter-UAV separation differs by more than an order of magnitude.

Figure A3. Representative 10-UAV trajectories under Campaign C4 on matched evaluation seed 10008 (scenario S2_shifted). (a) RL-JSO produces layered, well-separated trajectories (

s_{min} = 3.09

m, final fitness

601,471

). (b) Standard JSO achieves zero hard hits only by chance:

s_{min}

collapses to

0.10

m and final fitness is

5.2 \times

worse (

3,105,219

).

Figure A3. Representative 10-UAV trajectories under Campaign C4 on matched evaluation seed 10008 (scenario S2_shifted). (a) RL-JSO produces layered, well-separated trajectories (

s_{min} = 3.09

m, final fitness

601,471

). (b) Standard JSO achieves zero hard hits only by chance:

s_{min}

collapses to

0.10

m and final fitness is

5.2 \times

worse (

3,105,219

).

Appendix D. Per-Seed Stability in `S2_shifted` Under Campaign C4

Figure A4 plots the per-seed fitness values for Campaign C4 in scenario S2_shifted, the scenario that drives most of the aggregate C4 gap. RL-JSO remains within a relatively narrow band across all 20 seeds, whereas standard JSO exhibits extreme seedwise variance and several high-fitness outliers above

3 \times 10^{6}

. The figure should be interpreted as a qualitative stability aid, not as a replacement for the paired statistical analysis reported in Section 4.4.

Figure A4. Per-seed fitness values in Campaign C4 for scenario S2_shifted. RL-JSO remains comparatively stable across all 20 evaluation seeds, whereas standard JSO exhibits substantial seedwise variance with several large outliers.

References

Yu, Z.; Si, Z.; Li, X.; Wang, D.; Song, H. A novel hybrid particle swarm optimization algorithm for path planning of UAVs. IEEE Internet Things J. 2022, 9, 22547–22558. [Google Scholar] [CrossRef]
Meng, W.; Zhang, X.; Zhou, L.; Guo, H.; Hu, X. Advances in UAV path planning: A comprehensive review of methods, challenges, and future directions. Drones 2025, 9, 376. [Google Scholar] [CrossRef]
Meng, K.; Chen, C.; Wu, T.; Xin, B.; Liang, M.; Deng, F. Evolutionary state estimation-based multi-strategy jellyfish search algorithm for multi-UAV cooperative path planning. IEEE Trans. Intell. Veh. 2025, 10, 2490–2507. [Google Scholar] [CrossRef]
Kennedy, J.; Eberhart, R. Particle swarm optimization. In Proceedings of the IEEE International Conference on Neural Networks (ICNN), Perth, Australia, 27 November–1 December 1995; pp. 1942–1948. [Google Scholar]
Wang, Y.; Liu, J.; Qian, Y.; Yi, W. Path planning for multi-UAV in a complex environment based on reinforcement-learning-driven continuous ant colony optimization. Drones 2025, 9, 638. [Google Scholar] [CrossRef]
Chou, J.-S.; Truong, D.-N. A novel metaheuristic optimizer inspired by behavior of jellyfish in ocean. Appl. Math. Comput. 2021, 389, 125535. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction, 2nd ed.; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Zhang, F.; Chen, Z. A novel reinforcement learning-based particle swarm optimization algorithm for better symmetry between convergence speed and diversity. Symmetry 2024, 16, 1290. [Google Scholar] [CrossRef]
Kappagantula, S.; Sangubotla, R.; Varenya, V.V.S.; Gupta, S.; Arigela, S.V.; Moorthy, R.S.; D’Souza, J.M.; Bonthagorla, P.K. DPSO-Q: A reinforcement learning-enhanced swarm algorithm for solving the traveling salesman problem. Int. J. Intell. Syst. 2025, 2025, 8918171. [Google Scholar] [CrossRef]
Zhang, Y.; Ding, M.; Zhang, J.; Yang, Q.; Shi, G.; Lu, M.; Jiang, F. Multi-UAV pursuit-evasion gaming based on PSO-M3DDPG schemes. Complex Intell. Syst. 2024, 10, 6867–6883. [Google Scholar] [CrossRef]
Hazarika, B.; Singh, P.; Singh, K.; Cotton, S.L.; Shin, H.; Dobre, O.A.; Duong, T.Q. Generative AI-augmented graph reinforcement learning for adaptive UAV swarm optimization. IEEE Internet Things J. 2025, 12, 9508–9524. [Google Scholar] [CrossRef]
Li, W.; Xiong, Y.; Xiong, Q. Reinforcement learning-guided particle swarm optimization for multi-objective unmanned aerial vehicle path planning. Symmetry 2025, 17, 1292. [Google Scholar] [CrossRef]
Nayyef, H.M.; Ibrahim, A.A.; Mohd Zainuri, M.A.A.; Zulkifley, M.A.; Shareef, H. A novel hybrid algorithm based on jellyfish search and particle swarm optimization. Mathematics 2023, 11, 3210. [Google Scholar] [CrossRef]
Wang, Q.; Yi, W. Composite improved algorithm based on jellyfish, particle swarm and genetics for UAV path planning in complex urban terrain. Sensors 2024, 24, 7679. [Google Scholar] [CrossRef]
Bengio, Y.; Louradour, J.; Collobert, R.; Weston, J. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), Montreal, QC, Canada, 14–18 June 2009; pp. 41–48. [Google Scholar]
Narvekar, S.; Peng, B.; Leonetti, M.; Sinapov, J.; Taylor, M.E.; Stone, P. Curriculum learning for reinforcement learning domains: A framework and survey. J. Mach. Learn. Res. 2020, 21, 1–50. [Google Scholar]
van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2094–2100. [Google Scholar]
Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling network architectures for deep reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 19–24 June 2016; pp. 1995–2003. [Google Scholar]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Wang, W.; Liu, J.; Liu, Y. Research on UAV path planning algorithm based on multi-objective jellyfish search with adaptive RRT initialization. Sci. Rep. 2024, 14, 29927. [Google Scholar] [CrossRef]
Zeng, R.; Luo, R.; Liu, B. UAV path planning for forest firefighting using optimized multi-objective jellyfish search algorithm. Mathematics 2025, 13, 2745. [Google Scholar] [CrossRef]
Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Lowe, R.; Wu, Y.; Tamar, A.; Harb, J.; Abbeel, P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6379–6390. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Jing, Y.; Li, W. RL-QPSO Net: Deep reinforcement learning enhanced QPSO for efficient mobile robot path planning. Front. Neurorobot. 2024, 18, 1403770. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wu, H.; Xiong, X.; Chen, J. Cooperative UAV swarm localization algorithm based on probabilistic data association for GNSS-denied environments. IEEE Sens. J. 2022, 22, 22117–22128. [Google Scholar]
Feng, R.; Liu, S.; Huang, W.; Han, T.; Yan, B.; Wang, Z.; Niu, Y. Bridging game theory and multi-agent systems: Development status and future prospects. Prog. Aerosp. Sci. 2026, 161, 101183. [Google Scholar] [CrossRef]
Liu, H.; Long, X.; Li, Y.; Yan, J.; Li, M.; Chen, C.; Gu, F.; Pu, H.; Luo, J. Adaptive multi-UAV cooperative path planning based on novel rotation artificial potential fields. Knowl. Based Syst. 2025, 317, 113429. [Google Scholar] [CrossRef]
Zhang, M.; Li, N.; Chen, Y. Review of research on cooperative path planning algorithms for AUV clusters. Drones 2025, 9, 790. [Google Scholar] [CrossRef]
Pan, J.; Li, Y.; Chai, R.; Xia, S.; Zuo, L. Distributed multi-UAV 3D trajectory planning based on multi-agent deep reinforcement learning. IEEE Trans. Cogn. Commun. Netw. 2026, 12, 4577–4592. [Google Scholar] [CrossRef]
Wang, J.; Yu, Z.; Zhou, D.; Shi, J.; Deng, R. Vision-based deep reinforcement learning of unmanned aerial vehicle (UAV) autonomous navigation using privileged information. Drones 2024, 8, 782. [Google Scholar] [CrossRef]
Derrac, J.; García, S.; Molina, D.; Herrera, F. A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms. Swarm Evol. Comput. 2011, 1, 3–18. [Google Scholar] [CrossRef]
Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 1993, 114, 494–509. [Google Scholar] [CrossRef]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Yan, X.; Yu, G.; Huang, G.; Zhou, R.; Hao, L. Design of swarm intelligence control based on double-layer deep reinforcement learning. Appl. Sci. 2025, 15, 4337. [Google Scholar] [CrossRef]

Figure 1. RL-JSO framework overview. The RL decision layer (dueling DQN with PER) selects a JSO phase from a 24-dimensional state vector; the optimization layer updates the population using the selected phase; the simulation environment evaluates the resulting trajectories under dynamic obstacles, AR(1) wind, and adversarial events; and the mastery-gated curriculum advances the stage configuration based on validation-time competence.

Figure 2. Representative 3D simulation environment used throughout the evaluation campaigns. The

100 \times 100 \times 50

m workspace contains ten UAVs represented by colored trajectory curves with ground-plane projections. The trajectories connect a shared start zone, shown as the blue translucent cuboid on the left, to a shared goal zone, shown as the green translucent cuboid on the right. Yellow/gold spheres denote dynamic obstacles at their current positions, while dashed orange curves indicate their predicted obstacle paths. Stage-dependent adversarial factors are illustrated schematically: AR(1) wind as blue arrows above the workspace, GPS jamming as the purple cone, and a communication-loss event as the red cross. At runtime, these factors are stochastic and time-varying rather than static as depicted.

Figure 2. Representative 3D simulation environment used throughout the evaluation campaigns. The

100 \times 100 \times 50

m workspace contains ten UAVs represented by colored trajectory curves with ground-plane projections. The trajectories connect a shared start zone, shown as the blue translucent cuboid on the left, to a shared goal zone, shown as the green translucent cuboid on the right. Yellow/gold spheres denote dynamic obstacles at their current positions, while dashed orange curves indicate their predicted obstacle paths. Stage-dependent adversarial factors are illustrated schematically: AR(1) wind as blue arrows above the workspace, GPS jamming as the purple cone, and a communication-loss event as the red cross. At runtime, these factors are stochastic and time-varying rather than static as depicted.

Figure 3. Hierarchical RL control with five designed safety override layers. In the reported experiments, L1 (warmup), L2 (stagnation fallback), and L5 (DQN decision) are active, whereas L3 and L4 are retained as pass-through placeholders. At evaluation time, eval_disable_fallback=True bypasses L1 and L2 so the mature policy governs every iteration directly.

Figure 4. Mastery-gated curriculum design. (Upper): nine progressive stages arranged in two tiers. The first tier (S1–S5, green) introduces one new difficulty factor per stage—dynamic obstacles, denser layouts, faster dynamics, and wind—while the second tier (S6–S9, orange) begins with a controlled compound increase and adds GPS jamming and communication loss. Key stage parameters are shown in each box; full numerical values are given in Table 4. (Lower): mastery-gate decision flow. After validation evaluation, the learner is promoted (gate passed), enters a recovery cycle with an

ε

-bump (gate failed), or training is terminated after three consecutive failures at the same stage.

Figure 4. Mastery-gated curriculum design. (Upper): nine progressive stages arranged in two tiers. The first tier (S1–S5, green) introduces one new difficulty factor per stage—dynamic obstacles, denser layouts, faster dynamics, and wind—while the second tier (S6–S9, orange) begins with a controlled compound increase and adds GPS jamming and communication loss. Key stage parameters are shown in each box; full numerical values are given in Table 4. (Lower): mastery-gate decision flow. After validation evaluation, the learner is promoted (gate passed), enters a recovery cycle with an

ε

-bump (gate failed), or training is terminated after three consecutive failures at the same stage.

Figure 5. Fair comparison protocol. All RL-augmented algorithms share an identical foundation. The only differences are the algorithm-specific action heads and the optimizer backbones.

Figure 6. C4 (full adversarial) fitness distribution across 40 runs per algorithm (20 seeds × 2 scenarios). Hollow diamonds denote means; black bars denote medians. Hatched boxes mark algorithms with

CF < 100 %

: their reported fitness includes infeasible solutions and is not directly comparable to the collision-free algorithms. RL-JSO’s distribution lies entirely below standard JSO’s interquartile range.

Figure 6. C4 (full adversarial) fitness distribution across 40 runs per algorithm (20 seeds × 2 scenarios). Hollow diamonds denote means; black bars denote medians. Hatched boxes mark algorithms with

CF < 100 %

: their reported fitness includes infeasible solutions and is not directly comparable to the collision-free algorithms. RL-JSO’s distribution lies entirely below standard JSO’s interquartile range.

Figure 7. Scenario-level decomposition of Campaign C4. (Left): mean fitness in the two evaluation scenarios. (Right): collision-free rate by scenario. The dominant gap occurs in S2_shifted, where RL-JSO preserves

100 %

collision-free performance and reduces mean fitness by

81 %

relative to standard JSO; in S1_default, both JSO-based methods remain collision-free and the fitness gap is materially smaller.

Figure 7. Scenario-level decomposition of Campaign C4. (Left): mean fitness in the two evaluation scenarios. (Right): collision-free rate by scenario. The dominant gap occurs in S2_shifted, where RL-JSO preserves

100 %

collision-free performance and reduces mean fitness by

81 %

relative to standard JSO; in S1_default, both JSO-based methods remain collision-free and the fitness gap is materially smaller.

Figure 8. Collision-free rate (%) across progressive difficulty Campaigns C1–C4 for the four evaluated algorithms. The grouped bars emphasize the abrupt collapse of the PSO-based methods from C2 onward, while RL-JSO maintains

CF = 100 %

across all conditions; PSO-based methods collapse once wind is introduced (C2 onward), and standard JSO drops to

97.5 %

under the hard-obstacle condition (C3).

Figure 8. Collision-free rate (%) across progressive difficulty Campaigns C1–C4 for the four evaluated algorithms. The grouped bars emphasize the abrupt collapse of the PSO-based methods from C2 onward, while RL-JSO maintains

CF = 100 %

across all conditions; PSO-based methods collapse once wind is introduced (C2 onward), and standard JSO drops to

97.5 %

under the hard-obstacle condition (C3).

Figure 9. Effect size scaling of RL-JSO over standard JSO across Campaigns C1–C4. Cliff’s

| δ |

magnitudes for four metrics are plotted against campaign difficulty; the shaded horizontal bands mark the conventional effect-size categories (negligible, small, medium, large). All four metrics trend upward with difficulty, and

s_{min}

scales fastest.

Figure 9. Effect size scaling of RL-JSO over standard JSO across Campaigns C1–C4. Cliff’s

| δ |

magnitudes for four metrics are plotted against campaign difficulty; the shaded horizontal bands mark the conventional effect-size categories (negligible, small, medium, large). All four metrics trend upward with difficulty, and

s_{min}

scales fastest.

Figure 10. Composite cooperation score

A \in [0, 1]

across Campaigns C1 – C4 for all four algorithms (higher is better). Right-side annotations report the C1→C4 relative drop per algorithm. RL-JSO exhibits a near-horizontal profile (

- 1.6 %

), while the three comparator algorithms degrade by 17–

23 %

across the same campaign range.

Figure 10. Composite cooperation score

A \in [0, 1]

across Campaigns C1 – C4 for all four algorithms (higher is better). Right-side annotations report the C1→C4 relative drop per algorithm. RL-JSO exhibits a near-horizontal profile (

- 1.6 %

), while the three comparator algorithms degrade by 17–

23 %

across the same campaign range.

Figure 11. Gradient-based feature importance of the trained DQN policy. Each bar shows the mean

| \partial Q^{*} / \partial s_{j} |

over 5000 sampled states, colored by feature group. The separation and clearance group dominates, confirming that the policy primarily attends to inter-UAV safety margins.

Figure 11. Gradient-based feature importance of the trained DQN policy. Each bar shows the mean

| \partial Q^{*} / \partial s_{j} |

over 5000 sampled states, colored by feature group. The separation and clearance group dominates, confirming that the policy primarily attends to inter-UAV safety margins.

Table 1. Comparative summary of related frameworks (2022–2025). “P” indicates partial adversarial modeling; ✓ indicates that the feature is supported; and “–” indicates that it is not reported or not applicable.

Ref.	Year	Framework/Key Contribution	MUAV	Dyn.	Adv.	Fair	Curr.
[14]	2023	HJSPSO—JSO + PSO hybrid (deterministic)	–	–	–	–	–
[21]	2024	UMOJS—multi-obj. JSO + RRT init	–	–	–	–	–
[22]	2025	PVDE-MOJS—parallel JSO + DE	✓	✓	–	–	–
[3]	2024	ESE-MSJS—state-aware rule-based switching	✓	✓	–	–	–
[15]	2024	JSO-PSO-GA—static multi-SI fusion	–	–	–	–	–
[9]	2024	RLPSO—tabular Q-guided particle learning	–	–	–	–	–
[5]	2025	QMSR-ACOR—Q-learning, 32-cell table	✓	–	–	–	–
[10]	2025	DPSO-Q—Q-table, 27 cells	–	–	–	–	–
[13]	2025	QL-MOPSO—hierarchical RL-to-PSO	–	–	–	–	–
[11]	2024	PSO-M3DDPG—PSO enhances MARL samples	✓	✓	P	–	–
[27]	2024	RL-QPSO Net—DRL + quantum PSO	–	✓	–	–	–
[12]	2025	GenAI-GRL—GenAI + graph RL	✓	✓	P	–	–
This work	2026	RL-JSO—deep RL-guided JSO phase control	✓	✓	✓	✓	✓

Table 2. 24-dimensional RL state vector.

Index	Feature	Description/Normalization
0	Obstacle clearance	Mean clearance/world diagonal
1	Swarm dispersion	Mean pairwise distance/diagonal
2	Goal distance	Mean distance-to-goal/diagonal
3	Time progress	$1 - t / T$
4	Hard-hit pressure	Hard hits/20, clipped
5	Near-hit pressure	Near hits/50, clipped
6	Separation ratio	$(s_{min} / d_{min}) / 2$
7	Clearance ratio	$(c_{min} / margin + 1) / 3$
8	Log fitness	$log (1 + J / 10^{4}) / 20$
9	Improvement signal	Recent log-improvement
10	Fitness velocity	Relative change
11	Population diversity	Coefficient of variation of fitness/2
12	Energy ratio	Turning energy/path length
13	Mean turn angle	$\bar{α} / π$
14	Obstacle density	Fraction within influence radius
15	Boundary margin	Min distance to boundary, normalized
16	Safety trend	Weighted $Δ$ (hits + separation violations)
17	Separation reserve	Signed: $> 0.5$ means safe margin
18	Clearance reserve	Signed: $> 0.5$ means safe margin
19	Safety persistence	EMA of threshold violations
20	Separation violations	Distributed count/50
21	Wind magnitude	$∥ w ∥ / 5$ , clipped
22	Jamming flag	Binary
23	Comm-loss flag	Binary

Table 3. Reward function v3.2 component weights, shared between RL-JSO and RL-PSO. The five primary components form a convex combination summing to unity; the per-step time penalty is a small fixed negative signal applied outside the convex sum.

Component	Weight
Fitness improvement	0.30
Inter-UAV separation margin	0.20
Obstacle clearance margin	0.25
Trajectory smoothness	0.15
Cooperation (swarm coordination)	0.10
Per-step time penalty	—

Table 4. Nine-stage mastery-gated curriculum.

v_{obs}

denotes the obstacle velocity cap; J and C denote GPS jamming and communication loss probabilities, respectively.

Table 4. Nine-stage mastery-gated curriculum.

v_{obs}

denotes the obstacle velocity cap; J and C denote GPS jamming and communication loss probabilities, respectively.

Stage	Scale	$v_{obs}$	Wind	Adv.	New Factor
S1: easy	0.45	0	–	–	Baseline (static)
S2: easy_dyn	0.45	0.005	–	–	+Dynamic movement
S3: med_size	0.50	0.005	–	–	+Denser obstacles
S4: med_dyn	0.50	0.01	–	–	+Faster dynamics
S5: med_wind	0.50	0.01	✓	–	+Wind ( $ε$ -bump)
S6: hard	0.55	0.02	✓	–	Compound: +size, +speed
S7: jam	0.55	0.03	✓	J	+Jamming ( $ε$ -bump)
S8: comm	0.55	0.03	✓	J + C	+Comm. loss ( $ε$ -bump)
S9: full	0.58	0.05	✓	J + C	All factors elevated

Table 5. Key hyperparameters, fixed across RL-JSO and RL-PSO.

Category	Parameter	Value
Environment	World volume	$100 \times 100 \times 50$ m
	UAVs (N)/Waypoints (K)	10/16
	Min UAV separation ( $d_{min}$ )	2.80–2.95 (per stage)
	Obstacle safety margin	1.90–2.15 (per stage)
	Dynamic obstacles	up to 12 (spherical)
Optimizer	Population size M	30
	Iterations per episode T	150
	JSO drift coefficient $β$	3.0
	JSO contraction parameter c	0.1
DQN	Discount factor $γ$	0.99
	Hidden layers/units	2/256
	Dueling stream units	128 each
	Optimizer (Adam [37]) learning rate	$10^{- 4}$
	PER capacity/ $α$ / $β$	200,000/0.6/0.4 → 1.0
	Batch size	128
	Polyak $τ$	0.005
	$ε$ -decay horizon	45,000 steps (1.0 → 0.10)

Table 6. Training summary for RL-JSO and RL-PSO under architecturally matched configurations.

Metric	RL-JSO	RL-PSO
Episodes completed	736	766
Highest stage reached	S6	S9
Highest stage mastered	S5	S8
Training hard hits	1	795
Best validation win rate	0.911	0.578
Wall-clock time (hours)	75.74	75.62

Table 7. Campaign C1 (nominal) results. All four algorithms achieve

CF = 100 %

; fitness values are directly comparable.

Table 7. Campaign C1 (nominal) results. All four algorithms achieve

CF = 100 %

; fitness values are directly comparable.

Algorithm	Mean Fitness	Median	Std	CF (%)
RL-JSO	616,437	609,807	94,483	100.0
JSO	726,046	683,503	188,312	100.0
PSO	787,163	791,227	60,331	100.0
RL-PSO	737,280	729,912	57,589	100.0

Table 8. Campaign C2 (wind) results. Italicized values correspond to

CF < 100 %

and include infeasible solutions; they are not directly comparable to collision-free baselines.

Table 8. Campaign C2 (wind) results. Italicized values correspond to

CF < 100 %

and include infeasible solutions; they are not directly comparable to collision-free baselines.

Algorithm	Mean Fitness	Median	Std	CF (%)
RL-JSO	808,503	660,377	416,433	100.0
JSO	1,013,162	811,254	621,717	100.0
PSO	861,150	826,690	84,513	10.0 ^†
RL-PSO	868,925	852,812	76,907	2.5 ^†

^†

CF < 100 %

: fitness includes infeasible solutions; not directly comparable.

Table 9. Campaign C3 (hard dynamic) results. JSO drops just below

100 %

CF; PSO-based fitness values remain non-comparable.

Table 9. Campaign C3 (hard dynamic) results. JSO drops just below

100 %

CF; PSO-based fitness values remain non-comparable.

Algorithm	Mean Fitness	Median	Std	CF (%)
RL-JSO	901,414	570,547	790,995	100.0
JSO	1,991,694	1,343,203	1,755,800	97.5
PSO	877,103	842,280	118,245	20.0 ^†
RL-PSO	906,837	879,456	123,087	10.0 ^†

^†

CF < 100 %

: fitness includes infeasible solutions; not directly comparable.

Table 10. Campaign C4 (full adversarial) results. Only RL-JSO and JSO achieve

CF = 100 %

; the comparison is meaningful only between these two.

Table 10. Campaign C4 (full adversarial) results. Only RL-JSO and JSO achieve

CF = 100 %

; the comparison is meaningful only between these two.

Algorithm	Mean Fitness	Median	Std	CF (%)
RL-JSO	878,011	821,972	658,900	100.0
JSO	2,042,387	1,575,383	1,160,168	100.0
PSO	872,632	836,273	106,657	0.0 ^†
RL-PSO	915,533	868,648	156,571	2.5 ^†

^†

CF < 100 %

: fitness includes infeasible solutions; not directly comparable.

Table 11. Collision-free rate (%) across all campaigns.

Campaign	RL-JSO	JSO	PSO	RL-PSO
C1: Nominal	100.0	100.0	100.0	100.0
C2: Wind	100.0	100.0	10.0	2.5
C3: Hard	100.0	97.5	20.0	10.0
C4: Adversarial	100.0	100.0	0.0	2.5

Table 12. Statistical comparison of RL-JSO versus standard JSO across four campaigns. Paired Wilcoxon signed-rank tests with Holm correction; Cliff’s

δ

with effect-size labels (N = negligible,

| δ | < 0.147

; S = small,

0.147 \leq | δ | < 0.33

; M = medium,

0.33 \leq | δ | < 0.474

; L = large,

| δ | \geq 0.474

, following the thresholds of Romano et al.). Significance markers:

* p_{Holm} < 0.05

,

* * * p_{Holm} < 0.001

. ^† denotes metrics where the direction favors JSO.

Table 12. Statistical comparison of RL-JSO versus standard JSO across four campaigns. Paired Wilcoxon signed-rank tests with Holm correction; Cliff’s

δ

with effect-size labels (N = negligible,

| δ | < 0.147

; S = small,

0.147 \leq | δ | < 0.33

; M = medium,

0.33 \leq | δ | < 0.474

; L = large,

| δ | \geq 0.474

, following the thresholds of Romano et al.). Significance markers:

* p_{Holm} < 0.05

,

* * * p_{Holm} < 0.001

. ^† denotes metrics where the direction favors JSO.

Metric	C1	C2	C3	C4
Fitness ↓	$0.354$ (M) ***	$0.369$ (M) *	$0.628$ (L) ***	$0.689$ (L) ***
Path length ↓	$0.121$ (N)	$0.566$ (L) ***	$0.679$ (L) ***	$0.691$ (L) ***
Energy ↓	$0.089$ (N) ^†	$0.562$ (L) ***	$0.686$ (L) ***	$0.696$ (L) ***
Smoothness ↑	$0.078$ (N) ^†	$0.561$ (L) ***	$0.684$ (L) ***	$0.700$ (L) ***
$s_{min}$ ↑	$0.108$ (N)	$0.635$ (L) ***	$0.688$ (L) ***	$0.803$ (L) ***
Near hits ↓	$0.501$ (L) ***	$0.613$ (L) ***^†	$0.168$ (S) *	$0.188$ (S)
Obstacle clearance ↑	$0.650$ (L) ***	$0.798$ (L) ***^†	$0.010$ (N) ^†	$0.158$ (S)

Table 13. Composite cooperation score

A \in [0, 1]

across campaigns (higher is better).

Table 13. Composite cooperation score

A \in [0, 1]

across campaigns (higher is better).

Campaign	RL-JSO	JSO	PSO	RL-PSO
C1: Nominal	0.745	0.732	0.728	0.719
C2: Wind	0.742	0.681	0.624	0.612
C3: Hard	0.738	0.643	0.591	0.573
C4: Adversarial	0.733	0.607	0.582	0.554
C1 → C4 drop	$- 1.6 %$	$- 17 %$	$- 20 %$	$- 23 %$

Table 14. Zero-shot scalability across swarm sizes (N). Mean ± std over 10 independent seeds. CF% = percentage of runs with zero hard hits (collision-free). Sep. Viol. = mean separation violations per run.

N	Algorithm	Fitness ( $\times 10^{6}$ )	CF%	Hard Hits	Sep. Viol.	Energy
5	JSO	4.6 ± 8.1	70%	0.4	0	3124
	RL-JSO	2.6 ± 0.2	100%	0.0	2	8089
	PSO	0.4 ± 0.0	0%	3.0	0	6630
	RL-PSO	1.9 ± 4.8	20%	2.5	0	7218
10	JSO	22.1 ± 6.5	10%	2.7	11	18,737
	RL-JSO	21.4 ± 10.6	0%	2.8	29	19,206
	PSO	19.6 ± 23.2	0%	3.8	6	17,336
	RL-PSO	18.8 ± 24.0	0%	4.6	5	16,880
15	JSO	23.4 ± 2.2	0%	3.0	10	26,256
	RL-JSO	25.3 ± 3.7	0%	3.0	7	24,012
	PSO	11.4 ± 14.9	0%	3.3	6	25,293
	RL-PSO	17.2 ± 17.4	0%	3.1	3	23,656
20	JSO	32.1 ± 8.9	0%	3.7	32	33,473
	RL-JSO	34.3 ± 8.2	0%	4.4	26	34,990
	PSO	18.6 ± 15.9	0%	4.1	22	33,588
	RL-PSO	17.8 ± 15.4	0%	3.8	19	31,898
50	JSO	103.7 ± 10.0	0%	13.4	106	64,635
	RL-JSO	104.6 ± 12.2	0%	13.3	111	65,281
	PSO	95.9 ± 6.9	0%	13.6	123	69,348
	RL-PSO	100.0 ± 7.6	0%	14.0	129	70,882
100	JSO	99.3 ± 7.4	0%	12.1	162	98,067
	RL-JSO	112.4 ± 7.3	0%	14.5	161	97,779
	PSO	98.8 ± 14.9	0%	12.7	227	119,705
	RL-PSO	102.0 ± 13.0	0%	13.3	247	123,215

Table 15. Inference-time ablation on the RL-JSO checkpoint. Each condition is compared against the published RL-JSO baseline on the same (campaign, scenario, seed) cells using paired Wilcoxon signed-rank tests with Holm correction (significance markers:

* p_{Holm} < 0.05

,

** p_{Holm} < 0.01

,

*** p_{Holm} < 0.001

). All

| δ |

values in this table are computed on mean fitness; the L1 + L2 row shows

| δ | \approx 0.02

on fitness because re-enabling the fallback layers barely perturbs the objective value, while the same condition systematically degrades the safety margins

s_{min}

and

c_{min}

as reported in the text.

Table 15. Inference-time ablation on the RL-JSO checkpoint. Each condition is compared against the published RL-JSO baseline on the same (campaign, scenario, seed) cells using paired Wilcoxon signed-rank tests with Holm correction (significance markers:

* p_{Holm} < 0.05

,

** p_{Holm} < 0.01

,

*** p_{Holm} < 0.001

). All

| δ |

values in this table are computed on mean fitness; the L1 + L2 row shows

| δ | \approx 0.02

on fitness because re-enabling the fallback layers barely perturbs the objective value, while the same condition systematically degrades the safety margins

s_{min}

and

c_{min}

as reported in the text.

Condition	C1 $\| δ \|$	C2 $\| δ \|$	C3 $\| δ \|$	C4 $\| δ \|$
Fixed DRIFT (fit $Δ %$ )	$+ 216 %$	$+ 159 %$	$+ 148 %$	$+ 187 %$
Fixed DRIFT ( $δ$ )	$1.00$ (L) ***	$0.93$ (L) ***	$0.80$ (L) ***	$0.91$ (L) ***
Random (fit $Δ %$ )	$+ 4 %$	$- 28 %$	$- 26 %$	$- 21 %$
Random ( $δ$ )	$0.11$ (N)	$0.52$ (L) ***	$0.41$ (M) **	$0.32$ (S) *
L1 + L2 enabled (fit $Δ %$ )	$+ 5 %$	$- 16 %$	$- 11 %$	$0 %$
L1 + L2 enabled ( $δ$ )	$0.19$ (S)	$0.24$ (S)	$0.17$ (S)	$0.02$ (N)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alotaibi, N.; BinSaeedan, W. Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions. Drones 2026, 10, 394. https://doi.org/10.3390/drones10050394

AMA Style

Alotaibi N, BinSaeedan W. Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions. Drones. 2026; 10(5):394. https://doi.org/10.3390/drones10050394

Chicago/Turabian Style

Alotaibi, Nader, and Wojdan BinSaeedan. 2026. "Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions" Drones 10, no. 5: 394. https://doi.org/10.3390/drones10050394

APA Style

Alotaibi, N., & BinSaeedan, W. (2026). Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions. Drones, 10(5), 394. https://doi.org/10.3390/drones10050394

Article Menu

Adaptive Reinforcement Learning-Driven Jellyfish Search Optimizer for Cooperative Multi-UAV Path Planning Under Dynamic and Adversarial Conditions

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Swarm Intelligence Foundations and the JSO Family

2.2. Deep Reinforcement Learning Components

2.3. Reinforcement Learning for Optimizer Control

2.4. Curriculum Learning and Adversarial Modeling

2.5. Statistical Methodology and Fair Comparison

2.6. Synthesis of Research Gaps

3. Materials and Methods

3.1. Preliminaries

3.1.1. Jellyfish Search Optimization

3.1.2. Reinforcement Learning and Deep Q-Networks

3.2. System Architecture and Simulation Environment

3.2.1. Architecture Overview

3.2.2. Simulation Environment

3.2.3. Trajectory Encoding and Objective

3.3. Decision Layer

3.3.1. State Representation

3.3.2. Action Space

3.3.3. Network Architecture

3.4. Reward Function Design

3.5. Hierarchical Safety Control

3.6. Mastery-Gated Curriculum

3.7. Training Algorithm and Experimental Protocol

Training Algorithm

4. Results

4.1. Fair Comparison Protocol

4.2. Training and Curriculum Progression

4.3. Main Evaluation Results

4.3.1. Campaign C1: Nominal Conditions

4.3.2. Campaign C2: Wind Conditions

4.3.3. Campaign C3: Hard Dynamic Obstacles (Partial Transfer)

4.3.4. Campaign C4: Full Adversarial Conditions (Strict Zero-Shot)

4.3.5. Progressive Collision-Free Rate Degradation

4.3.6. Representative Trajectory Comparison

4.4. Statistical Analysis

4.5. Cooperation, Generalization, and Scalability

4.5.1. Cooperation

4.5.2. Out-of-Distribution Generalization

4.5.3. Zero-Shot Scalability

4.5.4. Feature Sensitivity of the Learned Policy

4.6. Inference-Time Ablation of the Learned Policy

5. Discussion

5.1. Interpretation and Mechanisms

5.2. Positioning Within the Literature

5.3. Limitations and Scope

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Dueling Double DQN Architecture

Appendix B. Reward Function Evolution Timeline

Appendix C. Representative Trajectory Comparison Under Campaign C4

Appendix D. Per-Seed Stability in S2_shifted Under Campaign C4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix D. Per-Seed Stability in `S2_shifted` Under Campaign C4