Figure 1.
RL-JSO framework overview. The RL decision layer (dueling DQN with PER) selects a JSO phase from a 24-dimensional state vector; the optimization layer updates the population using the selected phase; the simulation environment evaluates the resulting trajectories under dynamic obstacles, AR(1) wind, and adversarial events; and the mastery-gated curriculum advances the stage configuration based on validation-time competence.
Figure 1.
RL-JSO framework overview. The RL decision layer (dueling DQN with PER) selects a JSO phase from a 24-dimensional state vector; the optimization layer updates the population using the selected phase; the simulation environment evaluates the resulting trajectories under dynamic obstacles, AR(1) wind, and adversarial events; and the mastery-gated curriculum advances the stage configuration based on validation-time competence.
Figure 2.
Representative 3D simulation environment used throughout the evaluation campaigns. The m workspace contains ten UAVs represented by colored trajectory curves with ground-plane projections. The trajectories connect a shared start zone, shown as the blue translucent cuboid on the left, to a shared goal zone, shown as the green translucent cuboid on the right. Yellow/gold spheres denote dynamic obstacles at their current positions, while dashed orange curves indicate their predicted obstacle paths. Stage-dependent adversarial factors are illustrated schematically: AR(1) wind as blue arrows above the workspace, GPS jamming as the purple cone, and a communication-loss event as the red cross. At runtime, these factors are stochastic and time-varying rather than static as depicted.
Figure 2.
Representative 3D simulation environment used throughout the evaluation campaigns. The m workspace contains ten UAVs represented by colored trajectory curves with ground-plane projections. The trajectories connect a shared start zone, shown as the blue translucent cuboid on the left, to a shared goal zone, shown as the green translucent cuboid on the right. Yellow/gold spheres denote dynamic obstacles at their current positions, while dashed orange curves indicate their predicted obstacle paths. Stage-dependent adversarial factors are illustrated schematically: AR(1) wind as blue arrows above the workspace, GPS jamming as the purple cone, and a communication-loss event as the red cross. At runtime, these factors are stochastic and time-varying rather than static as depicted.
Figure 3.
Hierarchical RL control with five designed safety override layers. In the reported experiments, L1 (warmup), L2 (stagnation fallback), and L5 (DQN decision) are active, whereas L3 and L4 are retained as pass-through placeholders. At evaluation time, eval_disable_fallback=True bypasses L1 and L2 so the mature policy governs every iteration directly.
Figure 3.
Hierarchical RL control with five designed safety override layers. In the reported experiments, L1 (warmup), L2 (stagnation fallback), and L5 (DQN decision) are active, whereas L3 and L4 are retained as pass-through placeholders. At evaluation time, eval_disable_fallback=True bypasses L1 and L2 so the mature policy governs every iteration directly.
Figure 4.
Mastery-gated curriculum design. (
Upper): nine progressive stages arranged in two tiers. The first tier (S1–S5, green) introduces one new difficulty factor per stage—dynamic obstacles, denser layouts, faster dynamics, and wind—while the second tier (S6–S9, orange) begins with a controlled compound increase and adds GPS jamming and communication loss. Key stage parameters are shown in each box; full numerical values are given in
Table 4. (
Lower): mastery-gate decision flow. After validation evaluation, the learner is promoted (gate passed), enters a recovery cycle with an
-bump (gate failed), or training is terminated after three consecutive failures at the same stage.
Figure 4.
Mastery-gated curriculum design. (
Upper): nine progressive stages arranged in two tiers. The first tier (S1–S5, green) introduces one new difficulty factor per stage—dynamic obstacles, denser layouts, faster dynamics, and wind—while the second tier (S6–S9, orange) begins with a controlled compound increase and adds GPS jamming and communication loss. Key stage parameters are shown in each box; full numerical values are given in
Table 4. (
Lower): mastery-gate decision flow. After validation evaluation, the learner is promoted (gate passed), enters a recovery cycle with an
-bump (gate failed), or training is terminated after three consecutive failures at the same stage.
Figure 5.
Fair comparison protocol. All RL-augmented algorithms share an identical foundation. The only differences are the algorithm-specific action heads and the optimizer backbones.
Figure 5.
Fair comparison protocol. All RL-augmented algorithms share an identical foundation. The only differences are the algorithm-specific action heads and the optimizer backbones.
Figure 6.
C4 (full adversarial) fitness distribution across 40 runs per algorithm (20 seeds × 2 scenarios). Hollow diamonds denote means; black bars denote medians. Hatched boxes mark algorithms with : their reported fitness includes infeasible solutions and is not directly comparable to the collision-free algorithms. RL-JSO’s distribution lies entirely below standard JSO’s interquartile range.
Figure 6.
C4 (full adversarial) fitness distribution across 40 runs per algorithm (20 seeds × 2 scenarios). Hollow diamonds denote means; black bars denote medians. Hatched boxes mark algorithms with : their reported fitness includes infeasible solutions and is not directly comparable to the collision-free algorithms. RL-JSO’s distribution lies entirely below standard JSO’s interquartile range.
Figure 7.
Scenario-level decomposition of Campaign C4. (Left): mean fitness in the two evaluation scenarios. (Right): collision-free rate by scenario. The dominant gap occurs in S2_shifted, where RL-JSO preserves collision-free performance and reduces mean fitness by relative to standard JSO; in S1_default, both JSO-based methods remain collision-free and the fitness gap is materially smaller.
Figure 7.
Scenario-level decomposition of Campaign C4. (Left): mean fitness in the two evaluation scenarios. (Right): collision-free rate by scenario. The dominant gap occurs in S2_shifted, where RL-JSO preserves collision-free performance and reduces mean fitness by relative to standard JSO; in S1_default, both JSO-based methods remain collision-free and the fitness gap is materially smaller.
Figure 8.
Collision-free rate (%) across progressive difficulty Campaigns C1–C4 for the four evaluated algorithms. The grouped bars emphasize the abrupt collapse of the PSO-based methods from C2 onward, while RL-JSO maintains across all conditions; PSO-based methods collapse once wind is introduced (C2 onward), and standard JSO drops to under the hard-obstacle condition (C3).
Figure 8.
Collision-free rate (%) across progressive difficulty Campaigns C1–C4 for the four evaluated algorithms. The grouped bars emphasize the abrupt collapse of the PSO-based methods from C2 onward, while RL-JSO maintains across all conditions; PSO-based methods collapse once wind is introduced (C2 onward), and standard JSO drops to under the hard-obstacle condition (C3).
Figure 9.
Effect size scaling of RL-JSO over standard JSO across Campaigns C1–C4. Cliff’s magnitudes for four metrics are plotted against campaign difficulty; the shaded horizontal bands mark the conventional effect-size categories (negligible, small, medium, large). All four metrics trend upward with difficulty, and scales fastest.
Figure 9.
Effect size scaling of RL-JSO over standard JSO across Campaigns C1–C4. Cliff’s magnitudes for four metrics are plotted against campaign difficulty; the shaded horizontal bands mark the conventional effect-size categories (negligible, small, medium, large). All four metrics trend upward with difficulty, and scales fastest.
Figure 10.
Composite cooperation score across Campaigns C1 – C4 for all four algorithms (higher is better). Right-side annotations report the C1→C4 relative drop per algorithm. RL-JSO exhibits a near-horizontal profile (), while the three comparator algorithms degrade by 17– across the same campaign range.
Figure 10.
Composite cooperation score across Campaigns C1 – C4 for all four algorithms (higher is better). Right-side annotations report the C1→C4 relative drop per algorithm. RL-JSO exhibits a near-horizontal profile (), while the three comparator algorithms degrade by 17– across the same campaign range.
Figure 11.
Gradient-based feature importance of the trained DQN policy. Each bar shows the mean over 5000 sampled states, colored by feature group. The separation and clearance group dominates, confirming that the policy primarily attends to inter-UAV safety margins.
Figure 11.
Gradient-based feature importance of the trained DQN policy. Each bar shows the mean over 5000 sampled states, colored by feature group. The separation and clearance group dominates, confirming that the policy primarily attends to inter-UAV safety margins.
Table 1.
Comparative summary of related frameworks (2022–2025). “P” indicates partial adversarial modeling; ✓ indicates that the feature is supported; and “–” indicates that it is not reported or not applicable.
Table 1.
Comparative summary of related frameworks (2022–2025). “P” indicates partial adversarial modeling; ✓ indicates that the feature is supported; and “–” indicates that it is not reported or not applicable.
| Ref. | Year | Framework/Key Contribution | MUAV | Dyn. | Adv. | Fair | Curr. |
|---|
| [14] | 2023 | HJSPSO—JSO + PSO hybrid (deterministic) | – | – | – | – | – |
| [21] | 2024 | UMOJS—multi-obj. JSO + RRT init | – | – | – | – | – |
| [22] | 2025 | PVDE-MOJS—parallel JSO + DE | ✓ | ✓ | – | – | – |
| [3] | 2024 | ESE-MSJS—state-aware rule-based switching | ✓ | ✓ | – | – | – |
| [15] | 2024 | JSO-PSO-GA—static multi-SI fusion | – | – | – | – | – |
| [9] | 2024 | RLPSO—tabular Q-guided particle learning | – | – | – | – | – |
| [5] | 2025 | QMSR-ACOR—Q-learning, 32-cell table | ✓ | – | – | – | – |
| [10] | 2025 | DPSO-Q—Q-table, 27 cells | – | – | – | – | – |
| [13] | 2025 | QL-MOPSO—hierarchical RL-to-PSO | – | – | – | – | – |
| [11] | 2024 | PSO-M3DDPG—PSO enhances MARL samples | ✓ | ✓ | P | – | – |
| [27] | 2024 | RL-QPSO Net—DRL + quantum PSO | – | ✓ | – | – | – |
| [12] | 2025 | GenAI-GRL—GenAI + graph RL | ✓ | ✓ | P | – | – |
| This work | 2026 | RL-JSO—deep RL-guided JSO phase control | ✓ | ✓ | ✓ | ✓ | ✓ |
Table 2.
24-dimensional RL state vector.
Table 2.
24-dimensional RL state vector.
| Index | Feature | Description/Normalization |
|---|
| 0 | Obstacle clearance | Mean clearance/world diagonal |
| 1 | Swarm dispersion | Mean pairwise distance/diagonal |
| 2 | Goal distance | Mean distance-to-goal/diagonal |
| 3 | Time progress | |
| 4 | Hard-hit pressure | Hard hits/20, clipped |
| 5 | Near-hit pressure | Near hits/50, clipped |
| 6 | Separation ratio | |
| 7 | Clearance ratio | |
| 8 | Log fitness | |
| 9 | Improvement signal | Recent log-improvement |
| 10 | Fitness velocity | Relative change |
| 11 | Population diversity | Coefficient of variation of fitness/2 |
| 12 | Energy ratio | Turning energy/path length |
| 13 | Mean turn angle | |
| 14 | Obstacle density | Fraction within influence radius |
| 15 | Boundary margin | Min distance to boundary, normalized |
| 16 | Safety trend | Weighted (hits + separation violations) |
| 17 | Separation reserve | Signed: means safe margin |
| 18 | Clearance reserve | Signed: means safe margin |
| 19 | Safety persistence | EMA of threshold violations |
| 20 | Separation violations | Distributed count/50 |
| 21 | Wind magnitude | , clipped |
| 22 | Jamming flag | Binary |
| 23 | Comm-loss flag | Binary |
Table 3.
Reward function v3.2 component weights, shared between RL-JSO and RL-PSO. The five primary components form a convex combination summing to unity; the per-step time penalty is a small fixed negative signal applied outside the convex sum.
Table 3.
Reward function v3.2 component weights, shared between RL-JSO and RL-PSO. The five primary components form a convex combination summing to unity; the per-step time penalty is a small fixed negative signal applied outside the convex sum.
| Component | Weight |
|---|
| Fitness improvement | 0.30 |
| Inter-UAV separation margin | 0.20 |
| Obstacle clearance margin | 0.25 |
| Trajectory smoothness | 0.15 |
| Cooperation (swarm coordination) | 0.10 |
| Per-step time penalty | — |
Table 4.
Nine-stage mastery-gated curriculum. denotes the obstacle velocity cap; J and C denote GPS jamming and communication loss probabilities, respectively.
Table 4.
Nine-stage mastery-gated curriculum. denotes the obstacle velocity cap; J and C denote GPS jamming and communication loss probabilities, respectively.
| Stage | Scale | | Wind | Adv. | New Factor |
|---|
| S1: easy | 0.45 | 0 | – | – | Baseline (static) |
| S2: easy_dyn | 0.45 | 0.005 | – | – | +Dynamic movement |
| S3: med_size | 0.50 | 0.005 | – | – | +Denser obstacles |
| S4: med_dyn | 0.50 | 0.01 | – | – | +Faster dynamics |
| S5: med_wind | 0.50 | 0.01 | ✓ | – | +Wind (-bump) |
| S6: hard | 0.55 | 0.02 | ✓ | – | Compound: +size, +speed |
| S7: jam | 0.55 | 0.03 | ✓ | J | +Jamming (-bump) |
| S8: comm | 0.55 | 0.03 | ✓ | J + C | +Comm. loss (-bump) |
| S9: full | 0.58 | 0.05 | ✓ | J + C | All factors elevated |
Table 5.
Key hyperparameters, fixed across RL-JSO and RL-PSO.
Table 5.
Key hyperparameters, fixed across RL-JSO and RL-PSO.
| Category | Parameter | Value |
|---|
| Environment | World volume | m |
| UAVs (N)/Waypoints (K) | 10/16 |
| Min UAV separation () | 2.80–2.95 (per stage) |
| Obstacle safety margin | 1.90–2.15 (per stage) |
| Dynamic obstacles | up to 12 (spherical) |
| Optimizer | Population size M | 30 |
| Iterations per episode T | 150 |
| JSO drift coefficient | 3.0 |
| JSO contraction parameter c | 0.1 |
| DQN | Discount factor | 0.99 |
| Hidden layers/units | 2/256 |
| Dueling stream units | 128 each |
| Optimizer (Adam [37]) learning rate | |
| PER capacity// | 200,000/0.6/0.4 → 1.0 |
| Batch size | 128 |
| Polyak | 0.005 |
| -decay horizon | 45,000 steps (1.0 → 0.10) |
Table 6.
Training summary for RL-JSO and RL-PSO under architecturally matched configurations.
Table 6.
Training summary for RL-JSO and RL-PSO under architecturally matched configurations.
| Metric | RL-JSO | RL-PSO |
|---|
| Episodes completed | 736 | 766 |
| Highest stage reached | S6 | S9 |
| Highest stage mastered | S5 | S8 |
| Training hard hits | 1 | 795 |
| Best validation win rate | 0.911 | 0.578 |
| Wall-clock time (hours) | 75.74 | 75.62 |
Table 7.
Campaign C1 (nominal) results. All four algorithms achieve ; fitness values are directly comparable.
Table 7.
Campaign C1 (nominal) results. All four algorithms achieve ; fitness values are directly comparable.
| Algorithm | Mean Fitness | Median | Std | CF (%) |
|---|
| RL-JSO | 616,437 | 609,807 | 94,483 | 100.0 |
| JSO | 726,046 | 683,503 | 188,312 | 100.0 |
| PSO | 787,163 | 791,227 | 60,331 | 100.0 |
| RL-PSO | 737,280 | 729,912 | 57,589 | 100.0 |
Table 8.
Campaign C2 (wind) results. Italicized values correspond to and include infeasible solutions; they are not directly comparable to collision-free baselines.
Table 8.
Campaign C2 (wind) results. Italicized values correspond to and include infeasible solutions; they are not directly comparable to collision-free baselines.
| Algorithm | Mean Fitness | Median | Std | CF (%) |
|---|
| RL-JSO | 808,503 | 660,377 | 416,433 | 100.0 |
| JSO | 1,013,162 | 811,254 | 621,717 | 100.0 |
| PSO | 861,150 | 826,690 | 84,513 | 10.0 † |
| RL-PSO | 868,925 | 852,812 | 76,907 | 2.5 † |
Table 9.
Campaign C3 (hard dynamic) results. JSO drops just below CF; PSO-based fitness values remain non-comparable.
Table 9.
Campaign C3 (hard dynamic) results. JSO drops just below CF; PSO-based fitness values remain non-comparable.
| Algorithm | Mean Fitness | Median | Std | CF (%) |
|---|
| RL-JSO | 901,414 | 570,547 | 790,995 | 100.0 |
| JSO | 1,991,694 | 1,343,203 | 1,755,800 | 97.5 |
| PSO | 877,103 | 842,280 | 118,245 | 20.0 † |
| RL-PSO | 906,837 | 879,456 | 123,087 | 10.0 † |
Table 10.
Campaign C4 (full adversarial) results. Only RL-JSO and JSO achieve ; the comparison is meaningful only between these two.
Table 10.
Campaign C4 (full adversarial) results. Only RL-JSO and JSO achieve ; the comparison is meaningful only between these two.
| Algorithm | Mean Fitness | Median | Std | CF (%) |
|---|
| RL-JSO | 878,011 | 821,972 | 658,900 | 100.0 |
| JSO | 2,042,387 | 1,575,383 | 1,160,168 | 100.0 |
| PSO | 872,632 | 836,273 | 106,657 | 0.0 † |
| RL-PSO | 915,533 | 868,648 | 156,571 | 2.5 † |
Table 11.
Collision-free rate (%) across all campaigns.
Table 11.
Collision-free rate (%) across all campaigns.
| Campaign | RL-JSO | JSO | PSO | RL-PSO |
|---|
| C1: Nominal | 100.0 | 100.0 | 100.0 | 100.0 |
| C2: Wind | 100.0 | 100.0 | 10.0 | 2.5 |
| C3: Hard | 100.0 | 97.5 | 20.0 | 10.0 |
| C4: Adversarial | 100.0 | 100.0 | 0.0 | 2.5 |
Table 12.
Statistical comparison of RL-JSO versus standard JSO across four campaigns. Paired Wilcoxon signed-rank tests with Holm correction; Cliff’s with effect-size labels (N = negligible, ; S = small, ; M = medium, ; L = large, , following the thresholds of Romano et al.). Significance markers: , . † denotes metrics where the direction favors JSO.
Table 12.
Statistical comparison of RL-JSO versus standard JSO across four campaigns. Paired Wilcoxon signed-rank tests with Holm correction; Cliff’s with effect-size labels (N = negligible, ; S = small, ; M = medium, ; L = large, , following the thresholds of Romano et al.). Significance markers: , . † denotes metrics where the direction favors JSO.
| Metric | C1 | C2 | C3 | C4 |
|---|
| Fitness ↓ | (M) *** | (M) * | (L) *** | (L) *** |
| Path length ↓ | (N) | (L) *** | (L) *** | (L) *** |
| Energy ↓ | (N) † | (L) *** | (L) *** | (L) *** |
| Smoothness ↑ | (N) † | (L) *** | (L) *** | (L) *** |
| ↑ | (N) | (L) *** | (L) *** | (L) *** |
| Near hits ↓ | (L) *** | (L) ***† | (S) * | (S) |
| Obstacle clearance ↑ | (L) *** | (L) ***† | (N) † | (S) |
Table 13.
Composite cooperation score across campaigns (higher is better).
Table 13.
Composite cooperation score across campaigns (higher is better).
| Campaign | RL-JSO | JSO | PSO | RL-PSO |
|---|
| C1: Nominal | 0.745 | 0.732 | 0.728 | 0.719 |
| C2: Wind | 0.742 | 0.681 | 0.624 | 0.612 |
| C3: Hard | 0.738 | 0.643 | 0.591 | 0.573 |
| C4: Adversarial | 0.733 | 0.607 | 0.582 | 0.554 |
| C1 → C4 drop | | | | |
Table 14.
Zero-shot scalability across swarm sizes (N). Mean ± std over 10 independent seeds. CF% = percentage of runs with zero hard hits (collision-free). Sep. Viol. = mean separation violations per run.
Table 14.
Zero-shot scalability across swarm sizes (N). Mean ± std over 10 independent seeds. CF% = percentage of runs with zero hard hits (collision-free). Sep. Viol. = mean separation violations per run.
| N | Algorithm | Fitness () | CF% | Hard Hits | Sep. Viol. | Energy |
|---|
| 5 | JSO | 4.6 ± 8.1 | 70% | 0.4 | 0 | 3124 |
| RL-JSO | 2.6 ± 0.2 | 100% | 0.0 | 2 | 8089 |
| PSO | 0.4 ± 0.0 | 0% | 3.0 | 0 | 6630 |
| RL-PSO | 1.9 ± 4.8 | 20% | 2.5 | 0 | 7218 |
| 10 | JSO | 22.1 ± 6.5 | 10% | 2.7 | 11 | 18,737 |
| RL-JSO | 21.4 ± 10.6 | 0% | 2.8 | 29 | 19,206 |
| PSO | 19.6 ± 23.2 | 0% | 3.8 | 6 | 17,336 |
| RL-PSO | 18.8 ± 24.0 | 0% | 4.6 | 5 | 16,880 |
| 15 | JSO | 23.4 ± 2.2 | 0% | 3.0 | 10 | 26,256 |
| RL-JSO | 25.3 ± 3.7 | 0% | 3.0 | 7 | 24,012 |
| PSO | 11.4 ± 14.9 | 0% | 3.3 | 6 | 25,293 |
| RL-PSO | 17.2 ± 17.4 | 0% | 3.1 | 3 | 23,656 |
| 20 | JSO | 32.1 ± 8.9 | 0% | 3.7 | 32 | 33,473 |
| RL-JSO | 34.3 ± 8.2 | 0% | 4.4 | 26 | 34,990 |
| PSO | 18.6 ± 15.9 | 0% | 4.1 | 22 | 33,588 |
| RL-PSO | 17.8 ± 15.4 | 0% | 3.8 | 19 | 31,898 |
| 50 | JSO | 103.7 ± 10.0 | 0% | 13.4 | 106 | 64,635 |
| RL-JSO | 104.6 ± 12.2 | 0% | 13.3 | 111 | 65,281 |
| PSO | 95.9 ± 6.9 | 0% | 13.6 | 123 | 69,348 |
| RL-PSO | 100.0 ± 7.6 | 0% | 14.0 | 129 | 70,882 |
| 100 | JSO | 99.3 ± 7.4 | 0% | 12.1 | 162 | 98,067 |
| RL-JSO | 112.4 ± 7.3 | 0% | 14.5 | 161 | 97,779 |
| PSO | 98.8 ± 14.9 | 0% | 12.7 | 227 | 119,705 |
| RL-PSO | 102.0 ± 13.0 | 0% | 13.3 | 247 | 123,215 |
Table 15.
Inference-time ablation on the RL-JSO checkpoint. Each condition is compared against the published RL-JSO baseline on the same (campaign, scenario, seed) cells using paired Wilcoxon signed-rank tests with Holm correction (significance markers: , , ). All values in this table are computed on mean fitness; the L1 + L2 row shows on fitness because re-enabling the fallback layers barely perturbs the objective value, while the same condition systematically degrades the safety margins and as reported in the text.
Table 15.
Inference-time ablation on the RL-JSO checkpoint. Each condition is compared against the published RL-JSO baseline on the same (campaign, scenario, seed) cells using paired Wilcoxon signed-rank tests with Holm correction (significance markers: , , ). All values in this table are computed on mean fitness; the L1 + L2 row shows on fitness because re-enabling the fallback layers barely perturbs the objective value, while the same condition systematically degrades the safety margins and as reported in the text.
| Condition | C1 | C2 | C3 | C4 |
|---|
| Fixed DRIFT (fit ) | | | | |
| Fixed DRIFT () | (L) *** | (L) *** | (L) *** | (L) *** |
| Random (fit ) | | | | |
| Random () | (N) | (L) *** | (M) ** | (S) * |
| L1 + L2 enabled (fit ) | | | | |
| L1 + L2 enabled () | (S) | (S) | (S) | (N) |