To verify the effectiveness and adaptability of the proposed overlapping coalition formation method for heterogeneous UAV swarms, we design multiple simulation scenarios of different scales that cover a range of task densities, resource tightness, and UAV configurations. All experiments are conducted on a Windows 11 system, equipped with an Intel Core i9-13900K processor (3.00 GHz base frequency), two NVIDIA GeForce RTX 4090 GPUs with 24 GB of VRAM each, and 128.0 GB of RAM.
5.2. Performance Evaluation
Figure 6 reports convergence and final utilities under four resource conditions. SGRL-TS consistently outperforms the strongest baseline, PGG-TS-OCF, by 3.19%, 4.49%, 6.25%, and 9.68% in the abundant, balanced, constrained, and scarce settings, respectively. It enters the efficient ascent earlier and shows a smoother plateau. The gains arise from heterogeneous hypergraph attention, which captures high-order couplings among UAVs, tasks, and coalitions; a structure-conditioned hierarchical value decomposition that yields globally comparable, monotone scores and suppresses merge-split oscillations; and budgeted MCTS under feasibility masks, which focuses expansions on high-value structures and reduces wasted search.
Baseline behavior clarifies the gaps. PGG-TS-OCF employs a parallel population search that identifies feasible overlaps early on; however, it stalls at suboptimal mixes under budget constraints and lacks a stable cross-level yardstick. HYGMA strengthens interaction modelling, but long value propagation under multiple constraints slows ascent. RCFG-DRL introduces adversarial robustness; however, nonstationarity induces mid-horizon oscillations, diverting budget from structural improvements. LocalSearch-CF and SMART use anytime stepwise moves bounded by small neighborhoods, which promote local optima and capped utility. Non-overlapping SGRL-TS forbids resource reuse, creating capacity bottlenecks, synchronization penalties, and heightening sensitivity to task density and temporal perturbations.
We next examine resource utilization efficiency, task adaptability, and robustness under overload and scarcity by varying the number of UAVs from 4 to 20 under different task loads, as shown in
Figure 7. Across the four task scales, the curve of SGRL-TS remains at the top and reaches a higher peak near moderate swarm sizes; the standard deviation bars show markedly lower variability compared to all baselines. Relative to the strongest competitor, PGG-TS-OCF, SGRL-TS achieves higher peak utilities by approximately 2.27%, 3.00%, 6.36%, and 9.76% at task numbers 5, 10, 15, and 20, respectively. The advantage increases as resource constraints tighten, indicating stronger resource scheduling and parallel coordination under crowded and scarce conditions.
This performance stems from a careful consideration of the benefit–cost balance. SGRL-TS estimates timing and energy constraints online and maps them onto a unified utility scale, which steers the search toward ranges where adding UAVs yields net gains while suppressing ineffective parallelism and communication congestion as scale grows, thus avoiding high-scale regression. Temperature-controlled sampling and reuse of candidate structures broaden exploration, and as early and later convergence occurs, they converge to low-conflict configurations. Combined with penalties and pruning for repeated assignment and resource contention, these mechanisms reduce structural oscillation and tail-phase jitter. The result is a better compromise between task completion and coordination cost, leading to higher and more stable final utility.
We further test adaptability under sparse and dense tasks, reuse efficiency, and robustness to task pressure by fixing
and increasing the number of tasks from 3 to 12. Results are shown in
Figure 8. SGRL-TS stays on top across the three UAV scales and reaches a higher peak near the midrange of task counts, while the tail declines more gently and the variability remains smaller. Compared with the strongest baseline PGG-TS-OCF, the average utility over the full range improves by about 3.86% at 16 UAVs, 2.97% at 12 UAVs, and 3.53% at 8 UAVs.
This advantage and stability arise because bidirectional hypergraph attention normalizes task selectivity and contextual suitability within mask-constrained candidate sets, enabling precise member screening and task assignment as the number of tasks increases, which suppresses ineffective overlaps and resource contention. The SHIELD nested nonlinear aggregation with cross-task interaction terms provides a monotonic and comparable global value for cooperation and competition across tasks, making diminishing marginal returns detectable as the task load grows. This concentrates resources on actions with positive net gain, producing a higher midrange peak and slower performance decay.
To assess task completion under scaling, we vary the tasks from 4 to 12 with
. Results are in
Figure 9. Across the three experimental settings, the SGRL-TS curve remains closely aligned with the upper bound provided by the Task-completion OCF baseline, which optimizes only task fulfillment. As the number of tasks increases from 4 to 12, it improves the average task execution sufficiency over the entire range by approximately 2.42%, 2.63%, and 10.94% relative to PGG-TS-OCF. Moreover, when other methods exhibit pronounced degradation at higher task counts, SGRL-TS shows a significantly slower decline and can maintain a larger fraction of tasks close to complete execution, even under severely constrained resources. This advantage primarily stems from the balance term in the reward, which discourages extreme solutions that sacrifice a subset of tasks, thereby driving the policy to maintain medium to high completion levels across more tasks as the task load increases.
Figure 10a–c report the coalition-level temporal coordination performance of all methods. The evaluation metric is the coalition synchronization sufficiency, defined as the normalized score
obtained from the coalition arrival-time deviation cost
according to (
23); larger values indicate more synchronized coalition arrivals under the given reference time scale. As the number of tasks increases, the synchronization sufficiency of all methods decreases overall, indicating that higher task congestion makes it harder for coalitions to achieve good temporal coordination; moreover, when the number of UAVs is reduced from 15 to 5, the overall degradation in synchronization performance becomes more pronounced. The advantage of SGRL-TS in synchronization sufficiency is most pronounced in configurations with more tasks and tighter resources, suggesting that structure-guided overlapping coalition formation combined with joint value decomposition can effectively suppress the dispersion in coalition arrival times. In contrast, non-overlapping SGRL-TS and Task-completion OCF, which only focus on task completion rate, exhibit significantly lower synchronization sufficiency under high-load scenarios, indicating that ignoring overlapping structures or lacking explicit synchronization modeling leads to markedly degraded temporal coordination among coalitions.
Previous experiments on task completion and utility have shown that SGRL-TS achieves returns close to the Task-completion OCF upper bound and outperforms PGG-TS-OCF. This section further examines its cost side from the perspective of energy utilization. As shown in
Figure 11, under four task scales
, the energy-efficiency curves of SGRL-TS lie consistently above those of all baselines. Compared with PGG-TS-OCF, the average energy efficiency over the entire UAV range improves by approximately 8.09%, 8.02%, 3.05%, and 12.85%, respectively. Moreover, relative to the Task-completion OCF scheme, which optimizes only task completion, SGRL-TS achieves comparable completion levels while attaining between two and five times higher energy efficiency, thereby providing a considerably more economical way of sustaining overlapping coalition structures from the energy consumption viewpoint. Overall, this advantage primarily stems from incorporating energy safety margins and residual-resource-driven feasible-set pruning into the MCTS search guided by SHIELD evaluations, which reduces the expansion of high-cost, overlapping structures at the search level, and thus markedly improves global energy utilization without sacrificing task completion.
To examine the sensitivity of average task utility and method ranking to multi-objective weight settings, task-priority scenarios, and time/energy normalization scales, we conducted comparative experiments in a scenario with 10 tasks and 20 UAVs, as shown in
Figure 12. Under fixed network parameters and training configurations, we change only the weight vector
at evaluation time to assess the weight sensitivity of the multi-objective design; in
Figure 12b, under the default setting
, we keep the training process unchanged and modify only the task-priority directed acyclic graph to construct three scenarios,
Balanced priorities,
Rescue priority, and
Communication priority, with the horizontal axis corresponding to these three priority configurations, respectively; in
Figure 12c, we scale the normalization ranges of the synchronization reference time
and energy reference value
and, in turn, investigate the impact of five combinations on the final task utility, where the horizontal axis corresponds to these five
configurations.
Across all configurations of the three sensitivity tests, SGRL-TS consistently attains the highest average task utility, with a performance gap of approximately 2–3 percentage points relative to the best-performing baseline PGG-TS-OCF, and exhibits substantially more minor variance, indicating overall robustness to perturbations in weights, priority settings, and normalization scales. Specifically, in
Figure 12a, the completion-emphasized weight vector
increases the utilities of all four methods, whereas the synchronization- and energy-emphasized weight vectors
and
reduce the overall utilities, with a particularly pronounced impact on RDFG-DRL, while the curve of SGRL-TS exhibits only mild fluctuations;
Figure 12b shows that different task-priority topologies induce only slight changes in the utilities of all methods, and SGRL-TS can still better coordinate resources and synchronization constraints in the rescue-priority scenario, maintaining a stable performance lead; in
Figure 12c, scaling
or
mainly changes the absolute level of utility, and more stringent normalization (such as
or
) has a more pronounced negative impact on RCFG-TS, whereas both the performance and variance of SGRL-TS vary only moderately.
We analyze the per-decision computational overhead of SGRL-TS under different swarm scales and search budgets, as shown in
Table 5. We reuse the SGRL-TS policy trained in the previous experiments and perform online evaluation on six task configurations in inference mode without enabling backpropagation. For each configuration, we record the wall-clock time of the HAN encoding
, the SHIELD mixing
, the MCTS planning
, and the end-to-end decision latency
over
consecutive decision steps, and we collect the average number of node expansions of MCTS under feasible-region pruning (Avg. exp.).
As shown in
Table 5, in S
4 the end-to-end decision latency
is about 5.54 ms, whereas the lower bound under the no-search configuration (S
2,
) is only 2.10 ms, indicating that even with structured MCTS enabled the overall overhead remains significantly below the typical UAV control period on the order of tens of milliseconds and thus satisfies real-time application requirements. As the task scale increases,
and
slowly increase from about 0.82/0.64 ms in S
1 to about 1.05/0.82 ms in S
4, exhibiting an approximately linear growth trend that is consistent with the
complexity result given in
Section 4.4, which indicates that high-order structural modeling and structure-conditioned value decomposition themselves do not become the main bottlenecks. For the same swarm scale (S
2), when the search budget is increased from
to 32 and 64, the average number of expanded nodes grows from 0 to about 21.37 and 41.96, and the corresponding
increases from 0 to 1.63 ms and 3.01 ms. At the same time, Avg. exp. consistently remains clearly below the budget
B, which confirms that feasible-region pruning effectively suppresses the size of the search tree and makes the MCTS computational cost approximately linearly controllable with respect to the budget. Taken together, these results show that SGRL-TS achieves both low per-step latency and good scalability within the swarm sizes and search budgets considered in this work.
In
Figure 13, we evaluate all algorithms in a scenario with 10 tasks and 15 heterogeneous UAVs. Multiple independent runs are conducted under different search budgets and random seeds to sample a set of feasible overlapping coalition-structure solutions. For each solution, we first compute the task execution sufficiency
and the synchronization deviation cost
according to (
23), and then take the average over all tasks to obtain the overall task execution sufficiency and synchronization deviation; we also record the total energy cost
incurred to complete all tasks. The normalized task shortfall is then defined as
. Furthermore, synchronization deviation and energy cost are min–max normalized over the union of all methods and sampled solutions to obtain the normalized synchronization deviation
and normalized energy cost
. Consequently, all three quantities are scaled to the interval
with smaller values being better, which facilitates multi-objective Pareto analysis in a unified cost space.
From
Figure 13a, the scatter of SGRL-TS is more concentrated in the lower-left region of the
plane, yielding fewer dominated solutions with either a small task shortfall but significant synchronization deviation, or good synchronization at the price of a significantly increased task shortfall, compared with the baseline methods. In
Figure 13b, SGRL-TS maintains a smaller
under lower energy cost
, whereas other methods typically require higher energy to achieve a similar level of task completion or suffer a larger task shortfall at comparable energy, indicating a more favorable energy–efficiency trade-off.
Figure 13c further shows that, in the
plane, the SGRL-TS samples overall lie closer to the lower-left Pareto boundary and form a more compact “knee” region around low energy and low synchronization deviation. In contrast, the samples of the baseline methods more frequently fall outside this frontier. Taken together, these results demonstrate that SGRL-TS achieves superior Pareto performance in the joint objective space of task completion, coalition-time synchronization, and energy consumption.
5.3. Ablation Studies
Table 6 presents a systematic ablation of the encoder, value decomposition, and global search modules under resource-neutral and resource-tight configurations. Here, Viol. denotes the constraint violation rate, AUC is the normalized area under the curve of average task utility Util versus training iterations, and Iter@95% is defined as the training iteration at which the Util curve first reaches 95% of its steady-state mean, where the steady-state mean is computed from a moving average over the final training window. Comparing SGRL-TS with typical graph-based DRL, under the resource-tight regime Full-SGRL-TS improves Compl, Util, and Eff over GAT-QMIX by about 23.9%, 22.2%, and 26.5%, respectively, while reducing Viol by 45.9%, increasing AUC by 48.3%, and shortening Iter@95% by 36.9%. These gains do not come from merely swapping in a GAT encoder and a QMIX mixer, but from jointly exploiting three structural mechanisms: HAN explicitly encodes the high-order UAV–task–coalition hypergraph so that structural semantics enter value estimation and search rather than being limited to one-hop graph attention; SHIELD injects structure-conditioned terms within and across coalitions, enabling finer modeling of cooperative gains and resource competition and yielding a more monotone and comparable global value
; and the structured MCTS uses these signals for feasibility-set pruning and structural heuristics, so that coalition configurations are optimized globally around structural priors instead of relying on local GNN outputs.
Module-level ablations further support this view. With SHIELD-full and structured MCTS fixed, replacing the encoder with a standard GAT (A1) or HyperGCN (A2) shows that, relative to A2, the full HAN still improves Util and Eff in the resource-neutral regime by about 2.2% and 4.6%, further reduces Viol by 17.9%, increases AUC by 6.9%, and shortens Iter@95% by 11.4%; under the resource-tight regime it maintains roughly 6.0% and 6.7% gains in Util and Eff and a 20.7% reduction in violation rate, confirming that heterogeneous node types and high-order hyperedges provide additional structural information beyond conventional graph encoders. For the value decomposition, when VDN or QMIX mixing is used on top of HAN, SHIELD-lite already improves Util and Eff over QMIX-mix by about 2.2% and 4.7%, reduces Viol by 11.7%, increases AUC by 6.0%, and shortens Iter@95% by 5.7%; enabling full SHIELD further raises Util and Eff relative to QMIX-mix by 3.4% and 6.9%, decreases Viol by 22.0%, increases AUC by 8.4%, and reduces Iter@95% by 15.3%, with similar relative improvements in the resource-tight regime, indicating that the structure-conditioned nested mixer improves decomposability and credit assignment beyond monotonic mixing networks. In the global search module, removing search causes Util and Eff to drop by about 4.5% and 6.1%, Viol to increase by 34.4%, AUC to decrease by 11.3%, and Iter@95% to be extended by 24.6% compared with Full-SGRL-TS; greedy search or plain MCTS partially reduce this gap but still underperform structured MCTS in AUC and Iter@95%, showing that high-order structure–guided feasibility pruning and budget allocation are likewise crucial to achieving high AUC and fast, low-violation convergence.