Overlapping Coalition Formation for Resource Allocation in Post-Disaster Rescue UAV Swarms

Wenxin Li; Yongxin Feng; Fan Zhou; Konstantin Igorevich Kostromitin; Jian Wang; Peiying Zhang

doi:10.3390/drones9120837

,

and

¹

School of Information Science and Engineering, Shenyang Ligong University, Shenyang 110158, China

²

Department of Physics of Nanoscale Systems, South Ural State University, Chelyabinsk 454080, Russia

³

Institute of Radioelectronics and Information Technologies, Ural Federal University, Yekaterinburg 620002, Russia

⁴

College of Science, China University of Petroleum (East China), Qingdao 266580, China

Drones2025, 9(12), 837;https://doi.org/10.3390/drones9120837

This article belongs to the Topic Advanced Technologies and Applications for Unmanned Systems

Version Notes

Order Reprints

Review Reports

Highlights

What are the main findings?

We introduce a hypergraph attention-based structured representation that captures high-order relations among tasks, UAVs, and coalitions, enabling overlapping coalition formation in heterogeneous swarms.
We develop a structure-conditioned hierarchical value decomposition coupled with limited-budget Monte Carlo Tree Search with demonstration replay and distillation, achieving global optimization and policy feedback under operational constraints.

What are the implications of the main findings?

The combination of overlapping coalitions and structure-aware policy learning promotes cross-task resource reuse, alleviates bottlenecks under tight constraints, and strengthens global coordination for post-disaster multi-tasking.
Feasible-region pruning and delay-aware search help maintain convergence efficiency and stability under latency, communication, and energy limits, pointing to a practical path toward real-world deployment.

Abstract

Unmanned aerial vehicle (UAV) swarms, equipped for distributed sensing and rapid response, can form coalitions to undertake complex missions such as post-disaster relief, communication support, and payload delivery. However, typical coalition formation methods assign each UAV to a single task, limiting cross-task resource sharing. To address this, we investigate overlapping coalition formation (OCF) for UAV swarms, where a single UAV is permitted to participate in multiple coalitions, enabling resource reuse and reducing idleness. We formulate OCF as a multi-objective combinatorial optimization problem that jointly balances task fulfillment ratio, coalition synchronization deviation, and operational cost, while explicitly accounting for inter-coalition resource contention and execution precedence. Specifically, we first construct a hypergraph representation of UAVs and tasks and employ a hypergraph attention network to capture their high-order interactions. Next, we propose a structure-aware hierarchical value decomposition method for policy learning, which progressively aggregates individual- and coalition-level information, models member complementarity and inter-coalition cooperative–competitive relations, and generates a global value estimate that is sensitive to changes in coalition structure. Furthermore, we integrate Monte Carlo Tree Search, utilizing the learned value as a heuristic to efficiently explore the feasible region, and close the loop with candidate-structure demonstration replay and policy distillation, enabling search to refine the learned policy. In multi-scale rescue simulations, the proposed approach improves task utility by up to 11.4% over the best-performing baseline and increases energy efficiency by more than 228% compared to a non-overlapping coalition variant.

Keywords:

unmanned aerial vehicle swarm; overlapping coalition formation; multi-objective combinatorial optimization

1. Introduction

With continuing advances in avionics and wireless networking, unmanned aerial vehicle (UAV) swarms can undertake cross-scale tasks including reconnaissance, search and rescue, communication assurance, and emergency power supply [1], even when terrain is constrained and infrastructure is damaged [2]. Nevertheless, swarm-level collaboration essentially depends on the organizational mechanisms of task allocation and resource configuration [3]. Under constraints on endurance, payload, and sensing capacity, single-UAV assignment paradigms struggle to satisfy the joint resource and temporal requirements of complex tasks [4]. To enhance system-level feasibility and overall benefit, forming coalitions that aggregate capabilities across platforms [5,6] becomes a necessary pathway for achieving this goal.

Resource scheduling, execution precedence, and cost constraints at the task level jointly constitute the key difficulties of coalition coordination [7]. With the rise of modular payloads and multi-source sensor fusion [8], a single platform can be configured with diverse functional modules and can simultaneously carry multiple resource types in varying quantities to execute different kinds of tasks. For example, in military settings, a UAV may consecutively perform target tracking and communication relay tasks [9], whereas in civilian settings, a rescue UAV may consecutively conduct target-detection and payload-delivery tasks. However, prevalent non-overlapping coalition formation methods typically restrict each UAV to a single coalition and a single task within the same time window. Under high task load, this restriction prolongs waiting chains, and under low load, it yields idle capabilities and fragmented utilization, which degrades task completion and energy efficiency. Therefore, we study overlapping coalition formation (OCF) [10] for heterogeneous UAV swarms in multi-task disaster-relief scenarios, as illustrated in Figure 1. Subject to feasibility constraints on energy, communication, and deadlines, OCF permits a UAV to belong to multiple task coalitions within the planning horizon, enabling cross-task capability flow and resource reuse to alleviate threshold bottlenecks and improve overall resource utilization efficiency.

Figure 1. Illustrative coalition formation modes in a heterogeneous UAV swarm. (a) Non-overlapping coalition formation where each coalition executes one task. (b) Sequential multi-task formation where a coalition serves multiple tasks in sequence. (c) Overlapping coalition formation where UAVs can participate in multiple coalitions and jointly execute several tasks over time, enabling cross task capability flow and resource reuse.

In post-disaster contexts, OCF can be cast as a multi-objective combinatorial optimization problem with explicit resource competition [11]. Due to communication degradation and temporary restrictions, observation ranges shrink markedly, and information becomes fragmented [12]; as a result, each UAV relies primarily on local views and cannot readily grasp global dependencies across tasks and resources. Meanwhile, the number and urgency of tasks fluctuate over short intervals, and feasible routes and required preconditions change frequently [13]. Without continual updates to coalitions and resource allocations, previously feasible plans become invalid, which induces execution interruptions and response delays. Moreover, overlapping organization causes a single UAV to serve multiple tasks across time windows: consumable resources exhibit congestion effects, non-consumable capabilities are subject to occupancy and exclusivity, coalitions compete for scarce and critical capabilities, tasks display both cooperative and mutually exclusive relations, and task value becomes markedly non-additive. These factors induce credit-assignment bias and global scheduling imbalance, which makes it difficult to align locally optimal choices with system-level optimality.

Furthermore, the decision space is a joint combination of coalition structure, resource allocation, and temporal scheduling, and is high-dimensional, with a branching explosion [14]. Multiple objectives trade off against each other: higher task fulfillment often demands greater energy expenditure and tighter synchronization, whereas reducing operational cost and synchronization deviation may sacrifice fulfillment. Static weighting and local greediness are therefore often ineffective. Under partial observability and highly dynamic conditions, the central challenges are maintaining feasibility and responding rapidly to time-varying constraints, as well as achieving a globally coordinated, multi-objective balance amid competition.

To address these challenges, we propose a Structure-Guided Reinforcement Learning with Tree Search (SGRL-TS) framework. Using a hypergraph as the carrier and bidirectional attention to extract high-order relations among UAVs, task demands, and coalition capabilities, the framework consolidates local observations into structural semantics that enhance identifiability under incomplete information. It then employs structure-conditioned hierarchical value modeling to characterize intra-coalition complementarity and conflict while capturing inter-task interactions, thereby improving the robustness of value estimation under delayed and partial information. Building on this foundation, we adopt budgeted Monte Carlo Tree Search (MCTS) with feasible-region pruning based on energy, bandwidth, and deadlines to generate a small set of high-quality candidate coalition structures and to feed them back to policy learning via demonstration replay and value distillation, achieving efficient and scalable OCF under constrained communication and time-varying feasibility. Our contributions are as follows:

We formulate overlapping coalition formation on a time-varying heterogeneous hypergraph and design a masked bidirectional Hypergraph Attention Network. This network incorporates task-to-node and node-to-task attention, an operator view of higher-order propagation, and structural diversity regularization, yielding structure-aware node and task embeddings that encode task selectivity, contextual suitability, and structural roles under evolving coalition structures.
We develop Structure-aware Hierarchical Value-decompos Ition for Efficient Multi-task Coor Dination (SHIELD) that utilizes a structure-attribution encoder and agent-conditional Q functions, along with nested nonlinear intra-coalition composition and sparsified cross-task interaction aggregation. This approach constructs a global value $Q_{tot}$ that remains monotone and comparable across partial coalition structures, supporting stable credit assignment under the multi-objective reward.
We integrate SHIELD with a structure-prior and a feasible-set pruned budgeted MCTS. Masked SHIELD evaluations, progressive widening, upper-bound and dominance pruning, and hypervolume-based Pareto selection produce a compact set of high-quality overlapping coalition structures. These structures are injected into learning via scheduled demonstration replay and structure-consistent Q-value distillation. Experiments on heterogeneous UAV–multi-task disaster-relief scenarios demonstrate consistent gains in task utility, energy efficiency, conditional near-linear scalability, and improved ablation results compared to baselines.

2. Related Work

2.1. Overlapping Coalitions: Advantages of Hypergraph Modeling over Other OCF Algorithms

In multi-agent cooperation and resource allocation, the traditional non-overlapping Coalition Formation Game (CFG) has been widely adopted. Its core assumption is that each agent can join only one coalition to complete a specific task. Guo et al. [15] designed a fast CFG solver for urgent-task scheduling based on neighborhood search and coalition reorganization, which starts from an initial partition and performs local search to find member-replacement plans that reduce task delay. Yan et al. [16] proposed a hierarchical nested CFG framework for large-scale heterogeneous UAV swarms. The method first forms task groups via priority-based clustering at the upper level and then solves coalition partitioning within each group to optimize resource matching at the lower level. Although these methods attain good local optima in specific scenarios, they rely on the non-overlapping assumption, which prevents cross-task resource reuse and yields limited adaptability of coalition partitions in heterogeneous multi-task settings.

To overcome these limitations, researchers introduced the Overlapping Coalition Formation Game (OCFG), which enables agents to participate in multiple coalitions, allowing for flexible resource reuse in heterogeneous multi-task environments. Qi et al. [17] formulated task allocation as a sequential overlapping coalition formation game in which coalitions are iteratively adjusted via a bilateral mutual-benefit transfer rule that reallocates per-UAV resources across tasks until a stable overlapping partition is reached. Zhang et al. [18] developed an OCFG for vehicular edge computing that jointly optimizes information transmission rate and energy-sharing efficiency and balances multi-coalition participation via transfer payments and a virtual-currency mechanism. Qi et al. [19] combined OCFG with an auction mechanism for UAV data-collection task assignment, treating UAVs as buyers that can participate across coalitions and iteratively maximizing revenue through bipartite matching auctions and coalition games. Overall, OCFG outperforms traditional CFG in terms of resource efficiency and task adaptivity. However, most existing OCFG studies rely on local marginal contributions or linearly weighted utilities and lack explicit modeling of high-order interactions among tasks, agents, and coalitions. As a result, policies generalize poorly in dynamic multi-task environments and coalition structures tend to be trapped in local optima.

2.2. Structure Modeling: Advantages over Non-Structured Modeling

As the complexity of multi-agent cooperation grows, structure-aware modeling has been introduced to enhance interaction perception and policy generalization. By explicitly representing inter-agent relations in graphs or geometric structures, several works have reported progress in task cooperation and system performance. HYGMA [20] constructs a latent hypergraph where dynamically grouped agents are represented as hyperedges. Coalition structures emerge implicitly as the policy attends to and aggregates over these hyperedges, which approach avoids reliance on predefined or pairwise coalitions. Zhang et al. [21] proposed a coalition formation algorithm inspired by leaf-vein geometry, which predefines vein-like branching to assign functional roles and then uses heuristic pattern matching to adjust node layouts. This structure enhances global cooperative stability while remaining rule-based and static. Liao et al. [22] presented a two-stage cooperative anti-jamming scheme for radar networks under wideband sweep jamming. During coalition formation, the entire frequency-hopping range is partitioned into mutually exclusive sub-bands, each treated as a coalition unit, and coalitions are updated via a coalition formation game. This realizes structured modeling and optimization of radar cooperation at the sub-band level.

These structured approaches strengthen cooperation awareness and deliver improved stability and generalization during resource sharing and local task switching. Nevertheless, most of them are confined to first-order adjacency and cannot capture higher-order dependencies among tasks, agents, and coalitions. Consequently, UAVs cannot accurately identify their roles across multiple coalitions in multi-task settings. In addition, structures are often treated as fixed input features rather than directly participating in policy optimization.

2.3. Global Optimization and Policy Feedback: Integrating MCTS with Local Reinforcement Learning

To move beyond local heuristics and short-sighted payoff optimization, several studies have pursued coalition generation and stabilization driven by global objectives or long-term equilibria. Taguelmimt et al. [23] treat coalition structure generation as a centralized global optimization problem and propose SMART, which first uses offline coalition selection to precompute and prune promising coalitions, and applies complementarity-based dynamic programming with gradual, integer-partition–graph search to exhaustively derive the globally optimal coalition structure under a given utility. Chen et al. [24] formulated a global-optimization CFG for multi-UAV cooperative defense by maximizing defense coverage. The approach defines coalition utility in terms of global coverage and threat-suppression ratio, and incorporates dynamic coalition reconstruction and global utility re-estimation to ensure monotonic improvement toward the global objective. It improves coverage but relies heavily on a centralized coordinator, which reduces efficiency when handling asynchronous updates and dynamic tasks. Xiong et al. [25] proposed a global coalition formation method for radar-UAV multi-target pursuit based on multi-objective cooperative games, using evolutionary games and global Nash equilibria to determine stable structures and provide long-term optimality. Although these methods improve system-wide performance through global utilities or equilibria, most rely on deterministic optimization or equilibrium derivation and lack active exploration in the combinatorial space of large-scale overlapping coalitions.

To remedy these issues, some works integrate game-theoretic models with deep reinforcement learning so as to retain long-term optimality while improving the adaptability of coalition structures. Zhou et al. [26] model coalition dynamics as a three-party repeated coalition formation game and realize coalition formation via a switch-operation–based coalition selection process, where deep reinforcement learning optimizes each party’s continuous non-coalitional strategies while the game-theoretic analysis guarantees convergence to a stable coalition partition in each time slot. Zhang et al. [27] designed a distributed DRL and CFG framework for coalition selection in federated learning with heterogeneous data. Periodic training utilizes global accuracy and energy efficiency as rewards, enabling nodes to adaptively adjust the probabilities of joining different coalitions across rounds, thereby improving global convergence and energy efficiency. However, the process predominantly relies on experience replay and local policy convergence, which limits the adaptation speed when coalition structures undergo significant changes.

Overall, it is difficult for prior work to simultaneously capture high-order structural relations and explore globally optimal coalition structures. In particular, for overlapping coalition formation with heterogeneous UAV swarms, there is a lack of mechanisms that exploit structural semantics to actively guide policy optimization and to coordinate local learning with global search. To address this gap, we propose a structure-aware and globally guided OCF method that enhances high-order interaction modeling among tasks, agents, and coalitions through a hypergraph-based representation, and achieves long-term optimality by coordinating global search with local policy learning.

3. System Model and Problem Statement

3.1. System Architecture of Overlapping Coalitions and Hypergraph-Based Representation

3.1.1. Task Resource Types

Let the task set be

T = {T_{1}, T_{2}, \dots, T_{M}}

. Each task

T_{m}

requires a combination of resources. We categorize resources into two types: consumable and non-consumable. Consumable resources are physical items that are depleted during execution, such as rescue kits, medicines, or spraying agents. Their demand is

L_{m}^{(c)} \in R^{d_{c}}

, where each dimension specifies the required quantity of a specific consumable. Non-consumable resources are functional capabilities that support a task without being depleted, such as communication, relay, or information acquisition. Their demand vector is

L_{m}^{(n c)} \in R^{d_{n c}}

. The overall demand for task m is

L_{m} = [L_{m}^{(c)}; L_{m}^{(n c)}] \in R^{d_{r}}

with

d_{r} = d_{c} + d_{n c}

. Each task further has key attributes including spatial location

p_{m} \in R^{2}

, priority coefficient

β_{m} \in [0, 1]

, and maximum waiting time (deadline tolerance)

t_{m}^{dead}

. The complete task feature vector is

x_{m} = [L_{m}^{(c)}; L_{m}^{(n c)}; p_{m}; β_{m}; t_{m}^{dead}]

.

3.1.2. UAV Node State

Let the UAV swarm be

U = {u_{1}, u_{2}, \dots, u_{N}}

. The state vector of UAV

u_{n}

comprises the remaining energy

e_{n} \in [0, E_{n}^{\max}]

, the load of consumable resources

r_{n}^{(c)} \in R^{d_{c}}

, the non-consumable capability vector

r_{n}^{(n c)} \in R^{d_{n c}}

, the current position

p_{n} \in R^{2}

, the maximum flight speed

v_{n}^{\max}

, and a representation of the recent task-history sequence

{seq}_{n}

. Thus, the full UAV state is

x_{n} = [e_{n}; r_{n}^{(c)}; r_{n}^{(n c)}; p_{n}; v_{n}^{\max}; {seq}_{n}]

.

3.1.3. Overlapping Coalition Modeling

For each task

T_{m}

, define its UAV coalition as

C_{m} \subseteq U

. The coalition structure for the task set

T

is

C = {C_{1}, \dots, C_{m}, \dots, C_{M}}

. For any task

T_{m}

, Equation (1) specifies the set of coalition members that allocate resources to it. The resource-allocation vector of

T_{m}

is given in Equation (2), where

A_{(T_{m})}^{(u_{n})}

denotes the resource vector contributed by UAV

u_{n}

to task

T_{m}

as defined in Equation (3).

Coa (A_{(T_{m})}) = \{u_{n} \in U ∣ A_{(T_{m})}^{(u_{n})} \neq \emptyset\}

(1)

A_{(T_{m})} = \{A_{(T_{m})}^{(u_{1})}, \dots, A_{(T_{m})}^{(u_{n})}, \dots, A_{(T_{m})}^{(u_{N})}\}

(2)

A_{(T_{m})}^{(u_{n})} = \{α_{u_{n}, T_{m}}^{(1)}, \dots, α_{u_{n}, T_{m}}^{(z_{c})}, \dots, β_{u_{n}, T_{m}}^{(z_{n c})}, \dots, β_{u_{n}, T_{m}}^{(Z)}\}

(3)

Specifically,

α_{u_{n}, T_{m}}^{(z_{c})}

denotes the quantity of consumable resource type

z_{c}

that UAV

u_{n}

allocates to task

T_{m}

, and

β_{u_{n}, T_{m}}^{(z_{n c})}

denotes the quantity of non-consumable capability type

z_{n c}

that UAV

u_{n}

allocates to task

T_{m}

. For any UAV

u_{n} \in U

, the set of tasks it participates in is

M_{n} = {m ∣ u_{n} \in C_{m}}

. If

| M_{n} | > 1

, then

u_{n}

belongs to multiple task coalitions within the planning horizon, and the coalition structure is overlapping.

To capture complex overlapping relations and high-order interactions between UAVs and tasks, we formalize the coalition structure as a hypergraph

G = (V, E)

, where the node set

V

corresponds to the UAV set

U

and each node

v_{n} \in V

represents a UAV with feature

x_{n}

; the hyperedge set

E

corresponds to the task set

T

and each hyperedge

e_{m} \in E

connects multiple nodes that participate in task

T_{m}

, with initial hyperedge feature

x_{m}

. A node belonging to multiple hyperedges represents overlapping coalition membership, and a hyperedge connecting multiple nodes represents multi-agent cooperation within a coalition. This hypergraph provides a formal basis for employing a hypergraph attention network to capture high-order relations between UAVs and tasks.

3.2. Multi-Objective Design

The core of overlapping coalition formation for UAV swarms is to evaluate the overall effectiveness when multiple UAVs execute multiple tasks, which determines whether a coalition structure is optimal. We design three objectives that quantify task fulfillment, coalition synchronization deviation, and operational cost, respectively.

Task execution sufficiency: To quantify the consistency between resources allocated to task $T_{m}$ and its demands by first computing per-dimension satisfaction ratios. Then, perform within-class aggregation for consumable and non-consumable resources. Finally, we combine the two class-level scores into the overall task measure.
For consumable resources, the per-dimension satisfaction ratio is defined as

$s_{c}^{(z_{c})} = \min (\frac{λ_{T_{m}}^{(z_{c})}}{l_{T_{m}}^{(z_{c})}}, 1), z_{c} \in T_{z_{c}}, l_{T_{m}}^{(z_{c})} > 0,$

(4)

where $λ_{T_{m}}^{(z_{c})} = \sum_{u_{n} \in Coa (A_{(T_{m})})} α_{u_{n}, T_{m}}^{(z_{c})}$ . Given weights ${w_{z_{c}}}_{z_{c} \in T_{z_{c}}}$ with $\sum_{z_{c} \in T_{z_{c}}} w_{z_{c}} = 1$ , we adopt a shortage-sensitive generalized mean for within-class aggregation

$F_{c} (T_{m}; ρ_{c}) = {(\sum_{z_{c} \in T_{z_{c}}} w_{z_{c}} {[s_{c}^{(z_{c})}]}^{ρ_{c}})}^{1 / ρ_{c}}, ρ_{c} = - 1 .$

(5)

For non-consumable resources, the per-dimension satisfaction ratio is defined as

$s_{n}^{(z_{n c})} = \min (\frac{ϕ_{z_{n c}} (λ_{T_{m}}^{(z_{n c})})}{ϕ_{z_{n c}} (l_{T_{m}}^{(z_{n c})})}, 1), z_{n c} \in T_{z_{n c}}, l_{T_{m}}^{(z_{n c})} > 0,$

(6)

where $λ_{T_{m}}^{(z_{n c})}$ denotes the coalition-level equivalent capability on that dimension, and $ϕ_{z_{n c}}$ is a monotone mapping to form a dimensionless, larger-is-better scale. Given weights ${v_{z_{n c}}}_{z_{n c} \in T_{z_{n c}}}$ with $\sum_{z_{n c} \in T_{z_{n c}}} v_{z_{n c}} = 1$ , the within-class aggregation is

$F_{n} (T_{m}) = {(\sum_{z_{n c} \in T_{z_{n c}}} v_{z_{n c}} {[s_{n}^{(z_{n c})}]}^{ρ_{n}})}^{1 / ρ_{n}}, ρ_{n} \to - \infty .$

(7)

The overall task measure is computed as the product of the two class-level scores,

$f_{1} = F_{c} (T_{m}; ρ_{c}) \times F_{n} (T_{m}), f_{1} \in [0, 1] .$

(8)
Coalition synchronization deviation: To represent arrival latency, waiting energy, and deadline constraints under a unified dimension, the precedence relations among tasks are modeled by a directed acyclic graph $P$ ; when only task priorities are available, they become a total order within $P$ . For member $u_{n}$ of coalition $C_{m}$ , the sequence of tasks that must be completed before executing $T_{m}$ is defined as:

${Seq}_{n} = (Assigned (u_{n}) \cap Pred (T_{m})) \cup {T_{m}}$

(9)

Based on this sequence, two aggregate quantities are introduced for latency and energy calculation: $L_{n}$ denotes the total flight distance before reaching $T_{m}$ , and $S_{n}$ denotes the total processing time that must be completed before reaching $T_{m}$ , defined as:

$L_{n} = \sum_{j = 1}^{J} d_{n, j}, S_{n} = \sum_{j = 1}^{J} s_{n, j}$

(10)

where $d_{n, j}$ is the flight distance of the j-th leg and $s_{n, j}$ is the processing time that must be completed before that leg; if no processing is required, it is set to zero. Accordingly, the arrival time of $u_{n}$ can be written as a function of speed:

$t_{n}^{arr} (v) = \frac{L_{n}}{v} + S_{n}$

(11)

Under the speed bounds $v_{n}^{\min} \leq v \leq v_{n}^{\max}$ , taking the slowest path cruising at the maximum admissible speed as the baseline yields the earliest feasible synchronization time:

$t_{m}^{syn *} = \max_{u_{n} \in C_{m}} t_{n}^{arr} (v_{n}^{\max}) = \max_{u_{n} \in C_{m}} (\frac{L_{n}}{v_{n}^{\max}} + S_{n})$

(12)

Given $t_{m}^{syn *}$ , each remaining member selects the unique target speed that achieves on-time arrival without violating the bounds:

$v_{n}^{*} = clip (\frac{L_{n}}{t_{m}^{syn *} - S_{n}}, v_{n}^{\min}, v_{n}^{\max}), clip (x, a, b) = \min {\max {x, a}, b}$

(13)

If the required speed falls below $v_{n}^{\min}$ , the member cruises at $v_{n}^{\min}$ and waits upon arrival; the waiting time is:

$t_{n}^{hov} = \max \{0, t_{m}^{syn *} - t_{n}^{arr} (v_{n}^{\min})\} = \max \{0, t_{m}^{syn *} - \frac{L_{n}}{v_{n}^{\min}} - S_{n}\}$

(14)

Consequently, the synchronized start time of the coalition satisfies:

$t_{m}^{syn} = t_{m}^{syn *}$

(15)

To reflect the energy cost incurred during waiting within a unified metric, let $p_{n}^{fly} (v)$ denote the flight power and $p_{n}^{hov}$ denote the hovering power, and impose the ordering constraint:

$p_{n}^{hov} \geq p_{n}^{fly} (v_{n}^{econ})$

(16)

where $v_{n}^{econ}$ is an empirical economical cruise speed. The energy expenditure of $u_{n}$ before $T_{m}$ is then:

$E_{n} = p_{n}^{fly} (v_{n}^{*}) \cdot \frac{L_{n}}{v_{n}^{*}} + p_{n}^{hov} \cdot t_{n}^{hov}$

(17)

Further, given the deadline $D_{m}$ of task $T_{m}$ , define the lateness as:

$l_{m} = \max {0, t_{m}^{syn} - D_{m}}$

(18)

Finally, the synchronization latency, the deadline deviation, and the energy differential during waiting are combined via time-equivalent coefficients into a single objective:

$f_{2} (T_{m}) = J_{syn} (T_{m}) = t_{m}^{syn} + λ_{late} l_{m} + μ_{hov} \sum_{u_{n} \in C_{m}} (p_{n}^{hov} - p_{n}^{fly} (v_{n}^{*})) t_{n}^{hov}$

(19)

where $λ_{late} \geq 0$ and $μ_{hov} \geq 0$ are calibration weights that convert the energy penalty into a time-equivalent cost, allowing it to be summed with $t_{m}^{syn}$ and $l_{m}$ under a unified dimension.
Operational cost: To assess the aggregate energy burden incurred by a coalition while completing multiple tasks, Equation (20) defines the operational cost for task $T_{m}$ :

$f_{3} = E_{T_{m}} = \sum_{u_{n} \in Coa (A_{T_{m}})} e_{T_{m}}^{(u_{n})} = \sum_{u_{n} \in Coa (A_{T_{m}})} \frac{E_{(u_{n})} ∥ A_{(T_{m})}^{(u_{n})} ∥}{\sum_{T_{m^{'}} \in T} ∥ A_{(T_{m^{'}})}^{(u_{n})} ∥},$

(20)

where $E_{(u_{n})}$ is the total energy consumed by UAV $u_{n}$ to complete all assigned tasks. Since $u_{n}$ contributes heterogeneous resource types and quantities to multiple tasks, we use $∥ A_{(T_{m})}^{(u_{n})} ∥$ as a workload proxy and proportionally split the total energy to estimate per-task cost. The energy of UAV $u_{n}$ is the sum of hovering and flight energy, given by

$\begin{matrix} E_{u_{n}} & = \sum_{i = 1}^{ζ} P (0) t_{{task}_{u_{n}}^{(i)}}^{(h)} + \sum_{i = 1}^{ζ} P (V_{{task}_{u_{n}}^{(i - 1)}, {task}_{u_{n}}^{(i)}}^{(fly)}) t_{{task}_{u_{n}}^{(i - 1)}, {task}_{u_{n}}^{(i)}}^{(fly)} . \end{matrix}$

(21)

In Equation (21), $ζ$ is the length of the task schedule of $u_{n}$ , and $P (V)$ is the propulsion power of a rotorcraft at constant speed V, computed as

$\begin{matrix} P (V) & = P_{S} (1 + \frac{3 V^{2}}{U_{tip}^{2}}) + P_{R} {(\sqrt{1 + \frac{V^{4}}{4 v_{0}^{4}}} - \frac{V^{2}}{2 v_{0}^{2}})}^{1 / 2} + \frac{1}{2} f_{0} ρ s_{0} η V^{3}, \end{matrix}$

(22)

we instantiate the energy model with a DJI Matrice 600 Pro hexacopter equipped with 2170R 21 × 7 inch propellers; the parameter values used in Equation (22) are listed in Table 1.

Table 1. Parameterization of the propulsion power model in Equation (22) for a DJI Matrice 600 Pro hexacopter used to compute flight energy consumption.

To concisely characterize the aggregate benefit of task

T_{m}

on the interval

[0, 1]

, starting from the previously defined task execution sufficiency

f_{1} (T_{m}) \in [0, 1]

, we map the synchronization deviation

J_{syn} (T_{m})

and the energy cost

E_{T_{m}}

to sufficiency scores and then couple them into a utility, as follows:

f_{sync} (T_{m}) = \exp (- α_{syn} \frac{J_{syn} (T_{m})}{T_{ref} (T_{m})}), f_{eng} (T_{m}) = \exp (- α_{eng} \frac{E_{T_{m}}}{E_{ref} (T_{m})}),

(23)

where by default we set

α_{syn} = α_{eng} = \ln 2

, take

T_{ref} (T_{m})

as the deadline scale of

T_{m}

, and choose

E_{ref} (T_{m})

as the coalition energy budget or a platform-calibrated upper bound. With non-negative weights

w = (w_{u}, w_{s}, w_{e})

satisfying

w_{u} + w_{s} + w_{e} = 1

, the task utility is defined by

U_{task} (T_{m}; w) = {(f_{1} (T_{m}))}^{w_{u}} {(f_{sync} (T_{m}))}^{w_{s}} {(f_{eng} (T_{m}))}^{w_{e}} .

(24)

Consequently,

U_{task} (T_{m}) \in [0, 1]

, increases with

f_{1} (T_{m})

, and decreases with

J_{syn} (T_{m})

and

E_{T_{m}}

.

3.3. Problem Statement

We consider coalition formation with a heterogeneous UAV–multi-task structure in which each task comprises multiple heterogeneous sub-tasks with distinct resource demands, as illustrated in Figure 2. The goal is to optimize overlapping coalition formation under limited resources, dynamic tasks, and local communication. The decision variables are the task–UAV assignment structure

C

and the resource-allocation vectors

A_{(T_{m})}^{(u_{n})}

, yielding an exponentially large combinatorial search space. The problem has three inherently conflicting objectives

f_{1}

,

f_{2}

, and

f_{3}

. For example, higher task fulfillment typically increases energy consumption, while improving synchronization may conflict with preferred task orders. Hence, the problem is a multi-objective combinatorial optimization with a set of non-dominated solutions. The optimization seeks an overlapping coalition structure with resource allocations

B C^{(*)} = {A_{T_{1}}^{(*)}, \dots, A_{T_{m}}^{(*)}, \dots, A_{T_{M}}^{(*)}}

that maximizes the overall utility of the task set, as in Equation (25), subject to the constraints in Equation (26).

({SC}^{*}) : \{\begin{matrix} \max_{C, A} \sum_{T_{m} \in T} s (A_{T_{m}}) \\ \min_{C, A} \sum_{T_{m} \in T} t_{T_{m}}^{(syn)} \\ \min_{C, A} \sum_{T_{m} \in T} E_{T_{m}} \end{matrix}

(25)

\begin{matrix} subject to \{\begin{matrix} \sum_{u_{n} \in C_{m}} r_{n}^{(c)} \geq L_{m}^{(c)}, \forall T_{m} \in T, \\ \sum_{u_{n} \in C_{m}} r_{n}^{(n c)} \geq L_{m}^{(n c)}, \forall T_{m} \in T, \\ t_{T_{m}}^{(syn)} \leq t_{T_{m}}^{dead}, \forall T_{m} \in T, \\ \sum_{T_{m} \in M_{n}} e_{T_{m}}^{(u_{n})} \leq E_{n}^{\max}, \forall u_{n} \in U . \end{matrix} \end{matrix}

(26)

Figure 2. Example of resource allocation under overlapping coalitions, where heterogeneous UAV resources are assigned to multiple tasks and UAVs transfer between coalitions to reuse residual capacity.

In Equation (26), the resource constraints ensure that the sums of consumable and non-consumable provisions from a coalition meet each task’s demand; the temporal constraint ensures that the synchronization deviation remains within the tolerated waiting time; and the energy constraint ensures that a UAV’s total energy consumption does not exceed its initial energy. Because the solution space grows combinatorially with the number of tasks and UAVs, obtaining a globally optimal coalition structure is NP-hard. We therefore design a staged solving strategy with reduced computational complexity.

4. Overlapping Coalition Formation Algorithm

Overlapping coalition formation involves coupled relations among multiple agents and multiple tasks. It must perceive high-order structure, avoid local optima, and control global exploration cost under feasibility constraints. This chapter proposes a coordinated algorithmic framework SGRL-TS. First, we extract the structural semantics of task–UAV–coalition relations using a bidirectional HAN and utilize structure embeddings to modulate policy parameters, thereby improving generalization. Next, we develop a structure-aware and efficient multi-task cooperative hierarchical value-decomposition algorithm called SHIELD. SHIELD captures member complementarity or conflict via nested nonlinear aggregation within a coalition and models inter-task cooperation or competition via second-order interaction terms across coalitions, thereby yielding the global value

Q_{tot}

. Finally, we conduct a global search with MCTS guided by

Q_{tot}

and feed high-quality structures back into training through demonstration replay, Q-value distillation, and structure-neighborhood exploration.

4.1. High-Order Structural Modeling and State Representation

In overlapping coalition formation for heterogeneous UAVs across multiple tasks, the local observation of a single agent is insufficient to capture the high-order interactions between tasks and UAVs. Decisions are influenced by the task-demand context and by the role that each agent plays within multi-task collaborative structures. To improve policy generalization and coalition organization efficiency, we adopt a bidirectional hypergraph attention network that combines structural awareness with context-induced modulation. The core objective is to disentangle an agent’s intrinsic capability from the external task context at the representation level.

To accommodate structural evolution during planning, we construct a time-varying heterogeneous hypergraph

G_{t} = (V, E_{t})

at each decision step t. The node set

V

denotes UAVs. The hyperedge set

E_{t}

denotes the tasks within the planning scope at step t. A task

T_{m}

appears as a hyperedge

e_{m, t}

with member set

C_{m, t} \subseteq V

. A node in multiple hyperedge models represents overlapping coalition membership. This structure specifies connectivity, propagation paths, and normalization domains of structural information during policy learning, as illustrated in Figure 3.

Figure 3. High -order structural modeling with the bidirectional hypergraph attention network, which encodes UAV nodes and task hyperedges and propagates features along membership and communication links. Here,

e_{m}^{(1)}

denotes the task structural embedding produced by task-to-node aggregation, and

h_{n}^{*}

denotes the structure-aware node embedding produced by node-to-task aggregation.

To restrict attention normalization to valid candidate sets, we introduce three masks for membership, task activation, and UAV availability:

M_{n m}^{t} = 1 [u_{n} \in C_{m, t}], μ_{m}^{t} = 1 [T_{m} is in the current planning scope], η_{n}^{t} = 1 [u_{n} is schedulable]

(27)

Given the initial node and hyperedge embeddings

h_{n, t}^{(0)}

and

e_{m, t}^{(0)}

, we define a differentiable score

s_{n m}^{t} = LeakyReLU (a^{⊤} [W_{v} h_{n, t}^{(0)} ‖ W_{e} e_{m, t}^{(0)}])

(28)

where

W_{v}

and

W_{e}

are learnable mapping matrices,

a

is a learnable vector, and ‖ denotes concatenation. Two complementary attentions are then normalized over the mask-restricted domains.

Task-to-node aggregation. This aggregation builds the task representation from candidate members and expresses task selectivity. The attention coefficients $α_{n m}^{t}$ are normalized over the candidate-member domain:

$α_{n m}^{t} = \frac{\exp (s_{n m}^{t}) μ_{m}^{t} M_{n m}^{t}}{\sum_{n^{'} : η_{n^{'}}^{t} M_{n^{'} m}^{t} = 1} \exp (s_{n^{'} m}^{t})}$

(29)

The task structural embedding is then computed as

$e_{m, t}^{(1)} = ReLU (\sum_{n \in C_{m, t}} α_{n m}^{t} W_{v} h_{n, t}^{(0)})$

(30)

To make task selectivity explicit, we pose an entropy-regularized matching objective over the simplex $Δ (C_{m, t})$ :

$\max_{α_{m} \in Δ (C_{m, t})} \sum_{n \in C_{m, t}} s_{n m}^{t} α_{n m}^{t} - τ_{α} \sum_{n \in C_{m, t}} α_{n m}^{t} \log α_{n m}^{t}$

(31)

whose closed-form solution is the masked softmax

$α_{n m}^{t} = \frac{\exp (s_{n m}^{t} / τ_{α}) μ_{m}^{t} M_{n m}^{t}}{\sum_{n^{'} : η_{n^{'}}^{t} M_{n^{'} m}^{t} = 1} \exp (s_{n^{'} m}^{t} / τ_{α})}$

(32)

Under a linearization, Equation (30) becomes the convex combination

$e_{m, t}^{(1)} \approx \sum_{n \in C_{m, t}} α_{n m}^{t} W_{v} h_{n, t}^{(0)}$

(33)

showing that the task embedding aggregates member features with weights proportional to the marginal contribution scores.
Node-to-task aggregation. This aggregation fuses information over a node’s candidate tasks and expresses contextual suitability. The attention coefficients $β_{m n}^{t}$ are normalized over the candidate-task domain:

$β_{m n}^{t} = \frac{\exp (s_{n m}^{t}) η_{n}^{t} M_{n m}^{t}}{\sum_{m^{'} : μ_{m^{'}}^{t} M_{n m^{'}}^{t} = 1} \exp (s_{n m^{'}}^{t})}$

(34)

The structure-aware node representation is

$h_{n, t}^{*} = σ (\sum_{m : M_{n m}^{t} = 1} β_{m n}^{t} W_{e}^{'} e_{m, t}^{(1)})$

(35)

To formalize contextual suitability, we optimize an entropy-regularized utility over the simplex $Δ ({m : M_{n m}^{t} = 1})$ using a context utility $u_{n \to m}^{t}$ :

$\max_{β_{n} \in Δ ({m : M_{n m}^{t} = 1})} \sum_{m : M_{n m}^{t} = 1} u_{n \to m}^{t} β_{m n}^{t} - τ_{β} \sum_{m : M_{n m}^{t} = 1} β_{m n}^{t} \log β_{m n}^{t}$

(36)

whose solution is

$β_{m n}^{t} = \frac{\exp (u_{n \to m}^{t} / τ_{β}) η_{n}^{t} M_{n m}^{t}}{\sum_{m^{'} : μ_{m^{'}}^{t} M_{n m^{'}}^{t} = 1} \exp (u_{n \to m^{'}}^{t} / τ_{β})}$

(37)

With $τ_{β} = 1$ and a parameterization $u_{n \to m}^{t} = s_{n m}^{t}$ , Equation (37) reduces to Equation (34). Under the same linearization, Equation (35) becomes

$h_{n, t}^{*} \approx \sum_{m : M_{n m}^{t} = 1} β_{m n}^{t} W_{e}^{'} e_{m, t}^{(1)}$

(38)

so that the node embedding is a convex combination of task embeddings weighted by contextual suitability.

Operator view of higher-order relations. Substituting Equation (33) into Equation (38) yields, after one bidirectional pass,

h_{n, t}^{*} \approx \sum_{m} \sum_{n^{'}} \underset{coupled weight}{\underset{⏟}{β_{m n}^{t} α_{n^{'} m}^{t}}} W_{e}^{'} W_{v} h_{n^{'}, t}^{(0)}

(39)

which explicitly encodes the second-order node–task–node relation via the multiplicative coupling

β_{m n}^{t} α_{n^{'} m}^{t}

. Let

H_{t} \in {0, 1}^{| V | \times | E_{t} |}

be the incidence matrix. One bidirectional pass admits the operator approximation

h^{*} \approx (B_{t} H_{t} A_{t}^{⊤} H_{t}^{⊤}) W h^{(0)}

(40)

where

A_{t} = diag {α_{m}}_{m}

and

B_{t} = diag {β_{n}}_{n}

embed masked, within-domain normalizations into the feasible sets, and

W

absorbs linear mappings. Hence, bidirectional attention realizes a learnable second-order polynomial operator in

H_{t}

; stacking L layers induces up to order

2 L

polynomials, which approximate a broad class of higher-order hypergraph convolution kernels and subset functions, thereby capturing UAV–task–coalition higher-order relations with shallow propagation.

To improve discriminability and prevent representation collapse, we introduce a structural diversity regularizer:

L_{div} = \sum_{m < m^{'}} {(\frac{e_{m}^{(1)} \cdot e_{m^{'}}^{(1)}}{∥ e_{m}^{(1)} ∥ ∥ e_{m^{'}}^{(1)} ∥})}^{2}

(41)

Considering the continual evolution of the structure over time, we employ residual in-step updates to maintain adaptability of the features:

{\tilde{h}}_{n, t} = \sum_{m} α_{n m}^{t} W_{e} e_{m, t}, h_{n, t + 1} = Norm (h_{n, t} + σ ({\tilde{h}}_{n, t}))

(42)

{\tilde{e}}_{m, t} = \sum_{n} β_{m n}^{t} W_{v} h_{n, t}, e_{m, t + 1} = Norm (e_{m, t} + σ ({\tilde{e}}_{m, t}))

(43)

where

Norm (\cdot)

denotes a normalization operator. Tasks not in the planning scope and UAVs that are not schedulable are automatically excluded from normalization and aggregation through the masks in (27), which stabilizes computation and keeps the cost controlled. For brevity, time subscripts are omitted in the sequel when no ambiguity arises.

The pseudocode for HAN forward computation with diversity regularization is presented in Algorithm 1.

Algorithm 1 Bidirectional Hypergraph Attention with Diversity Regularization

1:: Input: Hypergraph $G = (V, E)$ ; memberships ${C_{m}}$ and ${M_{n}}$ ; initial features ${h_{n}^{(0)} = x_{n}}$ and ${e_{m}^{(0)} = x_{m}}$ ; trainable parameters (consistent with Section 4.1)
2:: Output: Task embeddings ${e_{m}^{(1)}}$ ; node embeddings ${h_{n}^{*}}$ ; diversity regularizer $L_{div}$
3:: Task-driven aggregation
4:: for $m = 1$ to M do
5:: Compute ${α_{n m}}_{n \in C_{m}}$ by Equation (29)
6:: Obtain $e_{m}^{(1)}$ by Equation (30)
7:: end for
8:: Task-to-node context feedback
9:: for $n = 1$ to N do
10:: Compute ${β_{m n}}_{m \in M_{n}}$ by Equation (34)
11:: Update $h_{n}^{*}$ by Equation (35)
12:: end for
13:: Structural diversity regularization
14:: Compute $L_{div}$ by Equation (41)
15:: Apply membership masking and renormalize within valid sets
16:: return ${e_{m}^{(1)}}$ , ${h_{n}^{*}}$ , $L_{div}$

4.2. Joint Policy Learning and Local Optimization

To address structural role modeling, policy coordination, and global value optimization for heterogeneous UAV swarms with multi-task overlapping coalitions, we propose a structure-aware hierarchical value decomposition algorithm, SHIELD. The algorithm takes as inputs each UAV’s local observation and the structure-aware node and task embeddings from Section 4.1, namely the node embedding

h_{n, t}^{*}

and the task embedding

e_{m, t}^{(1)}

, and performs encoding and aggregation under the unified masks

M_{n m}^{t}

(membership),

μ_{m}^{t}

(task activation), and

η_{n}^{t}

(UAV availability). This design ensures that task selectivity and contextual suitability propagate from the representation layer to the value-decomposition layer. The global value

Q_{tot}

remains well-defined even with only a partial coalition structure. An overview is shown in Figure 4.

Figure 4. SHIELD architecture: A hypergraph encoder, role attribution, and nested coalition and global mixers jointly implement structure-aware hierarchical value decomposition for multi-task coordination.

To disentangle node embedding, structural semantics are injected into a role representation. This makes each UAV’s role and functional positioning explicit within specific task coalitions. Given

h_{n, t}^{*}

, the role gate and the role vector are computed as

z_{n, t} = σ (M h_{n, t}^{*}), r_{n, t} = \tanh (U h_{n, t}^{*} + V (h_{n, t}^{*} ⊙ z_{n, t})),

(44)

where

U

,

V

, and

M

are learnable matrices;

σ (\cdot)

is the sigmoid;

\tanh (\cdot)

is the hyperbolic tangent; and ⊙ denotes the Hadamard product. To transmit task selectivity

α_{n m}^{t}

and contextual suitability

β_{m n}^{t}

into downstream aggregation, we convert both into a non-negative member-to-task gate effective only on the feasible domain,

κ_{n \to m, t} = softplus (w_{α} α_{n m}^{t} + w_{β} β_{m n}^{t}) \cdot M_{n m}^{t} \cdot η_{n}^{t},

(45)

and renormalize it within each coalition to remove scale effects:

{\bar{κ}}_{n \to m, t} = \frac{κ_{n \to m, t}}{\sum_{n^{'} : M_{n^{'} m}^{t} = 1} κ_{n^{'} \to m, t} + ε},

(46)

where

w_{α}

and

w_{β}

are learnable scalars and

ε

is a small constant.

Agent-Conditional Value Function. An individual policy must adapt to structural roles and local observations. We define an agent-conditional Q function. Here, the role vector modulates both the temporal encoder and the action embedding:

$Q_{n, t} (τ_{n, t}, a_{n, t}; r_{n, t}) = MLP (GRU (τ_{n, t}; γ (r_{n, t})) \oplus φ (a_{n, t}, r_{n, t})),$

(47)

where $τ_{n, t}$ is the local history, $a_{n, t}$ is the current action, $γ (r_{n, t})$ generates GRU parameters conditioned on the role, $φ (a_{n, t}, r_{n, t})$ encodes actions jointly with roles, and ⊕ denotes concatenation.
Intra-Coalition Value Decomposer. To capture complementarities and conflicts among coalition members, we adopt a nested nonlinear composition. The role-conditioned contribution of a member is

$ϕ (Q_{n, t}, e_{m, t}^{(1)}, r_{n, t}) = ReLU (W_{2} ReLU (W_{1} [Q_{n, t} ‖ e_{m, t}^{(1)} ‖ r_{n, t}] + b_{1}) + b_{2}),$

(48)

with learnable parameters $W_{1}$ , $W_{2}$ , $b_{1}$ , and $b_{2}$ , and concatenation $[\cdot ‖ \cdot]$ . The gated, role-modulated fusion within coalition m is

$Ψ_{m, t} = \tanh (\sum_{n : M_{n m}^{t} = 1} {\bar{κ}}_{n \to m, t} [r_{n, t} ⊙ ϕ (Q_{n, t}, e_{m, t}^{(1)}, r_{n, t})]),$

(49)

and the coalition value reads

$Q_{m, t} = f_{dec} (e_{m, t}^{(1)}, Ψ_{m, t}) \cdot μ_{m}^{t},$

(50)

where $f_{dec} (\cdot)$ is a small MLP with normalization and bounded activation, and $μ_{m}^{t}$ masks inactive tasks.
Cross-Task Interaction Aggregator. Tasks exhibit cooperation and competition that cannot be captured by a simple linear mixture. We therefore define a structure-aware interaction aggregator with quadratic and cross terms, restricted to active tasks. The global value is

$Q_{tot} (s_{t}, a_{t}) = f_{agg} (s_{t}, Γ_{t}),$

(51)

where $s_{t}$ is a global state summary and $f_{agg} (\cdot)$ is a multilayer mapping. The interaction module is

$Γ_{t} = σ (\sum_{m} μ_{m}^{t} (W_{m} (s_{t}, e_{m, t}^{(1)}) Q_{m, t} + U_{m} (s_{t}, e_{m, t}^{(1)}) Q_{m, t}^{2} + \sum_{m^{'} \neq m} μ_{m^{'}}^{t} V_{m, m^{'}} (s_{t}) Q_{m, t} Q_{m^{'}, t})),$

(52)

with task-wise base weights $W_{m}$ , quadratic weights $U_{m}$ , pairwise interaction weights $V_{m, m^{'}}$ , and the sigmoid $σ (\cdot)$ .
Stable Encoding under Partial Structures. When only a subset of tasks is active, we preserve tensor shapes and statistical stability by using a softened placeholder for inactive tasks:

$e_{m, t}^{(0)} = (1 - μ_{m}^{t}) e_{vac} + μ_{m}^{t} {\hat{e}}_{m, t}^{(0)},$

(53)

where $e_{vac}$ is a learnable placeholder and ${\hat{e}}_{m, t}^{(0)}$ is the standard encoder output. Because attention normalizations and gated aggregations are masked, inactive tasks neither appear in denominators nor contribute to value sums.
Training Objectives and Optimization. We normalize three performance indicators—task fulfillment, coalition synchronization bias, and action cost—as

${\hat{f}}_{i} (t) = \frac{f_{i} (t) - f_{i}^{\min}}{f_{i}^{\max} - f_{i}^{\min}},$

(54)

and define the reward with a balance-promoting term:

$R (t) = (ω_{1} {\hat{f}}_{1} {(t)}^{α_{1}} + ω_{2} {\hat{f}}_{2} {(t)}^{α_{2}} + ω_{3} {\hat{f}}_{3} {(t)}^{α_{3}}) + λ (1 - std ({\hat{f}}_{1} (t), {\hat{f}}_{2} (t), {\hat{f}}_{3} (t))),$

(55)

where $ω_{i} \geq 0$ with $\sum_{i} ω_{i} = 1$ , exponents $α_{i}$ control sensitivity, and $λ$ weights the balance term. The temporal-difference loss over a mini-batch $B$ is

$L_{TD} = \frac{1}{| B |} \sum_{(s, τ, a, s^{'}) \in B} {(Q_{tot} (s, a) - [R + γ \max_{a^{'}} Q_{tot} (s^{'}, a^{'})])}^{2},$

(56)

and the attention–gate alignment regularizer is

$L_{cons} = \sum_{m} KL ({softmax}_{n} (α_{n m}^{t} 1 [M_{n m}^{t} = 1]) ∥ {softmax}_{n} ({\bar{κ}}_{n \to m, t})),$

(57)

so that the total loss becomes

$L = L_{TD} + η_{div} L_{div} + η_{cons} L_{cons},$

(58)

where $η_{div}, η_{cons} \geq 0$ and $L_{div}$ is defined in (41).

We now present the single-step training procedure of SHIELD, which integrates structure-attribution encoding, hierarchical value decomposition, and the multi-objective reward into one parameter update, as shown in Algorithm 2.

Algorithm 2 SHIELD single-step training with structural regularization and distillation

1:: Input: mini-batch $B = {(s, τ, a, s^{'})}$ ; HAN outputs ${h_{n}^{*}}$ , ${e_{m}^{(1)}}$ ; discount $γ$ ; learning rate $η$ ; regularization weights $η_{div}$ , $η_{cons}$ ; distillation weight schedule $λ_{dist} (t)$ ; soft-update factor $τ$
2:: Output: updated SHIELD parameters and target networks
3:: for each sample $(s, τ, a, s^{'}) \in B$ do
4:: Compute role vectors $r_{n}$ via Equation (44)
5:: Compute agent-conditional values $Q_{n} (τ_{n}, a_{n}; r_{n})$ via Equation (47)
6:: for each coalition (task) m do
7:: Compute coalition value $Q_{m}$ using Equations (48)–(50)
8:: end for
9:: Fuse to global value $Q_{tot} (s, a)$ via Equations (51) and (52)
10:: Compute $f_{1}, f_{2}, f_{3}$ via Equations (8)–(20); normalize with Equation (54); form reward R via Equation (55)
11:: TD target: $y \leftarrow R + γ \max_{a^{'}} Q_{tot} (s^{'}, a^{'}; target)$
12:: Per-sample TD loss: $L_{TD}^{(i)} \leftarrow {(Q_{tot} (s, a) - y)}^{2}$
13:: end for
14:: Average TD loss: $L_{TD} \leftarrow \frac{1}{| B |} \sum_{i} L_{TD}^{(i)}$
15:: Compute attention–gating consistency loss $L_{cons}$ via Equation (57) on $B$
16:: if HAN diversity regularizer available then
17:: Obtain $L_{div}$ via Equation (41)
18:: else
19:: $L_{div} \leftarrow 0$
20:: end if
21:: if MCTS distillation targets available then
22:: Compute $L_{dist}$ via Equation (70)
23:: else
24:: $L_{dist} \leftarrow 0$
25:: end if
26:: Total loss: $L \leftarrow L_{TD} + η_{cons} L_{cons} + λ_{dist} (t) L_{dist} + η_{div} L_{div}$
27:: Update SHIELD parameters by gradient descent with learning rate $η$
28:: Soft update targets: $target \leftarrow τ current + (1 - τ) target$
29:: return updated SHIELD parameters and targets

4.3. Global Optimization and Policy Feedback

To overcome the combinatorial explosion of overlapping coalition structures and the tendency of local search to get trapped in suboptimal regions, we augment the structure-aware hierarchical value decomposition in the previous section with MCTS for global structure optimization, and feed the discovered high-quality structural evidence back to policy learning, thereby forming a closed loop of structure prior, feasible-set pruning, global evaluation, and policy feedback. The key idea is to use the masked encoding of a time-varying hypergraph as a bridge so that partially assigned structures receive stable and comparable value estimates within the bidirectional hypergraph attention and hierarchical value decomposition networks; structure priors then guide the tree policy, incremental constraint checks maintain real-time feasibility, and Pareto-nondominated solutions are transformed into demonstrations and distillation signals to jointly accelerate policy convergence and improve global quality. The workflow is shown in Figure 5.

Figure 5. Pipeline of the MCTS-based global structure optimizer, where SHIELD values guide feasible coalition expansions and Pareto optimal structures are fed back as demonstrations and distillation targets.

4.3.1. Structure-Prior and Feasible-Set–Constrained Global Optimizer

In MCTS, each node s represents a partially realized resource allocation

C_{1 : m}

; that is, the coalitions

C_{m}

and their resource assignments

A_{T_{m}}

for the first m tasks. The root is ∅, indicating no task has been assigned; expanding to

c h i l d (s, a)

selects a coalition

C_{m + 1}

and resource plan

A_{T_{m + 1}}

for task m + 1, and the full depth is M, whose leaves encode a complete overlapping coalition

C = {C_{1}, \dots, C_{M}}

. To uniformly encode partial structures, we reuse the masked mechanism in Section 4.1: inactive tasks use the learnable placeholder embedding in (53), while attention and gating normalize only over mask-restricted domains per (27) and (46); thus, any

C_{1 : m}

can be stably fed into SHIELD to evaluate

Q_{tot}

.

During node selection, we adopt PUCT with a structure prior,

PUCT (s, a) = \bar{Q} (s, a) + c_{puct} P_{str} (s, a) \frac{\sqrt{N (s)}}{1 + N (s, a)},

(59)

where

\bar{Q} (s, a)

is the running mean of backed-up action values,

N (s)

and

N (s, a)

are visit counts, and

c_{puct}

controls exploration. The prior

P_{str} (s, a)

aggregates the gating strengths of the candidate coalition for task

m + 1

and is softmax-normalized. Let

g (C_{m + 1}) = \sum_{n \in C_{m + 1}} {\bar{κ}}_{n \to (m + 1), t}

with

{\bar{κ}}_{n \to (m + 1), t}

defined by (46); then

P_{str} (s, a) = \frac{\exp (g (C_{m + 1}) / τ_{p})}{\sum_{C_{m + 1}^{'}} \exp (g (C_{m + 1}^{'}) / τ_{p})},

(60)

where

τ_{p}

is a temperature. To prevent branch explosion, we use progressive widening to limit the branching factor,

| A (s) | \leq c_{pw} N {(s)}^{ζ}, 0 < ζ < 1,

(61)

with growth tuned by

c_{pw}

and

ζ

; if the cap is not reached, high-prior candidates are expanded first, otherwise children are chosen by (59).

One-step expansion modifies only a small subset of resources and links related to the new task, so we perform incremental checks rather than full recomputation. For each UAV

u_{n}

, we maintain residual resources and accumulated costs, and update them incrementally when expanding to

C_{m + 1}

. Energy feasibility uses a “maneuver + workload” decomposition with

Δ E_{n} = Δ E_{n}^{fly} + Δ E_{n}^{task}

, imposing

E_{n}^{used} + Δ E_{n} \leq (1 - δ_{E}) E_{n}^{\max},

(62)

where

δ_{E} \in [0.05, 0.15]

is a safety margin. Bandwidth feasibility uses an optimistic expectation of available throughput: if task

T_{m}

requires

B_{m}

and

E [B_{m}^{avail}] = \sum_{(i, j) \in L_{m}} q_{i j, t} R_{i j, t}

, then

E [B_{m}^{avail}] \geq (1 + δ_{B}) B_{m},

(63)

where

q_{i j, t}

is the link success probability and

R_{i j, t}

the achievable rate, and

δ_{B}

is a redundancy factor. End-to-end latency is approximated by computation plus communication delay

L_{m} = L_{m}^{comp} + L_{m}^{comm}

with

L_{m}^{comm} \approx D_{m} / E [B_{m}^{avail}]

for data size

D_{m}

, yielding

L_{m} \leq (1 - δ_{L}) L_{m}^{\max} .

(64)

Beyond early rejection by (62)–(64), we define a layerwise optimistic upper bound for unassigned tasks at node s,

UB (s) = Q_{tot}^{SHIELD} (s) + \sum_{m^{'} > m} μ_{m^{'}}^{+} {\hat{Q}}_{m^{'}}^{ub},

(65)

where

Q_{tot}^{SHIELD} (s)

is the current estimate for the partial structure,

{\hat{Q}}_{m^{'}}^{ub}

is obtained by a conflict-ignorant greedy assignment, and

μ_{m^{'}}^{+}

flags tasks still within scope. If

UB (s)

is below the incumbent best, the branch is cut; at the same depth and m, candidates dominated in both residual resources and

Q_{tot}^{SHIELD}

are removed, realizing a triple pruning of feasibility, upper bound, and dominance.

Upon reaching a leaf or a cutoff depth, we greedily complete remaining tasks with a lightweight policy and use SHIELD’s joint value as a rollout proxy; visit counts are then updated and

\bar{Q} (s, a)

is backed up by a moving average. This proxy is scale-consistent with (65), maintaining evaluation coherence under limited budgets.

4.3.2. Policy Feedback: Demonstration Replay and Structure-Consistent Q-Value Distillation

MCTS solutions are not used as one-shot offline references; instead, they are injected into the learning loop under a controlled cadence so that global optimization continuously feeds local policy updates. Let an MCTS run be triggered whenever the interaction steps reach

K_{mcts} (t)

, with a search budget of

B_{mcts} (t)

simulations; after search, we perform

U_{mcts} (t)

parameter updates where each mini-batch mixes demonstration and environment samples with ratio

ρ_{demo} (t)

. To rely more on structural evidence early and anneal toward autonomous improvement later, we use the schedule

K_{mcts} (t) = K_{0} (1 + t / T_{K}), B_{mcts} (t) = B_{0} (1 + t / T_{B}), ρ_{demo} (t) = ρ_{0} e^{- t / T_{ρ}}, λ_{dist} (t) = λ_{0} {(1 + t / T_{λ})}^{- 1},

(66)

where

(K_{0}, B_{0}, ρ_{0}, λ_{0})

are initial values and

(T_{K}, T_{B}, T_{ρ}, T_{λ})

are time scales. This guarantees that triggers and budgets gradually increase as the policy matures, whereas the demonstration ratio and distillation weight decay to avoid long-term reliance on external supervision.

The update count per trigger scales with budget as

U_{mcts} (t) = U_{0} + κ B_{mcts} (t), κ \in [0.1, 0.3],

(67)

where

U_{0}

is a base count and

κ

controls coupling between search and learning. With mini-batch size B, we sample from demonstration and environment buffers with a fixed mix,

| B_{demo} | = ρ_{demo} (t) B, | B_{env} | = (1 - ρ_{demo} (t)) B .

(68)

Demonstrations come from the union of current and past nondominated solutions, each mapped to a scalar return via the hypervolume of multi-objective scores and decoded into joint actions and transition tuples for storage. To improve sample efficiency, we use prioritized sampling that balances TD error and structural prior: if sample i has TD error

δ_{i}

and a prior

π_{i}^{mcts} \in [0, 1]

derived from structural gating and hypervolume rank, then

p_{i} = α_{per} | δ_{i} | + (1 - α_{per}) π_{i}^{mcts}, α_{per} \in [0, 1],

(69)

and sampling is drawn proportionally to the normalized

p_{i}

across both buffers.

Beyond the TD loss and the attention–gating consistency regularizer, we introduce a structure-consistent distillation term to align marginal contributions in the value decomposition with structural evidence from search. Let the coalition value for task

T_{m}

be

Q_{m, t}

and its leave-one-out counterfactual be

Q_{m, t}^{∖ n}

; define the estimated marginal contribution

{\hat{Δ}}_{n, m} = Q_{m, t} - Q_{m, t}^{∖ n}

, and let

Δ_{n, m}^{mcts}

be the normalized contribution returned by search on the same scale. The distillation loss is

L_{dist} = {(Q_{tot} - y^{mcts})}^{2} + λ_{Δ} \sum_{m} \sum_{n \in C_{m}} {({\hat{Δ}}_{n, m} - Δ_{n, m}^{mcts})}^{2},

(70)

where

y^{mcts}

is an unbiased backup target and

λ_{Δ}

is a weight. The per-step objective becomes

L = L_{TD} + η_{cons} L_{cons} + λ_{dist} (t) L_{dist},

(71)

where

L_{TD}

is the joint-value TD loss,

L_{cons}

aligns attention distributions with downstream gating to preserve the structural semantics of task selectivity and contextual suitability, and

λ_{dist} (t)

follows (66) to emphasize structural supervision early and gradually release policy self-organization later. The pseudocode of this MCTS-guided policy feedback loop is given in Algorithm 3.

Algorithm 3 MCTS-guided policy feedback with scheduled demonstration and distillation

1:: Input: current interaction step t; environment buffer $D_{env}$ ; demonstration buffer $D_{demo}$ ; HAN–SHIELD networks; hyperparameters $K_{0}, B_{0}, ρ_{0}, λ_{0}, U_{0}, κ, T_{K}, T_{B}, T_{ρ}, T_{λ}, α_{per}$ ; mini-batch size B
2:: Output: updated replay buffers and SHIELD parameters
3:: Compute $K_{mcts} (t)$ , $B_{mcts} (t)$ , $ρ_{demo} (t)$ , $λ_{dist} (t)$ via Equation (66)
4:: Compute $U_{mcts} (t)$ via Equation (67)
5:: if t is a multiple of $K_{mcts} (t)$ then
6:: Run budgeted MCTS with $B_{mcts} (t)$ simulations using SHIELD values and feasibility pruning; obtain nondominated solutions $S_{MCTS}$
7:: for each solution $(C^{(i)}, A^{(i)}) \in S_{MCTS}$ do
8:: Decode into trajectories $(s_{u}^{(i)}, a_{u}^{(i)}, R_{u}^{(i)}, s_{u}^{' (i)})$
9:: For each step u, compute multi-objective scores and hypervolume-based return $y_{u}^{mcts}$ , and structural contributions ${Δ_{n, m, u}^{mcts}}$
10:: Store $(s_{u}^{(i)}, a_{u}^{(i)}, R_{u}^{(i)}, s_{u}^{' (i)}, y_{u}^{mcts}, {Δ_{n, m, u}^{mcts}})$ in $D_{demo}$
11:: end for
12:: for $u = 1$ to $U_{mcts} (t)$ do
13:: Set $| B_{demo} | = ρ_{demo} (t) B$ and $| B_{env} | = (1 - ρ_{demo} (t)) B$ by Equation (68)
14:: For each sample i in $D_{env} \cup D_{demo}$ , compute TD error $δ_{i}$ and structural prior $π_{i}^{mcts}$ ; set priority $p_{i}$ via Equation (69)
15:: Sample $B_{demo}$ from $D_{demo}$ and $B_{env}$ from $D_{env}$ proportionally to $p_{i}$ ; form mini-batch $B = B_{demo} \cup B_{env}$
16:: On $B$ , compute $L_{TD}$ , $L_{cons}$ , and $L_{div}$ as in Section 4.2
17:: For samples with MCTS labels, compute $L_{dist}$ via Equation (70); set $L_{dist} = 0$ otherwise
18:: Form total loss $L$ by Equation (71) and update SHIELD parameters and targets with learning rate $η$
19:: end for
20:: end if
21:: return $D_{env}$ , $D_{demo}$ , and updated networks

4.4. Complexity and Real-Time Feasibility

To render the overall framework deployable, we formalize the end-to-end latency, derive module-wise asymptotic costs, and state the resulting real-time criterion and scalability conditions.

Let

N = | V |

be the number of UAVs,

M_{act} \leq M

the number of active tasks in the current planning scope,

S = \sum_{m} | C_{m} |

the total member count across active task coalitions, d the embedding dimension, L the number of HAN layers,

C_{\max}

a per-task coalition-size cap, and k the sparsity level for cross-task interactions.

4.4.1. End-to-End Latency and Real-Time Criterion

We decompose a single decision period as

T_{e 2 e} = T_{HAN} + T_{SHIELD} + T_{MCTS} + T_{I / O}, T_{MCTS} \approx B (T_{eval} + T_{check}),

(72)

where B is the search budget,

T_{eval}

is the time to evaluate SHIELD on a partial structure, and

T_{check}

is the incremental feasibility checking time (energy, bandwidth, latency updated only for affected UAVs and links). Given a control period

T_{c}

, real-time feasibility requires

T_{e 2 e} \leq T_{c} \Rightarrow B \leq \frac{T_{c} - (T_{HAN} + T_{SHIELD} + T_{I / O})}{T_{eval} + T_{check}} .

(73)

This bound links real-time feasibility to tunable parameters

(B, k, C_{\max})

for pre-deployment selection and online adaptive pruning.

4.4.2. Asymptotic Costs by Module

HAN. Bidirectional hypergraph attention uses masked normalizations restricted to candidate sets. With one L-layer pass, the cost is

$T_{HAN} = O (L (N d^{2} + M_{act} d^{2} + S d)),$

(74)

where $N d^{2}$ and $M_{act} d^{2}$ arise from affine mappings for nodes and hyperedges, and $S d$ from member–task scoring plus masked softmax. Masking confines normalization to valid candidates and avoids unnecessary global normalization.
SHIELD. Role attribution and agent-conditional Q incur $O (N d^{2})$ ; the nested intra-coalition composition costs $O (S d^{2})$ ; cross-task interactions, when sparsified to at most k neighbors per task, reduce the naïve $O (M_{act}^{2} d^{2})$ to

$T_{SHIELD} = O (N d^{2} + S d^{2} + k M_{act} d^{2}), S = O (M_{act} C_{\max}) if C_{\max} is bounded .$

(75)
Budgeted MCTS. With progressive widening $| A (s) | \leq c_{pw} N {(s)}^{ζ}$ for $0 < ζ < 1$ and layer-wise upper-bound pruning, the per-decision search time is

$T_{MCTS} = O (B (T_{SHIELD}^{partial} + T_{check})),$

(76)

where $T_{SHIELD}^{partial}$ has the same order as (75) but is evaluated on a partial structure with $M_{part} \leq M_{act}$ ; masking stabilizes tensor shape and scale during partial evaluations.

4.4.3. Conditional Near-Linear Scalability

When

d, L, C_{\max}, k

are engineering constants and masking ensures

M_{act} ≪ M

, we obtain

T_{HAN} + T_{SHIELD} = O (N + M_{act}) .

(77)

Combining (73) with (77) yields end-to-end latency that scales near-linearly with

(N, M_{act})

, while

B, k,

and

C_{\max}

serve as explicit knobs to trade search effort for throughput under the hard constraint

T_{c}

.

5. Experimental Results and Analysis

To verify the effectiveness and adaptability of the proposed overlapping coalition formation method for heterogeneous UAV swarms, we design multiple simulation scenarios of different scales that cover a range of task densities, resource tightness, and UAV configurations. All experiments are conducted on a Windows 11 system, equipped with an Intel Core i9-13900K processor (3.00 GHz base frequency), two NVIDIA GeForce RTX 4090 GPUs with 24 GB of VRAM each, and 128.0 GB of RAM.

5.1. Simulation Setup

To evaluate SGRL-TS, we build a simulator with heterogeneous tasks and multiple UAV platform types. We configure five representative tasks and specify their resource requirements in Table 2. To meet these demands, we design four UAV platform types, with their capabilities summarized in Table 3. The platforms emphasize support, delivery, firefighting, and communication relay. To ensure realism and reproducibility, we specify the environment and communication parameters, including area size, communication radius, energy model, and weights for the multi-objective reward, as listed in Table 4.

Table 2. Resource requirements of representative tasks.

Table 3. Capabilities of heterogeneous UAV platform types.

Table 4. Simulation environment parameters.

To highlight the advantages of SGRL-TS, we compare it against the following methods:

PGG-TS-OCF [17] establishes preference relations between UAV supply and task demand, guides tabu search via preference gravity, and achieves overlapping allocation and scheduling with a stable overlapping coalition.
LocalSearch-CF [15] greedily seeds UAV–task coalitions and iteratively swaps or reassigns UAVs between neighboring coalitions to minimize mission completion time.
HYGMA [20] dynamically clusters nearby UAVs based on state and task affinity, treating each cluster as a coalition and learning joint actions through hypergraph attention.
SMART [23] enables each UAV to repeatedly switch its serving task coalition according to a locally evaluated payoff, yielding a distributed best response coalition formation.
RCFG-DRL [26] models UAV–task assignment as a repeated coalition game, where UAVs learn join or leave decisions via deep RL until coalitions reach equilibrium.

5.2. Performance Evaluation

Figure 6 reports convergence and final utilities under four resource conditions. SGRL-TS consistently outperforms the strongest baseline, PGG-TS-OCF, by 3.19%, 4.49%, 6.25%, and 9.68% in the abundant, balanced, constrained, and scarce settings, respectively. It enters the efficient ascent earlier and shows a smoother plateau. The gains arise from heterogeneous hypergraph attention, which captures high-order couplings among UAVs, tasks, and coalitions; a structure-conditioned hierarchical value decomposition that yields globally comparable, monotone scores and suppresses merge-split oscillations; and budgeted MCTS under feasibility masks, which focuses expansions on high-value structures and reduces wasted search.

Figure 6. Convergence of SGRL-TS and baselines across four resource regimes: (a) 18 UAVs–5 tasks (resource-abundant); (b) 20 UAVs–10 tasks (resource-balanced); (c) 15 UAVs–15 tasks (resource-constrained); (d) 10 UAVs–20 tasks (resource-scarce).

Baseline behavior clarifies the gaps. PGG-TS-OCF employs a parallel population search that identifies feasible overlaps early on; however, it stalls at suboptimal mixes under budget constraints and lacks a stable cross-level yardstick. HYGMA strengthens interaction modelling, but long value propagation under multiple constraints slows ascent. RCFG-DRL introduces adversarial robustness; however, nonstationarity induces mid-horizon oscillations, diverting budget from structural improvements. LocalSearch-CF and SMART use anytime stepwise moves bounded by small neighborhoods, which promote local optima and capped utility. Non-overlapping SGRL-TS forbids resource reuse, creating capacity bottlenecks, synchronization penalties, and heightening sensitivity to task density and temporal perturbations.

We next examine resource utilization efficiency, task adaptability, and robustness under overload and scarcity by varying the number of UAVs from 4 to 20 under different task loads, as shown in Figure 7. Across the four task scales, the curve of SGRL-TS remains at the top and reaches a higher peak near moderate swarm sizes; the standard deviation bars show markedly lower variability compared to all baselines. Relative to the strongest competitor, PGG-TS-OCF, SGRL-TS achieves higher peak utilities by approximately 2.27%, 3.00%, 6.36%, and 9.76% at task numbers 5, 10, 15, and 20, respectively. The advantage increases as resource constraints tighten, indicating stronger resource scheduling and parallel coordination under crowded and scarce conditions.

Figure 7. Average task utility versus the number of UAVs under different task loads: (a)

T = 5

; (b)

T = 10

; (c)

T = 15

; (d)

T = 20

.

This performance stems from a careful consideration of the benefit–cost balance. SGRL-TS estimates timing and energy constraints online and maps them onto a unified utility scale, which steers the search toward ranges where adding UAVs yields net gains while suppressing ineffective parallelism and communication congestion as scale grows, thus avoiding high-scale regression. Temperature-controlled sampling and reuse of candidate structures broaden exploration, and as early and later convergence occurs, they converge to low-conflict configurations. Combined with penalties and pruning for repeated assignment and resource contention, these mechanisms reduce structural oscillation and tail-phase jitter. The result is a better compromise between task completion and coordination cost, leading to higher and more stable final utility.

We further test adaptability under sparse and dense tasks, reuse efficiency, and robustness to task pressure by fixing

N \in {8, 12, 16}

and increasing the number of tasks from 3 to 12. Results are shown in Figure 8. SGRL-TS stays on top across the three UAV scales and reaches a higher peak near the midrange of task counts, while the tail declines more gently and the variability remains smaller. Compared with the strongest baseline PGG-TS-OCF, the average utility over the full range improves by about 3.86% at 16 UAVs, 2.97% at 12 UAVs, and 3.53% at 8 UAVs.

Figure 8. Average task utility versus the number of tasks under different swarm sizes: (a)

N = 16

UAVs; (b)

N = 12

UAVs; (c)

N = 8

UAVs.

This advantage and stability arise because bidirectional hypergraph attention normalizes task selectivity and contextual suitability within mask-constrained candidate sets, enabling precise member screening and task assignment as the number of tasks increases, which suppresses ineffective overlaps and resource contention. The SHIELD nested nonlinear aggregation with cross-task interaction terms provides a monotonic and comparable global value for cooperation and competition across tasks, making diminishing marginal returns detectable as the task load grows. This concentrates resources on actions with positive net gain, producing a higher midrange peak and slower performance decay.

To assess task completion under scaling, we vary the tasks from 4 to 12 with

N = 15, 10, 5

. Results are in Figure 9. Across the three experimental settings, the SGRL-TS curve remains closely aligned with the upper bound provided by the Task-completion OCF baseline, which optimizes only task fulfillment. As the number of tasks increases from 4 to 12, it improves the average task execution sufficiency over the entire range by approximately 2.42%, 2.63%, and 10.94% relative to PGG-TS-OCF. Moreover, when other methods exhibit pronounced degradation at higher task counts, SGRL-TS shows a significantly slower decline and can maintain a larger fraction of tasks close to complete execution, even under severely constrained resources. This advantage primarily stems from the balance term in the reward, which discourages extreme solutions that sacrifice a subset of tasks, thereby driving the policy to maintain medium to high completion levels across more tasks as the task load increases.

Figure 9. Average task completion versus the number of tasks for three UAV swarm sizes: (a) UAV

= 15

; (b) UAV

= 10

; (c) UAV

= 5

.

Figure 10a–c report the coalition-level temporal coordination performance of all methods. The evaluation metric is the coalition synchronization sufficiency, defined as the normalized score

f_{sync} \in [0, 1]

obtained from the coalition arrival-time deviation cost

J_{syn}

according to (23); larger values indicate more synchronized coalition arrivals under the given reference time scale. As the number of tasks increases, the synchronization sufficiency of all methods decreases overall, indicating that higher task congestion makes it harder for coalitions to achieve good temporal coordination; moreover, when the number of UAVs is reduced from 15 to 5, the overall degradation in synchronization performance becomes more pronounced. The advantage of SGRL-TS in synchronization sufficiency is most pronounced in configurations with more tasks and tighter resources, suggesting that structure-guided overlapping coalition formation combined with joint value decomposition can effectively suppress the dispersion in coalition arrival times. In contrast, non-overlapping SGRL-TS and Task-completion OCF, which only focus on task completion rate, exhibit significantly lower synchronization sufficiency under high-load scenarios, indicating that ignoring overlapping structures or lacking explicit synchronization modeling leads to markedly degraded temporal coordination among coalitions.

Figure 10. Average coalition synchronization sufficiency versus the number of tasks under three UAV swarm sizes: (a) UAV

= 15

; (b) UAV

= 10

; (c) UAV

= 5

. The metric

f_{sync} \in [0, 1]

is obtained by normalizing the coalition arrival-time deviation cost according to (23), with larger values indicating more synchronized coalition arrivals.

Previous experiments on task completion and utility have shown that SGRL-TS achieves returns close to the Task-completion OCF upper bound and outperforms PGG-TS-OCF. This section further examines its cost side from the perspective of energy utilization. As shown in Figure 11, under four task scales

T = 5, 10, 15, 20

, the energy-efficiency curves of SGRL-TS lie consistently above those of all baselines. Compared with PGG-TS-OCF, the average energy efficiency over the entire UAV range improves by approximately 8.09%, 8.02%, 3.05%, and 12.85%, respectively. Moreover, relative to the Task-completion OCF scheme, which optimizes only task completion, SGRL-TS achieves comparable completion levels while attaining between two and five times higher energy efficiency, thereby providing a considerably more economical way of sustaining overlapping coalition structures from the energy consumption viewpoint. Overall, this advantage primarily stems from incorporating energy safety margins and residual-resource-driven feasible-set pruning into the MCTS search guided by SHIELD evaluations, which reduces the expansion of high-cost, overlapping structures at the search level, and thus markedly improves global energy utilization without sacrificing task completion.

Figure 11. Energy efficiency versus the number of UAVs under different task loads: (a)

T = 5

; (b)

T = 10

; (c)

T = 15

; (d)

T = 20

.

To examine the sensitivity of average task utility and method ranking to multi-objective weight settings, task-priority scenarios, and time/energy normalization scales, we conducted comparative experiments in a scenario with 10 tasks and 20 UAVs, as shown in Figure 12. Under fixed network parameters and training configurations, we change only the weight vector

ω^{(k)} = (w_{comp}, w_{syn}, w_{eng})

at evaluation time to assess the weight sensitivity of the multi-objective design; in Figure 12b, under the default setting

ω^{(1)} = (0.4, 0.3, 0.3)

, we keep the training process unchanged and modify only the task-priority directed acyclic graph to construct three scenarios, Balanced priorities, Rescue priority, and Communication priority, with the horizontal axis corresponding to these three priority configurations, respectively; in Figure 12c, we scale the normalization ranges of the synchronization reference time

T_{ref}

and energy reference value

E_{ref}

and, in turn, investigate the impact of five combinations on the final task utility, where the horizontal axis corresponds to these five

(T, E)

configurations.

Figure 12. Sensitivity of average task utility to (a) multi-objective weight vectors

ω^{(k)} = (w_{comp}, w_{syn}, w_{eng})

, (b) task-priority scenarios, and (c) normalization scales

(T, E)

of synchronization and energy costs in the 10-task, 20-UAV resource-neutral setting. Each curve reports the mean utility with standard-deviation error bars for SGRL-TS, HYGMA, PGG-TS-OCF, and RDFG-DRL.

Across all configurations of the three sensitivity tests, SGRL-TS consistently attains the highest average task utility, with a performance gap of approximately 2–3 percentage points relative to the best-performing baseline PGG-TS-OCF, and exhibits substantially more minor variance, indicating overall robustness to perturbations in weights, priority settings, and normalization scales. Specifically, in Figure 12a, the completion-emphasized weight vector

ω^{(2)}

increases the utilities of all four methods, whereas the synchronization- and energy-emphasized weight vectors

ω^{(3)}

and

ω^{(4)}

reduce the overall utilities, with a particularly pronounced impact on RDFG-DRL, while the curve of SGRL-TS exhibits only mild fluctuations; Figure 12b shows that different task-priority topologies induce only slight changes in the utilities of all methods, and SGRL-TS can still better coordinate resources and synchronization constraints in the rescue-priority scenario, maintaining a stable performance lead; in Figure 12c, scaling

T_{ref}

or

E_{ref}

mainly changes the absolute level of utility, and more stringent normalization (such as

T = 0.5

or

E = 0.5

) has a more pronounced negative impact on RCFG-TS, whereas both the performance and variance of SGRL-TS vary only moderately.

We analyze the per-decision computational overhead of SGRL-TS under different swarm scales and search budgets, as shown in Table 5. We reuse the SGRL-TS policy trained in the previous experiments and perform online evaluation on six task configurations in inference mode without enabling backpropagation. For each configuration, we record the wall-clock time of the HAN encoding

T_{HAN}

, the SHIELD mixing

T_{SHIELD}

, the MCTS planning

T_{MCTS}

, and the end-to-end decision latency

T_{e 2 e}

over

10^{4}

consecutive decision steps, and we collect the average number of node expansions of MCTS under feasible-region pruning (Avg. exp.).

Table 5. Per-decision runtime profiling of SGRL-TS under different swarm scales and search budgets (N: number of UAVs, M: number of tasks). All results are averaged over

10^{4}

decision steps on the hardware platform described in Section 5.

As shown in Table 5, in S₄ the end-to-end decision latency

T_{e 2 e}

is about 5.54 ms, whereas the lower bound under the no-search configuration (S₂,

B = 0

) is only 2.10 ms, indicating that even with structured MCTS enabled the overall overhead remains significantly below the typical UAV control period on the order of tens of milliseconds and thus satisfies real-time application requirements. As the task scale increases,

T_{HAN}

and

T_{SHIELD}

slowly increase from about 0.82/0.64 ms in S₁ to about 1.05/0.82 ms in S₄, exhibiting an approximately linear growth trend that is consistent with the

O (N + M_{act})

complexity result given in Section 4.4, which indicates that high-order structural modeling and structure-conditioned value decomposition themselves do not become the main bottlenecks. For the same swarm scale (S₂), when the search budget is increased from

B = 0

to 32 and 64, the average number of expanded nodes grows from 0 to about 21.37 and 41.96, and the corresponding

T_{MCTS}

increases from 0 to 1.63 ms and 3.01 ms. At the same time, Avg. exp. consistently remains clearly below the budget B, which confirms that feasible-region pruning effectively suppresses the size of the search tree and makes the MCTS computational cost approximately linearly controllable with respect to the budget. Taken together, these results show that SGRL-TS achieves both low per-step latency and good scalability within the swarm sizes and search budgets considered in this work.

In Figure 13, we evaluate all algorithms in a scenario with 10 tasks and 15 heterogeneous UAVs. Multiple independent runs are conducted under different search budgets and random seeds to sample a set of feasible overlapping coalition-structure solutions. For each solution, we first compute the task execution sufficiency

f_{1} (T_{m}) \in [0, 1]

and the synchronization deviation cost

J_{syn} (T_{m})

according to (23), and then take the average over all tasks to obtain the overall task execution sufficiency and synchronization deviation; we also record the total energy cost

E_{T}

incurred to complete all tasks. The normalized task shortfall is then defined as

{\tilde{J}}_{1} = 1 - \frac{1}{M} \sum_{m = 1}^{M} f_{1} (T_{m})

. Furthermore, synchronization deviation and energy cost are min–max normalized over the union of all methods and sampled solutions to obtain the normalized synchronization deviation

{\tilde{J}}_{syn}

and normalized energy cost

{\tilde{E}}_{T}

. Consequently, all three quantities are scaled to the interval

[0, 1]

with smaller values being better, which facilitates multi-objective Pareto analysis in a unified cost space.

Figure 13. Multi-objective trade-off among normalized task shortfall, synchronization deviation, and energy cost: (a) normalized task shortfall

{\tilde{J}}_{1}

versus normalized synchronization deviation

{\tilde{J}}_{syn}

; (b) normalized task shortfall

{\tilde{J}}_{1}

versus normalized energy cost

{\tilde{E}}_{T}

; (c) normalized synchronization deviation

{\tilde{J}}_{syn}

versus normalized energy cost

{\tilde{E}}_{T}

. All three objectives are min–max normalized over the collected solutions so that lower values are better.

From Figure 13a, the scatter of SGRL-TS is more concentrated in the lower-left region of the

({\tilde{J}}_{1}, {\tilde{J}}_{syn})

plane, yielding fewer dominated solutions with either a small task shortfall but significant synchronization deviation, or good synchronization at the price of a significantly increased task shortfall, compared with the baseline methods. In Figure 13b, SGRL-TS maintains a smaller

{\tilde{J}}_{1}

under lower energy cost

{\tilde{E}}_{T}

, whereas other methods typically require higher energy to achieve a similar level of task completion or suffer a larger task shortfall at comparable energy, indicating a more favorable energy–efficiency trade-off. Figure 13c further shows that, in the

({\tilde{J}}_{syn}, {\tilde{E}}_{T})

plane, the SGRL-TS samples overall lie closer to the lower-left Pareto boundary and form a more compact “knee” region around low energy and low synchronization deviation. In contrast, the samples of the baseline methods more frequently fall outside this frontier. Taken together, these results demonstrate that SGRL-TS achieves superior Pareto performance in the joint objective space of task completion, coalition-time synchronization, and energy consumption.

5.3. Ablation Studies

Table 6 presents a systematic ablation of the encoder, value decomposition, and global search modules under resource-neutral and resource-tight configurations. Here, Viol. denotes the constraint violation rate, AUC is the normalized area under the curve of average task utility Util versus training iterations, and Iter@95% is defined as the training iteration at which the Util curve first reaches 95% of its steady-state mean, where the steady-state mean is computed from a moving average over the final training window. Comparing SGRL-TS with typical graph-based DRL, under the resource-tight regime Full-SGRL-TS improves Compl, Util, and Eff over GAT-QMIX by about 23.9%, 22.2%, and 26.5%, respectively, while reducing Viol by 45.9%, increasing AUC by 48.3%, and shortening Iter@95% by 36.9%. These gains do not come from merely swapping in a GAT encoder and a QMIX mixer, but from jointly exploiting three structural mechanisms: HAN explicitly encodes the high-order UAV–task–coalition hypergraph so that structural semantics enter value estimation and search rather than being limited to one-hop graph attention; SHIELD injects structure-conditioned terms within and across coalitions, enabling finer modeling of cooperative gains and resource competition and yielding a more monotone and comparable global value

Q_{tot}

; and the structured MCTS uses these signals for feasibility-set pruning and structural heuristics, so that coalition configurations are optimized globally around structural priors instead of relying on local GNN outputs.

Table 6. Ablation and replacement study of SGRL-TS under resource-neutral and resource-tight regimes.

Module-level ablations further support this view. With SHIELD-full and structured MCTS fixed, replacing the encoder with a standard GAT (A1) or HyperGCN (A2) shows that, relative to A2, the full HAN still improves Util and Eff in the resource-neutral regime by about 2.2% and 4.6%, further reduces Viol by 17.9%, increases AUC by 6.9%, and shortens Iter@95% by 11.4%; under the resource-tight regime it maintains roughly 6.0% and 6.7% gains in Util and Eff and a 20.7% reduction in violation rate, confirming that heterogeneous node types and high-order hyperedges provide additional structural information beyond conventional graph encoders. For the value decomposition, when VDN or QMIX mixing is used on top of HAN, SHIELD-lite already improves Util and Eff over QMIX-mix by about 2.2% and 4.7%, reduces Viol by 11.7%, increases AUC by 6.0%, and shortens Iter@95% by 5.7%; enabling full SHIELD further raises Util and Eff relative to QMIX-mix by 3.4% and 6.9%, decreases Viol by 22.0%, increases AUC by 8.4%, and reduces Iter@95% by 15.3%, with similar relative improvements in the resource-tight regime, indicating that the structure-conditioned nested mixer improves decomposability and credit assignment beyond monotonic mixing networks. In the global search module, removing search causes Util and Eff to drop by about 4.5% and 6.1%, Viol to increase by 34.4%, AUC to decrease by 11.3%, and Iter@95% to be extended by 24.6% compared with Full-SGRL-TS; greedy search or plain MCTS partially reduce this gap but still underperform structured MCTS in AUC and Iter@95%, showing that high-order structure–guided feasibility pruning and budget allocation are likewise crucial to achieving high AUC and fast, low-violation convergence.

5.4. Future Research Directions and Deployability

Evaluation is currently confined to simulation. Subsequent work will progress along a simulation–hardware-in-the-loop–field-trial path. First, we will build a reproducible benchmark that includes a library of parameterized scenarios, communication traces for synthetic generation and replay, and standardized evaluation scripts, accompanied by anonymized logs to strengthen comparability and reproducibility. Second, we will integrate the flight controller and onboard inference compute within a hardware-in-the-loop platform to quantify control-cycle stability, end-to-end latency, per-task energy consumption, and throughput, and we will replay post-disaster communication traces to assess robustness under burst losses, delay jitter, and bandwidth limitations. Third, we will conduct small-scale outdoor experiments with heterogeneous multi-UAV formations and calibrate the simulation models using field measurements of task completion rate, link quality, energy budget, and coalition-reconfiguration delay. To narrow the sim-to-real gap, we will systematically employ dynamics and payload domain randomization, bursty-channel perturbations based on the Gilbert–Elliott model, delay-aware action buffering with safety shielding, and online-estimate-driven feasibility-set tightening and parameter adaptation.

6. Conclusions

Starting from an overlapping-coalition paradigm that better reflects real rescue requirements, this work proposes a unified framework grounded in high-order structural modeling, centered on hierarchical value decomposition, and driven by global search with policy feedback. We jointly embed tasks and UAVs into a hypergraph, allowing for the capture of resource reuse, member complementarity, and inter-coalition competition within a common semantic space. Based on this representation, the hierarchical value decomposition stably aggregates information across the agent and coalition levels, providing reliable value signals for subsequent global structure search. The global search efficiently identifies high-potential structures within the feasible region. The resulting candidates are then fed back to the policy via demonstration replay and distillation, which forms a closed loop between learning and search. This loop consistently yields solutions with higher quality, lower energy consumption, and more stable coordination under complex feasibility constraints. In extensive post-disaster rescue simulations of varying scales, the proposed method improves task utility by up to 11.4% over the strongest baseline and surpasses the non-overlapping variant in energy efficiency by more than 228%. These advantages are more pronounced under resource-limited settings, indicating that overlapping coalitions combined with structure-aware policy learning can substantially enhance resource coordination under tight constraints while keeping coalition synchronization deviation and operational cost within a more reasonable range.

Author Contributions

Conceptualization, W.L. and Y.F.; methodology, W.L.; software, W.L.; validation, W.L. and Y.F.; writing—original draft preparation, W.L.; writing—review and editing, W.L., Y.F., F.Z., K.I.K., J.W. and P.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China under Grants 62471493 (for conceptualization and investigation), partially supported by the Natural Science Foundation of Shandong Province, China under Grants ZR2023LZH017 and ZR2024MF066 (for formal analysis and validation), and partially supported by the Russian Science Foundation (RSF) Project under Grant 22-71-10095-P (for validation and results verification).

Data Availability Statement

Dataset available on request from the authors. The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Chen, Y.; Chen, R.; Huang, Y.; Xiong, Z.; Li, J. DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization. Drones 2024, 8, 784. [Google Scholar] [CrossRef]
Zeng, Y.; Wu, L.; Li, J.; Zhuang, X.; Wu, C. Resilient Task Allocation for UAV Swarms: A Bilevel PSO-ILP Optimization Approach. Drones 2025, 9, 623. [Google Scholar] [CrossRef]
Chen, J.; Wu, Q.; Xu, Y.; Qi, N.; Guan, X.; Zhang, Y.; Xue, Z. Joint Task Assignment and Spectrum Allocation in Heterogeneous UAV Communication Networks: A Coalition Formation Game-Theoretic Approach. IEEE Trans. Wirel. Commun. 2021, 20, 440–452. [Google Scholar] [CrossRef]
Bai, X.; Fielbaum, A.; Kronmüller, M.; Knoedler, L.; Alonso-Mora, J. Group-Based Distributed Auction Algorithms for Multi-Robot Task Assignment. IEEE Trans. Autom. Sci. Eng. 2023, 20, 1292–1303. [Google Scholar] [CrossRef]
Vig, L.; Adams, J. Multi-robot coalition formation. IEEE Trans. Robot. 2006, 22, 637–649. [Google Scholar] [CrossRef]
Liu, F.; Dong, X.; Yu, J.; Hua, Y.; Li, Q.; Ren, Z. Distributed Nash Equilibrium Seeking of N-Coalition Noncooperative Games With Application to UAV Swarms. IEEE Trans. Netw. Sci. Eng. 2022, 9, 2392–2405. [Google Scholar] [CrossRef]
Nunes, E.; Gini, M. Multi-Robot Auctions for Allocation of Tasks with Temporal Constraints. Proc. AAAI Conf. Artif. Intell. 2015, 29, 2110–2116. [Google Scholar] [CrossRef]
Liao, X.; Wang, Y.; Han, Y.; Li, Y.; Lin, C.; Zhu, X. Heterogeneous Multi-Agent Deep Reinforcement Learning for Cluster-Based Spectrum Sharing in UAV Swarms. Drones 2025, 9, 377. [Google Scholar] [CrossRef]
Su, Y.; Zhou, H.; Deng, Y.; Dohler, M. Energy-Efficient Cellular-Connected UAV Swarm Control Optimization. IEEE Trans. Wirel. Commun. 2024, 23, 4127–4140. [Google Scholar] [CrossRef]
Shi, S.; Hu, C.; Wang, D.; Zhu, Y.; Han, Z. Federated HD Map Updating Through Overlapping Coalition Formation Game. IEEE Trans. Mob. Comput. 2024, 23, 1641–1654. [Google Scholar] [CrossRef]
Zhang, Y.; Gao, X.; Ye, N.; Niyato, D.; Han, Z.; Yang, K. Joint UAV Deployment, Power Allocation, and Coalition Formation for Physical Layer Security in Heterogeneous Networks. IEEE Trans. Veh. Technol. 2025, 74, 10994–11009. [Google Scholar] [CrossRef]
Peng, Q.; Wu, H.; Li, N.; Wang, F. A Dynamic Task Allocation Method for Unmanned Aerial Vehicle Swarm Based on Wolf Pack Labor Division Model. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 4075–4089. [Google Scholar] [CrossRef]
Alqefari, S.; Menai, M.E.B. Multi-UAV Task Assignment in Dynamic Environments: Current Trends and Future Directions. Drones 2025, 9, 75. [Google Scholar] [CrossRef]
Chen, R.; Li, J.; Peng, T. Decentralized UAV Swarm Scheduling with Constrained Task Exploration Balance. Drones 2023, 7, 267. [Google Scholar] [CrossRef]
Guo, M.; Xin, B.; Wang, Y.; Chen, J. A Local-Search-Based Heuristic for Coalition Formation in Urgent Missions. IEEE Trans. Syst. Man, Cybern. Syst. 2024, 54, 6924–6935. [Google Scholar] [CrossRef]
Yan, Y.; Bi, W.; Ma, G.; Zhang, A. Collaborative Task Allocation for Large-Scale Heterogeneous AAV Swarm: A Hierarchical Coalition Formation Game Method. IEEE Internet Things J. 2025, 12, 27237–27254. [Google Scholar] [CrossRef]
Qi, N.; Huang, Z.; Zhou, F.; Shi, Q.; Wu, Q.; Xiao, M. A Task-Driven Sequential Overlapping Coalition Formation Game for Resource Allocation in Heterogeneous UAV Networks. IEEE Trans. Mob. Comput. 2023, 22, 4439–4455. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, B.; Han, Z. Coalition Formation Game Based Information-Energy Collaboration in Vehicle Edge Computing Networks. IEEE Trans. Veh. Technol. 2023, 72, 7717–7727. [Google Scholar] [CrossRef]
Qi, N.; Huang, Z.; Sun, W.; Jin, S.; Su, X. Coalitional Formation-Based Group-Buying for UAV-Enabled Data Collection: An Auction Game Approach. IEEE Trans. Mob. Comput. 2023, 22, 7420–7437. [Google Scholar] [CrossRef]
Liu, C.; Li, D. HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2505.07207. [Google Scholar]
Zhang, Z.; Piao, S. Vein-Based Coalitions for Multi-Agent Pattern Formation Tasks. IEEE Robot. Autom. Lett. 2022, 7, 12379–12386. [Google Scholar] [CrossRef]
Liao, M.; Xiong, K.; Zhang, L.; Yu, X.; Cui, G. Frequency Agile Strategy Design of Radar Network for Wideband Sweep Jamming via Coalition Formation Game. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 7906–7919. [Google Scholar] [CrossRef]
Taguelmimt, R.; Aknine, S.; Boukredera, D.; Changder, N.; Sandholm, T. Faster optimal coalition structure generation via offline coalition selection and graph-based search. arXiv 2024, arXiv:2407.16092. [Google Scholar] [CrossRef]
Chen, L.; Wei, C.; Duan, H. Multiple Unmanned Aerial Vehicles Coalition Formation and Control for Collaborative Defense Mission. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6095–6109. [Google Scholar] [CrossRef]
Xiong, K.; Cui, G.; Liao, M.; Zhang, L.; Kong, L. Multitarget Pursuit Coalition Game of Radar-Enabled UAVs. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 13605–13621. [Google Scholar] [CrossRef]
Zhou, H.; Chen, R.; Yi, C.; Zhang, J.; Kang, J.; Cai, J.; Guizani, M. A Repeated Coalition Formation Game for Physical Layer Security Aware Wireless Communications With Third-Party Intelligent Reflecting Surfaces. IEEE Trans. Wirel. Commun. 2025, 24, 7612–7626. [Google Scholar] [CrossRef]
Zhang, N.; Ma, Q.; Mao, W.; Chen, X. Coalitional FL: Coalition Formation and Selection in Federated Learning with Heterogeneous Data. IEEE Trans. Mob. Comput. 2024, 23, 10494–10508. [Google Scholar] [CrossRef]

Figure 1. Illustrative coalition formation modes in a heterogeneous UAV swarm. (a) Non-overlapping coalition formation where each coalition executes one task. (b) Sequential multi-task formation where a coalition serves multiple tasks in sequence. (c) Overlapping coalition formation where UAVs can participate in multiple coalitions and jointly execute several tasks over time, enabling cross task capability flow and resource reuse.

Figure 2. Example of resource allocation under overlapping coalitions, where heterogeneous UAV resources are assigned to multiple tasks and UAVs transfer between coalitions to reuse residual capacity.

Figure 3. High -order structural modeling with the bidirectional hypergraph attention network, which encodes UAV nodes and task hyperedges and propagates features along membership and communication links. Here,

e_{m}^{(1)}

denotes the task structural embedding produced by task-to-node aggregation, and

h_{n}^{*}

denotes the structure-aware node embedding produced by node-to-task aggregation.

Figure 4. SHIELD architecture: A hypergraph encoder, role attribution, and nested coalition and global mixers jointly implement structure-aware hierarchical value decomposition for multi-task coordination.

Figure 5. Pipeline of the MCTS-based global structure optimizer, where SHIELD values guide feasible coalition expansions and Pareto optimal structures are fed back as demonstrations and distillation targets.

Figure 6. Convergence of SGRL-TS and baselines across four resource regimes: (a) 18 UAVs–5 tasks (resource-abundant); (b) 20 UAVs–10 tasks (resource-balanced); (c) 15 UAVs–15 tasks (resource-constrained); (d) 10 UAVs–20 tasks (resource-scarce).

Figure 7. Average task utility versus the number of UAVs under different task loads: (a)

T = 5

; (b)

T = 10

; (c)

T = 15

; (d)

T = 20

.

Figure 8. Average task utility versus the number of tasks under different swarm sizes: (a)

N = 16

UAVs; (b)

N = 12

UAVs; (c)

N = 8

UAVs.

Figure 9. Average task completion versus the number of tasks for three UAV swarm sizes: (a) UAV

= 15

; (b) UAV

= 10

; (c) UAV

= 5

.

Figure 10. Average coalition synchronization sufficiency versus the number of tasks under three UAV swarm sizes: (a) UAV

= 15

; (b) UAV

= 10

; (c) UAV

= 5

. The metric

f_{sync} \in [0, 1]

is obtained by normalizing the coalition arrival-time deviation cost according to (23), with larger values indicating more synchronized coalition arrivals.

Figure 11. Energy efficiency versus the number of UAVs under different task loads: (a)

T = 5

; (b)

T = 10

; (c)

T = 15

; (d)

T = 20

.

Figure 12. Sensitivity of average task utility to (a) multi-objective weight vectors

ω^{(k)} = (w_{comp}, w_{syn}, w_{eng})

, (b) task-priority scenarios, and (c) normalization scales

(T, E)

of synchronization and energy costs in the 10-task, 20-UAV resource-neutral setting. Each curve reports the mean utility with standard-deviation error bars for SGRL-TS, HYGMA, PGG-TS-OCF, and RDFG-DRL.

Figure 13. Multi-objective trade-off among normalized task shortfall, synchronization deviation, and energy cost: (a) normalized task shortfall

{\tilde{J}}_{1}

versus normalized synchronization deviation

{\tilde{J}}_{syn}

; (b) normalized task shortfall

{\tilde{J}}_{1}

versus normalized energy cost

{\tilde{E}}_{T}

; (c) normalized synchronization deviation

{\tilde{J}}_{syn}

versus normalized energy cost

{\tilde{E}}_{T}

. All three objectives are min–max normalized over the collected solutions so that lower values are better.

Table 1. Parameterization of the propulsion power model in Equation (22) for a DJI Matrice 600 Pro hexacopter used to compute flight energy consumption.

Symbol	Meaning	Unit	Value
$P_{S}$	profile power at hover	W	$3.7 \times 10^{2}$
$U_{tip}$	blade–tip speed	m s⁻¹	$1.10 \times 10^{2}$
$P_{R}$	induced power at hover	W	$8.6 \times 10^{2}$
$v_{0}$	induced velocity at hover	m s⁻¹	$5.33$
$f_{0}$	equivalent frontal area	m²	$1.0 \times 10^{- 1}$
$ρ$	air density	kg m⁻³	$1.225$
$s_{0}$	rotor disk area	m²	$1.34$
$η$	rotor solidity	–	$0.10$

Table 2. Resource requirements of representative tasks.

Task Type	Energy (Wh)	Supplies (kg)	Bandwidth (MHz)	Sensing (Mbit)	Agent (L)
Search and rescue	100–150	—	1–2	100–150	—
Supply delivery	80–120	40–80	0.5–1	20–30	—
Communication support	90–130	—	2–3	—	—
Energy replenishment	150–200	—	1–2	50–80	—
Firefighting	100–150	—	0.5–1	80–120	60–100

Table 3. Capabilities of heterogeneous UAV platform types.

UAV Type	Energy (Wh)	Supplies (kg)	Bandwidth (MHz)	Sensing (Mbit)	Agent (L)
A (general support)	400–500	10–20	1–2	50–100	—
B (delivery)	250–350	60–100	0.5–1	10–30	—
C (firefighting)	280–380	—	0.5–1	80–120	80–120
D (relay and recon)	260–340	—	2–3	100–150	—

Table 4. Simulation environment parameters.

Description	Value Range
Number of tasks M	3~20
Number of UAVs N	3~20
Area size (m²)	$1000 \times 1000$
Comm radius $r_{c}$ (m)	200
Max speed $v_{n}^{\max}$ (m/s)	5~14
Time step $Δ t$ (s)	1
Horizon T (steps)	1000
Air density $ρ$ (kg/m³)	1.125
Induced hover speed $v_{0}$ (m/s)	7.3
Blade tip speed $U_{tip}$ (m/s)	200
Fuselage drag ratio $f_{0}$	0.3
Task deadline $t_{m}^{dead}$ (steps)	150~250
Per-UAV comm demand $μ_{n}^{(comm)}$ (MHz)	0.5~2
Comm data volume $I_{m}^{(comm)}$ (Mbit)	20~150
Tx power $p^{(comm)}$ (W)	1
Noise PSD $N_{0}$ (dBm/Hz)	−169
Path-loss exponent $α$	2
Weight of fulfillment $ω_{1}$	0.1~0.6
Weight of sync deviation $ω_{2}$	0.04~0.5
Weight of energy cost $ω_{3}$	0.08~0.5
Boltzmann coefficient for structure perturbation, $Γ_{\max}$	2~10

Table 5. Per-decision runtime profiling of SGRL-TS under different swarm scales and search budgets (N: number of UAVs, M: number of tasks). All results are averaged over

10^{4}

decision steps on the hardware platform described in Section 5.

Table 5. Per-decision runtime profiling of SGRL-TS under different swarm scales and search budgets (N: number of UAVs, M: number of tasks). All results are averaged over

10^{4}

decision steps on the hardware platform described in Section 5.

Scenario	N–M	B	Avg. Exp.	$T_{HAN}$ (ms)	$T_{SHIELD}$ (ms)	$T_{MCTS}$ (ms)	$T_{e 2 e}$ (ms)
S₂–No-search	20–10	0	0.00	0.94	0.71	0.00	2.10
S₂–Light-search	20–10	32	21.37	0.94	0.71	1.63	3.74
S₁–Full-search	18–5	64	37.82	0.82	0.64	2.47	4.36
S₂–Full-search	20–10	64	41.96	0.96	0.72	3.01	5.16
S₃–Full-search	15–15	64	42.58	1.03	0.79	3.11	5.41
S₄–Full-search	10–20	64	43.11	1.05	0.82	3.18	5.54

Table 6. Ablation and replacement study of SGRL-TS under resource-neutral and resource-tight regimes.

ID	Variant	Compl. ↑	Util. ↑	Eff. ↑	Viol. [%] ↓	AUC ↑	Iter@95% ↓
Resource-neutral regime
Full SGRL-TS baseline (HAN encoder + SHIELD-full + structured MCTS)
F0	Full-SGRL-TS	$0.89 \pm 0.01$	$0.92 \pm 0.01$	$2.94 \pm 0.03$	$3.2 \pm 0.4$	$67.0 \pm 1.7$	$403 \pm 19$
(1) Graph-based DRL baselines (GAT encoder, no MCTS)
G1	GAT-VDN	$0.72 \pm 0.04$	$0.76 \pm 0.02$	$2.22 \pm 0.05$	$7.6 \pm 1.1$	$43.0 \pm 2.0$	$794 \pm 35$
G2	GAT-QMIX	$0.76 \pm 0.02$	$0.80 \pm 0.03$	$2.38 \pm 0.03$	$6.8 \pm 0.9$	$47.5 \pm 1.8$	$720 \pm 32$
(2) HAN encoder ablation (SHIELD-full + structured MCTS fixed)
A0	NoStruct-enc	$0.77 \pm 0.04$	$0.80 \pm 0.02$	$2.39 \pm 0.04$	$6.7 \pm 0.9$	$50.8 \pm 2.8$	$675 \pm 30$
A1	GAT-enc	$0.82 \pm 0.02$	$0.86 \pm 0.03$	$2.63 \pm 0.03$	$5.0 \pm 0.5$	$57.9 \pm 1.5$	$538 \pm 26$
A2	HyperGCN-enc	$0.86 \pm 0.03$	$0.90 \pm 0.01$	$2.81 \pm 0.03$	$3.9 \pm 0.6$	$62.7 \pm 1.7$	$455 \pm 24$
(3) SHIELD value-decomposition ablation (HAN encoder + structured MCTS fixed)
B0	VDN-mix	$0.84 \pm 0.03$	$0.86 \pm 0.02$	$2.64 \pm 0.04$	$4.9 \pm 0.8$	$58.4 \pm 1.8$	$515 \pm 28$
B1	QMIX-mix	$0.85 \pm 0.02$	$0.89 \pm 0.02$	$2.75 \pm 0.03$	$4.1 \pm 0.7$	$61.8 \pm 1.6$	$476 \pm 25$
B2	SHIELD-lite	$0.88 \pm 0.02$	$0.91 \pm 0.01$	$2.88 \pm 0.02$	$3.6 \pm 0.6$	$65.1 \pm 1.5$	$443 \pm 22$
(4) Global search ablation (HAN encoder + SHIELD-full fixed)
C0	No-search	$0.85 \pm 0.05$	$0.88 \pm 0.05$	$2.77 \pm 0.05$	$4.3 \pm 1.0$	$59.4 \pm 2.7$	$502 \pm 27$
C1	Greedy-search	$0.87 \pm 0.03$	$0.90 \pm 0.02$	$2.84 \pm 0.03$	$3.6 \pm 0.7$	$62.3 \pm 1.9$	$451 \pm 24$
C2	Plain-MCTS	$0.88 \pm 0.02$	$0.91 \pm 0.01$	$2.89 \pm 0.03$	$3.5 \pm 0.8$	$64.7 \pm 1.5$	$433 \pm 22$
Resource-tight regime
Full SGRL-TS baseline (HAN encoder + SHIELD-full + structured MCTS)
F0	Full-SGRL-TS	$0.83 \pm 0.04$	$0.88 \pm 0.03$	$2.72 \pm 0.06$	$4.6 \pm 0.4$	$60.8 \pm 2.0$	$514 \pm 27$
(1) Graph-based DRL baselines (GAT encoder, no MCTS)
G1	GAT-VDN	$0.61 \pm 0.07$	$0.67 \pm 0.08$	$1.99 \pm 0.12$	$10.5 \pm 0.9$	$36.0 \pm 2.6$	$901 \pm 45$
G2	GAT-QMIX	$0.67 \pm 0.09$	$0.72 \pm 0.07$	$2.15 \pm 0.10$	$8.5 \pm 0.7$	$41.0 \pm 2.8$	$815 \pm 43$
(2) HAN encoder ablation (SHIELD-full + structured MCTS fixed)
A0	NoStruct-enc	$0.65 \pm 0.07$	$0.71 \pm 0.13$	$2.12 \pm 0.15$	$9.3 \pm 0.9$	$43.1 \pm 3.3$	$825 \pm 47$
A1	GAT-enc	$0.71 \pm 0.08$	$0.78 \pm 0.11$	$2.35 \pm 0.12$	$7.3 \pm 0.6$	$50.5 \pm 2.9$	$665 \pm 44$
A2	HyperGCN-enc	$0.78 \pm 0.06$	$0.83 \pm 0.09$	$2.55 \pm 0.08$	$5.8 \pm 0.7$	$56.1 \pm 2.6$	$590 \pm 40$
(3) SHIELD value-decomposition ablation (HAN encoder + structured MCTS fixed)
B0	VDN-mix	$0.75 \pm 0.08$	$0.81 \pm 0.06$	$2.31 \pm 0.09$	$5.6 \pm 0.8$	$55.5 \pm 2.3$	$588 \pm 44$
B1	QMIX-mix	$0.77 \pm 0.07$	$0.86 \pm 0.08$	$2.66 \pm 0.07$	$5.1 \pm 0.6$	$58.3 \pm 2.7$	$567 \pm 42$
B2	SHIELD-lite	$0.81 \pm 0.05$	$0.87 \pm 0.06$	$2.70 \pm 0.09$	$4.7 \pm 0.5$	$60.2 \pm 2.4$	$545 \pm 31$
(4) Global search ablation (HAN encoder + SHIELD-full fixed)
C0	No-search	$0.77 \pm 0.08$	$0.84 \pm 0.09$	$2.56 \pm 0.09$	$5.9 \pm 0.7$	$53.7 \pm 2.9$	$626 \pm 40$
C1	Greedy-search	$0.80 \pm 0.06$	$0.86 \pm 0.05$	$2.63 \pm 0.08$	$5.3 \pm 0.5$	$56.9 \pm 2.6$	$575 \pm 32$
C2	Plain-MCTS	$0.82 \pm 0.03$	$0.87 \pm 0.03$	$2.69 \pm 0.05$	$4.9 \pm 0.4$	$59.1 \pm 2.2$	$548 \pm 29$

Variant definitions. GAT-VDN/GAT-QMIX: 1-hop GAT encoder with VDN/QMIX mixing (graph-based DRL without MCTS). NoStruct-enc: per-agent MLP without structural encoder. GAT-enc/HyperGCN-enc: graph/hypergraph encoders with SHIELD-full and structured MCTS. VDN-mix/QMIX-mix/SHIELD-lite: VDN, QMIX, and intra-coalition SHIELD mixers with HAN encoder and structured MCTS. No-search/Greedysearch/Plain-MCTS: no tree search, greedy assignment, and vanilla MCTS with HAN encoder and SHIELD-full.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Overlapping Coalition Formation for Resource Allocation in Post-Disaster Rescue UAV Swarms

Highlights

Abstract

1. Introduction

3. System Model and Problem Statement

3.1. System Architecture of Overlapping Coalitions and Hypergraph-Based Representation

3.1.1. Task Resource Types

3.1.2. UAV Node State

3.1.3. Overlapping Coalition Modeling

3.2. Multi-Objective Design

3.3. Problem Statement

4. Overlapping Coalition Formation Algorithm

4.1. High-Order Structural Modeling and State Representation

4.2. Joint Policy Learning and Local Optimization

4.3. Global Optimization and Policy Feedback

4.3.1. Structure-Prior and Feasible-Set–Constrained Global Optimizer

4.3.2. Policy Feedback: Demonstration Replay and Structure-Consistent Q-Value Distillation

4.4. Complexity and Real-Time Feasibility

4.4.1. End-to-End Latency and Real-Time Criterion

4.4.2. Asymptotic Costs by Module

4.4.3. Conditional Near-Linear Scalability

5. Experimental Results and Analysis

5.1. Simulation Setup

5.2. Performance Evaluation

5.3. Ablation Studies

5.4. Future Research Directions and Deployability

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics