Highlights
What are the main findings?
- We introduce a hypergraph attention-based structured representation that captures high-order relations among tasks, UAVs, and coalitions, enabling overlapping coalition formation in heterogeneous swarms.
- We develop a structure-conditioned hierarchical value decomposition coupled with limited-budget Monte Carlo Tree Search with demonstration replay and distillation, achieving global optimization and policy feedback under operational constraints.
What are the implications of the main findings?
- The combination of overlapping coalitions and structure-aware policy learning promotes cross-task resource reuse, alleviates bottlenecks under tight constraints, and strengthens global coordination for post-disaster multi-tasking.
- Feasible-region pruning and delay-aware search help maintain convergence efficiency and stability under latency, communication, and energy limits, pointing to a practical path toward real-world deployment.
Abstract
Unmanned aerial vehicle (UAV) swarms, equipped for distributed sensing and rapid response, can form coalitions to undertake complex missions such as post-disaster relief, communication support, and payload delivery. However, typical coalition formation methods assign each UAV to a single task, limiting cross-task resource sharing. To address this, we investigate overlapping coalition formation (OCF) for UAV swarms, where a single UAV is permitted to participate in multiple coalitions, enabling resource reuse and reducing idleness. We formulate OCF as a multi-objective combinatorial optimization problem that jointly balances task fulfillment ratio, coalition synchronization deviation, and operational cost, while explicitly accounting for inter-coalition resource contention and execution precedence. Specifically, we first construct a hypergraph representation of UAVs and tasks and employ a hypergraph attention network to capture their high-order interactions. Next, we propose a structure-aware hierarchical value decomposition method for policy learning, which progressively aggregates individual- and coalition-level information, models member complementarity and inter-coalition cooperative–competitive relations, and generates a global value estimate that is sensitive to changes in coalition structure. Furthermore, we integrate Monte Carlo Tree Search, utilizing the learned value as a heuristic to efficiently explore the feasible region, and close the loop with candidate-structure demonstration replay and policy distillation, enabling search to refine the learned policy. In multi-scale rescue simulations, the proposed approach improves task utility by up to 11.4% over the best-performing baseline and increases energy efficiency by more than 228% compared to a non-overlapping coalition variant.
1. Introduction
With continuing advances in avionics and wireless networking, unmanned aerial vehicle (UAV) swarms can undertake cross-scale tasks including reconnaissance, search and rescue, communication assurance, and emergency power supply [1], even when terrain is constrained and infrastructure is damaged [2]. Nevertheless, swarm-level collaboration essentially depends on the organizational mechanisms of task allocation and resource configuration [3]. Under constraints on endurance, payload, and sensing capacity, single-UAV assignment paradigms struggle to satisfy the joint resource and temporal requirements of complex tasks [4]. To enhance system-level feasibility and overall benefit, forming coalitions that aggregate capabilities across platforms [5,6] becomes a necessary pathway for achieving this goal.
Resource scheduling, execution precedence, and cost constraints at the task level jointly constitute the key difficulties of coalition coordination [7]. With the rise of modular payloads and multi-source sensor fusion [8], a single platform can be configured with diverse functional modules and can simultaneously carry multiple resource types in varying quantities to execute different kinds of tasks. For example, in military settings, a UAV may consecutively perform target tracking and communication relay tasks [9], whereas in civilian settings, a rescue UAV may consecutively conduct target-detection and payload-delivery tasks. However, prevalent non-overlapping coalition formation methods typically restrict each UAV to a single coalition and a single task within the same time window. Under high task load, this restriction prolongs waiting chains, and under low load, it yields idle capabilities and fragmented utilization, which degrades task completion and energy efficiency. Therefore, we study overlapping coalition formation (OCF) [10] for heterogeneous UAV swarms in multi-task disaster-relief scenarios, as illustrated in Figure 1. Subject to feasibility constraints on energy, communication, and deadlines, OCF permits a UAV to belong to multiple task coalitions within the planning horizon, enabling cross-task capability flow and resource reuse to alleviate threshold bottlenecks and improve overall resource utilization efficiency.
Figure 1.
Illustrative coalition formation modes in a heterogeneous UAV swarm. (a) Non-overlapping coalition formation where each coalition executes one task. (b) Sequential multi-task formation where a coalition serves multiple tasks in sequence. (c) Overlapping coalition formation where UAVs can participate in multiple coalitions and jointly execute several tasks over time, enabling cross task capability flow and resource reuse.
In post-disaster contexts, OCF can be cast as a multi-objective combinatorial optimization problem with explicit resource competition [11]. Due to communication degradation and temporary restrictions, observation ranges shrink markedly, and information becomes fragmented [12]; as a result, each UAV relies primarily on local views and cannot readily grasp global dependencies across tasks and resources. Meanwhile, the number and urgency of tasks fluctuate over short intervals, and feasible routes and required preconditions change frequently [13]. Without continual updates to coalitions and resource allocations, previously feasible plans become invalid, which induces execution interruptions and response delays. Moreover, overlapping organization causes a single UAV to serve multiple tasks across time windows: consumable resources exhibit congestion effects, non-consumable capabilities are subject to occupancy and exclusivity, coalitions compete for scarce and critical capabilities, tasks display both cooperative and mutually exclusive relations, and task value becomes markedly non-additive. These factors induce credit-assignment bias and global scheduling imbalance, which makes it difficult to align locally optimal choices with system-level optimality.
Furthermore, the decision space is a joint combination of coalition structure, resource allocation, and temporal scheduling, and is high-dimensional, with a branching explosion [14]. Multiple objectives trade off against each other: higher task fulfillment often demands greater energy expenditure and tighter synchronization, whereas reducing operational cost and synchronization deviation may sacrifice fulfillment. Static weighting and local greediness are therefore often ineffective. Under partial observability and highly dynamic conditions, the central challenges are maintaining feasibility and responding rapidly to time-varying constraints, as well as achieving a globally coordinated, multi-objective balance amid competition.
To address these challenges, we propose a Structure-Guided Reinforcement Learning with Tree Search (SGRL-TS) framework. Using a hypergraph as the carrier and bidirectional attention to extract high-order relations among UAVs, task demands, and coalition capabilities, the framework consolidates local observations into structural semantics that enhance identifiability under incomplete information. It then employs structure-conditioned hierarchical value modeling to characterize intra-coalition complementarity and conflict while capturing inter-task interactions, thereby improving the robustness of value estimation under delayed and partial information. Building on this foundation, we adopt budgeted Monte Carlo Tree Search (MCTS) with feasible-region pruning based on energy, bandwidth, and deadlines to generate a small set of high-quality candidate coalition structures and to feed them back to policy learning via demonstration replay and value distillation, achieving efficient and scalable OCF under constrained communication and time-varying feasibility. Our contributions are as follows:
- We formulate overlapping coalition formation on a time-varying heterogeneous hypergraph and design a masked bidirectional Hypergraph Attention Network. This network incorporates task-to-node and node-to-task attention, an operator view of higher-order propagation, and structural diversity regularization, yielding structure-aware node and task embeddings that encode task selectivity, contextual suitability, and structural roles under evolving coalition structures.
- We develop Structure-aware Hierarchical Value-decompos Ition for Efficient Multi-task Coor Dination (SHIELD) that utilizes a structure-attribution encoder and agent-conditional Q functions, along with nested nonlinear intra-coalition composition and sparsified cross-task interaction aggregation. This approach constructs a global value that remains monotone and comparable across partial coalition structures, supporting stable credit assignment under the multi-objective reward.
- We integrate SHIELD with a structure-prior and a feasible-set pruned budgeted MCTS. Masked SHIELD evaluations, progressive widening, upper-bound and dominance pruning, and hypervolume-based Pareto selection produce a compact set of high-quality overlapping coalition structures. These structures are injected into learning via scheduled demonstration replay and structure-consistent Q-value distillation. Experiments on heterogeneous UAV–multi-task disaster-relief scenarios demonstrate consistent gains in task utility, energy efficiency, conditional near-linear scalability, and improved ablation results compared to baselines.
2. Related Work
2.1. Overlapping Coalitions: Advantages of Hypergraph Modeling over Other OCF Algorithms
In multi-agent cooperation and resource allocation, the traditional non-overlapping Coalition Formation Game (CFG) has been widely adopted. Its core assumption is that each agent can join only one coalition to complete a specific task. Guo et al. [15] designed a fast CFG solver for urgent-task scheduling based on neighborhood search and coalition reorganization, which starts from an initial partition and performs local search to find member-replacement plans that reduce task delay. Yan et al. [16] proposed a hierarchical nested CFG framework for large-scale heterogeneous UAV swarms. The method first forms task groups via priority-based clustering at the upper level and then solves coalition partitioning within each group to optimize resource matching at the lower level. Although these methods attain good local optima in specific scenarios, they rely on the non-overlapping assumption, which prevents cross-task resource reuse and yields limited adaptability of coalition partitions in heterogeneous multi-task settings.
To overcome these limitations, researchers introduced the Overlapping Coalition Formation Game (OCFG), which enables agents to participate in multiple coalitions, allowing for flexible resource reuse in heterogeneous multi-task environments. Qi et al. [17] formulated task allocation as a sequential overlapping coalition formation game in which coalitions are iteratively adjusted via a bilateral mutual-benefit transfer rule that reallocates per-UAV resources across tasks until a stable overlapping partition is reached. Zhang et al. [18] developed an OCFG for vehicular edge computing that jointly optimizes information transmission rate and energy-sharing efficiency and balances multi-coalition participation via transfer payments and a virtual-currency mechanism. Qi et al. [19] combined OCFG with an auction mechanism for UAV data-collection task assignment, treating UAVs as buyers that can participate across coalitions and iteratively maximizing revenue through bipartite matching auctions and coalition games. Overall, OCFG outperforms traditional CFG in terms of resource efficiency and task adaptivity. However, most existing OCFG studies rely on local marginal contributions or linearly weighted utilities and lack explicit modeling of high-order interactions among tasks, agents, and coalitions. As a result, policies generalize poorly in dynamic multi-task environments and coalition structures tend to be trapped in local optima.
2.2. Structure Modeling: Advantages over Non-Structured Modeling
As the complexity of multi-agent cooperation grows, structure-aware modeling has been introduced to enhance interaction perception and policy generalization. By explicitly representing inter-agent relations in graphs or geometric structures, several works have reported progress in task cooperation and system performance. HYGMA [20] constructs a latent hypergraph where dynamically grouped agents are represented as hyperedges. Coalition structures emerge implicitly as the policy attends to and aggregates over these hyperedges, which approach avoids reliance on predefined or pairwise coalitions. Zhang et al. [21] proposed a coalition formation algorithm inspired by leaf-vein geometry, which predefines vein-like branching to assign functional roles and then uses heuristic pattern matching to adjust node layouts. This structure enhances global cooperative stability while remaining rule-based and static. Liao et al. [22] presented a two-stage cooperative anti-jamming scheme for radar networks under wideband sweep jamming. During coalition formation, the entire frequency-hopping range is partitioned into mutually exclusive sub-bands, each treated as a coalition unit, and coalitions are updated via a coalition formation game. This realizes structured modeling and optimization of radar cooperation at the sub-band level.
These structured approaches strengthen cooperation awareness and deliver improved stability and generalization during resource sharing and local task switching. Nevertheless, most of them are confined to first-order adjacency and cannot capture higher-order dependencies among tasks, agents, and coalitions. Consequently, UAVs cannot accurately identify their roles across multiple coalitions in multi-task settings. In addition, structures are often treated as fixed input features rather than directly participating in policy optimization.
2.3. Global Optimization and Policy Feedback: Integrating MCTS with Local Reinforcement Learning
To move beyond local heuristics and short-sighted payoff optimization, several studies have pursued coalition generation and stabilization driven by global objectives or long-term equilibria. Taguelmimt et al. [23] treat coalition structure generation as a centralized global optimization problem and propose SMART, which first uses offline coalition selection to precompute and prune promising coalitions, and applies complementarity-based dynamic programming with gradual, integer-partition–graph search to exhaustively derive the globally optimal coalition structure under a given utility. Chen et al. [24] formulated a global-optimization CFG for multi-UAV cooperative defense by maximizing defense coverage. The approach defines coalition utility in terms of global coverage and threat-suppression ratio, and incorporates dynamic coalition reconstruction and global utility re-estimation to ensure monotonic improvement toward the global objective. It improves coverage but relies heavily on a centralized coordinator, which reduces efficiency when handling asynchronous updates and dynamic tasks. Xiong et al. [25] proposed a global coalition formation method for radar-UAV multi-target pursuit based on multi-objective cooperative games, using evolutionary games and global Nash equilibria to determine stable structures and provide long-term optimality. Although these methods improve system-wide performance through global utilities or equilibria, most rely on deterministic optimization or equilibrium derivation and lack active exploration in the combinatorial space of large-scale overlapping coalitions.
To remedy these issues, some works integrate game-theoretic models with deep reinforcement learning so as to retain long-term optimality while improving the adaptability of coalition structures. Zhou et al. [26] model coalition dynamics as a three-party repeated coalition formation game and realize coalition formation via a switch-operation–based coalition selection process, where deep reinforcement learning optimizes each party’s continuous non-coalitional strategies while the game-theoretic analysis guarantees convergence to a stable coalition partition in each time slot. Zhang et al. [27] designed a distributed DRL and CFG framework for coalition selection in federated learning with heterogeneous data. Periodic training utilizes global accuracy and energy efficiency as rewards, enabling nodes to adaptively adjust the probabilities of joining different coalitions across rounds, thereby improving global convergence and energy efficiency. However, the process predominantly relies on experience replay and local policy convergence, which limits the adaptation speed when coalition structures undergo significant changes.
Overall, it is difficult for prior work to simultaneously capture high-order structural relations and explore globally optimal coalition structures. In particular, for overlapping coalition formation with heterogeneous UAV swarms, there is a lack of mechanisms that exploit structural semantics to actively guide policy optimization and to coordinate local learning with global search. To address this gap, we propose a structure-aware and globally guided OCF method that enhances high-order interaction modeling among tasks, agents, and coalitions through a hypergraph-based representation, and achieves long-term optimality by coordinating global search with local policy learning.
3. System Model and Problem Statement
3.1. System Architecture of Overlapping Coalitions and Hypergraph-Based Representation
3.1.1. Task Resource Types
Let the task set be . Each task requires a combination of resources. We categorize resources into two types: consumable and non-consumable. Consumable resources are physical items that are depleted during execution, such as rescue kits, medicines, or spraying agents. Their demand is , where each dimension specifies the required quantity of a specific consumable. Non-consumable resources are functional capabilities that support a task without being depleted, such as communication, relay, or information acquisition. Their demand vector is . The overall demand for task m is with . Each task further has key attributes including spatial location , priority coefficient , and maximum waiting time (deadline tolerance) . The complete task feature vector is .
3.1.2. UAV Node State
Let the UAV swarm be . The state vector of UAV comprises the remaining energy , the load of consumable resources , the non-consumable capability vector , the current position , the maximum flight speed , and a representation of the recent task-history sequence . Thus, the full UAV state is .
3.1.3. Overlapping Coalition Modeling
For each task , define its UAV coalition as . The coalition structure for the task set is . For any task , Equation (1) specifies the set of coalition members that allocate resources to it. The resource-allocation vector of is given in Equation (2), where denotes the resource vector contributed by UAV to task as defined in Equation (3).
Specifically, denotes the quantity of consumable resource type that UAV allocates to task , and denotes the quantity of non-consumable capability type that UAV allocates to task . For any UAV , the set of tasks it participates in is . If , then belongs to multiple task coalitions within the planning horizon, and the coalition structure is overlapping.
To capture complex overlapping relations and high-order interactions between UAVs and tasks, we formalize the coalition structure as a hypergraph , where the node set corresponds to the UAV set and each node represents a UAV with feature ; the hyperedge set corresponds to the task set and each hyperedge connects multiple nodes that participate in task , with initial hyperedge feature . A node belonging to multiple hyperedges represents overlapping coalition membership, and a hyperedge connecting multiple nodes represents multi-agent cooperation within a coalition. This hypergraph provides a formal basis for employing a hypergraph attention network to capture high-order relations between UAVs and tasks.
3.2. Multi-Objective Design
The core of overlapping coalition formation for UAV swarms is to evaluate the overall effectiveness when multiple UAVs execute multiple tasks, which determines whether a coalition structure is optimal. We design three objectives that quantify task fulfillment, coalition synchronization deviation, and operational cost, respectively.
- Task execution sufficiency: To quantify the consistency between resources allocated to task and its demands by first computing per-dimension satisfaction ratios. Then, perform within-class aggregation for consumable and non-consumable resources. Finally, we combine the two class-level scores into the overall task measure.For consumable resources, the per-dimension satisfaction ratio is defined aswhere . Given weights with , we adopt a shortage-sensitive generalized mean for within-class aggregationFor non-consumable resources, the per-dimension satisfaction ratio is defined aswhere denotes the coalition-level equivalent capability on that dimension, and is a monotone mapping to form a dimensionless, larger-is-better scale. Given weights with , the within-class aggregation isThe overall task measure is computed as the product of the two class-level scores,
- Coalition synchronization deviation: To represent arrival latency, waiting energy, and deadline constraints under a unified dimension, the precedence relations among tasks are modeled by a directed acyclic graph ; when only task priorities are available, they become a total order within . For member of coalition , the sequence of tasks that must be completed before executing is defined as:Based on this sequence, two aggregate quantities are introduced for latency and energy calculation: denotes the total flight distance before reaching , and denotes the total processing time that must be completed before reaching , defined as:where is the flight distance of the j-th leg and is the processing time that must be completed before that leg; if no processing is required, it is set to zero. Accordingly, the arrival time of can be written as a function of speed:Under the speed bounds , taking the slowest path cruising at the maximum admissible speed as the baseline yields the earliest feasible synchronization time:Given , each remaining member selects the unique target speed that achieves on-time arrival without violating the bounds:If the required speed falls below , the member cruises at and waits upon arrival; the waiting time is:Consequently, the synchronized start time of the coalition satisfies:To reflect the energy cost incurred during waiting within a unified metric, let denote the flight power and denote the hovering power, and impose the ordering constraint:where is an empirical economical cruise speed. The energy expenditure of before is then:Further, given the deadline of task , define the lateness as:Finally, the synchronization latency, the deadline deviation, and the energy differential during waiting are combined via time-equivalent coefficients into a single objective:where and are calibration weights that convert the energy penalty into a time-equivalent cost, allowing it to be summed with and under a unified dimension.
- Operational cost: To assess the aggregate energy burden incurred by a coalition while completing multiple tasks, Equation (20) defines the operational cost for task :where is the total energy consumed by UAV to complete all assigned tasks. Since contributes heterogeneous resource types and quantities to multiple tasks, we use as a workload proxy and proportionally split the total energy to estimate per-task cost. The energy of UAV is the sum of hovering and flight energy, given byIn Equation (21), is the length of the task schedule of , and is the propulsion power of a rotorcraft at constant speed V, computed aswe instantiate the energy model with a DJI Matrice 600 Pro hexacopter equipped with 2170R 21 × 7 inch propellers; the parameter values used in Equation (22) are listed in Table 1.
Table 1. Parameterization of the propulsion power model in Equation (22) for a DJI Matrice 600 Pro hexacopter used to compute flight energy consumption.
To concisely characterize the aggregate benefit of task on the interval , starting from the previously defined task execution sufficiency , we map the synchronization deviation and the energy cost to sufficiency scores and then couple them into a utility, as follows:
where by default we set , take as the deadline scale of , and choose as the coalition energy budget or a platform-calibrated upper bound. With non-negative weights satisfying , the task utility is defined by
Consequently, , increases with , and decreases with and .
3.3. Problem Statement
We consider coalition formation with a heterogeneous UAV–multi-task structure in which each task comprises multiple heterogeneous sub-tasks with distinct resource demands, as illustrated in Figure 2. The goal is to optimize overlapping coalition formation under limited resources, dynamic tasks, and local communication. The decision variables are the task–UAV assignment structure and the resource-allocation vectors , yielding an exponentially large combinatorial search space. The problem has three inherently conflicting objectives , , and . For example, higher task fulfillment typically increases energy consumption, while improving synchronization may conflict with preferred task orders. Hence, the problem is a multi-objective combinatorial optimization with a set of non-dominated solutions. The optimization seeks an overlapping coalition structure with resource allocations that maximizes the overall utility of the task set, as in Equation (25), subject to the constraints in Equation (26).
Figure 2.
Example of resource allocation under overlapping coalitions, where heterogeneous UAV resources are assigned to multiple tasks and UAVs transfer between coalitions to reuse residual capacity.
In Equation (26), the resource constraints ensure that the sums of consumable and non-consumable provisions from a coalition meet each task’s demand; the temporal constraint ensures that the synchronization deviation remains within the tolerated waiting time; and the energy constraint ensures that a UAV’s total energy consumption does not exceed its initial energy. Because the solution space grows combinatorially with the number of tasks and UAVs, obtaining a globally optimal coalition structure is NP-hard. We therefore design a staged solving strategy with reduced computational complexity.
4. Overlapping Coalition Formation Algorithm
Overlapping coalition formation involves coupled relations among multiple agents and multiple tasks. It must perceive high-order structure, avoid local optima, and control global exploration cost under feasibility constraints. This chapter proposes a coordinated algorithmic framework SGRL-TS. First, we extract the structural semantics of task–UAV–coalition relations using a bidirectional HAN and utilize structure embeddings to modulate policy parameters, thereby improving generalization. Next, we develop a structure-aware and efficient multi-task cooperative hierarchical value-decomposition algorithm called SHIELD. SHIELD captures member complementarity or conflict via nested nonlinear aggregation within a coalition and models inter-task cooperation or competition via second-order interaction terms across coalitions, thereby yielding the global value . Finally, we conduct a global search with MCTS guided by and feed high-quality structures back into training through demonstration replay, Q-value distillation, and structure-neighborhood exploration.
4.1. High-Order Structural Modeling and State Representation
In overlapping coalition formation for heterogeneous UAVs across multiple tasks, the local observation of a single agent is insufficient to capture the high-order interactions between tasks and UAVs. Decisions are influenced by the task-demand context and by the role that each agent plays within multi-task collaborative structures. To improve policy generalization and coalition organization efficiency, we adopt a bidirectional hypergraph attention network that combines structural awareness with context-induced modulation. The core objective is to disentangle an agent’s intrinsic capability from the external task context at the representation level.
To accommodate structural evolution during planning, we construct a time-varying heterogeneous hypergraph at each decision step t. The node set denotes UAVs. The hyperedge set denotes the tasks within the planning scope at step t. A task appears as a hyperedge with member set . A node in multiple hyperedge models represents overlapping coalition membership. This structure specifies connectivity, propagation paths, and normalization domains of structural information during policy learning, as illustrated in Figure 3.
Figure 3.
High -order structural modeling with the bidirectional hypergraph attention network, which encodes UAV nodes and task hyperedges and propagates features along membership and communication links. Here, denotes the task structural embedding produced by task-to-node aggregation, and denotes the structure-aware node embedding produced by node-to-task aggregation.
To restrict attention normalization to valid candidate sets, we introduce three masks for membership, task activation, and UAV availability:
Given the initial node and hyperedge embeddings and , we define a differentiable score
where and are learnable mapping matrices, is a learnable vector, and ‖ denotes concatenation. Two complementary attentions are then normalized over the mask-restricted domains.
- Task-to-node aggregation. This aggregation builds the task representation from candidate members and expresses task selectivity. The attention coefficients are normalized over the candidate-member domain:The task structural embedding is then computed asTo make task selectivity explicit, we pose an entropy-regularized matching objective over the simplex :whose closed-form solution is the masked softmaxUnder a linearization, Equation (30) becomes the convex combinationshowing that the task embedding aggregates member features with weights proportional to the marginal contribution scores.
- Node-to-task aggregation. This aggregation fuses information over a node’s candidate tasks and expresses contextual suitability. The attention coefficients are normalized over the candidate-task domain:The structure-aware node representation isTo formalize contextual suitability, we optimize an entropy-regularized utility over the simplex using a context utility :whose solution is
Operator view of higher-order relations. Substituting Equation (33) into Equation (38) yields, after one bidirectional pass,
which explicitly encodes the second-order node–task–node relation via the multiplicative coupling . Let be the incidence matrix. One bidirectional pass admits the operator approximation
where and embed masked, within-domain normalizations into the feasible sets, and absorbs linear mappings. Hence, bidirectional attention realizes a learnable second-order polynomial operator in ; stacking L layers induces up to order polynomials, which approximate a broad class of higher-order hypergraph convolution kernels and subset functions, thereby capturing UAV–task–coalition higher-order relations with shallow propagation.
To improve discriminability and prevent representation collapse, we introduce a structural diversity regularizer:
Considering the continual evolution of the structure over time, we employ residual in-step updates to maintain adaptability of the features:
where denotes a normalization operator. Tasks not in the planning scope and UAVs that are not schedulable are automatically excluded from normalization and aggregation through the masks in (27), which stabilizes computation and keeps the cost controlled. For brevity, time subscripts are omitted in the sequel when no ambiguity arises.
The pseudocode for HAN forward computation with diversity regularization is presented in Algorithm 1.
| Algorithm 1 Bidirectional Hypergraph Attention with Diversity Regularization |
|
4.2. Joint Policy Learning and Local Optimization
To address structural role modeling, policy coordination, and global value optimization for heterogeneous UAV swarms with multi-task overlapping coalitions, we propose a structure-aware hierarchical value decomposition algorithm, SHIELD. The algorithm takes as inputs each UAV’s local observation and the structure-aware node and task embeddings from Section 4.1, namely the node embedding and the task embedding , and performs encoding and aggregation under the unified masks (membership), (task activation), and (UAV availability). This design ensures that task selectivity and contextual suitability propagate from the representation layer to the value-decomposition layer. The global value remains well-defined even with only a partial coalition structure. An overview is shown in Figure 4.
Figure 4.
SHIELD architecture: A hypergraph encoder, role attribution, and nested coalition and global mixers jointly implement structure-aware hierarchical value decomposition for multi-task coordination.
To disentangle node embedding, structural semantics are injected into a role representation. This makes each UAV’s role and functional positioning explicit within specific task coalitions. Given , the role gate and the role vector are computed as
where , , and are learnable matrices; is the sigmoid; is the hyperbolic tangent; and ⊙ denotes the Hadamard product. To transmit task selectivity and contextual suitability into downstream aggregation, we convert both into a non-negative member-to-task gate effective only on the feasible domain,
and renormalize it within each coalition to remove scale effects:
where and are learnable scalars and is a small constant.
- Agent-Conditional Value Function. An individual policy must adapt to structural roles and local observations. We define an agent-conditional Q function. Here, the role vector modulates both the temporal encoder and the action embedding:where is the local history, is the current action, generates GRU parameters conditioned on the role, encodes actions jointly with roles, and ⊕ denotes concatenation.
- Intra-Coalition Value Decomposer. To capture complementarities and conflicts among coalition members, we adopt a nested nonlinear composition. The role-conditioned contribution of a member iswith learnable parameters , , , and , and concatenation . The gated, role-modulated fusion within coalition m isand the coalition value readswhere is a small MLP with normalization and bounded activation, and masks inactive tasks.
- Cross-Task Interaction Aggregator. Tasks exhibit cooperation and competition that cannot be captured by a simple linear mixture. We therefore define a structure-aware interaction aggregator with quadratic and cross terms, restricted to active tasks. The global value iswhere is a global state summary and is a multilayer mapping. The interaction module iswith task-wise base weights , quadratic weights , pairwise interaction weights , and the sigmoid .
- Stable Encoding under Partial Structures. When only a subset of tasks is active, we preserve tensor shapes and statistical stability by using a softened placeholder for inactive tasks:where is a learnable placeholder and is the standard encoder output. Because attention normalizations and gated aggregations are masked, inactive tasks neither appear in denominators nor contribute to value sums.
- Training Objectives and Optimization. We normalize three performance indicators—task fulfillment, coalition synchronization bias, and action cost—asand define the reward with a balance-promoting term:where with , exponents control sensitivity, and weights the balance term. The temporal-difference loss over a mini-batch isand the attention–gate alignment regularizer isso that the total loss becomeswhere and is defined in (41).
We now present the single-step training procedure of SHIELD, which integrates structure-attribution encoding, hierarchical value decomposition, and the multi-objective reward into one parameter update, as shown in Algorithm 2.
| Algorithm 2 SHIELD single-step training with structural regularization and distillation |
|
4.3. Global Optimization and Policy Feedback
To overcome the combinatorial explosion of overlapping coalition structures and the tendency of local search to get trapped in suboptimal regions, we augment the structure-aware hierarchical value decomposition in the previous section with MCTS for global structure optimization, and feed the discovered high-quality structural evidence back to policy learning, thereby forming a closed loop of structure prior, feasible-set pruning, global evaluation, and policy feedback. The key idea is to use the masked encoding of a time-varying hypergraph as a bridge so that partially assigned structures receive stable and comparable value estimates within the bidirectional hypergraph attention and hierarchical value decomposition networks; structure priors then guide the tree policy, incremental constraint checks maintain real-time feasibility, and Pareto-nondominated solutions are transformed into demonstrations and distillation signals to jointly accelerate policy convergence and improve global quality. The workflow is shown in Figure 5.
Figure 5.
Pipeline of the MCTS-based global structure optimizer, where SHIELD values guide feasible coalition expansions and Pareto optimal structures are fed back as demonstrations and distillation targets.
4.3.1. Structure-Prior and Feasible-Set–Constrained Global Optimizer
In MCTS, each node s represents a partially realized resource allocation ; that is, the coalitions and their resource assignments for the first m tasks. The root is ∅, indicating no task has been assigned; expanding to selects a coalition and resource plan for task m + 1, and the full depth is M, whose leaves encode a complete overlapping coalition . To uniformly encode partial structures, we reuse the masked mechanism in Section 4.1: inactive tasks use the learnable placeholder embedding in (53), while attention and gating normalize only over mask-restricted domains per (27) and (46); thus, any can be stably fed into SHIELD to evaluate .
During node selection, we adopt PUCT with a structure prior,
where is the running mean of backed-up action values, and are visit counts, and controls exploration. The prior aggregates the gating strengths of the candidate coalition for task and is softmax-normalized. Let with defined by (46); then
where is a temperature. To prevent branch explosion, we use progressive widening to limit the branching factor,
with growth tuned by and ; if the cap is not reached, high-prior candidates are expanded first, otherwise children are chosen by (59).
One-step expansion modifies only a small subset of resources and links related to the new task, so we perform incremental checks rather than full recomputation. For each UAV , we maintain residual resources and accumulated costs, and update them incrementally when expanding to . Energy feasibility uses a “maneuver + workload” decomposition with , imposing
where is a safety margin. Bandwidth feasibility uses an optimistic expectation of available throughput: if task requires and , then
where is the link success probability and the achievable rate, and is a redundancy factor. End-to-end latency is approximated by computation plus communication delay with for data size , yielding
Beyond early rejection by (62)–(64), we define a layerwise optimistic upper bound for unassigned tasks at node s,
where is the current estimate for the partial structure, is obtained by a conflict-ignorant greedy assignment, and flags tasks still within scope. If is below the incumbent best, the branch is cut; at the same depth and m, candidates dominated in both residual resources and are removed, realizing a triple pruning of feasibility, upper bound, and dominance.
Upon reaching a leaf or a cutoff depth, we greedily complete remaining tasks with a lightweight policy and use SHIELD’s joint value as a rollout proxy; visit counts are then updated and is backed up by a moving average. This proxy is scale-consistent with (65), maintaining evaluation coherence under limited budgets.
4.3.2. Policy Feedback: Demonstration Replay and Structure-Consistent Q-Value Distillation
MCTS solutions are not used as one-shot offline references; instead, they are injected into the learning loop under a controlled cadence so that global optimization continuously feeds local policy updates. Let an MCTS run be triggered whenever the interaction steps reach , with a search budget of simulations; after search, we perform parameter updates where each mini-batch mixes demonstration and environment samples with ratio . To rely more on structural evidence early and anneal toward autonomous improvement later, we use the schedule
where are initial values and are time scales. This guarantees that triggers and budgets gradually increase as the policy matures, whereas the demonstration ratio and distillation weight decay to avoid long-term reliance on external supervision.
The update count per trigger scales with budget as
where is a base count and controls coupling between search and learning. With mini-batch size B, we sample from demonstration and environment buffers with a fixed mix,
Demonstrations come from the union of current and past nondominated solutions, each mapped to a scalar return via the hypervolume of multi-objective scores and decoded into joint actions and transition tuples for storage. To improve sample efficiency, we use prioritized sampling that balances TD error and structural prior: if sample i has TD error and a prior derived from structural gating and hypervolume rank, then
and sampling is drawn proportionally to the normalized across both buffers.
Beyond the TD loss and the attention–gating consistency regularizer, we introduce a structure-consistent distillation term to align marginal contributions in the value decomposition with structural evidence from search. Let the coalition value for task be and its leave-one-out counterfactual be ; define the estimated marginal contribution , and let be the normalized contribution returned by search on the same scale. The distillation loss is
where is an unbiased backup target and is a weight. The per-step objective becomes
where is the joint-value TD loss, aligns attention distributions with downstream gating to preserve the structural semantics of task selectivity and contextual suitability, and follows (66) to emphasize structural supervision early and gradually release policy self-organization later. The pseudocode of this MCTS-guided policy feedback loop is given in Algorithm 3.
| Algorithm 3 MCTS-guided policy feedback with scheduled demonstration and distillation |
|
4.4. Complexity and Real-Time Feasibility
To render the overall framework deployable, we formalize the end-to-end latency, derive module-wise asymptotic costs, and state the resulting real-time criterion and scalability conditions.
Let be the number of UAVs, the number of active tasks in the current planning scope, the total member count across active task coalitions, d the embedding dimension, L the number of HAN layers, a per-task coalition-size cap, and k the sparsity level for cross-task interactions.
4.4.1. End-to-End Latency and Real-Time Criterion
We decompose a single decision period as
where B is the search budget, is the time to evaluate SHIELD on a partial structure, and is the incremental feasibility checking time (energy, bandwidth, latency updated only for affected UAVs and links). Given a control period , real-time feasibility requires
This bound links real-time feasibility to tunable parameters for pre-deployment selection and online adaptive pruning.
4.4.2. Asymptotic Costs by Module
- HAN. Bidirectional hypergraph attention uses masked normalizations restricted to candidate sets. With one L-layer pass, the cost iswhere and arise from affine mappings for nodes and hyperedges, and from member–task scoring plus masked softmax. Masking confines normalization to valid candidates and avoids unnecessary global normalization.
- SHIELD. Role attribution and agent-conditional Q incur ; the nested intra-coalition composition costs ; cross-task interactions, when sparsified to at most k neighbors per task, reduce the naïve to
- Budgeted MCTS. With progressive widening for and layer-wise upper-bound pruning, the per-decision search time iswhere has the same order as (75) but is evaluated on a partial structure with ; masking stabilizes tensor shape and scale during partial evaluations.
4.4.3. Conditional Near-Linear Scalability
When are engineering constants and masking ensures , we obtain
5. Experimental Results and Analysis
To verify the effectiveness and adaptability of the proposed overlapping coalition formation method for heterogeneous UAV swarms, we design multiple simulation scenarios of different scales that cover a range of task densities, resource tightness, and UAV configurations. All experiments are conducted on a Windows 11 system, equipped with an Intel Core i9-13900K processor (3.00 GHz base frequency), two NVIDIA GeForce RTX 4090 GPUs with 24 GB of VRAM each, and 128.0 GB of RAM.
5.1. Simulation Setup
To evaluate SGRL-TS, we build a simulator with heterogeneous tasks and multiple UAV platform types. We configure five representative tasks and specify their resource requirements in Table 2. To meet these demands, we design four UAV platform types, with their capabilities summarized in Table 3. The platforms emphasize support, delivery, firefighting, and communication relay. To ensure realism and reproducibility, we specify the environment and communication parameters, including area size, communication radius, energy model, and weights for the multi-objective reward, as listed in Table 4.
Table 2.
Resource requirements of representative tasks.
Table 3.
Capabilities of heterogeneous UAV platform types.
Table 4.
Simulation environment parameters.
To highlight the advantages of SGRL-TS, we compare it against the following methods:
- PGG-TS-OCF [17] establishes preference relations between UAV supply and task demand, guides tabu search via preference gravity, and achieves overlapping allocation and scheduling with a stable overlapping coalition.
- LocalSearch-CF [15] greedily seeds UAV–task coalitions and iteratively swaps or reassigns UAVs between neighboring coalitions to minimize mission completion time.
- HYGMA [20] dynamically clusters nearby UAVs based on state and task affinity, treating each cluster as a coalition and learning joint actions through hypergraph attention.
- SMART [23] enables each UAV to repeatedly switch its serving task coalition according to a locally evaluated payoff, yielding a distributed best response coalition formation.
- RCFG-DRL [26] models UAV–task assignment as a repeated coalition game, where UAVs learn join or leave decisions via deep RL until coalitions reach equilibrium.
5.2. Performance Evaluation
Figure 6 reports convergence and final utilities under four resource conditions. SGRL-TS consistently outperforms the strongest baseline, PGG-TS-OCF, by 3.19%, 4.49%, 6.25%, and 9.68% in the abundant, balanced, constrained, and scarce settings, respectively. It enters the efficient ascent earlier and shows a smoother plateau. The gains arise from heterogeneous hypergraph attention, which captures high-order couplings among UAVs, tasks, and coalitions; a structure-conditioned hierarchical value decomposition that yields globally comparable, monotone scores and suppresses merge-split oscillations; and budgeted MCTS under feasibility masks, which focuses expansions on high-value structures and reduces wasted search.
Figure 6.
Convergence of SGRL-TS and baselines across four resource regimes: (a) 18 UAVs–5 tasks (resource-abundant); (b) 20 UAVs–10 tasks (resource-balanced); (c) 15 UAVs–15 tasks (resource-constrained); (d) 10 UAVs–20 tasks (resource-scarce).
Baseline behavior clarifies the gaps. PGG-TS-OCF employs a parallel population search that identifies feasible overlaps early on; however, it stalls at suboptimal mixes under budget constraints and lacks a stable cross-level yardstick. HYGMA strengthens interaction modelling, but long value propagation under multiple constraints slows ascent. RCFG-DRL introduces adversarial robustness; however, nonstationarity induces mid-horizon oscillations, diverting budget from structural improvements. LocalSearch-CF and SMART use anytime stepwise moves bounded by small neighborhoods, which promote local optima and capped utility. Non-overlapping SGRL-TS forbids resource reuse, creating capacity bottlenecks, synchronization penalties, and heightening sensitivity to task density and temporal perturbations.
We next examine resource utilization efficiency, task adaptability, and robustness under overload and scarcity by varying the number of UAVs from 4 to 20 under different task loads, as shown in Figure 7. Across the four task scales, the curve of SGRL-TS remains at the top and reaches a higher peak near moderate swarm sizes; the standard deviation bars show markedly lower variability compared to all baselines. Relative to the strongest competitor, PGG-TS-OCF, SGRL-TS achieves higher peak utilities by approximately 2.27%, 3.00%, 6.36%, and 9.76% at task numbers 5, 10, 15, and 20, respectively. The advantage increases as resource constraints tighten, indicating stronger resource scheduling and parallel coordination under crowded and scarce conditions.
Figure 7.
Average task utility versus the number of UAVs under different task loads: (a) ; (b) ; (c) ; (d) .
This performance stems from a careful consideration of the benefit–cost balance. SGRL-TS estimates timing and energy constraints online and maps them onto a unified utility scale, which steers the search toward ranges where adding UAVs yields net gains while suppressing ineffective parallelism and communication congestion as scale grows, thus avoiding high-scale regression. Temperature-controlled sampling and reuse of candidate structures broaden exploration, and as early and later convergence occurs, they converge to low-conflict configurations. Combined with penalties and pruning for repeated assignment and resource contention, these mechanisms reduce structural oscillation and tail-phase jitter. The result is a better compromise between task completion and coordination cost, leading to higher and more stable final utility.
We further test adaptability under sparse and dense tasks, reuse efficiency, and robustness to task pressure by fixing and increasing the number of tasks from 3 to 12. Results are shown in Figure 8. SGRL-TS stays on top across the three UAV scales and reaches a higher peak near the midrange of task counts, while the tail declines more gently and the variability remains smaller. Compared with the strongest baseline PGG-TS-OCF, the average utility over the full range improves by about 3.86% at 16 UAVs, 2.97% at 12 UAVs, and 3.53% at 8 UAVs.
Figure 8.
Average task utility versus the number of tasks under different swarm sizes: (a) UAVs; (b) UAVs; (c) UAVs.
This advantage and stability arise because bidirectional hypergraph attention normalizes task selectivity and contextual suitability within mask-constrained candidate sets, enabling precise member screening and task assignment as the number of tasks increases, which suppresses ineffective overlaps and resource contention. The SHIELD nested nonlinear aggregation with cross-task interaction terms provides a monotonic and comparable global value for cooperation and competition across tasks, making diminishing marginal returns detectable as the task load grows. This concentrates resources on actions with positive net gain, producing a higher midrange peak and slower performance decay.
To assess task completion under scaling, we vary the tasks from 4 to 12 with . Results are in Figure 9. Across the three experimental settings, the SGRL-TS curve remains closely aligned with the upper bound provided by the Task-completion OCF baseline, which optimizes only task fulfillment. As the number of tasks increases from 4 to 12, it improves the average task execution sufficiency over the entire range by approximately 2.42%, 2.63%, and 10.94% relative to PGG-TS-OCF. Moreover, when other methods exhibit pronounced degradation at higher task counts, SGRL-TS shows a significantly slower decline and can maintain a larger fraction of tasks close to complete execution, even under severely constrained resources. This advantage primarily stems from the balance term in the reward, which discourages extreme solutions that sacrifice a subset of tasks, thereby driving the policy to maintain medium to high completion levels across more tasks as the task load increases.
Figure 9.
Average task completion versus the number of tasks for three UAV swarm sizes: (a) UAV ; (b) UAV ; (c) UAV .
Figure 10a–c report the coalition-level temporal coordination performance of all methods. The evaluation metric is the coalition synchronization sufficiency, defined as the normalized score obtained from the coalition arrival-time deviation cost according to (23); larger values indicate more synchronized coalition arrivals under the given reference time scale. As the number of tasks increases, the synchronization sufficiency of all methods decreases overall, indicating that higher task congestion makes it harder for coalitions to achieve good temporal coordination; moreover, when the number of UAVs is reduced from 15 to 5, the overall degradation in synchronization performance becomes more pronounced. The advantage of SGRL-TS in synchronization sufficiency is most pronounced in configurations with more tasks and tighter resources, suggesting that structure-guided overlapping coalition formation combined with joint value decomposition can effectively suppress the dispersion in coalition arrival times. In contrast, non-overlapping SGRL-TS and Task-completion OCF, which only focus on task completion rate, exhibit significantly lower synchronization sufficiency under high-load scenarios, indicating that ignoring overlapping structures or lacking explicit synchronization modeling leads to markedly degraded temporal coordination among coalitions.
Figure 10.
Average coalition synchronization sufficiency versus the number of tasks under three UAV swarm sizes: (a) UAV ; (b) UAV ; (c) UAV . The metric is obtained by normalizing the coalition arrival-time deviation cost according to (23), with larger values indicating more synchronized coalition arrivals.
Previous experiments on task completion and utility have shown that SGRL-TS achieves returns close to the Task-completion OCF upper bound and outperforms PGG-TS-OCF. This section further examines its cost side from the perspective of energy utilization. As shown in Figure 11, under four task scales , the energy-efficiency curves of SGRL-TS lie consistently above those of all baselines. Compared with PGG-TS-OCF, the average energy efficiency over the entire UAV range improves by approximately 8.09%, 8.02%, 3.05%, and 12.85%, respectively. Moreover, relative to the Task-completion OCF scheme, which optimizes only task completion, SGRL-TS achieves comparable completion levels while attaining between two and five times higher energy efficiency, thereby providing a considerably more economical way of sustaining overlapping coalition structures from the energy consumption viewpoint. Overall, this advantage primarily stems from incorporating energy safety margins and residual-resource-driven feasible-set pruning into the MCTS search guided by SHIELD evaluations, which reduces the expansion of high-cost, overlapping structures at the search level, and thus markedly improves global energy utilization without sacrificing task completion.
Figure 11.
Energy efficiency versus the number of UAVs under different task loads: (a) ; (b) ; (c) ; (d) .
To examine the sensitivity of average task utility and method ranking to multi-objective weight settings, task-priority scenarios, and time/energy normalization scales, we conducted comparative experiments in a scenario with 10 tasks and 20 UAVs, as shown in Figure 12. Under fixed network parameters and training configurations, we change only the weight vector at evaluation time to assess the weight sensitivity of the multi-objective design; in Figure 12b, under the default setting , we keep the training process unchanged and modify only the task-priority directed acyclic graph to construct three scenarios, Balanced priorities, Rescue priority, and Communication priority, with the horizontal axis corresponding to these three priority configurations, respectively; in Figure 12c, we scale the normalization ranges of the synchronization reference time and energy reference value and, in turn, investigate the impact of five combinations on the final task utility, where the horizontal axis corresponds to these five configurations.
Figure 12.
Sensitivity of average task utility to (a) multi-objective weight vectors , (b) task-priority scenarios, and (c) normalization scales of synchronization and energy costs in the 10-task, 20-UAV resource-neutral setting. Each curve reports the mean utility with standard-deviation error bars for SGRL-TS, HYGMA, PGG-TS-OCF, and RDFG-DRL.
Across all configurations of the three sensitivity tests, SGRL-TS consistently attains the highest average task utility, with a performance gap of approximately 2–3 percentage points relative to the best-performing baseline PGG-TS-OCF, and exhibits substantially more minor variance, indicating overall robustness to perturbations in weights, priority settings, and normalization scales. Specifically, in Figure 12a, the completion-emphasized weight vector increases the utilities of all four methods, whereas the synchronization- and energy-emphasized weight vectors and reduce the overall utilities, with a particularly pronounced impact on RDFG-DRL, while the curve of SGRL-TS exhibits only mild fluctuations; Figure 12b shows that different task-priority topologies induce only slight changes in the utilities of all methods, and SGRL-TS can still better coordinate resources and synchronization constraints in the rescue-priority scenario, maintaining a stable performance lead; in Figure 12c, scaling or mainly changes the absolute level of utility, and more stringent normalization (such as or ) has a more pronounced negative impact on RCFG-TS, whereas both the performance and variance of SGRL-TS vary only moderately.
We analyze the per-decision computational overhead of SGRL-TS under different swarm scales and search budgets, as shown in Table 5. We reuse the SGRL-TS policy trained in the previous experiments and perform online evaluation on six task configurations in inference mode without enabling backpropagation. For each configuration, we record the wall-clock time of the HAN encoding , the SHIELD mixing , the MCTS planning , and the end-to-end decision latency over consecutive decision steps, and we collect the average number of node expansions of MCTS under feasible-region pruning (Avg. exp.).
Table 5.
Per-decision runtime profiling of SGRL-TS under different swarm scales and search budgets (N: number of UAVs, M: number of tasks). All results are averaged over decision steps on the hardware platform described in Section 5.
As shown in Table 5, in S4 the end-to-end decision latency is about 5.54 ms, whereas the lower bound under the no-search configuration (S2, ) is only 2.10 ms, indicating that even with structured MCTS enabled the overall overhead remains significantly below the typical UAV control period on the order of tens of milliseconds and thus satisfies real-time application requirements. As the task scale increases, and slowly increase from about 0.82/0.64 ms in S1 to about 1.05/0.82 ms in S4, exhibiting an approximately linear growth trend that is consistent with the complexity result given in Section 4.4, which indicates that high-order structural modeling and structure-conditioned value decomposition themselves do not become the main bottlenecks. For the same swarm scale (S2), when the search budget is increased from to 32 and 64, the average number of expanded nodes grows from 0 to about 21.37 and 41.96, and the corresponding increases from 0 to 1.63 ms and 3.01 ms. At the same time, Avg. exp. consistently remains clearly below the budget B, which confirms that feasible-region pruning effectively suppresses the size of the search tree and makes the MCTS computational cost approximately linearly controllable with respect to the budget. Taken together, these results show that SGRL-TS achieves both low per-step latency and good scalability within the swarm sizes and search budgets considered in this work.
In Figure 13, we evaluate all algorithms in a scenario with 10 tasks and 15 heterogeneous UAVs. Multiple independent runs are conducted under different search budgets and random seeds to sample a set of feasible overlapping coalition-structure solutions. For each solution, we first compute the task execution sufficiency and the synchronization deviation cost according to (23), and then take the average over all tasks to obtain the overall task execution sufficiency and synchronization deviation; we also record the total energy cost incurred to complete all tasks. The normalized task shortfall is then defined as . Furthermore, synchronization deviation and energy cost are min–max normalized over the union of all methods and sampled solutions to obtain the normalized synchronization deviation and normalized energy cost . Consequently, all three quantities are scaled to the interval with smaller values being better, which facilitates multi-objective Pareto analysis in a unified cost space.
Figure 13.
Multi-objective trade-off among normalized task shortfall, synchronization deviation, and energy cost: (a) normalized task shortfall versus normalized synchronization deviation ; (b) normalized task shortfall versus normalized energy cost ; (c) normalized synchronization deviation versus normalized energy cost . All three objectives are min–max normalized over the collected solutions so that lower values are better.
From Figure 13a, the scatter of SGRL-TS is more concentrated in the lower-left region of the plane, yielding fewer dominated solutions with either a small task shortfall but significant synchronization deviation, or good synchronization at the price of a significantly increased task shortfall, compared with the baseline methods. In Figure 13b, SGRL-TS maintains a smaller under lower energy cost , whereas other methods typically require higher energy to achieve a similar level of task completion or suffer a larger task shortfall at comparable energy, indicating a more favorable energy–efficiency trade-off. Figure 13c further shows that, in the plane, the SGRL-TS samples overall lie closer to the lower-left Pareto boundary and form a more compact “knee” region around low energy and low synchronization deviation. In contrast, the samples of the baseline methods more frequently fall outside this frontier. Taken together, these results demonstrate that SGRL-TS achieves superior Pareto performance in the joint objective space of task completion, coalition-time synchronization, and energy consumption.
5.3. Ablation Studies
Table 6 presents a systematic ablation of the encoder, value decomposition, and global search modules under resource-neutral and resource-tight configurations. Here, Viol. denotes the constraint violation rate, AUC is the normalized area under the curve of average task utility Util versus training iterations, and Iter@95% is defined as the training iteration at which the Util curve first reaches 95% of its steady-state mean, where the steady-state mean is computed from a moving average over the final training window. Comparing SGRL-TS with typical graph-based DRL, under the resource-tight regime Full-SGRL-TS improves Compl, Util, and Eff over GAT-QMIX by about 23.9%, 22.2%, and 26.5%, respectively, while reducing Viol by 45.9%, increasing AUC by 48.3%, and shortening Iter@95% by 36.9%. These gains do not come from merely swapping in a GAT encoder and a QMIX mixer, but from jointly exploiting three structural mechanisms: HAN explicitly encodes the high-order UAV–task–coalition hypergraph so that structural semantics enter value estimation and search rather than being limited to one-hop graph attention; SHIELD injects structure-conditioned terms within and across coalitions, enabling finer modeling of cooperative gains and resource competition and yielding a more monotone and comparable global value ; and the structured MCTS uses these signals for feasibility-set pruning and structural heuristics, so that coalition configurations are optimized globally around structural priors instead of relying on local GNN outputs.
Table 6.
Ablation and replacement study of SGRL-TS under resource-neutral and resource-tight regimes.
Module-level ablations further support this view. With SHIELD-full and structured MCTS fixed, replacing the encoder with a standard GAT (A1) or HyperGCN (A2) shows that, relative to A2, the full HAN still improves Util and Eff in the resource-neutral regime by about 2.2% and 4.6%, further reduces Viol by 17.9%, increases AUC by 6.9%, and shortens Iter@95% by 11.4%; under the resource-tight regime it maintains roughly 6.0% and 6.7% gains in Util and Eff and a 20.7% reduction in violation rate, confirming that heterogeneous node types and high-order hyperedges provide additional structural information beyond conventional graph encoders. For the value decomposition, when VDN or QMIX mixing is used on top of HAN, SHIELD-lite already improves Util and Eff over QMIX-mix by about 2.2% and 4.7%, reduces Viol by 11.7%, increases AUC by 6.0%, and shortens Iter@95% by 5.7%; enabling full SHIELD further raises Util and Eff relative to QMIX-mix by 3.4% and 6.9%, decreases Viol by 22.0%, increases AUC by 8.4%, and reduces Iter@95% by 15.3%, with similar relative improvements in the resource-tight regime, indicating that the structure-conditioned nested mixer improves decomposability and credit assignment beyond monotonic mixing networks. In the global search module, removing search causes Util and Eff to drop by about 4.5% and 6.1%, Viol to increase by 34.4%, AUC to decrease by 11.3%, and Iter@95% to be extended by 24.6% compared with Full-SGRL-TS; greedy search or plain MCTS partially reduce this gap but still underperform structured MCTS in AUC and Iter@95%, showing that high-order structure–guided feasibility pruning and budget allocation are likewise crucial to achieving high AUC and fast, low-violation convergence.
5.4. Future Research Directions and Deployability
Evaluation is currently confined to simulation. Subsequent work will progress along a simulation–hardware-in-the-loop–field-trial path. First, we will build a reproducible benchmark that includes a library of parameterized scenarios, communication traces for synthetic generation and replay, and standardized evaluation scripts, accompanied by anonymized logs to strengthen comparability and reproducibility. Second, we will integrate the flight controller and onboard inference compute within a hardware-in-the-loop platform to quantify control-cycle stability, end-to-end latency, per-task energy consumption, and throughput, and we will replay post-disaster communication traces to assess robustness under burst losses, delay jitter, and bandwidth limitations. Third, we will conduct small-scale outdoor experiments with heterogeneous multi-UAV formations and calibrate the simulation models using field measurements of task completion rate, link quality, energy budget, and coalition-reconfiguration delay. To narrow the sim-to-real gap, we will systematically employ dynamics and payload domain randomization, bursty-channel perturbations based on the Gilbert–Elliott model, delay-aware action buffering with safety shielding, and online-estimate-driven feasibility-set tightening and parameter adaptation.
6. Conclusions
Starting from an overlapping-coalition paradigm that better reflects real rescue requirements, this work proposes a unified framework grounded in high-order structural modeling, centered on hierarchical value decomposition, and driven by global search with policy feedback. We jointly embed tasks and UAVs into a hypergraph, allowing for the capture of resource reuse, member complementarity, and inter-coalition competition within a common semantic space. Based on this representation, the hierarchical value decomposition stably aggregates information across the agent and coalition levels, providing reliable value signals for subsequent global structure search. The global search efficiently identifies high-potential structures within the feasible region. The resulting candidates are then fed back to the policy via demonstration replay and distillation, which forms a closed loop between learning and search. This loop consistently yields solutions with higher quality, lower energy consumption, and more stable coordination under complex feasibility constraints. In extensive post-disaster rescue simulations of varying scales, the proposed method improves task utility by up to 11.4% over the strongest baseline and surpasses the non-overlapping variant in energy efficiency by more than 228%. These advantages are more pronounced under resource-limited settings, indicating that overlapping coalitions combined with structure-aware policy learning can substantially enhance resource coordination under tight constraints while keeping coalition synchronization deviation and operational cost within a more reasonable range.
Author Contributions
Conceptualization, W.L. and Y.F.; methodology, W.L.; software, W.L.; validation, W.L. and Y.F.; writing—original draft preparation, W.L.; writing—review and editing, W.L., Y.F., F.Z., K.I.K., J.W. and P.Z. All authors have read and agreed to the published version of the manuscript.
Funding
This work is partially supported by the National Natural Science Foundation of China under Grants 62471493 (for conceptualization and investigation), partially supported by the Natural Science Foundation of Shandong Province, China under Grants ZR2023LZH017 and ZR2024MF066 (for formal analysis and validation), and partially supported by the Russian Science Foundation (RSF) Project under Grant 22-71-10095-P (for validation and results verification).
Data Availability Statement
Dataset available on request from the authors. The raw data supporting the conclusions of this article will be made available by the authors on request.
Conflicts of Interest
The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
References
- Chen, Y.; Chen, R.; Huang, Y.; Xiong, Z.; Li, J. DRL-Based Improved UAV Swarm Control for Simultaneous Coverage and Tracking with Prior Experience Utilization. Drones 2024, 8, 784. [Google Scholar] [CrossRef]
- Zeng, Y.; Wu, L.; Li, J.; Zhuang, X.; Wu, C. Resilient Task Allocation for UAV Swarms: A Bilevel PSO-ILP Optimization Approach. Drones 2025, 9, 623. [Google Scholar] [CrossRef]
- Chen, J.; Wu, Q.; Xu, Y.; Qi, N.; Guan, X.; Zhang, Y.; Xue, Z. Joint Task Assignment and Spectrum Allocation in Heterogeneous UAV Communication Networks: A Coalition Formation Game-Theoretic Approach. IEEE Trans. Wirel. Commun. 2021, 20, 440–452. [Google Scholar] [CrossRef]
- Bai, X.; Fielbaum, A.; Kronmüller, M.; Knoedler, L.; Alonso-Mora, J. Group-Based Distributed Auction Algorithms for Multi-Robot Task Assignment. IEEE Trans. Autom. Sci. Eng. 2023, 20, 1292–1303. [Google Scholar] [CrossRef]
- Vig, L.; Adams, J. Multi-robot coalition formation. IEEE Trans. Robot. 2006, 22, 637–649. [Google Scholar] [CrossRef]
- Liu, F.; Dong, X.; Yu, J.; Hua, Y.; Li, Q.; Ren, Z. Distributed Nash Equilibrium Seeking of N-Coalition Noncooperative Games With Application to UAV Swarms. IEEE Trans. Netw. Sci. Eng. 2022, 9, 2392–2405. [Google Scholar] [CrossRef]
- Nunes, E.; Gini, M. Multi-Robot Auctions for Allocation of Tasks with Temporal Constraints. Proc. AAAI Conf. Artif. Intell. 2015, 29, 2110–2116. [Google Scholar] [CrossRef]
- Liao, X.; Wang, Y.; Han, Y.; Li, Y.; Lin, C.; Zhu, X. Heterogeneous Multi-Agent Deep Reinforcement Learning for Cluster-Based Spectrum Sharing in UAV Swarms. Drones 2025, 9, 377. [Google Scholar] [CrossRef]
- Su, Y.; Zhou, H.; Deng, Y.; Dohler, M. Energy-Efficient Cellular-Connected UAV Swarm Control Optimization. IEEE Trans. Wirel. Commun. 2024, 23, 4127–4140. [Google Scholar] [CrossRef]
- Shi, S.; Hu, C.; Wang, D.; Zhu, Y.; Han, Z. Federated HD Map Updating Through Overlapping Coalition Formation Game. IEEE Trans. Mob. Comput. 2024, 23, 1641–1654. [Google Scholar] [CrossRef]
- Zhang, Y.; Gao, X.; Ye, N.; Niyato, D.; Han, Z.; Yang, K. Joint UAV Deployment, Power Allocation, and Coalition Formation for Physical Layer Security in Heterogeneous Networks. IEEE Trans. Veh. Technol. 2025, 74, 10994–11009. [Google Scholar] [CrossRef]
- Peng, Q.; Wu, H.; Li, N.; Wang, F. A Dynamic Task Allocation Method for Unmanned Aerial Vehicle Swarm Based on Wolf Pack Labor Division Model. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 4075–4089. [Google Scholar] [CrossRef]
- Alqefari, S.; Menai, M.E.B. Multi-UAV Task Assignment in Dynamic Environments: Current Trends and Future Directions. Drones 2025, 9, 75. [Google Scholar] [CrossRef]
- Chen, R.; Li, J.; Peng, T. Decentralized UAV Swarm Scheduling with Constrained Task Exploration Balance. Drones 2023, 7, 267. [Google Scholar] [CrossRef]
- Guo, M.; Xin, B.; Wang, Y.; Chen, J. A Local-Search-Based Heuristic for Coalition Formation in Urgent Missions. IEEE Trans. Syst. Man, Cybern. Syst. 2024, 54, 6924–6935. [Google Scholar] [CrossRef]
- Yan, Y.; Bi, W.; Ma, G.; Zhang, A. Collaborative Task Allocation for Large-Scale Heterogeneous AAV Swarm: A Hierarchical Coalition Formation Game Method. IEEE Internet Things J. 2025, 12, 27237–27254. [Google Scholar] [CrossRef]
- Qi, N.; Huang, Z.; Zhou, F.; Shi, Q.; Wu, Q.; Xiao, M. A Task-Driven Sequential Overlapping Coalition Formation Game for Resource Allocation in Heterogeneous UAV Networks. IEEE Trans. Mob. Comput. 2023, 22, 4439–4455. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, B.; Han, Z. Coalition Formation Game Based Information-Energy Collaboration in Vehicle Edge Computing Networks. IEEE Trans. Veh. Technol. 2023, 72, 7717–7727. [Google Scholar] [CrossRef]
- Qi, N.; Huang, Z.; Sun, W.; Jin, S.; Su, X. Coalitional Formation-Based Group-Buying for UAV-Enabled Data Collection: An Auction Game Approach. IEEE Trans. Mob. Comput. 2023, 22, 7420–7437. [Google Scholar] [CrossRef]
- Liu, C.; Li, D. HYGMA: Hypergraph Coordination Networks with Dynamic Grouping for Multi-Agent Reinforcement Learning. arXiv 2025, arXiv:2505.07207. [Google Scholar]
- Zhang, Z.; Piao, S. Vein-Based Coalitions for Multi-Agent Pattern Formation Tasks. IEEE Robot. Autom. Lett. 2022, 7, 12379–12386. [Google Scholar] [CrossRef]
- Liao, M.; Xiong, K.; Zhang, L.; Yu, X.; Cui, G. Frequency Agile Strategy Design of Radar Network for Wideband Sweep Jamming via Coalition Formation Game. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 7906–7919. [Google Scholar] [CrossRef]
- Taguelmimt, R.; Aknine, S.; Boukredera, D.; Changder, N.; Sandholm, T. Faster optimal coalition structure generation via offline coalition selection and graph-based search. arXiv 2024, arXiv:2407.16092. [Google Scholar] [CrossRef]
- Chen, L.; Wei, C.; Duan, H. Multiple Unmanned Aerial Vehicles Coalition Formation and Control for Collaborative Defense Mission. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6095–6109. [Google Scholar] [CrossRef]
- Xiong, K.; Cui, G.; Liao, M.; Zhang, L.; Kong, L. Multitarget Pursuit Coalition Game of Radar-Enabled UAVs. IEEE Trans. Aerosp. Electron. Syst. 2025, 61, 13605–13621. [Google Scholar] [CrossRef]
- Zhou, H.; Chen, R.; Yi, C.; Zhang, J.; Kang, J.; Cai, J.; Guizani, M. A Repeated Coalition Formation Game for Physical Layer Security Aware Wireless Communications With Third-Party Intelligent Reflecting Surfaces. IEEE Trans. Wirel. Commun. 2025, 24, 7612–7626. [Google Scholar] [CrossRef]
- Zhang, N.; Ma, Q.; Mao, W.; Chen, X. Coalitional FL: Coalition Formation and Selection in Federated Learning with Heterogeneous Data. IEEE Trans. Mob. Comput. 2024, 23, 10494–10508. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).