Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer

Zang, Yuan; Gao, Dengwei; Yin, Zeyang; Wei, Caisheng

doi:10.3390/drones10060414

Open AccessArticle

Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer

¹

School of Automation, Central South University, Changsha 410083, China

²

National Key Laboratory of Land and Air Based Information Perception and Control, Xi’an 710065, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 414; https://doi.org/10.3390/drones10060414

Submission received: 1 March 2026 / Revised: 22 May 2026 / Accepted: 23 May 2026 / Published: 27 May 2026

(This article belongs to the Special Issue Intelligent Cooperative Technologies of UAV Swarm Systems)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A Hierarchical Deep Reinforcement Learning framework partitions the LSH-TOP into distinct sub-problems. Within this architecture, MCTS determines global workload distribution, and a Transformer-based model generates individual agent trajectories.
The C-ACT protocol addresses convergence limitations in constrained environments. This protocol stabilizes the training curves and ensures steady convergence by managing the transition from relaxed to strict constraints.

What are the implications of the main findings?

The framework establishes a structural basis for coordinating heterogeneous multi-UAV fleets during emergency SAR missions. Compared to conventional heuristics, this approach yields up to 53.27% higher objective values and improves over the strongest baseline (MCTS + OR) by 0.63–8.51%, while balancing workload distribution across problem scales and delivering solutions within 24.30 s at 300 targets.
This hierarchical structure achieves sub-quadratic computational scaling, compared to the O(N^K) growth of joint formulations. Consequently, this approach supports real-time decision-making for post-disaster response operations.

Abstract

Post-disaster Search and Rescue (SAR) missions demand rapid coordination of Heterogeneous Unmanned Aerial Vehicle (UAV) fleets under stringent payload and flight range limitations. Traditional heuristic solvers struggle to solve the Large-Scale Heterogeneous Team Orienteering Problem (LSH-TOP) within operational time limits due to the coupled complexity of task allocation and route planning. A Hierarchical Deep Reinforcement Learning framework decomposes this high-dimensional combinatorial problem into tractable sub-problems. An upper-level policy, guided by Monte Carlo Tree Search (MCTS), partitions the global target set to balance fleet workload distribution, whereas a lower-level Transformer-based model constructs near-optimal trajectories for individual agents. A Curriculum-Integrated Alternating Cooperative Training (C-ACT) protocol resolves the convergence difficulties associated with sparse feasible solutions in constrained environments. This protocol incorporates a dynamic constraint annealing strategy and a virtual agent buffer to progressively shape the solution space from relaxed to strictly constrained formulations. Experiments conducted on real-world geographic data demonstrate the proposed approach consistently outperforms all baselines across scales of 80 to 300 targets, improving over the strongest competitor by 0.63–8.51% and over conventional heuristics by up to 53.27% in objective value. Results indicate a task completion rate of 27.5% at the 300-target scale (versus 25.1% for the strongest baseline MCTS + OR) and balanced workload distribution, validating framework adaptability to complex emergency response scenarios.

Keywords:

heterogeneous UAV mission planning; hierarchical deep reinforcement learning; Monte Carlo tree search; transformer architecture; curriculum learning; team orienteering problem

1. Introduction

The frequency and intensity of natural disasters threaten global public safety. In post-disaster scenarios, timely delivery of emergency supplies to trapped victims dictates casualty rates [1]. UAVs provide an alternative framework for emergency logistics and SAR missions due to their mobility, flexibility, and independence from ground infrastructure [2,3,4]. Collaborative multi-UAV coordination improves operational coverage for large-scale rescue operations [5].

Real-world emergency missions routinely involve heterogeneous UAV fleets featuring varying payload capacities [6], flight ranges [7], and rescue priorities [8], alongside hundreds of distributed targets. This paradigm constitutes the LSH-TOP, an NP-hard problem characterized by an exponentially growing solution space [9]. The coupled complexity of node selection and route construction impedes the generation of feasible plans under strict heterogeneous constraints [10].

Traditional approaches for multi-UAV task allocation rely on heuristic algorithms, including Genetic Algorithm, Particle Swarm Optimization, and market-based Consensus-Based Bundle Algorithm [11,12]. Hybrid strategies integrating heuristics with rolling dispatch [13] or simulated annealing [14] attempt to improve adaptability. While capable of generating feasible solutions, these methods suffer from extended computational times and scalability bottlenecks as the problem scale expands [15]. Furthermore, reliance on manually tuned hyperparameters and iterative search processes prevents real-time decision-making in disaster response scenarios.

Neural combinatorial optimization reframes routing as a Markov Decision Process (MDP), enabling second-level inference speeds. Transformer architectures achieve near-optimal solutions on team orienteering benchmarks when combined with Policy Optimization with Multiple Optima (POMO) [16,17]. These models struggle to satisfy complex constraints for heterogeneous fleets [18,19]. Standard architectures prioritize homogeneous agents and configure hard-masking mechanisms to enforce constraints [20,21].

Rigid masking in heterogeneous scenarios eliminates valid actions, triggering training failures [22]. Static penalties induce local optima by forcing agents to reject tasks for cost reduction. A direct transition from unconstrained to constrained execution impedes model convergence. These limitations collectively require a training paradigm that balances feasibility enforcement with exploration.

This paper proposes a Hierarchical Deep Reinforcement Learning framework decoupling the LSH-TOP for SAR missions [23]. MCTS guides the upper-level task allocation [24], and a Transformer-based model utilizing POMO solves the lower-level routing problems. The framework incorporates a curriculum learning strategy that progressively expands the feasible solution space from relaxed to strictly constrained formulations. The main contributions of this paper are summarized as follows:

A hierarchical bilevel optimization framework coordinates emergency delivery of heterogeneous UAV fleets. Decomposing LSH-TOP into an allocation layer and a routing layer decouples global reward maximization from local resource constraints. This formulation maximizes total collected rewards within strict flight range and capacity limits. Unlike existing hierarchical methods pairing MCTS with hand-crafted lower-level heuristics, both levels deploy learned neural policies, closing the second structural gap and preserving solution quality as the problem scale grows.
A Transformer-based routing policy implementing a virtual agent mechanism resolves coupled capacity and range constraints. Unlike hard-masking approaches terminating the decision process prematurely, a virtual agent functions as an absorptive buffer for infeasible tasks. This design prevents early episode termination caused by empty action spaces, securing the completeness of the MDP training process. This mechanism extends neural routing policies beyond homogeneous agents: existing models enforce constraints via hard masking and cannot handle coupled capacity-range limits of heterogeneous fleets without triggering training instability.
An MCTS-driven allocation strategy complements a curriculum learning-based training scheme for the task assignment policy. Constraints progressively tighten via a dynamic limit factor during training, allowing the optimizer to process relaxed scenarios before executing strict constraints. During inference, MCTS augments search capability, validating task assignments maximizing global rewards in heterogeneous scenarios. The C-ACT protocol closes the first structural gap by combining dynamic cosine annealing with the virtual agent buffer—a design absent from prior curriculum frameworks—enabling convergence where direct training under strict constraints fails.

The three contributions close both structural gaps identified in the literature review: existing frameworks lack a curriculum mechanism for heterogeneous coupled constraints, and asymmetric hierarchical solvers degrade routing quality. The framework improves over the strongest neural baseline by 0.63–8.51% and over conventional heuristics by up to 53.27% across all problem scales, with inference time scaling sub-quadratically to 300-target scenarios. These results validate the neural–neural bilevel structure with C-ACT as a scalable approach for time-critical emergency SAR operations.

2. Literature Review

Heuristic and meta-heuristic algorithms serve as traditional tools for solving UAV task allocation and path planning problems, computing feasible trajectories for combinatorial optimization challenges. Evolutionary strategies, including Genetic Algorithms and Particle Swarm Optimization, optimize routing schedules under multi-objective constraints, coordinating vehicle-mounted swarms and multi-depot logistics systems [25]. Market-based mechanisms, including the Consensus-Based Bundle Algorithm, improve scalability in distributed environments, exhibiting accelerated convergence in heterogeneous networks [26]. Combinatorial auction algorithms facilitate flexible resource combinations for multi-task scheduling. These iterative methods encounter a strict trade-off between solution quality and computational time. As the problem scale expands to hundreds of tasks, the search space grows exponentially, causing extended runtimes and stagnation in local optima. Reliance on hand-crafted rules and static parameter tuning limits framework adaptability in uncertain disaster response environments [27,28].

Neural combinatorial optimization reframes the algorithmic paradigm from iterative search to learned inference, supporting real-time decision-making. The Transformer architecture captures global dependencies via self-attention mechanisms, enabling end-to-end modeling of routing problems [29]. POMO exploits problem symmetries, enhancing training stability and solution generalization to achieve near-optimal performance on team orienteering problem benchmarks. Addressing agent heterogeneity remains an unresolved gap. Existing deep reinforcement learning models process homogeneous fleets and struggle managing distinct physical constraints (e.g., varying payload and range) simultaneously. These approaches employ hard-masking schemes to enforce valid actions [30]. Rigid masking in highly constrained scenarios generates empty action spaces, triggering premature episode termination and training instability.

Curriculum learning strategies introduce progressive difficulty scaling to neural combinatorial optimization, generating feasible policies under strict constraints. Gradually tightening constraint thresholds during training enables agents to learn basic routing patterns before encountering complex restrictions [31,32]. Relaxing penalty coefficients initially prevents agents from selecting local optima characterized by task avoidance [33,34]. Existing curriculum learning frameworks target single-vehicle problems or homogeneous swarms [35,36]. Dynamic constraint annealing handling coupled range and load constraints of heterogeneous fleets remains underexplored [37,38]. Models lacking specialized transition mechanisms fail to converge when transitioning from unconstrained exploration to rigid emergency rescue constraints.

Hierarchical frameworks process large-scale optimization tasks by decomposing them into upper-level allocation and lower-level execution, ensuring feasibility [39,40]. MCTS integration with learning-based methods extends global exploration capability, combining look-ahead search with neural intuition [41]. Current hierarchical methods exhibit structural imbalance. Complex upper-level planning contrasts with lower-level heuristics deployed to save computation. This asymmetry degrades overall solution quality. Integrating MCTS and Transformer-based networks for heterogeneous UAV fleets remains unaddressed in existing frameworks.

The existing literature reveals two structural gaps: First, no curriculum training mechanism addresses heterogeneous agents under coupled capacity-range constraints; existing frameworks target single-vehicle or homogeneous fleets and do not extend to multi-UAV heterogeneous settings. Second, hierarchical methods pair MCTS with hand-crafted lower-level heuristics, introducing asymmetry that degrades routing quality as problem scale grows. The present framework closes both gaps. It replaces lower-level heuristics with a learned Transformer routing policy, forming a neural–neural bilevel structure that preserves solution quality at both levels. The C-ACT protocol jointly manages coupled capacity-range constraints through dynamic annealing and a virtual agent buffer, a combination not present in prior curriculum learning frameworks for combinatorial optimization.

3. Problem Formulation

This section defines the mathematical framework for the multi-UAV task scheduling and route planning problem in disaster rescue scenarios. The formulation models the mission environment using graph theory. Based on this graph representation, a Mixed-Integer Linear Programming model specifies the optimization objectives and operational constraints. A subsequent transformation of this model into a sequential MDP enables the deep reinforcement learning approach as shown in Figure 1.

3.1. Problem Definition

A complete directed graph

G = (V, E)

represents the disaster-stricken region. The vertex set

V = {0} \cup N

contains a central depot (node 0) and a target set

N = {1, 2, \dots, N}

. A tuple

(x_{i}, p_{i}, q_{i})

defines each target

i \in N

, specifying the geospatial coordinates, a prize value of rescue priority, and the resource demand. The edge set

E

defines spatial connectivity. Each arc

(i, j) \in E

carries a Euclidean travel cost

c_{i j}

.

A heterogeneous fleet of K UAVs, denoted by

K = {1, 2, \dots, K}

, departs from the depot. Fleet heterogeneity imposes distinct constraints on each UAV

k \in K

: a maximum payload capacity

Q_{k}

and a maximum flight range

L_{k}

. Resource limits restrict the fleet from achieving full coverage of the target set

N

. Therefore, the optimization objective maximizes the collected prize of visited targets through collision-free, depot-returning routes. The mission concurrently necessitates equitable workload distribution across the UAV fleet to maintain operational sustainability. Table 1 enumerates all notations for the task scheduling problem.

3.2. Task Formulations

The optimization framework applies two sets of binary decision variables to model the mathematical scheduling logic. The assignment variable

y_{i k} \in {0, 1}

equals 1 if task i is allocated to UAV k, and 0 otherwise. The routing variable

x_{i j k} \in {0, 1}

equals 1 if UAV k traverses the arc

(i, j)

. The global objective function J maximizes the collective mission utility and penalizes excessive travel costs alongside workload imbalance:

Maximize J = \underset{Total Prize}{\underset{︸}{\sum_{k \in K} \sum_{i \in N} p_{i} y_{i k}}} - α \underset{Distance Penalty}{\underset{︸}{\sum_{k \in K} \sum_{(i, j) \in E} c_{i j} x_{i j k}}} - \underset{Balance Penalty}{\underset{︸}{β \cdot Ψ (u)}}

(1)

where

p_{i}

indicates the prize of task i, and

c_{i j}

defines the Euclidean distance cost between node i and node j. The terms

α

and

β

denote non-negative weighting coefficients. The function

Ψ (u)

quantifies workload equity. This formulation computes the variance of capacity usage rates across the heterogeneous fleet:

U_{k} = \frac{\sum_{i \in N} q_{i} y_{i k}}{Q_{k}}, Ψ (u) = \frac{1}{K} \sum_{k = 1}^{K} {(U_{k} - \bar{U})}^{2}

(2)

where

q_{i}

denotes the demand of task i,

Q_{k}

represents the payload capacity of UAV k, and

\bar{U}

equals the mean usage rate of the fleet.

A set of operational constraints restricts the feasible solution space. Equation (3) enforces assignment completeness. This constraint allocates every target i to exactly one agent within the extended set

K^{+}

. The set

K^{+}

encompasses physical UAVs for completed tasks and a virtual agent for unassigned tasks:

\sum_{k \in K^{+}} y_{i k} = 1, \forall i \in N

(3)

Equation (4) maintains the structural validity of the flight routes by enforcing topology and flow conservation. This formulation extracts a closed-loop trajectory for each UAV from the depot and preserves path continuity at all intermediate nodes.

\sum_{j \in V} x_{0 j k} = \sum_{j \in V} x_{j 0 k} = 1, \forall k \in K; \sum_{m \in V} x_{i m k} - \sum_{m \in V} x_{m i k} = 0, \forall i \in N, \forall k \in K

(4)

Equation (5) restricts each target node to at most one physical UAV visit in the routing layer, ensuring that the allocation variable

y_{i k} = 1

translates into exactly one arc traversal:

\sum_{k \in K} \sum_{j \in V} x_{i j k} \leq 1, \forall i \in N

(5)

Equations (6) and (7) address hardware heterogeneity. Equation (6) restricts the accumulated demand on UAV k to its payload capacity

Q_{k}

. Equation (7) bounds the total travel distance within the corresponding endurance range

L_{k}

:

\sum_{i \in N} q_{i} (\sum_{j \in V} x_{i j k}) \leq Q_{k}, \forall k \in K

(6)

\sum_{i \in V} \sum_{j \in V} c_{i j} x_{i j k} \leq L_{k}, \forall k \in K

(7)

Finally, Equation (8) maintains assignment-routing consistency. A target assignment to a physical UAV requires the corresponding routing variable to indicate a visit. This constraint applies exclusively to physical UAVs (

k \in K

). The virtual agent (

k = K + 1

) executes no physical routing operations but operates solely to absorb infeasible task demands:

y_{i k} = \sum_{j \in V} x_{i j k}, \forall i \in N, \forall k \in K

(8)

3.3. MDP Formulations

The tuple

P = (S, A, T, R, γ)

specifies this framework. The agent functions as a centralized scheduler. It constructs the solution by appending targets sequentially until it exhausts available resources or violates constraints.

State Space (

S

): The state

s_{t} \in S

integrates the global decision context at step t. The definition

s_{t} = (G, h_{t}, M_{t})

unifies three data modalities. The static graph

G

encodes the geospatial coordinates, target prizes, and demand requirements. The dynamic vector

h_{t}

monitors the active UAV status. It stores the current spatial coordinates, remaining payload, and residual flight endurance. The binary mask

M_{t}

evaluates the task feasibility of unvisited nodes. It clips the action space to enforce valid state transitions.

Action Space (

A

): The action

a_{t} \in A

selects the subsequent target destination based on the state

s_{t}

. The dynamic mask

M_{t}

constrains the action space to ensure route feasibility:

a_{t} \in {j \in N ∣ M_{t} [j] = 1} \cup {0}

(9)

The selection of a node

j \in N

allocates the corresponding task j to the active UAV. The specialized zero token (action 0) activates the termination logic. It directs a physical UAV to the depot for tour completion, or it assigns unserviceable targets to the virtual agent.

State Transition (

T

): The execution of action

a_{t}

triggers the system transition to the subsequent state

s_{t + 1}

. The selection of a task node j shifts the active UAV coordinates to node j and depletes the associated resources in

h_{t}

. Specifically, the task demand

q_{j}

deducts from the residual payload capacity. The travel cost

c_{c u r r e n t, j}

reduces the remaining flight range. The transition enforces sequence uniqueness by zeroing the index j in the updated mask

M_{t + 1}

. Conversely, the selection of the zero token (action 0) finalizes the current trajectory and initializes the position context for the subsequent agent.

Reward Function (

R

): A designated reward signal guides the policy optimization. This signal parallels the global objective structured in Equation (1). The environment formulates a sparse reward calculation at the terminal step T of the episode:

R (s_{T}) = \sum_{k \in K} \sum_{i \in N} p_{i} y_{i k} - α \cdot \sum_{k \in K} \sum_{(i, j) \in E} c_{i j} x_{i j k} - β \cdot Ψ (u)

(10)

This reward formulation balances the concurrent optimization goals: maximizing total prizes, minimizing trajectory distances, and establishing workload uniformity across the heterogeneous fleet. The episodic decision-making operates over a finite horizon. Thus, the formulation establishes the discount factor

γ = 1

to equally evaluate the terminal mission effectiveness.

4. Methodology

The solution of the LSH-TOP requires the decoupling of the combined assignment and routing decision space. A Hierarchical Deep Reinforcement Learning framework decomposes this coupled problem into a bi-level cooperative Markov process. The upper-level allocation network distributes the topological targets across the physical UAVs and the virtual agent. This partition logic systematically manages the heterogeneous payload and endurance constraints. Following the target allocation, a lower-level Transformer policy processes each generated sub-problem. This level executes a POMO routing algorithm to determine specific trajectory sequences. The C-ACT protocol optimizes both parametric policies. This protocol maximizes the episodic mission reward by synchronizing the cooperative output of the full heterogeneous fleet. Figure 2 illustrates the structural hierarchy of the proposed framework.

4.1. Inputs Preprocessing and Feature Embedding

A preprocessing pipeline converts heterogeneous node attributes into uniform embeddings for the encoder. The problem instance defines a graph

G = (V, E)

with node geospatial coordinates, priority prizes, and structural demands. A projection layer maps these attributes into a shared latent space for subsequent attention computation.

Feature Normalization and Fusion: The raw dataset records target locations in longitude and latitude. A transverse Mercator projection converts these spherical coordinates into a local two-dimensional Cartesian grid. Min-Max scaling then normalizes the planar values into

{[0, 1]}^{2}

, yielding normalized location vectors

x_{i} \in R^{2}

for all nodes

i \in V

.

The task characteristics extend the location data. A feature vector

r_{i}

for each target node

i \in N

concatenates the normalized coordinates

x_{i}

, the scalar prize

p_{i}

, and the demand

q_{i}

:

r_{i} = Concat (x_{i}, p_{i}, q_{i}) \in R^{4}

(11)

This formulation encodes the relationship between spatial topology and task priority. Figure 3 diagrams the specific procedural conversion pipeline.

Linear Projection to Latent Space: Coordinates, prizes, and demands occupy independent scales. A linear layer projects the composite vector

r_{i}

into a unified latent space. Equation (12) defines the transformation into the node embedding

h_{i}^{(0)} \in R^{d_{m o d e l}}

:

h_{i}^{(0)} = W_{n} r_{i} + b_{n}

(12)

where the matrix

W_{n} \in R^{d_{m o d e l} \times 4}

and the vector

b_{n} \in R^{d_{m o d e l}}

are trainable parameters.

The depot (node 0) lacks both prize and demand. A separate linear layer computes the depot embedding

h_{0}^{(0)}

from the planar coordinates

x_{0}

:

h_{0}^{(0)} = W_{d} x_{0} + b_{d}

(13)

Prepending the depot embedding to the task embeddings yields the input tensor

H^{(0)} = [h_{0}^{(0)}, h_{1}^{(0)}, \dots, h_{N}^{(0)}]

for the encoder.

4.2. Heterogeneous Attention Policy for Task Allocation

The upper-level allocation network coordinates target service demands against limited hardware capacities. This architecture integrates a graph encoder for feature extraction and a fleet-state-aware decoder for sequential assignment.

4.2.1. Graph Encoding with Heterogeneous Features

Multi-head self-attention propagates information across all node pairs to produce context-aware embeddings. The encoder updates the initialization matrix

H^{(0)}

into representations

H^{(N_{L})}

through

N_{L}

sequential attention layers.

Each encoder layer

l \in {1, \dots, N_{L}}

projects the intermediate tensor

H^{(l - 1)} \in R^{(N + 1) \times d_{m o d e l}}

into separate subspaces. The m-th attention head (

m \in {1, \dots, M}

) computes query (

Q_{m}

), key (

K_{m}

), and value (

V_{m}

) matrices:

Q_{m} = H^{(l - 1)} W_{m}^{Q}, K_{m} = H^{(l - 1)} W_{m}^{K}, V_{m} = H^{(l - 1)} W_{m}^{V}

(14)

where the weight matrices

W_{m}^{Q}, W_{m}^{K} \in R^{d_{m o d e l} \times d_{a t t n}}

and

W_{m}^{V} \in R^{d_{m o d e l} \times d_{v}}

define per-head subspaces.

A scaled dot-product interaction defines the message passing computation. The matrix multiplication between queries and keys computes pairwise compatibility scores. A division by

\sqrt{d_{a t t n}}

prevents softmax saturation when

d_{a t t n}

is large, preserving gradient flow through the attention weights. Equation (15) concatenates the activated outputs and passes the combined sequence through the terminal weight matrix

W^{O}

to produce the fused representation:

MHA (H^{(l - 1)}) = Concat {{(softmax (\frac{Q_{m} K_{m}^{T}}{\sqrt{d_{a t t n}}}) V_{m})}_{m = 1}^{M}} W^{O}

(15)

where the matrix

W^{O} \in R^{M d_{v} \times d_{m o d e l}}

merges head outputs into the model dimension. The function

Concat (\cdot)

concatenates the independent matrices across the parallel feature dimensions.

A point-wise feed-forward network adds non-linear transformation to the representation pipeline. This sub-layer processes individual node vectors identically. It sequences a dual linear projection separated by a ReLU mapping:

FFN (x) = ReLU (x W_{1} + b_{1}) W_{2} + b_{2}

(16)

where the weight matrices

W_{1} \in R^{d_{m o d e l} \times d_{f f}}

and

W_{2} \in R^{d_{f f} \times d_{m o d e l}}

expand features to

d_{f f}

then project back to

d_{m o d e l}

.

Sequential residual additions and layer normalization operations prevent representation collapse within the deep graph architecture. The structured transition cycle inside a unified encoder layer l proceeds as follows:

{\bar{H}}^{(l)} = LN (H^{(l - 1)} + MHA (H^{(l - 1)}))

(17)

H^{(l)} = LN ({\bar{H}}^{(l)} + FFN ({\bar{H}}^{(l)}))

(18)

where the iterated execution across

N_{L}

identical layers yields the final encoder output

H^{(N_{L})}

. This tensor encodes inter-node dependencies and initializes the subsequent decision generation.

4.2.2. Fleet-State-Aware Decoding Mechanism

Serving as a sequential decision process, the decoding phase iteratively assigns each target node

i \in N

to a specific UAV

k \in K^{+}

. Different from standard routing decoders selecting topological nodes, this module operates explicitly in the Task-to-UAV assignment space. At decoding step t, the policy constructs a context embedding centered on target node

i_{t}

. While the graph encoder captures static topological attributes, sequential assignment continuously depletes vehicle resources. Therefore, effective fleet reasoning necessitates a dynamic state representation

e_{k}^{(t)}

for each physical agent

k \in K

. This vector fuses the UAV’s current spatial context, inherent hardware capabilities, and real-time resource residuals at step t:

e_{k}^{(t)} = W_{u} \cdot Concat (h_{p r e v, k}^{(N_{L})}, Q_{k}, L_{k}, Δ Q_{k}^{(t)}, Δ L_{k}^{(t)}) + b_{u}

(19)

where the term

h_{p r e v, k}^{(N_{L})}

records the encoded graph dependency for the instantaneous destination of vehicle k. The parameters

Δ Q_{k}^{(t)}

and

Δ L_{k}^{(t)}

quantify the variable payload tolerance and endurance bounds. The matrices

W_{u}

and

b_{u}

execute the dimension fusion. The virtual entity (

k = K + 1

) is exempt from physical routing constraints. It utilizes a learned embedding token

e_{K + 1}

representing task rejection.

The decoder computes assignment probabilities via scaled dot-product attention between the target query and UAV state keys. The target representation

h_{i_{t}}^{(N_{L})}

serves as the Query. The dynamic state ensemble

E^{(t)} = {e_{k}^{(t)}}_{k \in K^{+}}

provides the Keys. Equation (20) computes the raw attention score

u_{t k}

:

u_{t k} = C \cdot tanh (\frac{(h_{i_{t}}^{(N_{L})} W_{Q}^{u p}) {(e_{k}^{(t)} W_{K}^{u p})}^{T}}{\sqrt{d_{a t t n}}}), \forall k \in K^{+}

(20)

where the scalar variable

u_{t k}

quantifies the compatibility between target

i_{t}

and UAV k. The multiplier constant C bounds logit magnitudes to stabilize the softmax output.

4.2.3. Heterogeneous Masking with Virtual Agent

A hard-constraint masking mechanism enforces the validity of generated solutions throughout decoding. This verification module restricts the action space based on fleet resource states. It concurrently redirects infeasible tasks to the virtual agent.

The binary mask

m_{t k}

indicates the feasibility of assigning target

i_{t}

to UAV k. Physical UAVs (

k \in K

) require strict compliance with cumulative limits. Equation (21) determines the mask value

m_{t k}

:

m_{t k} = I (Δ Q_{k}^{(t)} \geq q_{i_{t}}) \cdot I (Δ L_{k}^{(t)} \geq \tilde{c} (i_{t} | P_{k}^{(t)}))

(21)

where the indicator

I (\cdot)

enforces threshold bounds. The cost estimate

\tilde{c} (i_{t} | P_{k}^{(t)}) = c_{{cur}_{k}, i_{t}} + c_{i_{t}, 0}

computes the marginal flight cost of appending target

i_{t}

to the partial sequence

P_{k}^{(t)}

and returning to the depot.

Flight endurance limits systematically compromise complete topological coverage for physical UAVs. To resolve strict infeasibility, the system sets

m_{t, K + 1} = 1

unconditionally:

m_{t, K + 1} = 1

(22)

This constant assignment guarantees a non-empty decision space. Integration of the feasibility mask into the raw attention scores reassigns invalid scores to

- \infty

. Equation (23) defines the masked attention score

{\tilde{u}}_{t k}

:

{\tilde{u}}_{t k} = \{\begin{matrix} u_{t k}, & if m_{t k} = 1 \\ - \infty, & otherwise \end{matrix}

(23)

Applying softmax to these masked scores yields the assignment probability

p (y_{i_{t}} = k | s_{t})

:

p (y_{i_{t}} = k | s_{t}) = \frac{e^{{\tilde{u}}_{t k}}}{\sum_{k^{'} \in K^{+}} e^{{\tilde{u}}_{t k^{'}}}}

(24)

where the denominator normalizes the distribution over feasible candidates defined by

m_{t}

.

Sampling from this distribution determines the realized assignment variable

a_{t}

. This selection sets

y_{i_{t}, a_{t}} = 1

. Consequently, the framework modifies the state tensors

Δ Q_{a_{t}}^{(t + 1)}

and

Δ L_{a_{t}}^{(t + 1)}

prior to proceeding to the subsequent decision step.

4.3. Glimpse-Enhanced Route Planning via POMO

The upper hierarchy task allocation reduces the problem into K independent orienteering problems. Each assigned subset

N_{k}

is then routed independently by UAV k. The lower-level routing policy

π_{ϕ}

constructs a route over

N_{k}

, tracking visited nodes at each step. At decoding step t, the decoder constructs the context vector

h_{c}^{(t)}

. Equation (25) combines the global graph signature

h_{g l o b a l}

, the embedding of the previously visited node

h_{j_{t - 1}}^{(N_{L})}

, and the remaining dynamic tolerances:

h_{c}^{(t)} = [h_{g l o b a l}; h_{j_{t - 1}}^{(N_{L})}; Δ Q_{k}^{(t)}; Δ L_{k}^{(t)}] W_{c}

(25)

where the global variable

h_{g l o b a l} = \frac{1}{N + 1} \sum_{i = 0}^{N} h_{i}^{(N_{L})}

averages the independent node mappings into a global graph representation.

W_{c}

projects the concatenated features into the model dimension.

An integrated Multi-Head Glimpse architecture refines the context vector

h_{c}^{(t)}

into an actionable query. Unlike standard attention that uses a fixed query, this localized computation attends to the current agent state and candidate node embeddings. The query is

h_{c}^{(t)}

, while both keys and values are

H^{(N_{L})}

. Equation (26) computes the glimpse vector

g_{t}

:

g_{t} = MHA (Q_{g} = h_{c}^{(t)}, K_{g} = H^{(N_{L})}, V_{g} = H^{(N_{L})})

(26)

This operation captures dependencies between the current agent state and unvisited node embeddings. Incorporating resource residuals biases the glimpse toward feasible candidates.

The inference architecture uses

g_{t}

as a query to compute a score for each candidate node j via its embedding

h_{j}^{(N_{L})}

. This assessment extracts the attention score

v_{t j}

for target j. A local feasibility mask

m_{l o c}

enforces physical feasibility on route construction:

v_{t j} = \{\begin{matrix} C \cdot tanh (\frac{(g_{t} W_{Q}^{l o}) {(h_{j}^{(N_{L})} W_{K}^{l o})}^{T}}{\sqrt{d_{a t t n}}}) & if m_{l o c} (j, s_{t}) = 1 \\ - \infty & otherwise \end{matrix}

(27)

The Boolean expression

m_{l o c} (j, s_{t}) = 1

holds if and only if both physical constraints are satisfied: the remaining capacity

Δ Q_{k}^{(t)}

covers the demand

q_{j}

, and sufficient range

Δ L_{k}^{(t)}

is maintained for a safe return to the depot. A standardized softmax operation normalizes

v_{t j}

to parameterize the probabilistic mapping

π_{ϕ}

.

Sequential generation logic amplifies initialization bias towards local optima. To mitigate sensitivity to initial node selection, the framework applies POMO, exploiting the geometric invariance of Euclidean routing. The transformation operator set

F = {τ_{1}, \dots, τ_{N_{a u g}}}

contains distinct planar reflections and rotations. Equation (28) transforms the base uniform subset

X_{k}

associated with index

N_{k}

into augmented instances:

Σ_{k} = {τ_{m} (X_{k}) ∣ τ_{m} \in F}_{m = 1}^{N_{a u g}}

(28)

Decoding in parallel over these instances yields a candidate set

Π_{k}

. Equation (29) filters for the optimal route

σ_{k}^{*}

with the highest total prize:

Π_{k} = {π_{ϕ} (X_{k}^{'}) ∣ X_{k}^{'} \in Σ_{k}}, σ_{k}^{*} = \underset{σ \in Π_{k}}{argmax} (\sum_{i \in σ} p_{i})

(29)

where the resulting reward

R (σ_{k}^{*})

serves as the reward signal for optimizing the hierarchical policy via gradient updates.

4.4. MCTS-Guided Allocation Refinement

Single-step inference using the policy network often yields suboptimal allocations due to the greedy nature of sequential decoding. Addressing this limitation, the framework integrates MCTS to refine the upper-level task allocation. This look-ahead search mechanism balances exploration of the assignment space with exploitation of the learned policy priors. The search process operates on a decision tree where each node

s_{t}

represents a partial allocation configuration, and edges correspond to candidate assignments.

Traversal of the decision tree initiates the search iteration, starting from the root and descending to a leaf node. In this phase, the algorithm selects the optimal child node by maximizing the upper confidence bound. This selection criterion balances the accumulated value estimation and the exploration potential:

a_{t}^{*} = \underset{a \in A (s_{t})}{argmax} (\hat{Q} (s_{t}, a) + c_{p u c t} \cdot P (a | s_{t}) \frac{\sqrt{n (s_{t})}}{1 + n (s_{t}, a)})

(30)

where

\hat{Q} (s_{t}, a)

denotes the mean action value,

P (a | s_{t})

represents the prior probability derived from the upper-level policy

π_{θ}

, and

n (s_{t}, a)

tracks the action visit count regulating the exploration rate. The total visit count

n (s_{t}) = \sum_{a^{'}} n (s_{t}, a^{'})

aggregates all actions from state

s_{t}

.

Upon reaching a leaf node

s_{L}

, the expansion phase activates. The upper-level policy

π_{θ} (s_{L})

evaluates this node, initializing the prior probabilities

P (a | s_{L})

for all valid assignment actions. The masking logic described in Equation (21) ensures that only feasible branches receive non-zero probabilities, clipping the invalid action space.

Subsequent evaluation of the expanded node involves estimating the potential reward of the resulting allocation through simulation. Instead of random rollouts, greedy execution of the lower-level routing policy

π_{ϕ}

on the generated subsets

N_{k}

yields a value estimate. Accumulation of the prizes collected by all UAVs constitutes the simulation reward

R (s_{L})

:

R (s_{L}) = \sum_{k = 1}^{K} Val (Greedy (π_{ϕ}, N_{k} | s_{L}))

(31)

where

Val (\cdot)

calculates the accumulated prize of the route constructed by the lower-level policy.

Propagation of this simulation result

R (s_{L})

updates the statistics along the traversed path from

s_{L}

back to the root. The update rules for the visit count and action value follow:

n (s_{t}, a) \leftarrow n (s_{t}, a) + 1, \hat{Q} (s_{t}, a) \leftarrow \hat{Q} (s_{t}, a) + \frac{R (s_{L}) - \hat{Q} (s_{t}, a)}{n (s_{t}, a)}

(32)

where

n (s_{t}, a)

on the right-hand side of the second equation reflects the post-increment value, ensuring

\hat{Q} (s_{t}, a)

converges to the expected return of the assignment action.

Completion of the simulation budget triggers the final decision based on the accumulated visit counts. Specifically, the action distribution

π_{m c t s} (a | s_{t})

is derived proportional to the visit counts:

π_{m c t s} (a | s_{t}) = \frac{n {(s_{t}, a)}^{1 / τ}}{\sum_{a^{'}} n {(s_{t}, a^{'})}^{1 / τ}}

(33)

where

τ

serves as the temperature parameter controlling the determinism of the final selection.

4.5. Training Strategy

Optimization of the hierarchical policy involves handling the sparsity of feasible solutions within a highly constrained heterogeneous action space. The C-ACT strategy is proposed, incorporating dynamic constraint annealing with a virtual agent mechanism to stabilize the training process.

The training process commences with the lower-level policy

π_{ϕ}

, which functions as a neural operator evaluating task clusters provided by the upper level. Using the POMO algorithm with instance augmentation,

π_{ϕ}

is pre-trained to establish a baseline routing capability. The mean reward of

N_{a u g}

augmented trajectories is used as a baseline to reduce variance. The gradient is calculated as

\nabla_{ϕ} J (ϕ) \approx \frac{1}{N_{a u g}} \sum_{m = 1}^{N_{a u g}} \nabla_{ϕ} log π_{ϕ} (σ_{m} | N_{k}) \cdot (R (σ_{m}) - {\bar{R}}_{a u g})

(34)

where

{\bar{R}}_{a u g}

denotes the average reward of the augmented batch. This pre-training phase ensures that the lower-level policy provides consistent reward feedback

Val (\cdot)

to the upper-level policy.

The C-ACT sequential training structure resolves the reward-coupling problem between the two policy levels. When

π_{ϕ}

is undertrained,

Val (\cdot)

carries high variance, injecting noise into the gradient estimates of

π_{θ}

. Phase I pre-trains

π_{ϕ}

before upper-level optimization begins, ensuring that reward feedback reflects task-quality differences by the time

π_{θ}

receives its first gradient updates. In Phase II, the curriculum-aware baseline

b (s_{g l o b a l}, λ)

normalizes rewards relative to

λ_{e}

, isolating the policy gradient from reward magnitude shifts caused by constraint relaxation. Phase III alternates gradient updates between

π_{θ}

and

π_{ϕ}

, serializing the two optimization steps. This structure prevents simultaneous parameter interference, removing the need for an explicit delayed update mechanism.

Optimization of the upper-level policy

π_{θ}

faces challenges due to strict capacity and range constraints, which often cause random allocations to be infeasible. To enable convergence, the training protocol integrates a dynamic constraint annealing strategy. A curriculum factor

λ_{e}

scales the constraint thresholds:

\sum_{i \in N} q_{i} y_{i k} \leq λ_{e} Q_{k}, \sum_{i \in V} \sum_{j \in V} c_{i j} x_{i j k} \leq λ_{e} L_{k}

(35)

This scales both the capacity and range limits, transitioning from a relaxed state (

λ_{s t a r t} > 1

) to the physical limit (

λ_{e n d} = 1

) according to a cosine schedule. This mechanism gradually contracts the feasible region defined by the mask, guiding the policy from a simplified solution space toward the target constraints. To further guarantee the completeness of the MDP, a virtual agent is incorporated into the action space. Serving as a task rejection mechanism, this agent absorbs tasks that cannot be assigned to real drones, thereby preventing episode termination due to empty action spaces.

The upper-level gradient is derived using REINFORCE with a curriculum-aware baseline:

\nabla_{θ} J (θ) \approx \frac{1}{B} \sum_{i = 1}^{B} \nabla_{θ} log π_{θ} (A_{i} | s_{g l o b a l}, λ) \cdot (R_{r e a l} (A_{i}) - b (s_{g l o b a l}, λ))

(36)

where

A_{i}

represents the task allocation.

R_{r e a l}

denotes the total reward derived from real drones, and

b (s_{g l o b a l}, λ)

is the baseline reward under the current constraint factor

λ

. Algorithm 1 formalizes this complete C-ACT procedure.

Algorithm 1 Alternating cooperative training with curriculum learning

Input:: Training instance set $D$ ; batch size B; learning rate $η$ ; augmentation views $N_{a u g}$ ; warm-up epochs $E_{w}$ ; alignment epochs $E_{a}$ ; joint stabilization epochs $E_{j}$ ; curriculum factors $λ_{s t a r t}, λ_{e n d}$ ; virtual agent $v_{K + 1}$ .
Output:: Optimized parameters $θ^{*}$ (Upper) and $ϕ^{*}$ (Lower).

1:: Initialization: Randomly initialize $θ, ϕ$ ; Initialize baseline $θ_{b l} \leftarrow θ$ .
2:: Phase I: Lower-Level Warm-up based on POMO
3:: for $e = 1 \to E_{w}$ do
4:: Sample random batch $B \sim D$
5:: Step 1: Instance Augmentation
6:: $B_{a u g} \leftarrow Augment (B, N_{a u g})$
7:: Step 2: Reward Evaluation
8:: $R_{a u g} \leftarrow Rollout (π_{ϕ}, B_{a u g})$
9:: $b_{s h a r e d} \leftarrow Mean (R_{a u g})$ ▹ Shared Baseline
10:: Step 3: Optimization
11:: $L_{ϕ} \leftarrow - Mean ((R_{a u g} - b_{s h a r e d}) log π_{ϕ})$
12:: $ϕ \leftarrow Adam (\nabla L_{ϕ}, η)$
13:: end for
14:: Phase II: Upper-Level Alignment with Curriculum Learning
15:: for $e = 1 \to E_{a}$ do
16:: Curriculum Update: $λ_{e} \leftarrow λ_{e n d} + \frac{1}{2} (λ_{s t a r t} - λ_{e n d}) (1 + cos (\frac{(e - 1) π}{E_{a} - 1}))$
17:: Sample global batch $B \sim D$ subject to constraint factor $λ_{e}$
18:: Step 1: Exploration & Assessment
19:: $A^{*} \leftarrow Sample (π_{θ}, B \cup {v_{K + 1}})$
20:: Extract real assignments $A \leftarrow A^{*} ∖ {v_{K + 1}}$ ▹ Absorb infeasible Tasks
21:: $R \leftarrow Solve (π_{ϕ}, A)$
22:: Step 2: Baseline Computation
23:: $A_{b l}^{*} \leftarrow Greedy (π_{θ_{b l}}, B \cup {v_{K + 1}})$
24:: Extract real baseline assignments $A_{b l} \leftarrow A_{b l}^{*} ∖ {v_{K + 1}}$
25:: $b \leftarrow Solve (π_{ϕ}, A_{b l})$
26:: Step 3: Optimization
27:: $L_{θ} \leftarrow - Mean ((R - b) log π_{θ} (A^{*} | B))$
28:: $θ \leftarrow Adam (\nabla L_{θ}, η)$
29:: Update $θ_{b l} \leftarrow θ$ if validation improves
30:: end for
31:: Phase III: Joint Stabilization
32:: for $e = 1 \to E_{j}$ do
33:: Execute Phase II step to update $π_{θ}$ with fixed $λ_{e} = 1.0$
34:: Data Augmentation for Lower Level:
35:: Construct sub-problems $B^{'}$ from current allocations $A$
36:: Execute Phase I step on $B^{'}$ to update $π_{ϕ}$
37:: end for
38:: return $θ^{*} \leftarrow θ, ϕ^{*} \leftarrow ϕ$

5. Experiments

This section evaluates the Hierarchical Deep Reinforcement Learning framework for large-scale heterogeneous UAV mission planning. The setup details the experimental environment, dataset generation from real-world geographic data, and hyperparameter configurations. Evaluation compares the proposed method against eight baseline algorithms across problem scales (

N = 80

to 300). This comparison assesses solution quality and scalability. The analysis examines small-scale convergence and large-scale generalization. Trajectory visualizations corroborate the numerical findings. Finally, the evaluation quantifies statistical stability and computational efficiency. These metrics verify the framework’s applicability to time-sensitive disaster relief operations.

5.1. Experiment Settings

Experiments test the framework on instances with varying scales. The model sets target counts N from the set

{80, 100, 200, 300}

. Real-world geographic data from the Yuelu District of Changsha City, China, supplies the geospatial coordinates for target nodes. Figure 4 visualizes this spatial distribution, with raw location data sourced from the World Geodetic System 1984 reference system. The procedure samples N coordinates from this pool and projects them into a unit square

[0, 1] \times [0, 1]

.

Uniform distributions

(1.2, 2.4)

and

(1.5, 2.5)

define the maximum payload capacity

Q_{k}

and flight range

L_{k}

for each UAV

k \in {1, \dots, 6}

. These parameters model the fleet heterogeneity. Target demands

q_{i}

follow a uniform distribution

U (0.05, 0.15)

. Target prizes

p_{i}

are sampled from a uniform distribution

U (0.5, 1.5)

, reflecting heterogeneous rescue priorities across task nodes. The distance penalty coefficient and workload balance penalty coefficient in Equation (1) are set to

α = 0.01

and

β = 0.1

, respectively. The reported objective value (Obj.) in all result tables corresponds to the terminal episode reward

R (s_{T})

as defined in Equation (10). A virtual agent with infinite capacity and range absorbs unassigned tasks. Tasks routed to this agent yield zero reward, implicitly penalizing infeasible assignments.

The hierarchical framework pairs an upper-level allocation policy with a lower-level routing policy, both parameterized by attention-based neural networks. Each encoder incorporates 3 self-attention layers with a unified hidden and embedding dimension of 128, processed through an 8-head multi-head attention mechanism. The upper-level MCTS executes 80 simulations per step with

c_{p u c t} = 1.0

. The lower-level routing model applies the POMO mechanism with a sample size of 8. Validation and testing phases apply greedy decoding for deterministic decision-making.

Both models are trained with the Adam optimizer at an initial learning rate of

1 \times 10^{- 4}

. The upper-level model trains for 100 epochs total (50 Phase-II + 50 Phase-III; batch size 1600; epoch size 800,000 instances) with a per-epoch learning rate decay of 0.995; gradient norms are clipped at 3.0. The lower-level model trains for

E_{w} = 100

warm-up epochs (batch size 400; epoch size 200,000 instances) at a constant learning rate, with gradient norms clipped at 1.0. Phase III joint stabilization runs for

E_{j} = 50

additional epochs. The curriculum factor

λ_{e}

decays from

λ_{s t a r t} = 1.4

to

λ_{e n d} = 1.0

over

E_{a} = 50

Phase II epochs via a cosine schedule.

Implementation relies on Python 3.12 and the PyTorch 2.8.0 deep learning framework. All computational tasks were performed on a single workstation. The hardware includes an Intel Xeon Platinum 8358P CPU (2.60 GHz), 90 GB of RAM, and a single NVIDIA GeForce RTX 3090 GPU (24 GB). The Ubuntu 22.04 LTS operating system hosts the environment. CUDA 12.8 acceleration processes the tensor computations.

The following protocol governs all evaluations to support reproducibility. Performance metrics are computed over 100 independently sampled test instances per scale. Geographic coordinate sampling and fleet parameter generation use a fixed random seed (seed = 2026); network weights are initialized with seed = 0. Three independent training runs are conducted; the mean performance across the three runs is reported. The coefficient of variation of the reward across the three runs remains below 2% at convergence, confirming training stability. Neural models apply greedy decoding during inference. Stochastic baselines (Adaptive Large Neighborhood Search, ALNS) are executed with five independent runs per instance; the best result per instance is recorded. Error bars in Section 5.6 represent the standard deviation across the 100 test instances. Wilcoxon signed-rank tests in Section 5.6 are computed over the paired per-instance objective values from these same 100 instances.

5.2. Benchmark Evaluation

A cross-scale evaluation protocol tests the generalization capability and scalability of the hierarchical framework. Specifically, models trained on instances with

N = 80

targets execute inference on both

N = 80

and

N = 100

test sets. Models trained on the

N = 200

dataset evaluate scenarios with

N = 200

and

N = 300

targets. This process requires no additional fine-tuning. This experimental design verifies the adaptability of the learned policies to varying problem scales. It demonstrates the model’s ability to capture the underlying problem structure rather than overfit to specific dimensions.

The baseline selection follows a factorial design that isolates the contribution of each hierarchical component independently. Three upper-level allocation strategies (Random, K-Means, and MCTS) cross-combine with three lower-level solvers (ALNS, OR-Tools, and Transformer), yielding nine controlled comparisons. End-to-end learning-based planners for multi-UAV routing are considered as additional baselines. Published models of this type are trained on homogeneous fleets or on problem scales below

N = 80

; direct comparison without retraining on the LSH-TOP formulation would conflate architectural differences with distribution mismatch, obscuring interpretation. The MCTS + OR pairing serves as a strong reference for the lower-level routing component by substituting the Transformer with an exact solver under an identical upper-level strategy.

Random assignment serves as a lower-bound reference reflecting unstructured allocation. K-Means clusters tasks by spatial proximity without considering vehicle-specific constraints. Evaluations consider four problem scales with target counts

N \in {80, 100, 200, 300}

. Table 2 and Table 3 report the average objective value (Obj.), the optimality gap (Gap), and the inference time.

Experimental results show the MCTS + Trans method achieves the highest objective value across all four scales. The optimality gap is computed relative to the best objective value among all compared methods, not the theoretical optimum; LSH-TOP is NP-hard and exact solutions are intractable at these scales. A 0.00% gap indicates MCTS + Trans achieves the highest objective among all evaluated methods on every test instance, not that it reaches the provably optimal solution. The absolute values and the p-values in Section 5.6 provide the primary basis for performance assessment. Section 5.3 and Section 5.4 provide detailed analyses of these results by scale.

5.3. Comparative Analysis on Small and Medium Scales

This subsection investigates algorithmic performance on small- and medium-scale instances (

N = 80

and

N = 100

). The analysis focuses on convergence quality and constraint satisfaction. Table 2 shows the MCTS + Trans framework achieves objective values of 50.43 and 59.30 for

N = 80

and

N = 100

, respectively. This establishes a performance benchmark with a 0.00% optimality gap.

Fixing the lower-level solver to the Transformer model isolates the impact of the upper-level allocation strategy. The Random + Trans baseline lacks active assignment logic. It exhibits an optimality gap exceeding 33% across both scales. The KMeans + Trans strategy clusters tasks based on Euclidean proximity. This reduces the gap to 9.18% at

N = 80

. Spatial clustering fails to account for heterogeneous payload and flight range parameters. This failure causes load imbalances and restricts the total recoverable value. The MCTS strategy models these constraints during the tree search simulation. MCTS evaluates the potential reward of future states via look-ahead simulation. This approach optimizes task distribution based on the feasibility of the resulting sub-problems. It facilitates the construction of higher-value routes compared to geometric location strategies.

Fixing the upper-level strategy to MCTS isolates the contribution of the lower-level routing solver. The Transformer-based solver performs comparably to the exact OR-Tools solver at these scales. At

N = 80

, the MCTS + OR method achieves an objective of 46.14 (8.51% gap). The MCTS + Trans method reaches 50.43 (0.00% gap). OR-Tools converges to optimality given unlimited time. However, a fixed computational budget per sub-problem bounds its performance here. The Transformer model acts as a learned heuristic. It generalizes to the constrained TOP variants and infers solutions rapidly (4.13 s vs. 4.31 s for OR-Tools). This learned policy captures the structural properties of the routing problem efficiently.

Qualitative analysis of the solution topology, as illustrated in Figure 5, corroborates the numerical results. Figure 5a,b depict the planned trajectories for

N = 80

and

N = 100

instances. The generated routes exhibit spatial partitioning with minimal path crossing, indicating that MCTS groups proximal tasks while adhering to the individual flight range limits of each UAV. Furthermore, all UAVs return to the depot after completing their service loops, confirming that the coupled hierarchy satisfies the hard constraints of the mission.

To verify the numerical difference shown in Table 2, Figure 6 presents a side-by-side comparison of solution topologies for an instance with

N = 100

. Methods using random allocation (bottom row) result in disordered trajectories with overlapping paths, indicating inefficient task distribution. K-Means (middle row) improves spatial grouping but generates unbalanced routes, as it ignores vehicle-specific payload and range parameters. Route length disparity across UAVs in Figure 6 confirms this load imbalance. The MCTS + Trans method (top-left) produces a structured topology characterized by sub-region partitioning and compact routes. This visualization confirms that the proposed hierarchy optimizes the numerical objective and yields physically rational logistics plans.

5.4. Scalability Verification on Large Scales

5.4.1. Target-Scale Generalization

This section evaluates framework robustness on large-scale instances (

N = 200

and

N = 300

). The MCTS + Trans method maintains optimal performance with a 0.00% gap. It achieves objective values of 74.41 and 82.54 for

N = 200

and

N = 300

, respectively. These results appear in Table 3.

The limitations of heuristic allocation strategies compound as the problem scale expands. Fixing the lower-level solver to the Transformer model highlights this degradation. The Random + Trans baseline yields an objective value of 70.32 at

N = 300

. This value corresponds to a 14.80% optimality gap. The KMeans + Trans strategy records a gap of 12.44% (Obj. 72.27). These gaps widen compared to small-scale benchmarks. Geometric heuristics struggle to balance workloads across heterogeneous UAVs in dense environments. The MCTS strategy mitigates this workload imbalance. Simulating task assignments identifies partitions aligned with specific payload and range constraints. This anticipation prevents premature saturation of vehicle capacities.

The Transformer-based routing solver demonstrates a scalability advantage over conventional iterative baselines. OR-Tools performs competitively on smaller scales. Its performance decays on large-scale sub-problems due to the computational complexity of the orienteering problem. At

N = 300

, the MCTS + OR method achieves an objective of 77.04. This trails the MCTS + Trans method (82.54) by a 6.66% gap. The Transformer model processes increased node density without the corresponding performance decay. The learned policy generalizes to larger instances. It captures routing patterns inaccessible to constructive heuristics.

Analysis of Figure 7 and Figure 8 corroborates these numerical findings. Figure 7 illustrates the proposed framework generating compact, non-overlapping trajectories at high target densities. These routes maximize the coverage of distributed high-value targets. Figure 8 contrasts this with baseline methods. Random allocation methods (bottom row) produce entangled routes. This structural chaos reflects a failure to spatially decompose the massive task set. K-Means-based approaches (middle row) generate load imbalances. These imbalances stem from fleet heterogeneity. The proposed method (top-left) establishes a streamlined topology. This visual evidence verifies its capability to resolve large-scale collaborative mission planning problems.

5.4.2. Fleet-Size Sensitivity

The preceding experiments fix the fleet size at

K = 6

. To evaluate sensitivity to fleet composition, all methods are tested under

K = 9

UAVs using zero-shot inference without retraining. Table 4 reports the results.

Under

K = 9

, KMeans + Trans achieves the highest objective at both scales (52.53 at

N = 80

; 112.85 at

N = 300

), surpassing MCTS + Trans by 24.4% and 35.3%, respectively. This reversal relative to the

K = 6

results reveals a boundary condition of the learned allocation policy. The MCTS search tree is trained on a 6-agent decision space; when transferred to 9 agents without retraining, the tree structure cannot represent the expanded combinatorial branching, limiting its ability to exploit additional fleet capacity. K-Means, as a geometry-based method independent of training configuration, distributes tasks across all available UAVs by spatial proximity, naturally scaling with fleet size.

This result identifies fleet-size generalization as a limitation of the current framework. Retraining or fine-tuning the upper-level policy on the target fleet configuration is expected to restore MCTS dominance, as the search mechanism itself is not fleet-size-dependent. Developing fleet-size-agnostic allocation policies constitutes a direction for future work.

5.5. Ablation Study

To validate the structural choices and verify the isolated contribution of each mechanism within the C-ACT protocol, a comprehensive ablation study was conducted at the

N = 300

scale. This scenario, characterized by the highest combinatorial complexity and stringent resource limits, serves as a stress test for the proposed framework. The complete MCTS + Trans framework is compared against four degraded variants:

w/o POMO: Replaces the POMO mechanism with single-trajectory greedy decoding. This isolates the baseline capability of the Transformer routing heuristic.
w/o MCTS (Neural Greedy): Removes the look-ahead search from the upper-level allocator. Task assignment relies solely on the neural policy’s direct greedy output.
w/o Curriculum ( $λ = 1$ ): Disables dynamic constraint annealing. The network trains under strict payload and range limits from the initial epoch.
w/o Virtual Agent: Removes the virtual agent from the assignment space. Infeasible tasks revert to rigid hard-masking, risking action space collapse.

The performance metrics for these variants are summarized in Table 5.

Table 5 justifies each methodological choice through performance degradation. Most notably, the w/o MCTS variant achieves an objective of 74.50, yielding a gap of 9.74% relative to the full model. This result falls 3.08 percentage points below the MCTS + OR reference (6.66%), confirming that while the neural allocator learns meaningful representations, MCTS look-ahead simulations are essential to escape myopic assignments at large scales.

The most severe performance collapse stems from removing the Virtual Agent (29.49% gap). Without an absorptive buffer for infeasible targets, the action space frequently collapses during early training, causing MDP termination. To quantify its necessity, Table 6 tracks the virtual agent’s utilization frequency across scales.

The proportion of instances requiring virtual agent intervention rises from 2.1% at

N = 80

to 14.3% at

N = 300

, confirming the virtual agent functions as an essential buffer in dense environments. Furthermore, MCTS allocation consistently triggers the virtual agent less frequently than the K-Means heuristic, indicating it generates more physically feasible sub-problems.

Finally, omitting dynamic constraint annealing (w/o Curriculum) reduces the objective to 75.50. Experimental logs confirm that without C-ACT, training exhibited severe instability: reward variance during early epochs was substantially higher, and independent runs frequently settled into penalty-avoidance behaviors. The curriculum schedule resolves this by expanding the feasible region early in training, then gradually contracting it toward physical limits. The w/o POMO variant confirms that the base MCTS + Trans architecture outperforms OR-based routing under single-trajectory decoding. POMO augmentation reduces the gap by a further 3.32 percentage points through multi-start geometric diversification.

Figure 9 presents the training reward curves for both the upper-level allocation model and the lower-level routing model at two problem scales (

N = 80

and

N = 200

), comparing the full C-ACT protocol (solid) against the w/o Curriculum ablation with fixed

λ = 1

(dashed gray).

The convergence behavior of C-ACT is further evidenced by training dynamics (Figure 9). At the upper level, C-ACT reduces reward variance by an order of magnitude relative to the fixed-constraint baseline: the variance ratio (C-ACT/w/o Curriculum) is 0.014 at

N = 80

and 0.152 at

N = 200

, indicating that curriculum relaxation is essential for stable upper-level optimization across problem scales. At the lower level, the variance reduction is moderate (

ratio = 0.719

at

N = 200

) since the routing policy operates on pre-allocated subproblems with smaller action spaces. These dynamics are consistent with the 8.53% performance gap at convergence (Table 5) and confirm that dynamic constraint annealing is necessary for convergence in this constrained heterogeneous setting.

An end-to-end baseline generating joint allocation-routing sequences without bilevel decomposition would provide the most direct validation of the hierarchical design. However, no published end-to-end model supports the LSH-TOP constraints (heterogeneous capacities, heterogeneous ranges, and prize collection) at scales beyond

N = 80

. Training such a model from scratch requires

O (N^{K})

action-space exploration per step (

N = 300

,

K = 6

yields

\approx 7.3 \times 10^{14}

candidate sequences), rendering convergence infeasible within practical GPU budgets. As a proxy, the w/o MCTS (Neural Greedy) variant approximates a flat learned policy: it uses the same trained Transformer allocator but removes tree-search look-ahead, collapsing the hierarchical structure to a single forward pass per assignment. Its 9.74% gap relative to the full model quantifies the minimum cost of abandoning structured search. The MCTS + OR reference further isolates the decomposition benefit: replacing the learned router with OR-Tools still achieves a 6.66% gap, confirming that the hierarchical structure itself—independent of the lower-level solver—provides the dominant performance contribution.

5.6. Statistical Stability and Computational Efficiency

Evaluating stochastic stability and resource utilization efficiency is essential for operational deployment. This subsection assesses statistical variance, task completion rates, and computational latency across scales.

Figure 10 visualizes reward distribution and task completion rates for

N \in {80, 100, 200, 300}

. Error bars quantify standard deviations across test instances. Vertical progression from Figure 10a–d reveal the MCTS + Trans method securing tighter error bounds. This contrasts sharply with Random and K-Means baselines. At

N = 300

(Figure 10d), the standard deviation for MCTS + Trans remains smaller than Random + Trans. The hierarchical framework mitigates the randomness inherent in combinatorial optimization.

The TCR quantifies the proportion of targets successfully served by the fleet. Figure 10 reports TCR distributions across all scales and methods. At

N = 80

, MCTS + Trans achieves a 63.0% completion rate, exceeding the 54.0% recorded by MCTS + ALNS (read from Figure 10a). At

N = 300

, the task-to-resource ratio peaks. MCTS + Trans maintains a leading completion rate of 27.5% (Table 5). The MCTS strategy optimizes route compactness alongside high-value targeting, serving more nodes within strict flight range constraints.

The Wilcoxon signed-rank test was applied at

N = 300

to verify that the performance gains of MCTS + Trans over the two strongest baselines are statistically significant. The test targets two controlled comparisons: MCTS + Trans versus MCTS + OR isolates the contribution of the lower-level routing solver, and MCTS + Trans versus KMeans + Trans isolates the contribution of the upper-level allocation strategy. Table 7 reports both p-values. Both fall below the 0.05 threshold, rejecting the null hypotheses for both comparisons.

Table 8 reports 95% confidence intervals for the proposed method and its two strongest baselines at

N = 300

, computed across the 100 test instances via the normal approximation. Non-overlapping intervals confirm that performance differences are not attributable to sampling variance.

Computational efficiency dictates real-time applicability. Figure 11 plots framework inference times against baselines. The MCTS + Trans computational cost exhibits sub-quadratic growth. It scales from 4.13 s at

N = 80

to 24.30 s at

N = 300

. The MCTS + OR method incurs accelerating overhead as the sub-problem complexity grows. It demands 25.61 s at

N = 300

. Heuristic baselines like Random + ALNS operate with lower latency (8.79 s at

N = 300

). They suffer from optimality gaps exceeding 50%. The proposed framework controls these trade-offs by investing computation into upper-level tree search. This isolates high-quality decompositions. Parallelized neural network inference recovers speed in the lower level. The MCTS + Trans configuration produces solutions within 24.30 s at

N = 300

, satisfying the temporal constraints of operational decision support.

6. Discussion

Experimental outcomes demonstrate that the hierarchical decomposition maintains solution quality and controls computational complexity across all tested scales. As the problem scale grows, the performance gap between MCTS-guided and geometry-based allocation evolves in distinct ways across baselines. At

N = 300

, KMeans + Trans exhibits a gap of 12.44%, widening from 9.18% at

N = 80

; the absolute objective difference between MCTS + Trans and KMeans + Trans grows from 4.63 to 10.27. Random + Trans, by contrast, narrows its percentage gap from 33.93% to 14.80% as scale increases. Its absolute objective deficit relative to MCTS + Trans also contracts (17.11 to 12.22). In larger dense environments, any assignment captures some reward, reducing the relative penalty of unstructured allocation. In both cases, the MCTS strategy maintains its advantage by evaluating downstream feasibility before committing to each assignment—a capability that pure geometric methods cannot replicate. Heuristic operators fail to coordinate global workload distribution and local trajectory compactness simultaneously. The upper-level agent treats the lower-level routing policy as a neural heuristic that provides value estimates for candidate partitions, enabling globally coherent allocations that geometry-based methods cannot achieve.

The convergence behavior of C-ACT admits a theoretical interpretation. The REINFORCE gradient estimator converges to a locally optimal policy as long as the gradient variance remains bounded. The cosine annealing schedule keeps the feasible region

F (λ_{e})

non-empty throughout Phase II. Because

λ_{e} \geq 1

relaxes both capacity and range limits, any allocation feasible under

λ_{e} = 1

remains feasible under

λ_{e} > 1

, maintaining a non-zero episode completion probability at every epoch. The virtual agent reinforces this property by maintaining a non-empty action space at every decision step (Equation (22)), satisfying the MDP completeness requirement for policy gradient convergence. Training stability is further supported by gradient clipping at norms 3.0 and 1.0 for the upper and lower levels, respectively, and by the alternating update structure of Phase III, serializing

π_{θ}

and

π_{ϕ}

updates and preventing simultaneous parameter interference.

This structural efficiency is further quantified from a computational complexity standpoint. The hierarchical decomposition reduces inference complexity relative to joint allocation-routing. A brute-force policy over joint sequences requires

O (N^{K})

evaluations; the proposed structure reduces this to an MCTS search of depth N with branching factor K, where each simulation invokes the lower-level policy on sub-problems of average size

N / K

, plus K independent final routing evaluations:

T_{h i e r} = O (N \cdot K \cdot S_{m c t s} \cdot \frac{N}{K}) + O (K \cdot {(\frac{N}{K})}^{2}) = O (N^{2} \cdot S_{m c t s} + \frac{N^{2}}{K})

(37)

where

S_{m c t s}

denotes the MCTS simulation budget per step. For fixed K and

S_{m c t s}

, both terms scale as

O (N^{2})

, yielding a theoretical upper bound of

O (N^{2})

—a substantial reduction from the

O (N^{K})

complexity of joint end-to-end approaches. Empirically, the inference time increases from 4.13 s to 24.30 s across

N \in {80, 100, 200, 300}

(a 5.88× increase for a 3.75× increase in N). A log–log regression on these four data points yields an exponent of approximately 1.2, indicating sub-quadratic practical scaling tighter than the

O (N^{2})

theoretical bound due to GPU parallelism in the lower-level routing step and early termination in MCTS simulations.

The learned policies generalize to moderately larger problem scales without retraining, preserving solution feasibility under varying constraint tightness. This cross-scale transferability—enabled by the decomposition structure that reduces a 300-target instance to six independent sub-problems of manageable size—directly supports time-critical disaster response operations where retraining between deployments is impractical. However, fleet-size transfer (Section 5.4.2) reveals a boundary condition: the MCTS search tree, optimized for a fixed agent count, does not generalize to unseen fleet configurations under zero-shot inference. This limitation is structural rather than algorithmic—the search mechanism itself is fleet-size-agnostic, but the learned prior

P (a | s_{t})

encodes a fixed branching factor.

7. Conclusions and Future Work

A hierarchical reinforcement learning framework solves the LSH-TOP for emergency UAV operations. Decoupling the architecture into MCTS-guided allocation and Transformer-based routing balances global resources and local trajectories. A curriculum-integrated training protocol injects dynamic constraint annealing and a virtual agent mechanism. This protocol enables convergence in sparse-reward environments bounded by physical vehicle limits. Evaluations document consistent objective improvements over all baselines—0.63–8.51% over the strongest competitor and up to 53.27% over conventional heuristics—alongside leading task completion rates. The framework maintains these gains at scales up to 300 targets with sub-quadratic inference time growth. This scalability supports immediate deployment in SAR logistics.

The evaluation is confined to simulated environments; no hardware-in-the-loop testing or physical UAV deployment is performed. The geographic data from Yuelu District (Figure 4) provides realistic spatial distribution, but the simulation excludes wind disturbance, GPS drift, inter-UAV communication latency, and onboard sensor noise. The formulation assumes static target sets with deterministic travel costs, excluding the dynamic and uncertain conditions present in operational disaster response. Physical testbed validation and hardware-in-the-loop testing are deferred to future work.

Future development will target dynamic uncertainties in operational domains. Moving targets and stochastic communication gaps will require online adaptation mechanisms. Expanded state representations encapsulating temporal data will support real-time replanning. Multi-objective models combining energy consumption tracking with timeliness maximization will refine the system’s operational viability. Fleet-size-agnostic allocation policies will eliminate the retraining requirement when fleet composition changes between missions.

Author Contributions

Conceptualization, Y.Z. and Z.Y.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z., Z.Y., C.W. and D.G.; formal analysis, Y.Z.; investigation, Y.Z. and D.G.; resources, Z.Y., C.W. and D.G.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Z.Y., C.W. and D.G.; visualization, Y.Z.; supervision, Z.Y. and C.W.; project administration, Z.Y.; funding acquisition, Z.Y. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 2022YFC2204800), the National Key Laboratory of Land and Air Based Information Perception and Control, China (Grant No. A224001), the Natural Science Foundation of Hunan Province (Grant No. 2024JJ6482), and the Graduate Student Independent Exploration and Innovation Program of Central South University (Grant No. 2025ZZTS0683).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the privacy of the geographical information used in the simulation scenarios.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Mannan, A.; Kahtan, H.; Mustafa, M.B.; Ahmad, R.; Abdulhak, M.; Atiquzzaman, M. Mobility challenges and issues in unmanned aerial vehicle (UAV) path planning: A systematic review. Comput. Netw. 2025, 273, 111766. [Google Scholar] [CrossRef]
Yang, X.; Cao, W.; Wang, K.; Yin, H.; Wu, J.; Wu, L. Integrated scheduling of truck and drone fleets for cargo transportation in post-disaster relief: A two-stage stochastic optimization approach. Transp. Res. Part E Logist. Transp. Rev. 2025, 196, 104015. [Google Scholar] [CrossRef]
Zhang, G.; Jia, N.; Zhu, N.; Adulyasak, Y.; Ma, S. Robust drone selective routing in humanitarian transportation network assessment. Eur. J. Oper. Res. 2023, 305, 400–428. [Google Scholar] [CrossRef]
Yin, Y.; Yang, Y.; Yu, Y.; Wang, D.; Cheng, T. Robust vehicle routing with drones under uncertain demands and truck travel times in humanitarian logistics. Transp. Res. Part B Methodol. 2023, 174, 102781. [Google Scholar] [CrossRef]
Sun, G.; He, L.; Sun, Z.; Wu, Q.; Liang, S.; Li, J.; Niyato, D.; Leung, V.C. Joint task offloading and resource allocation in aerial-terrestrial UAV networks with edge and fog computing for post-disaster rescue. IEEE Trans. Mob. Comput. 2024, 23, 8582–8600. [Google Scholar] [CrossRef]
Hammami, F. An efficient hybrid adaptive large neighborhood search method for the capacitated team orienteering problem. Expert Syst. Appl. 2024, 249, 123561. [Google Scholar] [CrossRef]
Shiri, D.; Akbari, V.; Hassanzadeh, A. The Capacitated Team Orienteering Problem: An online optimization framework with predictions of unknown accuracy. Transp. Res. Part B Methodol. 2024, 185, 102984. [Google Scholar] [CrossRef]
Zhang, G.; Jia, N.; Zhu, N.; He, L.; Adulyasak, Y. Humanitarian transportation network design via two-stage distributionally robust optimization. Transp. Res. Part B Methodol. 2023, 176, 102805. [Google Scholar] [CrossRef]
Chao, I.M.; Golden, B.L.; Wasil, E.A. The team orienteering problem. Eur. J. Oper. Res. 1996, 88, 464–474. [Google Scholar] [CrossRef]
Chen, J.; Wan, P.; Xu, G. Cooperative Learning-Based Joint UAV and Human Courier Scheduling for Emergency Medical Delivery Services. IEEE Trans. Intell. Transp. Syst. 2025, 26, 935–949. [Google Scholar] [CrossRef]
Du, B.; Tang, A.; Ye, H.; Yue, H.; Xu, C.; Hao, L.; He, H.; Liao, X. Collaborative Vehicle-Mounted Multi-UAV Routing and Scheduling Optimization for Remote Sensing Observations. Drones 2025, 9, 783. [Google Scholar] [CrossRef]
Gan, W.; Xu, H.; Bai, Y.; Zhou, X.; Wu, W.; Du, X. Large-Scale Multi-UAV Task Allocation via a Centrality-Driven Load-Aware Adaptive Consensus Bundle Algorithm for Biomimetic Swarm Coordination. Biomimetics 2026, 11, 69. [Google Scholar] [CrossRef]
Ren, J.; Song, N.; Li, X.; Wang, L. Rolling Dispatch for AAVs Inspection Based on Task Adaptive Clustering. IEEE Trans. Autom. Sci. Eng. 2025, 22, 14108–14119. [Google Scholar] [CrossRef]
Fan, M.; Liu, H.; Wu, G.; Gunawan, A.; Sartoretti, G. Multi-UAV reconnaissance mission planning via deep reinforcement learning with simulated annealing. Swarm Evol. Comput. 2025, 93, 101858. [Google Scholar] [CrossRef]
He, L.; Gong, X.; Zheng, J.; Wang, Y.; Cui, Y. A Flexible Combinatorial Auction Algorithm (FCAA) for Multi-Task Collaborative Scheduling of Heterogeneous UAVs. Drones 2025, 9, 870. [Google Scholar] [CrossRef]
Kwon, Y.D.; Choo, J.; Kim, B.; Yoon, I.; Gwon, Y.; Min, S. POMO: Policy Optimization with Multiple Optima for Reinforcement Learning. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 21188–21198. [Google Scholar]
Gao, J.; Jia, L.; Kuang, M.; Shi, H.; Zhu, J. An End-to-End Solution for Large-Scale Multi-UAV Mission Path Planning. Drones 2025, 9, 418. [Google Scholar] [CrossRef]
Li, J.; Ma, Y.; Gao, R.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Deep Reinforcement Learning for Solving the Heterogeneous Capacitated Vehicle Routing Problem. IEEE Trans. Cybern. 2022, 52, 13572–13585. [Google Scholar] [CrossRef]
Li, J.; Xin, L.; Cao, Z.; Lim, A.; Song, W.; Zhang, J. Heterogeneous Attentions for Solving Pickup and Delivery Problem via Deep Reinforcement Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 2306–2315. [Google Scholar]
Zuo, B.; Li, W.; Zhao, J.; Bai, T.; Yang, L.; Ma, Z.; Zhu, Y. CMIP: Combining Constructive Model With Improvement Policy for Large-Scale Min-Max Multiple Traveling Salesman Problem. IEEE Trans. Intell. Transp. Syst. 2026, 27, 1550–1564. [Google Scholar] [CrossRef]
Wang, Y.; Jia, Y.H.; Chen, W.N.; Mei, Y. Soft Parameter Sharing Model for Cross-Problem Generalization in Vehicle Routing Problems. IEEE Trans. Artif. Intell. 2026, 7, 471–485. [Google Scholar] [CrossRef]
Fuertes, D.; del Blanco, C.R.; Jaureguizar, F.; Navarro, J.J.; García, N. Solving routing problems for multiple cooperative Unmanned Aerial Vehicles using Transformer networks. Eng. Appl. Artif. Intell. 2023, 122, 106085. [Google Scholar] [CrossRef]
Mao, X.; Wu, G.; Fan, M.; Cao, Z.; Pedrycz, W. DL-DRL: A Double-Level Deep Reinforcement Learning Approach for Large-Scale Task Scheduling of Multi-UAV. IEEE Trans. Autom. Sci. Eng. 2025, 22, 1028–1044. [Google Scholar]
Jian, J.; Chen, Y.; Li, Q.; Li, H.; Zheng, X.; Han, C. Decision-Making Method of Multi-UAV Cooperate Air Combat Under Uncertain Environment. IEEE J. Miniaturization Air Space Syst. 2024, 5, 138–148. [Google Scholar]
Jiang, B.; Li, Y.; Li, C.; Zheng, Y. Bi-Level Optimization Framework for Urban Low-Altitude UAV Delivery Ensuring Target Level of Safety. In IEEE Transactions on Intelligent Transportation Systems; IEEE: New York, NY, USA, 2026; pp. 1–14. [Google Scholar]
Chen, R.; Li, J.; Chen, Y.; Huang, Y. A Distributed Double-Loop Optimization Method with Fast Response for UAV Swarm Scheduling. Drones 2023, 7, 216. [Google Scholar] [CrossRef]
Hou, Z.; You, T.; Wang, W. Seismic resilience assessment-informed UAV task allocation framework for post-earthquake survey. Int. J. Disaster Risk Reduct. 2025, 116, 105160. [Google Scholar]
Fan, M.; Wu, Y.; Liao, T.; Cao, Z.; Guo, H.; Sartoretti, G.; Wu, G. Deep reinforcement learning for UAV routing in the presence of multiple charging stations. IEEE Trans. Veh. Technol. 2022, 72, 5732–5746. [Google Scholar] [CrossRef]
Xiang, C.; Mo, Y.; Liu, W.; Wu, Z.; Li, L. Path pool based transformer model in reinforcement framework for dynamic urban drone delivery problem. Transp. Res. Part C Emerg. Technol. 2025, 177, 105165. [Google Scholar] [CrossRef]
Wang, C.; Cao, Z.; Wu, Y.; Teng, L.; Wu, G. Deep Reinforcement Learning for Solving Vehicle Routing Problems With Backhauls. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4779–4793. [Google Scholar]
Wang, G.; Wang, X.; Miao, Z.; Liu, Z.; Hu, X. Communication-aided multi-UAV collision detection and avoidance based on two-stage curriculum reinforcement learning. Biomim. Intell. Robot. 2025, 5, 100253. [Google Scholar]
Wu, X.; Yan, Q.; Wang, J.; Zhou, Y.; Huang, Q.; Jiang, C. Dynamic Task Allocation for UAV Swarms in Maritime Rescue Scenarios Based on PG-MAPPO. IEEE Internet Things J. 2025, 12, 38073–38087. [Google Scholar] [CrossRef]
Zhu, H.; Hui, J.; Guo, Z. Two-Level-Attention-Based Continuous Trajectory Design and Computation Offloading for Multi-UAV Cooperative Target Search. IEEE Trans. Mob. Comput. 2026, 25, 3196–3214. [Google Scholar] [CrossRef]
Xiao, J.; Pisutsin, P.; Feroskhan, M. Collaborative Target Search With a Visual Drone Swarm: An Adaptive Curriculum Embedded Multistage Reinforcement Learning Approach. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 313–327. [Google Scholar] [CrossRef] [PubMed]
Kömürcü, K.K.; Kafadar, K.D.; Özaltun, E.; Orak, F.; Sanli, M.E.; Gazi, E.; Ure, N.K. Curriculum Learning with Heat Map-Based State Representations in Autonomous Drone Cargo Search. In Proceedings of the AIAA SCITECH 2026 Forum; American Institute of Aeronautics and Astronautics Inc.: Reston, VA, USA, 2026; p. 2205. [Google Scholar]
Mead, T.; Wang, Z.; Foo, E.; Dong, J.S.; Dong, N.; Ko, R.; Koay, A.M.; Nguyen, K.; Xu, Y.; Kim, J.; et al. Multi-agent reinforcement curriculum learning for real unmanned ground vehicles. Eng. Appl. Artif. Intell. 2026, 167, 113780. [Google Scholar] [CrossRef]
Chen, B.; Yuan, Q.; Yuan, J.; Pan, R.; Liu, Y.; Li, J.; Luo, G.; Li, X. AGI-Enhanced Curriculum-Driven Task Scheduling for Multi-UAV Heterogeneous Dynamic Package Delivery. IEEE Trans. Cogn. Commun. Netw. 2026, 12, 4794–4811. [Google Scholar] [CrossRef]
Gao, Z.; Yang, L.; Dai, Y. Large-Scale Cooperative Task Offloading and Resource Allocation in Heterogeneous MEC Systems via Multiagent Reinforcement Learning. IEEE Internet Things J. 2024, 11, 2303–2321. [Google Scholar] [CrossRef]
Wu, Q.; Liu, K.; Chen, L.; Lü, J. Hierarchical Reinforcement Learning for Swarm Confrontation With High Uncertainty. IEEE Trans. Autom. Sci. Eng. 2025, 22, 8630–8644. [Google Scholar] [CrossRef]
Yuan, W.; Chen, S.; He, H.; Hou, Y.; Chen, S.; Tan, X.; Yang, J. Hierarchical Reinforcement Learning-Based Joint Trajectory Planning and Resource Allocation in UAV-Assisted IoT-Sensor Networks. IEEE Trans. Commun. 2025, 73, 14517–14533. [Google Scholar] [CrossRef]
Wang, Q.; Hao, Y.; Cao, J. Learning to traverse over graphs with a Monte Carlo tree search-based self-play framework. Eng. Appl. Artif. Intell. 2021, 105, 104422. [Google Scholar] [CrossRef]

Figure 1. Schematic of the hierarchical framework for LSH-TOP. The upper-level MCTS partitions targets into feasible subsets; the lower-level Transformer generates trajectories respecting payload and range constraints.

Figure 2. Overview of the C-ACT protocol. Upper- and lower-level policies undergo alternating optimization with dynamic constraint annealing and a virtual agent buffer for convergence in constrained environments.

Figure 3. Node embedding pipeline: coordinate scaling, prize/demand integration, and linear projection for Graph Attention Network initialization.

Figure 4. Target distribution derived from real-world geographic data of Yuelu District, Changsha, China.

Figure 5. MCTS + Trans trajectories at (a)

N = 80

and (b)

N = 100

. Routes exhibit clear spatial partitioning under heterogeneous UAV constraints.

Figure 5. MCTS + Trans trajectories at (a)

N = 80

and (b)

N = 100

. Routes exhibit clear spatial partitioning under heterogeneous UAV constraints.

Figure 6. Solution topologies of nine methods at

N = 100

. Rows: MCTS, K-Means, and Random allocation. Columns: Transformer, OR-Tools, and ALNS routing. Different colors represent individual UAV trajectories.

Figure 6. Solution topologies of nine methods at

N = 100

. Rows: MCTS, K-Means, and Random allocation. Columns: Transformer, OR-Tools, and ALNS routing. Different colors represent individual UAV trajectories.

Figure 7. MCTS + Trans trajectories at (a)

N = 200

and (b)

N = 300

. Routes remain distinct and compact despite high target density.

Figure 7. MCTS + Trans trajectories at (a)

N = 200

and (b)

N = 300

. Routes remain distinct and compact despite high target density.

Figure 8. Solution topologies of nine methods at

N = 300

. Rows: MCTS, K-Means, and Random allocation. Columns: Transformer, OR-Tools, and ALNS routing. Different colors represent individual UAV trajectories.

Figure 8. Solution topologies of nine methods at

N = 300

. Rows: MCTS, K-Means, and Random allocation. Columns: Transformer, OR-Tools, and ALNS routing. Different colors represent individual UAV trajectories.

Figure 9. Training reward curves: C-ACT (solid) vs. fixed

λ = 1

(dashed). Upper row: allocation model; lower row: routing model. Left:

N = 80

; right:

N = 200

.

Figure 9. Training reward curves: C-ACT (solid) vs. fixed

λ = 1

(dashed). Upper row: allocation model; lower row: routing model. Left:

N = 80

; right:

N = 200

.

Figure 10. Objective values (left axis) and task completion rates (right axis) across scales. Error bars: standard deviation over 100 instances.

Figure 11. Average inference time across four problem scales. MCTS + Trans exhibits sub-quadratic growth, substantially lower than

O (N^{K})

joint approaches.

Figure 11. Average inference time across four problem scales. MCTS + Trans exhibits sub-quadratic growth, substantially lower than

O (N^{K})

joint approaches.

Table 1. Summary of key notations.

Notation	Description
Sets and Indices
$G$	The complete directed graph, $G = (V, E)$
$V$	Set of all nodes, $V = {0} \cup N$
$N$	Set of potential task/target nodes, $N = {1, 2, \dots, N}$
0	Index of the depot (base station)
$K$	Set of physical heterogeneous UAVs, $K = {1, 2, \dots, K}$
$K^{+}$	Extended UAV set including the virtual agent, $K^{+} = K \cup {K + 1}$
$i, j$	Indices for nodes ( $i, j \in V$ )
k	Index for UAVs ( $k \in K^{+}$ )
Parameters
$p_{i}$	The prize (importance score) associated with task i
$q_{i}$	The resource demand required by task i
$c_{i j}$	Euclidean distance cost traversing from node i to node j
$Q_{k}$	Maximum payload capacity of UAV k (heterogeneous)
$L_{k}$	Maximum flight range of UAV k (heterogeneous)
$α, β$	Weighting coefficients for distance and balance penalties
Decision Variables & Functions
$x_{i j k}$	Binary variable: 1 if UAV k travels from node i to node j, 0 otherwise
$y_{i k}$	Binary variable: 1 if task i is assigned to UAV k, 0 otherwise
$U_{k}$	Capacity usage rate of UAV k, $U_{k} = (\sum q_{i} y_{i k}) / Q_{k}$
$Ψ (u)$	Variance of the usage rates across the fleet (balance metric)
J	The global objective function to be maximized

Table 2. Experiment results on 6-UAVs scenario (scales 80 and 100). Obj. denotes the mean terminal episode reward

R (s_{T})

across 100 test instances. Gap is computed relative to the best objective among all evaluated methods at each scale. Bold indicates the shortest inference time within each allocation strategy group (Time) and the best objective/gap within each group (Obj./Gap).

Table 2. Experiment results on 6-UAVs scenario (scales 80 and 100). Obj. denotes the mean terminal episode reward

R (s_{T})

across 100 test instances. Gap is computed relative to the best objective among all evaluated methods at each scale. Bold indicates the shortest inference time within each allocation strategy group (Time) and the best objective/gap within each group (Obj./Gap).

Method	80			100
Method	Obj.	Gap (%)	Time (s)	Obj.	Gap (%)	Time (s)
Random + ALNS	24.81	50.80	1.17	28.08	52.65	2.71
Random + OR	30.69	39.14	1.66	35.71	39.78	3.76
Random + Trans	33.32	33.93	1.32	38.95	34.32	3.08
KMeans + ALNS	41.39	17.93	1.54	46.02	22.39	3.28
KMeans + OR	43.40	13.94	1.97	48.99	17.39	4.20
KMeans + Trans	45.80	9.18	1.69	54.79	7.61	3.83
MCTS + ALNS	43.21	14.32	4.05	48.44	18.31	6.22
MCTS + OR	46.14	8.51	4.31	54.69	7.77	6.69
MCTS + Trans	50.43	0.00	4.13	59.30	0.00	6.53

Table 3. Experiment results on 6-UAVs scenario (scales 200 and 300). Obj. and Gap are defined as in Table 2. Bold indicates the shortest inference time within each allocation strategy group (Time) and the best objective/gap within each group (Obj./Gap).

Method	200			300
Method	Obj.	Gap (%)	Time (s)	Obj.	Gap (%)	Time (s)
Random + ALNS	34.77	53.27	4.60	40.92	50.42	8.79
Random + OR	50.76	31.78	6.33	60.33	26.91	12.11
Random + Trans	57.55	22.66	5.11	70.32	14.80	10.52
KMeans + ALNS	61.95	16.75	5.47	65.40	20.77	11.82
KMeans + OR	63.24	15.01	7.26	68.43	17.09	15.55
KMeans + Trans	65.96	11.36	6.09	72.27	12.44	14.19
MCTS + ALNS	65.34	12.19	10.06	71.89	12.90	22.43
MCTS + OR	73.94	0.63	11.24	77.04	6.66	25.61
MCTS + Trans	74.41	0.00	10.77	82.54	0.00	24.30

Table 4. Zero-shot fleet-size transfer results (

K = 9

, models trained on

K = 6

). Obj. denotes the mean total collected prize across 100 test instances. Visited/Total indicates the average number of targets served. Bold indicates the best result at each problem scale.

Table 4. Zero-shot fleet-size transfer results (

K = 9

, models trained on

K = 6

). Obj. denotes the mean total collected prize across 100 test instances. Visited/Total indicates the average number of targets served. Bold indicates the best result at each problem scale.

Method	$N = 80$		$N = 300$
Method	Obj.	Visited	Obj.	Visited
KMeans + Trans	52.53 ± 5.77	69.9/80	112.85 ± 2.75	144.7/300
KMeans + OR	51.73 ± 5.82	69.2/80	106.09 ± 3.77	140.2/300
KMeans + ALNS	50.39 ± 5.67	66.7/80	101.27 ± 4.02	135.5/300
MCTS + Trans	42.23 ± 4.07	55.3/80	83.40 ± 2.96	108.4/300
MCTS + OR	41.03 ± 3.56	53.6/80	81.85 ± 3.21	103.5/300
MCTS + ALNS	40.87 ± 3.75	53.0/80	77.52 ± 4.34	99.8/300
Random + Trans	35.69 ± 4.36	46.7/80	73.92 ± 2.87	91.3/300
Random + OR	35.56 ± 3.75	46.2/80	71.47 ± 4.10	91.5/300
Random + ALNS	31.93 ± 5.47	42.6/80	66.29 ± 5.70	89.8/300

Table 5. Ablation study at

N = 300

. Gap is relative to the full model (82.54). TCR: task completion rate. MCTS + OR is an external reference, not an ablation variant. Bold indicates the full model (best performance). Italic indicates an external reference.

Table 5. Ablation study at

N = 300

. Gap is relative to the full model (82.54). TCR: task completion rate. MCTS + OR is an external reference, not an ablation variant. Bold indicates the full model (best performance). Italic indicates an external reference.

Model Variant	Obj.	Gap (%)	TCR (%)
MCTS + Trans (Full Model)	82.54	0.00	27.5
w/o POMO	79.80	3.32	26.2
w/o Curriculum ( $λ = 1$ )	75.50	8.53	23.8
w/o MCTS (Neural Greedy)	74.50	9.74	23.1
w/o Virtual Agent	58.20	29.49	18.5
MCTS + OR (external ref.)	77.04	6.66	25.1

Table 6. Virtual agent utilization rate (%) across scales and allocation methods. Bold indicates the proposed method.

Allocation Method	$N = 80$	$N = 100$	$N = 200$	$N = 300$
MCTS (Proposed)	2.1%	4.5%	9.8%	14.3%
K-Means Heuristic	5.4%	8.7%	16.5%	22.1%

Table 7. Wilcoxon signed-rank test results at

N = 300

.

p < 0.05

rejects the null hypothesis of equal performance.

Table 7. Wilcoxon signed-rank test results at

N = 300

.

p < 0.05

rejects the null hypothesis of equal performance.

Comparison	p-Value
MCTS + Trans vs. MCTS + OR	$3.42 \times 10^{- 5}$
MCTS + Trans vs. KMeans + Trans	$1.28 \times 10^{- 8}$

Table 8. Mean objective values and 95% confidence intervals at

N = 300

.

Table 8. Mean objective values and 95% confidence intervals at

N = 300

.

Method	Mean Obj.	95% CI
MCTS + Trans	82.54	[81.65, 83.41]
MCTS + OR	77.04	[76.06, 78.02]
KMeans + Trans	72.27	[71.09, 73.45]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zang, Y.; Gao, D.; Yin, Z.; Wei, C. Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer. Drones 2026, 10, 414. https://doi.org/10.3390/drones10060414

AMA Style

Zang Y, Gao D, Yin Z, Wei C. Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer. Drones. 2026; 10(6):414. https://doi.org/10.3390/drones10060414

Chicago/Turabian Style

Zang, Yuan, Dengwei Gao, Zeyang Yin, and Caisheng Wei. 2026. "Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer" Drones 10, no. 6: 414. https://doi.org/10.3390/drones10060414

APA Style

Zang, Y., Gao, D., Yin, Z., & Wei, C. (2026). Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer. Drones, 10(6), 414. https://doi.org/10.3390/drones10060414

Article Menu

Hierarchical Reinforcement Learning for Large-Scale Heterogeneous UAV Mission Planning via MCTS and Transformer

Highlights

Abstract

1. Introduction

2. Literature Review

3. Problem Formulation

3.1. Problem Definition

3.2. Task Formulations

3.3. MDP Formulations

4. Methodology

4.1. Inputs Preprocessing and Feature Embedding

4.2. Heterogeneous Attention Policy for Task Allocation

4.2.1. Graph Encoding with Heterogeneous Features

4.2.2. Fleet-State-Aware Decoding Mechanism

4.2.3. Heterogeneous Masking with Virtual Agent

4.3. Glimpse-Enhanced Route Planning via POMO

4.4. MCTS-Guided Allocation Refinement

4.5. Training Strategy

5. Experiments

5.1. Experiment Settings

5.2. Benchmark Evaluation

5.3. Comparative Analysis on Small and Medium Scales

5.4. Scalability Verification on Large Scales

5.4.1. Target-Scale Generalization

5.4.2. Fleet-Size Sensitivity

5.5. Ablation Study

5.6. Statistical Stability and Computational Efficiency

6. Discussion

7. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI