1. Introduction
With the rapid development of unmanned aerial vehicle (UAV) technology, heterogeneous UAV swarms, which offer synergistic capabilities in payload and operational range, are increasingly applied in fields such as modern warfare, disaster rescue, and logistics transportation [
1,
2,
3,
4,
5]. However, multi-UAV collaborative task allocation is fundamentally an NP-hard combinatorial optimization problem. Existing research primarily falls into three main categories: traditional optimization algorithms, deep reinforcement learning (DRL), and emerging large language models (LLMs).
Traditional optimization algorithms mainly include heuristics and meta-heuristics. Heuristic algorithms focus on designing efficient rules: Samiei et al. [
6] proposed a clustering-based Hungarian algorithm, effectively reducing the computational complexity of large-scale multi-task allocation; Li et al. [
7] introduced an overlapping coalition formation game framework and designed mechanisms for resource adjustment and invalid coalition exit, solving the collaboration problem under discrete resource constraints; and Liu et al. [
8] constructed a dynamic coalition model oriented towards sequential task chains to address the challenges of sequential coordination in complex environments. Meta-heuristic algorithms aim to enhance global search capabilities: Zhang et al. [
9] and Ye et al. [
10] utilized an improved co-evolutionary particle swarm optimization and an adaptive genetic algorithm, respectively, to address the highly coupled nature of task allocation in heterogeneous swarms; Xiong et al. [
11] constructed a gridded PSO-GA hybrid strategy (GrEAPSO) to balance exploitation and exploration. Furthermore, targeting high-dimensional multi-task scenarios, Dong et al. [
12] proposed a joint optimization framework for task allocation and trajectory planning, achieving Pareto optimality under multi-dimensional constraints. Although these methods perform excellently in the numerical space, their computational overhead increases exponentially with scale, and they cannot interpret tactical intents formulated in natural language, resulting in inflexible execution.
DRL algorithms have shown tremendous potential in various UAV task planning scenarios due to their end-to-end decision-making capabilities [
13,
14,
15,
16]. To address the exponential growth of the state space in large-scale swarms, Zuo et al. [
17] utilized an attention mechanism to rapidly capture swarm features; for complex SEAD (Suppression of Enemy Air Defenses) tasks, Yue et al. [
18] constructed a hierarchical multi-agent framework to decouple target allocation from low-level coordination; Yin et al. [
19] designed a deep transfer reinforcement learning algorithm based on QMIX, significantly accelerating model convergence. However, DRL methods suffer from severe cold-start issues and limited generalization capabilities, often failing catastrophically when encountering unforeseen commands or heterogeneous tasks.
The explosive growth of LLMs [
20,
21,
22,
23] has made their integration into UAV task planning a prominent research frontier. Recent surveys by Tian et al. [
24] and Jin et al. [
25] highlight this paradigm shift. They reveal a key insight: while LLMs excel in semantic reasoning, a fundamental gap remains in coupling this reasoning with structured optimization and physical control. Early explorations mainly focused on intent translation and logical decomposition. For instance, Cui et al. [
26] mapped abstract language into executable Python code using the TPML framework. Similarly, Li et al. [
27] utilized Chain of Thought (CoT) to decompose complex swarm tasks. However, pure LLMs often suffer from severe physical hallucinations when directly generating solutions. To mitigate this, Yu et al. [
28,
29] introduced external feasibility checkers and dual-verification mechanisms to ensure planning safety.
Tool-augmented AI architectures naturally help align semantic reasoning with physical constraints. As Xu et al. [
30] highlighted, equipping LLMs with external tools significantly reduces computational errors and hallucinations. Advancing this concept in the UAV domain, Koubaa and Gabr [
31] proposed the Agentic UAV framework. This fuses LLM reasoning with tool-calling interfaces, noticeably improving dynamic task allocation and detection confidence. Furthermore, Zhang et al. [
32] introduced AFlow to optimize complex, multi-step tool invocations. They demonstrated that automating tool-use workflows over computational graphs greatly enhances both reasoning efficiency and task performance.
Despite their potential in zero-shot generalization and high-level mission decomposition, directly applying LLMs to physical domains remains difficult. Specifically, existing pure LLM and early tool-augmented methods still struggle with three fundamental issues in tactical environments:
Computational and Physical Disconnect: LLMs lack native numerical and spatial reasoning skills. As a result, they struggle with strict resource and kinematic constraints, frequently generating physically impossible flight paths or payload plans.
Gradient Conflicts and Data Scarcity: Forcing a single LLM to simultaneously learn abstract tactics and rigid tool syntax often leads to optimization conflicts. Combined with the severe lack of high-quality tactical data, this causes major cold-start issues.
Lack of Deep Tactical Verification: Current LLM feedback loops usually rely on basic syntax or collision checks. They lack the deep, simulation-based feedback needed to correctly assign credit and optimize long-horizon tactics.
To address these limitations, we introduce a generative UAV task allocation paradigm augmented by a heterogeneous toolchain. We tackle the above issues through three structural innovations:
Heterogeneous Toolchain Integration: We ground semantic generation in physical reality by integrating five types of tools. Specifically, the Excellent Case Retrieval and Knowledge Graph Query tools supply tactical priors and accurate equipment parameters to prevent hallucinations; the Geospatial Calculator and Risk-Cost Calculator provide exact spatial matrices and operational assessments; and the Simulation Evaluation tool offers closed-loop environmental feedback. Together, these tools translate abstract reasoning into executable, constraint-compliant UAV assignments.
Decoupled Dual-Model Training: To prevent optimization conflicts, we design an execution-manifold-anchored orthogonal evolution training method. By separating the planning model (focused on intent) from the execution model (focused on syntax), we eliminate parameter interference. Furthermore, simulating self-play within a stable execution environment generates high-quality training data, directly overcoming the lack of domain data.
Recursive Causal Probe: We develop a dynamic decision attribution algorithm to solve the credit assignment problem in long-horizon games. Through simulation-based backward tracing, RCP isolates the exact decision responsible for a mission failure. This translates sparse failure signals into dense, actionable feedback, ensuring the generated plans are both logically sound and practically viable.
Compared with existing hybrid planning frameworks, the main contributions of this paper are summarized as follows:
- 1.
A generative task allocation paradigm augmented by a heterogeneous toolchain. Unlike existing hybrid frameworks that rely on post-generation feasibility checks, this paradigm integrates a heterogeneous toolchain directly into the semantic reasoning loop. This effectively narrows the semantic–physical divide and translates abstract intents into executable plans, mitigating physical hallucinations.
- 2.
An execution-manifold-anchored orthogonal evolution training method. To alleviate optimization conflicts between abstract reasoning and tool syntax in monolithic models, this method adopts a decoupled dual-model architecture. It minimizes parameter interference and leverages self-play to synthesize high-fidelity data, mitigating domain data scarcity.
- 3.
A dynamic decision attribution algorithm based on the RCP. To address the complex credit assignment problem, RCP utilizes simulation-based backward tracing. It aims to translate sparse failure signals into actionable corrections, supporting the continuous improvement of the swarm’s tactical decision-making.
The remainder of this paper is organized as follows:
Section 2 presents the mathematical model for heterogeneous UAV task allocation.
Section 3 details the generative allocation paradigm augmented by the heterogeneous toolchain, the orthogonal evolution training method, and the RCP algorithm.
Section 4 presents the simulation setups and results, verifying the effectiveness of the proposed methods. Finally,
Section 5 concludes the paper.
2. Task Description and Modeling
Traditional numerical optimization paradigms based on Mixed-Integer Linear Programming (MILP) [
33,
34] struggle to process ambiguous command intents and unstructured tactical logic on the battlefield. To this end, this paper proposes a generative allocation paradigm augmented by a heterogeneous toolchain, which reconstructs the task allocation for heterogeneous UAV swarms into a problem of hierarchical intent-driven sequential decision-making and black-box constraint solving.
2.1. Task Description
This paper constructs a typical 3D Red-Blue adversarial task allocation scenario, as illustrated in
Figure 1. The Red Force constitutes our heterogeneous UAV strike swarm, whereas the Blue Force comprises the enemy’s heterogeneous ground defenses and high-value targets.
Red Force entities: This consists of five categories of heterogeneous UAVs: reconnaissance, attack, electronic warfare (EW), anti-radiation, and suicide UAVs. Each unit is defined by a multi-dimensional attribute vector, encompassing payload capabilities (e.g., hard-kill missiles), kinematic parameters (e.g., operational range and velocity), and inherent attributes (e.g., radar cross-section and RCS).
Blue Force entities: This encompasses five types of ground facilities: radar stations, air defense missile silos, command centers, armored vehicles, and laser weapons. Each target possesses a specific defensive threat level and inherent attributes, and is assigned a distinct strike priority based on the dynamic battlefield situation.
Planning objectives: Under the premise of satisfying payload limits and heterogeneous capability constraints, the system must adaptively allocate limited resources to the targets. The optimization process adheres to an intent-driven hierarchical logic:
- –
Maximization of tactical effectiveness: Strictly align with the commander’s intent, prioritizing the guarantee of complete destruction or suppression of high-priority targets.
- –
Minimization of operational costs: Based on mission completion, minimize fuel consumption, ammunition expenditure, and the expected combat attrition of the Red Force as much as possible.
2.2. Task Allocation Modeling
We define the battlefield situation as a multimodal triplet , encompassing the Red Force heterogeneous UAV swarm (including types, positions, and quantities), the Red Force available missile inventory (including types and quantities), and the Blue Force heterogeneous target set (including types, positions, and quantities).
In response to the unstructured natural language instruction
input by the commander, this paper discards external rule engines and models it as a latent variable disentanglement task for the tactical planning model. Through the reasoning of the initial CoT, the model collapses the unstructured natural language instruction
into two core control variables:
where
denotes the initial reasoning sequence generated by the model and
represents the parameters of the tactical planning model.
These two variables define the optimization surface and topological boundary of the subsequent planning, respectively:
Implicit Preference State : The value ranges over . As a prior preference field, this state guides the generated schemes to shift towards specific evaluation regions by adjusting the weight space of multi-objective optimization.
Explicit Tactical Scenario
: Defines the macroscopic tactical background of the current mission;
Table 1 illustrates the specific types of tactical scenarios. Serving as a soft constraint, it activates tactical meta-rules to guide target screening and resource allocation.
After establishing the battlefield situation and logical constraints , this paper models the task allocation scheme as a Markov Decision Process (MDP) generated by the interaction between the semantic reasoning space and the physical execution manifold.
We define the joint interaction tuple at step t as :
(Chain of Thought): The tactical thinking process generated by the planning model, responsible for logical deduction.
(Instruction): The specific tactical instruction issued by the planning model to the execution model.
(Action): The specific tool invocation command.
(Observation): The execution result returned by the tool.
To find the optimal policy
, we perform orthogonal decomposition on the joint probability distribution of the trajectory
induced by the total historical context
:
This architecture achieves the complete physical decoupling of cognition and execution:
Tactical Planner Policy : Driven by parameters , it operates in the semantic space, focusing on processing complex tactical logical deductions and issuing intermediate instructions.
Manifold Executor Policy : Driven by parameters , it is responsible for deterministically anchoring the unstructured instructions into legal tool actions.
Theoretically, while the factorization in Equation (
2) shares the goal of external tool integration seen in recent agentic frameworks (e.g., MUTP-LLM [
29] and Agentic UAVs [
31]), our approach differentiates itself through its architectural mapping. Existing tool-augmented systems often rely on a monolithic LLM to simultaneously execute abstract semantic reasoning and generate rigid API syntax. In contrast, our orthogonal decomposition introduces two key theoretical advantages. First, it promotes optimization stability via gradient decoupling. Isolating parameters
and
effectively mitigates the heterogeneous gradient interference and helps avoid the catastrophic forgetting of tactical logic that frequently challenges single-model tool learning [
30]. Second, it enhances sample efficiency through dimensionality reduction. By acting as a deterministic bottleneck, the executor
shields the planner from complex low-level schemas (unlike direct code-generation approaches such as TPML [
26]), substantially compressing the effective action space. Ultimately, these decoupled policies leverage environmental observations
to achieve robust closed-loop self-iteration, allowing
to focus primarily on high-level tactical alignment while alleviating the cognitive overload associated with end-to-end monolithic generation.
We define the final task allocation problem as a black-box constraint-solving problem based on simulation feedback. The complete task allocation scheme generated by the dual-model architecture, denoted as
, after wargaming in the high-fidelity simulation environment
, yields the final evaluation result
R, which is calculated by the total effectiveness function
J:
The solving objective of this method aims to find a valid scheme satisfying the tactical success threshold
:
If , the simulation feedback containing specific penalty items will be fed back as correction prompts to the planning model , triggering closed-loop iterative correction until the scheme meets the standard.
2.3. Effectiveness Evaluation Model
To transform unstructured tactical intents into quantifiable mathematical optimization objectives, this paper constructs a dynamic effectiveness evaluation model based on the S-E-O (Survival–Economic–Offensive) three-dimensional framework.
2.3.1. S-E-O Evaluation Metrics
This paper defines the simulation evaluation metrics from three orthogonal dimensions: offensive, survival, and economic.
- (1)
O: Offensive Score
This dimension employs a time-effectiveness coupled model. Let
be the set of predetermined targets,
be the set of actually destroyed targets, and
and
be the actual elapsed time and the maximum time limit, respectively. It is defined as:
where
is the task completion rate;
is the time efficiency factor. This formula adjusts the time sensitivity via
, balancing the tactical value of strike coverage and rapid resolution.
- (2)
S: Survival Score
This dimension measures the battlefield survival rate of non-consumable units. To isolate expected tactical consumption, the set of suicide UAVs
is explicitly excluded. Let
and
be the sets of Red Force units at the initial and final stages, respectively.
This design guides the model to focus on the battlefield survival performance of high-value reconnaissance or strike platforms.
- (3)
E: Economic Score
This dimension measures the combat return on investment. To prevent reward explosion caused by extreme values, a Tanh mapping is utilized to compress values with a wide dynamic range into a normalized interval
. It is defined as:
where
is the value of the damaged Blue Force targets, and
and
are the battle loss and ammunition cost of the Red Force, respectively.
is the saturation threshold for the cost-effectiveness ratio.
2.3.2. Intent-Driven Effectiveness Function
To transform ambiguous intents into precise mathematical constraints, the total effectiveness function
J is constructed as a weighted linear combination of
S,
E, and
O:
where the weight vector
is dynamically determined by the parsed preference mode
, achieving intent alignment by reshaping the Pareto front of multi-objective optimization:
Balanced Preferences: When , follows a uniform distribution, i.e., . The model will seek a robust balance among survival, cost, and offense, making it suitable for conventional missions with ambiguous situations.
Biased Preferences: When , exhibits a sparse distribution, and the weight of a specific dimension is significantly amplified:
- –
Survival-oriented (): Extremely sensitive to battle losses, leaning towards generating low-risk or stand-off strike schemes.
- –
Economic-oriented (): Prioritizes optimizing the cost-effectiveness ratio, leaning towards strike schemes that minimize resource consumption.
- –
Offensive-oriented (): Possesses a high tolerance for costs, fully optimizing the offensive metric O, leaning towards adopting rapid saturation attacks.
2.4. Definition of Physical Hallucination
While the aforementioned S-E-O evaluation model establishes the quantitative criteria for assessing the effectiveness of a valid allocation plan, pure semantic LLMs frequently generate schemes that fundamentally violate underlying physical and logical boundaries. To formally contextualize the core challenges our methodology aims to resolve, we introduce the concept of physical hallucination and decompose it into three distinct failure modes:
Capability Constraint Violation: This category occurs when the assigned tasks exceed the inherent physical capabilities of the platforms or violate global objective constraints. Typical instances include assigning attack payloads to reconnaissance-only UAVs (violating the zero-load rule), mixing incompatible munition types on a single drone, or drafting plans where the aggregate munition consumption exceeds the global inventory limits.
Range Exceedance Error: This error manifests as flawed spatial reasoning or inaccurate numerical estimations. A primary example is targeting enemy units that lie beyond the UAV’s maximum operational range or the kinematic reach of its deployed munitions.
State and Sequence Conflict: This represents complex logical inconsistencies within multi-wave sequencing, asset lifecycle management, and tactical coordination. It encompasses state-tracking failures (e.g., “asset resurrection,” where a single-use suicide drone is erroneously deployed across multiple rounds) and tactical coordination violations (e.g., exceeding the strict limit of one electronic warfare unit per wave, or separating an escort from its designated strike aircraft into different rounds). Furthermore, it includes severe tactical causality inversions, such as deploying vulnerable bombers prior to executing the SEAD.
By formally isolating these three structural failure modes, we establish a targeted foundation for the generative paradigm proposed in
Section 3, where each type of physical hallucination is explicitly mitigated by a dedicated module within the heterogeneous toolchain.
5. Conclusions
To resolve the conflict between ambiguous intents and strict physical constraints in UAV swarm task allocation, we propose a toolchain-augmented generative planning paradigm. This approach transitions task allocation from numerical fitting toward tool-verified semantic reasoning. Supported by a decoupled dual-model architecture and an execution-manifold-anchored training method, our framework utilizes the RCP algorithm to facilitate closed-loop tactical refinement via counterfactual deduction.
AFSIM-based experiments suggest that this holistic framework mitigates LLM physical hallucinations and gradient conflicts, facilitating a more reliable mapping from unstructured intents to physically compliant schemes. While validating the framework’s macro-level feasibility, we recognize its current reliance on idealized simulations. Consequently, our future work will focus on three key areas: (1) conducting fine-grained ablation studies to evaluate individual tool contributions; (2) bridging the sim-to-real gap by incorporating physical factors such as communication delays, sensing uncertainties, and real-time execution constraints; and (3) exploring the paradigm’s generalizability across broader mission profiles. Ultimately, this research offers a promising pathway for intelligent UAV swarm decision-making.