A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains

Ai, Lei; Ma, Bin; Zhang, Jianxing; Ai, Yao; Hao, Ziqi; Li, Jianan; Yu, Zhuting; Cheng, Jiayu

doi:10.3390/drones10040289

Open AccessArticle

A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains

by

Lei Ai

^1,2

,

Bin Ma

¹,

Jianxing Zhang

^1,*,

Yao Ai

³,

Ziqi Hao

¹,

Jianan Li

^1,2,

Zhuting Yu

¹ and

Jiayu Cheng

⁴

¹

Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

³

Zhuoyue Honors College, Hangzhou Dianzi University, Hangzhou 310018, China

⁴

National Laboratory on Adaptive Optics, Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(4), 289; https://doi.org/10.3390/drones10040289

Submission received: 11 March 2026 / Revised: 7 April 2026 / Accepted: 11 April 2026 / Published: 16 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We proposed a generative task allocation paradigm augmented by a heterogeneous toolchain; it employs a dual-model architecture to transform UAV swarm task allocation from numerical fitting into semantic reasoning-based scheme generation, effectively suppressing the physical hallucinations and computational deficiencies of large language models.
An execution-manifold-anchored orthogonal evolution training method and a Recursive Causal Probe algorithm are developed to resolve heterogeneous gradient conflicts and the credit assignment problem in long-horizon game scenarios through a temporal decoupling mechanism and counterfactual deduction.

What are the implications of the main findings?

The proposed method achieves an 82.34% mission success rate with only 2.69 average interactive turns, significantly outperforming traditional methods in complex adversarial scenarios.
The paradigm provides a promising technical pathway for mapping unstructured command intents to physically consistent tactical execution, offering a foundation for exploring broader multi-agent collaborative planning domains.

Abstract

Task allocation for heterogeneous unmanned aerial vehicle (UAV) swarms requires complex spatiotemporal coordination. While traditional algorithms struggle to interpret abstract semantic intents, general large language models (LLMs) often suffer from physical hallucinations and superficial tactical reasoning. To address these limitations, we propose a generative task allocation paradigm augmented by a heterogeneous toolchain, shifting the approach from rigid numerical optimization toward tool-grounded semantic planning. To implement this and overcome domain data scarcity, we design a decoupled dual-model architecture. This architecture is optimized through an execution-manifold-anchored orthogonal evolution training method. By utilizing simulated self-play within a stable execution environment, this approach prevents gradient conflicts and autonomously generates abundant training data. Furthermore, to resolve the credit assignment problem in long-horizon scenarios, we develop a Recursive Causal Probe (RCP) algorithm. By tracing failures backward through the simulation, RCP synthesizes counterfactual preference data, effectively translating tactical mistakes into precise corrections for the planning model. Extensive simulations demonstrate that our method achieves an 82.34% mission success rate in complex scenarios, requiring significantly fewer interactive corrections than general LLMs, fully verifying its physical feasibility and practical robustness.

Keywords:

heterogeneous UAV swarms; task allocation; large language models; recursive causal probe; heterogeneous toolchain

1. Introduction

With the rapid development of unmanned aerial vehicle (UAV) technology, heterogeneous UAV swarms, which offer synergistic capabilities in payload and operational range, are increasingly applied in fields such as modern warfare, disaster rescue, and logistics transportation [1,2,3,4,5]. However, multi-UAV collaborative task allocation is fundamentally an NP-hard combinatorial optimization problem. Existing research primarily falls into three main categories: traditional optimization algorithms, deep reinforcement learning (DRL), and emerging large language models (LLMs).

Traditional optimization algorithms mainly include heuristics and meta-heuristics. Heuristic algorithms focus on designing efficient rules: Samiei et al. [6] proposed a clustering-based Hungarian algorithm, effectively reducing the computational complexity of large-scale multi-task allocation; Li et al. [7] introduced an overlapping coalition formation game framework and designed mechanisms for resource adjustment and invalid coalition exit, solving the collaboration problem under discrete resource constraints; and Liu et al. [8] constructed a dynamic coalition model oriented towards sequential task chains to address the challenges of sequential coordination in complex environments. Meta-heuristic algorithms aim to enhance global search capabilities: Zhang et al. [9] and Ye et al. [10] utilized an improved co-evolutionary particle swarm optimization and an adaptive genetic algorithm, respectively, to address the highly coupled nature of task allocation in heterogeneous swarms; Xiong et al. [11] constructed a gridded PSO-GA hybrid strategy (GrEAPSO) to balance exploitation and exploration. Furthermore, targeting high-dimensional multi-task scenarios, Dong et al. [12] proposed a joint optimization framework for task allocation and trajectory planning, achieving Pareto optimality under multi-dimensional constraints. Although these methods perform excellently in the numerical space, their computational overhead increases exponentially with scale, and they cannot interpret tactical intents formulated in natural language, resulting in inflexible execution.

DRL algorithms have shown tremendous potential in various UAV task planning scenarios due to their end-to-end decision-making capabilities [13,14,15,16]. To address the exponential growth of the state space in large-scale swarms, Zuo et al. [17] utilized an attention mechanism to rapidly capture swarm features; for complex SEAD (Suppression of Enemy Air Defenses) tasks, Yue et al. [18] constructed a hierarchical multi-agent framework to decouple target allocation from low-level coordination; Yin et al. [19] designed a deep transfer reinforcement learning algorithm based on QMIX, significantly accelerating model convergence. However, DRL methods suffer from severe cold-start issues and limited generalization capabilities, often failing catastrophically when encountering unforeseen commands or heterogeneous tasks.

The explosive growth of LLMs [20,21,22,23] has made their integration into UAV task planning a prominent research frontier. Recent surveys by Tian et al. [24] and Jin et al. [25] highlight this paradigm shift. They reveal a key insight: while LLMs excel in semantic reasoning, a fundamental gap remains in coupling this reasoning with structured optimization and physical control. Early explorations mainly focused on intent translation and logical decomposition. For instance, Cui et al. [26] mapped abstract language into executable Python code using the TPML framework. Similarly, Li et al. [27] utilized Chain of Thought (CoT) to decompose complex swarm tasks. However, pure LLMs often suffer from severe physical hallucinations when directly generating solutions. To mitigate this, Yu et al. [28,29] introduced external feasibility checkers and dual-verification mechanisms to ensure planning safety.

Tool-augmented AI architectures naturally help align semantic reasoning with physical constraints. As Xu et al. [30] highlighted, equipping LLMs with external tools significantly reduces computational errors and hallucinations. Advancing this concept in the UAV domain, Koubaa and Gabr [31] proposed the Agentic UAV framework. This fuses LLM reasoning with tool-calling interfaces, noticeably improving dynamic task allocation and detection confidence. Furthermore, Zhang et al. [32] introduced AFlow to optimize complex, multi-step tool invocations. They demonstrated that automating tool-use workflows over computational graphs greatly enhances both reasoning efficiency and task performance.

Despite their potential in zero-shot generalization and high-level mission decomposition, directly applying LLMs to physical domains remains difficult. Specifically, existing pure LLM and early tool-augmented methods still struggle with three fundamental issues in tactical environments:

Computational and Physical Disconnect: LLMs lack native numerical and spatial reasoning skills. As a result, they struggle with strict resource and kinematic constraints, frequently generating physically impossible flight paths or payload plans.
Gradient Conflicts and Data Scarcity: Forcing a single LLM to simultaneously learn abstract tactics and rigid tool syntax often leads to optimization conflicts. Combined with the severe lack of high-quality tactical data, this causes major cold-start issues.
Lack of Deep Tactical Verification: Current LLM feedback loops usually rely on basic syntax or collision checks. They lack the deep, simulation-based feedback needed to correctly assign credit and optimize long-horizon tactics.

To address these limitations, we introduce a generative UAV task allocation paradigm augmented by a heterogeneous toolchain. We tackle the above issues through three structural innovations:

Heterogeneous Toolchain Integration: We ground semantic generation in physical reality by integrating five types of tools. Specifically, the Excellent Case Retrieval and Knowledge Graph Query tools supply tactical priors and accurate equipment parameters to prevent hallucinations; the Geospatial Calculator and Risk-Cost Calculator provide exact spatial matrices and operational assessments; and the Simulation Evaluation tool offers closed-loop environmental feedback. Together, these tools translate abstract reasoning into executable, constraint-compliant UAV assignments.
Decoupled Dual-Model Training: To prevent optimization conflicts, we design an execution-manifold-anchored orthogonal evolution training method. By separating the planning model (focused on intent) from the execution model (focused on syntax), we eliminate parameter interference. Furthermore, simulating self-play within a stable execution environment generates high-quality training data, directly overcoming the lack of domain data.
Recursive Causal Probe: We develop a dynamic decision attribution algorithm to solve the credit assignment problem in long-horizon games. Through simulation-based backward tracing, RCP isolates the exact decision responsible for a mission failure. This translates sparse failure signals into dense, actionable feedback, ensuring the generated plans are both logically sound and practically viable.

Compared with existing hybrid planning frameworks, the main contributions of this paper are summarized as follows:

1.: A generative task allocation paradigm augmented by a heterogeneous toolchain. Unlike existing hybrid frameworks that rely on post-generation feasibility checks, this paradigm integrates a heterogeneous toolchain directly into the semantic reasoning loop. This effectively narrows the semantic–physical divide and translates abstract intents into executable plans, mitigating physical hallucinations.
2.: An execution-manifold-anchored orthogonal evolution training method. To alleviate optimization conflicts between abstract reasoning and tool syntax in monolithic models, this method adopts a decoupled dual-model architecture. It minimizes parameter interference and leverages self-play to synthesize high-fidelity data, mitigating domain data scarcity.
3.: A dynamic decision attribution algorithm based on the RCP. To address the complex credit assignment problem, RCP utilizes simulation-based backward tracing. It aims to translate sparse failure signals into actionable corrections, supporting the continuous improvement of the swarm’s tactical decision-making.

The remainder of this paper is organized as follows: Section 2 presents the mathematical model for heterogeneous UAV task allocation. Section 3 details the generative allocation paradigm augmented by the heterogeneous toolchain, the orthogonal evolution training method, and the RCP algorithm. Section 4 presents the simulation setups and results, verifying the effectiveness of the proposed methods. Finally, Section 5 concludes the paper.

2. Task Description and Modeling

Traditional numerical optimization paradigms based on Mixed-Integer Linear Programming (MILP) [33,34] struggle to process ambiguous command intents and unstructured tactical logic on the battlefield. To this end, this paper proposes a generative allocation paradigm augmented by a heterogeneous toolchain, which reconstructs the task allocation for heterogeneous UAV swarms into a problem of hierarchical intent-driven sequential decision-making and black-box constraint solving.

2.1. Task Description

This paper constructs a typical 3D Red-Blue adversarial task allocation scenario, as illustrated in Figure 1. The Red Force constitutes our heterogeneous UAV strike swarm, whereas the Blue Force comprises the enemy’s heterogeneous ground defenses and high-value targets.

Red Force entities: This consists of five categories of heterogeneous UAVs: reconnaissance, attack, electronic warfare (EW), anti-radiation, and suicide UAVs. Each unit is defined by a multi-dimensional attribute vector, encompassing payload capabilities (e.g., hard-kill missiles), kinematic parameters (e.g., operational range and velocity), and inherent attributes (e.g., radar cross-section and RCS).
Blue Force entities: This encompasses five types of ground facilities: radar stations, air defense missile silos, command centers, armored vehicles, and laser weapons. Each target possesses a specific defensive threat level and inherent attributes, and is assigned a distinct strike priority based on the dynamic battlefield situation.
Planning objectives: Under the premise of satisfying payload limits and heterogeneous capability constraints, the system must adaptively allocate limited resources to the targets. The optimization process adheres to an intent-driven hierarchical logic:
–
Maximization of tactical effectiveness: Strictly align with the commander’s intent, prioritizing the guarantee of complete destruction or suppression of high-priority targets.
–
Minimization of operational costs: Based on mission completion, minimize fuel consumption, ammunition expenditure, and the expected combat attrition of the Red Force as much as possible.

2.2. Task Allocation Modeling

We define the battlefield situation

S

as a multimodal triplet

S = (U_{r e d}, M_{r e d}, T_{b l u e})

, encompassing the Red Force heterogeneous UAV swarm (including types, positions, and quantities), the Red Force available missile inventory (including types and quantities), and the Blue Force heterogeneous target set (including types, positions, and quantities).

In response to the unstructured natural language instruction

I

input by the commander, this paper discards external rule engines and models it as a latent variable disentanglement task for the tactical planning model. Through the reasoning of the initial CoT, the model collapses the unstructured natural language instruction

I

into two core control variables:

(P_{m o d e}, T_{s c e n a r i o}) \sim P (c_{i n i t} | I, S; ϕ)

(1)

where

c_{i n i t}

denotes the initial reasoning sequence generated by the model and

ϕ

represents the parameters of the tactical planning model.

These two variables define the optimization surface and topological boundary of the subsequent planning, respectively:

Implicit Preference State $P_{m o d e}$ : The value ranges over ${S u r v i v a l, E c o n o m i c, O f f e n s i v e, B a l a n c e d}$ . As a prior preference field, this state guides the generated schemes to shift towards specific evaluation regions by adjusting the weight space of multi-objective optimization.
Explicit Tactical Scenario $T_{s c e n a r i o}$ : Defines the macroscopic tactical background of the current mission; Table 1 illustrates the specific types of tactical scenarios. Serving as a soft constraint, it activates tactical meta-rules to guide target screening and resource allocation.

After establishing the battlefield situation

S

and logical constraints

(P_{m o d e}, T_{s c e n a r i o})

, this paper models the task allocation scheme as a Markov Decision Process (MDP) generated by the interaction between the semantic reasoning space and the physical execution manifold.

We define the joint interaction tuple at step t as

(c_{t}, i_{t}, a_{t}, o_{t})

:

$c_{t}$ (Chain of Thought): The tactical thinking process generated by the planning model, responsible for logical deduction.
$i_{t}$ (Instruction): The specific tactical instruction issued by the planning model to the execution model.
$a_{t}$ (Action): The specific tool invocation command.
$o_{t}$ (Observation): The execution result returned by the tool.

To find the optimal policy

π^{*}

, we perform orthogonal decomposition on the joint probability distribution of the trajectory

P (τ | S, I)

induced by the total historical context

H_{< t}

:

P (τ | S, I) = \prod_{t = 1}^{T} \underset{Tactical Planner π_{ϕ}}{\underset{︸}{P (c_{t}, i_{t} | H_{< t}, S, I; ϕ)}} \cdot \underset{Manifold Executor π_{θ}}{\underset{︸}{P (a_{t} | i_{t}; θ)}}

(2)

This architecture achieves the complete physical decoupling of cognition and execution:

Tactical Planner Policy $π_{ϕ}$ : Driven by parameters $ϕ$ , it operates in the semantic space, focusing on processing complex tactical logical deductions and issuing intermediate instructions.
Manifold Executor Policy $π_{θ}$ : Driven by parameters $θ$ , it is responsible for deterministically anchoring the unstructured instructions into legal tool actions.

Theoretically, while the factorization in Equation (2) shares the goal of external tool integration seen in recent agentic frameworks (e.g., MUTP-LLM [29] and Agentic UAVs [31]), our approach differentiates itself through its architectural mapping. Existing tool-augmented systems often rely on a monolithic LLM to simultaneously execute abstract semantic reasoning and generate rigid API syntax. In contrast, our orthogonal decomposition introduces two key theoretical advantages. First, it promotes optimization stability via gradient decoupling. Isolating parameters

ϕ

and

θ

effectively mitigates the heterogeneous gradient interference and helps avoid the catastrophic forgetting of tactical logic that frequently challenges single-model tool learning [30]. Second, it enhances sample efficiency through dimensionality reduction. By acting as a deterministic bottleneck, the executor

π_{θ}

shields the planner from complex low-level schemas (unlike direct code-generation approaches such as TPML [26]), substantially compressing the effective action space. Ultimately, these decoupled policies leverage environmental observations

o_{t}

to achieve robust closed-loop self-iteration, allowing

π_{ϕ}

to focus primarily on high-level tactical alignment while alleviating the cognitive overload associated with end-to-end monolithic generation.

We define the final task allocation problem as a black-box constraint-solving problem based on simulation feedback. The complete task allocation scheme generated by the dual-model architecture, denoted as

T

, after wargaming in the high-fidelity simulation environment

E

, yields the final evaluation result R, which is calculated by the total effectiveness function J:

R = J (E (T) | P_{m o d e})

(3)

The solving objective of this method aims to find a valid scheme satisfying the tactical success threshold

δ

:

find T \sim (π_{ϕ}, π_{θ}) s . t . R \geq δ

(4)

If

R < δ

, the simulation feedback containing specific penalty items will be fed back as correction prompts to the planning model

π_{ϕ}

, triggering closed-loop iterative correction until the scheme meets the standard.

2.3. Effectiveness Evaluation Model

To transform unstructured tactical intents into quantifiable mathematical optimization objectives, this paper constructs a dynamic effectiveness evaluation model based on the S-E-O (Survival–Economic–Offensive) three-dimensional framework.

2.3.1. S-E-O Evaluation Metrics

This paper defines the simulation evaluation metrics from three orthogonal dimensions: offensive, survival, and economic.

(1): O: Offensive Score
This dimension employs a time-effectiveness coupled model. Let $T_{t a r g e t}$ be the set of predetermined targets, $T_{h i t}$ be the set of actually destroyed targets, and $τ$ and $τ_{m a x}$ be the actual elapsed time and the maximum time limit, respectively. It is defined as:

$O = η_{t a s k} \cdot [γ_{b a s e} + (1 - γ_{b a s e}) \cdot η_{t i m e}]$

(5)

where $η_{t a s k} = | T_{h i t} | / | T_{t a r g e t} |$ is the task completion rate; $η_{t i m e} = 1 - τ / τ_{m a x}$ is the time efficiency factor. This formula adjusts the time sensitivity via $γ_{b a s e}$ , balancing the tactical value of strike coverage and rapid resolution.
(2): S: Survival Score
This dimension measures the battlefield survival rate of non-consumable units. To isolate expected tactical consumption, the set of suicide UAVs $U_{s u i c i d e}$ is explicitly excluded. Let $U_{t o t a l}$ and $U_{a l i v e}$ be the sets of Red Force units at the initial and final stages, respectively.

$S = \{\begin{matrix} 1.0, & if | U_{t o t a l} ∖ U_{s u i c i d e} | = 0 \\ \frac{| U_{a l i v e} ∖ U_{s u i c i d e} |}{| U_{t o t a l} ∖ U_{s u i c i d e} |}, & otherwise \end{matrix}$

(6)

This design guides the model to focus on the battlefield survival performance of high-value reconnaissance or strike platforms.
(3): E: Economic Score
This dimension measures the combat return on investment. To prevent reward explosion caused by extreme values, a Tanh mapping is utilized to compress values with a wide dynamic range into a normalized interval $[0, 1]$ . It is defined as:

$E = tanh (\frac{V_{d a m a g e}}{C_{l o s s} + C_{a m m o}} \cdot \frac{1}{ρ_{s a t}})$

(7)

where $V_{d a m a g e}$ is the value of the damaged Blue Force targets, and $C_{l o s s}$ and $C_{a m m o}$ are the battle loss and ammunition cost of the Red Force, respectively. $ρ_{s a t}$ is the saturation threshold for the cost-effectiveness ratio.

2.3.2. Intent-Driven Effectiveness Function

To transform ambiguous intents into precise mathematical constraints, the total effectiveness function J is constructed as a weighted linear combination of S, E, and O:

J = (λ_{s} \cdot S + λ_{e} \cdot E + λ_{o} \cdot O) \times 100

(8)

where the weight vector

λ = {(λ_{s}, λ_{e}, λ_{o})}^{T}

is dynamically determined by the parsed preference mode

P_{m o d e}

, achieving intent alignment by reshaping the Pareto front of multi-objective optimization:

Balanced Preferences: When $P_{m o d e} = B a l a n c e d$ , $λ$ follows a uniform distribution, i.e., $λ_{s} = λ_{e} = λ_{o}$ . The model will seek a robust balance among survival, cost, and offense, making it suitable for conventional missions with ambiguous situations.
Biased Preferences: When $P_{m o d e} \neq B a l a n c e d$ , $λ$ exhibits a sparse distribution, and the weight of a specific dimension is significantly amplified:
–
Survival-oriented ( $P_{m o d e} = S u r v i v a l$ ): Extremely sensitive to battle losses, leaning towards generating low-risk or stand-off strike schemes.
–
Economic-oriented ( $P_{m o d e} = E c o n o m i c$ ): Prioritizes optimizing the cost-effectiveness ratio, leaning towards strike schemes that minimize resource consumption.
–
Offensive-oriented ( $P_{m o d e} = O f f e n s i v e$ ): Possesses a high tolerance for costs, fully optimizing the offensive metric O, leaning towards adopting rapid saturation attacks.

2.4. Definition of Physical Hallucination

While the aforementioned S-E-O evaluation model establishes the quantitative criteria for assessing the effectiveness of a valid allocation plan, pure semantic LLMs frequently generate schemes that fundamentally violate underlying physical and logical boundaries. To formally contextualize the core challenges our methodology aims to resolve, we introduce the concept of physical hallucination and decompose it into three distinct failure modes:

Capability Constraint Violation: This category occurs when the assigned tasks exceed the inherent physical capabilities of the platforms or violate global objective constraints. Typical instances include assigning attack payloads to reconnaissance-only UAVs (violating the zero-load rule), mixing incompatible munition types on a single drone, or drafting plans where the aggregate munition consumption exceeds the global inventory limits.
Range Exceedance Error: This error manifests as flawed spatial reasoning or inaccurate numerical estimations. A primary example is targeting enemy units that lie beyond the UAV’s maximum operational range or the kinematic reach of its deployed munitions.
State and Sequence Conflict: This represents complex logical inconsistencies within multi-wave sequencing, asset lifecycle management, and tactical coordination. It encompasses state-tracking failures (e.g., “asset resurrection,” where a single-use suicide drone is erroneously deployed across multiple rounds) and tactical coordination violations (e.g., exceeding the strict limit of one electronic warfare unit per wave, or separating an escort from its designated strike aircraft into different rounds). Furthermore, it includes severe tactical causality inversions, such as deploying vulnerable bombers prior to executing the SEAD.

By formally isolating these three structural failure modes, we establish a targeted foundation for the generative paradigm proposed in Section 3, where each type of physical hallucination is explicitly mitigated by a dedicated module within the heterogeneous toolchain.

3. Methodology

3.1. Generative Paradigm Augmented by Heterogeneous Toolchain

To overcome the cognitive overload of monolithic models under high-dimensional spatiotemporal constraints and transition task allocation from numerical fitting to tactical reasoning, this paper proposes a generative task allocation architecture augmented by a heterogeneous toolchain. To explicitly elucidate the dynamic information flow and the closed-loop interactions between the planning and execution spaces, Figure 2 breaks down the execution pipeline into a seven-step numbered sequence. This paradigm orthogonally decomposes the Markov process of task generation into two independently optimized and deeply collaborative modules:

Tactical Planner ( $π_{ϕ}$ ): Operates in the high-dimensional semantic space. It utilizes the CoT to parse ambiguous intents and conduct tactical deduction, outputting interpretable intermediate semantic instructions rather than direct control signals.
Tool Executor ( $π_{θ}$ ): Anchored in the low-dimensional physical manifold. Through constrained decoding, it strictly maps the natural language instructions into compliant tool invocation codes. Table 2 presents the five categories of heterogeneous tools defined in this study and their input–output paradigms.

Specifically, the dynamic information flow within this architecture is executed through the following seven-step closed-loop sequence:

Step 1 (Input Processing): The current situation $S$ and instruction $I$ are fed into the Tactical Planner, initiating preference and scenario analysis followed by CoT reasoning.
Step 2 (Intent Decomposition): The Planner generates a natural language (NL) intermediate instruction and transmits it to the Tool Executor.
Step 3 (Constrained Invocation): The Executor translates the NL instruction into a structured JSON format via constrained decoding, triggering valid tool calls to Group 1 (Auxiliary Tools).
Step 4 (Observation Feedback): The execution results from Group 1 are returned to the Planner as environmental observations, forming a short-cycle cognitive loop to update the CoT context.
Step 5 (Simulation Trigger): For critical decision nodes, the Executor invokes Group 2 tools ( $T_{s i m}$ ) to simulate the proposed actions.
Step 6 (Simulation Feedback): The simulation engine evaluates the action sequence and returns the score J as long-cycle feedback to the Planner for tactical refinement.
Step 7 (Plan Generation): Once the evaluation score meets the performance threshold ( $J \geq δ$ ), the system halts the loop and outputs the Final Plan.

Beyond defining functional boundaries (Table 2), the invocation of these tools is governed by a prompt-driven workflow with conditional logic. This allows the Tactical Planner to organize the toolchain sequentially across three phases:

Initialization Phase ( $T_{r e t}$ and $T_{k g}$ ): Upon receiving an intent, the system first invokes $T_{r e t}$ to retrieve historical cases as a reasoning baseline, followed by $T_{k g}$ to extract static physical constraints (e.g., payload limits) for the entities involved.
Context-Aware Computation Phase ( $T_{c a l c}$ and $T_{c o s t}$ ): Tool selection adapts dynamically to the tactical context. $T_{c a l c}$ is typically triggered before spatial assignments to provide distance matrices. Meanwhile, $T_{c o s t}$ is invoked conditionally based on commander preferences; for example, an “economic” intent triggers risk-cost filtering prior to platform selection.
Feedback-Driven Iteration Phase ( $T_{s i m}$ ): To validate the draft plan, the system invokes $T_{s i m}$ . If the efficacy score J falls below the threshold ( $δ = 90$ ), the model leverages the returned battle report to adjust tactical sequencing or payloads, repeating this closed-loop refinement until the performance criteria are satisfied.

Through this workflow, heuristic semantic reasoning is effectively grounded in quantitative computation. Furthermore, decoupling tactical reasoning from tool execution helps mitigate multi-task gradient interference, providing a structural basis for the subsequent orthogonal evolution training.

3.2. Orthogonal Evolution Training Method

Based on the architecture in Section 3.1, this paper proposes an execution-manifold-anchored orthogonal evolution training method oriented towards heterogeneous dual-models. Addressing the challenges of data scarcity and gradient conflicts typical in end-to-end monolithic training, this strategy utilizes a temporal decoupling mechanism. Instead of optimizing the two models simultaneously, we separate their training phases to reflect their inherent causal dependency. Specifically, the advanced value alignment of the Tactical Planner (via the RCP algorithm) naturally requires high-fidelity counterfactual feedback from the simulation environment. To guarantee the quality of this feedback, we first solidify the “execution manifold” by training and freezing the Tool Executor as a deterministic, rule-abiding translation interface. With this reliable execution anchor established to interact with external tools, the training focus can then orthogonally shift to refining the Tactical Planner. The entire process employs Low-Rank Adaptation (LoRA) [35] for parameter-efficient fine-tuning, efficiently injecting domain tactical knowledge while retaining the general semantic capabilities of the foundational models.

To provide a clear algorithmic overview of this strategy and facilitate reproducibility, Algorithm 1 summarizes the step-by-step workflow. The detailed implementations of the three main phases are sequentially elaborated in the following subsections.

Algorithm 1 Orthogonal evolution of heterogeneous dual-models.

Input: Base Planner

π_{ϕ_{0}}

and Executor

π_{θ_{0}}

, Seed dataset

D_{s e e d}

, Env

E

Output: Optimized Planner

π_{ϕ^{*}}

and Grounded Executor

π_{θ^{*}}

Phase 1: Trajectory Synthesis and Decoupling

1:

D_{t o t a l} \leftarrow

Synthesize trajectories based on

D_{s e e d}

and State Transition Graph

2:

D_{p l a n} \leftarrow Φ_{p l a n} (D_{t o t a l})

▹ Mask underlying tool execution details

3:

D_{e x e c} \leftarrow Φ_{e x e c} (D_{t o t a l})

▹ Extract instruction-to-action pairs

Phase 2: Execution Manifold Anchoring (via LoRA)

4:

π_{θ_{s f t}} \leftarrow {\arg \min}_{Δ θ} L_{S F T} (D_{e x e c})

▹ Initialize base mapping

5: for each instruction

i_{t} \in D_{e x e c}

do

6: Sample candidate actions

{a_{t}^{k}}_{k = 1}^{K} \sim π_{θ_{s f t}} (\cdot ∣ i_{t})

7: Compute rule-based binary rewards

r_{r u l e} (a_{t}^{k}) \in {0, 1}

8:

π_{θ^{*}} \leftarrow {\arg \min}_{Δ θ} L_{G S P O}

▹ Reward-driven policy optimization

9: end for

10: Freeze Executor parameters

π_{θ^{*}}

Phase 3: Planning Value Alignment (via LoRA)

11:

π_{ϕ_{s f t}} \leftarrow {\arg \min}_{Δ ϕ} L_{S F T} (D_{p l a n})

▹ Align semantic reasoning

12:

T_{f a i l} \leftarrow

Collect failed rollouts via self-play between

π_{ϕ_{s f t}}

and frozen

π_{θ^{*}}

in

E

13:

D_{d p o} \leftarrow RCP (T_{f a i l}, π_{ϕ_{s f t}}, π_{θ^{*}})

▹ Extract causal preference slices via Algorithm 2

14:

π_{ϕ^{*}} \leftarrow {\arg \min}_{Δ ϕ} L_{D P O} (D_{d p o})

▹ DPO preference alignment

15: return

π_{ϕ^{*}}, π_{θ^{*}}

3.2.1. Heterogeneous Data Construction

To address the scarcity of domain expert data, we propose a few-shot data augmentation approach. By leveraging a small set of expert demonstrations and structured generation constraints, we expand sparse initial samples into a comprehensive training set. Figure 3 illustrates this process.

(1) Expert Seed Construction

To establish a reliable baseline, domain experts constructed a seed set

D_{s e e d}

of 50 high-quality samples. Formatted in structured JSON, these seeds contain the initial situation, commander instructions, and standardized multi-round Tool-CoT trajectories. The seed selection aims for broad coverage, ensuring a balanced distribution across preference-scenario tuples

(P_{m o d e}, T_{s c e n a r i o})

and diverse trigger conditions for all available tools.

(2) Exemplar-Guided Topology-Constrained Synthesis

We utilize in-context learning (ICL) with a general LLM (Gemini-3.0-Pro via API, with Temperature=0.7 and Top-P=0.9 to balance generation diversity and tactical coherence) for large-scale data expansion. The prompt strategy incorporates a system persona, structured state observations, and a retrieved few-shot exemplar from

D_{s e e d}

to provide tactical context.

Tactical Prior Injection: Samples one highly similar expert seed from $D_{s e e d}$ as the In-Context Exemplar, implicitly transferring tactical logic through in-context learning.
Dynamic Topological Constraint: Explicitly defines the State Transition Graph $G_{S T G}$ within the prompt, forcing the model to execute a recursive loop of $Instruction \to {(Thought \to Tool_Call \to Observation)}^{k} \to Finish$ . During generation, we integrated the LangGraph framework as the orchestration middleware. LangGraph programmatically instantiates the $G_{S T G}$ as a cyclic graph, reliably routing the model’s Tool_Call to local Python tool nodes and sequentially injecting the real-time execution feedback back as the Observation. This mechanism drives the model to conduct multi-round closed-loop corrections, effectively circumventing the open-loop hallucinations of general models, thereby generating a complete trajectory set $D_{t o t a l}$ .

(3) Heterogeneous Data Decoupling

To support the dual-model architecture, the generated dataset

D_{t o t a l}

is split into two specialized subsets:

Planning Data ( $D_{p l a n}$ ): Created via a masking mapping $Φ_{p l a n} : τ \to (S, I, H_{< t}, c_{t}, i_{t})$ , which hides the low-level tool execution details, allowing the planner $π_{ϕ}$ to focus on tactical reasoning and intent decomposition.
Execution Data ( $D_{e x e c}$ ): Created via an extraction mapping $Φ_{e x e c} : τ \to (i_{t}, a_{t})$ , which isolates instruction-to-action pairs, enabling the executor $π_{θ}$ to focus strictly on precise tool invocation.

3.2.2. Execution Manifold Anchoring

This stage adopts a time-series decoupling strategy, temporarily freezing the planning model

π_{ϕ}

to focus on anchoring the execution model

π_{θ}

as a highly reliable universal tool foundation. Using the LoRA strategy, the backbone parameters

θ_{0}

are frozen, and only the low-rank matrices

Δ θ

injected into the attention layers are optimized.

Based on this architecture, we train the model using the execution dataset

D_{e x e c}

. To compel the model to rapidly acquire the tool format norms

K_{s c h e m a}

, we employ Supervised Fine-Tuning (SFT) [36] to minimize the negative log-likelihood loss from instruction

i_{t}

to action

a_{t}

:

L_{S F T} (Δ θ) = - E_{(i_{t}, K_{s c h e m a}, a_{t}) \sim D_{e x e c}} [log π_{θ_{0} + Δ θ} (a_{t} | i_{t}, K_{s c h e m a})]

(9)

However, models subjected only to SFT still exhibit output hallucinations caused by long-tail distribution shifts when handling multi-parameter dependencies or numerical boundaries.

To suppress these hallucinations, we introduce the Grouped Stochastic Policy Optimization (GSPO) algorithm [37]. This method discards the high-variance value networks and utilizes formalized rules as parameter-free hard constraints. Specifically, for a given instruction

i_{t}

, K groups of candidate actions

{a_{t}^{1}, \dots, a_{t}^{K}}

are sampled in parallel to construct a binarized reward function

r_{r u l e} (a_{t}^{k}) \in {0, 1}

. The policy is updated by maximizing the relative generation probability of compliant actions within the group:

L_{G S P O} (Δ θ) = - E_{i_{t} \sim D_{e x e c}} [log \frac{\sum_{k = 1}^{K} π_{θ_{0} + Δ θ} (a_{t}^{k} | i_{t}) \cdot r_{r u l e} (a_{t}^{k})}{\sum_{k = 1}^{K} π_{θ_{0} + Δ θ} (a_{t}^{k} | i_{t})}]

(10)

This process effectively concentrates the probability mass of the policy network from the illegal space toward the manifold region satisfying

r_{r u l e} = 1

, solidifying the model into a highly reliable execution foundation

π_{θ^{*}}

, which provides a physical consistency guarantee for subsequent tactical planning training.

3.2.3. Planning Model Value Alignment

After anchoring the highly reliable execution foundation

π_{θ^{*}}

, this stage focuses on enhancing the tactical decision-making capabilities of the planning model

π_{ϕ}

. The entire process also employs the LoRA strategy, freezing the backbone parameters

ϕ_{0}

and optimizing only the incremental parameters

Δ ϕ

.

First, SFT is implemented on

π_{ϕ}

using the planning model dataset

D_{p l a n}

. An action masking mechanism is employed to filter out the underlying tool invocation details, forcing the model to concentrate on long-horizon intent decomposition and tactical logical deduction. The optimization objective is to minimize the negative log-likelihood loss of tactical thoughts and intermediate instructions:

L_{S F T} (Δ ϕ_{s f t}) = - E_{(S, I, H_{< t}, c_{t}, i_{t}) \sim D_{p l a n}} [log π_{ϕ_{0} + Δ ϕ_{s f t}} (c_{t}, i_{t} | S, I, H_{< t})]

(11)

Although this stage endows the model with basic tactical common sense, it is still unable to handle the dynamic feedback and complex causal chains in long-horizon games.

To this end, we connect the trained planning model

π_{ϕ_{0} + Δ ϕ_{s f t}}

and the frozen execution model

π_{θ^{*}}

from the previous section into the simulation environment to conduct alternating self-play. As shown in Figure 4, this process constructs a low-noise tactical data generation engine, ensuring that all effectiveness discrepancies can be attributed to the quality of tactical logic.

Addressing the credit assignment bias of the traditional Direct Preference Optimization (DPO) algorithm [38] in long-horizon tasks, we introduce the RCP algorithm detailed in Section 3.3. Through counterfactual deduction, this algorithm bypasses the coarse-grained global return J, precisely extracts key tactical slices with causal significance, and constructs a high-confidence preference set

D_{d p o} = {(H, y_{w}, y_{l})}

. Subsequently, the DPO algorithm is utilized to optimize the policy network:

L_{D P O} (Δ ϕ) = - E_{(H, y_{w}, y_{l}) \sim D_{d p o}} log σ (β log \frac{π_{ϕ_{0} + Δ ϕ} (y_{w} | H)}{π_{r e f} (y_{w} | H)} - β log \frac{π_{ϕ_{0} + Δ ϕ} (y_{l} | H)}{π_{r e f} (y_{l} | H)})

(12)

where

π_{r e f} = π_{ϕ_{0} + Δ ϕ_{s f t}}

is the planning model from the SFT stage. This objective function ensures that the gradient update

\nabla_{Δ ϕ}

strictly points towards the solution space that conforms to physical constraints and maximizes tactical causality.

3.3. Recursive Causal Probe Algorithm

Addressing the challenge of ambiguous credit assignment in long-horizon games, this section proposes a dynamic decision attribution algorithm based on the RCP. Existing attribution paradigms generally face notable limitations in this context: traditional methods [39,40] (such as reinforcement learning utilizing Generalized Advantage Estimation) can be susceptible to signal dilution over extended sequences [41], while LLM-based counterfactual reasoning (e.g., Reflexion [42]) relies primarily on semantic self-reflection, which may lack strict physical grounding. To bridge this gap, RCP adopts a dynamic backward-tracing paradigm with physical counterfactual intervention. By evaluating alternative decisions directly within the simulation, it seeks to isolate the minimal causal unit responsible for task failure, thereby reducing reliance on sparse rewards or purely semantic inferences.

The overall workflow is illustrated in Figure 5 and formally detailed in Algorithm 2. Specifically, given a failed trajectory

τ_{f a i l} = {(c_{1}, i_{1}), \dots, (c_{T}, i_{T})}

where the terminal effectiveness falls below the success threshold (

J < δ

), the algorithm proceeds through the following structured phases:

Step 1 (Backward Tracing): The algorithm traces back frame-by-frame starting from the terminal step T. At any given time t, the historical context $H_{< t}$ is frozen.
Step 2 (Counterfactual Generation): A counterfactual intervention is applied to the planning policy $π_{ϕ}$ . Using Monte Carlo sampling, the model generates M alternative candidate branches $B = {y_{1}^{'}, \dots, y_{M}^{'}}$ that deviate from the original failure-inducing decision $y_{l} = (c_{t}, i_{t})$ .
Step 3 (Multi-Turn Rollout): Utilizing the frozen execution model $π_{θ^{*}}$ , each of the M candidate branches ( $y^{'} \in B$ ) undergoes a closed-loop forward rollout. Starting by executing the alternative decision $y^{'}$ , the planner iteratively reasons and invokes necessary tools up to a predefined maximum number of simulation turns $N_{m a x}$ :

$τ_{b r a n c h}^{'} = Rollout (π_{ϕ}, π_{θ^{*}}, H_{< t}, y^{'}, N_{m a x})$

(13)
Step 4 (Causal Extraction and Early Stopping): If a forward rollout successfully achieves the mission objective ( $J (τ_{b r a n c h}^{'}) \geq δ$ ) within the maximum allowed turns, time t is identified as the Tactical Turning Point. The corrective decision is extracted as the positive sample $y_{w}$ to form the preference pair $D_{d p o} = {(H_{< t}, y_{w}, y_{l})}$ . To isolate the minimal causal unit and reduce computational overhead, the backward search terminates immediately once the first valid turning point is found.

Algorithm 2 Recursive Causal Probe (RCP).

Input: Failure set

T_{f a i l}

, Planner

π_{ϕ}

, Executor

π_{θ^{*}}

, Score function J, Threshold

δ

, Sample size M, Max simulation iteration turns

N_{m a x}

Output: Causal Preference Dataset

D_{d p o}

1:

D_{d p o} \leftarrow \emptyset

2: for each trajectory

τ = {(c_{1}, i_{1}), \dots, (c_{T}, i_{T})} \in T_{f a i l}

do

3: for

t \leftarrow T

down to 1 do

4:

H_{< t} \leftarrow

Historical context up to step t

5:

y_{l} \leftarrow (c_{t}, i_{t})

6: Generate candidate branches via Monte Carlo sampling:

B \leftarrow {Sample}_{M C} (π_{ϕ} (\cdot ∣ H_{< t}), M)

7: Evaluate multi-turn rollout up to

N_{m a x}

:

8:

V \leftarrow \{y^{'} \in B ∖ {y_{l}} ∣ J (Rollout (π_{ϕ}, π_{θ^{*}}, H_{< t}, y^{'}, N_{m a x})) \geq δ\}

9: if

V \neq \emptyset

then

10: Select a successful corrective decision

y_{w} \in V

11:

D_{d p o} \leftarrow D_{d p o} \cup {(H_{< t}, y_{w}, y_{l})}

12: break

13: end if

14: end for

15: end for

16: return

D_{d p o}

Theoretically, exhaustive backward tracing scales as

O (M \cdot T^{2})

for a trajectory of length T. However, the early stopping mechanism (Algorithm 2, Line 12) effectively mitigates this worst-case quadratic bound. By terminating the search upon discovering the first valid turning point, the practical search depth is substantially reduced. Consequently, the actual wall-clock efficiency depends on the sampling size M, which influences the probability of early termination. A detailed empirical analysis of this computational overhead and the ablation of M under hardware constraints is presented in Section 4.4.2.

4. Experiments

To verify the effectiveness of the proposed generative allocation paradigm in complex battlefield environments, this paper conducts extensive comparative and ablation experiments.

4.1. Dataset Construction

To assess tactical planning, tool execution, and RCP performance, we established training and evaluation datasets across diverse scenarios (detailed in Table 3).

(1) Training Dataset Construction:

Based on the data synthesis method in Section 3.2.1, we generated 5000 high-quality Tool-CoT trajectories and accordingly constructed three training sets:

Planning Model Dataset ( $D_{p l a n}$ ): Directly adopts the 5000 complete trajectory samples mentioned above to fine-tune the long-horizon reasoning capability of the planning model.
Execution Model Dataset ( $D_{e x e c}$ ): Comprises a total of 31,338 instruction–action pairs extracted from the 5000 complete trajectories. As detailed in Section 3.2.1, this large volume (averaging ∼6.3 pairs per trajectory) intrinsically reflects the dense multi-tool invocation nature of complex tactical reasoning. However, to prevent overfitting to frequently invoked but structurally simple tools during actual training, this total dataset undergoes a difficulty-aware downsampling. Specifically, we strictly retain all complex samples ( $T_{s i m}$ and $T_{c o s t}$ ) while proportionally downsampling the remaining simpler tool types, thereby ensuring a balanced representation prior to execution model training.
Preference Alignment Dataset ( $D_{d p o}$ ): Utilizing the self-play engine and the RCP algorithm, 3000 pairs of preference samples were automatically filtered for alignment training during the DPO stage.

(2) Test Dataset Construction:

To evaluate tactical planning, tool execution, and RCP sampling performance, three evaluation sets were constructed:

Tactical Reasoning Test Set ( $D_{t e s t}$ ): Contains 300 complex tactical base situations. To ensure statistical confidence and mitigate the impact of environmental stochasticity, each base situation is evaluated across three independent randomized runs. These samples evenly cross-cover five typical tactical scenarios and four command preferences to provide a robust assessment of overall planning effectiveness.
Atomic Tool Test Set ( $D_{t e s t - t o o l}$ ): Contains 1000 independent invocation instructions, symmetrically covering all five categories of heterogeneous tools to individually verify the robustness of the execution foundation.
Causal Probe Dataset ( $D_{r c p}$ ): Comprises 100 failed trajectories extracted to analyze the computational overhead and sampling efficiency of the RCP algorithm. This dataset serves as the basis for the ablation study of the sampling parameter M in Section 4.4.2.

4.2. Experimental Setups

4.2.1. Experimental Platform Configuration

The experimental evaluations are conducted within the Advanced Framework for Simulation, Integration, and Modeling (AFSIM, version 2.9.0), a high-fidelity simulation engine. The underlying computational framework operates on an Ubuntu 22.04 LTS system powered by an Intel Xeon Gold 5418Y CPU, 256 GB of memory, and NVIDIA H100 (80 GB) GPU. This hardware is paired with a deep learning software stack comprising Python 3.10.18, PyTorch 2.10.0, and CUDA 12.8. Deployed upon this infrastructure, the proposed heterogeneous dual-model architecture employs GPT-OSS-20B as the planning model to facilitate long-horizon tactical reasoning, paired with Qwen3-4B-Instruct as the execution model.

4.2.2. Evaluation Metrics

To ensure the objectivity and discriminability of the scoring, the parameters of the S-E-O evaluation framework are instantiated as follows:

(1) Offensive Score (O): Sets the base effectiveness weight at

γ_{b a s e} = 0.9

to ensure task completion rate dominance, while time efficiency provides at most 10% marginal gain:

O = η_{t a s k} \cdot [0.9 + 0.1 \cdot max (0, 1 - τ / τ_{m a x})]

(14)

(2) Survival Score (S): This calculation utilizes the set difference

| U_{t o t a l} ∖ U_{s u i c i d e} |

to explicitly exclude the predefined suicide UAV set, calculating only the survival rate of reconnaissance and strike UAVs.

(3) Economic Score (E): Sets the cost-effectiveness ratio saturation threshold at

ρ_{s a t} = 150.0

, guiding the model to prioritize asymmetric, high-efficiency strikes under resource-constrained conditions.

For the linear total effectiveness function J, the preference weight vector

λ

is configured as follows:

(1) Balanced Preference (

P_{m o d e} = B a l a n c e d

): Uniform weight distribution (

λ_{s} = λ_{e} = λ_{o} = 1 / 3

), aiming to eliminate the short-board effect and guide the model to generate robust, multi-dimensional balanced schemes.

(2) Biased Preference (

P_{m o d e} \neq B a l a n c e d

): Weights exhibit a low-entropy sparse distribution. Let the primary dimension weight be

λ_{p r i} = 0.7

and the secondary dimension weight be

λ_{s e c} = 0.15

. Taking the economic-oriented preference as an example, the weight vector is

{[0.15, 0.7, 0.15]}^{T}

, forcing the model to prioritize searching the solution space for high cost-effectiveness.

4.2.3. Baseline Algorithms and Models

To evaluate the proposed paradigm, we selected foundational algorithms from distinct methodological categories as baselines. This approach helps explore the characteristics of different problem-solving pathways (e.g., numerical optimization vs. semantic generation) without relying on highly domain-specific variants:

Ant Colony Optimization (ACO) [43]: A representative heuristic algorithm, serving as a basic reference for discrete combinatorial search.
Deep Q-Network (DQN) [44] and Proximal Policy Optimization (PPO) [41]: Selected as foundational value-based and policy-gradient DRL methods. Given our centralized task formulation, they help observe common challenges—such as dimensionality scaling and sparse-reward degradation—often encountered by conventional end-to-end numerical paradigms lacking semantic priors.
Gemini-3.0-Flash and GLM-5 [45]: Selected as representative monolithic large language models. To ensure a fair structural comparison, these models are equipped with the same heterogeneous toolchain via Few-shot Prompting. This configuration serves as a direct control group to help illustrate the performance differences between conventional single-model tool invocation and our proposed decoupled dual-model paradigm.

4.2.4. Training Hyperparameters

To facilitate reproducibility and ensure a fair comparison, we carefully tuned the hyperparameters across all evaluated methods. The primary tuning criterion was to maximize the overall average evaluation score within acceptable training or inference timeframes, thereby balancing algorithmic performance with computational efficiency. For methodological transparency, all specific configurations—including the parameters for the heuristic and RL baselines, the inference settings for the general LLMs, and both the fine-tuning (LoRA, GSPO) and inference setups for our proposed dual-model architecture—are detailed in Appendix A.

4.3. Comparative Experiments

To evaluate the proposed heterogeneous dual-model paradigm on the test set

D_{t e s t}

, we compared it against traditional heuristic/reinforcement learning algorithms (ACO, DQN; PPO) and general large language models (GLM-5, Gemini-3.0-Flash). Since traditional baselines cannot directly process natural language, their inputs were manually translated into numerical constraint matrices; additionally, rule-based action masking was applied during their sampling phases to help minimize basic rule violations. The evaluation comprehensively covers macro-level task effectiveness—including mission success rate (MSR), Average Task Score (Avg. Score), and average interactive turns (Avg. Turns)—as well as micro-level logical robustness. This robustness is measured by three dedicated metric rates corresponding to the physical hallucinations formulated in Section 2.4: Capability Constraint Violation Rate (CCVR), Range Exceedance Rate (RER), and State and Sequence Conflict Rate (SSCR). The overall quantitative results are summarized in Table 4.

Experimental results indicate that traditional baselines and general models both possess significant limitations when handling complex tactical planning:

Traditional Baselines (ACO/DQN/PPO): Although the heuristic ACO maintained a reasonable average score of 82.16, its MSR was 62.34% with an average of 5.42 interactive turns, indicating that heuristic search often struggles to converge efficiently within limited steps. Conversely, the DRL methods (DQN and PPO) experienced notable policy degradation due to sparse rewards and vast action spaces; despite rapid single-turn decision-making, their MSRs dropped to 43.00% and 27.00%, respectively, limiting their consistent mission completion.
General LLMs (GLM-5/Gemini-3.0-Flash): Benefiting from inherent semantic reasoning capabilities, both models achieved competitive average scores (85.51 and 84.12). However, as monolithic architectures, they appear to face challenges in internalizing strict physical boundaries. This is reflected in their relatively high State and Sequence Conflict Rates (SSCRs of 53.12% and 40.00%), leading to sub-optimal MSRs (∼55%) despite multiple trial-and-error corrections via tool feedback.
Proposed Method (Ours): By decoupling tactical reasoning from tool execution, our architecture achieved an MSR of 82.34% and the highest average score (93.17). Importantly, this advantage appears to extend beyond a simple distillation effect from the stronger Gemini-3.0-Pro model used for data synthesis. If the performance gains were predominantly driven by inherited semantic priors, the model would likely still exhibit the high physical conflict rates common among monolithic architectures. Instead, our method substantially mitigated physical hallucinations (e.g., SSCR reduced to 15.10%, CCVR to 6.10%). This suggests that the structural decoupling, combined with targeted physical alignment (GSPO and RCP), plays a primary role in anchoring semantic logic to physical constraints.

To further evaluate the tactical logic, we selected three typical heterogeneous situations from the test set. Figure 6 displays the 3D simulation deployment snapshots, and Figure 7 projects them into 2D task allocation topologies, clearly revealing the logical discrepancies among the methods regarding wave planning and target allocation (allocation statistics are in Table 5).

Experimental results in Table 5 and Figure 7 indicate that existing baselines struggle to optimally balance mission efficacy with resource constraints. Although ACO achieves high scores (e.g., 98.95 in Scenario 3), it incurs notable delays, requiring five–eight iterations to converge. Conversely, DRL methods (PPO and DQN) enable rapid inference but often execute resource-intensive saturation strikes; despite the “Balanced” preference in Scenario 3, they dispatched 47 and 46 UAVs respectively, showing a limited capacity to internalize resource-saving intents. Furthermore, general LLMs (GLM-5 and Gemini) struggle with tactical resource optimization, often relying on repetitive trial-and-error (e.g., four turns in Scenario 1) or overly conservative force allocation (e.g., 47 UAVs in Scenario 2) to ensure success. In contrast, the proposed architecture successfully navigates these trade-offs. While maintaining highly competitive scores (e.g., 98.11 in Scenario 2), it compresses the decision cycle to one–two turns and allocates forces significantly more precisely—utilizing only 19 UAVs under identical conditions. This demonstrates notable improvements in balancing execution speed, cost-effectiveness, and tactical rationality.

4.4. Ablation Studies

To comprehensively deconstruct the sources of effectiveness in the proposed heterogeneous toolchain-augmented paradigm and verify the necessity of the execution manifold anchoring and orthogonal evolution strategies, we designed multi-dimensional ablation experiments.

4.4.1. Empirical Analysis of Multi-Task Gradient Conflicts

Before systematically evaluating the architectural variants, we first empirically investigate a primary optimization challenge in monolithic training—multi-task gradient conflicts—which motivated our decoupled design. To observe these internal dynamics, Figure 8 traces the joint training process of the monolithic end-to-end mode (M2) over 600 steps.

As illustrated in the bottom subplot, after an initial rapid descent, the gradient cosine similarity between the CoT and tool execution (Tool) tasks frequently fluctuates around and occasionally plunges below zero (the pink conflict region). This recurrent negative cosine similarity (

\cos < 0

) serves as a strong indicator of destructive interference, suggesting that the optimization directions of the two tasks frequently diverge. Furthermore, an analysis of the gradient norms reveals a pronounced task imbalance in the later stages (steps 200–600): the gradient norm of CoT (red line) largely dominates that of the Tool task (blue line). Since the model tends to fit the low-level tool syntax more readily than the high-dimensional tactical reasoning, the larger CoT updates can negatively interfere with the learned execution representations. Consequently, as shown in the top subplot, this internal gradient competition corresponds to noticeable training instability—the overall training loss plateaus with high-frequency oscillations around 0.2 after step 200, hindering further smooth convergence. These microscopic optimization dynamics provide strong empirical support for the idea that physically decoupling cognition from execution can effectively alleviate the multi-task capacity bottleneck.

4.4.2. RCP Computational Overhead

To evaluate the practical feasibility of the RCP algorithm, we examined the trade-off between exploration breadth and computational overhead. Using the dataset

D_{r c p}

containing 100 failed trajectories, we conducted three independent trials for each configuration to ensure statistical reliability. The mean total wall-clock time required for data generation across varying M configurations (with

N_{m a x} = 5

) was recorded, as illustrated in Figure 9. Detailed hardware specifications and model inference settings are provided in Section 4.2.1 and Appendix A, respectively.

The total runtime exhibits a U-shaped curve relative to M. As M increases from 1 to 8, the mean elapsed time decreases from 42.0 min to 25.8 min. This indicates that a larger M improves the probability of identifying a valid corrective branch early in the backward traversal, thereby reducing the overall search depth via the early-stopping mechanism. However, as M scales further to 16 or 32, the parallel rollout overhead begins to exceed the concurrency limits of the simulation environment and GPU inference capabilities. This results in significant queuing delays, increasing the mean total time to 37.5 min. Consequently,

M = 8

is selected as the default configuration to optimally balance causal probing efficiency with available system resources.

4.4.3. Training Framework and Architecture Evolution

To fully deconstruct the effectiveness gains of the heterogeneous dual-model architecture and the orthogonal evolution strategy, we constructed eight model variants (M1–M8) from baseline to the complete method on the test set. Evaluation metrics cover task success rate, average score, and average interactive turns; detailed results are shown in Table 6.

Architectural Decoupling (M1–M5): The comparison between monolithic (M1–M3) and dual-model (M4–M5) configurations illustrates the potential challenges of multi-objective optimization within a single parameter space. Interestingly, M3 yielded an MSR of only 15.34%, performing even lower than M2 (24.00%). This degradation suggests that forcing a single network to simultaneously handle flexible tactical reasoning and rigid structural syntax likely exacerbates task-specific interference. Conversely, utilizing the identical decoupled data, M5 restored the MSR to 53.67%. This indicates that a key bottleneck in monolithic paradigms may lie in their architectural limitations to isolate interference between heterogeneous decision processes.
Execution Manifold Anchoring (M5–M6): The introduction of GSPO in M6 contributed to a more stable deterministic execution. While M5 remains susceptible to execution drift during complex reasoning chains, M6 helps mitigate the upward propagation of underlying uncertainties by anchoring a low-noise execution manifold. This mechanism elevated the MSR from 53.67% to 63.00% and reduced the variance in average scores, highlighting the value of establishing a consistent interactive foundation between the planner and the environment.
Causal Preference Alignment (M7–M8): The transition from M7 to M8 underscores the role of credit assignment in long-horizon tasks. While M7 (standard DPO) improved the MSR to 71.67%, the enhancement in interactive efficiency was relatively modest (3.05 turns). By introducing the RCP algorithm to extract counterfactual causal slices, M8 achieved an MSR of 82.34% and further compressed the interactive cycle to 2.69 turns. This suggests that the RCP algorithm aids in mitigating sparse-reward bias, thereby supporting more robust tactical reasoning through targeted causal attribution.

4.4.4. Execution Manifold Stability

To quantify the effectiveness of the LoRA-GSPO strategy in constructing a deterministic execution manifold, we compared the zero-shot baseline and our fine-tuned variants on the tool benchmark set

D_{t e s t - t o o l}

. To account for inference randomness and ensure statistical robustness, all evaluations were conducted across three independent trials. The evaluation focuses on the mean Execution Accuracy (EA) and Schema Error Rate (SER) as core metrics; the results are presented in Table 7.

Experimental results indicate varying levels of proficiency among the evaluated models when handling the heterogeneous toolchain. While most models easily manage basic retrieval tasks (

T_{r e t}

and

T_{k g}

), noticeable disparities emerge in tasks requiring complex logic and strict physical constraints. Specifically, the Base model struggles with simulation evaluation (

T_{s i m}

accuracy falling below 50%) and exhibits a higher syntax error rate (SER), whereas advanced general models like Gemini-3.0-Flash and GLM-5 handle these challenges effectively with near-zero SERs. As for the dedicated execution models,

π_{θ}

(trained solely via SFT) noticeably reduces syntax errors compared to the Base model, yet its adaptation to high-dimensional simulation constraints remains a bottleneck, yielding only 70.80% on

T_{s i m}

. The incorporation of GSPO effectively resolves this limitation: the optimized

π_{θ}^{*}

not only elevates

T_{s i m}

execution to parallel the best general models but also achieves near-perfect performance across the computational tools (

T_{c a l c}

and

T_{c o s t}

). Ultimately, by securing a 0.00% SER alongside the highest average accuracy of 97.83%, these findings suggest that the proposed

π_{θ}^{*}

configuration establishes a reliable and robust execution foundation for the generative paradigm.

5. Conclusions

To resolve the conflict between ambiguous intents and strict physical constraints in UAV swarm task allocation, we propose a toolchain-augmented generative planning paradigm. This approach transitions task allocation from numerical fitting toward tool-verified semantic reasoning. Supported by a decoupled dual-model architecture and an execution-manifold-anchored training method, our framework utilizes the RCP algorithm to facilitate closed-loop tactical refinement via counterfactual deduction.

AFSIM-based experiments suggest that this holistic framework mitigates LLM physical hallucinations and gradient conflicts, facilitating a more reliable mapping from unstructured intents to physically compliant schemes. While validating the framework’s macro-level feasibility, we recognize its current reliance on idealized simulations. Consequently, our future work will focus on three key areas: (1) conducting fine-grained ablation studies to evaluate individual tool contributions; (2) bridging the sim-to-real gap by incorporating physical factors such as communication delays, sensing uncertainties, and real-time execution constraints; and (3) exploring the paradigm’s generalizability across broader mission profiles. Ultimately, this research offers a promising pathway for intelligent UAV swarm decision-making.

Author Contributions

Conceptualization, L.A. and J.Z.; software, L.A., Z.H. and J.L.; validation, L.A., Y.A., J.L., Z.Y. and J.C.; formal analysis, L.A.; investigation, L.A., Z.H. and J.L.; resources, B.M.; data curation, L.A. and Y.A.; writing—original draft preparation, L.A.; writing—review and editing, J.Z., B.M., L.A., Y.A., Z.H., J.L., Z.Y. and J.C.; supervision, J.Z. and B.M.; project administration, B.M.; funding acquisition, B.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research Program of China (grant number 2024153240) and the Shanghai Municipal Commission of Science and Technology Research Project (grant number 19511133100).

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request, subject to institutional data sharing guidelines.

Acknowledgments

The authors sincerely thank all the reviewers and editors for their valuable comments on this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Hyperparameter Configurations

This appendix provides the detailed hyperparameter settings used in our experiments to facilitate the reproducibility of both the baseline methods and our proposed paradigm.

Table A1, Table A2 and Table A3 list the parameter configurations for the Ant Colony Optimization (ACO), Deep Q-Network (DQN), and Proximal Policy Optimization (PPO) baselines, respectively. Table A4 and Table A5 detail the inference configurations for the evaluated large language models (Gemini-3.0-Flash and GLM-5).

Table A1. Hyperparameter configuration for ACO algorithm.

Parameter	Value
Population Size ( $N_{a n t s}$ )	50
Max Iterations (K)	10
Pheromone Weight ( $α$ )	1.0
Heuristic Weight ( $β$ )	2.0
Evaporation Rate ( $ρ$ )	0.2
Pheromone Constant (Q)	100.0
Elite Ratio ( $E_{r a t i o}$ )	0.1
Pheromone Limits ( $τ_{m i n}, τ_{m a x}$ )	0.1, 10.0

Table A2. Hyperparameter configuration for DQN algorithm.

Parameter	Value
Optimization and DQN
Learning Rate ( $η$ )	$1 \times 10^{- 4}$
Discount Factor ( $γ$ )	1.0
Exploration Initial ( $ϵ_{s t a r t}$ )	1.0
Exploration Final ( $ϵ_{e n d}$ )	0.05
Exploration Decay	10,000 steps
Target Update Frequency	1000 steps
Replay Buffer Size	100,000
Batch Size	64
Network Architecture
State Dimension	128
Hidden Layer Dimension	256

Note: Italic text denotes parameter category headers.

Table A3. Hyperparameter configuration for PPO algorithm.

Parameter	Value
Optimization & PPO
Learning Rate ( $η$ )	$3 \times 10^{- 4}$
Discount Factor ( $γ$ )	1.0
Clipping Ratio ( $ϵ$ )	0.2
Value Coefficient ( $c_{1}$ )	0.5
Entropy Coefficient ( $c_{2}$ )	0.01
Max Gradient Norm	0.5
Update Epochs (K)	10
Batch Size	64
Network Architecture
State Dimension	128
Hidden Layer Dimension	256
Dropout Rate	0.1
Max Decoding Rounds	5

Note: Italic text denotes parameter category headers.

Table A4. Inference configuration for Gemini-3.0-Flash.

Parameter	Setting
Model Version	Gemini-3.0-Flash
Temperature	0.7
Top-p	0.95
Top-k	40
Max Output Tokens	8192
Thinking Level	High

Table A5. Inference configuration for GLM-5.

Parameter	Setting
Model Version	GLM-5
Temperature	0.8
Top-p	0.95
Top-k	40
Max Output Tokens	8192
Penalty Score	1.1

Furthermore, to support methodological transparency, Table A6 outlines the specific fine-tuning (LoRA) and alignment (GSPO) parameters for our proposed decoupled dual-model architecture, while Table A7 details their corresponding inference settings utilizing the vLLM framework.

Table A6. Training and alignment configurations for the proposed dual-model.

Parameter	Tactical Planner (GPT-OSS-20B)	Manifold Executor (Qwen3-4B-Instruct)
Supervised Fine-Tuning (LoRA)
LoRA Rank (r)	16	16
LoRA Alpha ( $α$ )	32	32
Learning Rate	$1 \times 10^{- 4}$	$5 \times 10^{- 5}$
Epochs	2	3
Batch Size	2	32
Gradient Accumulation Steps	4	1
Alignment Training (GSPO)
Total Steps	-	1000
Learning Rate	-	$5 \times 10^{- 6}$
KL Penalty ( $β_{K L}$ )	-	0.01
Temperature	-	1.0
Top-p/Top-k	-	0.9/50
Num Generations	-	16
Batch Size	-	8
Gradient Accumulation Steps	-	4

Note: Italic text denotes parameter category headers.

Table A7. vLLM inference configurations for the proposed dual-model.

Parameter	Tactical Planner (GPT-OSS-20B)	Manifold Executor (Qwen3-4B-Instruct)
Inference Framework	vLLM	vLLM
Temperature	0.7	0.3
Top-p	0.9	0.9
Max Output Tokens	4096	4096
Batch Size	64	64

References

Nguyen, H.T.; Ghazali, I. Trends of unmanned aerial vehicles in logistics delivery. Bincang Sains Teknol. 2024, 3, 52–59. [Google Scholar] [CrossRef]
Abbas, S.; Abu Talib, M.; Ahmed, I.; Belal, O. Integration of UAVs and FANETs in disaster management: A review on applications, challenges and future directions. Trans. Emerg. Telecommun. Technol. 2024, 35, e70023. [Google Scholar] [CrossRef]
Wang, J.; Zhou, K.; Xing, W.; Li, H.; Yang, Z. Applications, evolutions, and challenges of drones in maritime transport. J. Mar. Sci. Eng. 2023, 11, 2056. [Google Scholar] [CrossRef]
Wang, Y.; Li, J.; Yang, X.; Peng, Q. UAV–Ground vehicle collaborative delivery in emergency response: A review of key technologies and future trends. Appl. Sci. 2025, 15, 9803. [Google Scholar] [CrossRef]
Minculete, G. Military Logistics Drones: The Innovative Solution for Transportation Challenges on the Battlefield. Land Forces Acad. Rev. 2025, 30, 342–354. [Google Scholar] [CrossRef]
Samiei, A.; Ismail, S.; Sun, L. Cluster-based hungarian approach to task allocation for unmanned aerial vehicles. In 2019 IEEE National Aerospace and Electronics Conference (NAECON); IEEE: New York, NY, USA, 2019; pp. 148–154. [Google Scholar]
Li, Y.; Zhang, Z.; He, Z.; Sun, Q. A heuristic task allocation method based on overlapping coalition formation game for heterogeneous UAVs. IEEE Internet Things J. 2024, 11, 28945–28959. [Google Scholar] [CrossRef]
Liu, H.; Shao, Z.; Zhou, Q.; Tu, J.; Zhu, S. Task Allocation Algorithm for Heterogeneous UAV Swarm with Temporal Task Chains. Drones 2025, 9, 574. [Google Scholar] [CrossRef]
Zhang, J.; Cui, Y.; Ren, J. Dynamic mission planning algorithm for UAV formation in battlefield environment. IEEE Trans. Aerosp. Electron. Syst. 2022, 59, 3750–3765. [Google Scholar] [CrossRef]
Ye, F.; Chen, J.; Tian, Y.; Jiang, T. Cooperative task assignment of a heterogeneous multi-UAV system using an adaptive genetic algorithm. Electronics 2020, 9, 687. [Google Scholar] [CrossRef]
Xiong, Y.; Zhang, L. Multi-UAV Task Allocation Based on Grid-Based Particle Swarm and Genetic Hybrid Algorithm. Mathematics 2025, 13, 3591. [Google Scholar] [CrossRef]
Dong, X.; Shi, C.; Wen, W.; Zhou, J. Multi-Mission Oriented Joint Optimization of Task Assignment and Flight Path Planning for Heterogeneous UAV Cluster. Remote Sens. 2023, 15, 5315. [Google Scholar] [CrossRef]
Tang, X.; Li, X.; Yu, R.; Wu, Y.; Ye, J.; Tang, F.; Chen, Q. Digital-twin-assisted task assignment in multi-UAV systems: A deep reinforcement learning approach. IEEE Internet Things J. 2023, 10, 15362–15375. [Google Scholar] [CrossRef]
Mao, X.; Wu, G.; Fan, M.; Cao, Z.; Pedrycz, W. DL-DRL: A double-level deep reinforcement learning approach for large-scale task scheduling of multi-UAV. IEEE Trans. Autom. Sci. Eng. 2024, 22, 1028–1044. [Google Scholar] [CrossRef]
Liu, D.; Dou, L.; Zhang, R.; Zhang, X.; Zong, Q. Multi-agent reinforcement learning-based coordinated dynamic task allocation for heterogenous UAVs. IEEE Trans. Veh. Technol. 2022, 72, 4372–4383. [Google Scholar] [CrossRef]
Zhao, X.; Yang, R.; Zhang, Y.; Yan, M.; Yue, L. Deep reinforcement learning for intelligent dual-UAV reconnaissance mission planning. Electronics 2022, 11, 2031. [Google Scholar] [CrossRef]
Zuo, L.; Gao, S.; Li, Y.; Li, L.; Li, M.; Lu, X. A fast and robust algorithm with reinforcement learning for large UAV cluster mission planning. Remote Sens. 2022, 14, 1304. [Google Scholar] [CrossRef]
Yue, L.; Yang, R.; Zuo, J.; Zhang, Y.; Li, Q.; Zhang, Y. Unmanned aerial vehicle swarm cooperative decision-making for SEAD mission: A hierarchical multiagent reinforcement learning approach. IEEE Access 2022, 10, 92177–92191. [Google Scholar] [CrossRef]
Yin, Y.; Guo, Y.; Su, Q.; Wang, Z. Task allocation of multiple unmanned aerial vehicles based on deep transfer reinforcement learning. Drones 2022, 6, 215. [Google Scholar] [CrossRef]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Comanici, G.; Bieber, E.; Schaekermann, M.; Pasupat, I.; Sachdeva, N.; Dhillon, I.; Blistein, M.; Ram, O.; Zhang, D.; Rosen, E.; et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv 2025, arXiv:2507.06261. [Google Scholar] [CrossRef]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 technical report. arXiv 2025, arXiv:2505.09388. [Google Scholar] [CrossRef]
Team, K.; Bai, Y.; Bao, Y.; Chen, G.; Chen, J.; Chen, N.; Chen, R.; Chen, Y.; Chen, Y.; Chen, Y.; et al. Kimi k2: Open agentic intelligence. arXiv 2025, arXiv:2507.20534. [Google Scholar] [CrossRef]
Tian, Y.; Lin, F.; Li, Y.; Zhang, T.; Zhang, Q.; Fu, X.; Huang, J.; Dai, X.; Wang, Y.; Tian, C.; et al. UAVs meet LLMs: Overviews and perspectives towards agentic low-altitude mobility. Inf. Fusion 2025, 122, 103158. [Google Scholar] [CrossRef]
Jin, W.; Du, H.; Zhao, B.; Tian, X.; Shi, B.; Yang, G. A comprehensive survey on multi-agent cooperative decision-making: Scenarios, approaches, challenges and perspectives. arXiv 2025, arXiv:2503.13415. [Google Scholar]
Cui, J.; Liu, G.; Wang, H.; Yu, Y.; Yang, J. TPML: Task planning for multi-UAV system with large language models. In 2024 IEEE 18th International Conference on Control & Automation (ICCA); IEEE: New York, NY, USA, 2024; pp. 886–891. [Google Scholar]
Li, T.; Wang, Q.; Wang, J.; Xu, Y. SWARM-LLM: An unmanned swarm task planning system based on large language models. Comput. Sci. 2025, 52, 72–79. [Google Scholar] [CrossRef]
Yu, H.; Wang, C.; Wu, L.; Liu, Y.; Niu, Y. LLMSTP: Empowering Swarm Task Planning with Large Language Models. In Proceedings of the International Conference on Autonomous Unmanned Systems; Springer: Singapore, 2024; pp. 500–510. [Google Scholar]
Yu, H.; Wang, C.; Niu, Y.; Wu, L. MUTP-LLM: Empowering Multi-UAV Task Planning with Large Language Models. Guid. Navig. Control 2025, 5, 477. [Google Scholar] [CrossRef]
Xu, W.; Huang, C.; Gao, S.; Shang, S. LLM-based agents for tool learning: A survey. Data Sci. Eng. 2025, 10, 533–563. [Google Scholar] [CrossRef]
Koubaa, A.; Gabr, K. Agentic UAVs: LLM-driven autonomy with integrated tool-calling and cognitive reasoning. arXiv 2025, arXiv:2509.13352. [Google Scholar]
Zhang, J.; Xiang, J.; Yu, Z.; Teng, F.; Chen, X.; Chen, J.; Zhuge, M.; Cheng, X.; Hong, S.; Wang, J.; et al. AFlow: Automating agentic workflow generation. In Proceedings of the Thirteenth International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025. [Google Scholar]
Sujit, P.; Beard, R. Distributed sequential auctions for multiple UAV task allocation. In 2007 American Control Conference; IEEE: New York, NY, USA, 2007; pp. 3955–3960. [Google Scholar]
Venugopalan, T.; Subramanian, K.; Sundaram, S. Multi-UAV task allocation: A team-based approach. In 2015 IEEE Symposium Series on Computational Intelligence; IEEE: New York, NY, USA, 2015; pp. 45–50. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. In Proceedings of the Tenth International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Zheng, C.; Liu, S.; Li, M.; Chen, X.H.; Yu, B.; Gao, C.; Dang, K.; Liu, Y.; Men, R.; Yang, A.; et al. Group sequence policy optimization. arXiv 2025, arXiv:2507.18071. [Google Scholar]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C.D.; Ermon, S.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 2023, 36, 53728–53741. [Google Scholar]
Zeng, Y.; Liu, G.; Ma, W.; Yang, N.; Zhang, H.; Wang, J. Token-level direct preference optimization. arXiv 2024, arXiv:2404.11999. [Google Scholar] [CrossRef]
Chen, G.; Liao, M.; Li, C.; Fan, K. Step-level value preference optimization for mathematical reasoning. arXiv 2024, arXiv:2406.10858. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Shinn, N.; Cassano, F.; Berman, E.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language agents with verbal reinforcement learning. arXiv 2023, arXiv:2303.11366. [Google Scholar] [CrossRef]
Dorigo, M.; Birattari, M.; Stutzle, T. Ant colony optimization. IEEE Comput. Intell. Mag. 2006, 1, 28–39. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Zeng, A.; Lv, X.; Hou, Z.; Du, Z.; Zheng, Q.; Chen, B.; Yin, D.; Ge, C.; Xie, C.; Wang, C.; et al. GLM-5: From Vibe Coding to Agentic Engineering. arXiv 2026, arXiv:2602.15763. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the collaborative task allocation scenario for heterogeneous UAV swarms.

Figure 2. Architecture of the generative UAV task allocation augmented by a heterogeneous toolchain. The colored arrows and numbered sequence (1–7) explicitly delineate the dynamic information flow and closed-loop interactions between the Tactical Planner and Tool Executor.

Figure 3. Few-shot-induced generative data augmentation method.

Figure 4. Flowchart of the alternating self-play for heterogeneous dual-models. (A) The planning model generates a tactical instruction based on the current context; (B) The frozen execution model translates the instruction into a structured tool action; (C) The simulation environment executes the action and returns observation feedback to update the planning context.

Figure 5. Flowchart of the RCP algorithm.

Figure 6. Simulation deployment snapshots of selected situations: (a) Scenario 1—SEAD; (b) Scenario 2—Decapitation Strike; (c) Scenario 3—Strategic Paralysis.

Figure 7. Comparison of task allocation topologies among different algorithms and models: (a) ACO for Scenario 1; (b) ACO for Scenario 2; (c) ACO for Scenario 3; (d) PPO for Scenario 1; (e) PPO for Scenario 2; (f) PPO for Scenario 3; (g) DQN for Scenario 1; (h) DQN for Scenario 2; (i) DQN for Scenario 3; (j) GLM for Scenario 1; (k) GLM for Scenario 2; (l) GLM for Scenario 3; (m) Gemini for Scenario 1; (n) Gemini for Scenario 2; (o) Gemini for Scenario 3; (p) Our method for Scenario 1; (q) Our method for Scenario 2; (r) Our method for Scenario 3. For some schemes, since the first wave exclusively comprises reconnaissance UAVs without assigned strike targets, there are no connecting lines drawn in the figure.

Figure 8. Training dynamics of the monolithic joint training mode (M2). The top subplot shows the overall training loss, while the bottom subplot illustrates the gradient norms and cosine similarity between the CoT and tool execution tasks, highlighting the recurrent destructive interference.

Figure 9. Total inference time on

D_{r c p}

across different sampling parameters M. The U-shaped curve reflects the trade-off between early stopping efficiency and hardware-constrained parallel execution overhead.

Figure 9. Total inference time on

D_{r c p}

across different sampling parameters M. The U-shaped curve reflects the trade-off between early stopping efficiency and hardware-constrained parallel execution overhead.

Table 1. Description of mission scenarios.

Scenario	Explanation
Total Strike	Aims to maximally destroy all enemy combat capabilities and effective strength.
SEAD	Prioritizes the destruction of radars and air defense missile sites to open safe corridors for subsequent echelons.
Strategic Paralysis	Focuses on communication base stations, energy facilities, and airports to paralyze enemy operational capabilities.
Armored Vehicle Annihilation	Targets enemy tank and armored vehicle clusters with intensive fire coverage.
Decapitation Strike	Precisely locates and destroys enemy command posts or high-value leadership targets with minimal cost.

Table 2. Definitions of the five categories of heterogeneous tools and their input–output paradigms.

Tool Name	Symbol	Definition	Input	Output
Excellent Case Retrieval	$T_{r e t}$	Retrieves successful historical tactical cases based on situational similarity.	`scene`, `preference`, `currentSituation`	One-shot historical tactical case.
Knowledge Graph Query	$T_{k g}$	Queries static attributes (e.g., range and damage radius) to mitigate model hallucination.	`queryEntity`	Equipment attributes and static parameters.
Geospatial Calculator	$T_{c a l c}$	Calculates Euclidean distances between multiple coordinates for spatial analysis.	`listA`, `listB`, `currentSituation`	Euclidean distance matrix.
Risk-Cost Calculator	$T_{c o s t}$	Evaluates operational risks and economic costs to support decision-making.	`uavList`, `targetList`, `currentSituation`	Combat loss costs.
Simulation Evaluation	$T_{s i m}$	Parses plans into instruction sequences, executes simulation, and returns feedback.	`assignmentJson`, `currentSituation`	Efficacy score and detailed battle report.

Note: currentSituation contains dynamic states such as red_aircraft_stock and blue_platforms.

Table 3. Datasets used for training and testing. The training datasets were automatically synthesized via Gemini-3.0-Pro based on expert seeds, while the test sets were manually curated by domain experts to ensure rigorous evaluation.

Dataset	Total Samples	Avg. Length	Category
$D_{p l a n}$	5000	16,583.10	TrainingSet
$D_{e x e c}$	31,338	1656.63	TrainingSet
$D_{d p o}$	3000	31,743.59	TrainingSet
$D_{r c p}$	100	32,007.11	TestSet
$D_{t e s t}$	300	1265.12	TestSet
$D_{t e s t - t o o l}$	1000	1581.56	TestSet

Note: The average length refers to the average number of tokens in each dataset under the corresponding tokenizer; the DPO dataset is calculated using the maximum length of the positive and negative sample pairs.

Table 4. Overall performance and physical hallucination comparison of different algorithms on the test set. Performance metrics are reported as Mean ± Standard Deviation across three independent runs. To control time consumption, the maximum iterations for ACO are set to ten turns, and for LLMs to five turns. The shaded row highlights the performance of our proposed model.

Model	Performance Metrics			Hallucination Rates (%)
Model	MSR (%)	Avg. Score	Avg. Turns	CCVR	RER	SSCR
ACO	62.34 ± 2.45	82.16 ± 1.92	5.42 ± 0.38	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†
DQN	43.00 ± 4.18	72.38 ± 3.75	1.00 ± 0.00	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†
PPO	27.00 ± 5.62	61.19 ± 4.21	1.00 ± 0.00	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†	0.00 ± 0.00 ^†
GLM-5 ¹	55.67 ± 2.76	85.51 ± 1.84	3.52 ± 0.22	12.50 ± 1.15	10.40 ± 1.82	53.12 ± 3.44
Gemini-3.0-Flash ¹	55.00 ± 3.14	84.12 ± 2.05	3.64 ± 0.18	11.00 ± 1.68	12.30 ± 0.95	40.00 ± 2.87
Ours	82.34 ± 1.42	93.17 ± 0.94	2.69 ± 0.12	6.10 ± 0.73	7.87 ± 0.58	15.10 ± 1.05

¹ The GLM-5 and Gemini models utilized the five tools designed in this paper for closed-loop reasoning. ^† Traditional algorithms (ACO, DQN; PPO) achieve strictly 0.00 ± 0.00% in Hallucination Rates due to hard-coded action masking and rule pruning, not generalized semantic understanding.

Table 5. Comparison of task assignment metrics across different algorithms and scenarios.

ID	Mission Intent	Pref.	Score (Iterations/UAVs)
ID	Mission Intent	Pref.	ACO	PPO	DQN	GLM	Gemini	Ours
1	SEAD	Survival	93.66 (8/16)	91.36 (1/40)	97.47 (1/51)	97.76 (2/23)	92.28 (4/21)	93.07 (2/11)
2	Decapitation Strike	Offensive	95.50 (5/17)	96.00 (1/36)	96.26 (1/47)	98.91 (2/24)	96.79 (1/47)	98.11 (1/19)
3	Strategic Paralysis	Balanced	98.95 (7/13)	99.07 (1/47)	99.61 (1/46)	99.60 (4/9)	99.52 (2/11)	99.53 (1/8)

Note: The metrics are formatted as “Score (Convergence Iterations/Active UAVs)”.

Table 6. Comparison of task allocation results under different ablation configurations. Maximum iterations for all models are set to five turns. Performance metrics are reported as Mean ± Standard Deviation across three independent runs. The shaded row highlights the performance of our proposed model.

ID	Architecture	Configuration		Performance Metrics
ID	Architecture	Planner ( $π_{ϕ}$ )	Executor ( $π_{θ}$ )	MSR (%)	Avg. Score	Avg. Turns
M1	Single (Base)	Few-shot	N/A	0.00 ± 0.00	4.09 ± 1.28	5.00 ± 0.00
M2	Single	SFT ( $D_{t o t a l}$ )	N/A	24.00 ± 3.82	44.26 ± 5.14	3.91 ± 0.42
M3	Single	SFT ( $D_{p l a n} \cup D_{e x e c}$ )	N/A	15.34 ± 4.75	24.07 ± 6.33	4.58 ± 0.21
M4	Dual (Base)	Few-shot	Few-shot	3.00 ± 1.24	55.18 ± 4.88	4.87 ± 0.11
M5	Dual	SFT	SFT	53.67 ± 2.91	83.67 ± 1.76	3.53 ± 0.28
M6	Dual	SFT	GSPO	63.00 ± 2.05	86.27 ± 2.31	3.33 ± 0.15
M7	Dual	DPO (Standard)	GSPO	71.67 ± 1.88	89.91 ± 1.05	3.05 ± 0.09
M8	Dual	DPO (RCP)	GSPO	82.34 ± 1.42	93.17 ± 0.94	2.69 ± 0.12

Table 7. Comparison of invocation capabilities on the heterogeneous toolchain among different models. Bold numbers indicate the best performance.

Model	EA (%)						SER (%)
Model	$T_{ret}$	$T_{kg}$	$T_{calc}$	$T_{cost}$	$T_{sim}$	Avg.	SER (%)
Base	$96.58 \pm 0.85$	$99.15 \pm 0.43$	$66.67 \pm 1.07$	$73.02 \pm 3.17$	$48.01 \pm 1.08$	$77.30 \pm 0.40$	$2.30 \pm 0.20$
Gemini-3.0-Flash	$100.0 \pm 0.00$	$100.0 \pm 0.00$	$95.60 \pm 0.49$	$93.65 \pm 1.59$	$94.02 \pm 0.43$	$97.17 \pm 0.25$	$0.00 \pm 0.00$
GLM-5	$99.72 \pm 0.25$	$100.0 \pm 0.00$	$98.01 \pm 0.65$	$84.66 \pm 2.42$	$83.48 \pm 0.65$	$94.63 \pm 0.51$	$0.27 \pm 0.15$
$π_{θ}$ (SFT-only)	$99.43 \pm 0.49$	$99.57 \pm 0.43$	$94.04 \pm 0.85$	$95.24 \pm 1.59$	$70.80 \pm 1.08$	$91.33 \pm 0.57$	$0.40 \pm 0.10$
*$π_{θ}^{}$ (SFT + GSPO)**	$100.0 \pm 0.00$	$100.0 \pm 0.00$	$99.29 \pm 0.49$	$100.0 \pm 0.00$	$91.45 \pm 0.85$	$97.83 \pm 0.20$	$0.00 \pm 0.00$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ai, L.; Ma, B.; Zhang, J.; Ai, Y.; Hao, Z.; Li, J.; Yu, Z.; Cheng, J. A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains. Drones 2026, 10, 289. https://doi.org/10.3390/drones10040289

AMA Style

Ai L, Ma B, Zhang J, Ai Y, Hao Z, Li J, Yu Z, Cheng J. A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains. Drones. 2026; 10(4):289. https://doi.org/10.3390/drones10040289

Chicago/Turabian Style

Ai, Lei, Bin Ma, Jianxing Zhang, Yao Ai, Ziqi Hao, Jianan Li, Zhuting Yu, and Jiayu Cheng. 2026. "A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains" Drones 10, no. 4: 289. https://doi.org/10.3390/drones10040289

APA Style

Ai, L., Ma, B., Zhang, J., Ai, Y., Hao, Z., Li, J., Yu, Z., & Cheng, J. (2026). A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains. Drones, 10(4), 289. https://doi.org/10.3390/drones10040289

Article Menu

A Generative Task Allocation Method for Heterogeneous UAV Swarms Empowered by Heterogeneous Toolchains

Highlights

Abstract

1. Introduction

2. Task Description and Modeling

2.1. Task Description

2.2. Task Allocation Modeling

2.3. Effectiveness Evaluation Model

2.3.1. S-E-O Evaluation Metrics

2.3.2. Intent-Driven Effectiveness Function

2.4. Definition of Physical Hallucination

3. Methodology

3.1. Generative Paradigm Augmented by Heterogeneous Toolchain

3.2. Orthogonal Evolution Training Method

3.2.1. Heterogeneous Data Construction

3.2.2. Execution Manifold Anchoring

3.2.3. Planning Model Value Alignment

3.3. Recursive Causal Probe Algorithm

4. Experiments

4.1. Dataset Construction

4.2. Experimental Setups

4.2.1. Experimental Platform Configuration

4.2.2. Evaluation Metrics

4.2.3. Baseline Algorithms and Models

4.2.4. Training Hyperparameters

4.3. Comparative Experiments

4.4. Ablation Studies

4.4.1. Empirical Analysis of Multi-Task Gradient Conflicts

4.4.2. RCP Computational Overhead

4.4.3. Training Framework and Architecture Evolution

4.4.4. Execution Manifold Stability

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Hyperparameter Configurations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI