Frequency Point Game Environment for UAVs via Expert Knowledge and Large Language Model
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript proposes a simulation environment based on frequency-point game theory for modeling interference and anti-interference interactions in UAV communication systems. The framework combines multi-agent reinforcement learning with an expert knowledge base for frequency selection and a large language model to generate adaptive adversarial UAV trajectories. However, the current experimental evaluation does not provide suffient evidence that the LLM-based planner offers measurable advantages over conventional baselines methods. In addition, critical details regarding model configuration, inference determinism, computational overhead, and reproducibility are missing
The manuscript shows a high similarity level with a publicly available preprint hosted on arXiv (ID: 2508.02757), which appears to correspond to the same work.
I consider that the manuscript does not sufficiently clarify what is fundamentally new compared to existing anti-jamming MARL environments and UAV adversarial simulation platforms.
The manuscript lacks a rigorous formalization of the underlying game model.
The manuscript does not convincingly demonstrate that LLM-based planning outperforms established baseline approaches such as heuristic planners, classical path optimization methods, or RL-based navigation.
The experimental section presents reward curves and frequency overlap trends, but lacks proper statistical validation.
I suggest providing a more detailed experimental protocol or supplementary material to improve reproducibility and transparency.
Introduction: Please cite concrete limitations of specific existing frameworks rather than using general statements.
The listed contributions overlap conceptually and partially repeat earlier statements.
Section 2.1 Lines 109-117. The authors should strengthen the link between cited works and the specific signal confrontation problem addressed in this study.
Section 2.3 Lines 150-179. The review of LLM-based path planning is descriptive but lacks critical comparison. I suggest clarifying why LLM-based reasoning is preferable to conventional UAV path planners under the considered constraints.
Lines 326-329. The structure of the expert knowledge base is insufficiently described. Please specify, how rules are encoded? whether knowledge is static or adaptive, and how conflicts between learned policy and expert rules are resolved.
Lines 375-394. Hardware details are provided, but software configuration is incomplete.
Lines 491-503. The conclusion repeats experimental observations without discussing limitations.
Author Response
Comment 1: The manuscript shows a high similarity level with a publicly available preprint hosted on arXiv (ID: 2508.02757), which appears to correspond to the same work.
Response 1: Thank you for pointing this out. We agree with this comment. We would like to clarify that the preprint mentioned (arXiv:2508.02757) is indeed our own work, which was uploaded to the arXiv repository by the authors to share our preliminary research findings with the community prior to the formal submission. To ensure full transparency and avoid any confusion regarding originality, we have added a clarifying statement in the revised manuscript. This change can be found on Author Contributions.
Comments 2:
I consider that the manuscript does not sufficiently clarify what is fundamentally new compared to existing anti-jamming MARL environments and UAV adversarial simulation platforms.
Response 2:
Thank you for this constructive comment. We agree that the original manuscript did not sufficiently highlight the fundamental novelty of the proposed environment. In the revised manuscript, we have added (i) a dedicated novelty paragraph that explicitly contrasts UAV-FPG with representative anti-jamming MARL/game formulations and widely used UAV simulators, and (ii) a compact comparison table (Table 1) to provide a structured, feature-level differentiation. These revisions can be found in Section 1 (Introduction), and all newly added text is marked in red. In particular, we clarify that UAV-FPG uniquely couples an explicit frequency-point confrontation loop with SINR/capacity-based evaluation, 3D geometry-dependent mobility/path-loss coupling, and two plug-in intelligence modules (expert KB for ally frequency selection and episode-level LLM planning for opponent trajectory generation), which are not jointly instantiated in prior anti-jamming MARL environments nor in existing UAV simulation platforms.
Comments 3: The manuscript lacks a rigorous formalization of the underlying game model.
Response 3: We appreciate this important suggestion and agree with the reviewer. To address it, we have added a rigorous formalization of UAV-FPG as a two-player Markov game in Section “Environment Model”. Specifically, we define the game tuple
G = ⟨I, S, {Ai}i∈I , T , {ri}i∈I , γ⟩,
specify the two agents (ally/opponent), the state, joint action spaces, transition dynamics, and discounted objectives. We also clarify how the implemented state abstraction summarizes geometry and spectrum/interference conditions, and how $T$ is induced by UAV kinematics and the wireless link model (path loss/SINR/capacity). Finally, we note that the expert knowledge base and the episode-level LLM planner are incorporated as fixed plug-in policy modules that do not change the Markov-game interface, but provide constrained frequency selection and opponent motion primitives. All added text is marked in red.
Comment 4:
The manuscript does not convincingly demonstrate that LLM-based planning outperforms established baseline approaches such as heuristic planners, classical path optimization methods, or RL-based navigation.
Response 4:
Thank you for this constructive suggestion. We fully agree that a comprehensive comparison with established baselines is essential to validate the effectiveness of the LLM-based planning.
Revisions Made: In the revised manuscript, we have conducted additional experiments comparing our LLM-based planner against three distinct categories of baseline approaches as requested:
- Heuristic Planner:A Greedy Intercept (Pure Pursuit) strategy.
- Classical Optimization:A Model Predictive Control (MPC) based planner.
- RL-based Navigation:Multi-Agent SAC (MASAC) and MAPPO baselines.
The comparative results are now presented in Table 5. The data demonstrates that the LLM-based opponent achieves a significantly higher average reward (2257.48) compared to both the Greedy Intercept (1899.98) and MPC (1849.97) baselines, as well as the RL-based methods. This confirms that the LLM-based planner effectively simulates a "stronger" and more adaptive adversary, thereby providing a more challenging and valuable testbed for evaluating ally anti-jamming strategies. We have also added a detailed discussion of these baselines in Section 5.2.
Comment 5:
The experimental section presents reward curves and frequency overlap trends, but lacks proper statistical validation.
Response 5:
Thank you for this valuable comment. We agree that statistical reporting is necessary to support the observed reward/overlap trends. In the revised manuscript, we repeated the main experiments with Ns = 5 independent random seeds (different network initializations and environment stochasticity) and now report mean ± standard deviation across seeds. Specifically:
(1) In Section 5.1 (Environment Setting), we added a clear statement that all learning curves and scalar results are averaged over Ns = 5 runs, and that the shaded bands in reward plots correspond to ±2σ across seeds, while frequency-overlap curves are shown with standard-deviation error bars.
(2) We further added Table 6 to provide seed-averaged scalar metrics for statistical comparison, including Final-10% opponent reward, step-normalized AUC of the opponent-reward curve, and late-window overlap (%), all reported as mean ± std.
These additions provide quantitative variability estimates and improve the statistical support for our conclusions.
Comment 6:
I suggest providing a more detailed experimental protocol or supplementary material to improve reproducibility and transparency.
Response 6:
We appreciate this suggestion and have strengthened the experimental protocol description to improve reproducibility and transparency. In the revised manuscript:
(1) Section 5.1 (Environment Setting) now explicitly states the evaluation protocol across multiple random seeds and how curves/statistics are computed (mean ± std, shaded ±2σ bands, and error bars for overlap curves).
(2) We augmented Table 4 to include additional key settings needed for replication, including the number of random seeds (Ns = 5), the curve-smoothing window (w = 200 training steps), and the LLM call frequency (once per episode boundary). The main environment parameters (bandwidth, power/noise, thresholds, etc.) and the learning hyperparameters are already provided in the paper.
(3) For the LLM-based opponent planner, the implementation details and inference protocol are described in Section 4.3 (API-based episode-level planning with fixed decoding settings and feasibility constraints).
Overall, these revisions provide a clearer experimental protocol and improve the reproducibility of our results.
Comments 7:
Introduction: Please cite concrete limitations of specific existing frameworks rather than using general statements.
Response 7:
Thank you for pointing this out. We agree with this comment. We revised the Introduction to replace general claims with concrete, cited limitations of representative frameworks (CMAA, GPDS, FANETs) and UAV simulators, clarifying that prior works either abstract away 3D mobility/geometry-coupled propagation or lack an executable SINR/capacity-driven spectrum confrontation loop; we also added a compact comparison table to summarize these differences.
These changes can be found in the revised manuscript in Section 1 (Introduction), Paragraph 2.
Comments 8: The listed contributions overlap conceptually and partially repeat earlier statements.
Response 8: Thank you for pointing this out. We agree with this comment. We revised the Introduction to reduce redundancy between the earlier description and the contribution list, and we rewrote the contributions to be mutually distinct and more specific. Specifically, we condensed the previous “three key stages” paragraph into a concise summary of UAV-FPG, and we rephrased the contributions to separately emphasize (i) the executable geometry-coupled spectrum confrontation loop, (ii) the expert-knowledge-guided frequency selection module, and (iii) the LLM-based opponent planner and its evaluation. The corresponding revisions are highlighted in red in the revised manuscript and can be found in Section 1 (Introduction), the third paragraph, and the “The primary contributions…” part.
Comments 9: The authors should strengthen the link between cited works and the specific signal confrontation problem addressed in this study.
Response 9: Thank you for pointing this out. We agree with this comment. We have revised Section 2.1 to explicitly connect the cited multi-agent game-theoretic and MARL paradigms (e.g., opponent learning/self-play and CTDE) to the spectrum signal confrontation setting considered in this paper. In particular, we clarified that anti-jamming can be formulated as an adversarial Markov game with step-wise actions for frequency hopping/spreading (ally) and jamming/mobility (opponent), and that UAV-FPG instantiates this connection through an explicit frequency-point decision loop with SINR/capacity-driven rewards. The corresponding revisions are highlighted in red in the revised manuscript in Section 2.1 (Multi-Agent Game Theory).
Comment 10:
Section 2.3 Lines 150–179. The review of LLM-based path planning is descriptive but lacks critical comparison. I suggest clarifying why LLM-based reasoning is preferable to conventional UAV path planners under the considered constraints.
Response10:
Thank you for this helpful suggestion. We agree that a clearer critical comparison can strengthen the motivation for using LLMs in our setting. Accordingly, we revised Section 2.3 to explicitly clarify why the LLM-based approach is adopted under the UAV-FPG constraints. In particular, we added a concise comparison explaining that the LLM is not used for low-level UAV control, but rather as an episode-level, high-level opponent planner inside an offline simulator; we also explain that conventional planners (e.g., A*/RRT*/MPC) typically require an explicit geometric goal and/or an accurate dynamics/cost formulation and may yield repetitive behaviors under similar initial conditions, whereas the LLM can condition on contextual summaries (past positions and rewards) to generate diverse, reward-aware adversarial trajectories for stress-testing anti-jamming policies within UAV-FPG.
These revisions have been highlighted in red in the revised manuscript; please see Section 2.3 (Path Planning with Large Language Models) for the added critical comparison and rationale.
Comment 11
The structure of the expert knowledge base is insufficiently described. Please specify how rules are encoded, whether knowledge is static or adaptive, and how conflicts between learned policy and expert rules are resolved.
Response11: Thank you for this helpful comment. We have revised Section 4.2 to describe the expert knowledge base (KB) more explicitly, including its encoding, whether it is static or adaptive, and its interaction with the learned policy (all changes are marked in red in the manuscript). Specifically:
(1) How rules are encoded: the KB is implemented as an engineer-designed mapping table , where denotes the detected jamming type and the estimated interference center frequency (or band). We then train an offline lightweight predictor on this table to output frequency-selection weights (a categorical distribution over the 15 candidate frequency points), which provides executable guidance during gameplay.
(2) Static vs. adaptive: the KB is static in our implementation. It does not update online during RL training or evaluation; is trained offline and then fixed.
(3) Conflict handling: in UAV-FPG, the KB module is used as a guidance/selection component for determining the recommended safe frequency candidates under detected interference, whereas the RL policy focuses on learning the anti-jamming control decisions (e.g., whether to trigger spreading or hopping under the current state). Since these components operate on different parts of the decision process, there is no rule–policy conflict to resolve in the current design.
Comment12:
Hardware details are provided, but software configuration is incomplete.
Response12:
Thank you for pointing this out. Our experiments were implemented in Python (v3.9 or above) using the PyTorch deep learning framework. The MARL algorithms (e.g., MADDPG and the compared baselines) were trained and evaluated with standard GPU acceleration (CUDA-enabled environment) on the reported RTX 4090 hardware. We did not include additional software environment details in the manuscript to avoid overloading the experimental section with implementation-specific items; We will also release the full dependency/version list in the supplementary material or code repository upon acceptance.
Comment13:
The conclusion repeats experimental observations without discussing limitations.
Response13:
Thank you for the suggestion. We agree that the original conclusion mainly summarized experimental observations without explicitly discussing limitations. In the revised manuscript, we therefore added a dedicated section (“Limitation and Future Work”) to clearly state the key limitations and the corresponding future directions. Specifically, we clarify that (i) our current evaluation is conducted in the proposed UAV-FPG simulator and has not yet been validated via real UAV/hardware deployment or field electromagnetic testing, (ii) the simulator adopts simplified propagation/interference assumptions compared with real environments, and (iii) the LLM is used as an episode-level external planner, which may introduce practical considerations such as controllability and latency/cost. We also outline future work toward higher-fidelity simulation and real-world validation in that section.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript explores a reinforcement learning–based environment for modeling frequency-domain interference and corresponding mitigation strategies in UAV communication systems. From an engineering perspective, the idea is well motivated and, in parts, quite clever. Bringing together game-theoretic formulations, an explicit expert knowledge component, and large language models for adversarial path planning is timely and broadly aligned with the scope of Drones. The experimental setup reflects careful design choices, even if some aspects would benefit from clearer exposition.
That said, several points would benefit from further clarification in order to better align the experimental evidence with the claims made in the manuscript.
Major Comments
- Framing of the LLM-Based “Strong Adversary”
The reported results suggest that LLM-driven path planning can lead to more adaptive adversarial behavior. However, the current evidence supports this conclusion mainly within the boundaries of the simulated environment. Some of the broader claims could therefore be framed more cautiously, with a clearer separation between observed performance trends and more general conclusions. - Structure and Role of the Expert Knowledge Base
The expert knowledge base is central to the proposed framework, yet its internal structure remains somewhat abstract. It is not always clear how expert rules are represented in practice, how they are accessed during learning, or whether they evolve over time. Making these aspects more explicit would improve both reproducibility and methodological transparency. - Baseline Comparisons and Component Isolation
While the experimental section is extensive, it would be helpful to more clearly distinguish between different baseline configurations. In particular, comparisons between RL-only setups, RL combined with expert knowledge, and RL augmented with LLM support would help clarify the individual contribution of each component.
Minor Comments
- Figures and Result Interpretation
Several figures contain a large amount of information, which can make the main message harder to extract at first glance. Shorter captions and more explicit guidance on the intended takeaway would improve readability. - Language and Overall Flow
The manuscript is generally understandable, but parts of the introduction and discussion remain overly dense. Some moderate language editing aimed at reducing verbosity would help the technical contributions stand out more clearly.
Overall, the work presents a promising simulation framework with clear potential. With more careful framing of claims and improved methodological clarity, the contribution could be significantly strengthened.
Comments on the Quality of English LanguageThe document contains simple information which needs English improvement to achieve better clarity and directness through eliminating unnecessary repetition and enhancing complex technical explanations.
Author Response
Thank you very much for taking the time to review our manuscript and for providing thoughtful and constructive comments. We appreciate the reviewer’s careful reading and valuable suggestions, which have helped us improve both the clarity and the presentation of the work. In response, we have (i) revised the framing of the LLM-based “strong adversary” to more clearly distinguish simulation-based observations from broader conclusions, (ii) clarified the structure and usage of the expert knowledge base to enhance reproducibility, (iii) provided a clearer explanation of baseline configurations and component-wise contributions via the ablation experiments, and (iv) improved figure captions and overall readability through minor language edits. All corresponding revisions have been highlighted in red in the revised manuscript. Please find our detailed point-by-point responses below.
Comments 1:
Framing of the LLM-Based “Strong Adversary”
The reported results suggest that LLM-driven path planning can lead to more adaptive adversarial behavior. However, the current evidence supports this conclusion mainly within the boundaries of the simulated environment. Some of the broader claims could therefore be framed more cautiously, with a clearer separation between observed performance trends and more general conclusions.
Response 1:
Thank you for pointing this out. We agree with this comment. Therefore, we revised the manuscript to more cautiously frame the LLM-based “strong adversary” claims as simulation-based observations rather than broad real-world conclusions. Specifically, we replaced overly strong wording (e.g., “demonstrates”) with more appropriate phrasing (e.g., “suggests/indicates”), and explicitly clarified that the LLM opponent planning is evaluated within UAV-FPG. These revisions are marked in red in the revised manuscript and can be found in the following locations:
Highlights (Implication of main finding): revised to emphasize simulation scope;
Introduction (Contribution (3)): revised to avoid real-world generalization language;
Related Work (LLM path planning paragraph): revised to present findings as simulation-based;
Section 4.3 (Strong opponent effect): added an explicit scope statement;
Conclusion: revised to emphasize simulation scope and future validation.
Comments 2:
Structure and Role of the Expert Knowledge Base
The expert knowledge base is central to the proposed framework, yet its internal structure remains somewhat abstract. It is not always clear how expert rules are represented in practice, how they are accessed during learning, or whether they evolve over time. Making these aspects more explicit would improve both reproducibility and methodological transparency.
Response 2:
Thank you for pointing this out. We agree with this comment. Therefore, we clarified the internal representation and usage of the expert knowledge base to improve reproducibility and methodological transparency. Specifically, our expert knowledge base is an engineer-designed and fixed dataset built from center-frequency engineering experience. It stores mappings from the detected jamming type and interference frequency (or band) to the corresponding avoidance strategy and a set of interference-free safe candidate frequencies. In addition, we clarified that the knowledge base does not evolve over time: it remains constant throughout training and evaluation. Before gameplay starts, a linear layer is trained on this fixed dataset to learn the guidance mapping, and the learned parameters are then used during the game as an additional weighted guidance term to support the ally UAV’s frequency selection. These clarifications have been added in Section 4.2 (end of the subsection) and are marked in red in the revised manuscript.
Baseline Comparisons and Component Isolation
While the experimental section is extensive, it would be helpful to more clearly distinguish between different baseline configurations. In particular, comparisons between RL-only setups, RL combined with expert knowledge, and RL augmented with LLM support would help clarify the individual contribution of each component.
Comments 3:
Baseline Comparisons and Component Isolation
While the experimental section is extensive, it would be helpful to more clearly distinguish between different baseline configurations. In particular, comparisons between RL-only setups, RL combined with expert knowledge, and RL augmented with LLM support would help clarify the individual contribution of each component.
Response 3:
Thank you for pointing this out. We agree that clearly separating baseline configurations is important for isolating the contribution of each component. Therefore, we would like to clarify the experimental design and how the component-wise contributions are evaluated in our manuscript. First, the proposed UAV-FPG is fundamentally a reinforcement learning–based game environment, and all reported settings are implemented under an RL framework (i.e., there is no non-RL baseline in this work). Second, the component isolation is conducted through our ablation experiments:
The effect of removing LLM-based opponent path planning (while keeping the RL-based opponent) is evaluated in Fig. 7(a) (“Without Using LLM to Plan Paths”).
The effect of removing the expert knowledge base on the ally side (while keeping the RL framework) is evaluated in Fig. 7(b) (“Without Using Expert Knowledge Base”).
Together, these two ablations explicitly compare (i) RL with expert knowledge vs. RL without expert knowledge, and (ii) RL with LLM-supported opponent planning vs. RL without LLM planning, thus clarifying the individual contribution of the expert knowledge base and the LLM component within our RL-based environment. We appreciate the reviewer’s suggestion and will ensure the above baseline distinctions are clearly reflected in our response.
Comments 4:
Figures and Result Interpretation
Several figures contain a large amount of information, which can make the main message harder to extract at first glance. Shorter captions and more explicit guidance on the intended takeaway would improve readability.
Response 4:
Thank you for pointing this out. We agree that clearer figure captions can help readers quickly grasp the main message. Therefore, we revised several figure captions to be more concise and to include more explicit guidance on the intended takeaway (e.g., what readers should focus on and what conclusion each figure supports). In particular, we updated the captions of Figures 3--6 to highlight the key interpretation (reward trends under different trajectories, stage-wise frequency overlap changes, and the relationship between overlap and jamming success). All revised text has been marked in red in the manuscript.
Comments 5:
Language and Overall Flow
The manuscript is generally understandable, but parts of the introduction and discussion remain overly dense. Some moderate language editing aimed at reducing verbosity would help the technical contributions stand out more clearly.
Response 5:
Thank you for pointing this out. We agree that improving readability is important. Therefore, we performed moderate language editing to reduce verbosity and improve the overall flow, while keeping the technical content unchanged. Specifically, we lightly refined wording in the Introduction and Conclusion/Discussion-related summary statements to improve clarity and reduce redundancy. All revised text has been marked in red in the manuscript.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper offers a new UAV environment for anti-jamming games that combines reinforcement learning, expertise, and LLM-based path planning. The concept is interesting, but the research currently lacks scientific rigor, correctness, and experimental verification. The paper faces major issues should be addressed:
1- "Model" - Communication Model Is Physically Incorrect. In equation (2), the interference power is subtracted from the signal power in dBm. However, this is not how SINR is calculated in wireless communication.
2- "Opponent" - LLM-Based Opponent Is Not Scientifically Described. The integration with the LLM model cannot be replicated. The paper does not describe: What type of LLM model was used, Whether it was fine-tuned, What temperature/decoding settings were used, What query frequency was used, What type of latency/inference constraints were used, and How text output translates to actions for the UAV. This makes the “strong opponent” claim unsupported.
3- Missing Baseline Comparisons, the evaluation lacks the necessary baseline comparisons. Only the path planning using the LLM is compared against the fixed trajectories.
4- Reward Function Is Arbitrary, the ally reward function is arbitrary as it combines the SNR, time cost, and hopping cost.
5- Expert Knowledge Base Is Vaguely Implemented, the expert knowledge base is vaguely implemented as it is only described as conceptual. How is the knowledge represented, for example, by rules, tables, models, etc.? Is the query to the KB dynamic or hardcoded? How does the KB interact with the RL decisions, etc.? This appears more like background theory than an actual implementation.
6- Evaluation Metrics Are Weak. The paper mostly presents the reward curves, which are internal RL metrics. Missing communication metrics: Throughput, Packet success rate, Bit error rate (BER), and Outage probability. The communication performance improvement is not shown.
7- Algorithm Complexity Claim Is Unclear, the complexity 𝑂 (𝑁⋅𝑃), ignores: LLM inference cost, Environment simulation cost. This claim is incomplete.
8- Define all symbols when first introduced
9- Clarify time step duration
10- Explain why 15 frequency points were selected
11- Justify bandwidth values (5 MHz vs 2400 MHz spread spectrum)
Author Response
Comments 1: “Model” – Communication Model Is Physically Incorrect. In equation (2), the interference power is subtracted from the signal power in dBm. However, this is not how SINR is calculated in wireless communication.
Response 1: Thank you for pointing this out. We agree with the reviewer that subtracting interference power from the signal power in the dBm domain is physically incorrect for SINR calculation. This issue was caused by a notation/writing mistake in the original manuscript. Accordingly, we have revised the communication model in Section 4.1 by rewriting Eq. (2) using the standard SINR formulation (signal power divided by the sum of interference and noise powers in linear scale, then converted to dB). The related description and symbol definitions were also updated for consistency. All revisions are highlighted in red in the revised manuscript (Section 4.1, around Eq. (2)–(3)).
Comments 2:
“Opponent” - LLM-Based Opponent Is Not Scientifically Described. The integration with the LLM model cannot be replicated. The paper does not describe: What type of LLM model was used, Whether it was fine-tuned, What temperature/decoding settings were used, What query frequency was used, What type of latency/inference constraints were used, and How text output translates to actions for the UAV. This makes the “strong opponent” claim unsupported.
Response 2:
Thank you for this detailed and valuable comment. We agree that the initial manuscript did not provide sufficient implementation details to make the LLM-based opponent fully reproducible. In the revised manuscript, we have expanded the description of the LLM integration and added explicit implementation details covering the model choice, inference configuration, query schedule, latency assumptions, and the deterministic text-to-action mapping.
Specifically, our opponent planner uses the iFLYTEK Spark Max-32K model via its public API, and no fine-tuning is performed. During simulation, the LLM is queried once per round (episode): we input the previous round’s opponent trajectory together with the per-step rewards (and the current position), and request the LLM to output a sequence of 3D direction vectors for the next round. The decoding parameters are kept fixed across runs (temperature, top-p, and maximum output length; now explicitly stated in the manuscript). Since our evaluation is conducted in an offline simulator, we do not impose real-time onboard latency constraints; we clarify this assumption and treat LLM inference as an episode-level planning step.
To ensure an unambiguous and replicable translation from text to UAV actions, we added a deterministic mapping: the LLM outputs direction vectors dt=[dx, dy, dz] in a fixed format. We enforce a feasibility constraint ∣dx∣+∣dy∣+∣dz∣≤1. If the constraint is violated, we re-query the LLM up to RRR times; if it still fails (or parsing fails), we project the vector onto the feasible set as a fallback. The opponent position is updated by pt+1=pt+v*Δt , where v is the fixed UAV speed and Δt is the simulator time step. With these additions, the “strong opponent” claim is now grounded in a clearly defined, reproducible simulator-based planner rather than an underspecified qualitative description.
Comments 3:
Missing Baseline Comparisons, the evaluation lacks the necessary baseline comparisons. Only the path planning using the LLM is compared against the fixed trajectories.
Response 3:
Thank you for this important comment. We agree that in the previous version the LLM-based opponent was mainly compared against fixed geometric trajectories, which was insufficient to position our approach against established baseline planners and navigation methods.
In the revised manuscript, we expanded the baseline comparisons to include representative methods from the categories suggested by the reviewer. Specifically, in addition to the fixed-trajectory baseline (triangular), we added: (i) RL-based navigation baselines (MASAC and MAPPO) to represent established reinforcement learning navigation methods in continuous-control and multi-agent settings; (ii) a heuristic planner baseline (Greedy Intercept / Pure Pursuit) that deterministically moves toward the ally UAV; and (iii) a classical path optimization baseline using MPC (short-horizon optimization) that replans actions at the episode boundary by optimizing a short-horizon objective.
We summarize the results using average episode rewards (Ally Average Reward and Opponent Average Reward), which provide a stable scalar metric for cross-method comparison. Higher opponent average reward indicates stronger adversarial pressure in UAV-FPG, while higher ally average reward reflects better anti-jamming performance. The new comparison table (Table 5) shows that the LLM-based planner yields higher opponent average reward than fixed trajectories and RL-based baselines, supporting our claim that LLM-based planning can act as a stronger and more adaptive adversary within UAV-FPG.
Comments 4:
Reward Function Is Arbitrary, the ally reward function is arbitrary as it combines the SNR, time cost, and hopping cost.
Response 4:
Thank you for this comment. We agree that the previous manuscript did not explain the design motivation of the reward functions clearly enough. In UAV-FPG, both agents face inherently multi-objective goals, and we use a scalarized reward to encode these objectives in a single training signal.
For the ally UAV, the primary objective is to maintain a reliable communication link under jamming. Therefore, the reward includes the instantaneous link-quality term (SNR/SINR in dB), which directly promotes higher communication reliability. However, in realistic anti-jamming communication, the ally cannot apply countermeasures without cost. Remaining in spread-spectrum mode for a long duration incurs resource/time overhead and can degrade operational efficiency; this is captured by the spreading-time cost , which discourages “always-spread” behavior. Similarly, frequent frequency hopping is associated with reconfiguration and synchronization overhead (e.g., switching latency, coordination cost with the base station), and can lead to degenerate policies that hop excessively; this is modeled by the hopping cost . The threshold represents the minimum SNR requirement for reliable communication in our simulator, so the term encourages the ally to stay above the reliability threshold rather than only maximizing SNR numerically.
For the opponent UAV, the reward is designed to reflect two practical adversarial goals: (i) approaching/maintaining proximity to the ally to increase effective jamming impact (modeled by the distance-based term), and (ii) maximizing the degradation of the ally’s link quality (modeled by the SNR reduction term). This design encourages the opponent to choose trajectories and jamming actions that meaningfully reduce the ally’s communication quality rather than exhibiting trivial movement patterns.
Overall, the reward terms are not intended as an ad-hoc combination, but as a standard scalarization of competing objectives in anti-jamming communication games: maintaining link reliability while accounting for operational costs on the ally side, and inducing effective interference through proximity and link-quality degradation on the opponent side. In the revised manuscript, we have added clearer explanations of these design principles and defined the symbols at their first appearance to improve clarity and reproducibility.
Comments 5:
Expert Knowledge Base Is Vaguely Implemented. The expert knowledge base is vaguely implemented as it is only described as conceptual. How is the knowledge represented (rules/tables/models)? Is the query dynamic or hardcoded? How does the KB interact with the RL decisions?
Response 5:
Thank you for raising this point. We agree that the previous version did not describe the KB implementation in a sufficiently reproducible manner. In our implementation, the expert knowledge base is a fixed, engineer-designed table that stores tuples of the form (jamming type τ, estimated interference frequency/band f_opponent) → (recommended avoidance strategy, safe frequency set F_safe). Before gameplay, we train a lightweight MLP g_ψ on this table to output a categorical distribution (frequency-selection weights) over the 15 candidate frequency points.
During gameplay, the query to the KB is dynamic and is triggered whenever interference is detected. The RL policy is responsible for deciding whether to hop/spread, and the KB provides the frequency candidate selection conditioned on (τ, f_opponent). Concretely, when hopping is triggered, the ally sets f_ally = argmax g_ψ(τ, f_opponent) (restricted to the safe set); otherwise it keeps the current frequency. The KB module remains fixed during RL training and serves as a guidance/candidate selector, while the RL policy learns when to apply these countermeasures.
Comments 6: Evaluation Metrics Are Weak. The paper mostly presents the reward curves, which are internal RL metrics. Missing communication metrics: Throughput, Packet success rate, Bit error rate (BER), and Outage probability. The communication performance improvement is not shown.
Response 6: Thank you for the comment. We agree that, from a traditional communications perspective, metrics such as throughput, packet success rate, BER, and outage probability are important. However, the primary goal of this paper is to propose UAV-FPG as an executable spectrum-confrontation game environment for developing and stress-testing anti-jamming decision-making policies, rather than a full PHY/MAC-level communication-system evaluation. Accordingly, our evaluation focuses on task-aligned metrics that directly reflect the confrontation objective in UAV-FPG, i.e., link-quality-driven rewards and confrontation outcomes under jamming/anti-jamming.
In the revised manuscript, we strengthened the evaluation by adding seed-averaged scalar metrics beyond raw reward curves, including Final-10% average reward and AUC (area under the reward curve) to capture both final performance and learning efficiency/stability across training, together with standard deviation over multiple random seeds (Table 6). We also explicitly report frequency-overlap/jamming success trends as outcome-level indicators of the spectrum confrontation process. These metrics are consistent with our Markov-game formulation, where instantaneous SINR/capacity is already used internally to compute rewards.
At the same time, we acknowledge that mapping our abstraction to packet-level or BER/outage metrics would require additional PHY/MAC assumptions (e.g., specific modulation/coding, packetization, receiver implementation, fading models), which are outside the current scope of UAV-FPG. To avoid over-claiming, we have clarified this limitation and added it as a future direction: incorporating higher-fidelity channel models and packet-level evaluation (e.g., throughput/BER/outage) in extended versions of the simulator and in real-world validation (Section “Limitation and Future Work”).
Comments 7:
Algorithm Complexity Claim Is Unclear. The complexity O(N·P) ignores LLM inference cost and environment simulation cost.
Response 7:
Thank you for pointing this out. We agree that the previous complexity statement was incomplete. The term O(N·P) is intended to describe the dominant cost of the MADDPG neural network updates (forward/backward passes), where N is the number of training steps and P is the network parameter size. In addition, the overall runtime includes (i) an environment-step cost O(N·C_env) for computing path loss/SINR/rewards and state transitions, and (ii) an episode-level external LLM planning cost O(E·C_LLM), since we call the LLM API once per episode. As the LLM is accessed via an external API, we do not report internal FLOP complexity; instead, we treat it as a black-box service cost (latency/token-dependent) per call. We have revised the manuscript to clarify these components and avoid overstating O(N·P) as the total complexity.
Comment 8:
Define all symbols when first introduced.
Response 8:
Thank you for this helpful suggestion. We agree that the original manuscript did not consistently define all symbols at their first appearance. In the revised version, we have carefully audited the manuscript and added explicit definitions for all variables and parameters when they are first introduced (e.g., in Eqs. (1)–(6) and the associated “where” clauses, including $H$, $t$, $M$, $\alpha$, $E$, $\Delta\mathrm{SNR}$, and the proximity threshold in Eq. (6)). These edits improve clarity and reproducibility.
Comments 9:
Clarify time step duration.
Response 9:
Thank you for this comment. In UAV-FPG, the “time step” follows the standard reinforcement-learning convention: one time step corresponds to one simulator/environment transition (i.e., a single state–action–reward–next-state update). For reproducibility, we clarify that the simulator uses a fixed discrete step size Δt, and all dynamics (e.g., position updates and sensing intervals) are defined consistently on this discrete-time grid. In particular, the opponent sensing interval is specified in seconds and can be interpreted as occurring every n steps (with n·Δt seconds). We will ensure this clarification is explicitly stated to avoid confusion.
Comment 10:
Explain why 15 frequency points were selected
Response 10:
Thank you for the reviewer’s helpful comment. We agree that the rationale for selecting 15 frequency points should be stated explicitly.
In this work, we discretize the 150–250 MHz band into a finite set of candidate channels to make the Markov-game formulation and MARL training tractable. Using 15 frequency points is a default simulation setting that provides sufficient spectral diversity to represent different jamming patterns (e.g., narrowband vs. wideband/comb effects) while keeping the discrete decision space manageable for stable learning. Importantly, the frequency resolution is not a limitation of UAV-FPG: the number of frequency points Nf is implemented as a configurable parameter and can be increased (or re-channelized) to match other experimental assumptions.
This explanation has been added to the revised manuscript and highlighted in red using \rev{} in Section 4.1 (Frequency Point Game in Wireless Communications), right after the sentence describing the 15-point selection.
Comment 11:
Justify bandwidth values (5 MHz vs 2400 MHz spread spectrum)
Response 11:
Thank you for the reviewer’s suggestion. We agree that the bandwidth settings require justification and clarification of their role in the simulator.
In our simplified link/noise model, the bandwidth values (5 MHz for non-spread and 2400 MHz for spread spectrum) are chosen as default parameters to create two clearly separated operating regimes. Specifically, when bandwidth increases, the in-band noise power increases through the term N0+10log(B), which reduces the resulting SINR/SNR and thus captures the intended “processing gain / concealment” effect of spreading at the system level in our environment. We also emphasize that these bandwidth values are not fixed design choices: they are configurable simulation parameters and can be replaced by alternative bandwidths consistent with other system specifications or experimental settings.
This justification has been added to the revised manuscript and highlighted in red in the caption of Table 4.
Reviewer 4 Report
Comments and Suggestions for AuthorsThis paper introduces UAV-FPG, a reinforcement learning-based environment simulating signal interference and anti-jamming strategies in UAV communications. It combines an expert knowledge base for ally UAV frequency selection with large language model (LLM) planning for an adaptive opponent. Experiments show that integrating LLMs and expert knowledge improves communication robustness and simulates realistic adversarial behavior.
Strength
- The paper addresses a major gap by modeling spectrum confrontation, which prior frameworks ignored (lines 113-121). This justifies the creation of UAV-FPG as a new environment.
- Combining reinforcement learning with both an expert system and LLM for adversarial modeling is novel.
- The system models physical-layer constraints like SNR and diverse jamming techniques. Equations and attack types (e.g., single-tone, comb-spectrum) are defined clearly, increasing fidelity.
- Ablation results prove that its removal weakens anti-jamming effectiveness.
Weakness
- Results only compare versions of the proposed method. There’s no benchmarking against prior models or anti-jamming techniques. This weakens claims of improvement over the state of the art.
- Does not adequately reference foundational literature on multi-agent reinforcement learning (MARL). For example, the MADDPG framework (Lowe et al., 2017).
- The paper doesn’t name the LLM used or clarify how path inference is executed. It’s hard to judge reproducibility or evaluate the model’s reasoning capacity. More transparency is needed.
- While technically sound, the contribution mainly integrates existing tools (RL, expert systems, LLMs). No new algorithms are introduced. The novelty lies in system design.
- Metrics like final performance, success rates, or statistical confidence are missing. Network architectures and hyperparameters are under-described.
- LLMs may not be deployable in real UAV systems due to compute limitations. The paper doesn’t address latency, onboard constraints, or possible deployment paths. A discussion of practicality is needed.
- Expressions like “simulation and deduction environment” are vague.
- Thresholds like “30 units” in reward functions or piecewise time penalties are not explained.
- “Abalation” (Figure 6 caption) should be “Ablation”.
- “Storge” in Figure 1 should be corrected to “Storage”.
Author Response
Comment 1: Results only compare versions of the proposed method. There’s no benchmarking against prior models or anti-jamming techniques. This weakens claims of improvement over the state of the art.
Response 1: Thank you for this valuable comment. We agree that the original submission mainly compared variants of our proposed system and lacked benchmarking against representative prior anti-jamming techniques. To address this concern, we have added additional ally-side baselines implemented within UAV-FPG, including No defense (fixed frequency, no spread), pseudo-random frequency hopping (Random FHSS), blacklist-based adaptive hopping (Adaptive FH), a multi-armed bandit strategy (Bandit-UCB), and a learning-based DQN-style channel-selection baseline.
For a fair and stress-testing-oriented comparison, we fix the opponent to the LLM planner (a strong adversary) and evaluate all ally strategies using the same metric of average episode rewards. The new benchmarking results are summarized in Table 7, and the corresponding description has been added at the end of Section 5.2 (Opponent Gameplay Performance in UAV-FPG Environment), immediately before Section 5.3. All added modified text is marked in red. These results demonstrate that our full system achieves the highest ally reward under the same environment settings, outperforming representative heuristic hopping baselines and remaining competitive against learning-based alternatives.
Comments 2: Does not adequately reference foundational literature on multi-agent reinforcement learning (MARL). For example, the MADDPG framework (Lowe et al., 2017).
Response 2: Thank you for this helpful suggestion. We agree that the manuscript should more explicitly reference foundational MARL literature. Accordingly, we have added an explicit citation to the MADDPG work by Lowe et al. (2017) and clarified our use of MADDPG in the experimental setup. The revisions are marked in red in the revised manuscript. Specifically, we updated Section 5.1 “Environment Setting” (end of the paragraph) to include: “... the hyperparameter values of the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) model \cite{CTDE}.”
Comments 3: The paper doesn’t name the LLM used or clarify how path inference is executed. It’s hard to judge reproducibility or evaluate the model’s reasoning capacity. More transparency is needed.
Response 3: Thank you for pointing this out. We agree that naming the LLM and clarifying the path-inference execution are important for reproducibility. In the revised manuscript, we have explicitly stated that the opponent planner calls the iFLYTEK Spark Max-32K model via its API, and we described how inference is executed at the episode boundary: the LLM is queried once per episode using the previous-round trajectories and rewards to generate the next-round motion directions. We also added the text-to-action mapping and feasibility enforcement (constraint checking with re-query and fallback projection when needed) to make the inference-to-simulation pipeline fully explicit. These revisions are marked in red and can be found in Section 4.3 “Strong Opponent Effect: Path Inference and Planning with LLMs” (Implementation details; LLM inference configuration; Text-to-action mapping and fallback), and we also reiterate the LLM name in Section 5.1 “Environment Setting.”
Comment 4: While technically sound, the contribution mainly integrates existing tools (RL, expert systems, LLMs). No new algorithms are introduced. The novelty lies in system design.
Response 4:
Thank you for this thoughtful comment. We agree that our contribution does not introduce a new reinforcement-learning algorithm per se. Instead, the main novelty of this work lies in the system-level instantiation and executable environment design: UAV-FPG explicitly couples (i) 3D UAV mobility and geometry-dependent propagation, (ii) a step-by-step frequency-point jamming/anti-jamming decision loop with SINR/capacity-driven rewards, and (iii) two plug-in intelligence modules (a fixed expert knowledge base for guided frequency selection and an episode-level LLM-based opponent planner for generating diverse, reward-aware adversarial trajectories).
To make this scope unambiguous, we revised the first part of the manuscript (Introduction and Contributions) to clearly state that the novelty is the joint integration of these components into one reproducible Markov-game environment, rather than any single component alone, and we added a compact comparison with representative anti-jamming learning games and UAV simulators (Table 1). We believe this environment contribution is valuable for systematically developing and stress-testing anti-jamming policies under realistic, geometry-coupled dynamics.
Comments 5: Metrics like final performance, success rates, or statistical confidence are missing. Network architectures and hyperparameters are under-described.
Response 5: Thank you for pointing this out. We agree that relying only on learning curves can make comparison and reproducibility harder. Therefore, in the revised manuscript we have added scalar summary metrics and statistical confidence across random seeds, and expanded the description of hyperparameters. Specifically, we now state that all learning curves and scalar results are averaged over Ns=5 independent runs with different random seeds, reporting mean ± standard deviation, and we use a shaded band corresponding to ±2σ across seeds in reward plots (Section 5.1). In addition, we provide seed-averaged scalar metrics such as Final-10% performance and AUC (Table 6), and report frequency-overlap / jamming-success statistics with error bars (Figure 5). Regarding implementation details, the environment parameters are summarized in Table 4, and the MADDPG hyperparameters are summarized in Table 3. These additions improve clarity and reproducibility without changing the experimental setup.
Comments 6: LLMs may not be deployable in real UAV systems due to compute limitations. The paper doesn’t address latency, onboard constraints, or possible deployment paths. A discussion of practicality is needed.
Response 6: Thank you for this valuable comment. We agree that deploying large LLMs onboard real UAV platforms can be challenging due to compute, latency, power, and controllability constraints. We would like to clarify that in our work the LLM is not used for low-level UAV control or real-time onboard decision making. Instead, it serves as an episode-level, high-level opponent planner within the UAV-FPG offline simulator, and feasibility is enforced by the simulator constraints (Section 4.3).
To address the practicality concern more explicitly, we have expanded the “Limitation and Future Work” section to discuss (i) the lack of real-world UAV/hardware and field EM testing, (ii) simplified propagation/interference assumptions, and (iii) the reliance on an external LLM planner with potential latency/cost/controllability issues. We also added a concrete deployment path: the LLM module can be replaced by an edge/offboard planner or a lightweight distilled policy for deployment-oriented settings (Section “Limitation and Future Work”). We believe this addition clarifies the intended scope of the LLM component and outlines practical directions for future deployment.
Comments 7: Expressions like “simulation and deduction environment” are vague.
Response 7: Thank you for pointing this out. We agree that the phrase “simulation and deduction environment” is ambiguous. In the revised manuscript, we replaced this wording with a more precise description, i.e., “an executable UAV spectrum-confrontation simulation environment (UAV-FPG)”, and added a brief definition clarifying that UAV-FPG couples 3D UAV kinematics, geometry-dependent propagation, and an explicit frequency-point decision loop with SINR/capacity-based rewards. In addition, we clarify that the LLM is used only as an episode-level opponent planner inside the simulator, rather than implying any unspecified “deduction” process. These revisions can be found in Section 5.1.
Comments 8:
Thresholds like “30 units” in reward functions or piecewise time penalties are not explained.
Response 8:
Response: Thank you for the comment. We agree that the constants in the reward design should be clearly explained. In the revised manuscript, we added an explicit rationale for the proximity threshold in Eq. (6). Specifically, the 30 m threshold is a scenario-driven engineering setting based on our application background, intended to approximate an effective close-in engagement/jamming range under the default transmit-power and path-loss assumptions in UAV-FPG. It also serves as reward shaping to avoid granting proximity reward at long distances and to encourage realistic close-in pursuit-and-jam behaviors. In addition, we explicitly added this threshold (with units) to Table 4 and clarified that it is a configurable environment parameter for reproducibility.
Comment 9: “Abalation” (Figure 6 caption) should be “Ablation”.
Response 9: Thank you for pointing this out. We have corrected the typo in the revised manuscript by changing “Abalation” to “Ablation” in the caption of Figure 6. We also performed an additional proofreading pass (including spell-check) to identify and fix similar spelling/typographical issues throughout the manuscript.
Comment 10: “Storge” in Figure 1 should be corrected to “Storage”.
Response 10: Thank you for the comment. We have corrected “Storge” to “Storage” in Figure 1 in the revised manuscript. In addition, we reviewed all figure labels/captions and the main text to ensure consistency and to minimize the possibility of remaining spelling-related errors.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI acknowledge the author's effort in responding to the review comments. I have no further comments.
Author Response
We thank the reviewer for the positive evaluation of our revised manuscript.
We appreciate the reviewer’s acknowledgment of our responses, and no further changes were required.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe revised manuscript shows a clear effort to address the previous comments, and the overall presentation has improved. The proposed framework remains interesting and relevant to the scope of Drones. However, several core issues raised in the first review have only been partially addressed and still require further clarification.
Major Comments
- Framing of the LLM-Based Adversary: The discussion of the LLM-driven “strong adversary” is more careful than in the previous version, but some claims remain stronger than what is directly supported by the simulation results. A clearer distinction between observed behavior within the proposed environment and broader general conclusions would improve scientific rigor.
- Role and Formalization of the Expert Knowledge Base: While additional explanations have been added, the structure and operational role of the expert knowledge base remain somewhat abstract. A more concrete description of how expert rules are represented and used during training would improve reproducibility.
- Baseline Comparisons: The experimental section would benefit from clearer isolation of the contribution of each component (RL, expert knowledge, and LLM support). At present, the relative impact of these elements remains difficult to disentangle.
Minor Comments
- Figures and Result Interpretation: Several figures are still information-dense. Clearer captions and more explicit guidance on the main takeaway of each figure would improve readability.
- Language and Flow: Although improved, parts of the manuscript remain verbose. Further tightening of the text would help the technical contributions stand out more clearly.
Overall, the paper has improved compared to the previous version and shows clear potential. Addressing the points above would further strengthen the alignment between methodology, results, and conclusions.
Comments on the Quality of English LanguageThe document contains simple information which needs English improvement to achieve better clarity and directness through eliminating unnecessary repetition and enhancing complex technical explanations.
Author Response
Comments 1: Framing of the LLM-Based Adversary: The discussion of the LLM-driven “strong adversary” is more careful than in the previous version, but some claims remain stronger than what is directly supported by the simulation results. A clearer distinction between observed behavior within the proposed environment and broader general conclusions would improve scientific rigor.
Response 1: Thank you for this important comment. We agree that the notion of an LLM-driven “strong adversary” should be framed strictly as an observed behavior within the proposed UAV-FPG simulator, rather than a broader claim about real-world jamming capability or onboard UAV planning performance. Accordingly, we revised the manuscript to (i) explicitly define what “stronger” means under our evaluation protocol, and (ii) add clear disclaimers to prevent over-generalization beyond UAV-FPG.
Specifically, we made the following changes (all revisions are marked in red in the manuscript):
- Abstract: We now describe the LLM as an episode-level opponent trajectory generator within UAV-FPG and clarify that the reported improvements are relative to fixed-path baselines under our protocol, while explicitly stating that these findings are confined to the proposed simulation environment and are not intended as general real-world claims.
Location: Abstract. - Introduction: We added an explicit clarification paragraph defining “LLM-driven strong adversary” in an in-simulator sense, i.e., an opponent planner that tends to produce more diverse, feedback-conditioned trajectories and often achieves higher opponent returns under the same UAV-FPG rules and rewards, without implying real-world superiority.
Location: Section 1 (Introduction), immediately after the UAV-FPG overview paragraph. - Related Work / LLM Path Planning Discussion: We further clarified that the LLM is not used for low-level control, but serves as a high-level episode planner inside an offline simulator. We emphasize that our goal is to create a more challenging simulator-side adversary for stress-testing ally anti-jamming policies, rather than claiming superiority for real-time onboard planning.
Location: Section 2.3, “Path Planning with Large Language Models,”. - Methods (Section 4.3): We strengthened the framing by explicitly stating that we report “stronger” only in terms of the opponent reward/pressure induced in UAV-FPG under our reward definition, and we do not claim universal strength outside the simulator.
We believe these revisions make the scope and interpretation of our results more scientifically rigorous by clearly separating environment-specific observations from any broader conclusions.
Comments 2:
Role and Formalization of the Expert Knowledge Base: While additional explanations have been added, the structure and operational role of the expert knowledge base remain somewhat abstract. A more concrete description of how expert rules are represented and used during training would improve reproducibility.
Response 2:
Thank you for the clarification. We agree that reproducibility benefits from a more explicit and operational description of how the expert knowledge base (KB) is represented and how it interacts with RL training. In our implementation, the KB is a fixed, engineer-designed table/dataset defined over the 15 discretized candidate frequency points (150–250 MHz). Each entry encodes a mapping of the form (τ, fopponent) → (Fsafe, strategy),
where τ denotes the detected jamming type and fopponent is the estimated interference center frequency/band. The output provides a recommended countermeasure and a safe-frequency set Fsafe. Importantly, the KB does not evolve over time. We train a lightweight predictor gψg_{\psi}gψ (MLP/linear layer) offline on this fixed table to produce frequency-selection weights over the 15 candidate points. During RL training/gameplay, the KB query is invoked when interference is detected: the RL policy learns when to trigger hopping/spreading, while the KB module specifies which safe frequency to choose. Concretely, when hopping is triggered (ahopping>0.5), we select fally = argmaxgψ(τ,fopponent) restricted to Fsafe ; otherwise the ally keeps its current frequency. To make this division of labor fully explicit, we added one sentence stating that both the KB table and gψ remain fixed throughout RL training and evaluation, and that RL only learns the hop/spread decision while gψ determines the safe-frequency choice. This addition appears at the end of Section 4.2 in the revised manuscript and is marked in red.
Comments 3:
Baseline Comparisons: The experimental section would benefit from clearer isolation of the contribution of each component (RL, expert knowledge, and LLM support). At present, the relative impact of these elements remains difficult to disentangle.
Response 3:
Thank you for this helpful comment. We agree that the relative impact of RL, the expert knowledge base (KB), and the LLM planner should be stated more explicitly, even though the corresponding ablations are already included in Section 5.4. To address this, we have added a brief clarification at the beginning of Section 5.4 emphasizing that all comparisons follow a controlled one-factor-at-a-time protocol, i.e., only one component is changed while the remaining modules and environment settings are kept identical. In our framework, the RL component (MADDPG) learns the decision policy under the UAV-FPG Markov game, the KB is a fixed guidance module that provides safe frequency recommendations once hopping is triggered, and the LLM is used only as an episode-level opponent motion planner inside the simulator. Accordingly, the “w/o KB” setting disables only the KB-based frequency selector while keeping the same RL training pipeline and opponent setting unchanged (Fig. 7(b)), and the “w/o LLM” setting replaces only the LLM planner with non-LLM opponent planners under the same RL and KB configuration (Fig. 7(a) and Table 5). In addition, to further contextualize the contribution of learning-based components, we also report ally-side comparisons against representative non-RL baselines under a fixed opponent setting (Table 7). These controlled comparisons jointly provide a clearer separation of the contributions from each component without changing the overall experimental design.
Comments 4:
Figures and Result Interpretation: Several figures are still information-dense. Clearer captions and more explicit guidance on the main takeaway of each figure would improve readability.
Response 4:
Thank you for this helpful suggestion. We agree that some figures are information-dense and benefit from clearer guidance. In the revised manuscript, we have rewritten and streamlined the caption of Figure 2 to make the figure easier to read and to explicitly describe the key components and the interaction loop (actor–critic structure, state/action/reward transition, and replay-buffer update). In addition, for the most information-dense result figures, we have strengthened the captions by adding explicit “takeaway” statements and brief guidance (e.g., what the curves/bands represent and what the reader should conclude). These revisions are marked in red in the manuscript.
Location of changes: Figure 2 caption, and updated/strengthened captions in the experimental result figures in Section 5.
Comments 5:
Language and Flow: Although improved, parts of the manuscript remain verbose. Further tightening of the text would help the technical contributions stand out more clearly.
Response 5: Thank you for the suggestion. We agree that tightening the prose improves readability and helps the technical contributions stand out. In the revised manuscript, we have edited and shortened several verbose passages, focusing on removing redundant explanations and “summary-style/self-promotional” sentences that do not add technical content. In particular, we (i) rewrote and condensed the first paragraph of Section 5.2 to describe the two opponent path-planning settings more directly, and (ii) tightened the text in Section 5.1 (Environment Setting) by removing unnecessary concluding/self-evaluative statements while keeping all experimental details unchanged. We also applied light, consistent trimming throughout the manuscript for conciseness. All revisions are marked in red. Location of changes: Section 5.2 (first paragraph) and Section 5.1, with additional minor wording reductions across the manuscript.
Reviewer 3 Report
Comments and Suggestions for Authorsthe authors have addressed all the required comments
Author Response
We sincerely thank the reviewer for confirming that all required comments have been adequately addressed.
We appreciate the reviewer’s time and valuable feedback.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors have addressed most weaknesses. The article is suitable for publication.
Author Response
We are grateful to the reviewer for the positive assessment of our work and for considering the manuscript suitable for publication.
We have carefully revised the manuscript to address the previously identified weaknesses.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsThe manuscript has improved a lot after the various rounds of revisions. In light of the more effective author responses and greater methodological consistency, the paper is significantly better structured. The author's response did a nice job of addressing the earlier concerns specifically the formalisation of the expert knowledge base, LLM driven adversarial behaviour, isolating the component-level contributions in the experimental analysis.
The framing of the LLM-based adversary is well bounded within the simulation context. The newly added ablation explanations enhance the overall efficacy. Conducted experiments only remain simulation-based and therefore have limited external validity. The limitations of these scopes have already been recognized and reasonably discussed.
Overall, the manuscript conforms with the journal scope and is technically sound. This is a well-crafted and original contribution. I do not have any further concerns. The manuscript is fit for publication in its current state.
Comments on the Quality of English LanguageThe English has been improved to achieve better clarity and directness, but maybe a little more polishing would be beneficial.
