Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsA general observation: like so many other manuscripts I have reviewed, this one is not written with the reader's attention to a quick and consistent understanding of the work.
For a better appreciation of the work, the authors' response to some observations would be useful.
- It is not clear whether Bellman's equation (14) is used in the paper; if so, in which sequence of the Algorithm 1 Kb DDPG algorithm
- In (34), is yi the Cartesian coordinate, as in (1)? If so, why does the other coordinate, xi, not appear?
- Explain step (16) of the algorithm, pg. (14) as much as possible.
- What does “tau” represent in step (17) of the algorithm?
- Why does the KAN-DDPG algorithm take practically zero time? Because it is superior to the others (see Fig. 11, Fig. 12). Comment on this aspect, where does the superiority come from? Can the authors imagine improving this algorithm?
Is there a difference between the 1 Kb DDPG and KAN-DDPG algorithms? If the superior algorithm is KAN-FFPG, why is the Algorithm on p. 14 specified for 1Kb DDPG??
- How is tracking duration measured (see Fig. 12b)?
- In Fig. 13, the green-painted race, the search/tracking race within the algorithm proposed by the authors, denotes implausible changes, physically impossible to be performed by a UAV.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsSummary of the Paper
The paper proposes a deep reinforcement learning-based approach for UAV target search and continuous tracking in 2D space with obstacles. To improve search efficiency and re-acquisition after target loss, the method integrates Gaussian Process Regression (GPR) for trajectory prediction and Kolmogorov-Arnold Networks-based Deep Deterministic Policy Gradient (KbDDPG) to embed prior control policies into the reinforcement learning framework. The UAV first performs an entropy-based search, avoiding redundant areas by maximizing spatial information entropy. Once the target is found, the UAV transitions to tracking mode, using GPR to predict the target's movement when it is lost. The proposed algorithm incorporated DDPG algorithm with KANs, allowing the agent to leverage prior knowledge for faster convergence.
Simulation results demonstrate that the proposed approach improves target search efficiency, target re-acquisition rate, and training convergence speed compared to traditional DRL methods. The GPR model enhances target re-localization, increasing success rates from 19.7% to over 90% in occlusion scenarios. The spatial information entropy strategy optimizes UAV search behavior, outperforming Random Walk and Coverage Path Planning methods in terms of target search efficiency. Additionally, KbDDPG achieves faster convergence and lower collision rates than standard DDPG algorithms. indicating that embedding prior knowledge enhances reinforcement learning efficiency.
Specific Comments
- L128: The assumption that UAV operates in a 2D space could be oversimplified, altitude info (z axis) is critical for aerial vehicles. If it is impossible to extend the dynamic model to 3D space, please provide additional justification for 2D modeling.
- L144-145: Though complex transformation is omitted, could you clarify whether sensor measurements are considered perfect. Consider explicitly adding sensor measurement noise when necessary.
- L166: Instead of assuming the next state depends only on the current state, the target motion model could be further improved by incorporating higher-order dependencies.
- L259: Based on the discussion of the decay factor in Section 4.3, it appears that 𝛼 is treated as a hyper-parameter? Could you clarify whether the decay factor must be a fixed value, or if it could be adaptively tuned based on environmental conditions or UAV exploration progress?
- Section 3.4: The determination of reward function weights (k1, k2,k3,k4) is not clearly explained in Chapter 3. Could you explicitly clarify how these values are chosen in this section? Are they derived from previous equations, or hyper-parameters?
- Section 4.1: Could you provide details on the simulation environment, including the software and hardware setup used for training and evaluation? Additionally, could you specify the network training time to ensure scalability and reproducibility.
- L470: Typo…… Analysi –> Analysis
- Chapter 4. The simulation uses 125 x 125 grid for 2D space – will the algorithm be scalable to larger environments with more complex obstacles setups.
- References: Please use consistent reference formatting – e.g. L660: journal/conference names are not italicized.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper presents a DRL framework for UAV target search and continuous tracking in complex environments, integrating GPR for trajectory prediction and a KbDDPG algorithm for policy learning. The work is well-structured, with a clear motivation and comprehensive experiments supporting the claims. However, some minor revisions are necessary to enhance its clarity and completeness.
The methodology section lacks certain details, particularly regarding the KbDDPG algorithm. While its implementation is described, its advantages over conventional DRL techniques should be more explicitly justified with stronger theoretical support and analysis. Additionally, the introduction asserts that the UAV system is designed to operate in complex environments, but it does not provide specific details on the complexity levels of the tested scenarios.
Furthermore, some inconsistencies in terminology should be addressed (e.g., ‘prior policy embedding’ and ‘prior knowledge incorporation’) to ensure uniformity throughout the paper. It is also recommended that the authors should replace the word ‘chapter’ with ‘section’.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsI would like to thank the authors for submitting their manuscript entitled “Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding”, which presents a method to improve the continuous search and tracking of targets by unmanned aerial vehicles (UAVs) in complex environments. To address the potential loss of the target, Gaussian regression is employed to predict its trajectory combined with an algorithm (KbDDPG) that incorporates prior knowledge into its policy. Simulations show that this approach outperforms traditional methods in target search and tracking, especially in recovery after loss of contact.
The authors’ idea is clearly explained, and experimental results are promising. However, there are several aspects that would benefit from further clarification and refinement. Addressing the following comments will strengthen the manuscript and enhance its overall quality.
Comments and suggestions to authors
1. Increasing literature review for mobile target search and tracking, as well as on the use of prior knowledge in RL for UAVs, could strengthen the rationale for the proposal. Few papers related to DRL for UAVs are cited, but including more direct references to papers addressing target loss and recovery would be beneficial.
2. Figure legends should be improved to more accurately describe the content of the figures rather than being mere title for the figures.
3. The description of the training process (Algorithm 1) is of a high level. More details would be needed on the specific implementation (pytorch or other deep learning framework?), including loss functions used, optimizers and their parameters (learning rate,betas,...).
4. No information is provided about thespecific software or libraries used for the simulation of the UAV environment, the implementation of GPR or neural networks (KANs and MLPs). Neither about the HW (GPU,CPU,RAM, ...) used for training and experiments.
5. There is no mention of the availability of source code, datasets used for prediction with GPR (beyond the general description of historical data collection) or replication scripts. This lack is a significant limitation to replicability. Have you contemplated sharing them?
6. Regarding the datasets, you mention preprocessing the historical UAV trajectory using centralization and weighted averaging with a dynamic window. Although the steps are described I think more justification could be provided for these preprocessing techniques. Why were these specific methods chosen, how do they affect state representation and learning performnce?
7. The evaluations are based on simulations in a virtual environment. The validity of the findings in the real world might be different due to additional complexity and uncertainties not modeled in the simulation (noise,complex flight dynamics,-still- more unpredictable target behavior). Have you had a chance to experience this? Can you comment more on this aspect in the limitations of the study?
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsMost of my concerns have been addressed, glad to see measurement noise has been included in latest manuscript.
Reviewer 4 Report
Comments and Suggestions for AuthorsDear authors, please accept my congratulations on your work.
The article meets the necessary standards for publication.