Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding

Feng, Zhihui; Na, Xitai; Hai, Shiji; Sun, Qingbin; Shi, Jinshuo

doi:10.3390/electronics14071330

Open AccessArticle

Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding

by

Zhihui Feng

,

Xitai Na

^*,

Shiji Hai

,

Qingbin Sun

and

Jinshuo Shi

College of Electronic Information Engineering, Inner Mongolia University, Hohhot 010021, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1330; https://doi.org/10.3390/electronics14071330

Submission received: 19 February 2025 / Revised: 21 March 2025 / Accepted: 25 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Control and Navigation of Robotics and Unmanned Aerial Vehicles)

Download

Browse Figures

Review Reports Versions Notes

Abstract

In recent years, unmanned aerial vehicles (UAVs) have shown substantial application value in continuous target tracking tasks in complex environments. Due to the target’s movement behavior and the complexities of the surrounding environment, the UAV is prone to losing track of the target. To tackle this issue, this paper presents a reinforcement learning (RL) approach that combines UAV target search and tracking. During the target search phase, spatial information entropy is employed to guide the UAV in avoiding redundant searches, thus enhancing information acquisition efficiency. In the event of target loss, Gaussian process regression (GPR) is employed to predict the target trajectory, thereby reducing the time needed for target re-localization. In addition, to address sample efficiency limitations in conventional RL, a Kolmogorov–Arnold networks-based deep deterministic policy gradient (KbDDPG) algorithm with prior policy embedding is proposed for controller training.Simulation results demonstrate that the proposed method outperforms traditional methods in target search and tracking tasks within complex environments. It improves the UAV’s ability to re-locate the target after loss. The proposed KbDDPG efficiently leverages prior policy, leading to accelerated convergence and enhanced performance.

Keywords:

unmanned aerial vehicle; target search and tracking; reinforcement learning; Gaussian process regression; Kolmogorov–Arnold networks

1. Introduction

In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has significantly enhanced its application in search and tracking tasks. Owing to their high mobility, broad application spectrum, and cost-effectiveness, UAVs have been extensively used in disaster relief, environmental monitoring, and military reconnaissance [1,2,3,4,5]. Several conventional algorithms have been extensively integrated into UAV operations. For instance, the coverage path planning method divides the working area into sub-regions and explores them with simple back-and-forth motions to ensure full scene coverage and complete target search [6]. The method of artificial potential field creates a virtual attraction field at the target’s location and a repulsion field at the obstacle’s location, allowing the UAV to track the target under combined forces [7]. The genetic algorithm designs an appropriate encoding method and termination criteria, constructs a fitness function, and effectively performs genetic operations, such as selection, crossover, and mutation, to determine the UAV’s flight path [8]. However, in dynamic and complex environments, efficiently searching for and tracking moving targets remains a challenging problem, particularly when the UAV’s field of view (FOV) is limited, the target moves quickly, and obstacles cause interference. First, the UAV must be capable of avoiding environmental obstacles during search and target tracking to prevent collisions. Second, it should autonomously plan the optimal search path to maximize coverage. In addition, the UAV must be robust and flexible enough to adapt to various environments. Lastly, due to factors such as sensor failures and target occlusion by the environment, target-detection algorithms may not provide fully reliable results in a short time. This necessitates the UAV having predictive capabilities for target movement.

The emergence of reinforcement learning (RL) methods has attracted increasing attention. RL is modeled as a Markov decision process (MDP), where the RL agent observes the environment’s state, makes decisions accordingly, interacts with the environment, and receives reward feedback to inform better actions [9]. To enhance the ability of RL to address complex tasks, researchers have integrated it with deep learning, leading to the development of deep reinforcement learning (DRL) [10]. DRL utilizes neural networks to approximate value or policy functions, enabling optimization of network parameters during agent-environment interactions, thus maximizing cumulative rewards. DRL has demonstrated significant success across several domains, including competitive gaming, smart manufacturing, and transportation systems [11,12,13].

The advent of deep reinforcement learning has offered new solutions for addressing UAV target search and tracking challenges in dynamic and complex environments [14]. For example, Yuanyuan Sheng et al. proposed a DRL-based UAV autonomous navigation algorithm that establishes a dynamic reward function and a novel state-space representation method to solve the UAV’s autonomous path planning problem in high-density and highly dynamic environments [15]. Yongfeng Yin et al. tackled the problem of weak obstacle avoidance capability in UAV target tracking by proposing an innovative reinforcement learning method with attention guidance, enabling the UAV’s decision-making to shift between navigation and obstacle avoidance tasks based on changes in the environment [16]. Haobin Shi et al. proposed an end-to-end DRL navigation strategy that leverages curiosity to encourage the agent to explore previously unvisited environmental states, addressing the problem of map-less navigation in complex environments with sparse rewards [17]. However, these methods generally assume that the target’s location is known or that the target is static, which limits the system’s adaptability and generality for many real-world tasks. Therefore, Mei Liu et al. proposed a two-stage method for mobile target search and tracking in unknown environments, dividing the task into search and tracking phases and training the controller for each phase using a deep deterministic policy gradient algorithm with three critic networks [18]. Despite its advantages, this method lacks sufficient research on task switching, particularly in handling target loss. Once a target is lost, quickly re-localizing it becomes difficult, which makes long-term target tracking in complex environments challenging. To address the issue of target loss, Tian Wang et al. proposed a quantum probability model, which records the probability of the target in a discrete grid and updates the probability based on UAV observations and target movement to search for the lost target [19]. Yanyu Cui et al. developed an action decision-occlusion handling network based on deep reinforcement learning, using temporal and spatial contexts, object appearance models, and motion vectors to provide information on the occluded target, thus achieving target tracking under occlusion [20]. Although the above methods have made significant progress in target relocation after loss, most of these approaches are designed for relatively simple scenarios or require detailed target information, which limits their applicability in complex real-world environments. Moreover, reinforcement learning models typically require continuous interaction with the environment to achieve convergence. Unfortunately, these models are often inefficient in terms of sample utilization [21]. In particular, in UAV applications, training directly in the real world requires significant time and may damage the equipment [22]. One common approach is learning from demonstration (LfD), where experience is obtained from experts to reduce the need for training samples in real-world settings [23,24,25]. Nevertheless, LfD requires obtaining expert data, and the learning performance is largely dependent on the quality of the expert data. Additionally, in reinforcement learning applications, we often have prior information about the problem, but this knowledge is seldom leveraged by reinforcement learning algorithms. Therefore, Chao Wang et al. proposed an algorithm called non-expert-assisted deep reinforcement learning, which uses both prior and learned policy to jointly construct the behavior policy, successfully guiding UAVs to complete navigation tasks under sparse rewards [26]. Dawei Wang et al. proposed a two-stage multi-UAV collision avoidance method based on reinforcement learning. In the first stage, supervised training is used to encourage agents to follow prior policy, and in the second stage, policy gradients are employed to further refine the strategy [27]. The aforementioned methods effectively utilized prior policy in reinforcement learning, but they required certain rule-based guidance or staged training, leading to redundant training processes, which further inspired us to design a more novel training approach to utilizing prior policy.

To address the above issue, we propose a deep reinforcement learning method that accomplishes the two major tasks of target searching and target tracking, which improves the training efficiency of reinforcement learning through prior policy embedding. Specifically, we extend the concept of spatial information entropy and use it to compute rewards that guide the UAV’s target search [28]. Upon locating the target, the UAV transitions from search mode to tracking mode and commences tracking the target. When the UAV loses the target, it utilizes the information collected during tracking and applies Gaussian process regression (GPR) to forecast the target’s future trajectory, offering a probable target location [29]. This approach offers potential target locations, thereby substantially diminishing the time needed for the UAV to reacquire the target. As soon as the target re-enters the FOV, the UAV resumes the tracking task. This strategy, which integrates prediction and search, effectively enhances re-localization efficiency after target loss, addressing the limitations of conventional single-task methods in such scenarios. During the algorithm training phase, inspired by Kolmogorov–Arnold Networks (KANs) [30], we designed a novel method of utilizing prior policy and propose the KANs-based deep deterministic policy gradient (KbDDPG) algorithm. The introduction of KANs enables prior policy to be embedded into the policy network, guiding training with the prior policy, which enhances the reinforcement learning training speed and further improves the UAV’s target search and tracking performance. The main contributions of this study are:

We propose an integrated decision-making framework for UAV search and tracking based on deep reinforcement learning, which significantly improves the efficiency of target search. It solves the re-localization problem after target loss, thereby achieving continuous target tracking.
We design a deep deterministic policy gradient algorithm, KbDDPG, based on Kolmogorov–Arnold networks, which uses prior policy embedding to significantly accelerate the algorithm’s convergence rate.
The proposed method is validated through extensive simulations in complex environments, demonstrating its effectiveness and superiority. Compared to existing DRL algorithms, our approach outperforms them in target search and tracking tasks.

The structure of the paper is as follows. Section 2 introduces the main model for UAV target search and tracking. Section 3 elaborates on the proposed method, covering trajectory prediction based on GPR, the extended spatial information entropy model, and the DRL algorithm based on KANs. Section 4 conducts experiment and analyzes the results. Finally, Section 5 concludes the paper and explores future research directions.

2. System Model

In this paper, we consider a target search and tracking system that consists of UAV equipped with gimbaled cameras. We assume that the UAV can use a gimbaled camera to capture the position information of obstacles and moving targets within its FOV. The UAV must search for targets within the area without collision and continuously track them upon detection. The following sections introduce the main model for the UAV target search and tracking task. To apply RL methods, we have made some simplifications and assumptions in constructing the dynamics model. These assumptions are clear and reasonable under suitable conditions.

2.1. UAV Dynamics Model

This study focuses on a UAV with adequate maneuverability and sensing capabilities. For UAVs, altitude information is crucial during actual flight, but in many target search and tracking tasks, UAVs are often set to fly at a fixed altitude. When verifying and optimizing the performance of the algorithm in this paper, using two-dimensional modeling helps focus on the UAV’s search, tracking, and relocalization problems without having to simultaneously solve the complex altitude control issue. Therefore, we modeled the UAV as a simple mass point and assumed it flies at a fixed altitude. The UAV operates in a two-dimensional grid model, where each cell represents a possible position. We use grid partitioning to simplify the spatial representation, facilitating the design of subsequent algorithms. A 2D space of size

M \times M

is constructed, with the set of all possible spatial positions defined as:

M = \{(x, y) | 0 \leq x, y < M\}

(1)

At time

t (t \in \{1, \dots, T\})

, let

p_{u}^{t} = (x_{u}^{t}, y_{u}^{t})

denote the UAV’s position and

u^{t} = (a_{u}^{t}, ω_{u}^{t})

denote the control inputs, where

a_{u}^{t}

represents acceleration and

ω_{u}^{t}

represents the rate of change in the UAV’s heading angle. The dynamic equation for a UAV at discrete time intervals is as follows:

\{\begin{matrix} x_{u}^{t + 1} = x_{u}^{t} + v_{u}^{t + 1} \cdot \cos (θ_{u}^{t + 1}) \cdot Δ t \\ y_{u}^{t + 1} = y_{u}^{t} + v_{u}^{t + 1} \cdot \sin (θ_{u}^{t + 1}) \cdot Δ t \\ v_{u}^{t + 1} = v_{u}^{t} + a_{u}^{t} \cdot Δ t \\ θ_{u}^{t + 1} = θ_{u}^{t} + ω_{u}^{t} \cdot Δ t \end{matrix}

(2)

where

θ_{u}^{t}

denotes the heading angle and

v_{u}^{t}

represents the velocity of the UAV in the direction of the heading angle.

In this context,

ω_{u}^{t}

is used to control the change in the UAV’s angle. However, the actual control inputs for a real UAV are typically represented by accelerations. To ensure the actions are reasonable, during simulations, we have imposed limits on the variation range of the control input

ω_{u}^{t}

to ensure that the acceleration range is within the maneuverability capabilities of a real UAV.

The UAV observes targets and obstacles using its onboard sensors. The sensor’s FOV covers a circular area, where R represents the maximum sensing range. The UAV’s FOV at time t and position

x, y

is expressed as a set:

F_{t} = \{(x_{i}, y_{i}) \in M | \sqrt{{(x_{u}^{t} - x_{i})}^{2} + {(y_{u}^{t} - y_{i})}^{2}} < R\}

(3)

To reduce the computational complexity of the model, the complex transformation from imagery to positional data is omitted. It is assumed that the UAV can directly obtain the relative positions of the target and obstacles. Using

ρ_{g}^{t}

and

ϕ_{g}^{t}

to represent the distance and relative angle, the target’s relative position within the UAV’s FOV is given by

L_{g}^{t} = (ρ_{g}^{t}, ϕ_{g}^{t})

, and the closest obstacle’s relative position is given by

L_{o}^{t} = (ρ_{o}^{t}, ϕ_{o}^{t})

. Given that in real-world applications, sensor measurements are inevitably influenced by noise and uncertainty. To address this, we incorporated sensor measurement noise into the model. The observed target position is expressed as

{\tilde{L}}_{g}^{t} = L_{g}^{t} + n_{g}, n_{g} \sim N (0, σ_{g}^{2} I)

, where

σ_{g}

represents the standard deviation of uncertainty in the target’s relative position measurement, and I is the identity matrix. The observed obstacle position is denoted as

{\tilde{L}}_{o}^{t} = L_{o}^{t} + n_{o}, n_{o} \sim N (0, σ_{o}^{2} I)

, with

σ_{o}

indicating the standard deviation of uncertainty in the obstacle measurement. This approach to modeling target and obstacle positions not only simplifies computational complexity but also facilitates easier transfer to real-world environments, ensuring high adaptability for both simulations and real UAV missions. During each simulation episode, the UAV and target start at random positions. The episode ends when the UAV either collides with an obstacle or reaches the maximum number of steps T.

2.2. Environment Model

The environment model is shown in Figure 1. We assume that during UAV search and tracking missions, various obstacles are present. This paper introduces irregular polygons as approximations of obstacles. The positions of obstacles that the UAV cannot occupy are defined by the set

Z

:

Z = \{(x_{i}^{z}, y_{i}^{z}) \in M | i = 1, \dots, Z\}

(4)

During the target search and tracking process, the UAV must avoid obstacles to ensure operational safety. Additionally, there are line-of-sight occluded zones in the environment through which both the target and UAV can pass, but the UAV’s sensors cannot detect targets entering these areas. The line-of-sight occlusion zones are defined by the set

B

:

B = \{(x_{i}^{b}, y_{i}^{b}) \in M | i = 1, \dots, B\}

(5)

The position of the target at time step t is represented as

p_{g}^{t} = (x_{g}^{t}, y_{g}^{t})

, and the movement of the target is controlled by the vector

u_{g}^{t} = (a_{g}^{t}, ω_{g}^{t})

. The

u_{g}^{t}

has L possible values, and the set of possible values is denoted by

L = \{u_{1}, \dots, u_{L}\}

. Thus,

u_{g}^{t}

is expressed as a Markov chain with a finite set of states and state transition probabilities

p_{i j} = p (u_{g}^{t} = u_{j} | u_{g}^{t - 1} = u_{i})

,

i, j \in \{1, \dots, L\}

, The state transition probability is given by the following formula:

P_{i j} = \{\begin{matrix} P_{l}, & if i = j \\ \frac{P_{i j} (1 - p_{l})}{L - 1}, & if i \neq j \end{matrix}

(6)

By setting a higher

p_{l}

value to simulate the target’s movement pattern, meaning the target’s control inputs tend to remain unchanged. Let

v_{g}^{t}

denote the target’s velocity. The target’s motion equation is expressed as:

\{\begin{matrix} x_{g}^{t + 1} = x_{g}^{t} + v_{g}^{t + 1} \cdot \cos (θ_{g}^{t + 1}) \cdot Δ t \\ y_{g}^{t + 1} = y_{g}^{t} + v_{g}^{t + 1} \cdot \sin (θ_{g}^{t + 1}) \cdot Δ t \\ v_{g}^{t + 1} = v_{g}^{t} + a_{g}^{t} \cdot Δ t \\ θ_{g}^{t + 1} = θ_{g}^{t} + (ω_{g}^{t} + ω_{a v o i d}^{t}) \cdot Δ t \end{matrix}

(7)

We apply the velocity obstacle method to assist the target in avoiding obstacles [31]. Here,

ω_{a v o i d}^{t}

denotes the avoidance angular velocity, which modifies the target’s velocity to steer away from velocity obstacles and prevent collisions. Failed avoidance trajectories are discarded during this process.

3. Method

This section begins by modeling the target search and tracking problem using the MDP. We extend the concept of spatial information entropy to enlarge the UAV’s search range. To mitigate target loss during tracking, a trajectory prediction method based on GPR is proposed. Lastly, we introduce the KbDDPG algorithm.

3.1. Problem Statement

In UAV target search and tracking tasks, the problem is often modeled as a MDP. The MDP model comprises a five-tuple

(S, A, P, R, γ)

.

The state space S represents the set of all possible states. In this task, the state input includes the processed historical trajectory of the UAV, as well as the observed positions of obstacles and targets. Next, we will introduce the process of state extraction. The historical trajectory processing function is defined as

f_{s} : M^{T} \to R^{2 \times 1}

h_{s}^{t} = f_{s} ([p_{g}^{t - T}, p_{g}^{t - T + 1}, \dots, p_{g}^{t}])

(8)

This function performs the following tasks:

Trajectory centralization: It centers all trajectory positions relative to the UAV’s current position. Centralizing the historical trajectory data removes the influence of global position information, preventing scale variations caused by differing absolute positions and focusing the state on the UAV’s relative movement. In search tasks, the UAV needs to efficiently cover unknown areas. The centralized trajectory data can more intuitively reflect the changes in the UAV’s position relative to its own historical path, which facilitates the learning of search strategies.
Weighted averaging with dynamic window size: It applies weighted averaging to the centralized trajectory using a dynamically adjusted window size. Weighted averaging can smooth the historical trajectory data, effectively reducing noise interference in state representation while also reducing the amount of data the model needs to process, significantly lowering the complexity of the state space. The inclusion of dynamic window techniques allows the window size to be adjusted based on the utilization of the historical trajectory, enabling it to capture the UAV’s fine motion details in the short term while also reflecting long-term movement trends. This flexibility allows the state representation to retain sufficient detail without introducing redundancy or outdated information from excessive historical data, thus improving sample efficiency and accelerating policy learning convergence.

For a given trajectory sequence

p_{g}^{t - T}, p_{g}^{t - T + 1}, \dots, p_{g}^{t}

, where each

p^{j} \in M

, the centering operation is defined as:

p_{c e n t e r} (j) = p^{j} - p_{g}^{t}, j = t - T, t - T + 1, \dots, t

(9)

Assume the window starts with an initial size of

W_{0}

. Define a window function

W (i) = W_{0} + d ⌊\frac{i}{u}⌋

, which adjusts the window size at each computation step i and increases by a fixed increment d after every u steps. The weighted averaging process is as follows:

h_{s}^{i} = \frac{1}{W (i)} \sum_{j = s t a r t (i)}^{e n d (i)} p_{c e n t e r} (j)

(10)

where

s t a r t (i)

and

e n d (i)

determine the range of the window, then

h_{s}^{t} = [h_{s}^{1}, \dots, h_{s}^{I}]

. Let s denote the UAV’s state, then

s_{t} = [h_{s}^{t}, {\tilde{L}}_{o}^{t}, {\tilde{L}}_{g}^{t}]

The action space A is the set of all possible actions, representing the agent’s choices in each state. In this task, the actions correspond to the control inputs for the UAV, denoted as

a = u

.

The state transition probability

P (s^{'} | s, a)

describes the probability of transitioning to the next state s after performing action a in state s.

The reward function

R (s, a)

represents the expected reward the agent receives for performing action a in state s. Designing the reward function is crucial and can greatly influence the success of the training task. The design for the search and tracking reward function will be discussed later.

The discount factor

γ \in [0, 1]

denotes the agent’s consideration of future rewards. Generally, future rewards are discounted by

γ

to balance immediate rewards and long-term gains.

The agent’s objective is to find the optimal policy

π^{*}

that maximizes the expected return. The return is typically defined as:

G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}

(11)

where

r_{t + k}

denotes the reward at time

t + k

.

To find the optimal policy

π^{*}

in the MDP, RL introduces two key functions: the state-value function and the action-value function. The state-value function

V^{π} (s)

represents the expected return when the agent follows policy

π

from state s:

V^{π} (s) = E_{π} [G_{t} | S_{t} = s] = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | S_{t} = s]

(12)

The action-value function

Q^{π} (s, a)

represents the expected return for taking action a in state s and then following policy

π

:

\begin{matrix} Q^{π} (s, a) = E_{π} [G_{t} | S_{t} = s, A_{t} = a] \\ = E_{π} [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} | S_{t} = s, A_{t} = a] \end{matrix}

(13)

In an MDP, the value function satisfies the Bellman equation:

Q^{π} (s, a) = E_{s^{'} \sim P (\cdot | s, a), a^{'} \sim π (s^{'})} [r + γ Q^{π} (s^{'}, a^{'})]

(14)

where

a \sim π (s)

denotes selecting action a according to policy

π

in state s,

s^{'} \sim P (\cdot | s, a)

represents the transition probability from state s to the next state

s^{'}

after taking action a, and r is the reward obtained by executing action a in state s.

This formula indicates that the value of the current state-action pair equals the immediate reward received plus the discounted cumulative value of future rewards, where the computation of future rewards can again be expressed in the same form. The Bellman equation recursively expresses the value function, aiding in determining the optimal policy. The optimal policy

π^{*}

can be represented as:

π^{*} = \underset{a *}{argmax} R (s_{t}, a_{t}) + γ \int_{s_{t + 1}} P (s_{t + 1} | s_{t}, a_{t}) V^{*} (s_{t}) d s_{t + 1}

(15)

In the process of policy optimization, the actor–critic framework can be applied to jointly learn the policy (actor) and the value function (critic). The essence of policy optimization is to enable the two networks to jointly improve the agent’s decision making. The goal of the policy network is to choose the optimal action based on the current state and update the policy by maximizing the Q-value from the value network, where the Q-value represents the expected return estimated by the value network. The value network evaluates each action by estimating the Q-value for the current state-action pair, and the network approximates the true Q-value by minimizing the error between the predicted and target values, where the target value is constructed using the Bellman equation. The two networks work in tandem: the value network provides feedback on the action values, while the policy network adjusts the policy accordingly, thus gradually enhancing the decision-making performance.

3.2. Spatial Information Entropy

To further expand the search range of the UAV during search, we expect the UAV to maintain a certain distance from previously detected positions within a certain time period. Specifically, when the UAV’s detection areas are close to each other spatially, the collected information is likely to be highly correlated. Consequently, the gathered information may be redundant, which significantly impacts the search efficiency. A direct solution to this problem is to collect spatially uncorrelated information. In other words, when the UAV’s detection coverage overlaps less, the collected information tends to be unrelated. To measure and optimize the UAV’s information acquisition efficiency during target search and tracking tasks, we extend the concept of spatial information entropy.

First, we define

Φ_{t} (x, y)

as the information correlation density at the coordinates

(x, y)

at time step t, which satisfies:

\sum_{x, y} Φ (x, y) = 1

(16)

The higher the information correlation density within the detection area when the UAV reaches a new position, the more redundant information it gathers. The UAV will influence the information correlation density on the map within a certain radius of its current position

(x_{u}^{t}, y_{u}^{t})

. To represent this, we define the indicator function

J_{r} (x, y, x_{u}^{t}, y_{u}^{t})

, which denotes the UAV’s influence at position

(x_{u}^{t}, y_{u}^{t})

on position

(x, y)

:

J_{r} (x, y, x_{u}^{t}, y_{u}^{t}) = \{\begin{matrix} 1, & if \sqrt{{(x - x_{u}^{t})}^{2} + {(y - y_{u}^{t})}^{2}} \leq R \\ 0, & otherwise \end{matrix}

(17)

At the initial time

t_{0}

, we initialize the position

(x, y)

as follows:

Φ_{t_{0}} (x, y) = \frac{1}{A}

(18)

where A is the number of grid points in the 2D space.

This initialization ensures that, at time

t_{0}

, the sum of the information correlation density values equals 1. In each subsequent time step, the UAV’s movement continuously impacts the information correlation density on the map. For position

(x, y)

, we apply the following update formula:

Φ_{t} (x, y) = (1 - α) \cdot Φ_{t - 1} (x, y) + \frac{α}{A_{r}} \cdot J_{r} (x, y, x_{u}^{t}, y_{u}^{t})

(19)

where

α \in [0, 1]

is the decay factor, which controls the rate at which the information correlation density decays over time.

A_{r}

denotes the number of coordinate points within the UAV’s FOV.

Based on the update formula, the relationship for updating the sum of information correlation densities on the map is:

\sum_{x, y} Φ_{t} (x, y) = (1 - α) \cdot \sum_{x, y} Φ_{t - 1} (x, y) + α \cdot \sum_{x, y} \frac{J_{r} (x, y, x_{u}^{t}, y_{u}^{t})}{A_{r}}

(20)

When

\sum_{x, y} Φ_{t - 1} (x, y) = 1

, the sum of the information correlation densities on the map remains constant, satisfying the conditions for defining information correlation density. This guarantees that at each time step, the total information correlation density on the map neither increases nor decreases without bound but maintains a dynamic balance. This helps the model effectively measure changes in spatial information entropy during long-term searches, while ensuring that the total information correlation density on the map remains within a stable range. Similar to information entropy, given the information correlation density, we can define the spatial information entropy over the entire spatial area. The spatial information entropy at time t,

H_{S I}^{t}

, is defined as:

H_{S I}^{t} = - \sum_{x, y} Φ_{t} (x, y) \log Φ_{t} (x, y)

(21)

If the UAV’s detection areas overlap significantly (i.e., the collected information is highly correlated), the entropy value decreases, indicating reduced spatial diversity and increased information redundancy. If the UAV’s detection areas overlap less and the detected regions are highly independent (i.e., with low information correlation), the spatial information entropy will be relatively higher.

3.3. GPR for Trajectory Prediction

In each training episode, the UAV acquires the target’s motion trajectory information via the camera. The target’s observed acceleration data is expressed as a two-dimensional acceleration vector:

a^{t} = (a_{x}^{t}, a_{y}^{t})

. The UAV stores the target’s acceleration information during the training process. When the target is lost, the UAV cannot obtain real-time position information. At this point, we construct a Gaussian process regression model based on historical episode data. Using the model, we predict the target’s future acceleration from its current episode’s trajectory information and subsequently forecast its movement trajectory. To use the data from these historical episodes for Gaussian process regression training, we need to construct appropriate input–output data pairs. Set a sliding window length of

w + m

, select sufficiently long acceleration sequences for each episode, and construct fixed-length input-output pairs. Specifically, extract input-output pairs from each episode’s trajectory data. Define input data

X_{k, t} = [a_{t}, a_{t + 1}, \dots, a_{t + w - 1}]

, representing a vector containing acceleration information for the next w time steps. Define output data

Y_{k, t} = [a_{t + w}, a_{t + w + 1}, \dots, a_{t + w + m - 1}]

, representing a vector containing acceleration information for the next m time steps. Let

k = 1, 2, \dots, K

denote the episode index, and

n_{k}

represent the number of data points in episode k. For the input–output dataset, a sliding window is applied to extract input–output pairs at each time step t:

\{\begin{matrix} X = \{X_{k}^{t} | t = 1, 2, \dots, n_{k} - w - m + 1, k = 1, 2, \dots, K\} \\ Y = \{Y_{k}^{t} | t = 1, 2, \dots, n_{k} - w - m + 1, k = 1, 2, \dots, K\} \end{matrix}

(22)

GPR assumes a joint Gaussian distribution between the input and output. Given the training dataset X and Y, the model posits:

[\begin{matrix} Y \\ Y^{*} \end{matrix}] \sim N (0, [\begin{matrix} K (X, X) K (X, X^{*}) \\ K (X^{*}, X) K (X^{*}, X^{*}) \end{matrix}])

(23)

where

Y^{*}

is the predicted value and

X^{*}

is the input at the prediction time.

In GPR, a kernel function is used to represent the covariance between data points.

K (X, X)

is the covariance matrix between the training data,

K (X^{*}, X)

is the covariance matrix between the observed data and the training data, and

K (X^{*}, X^{*})

is the covariance matrix between the observed data. Define the kernel function:

K (X_{i}, X_{j}) = \exp (- \frac{| | X_{i} - X_{j} | |^{2}}{2 σ^{2}})

(24)

where

σ

is the hyperparameter of the kernel function, often referred to as the length scale. It controls the width of the kernel, influencing the similarity between neighboring data points.

The Gaussian kernel function measures the similarity between two data points: when the points are close, the kernel value approaches 1, and when they are distant, the kernel value approaches 0. Given the training dataset X and Y, as well as the input

X^{*}

for prediction, we can compute the predictive conditional distribution via GPR. Specifically, the predicted mean and covariance are:

\{\begin{matrix} μ^{*} = K (X^{*}, X) K {(X, X)}^{- 1} Y \\ Σ^{*} = K (X^{*}, X^{*}) - K (X^{*}, X) K {(X, X)}^{- 1} K (X, X^{*}) \end{matrix}

(25)

Using the predicted mean acceleration

μ^{*}

, we can compute the target’s future velocity and position. The predicted covariance

Σ^{*}

is used to assess the prediction accuracy, with higher covariance indicating greater uncertainty in the prediction. Set a variance threshold

σ_{t h r e s h}

. If the predicted variance

Σ_{i i}^{*}

exceeds this threshold, the prediction is stopped. As the reinforcement learning training progresses and more target acceleration data is gathered, the prediction accuracy increases accordingly.

The incorporation of the target trajectory prediction method allows the UAV to quickly re-search and re-locate the target after temporarily losing the target position information. The system framework with the trajectory prediction method is shown in Figure 2, with the system operating in two modes: search mode and tracking mode. When the UAV detects the target, the system transitions from search mode to tracking mode. In tracking mode, if the target is lost, the agent treats the predicted target position as the true position and continues decision making. If the model’s predicted variance exceeds the set threshold, or if the number of predicted time steps reaches the set maximum prediction length

m_{max}

and the target is still not found, the target is considered fully lost, and the system switches back to search mode.

3.4. Reward Function Design

RL uses rewards to estimate the expected return and derive the optimal policy. The design of the reward function is closely tied to the quality of the training outcome. A suitable reward function can accelerate training convergence and better guide the agent to perform favorable actions while avoiding undesirable ones. In the UAV target search and tracking problem, the reward design focuses on three key aspects: minimizing environmental uncertainty, find the target as quickly as possible, and maintaining tracking once the target is found. The entire search and tracking process should avoid collisions with threats. Based on these objectives, the reward function is defined as:

Search Reward
The objective of this reward function is to leverage spatial information entropy to encourage the UAV to explore uncovered areas and guide it in target search. Denote $k_{1}$ as the search reward coefficient, then the search reward is expressed as:

$r_{1, t} = k_{1} (H_{S I}^{t} - H_{S I}^{t - 1})$

(26)
Target Tracking Reward
This reward function provides continuous rewards when the target is within the UAV’s FOV, guiding the UAV to keep tracking the target. Let $k_{2}$ represent the target tracking reward coefficient, then the target tracking reward function is defined as:

$r_{2, t} = \{\begin{matrix} k_{2}, & if P_{g}^{t} \in Z \\ 0, & otherwise \end{matrix}$

(27)
Target Loss Penalty
The purpose of this penalty function is to prevent the UAV from losing track of the target. Denote $k_{3}$ as the target loss penalty coefficient, then the target loss penalty is defined as:

$r_{3, t} = \{\begin{matrix} - k_{3}, & if P_{g}^{t} \notin Z and P_{g}^{t - 1} \in Z \\ 0, & otherwise \end{matrix}$

(28)
Collision Penalty
This penalty function guides the UAV to avoid threats. Denote $k_{4}$ as the collision penalty coefficient, then the collision penalty is defined as:

$r_{4, t} = \{\begin{matrix} - k_{4}, & if P_{u}^{t} \in B \\ 0, & otherwise \end{matrix}$

(29)

The four rewards mentioned above work in unison, balancing the relationship between search, tracking, target loss, and collision risk via the reward function weights, offering a unified decision-making framework for the UAV. The weights of the reward function are set as hyperparameters. Initially, we made preliminary estimations of the importance of each reward component based on the experience of domain experts to set the initial weights. Subsequently, these hyperparameters were finely tuned through simulation experiments to balance the weights of different reward components, enabling the reward function to guide UAV action more effectively in complex environments. In summary, the complete reward function can be defined as:

r_{t} = r_{1, t} + r_{2, t} + r_{3, t} + r_{4, t}

(30)

3.5. Training Algorithm

RL tends to perform poorly in sample efficiency, especially when the UAV is tasked with search and tracking in complex dynamic environments. The RL agent typically needs extensive exploration and trial-and-error to gradually improve its decision-making abilities. However, conducting such extensive exploration in a real environment can lead to serious negative consequences. On one hand, the UAV may frequently collide during task execution, resulting in equipment damage or even property loss. On the other hand, this inefficient training method severely limits its applicability in high-risk, complex scenarios. Therefore, to enhance training efficiency and minimize potential risks in practical applications, this paper introduces the KbDDPG algorithm, which aims to accelerate RL convergence by using prior policy embedding in the policy network of the KANs structure.

KANs are inspired by the Kolmogorov–Arnold representation theorem, which states that any multivariate continuous function can be decomposed into a finite combination of univariate functions and addition operations. More specifically, for a smooth function

f : {[0, 1]}^{n} \to R

:

f (x) = f (x_{1}, \dots, x_{n}) = \sum_{q = 1}^{2 n + 1} Φ_{q} (\sum_{p = 1}^{n} ϕ_{q, p} (x_{p}))

(31)

where

ϕ_{q : p} : [0, 1] \to R

,

Φ_{q} : R \to R

.

KANs parameterize each one-dimensional function in Expression (31) using B-spline curves. A KANs layer with

n_{i n}

-dimensional input and

n_{o u t}

-dimensional output can be defined as a univariate function matrix:

Φ = \{ϕ_{q, p}\}, p = 1, 2, \dots, n_{i n}, q = 1, 2 \dots, n_{o u t}

(32)

where the function

φ_{q, p}

has trainable parameters.

Expression (31) can be represented as a simple combination of two KANs layers. However, KANs are not restricted to this structure. Similar to multi-layer perceptrons (MLPs), KANs can also expand the width and depth of their network. Given an input vector x, the output of KANs is:

K A N (x) = (Φ_{L - 1} \circ Φ_{L - 2} \circ \dots \circ Φ_{1} \circ Φ_{0}) x

(33)

All operations in the process are differentiable, allowing us to train KANs via backpropagation.

Traditional MLPs use fixed activation functions at the nodes, while KANs employ learnable activation functions at the edges. Each activation function in KANs is a univariate function parameterized by a spline curve, further improving the ability to approximate complex functions. These characteristics of KANs make their structure more intuitive and interpretable. The activation functions of KANs can be visually represented by spline curves, allowing researchers to interact with the model during simplification and optimization to modify or constrain certain function forms.

The structure of the KbDDPG algorithm is shown in the Figure 3.

f_{s}

is used to process the UAV trajectory data. First, the data is centralized to standardize the input, improving the robustness of the state representation and enhancing generalization ability. Then, a weighted average is applied to compress the data length and increase information efficiency. Finally, the processed data is combined with the target and obstacle position information, serving as the agent’s state input. Additionally, in the traditional deep deterministic policy gradient (DDPG) algorithm, the policy network determines the action to take given a state, while the value network estimates the

Q^{π} (s, a)

for the current state–action pair. Both the policy network and the value network typically use MLPs structure [32]. KbDDPG modifies the policy network by replacing the original MLPs structure with a KANs structure that includes prior policy, while the value network retains its MLPs architecture.

We assume that researchers can design a simple, low-performance prior policy

π_{0} (s)

in advance, based on the agent’s state and environmental information. The prior policy helps the agent explore the environment and achieve its objectives. Due to the structural properties of KANs, the activation functions can be set to known symbolic forms, enabling the model to represent traditional mathematical expressions. In RL, we can embed an existing prior policy

π_{0} (s)

, which is represented symbolically, into a trainable KANs, resulting in a KANs that incorporates initial policy [33]. This process involves three main steps, as shown in Figure 4:

Compile symbolic formulas into KANs: First, parse the symbolic formulas into a tree structure, where nodes represent expressions and edges represent operations/functions. Then, modify the tree to align it with the KANs structure. This modification involves moving all leaf nodes to the input layer via virtual edges and adding virtual sub-nodes/nodes to match the KANs architecture. These virtual edges/nodes/sub-nodes perform only transformations. Finally, the variables are combined at the first layer, effectively converting the tree into a graph.
Extend the network: The compiled KANs network structure is compact, without redundant edges, which could limit its expressive power and hinder further fine-tuning. To improve expressiveness, the network width and depth can be increased according to the task complexity.
Training: The agent, after policy initialization, already possesses a certain ability to complete tasks, which can significantly accelerate the convergence speed of subsequent training and reduce the task failure rate during the training process. Finally, building on the prior policy, the strategy is further optimized using DRL algorithms.

The policy network

π_{K A N} (s_{t}, θ_{K A N})

, which embeds the prior policy

π_{0} (s)

, is used to generate the action based on the current state

s_{t}

:

a_{t} = π_{K A N} (s_{t}; θ_{π})

(34)

By executing the action

a_{t}

, interacting with the environment, and receiving the next state

s_{t + 1}

, reward

r_{t}

, and done flag, the current experience

(s_{t}, a_{t}, r_{t}, s_{t + 1})

is obtained and stored in the experience replay buffer

D

. The value network and policy network are updated by randomly sampling experience data

B = {(s_{i}, a_{i}, r_{i}, s_{i + 1})}_{i = 1}^{N}

from the experience replay buffer. According to Equation (14), the target Q-value is expressed as:

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, {π^{'}}_{K A N} (s_{i + 1}; θ_{π}^{'}); θ_{Q}^{'})

(35)

where

y_{i}

is the target value of the i-th experience data,

Q^{'}

and

{π^{'}}_{K A N}

represent the target network, which is specifically used to compute the target value.

During training, to make the target network updates smoother, it uses soft updating to gradually approach the value network and policy network, with the soft update coefficient

τ

controlling the target network parameter update speed. The objective of the value network is to approximate the true Q-values by minimizing the mean squared error (MSE) loss of the Q-value function. Thus, the loss function for the value network is defined as the MSE between the predicted Q-values and the target Q-values:

L_{Q} (θ_{Q}) = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - Q (s_{i}, a_{i}; θ_{Q}))^{2}

(36)

The gradient of the value network is the derivative of the loss function with respect to the parameter

θ_{q}

:

\nabla_{θ_{Q}} L_{Q} (θ_{Q}) = \frac{2}{N} \sum_{i = 1}^{N} (Q (s_{i}, a_{i}; θ_{Q}) - y_{i}) \nabla_{θ_{Q}} Q (s_{i}, a_{i}; θ_{Q})

(37)

The objective of the policy network is to find the optimal policy by maximizing the Q-value output by the value network. The goal is to maximize the expected return under the policy. The loss function of the policy network can be defined as the negative Q-value:

L_{π} (θ_{π}) = - \frac{1}{N} \sum_{i = 1}^{N} Q (s_{i}, π_{K A N} (s_{i}; θ_{π}); θ_{Q})

(38)

Since the policy network embeds prior policy

π_{0}

, the policy network can be represented as:

π (s; θ_{π}) = π_{0} (s) + Δ π (s; θ_{π})

(39)

where

Δ π (s; θ_{π})

represents the difference between the current network and the prior policy.

At the beginning of training, the network difference is zero. This way, the network has a reasonable behavioral foundation from the start. Compared to traditional reinforcement learning algorithms, this effectively reduces the inefficiency caused by starting exploration from a completely random policy. Additionally, in terms of network training, the derivative of the policy network with respect to the parameters can be expressed as:

\nabla_{θ_{π}} L_{π} (θ_{π}) = - \frac{1}{N} \sum_{i = 1}^{N} \nabla_{a} Q (s_{i}, a; θ_{Q}) |_{a = π_{0} (s) + Δ π (s; θ_{π})} \nabla_{θ_{π}} Δ π (s; θ_{π})

(40)

It can be observed that the gradient in the above equation only involves the

Δ π (s; θ_{π})

part, meaning that the network only needs to learn the correction term of the ideal policy relative to the prior policy. At the same time, the prior policy

π_{0}

provides a starting point close to the optimal solution, reducing the exploration range and making the gradient direction more stable, which significantly reduces the complexity of the learning task and accelerates convergence. Parameters are updated using gradient descent in the direction opposite to the gradient:

\{\begin{matrix} θ_{Q} \leftarrow θ_{Q} - l_{c} \nabla_{θ_{Q}} L_{Q} (θ_{Q}) \\ θ_{π} \leftarrow θ_{π} - l_{a} \nabla_{θ_{π}} L_{π} (θ_{π}) \end{matrix}

(41)

where

l_{c}

and

l_{a}

are the learning rates for the policy network and value network, respectively.

As learning progresses, the policy network is continuously optimized, and the agent gradually moves away from the prior policy’s influence, independently achieving the task objectives. In summary, the proposed algorithm can be summarized as Algorithm 1.

Algorithm 1 KbDDPG Algorithm

1:: Randomly initialize critic network $Q (s, a | θ_{Q})$ with weights $θ_{Q}$
2:: Compile the initial policy $π_{0} (s)$ into $π_{K A N} (s_{t}, θ_{K A N})$
3:: Extend the KANs $π_{K A N} (s_{t}, θ_{K A N})$
4:: Initialize target network $Q^{'}$ and $π_{K A N}^{'}$ with weight $θ_{Q}^{'} \leftarrow θ_{Q}, θ_{π}^{'} \leftarrow θ_{π}$
5:: Initialize replay buffer R
6:: for each episode = 1,M do
7:: Initialize a random $N$ process for action exploration
8:: Receive initial observation state $s_{1}$
9:: for each episode = 1,T do
10:: Select the action $a_{t} = π (s_{t} | θ_{K A N}) + N_{t}$ , $N_{t}$ represents the exploration noise
11:: Execute action $a_{t}$ and observe reward $r_{t}$ and observe new state $s_{t + 1}$
12:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in R
13:: Sample a random minibatch of transitions $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ from R
14:: Set $y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, {π^{'}}_{K A N} (s_{i + 1}; θ_{π}^{'}); θ_{Q}^{'})$
15:: Update critic by minimizing the loss:

$\begin{array}{l} \nabla_{θ_{Q}} L_{Q} (θ_{Q}) = \frac{2}{N} \sum_{i = 1}^{N} (Q (s_{i}, a_{i}; θ_{Q}) - y_{i}) \nabla_{θ_{Q}} Q (s_{i}, a_{i}; θ_{Q}) \\ θ_{Q} \leftarrow θ_{Q} - l_{c} \nabla_{θ_{Q}} L_{Q} (θ_{Q}) \end{array}$
16:: Update the actor policy using the sampled policy gradient:

$\begin{array}{l} \nabla_{θ_{π}} J = \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ_{Q}) |_{s = s_{i}, a = π (s_{i})} \nabla_{θ_{π}} π (s | θ_{π}) |_{s_{i}} \\ θ_{π} \leftarrow θ_{π} - l_{a} \nabla_{θ_{π}} L_{π} (θ_{π}) \end{array}$
17:: Update the target networks:

$\begin{matrix} θ_{Q}^{'} \leftarrow τ θ_{Q} + (1 - τ) θ_{Q}^{'} \\ θ_{π}^{'} \leftarrow τ θ_{π} + (1 - τ) θ_{π}^{'} \end{matrix}$
18:: end for
19:: end for

4. Experiments

This section evaluates the performance of the model and algorithm through several experiments. First, we describe the simulation environment and experimental parameters. Next, we compare the success rate of relocating the target before and after using target trajectory prediction, and visualizing the prediction results. We then analyze the impact of spatial information entropy on target search and compare the search performance with other baseline algorithms. Lastly, we compare the proposed KbDDPG algorithm with other baseline methods and visualize the UAV task execution to assess its convergence performance.

4.1. Simulation Environment Design

To verify the adaptability of the UAV system in complex environments, we built a highly complex virtual environment through simulation. These environments feature a large number of irregularly distributed static obstacles, and also simulate tracking targets that move dynamically with random starting positions. In addition, multiple occlusion areas were set in the test environment to simulate potential sensor blind spots and noise interference in real-world scenarios. The UAVs can only acquire environmental and target information within a limited range. This limitation further intensifies the uncertainty and dynamics of the environment, while also requiring the system to fully consider the coverage and redundancy of information acquisition when planning search paths. The randomness and dynamic changes in various factors in the environment make the entire scene much more complex than traditional static or single-obstacle environments, fully testing the UAV’s robustness and flexibility in tasks such as searching, continuous tracking, and rapid re-localization after target loss.

The virtual environment uses polygonal areas to represent obstacles and line-of-sight occlusion zones. If the UAV collides with an obstacle, the task is considered failed, and the UAV cannot detect targets entering the line-of-sight occlusion zones. The environment is stochastic, with the UAV’s starting position and target position being randomly re-initialized at the end of each navigation task. For the KbDDPG algorithm, we design a simple initial control policy

ω_{u}^{t} = ϕ_{g}^{t} - k_{p} ϕ_{o}^{t}

, where the coefficient

k_{p}

is used to balance the weights between UAV tracking and obstacle avoidance. In the simulation, the target trajectory information from the most recent 50 episodes is stored for trajectory prediction. The UAV’s historical trajectory is recorded for up to 100 steps and processed by a state extractor to generate a state input of length 40. After embedding the prior policy, 2 additional KANs nodes are added to the policy network. The value network has 512 hidden layer nodes. The simulation experiments in this paper were conducted in a Windows 11 environment using Python 3.11 and Pytorch 2.2.1. Gaussian process regression was implemented based on Scikit-learn 1.4.2, and the KAN network was constructed using Pykan 0.2.6. The simulation environment was built following the OpenAI Gym API standards. The experiments were run on a computer equipped with i7-11700F CPU, RTX 3060 GPU (12 GB VRAM), and 16 GB of RAM. The reinforcement learning training used the Adam optimizer, and 3300 s were needed to complete 100 training episodes. Other parameters of the virtual simulation environment are shown in Table 1. Unless otherwise specified, the parameters in the table are used for the simulation.

4.2. Performance Analysis of the Prediction Model

We tested the trajectory prediction performance of the proposed GPR model, which inputs data from 20 time steps and outputs 10 predicted values. Figure 5 displays the prediction results for both straight-line and turning motion when the target enters the line-of-sight occlusion zone. We observed that GPR provides accurate fitting after the target is lost. Additionally, the probability distribution generated by GPR quantifies the uncertainty in the estimates. We can see that as the prediction time steps increase, the uncertainty of the prediction gradually grows, which helps in evaluating the quality of the prediction. Table 2 presents the prediction performance when the target enters the line-of-sight occlusion zone in UAV tracking mode. For each prediction, if the target remains within the UAV’s FOV after exiting the line-of-sight occlusion zone, the prediction is deemed successful. It is clear that, compared to no prediction (sample size of 0), our prediction model significantly improves the UAV’s success rate in locating the lost target. As the number of model samples increases, its performance improves, but the increase is marginal, suggesting that the model performs well with small datasets and does not require a large amount of training data.

4.3. Search Performance Analysis

In this section, we test the effect of different parameters

α

on search performance. Furthermore, to validate the performance of our method, we compare the proposed RL-based search strategy with the complete coverage path planning and random walk methods. The coverage path planning method divides the workspace into sub-regions and explores them using simple back-and-forth motion, achieving complete area coverage. The random walk method is based on the concept of random motion. The UAV maintains straight flight when no obstacles are detected. Once an obstacle is detected ahead, the UAV randomly samples an angle change value to determine a new flight direction. After adjusting its direction, the UAV resumes straight flight and repeats the process [34].

Figure 6 visualizes the information correlation density. The heatmap illustrates how the spatial information correlation density changes over time steps for different

α

values, given the same UAV trajectory. On the horizontal axis, as the time steps increase, the information correlation density uniformly distributed across the environment decreases, while the correlation density in the explored regions increases, indicating that the UAV perceives these areas as having lower information search efficiency. On the vertical axis, when

α

is small, the information relevance density changes slowly. Information updates are not timely, and the differences in information density between regions are small. Although the historical positions of the UAV influence information relevance density for a longer time, excessively old historical information may interfere with the search for moving targets. As

α

increases, the spatial information density changes more dramatically with the UAV’s movement. However, the rapid rate of change also causes the information relevance density at historical positions to decline rapidly, leading to a loss of historical exploration information.

We define environmental coverage as the ratio of the number of coordinates covered by the UAV’s FOV during an episode to the total number of coordinates that can be explored in the environment. The parameter

α

controls the decay rate of the spatial information density, thus influencing search performance. To determine the optimal

α

value, we tested the impact of RL strategies trained with different

α

values on the UAV’s environmental exploration rate, as shown in Figure 7. When the parameter value is small, the coverage performance is relatively poor, with high variance. As the parameter value increases, the environmental coverage gradually rises, reaching its peak near

α = 0.015

. Indicating more stable and effective search performance within this range. As

α

continues to increase, coverage begins to decline, indicating that both very small and very large

α

values lead to decreased coverage. This is also consistent with our earlier discussion on information correlation density.

Figure 8 presents the comparison results of the search algorithm at different target speeds for

α = 0.015

. From the figure, we can observe that as the target speed increases, the success rate of the coverage path planning method gradually decreases from 1.0, reaching its worst performance at medium target speeds. This suggests that coverage path planning, which relies on a fixed path for complete coverage, works well for stationary and low-speed targets, leveraging the advantage of spatial coverage. However, at medium speeds, the probability of missing the target increases, and due to its inflexibility, it becomes challenging to relocate the target after missing it in the first scan of each coverage cycle. Random walk shows a low success rate at low target speeds. As the target speed increases, the success rate gradually improves, but its overall performance is inferior to that of the other two algorithms. Our method shows stable and efficient search performance at all speeds, especially at medium target speeds, where it significantly outperforms the other two methods. Additionally, at higher target speeds, the success rates of the three algorithms are similar, and all algorithms perform well in completing the search task. This suggests that under high-speed conditions, the performance differences between the algorithms are minimal, with target speed being the dominant factor rather than the specific algorithm design.

Figure 9 displays the motion trajectories of 1000 targets. We can observe that, influenced by the control method and obstacles, the probability of target appearance differs at various locations. We believe that RL-based methods offer greater robustness, as they can learn to capture the target’s hidden behavior patterns and exhibit strong adaptability to targets with different behaviors. Figure 10 presents the relationship between time steps and search success rate for the three search algorithms within a single episode. From the figure, we can see that our method shows a higher search success rate in most cases.

4.4. Convergence Performance Analysis

Next, we compare DDPG, DDPG from demonstrations (DDPGfD), KAN-DDPG, and KbDDPG, training each algorithm for 100 episodes in the same environment to collect our experimental data. DDPGfD is an algorithm that combines expert demonstration data with DRL. It is an improvement on the DDPG algorithm that accelerates the learning process by incorporating expert demonstration data. In this experiment, we use an agent trained with DDPG to interact with the environment and collect demonstration data. KAN-DDPG replaces the policy network of the DDPG algorithm with KANs, maintaining the same network structure as our algorithm but without without prior policy embedding. Figure 11 shows the reward progression for each episode during training. It can be observed that KbDDPG converges faster than both DDPG and KAN-DDPG. Although DDPGfD converges very quickly, its convergence starts later than our algorithm. This indicates that the KbDDPG algorithm demonstrates strong performance from the early stages of training. The results demonstrate that the KbDDPG algorithm leverages prior policy to efficiently learn critical information from the environment in the initial stages, leading to faster reward improvement.

Figure 12 presents the collision rate and target tracking duration during the training process. The collision rate is defined as the ratio of episodes where the UAV collides to the total number of experiments, which we use to evaluate the UAV’s obstacle avoidance ability. Target tracking duration is defined as the number of time steps the target remains within the UAV’s field of view during each episode. We use it to evaluate the UAV’s target search and tracking capability. As shown in Figure 12a, the other algorithms start with a collision rate of 1 and require multiple trial-and-error collisions to converge. In contrast, KbDDPG maintains a collision rate of about 0.4 from the start of training. As training progresses, the collision rate decreases further and stabilizes near 0, which is crucial for reducing equipment wear during real-world training. As shown in Figure 12b, our algorithm maintains around 50 target tracking duration steps at the beginning of training, and stabilizes at about 300 as training progresses. The results in Figure 12 show that KbDDPG has obstacle avoidance and target search and tracking abilities from the start of training, meaning that the algorithm successfully embeds the prior policy into the policy network. Meanwhile, we have observed that the performance of the KAN-DDPG algorithm consistently ranks the lowest. We attribute this to the fact that while the KAN structure enhances the network’s expressive power by learning activation functions, it also complicates the optimization problem. Without a prior policy to serve as a “good starting point” to guide the network’s learning, the optimization space becomes larger and harder to explore, making it difficult for the network to quickly converge to an effective policy.

To further assess the algorithm’s performance after training convergence, we conducted task performance tests. Figure 13 shows the results of the trained reinforcement learning agent performing target search and tracking in different environments, with 300 time steps tested for each environment. As illustrated in the figure, during the search phase, the UAV conducts a broad search across the area, exploring potential target locations in the environment. In the tracking phase, the UAV is able to stably track the target, successfully avoid obstacles, and mitigate interference from occluded regions. Meanwhile, the UAV is unaffected by changes in obstacle distribution and can consistently complete the task under various obstacle layouts. Figure 14 displays the results of the agent’s target search and tracking in environments of different sizes, expanding the task to larger, more complex obstacle settings. The 200-grid and 300-grid sizes environments were tested for 500 and 700 time steps, respectively, while other environments were tested for 300 time steps. It can be seen that regardless of the scene size, the trained agent exhibits excellent search and tracking performance. In smaller environments, the agent swiftly covers the entire area, quickly locating and locking onto the target. In larger, more complex environments with a broader target activity range, the agent effectively expands its search to locate the target. The results indicate that the trained agent performs well in target search and tracking tasks, demonstrating strong adaptability and robustness in different environments.

5. Discussion

In this paper, we focus on the UAV target search and tracking problem. We use the GPR model to predict the target’s trajectory, addressing the target loss issue. Furthermore, we extend the concept of spatial information entropy to target search. To address the issue of low sample efficiency in DRL, we propose a novel DRL algorithm based on KANs. A large number of simulation results demonstrate that, compared to other DRL algorithms, our method shows superior convergence performance. Moreover, due to the application of spatial information entropy, our method outperforms existing path planning algorithms in target search.

However, in the simulation, we made the simple assumption that the target’s next state only depends on the current state. In reality, the target’s movement contains higher-order dependencies, its behavior is more unpredictable, and it may even exhibit evasive actions against the searching UAV. Therefore, more reasonable modeling of the target’s movement will be a direction for our future research. Additionally, our study on the information correlation density decay coefficient alpha is limited to the optimal value in the same environment. However, the decay coefficient can be designed to adjust adaptively according to the environment. Thus, our future work will consider introducing an adaptive adjustment mechanism to enhance the flexibility and robustness of the algorithm in different dynamic environments. Meanwhile, although our virtual environment design considers factors such as obstacles, occlusion zones, and target movement, to simplify the study and improve computational efficiency, we still adopted simplified modeling for UAV dynamics, sensor noise, and target behavior, and restricted the environment to a 2D space. In real environments, UAV flight dynamics are often more complex, with various random noises bringing additional interference to the UAV’s decision making. There is often a performance gap between the simulation environment and real-world applications. We have not yet conducted comprehensive tests on real UAV platforms, and we plan to explore the deployment of DRL algorithms on real UAVs in future work, further improving the method to address more non-ideal factors in real environments. Finally, this paper only verifies the impact of relatively simple prior policies on the algorithm. In future work, we will delve deeper into the reinforcement learning algorithm based on KANs to explore its potential in complex tasks and further improve the algorithm’s performance in high-dimensional continuous action spaces.

Author Contributions

Investigation, data curation, and writing—original draft, Z.F.; conceptualization, X.N., S.H., and Q.S.; methodology, Z.F. and J.S.; supervision: X.N.; software and validation, Z.F.; writing—review and editing, X.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China grant number 62263025.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Xiong, T.; Liu, F.; Liu, H.; Ge, J.; Li, H.; Ding, K.; Li, Q. Multi-Drone Optimal Mission Assignment and 3D Path Planning for Disaster Rescue. Drones 2023, 7, 394. [Google Scholar] [CrossRef]
Wallar, A.; Plaku, E.; Sofge, D.A. Reactive Motion Planning for Unmanned Aerial Surveillance of Risk-Sensitive Areas. IEEE Trans. Autom. Sci. Eng. 2015, 12, 969–980. [Google Scholar] [CrossRef]
Stodola, P.; Drozd, J.; Mazal, J.; Hodický, J.; Procházka, D. Cooperative Unmanned Aerial System Reconnaissance in a Complex Urban Environment and Uneven Terrain. Sensors 2019, 19, 3754. [Google Scholar] [CrossRef]
Stodola, P.; Kozůbek, J.; Drozd, J. Using Unmanned Aerial Systems in Military Operations for Autonomous Reconnaissance. In Modelling and Simulation for Autonomous Systems; Mazal, J., Ed.; Springer International Publishing: Cham, Switzerland, 2019; Volume 11472, pp. 514–529. [Google Scholar]
Yuan, S.; Li, Y.; Bao, F.; Xu, H.; Yang, Y.; Yan, Q.; Zhong, S.; Yin, H.; Xu, J.; Huang, Z.; et al. Marine Environmental Monitoring with Unmanned Vehicle Platforms: Present Applications and Future Prospects. Sci. Total Environ. 2023, 858, 159741. [Google Scholar] [CrossRef]
Muñoz, J.; López, B.; Quevedo, F.; Monje, C.A.; Garrido, S.; Moreno, L.E. Multi UAV Coverage Path Planning in Urban Environments. Sensors 2021, 21, 7365. [Google Scholar] [CrossRef]
Jayaweera, H.M.; Hanoun, S. A Dynamic Artificial Potential Field (D-APF) UAV Path Planning Technique for Following Ground Moving Targets. IEEE Access 2020, 8, 192760–192776. [Google Scholar] [CrossRef]
Liu, J. An Improved Genetic Algorithm for Rapid UAV Path Planning. J. Phys. Conf. Ser. 2022, 2216, 012035. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement Learning: A Survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. arXiv 2013, arXiv:1312.5602. [Google Scholar]
Souchleris, K.; Sidiropoulos, G.K.; Papakostas, G.A. Reinforcement Learning in Game Industry—Review, Prospects and Challenges. Appl. Sci. 2023, 13, 2443. [Google Scholar] [CrossRef]
Li, C.; Zheng, P.; Yin, Y.; Wang, B.; Wang, L. Deep Reinforcement Learning in Smart Manufacturing: A Review and Prospects. CIRP J. Manuf. Sci. Technol. 2023, 40, 75–101. [Google Scholar] [CrossRef]
Aradi, S. Survey of Deep Reinforcement Learning for Motion Planning of Autonomous Vehicles. IEEE Trans. Intell. Transp. Syst. 2022, 23, 740–759. [Google Scholar] [CrossRef]
Azar, A.T.; Koubaa, A.; Ali Mohamed, N.; Ibrahim, H.A.; Ibrahim, Z.F.; Kazim, M.; Ammar, A.; Benjdira, B.; Khamis, A.M.; Hameed, I.A.; et al. Drone Deep Reinforcement Learning: A Review. Electronics 2021, 10, 999. [Google Scholar] [CrossRef]
Sheng, Y.; Liu, H.; Li, J.; Han, Q. UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments. Drones 2024, 8, 516. [Google Scholar] [CrossRef]
Yin, Y.; Wang, Z.; Zheng, L.; Su, Q.; Guo, Y. Autonomous UAV Navigation with Adaptive Control Based on Deep Reinforcement Learning. Electronics 2024, 13, 2432. [Google Scholar] [CrossRef]
Shi, H.; Shi, L.; Xu, M.; Hwang, K.S. End-to-End Navigation Strategy with Deep Reinforcement Learning for Mobile Robots. IEEE Trans. Ind. Inform. 2020, 16, 2393–2402. [Google Scholar] [CrossRef]
Liu, M.; Wei, J.; Liu, K. A Two-Stage Target Search and Tracking Method for UAV Based on Deep Reinforcement Learning. Drones 2024, 8, 544. [Google Scholar] [CrossRef]
Wang, T.; Qin, R.; Chen, Y.; Snoussi, H.; Choi, C. A Reinforcement Learning Approach for UAV Target Searching and Tracking. Multimed. Tools Appl. 2019, 78, 4347–4364. [Google Scholar] [CrossRef]
Cui, Y.; Hou, B.; Wu, Q.; Ren, B.; Wang, S.; Jiao, L. Remote Sensing Object Tracking With Deep Reinforcement Learning Under Occlusion. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605213. [Google Scholar] [CrossRef]
Irpan, A. Deep Reinforcement Learning Doesn’t Work Yet. 2018. Available online: https://www.alexirpan.com/2018/02/14/rl-hard.html (accessed on 3 January 2025).
James, S.; Davison, A.J. Q-Attention: Enabling Efficient Learning for Vision-Based Robotic Manipulation. IEEE Robot. Autom. Lett. 2022, 7, 1612–1619. [Google Scholar] [CrossRef]
Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Horgan, D.; Quan, J.; Sendonaris, A.; Osband, I.; et al. Deep Q-learning From Demonstrations. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–8 February 2018; Volume 32. [Google Scholar] [CrossRef]
Vecerik, M.; Hester, T.; Scholz, J.; Wang, F.; Pietquin, O.; Piot, B.; Heess, N.; Rothörl, T.; Lampe, T.; Riedmiller, M. Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards. arXiv 2017, arXiv:1707.08817. [Google Scholar]
Ravichandar, H.; Polydoros, A.S.; Chernova, S.; Billard, A. Recent Advances in Robot Learning from Demonstration. Annu. Rev. Control. Robot. Auton. Syst. 2020, 3, 297–330. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Wang, J.; Zhang, X. Deep-Reinforcement-Learning-Based Autonomous UAV Navigation with Sparse Rewards. IEEE Internet Things J. 2020, 7, 6180–6190. [Google Scholar] [CrossRef]
Wang, D.; Fan, T.; Han, T.; Pan, J. A Two-Stage Reinforcement Learning Approach for Multi-UAV Collision Avoidance under Imperfect Sensing. IEEE Robot. Autom. Lett. 2020, 5, 3098–3105. [Google Scholar] [CrossRef]
Xia, Z.; Du, J.; Wang, J.; Jiang, C.; Ren, Y.; Li, G.; Han, Z. Multi-Agent Reinforcement Learning Aided Intelligent UAV Swarm for Target Tracking. IEEE Trans. Veh. Technol. 2022, 71, 931–945. [Google Scholar] [CrossRef]
Liu, Y.; Fan, Y. A Review of Vehicle Trajectory Prediction Methods in Driverless Scenarios. In Proceedings of the 2024 9th International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan, 21–23 November 2024; Volume 9, pp. 293–299. [Google Scholar]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar] [CrossRef]
Fiorini, P.; Shiller, Z. Motion Planning in Dynamic Environments Using Velocity Obstacles. Int. J. Robot. Res. 1998, 17, 760–772. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous Control with Deep Reinforcement Learning. arXiv 2019, arXiv:1509.02971. [Google Scholar]
Liu, Z.; Ma, P.; Wang, Y.; Matusik, W.; Tegmark, M. KAN 2.0: Kolmogorov-Arnold Networks Meet Science. arXiv 2024, arXiv:2408.10205. [Google Scholar] [CrossRef]
Hasan, K.M.; Abdullah-Al-Nahid; Reza, K.J. Path Planning Algorithm Development for Autonomous Vacuum Cleaner Robots. In Proceedings of the 2014 International Conference on Informatics, Electronics & Vision (ICIEV), Dhaka, Bangladesh, 7–9 May 2014; pp. 1–6. [Google Scholar]

Figure 1. Task scenario diagram.

Figure 2. System search and tracking framework.

Figure 3. Algorithm architecture.

Figure 4. Construction process of Kolmogorov–Arnold networks (KANs).

Figure 5. Prediction results of the Gaussian process regression (GPR) model. (a) Prediction results when the target is moving straight. (b) Prediction results when the target is turning.

Figure 6. Information correlation densities variation with different values of

α

. (a)

H_{S I}

= 13.93. (b)

H_{S I}

= 13.77. (c)

H_{S I}

= 13.67. (d)

H_{S I}

= 13.92. (e)

H_{S I}

= 13.16. (f)

H_{S I}

= 12.88. (g)

H_{S I}

= 13.89. (h)

H_{S I}

= 11.63. (i)

H_{S I}

= 11.59.

Figure 6. Information correlation densities variation with different values of

α

. (a)

H_{S I}

= 13.93. (b)

H_{S I}

= 13.77. (c)

H_{S I}

= 13.67. (d)

H_{S I}

= 13.92. (e)

H_{S I}

= 13.16. (f)

H_{S I}

= 12.88. (g)

H_{S I}

= 13.89. (h)

H_{S I}

= 11.63. (i)

H_{S I}

= 11.59.

Figure 7. The effect of parameter

α

on coverage. The dotted lines indicate the Median Trend Line, while the circles represent Outliers.

Figure 7. The effect of parameter

α

on coverage. The dotted lines indicate the Median Trend Line, while the circles represent Outliers.

Figure 8. Performance comparison with other algorithms.

Figure 9. Target motion trajectory. The colored lines represent the target’s trajectories across different episode, while the black-filled regions represent obstacle.

Figure 10. Comparison of search success rates among three algorithms in a single episode. (a) Target speed = 0.5. (b) Target speed = 0.75.

Figure 11. Comparison of episode rewards among deep deterministic policy gradient (DDPG), DDPG from demonstrations (DDPGfD), KAN-DDPG, and KANs-based deep deterministic policy gradient (KbDDPG).

Figure 12. Performance analysis of DDPG, DDPGfD, KAN-DDPG, and KbDDPG. (a) Collision rate during the training process. (b) Tracking duration during the training process.

Figure 13. Performance of the UAV under different distributions of obstacles. The blue lines represent the target’s trajectory. (a) Two obstacles. (b) Three obstacles. (c) Four obstacles. (d) Five obstacles.

Figure 14. The performance of UAV in environments of different scales. The blue lines represent the target’s trajectory. (a) 100-grid sizes. (b) 150-grid sizes. (c) 200-grid sizes. (d) 300-grid sizes.

Table 1. Simulation parameters.

Parameter	Description	Value
M	Simulation environment size	125
$ω_{n}$	UAV angular velocity range	[−1, +1]
v	UAV velocity	3
R	UAV maximum perception range	12
$γ$	Discount factor	0.99
$α$	Decay coefficient	0.015
$T_{max}$	Maximum number of steps	500
$w_{max}$	Prediction model input length	30
$m_{max}$	Prediction length	20
d	Increment value for the window	4
u	Number of window uses	3
$k_{1}$	Search reward coefficient	200
$k_{2}$	Tracking reward coefficient	5
$k_{3}$	Target loss penalty coefficient	25
$k_{4}$	Collision penalty coefficient	200
$l_{a}$	Policy network learning rate	$5 \times 10^{- 4}$
$l_{c}$	Value network learning rate	$2 \times 10^{- 3}$
$β$	Adam optimizer hyperparameters	(0.9, 0.999)
$τ$	The rate of soft update	$5 \times 10^{- 3}$
$k_{p}$	Avoidance coefficient	10

Table 2. Model prediction performance.

Number of Samples	Success Rate	Mean Error
0	0.197	-
50	0.901	3.765
100	0.929	3.485
500	0.935	3.480
1000	0.935	3.479
1500	0.937	3.470

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, Z.; Na, X.; Hai, S.; Sun, Q.; Shi, J. Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding. Electronics 2025, 14, 1330. https://doi.org/10.3390/electronics14071330

AMA Style

Feng Z, Na X, Hai S, Sun Q, Shi J. Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding. Electronics. 2025; 14(7):1330. https://doi.org/10.3390/electronics14071330

Chicago/Turabian Style

Feng, Zhihui, Xitai Na, Shiji Hai, Qingbin Sun, and Jinshuo Shi. 2025. "Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding" Electronics 14, no. 7: 1330. https://doi.org/10.3390/electronics14071330

APA Style

Feng, Z., Na, X., Hai, S., Sun, Q., & Shi, J. (2025). Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding. Electronics, 14(7), 1330. https://doi.org/10.3390/electronics14071330

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deep Reinforcement Learning for UAV Target Search and Continuous Tracking in Complex Environments with Gaussian Process Regression and Prior Policy Embedding

Abstract

1. Introduction

2. System Model

2.1. UAV Dynamics Model

2.2. Environment Model

3. Method

3.1. Problem Statement

3.2. Spatial Information Entropy

3.3. GPR for Trajectory Prediction

3.4. Reward Function Design

3.5. Training Algorithm

4. Experiments

4.1. Simulation Environment Design

4.2. Performance Analysis of the Prediction Model

4.3. Search Performance Analysis

4.4. Convergence Performance Analysis

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI