Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking

Qiao, Dongna; Zhang, Hongxin

doi:10.3390/drones10050319

Open AccessArticle

Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking

by

Dongna Qiao

and

Hongxin Zhang

^*

School of Mechanical and Electrical Engineering, Qingdao University, Qingdao 266071, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(5), 319; https://doi.org/10.3390/drones10050319

Submission received: 19 March 2026 / Revised: 9 April 2026 / Accepted: 21 April 2026 / Published: 23 April 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed PPO-based method significantly improves the convergence speed and stability of UAV trajectory planning in vehicle tracking scenarios.
Compared with traditional path planning approaches, the optimized policy effectively reduces tracking error and enhances trajectory smoothness.

What are the implications of the main findings?

The results demonstrate that reinforcement learning-based trajectory planning can provide reliable and adaptive tracking performance in dynamic traffic environments.
The proposed method offers a practical solution for intelligent transportation applications such as traffic monitoring, autonomous escorting, and aerial–ground cooperative systems.

Abstract

Unmanned Aerial Vehicle (UAV) tracking of ground moving targets holds significant applications in domains such as intelligent transportation, logistics distribution, and environmental monitoring, placing greater demands on efficient and stable path-planning methods for vehicular tracking. This study investigates a UAV path tracking approach based on a deep reinforcement learning algorithm, Proximal Policy Optimization (PPO). Starting from the kinematic characteristics of UAVs and ground vehicles, a 3D path planning model was constructed that considers spatial coordinates, velocity, and attitude constraints. A well-designed objective function—including tracking error minimization, energy optimization, and safety distance constraints—was incorporated. By designing the state space, action space, and reward function, the PPO algorithm is capable of adaptive learning in complex environments. Compared with traditional Artificial Potential Field (APF), Q-learning, and TD3 algorithms, PPO better balances exploration and exploitation and demonstrates stronger learning stability and global optimization capability in dynamic multi-obstacle scenarios. Simulation results show that PPO-based UAV path planning outperforms Q-learning and other comparative algorithms in terms of tracking accuracy, convergence speed, and robustness. In specific scenarios, Q-learning achieves a trajectory error of approximately 1 m, TD3 and APF exhibit errors around 0.3 m with noticeable oscillations, and PPO achieves an error of about 0.2 m. The UAV can follow the vehicle trajectory smoothly, with a more continuous path and rapidly converging, stable error curves, indicating the promising application potential of PPO in intelligent UAV control. The PPO-based UAV-tracking path planning method effectively enhances the UAV’s intelligent decision-making and path optimization capabilities, providing new technical approaches and a research foundation for intelligent UAV traffic and cooperative control systems.

Keywords:

unmanned aerial vehicle (UAV); tracking control; path planning; PPO algorithm

1. Introduction

1.1. Research Background and Significance

Unmanned Aerial Vehicles (UAVs), owing to their high maneuverability, broad field of view, and rapid deployment capabilities, have been widely applied in intelligent transportation, public safety, logistics, and other fields. As a critical component of intelligent transportation systems, UAV-based vehicle tracking plays a significant role in tasks such as road monitoring, traffic management, accident detection, and fugitive vehicle pursuit. However, achieving efficient and stable path planning for UAV-based vehicle tracking still faces multiple challenges, including complex traffic environments, dynamic obstacle avoidance requirements, and computational efficiency in path optimization.

Currently, mainstream path planning methods primarily include traditional search-based algorithms (e.g., A*, Dijkstra), sampling-based optimization algorithms (e.g., Rapidly exploring Random Trees (RRT), Probabilistic Roadmaps (PRM)), and intelligent optimization algorithms (e.g., Genetic Algorithm (GA), Particle Swarm Optimization (PSO)). While these methods have achieved certain success in various applications, they still exhibit limitations in UAV-based vehicle tracking, such as insufficient search capability, susceptibility to local optima, and high computational overhead. Therefore, developing an efficient and robust path planning method for UAV-based vehicle tracking—enhancing both optimization performance and computational efficiency—holds significant theoretical and engineering value.

1.2. Research Status

Path planning technology is one of the key technologies determining whether an unmanned aerial vehicle can achieve smooth and successful flight. Ref. [1] proposed an intelligent parking system guided by a UAV. In this system, a quadrotor UAV detects real-time parking conditions and calculates the optimal parking route, guiding vehicles to enter or exit the parking lot along the planned path. However, during the guidance process, the UAV must continuously follow the intelligent vehicle, and its navigation relies heavily on onboard sensors, which introduces certain limitations in practical applications.

Li et al. [2] proposed an engineering-semantic-based path planning technique for assembly simulation, enabling rapid path planning and improving the efficiency of aircraft assembly simulation. Hou et al. [3] combined an improved Q-Learning algorithm with the artificial potential field method to solve the problem of ineffective path planning between two path nodes. Ma et al. [4] presented a 3D path planning method that integrates the Genetic Algorithm (GA) with the A* algorithm. This method includes global path planning and real-time path re-planning, effectively reducing the search space and improving search efficiency.

Cheng et al. [5] proposed a decentralized multi-UAV path planning method suitable for obstacle-rich environments. This method can handle minimum-time rendezvous path planning problems, optimize flight energy consumption, and maintain computational efficiency even as the UAV swarm size increases. Ref. [6] applied the backstepping method and behavior-based control to generate a virtual robot for cooperative motion control of unmanned vehicles. Ref. [7] designed a state observer based on Lyapunov theory under a leader–follower framework to achieve cooperative motion tracking control of multiple UAVs.

Due to the complexity of and variability in mission environments, improving UAV obstacle avoidance, collision avoidance, and tracking accuracy has become a major research focus [8,9]. Ref. [10] proposed an obstacle avoidance path planning method based on an improved Particle Swarm Optimization (PSO) algorithm. Ref. [11] introduced a PSO-based parameter optimization approach for A* and artificial potential field methods. Ref. [12] presented a UAV path planning method based on an improved Genetic Algorithm, all achieving promising results.

For fixed-wing UAVs, Ref. [13] proposed a dual-feedback model predictive control (MPC) approach based on state augmentation for controller design to address tracking problems. Li Keyu et al. [14] applied the Rapidly exploring Random Tree (RRT) algorithm to pre-planned path searching, improving the efficiency of UAV obstacle avoidance path planning and reducing the time required to generate feasible paths. Zhao Juan [15] proposed a trajectory planning method using heuristic-point-guided D* algorithm expansion, solving the issue of being unable to approach the target point from a specific direction, shortening the UAV trajectory length, and reducing planning time.

He Jingan et al. [16] improved the convergence speed of the Artificial Bee Colony (ABC) algorithm by combining it with the K-means clustering algorithm, resulting in better fitness values. Chen Xia et al. [17] proposed a UAV trajectory planning method based on an improved adaptive ant colony optimization (IAACO) algorithm, addressing the slow convergence and tendency to fall into local optima in traditional ant colony algorithms. Their method achieved faster convergence, smoother paths, and shorter trajectory lengths.

Research on cooperative path planning between unmanned aerial vehicles and ground vehicles has also gradually matured. Murray and Chull [18] were among the first to study the joint path planning problem of vehicles and UAVs. They proposed the Flying Sidekick Traveling Salesman Problem (FSTSP) and established a Mixed-Integer Programming (MIP) model aimed at minimizing total delivery time. Agatz et al. [19] formulated a traveling salesman problem with drones, also aiming to minimize total delivery time, and solved it using heuristic algorithms based on local search and dynamic programming.

Danny et al. [20] proposed a novel backup path planning method by repeatedly utilizing distributed pheromones in ant colony optimization. The proposed strategy derives feasible backup paths solely based on available pheromone concentrations, leading to improved final path quality and overall performance. Ahmed Hafez et al. [21,22] proposed a cooperative motion scheme for UAV–UGV systems based on Model Predictive Control (MPC), which provides better constraint handling for agent motion compared with other methods. Yang et al. [23] proposed a multi-strategy enhanced Dream Optimization Algorithm (MSDOA), aiming to address challenges such as insufficient search capability, slow convergence, and susceptibility to local optima when intelligent optimization algorithms are applied to three-dimensional UAV path planning. Chang et al. [24] introduced an improved Nutcracker Optimization Algorithm to solve multi-objective constrained optimization problems. Bei et al. [25] developed a Hybrid Multi-Strategy Artificial Rabbit Optimization (HARO) method for efficient and stable UAV path planning in complex environments. Yuan et al. [26] proposed an improved Dynamic Window Approach (DWA), enhancing obstacle avoidance performance by adaptively refining and augmenting the evaluation functions. Gu et al. [27] applied Proximal Policy Optimization to refine the initial path generated by an improved fluid disturbance algorithm, achieving optimal decision-making. Wang et al. [28] proposed a multi-UAV end-to-end motion planning method based on chain training and a Proximal Policy Optimization algorithm incorporating heuristic information, demonstrating its effectiveness and superiority in multi-UAV formation navigation through forest environments. However, traditional path planning methods struggle to simultaneously achieve real-time performance, adaptability, path smoothness, and robustness in vehicle-following UAV tasks. This limitation has been a key reason for the recent adoption of deep reinforcement learning approaches, which can automatically learn strategies for tracking vehicles in dynamic and complex environments while simultaneously optimizing multiple objectives and ensuring robustness.

1.3. Research Content and Contributions

To address the key challenges in UAV path planning for vehicle-following tracking tasks, this paper employs the Proximal Policy Optimization (PPO) algorithm for path planning, aiming to improve path quality and planning efficiency. The main contributions of this study are as follows:

First, a motion model for UAV-following tracking is established. The dynamic constraints and environmental constraints are analyzed to provide theoretical support for path planning, enhance the search capability, and improve the ability to obtain globally optimal solutions.

Second, the effectiveness of the proposed algorithm is validated through simulation experiments. The performance is evaluated in terms of path length, smoothness, obstacle avoidance capability, and computational efficiency. Comparative analyses with existing mainstream algorithms are also conducted to demonstrate the advantages of the proposed method. In trajectory tracking tasks, PPO can optimize control policies in a continuous action space, enabling the UAV to gradually approach the optimal trajectory while avoiding policy oscillations and performance degradation, thereby achieving better training stability. This technology can be applied to UAV-assisted vehicle monitoring and dispatching, as well as rapid UAV-based tracking of vehicles for rescue or surveillance purposes.

2. Modeling of Path Planning for UAV-Tracking

2.1. Analysis of Commonly Used Path Planning Algorithms

2.1.1. Artificial Potential Field (APF)

Artificial Potential Field (APF) is a classical path planning method, first proposed by Khatib [29] in 1985, and has been widely applied in robot navigation and obstacle avoidance. The fundamental idea is to treat the target point as an “attractive source” and obstacles as “repulsive sources,” thereby constructing a virtual potential field. The UAV moves along the direction of decreasing potential in this field, achieving obstacle avoidance while reaching the target.

In UAV-following tracking scenarios, the APF method can guide the UAV to avoid obstacles during flight, maintain an appropriate distance from ground vehicles, and achieve efficient, real-time path tracking. Its advantages include a simple model, high computational efficiency suitable for real-time path planning, ease of integration with other methods (e.g., fuzzy control or optimization algorithms), and the ability to handle multiple targets and obstacles.

However, APF also has limitations: it is prone to getting trapped in local minima; the repulsive field near the target may counteract the attractive force, causing path oscillations or failure to reach the target; and it exhibits poor adaptability in dynamic environments, making it difficult to effectively handle moving obstacles.

2.1.2. TD3 Algorithm

Twin Delayed Deep Deterministic Policy Gradient (TD3) [30] is an actor–critic-based reinforcement learning algorithm designed to address the overestimation of Q-values and training instability present in Deep Deterministic Policy Gradient (DDPG). The algorithm is particularly suitable for continuous action space problems, such as UAV path planning and target tracking tasks.

TD3 enhances performance and stability by introducing three key mechanisms: Clipped Double Q-learning: Two independent critic networks are constructed, and the smaller of the two Q-values is used as the target, effectively mitigating overestimation bias. Delayed Policy Update: The actor network is updated less frequently than the critic networks, improving training stability. Target Policy Smoothing: Noise is added to the target action to enhance policy robustness and reduce sensitivity to local errors.

The advantages of TD3 include effectively suppressing Q-value overestimation, improving training stability and convergence speed, handling high-dimensional continuous action spaces, and generating smoother control outputs. However, TD3 also has several limitations: it contains multiple critical hyperparameters (e.g., learning rate, policy update delay, noise scale, soft update coefficient) that strongly affect algorithm performance, often requiring repeated tuning across different tasks or environments, which increases application difficulty. Additionally, as an off-policy algorithm based on experience replay, it relies heavily on large amounts of interaction data, leading to higher training costs in complex real-world environments.

2.1.3. Q-Learning

Q-Learning [31] is a value-based, model-free reinforcement learning algorithm used to solve Markov Decision Processes (MDPs) and derive optimal policies. It does not rely on a prior model of the environment; instead, it learns a state–action value function (Q-value function) through interaction with the environment, which guides the agent to select the optimal action in each state.

Due to its simplicity, ease of implementation, and suitability for discrete action spaces, Q-Learning has been widely applied in mobile robot path planning, autonomous driving decision-making, and UAV navigation. Its advantages include not requiring a known environment model, making it suitable for unknown or partially known environments; theoretical maturity and ease of implementation; and the ability to effectively find optimal policies across various path planning scenarios.

However, Q-Learning has several limitations: it requires discrete state and action spaces, making it difficult to directly handle high-dimensional or continuous states; learning efficiency is low and convergence can be slow; in complex dynamic environments, balancing exploration and exploitation is challenging, which may lead to unstable policies; and it has high memory consumption, making scalability difficult in large state spaces.

2.1.4. Proximal Policy Optimization

Proximal Policy Optimization (PPO) is one of the most popular and effective policy optimization (policy gradient) algorithms in the field of reinforcement learning, proposed by OpenAI [32] in 2017. PPO has demonstrated strong convergence, stability, and sample efficiency in tasks such as continuous control, robot navigation, and UAV trajectory planning, and has been widely applied to agent control problems in complex dynamic environments.

Its advantages include high training stability, preventing drastic policy changes; the ability to handle high-dimensional, continuous state and action spaces; support for asynchronous updates and large-scale parallel training; higher learning efficiency compared with traditional policy gradient methods; and ease of integration with neural networks, enabling adaptation to complex scenarios.

However, PPO also has limitations: careful design of the reward function is required, otherwise training may fail to converge; it requires a large number of interaction samples, imposing high computational demands; its adaptability to dynamically changing environments is relatively slow (often requiring retraining or fine-tuning); and hyperparameters (such as clipping coefficient and learning rate) have a significant impact on performance.

Considering these factors, this study aims to explore the application of PPO in UAV-following trajectory planning.

2.2. Structural Composition of the UAV-Tracking System

An unmanned aerial vehicle (UAV) is an aircraft capable of autonomous or remotely controlled flight through wireless communication equipment and a flight control system. With the rapid development of sensors, embedded systems, navigation and positioning technologies, and communication technologies, UAVs have been widely applied in various fields such as agricultural inspection, disaster monitoring, environmental surveillance, logistics transportation, and intelligent transportation systems. In the field of intelligent transportation, UAVs provide new technical approaches for tasks such as traffic monitoring and vehicle tracking due to their aerial perspective and high maneuverability.

A complete UAV system, as shown in Figure 1, mainly consists of a flight control system, navigation and positioning module, communication module, mission payload, and power system. The flight control system (Flight Controller) is the core component of the UAV, responsible for attitude control, flight stability maintenance, and command execution. Common flight control systems include Pixhawk and DJI A3. The navigation and positioning module obtains real-time position and attitude information through technologies such as the Global Positioning System (GPS) and Inertial Navigation System (INS), enabling high-precision localization and path tracking. The communication module is used for data exchange between the UAV and the ground station or controller, supporting real-time transmission of images, control commands, and status information. The mission payload consists of sensors such as cameras, LiDAR, and thermal imagers, which are carried according to application requirements to perform tasks like target detection and environmental perception. The power system provides energy for the UAV and includes components such as lithium batteries, electronic speed controllers (ESCs), propellers, and motors.

In vehicle-tracking tasks, a UAV serves multiple roles. First, as a dynamic path follower, it must continuously adjust its flight trajectory in real time according to changes in the ground vehicle’s position to achieve stable tracking. Second, as an environmental information collector, the UAV uses onboard cameras or other sensors to capture real-time information about the ground vehicle and its surrounding environment, thereby providing perceptual input for path planning. Third, as an intelligent decision-making executor, the UAV integrates path planning algorithms to autonomously generate flight routes, avoid obstacles, and achieve safe, smooth, and energy-efficient flight. To fulfill these functions, the UAV must not only possess fundamental flight control capabilities but also incorporate intelligent path planning algorithms, enabling adaptive flight in dynamic and complex environments.

The UAV-tracking task involves following a ground vehicle from the air using a specific strategy, maintaining an appropriate tracking distance and observation angle while planning a smooth, feasible, and collision-free three-dimensional flight path.

Within the overall system, the target ground vehicle moves along a predefined or dynamically generated two-dimensional path. The agent, namely the UAV, operates in three-dimensional space and continuously adjusts its position through a path planning strategy. The objective of the task is to minimize the distance error between the UAV and the vehicle, avoid obstacles, and satisfy the UAV’s dynamic flight constraints.

2.3. Modeling of the Tracking Scenario and Assumptions

MATLAB (R2023b) is used to generate the road network grid, with arterial roads specified. The ground vehicle is assumed to move on a two-dimensional plane (X–Y) along a known or predicted trajectory

P_{v} (t) = (x_{v} (t), y_{v} (t))

. The UAV operates in three-dimensional space, with its trajectory denoted as

P_{u} (t) = (x_{u} (t), y_{u} (t), z_{u} (t))

. Together, they form a dynamic tracking system, as illustrated in Figure 2. Under the constraints of flight safety and trajectory smoothness, the UAV is required to continuously adjust its position to maintain an ideal tracking distance

d_{r e f}

from the vehicle, thereby satisfying observation or surveillance requirements.

The modeling assumptions are as follows: the ground vehicle moves at a slowly varying speed with a continuous trajectory; the UAV is capable of obtaining the current or predicted position of the vehicle; there are no no-fly zones within the UAV’s operational airspace; and the UAV dynamics are simplified as a position-controllable model.

2.4. Modeling of Three-Dimensional Path Constraints

Constraint modeling is performed for the three-dimensional trajectory of a UAV tracking a ground vehicle, ensuring that the UAV maintains dynamic feasibility while preserving target visibility, communication reliability, and flight safety, as well as accounting for energy limitations and flight regulation constraints. These constraints include not only geometric safety constraints but also dynamic and perception-related constraints. They can be further categorized into hard constraints, which must be strictly satisfied, and soft constraints, which may be relaxed but are penalized within the cost function.

2.4.1. Basic Safety Distance Constraint

To avoid collisions with the vehicle and ground obstacles, a three-dimensional distance constraint is typically imposed:

∥ Δ p (t) ∥_{2} \geq d_{m i n}

(1)

Here, d_min represents the minimum safety distance. If the dimensions of both the UAV and the vehicle (length, width, and height of the vehicle) are taken into account, a more accurate representation can be achieved using oriented cuboids or ellipsoids. For the ellipsoidal representation, let the safety ellipsoid around the vehicle be defined by a positive definite matrix

Q ≻ 0

; then

(Δ p)^{⊤} Q^{- 1} (Δ p) \geq 1

(2)

2.4.2. Lateral and Longitudinal Tracking Error Constraints

To maintain measurement accuracy, the lateral and vertical errors are typically constrained as follows:

e_{h} (t) \leq e_{h, m a x}, ∣ z_{u} (t) - z_{des} (t) ∣ \leq e_{z, m a x}

(3)

Here,

e_{h}

denotes the UAV’s tracking error in the horizontal plane, and

e_{h, \max}

represents the allowable maximum horizontal error.

z_{u} (t)

is the UAV’s actual flight altitude at time

t

, and

z_{des} (t)

is the desired altitude, which may be fixed or adaptively adjusted according to the vehicle’s speed or terrain to ensure proper viewing angle and maintain a safe distance.

e_{z, \max}

denotes the allowable maximum vertical error.

2.4.3. Dynamic and Kinematic Feasibility Constraints

The motion of a quadrotor UAV is affected by its tilt angle and thrust limitations.

The speed and acceleration constraints are given by:

∥ v_{u} (t) ∥_{2} \leq V_{m a x}, ∥ a_{u} (t) ∥_{2} \leq A_{m a x}

(4)

The tilt angle constraint is expressed as:

ϕ (t) \leq ϕ_{m a x} \Rightarrow ∥ a_{lat} (t) ∥ \leq g \tan (ϕ_{m a x})

(5)

where

V_{m a x}

is the maximum allowable flight speed,

A_{m a x}

is the maximum allowable acceleration,

ϕ_{m a x}

is the maximum allowable tilt angle,

a_{lat} (t)

is the lateral (horizontal) acceleration component, and

g

is the gravitational acceleration.

2.4.4. Simplified Dynamic Model

To reduce the complexity of reinforcement learning training for UAVs and ensure the feasibility of the path planning task, this study adopts a simplified dynamic model. While satisfying motion constraints and obstacle avoidance requirements, the model appropriately simplifies the actual dynamic characteristics of the UAV, enabling faster training convergence and more efficient experimental validation.

Model assumptions:

The UAV is treated as a point mass, neglecting inertial coupling, air resistance, and nonlinear propulsion characteristics.
The UAV can instantaneously adjust its forward velocity $v_{forward}$ , vertical velocity $v_{vertical}$ , and yaw rate $\dot{ψ}$ .
The state vector consists of altitude $z$ , position $(x, y)$ , and attitude angles $(ϕ, θ, ψ)$ .
Flight constraints such as maximum acceleration, tilt angle limits, and delays caused by rotational inertia are not considered.

2.4.5. Airspace Regulations and Altitude Constraints

The UAV must comply with regulatory altitude limits:

z_{m i n} \leq z_{u} (t) \leq z_{m a x}

(6)

No-fly zone constraints: if the set of no-fly zones is denoted as

F

, then

p_{u} (t) \notin F

(7)

where

z_{u} (t)

is the UAV’s flight altitude at time

t

,

z_{m i n}

and

z_{m a x}

are the minimum and maximum permissible altitudes, and

p_{u} (t)

is the UAV’s position vector in 3D space.

2.4.6. Relative Motion and Prediction Constraints

Ground vehicles typically travel along a two-dimensional road following predefined speed and curvature. To improve tracking performance, short-term prediction of the vehicle’s trajectory,

p_{v} (t + τ)

, should be performed. The corresponding constraint can then be expressed as a time-varying constraint on the predicted target trajectory:

∥ p_{u} (t) - p_{v} (t + τ) ∥_{2} \geq d_{m i n} (τ), for τ \in [0, τ_{lookahead}]

(8)

Here,

d_{m i n} (τ)

may increase with the uncertainty of the prediction.

2.4.7. Optimization and Constraint Relaxation

In engineering practice, certain constraints that can be temporarily violated are often modeled as soft constraints by introducing slack variables

s_{i} \geq 0

:

g_{i} (x, u) \leq s_{i}, J \leftarrow J + w_{i} s_{i}^{2}

(9)

Here,

J

is the objective function,

w_{i}

is the weighting factor, and

s_{i}^{2}

is a quadratic penalty term. A typical approach is to enforce critical constraints, such as collision avoidance or field-of-view requirements, as hard constraints, while less critical constraints, such as comfort or energy consumption, are treated as soft constraints with associated penalties.

3. Design of a PPO-Based Path Planning Method

3.1. Theoretical Foundations and Applicability Analysis of the Proximal Policy Optimization (PPO) Method

3.1.1. Algorithm Principle

Proximal Policy Optimization (PPO) is a class of reinforcement learning algorithms based on the policy gradient method, designed to improve sample efficiency and facilitate implementation while maintaining training stability. The key idea of PPO is to limit the magnitude of policy updates during each iteration (i.e., “proximal”), preventing the policy from being pushed too far in a single update, which could lead to performance collapse. PPO can be viewed as a practical adaptation of the earlier Trust Region Policy Optimization (TRPO) approach: it simplifies the constrained optimization problem while retaining the core principle of “do not move too far,” making it easier to implement and achieving strong performance in practice, as illustrated in Figure 3.

Let the current policy be

π_{θ}

and the old policy be

π_{θ_{old}}

.

Probability ratio for each action:

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})}

(10)

Basic unconstrained policy gradient objective (using the advantage function

{\hat{A}}_{t}

):

L^{P G} (θ) = E_{t} [r_{t} (θ) {\hat{A}}_{t}]

(11)

Here,

a_{t}

and

s_{t}

denote the action and state at time step

t

,

{\hat{A}}_{t}

is the estimated advantage function, and

r_{t} (θ)

measures how much the new policy differs from the old policy for the sampled action.

Clipped PPO Objective (PPO-Clip):

L^{CLIP} (θ) = E_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(12)

Here,

ϵ

(typically 0.1–0.3) controls the allowable range of policy updates. The clipping operation prevents

r_{t}

from becoming too large when the advantage is positive and too small when the advantage is negative, thereby avoiding “overly aggressive” updates that could destabilize training.

Entropy regularization and value function regression:

L (θ) = - L^{CLIP} (θ) + c_{v} L^{VF} (θ) - c_{e} S (π_{θ})

(13)

where

L^{VF}

is the mean squared error (MSE) loss for the value function,

S [π_{θ}]

is the policy entropy encouraging exploration, and

c_{v}, c_{e}

are weighting coefficients.

Advantage Estimation (Generalized Advantage Estimation, GAE):

{\hat{A}}_{t} = \sum_{l = 0}^{\infty} (γ λ)^{l} δ_{t + l}, δ_{t} = r_{t} + γ V (s_{t + 1}) - V (s_{t})

(14)

The GAE parameter

λ

can be adjusted between 0 (low variance, high bias) and 1 (low bias, high variance) to balance bias and variance in the advantage estimates.

3.1.2. Applicability Analysis

Reinforcement Learning (RL), as an autonomous learning approach based on interaction with the environment, has demonstrated significant advantages in UAV path planning. Among policy gradient algorithms, Proximal Policy Optimization (PPO) is currently one of the most widely applied methods. Its strong training stability and convergence performance make it highly suitable and practically valuable for UAV path planning in complex and dynamic environments.

From the algorithmic perspective, PPO is an on-policy deep reinforcement learning method. By introducing a clipped objective function through the probability ratio, it limits the magnitude of each policy update. This mechanism preserves exploration while preventing instability caused by overly large policy updates. The “trust region” principle effectively enhances the stability of the training process, allowing the algorithm to maintain reliable convergence even in high-dimensional, nonlinear, and continuous control scenarios.

For UAVs, tasks such as attitude control, velocity adjustment, and heading planning all involve continuous action spaces. Therefore, PPO naturally offers advantages in policy representation and optimization for these control problems.

From the perspective of task-feature compatibility, UAV path planning tasks typically involve continuous state spaces, nonlinear dynamic constraints, external disturbances (e.g., wind fields), and dynamic obstacles. Traditional search- or optimization-based algorithms, such as A*, RRT, and genetic algorithms, often struggle to adapt in real time to such high-dimensional and dynamic environments.

PPO, however, can continuously interact with a simulated environment to iteratively update its policy, enabling autonomous obstacle avoidance, target tracking, and multi-objective optimization, including energy efficiency. In continuous control tasks, PPO outputs actions via a Gaussian policy distribution, which facilitates smooth attitude and velocity control for UAVs, ensuring trajectory continuity and flight safety.

From the training and deployment perspective, PPO has a relatively simple structure with few hyperparameters, making it more engineer-friendly for implementation and tuning. Its sampling and update process can be easily parallelized, and it can leverage simulation platforms such as AirSim, Gazebo, or MATLAB to rapidly generate training samples, significantly reducing the cost of data collection. Moreover, during training, the reward function can be flexibly designed to integrate multiple performance metrics—including path length, obstacle clearance, energy consumption, and flight smoothness—allowing for controllable policy learning and balanced multi-objective optimization.

3.2. Algorithm Structure and Steps

Interact with the environment using the current policy

π_{θ}

to collect a set of samples over several timesteps (typically multiple trajectories or a fixed-length time window):

(s_{t}, a_{t}, r_{t}, s_{t + 1})

Compute the discounted returns, value estimates, and advantage estimates (commonly using Generalized Advantage Estimation, GAE).

Treat these samples as “old policy data” and perform several mini-batch gradient ascent (or descent, depending on sign convention) steps to optimize the clipped objective

L^{CLIP}

while simultaneously updating the value function network.

Repeat steps (1)–(3) until the training process converges.

3.3. Algorithm Complexity and Convergence Analysis

In UAV path planning problems, the algorithm’s computational complexity and convergence behavior directly affect planning efficiency and the stability of task execution. Proximal Policy Optimization (PPO), as a deep reinforcement learning method based on policy gradients, introduces a clipped objective function to balance exploration and exploitation, significantly enhancing the stability of the training process and improving convergence speed.

3.3.1. Algorithm Complexity Analysis

The computational complexity of the PPO algorithm primarily arises from three components: forward propagation through the policy and value networks, gradient computation and parameter updates, and the sampling of interaction data from the environment. Let the number of parameters in the policy and value networks be

θ_{π}

and

θ_{v}

, respectively, and let

N

denote the number of sampled steps per iteration. Then the overall time complexity can be approximated as:

O (T) = O (N \cdot (∣ θ_{π} ∣ + ∣ θ_{v} ∣))

(15)

Compared with traditional value-based reinforcement learning methods, such as DQN, PPO does not require explicit Q-value estimation for each action. Instead, it generates actions directly from the probability policy, significantly reducing search complexity in high-dimensional continuous action spaces (e.g., 3D flight control for UAVs). Moreover, PPO employs a mini-batch multiple-update mechanism, allowing each update to make full use of sampled data and thereby improving sample efficiency.

In UAV path planning scenarios, the state space typically includes multidimensional features such as position, velocity, attitude angles, and environmental obstacle information, while the action space involves continuous variables such as thrust, yaw, and pitch. PPO can operate efficiently in such complex state–action spaces, with time complexity growing linearly with network size and sample steps, demonstrating good scalability.

3.3.2. Convergence Analysis

The convergence of the PPO algorithm primarily depends on the design of its objective function. Unlike traditional policy gradient methods, PPO introduces a “trust region” constraint (i.e., the clipping term) to limit the magnitude of updates between the new and old policies, thereby preventing instability caused by excessively large policy updates. The objective function is formulated as:

L^{CLIP} (θ) = E_{t} [m i n (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(16)

This design constrains the update step size, theoretically ensuring monotonic improvement in the policy and enhancing the stability of convergence, as illustrated in Figure 4.

In UAV path planning tasks, the environment is highly nonlinear and dynamic, and traditional reinforcement learning algorithms often experience policy oscillations or become trapped in local optima during training. The PPO algorithm, through its clipping mechanism and advantage-function-based policy optimization, effectively mitigates these issues. The simulation results in Figure 4 indicate that at 1200 training episodes, an average return of 200 is achieved, resulting in a smoother and more stable convergence process.

4. Simulation and Results Analysis

4.1. Simulation Experiment Environment and Parameter Settings

4.1.1. Simulation Platform Construction

To validate the effectiveness of the UAV path planning method based on Proximal Policy Optimization (PPO), a simulation platform was established in the MATLAB R2023b Reinforcement Learning Toolbox environment. By integrating 3D spatial modeling with a reinforcement learning training framework, the platform enables UAV path planning and obstacle avoidance simulation in complex urban environments.

The simulation platform consists of the following modules:

Environment Module: The simulation scenario is constructed using MATLAB 3D graphics and dynamic modeling, including ground roads, building clusters, obstacles, and target points.
UAV Dynamic Module: The UAV is modeled using simplified six-degree-of-freedom dynamics. The control inputs are the desired velocity and heading angle.
Reinforcement Learning Training Module: PPO is employed to train the policy. Both the policy network and the value network adopt deep neural network architectures.
Path Evaluation Module: After training, the performance of the UAV is analyzed using metrics such as average reward, success rate, and trajectory smoothness.

4.1.2. State Space and Action Space Design

In the path planning task, the UAV must make continuous control decisions based on its own state and environmental information.

s_{t} = [x_{u}, y_{u}, z_{u}, v_{x}, v_{y}, v_{z}, x_{g} - x_{u}, y_{g} - y_{u}, z_{g} - z_{u}, d_{o b s}]

(17)

where

(x_{u}, y_{u}, z_{u})

denotes the UAV’s position,

(v_{x}, v_{y}, v_{z})

represents the velocity components,

(x_{g}, y_{g}, z_{g})

are the coordinates of the target point, and

d_{obs}

is the distance from the UAV to the nearest obstacle. Since the vehicle moves in a two-dimensional plane,

z_{g}

is set to 0.

The action vector is defined as:

a_{t} = [v_{d}, \dot{ψ}, \dot{z}]

(18)

where

v_{d}

,

\dot{ψ}

, and

\dot{z}

denote the UAV’s desired forward velocity, yaw rate, and vertical speed, respectively. All action components are continuous control variables.

The reward function is the core of PPO policy optimization. In this work, a reward structure is designed by comprehensively considering path length, safety, smoothness, and energy consumption:

R = w_{1} R_{g o a l} + w_{2} R_{s a f e} + w_{3} R_{s m o o t h} + w_{4} R_{e n e r g y}

(19)

where

R_{g o a l} = - ∥ p_{u} - p_{g} ∥

is the negative distance between the UAV and the target point;

R_{safe} = - m a x (0, d_{safe} - d_{obs})

penalizes proximity to obstacles;

R_{smooth} = - ∥ Δ a_{t} ∥

encourages smooth changes in actions;

R_{energy} = - ∥ a_{t} ∥^{2}

penalizes energy consumption.

Additionally, a positive reward

+ R_{success}

is given when the UAV successfully reaches the target, and an immediate penalty

- R_{collision}

is applied if a collision occurs.

4.1.3. Construction of the Actor–Critic Network

Both the policy network and the value network adopt a multi-layer feedforward deep neural network architecture to perform nonlinear approximation of the policy function and the state value function. The network consists of an input layer, multiple hidden layers, and an output layer. The hidden layers employ the ReLU activation function to enhance the model’s capability in representing complex dynamic environments.

In UAV path planning and target tracking tasks, the policy network typically outputs three-dimensional velocity or attitude control commands, such as velocity

v

, yaw rate

ω

, or three-axis acceleration, while the value network evaluates the impact of the current flight state on future task completion.

With this architecture, PPO can achieve stable convergence in continuous, high-dimensional, and nonlinear environments, and generate smooth and safe flight trajectories, making it well-suited for autonomous decision-making and control in complex dynamic scenarios. The Actor–Critic network structure is shown in Figure 5.

4.2. Tracking Performance Evaluation in Typical Scenarios

To validate the performance of the PPO based UAV path planning method in target tracking tasks, this section evaluates the UAV’s tracking accuracy, response speed, and stability through simulation experiments. The target vehicle moves along a predefined trajectory on a two-dimensional plane, while the UAV performs real-time tracking in three-dimensional space using a control policy trained by PPO.

4.2.1. Simulation Environment and Parameters

The simulation is implemented on the MATLAB platform, with the environmental parameters listed in the following Table 1, Table 2 and Table 3:

4.2.2. Evaluation Metrics

(1): Tracking Error

e_{t} = \sqrt{(x_{u a v} - x_{t a r g e t})^{2} + (y_{u a v} - y_{t a r g e t})^{2} + (z_{u a v} - z_{t a r g e t})^{2}}

(20)

Tracking error is employed to quantify the spatial distance between the UAV and the target.

(2): Convergence Stability

Convergence stability is determined by analyzing the variation trend in the average reward throughout the training process.

(3): Response Time

Response time is employed to quantify the tracking delay of the UAV during target turning and acceleration maneuvers.

4.2.3. Robustness Analysis

In path planning, UAVs often encounter complex environmental factors such as dynamic obstacles, wind disturbances, and sensor noise. To evaluate the robustness of the PPO algorithm, experiments were conducted under the same disturbance intensity and compared with TD3, APF, and Q-learning algorithms. The results indicate that when a disturbance was introduced at 10 s, the UAV exhibited a tracking error of approximately 0.2 m, which stabilized after 5 s. In contrast, the PID, APF, and Q-learning algorithms experienced larger errors under the same disturbance, as shown in Figure 6.

The underlying reasons are as follows: the policy update constraint mechanism (Clip Range) prevents policy oscillations; the sampling-based stochastic gradient optimization of PPO provides strong generalization capability; and the penalty terms in the reward function effectively suppress high-risk action selections. A comprehensive analysis indicates that the PPO algorithm achieves both high-precision tracking performance and excellent robustness in UAV path planning tasks. In complex nonlinear scenarios, it can balance exploration and exploitation, ensuring smooth and safe flight trajectories, making it suitable for multi-UAV cooperative tracking, dynamic obstacle avoidance, and urban environment flight applications.

4.2.4. Adaptability to Continuous Action Spaces

In the process of UAV tracking of ground targets, the PPO algorithm enables the agent to gradually learn the dynamic patterns of vehicle trajectories through continuous interactive learning. Its core lies in leveraging the value function to estimate future rewards and generating continuous actions via the policy network to adjust the UAV’s flight path in real time. The comparison of tracking error curves is shown in Figure 7.

The simulation results in Figure 7 demonstrate that, within the MATLAB-based 3D simulation environment, the PPO algorithm enables the UAV to achieve high-precision tracking of the target trajectory after 10 s, with the average position error maintained within 0.1 m and the maximum deviation not exceeding 0.5 m. Compared with traditional methods such as APF and Q-learning, PPO exhibits faster convergence and smoother trajectory responses in nonlinear environments.

The high accuracy of the PPO algorithm can be attributed to the following factors: the coordinated optimization of the policy and value networks, which allows for dynamic capture of target motion patterns; the clipping term introduced in the loss function, which constrains policy updates and prevents overfitting; and the multi-step return mechanism (GAE), which enhances temporal consistency in trajectory sampling.

In contrast, Q-learning relies on a discrete action space, resulting in limited action selection and an inability to achieve fine-grained control over continuous trajectories.

The convergence performance comparison of the four algorithms is presented in Table 4.

To reduce the impact of randomness on the experimental results, each experiment in this study is repeated 20 times independently. In each trial, the environmental parameters and the weights of the reinforcement learning policy network are randomly initialized. The final results are evaluated using statistical values obtained from multiple trials.

To comprehensively assess the performance of the algorithms, the following statistical metrics are adopted: Mean Error, Standard Deviation, Maximum error. The statistical results of the four algorithms are shown in Table 5.

From the standard deviation results, it can be seen that the PPO algorithm exhibits smaller fluctuations across various scenarios, indicating that its stability is superior to that of other algorithms.

4.2.5. Stability of Policy Updates

To improve the training stability of policy gradient methods in continuous action spaces, this paper employs the PPO algorithm to optimize the UAV control policy. Unlike traditional policy gradient methods that perform large-step direct updates to the policy, PPO constrains the deviation between the new and old policies, thereby effectively avoiding policy collapse and performance oscillations. The comparison of reward curves is shown in Figure 8.

The simulation results in Figure 8 show that the training curve of PPO becomes stable after approximately 1000–1200 episodes, with small fluctuations in reward values. The TD3 algorithm stabilizes after around 1500 episodes but still exhibits noticeable fluctuations. In contrast, Q-learning converges after about 1800 episodes, with large oscillations, indicating its limited effectiveness in dynamic and continuous control tasks.

This can be explained as follows: PPO employs a clipped surrogate objective and Generalized Advantage Estimation (GAE), which constrain the update step size during policy optimization and prevent instability caused by overly large updates. TD3 alleviates the overestimation problem through the use of twin Q-networks and delayed updates, but fluctuations still exist. Q-learning, affected by the ε-greedy exploration strategy and limited by discrete action updates, is prone to oscillations during training.

4.2.6. Ablation Study on the Reward Function

Remove the smoothness term from the reward function and retain only the target tracking reward to observe the variation in trajectory oscillations. The trajectory comparison is shown in Figure 9.

The simulation results in Figure 9 indicate that after removing the smoothness term from the reward function, the UAV is still able to track the target vehicle; however, its flight trajectory exhibits noticeable oscillations, the control inputs fluctuate sharply, and trajectory continuity significantly decreases. In contrast, when the smoothness reward is included, the UAV’s path becomes smoother and the stability of the control sequence is markedly improved, demonstrating the critical role of the smoothness term in enhancing policy executability and flight stability.

4.2.7. Tracking Error Variations Across Different Scenarios

In UAV trajectory tracking tasks, tracking error is a core metric for evaluating algorithm performance. In this study, comparative experiments were conducted using the PPO algorithm, TD3 algorithm, APF, and Q-learning algorithm to analyze the variation in tracking error under different scenarios, including straight-line, curved, lane-changing, and obstacle environments. The results are shown in Figure 10.

Figure 10 illustrates the time-varying tracking error curves of four control algorithms under four typical driving scenarios: straight-line tracking, turning, lane changing, and obstacle avoidance. From the tracking error versus time curves, it can be observed that in the straight-line driving scenario, all algorithms are able to gradually converge to zero error, although their dynamic characteristics differ significantly. The traditional APF algorithm exhibits a large initial overshoot and oscillation, with a relatively slow convergence rate. The Q-learning algorithm shows a slightly reduced oscillation amplitude compared to APF but still presents noticeable fluctuations. In contrast, both the TD3 and PPO algorithms effectively reduce overshoot, resulting in smoother error curves.

In the turning scenario, the vehicle begins to enter the curve at approximately 4 s, causing a sudden change in tracking error. The APF algorithm produces the largest peak error and the longest-lasting oscillations. Q-learning shows relatively large errors in the early stage of turning, indicating insufficient adaptability. The TD3 algorithm gradually regains stability after the turn, although some oscillations remain. The PPO algorithm achieves the smallest error peak and the shortest recovery time.

During the lane-changing phase, the lateral displacement of the target vehicle changes rapidly, requiring the system to perform significant trajectory adjustments. Both the APF and Q-learning algorithms exhibit large error peaks and relatively long recovery times. The TD3 algorithm reduces the error magnitude to some extent but still shows a noticeable lag. In contrast, the PPO algorithm achieves the lowest error peak, with the smoothest variation in the error curve.

In the obstacle avoidance scenario, the introduction of obstacles requires the UAV to replan its path within a short time. The error curves exhibit clear non-stationary characteristics. The APF algorithm shows severe oscillations during obstacle avoidance, indicating poor robustness to sudden environmental changes. The Q-learning algorithm demonstrates some random fluctuations, reflecting insufficient stability. The TD3 algorithm gradually recovers after obstacle avoidance, but the error variations remain relatively large. The PPO algorithm shows the smoothest error variation and the fastest convergence speed.

5. Conclusions

This study investigates the problem of UAV path tracking and planning for ground vehicles in three-dimensional (3D) space using the deep reinforcement learning-based Proximal Policy Optimization (PPO) algorithm. By constructing a UAV-tracking dynamic model and a 3D spatial path constraint model, autonomous tracking and trajectory planning of the UAV in dynamic environments are achieved. The main research findings are summarized as follows:

3D Path Planning Model and Objective Design: Based on the kinematic characteristics of UAVs and ground vehicles, a 3D path planning model was developed that considers spatial coordinates, velocity, and attitude constraints. A well-designed objective function—including tracking error minimization, energy optimization, and safety distance constraints—ensures both the realism and operability of the model.
PPO-Based Adaptive Learning: By designing an appropriate state space, action space, and reward function, the PPO algorithm can achieve adaptive learning in complex environments. MATLAB simulation results demonstrate that PPO-based UAV path planning outperforms comparative algorithms such as Q-learning in terms of tracking accuracy, convergence speed, and robustness. In specific scenarios, the trajectory error of Q-learning is approximately 1 m, whereas PPO achieves an error of about 0.2 m, with faster and more stable error convergence within roughly 10 s. In comparison, the APF algorithm converges in about 15 s, and TD3 converges in 10 s but exhibits oscillations. Incorporating a smoothness reward further improves the UAV path smoothness, allowing the UAV to follow the vehicle trajectory stably, indicating the promising application potential of PPO in intelligent UAV control.
Enhanced Decision-Making and Path Optimization: The PPO-based UAV-tracking path planning method effectively enhances the UAV’s intelligent decision-making and path optimization capabilities, providing a new technical approach and research foundation for intelligent UAV traffic and cooperative control systems.

Despite achieving promising results in UAV-tracking path planning, several challenges remain to be addressed:

Limitations of PPO: While PPO exhibits strong stability and robustness in continuous control problems, certain limitations remain in UAV-tracking tasks. First, as an on-policy algorithm, its sample efficiency is relatively low, resulting in moderate training efficiency. Second, PPO relies primarily on local policy optimization, limiting its long-term path planning capability; in complex obstacle environments, the planned path may not be optimal. Moreover, control precision in continuous action spaces is limited, the algorithm is sensitive to reward function design, and generalization across multiple scenarios remains insufficient. These issues partially constrain the application of PPO in high-precision, real-time UAV tracking tasks.
Future Research Directions: Although PPO-based UAV-tracking path planning demonstrates strong performance in theory and simulation, there remains substantial scope for research in real-world validation, multi-UAV cooperation, and the integration of perception and decision-making. Future work could leverage hardware-in-the-loop simulation platforms or real UAV flight experiments to conduct engineering validation of the proposed PPO algorithm, further assessing its feasibility and reliability in real-world environments.

Author Contributions

Writing—original draft, D.Q.; Writing—review and editing, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, J. Intelligent Parking Guidance System Based on Quadrotor UAV as Guide: 201410353889.1. CN104183153A, 23 July 2014. [Google Scholar]
Li, Y.G.; Song, C.Z.; Song, W.J.; Wang, L. Aircraft assembly simulation path planning based on engineering semantics. Key Eng. Mater. 2010, 431–432, 503–506. [Google Scholar] [CrossRef]
Hou, X.; Liu, F.; Wang, R.; Yu, Y. A UAV dynamic path planning algorithm. In Proceedings of the 2020 35th Youth Academic Annual Conference of Chinese Association of Automation (YAC); IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Ma, N.; Cao, Y.; Wang, X.; Wang, Z.; Sun, H. A fast path re-planning method for UAV based on improved A* algorithm. In 2020 3rd International Conference on Unmanned Systems (ICUS); IEEE: Piscataway, NJ, USA, 2020; pp. 462–467. [Google Scholar]
Cheng, Z.; Zhao, L.; Shi, Z. Decentralized multi-UAV path planning based on two layer coordinative framework for formation rendezvous. IEEE Access 2022, 10, 45695–45708. [Google Scholar] [CrossRef]
Dou, L.; Yang, C.; Wang, D. Multi-UAV Formation Tracking Control Based on State Observer. J. Tianjin Univ. 2019, 52, 90–97. [Google Scholar]
Wang, Q.; Cheng, J.Y.; Li, X. Formation Control of Omnidirectional Robots Based on BackStepping. J. Weapon. Equip. Eng. 2017, 38, 98–102. [Google Scholar]
Zhang, H.; Gan, X.; Mao, Y. A survey of UAV obstacle avoidance algorithms. Aero Weapon. 2021, 28, 53–63. [Google Scholar]
Chen, F.; Zhang, M. Collision-free trajectory planning for UAV tracking ground targets. Ordnance Ind. Autom. 2022, 41, 40–44. [Google Scholar]
Wang, Y.; Wang, S. UAV path planning based on an improved particle swarm optimization algorithm. Comput. Eng. Sci. 2020, 42, 1690–1696. [Google Scholar]
Sun, S.; Sun, T. Research on UAV path planning based on a fused A* Algorithm. Electron. Meas. Technol. 2022, 45, 82–91. [Google Scholar]
Huang, S.; Tian, J.; Qiao, L.; Wang, Q.; Su, Y. UAV path planning based on an improved genetic algorithm. J. Comput. Appl. 2021, 41, 390–397. [Google Scholar]
Wang, X.; Meng, X.; Li, C. Design of a UAV trajectory tracking controller based on model predictive control (MPC). Syst. Eng. Electron. 2021, 43, 191–198. [Google Scholar]
Li, K.; Lu, Y.; Bao, S.; Xu, P. Three-dimensional obstacle avoidance planning for UAV based on an improved RRT algorithm. Comput. Simul. 2021, 38, 59–63+96. [Google Scholar]
Zhao, J. UAV trajectory planning strategy based on an extended D* algorithm guided by heuristic points. Mach. Des. Manuf. 2020, 153–157. [Google Scholar]
He, J.; He, G.; Yu, X. UAV path planning based on an improved bee colony algorithm. Fire Control. Command. Control. 2021, 46, 103–106. [Google Scholar]
Chen, X.; Mao, H.; Liu, K. Research on UAV trajectory planning based on an improved adaptive ant colony algorithm. Electro-Opt. Control. 2022, 29, 6–10. [Google Scholar]
Murray, C.C.; Chu, A.G. The flying sidekick traveling salesman problem: Optimization of drone-assisted parcel delivery. Transp. Res. Part C-Emerg. Technol. 2015, 54, 86–109. [Google Scholar] [CrossRef]
Agatz, N.; Bouman, P.; Schmidt, M. Optimization approaches for the traveling salesman problem with drone. Transp. Sci. 2018, 52, 965–981. [Google Scholar] [CrossRef]
Meier, D.; Tullumi, I.; Stauffer, Y.; Dornberger, R.; Hanne, T. A novel backup path planning approach with ACO. In 2017 5th International Symposium on Computational and Business Intelligence (ISCBI); IEEE: Piscataway, NJ, USA, 2017; pp. 50–56. [Google Scholar]
Spurny, V.; Baca, T.; Saska, M. Complex manoeuvres of heterogeneous MAV-UGV formations using a model predictive control. In International Conference on Methods & Models in Automation & Robotics; Miedzyzdroje, Poland, IEEE: Piscataway, NJ, USA, 2016; p. 29. [Google Scholar]
Hafez, A.; Givigi, S. Formation Reconfiguration of Cooperative UAVs via Learning Based Model Predictive Control in an Obstacle-Loaded Environment. In 2016 Annual IEEE Systems Conference (SysCon); Orlando, FL, USA, IEEE: Piscataway, NJ, USA, 2016; pp. 18–21. [Google Scholar]
Yang, X.; Zhao, S.; Gao, W.; Li, P.; Feng, Z.; Li, L.; Jia, T.; Wang, X. Three-Dimensional Path Planning for UAV Based on Multi-Strategy Dream Optimization Algorithm. Biomimetics 2025, 10, 551. [Google Scholar] [CrossRef]
Xiao, C.; Yang, H.; Zhang, B. Multi-Unmanned Aerial Vehicle Path Planning Based on Improved Nutcracker Optimization Algorithm. Drones 2025, 9, 116. [Google Scholar] [CrossRef]
Liu, B.; Cai, Y.; Li, D.; Lin, K.; Xu, G. A Hybrid ARO Algorithm and Key Point Retention Strategy Trajectory Optimization for UAV Path Planning. Drones 2024, 8, 644. [Google Scholar] [CrossRef]
Gao, Y.; Li, S. Obstacle Avoidance Path Planning for UAV Applied to Photovoltaic Stations Based on Improved Dynamic Window Method. Electronics 2025, 14, 1963. [Google Scholar] [CrossRef]
Gu, Z.; Jia, K.; Xu, K. Research on Path Planning Based on Integration of Fluid Disturbance and Proximal Policy Optimization Algorithm. Fire Control. Command. Control. 2026, 51, 66–73. [Google Scholar]
Wang, H.; Huang, J.; Wang, W. Multi-UAV Formation Obstacle Avoidance Control Method Based on Proximal Policy Optimization Algorithm. Ordnance Ind. Autom. 2026, 45, 108–112. [Google Scholar]
Khatib, O. Real-time obstacle avoidance for manipulators and mobile robots. Int. J. Robot. Res. 1986, 5, 90–98. [Google Scholar] [CrossRef]
Fujimoto, S.; van Hoof, H.; Meger, D. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning (ICML); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Christopher, J.C.H.; Watkins, P.D. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]

Figure 1. UAV System Components.

Figure 2. Architecture of the Dynamic Tracking System.

Figure 3. Principle of the PPO Algorithm.

Figure 4. Convergence Curve of the PPO Algorithm.

Figure 5. Actor–Critic Network Structure.

Figure 6. Robustness Variation Curve.

Figure 7. Comparison of Tracking Error Curves.

Figure 8. Comparison of Reward Curves.

Figure 9. Trajectory Comparison Plot.

Figure 10. Tracking error variation under different scenarios.

Table 1. Environmental Parameters.

Parameter	Value	Description
Simulation Platform	MATLAB R2023b	Reinforcement Learning Toolbox
Simulation Step Size	0.05 s	Time discretization interval
Path Type	Straight + Turns	Simulation of road scenarios
UAV Maximum Speed	15 m/s	Flight safety constraints
Control Algorithm	PPO Algorithm	Based on Actor–Critic architecture
Reward Function	Tracking Error + Smooth Control Term	Ensures accuracy and stability

Table 2. Algorithm Parameters.

Algorithm	Learning Rate	Discount Factor	Network Structure
PPO	3 × 10⁻⁴	0.99	Policy Network + Value Network
Q-learning	0.1	0.99	Q-table
APF	-	-	Attractive/Repulsive Function
TD3	1 × 10⁻³	0.99	Policy Network + Value Network

Table 3. Algorithm Parameters.

Algorithm	Activation Function	Optimizer	Other Key Parameters
PPO	Relu	Adam	Clipping coefficient ε = 0.2, GAE λ = 0.95
Q-learning	-	-	ε-Greedy Policy, Exploration Rate ε = 0.1
APF	-	-	Goal Attraction Coefficient, Obstacle Repulsion Coefficient
TD3	Relu	Adam	Critic gradient clipping within [−1, 1]

Table 4. Convergence Performance of Algorithms.

Algorithm	Convergence Speed	Oscillation	Robustness
PPO	Fast	Very slight	Strong
Q-learning	Slow	Significant	Moderate
APF	Moderate	Noticeable	Weak
TD3	Fast	Significant	Moderate

Table 5. Statistical Results.

Algorithm	Mean Error (m)	Std (m)	Max Error (m)
PPO	0.45	0.80	5.00
TD3	0.55	0.90	5.00
Q-learning	1.00	1.25	5.60
APF	0.95	1.20	6.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qiao, D.; Zhang, H. Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking. Drones 2026, 10, 319. https://doi.org/10.3390/drones10050319

AMA Style

Qiao D, Zhang H. Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking. Drones. 2026; 10(5):319. https://doi.org/10.3390/drones10050319

Chicago/Turabian Style

Qiao, Dongna, and Hongxin Zhang. 2026. "Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking" Drones 10, no. 5: 319. https://doi.org/10.3390/drones10050319

APA Style

Qiao, D., & Zhang, H. (2026). Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking. Drones, 10(5), 319. https://doi.org/10.3390/drones10050319

Article Menu

Research on Proximal Policy Optimization Algorithm in Path Planning for UAV-Based Vehicle Tracking

Highlights

Abstract

1. Introduction

1.1. Research Background and Significance

1.2. Research Status

1.3. Research Content and Contributions

2. Modeling of Path Planning for UAV-Tracking

2.1. Analysis of Commonly Used Path Planning Algorithms

2.1.1. Artificial Potential Field (APF)

2.1.2. TD3 Algorithm

2.1.3. Q-Learning

2.1.4. Proximal Policy Optimization

2.2. Structural Composition of the UAV-Tracking System

2.3. Modeling of the Tracking Scenario and Assumptions

2.4. Modeling of Three-Dimensional Path Constraints

2.4.1. Basic Safety Distance Constraint

2.4.2. Lateral and Longitudinal Tracking Error Constraints

2.4.3. Dynamic and Kinematic Feasibility Constraints

2.4.4. Simplified Dynamic Model

2.4.5. Airspace Regulations and Altitude Constraints

2.4.6. Relative Motion and Prediction Constraints

2.4.7. Optimization and Constraint Relaxation

3. Design of a PPO-Based Path Planning Method

3.1. Theoretical Foundations and Applicability Analysis of the Proximal Policy Optimization (PPO) Method

3.1.1. Algorithm Principle

3.1.2. Applicability Analysis

3.2. Algorithm Structure and Steps

3.3. Algorithm Complexity and Convergence Analysis

3.3.1. Algorithm Complexity Analysis

3.3.2. Convergence Analysis

4. Simulation and Results Analysis

4.1. Simulation Experiment Environment and Parameter Settings

4.1.1. Simulation Platform Construction

4.1.2. State Space and Action Space Design

4.1.3. Construction of the Actor–Critic Network

4.2. Tracking Performance Evaluation in Typical Scenarios

4.2.1. Simulation Environment and Parameters

4.2.2. Evaluation Metrics

4.2.3. Robustness Analysis

4.2.4. Adaptability to Continuous Action Spaces

4.2.5. Stability of Policy Updates

4.2.6. Ablation Study on the Reward Function

4.2.7. Tracking Error Variations Across Different Scenarios

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI