UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments

Sheng, Yuanyuan; Liu, Huanyu; Li, Junbao; Han, Qi

doi:10.3390/drones8090516

Open AccessArticle

UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments

Faculty of Computing, Harbin Institute of Technology, Harbin 150080, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 516; https://doi.org/10.3390/drones8090516

Submission received: 17 August 2024 / Revised: 17 September 2024 / Accepted: 21 September 2024 / Published: 23 September 2024

Download

Browse Figures

Versions Notes

Abstract

Autonomous navigation of Unmanned Aerial Vehicles (UAVs) based on deep reinforcement learning (DRL) has made great progress. However, most studies assume relatively simple task scenarios and do not consider the impact of complex task scenarios on UAV flight performance. This paper proposes a DRL-based autonomous navigation algorithm for UAVs, which enables autonomous path planning for UAVs in high-density and highly dynamic environments. This algorithm proposes a state space representation method that contains position information and angle information by analyzing the impact of UAV position changes and angle changes on navigation performance in complex environments. In addition, a dynamic reward function is constructed based on a non-sparse reward function to balance the agent’s conservative behavior and exploratory behavior during the model training process. The results of multiple comparative experiments show that the proposed algorithm not only has the best autonomous navigation performance but also has the optimal flight efficiency in complex environments.

Keywords:

autonomous navigation; obstacle avoidance; deep reinforcement learning; unmanned aerial vehicles; dynamic rewards

1. Introduction

Thanks to the rapid development of control algorithms and computing platforms, research on unmanned aerial vehicles has made a series of progress [1,2,3]. Due to their high efficiency, portability, and economy, Unmanned Aerial Vehicles (UAVs) are widely used in many fields, such as remote rescue [4,5], logistics delivery [6,7], environmental monitoring [8,9], and smart agriculture [10,11]. In most tasks performed by UAVs, their core role can be defined as the mission of flying between two points, and during this process the UAV can achieve automatic navigation and obstacle avoidance [12]. However, in real-life scenarios, high-density and highly dynamic working environments such as woods and pedestrians pose great challenges to the autonomous planning and navigation of UAVs, which requires the UAV to be able to effectively identify obstacles and respond in a timely manner [13,14]. Therefore, how to improve the autonomous flight performance of UAVs in complex environments has become a core issue for scholars and industry.

UAVs usually need to choose a reasonable path to the target point by evaluating perceived environmental information. In recent years, the development of deep reinforcement learning (DRL) has provided new solutions and ideas for autonomous UAV navigation, which learns and optimizes strategies through continuous trial and error [15,16,17,18]. Many classic DRL algorithms have been applied in the field of autonomous navigation of unmanned systems and deployed on real machines [19,20,21,22]. In actual research, Guo et al. [23] divided the autonomous navigation of a UAV into two subtasks, navigation and obstacle avoidance, and proposed a targeted distributed DRL framework. This method can achieve the navigation task of a UAV in a highly dynamic environment without prior knowledge. Liu et al. [24] proposed a hierarchical network architecture to improve the algorithm performance by introducing a temporal attention mechanism and verified it in real and simulated scenarios, but the test environment of the algorithm is relatively simple. In order to improve the interpretability of the model, Shao et al. [25] proposed a UAV collision avoidance scheme based on DRL, which improved the training speed of the model by introducing the idea of curriculum learning to achieve autonomous flight in a restricted closed environment. In response to the distribution differences between the simulated training environment and actual application environment of the DRL algorithm, reference [26] built a random environment generation platform to train the DRL model, and improved the stability of the model deployed on the real machine through dynamic training. Reference [27] uses light detection and ranging (LiDAR) measurement data as input and proposes an end-to-end autonomous navigation algorithm, which uses a fast stochastic gradient algorithm to solve the constraint model and realize autonomous navigation in complex 3D unknown environments. To improve the perception and tracking accuracy of UAVs, reference [28] combines the improved differential amplification method with the DRL algorithm. Compared with the traditional motion control method, the proposed DRL algorithm has stronger flexibility and better performance.

Although the effectiveness of DRL-based UAV autonomous navigation solutions has been proven by many studies, most algorithms are only tested in simulation environments with few obstacles and simple scenes, making it difficult to evaluate their application effects in complex environments with high-density and highly dynamic obstacles. In addition, there are several issues in current research that need further study: (1) Sparse rewards: The agent will only be rewarded when it reaches the goal point during model training. This typically makes it more difficult for the agent to effectively learn the correct behavior strategy during the exploration process, which causes the trained model to converge slowly or not at all [29,30]; (2) Exploration and utilization: The exploration of an agent means trying new behaviors to discover possible high-reward states, while utilization means selecting the current optimal behavior based on known information, which corresponds to the conservative and radical behavior of the UAV [31,32]; (3) State space: The choice of state space is crucial to the DRL model. Most agents only make decisions based on the information at the current moment, ignoring the dynamic changing trends of the agent (such as the translation and rotation of the UAV) [12,33].

Researchers have conducted targeted research on the above problems. Wang et al. [34] generated actions for the agent to interact with the environment by combining current strategies and prior strategies, and gradually improved the agent performance under sparse rewards by setting dynamic learning goals. However, higher complexity and constraints limit the further promotion of this algorithm. Wang et al. [33] built a two-stream Q network to process the temporal information and spatial information extracted from the UAV’s current moment state and the previous moment state, respectively. This algorithm provides a new idea for the selection of state space, but when using observation changes, it only considers the position changes caused by the translation of the UAV and ignores the angle changes caused by its rotation. Based on this, Zhang et al. [35] developed a two-stream Actor-Critic architecture by combining a two-stream network and TD3, and constructed a non-sparse reward function that can balance the exploration behavior and radical behavior of UAVs. This algorithm also ignores the angle information of the UAV, and essentially its reward function remains fixed during the entire training process.

Based on the above analysis, this paper proposes an improved UAV autonomous navigation algorithm based on SAC in view of the current status of UAV autonomous navigation based on DRL in complex environments. By analyzing the continuous maneuvering mechanism of UAVs in complex environments, a more reasonable state space is constructed; the comprehensive performance of the intelligent agent is improved by designing a dynamic reward function. Based on the proposed method, this paper realizes autonomous path planning of UAVs in high-density and highly dynamic environments. The main contributions are as follows:

(1): The impact of UAV position changes and angle changes on navigation performance in complex scenarios was analyzed, and angle change information was introduced as input to the DRL model to expand ideas for research on LiDAR-based navigation algorithms;
(2): A dynamic reward function is constructed based on a non-sparse reward function to balance the conservative behavior and exploratory behavior of the agent during the model training process, and improve the flight efficiency while improving the success rate of UAV navigation;
(3): Flight simulation scenarios with high-density and highly dynamic obstacles were constructed, respectively, to verify the effectiveness and reliability of the proposed navigation algorithm.

The structure of subsequent chapters of this article is as follows: Section 2 introduces the theoretical background related to DRL. Section 3 formulates the DRL-based UAV navigation problem. Section 4 describes the algorithm proposed in this paper. Section 5 presents the experimental details and discusses the results. The last section summarizes this article and gives future work plans.

2. Background

2.1. Markov Decision Process (MDP)

MDP is a model decision framework based on Markov properties, consisting of five elements

〈 S, A, P, R, γ 〉

, where

S

represents the set of various states of the agent;

A

represents the set of actions that the agent can choose;

P (s^{'} | s, a)

describes the transition probability of the state;

R (s, a)

is used to evaluate the reward obtained by the agent for taking a certain action;

γ \in [0, 1]

is used to measure the importance of future rewards.

In MDP, policy refers to the mapping from states and actions to action selection probabilities, which can be expressed as

π (a | s) = P (A_{t} = a | S_{t} = s)

. It is essentially the probability distribution of selecting action a when given state

s

. The policy

π (a | s)

can be evaluated by the state value function

V^{π} (s)

and the state-action value function

Q^{π} (s, a)

, where

V^{π} (s)

is used to determine the future expected return that the agent will obtain starting from

s

under the policy

π

, which is expressed as:

V^{π} (s) = E^{π} [G_{t} | S_{t} = s]

(1)

where

G_{t}

represents the sum of all discount rewards from time

t

onwards, which is expressed as:

G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots = \sum_{k = 0}^{\infty} γ^{k} R_{t + k}

(2)

Therefore, Equation (1) can be expressed as:

V^{π} (s) = E^{π} (\sum_{k = 0}^{\infty} γ^{k} R_{t + k} | S_{t} = s)

(3)

The state-action value function

Q^{π} (s, a)

is used to indicate the future expected reward obtained after executing action

a

in state

s

when following policy π, which is expressed as:

Q^{π} (s, a) = E^{π} [G_{t} | S_{t} = s, A_{t} = a]

(4)

Similarly, Equation (4) can be further expressed as:

Q^{π} (s, a) = E^{π} (\sum_{k = 0}^{\infty} γ^{k} R_{t + k} | S_{t} = s, A_{t} = a)

(5)

The optimal value in the above equation represents the best performance of the agent in MDP problems.

2.2. SAC

The SAC algorithm combines the actor-critic method and the concept of entropy in RL. The core idea is to maximize the expected return while maintaining the randomness of the strategy, as follows:

π^{*} = \arg \max_{π} \sum_{t} E_{(s_{t}, a_{t}) \sim ρ_{π}} [r (s_{t}, a_{t}) + α H (π (\cdot | s_{t}))]

(6)

where

α

is the temperature parameter, which determines the relative importance of the entropy term relative to the reward, thus controlling the randomness of the optimal strategy.

In the maximum entropy RL framework, soft Q-value calculation iterates the mapping from state

S

to action

A

to reward

R

, and repeatedly corrects the Bellman operator. The Bellman equation and state value function can be further described as:

Q (s_{t}, a_{t}) ≜ r (s_{t}, a_{t}) + γ E_{s_{t + 1} \sim ρ} [V (s_{t + 1})]

(7)

V (s_{t}) = E_{a_{t} \sim π} [Q (s_{t}, a_{t}) - α \log π (a_{t} | s_{t})]

(8)

Then, the strategy can be improved according to the following formula:

π_{n e w} = \arg \min_{π^{'}} D_{K L} (π^{'} (\cdot | s_{t}), \frac{\exp (\frac{1}{α} Q^{π_{o l d}} (s_{t}, \cdot))}{Z^{π_{o l d}} (s_{t}, \cdot)})

(9)

where

D_{K L}

represents Kullback–Leibler (KL) divergence. The above formula can be used to improve the strategy in discrete space. In continuous spaces, it is necessary to approximate such iterations through parameterized functions and strategies. The detailed theory on this part can be found in [36].

3. Problem Formulation

3.1. Scenario Description

The tasks of UAVs are complex and diverse, especially in high-density and highly dynamic environments, which require excellent navigational performance. In order to facilitate modeling analysis, this article assumes that the UAV flies at a fixed altitude. As shown in Figure 1, the flight mission of the UAV can be represented as point-to-point flight, and the UAV needs to automatically identify and avoid obstacles during this process. Furthermore, we hope that the UAV will be able to finish the flight mission in the shortest possible path.

This paper generates UAV flight decisions based on the DRL algorithm to achieve autonomous flight and obstacle avoidance. The UAV senses environmental information based on LiDAR during flight. The current status information of the UAV is composed of environmental information, flight information, and target information. It is used as the input of the DRL model to obtain the next flight control command of the UAV, and the UAV will obtain new status information and rewards after executing the command. By constantly interacting with the environment, the UAV can continuously optimize its behavior and improve its ability to navigate and avoid obstacles autonomously.

3.2. Environment Perception Model

The airborne LiDAR model used in this article is shown in Figure 2a. The UAV measures the distance of surrounding obstacles through laser beams [37]. The field of view of LiDAR is

4 π / 3

and the angular resolution is

π / 18

. It can measure 25 sets of data in different directions at the same time. The measurement data at different times can be expressed as:

ο_{t} = [d_{1}^{t}, \dots d_{i}^{t}, \dots, d_{25}^{t}]

(10)

where

0.14 \leq d_{i} \leq 3.5

represents the measurement range of LiDAR;

θ = [θ_{1}, \dots θ_{i}, \dots, θ_{25}]

represents the angle corresponding to each laser beam. As mentioned in Section 3.1, in order to realize autonomous flight in complex environments, in addition to observation information, information about the UAV itself and target information are also needed. The UAV information available in this article includes linear velocity

v_{l i n}

, yaw angular velocity

v_{y a w}

, linear acceleration

a_{l i n}

, and yaw angular acceleration

a_{y a w}

. In addition, to ensure that the UAV moves toward the target point, we use relative coordinates to represent the positional relationship between the UAV and the target point. As shown in Figure 2b,

d_{u t}

represents the relative distance between the UAV and the target point,

α

represents the yaw angle, and

β

represents the angle between the UAV’s first perspective and the line connecting the UAV and the target point. Therefore, the state of the UAV can be described as:

s_{u} = [v_{l i n}, v_{y a w}, a_{l i n}, a_{y a w}, d_{u t}, β]

(11)

As mentioned before, in order to obtain richer environmental information, references [33,35] constructed observation information containing static and dynamic changes in the environment based on LiDAR measurement data. The static information

ο_{t}

represents the observation data at the current moment, and the dynamic information

Δ ο_{t}

can be expressed as:

\begin{matrix} Δ ο_{t} & = ο_{t} - ο_{t - 1} \\ = [d_{1}^{t} - d_{1}^{t - 1}, \dots, d_{i}^{t} - d_{i}^{t - 1}, \dots, d_{25}^{t} - d_{25}^{t - 1}] \end{matrix}

(12)

Dynamic information can provide the movement trend of the UAV relative to obstacles and help optimize the flight decision of the UAV.

3.3. DRL-Based Navigation Model

The UAV autonomous navigation model based on DRL mainly involves the UAV’s state space, action space, reward function, and network structure and other elements.

State space s is the basis for the DRL model to make UAV flight decisions, and it is crucial for UAV navigation. As mentioned in Section 3.2, the state space mainly includes environmental information and the UAV’s own information, which can be expressed as:

\begin{matrix} s_{t} & = [ο_{t}, s_{u}^{t}] \\ = [d_{1}^{t}, \dots d_{i}^{t}, \dots, d_{25}^{t}, v_{l i n}^{t}, v_{y a w}^{t}, a_{l i n}^{t}, a_{y a w}^{t}, d_{u t}^{t}, β^{t}] \end{matrix}

(13)

According to Equation (13), the dimension of the UAV’s state space s is 31. In addition, in order to accelerate the convergence of the method, the state space s will be normalized before inputting to the model.

The action space a is made up of its linear velocity

v_{l i n}

and yaw angular velocity

v_{y a w}

, since it flies at a predetermined height. In addition, to ensure the stability of the UAV, the actual flight speed of the UAV will be limited as follows:

v_{l i n} \in [0, 0.5], v_{y a w} \in [- 1, 1]

(14)

After obtaining the action command, the UAV updates its attitude and position by adjusting the motor speed to execute the command.

The main function of the reward function is to convey the goal to the agent, that is, to maximize the expected probability value of the cumulative sum of reward signals received by the agent. The agent will receive a corresponding reward after performing an action, so the design of the reward function directly determines the performance evaluation of the agent on a specific action, which in turn affects the performance and convergence speed of the DRL method. The reward function described in this article is as follows [26]:

\begin{matrix} R & = r_{d i s} + r_{a r r} + r_{c r a} \\ + r_{l a s e r} + r_{s t e p} + r_{l i n} + r_{y a w} \end{matrix}

(15)

where

r_{d i s}

represents the distance reward,

r_{a r r}

represents the reward for reaching the target,

r_{c r a}

represents the reward for hitting the obstacle,

r_{l a s e r}

represents the free space reward,

r_{s t e p}

represents the step reward,

r_{l i n}

and

r_{y a w}

represent the linear acceleration and yaw angular acceleration rewards, respectively, and their specific expressions are as follows:

r_{d i s} = ε_{1} (d_{u t}^{t} - d_{u t}^{t - 1})

(16)

r_{a r r} = {\begin{matrix} 100, if d_{u t} \leq 0.4 \\ 0, else \end{matrix}

(17)

r_{c r a} = ε_{2} {\begin{matrix} - 10, 0.5 \leq \min (ο) < 0.7 \\ - 25, 0.25 \leq \min (ο) < 0.5 \\ - 100, \min (ο) < 0.25 \end{matrix}

(18)

r_{l a s e r} = - {\sum_{i} (1 - d_{i} / \max (ο))}^{4}

(19)

r_{s t e p} = - ε_{3} * N_{s t e p}

(20)

r_{l i n} = - ε_{4} * | a_{l i n} |

(21)

r_{y a w} = - ε_{5} * | a_{y a w} |

(22)

where

N_{s t e p}

represents the number of steps performed by the drone, and

ε_{1}

to

ε_{5}

usually represent fixed constants.

4. Proposed Approach

4.1. State Space Representation Method

As mentioned in Section 3, the state space is crucial to the performance of DRL models. In autonomous UAV navigation tasks based on LiDAR, most studies directly use the LiDAR-measured data shown in Equation (10) as observation information. Reference [33] proves that the flight performance of the UAV is further improved after adding dynamic information as observation information, but the impact of changes in the angle of the UAV is not considered.

As shown in Figure 3a,b, LiDAR can obtain a set of measurement data at time t, and the UAV can estimate the distance and distribution of surrounding obstacles based on

ο_{t}

. Compared with time t-1, the position of the UAV relative to the obstacle has changed, and the change value

Δ ο_{t}

of the measurement data at adjacent moments can provide the movement trend of the UAV relative to the obstacle. As shown in Figure 3c,d, the attitude changes of the UAV during actual flight include not only changes in position but also changes in flight angle.

Based on the above analysis, this paper proposes a state space representation method that considers angle change information. First, add the angle information corresponding to the laser beam when obtaining the LiDAR measurement data

ο_{t}

, as shown below:

ο_{t} = [d_{1}^{t}, θ_{1}, \dots, d_{i}^{t}, θ_{i}, \dots, d_{25}^{t}, θ_{25}, α^{t}]

(23)

Furthermore, according to Equation (12), the dynamic information

Δ ο_{t}

at adjacent moments after adding angle information can be expressed as:

\begin{matrix} Δ ο_{t} & = ο_{t} - ο_{t - 1} \\ = [d_{1}^{t} - d_{1}^{t - 1}, 0, \dots, d_{i}^{t} - d_{i}^{t - 1}, 0, \dots, d_{25}^{t} - d_{25}^{t - 1}, 0, α^{t} - α^{t - 1}] \end{matrix}

(24)

Since the corresponding angle

θ_{i}

of the laser beam is fixed, Equation (24) can be further simplified as follows:

\begin{matrix} Δ ο_{t} & = [d_{1}^{t} - d_{1}^{t - 1}, \dots, d_{i}^{t} - d_{i}^{t - 1}, \dots, d_{25}^{t} - d_{25}^{t - 1}, α^{t} - α^{t - 1}] \\ = [Δ d_{1}^{t}, \dots, Δ d_{i}^{t}, \dots, Δ d_{25}^{t}, Δ α^{t}] \end{matrix}

(25)

where

Δ α^{t}

represents the change value of yaw angle. As shown in Figure 4, when the obstacles in the flight environment are stationary or the density is relatively sparse, the impact of changes in flight angle is not obvious. However, when the UAV flies in a highly dynamic or high-density complex environment, the angle change of the UAV can be expressed as:

Δ α^{t} = v_{y a w}^{t - 1} \cdot Δ t + 0.5 a_{y a w}^{t - 1} \cdot Δ t^{2}

(26)

The above formula shows that when the time interval is

Δ t

, the yaw angle of the UAV changes, and the impact of this change on the UAV’s flight decision-making in a complex environment cannot be ignored.

Based on the above analysis, the DRL model state space determined in this article is as follows:

s_{t} = [ο_{t}, Δ ο_{t}, s_{u}^{t}]

(27)

According to the Equations (14) and (27), the dimension of the state space in this paper is 83, the action space dimension is 2, and the network architecture adopted by the algorithm is shown in Figure 5.

4.2. Dynamic Reward Function

During the training process of the agent, how to balance the conservative and radical behavior of the agent is the focus that needs to be paid attention to. For UAVs, conservative means that the UAV is as safe as possible to avoid suffering larger penalties, while radical means that the UAV is more inclined to receive larger rewards and perform risky behaviors. If the agent obtains a larger reward after executing action a, it means that the action is more conducive to achieving the goal. Therefore, when the strategy is updated, the agent is more likely to subsequently choose to perform this action.

The non-sparse reward function designed in the current study usually remains unchanged during the training process of the agent. In fact, we hope that in the early stages of model training, the agent can be more radical to explore more possibilities. With the training of the model, the agent gradually tends to be conservative after it has the basic ability to complete tasks, thereby encouraging the agent to improve its own performance on the basis of ensuring its own safety.

Based on the above analysis, this paper proposes a dynamically changing reward function. By adjusting the distance reward

r_{d i s}

, step reward

r_{s t e p}

, and collision reward

r_{c r a}

, the agent is given gradually changing guidance at different stages of training to improve its overall performance. They can be expressed as:

\begin{array}{l} r_{d i s} = ε_{1} (d_{u t}^{t} - d_{u t}^{t - 1}) \\ s . t . ε_{1} = {\begin{matrix} r_{1} + \frac{E_{s t e p}}{2}, E_{s t e p} \leq 0.5 E_{\max} \\ r_{2}, E_{s t e p} > 0.5 E_{\max} \end{matrix} \end{array}

(28)

\begin{array}{l} r_{c r a} = ε_{2} {\begin{matrix} - 10, 0.5 \leq \min (ο) < 0.7 \\ - 25, 0.25 \leq \min (ο) < 0.5 \\ - 100, \min (ο) < 0.25 \end{matrix} \\ s . t . ε_{2} = {\begin{matrix} 0.5 + \frac{E_{s t e p}}{E_{\max}}, E_{s t e p} \leq 0.5 E_{\max} \\ 1, E_{s t e p} > 0.5 E_{\max} \end{matrix} \end{array}

(29)

\begin{array}{l} r_{s t e p} = - ε_{3} * N_{s t e p} \\ s . t . ε_{3} = {\begin{matrix} 1 - \frac{E_{s t e p}}{E_{\max}}, E_{s t e p} \leq 0.5 E_{\max} \\ 0.5, E_{s t e p} > 0.5 E_{\max} \end{matrix} \end{array}

(30)

where

E_{s t e p}

represents the current number of training rounds;

E_{\max}

represents the set maximum number of rounds;

r_{1}

and

r_{2}

are constants set according to the task.

By changing the parameters

ε_{1}

,

ε_{2}

, and

ε_{3}

, the reward function is dynamically adjusted in different training stages. In the initial stage of training, the agent is encouraged to explore the environment by giving smaller distance rewards and collision penalties. As training proceeds, the distance reward and collision penalty are increased while the step penalty is decreased to encourage the agent to stay away from obstacles.

5. Experiment and Discussion

5.1. Experimental Settings

In order to verify the algorithm proposed in this article, simulation scenarios of high-density and highly dynamic obstacles were built based on gazebo. The training scene for algorithms is shown in Figure 6, where there are 75 cylindrical obstacles with a radius of 0.15 and 75 cuboid obstacles with a side length of 0.2 in a 20 × 20 rectangular field. In each episode training, the coordinates of all obstacles are randomly generated and the target point is randomly selected among [0, 7], [0, −7], [7, 0], and [−7, 0]. The flight mission of the UAV is defined as flying from the origin to the destination within a specified number of steps, and the criteria for the end of each episode include: (1) the UAV successfully reaches the destination; (2) the UAV collides with an obstacle; (3) the UAV neither reaches the target nor collides within the maximum number of steps.

In order to verify the advantages of the method proposed in this article, DDPG, TD3, and SAC are used for comparison. All three algorithms use fixed reward functions and traditional state space expressions. In addition, in order to objectively evaluate the performance of the algorithm, we choose the following five quantitative indicators to measure the completion of the task.

(1): Success rate $R_{s}$ : the proportion of the number of times the UAV completes the navigation task to the total number of trials;
(2): Collision rate $R_{c}$ : the proportion of the number of times the UAV encountered obstacles to the total number of trials;
(3): Loss rate $R_{l}$ : the proportion of the number of times the UAV neither reaches the target point nor collides within the specified number of steps to the total number of trials;
(4): Average flight distance $L_{d i s}$ : the average distance flown by the UAV when it successfully completes its mission;
(5): Average number of flight steps $N_{s t e p}$ : the average number of steps required for the UAV to successfully complete the mission.

Among the above indicators, the larger

R_{s}

is, the better the autonomous navigation performance of the UAV is. The smaller

L_{d i s}

and are, the higher the flight efficiency of the UAV is.

5.2. Training Results

The proposed method and the three comparison algorithms are deployed on the hardware platform of Intel Core i5-11400F CPU @2.6 GHz × 12 and NVIDIA GTX 1650. For objective comparison, the training settings of the four algorithms are consistent, with the replay buffer size of 10,000, the network discount rate of 0.99, the batch size of 512, and the same network structure. In addition, the parameters of the reward function in Section 4.2 are set to

r_{1} = 200

,

r_{2} = 300

, and

E_{\max} = 400

, respectively. The results of the cumulative rewards of different algorithms changing with the number of training episodes are shown in Figure 7. All data in the figure have been smoothed.

According to Figure 7, the proposed algorithm has faster convergence speed and higher reward than the other three algorithms, among which our algorithm starts to converge around the 150th episode, SAC and DDPG start to converge around the 200th episode, while TD3 starts to converge around the 300th episode and has the lowest reward. Therefore, the training results show that the proposed algorithm has better convergence.

5.3. Experiment I: High-Density Scene Verification

In order to verify the effect of the proposed algorithm, the DRL model trained in Section 5.2 was tested in simulation scenarios with different numbers of obstacles. The test scene is shown in Figure 8, where

N_{o b s}

represents the number of obstacles, and ρ represents the obstacle density, which is used to measure the density of obstacles present per unit area.

The UAV based on the four algorithms performed 100 navigation missions in five scenarios with different obstacle densities. Similar to the training process, the location of the obstacles was re-randomly generated and the target point was re-selected after each flight mission.

The navigation success rate

R_{s}

, collision rate

R_{c}

, and loss rate

R_{l}

of the four algorithms under different obstacle densities are shown in Table 1. As can be seen from the table, the proposed algorithm achieved the highest navigation success rate and the lowest collision rate in all five scenarios, and the navigation performance of the SAC algorithm is better than DDPG and TD3. Due to the high density of obstacles in the experimental scene, only DDPG experienced flight loss. Comparing the experimental results of scene 1 and scene 5, it can be seen that with the substantial increase in obstacle density, the success rate of our algorithm is only reduced by 18%, while the success rates of the other three algorithms are reduced by 28%, 48%, and 29%, which further demonstrates the stability of our algorithm.

The flight trajectories of the UAV in all tests are shown in Figure 9. As can be seen from the figure, the algorithm proposed in this article has the most successful flight trajectories in all scenarios. In addition, as the density of obstacles increases, the flight path of the UAV becomes more complex. Based on the flight trajectory data, the average flight distance and average number of flight steps for all experiments were calculated, as shown in Figure 10.

According to Figure 10a, the proposed algorithm has the shortest average flight distance in scenarios 2 and 3, and has an average flight distance only larger than SAC in scenarios 1 and 5. According to Figure 10b, the proposed algorithm requires the minimum average number of flight steps in all five scenarios. Therefore, based on the above analysis, it can be seen that compared with the other three classic algorithms, the algorithm proposed in this article not only has the highest navigation success rate and stability but also has the optimal flight efficiency in high-density flight environments.

5.4. Experiment II: Highly Dynamic Scene Verification

To further verify the performance of the trained algorithm, a dynamic test scenario is constructed as shown in Figure 11. The starting and target points are represented in the figure as [0, −7] and [0, 7], respectively. The UAV’s flight mission is defined as autonomous flight from the starting point to the target point. During each test, 30 obstacles are randomly generated in the test area. The movement direction of the obstacles is marked by the arrow in the figure. After reaching the edge of the field, the obstacles will continue to move in the opposite direction until the end of this test. The moving speeds of the obstacles are set at 0.01 m/s, 0.05 m/s, 0.1 m/s, 0.15 m/s, and 0.2 m/s, respectively. Each set of experiments is performed 100 times, and the position of the obstacle will be regenerated after each test.

The experimental results of all algorithms in five dynamic scenes are presented in Table 2. It can be seen from the table that the proposed algorithm achieved the highest navigation success rate and the lowest collision rate in the experiments under different moving speeds of obstacles, and no flight loss occurred, which shows the effectiveness of the proposed algorithm in highly dynamic scenarios. For comparison, the other three algorithms all experienced flight loss, among which SAC and DDPG had higher loss rates. It should be noted that as the speed of obstacle movement increases, the difficulty for the UAV to avoid obstacles increases sharply, resulting in a significant reduction in the success rate of all algorithms.

The flight trajectory of the UAV in the highly dynamic test is shown in Figure 12. It can be seen from the flight trajectory in the figure that the algorithm proposed in this article can fly from the starting point to the target point more directly, while several other algorithms have more messy flight trajectories, which shows that the proposed algorithm can not only adapt to high-density flight environments, but also apply to highly dynamic scenarios that have not appeared in the training phase. Based on the flight trajectory of the UAV, the average flight distance

L_{d i s}

, and the average number of flight steps

N_{s t e p}

of the UAV in the dynamic scenario are calculated, as shown in Figure 13.

As can be seen from Figure 13, the proposed algorithm has the shortest average flight distance and the least average number of flight steps in all five dynamic flight scenarios. Corresponding to the flight trajectory in Figure 12, although SAC has a higher navigation success rate than TD3 and DDPG, its required flight distance is much higher than other algorithms. The TD3 algorithm has a lower average flight distance but the largest number of flight steps, which shows that the algorithm is more “cautious” in dynamic scenes. Figure 13 shows that the proposed algorithm not only has a high navigation success rate but also has the highest navigation efficiency when facing a highly dynamic flight environment, enabling the UAV to achieve autonomous navigation quickly and accurately.

5.5. Experiment III: Verify the Validity of the Reward

Experiments on UAVs in high-density and highly dynamic scenarios verified the effectiveness and reliability of the proposed algorithm. Based on the analysis in Section 4.2, we constructed “radical” and “conservative” reward function models by setting different step rewards, collision rewards, and distance rewards in order to further compare and illustrate the effect of the dynamic reward function designed in this article. The specific description is as follows:

Model 1: Based on Formulas (28)–(30), set

ε_{1} = (r_{1} + r_{2}) / 2

,

ε_{2} = 1

,

ε_{3} = 0.5

, respectively. By giving the agent a smaller step penalty and a larger collision penalty, the agent tends to choose a conservative behavior away from obstacles.

Model 2: Based on Formulas (28)–(30), set

ε_{1} = (r_{1} + r_{2}) / 2

,

ε_{2} = 0.5

,

ε_{3} = 1

, respectively. By giving the agent a larger step penalty and a smaller collision penalty, the agent is inclined to choose radical behavior to explore the environment.

Models with different reward functions were trained in the simulation scenario shown in Figure 6, and the reward convergence curve obtained is shown in Figure 14.

It can be seen from the figure that compared with the other two models, the model proposed in this article has the fastest convergence speed and optimal reward. In addition, since model 1 has a more “conservative” reward function, it converges faster than model 2 in the early stage of training, and model 2 explores a better strategy after a period of training, so it converges faster than model 1.

In order to verify the performance of the two trained models, the high-density simulation environment shown in Figure 8c–e is used for testing. The test results are shown in Table 3. As can be seen from the table, comparing the two algorithms with fixed reward functions, the proposed algorithm has the highest navigation success rate in all scenarios, and the navigation success rate of model 1 is slightly higher than that of model 2.

The flight trajectory of the UAV in the test is drawn as shown in Figure 15, and the average flight distance and average number of flight steps calculated based on the flight trajectory are shown in Figure 16. Combining all the experimental results in the figure, it can be seen that the proposed algorithm has the optimal flight efficiency.

Model 1 and model 2 were verified in highly dynamic scenarios with obstacle movement speeds of 0.1 m/s, 0.15 m/s, and 0.2 m/s, respectively. The obtained test results and UAV flight trajectories are shown in Table 4 and Figure 17 and Figure 18.

As can be seen from Table 4, the proposed algorithm still has the optimal navigation success rate in a highly dynamic flight environment. In addition, the navigation success rate of model 1 in dynamic scenes is higher than that of model 2, which demonstrates the advantages of models with conservative reward functions compared to models with radical reward functions. Comparing the flight trajectories shown in Figure 15 and Figure 17, it can be seen that the flight trajectory of model 2 in the dynamic scene has significant bends, which indicates that the model has poor generalization, resulting in model 2 having the highest average flight distance and average flight steps. As can be seen from Figure 18, the dynamic reward function constructed in this article can effectively improve the navigation performance of the UAV and further optimize its flight efficiency.

6. Conclusions

This paper proposes a DRL-based autonomous navigation and obstacle avoidance algorithm for UAV to achieve autonomous path planning in high-density and highly dynamic environments. First, by analyzing the impact of UAV position changes and angle changes on flight performance in complex environments, a state space representation method containing angle change information is proposed. Then, a dynamic reward function is constructed to balance the exploratory and conservative behaviors of the agent. Finally, a highly dynamic and high-density simulation environment is built to verify the proposed algorithm. The results show that the proposed algorithm can have the highest navigation success rate, and can improve the efficiency of autonomous flight while improving the navigation performance of the UAV.

Although the effectiveness of the proposed algorithm has been verified in a simulation environment, there are still some issues that deserve further study, for example, comparison with some more advanced algorithms and verification of the performance of the proposed method in a real environment. Therefore, we will test the flight effect of the proposed algorithm in a real environment on a physical UAV platform in the future to promote the practical application of the algorithm.

Author Contributions

Conceptualization, Y.S. and Q.H.; methodology, Y.S.; validation, Y.S., H.L.; formal analysis, H.L. and J.L.; resources, J.L.; writing—original draft preparation, Y.S.; writing—review and editing, H.L.; project administration, Y.S.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. 62271166.

Data Availability Statement

The data are contained within this article. The other data will be made available upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bu, Y.; Yan, Y.; Yang, Y. Advancement Challenges in UAV Swarm Formation Control: A Comprehensive Review. Drones 2024, 8, 320. [Google Scholar] [CrossRef]
Javed, S.; Hassan, A.; Ahmad, R.; Ahmed, W.; Ahmed, R.; Saadat, A.; Guizani, M. State-of-the-art and future research challenges in uav swarms. IEEE Internet Things J. 2024, 11, 19023–19045. [Google Scholar] [CrossRef]
Pieczyński, D.; Ptak, B.; Kraft, M.; Piechocki, M.; Aszkowski, P. A fast, lightweight deep learning vision pipeline for autonomous UAV landing support with added robustness. Eng. Appl. Artif. Intell. 2024, 131, 107864. [Google Scholar] [CrossRef]
Faiz, T.I.; Vogiatzis, C.; Noor-E-Alam, M. Computational approaches for solving two-echelon vehicle and UAV routing problems for post-disaster humanitarian operations. Expert Syst. Appl. 2024, 237, 121473. [Google Scholar] [CrossRef]
Xiong, T.; Liu, F.; Liu, H.; Ge, J.; Li, H.; Ding, K.; Li, Q. Multi-drone optimal mission assignment and 3D path planning for disaster rescue. Drones 2023, 7, 394. [Google Scholar] [CrossRef]
Arishi, A.; Krishnan, K.; Arishi, M. Machine learning approach for truck-drones based last-mile delivery in the era of industry 4.0. Eng. Appl. Artif. Intell. 2022, 116, 105439. [Google Scholar] [CrossRef]
Hong, F.; Wu, G.; Luo, Q.; Liu, H.; Fang, X.; Pedrycz, W. Logistics in the sky: A two-phase optimization approach for the drone package pickup and delivery system. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9175–9190. [Google Scholar] [CrossRef]
Sharma, R.; Arya, R. UAV based long range environment monitoring system with Industry 5.0 perspectives for smart city infrastructure. Comput. Ind. Eng. 2022, 168, 108066. [Google Scholar] [CrossRef]
Afshar-Mohajer, N.; Wu, C.Y. Use of a drone-based sensor as a field-ready technique for short-term concentration mapping of air pollutants: A modeling study. Atmos. Environ. 2023, 294, 119476. [Google Scholar] [CrossRef]
Jacygrad, E.; Kelly, M.; Hogan, S.; Preece, J.E.; Golino, D.; Michelmore, R. Comparison between field measured and UAV-derived pistachio tree crown characteristics throughout a growing season. Drones 2022, 6, 343. [Google Scholar] [CrossRef]
Abbas, A.; Zhang, Z.; Zheng, H.; Alami, M.M.; Alrefaei, A.F.; Abbas, Q.; Naqvi, S.A.H.; Rao, M.J.; Mosa, W.F.A.; Abbas, Q.; et al. Drones in plant disease assessment, efficient monitoring, and detection: A way forward to smart agriculture. Agronomy 2023, 13, 1524. [Google Scholar] [CrossRef]
Sheng, Y.; Liu, H.; Li, J.; Han, Q. A Framework for Improving UAV Decision of Autonomous Navigation from Training to Application Migration under Perceptual Uncertainty. Meas. Sci. Technol. 2024, 35, 056308. [Google Scholar] [CrossRef]
Ye, X.; Song, F.; Zhang, Z.; Zeng, Q. A review of small UAV navigation system based on multi-source sensor fusion. IEEE Sens. J. 2023, 23, 18926–18948. [Google Scholar] [CrossRef]
Yang, T.; Yang, F.; Li, D. A New Autonomous Method of Drone Path Planning Based on Multiple Strategies for Avoiding Obstacles with High Speed and High Density. Drones 2024, 8, 205. [Google Scholar] [CrossRef]
Soliman, A.; Al-Ali, A.; Mohamed, A.; Gedawy, H.; Izham, D.; Bahri, M.; Erbad, A.; Guizani, M. AI-based UAV navigation framework with digital twin technology for mobile target visitation. Eng. Appl. Artif. Intell. 2023, 123, 106318. [Google Scholar] [CrossRef]
AlMahamid, F.; Grolinger, K. VizNav: A Modular Off-Policy Deep Reinforcement Learning Framework for Vision-Based Autonomous UAV Navigation in 3D Dynamic Environments. Drones 2024, 8, 173. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Zhou, Y.; Shu, J.; Hao, H.; Song, H.; Lai, X. UAV 3D online track planning based on improved SAC algorithm. J. Braz. Soc. Mech. Sci. Eng. 2024, 46, 12. [Google Scholar] [CrossRef]
Zhang, L.; Peng, J.; Yi, W.; Lin, H.; Lei, L.; Song, X. A state-decomposition DDPG algorithm for UAV autonomous navigation in 3D complex environments. IEEE Internet Things J. 2023, 11, 10778–10790. [Google Scholar] [CrossRef]
Luo, X.; Wang, Q.; Gong, H.; Tang, C. UAV path planning based on the average TD3 algorithm with prioritized experience replay. IEEE Access 2024, 12, 38017–38029. [Google Scholar] [CrossRef]
Huang, X.; Wang, W.; Ji, Z.; Cheng, B. Representation Enhancement-Based Proximal Policy Optimization for UAV Path Planning and Obstacle Avoidance. Int. J. Aerosp. Eng. 2023, 2023, 6654130. [Google Scholar] [CrossRef]
Guo, T.; Jiang, N.; Li, B.; Zhu, X.; Wang, Y.; Du, W. UAV navigation in high dynamic environments: A deep reinforcement learning approach. Chin. J. Aeronaut. 2021, 34, 479–489. [Google Scholar] [CrossRef]
Liu, Z.; Cao, Y.; Chen, J.; Li, J. A hierarchical reinforcement learning algorithm based on attention mechanism for uav autonomous navigation. IEEE Trans. Intell. Transp. Syst. 2022, 24, 13309–13320. [Google Scholar] [CrossRef]
Shao, X.; Xia, Y.; Mei, Z.; Zhang, W. Model-guided reinforcement learning enclosing for uavs with collision-free and reinforced tracking capability. Aerosp. Sci. Technol. 2023, 142, 108609. [Google Scholar] [CrossRef]
Yang, Y.; Hou, Z.; Chen, H.; Lu, P. DRL-based Path Planner and its Application in Real Quadrotor with LIDAR. J. Intell. Robot. Syst. 2023, 107, 38. [Google Scholar] [CrossRef]
Xue, Y.; Chen, W. A uav navigation approach based on deep reinforcement learning in large cluttered 3d environments. IEEE Trans. Veh. Technol. 2022, 72, 3001–3014. [Google Scholar] [CrossRef]
Wan, K.; Li, B.; Gao, X.; Hu, Z.; Yang, Z. A learning-based flexible autonomous motion control method for UAV in dynamic unknown environments. J. Syst. Eng. Electron. 2021, 32, 1490–1508. [Google Scholar]
Lv, H.; Chen, Y.; Li, S.; Zhu, B.; Li, M. Improve exploration in deep reinforcement learning for UAV path planning using state and action entropy. Meas. Sci. Technol. 2024, 35, 056206. [Google Scholar] [CrossRef]
Ma, B.; Liu, Z.; Dang, Q.; Zhao, W.; Wang, J.; Cheng, Y.; Yuan, Z. Deep reinforcement learning of UAV tracking control under wind disturbances environments. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Wang, Z.; Gao, W.; Li, G.; Wang, Z.; Gong, M. Path Planning for Unmanned Aerial Vehicle via Off-Policy Reinforcement Learning with Enhanced Exploration. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2625–2639. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, W.; Wang, J.; Yuan, Y. Recent progress, challenges and future prospects of applied deep reinforcement learning: A practical perspective in path planning. Neurocomputing 2024, 608, 128423. [Google Scholar] [CrossRef]
Wang, Y.; He, H.; Sun, C. Learning to navigate through complex dynamic environment with modular deep reinforcement learning. IEEE Trans. Games 2018, 10, 400–412. [Google Scholar] [CrossRef]
Wang, C.; Wang, J.; Wang, J.; Zhang, X. Deep-reinforcement-learning-based autonomous UAV navigation with sparse rewards. IEEE Internet Things J. 2020, 7, 6180–6190. [Google Scholar] [CrossRef]
Zhang, S.; Li, Y.; Dong, Q. Autonomous navigation of UAV in multi-obstacle environments based on a Deep Reinforcement Learning approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
Haarnoja, T.; Zhou, A.; Abbeel, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Diels, L.; Vlaminck, M.; De Wit, B.; Philips, W.; Luong, H. On the optimal mounting angle for a spinning LiDAR on a UAV. IEEE Sens. J. 2022, 22, 21240–21247. [Google Scholar] [CrossRef]

Figure 1. UAV navigation mission diagram.

Figure 2. (a) Schematic diagram of airborne LiDAR. (b) Relative positional relationship between UAV and target point.

Figure 3. Schematic diagram of UAV movement and changes in measurement data (rectangles and circles in the diagram represent obstacles): (a) movement diagram (translation); (b) changes in laser-measured data (translation); (c) movement diagram (translation and rotation); (d) changes in laser-measured data (translation and rotation).

Figure 4. Different obstacle scenarios: (a) low-density obstacles; (b) high-density obstacles; (c) dynamic obstacles.

Figure 5. The network structure of our algorithm.

Figure 6. Schematic diagram of training scenario.

Figure 7. Reward convergence curves obtained by training four algorithms.

Figure 8. High-density test scenarios: (a) scene 1: N_obs = 90; ρ = 0.225; (b) scene 2: N_obs = 120; ρ = 0.3; (c) scene 3: N_obs = 150; ρ = 0.375; (d) scene 4: N_obs = 180; ρ = 0.45; (e) scene 5: N_obs = 210; ρ = 0.525.

Figure 9. Flight trajectory of UAV in high-density test: (a) DDPG in scene 1; (b) TD3 in scene 1; (c) SAC in scene 1; (d) ours in scene 1; (e) DDPG in scene 2; (f) TD3 in scene 2; (g) SAC in scene 2; (h) ours in scene 2; (i) DDPG in scene 3; (j) TD3 in scene 3; (k) SAC in scene 3; (l) ours in scene 3; (m) DDPG in scene 4; (n) TD3 in scene 4; (o) SAC in scene 4; (p) ours in scene 4; (q) DDPG in scene 5; (r) TD3 in scene 5; (s) SAC in scene 5; (t) ours in scene 5.

Figure 10.

L_{d i s}

and

N_{s t e p}

in different scenarios: (a) average flight distance

L_{d i s}

in each scenario; (b) average number of flight steps

N_{s t e p}

in each scenario.

Figure 10.

L_{d i s}

and

N_{s t e p}

in different scenarios: (a) average flight distance

L_{d i s}

in each scenario; (b) average number of flight steps

N_{s t e p}

in each scenario.

Figure 11. Highly dynamic test scenario diagram.

Figure 12. Flight trajectory of UAV in highly dynamic test: (a) DDPG in scene 1; (b) TD3 in scene 1; (c) SAC in scene 1; (d) ours in scene 1; (e) DDPG in scene 2; (f) TD3 in scene 2; (g) SAC in scene 2; (h) ours in scene 2; (i) DDPG in scene 3; (j) TD3 in scene 3; (k) SAC in scene 3; (l) ours in scene 3; (m) DDPG in scene 4; (n) TD3 in scene 4; (o) SAC in scene 4; (p) ours in scene 4; (q) DDPG in scene 5; (r) TD3 in scene 5; (s) SAC in scene 5; (t) ours in scene 5.

Figure 13.

L_{d i s}

and

N_{s t e p}

in different dynamic scenarios: (a) average flight distance

L_{d i s}

in each dynamic scenario; (b) average number of flight steps

N_{s t e p}

in each dynamic scenario.

Figure 13.

L_{d i s}

and

N_{s t e p}

in different dynamic scenarios: (a) average flight distance

L_{d i s}

in each dynamic scenario; (b) average number of flight steps

N_{s t e p}

in each dynamic scenario.

Figure 14. Reward convergence curves of three models.

Figure 15. UAV flight trajectories in high-density tests based on three models: (a) model 1 in scene 3; (b) model 2 in scene 3; (c) ours in scene 3; (d) model 1 in scene 4; (e) model 2 in scene 4; (f) ours in scene 4; (g) model 1 in scene 5; (h) model 2 in scene 5; (i) ours in scene 5.

Figure 16.

L_{d i s}

and

N_{s t e p}

of UAV based on three models in different scenarios: (a) average flight distance

L_{d i s}

in each scenario; (b) average number of flight steps

N_{s t e p}

in each scenario.

Figure 16.

L_{d i s}

and

N_{s t e p}

of UAV based on three models in different scenarios: (a) average flight distance

L_{d i s}

in each scenario; (b) average number of flight steps

N_{s t e p}

in each scenario.

Figure 17. UAV flight trajectories in highly dynamic tests based on three models: (a) model 1 in scene 3; (b) model 2 in scene 3; (c) ours in scene 3; (d) model 1 in scene 4; (e) model 2 in scene 4; (f) ours in scene 4; (g) model 1 in scene 5; (h) model 2 in scene 5; (i) ours in scene 5.

Figure 18.

L_{d i s}

and

N_{s t e p}

of UAV based on three models in different dynamic scenarios: (a) average flight distance

L_{d i s}

in each dynamic scenario; (b) average number of flight steps

N_{s t e p}

in each dynamic scenario.

Figure 18.

L_{d i s}

and

N_{s t e p}

of UAV based on three models in different dynamic scenarios: (a) average flight distance

L_{d i s}

in each dynamic scenario; (b) average number of flight steps

N_{s t e p}

in each dynamic scenario.

Table 1. Experimental results under different density scenarios.

Algorithm	Scene 1			Scene 2			Scene 3			Scene 4			Scene 5
Algorithm	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l
DDPG	87	12	1	81	18	1	65	34	1	64	36	0	59	41	0
TD3	73	27	0	59	41	0	49	51	0	32	68	0	25	75	0
SAC	94	6	0	93	7	0	87	13	0	73	27	0	65	35	0
Ours	97	3	0	96	4	0	93	7	0	86	14	0	79	21	0

Table 2. Experimental results under different dynamic scenarios.

Algorithm	Scene 1			Scene 2			Scene 3			Scene 4			Scene 5
Algorithm	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l
DDPG	70	16	14	63	16	21	52	28	20	48	40	12	34	60	6
TD3	61	38	1	61	38	1	53	46	1	40	60	0	24	72	4
SAC	80	4	16	79	7	14	68	20	12	51	44	5	29	63	8
Ours	100	0	0	94	6	0	81	19	0	63	37	0	42	58	0

Table 3. Experimental results of models with different reward functions in high-density scenarios.

Algorithm	Scene 3			Scene 4			Scene 5
Algorithm	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l
Model 1	88	12	0	81	19	0	77	23	0
Model 2	88	12	0	77	23	0	75	25	0
Ours	93	7	0	86	14	0	79	21	0

Table 4. Experimental results of models with different reward functions in highly dynamic scenarios.

Algorithm	Scene 3			Scene 4			Scene 5
Algorithm	R_s	R_c	R_l	R_s	R_c	R_l	R_s	R_c	R_l
Model 1	79	21	0	52	48	0	37	63	0
Model 2	62	37	1	36	64	0	26	74	0
Ours	81	19	0	63	37	0	42	58	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sheng, Y.; Liu, H.; Li, J.; Han, Q. UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments. Drones 2024, 8, 516. https://doi.org/10.3390/drones8090516

AMA Style

Sheng Y, Liu H, Li J, Han Q. UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments. Drones. 2024; 8(9):516. https://doi.org/10.3390/drones8090516

Chicago/Turabian Style

Sheng, Yuanyuan, Huanyu Liu, Junbao Li, and Qi Han. 2024. "UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments" Drones 8, no. 9: 516. https://doi.org/10.3390/drones8090516

APA Style

Sheng, Y., Liu, H., Li, J., & Han, Q. (2024). UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments. Drones, 8(9), 516. https://doi.org/10.3390/drones8090516

Article Menu

UAV Autonomous Navigation Based on Deep Reinforcement Learning in Highly Dynamic and High-Density Environments

Abstract

1. Introduction

2. Background

2.1. Markov Decision Process (MDP)

2.2. SAC

3. Problem Formulation

3.1. Scenario Description

3.2. Environment Perception Model

3.3. DRL-Based Navigation Model

4. Proposed Approach

4.1. State Space Representation Method

4.2. Dynamic Reward Function

5. Experiment and Discussion

5.1. Experimental Settings

5.2. Training Results

5.3. Experiment I: High-Density Scene Verification

5.4. Experiment II: Highly Dynamic Scene Verification

5.5. Experiment III: Verify the Validity of the Reward

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI