Three-Dimensional Path Planning of UAVs in a Complex Dynamic Environment Based on Environment Exploration Twin Delayed Deep Deterministic Policy Gradient

: Unmanned Aerial Vehicle (UAV) path planning research refers to the UAV automatically planning an optimal path to the destination under the corresponding environment, while avoiding collision with obstacles in this process. In order to solve the problem of 3D path planning of UAV in a dynamic environment, a heuristic dynamic reward function is designed to guide the UAV. We propose the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm, which combines the symmetrical 3D environment exploration coding mechanism on the basis of TD3 algorithm. The EE-TD3 algorithm model can effectively avoid collisions, improve the training efﬁciency, and achieve faster convergence speed. Finally, the performance of the EE-TD3 algorithm and other deep reinforcement learning algorithms was tested in the simulation environment. The results show that the EE-TD3 algorithm is better than other algorithms in solving the 3D path planning problem of UAV.


Introduction
With the development of Unmanned Aerial Vehicle (UAV) technology, UAVs are becoming more and more widely used in military and civilian fields.Because of their small size and strong mobility, military UAVs are often used in low-altitude or ultra-low-altitude raid operations on the battlefield.In the military application of UAVs, path planning is one of its important contents.UAV path planning can help UAVs complete tasks in complex environments, such as area search [1], terrain reconnaissance [2], formation flight [3], etc. [4][5][6].At the same time, many factors need to be considered, such as terrain, obstacles, weather, flight speed, etc., so it is relatively complex and challenging.
So far, there has been a lot of research on UAV path planning methods around the world, which are usually based on rules or optimization algorithms, such as A* algorithm [7], Dijkasta algorithm [8], and the artificial potential field method [9].The main ideas of algorithms such as the A* algorithm and the Dijkasta algorithm are usually based on Heuristic Search and graph theory.Some index is used as a Heuristic Function, such as distance, cost, valuation, etc.To evaluate every possible path and find the best one.In other words, all possible paths should be displayed in the space, otherwise the optimal solution cannot be found by such methods.Such algorithms can find the optimal path to a certain extent, but in practical applications, path planning problems often involve complex environments, dynamic obstacles, and other situations, which requires the path planning algorithm to quickly adapt to changing environments and the ability to learn and adapt strategies.The artificial potential field method represents a class of algorithms, and the representative methods include simulated annealing method [10], charge method [11], flow function method [12], etc.One of these methods is the flow function method, which is actually an extension of the artificial potential field method.It plans its path based on the potential field between the drone and the external environment (obstacles and targets).while the existing artificial potential field method is mainly applied to the 2D environment, and the shape of the obstacles also has many limitations.Most of the previous studies on path planning have solved the problem of path planning in a 2D environment [13].Compared with a complex 3D environment, a simple 2D path planning method cannot effectively identify obstacles and targets.So, it is necessary to solve the path planning problem of UAVs in 3D environments.In previous studies, many scholars were also concerned about the study of path planning in 3D environments [14][15][16][17].However, most of these studies are conducted in a single static environment, without considering the dynamic complex environmental constraints.
Deep reinforcement learning (DRL) is an advanced technology that has achieved remarkable achievements in robot control and agent behavior in recent years.It can independently learn effective behavior strategies.It can therefore update the strategy in a complex 3D environment by interacting with the environment constantly, and gradually adapt to the changes of the environment [18].The deep neural network can process highdimensional input data, so it can enable the agent to obtain rich information from the surrounding environment, and facilitate the generation of corresponding action strategies.In addition, the strong expression ability of the deep neural network can be effectively located in the characteristics of nonlinear and dynamic environment, which provides good help for the path planning research in a 3D environment [19,20].At the same time, by introducing noise in the training, it can better adapt to the changes and uncertainties of the environment.Therefore, DRL is widely used in the solution of the UAV path planning problem and implements the purpose of effectively completing the path planning in complex 3D environments.
In order to solve the problem of the UAV path in a complex 3D environment more effectively, we propose a model based on the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm and combine it with the unmanned aerial raid task model.The simulation results show that the algorithm model can plan the optimal flight path for UAVs in complex 3D battlefield environments, and it has relatively fast convergence compared to previous research.The main work of this article is as follows: 1.
This paper analyzes the task requirement and the battlefield environment, and designs the dynamic reward function of the system, solves the sparse reward problem in the traditional path planning, and accelerates the convergence speed of the algorithm; 2.
An environmental information exploration coding mechanism is proposed to explore and identify the surrounding environment of UAVs and enter it into the algorithm model, and then output the optimal control information to solve the path planning problem; 3.
The 3D environment model of the low-altitude raid task of the UAV is constructed and the process of the combination of the environmental model and the algorithm model is introduced in detail, and the simulation experiment shows that the algorithm model has better performance.
To summarize, this paper proposes a path planning algorithm based on deep reinforcement learning in a 3D environment and has significant implications for research in relevant directions.Firstly, the dynamic reward function can provide reference for other scholars.Secondly, the environment exploration method can be generalized to other deep reinforcement learning algorithms as a general method.Moreover, there are few studies relevant to complex dynamic 3D environments, and this paper also provides reference cases for such scenarios.
The main content of the rest of this article is as follows: Section 2 introduces the relevant work.Section 3 illustrates the establishment of the 3D battlefield environment model.Section 4 explains the principle of the EE-TD3 algorithm model.Section 5 shows the simulation results and analyzes the results.Section 6 presents the conclusion of this paper.

Related Work
So far, many researchers have used traditional intelligent algorithms to study UAV autonomous path planning, such as ant colony optimization algorithm, FPA algorithm, GA algorithm and so on.In recent years, with the continuous development of deep reinforcement learning (DRL) technology, more and more researchers have applied it to solve the autonomous path planning of UAV.
In terms of traditional algorithms, Chen J. et al. [21] proposed a formula based on mixed integer linear programming to find the optimal flight path for UAVs by fully searching the path space.At the same time, an original algorithm based on clustering is designed to divide different regions into clusters, allowing for the acquisition of the optimal point-topoint path for the UAV.C. Lamini et al. [22] proposed an improved crossover operator to solve the path planning problem in a static environment by using the genetic algorithm (GA).The crossover operator can also avoid local convergence, provide a feasible path with better fitness value than its parent operator, and make the algorithm converge faster.Ee Soong Low et al. [23] used the flower pollination algorithm (FPA) to improve the initialization of the Q value in Q-learning algorithm.By comparing the experimental evaluation under a different obstacle layout environment, it is shown that the Q-learning algorithm converges faster when FPA is used to improve the initial Q value.
In terms of DRL algorithm, Yang Yang et al. [24] used the Deep Q network (DQN) algorithm in this paper, which combined the Q-learning algorithm, experiential playback mechanism, and deep neural network in time to generate target Q values to solve the path planning problem of multiple robots.In Ref. [25], G. Lin et al. introduced a collision-free path planning method based on deep reinforcement learning, which is fast and robust.Firstly, the recurrent neural network is used to record and process the state information observed by the robot, and then the depth deterministic strategy gradient algorithm (DDPG) is used to predict the collision-free path of the current state.In the training experiment phase, the simulation environment is developed, and its parameters are randomized so that the cyclic DDPG algorithm can be generalized to real world scenarios.P. Chen et al. [26] proposed a path planning method based on the soft actor-critic (SAC) algorithm to solve the dynamic obstacle avoidance problem of Mechanical arm.In order to avoid moving obstacles in the environment while conducting real-time path planning, a comprehensive reward function of dynamic obstacle avoidance and target method is designed.To solve the problem of low sample utilization caused by random sampling, priority experience playback (PER) was used to change sample weights and improve the sampling efficiency.Finally, simulation experiments are carried out, and the experimental results show that this method can effectively avoid moving obstacles in the environment, and the success rate of path planning task is relatively high.
Recently, many researchers have tried to use the TD3 algorithm to solve the UAV path planning problem.Lei He et al. [27] proposed a path planner based on deep learning in this paper, which is used for the autonomous flight of quadrotor UAV in an unknown environment.The path navigation problem is modeled as a Markov decision process (MDP), and the path planner is trained in a simulated environment using a deep reinforcement learning (DRL) method using the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm.Finally, a real-world flight test is carried out to show that the path planner trained in the simulation has good performance and can be directly applied to the real environment.S. Zhang et al. [28] proposed an improved Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm.In order to make the UAV respond to the change of environment, the observed environment information is added to the Actor-Critic network input, and the double-flow Actor-Critic network structure is proposed to extract the environment information features.The performance of the algorithm is evaluated by simulation, and the experimental results show that the proposed method can effectively complete the autonomous path planning task in the multi-obstacle environment, which reflects the effectiveness of the proposed method.Although the above research shows the application ability of the TD3 algorithm in solving path planning problems, its performance in a complex dynamic battlefield environment still needs to be studied further.Since TD3 algorithm has the advantages of strong ability to deal with large state space information and fast convergence speed, this paper proposes a path planning algorithm model for UAV based on environment exploration mechanism and TD3 algorithm in three-dimensional environment.

Three-Dimensional Battlefield Environment Model
A 3D space with a length and width of 50 km and a height of 1 km is established in a 3D cartesian coordinate system.In this 3D space, the UAV is tasked with taking low-altitude surprise attacks on radar positions 50 km away.In order to avoid enemy radar reconnaissance, the UAV must fly at an altitude of less than 1 km.During flight, the UAV needs to autonomously avoid static obstacles, such as mountains and ground buildings.Due to the low altitude of the UAV, it is also affected by the dynamic obstacles of random movement such as low flying birds and civil aircraft.Therefore, the UAV should also respond to dynamic obstacles in a timely manner during flight.
As shown in Figure 1, the 3D battlefield environment includes UAV U, low-altitude bird group B, mountain M, and radar R.An UAV can fly autonomously in the battlefield environment, and the position information of the UAV can be expressed as follows: In Formula (1), x ∈ (0, 50 km), y ∈ (0, 50 km), z ∈ (0, 1 km).
environment information features.The performance of the algorithm is evaluated by simulation, and the experimental results show that the proposed method can effectively complete the autonomous path planning task in the multi-obstacle environment, which reflects the effectiveness of the proposed method.Although the above research shows the application ability of the TD3 algorithm in solving path planning problems, its performance in a complex dynamic battlefield environment still needs to be studied further.Since TD3 algorithm has the advantages of strong ability to deal with large state space information and fast convergence speed, this paper proposes a path planning algorithm model for UAV based on environment exploration mechanism and TD3 algorithm in three-dimensional environment.

Three-Dimensional Battlefield Environment Model
A 3D space with a length and width of 50 km and a height of 1 km is established in a 3D cartesian coordinate system.In this 3D space, the UAV is tasked with taking low-altitude surprise attacks on radar positions 50 km away.In order to avoid enemy radar reconnaissance, the UAV must fly at an altitude of less than 1 km.During flight, the UAV needs to autonomously avoid static obstacles, such as mountains and ground buildings.Due to the low altitude of the UAV, it is also affected by the dynamic obstacles of random movement such as low flying birds and civil aircraft.Therefore, the UAV should also respond to dynamic obstacles in a timely manner during flight.
As shown in Figure 1, the 3D battlefield environment includes UAV , low-altitude bird group , mountain , and radar .An UAV can fly autonomously in the battlefield environment, and the position information of the UAV can be expressed as follows: In Formula (1), ∈ 0, 50 km , ∈ 0, 50 km , ∈ 0, 1 km .The velocity of the UAV is divided according to the direction of the 3D rectangular coordinate system, ∈ (0, 100 m/s), ∈ (0, 100 m/s), ∈ (−3, 3 m/s).The real-time location of the UAV can be expressed as: The initial position of the UAV can be expressed as: The velocity v of the UAV is divided according to the direction of the 3D rectangular coordinate system, v x ∈ (0, 100 m/s), v y ∈ (0, 100 m/s), v z ∈ (−3, 3 m/s).The real-time location U i of the UAV can be expressed as: The initial position of the UAV U 0 can be expressed as: The coordinates of the vertices of mountain M can be expressed as: Symmetry 2023, 15, 1371 5 of 18 The height coordinate z m of a high mountain M can be represented by the horizontal coordinates x m and y m , as follows [29]: where, x m 0 and y m 0 are the horizontal coordinates of the lowest center of the mountain M, a is the height coefficient of the mountain, and the changes in b and c can change the size of the mountain.
The UAV may encounter dynamic obstacles such as birds and civil low-altitude aircraft during low-altitude flight.It is assumed that these dynamic obstacles are randomly generated in an area with a height below 300 m and the initial center position is expressed as B (x 0 ,y 0 ,z 0 ) .The dynamic real-time position B (x,y,z) can be expressed as: In Formula ( 6), the initial position x 0 ∈ (0, 50 km), y 0 ∈ (0, 50 km), z 0 ∈ (0, 0.3 km), and dynamic obstacles to movement speed v Bx ∈ (−10, 10 m/s), v By ∈ (−10, 10 m/s), v Bz ∈ (−2, 2 m/s).
In order to ensure flight safety, the UAV should maintain a safe distance of more than 50 m between bird group B, and d sa f e can be expressed as: The position of radar R can be expressed as: R = [50 km, 50 km, 0 km] ( The maximum detection radius of the radar is 40 km, but due to the influence of earth curvature, detection angle, ground obstacles, ground clutter, and other factors, the ultra-low altitude target below 1 km is difficult to be detected by radar.The probability of the UAV being detected by radar will change with the change of the distance d between the two and the UAV altitude h.When the UAV flies too low, the radar will regard it as entering the low altitude blind zone and cannot detect it.Assuming that the radar blind zone is an airspace with a height of less than 300 m, the radar detection probability model can be expressed as follows. The radar detection probability diagram can be obtained from Formula (9), as shown in Figure 2.
Figure 2 shows that the d-axis represents the Euclidean distance between the UAV and the radar, the h-axis represents the UAV flying altitude, and the p-axis represents the probability of the UAV being detected by the radar.Further analysis shows that the UAV flying altitude below 300 m can use the blind zone to avoid radar detection.In the range of flight altitude h ∈ (0.3 km, 1 km), the probability of being detected by a radar increases as the UAV flies higher and closer to the distance.The UAV carries air-to-ground missiles with a maximum firing range of 8 km, and the mission is considered complete when the UAV safely reaches the position 8 km away from the radar.The radar positions are equipped with anti-aircraft missile weapon systems (including guidance radars and anti-aircraft missiles), which have a maximum range of 25 km.If the UAV is within the radar detection area and the maximum range of the anti-aircraft missile, the anti-aircraft missile will lock on to the UAV and attack the launch.Therefore, in order to prevent being targeted by anti-aircraft missiles, the UAV should avoid entering the radar detection area.

UAV Path Planning Algorithm Model
In this section, we introduce the development of the deep reinforcement learning algorithm, and improve the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, so that it can better solve the path planning problem in UAV low-altitude raid mission.There are two improvements.The first point is that the environment exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm model is proposed, which mainly divides and compresses the regional state information near the UAV in the 3D environment, detects the state of the regions in different directions and encodes it with binary numbers, and uses the encoded environmental state value as the input of the algorithm model.This real-time status information can improve the UAV's perception of the environmental information.The second point is that a heuristic dynamic reward function is designed to solve the sparse reward problem in traditional path planning methods, so that the UAV can get the corresponding reward according to the current real-time state, and the convergence speed of the algorithm model in the training process is accelerated.

Deep Reinforcement Learning Algorithm Model
Deep Reinforcement learning (DRL) is an end-to-end learning method that combines deep learning and reinforcement learning, which can simultaneously have the big data processing ability of deep learning (DL) and the decision-making ability of reinforcement learning (RL) [30].DRL has achieved great results in the autonomous control of agents, which can effectively solve the problems of UAVs in traditional path planning.Among the existing DRL algorithm models, some algorithms are suitable for solving discrete action space problems, such as the DQN algorithm, and some algorithms are suitable for solving continuous action space problems, such as the DDPG algorithm and TD3 algorithm.DDPG is called Deep Deterministic Policy Gradient, where Deep refers to the deep network structure, Deterministic refers to deterministic selection, where the purpose is to help the policy gradient avoid random selection and output specific action values.Policy The UAV carries air-to-ground missiles with a maximum firing range of 8 km, and the mission is considered complete when the UAV safely reaches the position 8 km away from the radar.The radar positions are equipped with anti-aircraft missile weapon systems (including guidance radars and anti-aircraft missiles), which have a maximum range of 25 km.If the UAV is within the radar detection area and the maximum range of the anti-aircraft missile, the anti-aircraft missile will lock on to the UAV and attack the launch.Therefore, in order to prevent being targeted by anti-aircraft missiles, the UAV should avoid entering the radar detection area.

UAV Path Planning Algorithm Model
In this section, we introduce the development of the deep reinforcement learning algorithm, and improve the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, so that it can better solve the path planning problem in UAV low-altitude raid mission.There are two improvements.The first point is that the environment exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm model is proposed, which mainly divides and compresses the regional state information near the UAV in the 3D environment, detects the state of the regions in different directions and encodes it with binary numbers, and uses the encoded environmental state value as the input of the algorithm model.This real-time status information can improve the UAV's perception of the environmental information.The second point is that a heuristic dynamic reward function is designed to solve the sparse reward problem in traditional path planning methods, so that the UAV can get the corresponding reward according to the current real-time state, and the convergence speed of the algorithm model in the training process is accelerated.

Deep Reinforcement Learning Algorithm Model
Deep Reinforcement learning (DRL) is an end-to-end learning method that combines deep learning and reinforcement learning, which can simultaneously have the big data processing ability of deep learning (DL) and the decision-making ability of reinforcement learning (RL) [30].DRL has achieved great results in the autonomous control of agents, which can effectively solve the problems of UAVs in traditional path planning.Among the existing DRL algorithm models, some algorithms are suitable for solving discrete action space problems, such as the DQN algorithm, and some algorithms are suitable for solving continuous action space problems, such as the DDPG algorithm and TD3 algorithm.DDPG is called Deep Deterministic Policy Gradient, where Deep refers to the deep network structure, Deterministic refers to deterministic selection, where the purpose is to help the policy gradient avoid random selection and output specific action values.Policy Gradient refers to the policy gradient algorithm, which can randomly select actions in a continuous action space based on the learned policy [31].
As shown in Figure 3, the DDPG algorithm adopts the end-to-end learning mode, takes the initial state information of the agent as the input, and the action strategy value µ(s t ) which is calculated by the algorithm as the output result.At the same time, random noise is added to the action policy to obtain the final output action.At the beginning of the task, the UAV outputs an action based on the current state s t , which is evaluated by the designed reward function to evaluate the value of the output action, so as to obtain the feedback reward r t from the environment.In this way, it can be concluded that the optimal action in the current state gives a positive reward, and vice versa gives a negative reward.In this process, the current state information, the action, the reward, and the state information of the next time (s t , a t , r t , s t+1 ) are stored in the experience pool.When the experience pool is full, sample data is randomly selected from it for training, and the action strategy is constantly adjusted during the training process.In order to enhance the stability and accuracy of the algorithm, the gradient descent method is used to update the iterative network parameters.The DDPG algorithm is used to subdivide the network structure into an online network and target network under the framework of the AC algorithm.The online network includes the Actor online network and Critic online network, which can complete real-time output actions, evaluate actions, and update the network parameters.The target network includes the Actor target network and Critic target network, which are used to update the Actor online network and Critic online network.The target network and the online network have the same neural network architecture and initialization parameters.However, during the training process, the parameters of the Actor target network and the Critic target network are updated in a soft changing way, which enhances the stability of the training process compared to directly copying the parameters of the online network.The update process can be expressed by the following formula [32]: Symmetry 2023, 15, x FOR PEER REVIEW 7 of 19 Gradient refers to the policy gradient algorithm, which can randomly select actions in a continuous action space based on the learned policy [31].As shown in Figure 3, the DDPG algorithm adopts the end-to-end learning mode, takes the initial state information of the agent as the input, and the action strategy value ( ) which is calculated by the algorithm as the output result.At the same time, random noise is added to the action policy to obtain the final output action.At the beginning of the task, the UAV outputs an action based on the current state , which is evaluated by the designed reward function to evaluate the value of the output action, so as to obtain the feedback reward r from the environment.In this way, it can be concluded that the optimal action in the current state gives a positive reward, and vice versa gives a negative reward.In this process, the current state information, the action, the reward, and the state information of the next time ( , , , ) are stored in the experience pool.When the experience pool is full, sample data is randomly selected from it for training, and the action strategy is constantly adjusted during the training process.In order to enhance the stability and accuracy of the algorithm, the gradient descent method is used to update the iterative network parameters.The DDPG algorithm is used to subdivide the network structure into an online network and target network under the framework of the AC algorithm.The online network includes the Actor online network and Critic online network, which can complete real-time output actions, evaluate actions, and update the network parameters.The target network includes the Actor target network and Critic target network, which are used to update the Actor online network and Critic online network.The target network and the online network have the same neural network architecture and initialization parameters.However, during the training process, the parameters of the Actor target network and the Critic target network are updated in a soft changing way, which enhances the stability of the training process compared to directly copying the parameters of the online network.The update process can be expressed by the following formula [32]:  In Equation (10), θ µ and θ Q represent the parameters of the Actor-Critic online network, θ µ and θ Q represent the parameters of the Actor-Critic target network, and τ ∈ (0, 1) is the update coefficient.TD3 algorithm is an improved algorithm based on the DDPG algorithm.TD3 algorithm introduces three key skills: The first is to add smoothing noise to the action value output by the Actor target network, which can make the estimation more accurate, but no noise is added to the final action value output by the Actor network.The TD3 algorithm minimizes the error in the Bellman equation by calculating the mean square, and simultaneously learns two Q-functions Q ϕ 1 and Q ϕ 2 .In addition, the action used to form the Q-Learning objective is based on the target policy µ θ targ , but a smoothing noise ∼ N(0, σ) is also added to each dimension of the action, and the specific formula can be expressed as: In Formula ( 11), ∼ σ is the standard deviation when calculating the normal distribution, and the larger the value is, the more noise is added.
The second is the clipped double Q-learning method.As shown in Figure 4, two independent Critic networks are used to estimate the Q value, and the smaller Q value is selected for updating when calculating the target Q value, which can effectively solve the problem of overestimation of the Q value in the DDPG algorithm.This way, the target policy becomes smoother and can avoid the problem of incorrect peaks that may be generated in the DDPG algorithm.The target policy can quickly use the peak and then produce brittleness or incorrect behavior.The specific formula for updating the Q value can be expressed as follows: Symmetry 2023, 15, x FOR PEER REVIEW 8 of 19 In Equation (10), and represent the parameters of the Actor-Critic online network, and represent the parameters of the Actor-Critic target network, and ∈ (0, 1) is the update coefficient.
TD3 algorithm is an improved algorithm based on the DDPG algorithm.TD3 algorithm introduces three key skills: The first is to add smoothing noise to the action value output by the Actor target network, which can make the estimation more accurate, but no noise is added to the final action value output by the Actor network.The TD3 algorithm minimizes the error in the Bellman equation by calculating the mean square, and simultaneously learns two Q-functions and .In addition, the action used to form the Q-Learning objective is based on the target policy , but a smoothing noise ~ (0, ) is also added to each dimension of the action, and the specific formula can be expressed as: In Formula ( 11), is the standard deviation when calculating the normal distribution, and the larger the value is, the more noise is added.
The second is the clipped double Q-learning method.As shown in Figure 4, two independent Critic networks are used to estimate the Q value, and the smaller Q value is selected for updating when calculating the target Q value, which can effectively solve the problem of overestimation of the Q value in the DDPG algorithm.This way, the target policy becomes smoother and can avoid the problem of incorrect peaks that may be generated in the DDPG algorithm.The target policy can quickly use the peak and then produce brittleness or incorrect behavior.The specific formula for updating the Q value can be expressed as follows:  Then, both Q-value networks learn by regressing to this target value.The specific formula can be expressed as: Thirdly, the Actor network in the TD3 algorithm is updated with a delay strategy, and the Critic network is updated normally.Therefore, the update frequency of the Critic network is faster than that of the Actor network, which can greatly reduce the error caused by the same frequency update and help to stabilize the training process.The pseudocode for the TD3 algorithm can be expressed as Algorithm 1: and observe reward r and new state s 6. Store transition tuple (s, a, r, s ) in Replay memory 7. Sample mini-batch of N transitions (s, a, r, s ) from Replay memory 8. Compute target action Update ∅ by the deterministic policy gradient: 13.

Environment Exploration Twin Delayed Deep Deterministic Policy Gradient Algorithm Model
In this section, the environment exploration coding mechanism and the heuristic dynamic reward function are proposed, which can be combined with the TD3 algorithm model to obtain the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm model.This model can effectively solve the problem of complex environmental information in previous research and the sparse reward problem in traditional path planning research.In the next section, we will use this algorithm model to try to solve the UAV autonomous path planning problem through simulation experiments.

Coding Mechanism for Environmental Exploration
Most of the previous path planning studies take the current position information (x, y, z) of the UAV as the input of the algorithm model, take action according to the output action value, and then perform interactive learning by obtaining rewards from the environment.When the training model reaches convergence, the UAV can take the corresponding correct action at any position until the task is completed.However, when there are dynamic obstacles in the environment, because the trajectory of the obstacles is random, it is impossible to judge the direction of the obstacles only by using the current position information of the UAV as input, and it is difficult to effectively realize the purpose of avoiding obstacles.In order to make the UAV reach the destination efficiently in a complex and dynamic environment, the algorithm model must receive the environmental state information of the area near the UAV, and immediately identify the position of the obstacles and make decisions to avoid them.Therefore, this paper designs a symmetrical environment detection mechanism to encode the 3D environment information in the area near the UAV, so that the UAV can have better obstacle avoidance ability.
In modern warfare, the UAV is equipped with various sensors to detect the surrounding environment.It is assumed that the UAV is equipped with sensors that can detect the state of the region near itself, and the maximum action distance of the sensor is 100 m.With the information fed back by these sensors, the UAV can detect whether there are obstacles in the nearby area.Then, the binary number (1 or 0) is used to represent the existence or absence of obstacles, and a set of current environmental state information array can be obtained by state detection coding.This array is used as the input of the algorithm model, so that the algorithm model can make the right decision to control the UAV action according to the environmental state information in the nearby area.
As shown in Figure 5, the spherical area near the UAV in the 3D environment is divided.Since the spherical area has symmetry and the current 3D environment is a continuous space, infinite times of division can theoretically be carried out, but it will cause an increase in the amount of training calculation, so our scheme adopts limited times of division.It can be seen from Figure 5 that the UAV attachment area is divided according to the same Angle, where the Angle in the horizontal direction and the vertical direction is β, and the number of segmented areas is n.In the process of flight, a set of n-dimensional array S U AV is obtained by dividing and encoding the regional environment information detected by the UAV each time.
position information of the UAV as input, and it is difficult to effectively realize the purpose of avoiding obstacles.In order to make the UAV reach the destination efficiently in a complex and dynamic environment, the algorithm model must receive the environmental state information of the area near the UAV, and immediately identify the position of the obstacles and make decisions to avoid them.Therefore, this paper designs a symmetrical environment detection mechanism to encode the 3D environment information in the area near the UAV, so that the UAV can have better obstacle avoidance ability.
In modern warfare, the UAV is equipped with various sensors to detect the surrounding environment.It is assumed that the UAV is equipped with sensors that can detect the state of the region near itself, and the maximum action distance of the sensor is 100 m.With the information fed back by these sensors, the UAV can detect whether there are obstacles in the nearby area.Then, the binary number (1 or 0) is used to represent the existence or absence of obstacles, and a set of current environmental state information array can be obtained by state detection coding.This array is used as the input of the algorithm model, so that the algorithm model can make the right decision to control the UAV action according to the environmental state information in the nearby area.
As shown in Figure 5, the spherical area near the UAV in the 3D environment is divided.Since the spherical area has symmetry and the current 3D environment is a continuous space, infinite times of division can theoretically be carried out, but it will cause an increase in the amount of training calculation, so our scheme adopts limited times of division.It can be seen from Figure 5 that the UAV attachment area is divided according to the same Angle, where the Angle in the horizontal direction and the vertical direction is , and the number of segmented areas is .In the process of flight, a set of n-dimensional array is obtained by dividing and encoding the regional environment information detected by the UAV each time.Then, is used as the input of the algorithm model, and the UAV takes the next action according to the action output of the algorithm model.The environmental information exploration coding mechanism can be regarded as the compression coding of the state information of the area near the UAV, which simplifies the complex environmental information with interference and facilitates the algorithm model processing.In order to further verify the applicability of this method, we will select two cases = 45° and = 30° for experiments in the next section.Then, S U AV is used as the input of the algorithm model, and the UAV takes the next action according to the action output of the algorithm model.The environmental information exploration coding mechanism can be regarded as the compression coding of the state information of the area near the UAV, which simplifies the complex environmental information with interference and facilitates the algorithm model processing.In order to further verify the applicability of this method, we will select two cases β = 45 • and β = 30 • for experiments in the next section.

Heuristic Dynamic Reward Function
The reward function also known as the immediate reward is the key to the DRL technique to solve the problem.In the previous traditional reinforcement learning algorithms, the reward rules of navigation are relatively simple, generally only considering the reward of reaching the goal position when completing the task, and the punishment after hitting the obstacle.However, due to the relatively large 3D environment state space in this path planning task, if the UAV is only rewarded when completing the task and collision, and there is no reward in other states, the effective reward cannot be obtained in time, and the algorithm model will be difficult to converge in the training process.This leads to the problem of sparse reward [33].The training process can be accelerated by introducing a continuous reward function to guide the exploration.After many attempts, we designed a heuristic dynamic reward function.The specific formula can be expressed as follows.
In Equation (15), η is the reward coefficient and η ∈ (0, 1), d t is the distance between the UAV and the radar position in the current state, and d t+1 is the distance between the UAV and the target in the next state.The analysis of Equation (12) shows that when the UAV performs the current action, if the next state is closer to the target, the positive reward will be obtained, and the closer the distance to the target, the greater the positive reward value will be obtained.If the next state is further away from the goal, it receives a negative reward, and the further away from the goal, the greater the negative reward value.When the distance between the UAV and the target is less than 8 km, it is considered to have completed the mission and a positive reward of 300 is obtained.When the distance between the UAV and the dynamic obstacle is less than a safe distance or the UAV collides with the static obstacle, a negative reward of −300 is obtained and the environment is reset.

Simulation Experiment and Result Analysis
In this experiment, a battlefield simulation environment was constructed to verify the performance of the EE-TD3 algorithm, and the environment was set up with radar detection areas, mountains as static obstacles, and random low-altitude dynamic obstacles.The DDPG algorithm, TD3 algorithm and EE-TD3 algorithm will be verified in this battlefield simulation environment.All the experiments are conducted on a computer with Intel(R) Core(TM) i7-10700 CPU and NVIDIA GeForce RTX3060Ti GPU.Python3.9 is also used as the interpreter of the project program.Additionally, pytorch-1.12.1 is used as a deep learning framework to build neural networks.

Experimental Parameter Settings
The setting of hyperparameters can affect the convergence effect of the algorithm model, which plays a key role in the effectiveness of the final experimental results.The main hyperparameters in the deep reinforcement learning model are: deep neural network structure parameters, learning rate α, discount factor γ, experience pool storage capacity R, sampling number N, and target network soft update coefficient τ.
The three algorithms in this experiment are added noise, but as the training goes on, the model gradually tends to converge.If the noise value is too large, it may produce local oscillation and make the model difficult to converge.Therefore, the noise attenuation factor k ∈ (0, 1) is set in this experiment.During the training process, whenever the UAV reaches the end point to complete a task, the noise value is multiplied by the noise attenuation factor k to reduce the noise and to accelerate the convergence speed of the model.Since the DDPG algorithm, TD3 algorithm, and EE-TD3 algorithm all adopt the Actor-Critic structure, the learning rates α a and α c of the Actor network and Critic network are important hyperparameters.When α a is set to 0.0001 for multiple attempts, the model training effect is stable.Since the Critic network is responsible for evaluating the behavior of the model, and there are four independent neural networks in the Critic module in the TD3 algorithm and EE-TD3 algorithm, α c has a great influence on the training effect of the algorithm model.We set nine different α c for training.The convergence of the algorithm model in different environments is compared by analyzing the results.In addition, we also conducted performance test experiments on all algorithm models and analyzed the performance effects of each algorithm model to obtain the optimal algorithm model.
The specific hyperparameter settings for this experiment are shown in Table 1.

Experimental Results of the TD3 Algorithm Model
Figure 7 shows the training results of the TD3 algorithm in the battlefield simulation environment with nine Critic network learning rates .The analysis results show that the TD3 algorithm model tends to converge at 1550 rounds when is 0.0006, and the TD3 algorithm model tends to converge at 1320 rounds when is 0.0008.There is a certain degree of oscillation after these two results tend to converge, but the overall situation is convergent.Through further analysis it can be seen that the TD3 algorithm model has faster convergence speed and smaller oscillation amplitude than the DDPG algorithm model, so the TD3 algorithm model is more suitable for solving the UAV path planning problem.

Experimental Results of the TD3 Algorithm Model
Figure 7 shows the training results of the TD3 algorithm in the battlefield simulation environment with nine Critic network learning rates α c .The analysis results show that the TD3 algorithm model tends to converge at 1550 rounds when α c is 0.0006, and the TD3 algorithm model tends to converge at 1320 rounds when α c is 0.0008.There is a certain degree of oscillation after these two results tend to converge, but the overall situation is convergent.Through further analysis it can be seen that the TD3 algorithm model has faster convergence speed and smaller oscillation amplitude than the DDPG algorithm model, so the TD3 algorithm model is more suitable for solving the UAV path planning problem.In order to verify the applicability of the EE-TD3 algorithm, we selected two environmental region segmentation angles for experiments, and the specific experimental results are as follows:

Experimental Results of the EE-TD3 Algorithm Model
In order to verify the applicability of the EE-TD3 algorithm, we selected two environmental region segmentation angles for experiments, and the specific experimental results are as follows: (1) β = 45 • (2) = 30° Figure 9 shows the training effect achieved by the EE-TD3 algorithm (β = 30°) in a simulated battlefield environment with nine different Critic module network learning rates set.The analysis results show that the EE-TD3 algorithm model begins to con- (2) β = 30 •

Experimental Results of the Algorithm Performance Test
To further verify the actual performance of the EE-TD3 algorithm proposed in this paper, we conducted comparative experiments on all the algorithm models that have converged.In the experiment, we conducted 1000 rounds of independent experiments for the 3 algorithm models, and obtained the task success rate and average path length of each algorithm model.

Experimental Results of the Algorithm Performance Test
To further verify the actual performance of the EE-TD3 algorithm proposed in this paper, we conducted comparative experiments on all the algorithm models that have converged.In the experiment, we conducted 1000 rounds of independent experiments for the 3 algorithm models, and obtained the task success rate and average path length of each algorithm model.
Figures 10 and 11 represent the task success rate and average path length of each algorithm model in the performance testing experiment, respectively.We can see from this that the performance of the DDPG algorithm is relatively poor, and the performance of the TD3 algorithm and EE-TD3 algorithm is relatively good.Further analysis shows that the performance of EE-TD3 algorithm is better as the environment region is segmented smaller.Therefore, the EE-TD3 algorithm model is the optimal choice in this experiment.

Conclusions
This paper proposes the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm for UAV autonomous path planning in complex 3D environments.First, the mission process of the UAV in 3D battlefield simulation environment is modeled, and the modeling of low-space 3D dynamic battlefield simulation environment is also completed.Then, a dynamic reward function with enlightening guidance is designed to solve the problem of sparse rewards in the tradition, which can allow the algorithm model to converge faster.On the basis of these early work, the DDPG algorithm was simulated by the td3 algorithm and the EE-TD3 algorithm, the noise attenuation factor was added to accelerate the convergence of the model, and nine different learning rates

Conclusions
This paper proposes the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm for UAV autonomous path planning in complex 3D environments.First, the mission process of the UAV in 3D battlefield simulation environment is modeled, and the modeling of low-space 3D dynamic battlefield simulation environment is also completed.Then, a dynamic reward function with enlightening guidance is designed to solve the problem of sparse rewards in the tradition, which can allow the algorithm model to converge faster.On the basis of these early work, the DDPG algorithm was simulated by the td3 algorithm and the EE-TD3 algorithm, the noise attenuation factor was added to accelerate the convergence of the model, and nine different learning rates

Conclusions
This paper proposes the Environment Exploration Twin Delayed Deep Deterministic Policy Gradient (EE-TD3) algorithm for UAV autonomous path planning in complex 3D environments.First, the mission process of the UAV in 3D battlefield simulation environment is modeled, and the modeling of low-space 3D dynamic battlefield simulation environment is also completed.Then, a dynamic reward function with enlightening

Figure 3 .
Figure 3. Model diagram of the DDPG algorithm.Figure 3. Model diagram of the DDPG algorithm.

Figure 3 .
Figure 3. Model diagram of the DDPG algorithm.Figure 3. Model diagram of the DDPG algorithm.

Figure 5 .
Figure 5. Illustration of the exploration of the environmental information.

Figure 5 .
Figure 5. Illustration of the exploration of the environmental information.

Figure 6 19 Figure 6 .
Figure6shows the training effect achieved by the DDPG algorithm in the simulated battlefield environment after setting the learning rate α c of the module network for nine different critics.The analysis results show that although some cases can converge, the DDPG model converges relatively well when α c is 0.0005, and the model tends to converge after 2050 rounds.In general, the DDPG algorithm model has unstable convergence and obvious oscillation under different learning rates, and the effect is not ideal for solving the UAV path planning problem.Symmetry 2023, 15, x FOR PEER REVIEW 13 of 19

Figure 6 .
Figure 6.Training results of the DDPG algorithm model.

Figure 6 .
Figure 6.Training results of the DDPG algorithm model.

Figure 7 .
Figure 7. Training results of the TD3 algorithm model.

Figure 8 19 Figure 8 .
Figure 8 shows the training effect achieved by the EE-TD3 algorithm in a simulated battlefield environment (β = 45 • ) with nine Critic network learning rates α c .The analysis results show that the EE-TD3 algorithm model starts to converge at 1150 rounds when α c is 0.0003, and the EE-TD3 algorithm model starts to converge at 1070 rounds when α c is 0.0005.When α c is 0.0007, the EE-TD3 algorithm model starts to converge at 1385 rounds.When α c is 0.0009, the EE-TD3 algorithm model starts to converge in 1760 rounds.These four results tend to convergence; there is a small oscillation, but the overall situation is still convergent.At the same time, when β = 45 • , the optimal result of the EE-TD3 algorithm model has faster convergence speed and smaller oscillation amplitude than that of the TD3 algorithm model.Additionally, from the convergence success rate of the nine experiments, the EE-TD3 algorithm model is also higher than the TD3 algorithm model.Symmetry 2023, 15, x FOR PEER REVIEW 15 of 19

Figure 9
Figure 9 shows the training effect achieved by the EE-TD3 algorithm (β = 30 • ) in a simulated battlefield environment with nine different Critic module network learning rates α c set.The analysis results show that the EE-TD3 algorithm model begins to converge at 890 rounds when α c is 0.0002, the EE-TD3 algorithm model begins to converge at 2270 rounds when α c is 0.0003, and the EE-TD3 algorithm model begins to converge at 930 rounds when α c is 0.0004.When α c is 0.0005, the EE-TD3 algorithm model begins to converge at 2323 rounds.When α c is 0.0009, the EE-TD3 algorithm model begins to converge at 2285 rounds.Five results of the EE-TD3 algorithm model converge when β = 30 • .At the same time, it can be seen that the optimal result of the EE-TD3 algorithm model when β = 30 • is faster than the optimal result of the EE-TD3 algorithm model when β = 45 • .In addition, from the convergence success rate of these nine experiments, the training effect of the EE-TD3 algorithm model when β = 30 • is also better.In general, for the UAV autonomous path planning problem, the EE-TD3 algorithm model is generally better than the TD3 algorithm and DDPG algorithm.

Symmetry 2023 , 19 Figure 10 .
Figure 10.The success rate of the algorithm model.

Figure 11 .
Figure 11.Average path length of the algorithm model.

Figure 10 . 19 Figure 10 .
Figure 10.The success rate of the algorithm model.

Figure 11 .
Figure 11.Average path length of the algorithm model.

Figure 11 .
Figure 11.Average path length of the algorithm model.