A UAV Maneuver Decision-Making Algorithm for Autonomous Airdrop Based on Deep Reinforcement Learning

How to operate an unmanned aerial vehicle (UAV) safely and efficiently in an interactive environment is challenging. A large amount of research has been devoted to improve the intelligence of a UAV while performing a mission, where finding an optimal maneuver decision-making policy of the UAV has become one of the key issues when we attempt to enable the UAV autonomy. In this paper, we propose a maneuver decision-making algorithm based on deep reinforcement learning, which generates efficient maneuvers for a UAV agent to execute the airdrop mission autonomously in an interactive environment. Particularly, the training set of the learning algorithm by the Prioritized Experience Replay is constructed, that can accelerate the convergence speed of decision network training in the algorithm. It is shown that a desirable and effective maneuver decision-making policy can be found by extensive experimental results.


Introduction
With the development of the control and electronic techniques in recent years, the performance of unmanned aerial vehicle (UAV) has been improved rapidly in all aspects. The UAV has been applied to assisting, replacing people to complete difficult missions due to high mobility, great flying height, and low cost [1]. Thus, it is necessary to improve the autonomy of UAV while performing some special airdrop tasks without risking human lives, such as delivery of relief supplies [2], extinguishing by UAV, and so on. Consequently, how to improve the autonomous flight capability of UAV becomes the research focus of researchers in various countries [3].
At present, the airdrop tasks are typically implemented incorporating with path panning [4], which is a method for searching an optimal path from start point to end point while avoiding obstacles in the environment. Conventional path planning algorithms for UAVs include Visibility Graph [5,6]; randomly sampling search algorithms, like Rapidlyexploring Random Tree [7], Probabilistic Roadmap [8]; heuristic algorithms, such as A-Star [9], Sparse A-Star (SAS) [10], and D* [11]; and genetic algorithms [12]. Then, a UAV can fly to target point by following the planned route, where various trajectory tracking algorithms are proposed [13,14]. However, this kind of schemes have some disadvantages. For example, an optimal route relies on a priori knowledge about the environment, in which the data of terrain and obstacles is usually difficult to obtain, limiting our capability of environment modelling. Moreover, when the environment becomes dynamic, involving moving obstacles, these schemes are not flexible enough to alter their control strategies immediately. A replan of paths have to be scheduled to adapt to the changes in the environment. Therefore, it is desired to design an end-to-end algorithm that can manipulate a UAV to flight autonomously in a dynamic environment without path planning and trajectory tracking.
A promising direction is inspired by the AlphaGo developed by Google based on deep reinforcement learning, which can play Atari games using a kind of end-to-end decision-making algorithm, called Deep Q Network (DQN) [15]. The performance of this algorithm reached human level after an extensive training, which shows the potential of the reinforcement learning-based methods combined with deep learning in solving practical problems. Meanwhile, in order to solve the dimension explosion caused by the continuity of action space, the Deep Deterministic Policy Gradient (DDPG) was proposed in Reference [16], and the experience replay is used in these Deep Reinforcement Learning (DRL)-based algorithms and allow agents to remember and learn from historical data. The DDPG overcomes the dimension explosion issue caused by continuous action space and state space. However, it forms training set by taking samples from data memory with the Uniform Experience Replay (UER), which does not fully exploit the diversity of historical data. Moreover, the UER usually has a low convergence rate of neural network and even divergence in training neural networks. Therefore, the Prioritized Experience Replay (PER) was proposed to improve the efficiency of learning from experiences [17]. In this paper, we construct a new priority of each experience based on Double Q-Learning [18], which could overcome convergence fluctuation caused by over estimate compared with Reference [19].
In the present work, we aim to tackle the challenges mentioned above and focus on the UAV maneuver decision-making for airdrop task. The main works presented in this paper are summarized as follows: • The UAV Maneuver Decision-Making Model for airdrop tasks is built based on Markov Decision Processes (MDPs). Particularly, we design the flight state space, the flight action space, and the reward functions. Among the components of model, we devote our air-to-ground drop theory to designing and constructing the UAV maneuver decision-making model for airdrop tasks. • We propose the Maneuver Decision-Making Algorithm for autonomous airdrop based on DDPG with PER sampling method (PER-DDPG) to train a UAV for generating efficient maneuver under target point constraint refined from the model we designed in an interactive environment. Specially, we design the decision-making function by deep neural network and construct the training set sampling method based on PER. • Simulation results show that the algorithm we proposed could improve the autonomy of a UAV during the airdrop task and the PER is able to accelerate the efficiency of learning from experiences. Moreover, we find that the winning rate of the PER-based algorithm exceeds the UER-based algorithm 2-4%.
This paper is organized as follows: Section 2 describes the background knowledge of all the methods used to design UAV maneuver decision-making model and algorithm. Section 3 presents the details of experimentations we designed, comparison of learning rate and winning rate under UER and PER sampling methods separately. Section 4 shows the conclusion of our work and looks forward to the future of our research.

Methodology
The UAV has been used to help people finish some dangerous and repetitive missions, such as crop protection, wildlife surveillance, traffic monitoring, electric power inspection, search, and rescue operations. A need for more advanced and simple UAV autonomous flight solution has emerged. As mentioned above, traditional solution of real time obstacle avoidance for manipulators and UAVs is that algorithm plans an optimal path and then UAV follows path by trajectory tracking method. In this paper, we redefined the process of UAV autonomous flight and constructed the UAV Maneuver Decision-Making Model for Airdrop Task based on MDPs [20]. On the other hand, we proposed a novel UAV Maneuver Decision-Making algorithm based on Deep Reinforcement Learning [21].
As shown in Figure 1, we construct the UAV Maneuver Decision-Making Model for Airdrop Task Based on MDPs firstly. Among this model, we design the flight action space, the flight state space and flight assessment function that are used to demonstrate the characteristics of UAV autonomous flight during airdrop. Moreover, we design and realize the UAV Maneuver Decision-Making Algorithm, including the UAV maneuver decision-making network, prioritized experience replay used to sample training data from historical experiences, and network optimizer applied to train networks.

Environment
The  During the process of performing airdrop task, UAV maneuver decision-making could be regarded as a sequential decision process. Moreover, while selecting optimal action for UAV, the controller usually considers current information from environment. Thus, we can think that this decision process is Markovian and could use MDPs to model the UAV maneuver decision-making model for airdrop task.
where T represents the decision time, S represents the system state space, A(s)represents the system action space, and the transition probability P(·|s, a) represents the probability distribution of the system at the next moment when the system used the action a ∈ A(s) in the state s ∈ S. The reward function R(s, a) represents the benefit that the decision-maker gets when the action a ∈ A(s) is taken in the state s ∈ S. Based on the MDPs, we can define a complete mathematical description of the UAV Airdrop Task. As shown in Figure 2, MDPs can be summarized as follows: the initial state s 0 of the system is that the decision-maker chooses action a 0 and executes it, system moves to next state s 1 according to a certain transition probability P(·|s 0 , a 0 ), and so on. In this process, the decision-maker earned rewards sequence (r 0 , r 1 , · · · ). Among this process, the decision-maker is stimulated by external rewards, and rewards are maximized by constantly updating policy. The action adopted by decision-maker is a = µ(s), where µ(s) is effective policy, and the utility function (at state s ∈ S, the expected reward obtained by adopting the policy µ) is v(s, µ). When current policy is the optimal policy, Equation (2) should be satisfied. v Based on the characteristics of UAV maneuver decision-making for airdrop, we use infinite stage discount model as utility function, as shown in Equation (3).
In the equation above, γ ∈ [0, 1] is the future reward discount factor. E[·] represents mathematical expectation; thus, Equation (3) indicates that the expected objective of discount model is the sum of expectation of reward multiplied by a discount factor at every decision moment t. Thereby, the optimal policy under the discount model can be obtained by lots of iterations.
In the following, we will demonstrate the problems definition among Airdrop Task firstly. Then, state space S, action space A(s), transition probability model P(·|s, a), and reward function R(s, a) will be designed.

Problems Definition among Airdrop Task
Before we start running a Reinforcement Learning (RL)-based algorithm, we should construct a simulation model of problems to be solved. Thus, we define two problems that usually occur during airdrop task. Generally, when the UAV prepares to perform airdrop task, it should turn nose towards target area firstly, and then fly to target position by following the UAV maneuver decision-making policy, as shown in Figure 3. In Figure 3a, N and E represent North and East directions, and V f and ψ U AV are the velocity and azimuth of UAV separately. Moreover, the dashed line between UAV and Target Area indicates the expected azimuth of UAV, that is expressed by ψ LOS . Figure 3b, the drop position of UAV is a solid point, and D LOS and δ ψ LOS represents Line of Sight (LOS) and azimuth of LOS between UAV and Target Position. Therefore, two problems involved in the airdrop task: • Turn round problem: UAV flies from a random starting point and turns to the direction of target area demanded. During this process, pilots usually manipulate UAVs and controls the azimuth of UAVs towards target direction. • Guidance problem: UAV starts from a random position and flies to a drop position given by commander. If the pilot want to manually finish this work, it will take lots of energy because the pilot should plan an effective path and manipulate the UAV following it.  If we want to simulate this process described above, a dynamical model of UAV should be constructed. We adopted a dynamical model describing airdrop task based on 3-DoF kinematic model of UAV [22]. When the position and attitude of UAV are confirmed at t, we can obtain the state of UAV at t+1 by solving the model we designed. Therefore, we think that the transition probability of UAV maneuver decision-making model is P(·|s, a) = 1, which belongs to deterministic model. Based on the 3-DoF kinematic model of UAV, the flight state is defined as ( is the horizontal coordinates of the UAV in the geographical coordinate system, and ψ c is the flight path azimuth Angle of the UAV. On the other hand, the steering overload of UAV is defined as N s ∈ −N max y , N max y , where N max y represents the max normal acceleration of UAV in the body coordinate system. During the simulation process, algorithm outputs the current optimal maneuver control N s and the next state (x, z, v, ψ c ) of UAV is calculated combined with the current state (x, z, v, ψ c ) according to the flight simulation model of UAV.

State Representation and Action Design
Based on the problems' definition among airdrop task, we can design the state space of two problems mentioned above separately. Specifically, action space is the same for both problems because the core kinematic model is all realized based on the 3-DoF kinematic model of UAV.

(1) State Space of Turn Round Problem
Considering that turn round problem is related to the azimuth of UAV and the relative direction of relative orientation between UAV and target area, we designed the state space of turn round problem, that is defined as where δ ψ LOS is the relative azimuth between LOS and nose direction of UAV, and N s represents the steering overload of UAV. δ ψ LOS could be calculated by where ψ LOS is the azimuth of LOS relative to North, and ψ U AV is the azimuth of UAV relative to North. Two symbols are all in [0, 2π] and satisfy the right-hand rule.
(2) State Space of Guidance Problem For the guidance problem of airdrop task, we can define its state space as where d LOS indicates the distance between UAV and drop position. Moreover, if we define X U AV as the position of UAV and X TGT as the drop position, we will calculate the symbols by where · 2 represents the 2-norm of vector. (

3) Action Space of both Problems
Based on the flight simulation model of UAV we constructed, we can establish the action space as below. A

Reward Function Based on Potential-Based Reward Shaping
In MDPs, the reward function determines the direction of policy iteration and directly reflects the agent's intention. The termination condition of turn round problem can be defined as follows: where δ t ψ LOS is δ ψ LOS at decision moment t, and δ min ψ is the minimum error of δ ψ LOS . This termination condition means that, when the UAV heads to the target area with an error under certain tolerance, we could think that the turn round problem has been solved. Analogously, the termination condition of guidance problem could be also obtained as where D t LOS is D LOS at decision moment t, and d min LOS is the minimum error of the distance between the UAV and the required drop position.
Therefore, the reward function R(s, a) of the problems could be defined as R(s, a) = 1.0, Satisfy Termination Condition 0.0, Not Satisfy Termination Condition .
Equation (11) indicates that if UAV's state satisfies termination condition of problems, R(s, a) will return 1.0, otherwise 0.0.
The algorithm we proposed could search the optimal policy according to this kind of episodic reward, but there is a serious fault that is possible to influence the efficiency of policy convergence because the rewards environment returned are too sparse to learn useful experience, such as those samples whose reward is not zero. Thus, some researchers proposed the potential-based reward shaping [23,24] (PBRS) method to solve the problem brought by "sparse" reward. PBRS provides a guidance signal for improving the speed of policy convergence by adding a new reward shaping function F(s, a, s ) to original reward function. Generally, F(s, a, s ) should satisfy where γ ∈ [0, 1] is a discount factor, and s ∈ S, a ∈ A(s), and s indicate the current state, the current action, and the next state, respectively. Moreover, Φ(s) is a kind of potential energy function. If agent's action help it approach termination condition, it will receive a positive reward, otherwise negation. In summary, traditional MDPs can be rewritten as T, S, A(s), P(·|s, a), R s, a, s , (13) and new reward function is defined as R s, a, s = R(s, a) + F s, a, s .
In the following, we present the shaping function for each individual problem.
(1) The Shaping Function for Turn Round Problem Based on the definition of turn round problem, we can obtain the shaping function for turn round problem as shown in Equation (15).
(2) The Shaping Function for Guidance Problem As shown in Equation (16), the definition of the shaping function of guidance problem is given.
In the equation, δ d LOS is distance that UAV approaches drop position after one simulation step, ω δ d LOS ∈ [0, 1] is the coefficient of distance factor. Moreover, T is simulation step, v max is the maximum speed of UAV. The symbol δ d LOS could be calculated by where d t LOS is the distance between UAV and drop position at t-th decision step. Although DDPG can avoid the dimensional explosion problem brought by continuous state space and action space, it does not consider the diversity of data and does not utilize historical experience fully. This results in the low convergence speed of DDPG's policy and poor stability of the convergence result. Meanwhile, because the episode of UAV maneuver decision-making is short, while the task process lasts a long time, the changing of reward is not obvious. Thus, the value density of historical experience is low. That is the reason why we use PER to generate training data [17], which can improve the utilization of the potential value of historical experiences, thereby increasing convergence speed and enhancing the stability of training results. Figure 4 shows the block diagram of the PER-DDPG's structure. At each decisionmaking step, the actor network outputs action with noise for exploring according to state, and the current state, action, reward, and next state are packaged and stored in experience memory D. During the process of storing experience, samples bind with probability used for PER sampling. And then, the training data is sampled from D by PER, and every sampled data's TD-error [25] between current Q(s, a) and target value is calculated for updating the priority of data and being cumulated for updating network's parameters with importance sampling (IS) weights. Finally, the parameters of main networks Q(s, a; θ Q ) and µ(s; θ µ ) are updated, and the parameters of target networks Q(s, a; θ Q ) and µ(s; θ µ ) are also updated smoothly because of stability of network training.  At each moment, the algorithm gives action by

Calculate TD-Errors
where s t ∈ S is the current state, and a t is the resulting output by the actor function µ(s).
During the training process, the critic function Q(s, a) evaluates current action given by actor function, and the evaluation is used for the basis of updating µ(s).

The UAV Maneuver Decision-Making Network
As mentioned above, DDPG is a kind of deep reinforcement learning algorithm based on the Actor-Critic framework. During the training process, the actor network outputs action a ∈ A(s) according to state s ∈ S generated by environment. Meanwhile, TD-error is used to optimize the critic network and update its parameters. Similarly, the actor network's parameters are optimized according to max Q(s, a). Therefore, we must design the structure of actor and critic networks, respectively, on the basis of DL.
(1) Actor Network The actor network µ(s; θ µ ) is mainly used to output action in real-time decision according to state. The input vector of the network is the current state s ∈ S, and the output vector of the network is the current action a ∈ A(s) calculated by µ(s; θ µ ). Considering the definition of state space, the dimension of network input is dim (S), and the dimension of output is dim (A). As shown in Figure 5, it is the normal structure of actor network µ(s; θ µ ). (2) Critic Network The critic network Q(s, a; θ Q ) is used to evaluate the advantage of current action a ∈ A(s) output by µ(s; θ µ ). The network input is [s, a], and the network output is Q(s, a). According to state space and action space defined above, the dimension of network input is dim (S) + dim (A), and the dimension of network output is 1. As shown in Figure 6, it is the normal structure of critic network Q(s, a; θ Q ).  In addition, before state and action are entered into network, the value of input vector should be normalized for eliminating the influence of data's physical meaning. Moreover, the structure of target networks µ (s; θ µ ) and Q (s, a; θ Q ) is similar to µ(s; θ µ ) and Q(s, a; θ Q ), and only the method of parameters updating is distinguished.

The Training Procedure of UAV Maneuver Decision-Making Algorithm
Based on MDPs, the key issue of searching optimal UAV maneuver decision-making policy is to solve an optimization problem defined as where v(s, µ) is defined in Equation (3). In this paper, we use Double Q-Learning [18] In the equation above, s ∈ S is current state, a ∈ A(s) is current action, r = R(s, a, s ) is current reward, s ∈ S is next state, and σ ∈ [0, 1] is the learning rate of the algorithm. As shown in Equation (21), it is loss function L(θ Q ) of critic network Q(s, a; θ Q ).
The symbol δ j is TD-error based on Double Q-Learning of j-th data sampling from memory D. TD-error describes the difference between Q(s, a; θ Q ) and optimal goal, and it is defined as where y j is the optimal goal of Q(s j , a j ; θ Q ), (s, a, r, s ) j is j-th training data, and s j and a j are current state and action in (s, a, r, s ) j , respectively. The symbol y j could be calculated by where r j and s j+1 are current reward and next state in (s, a, r, s ) j . Thus, we can obtain the gradient of loss function L(θ Q ) as shown below considering Equations (21) and (22).
At the same time, we define the loss function L(θ µ ) of actor network µ(s; θ µ ) in order to update the parameters of µ(s; θ µ ).
Thereby, we can obtain the gradient of L(θ µ ) according to deterministic policy gradient theorem [26], as shown in Equation (26).
During the process of training networks, we use the PER method to sample training data from D in order to utilize the diversity of experiences fully. Usually, the training data is sampled by selecting a batch of data from D uniformly, which means the probability P(i) of each sample selected in D is equal. On the contrary, P(i) of PER is not same, as defined as Equation (27).
In the equation above, p i is the priority of i-th sample in D, and α is a hyperparameter. When α = 0, it is pure UER. p i is defined based on TD-error, as shown in Equation (28).
Among the equation above, δ i is TD-error of i-th sample in D. Moreover, a minimum ≤ 0.0001 is introduced to prevent p i from being 0.
Although PER improves the availability of experiences, the distribution error of training data sampled by PER occurs compared with UER's, and this problem also reduces the diversity of training samples. Therefore, importance sampling (IS) weights are introduced to correct the distribution error of training data caused by PER. The IS weight ω j is defined as Equation (29). (29) In the equation above, N is the size of D. When β = 1, the distribution error of training set is fully compensated. When δ j is calculated, the actual updating target is ω j · δ j and it's used to replace δ j in Equation (24). Therefore, the final gradient ∆ of Q(s j , a j ; θ Q ) is calculated by In order to ensure the stable convergence of the network, ω j is normalized by ω i max j ω j . Thereby, the actual IS weight ω j could be defined as At the same time, in the early stage of training, the distribution error caused by PER is not big. Thus, we define an initial β 0 ∈ (0, 1), and it gradually increases to 1 with training going on.
In addition, because of stability of target networks' training, the parameters of µ (s; θ µ ) and Q (s, a; θ Q ) are updated by "Soft" updating similar to smooth updating, as shown in Equation (32).
In the equation above, the symbol τ ∈ (0, 1) is a hyperparameter involved in the "Soft" updating. Moreover, a kind of random noise is used to improve the exploration ability of deterministic policy involved in algorithm, as shown in Equation (33).
Among the equation, N (t) is a kind of time-variant noise. Because the UAV maneuver decision-making satisfies the Markovian condition and the changing of state is inertial process, an autocorrelation noise model called Ornstein Uhlenbeck (OU) process [27] is used for action exploration. The iterative formula of N (t) is shown in Equation (34).
In the equation above, x t and x t+∆t are current and next value of noise separately. µ and κ indicate the mean value and regression rate of noise, respectively. Moreover, ∆t is the step of noise, and dW t represents the Wiener process.
Finally, the training procedure of UAV maneuver decision-making algorithm is given in Algorithm 1.

Input:
The hyperparameters of training networks: the size of minibatch k, networks' learning rate η; The hyperparameters of updating policy: policy's learning rate σ, learning period K, memory capacity N, "Soft" updating τ; The hyperparameters of sampling: the availability exponent of PER α, IS exponent β; The control parameters of simulation: maximum period M, maximum step per period T.

4:
Output a 0 according to Equation (18). 5: for t = 1 to T do 6: Observe current state s t and reward r t of environment and calculate current action a t according to Equation (18). 7: Save current transition (s t , a t , r t , s t+1 ) into experiences memory D. 8: if t mod K ≡ 0 then 9: Reset the gradient ∆ = 0 of Q(s j , a j ; θ Q ) with IS. 10: for j = 0 to k do 11: Sample traing data j ∼ P(j) according to Equation (27) 12: Calculate IS weight ω j according to Equation (31) 13: Calculate TD-error δ j of training data according to Equation (22) and update its priority according to Equation (28) 14: Accumulate ∆ according to Equation (30). 15: end for 16: Update the parameters of Q(s j , a j ; θ Q ) according to ∆ with learning rate η. 17: Update the parameters of µ(s; θ µ ) according to Equation (26). 18: Update the parameters of target networks Q (s, a; θ Q ) and µ (s; θ µ ) according to Equation (32) 19: end if 20: end for 21: end for

Results and Analysis
According to content aforementioned, we design some experiments to verify the availability of the algorithm we proposed and compare PER with UER in terms of the efficiency of policy optimization. In the following, we will explain the setting of simulation environment, training results, and results of Monte-Carlo (MC) test experiments, as well as their analysis.

The Settings of Simulation Environment
In the experiments we designed, the drop area and UAV are restricted to 100 km × 100 km airspace at the height of 5000 m. For each simulation experiment, the UAV's initial state is randomly generated, and the UAV might start from arbitrary position in flight airspace. In order to make simulation closer to real environment, we decide to make T = 0.5 s because UAV's control input is usually updated by human pilot every 0.5 s ∼ 1 s.
Moreover, because each dimension of state space has different physical units, the state and action should be normalized before it's input into Q(s, a; θ Q ) and µ(s; θ µ ). As shown in Table 1, the details of data are explained. Thereby, we can normalize parameters according to their physical meanings.

Parameter
Range Meaning The relative azimuth between LOS and nose of UAV.
The steering overload of UAV.

LOS
The distance between UAV and drop position.

The Parameters Setting of Algorithm
According to the training procedure of algorithm, before we start training, some parameters should be assigned. As shown in Table 2, there are some parameters assignments of algorithm. Moreover, we design the structure of networks µ(s; θ µ ) and Q(s, a; θ Q ) shown in Tables 3 and 4, respectively, according to the state space and action space of turn round problem. In this paper, the networks are all designed by fully-connected neural network, which means the layers are dense layers.

The Analysis of Simulation Results
Based on the setting above, we finished the training of networks successfully and the loss diagrams of critic networks involved in UER-DDPG and PER-DDPG over time are shown in Figure 7, respectively. We could find that the loss of PER-DDPG converges faster than UER-DDPG. Moreover, the loss of PER-DDPG becomes stable after converging to minimum. On the contrary, when UER-DDPG converges, the loss fluctuated greatly at 1000th episode, and its convergence costs more time. Figure 8 is the winning rate of algorithms based on different experience replay methods over simulation episode during the training process. We can find that all the winning rates are more than 80% and maintain stably. It is shown that DDPG with PER method could achieve the same result compared with UER-DDPG. But the training process of PER-DDPG is much stabler than UER-DDPG because the winning rate of UER-DDPG fluctuates violently at the beginning of training. Meanwhile, the curve of episode rewards further demonstrates that PER-DDPG is much steadier than UER-DDPG from Figure 9. After training, we run a group of Monte-Carlo experiments for trained results of UER-DDPG and PER-DDPG, and the number of MC experiments for each result is 1000. As shown in Table 5, the training result's performance of PER-DDPG is better than UER-DDPG's because the winning rate of PER-DDPG's is more than about 3% than UER-DDPG's. Meanwhile, we visualize some typical test results from MC experiments, and Figures 10-13 is the flight trajectory of UAV and some parameters, including azimuth, reward, and action, over simulation step.  In Figures 10 and 12, the red solid line represents the flight trajectory of UAV, and the red dashed line and the blue dash dot line indicate the termination azimuth of UAV and the required azimuth of LOS. In Figures 11 and 13, the 1st row each figure is azimuth of UAV over simulation step, the 2nd row each figure is action of UAV over simulation step, and the 3rd row each figure is reward of UAV received over simulation step.
We can find that the algorithm based on PER-DDPG we proposed solves the turn round problem involved in airdrop task, and its performance is more than UER-DDPG. In summary, not only is the training process of algorithm based on PER-DDPG stabler than UER-DDPG's, but also the trained result of algorithm based on PER-DDPG is much more effective than UER-DDPG's.

The Parameters Setting of Algorithm
Based on content above, there are some parameters assignments of algorithm for guidance problem shown in Table 6. Moreover, according to the state space and action space of guidance problem, the structure of networks Q(s, a; θ Q ) and µ(s; θ µ ) is shown in Tables 7 and 8, respectively.

The Analysis of Simulation Results
Similarly, we also analyzed the training loss, the winning rate, and the episode rewards generated during the training process of algorithms. In Figure 14, we could find that the convergence speed of PER-DDPG is more than UER-DDPG's due to high utilization of experiences, and PER-DDPG becomes much stabler than UER-DDPG because the fluctuation of PER-DDPG is less than UER-DDPG. In Figure 15, the winning rate of algorithms based on different experience replay methods over simulation episode during the training process is shown. We can find that the winning rate curves of UER-DDPG and PER-DDPG are stable after some simulation episodes and maintain a high value. And the winning rate of PER-DDPG is much more than UER-DDPG after fluctuation. Meanwhile, the comparison of episode rewards could demonstrate that PER-DDPG is much steadier than UER-DDPG because the fluctuation of episode rewards of PER-DDPG is clearly less than UER-DDPG from Figure 16. After training, we finished a set of Monte-Carlo experiments for trained results of UER-DDPG and PER-DDPG, and the number of it for each result is 1000. As shown in Table 9, the winning rate of PER-DDPG is more than approximately 3.5% than PER-DDPG, and we can think that the training result's performance of PER-DDPG is better than UER-DDPG's. Moreover, we visualize some typical test results from MC experiments in order to make our analysis more convincing. Figures 17-20 is the flight trajectory of UAV and some parameters, including reward and action over simulation step. In Figures 17 and 19, the red solid line represents the flight trajectory of UAV, and the blue dashed circle represents the maximum range of drop area. The red solid point and green solid point indicate start position and drop position. In Figures 18 and 20, the top figure is the action of UAV over simulation step, the bottom figure is the reward of UAV received over simulation step.
According to results and analysis above, we could find that the algorithm based on PER-DDPG we proposed solves the guidance problem involved in airdrop task, and its performance is more than UER-DDPG. Similarly, not only is the training process of algorithm based on PER-DDPG stabler than UER-DDPG's, but also the trained result of algorithm based on PER-DDPG is much more effective than UER-DDPG's, while solving the guidance problem.

Conclusions
Aiming at the airdrop task, we refined and described two key issues, including turn round problem and guidance problem. Based on the definitions of problems, we designed the UAV maneuver decision-making model for airdrop task based on MDPs and constructed the state space, the action space, and the reward function based on PBRS. Then, we proposed the UAV maneuver decision-making algorithm for autonomous airdrop based on Deep Reinforcement Learning. Particularly, we used Prioritized Experience Replay to improve the availability of experiences during training process. Meanwhile, the results showed that the algorithm we proposed could be able to solve the turn round problem and guidance problem after training successfully. And the convergence of PER-DDPG is faster and stabler than UER-DDPG and the trained result performance of PER-DDPG is also better than UER-DDPG. In the future, we will investigate the solution of UAV autonomous flight when state is partially observed. And we will extend the algorithm we proposed to manipulate the real UAV to improve the autonomy of UAV, while performing special missions in the real world.