Robust Motion Control for UAV in Dynamic Uncertain Environments Using Deep Reinforcement Learning

: In this paper, a novel deep reinforcement learning (DRL) method, and robust deep deterministic policy gradient (Robust-DDPG), is proposed for developing a controller that allows robust ﬂying of an unmanned aerial vehicle (UAV) in dynamic uncertain environments. This technique is applicable in many ﬁelds, such as penetration and remote surveillance. The learning-based controller is constructed with an actor-critic framework, and can perform a dual-channel continuous control (roll and speed) of the UAV. To overcome the fragility and volatility of original DDPG, three critical learning tricks are introduced in Robust-DDPG: (1) Delayed-learning trick, providing stable learnings, while facing dynamic environments; (2) adversarial attack trick, improving policy’s adaptability to uncertain environments; (3) mixed exploration trick, enabling faster convergence of the model. The training experiments show great improvement in its convergence speed, convergence e ﬀ ect, and stability. The exploiting experiments demonstrate high e ﬃ ciency in providing the UAV a shorter and smoother path. While, the generalization experiments verify its better adaptability to complicated, dynamic and uncertain environments, comparing to Deep Q Network (DQN) and DDPG algorithms.


Introduction
Safe and reliable motion control for unmanned aerial vehicles (UAVs) is an open and challenging problem in the realm of autonomous robotics. Successfully flying from arbitrary departures to destinations, while avoiding ubiquitous threats without any human intervention is indeed essential for a UAV in many practical applications [1,2], such as search and rescue [3], remote sensing [4,5], goods delivery [6], and Destroy or Suppression of Enemy Air Defenses (DEAD/SEAD) [7]. To maintain autonomous mobility, the UAV has to handle challenges in Observation, Orientation, Decision and Action (OODA) simultaneously, and these become particularly difficult while facing dynamic uncertain environments. Massive uncertain surroundings and unpredictable moving threats make any pre-planned motion strategy unavailable. Developing some novel techniques, which can provide the UAV robust motion strategies in these complex environments, becomes a crucial requirement in the near future.
Traditional approaches, such as A* [8], RRT [9], artificial potential fields [10], simultaneously localization and mapping (SLAM) [11], employ two steps to handle these motion control problems with unknown environments [12]: (i) Perceive and estimate the environment state; and (ii) model and optimize the control command. These approaches are often susceptible to unforeseen disturbances, any incomplete perception, biased estimate, or inaccurate model will lead to poor performances [13].
Their model-based scheme makes it difficult to apply such approaches to dynamic uncertain environments because the state transition models of the environments are usually unknown in those cases. Moreover, these traditional approaches use an open-loop mechanism [14] that makes decisions without any reasoning of the future, and the decision process has to be executed repeatedly to compensate for the changes.
To overcome the limitations mentioned above, researchers have resorted to learning or adaptive approaches. Reinforcement learning (RL) [15], for instance, can make agents learn the right actions to take with little or no prior knowledge of system model, and the predictive learning scheme makes it easily adapt to the stochastic changing conditions. For these reasons, RL has become a promising tool in improving autonomous flight in many different UAV applications. Junell [16] modelled the Quadrotor guidance as a high-level reinforcement learning problem and successfully developed an autonomous flying test in an unknown environment. Luo [17] proposed Deep-Sarsa, a novel path planning and obstacle avoidance approach, in order to navigate multi-UAVs fly autonomously in a dynamic environment. Imanberdiyev [18] uses a model-based reinforcement learning algorithm, TEXPLORE, to solve the UAV autonomous navigation problem and demonstrate that the effect outperforms Q-learning based method. Since traditional RLs can only deal with discrete states, all of these researches have to simplify and limit the environment as a discrete grid, and this is different from the practical situation faced by UAVs. To maintain a better representation of the high-dimensional continuous state space, the deep neural network is introduced into the conventional RL and produces deep reinforcement learning (DRL) methods. A series of DRLs, such as Deep Q Network (DQN) [19], Double DQN [20], Dueling DQN [21], and Prioritized DQN [22] are proposed one after another, and some of them have been utilized in the field of UAV control and have achieved outstanding performance [23][24][25][26]. Kersandt [24] establishes learning-based high-level controllers to navigate a UAV flying across a complicated environment with different DRL algorithms. Polvara [25] learned a DQN-based intelligent controller and successfully controlled the UAV to land on moving platforms. Conde [26] designed time-varying controllers with DQN to drive multiple UAVs and reach any formation as quickly as possible. However, these value-based DRLs have drawbacks, including that they can only address cases with discrete actions, which is the reason these applications only realized a discrete direction control of the UAV.
To achieve continuous control, policy gradient methods [27] are introduced into DRL, which derive parameterized stochastic or deterministic policies with continuous actions by performing gradient descent in the parameter space. Silver [28] proposed a deterministic policy gradient (DPG) algorithm, and demonstrated that it can significantly outperform the stochastic counterparts in high-dimensional action spaces. Lillicrap [29] combined DQN and DPG within the actor-critic framework and produced a deep deterministic policy gradient (DDPG) algorithm, which can map continuous observations directly to continuous actions. While, DDPG may perform well sometimes, it is frequently brittle with respect to complete tasks, and challenges arise when DDPG is applied to solve UAV adaptive motion control problems. Firstly, UAV is sensitive to rapidly changing speed and adding speed control channel into DDPG will make the training process unstable. For this reason, most studies only take heading control channel into account [25,30,31] in their UAV navigation tasks. This simplification limits their practical application scopes. Secondly, given the actor and critic are closely related in DDPG, an over-estimation of the critic will lead to policy vibrations in actors and the vibration will result in UAV's frequent crash, while facing dynamic uncertain surroundings. Lastly, DDPG itself is susceptible to hyper-parameters and exploration schemes, any irrational setting can lead to unstable learning, which is the reason that region policy optimization is relied on (TRPO) [32] and proximal policy optimization (PPO) [33] algorithms are proposed. With respect to the motion control problem, the dynamic uncertain environment expands the state and action space, which increases the difficulties of exploration.
To address these challenges, we conducted some exploratory research and propose an improved DRL algorithm named Robust-DDPG. This new algorithm is used to provide the UAV robust motion control in dynamic uncertain environments. Specifically, we make the following contributions in this paper: (1) We develop an actor-critic-based motion control framework, which can perform a dual-channel control of roll and speed, by predicting the desired steering angle and by reasoning the possible collision probability. The controller can provide safe flights for the UAV autonomously in dynamic uncertain environments.
(2) We propose an efficient policy-based DRL algorithm, Robust-DDPG, in which three critical tricks are introduced to provide a robust controller for the UAV. The first is a delayed-learning trick, in which the critic and actor networks are batch updated after each episode finishes, rather than being updated in each iteration. The second is an adversarial-attack trick, in which an adversarial scheme is introduced to sample noisy states and actions in the learning process. This trick will increase the robustness of the trained networks. The last is a mixed-exploration trick, in which rough sampling, based on -greedy and a fine sampling, based on Gaussian are performed in different periods of learning. By combining these three tricks, Robust-DDPG is able to overcome the shortcomings of DDPG and provide the UAV a controller with better adaptability to complicated, dynamic, and uncertain environments.
(3) We constructed a UAV mission platform to simulate dynamic stochastic environments for training and evaluating the effectiveness and robustness of our proposed methods. Through a series of experiments, we show that our trained UAV can adapt to various dynamic uncertain environments, with neither a map of the environment nor retraining or fine-tuning.
The remainder of this paper is organized as follows. Section 2 introduces the UAV motion control problem and formulates it as an MDP. Section 3 elaborates the core approach, Robust-DDPG, for problem solving, where three improved tricks, delayed learning, adversarial attack, and mixed exploration are integrated into an actor-critic framework. The performance, effectiveness, and adaptability of the proposed algorithm are demonstrated through a series of experiments in Section 4. Section 5 conducts a further discussion about the experimental results. Section 6 concludes this paper and envisages some future work.

Kinematics of UAV
Six degrees of freedom (DoF) aircraft model is the most accurate in UAV's flight control. However, it is taken that the UAV owns an onboard autopilot that will provide the low-level flight controls in a fast-inner loop and maintain roll, pitch, and yaw stability for the UAV, as well as velocity tracking and altitude holding functions [34,35]. For the sake of brevity and without loss of generality, we adopt the kinematics model with four DoF as a substitute for the six DoF one and apply it in the design of a high-level controller of the UAV. In our kinematics, we assume the UAV flies at a constant altitude and fly with inertial coordinated turns, in which the bank angle is set so that the centrifugal force acting on the aircraft is equal and opposite to the horizontal component of the lift acting in the radial direction [36]. These assumptions are reasonable in many realistic cases and can allow us to focus more on the motion control algorithms.
Let p u := (x u , y u ) and . p u := .
x u , . y u denote the planar position and velocity in Cartesian inertial coordinates, respectively. By taking some additional disturbances into account, the continuous-time kinematics of our UAV reads [37], where g denotes the acceleration due to gravity. ψ u , φ u denote the heading angle and roll angle, and v u denotes the linear velocity of the UAV. η .
x , η . y , η . ψ are the disturbance terms due to velocity and heading rate, which is drawn from normal distributions N 0, σ 2 .
By introducing these stochastic factors into the states of the model, we can partly make up for the loss of the un-modelled dynamics. The function f φ u , u φ defines the roll dynamics, which depends on a specific problem. The negative sign in the third row indicates the different definitions of the roll and heading direction, where the clockwise direction is defined as a positive roll, while the clockwise direction produces a negative roll.

Dual-Channel Control for UAV
Mobility control of a fixed-wing UAV in real environments could be complex, especially when the environment is unknown and changing rapidly. In our scenario, a fixed-wing UAV is supposed to have to fly across a dynamic unknown area until a specified target is finally reached. It is such an arduous circumstance for the UAV because of the ubiquitous mobile threats. Unlike most navigation researches that only considered a heading control [30,31] for the UAV, we utilize a dual-channel control to provide the UAV better flight robustness. As the increased speed control channel offers more avoiding options for the UAV when faced with dynamic environments.
Let a t = a v,t , a φ,t T be the control vector for the UAV at time t, in which a v,t and a φ,t are the control commands due to the speed, and the roll, respectively. a v,t is represented as the probability of collision that can be used to modulate the forward speed of the UAV [1], while a φ,t is the steering rate that can turn current roll to the desired one. Specifically, low-pass filters are used to provide soft updates of the speed and roll angle as, where λ v and λ φ are the tuning factors selected empirically for trading off smoothness and reactiveness of the flight. From the speed controller, we conclude that the UAV will gradually accelerate to maximal speed v u,max if the collision probability a v,t is 0, and will slow down to 0 when a v,t closes to 1. Similarly, the roll controller can map the predicted steering rate a φ,t into a desired roll angle φ u,t . Once the speed and roll are updated, the UAV moves to a new position by integrating Equation (1).

UAV Motion Control as an MDP
To provide the UAV with a robust controller to adapt to dynamic uncertain environments, we focus on the reinforcement learning technique. RL uses a Markov decision process (MDP) [31,38] to model the controller, in which the problem is abstracted into an Agent and an Environment. The agent can learn optimal sequential control policies from the historical trajectories accumulated by trial-and-error interactions with the environment. At each time step, the agent perceives the current system state s ∈ S and selects a favorable action a ∈ A depending on s and its knowledge about past experiences. After applying this action a to the UAV, a new system state s ∈ S and a reward signal r will arrive to the agent, and then the cycle repeats. Figure 1 illustrates the RL-based motion control structure of the UAV. The state s, action a and reward r constitute the core elements of the controller.

State and Action Specification
The state represents a collection of all the information that UAV can obtain. In this paper, onboard GPS and gyroscope devices can provide the agent its state ξ u = x u , y u , y u is the planar speed, ψ u is the heading, and φ u is the roll angle of the UAV. A LiDAR [39] with N r rays is equipped on the UAV to keep sensing the changing of the surroundings. At each sampling moment, the agent will receive a feedback of the environment state ξ e = [d 1 , d 2 , . . . , d N r ] T , where d i denotes the detected relative distance between the UAV and the threats by the i-th ray (depicted in Figure 2). Besides, the target state ξ T = [x T , y T ] T is supposed to be transmitted to the agent by an indicator-like device periodically, where (x T , y T ) represents the position of the target. Then, we get the system state s by combining ξ u , ξ e and ξ T , i.e., The fixed-wing UAV maneuvers by selecting its appropriate speed and roll control commands, and holding them for one second or until the next commands are selected. In our scenario, the two control commands are represented by action a = a v , a φ T , where a v ∈ [0, 1] denotes the probability of collision that can be estimated to control the forward speed of the UAV, while a φ ∈ [−1, 1] is a steering signal that can be selected to turn the UAV to the desired roll angle.

Reward Shaping
Reward r(s, a) acts as a signal evaluating how good it is when taking an action a at a state s [31]. The rewards are the only feedback signals available for the agent's learning. Accordingly, a well-shaped reward function should contain as many useful human experiences as possible. In this paper, we abandon the normally used sparse reward and shape a non-sparse reward scheme [40] that incorporates our domain knowledge about the motion control problem to precisely describe the tiny impart of selected policy. Four basic experiences are considered to construct the non-sparse reward: A) The UAV is urged to fly to the target; any action that brings the UAV close to the target should be rewarded and be punished if it is driven away from the target, and the faster the approaching or leaving, the greater the reward or penalty; B) the UAV is required to complete the mission as soon as possible, a greater approaching speed deserves a greater reward; C) the UAV should fly towards the target, any deviation from the target direction should be punished; D) the UAV should be proactive in avoiding collisions with the threats, if it is quickly approaching a threat, a great penalty should be assigned to remind the UAV to slow down or turn immediately. We have formulated these four experiences as follows, where D pre ut , D cur ut denote the previous and current relative distances between UAV and the target; ∆ψ denotes the angle of the UAV flight direction deviating from the target; v u are the current speeds of the UAV; D s is the detection distance of the sensor; D f is the distance of the detected threat in front of the UAV and if there is no threat ahead of it, D f will be set to D s . All the variables can be found in Figure 3. From the above four equations, we can see that r A is a reward item when D pre ut > D cur ut , otherwise, it is a penalty, and r B is always a reward while r C and r D are penalties. To summarize, the reward function can be finally formulated as, where µ 1 , µ 2 , µ 3 , µ 3 are used to indicate the contribution rates of the four items, 4 i=1 µ i = 1.

Non-Myopic Objective
As for a reinforcement learning setup, state space S, action space A, reward function r(s, a) and transition dynamics p(s |s, a) consist as a standard MDP. Unlike traditional myopic methods that make decisions rely only on the immediate reward r, RL learns an optimal control policy π by maximizing a non-myopic objective, H is the horizon of the prediction. The policy π is defined to map system states to a probability distribution over the actions π : S → P(A) . For an arbitrary pair (s t , a t ), Q-function is defined to describe the expected long-term cumulative return when performing an action a t in state s t and following π: Subsequently, the optimal action can be determined by: Despite the non-myopic scheme described above can provide a robust control policy by fully considering its impact on the future, it is troubled by the curse of dimensionality while trying to calculate the Q-value. As the agent is facing continuous state space and continuous action space in our scenario, and more seriously, the transition dynamics p(s |s, a) is unknown to the agent. Any heuristic or evolutionary algorithms become intractable for solving the Equation (11). To address it, we design a deep neural network, µ(s θ µ ) , to approximate the function of arg max a Q π (s t , a t ), where the deep neural network could directly map high-dimensional continuous state s t into optimal action a t . (As illustrated in Figure 4) As we can see in Figure 4, by approximating, the intractable planning problem (Equation (11)) is simplified into a deep neural network training problem, i.e., to figure out optimal parameters θ µ of µ(s θ µ ) . In Section 3, we spend the whole section to describe the specific learning framework and learning techniques.

Robust-DDPG for UAV Motion Control
This section introduces an actor-critic framework to training the agent described in Section 2, and a novel DRL algorithm is proposed to address challenges belongs to traditional DDPG and used to provide the UAV robust end-to-end motion control strategies.

Actor-Critic Framework
As shown in Section 2, the UAV motion control problem is modeled with a continuous state space and a continuous action space, which turns Q-learning [15] and Deep Q-learning (DQN) [19] unavailable because of their poor efficiencies in representing a continuous policy. Instead, an actor-critic approach is considered in this research. As depicted in Figure 5, the actor-critic approach expands DQN with a deterministic parameterized actor function µ(s θ µ ) , which defines the policy by deterministically mapping a state to a specific action, i.e., a = µ(s θ µ ) . By replacing policy gradient (PD) [15] a ∼ π(s θ π ) with deterministic policy gradient (DPG) [28] a = µ(s θ µ ) , the agent will learn an optimal policy without any action sampling and action integrating in each iteration. It is a practical technique to handle problems with continuous high-dimensional action spaces. Specifically, a deep neural network with parameter θ µ is used as an approximator (red box in actor module in Figure 5) for the actor, inspired by DQN that exploits a parameterized deep neural network Q(s, a θ Q ) (red box in critic module in Figure 5) to approximate Q-function in Equation (10). To disrupt the correlation between samples and maintain stable learning, experience replay strategy (blue modules in Figure 5) and two fixed target networks, µ (s θ µ ) and Q (s, a θ Q ) (white boxes in actor and critic modules in Figure 5) are created to provide target signals for critic updating. The representative algorithm with this actor-critic framework is deep deterministic policy gradient (DDPG) [29]. With this framework, the agent can learn by interacting with the environment repeatedly until a robust policy network µ(s θ µ ) is obtained. At each time step t, the agent receives observation s t and selects an action by a t = µ(s t |θ µ ) + N t , where N t is the exploration noise. Then, the action a t is applied to control the UAV to fly and a reward r t returns and a new state s t+1 is observed.
The transition (s t , a t , r t , s t+1 ) is then stored in experience pool. Afterwards, a mini-batch of N transitions (s i , a i , r i , s i+1 ) N is sampled from the pool to calculate two gradients: where y i = r i + γ Q s i+1 , µ s i+1 |θ µ |θ Q is the target signal of the critic. The two gradients are then used to update Q network parameter θ Q and policy network parameter θ µ , respectively. At last, the two target network parameters θ Q and θ µ are updated by soft or hard update strategies.

Robust Learning Techniques
Within the actor-critic framework, DDPG could handle problems with both high-dimensional continuous state space and high-dimensional continuous action space. However, DDPG often acts precariously since it is brittle with respect to hyper-parameters and environments, and any small perturbation may make it breaking. It is risky to directly apply such an unstable model as the motion controller of the UAV. In this work, we propose an improved approach, Robust-DDPG, to address issues of DDPG by introducing a delayed-learning trick, an adversarial-attack trick, and a combined-exploration trick.

Delayed Learning Trick
As we are aware, the original DDPG adopts a direct updating scheme in the learning process, in which both the critic and actor parameters are updated at each time step. Theoretically, direct update produces more training steps and could accelerate the convergence, but it often leads to unstable agents and policy jittering occurs occasionally, while exploiting it in practical applications. Scott et al. [41] have blamed the failures on accumulating errors of the estimated Q-function and proposed a Twin Delayed DDPG (TD3) approach to address it. To the best of our knowledge, the underlying reason is that the traditional training method of DDPG changes the strategic direction of the policy too frequently, which in return confuses the agent in policy learning. Here, let's take the UAV motion control as an example to illustrate this intuition. For a given state, there lie two policies for UAV to choose: A sound policy with low speed and small turn that maintain the UAV get as few collision penalties as possible and a radical policy with high speed and big turn that let the UAV finish the task as soon as possible. Both policies may benefit the UAV with high rewards. In other words, with a changed strategic direction, the learned critic may provide similar evaluations for two completely different policies, and if the changing happens frequently, the actor will get lost in the learning.
To address this drawback, a delayed learning trick is designed here. Differ from the policy delay in TD3, our trick delays the actor and critic network learning operations to the end of each episode. This way ensures the actor and the critic to obey the same principle in an ongoing episode and after each episode finishes, intensive learning begins. Since it avoids the repeatedly changing of strategic direction, this trick could stabilize the training to a certain extent. In addition, a fixed interval is set for the soft updates of the target networks. A quick procedure of delayed leaning trick for DDPG is depicted in Algorithm 1. select action a t according to policy and exploration 5: execute action a t and observe new state s t+1 and reward r t 6: store transition (s t , a t , r t , s t+1 ) in experience pool 7: end while 8: for l = 1, t do //Delayed Learning 9: sample a random mini-batch of N transitions (s i , a i , r i , s i+1 ) i=1,N 10: update critic network weights using gradient in Equation (12)  11: update actor network weights using gradient in Equation (13)

Adversarial Attack Trick
Most DRL algorithms are trained with ideal virtual environments, but there are ubiquitous noises in realistic applications. It has been shown that the ideally trained agent can be easily fooled into wrong policies by perturbing the input with adversarial noises [42]. For some safety critical domains, i.e., UAVs and robotics, robustness assumes much greater importance that a tiny adversarial noise may lead to undesirable and hazardous results.
To train a robust agent that enables successful adaptions of the real-world variations, an adversarial attack trick is introduced into the learning process of the DDPG. An adversarial attack is defined as any possible perturbation that could cause a trained agent to fail [43]. More specifically, an attack mechanism is designed to generate random noise and add it to current observations of the state with the hope that these noise samples will fool the agent to take bad actions. Algorithm 2 outlines the adversarial attack trick for DDPG, where the current state s, the trained critic target Q s, a θ Q , the trained actor µ(s|θ µ ) constitute the inputs and the corrupted state s noise forms the output. The main idea behind the attack is to search repeatedly nearby the current state s until the state that makes the agent select the worst action s noise is finally found. An adversarial deep N a can be assigned. All the noises are sampled from a Gaussian distribution with a standard deviation σ s in the attack. Algorithm 2 outlines the specific procedure of adversarial attack trick.

Mixed Exploration Trick
A major challenge of learning in continuous action spaces is exploration. DDPG constructs an exploration action a t by adding noise N t sampled from an Ornstein-Uhlenbeck (OU) process to online policy µ(s t |θ µ ). That is: where clip(·) is a clipping function that limits a t in a low , a high . This scheme can produce temporally correlated exploration and is efficient for continuous physical control problems theoretically [29]. However, the actual effect varies depending on the selected parameters (u, ϑ, σ) in practical applications. For example, at the start of training, we usually need differentiated action samples to ensure efficient learning, which means a bigger σ is required so that noises with enough deviations can be yielded. But the reality is, a bigger σ produces a large proportion of actions stay at the margin of a low or a high by clipping. It is these bad samples that have drag the DDPG down in the learning efficiency and exploiting stability. To cope with this drawback, we adopt a mixed exploration strategy in this paper. Specifically, a dynamic -greedy is used to provide a rough exploration at the start of training and an OU is adopted to conduct a fine exploration in the following training stages. -greedy's direct sampling from a Uniform distribution maintains sufficient sample diversity and facilitates a fast convergence. The main idea of the mixed exploration trick is described in Algorithm 3.

Overall Robust-DDPG Algorithm
Ultimately, we present the robust deep deterministic policy gradient algorithm (Robust-DDPG) by introducing the delayed learning trick, the adversarial attack trick and the mixed exploration trick described in Sections 3.2.1-3.2.3 into DDPG. Robust-DDPG is summarized in Algorithm 4.

Algorithm 4: Robust-DDPG
1: randomly initialize critic network Q s, a θ Q and actor µ(s|θ µ ) with weights θ Q and θ µ 2: initialize target network Q s, a θ Q and µ s θ µ with weights θ Q ← θ Q , θ µ ← θ µ 3: initialize hyper-parameters: attack point A, experience pool D, batch size N, target update interval K 4: for e = 1, E do 5: reset environment and receive initial observation state s 1 6: while not collide and not target and t < T do 7: if e ≥ A do 8: start an adversarial attack by s t ← StateAttack Q s, a θ Q , µ(s|θ µ ), s t 9: select action through mixed exploration a t ← ActionExplore µ(s|θ µ ), s t , t 10: else do 11: select action through mixed exploration a t ← ActionExplore(µ(s|θ µ ), s t , t ) 12: end if 13: execute action a t and observe new state s t+1 and reward r t 14: store transition (s t , a t , r t , s t+1 ) in D 15: update critic and actor weights by end if 25: end for 26: end for After thousands of training steps with Algorithm 4, the final policy network µ(s θ µ ) will be obtained and then it can continuously be utilized for autonomous motion control of UAV by a t = µ(s t θ µ ) . Due to the three newly introduced tricks, the Robust-DDPG-based controller will theoretically provide better adaptabilities to complicated, dynamic, and uncertain environments, compared with the original DDPG algorithm.

Results
This section presents experiments for evaluating the performance, the effectiveness and the adaptability of the proposed robust motion controller of the UAV through training experiments, exploiting experiments, and generalization experiments.

Experimental Platform and Settings
For the training and testing of the DRL-based motion controller, we construct a general simulation platform, depicted in Figure 6. The platform simulates a world with a total size of 400 × 300 m 2 (the rectangular area) and a series of threats (the 24 white cylinders with different heights and sizes) are randomly scattered in the world. A certain proportion of the threats are set to move with stochastic velocities obey a uniform distribution U(1,6). A fixed-wing UAV (the blue entity) is required to fly across the unknown world until a specified target (the green circle) is finally reached. The UAV is supposed to fly at an altitude of 100 m and be equipped with a sensor that is capable of detecting an area of 40 m ahead (D s = 40) and ±45 degrees from left to right (the blue sector in front of the UAV). Whenever an object is detected, the corresponding blue beams will be set to red so that the user can intuitively see the interaction between the UAV and the environment. The mobility of the UAV is limited by a maximum velocity v u,max = 45 m/s and a maximum roll φ u,max = π/2. The motion uncertainty of the UAV is considered by defining disturbances σ .
As described in Section 3.1, two neural networks constitute the core of the controller, in which the critic is constructed by a 40 × 100 × 100 × 1 fully connected neural network and the actor owns a structure of 38 × 100 × 100 × 2. The observed states are normalized as a 38-dimensional input to the actor and the two-dimensional output actions are used to control the UAV's motion. Adam optimizer is employed to learn network parameters with the same learning rate of α Q = α µ = 10 −4 for the actor and critic. Other hyper-parameters are set with discount factor γ = 0.9, batch size N = 256, experience pool D = 10, 000, target update interval K = 200, and attack point A = 4000. In addition, the soft update tuning factor is τ = 0.01, the adversarial deep is N a = 5 and the standard deviation is σ s = 0.5. A descending -greedy of t = max( t − α ε ∇ε, min ) mixed with an OU(u, ϑ, σ) distribution is used to explore the action spaces, where 0 = 0.5, α ε = 0.0001 and ∇ε = 0.4 are set to provide a proper descending of and min = 0.0001 sets the lower bound of . u = 0, ϑ = 0.15, and σ = 0.4 are selected to generate temporally correlated explorations. Besides, the maximum episode length T is set to 1000.

Experiment I: Performance of Robust-DDPG
Before its exploitation, the DRL-based controller has to be trained first. To demonstrate the performance of the proposed Robust-DDPG, reasonable comparative experiments are necessary. To be specific, we resort to another two state-of-the-art DRLs, DQN, and DDPG, as baselines and re-implement them with almost the same hyper-parameter settings to Robust-DDPG. In DQN, we assume the UAV flies at a constant velocity of 20 m/s and simplify the control commands to a φ,t ∈ {−1, 0, 1} only. All three agents are trained with the same dynamic uncertain environment, depicted in Figure 5. The experiments repeat 7000 episodes with 5000 episodes for learning and 2000 episodes for exploiting. In each episode, the UAV, target, and threats are randomly re-deployed throughout the world.
To measure the performance, we define some quantitative evaluation indicators: Hit rate, crash rate, lost rate, and average reward. For each agent, the hit rate, crash rate, and lost rate can be obtained by counting the percentage of successfully hitting the target over the latest 500 episodes, the percentage of crashing with threats over the last 500 episodes, and the percentage of trapped until that episode ends over the last 500 episodes, respectively. Average reward is defined as the mean value of the rewards per episode. Due to the severe fluctuation of average reward, we further average it every 500 episodes. The final learning results are illustrated in Figure 7, including convergence curves of the hit rate ( Figure 7a) and the convergence curves of the average reward (Figure 7b). In other words, Robust-DDPG achieves a faster convergence, a higher hit rate and a larger average reward comparing with the other two algorithms. This is because the delayed learning and mixed exploration techniques provide Robust-DDPG more efficient data utilization and more stable policy update. (2) In the exploiting stage (right part of the dotted line in Figure 7a), the added motion disturbances cause significant decreases of the hit rates of DQN and DDPG, but not for Robust-DDPG. This good adaptability to uncertain environments comes from the adversarial attack technique used in Robust-DDPG. To make a more specific verification, we further count all the three indicators of hit rates, crash rates and lost rates of the different agents in different stages. All the results are arranged in Table 1. Obviously, Robust-DDPG achieves the best performance in all three algorithms. From the longitudinal point of view, Robust-DDPG brings (10.2%, 3.9%), (20.8%, 12.1%) increases of hit rates, brings (8.3%, 3.0%), (14.7%, 10.5%) decreases of crash rates, brings (1.9%, 0.9%), (6.1%, 1.6%) decreases of lost rates in learning and exploiting stages comparing to (DQN, DDPG), which means Robust-DDPG is more efficient than DQN and DDPG. From lateral direction, Robust-DDPG only brings 1.6% decrease of hit rate, while DQN and DDPG bring 12.2%, 9.8% decreases of hit rates, only brings 0.6% increase of crash rate, while DQN and DDPG bring 7.0%, 8.1% increases of crash rates, only brings 1% increase of lost rate, while DQN and DDPG bring 5.2%, 1.7% increases of lost rates in exploiting stage comparing to learning stage, which means Robust-DDPG is more robust than DQN and DDPG.

Experiment II: Effectiveness of Robust-DDPG
While learning is done, the controller is constructed. In this section, we conduct some UAV exploiting experiments and evaluate the effectiveness of the Robust-DDPG-based controller. Specifically, we let the three controllers constructed by DQN, DDPG, and Robust-DDPG drive a UAV to start from the same initial location (−10, 180), cross the same dynamic environment, and reach the target point (−120, −120). The exploiting environment here is more complicated than the one for learning as Figure 6. We take three screenshots for each controller at T = 5 s, T = 10 s and the terminal time. All the screenshots can be seen in Figure 8. As we can see in Figure 8, the three controllers fly out completely different trajectories for the same task. DQN flies at a constant speed (20 m/s) and selects a relatively safe path to bypassing the moving threats. At time 22.2 s, it successfully reaches the target and the UAV flies 463.2 m in total. For DDPG, it chooses to avoid intensive threat areas and flies around to maintain safety, which leads to a longer path of 651.1 m. However, by dynamically adjusting the flying speed, it only takes 15.1 s to complete the journey. To clearly show the changing in speed, we use arrows of different sizes to indicate different speeds. Zoom in the pictures in Figure 8 and you will find the difference of the trajectories. Among the three controllers, Robust-DDPG flies the most efficient path. Through fine adjustments in the speed and roll, Robust-DDPG can safely avoid threats and fly quickly to the target. In fact, it only takes 11.4 s and flies 443.5 m to finally finish the task. Obviously, we can conclude that Robust-DDPG provides the most efficient controller, since it enables the UAV to complete the mission with minimum time and path costs. The exploiting effectiveness is intuitively shown in Table 2.

Experiment III: Adaptability of Robust-DDPG
To further validate the Robust-DDPG can be generalized into more complex environments, we conduct a series of other exploiting experiments in this section. The experiments try to perform a comprehensive evaluation of Robust-DDPG's adaptability to complicated, dynamic, and uncertain environments.
(a) Adaptability to complicated environments: we characterize environmental complexity by the density of threats (Dens). We build a series of environments by increasing the density of threats (Figure 9 illustrates three examples with threat density of 0.1, 0.3 and 0.5) and exploit the three learned agents to drive the UAV to fly in these environments. Each experiment is set to repeat 1000 episodes and each episode randomly re-deploys the UAV and target. The hit rate of the 1000 episodes will be counted after the end of each experiment. Figure 10 depicts the trends of hit rates under different threat densities of the three agents. Obviously, the increasing threat density has caused declines of the hit rates of all the three agents, but Robust-DDPG has presented a slower and smaller decline comparing to DQN and DDPG. In fact, Robust-DDPG has stayed about a hit rate of 61.0% even in a highly complex environment (dens = 0.5), while DQN and DDPG have dropped down to 22.9%, and 30.8%, respectively. In other words, Robust-DDPG shows greatest adaptability to complicated environments.  (b) Adaptability to dynamic environments: We use the proportion of moving threats (Pro) among all threats to characterize the dynamics of the environment. Pro = 0 means all the threats are stationary, while Pro = 1 means all of them are movable. The speeds of the moving threats are randomly sampled from a uniform distribution U (5,10). In the experiments, we gradually increase Pro from 0 to 1 and examine the hit rate trend of the three agents. As illustrated in Figure 11, Robust-DDPG provides better adaptability to dynamic environments. As the proportion of moving threats increases from 0 to 1, Robust-DDPG still reaches a hit rate of nearly 66.0% after a slow decline, while DQN and DDPG dropped rapidly from 88.7% and 90.1% to 28.4% and 41.7%. Robust-DDPG can provide a more stable policy for the UAV control, due to proposed three tricks in this paper. (c) Adaptability to uncertain environments: We increase the uncertainty of the environments by adding noises to the motion of UAV and further explore the Robust-DDPG's adaptability to these environments. Noise intensity can be represented by disturbances σ .
x and σ . y in Equation (1). So we conduct a series of comparative experiments by gradually increasing the values of σ .
x and σ . y , and then evaluate the impact of noise with different intensity on the hit rate. The trends of hit rates under different noise intensities of the three agents are illustrated in Figure 12. As we can see, Robust-DDPG performs great adaptability to uncertain environments, because as σ .
x and σ . y increase from 0 to 30, the hit rate only decreases from 93.7% to 74.1%. This is mainly due to the adversarial attacks during the training process. In contrast, DQN and DDPG present worse robustness to uncertain environments that when the noise intensities increase, larger decreases of the hit rates occur.

Discussion
For a comprehensive evaluation of the proposed algorithm, Robust-DDPG, experiments are conducted to verify its training performance, exploiting effectiveness, and environmental adaptability, by comparing it to DQN and DDPG algorithms. Most of the hyper-parameters are tuned by extensive repeated trials and some of them are selected based on domain experience. With respect to the same parameters, three different models are trained and tested in a series of repetitive experiments.
For the training performance evaluation, Robust-DDPG converges to a hit rate of approximate 92.6% in about 2500 episodes, while DQN uses about 5000 episodes to converge to a hit rate of 82.4% and DDPG uses about 4000 episodes to converge to a hit rate of 88.7%. In other words, Robust-DDPG converges to a higher final hit rate with less training time. The great improvements in convergence speed and convergence performance come from the joint work of the mixed exploration trick and the delayed learning trick described in Sections 3.2.1 and 3.2.3. The former ensures sample diversity in the early learning stage and provide Robust-DDPG more efficient data utilization. The latter keeps the learning in a stable strategic direction and improves the opportunity of learning a better policy.
For the exploiting effectiveness evaluation, the UAV is driven by three controllers to fly from the same departure, across the same dynamic environment, and to the same destination, respectively. The result is Robust-DDPG-based controller flies the most efficient path, that is, the shortest flight time of 11.4 s and the shortest path length of 443.5 m. Comparing the tracks of the three UAVs (in Figure 8), we can see that the control commands provided by Robust-DDPG-based controller perform a better fit between the speed and roll channel. Neither is it as radical as DDPG that flies with high speed and large roll, nor is it as cautious as DQN, which flies with constant low speed and small roll, it just fine-tunes both the speed and roll dynamically according to the sensed environment. This outstanding effectiveness of Robust-DDPG derives from the delayed learning trick. As we described in Section 3.2.1, delayed learning trick ensures the actor and the critic to obey the same principle in an ongoing learning episode and avoids frequent swinging of strategic direction between radical policy and cautious policy. It is a significant trick for learning a reliable policy.
For the environmental adaptability evaluation, there are three sets of experiments. Firstly, we studied the trends of hit rates under different threat densities, and the results clearly show that Robust-DDPG owns better adaptability to complicated environments than DQN and DDPG. As Figure 10 illustrates the smallest decent slop of hit rate for Robust-DDPG. Similarly, the trends of hit rates under different moving threat proportions and the trends of hit rates under different noise intensities are explored, and the results in Figures 11 and 12 leads to the same conclusion, and Robust-DDPG owns better adaptability to dynamic and uncertain environments than DQN and DDPG. Better adaptability means Robust-DDPG can be generalized into more complex environments. This feature is contributed by all the three tricks proposed in this paper, especially the adversarial attack trick. Introducing some adversarial attacks into the training process, the agent learns some additional skills that could be used to handle newly emerged circumstances.
In order to eliminate the impacts of accidental factors, we further carry out some statistical significance tests based on the original data used in Figures 10-12. As the data hardly satisfies normality and variance homogeneity, Friedman test is finally used. For simplification, let A, B and C denotes Robust-DDPG, DDPG, and DQN, respectively. To highlight the advantages of Robust-DDPG, we conduct Friedman(A,B), Friedman(A,C), and Friedman(A,B,C) to assess whether the differences between A and B; A and C; A,B and C are really significant. The basic Hypothesis is the difference is not significant. From the test results shown in Table 3, we can see that the condition p < 0.05 occurs in all test, which means the Hypothesis has to be rejected. In other words, the differences between Robust-DDPG and DDPG; Robust-DDPG and DQN; Robust-DDPG, DDPG and DQN are all significant. Robust-DDPG does have better adaptability than DDPG and DQN.  From all the experimental results, we conclude that Robust-DDPG performs great advantages in t learning performance, exploiting effectiveness, and environmental adaptability. It is a powerful weapon to provide the UAV great capabilities of autonomous flying in complex environments. However, it is worth mentioning that despite of our efforts in the robustness and adaptation, there are still challenges, while trying to transfer this technique to a real UAV application. In the context of real control, the uncertainty is everywhere and exists all the time, the positioning error, the sensing error, the actuator error or even the crosswind, etc. No matter how well the controller is trained in virtual environment, the reality gap does exist. It is still difficult to figure out in what range of uncertainty the controller can operate safely before adaptation is necessary. It needs some trial and errors with a real UAV. We can characterize as much uncertainty as possible by maintaining insight into flight control system, understand the techniques, and model them in the virtual environment. By constantly reducing the reality gap, we will finally apply this technique in a real UAV platform.

Conclusions
This paper presents a learning-based controller to provide the UAV robust motion control in dynamic uncertain environments. The controller is constructed by an actor-critic DRL framework and can be used to perform dual-channel continuous controls of roll and speed. To train a stable controller, an efficient DRL algorithm, Robust-DDPG, is proposed by introducing three critical learning tricks into the original DDPG. A delayed-learning trick is designed to learn a stable policy to adapt to dynamic environments, an adversarial attack trick is adopted to provide enough adaptability to uncertain environments, and a mixed exploration trick is introduced to maintain a faster convergence of the learning. A series of experiments are conducted to evaluate the performance, the effectiveness and the adaptability of the Robust-DDPG from different perspectives. The results show that Robust-DDPG outperforms the state-of-the-art algorithms of DQN and DDPG both in the convergence speed and convergence value. In the exploiting experiments, we conclude that Robust-DDPG can provide a better flying path for the UAV, and provide better adaptability to complicated, dynamic and uncertain environments.
For further research, we plan to extend the UAV motion control problem to a 3D space, which will add a pitch control channel for the UAV. In addition, we intend to construct an intelligent controller for the UAV in an adversarial environment, where the passive threat will be replaced by some aggressive enemies and the UAV will be attacked by these enemies during the whole mission.
Author Contributions: Conceptualization, investigation, methodology, software, visualization and writingoriginal draft preparation, K.W.; data curation, validation and formal analysis, Z.H. and G.W.; writing-review and editing, supervision, project administration and funding acquisition, X.G. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.