Obstacle Avoidance Drone by Deep Reinforcement Learning and Its Racing with Human Pilot

: Drones with obstacle avoidance capabilities have attracted much attention from researchers recently. They typically adopt either supervised learning or reinforcement learning (RL) for training their networks. The drawback of supervised learning is that labeling of the massive dataset is laborious and time-consuming, whereas RL aims to overcome such a problem by letting an agent learn with the data from its environment. The present study aims to utilize diverse RL within two categories: (1) discrete action space and (2) continuous action space. The former has the advantage in optimization for vision datasets, but such actions can lead to unnatural behavior. For the latter, we propose a U-net based segmentation model with an actor-critic network. Performance is compared between these RL algorithms with three different environments such as the woodland, block world, and the arena world, as well as racing with human pilots. Results suggest that our best continuous algorithm easily outperformed the discrete ones and yet was similar to an expert pilot.


Introduction
The drone is an unmanned aerial vehicle that has been around for a long time, and yet it has become a major research field recently. It seems that the recent success of the drone may come from the stable control of rotors and a bird's eye view provided by a camera installed in front of it. With the growing number of possible applications for the drone, i.e., disaster management [1] and agriculture [2,3], the demand for expert drone pilots is growing up nowadays. To become an expert level drone pilot, one has to spend extended time training. Dexterity in maneuvering a fast-flying drone has become an essential asset for professional drone racers, as a survey indicates that drone racing is one of the fastest growing e-sports, among many others [4]. Indeed, one of the major research fields for the drone has been autonomous drone navigation. One of the critical requirements for the successful application of drones is the ability to navigate through buildings and trees. To achieve this, obstacle avoidance capability is essential in particular. Controlling a drone with such dexterity involves solving many challenges in perception as well as in action using lightweight and yet high performing sensors. Among many, simultaneous localization and mapping (SLAM) has been a major representative approach to solve such challenges by utilizing the stereo camera, Lidar, and other sensors. Although SLAM can cover a range of challenges with the localization problem [5], it also shows some limitations, for example, when the target area has visually changed. Recently, an alternative by utilizing deep learning has arisen, as its ability to deal with vision data is promising.
These approaches are mainly divided into two machine learning paradigms: (1) supervised/ unsupervised learning and (2) reinforcement learning (RL). In supervised learning, one has to gather a large set of data for a specific environment where the drone will operate before the training. This scheme Figure 1. Screen shots of the woodland (a), the block world (b), and the arena world (c), respectively. Two insets in each shot indicate the depth map (left) and RGB input (right), respectively. As the space between blocks in the block world is narrower than that between trees in the woodland, the former is more challenging than the latter. In the arena, since the obstacles are arranged along a circular track, the drone requires a delicate roll, pitch, and yaw control.
Given that restriction of the action space into the limited number of discrete actions can lead to unnatural control of the drone, our second approach is to adopt an actor-critic network for controlling the drone in continuous space. To deal with RGB (Red Green Blue) input, U-net based segmentation model is used. This process can be understood as finding a way to go within the scene because this segmentation model is combined with the actor-critic network working in the continuous action space. The promising candidates are policy gradient algorithms such as Trust-Region Policy Optimization (TRPO) [10], Proximal Policy Optimization (PPO) [11], and Actor-Critic using the Kronecker-Factored Trust Region (ACKTR) [11] that represents the actor-critic networks. Our proposed system is to combine a U-net based segmentation model with a policy gradient algorithm with a reward-driven manner. Our contributions are:

1.
It is shown that the drone agents can be trained using several deep RL algorithms in discrete and continuous spaces using three different environments such as woodland, block world, an arena world.

2.
Human pilots can make a drone race against an algorithm utilizing the hardware-in-the-loop (HITL) feature, and performances between humans and algorithms are evaluated. Result suggests that our best algorithm shows similar performance to that of an expert pilot in three environments. 3.
It is found that DD-DQN outperformed other deep RL algorithms for discrete action space, and the performance of the algorithm was the best when both RGB and depth maps are given to the agent, rather than when only one of the sensor signals is provided. 4.
So far, U-net based segmentation model has been trained by the supervised learning paradigm where labels are provided by laborious manual labeling. In the proposed system, a label map is generated via the critic network using input provided by a simulated environment.

Related Work
Recently, deep neural network-based methods have been proposed to enhance the ability to control the drone using computer vision, such as collision avoidance, navigation, etc. [12]. Among many, supervised learning and reinforcement learning have been successfully adopted for the control of the drone, whereas unsupervised learning has been studied for the action recognition and assisting the labeling of the dataset.

Drone Navigation with Supervised Learning
Supervised learning can be promising with a large amount of labeled data. Almost all of the studies on applying supervised learning are based on convolutional neural network (CNN), so that their model can effectively extract features within vision inputs from various environments. It has been shown that a drone can be trained to navigate indoor, where the surrounding area is comparably small [13]. By focusing more on negative data, such as crashing and collision of a drone, it has been shown that a network can effectively avoid collision [14]. A recent study shows that a drone navigates through the corridors within a building using two well-trained CNNs: one is for depth map and the other for RGB input [15]. It is also shown that a well-trained CNN has such a small footprint that it is possible to embed on a nano quadrotor for controlling its flying trajectory [16]. The other possible way for the drone navigation is to combine a CNN with a long short-term memory(LSTM) for training within a Gazebo [17] environment, consisting of a block, a wall, and overhang in a room [18]. More recently, by utilizing a simulation environment during training, a CNN-based model for real-time navigation in a drone-racing track is proposed using micro UAV [19]. This has shown that a deep neural network not only controls a drone in real-time with vision input but also a network trained in a virtual environment can effectively maneuver a drone in the real-world. Instead of training a network in simulation first, many studies made use of data from the real-world directly to train networks. By using vision data collected from forests, a network drives a drone for forestry purposes [2]. Similarly, by using a large amount of data collected in an urban area, it has been shown that a drone can navigate robustly by avoiding obstacles appearing in a city [20]. These improvements in controlling the drone using supervised learning lead to broad applications for the society along with technologies such as the Internet of Things (IoT) [21,22]. Even though the studies mentioned above have successfully applied supervised learning to meet the needs of diverse applications, manual labeling remains a burden for researchers. Moreover, as the data is highly oriented towards a specific environment, it is necessary to go through the entire process again whenever there is a need to apply a model to a different environment. On the other hand, research based on RL try to overcome such an issue.

Unsupervised Learning-Based Methods for Drone Applications
Unsupervised learning-based methods have been utilized for the drone applications, mainly for the labeling. By using algorithms such as simple linear iterative clustering (SLIC), it has been shown that unsupervised-based methods can help label the data for the training of supervised-based model in an agricultural application, leading to the reduction of human effort [23]. Also, by interpreting a depth estimation problem as a reconstruction problem, a study has shown that unsupervised learning models can successfully estimate a depth map from a monocular RGB image taken by a drone [24]. Based on a reconstruction error of an unsupervised generative model, it has been shown that a robot can decide whether to go or not [25]. Furthermore, it has been shown that a combination of unsupervised-based methods could learn features for anomaly detection in pictures taken by a drone [26].

Drone Navigation with Reinforcement Learning
In RL, an agent is to be trained on how to navigate through the obstacles by making trials and errors. Also, yet, this could be advantageous because once the training environment is ready, then the agent learns by itself. For example, it is shown that a drone and Radio Control(RC) car can be trained to predict uncertainty or collisions by utilizing model-based RL [27]. Here, a drone and RC car had learned by making trials with a low speed so that the hardware was not damaged. This indicates that a model can be trained directly in the real environment with a specific restriction. However, as RL usually requires a lot of trial and error for the learning, having some restrictions on the environment inevitably slows down the training speed. Such characteristics of RL approaches lead to making use of a virtual environment, where the agent can safely make trials and errors quickly without concerning the actual hardware [28,29]. By using sensory data from the accelerometer, it is shown that the RL model can plan swing-free trajectories while carrying suspended load [30]. Similarly, by applying RL, a substantial amount of practical applications have been proposed in controlling the attitude [31], navigating from an arbitrary departure place to destinations [32], enhancing the efficiency of a network of cellular-connected drones in the 5G era [33]. However, it has not been used widely with the visual input, mainly because RL usually suffers from high dimensional data, especially for the similar succeeding frames for continuous actions. One alternative solution is to restrict the continuous action space into a discrete one so that a model can be trained with RL, such as the Deep Q Network [6], known to learn well based on the image. For example, it was shown that a drone can fly to reach the goal in environments by making discrete actions [34]. However, restricting the action spaces could lead to unnatural behaviors.

Robot Navigation using Segmentation Map
Recently, there are several studies where the segmentation model is adopted to extract simple and handy features for robot navigation. For example, a wheeled robot navigating the outdoor street has been trained using the CNN-based segmentation model [35]. Here, the navigation direction is controlled by a fuzzy logic that receives a segmented image from the RGB input. Another work has shown that a semantic segmentation model trained in a virtual environment can minimize the gap between the real and virtual environment [36]. In this study, the segmentation model plays an essential role in visually guiding a robot where to go, and an RL agent trained in the simulation can be transferred to the real environment to control a car. It has also been shown that the segmentation model can learn from the simulated image jointly with the real image by combining a recurrent neural network (RNN) so that the robot in the real environment can use the model directly without having suffered from the gap between real and simulation [37]. Besides, a study shows that a depth map generated from an RGB image using a CNN-based model can be used for training a network controlling a drone [38] by utilizing supervised learning. With this model, it was shown that a network trained in a simulation environment could navigate in the real world during testing.
However, none of the studies have applied the model on the drone using reinforcement learning where it does not require manual labeling of the data. In this study, for the continuous control, we have adopted segmentation model to simplify the information exhibited in the raw image from the first person view (FPV) of the drone, so that an RL model can learn to navigate in the continuous action space, which has been considered very challenging due to the curse of dimension. Moreover, designing the reward function is replaced by taking advantage of mutual interaction between the segmentation network and actor-critic networks.

Deep Reinforcement Learning and U-Net Segmentation Model
In a problem of a sequential decision making, an agent interacts with an environment over discrete time steps [39]. For example, in the Atari domain, which is widely used to measure the performances of various reinforcement learning algorithms, an agent observes frames from a video at a time step t. s t = (x t−p+1,...,x t ∈ S t , a t ∈ A = 1, ..., |A|) Equation (1) shows the ingredients of a state consisting of several frames from a video at time step t: here x is the frame, t is a time step, and A is the number of actions, respectively. p stands for a number of previous frames to observe for making action a t . Each s and a are the states that the agent observes an action that the agent takes. The agent then chooses which action to take from a set and receives a reward signal r. The goal of the agent is, of course, to maximize the cumulative reward R for an episode, Equation (2) shows the reward along with the time steps, where r τ is a reward at time step t and γ[0, 1] is the discount factor that makes the agent deal with a trade-off between immediate and future rewards [40]. In terms of action space, learning to maximize the reward in reinforcement learning mainly diverges into two schemes: (1) RL in discrete action space; (2) RL in continuous action space.

Learning in Discrete Action Space
In discrete action space, an agent chooses to act according to a policy π, which usually has a form of greedy learning. Here, π is equal to a state-value function V, since π can be seen to learn by greedily choosing the best action given a state. One of the popular forms for estimating the value is Q function. The connection between Q and V can be expressed as where a is a chosen action given a state s. Then, by defining the optimal Q function as and Optimal Q function satisfies the Bellman equation: For high dimensional data, i.e., image, the Deep Q Network (DQN) can effectively learn to solve a given task [6]. Loss function for optimizing the DQN: where θ is the parameters of a network for a current optimization step. However, it had been shown that using the on-line parameter θ for learning could cause unstable learning in vanilla DQN. To solve this issue, Double DQN [6] has been proposed. In Double DQN, a target network is copied for every pre-defined time step and is used to estimate the target value in the optimization step. Therefore, the target value Y in time step t in Double DQN is as follows: where θ − stands for the fixed parameters of the target network, To delete the correlation of every experience accumulated in the buffer, which causes significant deterioration of the stability of the network, DQN utilizes a technique called experience replay.
Equations (11) and (12) show how experiences are accumulated in the buffer, where s t is a state, a t is an action, and r t is a reward. With experience replay, the network uses randomly sampled mini-batches of experiences from D instead of using the data that is accumulated sequentially. An improvement of DQN, called Double DQN [8] is proposed. To avoid the problem of over-optimistic estimation of Q value, Double DQN uses different values to select and evaluate an action while DQN uses the same values for selecting and evaluating values: Instead of using different values to select and evaluate the action, Dueling Deep Q-network (Dueling DQN) [41] separates networks into a value network and an advantage network. The value network is used to estimate the quality of the state while the advantage network is used to estimate the qualities of each action, Among several equations to estimate Q(s t , a t ; θ, α, β) in Dueling DQN, we adopt the Equation (14) as it is known to stabilize the optimization process.
Given that there are two ways to improve the performance of DQN in previous sections, we suggest in this paper that there is a better way for an agent to learn by combining Double DQN with Dueling DQN, called it Double Dueling DQN (DD-DQN). In our experiment, we use Equation (14) to calculate Q value as it is known to be robust for removing correlation. By combining Equation (14) with Double DQN, we define the target value of DD-DQN as: where θ stands for the parameter of convolutional layers, while α and β are parameters of the advantage network and the value network, respectively. By following Equation (15), a loss function for optimization is defined as The flow diagram is shown in Figure 2, where the input consists of RGB and depth map. . Both RGB and depth maps are fed into two CNNs as input, and each convolutional neural network (CNN) has three convolutional layers. Two outputs from each CNN are concatenated and fed into fully connected layers for a dueling operation, which produces state value and action advantages. Here, state value concerns the given state itself for 2 input images, whereas action advantages consider advantages for each action. Final values for each action is then produced by merging state value and action advantages. The network is updated every predefined step, following Q doubling optimization policy.

Learning in Continuous Action Space
For the learning in continuous action space, there are several model-free policy gradient algorithms available such as Trust Region Policy Optimization (TRPO) [10], Deep Deterministic Policy Gradients (DDPG) [42], Proximal Policy Optimization (PPO) [11], and Actor-Critic using Kronecker-Factored Trust Region (ACKTR) [43]. It has been shown that they are promising candidates in the continuous control MuJoCo [44] domain from the OpenAI Gym [45]. They work to obtain the maximum expected rewards by estimating the gradient: In general, estimation of policy gradient using action a and state s becomes where π is a policy(actor) network, and Ψ is a critic network. Ψ can be state-value, Q value, or advantage, depending on the algorithm. TRPO and PPO use constraints and advantage estimation to perform this update by reformulating the optimization problem as Here, A t is the generalized advantage function [46]. TRPO uses conjugate gradient descent as the optimization method with a KL constraint: PPO reformulates the constraint as a penalty, and clip the objective to make sure that the optimization is carried out within a predefined range. DDPG and ACKTR adopt the actor-critic method, which estimates Q(s, a) and optimizes a policy that maximizes the Q-function based on Monte-Carlo rollouts. DDPG does this using the deterministic policies, while ACKTR utilizes the Kronecker-factored trust regions to ensure stability with stochastic policies. As there is an infinite number of actions and states to estimate in the continuous action space, off-policy based approaches, such as DQN, are computationally expensive. However, the on-policy based approach can move to the direction suggested by g using the gradient ascent optimization for the best returning policy π.

U-Net Segmentation Model
U-Net architecture [47] stems from the so-called fully convolutional network. The main idea is to supplement a usual contracting network by successive layers, where upsampling operators replace pooling operations. Hence these layers increase the resolution of the output. A successive convolutional layer can then learn to assemble a precise output based on this information.
A critical modification in U-Net is that there are a large number of feature channels in the upsampling part, which allows the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting party and yields an u-shaped architecture. The network only uses the valid part of each convolution without any fully connected layers. To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. Such a tiling strategy is essential to apply the network to large images as otherwise the resolution would be limited by the GPU memory.

Methods for Discrete Action Space
Airsim is a new simulation platform from Microsoft for autonomous vehicles such as drones and autonomous cars. For the graphics, it uses the Unreal Engine that provides rich repertoires of shader and other drawing tools with which third party suppliers or research users can make the realistic landscapes. One of the early releases is called as "modular package", mainly consisting of the urban landscape and woodland areas. In particular, the present study utilizes the woodland area as it provides an ideal woodland landscape for testing and training an autonomous drone, as shown in the top of Figure 1. In this case, the drone supposes to avoid the trees and find a 3D path to the goal. It is found in the early stage that the small branches and leaves of the trees were not recognized as the obstacles, unlike the main branch, because the default option was set for the simple collision case presumably because the processing time for detail graphics is to save within Unreal Engine [48]. Therefore, it required to bond the small brunches and leaves to the main branch of a tree as one object and to work in the complex collision mode.
Though the woodland provides a stimulating environment for an agent, it does not require much moving up or down direction simply because most of the trees are planted on a plain ground. Therefore, we have designed more challenging landscapes, called the block world and the arena, consisting of diverse solid objects, such as a cube, sphere, cone and other two wall-style objects such as half-wall and arch bridge as shown in Figure 1. Given that most of the drones that use SLAM for their autonomous navigation typically adopt either stereovision or Lidar, we utilize RGB and depth sensors for our experiment. The sensed images are given to two CNNs as input to the deep RL network. The size of the memory buffer is set to 1000, which means 1000 pairs of depth and RGB images are stored and sampled for training. DQN, Double DQN, Dueling DQN, DD-DQN algorithms are used for training and evaluating, respectively, in the environment of two different worlds: the woodland and the block world. For both cases, an agent learns to act with 5 actions, consisting of f orward, le f t, right, up, and down. After choosing an action, we use Euler angle pitch action , roll action , and throttle action values to control the drone. Therefore, Pitch drone , Roll drone , Throttle drone commands sent to set the drone's attitude are roll drone = roll action * max angle pitch drone = pitch action * max angle throttle drone = throttle drone + throttle action .
pitch action is set to +1 for f orward and 0 for other 4 actions. roll action is set to +1 for right and −1 for le f t. throttle action is set to 0.2 for up, and −0.2 for down, respectively. The above roll drone , and pitch drone are converted to Quaternions and then sent to the drone to match the pose, whereas throttle action is sent to the drone directly. ROS was used for transmission from the RL agent to Airsim environment. Although we empirically chose max angle and throttle action for our environments, one can manually set those values depending on environments. For the woodland, reward r is set to 0.1 only when the agent chooses to go forward and −10 when the agent collides with any obstacle recognized as an object. For example, every tree, leaves, and rocks are recognized as objects. For the block world case, r is set to 0.1 when the agent chooses to go forward, −10 on a collision, and 0.08 when it makes an action such as left, right, up and down, respectively. Because the block world has a more complicated environment than the woodland case, the agent has to choose various actions to find a path and to prevent a deadlock: a reward is given to each action, i.e., promotion of exploration. An RGB (and depth map) is rescaled from 256 × 144 to 64 × 64 to prevent overfitting, and certainly, it saves the training time of the network. When using both RGB and depth images for the training, two separate convolutional networks receive each image. The output of 2 CNNs is concatenated and used as an input for fully connected layers that are responsible for the dueling architecture. Finally, Q value generated by the dueling operation is used in estimating the double Q as shown in Figure 2. When using only one type of input, which is either RGB or the depth map, a same CNN and fully connected layers with 64 neurons are used.

Method for Continuous Action Space
Given that U-Net-based segmentation model belongs to the supervised learning paradigm where labels are provided by laborious manual labeling, the proposed system is to combine a U-Net-based segmentation model with a policy gradient algorithm under the reward-driven way, wherein a label map is generated with an optical flow calculated from sequential images via the critic network in real-time while training. In other words, training in RL replaces the labeling task. By utilizing a realistic simulation environment for training Actor-Critic networks and the segmentation model, it is shown that the model successfully learns to control and navigate through obstacles. In this section, we will describe how our two learning processes mutually cooperate: (1) a model for the visual representation of the scene in the form of the segmentation map. (2) Actor-Critic RL for controlling the drone in the continuous action space. First, we will describe the representative model, followed by RL.

Critic-Dependent Segmentation Model
Our actor-critic networks aim to assist the segmentation network by generating a label map. There are two steps for generating the training data in terms of predicted reward. The first is to recognize which direction the agent chooses to go within an image. An optical flow algorithm [49] is adopted in calculating these vectors. The second is to generate a label map for the segmentation model based on the predicted reward using the optical flow vectors.
To measure the direction where the RL agent chooses to go, we calculate optical flow vectors using two sequential images from the environment, S t and S t−1 . Here, S t stands for the raw RGB FPV on the time step t. Then the equation we use to calculate x movement V x and y movement V y is given as follow: where q i denotes a pixel value inside a window, and I stands for the partial derivative of S t and S t−1 with respect to the position (x, y). The size of the window for calculating optical flow is set to 10 in this study.
To get the vectors indicating where the drone moves, instead of the movement itself, we rotate V using a rotate matrix as follow: where θ is set by 180 to obtain aforementioned drone's movement. Using V calculated above, the second process is to make the label map. Using parallel displacement of vectors, we first move V to the center of the image, so that the calculation of the direction and speed for generating the label map could be more straightforward. After this displacement, as shown in Figure 3, we separate an array into zones with directions that V can have. Then, each direction of the vector and scaled speed are superposed on the zones. We use coefficient η, which is set to 20 in this study, for the speed value to calculate how many zones are superposed under each vector's direction. For every vector, an array filled with zero is initialized, and zones under a vector are filled with value one. Each of these arrays is flattened into 1-Dimension and fed into the critic network for estimating the direction. Let W ij denotes a set of values exhibited in a zone at i th row and j th column, and V n denotes n th optical flow vector, then each sample C n used as input for a critic network to make a label map are given as whereW i,j means an average value of W i,j . h and w are the number of zones for height and width, respectively. Then, each C n is fed into the critic network Ψ to receive its predicted reward. The label map can be produced by filling a zeros array with the output values of Ψ corresponding to zones indicated by V n of every sample C n . Note that filling the windows with the values from Ψ suggests each window in a label map will have the predicted reward since the critic network learns to predict the reward (Section 3.2). Figure 3 shows the label map generated with this sequential process. Input dimension for actor and critic networks is 25 with h and w both set to 5, as shown in Table 1. Given S t−1 as an input and the corresponding label map described above as a target, our segmentation model learns simultaneously with actor-critic networks. As the RL model performs well, the segmentation model improves as well. During testing, the segmentation model receives RGB input and then produces a predicted path in terms of reward without further processing indicated as optical flow and feedback from Ψ as shown in Figure 4. Table 1. Detailed specification of our Actor and Critic networks. During training, all 3 networks such as U-Net, Actor, and Critic networks are required, whereas only the U-Net and Actor network are utilized during testing. Note that all 3 networks, including the U-Net in Figure 5, use identical optimizer, and weight initialization method. Note that * indicates types of networks.    Although using policy gradient can be optimal in a certain environment, the problem in using vision input in the continuous action space arises when all the state has high dimensions with a similar appearance. In other words, π and Ψ cannot effectively learn from succeeding similar frames that require different actions. To overcome such an issue with high dimensional data, we adopt a segmentation model that compresses the information while preserving the useful features as mentioned earlier. Let f seg and θ seg denote the segmentation model and its parameter, then s t for training our actor π and critic network Ψ are given as s t ={W i,j , ...,W h,w }, (25) where W is the set of the values in the zones separating the output of f seg , as Equation (24), andW means an averaged value of W. Our actor and critic networks accept an input that has h × w dimension for training and testing. The reward is calculated using the optical flow vector V and the segmentation model's output. Given S t and S t+1 , V can be calculated and indicates zones with the direction and speed. Then, the corresponding zones in a segmentation map f seg (S t |θ seg ) are selected and averaged to become a reward for a given action a t . Apart from U-net generated reward, a reward is given as −1 whenever the drone collides so that the critic can give both π and f seg negative signals to learn. So that, the reward at time step t using the segmented output f seg (S t |θ seg ) is computed as follow:

Layer Layer Output Dimension Activation
whereW i |V t stands for an averaged value of a zone W i indicated by t th optical flow vector V t . m is the number of zones where V t is superposed. Note that the reward value ranges from −1 to 1, because of Sigmoid function at the last layer of U-Net as shown in Figure 5. Action a t produced by the actor network π using s t in Equation (25) is as follows: where θ π stands for the parameter of π. With these s t , a t , r t , s t+1 defined above, our actor and critic networks make a typical optimization step in RL by utilizing an experience replay buffer. During the testing, only the segmentation model f seg and the actor network π are required as indicated by Equation (27) and the procedure, indicated as blue lines, in Figure 4. In the end, our actor-critic networks learn to follow the segmentation model, while the segmentation model learns to generate a segmentation map as AC networks work well.

Control Commands for Drone with the Actor Network
Our control command is conveyed as linear velocities within X, Y, and Z-axis. In other words, the actor network produces three continuous velocity values for a state, and this command is used to control the drone for both training and testing (evaluation) as described in Section 5.
Let ω 0 , ω 1 , ω 2 denote an output action a from π given a state s, our control commands sent to the drone to control linear velocities are given as where ω x_drone , ω y_drone , and ω z_drone stand for the linear velocities for x, y, z axes being sent to the drone, respectively. φ x , φ y , φ z are the pre-defined parameters that one can set to prevent drone's attitude from moving too rapidly or either slowly. In this study, φ x and φ y were set to 1, whereas φ y was set to 0.5 for the stable control of ω z_drone .

Experiment
Experiments were designed to see training processes of algorithms for discrete and continuous action spaces, followed by their evaluation in 3 virtual environments made using Airsim, as shown in Figure 1. For the training in discrete action space, we have trained DQN, Double-DQN, Dueling-DQN, Double Dueling DQN(DD-DQN). For the continuous action space, TRPO, PPO, and ACKTR have been trained with U-Net for the assistant segmentation as described in Section 4. Evaluation of the networks has been carried out by comparing each algorithm for both action spaces as well as with human pilots whose maneuvering skills vary, i.e., novice, intermediate, and expert.

Learning Environments
A work station equipped with Intel i7 3.4 GHz CPU and an Nvidia Titan X was used for both training and testing with a Microsoft Airsim [7] simulation environment. Python 3.6, Tensorflow 1.10.0 [50], and OpenCV 3.4.1 [51] were used for experiments in Ubuntu 16.04 OS. During the training, the simple f light mode within Airsim was used. Hardware in the Loop (HITL) was used in measuring the performance of human pilot where a Graupner mz-12 Radio Control(RC) transmitter and a Pixhawk PX4 are connected to Airsim as shown in Figure 6. As PX4 is connected to Airsim, the pilot can use an RC directly to control a drone within Airsim. The signal from the RC is calculated and sent to Airsim in the same way as other commercially available drones whose flight controller is PX4. In our experiments, 3 human pilots calibrate their RC using the QGroundControl. The memory usage of the program was in a total of 280 MiB.

Network Training for Discrete and Continuous Action Spaces
The experiment was designed to see if the networks were able to maximize the reward by making trials and errors. If that is the case, such rewards indicate that a drone has an obstacle avoidance capability.

Training for Discrete Action Space
Training an agent in discrete action space has been made using algorithms, such as DQN, Double DQN, Dueling DQN, Double Dueling DQN. With such algorithms, experiments were designed to examine the effect of varying two conditions. The first one is to see if the algorithms were able to maximize the reward by learning. The second was designed to see the impact of input data type, i.e., RGB or Depth, on the learning.
Throughout the experiments in the woodland, block world, and arena, it was found that Double Dueling DQN (DD-DQN) showed the best performance in learning to maximize the reward among four algorithms as shown in Figure 7. This was because DD-DQN is an algorithm that combines Dueling DQN and Double DQN, which differently contributed to improving vanilla DQN. Second-best was Dueling DQN, followed by Double DQN and vanilla DQN. Also, in every case, using both RGB and depth maps for the training was performing the best, as shown in Figure 7. It was also found to be better to use a depth map than using only RGB. This indicates that the depth map contains a more informative feature than RGB for the learning. Note that the subject sees the first person view of the block world and performance of human is measured during the navigation task.

Training for Continuous Action Space
This experiment was made to see if actor-critic networks described in Section 4 for continuous action space can maximize the reward. As the reward is accumulated larger as an agent avoids obstacles, such a reward can indicate that a drone has an obstacle avoidance capability. For this purpose, starting points for the training were set as identical locations as training for discrete action spaces in 3 environments, respectively. After the setting, processes for training actor-critic networks and U-Net followed the typical RL process. Since U-Net already plays a role in semantic segmentation, and its role is in a way similar to convert RGB to the depth map, only RGB image was used for the learning in continuous action spaces. With the same specification of U-Net, 3 actor-critic algorithms, such as TRPO, PPO, ACKTR, were adopted to compare their performance for obstacle avoidance drones. Figure 8 shows how our U-Net segmentation model produces the output as the training proceeds. Decrement of U-Net loss, as well as increment of reward, indicates that the agent's performance improves. As a result, it can make a successful path even in a newly configured environment. Note that when the drone saw a different scene for the first time, the segmentation map produced by U-Net was blurry on the given scene. However, by repeating trial and error made by actor-critic networks, the U-Net loss decreases as the networks learn the surrounding area. After the loss became stable by learning many scenes indicated by training steps around 55 and 78 in Figure 8 (left), the agent obtained a high reward score. It shows that our agent became familiar with these obstacles through trials and errors. The segmentation model learned by knowing where the agent had to go with the given optical flow and the feedback from the critic net, whose role was to evaluate how good the action was given the state. The obtained rewards by the actor network also increased as the U-Net loss decreased, indicating that the segmentation model and actor-critic networks cooperated to avoid obstacles and made a successful path. Figure 9 shows how the actor network made flight trajectory given segmentation outputs. Figure 7. Illustration of episode rewards for four deep RL algorithms with RGB (left), depth (middle), and RGB + depth (right), respectively, in 3 environments such as the woodland (top), the block world (middle), and the area world (bottom). The reward was the biggest when both RGB and depth input were combined with DD-DQN, as shown in (c) of the woodland and the block world. However, since these algorithms could not complete their races in the arena world, the differences between them were not significant. Implementations for these 4 algorithms were based on OpenAI [45]. The second experiment was to determine which actor-critic based algorithm performed better on the given task. It was conducted by replacing the actor-critic algorithm one by one, whereas the U-net and reward scheme was identical. As shown in Figure 10, 3 recent and high-performing actor-critic algorithms, TRPO [10], PPO [11], and ACKTR [43] were evaluated. It is found that ACKTR outperformed two other algorithms for the given task, as shown in Figure 10.  Implementations for these algorithms were based on OpenAI [45] and their default hyper-parameters for a fair comparison. Note that ACKTR outperformed others.

Evaluation of Networks
For the evaluation of algorithms for both discrete and continuous action spaces, the trained models were used for testing in each of the 3 environments.

Comparison between Algorithms
To better understand how these algorithms work in three environments, the trained network models ran in three environments such as the woodland, the block world, and the arena world, respectively. It is found that DD-DQN in discrete action space outperformed others, whereas ACKTR in continuous action space excelled others. DD-DQN made discrete trajectories as shown in Figure 11a,c, mainly because it had only 5 possible actions, i.e., forward, left, right, up, and down. On the other hand, ACKTR made smooth and continuous trajectories as shown in Figures 9 and 11b,d, because it could produce actions in continuous space with three linear velocities. Note that such differences in trajectories were more evident in the block world where there were densely located obstacles that each agent had to avoid during its journey to the goal.  ) were collected from the woodland and (c,d) from the block world, respectively. Here, training models at step 90 k (green), 95 k (yellow), and 100 k (red) were deployed for each algorithm. Note that DD-DQN made discrete trajectories, whereas ACKTR generated continuous trajectories. Note that, in each panel, only 3 trials among many are drawn for clarity.
Our arena world consists of two kinds of obstacles: the first is wall-type large obstacles, consisting of two double arch bridges with sand color and two hoop-embedded black walls; the second is the small obstacles such as cone, half sphere, cuboid with different colors, as shown in Figures 8 and 12. Since all obstacles were densely located along with a circular track, algorithms with discrete action space failed to reach the goal mainly because they did not have delicate yaw control. However, it was inevitable to restrict the action space because adding yaw could be having too many actions, which is known to be leading to unstable learning. Note that DD-DQN could not reach the goal as shown in Figure 12a, whereas ACKTR reached the goal successfully by producing different linear velocities.

Experiment on Robustness of Actor Network
As previous studies suggested that the segmentation model could offer robustness against a certain environmental change, experiments were designed to see how an actor network and the corresponding U-Net segmentation model would perform when there was a change in the environment. Since our actor network could learn to follow the maximum reward zone suggested by U-Net, the actor network could perform well if the segmentation would be successful with given obstacles.
Our segmentation network combined with the actor-critic network was trained using the arena track with Figure 12a. Three more tracks were reconfigured from the arena (a) by calculating how much each track was changed in ratio σ. The arena track (b) consists of shuffled obstacles (σ = 65%), having the same 4 large obstacles, whereas the arena track (c) was made by shuffling large and small obstacles as well as by adding new obstacles with different colors (σ = 75%). The arena track (d) was made from (c) with more obstacles (σ = 81%). Results suggest that our network cooperating with U-Net could reach the goal in (b) and (c), where there were changes about 65% and 75%, respectively, from (a). However, when the environment was too different from what it had been, such as 81% in (d), it failed to reach the goal probably because of a severely reconfigured environment. Note that σ has been calculated as follows:

Racing between Human Pilot and Algorithm
This experiment was prepared to examine the performance between the RL algorithms and drone pilots. For this purpose, a few drone pilots were recruited based on their skills level of drone control, i.e., novice, intermediate, and expert. The novice had about two weeks of experience of drone piloting, the intermediate pilot about six months, and the expert pilot had more than two years of experience. Test environments were the woodland, the block world, and the arena (a) in Figure 12. To inspect how fast the pilots or algorithms reached the goal, 100 lines equally dividing the tracks until the goal was installed and gave 1 point when the drone passed through each of them. After the setting described in Section 5.1, all pilots did practice ten times before recording the data for comparison against the algorithms.
In the woodland, the fastest was expert pilot (47 s), followed by ACKTR (51), intermediate pilot (78), DD-DQN (84), and novice pilot (102) as shown in Figure 13. When there were more obstacles densely located in the block world, the gap between the expert (37) and ACKTR (39) agent was shortened. Interestingly, the gap between the intermediate pilot (56) and DD-DQN (70) was increased, indicating that the DD-DQN agent had difficulty in avoiding obstacles with discrete action space. In the arena, ACKTR reached the goal fastest (51) followed by the expert (61), and all others could not complete the tasks, suggesting that this track was more difficult than two other tracks. Figure 13. Performance comparison in terms of time taken from the start (score = 0) to the goal (score = 100) between human drone pilots and algorithms. For the woodland (a) and the block world (b), both the expert and ACKTR were the winners with similar speeds, whereas ACKTR excelled others in the arena world (c), wherein DD-DQN, the intermediate and novice drone pilots could not complete their races because of collisions, indicated as red dots, made during navigation. Each algorithm (as well as each drone pilot subject) made 3 trials within each environment. The average score for each case is plotted as a color line with shade for indicating its variation. Note that the deployed models were trained with 90 k (a), 95 k (b), and 100 k (c) steps, respectively. Our demo video can be found at: https://youtu.be/en6Xwht8ZSA.

Discussion
The drone can fly outdoor as well as indoors depending on its usage. Training a drone using the machine learning algorithm requires a well-prepared dataset. In particular, collecting a high-quality outdoor dataset is not easy. Gazebo has been a favored drone simulation tool by which one can collect data and fly his drone after training, although so far most cases were for indoor navigations. Airsim provides a new opportunity of collecting realistic outdoor datasets. In this study, we use one of the packages from Airsim, i.e., the woodland, and design two environments, i.e., the block world and the arena world, for training deep RL algorithms. We also plan to release these for public use.
The present study utilizes both RGB and depth map images as input for 4 RL algorithms with discrete action space. Given that RGB image is the FPV of the environment from the drone and depth map has the same view except for the fact that it is extracted from stereo images, these inputs contain the partial information of the environment, whereas the input image typically has the whole information at a given moment in playing any Atari game with DQN, that could make the present pathfinding task harder. As we want to know how individual sensor contributes to the performance, an experiment was carried out using 3 different cases such as RGB, depth image and depth + RGB, separately. Result suggests that the depth + RGB case outperformed the other two cases. For the comparison between 4 deep RL algorithms with such input types, we found that all deep RL algorithms succeed in finding the goal for the woodland experiment, whereas only Dueling DQN and DD-DQN were able to arrive at the goal for the block world case. This confirms that pathfinding in the block world was relatively harder than in the woodland for the deep RL algorithms. However, these algorithms failed to reach the goal in the arena track, confirming that they had a limitation in yaw control.
It is well known that gaining some skill for maneuvering the drone using an RC, especially for navigating through a group of 3D obstacles, often needs a certain period of training for a drone pilot. In this study, deep RL algorithms have been used to train a drone agent that supposes to find a path through obstacles and eventually to arrive at the goal. By using the HITL mode of Airsim, it was possible to measure ther performance of human pilots and to compare it with the performance of the algorithms.

Conclusions
In this study, we compared RL for discrete and continuous action spaces in avoiding obstacles by drone. A new method using a segmentation network for continuous action space is trained using an actor-critic network from the RL paradigm. The significant advantage of this is, of course, that manual labeling is not necessary, saving labor and time. The performance of U-net-based segmentation model is also improved very much. The question for RL in terms of autonomous navigation problems was how one could minimize the gap between real and training environments. Through a series of experiments, we demonstrate that our trained model made successful flying journeys not only in the trained environment but also in some re-configured environments. As far as we know the literature, the present study could be the first attempt where human pilot made a drone racing with algorithm and performance between them were evaluated. We plan to make a real arena shaped environment, that is often used for the drone racing championship, for testing obstacle avoidance drones.