Goal-Oriented Obstacle Avoidance with Deep Reinforcement Learning in Continuous Action Space

: In this paper, we propose a goal-oriented obstacle avoidance navigation system based on deep reinforcement learning that uses depth information in scenes, as well as goal position in polar coordinates as state inputs. The control signals for robot motion are output in a continuous action space. We devise a deep deterministic policy gradient network with the inclusion of depth-wise separable convolution layers to process the large amounts of sequential depth image information. The goal-oriented obstacle avoidance navigation is performed without prior knowledge of the environment or a map. We show that through the proposed deep reinforcement learning network, a goal-oriented collision avoidance model can be trained end-to-end without manual tuning or supervision by a human operator. We train our model in a simulation, and the resulting network is directly transferred to other environments. Experiments show the capability of the trained network to navigate safely around obstacles and arrive at the designated goal positions in the simulation, as well as in the real world. The proposed method exhibits higher reliability than the compared approaches when navigating around obstacles with complex shapes. The experiments show that the approach is capable of avoiding not only static, but also dynamic obstacles. be used in a real Experiments were performed in both simulated and real environments to show the feasibility of the proposed network and compare it to existing approaches.


Introduction
With the recent advances in computing and sensor technologies, the prevalence of robotic devices in our everyday lives is increasing. Currently, robots are freely interacting with physical environments and navigating in them. However, realizing this in varying real-world situations can still be a daunting task. In unpredictable surroundings, such fundamental capabilities as mobile robot navigation require much attention. For mobile robots to navigate successfully in real-world environments, they need not only the ability to navigate to a goal, but also avoid collisions in a safe manner. Generally, a path planner can be employed to find a clear path in the free space of the environment. However, in varying or unknown spaces, it is hard to generate such a path. Unless the work environment is fully described, it is hard to guarantee that the generated path will not lead to a collision. Keeping track and considering all the changes in the planning stage of navigation is unfeasible. Therefore, robot navigation must include a way to detect and avoid obstacles with only local information in space and time.
Multiple types of sensors can be used to recognize occupancy in the environment such as laser scanners, sonars, depth cameras, or bumps. By using sensory data, a robot can navigate towards a clear path. Such behavior can be programmed into the navigation system. For example, when an object is at a certain predetermined distance from the robot, an action can be programmed to avoid it. However, it can be very difficult and time consuming to predict all possible situations that the robot may encounter. Recent advances in deep reinforcement learning have shown the ability to solve increasingly more complex tasks. These methods have been successfully adopted in a wide variety of fields and have been applied to robot navigation. This has made it possible to exclude the necessity of human supervision and environment-dependent manual engineering of robots. For this method to be viable, the robot needs to gather a large variety of experiences, which might be costly if performed in the real world. Owing to the generalization ability of deep learning models, it is possible to learn in a simulated environment and apply the trained network to the real world. For this reason, the network needs to have as much information about the surroundings as possible, while still outputting navigation commands in a timely manner. This creates an issue of implementing navigation systems with a large input space, such as images, which require significant computing power to facilitate the network parameters. To reduce the computing burden of standard convolution for image processing, depth-wise separable convolution was established in [1]. It has achieved comparable results with significantly reduced multiply-accumulation operations and parameters [2,3]. This allows significant amounts of image data to be used in real-time robotic applications with limited computing resources. Yet, to obtain goal-oriented navigation, the processed image data still need to be combined with the goal position information.
In this paper, we set out to overcome these challenges and propose a navigation system that learns in a simulated environment how to avoid obstacles in collaboration with goal-oriented navigation. To achieve this purpose, we use depth images and the estimated position with respect to the goal of the robot as inputs to the network. We aim to improve upon the existing approaches, have the capability of processing input signals of higher complexity, and solve the issues associated with it. The main contributions of this paper can be itemized as follows: • Creation of a convolutional deep deterministic policy gradient network for tackling a large amount of input data. • Development of a deep deterministic policy gradient network with mixed inputs for goal-oriented collision avoidance.

•
Transfer of a network, learned in a simulation, to the real environment for map-less vector navigation with depth image inputs.
The remainder of this paper is organized as follows. In Section 2, the related works are reviewed. The proposed network is described in Section 3. Training of the network is explained in Section 4, followed by experimental results given in Section 5. Finally, a summary and discussion are presented in Section 6.

Related Works
In an ideal situation, the robot would be fully aware of its surroundings and move in a completely known environment. In such cases, a multitude of possible algorithms exist that can plan and execute the robot motion to the goal. If all of the obstacles are known, a path planner can be employed to set a path for a robot to follow. Algorithms such as A*, RRT*, and their extensions [4][5][6][7] are quick and easy ways of obtaining a path. To obtain better or constrained paths, heuristic methods are combined with optimization algorithms [8][9][10][11]. Here, optimization algorithms, such as genetic algorithm, simulated annealing, and others, are introduced during the planning and post-processing phases to obtain the optimal path. However, if they encounter an obstacle, the planning needs to be performed again. Reactive re-planing can be performed with the D* algorithm and its extensions [12,13], but still, the map of the environment should be known beforehand or the algorithms work with the assumption that the previously known free space holds no obstacles. Obtaining a full map of the environment often can be unfeasible or even impossible, especially if the robot is deployed in a new, unknown environment. Therefore, fully planned approaches cannot always be implemented, and the obstacle avoidance strategy in unknown environments needs to be developed.
In an unknown or uncertain environment, obstacle avoidance is typically solved by using range sensors and applying predetermined logic [14]. The sensor input is used to detect the obstacles. Here, visual SLAM methods can be used to understand the environment and navigate the robot [15][16][17]. Segmentation can recognize traversable planes from a visual image or point cloud [18][19][20]. Laser, sonar, and other range sensors can detect obstacles and their shape to determine in which direction to move [14,[21][22][23]. Especially, sensor readings from multiple robots through a centralized communications network are used [24]. When the surroundings have been mapped, the path planners can be updated based on set parameters for motion towards the goal. However, here, the collision avoidance is usually implemented by decoupling the object recognition and motion planning. First, an obstacle is detected, and then, the motion is selected based on policy. The policy needs to be described beforehand and based on the obstacle. It might not only be difficult to program an intelligent avoidance strategy, but also cause an issue when encountering previously unseen obstacles.
Currently, deep learning has shown great progress in dealing with large scale input data and directly producing finely-tuned policies in various applications, such as playing video games [25][26][27], stock market analysis [28][29][30], as well as various implementations in robotics [31]. Deep learning networks have been used in robotic applications to learn the navigation, as well as collision avoidance from the input signals [32][33][34][35][36]. Several applications have been developed that directly obtain motion policies from raw images [37][38][39]. Here, the motion policies are a direct result of the input image and do not require fine-tuning by an outside source. In [39], a dueling architecture-based Deep Double Q Network (D3QN) was used to obtain depth predictions from a monocular depth camera and perform collision avoidance. The network aims to avoid obstacles, but navigates without a specified goal position. Additionally, the D3QN algorithm outputs are in a discrete action space, which can result in rough navigation behaviors. In [40], the authors proposed a deep reinforcement learning model that separates the navigation towards a goal and the collision avoidance. The proposed two-stream Q network model learns the goal navigation and collision avoidance tasks separately, and the given actions are combined using an action scheduling method. The proposed approach uses a 360 • laser range finder as the input and outputs actions in a discrete action space. Predetermined heuristics are used to decide the switching between navigation and obstacle avoidance actions. For the deep reinforcement learning to be capable of producing outputs in a continuous action space, the authors of [41] proposed a Deep Deterministic Policy Gradient (DDPG) network. This network was extended for map-less navigation purposes in [33] as an Asynchronous Deep Deterministic Policy Gradient (ADDPG) network. Here, measurements from a laser scanner in 10 directions and estimated robot location are used as the environmental information to navigate the robot towards a goal. A 180 • laser sensor must be installed on the robot, and the state input considers only the current state of the robot. Laser sensor inputs are easily implementable in deep learning networks, especially if the number of inputs is reduced to minimize the number of learning parameters. However, they might not reflect a sophisticated scene in the environment since they generally produce information at a set height and have difficulties navigating around objects of certain shapes, as we show in Section 5.
Using deep reinforcement learning leads the network to tune its parameters based on the value function. However, without supervision, the network has to go through trial and error to explore and learn the best policy for the multitude of states it can encounter. This means that the training is difficult to perform in a real environment and tends to be done mostly in simulation. Several works [33,39,42,43] showed that deep learning networks are trainable in virtual environments and afterward transferable to the real world in robot applications. In this paper, we aim to create a goal-oriented collision avoidance navigation system that makes decisions on policies based directly on the sequential depth images and estimated goal location.

Deep Learning Network for Collision Avoidance in a Continuous Action Space
To tackle the problem of learning when and how to avoid obstacles from a depth image and goal location without a pre-existing map, we propose a Convolutional Deep Deterministic Policy Gradient (CDDPG) network. The CDDPG is Deep Deterministic Policy Gradient-based (DDPG) [41,44] network with depth-wise separable convolutional layers that are designed to deal with the large scale information of the depth images efficiently. DDPG networks are actor-critic networks that output an action in a continuous space.

Convolutional Deep Deterministic Policy Gradient
For effective goal-oriented obstacle avoidance from depth images in a continuous action space, we devised a DDPG-based network by implementing depth-wise separable convolutional layers. A DDPG-based network is an actor-critic type of network with similar structures of layer layouts for both parts, but with separate parameters. As the input of the actor network, we combined a stack of depth images with distance information to the goal, and the output was a set of real values representing angular velocity ω and linear velocity v, obtained from a continuous action space. The critic network takes the action produced by the actor, as well as the state as an input and estimates the Q-value of the current state. The structure of the actor network is visualized in Figure 1. Depth-wise separable convolution is performed on a stack of depth images. Max pooling is performed on the output. Positional information is concatenated with the depth information, followed by two fully connected layers and an output layer. and ⊕ denote convolution and concatenation, respectively.
Due to the limited field of view of depth cameras, the navigation system should recall possible obstacles on the periphery of the robot. Therefore, it should perceive not only its current state, but also previous states. Consequently, we used a sequence of depth images as the state representation. The stack of inputs from s t−n to s t serve as a short term memory to recall a possible obstacle on its periphery even though input scene s t might not recognize one. Since the state input is a pixel value array from a sequence of depth images in the range of t − n...t, the amount of input data is significantly larger than that of similar approaches using laser readings [33]. To mitigate the issue, we employed depth-wise separable convolution layers for image data processing. To obtain a multi-channel image as an input for CDDPG network from a single channel depth information, depth images from t − n to t are stacked in sequential order resulting in a n + 1 channel image. First, each channel of a multi-channel input image is convolved with a filter separately, and the resulting feature maps are stacked back into a multi-channel image. Then, a 1 × 1 convolution is performed over all the channels. Rectified Linear Unit (ReLU) is used as the activation function. Max pooling is performed on the convolutional layer output, and the state is flattened.
Based on the current location estimation, the distance and angle between the robot and the goal position are calculated and expressed in polar coordinates. This information is required for goal orientation, but consists of direct numerical data that are different from the depth images. Consequently, the mixed-input approach is utilized. We added the polar coordinates as a separate input that bypassed the depth-wise separable convolution part of the network. This input is directly concatenated with the flattened layer. Further, two fully connected layers follow. The latter layer is connected to an output that produces the two values of ω and v. In the case of a critic network, the action from the actor network is added in the second fully connected layer, which is connected to a single output producing the Q-value of the state. The full network is shown in Figure 2.

Reward
Training the network is performed by employing a reinforcement learning framework. Three conditions define the reward function in our CDDPG network: If the robot's current distance in relation to the goal position P t is less than a threshold P δ , the goal is considered reached, and a positive reward r g is applied. If a robot's collision with an obstacle is detected, a negative reward r c is applied. Otherwise, the reward r(s t , a t ) is calculated based on the robot's linear and angular velocity. This design rewards the robot for linear motion and discourages unnecessary turning.

Training
To learn the collision avoidance with deep learning methods, training needs to be performed. Training in the real environment can be very expensive and dangerous not only to the robot, but also people and the surroundings. Therefore, a popular method to avoid this problem is to perform the training in a simulation. We employed the Robot Operating System's (ROS) Gazebo simulator as the environment for gathering the data and performing movements of the robot.
The training was carried out on a computer equipped with an NVIDIA GTX 1080 graphics card, 32 GB of RAM, and an Intel Core i7-6800K CPU. We used a Mobile Robots Pioneer P3DX robot model and controlled its angular and linear velocities through ROS commands. Random noise was applied to the network input to increase the number of different training sets of data, to generalize the network. The input depth image at time step t was combined with the images from time steps t − 1 to t − n. For training, we used single-channel grayscale depth images with an image size of 80 × 64 pixels. Ten sequential images were used as a stack of input images in the network. The goal position was given in polar coordinates in regards to the robot. Here, the position of the robot was estimated from its odometry. The CDDPG network output two action values in continuous space. Once these values were obtained, random noise was applied to them. Collisions were detected at a 0.2 m distance from obstacles. Upon a collision, the episode was terminated. Otherwise, the episode was terminated when the maximum number of 500 steps was reached. The rewards for collision and reaching a goal were empirically selected as r c = −100 and r g = 80, respectively.
As the simulation environment for the training, we used a modified Gazebo world provided by [39], presented in Figure 3. Goal positions, as well as the robot's starting position and pose at each episode were randomized. Since the depth images were received from only one camera that was located on the robot facing forwards, there was no reliable information about what was located behind the robot at all times. Consequently, the robot was not expected to move backward, and the minimum linear velocity was set as v min = 0 meters per second. During our training, the maximum linear velocity was set as v max = 0.5 meters per second and the maximum angular velocity was set to ω max = ±1 radians per second. Each performed step in the simulation was recorded in a batch with a maximum size of 80,000. Each record consisted of {s t−1 , a t−1 , r(s t , a t ), c t , s t }, where c t represents the collision at time step t. Once the batch was full, old records were discarded and replaced with new ones. Network parameters were updated through learning from mini-batches of a size of 10 after each performed step. The loss was calculated as the mean squared error between the estimated Q-value and actual reward. The Adam Optimizer was used for loss minimization [45]. The list of network parameters used in training is presented in Table 1. We trained the network until the maximal Q-value converged, which took approximately 3000 episodes, taking 55 h.

Experiments
To validate the results of the training, experiments with the obtained network were conducted in two settings: a simulated environment in Gazebo and a real-life environment with inputs from a physical depth camera and a robot. In both setups, the same trained network was used to conduct the experiments. Quantitative and qualitative experiments were carried out where our method was compared with similar reactive obstacle avoidance approaches. For comparison purposes, we performed the same experiment with a map-less navigation method based on deep reinforcement learning proposed in [33], referred to as the ADDPG method. It employed an asynchronous DDPG network to learn navigation in a map-less environment based on single time step data of sparse laser readings and odometry. Training for this method was performed in the same environment as the CDDPG depth image-based method, depicted in Figure 3. An additional comparison was made with an Erle-rover laser obstacle avoidance package method for ROS simulations [46]. The Erle-rover method calculates the mean vector of laser readings and navigates in its direction, thus moving towards the empty space of the environment.

Experiments in the Simulated Environment
The main advantage of using input data from depth images is a more thorough scene representation. To show the benefit of a more complex representation of the environment through depth images over laser readings, avoidance around obstacles of varying complexity was explored. Naturally, a laser scan will only detect obstacles that are detectable on its plane, and a collision will occur with obstacles that cannot be perceived. However, also the geometrical composition of the obstacle can be a problem for laser-based obstacle avoidance, especially if sparse readings are used. Therefore, we performed comparison experiments of navigation around simulated objects of different compositions. This is portrayed in Figure 4. The robot had to navigate around the obstacle to a location behind it. In Figure 4d, the robot's task was to navigate to a location between the two obstacles. From Figure 4, we can observe that sparse image readings had difficulty navigating around obstacles that did not have a solid shape. The sparse readings would register the distance to an obstacle at one time step, but after motion, would not register the obstacle at the next step. Here, a part of the obstacle would be located between the laser readings as is the case in Figure 4f, where the laser read the gap between the legs of a human and perceived it as a safe direction to move. A special case is depicted in Figure 4e, where a coffee table surface was located on a pole. The laser-based approaches could not perceive the narrow pole as an obstacle. On the contrary, the depth image collision avoidance could recognize the full shape of the obstacle and navigate around it. In the case of Figure 4h, the lack of local memory in ADDPG forced the robot to turn away and again towards an obstacle in consequent moves, thus freezing the robot and forcing it into a deadlock. However, such a situation could occur for our method as well if the robot finds itself in a situation where no improvement of the current location can be made over the span of its memory. This is portrayed in Figure 4i. Here, after navigating into the pocket, the robot would constantly turn for 10 steps in one direction and then 10 steps in the other in order to find a direction of safe movement. Since it was unable to find a safe direction, it was perpetually stuck in this motion. Due to the field of view angle restrictions of a single camera, our method may lack peripheral vision, which may lead to a collision in certain circumstances, as is depicted in Figure 4j. The robot navigated around an obstacle until it was not in view and began to turn. Since it was necessary to perform a large turn, the stack did not hold depth images of the obstacle and lost information about its location in relation to the robot. To compare the performance of the map-less navigation between multiple way-points in a simplified setting, we employed a simulated Gazebo environment. We designed a new Gazebo simulator environment for validation and performed the experiments using different obstacle models and layouts. The robot performed navigation over a course of 10 way-points for five laps in the environment while avoiding obstacles along the way. If the distance between the goal position and the robot's estimated location was less than 30 centimeters, the goal was considered reached. As comparison metrics, we obtained the traveled distance in meters and navigation time in seconds. The performance of each method is displayed in Figure 5. The distance and time metrics over five laps are displayed in Table 2.  From the experiment, we could observe that using a more detailed sensory representation of the environment with a larger amount of data was comparable in terms of distance and time to that of ADDPG with sparse laser readings. The small increase of distance measured in our approach was due to more careful navigation around the obstacles, with an increased distance around them. The lack of peripheral vision forced the robot to keep an increased distance from the obstacles. Yet, the travel time was similar since the ADDPG method had difficulties navigating close to empty bookshelves, as is presented in Figure 4g. The shape of the bookshelves confused the navigation and forced the robot to slow down. By comparison, the rule-based approach of Erle-rover always led the robot along the safest path, disregarding a more optimal path towards the goal.

Experiments in a Real Environment
To prove the feasibility of the theoretical results obtained by the simulations, we performed experiments in a real environment and directly compared them to the results obtained by the ADDPG method. We used a Pioneer P3DX mobile robot platform on which we mounted an Intel Realsense D435 depth camera. For ADDPG implementation, we used a Velodyne Puck LiDAR scanner. A laptop with an NVIDIA GTX 980M graphics card, 32GB of RAM, and an Intel Core i7-5950HQ CPU was used to facilitate the networks. The robot setup is displayed in Figure 6a. The same as in the virtual simulation, the goal was considered to be reached at a distance of fewer than 30 centimeters. Just like during the training, the linear velocity was constrained to not exceed 0.5 m/s and the angular velocities not to exceed ±1 rad/s. An example of the proposed network output velocities in a real environment setting are presented in Figure 6b. If no obstacles were located in front of the robot, the network output the maximal linear and no angular velocity. If objects were in the way of the robot, the angular velocity was increased. When objects were close to the robot, the linear velocity was reduced and the angular velocity increased. In the presented case, the robot reduced it's speed and performed a right turn. Experiments in a static environment were performed in an approximately 10 × 10 m 2 area. The robot navigated in the indoor environment arriving at the goal positions based on the sequence from 0 to 6 labeled in Figure 7. After reaching Point 6, the robot returned to the starting location. Experiments were performed in two settings. First, the CDDPG and ADDPG methods were used in an empty hallway, where the obstacles consisted of smooth walls, as visualized in Figure 7a,c, respectively. Afterward, interior objects such as chairs, fans, and boxes were added as obstacles for collision avoidance. Navigation in this environment is depicted in Figure 7b,d.
When using the CDDPG method, even though the environment was significantly different from the one during the training and the depth camera produced slightly different images from the ones in the simulation, the navigation system was able to recognize obstacles and navigate around them in both settings. The input from the depth camera could be noisy and differed significantly between sequential images. However, the effects of the noisy images were mitigated by adding random noise during the training stage. By comparison, the ADDPG method worked well in a hallway setting and could successfully navigate between the way-points. However, when additional complex obstacles were added in the environment, the ADDPG method failed to recognize or navigate around them. As depicted in Figure 7d, the robot was able to avoid such obstacles as boxes, but was unable to navigate around office chairs and a fan, where human intervention was required. Additionally, at a point where the robot was surrounded by multiple office chairs, it stalled, unable to find a clear way out of the situation. Human assistance was required to navigate the robot towards a clear path. The difference in performance was due to the laser reading passing through the mesh-like structure of the fan and registering only the support pole part of the chairs. On the contrary, our proposed approach was able to react to these obstacles. The navigation towards the goal did not differ significantly if objects were recognized. However, our proposed approach exhibited higher reliability in a varied environment. To show the capability of the proposed method working not only in static, but also dynamic environments, additional experiments were performed, as presented in Figure 8. These experiments were performed in an approximately 5 × 5 m 2 area, and four goal points were located at each of its corners. The experiment was performed by allowing the robot to navigate between the set goal points and introducing new obstacles, changing the location of previously existing obstacles and adding dynamic human obstacles. The initial layout with the location of the robot and its starting pose is presented in Figure 8a. In Figure 8b, the robot performed a circuit without obstacles between the goal points. New obstacles are introduced in Figure 8c  From these experiments, we could observe that upon introducing new or changing the location of previously seen obstacles, the robot still could perform avoidance and goal-oriented navigation. Changing the environment did not affect its performance. Moreover, our approach exhibited the capability of performing goal-oriented obstacle avoidance in the vicinity of moving obstacles. As was the case with a moving human, the robot was successful in recognizing it and finding a way to navigate around it. The robot reacted to dynamic human motion and produced motion based on it. Here, dynamic obstacle avoidance was possible without explicit motion data of the object. This showed that the CDDPG approach was capable of navigating in a dynamic environment. Additional experiments in static and dynamic environments can be viewed in the Supplementary Video https://youtu.be/ nNWoabjKxIA.

Summary and Discussion
In this paper, a deep reinforcement learning-based algorithm was proposed for goal-oriented obstacle avoidance in continuous action space. Using depth images, a robot could safely navigate towards its goal while avoiding the obstacles. This was done by expanding the DDPG network to include depth-wise separable convolutional layers for processing the depth image data. The network could be trained in its entirety in a virtual simulator and directly transferred to be used in a real environment. Experiments were performed in both simulated and real environments to show the feasibility of the proposed network and compare it to existing approaches. The depth image-based collision avoidance provided a more complete representation of obstacles than compared to laser-based methods. This allowed for safer navigation in environments filled with various structures. However, the choice of the number in the stack of sequential images n representing the state of the environment could be very important to how successfully the robot navigated. Since the network was not learning or mapping the features of the surroundings, the robot did not know the location of the obstacles, but rather, reacted to them when they came in view. Setting n to a value that was too low could result in a robot constantly turning towards the obstacle, once it was not in view anymore. The lack of peripheral view could be seen as a negative aspect of the proposed method as it could lead to a situation where an obstacle comes into view as the robot is turning, as shown in the experiments. This also means that the obstacle avoidance could fail if an object suddenly appeared in front of it at a close distance from the side. If an obstacle had previously not been observed, the robot would assume that the direction it was turning towards would be clear. Placing an obstacle on the side of the robot in such a case may lead to a collision. For this purpose, additional cameras on the sides of the robot or a laser sensor can be used in combination with the existing depth camera. A laser sensor should be located near the camera and scan the environment at a wider angle than the depth camera. This would allow collecting the data on the periphery. Since separate feature extraction should be performed, again, the mixed-input approach might need to be utilized to create sensor fusion within the network. The sensor fusion information would again be combined with the goal data to create a goal-oriented obstacle avoidance with sensor fusion. Such sensor fusion will be the focus of our ongoing research. Obtaining the depth images from a monocular camera instead of a depth camera, by applying a depth prediction network, is also considered. Here, pre-trained depth estimation from a monocular RGB camera network can be utilized to obtain depth images without the need for a depth camera. The output of the pre-trained network can be used as the input for obstacle avoidance, similarly to the approach presented in [39]. The effects of real depth camera noise can be mitigated by adding random noise to the simulated camera. Therefore, the network can be implemented in real-life scenarios as-is from training only in simulation. However, further training of the network in a real environment could improve the smoothness of the robot motion in real-life settings. This will be explored in the future.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: