1. Introduction
UAVs are reshaping the aviation industry and have emerged as the potential successors of conventional aircraft. Without the assistance of a pilot, they are proving to be extremely convenient when it comes to reaching the most remote regions. Due to their great abilities, such as flexibility, lightweight, high mobility, and effective concealment, drones were initially most often associated with the military [
1]. In spite of this, technological advancements in the past decade have led to an exponential increase in drone use, and drones are no longer used only for military purposes. Data collection, delivery, and sharing can be achieved efficiently and relatively inexpensively with UAVs, which can be customized with cutting-edge detection and surveying systems [
2]. Today, drones are available in a variety of shapes, sizes, ranges, specifications, and equipment. They serve a variety of purposes, such as transporting commodities, taking photos and videos, monitoring climate change, and conducting search operations after natural disasters [
3]. The development of UAVs has a significant impact on the market and economy and a variety of industries.
Despite drones being available for a long time, research on their abilities for autonomous obstacle avoidance has only recently gained attention from the scientific community. As the name implies, autonomous systems are those that can perform specific tasks without the direct intervention of a human being. The use of autonomous systems is on the rise to reduce human error and improve the efficiency of operations [
4]. Humans are heavily involved in most drone practices. Drones with autonomous capabilities, however, can be extremely useful in emergency situations since they provide immediate situational awareness and direct response efforts without the need to have a pilot on-site. Security and inspection issues can also be addressed using autonomous drone systems. With predefined GPS coordinates for the departure and destination, researchers initially focused on self-navigation drones that could determine the optimal route and arrive at a predetermined location without human assistance [
5]. Despite the GPS navigation claim of collision avoidance, it is still possible for a drone to collide with a tree, a building, or another drone during its flight.
Different approaches have been taken by researchers for collision avoidance. Visual stereo or monocular camera is used in obtaining depth measurements for obstacle avoidance and mapping tasks [
6]. Instead of solely relying only on the visual device, researchers have also employed other types of sensors. For instance, laser range finders could give range measurements for obstacle identification and mapping of 3D environments, or ultrasonic sensors could be directly incorporated into obstacle detection operations [
6]. However, the transmitting device needed for such sensors makes them too heavy for small or medium-sized UAVs. Visual sensors are not constrained by geographic conditions or locations, consequently making them less susceptible to radiation or signal interference [
7]. In addition to being compact, lightweight, and energy-efficient, the cameras also offer rich information about the environment [
8]. Therefore, employing only visual sensors for obstacle detection is the best option for small or medium-sized UAVs. Different strategies have been implemented in vision-based navigation systems to tackle the problem of obstacle detection and avoidance. The most popular ones, among others, are based on geometric relations, fuzzy logic, potential fields, and neural networks [
9].
In recent years, the field of deep reinforcement learning (DRL) has experienced exponential growth in both its research and applications. Reinforcement learning (RL) can address a variety of complicated decision-making tasks that were previously beyond the capacity of a machine to function like a human and solve problems in the real world. The ability of autonomous drones to detect and avoid obstacles with a high degree of accuracy is considered a challenging task. Applying reinforcement learning in averting collisions can provide a model which can dynamically adapt to the environment. In RL-based techniques, the agent attempts to learn in a virtual world that is analogous to a real-world setting, and then the trained model is applied to a real drone for testing [
10]. When training RL models for effective collision avoidance, it is imperative to minimize the gap between the real world and the training environment. There should be a close correlation between the training environment and the real world. Different obstacle avoidance scenarios should be taken into account during training. Choosing the best reinforcement learning model is essential to obtaining the best result.
The purpose of this study is to explore the use of reinforcement learning models to navigate drones in the presence of static and dynamic obstacles. While previous research in this area has primarily focused on static obstacles, this study aims to provide a comprehensive comparison of three different RL algorithms—DQN, PPO, and SAC—for avoiding both types of obstacles. Several characteristics distinguish each algorithm, including DQN, whose reinforcement learning using discrete action spaces is based on Q-learning, and SAC, which uses continuous action spaces for reinforcement learning. PPO, on the other hand, is an on-policy, policy gradient reinforcement learning methodology that employs continuous action spaces. This study was conducted in a simulated setting provided by AirSim, and different training and testing environments were created using Unreal Engine 4. The novelty of this work lies in the comprehensiveness of its analysis, which shares valuable insights into the strengths and weaknesses of different RL techniques and which can serve as an important starting point for future research in this field.
The paper has been divided into the following sections:
Section 2 discusses related work conducted in the past concerning obstacle avoidance in drones.
Section 3 focuses on the methodology used in this research.
Section 4 discusses the findings of this research with various case studies.
Section 5 provides concluding statements with future directions.
2. Related Work
The autonomous navigation of UAVs has caught the attention of many researchers. Obstacle detection and avoidance are major elements of any autonomous navigation system. A number of approaches have been developed to resolve this challenging problem, especially in vision-based systems. This work can be categorized into computer vision, supervised learning, and reinforcement learning.
2.1. Autonomous Drone Based on Computer Vision and Geometric Relations
The preliminary research focused on simple computer vision algorithms for detecting obstacles and navigating past them using mathematical and geometric relationships. Martins et al. [
11] proposed a real-time obstacle detection algorithm that utilized simple mathematical computations to find contours and identify obstacles and free areas. The computer vision solution was simple and computationally inexpensive. However, it did not solve the navigation problem, as it diverted the drone without implementing a flight plan. Alternatively, Sarmalkar et al. [
12] use GPS technology to program the drone to follow a predetermined route using GPS waypoints. Using an OctoMap, they determine the obstacle’s location with a modified version of 3DVFH (3-Dimensional Vector Field Histogram). The algorithm creates a 2D primary polar histogram based on the location to find a new route. However, these algorithms are computationally expensive.
A typical obstacle navigation technique skirts the obstruction and follows its boundaries. However, focusing too much on its shape will make developing an obstacle-avoidance algorithm harder, more complicated, and of less value. Guo et al. [
13] developed a novel adaptive navigation algorithm that generates circular arc trajectories to avoid obstacles using geometric relationships between the aircraft and obstacles. This technique works for any shape and size of the obstacle. This way, a drone can avoid any stationary or moving obstacle provided it knows its position and velocity. The work of Aldao et al. [
9] sought to develop a strategy for avoiding stationary and moving objects not considered during trajectory planning in a construction environment. The obstacle’s position is estimated using polynomial regression based on sensor measurements. The optimization algorithm recalculates an updated trajectory profile with the least time and positional deviation. The only downside is that a room model needs to be included in the avoidance algorithm.
Research has suggested that as the drone gets closer to the obstacle, the size of the obstruction increases, i.e., the convex hull around the critical point grows, signaling the obstacle’s presence. Al-Kaff et al. [
8] used SIFT detectors to identify key points and a brute force algorithm to match them. Although overall detection effectiveness and processing time were good, downsides occurred mainly due to the nature of the sensor devices used, which might lead to information loss. Aswini and Uma [
14] improved this method by adding median filtering, sharpening, and a Harris corner detector to locate corners before extracting the SIFT features. The authors also underlined SIFT’s complexity and time-consuming nature. Frontal collision detection and avoidance with monocular imagery are tricky due to the absence of motion parallax and near-zero optical flow in low-resolution cameras. Mori and Scherer [
15] compare relative obstacle sizes between images even when there is no optical flow by SURF feature matching and template matching. This method is faster than SIFT and potentially uses any SURF matching feature. However, it requires obstacles to have sufficient texture to produce SURF key points.
2.2. Autonomous Drone Based Supervised Learning
Supervised learning is a method of training a model to find underlying patterns and relationships between input data and output labels, which can then be used to predict labels on new and unknown data. Mannar et al. [
16] proposed a control algorithm for drones aimed at navigating forest environments without crashing. The algorithm employed a feature vector with improvements based on a weighted combination of texture features to calculate distances to obstacles along longitudinal strips of the image frame. Using supervised learning, the weights were precalculated in correspondence with the frames produced in a simulated forest environment. In terms of obstacle-distance estimation accuracy and computational efficiency, there was a significant improvement, but the algorithm heavily relied on texture features.
In recent years, object detection algorithms have exhibited exceptional performance in computer vision for both detection and segmentation. For obstacle detection, insights can be gained from Zhai et al. [
17], who presented a cloud-enabled, autopilot drone that is capable of field surveillance and anomaly detection by integrating customizable DNN and computer vision algorithms. The proposed system uses ResNet and SVM models for 12-category classification tasks and YOLO v3 detection of objects in drone images. However, the cost of construction is high, and the drone moves only along a preplanned GPS track. Similar to this, Fang and Cai [
18] presented a YOLO v3 model for real-time target tracking and a ResNet-18 two-classifier model for obstacle detection. To modify the rotation and driving of the autonomous drone, the model also included the PID algorithm. The model was trained using the COCO dataset.
2.3. Autonomous Drone Based on Reinforcement Learning
A reinforcement learning system enables an agent to interact with an environment, navigate through it as they interact, and optimize their behavior as a result of the rewards they gain. Deep reinforcement learning has brought numerous advancements in the discipline of drone control, which is becoming increasingly popular as a field. The model discussed by Yang et al. [
19] demonstrates how deep reinforcement learning would work within the context of UAV obstacle avoidance. This study tests two algorithms, Nature-DQN and Dueling-DQN, using the v-rep simulation environment to determine their robustness. The study revealed that Nature-DQN learned better strategies, but its convergence rate was slower than Dueling-DQN. In the previous study, Deep Q-Networks (DQNs) were used to help explore an unknown environment in a 2D environment. However, the increased number of states in 3D environments poses new challenges to the RL algorithm. Furthermore, value exploration with uniform action sampling might result in redundant states, especially in an environment with sparse rewards by nature. Roghair et al. [
20] attempted to resolve this challenge by designing two approaches: the guidance-based method, which uses a domain network, and Gaussian mixture distribution, which compares prior states with future states. The convergence-based strategy relies on errors to iterate unexplored actions. This approach allows reinforcement learning algorithms to scale up to large state spaces and perform better.
Rubí et al. [
21] proposed a deep reinforcement learning approach with two separate agents, path following (PF) and obstacle avoidance (OA), which are implemented using the Deep Deterministic Policy Gradient (DDPG) algorithm. The approach achieved a cascade structure in which the state of the PF agent was determined by the action computed by the OA agent. OA modifies the PF agent’s path distance error state whenever it detects an obstacle on the vehicle’s route, and the PF agent simultaneously modifies the reference path to prevent collisions. This structure has several advantages, including being easier to understand, taking less time to learn, and being safely trained on a real platform. Shin et al. [
10] examined different reinforcement learning algorithms within two categories—discrete action space and continuous action space—and proposed a U-net-based segmentation model for models with an actor-critic network. The critic network builds the label map from input from a simulated environment, while the supervised learning paradigm trains a U-net-based segmentation model. The models were trained in three different environments with varying difficulty, and the results showed that in discrete action spaces, DD-DQN outperformed other deep RL algorithms, and in continuous action spaces, ACKTR outperformed the competition. Furthermore, the trained models performed well in a few reconfigured environments. According to the authors Xue and Gonsalves [
7], a soft actor-critic method can be used to train a drone to avoid obstacles in continuous action space using only image data, so it can perform the task on its own. The authors chose to train their algorithm using delay learning to achieve stable results in training. It converged faster and achieved higher rewards by combining VAE and SAC rather than using images directly. The model also performed better in a reconfigured environment without retraining.
2.4. Research Gap
Drone missions may vary based on weather, lighting, and terrain. Existing methods may not cope with such changes. It is crucial to develop new approaches to drone navigation that are robust and adaptable to different environments. The review indicated that among other methods, researchers are looking into how to detect obstacles using geometric relations, supervised learning, and neural networks. As models based on geometric relations do not require training on large datasets, they can be more advantageous than other models. However, such models rely heavily on the precise positioning and orientation of drones in relation to their surroundings. As a consequence, drones may have difficulties estimating their position in environments that have complex geometry or lighting conditions that change over time. In these cases, the drone’s position estimation may become inaccurate, resulting in navigation errors. Hence, major research was focused on supervised learning and reinforcement learning.
Although both reinforcement and supervised learning employ input-output mapping, reinforcement learning uses rewards and punishments as signals for good and bad behavior. In contrast, supervised learning provides feedback in the form of the proper set of actions. However, supervised learning requires labeled data for all possible scenarios, which makes it hard to incorporate the complex and dynamic drone flight environment into a single dataset. The problem is addressed by increasing the dataset size, but it can be time-consuming and expensive. Furthermore, supervised learning is limited to the task it was trained on and does not allow for adapting to changes in the environment.
Significant research has been conducted, with most of it focused on navigating around static obstacles. Using reinforcement learning to avoid moving obstacles is possible but can be more challenging, as the drone has to predict the future movements of the obstacle and adjust its trajectory accordingly. It is essential that reinforcement learning is expanded to prevent collisions with dynamic moving obstacles. One way is to train the drone in different scenarios with dynamic obstacles in a simulation environment. By doing so, the drone will learn to deal with a wide range of obstacles and environments more robustly.
4. Experimental Results
4.1. Training Evaluation
This work attempts to train the drone to avoid collisions on its own by detecting and avoiding obstacles autonomously. A reinforcement learning approach was employed to achieve this goal, and three different algorithms were compared to find the most effective one. Several environments varying in difficulty were used to train the algorithms, as shown in
Figure 3, which have been characterized by different levels of difficulty. The goal is to get to the exit point without hitting any obstacles. Based on the actions taken by the drone, both positive and negative rewards were assigned to it. The algorithms were trained for 100 k steps. Every time an episode started, the environments were chosen randomly and drones were placed in different starting points to increase randomness.
During the training process, the algorithms were evaluated every 500 steps to analyze the performance. The evaluation results of three algorithms—DQN, PPO, SAC—are provided in
Figure 12,
Figure 13 and
Figure 14, respectively. The mean rewards of all three algorithms appear to increase as the number of training steps increases, which suggests that the training is progressing in the right direction. SAC seems to have the highest reward owing to its actor-critic architecture and sample-efficient characteristics. SAC employs entropy regularization in order to maximize the trade-off between expected return and entropy (randomness in the policy). Contrary to popular belief, DQN outperformed PPO in this case. This is because DQN offers better sample efficiency via the replay buffer. The PPO policies are prone to overshooting and miss the reward peak or stall prematurely when the policy updates are made. Especially in neural network architectures, vanishing and exploding gradients are one of the most challenging aspects. There is a possibility that the algorithm may not be able to recover from a poor update.
4.2. Case Studies
To better understand and verify the effectiveness of the three algorithms, several tests are designed to study the behavior. Two different testing environments were created in Unreal Engine 4 where the trained models were tested. The case studies discuss how different algorithms were able to get to the desired goal, and how long it took them to accomplish it.
4.2.1. Case-1: Simple Pillar World
In this case study, the environment is set in a closed room with pillars positioned at random locations throughout the room. A snapshot of the environment is given in
Figure 4i. A diagram illustrating the position of the pillars can be found in
Figure 15b. The simulation results are given in
Figure 15, which visualizes the time and path taken by the different algorithms in the simple pillar world. Based on the results, we conclude that DQN and SAC successfully made it to the exit without any collisions. DQN even managed to reach the exit before SAC as shown in
Figure 15a. In
Figure 15b, it is visible that PPO took a slight detour, possibly hitting the wall in the process, thus ending the episode. Likewise, SAC also had close contact with the wall but was able to turn back just in time and avoid hitting it. The models seemed to have all taken a path that was inclined to the right at first. This is because most of the pillars appear to be on the left side, which means the models were able to choose a path with the fewest obstacles.
4.2.2. Case-2: Complex World with Moving Obstacles
This case study envisions an environment with an open-roof room. Most of the room is surrounded by trees on all sides. A snapshot of the environment is given in
Figure 4ii. As part of this case study, two mobile obstacles have been introduced: an archer walking and a ball rolling.
Figure 16b shows an illustration of how different obstacles are positioned relative to one another in the environment. The figure also depicts the trajectory of the mobile obstacles. The simulation results are given in
Figure 16, which visualizes the time and path taken by different algorithms. It is interesting to note that all algorithms managed to reach the exit even though it seems to be a more complex environment than Case-1. As seen from the time graph in
Figure 16a, SAC and PPO reached the exit at about the same time. In addition,
Figure 16b reveals that PPO took an unnecessary zigzag path toward the end and still managed to reach the exit along with SAC.
4.3. Discussion
A comparative study of three different algorithms—DQN, PPO, and SAC—was conducted for obstacle detection and avoidance in UAVs, including the analysis of their strengths, weaknesses, and trade-offs. The evaluation results showed that for all three algorithms, mean rewards increased as the number of training steps increased, indicating progress in the right direction. SAC performed the best among the three algorithms because of its actor-critic architecture and sample-efficient characteristics. PPO performed poorly among the algorithms, suggesting that on-policy algorithms are not suitable for effective collision avoidance in large 3D environments with dynamic actors. Off-policy algorithms, DQN and SAC, returned promising results. DQN outperformed PPO due to its better sample efficiency via the replay buffer. In the testing phase, the obstacles in the testing environments were kept at a reasonable distance from one another. In small paths and turns, DQN might not be as advantageous due to its restricted discrete action space.
A further evaluation of the three algorithms was conducted in different environments in order to verify their effectiveness. In Case-1, both DQN and SAC successfully made it to the exit without any collisions, while PPO took a slight detour, possibly hitting the wall in the process. In Case-2, all algorithms managed to reach the exit even though it was a more complex environment than in Case-1. Overall, the experimental results indicate that reinforcement learning using SAC and DQN algorithms can efficiently train a drone to detect and avoid obstacles autonomously in varying environments. Moreover, PPO models tend to follow zigzag paths, possibly because they tend to take sharp angles in their actions. This led them to reach the destination slower in Case-2. While it may be possible to incorporate constraints that limit sharp turns, this approach has limitations as it may compromise the adaptability and flexibility of the RL model and introduce biases that are not representative of the real-world environment. Such constraints may also result in suboptimal policies that are not adaptable to new situations. Instead, training the model in diverse environments can enable it to learn to navigate to different coordinates using a range of methods. The drone can benefit from this training method by acquiring a more generalized policy and improving its ability to adjust to environmental changes.
5. Conclusions and Future Work
This work aims to develop a drone that can operate autonomously. The use of automated drone systems increases productivity by removing the need for drone operators and enabling reliable real-time data to be accessed more conveniently. The advantages and opportunities of autonomous drones are numerous. However, one of the most challenging aspects of autonomous drones is the detection and avoidance of collisions. The paper compares three different reinforcement learning approaches that could aid in effectively avoiding stationary and moving obstacles. The study was conducted in a simulated setting by AirSim, which provides realistic environments for testing autonomous systems without risking damage to expensive hardware or endangering human lives. Different training and testing environments were created using Unreal Engine 4. The comparison of the three different algorithms revealed that SAC performed the best, followed by DQN, while PPO was found to be unsuitable for large 3D environments with dynamic actors. Overall, off-policy algorithms are more efficient in avoiding collisions than on-policy algorithms. The proposed approach can be applied to different types of UAVs and environments by training the deep reinforcement learning algorithm on diverse datasets.
In future research, the effectiveness of the collision avoidance algorithms could be improved by training them in more environments that mimic the real world. The research could be extended to creating a path-following model, which would be incorporated into the current model. It would also be worthwhile to assess the hardware power consumption and other related factors, as well as its efficiency in real-life situations by incorporating it into a real drone and analyzing how it avoids obstacles.