1. Introduction
In recent years, small drones have been more popular than ever from the perspective of flexibility, low power consumption, and reasonable prices. In addition, the drones are expected to play a variety of roles to take advantage of their convenience. The roles include infrastructure inspection, package delivery, and mobile surveillance cameras. Unlike manned vehicles such as cars and airliners, unmanned drones do not need to be controlled by a person and autonomous flights are becoming practical. In terms of autonomous flights of drones, collision avoidance has been indispensable and regarded as one of the crucial issues. Typically, conventional solutions have employed distance sensors. For instance, Light Detection and Ranging (LiDAR) which can detect long distances are employed [
1,
2]. Depth cameras or stereo cameras are also employed to perceive distance [
3,
4,
5,
6]. However, such sensors with high performance are usually heavy, costly, and power-consuming to equip on a small drone. In contrast, low performance depth sensors can hardly have long-distance vision with high accuracy and would rather increase risk of collisions with objects.
Many kinds of research for autonomous flight of drones have assumed that monocular cameras are often used to detect and recognize objects around the drones [
7,
8]. Single monocular camera-based depth estimation is also actively researched [
9,
10,
11,
12]. However, monocular cameras are not useful at night in terms of their visibility. Instead, infrared cameras are employed to improve the visibility at night, but they do not include depth information, which means the distances between a drone and objects, as well as monocular images. In the literature, depth images have been more important than ever to measure how far the objects are placed from the drone. A lightweight depth camera, which is small enough to be mounted on a drone, can measure a distance up to only 10 m. In addition, unfortunately, high-performance depth cameras are often expensive and too heavy to be equip on a drone. Therefore, depth estimation technology from an image by monocular cameras has been extensively investigated.
The contributions of this paper are threefold as follows:
This is the first paper to generate a depth image from a monocular image with optical flow for collision avoidance of drone flight.
We verify that our proposed method can estimate high-quality depth images in real-time, and demonstrates that a drone can successfully fly avoiding objects in a flight simulator.
In addition, our method is superior to previous method of depth estimation on accuracy and collision avoidance.
The rest of this paper is organized as follows. Related work of drone autonomous flight method and depth estimation method is introduced in
Section 2.
Section 3 describes the overview of AirSim.
Section 4 shows a proposed method to estimate depth with optical flow.
Section 5 shows the experimental results and
Section 6 concludes this paper.
2. Related Work
There has been a great deal of work related to autonomous drone flying for several decades. Much of work has been focused on safe flight, which is particularly necessary to prevent collision with objects. These studies include obstacle avoidance based on ultrasonic, radar, and image processing [
13]. Ultrasonic-based methods perform in real-time but the maximum range is short [
14,
15]. Radar-based methods perform well in obstacle detection. However, radar is not a good choice for small Unmanned Aerial Vehicles (UAVs) due to its weight [
16]. Vision-based methods include obstacle avoidance methods based on LiDAR images, Time of Flight (ToF) images, binocular images, or monocular images. In [
1,
2,
17], the authors used LiDAR for collision avoidance of a drone. However, the installation of many powerful sensors or high performance sensors results in an increase in the weight of a drone, which leads to an increase in energy consumption. Given that the drone’s flight is limited by the battery capacity, it is difficult for the drone to fly long distances with a large number of sensors.
In order to tackle this issue, the approaches of [
3,
4,
6] proposed collision avoidance techniques using a light and small depth camera and a stereo camera. The presented methods enable a drone to avoid obstacles on-the-fly by determining an optimum flight direction using depth images. The work in [
4], which is inspired by [
3,
6], proposed an algorithm for collision avoidance by dividing an image from a depth camera into five sections and selects a section so that the section is the most distant object among them. However, depth cameras with reasonable price and relatively low weight such as Kinect released by Microsoft can hardly be mounted on a drone since they can measure only within 10 m [
18]. In the context, depth estimation from a monocular camera, which can overlook farther than a depth camera, has been attractive.
In the studies [
9,
10], the authors presented the methods of depth estimation using Support Vector Machine (SVM). The systems divide an image from a drone into patches, represent each patch using a set of manually created features, and estimate the depth of each patch using a pre-trained SVM classifier. However, the accuracy of their systems is not high as a result of handheld training data. In [
19,
20,
21,
22], these methods are based on Convolutional Neural Networks (CNN). The CNN-based methods are more accurate than the SVM method in [
9,
10], but the accuracy is still not sufficient enough to realize secure flight without collisions. In [
11,
23], the authors proposed methods to preprocess segmenting images before depth estimation using a monocular camera. This method improves the accuracy of depth estimation. On the other hand, the computational workload largely increases, and the method is not suitable for real-time processing in terms of the performance since the methods [
11,
23] need segmentation as pre-process. In addition, the methods [
11,
19,
20,
21,
22,
23] use public datasets [
24,
25] to estimate depth. As the latest work, AdaBins based on a transformer has been proposed [
26]. However, the data that the works have used for training and testing in [
24,
25] are unsuitable for drones view since the data in [
24] is suitable for ground vehicles and the data in [
25] is oriented to indoor environments. In [
12], the authors collected data from drone views in a drone flight simulator and presented a method to generate depth images using Pix2Pix [
27] from a monocular image.
Drones are further smaller than general vehicles, and the processing capability of drones is comparatively lower than that of cars since the large computer cannot be equipped on the drone. Most of recent technologies based on image processing have exhausted computational resources due to the development of deep neural networks (DNNs), and the technologies are seemingly not suitable to the system in a small device. However, many embedded systems are oriented to the Internet of things (IoT), and the computation with data transmission to an on-land computer has been enabling to distribute the computational workloads. Although this fashion has spurred the development of depth estimation technologies using a monocular camera, there is little work that focuses on depth estimation for a small drone. Drones are required to fly without colliding objects but the weight of a camera that can be carried by a drone is severely limited, and high performance but heavy cameras cannot be carried.
In this paper, we propose a new depth estimation method for autonomous flight of a drone. Our proposed method can estimate long distance using a monocular image with optical flow. In addition, our model for the estimation is based on conditional generative adversarial networks (CGAN) [
28], and the training dataset is collected from AirSim [
29], which is known as the virtual flight environment of a drone.
3. AirSim
This section describes AriSim [
29], which we employ in this work. AirSim is a kind of flight simulator that uses a virtual environment called Unreal Engine 4. This simulator faithfully reproduces the reality in visual information and physics.
In addition, AirSim can acquire mesh information from Unreal Engine 4. The available information includes location, temperature, and images. The obtained images include RGB, segmentation, infrared, and depth images. The depth image in AirSim can exactly measure up to 200 m. Therefore, the AirSim environment enables to obtain and label both RGB and accurate depth images at the same time to create training dataset.
Figure 1a shows a depth image up to 10 m taken by a real depth camera that can be installed on real small drones in the AirSim environment.
Figure 1b shows a depth image up to 200 m obtained from a depth camera in AirSim.
Figure 1c shows monocular image taken by monocular camera in AirSim. As shown in
Figure 1, in monocular image and depth image up to 200 m, can detect objects, however depth image up to 10 m cannot detect that. Therefore, the more distance can be measured, the larger the benefit of using depth images for autonomous flight. However, to obtain a 200 m deep image with a real drone, it is necessary to install a camera of great depth, which is unrealistic. Therefore we propose a method for estimating the depth image obtained by AirSim based on a monocular image.
5. Experiments
In this section, we evaluate our method in terms of accuracy, latency and the performance to avoid collisions.
We use Intel Core i7-9700K (32 GB of main memory) and NVIDIA GeForce RTX 2070 SUPER, which is represented in
Table 1. Dataset, which are used for training, validation, and testing, have been collected from four maps provided in the AirSim environment; Blocks, City, Coastline, and Neighborhood, where the overviews of the maps are shown in
Figure 9.
We train our model in the following conditions: the number of epochs is set to 100. The batch size is set to 1, and the lambda of L1 norm is set to 100. In the experiments, we have prepared 16,000 pairs of monocular and depth images for each of the maps. 8000 pairs out of 16,000 are used to training our Pix2Pix-based model. The rest of the pairs in monocular and depth images is employed to test out model. In the labelling process, the depth and monocular images are taken through multiple flights with a variety of routes in AirSim beforehand.
Figure 10 shows the examples of the inputs and outputs of the model trained with the parameters.
Figure 10a shows the RGB images taken by a monocular camera during flights in the four maps of AirSim. At the same time, we obtain the optical flow maps as shown in
Figure 10b. From the images, we derive RGB images with embedding an optical flow map in
Figure 10c. Compared to the ground truth images in
Figure 10d, our proposed method generates depth images as shown in
Figure 10e.
5.1. Preliminary Evaluation with Different Pixels Interval of Optical Flow Maps
In this experiment, we use six models to investigate the effect of the optical flow maps, and the accuracy and error are compared. One out of six models employs only optical flow maps as input for depth estimation. The others embed the optical flow map into the monocular image at different intervals. The embedding intervals are one, three, five, seven, and nine pixels intervals. We intuitively suppose that the dense pixels of the optical flow map provide much information and achieve higher accuracy than the sparse pixels.
In order to quantify estimation error of models, we use rooted mean squared error (
RMSE) and absolute relative error (
Rel.) metrics. Hereby,
RMSE is obtained by the following equation.
is ground truth value.
is estimation value.
N is number of data.
Rel. is obtained by the following equation.
Specifically, the accuracy metrics are defined as:
Table 2 shows the error and accuracy of each model.
The results show that the model with the five pixel interval remarks the lowest error and the highest accuracy. In addition, the model trained with only optical flow maps shows the highest error and lowest accuracy. In terms of RMSE and Rel., the model trained with only optical flow maps increases 1.6736 points compared to the model with five pixels intervals. As well as the accuracy, the model with five pixels intervals achieves the highest value for each delta metric. The results imply that many pixels intervals might be about to cause over-fitting and lose information of the original RGB images, resulting in the high RMSE and Rel., especially over seven pixels intervals. In contrast to the error metrics, the more pixels intervals achieve the improvement of the accuracy.
5.2. Comparison Accuracy between Proposed Method and Related Work
We evaluate our model in terms of the error and accuracy, compared to the model presented in [
12]. The compared model is trained without using the optical flow. In other words, this model uses only RGB images to generate depth estimation maps. For our proposed model, we utilize the optical flow embedded into RGB images. The model is selected with five pixels intervals, which represents the lowest
RMSE and
Rel. in the
Table 2.
Table 3 shows the results of the error and accuracy comparison. Compared to the model without optical flow, we have demonstrated that embedding the optical flow enables to achieve the slightly lower error and higher accuracy.
As shown in
Table 3, in Shimada, T. et al. method,
RMSE is 5.942,
Rel. is 0.1338,
is 0.8871,
is 0.9562,
is 0.9772. In proposed method,
RMSE is 6.005,
Rel. is 0.1230,
is 0.8923,
is 0.9608,
is 0.9796. in Shimada, T. et al. method,
RMSE is 5.942,
Rel. is 0.1338,
is 0.8871,
is 0.9562,
is 0.9772. In
RMSE, Shimada, T. et al. method was better than proposed method. On the other hand, proposed method is superior in other evaluation indicators.
To confirm whether our method is effective in a real environment, we test our model using the KITTI dataset [
24], as shown in
Table 4. The KITTI dataset contains RGB images and depth images taken in the real world. We compare our AirSim-based model with other models trained on real images proposed by related works. Although the results of our method are slightly lower than those of association studies based on real model training, it is still a good result.
5.3. Run Time Evaluation
We also evaluate run time. We evaluate for the servers and the embedded devise, which represent NVIDIA RTX 2070 SUPER, Intel Core i7 9700K, and Jetson Xavier NX.
Table 5 shows the results of the run time per image. The slowest run time is shown in the Jetson and represents 0.193 s. In other words, approximately five frames per second can be processed in the Jetson. On the other hand, the result in the NVIDIA RTX 2070 SUPER shows 0.031 s per image. The validation of the results for collision avoidance depends on how long our model can estimate the distance in generated depth images.
We have concluded that the processing time is sufficient to avoid collisions in real time. NVIDIA Jetson Xavier NX is a small board computer that can be mounted on a UAV. The weight of Jetson Xavier NX is about 180 g. On the other hand, there is an accurate depth sensor Velodyne HDL-64E used in KITTI dataset [
24]. The weight of HDL-64E is 12,700 g [
37]. The weight of the other depth sensors which can measure 200 m are also near 1000 g. From the above, Jetson is light enough compared to long range depth sensors like used in KITTI dataset [
24]. Jetson is lighter than long range depth sensor. In addition, unlike attaching such a sensor, the replacement from a base-board into Jetson Xavier NX does not increase the weight so much.
5.4. Collision Rate Evaluation in AirSim Environment
Previously, we have evaluated the accuracy and run time of the proposed method. In this section, we conduct the simulation of a drone flight in AirSim to demonstrate that the proposed method can fly avoiding collision with objects. In order to realize the safe flight of an autonomous drone, it is necessary to plan the path by itself, that is, the drone needs to select the direction so that the drone can avoid colliding with objects. In the experiments, we use a state-of-the-art path planning method for flight control, which is developed in [
6]. The work in [
6] introduced the method that divides a depth map into multiple sections. The presented method in [
6] divides a depth image into 289 overlapped sections (17 rows and 17 columns) as shown in
Figure 11.
By dividing into overlapped sections, the drone selects the best section to avoid obstacles and pass safely so that the drone determines the section with the maximum total pixel value. The flight is simulated 400 times in the four maps. The flight scenarios are randomly generated in terms of route, direction, and distance.
We compare the collision rates that the number of collisions account for towards the total number of flights. Hereby, we define the collision rate for a map in the following formula:
Note that we assume that the flight has a collision if the drone collides with an obstacle even once during its flight.
In the experiments, we use the following four methods: The first can measure up to 10 m, which assumes a real depth camera for reasonable price and low weight enough to equipped on a drone. The second can measure up to 200 m. This method assumes an ideal depth camera, where it can measure by up to 200 m but is too heavy to be mounted on a drone in the real world. This method is used as ground truth depth images for comparison. The third is presented by Shimada, T. et al. [
12]. This method inputs a monocular image to generate a depth image through Pix2Pix. The fourth is our proposed method. Our method combines an image with optical flow map into Pix2Pix, and it generates the estimated depth map.
Table 6 shows the results of the collision rate in each map of AirSim. The results show that our proposed method achieves the lower collision rate compared to the method presented in [
12]. The depth map for 10 m yields the highest collision rate, and the result explicitly indicates that inaccurate depth images are useless to collision avoidance. The method [
12] represents that it achieves the higher collision rate than the proposed method. The results are attributed to depth maps with the low error and high accuracy.