End-to-End Nano-Drone Obstacle Avoidance for Indoor Exploration

: Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap; thus, they are suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano-drones. To address this issue, this paper presents a lightweight CNN depth estimation network deployed on nano-drones for obstacle avoidance. Inspired by knowledge distillation (KD), a Channel-Aware Distillation Transformer (CADiT) is proposed to facilitate the small network to learn knowledge from a larger network. The proposed method is validated on the KITTI dataset and tested on a Crazyflie nano-drone with an ultra-low power microprocessor GAP8. This paper also implements a communication pipe so that the collected images can be streamed to a laptop through the on-board Wi-Fi module in real-time, enabling an offline reconstruction of the environment.


Introduction
Drones play an important role in exploration tasks.In particular, nano-sized drones are suitable for exploring narrow and cluttered environments after disasters because of their small size and relative affordability [1].Flying autonomous drones in scenarios where GNSS is not available is a challenging research topic.Some research integrates multiple sensors, such as high-quality stereo cameras [2,3] and LiDAR [4][5][6][7] for drone navigation.These methods integrate with on-board SLAM algorithms and a map of the environment can be constructed [8,9], which is useful for planning the rescue mission after a disaster.However, this type of method requires drones with large payloads [10] and running on-board SLAM algorithms is not possible due to their weak processors.As computer vision technology evolves, learning-based methods using monocular vision emerge.Given an input image a convolutional neural network (CNN) can be trained to directly output control commands [11][12][13].Since this type of method cannot control the drone's next waypoints and relies on black-box decision-making, it is more reasonable to navigate and avoid obstacles based on intermediate depth maps predicted by CNNs [14][15][16].The advantages of using depth maps for drone navigation are two-fold: (i) A depth map intuitively represents the distance from each object to the viewpoint in the scene and is ideal for navigating a drone and selecting the next waypoint if needed.(ii) Some recent CNN-based depth estimation methods leverage the self-supervised training strategy and do not require labeled data for training.In view of these advantages, the focus of this paper is also on the use of the depth map estimated by a CNN for the obstacle avoidance of nano-drones.
The target platforms of the aforementioned methods were not nano-drones, and they used high-performance graphics processing units (GPUs), which are not available in nanodrones.The nano-drone platform used in this paper is Bitcraze's Crazyflie, which uses GAP8 as its processor and has only 22.65 GOPS of processing power and 512 KB of RAM.There is not enough memory to store two 324 × 324 RGB images and other files such as model weights.To our knowledge, this paper is the first to implement CNN-based depth estimation networks on such a nano-drone with extremely limited computational capacity.The contribution of this paper can be summarized as follows.

•
In order to reduce drone memory usage, this paper explores running depth estimation networks on single-channel grayscale images.The experimental results on the KITTI dataset show that self-supervised depth estimation using grayscale images is feasible.
A single grayscale image saves two-thirds of the storage space compared to using a single RGB image.

•
The remaining space is still insufficient for the storage and inference of small models such as Lite-Mono [17].Therefore, a lightweight depth estimation framework DDND is proposed that has few parameters (310 K).To compensate for the limited learning capacity of the small network, knowledge distillation is introduced and a Channel-Aware Distillation Transformer (CADiT) is proposed to make the student model explore important information in different feature channels from the teacher, thus enhancing the knowledge distillation.The effectiveness of the method is validated on the KITTI dataset.

•
The proposed model is deployed on the nano-drone platform Crazyflie, and it runs at 1.24 FPS on a GAP8 processor to avoid obstacles in real environments.The code will be released on the project website https://github.com/noahzn/DDND(accessed on 17 January 2024).

•
Since a map of the environment is useful for rescue missions and considering that it is not possible to run reconstruction algorithms onboard nano-drones, this paper implements data communication from the drone to a laptop and presents a pipeline for offline reconstruction of the environment.
The rest of the paper is organized as follows.Section 2 reviews some related research work.Section 3 describes the proposed method in detail.Section 4 elaborates on the experiments and discussions.Section 5 introduces the proposed pipeline for reconstructing the environment offline on a laptop.Section 6 concludes the paper.

Obstacle Avoidance of Nano-Drones
Nano-drones can be equipped with small laser rangers [1,[18][19][20] or sonars [21] to avoid obstacles at short distances.Some research focused on optical flow estimation using a dedicated optical flow sensor [22] or cameras [23,24].With the development of edge computing devices, obstacle avoidance using CNN-based methods is beginning to make its mark on edge computing platforms like PULP [25] and Crazyflie.Image-based methods can provide rich cues of a scene, but some methods directly regressed control commands from a single image, and they did not utilize the geometric information of scenes.For example, Kouris et al. proposed a regression network to output steering angles to control the drone [11].Similarly, DroNet used a CNN to output both the steering angle and the probability of collision to control the forward velocity.It was trained using car driving images and all the images were annotated as "collision" or "no collision" [12].Gandhi et al. created a UAV crash dataset and taught their drone to "go left", "go straight", or "go right" using a shallow network [26].Zhilenkov et al. deployed a similar system for autonomous drone navigation in forests [13].In comparison, methods using depth maps are favorable [16] because controlling commands or path planning can be built upon this intermediate step and the depth maps can be used for other tasks such as scene reconstruction and exploration.Yang et al. proposed a probabilistic CNN to predict monocular depth and guide the drone to avoid obstacles [14].Chakravarty et al. trained a supervised depth estimation network and controlled the drone based on the behavior arbitration algorithm [15].However, reliable depth estimation is computationally expensive, and the mentioned methods ran on larger platforms or off board.It is not feasible to directly deploy such models on nano-drone platforms.Additionally, the available memory is also a problem.For example, the drone platform used in this work has 512 K L2 (second-level RAM) memory, and this is where the chip executes code and stores captured images.The above methods all processed three-channel RGB images, and it will take more than half the memory (324 × 324 × 3 = 315 kB) just to store an RGB image, let alone store other code and model files.Instead, using a single-channel grayscale image can save two-thirds of the memory used by an RGB image.This allows us to use a larger CNN model for better performance.

Efficient Monocular Self-Supervised Depth Estimation
In recent years, single-image depth estimation (SIDE) has attracted considerable attention from researchers with the development of deep learning.SIDE models using supervised training regress pixel-wise depth values from labeled data.As it requires additional effort to annotate the data, self-supervised depth estimation (SSDE) stands out and predicts depth from label-free monocular videos [27].Subsequent work improved prediction accuracy by introducing multi-task learning [28,29], adding uncertainty constraints [30,31], or using more powerful deep learning architectures [32][33][34].Some recent methods have pursued a balance of model accuracy, speed, and size, which is also the focus of this paper.Fastdepth [35] adopted MobileNet [36] as the encoder and achieved fast inference speed on embedded systems.However, this model was designed for supervised training.R-MSFM [37] designed a feature modulation module to learn multi-scale depth features and reduced model size by controlling the encoder's layers.Lite-Mono [17] is a hybrid CNN and Transformer architecture, which is capable of extracting both enhanced local features and global contexts and achieving a state-of-the-art performance.It has a good trade-off between accuracy and model size (3.1 M parameters).Nonetheless, such a small model still exceeds the storage space of the GAP8 processor.To obtain a much smaller model, the Lite-Mono network is streamlined and a model with 310 K parameters is obtained in this paper.This results in the problem that the learning ability of this small model would be limited.For the purpose of improving the learning ability of the model, this paper introduces knowledge distillation.

Knowledge Distillation for Depth Estimation
In a typical knowledge distillation (KD) framework, a larger teacher model transfers its knowledge to a smaller student model.The common ways to perform KD are through soft labels [38,39] or intermediate feature matching [40][41][42][43].KD has been applied to depth estimation tasks to boost lightweight models.Wang et al. [44] used ResNet-101 [45] as the teacher and MobileNet as the student and set up distillation between decoders of the two networks.Hu et al. [46] improved knowledge distillation with auxiliary data.Pilzer et al. [47] also explored KD for depth estimation, but their method required stereo image pairs for training.However, the student models used in the above-mentioned research were still too heavy to be deployed on a GAP8.Some recent papers have pointed out that KD may have difficulty optimizing the student model and may achieve unsatisfactory results if there is a large learning capacity gap between teacher and student [48,49].Inspired by some KD methods for classification and semantic segmentation tasks [42,43,50], this paper proposes the CADiT module to encourage the student to pay attention to geometric cues from the teacher's feature channels, thus improving the KD process.

Method
The proposed distilled depth for nano-drones (DDND) is shown in Figure 1 and explained in detail in this section.First, the architecture of the network and the depth estimation training scheme are presented.Then, the KD scheme including the proposed CADiT is demonstrated.The last subsection introduces the control method using the generated depth map.

Network Structures
To make the model deployable on the GAP8 chip, the encoder of Lite-Mono [17] is streamlined to reduce the number of trainable parameters.As shown in the upper part of Figure 2, the student model (DepthNet) is an encoder-decoder network, with four stages in the encoder to extract features.Its decoder concatenates features from the encoder and outputs inverse depth maps at three scales.The channel numbers of the encoder in Lite-Mono are [48,48,80,128], while the student model used in this paper has channel numbers [C1, C2, C3, C4] = [32,32,64,80].The network takes one-channel gray images as input.The same dilation rates of Lite-Mono are used in the DDND network, and the total parameters are reduced from 3.1 M to 310 K.The PoseNet is the same pre-trained ResNet-18 used in [27], and it is not needed after training.

Photometric Loss
With the adjacent image I t+s , the estimated relative pose P, the predicted depth map D t , and the camera's intrinsics K known, the target image can be reconstructed as a function F (I t+s , P, D t , K), and the photometric loss between the target image and the reconstructed image Ît can be defined as: which can be calculated as a sum of the SSIM (structural similarity index) [51] and the L1 loss between two images: with α being an empirical value of 0.85 [27].The minimum photometric loss and the auto-masking techniques [27] are used to overcome the occlusion problem and to filter out objects that move at the same speed as the camera.The final photometric loss is defined as: where the ⟨•⟩ operation outputs a binary mask.

Smoothness Loss
The edge-aware smoothness loss [27] is also used to improve the smoothness of the edges of the generated depth map: where d * t = d t / dt is the mean-normalized inverse depth.Therefore, the combined loss function for the self-supervised training is: where j can be three resolutions of the inverse depth.λ is set to 10 −3 as in [27]., respectively.The feature map has a height of H, a width of W, and channel numbers of C. As shown in Figure 3a, the student's feature channels are increased using 1 × 1 convolutions to have the same channel numbers as the teacher.Then, they can be reshaped as F T ∈ F N×C and F S ∈ F N×C , respectively, where N = H × W. The L2 loss can be used to minimize the discrepancy between the teacher's and student's intermediate features (L IF ) [40]: However, such a direct feature-matching method may increase the difficulty of optimization if the student has poor learning ability.The proposed CADiT (Figure 3b) allows each student channel to learn geometric cues from all the teacher's channels to improve feature-matching.Specifically, a C × C channel correlation map (CCM) is built between the transposed aligned student and the teacher:

Channel
where (•) is the inner product, and the CCM measures correlations between student and teacher channels.The student's features can be reconfigured as: and the CADiT loss is computed as:

Matching Outputs
As with the traditional KD methods [38,46,48], the L1 loss is also used to minimize the multi-scale depth maps generated by the teacher and the student.The L Out is defined as: The final loss function to train the network, combining Equation ( 5), is written as: where α is a weighting factor, set to 0.1 in this paper.

Drone Controlling
With the depth map generated by the network the nano-drone avoids obstacles based on the behavior arbitration (BA) scheme [52].Although this scheme was originally used in conjunction with sonar, it can also be used with depth maps.The behavior avoid is defined as: where λ avoid is a weighting factor, ϕ is the heading of the drone, and d i and ψ i are the depth value and direction of the i-th value in the obstacle map, respectively.σ i is the horizontal field of view of the camera, which is the angular range that the drone can change.The gain c avoid controls the sensitivity to obstacles.Lower gain allows the drone to react to obstacles further away.Increasing λ avoid changes the angular velocity of the drone more quickly.If there is a single obstacle, the behavior will generate a repeller along the obstacle direction ψ i .As distant obstacles have less importance than nearby ones, the repeller's magnitude should decrease as the distance from the obstacle increases.
The drone started to drift a little when it was about 1.5m above the ground because the drone's Kalman filter was not working stably.In this paper, the flight altitude of the drone is fixed at about 0.7 m, and an obstacle map is constructed by average pooling the center rows of the depth map with a window size of 10 × 10 (Figure 4), resulting in a 1D obstacle map.Although the method only uses the pixels in the horizontal center to construct the obstacle map, the complete images are needed to train the depth estimation network.It is not feasible to use only the center pixels during training, as there is no guarantee that the same pixels will be seen in the previous or next frame, and this violates the training using photometric loss.The considerations of using such a simple control strategy are as follows: (1) The calculation is cheap, and no additional trainable parameters are required to generate the obstacle map.( 2  In the implementation, the deep network and the controlling code do not run immediately when the drone is switched on.The drone needs about 12 s to initialize itself, connect to the laptop, and send several testing messages.A take-off command will then be given 15 s after the start.Five seconds later, the network starts, and then the controlling code runs.The drone will land either when a predefined flight time runs out or crashes.

Drone Platform
To implement obstacle avoidance, this paper has considered several open-source drone platforms, including Crazepony, ArduBee, DJI Tello, and Crazyflie.A comparison of these platforms is listed in Table 1.Crazepony2 is a customizable platform, especially known for the First-Person View (FPV) drone.ArduBee is equipped with an optical flow sensor and an infrared sensor for object avoidance.Crazepony2 and ArduBee lack AI chips for on-board image processing.Although DJI Tello has a good vision processing unit and is designed for education, its closed-source system has become a hindrance to custom development.Bitcraze Crazyflie (Figure 5) is the drone platform used in this paper.It is equipped with a Flow Deck v2 at the bottom to measure the distance to the ground and an ultra-low power GAP8 processor (AI Deck) integrating a monochrome camera (HM01B0-MNA), which is designed for on-board AI computing.The active open-source community has also brought attention to Crazyflie.This drone platform measures 92 × 92 × 29 mm (W × H × D) and weighs 34 g.It is equipped with a 250 mAh LiPo battery and the maximum flight time is about 5 min.It has 22.65 GOPS of processing power and 512 KB of RAM.

Gray Campus Indoor
This in-house indoor drone dataset is collected by a Crazyflie in different buildings on our campus.It consists of 17 sequences, a total of 9140 grayscale images with an original resolution of 244 × 324 pixels.Images are resized to 128 × 160 pixels in this paper to meet the requirement of running speed.

Network Training
The proposed network is implemented in PyTorch.Models are trained for 30 epochs on KITTI and 100 epochs on Gray Campus Indoor with an NVIDIA TITAN Xp.The teacher model Lite-Mono is pretrained on ImageNet [55] and then trained on KITTI.During training, the teacher's weights are fixed, and only the student's weights are updated.AdamW [56] optimizer is used, and the weight decay is set to 10 −4 .The initial learning rate is set to 10 −4 with a cosine learning rate schedule.

Model Quantization and Deployment
The trained PyTorch model is further converted to the ONNX (Open Neural Network Exchange) format and quantized using an 8-bit scheme, reducing the size of the weights from 747.6 K bytes to 201.3 K bytes.The controlling algorithm is implemented in C language.The Fabric Controller (FC) frequency of the GAP8 is set to 250 MhZ.

Results on Grayscale KITTI
Table 2 shows the accuracy of models trained on the grayscale KITTI dataset, and the seven commonly used accuracy indicators [57] are AbsRel, SqRel, RMSE, RMSE log, δ < 1.25, δ < 1.25 2 , and δ < 1.25 3 .By comparing the results of Lite-Mono with Lite-Mono (RGB) it can be found that the self-supervised training based on the photometric loss also works on grayscale images, albeit with less accuracy.This confirms the feasibility of the proposed method using grayscale images for self-supervised depth estimation.DDND w/o KD in the table denotes the model that does not use knowledge distillation, i.e., it contains only the streamlined DepthNet and PoseNet as shown in Figure 2. The results show that leveraging the proposed KD method in the training the accuracy is greatly improved.Figure 6 shows some images generated by the networks, and it can be observed that DDND learns knowledge from Lite-Mono and is able to perceive larger objects.In addition, DDND can produce sharper depth maps at the edges of objects compared to the blurred depth maps produced without KD.Lite-Mono (RGB) [17] 0

Qualitative Results on Gray Campus Indoor
Figure 7 shows some results on the in-house dataset.The dataset is challenging for the SSDE framework as it has many lighting sources and low-texture regions, such as walls and floors.In addition, scenes are more diverse.DDND benefits from the KD scheme and captures more detail in scenes.

Ablation Study on KD Losses
Ablation studies with different loss settings on KITTI are performed to validate the effectiveness of the proposed KD training and CADiT.In Table 3, the first setting is DDND without KD, which is the baseline in this ablation study.When introducing the KD on the generated depth maps (No. 2), the results are better than the baseline.However, experiment No. 3 shows that not all metrics are better when using the KD in the encoder at the same time.Simply using L2 loss to distill feature representations cannot give good results due to the large differences between student and teacher in network learning ability.From experiments No. 3 and No. 4, it can be found that the distillation methods using the channel loss yield better results.This loss allows the KD process to focus on the feature channels and enables more effective knowledge transfer.Experiments No. 6-8 show the effectiveness of the proposed CADiT module.Even if the proposed CADiT module is only used in the encoder, without the help of L1 loss in the decoder, good results can still be achieved (No. 7).The best result is obtained when the CADiT module is used in the encoder and the L1 distillation is used in the decoder.

Test in Real Environments
The proposed approach is tested in real indoor environments.Figure 8 shows some images taken by the nano-drone, and the generated depth maps with the deployed quantized model.The green bars on the grayscale images denote steering commands for avoiding obstacles calculated by Equation (12).Due to model quantization, the on-board network is not able to generate smooth depth maps, but these maps still succeed in showing the structure and volume of these scenes.Considering the inference speed of GAP8 the c avoid defined in Equation ( 12) is set to 0.1 to make sure that the drone is able to react to obstacles at a safer distance.

Inference Speed Analysis
The inference speed of the proposed network is evaluated both on the NVIDIA TITAN Xp GPU (graphics processing unit) and the GAP8 processor.As shown in Table 4, there is little difference in the speed of the network inferring on TITAN Xp at either resolution, but on GAP8, it is about six times faster for the resolution of 128 × 160.It can also be observed that the computing power of edge computing devices such as GAP8 is extremely limited.The inference speed of 1.24 FPS is acceptable because the nano-drone flies at a low speed during tests.The proposed method fails to estimate the depth of the glass or if it is too close to a wall, as shown in Figure 9.This is also a limitation of SSDE methods, and this problem can be overcome by integrating additional sensors, such as ultrasonic sonar, to detect the distance between glass and walls.

The Scene Reconstruction Pipeline
Since a map of the environment is useful for post-disaster rescue and it is not possible to run SLAM algorithms on the nano-drone, this paper also implements the streaming of collected images to a laptop using the nano-drone's WiFi module and presents an offline pipeline for reconstructing the environment.Figure 10 displays the entire pipeline, divided into on-board processing and offline processing stages.During the on-board processing, the proposed depth estimation network runs on the nano-drone and generates relative depth maps and angular velocity to make the drone avoid obstacles.
Meanwhile, the grayscale images that have been captured are transmitted from the drone to a laptop through the NINA WiFi module of the GAP8 chip.On the laptop, a SLAM algorithm can be used to estimate the poses of the sequential images.This paper uses ORB-SLAM2 [57] to extract keyframe trajectories from the images.To create a 3D reconstruction, it is necessary to know the accurate depth values for each pixel, but the depth estimation models trained with a self-supervised scheme are only able to predict relative depth values.This paper adopts ZoeDepth [56] for metric depth estimation.ZoeDepth has 345 M parameters, and it has shown excellent generalization capacity as it was initially pretrained on 12 datasets using relative depth and subsequently fine-tuned on two datasets using metric depth.Then, the grayscale images and their corresponding depth maps are used to generate colored (monochrome) point clouds.Therefore, the scene reconstruction pipeline allows for building a map of an indoor environment by utilizing a nano-drone equipped with a monochrome camera.

Conclusions
This paper proposes a lightweight depth estimation framework DDND, for obstacle avoidance on the nano-drone Crazyflie.Considering the limited storage and computing capacity of such a small drone platform, it is only possible to deploy a tiny network on it for monocular depth estimation.To enhance the learning ability of this tiny network, this paper integrates knowledge distillation and proposes the CADiT module for better knowledge transfer from a teacher model.The quantitative and qualitative results on the KITTI dataset validated the effectiveness of the proposed KD module.The model is then quantized so that it can infer on a Crazyflie for real environment tests.The limitation of such vision-based methods is that it is unable to avoid transparent objects such as glass.This paper also presents an application pipeline for the reconstruction of the environment using offline metric depth estimation and keyframe pose estimation.With the 3D reconstruction, future potential work will be focused on the selection of waypoints in the reconstruction for path planning.This function requires the implementation of bilateral data communication between the laptop and the drone.The low inference speed of the algorithm only allows the drone to fly at a low speed.Future work will focus on improving the efficiency of the algorithm.A more powerful GAP9 chip is also being considered.

Figure 1 .
Figure 1.Overview of the proposed DDND.In addition to the self-supervised (SS) loss used in the SSDE training scheme, L2 loss and L1 loss are used to distill the teacher's knowledge into the student's encoder and decoder, respectively.The proposed CADiT is introduced in Section 3.3.

Figure 2
Figure 2 also shows the self-supervised depth estimation (SSDE) training scheme, which aims at minimizing the photometric loss L p between a target image I t and the

3. 3 .
Knowledge Distillation Scheme 3.3.1.Matching Intermediate Features using the Channel Aware Distillation Transformer (CADiT) Assume that the teacher network T and the student network S have intermediate feature maps denoted by F T ∈ F H×W×C and F S ∈ F H×W×C ′

Figure 3 .
Figure 3. Intermediate feature-matching schemes.(a) is the conventional feature-matching scheme.(b) is the proposed CADiT that makes the student learn the channel correlations from the teacher.
) It allows the selection of the next waypoint based on the depth map for path planning in future work.It is possible to use more complex control schemes, but this is beyond the scope of this paper.

Figure 4 .
Figure 4.An obstacle map is generated by applying a sum pooling operation on the horizontal center of the depth map.The depth values of all the pixels in each pooling window are averaged.

Figure 5 .
Figure 5.The Crazyflie is used as the drone platform in this paper.

4. 2 .
Datasets 4.2.1.KITTI KITTI is a multimodal dataset[53], which consists of 61 stereo road scenes.In this paper, the self-supervised model is trained on the Eigen split[54].There are 39,810 monocular triplets used in the training, 4421 for evaluation, and 697 for testing.During training, all the RGB images in the KITTI dataset are resized to 192 × 640 and converted to one-channel grayscale images.

Figure 7 .
Figure 7. Qualitative results on Gray Campus Indoor.

Figure 8 .
Figure 8. Real environment tests.Grayscale input images and their corresponding depth maps generated by the quantized CNN model are shown.The green bar in each grayscale image denotes the change in angular velocity to avoid obstacles.

Figure 9 .
Figure 9.The method fails to avoid obstacles in areas of glass and when close to walls.

Figure 10 .
Figure 10.The application pipeline for the offline 3D reconstruction.

Author Contributions:
Conceptualization, N.Z.and F.N.; methodology, N.Z.; software, N.Z.; validation, N.Z.; formal analysis, N.Z.; investigation, N.Z.; writing-original draft preparation, N.Z.; writing-review and editing, N.Z., F.N., G.V. and N.K.; visualization, N.Z.All authors have read and agreed to the published version of the manuscript.Funding: This project has funding from the European Union's Horizon 2020 Research and Innovation Programme and the Korean Government under Grant Agreement No. 833435.The content reflects only the authors' view and the Research Executive Agency (REA) and the European Commission are not responsible for any use that may be made of the information it contains.

Table 1 .
A comparison of the different open-source drone platforms.

Table 4 .
Inference speed evaluation under two image resolutions.