A Review on IoT Deep Learning UAV Systems for Autonomous Obstacle Detection and Collision Avoidance

: Advances in Unmanned Aerial Vehicles (UAVs), also known as drones, offer unprecedented opportunities to boost a wide array of large-scale Internet of Things (IoT) applications. Nevertheless, UAV platforms still face important limitations mainly related to autonomy and weight that impact their remote sensing capabilities when capturing and processing the data required for developing autonomous and robust real-time obstacle detection and avoidance systems. In this regard, Deep Learning (DL) techniques have arisen as a promising alternative for improving real-time obstacle detection and collision avoidance for highly autonomous UAVs. This article reviews the most recent developments on DL Unmanned Aerial Systems (UASs) and provides a detailed explanation on the main DL techniques. Moreover, the latest DL-UAV communication architectures are studied and their most common hardware is analyzed. Furthermore, this article enumerates the most relevant open challenges for current DL-UAV solutions, thus allowing future researchers to deﬁne a roadmap for devising the new generation affordable autonomous DL-UAV IoT solutions.


Introduction
The Internet of Things (IoT) is expected to connect to the Internet more than 75 billion devices in 2025 [1]. Such devices are used in a wide variety of fields, like agriculture [2], industry [3][4][5], or environmental monitoring [6]. However, IoT still faces some challenges, especially regarding security [7,8], which result in a slowdown for its widespread adoption.
Unmanned Aerial Vehicles (UAVs) have an enormous potential for enabling novel IoT applications thanks to their low maintenance cost, high mobility and high maneuverability [9]. Due to such characteristics, UAVs have been really useful in a number of fields and applications, like remote sensing, real-time monitoring, disaster management, border and crowd surveillance, military applications, delivery of goods, or precision agriculture [10,11]. The use of UAVs has been also suggested for providing services in a number of industrial applications, like critical infrastructure inspections [12][13][14][15][16], sensor monitoring [17,18], automatic cargo transport [19,20], or logistics optimization of UAV swarms [21]. In the case of UAVs that operate in industrial environments, they require certain characteristics that may differ substantially from other applications [22].
For instance, UAV maneuvering in such environments needs to consider the presence of workers, mobile vehicles, robots, and heavy-duty tools in order to avoid collisions.
Most of the previously mentioned IoT applications require operating without pilot intervention, so there is a growing interest in the development of techniques that enable UAV autonomous flights. Among the different computational techniques that can be used to detect obstacles and to avoid collisions, Deep Learning (DL) techniques have arisen as a promising alternative that when applied to UAVs, derive into the concept of Deep Learning-Unmanned Aerial Vehicles (DL-UAVs). The main advantage of the application of DL to UAVs is the ability to recognize complex patterns from the raw input data captured by the sensors incorporated in the UAV, learning proper hierarchical representations of the underlying information at different levels.
This article analyzes the latest developments on DL-UAV systems and includes the following main contributions. • The most relevant DL techniques for autonomous collision avoidance are reviewed, as well as their application to UAV systems.

•
The latest DL datasets that can be used for testing collision avoidance techniques on DL-UAV systems are described.

•
The hardware and communications architecture of the most recent DL-UAVs are thoroughly detailed in order to allow future researchers to design their own systems.

•
The main open challenges and current technical limitations are enumerated.
It must be noted that in this article the term UAV system is used, as it has been commonly used in the literature for years to name unmanned aerial systems. However, most reputable international organizations (e.g., EUROCONTROL, EASA, FAA, and DoD) have adopted Unmanned Aerial System (UAS) as the correct official term for a system that consists of a ground control station, communication transceivers and an aircraft.
The rest of this article is structured as follows. Section 2 introduces the related work. Section 3 analyzes the use of DL techniques and datasets for autonomous collision avoidance, emphasizing their application to UAVs. Section 4 reviews the hardware of the most relevant DL-UAV developments and their communications architecture, as well as their main subsystems. Finally, Section 5 enumerates currently open challenges, and Section 6 is devoted to the conclusions.

Related Work
Autonomy in a UAV can be defined as the capability for being self-guided without human intervention [23]. According to the UAV autonomy levels indicated in the work by the authors of [23], the most basic level requires the UAV to be guided with a remote control, whereas the maximum level involves performing a complex mission with human level decision-making without external intervention. In the case of this review, intermediate autonomy level approaches are analyzed, in which the UAV should be able to perform path-planning and detect existing obstacles in order to avoid collisions during navigation.
Depending on the degree of autonomous navigation and on the UAV application, certain problems may arise. For example, in the case of following a fixed path, if the UAV deviates from it to avoid a collision, it would be desirable to know its position during all the navigation to return to such a path. In an outdoor environment, this issue can be solved in a relatively easy way just by making use of a Global Positioning System (GPS), but in indoor scenarios visual odometry techniques should be added [24]. In addition, if the existing obstacles move throughout the scenario, object detection algorithms should be incorporated to the system in order to estimate their next position, thus anticipating potential collisions.
Collision avoidance classical approaches make use of techniques like Simultaneous Localization and Mapping (SLAM) [25,26] and Structure from Motion (SfM) [27] to generate or update a map that represents the visual geometry of the environment, which allows for inferring obstacles and traversable spaces. These techniques use data captured by sensors like RGB-D cameras, Light Detection and Ranging (LIDAR) sensors, or Sound Navigation and Ranging (SONAR) sensors, either individually [28] or by fusing several of them [29]. However, it must be considered that the mentioned sensors can be expensive and still present some limitations (e.g., a LIDAR can suffer from mirror reflections, whereas a GPS does not work properly indoors). Alternatively, computer vision methods can be used, which involve tasks such as optical flow or depth estimation from the data captured by monocular and stereo cameras [30,31]. Such computer vision techniques require using software algorithms and a single camera, which is a considerably cheaper option than the previously mentioned sensors. Nonetheless, further research is needed regarding this latter alternative, as it requires more computational resources.
In the literature, there are some surveys and reviews of methods for obstacle detection and collision avoidance in different application domains. For instance, the authors of [32] provide a systematic review of sensors and vision-based techniques for vehicle detection and tracking for collision avoidance systems on the specific context of on-road driving. Another interesting review can be found in [33], where the authors focus on space robotics and identify the major challenges in collision-free trajectory planning for manipulators mounted on large orbital structures, small satellites, and space robots that navigate in proximity to large orbital structures. Other authors have reviewed the main techniques for improving the autonomy of unnamed surface vehicles with regard to navigation, guidance, control and motion planning with respect to the international regulations for avoiding collisions at sea [34].
In the specific context of UAV navigation, the authors of [31] reviewed vision-based techniques for positioning and mapping, obstacle avoidance, and path planning. Regarding the use of DL techniques for robotic solutions, in the work by the authors of [35], a survey is presented on the use of DL techniques for robotic perception, robotic control and exploration, and robotic navigation and autonomous driving.
Regarding the application of UAVs with attached sensors (i.e., UASs) to remote sensing applications, note that the latest advances have allowed for monitoring environmental processes and changes produced at spatial and temporal scales that would be difficult or impossible to detect by means of conventional remote sensing platforms [36]. Moreover, the characteristics of small UAS and their impact in the context of remote sensing models have been studied in the literature [37], identifying novel remote sensing capabilities as well as the challenges that this type of platforms entails. In particular, the affordability and potential for ubiquitous operation of small UASs enable advances in the type and quality of information that can be collected, and therefore in the applications that can be addressed by this type of remote sensing systems [37]. The main capabilities that small UASs can provide to remote sensing are dependent on their operation without direct human supervision, involving autonomous deployment, data collection, landing and data transfer [37]. In this regard, it is worth mentioning the work detailed in the work by the authors of [36], which provides a review on the recent progress made in remote sensing for small UASs in different fields that involve photogrammetry, multispectral and hyperspectral imaging, and synthetic aperture radars and LIDARs. Additionally, in [38], the authors present a review on remote sensing tasks that involve data acquired by UAVs. Such a review focuses on solutions that specifically address issues derived from the specific nature of the collected data (e.g., ultra-high resolution, geometric, and spectral data, or the fusion of data from multisensor acquisition). In this regard, it is worth mentioning that remote sensing systems that use data acquired by UAVs may involve four types of resolutions: spatial, spectral, radiometric, and temporal [37]. In the case of DL-UAV systems, in this article, only spatial resolution was considered.
There are only a few recent articles on the use of DL methods for UAVs. For instance, in the work by the authors of [39], the current literature is systematically reviewed according to a taxonomy that includes different abstraction levels related to perception, guidance, navigation, and control. Other authors focused only on solving specific problems (e.g., counting cars in UAV images [40]) or on improving specific algorithms [41,42]. Nevertheless, most of the current literature cannot be considered IoT-enabled, and, to the best of our knowledge, there are no articles that review the whole DL-UAV architecture while focusing on the problem of autonomous obstacle detection and collision avoidance.

Deep Learning in the Context of Autonomous Collision Avoidance
During the last decade, DL has demonstrated to be an excellent technique in the area of artificial intelligence, solving different problems, and even surpassing humans in some cases [43,44]. In addition, DL yields good results in diverse areas like image recognition [45], medical imaging [46], or speech recognition [47].
Classic Machine Learning (ML) approaches require designing feature extraction methods in order to generate a descriptor (i.e., a feature vector that emphasizes the pattern to detect and that is usually more compact than the raw data). The descriptor feeds a learning algorithm that performs a specific task, such as a classification or a regression of the input data. In contrast to such an approach, DL methods are able to perform both tasks (the representation of the data and the classification) at the same time just by feeding the network with raw data [43]. Moreover, DL techniques excel when learning data representations from raw sensor data acquired by robotic solutions in real environments [39]. In the specific context of autonomous collision avoidance, DL techniques have shown their effectiveness when solving a wide variety of robotic tasks in areas like perception, planning, positioning, and control [35], which involve learning complex behaviors.
Among the DL techniques for collision avoidance there are end-to-end approaches that map directly the raw sensor data captured by the robotic system into a set of possible actions [48][49][50]. Such approaches are based on learning the behavior of an expert pilot at a given scenario in a real-world environment. There are other DL techniques whose execution requires several stages that involve intermediate representations (e.g., depth maps) that estimate the distance to potential obstacles, the UAV pose or its odometry in order to recalculate the path to reach a goal position. These DL approaches usually include a module for situational awareness that generates a set of feature maps related to the state of the robotic system and its surroundings, and then such computed feature maps feed up a second module for the decision-making process. Therefore, the combination of the mentioned two modules make up a complex network that takes raw sensor data as input and generates the motion control commands for the robotic system [48,[51][52][53][54][55].
Regarding the learning paradigm, Reinforcement Learning (RL) algorithms [56] have been widely used in robotic systems. The essential idea in deep RL methods is that an agent extracts feedback from the interaction with real or simulated environments. Thus, given a specific state and based on previous experience, the agent can infer which action maximizes a predefined goal. Several approaches use RL methods in order to learn effective collision avoidance policies that require experience on successful trajectories as well as on undesirable events like collisions [51,53,54,[57][58][59]. This use of simulated environments allows for collecting a large amount of data in an easy way. Specifically, AirSim [60] enables navigating freely with a car or a drone through different virtual scenarios, providing the raw information of different sensors (e.g., RGB cameras and depth cameras), which are synchronized with the six Degrees of Freedom (DoF) of the vehicle and the segmentation maps. Other approaches use supervised learning algorithms that are based on learning through examples. Such examples consist of a representation of the environment provided by the raw sensor data and the actions taken by an expert in the same conditions. The learning process returns the policy that best imitates the action of the experts according to the given examples [49,50,52,55]. A limiting factor of the previous supervised approaches is the need for large amounts of data to train models with generalization capabilities in different real-world environments. This necessity involves human experts during the collection and annotation of the data. In order to minimize human effort, self-supervised algorithms automate the collection and annotation processes to generate large scale datasets [48,49].
In addition to the mentioned learning techniques, transfer learning [61] allows for starting from a model previously trained with related data. Such a model is further trained for a specific application domain, thus reducing the total amount of data needed for training. For example, the policies learned from data collected by cars and bicycles can be applied to a UAV for autonomous flights [62]. Such navigation policies are generic, despite having been trained from the viewpoint of urban vehicles. Transfer learning from models trained on virtual worlds, but applied to real data, have also been studied in the literature [59,[63][64][65].
There are different DL architectures. On the one hand, a Convolutional Neural Network (CNN) is a type of hierarchical architecture for feature extraction that allows for generating a set of high-dimensional feature maps from the raw sensor data acquired for the robotic system. Thus, CNN-based models [45] are widely used in the context of collision avoidance, especially in the stages related to environment perception or to depth estimation [52]. On the other hand, in the context of continuous autonomous flights, the use of temporary information becomes relevant. Considering that the information extracted from the sensors at each observation is related to a partial representation of the environment surrounding the robotic system, the possibility of storing and tracking all the relevant information obtained in the past allows for getting a more complete representation of the environment. For such a purpose, the use of Recurrent Neural Network (RNN) architectures [66] provides the ability to learn temporal dependencies from the information of an arbitrarily long sequence of observations. Therefore, there are RNN-based approaches that allow for keeping track of past observations such as the depth maps obtained from RGB images [53], poses, or recently avoided obstacles. In addition, the incorporation of Temporal Attention [67] allows for evaluating the information given by each observation of a sequence, and it can be used by weighting each of the recent observations based on its importance in decision-making. Such a technique increases training speed and provides a better generalization over the training dataset. A type of RNN is Long Short-Term Memory (LSTM) [68], which is capable of learning long-term dependencies with the ability of removing old information or by adding new information at any point, thus allowing the use of all the relevant past information. Table 1 summarizes the main characteristics of the approaches that enable UAVs for autonomous flight through the use of DL techniques. All the methods mentioned in the Table feed DL networks with RGB images whose resolution ranges from 360 p to 720 p.

On the Application of DL to UAVs
Depending on their final goal, it can be distinguished among three types of methods. Some methods learn to navigate from raw images, indicating the next movement of the UAV, thus being able to navigate avoiding obstacles [49,59,62,69]. Other authors proposed specific solutions for certain applications that involve following a trail or a gate [55,70]. However, such methods do not include a collision avoidance module. There are also other methods that generate an obstacle map, which represents the probability of collision on the scene or the distance to nearby obstacles [53].
It is important to note that none of the compared systems make use of depth maps extracted directly from stereo or depth cameras. However, some try to estimate depth maps from just one single RGB camera [53]. The combination of depth maps provided by a hardware sensor with RGB images that feed a DL architecture is an interesting approach that has not been described so far in the literature.
Regarding the application scenario, the most relevant solutions of Table 1 are focused on unstructured outdoor environments such as forests, which enable applications like search and rescue, wilderness monitoring, target tracking, exploration, or environmental mapping. For instance, in [55], the authors describe a Micro Aerial Vehicle (MAV) for autonomous low-flying that follows a artificial trail. This approach is based on a Deep Neural Network (DNN) architecture for trail detection that uses transfer learning to estimate the view orientation and the lateral offset of the MAV with respect to the center of the trail. Another proposal in forest environments is presented in the work by the authors of [71], which uses a pretrained AlexNet [45] for tree detection and for the prediction of the direction for collision avoidance. Other outdoor UAV solutions are trained for urban spaces, such as CNN-based models that detect all the potential obstacles [72] or that control a UAV through the streets of a city environment [62].

Reference
Architecture Goal and Details Scenario [55] CNN Two deep networks: one for detection of the Outdoors rail center and another for obstacle detection [70] CNN The network detects the gate center. Then, Indoors an external guidance algorithm is applied.
The network returns the next flight command. [69] CNN The commands are learned as a classification. Indoors The dataset contains several sample flights performed by a pilot.
The network returns the next flight command. [49] CNN The commands are learned as a classification. Indoors Self-supervised data for training.
The network returns the next flight command. [50] CNN The commands are learned as a classification. Indoors The dataset contains several sample flights annotated manually.
The network returns the next steering angle [62] CNN and the probability of collision. Outdoors The steering angle is learned by a regression.
The CNN returns a map that represents the action [59] CNN + RL space for the RL method. Indoors The method is trained on a virtual environment. [53] cGAN The first cGAN estimates a depth map from the RGB image. Indoors + LSTM + RL Then, these maps feed an RL with LSTM to return the flight command.
[71] CNN The network returns the distance to tree obstacles. Outdoors The distance is learned as three classes. [72] CNN Object detection. Outdoors [73] CNN The network computes feature extraction Outdoors for learning safe trajectories.
The network makes use of two consecutive RGB images [48] Two stream CNN and returns the distance to obstacles in Indoors three directions.
Approaches for indoor scenarios cover applications of surveillance, transportation, goods delivery, inspection tasks or diverse tasks in manufacturing environments. For instance, Kouris et al. [48] proposed an approach that maps each input RGB image to a flight command. Such a solution is composed of a two-stream CNN for UAV perception that takes as inputs two RGB consecutive frames in order to extract spatio-temporal features for predicting the distance to the potential obstacles. The CNN is trained with a custom dataset of UAV real-flight indoor trajectories with distance labels indicating the closest obstacle. The collection and annotation of the dataset is performed in a self-supervised manner by means of a UAV equipped with three pairs of ultrasonic and infrared distance sensors that enable data automatic annotation. Then, a local motion planning policy translates the distance predictions to a single control command that modulates the yaw and forward linear velocity to guide the UAV towards the direction for which the longest navigable space is predicted within the current FoV. Another custom dataset is created to train the end-to-end approach proposed by Gandhi et al. [49]. Such a work, in contrast to other datasets related to collision avoidance, uses a UAV crash dataset in order to collect the different ways in which a UAV can crash. This negative flying data is used in conjunction with the equivalent positive data sampled from the same trajectories to learn a robust policy for UAV navigation. This CNN-based approach uses, in particular, an AlexNet architecture [45] that is initialized with pretrained weights. Self-supervised learning is used to predict the probability to move in a certain direction. Then, the decision to turn the UAV to the left or to the right while moving forward is taken according to the confidence of the CNN predictions. Another custom dataset was created to train the system proposed in the work by the authors of [50], which focuses on enabling autonomous navigation in indoor corridors by means of the DenseNet-161 architecture [74] with pretrained weights. The dataset is composed by images at several positions from corridors with different lengths captured by means of a front-facing camera attached to a quadcopter and the corresponding ground-truth values in terms of in-flight commands. The model is trained over the custom dataset using supervised learning to predict the probability of different classes that decide whether to move forward, shift left/right, or stop. Sadeghi et al. [59] explore the possibility of performing collision-free indoor flights in the real world by only training with data generated in a simulated environment. Such an approach combines CNN networks to process the raw input data with RL without requiring any human demonstration data or real images during training. Singla et al. [53] proposed a two-way method for UAV navigation that avoids stationary and mobile obstacles. The proposed method involves intermediate depth estimations and a RNN-base method with RL that incorporates Temporal Attention [67] to integrate the relevant data gathered over time.
Among the most popular e-sports, drone racing has recently gained a lot of popularity, giving rise to the call for challenges to promote the development of this field, such as the autonomous drone racing competitions celebrated during IROS conference. These competitions motivated the development of UAV systems that make use of DL techniques. For example, in [70], the authors consider the nature of the competitions, which involve moving as fast as possible through a series of gates. Thus, the researchers devised a real-time gate detection network that is complemented with a guidance algorithm.

Datasets
The previously mentioned DL techniques provide effective solutions that extract hierarchical abstractions from large amounts of data through the training stages. There are several publicly available large-scale datasets that include labeled data acquired from real or simulated environments in challenging conditions: • KITTI online benchmark [75] is a widely-used outdoor dataset that contains stereo gray and color video, 3D LIDAR, inertial, and GPS navigation data for depth estimation, depth completion or odometry estimation. In addition to the data specifically gathered to generate a training dataset, techniques of data augmentation are frequently used to obtain more representative data that cover different conditions in order to train models with high generalization capabilities. Methods for data augmentation usually include left-right flips, random horizontal flips, random crops, or variations in scale, contrast, brightness, saturation, sharpness, rotation, and jitter with random permutations of these transformations [55,79].

DL-UAV Hardware and Communications Architecture
The underlying DL-UAV hardware and its communications architecture are essential for the success of the system, as they provide the support for implementing DL techniques and advanced features. Specifically, the use of a DL approach for autonomous obstacle detection and collision avoidance imposes UAV hardware and communications design restrictions (e.g., a high computational cost) that should be carefully addressed. This section reviews the latest DL-UAV communication architectures, detailing the main components of the different subsystems and analyzing the most common hardware used for developing such subsystems. Figure 1 shows the typical cloud-based communications architecture for DL-UAV systems. Such an architecture is divided into three main layers:

Typical Deep Learning UAV System Architecture
• The layer at the bottom is the UAV layer, which includes the aerial vehicles and their subsystems: Storage subsystem: it stores the data collected by the different subsystems. For instance, it is often used for storing the video stream recorded by an on-board camera. -Identification subsystem: it is able to identify remotely other UAVs, objects, and obstacles. The most common identification subsystems are based on image/video processing, but wireless communications systems can also be used.

-
Deep Learning subsystem: it implements deep learning techniques to process the collected data and then determine the appropriate response from the UAV in terms of maneuvering.
• The data sent by every UAV are sent to either a remote control ground station or to a UAV pilot that makes use of a manual controller. The former case is related to autonomous guiding and collision-avoidance systems, and often involves the use of high-performance computers able to collect, process, and respond in real-time to the information gathered from the UAV. Thus, such high-performance computers run different data collection and processing services, algorithms that make use of deep learning techniques, and a subsystem that is responsible for sending control commands to the UAVs.

•
At the top of the architecture is the remote service layer, which is essentially a cloud composed by multiple remote servers that store data and carry out the most computationally-intensive data processing tasks that do not require real-time responses (e.g., certain data analysis). It is also usual to provide through the cloud some kind of front-end so that remote users can manage the DL-UAV system in a user-friendly way.

Advanced UAV Architectures
In the last years, UAV IoT architectures evolved towards more sophisticated architectures that essentially attempt to avoid the reliance on a remote cloud. This is due to the fact that a cloud may not scale properly, and thus it may constitute a bottleneck when a significant number of UAVs or IoT devices exchange communications with it [80]. In addition, cyberattacks may compromise the availability of the cloud servers, therefore preventing the UAV system from working properly.
Due to the previously mentioned reason, UAV architectures evolved toward IoT edge computing architectures, which allow unburdening the cloud from part of its tasks through the use of devices on the edge of a network (i.e., in the limit between the last network routing devices and the sensing/actuation devices that act on the physical world). There are basically two variants of edge computing that are commonly applied to UAV applications, depicted in Figure 2:  • Fog Computing: It is based on the deployment of medium-performance gateways (known as fog gateways) that are placed locally (close to a UAV ground station). It is usual to make use of Single-Board Computers (SBCs), like Raspberry Pi [81], Beagle Bone [82], or Orange Pi PC [83], to implement such fog gateways. Despite its hardware, fog gateways are able to provide fast responses to the UAV, thus avoiding the communications with the cloud. It is also worth noting that fog gateways can collaborate among themselves and collect data from sensors deployed on the field, which can help to decide on the maneuver commands to be sent to the UAVs in order to detect obstacles and prevent collisions.

•
Cloudlets. They are based on high-performance computers (i.e., powerful CPUs and GPUs) that carry out computationally intensive tasks that require real-time or quasi real-time responses [84].
Like fog gateways, cloudlets are deployed close to the ground station (they may even be part of the ground station hardware), so that the response lag is clearly lower than the one that would be provided by a remote Internet cloud. Figure 2 also includes in the Remote Service Layer the use of Distributed Ledger Technologies (DLTs) like blockchain, which have been previously applied to multiple IoT applications [85][86][87], including UAV systems [88]. In the field of UAV, the main advantage of blockchain and similar DLTs is that they are able to provide information redundancy, data security, and trustworthiness to the potential applications [89]. In addition, a blockchain can execute smart contracts, which translate legal terms into code that can be run autonomously [90], and therefore can automate certain tasks depending on the detection of certain events.

Hardware for Collision Avoidance and Obstacle Detection Deep Learning UAV Systems
In the literature, diverse researchers have addressed the problem of what hardware may be used by a UAV to prevent collisions and detect obstacles when harnessing the outputs of deep learning systems. Table 2 summarizes the features of the most relevant previous DL-UAV systems that deal with such a topic, including the hardware used for building their communications architecture and the provided DL-based collision-avoidance and obstacle-detection functionality.
As seen in Table 2, most UAVs are commercial platforms. In fact, the majority are manufactured by Parrot due to their easy-to-use developer APIs [91], which allow for monitoring and controlling remotely the UAVs. There are also in the list a couple of self-assembled drones, which need an additional effort from the researchers during the assembly and validation of the UAV, but that enable customizing it with specific sensors, powerful controllers and extended batteries that increase the type and number of tests that can be carried out.
The following subsections analyze the characteristics of the subsystems of the UAVs listed in Table 2 and detail some of the most popular alternatives for implementing each subsystem.

Platform Frame, Weight and Flight Time
The UAV platforms listed in Table 2 can be considered light, with an average weight of roughly 400 g. Nonetheless, extremely light platforms like Crazyflie have also been used for developing Micro-Aerial Vehicle (MAV) applications. In contrast, relatively heavy platforms like DJI Matrice 100, which weights~2.4 Kg, are able to move heavier payloads (more than 1 Kg). In fact, most of the commercial drones on the list have not been designed to transport additional payloads, but some of them, like Parrot's AR.Drone 2.0, thanks to their powerful propellers, are able to carry up to 500 g.
Regarding flight time, it is related to their weight and maximum payload: the higher the weight, the larger the carried batteries, so the longer the flight time. Moreover, it is worth pointing out that some UAVs have optional batteries and modules that are able to double (e.g., in the case of DJI Matrice 100) or triple (e.g., with the AR.Drone 2.0 Power Edition) the original battery life.

Propeller Subsystem
UAVs are usually conceived for flying indoors and/or outdoors, what influences significantly their flight capabilities. In any case, the selected hardware must provide a good trade-off among cost, payload capacity and reliability. All the UAVs listed in Table 2 are multirotor UAVs (specifically, quadrotors), which offer good reliability and may minimize vibration during operation (this is useful for collecting stabilized images and videos from a DL-UAV), but it is also more expensive and heavier than alternatives like single-rotor UAVs. In contrast, single-rotor UAVs benefit from slower spinning speeds and less power consumption, which is key for some applications where an extended flight time or heavy payload transport are necessary.

Control Subsystem
The controller hardware has greater power in comparison to the traditional IoT node hardware. The majority of the hardware can be considered powerful and low-power; it is the typical hardware used for mobile device platforms. Most controllers run on ARM devices (e.g., ARM Cortex-M4, ARM Cortex A8, ARM9, or ARM A57 Complex in the case of the NVIDIA Jetson TX2), which currently provide the best trade-off for mobile computing devices in terms of power consumption and performance by including 32/64-bit architectures, fast multicore processors, dedicated GPUs and support for managing several gigabytes of RAM. In fact, some of the listed UAVs are so powerful that they actually run certain versions of the Linux operating system.
It is important to note that some UAVs distinguish between the hardware used for controlling the flight and the one that interacts with the other subsystems. For instance, it is usual in self-assembled drones to make use of a PixHawk controller [97] as flight controller, which embeds a programmable processor, an Inertial Measurement Unit (IMU), a GPS, and provides connectors for radio telemetry and control interfaces.
Both the main control subsystem and the flight controller can make use of different electronic devices. The most common are microcontrollers, Application-Specific Integrated Circuits (ASIC), and System-on-Chips (SOCs), but it is also possible to embed Field-Programmable Gate Arrays (FPGAs) or Central Processing Units (CPUs) optimized for mobile computing devices (i.e., that provide a good trade-off between performance and power consumption).
High-performance microcontrollers are probably the most commonly used devices, owing to their low power consumption, their ability to be reprogrammed easily, and the fact that they have enough processing power for carrying out the required control tasks. SoCs usually integrate a medium-to-high performance microcontroller and multiple peripherals (e.g., wireless transceivers), which makes them more appropriate for lightweight systems, but causes them to have greater energy consumption that traditional microcontrollers.
In the case of FPGAs, they are able to provide excellent performance for executing certain deterministic demanding tasks and that can be reconfigured easily with a different design; but, unfortunately, FPGA design is usually clearly slower (and consequently more expensive) and often consumes more energy due to the requirement to power the used logic continuously.
ASICs provide even higher performance and significantly lower power consumption that FPGAs and other embedded devices thanks to being designed explicitly for very specific applications. Nonetheless, in ASIC, the development cost is very high (in the order of several millions of U.S. dollars), so their application in DL-UAV systems is limited to already programmed Commercial Off-The-Shelf (COTS) products.

Sensing Subsystem
Sensors are essential for UAV maneuvering. The most common are Inertial Measurement Units (IMUs), which include accelerometers, gyroscopes, and magnetometers. Photo and video cameras are also necessary for providing feedback to drone pilots and inputs to neural networks.
Perform control tasks and, at the same time, processing the large amount of data that comes from sensors and cameras is difficult. For instance, there are video cameras that can stream up to 60 frames per second (fps), but such an amount of video data may not be processed in real-time by the on-board UAV hardware, so two alternatives are often implemented: UAVs either sample frames to reduce the input frame rate or they make use of external remote hardware that carries out the video processing tasks. In practice, only a few UAVs perform the mentioned processing on-board in real-time: most UAVs delegate such a task to external systems, which are usually powerful servers that integrate high-performance graphic cards (the characteristics of some of such servers are indicated in the last two columns of Table 2). In addition, such a powerful hardware is often used during the training phase of the deep learning networks in order to save time.
In summary, it can be stated that the most commonly used sensors in DL-UAV systems are as follows.

•
IMUs that embed 3-axis accelerometers, gyroscopes and magnetometers. The reliability and safety of the previously mentioned sensors and actuators should be ensured to guarantee the security of the DL-UAV system. There are just a few examples in the literature on this specific research area. For instance, the authors of [98] designed a fault detection and diagnosis system for a quadrotor under total failures (e.g., loss of a propeller or a motor) or with partial faults (e.g., degradation or displacement of a component). The obtained numerical results illustrate the effectiveness of the proposed method under partial and total rotor faults. Another example is presented in [99], where the authors propose a hybrid feature model and CNN-based fault diagnosis system validated through flight experiments.

Positioning Subsystem
Besides the sensing subsystem, the positioning subsystem is probably the most important for autonomous UAV navigation. Outdoors, most UAV systems make use of GPS/ Global'naya Navigatsionnaya Sputnikovaya Sistema (GLONASS) in combination with an altimeter based on ultrasound measurements. The most sophisticated UAVs make use of LIDARs and special cameras that measure both odometry and speed. Ultra-Wide Band (UWB) transceivers are only used in one of the UAV systems listed in Table 2 [92], although they have been previously analyzed in the literature when performing accurate (centimeter-precision) distance measurements [100].
Among the different available indoor location techniques, those based on Received Signal Strength Indicator (RSSI) or Received Signal Strength (RSS) have proved their accuracy when positioning in limited areas [101,102], but their heavily depend on characteristics of the scenario (e.g., presence of metallic objects) and on the used UAV hardware (e.g., antennas) [103]. There are also positioning techniques based on computing the Angle of Arrival (AoA) of the received signals [104], or their time of arrival (through Time of Arrival (ToA) and Time Difference of Arrival (TDoA) techniques) [105].

Communications Subsystem
The data to/from the different subsystems is transmitted to remote computers by using different wireless communications technologies. The most common is Wi-Fi (IEEE 802.11 standard), which operates in unlicensed bands and provides a good trade-off among hardware cost, outdoor range and speed rate. Bluetooth Low-Energy (BLE) is also often used, but it is usually aimed at providing short-range communications.
There are also many other communications technologies that can be used by DL-UAVs, like 3G/4G/5G [106,107], ZigBee (IEEE 802.15.4) [103], Long-Range Wide Area Network (LoRaWAN) [108], Ultra Wide Band (UWB) [109], IEEE 802.11ah [110], or Wi-Fi Direct [111]. Table 3 shows a comparison on the main characteristics of the latest communications technologies for DL-UAVs [4], indicating their frequency band, coverage, data rate, power consumption, and potential applications. The DL-UAV communication network is one of the main factors that affect energy consumption. As a result, there is ongoing research on different network topologies such as rings, stars and, specially, on ad hoc networks (i.e., on Flying Ad-hoc Networks (FANETS)). There is also research on path planning and on relay selection to optimize network efficiency. It is important to note that in some IoT applications, delay requirements of the DL-UAV data traffic can be very strict and real-time data transfer models may be required (e.g., in emergency situations). Some alternatives for UAV-based data collection frameworks are presented in the work by the authors of [123]. For instance, a model that increases the efficiency of collaborative UAV data transmission while optimizing energy consumption is described in the work by the authors of [124]. Another example of a data gathering framework for UAV-assisted Wireless Sensor Networks (WSNs) is described in the work by the authors of [125]. In such a paper, to increase the efficiency of data collection, and thus maximize the system throughput, the authors propose to use a priority-based data access scheme that considers UAV mobility to suppress redundant data transmissions between sensor nodes and the UAV. Thus, the authors classify the nodes within the UAV coverage area into different frames according to their locations, and assign them different transmission priorities. The authors also introduce a novel routing algorithm to extend the lifetime of the WSN.

Power Subsystem
A UAV power subsystem is responsible for powering the different electronic components. In the case of the DL-UAVs listed in Table 2, their main energy source are either Li-Po or Li-ion batteries, which are appropriate for high-power electronics (usually of up to several tens of Watts) and for powering the propeller subsystem.
However, it must be noted that the higher the battery power, the heavier the battery. This derives into the fact that current UAV battery technologies like Li-ion or Li-Polymer are a bottleneck when designing systems that require long battery life and reduced weight. Therefore, nowadays, it is necessary either to reduce on-board electronics power consumption by introducing novel technologies (e.g., Polymer Electrolyte Membrane (PEM) fuel cells [126]) or make use of additional power sources. For the latter case, it is possible to embed energy harvesting devices, which can collect ambient energy and store it into the batteries [127] or supercapacitors, which are environmentally friendly and provide an energy density similar to lead-acid batteries, fast charging/discharging speed and a long life cycle [128]. For UAVs, it seems that motion can be harvested by generating mechanical energy that can be collected through thermoelectric systems [129]. Other power sources can be collected through photovoltaic panels [130] or by rectifying electromagnetic waves [131].
Regarding UAV recharging, most systems make use of USB connectors, DC power jacks, or special dock stations where the batteries are connected. There are a few recent examples in the literature of wireless UAV recharging, as wireless rechargers are currently inefficient and require the wireless receiver to be next to the transmitter [132,133].
The impact of the battery on UAV flight time depends mainly on the aircraft and its payload and on the wind conditions, but it is commonly in the range of minutes. As a result, there is a clear need for developing systems that can endure longer flights. There are some recent studies on the topic, but none of them is explicitly focused on DL-UAV systems. For example, the authors of [134] propose an automated battery-management platform for small-scale UAVs (with a typical endurance of less than 10 min) that consists of an autonomous battery change/recharge station. The autonomous battery recharger uses a swapping mechanism with a linear sweeping motion to exchange batteries. The obtained results indicate that the system lasts at least 5 h. The researchers also evaluated the proposed recharge stations through flight validation with Markov Decision Process based planning and a learning algorithm in a 3-h mission with more than 100 swaps. However, the article points out that there are still open issues with the developed recharge station in relation to accurate landing and to non-balanced charging. Another alternative is proposed in [135], where the authors jointly manage the battery levels of UAVs with recharger stations (i.e., solar panels and batteries installed in a set of ground sites in a cellular network). A more sophisticated approach is presented in [136], where a UAV is able to transfer energy and information to several ground IoT devices at the same time, being each device equipped with a power splitter.

Storage Subsystem
As it can be observed in Table 2, most UAVs include several gigabytes of flash memory for storing the collected video and data. Such data are first processed by the main control subsystem in a light way and then stored in static memories such as EEPROMs or SD cards.
Nonetheless, the majority of the DL-UAV systems in Table 2 stream the collected data to a ground station, but it could be upload to a cloud server, a cloudlet or to a fog computing gateway to further process and store the data, which can be ultimately presented to the user through a graphical interface, (e.g., through a web application).
It is worth pointing out that, although local storage can be inexpensive, it is prone to technical failures (e.g., SD card failures) or external factors (e.g., UAV impacts), which can harm the storage subsystem and then loss the stored information. This fact is essential for fostering remote storage subsystems in cloud servers, fog gateways or cloudlets, which provide redundancy. However, the information stored in such systems may be affected by cyberattacks (e.g., Denial-of-Service (DoS) attacks) that affect the availability of the data, and their trustworthiness and integrity. Due to such problems, DLTs like blockchain have been proposed as an alternative for storing certain data (or a proof of such data) securely and between entities that may not trust each other [137][138][139][140][141].
There is recent specific research devoted to UAV deployments with limited storage. For example, the authors of [142] studied joint cooperative transmissions influenced, not only by location planning, but also by strategic content placement. In addition, in [143] the researchers analyzed a swarm of miniature UAVs that make use of distributed in-network storage, which allows them to behave as sensor nodes specialized in detecting certain features. However, none of the above studies takes the storage issues into account when dealing with DL-UAV systems.

Identification Subsystem
The majority of the UAVs stream images collected with a video camera to the deep learning subsystem, which process them in order to determine whether there are obstacles in the path of the drone and then performs the appropriate movement to avoid them.
However, there are other identification technologies that can be used by DL-UAVs, specifically in conjunction with location, inventory, and traceability applications. For instance, Radio-Frequency IDentification (RFID) is one of the most commonly used technologies for such applications [7]. It uses radio frequency transponders (i.e., tags) that transmit unique identifiers and at least one RFID reader that communicates with them. The most relevant advantage of RFID over other traditional identification technologies (e.g., barcodes and QR codes) is that no Line-of-Sight (LoS) is required between the tag and reader.
In the literature, there are several solutions that make use of RFID. For example, in the work by the authors of [144], an autonomous UAV is described that makes use of RFID and self-positioning/mapping techniques based on a 3D LIDAR device. Table 4 shows a comparison of the most relevant UAVs that make use of identification technologies [88]. In Table 2 most of the analyzed systems make use of convolution neural networks. This type of neural network and other deep learning techniques, which are commonly used for avoiding collisions and detecting obstacles, were previously discussed in detail in Section 3.

DL Challenges
Given the nature of the DL methods that extract complex hierarchical abstractions from the data used throughout the training stages, the effectiveness and capability of generalization is bounded by the quantity and quality of the used data. Therefore, most of the shortcomings of the DL models are due to the data used for the learning processes: • A limiting factor of the supervised approaches is the large amount of data required to generate robust models with generalization capability in different real-world environments. In order to minimize annotation effort, self-supervised algorithms allow automating the collection and annotation processes to generate large-scale datasets, the results of which are bounded by the strategy for generating labeled data.

•
The diversity of the datasets in terms of representative scenarios and conditions, the variety of sensors, or the balance between the different classes are also conditioning factors for learning processes.

•
The trial and error nature of RL raises safety concerns and suffers from crashes operating in real-world environments [48]. An alternative is the use of synthetic data or virtual environments for generating the training datasets.

•
The gap between real and virtual environments limits the applicability of the simulation policies in the physical world. The development of more realistic virtual datasets is still an open issue.

•
The accuracy of the spatial and temporal alignment between different sensors in a multimodal robotic system impacts data quality.

Other Challenges
As it can be observed after reviewing the different aspects of the DL-UAV hardware and the proposed communications architecture, it is possible to highlight several additional shortcomings:

•
Most current DL-UAV architectures rely on a remote cloud. This solution does not fulfill the requirements of IoT applications in terms of cost, coverage, availability, latency, power consumption, and scalability. Furthermore, a cloud may be compromised by cyberattacks.
To confront this challenge, the fog/edge computing and blockchain architecture analyzed in Section 4.2 can help to comply with the strict requirements of IoT applications.

•
Flight time is related directly to the drone weight and its maximum payload. Moreover, a trade-off among cost, payload capacity, and reliability should be achieved when choosing between single and multirotor UAVs (as of writing, quardrotors are the preferred solution for DL-UAVs).

•
Considering the computational complexity of most DL techniques, the controller hardware must be much more powerful than traditional IoT nodes and UAV controllers, so such a hardware needs to be enhanced to provide a good trade-off between performance and power consumption.
In fact, some of the analyzed DL-UAVs have been already enhanced, some of which even have the ability to run certain versions of Linux and embedding dedicated GPUs. The improvements to be performed can affect both the hardware to control the flight and the one aimed at interacting with the other UAV subsystems.

•
Although some flight controllers embed specific hardware, most of the analyzed DL-UAVs share similar sensors, mainly differing in the photo and video camera features. Future DL-UAV developers should consider complementing visual identification techniques with other identification technologies (e.g., UWB and RFID) in order to improve their accuracy.

•
Regarding the ways for powering a UAV, most DL-UAVs are limited by the use of heavy Li-Po or Li-ion batteries, but future developers should analyze the use of energy harvesting mechanisms to extend UAV flight autonomy.

•
Robustness against cyberattacks is a key challenge. In order to ensure a secure operation, challenges such as interference management, mobility management or cooperative data transmission have to be taken into account. For instance, in [154], the authors provide a summary of the main wireless and security challenges, and introduce different AI-based solutions for addressing them. DL can be leveraged to classify various security events and alerts. For example, in the work by the authors of [155], the ML and DL methods are reviewed for network analysis of intrusion detection. Nonetheless, there are few examples in the literature that deal with cybersecure DL-UAV systems. One such work is by the authors of [156], where an RNN-based abnormal behavior detection scheme for UAVs is presented. Another relevant work is by the authors of [157], which details a DL framework for reconstructing the missing data in remote sensing analysis. Other authors focused on studying drone detection and identification methods using DL techniques [158]. In such a paper, the proposed algorithms exploit the unique acoustic fingerprints of UAVs to detect and identify them. In addition, a comparison on the performance of different neural networks (e.g., CNN, RNN, and Convolutional Recurrent Neural Network (CRNN)) tested with different audio samples is also presented. In the case of DL-UAV systems, the integrity of the classification is of paramount importance (e.g., to avoid adversarial samples [159] or inputs that deliberately result in an incorrect output classification). Adversarial samples are created at test time and do not require any alteration of the training process. The defense against the so-called adversarial machine learning [160] focuses on hardening the training phase of the DL algorithm. Finally, it is worth pointing out that cyber-physical attacks (e.g., spoofing attacks [161] and signal jamming) and authentication issues (e.g., forging of the identities of the transmitting UAVs or sending disrupted data using their identities) remain open research topics.

Conclusions
This article reviewed the latest advances on the development of autonomous IoT UAV systems controlled by DL techniques. After analyzing the problems that arise when UAVs need to detect and avoid obstacles, the paper presented a thorough survey on the state-of-the-art of DL techniques for autonomous obstacle detection and collision avoidance. In addition, the most relevant datasets and DL-UAV applications were detailed together with the required hardware. In relation to such hardware, the most common and the newest DL-UAV communications architectures were described, with special emphasis on the different subsystems of the architecture and the few academic developments that have proposed DL-UAV systems. Finally, the most relevant open challenges for current DL-UAVs were enumerated, thus defining a clear roadmap for the future DL-UAV developers that will create the next generation of affordable autonomous UAV IoT solutions.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript.