A Comparison of Deep Neural Networks for Monocular Depth Map Estimation in Natural Environments Flying at Low Altitude

Currently, the use of Unmanned Aerial Vehicles (UAVs) in natural and complex environments has been increasing, because they are appropriate and affordable solutions to support different tasks such as rescue, forestry, and agriculture by collecting and analyzing high-resolution monocular images. Autonomous navigation at low altitudes is an important area of research, as it would allow monitoring parts of the crop that are occluded by their foliage or by other plants. This task is difficult due to the large number of obstacles that might be encountered in the drone’s path. The generation of high-quality depth maps is an alternative for providing real-time obstacle detection and collision avoidance for autonomous UAVs. In this paper, we present a comparative analysis of four supervised learning deep neural networks and a combination of two for monocular depth map estimation considering images captured at low altitudes in simulated natural environments. Our results show that the Boosting Monocular network is the best performing in terms of depth map accuracy because of its capability to process the same image at different scales to avoid loss of fine details.


Introduction
In recent years, the importance of Unmanned Aerial Vehicles (UAVs) has been increasing in sectors such as search and rescue [1], precision agriculture [2], and forestry [3], as they can capture and process different types of data in real time to monitor the environment in which they are moving.
Deep Learning (DL) has proven to be an excellent alternative to achieve autonomous navigation of UAVs, solving a variety of tasks in the areas of sensing, planning, mapping, and control.
One of the main problems that has not yet been solved is giving UAVs the capacity to navigate autonomously at low altitudes in confined and cluttered spaces using cameras as the main sensing sensor.
UAVs with only monocular cameras must be able to detect obstacles to prevent collisions; therefore, it is necessary to create depth maps from RGB images in order to determine the free space to navigate safely. This problem is further complicated when the drone is navigating in a natural environment, due to very thin obstacles such as tree branches or soft obstacles such as large leaves or bushes. Such obstacles can destabilize or damage the UAVs.
Vision-based navigation has been promising for autonomous navigation [4]. First, visual sensors can provide a wealth of information about the environment; second, cameras are very suitable for the perception of the dynamic environment; third, some cameras are cheaper than other types of sensors [5].
The concept of depth estimation refers to the process of preserving the 3D information of the scene using 2D information captured by cameras [6]. A variety of 3D commercial sensors are available to obtain depth information; for example, binocular cameras or LiDAR sensors can obtain accurate depth information, but the memory requirements and processing power are challenges in many onboard applications. In addition, the price of these sensors is very high.
A potential solution to address the problems of 3D sensors could be the use of monocular cameras to create depth maps. The single-view depth estimation technique has currently shown considerable advances in accuracy and speed, increasing the number of approaches published in the literature.
In the specific case of autonomous navigation for drones in outdoor forested environments, Loquercio et al. [6] demonstrated that a neural network for autonomous flight can be trained in simulated environments and perform well in similar real scenarios. However, the creation of such simulated environments can be very time-consuming and there is currently no large set of images with a variety of plants in different settings, such as what may exist in agriculture or forestry.
For this reason, in this work, we evaluated the performance of four models for monocular depth estimation pre-trained on a mixture of diverse datasets with the aim of inferring the depth of RGB images of natural and unstructured scenarios. These models were chosen because they are in a supervised learning framework. We analyzed whether existing databases and network models for depth estimation can generalize to complex natural environments due to the presence of very thin obstacles.
The main contributions of this paper are: • A qualitative and quantitative analysis of the performance of four state-of-the-art neural networks for monocular depth estimations of synthetic images of complex forested environments. The rest of this paper is structured as follows. Section 2 introduces the related work. Section 3 explains the network architectures. Section 4 analyzes the comparison of the monocular depth estimation networks. Section 5 enumerates currently open challenges, and in Section 6, we present the conclusions. Ranftl et al. [13] demonstrated that a network trained with different datasets can generate a better estimation of depth; therefore, on their experimentation, they started with the experimental configuration of [42] and used a multi-scale architecture based on ResNet for the prediction of depth with a single image. During the experimentation, the influence of the encoder on the architecture was evaluated, so they exchanged different encoders such as ResNet-101, ResNeXt-101, and DenseNet-161. They showed that network performance increased when using higher-capacity encoders. Thus, in subsequent tests, they used ResNeXt-101-WSL, which is a ResNeXt-101 version pre-trained with a massive corpus of Weakly-Supervised Data (WSL). After the evaluation of the encoder, their new model MiDaS was trained with six different datasets (DIW, ETH3D, Sintel, KITTI, NYU, and TUM). The model performed best in the ETH3D dataset according to absolute relative error (Abs Rel).
The MiDaS network has been continuously upgraded to the MiDaS v3.0 DPT version, where it uses the Dense Prediction Transform (DPT) model from the work of [43]. This model can use keys, queries, or values to completely trust the attention ratio between units to find the ratio between each unit in the sequence.
The architecture proposed in [14] consists of four main components: (a) a depth encoder that extracts the pixel representations of ResNet18 except for the last two blocks, which produces multi-scale features with an input resolution of 1 2 , 1 4 , 1 8 ; (b) a sharedparameter depth decoder that iteratively updates a zero-initialized inverse depth, preventing spatial inaccuracy at the coarse level from propagating to the fine part; (c) a parameter-learned oversampling module that adaptively oversamples the estimated inverse depth, preserving its motion limits; (d) a Multi-Scale Feature Modulation (MSFM) component that modulates the content in multi-scale feature maps, maintaining semantically richer and spatially more accurate representations for each iterative update. This architecture reduces the number of parameters to 3.8M, making it more suitable for memorylimited scenarios and giving it the ability to process 640x192 videos at 44 frames per second on an RTX2060 GPU.
In [15], an algorithm called double estimation was proposed, where two depth estimations from the same image at different resolutions are merged, obtaining a structure with high-frequency details. Their experiments showed that at low resolutions, depth estimates exhibit a consistent structure of the scene, but high-frequency details are lost. On the other hand, at high resolutions, fine details are well preserved but the scene structure starts to show inconsistencies. To combine the features of both images, they used Pix2Pix4DepthModel, which has a Pix2Pix architecture [17] with U-net layers [18] as a generator and a "PatchGAN" convolutional classifier (only penalizes the scale structure of image patches) as a discriminator. Both the generator and the discriminator use modules of the form Convolution-BatchNorm-ReLU. The entire network process can be described in three steps; first, the network generates a base estimation using the double estimation for the whole image. Then, patch selection starts by tiling the image at the base resolution with a tile size equal to the receptive field size and a 1 3 overlap; for each patch, a depth estimate is generated using again the double estimation algorithm. Finally, the generated patch-estimates are merged onto the base estimate one by one to generate a more detailed depth map.
In [20], a new architecture that is mainly made up of an encoder, a decoder, and skip connections with feature fusion modules was suggested. The encoder has the objective to take advantage of the rich global information to model long-range dependencies and capture multi-scale context features with a hierarchical transformer [44], where the transformer allows the network to expand the size of the receptive field. The input image is embedded as a sequence of patches with the 3 × 3 convolution operation; then, these patches are used in the transforming block that is made up of several self-attention sets and a Multilayer-Convolution-Multilayer with residual skip. In the lightweight decoder, the channel dimension of the function is reduced to N c with 1 × 1 convolution; then, consecutive bilinear upsampling is used to expand the function to size H × W × N c . Finally, the output goes through two convolution layers and a sigmoid function to predict the depth map H × W × 1, which is multiplied with the maximum depth value to scale in meters. To further exploit local structures in fine detail, a skip connection (to create smaller, receptive fields that help focus on short-distance information) was added with a proposed fusion module.

Comparison of the Monocular Depth Estimation Networks
The experimentation was performed using the TartanAir public dataset [40], which consists of different trajectories of simulated natural environments captured by a drone. We decided to use the trajectories that are composed of elements of our interest, such as various types of plants (trees, bushes, and grass) and farm tools (fences, lights, poles, and walls).

TartanAir Dataset
The three selected trajectories were: Gascola, Season Forest, and Season Forest Winter. Each of the trajectories is composed of the depth, RGB, and segmentation images obtained by a stereo camera. Table 2 shows the details of each dataset by trajectory. The main visual characteristics of the three selected environments are described below: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3). Table 3. Examples of RGB images from Tartan Air dataset: Gascola (a), Season Forest (b), and Season Forest Winter (c) trajectories.

RGB RGB RGB
(a) Gascola low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: (b) Season Forest low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: (c) Season Forest Winter low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: low: Gascola: A wooded environment with several rocky areas, mossy regions, and areas with different species of pines. Images were captured in daylight in the morning (see row 1 of Table 3).
Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following stateof-the-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds ( ), that is, the percentage of pixels with relative error under a threshold controlled by the constant [11]. These metrics are defined as follows, respectively: Season Forest: A wooded environment in the autumn season. Row 2 of Table 3 shows different species of trees with shades of autumn in the trunks and leaves, with effects of falling leaves. The lighting was at noon, so there are shadows under the trees.
Season Forest Winter: A forest area in the winter season; it contains different species of trees showing only the trunks and branches in a greyish shade and covered with snow. Row 3 of Table 3 shows that the capture was made at dusk, so there are many shadows in the area around the trees.

Metrics for Monocular Depth Estimation
In order to evaluate the performance of the networks, we used the following state-ofthe-art metrics: Absolute Relative Difference (AbsRel), Root-Mean-Square Error (RMSE), RMSE (log), Square Relative Error (SqRel), and Delta Thresholds (δ i ), that is, the percentage of pixels with relative error under a threshold controlled by the constant i [11]. These metrics are defined as follows, respectively: where d i and d * i are the ground truth and predicted depth at pixel i, respectively, and N is the total number of pixels. In (5), the accuracy is calculated under a predefined threshold [30,31], where a point of the image d i is used as a positive or negative sample based on how close the ground truth depth of the corresponding pixel is to the depth in the predicted image d * i , if their ratio is close to 1. Even though these statistics are good indicators for the general quality of the predicted depth map, they could be elusive. In addition, it is of high relevance that depth discontinuities are precisely located. Therefore, in [16], Xian et al. proposed the ORD metric (Ordinal Relation Error in the depth space) for the evaluation of zero shot crossed datasets. This ordinal error is a general metric for evaluating the ordinal accuracy of a depth map and can be used directly with different sources of the ground truth. The ordinal error can be defined as: where ω i is set to 1, and the ordinal relationships i and * i,τ (p) are computed using Equation (7).
where τ is a tolerance threshold, and p * 1 denotes the ground truth pseudo-depth value. When the pair of points are close in the depth space, i.e., i = 0, the loss encourages the predicted p 0 and p 1 to be the same; otherwise, the difference between p 0 and p 1 must be large to minimize the loss.
Moreover, Miangoleh et al. [15] proposed a variation in the ordinal ratio error, which they called Depth Discontinuity Disagreement Ratio (D 3 R), to measure the quality of high-frequency depth estimates. Instead of using random points for ordinal comparison as in [16], they used the centers of the superpixels calculated from the depth maps of the ground truth, as well as the centroids neighborhoods, to compare the depth discontinuities. Therefore, this metric focuses on the boundary accuracy to capture performance around high-frequency details.

Qualitative Analysis of Networks
We used some of the RGB images from the three trajectories to estimate their respective depth maps using the pretrained networks: R-MSFM, MiDaS, GLPDepth, and Boosting Monocular Depth (MiDaS). To perform the qualitative analysis, the inferred depth maps were compared with the ground truth of each dataset. Table 4 illustrates the depth maps that were obtained from the networks using nine test images representative of the different natural obstacles in the environments. The R-MSFM network can define the obstacles that are close and the free space that the scene has in the depth maps. However, in areas of the image where there are several branches with many leaves at different depths, the network considers them as a single object, generating very large areas marked as obstacles.
We used the MiDaS network in its hybrid version with the DPT model because there is a great improvement in the detection of fine details compared with the R-MSFM network. Tree trunks are well defined, but branches with many leaves and distant objects do not always perform well. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories.
Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. In the case of depth maps obtained with the GLPDepth network, the branches and leaves appear thicker than they really are, which could provide a safety margin when navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories.
Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB
Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. Table 4. Depth maps of forested environments generated from state-of-the-art monocular depth networks.

RGB Ground Truth R-MSFM MiDaS GLPDepth Boosting Monocular Depth
this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories. Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous Sensors 2022, 22, x FOR PEER REVIEW 9 of 16 In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous In the case of depth maps obtained with the GLPDepth network, the branches and leaves appear thicker than they really are, which could provide a safety margin when navigating between trees. According to the limited color range of the depth maps with this network, we can infer that GLPDepth has some issues determining the depth of the different objects in some of the images of Gascola and Season Forest Winter trajectories.
Finally, with the Boosting Monocular Depth network using the first version of MiDaS, depth maps are obtained with all the obstacles detected; in some scenarios where the branches are in the foreground, they are well defined, reflecting their thickness. This is revealing as it could allow for trajectory planning.
In summary, the above experimental results show that GLPDepth and Boosting Monocular Depth (MiDaS) are the best options to detect obstacles caused by trunks and branches with dense or sparse foliage, regardless of how close or far they are.
Recently, it has been shown that changing the MiDaS network to LeReS in the Boosting architecture improves the performance of depth map generation [45]. Hence, we also decided to carry out an evaluation between GLPDepth and Boosting Monocular Depth with LeReS. Table 5 displays the depth maps obtained by the two networks. As in the previous test, GLPDepth demonstrates its ability to detect fine details, but these are not very well delimited, thus having in the map the branches and leaves that are thicker than they really appear. Meanwhile, Boosting Monocular Depth (LeReS) shows a great improvement in detecting fine details compared to GLPDepth. The thickness of branches, trunks, and leaves corresponds to what is visualized in the RGB images. In addition, this combination makes a better differentiation of the various depths of the objects.
For the above, we observed that the dataset used in the training phase of each neural network significantly influences the results. The GLPDepth network was trained with the NYU Depth V2 dataset, which contains images of several closed scenarios, and is, therefore, severely affected by light and shadows originating from outdoor environments. Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

Quantitative Analysis of Networks
To measure the performance and inference times of the neural networks, we used a computer with an Intel Core i5-11400H of 2.70 GHz, 16 GB of RAM, a Nvidia GeForce RTX3050 graphics card of 8 GB, and the Windows 10 operating system for all tests.
The results obtained using the three trajectories are shown in Table 6. The GLPDepth network gives the best results in the accuracy metrics, and the Boosting Monocular Depth network shows the best results for the others metrics. The error metrics are better when they tend to 0 and accuracy metrics are better when they are closer to 1. In Season Forest and the Season Forest Winter trajectory, the Boosting Monocular network achieves the best results in the accuracy metric and in the other error evaluation metrics.
We confirm from the observations made in the qualitative analysis that the GLPDepth network is good at detecting obstacles as is shown on the accuracy threshold metrics; they are comparable to those obtained with Boosting Monocular Depth (LeReS). However, this network is not providing a good estimation of depth, because there is a considerable increase in the values obtained with the metrics ORD and D 3 R in the three trajectories analyzed.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pair of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D image of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using variou RGBD image datasets, providing depth maps with better definitions of fine details with out being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting. fore, severely affected by light and shadows originating from outdoor environments. Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

RGB Ground Truth GLPDepth Boosting Monocular Depth (LeReS)
Boosting Monocular Depth (MiDaS) was trained to transfer the fine-grained details from the high-resolution input to the low-resolution input using the Middlebury2014 (23 pairs of high-resolution image pairs of interior scenes) and Ibims-1 (high-quality RGB-D images of indoor scenes), whereas Boosting Monocular Depth (LeReS) was trained using various RGBD image datasets, providing depth maps with better definitions of fine details without being affected by shadows or lighting.

Quantitative Analysis of Networks
To measure the performance and inference times of the neural networks, we used computer with an Intel Core i5-11400H of 2.70 GHz, 16 GB of RAM, a Nvidia GeForc RTX3050 graphics card of 8 GB, and the Windows 10 operating system for all tests.
The results obtained using the three trajectories are shown in Table 6. The GLPDept network gives the best results in the accuracy metrics, and the Boosting Monocular Dept network shows the best results for the others metrics. The error metrics are better whe they tend to 0 and accuracy metrics are better when they are closer to 1. In Season Fore

Quantitative Analysis of Networks
To measure the performance and inference times of the neural networks, we used a computer with an Intel Core i5-11400H of 2.70 GHz, 16 GB of RAM, a Nvidia GeForce RTX3050 graphics card of 8 GB, and the Windows 10 operating system for all tests.
The results obtained using the three trajectories are shown in Table 6. The GLPDepth network gives the best results in the accuracy metrics, and the Boosting Monocular Depth network shows the best results for the others metrics. The error metrics are better when they tend to 0 and accuracy metrics are better when they are closer to 1. In Season Forest

Quantitative Analysis of Networks
To measure the performance and inference times of the neural networks, we used a computer with an Intel Core i5-11400H of 2.70 GHz, 16 GB of RAM, a Nvidia GeForce RTX3050 graphics card of 8 GB, and the Windows 10 operating system for all tests.
The results obtained using the three trajectories are shown in Table 6. The GLPDepth network gives the best results in the accuracy metrics, and the Boosting Monocular Depth network shows the best results for the others metrics. The error metrics are better when they tend to 0 and accuracy metrics are better when they are closer to 1. In Season Forest

Quantitative Analysis of Networks
To measure the performance and inference times of the neural networks, we used a computer with an Intel Core i5-11400H of 2.70 GHz, 16 GB of RAM, a Nvidia GeForce RTX3050 graphics card of 8 GB, and the Windows 10 operating system for all tests.
The results obtained using the three trajectories are shown in Table 6. The GLPDepth network gives the best results in the accuracy metrics, and the Boosting Monocular Depth network shows the best results for the others metrics. The error metrics are better when they tend to 0 and accuracy metrics are better when they are closer to 1. In Season Forest The datasets with which the networks were trained may affect their performance in this forested environment under study. The GLPDepth network was trained with indoor images, while a set of images from a variety of environments were used to train the Boosting network.
The inference times obtained by GLPDepth and Boosting Monocular Depth (LeReS) with the three different trajectories are shown in Table 7. GLPDepth proves to be much faster than Boosting Monocular Depth (LeReS) in all cases. This large time difference is because Boosting has to extract patches from the image and estimate the depth map of each of them, and finally merges them all into a base depth estimation generated from the whole image. In order to obtain short inference times and to preserve estimation accuracy, we decided to combine the Boosting Monocular Depth network with GLPDepth. We tested this new model only with the Gascola trajectory; the resulting depth maps are shown in Table 8. The new model is better able to highlight fine details in the depth maps than GLPDepth. However, Boosting Monocular Depth using LeReS still generates better depth maps with well-defined fine details than the other networks. Boosting Monocular Depth (GLPDepth) has the same problem as GLPDepth in correctly inferring object depths, so the depth maps also show a very narrow range of colors.
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.
The inference times obtained using the Gascola trajectory are shown in Table 10. Boosting Monocular Depth (GLPDepth) shows a decrease in inference times compared to Boosting Monocular Depth (LeReS). Nevertheless, these times are still high compared to those generated by GLPDepth.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Boosting Monocular Depth (GLPDepth)
The performance metrics obtained with Boosting Monocular Depth (GLPDepth) are shown in Table 9. This network generates better values in accuracy metrics, and with the other metrics, the Boosting Monocular Depth (LeReS) network achieves the best results.

Challenges in Monocular Depth Estimation to Navigate in Complex Natural Environments
Many of the depth estimation networks used for inferring depth maps in complex natural environments have been trained using general-purpose datasets, which prevents them from performing well. Thus, it is of great importance to have more freely available datasets of natural environments with images captured at low altitude.
For the UAVs navigation in the free spaces of natural environments, it is necessary to consider the following elements that make navigation difficult: narrow paths, tree trunks, different types of branches and leaves, electricity poles, fences, people, branches in movement, etc.
Depth estimation from a monocular image collected by a flying drone is a complicated problem, as images can have blurring problems, lighting variations, abrupt changes in scale and perspective, and the presence of shadows cast on objects that make up the environment.
Therefore, it is necessary to create models that can create depth maps with high accuracy in spite of the above-mentioned problems and, at the same time, have very short inference times, in order to be embedded in the drone and operate with the limited hardware resources they have. Uncertainty estimation [21,46] is an important aspect to consider when creating depth maps for autonomous navigation to ensure safe flight for the UAV.

Conclusions
Currently, there has been a breakthrough in the development of deep networks for monocular depth estimation, demonstrating improvements in accuracy and inference times of the models. However, there are still many problems to be solved in the area of low-altitude aerial navigation.
We reviewed four state-of-the-art neural networks for monocular depth estimation networks, with the main idea of using depth maps for obstacle detection for a UAV system in complex natural environments. We found that the networks GLPDepth and Boosting Monocular Depth achieved good performance in detecting fine details in natural environments such as thin branches and small leaves. Nevertheless, Boosting Monocular Depth had high inference times and GLPDepth did not correctly infer the thickness of objects.
We proposed the combination of the GLPDepth network with the Boosting Monocular Depth network for the creation of depth maps. With this new model, inference times were reduced but the accuracy did not improve much compared to Boosting Monocular Depth (LeReS).
We identified some challenges related to the creation of depth maps for autonomous drone navigation in heavily vegetated environments. We also noted the lack of natural environment datasets for training deep neural networks.  Acknowledgments: We thankfully acknowledge the use of the TecNM/Centro Nacional de Investigación y Desarrollo Tecnológico (CENIDET) facility in carrying out this work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: