You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

13 August 2021

SFA-MDEN: Semantic-Feature-Aided Monocular Depth Estimation Network Using Dual Branches

,
and
1
Digital Photoelectric Information Processing Technology Laboratory, School of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing 100191, China
2
Engineering Research Center for Intelligent Robotics, Ji Hua Laboratory, Foshan 528200, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Sensing and Imaging

Abstract

Monocular depth estimation based on unsupervised learning has attracted great attention due to the rising demand for lightweight monocular vision sensors. Inspired by multi-task learning, semantic information has been used to improve the monocular depth estimation models. However, multi-task learning is still limited by multi-type annotations. As far as we know, there are scarcely any large public datasets that provide all the necessary information. Therefore, we propose a novel network architecture Semantic-Feature-Aided Monocular Depth Estimation Network (SFA-MDEN) to extract multi-resolution depth features and semantic features, which are merged and fed into the decoder, with the goal of predicting depth with the support of semantics. Instead of using loss functions to relate the semantics and depth, the fusion of feature maps for semantics and depth is employed to predict the monocular depth. Therefore, two accessible datasets with similar topics for depth estimation and semantic segmentation can meet the requirements of SFA-MDEN for training sets. We explored the performance of the proposed SFA-MDEN with experiments on different datasets, including KITTI, Make3D, and our own dataset BHDE-v1. The experimental results demonstrate that SFA-MDEN achieves competitive accuracy and generalization capacity compared to state-of-the-art methods.

1. Introduction

Depth estimation plays a fundamental role in numerous application scenarios, such as image reconstruction [1], object detection [2], semantic segmentation [3], pose estimation [4], and medical image processing [5]. There are various sensors available to obtain depth information based on either active or passive measurement methods. Time-of-Flight (TOF) [6] is an active sensor that computes depth by the amount of time the infrared ray takes to travel back and forth between the emitter and the camera. Structured light cameras can predict depth accurately, even in weak light conditions, but require complex calibration and expensive costs. Stereo vision systems [7], using the passive measurement method, achieve high-quality depth estimation from rich-textured images via the configuration of binocular or even multi-view cameras. However, the system structure parameters, such as focal length and baseline, can greatly affect the calibration precision of cameras, which limits the flexibility of the stereo vision system. Monocular depth estimation refers to the scene depth recovery from a single two-dimensional image captured by the camera [8], which benefits the development of lightweight sensors. Nevertheless, given a two-dimensional image, multiple three-dimensional scenes will be reconstructed, which indicates that monocular depth estimation is an inherently ill-posed problem. For the past few years, an emerging computational imaging technology, where deep learning technology constrains estimation by extracting additional features, has become a major approach for monocular depth estimation.
In this paper, a novel network architecture with two branches is proposed to couple the semantic segmentation feature into a monocular depth estimation network aimed at improving the robustness and precision of depth estimation models. Generally, for unsupervised models, a left-view image is fed into the network to generate the disparity map, which is also known as the inverse depth and could warp the left-view into a predicted right-view image. Then, the self-supervised loss signal is computed through the difference between the predicted and the ground truth right-view images. The estimated depth map can be established by depth = baseline × focal_length/in-verse_depth. In addition, unsupervised methods learn to simulate a binocular system with the intrinsic parameters of the training dataset during the process. Thus, unsupervised networks perform unsatisfactorily on other datasets.
Semantics and depth are different comprehension patterns of an image scene, but have close relations. Inspired by the work of Stathopoulou et al. [4] who constructed geometric constraints based on semantics to optimize reconstruction from multi-view images, we believe multi-task learning via effectively merging semantic segmentation and depth estimation can improve the robustness and generalization capacity of networks. Traditional multi-task learning for predicting depth and semantic segments simultaneously employs a multi-term loss, which requires not only semantic annotations, but depth self-labels referring to stereo images or image sequences from the training datasets. Unfortunately, there are small datasets available but no large public datasets satisfying these label requirements currently. In view of this, we decided to predict depth robustly and accurately, even without the corresponding result of semantic segmentation. By combining the semantic information in feature maps with the depth estimation network, we no longer require joint depth and semantic annotations from the training set. In brief, different datasets for semantics and depth, respectively, are adopted to train our network.
Obviously, how to effectively deploy semantic information for the improvement of monocular depth estimation performance is an open issue. To address the problem, we proposed a novel network architecture to break the severe restriction of dataset annotations; the main contributions of the current study are as follows:
  • We propose a Semantic-Feature-Aided Monocular Depth Estimation Network (SFA-MDEN), which fuses multi-resolution semantic feature maps into networks to achieve better precision and robustness for the monocular depth estimation task.
  • A training strategy termed as Two-Stages for Two Branches (TSTB) is designed to train the SFA-MDEN. Instead of using the paired semantic label and stereo images for the input, a dataset for semantic segmentation and another dataset for monocular depth estimation with similar themes will satisfy the requirement. Therefore, larger datasets are guaranteed for better use of the semantic information to train our network.
  • Our SFA-MDEN achieves competitive estimation not only on public datasets KITTI [9] and Make3D [10], but also on our self-built dataset BHDE-v1 (Beihang University Depth Estimation dataset—version 1), which is a sparse depth dataset created to validate the generalization of our network.
The remainder of the paper is organized as follows: Section 2 reviews existing monocular depth estimation models based on deep learning, especially the methods involving semantics; the structure and the TSTB training strategy of SFA-MDEN are presented in Section 3; experimental results and analysis are presented in Section 4; and, finally, Section 5 presents the concluding remarks.

3. Methodology

In this section, we introduce the framework and training strategy of SFA-MDEN, which predicts the depth from a single image with semantic guidance.

3.1. Framework and Training Strategy

As mentioned previously, monocular depth estimation with semantic optimization demands semantic corpus and stereo pair sets in a large public dataset, which is unapproachable. Fortunately, there are available training datasets for depth prediction and semantic segmentation, respectively. SFA-MDEN is designed to implicitly couple semantic features into depth estimation without requiring two correspondent annotations (stereo image pairs are also thought of as depth annotations in a sense).
The backbone of SFA-MDEN consists of an unsupervised monocular depth estimation network DepthNet and a supervised semantic segmentation network SemanticsNet, as shown in Figure 1a,b. Both networks adopt the encoder–decoder architecture with long-skip connections as the backbone. Depth differs from semantics, but there exists an implicit and tight interplay between them. Adjacent pixels are segmented to different semantic areas, which often means that occlusion, lighting changes, etc., have occurred, and usually result in a significant fluctuation in depth. Adjacent pixels segmented to the same semantic areas represent a single object, of which the depth usually varies continuously. Feature maps from the SemanticsNet reflect the depth fluctuation from another view. Therefore, we merge the corresponding scale feature maps in the dual branches and decode them to construct the final depth map, as shown in Figure 1c.
Figure 1. Overview of SFA-MDEN. The symbol RGB~P(depth) means that the input RGB image is required to have depth annotations for training. Similarly, RGB~P(semantics) means the requirement of semantics labels. (a) DepthNet: an unsupervised network to predict depth map as the main branch of SFA-MDEN, (b) SemanticsNet: a network for semantic segmentation to generate multi-resolution semantic feature maps for SFA-MDEN, (c) The overall framework of SFA-MDEN
Aiming to break the limitation that the input image is required to have a multitude of different types of annotations, a Two-Stages for Two Branches (TBTS) training strategy is designed, which trains the branches independently and incorporates the two branches into a single network. In our SFA-MDEN, the DepthNet and SemanticsNet are the two branches. In Stage 1, DepthNet and SemanticsNet are trained separately using different datasets. For DepthNet, we use the KITTI datasets for training to obtain the initial weight of DepthNet in Stage 2. For SemanticsNet, the CityScapes [51] corpus is used to enable the Semantics Encoder to extract multi-scale semantic features. In Stage 2, semantic feature maps and depth feature maps in the corresponding scale are cascaded in the channel dimension. The details of the networks are introduced in Section 3.2. The parameters of Semantics Encoder are frozen and can extract semantic features from different scales. Semantics Decoder and DepthNet employ the parameters obtained in Stage 1 as the initial weights, and are trained on the KITTI datasets to achieve monocular depth estimation combined with semantic features.
Based on the TSTB training strategy, our SFA-MDEN has achieved monocular depth estimation based on the stereo sets of KITTI and semantic corpus of Cityscapes.

3.2. Network Architectures

Theoretically, the architecture of SFA-MDEN can be applied to any encoder–decoder structure. In this work, Light-Weight RefineNet [52] is adopted as the backbone. Additionally, ResNet50 without the fully connected layer is applied as the encoders. The main architecture of SFA-MDEN is shown in Figure 2. The input image is fed into Depth Encoder and Semantics Encoder separately. The two encoders, which are composed of ResBlocks from ResNet50, extract feature maps with four different resolutions, including 1/4, 1/8, 1/16, and 1/32, with the baseline of the input image. The detailed structures of partial modules are presented in the form of a legend, and the important modules are illustrated in Figure 3.
Figure 2. Our proposed architecture, named SFA-MDEN, exploits depth features and semantic features at different levels and fuses them to obtain a high-resolution monocular depth estimation. The square-corner rectangles denote the component modules of DepthNet, while the rounded rectangles denote the components of SemanticsNet. Structures of some components are illustrated in the form of a legend and some are shown in Figure 3. See text and Figure 3 for more details.
Figure 3. (a) DSF unit, (b) MRFF unit, and (c) CRP unit. In each illustration, the left block is the representation of the module in Figure 2 and its detailed structures are on the right. The yellow blocks denote the feature fusion modules. The red edge denotes the fusion of different types of features with the magenta block of the adaptive convolution and the blue edge denotes the fusion of features with different resolutions. The CRP unit is represented as the orange block.
Depth–Semantics-Fusion (DSF) unit. The DSF unit, as shown by the dash-dotted line in Figure 2, achieves the fusion between different types of feature maps with the same resolution, with detailed structure in Figure 3a. The semantic feature maps generated by the Semantics Encoder are fine-tuned through an adaptive convolution operation for the monocular depth estimation task. The adaptive semantic features differ from depth features in geometric expression. Therefore, concatenation in channel dimension is adopted for the feature map fusion. BatchNorm layer is introduced to adjust the distribution of concatenated feature maps to avoid gradient vanishing and exploding. After the activation operation of ReLU, the feature group passes a convolution and the two types of features are fused to hunt for the new features.
Multi-Resolution Feature Fusion (MRFF) unit. The MRFF unit is employed to fuse the same type of features with different resolutions, as shown by the dotted line in Figure 2. The detailed structure is illustrated in Figure 3b. The lower resolution feature maps are bilinear, interpolated to be consistent with the higher resolution. The same type of feature maps in the corresponding channel are similar in geometric structure, and feature combination is carried out by adding items in an element-wise manner. After the feature maps are fused and activated, the feature map content of each channel increases.
Chained Residual Pooling (CRP) unit. The CRP unit [46] is established as a chain of pooling sub-blocks, each of which contains a max-pooling layer and a convolution operation, as denoted in Figure 3c. For each pooling sub-block, the output of previous blocks is taken as the input. Therefore, the CRP unit could pool features with multiple window sizes and fuse them together via the manner of element adding. Thus, it can capture information from a large region of the fused features and promote decoding.

3.3. Loss Function

Our SFA-MDEN is trained through the bidirectional loss items on four scales, including full, half, quarter, and eighth resolution. The bidirectional disparities contain the disparity from right to left view, denoted with superscript l, and the disparity from left to right view, denoted with the superscript r. The bidirectional loss on a single scale s consists of three weighted contributions as defined in Equation (1). The loss items are computed based on the bidirectional disparities. k denotes the corresponding weight with k r e = 1 , k d s = 0.1 , k l r = 1 . Additionally, the loss function of DepthNet is the same.
L s = k r e ( L r e l + L r e r ) + k d s ( L d s l + L d s r ) + k l r ( L l r l + L l r r ) , s = 1 , 2 , 3 , 4
The first term, defined as Equation (2), denotes the image reconstruction error by computing the difference between the original left view I l and the warped image I ˜ l . N in Equations (2)–(4) denotes the number of pixels with valid depths.
L r e l = 1 N i , j τ 2 [ 1 S S I M ( I i , j l , I ˜ i , j l ) ] + ( 1 τ ) I i , j l I ˜ i , j l
The disparity smoothness term, as (3), is deployed to encourage the gradient variation of the estimated disparity to approximate that of the input image. δ x and δ y represent the image gradients along the width and the height, and d denotes the estimated disparity in Equations (3) and (4). The gradient discontinuity loss is decayed by an exponential weight when corresponding discontinuity occurs on the input.
L d s l = 1 N i , j | δ x d i , j l | e δ x I i , j l + | δ y d i , j l | e δ y I i , j l
The last term encourages the consistency of the predicted bidirectional disparities, defined as Equation (4). A warped left-view disparity, constructed by projecting the right-view disparity, is encouraged to be coherent with the estimated left-view disparity.
L l r l = 1 N | d i , j l d i , j + d i , j l r |
As for the training of SemanticsNet, we use the CrossEntropyLoss function, as defined in Equation (5). M denotes the number of categories, y c is the one-hot vector to indicate the ground truth, while p c is the probability that the sample falls into the category c .
L s e m a n t i c s = m = 1 M y c log ( p c )

4. Experiments and Analysis

In this section, we compare our SFA-MDEN with existing supervised methods, self-supervised methods, and methods with semantic assistance. We consider two datasets for training, KITTI and Cityscapes. To evaluate our models, we test on the KITTI eigen split and analyze the performance of the model. In particular, to evaluate the estimation quality of relative location between pixels, two metrics, named Scale Invariant Error (SIE) and Absolute Relative Error of Relative Depth (ARERD), are computed. Meanwhile, we test the model directly on the Make3D datasets without any fine-tuning to validate the generalization performance of SFA-MDEN. Finally, a self-made dataset BHDE-v1 is applied to further evaluate the generalization ability and robustness of SFA-MDEN.

4.1. Implementation Details and Metrics

According to the TSTB training strategy, the training process is divided into two stages. In Stage 1, we train the DepthNet and SemanticsNet on different datasets, respectively. For DepthNet, we split the dataset as Eigen et al. [14] to compare with existing works. There are 61 scenes in total, divided into a test split of 20 scenes containing 697 image pairs, a training set of 22,600 image pairs, and the validation set of the remaining 888 image pairs from the remaining 32 scenes. We use the Adam optimizer with the hyper-parameter β 1 = 0.9 , β 2 = 0.999 , and ε = 10 8 with learning rate α = 0.0001 and a batch size of 4. We flip the input images with a chance of 50% and add color augmentation for data augmentation. The corpus of Cityscapes is applied to train the supervised. SemanticsNet. The corpus consists of 5000 high-resolution images with 19 categories of fine semantic annotations from 50 cities. The training set includes 2975 images, with another 500 for validation, and the remaining 1525 for test. We select the mini-batch SGD with a momentum parameter of 0.9 and the initial learning rate of 10 4 as the optimizer. Random crop, random mirror, and random scaling are applied to augment the training set. Stage 2 is the training of the overall SFA-MDEN. Only the training set of KITTI is used. The hyper-parameters are almost the same as the DepthNet in Stage 1. The learning rate is set as 10 4 for the first 30 epochs, and is halved every 10 epochs thereafter. The proposed SFA-MDEN is trained on a computer configured as: 2.20 GHz × 8, Intel(R) Core (TM) i7-8750H CPU, and a single GeForce RTX 2080 Mobile GPU card with 8GB RAM.
Similar to previous works, we adopt four error metrics and three accuracy metrics to evaluate the models firstly, as follows:
  • Absolute relative error: A b s R e l = 1 N i = 1 N | D ^ i D i | D i
  • Square relative error: S q R e l = 1 N i = 1 N | D i D ^ i | 2 D i
  • Linear root-mean-squared error: R M S E ( l i n e a r ) = 1 N i = 1 N | D ^ i D i | 2
  • Logarithm root-mean-squared error: R M S E ( log ) = 1 N i = 1 N | log D ^ i log D i | 2
  • Threshold accuracy (%correct): δ = max ( D i D ^ i , D ^ i D i ) < T (T is the threshold and can be assigned as 1.25, 1.25 2 , 1.25 3 )
where D ^ i denotes the predicted depth, D i denotes the ground truth depth, the subscripts i and j are the pixel index, and N represents the total number of valid pixels.
The aforementioned four error metrics assess the accuracy of the predicted depth for merely a single pixel but, in some application scenarios, the accuracy of the relative depth between pixels is given more attention, for instance, the detection system for the landing site of an unmanned aerial vehicle (UAV). Therefore, we also employ SIE, as shown in (6), which was designed as a loss function by Eigen et al. [14], to evaluate the performance of our network. SIE represents the mean square error of the relative depth between each pair of pixels in log domain. n represents the number of point pairs.
S I E = 1 n i , j ( ( log D ^ i log D ^ j ) ( log D i log D j ) ) 2
Inspired by SIE, we design another metric ARERD, as shown in (7), which refers to the relative error of the point-pair distance directly.
A R E R D = 1 n i , j | D ^ i D ^ j | | D i D j | | D i D j |

4.2. Eigen Split of KITTI

We compare the performance of the proposed SFA-MDEN with existing monocular depth estimation algorithms. For accuracy evaluation, ground truth depth values are generated sparsely by reprojecting 3D points from Velodyne laser into the left input color camera. To weaken the threats caused by the rotation of the Velodyne, the motion of the vehicle, and the surrounding objects, along with occlusions, the cropping scheme proposed by [14] is adopted, which contains approximately 58% in height and 93% in width of the image.
Table 1 shows the comparison results with the supervised methods [12,14], the unsupervised methods [28,29,36,37,45,50], and the estimation methods related to semantics [50] at 80 m and 50 m caps. First, our algorithm has achieved a clear improvement compared to the most listed supervised and unsupervised methods in all metrics under both depth caps, except Yue et al. [50]. The reason is that Yue et al. [50] established a semantic weight map to optimize the estimated depth map, which takes advantage of semantic information as well. Even so, there is a narrow margin between our SFA-MDEN and Yue et al. [50] in metrics of Sq Rel, RMSE log, and δ < 1.25 3 . Furthermore, the Abs Rel of SFA-MDEN is 0.023 smaller, the RMSE of SFA-MDEN is 0.135 m smaller, and the threshold accuracy δ < 1.25 is 2.9% higher compared to Yue et al. [50]. Second, SFA-MDEN achieves considerable advances by merging the feature maps from SemanticsNet compared to the original DepthNet. All error metrics have been prominently reduced. Not only is Abs Rel declined by 16.67%, but Sq Rel is also decreased by 6.25%. Moreover, RMSE and RMSE log are decreased by 7.54% and 9.24%, respectively. The three metrics of threshold accuracy are increased by about 1%. The effectiveness of semantic features for monocular depth estimation is demonstrated and the validity of SFA-MDEN can be supported powerfully.
Table 1. Monocular depth estimation evaluation using the split of Eigen et al. [14] on the KITTI dataset [9]. For datasets, K = KITTI, and CS = Cityscapes, CS + K means that the model is pretrained on the Cityscapes and finetuned on the KITTI. For methods with semantics, the dataset for semantic segmentation training is in brackets. We highlight the best results with the underline in each depth cap. The results of Garg et al. [28] are only capped at 50 m.
In addition to the quantitative comparison, some qualitative samples are exhibited in Figure 4. The results of our SFA-MDEN are compared to Godard et al. [29] and DepthNet. According to the samples, the depth variation of the same object is more continuous, such as building surfaces with windows in outputs of SFA-MDEN. Furthermore, the depth variation of the object boundaries is more evident for our method. As shown in Figure 4, outlines of vehicles and road signs and are sharper and more plausible.
Figure 4. Qualitative samples on the Eigen split of KITTI. (a) The input RGB images, (b) ground truth depth maps, (c) outputs of Godard et al. [31], (d) outputs of DepthNet, (e) outputs of the proposed SFA-MDEN. Our method achieves the plausible estimation quality on the surfaces and boundaries.
In addition to comparison with above methods, to illustrate the competitiveness of SFA-MDEN, we compare our method with other models with semantic guidance, as shown in Table 2. It should be noted that the training set and test set of these methods are not the same. First, Nekrasov et al. [49] use a teacher network to annotate semantic labels of KITTI to pretrain the target model. They have achieved outstanding results on the 46 test images of KITTI-6 on RMSE and RMSE log, even compared with methods in Table 1, which demonstrate the superiority of the semantic guide. Nevertheless, the expert systems can affect the accuracy of the model greatly and the whole process is tedious. Our SFA-MDEN does not depend on other models. Second, Ramirez et al. [47] pretrained their network on Cityscapes with about 3000 finely semantically annotated stereo images, fine-tuned the network on 160 stereo images with semantic labels of KITTI 2015, and tested on another 40 images. Observably, the model performs poorly due to the small dataset. We have compared Yue et al. [50] with our model quantitatively in Table 1. According to Table 2, generally speaking, SFA-MDEN performs the best estimation in the metric Abs Rel and δ < 1.25 2 . Moreover, our model ranks second in Sq Rel and RMSE. Although SFA-MDEN comes last in RMSE log and δ < 1.25 3 , there is a narrow margin between these methods. In conclusion, our method has achieved a competitive performance compared to the state-of-the-art methods with semantic auxiliary.
Table 2. Comparison between monocular depth estimation methods with semantic auxiliary and SFA-MDEN. For datasets, K = KITTI, CS = Cityscapes, K-6 = KITTI-6, and K (200) denotes the 200 images Ramirez et al. [50] used in KITTI. We highlight the best results with an underline. Nekrasov et al. [43] only reported the evaluation metrics of RMSE and RMSE log.
We can evaluate the models through diverse metrics and, in practice, we select the network according to the application. As previously mentioned, in certain application scenarios, such as the detection system for the landing site of UAVs, the precision of relative depth estimation accuracy between point pairs is of more concern. Therefore, the metrics of SIE and ARERD defined in Equations (6) and (7) are used to evaluate the models, as shown in Table 3. The model from Godard et al. [29] and DepthNet are selected to compare with SFA-MDEN due to their comparable performance on most metrics, as seen in Table 1. According to Table 3, our SFA-MDEN has achieved superior performance compared to the other two. Compared to DepthNet without the semantic information, SFA-MDEN achieves a decrease of 18.64% on the metric SIE and a decrease of 14.95% on the metric ARERD. We conclude that, after merging the semantic information into depth estimation, the accuracy of the positional relationship between pixels has been improved.
Table 3. Evaluation on the metrics of SIE and ARERD. We highlight the best results with an underline.
The UAV landing system not only demands the accuracy of relative depth, but requires different accuracy with different working height according to the user guide of DJI UAV [53]. According to the user guide, there are four working depth intervals which are ( 0 , 20 ] , ( 10 , 30 ] , ( 30 , 60 ] , ( 60 , 80 ] . Therefore, we evaluate the precision of pixels in the depth intervals and visualize the results as Figure 5. As shown in Figure 5, firstly, most trends in the metrics indicate that, as the depth increases, the prediction accuracy decreases. Secondly, Abs Rel shows a downward trend in the last two intervals. It can be interpreted that Abs Rel is defined as the absolute error divided by the ground truth depth value, and the absolute error of the estimation does not increase as much as the ground truth with the depth getting higher. Thirdly, SIE also shows a downward trend in the last two intervals. The reason can be explained as the large depth changes slower in the logarithmic domain, so the relative distance between the point pairs is smaller and the SIE is reduced.
Figure 5. Quantitative comparison depth estimation precision in different depth ranges of the SFA-MDEN. (a) Abs Rel, (b) Sq Rel, (c) RMSE, (d) RMSE log, (e) ARERD, (f) accuracy metrics ( δ < 1.25 , δ < 1.25 2 , δ < 1.25 3 ), and (g) SIE.

4.3. Make3D

Experiments on KITTI have indicated the estimation performance of SFA-MDEN. To evaluate the generalization of our SFA-MDEN, we compare it to other methods on the Make3D test set [10], which contains 134 images and the corresponding depth maps. Although our SFA-MDEN is trained on the KITTI and Cityscapes and is directly tested at a center crop of 2 × 1 ratio as [29] on the Make3D dataset, we still achieve comparable results, as shown in Table 4. First, we almost outperform all unsupervised methods, which illustrates the superiority of our SFA-MDEN. Second, compared to supervised methods which are trained with the ground truth depth of Make3D, the results of our SFA-MDEN without fine-tuning on the Make3D set are still comparable. Experimental results on the Make3D set show the strong generalization capacity of our SFA-MDEN even if it is applied to a different domain directly without fine-tuning. Qualitative results can be seen in Figure 6.
Table 4. Monocular depth estimation evaluation on the Make3D dataset. We highlight the best results with an underline, and bold numbers rank second in each metric.
Figure 6. Qualitative samples on the Make3D datasets. (a) The input RGB images, (b) ground truth depth maps, (c) outputs of Godard et al. [29], and (d) outputs of the proposed SFA-MDEN.

4.4. Self-Made Datasets: BHDE-v1

In addition to the above public datasets, we captured a small dataset BHDE-v1 of, in total, 140 frames of images with the resolution of 1920 × 1080 in order to validate the generalization capacity of our SFA-MDEN. An image example is illustrated in Figure 7. The captured subject of the dataset is the plain land as the area in the red rectangle window dropping out the protrusions. We obtain the depth value from the camera to the plain land, which is 21.678 m, through the laser rangefinder GLM80. The measurement error of GLM80 is ± 1.5   mm in the measurement range of 80 m, which is much less than that of the estimation from models. Thus, the measured depth can be viewed as the ground truth. Areas in the green rectangle windows are from the surrounding buildings with continuous depth, which we have not annotated.
Figure 7. An image sample of the BHDE-v1 dataset. The area inside the red rectangle window is the sparsely depth-annotated plain land, while areas in the green window are not annotated.
SFA-MDEN and the model from Godard et al. [29], which performs well on the Eigen split of KITTI, are tested without fine-tuning on the BHDE-v1 dataset. Table 5 presents the quantitative results and Figure 8 illustrates the qualitative samples. As can be seen, the precision metrics of our SFA-MDEN are superior to Godard et al. [29], with a depth range of about 20 m. Moreover, SFA-MDEN indicates stronger robustness to the shadows and illumination, as shown in Figure 8. As for the floor gaps in the input image, the output of Godard et al. [29] shows significant depth changes, while our SFA-MDEN performs more robustly. In conclusion, our SFA-MDEN is verified to achieve better generalization and powerful robustness on different datasets.
Table 5. Monocular depth estimation evaluation on the BHDE-v1 of SFA-MDEN and the model in Godard et al. [29].
Figure 8. Qualitative samples on the BHDE-v1 dataset. (a) The input RGB images, (b) outputs of Godard et al. [29], and (c) outputs of the proposed SFA-MDEN.

5. Conclusions and Discussion

Monocular depth estimation based on deep learning achieves higher flexibility compared to binocular systems, which promotes the application of lightweight vision sensors. In this paper, inspired by multi-task training, a novel network architecture abbreviated as SFA-MDEN, which merges semantics-related features into the monocular depth estimation network, is proposed to predict the depth map for preferable precision, robustness, and generalization capacity. A Two-Stages for Two Branches training strategy is designed to train the SFA-MDEN on two large public datasets, KITTI for depth estimation and Cityscapes for semantic segmentation. In Stage 1, two branches are trained independently based on respective datasets, with the unsupervised DepthNet trained on the stereo sets of KITTI and the supervised SemanticsNet trained on the corpus of Cityscapes. In Stage 2, the SFA-MDEN is constructed by the feature fusion between semantics and depth and is trained on the stereo sets of KITTI, with the pretrained parameters of Stage 1 as the initial weights. At inference, the semantic features are extracted by the SemanticsNet branch and delivered to fuse with the depth feature extracted by the DepthNet branch. Experiment comparisons on the Eigen split of KITTI demonstrate the superiority of our proposed SFA-MDEN, even compared to existing methods with semantic auxiliary. Additionally, experiments on Make3D datasets have exhibited the generalization of SFA-MDEN, which is tested without fine-tuning, and experiments on our self-made datasets BHDE-v1 validate the robustness against the variation of shadow and illumination.
In future work, the relation between semantics and depth could be exploited further. Additionally, the loss function on account of traditional image reconstruction constraints is worthy of attention. Our future work will focus on improving the unsupervised monocular depth estimation with the semantic guidance in a lightweight network structure.

Author Contributions

Conceptualization, R.W., J.Z. and J.Z.W.; methodology, R.W. and J.Z.; software, J.Z.; validation, R.W.; investigation, J.Z.W.; resources, R.W.; writing—original draft preparation, J.Z.; writing—review and editing, R.W., J.Z. and J.Z.W.; visualization, J.Z.; supervision, R.W. and J.Z.W.; funding acquisition R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from National Natural Science Foundation of China (61673039). This study was partially supported by Ji Hua Laboratory through the Self-Programming Intelligent Robot Project (No. X190101TB190).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The BHDE-v1 are available online at http://doip.buaa.edu.cn/info/1092/1113.htm (accessed on 5 August 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Alidoost, F.; Arefi, H.; Tombari, F. 2D image-to-3D model: Knowledge-based 3D building reconstruction (3DBR) using single aerial images and convolutional neural networks (CNNs). Remote Sens. 2019, 11, 2219. [Google Scholar] [CrossRef] [Green Version]
  2. Wang, R.; Xu, J.; Han, T.X. Object instance detection with pruned alexnet and extended data. Signal Process Image Commun. 2019, 70, 145–156. [Google Scholar] [CrossRef]
  3. Stathopoulou, E.K.; Battisti, R.; Cernea, D.; Remondino, F.; Georgopoulos, A. Semantically derived geometric constraints for MVS reconstruction of textureless areas. Remote Sens. 2021, 13, 1053. [Google Scholar] [CrossRef]
  4. Jin, L.; Wang, X.; He, M.; Wang, J. DRNet: A depth-based regression network for 6D object pose estimation. Sensors 2021, 21, 1692. [Google Scholar] [CrossRef]
  5. Hwang, S.-J.; Park, S.-J.; Kim, G.-M.; Baek, J.-H. Unsupervised monocular depth estimation for colonoscope system using feedback network. Sensors 2021, 21, 2691. [Google Scholar] [CrossRef] [PubMed]
  6. Microsoft. Kinect for Windows. Available online: https://developer.microsoft.com/zh-cn/windows/kinect/ (accessed on 23 April 2021).
  7. Dhond, U.R.; Aggarwal, J.K. Structure from stereo—A review. IEEE Trans. Syst. Man Cybern. 1989, 19, 1489–1510. [Google Scholar] [CrossRef] [Green Version]
  8. Khan, F.; Salahuddin, S.; Javidnia, H. Deep learning-based monocular depth estimation methods—A state-of-the-art review. Sensors 2020, 20, 2272. [Google Scholar] [CrossRef] [Green Version]
  9. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
  10. Saxena, A.; Sun, M.; Ng, A.Y. Make3D: Learning 3D scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 824–840. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Saxena, A.; Chung, S.H.; Ng, A.Y. 3-D depth reconstruction from a single still image. Int. J. Comput. Vis. 2008, 76, 53–69. [Google Scholar] [CrossRef] [Green Version]
  12. Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2024–2039. [Google Scholar] [CrossRef] [Green Version]
  13. Karsch, K.; Liu, C.; Kang, S.B. Depth transfer: Depth extraction from video using non-parametric sampling. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2144–2158. [Google Scholar] [CrossRef] [Green Version]
  14. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
  15. Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar] [CrossRef] [Green Version]
  16. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar] [CrossRef] [Green Version]
  17. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
  18. Wofk, D.; Ma, F.; Yang, T.-J.; Karaman, S.; Sze, V. Fast monocular depth estimation on embedded systems. In Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar] [CrossRef] [Green Version]
  19. Tu, X.; Xu, C.; Liu, S.; Xie, G.; Huang, J.; Li, R.; Yuan, J. Learning depth for scene reconstruction using an encoder-decoder model. IEEE Access 2020, 8, 89300–89317. [Google Scholar] [CrossRef]
  20. Lee, J.H.; Han, M.K.; Ko, D.W.; Suh, I.H. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv 2020, arXiv:1907.10326. [Google Scholar]
  21. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2002–2011. [Google Scholar] [CrossRef]
  22. Su, W.; Zhang, H. Soft regression of monocular depth using scale-semantic exchange network. IEEE Access 2020, 8, 114930–114939. [Google Scholar] [CrossRef]
  23. Kim, S.; Nam, J.; Ko, B. Fast depth estimation in a single image using lightweight efficient neural network. Sensors 2019, 19, 4434. [Google Scholar] [CrossRef] [Green Version]
  24. Kuznietsov, Y.; Stückler, J.; Leibe, B. Semi-supervised deep learning for monocular depth map prediction. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2215–2223. [Google Scholar] [CrossRef] [Green Version]
  25. Atapour-Abarghouei, A.; Breckon, T.P. Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2800–2810. [Google Scholar] [CrossRef] [Green Version]
  26. Zhao, S.; Fu, H.; Gong, M.; Tao, D. Geometry-aware symmetric domain adaptation for monocular depth estimation. Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9780–9790. [Google Scholar] [CrossRef] [Green Version]
  27. Ji, R.; Li, K.; Wang, Y.; Sun, X.; Guo, F.; Guo, X.; Wu, Y.; Huang, F.; Luo, J. Semi-supervised adversarial monocular depth estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2410–2422. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Garg, R.; Bg, V.K.; Carneiro, G.; Reid, I. Unsupervised CNN for single view depth estimation: Geometry to the rescue. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar] [CrossRef] [Green Version]
  29. Godard, C.; Mac Aodha, O.; Brostow, G.J. Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 270–279. [Google Scholar] [CrossRef] [Green Version]
  30. Goldman, M.; Hassner, T.; Avidan, S. Learn stereo, infer mono: Siamese networks for self-supervised, monocular, depth estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 2886–2895. [Google Scholar] [CrossRef] [Green Version]
  31. Poggi, M.; Aleotti, F.; Tosi, F.; Mattoccia, S. Towards real-time unsupervised monocular depth estimation on CPU. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 5848–5854. [Google Scholar] [CrossRef] [Green Version]
  32. Ye, X.; Zhang, M.; Xu, R.; Zhong, W.; Fan, X.; Liu, Z.; Zhang, J. Unsupervised Monocular depth estimation based on dual attention mechanism and depth-aware loss. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 169–174. [Google Scholar] [CrossRef]
  33. Zhang, C.; Liu, J.; Han, C. Unsupervised learning of depth estimation based on attention model from monocular images. In Proceedings of the 2020 International Conference on Virtual Reality and Visualization (ICVRV), Recife, Brazil, 13–14 November 2020; pp. 191–195. [Google Scholar] [CrossRef]
  34. Ling, C.; Zhang, X.; Chen, H. Unsupervised monocular depth estimation using attention and multi-warp reconstruction. IEEE Trans. Multimed. 2021. [Google Scholar] [CrossRef]
  35. Wang, R.; Huang, R.; Yang, J. Facilitating PTZ camera auto-calibration to be noise resilient with two images. IEEE Access 2019, 7, 155612–155624. [Google Scholar] [CrossRef]
  36. Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised learning of depth and ego-motion from video. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar] [CrossRef] [Green Version]
  37. Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised learning of depth and ego-motion from monocular video using 3D geometric constraints. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5667–5675. [Google Scholar] [CrossRef] [Green Version]
  38. Luo, X.; Huang, J.B.; Szeliski, R.; Matzen, K.; Kopf, J. Consistent video depth estimation. ACM Trans. Graph. 2020, 39, 85–95. [Google Scholar] [CrossRef]
  39. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G. Digging into self-supervised monocular depth estimation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 3827–3837. [Google Scholar] [CrossRef] [Green Version]
  40. Ramamonjisoa, M.; Du, Y.; Lepetit, V. Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14636–14645. [Google Scholar] [CrossRef]
  41. Casser, V.; Pirk, S.; Mahjourian, R.; Angelova, A. Depth prediction without the sensors: Leveraging structure for unsupervised learning from monocular videos. arXiv 2018, arXiv:1811.06152. [Google Scholar] [CrossRef] [Green Version]
  42. Bian, J.W.; Zhan, H.; Wang, N.; Li, Z.; Zhang, L.; Shen, C.; Cheng, M.; Reid, I. Unsupervised scale-consistent depth learning from video. Int. J. Comput. Vis. 2021, 129, 2548–2564. [Google Scholar] [CrossRef]
  43. Liu, J.; Li, Q.; Cao, R.; Tang, W.; Qiu, G. MiniNet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation. ISPRS J. Photogramm. Remote Sens. 2020, 166, 255–267. [Google Scholar] [CrossRef]
  44. Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning depth from monocular videos using direct methods. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar] [CrossRef] [Green Version]
  45. Yin, Z.; Shi, J. GeoNet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1983–1992. [Google Scholar] [CrossRef] [Green Version]
  46. Cipolla, R.; Gal, Y.; Kendall, A. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar] [CrossRef] [Green Version]
  47. Ramirez, P.Z.; Poggi, M.; Tosi, F.; Mattoccia, S.; Di Stefano, L. Geometry meets semantics for semi-supervised monocular depth estimation. In Proceedings of the 2018 Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 298–313. [Google Scholar] [CrossRef] [Green Version]
  48. Mousavian, A.; Pirsiavash, H.; Košecká, J. Joint semantic segmentation and depth estimation with deep convolutional networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 611–619. [Google Scholar] [CrossRef] [Green Version]
  49. Nekrasov, V.; Dharmasiri, T.; Spek, A.; Drummond, T.; Shen, C.; Reid, I. Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 7101–7107. [Google Scholar] [CrossRef] [Green Version]
  50. Yue, M.; Fu, G.; Wu, M.; Wang, H. Semi-supervised monocular depth estimation based on semantic supervision. J. Intell. Robot. Syst. 2020, 5, 455–463. [Google Scholar] [CrossRef]
  51. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef] [Green Version]
  52. Nekrasov, V.; Shen, C.; Reid, I. Light-weight refinenet for real-time semantic segmentation. arxiv 2018, arXiv:1810.03272. [Google Scholar]
  53. PHANTOM 4 PRO/PRO+. Available online: https://dl.djicdn.com/downloads/phantom_4_pro/20200108/Phantom_4_Pro_Pro_Plus_Series_User_Manual_CHS.pdf (accessed on 22 April 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.