You are currently viewing a new version of our website. To view the old version click .
Remote Sensing
  • Article
  • Open Access

1 May 2021

Unsupervised Learning of Depth from Monocular Videos Using 3D-2D Corresponding Constraints

,
,
,
and
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue A Review of Computer Vision for Remote Sensing Imagery

Abstract

Depth estimation can provide tremendous help for object detection, localization, path planning, etc. However, the existing methods based on deep learning have high requirements on computing power and often cannot be directly applied to autonomous moving platforms (AMP). Fifth-generation (5G) mobile and wireless communication systems have attracted the attention of researchers because it provides the network foundation for cloud computing and edge computing, which makes it possible to utilize deep learning method on AMP. This paper proposes a depth prediction method for AMP based on unsupervised learning, which can learn from video sequences and simultaneously estimate the depth structure of the scene and the ego-motion. Compared with the existing unsupervised learning methods, our method makes the spatial correspondence among pixel points consistent with the image area by smoothing the 3D corresponding vector field based on 2D image, which effectively improves the depth prediction ability of the neural network. Our experiments on the KITTI driving dataset demonstrated that our method outperformed other previous learning-based methods. The results on the Apolloscape and Cityscapes datasets show that our proposed method has a strong universality.

1. Introduction

5G technologies present a new paradigm to provide connectivity to vehicles, in support of high data rate services, complementing existing AMP communication standards [1]. The 5G network has low latency, high throughput, and high reliability, which greatly enhances the richness and timeliness of the information transmitted by the car network, and also improves the technical value of the car network sensor [2]. It provides the network foundation for cloud computing and edge computing. This makes it possible for models based on deep learning using in autonomous moving platforms (AMP) [3].
Autonomous vehicles usually have sensors such as LIDAR and cameras. The monocular camera has the advantages of low price, rich information content, and small size, which can effectively overcome the many shortcomings of other sensors. Therefore, the use of monocular camera to obtain depth information has important research significance, and has gradually become one of the research hot spots in the field of computer vision [4,5].
The traditional depth estimation methods estimate depth scene by fusing information from different views. Structure from motion (SfM) or simultaneous localization and mapping (SLAM) is considered to be an effective method for estimating depth structures [6]. Typical SLAM algorithms estimate the ego-motion and the depth of scene in parallel. However, this type of method is highly dependent on the matching of points. Therefore, mismatching and insufficient features will still have a significant impact on the results [7].
In recent years, deep learning methods based on deep neural networks have triggered another wave in the field of computer vision. A large number of documents indicate that deep neural networks have played a huge role in various aspects of computer vision, including target recognition [8] and target tracking [9,10]. Traditional computer vision problems such as image segmentation have greatly improved in efficiency and accuracy. In the field of depth estimation, traditional methods based on multi-view geometry and methods based on machine learning have formed their respective theories and method systems. Therefore, researchers began to try to combine traditional computer vision methods with deep learning. Using deep learning methods to estimate the depth of the scene from a single picture is one of the important research directions [11]. Compared with the traditional method based on multi-view geometry, the deep learning-based method uses a large number of different training samples to learn a priori knowledge of the scene structure, and thus estimates the depth of the scene [12,13].
In this paper, we propose a deep prediction method based on unsupervised learning. This method trains the neural network by analyzing the geometric constraint relationship of the three-dimensional (3D) scene among sequence of pictures and constraining the correspondence of pixels among images according to image gradient. It can simultaneously predict the depth structure of the scene and the ego-motion. Our method is introducing a loss which penalizes inconsistencies in 3D pixel corresponding field and 2D images. Different from the existing method based on pixel corresponding in 3D-3D alignment like [14,15], we introduce a 3D-2D method which also proved effective. We first project the pixel points in the image into the 3D space and then calculate the 3D corresponding vector field of the pixels according to the ego-motion and the depth map predicted by the neural network. The 3D corresponding vector field is smoothed based on the 2D image by minimizing the smoothing loss of the vector filed according to the pixel gradient. The smoothing of the 3D corresponding vector field makes the gradient of 3D corresponding consistent with the gradient of 2D image, which effectively improves the details in the prediction result. Example predictions are shown in Figure 1.
Figure 1. Example predictions by our method on KITTI dataset [16]. Compared against [5], our approach recovers more details in the scene.
The main contributions of this paper are:
  • We propose a depth prediction method for AMP based on unsupervised learning, which can learn from video sequences and simultaneously estimate the depth structure of the scene and the ego-motion.
  • Our method makes the spatial correspondence between pixel points consistent with the image area by smoothing the 3D corresponding vector field by 2D image. This effectively improves the depth prediction ability of the neural network.
  • The model is trained and evaluated on the KITTI dataset provided by [16]. The results of the assessment indicate that our unsupervised method is superior to existing methods of the same type and has better quality than other self-supervised and supervised methods in recent years.

3. The Proposed Approach

This paper proposes an unsupervised learning depth estimation method based on 3D-2D consistency, which is used to train a neural network to estimate the depth of a scene. First, the image is divided into the original image used to estimate the depth and the target image used to build the loss. The relationship between the depth of the original image and the motion of the camera is estimated through the neural network. Then the projected image of the original image is constructed by the projection transformation, and the reconstruction loss is calculated. At the same time, the 3D scene flow was constructed according to the depth and motion relations, and the 3D-2D consistency loss was calculated. Finally, the neural network was trained by minimizing the loss function.

3.1. Differentiable Reprojection Error

In our method, we train the neural network by minimizing the re-projection loss. The re-projection loss refers to the loss between the point of the projected target image.
Inspired by [5], for the rigid part of image, we use depth and camera pose to calculate the loss. For two frames I t 1 and I t in a sequence of images, I t 1 is the source view and I t is the target view. Our method can reconstruct the target view I t by sampling pixels from a source view I t 1 based on the predicted depth map D t and the relative pose T t t 1 . Let p t denote the homogeneous coordinates of a pixel in the target view, We can obtain p t s projected coordinates onto the source view p t 1 as:
p t 1 K T t t 1 D t ( p t ) K 1 p t
in which K is the camera intrinsic matrix obtained by camera calibration.
Since p t 1 is a continuous value, we use a differential bilinear sampling method to calculate discrete pixel coordinates. That is, interpolation is performed according to four pixel points (upper left, lower left, upper right, lower right) adjacent to p t 1 to approximate [5]. Finally, the re-projection loss can be expressed as follows:
L r e p r o j e c t = p I t ( p ) I ^ t ( p )
in which I ^ t denotes warped target image, p denotes pixel index.

3.2. Image Reconstruction Loss

According to the image re-projection loss, we use three pictures in the video sequence to calculate the image reconstruction loss. The gradient of the image may be unevenly distributed. To avoid this, a Gaussian smoothing is used. We complete Gaussian smoothing by convolution calculation, and its convolution kernel can be calculated as following:
G ( u , v ) = 1 2 π σ 2 e ( u 2 + v 2 ) / ( 2 σ 2 )
in which u and v denotes the size of convolution kernel, σ denotes the smooth parameters for the Gaussian.
In addition, for rigid part of the depth map is normalize by dividing by its mean, which is denoted by operator η ( · ) :
η ( d i ) = N d i j = 1 N d j
The structural similarity (SSIM) proposed in [40] is a common metric used to evaluate the quality of image predictions. It measures the similarity between two images. SSIM is calculated as follows:
S S I M ( x , y ) = ( 2 μ x μ y + c 1 ) ( 2 σ x y + c 2 ) ( μ x 2 + μ y 2 + c 1 ) ( σ x + σ y + c 2 )
in which μ and σ denote the mean and the variance, respectively, c 1 and c 2 are the constant used to maintain stability. We calculate μ and σ by pooling using the method proposed by [14]. Since SSIM needs to be maximized and the upper bound is 1, the SSIM loss is designed as:
L S S I M = p 1 S S I M ( I t p , I ^ t p ) M t 1 p + p 1 S S I M ( I t p , I ^ t p ) M t + 1 p
in which M t 1 p and M t + 1 p denote the masks calculated through neural network and represent the legal part of projected images I t 1 and I t + 1 , respectively. Then, the image reconstruction loss L r e c o n can be expressed as:
L r e c o n = S ( ζ ( I s ( p t ) ) , I t ( K | 0 T t s D t ( p t ) K 1 p t 1 ) )
in which ζ denotes the Gaussian smooth process and S is the combination of abs error and SSIM error.

3.3. Corresponding Consistency Loss

As can be seen from the previous section, in the process of image pixel correspondence, the image reconstruction loss only depends on the pixel difference of the independent pixel points, so the regional relationship of the pixel points is not considered. In addition, for weak texture regions, the color of pixels in a certain range is almost the same, and the corresponding pixel points do not produce effective errors so that the loss function cannot be effectively calculated. In the traditional dense SLAM method, there are two main ways to solve this problem. One way is to increase the length of the input sequence, so that the pixel points correspond in a larger range. A typical example is [18], the system inputs multiple images and builds cost volumes so that pixel points can be matched among several different views to find the best correspondence. The other is to remove weak texture parts in the image by checking the gradient of the image. Before matching the pixel points, these methods calculate image pixel gradient and remove the areas with low pixel gradient. Only the remaining area with a high pixel gradient is matched. A typical example is [19]. The above two methods cannot be applied to neural-network-based methods for these reasons. For methods similar to [18], increasing the length of the input sequence will increase the input size of the network, resulting in unstable ego-motion estimation. For methods similar to [19], removing the area with a low pixel gradient will make the network not able to gather necessary information thus affect the converge. Therefore, a new loss function must be designed according to the correspondence between pixel points after image reprojection.
The correspondence between pixel points can be expressed in the form of a vector field, which is the form of optical flow for a 2D image, that is, the movement of a point from the first frame image to the second frame image in the x-axis and y-axis directions. For example, Figure 2 is an optical flow image and its vector field representation. In general, we assume that the 2D optical flow has a similar gradient with the image, and smoothing the corresponding vector field will enhance this consistency.
Figure 2. The optical flow field shown in image and its vector field representation [41].
However, the above assumption is valid only in the case where the pixel points only have planar motion. Due to the spatial relationship of the scene, the closer the object is to the camera, the more there are the position changes on the image when the camera moves. Therefore, it is necessary to calculate a 3D corresponding vector field of the pixel points in the image and smooth the 3D vector field according to the 2D image.
The method is to first calculate the 3D corresponding field of the target image, and then smooth it according to the image pixels gradient. As shown in Figure 3, for pixel point p i j = i , j , 1 T in the image, and its spatial coordinate q can be obtained according to the depth D as follow:
q i j = D ( p ) K 1 i , j , 1 T
in which K denotes the camera intrinsic matrix. T t > t 1 is the ego-motion from target view to source view, then the spatial position motion of the point can be expressed as:
s i j = ( 1 T t t 1 ) D ( p ) K 1 i , j , 1 T
Figure 3. Calculation of consistency loss. The depth map of the target view and the ego-motion are estimated by CNN network, respectively. The 3D corresponding field of target view is calculated based on the depth information and the ego-motion. After that we smooth the 3D field according to pixel gradient of target view.
According to the above formula, the 3D corresponding field S of the pixel of the entire image can be calculated. Smoothing S according to the image pixel gradient can make the spatial motion of points consistent with the image gradient. The spatial corresponding consistency loss L c o n s i s t can be designed as:
L c o n s i s t = i j x S i j e x I i j + i j y S i j e y I i j
in which I denotes the source image and S denotes the 3D corresponding vector field, i and j are the pixel indexes and x and y are the image coordinate axis directions. This approach ensures that the spatial motion of the pixels is consistent with the image gradient change.

3.4. Learning Setup

All loss functions are applied to four different scales s, from the original resolution of the image to 1/8 of image resolution in width and height. The total loss function can be expressed as:
L = s ( α L r e c s + β L c o n s i s t s + γ L s m s + ω L S S I M s )
in which α , β , γ , ω are weight parameters. L s m s [5] is a depth smoothness loss is also employed to regularize the depth estimates. We use TensorFlow framework proposed in [42] to build our neural network structure and Adam optimizer [43]. Neural network training typically converges after 150 K iterations. During training, we scaled the image to the resolution of 128 × 416, but due to the full convolution structure, both the depth estimation network and the ego-motion estimation network can input images of any size during testing.

4. Experiments and Discussion

In this section, we first introduce the network structure of our model, and its training details of it. Then we introduce the datasets and compare the performance of our method with other similar methods.

4.1. Network Structure

There were two sub-nets in our framework: the depth estimation network and the ego-motion estimation network. The depth estimation network predicated one-channel depth map of four different scales from a single three-channel image while the ego-motion estimation network predicated six-DOF of ego-motion from three consecutive images.
The depth estimation network was based on the [44] network structure, that is also adopted by [5]. As show in Figure 4, this network structure could be divided into encoder part and decoder part, adding a skip structure and adopting an output of four scales. The overall network structure is shown in Table 1. Since the full convolutional neural network did not restrict the size of the input image, the two columns of the table, input size and output size, indicate the ratio of the image edge length to the original image. Except for the predicted output layer, each convolutional layer was activated by the RELU activation function. Since the inverse depth result of the output was not conducive to network calculation, we multiplied the output by 10 to control it within the appropriate range and added 0.01 to prevent the error caused by little values.
Figure 4. Depth estimation network structure The network structure is based on encoder-decoder architecture, including 28 layers, with a skip structure. The network provides output results at four scales for calculating losses at different resolutions.
Table 1. The depth estimation network structure.
The encoder included 14 layers, all of which were convolutional layers. The output size of every two layers was the same, divided into a group. The input of the first group (conv1a and conv1b) was the original image, the size of the convolution kernel was 7 × 7, and the output size was a feature map of 1/2 of the original image. Each subsequent group was similar to the first group, and the output size of the previous group was 1/2 of the input feature map.The dimension of the feature graph increased with the depth of the network and finally reached 512.
The decoder was complex, including convolutional layer and deconvolutional layer. Like the encoder, the decoder was divided into a group with the same output size every two layers. The first group consisted of upconv7 and icnv7 layers. upconv7 was the reverse convolution layer. Unlike convolution layer, the deconvolution layer enlarged the feature map and output the result with the output size of twice the input. icnv7 was the convolutional layer, and the input contained skip structure, i.e., the upconv7 was superimposed with the conv6b in the encoder as the input, thus preserving the detailed features of the shallow layers. The second group (upconv6 and icnv6) and the third group (upconv5 and icnv5) were similar in structure to the first group. After one group, the feature map size was doubled. The fourth group was different from the first three groups by adding the output layer pred4, which output the estimation result of one dimension. The overall structure of the fifth group was similar to that of the fourth group, while the skip structure was more complicated. The skip structure of the convolutional layer icnv3 superimposed upconv3 and conv2b, and at the same time enlarged the output result of pred4 to pred4up and superimposed it, while ensuring that the deep global information and the shallow detail information were preserved. The sixth group (upconv2, icnv2 and pred2) had the same structure as the fifth group. In the seventh group (upconv1, icnv1 and pred1) structure, because there was no feature layer with the same size as the original image, the skip structure of the convolution layer iconv1 superimposed upconv1 with the result pred2up of the previous layer’s estimation.
The depth estimation network structure did not include a fully connected layer and a pooling layer. The disadvantage of the fully connected layer was that the fixed feature vector length limited the input size, and the transformation of the feature vector lost the spatial characteristics of the pixel. In traditional convolutional neural networks, the pooling layer is used for downsampling, so information loss will occur. The direct use of a convolutional layer with a step size of 2 to achieve downsampling circumvented this problem.
The ego-motion estimation network had the same network structure used in [5]. As show in Figure 5. The network input several continuous images in the image sequence, output six-DOF camera motion, and output masks of four scales corresponding to the depth map, which were used to exclude the non-rigid body parts in the scene. In the ego-motion vector, the position change part was divided by the average of depth, so that the motion of the camera corresponded to the scale of the depth map.
Figure 5. Ego-motion estimation network: the network structure is designed based on the encoder-decoder architecture. The encoder contains 5 layers, all of which are convolutional layers, for extracting image features. The specific structure is shown in Table 2.
The camera motion estimation part of the network was dense to sparse estimation, so the deconvolution layer was not adopted and only contained three convolution layers. The structure is shown in Table 3, where N is the number of target images in the image sequence. The mask estimation part of the network was dense to dense estimation, which contained only five deconvolution layers and four output layers. The structure is shown in Table 4, where N is the number of target images in the image sequence.
Table 3. Camera moving part structure of ego-motion estimation network.
Table 4. Mask estimation of partial structure of ego-motion estimation network.

4.2. Datasets Description

In the experiments, three widely used depth estimation datasets were used to test the proposed approach: KITTI, Apolloscape [45] and Cityscapes [46].
KITTI dataset was the largest data set for evaluating computer vision algorithms in autonomous driving scenarios in the world. It collected 61 scenes including rural areas and urban highways with optical lenses, cameras, LIDAR and other hardware equipment. There were at most 30 pedestrians and 15 cars in the image, and there were different degrees of occlusion. The normal RGB image resolution in KITTI was 375 × 1242 and the ground-truth depth resolution was 228 × 912. The original KITTI dataset did not have a true depth map, but contained sparse 3D laser measurements captured with the Velodyne laser sensor. To be able to evaluate in the KITTI dataset, we needed to map the laser measurements into the graph space to generate the ground-truth depth corresponding to the original image.
The Cityscapes dataset, jointly provided by three German companies, including Daimler, contains stereo vision data for more than 50 cities, with higher resolution and quality images. It contained a rich and distinct set of scenes from KITTI. Compared to the KITTI dataset, the images from the Cityscapes dataset were of better quality, with more diverse shooting scenes and higher resolution.
Apolloscape dataset was provided by the company, including perception, the simulation scene, road network data, such as hundreds of frames per-pixel semantic segmentation of high-resolution image data, as well as the corresponding per-pixel semantic annotation, dense point cloud, three-dimensional images, three-dimensional panoramic images, and further more complex environment, weather and traffic conditions, etc.

4.3. Experiment Settings

In this paper, the KITTI2012 dataset [16] was used to train the neural network. In the training, the resolution of the image was set to 416 × 128 . Since the network structure is full convolutional, the image of any size can be used in the actual test. In this paper, TensorFlow [42] was used to build a neural network, and the Adam [43] optimizer was used. The learning rate was set to 0.001, and the training process usually converged after about 150 K iterations.
For the evaluation of depth results, this paper used the same indicators and test set partitioning as [12]. This division included 700 images from the KITTI test dataset (this division excludes visually similar images). During the assessment, the effective distances were set to 50 m and 80 m, respectively, and each method was evaluated using the error used in [12].
The evaluation criteria included Absolute Relative error (Abs Rel), Square Relative error (Sq Rel), Root Mean Squared Error (RMSE) and Root Mean Squared logarithmic Error (RMSE log). For the absolute relative error and square relative error, this paper adopted the calculation method in [20]. For RMSE, it could be calculated by the following formula:
R M S E = 1 n i = 1 n ( y i y ^ i ) 2 )
where, n is the total number of pixels, and y i and y ^ i are the actual and estimated depths, respectively. RMSE log could be calculated according to the following formula:
R M S E l o g = 1 n i = 1 n ( l o g ( y i + 1 ) l o g ( y ^ i + 1 ) ) 2 )
The parameters were the same as Formula (12).
For camera position estimation, two sequences Seq.09 and Seq.10 in KITTI dataset were used in this paper. The evaluation index was Absolute Trajectory Error (ATE), which could be calculated according to the following formula:
A T E = Q 1 S P
where Q S E ( 3 ) is the actual pose of the camera, P S E ( 3 ) is the estimated camera pose, and S S i m ( 3 ) is the similar transformation matrix from the estimated pose to the actual pose.

4.4. Comparisons with Other Methods

To demonstrate the superiority of our method, we compared it with some classical methods, including supervised methods [13,20] and unsupervised methods [4,5]. The ground-truth depth was from LIDAR data, which were obtained by projecting the point cloud to the image plane.

4.4.1. Evaluation of Depth Estimation

Table 5 compares the results of our work with existing work in estimating the depth of the scene. As seen in Table 5, “Ours” and “Ours consis” indicate the results of using and not using the consistency constraints, respectively. Experimental results showed that our method was significantly better than supervised learning methods, which showed that our method overcame the impact of the supervised learning method on the results due to the poor quality of the supervised data. Compared with the benchmark work [5], our results reduced the average error from 0.208 to 0.169, which reflected the effectiveness of our method. Our results still had some gaps with the results of the stat-of-the-art method of 0.148 by Godard [15] since their method used images with known camera baseline as supervised data, we believe that the gap is due to our method not further constraining the ego-motion. In addition, the consistency loss further narrowed the error from 0.176 to 0.169, which reflected the effect of our loss term. Figure 6 is a qualitative comparison of visualizations. The experimental results reflected the ability of our method to understand 3D scenes, that is, the method successfully analyzed the 3D consistency of different scenes.
Table 5. Depth evaluation results for the KITTI test set, K indicates training on KITTI, and C indicates training on Cityscapes [46]. Ours indicates that the consistency loss is not used, and Ours consis indicates the result using the consistency loss term.
Figure 6. Qualitative results on KITTI [16] test set. Our method captures details in thin structures and preserves consistently high-quality predictions both in close and distant regions.

4.4.2. Evaluation of Ego-Motion

During the training, the result of motion estimation greatly affected the accuracy of depth estimation. In order to evaluate the accuracy of our method in camera motion estimation, we conducted an experiment on the KITTI odometry split dataset. This data set contained 11 video sequences and their corresponding sensor information which are obtained through IMU/GPS. We used the sequence 00–08 to train the model, and the sequence 09–10 to evaluate it. Additionally, we compared our method with a typical visual odometry of ORB-SLAM [6]. ORB-SLAM is an indirect SLAM method which calculates camera motion and scene depth through feature point matching. It has a bundle adjustment back-end based on graph optimization, which further constrains ego-motion by non-adjacent images. Therefore, we compared our approach to two different SLAM processes: (1) “ORB-SLAM (short)” containing only five frames as input, which had no graph optimization; (2) “ORB-SLAM (full)” containing the entire process and all frames. As shown in Table 6, we compared our method with existing work on ego-motion estimation. Our method outperformed other unsupervised learning methods, approaching the ORB-SLAM with global optimization.
Table 6. Absolute Track Error (ATE) tested on the KITTI odometry dataset [16]. Ours indicates that the consistency loss is not used, and Ours consis indicates the result using the consistency loss term.

4.4.3. Depth Results on Apollo and Cityscapes

To prove the versatility of our method, we directly applied our model trained on KITTI to the Apollo stereo test set [45] and Cityscapes test set. Our model could still output accurate prediction results, even if the scene structure was more complex. As shown in Figure 7 and Figure 8, our method could recover more details.
Figure 7. Example predictions by our method on Apollo dataset [45]. The model is only trained on KITTI dataset but also performs well in other cases. Compared with [5], our method recovers more details.
Figure 8. Example predictions by our method on Cityscapes dataset [46]. Our method can predict high quality depth information from a single image, even in areas where the laser scanning system cannot measure very well.

5. Conclusions

We improve the existing unsupervised learning depth estimation method by enhancing the consistency between the 3D corresponding vector field and the 2D image. It effectively improves the prediction result and exceeds similar existing methods. The experiments on the KITTI dataset demonstrated that our method exceeded the previous unsupervised learning methods and supervised learning methods. The results on the Apolloscape and Cityscapes datasets demonstrate the strong generality of our proposed approach.
Compared with the latest methods involved flow prediction, our method only predicts the camera position change and the scene depth structure and does not involve the prediction of the image flow, so the performance can only be close to it. Recent studies have demonstrated the ability of deep neural networks in the field of depth estimation and flow estimation. This also represents a great potential for deep learning methods to solve the problem of moving object depth and spatial motion in estimated scenes.

Author Contributions

Conceptualization, F.J.; software, Y.Z.; methodology, F.J. and Y.Z.; supervision, F.J. and S.W.; validation, C.W.; writing—original draft, Y.Z.; writing—review and editing, Y.Y. and S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Key Research and Development Program of China (2020YFC0832500), Key R&D Project of Guangdong Province (2020B010164002) and Beijing Major Science and Technology Projects (Z171100005117002).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. Applloscapes: https://github.com/ApolloScapeAuto/dataset-api (20 December 2020). KITTI: http://www.cvlibs.net/datasets/KITTI/index.php (20 December 2020). Cityscapes: https://www.cityscapes-dataset.com (20 December 2020).

Acknowledgments

The authors would also like to thank the peer researchers who made their sourcecodes available to the whole community.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AMPAutonomous Moving Platforms
5GFifth-generation
SFMStructure From Motion
SLAMSimultaneous Localization And Mapping
CNNConvolutional Neural Network
CRFConditional Random Fields
DCNNDeep Convolutional Neural Networks
RNNRecurrent Neural Network
Abs RelAbsolute Relative error
Sq RelSquare Relative error
RMSERoot Mean Squared Error
RMSE logRoot Mean Squared logarithmic Error
ATEAbsolute Trajectory Error
6-DOFSix degrees of freedom
2Dtwo-dimensional
3Dthree-dimensional

References

  1. Wymeersch, H.; Seco-Granados, G.; Destino, G.; Dardari, D.; Tufvesson, F. 5G mmWave positioning for vehicular networks. IEEE Wirel. Commun. 2017, 24, 80–86. [Google Scholar] [CrossRef]
  2. Lu, Z.; Huang, Y.C.; Bangjun, C. A Study for Application in Vehicle Networking and Driverless Driving. In Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, Beijing, China, 6–8 December 2019; pp. 264–267. [Google Scholar]
  3. Zhao, Y.; Jin, F.; Wang, M.; Wang, S. Knowledge Graphs Meet Geometry for Semi-supervised Monocular Depth Estimation. In Proceedings of the International Conference on Knowledge Science, Engineering and Management, Hangzhou, China, 28–30 August 2020; pp. 40–52. [Google Scholar]
  4. Garg, R.; Kumar, B.G.V.; Carneiro, G.; Reid, I.D. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 740–756. [Google Scholar]
  5. Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
  6. Murartal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot. 2015, 31, 1147–1163. [Google Scholar] [CrossRef]
  7. Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
  8. Wang, J.; Liu, Z.; Xie, R.; Ran, L. Radar HRRP Target Recognition Based on Dynamic Learning with Limited Training Data. Remote Sens. 2021, 13, 750. [Google Scholar] [CrossRef]
  9. Kazimierski, W.; Zaniewicz, G. Determination of Process Noise for Underwater Target Tracking with Forward Looking Sonar. Remote Sens. 2021, 13, 1014. [Google Scholar] [CrossRef]
  10. Li, B.; Gan, Z.; Chen, D.; Sergey Aleksandrovich, D. UAV Maneuvering Target Tracking in Uncertain Environments Based on Deep Reinforcement Learning and Meta-Learning. Remote Sens. 2020, 12, 3789. [Google Scholar] [CrossRef]
  11. Guo, J.; Bai, C.; Guo, S. A Review of Monocular Depth Estimation Based on Deep Learning. Unmanned Syst. Technol. 2019, 3. Available online: https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CJFD&dbname=CJFDLAST2019&filename=UMST201902003&v=LxXxs2LYM%25mmd2FrpCJsoTtiaExYvBg0cRUvrHeXluBqPeql%25mmd2FO67HDuhfchKopV1yVha7 (accessed on 10 March 2021).
  12. Eigen, D.; Fergus, R. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 2650–2658. [Google Scholar]
  13. Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed]
  14. Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5667–5675. [Google Scholar]
  15. Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6602–6611. [Google Scholar]
  16. Geiger, A. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
  17. Taketomi, T.; Uchiyama, H.; Ikeda, S. Visual SLAM algorithms: A survey from 2010 to 2016. IPSJ Trans. Comput. Vis. Appl. 2017, 9, 16. [Google Scholar] [CrossRef]
  18. Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. DTAM: Dense tracking and mapping in real-time. In Proceedings of the IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011. [Google Scholar]
  19. Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
  20. Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
  21. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  22. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  23. Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the International Conference on 3D Vision, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
  24. Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A.L. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2800–2809. [Google Scholar]
  25. Jafari, O.H.; Groth, O.; Kirillov, A.; Yang, M.Y.; Rother, C. Analyzing modular CNN architectures for joint depth prediction and semantic segmentation. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 4620–4627. [Google Scholar]
  26. Fu, H.; Gong, M.; Wang, C.; Batmanghelich, K.; Tao, D. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2002–2011. [Google Scholar]
  27. Mancini, M.; Costante, G.; Valigi, P.; Ciarfuglia, T.A. Fast robust monocular depth estimation for obstacle detection with fully convolutional networks. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 4296–4303. [Google Scholar]
  28. Liu, F.; Shen, C.; Lin, G. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5162–5170. [Google Scholar]
  29. Li, J.; Klein, R.; Yao, A. A two-streamed network for estimating fine-scaled depth maps from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3372–3380. [Google Scholar]
  30. Oliveira, G.L.; Radwan, N.; Burgard, W.; Brox, T. Topometric localization with deep learning. In Robotics Research; Springer: Berlin/Heidelberg, Germany, 2020; pp. 505–520. [Google Scholar]
  31. Clark, R.; Wang, S.; Wen, H.; Markham, A.; Trigoni, N. VINet: Visual-Inertial Odometry as a Sequence-to-Sequence Learning Problem. In Proceedings of the National Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; pp. 3995–4001. [Google Scholar]
  32. Repala, V.K.; Dubey, S.R. Dual cnn models for unsupervised monocular depth estimation. In Proceedings of the International Conference on Pattern Recognition and Machine Intelligence, Tezpur, India, 17–20 December 2019; pp. 209–217. [Google Scholar]
  33. Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
  34. Rezende, D.J.; Eslami, S.; Mohamed, S.; Battaglia, P.; Jaderberg, M.; Heess, N. Unsupervised learning of 3d structure from images. arXiv 2016, arXiv:1607.00662. [Google Scholar]
  35. Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Single-view to Multi-view: Reconstructing Unseen Views with a Convolutional Network. arXiv 2015, arXiv:1511.06702. [Google Scholar]
  36. Vijayanarasimhan, S.; Ricco, S.; Schmid, C.; Sukthankar, R.; Fragkiadaki, K. Sfm-net: Learning of structure and motion from video. arXiv 2017, arXiv:1704.07804. [Google Scholar]
  37. Yin, Z.; Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
  38. Garg, R.; Wadhwa, N.; Ansari, S.; Barron, J.T. Learning single camera depth estimation using dual-pixels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7628–7637. [Google Scholar]
  39. Wang, C.; Buenaposada, J.M.; Zhu, R.; Lucey, S. Learning Depth from Monocular Videos Using Direct Methods. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2022–2030. [Google Scholar]
  40. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  41. Patait, A. An Introduction to the NVIDIA Optical Flow SDK. Available online: https://developer.nvidia.com/blog/an-introduction-to-the-nvidia-optical-flow-sdk/ (accessed on 13 February 2019).
  42. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv 2015, arXiv:1603.04467. [Google Scholar]
  43. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  44. Mayer, N.; Ilg, E.; Hausser, P.; Fischer, P.; Cremers, D.; Dosovitskiy, A.; Brox, T. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4040–4048. [Google Scholar]
  45. Wang, P.; Huang, X.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed]
  46. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.