Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency
Abstract
:1. Introduction
2. Methods
2.1. Depth-Temporal Consistency Constraint
2.1.1. Spatial Translation
2.1.2. Depth-Temporal Consistency Module
2.2. Photometric-Temporal Consistency Constraint
2.2.1. Pixal Warp
2.2.2. Photometric Reproject Auto-Mask
2.3. Automatic Feature Points Refinement
3. Experimental Evaluation
3.1. Datasets and Setup
Algorithm 1 Data set preprocessing |
|
Algorithm 2 Flow chart of self-supervised sequence depth constraint training algorithm. |
|
3.2. Results of the Ablation Experiments
- (1)
- Depth-temporal consistency constraint (DC): Many works involve the input depth map in training so that the network will not lose too much depth information. However, due to the displacement between the RGB camera and LiDAR, some of the background depth and the foreground depth are mixed in the same occluded area. Retaining input depth means retaining these erroneous background depth. Our method improved this practice and the stitched sequence depth to replace the current depth.
- (2)
- Photometric-temporal consistency constraint (PC): The sequence depth constraint is too sparse to provide the global constraint. Thus, we introduce the sequence photometric constraint. As shown in the Table 1, the sequence photometric constraint is also indispensable, and the superiority of the sequence multi-modal constraint is also proved. In the DC constraint, this temporal depth data cannot constrain depth completion directly; it contains displaces depth points. This points can affect the performance of depth completion, which was also shown in subsequent experiments. Therefore, we introduce photometric reproject auto-mask to remove these error points, and the experiment proves that this auto-mask is useful.
- (3)
- Multi-modal spatio-temporal consistency constraint (MSC): The MSC constraint contains DC and PC, containing three loss functions. We tested the effect of their weights on the two network structure, S2D [6] and KBNet [31]. As can be observed in the Table 1, the network has the best performance when , , and . We tested several weight ratios of and and finally took , , and as the value of the following experiment.
- (1)
- Experiments show that the performance of dark, low-texture parts and distant object is significantly improved(➀ ➁ ➂ ➃).
- (2)
- Sequence depth replaces the input depth as a constraint, so the network does not fully preserve the input depth with its error points. In addition, our photometric similarity estimation module filters out the occlusion points in adjacent depth. Compared with the scheme without MSC constraints, our method reduces the influence of displace points of sparse depth input on depth completion (➄ ➅).
- (3)
- For dynamic targets (➃ ➆), our method has better adaptability.
- (4)
- Our method enhances the fusion ability of the network for multi-modal information, so there will not be a large number of residual sparse points in the image(➀ ➁ ➄ ➅ ➆).
3.3. Results of the Comparative Experiments
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. Kinectfusion: Real-time dense surface mapping and tracking. In Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality, Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
- Zhang, J.; Singh, S. LOAM: Lidar odometry and mapping in real-time. In Proceedings of the Robotics: Science and Systems, Berkeley, CA, USA, 12–16 July 2014; Volume 2, pp. 1–9. [Google Scholar]
- Li, Y.; Le Bihan, C.; Pourtau, T.; Ristorcelli, T.; Ibanez-Guzman, J. Coarse-to-fine segmentation on lidar point clouds in spherical coordinate and beyond. IEEE Trans. Veh. Technol. 2020, 69, 14588–14601. [Google Scholar] [CrossRef]
- Zhou, H.; Zou, D.; Pei, L.; Ying, R.; Liu, P.; Yu, W. StructSLAM: Visual SLAM with building structure lines. IEEE Trans. Veh. Technol. 2015, 64, 1364–1375. [Google Scholar] [CrossRef]
- Song, Z.; Lu, J.; Yao, Y.; Zhang, J. Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving. IEEE Trans. Intell. Transp. Syst. 2021, 23, 11654–11665. [Google Scholar] [CrossRef]
- Ma, F.; Cavalheiro, G.V.; Karaman, S. Self-supervised sparse-to-dense: Self-supervised depth completion from lidar and monocular camera. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 3288–3295. [Google Scholar]
- Cui, Y.; Chen, R.; Chu, W.; Chen, L.; Tian, D.; Li, Y.; Cao, D. Deep learning for image and point cloud fusion in autonomous driving: A review. IEEE Trans. Intell. Transp. Syst. 2021, 23, 722–739. [Google Scholar] [CrossRef]
- Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar]
- Jaritz, M.; De Charette, R.; Wirbel, E.; Perrotton, X.; Nashashibi, F. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 52–60. [Google Scholar]
- Eldesokey, A.; Felsberg, M.; Khan, F.S. Propagating confidences through cnns for sparse data regression. arXiv 2018, arXiv:1805.11913. [Google Scholar]
- Eldesokey, A.; Felsberg, M.; Khan, F.S. Confidence propagation through cnns for guided sparse depth regression. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2423–2436. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yan, L.; Liu, K.; Belyaev, E. Revisiting sparsity invariant convolution: A network for image guided depth completion. IEEE Access 2020, 8, 126323–126332. [Google Scholar] [CrossRef]
- Huang, Z.; Fan, J.; Cheng, S.; Yi, S.; Wang, X.; Li, H. Hms-net: Hierarchical multi-scale sparsity-invariant network for sparse depth completion. IEEE Trans. Image Process. 2019, 29, 3429–3441. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Ma, F.; Karaman, S. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 4796–4803. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Wei, M.; Zhu, M.; Zhang, Y.; Sun, J.; Wang, J. An Efficient Information-Reinforced Lidar Deep Completion Network without RGB Guided. Remote. Sens. 2022, 14, 4689. [Google Scholar] [CrossRef]
- Hu, M.; Wang, S.; Li, B.; Ning, S.; Fan, L.; Gong, X. Penet: Towards precise and efficient image guided depth completion. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13656–13662. [Google Scholar]
- Li, A.; Yuan, Z.; Ling, Y.; Chi, W.; Zhang, C. A multi-scale guided cascade hourglass network for depth completion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 32–40. [Google Scholar]
- Liu, L.; Song, X.; Lyu, X.; Diao, J.; Wang, M.; Liu, Y.; Zhang, L. FCFR-Net: Feature fusion based coarse-to-fine residual learning for depth completion. arXiv 2020, arXiv:2012.08270. [Google Scholar] [CrossRef]
- Zhang, Y.; Funkhouser, T. Deep depth completion of a single rgb-d image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 175–185. [Google Scholar]
- Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3313–3322. [Google Scholar]
- Xu, Y.; Zhu, X.; Shi, J.; Zhang, G.; Bao, H.; Li, H. Depth completion from sparse lidar data with depth-normal constraints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2811–2820. [Google Scholar]
- Nazir, D.; Liwicki, M.; Stricker, D.; Afzal, M.Z. SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion. arXiv 2022, arXiv:2204.13635. [Google Scholar] [CrossRef]
- Yue, J.; Wen, W.; Han, J.; Hsu, L.T. 3D Point Clouds Data Super Resolution-Aided LiDAR Odometry for Vehicular Positioning in Urban Canyons. IEEE Trans. Veh. Technol. 2021, 70, 4098–4112. [Google Scholar] [CrossRef]
- Cheng, X.; Wang, P.; Yang, R. Learning depth with convolutional spatial propagation network. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2361–2379. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Cheng, X.; Wang, P.; Guan, C.; Yang, R. Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 10615–10622. [Google Scholar]
- Yang, Y.; Wong, A.; Soatto, S. Dense depth posterior (ddp) from single image and sparse range. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3353–3362. [Google Scholar]
- Shivakumar, S.S.; Nguyen, T.; Miller, I.D.; Chen, S.W.; Kumar, V.; Taylor, C.J. Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Long Beach, CA, USA, 15–20 June 2019; pp. 13–20. [Google Scholar]
- Feng, Z.; Jing, L.; Yin, P.; Tian, Y.; Li, B. Advancing self-supervised monocular depth learning with sparse liDAR. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022; pp. 685–694. [Google Scholar]
- Choi, J.; Jung, D.; Lee, Y.; Kim, D.; Manocha, D.; Lee, D. Selfdeco: Self-supervised monocular depth completion in challenging indoor environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 467–474. [Google Scholar]
- Wong, A.; Soatto, S. Unsupervised depth completion with calibrated backprojection layers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 12747–12756. [Google Scholar]
- Wong, A.; Fei, X.; Tsuei, S.; Soatto, S. Unsupervised depth completion from visual inertial odometry. IEEE Robot. Autom. Lett. 2020, 5, 1899–1906. [Google Scholar] [CrossRef]
- Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3828–3838. [Google Scholar]
- Ku, J.; Harakeh, A.; Waslander, S.L. In defense of classical image processing: Fast depth completion on the cpu. In Proceedings of the 2018 15th Conference on Computer and Robot Vision (CRV), Toronto, ON, Canada, 8–10 May 2018; pp. 16–22. [Google Scholar]
Method | |||||||
---|---|---|---|---|---|---|---|
√ | 0.010 | 0.1 | 18 | 4885.72 | |||
0.1 | 1.0 | 18 | 2299.49 | ||||
x | 0.010 | 0.1 | 1.0 | 18 | 1379.70 | ||
√ | 0.010 | 0.1 | 1.0 | 18 | 1316.93 | ||
x | 0.010 | 0.1 | 1.0 | 34 | 1241.58 | ||
√ | 0.010 | 0.1 | 1.0 | 34 | 1212.69 | ||
x | 0.010 | 0.1 | 1.0 | 1527.37 | |||
√ | 0.010 | 0.1 | 1.0 | 1289.67 | |||
√ | 0.001 | 0.1 | 1.0 | 18 | 1503.27 | ||
√ | 0.010 | 0.1 | 0.1 | 18 | 1577.12 |
Method | Input | |||||
---|---|---|---|---|---|---|
1 | 0.01 | 0.1 | 1 | 34 | 1212.69 | |
2 | 0.01 | 0.1 | 1 | 34 | 1264.01 | |
3 | 0.01 | 0.1 | 1 | 34 | 1255.51 |
Method | Network | ||||
---|---|---|---|---|---|
s2dNet | x | PnP(Ma) | 1476.76 | ||
s2dNet | x | PnP(Ma) | 1379.70 | ||
s2dNet | √ | PnP(Ma) | 1322.37 | ||
s2dNet | √ | PnP(AFPR) | 1316.93 | ||
kbNet | x | PnP(Ma) | 1495.51 | ||
kbNet | x | PnP(Ma) | 1361.9 | ||
kbNet | √ | PnP(Ma) | 1311.12 | ||
kbNet | √ | PnP(AFPR) | 1289.67 |
Method | |||
---|---|---|---|
Kitti2012 Depth Completion Validation Dataset | |||
S2D [6] | PnP | 1342.33 | |
DepthComp [5] | PnP | 1330.88 | |
DepthComp | PoseNet | 1282.81 | |
SelfDeco [30] | PoseNet | 1212.89 | |
KBNet(withPnP) | PnP | 1289.67 | |
our | PnP | 1212.69 | |
Kitti2012 Depth Completion Test Dataset | |||
S2D | PnP | 1299.85 | |
IP−Basic [34] | PnP | 1288.46 | |
KBNet(with~PnP) | PnP | 1223.59 | |
DFuseNet [28] | Stereo | / | 1206.66 |
DDP [27] | PoseNet | 1263.19 | |
DepthComp | PoseNet | 1216.26 | |
VOICED(VGG8) [32] | PoseNet | 1164.58 | |
VOICED(VGG11) | PoseNet | 1169.97 | |
ours | PnP | 1156.78 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Q.; Chen, X.; Wang, X.; Han, J.; Zhang, Y.; Yue, J. Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency. Remote Sens. 2023, 15, 135. https://doi.org/10.3390/rs15010135
Zhang Q, Chen X, Wang X, Han J, Zhang Y, Yue J. Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency. Remote Sensing. 2023; 15(1):135. https://doi.org/10.3390/rs15010135
Chicago/Turabian StyleZhang, Quan, Xiaoyu Chen, Xingguo Wang, Jing Han, Yi Zhang, and Jiang Yue. 2023. "Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency" Remote Sensing 15, no. 1: 135. https://doi.org/10.3390/rs15010135
APA StyleZhang, Q., Chen, X., Wang, X., Han, J., Zhang, Y., & Yue, J. (2023). Self-Supervised Depth Completion Based on Multi-Modal Spatio-Temporal Consistency. Remote Sensing, 15(1), 135. https://doi.org/10.3390/rs15010135