# Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- Jointly learn an unsupervised deep neural network from monocular videos that predicts depth, optical flow, and camera motion simultaneously at training time coupled by combined loss term.
- Modeling occlusion all through entire process explicitly through bidirectional flow from consecutive frames to make the model occlusion-aware and non-occluded region better constrained.
- Mutually supervise each component of the network in use of both 2D and 3D geometric constraints combined with occlusion module. The 2D image reconstruction loss takes optical flow into consideration. The 3D constraint contains two part: 3D point alignment loss and a novel 3D flow loss.

## 2. Related Work

#### 2.1. Deep Learning vs. Geometry for Scene Understanding

#### 2.2. Deep Learning with Geometry for Scene Understanding

#### 2.2.1. Supervised Videos Learning

#### 2.2.2. Unsupervised Videos Learning

## 3. Method

#### 3.1. System Overview

#### 3.2. Bidirectional Flow Loss as Occlusion Mask

#### 3.3. 2D Image Reconstruction Loss as Supervision

#### 3.4. Mutually Supervised 3D Geometric Consistency Loss

#### 3.4.1. 3D Point Alignment Loss

#### 3.4.2. 3D Flow Loss

#### 3.5. Smoothness Constraint

## 4. Experiments

#### 4.1. Implementation Specification

#### 4.1.1. Network Architecture

#### 4.1.2. Training Details

#### 4.2. Experimental Evaluation

#### 4.2.1. Depth Evaluation

#### 4.2.2. Optical Flow Evaluation

#### 4.2.3. Pose Evaluation

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Yang, L.; Cheng, H.; Hao, J.; Ji, Y.; Kuang, Y. A Survey on Media Interaction in Social Robotics; Springer: Cham, Switzerland, 2015; pp. 181–190. [Google Scholar]
- Chen, C.; Seff, A.; Kornhauser, A.; Xiao, J. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; Volume 00, pp. 2722–2730. [Google Scholar]
- Carmigniani, J.; Furht, B.; Anisetti, M.; Ceravolo, P.; Damiani, E.; Ivkovic, M. Augmented reality technologies, systems and applications. Multimed. Tools Appl.
**2011**, 51, 341–377. [Google Scholar] [CrossRef] - Torresani, L.; Hertzmann, A.; Bregler, C. Nonrigid Structure-from-Motion: Estimating Shape and Motion with Hierarchical Priors. IEEE Trans. Pattern Anal. Mach. Intell.
**2008**, 30, 878–892. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wu, C. Towards Linear-Time Incremental Structure from Motion. In Proceedings of the International Conference on 3dtv-Conference, Seattle, WA, USA, 29 June–1 July 2013; pp. 127–134. [Google Scholar]
- Agudo, A.; Morenonoguer, F.; Calvo, B.; Montiel, J.M. Sequential Non-Rigid Structure from Motion Using Physical Priors. IEEE Trans. Pattern Anal. Mach. Intell.
**2016**, 38, 979–994. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Davison, A.J.; Reid, I.D.; Molton, N.D.; Stasse, O. MonoSLAM: Real-Time Single Camera SLAM. IEEE Trans. Pattern Anal. Mach. Intell.
**2007**, 29, 1052–1067. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mur-Artal, R.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Trans. Robot.
**2015**, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version] - Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot.
**2017**, 33, 1255–1262. [Google Scholar] [CrossRef] [Green Version] - Eigen, D.; Puhrsch, C.; Fergus, R. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2366–2374. [Google Scholar]
- Godard, C.; Aodha, O.M.; Brostow, G.J. Unsupervised Monocular Depth Estimation with Left-Right Consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 6602–6611. [Google Scholar]
- Kendall, A.; Martirosyan, H.; Dasgupta, S.; Henry, P. End-to-End Learning of Geometry and Context for Deep Stereo Regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 66–75. [Google Scholar]
- Brahmbhatt, S.; Gu, J.; Kim, K.; Hays, J.; Kautz, J. MapNet: Geometry-Aware Learning of Maps for Camera Localization. arXiv
**2017**, arXiv:1712.03342. [Google Scholar] - Dosovitskiy, A.; Fischery, P.; Ilg, E.; Hausser, P.; Hazirbas, C.; Golkov, V.; Smagt, P.V.D.; Cremers, D.; Brox, T. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2758–2766. [Google Scholar]
- Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6612–6619. [Google Scholar]
- Ummenhofer, B.; Zhou, H.; Uhrig, J.; Mayer, N.; Ilg, E.; Dosovitskiy, A.; Brox, T. DeMoN: Depth and Motion Network for Learning Monocular Stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5622–5631. [Google Scholar]
- Vijayanarasimhan, S.; Ricco, S.; Schmid, C.; Sukthankar, R.; Fragkiadaki, K. SfM-Net: Learning of Structure and Motion from Video. arXiv
**2017**, arXiv:1704.07804. [Google Scholar] - Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. arXiv
**2018**, arXiv:1802.05522. [Google Scholar] - Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. arXiv
**2018**, arXiv:1803.02276. [Google Scholar] - Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; Brox, T. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1647–1655. [Google Scholar]
- Garg, R.; Vijay, K.B.G.; Carneiro, G.; Reid, I. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 740–756. [Google Scholar]
- Li, R.; Wang, S.; Long, Z.; Gu, D. UnDeepVO: Monocular Visual Odometry through Unsupervised Deep Learning. arXiv
**2017**, arXiv:1709.06841. [Google Scholar] - Meister, S.; Hur, J.; Roth, S. UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss. arXiv
**2017**, arXiv:1711.07837. [Google Scholar] - Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial Transformer Networks. In Proceedings of the Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 2017–2025. [Google Scholar]
- Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv
**2015**, arXiv:1603.04467. [Google Scholar] - Liu, F.; Shen, C.; Lin, G.; Reid, I. Learning Depth from Single Monocular Images Using Deep Convolutional Neural Fields. IEEE Trans. Pattern Anal. Mach. Intell.
**2016**, 38, 2024–2039. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process.
**2004**, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] - Saxena, A.; Sun, M.; Ng, A.Y. Make3D: Learning 3D Scene Structure from a Single Still Image. IEEE Trans. Pattern Anal. Mach. Intell.
**2008**, 31, 824–840. [Google Scholar] [CrossRef] [PubMed] - Karsch, K.; Liu, C.; Kang, S.B. Depth Transfer: Depth Extraction from Video Using Non-Parametric Sampling. IEEE Trans. Pattern Anal. Mach. Intell.
**2012**, 36, 2144. [Google Scholar] [CrossRef] [PubMed] - Liu, M.; Salzmann, M.; He, X. Discrete-Continuous Depth Estimation from a Single Image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 716–723. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper Depth Prediction with Fully Convolutional Residual Networks. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Revaud, J.; Weinzaepfel, P.; Harchaoui, Z.; Schmid, C. EpicFlow: Edge-preserving interpolation of correspondences for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7–12 June 2015; pp. 1164–1172. [Google Scholar]
- Ren, Z.; Yan, J.; Ni, B.; Liu, B.; Yang, X.; Zha, H. Unsupervised Deep Learning for Optical Flow Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]

**Figure 1.**Overview of our system. It consists of three task-specific sub-networks targeting at estimating monocular depth, optical flow, and camera motion. Rich geometric constraints are employed extracting from the natural structure of the scene. Here M denotes input image and P denotes point cloud.

**Figure 2.**The 3D Flow is computed for both directions. The loss takes utilization of estimated depth, optical flow and camera pose.

**Figure 3.**Comparison of monocular depth estimation between Eigen et al. [10] (supervised by depth), Garg et al. [21] (supervised by stereo), Zhou et al. [15] (unsupervised) and ours (unsupervised). Our method captures more details in the whole scene and particularly in thin structures for both close and distant regions.

**Figure 4.**Comparison of monocular depth estimation between failed cases of Zhou et al. [15] and same cases of ours. Our method preserves considerable predictions in vast open scenes and objects close to the front of the camera.

**Figure 6.**Comparison of optical flow estimation between Yin et al. [19] and ours on KITTI Flow 2015 dataset. The ground truth is interpolated for visualization purpose. Our method performs better in both occluded and overall regions.

Method | Supervise | Abs Rel | Sq Rel | RMSE | RMSE Log | $\mathit{\delta}\mathbf{<}\mathbf{1.25}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{2}}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{3}}$ |
---|---|---|---|---|---|---|---|---|

Eigen [10] Coarse | Depth | 0.214 | 1.605 | 6.563 | 0.292 | 0.673 | 0.884 | 0.957 |

Eigen [10] Fine | Depth | 0.203 | 1.548 | 6.307 | 0.282 | 0.702 | 0.890 | 0.958 |

Liu [27] | Depth | 0.202 | 1.614 | 6.523 | 0.275 | 0.678 | 0.895 | 0.965 |

Godard [11] | Stereo | 0.148 | 1.344 | 5.927 | 0.247 | 0.803 | 0.922 | 0.964 |

Zhou [15] | No | 0.208 | 1.768 | 6.856 | 0.283 | 0.678 | 0.885 | 0.957 |

Mahjourian [18] | No | 0.163 | 1.240 | 6.220 | 0.250 | 0.762 | 0.916 | 0.968 |

Yin [19] | No | 0.155 | 1.296 | 5.857 | 0.233 | 0.793 | 0.931 | 0.973 |

Ours (VGG) | No | 0.157 | 1.229 | 5.960 | 0.238 | 0.799 | 0.932 | 0.973 |

Ours (ResNet) | No | 0.149 | 1.223 | 5.732 | 0.225 | 0.829 | 0.944 | 0.978 |

**Table 2.**Monocular depth results on KITTI by the split of Eigen et al. [10] capped 50 m.

Method | Supervise | Abs Rel | Sq Rel | RMSE | RMSE Log | $\mathit{\delta}\mathbf{<}\mathbf{1.25}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{2}}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{3}}$ |
---|---|---|---|---|---|---|---|---|

Garg [21] | Stereo | 0.169 | 1.080 | 5.104 | 0.273 | 0.740 | 0.904 | 0.962 |

Mahjourian [18] | No | 0.155 | 0.927 | 4.549 | 0.231 | 0.781 | 0.931 | 0.975 |

Yin [19] | No | 0.147 | 0.936 | 4.348 | 0.218 | 0.810 | 0.941 | 0.977 |

Ours (Resnet) | No | 0.142 | 0.909 | 4.306 | 0.203 | 0.832 | 0.949 | 0.978 |

**Table 3.**Monocular depth results on Cityscapes and KITTI by the split of Eigen et al. [10] capped 80 m.

Method | Supervise | Abs Rel | Sq Rel | RMSE | RMSE Log | $\mathit{\delta}\mathbf{<}\mathbf{1.25}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{2}}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{3}}$ |
---|---|---|---|---|---|---|---|---|

Godard [11] | Stereo | 0.124 | 1.076 | 5.311 | 0.219 | 0.847 | 0.942 | 0.973 |

Zhou [15] | No | 0.198 | 1.836 | 6.565 | 0.275 | 0.718 | 0.901 | 0.960 |

Mahjourian [18] | No | 0.159 | 1.231 | 5.912 | 0.243 | 0.784 | 0.923 | 0.970 |

Yin [19] | No | 0.153 | 1.328 | 5.737 | 0.232 | 0.802 | 0.934 | 0.972 |

Ours (Resnet) | No | 0.146 | 1.253 | 5.614 | 0.224 | 0.838 | 0.941 | 0.973 |

**Table 4.**Ablation results where individual components are left out on KITTI dataset when capped 80 m.

Method | Abs Rel | Sq Rel | RMSE | RMSE Log | $\mathit{\delta}\mathbf{<}\mathbf{1.25}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{2}}$ | $\mathit{\delta}\mathbf{<}{\mathbf{1.25}}^{\mathbf{3}}$ |
---|---|---|---|---|---|---|---|

No Occ-Mask | 0.161 | 1.367 | 6.017 | 0.236 | 0.805 | 0.934 | 0.972 |

No 3D ICP Loss | 0.160 | 1.594 | 5.775 | 0.226 | 0.826 | 0.945 | 0.976 |

No 3D Flow Loss | 0.157 | 1.353 | 5.971 | 0.232 | 0.811 | 0.938 | 0.975 |

All losses | 0.149 | 1.223 | 5.732 | 0.225 | 0.829 | 0.944 | 0.978 |

Method | Supervision | Abs Rel | Sq Rel | RMSE | RMSE Log |
---|---|---|---|---|---|

Train set mean | Depth | 0.876 | 13.98 | 12.27 | 0.307 |

Karsch [30] | Depth | 0.428 | 5.079 | 8.389 | 0.149 |

Liu [31] | Depth | 0.475 | 6.562 | 10.05 | 0.165 |

Laina [32] | Depth | 0.204 | 1.840 | 5.683 | 0.084 |

Godard [11] | Pose | 0.544 | 10.94 | 11.76 | 0.193 |

Zhou [15] | No | 0.383 | 5.321 | 10.47 | 0.478 |

Ours | No | 0.406 | 4.071 | 9.624 | 0.385 |

**Table 6.**The reported Average end-point error (AEE) on KITTI flow 2015 over all pixels (All) and over non-occluded pixels only (Noc). C denotes dataset FlyingChairs, S denotes Sintel and T is FlyingThings3D.

Method | Dataset | Noc | All |
---|---|---|---|

EpicFlow [33] | - | 4.45 | 9.57 |

FlowNetS [14] | C+S | 8.12 | 14.19 |

FlowNet2 [20] | C+T | 4.93 | 10.06 |

DSTFlow [34] | K | 6.96 | 16.79 |

UnFlow-C [23] | K | 4.29 | 8.80 |

GeoNet (FlowNetS) [19] | K | 6.77 | 12.21 |

GeoNet [19] | K | 8.05 | 10.81 |

Ours | K | 4.41 | 9.24 |

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Teng, Q.; Chen, Y.; Huang, C.
Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints. *Future Internet* **2018**, *10*, 92.
https://doi.org/10.3390/fi10100092

**AMA Style**

Teng Q, Chen Y, Huang C.
Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints. *Future Internet*. 2018; 10(10):92.
https://doi.org/10.3390/fi10100092

**Chicago/Turabian Style**

Teng, Qianru, Yimin Chen, and Chen Huang.
2018. "Occlusion-Aware Unsupervised Learning of Monocular Depth, Optical Flow and Camera Pose with Geometric Constraints" *Future Internet* 10, no. 10: 92.
https://doi.org/10.3390/fi10100092