Occlusion-Aware Unsupervised Learning of Monocular Depth , Optical Flow and Camera Pose with Geometric Constraints

We present an occlusion-aware unsupervised neural network for jointly learning three low-level vision tasks from monocular videos: depth, optical flow, and camera motion. The system consists of three different predicting sub-networks simultaneously coupled by combined loss terms and is capable of computing each task independently on test samples. Geometric constraints extracted from scene geometry which have traditionally been used in bundle adjustment or pose-graph optimization are formed as various self-supervisory signals during our end-to-end learning approach. Different from prior works, our image reconstruction loss also takes account of optical flow. Moreover, we impose novel 3D flow consistency constraints over the predictions of all the three tasks. By explicitly modeling occlusion and taking utilization of both 2D and 3D geometry relationships, abundant geometric constraints are formed over estimated outputs, enabling the system to capture both low-level representations and high-level cues to infer thinner scene structures. Empirical evaluation on the KITTI dataset demonstrates the effectiveness and improvement of our approach: (1) monocular depth estimation outperforms state-of-the-art unsupervised methods and is comparable to stereo supervised ones; (2) optical flow prediction ranks top among prior works and even beats supervised and traditional ones especially in non-occluded regions; (3) pose estimation outperforms established SLAM systems under comparable input settings with a reasonable margin.


Introduction
The ability to perceive visual environment of one agent is more required than ever due to the advancing expand in industries such as robotics [1], autonomous driving [2] and augmented reality [3], among which has many crucial tasks to fulfill including reasoning one's ego-motion and the scene structure.In this paper, we propose a jointly learning network in an utterly unsupervised manner to predict depth, camera pose and optical flow from monocular video sequences with no labeling data or ground truth.
Years of research in scene understanding has already been studied.Structure from motion (SfM) [4][5][6] is a long-standing task in computer vision which aims at reconstructing camera motion and scene structure, but often hard to integrate reasonable priors for small camera translation or outliers from low-level sparse correspondences.Visual Simultaneous localization and mapping (VSLAM) [7][8][9] use handcrafted sparse or dense abstractions to represent geometry but also limited to a particular kind of scenes.They also suffer from absolute scale regarding monocular approach and noisy data regarding depth sensor.In this paper, we employ deep neural networks for better representation of high-level cues instead of being restricted to a certain scenario.
Given the advantage of numerous data and learning effectiveness of deep network, recent works have emerged to formulate these tasks through deep models and achieve considerable results compared to traditional ones.Most previous works [10][11][12][13][14] target at one specific task only, they neglect the inherent relationships in between relative tasks.Ref. [15,16] combine depth prediction with camera motion by image reconstruction loss based on photometric quality.Ref. [17] comprise optical flow in their work in an implicit way, predicting depth and camera motion meantime.Ref. [18] introduce 3D geometric constraint during propagation.However they all did not explicitly handle occlusions.Ref. [19] form these three tasks through a stage-wise structure which enforce geometric consistency with non-rigidity filtered in the second stage, but they cannot fully exploit on geometric relations due to the cascade network structure.Therefore, we focus on avoiding these shortcomings and elaborate on them to achieve more accurate results.
Inspired by works that impose geometric constraints on learning procedures [18,19], we also make our effort in exploiting scene structure constraints in a deep learning system.The main idea of this paper is by taking account of optical flow simultaneously in addition to depth and camera motion estimation, jointly learning a network with both 2D and 3D geometry transformations during training to create additional constraints for better exploitation of nature geometry structure.Since the depth and camera motion as well as optical flow are calculable, we are able to make following contributions: 1.
Jointly learn an unsupervised deep neural network from monocular videos that predicts depth, optical flow, and camera motion simultaneously at training time coupled by combined loss term.

2.
Modeling occlusion all through entire process explicitly through bidirectional flow from consecutive frames to make the model occlusion-aware and non-occluded region better constrained.

3.
Mutually supervise each component of the network in use of both 2D and 3D geometric constraints combined with occlusion module.The 2D image reconstruction loss takes optical flow into consideration.The 3D constraint contains two part: 3D point alignment loss and a novel 3D flow loss.

Deep Learning vs. Geometry for Scene Understanding
The study of estimating scene structure and camera motion from sequences of images or simultaneously mapping and localization in computer vision was considered a purely geometric problem for decades.It is usually accomplished by pipelines containing several successive processing steps, as well as simultaneously mapping and localization.However the handcrafted methods often highly rely on accurate image correspondence, and structure will be ill-posed if arbitrary deformations are allowed.Whereas deep learning is more capable of capture relatively random details and learning from them.Possessed of such qualities, more recent methods focus on how to alleviate such reliance by introducing deep learning into geometric problems.However, rather than apply deep learning models naively, imposing geometry in deep learning allow us to learn a geometric problem without massive amount of labeled data, extracting enforcement from nature structure.This is an exciting breakthrough and has proven to be effective in many researches.

Deep Learning with Geometry for Scene Understanding
Deep learning from videos has made significant progress since it first appeared and has a promising future.Many works have explored tasks including depth prediction, optical flow, and pose estimation.These approaches are mainly divided into two categories, supervised methods and unsupervised ones.

Supervised Videos Learning
The work of Eigen et al. [10] manifested deep models' capability for single view depth estimation with a coarse-to-fine strategy.Kendall et al. [12] proposed a stereo regression architecture for sub-pixel disparity from a rectified pair of stereo images by leveraging knowledge of the problem's geometry.Brahmbhatt et al. [13] formulated loss terms from geometric constraints expressed by sensory inputs which usually used in traditional ways were exploited to bring up camera localization accuracy.Similar spirits have been shared in learning optical flow.Ref. [14,20] proposed FlowNet to compute dense flow prediction through fully connected convolutional neural networks in an end-to-end fashion supervised by synthetic datasets.Ummenhofer et al. [16] engineered settings that alternate optical flow estimation with camera motion and depth estimation, which required various forms of supervision including an abundant amount of scanned depth data.To eliminate heavy reliance on large labeled data, an unsupervised setting is exploited in our method.

Unsupervised Videos Learning
Garg et al. [21] leveraged well-understood ideas in visual geometry by proposing a coarse-to-fine stereopsis-based auto-encoder to predict single view depth using projection objective.Godard et al. [11] exploited epipolar geometry constraints consistency between the disparities generated during network training from monocular videos by introducing a left-right consistency loss.While such stereo formulation has a heavier reliance on scene priors, a monocular setting is preferred by many recent methods.
Ref. [22] achieved a monocular VO system along with dense depth map with recovered absolute scale in an unsupervised manner by utilizing both temporal and spatial geometric constraints.Zhou et al. [15] formulated a view synthesis pipeline which learns monocular depth and ego-motion in a coupled way by building upon the rigid projective geometry with an explainability mask for compensation of any dynamic factor.Concurrently, Vijayanarasimhan et al. [17] mimicked the traditional problem of structure from motion through explicit modeling of scene geometry and several object masks.Ref. [18] considered the inferred 3D geometry with a proposed objective which directly penalizes inconsistencies in the estimated depth during image reconstruction process.However, they all overlook the fact that including outliers such as occlusions in training could potentially corrupt the process.Meister et al. [23] introduced an occlusion-aware bidirectional census loss in optical flow learning.Yin and Shi [19] proposed a divide-and-conquer strategy to solve depth, optical flow, and camera motion estimation in a combined way constrained by reconstruction loss formulated of geometric relationships extracted from the tasks, demonstrating that learning a non-rigid flow residual is helpful to their designed 3D geometric scene understanding.However, in their rigid flow stage, learning from overall region including occluded area may cause side effects.The optical flow was mainly to refine the rigid-flow produced in the first stage, hence they struggled in capturing more inherent geometric nature between these tasks.
Even though unsupervised-based methods have gained attention in recent times due to non-requirement of any type of ground truth and have made considerable improvement; however, there is still need to discover better-suited network structures and loss functions.Differently from Yin and Shi [19], by imposing optical flow into our system and training synchronously, we are able to form the combined losses over all predictions which contain abundant geometric information to lead the backpropagation proving to be effective after.Furthermore, occlusion masks produced from optical flow estimation are employed for all three tasks all through entire training which will ensure the filtering of non-rigid cases.

Method
Our approach combines depth estimation, optical flow, and camera motion in a whole, in which they can jointly benefit from each other effectively through geometric observation.This section starts by giving an overview of our system and then explain each component individually.

System Overview
The schematic network architecture is showed in Figure 1.It handles these three tasks separately with the utilization of three sub-networks, within which each one of them is capable of jointly training and supervising one another mutually, resolving the scene perception task easier.Additionally, rich occlusion-aware geometric constraints are applied within and between these tasks to supervise one task another mutually.We present a way to employ the nature of occlusion to a greater extent instead of simply filtering it out, by combining it with various geometric constraints.During training, we formulate this problem as: Given consecutive frames M t−1 , M t with known camera intrinsic, D t as per-pixel depth map of each frame is estimated, and T t−1 denotes relative camera motion from t − 1 to t, F f presenting forward optical flow while F b standing for the backward optical flow.The forward and backward optical flow are used to generalize the occlusion mask to reduce the influence of occlusion when applying rigid geometric constraints.As depth and camera motion are available, we generate point cloud P t of frame M t which is a basic element throughout our 3D constraints of the network.The frames are preprocessed as four pyramid scales from l 1 to l 4 .Details of the methods are discussed below.

Bidirectional Flow Loss as Occlusion Mask
To detect occlusion, we conduct the method recommended by [23] which is based on forward-backward consistency assumption to mask the occluded area.The forward flow ) are computed by performing a second pass with the input images exchanged with shared weights and symmetric loss.Occluded pixels are marked whenever there is a significant mismatch between these two flows.The occlusion flag is set to be 1 when the following constraint is against, and 0 otherwise.In our experiment we set α 1 = 0.01, α 2 = 0.5.Then the occluded pixel mask of frame M t is calculated as O t .Forward-backward consistency between adjacent flows for the non-occluded region is applied during training, along with a constant penalty λ p = 3.5 to all occluded pixels in case the whole scene happens to be occluded: with X denoting pixels of image M, ρ denoting the robust generalized Charbonnier penalty

2D Image Reconstruction Loss as Supervision
Following the fashion of most unsupervised approaches, we conduct synthesis image as fundamental supervision of each task in our network by using differentiable inverse warping [24] with a bilinear sampler between nearby frames.Given frame M t and depth D t , the image projected into a structured three-dimensional point cloud can be presented as: where (i, j) are the coordinates of the image pixel and (x, y, z) are the point coordinates, K is the homogeneous camera intrinsic matrix.This way we compute per frame point cloud Pt, t ∈ (1, ...).We transform the nearby point cloud P t−1 using camera intrinsic and known rigid scene transformation T following P t = KTP t−1 , then back project it to the image plane to get the warped frame M p t .M p t is warped through both forward and backward directions.The self-supervision approach for optical flow is similar, only the warped image M f t is obtained by the predicted flow, which is also in both directions.Let F f t be the forward flow from where X denotes image pixel of frame M t−1 .For the backward direction, we define in the same way with F f t and F b t exchanged.The reconstruction consistency loss is then produced by comparing image M t to warped image M t .Note that the warping schemes make implicitly assumption that there is no occlusion in the scene so that incorrect deformations could be learned from the occluded regions.Therefore we mask occluded pixels to alleviate the negative effect of outliers and penalize photometric difference only for every non-occluded pixel.The occlusion mask is modeled as O t in Section 3.2, the photometric consistency loss is formulated as: By performing this way we also take into account the discrepancy between warped image M p t and M f t without occlusion since they should be ideally identical if predictions of the network are perfect.

3D Point Alignment Loss
An additional 3D geometric constraint is exploited to the non-occluded area in a differentiable way recommended by [18] to reinforce the backpropagation for depth map and camera pose estimations.The core of this constraint is a rigid point matching algorithm, Iterative Closest Point(ICP) which calculates a transformation between two 3D points by optimizing the minimal point-to-point error, in our case, between point P t (1 − O t ) and warped point P t (1 − O t ).To be specific, we use both outputs, the best-fit transformation T and a residual distance r after the best transformation has been applied to guide the regression.r is used as an approximation to the negative gradient of the loss about depth D t for adjustment and T is to adjust camera pose.The 3D constraint is formulated as:

3D Flow Loss
Figure 2 depicts how 3D flow loss is formed in our method.For the better understanding of scene geometry, we form an additional geometric enforcement to maintain the coherence between the 3D flow generated from estimated 2D flow and from point cloud, so that all the information we obtained from these three sub-tasks can be utilized as an unified constraint.As F f t (u f , v f ) is the predicted dense 2D flow on X-Y image plane from frame t to t + 1, and D t (x, y) is the predicted depth map, the 3D flow can be calculated as: where C denotes the concatenation operation.For the 3D flow computed from point cloud using both depth and pose prediction, is formulated as: Thus, the complete mutually supervised geometric consistency loss is: Figure 2. The 3D Flow is computed for both directions.The loss takes utilization of estimated depth, optical flow and camera pose.

Smoothness Constraint
However, depth borders have a tendency to be locally blurred.To smooth out prediction discontinuities and preserve sharp details, edge-aware weighted smoothness penalties are adapted to our system.The occluded area is solely guided by smoothness loss since it violates both photometric and geometric consistency.Depth smoothness is formulated by image gradients while second-order derivative is used for optical flow loss term: where α controls the weight of edges which is set to be 3 in our experiment, d indexes over partial derivative on x and y directions.The final loss term is a weighted summation of above all in which λ denotes loss weights and l indexes over four different pyramid image scales:

Experiments
In this section we first introduce our implementation specification, including network architecture and training details.Then we show quantitative and qualitative performance in each of these tasks respectively compared with prior approaches.

Network Architecture
Jointly train three sub-networks could be highly computational burdened.Thus under this consideration, we choose rather generic networks to conduct and evaluate our method.For depth estimation, we adopt the network from [11] which uses skip connections between encoder and decoder at different corresponding resolutions.For optical flow, it is based on a modified structure of FlowNetS proposed by [14].This Optical flow prediction is generated using a multi-stage refinement process.Both our depth and optical flow networks take two consecutive images as input.For camera pose regression the structure in [15] which regress 6-DoF camera pose is implemented.The input to the pose estimation network is one target view concatenated with two source views.A multi-scale image pyramid strategy is applied in image preprocessing so that all tasks can benefit from different scale information.

Training Details
Our unsupervised network is trained end-to-end on monocular video streams from KITTI dataset [25] using TensorFlow framework [26] and Adam optimizer with β1 = 0:9 and β2 = 0:999.The KITTI dataset contains sequences captured from a moving vehicle.The initial learning rate is set to be 0.0002 and decrease gradually when there is convergence in training loss.The mini-batch size is 4.During training, we train our model at a reduced resolution of 416 × 128 pixels over 400,000 iterations to make the computational burden decreased.Our experiments are performed on single NVIDIA Quadro M5000 GPU.We experimentally find that a weighting combination of [λ f b ; λ re ; λ icp ; λ mu ; λ sm ] = [0.3;1; 0.5; 0.5; 0.8] in our final loss function results in a stable training.
We evaluated our method with different test splits from training for each task on the popular KITTI dataset for the sake of fair comparison.The splits are same as Zhou [15].Our joint training computes forwardly corresponding to each of these three tasks independently, and backpropagate through combined loss.Data augmentation is implemented as a widely used strategy to improve generalization of neural networks which is crucial to avoid over-fitting.

Depth Evaluation
We take the split provided by [10] to compare with prior related methods to evaluate the performance of our network in monocular depth estimation.We exclude the visually similar frames to the test scenes according to [15].The predictions are resized at input image resolution by interlinear interpolation.We use ground truth obtained by a laser scanner.Both 50 m and 80 m threshold of maximum depth for evaluation are used.It takes two consecutive frames during training because of the 3D alignment loss.We compare with both supervised methods and state-of-the-art unsupervised methods.We also compare with the results of Garg et al. [21] which takes stereo pairs to predict depth.Some qualitative results are shown in Figure 3.The metrics in Tables 1-3 show the quantitative results of all standard error measures of these methods.To verify the advantage of the network architecture and the effectiveness of our loss terms, we trained "Ours(VGG)" on KITTI which shares the same structure with [15].The results validated our conclusion.Furthermore, "Ours(ResNet)" significantly outperforms both supervised works [10,27] and prior unsupervised baselines [15,18,19] with a reasonable margin, which is benefited from our extensive geometric constraint and mutually supervised mechanism.We also experimented with pre-training on Cityscapes first.As same as [19], our result is slightly inferior to [11] when trained on Cityscapes and KITTI due to characteristics of training data.We believe that such dataset stereo settings may provide more details, which we would like to explore in the future.Note that during our training process, robust similarity technique like SSIM [28] are not adopted with which should lead to further improvement.
We also conduct Ablation Experiments on the components of our network in Table 4.In order to study the importance of each component, we trained and evaluated a series of models with each one component missing.The study shows our approach is able to benefit from all these components, and 3D losses make a great difference.On the other hand, a visualized comparison to failed cases of Zhou [15] in Figure 4. further illustrates our method's capability of reasoning scene geometry.However, there are still failure cases remained when large dynamic objects close to camera or intense lighting changes happened caused by gradient locality.The search for a solution to the gradient locality of warping-based loss is still in need.
Table 1.Monocular depth results on KITTI by the split of Eigen et al. [10] capped 80 m. Results of other methods are taken from [15,18,19].

Method
Supervise Abs Rel Sq Rel RMSE RMSE Log δ < 1.25 δ < 1.25 2 δ < 1.25 3   Eigen   Our method captures more details in the whole scene and particularly in thin structures for both close and distant regions.To demonstrate the generalization ability of our model, we test the model on Make3D [29] dataset which trained only on KITTI and Cityscape.The images of Make3D dataset are in different aspect ratio thus we evaluate on a central crop of these images.Errors are only computed where depth is less than 70 m in the central image crop similar as Godard [11] and Zhou [15].As shown in Table 5 and Figure 5, we compare with several methods including supervised ones using Make3D groundtruth depth.Though there leaves a performance gap between our method and supervised ones, our predictions beat Zhou [15] in three metrics, preserving well global scene layout and thin structures.During the training process, these three tasks are learned jointly and their accuracy is inter-dependent.For evaluation of estimated camera pose accuracy, we test our model on KITTI visual odometry split which contains 11 sequences with ground truth.To compare with prior works, we train our joint network on the sequences as the same setting as [15], that sequences 00-08 are for training and 09-10 are for testing.We also compare our method with a full version and a short version of traditional SLAM approach ORB-SLAM [8].The input to the full version takes entire sequence while the short version takes 5 frames.We also employ another experiment where all the loss weights are naive.As shown in Table 7, our method outperforms all of the competing baselines.Our occlusion-aware geometric constrained model is able to capture high-level and more reliable details.  [15,18,19].

Conclusions
In this paper, we proposed a joint learning framework to discuss three basic vision tasks of the long-standing scene understanding problem in an unsupervised manner.We explicitly take the occlusion and 3D structure of the scene into consideration.By preserving extensive geometric cues to lead learning, we obtained impressive results which demonstrated how geometry can benefit deep learning significantly in geometric reasoning.
However, our method is still limited to certain type of scenes where there are not many dynamic motions.Moreover, lower computational burden and system complexity are in need of discovery.In the future, we would like to explore the modeling of various motion masks for a more dynamic environment.Enlarged search space for warping-based loss to solve gradient locality is in the process of exploration.Also, introducing semantic segmentation into our system can offer more advantages.Meanwhile, the potential multi-task architecture, which has shared representation and computation in learning both low-level and high-level vision tasks, could be promising.

Figure 1 .
Figure 1.Overview of our system.It consists of three task-specific sub-networks targeting at estimating monocular depth, optical flow, and camera motion.Rich geometric constraints are employed extracting from the natural structure of the scene.Here M denotes input image and P denotes point cloud.

Figure 3 .
Figure 3.Comparison of monocular depth estimation between Eigen et al.[10] (supervised by depth), Garg et al.[21] (supervised by stereo), Zhou et al.[15] (unsupervised) and ours (unsupervised).Our method captures more details in the whole scene and particularly in thin structures for both close and distant regions.

Figure 6 .
Figure 6.Comparison of optical flow estimation between Yin et al. [19] and ours on KITTI Flow 2015 dataset.The ground truth is interpolated for visualization purpose.Our method performs better in both occluded and overall regions.

Table 3 .
[10]cular depth results on Cityscapes and KITTI by the split of Eigen et al.[10]capped 80 m.

Table 4 .
Ablation results where individual components are left out on KITTI dataset when capped 80 m.

Table 6 .
The reported Average end-point error (AEE) on KITTI flow 2015 over all pixels (All) and over non-occluded pixels only (Noc).C denotes dataset FlyingChairs, S denotes Sintel and T is FlyingThings3D.

Table 7 .
Absolute Trajectory Error (ATE) on KITTI odometry Dataset.The results of other baselines are taken from