Unsupervised Learning of Monocular Depth and Ego-Motion with Optical Flow Features and Multiple Constraints

This paper proposes a novel unsupervised learning framework for depth recovery and camera ego-motion estimation from monocular video. The framework exploits the optical flow (OF) property to jointly train the depth and the ego-motion models. Unlike the existing unsupervised methods, our method extracts the features from the optical flow rather than from the raw RGB images, thereby enhancing unsupervised learning. In addition, we exploit the forward-backward consistency check of the optical flow to generate a mask of the invalid region in the image, and accordingly, eliminate the outlier regions such as occlusion regions and moving objects for the learning. Furthermore, in addition to using view synthesis as a supervised signal, we impose additional loss functions, including optical flow consistency loss and depth consistency loss, as additional supervision signals on the valid image region to further enhance the training of the models. Substantial experiments on multiple benchmark datasets demonstrate that our method outperforms other unsupervised methods.


Introduction
Depth recovery and camera ego-motion estimation from monocular video are fundamental topics in computer vision with numerous applications in industry, including robotics, driverless vehicles, and navigation systems. Traditional solutions to these tasks rely on binocular stereo techniques or structure-from-motion methods, which reconstruct 2D images into the 3D world by analyzing the geometric difference between left-right or/and consecutive images [1]. Camera ego-motion, also known as visual odometry (VO), is the process of calculating an agent's pose solely based on images captured by a single or multiple cameras mounted to it. The basic VO framework follows a standard pipeline, which typically includes feature detection, feature tracking, outlier rejection, motion estimation and optimization [2][3][4]. Although these methods are accurate and robust under favorable conditions, they are sensitive to camera parameters and are more unstable in extreme environments, such as textureless areas and lighting changes.
Recently, convolutional neural networks (CNNs) have become increasingly popular in computer vision tasks, providing researchers with a new solution to depth recovery and ego-motion estimation. Learning-based methods can be classified into two groups including supervised and unsupervised methods in terms of whether they rely on ground truth for training. Supervised methods learn the functions to map the depth and egomotion to the image by minimizing the differences between the estimated values and the related ground truth [5][6][7][8][9][10][11][12][13][14][15]. However, supervised methods need a massive quantity of ground truth data to train the model, which is both costly and difficult to get in reality. Moreover, the dependency on ground truth data leads to unsatisfactory performance in new environments or some unlearned scenarios.
Instead of using expensive ground truth data, unsupervised methods train CNN models directly using unlabeled data, thereby saving human effort on data-labeling, allowing the use of a larger amount of data for training, and achieving better generalization. A common principle of the existing unsupervised methods [16][17][18][19][20][21][22][23] is to train the CNN models by using a synthesized view as a supervisory signal. We call it the view synthesis technique, where one view (source) is synthesized into another (target) based on the estimated camera ego-motion and the predicted depth of the target view. The unsupervised framework is subsequently trained by decreasing the photometric difference between the synthesized and original target views. Three issues exist in regard to the existing unsupervised methods: (1) The synthesized view is subject to error because these works do not apply optical flow (OF) to estimate camera ego-motion but directly use RGB images, which contain complex and redundant information. In fact, the OF field implies geometric motion between consecutive images and is a key factor for accurate view synthesis and ego-motion estimation. Our previous work [14] and the authors of [13,15] have demonstrated that OF is highly effective for learning VO. (2) These methods conduct the learning on the whole area of the synthesized image containing substantial outliers such as occluded regions and moving objects, which can inhibit the training of the network and cause a significant error in the results. (3) These methods design loss function by considering the photometric consistency between the synthesized and original target views, which provides a weak constraint for unsupervised learning and thus may generate inaccurate training results.
To overcome these disadvantages, this paper proposes a new unsupervised framework for improving unsupervised learning and the performance of depth recovery and camera ego-motion estimation. The framework learns the depth and VO models from the accurate OF field rather than from the raw RGB images. Moreover, we exploit the forward-backward consistency check of the optical flow to generate a mask of the invalid region in the image, and accordingly, eliminate outlier regions such as occlusion regions and moving objects for the learning. Furthermore, we adopt several constraints for defining the multiple loss functions to further enhance the unsupervised learning. In brief, the following are our main contributions:

1.
We propose a novel unsupervised learning framework for estimating the depth and camera ego-motion. By virtue of the optical flow property, the framework extracts the features from the optical flow rather than from the raw RGB images, thereby enhancing unsupervised learning; 2.
We eliminate the outlier regions such as occlusion regions and moving objects for the learning by generating a mask of the invalid region in the scene according to the forward-backward consistency of the optical flow, thereby preventing the training from being inhibited and improving the performance; 3.
We propose optical flow consistency loss and depth consistency loss as additional supervision signals to further enhance the training of the models; 4.
We conduct extensive experiments on multiple benchmark datasets, and the results demonstrate that our method outperforms the existing unsupervised algorithms.

Related Work
We primarily discuss related works that use machine learning methods for separate or joint learning of depth and ego-motion. As mentioned above, learning-based methods can be divided into two groups, including supervised and unsupervised methods in terms of whether they rely on ground truth for training.

Supervised Learning of Monocular Depth and Ego-Motion
Existing supervised learning-based works normally treat depth recovery and egomotion as two separate tasks and conduct learning for each goal by minimizing the differences between the estimated values and the related ground truth.
The authors of [5] were the first to predict depth from a single image via supervised learning. They accomplished this work by combining two deep network stacks: one that generates a rough global prediction based on the full image and another that rectifies this prediction locally. Different from [5], which employs an extra network to improve the results, Liu et al. [6] presented an approach based on the hierarchical conditional random fields (CRFs) to enhance the depth map. Meanwhile, they proposed a super-pixel pooling method to accelerate convolutional networks. Recently, several works [7][8][9] have used adversarial learning to estimate depth and have proven to be beneficial.
DeepVO [10] is a typical supervised learning method to estimate camera motion. CNN was utilized to learn effective feature representation, while an RNN was employed to describe sequential dynamics and connections. DeepVO completed an end-to-end pose estimation and obtained competitive accuracy and generalization ability. Based on this typical model, several studies expanded on this strategy to increase model performance.
In [11], the authors considered the curriculum learning (CL) technique (training a model by gradually increasing the complexity of the training data) to increase the generalization capacity of supervised VO. In [12], knowledge distillation (transferring the knowledge of a huge teacher model to a small student model) was used in the supervised VO framework to drastically decrease the amount of network parameters, making it more suitable for real-time operation on portable devices. Since the OF field implies geometric motion, learning optical flow for VO is a common technique for learning-based VO methods such as in [13][14][15]. In [13], the authors proposed to use an auto-encoder network to find a nonlinear representation of the OF field for ego-motion estimation. Our previous work [14] further proved that learning the latent space of the OF field is effective for ego-motion estimation. We conducted sequential learning by using the RCNN network to regress the OF latent space into the 6-DOF camera ego-motion. In [15], Zhao et al. not only used OF field to estimate the camera ego-motion, but also investigated the capacity of deep neural networks for state estimation to filter the 6-DOF trajectory given a sequence of measurements.
Since the supervised approaches are guided by the ground truth, they can effectively train the functions to map the depth and ego-motion to the image and have produced outstanding results. However, these supervised algorithms are constrained by labeled datasets, which are difficult and expensive to obtain and may be short of generalization.

Unsupervised Learning of Monocular Depth and Ego-Motion
Existing unsupervised learning-based works normally conduct joint learning on depth and ego-motion simultaneously by using true constraints as supervisory signals for training. Since the depth recovery and camera pose estimation are closely connected in terms of their internal geometric relationship, the main supervisory signal can be obtained by jointly training these tasks in the lack of ground truth and stereo frames. The existing unsupervised methods [16][17][18][19][20][21][22][23] adopt the synthesized view as a supervisory signal to train the models.
A typical unsupervised method was proposed by Zhou et al. [16] that utilizes view synthesis as a supervised signal to jointly learn depth and camera pose from image sequences. Specifically, this framework is made up of two networks: a depth network for estimating depth and a pose network for calculating camera ego-motion. Based on the predicted depth and the estimated camera ego-motion, one view (source) can be synthesized into another (target). The CNN models are then trained by minimizing the photometric differences between the synthesized and original target views. Motivated by this basic model, several studies [17][18][19][20][21][22][23] have been done to improve on it and achieved good results. Aiming at the scale problem, Zhan et al. [17] recovered the absolute scale using stereo image pairs. Meanwhile, they introduced a feature reconstruction loss to increase depth and pose estimation accuracy. In [18], Mahjourian et al. introduced an ICP loss to ensure the consistency of the calculated 3D point clouds between the consecutive frames. They trained the network using a combination of 3D and 2D losses and achieved good results. In [19], Yang et al. introduced edge estimation, which improves the performance of the model by jointly estimating the edge and the 3D scene. In [20], Jiang et al. introduced an outlier masking strategy that treats occluded or dynamic pixels as statistical outliers, hence avoiding the negative impacts of occlusion and dynamics on learning in realistic environments.
Recently, several studies have taken advantage of the inherent geometric connection between depth, ego-motion, and optical flow to jointly train the models of these subtasks. In [21], the authors introduced GeoNet, a collaborative learning framework for estimating depth, ego-motion, and optical flow. They used an additional network to learn the residual optical flow for the scene's dynamic objects. As a result of segregating rigid and nonrigid scenes, the accuracy of all three estimations was enhanced. Instead of estimating residual optical flow, Zhang et al. [22] added an extra network to predict the optical flow. They also introduced multi-view consistency losses to constrain the framework for better performance. Ranjan et al. [23] enhanced the multi-task framework by incorporating a motion segmentation task based on the results of other tasks (depth recovery, pose estimation and optical flow estimation). The increase in the number of tasks makes the training more complicated, so they introduced competitive collaboration, a framework for coordinating the training of various specialized neural networks to address complicated tasks. In these multi-task-based methods, OF estimation is added as a subtask network and trained alongside the depth and pose networks. The errors caused by the depth and pose calculation will inevitably be transmitted to the optical flow, and optical flow with poor accuracy will in turn affect the learning of the depth and pose.

Methods
The framework of the proposed method is shown in Figure 1. It is composed of three CNN networks: DepthNet, PoseNet and FlowNet (Section 3.1). An off-the-shelf optical flow estimation network was used as the FlowNet to generate accurate OF fields. Our goal was to jointly train the DepthNet and the PoseNet by using unlabeled monocular image sequences so that the two networks can estimate single-view depth and camera motion separately during testing. Given the consecutive images (I t , I t+1 ), we first estimated the depth of frame I t and I t+1 , and the forward-backward OF fields between frame I t and I t+1 . Then we used the forward OF field to estimate the camera pose between the consecutive images (I t , I t+1 ). With the predicted depth map and the estimated 6-DOF camera ego-motion, frame I t+1 can be synthesized into frame I t . The synthesized frame I t and the original frame I t should be consistent in terms of photometry (Section 3.2). With the estimated forward-backward OF fields, we could generate a mask of the invalid regions in the image, and then eliminate the outliers such as occlusions and moving objects according to the forward-backward consistency of the optical flow (Section 3.3). In addition, the predicted depth map and the estimated camera pose can be used to calculate the OF field, which should be consistent with the generated OF field from FlowNet in the rigid area of the image (Section 3.4). Based on the generated forward OF field, the depth map (D t+1 ) can be synthesized into depth map (D t ); the synthesized depth map D t should be consistent with the target depth map (D t ) (Section 3.4). The objective function is defined by considering four constraints including photometric consistency, smoothness, optical flow consistency and depth consistency, and is formulated as where l indicates different image scales, L l pho , L l smo , L l f lo and L l dep indicate photometric consistency loss, smoothness loss, optical flow consistency loss and depth consistency loss, respectively, and λ s , λ f and λ d represent their corresponding weights.

The Networks
The framework in Figure 1 contains three subnetworks: DepthNet, PoseNet and FlowNet. The FlowNet is an off-the-shelf optical flow estimation network to generate accurate OF fields. We used a fixed-weight MaskFlownet [24] as the FlowNet. We adopted the DispResNet [23], an encoder-decoder network with skip connections and multi-scale side predictions, as the DepthNet. For the PoseNet, we used the network proposed in [16]. The difference is that we used the optical flow of adjacent images as input to estimate the 6-DOF camera pose instead of directly using adjacent images. where l indicates different image scales, , , and indicate photometric consistency loss, smoothness loss, optical flow consistency loss and depth consistency loss, respectively, and , and represent their corresponding weights.

The Networks
The framework in Figure 1 contains three subnetworks: DepthNet, PoseNet and FlowNet. The FlowNet is an off-the-shelf optical flow estimation network to generate accurate OF fields. We used a fixed-weight MaskFlownet [24] as the FlowNet. We adopted the DispResNet [23], an encoder-decoder network with skip connections and multi-scale side predictions, as the DepthNet. For the PoseNet, we used the network proposed in [16]. The difference is that we used the optical flow of adjacent images as input to estimate the 6-DOF camera pose instead of directly using adjacent images.

Photometric Consistency Loss and Smoothness Loss
We used the view synthesis as the main supervision signal to jointly learn depth and camera motion from unlabeled video sequences. Given the consecutive images ( , ), the estimated depth map ( ) at time t and the estimated relative camera pose ( → ), we can establish the dense pixel correspondence between the consecutive images ( , ). When denotes the coordinate of a pixel in frame , the corresponding point of in frame can be computed via:

Photometric Consistency Loss and Smoothness Loss
We used the view synthesis as the main supervision signal to jointly learn depth and camera motion from unlabeled video sequences. Given the consecutive images (I t , I t+1 ), the estimated depth map (D t ) at time t and the estimated relative camera pose (T t→t+1 ), we can establish the dense pixel correspondence between the consecutive images (I t , I t+1 ). When p t denotes the coordinate of a pixel in frame I t , the corresponding point of p t in frame I t+1 can be computed via: where K indicates the camera's intrinsic matrix. According to this geometric correspondence, we can synthesize a new image I t with the inverse warping from frame (I t+1 ). When the scene is static, there is no occlusion between the consecutive frames and the surface is Lambertian, the synthesized image I t should be consistent with the target image (I t ). The photometric discrepancy between the synthesized image and the original image can be used as an unsupervised loss function for training CNNs. Specifically, the photometric consistency loss function can be formulated as: where ρ(x) = x 2 + 2 γ is the robust generalized Charbonnier penalty function with γ = 0.45 and = 10 −3 [25]. In previous works [16][17][18], the loss function mostly used the combination of an L1 norm and a structural similarity (SSIM) to measure the photometric discrepancy, which is not suitable for realistic situations where illumination changes. This loss function [25] is used to compensate for additive and multiplicative illumination changes, thus providing us with a more reliable constancy assumption for realistic imagery.
Since the photometric consistency loss is not informative in the low-texture or homogeneous region of the scene, we adopted the edge-aware smoothness loss used in [21] to keep sharp details, which is formulated as: where |·| indicates element-wise absolute value, ∇ represents the vector differential operator, and T denotes the transpose of image gradient weighting.

Outlier Region Elimination
Under the premise of photometric consistency, the synthesized image should be photometrically compatible with the target image. However, this assumption does not hold for the outlier regions such as occlusion regions and moving objects. Therefore, we need to eliminate the outlier region in the scene and only impose the photometric consistency loss on the valid region.
The forward flow at a non-occluded pixel should equal the inverse of the backward flow at the same pixel in the second frame. Based on this forward-backward consistency assumption, we used the accurate forward-backward OF fields generated by MaskFlownet to eliminate the outlier region in the scene. Specifically, when the condition is not satisfied, we flag pixels as potentially outliers. The constraint is formulated as: where F f (p t ) denotes the forward flow of the pixel at p t , F b (p t ) denotes the backward flow of the pixel at p t . α 1 and α 2 were set to 0.01 and 0.5 in our experiment, respectively [26]. An example is shown in Figure 2, the generated mask effectively marks the invalid regions such as occlusion regions (yellow), moving objects (red) and boundaries (blue). In the boundary region, the backward OF cannot be calculated due to the camera's moving, which results in inconsistent forward and backward OF. Then we impose the photometric consistency loss on the valid region, which is formulated as: where V denotes the valid region.

Optical Flow Consistency Loss
The optical flow can be calculated from the scene depth (D t ) and the relative camera pose (T t→t+1 ) using 3D scene geometry. The calculated optical flow F cal can be represented by Therefore, we can estimate the scene depth (D t ) and the relative camera pose (T t→t+1 ) to obtain the calculated optical flow through DepthNet and PoseNet, respectively. For nonoccluded regions, the computed optical flow should be consistent with the generated forward optical flow (produced by MaskFlownet). Therefore, minimizing the difference between the two optical flow fields can be used as another loss function for jointly training DepthNet and PoseNet. Using the generated mask, our optical flow consistency loss is formulated as:

Optical Flow Consistency Loss
The optical flow can be calculated from the scene depth ( ) and the relative camera pose ( → ) using 3D scene geometry. The calculated optical flow ( ) can be represented by Therefore, we can estimate the scene depth ( ) and the relative camera pose ( → ) to obtain the calculated optical flow through DepthNet and PoseNet, respectively. For non-occluded regions, the computed optical flow should be consistent with the generated forward optical flow (produced by MaskFlownet). Therefore, minimizing the difference between the two optical flow fields can be used as another loss function for jointly training DepthNet and PoseNet. Using the generated mask, our optical flow consistency loss is formulated as:

Depth Consistency Loss
According to the generated forward optical flow, the pixel mapping relationship of the consecutive images can be determined. Then, the relationship between the depth maps of the consecutive images can also be established. Therefore, we can synthesize a new depth map ( ) with the inverse warping from depth map ( ) by using the generated forward optical flow. The synthesized depth map ( ) and the target depth map ( ) should be consistent in the valid region. Consequently, we propose a depth consistency loss to train the DepthNet by penalizing the inconsistency between the synthesized depth map ( ) and the target depth map ( ). The depth consistency loss is formulated as:

Depth Consistency Loss
According to the generated forward optical flow, the pixel mapping relationship of the consecutive images can be determined. Then, the relationship between the depth maps of the consecutive images can also be established. Therefore, we can synthesize a new depth map D t with the inverse warping from depth map (D t+1 ) by using the generated forward optical flow. The synthesized depth map D t and the target depth map (D t ) should be consistent in the valid region. Consequently, we propose a depth consistency loss to train the DepthNet by penalizing the inconsistency between the synthesized depth map D t and the target depth map (D t ). The depth consistency loss is formulated as:

Experiment and Results
We implemented our approach in the PyTorch platform and conducted all the experiments on a single NVIDIA GeForce GTX 1080Ti GPU with 11 GB memory. During training, the initial learning rate was set to 0.0002, the mini-batch was set to 4, and the loss weights were λ s = 0.5, λ f = 0.2 and λ d = 0.2. We used the Adam optimizer with β1 = 0.9 and β2 = 0.99. The images from the datasets were resized to 128 × 416 as the input of the network. Like other works [21][22][23], we applied several types of data augmentation methods to improve performance and prevent potential overfitting, including image color augmentation, rotational data augmentation and left-right pose estimation augmentation.

Datasets and Metrics
Like the work in [16], we used the KITTI dataset [27] as our main training dataset, which is the largest and most commonly used dataset for autonomous driving applications such as VO, depth and optical flow, etc. The KITTI dataset provides 56 scenes of car driving and can be classified into "city", "residential" and "road". In addition, we used the Cityscapes dataset [28] to pre-train our model, which contains more than 50 cities' stereo data without depth annotation. In order to evaluate the generalization ability of the network on a different dataset, we also used the Make3D dataset [29] in the testing phase. It only contains monocular images as well as corresponding depth maps and does not have monocular sequences and stereo image pairs.
Similar to other works such as [16][17][18][19][20][21][22][23], we used the absolute trajectory error (ATE) as the metric for pose estimation and we used the synthetic policy [30] as the metric for depth estimation. The ATE is defined as: where P i , Q i indicate the estimated pose value and its related ground truth. S denotes the similarity transformation matrix, and trans denotes fetching the translation part. The metrics used for depth estimation include the absolute relative error (Abs. Rel), the square relative error (Sq. Rel), the root mean squared error (RMSE), the log root mean squared error (RMSE log), and the prediction accuracy (δ). The definitions of these evaluation criteria are as follows: where D i and D * i represent the estimated depth value and its related ground truth. For T, 1.25, 1.25 2 , and 1.25 3 were used. The lower the value of the error metrics (Abs. Rel, Sq. Rel, RMSE, RMSE log) and the higher the value of the accuracy metric (δ), the better the performance.

Ablation Study
We conducted an ablation study on different versions of the framework to investigate the effect of different components in our network. The baseline version was the model without outlier elimination, optical flow consistency loss and depth consistency loss. Since the DepthNet and the PoseNet are jointly trained, their accuracy is interdependent during training. Therefore, we only need to evaluate the effect on one of the two models. We conducted the experiments on KITTI odometry data, with 00-08 sequences used for training and 09-10 sequences utilized for testing. The results of the ego-motion evaluation are shown in Table 1. As indicated in the table, adding the outlier elimination, optical flow consistency loss and depth consistency loss can greatly enhance the model's performance.

Evaluation of Ego-Motion Estimation
Although depth and pose are jointly trained, they are tested separately and their accuracy is interdependent. We evaluated the performance of PoseNet on the official KITTI visual odometry split, which includes 11 sequences with ground truth. We utilized 00-08 sequences for training and 09-10 sequences for testing to compare with other approaches. A row of each of the five images was truncated as a sequence of images as the input of the network during training. We compared our model to the other unsupervised methods [16,18,21,23] and the classical SLAM framework (ORB-SLAM). ORB-SLAM (full) allows for closed loop and re-localization, while ORB-SLAM (short) has no closed loop and re-localization. For the problem of scale ambiguity in monocular VO, we aligned the per-frame scale to the ground truth. The quantitative evaluation results for pose estimation on the KITTI dataset are shown in Table 2. The trajectories of sequence 09 and 10 produced by different methods are plotted in Figure 3. It is obvious that our method outperforms all of the others. ORB-SLAM (full) [2] 0.014 ± 0.008 0.012 ± 0.011 ORB-SLAM (short) 0.064 ± 0.141 0.064 ± 0.130 Zhou et al. [16] 0.016 ± 0.009 0.013 ± 0.009 Mahjourian et al. [18] 0.013 ± 0.010 0.012 ± 0.011 Yin et al. [21] 0.012 ± 0.007 0.012 ± 0.009 Ranjan et al. [23] 0.012 ± 0.007 0.012 ± 0.008 Ours 0.010 ± 0.005 0.009 ± 0.006

Evaluation of Depth Estimation
We used the KITTI dataset to evaluate the depth estimation. For better comparison with other works, we followed the dataset segmentation suggested by Eigen et al. [5] and Zhou et al. [16] for training and testing because the segmentation is commonly accepted by the research community for depth benchmark purposes. A total of 44,540 raw KITTI images were used for the training and validation, of which 40,109 images were used for training and 4431 for validation. The other 697 images were selected for testing. A row of each of the three images was truncated as a sequence of images as the input of the network during training. The ground truth was achieved by projecting the Velodyne laser scanned points onto the image plane for error and accuracy metrics evaluation. For the problem of scale ambiguity, we calculated a scale factor to match the estimated depth with the ground truth in the following form: s = median D gt /median(D pred ). Table 3 shows the comparison between different studies. In the table, the second column defines the supervision signals used in the network. "Depth" means ground truth of depth and is used for supervised learning in the method, "Stereo" denotes that in the training, stereo sequences with known stereo camera pose are employed, and "Mono" means monocular sequences are used in the training. The third column defines the dataset used for training. K means trained only on KITTI dataset and CS + K denotes fine-tuning on the KITTI dataset following pre-training on the Cityscapes dataset. Our algorithm outperformed both the supervised and unsupervised methods, as demonstrated in Table 3.

Evaluation of Depth Estimation
We used the KITTI dataset to evaluate the depth estimation. For better comparison with other works, we followed the dataset segmentation suggested by Eigen et al. [5] and Zhou et al. [16] for training and testing because the segmentation is commonly accepted by the research community for depth benchmark purposes. A total of 44,540 raw KITTI images were used for the training and validation, of which 40,109 images were used for Furthermore, we used the Cityscapes dataset [28] to pre-train our model and used the KITTI dataset for fine-tuning. The results (in the bottom part of Table 3) show some improvement in depth prediction, indicating that expanding the training data can enhance the models' performance. The depth maps in Figure 4 were estimated by our model and in [16,21]. When compared to the other methods, our method produces sharper and more accurate depth maps.

Generalization to Other Datasets
We tested the models on a new dataset, the Make3D without utilizing it for training to assess their generalization ability. In our experiment, we used the KITTI and Cityscapes datasets for training and used the Make3D for testing. The quantitative and qualitative evaluation results are shown in Table 4 and Figure 5. The results show that our method can perform well, even in unknown datasets and is superior to other methods.

Generalization to Other Datasets
We tested the models on a new dataset, the Make3D without utilizing it for training to assess their generalization ability. In our experiment, we used the KITTI and Cityscapes datasets for training and used the Make3D for testing. The quantitative and qualitative evaluation results are shown in Table 4 and Figure 5. The results show that our method can perform well, even in unknown datasets and is superior to other methods.

Generalization to Other Datasets
We tested the models on a new dataset, the Make3D without utilizing it for training to assess their generalization ability. In our experiment, we used the KITTI and Cityscapes datasets for training and used the Make3D for testing. The quantitative and qualitative evaluation results are shown in Table 4 and Figure 5. The results show that our method can perform well, even in unknown datasets and is superior to other methods.

Conclusions
We present a novel unsupervised learning pipeline for estimating depth and camera ego-motion in this study. We introduced a trained optical flow estimation network and made full use of it, including learning optical flow to estimate the camera motion, generating a mask that eliminates outlier regions, and adding additional geometric constraints. The results of the ablation experiments demonstrate their importance in improving the performance of the framework. Experiments on the KITTI dataset indicate that our algorithm outperforms other unsupervised methods. In the future, we want to enhance our framework to include a visual SLAM system to decrease drift.