The Constraints between Edge Depth and Uncertainty for Monocular Depth Estimation

: The self-supervised monocular depth estimation paradigm has become an important branch of computer vision depth-estimation tasks. However, the depth estimation problem arising from object edge depth pulling or occlusion is still unsolved. The grayscale discontinuity of object edges leads to a relatively high depth uncertainty of pixels in these regions. We improve the geometric edge prediction results by taking uncertainty into account in the depth-estimation task. To this end, we explore how uncertainty affects this task and propose a new self-supervised monocular depth estimation technique based on multi-scale uncertainty. In addition, we introduce a teacher–student architecture in models and investigate the impact of different teacher networks on the depth and uncertainty results. We evaluate the performance of our paradigm in detail on the standard KITTI dataset. The experimental results show that the accuracy of our method increased from 87.7% to 88.2%, the AbsRel error rate decreased from 0.115 to 0.11, the SqRel error rate decreased from 0.903 to 0.822, and the RMSE error rate decreased from 4.863 to 4.686 compared with the benchmark Monodepth2. Our approach has a positive impact on the problem of texture replication or inaccurate object boundaries, producing sharper and smoother depth images.


Introduction
Monocular depth estimation refers to the ability to learn a dense depth map at the pixel level from the video stream. It is a fundamental challenge in the field of computer vision with potential applications in robotics, autonomous driving, 3D reconstruction, and medical imaging [1][2][3][4]. How to predict a high quality dense depth map remains a problem to be solved. As the edges of objects in the image are prone to noise, bleeding, feature shifts, and surface field curvature changes can lead to distortion of the depth around the object. Even with high-precision camera equipment, these factors are inevitably introduced in the process of acquiring image data.
The exploration of edge depth has been around since long before deep learning became prevalent. By edge depth, we mean the depth of the edges of objects in the depth map. In the field of free point-of-view television technology, to improve the depth characteristics of object edges, Liu et al. [5] classified the pixels of the depth map and proposed a modified smoothness function to improve the accuracy of object edge depth values. However, monocular depth estimation methods have paid little attention to occlusion and image detail distortion. Related research is reflected in edge-aware depth estimation techniques.
To solve the problem of image edge detail distortion,Chou et al. [6] investigated the effect of the quality of the depth map on the synthetic focusing performance and proposes a synthetic focusing paradigm that integrates RGB images and depth information. In the task of reconstructing depth from raw video data using neural networks, Yang et al. [7] proposed using surface normal vectors to constrain the estimated depth. They constructed a depth-normal consistency that perceived the edges of objects inside the image.
Additional work includes estimating the depth map from the light field by sparse depth edges and gradients [8] and extracting RGB image edges and depth edges for alignment to improve the estimation accuracy [9].
Currently, most of the work related to improving edge depth requires the introduction of an additional network, e.g., semantic segmentation [10][11][12], edge map detection networks [13][14][15], or optical flow [16]. We found that research on uncertainty, which has only recently entered the limelight, can also improve the quality of edge depth and without learning other complex networks. Uncertainty is defined into two categories, epistemic and aleatoric [17].
The former can be used to understand examples that are different from those inside a training set, such as new scenes or new targets under which the model will predict the wrong depths with high probability, and such wrong depth results need to be detected. The latter can correctly learn the uncertainty (confidence) of the depth at the edge of the object, which is exactly what we require. The monocular depth estimation is mainly optimized for photometric error of the pixels, with little attention to the depth of geometric edges. This strategy of encouraging depth continuity can lead to misalignment of the edge depth in certain situations, e.g., when there is low texture, low luminance, or occlusion. To this end, we investigated the depth uncertainty of the monocular depth-estimation task.
In summary, to obtain an edge-aware depth estimation model, we introduce a depth uncertainty strategy, and use a framework in migration learning that allows our network to learn better quality models on monocular video sequences. Our main contributions are summarized as follows:

1.
We study the impact of multi-scale uncertainty on self-supervised monocular depth estimation and find that it yields more edge-depth uncertainty.

2.
We analyze the effect of different teacher-student combination strategies on the uncertainty of self-supervised monocular depth estimation.

3.
We propose a paradigm for self-supervised monocular depth estimation based on a teacher-student framework and combined with multi-scale uncertainty. Our method effectively improves the overall performance of depth estimation.
We provide detailed experimental results on the KITTI [18] dataset. Our qualitative results show that our method obtains smoother results on the edges of people, cars, or road signs compared to previous work, which blends some of the edges with the background. The depth uncertainty map correctly represents the uncertainty of geometric edges and can restrict the learning of depth to edge pixels with large uncertainty.

Related Work
In this section, we review the relevant paradigms for self-supervised monocular depth estimation and how uncertainty estimation techniques are used on depth-estimation tasks.

Self-Supervised Methods
Before the prevalence of self-supervised methods, supervised methods were mainly used. Learning-based methods use the relationship between color images and their depth to fit prediction models, such as combining nonparametric scene sampling [19] or local prediction [20]. Later, end-to-end supervised models emerged [21,22]. The learning-based approach is the optimal representation of optical flow and stereo estimation.
However, we already know that it is difficult to obtain ground truth data in different real-world environments. How to predict the depth without labeled data is the goal pursued by all. To overcome this problem, self-supervised methods using image reconstruction have become a popular research topic, and they contain two classes of monocular video sequences and stereo pairs.
The paradigm of stereo pairs is to model the geometric properties between stereo image pairs as depth information, which is obtained by projection transformations between images [23]. Such an approach allows training the network based on the loss of photometric error between the actual image and the projected image. Later, model architectures for pose prediction networks and depth prediction networks between frames of video sequences were developed [24].
Garg et al. [25] used L2 loss as the photometric error loss while producing ambiguity in the prediction results. Godard et al. [23] adopted a combination of L1 loss as well as SSIM loss to train the model named Monodepth, which was combined with post-processing operations to obtain more accurate depth accuracy. Then, they proposed the Monodepth2 [26] model, which solved the depth blurring problem in the object occlusion region to an extent by minimizing the reprojection loss for each pixel. Some specific network structures [27][28][29][30] or improved loss functions [31] have also appeared to optimize models.
There are also hybrid methods that use both stereo pair data and video sequence image frames [32,33]. Other refinement strategies, such as [34][35][36]. Some recent approaches used relatively bulky architectures to improve the depth quality [37], which indicates a higher memory cost as well as time.

Uncertainty Estimation
Uncertainty in decision making is crucial in real-world practical applications of computer vision, as it prevents overconfident and wrong decisions. Before the prevalence of neural networks, uncertainty was studied regarding stereo matching and optical flow problems. Stereo matching performs deep learning by inferring the differences in the network feature maps and estimating the difference maps [38]. Confidence inference for optical flow mainly consists of two types of posterior and model-inherent inferences. The first is by analyzing the uncertainty fraction of the optical flow field [39], and the second is by using the minimization energy module [40].
Recently, depth-estimation tasks have used uncertainty to improve model performance by adding confidence to the model output [41]. Depth estimation for 3D reconstruction can also introduce uncertainty in depth to improve the accuracy and robustness of learning [42]. To address depth estimation in regions without illumination, [43] extends the Gated2Depth framework by adding uncertainty to help filter the depth of these regions.
Our work focuses on solving the problem of inaccurate geometric edge depth estimation in complex scenes. Similar to our objective, [42] proposed a new photometric loss function using an uncertainty-based monocular depth estimation method to solve the problem of edge pixel pull due to object movement. They mainly rely on the proposed loss to constrain the pull of moving object depth and do not fully utilize the role of uncertainty in this problem-instead, only to evaluate the reliability of the output results.
In contrast, we consider the distribution properties of uncertainty at the geometric edges of images, which can be used to constrain the learning of object edge depth. Moreover, we find that it is not only the pulling problem of moving objects but also the case of occlusion where uncertainty makes a great contribution. By constraining the depth with uncertainty, it is possible to distinguish the depth of object in front from the depth of the obscured object behind.
Previous methods have predicted the depths of two objects with overlap to be the same depth. Figure 1 shows the depth estimation with edge awareness. The pulling phenomenon of moving character edges is not effectively addressed in monodepth2, while our method solves this problem by limiting the depth value of pixels with larger uncertainty in edge depth.

Method
In this section, we first introduce the basic concepts of self-supervised depth estimation, then introduce techniques for uncertainty in depth-estimation tasks, and finally introduce the teacher-student frameworks and multi-scale uncertainty.
The overview diagram of our method is shown in Figure 2. To decouple the depth and pose networks when modeling uncertainty, we introduce the teacher-student framework. The teacher uses self-supervised monocular depth estimation, which incorporates a depth and pose network using a video sequence as input, where t' represents adjacent frames. To investigate the effect of model uncertainty on depth estimation, we design different strategies, where BaseT (baseline as teacher), DropT (dropout as teacher), and BagT (bagging as teacher) denote different teacher networks. The parameter N denotes the number of depth and pose networks used by the strategies. The student includes only the depth network and incorporates the uncertainty distribution of depth.

Self-Supervised Depth Estimation
The definition of self-supervised monocular estimation is to estimate the pixel-level depth value of frame sequences without ground truth. The geometric constraints between multiple frames are used as the supervision signal. Specifically, the image frame at discrete time t is warped to the previous frame at time t in the training process: where K is the known intrinsic matrix of the camera, and T t→t indicates spatial transformation between image I t and I t . The pixel in the image is expressed as p n , and D t (p t ) represents the depth value of all pixels in the image at t. Therefore, we can obtain depth D t and the spatial transformation T t→t through a depth and pose network. Generally, the more popular choice for the loss function L ss is the weighted sum between the Structured Similarity Index Measure (SSI M) [44] and L1: Here, α is commonly set to 0.85 [26]. In addition, adjacent locations are encouraged to have similar depth values, using edge-aware smoothness loss on the mean-normalized inverse depthρ t , andρ t = ρ t ρ t . The loss function is defined by L sm : where ∂ h and ∂ w , respectively, denote the one-dimensional difference quotient of pixels in the image height and width directions. This encourages adjacent pixels to have contiguous depths, causing the network to ignore the depth of edge pixels during training.

Uncertainty Estimation
The uncertainty of the depth-estimation task comes from two aspects, uncertainty of the model's own ability to learn the data and self-induced uncertainty of the data. The former is due to limitation of the amount of data learned by models, resulting in a large uncertainty in the predicted depth on unlearned scenarios, which can be reduced by expanding the dataset. We still take the model uncertainty into account because we believe the training set is usually collected in similar scenes, e.g., street scenes or indoor scenes.
Although the sample size is large, the sample variety is not complete, and the model still has a probability to encounter unlearned objects. The latter comes from the noise caused by acquisition devices during data collection and the uncertainty caused by complex scenes, such as low brightness, unclear textures, and blurred edges due to relative motion or occlusion of objects. This cannot be reduced by expanding the dataset, but serves as a limiting signal for network to help us solve the edge depth problem associated with depth pulling or occlusion.

Uncertainty in Depth Models
The uncertainty of the model (also known as the epistemic uncertainty) can be calculated by measuring the variance between multiple network instances. One of typical methods is Monte Carlo Dropout [45]. In this way, multiple network instances are sampled from the weight distribution of a single model to estimate the variance. The connections between network layers are randomly discarded with probability. Dropout is turned on during the test, and different network instances of this model can be obtained every time the sample is taken. As follows, the mean µ(d) and variance σ 2 (d) can be calculated by performing N forward inferences: Here, the variance calculated by N sampling is defined as the model uncertainty u mod .
A similar sampling approach can also be used to compute the uncertainty of the depth model, namely bagging (also called bootstrap aggregation) [46]. By training different instance models on random subsets of the training set, we can compute the mean µ(d) and variance σ 2 (d) of different depth outputs. This approach requires training N independent sub-networks and passing each network forward in computing variance.

Uncertainty from Depth Distribution
Unlike the model uncertainty presented above, uncertainty introduced by data is unexplained (also called aleatoric). Monocular depth estimation predicts the depth of the same object by multiple viewpoints, which is based on the assumption of grayscale invariance. Clearly, real-world objects must have different grayscale values at their boundaries due to light intensity, surface field curvature, or complex structures, which contradicts the assumption. To encode these uncertainties, we learn its prediction model, whose predicted values are a function of the depth network weights and inputs.
One of popular strategies is to train a network to infer the uncertainty of the depth distribution p(d * |I, D) of parameters ϕ by minimizing the negative log-likelihood: where w is the network weights. The predictive distribution can be modeled as Laplacian in the case of L1 loss computation with respect to d * [47]. In the self-supervised monocular estimation task, due to the lack of ground truth d * , the depth data uncertainty u dat can be modeled by photometric matching [48], implying minimization of the following loss function: The variance is trained to be in logarithmic form to avoid zero variance. An extra logarithmic term in the formula then restricts the pixels to make infinite predictions.

Multi-Scale Depth Uncertainty
The monocular depth estimation network uses multi-scale feature maps to prevent the gradient from entering the minimum. This idea comes from the work of Lin et al. [49]. The depth uncertainty is modeled on depth weights. We believe that different scale depths need to be constrained by the uncertainty of the corresponding scale. Specifically, the output of the decoder is given additional extra intermediate outputs. Each layer contains one 3 × 3 convolution, which can be estimated at lower scales. This effectively prevents the training from falling into a local minimum. The existing models mostly use multi-scale for image reconstruction and depth estimation. Our total loss is a combination of the losses of every single scale.
In low-to-medium resolution images, large low-texture regions are not easy to learn. Since monocular depth estimation is predicted based on grayscale values, similar grayscale values between objects or backgrounds cause the network to tend to predict continuous depth. Figure 3 shows a visual example of depth uncertainty estimation. The color of pedestrian clothes in the top left image is similar to the background, and the color of vehicles in the bottom left image is similar to the low textured trees in the background.
These regions are predicted with strong depth uncertainty, which helps constrain the network to make continuous depth predictions on discontinuous boundaries. The multiscale uncertainty in the figure predicts more uncertainty details (compare the uncertainty intensity of pedestrians and trees in the background), which is helpful for structurally complex objects.

Teacher-Student Frameworks
In order to decouple the depth and pose when modeling depth uncertainty, we first train a self-supervised monocular depth estimation network and then use a depth network to mimic it. This teacher-student framework is a type of transfer learning, which can learn smaller models in the same field. Generally, the teacher structure is a complex deep neural network, while the student is lightweight and simple model. Poggi et al. [50] improved the performance of depth estimation results by incorporating this architecture into a network.
We used three teacher-student combinations to investigate the effects on modeling depth uncertainty. The teacher models contained self-supervised monocular depth estimation, dropout layer networks, and bagging strategy models, respectively. The student models mimic the depth distribution from the teacher, which is equivalent to supervised learning, with the supervised signal coming from the output of teacher. Then, we can model the depth uncertainty on the student. Specifically, we train a teacher instance to obtain an output d T . Assuming the use of L1 loss, the depth uncertainty can be modeled as L TS : To avoid zero values in the denominator, let u S = log σ(d S ), and the loss function can be transformed into: The µ(d S ) and u S represent the depth mean and variance of student outputs. When the depth of the prediction result cannot imitate the teacher depth well, the value of L1 becomes larger, and the network will increase the uncertainty of pixels in order to minimize the loss.

Experiments
In this section, we validate the effectiveness of using uncertainty for self-supervised depth estimation and research different networks of teachers.

Training Details and Metrics
At first, we describe the relevant details about training and the metrics used in the evaluation model.

Details on the Learning Procedure
In our experiment, the training process uses a monocular sequence, and the open source model Monodepth2 [26] is chosen as the baseline model. Most of our protocols follow the setting of [26], the input and output image size is 192 × 640, trained for 20 epochs using Adam, and the batch size is set to 12 (due to memory limitations, we use gradient ac-cumulation technology). Moreover, the ImageNet [51] pre-training is used at the beginning of the encoder. In addition, we turn on dropout during training and testing, and only use it in the decoder. Regarding our methods, we set the hyperparameter N to 8 and randomly extract 25% of the training set for each bagging network.

Depth Metrics
In order to compare the performance of depth networks, we report the following seven standard performance criteria. They are the Absolute Relative Error (AbsRel), Squared Relative Difference (SqRel), Root Mean Squared Error (RMSE), Root Mean Squared Logarithmic Error (RMSE log), and three accuracy metrics (threshold δ < 1.25 k , k ∈ 1, 2, 3). For a detailed description, please refer to the reader to [52] for further description of these metrics.

Uncertainty Metrics
In addition to the above metrics to evaluate the performance of our depth estimation model, we measure both Area Under the Sparsification Error (AUSE) and Area Under the Random Gain (AURG) as measures to evaluate the quality of uncertainty prediction. They come from the sparsity of specified parameters-the so-called sparsification plots [47]. This graph shows similarity between the estimated uncertainty and true error. Generally the descending sorting curve of true error versus ground truth is called oracle sparsification, difference between sparsification and oracle is called AUSE.
We used the method described in [50] to measure the area under the sparsification error curve as first indicator. Specifically, for the uncertainty of given error,pixel level is sorted in descending order, we extracted 2% of the pixel subset each time, and the remaining pixel error curve can be drawn. If uncertainty is correctly coded, the error curve will be reduced. Therefore, the lower the value of this indicator, the better. In addition, subtracting the estimated sparsification from random uncertainty is the area AURG, the higher the better. We follow [50] and set three error metrics, namely AbsRel, RMSE, and δ 1.25.

Dataset
We used the KITTI dataset [18], which is an outdoor scene image data captured by vehicle with depth estimation sensor while driving on city street, it contains 61 scenes. We used Eigen et al.'s [53] data split and follow Zhou et al.'s [24] of pretreatment to remove the static frame. This outcome 39,810 monocular triplets for training and 4424 for validation. We set the principal point of camera as the center of the image and use the same intrinsic function for all images, taking the average of all focal lengths as the final focal length. During evaluation, we set the maximum depth to 80 m according to standard practice in [23] and use the per-image median ground truth scaling introduced by [24] to report the results.

Monocular Depth Estimation with Uncertainty
Here, we compare the monocular estimation methods published previously, demonstrate the effectiveness of our approach by depth estimation evaluation. Then, we conduct ablation experiments on multiple model architectures with depth uncertainty for different KITTI benchmarks.

KITTI Eigen Split
We compared the best combined results of our model with those of some state-of-theart depth estimation models, all of which are monocular methods. The quantitative results are shown in Table 1, all on KITTI using Eigen split. The best results for each evaluation metric are shown in bold. As can be seen, our model is the best result for each metric, which significantly improves the performance compared to the baseline model monodepth2 with the error rate metrics decreased by 4.35%(AbsRel), 8.97%(SqRel), 3.64%(RMSE), and 3.11%(RMSElog), respectively, and the accuracy rate increased by 0.57%(δ < 1.25 ↑). The qualitative results are shown in Figure 4. To further elaborate the findings of the experimental results, we selected representative results to explain the effect of uncertainty on depth estimation. Figure 5 contains two typical scenes, occlusion and a low-texture region. Object occlusions (with a front-to-back position relationship) are typically predicted to have the same depth values. In the first image, monodepth2 predicts the backward vehicle and forward vehicle as having similar depth values.
Our method correctly detects the depth of the vehicle in front. The second image is a case of predicting depth of low-texture region with dark background and low saturation of the road sign color. Monodepth2 produced a depth pulling phenomenon at the edges of road sign, and the pixel size of predicted depth region is larger than the original pixel size in input image. In contrast, our method produces better results. Figure 5. The case of depth estimation with edge optimization. In the case of object occlusion (white and black car in the above figure), monodepth2 predicts both as similar depths, and the silvery green in the figure is the front car outline. Our method correctly predicts the depth of the front car. For the object in the low-texture region (the road sign in the figure below), monodepth2 predicts the depth area pixel size not only larger than the pixel size in the original image but also a little blurred at the depth edge (the depth difference with the background is not obvious). Our method produces better results both in terms of pixel size and smoothness of edges.

Ablation Study
In order to study depth uncertainty and how our strategy contributes to monocular depth estimation training, we implemented two methods of depth uncertainty (aleatoric and epistemic) with different teacher-student strategies for ablation research. Table 2 reports our final results, where BaseT, BagT, and DropT represent the use of baseline, bagging, and dropout networks as the teacher models. In addition, A, E, and S indicate the use of aleatoric, epistemic, and single-scale uncertainty, respectively (multi-scale uncertainty is used by default). The T+S represents that this method uses teacher-student framework, the Alea and Epis are shorthand for aleatoric and epistemic.
The Bagging method that generates model uncertainty obtained the best results for the RMSE metric. While combining Bagging as a teacher model and the two uncertainty methods BagT+AE obtained the best results for the Sq Rel metric. This indicates that the model can learn richer information when trained with the combination of uncertainty and can reduce the error between the predicted and true results.
Our baseline uses Monodepth2, which can already produce high accuracy by itself, and the networks using the baseline as the teacher model both improve in accuracy compared to the baseline. Showing that the teacher-student strategy allows the final network to learn a better distribution of weights. For uncertainty evaluation, the results show that the best combined result is the BaseT+A, while the method using single-scale uncertainty (BaseT+A+S) obtains suboptimal results. For the scale-dependent experiments, we only perform on the strategy with the optimal results (BaseT+A).
We report visualization view of depth uncertainty under different strategies as shown in Figure 6. It is clearly seen that the approach using the teacher-student strategy yields clearer uncertainty results, and the approach using the baseline as the teacher model is superior. These observations are consistent with the quantitative results Tables 2 and 3.   Tables 4 and 5 show the quantitative results of our different strategies on the KITTI new benchmark.
Consistent with the previous analysis, baseline as the teacher model with the addition of multi-scale uncertainty method BaseT+A obtains the best results for combined evaluation metrics. Regarding the baseline as a teacher model approach, the uncertainty evaluation metric has multiple identical metric results. These methods all have the same teacherstudent model architecture, only with different uncertainty methods, proving that the main strategy affecting the overall performance of the model is the teacher-student frameworks. In addition, increasing epistemic uncertainty reduces accuracy.

Conclusions
In this paper, we proposed an improved model for self-supervised monocular depth estimation. To study the influence of the uncertainty of different scales on depth, the results revealed that multi-scale obtained more depth uncertainty details compared with singlescale, which is helpful to solve the problem of depth boundary blur caused by occlusion and low-texture. In addition, we studied the effects of different teacher-student paradigms (using different networks as teachers) on the monocular depth performance.
This architecture not only decoupled the original depth and pose networks but also used a simple model to learn the weight distribution of the teacher network. The student model showed better comprehensive performance, which can improve the depth accuracy of the model. Our results show that the network combining multi-scale depth uncertainty with the optimal teacher-student architecture achieved the best results. The qualitative results show that our proposed network approach can generate high quality depth maps with clarity and details.

Conflicts of Interest:
The authors declare no conflict of interest.