An Adaptive Refinement Scheme for Depth Estimation Networks

Deep learning has proved to be a breakthrough in depth generation. However, the generalization ability of deep networks is still limited, and they cannot maintain a satisfactory performance on some inputs. By addressing a similar problem in the segmentation field, a feature backpropagating refinement scheme (f-BRS) has been proposed to refine predictions in the inference time. f-BRS adapts an intermediate activation function to each input by using user clicks as sparse labels. Given the similarity between user clicks and sparse depth maps, this paper aims to extend the application of f-BRS to depth prediction. Our experiments show that f-BRS, fused with a depth estimation baseline, is trapped in local optima, and fails to improve the network predictions. To resolve that, we propose a double-stage adaptive refinement scheme (DARS). In the first stage, a Delaunay-based correction module significantly improves the depth generated by a baseline network. In the second stage, a particle swarm optimizer (PSO) delineates the estimation through fine-tuning f-BRS parameters—that is, scales and biases. DARS is evaluated on an outdoor benchmark, KITTI, and an indoor benchmark, NYUv2, while for both, the network is pre-trained on KITTI. The proposed scheme was effective on both datasets.


Introduction
Dense depth maps play a crucial role in a variety of applications, such as simultaneous localization and mapping (SLAM) [1], visual odometry [2], and object detection [3]. With the advent of deep learning (DL) and its ever-growing success in most fields, DL methods have also been utilized for generating dense depth (DD) maps and have demonstrated a prominent improvement in this field.
Depending on the primary input, DL-based depth generation methods can be categorized into depth completion and estimation methods. Depth completion methods try to fill the gaps present in input sparse depth (SD) maps [4,5], whereas depth estimation ones attempt to estimate depth for each pixel of an input image [6][7][8][9][10]. Although the results provided by depth completion methods [4,11] are usually more accurate than those from depth estimation ones, they need to be supplied by a remarkably large number of DD maps as targets in the training stage. This is while collecting such data in a real-world application is an expensive and time-consuming task [8,12].
In parallel, the input to depth estimation methods includes no depth maps; however, supervised ones take either SD [13] or DD maps [14][15][16] as the target during training. Between these two supervised depth estimation approaches, using SD maps usually leads to less accurate results but is more viable than using DD ones. Because SD-based methods only need SD maps which can be provided using a LiDAR sensor and without any need for postprocessing or labeling effort. Considering the above issues, supervised depth estimation methods which use sparse depth maps are preferred, especially for most real-world cases in which access to large-enough accurate DD maps is difficult or even impossible.
Similar to all DL methods, DL-based depth estimation methods, no matter supervised or unsupervised, suffer from the generalization problem. In other terms, DL models trained on Overall, our contributions can be summarized as: • A novel double-stage adaptive refinement scheme for monocular depth estimation networks. The proposed scheme needs neither offline data gathering nor offline training, because it uses available pre-trained weights. • Introduction of functional adaptation schemes in the field of depth generation, for the first time. Using the proposed adaptive scheme, pre-trained networks can be straightforwardly used for unseen datasets through adjusting the shape of activation functions of an intermediate layer.

•
A model-agnostic scheme which can be plugged into any baseline. In this paper, we selected Monodepth2 [23] as one of the most widely used baselines for depth estimation.

Related Work
Here, we initially provide an overview for unsupervised and supervised depth estimation methods. In the last part, a brief review of functionally adaptive networks is displayed.

Unsupervised Depth Estimation Methods
These methods use color consistency loss between stereo images [24], temporal ones [25], or a combination of both [23] to train a monocular depth estimation model. Many attempts have been made to rectify the self-supervision by new loss terms such as left-right consistency [26], temporal depth consistency [27], or cross-task consistency [28][29][30]. Of these improvements, Monodepth2 has attracted substantial attention because of the different sets of techniques it has used for modification [23]. To the best of our knowledge, methods in this category have been presented for either outdoor environments, such as the above ones, or indoor environments, as in [31]. Not being applicable for both indoor and outdoor datasets can be regarded as a drawback of these methods. Another problem of these methods is that they suffer from low accuracy.

Supervised Depth Estimation Methods
The inputs to these methods are only images and they use either DD maps or SD maps as targets. This group can be categorized into DD-based and SD-based methods. Of these two, DD-based ones need DD maps during their training. DD-based methods, such as Adabins [14] and BTS [15], learn based on the error between predicted depth maps and DD maps. The main disadvantage of these methods is that they need DD maps for training.
Unlike DD-based methods, SD-based ones use SD maps only. Training data are not an issue for these methods because current robots and mapping systems can capture both images and SD maps simultaneously. The distance between predictions and SD maps are used as loss functions [32][33][34]. These methods are also known as semi-supervised [35].

Functionally Adaptive Neural Networks
Neural networks are called adaptive when they can adapt themselves to unseen environments, i.e., new inputs [36,37]. There are different techniques for designing adaptive networks, among which weight modification and functional adaptation can be mentioned. The former optimizes the network weights for new inputs while the latter modifies the slope and shape of the activation functions usually through a relatively few number of additional parameters [36]. Functional adaptation can be categorized under activation response optimization methods [38][39][40][41], in which the aim is to update activation responses while the network weights are fixed. The reason behind keeping the network weights fixed is to preserve the semantics learned by the network during the training process. On the other hand, one or several activation responses are modified to optimize the performance on inevitable unseen objects and scenes so that the network maintains its proficient performance in constantly changing environments [19].
The adaptation process can happen in either the training stage [20] or the inference stage for some tasks, such as interactive segmentation or SLAM, where some ground truth (even though sparse) is available on the fly [37]. In addition, the networks can adapt to a sequence of images or a single image. In a single-image adaptation, the core merit optimizes the prediction for a specific image or even an object, and the adaptation is discarded for the next image [37]. Thus, single-image adaptation can be beneficial, especially when scenes are prone to varying significantly.
Inspired from the biological neurons, some investigations have been conducted on the adaptive activation functions such as PReLU, which shows that adaptation behaviour in such activation functions can improve the accuracy and generalization of neural networks [20]. In [19], some parameters are introduced to adapt the activation functions to user clicks during the inference of the interactive segmentation task. An adaptive instance normalization layer is proposed in [21], which enables the style transfer networks to adapt to arbitrary new styles, adding a negligible computational cost.

Theoretical Background
In this section, some theoretical background needed for understanding the proposed scheme is provided. Firstly, Delaunay-based interpolation is explained, which is used in the correction stage to densify sparse correction maps. Subsequently, the particle swarm optimization (PSO) algorithm is displayed as the optimizer utilized in the optimization stage of the proposed scheme.

Delaunay-Based Interpolation
The first step of the interpolation is to conduct triangulation. Considering that there are many different triangulations for a given point set, we should obtain an optimal triangulation method, avoiding poorly shaped triangles. The Delaunay triangulation method has proved to be the most robust and widely used triangulation approach. This method connects all the neighbouring points in a Voronoi diagram to obtain a triangulation [42].
To find the value of any new point by interpolation, its corresponding triangle in which it lies should be identified. Suppose P(x, y) is a new point that belongs to a triangle with vertices of P 1 (x 1 , y 1 ), P 2 (x 2 , y 2 ) and P 3 (x 3 , y 3 ) with the values of z 1 , z 2 and z 3 , respectively, to linearly interpolate the value z of P, we should fit a plane (Equation (1)) to the vertices P 1 , P 2 and P 3 .
By inserting the known points (x 1 , y 1 , z 1 ), (x 2 , y 2 , z 2 ) and (x 3 , y 3 , z 3 ) in Equation (1) and solving a linear system of equations, the unknown coefficients (a, b, c) of the plane are estimated. Finally, applying Equation (1) and having (a, b, c), the value z for any arbitrary point P(x, y) is interpolated within the triangle ( Figure 2).

Figure 2.
Delaunay-based interpolation on a set of points. First, a Delaunay triangulation is carried out on the points. Then, a plane is fitted to each triangle, and finally, the value for points on each of them is obtained based on the fitted plane.

PSO
PSO is a population-based stochastic optimization technique inspired by the social behavior of birds within a flock or fish schooling [22]. PSO has two main components which need to be specifically defined for each application. One component is the introduction of particles, and the other is an objective function for particle evaluation.
Each particle has the potential of solving the problem; this means they must contain all the arguments needed for the problem in question.
The velocity and position of each particle are calculated using Equation (2) and Equation (3), respectively, [22]. Optimum values of unknown parameters are iteratively updated using the position equation, which is itself dependent on the velocity. (2) In Equation (2), V i (t) is the velocity of a particle i at time t, and pbest i (t) and gbest i (t) are personal and global best positions found by the particle i and all the particles by the iteration t, respectively. The w parameter is an inertia weight scaling the previous time step velocity. Parameters c 1 and c 2 are two acceleration coefficients that scale the influence of pbest i (t) and gbest i (t), respectively. In addition, parameters r 1 and r 2 are random variables between 0 and 1 obtained from a uniform distribution. The next position of each particle (X i (t + 1)) can be calculated using Equation (3).

Proposed Method
Supervised depth estimation methods suffer from the generalization problem. In other words, they usually need to be retrained for achieving a proficient performance on an unseen dataset. To alleviate this, a double-stage adaptive refinement scheme (DARS) is proposed to equip pre-trained depth estimation networks with inference-time optimization for improving the performance on both seen and unseen datasets. The proposed scheme ( Figure 3) consists of several components including a deep baseline model, a correction module which applies the first stage of refinement, and an activation optimization as the second stage. The baseline model could be any supervised or unsupervised pre-trained depth estimation network. The predicted depth by the baseline is given to the correction module to provide the optimization module with a sufficiently accurate depth map. In the second stage, scale and bias parameters are applied on a set of intermediate feature maps in the baseline, and they are optimized by a PSO to improve the accuracy of the final depth. The tasks and details of each module, and the overall proposed scheme, are displayed below. In the following subsections, s and d superscripts, respectively, indicate that depth maps are sparse or dense.

Baseline
Given an input monocular RGB image I ∈ R w×h×3 , we rely on a depth estimation network F : I → D d 0 to provide us with an initial depth map D d 0 ∈ R w×h . The proposed scheme can utilize any monocular depth estimation network. In this study, Monodepth2 [23] has been selected as the baseline, as one of most widely used depth estimation networks. The baseline is pre-trained and the weights are kept fixed.

Correction
The depth map D d 0 predicted by the baseline lacks sufficient accuracy, especially for an unseen input. Thus, D d 0 is not a proper initial value for the optimization stage. As a solution, in the first stage of the proposed refinement scheme, a sliced Delaunay correction (SDC) C : R w×h → R w×h is used to correct D d 0 , using the available sparse depth map D s . In SDC, first a correction value δd s ∈ ∆D s for any available depth pixel d s ∈ D s is calculated: where d s 0 ∈ D d 0 are the pixels in D d 0 corresponding to the ones in D s . Then, the sparse correction map ∆D s is divided into three overlapped slices (see Figure 4). Neighbouring pixels are intuitively assumed to share a similar error pattern, and slices can represent a simplistic segmentation based on the error pattern. In each slice, a Delaunay-based interpolation (see Section 3.1) J : R 2 → R is utilized to estimate a dense correction map ∆D d = J(∆D s ), given the sparse one ∆D s . For the pixels in overlapped areas (see Figure 4), the average of the values coming from two adjacent slices is considered as the final depth correction value. As a result of this stage, a corrected depthD d = D d 0 + ∆D d is generated, yet with marginal errors. Regarding the number of slices, three was selected as the optimal number of slices on both datasets based on our experiments. Lower numbers could not result in homogeneous areas, and hence, a remarkable correction performance. On the other hand, a larger number was not considered because the improvement in accuracy was negligible with respect to the computational overload.

Activation Optimization
Given the initial value from the first stage (correction), the core part of network adaptation is conducted in the second stage. The technique chosen for the network adaptation is to modify an intermediate set of activation outputs [36]. This is usually carried out by freezing the weights and optimizing some auxiliary parameters. This way, not only are the valuable learned semantics preserved, but also the network can adapt itself to inputs. Inspired from works such as f-BRS [19] in an interactive segmentation field, we apply channel-wise scale and bias parameters on intermediate features of the baseline network. The scales are initialized to ones and biases to zeros; they are then optimized based on a cost function. To describe the algorithm of the optimization module better, the overall scheme, i.e., from the baseline to optimization module, is explained, followed by some details about the optimizer.

Overall Scheme
Given an input RGB image I ∈ R w×h×3 , denote the intermediate feature set as G(I) ∈ R m×n×c where G : R w×h → R m×n×c is the network body and m, n, and c are, respectively, width, height, and number of channels. The auxiliary parameters, scales S ∈ R c and biases B ∈ R c are applied on G(I), and the depth D d 0 = H(S G(I) B) is predicted, where H : R m×n×c → R w×h is the network head, and and represent channel-wise multiplication and addition. Afterwards, the correction module C : R w×h → R w×h carries out the first refinement stage on D d 0 and returnsD d : The auxiliary parameters X ∈ R 2c , i.e., channel-wise scales and biases, are learnable. Therefore, the following optimization problem can be formulated as: where ∆X is the corrections applied to the parameters and L is the cost function given to the optimizer.

Optimizer
The above optimization problem can be given to any type of optimizers. The default optimizer of f-BRS is limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) [43,44]. This optimizer, due to its local gradient-based nature, is trapped in local optima. To overcome this problem, L-BFGS is replaced with PSO [22]. PSO iteratively updates scale and bias parameters in each particle based on the below distance loss: where T is the total number of pixels with depth values in D s . Figure 5 shows the algorithm flow of the PSO and its parameters in DARS.

Experiments
In this section, we first briefly describe the datasets used in the experiments. Secondly, the metrics are introduced, and after that, an ablation study is discussed to show the effectiveness of each module. Finally, the results by the proposed scheme are compared with those of the state of the art.

Datasets
Two datasets are used in the experiments, KITTI [45] and NYUv2 [46]. KITTI is a well-known outdoor dataset, on which the baseline is trained, while NYUv2 is an indoor benchmark dataset and the adaptation performance of the scheme is highlighted through testing on it.

KITTI
The KITTI dataset [45] consists of stereo RGB images and corresponding SD and DD maps of 61 outdoor scenes acquired by 3D mobile laser scanners. The RGB images have a resolution of 1241 × 376 pixels, while the corresponding SD maps are of very low density with lots of missing data. The dataset is divided into 23,488 train and 697 test images, according to [47]. For testing, 652 images associated with DD maps are selected from the test split. Sample data have been shown in Figure 6 for the KITTI dataset.

NYUv2
The NYUv2 dataset [46] contains 120,000 RGB and depth pairs of 640 × 480 pixels in size, acquired as video sequences using a Microsoft Kinect from 464 indoor scenes. The official train/test split contains 249 and 215 scenes, respectively. Given that NYUv2 does not contain any SD maps, SD maps with 80% sparsity have been randomly synthesized from DD maps for the experiments of the proposed method. Sample data for NYUv2 dataset, including the synthetic SD maps, are illustrated in Figure 6.

Assessment Criteria
Assessment criteria proposed by [47] include error and accuracy metrics. The error metrics are root mean square error (RMSE), logarithmic RMSE (RMSE log ), absolute relative error (Abs Rel), and square relative error (Sq Rel), whereas the accuracy rate metrics contain thr = 1.25 t where t = 1, 2, 3. These criteria are formulated as follows: where d i and d gt i are the predicted and target (ground truth) depth, respectively, at the pixel indexed by i, and T is the total number of pixels in all the evaluated images.

Network Architecture
As the proposed scheme is by design model agnostic, the network architecture is not the focus of this study. Thus, we used the standard monocular version of the Mon-odepth2 [23] model with the input size of 640 × 192 × 3.

Implementation Details
We have used monocular Monodepth2 pre-trained on KITTI as our baseline. The input images were resampled to 640 × 192 and then were fed to the network. The weights were fixed and the network was run in inference mode. In SDC, the number of slices were three and the overlap between slices was set to 50%. Moreover, PSO paramters, i.e., c 1 , c 2 , number of particles, and number of iterations were, respectively, set to 0.5, 0.3, 10, and 30 in all the experiments. Furthermore, all the implementations were conducted in PyTorch [48].

Ablation Studies
This ablation study aims to prove the effectiveness of different stages and modules in the proposed scheme. To do this, starting from the baseline, we have enabled the correction and optimization modules in several steps (see Table 1). First of all, the result of Monodepth2 [23] with median scaling is discussed for comparison, while the version without any kind of post-processing (Monodepth2*) is also reported as our baseline. It means that the baseline results are without median scaling by target DD maps. As a result, they suffer from scale ambiguity and low accuracy. In addition, DC is introduced to show the efficacy of slicing in our proposed SDC as the correction module. The difference between SDC and DC is that, in the latter, Delaunay interpolation and correction are carried out on the entire depth maps instead of separately on each slice. For the sake of brevity, these two methods have just been surveyed for KITTI. From Table 1, the worst results on KITTI in terms of all the metrics was recorded by the baseline (Monodepth2*), which was expected because of scale ambiguity. Using DC as the correction module improved the results by 13% in terms of RMSE, while SDC showed a significantly higher improvement over the baseline by 91%. This not only proves the contribution of the correction module but also indicates the effectiveness of the slicing process in SDC. Furthermore, SDC, without any use of target DD maps, yielded over 57% improvement with respect to Monodepth2, which means that SDC not only addresses the scale ambiguity problem but also corrects the given depth map significantly. Moreover, this observation supports the assumption that adjacent pixels in depth maps share a similar error pattern. First because adjacent pixels usually belong to same objects. Second, the error in LiDAR sensor has a correlation with distance from sensor, and as a result, pixels which are in an approximately equal distance to the sensor are likely to have close error magnitudes. From another perspective, the proposed slicing proved to be a simplistic segmentation based on the error pattern and was able to remarkably contribute to the correction stage.
According to Table 1, the results obtained when using L-BFGS as the optimizer are equal to ones without optimization on both KITTI and NYUv2 datasets. This means that L-BFGS could not improve the results because, unlike PSO, it does not have the capability for global search. In better words, it seems that it was trapped in local optima, i.e., the depth provided by SDC. Therefore, due to the identical performances and for the sake of conciseness, just one row is dedicated to both SDC and L-BFGS in Figures 7 and 8. In the meanwhile, PSO improved the results significantly in terms of all metrics and on both KITTI and NYUv2 datasets. For instance, PSO showed nearly 50% enhancement in AbsRel and 14% in RMSE on KITTI and 6% and 86%, respectively, in terms of AbsRel and RMSE on NYUv2.
If we compare the improvement of PSO over L-BFGS on KITTI and that on NYUv2, it can be observed that the improvement was more remarkable on NYUv2. Thus, considering that the baseline was trained on KITTI, one can conclude that the optimization module with PSO as its optimizer plays a significant role in the adaptation process. This observation also demonstrated the capability and efficacy of the activation optimization used in the proposed scheme.
To conclude, both of the proposed correction and optimization stages in DARS, i.e., SDC and activation optimization using PSO, proved to be effective and led to considerable improvements. Moreover, DARS proved its capability in network adaptation, given its performance on NYUv2.
As is clear from error patterns in Figure 7, related to KITTI and Figure 8 pertaining to NYUv2, the introduction of PSO has led to considerable improvements. The improvements can be specifically observed in more distant pixels which are usually of a higher error magnitude.

Comparison with SOTA
Proficient generalization is necessary for DL-based depth estimation methods, especially in applications with constantly changing environments, such as SLAM and autonomous vehicles. To deal with this problem, an inference-time refinement scheme is proposed to help pre-trained networks adapt to new inputs. To show the generalization performance of the proposed scheme, it has been compared with a range of unsupervised and supervised methods. On the other hand, to evaluate its adaptation performance, DARS with pre-trained weight on KITTI is applied on an unseen benchmark dataset, namely NYUv2. As is clear from Table 2, DARS outperformed competing methods in terms of almost all assessment criteria except for δ 1.25 2 and δ 1.25 3 . From the perspective of these two criteria, the performance of our method was not as good as the second-place rival. However, DARS led to better performance in terms of δ 1.25 , which is the primary criterion for accuracy assessment. Although DARS utilizes a self-supervised baseline, Monodepth2, it outperformed its supervised rivals by a 39% margin in terms of RMSE on KITTI. This confirms the superiority of the proposed DARS even over supervised approaches and in dealing with harder scenes in a seen dataset. A visual comparison between DARS and the second best method in terms of RMSE on KITTI is presented in Figure 9.  9. Visual results related to the comparative study of the KITTI dataset. The results of Adabins [14], as the second best method, are brought. Also, numbers on the right side of error patterns are in meters.
Regarding the second dataset, NYUv2, DARS outperformed the competing methods in terms of all criteria according to Table 3. In terms of AbsRel and RMSE, DARS reached improvements of, respectively, 83% and 70% with respect to the best competing method. Furthermore, this table indicates how the proposed method successfully adapted to an unseen dataset. Note that unlike DARS, the other methods in Table 3 have been trained on NYUv2. Hence, one can deduce that DARS not only could adapt a network to an unseen dataset but also outperformed the methods trained on the exact same dataset. Furthermore, it suggests DARS as a possible alternative to supervised approaches which suffer from complicated generalization problems in practice. This adaptation capability is extremely advantageous in applications with constantly changing environments such as SLAM, where the scenes are of an unlimited variety and sparse LiDAR maps are available on the fly. A visual comparison between DARS and the second best method in terms of RMSE on NYUv2 is presented in Figure 10. Table 3. Comparative study on NYUv2.The first part from above contains unsupervised methods while the second part is dedicated to supervised ones.

Method
Lower  Figure 10. Visual results related to comparative study on NYUv2 dataset. The results of Adabins [14], as the second best method, are brought. Also, numbers on the right side of error patterns are in meters.

Conclusions
This paper deals with one of the main problems of available deep learning-based depth estimation networks, which is their limited generalization capability. This problem specifically restricts the practical usage of such models in applications with a constantly changing environment, such as SLAM. To alleviate this problem, a new double-stage adaptive refinement scheme for depth estimation networks, namely, DARS based on the combination of f-BRS and PSO, is proposed in this paper. DARS, here, is injected into Monodepth2 as the baseline and adapts the pre-trained network to each input during inference. Experimental results on KITTI and NYUv2 datasets demonstrated the efficacy of the proposed scheme not only for KITTI but also for NYUv2, while the baseline model was pre-trained only on KITTI. Although our approach is model agnostic by design, this paper did not explore the effects of using different baselines. In future work, we will, therefore, replace our unsupervised baseline with other networks, ranging from unsupervised to supervised in order to investigate the effectiveness of our proposed scheme on different baselines.