NSVDNet: Normalized Spatial-Variant Diffusion Network for Robust Image-Guided Depth Completion

: Depth images captured by low-cost three-dimensional (3D) cameras are subject to low spatial density, requiring depth completion to improve 3D imaging quality. Image-guided depth completion aims at predicting dense depth images from extremely sparse depth measurements captured by depth sensors with the guidance of aligned Red–Green–Blue (RGB) images. Recent approaches have achieved a remarkable improvement, but the performance will degrade severely due to the corruption in input sparse depth. To enhance robustness to input corruption, we propose a novel depth completion scheme based on a normalized spatial-variant diffusion network incorporating measurement uncertainty, which introduces the following contributions. First, we design a normalized spatial-variant diffusion (NSVD) scheme to apply spatially varying filters iteratively on the sparse depth conditioned on its certainty measure for excluding depth corruption in the diffusion. In addition, we integrate the NSVD module into the network design to enable end-to-end training of filter kernels and depth reliability, which further improves the structural detail preservation via the guidance of RGB semantic features. Furthermore, we apply the NSVD module hierarchically at multiple scales, which ensures global smoothness while preserving visually salient details. The experimental results validate the advantages of the proposed network over existing approaches with enhanced performance and noise robustness for depth completion in real-use scenarios.


Introduction
Depth sensing and estimation are of vital importance in a wide range of applications, e.g., robotics [1], autonomous driving [2], and augmented reality [3].However, depth sensors, such as Light Detection and Ranging (LiDAR) and Time-of-Flight (ToF) sensors, typically provide relatively low output density [4,5], as demonstrated in Figure 1b.This hinders the application of depth sensors in downstream applications that require dense depth maps.
To improve three-dimensional (3D) imaging quality, direct interpolation with only sparse depth measurements can efficiently provide a dense depth map [6] but results in blurry edges and structural details as shown in Figure 1c.On the other hand, Red-Green-Blue (RGB) cameras capture the high-resolution shape and structure information of the scene, and a cost-effective way to obtain dense depth is to estimate it directly from a single image based on monocular depth estimation algorithms [7,8].However, the inference accuracy is relatively low and the generalization ability is limited, which restrict their applications in scenarios requiring high accuracy and robustness in depth estimation [9,10].For example, in Figure 1d, although the monocular depth estimation is able to preserve the relative distance, it is hard to provide accurate absolute measurement [7].This is consistent with the quantitative evaluation using the Root Mean Squared Error (RMSE) metric [11], where the result in Figure 1d shows a relatively larger RMSE indicating low accuracy.Therefore, existing approaches use RGB images as guidance to recover dense depth maps from sparse sensor depth measurements; this is called image-guided depth completion [4,11].For example, with the RGB image input in Figure 1a as guidance, the structural details are better preserved to achieve improved accuracy as shown in Figure 1e.Example in NYUv2 dataset [12].(a) RGB image input, (b) sparse depth input, depth estimation with (c) PNCNN [6] using single depth, (d) MiDaS [7] using single RGB, (e) NLSPN [11], and (f) proposed NSVDNet using both RGB and depth.As highlighted in the black rectangles, (f) NSVDNet generates more accurate structural details than (e) NLSPN due to the uncertainty-aware diffusion scheme.The results are evaluated using RMSE metric, where (f) NSVDNet achieves the smallest RMSE, indicating improved accuracy.
In recent years, deep neural networks have achieved great success in various applications [1,4,13] and have been successfully applied in image-guided depth completion tasks, achieving remarkable improvements in depth estimation accuracy.Various network architectures have been proposed to integrate features from RGB and depth completion tasks [14,15] .The problem is, most depth completion approaches ignore the fact that the sensor depth is inherently noisy [16,17], and the performance will fail at inaccurate depth measurements.
To enhance the network robustness to input noise and guarantee reliability in realworld usage, increasing attention has been paid to incorporating prior knowledge about depth images into the network design [18,19].In this way, the solution space of the network is restricted to avoid over-fitting to the training dataset, which in turn enhances the generalization ability to unseen real test data.However, the task of image-guided depth completion has not received enough attention in terms of network reliability, and the corruption in the input sparse sensor depth is not fully considered, which leads to degradation in the resulting depth prediction.
To tackle the above problems, we propose the normalized spatial-variant diffusion network (NSVDNet) based on uncertainty-aware diffusion to enhance performance robustness to input corruption.Specifically, the sparse depth is diffused with end-to-end learned spatially adaptive kernels that incorporate (1) the input depth uncertainty to avoid diffusing corrupted depth measurements and (2) the semantic features learned from RGB images to further enhance structural detail reconstruction.By utilizing the above uncertaintyaware diffusion, we essentially implemented anisotropic diffusion in the network, making the network interpretable and thus limiting the solution space to avoid over-fitting to the training dataset.This explains why the proposed NSVDNet is robust to the noise in corrupted input data that is not included in the training dataset.
To implement the uncertainty-aware diffusion on depth maps, we design the normalized spatial-variant diffusion (NSVD) module utilizing the learned depth certainty as a normalization factor to mitigate depth noise, and the spatial-variant affinity extracted from RGB to guide structural enhancement.Furthermore, the NSVD module is applied hierarchically at multiple scales so as to ensure global smoothness while preserving visually salient details.
In sum, previous approaches for depth completion can be classified into three categories: (1) modifying convolution layers to adapt to sparse input [20]; (2) utilizing RGB-D fusion to recover dense depth with RGB guidance [14,15]; and (3) constructing affinity matrices to refine structural details [11].However, when the input depth measurements are corrupted [16,17], the extracted features do not unveil the underlying structure of the depth, which degrades the resulting depth estimation for these schemes.In contrast, the proposed NSVDNet provides the following advantages over existing schemes: (1) NSVDNet utilizes uncertainty-aware diffusion to enhance the network robustness to input corruption based on the input depth uncertainty; (2) NSVD modules are applied hierarchically to the depth features to further enhance the RGB-D fusion efficiency; (3) NSVDNet essentially implements anisotropic diffusion, which limits the solution space to avoid over-fitting and enhance generalization ability.More discussions about the comparison with existing schemes are provided in Section 2. The example in Figure 1f shows that the proposed NSVDNet outperforms the competing scheme NLSPN [11] where the global smoothness and local detail are better preserved without introducing extra textures, e.g., on the bicycle and the table corner highlighted in the black rectangles.Also, the result in Figure 1f shows a smaller RMSE value than that in Figure 1e, validating the enhanced accuracy of NSVDNet.In summary, the main contributions of our work include:

•
We design the uncertainty-aware diffusion network to enhance the robustness to depth measurement corruption, where the input depth uncertainty is integrated into the diffusion to avoid input noise from propagating to neighboring pixels; • We implement the diffusion with the normalized spatial-variant diffusion (NSVD) module, which diffuses the input depth with spatial-variant kernels constructed from the semantic structural features extracted from the RGB image; • We design the hierarchical deployment of NSVD modules to ensure both global smoothness and local detail preservation.

•
We conduct extensive experiments to demonstrate that the proposed NSVDNet is more robust to input depth corruption than competing schemes.Additionally, the ablation study validates the design of the network architecture.
The paper is organized as follows.Related works are discussed in Section 2. Section 3 and Section 4 provide a detailed discussion of the proposed method and the network architecture.An ablation study and a comparison with competing methods are demonstrated in Section 5, and the work is concluded in Section 6.

Related Works
In this section, we will first overview the existing schemes for the depth completion task and then focus on the affinity-based methods that are most related to the proposed approach.

Depth Completion
Depth completion recovers dense depth from sparse depth input.With the development of deep neural networks, deep learning-based approaches provide state-of-the-art performance and outperform model-based methods [21][22][23] by a wide margin.Early methods relied only on sparse depth measurement.For example, SparseConvNet [20] proposed the sparse convolution layer and used a binary mask to distinguish between valid and missing values so that convolution operated only among valid data.The sparse convolution is not suitable to be applied to classical encoder-decoder networks, so the sparsity-invariant multi-scale encoder-decoder network (HMS-Net) [24] is proposed to effectively utilize multi-scale features from different layers for depth completion.
Methods using only sparse depth input suffer from blurry edges and missing structural details, so recent methods have used RGB images as guidance for accurate detail preservation in depth prediction.Various network architectures have been proposed to fuse the multi-modal RGB-D features.For example, sparse-to dense [14] proposed to accomplish depth completion by concatenating the RGB image and the sparse depth map before feeding them to an encoder-decoder network built on a ResNet-50 network.ACMNet [15] used co-attention-guided graph propagation to propagate the multi-modal information from RGB and depth, which were then fused by the symmetric gated fusion module to obtain the final dense depth output.Recent methods introduced more sophisticated RGB-D fusion networks to enhance the depth estimation accuracy.For example, MFF-Net [25] extracted and fused features with different modals in both encoding and decoding processes.CompletionFormer [26] coupled the convolutional attention layer with Vision Transformer to take advantage of both the local connectivity of convolutions and the global context of the Transformer in one single model.BEV@DC [27] projected the geometric features onto a unified Bird's-Eye-View (BEV) space and combined them with RGB features to perform BEV completion.
Intermediate 3D geometric cues are also used to faciliate depth completion.For example, DeepLidar [28] used a two-branch encoder-decoder network to estimate the dense depth and surface normal simultaneously, where the surface normal was used as an intermediate representation and merged with the predicted dense depth to predict the final dense depth.FuseNet [29] used 3D continuous convolutions to extract 3D geometric clues in the 3D point domain, which were back-projected to the two-dimensional (2D) plane and fused with 2D features to obtain the depth prediction.
While learning-based methods can achieve remarkable enhancements in depth completion, the results usually suffer from blurry edges.This motivates recent methods utilizing affinity-based spatial propagation networks to reconstruct more accurate structural details.

Affinity-Based Depth Completion
The affinity matrix is learned from the encoder-decoder network in the spatial propagation network (SPN) [30] in a data-driven manner, which updated the current pixel by the weighted sum of the neighboring pixels.However, SPN only used two neighbors in a row or column for spatial propagation, which was not comprehensive enough to capture all the local information simultaneously.As a variant of SPN, CSPN [31] overcame this limitation by using eight local neighbors for spatial propagation.CSPN++ [32] improved over CSPN by learning adaptive convolutional kernel sizes and the iteration number for the propagation; thus, the context and computational resource needed at each pixel could be dynamically assigned upon request.NLSPN [11] further improved CSPN by adopting a non-local neighborhood for spatial propagation, which avoided mixed-depth problems.PENet [33] proposed a two-branch backbone for depth estimation, and the output was refined by a dilated and accelerated CSPN++ [32].
Nevertheless, the SPN-based approaches use fixed affinity learned from the RGB-D input by the neural network.When the RGB-D input is corrupted, the learned affinity matrices do not unveil the underlying correlation of the depth, which can result in erroneous textures in the depth map.While PNCNN [6] used the estimated input confidence map in convolutions for suppressing the input corruptions, the convolutions were applied to different pixels invariantly and resulted in blurry artifacts.
Different from existing SPN-based schemes, we propose the normalized spatial-variant diffusion (NSVD), which utilizes the input depth uncertainty to refine the affinity learning, which enhances the network robustness to input corruption and avoids erroneous texture generation as shown in Figure 1f.Moreover, NSVD is applied to the depth feature hierarchically to allow for efficient RGB-D fusion.Furthermore, NSVD overcomes the limitation of PNCNN in pixel-invariant convolution by fusing the RGB-dominant features in the depth feature diffusion to avoid blurry artifacts.

Normalized Spatial-Variant Diffusion
In this section, we formulate the depth completion problem as a weighted least-squares (WLS) optimization problem.By considering the corruption in the input depth, we generalize the WLS problem to the uncertainty-aware formulation in order to attenuate the contribution of less confident pixels in the depth completion.Then, we solve the optimization problem with the proposed normalized spatial-variant diffusion scheme, which applies the spatially adaptive filters iteratively to further boost the depth reconstruction.

Problem Formulation and Solution Interpretation
Assume the sparse input depth image y ∈ R N is sampled from the dense depth x ∈ R N , where N is the number of pixels.To recover the i-th pixel x i , we consider the input y equipped with the diagonal sampling matrix S ∈ R N×N , S(j, j) ∈ {0, 1} indicating the sampling locations of the depth measurements.Following [34], we formulate the depth completion problem as the weighted least-squares problem : where 1 is an all-one vector.The weight matrix W i is used to scale the difference between the estimated x i and the neighboring pixels in y, assigning more influence to data points with higher weights and less influence to those with lower weights.Specifically, W i is the diagonal weighting matrix, where the j-th element W i (j, j) indicates the similarity between the center pixel x i and its neighboring pixel y j .Considering the input corruption, we further generalize S to indicate the certainty of y, where S(j, j) becomes a scalar in the range of [0, 1].W i and S are end-to-end learned in the deep neural network, with computation details introduced in Section 4.
The close-form solution to the weighted least-squares problem in (1) is given as: Therefore, solution (2) is given by a weighted sum of all the neighboring pixels in y, where the weight depends on its certainty S and its similarity W i with x i .

Normalized Spatial-Variant Diffusion
While [6] implemented the solution (2) by applying a normalized convolution to y, the matrices S and W i are extracted from the noisy y, which can be suboptimal in practice.To remedy the lack of optimality of the choice of S and W i , we implement (2) with the diffusion scheme, which applies the resulting filters iteratively.Details are shown as follows.
With the simplified notation to denote the positive filter coefficients, where s = diag(S), w i = diag(W i ), and is the Hadamard product, then (2) is rewritten as By arranging the filter coefficients into matrix form, i.e., A = [a 1 , . . ., a N ] ⊤ , then the solution for all pixels can be written as where D = diag(A1) is the normalization.As derived in [35], when applying the filter in (4) multiple times, i.e., with the initial state x 0 = y, the iterations are essentially a discrete version of anisotropic diffusion [36].More importantly, it is shown that optimizing A through diffusion can make the filter spectrum closer to those of an ideal Wiener filter that minimizes the reconstruction error.Therefore, in our approach, we implement (2) with the following diffusion scheme.Denote W = [w 2 1 , . . ., w 2 N ] ⊤ as the spatial-variant filter; then, for the t-th iteration, the output x t is computed as, where s t−1 denotes the certainty of x t−1 , which also gets updated with the spatial-variant filter as follows, With ( 5) and ( 6), we define the normalized spatial-variant diffusion , referred to as NSVD for short, with input feature y and the corresponding s, which is filtered iteratively via the spatial-variant kernels W until the results converge.
Different from PNCNN [6], where W is chosen as a spatial-invariant filter leading to blurry object boundaries, NSVD adopts spatial-variant kernels adaptive to the structural features in the signal.Meanwhile, different from NLSPN [11] where the confidence indicates the reliability of the depth initial prediction and does not consider the input corruption, the certainty s in NSVD indicates the reliability of the depth measurement, which is used to exclude the noisy pixels from propagating to neighboring pixels.In the case of disrupted depth input, e.g., containing noise and outliers, the certainty reweights the corresponding depth features and enhances performance robustness to depth corruption.In Section 4, we will discuss how s and W are end-to-end learned in the deep neural network.

Network Architecture
In this section, we propose the normalized spatial-variant diffusion network (NSVD-Net) for image-guided depth completion based on the NSVD module proposed in Section 3. As illustrated in Figure 2, the network is composed of the depth-dominant branch, which estimates the initial dense depth from the sparse sensor depth, and the RGB-dominant branch, which generates the semantic structural features.The two branches are fused in the hierarchical NSVD modules, where the initial dense depth is diffused with spatial-variant diffusion kernels constructed from RGB features.Details are provided as follows.

Depth-Dominant Branch
A hierarchical multi-scale architecture based on the U-Net [37] is adopted for the depthdominant branch, which is illustrated at the top of Figure 2. First, the input confidence estimation network is adopted from [6], which uses sparse depth input to produce an estimate for the input confidence, indicating the reliability of the depth measurements, i.e., S in (2).The sparse depth and the estimated confidence are then fed into the encoder of the depth branch, which adopts the NConv layer from [38] for initial dense depth estimation.At the decoder, we use the proposed NSVD modules to refine depth at each scale, and the features from the encoder are fused with the decoder features via the uncertainty-aware feature fusion as follows.

Uncertainty-Aware Feature Fusion
To preserve details in the input features, skip connections are used to fuse the features at the corresponding scale.While direct concatenation will increase the feature channels, thus increasing the computational complexity in NSVConv layer, we instead fuse the features from the encoder and decoder via the uncertainty-aware feature fusion.Specifically, at each scale l, the decoder feature x dec l with corresponding s dec l and the encoder feature x enc l with corresponding s enc l at the same scale are fused based on the certainty, which generates the fused feature x f use l as, and the output confidence is computed as With ( 7) and ( 8), we define the uncertainty-aware feature fusion module, which integrates the encoder-decoder features as well as the corresponding certainty.In our work, the features are fused at four different scales, i.e., l ∈ {1, 1/2, 1/4, 1/8}.The fused depth features and certainty measures are then fed into the NSVD modules for further refinement.

RGB-Dominant Branch
In the NSVD modules, while the inputs are generated from the fused depth features, the spatial-variant kernels are generated from the RGB-dominant branch at the bottom of Figure 2. The network adopts the encoder-decoder structure built upon residual networks [39] with ResNet34 as the encoder backbone to extract features from both RGB and sparse depth input.Specifically, the encoder and the decoder are composed of the Conv-BN-ReLU layers, where each layer is composed of the convolution, Batch-Normalization, and ReLU layer.The output of the decoder features is fed into the guidance block to generate kernels in the corresponding scales in the depth-dominant branch, where the guidance block is implemented using two layers of Conv-BN-ReLU.The spatial-variant kernels generated from the guidance block are used as the filter weight for diffusing the depth features and the corresponding certainty, i.e., W used in ( 5) and (6).

Hierarchical Normalized Spatial-Variant Diffusion
For efficient depth diffusion, the NSVD modules are applied hierarchically at the different scales in the decoder so that the spatial-variant diffusion operates at both the global region for overall scene depth accuracy and the local region for detail refinement.We adopt four NSVD modules for hierarchical calculation.For the modules at the smaller scales, i.e., scales of 1/8 and 1/4, the spatial-variant diffusion in NSVD covers a nonlocal region, which promotes global smoothness and overall scene depth accuracy.For the modules at the larger scales, i.e., 1/2 scale and original scale, NSVD operates at a localized neighborhood, which refines the structural details.Meanwhile, the noise variance estimation network takes the output certainty from the last NSVD module as input to provide the final output depth certainty.

Loss Function
For the accurate prediction of the dense depth map, we train our network with the reconstruction loss function below supervised by the ground truth depth: where x gt is the ground-truth depth, x pred is the predicted dense depth, and s pred is the output certainty measure.x v , V, and |V | denote the depth values at pixel index v, the valid pixels of x gt , and the number of valid pixels, respectively.The first term in (9) is the data term weighted by the certainty measure, where high weights are assigned to more reliable measures.The second term is the regularization term for the certainty estimation, to avoid the trivial solution where s pred goes all zero.Note that we do not maintain any supervision of the certainty because there is no ground truth; therefore, it is indirectly trained based on L recon .

Experimental Results
In this section, we evaluate the depth completion performance of the proposed NSVD-Net and demonstrate a comparison with existing algorithms, including sparse-to dense [14]; NCONV with RGB guidance using EncDec-Net [38]; CSPN [31]; and NLSPN [11].We first provide a description of the implementation details in Section 5.1, where the network architecture details are provided in Tables 1 and 2.Then, quantitative and qualitative comparisons to previous algorithms on indoor and outdoor datasets are presented and organized as follows.

•
In Section 5.2, we adopt the NYUv2 [12] and KITTI [20] datasets for evaluation in indoor and outdoor scenarios.The quantitative evaluation results using the two datasets are shown in Table 3 and Table 4, respectively, while the qualitative results further demonstrates the visual comparison using the NYUv2 dataset.• In Section 5.3, we focus on the evaluation of robustness to input corruption in sparse depth, where we simulate corrupted sparse-depth using NYUv2 and show strong robustness of NSVDNet.• In Section 5.3, we further test the generalization ability of NSVDNet to the new dataset via testing on the TetrasRGBD dataset [40] with the model trained on the NYUv2 noisy dataset.The visual results using simulated noise and the visual comparison with existing schemes using real sensor data are demonstrated, which validates that NSVDNet has a strong generalization ability to real usage scenarios.• In Section 5.4, we present ablation studies to verify the effectiveness of each module in the NSVDNet.

Implementation Details
Training Details: We use the Adam optimizer with the initial learning rate set to 10 −3 and decayed at epoch [10,20,30,40] with decay rate 0.1.The model is trained from scratch without a pretrained model for 50 epochs.We implement with PyTorch 1.10.1 [41] on 2 NVIDIA GeForce RTX 3090 GPUs.
Network Architecture: Details of the NSVDNet architecture are shown in Tables 1  and 2, illustrating the depth-dominant branch and the RGB-dominant branch, respectively.Here, we use the input size of 256 × 256 as the example, and input and output dimensions are shown in the table where H, W, and D denote the height, weight, and channel number of the input/output tensors, respectively.
As shown in Table 1, the depth-dominant branch takes the sparse depth as input and generates input certainty using UNet [6]; then, the sparse depth and input certainty are fed into the encoder, which is composed NConv layers [6] and MaxPool layers for downsampling.The decoder is composed of the proposed NSVD modules, uncertaintyaware fusion layers, and nearest-neighbor interpolation for upsampling, denoted as NSVD, Fusion, and NN interpolation, respectively.The number of iterations in NSVD modules is set to 10, which ensures result convergence.Along with the dense depth output, the decoder also outputs the confidence feature, which is fed into the output confidence estimator [6] for final confidence generation.As shown in Table 2, we adopt ResNet34 [39] as our encoder-decoder baseline network for the RGB-dominant branch.To generate the spatial-variant kernels, the features from the decoder are fed into the guidance layers before fed into the NSVD modules, where the guidance layer is implemented using two layers of Conv-BN-ReLU .The number of output features from the guidance layers for spatial-variant kernels is set to 9 for a fair comparison to other affinity-based algorithms using 3 × 3 local neighbors.Evaluation Metrics: For quantitative evaluation, we adopted the commonly used metrics [11]:

•
The Root Mean Squared Error (RMSE): The Mean Absolute Error (MAE): The Root Mean Squared Error of the inverse depth (iRMSE): ) 2 ; • The Mean Absolute Error of the inverse depth (iMAE):

|;
Datasets: To demonstrate the performance in both indoor and outdoor scenarios, we adopt the NYUv2 [12] and KITTI [20] datasets for evaluation.Furthermore, the TetrasRGBD dataset [40] is used to evaluate the generalization ability of the proposed network to the unseen test dataset with simulated sensor noise.

NYU Depth v2:
The NYUv2 dataset contains video sequences from a variety of indoor scenes recorded by both the RGB and depth cameras from the Microsoft Kinect.Following [11], we use a subset of 45K images from the official training split as training data, and 654 official labeled images are used for evaluation.Every image is resized to 320 × 240 and then center-cropped to 304 × 228.
Similar to previous works [11], we randomly sampled 500 points from the ground truth depth as the sparse depth, which is combined with RGB image as the input of our network.Table 3 shows the quantitative result of our method on the NYUv2 dataset, and we can see that the proposed NSVDNet outperforms existing schemes, including sparse-to dense [14]; NCONV with RGB guidance using EncDec-Net [38]; CSPN [31]; and NLSPN [11].
In addition, we provide the number of network parameters of the competing methods in Table 3.The proposed method achieves the best performance with a reasonable amount of network parameters.We also provide the average running time (s) in Table 3, which is tested on one GeForce RTX 3090 GPU.As shown in Table 3, the most competitive method-NLSPN-consumes a higher runtime, while NSVDNet achieves higher accuracy at moderate complexity with a 43% runtime reduction.Therefore, NSVDNet outperforms competing methods with high efficiency, which is suitable for real-time applications.To demonstrate visual comparison, in Figure 3, we show our depth completion results tested with the NYUv2 dataset with a qualitative comparison to the depth-only PNCNN, and the state-of-the-art NLSPN.We can see that with only the depth input, the result of PNCNN has blurry edges and the object shapes are not preserved.Meanwhile, with RGB guidance, NLSPN and the proposed NSVDNet provide much sharper and more complete depth details.However, without certainty guidance, NLSPN introduces extra textures, e.g., the back of the chair contains a bumpy surface in the first row of Figure 3.This is because the initial depth and affinity matrix are implicitly learnt lacking interpretability.On the other hand, with the inherent diffusion model in the network design, the proposed NSVDNet gives sharp results without extra textures from the RGB features.This is consistent with the metric values in Table 3.

KITTI Depth Completion Dataset:
The KITTI dataset is an outdoor dataset for autonomous driving, which contains 85,000 color images and corresponding dense annotated depth maps and sparse raw LiDAR scans for training, 6,000 for validation, and 1000 for testing.One thousand color images and corresponding sparse depth maps with unpublished depth maps are selected as the benchmark for algorithm evaluation.For training, we crop the first 100 rows of color and depth images (which have no corresponding ground truth depth) and then randomly crop color and depth images to 1216 × 240.
Table 4 shows the quantitative result of our method on the KITTI DC dataset, and we can see that the proposed approach outperforms existing schemes, including sparse-to dense [14], NCONV with RGB guidance [38], CSPN [31], PENet [33], and NLSPN [11].Note that the performance enhancement is less obvious because the input depth noise level is small, and the advantage of noise robustness in NSVDNet is not fully exhibited.Nevertheless, NSVDNet achieves competing results due to the hierarchical spatially adaptive filtering.

Data with Corruption
NYU Depth v2 with Simulated Corruption: In real-world scenarios, the captured depth is highly likely to be corrupted by noise.To demonstrate the robustness of our algorithm to noisy depth input, we simulate the corrupted sparse-depth using the NYUv2 depth dataset.Specially, we randomly set 50% of the outliers in the sparse depth, including 25% of the valid pixels to be 10m, which is the maximum depth value, and 25% to be 0m, which is the minimum depth value.The network is retrained on the NYUv2 dataset with simulated noise.The visual results are provided in Figure 4, comparing the cases of adding and not adding outliers to the sparse input.
As shown in Figure 4, the output depth of NSVDNet is not obviously affected by the input outliers, demonstrating good overall smoothness and detail preservation.This is due to the input certainty estimation that distinguishes the outliers from the accurate pixels and excludes the outliers from propagating to neighboring pixels.For better demonstration, we provide a zoom-in version of a selected region of each scene.All depth-related images are colored with the 'jetr' color map.All of the certainty maps are colored with the 'hot' color map.The black pixels indicate a low certainty value, and the yellow pixels indicate a high certainty value.All of the outliers are set to almost-zero values in the estimated input certainty, which explains why NSVDNet is able to suppress the noise.Moreover, the output certainty map provides the confidence measurement for the depth estimation, where we can see the outliers obviously decrease the confidence for known pixels, providing an accurate output reliability map.

Generalization Evaluation on TetrasRGBD:
To test the generalization ability of NSVDNet on the new dataset, we use the model trained on NYUv2 noisy dataset and test with the new dataset called TetrasRGBD [40].TetrasRGBD contains 2.3k pairs of testing data from mixed sources.All the data are collected in indoor scenarios, and the synthetic dataset is generated with ground-truth 3D geometry.
To demonstrate the robustness to noisy depth measurement, we adopt the TetrasRGBD dataset augmented with outliers in sparse depth input.As shown in Figure 5, 20% of the pixels in the sparse depth are corrupted by outliers.The proposed NSVDNet generalizes well to the unseen test dataset, showing strong robustness to the outliers and high depth estimation accuracy.This is due to the use of interpretable uncertainty-aware diffusion that limits the solution space of the network and avoids overfitting to the training data.To visualize how NSVD rivals the noisy input, Figure 5 shows the estimated certainty map for the input, indicating depth measurement reliability.When used in the NSVD module, the certainty map prevents outlier information from propagating to the neighboring pixels.Further, the output certainty provides the reliability of the network output depth, where we can see the known depth pixels show higher values, while the regions with fine details show lower values, which are typical with lower accuracy.To further demonstrate the generalization to real-world scenarios, we use the real data captured by mobile devices provided in the TetrasRGBD dataset [40] for testing, where the input depth suffers from large sensor noise.We again use the pre-trained model trained on the NYUv2 noisy dataset, and the results are shown in Figure 6.By comparison with competitive methods, including PNCNN [38] and NLSPN [11], the proposed NSVDNet shows more accurate global depth estimation with sharp structural details.In sum, the evaluation with real sensor data illustrates the strong generalization ability of NSVDNet, indicating the potential to apply NSVDNet in practice.

Ablation Study
In the ablation study, we regard the PNCNN network [6] as a simplified variant of the proposed NSVDNet, which replaces the NSVD module with the NConv module without spatial-varying filters and hierarchical implementation.With PNCNN as the baseline model, we then show a comparison among different variants in Table 5 that validates the importance of the designed modules.First, we compare the multi-scale deployment and single-scale deployment of NSVD, which are referred to as MS-NSVD and OS-NSVD, respectively.By comparing PNCNN+FF+MS-NSVD and PNCNN+FF+OS-NSVD in Table 5, we can see that MS-NSVD greatly improves compared to OS-NSVD due its enhanced global smoothness.
Next, we compare the uncertainty-aware feature-fusion with feature-concatenation used in the dept decoder, which are referred to as FF and FC, respectively.By comparing PNCNN+FC+OS-NSVD and PNCNN+FF+OS-NSVD in Table 5, we can see that a simple concatenation without considering depth feature certainty can degrade the final prediction due to the inclusion of low-confident features in the decoder.

Conclusions
In this work, we propose a hierarchical normalized spatial-variant diffusion network for image-guided depth completion.The network is designed to incorporate the anisotropic diffusion model, where the diffusion is deployed via the proposed normalized spatial-variant diffusion (NSVD) module.NSVD diffuses the input depth feature and corresponding confidence with the semantic structural guidance extracted from the RGB image.Moreover, the hierarchical deployment of NSVD modules is adopted to ensure both global smoothness and local details.Extensive experimental results demonstrate that the proposed NSVDNet outperforms the existing methods at providing more accurate depth completion and sharper visually salient features.Ablation studies validate the effectiveness of the proposed hierarchical NSVDNet at enhancing the robustness to noisy pixels in the sensor depth input.
Despite the improvements introduced by the proposed network, focusing on interpretable depth diffusion design with noise robustness, several limitations still remain requiring further investigation.Instead of utilizing localized spatial filtering in the depth diffusion, future work could develop more powerful spatial filtering techniques to exploit adaptive neighborhood with non-local filtering kernels.In this way, a longer-range context would be involved in the diffusion with more accurate global smoothness and faster convergence.More importantly, the non-local filtering would benefit from the case where more severe noise corruption were involved with non-uniform sparsity.
Another crucial aspect that needs to be considered in future work is the time-domain consistency in depth video completion, which is required in various applications such as 3D scene reconstruction and SLAM.This aspect necessitates depth completion to enforce coherence between consecutive frames.Furthermore, the redundancy in temporal sequence can be utilized to reduce the computational complexity by re-using the features from neighboring frames.Toward this end, we plan to explore the spatial-temporal depth video completion algorithm to enhance algorithm efficiency and temporal consistency, which should be of great importance to real-time 3D vision applications.

Figure 2 .
Figure 2.An overview of NSVDNet architecture to predict a dense depth from a disturbed sparse depth with RGB guidance.NSVDNet is composed of the depth-dominant branch, which estimates the initial dense depth from the sparse sensor depth, and the RGB-dominant branch, which generates the semantic structural features.The two branches are fused in the hierarchical NSVD modules, where the initial dense depth is diffused with spatial-variant diffusion kernels constructed from RGB features.

Figure 3 .
Figure 3. Depth completion with different algorithms, tested on NYUv2 dataset.As highlighted in the red rectangles, the proposed NSVDNet achieves more accurate depth completion results with detail preservation and noise robustness.

Figure 4 .
Figure 4. Comparison of depth completion with original sparse depth and noisy sparse depth with 50% outliers, tested on NYUv2 dataset.The comparison between results with original and noisy inputs demonstrates the robustness to input corruption for the proposed method.The selected patches are enlarged in the colored rectangles.

Figure 5 .
Figure 5. Generalization ability evaluation tests on TetrasRGBD dataset with outliers.The certainty maps explain the robustness of NSVDNet to input corruptions.

Figure 6 .
Figure 6.Generalization ability evaluation tests on TetrasRGBD dataset with real sensor data,where the proposed NSVDNet generates more accurate depth estimation than competitive methods, including PNCNN[38] and NLSPN[11].

Table 1 .
Network Architecture for depth-dominant branch with input dimension 256 × 256.For each module, the operators and the number of operators are specified.The input and output feature dimensions are specified, where H, W, and D refer to height, weight, and channel number of the tensors, respectively.The positions for input sparse depth and output dense depth/confidence are specified.

Table 2 .
Network Architecture for RGB-dominant Branch with input dimension 256 × 256.For each module, the layer type is specified.The input and output feature dimensions are specified, where H, W, and D refer to height, weight, and channel number of the tensors, respectively.The positions for sparse depth and RGB input are specified, and the features generated by the guidance modules are fed into the NSVD modules in the depth-dominant branch.

Table 5 .
Ablation study using TetrasRGBD dataset.With PNCNN as the baseline model, we compare multi-scale NSVD (MS-NSVD) used in NSVDNet with simplified variant single-scale NSVD (OS-NSVD), where MS-NSVD outperforms OS-NSVD due to enhanced global smoothness.Additionally, we compare feature-fusion (FF) used in NSVDNet with its variant feature-concatenation (FC), where FF outperforms FC due to the utilization of input uncertainty.