Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network

: Depth maps captured by traditional consumer-grade depth cameras are often noisy and low-resolution. Especially when upsampling low-resolution depth maps with large upsampling factors, the resulting depth maps tend to suffer from vague edges. To address these issues, we propose a multi-channel progressive attention fusion network that utilizes a pyramid structure to progressively recover high-resolution depth maps. The inputs of the network are the low-resolution depth image and its corresponding color image. The color image is used as prior information in this network to ﬁll in the missing high-frequency information of the depth image. Then, an attention-based multi-branch feature fusion module is employed to mitigate the texture replication issue caused by incorrect guidance from the color image and inconsistencies between the color image and the depth map. This module restores the HR depth map by effectively integrating the information from both inputs. Extensive experimental results demonstrate that our proposed method outperforms existing methods.


Introduction
With the development of 3D imaging technology, depth maps have been widely used in various fields of life.Depth maps can directly reflect the shape of objects, making them useful for estimating object depth and contours for 3D reconstruction [1], scene segmentation [2], and human-computer interaction [3].However, due to the limitations of the imaging environment and sensor hardware conditions, it is extremely difficult to obtain high-quality depth maps.Depth maps captured by low-cost consumer-grade depth cameras usually have problems such as low resolution and loss of depth details.These limitations have hindered the development of many applications that rely on accurate and high-resolution depth information.Therefore, image super-resolution technology that restores low-resolution depth maps to high-quality, high-resolution depth maps is of great research significance.
Previous researchers have proposed various DSR (depth super-resolution) algorithms to address this issue, such as interpolation-based methods [4], filtering-based methods [5,6], and optimization-based methods [7,8].Some traditional methods have achieved remarkable results in deep image super-resolution techniques.For example, optimization-based methods typically consider the DSR problem as a global optimization problem, which constrains the reconstruction of the depth image using data terms and regularization terms.This approach can restore high-quality depth maps in terms of visual quality, but it is highly dependent on the optimization model with high complexity, which often takes a lot of running time.With the continuous development of deep convolutional neural networks in computer vision tasks, such as the enormous successes achieved in image recognition, image classification, object detection, etc., DCNN has been widely used in color image super-resolution.Dong [9] proposes an end-to-end network model SRCNN, which applies deep learning to image super-resolution for the first time, demonstrating that deep learning can surpass traditional methods and achieve higher performance in super-resolution reconstruction.In recent years, color image super-resolution algorithms based on convolutional neural networks [10][11][12] have achieved great success, and many researchers have begun to apply this method to depth image super-resolution reconstruction [13][14][15].However, unlike color images that contain abundant texture information, depth maps only display the depth information of objects.This presents a greater challenge for depth map super-resolution as the task involves recovering finer details and enhancing the resolution of the depth map.
This paper proposes a novel multi-channel progressive attention fusion network based on color image guidance to address the aforementioned challenges.It utilizes the rich texture information of color images to fill in the missing information in LR depth images.The network is guided by the color image corresponding to the depth image and uses a pyramid structure combined with residual learning to progressively restore high-quality depth maps.The network mainly consists of two parts: the color image guidance branch and the depth image fusion upsampling branch.For the depth image branch, we propose an attention-based multi-branch feature fusion module, which can adaptively select the texture information consistent with the LR depth image in the HR color image for effective fusion, while ignoring the abundant foreground information present in the HR color image.As a result, it effectively avoids the generation of depth artifacts and preserves the quality of the depth map.Afterwards, the fused feature map is optimized through a multi-scale feature optimization residual module to capture and enhance the features of different levels, which strengthens the multi-scale expression ability of the network.Extensive results show that our method outperforms the current state-of-the-art methods.
Our contributions are as follows: (1) A novel multi-channel progressive attention fusion network is proposed, incorporating local residual connections and global residual connections to predict the highfrequency residual information of the image, which is more conducive to the recovery of depth map details. (2) The multi-scale feature optimization residual module is utilized to enhance the network's multi-scale feature representation capability through multiple intertwined paths.(3) We propose an attention-based multi-branch feature fusion module that can adaptively fuse features from both the depth and color images.

Related Works
Generally, according to the different input data, depth map super-resolution methods can be divided into single depth map super-resolution technology and color image-guided depth map super-resolution technology.

Single DSR Technology
Single depth map super-resolution technology refers to recovering a high-quality depth map from a single low-resolution depth map as input.The difficulty of this method lies in how to restore details and structure information from LR depth images that lose part of the depth information.
For traditional methods, techniques such as interpolation, filtering, and optimization are usually used to improve the resolution of the depth image.Xie et al. [16] use Markov Random Field to optimize high-resolution edge maps constructed from edges of lowresolution depth images.Guided by the HR edge map, an improved joint bilateral filter is used to upsample the HR depth map.This not only avoids the generation of depth artifacts due to texture prediction, but also preserves sharp edges.Zhao et al. [17] learn multiple residual dictionaries from a single external image and utilize a shape-adaptive weighted median filter along the edges of the depth map to remove artifacts.Zhou et al. [18] apply a pair of sparse dictionaries to obtain the edge information of high-quality depth images.Then, guided by the edge information, they use an improved joint bilateral filter to interpolate the depth image.
These traditional approaches help to improve the resolution and quality of depth images, but they often have limitations in capturing details and recovering accurate depth information compared to more advanced deep learning-based methods.Depth superresolution based on deep learning utilizes deep convolutional neural networks, usually using an end-to-end progressive network to gradually restore high-quality depth maps.For example, Song et al. [13] regard the depth map super-resolution task as a series of super-resolution subtasks, and reconstruct high-quality depth maps by synthesizing these subtasks.Huang et al. [14] cascade multiple dense residual blocks to design a pyramid structure based on deep dense residual networks.This structure utilizes dense skip connections to extract image features while employing residual learning to reconstruct high-quality depth maps.Song et al. [19] incorporate depth supervision into an iterative residual learning framework, using the total generalized variation (TGV) term and consistency loss to further refine the obtained HR depth map.The depth controllable slicing network proposed by Ye et al. [20] divides the depth map into a series of depth slices for learning, weighted by a distance-aware method, and adaptively fuses slices of different depths to generate high-quality depth maps.

Color Image-Guided DSR Technology
Color image-guided depth map super-resolution technology often uses the color image corresponding to the depth image as guidance to improve the quality of the generated depth map.By leveraging the texture, color, and structural features present in the color image, these methods aim to recover fine details and improve the overall quality of the depth map.The color space mismatch between the color image and the depth map poses a challenge in color-guided depth map super-resolution techniques.How to fully fuse the color image guidance information with the depth image and avoid the generation of depth artifacts is an important problem faced by this method.
Kopf et al. [5] propose the joint bilateral filtering for the first time, using the color high-resolution image as prior information during the upsampling process.By combining the spatial information of the low-resolution target image with the range information of the high-resolution guiding image, a better high-resolution depth image can be restored.Yang et al. [6] propose a novel joint ternary filter that uses the updated edge map of the up-sampled depth map as guidance to change the filtering process.Diebel [21] is the first person to apply Markov Random Fields (MRF) to generate HR depth maps.He proposes a Markov random fields formulation, which consists of a data term based on the low-resolution depth map and a smooth term composition for the regularization of corresponding high-resolution color images.Wang et al. [22] learn edge information from LR depth maps and HR color maps, and then recover high-resolution depth image using predicted edge image and HR color image.
The depth super-resolution technology guided by color images based on deep learning has achieved remarkable results.Hui et al. [23] propose a multi-scale guided convolutional network that uses intensity images as guidance, which can delicately restore high-quality depth images.Zuo et al. [24] adopt global residual learning and local residual learning to iteratively upsample LR depth maps under the guidance of HR intensity images.Xian et al. [15] constructed a multi-scale progressive deep fusion network using the depth code branch and its related color code branch, which gradually merges the multi-scale feature maps extracted from the two branches to generate high-resolution depth images.Zhong et al. [25] propose a new attention-based multimodal attention fusion strategy, which makes full use of the low-level spatial information and high-level structural information, effectively exploring the complementarity of multi-level multimodal features.Chen et al. [26] combine 3D depth images with 2D color images and use the combined image as prior information to guide depth image reconstruction.Guo et al. [27] replace the original convolution kernel with a spatial variation kernel so that the HR color image can more effectively guide the LR depth map to restore a high-quality depth map.Sun et al. [28] propose a single depth image super-resolution framework based on cross-task scene structure knowledge transfer for the first time.They introduce a knowledge distillation method that the depth estimation task and the super-resolution task are jointly trained.Although this multi-task learning approach can improve the performance of depth image super-resolution, it often comes with higher network complexity, which may require more powerful hardware to train the network effectively.

Overview
We propose a multi-scale progressive attention fusion network based on color image guidance.The entire network utilizes local and global residual connections, which not only enhances the gradient propagation of the network, but also effectively combines the shallow features and deep features of the depth map to realize feature reuse.To meet the demand for large upscaling factors, the network adopts a pyramid structure to upsample the depth image, and the upsampling factor of each level is 2. Figure 1 shows an example structure with an upsampling factor of 8.The inputs of the network are the LR depth image I LR ∈ R H×W×1 and the HR color image G HR ∈ R αH×αW×C , where H and W represent the image height and width, respectively, C is the number of channels, and α is the upsampling factor (e.g., α = 2, 4, 8, 16).The HR color image G HR is learned through multiple convolutional layers to capture different levels of features in the color image.Simultaneously, a max pooling layer is used to ensure the resolution consistency between the color image and the depth image.As the upsampling factor increases and the network deepens, the depth image can recover more high-frequency details with the help of the HR color map.Next, the features extracted from the two branches are passed into the attention-based multi-branch feature fusion module, which employs the attention mechanism for adaptive fusion.Finally, through the multi-scale feature optimization module, the network's multi-scale representation capability is enhanced, gradually recovering high-resolution depth images.

Attention-Based Multi-Branch Feature Fusion Module
Generally, the fusion operation involves simply concatenating or weighted summing of features.However, in this task, simple feature fusion cannot achieve effective guidance for depth map reconstruction.Instead, it may lead to depth artifacts due to the rich texture edges present in the color image.Therefore, we designed a feature fusion module based on the attention mechanism, which can learn the feature consistency between the color image and the depth image, explore useful information in the color image, and fully exert the guidance role of the color image in depth map super-resolution.
As shown in Figure 2, the inputs of this module are the feature maps extracted from the two branches.To address the issue of unnecessary texture information in the color image and noise in the depth image, it is necessary to enhance the extracted features to reduce redundant and irrelevant information in the features so that the network can obtain richer and more representative features.The features from both branches are processed through convolution operations.Then, the PReLU and Sigmoid activation functions are applied to the features separately.By element-wise multiplication, the network adaptively extracts useful information from both inputs while filtering out irrelevant details.This process enhances the features of the images, facilitating effective guidance for depth map reconstruction.The specific operations are as follows: H sc , H sd represent the enhanced color feature map and depth feature map, respectively, and * represents element-wise multiplication.Since the useful edge information in the color image can guide the depth map reconstruction, the useless information may lead to texture duplication or edge discontinuity.Therefore, an attention mechanism is needed to minimize the negative effects of texture replication and select the features that are consistent and significant with the depth image.The convolution operation is inspired by the Inception [29] network, which is divided into two branches, performing 3 × 3 convolution and 5 × 5 convolution, respectively, to learn feature information of different scales.The specific operations are as follows: Among them, maxpool and avgpool represent the max pooling and average pooling operations, respectively.W f 1 and W f 2 represent two different convolution operations, b f 1 and b f 2 represent the corresponding bias terms, and θ 1 , θ 2 , and θ 3 represent the weights corresponding to each feature, H f is the final output of the feature fusion module.

Multi-Scale Feature Optimization Module
In order to fully leverage the multi-scale information in the color image to guide the scene depth map, as shown in Figure 3, we propose an MSFO module, which extracts the features of different scales of the depth image through the interweaving and fusion of different paths.In this network, we consider the 3 × 3 and 5 × 5 convolutions in different paths as a convolution group, which extracts features at different scales from the image and fuses the extracted features.By allowing the information in different paths to be shared, the network can adaptively detect feature information and enhance its feature extraction ability.Furthermore, to reduce the number of parameters and capture more different high-frequency features, we split the 5 × 5 convolution into two 3 × 3 convolutions.Finally, to alleviate the training difficulty of the network, we incorporate residual learning [30] into MSFO.

Loss Function
In order to help the network to better improve performance, it is very important to design a good loss function.Our network utilizes the L1 loss function, which has good convergence properties and high robustness, to optimize the network.For a reconstructed depth map I SR obtained through network reconstruction and an original high-resolution image I GR , the loss function expression is as follows: where θ represents the network parameters that need to be optimized, I HR i − I SR i 1 denotes the loss of the i-th original high-resolution image and the image reconstructed by the network.

Experimental Results
In this section, we present the implementation details of the network and describe the datasets used during the network training process.We compare our proposed method with existing state-of-the-art methods and provide quantitative and qualitative results on benchmark datasets, including both noisy and noise-free scenarios.

Experimental Setting
Parameter settings: Apart from setting the kernel size to 1 × 1 in the fusion operation, the kernel size of all convolutional layers in the network is set to 3 × 3. The zero-padding strategy is used to maintain the same output feature size as the input.The number of convolution output channels in the feature extraction and fusion process is set to 64.In the MSFO module, the interleaved paths are set to 4, so the number of channels traversed by each path is 16.Our network is implemented with Pytorch and trained on NVIDIA RTX2080ti GPU.The network training model use the ADAM optimizer [31] to optimize the network parameters, where β 1 = 0.9, β 2 = 0.99.The initial learning rate is set to lr = 1 × 10 −4 .Every 30 training epochs, the learning rate is reduced by half.The batch size is 16, and the network is trained for a total of 60 epochs.The proposed method trains different networks according to different upscaling factors (e.g., 2×, 4×, 8×, 16×).
Training set: We follow the work in [23], selecting 110 images from the Middlebury Stereo Datasets [32] and MPI Sintel Datasets [33] to build our original dataset, in which 95 depth images are used as the original training samples and 15 depth images are used as validation samples.To make full use of these depth images, the images are randomly rotated by 90°, 180°, and 270°, and horizontally flipped.During the training phase, we crop the HR depth images into small squares of size 256 × 256 and remove the sub-images without depth information.As a result, we obtain 10,650 images for training.Finally, LR depth images of size 128 × 128, 64 × 64, 32 × 32, and 16 × 16 are generated by downsampling the HR images using bicubic interpolation with the given scaling factors of 2, 4, 8, and 16.

Evaluation 4.2.1. Evaluate under Noise-Free Conditions
In this section, our network is trained on the noise-free training set.In order to validate the effectiveness of our algorithm, this paper compares our model with 13 state-of-the-art DSR methods, including Bicubic, JID [34], SRCNN [9], LapSRN [35], AR [36], VDSR [37], MSG-Net [23], MFR-SR [24], DRN [15], RDN-GDE [38], PAG-Net [39], PMBANet [40], and PDR-Net [41].All comparison algorithms are evaluated using the same experimental configuration and training data.The experimental results are obtained based on the code and relevant data provided by the author.In this paper, the test images from the Middlebury datasets [32] are divided into three groups: A, B, and C. The root mean square error (RMSE) is used as the quantitative metric to evaluate the performance of the super-resolution network on the upsampled results.A smaller RMSE value indicates a better reconstruction quality of the depth maps.Due to the limitation of the resolution of the ground truth depth maps, the analysis of RMSE results in Group A is conducted only for upscaling factors of 2, 4, and 8.As shown in Tables 1-3, the RMSE values obtained from all state-of-the-art methods for various upsampling factors are listed, respectively.The best results are marked in bold font, while the second-best results are marked with an underline.Our network exhibits superior performance across almost all scaling factors.It can be seen from the table that in the DSR task, the method based on deep learning has a better reconstruction effect than the traditional method.Among the deep learningbased methods, networks such as MSG-Net [23], MFR-SR [24], and DRN [15] achieve better RMSE values compared to SRCNN [9] and LapSRN [35].This is because the SRCNN [9] and LapSRN [35] networks only rely on LR depth maps for SR reconstruction.These networks are relatively simple and lack the incorporation of HR color images as prior information, making it difficult for the network to capture details, resulting in the loss of some details in the reconstructed depth map.In contrast, methods such as MSG-Net [23], MFR-SR [24], and DRN [15] incorporate the rich texture and structural information from HR color images, which enable them to better recover high-frequency details and improve the quality of the reconstructed depth maps.This highlights the importance of color image information in guiding DSR tasks.
Specifically, Figure 4 shows the final depth map reconstruction results of different methods on 8× upsampling on the Middlebury datasets.Visually, most methods demonstrate similar recovery results for large background areas.However, there are significant differences in details, such as the pens in Art or the antlers in Reindeer.Among them, the Bicubic method yields the poorest result with significant image distortion and pronounced jagged edges; the SRCNN [9] method is superior to the traditional super-resolution reconstruction method.However, since this method takes the bicubic interpolated images as inputs, it not only increases the computational overhead but also introduces noise to the images.As a result, the images are affected by noise, leading to the problem of edge blurring; the LapSRN [35] method directly extracts features from the low-resolution (LR) images, avoiding the influence of noise.However, in some cases, it still encounters challenges in preserving fine details and edges, which may lead to slight blurring.Both MSG-Net [23] and MFR-SR [24] methods employ color image guidance for depth map reconstruction, thereby enhancing reconstruction quality and producing sharp depth edges.However, they may introduce incorrect depth information in certain regions, leading to the issue of depth artifacts.In comparison, our method excels in recovering clear details and effectively addresses the issue of edge artifacts.This is primarily attributed to the multi-branch feature fusion module we proposed, which can fully explore the consistency between the HR color image and the LR depth image.By making reasonable predictions for less obvious details, our method achieves superior results in depth map reconstruction.

Evaluate under Noisy Conditions
To evaluate the effectiveness of the algorithm under noisy conditions, we introduce Gaussian noise with a mean of 0 and a standard deviation of 5 to the LR depth images for the noisy experiment and retrain the network model.We select Bicubic, TGV [42], SRCNN [9], DGN [23], MFR-SR [24], and RDN-GDE [38] as comparison algorithms for noisy experiments.We list the RMSE results with upsampling factors of 2, 4, and 8 in the Middlebury datasets in Table 4, where the optimal results are marked in bold font and the suboptimal results are marked with underlines.From Table 4, the TGV method [42] exploits the global information and can achieve better performance while addressing the noise.It can be observed that deep learning-based methods outperform traditional methods in suppressing noise.Among them, color-guided reconstruction methods such as MFR-SR [24] and RDN-GDE [38] can obtain smaller RMSE values.This indicates that the guidance of color images is of great help to suppress image noise.Our method utilizes color-guided information that can not only effectively suppress noise but also restore clear and sharp edges, making the restored image closer to the ground truth.

Ablation Study
To further evaluate the impact of each module on the network performance, we conduct an ablation analysis on the MSFO and AMBFF modules in the network.In this study, we replace the AMBFF module with a 1 × 1 convolution fusion operation and the MSFO module with a basic residual block to form a comparative baseline network.Table 5 presents the results of the ablation experiments for an upsampling factor of 8. From the table, it can be observed that the baseline network achieves relatively higher RMSE values, indicating poorer performance.When the MSFO module is added to the network, the multi-scale expression ability of the network is enhanced.By extracting multiscale features, it can better adapt to the changes of details at different scales and enhance the network performance.With the addition of the AMBFF module, the network is able to leverage the attention mechanism to suppress unnecessary texture information from the color image.This helps address the issue of texture replication caused by incorrect guidance from the color image.The module facilitates better fusion of features between the color and depth images, enhancing the overall performance of the network.The effectiveness of MSFO and AMBFF modules in depth map super-resolution is fully demonstrated by ablation experiments.

Conclusions
In this paper, we propose a multi-channel progressive attention fusion network.The network fully exploits the multi-scale features of the HR guidance image and the LR depth image through the pyramid structure, and adaptively fuses these features to further improve the network performance.In order to alleviate the depth artifact problem caused by the texture inconsistency between the color image and the depth image, the network uses the multi-branch feature fusion module to effectively enhance the useful information and filter the useless information so that the edge of the restored HR depth image is smoother.In addition, a multi-scale feature optimization module is designed to optimize the features, which improves the multi-scale expression ability of the network and further integrates multi-scale features.Extensive experiments have demonstrated that our method outperforms most existing methods in both noise-free and noisy conditions.When using depth cameras to capture depth images, various issues such as surface reflections or exceeding the range can lead to missing regions in the depth image.Therefore, in future work, we plan to further explore methods for the super-resolution of depth images containing holes to improve the overall quality and integrity of HR depth images.

Figure 1 .
Figure 1.Multi-scale progressive attention fusion network with an upsampling factor of 8.The network consists of two main branches: the color-guided branch and the depth fusion and upsampling branch.The AMBFF module combines the color image features and the depth image features in an adaptive manner, taking into account their respective importance and relevance to the depth map reconstruction task.The MSFB module is used to capture the multi-scale information of feature maps and enhance the multi-scale representation ability of the network.
In the equation shown, h c and h d represent the outputs of the branches, W c1 , W c2 , W d1 , and W d2 represent the two convolution operations performed on the color and depth image features, b c1 , b c2 , b d1 , and b d2 represent the corresponding bias terms.σ represents the PReLU activation function, and ε represents the Sigmoid activation function.

Figure 2 .
Figure 2. Attention-based multi-branch feature fusion module.This module consists of two components: feature enhancement and attention fusion.The feature enhancement component helps the network better understand image features and improves its expressive power.The attention fusion component alleviates the negative effects caused by depth replication, allowing the network to focus on important features.

Figure 3 .
Figure 3. Multi-scale feature optimization module.The multi-scale features of the feature map are extracted through the interweaving and fusion of different paths.

Table 1 .
Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset A.

Table 2 .
Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset B.

Table 3 .
Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset C.

Table 4 .
Quantitative Analysis of Noisy DSR Results (RMSE) on Dataset A.

Table 5 .
RMSE value of ablation study.