Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network

Wang, Jiachen; Huang, Qingjiu

doi:10.3390/app13148270

Open AccessArticle

Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network

by

Jiachen Wang

¹ and

Qingjiu Huang

^1,2,*

¹

School of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China

²

Control System Laboratory, Graduate School of Engineering, Kogakuin University, Tokyo 163-8677, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(14), 8270; https://doi.org/10.3390/app13148270

Submission received: 13 June 2023 / Revised: 8 July 2023 / Accepted: 13 July 2023 / Published: 17 July 2023

Download

Browse Figures

Versions Notes

Abstract

:

Depth maps captured by traditional consumer-grade depth cameras are often noisy and low-resolution. Especially when upsampling low-resolution depth maps with large upsampling factors, the resulting depth maps tend to suffer from vague edges. To address these issues, we propose a multi-channel progressive attention fusion network that utilizes a pyramid structure to progressively recover high-resolution depth maps. The inputs of the network are the low-resolution depth image and its corresponding color image. The color image is used as prior information in this network to fill in the missing high-frequency information of the depth image. Then, an attention-based multi-branch feature fusion module is employed to mitigate the texture replication issue caused by incorrect guidance from the color image and inconsistencies between the color image and the depth map. This module restores the HR depth map by effectively integrating the information from both inputs. Extensive experimental results demonstrate that our proposed method outperforms existing methods.

Keywords:

depth map super-resolution; residual learning; deep convolutional neural network; attention mechanism

1. Introduction

With the development of 3D imaging technology, depth maps have been widely used in various fields of life. Depth maps can directly reflect the shape of objects, making them useful for estimating object depth and contours for 3D reconstruction [1], scene segmentation [2], and human–computer interaction [3]. However, due to the limitations of the imaging environment and sensor hardware conditions, it is extremely difficult to obtain high-quality depth maps. Depth maps captured by low-cost consumer-grade depth cameras usually have problems such as low resolution and loss of depth details. These limitations have hindered the development of many applications that rely on accurate and high-resolution depth information. Therefore, image super-resolution technology that restores low-resolution depth maps to high-quality, high-resolution depth maps is of great research significance.

Previous researchers have proposed various DSR (depth super-resolution) algorithms to address this issue, such as interpolation-based methods [4], filtering-based methods [5,6], and optimization-based methods [7,8]. Some traditional methods have achieved remarkable results in deep image super-resolution techniques. For example, optimization-based methods typically consider the DSR problem as a global optimization problem, which constrains the reconstruction of the depth image using data terms and regularization terms. This approach can restore high-quality depth maps in terms of visual quality, but it is highly dependent on the optimization model with high complexity, which often takes a lot of running time. With the continuous development of deep convolutional neural networks in computer vision tasks, such as the enormous successes achieved in image recognition, image classification, object detection, etc., DCNN has been widely used in color image super-resolution. Dong [9] proposes an end-to-end network model SRCNN, which applies deep learning to image super-resolution for the first time, demonstrating that deep learning can surpass traditional methods and achieve higher performance in super-resolution reconstruction. In recent years, color image super-resolution algorithms based on convolutional neural networks [10,11,12] have achieved great success, and many researchers have begun to apply this method to depth image super-resolution reconstruction [13,14,15]. However, unlike color images that contain abundant texture information, depth maps only display the depth information of objects. This presents a greater challenge for depth map super-resolution as the task involves recovering finer details and enhancing the resolution of the depth map.

This paper proposes a novel multi-channel progressive attention fusion network based on color image guidance to address the aforementioned challenges. It utilizes the rich texture information of color images to fill in the missing information in LR depth images. The network is guided by the color image corresponding to the depth image and uses a pyramid structure combined with residual learning to progressively restore high-quality depth maps. The network mainly consists of two parts: the color image guidance branch and the depth image fusion upsampling branch. For the depth image branch, we propose an attention-based multi-branch feature fusion module, which can adaptively select the texture information consistent with the LR depth image in the HR color image for effective fusion, while ignoring the abundant foreground information present in the HR color image. As a result, it effectively avoids the generation of depth artifacts and preserves the quality of the depth map. Afterwards, the fused feature map is optimized through a multi-scale feature optimization residual module to capture and enhance the features of different levels, which strengthens the multi-scale expression ability of the network. Extensive results show that our method outperforms the current state-of-the-art methods.

Our contributions are as follows:

(1): A novel multi-channel progressive attention fusion network is proposed, incorporating local residual connections and global residual connections to predict the high-frequency residual information of the image, which is more conducive to the recovery of depth map details.
(2): The multi-scale feature optimization residual module is utilized to enhance the network’s multi-scale feature representation capability through multiple intertwined paths.
(3): We propose an attention-based multi-branch feature fusion module that can adaptively fuse features from both the depth and color images.

2. Related Works

Generally, according to the different input data, depth map super-resolution methods can be divided into single depth map super-resolution technology and color image-guided depth map super-resolution technology.

2.1. Single DSR Technology

Single depth map super-resolution technology refers to recovering a high-quality depth map from a single low-resolution depth map as input. The difficulty of this method lies in how to restore details and structure information from LR depth images that lose part of the depth information.

For traditional methods, techniques such as interpolation, filtering, and optimization are usually used to improve the resolution of the depth image. Xie et al. [16] use Markov Random Field to optimize high-resolution edge maps constructed from edges of low-resolution depth images. Guided by the HR edge map, an improved joint bilateral filter is used to upsample the HR depth map. This not only avoids the generation of depth artifacts due to texture prediction, but also preserves sharp edges. Zhao et al. [17] learn multiple residual dictionaries from a single external image and utilize a shape-adaptive weighted median filter along the edges of the depth map to remove artifacts. Zhou et al. [18] apply a pair of sparse dictionaries to obtain the edge information of high-quality depth images. Then, guided by the edge information, they use an improved joint bilateral filter to interpolate the depth image.

These traditional approaches help to improve the resolution and quality of depth images, but they often have limitations in capturing details and recovering accurate depth information compared to more advanced deep learning-based methods. Depth super-resolution based on deep learning utilizes deep convolutional neural networks, usually using an end-to-end progressive network to gradually restore high-quality depth maps. For example, Song et al. [13] regard the depth map super-resolution task as a series of super-resolution subtasks, and reconstruct high-quality depth maps by synthesizing these subtasks. Huang et al. [14] cascade multiple dense residual blocks to design a pyramid structure based on deep dense residual networks. This structure utilizes dense skip connections to extract image features while employing residual learning to reconstruct high-quality depth maps. Song et al. [19] incorporate depth supervision into an iterative residual learning framework, using the total generalized variation (TGV) term and consistency loss to further refine the obtained HR depth map. The depth controllable slicing network proposed by Ye et al. [20] divides the depth map into a series of depth slices for learning, weighted by a distance-aware method, and adaptively fuses slices of different depths to generate high-quality depth maps.

2.2. Color Image-Guided DSR Technology

Color image-guided depth map super-resolution technology often uses the color image corresponding to the depth image as guidance to improve the quality of the generated depth map. By leveraging the texture, color, and structural features present in the color image, these methods aim to recover fine details and improve the overall quality of the depth map. The color space mismatch between the color image and the depth map poses a challenge in color-guided depth map super-resolution techniques. How to fully fuse the color image guidance information with the depth image and avoid the generation of depth artifacts is an important problem faced by this method.

Kopf et al. [5] propose the joint bilateral filtering for the first time, using the color high-resolution image as prior information during the upsampling process. By combining the spatial information of the low-resolution target image with the range information of the high-resolution guiding image, a better high-resolution depth image can be restored. Yang et al. [6] propose a novel joint ternary filter that uses the updated edge map of the up-sampled depth map as guidance to change the filtering process. Diebel [21] is the first person to apply Markov Random Fields (MRF) to generate HR depth maps. He proposes a Markov random fields formulation, which consists of a data term based on the low-resolution depth map and a smooth term composition for the regularization of corresponding high-resolution color images. Wang et al. [22] learn edge information from LR depth maps and HR color maps, and then recover high-resolution depth image using predicted edge image and HR color image.

The depth super-resolution technology guided by color images based on deep learning has achieved remarkable results. Hui et al. [23] propose a multi-scale guided convolutional network that uses intensity images as guidance, which can delicately restore high-quality depth images. Zuo et al. [24] adopt global residual learning and local residual learning to iteratively upsample LR depth maps under the guidance of HR intensity images. Xian et al. [15] constructed a multi-scale progressive deep fusion network using the depth code branch and its related color code branch, which gradually merges the multi-scale feature maps extracted from the two branches to generate high-resolution depth images. Zhong et al. [25] propose a new attention-based multimodal attention fusion strategy, which makes full use of the low-level spatial information and high-level structural information, effectively exploring the complementarity of multi-level multimodal features. Chen et al. [26] combine 3D depth images with 2D color images and use the combined image as prior information to guide depth image reconstruction. Guo et al. [27] replace the original convolution kernel with a spatial variation kernel so that the HR color image can more effectively guide the LR depth map to restore a high-quality depth map. Sun et al. [28] propose a single depth image super-resolution framework based on cross-task scene structure knowledge transfer for the first time. They introduce a knowledge distillation method that the depth estimation task and the super-resolution task are jointly trained. Although this multi-task learning approach can improve the performance of depth image super-resolution, it often comes with higher network complexity, which may require more powerful hardware to train the network effectively.

3. Proposed Method

3.1. Overview

We propose a multi-scale progressive attention fusion network based on color image guidance. The entire network utilizes local and global residual connections, which not only enhances the gradient propagation of the network, but also effectively combines the shallow features and deep features of the depth map to realize feature reuse. To meet the demand for large upscaling factors, the network adopts a pyramid structure to upsample the depth image, and the upsampling factor of each level is 2. Figure 1 shows an example structure with an upsampling factor of 8. The inputs of the network are the LR depth image

I_{L R} \in R^{H \times W \times 1}

and the HR color image

G_{H R} \in R^{α H \times α W \times C}

, where H and W represent the image height and width, respectively, C is the number of channels, and

α

is the upsampling factor (e.g.,

α = 2, 4, 8, 16

). The HR color image

G_{H R}

is learned through multiple convolutional layers to capture different levels of features in the color image. Simultaneously, a max pooling layer is used to ensure the resolution consistency between the color image and the depth image. As the upsampling factor increases and the network deepens, the depth image can recover more high-frequency details with the help of the HR color map. Next, the features extracted from the two branches are passed into the attention-based multi-branch feature fusion module, which employs the attention mechanism for adaptive fusion. Finally, through the multi-scale feature optimization module, the network’s multi-scale representation capability is enhanced, gradually recovering high-resolution depth images.

3.2. Attention-Based Multi-Branch Feature Fusion Module

Generally, the fusion operation involves simply concatenating or weighted summing of features. However, in this task, simple feature fusion cannot achieve effective guidance for depth map reconstruction. Instead, it may lead to depth artifacts due to the rich texture edges present in the color image. Therefore, we designed a feature fusion module based on the attention mechanism, which can learn the feature consistency between the color image and the depth image, explore useful information in the color image, and fully exert the guidance role of the color image in depth map super-resolution.

As shown in Figure 2, the inputs of this module are the feature maps extracted from the two branches. To address the issue of unnecessary texture information in the color image and noise in the depth image, it is necessary to enhance the extracted features to reduce redundant and irrelevant information in the features so that the network can obtain richer and more representative features. The features from both branches are processed through convolution operations. Then, the PReLU and Sigmoid activation functions are applied to the features separately. By element-wise multiplication, the network adaptively extracts useful information from both inputs while filtering out irrelevant details. This process enhances the features of the images, facilitating effective guidance for depth map reconstruction. The specific operations are as follows:

\begin{matrix} H_{c 1} = & σ (W_{c 1} * h_{c} + b_{c 1}) \end{matrix}

(1)

\begin{matrix} H_{d 1} = & σ (W_{d 1} * h_{d} + b_{d 1}) \end{matrix}

(2)

\begin{matrix} H_{c 2} = & ε (W_{c 2} * h_{c} + b_{c 2}) \end{matrix}

(3)

\begin{matrix} H_{d 2} = & ε (W_{d 2} * h_{d} + b_{d 2}) \end{matrix}

(4)

In the equation shown,

h_{c}

and

h_{d}

represent the outputs of the branches,

W_{c 1}

,

W_{c 2}

,

W_{d 1}

, and

W_{d 2}

represent the two convolution operations performed on the color and depth image features,

b_{c 1}

,

b_{c 2}

,

b_{d 1}

, and

b_{d 2}

represent the corresponding bias terms.

σ

represents the PReLU activation function, and

ε

represents the Sigmoid activation function.

\begin{matrix} H_{s c} = H_{c 1} * H_{c 2} \end{matrix}

(5)

\begin{matrix} H_{s d} = H_{d 1} * H_{d 2} \end{matrix}

(6)

H_{s c}

,

H_{s d}

represent the enhanced color feature map and depth feature map, respectively, and ∗ represents element-wise multiplication.

Since the useful edge information in the color image can guide the depth map reconstruction, the useless information may lead to texture duplication or edge discontinuity. Therefore, an attention mechanism is needed to minimize the negative effects of texture replication and select the features that are consistent and significant with the depth image. The convolution operation is inspired by the Inception [29] network, which is divided into two branches, performing

3 \times 3

convolution and

5 \times 5

convolution, respectively, to learn feature information of different scales. The specific operations are as follows:

H_{b 1} = m a x p o l (H_{s c})

(7)

H_{b 2} = a r g p o l (H_{s c})

(8)

H_{b 3} = σ (W_{f 1} * H_{s c} + b_{f 1}) + σ (W_{f 2} * H_{s c} + b_{f 2})

(9)

H_{f} = H_{s d} * ε (θ_{1} \times H_{b 1} + θ_{2} \times H_{b 2} + θ_{3} \times H_{b 3})

(10)

Among them,

m a x p o o l

and

a v g p o o l

represent the max pooling and average pooling operations, respectively.

W_{f 1}

and

W_{f 2}

represent two different convolution operations,

b_{f 1}

and

b_{f 2}

represent the corresponding bias terms, and

θ_{1}

,

θ_{2}

, and

θ_{3}

represent the weights corresponding to each feature,

H_{f}

is the final output of the feature fusion module.

3.3. Multi-Scale Feature Optimization Module

In order to fully leverage the multi-scale information in the color image to guide the scene depth map, as shown in Figure 3, we propose an MSFO module, which extracts the features of different scales of the depth image through the interweaving and fusion of different paths.

In this network, we consider the

3 \times 3

and

5 \times 5

convolutions in different paths as a convolution group, which extracts features at different scales from the image and fuses the extracted features. By allowing the information in different paths to be shared, the network can adaptively detect feature information and enhance its feature extraction ability. Furthermore, to reduce the number of parameters and capture more different high-frequency features, we split the

5 \times 5

convolution into two

3 \times 3

convolutions. Finally, to alleviate the training difficulty of the network, we incorporate residual learning [30] into MSFO.

3.4. Loss Function

In order to help the network to better improve performance, it is very important to design a good loss function. Our network utilizes the L1 loss function, which has good convergence properties and high robustness, to optimize the network. For a reconstructed depth map

I^{S R}

obtained through network reconstruction and an original high-resolution image

I^{G R}

, the loss function expression is as follows:

\begin{matrix} L (θ) = \frac{1}{N} \sum_{i = 1}^{N} {‖ I_{i}^{G R} - I_{i}^{S R} ‖}_{1} \end{matrix}

(11)

where

θ

represents the network parameters that need to be optimized,

{‖ I_{i}^{H R} - I_{i}^{S R} ‖}_{1}

denotes the loss of the i-th original high-resolution image and the image reconstructed by the network.

4. Experimental Results

In this section, we present the implementation details of the network and describe the datasets used during the network training process. We compare our proposed method with existing state-of-the-art methods and provide quantitative and qualitative results on benchmark datasets, including both noisy and noise-free scenarios.

4.1. Experimental Setting

Parameter settings: Apart from setting the kernel size to

1 \times 1

in the fusion operation, the kernel size of all convolutional layers in the network is set to

3 \times 3

. The zero-padding strategy is used to maintain the same output feature size as the input. The number of convolution output channels in the feature extraction and fusion process is set to 64. In the MSFO module, the interleaved paths are set to 4, so the number of channels traversed by each path is 16. Our network is implemented with Pytorch and trained on NVIDIA RTX2080ti GPU. The network training model use the ADAM optimizer [31] to optimize the network parameters, where

β_{1} = 0.9

,

β_{2} = 0.99

. The initial learning rate is set to

l r = 1 \times 10^{- 4}

. Every 30 training epochs, the learning rate is reduced by half. The batch size is 16, and the network is trained for a total of 60 epochs. The proposed method trains different networks according to different upscaling factors (e.g.,

2 \times

,

4 \times

,

8 \times

,

16 \times

).

Training set: We follow the work in [23], selecting 110 images from the Middlebury Stereo Datasets [32] and MPI Sintel Datasets [33] to build our original dataset, in which 95 depth images are used as the original training samples and 15 depth images are used as validation samples. To make full use of these depth images, the images are randomly rotated by 90°, 180°, and 270°, and horizontally flipped. During the training phase, we crop the HR depth images into small squares of size

256 \times 256

and remove the sub-images without depth information. As a result, we obtain 10,650 images for training. Finally, LR depth images of size

128 \times 128

,

64 \times 64

,

32 \times 32

, and

16 \times 16

are generated by downsampling the HR images using bicubic interpolation with the given scaling factors of 2, 4, 8, and 16.

4.2. Evaluation

4.2.1. Evaluate under Noise-Free Conditions

In this section, our network is trained on the noise-free training set. In order to validate the effectiveness of our algorithm, this paper compares our model with 13 state-of-the-art DSR methods, including Bicubic, JID [34], SRCNN [9], LapSRN [35], AR [36], VDSR [37], MSG-Net [23], MFR-SR [24], DRN [15], RDN-GDE [38], PAG-Net [39], PMBANet [40], and PDR-Net [41]. All comparison algorithms are evaluated using the same experimental configuration and training data. The experimental results are obtained based on the code and relevant data provided by the author. In this paper, the test images from the Middlebury datasets [32] are divided into three groups: A, B, and C. The root mean square error (RMSE) is used as the quantitative metric to evaluate the performance of the super-resolution network on the upsampled results. A smaller RMSE value indicates a better reconstruction quality of the depth maps. Due to the limitation of the resolution of the ground truth depth maps, the analysis of RMSE results in Group A is conducted only for upscaling factors of 2, 4, and 8. As shown in Table 1, Table 2 and Table 3, the RMSE values obtained from all state-of-the-art methods for various upsampling factors are listed, respectively. The best results are marked in bold font, while the second-best results are marked with an underline. Our network exhibits superior performance across almost all scaling factors.

It can be seen from the table that in the DSR task, the method based on deep learning has a better reconstruction effect than the traditional method. Among the deep learning-based methods, networks such as MSG-Net [23], MFR-SR [24], and DRN [15] achieve better RMSE values compared to SRCNN [9] and LapSRN [35]. This is because the SRCNN [9] and LapSRN [35] networks only rely on LR depth maps for SR reconstruction. These networks are relatively simple and lack the incorporation of HR color images as prior information, making it difficult for the network to capture details, resulting in the loss of some details in the reconstructed depth map. In contrast, methods such as MSG-Net [23], MFR-SR [24], and DRN [15] incorporate the rich texture and structural information from HR color images, which enable them to better recover high-frequency details and improve the quality of the reconstructed depth maps. This highlights the importance of color image information in guiding DSR tasks.

Specifically, Figure 4 shows the final depth map reconstruction results of different methods on

8 \times

upsampling on the Middlebury datasets. Visually, most methods demonstrate similar recovery results for large background areas. However, there are significant differences in details, such as the pens in Art or the antlers in Reindeer. Among them, the Bicubic method yields the poorest result with significant image distortion and pronounced jagged edges; the SRCNN [9] method is superior to the traditional super-resolution reconstruction method. However, since this method takes the bicubic interpolated images as inputs, it not only increases the computational overhead but also introduces noise to the images. As a result, the images are affected by noise, leading to the problem of edge blurring; the LapSRN [35] method directly extracts features from the low-resolution (LR) images, avoiding the influence of noise. However, in some cases, it still encounters challenges in preserving fine details and edges, which may lead to slight blurring. Both MSG-Net [23] and MFR-SR [24] methods employ color image guidance for depth map reconstruction, thereby enhancing reconstruction quality and producing sharp depth edges. However, they may introduce incorrect depth information in certain regions, leading to the issue of depth artifacts. In comparison, our method excels in recovering clear details and effectively addresses the issue of edge artifacts. This is primarily attributed to the multi-branch feature fusion module we proposed, which can fully explore the consistency between the HR color image and the LR depth image. By making reasonable predictions for less obvious details, our method achieves superior results in depth map reconstruction.

4.2.2. Evaluate under Noisy Conditions

To evaluate the effectiveness of the algorithm under noisy conditions, we introduce Gaussian noise with a mean of 0 and a standard deviation of 5 to the LR depth images for the noisy experiment and retrain the network model. We select Bicubic, TGV [42], SRCNN [9], DGN [23], MFR-SR [24], and RDN-GDE [38] as comparison algorithms for noisy experiments. We list the RMSE results with upsampling factors of 2, 4, and 8 in the Middlebury datasets in Table 4, where the optimal results are marked in bold font and the suboptimal results are marked with underlines. From Table 4, the TGV method [42] exploits the global information and can achieve better performance while addressing the noise. It can be observed that deep learning-based methods outperform traditional methods in suppressing noise. Among them, color-guided reconstruction methods such as MFR-SR [24] and RDN-GDE [38] can obtain smaller RMSE values. This indicates that the guidance of color images is of great help to suppress image noise. Our method utilizes color-guided information that can not only effectively suppress noise but also restore clear and sharp edges, making the restored image closer to the ground truth.

4.3. Ablation Study

To further evaluate the impact of each module on the network performance, we conduct an ablation analysis on the MSFO and AMBFF modules in the network. In this study, we replace the AMBFF module with a 1 × 1 convolution fusion operation and the MSFO module with a basic residual block to form a comparative baseline network. Table 5 presents the results of the ablation experiments for an upsampling factor of 8. From the table, it can be observed that the baseline network achieves relatively higher RMSE values, indicating poorer performance. When the MSFO module is added to the network, the multi-scale expression ability of the network is enhanced. By extracting multi-scale features, it can better adapt to the changes of details at different scales and enhance the network performance. With the addition of the AMBFF module, the network is able to leverage the attention mechanism to suppress unnecessary texture information from the color image. This helps address the issue of texture replication caused by incorrect guidance from the color image. The module facilitates better fusion of features between the color and depth images, enhancing the overall performance of the network. The effectiveness of MSFO and AMBFF modules in depth map super-resolution is fully demonstrated by ablation experiments.

5. Conclusions

In this paper, we propose a multi-channel progressive attention fusion network. The network fully exploits the multi-scale features of the HR guidance image and the LR depth image through the pyramid structure, and adaptively fuses these features to further improve the network performance. In order to alleviate the depth artifact problem caused by the texture inconsistency between the color image and the depth image, the network uses the multi-branch feature fusion module to effectively enhance the useful information and filter the useless information so that the edge of the restored HR depth image is smoother. In addition, a multi-scale feature optimization module is designed to optimize the features, which improves the multi-scale expression ability of the network and further integrates multi-scale features. Extensive experiments have demonstrated that our method outperforms most existing methods in both noise-free and noisy conditions. When using depth cameras to capture depth images, various issues such as surface reflections or exceeding the range can lead to missing regions in the depth image. Therefore, in future work, we plan to further explore methods for the super-resolution of depth images containing holes to improve the overall quality and integrity of HR depth images.

Author Contributions

Conceptualization, J.W.; methodology, J.W.; software, J.W.; validation, J.W.; formal analysis, J.W.; investigation, J.W.; resources, J.W.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W. and Q.H.; visualization, J.W. and Q.H.; supervision, Q.H.; project administration, Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

The study did not involve humans.

Data Availability Statement

Data sharing is not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 559–568. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Sinha, G.; Shahi, R.; Shankar, M. Human computer interaction. In Proceedings of the IEEE/CVF Conference on 3rd International Conference on Emerging Trends in Engineering and Technology, Goa, India, 19–21 November 2010; pp. 1–4. [Google Scholar]
Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef] [Green Version]
Kopf, J.; Cohen, M.F.; Lischinski, D.; Uyttendaele, M. Joint bilateral upsampling. ACM Trans. Graph. (ToG) 2007, 26, 96-es. [Google Scholar] [CrossRef]
Yang, S.; Cao, N.; Guo, B.; Li, G. Depth map super-resolution based on edge-guided joint trilateral upsampling. Vis. Comput. 2022, 38, 883–895. [Google Scholar] [CrossRef]
Mac Aodha, O.; Campbell, N.D.; Nair, A.; Brostow, G.J. Patch based synthesis for single depth image super-resolution. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part III 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 71–84. [Google Scholar]
Li, Y.; Min, D.; Do, M.N.; Lu, J. Fast guided global interpolation for depth and motion. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 717–733. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [Green Version]
Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
Qin, J.; Huang, Y.; Wen, W. Multi-scale feature fusion residual network for single image super-resolution. Neurocomputing 2020, 379, 334–342. [Google Scholar] [CrossRef]
Mei, Y.; Fan, Y.; Zhou, Y.; Huang, L.; Huang, T.S.; Shi, H. Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5690–5699. [Google Scholar]
Song, X.; Dai, Y.; Qin, X. Deeply supervised depth map super-resolution as novel view synthesis. IEEE Trans. Circuits Syst. Video Technol. 2018, 29, 2323–2336. [Google Scholar] [CrossRef] [Green Version]
Huang, L.; Zhang, J.; Zuo, Y.; Wu, Q. Pyramid-structured depth map super-resolution based on deep dense-residual network. IEEE Signal Process. Lett. 2019, 26, 1723–1727. [Google Scholar] [CrossRef]
Xian, C.; Qian, K.; Zhang, Z.; Wang, C.C. Multi-scale progressive fusion learning for depth map super-resolution. arXiv 2020, arXiv:2011.11865. [Google Scholar]
Xie, J.; Feris, R.S.; Sun, M.T. Edge-guided single depth image super resolution. IEEE Trans. Image Process. 2015, 25, 428–438. [Google Scholar] [CrossRef]
Zhao, L.; Bai, H.; Liang, J.; Wang, A.; Zhao, Y. Single depth image super-resolution with multiple residual dictionary learning and refinement. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 739–744. [Google Scholar]
Zhou, D.; Wang, R.; Lu, J.; Zhang, Q. Depth image super resolution based on edge-guided method. Appl. Sci. 2018, 8, 298. [Google Scholar] [CrossRef] [Green Version]
Song, X.; Dai, Y.; Zhou, D.; Liu, L.; Li, W.; Li, H.; Yang, R. Channel attention based iterative residual learning for depth map super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 5631–5640. [Google Scholar]
Ye, X.; Sun, B.; Wang, Z.; Yang, J.; Xu, R.; Li, H.; Li, B. Depth super-resolution via deep controllable slicing network. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event, 12–16 October 2020; pp. 1809–1818. [Google Scholar]
Diebel, J.; Thrun, S. An application of markov random fields to range sensing. Adv. Neural Inf. Process. Syst. 2005, 18, 291–298. [Google Scholar]
Wang, Z.; Ye, X.; Sun, B.; Yang, J.; Xu, R.; Li, H. Depth upsampling based on deep edge-aware learning. Pattern Recognit. 2020, 103, 107274. [Google Scholar] [CrossRef]
Hui, T.W.; Loy, C.C.; Tang, X. Depth map super-resolution by deep multi-scale guidance. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 353–369. [Google Scholar]
Zuo, Y.; Wu, Q.; Fang, Y.; An, P.; Huang, L.; Chen, Z. Multi-scale frequency reconstruction for guided depth map super-resolution via deep residual network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 297–306. [Google Scholar] [CrossRef]
Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Chen, Z.; Ji, X. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion. IEEE Trans. Image Process. 2021, 31, 648–663. [Google Scholar] [CrossRef]
Chen, C.; Lin, Z.; She, H.; Huang, Y.; Liu, H.; Wang, Q.; Xie, S. Color image-guided very low-resolution depth image reconstruction. Signal Image Video Process. 2023, 17, 2111–2120. [Google Scholar] [CrossRef]
Guo, J.; Xiong, R.; Ou, Y.; Wang, L.; Liu, C. Depth Image Super-resolution via Two-Branch Network. In Proceedings of the Cognitive Systems and Information Processing: 6th International Conference, ICCSIP 2021, Suzhou, China, 20–21 November 2021; Revised Selected Papers 6. Springer: Berlin/Heidelberg, Germany, 2022; pp. 200–212. [Google Scholar]
Sun, B.; Ye, X.; Li, B.; Li, H.; Wang, Z.; Xu, R. Learning scene structure guidance via cross-task knowledge transfer for single depth super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7792–7801. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Scharstein, D.; Szeliski, R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002, 47, 7–42. [Google Scholar] [CrossRef]
Butler, D.J.; Wulff, J.; Stanley, G.B.; Black, M.J. A naturalistic open source movie for optical flow evaluation. In Proceedings of the Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part VI 12. Springer: Berlin/Heidelberg, Germany, 2012; pp. 611–625. [Google Scholar]
Kiechle, M.; Hawe, S.; Kleinsteuber, M. A joint intensity and depth co-sparse analysis model for depth map super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1545–1552. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Yang, J.; Ye, X.; Li, K.; Hou, C.; Wang, Y. Color-guided depth recovery from RGB-D data using an adaptive autoregressive model. IEEE Trans. Image Process. 2014, 23, 3443–3458. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1646–1654. [Google Scholar]
Zuo, Y.; Fang, Y.; Yang, Y.; Shang, X.; Wang, B. Residual dense network for intensity-guided depth map enhancement. Inf. Sci. 2019, 495, 52–64. [Google Scholar] [CrossRef]
Bansal, A.; Jonna, S.; Sahay, R.R. Pag-net: Progressive attention guided depth super-resolution network. arXiv 2019, arXiv:1911.09878. [Google Scholar]
Ye, X.; Sun, B.; Wang, Z.; Yang, J.; Xu, R.; Li, H.; Li, B. PMBANet: Progressive multi-branch aggregation network for scene depth super-resolution. IEEE Trans. Image Process. 2020, 29, 7427–7442. [Google Scholar] [CrossRef]
Liu, P.; Zhang, Z.; Meng, Z.; Gao, N.; Wang, C. PDR-Net: Progressive depth reconstruction network for color guided depth map super-resolution. Neurocomputing 2022, 479, 75–88. [Google Scholar] [CrossRef]
Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image guided depth upsampling using anisotropic total generalized variation. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 993–1000. [Google Scholar]

Figure 1. Multi-scale progressive attention fusion network with an upsampling factor of 8. The network consists of two main branches: the color-guided branch and the depth fusion and upsampling branch. The AMBFF module combines the color image features and the depth image features in an adaptive manner, taking into account their respective importance and relevance to the depth map reconstruction task. The MSFB module is used to capture the multi-scale information of feature maps and enhance the multi-scale representation ability of the network.

Figure 2. Attention-based multi-branch feature fusion module. This module consists of two components: feature enhancement and attention fusion. The feature enhancement component helps the network better understand image features and improves its expressive power. The attention fusion component alleviates the negative effects caused by depth replication, allowing the network to focus on important features.

Figure 3. Multi-scale feature optimization module. The multi-scale features of the feature map are extracted through the interweaving and fusion of different paths.

Figure 4. The visual comparison of the ×8 upsampled depth images without noise on the Middlebury datasets: (a) Ground Truth; (b) Biucbic; (c) LapSRN [35]; (d) SRCNN [9]; (e) MSG-Net [23]; (f) MFR-SR [24]; (g) proposed method.

Table 1. Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset A.

Method	2×				4×				8×
Method	Cones	Teddy	Tsukuba	Venus	Cones	Teddy	Tsukuba	Venus	Cones	Teddy	Tsukuba	Venus
Bicubic	2.51	1.93	5.82	1.31	3.52	2.86	8.67	1.91	5.36	4.04	12.31	2.76
JID [34]	1.58	1.27	3.83	0.75	3.23	1.98	6.13	1.02	4.92	2.94	9.57	1.37
SRCNN [9]	1.76	1.35	3.82	0.84	2.98	1.94	5.91	1.11	4.94	2.92	9.15	1.45
LapSRN [35]	0.98	0.97	1.72	0.72	2.85	1.68	5.34	0.77	4.67	2.96	8.94	1.34
MSG-Net [23]	1.13	0.99	2.20	0.59	2.95	1.78	5.21	0.78	5.23	3.18	10.25	1.18
MFR-SR [24]	1.27	1.08	2.65	0.54	2.43	1.75	5.63	0.85	4.73	2.49	8.03	1.25
RDN-GDE [38]	0.88	0.85	1.41	0.56	2.38	1.58	3.73	0.73	4.66	2.88	7.79	1.09
DSDMSR [13]	1.04	0.83	1.80	0.53	2.19	1.54	4.15	0.78	4.70	3.06	7.43	1.28
ours	0.69	0.74	1.13	0.39	1.72	1.26	2.94	0.53	4.28	2.12	7.11	0.76

Table 2. Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset B.

Method	Art				Books				Moebius
Method	2×	4×	8×	16×	2×	4×	8×	16×	2×	4×	8×	16×
Bicubic	2.57	3.85	5.53	8.37	1.01	1.58	2.27	3.36	0.91	1.35	2.06	2.97
JID [34]	1.16	1.84	2.77	10.86	0.59	0.93	1.14	9.15	0.61	0.87	1.37	10.3
SRCNN [9]	0.98	2.09	4.75	7.81	0.39	0.94	2.15	3.24	0.45	0.97	2.01	2.82
LapSRN [35]	0.88	1.79	2.73	6.31	0.78	0.94	1.29	2.35	0.77	0.95	1.33	2.37
AR [36]	3.07	3.99	4.68	6.87	1.38	1.94	2.05	2.84	0.98	1.23	1.73	2.56
VDSR [37]	1.32	2.07	3.24	6.66	0.48	0.83	1.72	2.14	0.54	0.91	1.57	2.15
MSG-Net [23]	0.66	1.47	3.01	4.57	0.37	0.67	1.03	1.80	0.60	0.76	1.29	1.63
MFR-SR [24]	0.71	1.54	2.71	4.35	0.42	0.63	1.05	1.78	0.42	0.72	1.10	1.73
DRN [15]	0.66	1.59	2.57	4.83	0.54	0.83	1.19	1.70	0.52	0.86	1.21	1.87
RDN-GDE [38]	0.56	1.47	2.60	4.16	0.36	0.62	1.00	1.68	0.38	0.69	1.06	1.65
PAG-Net [39]	0.33	1.15	2.08	3.68	0.26	0.46	0.81	1.38	-	-	-	-
PMBANet [40]	0.61	1.42	1.92	2.44	0.41	0.95	1.10	1.58	0.39	0.91	1.23	1.58
PDR-Net [41]	-	1.63	1.92	2.37	-	1.10	1.35	1.66	-	1.03	1.17	1.54
ours	0.27	1.07	1.87	2.53	0.21	0.39	0.76	1.24	0.22	0.52	0.87	1.57

Table 3. Quantitative Analysis of Noise-Free DSR Results (RMSE) on Dataset C.

Method	Reindeer				Laundry				Dolls
Method	2×	4×	8×	16×	2×	4×	8×	16×	2×	4×	8×	16×
Bicubic	1.92	2.80	3.99	5.86	1.60	2.40	3.45	5.07	0.89	1.29	1.95	2.62
JID [34]	0.91	1.47	2.19	4.15	0.72	1.19	1.77	3.47	0.73	0.96	1.26	2.06
SRCNN [9]	0.81	1.87	3.87	5.64	0.67	1.74	3.45	5.04	0.61	0.95	1.52	2.54
LapSRN [35]	0.80	1.31	1.92	4.56	0.78	1.12	1.67	3.79	0.77	0.98	1.42	2.28
AR [36]	2.99	3.09	4.33	4.99	2.39	2.43	3.01	4.47	1.01	1.23	1.65	2.23
VDSR [37]	1.01	1.50	2.28	4.17	0.72	1.19	1.59	3.20	0.63	0.89	1.31	2.09
MSG-Net [23]	0.66	1.47	2.46	4.57	0.79	1.12	1.51	4.26	0.68	0.92	1.47	1.86
MFR-SR [24]	0.65	1.23	2.06	3.74	0.61	1.11	1.75	3.01	0.60	0.89	1.22	1.74
DRN [15]	0.59	1.11	1.80	3.11	0.52	0.92	1.52	2.97	0.58	0.91	1.31	1.87
RDN-GDE [38]	0.51	1.17	2.05	3.58	0.48	0.96	1.63	2.86	0.56	0.88	1.21	1.71
PAG-Net [39]	0.31	0.85	1.46	2.52	0.30	0.71	1.27	1.88	-	-	-	-
PMBANet [40]	0.41	1.17	1.74	2.17	0.38	0.99	1.54	1.78	0.36	0.96	1.24	1.42
PDR-Net [41]	-	1.44	1.67	1.99	-	1.29	1.52	1.90	-	1.03	1.22	1.45
ours	0.27	0.79	1.38	2.17	0.25	0.73	1.21	1.74	0.37	0.84	1.15	1.53

Table 4. Quantitative Analysis of Noisy DSR Results (RMSE) on Dataset A.

Method	2×				4×				8×
Method	Cones	Teddy	Tsukuba	Venus	Cones	Teddy	Tsukuba	Venus	Cones	Teddy	Tsukuba	Venus
Bicubic	4.81	4.51	7.07	4.25	5.60	5.02	9.53	4.52	6.88	5.70	13.06	4.91
TGV [42]	3.36	2.67	7.20	1.58	4.31	3.20	10.10	1.91	7.74	4.93	16.08	2.65
SRCNN [9]	2.32	2.10	4.49	1.28	4.10	3.32	7.88	2.69	6.35	5.32	12.13	3.94
DGN [23]	2.29	2.14	4.34	1.55	3.57	3.20	7.90	2.77	5.45	4.62	11.09	3.95
MFR-SR [24]	2.09	1.98	3.59	1.25	3.31	2.72	6.63	1.57	5.01	3.79	9.95	1.99
RDN-GDE [38]	2.02	1.97	3.70	1.11	3.13	2.82	6.56	1.59	4.84	3.93	9.78	2.46
ours	1.85	1.69	3.48	0.97	2.88	2.49	6.37	1.34	4.67	3.52	9.51	1.57

Table 5. RMSE value of ablation study.

MSFO	AMBFF	RMSE
No	No	2.57
No	Yes	2.24
Yes	No	2.32
Yes	Yes	2.15

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Huang, Q. Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network. Appl. Sci. 2023, 13, 8270. https://doi.org/10.3390/app13148270

AMA Style

Wang J, Huang Q. Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network. Applied Sciences. 2023; 13(14):8270. https://doi.org/10.3390/app13148270

Chicago/Turabian Style

Wang, Jiachen, and Qingjiu Huang. 2023. "Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network" Applied Sciences 13, no. 14: 8270. https://doi.org/10.3390/app13148270

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network

Abstract

1. Introduction

2. Related Works

2.1. Single DSR Technology

2.2. Color Image-Guided DSR Technology

3. Proposed Method

3.1. Overview

3.2. Attention-Based Multi-Branch Feature Fusion Module

3.3. Multi-Scale Feature Optimization Module

3.4. Loss Function

4. Experimental Results

4.1. Experimental Setting

4.2. Evaluation

4.2.1. Evaluate under Noise-Free Conditions

4.2.2. Evaluate under Noisy Conditions

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI