IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Tragakis, Athanasios; Kaul, Chaitanya; Mitchell, Kevin J.; Dai, Hang; Murray-Smith, Roderick; Faccio, Daniele

doi:10.3390/s25010024

Open AccessFeature PaperArticle

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

by

Athanasios Tragakis

¹,

Chaitanya Kaul

^2,*,

Kevin J. Mitchell

¹

,

Hang Dai

²,

Roderick Murray-Smith

² and

Daniele Faccio

¹

School of Physics and Astronomy, University of Glasgow, Glasgow G12 8QQ, UK

²

School of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(1), 24; https://doi.org/10.3390/s25010024

Submission received: 21 October 2024 / Revised: 27 November 2024 / Accepted: 10 December 2024 / Published: 24 December 2024

(This article belongs to the Special Issue Convolutional Neural Network Technology for 3D Imaging and Sensing)

Download

Browse Figures

Versions Notes

Abstract

Accurate depth estimation is crucial for many fields, including robotics, navigation, and medical imaging. However, conventional depth sensors often produce low-resolution (LR) depth maps, making detailed scene perception challenging. To address this, enhancing LR depth maps to high-resolution (HR) ones has become essential, guided by HR-structured inputs like RGB or grayscale images. We propose a novel sensor fusion methodology for guided depth super-resolution (GDSR), a technique that combines LR depth maps with HR images to estimate detailed HR depth maps. Our key contribution is the Incremental guided attention fusion (IGAF) module, which effectively learns to fuse features from RGB images and LR depth maps, producing accurate HR depth maps. Using IGAF, we build a robust super-resolution model and evaluate it on multiple benchmark datasets. Our model achieves state-of-the-art results compared to all baseline models on the NYU v2 dataset for

\times 4

,

\times 8

, and

\times 16

upsampling. It also outperforms all baselines in a zero-shot setting on the Middlebury, Lu, and RGB-D-D datasets. Code, environments, and models are available on GitHub.

Keywords:

depth super-resolution; multimodal sensor fusion; convolutional neural networks; deep learning

1. Introduction

Accurate and useful visual perception is conventionally achieved by using RGB and depth sensors. Depth sensors, due to their small form factor, low cost, and low power consumption, are very popular in many fields of research such as robotics [1,2,3], medical imaging [4,5], augmented reality and consumer electronics. However, they typically tend to have lower spatial resolution than conventional imaging modalities such as RGB, leading to information loss, which can be overcome with accurate super-resolution techniques. In order to achieve this high resolution, existing techniques leverage correlations between sharp high-frequency texture edges of RGB images and low-resolution edge discontinuities of depth images. Typical super-resolution solutions often prove inadequate for addressing depth super-resolution due to their limited ability to effectively incorporate the unique characteristics and complexities inherent in the depth data [6]. To this end, the task of DSR is to provide solutions aimed at optimizing the super-resolution of lower resolution depth maps.

The depth super-resolution literature can be broadly categorized into three different approaches: filtering, optimization, and learning-driven strategies. Filtering-driven DSR [7,8,9,10,11,12] relies on finding filters based on neighboring pixels in the LR depth map and the HR guidance image. A downside of this approach is the creation of artifacts and errors when the scenes that need to be super-resolved are complex. Optimization-driven DSR [13,14,15,16] converts the task of super-resolution to an optimization problem where a cost function between the LR depth map and the HR depth map is minimized. This approach relies on selecting an appropriate cost function and is highly sensitive to the choice made. Finally, learning-driven approaches [17,18,19,20] in recent years have made use of deep learning techniques which have quickly become the de facto preferred solution of choice in the field of DSR.

In GDSR, most approaches rely on fusion techniques after the feature extraction stage. The unique per-modality features are fused together to create an HR depth map. It is important to have both strong feature extractors and fusion modules for this task as inadequate feature extractors do not provide the appropriate unique information on crucial features that help create sharp reconstructions. The RGB image is responsible for providing structure so that the end result does not suffer from depth bleeding. This task is not trivial, as over-transfer may occur and non-relevant structures from the RGB image can be transferred to the depth map. An example would be an image on the cover of a book where the network should be able to identify the textures that are irrelevant to depth reconstruction. Equally important to the feature extractors are the fusion modules that combine the two branches and refine the available information to estimate clear depths.

Existing works face two major limitations: (1) weak feature extractors that fail to capture the distinct and complementary characteristics of RGB and depth modalities, and (2) naive fusion strategies, such as simple concatenation or addition, which result in modality-specific artifacts, including over-transfer of irrelevant RGB features and insufficient depth-specific enhancement. These issues often lead to blurring, depth bleeding, and misaligned structures in the final depth maps. Our approach (see Figure 1 for an overview), IGAF, systematically addresses these challenges (the code is available on https://github.com/Thanos-DB/IncrementalGuidedAttentionFusion accessed on 9 December 2024). First, it introduces the filtered wide-focus (FWF) block to extract per-modality features with greater sensitivity to spatial textures and channel importance. Second, it employs an Incremental guided attention fusion (IGAF) module that refines features iteratively. Unlike previous methods, which fuse features in a single stage, IGAF leverages cross-modal attention to ensure only relevant information is transferred while suppressing noise. By iteratively fusing and refining features, our method minimizes artifacts, resulting in sharper, more accurate depth maps. To formalize, our contributions are as follows:

We propose the incremental guided attention fusion $(IGAF)$ model, which surpasses existing works for the task of DSR on all tested benchmark datasets for all tested benchmark resolutions.
We propose the $IGAF$ module, which is a flexible and adaptive attention fusion strategy, with the ability to effectively fuse multi-modal features by creating weights from both modalities and then applying a two-step cross fusion.
We propose the filtered wide-focus $(FWF)$ block, a strong feature extractor composed of two modules, the feature extractor $(FE)$ and wide-focus $(WF)$ . The $FE$ with the help of channel attention is able to highlight relevant feature channels in feature volumes, and the $WF$ using varying dilation rates in the convolution layers per branch, creates multi-receptive field spatial information that allows integrating global resolution features using dynamic receptive fields to better highlight textures and edges. The combination of the two forms a general-purpose feature extractor specifically tailored towards DSR.

2. Literature Review

Depth Super-Resolution Architectures. DSR techniques are broadly categorized into those that use RGB or grayscale images as guidance, and those that do not. Non-guided DSR techniques [17,21,22] try to solve the task by only using an LR depth map. This results in a simplified data acquisition pipeline (as syncing different modalities is not required, leading to smaller-sized datasets) alongside a simpler model by alleviating the need for sensor fusion techniques for the additional stream. This simplicity in the data processing, however, comes at the cost of producing smooth edges, especially on the contours of objects, as well as blurring and distortion effects in the super-resolved depth maps.

GDSR techniques propose a fix to over-smoothed edges by using structural and textural information from RGB or grayscale images. Additional techniques are needed to prevent the over-transfer of information from the guidance stream and to only retain the features that are relevant. Ref. [23] propose a fast model utilizing the high-frequency information of the guidance RGB stream using octave convolutions, but fuse the information from the two branches by a simple concatenation. Ref. [24], on the other hand, propose to fuse information between the two modalities through a symmetric uncertainty incorporated into their system. Ref. [25] use a joint implicit function representation to learn the interpolation weights and values for the HR depth simultaneously. Ref. [26] employ knowledge distillation such that the guidance stream is only needed during training while simplifying the model during the test phase. Ref. [27] utilize bridges to fuse information during multi-task learning. The two tasks in their system are depth super-resolution and monocular depth estimation. Additional novel techniques include [26,28,29].

Attention Feature Fusion in Depth Super-Resolution. Feature fusion techniques are crucial for multi-modal data processing. They range from a simple addition or concatenation of multiple features to complex feature processing modules. Ref. [30] place a hierarchical attention fusion module in the generator of a generative adversarial network towards this task. Ref. [31] employ an attention fusion strategy to adaptively utilize information from both modalities by first enhancing the features and then using an attention mechanism to fuse the two branches. Ref. [32] also propose a two-step approach where first a weighted attention fusion, followed by a high-frequency reconstruction generates the resulting high-resolution depth image. Ref. [20] use channel attention combined with reconstruction in the proposed module, whilst [33] have a fusion module consisting of feature enhancement and feature re-calibration step.

Existing works fail to effectively leverage both modalities to create fusion weights that accurately propagate relevant features. This limitation often results in the over-transfer of RGB features or insufficient depth-specific enhancements, leading to artifacts such as depth bleeding and texture misalignment. Our approach (see Figure 2) overcomes these drawbacks through a flexible and more powerful attention-based mechanism. By creating weights from one modality to iteratively guide the fusion with the other, we ensure that only the most relevant features are propagated. Unlike previous methods that rely on simple concatenation or addition for fusion, we introduce an Incremental guided attention fusion (IGAF) module that performs cross-modal attention in iterative steps. This process eliminates the over-transfer of RGB features while emphasizing critical depth-specific information. Specifically, we carry this out by first creating a naive fusion of the RGB and depth modalities (element-wise multiplication), followed by creating a structural guidance for the depth modality by learning a set of attention weights from the naive fusion for the RGB image (the first spatial attention fusion, i.e., SAF block; see figure below). We then use this intermediate fusion as structural guidance for the depth image to create a better fusion output than existing methods (the second SAF block).

3. Methodology

3.1. Problem Statement

Consider a dataset

{G, L, H}

, where

G

represents the RGB images,

L

represents the RGB images’ corresponding LR depth maps and

H

are the corresponding HR depth maps, for each RGB image

g_{i} \in R^{3 \times s H \times s W}

. Each LR depth map is

l_{i} \in R^{1 \times H \times W}

and HR depth map is

h_{i} \in R^{1 \times s H \times s W}

, where H and W are the spatial resolutions of the images. The 1 in

l_{i}

(and

h_{i}

) and 3 in

g_{i}

refer to the number of input channels, while s is the scale factor between the HR and the LR depth maps. The model estimates an HR depth map

\hat{H}

where

{\hat{h}}_{i} \in R^{1 \times s H \times s W}

, by first upsampling

L

to

L_{U}

where

l_{U i} \in R^{1 \times s H \times s W}

using bicubic interpolation such that the dimensions between

G

and

L_{U}

match. A formal representation is:

\hat{H} = L_{U} + F (G, L_{U}; θ),

(1)

where

F (\cdot)

is the learned function that maps

L_{U}

and

G

to

\hat{H}

for the predicted HR depth map. Finally,

θ

represents the learned parameters. The addition operation in Equation (1) represents the global residual connection as seen in Figure 2.

3.2. Model Architecture

We follow the conventional architecture of a dual-stream model as depicted in Figure 2. Our model contains two inputs, one for the RGB guidance and the other for the upsampled LR depth map. First, each modality is processed via a convolutional layer followed by a LeakyReLU activation. This is followed by 3

IGAF

modules which extract and fuse the multi-modal features from the two input modalities. After the fusion modules, the depth is refined through our refinement block and a global skip connection adds the LR upsampled depth map to the final feature representation to produce the final prediction. The predicted depth map

\hat{H}

is calculated as

\hat{H} = L_{U} + Depth_Refinement (IGAF (IGAF (IGAF (α (Conv (G)), α (Conv (L_{U})))))),

(2)

where

α

is the LeakyReLU activation. The depth refinement consists of 3 feature extractor modules (see Section 3.2.1), and a convolution-LeakyReLU-convolution stack of layers.

3.2.1. The $IGAF$ Module

Each

IGAF

module processes two inputs and provides two outputs (Figure 3). For the last

IGAF

module, we only propagate the depth stream forward into the depth refinement block and ignore the second output.

At first, each stream passes through a two-piece feature extraction block, the

FWF

consisting of a general feature extractor (

FE

) and a

WF

block. The

FE

processes the input using a convolution–LeakyReLU–convolution stack of layers. Next, a channel attention

CA

module focuses only on the relevant channels while reducing the influence of the less important or noisy ones. Finally, we employ an element-wise addition between the input of the module and the output of the

CA

module, followed by another convolutional layer, and a skip connection that is global within the

FE

module which propagates the global structure of the depth forward through the model. For simplicity in the explanations and equations, we treat N in the Figure below as 1, although during training N = 10 was used. The choice of the

CA

module was empirically estimated after multiple trainings, alternating between channel attention, spatial attention, and a combination of both. For the

CA

module, we have:

F_{C A} = K \times σ (Conv (ReLU (Conv (Avg . Pool (K))))),

(3)

where

F_{C A}

represents the feature maps output by the

CA

module,

K

is the feature maps input and

σ

is the sigmoid activation. The

FE

module is represented as:

F_{F E} = M + Conv (M + CA (Conv (α (Conv (M))))),

(4)

where

F_{F E}

is the feature maps output by the

FE

module,

M

is the feature maps input and

α

is the LeakyReLU activation. See Figure 4.

WF

is an efficient feature extractor first introduced by [34] for medical image segmentation and has shown great promise in extracting multi-scale features from feature representations. It contains three branches, each with a different dilation rate for the convolution kernels, followed by an activation layer and a dropout layer to prevent overfitting. After the element-wise addition, another convolutional layer extracts features from the gradually increased receptive fields of the

WF

dilated convolution layer to aggregate the extracted multi-resolution features. This is followed by an activation layer and a dropout layer to avoid overfitting. A

WF

block is used after every

FE

module to aggregate the multi-resolution hierarchical features extracted in each layer. The RGB stream after the

FE

module uses a skip connection to propagate the extracted features to the next

IGAF

module further propagating the global scene structure forward within the model. We observed from our experiments that not placing a skip connection after

WF

or at later stages is an effective strategy for learning better scene structure as the forward propagation of shallower features without the skip connection helps propagate high-frequency structure better through the model, which can be verified through our ablations in.

The first fusion in the

IGAF

module (see Figure 3) is an element-wise multiplication of both modalities. Its result (a) creates intermediate feature weights

w_{b}

and (b) is used in an element-wise weighted addition between the two intermediate features of the

IGAF

block. Similarly, the extracted features from the RGB stream (a) create intermediate feature weights

w_{a}

and (b) are the second component of the weighted addition. The addition can be seen as adding features from a joint representation of the RGB and LR images weighted by their common features. This helps the model focus on both high-level semantic structures in the image through the depth features, as well as high-frequency features from the RGB images while weighting them according to backpropagation. For each component, weights are extracted and applied in a crosswise fashion, i.e., weights from one component are applied to the other component, resulting in a spatial attention fusion

(SAF)

block. This allows the model to learn features across both modalities that can limit their influence on the output resulting in a smoother depth map at the output. The weights are learnable and created via two-layer MLPs. A formalized expression of the SAF block is:

y_{A} = (x_{A} A_{A 1}^{T} + b_{A 1}) A_{A 2}^{T} + b_{A 2}

(5)

y_{B} = (x_{B} A_{B 1}^{T} + b_{B 1}) A_{B 2}^{T} + b_{B 2}

(6)

y = x_{A} σ (y_{B}) + x_{B} σ (y_{A}),

(7)

where

x_{A}

and

x_{B}

are the two inputs of the block,

A

represents the weights,

b

the bias term and

σ

the sigmoid activation. Equations (5) and (6) show the 2 MLP layers for the weights creation that are used cross-wise with the inputs as seen in Equation (7).

The output of the first

SAF

block passes through a convolutional layer for joint feature processing and is then used as input for the second

SAF

block. This convolution layer now extracts shared features from the two fused modalities. The second

SAF

block works in a similar manner to the previous one but now fuses the joint features from the two modalities with the depth features. The first

SAF

module fuses together the extracted features from the RGB stream and the naive feature fusion obtained by the element-wise multiplication. The second

SAF

block fuses together the results of the first

SAF

block after the convolutional layer and the output of the

FE

module of the depth stream. This fusion is incremental in nature as we iteratively combine RGB and depth features in multiple steps to create a cross-modal fusion of attributes leading to simultaneously processing both structure and depth.

4. Experiments

We test our model on four benchmark datasets commonly used widely in comparing proposed models for the DSR task. These are the NYU v2 [35], Middlebury [36,37], Lu [38] and RGB-D-D [23] datasets.

We only train

IGAF

on the NYU v2 dataset and do not fine-tune the model further on the others. Our results on the remaining datasets are a zero-shot prediction to demonstrate the generalization ability of our model. The NYU v2 contains 1449 pairs of RGB and depth images. The first 1000 images are used to train the model and the remaining 449 are used for the purpose of evaluating our approach. For the Middlebury dataset, we use the provided 30 RGB and depth image pairs and for the Lu dataset we use the 6 pairs following previous works [23,24,25,38] in order to report consistent results with other proposed works. For RGB-D-D, we use 405 RGB and depth image pairs following [23].

Implementation Details: We run all our experiments using one RTX 3090 GPU using PyTorch 2.0.1. The initial learning rate for our model is set to 0.00025 and is reduced by half using the MultiStepLR scheduler with milestones. The milestones are set to every 25 epochs with the exception of the last one being 150 out of a total of 200 epochs. A batch size of 1 is used to train the model. We use the Adam optimizer and the Root Mean Square Error (RMSE) as our metric to report all our results. We use the

L_{1}

loss to train our model:

L_{1} = \frac{1}{N} \sum_{i = 1}^{n} | h_{i} - {\hat{h}}_{i} |,

(8)

where N is the number of pixels,

h_{i}

is the ground truth depth map and

{\hat{h}}_{i}

is the predicted depth map.

During training, we use

256 \times 256

patches of the HR image that are randomly cropped. The LR depth maps are simulated by bicubic downsampling which is consistent with other approaches using the same datasets. Additionally, we evaluate our model using the “real-world manner” RGB-D-D dataset where both HR and LR are provided. The dimensions of the LR depth map is

192 \times 144

and the HR depth map is

512 \times 384

.

5. Results

Our model achieves state-of-the-art (SOTA) results on all benchmark test datasets compared to the baselines demonstrating its ability to super-resolve various depth resolutions as well as its generalization capabilities across multiple datasets. The Table 1, Table 2, Table 3, Table 4, Table 5 below show a quantitative comparison between our model and previous works. The evaluation is carried out based on the RMSE metric. The best performance is marked in bold. The Figure below showcases a qualitative comparison, on the NYU v2 dataset, between our model and SUFT [24]. The visualizations show how our proposed attention fusion helps alleviate problems such as bleeding and blurring that occur in previous SOTA models. This happens because IGAF iteratively refines features by leveraging structural guidance from RGB and selectively emphasizing depth-specific details. This approach minimizes the over-transfer of irrelevant RGB features, which is a key cause of blurring and bleeding in prior methods. The SAF blocks incrementally learn attention weights, ensuring sharper edges and reducing distortions. Qualitative comparisons in the Figure below demonstrate these advantages. See Figure 5.

6. Ablation Study

We run ablations on NYU v2 for the

\times 4

DSR scenario. We study the effects of addition and concatenation as a fusion strategy by replacing

IGAF

in our model with the two naive approaches. In Table 6, we show that our carefully built module based on empirical results outperforms them both as expected.

We also study the effects of different settings of the

IGAF

module. The tested settings are (1) skip connections after the

WF

modules to propagate deeper features, and not between the

FE

and

WF

modules, (2) an additional

IGAF

module (four in total in the ablation model), (3) the

IGAF

module without MLP layers, i.e., the element-wise additions are not weighted, (4) MLP layers consisting of only one dense layer instead of two, and lastly (5) removing the

WF

module. Table 7 shows the importance of each component empirically.

We note that relocating the skip connection is not a good choice as propagating shallower features of high frequency has more spatial information. The additional module also does not improve the performance as the parameters of the model are increased. This larger model tends to overfit the training data. Keeping the weights of the addition improves the performance as the two parts are dynamically combined after the model has learned which features of each modality are important. The reduction of the MLP layers makes the approximation of the weights weaker which additionally supports the reasoning for using a two-layer MLP combined with our previous ablation. Lastly, without

WF

, we lack the ability to dynamically increase our feature processing receptive fields that are provided by this module; as such, we lose the ability to capture multi-resolution features effectively.

7. Conclusions

Given the importance of depth perception in its various applications, the ability to estimate accurate, higher resolution depth information is crucial. We proposed an incremental guided attention fusion model for depth super-resolution that uses structural guidance from the RGB modality to provide intermediate structure to the processed features in every layer of the model which leads to the resultant HR output depth map being more accurate compared to the existing methods, as well as free of blurring effects and distortions. Our model’s main component, the

IGAF

module, performs a cross-modal attention fusion that fuses RGB and depth modalities, while simultaneously focusing on the important information in the intermediate fused features. We achieve state-of-the-art performance on four benchmark datasets against all evaluated baselines for our metrics where the LR depth maps were downsampled from the HR ground truths. Specifically, we demonstrate the ability of our model to generate high-quality depth super-resolutions by training only on the NYU v2 dataset as well as the ability of our model to generalize in a zero-shot setting on the RGB-D-D, Lu and Middlebury datasets in a zero-shot setting which shows the robustness of our method. Additionally, on a 5th data set where the LR and HR depth maps were collected using different sensors, mimicking a real-world scenario, we also demonstrate better results than all existing methods.

Author Contributions

Conceptualization, A.T., C.K., K.J.M., H.D., D.F. and R.M.-S.; methodology, A.T., K.J.M., H.D. and C.K.; software, A.T. and R.M.-S.; validation, A.T., K.J.M. and C.K.; formal analysis, C.K., K.J.M. and A.T.; investigation, A.T.; resources, C.K., K.J.M., H.D., R.M.-S. and D.F.; data processing, A.T.; writing—Original draft preparation, A.T., C.K. and K.J.M.; writing—Review and editing, A.T., C.K. and K.J.M.; visualization, A.T.; supervision, D.F. and R.M.-S.; project administration, D.F. and R.M.-S.; funding acquisition, D.F. and R.M.-S. All authors have read and agreed to the published version of the manuscript.

Funding

D.F. acknowledges funding from the Royal Academy of Engineering Chairs in Emerging Technologies programme and the UK Engineering and Physical Sciences Research Council (grant n. EP/T00097X/1). R.M.S. and C.K. received funding from EPSRC projects Quantic EP/T00097X/1 and QUEST EP/T021020/1 and from the DIFAI ERC Advanced Grant proposal 101097708, funded by the UK Horizon guarantee scheme as EPSRC project EP/Y029178/1. This work was in part supported by a research gift from Google.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Ethics Committee of the University of Glasgow (application number 300220059 16 December 2022).

Data Availability Statement

The data underlying the results presented in this paper are available via their open-sourced links cited in the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, A.S.; Bachrach, A.; Henry, P.; Krainin, M.; Maturana, D.; Fox, D.; Roy, N. Visual odometry and mapping for autonomous flight using an RGB-D camera. In Proceedings of the Robotics Research: The 15th International Symposium ISRR, Flagstaff, AZ, USA, 28 August–1 September 2011; Springer: Cham, Switzerland, 2017; pp. 235–252. [Google Scholar]
Stowers, J.; Hayes, M.; Bainbridge-Smith, A. Altitude control of a quadrotor helicopter using depth map from Microsoft Kinect sensor. In Proceedings of the 2011 IEEE International Conference on Mechatronics, Istanbul, Turkey, 13–15 April 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 358–362. [Google Scholar]
Melchiorre, M.; Scimmi, L.S.; Pastorelli, S.P.; Mauro, S. Collison avoidance using point cloud data fusion from multiple depth sensors: A practical approach. In Proceedings of the 2019 23rd International Conference on Mechatronics Technology (ICMT), Salerno, Italy, 23–26 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Gao, X.; Uchiyama, Y.; Zhou, X.; Hara, T.; Asano, T.; Fujita, H. A fast and fully automatic method for cerebrovascular segmentation on time-of-flight (TOF) MRA image. J. Digit. Imaging 2011, 24, 609–625. [Google Scholar] [CrossRef] [PubMed]
Penne, J.; Höller, K.; Stürmer, M.; Schrauder, T.; Schneider, A.; Engelbrecht, R.; Feußner, H.; Schmauss, B.; Hornegger, J. Time-of-flight 3-D endoscopy. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 7–11 October 2009; Springer: Cham, Switzerland, 2009; pp. 467–474. [Google Scholar]
Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Ji, X. Guided depth map super-resolution: A survey. ACM Comput. Surv. 2023, 55, 1–36. [Google Scholar] [CrossRef]
Yang, Q.; Yang, R.; Davis, J.; Nistér, D. Spatial-depth super resolution for range images. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
Riemens, A.; Gangwal, O.; Barenbrug, B.; Berretty, R.P. Multistep joint bilateral depth upsampling. In Proceedings of the Visual Communications and Image Processing, San Jose, CA, USA, 20–22 January 2009; SPIE: Bellingham, WA, USA, 2009; Volume 7257, pp. 192–203. [Google Scholar]
Liu, M.Y.; Tuzel, O.; Taguchi, Y. Joint geodesic upsampling of depth images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 169–176. [Google Scholar]
Lo, K.H.; Wang, Y.C.F.; Hua, K.L. Edge-preserving depth map upsampling by joint trilateral filter. IEEE Trans. Cybern. 2017, 48, 371–384. [Google Scholar] [CrossRef] [PubMed]
Sun, Z.; Han, B.; Li, J.; Zhang, J.; Gao, X. Weighted guided image filtering with steering kernel. IEEE Trans. Image Process. 2019, 29, 500–508. [Google Scholar] [CrossRef] [PubMed]
Qiao, Y.; Jiao, L.; Li, W.; Richardt, C.; Cosker, D. Fast, High-Quality Hierarchical Depth-Map Super-Resolution. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 4444–4453. [Google Scholar]
Diebel, J.; Thrun, S. An application of markov random fields to range sensing. Adv. Neural Inf. Process. Syst. 2005, 18, 291–298. [Google Scholar]
Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image guided depth upsampling using anisotropic total generalized variation. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 993–1000. [Google Scholar]
Newcombe, R.A.; Fox, D.; Seitz, S.M. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 343–352. [Google Scholar]
Park, J.; Kim, H.; Tai, Y.W.; Brown, M.S.; Kweon, I.S. High-quality depth map upsampling and completion for RGB-D cameras. IEEE Trans. Image Process. 2014, 23, 5559–5572. [Google Scholar] [CrossRef] [PubMed]
Riegler, G.; Rüther, M.; Bischof, H. Atgv-net: Accurate depth super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14; Springer: Cham, Switzerland, 2016; pp. 268–284. [Google Scholar]
Jiang, Z.; Yue, H.; Lai, Y.K.; Yang, J.; Hou, Y.; Hou, C. Deep edge map guided depth super resolution. Signal Process. Image Commun. 2021, 90, 116040. [Google Scholar] [CrossRef]
Song, X.; Dai, Y.; Qin, X. Deep depth super-resolution: Learning depth super-resolution using deep convolutional neural network. In Proceedings of the Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part IV 13. Springer: Cham, Switzerland, 2017; pp. 360–376. [Google Scholar]
Song, X.; Dai, Y.; Zhou, D.; Liu, L.; Li, W.; Li, H.; Yang, R. Channel attention based iterative residual learning for depth map super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 5631–5640. [Google Scholar]
Ye, X.; Sun, B.; Wang, Z.; Yang, J.; Xu, R.; Li, H.; Li, B. Depth super-resolution via deep controllable slicing network. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual Event, 12–16 October 2020; pp. 1809–1818. [Google Scholar]
Huang, L.; Zhang, J.; Zuo, Y.; Wu, Q. Pyramid-structured depth map super-resolution based on deep dense-residual network. IEEE Signal Process. Lett. 2019, 26, 1723–1727. [Google Scholar] [CrossRef]
He, L.; Zhu, H.; Li, F.; Bai, H.; Cong, R.; Zhang, C.; Lin, C.; Liu, M.; Zhao, Y. Towards fast and accurate real-world depth super-resolution: Benchmark dataset and baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9229–9238. [Google Scholar]
Shi, W.; Ye, M.; Du, B. Symmetric Uncertainty-Aware Feature Transmission for Depth Super-Resolution. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3867–3876. [Google Scholar]
Tang, J.; Chen, X.; Zeng, G. Joint implicit image function for guided depth super-resolution. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4390–4399. [Google Scholar]
Sun, B.; Ye, X.; Li, B.; Li, H.; Wang, Z.; Xu, R. Learning scene structure guidance via cross-task knowledge transfer for single depth super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 7792–7801. [Google Scholar]
Tang, Q.; Cong, R.; Sheng, R.; He, L.; Zhang, D.; Zhao, Y.; Kwong, S. Bridgenet: A joint learning network of depth map super-resolution and monocular depth estimation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 2148–2157. [Google Scholar]
Zhao, Z.; Zhang, J.; Xu, S.; Lin, Z.; Pfister, H. Discrete cosine transform network for guided depth map super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5697–5707. [Google Scholar]
Zhao, Z.; Zhang, J.; Gu, X.; Tan, C.; Xu, S.; Zhang, Y.; Timofte, R.; Van Gool, L. Spherical space feature decomposition for guided depth map super-resolution. arXiv 2023, arXiv:2303.08942. [Google Scholar]
Xu, D.; Fan, X.; Gao, W. Multiscale Attention Fusion for Depth Map Super-Resolution Generative Adversarial Networks. Entropy 2023, 25, 836. [Google Scholar] [CrossRef]
Wang, J.; Huang, Q. Depth Map Super-Resolution Reconstruction Based on Multi-Channel Progressive Attention Fusion Network. Appl. Sci. 2023, 13, 8270. [Google Scholar] [CrossRef]
Song, X.; Zhou, D.; Li, W.; Dai, Y.; Liu, L.; Li, H.; Yang, R.; Zhang, L. WAFP-Net: Weighted Attention Fusion Based Progressive Residual Learning for Depth Map Super-Resolution. IEEE Trans. Multimed. 2021, 24, 4113–4127. [Google Scholar] [CrossRef]
Zhong, Z.; Liu, X.; Jiang, J.; Zhao, D.; Chen, Z.; Ji, X. High-resolution depth maps imaging via attention-based hierarchical multi-modal fusion. IEEE Trans. Image Process. 2021, 31, 648–663. [Google Scholar] [CrossRef] [PubMed]
Tragakis, A.; Kaul, C.; Murray-Smith, R.; Husmeier, D. The fully convolutional transformer for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 3660–3669. [Google Scholar]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from rgbd images. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part V 12. Springer: Cham, Switzerland, 2012; pp. 746–760. [Google Scholar]
Hirschmuller, H.; Scharstein, D. Evaluation of cost functions for stereo matching. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
Scharstein, D.; Pal, C. Learning conditional random fields for stereo. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
Lu, S.; Ren, X.; Liu, F. Depth enhancement via low-rank matrix completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3390–3397. [Google Scholar]
Gu, S.; Zuo, W.; Guo, S.; Chen, Y.; Chen, C.; Zhang, L. Learning dynamic guidance for depth image enhancement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3769–3778. [Google Scholar]
Pan, J.; Dong, J.; Ren, J.S.; Lin, L.; Tang, J.; Yang, M.H. Spatially variant linear representation models for joint filtering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1702–1711. [Google Scholar]
Kim, B.; Ponce, J.; Ham, B. Deformable kernel networks for guided depth map upsampling. arXiv 2019, arXiv:1903.11286. [Google Scholar]
Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Joint image filtering with deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 1909–1923. [Google Scholar] [CrossRef] [PubMed]
Su, H.; Jampani, V.; Sun, D.; Gallo, O.; Learned-Miller, E.; Kautz, J. Pixel-adaptive convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11166–11175. [Google Scholar]
Hui, T.W.; Loy, C.C.; Tang, X. Depth map super-resolution by deep multi-scale guidance. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part III 14. Springer: Cham, Switzerland, 2016; pp. 353–369. [Google Scholar]
Li, Y.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep joint image filtering. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part IV 14. Springer: Cham, Switzerland, 2016; pp. 154–169. [Google Scholar]
Deng, X.; Dragotti, P.L. Deep convolutional neural network for multi-modal image restoration and fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3333–3348. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the proposed multi-modal architecture for the guided depth super resolution estimation.

Figure 2. The proposed multi-modal architecture utilizes information from both an LR depth map and an HR RGB image. Firstly, each modality passes through a convolutional layer followed by a LeakyReLU activation. The model utilizes the IGAF modules to combine information from the two modalities by fusing the relevant information on each stream and ignoring information that is unrelated to the depth maps. Finally, after the third IGAF module, the depth maps are refined and added using a global skip connection from the original upsampled LR depth maps. The RGB modality is used to provide guidance to estimate an HR depth map given an LR one.

Figure 3. The

IGAF

module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage

(FWF)

before the initial naive fusion by an element-wise multiplication. An

SAF

block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second

SAF

block incrementally fuses this extracted structural guidance with the depth stream. The output of each

SAF

block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.

Figure 3. The

IGAF

module. The module is responsible for both feature extraction and modality fusion. Each modality passes through a feature extraction stage

(FWF)

before the initial naive fusion by an element-wise multiplication. An

SAF

block follows, which fuses the result of the multiplication with the extracted features of the RGB stream creating an initial structural guidance. The second

SAF

block incrementally fuses this extracted structural guidance with the depth stream. The output of each

SAF

block is generated by learning attention weights and subsequently performing a cross-multiplication operation between the two input sequences, resulting in fused and salient processed information.

Figure 4. Overview of the

FWF

module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure 3 boosts the performance of the model. The

FE

module is a series of convolutional layers, a channel attention process, and two skip connections. The

WF

module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.

Figure 4. Overview of the

FWF

module. The two modules are separated and not combined into one larger module because the propagation of shallower features through the skip connections as seen in Figure 3 boosts the performance of the model. The

FE

module is a series of convolutional layers, a channel attention process, and two skip connections. The

WF

module uses linearly increasing dilation rates in convolutional layers to extract multi-resolution features.

Figure 5. Qualitative comparison between our model and SUFT [24]. The visualizations shown are for the

\times 8

case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).

Figure 5. Qualitative comparison between our model and SUFT [24]. The visualizations shown are for the

\times 8

case. Our model creates more complete depth maps as seen in (c) for rows 1 and 2. In (c), row 3 shows that our model creates sharper edges with minimal bleeding. Also, in (c), row 4 the proposed model creates less smoothing with less bleeding. (Colormap chosen for better visualization. Better seen in full-screen, with zoom-in options).

Table 1. Results on the NYU v2 data set.

Method	Bicubic	DG [39]	SVLRM [40]	DKN [41]	FDSR [23]	SUFT [24]	CTKT [26]	JIIF [25]	IGAF
×4	8.16	1.56	1.74	1.62	1.61	1.14	1.49	1.37	1.12
×8	14.22	2.99	5.59	3.26	3.18	2.57	2.73	2.76	2.48
×16	22.32	5.24	7.23	6.51	5.86	5.08	5.11	5.27	5.00

Table 2. Results on the RGB-D-D data set.

Method	Bicubic	DJFR [42]	PAC [43]	DKN [41]	FDKN [41]	FDSR [23]	JIIF [25]	SUFT [24]	IGAF
×4	2.00	3.35	1.25	1.30	1.18	1.16	1.17	1.20	1.08
×8	3.23	5.57	1.98	1.96	1.91	1.82	1.79	1.77	1.69
×16	5.16	7.99	3.49	3.42	3.41	3.06	2.87	2.81	2.69

Table 3. Results on the Lu data set.

Method	Bicubic	DMSG [44]	DG [39]	DJF [45]	DJFR [42]	PAC [43]	JIIF [25]	DKN [41]	IGAF
×4	2.42	2.30	2.06	1.65	1.15	1.20	0.85	0.96	0.82
×8	4.54	4.17	4.19	3.96	3.57	2.33	1.73	2.16	1.68
×16	7.38	7.22	6.90	6.75	6.77	5.19	4.16	5.11	4.14

Table 4. Results on the “real-world manner” RGB-D-D data set.

Method	Bicubic	DJF [45]	DJFR [42]	FDKN [41]	DKN [41]	FDSR [23]	JIIF [25]	SUFT [24]	IGAF
“real-world manner”	9.15	7.90	8.01	7.50	7.38	7.50	8.41	7.17	7.01

Table 5. Results on the Middlebury data set.

Method	Bicubic	PAC [43]	DKN [41]	FDKN [41]	CUNet [46]	JIIF [25]	SUFT [24]	FDSR [23]	IGAF
×4	2.28	1.32	1.23	1.08	1.10	1.09	1.20	1.13	1.01
×4	3.98	2.62	2.12	2.17	2.17	1.82	1.76	2.08	1.73
×16	6.37	4.58	4.24	4.50	4.33	3.31	3.29	4.39	3.24

Table 6. Demonstrating the importance of the

IGAF

module.

Table 6. Demonstrating the importance of the

IGAF

module.

Fusion Method	Addition	Concatenation	IGAF
×4	1.23	1.22	1.12

Table 7. Results on the NYU v2 data set.

Test	Relocated Skip Connection	Extra IGAF Module	Without Weights	One Layer MLP	Without WF	Full Model
×4	1.14	1.14	1.17	1.15	1.14	1.12

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tragakis, A.; Kaul, C.; Mitchell, K.J.; Dai, H.; Murray-Smith, R.; Faccio, D. IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution. Sensors 2025, 25, 24. https://doi.org/10.3390/s25010024

AMA Style

Tragakis A, Kaul C, Mitchell KJ, Dai H, Murray-Smith R, Faccio D. IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution. Sensors. 2025; 25(1):24. https://doi.org/10.3390/s25010024

Chicago/Turabian Style

Tragakis, Athanasios, Chaitanya Kaul, Kevin J. Mitchell, Hang Dai, Roderick Murray-Smith, and Daniele Faccio. 2025. "IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution" Sensors 25, no. 1: 24. https://doi.org/10.3390/s25010024

APA Style

Tragakis, A., Kaul, C., Mitchell, K. J., Dai, H., Murray-Smith, R., & Faccio, D. (2025). IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution. Sensors, 25(1), 24. https://doi.org/10.3390/s25010024

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Problem Statement

3.2. Model Architecture

3.2.1. The $IGAF$ Module

4. Experiments

5. Results

6. Ablation Study

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

IGAF: Incremental Guided Attention Fusion for Depth Super-Resolution

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Problem Statement

3.2. Model Architecture

3.2.1. The IGAF Module

4. Experiments

5. Results

6. Ablation Study

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2.1. The $IGAF$ Module