You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

11 March 2020

Color-Guided Depth Map Super-Resolution Using a Dual-Branch Multi-Scale Residual Network with Channel Interaction

and
1
National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
2
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence

Abstract

We designed an end-to-end dual-branch residual network architecture that inputs a low-resolution (LR) depth map and a corresponding high-resolution (HR) color image separately into the two branches, and outputs an HR depth map through a multi-scale, channel-wise feature extraction, interaction, and upsampling. Each branch of this network contains several residual levels at different scales, and each level comprises multiple residual groups composed of several residual blocks. A short-skip connection in every residual block and a long-skip connection in each residual group or level allow for low-frequency information to be bypassed while the main network focuses on learning high-frequency information. High-frequency information learned by each residual block in the color image branch is input into the corresponding residual block in the depth map branch, and this kind of channel-wise feature supplement and fusion can not only help the depth map branch to alleviate blur in details like edges, but also introduce some depth artifacts to feature maps. To avoid the above introduced artifacts, the channel interaction fuses the feature maps using weights referring to the channel attention mechanism. The parallel multi-scale network architecture with channel interaction for feature guidance is the main contribution of our work and experiments show that our proposed method had a better performance in terms of accuracy compared with other methods.

1. Introduction

With the development of 3D technologies, such as 3D reconstruction, robot interaction, and virtual reality, the acquisition of precise depth information as the basis of 3D technology has become very important. At present, depth maps can be obtained conveniently using low-cost depth cameras. However, depth maps obtained under such hardware constraints are usually of low resolution. To use low-cost depth maps in 3D tasks, we need to perform super-resolution (SR) processing on low-resolution (LR) depth maps to obtain high-resolution (HR) depth maps.
The main difficulty of depth map SR tasks is that the spatial downsampling of HR images to LR images will result in the loss and distortion of details, and this phenomenon will become more serious as the downscaling factor increases. When we want to recover HR images from LR images using simple upsampling, an edge blur and other detail distortion problems will appear. To cope with these problems, methods of using HR intensity images to guide the upsampling process of LR images have been proposed. The realization of these methods is based on the corresponding association relationship between HR intensity images and LR depth maps in the same scene. If the resolution of intensity image and target HR depth map are the same, edges of the intensity image and the target HR depth map can be regarded as basically corresponding, and therefore discontinuities in the intensity image can help to locate discontinuities in the target HR depth map during upsampling on the LR depth map. Although the introduction of intensity image guidance during the upsampling process will alleviate the blur of details like edges, extra textures may be introduced into the generated HR depth map owing to the inconsistency of the structure between the depth map and the intensity image.
We proposed an end-to-end, multi-scale deep map SR network, which consists of two branches, namely the RGB image branch ( Y -branch) and the depth map branch ( D -branch). Each branch is mainly composed of residual levels at multiple scales, and each residual level has two functional structures of feature extraction and upsampling. Among them, feature extraction is achieved by connecting several residual groups, each of which contains several residual blocks. As the key to residual structure, the internal short-skip connections of residual blocks and the long-skip connections in residual groups and levels enable the main road of branch network to learn the high-frequency information of the RGB image or depth map at different scales. Feature extraction parts in every residual level correspond one-to-one, which means that channel-wise, high-frequency features learned by each residual block of the Y -branch can be input into the corresponding residual block of the D -branch. On this foundation, we utilized a channel attention mechanism to rescale the channel-wise feature maps and fuse these features from two branches to implement guidance from the RGB image to the depth map. Under this kind of guidance, features in the HR depth map are supplemented, meanwhile weights in the aforementioned channel-wise feature rescaling limits the addition of artifacts from the RGB image. Compared with many existing methods, we input the LR depth map and HR RGB image directly into the network instead of inputting a bicubic interpolation of the LR depth map. Experiments indicate that our proposed method achieved great performances when recovering an HR depth map from an LR depth map with different upscaling factors.
The main contributions of our work are:
  • We designed a multi-scale residual network with two branches to realize an end-to-end LR depth map super-resolution under the guidance from an HR color image.
  • We applied a channel attention mechanism [1] to learn the features of a depth map and RGB image and fuse them via weights; furthermore, we tried to avoid copying artifacts to the depth map while ensuring the guidance from RGB image worked.
  • We discuss the detailed steps toward realizing image-wise upsampling and end-to-end training of this dual-branch, multi-scale residual network.

3. Proposed Dual-Branch Multi-Scale Residual Network with Channel Interaction

In this study, we supposed that an LR depth map D l is obtained by downsampling its corresponding target HR depth map D h and an HR RGB image Y h of the same scene is available. Y h and D l of the same scene are the inputs of our network, and the goal is to reconstruct and output D h end to end at an upscaling factor s .
In the following, we take s = 8 as an example to show our network structure (see Figure 1).
Figure 1. The architecture of our network for 8× upsampling. HR: High-resolution, LR: Low-resolution.

3.1. RGB Image Network Branch

The main role of the RGB image network branch is to provide guidance for the feature map extraction of the deep map network branch. In general, the structure of the Y -branch can be divided into three functional parts. The first part is to downscale the input RGB image by a factor of 2 through a convolution layer and a maxpooling layer for m = log ( s ) times until the resolution of the feature maps is the same as the input depth map (see Figure 1). Since the sample network has an upscaling factor of 8, such a downsampling operation is executed three times in total. The feature maps obtained in the first part can be expressed as follows:
F D W ( 1 ) Y = W D W ( 1 ) Y Y h + b D W ( 1 ) Y
F D W ( i ) Y = W D W ( i ) Y F D W ( i 1 ) Y + b D W ( i ) Y
F D W ( 2 i ) Y = MaxPool ( F D W ( 2 i 1 ) Y )
where i = {3, 5, …, 2 m − 1}, i = {1, 2, …, m }. The operator represents convolution and W D W Y is a kernel of size 3 × 3 and b D W Y is a bias vector. The superscript Y means that features or blobs belong to the Y -branch and subscript D W stands for the whole downscaling part.
The second part is the parallel network structure matching with the D -branch, which includes a nested structure of residual blocks, groups, and levels. As the most basic constituent unit in the network structure, the residual block of the Y -branch matches with the residual block at the same location in the D -branch. Despite this one-to-one relationship, the residual block for feature extraction in the Y -branch consists of two convolution layers and one PReLU (Parametric Rectified Linear Unit) layer, which is simpler relative to that in the D -branch. After the second convolution operation in the block, the generated feature maps are input into the matched residual block in the D -branch and concatenate feature maps of the depth map guided from the RGB image. In addition, the input feature maps of each residual block are added to the feature maps obtained after feature extraction, which is called a short-skip connection inside the block. Based on the residual block, a residual group is composed of several connected residual blocks and one convolution layer. Similar to a short-skip connection, a long-skip connection is implemented by adding the input and output of each residual group. In the same way, several residual groups and one convolution layer are connected to constitute a residual level and a long-skip connection is also realized in each level using the same addition of input and output. Figure 2 shows the structure of a residual block and a residual group in the Y -branch. The feature maps generated by each residual level l can be expressed as follows:
F D F ( 1 ) Y = H D F ( 1 ) Y ( F D W ( 2 m ) Y )
F D F ( l ) Y = H D F ( 1 ) Y ( F U P ( l 1 ) Y )
where l = {2, 3, …, m + 1}. H D F Y ( ) donates the deep feature extraction and F U P Y represents the feature maps from the third part of the Y -branch. In each residual level l , the feature maps generated by each group g can be expressed as follows:
F l , 1 Y = H l , 1 Y ( F l , 0 Y )
F l , g Y = H l , g Y ( F l , g 1 Y )
F D F ( l ) Y = F l , 0 Y + W l Y F l , G Y
where g = {2, 3, …, G }, and G is the number of residual groups in a level. F l , 0 Y is the input of the residual level. H l , g Y ( ) donates the function of the g th residual group. F l , g 1 Y and F l , g Y are the input and output of gth residual group, respectively. W l Y is the weight set of the tail convolution layer. In each residual group g , the feature maps generated by each residual block b can be expressed as follows:
F g , 1 Y = H g , 1 Y ( F g 1 Y )
F g , b Y = H g , b Y ( F g , b 1 Y )
F g Y = F g 1 Y + W g Y F g , B Y
where b = {2, 3, …, B }, and B is the number of residual blocks in a group. F g 1 Y and F g Y are the input and output of g th group, respectively. H g , b Y ( ) donates the function of the b th residual block. F g , b 1 Y and F g , b Y are the input and output of the b th residual block, respectively. W g Y is the weight set of the tail convolution layer. In each residual block b , the basic operations can be expressed as follows:
h ( F b Y ) = W b , 2 Y ( σ ( W b , 1 Y F b 1 Y + b b , 1 Y ) ) + b b , 2 Y
F b Y = F b 1 Y + h ( F b Y )
where h ( ) denotes the high-frequency feature maps of the input. σ ( ) donates the activation function PReLU. F b 1 Y and F b Y are the input and output of the b th residual block, respectively. W b , 1 Y and W b , 2 Y are kernels of size 3 × 3 , and b b , 1 Y and b b , 2 Y are the bias vectors.
Figure 2. The structure of residual block, residual group and upsampler.
The third part of the Y -branch is the resolution enlarging level. This part consists of an upsampler and a convolution layer, and all these layers are connected after the residual level. The upsampler here is composed of a convolution layer and a pixel-shuffling layer. Corresponding to the initial downscaling steps, feature maps are upscaled by a factor of 2 after each residual level and resolution enlarging level. Furthermore, the feature maps from the first part concatenate the feature maps that have the same resolution after upsampling, and then perform a convolution operation (see Figure 2). This design means the upsampled feature maps become supplemented by feature maps with an original high resolution from the first part such that more structured features at different scales can be retained in the network for the processing that follows, meaning that enough guidance is provided to the D -branch. The feature maps generated by the third part can be expressed as follows:
F U P ( l ) Y = W l , 2 Y ( PixelShuffle ( W l , 1 Y F D F ( l ) Y + b l , 1 Y ) , F D W ( 2 m 2 l + 1 ) Y ) + b l , 2 Y
where l = {1, 2, …, m }. W l , 1 Y and W l , 2 Y are kernels of size 3 × 3 , and b l , 1 Y and b l , 2 Y are the bias vectors.
Referring to Shi et al. [36], the pixel-shuffling layer rearranges the elements of a H × W × C r 2 blob B to a blob of shape r H × r W × C , where r is the upscaling factor and H × W is the size of C feature maps. Mathematically, the pixel-shuffling operation can be described as follows:
PixelShuffle ( B ) x , y , c = B x / r , y / r , C r mod ( y , r ) + C mod ( x , r ) + c
where x and y are the output pixel coordinates of the c th feature map in HR space. The feature maps from the LR space are built into HR feature maps through the pixel-shuffling layer.

3.2. Depth Map Network Branch

The task of the depth map network branch is to complete the super-resolution of an LR depth map under guidance from the parallel Y -branch. Compared to the Y -branch, due to the low resolution of the input depth map, the D -branch is mainly composed of two parts, the residual levels and the resolution enlarging levels, without the downscaling part. Except for this difference in architecture, the nested structure of the residual blocks, groups, and the short- or long-skip connections in the D -branch still exist as in the Y -branch. However, the composition of the residual block that contains convolution layers, PReLU layers, and average-pooling layer in the D -branch is more complicated than that in the Y -branch. The whole feature extraction procedure of this kind of residual block is explained as follows. The input feature maps are processed using convolution, PReLU, and convolution first, and then the feature maps from the Y -branch are concatenated. After the subsequent average pooling, convolution, PReLU, convolution again, and applying the sigmoid function, the weights are generated and multiplied by the previous concatenated feature maps to generate new feature maps that not only integrate the structure information coming from the RGB image, but also prevent unreasonable textures from appearing. In addition to these internal structures, the short-skip connection still exists and adds the input and the output of each residual block. Figure 2 shows the structure of the residual block and residual group in the D -branch. The feature maps generated by each residual level l can be expressed as follows:
F D F ( 1 ) D = H D F ( 1 ) D ( W 0 D D l + b 0 D )
F D F ( l ) D = H D F ( l ) D ( F U P ( l 1 ) D )
where l = {2, 3, …, m + 1}. The superscript D means that features or blobs belong to the D -branch. W 0 D and b 0 D are a kernel of 3 × 3 and a bias vector to the head convolution layer for initial feature extraction, respectively. H D F D ( ) denotes the deep feature extraction and F U P D represents the feature maps from the second part of the D -branch. In each residual level l , the feature maps generated by each group g can be expressed as follows:
F l , 1 D = H l , 1 D ( F l , 0 D )
F l , g D = H l , g D ( F l , g 1 D )
F D F ( l ) D = F l , 0 D + W l D F l , G D
where g = {2, 3, …, G }, and G is the number of residual groups in a level. F l , 0 D is the input of the residual level. H l , g D ( ) denotes the function of the g th residual group. F l , g 1 D and F l , g D are the input and output of the g th residual group, respectively. W l D is the weight set of the tail convolution layer. In each residual group g , the feature maps generated by each residual block b can be expressed as follows:
F g , 1 D = H g , 1 D ( F g 1 D )
F g , b D = H g , b D ( F g , b 1 D )
F g D = F g 1 D + W g D F g , B D
where b = {2, 3, …, B }, and B is the number of residual blocks in a group. F g 1 D and F g D are the input and output of the g th group, respectively. H g , b D ( ) denotes the function of the b th residual block. F g , b 1 D and F g , b D are the input and output of the b th residual block, respectively. W g D is the weight set of the tail convolution layer. In each residual block b , the basic operations can be expressed as follows:
h ( F b D ) = W b , 2 D ( σ ( W b , 1 D F b 1 D + b b , 1 D ) ) + b b , 2 D
F b D = F b 1 D + R b D ( h ( F b D ) , h ( F b Y ) ) ( h ( F b D ) , h ( F b Y ) )
where h ( ) denotes the high-frequency feature maps of the input. σ ( ) denotes the activation function PReLU. F b 1 D and F b D are the input and output of the b th residual block, respectively. W b , 1 D and W b , 2 D are kernels of size 3 × 3 , and b b , 1 D and b b , 2 D are the bias vectors. R b D ( ) denotes the function of the channel interaction.
Except for the difference in the residual block, the D -branch directly employs the upsampler and the convolution layer as a resolution enlarging level to upscale the feature maps without concatenating feature maps from the branch itself due to the lack of a downscaling part. The resolution enlarging level is arranged to be connected after the residual level, which is one of the steps used to gradually achieve super-resolution. Finally, a convolution layer is connected after the last residual layer to convert the feature maps into a depth map to generate a target HR depth map as the whole dual-branch network’s output (see Figure 2). The feature maps generated by the second part can be expressed as follows:
F U P ( l ) D = W l , 2 D PixelShuffle ( W l , 1 D F D F ( l ) D + b l , 1 D ) + b l , 2 D
where l = {1, 2, …, m }. W l , 1 D and W l , 2 D are kernels of size 3 × 3 , and b l , 1 D and b l , 2 D are the bias vectors.
At the end of our network is a convolution layer that reconstructs feature maps into an output HR depth map D ˜ h as follows:
D ˜ h = W R E C D F D F ( m + 1 ) D + b R E C D
where W R E C D is a kernel of size 3 × 3 , and b R E C D is the bias vector.
Our network is optimized with a loss function L1. Given a training set { Y h i , D l i , D h i } i = 1 N , which contains N HR RGB images and LR depth maps as inputs, along with their HR depth map counterparts, our network is trained by minimizing the L1 loss function
L ( Θ ) = 1 N i = 1 N D ˜ h i D h i 1
where Θ denotes the parameter set of our network. This L1 loss function is optimized using a stochastic gradient descent.

3.3. Channel Interaction

Channel attention is a channel-wise feature interaction and change mechanism proposed by Zhang et al. [1], whose goal is to allow the network to pay more attention to features that contain more information. This mechanism originates from two points. One is that there are abundant low-frequency and valuable high-frequency components in LR space. The low-frequency components are mostly flat, and the high-frequency components are mostly regions full of details, such as edges and textures. Another is that each filter of the convolution layer has a local receptive field such that convolution fails to use contextual information outside the local region. In response to these two points, the channel attention mechanism uses global average pooling to obtain channel-wise global spatial information and employs a gating mechanism to capture the dependencies between channels. This gating mechanism can not only learn nonlinear interactions, but also avoids mutual exclusion between channel-wise features. The coefficient factors learned by the gating mechanism are the weights for rescaling the channels. The channel attention mechanism operates between the channel-wise features learned from the input image. We further extended this mechanism to the guidance from the RGB image to the depth map, which makes the features learned by dual-network branches interact with each other.
There are two types of channel interactions in our network. The first one is the concatenation of the feature maps before downscaling and after upsampling in the Y -branch, and then executing the convolution operation for new channel-wise feature maps. This is a relatively common channel-wise interaction procedure, which guarantees that the feature maps of all the channels affect each other equally. The reason for adopting this kind of equal channel interaction is that due to the beginning downscaling part, the loss of details in the previous residual level needs to be supplemented for feature extraction and network learning of the next residual level at a larger scale. Furthermore, the supplemented feature maps also help the guidance provided for the D -branch. The second way channel interaction occurs is through the weight of each channel, which is calculated through a series of functions and decides the influence of its channel in the process of generating new feature maps after the feature maps of each residual block in the D -branch concatenates the feature maps from the Y -branch. The guidance from the Y -branch to the D -branch is realized in this way for the channels from the Y -branch, which can affect all the channels in the residual block. However, each channel from the Y -branch has an unequal influence and interacts with each other according to different weights such that the structured features that have a corresponding relationship between the RGB image and depth map are emphasized and the inconsistent features without such a relationship suppressed. Small weights limit the appearance of artifacts introduced by the feature maps from the Y -branch.
As R b D ( · ) denotes the entire operation of channel interaction, we suppose that X = [ x 1 Y , , x c Y , , x C Y , x 1 D ,   , x c D , , x C D ] is an input, which has C feature maps with a size of H × W from the Y th and D th branches separately. The channel-wise statistic z 2 C can be obtained by shrinking X , and the c th element of z is:
z c = AveragePool ( x c ) = 1 H × W h = 1 H w = 1 W x c ( h , w )
where x c ( h , w ) is the value at position ( h , w ) of the c th feature x c from either the Y th or D th branch. Therefore, we obtain the weight coefficient using the function:
s = f ( W U D σ ( W D D z ) )
where f ( · ) and σ ( · ) denote the sigmoid and PReLU functions, respectively. W D D is the weight set of a convolution layer that downscales channels with a reduction ratio r . In our experiments, r was set to 16. W U D is also a weight set of a convolution layer that upscales channels with the same ratio r . Then, we can rescale x c by:
x ^ c = s c x c

4. Evaluation

4.1. Network Training

The data set for experiments in this paper was the same as in Hui et al. [23], which consisted of 58 RGBD images from the MPI (Max-Planck Institute) Sintel depth dataset and 34 RGBD images from the Middlebury dataset. Among them, a total of 82 RGBD images made up the training set for our network training, and the other 10 images composed the test set for validation. Our experiments included SR reconstruction of an LR depth map with upscaling factors of 2, 3, 4, 8, and 16 separately. Considering that a factor of 2 was the initial base, we first trained a network with an upscaling factor of 2 whose Y -branch was pre-trained using 1000 images from the NYUv2 (New York University Version 2) dataset [37]; then, the entire network was trained using these 1000 RGB images and depth maps, and finally, the aforementioned training dataset containing 82 RGBD images were used for network fine-tuning. Based on the trained network with an upscaling factor of 2, other networks with upscaling factors of 3, 4, 8, and 16 were further fine-tuned using the same 82 RGBD images.
In terms of the details of training, we gathered LR depth maps to form the training dataset at different upscaling factors by downscaling the corresponding HR depth maps through bicubic interpolation. In the process of training, we did not input large-size images or depth maps into our network directly, but split each one into small overlapping patches and did some common data enhancement before a patch entered the network. The size of these patches was set according to the upscaling factor. The upscaling factors were {2,3,4,8,16}, the corresponding size of the input depth map’s patch were { 48 2 , 48 2 , 48 2 , 24 2 , 12 2 } , and the sizes of the input RGB image’s patch were { 96 2 , 144 2 , 192 2 , 192 2 , 192 2 } . Furthermore, the other settings of the network training included the choice of the loss function, optimizer, learning rate, etc. We chose L1 as the loss function, used the ADAM optimizer where P 1 = 0.8 , P 2 = 0.999 , ε = 10 8 and the initial learning rate was set to 10 4 . The learning rate was halved after every 200 epochs. We trained all these network models using PyTorch on a GTX 1080 GPU.

4.2. Evaluation on the Middlebury Dataset

In order to compare our method with the experimental results of other studies, we used the root mean squared error (RMSE) as an evaluation criterion. Referring to Hui et al. [23], we evaluated our algorithm using Middlebury RGBD datasets whose holes were filled. The dataset was divided into three sets, namely A, B, and C. Data in the table came from References [2,3,6,10,12,13,14,16,17,18,23,24,25]. At each upscaling factor, the best RMSE result of all the algorithms listed in the table is in bold and the sub-optimal result is underlined. For dataset C, the comparison was only performed until the upscaling factor increased to 8 because the resolution of the input depth map was too low to reconstruct the HR depth map when the upscaling factor was 16. In addition, the experimental results at the upscaling factor of 3 were not put into the three tables because the other algorithms cannot reconstruct depth maps at a factor that is not a power of 2.
Table 1, Table 2 and Table 3 are records of the evaluation on sets A, B, and C separately, and our algorithm showed an excellent performance compared with the others. When the upscaling factor was small, the gap between the algorithms was not huge, but the advantage of our method was obvious with after increasing the upscaling factor. This phenomenon shows that it is feasible to use an HR RGB image to guide an LR depth map super-resolution in a multi-scaled way if the LR depth map has poor quality and lacks high-frequency information. This condition is a challenge to all the image SR methods. Since References [23,24] adopt a multi-scale mechanism and References [24,25] are built on a residual structure, we focused on the comparison of the experiment results between theirs and ours. According to Table 1, the average RMSE of our network on dataset A at the upscaling factors of {2, 4, 8, 16} were {0.37, 0.78, 1.27, 1.89}, which outperformed Hui et al. [23] with gains of {0.09 (+19.6%), 0.15 (+16.1%), 0.23 (+15.3%), 0.71 (+27.3%)}, outperformed Zuo et al. [24] with gains of {0.15 (+28.8%), 0.22 (+22.0%), 0.35 (+21.6%), 0.73 (+27.9%)} and outperformed Zuo et al. [25] with gains of {0.06 (+14.0%), 0.15 (+16.1%), 0.28 (+18.1%), 0.61 (+24.4%)}. On dataset B, our network outperformed Hui et al. [23] with gains of {0.07 (18.4%), 0.13 (+15.9%), 0.32 (+22.2%), 0.75 (+31.5%)}, outperformed Zuo et al. [24] with gains of {0.31 (+50%), 0.39 (+36.1%), 0.56 (+33.3%), 1.2 (+42.4%)}, and outperformed Zuo et al. [25] with gains of {0.21 (+40.4%), 0.31 (+31%), 0.51 (+31.3%), 1.09 (+40.1%)}. On dataset C, our network outperformed Hui et al. [23] with gains of {0.35 (+38.9%), 0.53 (+24.3%), 0.96 (23.3%)} at the upscaling factors of {2, 4, 8}. Overall, our network substantially reduced the RMSE using these three datasets in the mean sense compared with other methods. Although our network only had sub-optimal results in several cases, such as for Venus in dataset C, it is still reasonable to infer that special optimization may be required for some isolated samples.
Table 1. Quantitative comparison (in RMSE) on dataset A.
Table 2. Quantitative comparison (in RMSE) on dataset B.
Table 3. Quantitative comparison (in RMSE) on dataset C.
Figure 3 shows the results of our network on dataset A with an upscaling factor of 8. To further verify the effectiveness of the network structure we designed, we selected several regions full of details in each HR depth map to observe the differences between our SR results and the ground truths. We examined the effect of our network in terms of two aspects. One aspect was concerned with whether the regions containing edges were blurred after super-resolution. In Figure 3, we marked these regions with blue boxes in (a–c), and give the contrast between the ground truths and our SR results in (d). It is obvious that edges in our SR results were as sharp as those in the ground truths. Generally, deeper networks like ours can learn more complex and finer features, including edges. On the other hand, we examined whether the artifacts existed in the reconstructed HR depth maps. We marked the regions containing textures in the HR RGB image but were complanated in the corresponding HR depth map with red boxes. The contrasts between the reconstructed results and ground truths given in (e) demonstrate that artifacts disappeared after super-resolution. From these results, we can conclude that our proposed method can perform finer depth map SR reconstruction while suppressing the introduction of artifacts.
Figure 3. Upsampled depth maps for dataset A with an upscaling factor of 8. (a) HR RGB images for input, (b) ground-truth HR depth maps, (c) upsampled results from our network, (d) regions inside blue boxes from (b) (left) and (c), and (e) regions inside red boxes from (b) (left) and (c).

4.3. Evaluation of Generalization

To test the generalization of our proposed network, we selected three RGBD images from different databases to form a new dataset Mixture in which image Lucy from the SimGeo dataset [26], image Plant from the ICL-NUIM (Imperial College London- National University of Ireland Maynooth) dataset [38], and image Vintage from Middlebury dataset were considered. The model we used for evaluation was the same as the model tested on datasets A, B, and C without fine-tuning, and the evaluation criterion was still the RMSE. We mainly tested our method at the upscaling factors of 4 and 8, in comparison with methods from References [23,26,39,40,41]. Our method produced the best performance on the image from the Middlebury dataset and performed nearly 20% better than the sub-optimal result (see Table 4). On the ICL-NUIM dataset, our method’s performance was similar to other methods. However, the results on image Lucy indicated that our network was not suitable for this dataset, which means the generalization ability of our network needs to be improved in the future. Figure 4 shows the results of our network on dataset Mixture with an upscaling factor of 4. Details in blue boxes were enlarged and shown in columns (d) and (e).
Table 4. Quantitative comparison (in RMSE) on dataset Mixture.
Figure 4. Upsampled depth maps for dataset Mixture with an upscaling factor of 4. (a) HR RGB images for input, (b) ground-truth HR depth maps, (c) upsampled results from our network, (d) regions inside blue boxes from (b), and (e) regions inside blue boxes from (c).
In Table 5, we provide the time taken by our network and other methods [6,7,23] to upscale the depth map from different low resolutions to full resolution. The computation time of Hui et al. [23] was calculated by upsampling image Art using dataset A, and we completed the same experiment on a GTX 1080 GPU using Python. Bicubic, SRCNN, and VDSR were written in MATLAB and Guo et al. [42] provides information about the average running time.
Table 5. Computation time (seconds).

5. Conclusions

We proposed a dual-branch residual network that realizes LR depth map super-resolution with channel interaction and multi-scale residual levels under the guidance of an HR RGB image. In the design of the network structure, we made the residual levels of the RGB image branch and the depth map branch parallel for not only the corresponding feature extraction process, but also the guidance process from the RGB image branch to the depth map branch. Furthermore, the channel interaction via weights avoided introducing artifacts into the upscaled depth map. Using a multi-scale method for upscaling the LR depth map helped to alleviate the blur of the HR depth map that is caused by upsampling to a high resolution in one step. The experiments showed that our method performed excellently compared with other methods, especially when the upscaling factor was large. In the future, we hope to explore other methods for the channel-wise feature fusion and go further in the residual network design. In addition, the RGB image branch, as an auxiliary role in our network, has more layers than the depth map branch, which gives room for improved performance regarding compressing the layers of the whole network.

Author Contributions

Conceptualization, R.C. and W.G.; methodology, R.C.; software, R.C.; validation, R.C.; formal analysis, R.C.; investigation, R.C.; resources, R.C.; data curation, R.C.; writing—original draft preparation, R.C.; writing—review and editing, R.C. and W.G.; visualization, W.G.; supervision, W.G.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (grant number 2016YFB0502002), and the National Natural Science Foundation of China (NSFC) (grant numbers 61872361, 61991423, and 61421004).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  2. Narayanan, B.N.; Hardie, R.C.; Balster, E. Multiframe Adaptive Wiener Filter Super-Resolution with JPEG2000-Compressed Images. EURASIP J. Adv. Signal Process. 2014, 55, 1–18. [Google Scholar] [CrossRef]
  3. Lu, J.; Forsyth, D. Sparse Depth Super Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2245–2253. [Google Scholar]
  4. Kwon, H.; Tai, Y.W.; Lin, S. Data-Driven Depth Map Refinement via Multi-Scale Sparse Representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 159–167. [Google Scholar]
  5. Xie, J.; Feris, R.S.; Sun, M. Edge-Guided Single Depth Image Super Resolution. IEEE Trans. Image Process. 2016, 25, 428–438. [Google Scholar] [CrossRef] [PubMed]
  6. Dong, C.; Loy, C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. PAMI 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  7. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1646–1654. [Google Scholar]
  8. Lai, W.; Huang, J.; Ahuja, N.; Yang, M. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
  9. Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and Checkerboard Artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
  10. He, K.; Sun, J.; Tang, X. Guided Image Filtering. IEEE Trans. Pattern Anal. Mach. Intel. 2013, 6, 1397–1409. [Google Scholar] [CrossRef] [PubMed]
  11. Barron, J.T.; Poole, B. The Fast Bilateral Solver. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 617–632. [Google Scholar]
  12. Diebel, J.; Thrun, S. An Application of Markov Random Fields to Range Sensing. In Proceedings of the 19th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 5–8 December 2005. [Google Scholar]
  13. Park, J.; Kim, H.; Tai, Y.W.; Brown, M.; Kweon, I. High Quality Depth Map Upsampling for 3D-TOF Cameras. In Proceedings of the International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1623–1630. [Google Scholar]
  14. Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image Guided Depth Upsampling Using Anisotropic Total Generalized Variation. In Proceedings of the International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 993–1000. [Google Scholar]
  15. Zuo, Y.; Wu, Q.; Zhang, J.; An, P. Explicit Edge Inconsistency Evaluation Model for Color-Guided Depth Map Enhancement. IEEE Trans. Circuit Syst. Video Techol. 2018, 28, 439–453. [Google Scholar] [CrossRef]
  16. Zuo, Y.; Wu, Q.; Zhang, J.; An, P. Minimum Spanning Forest with Embedded Edge Inconsistency Measurement Model for Guided Depth Map Enhancement. IEEE Trans. Image Process. 2018, 27, 4145–4149. [Google Scholar] [CrossRef] [PubMed]
  17. Yang, J.; Ye, X.; Ki, K.; Hou, C.; Wang, Y. Color-Guided Depth Recovery from RGB-D Data Using an Adaptive Autoregressive Model. TIP 2014, 23, 3962–3969. [Google Scholar] [CrossRef]
  18. Kiechle, M.; Hawe, S.; Kleinsteuber, M. A Joint Intensity and Depth Co-Sparse Analysis Model for Depth Map Super-Resolution. In Proceedings of the International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1545–1552. [Google Scholar]
  19. Riegler, G.; Rüther, M.; Bischof, H. Atgv-Net: Accurate Depth Super-Resolution. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 268–284. [Google Scholar]
  20. Zhou, W.; Li, X.; Reynolds, D. Guided Deep Network for Depth Map Super-Resolution: How much can color help? In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, USA, 5–9 March 2017; pp. 1457–1461. [Google Scholar]
  21. Yang, J.; Lan, H.; Song, X.; Li, K. Depth Super-Resolution via Fully Edge-Augmented Guidance. In Proceedings of the IEEE Visual Communications and Image Processing, St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
  22. Ye, X.; Duan, X.; Li, H. Depth Super-Resolution with Deep Edge-Inference Network and Edge-Guided Depth Filling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Seoul, South Korea, 22–27 April 2018; pp. 1398–1402. [Google Scholar]
  23. Hui, T.-W.; Loy, C.C.; Tang, X. Depth Map Super-Resolution by Deep Multi-Scale Guidance. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 353–369. [Google Scholar]
  24. Zuo, Y.; Wu, Q.; Fang, Y.; An, P.; Huang, L.; Chen, Z. Multi-Scale Frequency Reconstruction for Guided Depth Map Super-Resolution via Deep Residual Network. IEEE Trans. Circuit Syst. Video Techol. 2020, 30, 297–306. [Google Scholar] [CrossRef]
  25. Zuo, Y.; Fang, Y.; Yang, Y.; Shang, X.; Wang, B. Residual Dense Network for Intensity-Guided Depth Map Enhancement. Inf. Sci. 2019, 495, 52–64. [Google Scholar] [CrossRef]
  26. Voynov, O.; Artemov, A.; Egiazarian, V.; Notchenko, A.; Bobrovskikh, G.; Burnaev, E. Perceptual Deep Depth Super-Resolution. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5653–5663. [Google Scholar]
  27. Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar]
  28. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2472–2481. [Google Scholar]
  29. Liu, D.; Wen, B.; Fan, Y.; Loy, C.C.; Huang, T.S. Non-Local Recurrent Network for Image Restoration. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 2–7 December 2018; pp. 1673–1682. [Google Scholar]
  30. Qiu, Y.; Wang, R.; Tao, D.; Cheng, J. Embedded Block Residual Network: A Recursive Restoration Model for Single-Image Super-Resolution. In Proceedings of the International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 4180–4189. [Google Scholar]
  31. Hu, Y.; Li, J.; Huang, Y.; Gao, X. Channel-wise and Spatial Feature Modulation Network for Single Image Super-Resolution. IEEE Trans. Circuit Syst. Video Techol. 2019. [Google Scholar] [CrossRef]
  32. Jing, P.; Guan, W.; Bai, X.; Guo, H.; Su, Y. Single Image Super-Resolution via Low-Rank Tensor Representation and Hierarchical Dictionary Learning. Multimed. Tools Appl. 2020. [Google Scholar] [CrossRef]
  33. Haris, M.; Shakhnarovich, G.; Ukita, N. Recurrent Back-Projection Network for Video Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–21 June 2019; pp. 3897–3906. [Google Scholar]
  34. Molini, A.B.; Valsesia, D.; Fracastoro, G.; Magli, E. DeepSUM: Deep Neural Network for Super-Resolution of Unregistered Multitemporal Images. IEEE Trans. Geosci. Remote Sens. 2020. [Google Scholar] [CrossRef]
  35. Molini, A.B.; Valsesia, D.; Fracastoro, G.; Magli, E. DeepSUM++: Non-local Deep Neural Network for Super-Resolution of Unregistered Multitemporal Images. arXiv 2020, arXiv:2001.06342. [Google Scholar] [CrossRef]
  36. Shi, W.; Caballero, J.; Huszar, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1874–1883. [Google Scholar]
  37. Silberman, N.; Kohli, P.; Hoiem, D.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the European Conference on Computer Vision, Florence, Italy, 7–13 October 2012. [Google Scholar]
  38. Handa, A.; Whelan, T.; McDonald, J.; Davison, A.J. A Benchmark for RGB-D Visual Odometry, 3D reconstruction and SLAM. In Proceedings of the IEEE Conference on Robotics and Automation, Hong Kong, China, 31 May–5 June 2014; pp. 1524–1531. [Google Scholar]
  39. Riegler, G.; Ferstl, D.; Ruther, M.; Bischof, H. A Deep Primal-Dual Network for Guided Depth Super-Resolution. In Proceedings of the British Machine Vision Conference, York, UK, 19–22 September 2016. [Google Scholar]
  40. Haefner, B.; Queau, Y.; Mollenhoff, T.; Cremers, D. Fight Ill-Posedness with Ill-Posedness: Single-shot Variational Depth Super-Resolution from Shading. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 164–174. [Google Scholar]
  41. Gu, S.; Zuo, W.; Guo, S.; Chen, Y.; Chen, C.; Zhang, L. Learing dynamic guidance for depth image enhancement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 712–721. [Google Scholar]
  42. Guo, C.; Li, C.; Guo, J.; Cong, R.; Fu, H.; Han, P. Hierarchical Features Driven Residual Learning for Depth Map Super-Resolution. IEEE Trans. Image Process. 2019, 28, 2545–2557. [Google Scholar] [CrossRef] [PubMed]

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.