Depth Map Upsampling via Multi-Modal Generative Adversarial Network

Autonomous robots for smart homes and smart cities mostly require depth perception in order to interact with their environments. However, depth maps are usually captured in a lower resolution as compared to RGB color images due to the inherent limitations of the sensors. Naively increasing its resolution often leads to loss of sharpness and incorrect estimates, especially in the regions with depth discontinuities or depth boundaries. In this paper, we propose a novel Generative Adversarial Network (GAN)-based framework for depth map super-resolution that is able to preserve the smooth areas, as well as the sharp edges at the boundaries of the depth map. Our proposed model is trained on two different modalities, namely color images and depth maps. However, at test time, our model only requires the depth map in order to produce a higher resolution version. We evaluated our model both quantitatively and qualitatively, and our experiments show that our method performs better than existing state-of-the-art models.


Introduction
The standard digital camera tries to capture three-dimensional scenes and projects them onto a two-dimensional image plane. This process inevitably loses much information, particularly the depths or the distances between objects and the camera: small objects close to the camera will appear much larger, and big objects that are far away will appear much smaller than they actually are. This is why depth information is crucial in many tasks that rely on visual perception such as robot grasping, obstacle avoidance, and navigation, which are necessary tasks for smart homes and smart cities.
With the availability of affordable depth cameras such as Microsoft Kinect sensors, many systems are now being supplemented with depth range information to solve various computer vision tasks. However, the depth maps produced by these sensors have significantly lower resolution than the color images due to the intrinsic physical constraints of the sensors [1,2]. We desire accurate and high-resolution depth maps in order to perform many robotic tasks effectively, especially those involving higher risks, such as autonomous robots and self-driving cars, where small errors could lead to large costs.
A simple way of increasing the resolution of the depth map is to use the high-resolution color image. First, the depth map and its corresponding color image need to be aligned and registered since there would be a small offset coming from the side-by-side placements of the range and image sensors

Related Work
Common approaches for depth map super-resolution can be grouped into two categories. One uses a single depth map as the input, and the other uses RGB images together with the depth maps as input. Single depth map up-sampling is desirable for applications where privacy is necessary or for applications where it is sufficient to use only depth maps. Aodha et al. [7] collected a set of training images containing large numbers of low-resolution patches and high-resolution patches. They then performed patch matching in order to synthesize the super-resolved depth map. Hornacek et al. [8] further extended this patch-based matching strategy by exploiting patch-wise self-similarity structures across depth resolutions. Li et al. [9], on the other hand, proposed to consider semantic object composition as an auxiliary regularization for assembling the high-resolution depth outputs. For these types of approaches, the collected training data have a high influence on the model's performance.
Methods that use only single depth map inputs usually perform poorer than methods that use both RGB images and the depth maps as inputs. This is to be expected because of the added high-resolution information that they get from the RGB images. Algorithms with RGB-D inputs can be further divided into filtering-based methods and learning-based ones. Filtering-based methods are based on performing a weighted average of the neighboring pixels. The most common is the Bilateral Filter (BF) [10], which is an edge-preserving filter for image upsampling. Kopf et al. [11] proposed a Joint Bilateral Upsampling (JBU) framework that operates on both the high-resolution color image and the low-resolution depth map. They leveraged the color image to preserve the edges of the depth map. Chan et al. [12] and Li et al. [13] proposed an extended JBU method by providing a noise-aware characteristic, which can determine the effects of color information on the upsampled depth map. Yang et al. [14] proposed a Joint Bilateral Filtering (JBF) method, which is similar to JBU. Kim et al. [15] developed a JBF framework by integrating the numerical analysis of the local distribution in color images and depth maps. Liu et al. [16] also proposed an improved JBF method using geodesic distance for better-preserved depth edges. The above methods present some promising results, but color or lighting variations may produce incorrect discontinuity regions in the process of integrating color information into the upsampled depth map. Jung [17] proposed a filtering method that applies matching patterns between local color and depth patches to the filtering kernels. However, some ambiguous pattern assignments caused prediction errors in their results.
Choi et al. [18] proposed another filtering method that suppresses texture-transfer and depth-bleeding artifacts. They first applied region classification to split the color image into different regions and processed each region differently. They used the Weighted Mode Filter (WMF) [19] as their main filter kernel. A problem with this approach is that depth edges at some regions are ambiguous due to color similarities. Lo et al. proposed a series of methods to improve this filtering method [4,20,21] and reported promising results.
Learning-based methods, on the other hand, rely on extracting information from a training dataset that can be generalized to different examples. Diebel and Thrun et al. [22] proposed a multi-labeling optimization problem based on Markov Random Fields (MRF), which defines a consistency term, which encourages consistency between depth values across resolutions, and a smoothness term, which encourages neighboring pixels with similar colors to have similar depth values. Revised MRF methods focusing on depth discontinuities were proposed in [3,[23][24][25] to improve depth map super-resolution. Park et al. [26] improved the smoothness term by incorporating semi-local neighborhood information extracted from Non-Local Means (NLM) regularization and an edge weighting scheme, which enhances color details. Other learning-based methods like [27] phrased the depth map up-sampling task as a convex optimization problem with higher order regularization guided by anisotropic diffusion tensors extracted from high-resolution intensity images.
In this paper, we present a novel multi-modal GAN framework to perform the depth map up-sampling task. Our method combines the best of both worlds where we train on both the RGB color image and the depth maps, but only require a single depth map at testing time. We show in our experiments that this approach outperforms several state-of-the-art baselines. Figure 1 shows an overview of our GAN-based depth map super-resolution framework. Our generator takes inputs from two modalities: low-resolution depth maps and low-resolution scene images. Since the scene images and depth maps are structurally very similar to each other, as shown in Figure 2, we can leverage the information from the scene images to preserve the discontinuity regions of the depth maps. In order to ensure that the generator outputs valid depth maps, we train it adversarially using a discriminator that tries to distinguish whether the output of the generator is similar to real depth maps sampled from the dataset or fake depth maps synthesized by the generator.

Proposed Framework
We first formally define the problem of depth map super-resolutions in Section 3.1. Next, we discuss the details of our model starting with generative adversarial networks in Section 3.2 followed by our loss functions (Section 3.3) and network architecture (Section 3.4). Lastly, we discuss our multi-modal mini-batch scheme in Section 3.5.

Problem Formulation
A standard camera and a range sensor can capture a high-resolution color image x HR i and a low-resolution depth map x LR d . Suppose that we have a dataset with the corresponding ground truth high-resolution depth maps x HR d ; our goal is to learn a function G : x LR d → x HR d that can generate a high-resolution version of the low-resolution depth map. This is more commonly referred to as super-resolution. The function G is modeled as a convolutional neural network, which we refer to as the generator.

Generative Adversarial Network
The framework of Generative Adversarial Networks (GANs) introduces a discriminator D, which is a separate classifier, to guide the learning process of the generator G. This transforms the learning problem into a two-player minimax game, where the optimal solution is a Nash equilibrium.
The role of the discriminator is to learn how to tell apart real images from fake images. The generator G, on the other hand, is our super-resolution model, which generates high-resolution depth maps from its low-resolution input and tries to make it as realistic looking as possible in order to trick the discriminator D into classifying the generated depth map as real. This is represented as a min-max optimization in the form shown in Equation (1), where p(x) represents the distributions of the data and x HR and x LR are the high-resolution and low-resolution inputs, respectively. The first term accounts for the objective where we want the discriminator to classify the high-resolution inputs x HR coming from the dataset as real, while the second term accounts for the objective where we want to classify the super-resolved outputs of the generator (G(x LR )) from low-resolution inputs x LR as fake.
In the formulation of these networks, the generator G has access to the gradients of the discriminator D and therefore has some form of instruction as to how to improve itself. This enables the generator to learn how to produce realistic-looking depth maps that are indistinguishable from the ground truth depth maps.
In the beginning of the training process, the fake images generated by G are extremely poor and are rejected by D with high confidence. Therefore, when performing the optimization, it is better for G to optimize for log(D(G(x LR ))) instead of log(1 − D(G(x LR ))). Both objectives result in the same fixed point, but log(D(G(x LR ))) provides stronger gradients in the early stages of learning.

Loss Function
Our loss function is a combination of two components. The first component is a content loss or a reconstruction error, which measures how different the generated depth maps are from the ground truth depth map. It is implemented as the mean absolute error or L 1 distance between the generated and the ground truth depth map, as shown in Equation (2). As shown in previous works [5,6], this term alone will produce blurry images as the solution would turn towards the mean. However, this shows that it can capture the low frequency components well.
The second component is an adversarial loss that we have defined in the previous Section 3.2. This term encourages the outputs of G to reside on the manifold of the ground truth depth maps. It is also able to model the high frequency components as shown by [5,6], which makes it suitable for our problem. The final objective function is shown in Equation (3), where λ is a hyper-parameter that controls the relative importance of the two components. Figure 3 details the network architecture of our generator. Inspired by Ledig et al. [28] and Lim et al. [29], we use a single convolutional layer followed by a series of sixteen residual blocks. A skip-connection is then performed, where the output of the residual blocks is combined with the output of the initial convolutional layer using an element-wise sum. The idea behind this design is that the low-resolution depth map already contains many of the pixel values for the output, and adding the skip-connection allows the network to have easier passing of information from the lower layers, which are closer to the original input, to the upper layers, which are closer to the output. We then use a two sub-pixel convolutional layer as proposed by [30], which up-samples the depth maps to four times their input resolution. We use the ReLU activation function after every convolutional layer without any batch normalization layers. We remove the batch normalization layers since it has been shown in [29] that it is detrimental to super-resolution tasks since it reduces the range flexibility by normalizing the features.

Network Architecture
As shown in Figure 4, the discriminator D is composed of a series of six convolutional layers where every other layer has a stride of two that downsamples the image representations by half. This is then followed by two fully-connected layers and a sigmoid activation function that outputs the probability of being real. All the layers use LeakyReLU activations, and batch normalization layers are inserted after every convolutional layer except for the first layer.

Multi-Modal Mini-Batch
We would like to incorporate the scene image during the learning process; however, adding an explicit branch with convolutional layers to encode and extract features from RGB scene image would require us to use the RGB image during test time. This is undesirable since our goal is to enable our super-resolution model to handle depth maps independent from the scene image at test time. Inspired by multi-task learning and domain adaptation frameworks, we mix scene images with the depth images and perform super-resolution on both modalities during training. More specifically, we down-sample the scene image to have the same resolution as the depth map and convert it to gray scale. Now that they both have the same dimensions, we can combine them in a mini-batch and perform the training. This way, the network will be able to share what it learned from one domain (super-resolving scene images) and apply it to the other (super-resolving depth maps).

Dataset
A standard dataset used widely in depth-related tasks is the NYU Depth dataset Version 2 [31]. The images and its corresponding depth maps were captured from 464 different locations coming from a variety of indoor scenes such as bathrooms, bedrooms, offices, and kitchens. We followed the same train-test experiment setup as previous studies [3,4,21]. We downsampled the ground truth depth maps from 640 × 480 to 160 × 120 (by a factor of four) in order to obtain the low-resolution depth maps to be used as the input by our model.
We also experimented on the Middlebury stereo dataset [32]. The ground truth depth maps were estimated using state-of-the-art stereo depth estimation methods that are able to handle high-resolution images. Similarly, we downsampled the depth maps by a factor of four to get our training inputs.

Implementation Details
We trained our model on a single NVIDIA GTX 1080 GPU with a mini-batch size of 16. During training, we randomly cropped 96 × 96 patches from the low-resolution depth maps, and our model up-sampled it by a factor of four producing a patch size of 384 × 384. Note that since our model is composed of fully-convolutional layers, it can handle arbitrary sizes at test time. We usd Adam optimizer [33] with β 1 = 0.9 and β 2 = 0.999 to train our model. Our network was trained with an initial learning rate of 1e −4 . We set λ = 1e −3 in our loss function. We implemented our models using the TensorLayer and TensorFlow framework.

Performance Evaluations
Following previous works [3,4,21,32], we quantitatively evaluated our model using the percentage of Bad matching Pixels (BP%). Bad matching Pixels (BP) is defined as the percentage of pixels that have a difference larger than a pre-defined threshold from the ground truth values. This can be expressed mathematically as shown in Equation (4), where Ω refers to the set of all pixels in the depth map. To compute the percentage of bad pixels (BP%), the output depth maps have to be scaled to the same range. We used the same threshold (δ d = 1) from previous works [3,4,21,32] to maintain a fair comparison. In addition to the whole image statistics (referenced as "all"), we also computed local statistics at non-occluded regions ("nonocc.") and at depth discontinuity regions ("disc."), as defined in [32,34].
We evaluated our model's super-resolved depth maps at an upsampling factor of four. We compared against several state-of-the-art baselines: (1) Diebel et al. [22]; (2) (12) Lo et al. [21]. We used the outputs from the respective authors' publicly available code for the comparisons. However, there is no information on the parameter settings used in [12,22]; hence, we performed a simple grid search and reported the best performing results. Table 1 lists the performance of the various methods in terms of BP%. It can be observed that our method consistently performed better in terms of percentage of bad matching pixels.
We also evaluated our model using the Mean Squared Error (MSE) with respect to the ground truth depth maps. Our method was able to achieve state-of-the-art results, as shown in Table 2. MSE weights large differences more than small differences. We can observe that these large differences usually occurred at the depth discontinuity regions ("disc.") since the local MSE values were significantly higher than the global MSE ("all"). By visually inspecting the super-resolved depth maps, we can further evaluate the performances of the various models. Figures 5-14 show comparisons of super-resolved depth maps on several examples of patches at regions with depth discontinuity. We can see that most methods failed to capture the geometry at the edges properly. This is more noticeable when we zoom in to the patches where the straight lines were no longer straight for other methods, while our method successfully preserved these structures.

Conclusions
We proposed a multi-modal model based on the generative adversarial network for depth map super-resolution. Our method was able to preserve detailed discontinuity regions such as sharp edges, as well as respect the geometric structure of the objects. Moreover, our model only required a single depth map as an input during test time, even though it was trained with both RGB and depth maps. The results of our quantitative and qualitative experiment show that our method outperforms state-of-the-art depth map super-resolution methods, which confirms the effectiveness of our method.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: