Multiscale Attention Fusion for Depth Map Super-Resolution Generative Adversarial Networks

Color images have long been used as an important supplementary information to guide the super-resolution of depth maps. However, how to quantitatively measure the guiding effect of color images on depth maps has always been a neglected issue. To solve this problem, inspired by the recent excellent results achieved in color image super-resolution by generative adversarial networks, we propose a depth map super-resolution framework with generative adversarial networks using multiscale attention fusion. Fusion of the color features and depth features at the same scale under the hierarchical fusion attention module effectively measure the guiding effect of the color image on the depth map. The fusion of joint color–depth features at different scales balances the impact of different scale features on the super-resolution of the depth map. The loss function of a generator composed of content loss, adversarial loss, and edge loss helps restore clearer edges of the depth map. Experimental results on different types of benchmark depth map datasets show that the proposed multiscale attention fusion based depth map super-resolution framework has significant subjective and objective improvements over the latest algorithms, verifying the validity and generalization ability of the model.


Introduction
With the increasing emphasis on security, trustworthy artificial intelligence is on the rise. In trustworthy AI, various 3D applications play a crucial role in scene construction, understanding the relationship between entities and the scene, and reasoning about invisible factors outside the scene. In the research of stereo image technology, the quality of depth maps is very significant because the depth value reflects the spatial position of the object in the scene. However, the resolution of depth maps has been very low due to the limited capture capability of depth sensors. Therefore, depth map super-resolution (SR) has become an urgent problem to be solved.
Due to the limited information contained in a single depth map, the corresponding high-resolution (HR) color image is generally used to guide the super-resolution of depth maps. Conventional methods use filters or Markov Random Fields (MRF) to implement depth map super-resolution with the guidance of the color image. Leveraging the HR color image and the given low-resolution (LR) depth map, Kopf et al. [1] proposed a joint bilateral filter (JBU) which combines a range filter and a spatial filter to produce very good full resolution results. Diebel and Thrun [2] first formulated the depth map SR as a multi-labeling optimization problem based on the MRF model by connecting the color image and the depth map as the balance factor of the smooth term.
In recent years, due to the rapid development of convolutional neural networks, colorguided depth map super-resolution methods based on convolutional neural networks have achieved more remarkable results. Hui et al. [3] proposed a multiscale guided convolutional network (MSG-Net) for depth map super-resolution which complements low-resolution depth features with HR intensity features using a multiscale fusion strategy. Ye et al. [4] constructed a convolutional neural network architecture to learn a binary map of depth edge location from a low-resolution depth map and the corresponding color image, and then proposed a fast edge-guided depth filling strategy to interpolate the missing depth.
However, most color-guided depth map super-resolution methods use color images directly. How to quantitatively measure the guiding effect of color images on depth map super-resolution lacks the attention of researchers. In this paper, we propose a depth map super-resolution framework that uses hierarchical attention fusion modules to measure the guidance of color features on depth features. Inspired from the recent emergence of excellent color image super-resolution generative adversarial networks such as SRGAN [5] and ESRGAN [6], our framework uses relativistic standard generative adversarial networks as the backbone. In particular, a loss model generator that includes content loss, adversarial loss, and edge loss helps the proposed generative adversarial networks produce clearer edges of the depth map.
Our main contributions are as follows: (1) We propose a depth map super-resolution framework with multiscale attention fusion based generative adversarial networks to quantitatively measure the effectiveness of color images as a guide to depth map superresolution. (2) The hierarchical color-depth attention fusion module measures the guidance of the color image on the depth map super-resolution and generates fused features of various scales. (3) The multiscale fused feature balance module evaluates the correlation between scales and fused features, and integrates fused color-depth features of various scales proportionally. (4) A loss function consisting of content loss, adversarial loss, and edge loss helps our method produce clearer edges of the depth map.
We organize the remainder of this paper as follows. After a brief review of related literature in Section 2, we present the framework and introduce the details of our method in Section 3. In Section 4, we conduct an ablation study and comparison experiments on the benchmark depth map datasets, and discuss the performance of our method compared to other methods. Finally, we conclude the whole paper in Section 6.

Related Works
In this section, we introduce color-guided depth map super-resolution and color image super-resolution generative adversarial networks methods.

Conventional Color-Guided Depth Map Super-Resolution
Conventional color-guided depth map super-resolution methods can be divided into three categories: filter based methods, MRF based methods, and sparse representation based methods.
Filter-based methods [1,[7][8][9][10][11][12][13] aim to construct upsampling filters to enhance the depth map resolution with the guidance of the registered color image. Leveraging the HR color image and the given low-resolution depth map, Kopf et al. [1] proposed a joint bilateral filter (JBU) which combines a range filter and a spatial filter to produce very good full-resolution results. In [8], Kim et al. proposed a modified JBU called JABDU that computes each depth value as the average of neighboring pixels weighted by color and depth intensity filters, which are formulated as an adaptive smoothing parameter and a control parameter, respectively. Inspirited from the geodesic distance, Liu et al. [9] advanced the resolution of a depth map using geodesic paths to the pixels whose depths are known from the lowresolution ones. A weighted mode filter (WMF) is proposed in [10] by seeking a global mode on the histogram which uses the weight considering color similarity between the reference and neighboring pixels on the color image to upsample the depth map. Furthermore, Fu et al. [11] incorporated a noise-aware filter (NAF) into a WMF. In order to reduce the artifacts such as texture copy and edge discontinuities, Lo et al. [12] constructed a joint trilateral filtering (JTF) algorithm for depth image SR considering spatial distance, color difference, and local depth gradient simultaneously to better preserve the contour information. Filter-based depth map SR methods can remove the external and internal noise of the depth map, and simultaneously preserve contour features of it. However, with the color image guiding them, these methods can produce texture copy artifacts in smooth regions of the depth map which correspond to richly textured regions of the color image.
Optimization-based single depth map SR methods can be generally divided into two classes: Markov Random Fields (MRF) [2,[14][15][16][17][18][19] based algorithms and Sparse Representation based algorithms. Diebel and Thrun [2] first formulate the depth map SR as a multi-labeling optimization problem based on the MRF model. The method in [15] extends the MRF model by presenting a novel data term allowing for adaptive pixel-wise determination of an appropriate depth reference value. In [14], Zuo et al. proposed a method to quantitatively measure the inconsistency between the depth edge map and the color edge map and explicitly embedded the measurement into the smoothness term of the MRF model. Utilizing the edges of the low-resolution depth image through a Markov Random Fields optimization in a patch synthesis based manner, Xie et al. [17] constructed a high-resolution edge map to guide the upscaling of the depth map. By solving an MRF labeling optimization problem, Lo et al. presented a learning-based depth map superresolution framework in [12] which exhibits the capability of preserving the edges of range data while suppressing the artifacts of texture copy due to color discontinuities. Compared with filter-based methods, optimization-based methods are more robust to noise. For the condition that the edges in a depth map correspond to the smooth region of the color image, blurred edge artifacts can be generated in the SR process due to the inconsistency between the edges of the depth map and color image at the same location.
Many sparse representation-based depth map SR methods [20][21][22][23][24][25] have been proposed in the last few years. They usually cut HR color images and LR depth maps into patches and bind them in pairs to train a dictionary. The depth map SR solutions can be represented as a linear combination of elements in the learned dictionary. Ferstl et al. [21] presented a variational sparse representation approach by using a dictionary of edge priors which learned from an external database of high-and low-resolution examples. In [22], Xie et al. reconstructed the corresponding HR depth map through a robust coupled dictionary learning method with locality coordinate constraints. Simultaneously, an adaptively regularized shock filter is introduced to reduce sharpening of the contours and the jagged noise. Zhang et al. proposed a dual sparsity model based single depth map SR method by combining the analysis model and synthesis model in [24]. As this category of methods utilizes amounts of depth map patches in the training stage, the performance of them heavily relies on the selection of external datasets. In addition, few representation-based depth map SR methods suffer from blurring edge artifacts on the depth edges or the overlapping regions of adjacent patches similar to the optimization-based depth map SR methods. Single depth map SR methods can achieve a promising performance in preserving depth contour while alleviating the noise of the depth map. However, they can produce texture copy artifacts and blurring edge artifacts derived from the depth discontinuities that are not consistent with color discontinuities at the corresponding position.

Neural-Networks-Based Depth Map Super-Resolution
Depth map super-resolution methods based on neural networks have achieved promising success [3,4,26,27]. The authors of [3] proposed a multiscale guided convolutional network (MSG-Net) for depth map super-resolution which complements low-resolution depth features with HR intensity features using a multiscale fusion strategy. Ye et al. [4] constructed a convolutional neural network architecture to learn a binary map of depth edge location from a low-resolution depth map and corresponding color image. They then proposed a fast edge-guided depth filling strategy to interpolate the missing depth constrained by the acquired edges to prevent predicting across the depth boundaries. Wang et al. [26] proposed a novel depth upsampling framework based on deep edge-aware learning which firstly learns edge information of depth boundaries from the known LR depth map and its corresponding high-resolution (HR) color image as reconstruction cues. Then, two depth restoration modules, i.e., a fast depth filling strategy and a cascaded restoration network, are proposed to recover an HR depth map by leveraging the predicted edge map and the HR color image. In [28], Zuo et al. proposed a novel DCNN to progressively reconstruct the high-resolution depth map guided by the intensity image. Specifically, the multiscale intensity features are extracted to provide guidance for the refinement of depth features as their resolutions are gradually enhanced. In [27] by Zuo et al., a novel depth-guided affine transformation is used to filter out the unrelated intensity features, which is further used to refine the depth features. Since the quality of initial depth features is low, the depth-guided intensity features filtering and the intensity-guided depth features refinement are iteratively performed, which progressively promotes the effects of such tasks.
Images at different scales contain different feature information [3]. However, the guidance of color image features at different scales on depth map super-resolution is not equal. It is not appropriate to cascade or link them directly. As far as we know, quantitative evaluation of the correlation between the scales of features and depth map super-resolution is a topic that has not been discussed. In this article, we use a multiscale fused feature balance module to measure the correlations between different scale features and depth map super-resolution, and further fuse the color-depth features at different scales proportionally.

Generative Adversarial Network Based Color Image Super-Resolution
Super-resolution methods for color images based on generative adversarial networks [5,6,[29][30][31] generate realistic high-resolution color images by means of successive iterations of mutual adversaries between generators and discriminators.
Denton et al. [29] introduced a generative parametric model capable of producing high-quality samples of natural images. It uses a cascade of convolutional networks within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid, a separate generative convnet model is trained using the generative adversarial networks (GAN) approach (Goodfellow et al.). Samples drawn from their model are of significantly higher quality than alternate approaches. The key idea of [30] is to grow both the generator and discriminator progressively: starting from a low resolution and adding new layers that model increasingly fine details as training progresses. This both speeds the training up and greatly stabilizes it to produce images of unprecedented quality. Ledig et al. [5] presented SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors. To achieve this, they propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes their solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. The super-resolution generative adversarial network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, Wang et al. [6] thoroughly studied three key components of SRGAN, network architecture, adversarial loss, and perceptual loss, and improve each of them to derive an enhanced SRGAN (ESRGAN).
Some excellent methods for color image super-resolution generative adversarial networks have emerged. However, they produce more artifacts and textures. Obviously, these networks are not suitable for depth map super-resolution. Therefore, considering the sharp edges and smooth interior of the depth map, we propose a multiscale attention fusion based super-resolution generative adversarial network for depth maps. In particular, building a generator loss function that includes content loss, adversarial loss, and edge loss facilitates the generation of sharper edges.

Multiscale Attention Fusion for Depth Map Super-Resolution Generative Adversarial Networks
In this section, we propose a multiscale attention fusion for depth map super-resolution generative adversarial networks.

Framework
The framework of our proposed method is demonstrated in Figure 1. In Figure 1, our goal is to generate a precise high-resolution depth estimation D HR of the ground truth D G . The generator consists of four parts: a multiscale color and depth feature extraction module, a hierarchical feature attention fusion module, a multiscale fused feature balance module, and a super-resolution module. The multiscale color and depth feature extraction module extracts different scale features using a low-resolution depth map and the corresponding color image as inputs. It consists of two convolutional layers and n residual dense blocks (RDBs), where n is the scale of feature extraction. The settings of RDBs are consistent with those in [32]. The depth feature and color feature passing through the ith RDB are represented as F i D and F i I , respectively. Previous methods have used these to directly concatenate depth features and color features together. However, the guidance of color features on depth features should not be just a simple link. How to quantitatively measure the guidance of color features on depth features has become a key issue. In this article, we propose using an attention module to measure the guiding effect of color features on depth features. F i D and F i I form a color-depth fused feature F i f at the ith scale through the attention module. In this way, we obtain color-depth fused features F 1 f , F 2 f , . . ., F n f at n scales. Images at different scales contain different geometric structures. The contribution of fused features at different scales to depth map super-resolution is not equal. We input F 1 f -F n f into the multiscale fused feature balance module to evaluate the correlations between the scales and fused features, and obtain a final fused feature F f . We choose UPNet as [32] as the super-resolution module of the generator. The high-resolution depth map D HR is generated by F f through UPNet.  Figure 1. Framework of the multiscale attention fusion for depth map super-resolution generative adversarial networks. D LR and I LR are the low-resolution depth map and the corresponding downsampled color image. D HR is the high-resolution depth map generated by the generator of our proposed GAN. Furthermore, D G is the ground truth depth map.

Hierarchical Color and Depth Attention Fusion Module
The details of the proposed hierarchical color and depth attention fusion module are shown in Figure 2. Before inputting them into the module, we first concatenate the color feature F i I and the depth feature F i D at the ith scale to form the merged feature F i C . Then, F i C is fed into global average pooling and global sum pooling, respectively. Global average pooling and global sum pooling are followed by two convolutional layers and one ReLU, respectively. By processing the convolutional features through the sigmoid function, two coefficient matrices are obtained. By adding and splitting the coefficient matrices in place we can obtain the weight coefficient vector C i of F i I and F i D as in Equation (1).
where f att denotes the color and depth attention fusion module. Multiplying F i I and F i D element-wise by the corresponding coefficient vector C i , we obtain the fused color-depth fused feature F i f at the ith scale as in Equation (2).

Multiscale Fused Feature Balance Module
After obtaining the color-depth fused features F i f using the n attention modules, we concatenate these features and denote them as F C f .
Then F C f is fed to the multiscale fused feature balance module.
where W f is a vector of balance factors and f bal is the multiscale fused feature balance module. The multiscale fused feature balance module is used to evaluate the correlations between the scales and the fused features as shown in Figure 3. It consists of two branches. These two branches start with a global average pooling and a global sum pooling, separately, followed by two convolutional layers, a ReLU layer, and a sigmoid function. F C f generates two weight coefficient matrices through these two branches. The two weight coefficient matrices are summed and separated to obtain W f . The balanced multiscale color-depth feature F f is generated by multiplying the concatenated sequence F C f with the corresponding balance factor vector as in Equation (5).

Relativistic Standard Generative Adversarial Networks
In the standard GAN, the discriminator outputs the probability that the input image is real to determine whether the input is real or fake. The type of GAN can be defined with the discriminator. In general, the discriminator loss of a standard GAN with the assumption of cross-entropy loss can be expressed as follows: where x r and x f indicate the real depth map and the fake one, respectively. The adversarial loss of the generator is expressed as where D(x) is the activation function of the non-tranformed layer C(x) as in Equation (9).
In the discriminator, D G and D HR are input as x r and x f . Because the gradient of g 1 is 0, only half of the generator is involved during the training process.
In this paper, we adopt the relativistic standard GAN (RGAN) [33] structure to achieve the full participation of the generator. The discriminator of RGAN estimates the probability that the given real data is more realistic than a randomly sampled fake data. It is denoted as Equation (10).
Correspondingly, the loss of the discriminator is expressed as follows: The adversarial loss of the generator is expressed as Equation (14).
The discriminator extracts features using the RDBs, and then performs a discriminant classification using the sigmoid function to determine whether the input depth map is fake or real. Compared to the standard GAN, the relativistic GAN can generate high-resolution depth maps from relatively small samples. Furthermore, the training time to achieve optimal performance is significantly reduced.
Due to the fact that depth maps are a kind of piece-wise smooth images, they are characterized by sharp edges and smooth interiors. Conventional GAN-based color image super-resolution methods that only use mean squared error (MSE) as content loss are not suitable for depth map super-resolution. In order to improve the edge sharpness of the generated high-resolution depth maps, we propose a loss function consisting of content loss, adversarial loss, and edge loss, which is expressed by Equation (13). where µ and γ are the scale factors which balance the adversarial loss and the edge loss.

Parameter Setting
We train our network with 80 color and depth pairs. In the training dataset, 52 color-depth pairs are from the Middlebury dataset and others are from the MPI Sintel depth dataset. The color images are downsampled to the images of corresponding scale factor by interval interpolation. The patch size is 128 × 128 and the batch size is 16. To enrich data diversity, we flip the patches horizontally and vertically, and rotate them by 90 • . The kernel size of all convolution layers is 3 and the channels of the feature map number 64. We take ReLU function as the activation function after all convolution layers. Adam is set as the optimizer with β 1 = 0.9 and β 2 = 0.999. Our proposed method is implemented on two Nvidia RTX 2080ti GPUs. We train our network for 1000 epochs, and the initial learning rate is 10 −4 and halved every 200 epochs.

Datasets Training Datasets
In the training phase, we use two datasets, the Middlebury datasets and the MPI Sintel depth dataset. The Middlebury datasets [34] are a stereo dataset widely used in applications related to stereo matching, 3D reconstruction, and stereo quality evaluation. It The MPI Sintel depth dataset [35] is a synthetic stereo dataset which provides naturalistic video sequences. The depth values in the MPI Sintel depth dataset are returned from Blender with an additional Z-buffer pass, similar to the optical flow.

Testing Datasets
Among the Middlebury Stereo Datasets [34], we use six color-depth pairs as the testing samples. They are Art, Books, Moebius, Reindeer, Laundry, and Dolls. To better demonstrate the effectiveness of our method, we also conduct experiments on the Multiview depth (MVD) test sequences [36] and ToFMark dataset [37].
The Multi-view depth (MVD) test sequences consist of multi-view video sequences and corresponding pixel-by-pixel depth information to support flexible synthesis of virtual views during rendering. It is widely used in studies on 3D applications such as free-view video, binocular stereoscopic video, and naked eye 3D stereo video, making it the most promising form of 3D video data representation today.
ToFMark dataset containes three real-scene depth maps captured by ToF sensors. The low-resolution depth maps in it were acquired using the PMD Nano ToF camera with a resolution of 120 × 160, and the high-resolution color images were acquired using the CMOS camera with a resolution of 810 × 610.

Evaluation Metrics
For reconstructed and enhanced images, many studies have proposed many objective evaluation criteria [38,39]. In this paper, we take three metrics to evaluate the performance of our proposed method in depth map super-resolution. They are RMSE, MAD, and PSNR. RMSE stands for root mean squared error, as in Equation (15).
MAD represents Mean Absolute Difference, described by Equation (16).
Peak Signal to Noise Ratio (PSNR) is also a commonly used objective criterion for evaluating image quality, where MSE is mean squared error in Equation (18), which is the square of RMSE.

Comparison of Different Numbers of RDBs
In this subsection, we explore the correlation between the quantity of scales in the multiscale fusion attention module on the performance of the depth map super-resolution. We tested the experimental results on the Middlebury datasets for four quantities of RDBs: 10, 16, 20, and 22. The selection type of GAN is RGAN and the loss of the generator is set to content loss + edge loss. The experimental results are shown in Table 1. We can see that as the quantity of RDBs increases, the RMSE of the generated depth map decreases. However, after the number of scales exceeds 16, the effect of depth map super-resolution is not significantly improved. Considering the increased storage and computing consumption, we believe that 16 is the most reasonable number of RDBs.

Comparison of GAN Types
In this subsection, we compare the depth map SR results with different kinds of GANs. Table 2 demonstrates the experimental results of our proposed method with GAN and RGAN. We choose 16 as the number of scales in the multiscale fusion attention module and MSE + edge loss as the loss of generator. It can be seen that the RMSE of our method with RGAN is better than that with GAN. This shows that our method uses RGAN to generate high-resolution depth maps that are closer to the real depth maps.

Comparison of Generator Losses
In this subsection, we compare the experimental results with generator losses of MSE loss and MSE loss + edge loss, besides adversarial loss, to verify the necessity of edge loss. The general color image super-resolution generative adversarial networks are reconstructed based on MSE loss, which can obtain closer objective experimental results to the ground truth. However, the visual performance of the image generated in this way is not perceptually the closest to ground truth. Therefore, we propose a generator loss function that includes an edge loss for the characteristics of the clear edges and the smooth interior of the depth maps. As shown in Table 3, the RMSE of the depth maps generated by RGAN with edge loss is very close to the RMSE of those generated by a network containing only MSE. However, Figure 4 shows the comparison of two sets of super-resolution results on Art. We can see that the network containing edge loss generates high-resolution depth maps with clearer edges compared to the GAN containing only MSE, thereby verifying the effectiveness of edge loss in generating perceptually high-quality depth maps.

Experimental Results on Middlebury Datasets
Our baseline state-of-the-art methods are joint bilateral upsampling (JBU) [1], noiseaware filter (NAF) [40], anisotropic diffusion [41], Markov random field (MRF) [2], guided image filtering (GIF) [42], SRF from [43], edge weighted NLM regularization (Edge) [44], joint geodesic filtering (JGF) [9], total generalized variation (TGV) from [37], four deep learning method SRCNNs from [45], deep joint image filter (DJIF) from [46], deep edgeaware network (DSR) from [26] and cross-guided network for depth map enhancement (CGN) from [27], two GAN-based color image super-resolution methods for a superresolution generative adversarial network (SRGAN) from [5], enhanced SRGAN (ESRGAN) from [6], and dictionary learning method JESR from [20] that are used in comparison to evaluate the performance of our method. We set the number of RDBs to 16 and the type of GAN to RGAN in our method. The depth map upscaling factors are set to 2, 4, 8, and 16. In Tables 4 and 5, we can see that both DSR and CGN obtain top-ranked experimental results. Compared with the other two color image super-resolution GAN methods, our proposed method gains the lowest RMSE and MAD. This is because SRGAN and ESRGAN are designed for color images with a structure that produces more texture. However, they are not suitable for the internal smoothing properties of depth maps. Figure 5 shows the visual comparison of the state-of-the-art baselines with our method. It can be seen that our method produces clearer and sharper edges, and avoids artifacts of blurred edges and texture transfer.    , (e) TGV [37], (f) ESRGAN [6], (g) DJIF [46], (h) DSR [26], (i) CGN [27], and (j) ours.

Experimental Results on Real Datasets
Since depth maps are acquired by depth sensors in real scenes, we not only compare experimental results on the Middlebury datasets, but also conduct experiments and comparisons on real scene depth map datasets. In this article, we selected the ToFMark dataset captured by the ToF sensor and the multi-view depth (MVD) test sequences [36] as the test sets. Our comparison methods are bicubic, joint geodesic filtering (JGF) [9], total generalized variation (TGV) from [37], SRGAN from [5], enhanced SRGAN (ESRGAN) from [6], deep joint image filter (DJIF) from [46], deep edge-aware network (DSR) from [26], and cross-guided network for depth map enhancement (CGN) from [27]. The depth map upscaling factors are set to 2, 4, 8, and 16.
Tables 6 and 7 demonstrate the quantitative depth upsampling results on ToFMark dataset and MVD dataset, respectively. Our proposed method shows the best objective performance over other the state-of-the-art methods.

Discussion
In this section, we briefly discuss our proposed method and the directions we can focus on in the future. In the edge region of the depth map, the introduced color image corresponds to a smooth region, resulting in the generated high-resolution depth map with occasional edge blurring. In the future, we will focus on introducing color image edges aligned with the edges of the depth map into the framework to achieve more accurate depth map super-resolution as well as to generate sharper edges.

Conclusions
In this paper, we propose a multiscale attention fusion based depth super-resolution generative adversarial networks for 3D reconstruction in trustworthy AI. Specifically, a hierarchical color-depth attention fusion module measures the guidance of the color image on the depth map super-resolution and generates fused features of various scales. The multiscale fused feature balance module evaluates the correlation between scales and fused features, and integrates fused color-depth features at different scales in a proportional manner. By constructing a loss function model consisting of content loss, adversarial loss, and edge loss, our proposed generative adversarial networks produce high-resolution depth maps with sharper edges. The robustness and generalization of the model is demonstrated by extensive experiments that show satisfactory subjective and objective results of our proposed method on several types of depth map datasets.
Author Contributions: Conceptualization, methodology, software, validation, writing-original draft preparation, D.X.; writing-review and editing, supervision, funding acquisition, X.F.; project administration, W.G. All authors have read and agreed to the published version of the manuscript.