Latent 3D Volume for Joint Depth Estimation and Semantic Segmentation from a Single Image

This paper proposes a novel 3D representation, namely, a latent 3D volume, for joint depth estimation and semantic segmentation. Most previous studies encoded an input scene (typically given as a 2D image) into a set of feature vectors arranged over a 2D plane. However, considering the real world is three-dimensional, this 2D arrangement reduces one dimension and may limit the capacity of feature representation. In contrast, we examine the idea of arranging the feature vectors in 3D space rather than in a 2D plane. We refer to this 3D volumetric arrangement as a latent 3D volume. We will show that the latent 3D volume is beneficial to the tasks of depth estimation and semantic segmentation because these tasks require an understanding of the 3D structure of the scene. Our network first constructs an initial 3D volume using image features and then generates latent 3D volume by passing the initial 3D volume through several 3D convolutional layers. We apply depth regression and semantic segmentation by projecting the latent 3D volume onto a 2D plane. The evaluation results show that our method outperforms previous approaches on the NYU Depth v2 dataset.


Introduction
Semantic 3D scene reconstruction, in which a scene is geometrically and semantically analyzed, is one of the challenging and crucial problems in the field of computer vision. Understanding a surrounding environment is valuable for many applications such as remote sensing, robotics, augmented reality, and human-computer interaction. Early studies tackled this problem using a combination of 3D reconstruction and image-based recognition techniques. For example, several methods using Simultaneous Localization and Mapping (SLAM) or Structure from Motion (SfM) with 2D semantic segmentation have been proposed [1][2][3][4].
Estimating depth and semantic labels from 2D images is a crucial step for precise semantic 3D scene reconstruction. Recently, Convolutional Neural Networks (CNNs) have achieved tremendous results in tasks such as depth estimation [5][6][7][8] and semantic segmentation [9][10][11][12]. An important advantage of CNNs is the ability to learn contextual information in an image. Standard CNNs apply a set of 2D convolutions to an input image to acquire such information in a 2D image plane. In addition, the modern approaches have focused on the elaborate design of multi-scale feature extraction [5,13], pooling [10,11,14], and upsampling [6] to acquire contextual information more effectively. The majority of the existing methods have dealt with depth estimation and semantic segmentation separately (i.e., single-task learning).
In contrast, relatively few studies have addressed joint depth estimation and semantic segmentation [15][16][17][18][19][20][21]. This joint task is generally formulated as a multi-task learning problem, in which a single CNN is used to estimate both depth and semantic structure of a scene, instead of using a separate dedicated CNN for each task. The typical CNN design choice for the joint depth estimation and semantic segmentation is so-called hard parameter sharing [15][16][17][18][19][20][21]. In hard parameter sharing, a network has hidden layers shared between all tasks, while having several task-specific output layers (Figure 1a). These shared layers produce shared feature maps, which capture the feature representation useful for all of the tasks. Hence, the design of this representation is one of the key aspects of the joint problem. Previous approaches to the joint task [15][16][17][18][19][20][21] employ a set of 2D convolutions to construct the shared representation, which is given as a set of feature vectors arranged over a 2D plane (Figure 1b). However, considering the real world is three-dimensional, this 2D arrangement reduces one dimension and may limit the capacity of feature representation. Besides, a set of 2D convolutions extracts 2D spatial relationships and may be unsuitable for obtaining 3D spatial contextual information. In this paper, we examine a novel feature representation for joint depth estimation and semantic segmentation. Considering the real world, the objects and structures are originally arranged in 3D space, and the feature maps should therefore also be represented in such a space. Our key idea is arranging feature vectors on a 3D grid (Figure 1c), where each feature vector has both geometric and semantic information. Because geometric structures and semantics are mutually related, we believe that this representation is suitable for the understanding of a 3D scene. To distinguish the proposed 3D feature representation from feature vectors arranged over a 2D plane, we refer to the proposed representation as a 3D feature volume. We construct the initial 3D feature volume using feature maps extracted by 2D CNNs. The initial volume is then regularized by 3D CNNs to acquire 3D feature volume. We refer to the acquired 3D feature volume as a latent 3D volume. The latent 3D volume is shared and used to infer depth and semantic segmentation. By training the network in an end-to-end manner, 2D CNNs are trained to output 2D feature maps to implicitly have the 3D structure in the channel dimension. As the latent 3D volume is used for depth estimation and semantic segmentation, it contains both geometric and semantic features.
There are three questions we examine in this paper: 1. How to build a latent 3D volume; 2. How to use the latent 3D volume for depth estimation and semantic segmentation; and 3. How to learn the latent 3D volume efficiently.
For the first question, we employ hard parameter sharing design and explore a network architecture to construct a latent 3D volume from images to have geometric and semantic information. We propose a projection module that projects the latent 3D volume onto a 2D plane for the second question. The proposed projection module is a two-branch structure that decomposes the latent 3D volume into geometric and semantic features. We evaluate the latent 3D volume on the NYU Depth v2 dataset [22], which has only 1449 images annotated in terms of both depth and semantic segmentation. For the third question, we also propose a training strategy using the SUNCG dataset [23], which is an indoor synthetic dataset, to deal with the small amount of annotated data. We divide the training procedure into two stages: one is a pre-training using the synthetic dataset and the other is a fine-tuning using a real-world dataset. We experimentally demonstrated that the proposed method achieves significant performance improvements over previous methods.
To the best of our knowledge, this is the first attempt to introduce a 3D representation for the joint task of depth estimation and semantic segmentation from a single image. We also propose a novel projection module which effectively passes a latent 3D volume to task-specific layers. To summarize, we present: • A novel 3D representation that aggregates geometric and semantic information in a latent space; • A two-branch projection module that projects a latent 3D volume onto a 2D plane and decomposes it into geometric and semantic features; • A training strategy to effectively learn a latent 3D volume by using a large-scale synthetic dataset adapted to real domains and a small-scale real-world dataset.
The rest of this paper is organized as follows. Section 2 introduces previous studies related to this field. Section 3 describes the architecture to learn our proposed latent 3D volume. Detailed procedures to train the proposed latent 3D volume are described in Section 4. In Section 5, the experimental results on the standard NYU Depth v2 indoor scene dataset [22] are described. Further analysis of the proposed method is discussed in Section 6. Finally, Section 7 provides some concluding remarks and discusses future work.

Related Work
In this section, we review depth estimation, semantic segmentation, and their multi-task learning.

Depth Estimation
Early studies used hand-crafted features and probabilistic graphical models for single image depth estimation [24][25][26][27][28]. CNN-based methods have become successful in recent years.
Eigen et al. [5] proposed a robust depth estimation method using a multi-scale CNN. Owing to the learning capability of a multi-scale CNN and the availability of large-scale datasets, their latter work showed a promising performance [13]. Laina et al. [6] employed ResNet [29] as an encoder and introduced an up-projection block to effectively learn feature map upsampling. Fu et al. [8] presented a deep ordinal regression network and introduced a spacing-increasing discretization strategy that defines the continuous depth as discrete values and achieves state-of-the-art performance.
Several studies have combined a CNN with regularization based on a conditional random field (CRF) [7,30]. Liu et al. [7] tackled the problem with a deep conditional neural field, which jointly learn a CNN and a continuous CRF. Roy and Todorovic [30] showed that random forests can also be used as a regularizer.

Semantic Segmentation
Recent semantic segmentation approaches have taken advantage of CNNs. The input modalities of semantic segmentation are categorized into RGB images [9,10,12,14,[31][32][33] and RGB-D images [22,[34][35][36]. Although RGB-D semantic segmentation approaches have achieved better performance than RGB semantic segmentation approaches, they assume that a precise depth map is available as an input. Thus, we focus on RGB semantic segmentation.
Recent studies [10][11][12] based on fully convolutional networks (FCNs) [9] have demonstrated significant improvements on a variety of benchmarks [22,[37][38][39]. In other studies [9,31,32], the authors attempted to train deconvolution networks to overcome low-resolution estimation and leverage context information. To enhance such information, spatial pyramid pooling at several scales was presented [10], and later atrous spatial pyramid pooling applying atrous convolution with a different dilation rate was proposed [14]. Through a different approach, Lin et al. [12] presented RefineNet, which iteratively combines multi-level features to produce high-resolution estimates. Jiao et al. [33] distilled geometry-aware embeddings for semantic segmentation and achieved significant improvements over previous approaches on indoor benchmarks. Whereas our approach is similar to their approach [33] in terms of geometry-aware features, we use a 3D feature representation instead of a 2D feature representation.
There have been relatively few studies [15][16][17][18][19][20]41] on estimating the depth and semantic segmentation jointly. Wang et al. [41] tackled the problem using two separate CNNs, which obtain the regional and global potentials of a scene and feed them into a hierarchical CRF. However, a hierarchical CRF is computationally expensive. Mousavian et al. [15] estimated depth and semantic segmentation using a multi-scale shared encoder. They also showed that a fully connected CRF improves performance. Zhang et al. [21] presented a task-recursive learning framework that refines both depth and semantic segmentation recursively. Jiao et al. [20] proposed synergy network architecture that propagates semantic information to depth prediction. They also proposed an attention-driven loss to deal with long-tail data distribution. Lin et al. [17] presented a hybrid CNN that combines a common feature extraction network and a global depth estimation network for understanding the global layout of the scene. Nekrasov et al. [16] used knowledge distillation and achieved a real-time performance. Zhou et al. [18] proposed a knowledge interactiveness learning module to enclose both depth estimation and semantic segmentation information. The authors later proposed a pattern-structure diffusion framework that effectively mines and propagates the relationships within/across tasks [19].
As mentioned above, CNN-based approaches use shared feature maps to infer depth and semantic segmentation. However, these 2D representations are limited to obtain the 2D contextual information in an image due to 2D CNNs. In contrast to existing approaches [15][16][17], we propose a 3D feature representation, namely, a latent 3D volume. This representation enhances the 3D contextual information by introducing 3D CNNs. In this study, we explore the design of hard parameter sharing, whereas recent approaches [18][19][20][21] have focused on designing output layers using mutual information between tasks.

Latent 3D Volume
This section describes the details of the latent 3D volume for joint learning of depth estimation and semantic segmentation.

Overview
Our network architecture is based on the hard parameter sharing design that combines a shared backbone and several task-specific output layers for each task. The overall architecture is illustrated in Figure 2. The proposed architecture consists of five modules: an encoder, a decoder, a volume constructor, a volume encoder, and a volume decoder. The first step is to extract deep image features using the encoder and decoder. In the proposed network, any image feature extractor can be used. In this study, we use a standard encoder-decoder network consisting of 2D convolutions [6]. The multi-scale features from the encoder are upsampled and concatenated with the output of the decoder. The volume constructor takes these concatenated features as inputs and outputs a 3D volume. To extract 3D contextual information, we pass the 3D volume through a volume encoder with several 3D convolutional layers and obtain a latent 3D volume. Finally, the latent 3D volume is fed into the volume decoder, which consists of a projection module and 2D convolutions, and both depth and semantic segmentation are inferred. The proposed network is trained to estimate depth and semantic segmentation from a single image in an end-to-end manner. As the shared backbone outputs the latent 3D volume as shared features, the latent 3D volume is learned to have both geometric and semantic information.

Image Features
Taking a single RGB image as an input, the encoder first extracts the features on four different scales. The final feature from the encoder is then fed into the decoder which uses the up-projection blocks consisting of unpooling and several 2D convolutions [6]. Laina et al. [6] reported that an up-projection block achieved a better performance than other upsampling methods such as deconvolution and up-convolution. The decoder first applies a 2D convolution to the encoded feature map to reduce the channel by half and gradually upsamples the obtained feature map to half the size of the input image while reducing the number of feature channels by half. The decoded feature is finally upsampled to the same size as the input image using bilinear interpolation.
In addition, the multi-scale features from the encoder are upsampled to half the size of the input image and converted to the 16-channel feature representations using the up-projection block [6]. These features are upsampled to the same size as the input image using bilinear interpolation and concatenated with the decoded features.

Latent Volume Generation
Taking image features as inputs, the purpose of the volume constructor is to generate a latent 3D volume of size W × H × D × C, where W and H are the width and height of the input image, D is the number of predefined depth samples, and C is the number of predefined channels. We first feed the concatenated features into a volume constructor consisting of three 2D convolutional layers with 5 × 5 filters. Each layer of the volume constructor yields feature maps with the same channels as the concatenated feature map except for the final layer whose output channel is of size D × C. We then generate the initial volume V 0 by reshaping the feature maps into a volume of size The volume encoder is trained to output the latent 3D volume from the initial volume V 0 . There are five 3D convolutional layers with residual blocks, which have 3 × 3 × 3 filters. Each layer yields 32-channel feature maps except for the final layer whose output dimension is C. We use C = 32 during the experiments.

Depth Regression and Classification
The volume decoder infers depth and semantic segmentation. We first project the latent 3D volume V onto a 2D space using a projection module. Through preliminary experiments, we found that a straightforward projection module combining a 3D convolution and a squeezing of the channel dimensions is ineffective. Therefore, we explored the appropriate design of the projection module for joint depth estimation and semantic segmentation. Inspired by recent stereo matching, we use the soft argmax operation [45] as a projection module. In addition, we consider relationships between geometric structures and semantics. Thus, we design the projection module to have two branches. To preserve geometric cues, one branch produces the depth maps by conducting a 1 × 1 × 1 3D convolution and soft argmax operations. The other branch projects the latent 3D volume onto a 2D space and outputs a 2D feature map of size W × H × C. The purpose of this branch is to maintain semantic cues.
The first branch produces a depth map from the latent 3D volume V. We first apply a 1 × 1 × 1 3D convolution to reduce the channel dimension and obtain a 3D volume V of size W × H × D × 1. According to Im et al. [46], the discrete depth in the inverse-depth domain is more effective than the depth domain. We uniformly sample the discrete depth in the inverse-depth domain as follows: where d min is the minimum scene depth and l ∈ {1, . . . , D} is the index of the depth dimension of the volume. The soft argmax operation [45] regresses a continuous depth value in the following manner: where σ is the softmax operation and V l is the l-th slice of the volume V along with the depth dimension. This process yields the depth map with the same resolution as the input image. The second branch generates a 2D feature map. The operation is almost the same as in the first branch, but the input is different. We slice the volume V along the channel dimension and then apply the soft argmax operation. All outputs of each slice are then concatenated. The resulting feature size is Finally, we concatenate the outputs of each branch and use them to infer depth and semantic segmentation. We use the same architecture, which is composed of three 2D convolutional layers with 5 × 5 filters. The difference between depth estimation and semantic segmentation is the output dimension of the final layer, i.e., the depth estimation has a dimension of 1 and the semantic segmentation is the number of classes.

Loss Functions
Here, we define the loss function to train our network.

Depth Estimation
Following the previous single depth estimation studies [5,6], we employ the sum of differences between the inferred depth D and its ground truthD: where N represents the total number of valid pixels in the image, i represents the pixel, and e(i) = |D(i) −D(i)| 1 . Some studies use the l2 norm or burHu loss instead of the l1 norm. Ma et al. [47] reported that the berHu and l1 losses significantly outperform the l2 loss and that the l1 loss is slightly better than the berHu loss, and thus we employ the l1 norm in this study. The depth loss is sensitive to displacement in the depth direction. To capture the structure of a scene, not only the depth of the target pixel but also the depth displacement of neighboring pixels is important. In addition to the depth loss, we employ a depth gradient loss [48] which penalizes the depth around the edges such as depth discontinuities: where ∇ x and ∇ y are the spatial derivatives with respect to horizontal and vertical directions. Although the gradient loss l grad penalizes areas with large depth changes, it often misses small structural errors. In contrast, the surface normal is useful for capturing local depth changes, such as surface irregularities and high-frequency waviness of the surface. To ensure fine-grained depth estimates, we also use normal loss calculated from a depth map. Let n i andn i be the surface normals of the estimated depth and its ground truth, respectively. The surface normal is computed using 1] T . Following Hu et al. [48], the normal loss based on cosine similarity is defined as follows: where · is the inner product of vectors. The overall depth loss is as follows: The parameters λ g and λ n are both set to 1.0.

Semantic Segmentation
For semantic segmentation, we use a pixelwise softmax classifier to predict labels for each pixel. Let S andŜ be the inferred segmentation and ground truth, respectively. We define the segmentation loss as follows: where H indicates the cross-entropy loss.

Overall Losses
Finally, our total loss is defined as follows: The weighting coefficient λ is set to 1.0.

NYU Depth v2 Dataset
We evaluate our network architecture on the standard NYU Depth v2 indoor scene depth estimation and semantic segmentation dataset [22], which has 1449 images with 40 semantic classes [49]. We use the official split for 464 scenes: 795 images from 249 scenes for training and 654 images from 215 scenes for testing.

SUNCG Dataset
The SUNCG dataset [23] is an indoor synthetic dataset which contains 45,622 different indoor scenes. Similar to the prior work [23], we randomly choose 30 K scenes as a training set and render 120 K color images, depth maps, and semantic labels using the camera intrinsics provided by the authors. Note that the SUNCG dataset has none of the paper, towel, or bag models included in the NYU dataset. In addition, there are few instances in some categories such as boxes.

Training Protocol
In joint learning of depth estimation and semantic segmentation, one of the main issues is that fewer samples for the joint task are available than for single tasks. To train our network, we use the SUNCG dataset [23] and the NYU Depth v2 dataset [22]. Our training procedure consists of two stages.
In the first stage, we pre-train the network using the SUNCG dataset [23]. Although these data are much easier to generate than manual annotations, there is a domain gap between real and synthetic data. Thus, we employ CycleGAN [50] for domain adaptation. We train the image translation between synthetic images and real images and convert all synthetic images into real images.
In the second stage, we fine-tune the entire network using the NYU Depth v2 dataset [22]. Previous studies [5,6,13,47,48] have shown that data augmentation improves performance. Similar to previous studies [5,13,48], we apply the following augmentations at random: We follow the same procedure as that employed in previous studies [6,48]. We downsample images from their original size (640 × 480) to 320 × 240 using bilinear interpolation. The images are then cropped to their central parts, and we obtain images of size 304 × 228. The outputs of the network are the same size as the input images.
We use the same settings for each stage except for the number of iterations. We employ ResNet-50 [29] as an encoder. The minimum scene depth and the depth of volume D are set to 0.5 m and 16, respectively. We train the network using the SGD solver with mini-batches of size 8. The learning rate and momentum are set to 0.001 and 0.9, respectively. We pre-train the network for 150 K iterations and fine-tune the network for 30 K iterations. Our training usually takes 5 days on 4 NVIDIA Tesla V100 GPUs.

Metrics
For depth estimation, we report evaluation metrics commonly used in previous studies [5,6,8,13,47,48]. Let y i andŷ i be the predicted and ground truth depth value of a pixel i, and N be the total number of valid pixels. The metrics are defined as follows: • Mean absolute relative difference (REL): • Root mean square error (RMSE): • Mean log 10 error (log10): • Thresholded accuracy (δ < thr): where g(y i ,ŷ i ) = 1 (δ < thr) The thresholded accuracy means the ratio of the maximum relative error δ below the threshold thr. In our experiments, we use thr = 1.25 to align the setting with previous studies [6,8,13,[15][16][17]42].
For semantic segmentation, we use mean intersection over union (mIoU) as a metric. For each class, the intersection over union is calculated as follows: where TP, FP, and FN indicate true positive, false positive, and false negative, respectively. The IoU means the ratio of the area of overlap between the predicted segmentation and the ground truth (TP) and the area of the union of the predicted segmentation and the ground truth (TP + FP + FN). The mIoU is the average of all classes of IoU values. We conducted our experiments using original images (640 × 480) on NVIDIA TITAN RTX with 24 GB of GPU memory. Table 1 shows a quantitative comparison of the proposed method against previous methods including single and joint tasks. The proposed method achieves the best performance for most metrics when compared to joint task approaches [15][16][17]21]. Although Kendall et al. [42] demonstrated the best performance in terms of depth estimation error metrics, they did not use the same weights for depth estimation and semantic segmentation, and the segmentation performance was relatively poor compared to other approaches [15,16]. By contrast, the proposed method shows well-balanced results for both depth estimation and semantic segmentation, and the results of each task are comparable to those of single task methods. The main difference between our method and previous methods [15][16][17] is the shared feature representation. Specifically, the proposed method uses a 3D feature representation, whereas previous methods used 2D feature representations. The results show that the proposed 3D representation improves the performance of depth and semantic segmentation. Note that the proposed method is not applied in real-time while the method by Nekrasov et al. [16] is applied in real-time, which is 38.7 times faster than our approach ( Table 2). Table 1. Quantitative comparison against previous approaches on the NYU Depth v2 dataset. † denotes that two tasks use the same architecture but not the same weights. ‡ employs a single model for the joint task. The bold and underlined items represent the best and second place, respectively. Laina et al. [6] 63.57 21.9 ± 1.9 Fu et al. [8] N/A Lin et al. [12] 118.10 52.6 ± 0.1 Jiao et al. [33] 103 We also noticed that most depth estimation approaches [6,8,13] used the raw distribution of the NYU dataset, which has more than 120 K images with corresponding depth maps. Whereas Laina et al. [6] sampled the raw dataset, with 12 K unique images used for training, multi-task approaches [15,42] have used only the 795 images annotated with both depth and segmentation. Nekrasov et al. [16] sampled the raw dataset and used 25K real-world images for pre-training, fine-tuning the CNN using 795 images. Although we used 120 K images for pre-training, these images were all synthetic images. As we used only the 795 images for fine-tuning, the proposed volume can be learned from a large number of synthetic data and less than 1K real-world images instead of a large number of real-world images. Figure 3 shows a qualitative comparison between our method and the method by Nekrasov et al. [16], which also employs a joint learning approach. For depth estimation, our method preserves the entire structure of the scene and correctly estimates the depth around the edges, whereas the method by Nekrasov et al. [16] often fails. For semantic segmentation, our method reduces incorrect category estimates. The latent 3D volume has more contextual information than a 2D feature map for understanding object relationships. The right two columns in Figure 3 are only the failure cases. The previous method [16] recognizes the chair in the right foreground, but our method incorrectly recognizes it as a sofa. Moreover, both methods fail to properly capture the shape of the piano. When looking at the depth results, there seems to be a lack of detailed geometry around the music stands and keys. A comparison of the results with single task methods [8,12] is shown in Figure 4. Although our approach produces smooth results for both depth and segmentation, it occasionally fails to estimate small objects or regards adjacent structures as a single region. For a more detailed analysis, a comparison with previous segmentation methods including RGB and RGB-D is shown in Table 3. Our approach was second among numerous results. As mentioned in Section 4.2, the SUNCG dataset has none of the paper, towel, or bag models, and few instances of categories such as boxes. Because our method fails with such objects, further accuracy improvement can be expected by adding these objects to the synthetic data.

Ground truth
Nekrasov et al. [16] Ours Ground truth Nekrasov et al. [16] Ours Figure 3. Qualitative comparison with recent joint learning approach on the NYU Depth v2 dataset. From the top to the bottom, input images, ground truth depth maps, depth results of Nekrasov et al. [16] and our proposed approach, ground truth semantic labels, and segmentation results of Nekrasov et al. [16] and our approach, respectively.

Input
Ground truth Kendall et al. [42] Fu et al.   Table 3. Comparison of IoU with previous approaches on each category of the NYU Depth v2 dataset. * takes an RGB image as an input whereas the others take an RGB-D image. The bold and underlined items represent the best and second place, respectively.

Ablation Study
We conducted ablation studies to analyze the effects of the proposed method. The results are summarized in Table 4.

Number of Depth Samples
To study the influence of the number of depth samples D in the latent 3D volume, we trained models using D ∈ {12, 16, 24, 32, 48, 64}. As shown in Figure 5a,b, the performance improves as the number of depth samples increases up to D = 32. However, the proposed method derives a poor performance with D ∈ {48, 64}. One possible reason for this is that the contextual information of 2D features extracted from the image is insufficient for constructing a 3D volume. Figure 5c shows that a greater number of depth samples requires a longer running time because the number of parameters also increases. The number of depth samples should be set depending on the architecture of the encoder and the permissible running time.

Network Architecture
As described in Section 3.4, we analyze the effectiveness of our projection module used for inferring the depth and semantic segmentation. Table 4a shows the performance of the straightforward projection module described in Section 3.4. In addition, we validated a projection module with two different branches for depth estimation and semantic segmentation (Table 4b). This module feeds the output of each branch to a separate output layer without concatenating the output of each branch. We observed that our two-branch projection module (Table 4j) provides a better performance in the proposed network, specifically for semantic segmentation tasks.
We also investigated the design of the network architecture for the volume generation. As described in Section 3.1, the proposed network consists of several modules. Firstly, we verified the importance of the decoder. Instead of a decoder, we use a bilinear interpolation to upsample the output of the encoder. Table 4c shows that the decoder reduces the depth estimation errors and improves the mean IoU by 1.9%. Secondly, we analyze the effects of the up-projection module. Table 4d shows that the up-projection module used to extract the multi-scale features is important for constructing a latent 3D volume. Finally, we compared the model using a stacked hourglass [51] for the volume encoder. As shown in Table 4e, this slightly enhances the depth estimation results, although the segmentation results were not improved. We believe that the residual blocks provide more well-balanced results than a stacked hourglass architecture.
In addition, we analyze the difference between the depth domain and inverse-depth domain for building a 3D volume. As reported by Im et al. [46], Table 4f shows that the inverse-depth domain improves the performance of depth estimation.

Pre-Training
To evaluate the training strategy, we compare the results with and without pre-training. As shown in Table 4g, the results show that pre-training is most effective for our network, and domain adaptation based on CycleGAN [50] further improves segmentation results (Table 4h). By contrast, the results indicate that a considerable amount of annotation data are required for greater effectiveness of a latent 3D volume representation. Compared to the knowledge distillation approach [16], a synthetic dataset provides accurate annotations. In addition, the knowledge distillation approach requires pre-training for generating pseudo semantic labels and large-scale real images. Although we train CycleGAN for domain adaptation, only small-scale real images are used for training. Thus, the proposed training strategy does not require the collection of large-scale real images.

Single Tasks
To investigate the effects of multi-task learning, we compare the proposed method with single tasks. As shown in Table 4i, the results of depth estimation in the single-task setting are better than those of a multi-task setting. By contrast, the proposed method achieves the highest mIoU, which is 2.9% higher than that of a single-task setting. A similar relationship is observed for the method by Jiao et al. [33] because the proposed latent 3D volume likely acquires a spatial context and contributes to the semantic segmentation. Although the proposed method is accurate for multiple tasks, for a single task, it is more practical to use state-of-the-art approaches [6,12] that are faster than our approach with fewer parameters.

Discussion
As described in Section 1, there are three questions throughout this paper. In this section, we discuss how the proposed latent 3D volume is beneficial to the tasks of depth estimation and semantic segmentation.

•
Hard parameter sharing. For the joint task of depth estimation and semantic segmentation, the typical design choice is hard parameter sharing. Although dimensionality of feature representations is an important aspect, it has been overlooked in previous studies [15][16][17][18][19][20]41]. In this study, considering the real world is three-dimensional, we proposed a latent 3D volume representation.

•
Comparison against state-of-the-art methods. We evaluated our method on the NYU Depth v2 dataset [22]. As shown in Section 5.2, the proposed method achieves the best performance in both depth estimation and semantic segmentation compared to previous joint task methods [15][16][17].
The main difference between our method and previous methods is the representation of shared features in terms of dimensionality. While 2D representation has one feature vector for each pixel, the proposed representation assumes multiple depth planes and can maintain feature vectors for each depth. Having multiple feature vectors can enhance the 3D spatial context, which leads to dealing with ambiguity in depth and semantics. In other words, the proposed representation complements each other in geometric and semantic contexts. The experimental results show that the proposed latent 3D volume is superior to 2D feature representation for the joint task of depth estimation and semantic segmentation. One concern about the proposed method is running time and memory consumption. Although our new attempt is the use of 3D CNNs, it requires more running time and memory than 2D CNNs. Especially, the method by Nekrasov et al. [16] is much better than our method in terms of time and memory efficiency. A more compact representation of 3D features is required to reduce running time and memory consumption for practical use. • Network architecture. Our network builds the initial volume using the 2D feature maps from 2D CNNs and then feeds it into 3D CNNs to obtain the latent 3D volume. While the initial volume is obtained by reshaping the 2D feature maps, our network can produce a latent 3D volume with 3D spatial contextual information. By training the network in an end-to-end manner, the 2D CNNs are trained to output the feature maps to have the 3D structure in the channel dimension. We examine how the network structure affects the learning of latent 3D volume. As shown in Section 5.3, multi-scale features from 2D CNNs are most effective to build a latent 3D volume.
In contrast, the comparison of 3D CNNs shows that there is only a slight difference in performance between the residual and hourglass modules. In other words, multi-scale feature extraction in 3D CNNs has little effect. • Projection module. The latent 3D volume is used for inferring depth and semantic segmentation. We design the projection module to have two branches to decompose the latent 3D volume into geometric and semantic features. The outputs of the branches are fed into task-specific layers. We show that concatenating the outputs of the two branches improves performance. In addition, the combination of latent 3D volume and projection module achieves promising results.

•
Training strategy. We train the network in two stages using a large-scale synthetic dataset and a small-scale real-world dataset. Our experimental results show that pre-training using synthetic images is effective for learning the latent 3D volume. In addition, we use the domain adaptation to fill the gap between synthetic and real images. One of the advantages of synthetic data is accurate annotation. As real data sometimes suffers from inter-sensor distortions, calibration may be required. In contrast, synthetic data and its annotations are always aligned. We show that the performance of depth and semantic segmentation can be improved by using domain-adapted synthetic data.

Conclusions
We presented a latent 3D volume for joint depth estimation and semantic segmentation. The initial 3D volume is constructed using an encoder-decoder network and multi-scale features from the encoder. We pass the volume through several 3D convolutional layers and aggregate the geometric and semantic features in a 3D latent volume. The latent 3D volume is then used for inferring depth and semantic segmentation. We evaluated the proposed method on the NYU Depth v2 dataset and showed that our proposed latent 3D volume improves the performance of joint learning when applied to depth estimation and semantic segmentation under most metrics, and achieves comparable results to those of state-of-the-art single task methods.
We see several directions for future studies. Firstly, more compact representations of 3D features should be explored. As the proposed 3D representation has the same height and width of the input image, it requires large memory and long running time. To reduce the cost, recent implicit functions such as [52] that encode spatial information into a low-dimensional code could be applied.
Secondly, the preparation of the training data could also be further investigated. Following previous work [23], we generated 120 K images for pre-training. Because it may be possible to pre-train with fewer images, we hope to validate the minimum number of synthetic images necessary for pre-training. In addition, we applied a domain adaptation based on CycleGAN [50] using a synthetic dataset. Instead of this approach, more recent domain adaptation [53] or knowledge distillation [54] may improve the performance.
Thirdly, the training strategy for learning latent 3D volumes could be developed. Because manual annotation is laborious, it may be desirable to train the network using different datasets for depth and semantic segmentation. Although we use real-world images with both depth and semantic segmentation annotations, depth annotations can be generated in an unsupervised manner using multiple images with overlap, and it is relatively easy to collect such real-world images. We plan to explore training strategies using different datasets.