Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary

As the core task of scene understanding, semantic segmentation and depth completion play a vital role in lots of applications such as robot navigation, AR/VR and autonomous driving. They are responsible for parsing scenes from the angle of semantics and geometry, respectively. While great progress has been made in both tasks through deep learning technologies, few works have been done on building a joint model by deeply exploring the inner relationship of the above tasks. In this paper, semantic segmentation and depth completion are jointly considered under a multi-task learning framework. By sharing a common encoder part and introducing boundary features as inner constraints in the decoder part, the two tasks can properly share the required information from each other. An extra boundary detection sub-task is responsible for providing the boundary features and constructing cross-task joint loss functions for network training. The entire network is implemented end-to-end and evaluated with both RGB and sparse depth input. Experiments conducted on synthesized and real scene datasets show that our proposed multi-task CNN model can effectively improve the performance of every single task.


Introduction
Scene understanding [1][2][3] is an essential task in lots of intelligent applications such as robotics navigation, AR/VR and autonomous driving. As the core content of scene understanding, semantic [4][5][6][7][8] and depth estimation [9][10][11][12] parse the scene in the view of semantics and geometry, respectively. In recent years, much research focusing on either depth estimation or semantic segmentation has been carried out. With the help of deep learning technologies, great success has been made.
Semantic segmentation refers to the classification and labeling of each pixel in an image, thereby dividing the image into several semantic meaningful regions, and converting the original color-data image into a pixel-level class-labeled image. Following the success of deep neural networks in image classification [13], the invention of fully convolutional neural network (FCN) [5] makes the pixel-level semantic labeling possible. Based on FCN, recent studies develop deconvolution-based architectures to improve segmentation accuracy [6,7]. Deeplab [8] proposes an atrous convolution to tackle the problem of low-resolution features caused by the traditional cascading pooling structure. Their later work [14] introduces bottleneck and skip-connection structure with Resnet [15] framework. ERFNet [16] further proposes a non-bottleneck structure to reduce computing complexity. Extra data sources can also be introduced to improve performance through feature fusion. Dense or sparse depth images, acquired with range sensors such as Kinect or Lidar, are used to strengthen the semantic segmentation The main contributions of our work are threefold. First, a triple-task network with singleencoder-multi-decoder architecture is designed for simultaneous predicting semantic and dense depth of an image. It takes RGB and sparse depth data as input, capable of utilizing the complementary information hidden in each of the heterogeneous data. Second, a boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function. The boundary plays a role of bridge connecting the semantic and depth prediction tasks and strengthens the relationship between them. The last contribution is the end-to-end implementation of the entire network and evaluation of the method on different datasets. The experiments on both synthesized (Virtual KITTI [36]) and real datasets (Cityscapes [37]) demonstrate that our method can effectively improve the performance for both depth completion and semantic segmentation.

Proposed Methods
In this section, the proposed network architecture is introduced first, and the loss functions for training the network are described later.

Network Architecture
The overall network is based on FCN structure with a single sharing encoder and multiple branch decoder. As shown in Figure 1, the architecture of the proposed SSDNet is mainly composed of a feature-sharing encoder and three branch decoders corresponding to boundary-detection, semantic-segmentation, and depth-completion tasks, respectively.

Feature-Sharing Encoder
The main structure of the proposed feature-sharing encoder is based on VGG's [13] multi-scale cascading convolutions, as shown in Figure 2a. It consists of 5 convolution blocks denoted as Conv_block. The scaled outputting features from Conv_block, marked as green nodes i S , are transferred to all of the three subsequent decoder branches in the form of skip-connection, as shown in Figure 1. Proposed simultaneous semantic segmentation and depth completion multi-task network (SSDNet). Boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function.
The main contributions of our work are threefold. First, a triple-task network with single-encoder-multi-decoder architecture is designed for simultaneous predicting semantic and dense depth of an image. It takes RGB and sparse depth data as input, capable of utilizing the complementary information hidden in each of the heterogeneous data. Second, a boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function. The boundary plays a role of bridge connecting the semantic and depth prediction tasks and strengthens the relationship between them. The last contribution is the end-to-end implementation of the entire network and evaluation of the method on different datasets. The experiments on both synthesized (Virtual KITTI [36]) and real datasets (Cityscapes [37]) demonstrate that our method can effectively improve the performance for both depth completion and semantic segmentation.

Proposed Methods
In this section, the proposed network architecture is introduced first, and the loss functions for training the network are described later.

Network Architecture
The overall network is based on FCN structure with a single sharing encoder and multiple branch decoder. As shown in Figure 1, the architecture of the proposed SSDNet is mainly composed of a feature-sharing encoder and three branch decoders corresponding to boundary-detection, semantic-segmentation, and depth-completion tasks, respectively.

Feature-Sharing Encoder
The main structure of the proposed feature-sharing encoder is based on VGG's [13] multi-scale cascading convolutions, as shown in Figure 2a. It consists of 5 convolution blocks denoted as Conv_block. The scaled outputting features from Conv_block, marked as green nodes S i , are transferred to all of the three subsequent decoder branches in the form of skip-connection, as shown in Figure 1.

Decoder for Boundary Detection
The boundary is valuable information to hint the discontinuity in the semantic labels and depth image. Therefore, a boundary detection decoder is designed to produce the required boundary similarity to the other two major tasks. As shown in Figure 2b, together with the output of the encoder part, different scales of skip-connection features are also fed into the boundary-detection branch. The boundary computed from the semantic ground truth image can be utilized as supervision signals to train the boundary detection branch. Different boundary feature maps i B from the i th scale, marked as pink nodes in Figure 2b, are then introduced to the following semantic segmentation and depth completion branches.

Decoder for Semantic Segmentation and Depth Completion
Besides the feature maps from the encoder, the semantic and depth decoder branches also share the multi-scale skipped features and the boundary feature maps, as shown in Figure 3a. The core modules for both branches are the multiple UpBlocks, where the boundary feature i B is absorbed to construct a special boundary-aware convolution layer BaConv [20], as illustrated in Figure 3b. With the guidance of the boundary, boundary-aware convolution could focus more on the regions with similar semantic features and gather the contributions more adaptively to produce the output. The operation of boundary-aware convolution is shown in Equation (1) where the output ( ) y  at the position p is produced by the convolution of three parts, i.e., kernel weight ( ) w  , input feature map ( ) x  and boundary similarity feature ( )

Decoder for Boundary Detection
The boundary is valuable information to hint the discontinuity in the semantic labels and depth image. Therefore, a boundary detection decoder is designed to produce the required boundary similarity to the other two major tasks. As shown in Figure 2b, together with the output of the encoder part, different scales of skip-connection features are also fed into the boundary-detection branch. The boundary computed from the semantic ground truth image can be utilized as supervision signals to train the boundary detection branch. Different boundary feature maps B i from the ith scale, marked as pink nodes in Figure 2b, are then introduced to the following semantic segmentation and depth completion branches.

Decoder for Semantic Segmentation and Depth Completion
Besides the feature maps from the encoder, the semantic and depth decoder branches also share the multi-scale skipped features and the boundary feature maps, as shown in Figure 3a. The core modules for both branches are the multiple UpBlocks, where the boundary feature B i is absorbed to construct a special boundary-aware convolution layer BaConv [20], as illustrated in Figure 3b. With the guidance of the boundary, boundary-aware convolution could focus more on the regions with similar semantic features and gather the contributions more adaptively to produce the output. The operation of boundary-aware convolution is shown in Equation (1): Sensors 2020, 20, 635 where the output y(·) at the position p is produced by the convolution of three parts, i.e., kernel weight w(·), input feature map x(·) and boundary similarity feature B(·). p n denotes each position in the local window K around the target position p. The size of K is defined by the convolution kernel size, and parameters in w(·) are determined through the training process.
Sensors 2020, 20, 635 5 of 16 position in the local window K around the target position p . The size of K is defined by the convolution kernel size, and parameters in ( ) w  are determined through the training process.   represents the number of semantic classes. BaConv refers to the boundary-aware convolution [20]. The bilinear interpolation upsampling layer is used in (b) as Upsampling, x2.
Comparing with standard convolution operation, BaConv introduces the boundary-similarity term ( ) B  and brings the idea of adaptively setting contributions to each n p . Based on BaConv, pixels which have higher similarities as object boundaries will have a higher weight of ( ) n B p and have less impact on the convolution results.
Following five UpBlocks, a 1 × 1 convolution layer with channels equal to the class number nc is employed to produce the final semantic results. For the depth completion branch, a 1 × 1 convolution layer with an output channel number of 1 is introduced after each UpBlock, and produces a normalized depth prediction on each scale. All of the five scaled outputs are used for producing depth task loss during training, while only the depth outputs in the final scale are used in the testing stage.  nc represents the number of semantic classes. BaConv refers to the boundary-aware convolution [20]. The bilinear interpolation upsampling layer is used in (b) as Upsampling, x2.

Loss Function
Comparing with standard convolution operation, BaConv introduces the boundary-similarity term B(·) and brings the idea of adaptively setting contributions to each p n . Based on BaConv, pixels which have higher similarities as object boundaries will have a higher weight of B(p n ) and have less impact on the convolution results.
Following five UpBlocks, a 1 × 1 convolution layer with channels equal to the class number nc is employed to produce the final semantic results. For the depth completion branch, a 1 × 1 convolution layer with an output channel number of 1 is introduced after each UpBlock, and produces a normalized depth prediction on each scale. All of the five scaled outputs are used for producing depth task loss during training, while only the depth outputs in the final scale are used in the testing stage.

Loss Function
Given the network model and training samples I p , I d , G d , G s , G b , where I p and I d separately represent the RGB image and sparse depth data,G d ,G s and G b represent the ground-truth for dense depth, semantic labels and boundary, the loss functions for each single and joint task have to be designed. Boundary ground truth G b can be calculated from the discontinuity of the semantic ground truth. The output of the triple task is denoted as P s , P d and P b , which represent semantic, depth and boundary prediction, respectively.

Loss Function for Depth Completion
For the depth completion branch, the model can be optimized by L1 and L2 loss between the prediction and the ground truth. SSIM loss [38] is also induced to compare the features of brightness l(P d , G d ), contrast c(P d , G d ), and structure s(P d , G d ) between the completed depth images P d and the ground truth G d . SSIM loss is defined as: In Equation (2), W represents the number of 11x11 sliding windows to calculate local image quality. SSIM loss for the overall image, i.e., Loss − SSIM, is obtained by averaging all of the sliding windows. In Equation (3), a and b are the parameters for the metric functions, E(·) represents the expectation function to calculate the mean value for the feature map, and µ a , σ a , σ ab denote mean, variance, covariance, respectively. C 1 , C 2 and C 3 are the constants to avoid instability when the denominator is close to zero. Empirically, they are set as: In summary, the final loss function for depth completion is composed of three terms: where α d1 , α d2 and α d3 are the weights for each loss, respectively. Empirically, they are set as α d1 = α d2 = 0.4, and α d3 = 0.2 in our implementation.

Loss Function for Semantic Segmentation and Boundary Detection
For semantic segmentation-related tasks, the class weighted cross-entropy is applied in the loss function. Given gt jk and pred jk the probability of the jth pixel labeled as the kth class in the ground truth and the prediction, respectively. The class-weighted cross-entropy (WCE) is given as: where β j represents the proportion of pixels with the semantic category j in the whole sample dataset, N and nc represent the total number of pixels and semantic categories. Class-weighted cross-entropy (WCE) is used to evaluate the output of the semantic segmentation branch decoder. Similarly, boundary-detection could be defined as a binary semantic segmentation problem and two-class-weighted cross-entropy is employed in Equation (6): where B(pred j ) denotes the boundary-similarity of the jth pixel in the boundary-prediction map and gt j denotes if jth pixel is labeled as a boundary.

Loss Function for Joint Tasks
Aiming to enhance the correlation among different tasks and further improve the overall generalization performance, joint cross-task loss functions are proposed. Boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function. The boundary plays the role of bridges connecting the semantic and depth prediction tasks and strengthens the relationship between them. In brief, the positions of the semantic boundary, depth boundary and boundary predicted through the detection sub-task should all be compatible.
Specifically, defining local boundary function f BS (·) on the semantic prediction image P s , the semantic boundary confidence in the horizontal f BS − u (·) and vertical direction f BS − v (·) can be respectively computed as shown in Equation (7): where the subscript u and v represent the pixel position in the horizontal and vertical direction, respectively. Then the semantic-boundary joint loss function is defined in Equation (8). The boundary calculated from semantic prediction should have the structure-similarity with the boundary-similarity prediction result. The semantic-boundary joint loss function can be minimized when the pixels at the semantic boundary ( f BS (·) = 1) has a high boundary-similarity (P b:u,v reaches the maximum value of 1).
For the depth prediction, the predicted P d is a continuous quantity and the local gradients on P d could be built in Equation (9). The larger the gradient is, the more the pixel tends to the position of a boundary. Numerically, the partial derivatives could be calculated in Equation (10): Then the semantic-depth joint loss Loss sd can be formulated in Equation (11). It can reach the minimum when the pixels at the semantic boundary ( f BS (P s:u,v ) = 1) has a high depth gradient Grad P d:u,v .
Finally, the full loss function for the entire multi-task model is defined as:

Experimental Results
In this section, the experimental setup and the evaluation datasets are described first. Then the quantitative and qualitative results are presented with some comparisons to state-of-the-art methods.

Experimental Setup and Dataset Introduction
To evaluate the performance of the proposed network, we introduce metrics for semantic segmentation and depth completion task respectively. As shown in Table 1, n lk is the number of pixels that labeled as class l and predicted as class k, nc indicates the number of classes, and t l = k n lk is the number of pixels with ground truth class l, and N = l t l represents the total number of all pixels. For depth completion metrics, d j andd j represent the ground truth and depth prediction of the jth pixel. Mean pixel accuracy mAcc = (1/nc) l (n ll /t l ) Mean Intersection-over-Union of different categories mIoU = (1/nc) l (n ll /(t l + k n kl − n ll )) Frequency-weighted IoU f wIoU = (1/N) l (t l n ll /(t l + k n kl − n ll ))

Metrics For Depth Completion
Root Mean Squared Error Among semantic segmentation evaluation metrics, Acc stands for the total correct-predicted pixels in the overall image, and mAcc denotes the mean accuracy among different classes. mIoU is the average between the IoU (the ratio between the correct-predicted area and the union area of the ground truth and the predicted areas) of different semantic labels over all the images, and f wIoU represents the class-weighted IoU metric.
The network model is implemented using the PyTorch framework and trained on the NVIDIA GeForce GTX 1080 Ti with 11GB of graphics processing unit (GPU) memory. The network parameters are randomly initialized by Xavier, and the initial offset is set to zero. The loss function is optimized using the SGD optimizer in the experiment, where the initial learning rate is set to 1 × 10 −4 and the batch size is set to 4. The experiment is done primarily on the Virtual KITTI and Cityscapes datasets.

•
Virtual KITTI [36] is a synthetic outdoor dataset. The dataset contains 10 different rendering variants in each sequence, one of them is an outdoor environment cloned as close as possible to the original KITTI benchmark and the others are geometry transformed or adjusted with weather conditions from the cloned one. Each RGB image in the dataset has a corresponding depth image and semantic segmentation groundtruth. The ground truth depth maps are randomly down-sampled to only 5% of the original density to produce the sparse depth input: 11,112 images are randomly selected for training, 2320 images for validation and 3576 images for testing. • CityScapes [37] is a real outdoor dataset, which contains high-quality semantic annotations of 5000 images collected in street scenes from 50 different cities. A total of 19 semantic labels are used for evaluation. They belong to 7 super categories: ground, construction, object, nature, sky, human, and vehicle. The ground truth of depth (disparity) is provided by the SGM method [37]. In the experiment, the original disparity images are randomly down-sampled to 5% density and used as sparse depth input. The training, validation, and testing sets contain 2975, 500 and 1525 images, respectively.

Experiment Analysis: Virtual KITTI
Virtual KITTI [36] is a synthetic outdoor dataset and mostly used for ablation experiments among models with different settings. All the models are trained from scratch and do not rely on any pre-training model.

Experiments on Semantic Segmentation
The evaluation of the experiment is first carried out on the Virtual KITTI dataset. The specific settings of the comparative experimental model are listed as follows:

1.
BaCNN: Baseline model proposed [20]. BaCNN is based on the backbone of FCN8S [5], and modified by replacing the first layer of each Conv_Block into the boundary-aware convolution.
With the help of a front-built boundary-detection sub-network, the boundary-similarity map is introduced.
Sensors 2020, 20, 635 9 of 15 2. SSDNet_Sem: Remove the depth completion branch from our proposed multi-task learning framework. It could also be understood as two modifications to the BaCNN model: (a) BaCNN employs the boundary-detection sub-network as one cascading task and produces a boundary-similarity map for the following semantic segmentation sub-network, while SSDNet_Sem treats the boundary task as a dependent branch and shares the encoder features. Moreover, other than independent loss functions for boundary and semantic sub-task, SSDNet_Sem model could also be optimized by joint semantic-boundary loss function; Comparing with BaCNN who performs Early Fusion with introducing boundary-similarity in the encoder stage, SSDNet_Sem performs Later Fusion in the decoder stage.

SSDNet (Full model):
The complete network model with multi-tasks optimized by the full joint loss function.

4.
SSDNet_ind: The complete network model without using joint loss functions.
The quantitative comparison results are shown in Table 2. For following tables, we highlight our method's performance in bold, and underline the best performance. As expected, all of the three models perform better than the single semantic task model FCN8S [5]. With adding of the boundary and depth task, the performance is gradually improved. BaCNN [20] introduces boundary detection task and replaces the standard convolution with the boundary-aware convolution. SSDNet_Sem achieves further improvement than the baseline BaCNN model. It is due to the changes in the boundary fusion phase (from early fusion to later fusion) and the task hierarchy from the original cascaded style into a parallel multi-task architecture. By adding the depth completion task, the full SSDNet model achieved further significant improvements compared to SSDNet_Sem and performs the best on all four metrics. Without any help from cross-task joint loss functions, SSDNet_ind performs a little lower than our full model and supports the effectiveness of cross-task joint loss functions. The full model also has lower complexity and higher real-time performance than FCN8S and BaCNN (FLOPs are all computed under image resolution of 125 × 414). The qualitative results are shown in Figure 4, where the predictions of depth and semantics look good with the help of boundary prediction. Comparing to the baseline BaCNN model, the full multi-task model is able to produce much shaper segmentations on very close objects (as marked in the red box). This also proves that the proposed multi-task joint network can promote the performance of every single task.
The qualitative results are shown in Figure 4, where the predictions of depth and semantics look good with the help of boundary prediction. Comparing to the baseline BaCNN model, the full multitask model is able to produce much shaper segmentations on very close objects (as marked in the red box). This also proves that the proposed multi-task joint network can promote the performance of every single task.

Experiments on Depth Completion
Several ablation experiments are conducted with the following model configurations: 1. SSDNet (full model): The proposed semantic segmentation and depth completion multi-task network. 2. SSDNet_Dep: Removing the semantic branch from the full model, but still using both sparse depth and RGB image as input. 3. SSDNet_Dep_d: Using sparse depth as the only data source on the model of SSDNet_Dep. 4. SSDNet_Dep_rgb: Using RGB image as the only data source and perform depth prediction on the model of SSDNet_Dep. 5. SSDNet_ind: The complete model without using joint loss functions.
In the depth completion ablation experiment, the RMSE and MAE are analyzed for depth outputs in the range of 20 m, 50 m and 100 m, respectively. Sparse depth points and image pixels become sparser with the increasing of range, leading decreasing of prediction accuracy along the distance. The experiments on these three ranges can well represent the system performance in near, middle and far range ahead of the vehicle. The quantitative comparison results are shown in Table 3, where n/a represents unpublished data. The full SSDNet performs the best among all of the models. Compared with the full model, SSDNet_Dep obtained slightly worse results on the depth prediction accuracy, which shows the importance of the semantic task to the depth completion. If only one type of data could be used, SSDNet_Dep with sparse depth can achieve better results than using the RGB image. However, they are both worse than SSDNet_Dep with full heterogeneous input, which

Experiments on Depth Completion
Several ablation experiments are conducted with the following model configurations:

1.
SSDNet (full model): The proposed semantic segmentation and depth completion multi-task network.

2.
SSDNet_Dep: Removing the semantic branch from the full model, but still using both sparse depth and RGB image as input.

3.
SSDNet_Dep_d: Using sparse depth as the only data source on the model of SSDNet_Dep.

4.
SSDNet_Dep_rgb: Using RGB image as the only data source and perform depth prediction on the model of SSDNet_Dep.

5.
SSDNet_ind: The complete model without using joint loss functions.
In the depth completion ablation experiment, the RMSE and MAE are analyzed for depth outputs in the range of 20 m, 50 m and 100 m, respectively. Sparse depth points and image pixels become sparser with the increasing of range, leading decreasing of prediction accuracy along the distance. The experiments on these three ranges can well represent the system performance in near, middle and far range ahead of the vehicle. The quantitative comparison results are shown in Table 3, where n/a represents unpublished data. The full SSDNet performs the best among all of the models. Compared with the full model, SSDNet_Dep obtained slightly worse results on the depth prediction accuracy, which shows the importance of the semantic task to the depth completion. If only one type of data could be used, SSDNet_Dep with sparse depth can achieve better results than using the RGB image. However, they are both worse than SSDNet_Dep with full heterogeneous input, which inversely demonstrates the advantages of data fusion. SSDNet_ind is a little worse than the full model, which proves that cross-task joint loss function could help improve depth accuracy. Compared with the traditional methods such as MRF [39], TGV [40] and the state-of-the-art CNN methods such as Sparse-to-dense [25] and SparseConvNet [12], our full model also performs the best.

Experimental Analysis on CityScapes
To further evaluate the proposed SSDNet model in a real environment, this section conducts experiments in the CityScapes dataset and compares with state-of-the-art methods. Unlike the simulated Virtual KITTI dataset, the CityScapes dataset contains more noises in the original RGB and disparity (depth) image. The proposed SSDNet model is trained from scratch and does not rely on any ImageNet pre-training. The quantitative verification results for the semantic segmentation task are shown in Table 4, where IoU_cat and IoU_cla represent mIoU corresponding to all of the 7 categories and 19 classes, respectively; fwt represents the running time of each frame in seconds. With the help of the multi-task learning framework, the proposed SSDNet model performs better than the baseline BaCNN [20] and most of its counterparts. Compared with the state-of-the-art real-time CNN methods such as encoder-decoder based ENet model [41] and its loss-edited version [42], ESPNet [43] and two-branch-fusion Fast-SCNN [44], our full model performs better in IoU without costing more time. With fewer layers built in the network and without any pre-training on ImageNet, our method performs still slightly worse than ERFNet [16]. However, our full SSDNet model with three tasks can run at 100 fps, which is 2 times faster than ERFNet. Detailed category evaluation comparisons with the baseline are shown in Table 5. The statistics show that our SSDNet outperforms the baseline BaCNN model in almost all categories, which verified the effectiveness of our proposed multi-task learning framework. Compared to the dense depth map in Virtual KITTI, the depth (disparity) ground truth in the Cityscapes dataset is much noisier and only semi-dense with lots of holes in it. Our proposed method is susceptible to this incomplete and noisy supervisory signal and still able to produce full density depth results. The quantitative results of depth are shown in Table 6. Compared with multi-task learning [35] and unsup-stereo-depthGAN [45] method, our SSDNet multi-task model achieves the best performance thanks to the effective sharing of semantic and boundary features.
Some qualitative results of the proposed method in the Cityscapes dataset are shown in Figure 5. Each column from top to down displays RGB images, depth (disparity) ground truth, depth completion output, semantic ground truth, semantic prediction and boundary detection result, respectively. Despite the noisy depth ground truth, the proposed model can still benefit from triple-task learning and produce satisfying results.

Conclusions
In this paper, a multi-task network for simultaneous semantic segmentation and depth completion is proposed. With the structure of single-encoder-multi-decoder, the model is capable of learning the enhanced features suitable for the entire task. Boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function. The boundary

Conclusions
In this paper, a multi-task network for simultaneous semantic segmentation and depth completion is proposed. With the structure of single-encoder-multi-decoder, the model is capable of learning the enhanced features suitable for the entire task. Boundary constraint is embedded into the two major tasks via multi-scale feature sharing and cross-task joint loss function. The boundary features play a role of the bridge connecting the semantic and depth prediction tasks and strengthen the relationship between them. The boundary associated cross-task joint loss functions are beneficial for each task. The entire network is implemented end-to-end and evaluated on both synthesized and real datasets. The ablative and comparative results show that our multi-task SSDNet model is able to effectively improve the performance of both semantic segmentation and depth completion tasks in a real-time frame rate.
Future work will focus on further improving the task performance by designing a more robust feature fusion mechanism and better network structure. We also plan to test our algorithm in more complex environments.