Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary
Abstract
:1. Introduction
2. Proposed Methods
2.1. Network Architecture
2.1.1. Feature-Sharing Encoder
2.1.2. Decoder for Boundary Detection
2.1.3. Decoder for Semantic Segmentation and Depth Completion
2.2. Loss Function
2.2.1. Loss Function for Depth Completion
2.2.2. Loss Function for Semantic Segmentation and Boundary Detection
2.2.3. Loss Function for Joint Tasks
3. Experimental Results
3.1. Experimental Setup and Dataset Introduction
- Virtual KITTI [36] is a synthetic outdoor dataset. The dataset contains 10 different rendering variants in each sequence, one of them is an outdoor environment cloned as close as possible to the original KITTI benchmark and the others are geometry transformed or adjusted with weather conditions from the cloned one. Each RGB image in the dataset has a corresponding depth image and semantic segmentation groundtruth. The ground truth depth maps are randomly down-sampled to only 5% of the original density to produce the sparse depth input: 11,112 images are randomly selected for training, 2320 images for validation and 3576 images for testing.
- CityScapes [37] is a real outdoor dataset, which contains high-quality semantic annotations of 5000 images collected in street scenes from 50 different cities. A total of 19 semantic labels are used for evaluation. They belong to 7 super categories: ground, construction, object, nature, sky, human, and vehicle. The ground truth of depth (disparity) is provided by the SGM method [37]. In the experiment, the original disparity images are randomly down-sampled to 5% density and used as sparse depth input. The training, validation, and testing sets contain 2975, 500 and 1525 images, respectively.
3.2. Experiment Analysis: Virtual KITTI
3.2.1. Experiments on Semantic Segmentation
- SSDNet_Sem: Remove the depth completion branch from our proposed multi-task learning framework. It could also be understood as two modifications to the BaCNN model:
- (a)
- BaCNN employs the boundary-detection sub-network as one cascading task and produces a boundary-similarity map for the following semantic segmentation sub-network, while SSDNet_Sem treats the boundary task as a dependent branch and shares the encoder features. Moreover, other than independent loss functions for boundary and semantic sub-task, SSDNet_Sem model could also be optimized by joint semantic-boundary loss function;
- (b)
- Comparing with BaCNN who performs Early Fusion with introducing boundary-similarity in the encoder stage, SSDNet_Sem performs Later Fusion in the decoder stage.
- SSDNet (Full model): The complete network model with multi-tasks optimized by the full joint loss function.
- SSDNet_ind: The complete network model without using joint loss functions.
3.2.2. Experiments on Depth Completion
- SSDNet (full model): The proposed semantic segmentation and depth completion multi-task network.
- SSDNet_Dep: Removing the semantic branch from the full model, but still using both sparse depth and RGB image as input.
- SSDNet_Dep_d: Using sparse depth as the only data source on the model of SSDNet_Dep.
- SSDNet_Dep_rgb: Using RGB image as the only data source and perform depth prediction on the model of SSDNet_Dep.
- SSDNet_ind: The complete model without using joint loss functions.
3.3. Experimental Analysis on CityScapes
4. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ladický, L.U.; Russell, C.; Kohli, P.; Torr, P.H. Associative hierarchical crfs for object class image segmentation. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 739–746. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Torralba, A.; Oliva, A. Depth estimation from image structure. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1226–1238. [Google Scholar] [CrossRef] [Green Version]
- Liu, B.; Gould, S.; Koller, D. Single image depth estimation from predicted semantic labels. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 1253–1260. [Google Scholar]
- Hua, J.; Gong, X. A Normalized Convolutional Neural Network for Guided Sparse Depth Upsampling. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 2283–2290. [Google Scholar]
- Uhrig, J.; Schneider, N.; Schneider, L.; Franke, U.; Brox, T.; Geiger, A. Sparsity invariant cnns. In Proceedings of the 2017 International Conference on 3D Vision (3DV), Qingdao, China, 10–12 October 2017; pp. 11–20. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Romera, E.; Alvarez, J.M.; Bergasa, L.M.; Arroyo, R. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans. Intell. Trans. Syst. 2017, 19, 263–272. [Google Scholar] [CrossRef]
- Couprie, C.; Farabet, C.; Najman, L.; LeCun, Y. Indoor semantic segmentation using depth information. arXiv 2013, arXiv:1301.3572. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning Rich Features from RGB-D Images for Object Detection and Segmentation. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 345–360. [Google Scholar]
- Wang, W.; Neumann, U. Depth-aware CNN for RGB-D segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 135–150. [Google Scholar]
- Zou, N.; Xiang, Z.; Chen, Y.; Chen, S.; Qiao, C. Boundary-Aware CNN for Semantic Segmentation. IEEE Access 2019, 7, 114520–114528. [Google Scholar] [CrossRef]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2 (NIPS’14), Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2014; pp. 2366–2374. [Google Scholar]
- Eigen, D.; Fergus, R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2650–2658. [Google Scholar]
- Ku, J.; Harakeh, A.; Waslander, S.L. In defense of classical image processing: Fast depth completion on the cpu. In Proceedings of the 2018 15th Conference on Computer and Robot Vision(CRV), Toronto, ON, Canada, 9–11 May 2018; pp. 16–22. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. In Asian Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 213–228. [Google Scholar]
- Mal, F.; Karaman, S. Sparse-to-dense: Depth prediction from sparse depth samples and a single image. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 1–8. [Google Scholar]
- Jaritz, M.; De Charette, R.; Wirbel, E.; Perrotton, X.; Nashashibi, F. Sparse and dense data with cnns: Depth completion and semantic segmentation. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 52–60. [Google Scholar]
- Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q.V. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar]
- Zhang, Y.; Funkhouser, T. Deep depth completion of a single RGB-D image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 175–185. [Google Scholar]
- Qiu, J.; Cui, Z.; Zhang, Y.; Zhang, X.; Liu, S.; Zeng, B.; Pollefeys, M. Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3313–3322. [Google Scholar]
- Murphy, K.P.; Torralba, A.; Freeman, W.T. Using the forest to see the trees: A graphical model relating features, objects, and scenes. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2004; pp. 1499–1506. [Google Scholar]
- Teichmann, M.; Weber, M.; Zoellner, M.; Cipolla, R.; Urtasun, R. Multinet: Real-time joint semantic reasoning for autonomous driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 1013–1020. [Google Scholar]
- Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
- Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
- Uhrig, J.; Cordts, M.; Franke, U.; Brox, T. Pixel-level Encoding and Depth Layering for Instance-Level Semantic Labeling. In German Conference on Pattern Recognition; Springer: Cham, Switzerland, 2016; pp. 14–25. [Google Scholar]
- Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7482–7491. [Google Scholar]
- Gaidon, A.; Wang, Q.; Cabon, Y.; Vig, E. Virtual Worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4340–4349. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Harrison, A.; Newman, P. Image and Sparse Laser Fusion for Dense Scene Reconstruction. In Field and Service Robotics; Springer: Berlin/Heidelberg, Germany, 2010; pp. 219–228. [Google Scholar]
- Ferstl, D.; Reinbacher, C.; Ranftl, R.; Rüther, M.; Bischof, H. Image guided depth upsampling using anisotropic total generalized variation. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 993–1000. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Berman, M.; Rannen Triki, A.; Blaschko, M.B. The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4413–4421. [Google Scholar]
- Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Seattle, WA, USA, 19 March 2018; pp. 552–568. [Google Scholar]
- Poudel, R.P.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast semantic segmentation network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
- Pilzer, A.; Xu, D.; Puscas, M.; Ricci, E.; Sebe, N. Unsupervised adversarial depth estimation using cycled generative networks. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 587–595. [Google Scholar]
Metrics For Semantic Segmentation | Pixel accuracy | |
Mean pixel accuracy | ||
Mean Intersection-over-Union of different categories | ||
Frequency-weighted IoU | ||
Metrics For Depth Completion | Root Mean Squared Error | |
Mean Absolute Error |
Acc (%) | mAcc (%) | mIoU (%) | fwIoU (%) | Params | FLOPs | |
---|---|---|---|---|---|---|
FCN8S [5] | 73.549 | 52.243 | 28.627 | 61.948 | 1.3 × 108 | 2.2 × 1010 |
BaCNN [20] (baseline) | 75.439 | 54.187 | 35.190 | 66.027 | 1.5 × 108 | 3.8 × 1010 |
SSDNet_Sem | 75.759 | 58.091 | 41.786 | 64.018 | 2.5 × 107 | 2.5 × 1010 |
SSDNet_ind | 76.427 | 58.579 | 41.134 | 64.823 | 3.5 × 107 | 3.3 × 1010 |
SSDNet (our method) | 78.967 | 63.461 | 46.041 | 68.227 | 3.5 × 107 | 3.3 × 1010 |
0–20 m (cm) | 0–50 m (cm) | 0–100 m (cm) | Params | FLOPs | ||||
---|---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |||
MRF [39] | 56.67 | 116.776 | 131.03 | 312.41 | 209.45 | 575.20 | n/a | n/a |
TGV [40] | 41.85 | 114.57 | 113.38 | 323.97 | 205.78 | 621.48 | n/a | n/a |
Sparse-to-dense [25] | 258.98 | 386.91 | 653.54 | 1066.55 | 1072.52 | 1892.04 | n/a | n/a |
SparseConvNet [12] | 56.44 | 137.34 | 153.01 | 384.96 | 258.23 | 681.13 | n/a | n/a |
SSDNet_Dep | 21.97 | 86.32 | 87.70 | 282.63 | 145.90 | 473.84 | 2.5 × 107 | 2.5 × 1010 |
SSDNet_Dep_d | 34.44 | 113.37 | 117.59 | 370.50 | 169.76 | 588.63 | 2.5 × 107 | 2.5 × 1010 |
SSDNet_Dep_rgb | 72.48 | 199.15 | 288.73 | 702.71 | 455.76 | 1125.47 | 2.5 × 107 | 2.5 × 1010 |
SSDNet_ind | 24.96 | 95.23 | 86.71 | 277.04 | 119.66 | 404.70 | 3.5 × 107 | 3.3 × 1010 |
SSDNet (our method) | 21.65 | 85.05 | 86.47 | 274.03 | 118.58 | 395.51 | 3.5 × 107 | 3.3 × 1010 |
IoU_cat (%) | IoU_cla (%) | fwt (s) | |
---|---|---|---|
FCN8S [5] | 81.6 | 61.9 | 0.5 |
BaCNN [20] (baseline) | 85.6 | 64.8 | 0.012 |
SSDNet (our method) | 86.0 | 65.3 | 0.010 |
ENet [41] | 80.4 | 58.3 | 0.013 |
ENet_LSloss [42] | 83.6 | 63.1 | 0.013 |
ESPNet [43] | 82.2 | 60.3 | 0.009 |
Fast-SCNN [44] | 80.5 | 62.8 | 0.004 |
ERFNet [16] | 86.5 | 68.0 | 0.02 |
Deeplab LargeFOV [8] | 81.2 | 63.1 | 4.0 |
Label_Category | Flat | Nature | Object | Sky | Construction | Human | Vehicle | mIoU |
---|---|---|---|---|---|---|---|---|
BaCNN [20] (baseline) | 97.9 | 90.8 | 63.7 | 92.3 | 89.6 | 75.3 | 89.7 | 85.6 |
SSDNet | 98.0 | 91.1 | 64.6 | 93.9 | 90.4 | 73.1 | 90.7 | 86.0 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zou, N.; Xiang, Z.; Chen, Y.; Chen, S.; Qiao, C. Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary. Sensors 2020, 20, 635. https://doi.org/10.3390/s20030635
Zou N, Xiang Z, Chen Y, Chen S, Qiao C. Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary. Sensors. 2020; 20(3):635. https://doi.org/10.3390/s20030635
Chicago/Turabian StyleZou, Nan, Zhiyu Xiang, Yiman Chen, Shuya Chen, and Chengyu Qiao. 2020. "Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary" Sensors 20, no. 3: 635. https://doi.org/10.3390/s20030635
APA StyleZou, N., Xiang, Z., Chen, Y., Chen, S., & Qiao, C. (2020). Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary. Sensors, 20(3), 635. https://doi.org/10.3390/s20030635