Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes
Abstract
:1. Introduction
- A brand-new module called Multi-Scale Depthwise Separable Convolution is proposed in this paper. This module can extract multi-scale information while remaining lightweight.
- The proposed structure makes a trade-off between accuracy and memory. It significantly reduces the storage requirements of embedded AI computing devices, which is advantageous for real-world applications.
2. Related Works
2.1. Semantic Segmentation Task
2.2. Multi-Scale Feature Extraction
2.3. Computational Method in Convolution Operation
3. Proposed Algorithms
- To filter input channels, depthwise convolution is used, which is a group convolution operations with the same number of groups as input channels.
- To integrate features from a depthwise convolution operation to produce new features, pointwise convolution, a common convolution operation with a kernel size of 1 × 1, is used.
4. Experimental Results
4.1. Parameter Setting
4.2. Performance Evaluation on the Camvid Dataset
4.3. Performance Evaluation on KITTI Dataset
4.4. Performance Evaluation on the Cityscapes Dataset
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, J.; Dai, Y.; Su, X.; Wu, W. Efficient Dual-Branch Bottleneck Networks of Semantic Segmentation Based on CCD Camera. Remote Sens. 2022, 14, 3925. [Google Scholar] [CrossRef]
- Li, J.; Dai, Y.; Wang, J.; Su, X.; Ma, R. Towards broad learning networks on unmanned mobile robot for semantic segmentation. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 9228–9234. [Google Scholar]
- Siam, M.; Gamal, M.; Abdel-Razek, M.; Yogamani, S.; Jagersand, M. Rtseg: Real-time semantic segmentation comparative study. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 1603–1607. [Google Scholar]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Arsalan, M.; Kim, D.S.; Lee, M.B.; Owais, M.; Park, K.R. FRED-Net: Fully residual encoder–decoder network for accurate iris segmentation. Expert Syst. Appl. 2019, 122, 217–241. [Google Scholar] [CrossRef]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
- Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Object instance segmentation and fine-grained localization using hypercolumns. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 627–639. [Google Scholar] [CrossRef] [PubMed]
- Hsu, K.J.; Lin, Y.Y.; Chuang, Y.Y. Weakly supervised salient object detection by learning a classifier-driven map generator. IEEE Trans. Image Process. 2019, 28, 5435–5449. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Woźniak, M.; Połap, D. Object detection and recognition via clustered features. Neurocomputing 2018, 320, 76–84. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Li, J.; Zhang, X.; Li, J.; Liu, Y.; Wang, J. Building and optimization of 3D semantic map based on Lidar and camera fusion. Neurocomputing 2020, 409, 394–407. [Google Scholar] [CrossRef]
- Chandra, S.; Kokkinos, I. Fast, exact and multi-scale inference for semantic image segmentation with deep gaussian crfs. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 402–418. [Google Scholar]
- Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 447–456. [Google Scholar]
- Li, J.; Wang, J.; Peng, H.; Hu, Y.; Su, H. Fuzzy-Torque Approximation-Enhanced Sliding Mode Control for Lateral Stability of Mobile Robot. IEEE Trans. Syst. Man, Cybern. Syst. 2022, 52, 2491–2500. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Li, J.; Wang, J.; Peng, H.; Zhang, L.; Hu, Y.; Su, H. Neural fuzzy approximation enhanced autonomous tracking control of the wheel-legged robot under uncertain physical interaction. Neurocomputing 2020, 410, 342–353. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the EEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Sezgin, M.; Sankur, B.L. Survey over image thresholding techniques and quantitative performance evaluation. J. Electron. Imaging 2004, 13, 146–168. [Google Scholar]
- Liao, P.S.; Chen, T.S.; Chung, P.C. A fast algorithm for multilevel thresholding. J. Inf. Sci. Eng. 2001, 17, 713–727. [Google Scholar]
- Wang, Y.; Dai, Y.; Xue, J.; Liu, B.; Ma, C.; Gao, Y. Research of segmentation method on color image of Lingwu long jujubes based on the maximum entropy. EURASIP J. Image Video Process. 2017, 2017, 34. [Google Scholar] [CrossRef]
- Steinbrunn, M.; Moerkotte, G.; Kemper, A. Heuristic and randomized optimization for the join ordering problem. VLDB J. 1997, 6, 191–208. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Keys, R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech Signal Process. 1981, 29, 1153–1160. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
- LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation applied to handwritten zip code recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 818–833. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Ting, Z.; Guo-Jun, Q.; Bin, X.; Jingdong, W. Interleaved group convolutions for deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Hahnloser, R.H.; Sarpeshkar, R.; Mahowald, M.A.; Douglas, R.J.; Seung, H.S. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 2000, 405, 947–951. [Google Scholar] [CrossRef] [PubMed]
- Brostow, G.J.; Shotton, J.; Fauqueur, J.; Cipolla, R. Segmentation and recognition using structure from motion point clouds. In Proceedings of the Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, 12–18 October 2008; Proceedings, Part I 10. Springer: Berlin/Heidelberg, Germany, 2008; pp. 44–57. [Google Scholar]
- Brostow, G.J.; Fauqueur, J.; Cipolla, R. Semantic object classes in video: A high-definition ground truth database. Pattern Recognit. Lett. 2009, 30, 88–97. [Google Scholar] [CrossRef]
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
- Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Sturgess, P.; Alahari, K.; Ladicky, L.; Torr, P.H. Combining appearance and structure from motion features for road scene understanding. In Proceedings of the BMVC-British Machine Vision Conference, London, UK, 7–10 September 2009. [Google Scholar]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Zhang, X.; Chen, Z.; Wu, Q.J.; Cai, L.; Lu, D.; Li, X. Fast semantic segmentation for scene perception. IEEE Trans. Ind. Inform. 2018, 15, 1183–1192. [Google Scholar] [CrossRef]
Layer | Operator | Channel |
---|---|---|
1–2 | Conv + BN + ReLU | 32 |
3 | Max pooling | 32 |
4–5 | Conv + BN + ReLU | 64 |
6 | Max pooling | 64 |
7–9 | Conv + BN + ReLU | 128 |
10 | Max pooling | 128 |
11–13 | Conv + BN + ReLU | 256 |
14 | Max pooling | 256 |
15–17 | Conv + BN + ReLU | 256 |
18 | Max pooling | 256 |
19 | up-Conv | 256 |
20 | up-Conv + skip connection | 128 |
21 | up-Conv + skip connection | 64 |
22 | up-Conv | 32 |
23 | up-Conv | 32 |
24 | 1 × 1 Conv + Softmax | class |
Item | Convolution Module | ||
---|---|---|---|
Standard Convolution | Depthwise Separable Convolution | Proposed Module | |
Building | 72.58 | 74.33 | 72.96 |
Tree | 65.41 | 68.38 | 67.27 |
Sky | 88.64 | 89.46 | 89.14 |
Car + truckBus | 67.59 | 69.51 | 76.55 |
Traffic lights | 21.27 | 22.50 | 37.22 |
Road | 88.55 | 88.85 | 90.79 |
Pedestrian | 36.13 | 38.73 | 50.29 |
Fence | 28.39 | 37.21 | 41.01 |
ColumnPole | 11.13 | 12.98 | 28.49 |
Sidewalk | 69.76 | 69.68 | 77.23 |
Bicyclist | 28.89 | 30.18 | 40.26 |
MIoU | 52.58 | 54.71 | 61.02 |
Parameters | 40.36 M | 1.68 M | 2.68 M |
Item | Convolution Module | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
FCN-8s | Segnet | Enet | PSPNet | BiseNet | Dilation8 | DeepLab | ICNet | FSSNet | MDS-FCN | |
Building | 77.8 | 88.8 | 74.7 | n/a | 83.0 | n/a | n/a | n/a | 84.6 | 72.96 |
Tree | 71.0 | 87.3 | 77.8 | 75.8 | 86.0 | 67.27 | ||||
Sky | 88.7 | 92.4 | 95.1 | 92.0 | 94.3 | 89.14 | ||||
Car + truckBus | 76.1 | 82.1 | 82.4 | 83.7 | 84.6 | 76.55 | ||||
Traffic lights | 32.7 | 20.5 | 51.0 | 46.5 | 57.9 | 37.22 | ||||
Road | 91.2 | 97.2 | 95.1 | 94.6 | 95.2 | 90.79 | ||||
Pedestrian | 41.7 | 57.1 | 67.2 | 58.8 | 80.9 | 50.29 | ||||
Fence | 24.4 | 49.3 | 51.7 | 53.6 | 43.4 | 41.01 | ||||
ColumnPole | 19.9 | 27.5 | 35.4 | 31.9 | 53.6 | 28.49 | ||||
Sidewalk | 72.7 | 84.4 | 86.7 | 81.4 | 92.9 | 77.23 | ||||
Bicyclist | 31.0 | 30.7 | 34.1 | 54.0 | 67.4 | 40.26 | ||||
MIoU | 57.0 | 55.6 | 51.3 | 69.1 | 68.7 | 65.3 | 61.6 | 67.1 | 58.6 | 61.02 |
Parameters | 134.5 M | 29.45 M | 0.37 M | 65.7 M | 49.0 M | 140.8 M | 20.5 M | 26.6 M | 0.2 M | 2.68 M |
Item | Convolution Module | ||
---|---|---|---|
Standard Convolution | Depthwise Separable Convolution | Tproposed Module | |
MIoU | 43.02 | 41.26 | 51.71 |
Parameters | 40.36 M | 1.68 M | 2.68 M |
Item | Module | |||||
---|---|---|---|---|---|---|
FCN-32s | FCN-8s | Unet | Segnet | Enet | MDS-FCN | |
MIoU | 37.56 | 43.81 | 43.10 | 40.39 | 47.18 | 50.62 |
Parameters | 14.7 M | 134.55 M | 33.04 M | 29.45 M | 0.37 M | 2.68 M |
Model | MIoU | Parameters | FPS | GPU |
---|---|---|---|---|
FCN-8s | 65.3 | 134.5 M | 2 | Titan X |
Segnet | 57.0 | 29.45 M | 16.7 | Titan X |
ESPNet | 60.3 | 0.36 M | 112 | Titan X |
Enet | 58.3 | 0.37 M | 76.9 | Titan X |
ERFNet | 68.0 | 2.1 M | 41.7 | Titan X |
BiseNet | 68.4 | 12.5 M | 105.8 | Titan X |
ICNet | 69.5 | 26.6 M | 27.8 | Titan X |
Dilation10 | 67.1 | 140.5 M | 0.25 | Titan X |
DeepLab v3 | 81.3 | 59.4 M | n/a | Titan X |
PSPNet | 78.4 | 65.7 M | n/a | Titan X |
MDS-FCNt | 68.5 | 2.68 M | 13.4 | GTX 1070ti |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dai, Y.; Li, C.; Su, X.; Liu, H.; Li, J. Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes. Remote Sens. 2023, 15, 2649. https://doi.org/10.3390/rs15102649
Dai Y, Li C, Su X, Liu H, Li J. Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes. Remote Sensing. 2023; 15(10):2649. https://doi.org/10.3390/rs15102649
Chicago/Turabian StyleDai, Yingpeng, Chenglin Li, Xiaohang Su, Hongxian Liu, and Jiehao Li. 2023. "Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes" Remote Sensing 15, no. 10: 2649. https://doi.org/10.3390/rs15102649
APA StyleDai, Y., Li, C., Su, X., Liu, H., & Li, J. (2023). Multi-Scale Depthwise Separable Convolution for Semantic Segmentation in Street–Road Scenes. Remote Sensing, 15(10), 2649. https://doi.org/10.3390/rs15102649