Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation
Abstract
:1. Introduction
- •
- We introduce GL-Deep Fusion, which effectively holds the correlation between global semantics and ultra-high resolution image details through its integrated feature representation.
- •
- The global contextual information and partially truncated block details captured by the dual-branch structure can be directly alternated between the dual encoding structures of the GL-Deep Fusion module, thereby avoiding redundant feature computations.
- •
- Our proposed GLSNet has significantly improved GPU memory utilization and segmentation accuracy in the context of ultra-high resolution image segmentation. Compared to GLNet (baseline), it reduces GPU memory usage by 24.1% on the DeepGlobe dataset [3]. The Vaihingen dataset [19] also achieves groundbreaking results.
2. Related Work
2.1. Image Segmentation
2.2. Segmentation of Ultra-High Resolution Images: Efficiency and Quality
3. Proposed Method
3.1. Overview
3.2. GL-Deep Fusion
3.3. Global Shallow Branch and Local Deep Branch
4. Experiments and Results
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Experimental Results
4.4.1. Results and Analysis on the Vaihingen Dataset
4.4.2. Results and Analysis on the DeepGlobe Dataset
4.5. Ablation Experiments
4.5.1. The Effects of Shallow–Deep Branch and GL-Deep Fusion
4.5.2. The Effect of Transformer Attention
4.6. Visualization Results and Analysis
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Minaee, S.; Boykov, Y.; Porikli, F.; Plaza, A.; Kehtarnavaz, N.; Terzopoulos, D. Image segmentation using deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3523–3542. [Google Scholar] [CrossRef] [PubMed]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 172–181. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
- Hou, J.; Guo, Z.; Wu, Y.; Diao, W.; Xu, T. BSNet: Dynamic hybrid gradient convolution based boundary-sensitive network for remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624022. [Google Scholar] [CrossRef]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Zhao, H.; Qi, X.; Shen, X.; Shi, J.; Jia, J. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 405–420. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1925–1934. [Google Scholar]
- Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7151–7160. [Google Scholar]
- Ascher, S.; Pincus, E. The Filmmaker’s Handbook: A Comprehensive Guide for the Digital Age; Penguin: London, UK, 2007. [Google Scholar]
- Lilly, P. Samsung launches insanely wide 32: 9 aspect ratio monitor with hdr and freesync 2. PC Gamer, 10 June 2017. [Google Scholar]
- Initiatives, D.C. Digital Cinema System Specification, Version 1.3. 2018. Available online: http://dcimovies.com/specification/DCI%20DCSS%20Ver1-3%202018-0627.pdf (accessed on 11 July 2023).
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- ISPRS Vaihingen Dataset. Available online: https://paperswithcode.com/dataset/isprs-vaihingen (accessed on 15 September 2023).
- Chen, W.; Jiang, Z.; Wang, Z.; Cui, K.; Qian, X. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8924–8933. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- AlMarzouqi, H.; Saoud, L.S. Semantic Labeling of High Resolution Images Using EfficientUNets and Transformers. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 4402913. [Google Scholar] [CrossRef]
- Zhu, F.; Zhu, Y.; Zhang, L.; Wu, C.; Fu, Y.; Li, M. A unified efficient pyramid transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2667–2677. [Google Scholar]
- Gu, J.; Zhu, H.; Feng, C.; Liu, M.; Jiang, Z.; Chen, R.T.; Pan, D.Z. Towards memory-efficient neural networks via multi-level in situ generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 5229–5238. [Google Scholar]
- Jiang, C.; Qiu, Y.; Shi, W.; Ge, Z.; Wang, J.; Chen, S.; Cérin, C.; Ren, Z.; Xu, G.; Lin, J. Characterizing co-located workloads in alibaba cloud datacenters. IEEE Trans. Cloud Comput. 2020, 10, 2381–2397. [Google Scholar] [CrossRef]
- Venkat, A.; Rusira, T.; Barik, R.; Hall, M.; Truong, L. SWIRL: High-performance many-core CPU code generation for deep neural networks. Int. J. High Perform. Comput. Appl. 2019, 33, 1275–1289. [Google Scholar] [CrossRef]
- Dao, T.; Fu, D.; Ermon, S.; Rudra, A.; Ré, C. Flashattention: Fast and memory-efficient exact attention with io-awareness. Adv. Neural Inf. Process. Syst. 2022, 35, 16344–16359. [Google Scholar]
- Ivanov, A.; Dryden, N.; Ben-Nun, T.; Li, S.; Hoefler, T. Data movement is all you need: A case study on optimizing transformers. Proc. Mach. Learn. Syst. 2021, 3, 711–732. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Liu, D.; Wen, B.; Liu, X.; Wang, Z.; Huang, T.S. When image denoising meets high-level vision tasks: A deep learning approach. arXiv 2017, arXiv:1706.04284. [Google Scholar]
- Liu, D.; Wen, B.; Jiao, J.; Liu, X.; Wang, Z.; Huang, T.S. Connecting image denoising and high-level vision tasks via deep learning. IEEE Trans. Image Process. 2020, 29, 3695–3706. [Google Scholar] [CrossRef]
- Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
- Liu, W.; Rabinovich, A.; Berg, A.C. Parsenet: Looking wider to see better. arXiv 2015, arXiv:1506.04579. [Google Scholar]
- Poudel, R.P.; Bonde, U.; Liwicki, S.; Zach, C. Contextnet: Exploring context and detail for semantic segmentation in real-time. arXiv 2018, arXiv:1805.04554. [Google Scholar]
- Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
- Mazzini, D. Guided upsampling network for real-time semantic segmentation. arXiv 2018, arXiv:1807.07466. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
- Mou, L.; Hua, Y.; Zhu, X.X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12416–12425. [Google Scholar]
- Marmanis, D.; Schindler, K.; Wegner, J.D.; Galliani, S.; Datcu, M.; Stilla, U. Classification with an edge: Improving semantic image segmentation with boundary detection. ISPRS J. Photogramm. Remote Sens. 2018, 135, 158–172. [Google Scholar] [CrossRef]
- Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Layer Name | Output Size | 18-Layer | 50-Layer |
---|---|---|---|
conv1 | 112 × 112 | 7 × 7, 64, stride 2 | |
conv2_x | 56 × 56 | 3 × 3 max pool, stride 2 | |
× 2 | × 3 | ||
conv3_x | 28 × 28 | × 2 | × 4 |
conv4_x | 14 × 14 | × 2 | × 6 |
conv5_x | 7 × 7 | × 2 | × 3 |
1 × 1 | average pool, 1000-d fc, softmax | ||
FLOPs |
Tree | ||||||||
---|---|---|---|---|---|---|---|---|
FCN-8s [45] | 90.0 | 93.0 | 77.7 | 86.5 | 80.4 | 88.3 | 85.5 | 75.5 |
UNet [8] | 90.5 | 93.3 | 79.6 | 87.5 | 76.4 | 89.2 | 85.5 | 75.5 |
SegNet [11] | 90.2 | 93.7 | 78.5 | 85.8 | 83.9 | 88.5 | 86.4 | 76.8 |
EncNet [14] | 91.2 | 94.1 | 79.2 | 86.9 | 83.7 | 89.4 | 87.0 | 77.8 |
RefineNet [13] | 91.1 | 94.1 | 79.8 | 87.2 | 82.3 | 88.9 | 86.9 | 77.1 |
CCEM [46] | 91.5 | 93.8 | 79.4 | 87.3 | 83.5 | 89.6 | 87.1 | 78.0 |
DeepLavb3 [6] | 91.4 | 94.7 | 79.6 | 87.6 | 85.8 | 88.9 | 87.8 | 79.0 |
S-RA-FCN [45] | 90.5 | 93.8 | 79.6 | 87.5 | 82.6 | 89.2 | 86.8 | 77.3 |
PSPNet [10] | 90.6 | 94.3 | 79.0 | 87.0 | 70.7 | 89.1 | 84.3 | 74.1 |
TransUNet [44] | 92.2 | 93.9 | 83.7 | 87.4 | 89.3 | 89.1 | 80.4 | |
Mask2Former [23] | 91.4 | 94.2 | 82.0 | 86.4 | 86.0 | 88.3 | 88 | 78.1 |
BSNet [9] | 92.1 | 94.4 | 83.1 | 86.7 | 90.3 | 88.9 | 80.2 | |
GLSNet | 87.6 |
Model | Local Inference | Global Inference | ||
---|---|---|---|---|
mIoU [%] | Memory [MB] | mIoU [%] | Memory [MB] | |
U-Net [8] | 37.3 | 949 | 38.4 | 5507 |
ICNet [12] | 35.5 | 1195 | 40.2 | 2557 |
PSPNet [10] | 53.3 | 1513 | 56.6 | 6289 |
SegNet [11] | 60.8 | 1139 | 61.2 | 10,339 |
DeepLabv3+ [7] | 63.1 | 1279 | 63.5 | 3199 |
FCN-8s [45] | 64.3 | 1963 | 70.1 | 5227 |
Mask2Former [23] | 66.7 | 3458 | 70.3 | 23,577 |
TransUnet [44] | 68.2 | 2436 | 70.2 | 6283 |
Local and Global | ||||
mIoU [%] | Memory [MB] | |||
GLNet [20] (baseline) | 71.6 | 1865 | ||
GLSNet |
Network Architectures | Global Backbone | Local Backbone | Fusion |
---|---|---|---|
GLNet (baseline) | ResNet50 | ResNet50 | FPN |
Shallow–Deep | ResNet18 | ResNet50 | GL-Deep Fusion |
Shallow–Shallow | ResNet18 | ResNet18 | GL-Deep Fusion |
Deep–Deep | ResNet50 | ResNet50 | FPN + GL-Deep Fusion |
Network Architectures | mIoU [%] | Memory [MB] | FPS [f/s] |
---|---|---|---|
GLNet (baseline) | 71.6 | 1865 | 0.05 |
Shallow–Deep | 72.4 | 1414 | 1.14 |
Shallow–Shallow | 71.9 | 1.34 | |
Deep–Deep | 2903 | 0.50 |
Network Architectures | mIoU [%] | Memory [MB] | FPS [f/s] |
---|---|---|---|
GLNet (baseline) [20] | 71.6 | 1865 | 0.05 |
Attention (DANet) [47] | 71.4 | 1510 | 0.02 |
Attention (transformer) [39] | 1.14 |
Method | Core Contribution |
---|---|
FCN (Series) [31,45] | Introduced fully convolutional networks for semantic segmentation. |
U-Net (Series) [8,25,32,33] | Utilizes an encoder–decoder architecture with skip connections to integrate features from various levels, particularly effective for medical image segmentation. |
DeepLab (Series) [4,5,6,7] | Utilizes atrous convolution and pyramid pooling modules to effectively expand the receptive field and capture multi-scale contextual information. |
GLNet [20] | a dual-branch network that leverages multi-level feature pyramid networks (FPNs) to exchange features between branches, improving feature utilization. |
MaskFormer (Series) [22,23] | Introduces Transformer decoders and proposes a mask classification model that unifies semantic, instance, and panoptic segmentation tasks by predicting a set of binary masks. |
Our Method | Utilizes a global shallow branch and a local deep branch in conjunction with GL-Deep Fusion based on branch-attention, achieving complete collaboration between the dual-branch structure, which is highly suitable for the field of ultra-high resolution image segmentation. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liang, C.; Huang, K.; Mao, J. Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation. Appl. Sci. 2024, 14, 5443. https://doi.org/10.3390/app14135443
Liang C, Huang K, Mao J. Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation. Applied Sciences. 2024; 14(13):5443. https://doi.org/10.3390/app14135443
Chicago/Turabian StyleLiang, Chenjing, Kai Huang, and Jian Mao. 2024. "Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation" Applied Sciences 14, no. 13: 5443. https://doi.org/10.3390/app14135443
APA StyleLiang, C., Huang, K., & Mao, J. (2024). Global–Local Deep Fusion: Semantic Integration with Enhanced Transformer in Dual-Branch Networks for Ultra-High Resolution Image Segmentation. Applied Sciences, 14(13), 5443. https://doi.org/10.3390/app14135443