Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images
Abstract
:1. Introduction
- This paper proposes a BFF module based on self-attention and cross-attention to maintain the completeness of single-modal semantic information while realizing the complementarity of image–text multimodal features.
- The CPC learning is proposed on the pixel–text score map obtained by feature alignment to reduce the difference among the pixels of the same category and increase the gap among pixels of different categories.
- A selection strategy for positive and negative samples is proposed, expanding the CPC learning to different images and making full use of the global semantic features of pixels.
2. Related Work
2.1. Feature Encoding
2.2. Image–Text Information Fusion and Alignment
2.3. Feature Decoding
3. Proposed Method
3.1. Image–Text Bidirectional Feature Fusion
3.2. Image–Text Feature Alignment
3.3. Feature Decoding
4. Experiments and Discussion
4.1. Experimental Settings
4.1.1. Dataset
- The Potsdam dataset contains 38 images with a spatial resolution of 5 cm and spectrum RGB. The image size is 6000 × 6000 pixels, taking the number of 2–10~2–12, 3–10~3–12, 4–10~4–12, 5–10~5–12, 6–7~6–12, and 7–7~7–12, a total of 24 images with ground truth labels as the training set, and the remaining 14 images are used as the test set.
- The Vaihingen dataset contains 33 images with a spatial resolution of 9 cm and spectrum IRRG, the image size is not fixed, with an average of 2494 × 2064 pixels. Taking the number of areas, i.e., 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, and 37, a total of 16 images with ground truth labels as the training set, and the remaining 17 images are used as the test set.
- The LoveDA dataset contains 5987 images that have been cropped into patches with 1024 × 1024 pixels. Following the official dataset split, 2522 images are used for training and 1669 images for test in the experiments.
4.1.2. Evaluation Metrics
4.1.3. Implementation Details
4.2. Comparative Experiments and Analysis
4.3. Ablation Experiments
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
VLP | vision-language pretraining |
NLP | natural language processing |
MFT | multimodality fusion technology |
BEMSeg | bidirectional feature fusion and enhanced alignment-based multimodal semantic segmentation |
BFF | bidirectional feature fusion |
CPC | category-based pixel-level contrastive |
MS | multi-scale |
SOTA | state-of-the-art |
References
- Marcos, D.; Volpi, M.; Kellenberger, B.; Tuia, D. Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models. ISPRS J. Photogramm. Remote Sens. 2018, 145, 96–107. [Google Scholar] [CrossRef]
- Qin, P.; Cai, Y.; Liu, J.; Fan, P.; Sun, M. Multilayer feature extraction network for military ship detection from high-resolution optical remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 11058–11069. [Google Scholar] [CrossRef]
- Tang, X.; Tu, Z.; Wang, Y.; Liu, M.; Li, D.; Fan, X. Automatic detection of coseismic landslides using a new transformer method. Remote Sens. 2022, 14, 2884. [Google Scholar] [CrossRef]
- Wang, P.; Tang, Y.; Liao, Z.; Yan, Y.; Dai, L.; Liu, S.; Jiang, T. Road-side individual tree segmentation from urban MLS point clouds using metric learning. Remote Sens. 2023, 15, 1992. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
- Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
- Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]
- Liu, M.; Fan, J.; Liu, Q. Biomedical image segmentation algorithm based on dense atrous convolution. Math. Biosci. Eng. 2024, 21, 4351–4369. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Open-vocabulary image segmentation. arXiv 2021, arXiv:2112.12143. [Google Scholar]
- Rao, Y.; Zhao, W.; Chen, G.; Tang, Y.; Zhu, Z.; Huang, G.; Zhou, J.; Lu, J. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18082–18091. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y. Segment anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
- Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-Driven Semantic Segmentation. In Proceedings of the ICLR 2022—10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Lüddecke, T.; Ecker, A. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7086–7096. [Google Scholar]
- Ramachandram, D.; Taylor, G.W. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
- Wei, X.; Zhang, T.; Li, Y.; Zhang, Y.; Wu, F. Multi-modality cross attention network for image and sentence matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10941–10950. [Google Scholar]
- Kato, N.; Yamasaki, T.; Aizawa, K. Zero-shot semantic segmentation via variational mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
- Li, X.; Wen, C.; Hu, Y.; Zhou, N. Rs-clip: Zero shot remote sensing scene classification via contrastive vision-language supervision. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103497. [Google Scholar] [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Chen, J.; Zhu, D.; Qian, G.; Ghanem, B.; Yan, Z.; Zhu, C.; Xiao, F.; Culatana, S.C.; Elhoseiny, M. Exploring open-vocabulary semantic segmentation from clip vision encoder distillation only. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 699–710. [Google Scholar]
- Zhou, Z.; Lei, Y.; Zhang, B.; Liu, L.; Liu, Y. Zegclip: Towards adapting clip for zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11175–11185. [Google Scholar]
- Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; Qiao, Y. Clip-adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. 2024, 132, 581–595. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Guo, S.C.; Liu, S.K.; Wang, J.Y.; Zheng, W.M.; Jiang, C.Y. CLIP-Driven Prototype Network for Few-Shot Semantic Segmentation. Entropy 2023, 25, 1353. [Google Scholar] [CrossRef] [PubMed]
- Kiela, D.; Grave, E.; Joulin, A.; Mikolov, T. Efficient large-scale multi-modal classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Xu, P.; Zhu, X.; Clifton, D.A. Multimodal learning with transformers: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12113–12132. [Google Scholar] [CrossRef]
- Jiao, S.; Wei, Y.; Wang, Y.; Zhao, Y.; Shi, H. Learning mask-aware clip representations for zero-shot segmentation. Adv. Neural Inf. Process. Syst. 2023, 36, 35631–35653. [Google Scholar]
- Kirillov, A.; Girshick, R.; He, K.; Dollár, P. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6399–6408. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
- Wang, W.; Zhou, T.; Yu, F.; Dai, J.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7303–7313. [Google Scholar]
- Potsdam. ISPRS Potsdam 2D Semantic Labeling Dataset, 2018. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-potsdam.aspx (accessed on 1 June 2023).
- Vaihingen. ISPRS Vaihingen 2D Semantic Labeling Dataset, 2018. Available online: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx (accessed on 1 June 2023).
- Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
- Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking transformers for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023. [Google Scholar] [CrossRef]
- Chen, Y.; Dong, Q.; Wang, X.; Zhang, Q.; Kang, M.; Jiang, W.; Wang, M.; Xu, L.; Zhang, C. Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 4421–4435. [Google Scholar] [CrossRef]
Methods | Backbone | SPP/ASPP | Transformer Encoder | Spatial/Channel Attention | Transformer Dncoder | Multilayer Feature Fusion |
---|---|---|---|---|---|---|
PSPNet | ResNet50 | ✓ | ||||
DeepLabV3+ | ResNet50 | ✓ | ||||
Semantic FPN | ResNet50 | ✓ | ||||
BANet | ResT-Lite | ✓ | ✓ | |||
SwinB-CNN | Swin Transformer | ✓ | ✓ | ✓ | ✓ | |
GLOTS | Vit-B | ✓ | ✓ | ✓ | ||
HAFNet | ResNet50 | ✓ | ✓ | ✓ |
Methods | Backbone | Image Feature | Text Feature | Image-to-Text Fusion | Text-to-Image Fusion | Multimodal Alignment | Contrastive Learning |
---|---|---|---|---|---|---|---|
CLIP-FPN | ResNet50 | ✓ | ✓ | ✓ | |||
DenseCLIP | ResNet50 | ✓ | ✓ | ✓ | ✓ | ||
LSeg | ViT-L | ✓ | ✓ | ✓ | |||
BEMSeg | ResNet50 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Methods | Potsdam | Vaihingen | LoveDA | GFLOPs | Params /M | Training Time/h | Inference Time/FPS | |||
---|---|---|---|---|---|---|---|---|---|---|
mIoU/% | OA/% | mIoU/% | OA/% | mIoU/% | OA/% | |||||
PSPNet | 76.63 | 90.25 | 70.32 | 90.05 | 47.43 | 66.78 | 201.59 | 53.32 | 4.2 | 47.6 |
DeepLabV3+ | 77.23 | 90.33 | 70.79 | 90.13 | 50.18 | 68.65 | 198.33 | 52.04 | 4.5 | 42.2 |
Semantic FPN | 75.72 | 89.53 | 73.67 | 89.94 | 50.01 | 68.68 | 45.40 | 28.50 | 4.5 | 42.1 |
BANet | 73.74 | 88.17 | 70.01 | 88.09 | - | - | 58.10 | 28.58 | - | - |
SwinB-CNN | 75.10 | 89.21 | 68.79 | 87.65 | 54.20 | 70.52 | 114.00 | 104.02 | - | 25.3 |
GLOTS | 76.02 | 89.96 | 70.13 | 88.24 | 55.63 | 70.85 | - | - | - | 19.8 |
HAFNet | 78.76 | 90.45 | 76.37 | 90.29 | - | - | 114.64 | 38.51 | - | - |
CLIP-FPN | 75.90 | 89.89 | 74.21 | 90.15 | 50.99 | 68.85 | 62.60 | 31.00 | 10.2 | 40.3 |
DenseCLIP | 77.31 | 90.30 | 74.69 | 90.28 | 52.17 | 69.50 | 69.30 | 50.17 | 11.7 | 36.6 |
LSeg | 78.24 | 90.56 | 75.80 | 90.87 | 55.76 | 70.92 | - | - | 44.6 | - |
BEMSeg | 79.33 | 91.25 | 76.28 | 90.96 | 54.71 | 70.66 | 70.30 | 54.12 | 15.5 | 34.0 |
Methods | Imperious Surface | Building | Low Vegetation | Tree | Car | Clutter | mIoU/% |
---|---|---|---|---|---|---|---|
PSPNet | 86.13 | 92.97 | 76.09 | 79.29 | 87.87 | 37.45 | 76.63 |
DeepLabV3+ | 86.01 | 92.66 | 76.46 | 79.50 | 87.30 | 41.48 | 77.23 |
Semantic FPN | 85.69 | 92.67 | 75.71 | 79.05 | 91.45 | 29.77 | 75.72 |
BANet | 83.35 | 89.14 | 73.55 | 74.56 | 87.99 | 33.88 | 73.74 |
SwinB-CNN | 84.52 | 92.4 | 75.07 | 76.88 | 82.66 | 39.06 | 75.10 |
GLOTS | 85.20 | 92.19 | 75.23 | 76.97 | 84.37 | 42.15 | 76.02 |
HAFNet | 85.94 | 92.54 | 76.89 | 79.32 | 91.58 | 42.12 | 78.06 |
CLIP_FPN | 86.26 | 92.42 | 75.25 | 79.05 | 91.81 | 30.58 | 75.90 |
DenseCLIP | 86.70 | 93.26 | 75.89 | 79.05 | 91.55 | 35.62 | 77.31 |
LSeg | 87.14 | 92.99 | 76.45 | 79.15 | 91.19 | 42.55 | 78.24 |
BEMSeg | 87.65 | 93.99 | 78.25 | 80.37 | 92.37 | 43.37 | 79.33 |
Methods | Background | Building | Road | Water | Barren | Forest | Agriculture | mIoU/% |
---|---|---|---|---|---|---|---|---|
PSPNet | 52.73 | 62.75 | 51.33 | 57.48 | 23.06 | 36.43 | 48.21 | 47.43 |
DeepLabV3+ | 53.98 | 63.87 | 54.05 | 62.07 | 29.24 | 38.59 | 49.45 | 50.18 |
Semantic FPN | 53.05 | 62.70 | 53.70 | 67.34 | 23.08 | 42.08 | 48.01 | 50.01 |
SwinB-CNN | 53.60 | 65.98 | 58.36 | 72.67 | 31.60 | 43.56 | 53.62 | 54.20 |
GLOTS | 55.27 | 67.06 | 59.61 | 73.08 | 29.18 | 45.06 | 60.09 | 55.63 |
CLIP_FPN | 53.28 | 64.81 | 53.84 | 65.29 | 27.26 | 43.88 | 48.53 | 50.99 |
DenseCLIP | 53.29 | 63.95 | 54.44 | 67.50 | 31.64 | 44.00 | 50.34 | 52.17 |
LSeg | 55.52 | 67.32 | 59.72 | 73.86 | 32.46 | 45.63 | 55.79 | 55.76 |
BEMSeg | 56.79 | 67.05 | 56.77 | 70.43 | 32.52 | 45.77 | 53.65 | 54.71 |
Methods | IoU Per Class/% | mIoU/% | mF1/% | OA/% | Training Time/h | Inference Time/FPS | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Imperious Surface | Building | Low Vegetation | Tree | Car | Clutter | ||||||
Baseline | 86.26 | 92.42 | 75.25 | 79.05 | 91.81 | 30.58 | 75.90 | 84.24 | 89.89 | 10.2 | 40.3 |
BEMSeg-B | 87.56 | 93.82 | 77.90 | 80.21 | 92.45 | 41.14 | 78.84 | 86.88 | 91.17 | 12.0 | 34.0 |
BEMSeg-C | 87.00 | 93.13 | 76.58 | 79.21 | 92.80 | 42.83 | 78.59 | 86.11 | 91.01 | 14.5 | 40.3 |
BEMSeg | 87.65 | 93.99 | 78.25 | 80.37 | 92.87 | 42.87 | 79.33 | 87.29 | 91.25 | 15.5 | 34.0 |
Methods | IoU Per Class/% | mIoU/% | mF1/% | OA/% | |||||
---|---|---|---|---|---|---|---|---|---|
Imperious Surface | Building | Low Vegetation | Tree | Car | Clutter | ||||
Baseline | 86.00 | 91.63 | 70.91 | 79.71 | 77.98 | 39.01 | 74.21 | 83.93 | 90.15 |
BEMSeg-B | 86.68 | 92.00 | 73.24 | 81.32 | 79.02 | 41.84 | 75.68 | 85.08 | 90.87 |
BEMSeg-C | 86.36 | 91.78 | 72.13 | 80.95 | 79.56 | 42.86 | 75.60 | 85.15 | 90.79 |
BEMSeg | 86.78 | 92.01 | 73.42 | 81.47 | 79.73 | 44.29 | 76.28 | 85.56 | 90.96 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Q.; Wang, X. Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images. Remote Sens. 2024, 16, 2289. https://doi.org/10.3390/rs16132289
Liu Q, Wang X. Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images. Remote Sensing. 2024; 16(13):2289. https://doi.org/10.3390/rs16132289
Chicago/Turabian StyleLiu, Qianqian, and Xili Wang. 2024. "Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images" Remote Sensing 16, no. 13: 2289. https://doi.org/10.3390/rs16132289
APA StyleLiu, Q., & Wang, X. (2024). Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images. Remote Sensing, 16(13), 2289. https://doi.org/10.3390/rs16132289