VMMT-Net: A Dual-Branch Parallel Network Combining Visual State Space Model and Mix Transformer for Land–Sea Segmentation of Remote Sensing Images
Abstract
1. Introduction
2. Related Work
2.1. Semantic Segmentation
2.1.1. CNN-Based Semantic Segmentation Methods
2.1.2. Transformer and Hybrid Structure-Based Semantic Segmentation Methods
2.1.3. Mamba and Hybrid Structure-Based Semantic Segmentation Methods
2.2. Land–Sea Segmentation
3. Method
3.1. Overall Network Architecture
3.2. Dual-Branch Encoder Based on Mamba and Transformer
3.3. Cross-Branch Fusion Module
3.4. Decoder Design
4. Experiments
4.1. Datasets
4.1.1. Benchmark Sea–Land Dataset
4.1.2. GF-HNCD
4.2. Evaluation Metrics
4.3. Comparative Experiments and Results Analysis
- (1)
- Perception of complex morphology and fine-scale structures
- (2)
- Modeling of blurred land–sea transition zones
4.4. Ablation Study
- VMMT-Net (Full): The complete model developed in this study, which includes MiT, VSS, CBFM, and the customized decoder.
- w/o VSS and CBFM: The VSS branch is removed from the encoder, with only the MiT branch retained as the backbone. The CBFM is also removed accordingly.
- w/o MiT and CBFM: The MiT branch is removed from the encoder, with only the VSS branch retained as the backbone. The CBFM is likewise removed in this configuration.
- w/o CBFM: The CBFM is removed from all stages and replaced with a simple concatenation followed by 1 × 1 convolution.
- w/o Decoder: The customized decoder is replaced with a UNet-style decoder to assess the contribution of the decoder design.
- VMMT-Net (Full): The complete model structure proposed in this paper, using the CBFM module for cross-branch feature fusion;
- r CCM: Replacing CBFM in the model with CCM;
- r BiFFM: Replacing CBFM in the model with BiFFM.
5. Discussion
- (1)
- The dual-branch encoder balances the modeling of multiscale features and spatial structures. Unlike traditional feature-extraction strategies, VMMT-Net adopts MiT and VMamba to construct a parallel dual-branch encoder structure. This strategy fully leverages the advantages of MiT in multiscale semantics and detail capture, and uses VMamba to enhance the modeling of spatial structural continuity. The two components complement each other synergistically, addressing the limitations of traditional CNN models in global dependency modeling and the constraints of Transformer structures in directional perception modeling. Ablation experiments (Table 2) show that removing either branch results in a decline in the mF1 and MIoU metrics. This demonstrates the complementary performance of the two architectures.
- (2)
- CBFM enhances the feature interaction and semantic compensation between different branches. This module employs a dual-branch structure with attention cross-fusion mechanisms to achieve the simultaneous perception of complex land–sea backgrounds and key regions, which is significant for land–sea segmentation in ambiguous areas and discrimination of small-scale structures. Moreover, the experimental data in Table 3 demonstrate that CBFM is more lightweight than existing feature fusion modules (such as CCM and BiFFM), effectively conserving computational resources. This further demonstrates that feature fusion is not necessarily better when more complex—the key lies in precise feature selection and fusion strategies.
- (3)
- The multi-level customized decoder promotes detail restoration and spatial consistency. The decoder block employs a strategy combining multiscale feature fusion, dynamic snake convolution, and channel attention mechanisms, which further enhances the model’s ability to restore land–sea boundary details, ensures the integrity of land–sea structures, and overcomes the issue of overly smooth transitions in land–sea edge segmentation results observed in traditional models. Table 2 and Figure 10 indicate that the custom decoder plays a crucial role in restoring boundary details and ensuring the integrity of the land–sea structure.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Yang, G.; Huang, K.; Zhu, L.; Sun, W.; Chen, C.; Meng, X.; Wang, L.; Ge, Y. Spatio-temporal changes in China’s mainland shorelines over 30 years using Landsat time series data (1990–2019). Earth Syst. Sci. Data Discuss. 2024, 2024, 1–26. [Google Scholar] [CrossRef]
- Zhang, L.; Li, G.; Liu, S.; Wang, N.; Yu, D.; Pan, Y.; Yang, X. Spatiotemporal variations and driving factors of coastline in the Bohai Sea. J. Ocean Univ. China 2022, 21, 1517–1528. [Google Scholar] [CrossRef]
- Hou, X.Y.; Wu, T.; Hou, W.; Chen, Q.; Wang, Y.; Yu, L. Characteristics of coastline changes in mainland China since the early 1940s. Sci. China Earth Sci. 2016, 59, 1791–1802. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, J.; Zheng, F.; Wang, H.; Yang, H. An overview of coastline extraction from remote sensing data. Remote Sens. Sens. 2023, 15, 4865. [Google Scholar] [CrossRef]
- Shirmard, H.; Farahbakhsh, E.; Müller, R.D.; Chandra, R. A review of machine learning in processing remote sensing data for mineral exploration. Remote Sens. Environ. 2022, 268, 112750. [Google Scholar] [CrossRef]
- Han, W.; Zhang, X.; Wang, Y.; Wang, L.; Huang, X.; Li, J.; Wang, S.; Chen, W.; Li, X.; Feng, R.; et al. A survey of machine learning and deep learning in remote sensing of geological environment: Challenges, advances, and opportunities. ISPRS J. Photogramm. Remote Sens. 2023, 202, 87–113. [Google Scholar] [CrossRef]
- McCarthy, M.J.; Colna, K.E.; El-Mezayen, M.M.; Laureano-Rosario, A.E.; Méndez-Lázaro, P.; Otis, D.B.; Toro-Farmer, G.; Vega-Rodriguez, M.; Muller-Karger, F.E. Satellite remote sensing for coastal management: A review of successful applications. Environ. Manag. 2017, 60, 323–339. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; Yu, X.; Dedman, S.; Rosso, M.; Zhu, J.; Yang, J.; Xia, Y.; Tian, Y.; Zhang, G.; Wang, J. UAV remote sensing applications in marine monitoring: Knowledge visualization and review. Sci. Total Environ. 2022, 838, 155939. [Google Scholar] [CrossRef] [PubMed]
- Xu, Q.; Zhao, B.; Dai, K.; Dong, X.; Li, W.; Zhu, X.; Yang, Y.; Xiao, X.; Wang, X.; Huang, J.; et al. Remote sensing for landslide investigations: A progress report from China. Eng. Geol. 2023, 321, 107156. [Google Scholar] [CrossRef]
- Kucharczyk, M.; Hugenholtz, C.H. Remote sensing of natural hazard-related disasters with small drones: Global trends, biases, and research opportunities. Remote Sens. Environ. 2021, 264, 112577. [Google Scholar] [CrossRef]
- Ji, X.; Tang, L.; Chen, L.; Hao, L.-Y.; Guo, H. Toward efficient and lightweight sea–land segmentation for remote sensing images. Eng. Appl. Artif. Intell. 2024, 135, 108782. [Google Scholar] [CrossRef]
- Lu, C.; Wen, Y.; Li, Y.; Mao, Q.; Zhai, Y. Sea-land segmentation method based on an improved MA-Net for Gaofen-2 images. Earth Sci. Inform. 2024, 17, 4115–4129. [Google Scholar] [CrossRef]
- Lu, S.; Wu, B.; Yan, N.; Wang, H. Water body mapping method with HJ-1A/B satellite imagery. Int. J. Appl. Earth Obs. Geoinf. 2011, 13, 428–434. [Google Scholar] [CrossRef]
- Yang, C.S.; Park, J.H.; Harun-Al Rashid, A. An improved method of land masking for synthetic aperture radar-based ship detection. J. Navig. 2018, 71, 788–804. [Google Scholar] [CrossRef]
- Ge, X.; Sun, X.; Liu, Z. Object-oriented coastline classification and extraction from remote sensing imagery. In Remote Sensing of the Environment: 18th National Symposium on Remote Sensing of China; SPIE: Bellingham, WA, USA, 2014; Volume 9158, pp. 131–137. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive bilateral contextual network for efficient semantic segmentation of Fine-Resolution remotely sensed imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
- Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- Long, J.; Li, M.; Wang, X. Integrating spatial details with long-range contexts for semantic segmentation of very high-resolution remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
- Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024. [Google Scholar] [CrossRef]
- Zhuang, P.; Zhang, X.; Wang, H.; Zhang, T.; Liu, L.; Li, J. FAHM: Frequency-Aware Hierarchical Mamba for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6299–6313. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar]
- Hatamizadeh, A.; Kautz, J. Mambavision: A hybrid mamba-transformer vision backbone. arXiv 2024, arXiv:2407.08083. [Google Scholar]
- Cheng, D.; Meng, G.; Cheng, G.; Pan, C. SeNet: Structured edge network for sea–land segmentation. IEEE Geosci. Remote Sens. Lett. 2016, 14, 247–251. [Google Scholar] [CrossRef]
- Li, R.; Liu, W.; Yang, L.; Sun, S.; Hu, W.; Zhang, F.; Li, W. DeepUNet: A deep fully convolutional network for pixel-level sea-land segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 3954–3962. [Google Scholar] [CrossRef]
- Shamsolmoali, P.; Zareapoor, M.; Wang, R.; Zhou, H.; Yang, J. A novel deep structure U-Net for sea-land segmentation in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3219–3232. [Google Scholar] [CrossRef]
- Cui, B.; Jing, W.; Huang, L.; Li, Z.; Lu, Y. SANet: A sea–land segmentation network via adaptive multiscale feature learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 116–126. [Google Scholar] [CrossRef]
- Gao, H.; Yan, X.D.; Zhang, H. Multi-Scale Sea-Land Segmentation Method for Remote Sensing Images Based on Res2Net. Acta Opt. Sin. 2022, 42, 1828004. [Google Scholar]
- Ji, X.; Tang, L.; Lu, T.; Cai, C. Dbenet: Dual-branch ensemble network for sea–land segmentation of remote-sensing images. IEEE Trans. Instrum. Meas. 2023, 72, 1–11. [Google Scholar] [CrossRef]
- Gao, J.; Zhou, C.; Xu, G.; Sun, W. Multiscale sea-land segmentation networks for weak boundaries. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4205511. [Google Scholar] [CrossRef]
- Xiong, X.; Wang, X.; Zhang, J.; Huang, B.; Du, R. Tcunet: A lightweight dual-branch parallel network for sea–land segmentation in remote sensing images. Remote Sens. 2023, 15, 4413. [Google Scholar] [CrossRef]
- Tong, Q.; Wu, J.; Zhu, Z.; Zhang, M.; Xing, H. STIRUnet: SwinTransformer and inverted residual convolution embedding in unet for Sea–Land segmentation. J. Environ. Manag. 2024, 357, 120773. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Zhang, L.; Chen, B.; Zuo, J.; Hu, Y. MPG-Net: A Semantic Segmentation Model for Extracting Aquaculture Ponds in Coastal Areas from Sentinel-2 MSI and Planet SuperDove Images. Remote Sens. 2024, 16, 3760. [Google Scholar] [CrossRef]
- Ai, J.; Xue, W.; Zhu, Y.; Zhuang, S.; Xu, C.; Yan, H.; Chen, L.; Wang, Z. AIS-PVT: Long-time AIS Data assisted Pyramid Vision Transformer for Sea-land Segmentation in Dual-polarization SAR Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3449894. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
- Wei, K.; Dai, J.; Hong, D.; Ye, Y. MGFNet: An MLP-dominated gated fusion network for semantic segmentation of high-resolution multi-modal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104241. [Google Scholar] [CrossRef]
- Stergiou, A.; Poppe, R.; Kalliatakis, G. Refining activation downsampling with SoftPool. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10357–10366. [Google Scholar]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic snake convolution based on topological geometric constraints for tubular structure segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6070–6079. [Google Scholar]
- Liang, L.; Deng, S.; Gueguen, L.; Wei, M.; Wu, X.; Qin, J. Convolutional neural network with median layers for denoising salt-and-pepper contaminations. Neurocomputing 2021, 442, 26–35. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Yang, T.; Jiangde, S.; Hong, Z.; Zhang, Y.; Han, Y.; Zhou, R.; Wang, J.; Yang, S.; Tong, X.; Kuc, T.-Y. Sea-land segmentation using deep learning techniques for landsat-8 OLI imagery. Mar. Geod. 2020, 43, 105–133. [Google Scholar] [CrossRef]
Model | Backbone | GF-HNCD | BSD | Parameters (/M) | FLOPs (/G) | ||
---|---|---|---|---|---|---|---|
mF1 (%) | MIoU (%) | mF1 (%) | MIoU (%) | ||||
U-Net | - | 94.79 | 90.10 | 97.02 | 94.22 | 28.98 | 203 |
PSPNet | ResNet50 | 96.47 | 93.18 | 97.82 | 95.75 | 46.69 | 179 |
SegFormer | MiT-B0 | 97.01 | 94.20 | 97.70 | 95.51 | 3.75 | 8.50 |
TCUNet | PVT V2-ResNet | 97.94 | 95.96 | 98.08 | 96.23 | 1.72 | 3.24 |
CLCFormer | SwinV2-EfficientNet-B3 | 97.17 | 94.50 | 97.88 | 95.86 | 38.01 | 31.06 |
FTransUNet | ResNet50-FVit | 98.29 | 96.65 | 97.95 | 95.99 | 184.14 | 57.28 |
VM-UNet | VSS | 97.44 | 95.02 | 97.95 | 96.00 | 22.03 | 16.45 |
RS3Mamba | VSS-ResNet18 | 98.24 | 96.54 | 98.42 | 96.89 | 31.65 | 43.32 |
VMMT-Net (Ours) | - | 98.48 | 97.02 | 98.53 | 97.11 | 28.24 | 25.21 |
Method | mF1 (%) | MIoU (%) |
---|---|---|
VMMT-Net (Full) | 98.48 | 97.02 |
w/o VSS and CBFM | 97.55 | 95.22 |
w/o MiT and CBFM | 97.99 | 96.06 |
w/o CBFM | 98.22 | 96.51 |
w/o Decoder | 98.17 | 96.40 |
Method | mF1 (%) | MIoU (%) | Parameters (/M) | FLOPs (/G) |
---|---|---|---|---|
VMMT-Net (Full) | 98.48 | 97.02 | 28.24 | 25.21 |
r CCM | 98.24 | 96.55 | 36.39 | 31.75 |
r BiFFM | 98.40 | 96.86 | 51.21 | 43.03 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, J.; Liu, Z.; Zhu, Z.; Song, C.; Wu, X.; Xing, H. VMMT-Net: A Dual-Branch Parallel Network Combining Visual State Space Model and Mix Transformer for Land–Sea Segmentation of Remote Sensing Images. Remote Sens. 2025, 17, 2473. https://doi.org/10.3390/rs17142473
Wu J, Liu Z, Zhu Z, Song C, Wu X, Xing H. VMMT-Net: A Dual-Branch Parallel Network Combining Visual State Space Model and Mix Transformer for Land–Sea Segmentation of Remote Sensing Images. Remote Sensing. 2025; 17(14):2473. https://doi.org/10.3390/rs17142473
Chicago/Turabian StyleWu, Jiawei, Zijian Liu, Zhipeng Zhu, Chunhui Song, Xinghui Wu, and Haihua Xing. 2025. "VMMT-Net: A Dual-Branch Parallel Network Combining Visual State Space Model and Mix Transformer for Land–Sea Segmentation of Remote Sensing Images" Remote Sensing 17, no. 14: 2473. https://doi.org/10.3390/rs17142473
APA StyleWu, J., Liu, Z., Zhu, Z., Song, C., Wu, X., & Xing, H. (2025). VMMT-Net: A Dual-Branch Parallel Network Combining Visual State Space Model and Mix Transformer for Land–Sea Segmentation of Remote Sensing Images. Remote Sensing, 17(14), 2473. https://doi.org/10.3390/rs17142473