CVMFusion: ConvNeXtV2 and Visual Mamba Fusion for Remote Sensing Segmentation
Highlights
- This paper presents CVMFusion, an innovative dual-branch network that cohesively combines ConvNeXtV2 for precise local feature extraction and VMamba for extensive contextual modelling, therefore setting a new benchmark for sea–land segmentation in remote sensing data.
- The proposed Dynamic Multi-scale Attention (DyMSA) and Dynamic Weighted Cross-Attention (DyWCA) modules enable dynamic, adaptive feature fusion, which is empirically shown to enhance the segmentation accuracy of small targets and complex coastline boundaries.
- The exceptional performance of CVMFusion on public SAR datasets illustrates the effectiveness of the hybrid CNN-Mamba architecture in addressing the shortcomings of current approaches, especially in managing class imbalance and retaining essential edge information.
- This work provides a robust and accurate tool for coastal zone monitoring, with direct implications for improving applications in marine disaster early warning, navigation safety, and sustainable coastal resource management.
Abstract
1. Introduction
- (1)
- (2)
- We design the Dynamic Multi-Scale Attention (DyMSA) module, which replaces traditional static fusion (e.g., fixed convolution after concatenation) with a dynamic channel-spatial dual-path approach. This addresses the limitations of fixed weighting for deep and shallow features, effectively preserving high-frequency details.
- (3)
- We introduce the Dynamic Weighted Cross-Attention (DyWCA) module, which leverages dynamic weighting and cross-attention to adaptively fuse local features with global semantics. This resolves the issues of inflexible feature integration and edge detail loss inherent in traditional static fusion methods.
- (4)
- The comprehensive study on the publicly available Sentinel-1 SAR dataset [18] and GF-3 SAR sea–land segmentation dataset [19] shows that CVMFusion achieves excellent performance. Compared to existing advanced methods, our method has achieved comprehensive improvements in key indicators, such as MIoU, FgIoU, and F1-score, and has shown significant advantages in detecting small targets and delineating complex boundaries, fully verifying its excellent accuracy and robustness.
2. Related Work
2.1. Improvements to the U-Net Structure
2.1.1. CNN-Based U-Shape Network for Land and Sea Image Segmentation
2.1.2. Transformer-Based U-Shaped Network for Sea and Land Image Segmentation
2.1.3. VMamba U-Net Model for Image Segmentation
2.2. Improvements in Attention Mechanisms
2.3. Improvement of Multi-Branch Network Structure
3. Our Method
3.1. Network Architecture
3.2. VSS Block
3.3. Dynamic Multi-Scale Attention (DyMSA) Block
3.4. Dynamic Weighted Cross-Attention (DyWCA) Block
3.5. Loss Function
4. Experimental
4.1. Experimental Data
4.2. Experimental Configuration
- (1)
- (2)
- Experimental details: The algorithm used in this paper was set up as follows: the batch size was set to 8, and the AdamW [40] optimiser was used with an initial learning rate of 1 × 10−4. The CosineAnnealingWarmRestarts scheduler [41] was used, with a maximum number of iterations of 30, a minimum learning rate of 1 × 10−5, and 30 training epochs. The ConvNextV2 and VMamba feature extraction networks were selected as the backbone networks for the algorithm’s encoder component; these networks remove the fully connected layers and include batch normalization layers. The parameters were initialized using weights pre-trained on the ImageNet-1k dataset. For the decoder component, we used the default parameter initialization method provided by the PyTorch deep learning toolkit.
4.3. Comparison with Other Algorithms
- (1)
- Sea–Land Segmentation V1.1 [18]: We compare CVMFusion against several methods using this dataset. As shown in Table 2, CVMFusion achieves an MAE of 0.0105, an F1-score of 99.03%, an mIoU of 98.05%, an FgIoU of 97.45%, and an OA of 99.70%. These results represent improvements across multiple metrics. In contrast to VM-UNet, a pioneering model utilizing only Mamba, the MAE exhibited a reduction of 0.20%, which suggests improved prediction of edges and fine details. The 0.62% increase in mIoU demonstrates that our attention-focused U-Net architecture enhances the acquisition of global context and the coherence of segmentation. Furthermore, the 0.43% increase in FgIoU indicates improved integrity of target regions and border precision. The combined results show that our hybrid CNN-Mamba architecture effectively captures both global relationships and local details, which leads to better performance than previous methods. Moreover, the qualitative results support the effectiveness of our approach. Figure 6 presents a qualitative comparison of typical scenarios on the Sea–Land Segmentation V1.1 dataset, covering small island detection, complex boundaries, estuarine jagged coastlines, and class-imbalanced scenes. Red circles indicate the differences between segmentation results of various models and ground truth labels. Traditional CNN methods like U-Net and U-Net++ initially exhibit missed detections or fragmented segmentation due to their receptive field limitations, resulting in discontinuous predictions at island edges. While VM-UNet improves global consistency, it excessively smooths small island boundaries, sacrificing sharp geometric features in some local regions. Although MIoU surpasses traditional CNN methods, high-frequency details remain lost. Compared to other models, CVMFusion maintains superior overall contours and edge sharpness across all typical scenarios, achieving precise segmentation even in complex sea–land boundary features and small target scenarios. Finally, we perform pixel-wise difference analysis between masks and our predicted results to further visualize segmentation quality, demonstrating stable accuracy in challenging conditions: our method avoids typical segmentation errors (e.g., incorrect segmentation regions or missed small targets) entirely—these errors are common in other approaches. Only minor pixel deviations exist, with errors concentrated in fine-edge segmentation, indicating room for improvement in our algorithm’s precision handling.
- (2)
- SARSealand V1.0: The dataset results are presented in Table 2. In comparison to VM-UNet, CVMFusion achieves an F1-score improvement of 0.79 percentage points and an MIoU increase of 1.13 percentage points. For the FgIoU metric, CVMFusion reaches 95.85%, which is 1.05 percentage points higher than VM-UNet (94.80%), reflecting improved integrity of target regions and boundary delineation in multi-resolution SAR scenes. This result clearly demonstrates the superiority of CNN-Mamba-based models when compared to other models in the context of SAR sea–land segmentation. Figure 7 illustrates that traditional approaches (U-Net, U-Net++, Swin-UNet, SegNet, and DeepLabV3+) exhibit significant void artifacts in inland regions; furthermore, U-Net, U-Net++, and DeepLabV3+ erroneously categorize vessels as sea in oceanic areas. CVMFusion demonstrates enhanced performance in both terrestrial and maritime regions, exhibiting markedly fewer misclassifications. These approaches have a major flaw: they cannot handle images with different resolutions, which leads to consistent errors. For example, sites far from the coast, where radar signals bounce back weakly, are incorrectly classified as being at sea. At the same time, ships that reflect radar signals strongly in the ocean are mistakenly detected as being on land. Moreover, Swin-UNet and DeepLabV3+ exhibit diminished accuracy near coastlines due to insufficient incorporation of edge information. CVMFusion attains good overall accuracy; nonetheless, it has small limitations in edge refinement, resulting in a slight smoothing of fine coastline protrusions and indentations. This is probably attributable to the edge supervision mechanism, which promotes more seamless border extraction. Finally, we still perform a pixel-wise difference between the mask and our method’s predicted map to further visualize the segmentation quality on the SARSealand V1.0 dataset: Our method aligns almost perfectly with the ground truth, with only minor false detections (green areas) at some edges and virtually no missed detections. It outperforms the comparative models in terms of error ratio, boundary precision, and region adaptability, demonstrating high pixel-level consistency between predicted results and the true mask, achieving ideal segmentation quality.
- (3)
- To quantitatively evaluate whether the performance difference between CVMFusion and comparative methods has statistical reliability, we conducted Wilcoxon sign rank test on CVMFusion and three typical methods representing CNN, Transformer, and Mamba techniques based on MIoU index. As shown in Table 3, on the Sea–Land Segmentation V1.1 dataset, the improvement of CVMFusion compared to U-Net++ is statistically significant, reaching edge significance compared to Swin Unet method. Although the difference with VM-UNet does not reach the traditional significance level, it still shows clear performance advantages in MIOU metrics.
4.4. Ablation Experiment
- (1)
- Ablation experiments on the importance of encoders:
- (2)
- Effectiveness of the DyMSA and DyWCA modules: These modules are key components of CVMFusion. The DyMSA module uses dynamic weight allocation of channel-spatial attention to enhance interactions between dimensions, helping to generate more accurate segmentation masks. Meanwhile, the DyWCA module enhances fine-grained features and global information by allocating weights dynamically to cross-attention between features from the ConvNextV2 branch and the VMamba branch.
- (3)
- Ablation experiments on the effectiveness of multi-level supervision and edge loss.
4.5. Efficiency Evaluation
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Seale, C.; Redfern, T.; Chatfield, P.; Luo, C.; Dempsey, K. Coastline detection in satellite imagery: A deep learning approach on new benchmark data. Remote Sens. Environ. 2022, 278, 113044. [Google Scholar] [CrossRef]
- Dong, S.; Pang, L.; Zhuang, Y.; Liu, W.; Yang, Z.; Long, T. Optical Remote Sensing Water-Land Segmentation Representation Based on Proposed SNS-CNN Network. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 3895–3898. [Google Scholar]
- Cao, W.; Zhou, Y.; Li, R.; Li, X. Mapping changes in coastlines and tidal flats in developing islands using the full time series of Landsat images. Remote Sens. Environ. 2020, 239, 111665. [Google Scholar] [CrossRef]
- Shao, Z.; Wang, L.; Wang, Z.; Du, W.; Wu, W. Saliency-aware convolutional neural network for ship detection in surveillance video. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 781–794. [Google Scholar] [CrossRef]
- An, C.; Niu, Z.; Li, Z.; Chen, Z.-P. Otsu threshold comparison and SAR water segmentation result analysis. J. Electron. Inf. Technol. 2010, 32, 2215–2219. [Google Scholar] [CrossRef]
- Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef]
- Baig, M.H.A.; Zhang, L.; Wang, S.; Jiang, G.; Lu, S.; Tong, Q. Comparison of MNDWI and DFI for water mapping in flooding season. In Proceedings of the 2013 IEEE International Geoscience and Remote Sensing Symposium—IGARSS, Melbourne, VIC, Australia, 21–26 July 2013; pp. 2876–2879. [Google Scholar]
- Zhang, L.; Hu, Y. Improved ROEWA operator for sea-land segmentation in SAR image. Comput. Eng. Appl. 2017, 53, 26–30. [Google Scholar]
- Güvelioğlu, E.; Aci, Ç.I. A Faster R-CNN model for multi-class classification and detection of land, air, and sea vehicles. In Proceedings of the 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 26–28 October 2024; pp. 851–856. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Ai, J.; Wang, Y.; Zhang, H. AIS-PVT: Long-time AIS data assisted pyramid vision transformer for sea-land segmentation in dual-polarization SAR imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5220712. [Google Scholar] [CrossRef]
- Tong, Q.; Wu, J.; Zhu, Z.; Zhang, M.; Xing, H. STIRUnet: SwinTransformer and inverted residual convolution embedding in UNet for sea–land segmentation. J. Environ. Manag. 2024, 357, 120773. [Google Scholar] [CrossRef]
- Xie, Q.; Jiang, W.; Wang, Z.; Jiang, Q. ICAA-Mamba: Vision Mamba for image color aesthetics assessment. In Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Liu, Y.; Cheng, G.; Sun, Q.; Tian, C.; Wang, L. CWmamba: Leveraging CNN-Mamba fusion for enhanced change detection in remote sensing images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 2501505. [Google Scholar] [CrossRef]
- Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. In Proceedings of the 2023 IEEECVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model for efficient long-range dependency modeling. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Zhu, R.; Zhang, T.; Li, J.; Wei, F.; Yu, W. A network for merging SAR image sea-land segmentation and coastline detection tasks. IEEE Geosci. Remote Sens. Lett. 2024, 21, 4017305. [Google Scholar] [CrossRef]
- Liang, F.; Zhang, R.; Chai, Y.; Chen, J.; Ru, G.; Yang, W. A Sea-Land Segmentation Method for SAR Images Using Context-Aware and Edge Attention Based CNNs. Geomat. Inf. Sci. Wuhan Univ. China 2023, 48, 1286–1295. [Google Scholar]
- Shamsolmoali, P.; Zareapoor, M.; Wang, R.; Zhou, H.; Yang, J. A novel deep structure U-Net for sea–land segmentation in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 3219–3232. [Google Scholar] [CrossRef]
- Dutta, T.K.; Majhi, S.; Nayak, D.R.; Jha, D. SAM-Mamba: Mamba Guided SAM Architecture for Generalized Zero-Shot Polyp Segmentation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, Tucson, AZ, USA, 26 February–6 March 2025; pp. 4655–4664. [Google Scholar]
- Zhang, Y.; Wang, S.; Chen, Y.; Wei, S.; Xu, M.; Liu, S. Algae-Mamba: A Spatially Variable Mamba for Algae Extraction from Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14324–14337. [Google Scholar] [CrossRef]
- Wu, Q.; Wang, N.; Du, B. Unsupervised Change Detection in multitemporal Satellite Images: A VMamba-Driven Cross-Scale Feature Decoding Network. In Proceedings of the 2025 Joint International Conference on Automation-Intelligence-Safety (ICAIS) & International Symposium on Autonomous Systems (ISAS), Xi’an, China, 23–25 May 2025. [Google Scholar]
- Batu, X.; Jiarui, H.; Jun, P. DGIONet:dual-path global information optimization network for sea-land segmentation with remote sensing images. Bull. Surv. Mapp. 2025, 3, 52–58+86. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, J.; Zhang, X.; Liu, L.; Wang, J.; Tan, W. MFB: A 3D point cloud segmentation model based on Mamba. In Proceedings of the 2024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT), Huaibei, China, 24–27 November 2024. [Google Scholar]
- Ding, L.; Tang, H.; Bruzzone, L. LANet: Local attention embedding to improve the semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
- Ji, X.; Tang, L.; Lu, T.; Cai, C. DBENet: Dual-Branch Ensemble Network for Sea–Land Segmentation of Remote-Sensing Images. IEEE Trans. Instrum. Meas. 2023, 72, 5503611. [Google Scholar] [CrossRef]
- Qi, X.; Hu, Z.; Gao, Y.; Chen, Y.; Zhang, Y. Mamba-CNN: Remote Sensing Image Scene Classification Network with Cross-fusion of Long Sequence and Short Sequence Features. In Proceedings of the 2024 International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, 16–18 August 2024. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Liu, Y.; Tian, Y.; Zhao, H. VM-UNet: Visual Mamba UNet for dense image segmentation. arXiv 2025, arXiv:2402.02491v2. [Google Scholar]
- Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012; pp. 245–248. [Google Scholar]
- Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
- Li, L.; Ma, H.; Zhang, X.; Zhao, X.; Lv, M.; Jia, Z. Synthetic Aperture Radar Image Change Detection Based on Principal Component Analysis and Two-Level Clustering. Remote Sens. 2024, 16, 1861. [Google Scholar] [CrossRef]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
- Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A Review on Deep Learning Techniques Applied to Semantic Segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar] [CrossRef]
- Li, K.; Wang, B.; Xu, Y.; Hu, Y.; Hu, X.; Zhang, M. MCFTNet: Multimodal Cross-Layer Fusion Transformer Network for Hyperspectral and LiDAR Data Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 12803–12818. [Google Scholar] [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2025, arXiv:1711.05101. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-UNet: Swin Transformer-based U-Net for medical image segmentation. In Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), Kolkata, India, 28–31 March 2022. [Google Scholar]
- Song, Y.; Xue, B.; Meng, Y.; Qin, X.; Li, Y.; Liu, Q. A Fusion Method Incorporating Dual-Attention Mechanism and Transfer Learning Into UNet++ for Remote Sensing Image Coastline Extraction. IEEE Access 2025, 13, 11320–11331. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, H.; Hu, Q. TransFuse: Fusing Transformers and CNNs for Medical Image Segmentation. arXiv 2021, arXiv:2102.08005. [Google Scholar] [CrossRef]
- YPeng, Y.; Chen, D.Z.; Sonka, M. U-Net v2: Rethinking the Skip Connections of U-Net for Medical Image Segmentation. arXiv 2023, arXiv:2311.17791. [Google Scholar]
- Zhang, M.; Yu, Y.; Jin, S.; Gu, L.; Ling, T.; Tao, X. VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation. arXiv 2024, arXiv:2403.09157. [Google Scholar]









| Hardware Configuration | |||
|---|---|---|---|
| CPU | GPU | Memory | Disk |
| Intel(R)Xeon(R)Platinum 8260 CPU@2.30 GHz | NVIDIA GeForce RTX 3090 | 86 GB | Samsung SSD 870 |
| Software configuration | |||
| Operating system | NVIDIA drive | CUDA | Acceleration library |
| Ubuntu 20.04 | 525.105.175 | 11.8 | cuDNN8 |
| Development environment | |||
| Deep Learning Framework | Programming languages | Algorithm library | GCC compiler |
| Pytorch = 2.4.0 | Python = 3.10 | Numpy = 1.220 | Gcc 7.5.0 |
| Dataset | Model Category | MAE | F1-Score (%) | MioU (%) | FgIoU (%) | OA (%) |
|---|---|---|---|---|---|---|
| Sea–Land Segmentation V1.1 | UNet | 0.0185 | 97.84 | 96.09 | 94.40 | 98.21 |
| SegNet | 0.0127 | 98.28 | 96.91 | 95.86 | 99.07 | |
| DeepLabV3+ | 0.0164 | 98.56 | 96.77 | 95.72 | 96.50 | |
| TransFuse | 0.0144 | 98.42 | 96.83 | 96.71 | 99.06 | |
| Swin-UNet | 0.0119 | 98.13 | 97.14 | 96.74 | 99.38 | |
| UNetV2 | 0.0135 | 98.52 | 97.06 | 97.01 | 98.96 | |
| VM-UNet | 0.0125 | 98.61 | 97.43 | 97.02 | 99.16 | |
| VM-UNetV2 | 0.0132 | 98.41 | 97.10 | 96.68 | 99.13 | |
| UNet++ | 0.0141 | 98.89 | 97.21 | 96.27 | 99.05 | |
| CVMFusion | 0.0105 | 99.03 | 98.05 | 97.45 | 99.70 | |
| SARSealand V1.0 | UNet | 0.0360 | 96.19 | 94.95 | 94.76 | 96.40 |
| SegNet | 0.0371 | 94.34 | 93.05 | 92.84 | 95.41 | |
| DeepLabV3+ | 0.0289 | 96.56 | 94.57 | 94.12 | 96.11 | |
| TransFuse | 0.0352 | 96.52 | 95.30 | 95.08 | 97.06 | |
| Swin-UNet | 0.0301 | 96.10 | 94.95 | 94.04 | 96.99 | |
| UNetV2 | 0.0342 | 96.78 | 95.03 | 94.72 | 96.54 | |
| VM-UNet | 0.0378 | 97.34 | 95.15 | 94.80 | 96.82 | |
| VM-UnetV2 | 0.0362 | 97.06 | 95.42 | 95.14 | 97.09 | |
| UNet++ | 0.0298 | 96.21 | 95.06 | 94.89 | 97.02 | |
| CVMFusion | 0.0246 | 98.13 | 96.28 | 95.85 | 97.75 |
| Methods | Sea–Land Segmentation V1.1 | SARSealand V1.0 |
|---|---|---|
| U-Net++ | p = 0.0017 * | p = 0.000007 * |
| Swin-Unet | p = 0.067 | p = 0.036 * |
| VM-UNet | p = 0.132 | p = 0.047 * |
| Dataset | Encoder Group | MAE | F1-Score (%) | MioU (%) | FgIoU (%) | OA (%) |
|---|---|---|---|---|---|---|
| Sea–Land Segmentation V1.1 | ConvNextV2 | 0.01449 | 98.38 | 97.10 | 97.42 | 98.50 |
| VMamba | 0.0121 | 98.69 | 97.43 | 97.34 | 99.23 | |
| ConvNextV2+VMamba | 0.0105 | 99.03 | 98.05 | 97.45 | 99.70 | |
| SARSealand V1.0 | ConvNextV2 | 0.0305 | 97.41 | 95.47 | 95.22 | 97.05 |
| VMamba | 0.0280 | 97.70 | 95.76 | 95.38 | 97.22 | |
| ConvNextV2+VMamba | 0.0259 | 98.13 | 96.28 | 95.85 | 97.75 |
| Dataset | Experiment Group | MAE | F1-Score (%) | MioU (%) | FgIoU (%) | OA (%) |
|---|---|---|---|---|---|---|
| Sea–Land Segmentation V1.1 | Baseline | 0.0142 | 98.38 | 97.42 | 96.50 | 99.08 |
| +DyMSA | 0.0119 | 98.94 | 97.63 | 97.30 | 99.21 | |
| +DyMSA+DyWCA | 0.0105 | 99.03 | 98.05 | 97.45 | 99.70 | |
| SARSealand V1.0 | Baseline | 0.0306 | 97.26 | 95.42 | 94.68 | 96.40 |
| +DyMSA | 0.0287 | 97.70 | 96.14 | 95.30 | 97.18 | |
| +DyMSA+DyWCA | 0.0259 | 98.13 | 96.28 | 95.85 | 97.75 |
| Dataset | Experiment Number | Supervision Strategy | Edge Supervision | α Strategy | F1-Score (%) | MIoU (%) |
|---|---|---|---|---|---|---|
| Sea–-Land Segmentation V1.1 | (1) | Only the last layer is true | ✗ | ✗ | 98.04 | 96.85 |
| (2) | Only the last layer of truth values + edges | ✓ | 0.1 | 98.45 | 97.14 | |
| (3) | Multi-level truth value | ✗ | ✗ | 98.24 | 97.02 | |
| (4) | Multi-level truth value + edges | ✓ | 0.1 | 98.79 | 97.48 | |
| (5) | Multi-level truth value + edges | ✓ | Incremental (+0.1/10 epoch) | 99.03 | 98.05 | |
| SARSealand V1.0 | (1) | Only the last layer is true | ✗ | ✗ | 96.92 | 95.08 |
| (2) | Only the last layer of truth values + edges | ✓ | 0.1 | 97.20 | 95.41 | |
| (3) | Multi-level truth value | ✗ | ✗ | 97.01 | 95.52 | |
| (4) | Multi-level truth value + edges | ✓ | 0.1 | 97.80 | 96.05 | |
| (5) | Multi-level truth value + edges | ✓ | Incremental (+0.1/10 epoch) | 98.13 | 96.28 |
| Model | Params (M) | Flops (G) | The Sea–Land Segmentation V1.1 | SARSealand V1.0 |
|---|---|---|---|---|
| MIoU (%) | MIoU (%) | |||
| UNet(ResNet50) | 32.52 | 43.83 | 96.09 | 94.95 |
| UNet(ResNet101) | 51.51 | 62.33 | 96.15 | 94.42 |
| UNet++(ResNet50) | 48.99 | 230.25 | 97.21 | 95.06 |
| UNet++(ResNet101) | 67.98 | 249.75 | 97.11 | 94.59 |
| DeepLabV3+(ResNet50) | 26.68 | 36.90 | 96.77 | 94.57 |
| DeepLabV3+(ResNet101) | 45.67 | 56.41 | 97.25 | 94.12 |
| VM-UNet | 27.42 | 16.45 | 97.43 | 95.15 |
| VM-UNet(2,2,9,2-2,9,2,2) | 38.28 | 30.33 | 97.27 | 94.66 |
| CVMFusion(-DyMSA-DyWCA) | 43.71 | 29.74 | 97.42 | 95.47 |
| CVMFusion(-DyWCA) | 45.90 | 44.04 | 97.63 | 95.76 |
| CVMFusion | 51.25 | 46.61 | 98.05 | 96.28 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Wang, Z.; Qin, L.; Xu, C.; Liu, D.; Guo, Z.; Hu, Y.; Yang, T. CVMFusion: ConvNeXtV2 and Visual Mamba Fusion for Remote Sensing Segmentation. Sensors 2026, 26, 640. https://doi.org/10.3390/s26020640
Wang Z, Qin L, Xu C, Liu D, Guo Z, Hu Y, Yang T. CVMFusion: ConvNeXtV2 and Visual Mamba Fusion for Remote Sensing Segmentation. Sensors. 2026; 26(2):640. https://doi.org/10.3390/s26020640
Chicago/Turabian StyleWang, Zelin, Li Qin, Cheng Xu, Dexi Liu, Zeyu Guo, Yu Hu, and Tianyu Yang. 2026. "CVMFusion: ConvNeXtV2 and Visual Mamba Fusion for Remote Sensing Segmentation" Sensors 26, no. 2: 640. https://doi.org/10.3390/s26020640
APA StyleWang, Z., Qin, L., Xu, C., Liu, D., Guo, Z., Hu, Y., & Yang, T. (2026). CVMFusion: ConvNeXtV2 and Visual Mamba Fusion for Remote Sensing Segmentation. Sensors, 26(2), 640. https://doi.org/10.3390/s26020640

