Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments
Abstract
1. Introduction
- (1)
- Dynamic Window Module (DWM): this study presents a dynamic window module, which adjusts the size and position of the window dynamically by integrating continuous positional deviations. This addresses the key limitation of the fixed window strategy in Swin-Transformer that fails to capture the multi-scale features of irregular objects, significantly improving the recognition accuracy of complex terrain targets.
- (2)
- Cross-Scale Contextual Attention (SCA): this study designed a cross-scale contextual attention module, which employs a strategy from coarse to fine, extracting and integrating features from adjacent scales. It addresses the fundamental challenge of the semantic gap between local details and global context in conventional attention mechanisms, thereby improving the model’s ability to understand spatial relationships in heterogeneous scenarios.
- (3)
- Hierarchical Adaptive Fusion Architecture (HAFA): a hierarchical adaptive fusion architecture is proposed, which designs a heterogeneous modality synchronizer module (HMS) at the shallow network layer and employs an Efficient Channel Attention (ECA) mechanism at the deep network layer. This effectively fuses complementary multi-modal information while maintaining accuracy and significantly reducing the number of model parameters.
2. Datasets and Data Preprocessing
2.1. Datasets
2.1.1. Vaihingen Dataset
2.1.2. Potsdam Dataset
2.2. Data Preprocessing
3. Methods
- (1)
- Dual-branch Feature Extraction: the workflow starts with the input image and employs a parallel dual-branch architecture for multimodal feature encoding. Left branch: The IRRB image is processed using a Swin-T encoder integrated with DWM to extract the intermediate feature , which is then further refined through the SCA module to obtain . Right branch: The DSM image is processed using a unique Swin-T encoder integrated with the DWM to generate the intermediate feature , which is then refined with SCA to produce .
- (2)
- Cross-modal Feature Fusion: after obtaining the high-dimensional features and from the dual branches, the workflow proceeds to two core fusion stages: HMS and ECA. The HMS performs an initial fusion of (from the optical/infrared branch) and (from the DSM branch), enabling the complementary integration of features from these two heterogeneous data sources. Subsequently, the ECA module applies channel—and/or spatial—wise attention weighting or feature reorganization to the features output by HMS, facilitating deeper interaction and integration of multimodal information, thereby providing more discriminative fused features for the subsequent decoding stage.
- (3)
- Decoding and Output: after feature extraction by the dual branches and cross-modal fusion, the final features are fed into the Swin-T Decoder for upsampling and refined classification operations, yielding the final output.
3.1. Dynamic Window Module
- (1)
- Adaptive Window Partitioning (AWP): dynamically determine the optimal window size based on the input feature map dimensions to adapt to the characteristics of remote sensing images, ensuring that the window size evenly divides the feature map dimensions while eliminating redundant padding.
- (2)
- Continuous Relative Position Bias Generator (CRPB): use a lightweight multi-layer perceptron (MLP) to replace the pre-computed position bias table, generating position codes that are independent of window size through real-time normalization of spatial coordinates.
- (3)
- Multi-scale Dynamic Window Attention (MDWA): integrates outputs from the preceding two sub-modules to enable genuine multi-scale perception, effectively addressing the limitations of fixed windows when processing variable-sized feature maps.
3.1.1. Adaptive Window Partitioning Module
3.1.2. Continuous Relative Position Bias Generator
3.1.3. Multi-Scale Dynamic Window Attention Module
3.2. Cross-Scale Context Attention Module
3.2.1. Multilevel Feature Interaction Module
- 1.
- The feature map is first refined through a convolution layer to adapt its channel dimensions for subsequent fusion;
- 2.
- The adjusted , which is upsampled (to match the resolution of ) and processed via a convolution layer to ensure feature compatibility;
- 3.
- The concatenated features are integrated to generate the output feature map (with the same spatial and channel dimensions as ), which preserves both local details from and global context from the .
3.2.2. Channel Semantic Recalibration Module
3.3. Hierarchical Adaptive Fusion Architecture
- (1)
- Heterogeneous Modality Synchronizer (HMS), which is used to fuse features from different modes, is mainly composed of two parts: Feature Enhancement in spatial direction and adaptive feature fusion in channel direction. The spatial correlation is calculated by using the multimodal features and derived from SCA module, which is the key to realize the adaptive fusion of multimodal features.
- (2)
- Efficient Channel Attention (ECA): in the deep network, due to the low resolution of feature map but rich semantic information of channels, this paper uses ECA module [29] to enhance the expression of significant channels in multimodal features, which can improve the response of important channels without increasing too much computational burden, thus strengthening the semantic consistency of cross-modal features. The hierarchical adaptive fusion architecture is illustrated in Figure 5.
3.3.1. Heterogeneous Modality Synchronizer
- (1)
- Spatial Feature Enhancement (FE): firstly, process the multimodal data (including IRRB and DSM) through global average pooling (GAP) to extract spatial information. Then, input the aligned and fused features of IRRB and DSM into the adaptive fusion process, where the weight matrix is learned from the features of adjacent stages.
- (2)
- Adaptive Channel Fusion: to mitigate the impact of noisy or missing DSM signals on IRRB data, channel attention mechanisms are employed to adaptively adjust the weights of different modal features, enabling more effective cross-modal feature fusion.
- 1.
- The spatially enhanced features are first concatenated with and , respectively, followed by a convolutional layer to adjust channel dimensions and a ReLU activation for non-linearity;
- 2.
- The concatenated outputs are fed into a softmax layer to generate the weighting coefficients and , which adaptively balance the contributions of the two branches;
- 3.
- Third item. The final spatial correlation map is computed by weighted fusion of the intermediate features using and . Mathematically, the operations are formulated as:
3.3.2. Efficient Channel Attention Module
3.4. Loss Function
4. Results and Discussion
4.1. Evaluation Metrics
4.2. Comparative Experiments
4.3. Ablation Experiments
4.4. Multimodal Noise Robustness Experiments
- (1)
- DSM noise: Gaussian noise was added to the elevation values of the digital surface model (DSM) with standard deviations of 0.1 m, 0.5 m, and 1.0 m.
- (2)
- Multispectral noise: Gaussian noise was added to each spectral channel of the multispectral images with standard deviations of 2, 5, and 10.
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix A.1. Detailed Definitions of CRPB and AWP Module
- (1)
- Structure: two layers of linear transformation, from input to hidden layer and then to output, with a GELU function as the non-linear activation and two rounds of Dropout regularization.
- (2)
- Parameter initialization rules: if hidden_features is not specified, it defaults to the same dimension as the input in_features (i.e., the hidden layer has the same dimension as the input). If out_features is not specified, it defaults to the same dimension as the input in_features. The activation function defaults to nn.GELU, which is the Gaussian Error Linear Unit. The dropout probability defaults to 0 (disabled); when performing regularization in this paper, it can be set to 0.3.
Appendix A.2. The Precise Mathematical Formulas for All Loss Terms
Appendix A.3. Training Configuration and Complete Data Augmentation Process
Appendix B
Appendix B.1. Pseudo-Code Flow in the DWM
Algorithm A1 Dynamic Window Module (DWM) Implementation |
Input: A data tensor X, window size dimensions (h, w), optional attention mask. Output: Processed window tensor (H, W, C). Module 1: Adaptive Window Partitioning (AWP) Input: (B, H, W, C) tensor, (h, w) window size Output: (B*num_windows, h, w, C) windows 1: (M_base, N_base) = (8, 8) // Preset default window size - For inference: enforce divisibility - For training: find max divisible size 2: Reshape X to: [B, H//h, h, W//w, w, C] 3: Permute and reshape to: [B*(H//h)*(W//w), h, w, C] 4: Return final window size 5: Calculate num_windows = (H//h)*(W//w) 6: Reshape to: [B, H//h, W//w, h, w, C] 7: Permute and reshape to original: [B, H, W, C] |
Module 2: MLP Block of the CRPB Input: in_features, [hidden_features], [out_features], [act_layer], [drop] Output: Processed tensor 1: Initialize linear layers: fc1 = Linear(in_features → hidden_features) fc2 = Linear(hidden_features → out_features) 2: Forward pass: x ← fc1(x) x ← Activation(x) # Default: GELU x ← Dropout(x) x ← fc2(x) x ← Dropout(x) return x |
Module 3: Multimodel Dynamic Window Attention (MDWA) Input: dim, num_heads, [qkv_bias], [qk_scale], [attn_drop], [proj_drop] Output: Attention-weighted features 1: Initialize components: - qkv: Linear(dim → 3*dim) - cpb_mlp: MLP for relative position bias - Projection layer and dropouts 2: Forward(x, window_size, mask): (a) Generate relative position coordinates (b) Normalize coords to [−1, 1] (c) Compute position bias via cpb_mlp (d) Calculate QKV and attention scores (e) Apply position bias + mask (if any) (f) Softmax → attention dropout (g) Project and return output return (H, W, C) |
Appendix B.2. Pseudo-Code Flow in the SCA Module
Algorithm A2 SCA Module |
Input: mol1: Feature map 1 (shape: [B, feature1, H, W]), mol2: Feature map 2 (shape: [B, feature2, H//2, W//2]), size1: Base feature dimension. Output: out: Fused feature map (shape: [B, feature1, H, W]). 1: Initialize layers: conv1 ← Conv2d(feature1→feature1, kernel = 1) # Feature projection conv2 ← Conv2d(2*feature1→feature1, kernel = 3, padding = 1) # Fusion conv3 ← Conv2d(feature2→feature1, kernel = 1) # Channel adjustment upSampling2x2 ← ConvTranspose2d(feature2→feature1, kernel = 2, stride = 2) # Upsampler avg_pool ← AdaptiveAvgPool2d(1) # Global context Bach ← BatchNorm2d(feature1) # Normalization relu1 ← ReLU(inplace = True) # Activation 2: Process mol1 (high-res path): c4_lat ← conv1(mol1) # Linear projection 3: Process mol2 (low-res path): c5_lat ← upSampling2x2(mol2) # 2x upsampling c5_lat ← relu1(c5_lat) # Non-linear activation 4: Feature fusion: c_glb ← concat(c4_lat, c5_lat, dim = 1) # Channel-wise concatenation c_glb ← conv2(c_glb) # Cross-scale fusion c_glb ← Bach(c_glb) # Batch normalization 5: Attention weighting: c_glb_lat2 ← avg_pool(mol2) # Global average pooling c_glb_lat2 ← conv3(c_glb_lat2) # Channel adjustment c_glb_lat2 ← softmax(c_glb_lat2, dim = 1) # Attention weights 6: Apply attention: out ← c_glb * c_glb_lat2 # Element-wise multiplication return out |
Appendix B.3. Canny Edge Detection
Algorithm A3 Canny Edge Detection |
Input: label_path—Path to RGB annotation image CLASS_COLORS—Dictionary mapping class names to RGB values TARGET_PAIRS—List of class pairs for boundary analysis Output: Boundary density percentage (0–100) |
Function: calculate_boundary_density 1: Load and validate input: if not file_exists(label_path): print(“File not found error”) return 0.0 label_img ← cv2.imread(label_path) # BGR format if label_img is None: print(“Image read error”) return 0.0 2: Preprocess image: label_img ← convert BGR to RGB h, w ← image.height, image.width total_pixels ← h * w 3: Generate class mask: class_mask ← zeros(h, w) for each (class_name, rgb_color) in CLASS_COLORS: if class_name in [‘building’, ‘low_vegetation’, ‘road’]: binary_mask ← (label_img == rgb_color) across all channels class_id ← index_of(class_name) + 1 class_mask[binary_mask] ← class_id 4: Create target mask: target_mask ← zeros(h, w) for each (class_name, _) in TARGET_PAIRS: if class_name in CLASS_COLORS: rgb_color ← CLASS_COLORS[class_name] binary_mask ← (label_img == rgb_color) across all channels target_mask[binary_mask] ← 1 5: Edge detection: gray_mask ← target_mask * 255 (convert to 8-bit) edges ← cv2.Canny(gray_mask, low_thresh = 25, high_thresh = 75) # Canny edge detection 6: Calculate density: boundary_pixels ← count_nonzero(edges) boundary_density ← (boundary_pixels/total_pixels) * 100 return boundary_density |
Function: compute_dataset_boundary_density Input: label_paths—List of annotation image paths Output: Mean and std of boundary densities across dataset 1: Initialize empty density list 2: for each path in label_paths: density ← calculate_boundary_density(path) densities.append(density) print(f”Image {basename(path)} density: {density:.2f}%”) 3: mean_density ← mean(densities) 4: std_density ← std(densities) 5: return (mean_density, std_density) |
References
- Li, K.; Liu, R.; Cao, X.; Bai, X.; Zhou, F.; Meng, D.; Wang, Z. SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 1234–1245. [Google Scholar]
- State Council. National Medium- and Long-Term Plan for Basic Surveying and Mapping (2021–2030); Standards Press of China: Beijing, China, 2021. [Google Scholar]
- Zhang, S.; Chen, Z.; Wang, D.; Wang, Z.J. Cross-Domain Few-Shot Contrastive Learning for Hyperspectral Images Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
- Hou, Z.; Chen, M.; Ma, S.; Qu, M.; Yang, X. Real-Time Urban Street View Semantic Segmentation Based on Cross-Level Aggregation Network. Opt. Precis. Eng. 2024, 32, 1212–1226. [Google Scholar] [CrossRef]
- Pan, T.; Zuo, R.; Wang, Z. Geological Mapping via Convolutional Neural Network Based on Remote Sensing and Geochemical Survey Data in Vegetation Coverage Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3485–3494. [Google Scholar] [CrossRef]
- Jiang, W.; Zhang, C.; Xu, B.; Luo, C.; Zhou, H.; Zhou, K. AED-Net: A Semantic Segmentation Model for Landslide Disaster Remote Sensing Images. J. Geo-Inf. Sci. 2023, 25, 2012–2025. [Google Scholar]
- Fan, K.; Fen, Y. A Traffic Scene Perception Algorithm Combining Semantic Segmentation and Depth Estimation. J. Zhejiang Univ. Eng. Sci. 2024, 58, 684–695. [Google Scholar]
- Li, X.; Yan, H.; Wang, Z.; Wang, B. Evaluation and Influencing Factors Analysis of Road Environment Safety Perception Combining Street View Images and Machine Learning. J. Geo-Inf. Sci. 2023, 25, 852–865. [Google Scholar]
- Xu, Y.; Cao, B.; Lu, H. Improved U-Net++ Semantic Segmentation Method for Remote Sensing Images. IEEE Access 2025, 13, 55877–55886. [Google Scholar] [CrossRef]
- Fan, L.; Zhou, Y.; Liu, H.; Li, Y.; Cao, D. Combining Swin Transformer With UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
- Zhou, J.; Hao, M.; Zhang, D.; Zou, P.; Zhang, W. Fusion PSPnet Image Segmentation Based Method for Multi-Focus Image Fusion. IEEE Photonics J. 2019, 11, 1–12. [Google Scholar] [CrossRef]
- Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20 June 2021; pp. 6877–6886. [Google Scholar]
- Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
- Li, S.; Li, C.; Kang, X. Development Status and Future Prospects of Multi-Source Remote Sensing Image Fusion. Natl. Remote Sens. Bull. 2021, 25, 148–166. [Google Scholar] [CrossRef]
- Peng, C.; Li, Y.; Jiao, L.; Chen, Y.; Shang, R. Densely Based Multi-Scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2612–2626. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.-O. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
- Sun, Y.; Fu, Z.; Sun, C.; Hu, Y.; Zhang, S. Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
- Ma, M.; Ma, W.; Jiao, L.; Liu, X.; Li, L.; Feng, Z.; Liu, F.; Yang, S. A Multimodal Hyper-Fusion Transformer for Remote Sensing Image Classification. Inf. Fusion 2023, 96, 66–79. [Google Scholar] [CrossRef]
- Chen, H.; Lan, C.; Song, J.; Broni-Bediako, C.; Xia, J.; Yokoya, N. ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer. IEEE Trans. Geosci. Remote Sens. 2023, 62, 4408522. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
- Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22 September 2019; pp. 1440–1444. [Google Scholar]
- He, S.; Yang, H.; Zhang, X.; Li, X. MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images. Mathematics 2023, 11, 722. [Google Scholar] [CrossRef]
- Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
- Ma, X.; Xu, X.; Zhang, X.; Pun, M.-O. Adjacent-Scale Multimodal Fusion Networks for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 20116–20128. [Google Scholar] [CrossRef]
- Ren, P.; Li, C.; Wang, G.; Xiao, Y.; Du, Q.; Liang, X.; Chang, X. Beyond Fixation: Dynamic Window Visual Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18 June 2022; pp. 11977–11987. [Google Scholar]
- Zhang, Z.; Shu, D.; Gu, G.; Hu, W.; Wang, R.; Chen, X.; Yang, B. RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery. Remote Sens. 2025, 17, 3064. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13 June 2020; pp. 11531–11539. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]
Dataset | Method | F1 | OA | mF1 | mIoU | ||||
---|---|---|---|---|---|---|---|---|---|
Roa. | Bui. | Low. | Tre. | Car | |||||
Vaihingen | Swin-Unet | 0.9021 | 0.9518 | 0.8463 | 0.6923 | 0.7591 | 0.8294 | 0.8303 | 0.7209 |
PSPNet | 0.9008 | 0.9467 | 0.8831 | 0.7383 | 0.7113 | 0.8374 | 0.8361 | 0.7292 | |
ACNet | 0.8904 | 0.9314 | 0.8750 | 0.8135 | 0.7387 | 0.8419 | 0.8478 | 0.7590 | |
RDFNet | 0.8981 | 0.9517 | 0.8921 | 0.8170 | 0.7392 | 0.8570 | 0.8596 | 0.7610 | |
CMFNet | 0.9058 | 0.9545 | 0.9077 | 0.8196 | 0.7405 | 0.8640 | 0.8656 | 0.7708 | |
MFTransNet | 0.9229 | 0.9629 | 0.9101 | 0.8254 | 0.7513 | 0.8715 | 0.8745 | 0.7849 | |
DyFuseNet | 0.9201 | 0.9677 | 0.9102 | 0.8561 | 0.7900 | 0.8511 | 0.8869 | 0.8040 | |
Potsdam | Swin-Unet | 0.8983 | 0.9508 | 0.8870 | 0.7951 | 0.7136 | 0.8500 | 0.8490 | 0.7466 |
PSPNet | 0.9008 | 0.9536 | 0.8690 | 0.7948 | 0.7391 | 0.8455 | 0.8515 | 0.7490 | |
ACNet | 0.9098 | 0.9526 | 0.8893 | 0.7883 | 0.7292 | 0.8536 | 0.8538 | 0.7538 | |
RDFNet | 0.9037 | 0.9582 | 0.9015 | 0.8117 | 0.7470 | 0.8620 | 0.8644 | 0.7688 | |
CMFNet | 0.9143 | 0.9601 | 0.9026 | 0.8247 | 0.7747 | 0.8677 | 0.8753 | 0.7843 | |
MFTransNet | 0.9159 | 0.9657 | 0.9082 | 0.8449 | 0.7594 | 0.8720 | 0.8788 | 0.7908 | |
DyFuseNet | 0.9194 | 0.9697 | 0.9160 | 0.8703 | 0.7970 | 0.8980 | 0.8987 | 0.8085 |
Method | Multimodal | FLOPs (G) | Parameter (MB) | FPS | mIoU (%) |
---|---|---|---|---|---|
Swin-Unet | N | 16.54 | 34.68 | 38.64 | 74.66 |
PSPNet | N | 51.23 | 46.72 | 68.24 | 74.90 |
ACNet | Y | 12.96 | 62.37 | 18.64 | 75.38 |
RDFNet | Y | 60.44 | 42.08 | 20.72 | 76.88 |
CMFNet | Y | 80.67 | 112.44 | 9.82 | 78.43 |
MFTransNet | Y | 9.52 | 41.36 | 16.68 | 79.08 |
DyFuseNet | Y | 26.19 | 82.77 | 30.09 | 80.85 |
Dataset | Baseline | DWM | SCA | HAFA | OA | mF1 |
---|---|---|---|---|---|---|
Vaihingen | √ | 0.8481 | 0.8532 | |||
√ | √ | 0.8537 | 0.8581 | |||
√ | √ | √ | 0.8702 | 0.8748 | ||
√ | √ | √ | √ | 0.8764 | 0.8867 | |
Potsdam | √ | 0.8536 | 0.8538 | |||
√ | √ | 0.8677 | 0.8753 | |||
√ | √ | √ | 0.8764 | 0.8867 | ||
√ | √ | √ | √ | 0.8780 | 0.8897 |
Datasets | Noisy | Swin-T | DyFuseNet | OA-Change |
---|---|---|---|---|
Vaihingen | Origin | 0.8322 | 0.8789 | 0 |
OnlyDSM | 0.8215 | 0.8774 | 0.0015 | |
OnlyIRRB | 0.8143 | 0.8721 | 0.0068 | |
Both | 0.8112 | 0.8695 | 0.0094 | |
Potsdam | Origin | 0.8436 | 0.8789 | 0 |
OnlyDSM | 0.8325 | 0.8782 | 0.0007 | |
OnlyIRRB | 0.8276 | 0.8743 | 0.0046 | |
Both | 0.8105 | 0.8711 | 0.0078 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, Q.; Wu, M.; Zhang, P.; Wang, L.; Shi, Q. Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Appl. Sci. 2025, 15, 10813. https://doi.org/10.3390/app151910813
He Q, Wu M, Zhang P, Wang L, Shi Q. Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Applied Sciences. 2025; 15(19):10813. https://doi.org/10.3390/app151910813
Chicago/Turabian StyleHe, Qianping, Meng Wu, Pengchang Zhang, Lu Wang, and Quanbin Shi. 2025. "Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments" Applied Sciences 15, no. 19: 10813. https://doi.org/10.3390/app151910813
APA StyleHe, Q., Wu, M., Zhang, P., Wang, L., & Shi, Q. (2025). Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Applied Sciences, 15(19), 10813. https://doi.org/10.3390/app151910813