Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

He, Qianping; Wu, Meng; Zhang, Pengchang; Wang, Lu; Shi, Quanbin

doi:10.3390/app151910813

Open AccessArticle

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

by

Qianping He

¹

,

Meng Wu

^1,2,*

,

Pengchang Zhang

³,

Lu Wang

² and

Quanbin Shi

¹

School of Information and Control Engineering, Xi’an University of Architecture and Technology, Xi’an 710055, China

²

Institute for Interdisciplinary and Innovate Research, Xi’an University of Architecture and Technology, Xi’an 710055, China

³

Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(19), 10813; https://doi.org/10.3390/app151910813

Submission received: 17 September 2025 / Revised: 2 October 2025 / Accepted: 7 October 2025 / Published: 8 October 2025

(This article belongs to the Special Issue Signal and Image Processing: From Theory to Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and infrared). To address these challenges, we proposed a novel multi-modal segmentation framework, DyFuseNet, which features dynamic adaptive windows and cross-scale feature fusion capabilities. This framework consists of three key components: (1) Dynamic Window Module (DWM), which uses dynamic partitioning and continuous position bias to adaptively adjust window sizes, thereby improving the representation of irregular and fine-grained objects; (2) Scale Context Attention (SCA), a hierarchical mechanism that associates local details with global semantics in a coarse-to-fine manner, enhancing segmentation accuracy in low-texture or occluded regions; and (3) Hierarchical Adaptive Fusion Architecture (HAFA), which aligns and fuses features from multiple modalities through shallow synchronization and deep channel attention, effectively balancing complementarity and redundancy. Evaluated on benchmark datasets (such as ISPRS Vaihingen and Potsdam), DyFuseNet achieved state-of-the-art performance, with mean Intersection over Union (mIoU) scores of 80.40% and 80.85%, surpassing MFTransNet by 1.91% and 1.77%, respectively. The model also demonstrated strong robustness in challenging scenes (such as building edges and shadowed objects), achieving an average F1 score of 85% while maintaining high efficiency (26.19 GFLOPs, 30.09 FPS), making it suitable for real-time deployment. This work presents a practical, versatile, and computationally efficient solution for multi-modal image analysis, with potential applications beyond remote sensing, including smart monitoring, industrial inspection, and multi-source data fusion tasks.

Keywords:

multimodal semantic segmentation; computer vision; dynamic window; vision transformer; heterogeneous data; cross-scale fusion

1. Introduction

Multimodal image semantic segmentation is a fundamental technique for extracting pixel-level semantic information from complex image data [1], particularly when dealing with heterogeneous modalities (e.g., optical, radar, and elevation data [2]). By assigning semantic labels to objects and generating high-resolution segmentation maps [3], this technology serves as a key enabler for a wide range of applications, including urban development [4], natural resource surveying [5], environmental analysis and infrastructure monitoring [6], intelligent transportation systems [7], and public safety management [8]. Its significance extends not only to strategic initiatives such as smart city construction and integrated monitoring systems but also to practical tasks involving scene understanding and object recognition.

With the rapid development of remote sensing and aerial imaging platforms, modern datasets increasingly feature high-resolution images containing dense small objects, complex scene layouts, and significant variations in object size and scale. In this context, overlapping structures (such as buildings and vehicles in urban environments) pose significant challenges for accurate segmentation. Effectively addressing these challenges requires not only modeling the spatial relationships between targets but also leveraging the complementary information provided by various data modalities. These demands highlight the need for advanced, generalizable segmentation methods capable of handling complex, multi-modal, and high-resolution visual inputs. Traditional single-modal semantic segmentation methods, which depend on manual feature extraction, exhibit limited automation and inadequate accuracy when dealing with complex scenes [9]. Although deep learning-based methods, including convolutional neural networks (CNNs) (e.g., U-Net [10], PSPNet [11]) and transformers (e.g., ViT [12], Swin-Transformer [13], Swin-Unet [14]), have made substantial progress in feature extraction and global context modeling, they are still restricted by the limited information intrinsic to unimodal data. This limitation restricts their ability to effectively address the segmentation challenges in complex scenarios (for example, overlapping objects of different scales and heterogeneous data sources) [15], highlighting the necessity of multimodal fusion strategies that leverage the complementarity of multi-source data.

Multimodal data (such as RGB spectral data, DSM elevation data, infrared images, etc.) can more comprehensively describe surface features through complementary heterogeneous information, thereby improving segmentation robustness. At the same time, multiscale contextual information (for example, the dense connection mechanism in DFCN [16] and the local–global feature mining strategy in CMFNet [17]) has been proven to be crucial for achieving high segmentation accuracy. However, the existing multimodal fusion-based semantic segmentation methods exhibit significant limitations: the relevance of different modalities and semantic contexts is essentially unequal [18], and naively connecting multimodal data may lead to incompatible feature spaces due to heterogeneous statistical properties, thus reducing segmentation performance [19,20]. Although RDFNet [21] and HRNet [22] attempted to improve fusion strategies, and ACNet [23] introduced attention mechanisms to enhance contextual information association, challenges such as modal-specific feature neglect and feature redundancy still exist. Furthermore, Transformer-based approaches (e.g., MFTransNet [24] and STransFuse [25]) enhance performance by modeling long-range dependencies. Nevertheless, their single-scale window mechanism is inadequate for capturing the multiscale characteristics of remote sensing objects. Additionally, mismatches between window sizes and feature maps necessitate padding adjustments, which disrupt network structural consistency and compromise computational efficiency.

To address the aforementioned limitations, inspired by ASMFNet [26] and the dynamic window vision transformer [27,28], this study proposes a novel multimodal image segmentation network, DyFuseNet, by leveraging Swin-Transformer as the backbone and exploring multimodal fusion through an adjacent-scale perspective. The key contributions are outlined as follows:

(1): Dynamic Window Module (DWM): this study presents a dynamic window module, which adjusts the size and position of the window dynamically by integrating continuous positional deviations. This addresses the key limitation of the fixed window strategy in Swin-Transformer that fails to capture the multi-scale features of irregular objects, significantly improving the recognition accuracy of complex terrain targets.
(2): Cross-Scale Contextual Attention (SCA): this study designed a cross-scale contextual attention module, which employs a strategy from coarse to fine, extracting and integrating features from adjacent scales. It addresses the fundamental challenge of the semantic gap between local details and global context in conventional attention mechanisms, thereby improving the model’s ability to understand spatial relationships in heterogeneous scenarios.
(3): Hierarchical Adaptive Fusion Architecture (HAFA): a hierarchical adaptive fusion architecture is proposed, which designs a heterogeneous modality synchronizer module (HMS) at the shallow network layer and employs an Efficient Channel Attention (ECA) mechanism at the deep network layer. This effectively fuses complementary multi-modal information while maintaining accuracy and significantly reducing the number of model parameters.

Experiments on the ISPRS Vaihingen and Potsdam datasets show that the method improves the segmentation accuracy of irregular features (such as building edges and road networks), enhances robustness in shadow and low-texture areas, maintains reliable classification in high-variance backgrounds, and achieves an optimal balance between accuracy and processing speed. These results collectively validate the effectiveness of the proposed DyFuseNet in complex remote sensing segmentation tasks.

2. Datasets and Data Preprocessing

This section introduces the characteristics of the Vaihingen and Potsdam datasets, discussing their spatial resolution, spectral composition, geographical coverage, and labeling systems. In addition, it details the processing of the raw datasets in this paper, including the sliding window cropping to address the mismatch between GPU memory limitations and the original high-resolution image ratio, ensuring compatibility with the input requirements of the proposed model.

2.1. Datasets

This section introduces the core features of the Vaihingen and Potsdam datasets, compares key attributes, and clarifies their complementary roles in data complexity and model validation, providing a basis for method adaptability analysis.

2.1.1. Vaihingen Dataset

The Vaihingen dataset comes from aerial images of the city of Vaihingen an der Enz in southwestern Germany, characterized by dense building clusters and a complex spatial configuration of scattered forest areas. This heterogeneity makes it particularly suitable for multi-class land cover classification problems in fragmented scenarios involving urban and vegetation elements.

The dataset contains 33 orthorectified aerial images with a spatial resolution of 9 cm per pixel, capable of capturing detailed features of small-scale objects such as individual buildings, roads, and patches of vegetation. Spectrally, the images are acquired in three bands: near-infrared (NIR), red (R), and green (G) (represented as an IR-R-G band combination), which are widely used to distinguish vegetation (through near-infrared reflectance) and built-up areas (through red and green bands).

Geographically, the area of the Vaihingen region is approximately 0.5 square kilometers, concentrated in residential areas, industrial facilities, and vegetation zones. Full labeled annotations are provided for all 33 images to ensure comprehensive ground truth for model training and evaluation. The dataset includes six semantic classes: roads, buildings, low vegetation, trees, cars, and background, with the first five categories defined as primary foreground objects of interest.

2.1.2. Potsdam Dataset

The Potsdam dataset comes from aerial survey images of the city center area of Potsdam, a historic city in Germany, known for its densely arranged large buildings and narrow streets, which presents unique challenges for segmentation tasks (such as distinguishing closely adjacent structures and fine urban features). This dataset consists of 38 orthorectified aerial images, with a higher spatial resolution of 5 cm per pixel, providing finer details compared to the Vaihingen dataset and assisting in the analysis of small-scale objects.

In the spectrum, the Potsdam images expand the Vaihingen band combination by including an additional blue (B) channel, thus forming a four-band configuration (NIR, R, G, and B). The addition of the blue band enhances contrast differentiation and improves the robustness of feature extraction in complex urban scenes. Furthermore, out of 38 images, 24 are fully labeled, serving as the primary source for model training and validation. The annotations follow the same six-class semantic system as the Vaihingen dataset, with consistent definitions for foreground and background categories. A key advantage of the Potsdam dataset lies in its high-resolution four-band imagery and dense urban structure, which provides a complementary contrast to the urban-forest mixed complexity and full label coverage of the Vaihingen dataset.

2.2. Data Preprocessing

In order to address the GPU memory limitations associated with handling high-resolution images and to align the large-scale input data with the input requirements of the proposed model, a sliding window clipping strategy is used for dataset preprocessing.

The original image is divided into 512 × 512 pixel sub-images with a stride of 512 pixels. This stride is chosen to balance computational efficiency and contextual continuity: a smaller stride reduces edge discontinuities within segmented objects, while a larger stride minimizes redundant computation. The sliding window method effectively suppresses boundary artifacts that may arise from sudden transitions at the edges of cropped sub-images, thereby improving the model’s generalization ability to different spatial distributions of ground objects. Additionally, to ensure the statistical independence and balanced representation of each semantic class, a stratified sampling strategy is used to randomly split the generated sub-images into training (70%), validation (20%), and testing (10%) sets. This method maintains the proportional distribution of foreground and background classes across all subsets, thus preventing bias in model evaluation and ensuring reliable assessment of the model’s multi-class segmentation performance.

By integrating the complementary attributes of the Vaihingen and Potsdam datasets with targeted preprocessing (sliding-window clipping and stratified splitting), this study establishes a robust and scalable dataset framework that enables rigorous validation of the proposed method’s effectiveness in multi-scale, multimodal remote sensing image analysis.

3. Methods

In response to the challenges of heterogeneity in DSM (Digital Surface Model) and IRRB (Infrared Radiation Brightness Image) images and the problem of multi-scale object segmentation, a multi-modal segmentation network called DyFuseNet based on dynamic windows and cross-scale fusion is proposed. As shown in Figure 1, the workflow of this algorithm consists of the following key steps:

(1): Dual-branch Feature Extraction: the workflow starts with the input image and employs a parallel dual-branch architecture for multimodal feature encoding. Left branch: The IRRB image is processed using a Swin-T encoder integrated with DWM to extract the intermediate feature $R_{i}$ , which is then further refined through the SCA module to obtain $R_{h}$ . Right branch: The DSM image is processed using a unique Swin-T encoder integrated with the DWM to generate the intermediate feature $D_{i}$ , which is then refined with SCA to produce $D_{h}$ .
(2): Cross-modal Feature Fusion: after obtaining the high-dimensional features $R_{h}$ and $D_{h}$ from the dual branches, the workflow proceeds to two core fusion stages: HMS and ECA. The HMS performs an initial fusion of $R_{h}$ (from the optical/infrared branch) and $D_{h}$ (from the DSM branch), enabling the complementary integration of features from these two heterogeneous data sources. Subsequently, the ECA module applies channel—and/or spatial—wise attention weighting or feature reorganization to the features output by HMS, facilitating deeper interaction and integration of multimodal information, thereby providing more discriminative fused features for the subsequent decoding stage.
(3): Decoding and Output: after feature extraction by the dual branches and cross-modal fusion, the final features are fed into the Swin-T Decoder for upsampling and refined classification operations, yielding the final output.

DyFuseNet is an end-to-end network, and its overall network framework is shown in Figure 2. Let the IRRB image and DSM image be used separately

R \in R^{C_{R} \times H \times W}

and

D \in R^{C \times H \times W}

, where

H

and

W

represent the height and width of the image, respectively, and

C

represents the channel size.

R_{0}

is obtained by dividing

R

into different patches with positional embeddings, and through several consecutive stages, its output form is

I_{i} \in R^{2^{i} C \times \frac{H}{2^{i + 2}} \times \frac{W}{2^{i + 2}}}

, where

i

represents the number of layers,

C = 96

. Similarly, the DSM encoder

R_{i}

will export

D_{i}

with the same size.

3.1. Dynamic Window Module

In the field of computer vision, the window attention mechanism has become a major design paradigm due to its inherent trade-off between computational efficiency and modeling capability. However, traditional fixed window strategies exhibit significant limitations when handling multi-scale visual analysis tasks, greatly restricting the generalization ability of the model.

To overcome this limitation, this study proposes a Dynamic Window Module. The specific key algorithms can be found in Appendix B.1. As illustrated in Figure 3, DWM integrates three key sub-modules: the Adaptive Window Partitioning (AWP) module, the Continuous Relative Position Bias Generator (CRPB) module, and the Multi-scale Dynamic Window Attention (MDWA) module.

(1): Adaptive Window Partitioning (AWP): dynamically determine the optimal window size based on the input feature map dimensions to adapt to the characteristics of remote sensing images, ensuring that the window size evenly divides the feature map dimensions while eliminating redundant padding.
(2): Continuous Relative Position Bias Generator (CRPB): use a lightweight multi-layer perceptron (MLP) to replace the pre-computed position bias table, generating position codes that are independent of window size through real-time normalization of spatial coordinates.
(3): Multi-scale Dynamic Window Attention (MDWA): integrates outputs from the preceding two sub-modules to enable genuine multi-scale perception, effectively addressing the limitations of fixed windows when processing variable-sized feature maps.

3.1.1. Adaptive Window Partitioning Module

By dynamically adjusting the window size, the AWP module improves the calculation efficiency and reduces the semantic separation at the boundary, which is expressed by the mathematical formula:

M_{d} = \max \{m |m \leq M_{b a s e}, m| H\}

(1)

N_{d} = m a x {n |n \leq N_{b a s e}, n| W}

(2)

The height and width of the input feature map are represented as

H

and

W

, respectively. In addition,

M_{b a s e}

and

N_{b a s e}

are predefined basic window sizes, with a default value of (8,8), which has been experimentally verified to be optimal in terms of computational efficiency and semantic integrity. Here,

m

is a variable related to the window size, and the condition

m | H

(indicating that m can be divided by

H

) is used to ensure the rational division of windows.

The AWP module takes the size of the input image and the preset basic window information as input, and dynamically determines the optimal window size that can evenly divide the dimensions of the feature map based on the aspect ratio of the feature map and computational efficiency. After calculating the optimal window size, it outputs the actual size of the window for subsequent processing, while segmenting the feature map into corresponding windows based on this data. This module can eliminate the padding requirements of feature maps with different input scales, improve the calculation efficiency, and ensure a consistent number of windows for different input feature maps. By dynamically adjusting the window size, the AWP module alleviates the semantic segmentation issues at the boundaries while maintaining computational efficiency. For the configuration parameters involved in this module and the complete derivation process, please refer to Appendix A.1.

3.1.2. Continuous Relative Position Bias Generator

In this paper, a continuous relative position bias generator module is developed by combining normalized coordinates and employing a lightweight MLP instead of a pre—calculated position bias table. The specific calculation formula is presented as follows:

∆ h_{i j} = h_{i} - h_{j}, ∆ w_{i j} = w_{i} - w_{j}

(3)

\hat{∆} h_{i j} = \frac{∆ h_{i j}}{M_{d} - 1}, \hat{∆} w_{i j} = \frac{∆ 2_{i j}}{N_{d} - 1}

(4)

B_{i j} = W_{2} \cdot R e L U (W_{1} \cdot {[\hat{∆} h_{i j}, \hat{∆} w_{i j}]}^{T} + b_{1})

(5)

Let

∆ h_{i j}

and

∆ w_{i j}

be the generated relative coordinates,

\hat{∆} h_{i j}

and

\hat{∆} w_{i j}

be the results of coordinate normalization, and

B_{i j}

be the position offset matrix generated by the Multi-Layer Perceptron (MLP) transformation. In this context, the coordinate normalization operation is performed on the generated relative coordinates. By mapping the physical coordinates (represented by

∆ h_{i j}

and

∆ w_{i j}

) to a canonical space through this normalization process, the module enables adaptation to any window size. This adaptability eliminates the need for additional adjustments when dealing with different window sizes, thereby reducing the number of parameters required for the model.

The use of continuous coordinate representation achieved through coordinate normalization allows for more detailed modeling of the positional relationships between elements. This detailed modeling provides a more accurate description of the spatial structure within the feature map, thereby enhancing the model’s ability to extract image information. Furthermore, this module utilizes MLP to learn a generic function that describes positional relationships. This function is capable of generating positional offsets in real time, dynamically adapting to different window sizes. The MLP is a standard multilayer perceptron, its specific details and configuration are explained in Appendix A.1. By learning such functions, the model can better generalize to a variety of input data that require different window sizes, improving its generalization capability.

3.1.3. Multi-Scale Dynamic Window Attention Module

MDWA module is the computing core of DWM, which integrates the outputs of AWP and CRPB modules. Mathematically, these operations can be expressed as:

A t t e n t i o n = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}} + B) V

(6)

Among them,

Q

,

K

, and

V

represent the matrices of queries, keys, and values derived from the linear transformation of

d_{k}

. This represents the dimension of the key vector, and B is the offset matrix generated by CRPB. Specifically, this module integrates the input window features with the learned position bias matrix

B

and combines a dynamic window mechanism with continuous position encoding to achieve multi-scale adaptive attention.

Compared to traditional window attention mechanisms, the proposed multi-scale dynamic window attention achieves unified multi-scale processing, supports feature maps of different resolutions within a single network architecture, eliminates padding, and generates window-adaptive position offset codes in real-time. This submodule can reduce the computational complexity of the model while enhancing its ability to recognize objects in multi-scale images.

3.2. Cross-Scale Context Attention Module

To address the challenge of cross-scale semantic information interaction in remote sensing images, this paper proposes a Cross-Scale Context Attention module, as shown in Figure 4. Through the collaborative design of Multilevel Feature Interaction (MFI) and Channel Semantic Recalibration (CSR), the SCA module enables effective interaction of semantic information across scales, significantly enhances feature representation capability, and thereby improves the accuracy of semantic segmentation.

The core innovation of the SCA module lies in establishing a fusion mechanism between high-level semantic information and low-level detail features, effectively addressing the representation inconsistency caused by the scale differences in remote sensing images. Below, the architectural design and implementation principles of the MFI and CSR submodules are elaborated in detail. Taking the IRRB encoder as an example, the specific working mechanisms of each submodule and their contributions to the semantic segmentation task of remote sensing images are introduced. The key algorithm implementation of the SCA module can be found in Appendix B.2.

3.2.1. Multilevel Feature Interaction Module

MFI builds a foundation for cross-scale feature fusion to connect high-level and low-level features. As shown in Figure 4, in the IRRB encoder, this sub-module connects features at different levels through a bidirectional feature propagation mechanism. Mathematically, the above can be expressed as:

R_{l o w} = {C o n v}_{1 \times 1} (R_{i})

(7)

R_{h i g h} = {C o n v}_{3 \times 3} (U p s a m p l e (R_{i + 1}))

(8)

R_{c a t} = C o n c a t (R_{l o w}, R_{h i g h})

(9)

where

{C o n v}_{i \times i}

denotes the convolution operation,

U p s a m p l e (\cdot)

represents bilinear upsampling, and

C o n c a t

indicates channel-wise splicing, the Multilevel Feature Interaction module operates as follows:

1.: The feature map $R_{i}$ is first refined through a $1 \times 1$ convolution layer to adapt its channel dimensions for subsequent fusion;
2.: The adjusted $R_{i + 1}$ , which is upsampled (to match the resolution of $R_{i}$ ) and processed via a $3 \times 3$ convolution layer to ensure feature compatibility;
3.: The concatenated features are integrated to generate the output feature map $R_{c a t}$ (with the same spatial and channel dimensions as $R_{i}$ ), which preserves both local details from $R_{i}$ and global context from the $R_{i + 1}$ .

By establishing cross-level contextual interaction pathways, the MFI module enables the model to simultaneously utilize local fine-grained features and global semantic information, significantly enhancing the model’s ability to extract image information in complex scenarios. The output

R_{c a t}

provides a feature representation for subsequent CSR sub-modules that allows for spatial semantic collaborative optimization, forming the core innovation of the SCA mechanism.

3.2.2. Channel Semantic Recalibration Module

The Channel Semantic Recalibration module adapts and optimizes feature channels through an intelligent weighting mechanism guided by high-level semantic features, significantly improving the accuracy of semantic segmentation. This module utilizes the semantic prior knowledge of high-level features to assess the importance of each channel (including channels in the feature map and

R_{c a t}

outputted by the MFI module), thereby enhancing the representation of key semantic information.

More specifically, CSR module first compresses the spatial dimension through a global average pooling layer. This operation extracts the global features at the channel level and integrates the adjacent higher-level features, as illustrated in Figure 4, which depicts the spatial dimension compression and channel-level feature extraction process.

Subsequently, a series of convolution layers are employed to learn the dependencies between channels, thereby generating attention weights. The Softmax activation function is then applied to these attention weights to produce a normalized attention weight distribution, achieving channel recalibration.

Next, the attention weights are multiplied by the output

R_{c a t}

of the MFI module. The final enhanced feature representation is derived using feature representations at different levels, thus achieving cross-scale contextual feature enhancement. This process significantly improves the model’s ability to extract features from remote sensing images. The above-mentioned operations can be formulated mathematically as follows:

R_{i}^{h} = C o n v (C o n c a t (R_{i}, U p (R_{i + 1}))) \times (S o f t m a x (C o n v (G A P (R_{i + 1})))

(10)

where

G A P (\cdot)

,

C o n c a t (\cdot)

,

U p (\cdot)

and

S o f t m a x (\cdot)

respectively represent the global average pooling layer, cascade layer, up-sampling layer and Softmax function. In the same way,

D_{i}^{h}

can be deduced in the same way.

Through the channel recalibration mechanism, the CSR module significantly enhances the discriminative ability of feature representations, thereby addressing key challenges in remote sensing images such as high inter-class similarity and the difficulty in identifying small targets, and provides an important guarantee for accurate semantic segmentation.

3.3. Hierarchical Adaptive Fusion Architecture

In heterogeneous data feature fusion, features at different levels contain different semantic information. Shallow features typically contain rich spatial details, while deep features carry higher-level semantic information. To fully leverage the complementary information in multimodal data, this paper proposes a Hierarchical Adaptive Fusion Architecture (HAFA), which consists of two sub-modules:

(1): Heterogeneous Modality Synchronizer (HMS), which is used to fuse features from different modes, is mainly composed of two parts: Feature Enhancement in spatial direction and adaptive feature fusion in channel direction. The spatial correlation is calculated by using the multimodal features $R_{i}^{h}$ and $D_{i}^{h}$ derived from SCA module, which is the key to realize the adaptive fusion of multimodal features.
(2): Efficient Channel Attention (ECA): in the deep network, due to the low resolution of feature map but rich semantic information of channels, this paper uses ECA module [29] to enhance the expression of significant channels in multimodal features, which can improve the response of important channels without increasing too much computational burden, thus strengthening the semantic consistency of cross-modal features. The hierarchical adaptive fusion architecture is illustrated in Figure 5.

Specifically, the output of the heterogeneous modal synchronizer, denoted as

A_{i}^{s}

, represents spatial correlations (its detailed operation is described in the next section). Meanwhile, the extracted multimodal features

R_{i}^{h}

and

D_{i}^{h}

are input into two ECA modules, and their outputs are summed to generate

A_{i}^{c}

(i.e., channel correlations). Subsequently,

A_{i}^{c}

and

A_{i}^{s}

are processed through a Fully Connected (FC) layer, and the resulting outputs are combined to produce the final cross-modal feature

M_{i}^{F}

. The above operations are mathematically formulated as follows:

M_{i}^{F} = F C (A F (R_{i}^{h}, D_{i}^{h}, C o n v (F E (R_{i}^{h}, D_{i}^{h}))) + F C (E C A (R_{i}^{h}, D_{i}^{h}))), i = 1, 2, 3

(11)

where

F C (\cdot)

consists of FC layer, ReLU and Sigmoid function. Among them,

F E (\cdot)

and

A F (\cdot)

represent feature enhancement operation and feature fusion, respectively.

E C A (\cdot)

is the operation of ECA module.

3.3.1. Heterogeneous Modality Synchronizer

Due to the incomplete spatial alignment between DSM data and IRRB data during the spatial correlation calculation process, this study proposes a Heterogeneous Modality Synchronizer (HMS) to fuse heterogeneous information. As shown in Figure 6, this module consists of two components: spatial feature enhancement (FE) and adaptive channel fusion.

(1): Spatial Feature Enhancement (FE): firstly, process the multimodal data (including IRRB and DSM) through global average pooling (GAP) to extract spatial information. Then, input the aligned and fused features of IRRB and DSM into the adaptive fusion process, where the weight matrix is learned from the features of adjacent stages.
(2): Adaptive Channel Fusion: to mitigate the impact of noisy or missing DSM signals on IRRB data, channel attention mechanisms are employed to adaptively adjust the weights of different modal features, enabling more effective cross-modal feature fusion.

The adaptive feature fusion process integrates spatially enhanced features

E_{i}^{h}

with modality-specific inputs

R_{i}^{h}

and

D_{i}^{h}

as follows:

1.: The spatially enhanced features $E_{i}^{h}$ are first concatenated with $R_{i}^{h}$ and $D_{i}^{h}$ , respectively, followed by a $1 \times 1$ convolutional layer to adjust channel dimensions and a ReLU activation for non-linearity;
2.: The concatenated outputs are fed into a softmax layer to generate the weighting coefficients $α$ and $β$ , which adaptively balance the contributions of the two branches;
3.: Third item. The final spatial correlation map $A_{i}^{s}$ is computed by weighted fusion of the intermediate features using $α$ and $β$ . Mathematically, the operations are formulated as:

A_{i}^{s} = α \times R_{i}^{h} + β \times D_{i}^{h}

(12)

α = \frac{C o n v (C o n c a t (R_{i}^{h}, E_{i}^{h}))}{C o n v (C o n c a t (R_{i}^{h}, E_{i}^{h})) + C o n v (C o n c a t (D_{i}^{h}, E_{i}^{h}))}

(13)

β = \frac{C o n v (C o n c a t (D_{i}^{h}, E_{i}^{h}))}{C o n v (C o n c a t (R_{i}^{h}, E_{i}^{h})) + C o n v (C o n c a t (D_{i}^{h}, E_{i}^{h}))}

(14)

The ReLu function is omitted from the formula, which makes the formula more concise. In this module, the learnable potential weighting coefficients α and β provide strong adaptability for the process.

3.3.2. Efficient Channel Attention Module

As shown in Figure 7, in response to the problem of information redundancy caused by direct concatenation of multimodal remote sensing image features, this paper adopts Efficient Channel Attention (ECA) to improve the performance of the model through cross-channel interaction.

The input data are first processed by global average pooling (GAP) to obtain the spatially aggregated features

F_{r g b} \in R^{C \times 1 \times 1}

, where

C

denotes the number of channels. Subsequently, a convolutional layer is applied to these features.

Specifically, in the ECA module, the size of the convolution kernel is derived through non-linear mapping based on the channel dimension. The output of the convolutional layer is passed through a Sigmoid function to generate a vector of channel attention weights

F_{d s m}

. Finally,

F_{d s m}

is element-wise multiplied with the input features to produce the final weighted features. Mathematically, the operations are formulated as:

α_{c} = σ (M L P (G A P (F_{r g b} \oplus F_{d s m})))

(15)

α_{s} = σ (C o n v ([F_{r g b}; F_{d s m}]))

(16)

F_{f u s e d} = α_{c} \cdot α_{s} \cdot (W_{r} F_{r g b} + {W_{d} F}_{d s m})

(17)

where

\oplus

represents element wise addition,

W_{r}

and

W_{d}

are learnable weights.

3.4. Loss Function

Different objects in remote sensing imagery are adjacent to and overlapping each other, which makes the boundaries between neighboring objects blurry, leading to a higher risk of misclassification. In addition, there is a significant imbalance in the number of samples among different ground target categories, and most existing models tend to prioritize the correct classification of majority samples while neglecting minority samples, resulting in overfitting. To address these challenges, this paper adopts a loss function based on error point correction, which consists of two parts, defined as follows:

{L o s s}_{t o t a l} = (1 - α) {L o s s}_{E P C} + α {L o s s}_{L S R}

(18)

In this context,

α

is a hyperparameter,

{L o s s}_{E P C}

is a loss function for error-prone points and misclassification points, and

{L o s s}_{L S R}

is a loss function after smooth regularization for labels over-fitted by the network.

{L o s s}_{E P C}

is a specialized loss function designed to focus on points prone to errors and misclassifications. In remote sensing images, due to factors such as complex object boundaries and interference from objects with similar appearances, these regions or pixels are more likely to be misclassified.

{L o s s}_{E P C}

can assign higher weights to points that are prone to errors and misclassification during training, enabling the model to adjust its parameters accordingly to reduce the misclassification rate in these areas.

{L o s s}_{L S R}

is a loss function that performs label smoothing regularization to prevent network overfitting. When a model overfits the training data, it may memorize the noise and characteristics in the labels instead of learning the underlying patterns.

{L o s s}_{L S R}

helps prevent this overfitting by adding a regularization term to the loss function. It smooths the influence of overfitted labels, allowing the model to learn more generalizable features and improving its recognition performance on new data. Additionally, detailed mathematical expressions and derivations for

{L o s s}_{E P C}

and

{L o s s}_{L S R}

can be found in the Appendix A.2.

4. Results and Discussion

To validate the effectiveness of DyFuseNet, systematic experiments were conducted using the Vaihingen and Potsdam standard remote sensing datasets released by the International Society for Photogrammetry and Remote Sensing (ISPRS). This section first systematically introduces the evaluation metrics used in the experiments, followed by a presentation of the comparative experimental results and ablation study results. Subsequently, it focuses on discussing the advantages of the proposed DyFuseNet compared to existing models, and provides an in-depth analysis of the enhancement mechanisms of each module’s contribution to the model’s performance.

All experiments were conducted in an Ubuntu 20.04 environment, with hardware configured as a single NVIDIA RTX 3090 GPU (24 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA) paired with a 15-core Intel Xeon Platinum 8358 P 2.60 GHz processor (Intel Corporation, Santa Clara, CA, USA), 90 GB RAM, and 80 GB storage capacity, including a 30 GB system disk and a 50 GB data disk. The software environment is based on PyTorch 1.11.0, Python 3.8, and CUDA 11.3. The training parameters are set as follows: using the Stochastic Gradient Descent (SGD) optimizer, the decoder learning rate is 0.001, the encoder learning rate is 0.0005, and a fixed batch size of 10.

4.1. Evaluation Metrics

In this paper, the following indexes are used to evaluate the segmentation performance, including overall accuracy (OA), mean F1-score (mF1) and mean intersection-over-union (mIoU). The formulas are as follows:

O A = \frac{T P}{T P + T F + F P + F N}

(19)

{P r e c i s i o n}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F P}_{c}}, {R e c a l l}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F N}_{c}}

(20)

{F 1}_{c} = \frac{2 \times {P r e c i s i o n}_{c} \times {R e c a l l}_{c}}{{P r e c i s i o n}_{c} + {R e c a l l}_{c}}

(21)

m F 1 = \frac{1}{C} \sum_{c = 1}^{C} {F 1}_{c}

(22)

{I o U}_{c} = \frac{{T P}_{c}}{{T P}_{c} + {F P}_{c} + {F N}_{c}}

(23)

m I o U = \frac{1}{C} \sum_{c = 1}^{C} I o U_{c}

(24)

In this context, TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively, and the subscript c indicates the metrics calculated for the c-th class of land cover objects (e.g., building, vegetation, water body, etc.)

OA reflects the proportion of correctly classified pixels to the total number of pixels, with higher values indicating better overall segmentation accuracy; similarly, higher values of mF1 and mIoU demonstrate the model’s superior semantic segmentation performance in preserving detailed features and accurately segmenting object boundaries.

In addition, to further analyze the computational efficiency of DyFuseNet, this study introduces a commonly used three-dimensional evaluation framework in semantic segmentation: Parameters (Params) reflect the model’s memory consumption, FLOP (Floating Point Operations) represents computational complexity, and FPS (Frames Per Second) measures inference speed. Ideally, an efficient model should have lower parameter and FLOP values to reduce computational overhead while maintaining a high FPS for real-time performance. These three metrics are quantitatively defined as follows:

P a r a m s = \sum_{l = 1}^{L} (K_{l, w i d t h} \times K_{l, h e i g h t} \times C_{l, i n} \times C_{l, o u t} + C_{l, o u t})

(25)

F L O P s = \sum_{l = 1}^{L} ({2 \times K}_{l, w i d t h} \times K_{l, h e i g h t} \times C_{l, i n} \times C_{l, o u t}) \times \frac{H}{s_{l}} \times \frac{W}{s_{l}}

(26)

F P S = \frac{N_{f r a m e s}}{\sum_{i = 1}^{N_{f r a m e s}} t_{i n f e r, i}}

(27)

where

K_{l}

represents the convolution kernel size of the

l

layer, C represents the total number of classes, and

S_{l}

represents the down-sampling rate of the

l

layer. Ideally, an effective model should have lower values of the first two parameters and higher FPS.

All efficiency metric experiments were tested under the condition of input feature map resolution of 512 × 512 and batch size set to 10, consistent with the training configuration to ensure fairness. The number of channels is set according to the model input layer, with RGB images set to 3 channels and DSM images set to single channel. In addition, to avoid the impact of GPU initial state fluctuations, 10 warm-up forward inferences are performed before formal timing (only forward propagation time is counted, excluding backpropagation or data loading). The specific measurement methods and procedures for FLOPs and FPS metrics are as follows: FLOPs are calculated using PyTorch’s torch.profiler module by counting the multiply-accumulate operations of each layer during forward propagation and converting them; FPS is calculated based on the end-to-end time of 100 consecutive inferences (excluding data loading and warm-up iterations), taking the average value and keeping two decimal places. For the Canny edge detection algorithm, this paper uses boundary density(Bd) and the standard deviation of boundary density (stdBd) as experimental evaluation metrics. Boundary density refers to the proportion of boundary pixels between specific categories to the total number of pixels in a multi-class segmentation task. It reflects the clarity and separation degree of boundaries between different categories. The standard deviation of boundary density represents the dispersion of boundary densities across images in the dataset and can indicate the consistency or variability of boundary density within the dataset; a larger value indicates a higher degree of dispersion. The specific definition formula is:

B d (%) = (\frac{b o u n d a r y p i x e l s n u m b e r s}{t o t a l p i x e l s n u m b e r s}) \times 100

(28)

s t d B d (%) = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} ({B d}_{i} (%) - M e a n {B d}_{i} (%))}

(29)

Here,

N

is the number of images in the dataset,

{B d}_{i}

is the boundary density of the i-th image, and

M e a n {B d}_{i} (%)

is the average boundary density of all images. When the boundaries of a dataset are relatively regular, that is, the boundaries between different categories are relatively clear and simple, with smoother and more orderly lines, the proportion of boundary pixels in the image is relatively low, meaning the boundary density is low. This is because in the case of regular boundaries, the transition areas between categories are relatively narrow, so only a small number of pixels are located on the boundaries.

4.2. Comparative Experiments

In this study, DyFuseNet uses a unified encoder to process IRRB and DSM data. Under the same experimental conditions, the performance of the proposed DyFuseNet is compared with the performance of six state-of-the-art segmentation models on the ISPRS Vaihingen and Potsdam datasets: PSPNet [11], Swin-Unet [14], ACNet [23], RDFNet [21], CMFNet [17], and MFTransNet [24]. Among these, Swin-Unet and PSPNet are unimodal methods that only utilize IRRB data, while the other models are multimodal methods. The inclusion of unimodal methods in the comparative experiments aims to validate the performance of classical methods, thereby providing a clearer benchmark for evaluating the proposed methods. For all comparison methods, at least 5 independent training and testing runs were conducted under the same dataset split, hyperparameters, and evaluation protocols, recording the mIoU results for each run. The detailed comparison results are presented in Table 1.

In addition, DyFuseNet achieved mIoUs of

80.40 % \pm 0.4 %

and

80.85 % \pm 0.3 %

on Vaihingen and Potsdam, respectively, demonstrating stable results. Moreover, a paired t-test was performed to analyze the multiple experimental results of DyFuseNet and the SOTA baselines. The paired t-test p-values on both datasets were less than 0.01. Multiple experiments and statistical tests indicate that the proposed method shows significant and stable improvements over other SOTA methods across various metrics.

Comparing the single-modal methods (Swin-Unet and PSPNet) with the multi-modal methods in the comparison table, it can be seen that the multi-modal data fusion methods outperform the single-modal methods across all evaluation metrics. This finding indicates that incorporating DSM elevation information can effectively enhance semantic segmentation performance, with the mIoU increasing by an average of 7.28%.

Further analysis of the class-level F1 scores for the two datasets indicates that the proposed DyFuseNet method achieves the best segmentation results for the building, tree, and vehicle categories. Notably, the improvements in the tree categories are particularly significant, exceeding the second-best method METransNet by 3.87% and 3.76%, respectively. Although DyFuseNet shows slightly lower performance than METransNet in the road and low vegetation categories, it still outperforms other comparative methods in these categories. Furthermore, among all multimodal methods, DyFuseNet achieved the highest mIoU metrics across both datasets, with particularly significant improvements in the tree and vehicle categories.

In urban scenarios, including the datasets used in this paper, Vaihingen and Potsdam, low vegetation (such as grass and shrubs) is often adjacent to buildings, roads, and other land cover types, forming numerous “vegetation-building” and “vegetation-road” boundary areas. These boundaries are challenging for semantic segmentation and place high demands on a model’s ability to distinguish local features. Therefore, an improvement in the F1 score for low vegetation (Low.) not only indicates more accurate classification of this category itself but also implicitly reflects the model’s precise segmentation ability at the “edges where vegetation meets other land cover types.” If a model can correctly classify boundary pixels of low vegetation, it demonstrates that it has captured local features at the edges. Combined with the overall improvement in mF1, it shows that the model not only classifies low vegetation more accurately but also, by optimizing edge local features, enhances the boundary segmentation accuracy of neighboring land cover types. In Table 1, DyFuseNet’s Low. F1 scores reach 0.9102 and 0.9160 on the Vaihingen and Potsdam datasets, respectively, demonstrating its significant advantage in classifying low vegetation/boundary pixels compared to other methods. These results validate the effectiveness of DyFuseNet in addressing the challenges of complex boundary segmentation.

From the perspective of comprehensive evaluation indicators, DyFuseNet ranks first in both OA and mF1 across two datasets, achieving mIoU values of 80.40% and 80.85%, which are 1.91% and 1.77% higher than the suboptimal method METransNet, respectively. These results indicate that, despite both the IRRB and DSM modalities using the same encoder, DyFuseNet demonstrates significantly superior overall segmentation performance compared to other multimodal methods, highlighting its balance and robustness in multi-class segmentation tasks. Taking the Potsdam dataset as an example, the comparative segmentation results, as shown in Figure 8, demonstrate that DyFuseNet achieves significant improvements in boundary segmentation accuracy and the recognition of objects at varying scales.

The results in the figure further support Low.’s finding that F1 can represent boundary segmentation accuracy: compared with the baseline, the segmentation maps generated by DyFuseNet show clearer and more precise boundaries, consistent with Low.’s quantitative advantage. This consistency between F1 visual and quantitative evidence reinforces the claim that DyFuseNet has excellent edge-handling capabilities.

To further assess the performance of the proposed DyFuseNet, Table 2 presents a comprehensive comparison among different methods under identical experimental conditions.

This evaluation provides a comprehensive assessment of method performance by simultaneously analyzing computational efficiency metrics and segmentation accuracy. Computational efficiency is quantified using three key indicators: FLOPs (measured in billions of floating-point operations), parameter count (reported in megabytes), and inference speed (expressed in frames per second, FPS). Segmentation accuracy is primarily evaluated through mIoU (mean Intersection over Union, reported as a percentage). Additionally, a multimodal indicator (marked as Y/N) is incorporated to specify whether each compared method employs multimodal fusion (Y) or relies on single-modal data (N) for semantic segmentation.

Compared to multi-modal fusion segmentation methods (CMFNet and MFTransNet), the proposed DyFuseNet achieves lower FLOPs, fewer parameters, and higher FPS, demonstrating the optimal balance between computational efficiency and segmentation performance. Specifically, The FPS model proposed in this paper is 1.8 times better than the suboptimal method. Although the parameters have slightly increased, the overall segmentation performance of DyFuseNet is still significantly superior to other methods.

In order to systematically evaluate the competitive advantages of DyFuseNet in multimodal remote sensing segmentation methods, we generated a scatter plot based on the quantitative data from Table 2 to intuitively illustrate the trade-off between FLOPs and mIoU. As shown in Figure 9, the horizontal axis represents floating-point operations (FLOPs, in billions), and the vertical axis represents mean Intersection over Union (mIoU, in %), clearly displaying the balance between the computational resource consumption and segmentation accuracy of different methods.

In the figure, DyFuseNet is marked with a black pentagram, occupying the best area, with a FLOPs of about 30 G FLOPs and an mIoU close to 81%; the multimodal methods are marked with orange circles, including MFTransNet, ACNet, CMFNet, and RDFNet, which are relatively dispersed in distribution, with generally lower mIoU values; the unimodal methods are marked with blue triangles, including Swin-Unet and PSPNet, achieving mIoUs of 74.5% and 75%, respectively.

Based on the comparative analysis in the figure, three core advantages of the method proposed in this paper can be drawn: First, compared to other typical multimodal methods, DyFuseNet achieves a higher mIoU at lower FLOPs, demonstrating its synergistic optimization in computational efficiency and segmentation accuracy. Secondly, as shown in the figure, unimodal methods cannot excel in a single metric; although Swin-Unet has low FLOPs, its mIoU is also low, indicating insufficient segmentation accuracy, while PSPNet has a high mIoU but also very high FLOPs, leading to both methods failing to simultaneously optimize computational load and segmentation performance. In contrast, DyFuseNet achieves a balanced performance of 26.19 G FLOPs and 80.65% mIoU, significantly surpassing the compared unimodal methods. Thirdly, Pareto optimality—from the overall distribution in the FLOPs-mIoU scatter plot, DyFuseNet is situated near the Pareto front, indicating that it maximizes segmentation accuracy under the given computation budget, reflecting the synergistic optimization of efficiency and accuracy.

In summary, DyFuseNet exhibits significant superiority in multimodal remote sensing segmentation: Its computational efficiency and segmentation performance cooperatively optimize resource consumption and accuracy, offering both practical utility and theoretical value as a representative method in the field.

4.3. Ablation Experiments

To better understand the influence of each improved module in DyFuseNet on the semantic segmentation performance of multimodal remote sensing images, this study conducts ablation experiments by progressively integrating modular components. The experimental design focused on three core objectives: (1) verifying the performance gains achieved by replacing the fixed window mechanism of the Swin Transformer with the DWM; (2) evaluating the effectiveness of the SCA Module in enhancing feature extraction cross-scales; and (3) analyzing the superiority of the HAFA in collaboratively integrating IRRB imagery and DSM information. All experiments are performed under consistent environmental conditions.

The baseline model is the Swin-Transformer, designed for semantic segmentation of multimodal remote sensing images. The ablation process is as follows: initially, the fixed-size window mechanism of Swin-Transformer is replaced with a dynamic window module (DWM) to enhance feature localization. Subsequently, a cross-scale contextual attention (SCA) module is incorporated to capture long-range dependencies. Finally, a hierarchical adaptive fusion architecture (HAFA) is constructed, where the shallow network employs a heterogeneous modal synchronizer to align multimodal features, and the deep network introduces efficient channel attention to reduce feature redundancy. The ablation experimental results are presented in Table 3.

In addition, since the primary purpose of the ablation experiments in this paper is to independently verify the contribution of each improved module to the model’s accuracy, the focus is therefore on changes in performance metrics, including mIoU and mF1. As a result, computational efficiency metrics like FLOPs are not the main concern of this ablation study, for the following reasons: Firstly, the essence of ablation experiments is ‘controlled variables’: by gradually removing or adding specific modules, we observe fluctuations in performance metrics to clarify the direct impact of the module on accuracy. Comparing FLOPs simultaneously would introduce additional variables (such as the number of module parameters and the non-linear accumulation of computational complexity), which would interfere with the pure analysis of the module’s performance contribution. Then, computational efficiency metrics should be evaluated in the context of the complete model: metrics like FLOPs are more suitable for comparisons between complete models to assess the overall efficiency of a design. The focus of ablation experiments is on the local impact of modules, and the FLOPs of the complete model can be analyzed uniformly in comparative experiments. The impact of adding or removing a module on FLOPs and FPS is indirect and can be inferred: for example, a dynamic window module may replace global fusion with local window computation, theoretically reducing redundant calculations; hierarchical fusion may reuse multi-scale features without significantly increasing FLOPs. These logical inferences can be supplemented through theoretical analysis, eliminating the need to repeatedly calculate them in the ablation experiments.

The results of the ablation experiments indicate that the gradual integration of these three improved modules—the dynamic window module, the cross-scale contextual attention module, and the hierarchical adaptive fusion architecture—has produced significant and quantifiable performance improvements, validating the scientific design and effectiveness of each component.

The DWM replaced the fixed-size window mechanism with dynamic window scaling, significantly enhancing the perception of structures at different scales, including large buildings and small cars. It achieved a 2.86% increase in mIoU in Vaihingen, rising from 77.34% to 80.20%, and a 3.05% increase in Potsdam, from 77.60% to 80.65%, confirming that this module improved the segmentation accuracy of multi-scale target boundary details in complex scenarios.

The SCA module further improved the model’s segmentation accuracy, with the mIoU metric increasing by 1.93% on the Vaihingen dataset and by 1.77% on the Potsdam dataset. Although the gain of the model after adding this module is slightly lower than that of the dynamic multi-scale window module, it optimizes contextual dependencies through cross-scale feature interaction, enhancing the segmentation of fine-grained categories. It is evident that the segmentation accuracy of low vegetation and roads in the Vaihingen dataset has significantly improved. Additionally, the mF1 score for trees and vehicles in the Potsdam dataset has seen a substantial increase, indicating that this module compensates for the model’s limitations in global context modeling.

The addition of the HAFA module indicates the dependency of the model DyFuseNet proposed in this paper on the dataset. From the data in the table, it can be seen that this module increases the mIoU on the Vaihingen dataset by 1.25%, from 78.95% to 80.20%, by adjusting modal differences and suppressing redundancy. However, on the Potsdam dataset, the mIoU only improved by 0.45%, increasing from 79.20% to 79.65%. The reason for this result may be that the model is very sensitive to the modal fusion requirements of specific datasets, which needs further validation.

By comparing the annotation files of the original datasets, it can be observed that the Potsdam dataset has more regular boundaries, generally featuring organized block-like building areas and grid-like roads, whereas Vaihingen contains complex, fragmented edges with abundant scattered vegetation and exhibits gear-shaped boundary lines where irregular buildings and roads intersect. HAFA can significantly improve classification accuracy by integrating the complementary information from DSM elevation noise and IRRB data. However, in areas with clearly defined boundaries, the original information from DSM and IRRB is already sufficiently reliable, leaving limited room for optimization. Therefore, the effect of the HAFA module is more pronounced on Vaihingen.

Based on the Canny edge detection algorithm, we calculated the boundary pixel proportions for key categories such as ‘building-vegetation’ and ‘road-vegetation’ in the Vaihingen and Potsdam datasets (Figure 10), further quantifying this phenomenon. The implementation is performed via a Python script: the original annotation map is input, edges are detected using OpenCV’s Canny function (low threshold 0.1, high threshold 0.3), and then the total number of edge pixels related to each land cover category is counted.

The experimental results are shown in the figure. Experimental data show that the average boundary density in Potsdam is 0.6% ± 0.3% (standard deviation 0.3%), significantly lower than the value of 1.1% ± 0.2% (standard deviation 0.2%) in the Wageningen area. This difference directly reflects the contrast between the regularity of surface cover boundaries in Potsdam and the complexity in Wageningen, explaining why the HAFA module achieves only an average improvement of 0.45% in overlap ratio in Potsdam, while it can improve by 1.2–1.5% in Wageningen. This also confirms that the effectiveness of the HAFA algorithm is closely related to boundary complexity. Appendix B.3 provides the boundary density calculation script, and the revised paper adds the boundary calculation formula in Section 3.1 on evaluation metrics.

Overall, the three modules—namely the dynamic window module that optimizes local feature perception, the cross-scale context attention module that enhances global context modeling, and the hierarchical adaptive fusion architecture that refines multimodal fusion—work together to drive the overall performance improvement of DyFuseNet. The design logic of each module is closely related to the typical challenges in remote sensing images, such as complex scenes, multi-scale targets, modal differences, and boundary ambiguity, confirming the innovation and effectiveness of the DyFuseNet model proposed in this paper.

4.4. Multimodal Noise Robustness Experiments

In this experiment, we simulated two types of noise scenarios to evaluate the robustness of the DyFuseNet under multimodal noise conditions, specifically:

(1): DSM noise: Gaussian noise was added to the elevation values of the digital surface model (DSM) with standard deviations of 0.1 m, 0.5 m, and 1.0 m.
(2): Multispectral noise: Gaussian noise was added to each spectral channel of the multispectral images with standard deviations of 2, 5, and 10.

We applied DyFuseNet and Swin-T to the test sets of the Vaihingen and Potsdam datasets, respectively, and tested the overall accuracy (OA) for the following four input combinations: original annotated images (Origin), noise added only to DSM (OnlyDSM), noise added only to IRRB images (OnlyIRRB), and noise added to both DSM and IRRB images (Both). Keeping all experimental settings and parameters unchanged, we conducted a longitudinal evaluation of the impact of different noise conditions on the segmentation overall accuracy of DyFuseNet, while also performing a lateral comparison of the robustness of the proposed improved model DyFuseNet with the baseline model Swin-T under different noise conditions. The comparison results are shown in Table 4. The comparison results are shown in Table 4. In addition, the OA-change listed in the table represents the change in OA of DyFuseNet compared to the original dataset without any added noise.

As can be seen from the table, under all noise conditions, the improved DyFuseNet proposed in this paper consistently outperforms the baseline methods in terms of OA. In particular, under multimodal noise conditions, the segmentation accuracy of the Swin-T model on Vaihingen decreases by 2% compared to the noiseless case. However, after incorporating the improvements proposed in this paper, the decline in OA of DyFuseNet is significantly smaller than before the modifications. A longitudinal comparison also shows that, compared with noise-free conditions, DyFuseNet’s OA values do not significantly decrease under different noise conditions, demonstrating the model’s stability and robustness.

5. Conclusions

In this study, we proposed DyFuseNet, a novel multimodal image segmentation framework that integrates dynamic adaptive windows and cross-scale fusion to address key challenges in segmenting complex images with significant scale variations and heterogeneous data sources.

The proposed network consists of three synergistically designed modules: (1) the Dynamic Window Module (DWM), which adaptively adjusts receptive fields to enhance the modeling of objects at varying scales and improve the extraction of fine-grained, edge-rich features; (2) the Cross-Scale Contextual Attention (SCA), which establishes hierarchical interactions between local and global semantics to refine boundary delineation in challenging regions such as shadows and low-texture areas; and (3) the Hierarchical Adaptive Fusion Architecture (HAFA), which aligns features across modalities in the shallow layers while suppressing redundancy in deeper layers through channel attention, achieving an effective balance between segmentation accuracy and computational efficiency.

Extensive experiments on the Vaihingen and Potsdam datasets demonstrate that DyFuseNet achieves state-of-the-art performance, with mIoUs reaching 80.40% and 80.85%, which are 2.83% and 2.99% higher than the baseline model Swin-Trans, respectively, and 1.71% and 1.57% better than the sub-optimal model MFTransNet. Moreover, the model shows high segmentation accuracy in complex scenes, especially for the edges of large buildings and the shadow areas of small objects, with an average F1 score of up to 85%. Finally, DyFuseNet maintains low model complexity and computation efficiency while improving segmentation accuracy, with a FLOP count of only 26.19 G and achieving 30.09 FPS, indicating that the model is suitable for real-time deployment in practical applications.

The method proposed in this article not only provides a robust and general solution for image segmentation under heterogeneous data scenarios, but also introduces innovative strategies of dynamic window modeling and cross-scale feature fusion, contributing to theoretical progress and practical utility. These contributions hold significant value for a wide range of applications, including urban planning, infrastructure monitoring, environmental analysis, and intelligent scene understanding, where accurately and efficiently segmenting complex multimodal visual data is crucial.

Author Contributions

Conceptualization, Q.H. and M.W.; methodology, Q.H., M.W., P.Z. and L.W.; software, Q.H. and M.W.; validation, Q.H., M.W. and P.Z.; formal analysis, Q.H. and L.W.; investigation, Q.H., M.W., P.Z. and L.W.; resources, M.W. and P.Z.; data curation, Q.H., M.W., P.Z., L.W. and Q.S.; writing—original draft preparation, Q.H., Q.S.; writing—review and editing, M.W., P.Z. and L.W.; supervision, M.W., P.Z.; funding acquisition, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported in part by the National Key Research and Development Plan (Grant No. 2023YFC3803903), in part by the Humanities and Social Science Research Project of the Ministry of Education of China (No. 22XJA780002), in part by the Cultivation Special Project for Frontier Interdisciplinary Fields of Xi’an University of Architecture and Technology (No. X20230085).

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Detailed Definitions of CRPB and AWP Module

AWP dynamically adjusts the window size based on the size of the input feature map, allowing us to achieve more efficient use of computational resources while effectively avoiding the semantic fragmentation often seen with fixed-window methods. Assume the input feature map dimensions are:

X \in R^{B \times H \times W \times C}

(A1)

Here, B is the batch size, H is the height of the input feature map, W is the width of the input feature map, and C is the number of channels. In our experiments, the setting depends on the modality of the input image: IRRB images have three channels, while DSM images are set to a single channel. The baseline window sizes M_base and H_base are preset to (8, 8); experiments have shown that this value achieves an optimal balance between computational efficiency and semantic completeness. The dynamic window calculation formula is expressed as:

h_{d} = \{\begin{matrix} m a x \{m |m \leq M_{b a s e}, m| H\}, & i f H % M_{b a s e} = 0 \\ m a x \{h | h \leq H, h | H\}, & o t h e r w i s e \end{matrix}

(A2)

w_{d} = \{\begin{matrix} m a x \{n |n \leq N_{b a s e}, n| W\}, & i f W % W_{b a s e} = 0 \\ m a x \{w | w \leq W, w | W\}, & o t h e r w i s e \end{matrix}

(A3)

Static optimization mode refers to the scenario without gradients, where the preset window size is enforced. If the input size does not meet the divisibility condition, an exception is thrown. In training or inference scenarios, a dynamic window partition strategy is used, taking the smaller value between the input size and the baseline window as the initial candidate. The maximum window size that satisfies the divisibility condition is then determined through backward iteration.

The specific tensor computation process of this module is expressed mathematically as follows:

X_{r e s h a p e d} = r e s h a p e (X, [B, \frac{H}{h_{d}}, h_{d}, \frac{W}{w_{d}}, w_{d}, C])

(A4)

X_{t r a n s p o s e d} t r a n s p o s e (X_{r e s h a p e d}, [0,1, 3,2, 4,5])

(A5)

X_{w i n d o w s} = r e s h a p e (X_{t r a n s p o s e d}, [B \cdot \frac{H}{h_{d}} \cdot \frac{W}{w_{d}}, h_{d}, w_{d}, C])

(A6)

The dimensions of the window tensor are

R^{N \times h_{d} \times w_{d} \times C}

, with the total number of windows

N = B \cdot \frac{H}{h_{d}} \cdot \frac{W}{w_{d}}

. Finally, the actual window size is returned. This design achieves a balance between computational efficiency and semantic integrity, making it particularly suitable for self-attention mechanisms that require dynamic adjustment of window sizes. Pseudocode example: in the CRPB module of this paper, the MLP is mainly used for generating continuous relative position biases. The MLP is a standard multilayer perceptron, consisting of the following key components: two linear layers (fc1 and fc2): which perform the dimension transformation from input to hidden layer and from hidden layer to output, respectively. Activation function: introduces non-linearity. Dropout layer: used for regularization to prevent overfitting. The detailed parameters of the MLP module, including hidden layer dimensions, activation function, and dropout probability, as well as initialization rules and tensor shape transformations, are as follows:

(1): Structure: two layers of linear transformation, from input to hidden layer and then to output, with a GELU function as the non-linear activation and two rounds of Dropout regularization.
(2): Parameter initialization rules: if hidden_features is not specified, it defaults to the same dimension as the input in_features (i.e., the hidden layer has the same dimension as the input). If out_features is not specified, it defaults to the same dimension as the input in_features. The activation function defaults to nn.GELU, which is the Gaussian Error Linear Unit. The dropout probability defaults to 0 (disabled); when performing regularization in this paper, it can be set to 0.3.

Appendix A.2. The Precise Mathematical Formulas for All Loss Terms

To address the problems of target boundary ambiguity and unbalanced category samples in remote sensing images, this paper proposes a total loss function that combines the Error Point Loss (EPC) and Label Smoothing Loss (LSR):

{L o s s}_{t o t a l} = (1 - α) {L o s s}_{E P C} + α {L o s s}_{L S R}

(A7)

Among them, α is the balance coefficient, which is used to adjust the weight of two losses. In the task of semantic segmentation of remote sensing images, misclassification due to blurred boundaries (such as the interface between buildings and vegetation) and small categories (such as low vegetation) are the main sources of error. Therefore,

{L o s s}_{E P C}

plays a dominant role in improving the classification accuracy of key regions, while

{L o s s}_{L S R}

is mainly used to prevent the model from overfitting the label noise in the training set and serves as an auxiliary regularization method. In addition, through grid search (

α \in {0.1, 0.3, 0.5, 0.7}

) on the validation sets (Vaihingen/Potsdam) to evaluate the mIoU metric, the results show that the model performs best when

α = 0.3

. Hence, this paper sets

α

to 0.3.

{L o s s}_{E P C}

aims to improve the model’s classification ability for error-prone areas (such as boundary pixels, small targets, or targets similar to the background). The core idea is to assign higher loss weights to these high-risk prediction points. Its mathematical expression is:

{L o s s}_{E P C} = - \frac{1}{N} \sum_{i = 1}^{N} [{(1 - p_{i, c_{i}})}^{γ} \cdot l o g (p_{i, c_{i}})] + λ \cdot {II}_{e d g e} (i) \cdot {|p_{i, c_{i}} - 0.5|}^{2}

(A8)

Here,

p_{i, c_{i}}

is the probability predicted by the model that pixel

i

belongs to the true class

c_{i}

;

γ

is the focusing parameter, used to reduce the loss weight of ‘correctly classified’ samples;

{II}_{e d g e} (i)

is an indicator function, which equals 1 if pixel

i

is at the image edge (determined using edge detection operators such as Sobel or Canny), otherwise 0;

λ

denotes the regularization strength hyperparameter for edge regions;

N

is the total number of pixels in the current batch. The final hyperparameter settings are as follows: γ is set to 2 to focus on hard examples and improve boundary classification sensitivity, and λ is set to 0.1 to moderately constrain the loss of edge pixels, avoiding overfitting to local features.

{L o s s}_{L S R}

applies a smoothing constraint to hard labels (one-hot labels) to prevent the model from overfitting the label noise in the training data. Refer to the label smoothing technique proposed by Szegedy et al., 2016 [30], whose mathematical form is:

{L o s s}_{L S R} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} y_{i, c} \cdot l o g ((1 - ϵ) \cdot \hat{p_{i, c}} + \frac{ϵ}{C})

(A9)

y_{i, c}

represents the one-hot encoded label of pixel

i

corresponding to class

c

(the true label is 1, and the rest are 0).

\hat{p_{i, c}}

represents the predicted probability by the model that pixel

i

belongs to class

c

.

ϵ

is the label smoothing coefficient used to control the “degree of smoothing”.

C

is the total number of classes in the dataset.

N

is the total number of pixels in the current batch. To effectively prevent the model from being overly confident in the training labels and to enhance the model’s robustness to label noise in the test set, in this experiment,

ϵ

is set to 0.1.

C

can be determined according to the dataset used in this study, which includes 6 classes: buildings, low vegetation, trees, roads, cars, and water bodies.

Appendix A.3. Training Configuration and Complete Data Augmentation Process

This paper employs traditional image processing techniques to achieve multimodal data enhancement. The main focus is on processing IRRB images and their corresponding DSM images simultaneously to ensure spatial–spectral information consistency. Specific operations include geometric transformations, such as mirroring DSM and multispectral images along the image width axis (horizontal) or height axis (vertical). Additionally, to simulate changes in lighting conditions, the images are enhanced by adjusting brightness and contrast. Color perturbation processing is mainly applied to adjust the IRRB images. The process is shown in Figure A1.

Figure A1. The process of the data enhancement.

Appendix B

Appendix B.1. Pseudo-Code Flow in the DWM

Algorithm A1 Dynamic Window Module (DWM) Implementation

Input: A data tensor X, window size dimensions (h, w), optional attention mask.
Output: Processed window tensor (H, W, C).
Module 1: Adaptive Window Partitioning (AWP)
      Input: (B, H, W, C) tensor, (h, w) window size
      Output: (B*num_windows, h, w, C) windows
      1: (M_base, N_base) = (8, 8) // Preset default window size
                  - For inference: enforce divisibility
                  - For training: find max divisible size
      2: Reshape X to:
                  [B, H//h, h, W//w, w, C]
      3: Permute and reshape to:
                  [B*(H//h)*(W//w), h, w, C]
      4: Return final window size
      5: Calculate num_windows = (H//h)*(W//w)
      6: Reshape to:
                  [B, H//h, W//w, h, w, C]
      7: Permute and reshape to original:
                  [B, H, W, C]

Module 2: MLP Block of the CRPB
      Input: in_features, [hidden_features], [out_features], [act_layer], [drop]
      Output: Processed tensor

      1: Initialize linear layers:
                  fc1 = Linear(in_features → hidden_features)
                  fc2 = Linear(hidden_features → out_features)
      2: Forward pass:
                  x ← fc1(x)
                  x ← Activation(x) # Default: GELU
                  x ← Dropout(x)
                  x ← fc2(x)
                  x ← Dropout(x)
                  return x

Module 3: Multimodel Dynamic Window Attention (MDWA)
      Input: dim, num_heads, [qkv_bias], [qk_scale], [attn_drop], [proj_drop]
      Output: Attention-weighted features

      1: Initialize components:
                  - qkv: Linear(dim → 3*dim)
                  - cpb_mlp: MLP for relative position bias
                  - Projection layer and dropouts

      2: Forward(x, window_size, mask):
                  (a) Generate relative position coordinates
                  (b) Normalize coords to [−1, 1]
                  (c) Compute position bias via cpb_mlp
                  (d) Calculate QKV and attention scores
                  (e) Apply position bias + mask (if any)
                  (f) Softmax → attention dropout
                  (g) Project and return output
return (H, W, C)

Appendix B.2. Pseudo-Code Flow in the SCA Module

The SCA module combines high-resolution and low-resolution features through upsampling and concatenation, and generates channel attention weights using global average pooling and softmax. In addition, the batch normalization operation in this module can stabilize the training dynamics, and using a 1 × 1 convolution before the 3 × 3 convolution can help reduce dimensionality.

Algorithm A2 SCA Module

Input: mol1: Feature map 1 (shape: [B, feature1, H, W]), mol2: Feature map 2 (shape: [B, feature2, H//2, W//2]), size1: Base feature dimension.
Output: out: Fused feature map (shape: [B, feature1, H, W]).
      1: Initialize layers:
                  conv1 ← Conv2d(feature1→feature1, kernel = 1) # Feature projection
                  conv2 ← Conv2d(2*feature1→feature1, kernel = 3, padding = 1) # Fusion
                  conv3 ← Conv2d(feature2→feature1, kernel = 1) # Channel adjustment
                  upSampling2x2 ← ConvTranspose2d(feature2→feature1, kernel = 2, stride = 2) # Upsampler
                  avg_pool ← AdaptiveAvgPool2d(1) # Global context
                  Bach ← BatchNorm2d(feature1) # Normalization
                  relu1 ← ReLU(inplace = True) # Activation
      2: Process mol1 (high-res path):
                  c4_lat ← conv1(mol1) # Linear projection
      3: Process mol2 (low-res path):
                  c5_lat ← upSampling2x2(mol2) # 2x upsampling
                  c5_lat ← relu1(c5_lat) # Non-linear activation
      4: Feature fusion:
                  c_glb ← concat(c4_lat, c5_lat, dim = 1) # Channel-wise concatenation
                  c_glb ← conv2(c_glb) # Cross-scale fusion
                  c_glb ← Bach(c_glb) # Batch normalization
      5: Attention weighting:
                  c_glb_lat2 ← avg_pool(mol2) # Global average pooling
                  c_glb_lat2 ← conv3(c_glb_lat2) # Channel adjustment
                  c_glb_lat2 ← softmax(c_glb_lat2, dim = 1) # Attention weights
      6: Apply attention:
                  out ← c_glb * c_glb_lat2 # Element-wise multiplication
                  return out

Appendix B.3. Canny Edge Detection

Canny Edge Detection uses dual thresholds (25/75) to identify strong edges between target classes, Creates binary masks for specified class pairs (‘building-vegetation’, ‘road-vegetation’), Normalizes edge pixel count by total image pixels.

Algorithm A3 Canny Edge Detection

Input:
      label_path—Path to RGB annotation image
      CLASS_COLORS—Dictionary mapping class names to RGB values
      TARGET_PAIRS—List of class pairs for boundary analysis
Output:
      Boundary density percentage (0–100)

Function: calculate_boundary_density
1: Load and validate input:
      if not file_exists(label_path):
                  print(“File not found error”)
                  return 0.0
      label_img ← cv2.imread(label_path) # BGR format
      if label_img is None:
                  print(“Image read error”)
                  return 0.0
2: Preprocess image:
      label_img ← convert BGR to RGB
      h, w ← image.height, image.width
      total_pixels ← h * w
3: Generate class mask:
      class_mask ← zeros(h, w)
      for each (class_name, rgb_color) in CLASS_COLORS:
                  if class_name in [‘building’, ‘low_vegetation’, ‘road’]:
                        binary_mask ← (label_img == rgb_color) across all channels
                        class_id ← index_of(class_name) + 1
                        class_mask[binary_mask] ← class_id
4: Create target mask:
      target_mask ← zeros(h, w)
      for each (class_name, _) in TARGET_PAIRS:
                  if class_name in CLASS_COLORS:
                        rgb_color ← CLASS_COLORS[class_name]
                        binary_mask ← (label_img == rgb_color) across all channels
                        target_mask[binary_mask] ← 1
5: Edge detection:
      gray_mask ← target_mask * 255 (convert to 8-bit)
      edges ← cv2.Canny(gray_mask, low_thresh = 25, high_thresh = 75) # Canny edge detection
6: Calculate density:
      boundary_pixels ← count_nonzero(edges)
      boundary_density ← (boundary_pixels/total_pixels) * 100
      return boundary_density

Function: compute_dataset_boundary_density
Input:
      label_paths—List of annotation image paths
Output:
      Mean and std of boundary densities across dataset

1: Initialize empty density list
2: for each path in label_paths:
      density ← calculate_boundary_density(path)
      densities.append(density)
      print(f”Image {basename(path)} density: {density:.2f}%”)
3: mean_density ← mean(densities)
4: std_density ← std(densities)
5: return (mean_density, std_density)

References

Li, K.; Liu, R.; Cao, X.; Bai, X.; Zhou, F.; Meng, D.; Wang, Z. SegEarth-OV: Towards Training-Free Open-Vocabulary Segmentation for Remote Sensing Images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 1234–1245. [Google Scholar]
State Council. National Medium- and Long-Term Plan for Basic Surveying and Mapping (2021–2030); Standards Press of China: Beijing, China, 2021. [Google Scholar]
Zhang, S.; Chen, Z.; Wang, D.; Wang, Z.J. Cross-Domain Few-Shot Contrastive Learning for Hyperspectral Images Classification. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hou, Z.; Chen, M.; Ma, S.; Qu, M.; Yang, X. Real-Time Urban Street View Semantic Segmentation Based on Cross-Level Aggregation Network. Opt. Precis. Eng. 2024, 32, 1212–1226. [Google Scholar] [CrossRef]
Pan, T.; Zuo, R.; Wang, Z. Geological Mapping via Convolutional Neural Network Based on Remote Sensing and Geochemical Survey Data in Vegetation Coverage Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3485–3494. [Google Scholar] [CrossRef]
Jiang, W.; Zhang, C.; Xu, B.; Luo, C.; Zhou, H.; Zhou, K. AED-Net: A Semantic Segmentation Model for Landslide Disaster Remote Sensing Images. J. Geo-Inf. Sci. 2023, 25, 2012–2025. [Google Scholar]
Fan, K.; Fen, Y. A Traffic Scene Perception Algorithm Combining Semantic Segmentation and Depth Estimation. J. Zhejiang Univ. Eng. Sci. 2024, 58, 684–695. [Google Scholar]
Li, X.; Yan, H.; Wang, Z.; Wang, B. Evaluation and Influencing Factors Analysis of Road Environment Safety Perception Combining Street View Images and Machine Learning. J. Geo-Inf. Sci. 2023, 25, 852–865. [Google Scholar]
Xu, Y.; Cao, B.; Lu, H. Improved U-Net++ Semantic Segmentation Method for Remote Sensing Images. IEEE Access 2025, 13, 55877–55886. [Google Scholar] [CrossRef]
Fan, L.; Zhou, Y.; Liu, H.; Li, Y.; Cao, D. Combining Swin Transformer With UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–11. [Google Scholar] [CrossRef]
Zhou, J.; Hao, M.; Zhang, D.; Zou, P.; Zhang, W. Fusion PSPnet Image Segmentation Based Method for Multi-Focus Image Fusion. IEEE Photonics J. 2019, 11, 1–12. [Google Scholar] [CrossRef]
Xue, Z.; Tan, X.; Yu, X.; Liu, B.; Yu, A.; Zhang, P. Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification. IEEE Trans. Image Process. 2022, 31, 3095–3110. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20 June 2021; pp. 6877–6886. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-Like Pure Transformer for Medical Image Segmentation. In Proceedings of the Computer Vision–ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 205–218. [Google Scholar]
Li, S.; Li, C.; Kang, X. Development Status and Future Prospects of Multi-Source Remote Sensing Image Fusion. Natl. Remote Sens. Bull. 2021, 25, 148–166. [Google Scholar] [CrossRef]
Peng, C.; Li, Y.; Jiao, L.; Chen, Y.; Shang, R. Densely Based Multi-Scale and Multi-Modal Fully Convolutional Networks for High-Resolution Remote-Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2612–2626. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O. A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
Sun, Y.; Fu, Z.; Sun, C.; Hu, Y.; Zhang, S. Deep Multimodal Fusion Network for Semantic Segmentation Using Remote Sensing Image and LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Ma, M.; Ma, W.; Jiao, L.; Liu, X.; Li, L.; Feng, Z.; Liu, F.; Yang, S. A Multimodal Hyper-Fusion Transformer for Remote Sensing Image Classification. Inf. Fusion 2023, 96, 66–79. [Google Scholar] [CrossRef]
Chen, H.; Lan, C.; Song, J.; Broni-Bediako, C.; Xia, J.; Yokoya, N. ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer. IEEE Trans. Geosci. Remote Sens. 2023, 62, 4408522. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.-O.; Liu, M. A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Hu, X.; Yang, K.; Fei, L.; Wang, K. ACNET: Attention Based Network to Exploit Complementary Features for RGBD Semantic Segmentation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22 September 2019; pp. 1440–1444. [Google Scholar]
He, S.; Yang, H.; Zhang, X.; Li, X. MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images. Mathematics 2023, 11, 722. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Ma, X.; Xu, X.; Zhang, X.; Pun, M.-O. Adjacent-Scale Multimodal Fusion Networks for Semantic Segmentation of Remote Sensing Data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 20116–20128. [Google Scholar] [CrossRef]
Ren, P.; Li, C.; Wang, G.; Xiao, Y.; Du, Q.; Liang, X.; Chang, X. Beyond Fixation: Dynamic Window Visual Transformer. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18 June 2022; pp. 11977–11987. [Google Scholar]
Zhang, Z.; Shu, D.; Gu, G.; Hu, W.; Wang, R.; Chen, X.; Yang, B. RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery. Remote Sens. 2025, 17, 3064. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13 June 2020; pp. 11531–11539. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv 2016, arXiv:1602.07261. [Google Scholar] [CrossRef]

Figure 1. The proposed DyFuseNet framework: stepwise workflow integrating dynamic windows (DWM), cross-scale contextual attention (SCA), and hierarchical adaptive fusion for multimodal remote sensing image segmentation.

Figure 2. (a) The architectural design of the proposed DyFuseNet network; (b) the structural design of the Encoder within the proposed DyFuseNet network, where DWM represents the dynamic multi-scale window module that can be directly embedded into the encoder and decoder of the Swin-Transformer network; (c) presents the detailed internal organization of sequential Swin-Transformer blocks.

Figure 3. The network structure diagram of the Dynamic Window Module.

Figure 4. Cross-Scale Context Attention Module Network structure diagram.

Figure 5. The Hierarchical Adaptive Fusion Architecture.

Figure 6. Adaptive Channel fusion process.

Figure 7. The Efficient Channel Attention (ECA) module: a lightweight self-attention mechanism enhancing feature representation by modeling inter-channel dependencies via adaptive 1D convolution.

Figure 8. Visual comparison of semantic segmentation results on the Potsdam dataset, including original IR-RGB images, ground truth labels, and outputs from five methods (PSPNet, ACNet, MFTransNet, DyFuseNet). The legend indicates: white for road, blue for building, cyan for low vegetation, green for tree, yellow for vehicle, and red for background categories.

Figure 9. Trade-off between Computational Complexity and Segmentation Accuracy in Remote Sensing Image Segmentation.

Figure 10. Vaihingen & Potsdam Dataset Boundary Density Boxplot.

Table 1. Performance comparison of DyFuseNet with SOTA single-modal and multimodal fusion methods for semantic segmentation on Vaihingen and Potsdam datasets.

Dataset	Method	F1					OA	mF1	mIoU
Dataset	Method	Roa.	Bui.	Low.	Tre.	Car	OA	mF1	mIoU
Vaihingen	Swin-Unet	0.9021	0.9518	0.8463	0.6923	0.7591	0.8294	0.8303	0.7209
	PSPNet	0.9008	0.9467	0.8831	0.7383	0.7113	0.8374	0.8361	0.7292
	ACNet	0.8904	0.9314	0.8750	0.8135	0.7387	0.8419	0.8478	0.7590
	RDFNet	0.8981	0.9517	0.8921	0.8170	0.7392	0.8570	0.8596	0.7610
	CMFNet	0.9058	0.9545	0.9077	0.8196	0.7405	0.8640	0.8656	0.7708
	MFTransNet	0.9229	0.9629	0.9101	0.8254	0.7513	0.8715	0.8745	0.7849
	DyFuseNet	0.9201	0.9677	0.9102	0.8561	0.7900	0.8511	0.8869	0.8040
Potsdam	Swin-Unet	0.8983	0.9508	0.8870	0.7951	0.7136	0.8500	0.8490	0.7466
	PSPNet	0.9008	0.9536	0.8690	0.7948	0.7391	0.8455	0.8515	0.7490
	ACNet	0.9098	0.9526	0.8893	0.7883	0.7292	0.8536	0.8538	0.7538
	RDFNet	0.9037	0.9582	0.9015	0.8117	0.7470	0.8620	0.8644	0.7688
	CMFNet	0.9143	0.9601	0.9026	0.8247	0.7747	0.8677	0.8753	0.7843
	MFTransNet	0.9159	0.9657	0.9082	0.8449	0.7594	0.8720	0.8788	0.7908
	DyFuseNet	0.9194	0.9697	0.9160	0.8703	0.7970	0.8980	0.8987	0.8085

Table 2. Efficiency and Accuracy Comparison of DyFuseNet and SOTA Instance Segmentation Methods.

Method	Multimodal	FLOPs (G)	Parameter (MB)	FPS	mIoU (%)
Swin-Unet	N	16.54	34.68	38.64	74.66
PSPNet	N	51.23	46.72	68.24	74.90
ACNet	Y	12.96	62.37	18.64	75.38
RDFNet	Y	60.44	42.08	20.72	76.88
CMFNet	Y	80.67	112.44	9.82	78.43
MFTransNet	Y	9.52	41.36	16.68	79.08
DyFuseNet	Y	26.19	82.77	30.09	80.85

Table 3. DyFuseNet model ablation experiment results.

Dataset	Baseline	DWM	SCA	HAFA	OA	mF1
Vaihingen	√				0.8481	0.8532
	√	√			0.8537	0.8581
	√	√	√		0.8702	0.8748
	√	√	√	√	0.8764	0.8867
Potsdam	√				0.8536	0.8538
	√	√			0.8677	0.8753
	√	√	√		0.8764	0.8867
	√	√	√	√	0.8780	0.8897

√: Indicates that the corresponding module in the model; Blank: Indicates that the corresponding module is not included.

Table 4. Multimodal Noise Robustness Experiments.

Datasets	Noisy	Swin-T	DyFuseNet	OA-Change
Vaihingen	Origin	0.8322	0.8789	0
	OnlyDSM	0.8215	0.8774	0.0015
	OnlyIRRB	0.8143	0.8721	0.0068
	Both	0.8112	0.8695	0.0094
Potsdam	Origin	0.8436	0.8789	0
	OnlyDSM	0.8325	0.8782	0.0007
	OnlyIRRB	0.8276	0.8743	0.0046
	Both	0.8105	0.8711	0.0078

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Wu, M.; Zhang, P.; Wang, L.; Shi, Q. Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Appl. Sci. 2025, 15, 10813. https://doi.org/10.3390/app151910813

AMA Style

He Q, Wu M, Zhang P, Wang L, Shi Q. Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Applied Sciences. 2025; 15(19):10813. https://doi.org/10.3390/app151910813

Chicago/Turabian Style

He, Qianping, Meng Wu, Pengchang Zhang, Lu Wang, and Quanbin Shi. 2025. "Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments" Applied Sciences 15, no. 19: 10813. https://doi.org/10.3390/app151910813

APA Style

He, Q., Wu, M., Zhang, P., Wang, L., & Shi, Q. (2025). Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments. Applied Sciences, 15(19), 10813. https://doi.org/10.3390/app151910813

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

Abstract

1. Introduction

2. Datasets and Data Preprocessing

2.1. Datasets

2.1.1. Vaihingen Dataset

2.1.2. Potsdam Dataset

2.2. Data Preprocessing

3. Methods

3.1. Dynamic Window Module

3.1.1. Adaptive Window Partitioning Module

3.1.2. Continuous Relative Position Bias Generator

3.1.3. Multi-Scale Dynamic Window Attention Module

3.2. Cross-Scale Context Attention Module

3.2.1. Multilevel Feature Interaction Module

3.2.2. Channel Semantic Recalibration Module

3.3. Hierarchical Adaptive Fusion Architecture

3.3.1. Heterogeneous Modality Synchronizer

3.3.2. Efficient Channel Attention Module

3.4. Loss Function

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Comparative Experiments

4.3. Ablation Experiments

4.4. Multimodal Noise Robustness Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Detailed Definitions of CRPB and AWP Module

Appendix A.2. The Precise Mathematical Formulas for All Loss Terms

Appendix A.3. Training Configuration and Complete Data Augmentation Process

Appendix B

Appendix B.1. Pseudo-Code Flow in the DWM

Appendix B.2. Pseudo-Code Flow in the SCA Module

Appendix B.3. Canny Edge Detection

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI