Next Article in Journal
Hyperspectral Image Mixed Denoising via Robust Representation Coefficient Image Guidance and Nonlocal Low-Rank Approximation
Previous Article in Journal
Reconstruction of Effective Cross-Sections from DEMs and Water Surface Elevation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation

1
School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
2
College of Electrical Engineering, Zhejiang University, Hangzhou 310027, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(6), 1019; https://doi.org/10.3390/rs17061019
Submission received: 29 December 2024 / Revised: 3 March 2025 / Accepted: 12 March 2025 / Published: 14 March 2025

Abstract

:
Although hybrid models based on convolutional neural network (CNN) and Transformer can extract features encompassing both global and local information, they still face two challenges in addressing the semantic segmentation task of high-resolution remote sensing (HR2S) images. First, they are limited by the loss of detailed information during encoding, resulting in inadequate utilization of features. Second, the ineffective fusion of local and global context information leads to unsatisfactory segmentation performance. To simultaneously address these two challenges, we propose a dual-branch network named global–local feature fusion network (GLFFNet) for HR2S image semantic segmentation. Specifically, we use the residual network (ResNet) as the main branch to extract local features. Recently, a Mamba architecture based on State Space Models has shown significant potential in image semantic segmentation tasks. Given that Mamba is capable of handling long-range relationships with linear computational complexity and relatively high speed, we introduce VMamba as an auxiliary branch encoder to provide global information for the main branch. Meanwhile, in order to utilize global information efficiently, we propose a multi-scale feature refinement (MSFR) module to reduce the loss of details during global feature extraction. Additionally, we develop a semantic bridging fusion (SBF) module to promote the full integration of global and local features, resulting in more comprehensive and refined feature representations. Comparative experiments on three public datasets demonstrate the segmentation accuracy and application potential of GLFFNet. Specifically, GLFFNet achieves mIoU scores of 84.01% on ISPRS Vaihingen, 87.54% on ISPRS Potsdam, and 54.73% on LoveDA, as well as mF1 scores of 91.11%, 93.23%, and 70.07% on these respective datasets.

1. Introduction

With the rapid development of earth observation technology, the acquisition of high-resolution remote sensing (HR2S) images has become increasingly accessible [1,2]. HR2S images can provide high-quality spatial information support for a wide range of applications, such as road extraction [3], land cover classification [4], and change detection [5]. However, the analysis and interpretation of this spatial information is a challenging task. To address these challenges, semantic segmentation has emerged as a critical task.
Semantic segmentation is a computer vision task that aims to assign a category label to each pixel in an image [6]. Semantic segmentation of HR2S images can analyze and interpret spatial information of ground objects in complex scenarios, making it one of the current research focuses [7]. Remote sensing image semantic segmentation methods can be divided into traditional segmentation methods and deep learning-based segmentation methods. Traditional segmentation methods offer advantages such as fast computation and low dependence on labeled data, but their application is limited due to their low accuracy and difficulty in handling complex scenarios [8]. In contrast, deep learning-based segmentation methods can automatically learn and extract complex features, making them well suited for handling complex scenes and diverse target categories [9,10]. Therefore, this paper focuses on the semantic segmentation methods based on deep learning.
In recent years, deep learning technology has made significant advancements in the processing of remote sensing images. Basic models based on convolutional neural networks (CNNs) possess powerful hierarchical feature extraction capabilities, making them widely used in semantic segmentation tasks of HR2S images [11,12]. To further improve segmentation performance, researchers enhance feature selectivity by adding attention modules to the basic models [13,14,15,16,17,18]. However, due to the locality of convolution, CNN-based models often face challenges in effectively learning global semantic information. The emergence of Transformer [19] provided a way to obtain global information. Subsequently, semantic segmentation models based on Transformer have achieved remarkable results [20,21,22,23,24]. However, the local information extraction capability of Transformer-based models is relatively weak, resulting in rough segmentation results.
The advent of hybrid models based on CNNs and Transformer provided a way to capture features with both local and global information in images [25,26,27]. Wu et al. [28] designed a CNN–Transformer fusion network (CTFNet), which used ResNet as the encoder to extract local features, and introduced a lightweight W/P transformer block as the decoder to capture global context information. Weng et al. [29] proposed a local and global feature coupling network based on convolution operators and self-attention modules. They additionally developed a feature coupling module to integrate local and global features. Wang et al. [30] used parallel CNN and Transformer encoders to extract local and global features, respectively. In addition, they designed a cross-stage fusion module to integrate local and global features. Although the above-mentioned methods achieved initially satisfactory segmentation performance, the high computational complexity of Transformer limits the development of such methods.
Recently, Gu et al. [31] introduced Mamba, which achieved global receptive fields with linear computational complexity and significantly improved computational speed. Subsequently, a series of hybrid models based on Mamba and CNN were proposed. For example, Ma et al. [32] introduced a dual-branch model, RS3Mamba. RS3Mamba applied visual state space (VSS) modules as an auxiliary branch to provide global auxiliary features to the main branch based on CNNs and utilized a collaborative completion module (CCM) to fuse cross-branch features. Liu et al. [33] developed a hybrid CNN–Mamba UNet (CM-UNet) that combines a CNN-based encoder and a Mamba-based decoder. Moreover, they employed an attention module to fuse local features at multiple scales. The aforementioned methods effectively extract features that incorporate both global and local information by combining Mamba with CNN, and reduce the computational complexity of the model. However, two shortcomings remain. First, they may be hindered by the loss of detailed information during hierarchical feature extraction, leading to insufficient exploitation and utilization of features. Second, due to the semantic differences that often arise in global and local features extracted using different encoding strategies at the same locations on the feature maps, directly fusing them through concatenation or addition may introduce noise and lead to the loss of important information. They may not adequately account for this semantic differences between global and local features during fusion, resulting in unsatisfactory segmentation performance.
To address the aforementioned issues, we propose a global–local feature fusion network (GLFFNet) for HR2S image semantic segmentation. Specifically, GLFFNet adopts a dual-branch architecture, in which the auxiliary branch based on VMamba extracts global features to provide global information for the main branch based on CNN. In order to reduce the loss of detailed information in the process of global feature extraction, a multi-scale feature refinement (MSFR) module is proposed to specifically enhance and integrate adjacent global features. Particularly, the MSFR module applies linear channel attention to enhance the lower-level features and spatial attention to refine the higher-level features. Then, the lower-layer features containing rich detail information are used to supplement the detailed information for the upper-layer features through cross-layer fusion, thus producing a comprehensive global feature representation combining local details and global semantics. Additionally, we propose a semantic bridging fusion (SBF) module to facilitate efficient fusion of global and local features. The SBF module performs adaptive fusion after narrowing the semantic differences between global and local features, effectively avoiding noise and key information loss. The main contributions of this paper can be summarized as follows:
  • A dual-branch network named GLFFNet is proposed for the semantic segmentation of HR2S images, which reduces the loss of detailed information during global feature extraction by the MSFR module and enhances the fusion efficiency of global and local features through the SBF module.
  • An MSFR module is proposed to enhance and integrate adjacent global features. The MSFR module supplements the upper-layer features with details from the lower layer, reducing the loss of detailed information during global feature extraction.
  • An SBF module is introduced to facilitate efficient fusion of global and local features which performs adaptive fusion after narrowing the semantic differences between global and local features, resulting in more refined and comprehensive features.
The organization of the remaining sections is as follows. In Section 2, we briefly introduce the theoretical foundations related to Mamba and the applications of Mamba in the field of semantic segmentation tasks. In Section 3, we provide a comprehensive explanation of our proposed GLFFNet framework, including the design details of the MSFR and SBF modules. Section 4 validates the performance of the GLFFNet through experiments on three datasets. Finally, Section 5 provides a brief summary of our work.

2. Related Work

In this section, we primarily focus on the relevant applications of Mamba in image semantic segmentation tasks. To provide a comprehensive foundation for understanding the architecture of Mamba, we have included supplemental theoretical details of Mamba in Appendix A, which covers the basic principles of State Space Models (SSMs), their associated discretization and convolution computation methods, as well as the core concepts of Mamba.

Mamba in Semantic Segmentation Tasks

The core of Mamba is designed for processing 1-dimensional sequential data, whereas visual data are inherently non-sequential and contain spatial information. As a result, simple parallel selective scanning operations are not suitable for handling visual data. To adapt Mamba for visual tasks, Zhu et al. [34] proposed Vision Mamba. Vision Mamba first flattens a 2D image into 2D image patches, adds positional embeddings to preserve spatial information, and then uses bidirectional SSMs to process the image sequence. Due to the correlations between image patches, Vision Mamba adopts a sequential scanning approach that overlooks the issue of direction sensitivity. Another visual Mamba model VMamba [35] introduced an innovative 2D selective scan (SS2D) module. SS2D traverses along four scanning directions, which is beneficial for bridging the gap between the ordered nature of 1D selective scanning and the unordered structure of 2D visual data. This facilitates the collection of contextual information from various directions and perspectives. PlainMamba [36] used a zigzag scanning technique to maintain the spatial adjacency between tokens and prevent discontinuities. In order to explicitly integrate relative 2D positional information into the selective scanning process, PlainMamba also introduced a direction-aware update method. Huang et al. [37] highlighted that the effectiveness of SSMs in capturing image representations varies depending on the scanning direction. Therefore, they introduced a windowed selective scanning technique and a scanning direction search strategy, achieving significant improvements over existing models. The above methods have advanced the application of Mamba in visual tasks. Given that VMamba effectively captures the spatial structure and contextual information of visual data, we adopt VMamba [35] as an auxiliary encoder in our method.
The advantages of Mamba in extracting global information and performing efficient computations make it particularly well suited for image semantic segmentation tasks [38,39,40,41,42,43]. In the field of medical image segmentation, U-Mamba [44] represents an effective early attempt to apply Mamba to this domain. U-Mamba introduced a hybrid CNN-SSM module that combined the localized feature extraction strengths of convolutional layers with the long-range dependency modeling capabilities of SSMs. VM-UNet [45] serves as one of the most basic implementations of a pure SSM-based segmentation model. To unify the functionality of pre-trained models, Liu et al. [46] proposed Swin-UMamba for medical image segmentation. LightM-UNet [47] combines Mamba and UNet within a lightweight architecture, delivering exceptional segmentation performance. In the field of remote sensing image semantic segmentation, Samba [48] constructed Samba blocks based on Mamba blocks and employed them as encoders, effectively extracting multi-level semantic information. To address the limitation of Mamba, which is primarily designed for sequential data and lacks adaptability for 2D image data, RSMamba [49] introduced a dynamic multipath activation mechanism to enhance its capability for modeling non-sequential spatial information. RS3Mamba [32] is a novel dual-branch network designed for remote sensing image semantic segmentation. Hu et al. [50] introduced Pyramid Pooling Mamba (PPMamba), a network that combines CNN and Mamba for remote sensing semantic segmentation tasks. PPMamba performs selective scanning of feature maps from eight different directions, enabling it to capture a wide range of comprehensive feature information. Although the aforementioned works demonstrate the strong performance of Mamba in image semantic segmentation tasks, they often fail to adequately extract and utilize global and local information. Therefore, in order to further improve the segmentation performance of the model, it is urgent to propose a new method that can comprehensively and efficiently consider the global and local feature information.

3. Proposed Method

GLFFNet adopts a two-branch architecture, and the overall network structure is illustrated in Figure 1. GLFFNet consists of three parts: an auxiliary branch, main branch, and decoder. The auxiliary branch consists of a VMamba encoder and multi-scale feature refinement (MSFR) modules. The main branch adopts ResNet encoder and semantic bridging fusion (SBF) modules. The decoder consists of weight fusion (WF) modules and a final convolution module. The WF modules perform weighted feature fusion, while the final convolution module outputs the prediction results.

3.1. Global Feature Extraction and Enhancement

As shown in Figure 1, the auxiliary branch extracts global features using a VMamba encoder with linear computational complexity. Specifically, the VMamba encoder first uses the stem module to split the input image into multiple small patches, and then goes through four stages to create hierarchical feature representations. Each stage consists of downsampling layers and VSS blocks [35], where the downsampling layer is made via patch merging. When encoding, the input image X R H × W × 3 is first decomposed into non-overlapping sub-blocks by the stem module, and then processed through four stages to generate hierarchical features S 1 R C × H 4 × W 4 , S 2 R 2 C × H 8 × W 8 , S 3 R 4 C × H 16 × W 16 , and S 4 R 8 C × H 32 × W 32 .
The four-layer features initially extracted by the VMamba encoder are often relatively rough. Compared with the lower-layer features, the semantic information of the upper-layer features is enhanced, but some details are lost. To reduce the loss of details during the hierarchical feature extraction, we design an MSFR module to specifically enhance and fuse the features of two adjacent layers, thereby achieving more refined global features. Specifically, to better aggregate feature information, we designed a downsampling module to adjust the feature sizes of adjacent layers. Then, based on the mechanism of linear attention [14], we developed a linear channel attention (LCA) module and a linear spatial attention (LSA) module to selectively enhance the features of the two adjacent layers. The structure of the MSFR module is shown in Figure 2. The MSFR module consists of an LCA module, LSA module, and downsampling module.
(1) Linear Spatial Attention: The LSA module is used to enhance the long-range spatial dependencies of the higher-level feature S n + 1 , resulting in the feature S n + 1 , which can be defined as follows:
S n + 1 = n = 1 N V ( S n + 1 ) c , n + Q ( S n + 1 ) Q ( S n + 1 ) 2 K ( S n + 1 ) K ( S n + 1 ) 2 T V ( S n + 1 ) N + Q ( S n + 1 ) Q ( S n + 1 ) 2 n = 1 N K ( S n + 1 ) K ( S n + 1 ) 2 c , n T
where Q ( S n + 1 ) R N × D k , K ( S n + 1 ) R N × D k , V ( S n + 1 ) R N × D v represent the query matrix, key matrix, and value matrix generated by the convolution operation, N represents the number of pixels in the feature map, c represents the channel dimension, and n represents the flattened spatial dimension.
(2) Linear Channel Attention: Similarly, the LCA module is used to enhance the long-range channel dependencies of the lower-level feature S n , resulting in the feature S n , which can be defined as follows:
S n = n = 1 N R ( S n ) c , n + R ( S n ) R ( S n ) 2 R ( S n ) R ( S n ) 2 T R ( S n ) N + R ( S n ) R ( S n ) 2 n = 1 N R ( S n ) R ( S n ) 2 c , n T
where R ( · ) represents the reshape operation. Notably, both LCA and LSA modules have linear computational complexity. More details about the linear attention mechanism can be found in Reference [14].
(3) Downsampling: The purpose of the downsampling module is to adjust the shape of the enhanced features S n and S n + 1 to be consistent, which can be defined as follows:
D ( S n ) = f σ f δ ( S n ) + f μ f θ ( S n )
where D(·) represents the downsampling operation, f σ represents the ReLU activation function, f δ and f μ represent the 3 × 3 depthwise separable convolution (DSConv) with a step size of 2, and f θ represents the 3 × 3 DSConv with a step size of 1. Each convolutional layer contains a batch normalization ( B N ) operation. Through the downsampling module processing, the shape of the enhanced feature S n is consistent with the enhanced feature S n + 1 . Notably, compared to traditional convolutions, depthwise separable convolutions significantly reduce computational cost and model parameters with minimal impact on model performance.
Overall, the input features of the MSFR module are the adjacent layer features S n R c × h × w and S n + 1 R 2 c × h 2 × w 2 extracted by the VMamba encoder, where n { 1 , 2 , 3 } . The upper-layer feature S n + 1 is enhanced using the LSA module, which improves spatial dependency as described in Equation (1), yielding the feature S n + 1 . Simultaneously, the feature S n is processed by the LCA module, enhancing channel dependency according to Equation (2), producing the output S n . Afterward, the downsampling module processes the feature S n , adjusting its shape to match S n + 1 as in Equation (3). Finally, the two enhanced features are added together to obtain the final enhanced feature F n + 1 m , as shown below:
F n + 1 m = D ( S n ) + S n + 1
Through the above processing, the features of two adjacent layers are enhanced and aggregated through the MSFR module to obtain the global auxiliary features F 1 m , F 2 m , F 3 m and F 4 m . The MSFR module leverages the LCA module to enhance the channel attention of lower-layer features, and the LSA module to enhance the spatial attention of high-level feature maps. Then, through cross-layer fusion, the lower-level features are used to supplement detailed information to the higher-level features. As a result, the MSFR module is capable of dynamically selecting and focusing on global features, reducing the loss of detailed information, and improving the network’s efficiency in leveraging these features.

3.2. Global–Local Feature Fusion

In order to fully leverage the advantages of both global and local features, we designed an SBF module that combines the strengths of window-based multi-head self-attention [20] and multi-scale channel attention [51]. The SBF module adaptively fuses features after reducing the semantic differences between global and local features, allowing for flexible and efficient feature integration. The structure of the SBF module is shown in Figure 3.
Given stage n, the VMamba global auxiliary feature is F n m R C n m × h × w , and the local feature extracted by ResNet is F n r R C n r × h × w , where C n m and C n r represent the number of channels for the two types of features, respectively. First, F n r is processed through window-based multi-head self-attention [20] to enhance long-range dependency capabilities, resulting in the feature F l o c a l R C n r × h × w . It is worth noting that this operation has linear complexity. Meanwhile, F n m is processed through two parallel convolution branches to capture local detail information and adjust the number of channels to C n r , resulting in the feature F g l o b a l R C n r × h × w . The above process can be expressed by the following formula:
F l o c a l = W ( F n r )
F g l o b a l = C o n v 1 × 1 ( F n m ) + C o n v 3 × 3 ( F n m )
where W ( · ) is the window-based multi-head self-attention, C o n v 1 × 1 ( · ) represents 1 × 1 convolution with BN operation, and C o n v 3 × 3 ( · ) represents 3 × 3 convolution with BN operation. By integrating local information into global features and infusing global information into local features, the semantic differences between these features have been reduced.
To effectively fuse F g l o b a l and F l o c a l at the same level, we first obtain a fused feature F i n through simple addition of F g l o b a l and F l o c a l . Then, F i n is processed through two parallel branches to obtain the local and global channel contexts, which are combined via broadcast addition to produce the fusion weight. The fusion weight is then used to perform adaptively weighted fusion on F g l o b a l and F l o c a l , resulting in the final fused feature. The calculation is described as follows:
F i n = F g l o b a l + F l o c a l
L F i n = B PWConv 2 δ B PWConv 1 F i n
G F i n = B PWConv 2 δ B PWConv 1 g F i n
M F i n = σ ( L F i n G F i n )
F o u t p u t = M F i n F g l o b a l + 1 M F i n F l o c a l
where L ( · ) , G ( · ) and M ( · ) represent the operations for generating the local channel context, global channel context, and fusion weight, respectively. Specifically, B represents the batch normalization operation, σ represents the sigmoid operation, δ represents the relu operation, and PWConv represents the point-wise convolution. The kernel size of PWConv 1 is C n r t × C n r × 1 × 1 and the kernel size of PWConv 2 is C n r × C n r t × 1 × 1 , where t represents the channel reduction ratio. The symbol ⊕ represents broadcast addition, while the symbol ⊗ represents element-wise multiplication. Notably, M ( F i n ) matches the dimensions of F g l o b a l and F l o c a l and consists of real numbers between 0 and 1. F o u t p u t R C n r × h × w represents the final fused feature, which effectively combines the features from both encoders at the same stage, incorporating both global and local information.
By reducing semantic differences and controlling the fusion ratio of global and local features, we effectively mitigate noise and key information loss during the feature fusion process. At each stage of the main branch, the SBF module effectively fuses the global and local features of the same layer. The resulting fused features are then fed into the decoder via skip connections to obtain the prediction results.

4. Experiments

We conducted comparative and ablation experiments on the ISPRS Vaihingen, ISPRS Potsdam, and LoveDA datasets to validate the effectiveness of the proposed GLFFNet model. In this section, we first introduced the characteristics of these three public datasets. Next, we outlined the evaluation metrics and experimental setup used in our experiments. Following that, we performed comparative experiments, evaluating GLFFNet against several state-of-the-art semantic segmentation models on all three datasets. Then, we carried out an ablation study to assess the contribution of each module within GLFFNet to the overall performance. Finally, we analyzed the computational complexity of the model. It should be noted that the ISPRS datasets provide two types of ground truth: one with eroded boundaries and one without. In all our experiments, we used the version with eroded boundaries.

4.1. Dataset

Vaihingen: The ISPRS Vaihingen dataset consists of 33 remote sensing image tiles with an average size of 2000 × 2600 pixels and a ground sampling distance (GSD) of 9 cm. Each image tile is composed of three spectral bands: near-infrared, red, and green. The dataset includes five foreground categories (impervious surfaces, buildings, low vegetation, trees, cars) and one background category (clutter). The clutter/background category includes other objects that are usually not of interest in semantic object classification in urban scenes. Among the 33 image tiles in the original Vaihingen dataset, we exploited ID 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, and 37 for training, ID 38 for validation, and the remaining 16 image tiles for testing [13]. In our experiments, all image tiles were cropped into multiple 1024 × 1024 pixel patches to construct the training, validation, and test sets.
Potsdam: The ISPRS Potsdam dataset consists of 38 remote sensing images tiles with a GSD of 5 cm, each 6000 × 6000 pixels. It involves the same category information as the Vaihingen dataset. The images are composed of four spectral bands: near-infrared, red, green, and blue. Note that we use only the red, green, and blue channels in our experiments. Excluding one erroneous image tile (ID: 7_10), we chose ID 2_13, 2_14, 3_13, 3_14, 4_13, 4_14, 4_15, 5_13, 5_14, 5_15, 6_13, 6_14, and 6_15 for testing, ID 7_13 for validation, and the remaining 23 image tiles for training [13]. Similar to the process for the Vaihingen dataset, all image tiles were cropped into multiple 1024 × 1024 pixel patches to construct the training, validation, and test sets.
LoveDA: The LoveDA dataset [52] consists of 5987 high-resolution remote sensing images collected from three different cities, covering both urban and rural areas. It is well suited for semantic segmentation tasks due to its distinct characteristics, including multi-scale objects, complex background samples, and inconsistent class distributions. Each image has a GSD of 0.3 m and a size of 1024 × 1024 pixels, with annotations covering seven categories: background, building, road, water, barren, forest, and agricultural. The original dataset is organized into three subsets, with 2522 images allocated to the training set, 1669 images designated for the validation set, and 1796 images reserved for the test set.

4.2. Experimental Setting

In the experiments, all models were implemented using the PyTorch (version 2.2.2) framework on a single NVIDIA GeForce RTX 2080 Ti GPU. The experiments were conducted on a computer equipped with a 12th Gen Intel Core i7-12700F CPU (2.10 GHz) and 32 GB of RAM. The AdamW optimizer was used with a base learning rate set to 1 × 10 4 , a weight decay set to 2 × 10 4 , and a batch size set to 4. The total epoch was set to 150. The learning rate was adjusted using a cosine annealing strategy. During training, images were randomly cropped to 512 × 512 pixels, and data augmentation techniques such as random scaling (with scales of [0.5, 0.75, 1, 1.25, 1.5]), random vertical flipping, random horizontal flipping, and random rotation were applied. The loss function was set as the weighted average of the cross-entropy loss and the dice loss.

4.3. Evaluation Metrics

The performance of GLFFNet on the two datasets is evaluated by using the mean intersection over union (mIoU) and the F1 score, which are calculated based on the aggregated confusion matrix. The calculations are as follows:
P = 1 N k = 1 N T P k T P k + F P k
R = 1 N k = 1 N T P k T P k + F N k
F 1 = 2 × P × R P + R
mIoU = 1 N k = 1 N T P k T P k + F P k + F N k
where T P k , F P k , T N k , and F N k indicate the true positive, false positive, true negative, and false negative for the specific object indexed as class k. P and R denote precision and recall, respectively.

4.4. Comparison Experiment

To validate the effectiveness of GLFFNet for semantic segmentation of HR2S images, we assessed the GLFFNet and benchmarked it against several leading supervised methods using the Vaihingen, Potsdam, and LoveDA datasets. The comparison methods can be divided into two categories. The first category includes networks that use a single model for both encoding and decoding, such as MANet [13], MAResU-Net [14], GCDNet [15], Swin-Unet [22], and VM-UNet [45]. Among them, MANet, MAResU-Net and GCDNet are CNN-based methods, Swin-Unet is a Transformer-based method, and VM-UNet is a Mambda-based method. The second category includes networks that use a hybrid model for encoding and decoding, such as UNetFormer [26], TransUNet [27] and RS3Mamba [32]. Among them, UNetFormer and TransUNet are methods based on CNN and Transformer, RS3Mamba is a method based on CNN and VMamba.

4.4.1. Results on Vaihingen

The comparative experimental results on the Vaihingen dataset are presented in Table 1, where mF1 denotes the mean F1 score across all classes and values in bold represent the top-performing metrics in the table. The experimental outcomes indicate that the proposed GLFFNet achieves the highest mF1 (91.11%) and mIoU (84.01%). It outperforms networks based on a single model by at least 1.02% in mF1 and 1.70% in mIoU, underscoring the significance of global and local information in semantic segmentation. Compared to the hybrid models, GLFFNet improves the mF1 and mIoU scores by at least 0.52% and 0.92%, respectively, demonstrating superior overall performance due to the efficient utilization and comprehensive fusion of global and local feature information. Based on the results for each category, TransUNet [27] excels in the low vegetation and tree categories. However, due to the insufficient integration of global and local information, its performance in other categories and overall segmentation is inferior to that of GLFFNet. It is noteworthy that GLFFNet achieves the highest F1 score of 88.75% and an IoU score of 79.78% for the car category, indicating that it performs well in segmenting small objects.
The visualization results in Figure 4 illustrate the performance of GLFFNet on the Vaihingen dataset. The first three columns show that GLFFNet effectively delineates building boundaries with clarity and preserves complete structural shapes, while also distinguishing subtle differences between low vegetation and trees. In contrast, other methods (e.g., MANet [13], UNetFormer [26], and RS3Mamba [32]) display varying degrees of blurriness at the boundaries between buildings and other categories, resulting in less distinct class separations. In the fourth column, GLFFNet demonstrates a notable improvement in differentiating between impervious surfaces, low vegetation, and trees. Specifically, narrow impervious surface regions can be distinctly separated from the surrounding environment and exhibit clear edge contours. Conversely, other methods (e.g., MANet [13], GCDNet [15], and VM-UNet [45]) often misclassify these narrow regions, frequently interpreting them as low vegetation or trees. Overall, GLFFNet excels in capturing subtle inter-class differences in complex scenes, producing more coherent and accurate segmentation results.

4.4.2. Results on Potsdam

Similarly, GLFFNet was evaluated on the Potsdam dataset alongside the current state-of-the-art segmentation models, with the comparative experimental results presented in Table 2. GLFFNet demonstrates strong overall performance, achieving an mF1 score of 93.23% and an mIoU score of 87.54%, surpassing most existing state-of-the-art methods. GCDNet [15] introduced a new dot-product attention mechanism to establish long-range dependencies across feature maps with varying receptive fields, effectively enhancing global feature representation. In comparison, GLFFNet leverages VMamba to directly extract global features and provide the main branch with enhanced global information. As a result, GLFFNet achieves a notable improvement over GCDNet [15], with mF1 increasing by 1.11% and mIoU by 1.92%. Furthermore, when compared to TransUNet [27], GLFFNet delivers additional improvements, with mF1 rising by 0.37% and mIoU by 0.65%. These performance gains can be attributed to GLFFNet’s effective integration of global and local feature information. Notably, GLFFNet exhibits strong performance across all categories when sufficient sample data are available, highlighting its robustness and adaptability.
The visualization results in Figure 5 demonstrate the performance of the GLFFNet on the Potsdam dataset. The results in the first and second columns demonstrate that GLFFNet can better distinguish impervious surfaces from low vegetation and trees. The third column reveals that GLFFNet achieves higher boundary precision and class differentiation in segmenting background regions within complex scenes. In contrast, other methods (e.g., MANet [13], GCDNet [15], and TransUNet [27]) exhibit noticeable misclassification in low-vegetation areas, where parts of the low vegetation are incorrectly identified as background regions. In the fourth column, GLFFNet excels in identifying cars, preserving their complete shapes and minimal interference from surrounding regions. Furthermore, GLFFNet effectively differentiates between low vegetation and trees, maintaining clear boundaries and a high degree of consistency between categories.

4.4.3. Results on LoveDA

The comparative experimental results on the LoveDA dataset are presented in Table 3. In terms of overall performance, GLFFNet achieves an mF1 score of 70.07% and an mIoU score of 54.73%, surpassing the majority of existing models. Compared to the single-encoder model VM-UNet [45] based on Mamba, GLFFNet achieves significant improvements, with an increase of 3.36% in mF1 and 3.84% in mIoU, while consistently delivering strong segmentation performance across all categories. These enhancements are attributed to GLFFNet’s dual-branch architecture, which effectively incorporates both global and local information. Compared to RS3Mamba [32], which also adopts a dual-branch structure, GLFFNet demonstrates stronger performance in the segmentation of buildings, barren, forests, and agricultural areas. However, the performance of GLFFNet is relatively weaker in the segmentation of road and water. This is likely due to the elongated and irregular shapes of these features, which challenge the network’s ability to effectively extract line-based and edge-specific characteristics. Overall, GLFFNet delivers better segmentation performance than most models, showcasing its advantages in general segmentation tasks.
The visualization results in Figure 6 showcase the performance of GLFFNet on the LoveDA dataset. The first three columns depict segmentation results in urban scenes, while the last three columns represent results in rural scenes. The black boxes in the first, third, and fourth columns highlight GLFFNet’s outstanding performance in segmenting the building category. Specifically, GLFFNet accurately identifies buildings and captures the shapes and boundaries of buildings with high precision. The second column emphasizes GLFFNet’s excellent performance in the forest category. In comparison to other methods (e.g., GCDNet [15], Swin-UNet [22], and VM-UNet [45]), which often produce redundant or irregular edges in forest segmentation, GLFFNet delivers more coherent results, effectively mitigating fragmentation and discontinuity. Similarly, the black boxes in the fifth and sixth columns underline GLFFNet’s superior segmentation performance in the agricultural category. In scenarios with complex agricultural distributions, MANet [13] and Swin-UNet [22] often misclassify parts of agricultural regions as background or other categories. Conversely, GLFFNet can accurately detect scattered agricultural regions, demonstrating its robustness and adaptability in diverse rural environments.

4.5. Ablation Study

To validate the effectiveness of GLFFNet, we conducted an ablation study on the Vaihingen, Potsdam, and LoveDA datasets to evaluate the contribution of each module to the overall performance. GLFFNet consists of four main modules: the main encode ResNet, the auxiliary encoder VMamba, the MSFR module, and the SBF module. In the ablation study, we analyzed and compared the impact of each module on the segmentation performance by gradually adding each module.
Table 4 and Table 5 present the results of the ablation study on the Vaihingen and Potsdam datasets, respectively. Table 6 shows the results on the LoveDA dataset. In these tables, the symbol ↑ indicates an improvement in evaluation metrics after adding a new module. It can be observed in Table 4, Table 5 and Table 6 that progressively adding various modules to the main branch ResNet results in improvements in both mF1 and mIoU metrics, demonstrating the effectiveness of each added module.
(1) Effectiveness of Global Information: After the VMamba is introduced to provide global information in the ResNet main branch, the segmentation performance for each category significantly improves. On the Vaihingen dataset, the mF1 score increases by 2.96% and the mIoU score by 4.49%. On the Potsdam dataset, the overall mF1 increases by 2.42% and the mIoU by 4.09%. On the LoveDA dataset, the overall mF1 increases by 5.87% and the mIoU by 6.37%. These results validate the importance of global information in segmentation tasks.
(2) Effectiveness of MSFR Module: After further introducing the MSFR module to refine and enhance global features at multiple scales, the segmentation performance of all categories improves to varying degrees. Notably, the performance improvement for the car category is the most significant in the Vaihingen and Potsdam datasets. This indicates that the inclusion of the MSFR module benefits segmentation tasks across various categories, with a particularly strong impact on the segmentation of small objects.
(3) Effectiveness of SBF Module: After further introducing the SBF module to the network, the overall performance on the Vaihingen dataset improves. Specifically, the segmentation accuracy for the impervious surfaces, buildings, trees, and cars categories shows effective improvement. On the Potsdam dataset, improvements are observed in the impervious surface, building, low vegetation, and tree categories, with an overall mF1 increase of 0.09% and an mIoU increase of 0.16%. On the LoveDA dataset, the inclusion of the SBF module leads to notable improvements across multiple categories, particularly in the building, barren, forest, and agricultural classes. The overall mF1 increases by 0.69%, reaching 70.07%, while the mIoU improves by 0.64%, reaching 54.73%. These results demonstrate the effectiveness of the SBF module.
Figure 7, Figure 8 and Figure 9 visually demonstrate the impact of each module on segmentation. As each module is progressively introduced, object contours become clearer and more complete, segmentation of small objects becomes more precise, and mutual interference between categories gradually decreases on the Vaihingen and Potsdam datasets. The ablation study results on the LoveDA dataset demonstrate that the introduction of each module significantly enhances the segmentation performance of the model. The VMamba module effectively mitigates boundary blurriness in categories such as buildings, roads, and forests, ensuring smoother transitions between different classes. The MSFR module strengthens multi-scale feature extraction, improving segmentation stability in complex scenes and reducing misclassification in small objects and boundary regions. The SBF module further refines the boundaries between forests and farmland, while also enhancing the shape and boundary details of barren land. Overall, these modules collectively improve the model’s class differentiation, boundary clarity, and detail preservation capabilities, resulting in more stable segmentation outcomes that closely align with the ground-truth labels.

4.6. Model Complexity Analysis

The model complexity of GLFFNet was evaluated using floating-point operation counts (FLOPs) and model parameters. FLOPs assess the computational complexity of the model, while parameters reflect its size. The training time of the model was represented by the average training time (ATT) per epoch, which was used to evaluate the model’s training efficiency. An optimal model would have lower FLOPs, fewer parameters, and shorter training time. It is worth noting that training time is influenced by various factors, such as model complexity, hardware resource utilization efficiency, optimization algorithms, and hyper-parameter settings.
Table 7 shows the complexity analysis results for all the methods compared in this paper, where values in bold represent the top-performing metrics in the table. Notably, the FLOPs and parameter values for all methods were calculated based on two 512 × 512 images using a single NVIDIA GeForce RTX 2080 Ti GPU. The results in Table 7 indicate that, compared to methods based on a single model, GLFFNet employs a parallel dual-branch structure, which increases the number of parameters, but achieves lower FLOPs than MANet and GCDNet. Compared to the lightweight model UNetFormer, although the FLOPs, parameter count, and training time are higher, the segmentation performance is significantly improved. Compared to RS3Mamba, GLFFNet introduces additional attention modules and feature fusion modules, leading to higher FLOPs and parameter counts. However, compared to TransUNet, which extracts features based on a Transformer architecture, GLFFNet significantly reduces the FLOPs and the number of parameters while still maintaining excellent segmentation performance. Overall, GLFFNet strikes a better balance between computational complexity and segmentation performance, highlighting its notable advantages.

5. Conclusions

In this paper, we propose a global–local feature fusion network (GLFFNet) for the semantic segmentation of HR2S images, mitigating the limitations of hybrid models in detailed information loss during feature extraction and insufficient fusion of global and local features. Specifically, GLFFNet adopts a dual-branch architecture, in which the auxiliary branch extracts global features through a VMamba encoder to provide global information for the main branch based on CNN. In addition, by introducing the MSFR module, we effectively reduce the loss of detailed information when extracting global features. This module performs cross-layer aggregation after enhancing the global features of two adjacent layers, resulting in a more comprehensive global feature representation that combines local details and global semantics. In order to avoid noise and loss of key information when fusing features, an SBF module is proposed. The SBF module performs adaptive fusion after narrowing the semantic differences between global and local features, achieving efficient fusion of global and local features. Comparative experiments and ablation studies on three public datasets fully demonstrate the effectiveness of GLFFNet. Although GLFFNet significantly improves the accuracy of semantic segmentation, it is still limited in segmenting certain objects with elongated and irregular shapes. Therefore, we will focus on addressing these limitations in future work. Additionally, we will continue to explore the potential of hybrid models for semantic segmentation tasks in HR2S images.

Author Contributions

Conceptualization, S.Z. and L.Z.; methodology, S.Z.; validation, Q.X. and J.D.; investigation, S.Z.; resources, L.Z. and X.L.; data curation, S.Z.; writing—original draft preparation, S.Z.; writing—review and editing, L.Z., Q.X. and J.D.; visualization, S.Z. and J.D.; funding acquisition, L.Z. and X.L. All authors contributed to the conception and design of the study. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62171404, and in part by the National Key Research and Development Program of China under Grant No. 2024YFF1400900.

Data Availability Statement

In the paper, we used public datasets for our research. The ISPRS Vaihingen and Potsdam datasets can be accessed at https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx (accessed on 1 January 2024), while the LoveDA dataset is available at https://github.com/Junjue-Wang/LoveDA (accessed on 1 January 2024). The source code for all code in this article can be obtained from the corresponding author.

Acknowledgments

All authors sincerely thank the reviewers and editors for their suggestions and opinions for improving this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. State Space Models

State Space Models (SSMs) are a classical mathematical approach used to represent the dynamic evolution of a system over time [53]. The core of SSMs lies in representing the behavior of a system through a set of hidden variables, referred to as states, enabling the effective capture of temporal data dependencies. To be specific, the classical State Space Model employs the state equation and observation equation to simulate the relationship between the input x ( t ) R L and the output y ( t ) R L at the current time t through an N-dimensional hidden state h ( t ) R N . This process can be represented using linear ordinary differential equations as follows:
h ( t ) = A h ( t ) + B x ( t )
y ( t ) = C h ( t )
where h ( t ) represents the derivative of the current state h ( t ) , A R N × N is the state transition matrix, B R N × 1 is the projection matrix that controls how the input affects the state change, and C R 1 × N represents the projection matrix that generates the output result based on the current state.
To accommodate the requirements of diverse machine learning settings, SSMs need to convert continuous parameters into discrete parameters. Discretization methods typically aim to divide continuous time into discrete intervals, ensuring that the integral areas are as equal as possible. Zero-Order Hold (ZOH) [54] is the most representative discretization method successfully applied in SSMs, which assumes that the function value is constant in the interval Δ = t k 1 , t k . After ZOH discretization, the discrete parameters A ¯ and B ¯ can be expressed as follows:
A ¯ = exp ( Δ A )
B ¯ = ( Δ A ) 1 ( exp ( Δ A ) I ) · Δ B
After discretization, the SSM equations can be rewritten as
h k = A ¯ h k 1 + B ¯ x k
y k = C h k
where k represents the discrete time step. It is worth noting that the discrete SSM shares a similar structure with recurrent neural networks, enabling it to efficiently perform the inference process.
The discrete SSM can calculate the output of each time step independently, so convolution can be used for calculation. For simplicity, let the initial state be h 1 = 0 ; unrolling the discrete SSM equations explicitly yields the following:
y 0 = C A ¯ 0 B ¯ x 0
y 1 = C A ¯ 1 B ¯ x 0 + C A ¯ 0 B ¯ x 1
y 2 = C A ¯ 2 B ¯ x 0 + C A ¯ 1 B ¯ x 1 + C A ¯ 0 B ¯ x 2
.
y k = C A ¯ k B ¯ x 0 + C A ¯ k 1 B ¯ x 1 + + C A ¯ 1 B ¯ x k 1 + C A ¯ 0 B ¯ x k
By creating a set of convolutional kernels K ¯ = C B ¯ , , C A ¯ k B ¯ , the above computation can be transformed into a convolutional form, as shown below:
y = x * K ¯
where x = [ x 0 , x 1 , ] R L denotes the input sequence, y = [ y 0 , y 1 , ] R L denotes the output sequence. L represents the sequence length. Convolutional computation enables SSMs to perform parallel computation during training. In traditional SSMs, the matrices A, B, C, and the interval Δ are independent of the model input x, which limits the model’s ability to capture contextual dependencies [31].

Appendix A.2. Mamba

To address the limited context-aware modeling capabilities of traditional SSMs, Gu et al. [31] designed a selective State Space Model (Mamba). Mamba leverages innovative techniques such as memory initialization based on High-order Polynomial Projection Operator (HiPPO) [55], selection mechanisms, and hardware-aware computing to effectively enhance the capabilities of SSMs in modeling long-range linear time sequences [31]. By initializing the hidden state matrix A using the HiPPO theory, Mamba can learn long-range-dependent memories. The selection mechanism improves the context awareness of SSMs, and hardware-aware algorithms contribute to increased training efficiency.
Traditional SSMs cannot generate personalized outputs based on the inputs. To enhance the content-aware modeling capabilities of SSMs, Mamba introduces a time-varying selection mechanism that parameterizes the weight matrices based on the model’s inputs. Specifically, parameters B, C, Δ are dependent on the input sequence x, as shown below:
B = s B ( x )
C = s C ( x )
Δ = τ Δ Parameter + s Δ ( x )
where s B ( x ) = L i n e a r N ( x ) and s C ( x ) = L i n e a r N ( x ) both map the input to an N-dimensional space. s Δ ( x ) = B r o a d c a s t D ( L i n e a r 1 ( x ) ) first projects the input to a 1-dimensional space and subsequently broadcasts it to a dimension of D. τ Δ represents the s o f t p l u s function. By performing the above operations, the selective SSMs gain content-aware capabilities.
As the convolution kernels in SSMs become dependent on the input, traditional RNN and convolution operations can no longer be applied for computation. To solve this problem, Mamba uses a hardware-aware algorithm to effectively calculate the selective SSM. Three classical techniques are utilized by the hardware-aware algorithm: parallel scan, kernel fusion, and recomputation [31].

References

  1. Wang, X.; Wang, H.; Jing, Y.; Yang, X.; Chu, J. A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 1514. [Google Scholar] [CrossRef]
  2. Alganci, U.; Soydas, M.; Sertel, E. Comparative research on deep learning approaches for airplane detection from very high-resolution satellite images. Remote Sens. 2020, 12, 458. [Google Scholar] [CrossRef]
  3. Zhou, G.; Chen, W.; Gui, Q.; Li, X.; Wang, L. Split depth-wise separable graph-convolution network for road extraction in complex environments from high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614115. [Google Scholar] [CrossRef]
  4. Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
  5. Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A semisupervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5891–5906. [Google Scholar] [CrossRef]
  6. Garcia-Garcia, A.; Orts-Escolano, S.; Oprea, S.; Villena-Martinez, V.; Garcia-Rodriguez, J. A review on deep learning techniques applied to semantic segmentation. arXiv 2017, arXiv:1704.06857. [Google Scholar]
  7. Li, X.; Lei, L.; Kuang, G. Multilevel adaptive-scale context aggregating network for semantic segmentation in high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 6003805. [Google Scholar] [CrossRef]
  8. Tao, W.B.; Tian, J.W.; Liu, J. Image segmentation by three-level thresholding based on maximum fuzzy entropy and genetic algorithm. Pattern Recognit. Lett. 2003, 24, 3069–3078. [Google Scholar] [CrossRef]
  9. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  10. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  11. Wang, H.; Qiao, L.; Li, H.; Li, X.; Li, J.; Cao, T.; Zhang, C. Remote sensing image semantic segmentation method based on small target and edge feature enhancement. J. Appl. Remote Sens. 2023, 17, 044503. [Google Scholar] [CrossRef]
  12. Su, Z.; Li, W.; Ma, Z.; Gao, R. An improved U-Net method for the semantic segmentation of remote sensing images. Appl. Intell. 2022, 52, 3276–3288. [Google Scholar] [CrossRef]
  13. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5607713. [Google Scholar] [CrossRef]
  14. Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8009205. [Google Scholar] [CrossRef]
  15. Cui, J.; Liu, J.; Wang, J.; Ni, Y. Global Context Dependencies Aware Network for Efficient Semantic Segmentation of Fine-Resolution Remoted Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2505205. [Google Scholar] [CrossRef]
  16. He, X.; Wang, Z.; Bai, L.; Fan, M.; Chen, Y.; Chen, L. Attention-Enhanced Urban Fugitive Dust Source Segmentation in High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 3772. [Google Scholar] [CrossRef]
  17. Zhao, D.; Wang, C.; Gao, Y.; Shi, Z.; Xie, F. Semantic segmentation of remote sensing image based on regional self-attention mechanism. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8010305. [Google Scholar] [CrossRef]
  18. Song, W.; Zhou, X.; Zhang, S.; Wu, Y.; Zhang, P. GLF-Net: A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 4649. [Google Scholar] [CrossRef]
  19. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
  20. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  21. Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506105. [Google Scholar] [CrossRef]
  22. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 205–218. [Google Scholar]
  23. Sun, Y.; Wang, M.; Huang, X.; Xin, C.; Sun, Y. Fast Semantic Segmentation of Ultra-High-Resolution Remote Sensing Images via Score Map and Fast Transformer-Based Fusion. Remote Sens. 2024, 16, 3248. [Google Scholar] [CrossRef]
  24. Wang, H.; Li, X.; Huo, L.; Hu, C. Global and edge enhanced transformer for semantic segmentation of remote sensing. Appl. Intell. 2024, 54, 5658–5673. [Google Scholar] [CrossRef]
  25. Song, W.; Nie, F.; Wang, C.; Jiang, Y.; Wu, Y. Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 3774. [Google Scholar] [CrossRef]
  26. Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
  27. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  28. Wu, H.; Huang, P.; Zhang, M.; Tang, W. CTFNet: CNN-Transformer Fusion Network for Remote-Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5000305. [Google Scholar] [CrossRef]
  29. Weng, L.; Pang, K.; Xia, M.; Lin, H.; Qian, M.; Zhu, C. Sgformer: A Local and Global Features Coupling Network for Semantic Segmentation of Land Cover. IEEE J. Sel. Top Appl. Earth Obs. Remote Sens. 2023, 16, 6812–6824. [Google Scholar] [CrossRef]
  30. Wang, Z.; Xia, M.; Weng, L.; Hu, K.; Lin, H. Dual Encoder–Decoder Network for Land Cover Segmentation of Remote Sensing Image. IEEE J. Sel. Top Appl. Earth Obs. Remote Sens. 2024, 17, 2372–2385. [Google Scholar] [CrossRef]
  31. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  32. Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
  33. Liu, M.; Dan, J.; Lu, Z.; Yu, Y.; Li, Y.; Li, X. CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2405.10530. [Google Scholar]
  34. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
  35. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
  36. Yang, C.; Chen, Z.; Espinosa, M.; Ericsson, L.; Wang, Z.; Liu, J.; Crowley, E.J. Plainmamba: Improving non-hierarchical mamba in visual recognition. arXiv 2024, arXiv:2403.17695. [Google Scholar]
  37. Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]
  38. Zhang, Q.; Geng, G.; Zhou, P.; Liu, Q.; Wang, Y.; Kang, L. Link Aggregation for Skip Connection–Mamba: Remote Sensing Image Segmentation Network Based on Link Aggregation Mamba. Remote Sens. 2024, 16, 3622. [Google Scholar] [CrossRef]
  39. Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A novel mamba architecture with a semantic transformer for efficient real-time remote sensing semantic segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
  40. Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6001205. [Google Scholar] [CrossRef]
  41. Tsai, T.Y.; Lin, L.; Hu, S.; Chang, M.C.; Zhu, H.; Wang, X. UU-Mamba: Uncertainty-aware U-Mamba for Cardiac Image Segmentation. In Proceedings of the 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), San Jose, CA, USA, 7–9 August 2024; pp. 267–273. [Google Scholar] [CrossRef]
  42. Xu, Z.; Tang, F.; Chen, Z.; Zhou, Z.; Wu, W.; Yang, Y.; Liang, Y.; Jiang, J.; Cai, X.; Su, J. Polyp-mamba: Polyp segmentation with visual mamba. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 510–521. [Google Scholar]
  43. Wang, J.; Chen, J.; Chen, D.; Wu, J. LKM-UNet: Large Kernel Vision Mamba UNet for Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 360–370. [Google Scholar]
  44. Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
  45. Ruan, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv 2024, arXiv:2402.02491. [Google Scholar]
  46. Liu, J.; Yang, H.; Zhou, H.Y.; Xi, Y.; Yu, L.; Li, C.; Liang, Y.; Shi, G.; Yu, Y.; Zhang, S.; et al. Swin-umamba: Mamba-based unet with imagenet-based pretraining. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Marrakesh, Morocco, 6–10 October 2024; pp. 615–625. [Google Scholar]
  47. Liao, W.; Zhu, Y.; Wang, X.; Pan, C.; Wang, Y.; Ma, L. Lightm-unet: Mamba assists in lightweight unet for medical image segmentation. arXiv 2024, arXiv:2403.05246. [Google Scholar]
  48. Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic segmentation of remotely sensed images with state space model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef]
  49. Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote Sensing Image Classification With State Space Model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar] [CrossRef]
  50. Hu, Y.; Ma, X.; Sui, J.; Pun, M.O. PPMamba: A Pyramid Pooling Local Auxiliary SSM-Based Model for Remote Sensing Image Semantic Segmentation. arXiv 2024, arXiv:2409.06309. [Google Scholar]
  51. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar] [CrossRef]
  52. Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
  53. Qu, H.; Ning, L.; An, R.; Fan, W.; Derr, T.; Liu, H.; Xu, X.; Li, Q. A survey of mamba. arXiv 2024, arXiv:2408.01129. [Google Scholar]
  54. Pechlivanidou, G.; Karampetakis, N. Zero-order hold discretization of general state space systems with input delay. IMA J. Math. Control. Inf. 2022, 39, 708–730. [Google Scholar] [CrossRef]
  55. Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Adv. Neural Inf. Proces. Syst. 2020, 33, 1474–1487. [Google Scholar]
Figure 1. The overall architecture of GLFFNet.
Figure 1. The overall architecture of GLFFNet.
Remotesensing 17 01019 g001
Figure 2. The detailed architecture of the MSFR module.
Figure 2. The detailed architecture of the MSFR module.
Remotesensing 17 01019 g002
Figure 3. The detailed architecture of the SBF module.
Figure 3. The detailed architecture of the SBF module.
Remotesensing 17 01019 g003
Figure 4. Visualization results of different models on the Vaihingen dataset.
Figure 4. Visualization results of different models on the Vaihingen dataset.
Remotesensing 17 01019 g004
Figure 5. Visualization results of different models on the Potsdam dataset.
Figure 5. Visualization results of different models on the Potsdam dataset.
Remotesensing 17 01019 g005
Figure 6. Visualization results of different models on the LoveDA dataset.
Figure 6. Visualization results of different models on the LoveDA dataset.
Remotesensing 17 01019 g006
Figure 7. Visualization results of the ablation study on the Vaihingen dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Figure 7. Visualization results of the ablation study on the Vaihingen dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Remotesensing 17 01019 g007
Figure 8. Visualization results of the ablation study on the Potsdam dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Figure 8. Visualization results of the ablation study on the Potsdam dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Remotesensing 17 01019 g008
Figure 9. Visualization results of the ablation study on the LoveDA dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Figure 9. Visualization results of the ablation study on the LoveDA dataset. (a) Images. (b) Ground truth. (c) ResNet. (d) ResNet + VMamba. (e) ResNet + VMamba + MSFR. (f) ResNet + VMamba + MSFR + SBF.
Remotesensing 17 01019 g009
Table 1. Segmentation results of different models on the Vaihingen dataset. The accuracy of each category is presented in the format of F1/IoU (%).
Table 1. Segmentation results of different models on the Vaihingen dataset. The accuracy of each category is presented in the format of F1/IoU (%).
MethodImp. Surf.BuildingLow Veg.TreeCarmF1 (%)mIoU (%)
MANet [13]96.28/92.8294.39/89.3783.38/71.5089.72/81.3686.67/76.4890.0982.31
MAResU-Net [14]96.59/93.4294.86/90.2384.22/72.7489.67/81.2784.73/73.5190.0182.23
GCDNet [15]95.98/92.2793.79/88.3283.25/71.3089.44/80.9086.41/76.0889.7781.77
Swin-Unet [22]94.57/89.7093.93/88.5683.04/71.0089.32/80.7183.45/71.6088.8680.32
VM-UNet [45]96.58/93.3994.74/90.0183.66/71.9289.63/81.2184.97/73.8889.9282.08
UNetFormer [26]96.64/93.5195.53/91.4483.03/70.9889.16/80.4587.05/77.0890.2882.69
TransUNet [27]95.39/91.1895.66/91.6884.71/73.4790.00/81.8187.19/77.2990.5983.09
RS3Mamba [32]95.15/90.7694.49/89.5683.79/72.1189.29/80.6586.07/75.5589.7681.72
GLFFNet(Ours)96.84/93.8795.71/91.7884.35/72.9589.89/81.6588.75/79.7891.1184.01
The values in bold represent the top-performing metrics in the table.
Table 2. Segmentation results of different models on the Potsdam dataset. The accuracy of each category is presented in the format of F1/IoU (%).
Table 2. Segmentation results of different models on the Potsdam dataset. The accuracy of each category is presented in the format of F1/IoU (%).
MethodImp. Surf.BuildingLow Veg.TreeCarmF1 (%)mIoU (%)
MANet [13]92.97/86.8795.59/91.5686.09/75.5887.61/77.9595.65/91.6791.5884.73
MAResU-Net [14]93.19/87.2696.20/92.6986.47/76.1687.88/78.3895.68/91.7391.8885.24
GCDNet [15]93.56/87.9095.96/92.2386.83/76.7388.51/79.4095.76/91.8692.1285.62
Swin-Unet [22]93.00/86.9395.52/91.4286.11/75.6187.63/77.9894.30/89.2191.3184.23
VM-UNet [45]93.77/88.2896.21/92.7086.70/76.5387.42/77.6595.27/90.9791.8885.23
UNetFormer [26]93.77/88.2896.48/93.2086.61/76.3888.09/78.7295.81/91.9692.1585.71
TransUNet [27]94.29/89.2096.78/93.7788.01/78.5989.08/80.3196.16/92.6092.8686.89
RS3Mamba [32]93.83/88.3896.83/93.8687.61/77.9688.39/79.2095.56/91.5092.4586.18
GLFFNet(Ours)94.47/89.5297.30/94.7588.46/79.3289.53/81.0596.39/93.0493.2387.54
The values in bold represent the top-performing metrics in the table.
Table 3. Segmentation results of different models on the LoveDA dataset. The accuracy of each category is presented in the format of F1/IoU (%).
Table 3. Segmentation results of different models on the LoveDA dataset. The accuracy of each category is presented in the format of F1/IoU (%).
MethodBackgroundBuildingRoadWaterBarrenForestAgriculturalmF1 (%)mIoU (%)
MANet [13]69.75/53.5576.19/61.5468.02/51.5479.31/65.7242.95/27.3558.26/41.1070.69/54.6766.4550.78
MAResU-Net [14]69.83/53.6576.65/62.1570.29/54.1977.76/63.6249.72/33.0959.85/42.7171.01/55.0567.8752.06
GCDNet [15]67.88/51.3870.95/54.9770.49/54.4377.44/63.1941.23/25.9755.23/38.1566.28/49.5764.2148.24
Swin-Unet [22]68.92/52.5872.70/57.1167.32/50.7380.46/67.3147.52/31.1660.13/42.9965.81/49.0466.1250.13
VM-UNet [45]69.42/53.1775.63/60.8269.69/53.4980.68/67.6250.75/34.0053.92/36.9166.87/50.2366.7150.89
UNetFormer [26]66.75/50.0976.17/61.5171.43/55.5680.77/67.7546.69/30.4659.74/42.5967.77/51.2567.0551.32
TransUNet [27]70.16/54.0477.79/63.6572.45/56.8080.57/67.4651.17/34.3858.50/41.3472.24/56.5568.9853.46
RS3Mamba [32]70.42/54.3578.86/65.1071.94/56.1882.54/70.2750.30/33.6057.03/39.8975.16/60.2069.4654.23
GLFFNet(Ours)68.27/51.8379.01/65.3171.28/55.3780.41/67.2453.14/36.1960.33/43.1978.03/63.9870.0754.73
The values in bold represent the top-performing metrics in the table.
Table 4. Ablation study on the Vaihingen dataset. The accuracy of each category is presented in the format of IoU (%).
Table 4. Ablation study on the Vaihingen dataset. The accuracy of each category is presented in the format of IoU (%).
ResNetVMambaMSFRSBFImp. Surf.BuildingLow Veg.TreeCarmF1 (%)mIoU (%)
92.0187.370.3379.5562.4787.4378.33
93.5290.5172.3480.4677.2790.39 (↑2.96)82.82 (↑4.49)
93.6690.8573.2581.3778.8690.88 (↑0.49)83.60 (↑0.78)
93.8791.7872.9581.6579.7891.11 (↑0.23)84.01 (↑0.41)
The symbol ↑ indicates an improvement in evaluation metrics after adding a new module. The values in bold represent the top-performing metrics in the table.
Table 5. Ablation study on the Potsdam dataset. The accuracy of each category is presented in the format of IoU (%).
Table 5. Ablation study on the Potsdam dataset. The accuracy of each category is presented in the format of IoU (%).
ResNetVMambaMSFRSBFImp. SurfBuildingLow veg.TreeCarmF1 (%)mIoU (%)
86.3890.1474.0675.3389.6190.6183.10
89.2394.3878.780.9692.6893.03 (↑2.42)87.19 (↑4.09)
89.2794.4879.0880.9693.1193.14 (↑0.11)87.38 (↑0.19)
89.5294.7579.3281.0593.0493.23 (↑0.09)87.54 (↑0.16)
The symbol ↑ indicates an improvement in evaluation metrics after adding a new module. The values in bold represent the top-performing metrics in the table.
Table 6. Ablation study on the LoveDA dataset. The accuracy of each category is presented in the format of IoU (%).
Table 6. Ablation study on the LoveDA dataset. The accuracy of each category is presented in the format of IoU (%).
ResNetVMambaMSFRSBFBackgroundBuildingRoadWaterBarrenForestAgriculturalmF1 (%)mIoU (%)
50.5756.6252.3862.3321.7840.8646.0263.1247.22
53.8053.8057.1469.6936.1536.9757.8868.99 (↑5.87)53.59 (↑6.37)
55.5064.3851.9969.9734.1740.4562.1869.38 (↑0.39)54.09 (↑0.5)
51.8365.3155.3767.2436.1943.1963.9870.07 (↑0.69)54.73 (↑0.64)
The symbol ↑ indicates an improvement in evaluation metrics after adding a new module. The values in bold represent the top-performing metrics in the table.
Table 7. Model complexity analysis. All results are from the Vaihingen dataset.
Table 7. Model complexity analysis. All results are from the Vaihingen dataset.
MethodFLOPs (G)Parameter (M)ATT (s)mF1 (%)mIoU (%)
MANet155.5135.8656.2390.0982.31
MAResU-Net70.2126.2839.6890.0182.23
GCDNet562.2660.56276.3489.7781.77
Swin-Unet62.0527.1561.6888.8680.32
VM-Unet32.9622.04132.1989.9282.08
UNetFormer23.4811.6846.0890.2882.69
TransUNet258.8993.23293.6890.5983.09
RS3Mamba126.649.66300.7789.7681.72
GLFFNet(ours)154.4664.22283.7991.1184.01
The values in bold represent the top-performing metrics in the table.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhu, S.; Zhao, L.; Xiao, Q.; Ding, J.; Li, X. GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 1019. https://doi.org/10.3390/rs17061019

AMA Style

Zhu S, Zhao L, Xiao Q, Ding J, Li X. GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation. Remote Sensing. 2025; 17(6):1019. https://doi.org/10.3390/rs17061019

Chicago/Turabian Style

Zhu, Saifeng, Liaoying Zhao, Qingjiang Xiao, Jigang Ding, and Xiaorun Li. 2025. "GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation" Remote Sensing 17, no. 6: 1019. https://doi.org/10.3390/rs17061019

APA Style

Zhu, S., Zhao, L., Xiao, Q., Ding, J., & Li, X. (2025). GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation. Remote Sensing, 17(6), 1019. https://doi.org/10.3390/rs17061019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop