SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation

Lan, Yizhou; Zheng, Daoyuan; Zheng, Yingjun; Zhang, Feizhou; Xu, Zhuodong; Shang, Ke; Wan, Zeyu

doi:10.3390/rs17101797

Open AccessArticle

SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation

by

Yizhou Lan

¹

,

Daoyuan Zheng

²,

Yingjun Zheng

¹,

Feizhou Zhang

^1,*,

Zhuodong Xu

¹,

Ke Shang

¹

and

Zeyu Wan

¹

The Institute of RS and GIS, School of Earth and Space Sciences, Peking University, Beijing 100871, China

²

The School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(10), 1797; https://doi.org/10.3390/rs17101797

Submission received: 10 April 2025 / Revised: 19 May 2025 / Accepted: 20 May 2025 / Published: 21 May 2025

Download

Browse Figures

Versions Notes

Abstract

The semantic segmentation of remote sensing images is vital for accurate surface monitoring and environmental assessment. Multimodal remote sensing images (RSIs) provide a more comprehensive dimension of information, enabling faster and more scientific decision-making. However, existing methods primarily focus on modality and spectral channels when utilizing spectral features, with limited consideration of their association to ground object types. This association, commonly referred to as the spectral characteristics of ground objects (SCGO), results in distinct spectral responses across different modalities and holds significant potential for improving the segmentation accuracy of multimodal RSIs. Meanwhile, the inclusion of redundant features in the fusion process can also interfere with model performance. To address these problems, a ground object spectral feature awareness network (SFANet) specifically designed for RSIs that effectively leverages spectral features by incorporating the SCGO is proposed. SFANet includes two innovative modules: (1) the Spectral Aware Feature Fusion module, which integrates multimodal features in the encoder based on SCGO, and (2) the Adaptive Spectral Enhancement module, which reduces the confusion from redundant information in the decoder. SFANet significantly improves the mIoU by 5.66% and 4.76% compared to the baseline on two datasets, outperforming existing multimodal RSIs segmentation networks by adaptively enhanced spectral feature awareness. SFANet demonstrates significant advancements over other multimodal RSIs segmentation networks and provides new perspectives for RSI-specific network design by incorporating spectral characteristics. This work offers new perspectives for the design of segmentation networks for RSIs.

Keywords:

multimodal remote sensing; deep learning; feature fusion; spectral feature

1. Introduction

Semantic segmentation is one of the hottest research topics in the field of remote sensing, playing an important role in land cover classification [1], traffic monitoring and management [2], agricultural monitoring [3], and so on. Therefore, the automatic, precise, and efficient segmentation of remote sensing images is of great importance [4,5].

The segmentation of multimodal remote sensing images (RSIs) is a classic application in image segmentation. Multimodal remote sensing data are characterized by two principal factors—sensor specifications (e.g., imaging mechanisms and spatial resolutions) and acquisition conditions (e.g., acquisition time, viewing angles, and platforms)—which together describe a scene, whereas each modality alone captures only a limited subset of its properties [6]. In this paper, the term “multimodal remote sensing images” refers specifically to the image-based form of these multimodal remote sensing data. With the rapid development of deep learning, numerous advanced segmentation networks have been proposed for various types of images, including natural [7,8], medical [9,10], and remote sensing [11,12] images, significantly improving segmentation accuracy and efficiency [13,14]. The integration of multimodal data, such as RGB-T [15,16] and RGB-D [17,18], further enhances performance by leveraging complementary information, proving valuable in applications like urban planning [19], environmental monitoring [20], and disaster response [21]. However, RSIs segmentation poses unique challenges compared to natural scene images (NSIs), which are more widely studied. RSIs typically contain diverse information and complex targets, with larger spatial scales and exhibiting the “same object with different spectra, same spectrum from different objects” phenomenon, making the direct application of natural scene image-oriented networks suboptimal [22]. Thus, exploring RSI-specific characteristics and designing dedicated segmentation networks is of significant research value. Although advanced deep learning-based semantic segmentation methods for remote sensing images have been proposed [23,24,25], most focus on unimodal data. The rapid development of Earth observation technologies and multi-source remote sensing has transformed remote sensing data into a vital information resource [26], offering diverse observational perspectives [27], facilitating comprehensive situational analysis, and improving monitoring accuracy [28]. As multi-source remote sensing data from various sensors become the primary input for RSI semantic segmentation, their integration is crucial for advancing this field [29,30,31]. The mainstream approach of multimodal RSIs segmentation is to extract features from each modality separately and then perform feature fusion [32,33,34]. This approach has shown promise in enhancing segmentation accuracy and robustness by integrating diverse sources of information, making it a critical area of research in the field of remote sensing [35,36]. The effectiveness of these methods largely depends on the design of the feature fusion module.

The above studies demonstrate that multimodal remote sensing imagery is more advantageous for semantic segmentation compared to unimodal imagery. Moreover, the application of networks designed for multimodal NSIs to the segmentation of multimodal RSIs still holds potential for further enhancement. Furthermore, in the field of multimodal RSIs segmentation, multimodal input data inherently contain rich spectral information. This spectral information causes sufficient differentiation in the appearance of different ground objects within the image, making it possible to distinguish between them. The spectral characteristics are closely related to the input data modalities, spectral feature channels, and ground object categories. Existing research on segmentation and classification using spectral features is based on this premise, often focusing on designing cross-modal information interaction methods by leveraging the differences between input modalities or designing spectral attention modules based on spectral channels, allowing the network to incorporate spectral dimensional features [37,38,39]. However, few studies approach the problem from the perspective of ground object categories with the aim to enhance the distinctive features of different ground object classes to further increase their separability in the feature space.

Over the past few decades, researchers have discovered the critical importance of spectral characteristics in classifying different ground object types through their studies of remote sensing imagery [40,41,42]. During the training process, deep networks learn to extract and utilize spectral features, thereby capturing the spectral relationships among multimodal data. This enables the network to better perceive the spectral properties of ground objects. As a consequence, designing a multimodal remote sensing feature fusion module specifically tailored to ground objects, based on their varying sensitivities to different modalities and spectral bands, is crucial. Strengthening the features of the sensitive spectral bands in regions where a particular ground object is more likely to occur could enhance its distinguishability from other ground objects. For instance, as shown in Figure 1, asphalt roads and buildings are often confused in classifications based on a thermal infrared image (TII). However, they display distinct characteristics in very high resolution (VHR) images. Therefore, incorporating features from bands with pronounced differences can significantly improve classification accuracy. In addition to this, the inclusion of spectral characteristics of ground objects into the design of a multimodal remote sensing feature fusion module, which takes into account their varying sensitivities to different modalities and spectral bands, holds the potential to further enhance the performance of multimodal remote sensing image segmentation networks.

To address the aforementioned issues, this paper proposes a ground object spectral feature awareness network (SFANet) specifically designed for the segmentation of multimodal RSIs. SFANet aims to design an innovative module that guides the network to selectively learn spectral features sensitive to ground objects during multimodal feature fusion, based on the spectral characteristics of the studied ground objects. This approach seeks to achieve more effective multimodal feature integration within deep networks. Additionally, recognizing that redundant information in the fused features may interfere with the network’s ability to accurately identify ground objects, this study also explores how to leverage the spectral characteristics of these objects to direct the network towards learning the most relevant spectral features, thereby mitigating the impact of redundancy and enhancing segmentation accuracy. Consequently, SFANet incorporates an innovative Spectral Aware Feature Fusion (SAF) module, which enhances the fusion process of multi-source remote sensing features by focusing on the most relevant characteristics for classification. Unlike existing remote sensing fusion modules, the SAF module, specifically designed for ground objects and remote sensing data, differentiates itself by incorporating ground object categories and the relationships between ground objects and input features to achieve more effective feature fusion. This sets it apart from self-attention, spatial attention, and spectral attention, which primarily focus on individual feature dimensions and may not fully account for the specific interactions between ground object categories and their relevant spectral characteristics. Furthermore, a novel Adaptive Spectral Enhancement (ASE) module is introduced to reduce confusable features, thereby improving the model’s ability to accurately identify various types of ground objects. The existing methods for reducing redundant information in fused features mostly rely on self-attention mechanisms or multi-head attention, which guide the model to gradually assign higher weights to important features. In contrast, the ASE module differs in that it weights features based on the spectral characteristics of land cover, focusing on the most sensitive features for each land cover type. The main contributions of this paper are as follows:

(1): A novel deep learning network architecture is designed to emphasize the spectral characteristics of ground objects, enhancing the network’s classification capability for these objects.
(2): We formulate an entirely new module, the SAF module, which considers spectral feature differences, enabling the deep network to more effectively integrate multimodal remote sensing features and enhance inter-class differences.
(3): An ASE module is designed to enhance effective and reduce redundancy information in the fused features based on spectral characteristics, thereby reducing confusion from redundant information.

2. Related Work

2.1. Semantic Segmentation of NSIs

Deep learning-based semantic segmentation, a core technology in computer vision, aims to classify each pixel in an image into a corresponding object category [43,44,45]. Compared to traditional image processing techniques, deep learning methods, particularly those based on deep convolutional neural networks (CNNs), can more accurately understand the semantic information of image content, demonstrating superior performance in various applications.

Early models like Fully Convolutional Networks (FCNs) enabled end-to-end pixel-level classification by replacing fully connected layers with convolutional layers. Subsequently, U-Net, proposed by Ronneberger et al. [46], effectively addressed information loss during the segmentation process by introducing skip connections. With its outstanding edge-capturing ability and training efficiency, U-Net has been widely applied in biomedical image processing. Furthermore, atrous convolution, proposed in the DeepLab series [47,48], addresses the problem of spatial resolution loss in deep convolutional neural networks, enabling multiscale contextual information capture without reducing feature map resolution. Recently, Zheng et al. [49] proposed a high-order semantic decoupling network (HSDN) to enhance the robustness and accuracy of remote sensing image segmentation by leveraging high-order features and generating specialized masks for improved feature disentanglement and boundary handling. In addition to this, MMSMCNet, proposed by Zhou et al. [50], is an innovative RGB-T semantic segmentation network utilizing modal memory fusion and morphological multiscale assistance to enhance feature extraction, cross-modal information fusion, and morphological feature complementation, achieving superior performance on standard datasets. Additionally, Huo et al. [51] introduced a novel glass segmentation method leveraging the unique properties of paired RGB and thermal images, utilizing a neural network architecture with a multimodal fusion module based on attention, and integrating CNN and transformer to extract local and non-local features, respectively, validated on a newly collected dataset of 5551 RGB–thermal image pairs. However, the above studies focus on multimodal NSIs segmentation or unimodal RSIs. Due to the lack of semantic segmentation networks specifically designed for multimodal RSIs, existing segmentation networks struggle to achieve promising results with them. In contrast to previous research, we explore a novel semantic segmentation network architecture designed for multimodal RSIs segmentation, rather than being limited to unimodal RSIs or NSIs.

2.2. Semantic Segmentation of Multimodal RSIs

Multimodal remote sensing research, a crucial area in geospatial analysis, aims to integrate data from multiple sensing modalities to enhance the understanding of Earth’s surface features. Compared to traditional single-modality approaches, multimodal remote sensing techniques, particularly those leveraging advanced data fusion algorithms, can more accurately capture and interpret complex environmental information, demonstrating superior performance in various geospatial applications [20,52]. In recent years, researchers have been keen to utilize deep learning models to extract complex features from multimodal remote sensing data to accomplish their respective research tasks [6,53,54]. Amani et al. [55] proposed a method for classifying wetlands in five pilot areas using multi-source and multi-temporal optical remote sensing images. They employed an object-based approach to segment and classify the images, testing five different machine learning algorithms. Kussul et al. [56] introduced a multi-layer deep learning architecture for the pixel-level classification of land cover and crop types using multi-temporal, multi-source satellite images, achieving higher accuracy for some crops compared to other methods. Zhang et al. [57] proposed a method called the Structure Optimization Transmission Network (SOT-Net) for the more accurate joint classification of hyperspectral images and LiDAR data. This approach combines spectral and geometric information for structural control, integrating cross-attention mechanisms and dynamic structural optimization into a unified framework. This effectively balances multi-source information and generates consistent inter-source pattern control. However, the above methods are primarily designed to leverage the differences and complementarities between data sources for ground object classification frameworks. Few studies focus on designing data band combinations based on the specific ground objects to be classified. In contrast to these studies, we explore the spectral characteristics of the ground objects to be classified and develop two network modules that more effectively utilize multimodal remote sensing data.

2.3. Multimodal Remote Sensing Feature Fusion

Multimodal remote sensing feature fusion, a crucial technique in geospatial analysis, aims to integrate data from multiple sensor modalities to enhance the accuracy and depth of environmental understanding. Compared to traditional single-sensor approaches, multimodal fusion methods, particularly those leveraging advanced machine learning algorithms, can more effectively capture and interpret complex environmental features, demonstrating superior performance in various applications such as land cover classification, urban planning, and disaster management.

In the study of deep learning networks for multimodal remote sensing data, researchers are keen on using specially designed fusion modules to achieve multimodal data integration [58,59,60]. Originally, the fusion was achieved by multiplying or adding the feature matrices of each modality. Thereafter, numerous attention-based methods have been proposed to effectively fuse features from different modalities. Zhao et al. [52] proposed (MS)²-Net, a novel network for multimodal remote sensing data segmentation, employing multi-stage fusion and multi-source attention modules to effectively utilize complementary information and enhance feature discriminability, demonstrating significant performance improvements on publicly available datasets. Recently, a growing number of researchers prefer to first extract features separately and then employ specially designed fusion modules to achieve multimodal remote sensing feature fusion. Luo et al. [2] proposed DECCFNet, a dual encoder-based cross-modal complementary fusion network, leveraging high-resolution imagery and LiDAR data to improve urban road extraction by fusing deep features from multiple modalities and applying multi-direction strip convolution. Cai et al. [61] proposes DSTFNet, a Dual-branch Spatiotemporal Fusion Network which integrates very high-resolution (VHR) images and medium-resolution satellite image time series (MRSITS) to accurately delineate agricultural field parcels across diverse landscapes. However, these studies primarily design fusion modules based on the differences and complementarities between data modalities. Few studies focus on integrating multimodal features by considering the sensitivity of different land cover types to various modalities. This often results in the inadequate learning of the spectral feature of the land cover by the network. Distinct from these studies, this work proposes an innovative multimodal feature fusion module that accounts for both the complementarity of multimodal features and the spectral characteristics of land cover. This approach achieves a more comprehensive integration of multimodal features.

3. Method

3.1. Overview

The proposed SFANet network, as shown in Figure 2, is built on a U-shaped framework with two encoders and a decoder, and utilizes ResNet50 as the backbone in its encoder, ultimately comprising three branches. The E1∼E5 branches represent two independent feature extraction branches for heterogeneous remote sensing data, while F1∼F5 form the multi-source remote sensing feature fusion branch. These branches progressively fuse feature maps from the two extraction branches layer by layer through the SAF module. Except for the lowest-level F5 module, all fused features are used as low-level features for skip connections in the decoding process. The output from the F5 module serves as input for the first decoding module.

Each feature extraction branch includes five sub-modules containing convolutional, pooling, BN, and ReLU layers, collectively denoted as CBR. The CBR kernel size, stride, and padding are set to 7, 2, and 3 in the E1 module; 3, 1, and 1 in the E2 module; and 3, 2, and 1 in the remaining modules. In the backbone network, layers 1 through 4 correspond to the identically named modules in ResNet50. Each layer comprises several Bottleneck units: Beginning with layer 2, the first Bottleneck in each layer uses a stride of 2 to perform spatial downsampling. Detailed implementation can be found in the original ResNet50 paper [62]. The decoder mirrors this structure with five sub-modules (D1∼D4). As shown in Figure 2, D4 receives the outputs of modules F5 and F4 as its inputs. The output of F5 is first spatially upsampled and then concatenated with the output of F4 along the channel dimension. The resulting feature map is passed through two consecutive CBR modules to produce D4’s output, which—together with the output of F3—is supplied as the input to D3. Additionally, an ASE module is introduced after D1 to refine low-confidence pixels in the prediction probability map and enhance the impact of ground object-sensitive bands. Each of the modules F1–F5 comprises a single SAF module.

3.2. SAF Module

Due to the limitations of fusion modules that focus solely on heterogeneous feature differences in remote sensing object classification, the SAF module was proposed. This module leverages the spectral characteristics of remote sensing objects to better perceive and address these feature differences. As shown in Figure 3, the implementation of the SAF module is divided into three stages: the Convolutional Fusion and Ground Object Localization Stage, the Spectral Feature Awareness Stage, and the Weighted Fusion Stage.

(1) Convolutional Fusion and Ground Object Localization (CFL) Stage: Images from modalities I and II are resampled to a common image size and spatially registered, so that pixels at the same coordinates in each modality correspond to the same physical scene elements. Inspired by the Semantic Supervision module [63] and to localize the ground object in the feature map, the input Modal I feature and Modal II feature are concatenated along the channel dimension. Both input features have the same number of channels, denoted as C. After concatenation, the number of channels becomes 2C. A CBR combination is then used to obtain convolutional fusion features with C channels, denoted as

F_{c o n v}

. Subsequently, a classifier with a convolutional layer and a softmax function is applied to the

F_{c o n v}

to generate the prediction probability map, denoted as P. To ensure that P accurately reflects the predicted probabilities for each pixel corresponding to their ground object classes, P is incorporated into the loss function calculation. This integration guides the model to optimize prediction accuracy during the training process.

(2) Spectral Awareness (SA) Stage: To accurately determine the positions of various ground objects on the feature map and prediction probability map, we assume that the ground objects sensitive to Modal I and II features are indexed as I and S in the labels. By setting a filtering threshold

δ

, the high-confidence regions for ground objects in the prediction probability map are identified, resulting in a localization map.

L^{I} = \{\begin{matrix} 0, & s o f t m a x (P) \neq I \\ P_{I} (i, j), & s o f t m a x (P) = I \end{matrix}

(1)

L^{S} = \{\begin{matrix} 0, & s o f t m a x (P) \neq S \\ P_{S} (i, j), & s o f t m a x (P) = S \end{matrix}

(2)

In this context,

P_{I} / P_{S}

represents the tensor formed by the

I th / S th

channel of the prediction probability map P. Since P is a prediction probability map,

P_{I} / P_{S}

indicates the probability that each pixel belongs to the

I th / S th

class of ground objects.

P_{I} (i, j) / P_{S} (i, j)

represents the value of the element at the

i th

row and

j th

column of this tensor, which is the probability that the pixel at this position belongs to the

I th / S th

class. Further, the feature award weights are calculated based on the localization map as follows:

f^{I} = m a x (L^{I} - δ, 0) \div δ

(3)

f^{S} = m a x (L^{S} - δ, 0) \div δ

(4)

f^{I}

and

f^{S}

denote the Modal I feature award weight and Modal II feature award weight.

(3) Weighted Fusion (WF) Stage: To ensure that the final output fused features

F_{f u s e d}

retain both convolutional fusion information and spectral awareness, the following calculation was performed.

F_{f u s e d} = F_{c o n v} + F_{I} ⊙ f^{I} + F_{S} ⊙ f^{S}

(5)

where ⊙ represents element-wise multiplication.

3.3. ASE Module

The structure of the ASE module is shown in Figure 4. In the prediction probability map, the same color tone represents one type of ground object, with darker shades indicating high-confidence pixels and lighter shades indicating low-confidence pixels. This module consists of two main components: the Spectral Blurring Focus Adaptive Mechanism (SBFA) and Effective Spectral Feature Enhancement stage (ESFE).

(1) SBFA: To identify the areas where the network is prone to confusion, the positions of these regions need to be determined. The prediction probability map P is obtained from the fused features

F_{D}^{5}

output by the decoder D1 in Figure 2. To ensure that P has actual semantic meaning, it is included in the loss function calculation. By introducing the same filtering threshold

δ

as in the SAF module, the spectral feature award weight is calculated using the specified method.

F_{S A} = \frac{1}{δ} \times max (δ - σ (P), 0)

(6)

here,

σ (P)

represents the maximum value of the prediction probability map across its channel dimension, indicating the highest probability of each pixel’s belonging to a specific category. A lower value suggests higher uncertainty or ambiguity in the network’s output for that pixel. Therefore, the refined features

R f^{I}

and

R f^{S}

are then calculated accordingly.

R f^{I} = F_{I}^{1} ⊙ F_{S A}

(7)

R f^{S} = F_{S}^{1} ⊙ F_{S A}

(8)

here,

F_{I}^{1}

and

F_{S}^{1}

represent the output Modal I and II features of the encoder module E1 and E1’ in Figure 2, respectively. Since their size is only half that of

F_{S A}

, an upsampling operation is performed before they are input into the module. The symbol ⊙ represents element-wise multiplication.

(2) ESFE stage: After identifying the locations of the confused regions, the characteristics of the effective ground objects from both data modalities are enhanced to reduce confusion. The refined features

R f^{I}

and

R f^{S}

obtained from the previous step are input into the global convolutional network (GCN), which is a semantic segmentation architecture that employs large convolutional kernels to simultaneously enhance classification and localization while using a residual-based boundary refinement module to sharpen object edges [64], resulting in

G f^{I}

and

G f^{S}

. The number of channels in the output tensor of the GCN is set to the number of ground object categories. The channels in

G f^{I}

and

G f^{S}

that correspond to ground object categories sensitive to Modal I and II features are retained, while the remaining channels are set to zero. The output tensor can also be viewed as a feature map of the same size as the prediction probability map. The final output prediction probability map

f^{O}

is then calculated accordingly.

f^{O} = s o f t m a x (P + G f^{I} + G f^{S})

(9)

3.4. Loss Function

As mentioned in the previous two sections, the proposed SAF module and the ASE module both include output prediction probability maps that participate in the loss calculation and guide the model’s convergence. Therefore, the loss calculation method during the network training process in this chapter is as follows:

L o s s = L^{f} + α \times (L_{1}^{a} + L_{2}^{a} + L_{3}^{a} + L_{4}^{a} + L_{5}^{a} + L_{A S E})

(10)

The loss function L used in the network is a multi-class cross-entropy loss due to the multi-class nature of the task. Here,

L^{f}

represents the loss value calculated from the prediction probability map of the network’s final output, while

L_{n}^{a}

denotes the cross-entropy loss calculated from the prediction probability map of the

n th

level SAF module’s auxiliary output.

L_{A S E}

is the auxiliary output of the ASE module. The weighting factor

α

of the auxiliary loss is set to 0.5 by default. The rationale for the chosen value of

α

is detailed in Section 5.4.

4. Experiment

4.1. Dataset

To validate the effectiveness, two multimodal remote sensing datasets were selected for the experiments: the ISPRS Vaihingen dataset and a self-labeled dataset. The ISPRS Vaihingen dataset includes IRRG (infrared, red, green) composite false-color remote sensing imagery and DSM data. This dataset features IRRG images, which differ from NSI images, and the DSM data provide more remote sensing spectral information. Additionally, the dataset contains ground objest types that are particularly sensitive to these two modal data types, making it suitable for validating the SFANet. The self-labeled dataset consists of remote sensing images collected by drones using multispectral and dual-thermal infrared cameras from a park in Wuhan. The images have undergone preprocessing, registration, resampling, and labeling to construct a semantic segmentation dataset. The resolution of the VHR imagery is 0.007 m, and the multispectral images contain five bands: RGB, near-infrared, and red-edge, with a resolution of 0.154 m. The thermal infrared data provide color thermal maps with a resolution of 0.2684 m. The land cover types included in this dataset are highly sensitive to thermal infrared and near-infrared/red-edge features, which aligns with the requirements for validating the SFANet. Therefore, these two datasets were chosen for the experiments. The following is a detailed description of these two datasets:

4.1.1. ISPRS Vaihingen Dataset

The ISPRS Vaihingen dataset includes IRRG composite false-color remote sensing images and DSM data. This dataset contains labels for six types of ground objects: impervious surfaces (white), low vegetation (cyan), buildings (blue), cars (yellow), trees (green), and background (red). Due to the minimal representation of the background class in the dataset, it is excluded from the labels and not considered during training. The training and test sets were divided following the same method used by Liu et al. [65]: images with ID numbers 11, 15, 28, 30, and 34 were used as the test set, while images with ID numbers 1, 3, 5, 7, 13, 17, 21, 23, 26, 32, and 37 were used as the training set. These images were cropped into multiple 256 × 256 sample pairs using a sliding window with a stride of 256. In total, 902 paired samples were generated for training and 408 for testing. An illustration of the Vaihingen dataset samples is shown in Figure 5.

4.1.2. Self-Annotated Dataset

The dataset’s land cover labels encompass six categories: asphalt road, masonry road, buildings, water, vegetation, and background. The images and labels were cropped into 512 × 512 pixel samples, with non-overlapping samples divided into training and testing sets, containing 1914 and 272 sample pairs, respectively. Samples and their GT are illustrated in Figure 6.

4.2. Experimental Setup

4.2.1. Training Configuration and Comparative SOTA Methodology

The experiment was implemented using the PyTorch 2.0.0 open-source framework. The proposed SFANet network model was constructed through 80 iterations. The backbone selected for this chapter is ResNet50, which was pre-trained on ImageNet. SGD optimizer with a momentum of 0.9 and a weight decay of

5 \times 10^{- 4}

is used in this study. An initial learning rate of 0.005 is set, which is decayed exponentially with a power of 0.9 at each iteration. The batch size is set to 4 for self-annotated dataset and 16 for ISPRS Vaihingen dataset. Conducted on an Ubuntu 20.04 platform, all model training was performed using NVIDIA A30 GPUs. The baseline is a dual-encoder single-decoder U-shaped network with ResNet50 as the backbone. The two feature extraction branches are merged through a CBR module.

To verify the effectiveness and advancement of the proposed method, the network was compared with four multimodal RSIs segmentation networks: MGFNet [66], FTransUNet [33], CMFNet [67], and PACSCNet [68]. MGFNet, FTransUNet, and PACSCNet are all recent U-Net-based frameworks, which help to compare the effectiveness of the two novel modules proposed in this paper. Additionally, five multimodal NSIs methods—SFAMFANet [69], DRNet [70], SGFNet [71], ESANet [72], and RDFNet [73]—as well as four existing unimodal networks—U-Net [46], SegNet [74], HRNet [75], and DeeplabV3+ [48]—were used to validate the advantages of the multimodal RSIs segmentation approach. Since the dataset used in this study is a multi-source remote sensing dataset, data were stacked along the channel dimension before being input into the unimodal networks for comparison experiments.

4.2.2. Intrinsic Parameter Configuration

The self-annotated dataset originally included VHR images, TIIs, and multispectral images. Because the multispectral image contains the same bands as the VHR image but has a lower spatial resolution, to reduce the complexity of the deep learning network and avoid redundant input information, the VHR image data and the non-RGB bands of multispectral data were combined along the channel dimension to form a new multispectral image, which is used as one input for the network. The restructured dataset includes two types of multi-source images: TIIs and multispectral images. For the intrinsic parameters of the modules, the filtering threshold

δ

of the ASE module is set to 0.4. The justification for the selected

δ

value is established from both theoretical and experimental viewpoints in Section 5.4. To determine the effective ground object categories for TIIs and multispectral images, a preliminary experiment was conducted. Following the methodology outlined in our previous work [76], models were trained to predict three combinations: TIIs, multispectral images, and TIIs + multispectral images. The results, as shown in Table 1, indicate that multispectral data aid in vegetation recognition, while the TII data hinder it. Therefore, vegetation (index 5 in the labels) was set as the effective class for the multispectral data. For the TII data, since they improved building segmentation accuracy the most, buildings (index 4) were selected as the effective class.

For the ISPRS Vaihingen dataset, the filtering threshold

δ

is set to 0.4. The effective ground object for the DSM band is set to impervious surfaces (index 0 in the labels), and the effective ground object for the IRRG band is set to buildings (index 1 in the labels).

4.3. Experimental Results

4.3.1. Experimental Results on Self-Annotated Dataset

The classification accuracy of different methods is shown in Table 2. Compared to unimodal segmentation networks, multimodal networks generally achieve higher accuracy due to their enhanced ability to learn from multi-source remote sensing features. The proposed SFANet outperforms in mIoU compared to existing multimodal networks, being 1.50% and 3.98% higher on mIoU than the existing multimodal method and unimodal method.

From the perspective of individual ground objects, our method demonstrated more precise and robust performance in classifying each type of ground objects compared to the baseline. The proposed SFA and ASE modules enhance the performance of SFANet on building and vegetation. For example, the accuracy of vegetation is increased from 75.4% to 83.6%. Additionally, our method significantly improved building recognition compared to the baseline network, with an increase of 20.6%. Notably, the ground objects sensitive to the two modalities in this dataset—buildings and vegetation—are the two categories with the most significant improvement in recognition accuracy. This confirms that the network proposed in this paper effectively achieves what it intends to. Moreover, SFANet achieves the highest IoU in both water and building, demonstrating its superior performance. The SAF and ASE modules introduced in this study play a critical role in enhancing network performance, particularly by improving the recognition of land cover types that are more sensitive to the spectral properties of the employed remote sensing data. This contributes significantly to the overall accuracy of the model. MGFNet employs a gate-based cross-attention fusion mechanism specifically tailored for two input modalities, and is capable of yielding relatively uniform improvements in segmentation accuracy across all land-cover classes. However, due to its insufficient exploitation of the sensitivity between ground objects and the two input modalities, MGFNet achieves a lower mIoU than the proposed method.

To visually demonstrate the effectiveness and superiority of our method, some results from the test set using the compared methods are shown in Figure 7. Observing the classification results of group (a), it can be seen that our network performs better than the baseline network in resolving misclassifications between vegetation and buildings, and it achieves a more complete recognition of the buildings on the left side of the images. In the results of group (d) and group (e), our network achieves higher accuracy and lower error rates in classifying narrow brick roads compared to existing multimodal networks. The results of group (f) further demonstrate that SFANet effectively suppresses the misclassification of buildings by the network.

4.3.2. Experimental Results on ISPRS Vaihingen Dataset

To test the applicability of the proposed SFANet network, comparative experiments were conducted on the ISPRS Vaihingen dataset. The classification accuracy of different methods is presented in Table 3. Similar to the results on the self-annonated dataset, multimodal networks demonstrated an overall performance improvement compared to unimodal networks. Specifically, the greatest improvements for multimodal networks were observed in the classification of impervious surfaces and cars. The worst performance of multimodal networks for these classes (78.3% and 57.3%) surpassed the best performance of unimodal networks (72.0% and 46.2%) by 6.3% and 11.1%, respectively. This indicates that multimodal networks, through their feature fusion modules, effectively enhance the fusion of multi-source features and improve the recognition capability for these two types of ground objects.

Among the multimodal networks presented, our network achieved the best overall performance on the test set for ground objects, except for tree recognition, where it was slightly inferior to MGFNet. Compared to the baseline network, the two proposed modules in our network improved the accuracy of building recognition by 3.8% and car recognition by 13.4%. This validates the effectiveness of the proposed modules in promoting the fusion of multi-source remote sensing features.

To visually demonstrate the differences in recognizing various ground objects with different methods, segmentation results from the test set for some of the aforementioned methods are shown in Figure 8. The results of group (a) and (e) demonstrate that, compared to MGFNet, SFANet can effectively reduce the confusion between buildings and impervious surfaces within the network. As illustrated in results (b), the baseline network has certain deficiencies in recognizing cars, whereas the proposed SFANet mitigates this issue to some extent. Results (c) and (d) indicate that the baseline network tends to confuse low vegetation and impervious surfaces, whereas the proposed method significantly reduces this confusion.

In summary, these experiments on the ISPRS Vaihingen dataset corroborate the effectiveness and superiority of SFANet in typical ground object classification tasks. The ability of SFANet to enhance multimodal feature fusion and improve classification accuracy across various ground objects underscores its robustness and applicability in different remote sensing scenarios.

5. Discussion

5.1. Ablation Study

To test the impact of the proposed fusion module and spectral feature enhancement module on the overall performance of the SFANet network, ablation experiments were conducted. The mIoU and Pixel Accuracy (PA) were calculated, as shown in Table 4.

Table 4 presents the performance of each module on the dataset. It can be observed that replacing a

1 \times 1

convolution used for merging multi-source remote sensing features in the baseline network with the SAF module improved the mIoU by 4.18% and the PA by 3.46%. This indicates that the SAF module effectively facilitates the integration of multi-source remote sensing features, thereby enhancing the accuracy of the network in recognizing different ground objects.

Incorporating the ASE module with multi-task loss calculation into the baseline network resulted in a 2.5% improvement in mIoU and a 1.78% increase in PA. This demonstrates that the ASE module enhances the network’s ability to distinguish features of different ground objects. Using both SAF and ASE modules together achieved the best results, with a mIoU improvement of 5.66% and a PA improvement of 3.99% compared to the baseline. The improvement in pixel accuracy was more significant than the mIoU, indicating that the addition of SAF or ASE modules to the baseline network significantly improved the classification accuracy of in the test set.

As shown in Table 4, the influence of the modules is primarily concentrated on buildings and vegetation. Both the SAF and ASE modules enhance the classification accuracy of buildings and vegetation, with the highest accuracy achieved when both modules are used together. This indicates that both modules effectively facilitate the comprehensive integration of multi-source remote sensing features while improving the network’s ability to distinguish between different ground objects.

For asphalt roads and brick roads, the classification accuracy fluctuated around the baseline when the SAF and ASE modules were introduced. This suggests that these modules may not significantly enhance the integration of multi-source remote sensing features for these two types of ground objects. For water, the classification accuracy showed a slight improvement after introducing the proposed modules, but the change was less pronounced than for buildings and vegetation. This is likely due to a boundary effect, as the classification accuracy for water was already relatively high.

5.2. The Effectiveness of Proposed Modules

To demonstrate the effectiveness of the proposed modules in enhancing model accuracy and reducing model confusion, confusion matrices of the classification results for both the baseline and SFANet models on the test set are shown in Figure 9. Firstly, for the two sensitive ground objects—buildings and vegetation—the inclusion of the SAF and ASE modules results in improved recognition accuracy and a reduction in misclassifications. This indicates that both modules effectively enhance the network’s ability to accurately identify relevant land cover types while minimizing confusion within the network. Additionally, compared to the baseline, SFANet shows a significant improvement in the recognition accuracy of two effective ground objects—buildings and vegetation—with the true positive (TP) rate increasing from 6.58% and 25.76% to 7.34% and 28.16%, respectively, representing an improvement of approximately 10%. This demonstrates that the SAF module effectively enhances inter-class differentiation for these two ground objects, leading to more accurate classification by the network. Furthermore, owing to the ASE module’s ability to enhance regions prone to confusion within the network, the misclassification of vegetation as buildings has been significantly suppressed, as indicated by the red arrows in the figure. This demonstrates that both innovative modules proposed in this paper successfully fulfill their initial design objectives.

To provide a more intuitive illustration of the aforementioned points, selected results from the test set are presented in Figure 10. Comparing the results of group (a) and group (c) for SFANet and the baseline, it can be observed that with the addition of the two innovative modules, the network successfully identifies vegetation and buildings that were previously missed. This demonstrates the modules’ ability to enhance the network’s accuracy in recognizing relevant land cover types. Furthermore, a comparison between group (b) and group (d) reveals that, with the inclusion of these modules, the network correctly classifies areas that were previously misidentified as vegetation or buildings. This indicates that the modules help reduce confusion in the segmentation process.

5.3. Limitation

Although our method achieves the highest mIoU on both datasets and yields substantial accuracy gains for the sensitive ground object types, it nevertheless exhibits a limitation: a risk of accuracy degradation on non-sensitive ground object types. As shown in the ablation study’s accuracy Table 4, relative to the baseline, our approach incurs a 1.7% drop in asphalt road segmentation performance, since both modules introduce some confusion for this class. Moreover, the confusion matrix in Figure 9 reveals that, after integrating the two modules, the network correctly identifies more asphalt road pixels (i.e., an increase in true positives) but also misclassifies a larger number of background pixels as asphalt road, thereby reducing the IoU. At the same time, the likelihood of background pixels being erroneously predicted as masonry road or water also increases, indicating that our method amplifies confusion between background and non-sensitive ground object types, and thus may compromise accuracy on those classes.

5.4. Intrinsic Parameter Analysis for SFANet

SAF and ASE modules have two intrinsic parameters in the proposed network: the filtering threshold

δ

and the sensitive ground object types corresponding to the two remote sensing data types. This section investigates the reasonableness and effectiveness of these parameter settings through a series of experiments.

5.4.1. The Filtering Threshold $δ$

In this section, SFANet was trained to convergence with different filtering thresholds

δ

ranging from 0.3 to 0.7, using the same hypeparameter settings as shown in experimental setup. The performance of these models on the test set was recorded, resulting in Figure 11. The horizontal axis represents the filtering threshold value, and the vertical axis represents the mIoU of the converged models on the test set. It was observed that the mIoU reached a maximum mIoU of 82.30% at a threshold of 0.4. This is because the filtering threshold primarily functions to divide the prediction probability map in the SAF and ASE modules into high-confidence and low-confidence regions. When the threshold is too low, the modules tend to classify more regions as high-confidence, which can somewhat suppress the effect of the ASE enhancement module. Conversely, when the threshold is too high, the high-confidence regions are reduced, limiting the comprehensive integration of multi-source remote sensing features by the SAF fusion module.

5.4.2. The Sensitive Ground Object Types

In addition to the filtering threshold, the two modules designed in this chapter include another intrinsic parameter: the types of ground objects that can be promoted by the multispectral and thermal infrared data in the multi-source remote sensing feature fusion. The multispectral data facilitate the fusion process of vegetation-related multi-source remote sensing features, and thermal infrared data enhance the fusion of building-related multi-source remote sensing features. Consequently, in the experiments in this chapter, the effective ground object type for thermal infrared data was set to buildings, and for multispectral data to vegetation. This section verifies the rationality of the conclusions from the previous chapter and explores the significance of this parameter for the SAF and ASE modules and its impact on the SFANet network by fixing the effective ground object type for one data type and adjusting the other.

As shown in Table 5, when the thermal infrared is buildings and the multispectral is vegetation, the model achieved the highest mIoU of 82.30% on the test set. This indicates that multispectral and thermal infrared data best facilitate the fusion of vegetation and building-related multi-source remote sensing features, respectively, allowing the model to better recognize various ground objects in the fused features.

5.4.3. Weighting Factor $α$ for the Auxiliary Loss

To evaluate the appropriateness of the auxiliary-loss weighting factor

α

, we trained a series of models on our self-annotated dataset by varying

α

and plotted the results in Figure 12. It can be observed that the model achieves the highest accuracy at

α

= 0.5.

5.5. Model Complexity and Computational Efficiency Analysis

To further demonstrate the performance of our approach, the parameter counts and inference efficiency of our method and several key baselines on the test set are presented in the Table 6. Compared to existing multimodal methods, our model has the smallest parameter footprint and a relatively short inference time on the same evaluation platform, making it a promising candidate for deployment on UAVs or other lightweight systems.

6. Conclusions

This paper proposes SFANet with two innovative modules, SAF and ASE, which integrate multimodal features while considering the spectral characteristics of ground objects to enhance the segmentation performance on multimodal RSIs. Comparative experiments on the ISPRS Vaihingen dataset and a self-annotated dataset demonstrate that our SFANet achieves higher mIoU accuracy, outperforming the best existing multimodal models by 0.80% and 1.50%, and surpassing unimodal models by 9.74% and 3.98%, respectively. Additionally, the SAF and ASE modules play a crucial role in the fusion of multimodal features. The SAF module locates high-confidence areas of specific land cover during the fusion process and uses effective remote sensing features of these ground objects to more comprehensively integrate multi-source features, thereby enhancing the network’s ability to recognize these ground objects. The ASE module identifies areas where the network is prone to confusion and strengthens the effective features within these areas, making it easier for the network to distinguish between different ground objects. SFANet demonstrates significant advancements over other multimodal RSIs segmentation networks, and our approach of designing network modules based on ground object spectral characteristics provides new perspectives for RSI-specific network design.

Although the SFANet achieved promising results on two datasets, there was still a limitation: This method achieves significant improvement primarily in the segmentation of ground objects that are sensitive to multimodal data; however, its effectiveness may be limited in the absence of such sensitive ground objects.

Author Contributions

Conceptualization, Y.L. and F.Z.; methodology, Y.L. and D.Z.; software, Y.L. and Z.X.; formal analysis, Y.Z.; investigation, K.S.; data curation, F.Z.; writing—original draft, Y.L.; writing—review and editing, D.Z., Y.Z., Z.X. and Z.W.; visualization, Z.W.; supervision, D.Z. and F.Z.; project administration, K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data derived from public domain resources. The ISPRS Vaihingen dataset presented in this study is available at https://www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/Default.aspx (accessed on 6 January 2025). These data were derived from the following resources available in the public domain: https://seafile.projekt.uni-hannover.de/f/6a06a837b1f349cfa749/ (accessed on 6 January 2025). The self-annotated dataset presented in this study is available at https://doi.org/10.57760/sciencedb.16649 (accessed on 10 March 2025). These data were derived from the following resources available in the public domain: https://download.scidb.cn/download?fileId=0667eb484603b7c92c12fad215d8f2ee&path=/V3/infrared.zip&fileName=infrared.zip (accessed on 10 March 2025), https://download.scidb.cn/download?fileId=a3731510aa68494d6abce740c53bff6a&path=/V3/spectral.zip&fileName=spectral.zip (accessed on 10 March 2025), https://download.scidb.cn/download?fileId=1cfa4f44e064840f8a1880940e9fe8ca&path=/V3/rgb.zip&fileName=rgb.zip (accessed on 10 March 2025).

Acknowledgments

We extend our gratitude to Wuhan Field Park for providing the data collection site used in the self-annotated dataset for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Cui, H.; Zhang, G.; Chen, Y.; Li, X.; Hou, S.; Li, H.; Ma, X.; Guan, N.; Tang, X. Knowledge evolution learning: A cost-free weakly supervised semantic segmentation framework for high-resolution land cover classification. ISPRS J. Photogramm. Remote Sens. 2024, 207, 74–91. [Google Scholar] [CrossRef]
Luo, H.; Wang, Z.; Du, B.; Dong, Y. A Deep Cross-Modal Fusion Network for Road Extraction with High-Resolution Imagery and LiDAR Data. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4503415. [Google Scholar] [CrossRef]
Gui, B.; Bhardwaj, A.; Sam, L. Evaluating the efficacy of segment anything model for delineating agriculture and urban green spaces in multiresolution aerial and spaceborne remote sensing images. Remote Sens. 2024, 16, 414. [Google Scholar] [CrossRef]
Deren, L.; Liangpei, Z.; Guisong, X. Automatic analysis and mining of remote sensing big data. Acta Geod. Cartogr. Sin. 2014, 43, 1211. [Google Scholar]
Sun, X.; Tian, Y.; Lu, W.; Wang, P.; Niu, R.; Yu, H.; Fu, K. From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy. Sci. China Inf. Sci. 2023, 66, 140301. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Zhong, Z.; Cui, J.; Yang, Y.; Wu, X.; Qi, X.; Zhang, X.; Jia, J. Understanding imbalanced semantic segmentation through neural collapse. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19550–19560. [Google Scholar]
Wu, L.; Fang, L.; He, X.; He, M.; Ma, J.; Zhong, Z. Querying labeled for unlabeled: Cross-image semantic consistency guided semi-supervised semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8827–8844. [Google Scholar] [CrossRef] [PubMed]
He, Y.; Gao, Z.; Li, Y.; Wang, Z. A lightweight multi-modality medical image semantic segmentation network base on the novel UNeXt and Wave-MLP. Comput. Med Imaging Graph. 2024, 111, 102311. [Google Scholar] [CrossRef] [PubMed]
Bhattarai, B.; Subedi, R.; Gaire, R.R.; Vazquez, E.; Stoyanov, D. Histogram of oriented gradients meet deep learning: A novel multi-task deep network for 2D surgical image semantic segmentation. Med. Image Anal. 2023, 85, 102747. [Google Scholar] [CrossRef]
Xiao, T.; Liu, Y.; Huang, Y.; Li, M.; Yang, G. Enhancing multiscale representations with transformer for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605116. [Google Scholar] [CrossRef]
Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and multiscale transformer fusion network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2004612. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS 3 Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Li, X.; Xu, F.; Liu, F.; Lyu, X.; Tong, Y.; Xu, Z.; Zhou, J. A synergistical attention model for semantic segmentation of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5400916. [Google Scholar] [CrossRef]
Zhao, S.; Liu, Y.; Jiao, Q.; Zhang, Q.; Han, J. Mitigating modality discrepancies for RGB-T semantic segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 9380–9394. [Google Scholar] [CrossRef] [PubMed]
Lv, Y.; Liu, Z.; Li, G. Context-aware interaction network for rgb-t semantic segmentation. IEEE Trans. Multimed. 2024, 26, 6348–6360. [Google Scholar] [CrossRef]
Yang, J.; Bai, L.; Sun, Y.; Tian, C.; Mao, M.; Wang, G. Pixel difference convolutional network for rgb-d semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1481–1492. [Google Scholar] [CrossRef]
Zhao, Q.; Wan, Y.; Xu, J.; Fang, L. Cross-modal attention fusion network for RGB-D semantic segmentation. Neurocomputing 2023, 548, 126389. [Google Scholar] [CrossRef]
Su, C.; Hu, X.; Meng, Q.; Zhang, L.; Shi, W.; Zhao, M. A multimodal fusion framework for urban scene understanding and functional identification using geospatial data. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103696. [Google Scholar] [CrossRef]
Arun, P.V.; Sadeh, R.; Avneri, A.; Tubul, Y.; Camino, C.; Buddhiraju, K.M.; Porwal, A.; Lati, R.N.; Zarco-Tejada, P.J.; Peleg, Z.; et al. Multimodal Earth observation data fusion: Graph-based approach in shared latent space. Inf. Fusion 2022, 78, 20–39. [Google Scholar] [CrossRef]
Zhang, W.; Wang, X.; Wang, H.; Cheng, Y. Causal Meta-Reinforcement Learning for Multimodal Remote Sensing Data Classification. Remote Sens. 2024, 16, 1055. [Google Scholar] [CrossRef]
Zhao, J.; Zhang, D.; Shi, B.; Zhou, Y.; Chen, J.; Yao, R.; Xue, Y. Multi-source collaborative enhanced for remote sensing images semantic segmentation. Neurocomputing 2022, 493, 76–90. [Google Scholar] [CrossRef]
Liu, Q.; Wang, X. Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images. Remote Sens. 2024, 16, 2289. [Google Scholar] [CrossRef]
Hou, J.; Guo, Z.; Wu, Y.; Diao, W.; Xu, T. BSNet: Dynamic hybrid gradient convolution based boundary-sensitive network for remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5624022. [Google Scholar] [CrossRef]
Cai, Y.; Fan, L.; Fang, Y. SBSS: Stacking-based semantic segmentation framework for very high-resolution remote sensing image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5600514. [Google Scholar] [CrossRef]
Ma, M.; Ma, W.; Jiao, L.; Liu, X.; Li, L.; Feng, Z.; Yang, S. A multimodal hyper-fusion transformer for remote sensing image classification. Inf. Fusion 2023, 96, 66–79. [Google Scholar] [CrossRef]
Roy, S.K.; Deria, A.; Hong, D.; Rasti, B.; Plaza, A.; Chanussot, J. Multimodal fusion transformer for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3286826. [Google Scholar] [CrossRef]
Hong, D.; Zhang, B.; Li, H.; Li, Y.; Yao, J.; Li, C.; Werner, M.; Chanussot, J.; Zipf, A.; Zhu, X.X. Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks. Remote Sens. Environ. 2023, 299, 113856. [Google Scholar] [CrossRef]
He, X.; Chen, Y.; Huang, L.; Hong, D.; Du, Q. Foundation model-based multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2023, 62, 5502117. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
He, Q.; Sun, X.; Diao, W.; Yan, Z.; Yao, F.; Fu, K. Multimodal remote sensing image segmentation with intuition-inspired hypergraph modeling. IEEE Trans. Image Process. 2023, 32, 1474–1487. [Google Scholar] [CrossRef]
Zhang, Y.; Lan, C.; Zhang, H.; Ma, G.; Li, H. Multimodal remote sensing image matching via learning features and attention mechanism. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5603620. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Dong, S.; Wang, L.; Du, B.; Meng, X. ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning. ISPRS J. Photogramm. Remote Sens. 2024, 208, 53–69. [Google Scholar] [CrossRef]
Wang, Q.; Chen, W.; Huang, Z.; Tang, H.; Yang, L. MultiSenseSeg: A cost-effective unified multimodal semantic segmentation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703724. [Google Scholar] [CrossRef]
Yao, J.; Zhang, B.; Li, C.; Hong, D.; Chanussot, J. Extended vision transformer (ExViT) for land use and land cover classification: A multimodal deep learning framework. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3284671. [Google Scholar] [CrossRef]
Feng, Z.; Song, L.; Yang, S.; Zhang, X.; Jiao, L. Cross-modal contrastive learning for remote sensing image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5517713. [Google Scholar] [CrossRef]
Du, X.; Zheng, X.; Lu, X.; Doudkin, A.A. Multisource remote sensing data classification with graph fusion network. IEEE Trans. Geosci. Remote Sens. 2021, 59, 10062–10072. [Google Scholar] [CrossRef]
Cao, Z.; Diao, W.; Sun, X.; Lyu, X.; Yan, M.; Fu, K. C3Net: Cross-Modal Feature Recalibrated, Cross-Scale Semantic Aggregated and Compact Network for Semantic Segmentation of Multi-Modal High-Resolution Aerial Images. Remote Sens. 2021, 13, 528. [Google Scholar] [CrossRef]
Stahl, A.T.; Andrus, R.; Hicke, J.A.; Hudak, A.T.; Bright, B.C.; Meddens, A.J. Automated attribution of forest disturbance types from remote sensing data: A synthesis. Remote Sens. Environ. 2023, 285, 113416. [Google Scholar] [CrossRef]
Lv, Z.; Zhang, P.; Sun, W.; Benediktsson, J.A.; Li, J.; Wang, W. Novel adaptive region spectral–spatial features for land cover classification with high spatial resolution remotely sensed imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3275753. [Google Scholar] [CrossRef]
Han, W.; Zhang, X.; Wang, Y.; Wang, L.; Huang, X.; Li, J.; Wang, S.; Chen, W.; Li, X.; Feng, R.; et al. A survey of machine learning and deep learning in remote sensing of geological environment: Challenges, advances, and opportunities. ISPRS J. Photogramm. Remote Sens. 2023, 202, 87–113. [Google Scholar] [CrossRef]
Hao, S.; Zhou, Y.; Guo, Y. A brief survey on semantic segmentation with deep learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
Asgari Taghanaki, S.; Abhishek, K.; Cohen, J.P.; Cohen-Adad, J.; Hamarneh, G. Deep semantic segmentation of natural and medical images: A review. Artif. Intell. Rev. 2021, 54, 137–178. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Zheng, C.; Nie, J.; Wang, Z.; Song, N.; Wang, J.; Wei, Z. High-order semantic decoupling network for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5401415. [Google Scholar] [CrossRef]
Zhou, W.; Zhang, H.; Yan, W.; Lin, W. MMSMCNet: Modal memory sharing and morphological complementary networks for RGB-T urban scene semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7096–7108. [Google Scholar] [CrossRef]
Huo, D.; Wang, J.; Qian, Y.; Yang, Y.H. Glass segmentation with RGB-thermal image pairs. IEEE Trans. Image Process. 2023, 32, 1911–1926. [Google Scholar] [CrossRef]
Zhao, J.; Zhou, Y.; Shi, B.; Yang, J.; Zhang, D.; Yao, R. Multi-stage fusion and multi-source attention network for multi-modal remote sensing image segmentation. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–20. [Google Scholar] [CrossRef]
Wu, X.; Hong, D.; Chanussot, J. Convolutional neural networks for multimodal remote sensing data classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5517010. [Google Scholar] [CrossRef]
Hong, D.; Hu, J.; Yao, J.; Chanussot, J.; Zhu, X.X. Multimodal remote sensing benchmark datasets for land cover classification with a shared and specific feature learning model. ISPRS J. Photogramm. Remote Sens. 2021, 178, 68–80. [Google Scholar] [CrossRef] [PubMed]
Amani, M.; Salehi, B.; Mahdavi, S.; Granger, J.E.; Brisco, B.; Hanson, A. Wetland classification using multi-source and multi-temporal optical remote sensing data in Newfoundland and Labrador, Canada. Can. J. Remote Sens. 2017, 43, 360–373. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Zhang, M.; Li, W.; Zhang, Y.; Tao, R.; Du, Q. Hyperspectral and LiDAR data classification based on structural optimization transmission. IEEE Trans. Cybern. 2022, 53, 3153–3164. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, M.; Tao, R.; Li, W.; Liao, W.; Tian, L.; Philips, W. Fractional Fourier image transformer for multimodal remote sensing data classification. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 2314–2326. [Google Scholar] [CrossRef] [PubMed]
Sun, Y.; Fu, Z.; Sun, C.; Hu, Y.; Zhang, S. Deep multimodal fusion network for semantic segmentation using remote sensing image and LiDAR data. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5404418. [Google Scholar] [CrossRef]
Li, Y.; Zhou, Y.; Zhang, Y.; Zhong, L.; Wang, J.; Chen, J. DKDFN: Domain knowledge-guided deep collaborative fusion network for multimodal unitemporal remote sensing land cover classification. ISPRS J. Photogramm. Remote Sens. 2022, 186, 170–189. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; Wang, J.; Zeng, Y.; Yin, G.; Li, W.; You, L.; et al. Improving agricultural field parcel delineation with a dual branch spatiotemporal fusion network by integrating multimodal satellite data. ISPRS J. Photogramm. Remote Sens. 2023, 205, 34–49. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, X.; Peng, C.; Xue, X.; Sun, J. Exfuse: Enhancing feature fusion for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 269–284. [Google Scholar]
Peng, C.; Zhang, X.; Yu, G.; Luo, G.; Sun, J. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4353–4361. [Google Scholar]
Liu, T.; Hu, Q.; Fan, W.; Feng, H.; Zheng, D. AMIANet: Asymmetric Multimodal Interactive Augmentation Network for Semantic Segmentation of Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5706915. [Google Scholar] [CrossRef]
Wei, K.; Dai, J.; Hong, D.; Ye, Y. MGFNet: An MLP-dominated gated fusion network for semantic segmentation of high-resolution multi-modal remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 135, 104241. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. A crossmodal multiscale fusion network for semantic segmentation of remote sensing data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3463–3474. [Google Scholar] [CrossRef]
Fan, X.; Zhou, W.; Qian, X.; Yan, W. Progressive adjacent-layer coordination symmetric cascade network for semantic segmentation of multimodal remote sensing images. Expert Syst. Appl. 2024, 238, 121999. [Google Scholar] [CrossRef]
He, X.; Wang, M.; Liu, T.; Zhao, L.; Yue, Y. SFAF-MA: Spatial feature aggregation and fusion with modality adaptation for RGB-thermal semantic segmentation. IEEE Trans. Instrum. Meas. 2023, 72, 1–10. [Google Scholar] [CrossRef]
Yang, E.; Zhou, W.; Qian, X.; Lei, J.; Yu, L. DRNet: Dual-stage refinement network with boundary inference for RGB-D semantic segmentation of indoor scenes. Eng. Appl. Artif. Intell. 2023, 125, 106729. [Google Scholar] [CrossRef]
Wang, Y.; Li, G.; Liu, Z. Sgfnet: Semantic-guided fusion network for rgb-thermal semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 7737–7748. [Google Scholar] [CrossRef]
Zhou, J.; Qian, S.; Yan, Z.; Zhao, J.; Wen, H. ESA-Net: A network with efficient spatial attention for smoky vehicle detection. In Proceedings of the 2021 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Virtual, 17–20 May 2021; pp. 1–6. [Google Scholar]
Park, S.J.; Hong, K.S.; Lee, S. Rdfnet: Rgb-d multi-level residual feature fusion for indoor semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4980–4989. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
Lan, Y.; Hu, Q.; Wang, S.; Li, J.; Zhao, P.; Ai, M. Research on deep learning-based land cover extraction method using multi-source mixed samples. In MIPPR 2023: Remote Sensing Image Processing, Geographic Information Systems, and Other Applications; SPIE: Bellingham, WA, USA, 2024; Volume 13088, pp. 9–16. [Google Scholar]

Figure 1. The features of the ground objects indicated by the blue arrow exhibit significant differences in VHR images, while the differences are smaller in TIIs.

Figure 2. The pipeline of the SFANet. Every feature fusion module contains a SAF module. E* denotes the five-layer encoder blocks E1–E5. D* represents the four-layer decoder blocks D1–D4.

Figure 3. The structure of SAF module.

Figure 4. The structure of ASE module.

Figure 5. The IRRG, DSM, and GT of a sample in ISPRS Vaihingen dataset.

Figure 6. Data samples of self-annotated dataset. Each sample set includes TIIs, VHR images, and multispectral images with the same resolution, all of which are registered, along with a pixel-level label.

Figure 7. Qualitative comparison between the proposed network and existing segmentation networks on the self-annotated dataset test set. Groups (a–f) present the imagery, annotations, and comparative six results across test set.

Figure 8. Qualitative comparison between the proposed network and existing segmentation networks on the ISPRS Vaihingen dataset test set. Groups (a–e) present the imagery, annotations, and comparative five results across test set.

Figure 9. Confusion matrices of baseline and SFANet on test set. Each percentage represents the proportion of pixels of a specific class relative to the total number of pixels. The red box highlights our module’s role in disambiguating building and vegetation.

Figure 10. Qualitative comparison between the SFANet and baseline.

Figure 11. Analysis of

δ

in SFANet.

Figure 11. Analysis of

δ

in SFANet.

Figure 12. The impact of the auxiliary loss weighting factor

α

on model accuracy.

Figure 12. The impact of the auxiliary loss weighting factor

α

on model accuracy.

Table 1. Model performance under different data combinations.

TII	Multispectral	Asphalt Road	Masonry Road	Water	Building	Vegetation
✓		40.0%	46.9%	85.8%	29.6%	50.4%
	✓	68.4%	76.1%	96.3%	50.3%	78.4%
✓	✓	72.4%	81.0%	98.0%	55.5%	76.7%

Table 2. Quantitative comparison of different segmentation networks on the self-annotated dataset test set. Bold highlighting denotes the highest accuracy achieved for each ground object category.

	Type	Asp. Road	Mas. Road	Water	Bui.	Veg.	mIoU
U-Net (2015) [46]	Unimodal	78.4%	74.4%	97.3%	52.2%	80.7%	76.60%
SegNet (2017) [74]	Unimodal	70.2%	77.7%	97.8%	56.4%	74.8%	75.38%
HRNet (2020) [75]	Unimodal	73.1%	81.2%	96.9%	56.7%	79.3%	77.44%
DeeplabV3+ (2018) [48]	Unimodal	79.3%	80.4%	98.1%	55.5%	78.3%	78.32%
Baseline	Multimodal	85.0%	81.6%	97.6%	43.6%	75.4%	76.64%
RDFNet (2017) [73]	Multimodal	79.5%	84.0%	98.0%	55.1%	75.3%	78.38%
ESANet (2021) [72]	Multimodal	81.8%	82.3%	97.3%	57.8%	74.6%	78.76%
SFAFMA (2023) [69]	Multimodal	82.7%	83.7%	97.8%	53.1%	77.6%	78.98%
CMFNet (2022) [67]	Multimodal	82.1%	79.2%	97.4%	57.8%	76.6%	78.62%
DRNet (2023) [70]	Multimodal	85.0%	80.3%	98.1%	50.7%	84.0%	79.62%
MGFNet (2024) [66]	Multimodal	86.0%	79.3%	97.6%	58.8%	80.6%	80.46%
SGFNet (2023) [71]	Multimodal	86.7%	82.5%	97.5%	51.6%	84.1%	80.48%
FTransUNet (2024) [33]	Multimodal	86.7%	81.0%	97.4%	59.9%	79.0%	80.80%
PACSCNet (2024) [68]	Multimodal	81.6%	80.0%	98.1%	61.8%	81.9%	80.68%
SFANet (Ours)	Multimodal	83.3%	82.3%	98.1%	64.2%	83.6%	82.30%

Table 3. Quantitative comparison of different segmentation networks on the ISPRS Vaihingen dataset test set. Bold highlighting denotes the highest accuracy achieved for each ground object category.

	Type	Imp. Surf.	Building	Low Veg.	Tree	Car	mIoU
U-Net (2015) [46]	Unimodal	72.0%	80.5%	56.4%	70.4%	46.2%	65.10%
SegNet (2017) [74]	Unimodal	71.5%	79.2%	54.5%	69.9%	41.7%	63.36%
HRNet (2020) [75]	Unimodal	74.4%	81.6%	56.4%	70.4%	46.7%	65.90%
Baseline	Multimodal	77.3%	85.9%	60.8%	73.3%	57.1%	70.88%
SFAFMA (2023) [69]	Multimodal	78.3%	88.0%	61.2%	73.7%	57.3%	71.70%
CMFNet (2022) [67]	Multimodal	75.9%	85.1%	60.6%	72.4%	66.1%	72.02%
ESANet (2021) [72]	Multimodal	78.3%	87.5%	61.7%	74.0%	62.1%	72.72%
DRNet (2023) [70]	Multimodal	80.3%	88.4%	62.3%	74.2%	67.1%	74.46%
MGFNet (2024) [66]	Multimodal	78.6%	87.9%	62.8%	75.0%	68.8%	74.62%
FTransUNet (2024) [33]	Multimodal	79.2%	87.2%	62.4%	74.3%	69.9%	74.60%
PACSCNet (2024) [68]	Multimodal	79.9%	88.4%	62.3%	74.6%	69.0%	74.84%
SFANet (Ours)	Multimodal	80.4%	89.7%	63.4%	74.2%	70.5%	75.64%

Table 4. The results of the ablation study on the self-annotated dataset.

	Asp. Road	Mas. Road	Water	Bui.	Veg.	mIoU	PA
Baseline	85.0%	81.6%	97.6%	43.6%	75.4%	76.64%	83.04%
Baseline + SAF	84.4%	78.6%	98.0%	58.7%	84.4%	80.82%	86.50%
Baseline + ASE	84.6%	80.2%	94.7%	55.3%	80.9%	79.14%	84.82%
Baseline + SAF + ASE	83.3%	82.3%	98.1%	64.2%	83.6%	82.30%	87.03%

Table 5. The impact of effective ground object settings in thermal Infrared/multispectral imagery on model accuracy. Bold highlighting indicates the highest accuracy.

Ground Object Setting	Multispectral	Infrared
Asphalt road	78.92%	80.62%
Masonry road	80.68%	80.18%
Building	80.72%	82.30%
Water	80.58%	80.58%
Vegetation	82.30%	80.82%

Table 6. The comparison of model complexity and computational efficiency on the self-annotated dataset.

	Parameter (M)	Speed (FPS)
MGFNet	109.37	11.02
SGFNet	125.26	10.27
FTransUNet	160.88	7.13
PACSCNet	133.07	10.35
SFANet (Ours)	108.36	10.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lan, Y.; Zheng, D.; Zheng, Y.; Zhang, F.; Xu, Z.; Shang, K.; Wan, Z. SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation. Remote Sens. 2025, 17, 1797. https://doi.org/10.3390/rs17101797

AMA Style

Lan Y, Zheng D, Zheng Y, Zhang F, Xu Z, Shang K, Wan Z. SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation. Remote Sensing. 2025; 17(10):1797. https://doi.org/10.3390/rs17101797

Chicago/Turabian Style

Lan, Yizhou, Daoyuan Zheng, Yingjun Zheng, Feizhou Zhang, Zhuodong Xu, Ke Shang, and Zeyu Wan. 2025. "SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation" Remote Sensing 17, no. 10: 1797. https://doi.org/10.3390/rs17101797

APA Style

Lan, Y., Zheng, D., Zheng, Y., Zhang, F., Xu, Z., Shang, K., & Wan, Z. (2025). SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation. Remote Sensing, 17(10), 1797. https://doi.org/10.3390/rs17101797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SFANet: A Ground Object Spectral Feature Awareness Network for Multimodal Remote Sensing Image Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation of NSIs

2.2. Semantic Segmentation of Multimodal RSIs

2.3. Multimodal Remote Sensing Feature Fusion

3. Method

3.1. Overview

3.2. SAF Module

3.3. ASE Module

3.4. Loss Function

4. Experiment

4.1. Dataset

4.1.1. ISPRS Vaihingen Dataset

4.1.2. Self-Annotated Dataset

4.2. Experimental Setup

4.2.1. Training Configuration and Comparative SOTA Methodology

4.2.2. Intrinsic Parameter Configuration

4.3. Experimental Results

4.3.1. Experimental Results on Self-Annotated Dataset

4.3.2. Experimental Results on ISPRS Vaihingen Dataset

5. Discussion

5.1. Ablation Study

5.2. The Effectiveness of Proposed Modules

5.3. Limitation

5.4. Intrinsic Parameter Analysis for SFANet

5.4.1. The Filtering Threshold δ

5.4.2. The Sensitive Ground Object Types

5.4.3. Weighting Factor α for the Auxiliary Loss

5.5. Model Complexity and Computational Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.4.1. The Filtering Threshold $δ$

5.4.3. Weighting Factor $α$ for the Auxiliary Loss