Next Article in Journal
Ice Sheet Mass Changes over Antarctica Based on GRACE Data
Next Article in Special Issue
Small Object Detection in UAV Remote Sensing Images Based on Intra-Group Multi-Scale Fusion Attention and Adaptive Weighted Feature Fusion Mechanism
Previous Article in Journal
Underwater Optical Imaging: Methods, Applications and Perspectives
Previous Article in Special Issue
Infrared Weak Target Detection in Dual Images and Dual Areas
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images

1
Xi’an Key Laboratory of Network Convergence Communication, School of Communication and Information Engineering, Xi’an University of Science and Technology, Xi’an 710054, China
2
School of Electronics Engineering, Xidian University, Xi’an 710071, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(20), 3774; https://doi.org/10.3390/rs16203774
Submission received: 8 September 2024 / Revised: 8 October 2024 / Accepted: 9 October 2024 / Published: 11 October 2024

Abstract

:
Generating pixel-level annotations for semantic segmentation tasks of high-resolution remote sensing images is both time-consuming and labor-intensive, which has led to increased interest in unsupervised methods. Therefore, in this paper, we propose an unsupervised multi-scale hybrid feature extraction network based on the CNN-Transformer architecture, referred to as MSHFE-Net. The MSHFE-Net consists of three main modules: a Multi-Scale Pixel-Guided CNN Encoder, a Multi-Scale Aggregation Transformer Encoder, and a Parallel Attention Fusion Module. The Multi-Scale Pixel-Guided CNN Encoder is designed for multi-scale, fine-grained feature extraction in unsupervised tasks, efficiently recovering local spatial information in images. Meanwhile, the Multi-Scale Aggregation Transformer Encoder introduces a multi-scale aggregation module, which further enhances the unsupervised acquisition of multi-scale contextual information, obtaining global features with stronger feature representation. The Parallel Attention Fusion Module employs an attention mechanism to fuse global and local features in both channel and spatial dimensions in parallel, enriching the semantic relations extracted during unsupervised training and improving the performance of unsupervised semantic segmentation. K-means clustering is then performed on the fused features to achieve high-precision unsupervised semantic segmentation. Experiments with MSHFE-Net on the Potsdam and Vaihingen datasets demonstrate its effectiveness in significantly improving the accuracy of unsupervised semantic segmentation.

1. Introduction

With advancements in space information technology, remote sensing images obtained from satellites and Synthetic Aperture Radar (SAR) [1,2] now provide broader coverage, richer content, and more detailed information. These high-resolution optical images and SAR data are highly valuable and crucial for automated monitoring across various remote sensing applications [3,4,5]. Semantic segmentation, a key technique for scene understanding in high-resolution remote sensing images, has been extensively applied in critical areas such as environmental monitoring [6], crop cover classification [7], type identification [8], and land use analysis [9]. This has become a significant research topic in recent years [10,11]. Traditional low-resolution remote sensing images suffer from low spatial resolution, resulting in limitations such as feature confusion and blurred details. The advent of high-resolution remote sensing images [12], obtained through platforms like satellites, aerial vehicles, and drones, addresses these limitations by providing more detailed surface information, including buildings, roads, and vegetation [13].
With the advancements in deep learning techniques, Convolutional Neural Networks (CNNs), known for their strong ability to learn image features, have been widely employed in supervised semantic segmentation tasks. In order to achieve more refined pixel-level classification, researchers proposed Fully Convolutional Networks (FCNs) with an encoder–decoder structure. PSPNet extracts features at different scales through parallel pooling, enhancing feature representation and thereby improving segmentation accuracy. To tackle the challenges posed by varying scales and textures, the End-to-End Integrated Fully Convolutional Network (EFCNet) utilizes the Adaptive Fusion Module (AFM) to learn feature weighting maps and adaptively fuse features [14]. The Separable Convolution Module (SCM) further reduced the number of model parameters through the mapping of each convolution kernel channel to a feature map, thereby reducing model complexity while enhancing feature fusion. The local receptive fields of convolutional operations limit CNN-based methods from capturing long-range dependencies. In contrast, Vision Transformer (ViT) shows great potential in modeling long-distance dependencies and obtains excellent results in semantic segmentation. The Class-Guided Swin Transformer (CG-Swin) [15] employs a Transformer-based encoder–decoder architecture by introducing the Swin Transformer backbone as an encoder and designing a class-guided Transformer block to construct the decoder. Recently, the Segmentation Anything Model (SAM) has emerged in segmentation tasks, learning complex object features by training on large-scale image datasets. This model exhibits strong generalization capabilities, making it applicable to various computer vision task scenarios. Armin Moghimi et al. [16] compared the performance of SAM with other deep learning models across multiple datasets and demonstrated that, while SAM achieves the highest accuracy, it also incurs relatively high computational overhead. Lucas Prado Osco et al. [17] created an automated segmentation approach by integrating SAM with textual cues and the Grounding DINO method, showcasing robust generalization across a wide range of remote sensing datasets.
However, existing supervised methods require large-scale pixel-level annotations [18], which incur substantial annotation costs. This has led to the emergence of weakly supervised and unsupervised semantic segmentation approaches, which can be developed without the need for expensive pixel-level annotations. Among these approaches, unsupervised semantic segmentation [19,20] remains one of the most challenging tasks as it necessitates capturing pixel-level semantics from unlabeled data. In this context, a learning approach based on clustering [21] was proposed to minimize the Euclidean distance of each pixel to the center of the cluster to which it belongs to assign the pixel to a different class and to maintain semantic clustering by attracting enhanced views at the pixel level. Recently, due to the challenging nature of discovering pixel-level semantics from scratch, STEGO [22] decomposed the problem into learning to represent and learning to segment the header. STEGO captured the relationships between pixels through their feature similarity. The training process minimized the energy function and encouraged pixels with similar features to belong to the same cluster, effectively forming coherent regions representing different semantic categories in the image. They used patch-level representations learned from pioneering work on unsupervised learning to train classification heads using a distillation strategy [23]. Meanwhile, the Unsupervised Domain Adaptation (UDA) [24] strategy is also a promising unsupervised learning method that has been investigated and achieved significant performance improvements. UDA-based methods aim to transfer segmentation knowledge from densely labeled source domains to unlabeled target domains in remote sensing imagery, mitigating inter- and intra-domain differences through adversarial training. RoadDA [25] utilizes a feature pyramid fusion module in a Generative Adversarial Network (GAN)-based approach to avoid information loss during feature extraction; speed up feature alignment between the target and source domains; and solve the Domain Shift (DS) problem, ensuring that the trained model generalizes well over the target domain. MemoryAdaptNet [26] improved segmentation accuracy by extracting features through CNN encoder and employing adversarial learning and an invariant feature memory module to mitigate domain offsets between the source and target domains.
Despite the great progress achieved by these methods, they still face certain shortcomings and challenges, which can be summarized as follows. (1) Existing feature extraction methods are based on simple feature extraction backbones, which are not specifically optimized for the segmentation task, thereby hindering the extraction of detailed feature information from high-resolution remote sensing images. (2) The feature extraction backbones used in existing methods primarily consist of CNN encoders and Transformer encoders. Common CNN encoder backbones, such as ResNet and VGG, excel at extracting fine-grained local features, while they fall short in capturing global contextual information. (3) The Transformer encoder captures global contextual information via the attention mechanism, enabling remote information interaction. However, this also limits the Transformer encoder’s ability to fully leverage its strengths in local feature extraction tasks.
To address the above problems, this paper proposes an unsupervised and efficient high-resolution remote sensing feature extraction network (MSHFE-Net) that combines the CNN [27] and Transformer [28] architectures. For unsupervised semantic segmentation tasks, the MSHFE-Net presented in this study employs a Multi-Scale Pixel-Guided CNN Encoder to extract local features and a Multi-Scale Aggregation Transformer Encoder to capture global features. The Parallel Attention Fusion Module integrates local and global features into more expressive representations to ultimately achieve unsupervised semantic segmentation of high-resolution remote sensing images by clustering the extracted features. The main contributions and innovations of this paper are as follows:
(1)
The Multi-Scale Pixel-Guided CNN Encoder employs a parallel architecture, using deformable convolutions to learn offsets and adaptively adjust the convolutional receptive fields. These adjustments enhance the model’s ability to extract nonlinear and deformed features in the absence of label guidance and to achieve more accurate extraction of multi-scale local features when handling complex spatial structures in remote sensing images. This approach enhances both the model’s flexibility and accuracy. Additionally, the pixel-guided fusion module assesses the confidence levels of the extracted multi-scale local features within the same class and guides the precise fusion of local pixel features. This significantly enhances the model’s ability to capture fine-grained features in unsupervised semantic segmentation tasks.
(2)
The Multi-Scale Aggregation Transformer Encoder efficiently aggregates deeper multi-scale global features for improved feature representation. The integration of the Parallel Aggregation Pyramid Pooling Module (PAPPM) after the Feed-Forward Neural Network (FFN) layer of the conventional Transformer encoder allows the extracted contextual information to be extended to multiple scales in parallel and further fused. This approach effectively extracts deeper multi-scale features during unsupervised semantic segmentation training while minimizing computational overhead, thereby making the aggregation of global contextual information both more efficient and comprehensive. Additionally, it enhances the model’s ability to capture global contextual information in complex remote sensing scenes.
(3)
The Parallel Attention Fusion Module combines a channel attention module and a spatial attention module in parallel, with each stage processing features for fusion in the channel and spatial dimensions. This module fuses the multi-scale global and local features with the features fused at the previous stage ( n 1 ) during the n = 2 , 3 , , N stages. This approach gradually enhances the expressiveness of the fused features throughout the training process, ultimately providing more accurate features for unsupervised clustering.
The rest of the paper is organized as follows: Section 2 briefly reviews popular CNN models and Transformer models. Section 3 describes in detail the proposed MSHFE-Net framework, which mainly consists of the Multi-Scale Pixel-Guided CNN Encoder, Multi-Scale Aggregated Transformer Encoder, and Parallel Attention Fusion Module. Finally, Section 4 gives the semantic segmentation results and analysis of high-resolution remote sensing images on two public datasets to verify the effectiveness of MSHFE-Net, and Section 5 summarizes the article.

2. Related Work

Feature extraction is vital for semantic segmentation of high-resolution remote sensing images. By effectively extracting features, the model can accurately identify targets and distinguish subtle differences between targets and the background, improving target detection and classification accuracy. Additionally, it can handle multi-class classification tasks, integrating spatial information to enhance image understanding and improve model robustness.
Earlier neural networks were used to improve segmentation performance using a supervised approach, e.g., Ling Dai et al. [29] proposed Road Augmented Deformable Attention Networks (RADANet) for extracting roads from high-resolution remote sensing images. They developed a Road Augmentation Module (RAM) and a Deformable Attention Module (DAM), which combine deformable convolution with spatial self-attention mechanisms to enhance the accuracy of road feature extraction. However, producing pixel-level labels entails a high labor cost. In order to reduce the cost, researchers started to investigate unsupervised methods. Xiao et al. [30] designed a novel encoder structure based on the Deformable Attention Module, utilizing a tandem connection between CNN and Transformer. The local features extracted by the CNN are fed into the Transformer to further capture global features, effectively combining both types of features and enhancing the model’s feature learning capability. Song et al. [31] used a Transformer to hierarchically extract multi-scale features, while employing a CNN decoder to perform contextual aggregation at multiple scales and micro-feature clustering to achieve semantic segmentation of remote sensing images. Qiu et al. [32] combined CNN and Transformer to design the split-depth transposition encoder. The model employed a staged architecture, where each stage expands the network’s receptive field using a split-depth transposition encoder to extract multi-scale features. The feature representation capability was progressively enhanced through stage-by-stage training. Cui et al. [33] utilized multi-layer local features and global correlations for feature fusion in the Transformer encoder and introduced a convolutional block attention module to adaptively capture multi-scale features, which greatly improved segmentation accuracy. Yang et al. [34] proposed a hierarchical context-integrated network based on multi-element features, which deeply integrates global, local, multi-scale, and edge information, significantly enhancing feature generalization during training and reducing category recognition errors.

3. Materials and Methods

In complex scenes, multi-scale contextual features are essential for capturing detailed image information. However, in unsupervised semantic segmentation, key information is often lost. The complexity of remote sensing images can cause confusion between small targets and boundaries when relying solely on global information. Enhancing feature mapping to capture fine-grained details improves the model’s accuracy in recognizing small targets and boundaries. To address this challenge, this paper proposes MSHFE-Net.
This section describes the model structure of MSHFE-Net. As shown in Figure 1, MSHFE-Net contains three main modules: (1) The Multi-Scale Pixel-Guided CNN Encoder is designed to accurately extract multi-scale local spatial information for unsupervised semantic segmentation tasks. (2) The Multi-Scale Aggregated Transformer Encoder is for capturing multi-scale remote dependencies and global context information in training without label guidance. (3) The Parallel Attention Fusion Module is designed to fuse local and global features, enhancing feature representation. In this paper, MSHFE-Net starts with adaptive subsurface sampling via deformable convolution and introduces a pixel-guided fusion module to fuse multi-scale local features. At the same time, the input remote sensing image is cut into fixed-size image blocks with positional embedding, and global features at multiple scales are captured by a multi-scale aggregated Transformer encoder layer. MSHFE-Net uses a parallel attention fusion module to fuse multi-scale global and local features for unsupervised semantic segmentation of high-resolution remote sensing images via K-Means clustering.

3.1. Multi-Scale Pixel-Guided CNN Encoder

In unsupervised semantic segmentation tasks, small objects in complex environments often result in classification errors. Unlike global features, local features are more effective in identifying and classifying intricate fine-grained attributes in images. Therefore, it is essential to develop a task-specific local feature extraction method that addresses the classification errors associated with small objects and target boundaries.
Deformable Convolution [35] is a variant of Convolutional Neural Network (CNN) that can adaptively change its receptive domain to capture deformed and nonlinear features in an image by introducing an offset Δ P n | n = 1 , , N , as shown in Figure 2.
Deformable convolution realizes more accurate feature extraction for key parts of the image (e.g., edges or corners of objects), thus improving the model’s ability to deal with spatial changes in complex scenes. Define P 0 as the pixel position in the output feature; then, the output feature is y:
y ( P 0 ) = P n R w ( P n ) · x ( P n + P 0 )
where R is the regularized network that defines the sense field size and expansion of the convolution, and P n is the corresponding enumeration position in R.
An offset Δ P n | n = 1 , , N is introduced in the regularization network R, where N = | R | . Considering sampling over irregular and offset positions P n + Δ P n , bilinear interpolation is used to obtain the pixel positions after the offset:
x ( p ) = q G ( q , p ) · x ( q )
where pixel positions p = P 0 + P n + Δ P n , G ( · , · ) are bilinear interpolation kernels.
In the Multi-Scale Pixel-Guided CNN Encoder proposed in this paper, the image is first downsampled by deformable convolution to the size of w 8 × h 8 , w 16 × h 16 , w 32 × h 32 , where w and h denote the width and height of the input image. Each downsampling stage contains a deformable convolutional layer and an ReLu activation function. The structure of the Multi-Scale Pixel-Guided CNN Encoder is shown in Figure 3.
Additionally, to enhance information transfer between features at different scales, a Pixel-Guided Fusion Module was introduced, as illustrated in Figure 4. Selective fusion of useful semantic information from other features of different scales for features of size w 8 × h 8 is performed to further promote the consistency of the semantics of features of different scales. Defining the vectors of corresponding pixels in the mapping of features of size w 8 × h 8 and other features of different scales as v t and v q , respectively, the output of the Sigmoid function can be expressed as
σ = Sigmoid ( f t ( v t ) · f q ( v q ) )
where σ indicates the probability that two pixels belong to the same object. If σ is high, more trust is placed on v t , and conversely, more trust is placed on v q . Therefore, the output of the Pixel-Guided Fusion Module is denoted as
O u t f u s i o n = σ v t + ( 1 σ ) v q

3.2. Multi-Scale Aggregated Transformer Encoder

The visual transformation (ViT) backbone [36] has excellent performance in segmentation tasks, capturing long-range dependencies and global contextual information over the entire image range. However, extracting more global features with improved feature representation in unsupervised semantic segmentation tasks necessitates the use of a Transformer layer with enhanced feature extraction capabilities, which significantly increases computational cost during training. This can lead to gradient explosion and ultimately degrade the model’s performance. Efficient extraction of global features is crucial for the Transformer encoder.
In this paper, the Multi-Scale Aggregated Transformer Encoder first cuts the input image x R H × W × C into fixed-size image blocks x p R N × ( P 2 × C ) and encodes the linear position in order to extract multi-scale features from it. Define as the position index of the image block in the whole image, and calculate the position encoding matrix by sine and cosine functions; the position encoding matrix is denoted as P E ( p o s , 2 i ) and P E ( p o s , 2 i + 1 ) :
P E ( p o s , 2 i ) = sin p o s 10000 2 i d
P E ( p o s , 2 i + 1 ) = cos p o s 10000 2 i d
where i is the index of the coding dimension and d is the dimension of the embedding vector. The multi-scale global features are captured by attention and nonlinear transformations of features through Multi-head Self-Attention Layer (MSA) and Feed-Forward Neural Network (FFN). The Q, K, and V of each head are obtained from the following linear transformation:
Q h = P E · W h Q
K h = P E · W h K
V h = P E · W h V
where W h Q , W h K , and W h V are the weight matrices of Q, K, and V, respectively. The attention score of each head can be calculated based on Q h , K h , and V h :
head h ( Q h , K h , V h ) = softmax Q h K h T d k V h
where d k is the dimension of each head. Then, the output of the Multi-head Self-Attention Layer is represented as
MultiHead = Concat ( head 1 , , head h ) W O
where W O is an output transformation matrix. The output of the multi-head self-attention layer is used as an input to the Feed-Forward Neural Network (FFN) by residual linking and layer normalization input F F N :
input F F N = LayerNorm ( MultiHead + P E )
The Feed-Forward Neural Network consists of two nonlinear transformations. Define W 1 and b 1 as the weight matrix and bias vectors of the first layer, and W 2 and b 2 as the weight matrix and bias vectors of the second layer:
Intermediate = ReLu ( input F F N · W 1 + b 1 )
output F F N = Intermediate · W 2 + b 2
PSPNet introduces the Pyramid Pooling Module (PPM) [37], which connects multi-scale pooled maps before convolutional layers to form local and global contextual representations. Deep Aggregate PPM (DAPPM) [38] further improves the context embedding capability of PPM and exhibits excellent performance. Nevertheless, the process of computing DAPPM cannot be parallelized at most depths as DAPPM contains too many channels per scale and is computationally time-consuming. Therefore, the extracted global features are multi-scaled through various convolution and pooling operations along the channel dimension. The features obtained at multiple scales are then interconnected via up-sampling to enable fast aggregation of these multi-scaled features in a parallel manner, as illustrated in Figure 5b. To manage computational volume, the number of channels per scale is reduced from 128 to 64 to prevent gradient explosion during training. This new fast context-gathering module, termed Parallel Aggregation PPM (PAPPM), efficiently aggregates global features at multiple scales while minimizing computational effort.
In order to better construct the global scene prior, after introducing PAPPM into the FFN layer in the Transformer encoder, further multi-scale global feature extraction is performed, as shown in Figure 5a. Define Conv ( Avg ( · ) ) m , Up m ( · ) as the convolution, pooling operation and upsampling operation of the m t h parallel branch, respectively; then, the output of each parallel branch is denoted as
Branch m = Up m ( Conv ( Avg ( output F F N ) ) m )
where output F F N is the output of the FFN layer. The output of PAPPM is denoted as
PAPPM = Concat ( B r a n c h 1 , , B r a n c h m )
The Multi-Scale Aggregation Transformer Encoder proposed in this paper expands the range of contextual information captured during unsupervised semantic segmentation training, enhancing the generalization ability of the extracted features and improving the expressiveness of the Transformer encoder layer. Compared to previous Transformer encoders, the multi-scale aggregation Transformer encoder can extract more expressive global features and achieve higher segmentation accuracy in unsupervised semantic segmentation tasks, all while utilizing the same Transformer layer.

3.3. Parallel Attention Fusion Module

In neural networks, convolution and pooling operations inevitably result in some degree of information loss across various frequency ranges. However, the attention mechanism [39] can selectively focus on the most relevant parts of the data, thereby enhancing the model’s feature representation. In this study, we introduce both channel attention and spatial attention during the fusion process to capitalize on the complementarity between high-frequency and low-frequency features. By integrating feature information across both channel and spatial dimensions, these mechanisms enhance the robustness of the model.
The Parallel Attention Fusion Module includes a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), as illustrated in Figure 6. Following the second stage, global features are combined with the fused output features from the previous stage along the channel dimension using convolution. CAM and SAM are subsequently employed to identify and highlight key features within the channel and spatial dimensions, respectively. Through the Parallel Attention Fusion Module, the output from the Multi-Scale Pixel-Guided CNN Encoder is fused with that of the Multi-Scale Aggregated Transformer Encoder. This combined output, which integrates global and local features, is then passed to the next stage. By fully leveraging the complementary nature of these features, the module recovers local spatial information and enhances fine-scale details, thereby improving the model’s overall performance and robustness. Defining k as the number of stages, the fused output feature F is denoted as
F k = G ( C A M ( f ) + S A M ( f ) )
where f = f g + f l + F k 1 , f g is the global feature, f l is the local feature, F k 1 is the fused output feature from the previous stage, G ( · ) denotes convolution and normalization, C A M ( · ) is the channel attention module, and S A M ( · ) is the spatial attention module.

4. Experimental Results and Analysis

4.1. Datasets

In this study, we assess the performance of MSHFE-Net using advanced airborne image datasets from ISPRS’s Urban Classification and 3D Building Reconstruction test programs. These datasets, accessible through the Semantic Annotation benchmark, incorporate a digital surface model (DSM)-generated Vaihingen, a village with a mix of free-standing and multi-storey buildings, and Potsdam, a high-resolution urban area characterized by expansive structures, narrow alleys, and densely packed neighborhoods. Each dataset was meticulously annotated to classify the land cover into six dominant categories, ensuring an accurate representation of the urban scenes.
The Potsdam dataset consists of 28 uniformly sized images, each with a spatial resolution of 5 cm. These images are stored in 8-bit TIFF format and include three spectral bands: near-infrared, red, and green. The DSM data, on the other hand, are provided in a single band. Notably, each image in this dataset covers the same geographic area. A sample plot from the Potsdam dataset is presented in Figure 7.
The Vaihingen dataset comprises 33 remote sensing images of varying dimensions, each precisely extracted from a larger orthophoto. A rigorous selection process was employed to ensure that there are no data gaps. The images are in 8-bit TIFF format and consist of three spectral bands: near-infrared, red, and green. An example of the Vaihingen dataset is illustrated in Figure 8.

4.2. Parameter Setting and Evaluation Index

In this study, we trained our model using the PyTorch framework on high-resolution remote sensing image datasets. The experiments were performed on a personal computer equipped with an 11th-generation Intel(R) Core(TM) i9-11900F CPU (2.50 GHz), an NVIDIA GeForce RTX 3090 GPU, and 32 GB of RAM. The training process began with an initial learning rate of 0.0001 over 15 epochs, with adjustments made every two epochs to facilitate gradual optimization. The cross-entropy loss function was employed to calculate loss, aiding in the convergence of the training process. To ensure compatibility with MSHFE-Net, the high-resolution remote sensing images in the Potsdam and Vaihingen datasets were carefully divided into eight thousand and seven thousand smaller image blocks of 224 × 224 pixels, respectively, for input.
Utilizing data enhancement techniques, this paper introduces image flipping and rotation to effectively extend the dataset and enhance its diversity. The performance evaluation of MSHFE-Net is performed using metrics such as mean interconnection (mIoU), overall accuracy, and metrics. IoU is the ratio of intersection and concatenation of predictions computed for the use case segmentation to the base truth. mIoU is a standard evaluation of the intersection and concatenation of all classes of average of the IoU bars. mIoU measures the overlap between the model predictions and the actual categorized regions and can effectively deal with the problem of spatial confusion between categories. OA is used to measure the overall classification accuracy of the model across all categories and can be used to visualize the model’s overall performance. F1 score is the weighted average of precision and recall, offering a balanced assessment by considering both false positives and false negatives in classification. FLOPs (Floating Point Operations per Second) measures the computational complexity of the model, indicating the computation required for inference. The higher the FLOPs, the more computationally complex the model is. Inference time represents the time the model takes to process input data during inference. From the confusion matrix, mIoU, OA, and F1 can be calculated.
I o U = k = 1 K T P k k = 1 K ( T P k + F P k + F N k )
m I o U = 1 K k = 1 K T P k T P k + F P k + F N k
O A = k = 1 K T P k k = 1 K ( T P k + F P k + T N k + F N k )
m F 1 = 1 K k = 1 K 2 × p r e c i s i o n k × r e c a l l k p r e c i s i o n k + r e c a l l k
where T P , T N represent the number of correct and incorrect positive samples; F P , F N represent the number of negative samples that were correctly and incorrectly judged; and p r e c i s i o n K = T P K T P K + F P K and r e c a l l K = T P K T P K + F N K are the precision and recall of MSHFE-Net.

4.3. Semantic Segmentation Results and Analysis

This section highlights the results obtained by MSHFE-Net. The semantic segmentation results for high-resolution remote sensing images are shown in Figure 9 and Figure 10. Figure 9 presents the results for the Potsdam dataset, while Figure 10 displays the results for the Vaihingen dataset. MSHFE-Net exhibits strong performance across both datasets, underscoring its effectiveness in unsupervised semantic segmentation tasks.
In order to validate the performance of MSHFE-Net, this paper establishes a quantitative comparison experiment, the results of which are shown in Figure 11 and Figure 12. MSHFE-Net compares five models: DeeplabV3+ [40], MSANet [41], FLDA-NET [42], JDAF [43], and SAM [44]. DeeplabV3+ utilized dilated convolution to capture features at multiple scales, thereby enhancing the extraction of contextual information. MSANet combined low- and high-dimensional features from different levels of feature maps by obtaining semantically similar feature maps, employed Inception-Res-like blocks to capture multi-scale feature maps with varying receptive fields, and utilized MSCF blocks with dense connections to mitigate the semantic gap caused by simple skip connections. FLDA-NET enhances cross-domain remote sensing classification by comparing the distribution of source and target domain data at multiple scales across the image, feature, and output levels. JDAF aligns the inter-domain edge distributions locally to globally through the MDA module, using the CFA block for aligning the conditional distributions by counter-training the class-invariant features shared between the two domains, and the DCA block for enhancing the raw pixel representations and improving intra-class compactness and inter-class discretization at the dataset level. Built on the classical Transformer structure, SAM employs an encoder–decoder architecture to project the image into a high-dimensional feature space to extract rich image information. Simultaneously, SAM encodes and embeds cues into a space compatible with the image features. Using the self-attention mechanism, the decoder then combines image and cue features to generate the final segmentation mask. The MSHFE-Net proposed in this paper extracts multi-scale global and local features using a Multi-Scale Pixel-Guided CNN Encoder and a Multi-Scale Aggregated Transformer Encoder, respectively. It then fuses these global and local features through a Parallel Attention Fusion Module, achieving high-precision semantic segmentation via K-means clustering.
Table 1 and Table 2 show the results of comparing this paper’s method with other methods on the Potsdam dataset and the Vaihingen dataset, where the best-performing metrics are labeled in bold. As can be seen from Table 1 and Table 2, MSHFE-Net shows excellent performance in these evaluations, with mIoU reaching 58.79 % and 62.81 % , and OA reaching 79.19 % and 83.14 % , respectively. In both datasets, the MSHFE-Net proposed in this paper improves mIoU by 3.27 % and 2.74 % , and OA by 3.57 % and 3.66 % , respectively, compared to the state-of-the-art model and has the best performance on each category. The F1 scores of MSHFE-Net on the Potsdam and Vaihingen datasets are 77.5 and 80.4 , respectively, demonstrating a good balance between precision and recall. JDAF and SAM excel at extracting boundary information and achieving high segmentation accuracy, but still exhibit instances of misclassification. In contrast, MSHFE-Net excels at reducing misclassified segments and improving boundary recognition and accuracy for smaller objects. Table 3 presents a comparison of the computational complexity and inference time of different models. MSANet has the lowest FLOPs, indicating low computational complexity. SAM has the highest FLOPs (185.8 G), implying extremely high computational complexity. MSHFE-Net has 159.6 G FLOPs, indicating relatively high computational complexity. DeepLabV3+ has the shortest inference time (77 ms), indicating the fastest processing speed. SAM has the longest inference time (174 ms), with the slowest processing speed. MSHFE-Net has an inference time of 167 ms, achieving improved accuracy through higher computational complexity, despite its slower processing speed. MSHFE-Net demonstrates strong overall performance when compared across various aspects.

4.4. Ablation Experiments

MSHFE-Net enhances the recognition and classification of small targets by effectively leveraging global context features through the Multi-Scale Aggregated Transformer Encoder and extracting fine-grained local features via the Multi-Scale Pixel-Guided CNN Encoder. The Parallel Attention Fusion Module efficiently combines these features. To verify the utilization of each module, we conduct ablation experiments. The ablation strategy includes a Baseline (ResNet50), a Multi-Scale Pixel-Guided CNN Encoder, a Multi-Scale Aggregated Transformer Encoder, and the simultaneous use of all three proposed modules (MSHFE-Net) to evaluate their performance.
Table 4 and Table 5 present the results of the ablation experiments conducted on the Potsdam and Vaihingen datasets, respectively. Additionally, Figure 13 and Figure 14 provide visualizations of these experiments on the same datasets. A detailed analysis of the data in these tables reveals that the modules proposed in this paper substantially enhance the performance of MSHFE-Net compared with the baseline network (ResNet50). Specifically, the mIoU reaches 58.79 % on the Potsdam dataset and 62.81 % on the Vaihingen dataset, while the OA achieves 79.19 % and 83.14 % , respectively. These results indicate that the combined use of all three modules significantly outperforms both the baseline and the individual modules, particularly in the accurate classification of boundaries and small targets.

5. Conclusions

This paper has presented a MSHFE-Net model for unsupervised semantic segmentation of high-resolution remote sensing images. The model is a hybrid CNN-Transformer architecture that obtains broader global contextual information while refining fine-grained local texture and boundary features by combining the Multi-Scale Pixel-Guided CNN Encoder and Multi-Scale Aggregated Transformer Encoder. Tight integration of global and local features is achieved through a parallel attention fusion module to further improve feature representation. The method fully considers the consistency of global and local semantics, significantly enhancing the model’s ability to recognize smaller targets in complex remote sensing data and contributing to the overall improvement of segmentation performance for better robustness of unsupervised high-resolution remote sensing image mask prediction. The model in this paper achieved a relative increase of 3.27 % and 6.81 % in mIoU and OA on the Potsdam dataset compared to the state-of-the-art model. The mIoU and OA on the Vaihingen dataset reached 62.81 % and 83.14 % , achieving a relative growth of 2.74 % and 4.31 % , respectively. Experiments on both datasets confirmed the remarkable achievement of MSHFE-Net in this paper.
Future research will focus on optimizing the model architecture and computational process to reduce the parameters and computational requirements of the model, thus reducing the complexity of the model while achieving higher performance.

Author Contributions

Conceptualization, W.S.; Methodology, W.S.; Software, F.N.; Validation, W.S., F.N. and C.W.; Formal analysis, Y.J. and Y.W.; Writing—Original Draft Preparation, W.S. and F.N.; Writing—Review and Editing, W.S., F.N. and C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Science Foundation of China under Grant (61901358) and Grant (62172321), in part by the Outstanding Youth Science Fund of Xi’an University of Science and Technology under Grant (2020YQ3-09), in part by the Scientific Research Plan Projects of Shaanxi Education Department under Grant (20JK0757), in part by the China Postdoctoral Science Foundation under Grant (2020M673347), in part by the Natural Science Basic Research Plan in Shaanxi Province of China under Grant (2019JZ-14), and in part by the Civil Space Thirteen Five Years Pre-Research Project under Grant (D040114).

Data Availability Statement

We employed two publicly available 2D semantic labeling datasets—namely, Vaihingen and Potsdam—graciously provided by the International Society for Photogrammetry and Remote Sensing (ISPRS), https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx, accessed on 30 May 2024.

Acknowledgments

The authors would like to thank the reviewers and the editor for the constructive comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, J.; Xiong, R.; Yu, H.; Xu, G.; Xing, M. Nonparametric Full-Aperture Autofocus Imaging for Microwave Photonic SAR. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5214815. [Google Scholar] [CrossRef]
  2. Chen, J.; Li, M.; Yu, H.; Xing, M. Full-aperture processing of airborne microwave photonic SAR raw data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5218812. [Google Scholar] [CrossRef]
  3. Khaleel, T.A.; Mustafa, F.A.; Khattab, M.F. Applications of Sensor Networks and Remote Sensing in Environmental Sustainability: A Review. In Proceedings of the 2022 International Conference on Engineering & MIS (ICEMIS), Istanbul, Turkey, 4–6 July 2022; pp. 1–3. [Google Scholar]
  4. Li, X.; Wen, C.; Hu, Y.; Yuan, Z.; Zhu, X.X. Vision-language models in remote sensing: Current progress and future trends. IEEE Geosci. Remote Sens. Mag. 2024, 12, 32–66. [Google Scholar] [CrossRef]
  5. Qian, S.E. Overview of hyperspectral imaging remote sensing from satellites. In Advances in Hyperspectral Image Processing Techniques; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2022; pp. 41–66. [Google Scholar]
  6. Li, J.; Ou, Z. Remote Sensing Image Processing of Ecological Environment Monitoring Based on Multi-scale Retinex Algorithm. In Proceedings of the 2023 2nd International Conference on 3D Immersion, Interaction and Multi-Sensory Experiences (ICDIIME), Madrid, Spain, 27–29 June 2023; pp. 21–24. [Google Scholar]
  7. Kumar, C.M.; Nidamanuri, R.R.; Dadhwal, V.K. Subpixel level discrimination of vegetable crops in a complex landscape environment. In Proceedings of the 2023 International Conference on Machine Intelligence for GeoAnalytics and Remote Sensing (MIGARS), Hyderabad, India, 27–29 January 2023; Volume 1, pp. 1–4. [Google Scholar]
  8. Song, J.; Kim, D.j.; Hwang, J.H.; Kim, H.; Li, C.; Han, S.; Kim, J. Effective Vessel Recognition in High Resolution SAR Images Using Quantitative and Qualitative Training Data Enhancement From Target Velocity Phase Refocusing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3346171. [Google Scholar] [CrossRef]
  9. Peeling, J.A.; Chen, C.; Judge, J.; Singh, A.; Achidago, S.; Eide, A.; Tarrio, K.; Olofsson, P. Applications of Remote Sensing for Land Use Planning Scenarios with Suitability Analysis. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6366–6378. [Google Scholar] [CrossRef]
  10. Khalsa, S.J.S.; Percivall, G. Standardization in Geoscience Remote Sensing. In Proceedings of the IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 4676–4678. [Google Scholar]
  11. Chauhan, K.; Tomar, H.; Kamal, K.; Goel, P. Feature Extraction from Image Sensing (Remote): Image Segmentation. In Proceedings of the 2023 5th International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), Greater Noida, India, 15–16 December 2023; pp. 227–232. [Google Scholar]
  12. Wang, Y.; Shao, Z.; Lu, T.; Wu, C.; Wang, J. Remote sensing image super-resolution via multiscale enhancement network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5000905. [Google Scholar] [CrossRef]
  13. Qiu, W.; Gu, L.; Gao, F.; Jiang, T. Building extraction from very high-resolution remote sensing images using refine-UNet. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6002905. [Google Scholar] [CrossRef]
  14. Chen, L.; Dou, X.; Peng, J.; Li, W.; Sun, B.; Li, H. EFCNet: Ensemble full convolutional network for semantic segmentation of high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8011705. [Google Scholar] [CrossRef]
  15. Meng, X.; Yang, Y.; Wang, L.; Wang, T.; Li, R.; Zhang, C. Class-guided swin transformer for semantic segmentation of remote sensing imagery. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6517505. [Google Scholar] [CrossRef]
  16. Moghimi, A.; Welzel, M.; Celik, T.; Schlurmann, T. A Comparative Performance Analysis of Popular Deep Learning Models and Segment Anything Model (SAM) for River Water Segmentation in Close-Range Remote Sensing Imagery. IEEE Access 2024, 12, 52067–52085. [Google Scholar] [CrossRef]
  17. Prado Osco, L.; Wu, Q.; Lopes de Lemos, E.; Nunes Gonçalves, W.; Marques Ramos, A.P.; Li, J.; Marcato Junior, J. The Segment Anything Model (SAM) for Remote Sensing Applications: From Zero to One Shot. arXiv 2023, arXiv:2306.16623. [Google Scholar]
  18. Shi, J.; Liu, W.; Shan, H.; Li, E.; Li, X.; Zhang, L. Remote sensing scene classification based on multibranch fusion attention network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 3001505. [Google Scholar] [CrossRef]
  19. Huang, L.; Jiang, B.; Lv, S.; Liu, Y.; Fu, Y. Deep Learning-Based Semantic Segmentation of Remote Sensing Images: A Survey. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 8370–8396. [Google Scholar] [CrossRef]
  20. Zou, J.; Li, Z.; Lu, F.; He, W.; Zhang, H. Multimodal unsupervised domain adaptation for remote sensing image segmentation. In Proceedings of the 2023 13th Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS), Athens, Greece, 31 October–2 November 2023; pp. 1–5. [Google Scholar]
  21. Jia, Y.; Wan, G.; Liu, J.; Zhao, C.; Wang, G.; Zhang, Y.; Liu, L.; Xie, B. A Multi-Scale Transformer Fusion Deep Clustering Network for Unsupervised Planetary Change Detection. IEEE Geosci. Remote Sens. Lett. 2023, 21, 8000205. [Google Scholar] [CrossRef]
  22. Nadgauda, S.S.; Pennamada, Y.R.; Sumathi, D. StegaNet: A Deep Learning Model for Image Steganography Using Customized CNN and Autoencoders. In Proceedings of the 2023 OITS International Conference on Information Technology (OCIT), Raipur, India, 13–15 December 2023; pp. 196–201. [Google Scholar]
  23. Yu, Y.; Liang, M.; Yin, M.; Lu, K.; Du, J.; Xue, Z. Unsupervised Multimodal Graph Contrastive Semantic Anchor Space Dynamic Knowledge Distillation Network for Cross-Media Hash Retrieval. In Proceedings of the 2024 IEEE 40th International Conference on Data Engineering (ICDE), Utrecht, The Netherlands, 13–16 May 2024; pp. 4699–4708. [Google Scholar]
  24. Liu, H.; Yao, M.; Xiao, X.; Zheng, B.; Cui, H. Marsscapes and udaformer: A panorama dataset and a transformer-based unsupervised domain adaptation framework for martian terrain segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 62, 4600117. [Google Scholar] [CrossRef]
  25. Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise unsupervised domain adaptation with adversarial self-training for road segmentation of remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5609413. [Google Scholar] [CrossRef]
  26. Zhu, J.; Guo, Y.; Sun, G.; Yang, L.; Deng, M.; Chen, J. Unsupervised domain adaptation semantic segmentation of high-resolution remote sensing imagery with invariant domain-level prototype memory. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5603518. [Google Scholar] [CrossRef]
  27. Fallahreyhani, M.; Ghassemian, H.; Imani, M. Unsupervised Classification of Remotely Sensed High resolution Images using RP-CNN. In Proceedings of the 2024 13th Iranian/3rd International Machine Vision and Image Processing Conference (MVIP), Tehran, Iran, 6–7 March 2024; pp. 1–7. [Google Scholar]
  28. Wei, L.; Chen, G.; Zhou, Q.; Liu, C.; Cai, C. Cross-mapping net: Unsupervised change detection from heterogeneous remote sensing images using a transformer network. In Proceedings of the 2023 8th International Conference on Computer and Communication Systems (ICCCS), Guangzhou, China, 21–24 April 2023; pp. 1021–1026. [Google Scholar]
  29. Dai, L.; Zhang, G.; Zhang, R. RADANet: Road augmented deformable attention network for road extraction from complex high-resolution remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602213. [Google Scholar] [CrossRef]
  30. Xiao, T.; Liu, Y.; Huang, Y.; Li, M.; Yang, G. Enhancing multiscale representations with transformer for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605116. [Google Scholar] [CrossRef]
  31. Song, J.; Li, Y.; Li, X.; Yang, S.; Xie, J.; Zhu, R. Unsupervised remote sensing image classification with differentiable feature clustering by coupled transformer. J. Appl. Remote Sens. 2024, 18, 026505. [Google Scholar] [CrossRef]
  32. Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 3–20. [Google Scholar]
  33. Cui, L.; Jing, X.; Wang, Y.; Huan, Y.; Xu, Y.; Zhang, Q. Improved swin transformer-based semantic segmentation of postearthquake dense buildings in urban areas using remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 16, 369–385. [Google Scholar] [CrossRef]
  34. Yang, Y.; Yuan, G.; Li, J. Multielement Feature-Based Hierarchical Context Integration Network for Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7971–7985. [Google Scholar] [CrossRef]
  35. Xi, W.; Sun, L.; Sun, J. Upgrade your network in-place with deformable convolution. In Proceedings of the 2020 19th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), Xuzhou, China, 16–19 October 2020; pp. 239–242. [Google Scholar]
  36. Xu, M.; Wang, W.; Wang, K.; Dong, S.; Sun, P.; Sun, J.; Luo, G. Vision Transformers (ViT) Pretraining on 3D ABUS Image and Dual-CapsViT: Enhancing ViT Decoding via Dual-Channel Dynamic Routing. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkiye, 5–8 December 2023; pp. 1596–1603. [Google Scholar]
  37. Li, Z.; Guo, Y. Semantic segmentation of landslide images in Nyingchi region based on PSPNet network. In Proceedings of the 2020 7th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, 18-20 December 2020; pp. 1269–1273. [Google Scholar]
  38. Namin, N.A.; Garaaghaji, E.; Rezaei, M.; Lighvan, M.Z. Light Weight Semantic Segmentation: A Modified DDRNET Approach Trained on Cityscapes and COCO-Stuff Datasets for Efficient Image Analysis. In Proceedings of the 2023 7th International Symposium on Innovative Approaches in Smart Technologies (ISAS), Istanbul, Turkiye, 23–25 November 2023; pp. 1–5. [Google Scholar]
  39. Chen, X.; Zou, Y.; Ke, H. TrafficYOLO: YOLO with Multi-Head Attention Mechanism for Traffic Detection Scenarios. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; pp. 2276–2279. [Google Scholar]
  40. Heryadi, Y.; Irwansyah, E.; Miranda, E.; Soeparno, H.; Herlawati; Hashimoto, K. The effect of resnet model as feature extractor network to performance of DeepLabV3 model for semantic satellite image segmentation. In Proceedings of the 2020 IEEE Asia-Pacific Conference on Geoscience, Electronics and Remote Sensing Technology (AGERS), Jakarta, Indonesia, 7–8 December 2020; pp. 74–77. [Google Scholar]
  41. Guo, Z.; Zhao, L.; Yuan, J.; Yu, H. Msanet: Multiscale aggregation network integrating spatial and channel information for lung nodule detection. IEEE J. Biomed. Health Inform. 2021, 26, 2547–2558. [Google Scholar] [CrossRef] [PubMed]
  42. Meng, Y.; Yuan, Z.; Yang, J.; Liu, P.; Yan, J.; Zhu, H.; Ma, Z.; Jiang, Z.; Zhang, Z.; Mi, X. Cross-domain Land Cover Classification of Remote Sensing Images based on Full-level Domain Adaptation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11434–11450. [Google Scholar] [CrossRef]
  43. Huang, H.; Li, B.; Zhang, Y.; Chen, T.; Wang, B. Joint distribution adaptive-alignment for cross-domain segmentation of high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5401214. [Google Scholar] [CrossRef]
  44. Li, T.; Pei, G.; Cai, X.; Liu, H.; Wang, Q.; Yao, Y. Universal Organizer of SAM for Unsupervised Semantic Segmentation. arXiv 2024, arXiv:2405.11742. [Google Scholar]
Figure 1. Overall structure of the MSHFE-Net. (a) Extraction of multi-scale local features. (b) Capturing multi-scale global features. (c) Fusion of multi-scale global and local features. (d) Segmentation Masks by K-means Clustering.
Figure 1. Overall structure of the MSHFE-Net. (a) Extraction of multi-scale local features. (b) Capturing multi-scale global features. (c) Fusion of multi-scale global and local features. (d) Segmentation Masks by K-means Clustering.
Remotesensing 16 03774 g001
Figure 2. Schematic illustration of the variation of deformable convolutional sampling positions. Deformable convolution dynamically adjusts the sampling positions of each convolutional kernel by introducing positional offsets, enabling the network to adapt flexibly to object deformations.
Figure 2. Schematic illustration of the variation of deformable convolutional sampling positions. Deformable convolution dynamically adjusts the sampling positions of each convolutional kernel by introducing positional offsets, enabling the network to adapt flexibly to object deformations.
Remotesensing 16 03774 g002
Figure 3. Structure of Multi-Scale Pixel-Guided CNN Encoder. Deformable convolution adaptively extracts feature information, while a Pixel-Guided Fusion Module is employed to merge the extracted multi-scale features, enabling the extraction of more accurate fine-grained details.
Figure 3. Structure of Multi-Scale Pixel-Guided CNN Encoder. Deformable convolution adaptively extracts feature information, while a Pixel-Guided Fusion Module is employed to merge the extracted multi-scale features, enabling the extraction of more accurate fine-grained details.
Remotesensing 16 03774 g003
Figure 4. Pixel-Guided Fusion Module structure. Multi-scale feature fusion is guided by calculating the probability that pixels at the same location belong to the same object, enabling high-precision extraction of local features.
Figure 4. Pixel-Guided Fusion Module structure. Multi-scale feature fusion is guided by calculating the probability that pixels at the same location belong to the same object, enabling high-precision extraction of local features.
Remotesensing 16 03774 g004
Figure 5. (a) The Multi-Scale Aggregation Transformer Encoder layer for extracting multi-scale contextual information. (b) The Parallel Aggregation PPM (PAPPM), which enhances feature scale through pooling and convolution, thereby improving the generalization capability of the features.
Figure 5. (a) The Multi-Scale Aggregation Transformer Encoder layer for extracting multi-scale contextual information. (b) The Parallel Aggregation PPM (PAPPM), which enhances feature scale through pooling and convolution, thereby improving the generalization capability of the features.
Remotesensing 16 03774 g005
Figure 6. Structure of Parallel Attention Fusion Module. This module effectively integrates local and global features, enhancing the overall robustness of the segmentation model.
Figure 6. Structure of Parallel Attention Fusion Module. This module effectively integrates local and global features, enhancing the overall robustness of the segmentation model.
Remotesensing 16 03774 g006
Figure 7. Some sample images from the Potsdam dataset: (a) Potsdam dataset image. (b) GT.
Figure 7. Some sample images from the Potsdam dataset: (a) Potsdam dataset image. (b) GT.
Remotesensing 16 03774 g007
Figure 8. Some sample images from the Vaihingen dataset: (a) Vaihingen dataset image. (b) GT.
Figure 8. Some sample images from the Vaihingen dataset: (a) Vaihingen dataset image. (b) GT.
Remotesensing 16 03774 g008
Figure 9. Unsupervised semantic segmentation results for MSHFE-Net on the Potsdam dataset: (a) Potsdam dataset image. (b) GT. (c) Semantic segmentation results.
Figure 9. Unsupervised semantic segmentation results for MSHFE-Net on the Potsdam dataset: (a) Potsdam dataset image. (b) GT. (c) Semantic segmentation results.
Remotesensing 16 03774 g009
Figure 10. Unsupervised semantic segmentation results for MSHFE-Net on the Vaihingen dataset (a) Vaihingen dataset image. (b) GT. (c) Semantic segmentation results.
Figure 10. Unsupervised semantic segmentation results for MSHFE-Net on the Vaihingen dataset (a) Vaihingen dataset image. (b) GT. (c) Semantic segmentation results.
Remotesensing 16 03774 g010
Figure 11. Unsupervised semantic segmentation results from comparison experiments on the Potsdam dataset. (a) Potsdam dataset image. (b) GT. (c) DeeplabV3+. (d) MSANet. (e) FLDA-Net. (f) JDAF. (g) SAM. (h) OURS.
Figure 11. Unsupervised semantic segmentation results from comparison experiments on the Potsdam dataset. (a) Potsdam dataset image. (b) GT. (c) DeeplabV3+. (d) MSANet. (e) FLDA-Net. (f) JDAF. (g) SAM. (h) OURS.
Remotesensing 16 03774 g011
Figure 12. Unsupervised semantic segmentation results from comparison experiments on the Vaihingen dataset. (a) Vaihingen dataset image. (b) GT. (c) DeeplabV3+. (d) MSANet. (e) FLDA-Net. (f) JDAF. (g) SAM. (h) OURS.
Figure 12. Unsupervised semantic segmentation results from comparison experiments on the Vaihingen dataset. (a) Vaihingen dataset image. (b) GT. (c) DeeplabV3+. (d) MSANet. (e) FLDA-Net. (f) JDAF. (g) SAM. (h) OURS.
Remotesensing 16 03774 g012
Figure 13. Unsupervised semantic segmentation results for ablation experiments on the Potsdam dataset. (a) Image. (b) GT. (c) Baseline. (d) Multi-Scale Pixel-Guided CNN Encoder. (e) Multi-Scale Aggregation Transformer Encoder. (f) OURS.
Figure 13. Unsupervised semantic segmentation results for ablation experiments on the Potsdam dataset. (a) Image. (b) GT. (c) Baseline. (d) Multi-Scale Pixel-Guided CNN Encoder. (e) Multi-Scale Aggregation Transformer Encoder. (f) OURS.
Remotesensing 16 03774 g013
Figure 14. Unsupervised semantic segmentation results for ablation experiments on the Vaihingen dataset. (a) Image. (b) GT. (c) Baseline. (d) Multi-Scale Pixel-Guided CNN Encoder. (e) Multi-Scale Aggregation Transformer Encoder. (f) OURS.
Figure 14. Unsupervised semantic segmentation results for ablation experiments on the Vaihingen dataset. (a) Image. (b) GT. (c) Baseline. (d) Multi-Scale Pixel-Guided CNN Encoder. (e) Multi-Scale Aggregation Transformer Encoder. (f) OURS.
Remotesensing 16 03774 g014
Table 1. Comparative experiments on Potsdam dataset.
Table 1. Comparative experiments on Potsdam dataset.
ModelIoUmIoUOAF1
BuildingLow-VegSurfaceTreeCar
DeeplabV3+52.7922.5747.9150.7022.6440.8661.7453.9
MSANet70.3434.7850.0743.6137.8447.6565.9260.1
FLDA-NET74.5631.3156.2840.4946.5749.8168.4364.7
JDAF77.1947.3968.7658.3842.7655.5272.3870.4
SAM75.4948.0664.8051.8446.8849.6975.6272.5
OURS77.6148.9069.6359.2147.0458.7979.1977.5
Table 2. Comparative experiments on Vaihingen dataset.
Table 2. Comparative experiments on Vaihingen dataset.
ModelIoUmIoUOAF1
Building Low-Veg Surface Tree Car
DeeplabV3+71.6134.4962.5156.2639.2044.6863.0856.7
MSANet78.1338.8964.1758.3744.8650.7568.8565.1
FLDA-NET82.0738.4661.0661.9151.1858.9475.1371.6
JDAF81.1549.8468.1066.5753.2360.0778.8376.9
SAM74.3347.7664.8162.6550.3758.3979.4877.3
OURS83.5454.7771.0967.6156.7362.8183.1480.4
Table 3. Comparison of Model Computational Complexity and Inference Time.
Table 3. Comparison of Model Computational Complexity and Inference Time.
ModelFLOPs (G)Inference Time (ms)
DeeplabV3+56.377
MSANet44.782
FLDA-NET83.4118
JDAF126.1149
SAM185.8174
OURS159.6167
Table 4. Ablation experiments on Potsdam dataset.
Table 4. Ablation experiments on Potsdam dataset.
BaselineMulti-Scale Pixel-Guided CNN EncoderMulti-Scale Aggregation Transformer EncoderParallel Attention Fusion ModulemIoUOAF1
17.6433.9428.7
20.0936.7734.1
34.8954.9552.0
58.7979.1977.5
Table 5. Ablation experiments on Vaihingen dataset.
Table 5. Ablation experiments on Vaihingen dataset.
BaselineMulti-Scale Pixel-Guided CNN EncoderMulti-Scale Aggregation Transformer EncoderParallel Attention Fusion ModulemIoUOAF1
21.4738.8730.6
28.3147.3645.3
37.6458.8656.9
62.8183.1480.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, W.; Nie, F.; Wang, C.; Jiang, Y.; Wu, Y. Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 3774. https://doi.org/10.3390/rs16203774

AMA Style

Song W, Nie F, Wang C, Jiang Y, Wu Y. Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sensing. 2024; 16(20):3774. https://doi.org/10.3390/rs16203774

Chicago/Turabian Style

Song, Wanying, Fangxin Nie, Chi Wang, Yinyin Jiang, and Yan Wu. 2024. "Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images" Remote Sensing 16, no. 20: 3774. https://doi.org/10.3390/rs16203774

APA Style

Song, W., Nie, F., Wang, C., Jiang, Y., & Wu, Y. (2024). Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sensing, 16(20), 3774. https://doi.org/10.3390/rs16203774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop