Next Article in Journal
A Novel Sample Generation Method for Deep Learning Lithological Mapping with Airborne TASI Hyperspectral Data in Northern Liuyuan, Gansu, China
Next Article in Special Issue
Pyramid Cascaded Convolutional Neural Network with Graph Convolution for Hyperspectral Image Classification
Previous Article in Journal
Automatic Detection of Quasi-Periodic Emissions from Satellite Observations by Using DETR Method
Previous Article in Special Issue
Remote Sensing Image Harmonization Method for Fine-Grained Ship Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimization of Remote-Sensing Image-Segmentation Decoder Based on Multi-Dilation and Large-Kernel Convolution

1
State Key Laboratory of Integrated Service Networks, Xidian University, Xi’an 710071, China
2
Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, China
3
Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(15), 2851; https://doi.org/10.3390/rs16152851
Submission received: 14 May 2024 / Revised: 29 July 2024 / Accepted: 30 July 2024 / Published: 3 August 2024

Abstract

:
Land-cover segmentation, a fundamental task within the domain of remote sensing, boasts a broad spectrum of application potential. We address the challenges in land-cover segmentation of remote-sensing imagery and complete the following work. Firstly, to tackle the issues of foreground–background imbalance and scale variation, a module based on multi-dilated rate convolution fusion was integrated into a decoder. This module extended the receptive field through multi-dilated convolution, enhancing the model’s capability to capture global features. Secondly, to address the diversity of scenes and background interference, a hybrid attention module based on large-kernel convolution was employed to improve the performance of the decoder. This module, based on a combination of spatial and channel attention mechanisms, enhanced the extraction of contextual information through large-kernel convolution. A convolution kernel selection mechanism was also introduced to dynamically select the convolution kernel of the appropriate receptive field, suppress irrelevant background information, and improve segmentation accuracy. Ablation studies on the Vaihingen and Potsdam datasets demonstrate that our decoder significantly outperforms the baseline in terms of mean intersection over union and mean F1 score, achieving an increase of up to 1.73% and 1.17%, respectively, compared with the baseline. In quantitative comparisons, the accuracy of our improved decoder also surpasses other algorithms in the majority of categories. The results of this paper indicate that our improved decoder achieves significant performance improvement compared with the old decoder in remote-sensing image-segmentation tasks, which verifies its application potential in the field of land-cover segmentation.

Graphical Abstract

1. Introduction

Due to the rapid development of imaging technology, the processing and analysis of remote-sensing images have become increasingly important. Consequently, the automatic extraction of fundamental information from remote-sensing images has become a key research direction in the field of remote-sensing image processing [1]. Remote-sensing land-cover segmentation is critical for analyzing remote-sensing images, playing a key role in processing and utilizing remote-sensing data. By employing image semantic-segmentation algorithms, it assigns categories to each pixel of remote-sensing images, identifying various landforms and extracting essential information [2]. In military contexts, it provides crucial intelligence for tactical and strategic operations. Environmentally [3], it aids in quickly and accurately detecting ecological changes, while in urban development, it supports city planning and the enhancement of infrastructure. Geospatial clarity is also improved in geoscience, establishing an important base for earth studies. The wide-ranging utility of this technology underscores the need for precision in remote-sensing segmentation methods. Remote-sensing images contain abundant surface feature information, yet accurately segmenting real-world regions remains a long-standing challenge [4,5,6]. Traditional segmentation methods [7,8,9], such as threshold-based segmentation, edge detection, and pixel clustering, have limited robustness and struggle to extract deep semantic information from images.
The swift advancement of deep learning techniques, particularly convolutional neural networks (CNNs), has made them an essential tool in the field of computer vision due to their powerful capability for feature extraction. Several scholars have successfully applied CNNs to tasks related to remote-sensing image segmentation [10]. Zheng [11] developed FarSeg, a foreground-aware relational network designed to address significant intraclass variance in background classes and the imbalance between foreground and background in remote-sensing images. In 2021, Li [12] introduced Fctl, a geospatial segmentation approach based on location-aware contexts, which systematically crops and independently segments images, later merging these to form a cohesive output. Zheng et al. proposed SETR [13] to improve the model’s ability to understand complex contextual relationships in images by exploiting the contextual information capture ability of transformers. Additionally, Ma [14] introduced FactSeg, utilizing the foreground activation object representation to enhance detection and differentiation of smaller objects.
Recent studies on object detection highlight the critical role of foreground saliency in remote-sensing image analysis. Various researchers, inspired by these findings, have been adapting transformer architectures for remote object detection. Segformer [15] designed a hierarchically structured transformer-based encoder and a decoder consisting of several layers of lightweight multi-layer perceptrons. Xu introduced the Efficient Transformer [16], which exhibits improved computational efficiency and uses explicit and implicit edge-enhancement techniques for precise segmentation. Wang [17] presented a design using the Swin Transformer for context extraction from images and introduced a densely connected feature aggregation setup for resolution restoration and intricate segmentation. Xu proposed the RSSFormer [18], a remote-sensing object segmentation framework with an adaptive transformer conjunction module, attention layer, and a foreground prominence-based loss function, designed to reduce background noise and enhance foreground differentiation. Sanghyun Woo [19] proposed CBAM in 2018, with the main goal of improving the perception ability of the model by introducing channel attention and spatial attention into CNNs.
Land-cover segmentation poses distinct challenges compared to standard semantic segmentation, which include significant variance in object sizes even within the same class, like large forests versus isolated trees; complex background components in remote-sensing images, making it difficult to classify certain elements into defined segments; and prevalence of background over foreground, which could bias the model to favor background segmentation during training, potentially affecting its optimization path.
In this paper, the prominent contributions are as follows:
  • We propose a module called multi-dilation rate convolutional fusion module. A corresponding module integrates outputs from convolutions with different dilation rates, addressing dilated convolution’s information loss and improving scale-variant target segmentation.
  • We introduce a hybrid attention module called large-kernel selection hybrid attention module, grounded on large-kernel convolutions. In the spatial attention submodule, we embed a convolutional kernel selection strategy to accommodate varying segmentation scales. For channel attention, we consider a large-kernel convolution-based attention to enhance the model’s receptive scope, thereby improving its foreground–background distinction and suppressing unrelated background noise. Combining this with multi-dilation rate convolutional fusion module results in the multi-dilation and large-kernel convolution-based decoder.
The rest of this paper is organized as follows. Section 2 presents a diagram of the decoder system we designed and introduces the workflow of the system. This is then followed by an introduction to the principles and network structure of the fusion decoder based on multi-dilation rate convolution along with the hybrid attention module based on the large-kernel convolution selection. Section 3 describes the datasets and implementation details used. Section 4 provides discussions. Section 5 draws conclusions.

2. Methods

2.1. Multi-Dilation and Large-Kernel Convolution-Based Decoder

The structure of our multi-dilation and large-kernel convolution-based decoder is shown in Figure 1.
We propose an improved decoder workflow aimed at enhancing the performance of image segmentation. In this process, we draw on the widely recognized encoder-decoder architecture in the field of image segmentation, with particular reference to the Segformer and HRNet models. While retaining the original encoder, we have optimized the decoder. Specifically, we have replaced the All-MLP decoder of Segformer and the upsampling part of HRNet with the multi-dilation rate convolutional fusion module and large-kernel-selection hybrid attention module.
The input image is first processed by the encoder to extract feature maps at various resolutions. These feature maps then enter the decoder. Here, the multi-scale feature maps are first sent to the large-kernel-selection hybrid attention module. This module uses spatial attention and channel attention modules based on a large-kernel convolution selection mechanism to perform hybrid processing on the feature maps. The processed feature maps are then sent to the multi-dilation rate convolutional fusion module. Within the module, the feature maps undergo three convolution-batchnorm-activation modules processing steps, followed by upsampling, feature-map fusion, and convolution operations, ultimately generating the decoded output. This decoder not only improves the processing efficiency of the feature maps but also enhances the accuracy of image segmentation.

2.2. Multi-Dilation Rate Convolutional Fusion Module

To counter the challenges in land-cover segmentation, we delineate the construction of a decoder tailored for the segmentation of objects within remote-sensing imagery, named the Multi-Dilation Rate Convolutional Fusion Decoder (MDCFD). Drawing inspiration from the decoders of Segformer and SETR, alongside extant encoder–decoder frameworks, the conceived MDCFD embodies a comparatively simplistic structure of multiple multi-layer perceptrons. In a bid to augment the saliency of foreground features within the decoder’s feature maps, this exposition integrates the Rssformer paradigm into the architecture of MDCFD through the adoption of dilated convolutions, with the intent to accentuate foreground details during the decoding phase. Dilated convolutions leverage enlarged kernels to span a wider sampling domain, thereby bridging distant pixels and infusing additional contextual data.
To adeptly tackle the issue of pronounced size variability among analogously classified objects within remote-sensing imagery, we introduce the Multi-Dilation Rate Convolutional Fusion Module (MDCFM) in MDCFD. Said module employs dilated convolutions with minimal dilation rates for the delineation of fine-detail features inherent to smaller objects, while those with escalated dilation rates apprehend an extended feature spectrum. Subsequently, the MDCFM amalgamates the feature maps, processed via convolutional layers possessing divergent receptive fields across multiple convolutional trajectories, through an element-wise addition methodology. The structure of the MDCFM is as depicted in Figure 2.
Input feature maps are processed through sequential processing through three CBAs (Convolution–BatchNorm–Activation), each a composite of convolutional, batch normalization, and activation layers in a serial configuration. We incorporate the Gaussian Error Linear Unit (GELU) as the activation function within the CBA modules, with the computational progression of these modules delineated as
Y = G E L U ( B N ( C o n v ( X ) ) )
where X represents the input feature maps and Y is the output of the CBA, with B N ( · ) indicating batch normalization. The CBA in the sequence includes two dilated convolutions with dilation rates of 2 and 3, plus a convolution (dilation rate of 1), facilitating feature integration across various receptive fields via element-wise addition. The module’s computational progression is outlined as
Y F u s i o n = C o n v 1 × 1 R = 1 ( Y ) + C o n v 3 × 3 R = 2 ( Y ) + C o n v 3 × 3 R = 3 ( Y )
where Y F u s i o n represents the output feature map that results from the fusion of multiple dilated convolutions, and C o n v 3 × 3 R = 2 ( · ) refers to the dilated convolution operation with a kernel size of 3 and a dilation rate of 2. This approach captures multi-scale context information through different dilation rates, effectively generating enriched feature representation and enhancing the module’s adaptability to complex, varied spatial structures in remote-sensing images through an element-wise addition operation. The construction of MDCFD is shown in Figure 3.
After processing by the MDCFM, the n-channel feature maps are subjected to bilinear upsampling within the upsampling module, which helps to ensure consistency in spatial features. The upsampling increases the map resolution to a quarter of its original width and height, avoiding a direct return to the original resolution. This strategic choice theoretically reduces computational and memory demands while maintaining the spatial detail needed for accurate semantic segmentation. Finally, the feature maps are merged through channel concatenation, followed by a convolutional layer that outputs C -channels, to generate the final output of the decoder

2.3. Large-Kernel-Selection Hybrid Attention Module

Inspired by the CBAM, we devised a large-kernel-selection hybrid attention module (LKSHAM) that adopts similar dual-submodule framework, as detailed in Figure 4. Within this framework, the initial output X is directed through a channel attention module to generate an attention mask M c (Equation (3)). This mask is then applied element-wise to X , yielding a channel-enhanced feature map X c (Equation (4)). Subsequently, the enhanced feature map X c is refined by the spatial attention module, generating a spatial attention mask M S which serves to sharpen the spatial focus (Equation (5)). The final output, O U T , is obtained by combining the mask M S with the feature map X c , signifying an increased concentration of attention, as expressed by Equation (6).
M c = C h a n n e l A t t e n t i o n ( X )
X c = M c · X
M s = S p a t i a l A t t e n t i o n ( X c )
O U T = M s · X c

2.3.1. Large-Kernel-Selection Spatial Attention Module

Taking inspiration from SKNet and LSKNet [20], this manuscript introduces a novel spatial attention module, denoted as the Large-Kernel-Selection Spatial Attention Module (LKSSAM), which integrates large-kernel convolutions with a convolutional kernel selection mechanism.
Figure 5 illustrates the procedural mechanics of the convolutional kernel selection mechanism. The output feature map X from the previous network layer, characterized by a batch size B and a channel count N , is subjected to a triad of depthwise convolutions: F 1 ˜ , F 2 ˜ , and F 3 ˜ , derived from an expansive kernel convolution with a receptive field of 17 × 17. The dimensions of the kernel for F 1 ˜ are stipulated as 3 × 3 with a corresponding dilation rate of 1, for F 2 ˜ a kernel size of 5 × 5 with a dilation rate of 2, and for F 3 ˜ , a kernel size of 3 × 3 with a dilation rate of 3 is established. X traverses the aforementioned paths F 1 ˜ , F 1 ˜ F 2 ˜ , and F 1 ˜ F 2 ˜ F 3 ˜ to procure respective outputs O 1 ˜ , O 2 ˜ , and O 3 ˜ , as expounded in the subsequent equation.
O 1 ˜ = F 1 ˜ ( X )   ,   O 2 ˜ = F 2 ˜ ( O 1 ˜ )   , O 3 ˜ = F 3 ˜ ( O 2 ˜ )
After individual operations, O ˜ 1 , O ˜ 2 , and O ˜ 3 are combined through channel-wise (at dimension dim = 1) to form O ˜ as shown in Equation (8). O ˜ then undergoes global average pooling P a v g , resulting in U B × 3 N × 1 × 1 . U , after a 1 × 1 convolution maintaining channels at 3 N and then expanding and reshaping, becomes the five-dimensional matrix U ¯ B × 3 × N × 1 × 1 , as described in Equations (9) and (10). Applying Softmax S m a x to U ¯ ’s next-to-last dimension yields the kernel selection weights matrix W k , per Equation (11).
O ˜ = [ O 1 ˜ ; O 2 ˜ ; O 3 ˜ ] d i m = 1
U = P a v g ( O ˜ )
U ¯ = r e s h a p e ( F 1 × 1 ( U ) )
W k = S m a x d i m = 1 ( U ¯ )
W k B × 3 × N × 1 × 1 contains B × N sets of convolutional kernel selection weights, where the three elements within the kernel selection weights correspond to the selection coefficients for feature maps with different receptive fields after operations F 1 ˜ , F 2 ˜ , and F 3 ˜ . After reshaping O ˜ to the shape of B × 3 × N × H × W , multiplicative interaction with W k yields W k ¯ B × 3 × N × H × W , as depicted in Equation (12). Subsequently, W k ¯ is divided along the penultimate dimension into three matrices, each with the shape of B × N × H × W . By performing an element-wise addition operation on these three matrices, the feature map X k B × N × H × W , post adaptive receptive field selection, is obtained. This process is detailed in Equation (13), where c h u n k denotes the matrix partitioning.
W k ¯ = r e s h a p e ( O ˜ ) × W k
X k = s u m ( c h u n k d i m = 1 3 ( W k ¯ ) )
Figure 6 shows the complete schematic of the LKSSAM as presented in this article. The feature map X k , after the kernel selection module, is subjected to both average and maximal pooling operations, P a v g , P m a x , yielding outputs I a v g B × 1 × H × W and I m a x B × 1 × H × W , as shown in Equation (14). Concatenating I a v g and I m a x along the channel axis results in matrix I ˜ B × 2 × H × W , characterized by Equation (15). Subsequent to the processing of H through a convolutional maneuver I with an output channel quantum fixed at 1 and a 3 × 3 kernel dimension, succeeded by a Sigmoid activation, a spatial attention mask matrix M s B × 1 × H × W with elemental values confined within the interval (0, 1) comes to fruition, as delineated in Equation (16). The element-wise product of M s with the antecedent input X begets the LKSSAM-augmented construct X s , as depicted in Equation (17).
I a v g = P a v g ( X k ) I m a x = P m a x ( X k )
I ˜ = [ I a v g ; I m a x ]
M s = S i g m o i d ( F 3 × 3 ( I ˜ ) )
X s = M s · X

2.3.2. Large Kernel Channel Attention Module

We introduce a Large-Kernel Channel Attention Module (LKCAM), and by integrating with the LKSSAM, a Hybrid Attention Module (LKSHAM) is formed, capable of fully utilizing information from both spatial and channel dimensions.
The architecture of the LKCAM, as presented in this paper, is shown in Figure 7. The output feature map X B × C × H × W , with a batch size B and C channels from the previous layer of the neural network, first undergoes a 1 × 1 convolution F 1 × 1 , reducing the channel number to C / R . To control the model’s parameterization scale, a channel reduction factor R is applied, reducing the channel count of the input feature map to a fraction of C / R , as shown in Equation (18)
A = F 1 × 1 ( X )
Following the F 1 × 1 , the output A B × C / R × H × W is processed through two subsequent depthwise convolutions, F 1 ˜ and F 2 ˜ , as described in Equation (19). In this research, F 1 ˜ is set with a dilation rate of 1, and F 2 ˜ uses a 7 × 7 kernel with a dilation rate of 4. The sequence of F 1 ˜ and F 2 ˜ is equivalent to a single large-kernel convolution with a receptive field of 29 × 29. The feature map A ¯ , refined by this large-kernel convolution, is then passed through a 1 × 1 convolution that outputs C channels. This process restores the channel count of the final feature Y B × C × H × W to C , as shown in Equation (20).
A ¯ = F 1 ˜ ( F 2 ˜ ( A ) )
Y = F 1 × 1 ( A ¯ )
Y is processed by a global max pooling layer and a global average pooling layer, resulting in matrices C 1 B × C × 1 × 1 and C 2 B × C × 1 × 1 , as presented in Equation (21). The result of adding C 1 and C 2 element-wise then enters a Sigmoid activation layer. The output from this layer forms the channel attention mask M c , detailed in Equation (22). The values within M c range from 0 to 1, indicating the weights assigned to the B × C channels by the channel attention module. Multiplying the output from the previous neural network layer element-wise with M c yields the LKCAM-enhanced output X c , as shown in Equation (23). This design allows the model to identify significant channels, emphasizing features that are crucial for the task while downplaying irrelevant channel information
C 1 = P a v g ( Y ) C 2 = P m a x ( Y )
M c = S i g m o i d ( C 1 + C 2 )
X c = M c · X
After combining MDCFM and LHSHAM, and adding the necessary decoding output module, we obtain the decoder. The construction of the decoder is shown in Figure 8.

3. Implementation and Results

3.1. Datasets and Data Pre-Processing

To evaluate the effectiveness of the described decoder, empirical tests were conducted using the ISPRS Potsdam and ISPRS Vaihingen datasets to assess its impact on segmentation accuracy. The results were then analyzed in detail. The datasets will be freely available at www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx, accessed on 2 February 2024.
(1) The Potsdam dataset, recognized for advancing object segmentation and scene analysis, includes various image formats such as RGB, IRRG, and DSM. Our study selectively employed the RGB format for evaluation. Categorically, it covers six types of land cover: impervious surface, buildings, low vegetation, trees, cars, and miscellaneous areas (commonly termed as ‘clutter’).
(2) The Vaihingen dataset is key for advancing remote-sensing object segmentation research and applications. It mainly includes 33 ‘TOP’ high-resolution aerial images, with an average size of 2494 ∗ 2064 pixels. The Vaihingen dataset additionally provides detailed elevation data through Digital Surface Models (DSM) and Normalized Digital Surface Models (NDSM), with precise ground sampling accuracy up to 9 cm. Its object categories align with those in the Potsdam dataset. We exclusively utilized the ‘TOP’ images, without DSM and NDSM data.
(3) The high-resolution images from the Potsdam and Vaihingen datasets necessitate preprocessing through cropping to reduce memory load; hence, images were resized to 512 ∗ 512. We analyzed six land-cover categories from Potsdam and five corresponding categories from Vaihingen, excluding clutter, to evaluate the segmentation capability of our model.

3.2. Implementation Details

3.2.1. Hardware Environment

The methodology of this investigation involved the following hardware setup: an Intel Core i9-9900K CPU with a base frequency of 3.6 GHz, featuring 16 threads; an NVIDIA GeForce GTX 2080Ti GPU; and a computational platform for conducting experiments, equipped with 64 GB of memory and 4 TB of storage.

3.2.2. Software Environment

This article describes computational experiments that utilized Ubuntu 20.04 as the operating system and CUDA 11.3 for the parallel computing framework. To facilitate managing virtual environments necessary for model training and inference, this research utilized the Anaconda system. Detailed configurations are provided in the Table 1.

3.2.3. Hyperparameter Settings

During the training regimen, the Adam optimizer was selected, featuring the beta coefficients of the Adam algorithm designated as 0.9 for B 1 and 0.999 for B 2 . The instructional rate for model training was established at 6 × 10−5, while the regularization term for weight decay was anchored at 0.01. The learning rate experienced adaptive alterations in line with the Poly learning rate policy, where the Poly power was defined as 1. We stopped after 160,000 epochs.

3.2.4. Common Evaluation Metrics for Land-Cover Segmentation

The commonly used evaluation metrics for land-cover segmentation algorithms include mean F1 score(mF1) and mean Intersection over Union (mIoU).
(1) The F1 score is the harmonic mean of precision and recall, which comprehensively considers the performance of these two metrics. Precision refers to the proportion of instances that are truly positive among those that the model predicts as positive samples. It measures the accuracy of the model’s predictions for positive samples. The formula for calculating precision can be referred to as Equation (24).
p r e c i s i o n = T P T P + F P
Here, T P represents true positives: the number of instances correctly predicted as positive samples by the model; F P represents false positives: the number of instances incorrectly identified as positive samples by the model. Recall refers to the proportion of all actual positive samples that are predicted as positive samples by the model. It measures the model’s ability to identify positive samples, that is, how many true positive samples are found. The formula for calculating recall can be seen in Equation (25).
r e c a l l = T P T P + F N
Here, F N represents false negatives: the number of instances incorrectly identified as negative samples by the model. For a single category, the formula for calculating the F1 score can be referred to as Equation (26).
F 1 = 2 × p r e c i s i o n × r e c a l l p r e c i s i o n + r e c a l l
The mF1 is the average of all category F1 scores, used to evaluate the model’s overall performance across all categories. Its calculation formula can be referred to as Equation (27).
m F 1 = 1 C i = 0 n F 1 i
Here, C represents the total number of target categories, and F 1 i represents the F1 score of the i -th category.
(2) IoU refers to the ratio of the area of the intersection between the predicted region by the segmentation algorithm and the actual region to the area of their union. The calculation process can be seen in Equation (28).
I o U = A r e a p r e d i c t A r e a g r o u n d   t r u t h A r e a p r e d i c t A r e a g r o u n d   t r u t h
Here, A r e a p r e d i c t represents the region predicted by the segmentation algorithm, and A r e a g r o u n d   t r u t h represents the actual region. mIoU is the average value of the Intersection over Union for all categories. The specific calculation formula is shown in Equation (29).
m I o U = 1 C i = 0 k T P F N + F P + T P

3.3. Ablation Study

To establish the MDCFD decoder and LKSHAM’s accuracy, universality, and effectiveness, we merged them with top-performing models for comparative evaluation against original models. In our ablation studies, we first paired the MDCFD and LKSHAM with the Segformer’s MiT encoder, choosing the parameter-rich MiT-B5 encoder. Secondly, we combined them with HRNetV2 [21], using the robust HRNetV2-W48 model. The outcomes are presented in Table 2.
To assess the detection performance of the two modules suggested in this article, we compared the improved Segformer (named Seg&Ours) and HRNetV2 (named HRNetV2&Ours) models based on our proposal with conventional land-cover segmentation algorithms. The results are illustrated in Table 3.
The results confirmed the significant enhancement in the performance of remote-sensing land-cover segmentation when the decoder we proposed was used in combination with two different encoder structures. The experimental results indicate that our decoder achieved notable improvements in the mIoU and mF1 accuracy metrics. Specifically, on the Potsdam dataset, our decoder increased the mIoU by 1.5% to reach 79.6%, and the mF1 by 1.2% to reach 87.6%, compared to the Segformer model. The segmentation accuracy for all categories improved compared to the baseline model, especially in the “car” category, where the accuracy increased by 1.8% after being combined with the HRNetV2 encoder. In experiments on the Vaihingen dataset, the mIoU improvement was up to 1.7%, reaching 82.1%, and the mF1 increased by up to 1.2 percentage points, reaching 90.1%. Particularly in the “car” category, after being combined with the HRNetV2 encoder, its accuracy increased by 2.4 percentage points compared to the baseline model. These results fully demonstrate that the decoder designed by us has a significant improvement in segmentation accuracy compared to traditional decoders.

3.4. Visual Results

3.4.1. Feature Map Analysis

In Figure 9a, the impact of the LKSHAM on feature maps is evident. The LKSHAM module enhances the extraction of features for small-scale targets (as indicated by the black box) by selecting an appropriate receptive field. This approach circumvents the issue observed in the baseline model, where such targets were not identifiable. Furthermore, the MDCFD integrates outputs from convolutions with varying dilation rates, synergizing with LKSHAM to enhance the segmentation capabilities across different scales of targets. In Figure 9b, the influence of LKSHAM on feature maps is further elucidated. By incorporating large-kernel convolutions, the LKSHAM module bolsters the extraction of contextual information. Concurrently, the MDCFD leverages dilated convolutions to expand the receptive field, thereby enhancing the differentiation between the foreground and background. This strategy effectively addresses the shortcomings present in the baseline model, as illustrated by the black box.

3.4.2. Segmentation Results

Figure 10 graphically shows the improved segmentation accuracy of the system due to the MDCFD, using a set of image comparisons. As illustrated in Figure 10c,d, the baseline algorithm performs poorly in segmenting the low vegetation marked by the red box in the images, while the algorithm improved with MDCFD can effectively recognize this low vegetation. Furthermore, in Figure 10g,h, the system enhanced by MDCFD demonstrates superior performance in identifying cluttered scenes (as shown in the yellow boxes) compared to the baseline algorithm. These results confirm that MDCFD has better segmentation effects than the baseline algorithm in specific scenarios.
Figure 11 and Figure 12 visually display how the two architectural configurations discussed in this paper enhance model segmentation precision.
The sections highlighted with red borders in Figure 11 and Figure 12 significantly demonstrate the substantial improvement in land-cover classification accuracy of the system integrated with the decoder we designed. It can be observed that, compared to the baseline segmentation results in the third column, the “ours” in the fourth column is closer to the ground truth in terms of segmentation results, thereby showing a distinct advantage. For instance, in the segmentation of Figure 11a,e,i,m, the baseline incorrectly classifies the background as other categories, while our model avoids this issue, indicating that the attention modules we designed effectively reduce the interference of background noise. In Figure 12a,e,i, the baseline incorrectly judges other categories as the background, while our model avoids this problem. This not only represents the classification of land-cover features on a small and large scale but also shows the enhancement of the model’s segmentation capabilities across different scales. These results highlight the system’s adaptability and flexibility in efficiently processing targets of various sizes.

4. Discussion

We have introduced a multi-dilated convolution fusion module and a large-kernel convolution hybrid attention module, dedicated to enhancing the performance of remote-sensing image segmentation. These designs have broadened the model’s receptive field, significantly enhancing the capability to capture global features and distinguish between the foreground and background. Notably, our improved method achieved remarkable improvements in key segmentation metrics such as mIoU and mF1.
These accomplishments are attributed to the innovative redesign of our decoder. As described in Section 2, we integrated multi-dilated convolution fusion module and a convolutional kernel selection module into the decoder, effectively merging feature maps processed through multiple convolutional pathways, balancing the recognition capabilities for both large-scale and fine-scale targets. The strategy of dynamically selecting an appropriate receptive field within LKSHAM also played a crucial role in enhancing performance. At the same time, the expanded receptive field enhanced the model’s ability to perceive global features, thereby more effectively distinguishing between the foreground and background. The introduction of the hybrid attention mechanism and large-kernel convolution technology has further promoted improvements in performance. The effects of these improvements are intuitively demonstrated in Figure 11 and Figure 12.

5. Conclusions

In tackling the challenges of land-cover segmentation in remote-sensing imagery, we have integrated a multi-dilated rate convolution fusion module into our decoder to address the imbalance between foreground and background as well as scale variation. This enhancement broadens the receptive field, thereby improving the model’s ability to capture global features. Additionally, to manage scene diversity and background interference, we implemented a hybrid attention module that utilizes large-kernel convolution. This module leverages spatial and channel attention mechanisms to enhance the extraction of contextual information. Furthermore, a convolution kernel selection mechanism has been introduced to dynamically select the appropriate kernel, thereby suppressing irrelevant background information and enhancing segmentation accuracy. These findings demonstrate that our refined decoder significantly outperforms its predecessor in the context of remote-sensing image segmentation, affirming its potential for application in the domain of land-cover segmentation.

Author Contributions

Conceptualization, G.L. and X.W.; methodology, G.L. and C.L.; software, C.L. and X.Z.; validation, Y.L.; formal analysis, X.Z. and X.W.; writing—original draft preparation, J.X. and X.W.; visualization, J.X.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the China Postdoctoral Science Foundation (2013M540735); by the National Nature Science Foundation of China under Grant 61901388, 61301291, 61701360; by the 111 Project under Grant B08038; by the Shaanxi Provincial Science and Technology Innovation Team; by the Fundamental Research Funds for the Central Universities; by the Youth Innovation Team of Shaanxi Universities.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yuan, X.; Shi, J.; Gu, L. A Review of Deep Learning Methods for Semantic Segmentation of Remote Sensing Imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
  2. Qing, C.; Yu, J.; Xiao, C.B.; Duan, J. Deep Convolutional Neural Network for Semantic Image Segmentation. J. Image Graph. 2020, 25, 1069–1090. [Google Scholar]
  3. Zengyuan, L.I.; Erxue, C. Development Course of Forestry Remote Sensing in China. Natl. Remote Sens. Bull. 2021, 25, 292–301. [Google Scholar]
  4. Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
  5. Huo, Y.; Gang, S.; Guan, C. Fcihmrt: Feature Cross-Layer Interaction Hybrid Method Based on Res2net and Transformer for Remote Sensing Scene Classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
  6. Wu, X.; Wang, L.; Wu, C.; Guo, C.; Yan, H.; Qiao, Z. Semantic Segmentation of Remote Sensing Images Using Multiway Fusion Network. Signal Process. 2024, 215, 109272. [Google Scholar] [CrossRef]
  7. Pal, S.K.; Ghosh, A.; Shankar, B.U. Segmentation of Remotely Sensed Images with Fuzzy Thresholding, and Quantitative Evaluation. Int. J. Remote Sens. 2000, 21, 2269–2300. [Google Scholar] [CrossRef]
  8. Li, D.; Zhang, G.; Wu, Z.; Yi, L. An Edge Embedded Marker-Based Watershed Algorithm for High Spatial Resolution Remote Sensing Image Segmentation. IEEE Trans. Image Process. 2010, 19, 2781–2787. [Google Scholar] [PubMed]
  9. Saha, I.; Maulik, U.; Bandyopadhyay, S.; Plewczynski, D. SVMeFC: SVM Ensemble Fuzzy Clustering for Satellite Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2011, 9, 52–55. [Google Scholar] [CrossRef]
  10. Yi, Y.; Zhang, Z.; Zhang, W.; Zhang, C.; Li, W.; Zhao, T. Semantic Segmentation of Urban Buildings from VHR Remote Sensing Imagery Using a Deep Convolutional Neural Network. Remote Sens. 2019, 11, 1774. [Google Scholar] [CrossRef]
  11. Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4096–4105. [Google Scholar]
  12. Liu, W.; Li, Q.; Lin, X.; Yang, W.; He, S.; Yu, Y. Ultra-High Resolution Image Segmentation via Locality-Aware Context Fusion and Alternating Local Enhancement. arXiv 2021, arXiv:2109.02580. [Google Scholar] [CrossRef]
  13. Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
  14. Ma, A.; Wang, J.; Zhong, Y.; Zheng, Z. FactSeg: Foreground Activation-Driven Small Object Semantic Segmentation in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
  15. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  16. Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient Transformer for Remote Sensing Image Segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
  17. Wang, L.; Li, R.; Duan, C.; Zhang, C.; Meng, X.; Fang, S. A Novel Transformer Based Semantic Segmentation Scheme for Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  18. Xu, R.; Wang, C.; Zhang, J.; Xu, S.; Meng, W.; Zhang, X. Rssformer: Foreground Saliency Enhancement for Remote Sensing Land-Cover Segmentation. IEEE Trans. Image Process. 2023, 32, 1052–1064. [Google Scholar] [CrossRef]
  19. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  20. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.-M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
  21. Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef]
  22. Li, G.; Yun, I.; Kim, J.; Kim, J. Dabnet: Depth-Wise Asymmetric Bottleneck for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1907.11357. [Google Scholar]
  23. Hu, P.; Perazzi, F.; Heilbron, F.C.; Wang, O.; Lin, Z.; Saenko, K.; Sclaroff, S. Real-Time Semantic Segmentation with Fast Attention. IEEE Robot. Autom. Lett. 2020, 6, 263–270. [Google Scholar] [CrossRef]
  24. Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
  25. Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck Transformers for Visual Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
  26. Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
  27. Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 7262–7272. [Google Scholar]
  28. Wu, H.; Huang, P.; Zhang, M.; Tang, W.; Yu, X. CMTFNet: CNN and Multiscale Transformer Fusion Network for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3314641. [Google Scholar] [CrossRef]
  29. Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking Convolutional Attention Design for Semantic Segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
  30. Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side Adapter Network for Open-Vocabulary Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2945–2954. [Google Scholar]
  31. Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
  32. Tian, T.; Chu, Z.; Hu, Q.; Ma, L. Class-Wise Fully Convolutional Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2021, 13, 3211. [Google Scholar] [CrossRef]
  33. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
  34. Li, X.; Zhong, Z.; Wu, J.; Yang, Y.; Lin, Z.; Liu, H. Expectation-Maximization Attention Networks for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9167–9176. [Google Scholar]
  35. Li, X.; He, H.; Li, X.; Li, D.; Cheng, G.; Shi, J.; Weng, L.; Tong, Y.; Lin, Z. Pointflow: Flowing Semantics through Points for Aerial Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4217–4226. [Google Scholar]
Figure 1. The structure of multi-dilation and large-kernel convolution-based decoder.
Figure 1. The structure of multi-dilation and large-kernel convolution-based decoder.
Remotesensing 16 02851 g001
Figure 2. The structure of Multi-Dilation Rate Convolutional Fusion Module.
Figure 2. The structure of Multi-Dilation Rate Convolutional Fusion Module.
Remotesensing 16 02851 g002
Figure 3. The structure of Multi-Dilation Rate Convolutional Fusion Decoder.
Figure 3. The structure of Multi-Dilation Rate Convolutional Fusion Decoder.
Remotesensing 16 02851 g003
Figure 4. The structure of Large-Kernel-Selection Hybrid Attention Module.
Figure 4. The structure of Large-Kernel-Selection Hybrid Attention Module.
Remotesensing 16 02851 g004
Figure 5. The structure of kernel selection.
Figure 5. The structure of kernel selection.
Remotesensing 16 02851 g005
Figure 6. The configuration of Large-Kernel-Selection Spatial Attention Module.
Figure 6. The configuration of Large-Kernel-Selection Spatial Attention Module.
Remotesensing 16 02851 g006
Figure 7. The structure of Large-Kernel Channel Attention Module.
Figure 7. The structure of Large-Kernel Channel Attention Module.
Remotesensing 16 02851 g007
Figure 8. Structural diagram of the decoder improved by MDCFD and LKSHAM.
Figure 8. Structural diagram of the decoder improved by MDCFD and LKSHAM.
Remotesensing 16 02851 g008
Figure 9. The feature map analysis of our decoder: The feature maps and segmentation outcomes of two images in the dataset after processing with LKSHAM and MDCFD.
Figure 9. The feature map analysis of our decoder: The feature maps and segmentation outcomes of two images in the dataset after processing with LKSHAM and MDCFD.
Remotesensing 16 02851 g009
Figure 10. (ah) Segmentation results comparison before and after the introduction of MDCFD. (a,e) The original images in the test set. (b,f) Ground truth of original images. (c,g) The segmentation results without MDCFD. (d,h) The segmentation results with MDCFD.
Figure 10. (ah) Segmentation results comparison before and after the introduction of MDCFD. (a,e) The original images in the test set. (b,f) Ground truth of original images. (c,g) The segmentation results without MDCFD. (d,h) The segmentation results with MDCFD.
Remotesensing 16 02851 g010
Figure 11. (ap) Comparative visualization of model segmentation efficacy pre- and post-enhancement on incorrectly classifies. (a,e,i,m) The original images in the test set, (b,f,j,n) Ground truth of original images. (c,g,k,o) The segmentation results of baseline. (d,h,l,p) The segmentation results of Ours.
Figure 11. (ap) Comparative visualization of model segmentation efficacy pre- and post-enhancement on incorrectly classifies. (a,e,i,m) The original images in the test set, (b,f,j,n) Ground truth of original images. (c,g,k,o) The segmentation results of baseline. (d,h,l,p) The segmentation results of Ours.
Remotesensing 16 02851 g011
Figure 12. (al) Comparative visualization of model segmentation efficacy pre- and post-enhancement. (a,e,i) The original images in the test set, (b,f,j) Ground truth of original images. (c,g,k) The seg-mentation results of baseline. (d,h,l) The segmentation results of Ours.
Figure 12. (al) Comparative visualization of model segmentation efficacy pre- and post-enhancement. (a,e,i) The original images in the test set, (b,f,j) Ground truth of original images. (c,g,k) The seg-mentation results of baseline. (d,h,l) The segmentation results of Ours.
Remotesensing 16 02851 g012
Table 1. The list of hardware and software environments relied on by the experiment.
Table 1. The list of hardware and software environments relied on by the experiment.
Hardware/SoftwareParameter/Version
CPUIntel Core i9-9900K
GPUGeForce GTX 2080Ti
Memory Allocation64 GB
Storage Capability4 TB
Operation SystemUbuntu 20.04
Python3.8.2
CUDA11.3
PyTorch1.12.1
mmcv2.0.0
mmsegementation1.2.2
numpy1.24.4
opencv4.9.0
Table 2. The results of ablation experiments for combinations of MDCFD and LKSHAM on the Vaihingen dataset and Potsdam dataset.
Table 2. The results of ablation experiments for combinations of MDCFD and LKSHAM on the Vaihingen dataset and Potsdam dataset.
DatasetModelmIoU(%)mF1(%)
MiTHRNetV2MiTHRNetV2
VaihingenBaseline80.4179.1188.9388.17
Baseline + MDCFD81.5679.4989.6888.41
Baseline + MDCFD +LKSHAM82.1480.2790.0688.89
PotsdamBaseline78.1077.4286.3985.96
Baseline + MDCFD79.1378.1987.2186.46
Baseline + MDCFD +LKSHAM79.6378.8087.5687.07
Table 3. Quantitative comparison results on the Vaihingen dataset and Potsdam dataset.
Table 3. Quantitative comparison results on the Vaihingen dataset and Potsdam dataset.
DatasetModelImp.
Surf.
BuildingLow-
Veg.
TreeCarCluttermF1mIoU
VaihingenDABNet [22]87.888.874.384.960.2-79.270.2
ERFNet88.590.276.485.853.6-78.969.1
PSPNet89.093.281.587.743.9-79.068.6
FANet [23]90.793.882.688.671.6-85.475.6
ABCNet [24]92.795.284.589.785.3-89.581.3
BoTNet [25]89.992.181.888.771.3-84.874.3
BANet [26]92.295.283.889.986.8-89.681.4
Segmenter [27]89.893.081.288.967.6-84.173.6
Deeplabv3+90.193.282.188.084.1-87.578.0
CMTFNet [28]90.694.281.987.682.8-87.477.9
SegneXT [29]81.186.267.578.234.2-70.459.8
SAN [30]81.887.367.577.657.1-76.865.7
FCN [31]89.793.280.888.971.6-84.873.5
C-FCN [32]87.691.477.384.576.8-83.572.3
Segformer92.095.583.389.284.6-88.980.4
HRNetV291.094.482.888.883.8-88.279.1
Seg&Ours92.895.785.189.687.0-90.182.1
HRNetV2&Ours91.895.184.089.084.5-88.980.3
PotsdamDeeplabV3+92.696.486.387.895.455.185.677.1
DANet [33]88.592.778.885.773.743.277.165.3
CCNet88.392.578.885.773.936.375.964.3
EMANet [34]88.292.778.085.772.748.977.765.6
Segformer92.996.486.988.195.258.986.478.1
PFNet [35]91.595.985.486.391.158.684.858.6
SegneXT80.788.170.973.472.5-80.969.9
SAN84.891.474.2374.790.5-84.275.1
FCN90.895.684.184.884.9-88.179.5
C-FCN88.092.481.283.288.740.586.776.9
HRNetV292.796.487.188.294.457.086.077.4
Seg&Ours93.396.887.989.396.261.987.679.6
HRNetV2&Ours93.796.987.688.896.259.287.178.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, G.; Liu, C.; Wu, X.; Li, Y.; Zhang, X.; Xu, J. Optimization of Remote-Sensing Image-Segmentation Decoder Based on Multi-Dilation and Large-Kernel Convolution. Remote Sens. 2024, 16, 2851. https://doi.org/10.3390/rs16152851

AMA Style

Liu G, Liu C, Wu X, Li Y, Zhang X, Xu J. Optimization of Remote-Sensing Image-Segmentation Decoder Based on Multi-Dilation and Large-Kernel Convolution. Remote Sensing. 2024; 16(15):2851. https://doi.org/10.3390/rs16152851

Chicago/Turabian Style

Liu, Guohong, Cong Liu, Xianyun Wu, Yunsong Li, Xiao Zhang, and Junjie Xu. 2024. "Optimization of Remote-Sensing Image-Segmentation Decoder Based on Multi-Dilation and Large-Kernel Convolution" Remote Sensing 16, no. 15: 2851. https://doi.org/10.3390/rs16152851

APA Style

Liu, G., Liu, C., Wu, X., Li, Y., Zhang, X., & Xu, J. (2024). Optimization of Remote-Sensing Image-Segmentation Decoder Based on Multi-Dilation and Large-Kernel Convolution. Remote Sensing, 16(15), 2851. https://doi.org/10.3390/rs16152851

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop