You are currently viewing a new version of our website. To view the old version click .
Remote Sensing
  • Article
  • Open Access

7 February 2025

PSNet: A Universal Algorithm for Multispectral Remote Sensing Image Segmentation

,
,
,
and
1
State Key Laboratory of Multispectral Information Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China
2
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
*
Author to whom correspondence should be addressed.

Abstract

Semantic segmentation, a fundamental task in remote sensing, plays a crucial role in urban planning, land monitoring, and road vehicle detection. However, compared to conventional images, multispectral remote sensing images present significant challenges due to large-scale variations, multiple bands, and complex details. These challenges manifest in three major issues: low cross-scale object segmentation accuracy, confusion between band information, and difficulties in balancing local and global information. Recognizing that traditional remote sensing indices, such as the Normalized Difference Vegetation Index and the water body index, reveal unique semantic information in specific bands, this paper proposes a feature-decoupling-based pseudo-Siamese semantic segmentation architecture. To evaluate the effectiveness and robustness of the proposed algorithm, comparative experiments were conducted on the Suichang Spatial Remote Sensing Dataset and the Potsdam-S Aerial Remote Sensing Dataset. The results demonstrate that the proposed algorithm outperforms all comparison methods, with average accuracy improvements of 80.719% and 77.856% on the Suichang and Potsdam datasets, respectively.

1. Introduction

Since 2010, China has been implementing the High-resolution Earth Observation System (HRES) project, which has launched a large number of high-resolution satellites and established a sophisticated and advanced high-resolution remote sensing system. This has resulted in a vast amount of valuable remote sensing imagery being transmitted back to the ground, leading to a significant improvement in both the quantity and quality of remote sensing data obtained by China. The massive amount of remote sensing data has played a crucial role in various fields, such as urban detection and planning, building 3D reconstruction [1], and ground object recognition and classification [2,3,4]. In recent years, deep learning methods have been applied in remote sensing, especially in computer vision tasks such as semantic segmentation, which has become a common application in remote sensing imagery analysis. The pixel-wise classification map obtained from semantic segmentation is of great significance, as it can be used to classify urban land, agricultural land, water areas, and vegetation areas [5], which can be utilized to calculate forest coverage. Additionally, the identification of buildings and roads [6] can provide reference information for urban planning and construction.
Semantic segmentation can achieve precise interpretation of remote sensing images. In recent years, significant breakthroughs have been made in deep neural network research, which has promoted the vigorous development of the semantic segmentation field. For example, classical convolutional neural network (CNN) architectures, such as FCN [7] and UNet [8], perform pixel-level classification using an encoder–decoder framework. However, these methods have limitations in handling complex multi-scale objects and struggle to effectively leverage specific spectral band information in multispectral data. To address these challenges, researchers have incorporated attention mechanisms and Transformer [9] architectures to enhance the integration of global and local information. For instance, SegFormer [10] utilizes Transformers as encoders, overcoming the limitations of traditional convolutional networks in modeling long-range dependencies and multi-scale features. Nevertheless, compared to ordinary images, multispectral remote sensing images exhibit greater scale variability, more spectral bands, and more complex details, which lead to the following shortcomings in existing semantic segmentation methods when applied to multispectral remote sensing images:
1.
The problem of model training being unable to adapt to scale variations and difficult to train, which is caused by the large-scale variations and imbalanced sample categories in remote sensing images.
2.
The low accuracy of some typical land cover segmentation, especially for vegetation and water bodies, is due to the confusion between special bands and visible light in multispectral images, resulting in the loss of information from special bands.
3.
The current algorithms have difficulty balancing local and global information, and cannot achieve good results in both detailed segmentation and global segmentation simultaneously. This is due to the inherent difficulty of multispectral remote sensing images, which are characterized by complex details and rich global information.
To address the problem of large-scale variations and imbalanced sample categories in remote sensing images, which cause difficulties in model training and adaptation, this paper proposes a remote sensing image segmentation algorithm, which incorporates group convolution and spatial–channel attention. The algorithm uses 32-group convolution to independently train the network convolution module from multiple perspectives, allowing a large number of different convolution kernels to group together to extract semantic features of various scales. By embedding channel–spatial attention and channel attention into different positions of the encoder, the features are enhanced by weighting the channel and spatial dimensions. In comparison to traditional approaches such as UNet [8], the proposed method emphasizes the distinctions in cross-band features and the detailed extraction of local features. Unlike PSPNet [11], which relies exclusively on global pooling for information extraction, this approach combines grouped convolutions with channel attention to more effectively capture multi-scale semantic information.A joint loss function combining Dice Loss and cross-entropy loss is designed to address class imbalance and prevent gradient loss during training.
To address the issue of low accuracy in segmenting land cover with unique reflectivity in special bands (such as water bodies and vegetation) due to confusion with visible light in multispectral images, which results in the loss of information from special bands, this paper proposes an extensible pseudo-Siamese semantic segmentation network based on the ideas of traditional remote sensing indices, such as the Normalized Difference Vegetation Index (NDVI) [12] and the Normalized Difference Water Index (NDWI) [13]. The framework separates visible light bands from special bands, such as infrared, and selects separate encoders and decoders for each to ensure that the features extracted from special channels such as infrared are not contaminated, thereby fully utilizing semantic information outside of visible light bands. In addition, this paper introduces a Transformer encoder that combines fusion convolution and multi-layer perceptron and designs a matching multi-scale decoder. Suitable feature extraction networks are selected for both RGB and infrared bands to improve the accuracy of semantic segmentation in this pseudo-Siamese network. Compared to approaches like SegFormer [10], the decoupling design of the pseudo-Siamese network better emphasizes the advantages of infrared band characteristics in categories such as vegetation and water bodies. Unlike methods such as FuseNet [14], which primarily address RGB-T images, the proposed method demonstrates more stable performance when processing high-resolution multispectral remote sensing images.
To address the problem of difficulty in feature fusion in the multispectral decoder output and the difficulty of balancing local and global information in space, this paper proposes a local and global feature fusion module called LGFF. When the feature fusion module works, the first step is to extract the local and global information of the input features, and the second step is to fuse the input features by weighting them based on this information. Since this module uses pointwise convolution extensively to extract features, embedding this module into the network does not significantly increase the model size, but it can better integrate local and global information. Compared to the fusion method in PAN [15], which merely sums features from different layers, the LGFF module emphasizes a more dynamic balance between features. In contrast to the feature decoding approach of DeepLabV3+ [16], the LGFF module more effectively integrates multi-scale features.
This research studied a series of neural network-based methods, and the specific research results are as follows; these research results have contributed significantly to the development of remote sensing image segmentation, and have practical applications in fields such as urban planning, land use, and environmental monitoring:
1.
A remote sensing semantic segmentation algorithm based on group awareness and feature enhancement addresses challenges such as large-scale variations and class imbalance in samples, which hinder model training and adaptation. The use of group convolution allows the model to capture multi-scale semantic information, while the integration of a channel attention module enhances salient features and suppresses redundant information.
2.
A feature-decoupled pseudo-Siamese network architecture effectively mitigates the issue of information confusion between spectral bands in multispectral images. By separating feature extraction for visible and infrared bands and employing a dedicated decoder structure, the architecture improves segmentation accuracy.
3.
A local and global feature fusion (LGFF) module resolves the challenge of integrating decoder outputs in multispectral data and balancing local and global information within spatial domains. Centered around lightweight pointwise convolution, LGFF achieves adaptive feature balancing through dynamic weighting mechanisms.

3. Proposed Method

3.1. Overall Architecture

To prevent feature confusion between different bands, this section proposes a feature-decoupling pseudo-Siamese network architecture. Given that the semantic information in the infrared band significantly differs from that in the visible light band, as demonstrated by traditional remote sensing indices (such as NDVI and NDWI), the image is first split into visible and non-visible infrared bands as it enters the network, with the information flow accordingly separated. The two inputs are then passed through their respective feature extraction networks. To ensure the network’s scalability and compatibility with various decoders, the fusion of information occurs after the decoder. The semantic features derived from different bands are concatenated and aligned before being forwarded to the classification head, which produces pixel-wise classification results. The basic structure is shown in the following, Figure 1.
Figure 1. The feature sources are decoupled by dividing them according to their different bands. The visible light band uses a Transformer encoder, while the infrared band uses a grouped perception and feature-enhanced convolutional encoder. The feature fusion module proposed in this paper is applied to fuse the multi-scale pyramid features of each band.
For multispectral images, this pseudo-Siamese architecture semantic segmentation model has strong scalability. First, this framework is not limited to dual pathways. It only uses dual pathways because the dataset used in this paper is a combination of RGB and infrared. For situations with more channels, more parallel pathways can be set up. Secondly, the choice of encoder or decoder is not limited to the two backbones used in this paper. Whether it is based on a Transformer or a convolutional neural network, this architecture is compatible.

3.2. Encoder-Feature Extraction

For the selection of the encoder, considering that the three RGB bands of visible light are the most common, can be directly observed by the human eye, and have rich global information, this section uses the efficient perception-convolution Transformer of SegFormer as the encoder to extract features [10]. For the single-channel input image of the infrared band, the local information in this band is relatively rich, so a convolutional encoder based on grouped convolution and feature enhancement is proposed to extract features from the infrared band specifically. Grouped convolution is shown in the following, Figure 2.
Figure 2. The module uses residual connections to focus on the errors in the network. The network uses 32 convolutional kernels, corresponding to dividing the feature map into 32 groups according to channels, extracting features from each group, and then concatenating them together.
The decoder cascades channel attention after each grouped convolution, helping the network to recalculate the attention of each channel after channel-wise convolution. Meanwhile, the channel–spatial attention module is used to enhance the deepest feature map. It adapts to images with different resolutions from the highest-level features.

3.3. Decoder-Feature Fusion and Resolution Restoring

For the decoder, inspired by the Panoptic FPN network proposed by He et al., this paper uses four groups of convolutional modules for the four-layer pyramid-shaped feature maps generated by the above two networks. Each module contains a nearest-neighbor interpolation upsampling and a skip-connection module. After these two modules are concatenated, they are added to the input for residual connection. Except for the input of the highest-level feature map, which is only the upsampled result of that layer, the output of each module is upsampled by a factor of 2 and then skip-connected with the same-level feature map of the pyramid. The two channels each obtain their own set of features. The outputs obtained by the two decoders are fused using a feature fusion module based on global and local features. To selectively choose features, a module that can describe the features is needed. This feature fusion module is inspired by Dai’s work [45] on extracting local and global attention. Its structure is shown in Figure 3:
Figure 3. In the local feature attention extraction module, the feature map is processed using a pointwise convolution with a bottleneck structure, resulting in a tensor with the same dimensions as the input vector. In the global attention extraction module, the feature map undergoes global average pooling to compress the spatial dimension, followed by pointwise convolution with a bottleneck structure to produce a vector. Finally, the local and global information is fused through broadcast addition, and a sigmoid activation function is applied to obtain the final fused feature map.
This module is mainly based on pointwise convolution with an extremely low computational cost to extract local and global features separately, and then add them together to obtain the output. The extraction methods for local and global features and the formula to compute the overall feature are described as follows:
Local ( X ) = BN PWConv 2 ReLU BN PWConv 1 ( X )
Global ( X ) = BN PWConv 2 ReLU BN PWConv 1 GAP ( X )
LG ( X ) = Sigmoid Local ( X ) Global ( X )
In the above formula, “Local” represents the local feature extraction function, “X” represents the input feature, “BN” represents batch normalization, “PWConv” represents 2D pointwise convolution, “ReLU” represents the activation layer with ReLU as the activation function, “Global” represents the global feature extraction function, and “GAP” represents global average pooling. The final “LG” represents the extracted local–global attention feature.
The feature obtained from the local–global attention extraction module is used as the weight to help fuse the two input features. In order to prevent the mean value of the output feature from shifting, the weight of the two features is set to 1. Therefore, the local–global feature fusion module (LGFF) can be seen in Figure 4 and can be represented by the following formula:
F out = 1 LG ( F 1 F 2 ) F 1 + LG ( F 1 F 2 ) F 2
Figure 4. The LGFF (local–global feature fusion) module uses the local–global feature extracted by the attention module as the weight for feature fusion. The two input features are added together as the baseline feature input to the local–global attention extraction module. The feature obtained from the local–global attention extraction module is used as the weight to help fuse the two input features.
In the above formula, “LG” represents the local–global attention extraction module proposed in the previous section, “ F 1 ” and “ F 2 ” represent the two input features to be fused, and “⊕” and “⊗”, respectively, denote element-wise addition and multiplication, which are used for the preliminary fusion of features and the weighted mapping between features and attention.

3.4. Loss Function

For pixel-dense semantic segmentation tasks, pixel-level cross-entropy loss is a natural choice. To address the class imbalance problem in the datasets used in the experiments, Dice Loss is incorporated to mitigate the foreground–background imbalance. Based on the task and dataset characteristics mentioned above, a joint loss function is proposed in this paper, which is expressed by the following formula:
L bce = i q i log ( p i )
L dice = 1 2 i p i q i + ϵ i p i + i q i + ϵ
L mask = L bce + μ L dice
The L mask in the equation is obtained by adding the cross-entropy loss L bce and Dice Loss function L dice with a ratio of μ , where the value of μ in all experiments is set to 1.5 to place more emphasis on handling the class imbalance issue. Both losses are calculated based on q i and p i , where p i represents the probability that a pixel belongs to the i -th class during inference, and q i is the true label of the pixel, which can only take values 0 and 1. A value of 0 indicates that the pixel does not belong to the iii-th class, while 1 means that the pixel’s actual label is the i -th class. The Dice Loss uses a hyperparameter ϵ , a very small value used to ensure that the denominator is not zero.

4. Experiments and Results

4.1. Dataset

This paper selects two representative datasets to verify the effectiveness of the proposed semantic segmentation network. One includes satellite remote sensing images with a resolution of 0.8 m, and the other includes aerial remote sensing images with a resolution of 5 centimeters.
1.
Suichang: The first dataset, “Suichang”, was acquired by a high-resolution satellite and consists of RGB and infrared (IR) images in four bands. The dataset is from the public dataset available on Baidu’s deep learning platform, PaddlePaddle, in the AI Studio community. It can be downloaded by searching for “Suichang” on the following website: https://aistudio.baidu.com/datasetoverview (accessed on 15 January 2024). The annotations are divided into 10 categories, including cultivated land, forest land, grassland, roads, urban construction land, rural construction land, industrial land, construction sites, water, and bare land. There are over 32,000 images in total, with a resolution of 256 × 256, and were taken in Suichang, Zhejiang.
2.
The second dataset, Potsdam-s, comes from an open dataset used in the ISPRS 2D Semantic Labeling Contest. The dataset consists of remote sensing images captured by drones and includes four bands of RGB and infrared. The dataset is available for download at https://www.isprs.org/education/benchmarks/UrbanSemLab/Default.aspx (accessed on 20 January 2024). The images were taken in Potsdam, Germany, and are annotated based on different land cover types, including water, buildings, low vegetation, forests, cars, and background. There are a total of 38 high-resolution (6000 × 6000) remote sensing images, but two images were removed due to labeling errors, leaving a total of 36 images. The images were cropped into non-overlapping 256 × 256 patches and simple images (images with only one type of land cover) were removed, resulting in over 20,000 images.

4.2. Training Details

The experimental software was based on Ubuntu 20.04.4, and Python was used as the main algorithm programming language. The algorithm model was mainly built on pytorch [45], and mainly relies on Python packages such as numpy, PIL, pytorch_toolbelt, and segmentation_models_pytorch.The hardware experimental conditions and configuration can be seen in Table 1. The algorithm training configuration and hyperparameters are as follows:
Table 1. Hardware experimental conditions and configuration.
1.
Image size: cropped to 256 × 256 size by sliding window.
2.
Image augmentation: during training, the image and its label are vertically flipped, horizontally flipped, rotated 90 degrees clockwise (counterclockwise), transposed, elastically distorted, cut into grid units and randomly arranged, and optically distorted with a 35% probability.
3.
Model parameter initialization is performed using the parameter initialization model proposed by He [46].
4.
Related hyperparameter configuration: the learning rate is initially set to 0.0001, the AdamW optimizer is used for optimization, and the learning rate adaptive adjustment algorithm is CosineAnnealingWarmRestarts, which completes one cosine oscillation in the first 15 epochs, and the period of learning rate oscillation doubles thereafter.

4.3. Evaluation Metrics

There are mature and widely accepted evaluation metrics for semantic segmentation. In order to accurately evaluate the effectiveness of the model, this article selects six metrics, including PA, MPA, MIOU, FWIOU, and Kappa, as standards for evaluation.
1.
PA : PA is the abbreviation of Pixel Accuracy, which means the accuracy of pixels. Its calculation method is shown in Formula (8). Here, p _ ij represents the number of pixels that are actually of class i and are identified as class j .
P A = i = 0 k p i i = 0 k j = 0 k p i j
2.
MPA : MPA is the abbreviation of Mean Pixel Accuracy, which means the accuracy of classifying pixels. Its calculation formula based on pixel values is shown in Formula (9).
M P A = 1 k + 1 i = 0 k p j j j = 0 k p i j
3.
MIoU : MIoU is the abbreviation of Mean Intersection over Union, which means the average intersection over union. The calculation method is to obtain the intersection over union of each class and then take the average of the classes. The ratio of intersection over union can be transformed into the true positive ( TP ) divided by the sum of the true positive ( TP ), false negative ( FN ), and false positive ( FP ). Therefore, the formula for IoU is shown in Formula (10), and the formula for MIoU is shown in Formula (11).
I o U = T P T P + F P + F N
M I o U = 1 n + 1 i = 0 n p i i j = 0 n p i j + j = 0 n p j i p i i
4.
FWIoU : FWIoU is the abbreviation of Frequency-Weighted Intersection over Union, which is the intersection over union weighted by frequency. The calculation method is to calculate the intersection over union of each category and then weight them based on frequency. The formula is shown in Formula (12).
F W I o U = 1 i = 0 k j = 0 k p i j i = 0 k p i i j = 0 k p i j j = 0 k p i j + j = 0 k p j i p i i
5.
Kappa: The Kappa coefficient is a measure of accuracy for remote sensing image classification. After obtaining the confusion matrix of the classification, it can be calculated using the following formula. The formula for calculating the Kappa coefficient is shown in Formula (10), where p o represents the accuracy, and its calculation formula is shown in Formula (11), where tr ( P ) is the trace of the confusion matrix and the denominator is the sum of all elements in the confusion matrix. The calculation formula for p e is shown in Formula (12), where A i and B i represent the sum of elements in the i -th column and the i -th row of the confusion matrix, respectively. The Kappa coefficient is between −1 and 1, and a value of 0.6–0.8 indicates high consistency, while a Kappa coefficient greater than 0.8 is considered a sign of almost complete perfection.
κ = p o p e 1 p e
p o = t r ( P ) i = 0 k j = 0 k p i j
p e = i A i B i i = 0 k j = 0 k p i j 2

4.4. Results and Analysis

In the comparison of results, the quantitative indicators of eight classic semantic segmentation networks, including UNet, UNet++, PSPNet, Linknet, PAN, DeepLabV3, DeepLabV3+, and SegFormer, are used as references. In order to save space, the visualization comparison will select the result images of the UNet++, LinkNet, DeepLabV3, DeepLabV3+, and SegFormer networks that perform better on the corresponding dataset for qualitative comparison.
① Suichang
From Table 2, it can be seen that the LGFF PSNet proposed in this chapter exceeds all the models used for comparison in all technical indicators, achieving the milestone of an average IoU of 80% and a Kappa of 0.9. We have also visualized the comparison between our algorithm and other algorithms on the Suichang dataset in Figure 5. The results show that our algorithm has the clearest segmentation details.
Table 2. Comparison of indicators between our segmentation algorithm and other algorithms on the Suichang dataset.
Figure 5. The visual comparison between our algorithm and other algorithms on the Suichang dataset is shown in the figure. The main land covers in the image are cultivated land and forest, and there are two rivers in the middle of the image. The boxed area on the right-hand side of the image shows the complex interlaced relationship between the two land covers. It can be seen that the network proposed in this chapter integrates both specific local features and advanced global features, and the segmentation results not only conform to the characteristics of the rural area as a whole, but also have clean segmentation details. The segmentation of the boundary between forest and cultivated land is closest to the real situation.
② Potsdam-s
For the Potsdam-s dataset, as shown in Table 3, the network proposed in this article exceeds or equals the comparison models in all indicators.We have also visualized the comparison between our algorithm and other algorithms on the Potsdam dataset in Figure 6. The results show that our algorithm can accurately segment the contours of regions even under conditions of poor image quality.
Table 3. Comparison of indicators between our segmentation algorithm and other algorithms on the Potsdam-s dataset.
Figure 6. Due to the factor of light obstruction, this part has poor imaging quality and is almost completely black. However, only the LGFF network proposed in this chapter can accurately segment the outline of this area based on local information and assist in judging that it is an urban area based on global information next to the road. Based on local information, it accurately segments the boundary between “building” and the semi-circular “background”, and performs the best among all methods.

4.5. Ablation Study

In this chapter, four sets of ablation experiments will be conducted on two datasets: (1) To verify the effectiveness of the LGFF module proposed in this chapter, this article will compare the effects of four fusion methods—LGFF, Concat, Add, and FeaNet—and conduct the first set of validity ablation experiments. (2) In order to explore the best usage of the LGFF module proposed in this chapter, the LGFF module will also be tested at the primary and advanced feature fusion locations of the FPN pyramid network, and at the multi-channel feature fusion location of the Siamese network, followed by another set of ablation experiments. Different data indicators will be provided for each usage. (3) To validate that separately processing specific spectral bands can improve the accuracy of remote sensing image segmentation, we will compare the network without channel separation to our proposed PSNet in a third set of ablation experiments. In this case, the decoder remains unchanged. For the former, we use a CNN-based encoder without channel separation, which will be referred to as PSNet_CNNonly for clarity. (4) For the same backbone network, the infrared band, which primarily affects heat-emitting objects and lacks global information, will utilize a CNN-based encoder. In the fourth set of ablation experiments, we will fix the encoder for the infrared band and compare the effect of using a CNN-based encoder versus a Transformer-based encoder for the RGB band. For clarity, we will refer to the RGB band with a CNN-based pseudo-Siamese network as PSNet_CNN, and the RGB band with a Transformer-based MCT pseudo-Siamese network as PSNet.
① Suichang
As shown in Table 4, the LGFF fusion method exhibits a significant advantage in the key metrics, mIoU and FWIoU, outperforming all but one of the other methods. Specifically, it achieves the highest value in 4 out of 5 indicators, with the exception of the mPA metric, where the Add fusion method slightly outperforms LGFF. Table 5 demonstrates that LGFF, when applied to both high- and low-level feature fusion in the feature pyramid and in the pseudo-Siamese network, delivers the best network accuracy. Table 6 highlights that PSNet outperforms PSNet_CNNonly across all metrics, strongly supporting the notion that processing specific spectral bands separately enhances remote sensing image segmentation accuracy. Finally, Table 7 shows that PSNet outperforms PSNet_CNN in all metrics, further validating that a transformer-based encoder is better suited for extracting information from the RGB bands of remote sensing images.
Table 4. Accuracy indicators of four feature fusion methods on the Suichang dataset.
Table 5. Accuracy indicators of three usages of the LGFF feature fusion module on the Suichang dataset.
Table 6. Accuracy indicators of PSNet_CNNonly and PSNet on Suichang dataset.
Table 7. Accuracy indicators of PSNet_CNN and PSNet on Suichang dataset.
② Potsdam-s
For the Potsdam-s dataset, as shown in Table 8, the LGFF fusion method achieves the highest mIoU and demonstrates the most stability, consistently ranking among the top in all indicators across the four fusion methods. Table 9 shows that LGFF, when applied to both high- and low-level feature fusion in the feature pyramid and in the pseudo-Siamese network, delivers the best network accuracy. Table 10 reveals that for the Potsdam dataset, PSNet outperforms PSNet_CNNonly across all metrics, strongly supporting the notion that processing specific spectral bands separately improves remote sensing image segmentation accuracy. Table 11 demonstrates that PSNet outperforms PSNet_CNN in all metrics, further validating that a transformer-based encoder is better suited for extracting information from the RGB bands of remote sensing images. Both the Suichang and Potsdam-s datasets validate the effectiveness of our network design.
Table 8. Accuracy indicators of four feature fusion methods on the Potsdam-s dataset.
Table 9. Comparison of indicators of three usages of the LGFF module on the Potsdam-s dataset.
Table 10. Accuracy indicators of PSNet_CNNonly and PSNet on Potsdam-s dataset.
Table 11. Accuracy indicators of PSNet_CNN and PSNet on Potsdam-s dataset.

5. Conclusions

The research presented in this paper has been validated using the Suichang satellite remote sensing dataset and the Potsdam aerial remote sensing dataset. To address large-scale variations in remote sensing images, prevent feature confusion between different bands, and resolve the challenge of local and global feature fusion, we designed a pseudo-Siamese architecture. This architecture employs a CNN-based network and a Transformer-based network for feature extraction, and incorporates the LGFF module to integrate features from different bands while preserving both local and global information. These innovations have significantly enhanced the performance of semantic segmentation tasks. By combining these three improvements in the network, multispectral information is fully leveraged, resulting in a marked improvement in semantic segmentation performance compared to the baseline model.
For the semantic segmentation task of multispectral or hyperspectral images, the algorithm proposed in this paper still has several areas for improvement. Despite incorporating three key innovations, the proposed network shows significant improvements over the baseline model. However, the model size has doubled, and processing speed has decreased. These issues can be addressed in two ways. First, knowledge distillation can be employed to transfer the learned information from the large model to a smaller one, thereby achieving model compression. Second, network pruning and reducing numerical precision where permissible can help reduce storage requirements.The feature fusion in this paper occurs after upsampling, which requires the parallel network decoders to align with the corresponding encoders. While this approach benefits feature fusion in later stages, it also increases the overall network size. A potential solution would be to explore feature alignment strategies that allow features extracted by different encoders to be fused immediately after encoding, followed by decoding with a unified decoder.

Author Contributions

Conceptualization and Methodology, Y.Z.; Resources and Funding Acquisition, Z.C.; Investigation, T.Z.; Writing, C.T.; Data Curation and Visualization, W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Civil Aerospace Technology Pre-research Project of China’s 14th Five-Year Plan, Guide Number: D040404, and the Key Laboratory of Target Cognition and Application Technology, Project Number: 2023-CXPT-LC-005.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Rottensteiner, F.; Sohn, G.; Gerke, M.; Wegner, J.D.; Breitkopf, U.; Jung, J. Results of the ISPRS benchmark on urban object detection and 3D building reconstruction. ISPRS J. Photogramm. Remote Sens. 2014, 93, 256–271. [Google Scholar] [CrossRef]
  2. Volpi, M.; Ferrari, V. Semantic segmentation of urban scenes by learning local class interactions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
  3. Kampffmeyer, M.; Salberg, A.B.; Jenssen, R. Semantic Segmentation of Small Objects and Modeling of Uncertainty in Urban Remote Sensing Images Using Deep Convolutional Neural Networks. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 680–688. [Google Scholar] [CrossRef]
  4. Kemker, R.; Salvaggio, C.; Kanan, C. Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS J. Photogramm. Remote Sens. 2018, 145, 60–77. [Google Scholar] [CrossRef]
  5. Gupta, S.; Uma, D.; Hebbar, R. Analysis and Application of Multispectral Data for Water Segmentation Using Machine Learning. In Computer Vision and Machine Intelligence; Tistarelli, M., Dubey, S.R., Singh, S.K., Jiang, X., Eds.; Springer Nature: Singapore, 2023; pp. 709–718. [Google Scholar]
  6. Mei, J.; Li, R.J.; Gao, W.; Cheng, M.M. CoANet: Connectivity Attention Network for Road Extraction From Satellite Imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef] [PubMed]
  7. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
  8. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  9. Vaswani, A. Attention is all you need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
  10. Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  11. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
  12. Kaufman, Y.; Tanre, D. Atmospherically resistant vegetation index (ARVI) for EOS-MODIS. IEEE Trans. Geosci. Remote Sens. 1992, 30, 261–270. [Google Scholar] [CrossRef]
  13. McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
  14. Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Computer Vision—ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Revised Selected Papers, Part I 13. Springer: Cham, Switzerland, 2017; pp. 213–228. [Google Scholar]
  15. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  16. Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  17. Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held Conjunction MICCAI 2018, Granada, Spain, 20 September 2018, Proceedings 4; Springer International Publishing: Cham, Switzerland, 2018; Volume 11045, pp. 3–11. [Google Scholar]
  18. Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.; Wu, J. UNet 3+: A Full-Scale Connected UNet for Medical Image Segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1055–1059. [Google Scholar]
  19. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  20. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  21. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  22. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
  23. Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  24. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  25. Liao, M.; Wan, F.; Yao, Y.; Han, Z.; Zou, J.; Wang, Y.; Feng, B.; Yuan, P.; Ye, Q. End-to-end weakly supervised object detection with sparse proposal evolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 210–226. [Google Scholar]
  26. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  27. Nguyen, C.; Asad, Z.; Deng, R.; Huo, Y. Evaluating transformer-based semantic segmentation networks for pathological image segmentation. In Proceedings of the Medical Imaging 2022: Image Processing, San Diego, CA, USA, 20 February–28 March 2022; Volume 12032, pp. 942–947. [Google Scholar]
  28. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 12299–12310. [Google Scholar]
  29. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  30. He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
  31. Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
  32. Wang, L.; Li, D.; Dong, S.; Meng, X.; Zhang, X.; Hong, D. PyramidMamba: Rethinking Pyramid Feature Fusion with Selective Space State Model for Semantic Segmentation of Remote Sensing Imagery. arXiv 2024, arXiv:2406.10828. [Google Scholar]
  33. Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. arXiv 2024, arXiv:2404.02668. [Google Scholar] [CrossRef]
  34. Chen, K.; Zou, Z.; Shi, Z. Building extraction from remote sensing images with sparse token transformers. Remote Sens. 2021, 13, 4441. [Google Scholar] [CrossRef]
  35. Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to prompt for remote sensing instance segmentation based on visual foundation model. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
  36. Osco, L.P.; Wu, Q.; de Lemos, E.L.; Gonçalves, W.N.; Ramos, A.P.M.; Li, J.; Junior, J.M. The segment anything model (sam) for remote sensing applications: From zero to one shot. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103540. [Google Scholar] [CrossRef]
  37. Ding, L.; Wang, Y.; Laganiere, R.; Huang, D.; Fu, S. Convolutional neural networks for multispectral pedestrian detection. Signal Process. Image Commun. 2020, 82, 115764. [Google Scholar] [CrossRef]
  38. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  39. Hwang, S.; Park, J.; Kim, N.; Choi, Y.; So Kweon, I. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
  40. Deng, F.; Feng, H.; Liang, M.; Wang, H.; Yang, Y.; Gao, Y.; Chen, J.; Hu, J.; Guo, X.; Lam, T.L. FEANet: Feature-enhanced attention network for RGB-thermal real-time semantic segmentation. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4467–4473. [Google Scholar]
  41. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
  42. Ha, Q.; Watanabe, K.; Karasawa, T.; Ushiku, Y.; Harada, T. MFNet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 5108–5115. [Google Scholar]
  43. Sun, Y.; Zuo, W.; Liu, M. RTFNet: RGB-thermal fusion network for semantic segmentation of urban scenes. IEEE Robot. Autom. Lett. 2019, 4, 2576–2583. [Google Scholar] [CrossRef]
  44. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
  45. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8026–8037. [Google Scholar]
  46. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.