1. Introduction
Remote sensing images (RSIs) are acquired through intermediate imaging sensors, typically mounted on satellites, aircraft, and unmanned aerial vehicles (UAVs), enabling non-contact observation of terrestrial objects [
1]. A comprehensive semantic understanding of RSIs significantly influences various downstream applications, including water resource management [
2,
3], land cover classification [
4,
5,
6], urban planning [
7,
8,
9], and hazard assessment [
10,
11]. To achieve the precise labeling of individual pixels with specific classes, semantic segmentation [
12], originally designed for natural image processing, has been successfully applied to RSIs with exceptional performance.
Traditional segmentation methods primarily relied on manually crafted features as guidance for pixel recognition. In the initial stages, classical techniques such as logistic regression [
13] and distance measures [
14] were favored for their stability and user-friendliness. Subsequently, more advanced models, including support vector machines (SVMs) [
15], Markov random fields (MRFs) [
16], random forests (RFs) [
17], and conditional random fields (CRFs) [
18], were developed to enhance the classification process. However, despite the introduction of robust classifiers, the use of artificially selected features inherently limited overall performance, particularly in terms of accuracy.
Deep convolutional neural networks (DCNNs) have gained prominence for their exceptional performance in a multitude of computer vision tasks [
19,
20,
21]. DCNNs possess the ability to automatically derive task-specific features, making them an optimal choice for handling complex scenarios. Consequently, the remote sensing community has become increasingly intrigued by the potential application of DCNNs in processing RSIs. This interest has led to the development of several DCNN-based RSI interpretation methods, showcasing their adaptability in comprehending multi-source and multi-resolution RSIs [
22,
23]. While these methods have significantly improved representation learning and classifier training, they are fundamentally constrained by the fixed geometry of convolutional neural networks, resulting in limited local receptive fields and short-range contextual awareness. Moreover, RSI presents unique challenges due to its broad scope, diverse objects, and varying resolutions compared to natural imagery.
Upon comprehensive review, it is evident that leveraging contextual information offers a promising approach to enhancing the discriminative capacity of learned representations. Two distinct methodologies have been proposed to integrate extensive contextual knowledge, thereby enriching pixel-wise representations within segmentation networks. Initially, several studies have incorporated different-scale dilated convolutional layers and pooling functions to capture multi-scale features. For instance, in the realm of RSI semantic segmentation, MLCRNet [
24] introduced multi-level context aggregation and achieved superior performance on benchmark datasets such as ISPRS Potsdam [
25] and Vaihingen [
26]. Furthermore, Shang et al. devised a multi-scale feature fusion network using atrous convolutions [
27], and Du et al. crafted a similar semantic segmentation network tailored for mapping urban functional zones [
28].
A sophisticated approach involves the incorporation of attention modules designed to capture long-range dependencies. Attention, a cognitive process focusing selectively on specific information while disregarding other perceptible data, plays a pivotal role in human cognition and survival [
29,
30]. Leveraging the self-attention mechanism (SAM), the network can concentrate on information-rich regions, thereby enhancing the representation of crucial areas. Consequently, segmentation accuracy has witnessed a substantial rise with the emergence of attention-based methods [
31]. In the realm of RSIs, Li et al. introduced innovative strategies for addressing the challenge of segmenting large-scale satellite RSI, including dual attention and deep fusion techniques [
32]. Li et al. proposed a multi-attention network that extracts contextual dependencies while maintaining computational efficiency [
33]. HCANet was developed to amalgamate cross-level contextual and attentive representations through the utilization of the attention mechanism [
34]. EDENet skillfully learned edge distributions through the design of a distribution attention module, effectively injecting edge information in a learnable manner [
35]. Lei et al. proposed LANet, which bridges the gap between high-level and low-level features by incorporating a patch attention module to focus locally [
36]. In summary, the attention mechanism has demonstrated its superiority in the field of RSI, enabling models to recognize and accommodate diverse intra-class variances and subtle inter-class distinctions [
37].
However, all the aforementioned methods were primarily designed for processing RSI and learning features within the spatial domain without giving due consideration to their spectral properties. In the realm of image processing, the inner body and edges correspond to low and high-frequency components, respectively. This relationship is visually represented in
Figure 1, where we illustrate an RSI and provide a frequency image. In
Figure 1c, we showcase the low-frequency component, while
Figure 1d presents the high-frequency component, solidifying this assumption. Furthermore, self-attention is fundamentally developed to enhance the internal consistency of objects through similarity measurement. However, self-attention employs identical learnable parameters for all frequency components, hindering its ability to simultaneously enhance internal consistency and inter-object edge contouring. Therefore, the effective utilization of frequency domain features, particularly in learning the spectral context inherent in remote sensing images, becomes paramount.
In summary, we contend that while learning representations of remote sensing images using convolutional neural networks and self-attention mechanisms within the spatial domain enhance internal consistency, they inadequately incorporate spectral contexts and erode edge details. Hence, this observation suggests the need to optimize learned representations in both the frequency and spatial domains, requiring skillful aggregation across these domains. To address these challenges, this paper introduces a novel approach. Firstly, we propose a joint spectral–spatial attention (JSSA) that deploys spectral attention (SpeA) and spatial attention (SpaA) in parallel. Instead of mere feature-level aggregation, we devise a post-weighted summation of the two attention maps to create a unified attention map that concurrently incorporates spectral and spatial contexts. To facilitate this, we formulate a novel loss function to train the network in learning discriminative representations within the spectral and spatial domains. Finally, this integrated approach results in the spectrum-space collaborative network (SSCNet), which accurately performs pixel-level segmentation of ground objects in remote sensing images. The primary contributions of this work are summarized as follows:
- (1)
We propose a SpeA for capturing the spectral context in the frequency domain. SpeA first maps the feature map into the frequency domain using a 2D fast Fourier transform (2D FFT) layer. Considering that the transformed features may be complex, we compute pairwise similarity by measuring the complex spectral Euclidean distance (CSED) of the real and imaginary parts. Subsequently, we create SpeA maps by weighted summation, enabling the prioritization of spectral features in attention modeling.
- (2)
To comprehensively model and utilize contexts that span spectral and spatial domains, we present the JSSA module. For spatial contexts, we incorporate position-wise self-attention as a parallel SpaA branch. Through an attention fusion (AttnFusion) module, we merge the attention maps obtained from SpeA and SpaA. This results in JSSA producing an attention map that considers both spectral and spatial contexts simultaneously.
- (3)
We formulate a hybrid loss function (HLF) that encompasses both spectral and spatial losses. Concerning the high-frequency components, we calculate edge loss. While promoting the inner consistency of objects, mainly represented by low-frequency components, we introduce Dice loss to compensate. Simultaneously, we employ cross-entropy loss to supervise the spatial aspects. By combining these losses with appropriate weights, we establish a hybrid loss function that facilitates the network in learning discriminative representations within both the frequency and spatial domains.
- (4)
Complementing the above-mentioned designs, we propose the SSCNet, a semantic segmentation network for remote sensing images. Thorough experimentation demonstrates its superior performance compared with other state-of-the-art methods. Furthermore, an ablation study corroborates the efficacy of the SpeA component.
This paper is structured as follows.
Section 2 provides an overview of related research in semantic segmentation of RSI and methods focused on frequency domain-based learning.
Section 3 introduces the comprehensive network architecture along with the individual sub-modules and their formulations.
Section 4 compiles and compares the findings based on two prominent RSI datasets to verify the model’s performance, followed by in-depth discussions.
Section 5 offers conclusions drawn from this study and outlines potential future research directions.
3. The Proposed Method
3.1. Overall Framework
As illustrated in
Figure 2, the proposed SSCNet adopts the encoder-decoder architecture. SSCNet primarily introduces enhancements in two key areas. Firstly, we introduce a JSSA module, which comprehensively models and leverages contextual information spanning both the frequency and spatial domains. In the frequency domain, we generate a SpeA attention map, thoughtfully considering the spectral properties. Meanwhile, within the spatial domain, the position-wise self-attention mechanism captures context from the spatial-channel perspective. Post-fusion by the AttnFusion module, these two attention maps collectively provide JSSA with a well-rounded contextual foundation. JSSA effectively extends the integration of spectral context alongside the prevailing spatial-domain-based methodologies. Secondly, for the representation of high-frequency components, we incorporate edge distributions obtained from the ground truth to supervise the network. For the low-frequency ones, we introduce Dice loss to promote a low distortion in the inner consistency of objects. In correspondence, we formulate a hybrid loss function that embraces both spectral and spatial losses with appropriate weighting. This design encourages the network to learn informative spectral and spatial cues concurrently, thereby enhancing the discriminative capability of the acquired representation.
3.2. Joint Spectral–Spatial Attention
In this subsection, we provide a detailed exposition of the joint spectral–spatial attention (JSSA). Firstly, the pipeline of JSSA is depicted in
Figure 3. In essence, JSSA simultaneously employs SpeA and SpaA. Subsequently, an attention fusion layer performs post-fusion, employing weighted summation of the attention maps generated by SpeA and SpaA. This architectural choice facilitates the generation of the JSSA’s attention map, which jointly evaluates pixel-wise correlations across both the frequency and spatial domains. To elaborate further, this process effectively captures and aggregates spectral and spatial contexts, subsequently enhancing feature refinement. This is followed by a matrix multiplication and element-wise summation, resulting in the acquisition of JSSA-refined representations. The precise steps are elucidated below.
Considering the input feature of JSSA, denoted as , where , , and represent the number of channels, height, and width, respectively, in the SpeA branch, undergoes an initial transformation into the frequency domain using a 2D fast Fourier transform (2D FFT) function. It is worth noting that this transformation generates multiple frequency values, leading us to segment the transformed feature maps into two distinct components: the real part and the imaginary part.
As previously discussed, SpeA generates an attention map by projecting into the frequency domain and assessing spectrum-related similarity. In
Figure 4, we begin with the input feature
, which is initially transformed using a 2D FFT. The 2D FFT takes a spatial signal within
and transforms it into a complex frequency signal
.
Suppose
is defined as
, where
and
represent the spatial frequency indices in the horizontal and vertical directions, respectively. The formula for the 2D FFT is expressed as follows:
Here, represents the imaginary unit. It is important to note that includes both real and imaginary components. For clarity, we denote these as and . Correspondingly, , , and all possess dimensions of .
Real and imaginary parts of complex numbers represent different aspects of the underlying data. The real part typically encodes amplitude or magnitude information, while the imaginary part encodes phase information. Separating these parts allows us to analyze and compare these two distinct aspects individually. Therefore, we design a parallel manner. In the generalized attention module, the similarity function is dynamically adjustable. In the frequency domain, we here strive to involve spectral context. Therefore, we utilize the complex spectral Euclidean distance followed by the Softmax function to quantify spectral similarity.
In the top branch of SpeA,
is transposed to obtain the query feature (Q in
Figure 4)
, while the key feature (K in
Figure 4) is with
. The attention map can be formed as
where
denotes the vector of position
, and
represents the vector of position
. With Softmax, we have the attention map of the real part as
where the attention map of the real part is with
. Likewise, we have the attention map of the imaginary part as
where
. As can be observed,
is the same size as
.
Finally, a weighted summation is applied to produce the SpeA attention map by
where
is a coefficient set as 0.5.
Figure 5 illustrates the pipeline of SpaA, in which we apply position-wise self-attention to
. We perform matrix multiplication between
and its transposed query feature. We obtain the SpaA attention map using the Softmax function, represented as
where
.
After separately attending to spectral and spatial correlations, in
Figure 6, AttnFusion combines them through a straightforward weighted summation of
and
,
where
is a coefficient pre-defined as 0.5.
Afterward,
is multiplied by
, followed by an element-wise summation,
In the end, we have the JSSA-refined feature map denoted as . Hereafter, is put forward to the decoder stage.
3.3. Hybrid Loss Function
In this section, we introduce a hybrid loss function (HLF) for network tuning. As previously discussed, the high-frequency and low-frequency components contribute to the edges and internal consistency of learned representations. Additionally, SpaA is adapted to the cross-entropy loss. Considering these factors, we have formulated a novel loss function.
where
,
and
represent cross-entropy, Dice, and edge losses, respectively, and
is a coefficient pre-defined as 0.5.
More concretely, Dice loss measures the spatial overlap between the predicted segmentation and the ground truth, quantifying the consistency of the two masks. It is defined as follows:
where
denotes the binary segmentation mask generated by the neural network, where 1 represents the object region, 0 represents the background, and
is the ground truth segmentation mask, which also consists of binary labels for the object (1) and background (0).
represents the total number of pixels in the binary masks, and
is the number of classes. In the context of
,
where
is the location of pixel
, and
is the nearest ground truth pixel of the edge. Edge loss is commonly based on the average distance between boundary pixels in the predicted segmentation and their nearest counterparts in the ground truth boundary. In this study, we adopt Euclidean distance.
4. Experiments and Discussion
4.1. Datasets
4.1.1. ISPRS Potsdam Dataset
The ISPRS Potsdam dataset [
23] exhibits a spatial resolution of 5 cm. It entails pixel-level ground truth annotations for land cover classification, wherein the category “clutter” designates the background class. Each image is characterized by a spatial dimension of 6000 × 6000 pixels, employing the red (R), green (G), and near-infrared (NIR) spectral bands. We have partitioned this dataset into three distinct subsets: a training set, a validation set, and a test set, comprising 17, 2, and 19 images, respectively. For visual reference, specific examples are depicted in
Figure 7.
4.1.2. LoveDA Dataset
The LoveDA dataset [
56] introduced a novel challenge in the realm of semantic segmentation for large-scale satellite images characterized by a spatial resolution of 0.3 m. Sourced from the Google Earth platform, LoveDA encompasses a vast expanse exceeding 536 square kilometers (km
2). This dataset encompasses both rural and urban regions within three cities: Nanjing, Changzhou, and Wuhan. Each image within this dataset boasts a spatial dimension of 1024 × 1024 pixels. Our study utilized a total of 2522 images for training, 834 images for validation, and 835 images for testing. The dataset exhibits an imbalanced class distribution, and the objects belonging to the same category exhibit variations in terms of scale, size, and surface type, rendering LoveDA an even more formidable dataset for semantic segmentation. To provide a visual representation, specific examples are presented in
Figure 8.
4.2. Implementation Details
The proposed SSCNet, alongside the compared semantic segmentation methods, was implemented using PyTorch on a Linux OS running on an NVIDIA A40 GPU. Data augmentations, such as random flipping and cropping operations, were uniformly applied to all datasets and networks, as outlined in
Table 1. The initial learning rate was set at 0.02, and the maximum number of training epochs was fixed at 500. Additionally, we adopted stochastic gradient descent (SGD) as the optimizer, with the learning strategy employing poly decay and a momentum of 0.9. The model parameter file with the lowest validation loss was retained. The parameters
and
, introduced in Equations (7) and (9), respectively, are preset constants. We initialized both parameters to a value of 0.5 to ensure an equal weighting between the components they control. This choice was substantiated by preliminary experiments, which indicated that an equal balance yields effective performance on our validation dataset. We maintained these values throughout the training process, as our empirical results validated this initial setting. No further optimization was performed for these parameters. This decision aligns with our aim to minimize model complexity and maintain interpretability of the parameter settings. Besides, the hyperparameters are pre-defined in
Table 2.
We selected 10 methods for comparative analysis, encompassing representative baselines and designs specifically tailored for remote sensing imagery (RSI). The former category includes U-Net [
57], DeepLab V3+ [
58], and CBAM [
59], while the latter consists of ResUNet-a [
43], RAANet [
46], SCAttNet [
60], HCANet [
34], A2FPN [
48], and LANet [
36].
4.3. Evaluation Metrics
In this study, we have employed standard evaluation metrics to assess the performance of the predicted results on the test set. These metrics include the class-wise
-score, the average
-score across all classes (AF), overall accuracy (OA), and mean intersection over union (mIoU). The
-score serves as a balanced measure of precision and recall, providing insight into the trade-off between false positives and false negatives. OA quantifies the number of correctly classified pixels in relation to all pixels. mIoU is a global metric used to evaluate overall accuracy. Formally,
where
and
are calculated as follows:
In the equations, , , , and represent the counts of true positives, true negatives, false positives, and false negatives, respectively. The mIoU is computed as the mean of class-specific IoU values.
4.4. Comparison with State-of-the-Art Methods
4.4.1. Results on ISPRS Potsdam Dataset
In the presented comparative analysis of semantic segmentation performance on the ISPRS Potsdam dataset, various state-of-the-art methods were evaluated, including the newly proposed SSCNet. As listed in
Table 3, these methods were examined across multiple land cover categories, namely impervious surfaces, buildings, low vegetation, trees, and cars. The results indicate that SSCNet achieved remarkable performance, outperforming most of the other methods in all classes. Particularly, SSCNet demonstrated superior results in the building, tree, and car classes, reaching 97.16%, 91.41%, and 93.26%
-scores, respectively. This signifies SSCNet’s effectiveness in delineating fine details and complex object boundaries, which is critical in remote sensing applications.
Comparing SSCNet with the baseline methods, we observe a consistent trend where SSCNet surpasses the others. DeepLabV3+, CBAM, ResUNet-a, and HCANet also show competitive results, especially in the impervious surfaces and building classes. This highlights that the proposed SSCNet indeed addresses the challenge of capturing intricate edge details and suppressing noise in remote sensing imagery. The OA and mIoU results show a similar trend, emphasizing the effectiveness of SSCNet in providing both accurate and spatially coherent predictions. The outcomes of this analysis underscore the significance of the newly introduced SSCNet, particularly its potential in remote sensing applications where accurate segmentation of land cover is essential. This substantial improvement in performance is indicative of SSCNet’s potential to enhance object detection and land cover classification in satellite and aerial imagery.
As shown in
Figure 9, in a thorough visual inspection of the predicted labels on randomly sampled images from the ISPRS Potsdam dataset, several key observations come to light. SSCNet, our proposed semantic segmentation method, consistently showcases its prowess in accurately delineating land cover types. Notably, it excels in capturing intricate details, such as the edges of buildings and trees, where the SpeA module plays a pivotal role. The inclusion of this module allows SSCNet to better understand the spectral context, making it especially proficient in distinguishing fine-grained structures.
Moreover, the visual inspections also highlight SSCNet’s exceptional adaptability to diverse land cover scenarios. It consistently produces accurate labels across various classes, from impervious surfaces and buildings to low vegetation and water bodies. The model exhibits its ability to capture both large-scale features, such as roads and barren lands, as well as fine details, like trees and cars. This versatility reflects SSCNet’s proficiency in handling the diverse and complex landscapes present in the ISPRS Potsdam dataset. While challenges persist due to the dataset’s imbalanced class distribution and variation in scale, SSCNet’s robust performance, driven by the SpeA module, underscores its utility in real-world applications, particularly for urban and environmental monitoring.
4.4.2. Results on LoveDA Dataset
Table 4 offers a comprehensive assessment of various methods, including SSCNet, applied to semantic segmentation on the LoveDA dataset. SSCNet, the proposed model, presents outstanding performance across multiple classes, demonstrating its competence in accurate land cover classification. SSCNet achieves the highest
-scores in most classes, with notable distinctions in the water, barren, and agriculture categories, reaching impressive
-scores of 91.10%, 56.66%, and 85.35%, respectively. This signifies SSCNet’s effectiveness in capturing intricate details and accurately discerning land cover types. Comparing SSCNet with baseline models, it consistently outperforms them in terms of F1-scores, underlining its superiority in semantic segmentation. Other methods, such as LANet and HCANet, exhibit competitive results, particularly in the building and road classes, but SSCNet’s remarkable consistency across all classes showcases its robustness. SSCNet further outperforms the competition in OA and mIoU, confirming its ability to provide both precise and spatially coherent semantic segmentation. These results suggest SSCNet’s significance in remote sensing applications, particularly in challenging classes such as water and agriculture, with potential applications in accurate land cover analysis.
Examining SSCNet’s performance, it is clear that the model excels in capturing fine details, evident in the high -scores across various classes. The water class is notably challenging, but SSCNet demonstrates remarkable accuracy with an -score of 91.10%, indicating its proficiency in distinguishing small water bodies. The building and road classes also witness substantial performance improvements, with SSCNet achieving -scores of 76.36% and 81.91%, respectively. These results indicate SSCNet’s potential in applications requiring precise segmentation, such as urban planning and environmental monitoring. SSCNet surpasses existing models, including U-Net and DeepLab V3+, highlighting its state-of-the-art performance. SSCNet’s competence in handling both fine-grained land cover details and large-scale geographic areas is a testament to its versatility in remote sensing tasks. Its substantial lead in the AF, OA, and mIoU further emphasizes its significance, offering an advanced solution to semantic segmentation challenges in large-scale satellite imagery.
As shown in
Figure 10, visual inspections of the predicted labels on random samples from the LoveDA dataset reveal valuable insights into the performance of SSCNet in the context of large-scale satellite image segmentation. SSCNet demonstrates its competence in effectively handling the diverse and intricate land cover types present in this dataset. Notably, the inclusion of the SpeA module contributes to the model’s remarkable performance. It excels in accurately delineating various classes, including background, buildings, roads, water bodies, barren lands, and forests. SSCNet’s superior performance in classifying these diverse land cover types underscores its versatility and robustness. Overall, these visual inspections demonstrate that SSCNet, with its SpeA module, stands out as a reliable choice for large-scale satellite image segmentation, particularly for applications such as land cover monitoring, urban planning, and environmental assessments in regions covered by the LoveDA dataset.
4.5. Ablation Study on SpeA
Table 5 provides a comparison between SSCNet, the proposed method, and a variant, SSCNet w/o SpeA (without SpeA module), on both the ISPRS Potsdam and LoveDA datasets for semantic segmentation. This evaluation aims to elucidate the importance of the SpeA module in SSCNet’s performance. In the context of the ISPRS Potsdam dataset, SSCNet exhibits impressive performance with an AF of 91.03, an OA of 92.90, and an mIoU of 82.55. Notably, these metrics signify the model’s ability to achieve a high mean F1-score, overall accuracy, and intersection over union, underlining its competence in semantic segmentation. However, when the SpeA module is removed (SSCNet w/o SpeA), there’s a considerable decline in all metrics, resulting in an AF of 87.62, an OA of 87.92, and an mIoU of 79.55. This reduction demonstrates the detrimental impact of eliminating the SpeA module, highlighting its crucial role in enhancing SSCNet’s performance. The decrease in mIoU and overall accuracy implies that SpeA is pivotal for capturing fine details and providing accurate semantic segmentation.
Transitioning to the LoveDA dataset, SSCNet again delivers commendable results, with an AF of 76.02, an OA of 72.01, and an mIoU of 65.91. These metrics indicate SSCNet’s ability to perform well on this dataset, exhibiting its adaptability. However, when the SpeA module is omitted (SSCNet w/o SpeA), there’s a more substantial drop in performance across all metrics. The model’s AF decreases to 60.16, OA to 62.65, and mIoU to 54.62. Moreover, as drawn in
Figure 11 and
Figure 12, we observe that SpeA significantly improves the convergence rate while keeping a lower loss than before. This significant reduction reaffirms the importance of the SpeA module in SSCNet, as its removal leads to a noticeable decrease in segmentation accuracy and overall performance.
As shown in
Figure 13 and
Figure 14, the results predicted by SSCNet and SSCNet w/o SpeA are presented. With SpeA, SSCNet exhibits enhanced consistency in classifying various land covers, closely mirroring the ground truth, particularly around complex interfaces such as building edges and vegetative boundaries. The edge details are notably sharper, as SpeA aids in delineating clear and precise segmentations, a contrast to the SSCNet without SpeA, where the edges appear blurred and less defined. This comparative visualization underscores the efficacy of SpeA in augmenting the spatial resolution and fidelity of semantic segmentation in remote sensing imagery.
These findings demonstrate that the SpeA module in SSCNet significantly contributes to its ability to handle both ISPRS Potsdam and LoveDA datasets, enhancing its semantic segmentation capabilities, especially in more challenging datasets like LoveDA. Therefore, retaining the SpeA module in SSCNet is essential for robust performance across various remote sensing applications.
4.6. Effects of the Value of
In this section, we delve into the effects of the coefficient
on the model’s performance across two distinct datasets: ISPRS Potsdam and LoveDA. The coefficient
is pre-defined to modulate the importance of spectral attention within the model’s architecture.
Table 6 reports the results with different settings of
. An analysis of the performance metrics indicates a non-linear relationship between the value of
and the model’s effectiveness, with the model achieving optimal performance at an intermediate value.
Specifically, the optimal results for both datasets are observed at , where the AF/OA/mIoU scores reach their peak. For the ISPRS Potsdam dataset, the performance improves consistently as increases from 0 to 0.5, suggesting that the incorporation of spectral attention up to a certain threshold contributes positively to the model’s accuracy and ability to generalize. However, beyond this point, there is a notable decline in performance, with showing a decrease and regressing to levels similar to the absence of spectral attention (). This trend is mirrored in the LoveDA dataset, albeit with more pronounced fluctuations, suggesting a higher sensitivity to the changes in spectral attention. The pronounced peak at , followed by a decline, indicates that while spectral attention is crucial, its overemphasis is counterproductive.
The results elucidate the critical balance required in spectral attention to enhance model performance. At low values of A (0 and 0.25), the model is likely underutilizing spectral information, while at high values (0.75 and 1.0), there is an overemphasis that may lead to overfitting or distraction from spatial features. The peak performance at across all metrics for both datasets underscores the importance of a moderated spectral attention mechanism. This balance ensures that the model is neither starved of spectral information nor overwhelmed by it, facilitating robust feature extraction that is evidently beneficial across different landscapes and urban settings, as represented by the ISPRS Potsdam and LoveDA datasets, respectively.
4.7. Discussion
The proposed SSCNet introduces an innovative approach to the semantic segmentation of remote sensing images by incorporating both spectral and spatial information within a unified framework. Theoretically, the architecture of SSCNet is designed to exploit the rich spectral information present in hyperspectral images through its joint spectral–spatial attention mechanism, potentially outperforming methods that do not utilize such integration. While our comparisons have been limited to methods utilizing 2D FFT conversion, the conceptual strengths of SSCNet suggest that it could excel in comparisons against recent state-of-the-art methods as well. Specifically, SSCNet’s feature representation in both the spatial and frequency domains may provide enhanced discriminative capabilities, particularly in complex segmentation scenarios.
Future work could extend these comparisons to include recent advancements in semantic segmentation that do not employ 2D FFT conversion, providing a more exhaustive benchmark for SSCNet’s performance. Moreover, investigations could be directed toward refining SSCNet’s spectral–spatial attention mechanisms to further leverage the complementarity of spectral and spatial features, thereby reinforcing its theoretical and practical superiority in the semantic segmentation of remote sensing images.
5. Conclusions
In conclusion, this study introduces SSCNet, a pioneering spectrum-space collaborative network aimed at enhancing semantic segmentation in RSIs. SSCNet adeptly capitalizes on the intrinsic spectral characteristics of RSIs by incorporating spectral and spatial context for discriminative representation learning. The novel joint spectral–spatial attention module, comprising SpeA and SpaA, dynamically captures the spectral and spatial dependencies simultaneously. The proposed CSED in SpeA is pivotal for modeling spectral contexts in the frequency domain, and the position-wise self-attention in SpaA complements this by addressing spatial aspects. The synergy achieved by merging these attention maps through AttnFusion results in SSCNet’s attention mechanism, which considers both spectral and spatial contexts. Additionally, the introduced hybrid loss function, which combines edge loss, Dice loss, and cross-entropy loss, ensures the comprehensive training of SSCNet, thus enabling it to learn discriminative features within both the spectral and spatial domains. Experimental results on the ISPRS Potsdam and LoveDA datasets demonstrate SSCNet’s superiority over state-of-the-art methods, reaffirming its efficacy in addressing the challenges of remote sensing image segmentation.
Looking forward, this work opens up several avenues for future research. First, SSCNet could be extended to address the task of pansharpening, which is critical for improving the spatial resolution of RSIs. Second, further investigations into adaptive fusion techniques for spectral and spatial features can be explored to enhance the network’s flexibility in handling diverse remote sensing scenarios. Additionally, the incorporation of more advanced spectral analysis tools and domain adaptation methods may improve the model’s performance under various conditions. Finally, research into the application of SSCNet in real-time semantic segmentation and its integration with autonomous systems, such as drones or satellites, could pave the way for transformative developments in the field of remote sensing and environmental monitoring.