CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation

Zhang, Lei; Xing, Xing; Jing, Changfeng; Kong, Min; Xu, Gaoran

doi:10.3390/rs17162803

Open AccessArticle

CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation

by

Lei Zhang

¹,

Xing Xing

¹

,

Changfeng Jing

^2,*

,

Min Kong

¹ and

Gaoran Xu

²

¹

School of Intelligent Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

School of the Information Engineering, China University of Geosciences Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(16), 2803; https://doi.org/10.3390/rs17162803

Submission received: 16 June 2025 / Revised: 2 August 2025 / Accepted: 11 August 2025 / Published: 13 August 2025

(This article belongs to the Special Issue Advances in Deep Learning Approaches in Remote Sensing (Second Edition))

Download

Browse Figures

Versions Notes

Abstract

The spatial attention mechanism has been widely employed in the semantic segmentation of remote sensing images due to its exceptional capacity for modeling long-range dependencies. However, the analysis performance of remote sensing images can be reduced owing to their large intra-class variance and complex spatial structures. The vanilla spatial attention mechanism relies on the dense affine operations and a fixed scanning mechanism, which often introduces a large amount of redundant contextual semantic information and lacks consideration of cross-directional semantic connections. This paper proposes a new Cross-scan Semantic Cluster Network (CSCN) with integrated Semantic Filtering Contextual Cluster (SFCC) and Cross-scan Scene Coupling Attention (CSCA) modules to address these limitations. Specifically, the SFCC is designed to filter redundant information; feature tokens are clustered into semantically related regions, effectively identifying local features and reducing the impact of intra-class variance. CSCA effectively addresses the challenges of complex spatial geographic backgrounds by decomposing scene information into object distributions and global representations, using scene coupling and cross-scanning mechanisms and computing attention from different directions. Combining SFCC and CSCA, CSCN not only effectively segments various geographic spatial objects in complex scenes but also has low model complexity. The experimental results on three benchmark datasets demonstrate the outstanding performance of the attention model generated using this approach.

Keywords:

remote sensing image segmentation; semantic clustering; cross-scan attention; scene coupling; spatial attention mechanism

1. Introduction

Semantic segmentation is a fundamental task in the intelligent analysis of remote sensing images, aiming to assign the semantic labels to each pixel in the image. This method uses pixel-level dense classification to extract geographic spatial components from high-resolution satellite and aerial imagery, including urban structures, vegetation, water bodies, etc. As a result, semantic segmentation has become a critical component in applications including environmental monitoring [1], urban planning [2,3], and catastrophe management [4].

Although there has been considerable progress in deep learning-driven semantic segmentation methods, remote sensing images continue to encounter many obstacles owing to their intrinsic complexity. As shown in Figure 1, there are common problems in remote sensing scenes, such as complex background textures, semantic and sensory redundancy, significant intra-class differences, and blurred inter-class boundaries [5,6]. In Figure 1, subfigures (a) and (c) compare rural and urban scenes, highlighting differences in geographical distribution and intra-class variance. Subfigures (b) and (d) further emphasize spatial correlation and the visual complexity of urban categories. Subfigures (e) and (f) depict the limitation of vanilla attention compared with cluster-based strategies, where the latter enables semantic aggregation based on feature proximity. Subfigure (g) shows vanilla’s attention lacks the concept of a global scanning strategy, which is not conducive to structured contextual modeling along spatial axes. Traditional convolutional neural networks (CNNs) have shown excellent hierarchical feature learning capabilities in remote sensing images [7,8]. However, their inherent receptive field limitations and spatially induced biases make it difficult to effectively capture the contextual dependencies between distant pixels, especially in large-scale scenes where the target boundaries are prone to being unclear [9].

Recent studies have addressed these limitations by incorporating global context modeling through attention mechanisms [10]. Among these works, the self-attention module in Transformer designs, particularly token-based representations [11], has achieved more superior results in the segmentation of both natural and remote sensing images by enabling adaptive, data-dependent interactions across spatial positions [12]. Nevertheless, vanilla attention ignores the geographic spatial correlation of objects in remote sensing images and is impacted by significant background noise interference [13]. At the same time, performing intensive calculations between all spatial locations results in high memory overhead and severe semantic and perceptual redundancy encoding, making it difficult to apply to high-resolution remote sensing images [14,15].

Current state space modeling-based models like Vision Transformers (ViTs), Swin Transformer, and the Mamba frameworks have been proposed to further improve efficiency and global awareness. These models try to capture long-range dependencies through linear recursive mechanisms or hierarchical token structures [16,17,18]. However, as shown in Figure 2, these methods still depend on fixed attention scopes or rigid scanning patterns. Subfigure (a) shows CNN-based scanning, which is localized and lacks global awareness. Subfigure (b) represents Vision Transformers that offer global dependency modeling but may ignore spatial structure. Subfigure (c) presents Mamba’s implicit long-range capture through state propagation. As a result, they lack the flexibility required to handle the structural complexity and geographical heterogeneity seen in remote sensing situations. Therefore, designing an attention mechanism has become crucial. This attention mechanism needs to adapt to the spatial structure. It also needs to consider the global semantic information of the scene while preserving local discriminative features. Moreover, it should avoid introducing redundancy to reduce computational costs. This approach is a key direction for improving the semantic segmentation performance of remote sensing images [19].

To address the challenges, this paper proposes a novel segmentation framework called the Cross-scan Semantic Cluster Network (CSCN). Remote sensing scenes have complex backgrounds, so the semantic analysis not only relies on local features but also highly depends on the global structural information of the scene. Therefore, it is essential to aggregate both global and local features. We have presented a Semantic Filtering Contextual Cluster (SFCC) based on feature maps for semantic aggregation within local regions. SFCC enhances intra-class consistency and suppresses background noise by clustering feature tokens into semantically related regions rather than treating pixels in isolation, thereby improving segmentation accuracy. Meanwhile, the introduction of the semantic filtering module effectively eliminates redundant information and reduces the computational overhead.

Furthermore, we have developed a Cross-scan Scene Coupling (CSCA) combining the attention mechanisms across horizontal, vertical, and flipping directions to enhance the network’s understanding of spatial anisotropy. Compared to conventional self-attention mechanisms, CSCA offers greater adaptability in capturing spatial dependencies by dynamically adjusting to the co-occurrence patterns among ground objects. The scene coupling comprehends the geometric and semantic relationships of terrain features in spatial distribution, thereby enabling a more precise restoration of the actual geographical configuration. It is beneficial to establish global features in multiple directions and avoid the interruption of semantic connections in different directions. Ultimately, we combined the local contextual information with multi-directional attention features to create a lightweight decoder for precise prediction. We aim to improve the accuracy of segmentation by preserving edge details.

The main contributions of this paper are as follows:

An innovative segmentation framework named CSCN is proposed. By reconstructing the vanilla attention mechanisms, the understanding of complex spatial structures and semantic relationships in remote sensing images has been enhanced. This method improves the intra-class consistency and model generalization while optimizing the inter-class discrimination.
The SFCC, based on a clustering-inspired class-centered layer is designed to reduce the intra-class feature dispersion, improve the inter-class discriminability, and reduce the computational overhead. This module enhances the structural context modeling by grouping the local regions with relevant semantics.
To address the issue of complex spatial structure and strong background interference of land features in remote sensing images, the CSCA is presented. By performing attention calculations in multiple scanning directions (horizontal, vertical, and flipping), long-range dependencies in different directions can be captured. This strategy not only expands the receptive area of the model but also effectively integrates the spatial structural information, thereby improving the ability to understand the global semantics.
To improve the accuracy of the segmentation, SFCC and CSCA are effectively combined. Based on the test results on three benchmark datasets, the proposed approach outperforms other state-of-the-art methods. At the same time, it balances inference efficiency and achieves clear and precise prediction results with fewer parameters and calculations.

2. Related Work

2.1. General Semantic Segmentation

Early approaches based on Fully Convolutional Networks (FCNs) achieved significant progress by using pre-trained classification backbones on large-scale datasets. Subsequent research shifted focus toward improving feature discrimination and contextual modeling. For example, LCNet [20] introduces a Partial Channel Transition (PCT) strategy to accelerate semantic segmentation while reducing latency and hardware cost. To integrate multi-scale contextual information, PSPNet [21] designs a pyramid pooling module that combines global semantics with fine-grained target data. RefineNet [22] introduces a general multi-path optimization network and fully utilizes the information retained during the downsampling process. The DeepLab series [23], developed by Google, adopts dilated convolutions to expand the receptive field without increasing computation. DenseASPP [24] further enhances this approach by employing a strategy with a denser dilation rate to capture more comprehensive contextual representations.

To overcome the limitations of Fully Convolutional Networks (FCNs) in modeling long-range dependencies, attention mechanisms have been widely integrated into semantic segmentation models. To capture rich contextual information, for example, DANet [25] combines both spatial and channel attention. CCNet [26] enhances computational efficiency through a crisscross attention mechanism. Moreover, object-context-based methods such as OCNet [27] utilize class-level representations to optimize feature aggregation and improve segmentation accuracy.

Recently, Transformer- and Mamba-based architectures have been introduced to semantic segmentation tasks. SegFormer [28] employs a lightweight decoder and a hierarchical Transformer encoder, achieving robust performance across various datasets. CS-Net [29] reshapes the Transformer architecture to improve both inference speed and segmentation accuracy. UnetMamba [30] integrates a Mamba-based decoder with a local supervision module, enabling efficient decoding of complex patterns in high-resolution images. PPMamba5 [31] combines convolutional neural networks with Mamba by introducing local auxiliary mechanisms and omnidirectional state space models, effectively fusing multi-scale local representations with global contextual features. Moreover, recent works like StripFormer [32] and Swin-Unet [33] attempt to improve spatial context modeling by applying directional or hierarchical attention mechanisms. While these models introduce structure-aware improvements, they often rely on predefined scanning paths or window schemes, which may not generalize well to the irregular patterns of remote sensing scenes. Other models, such as RegionViT [34] and Token Cluster Attention [35], incorporate semantic clustering to reduce redundancy but lack directional diversity.

However, general segmentation models are typically designed for natural scenes, where object boundaries and textures are relatively consistent. When directly applied to remote sensing data, these models struggle to handle severe background noise, large intra-class variance, and spatial complexity. These limitations highlight the need for specialized mechanisms—such as directional attention and semantic clustering—tailored to the characteristics of remote sensing imagery. We review such adaptations in the following subsection.

2.2. Semantic Segmentation in Remote Sensing

Semantic segmentation is pivotal in numerous remote sensing applications, including road extraction [36], building detection [37], and land use and vegetation classification [38]. Recent advances in remote sensing semantic segmentation have been largely driven by the incorporation of attention mechanisms, which enable the capture of complex spatial dependencies and contextual information. To improve segmentation performance in high-resolution remote sensing images, several attention-based models have been presented in recent works.

For instance, RADANet [39] employs a deformable attention mechanism to effectively extract road features from complex backgrounds, demonstrating improved accuracy in road extraction tasks. Similarly, BEM [40] integrates nested edge detection to enhance building boundary delineation, addressing challenges in building detection.

Researchers have proposed a range of attention-based segmentation methods for remote sensing to improve generality. For instance, a linear attention technique was created by Li et al. [41] to reduce computational costs while maintaining performance on large-scale datasets. Deng et al. [42] proposed a class-aware attention network that emphasizes class-specific features, thereby improving segmentation accuracy across diverse scenes. However, most of these methods ignore the common issues of large intra-class differences and strong background interference in remote sensing images. At the same time, these methods also lack attention to pixel-level correlations among similar regions, resulting in blurred object boundaries and limited capacity to model the unique semantic structures found in remote sensing data.

To address these challenges, FarSeg++ [43] introduced a module to represent the interaction between foreground and background, enhancing their differential expression. Yang and Ma [44] designed a multimodal contrastive learning framework that utilizes semantic similarity for feature discrimination learning. Li et al. [45] proposed a deformable attention edge network to improve the model’s capacity to capture small boundary targets and details. Although these methods have made some progress, they do not fundamentally reformulate the attention mechanism itself, resulting in the model still being limited by fixed scanning modes and difficult to adapt to complex and changing spatial structures.

Despite these advancements, there are still challenges in dynamic context aggregation, semantic structure modeling, and resistance to background interference. This paper proposes a semantic segmentation and reconstruction attention framework based on attention mechanisms. The goal is to balance the modeling of local features and global structures in remote sensing scenes while enhancing the intra-class consistency and boundary precision.

3. Methodology

In this section, we provide a detailed description of the Cross-scan Semantic Cluster Network (CSCN). This network utilizes the Semantic Filtering Contextual Cluster (SFCC) based on class-centered context modeling and couples it with the Cross-scan Scene Coupling Attention (CSCA) reconstructed using a multi-directional scanning strategy. This structure is designed to effectively address prevalent challenges in remote sensing semantic segmentation, such as substantial background interference, significant intra-class variability, and blurred object boundaries in remote sensing images. Thus, it can enhance the performance of the remote sensing semantic segmentation model.

3.1. Architecture and Workflow

As is shown in Figure 3, CSCN is an encoder–decoder semantic segmentation framework, which includes three key components: a high-resolution backbone network, SFCC, and CSCA.

The input image is initially processed by the backbone network to extract multi-scale features. These features are subsequently downsampled using a bottleneck convolutional layer (Conv 3 × 3), producing a compact representation that facilitates efficient context extraction and attention modeling.

The SFCC module is designed to model local semantic context by viewing the feature map as a set of unordered points rather than a structured grid. Initially, it divides the feature map into local windows and flattens each into point sets. These points are clustered into semantic groups via top-k nearest neighbor selection based on cosine similarity. To eliminate perceptual and semantic redundancy, a Semantic Filtering (SF) mechanism is applied. Each cluster is assigned a centroid, and similarity-weighted aggregation is performed to produce refined local semantic features

L_{i}

.

Meanwhile, the proposed CSCA constructs global and local class centers from pre-classification masks and uses rotation positional encoding (

{ROPE}^{rs}

) and DCT-based global scene representation. To capture the long-range dependencies, the model scans input features in four directions: horizontal, vertical, horizontal-flip, and vertical-flip. This multi-directional strategy allows the model to perceive spatial continuity and semantic relationships from multiple perspectives. The final attention output

G_{i}

is generated by averaging the responses in each direction and projecting them back to the original spatial resolution.

Finally, the outputs from the clustering-enhanced context and attention module are concatenated with the original features and passed through a fusion block to generate the refined feature representation

F_{oa}

. This results in two-stage predictions: a coarse mask from the decoder and a fine-grained segmentation output from the full attention path. This hierarchical refinement design strengthens the framework’s capacity to better preserve edge structures and class boundaries.

3.2. Semantic Filtering Contextual Cluster

We created a Semantic Filtering Contextual Clustering (SFCC) module to clearly identify key local meanings and spatial relationships in remote sensing images while minimizing the confusion between different meanings. Unlike traditional methods that treat feature maps as structured grids, this module interprets them as unordered sets of points, enabling feature sharing and aggregation through point-based clustering within localized windows. The implementation of the semantic filtering method contributed to the model’s generalization capacity and semantic recognition in high-complexity situations.

To begin, divide the input feature map

L_{f}

into a number of non-overlapping local windows

P \in L^{n \times d^{'}}

. Flatten each window into a point set

P

. In this setting,

n

represents the total number of data points, and

d

is the feature dimension of the points. Use a learnable linear mapping [16]

proj (\cdot)

to project the point features into similarity space.

P_{S} = proj (P) \in R^{n \times d^{'}}

(1)

Within each window, an average of

t

cluster centers are initialized, and the initial features are obtained by selecting the

k

nearest neighbors of each center and taking the average. Due to the uneven semantic distribution in remote sensing images—largely attributed to perceptual and semantic redundancy—a Semantic Filtering (SF) mechanism is incorporated, inspired by recent image de-redundancy clustering techniques [46,47]. Figure 4a shows that perceptual redundancy exhibits the smallest difference at the pixel level but is repeatedly learned by the model. On the contrary, as shown in Figure 4b, semantic redundancy displays images with significant differences in pixel styles but expresses the same semantic information. The presence of perceptual redundancy and semantic redundancy in remote sensing images leads to an imbalanced distribution of semantic concepts in the dataset. Such an imbalance may result in bias and resource waste during training.

For the center points within a single cluster, we set the global image embedding matrix as

E = {e_{i}}_{i = 1}^{N} \in R^{N \times d}

, where

N

is the number of images, and

d

is the embedding dimension. We partition

E

into

c

chunks,

E = {E^{1}, E^{2}, \dots, E^{c}}

.

e_{i}

denotes the feature embedding of the i-th image, and

E_{j}^{n}

is the j-th point in the n-th chunk

E^{n}

. For each image

e_{i} \in E,

we compute the Euclidean distance [48] to all points in chunk

E^{n}

.

N_{c}

represents the total number of points in each chunk.

D_{i}^{n} (j) = {| | e_{i} - E_{j}^{n} | |}_{2} j = 1, \dots, N_{c}

(2)

The top-k nearest neighbors [46] are retained per chunk

C_{i}^{n}

. All the chunks are concatenated to form the full nearest neighbor matrix

C

. An image pair

(e_{i}, e_{j})

is merged into the same cluster using a distance threshold

β

.

(e_{i}, e_{j})

is in the same cluster if their distance is below the threshold

β

. In our experiments, we empirically set the threshold

β

to 0.7. This value was selected based on validation performance on the LoveDA dataset, where we observed that

β = 0.7

provided the best trade-off between over-segmentation and under-clustering. A grid search over the range

β \in {0.5, 0.6, 0.7, 0.8}

was conducted, and 0.7 achieved the most stable and accurate clustering results across all datasets.

d_{ij} = {| | e_{i} - e_{j} | |}_{2}, if d_{ij} < β

(3)

Each cluster

L_{i}

is assigned a centroid

μ_{i} = \frac{1}{| L_{i} |} \sum_{e_{i} \in L_{i}} e_{j}

, and only the image closest to

μ_{i}

is retained. All others are filtered out to reduce redundancy.

e_{i}^{*} = \arg \min_{e_{i} \in G_{i}} {| | e_{j} - μ_{i} | |}_{2}

(4)

To compute the similarity between cross-cluster centers and all patches, we utilized cosine similarity to create a similarity matrix

S_{ij}

. We developed a similarity matrix using cosine similarity to compare all points with all cluster centers and used the Union-Find method to organize semantically similar patches. We combined two patches into a single group if their distance was below the threshold

β

. To minimize redundancy and improve semantic representation, only the core patch (i.e., the patch nearest to the centroid) remained in each group for subsequent feature aggregation. We assessed the semantic similarity of the patches based on these similarities, and thereafter chose whether to combine them for future feature aggregation.

S \in R^{t \times n}, S_{ij} = \frac{c_{i} \cdot p_{j}}{| | c_{i} | | \cdot | | p_{j} | |}

(5)

In this formula,

c_{i}

and

p_{j}

represent the feature vectors of the

i

cluster center and the

j

patch, respectively.

t

clusters are created by allocating each point to the nearest center using the similarity matrix

S_{ij}

. The number of data points for each cluster may vary, and some clusters may not contain data points. This situation represents data redundancy. Data within the same cluster will dynamically cluster to the center point based on their similarities.

If

m

feature points belonging to a cluster are selected from a local window and their feature dimension is d′, a matrix

P_{m}

is formed by these points. The dimension of

P_{m}

is

m \times d

.

We use the similarity matrix

S

to calculate the similarity between each point and the cluster center. Considering each point

p_{i}

is part of a cluster

G

, the similarity between the extracted points from the similarity matrix and the cluster center

μ_{i}

forms a vector

s_{i}

.

By using linear transformation to project

P_{m}

into a new space, which becomes the value space,

P_{m}^{'}

is obtained for subsequent feature aggregation and optimization. In the value space, we fuse all points to obtain a new cluster center

v_{c}

.

We calculate the similarity between

P_{i}

and the new cluster center

v_{c}

, and weight it based on the new similarity. The weighting process is carried out through the sigmoid activation function. Among them,

α

and

β

are learnable parameters,

σ

is the Sigmoid activation function, and

p_{i}

is the original point feature. This weighting process ensures that points closer to the cluster center have a greater weight in feature aggregation during clustering.

Next, we combine the similarity weighted features of

v_{c}

and member points to construct the aggregated local contextual feature

L_{i}

for this cluster. Here,

v_{c}

denotes the new cluster center,

σ (\cdot)

is the sigmoid activation function,

α

and

β

are learnable parameters controlling the weighting of features.

L_{i} = \frac{1}{T} (v_{c} + \sum_{i = 1}^{m} σ (α s_{i} + β) \cdot p_{i}), T = 1 + \sum_{i = 1}^{m} σ (α s_{i} + β)

(6)

Using aggregated features

L_{i}

to fuse and update with each point

p_{i}^{'}

after similarity weighting, improving intra-class consistency. Finally, the clustering results are mapped from the clustering feature space to the original feature space and reshaped back into a 2D feature map.

p_{i}^{'} = p_{i} + FC (σ (α s_{i} + β) \cdot L_{i})

(7)

3.3. Cross-Scan Scene Coupling Attention

The Cross-scan Scene Coupling Attention (CSCA) reconstructs the attention attraction operations and improves the attention modeling process. By enabling dynamic adaptation, CSCA facilitates the integration of global contextual information and promotes the efficient exploitation of spatial relationships among geographic objects. CSCA mainly consists of two parts: scene coupling (SC) and cross-scan strategy.

3.3.1. Scene Coupling

In remote sensing images, the separation of similar pixel types in the feature space can cause vanilla attention models to have poor performance in class center construction. To address this limitation and more effectively capture the complex spatial relationships among ground objects, we introduce the scene coupling (SC) mechanism. The model is better able to comprehend the intricate and dynamic spatial structures by combining global scene semantic information with spatial item distribution. This approach has demonstrated success in accurately modeling geographic objects under complex spatial conditions [49].

Originally, the initial feature

I_{f} \in R^{C \times H \times W}

is extracted from the input image by the backbone, and a

1 \times 1

convolution is employed to produce the pre-classification feature

P_{f} \in R^{K \times H \times W}

, where

K

indicates the number of classes. A pre-classification semantic mask

M_{p}

and a local semantic representation

S_{l}

are constructed. Specifically, for each pixel position, the class corresponding to the maximum predicted class probabilities is selected as the pre-classification mask for that pixel. The

ψ

function is used to map

K

class centers back to the original graph space.

S_{l}

is the global class center. Here, the function

ψ

is introduced to project the pre-classification semantic features and class center information into the semantic graph space. Specifically,

ψ

acts as a mapping and reconstruction operator that reassigns each pixel’s class probability to its corresponding class center and generates a semantic representation.

\begin{matrix} M_{p} = {Argmax}_{K} (P_{f}) \in R^{K \times H \times W} \\ S_{g} = ψ ({P_{f}}^{K \times (H \times W)} ⨂ {I_{f}}^{(H \times W) \times C}, M_{p}) \end{matrix}

(8)

To consider local consistency, the original image is divided into patches of size

h \times w

. Among them,

N_{h} = \frac{H}{h}, N_{w} = \frac{W}{w}

, and there are a total of

N_{h} \times N_{w}

local blocks. By further splitting

I_{f}

and

P_{f}

into

I_{fl}

and

P_{fl}

, semantic masks and a local class center are constructed:

\begin{matrix} M_{l} = {Argmax}_{K} (P_{fl}) \in R^{{(N}_{h} \times N_{w}) \times K \times h \times w} \\ S_{l} = ψ ({P_{fl}}^{{(N}_{h} \times N_{w}) \times K \times h \times w} ⨂ {I_{fl}}^{{(N}_{h} \times N_{w}) \times (h \times w) \times C}, M_{l}) \end{matrix}

(9)

At this point, the vanilla attention can be restructured into the following:

x^{q} = R_{l}, x^{k} = S_{l}, x^{v} = S_{g}

(10)

Remote sensing images often show specific spatial distribution patterns, such as vehicles mostly seen on roads and buildings generally located near roads. These patterns emphasize the importance of spatial relationships among pixels. Using the Rotation Position Encoding (ROPE) [50] idea, we integrate these spatial correlations into the attention mechanism process through the inner product of

{ROPE}^{rs}

. As shown in Figure 5, this method overcomes the limitations of fixed position encoding and significantly enhances the model’s generalization capabilities.

We define two-dimensional ROPE encoding and construct an attention inner product as follows:

f_{q} (x_{m}, m) = {(W_{q} x_{m})}^{e^{im θ}}, f_{k} (x_{n}, n) = {(W_{k} x_{n})}^{e^{in θ}}

(11)

x_{m}, x_{n}

represent the feature vectors at positions

m

and

n

in the image,

W_{q}

and

W_{k}

are the linear mapping weight matrices of the query and key,

θ

is the rotation factor, and

e^{im θ}

is the complex angle.

We expand it to 2D images:

\begin{matrix} f_{q} (x^{(m^{x}, m^{y})}, m^{x}, m^{y}) = (W_{q} x_{m}) e^{i m^{x} θ^{x}} e^{i m^{y} θ^{y}} \\ f_{k} (x^{(n^{x}, n^{y})}, n^{x}, n^{y}) = (W_{k} x_{n}) e^{i n^{x} θ^{x}} e^{i n^{y} θ^{y}} \end{matrix}

(12)

m^{x}

and

m^{y}

are the horizontal and vertical indices of the patch, and

θ^{y}

is the rotation angle in both directions.

f_{q}

and

f_{k}

introduce spatial location information into queries and keys.

Due to the multi-channel nature of remote sensing image features, in order to adapt to multidimensional features, multi-channel extension is introduced on the basis of two-dimensional.

{\hat{f}}_{\{q, k\}} (x^{(m^{x}, m^{y})}, m^{x}, m^{y}) = (R_{θ^{x}, m^{x}}^{d} R_{θ^{y}, m^{y}}^{d}) ⊗ W_{\{q, k\}} x^{(m^{x}, m^{y})}

(13)

R_{θ^{x}, m^{x}}^{d}, R_{θ^{y}, m^{y}}^{d}

are the horizontal and vertical rotation matrices.

W_{\{q, k\}}

is the original query and key vector.

Re (\cdot)

denotes taking the real part of the complex product. The affinity matrix between the two is calculated as follows:

g (x_{m}, x_{n}) = Re [{\hat{f}}_{q} (x_{m}) \cdot {{\hat{f}}_{k} (x_{n})}^{*}]

(14)

To obtain cross-spatial global semantic information, we introduce two-dimensional discrete cosine transform [51] to model the global representation of the scene. The basic transformation function is as follows:

B_{x, y}^{h, w} = α_{h} α_{w} \cos (\frac{π (2 x + 1) h}{2 T_{H}}) \cos (\frac{π (2 y + 1) ω}{2 T_{w}})

(15)

Here,

α_{h}

and

α_{w}

are the normalization factors of the discrete cosine transform (DCT) along the height and width dimensions, respectively. The spectrum converts to the following:

T_{h, w} = \sum_{x = 0}^{t_{H - 1}} \sum_{y = 0}^{t_{W - 1}} A_{x, y} B_{x, y}^{h, w}

(16)

A_{x, y}

denotes the pixel (or feature) value at a spatial location

(x, y)

. The

m

most distinguishing frequency components for channel concatenation are selected to obtain the global scene representation

G = Concat (T_{u 1, v 1}, T_{u 2, v 2}, \dots, T_{um, vm})

.

Ultimately, the global scene representation and affinity matrix are reconstructed into the attention mechanism.

C

is the number of feature channels, which serves as a scaling factor in the attention computation:

t_{m, n} = \frac{{(G \cdot g)}^{T} \cdot k_{n}}{\sqrt{C}}

(17)

3.3.2. Cross-Scan Strategy

Conventional global scanning methods often overlook the importance of local features and contextual information in remote sensing images. Moreover, fixed scanning methods are unable to adaptively capture spatial relationships. To address this issue, we propose a cross-scan approach. Network modeling for remote dependence relationships has been improved by selectively interacting in four critical ways, which is particularly effective in handling remote sensing scenes characterized by irregular and diverse land cover distributions. In contrast to previous methods that only scan in the row or column direction (such as SSM [52]), we patch and flatten the input image and scan the flattened image horizontally (H), vertically (V), horizontally flipped (H_flip), and vertically flipped (V_flip) to gain more comprehensive global information.

Specifically, the input feature map

{X \in R}^{B \times C \times H \times W}

is divided into local blocks and flattened into sequences in four directions,

X^{H}, X^{V}, X^{H_flip}, X^{V_flip}

and then rearranged according to the directions to simulate the scanning process. In each direction, learnable projections are projected onto queries, keywords, and values.

Q^{dir} = W_{q} \cdot X^{dir}, K^{dir} = W_{k} \cdot X^{dir}, V^{dir} = W_{v} \cdot X^{dir}

(18)

Among them,

dir \in \{H, V, {H_}_{fip}, V_fip\}

and

Q_{dir}, K_{dir}, V_{dir}

represent the projection of features in a specific direction.

W_{q}

,

W_{k}

, and

W_{v}

are the learnable projection matrices for queries, keys, and values. An averaging process aggregates the final output. The attention map in each direction is computed as follows:

{Attention}_{dir} (x) = softmax (\frac{Q_{dir} K_{dir}^{T}}{\sqrt{d}}) V_{dir}

(19)

Our method successfully advances the basic attention operation by integrating the understanding of the scene with the design of local and global class attention. This integration reduces background noise and alleviates the adverse effects of complex and dynamic spatial structures on contextual modeling. It effectively uses the global features of geographic spatial objects from different angles.

After obtaining the attention-enhanced representations in all four directions, we perform an averaging operation to integrate the directional outputs and restore the original spatial structure. This process results in a unified global feature representation

G_{i} \in R^{B \times C \times H \times W}

, which effectively captures long-range dependencies and scene-level semantics from different orientations. The resulting

G_{i}

not only complements the class-aware local context

L_{i}

derived from the SFCC module, but also serves as a crucial component in subsequent feature fusion and semantic prediction.

We concatenate the local semantic features

L_{i}

from SFCC, the global scene features

G_{i}

from CSCA, and the initial features

I_{i}

extracted by the backbone to effectively integrate the multi-granularity contextual representations. A learnable projection layer is employed to fuse these features, allowing the network to combine the fine-grained spatial semantics with long-range global dependencies. The fused representation is subsequently input into the final segmentation head to generate precise and structure-conscious predictions.

F_{oa} = ϕ (Concat [L_{i}, G_{i}, I_{i}])

(20)

where

ϕ (\cdot)

denotes a learnable fusion function implemented as a lightweight

1 \times 1

convolution. The resulting fused feature

F_{oa}

serves as the input to the final segmentation head.

3.4. Loss Function

This paper constructs a loss function by combining semantic supervision and edge supervision, which can increase semantic segmentation accuracy for high-resolution remote sensing images, especially to enhance the model’s perception ability of scene edges. The function consists of three parts: the main segmentation loss, the edge perception loss, and the cluster loss [53]. The overall loss function is expressed as follows:

L_{total} = L_{main} + λ_{edg} L_{edg} + λ_{clu} L_{clu}

(21)

\{\begin{matrix} L_{main} = - \sum_{i = 1}^{N} y_{i} \log (p_{i}) \\ L_{edg} = 1 - \frac{2 \sum B_{y}^{i} B_{\hat{y}}^{i} + ε}{\sum B_{y}^{i} + \sum B_{\hat{y}}^{i} + ε} \\ L_{clu} = \frac{1}{N} \sum_{i = 1}^{N} {| | F_{i} - C_{Z_{i}} | |}_{2}^{2} \end{matrix}

(22)

N

is the number of samples,

y_{i}

is the true label, and

p_{i}

is the probability predicted by the model.

B_{y}^{i}

is the true output value of the

i

sample,

B_{\hat{y}}^{i}

is the predicted value of the model, the intersection between the two

F_{i}

is the eigenvector of the

i

patch, and

C_{Z_{i}}

is the cluster center to which it belongs. In our implementation, we empirically set the loss weights based on validation performance on the LoveDA dataset. Specifically, we set

λ_{edg} = 1.0

to give emphasis to boundary details and promote better edge delineation, which is critical in remote sensing scenes. The weight

λ_{clu} = 0.5

was chosen to encourage semantic compactness while avoiding over-clustering effects that could suppress inter-class distinctions. These values were selected after a grid search over several combinations, and consistently led to optimal trade-offs between accuracy and stability across multiple datasets.

4. Experimental Setting

4.1. Dataset and Implementation Details

Inspiring similar research [54,55], the three widely used standard datasets LoveDA, Potsdam, and Vaihingen were adopted in our extensive experiments.

LoveDA [56]: The LoveDA datasets contain 598 high-resolution remote sensing images with 512 × 512 patches in urban, rural, and other different environments. Each image has a spatial resolution of 0.3 m. It includes 7 semantic classes—background, building, road, water, barren, forest, and farmland—divided into 2520 training, 920 validation, and 1668 test images.

Potsdam [57]: The International Society for Photogrammetry and Remote Sensing (ISPRS) provides this dataset. The dataset includes 38 orthorectified large-scale aerial images that capture urban scenes at a spatial resolution of 5 cm. This dataset contains six land cover categories: impervious surfaces, buildings, low vegetation, trees, cars, and clutter. This paper uses 24 images for training and the remaining 14 images for testing.

Vaihingen [57]: Remote sensing images commonly utilize this dataset from ISPRS, too. The dataset includes aerial images of Vaihingen in Germany, along with the corresponding labels. Composed of 33 true orthophoto images with a resolution of 9 cm, they are divided into six categories as Potsdam. This paper uses 16 images for training and the remaining 17 for testing.

To ensure consistency across datasets with varying spatial resolutions, all input images from LoveDA, Potsdam, and Vaihingen were uniformly resized to 512 × 512 patches before training and evaluation. This preprocessing step helps standardize the semantic scale among datasets and follows common practices in prior works [58]. In terms of implementation, this article builds and trains the model based on the PyTorch 2.0.0 framework. The backbone network adopts HRNet-W48 to extract the high-resolution and multi-scale semantic features. The optimizer uses SGD, with an initial learning rate set to 0.01, a momentum factor of 0.9, a weight decay coefficient of 1 × 10⁻⁴, and a polynomial decay strategy (power exponent of 0.9) for the learning rate. The training batch size is set to 8 and the total training epochs to 120. The selection of an initial learning rate of 0.01 and the use of a polynomial decay strategy (with a power of 0.9) are based on prior empirical findings in high-resolution remote sensing image segmentation tasks [59]. This setting has demonstrated a satisfactory balance between convergence speed and stability in complex semantic contexts. All experiments during the training and testing phases were executed on an NVIDIA Tesla H800 GPU.

4.2. Evaluation Metrics

To thoroughly measure the performance of remote sensing semantic segmentation models, we adopted multiple evaluation metrics commonly utilized in the domain of semantic segmentation, including Mean Intersection over Union (mIoU), Overall Accuracy (OA), and Average F1 Score (AF). Among them, mIoU is an important indicator to measure the degree of overlap between predicted results and true labels, which can reflect the average segmentation accuracy of the model between various categories.

m I o U = \frac{1}{K} \sum_{i = 1}^{K} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}

(23)

where

K

is the number of semantic classes, and

T P_{i}

,

F P_{i}

, and

F N_{i}

denote the true positives, false positives, and false negatives for class

i

, respectively. A higher mIoU value indicates better segmentation performance, with 1.0 (or 100%) being the optimal value.

OA represents the overall classification accuracy of the model on all pixels and is a key indicator for evaluating global performance.

O A = \frac{\sum_{i = 1}^{K} T P_{i}}{\sum_{i = 1}^{K} (T P_{i} + F P_{i} + F N_{i})}

(24)

The above measures the ratio of correctly classified pixels over all pixels. The optimal value for OA is also 1.0 (or 100%).

As the average value of the F1 score, AF considers both the precision and recall of the model in various categories, reflecting its overall balance and stability.

A F = \frac{1}{K} \sum_{i = 1}^{K} \frac{2 T P_{i}}{2 T P_{i} + F P_{i} + F N_{i}}

(25)

which is the mean of class-wise F1 scores, taking both precision and recall into account. A higher AF value represents a better balance between precision and recall, with 1.0 (or 100%) being the ideal value.

These three types of indicators work together to create a full system for evaluating performance from different angles, like how well the model classifies individual pixels, how accurately it segments different categories, and how well it distinguishes between various categories. Especially in scenarios where there are challenges such as imbalanced categories and complex structures in remote sensing images, using multidimensional indicators can more objectively and comprehensively reflect the semantic understanding and generalization performance of the model for different land cover categories.

5. Experimental Results and Analysis

In this section, we carried out extensive comparative experiments to assess the model’s performance. Comparative methods include those that use spatial context modeling, like PSPNet [21] and DeepLabv3+ [60]; attention-based methods, like DANet [25] and OCRNet [61]; Transformer-based methods, like EfficientViT [62] and TransUnet [63]; Mamba-based methods, like Vision Mamba [64]; and the latest models in the field of remote sensing, like DocNet [65] and LOGCAN++ [66].

5.1. Comparison with State-of-the-Art Methods

5.1.1. Comparison on the LoveDA Dataset

Figure 6 and Table 1 show the comparative results of our method with multiple advanced semantic segmentation models on the LoveDA dataset. Our method achieved an average intersection with mIoU of 57.7%, which is 4.3% higher than the previously leading LOGCAN++. Our approach achieves competitive performance and shows consistent improvements in several key categories compared to other methods. The attention mechanism guided by clustering promotes the recognition ability of boundaries, especially in key categories such as background (55.5%), buildings (65.6%), water bodies (81.2%), and farmland (68.1%), showing significant advantages. In addition, we also achieved favorable scores in the categories of barren land (27.8%) and forest (47.2%), exhibiting the model’s strong capacity to differentiate between diverse and intricate land characteristics. As shown in the focus regions marked by the colored boxes in Figure 6, our model captures fine structural details and preserves clearer object boundaries compared to other methods, especially in challenging areas where roads intersect with buildings or agricultural fields. These qualitative results demonstrate the enhanced boundary delineation capabilities of the CSCN model in handling complex urban and rural scenes.

The confusion matrix in Figure 7 further validates the model’s strong segmentation capability in dominant classes such as background, water, and buildings, with IoU and F1-scores close to 0.85. However, categories like barren and forest exhibit relatively lower performance, largely due to their limited representation in the dataset and spectral similarity to surrounding regions such as agricultural land or background. This overlap leads to class confusion, which remains a common challenge in remote sensing image segmentation. Nevertheless, the overall results indicate that the model achieves a good balance between precision and recall, particularly in large-scale areas, with robust boundary delineation.

5.1.2. Comparison on the Potsdam Dataset

As is shown in Figure 8 and Table 2, on the Potsdam dataset, our approach excelled with an mIoU of 86.26% and an overall accuracy of 91.43%. Compared to existing models such as DOCNet and LOGCAN, significant accuracy improvements were observed in key categories like buildings and roads. The focus areas highlighted in Figure 8 demonstrate that our model captures fine-grained structures (e.g., small vehicles and vegetation patches) and preserves sharper building edges compared to other methods. The red-boxed regions illustrate how CSCN reduces boundary blurring and misclassification in complex urban scenes, further confirming its strong capability for detailed segmentation.

The confusion matrix in Figure 9 reveals that our method achieved more precise class distinctions, reducing misclassifications between buildings and backgrounds, roads, and other land cover types, thus augmenting the model’s discriminative capacity and robustness. The classification accuracy of buildings, trees, and impervious surfaces is relatively high, reaching 0.92, 0.90, and 0.90, respectively, indicating that the model has strong discriminative abilities for these types of features. The IoU and F1 indicators are generally high, especially in the construction and automotive categories, reaching over 95%, respectively, indicating that the model possesses both a high recall rate and exceptional precision. However, there remains moderate confusion between clutter and other categories like car or low vegetation, which can be attributed to the fine-grained and often ambiguous boundaries of these objects, as well as visual similarity in urban scenes. Despite this, the overall segmentation remains robust, with high recall and precision values supporting the model’s ability to generalize across various surface types.

5.1.3. Comparison on the Vaihingen Dataset

Figure 10 and Table 2 illustrate that, on the Vaihingen dataset, our method achieved the highest mean Intersection over Union (mIoU) of 87.59% and an overall accuracy (OA) of 92.68%, substantially surpassing the second-best method, LOGCAN++, with 83.67% mIoU and 90.86% OA. This achievement indicates our model’s superior capability in accurately identifying and segmenting various land cover classes. Particularly for detail-rich and edge-complex classes like Buildings and Roads, our model demonstrated superior edge awareness and contextual understanding. As highlighted by the red-boxed regions in Figure 10, our approach preserves sharper boundaries and more consistent segmentation in areas with dense building clusters and mixed vegetation. These qualitative results reinforce the robustness of CSCN in handling highly textured and structurally complex urban scenes.

The experimental result chart, Figure 11 shows the model maintains balanced segmentation performance across all six categories in the Vaihingen dataset, achieving high IoU and F1 scores overall. Nevertheless, slightly reduced metrics are observed for the ‘car’ and ‘clutter’ categories, where both class frequency and annotation size are relatively low. These factors lead to under-representation during training, which affects precision and recall. In contrast, major classes like buildings, trees, and impervious surfaces show excellent consistency, demonstrating the model’s effective spatial generalization and edge-awareness capabilities.

5.2. Ablation Experiments

To evaluate the impact of the proposed Semantic Filtering Contextual Cluster (SFCC) and Cross-scan Scene Coupling Attention (CSCA) modules, we conducted ablation experiments on the LoveDA and Potsdam datasets. The results of these experiments demonstrate the significant impact of both modules on overall model accuracy and robustness. By isolating their effects, we were able to obtain valuable information about how each component contributes to improved performance in diverse scenarios.

We began with the baseline model and progressively integrated the proposed modules (SF, SFCC, SC, and CSCA) to assess their individual and combined contributions. Table 3 and Table 4 illustrate that the base model employs solely the backbone network and a rudimentary attention mechanism, yielding average performance on the Loveda dataset, background (52.30%) and farmland (57.02%) categories, resulting in an average mIoU of 51.87%. And on the Potsdam dataset, OA, AF and mIoU values are also the lowest.

The addition of Semantic Filtering (SF) results in a modest advancement in the model’s performance, with mIoU increases to 52.25% and 85.82% separately; however, the improvement for small targets and boundary regions remains limited. After introducing our designed SFCC, especially when combining different numbers of clustering centers, the model performance significantly improved, achieving the highest mIoU (53.46% and 90.45), with an accuracy rate of 58.67% for farmland, verifying the effectiveness of this module in modeling local semantic structures in remote sensing images.

To verify the effectiveness of CSCA in enhancing spatial structure modeling capability in remote sensing images, we gradually introduced it based on SFCC. Table 3 and Table 4 demonstrate that with the further introduction of the scene coupling (SC) mechanism, the overall performance of the model continues to improve, with mIoU increasing to 55.45% and 87.14% separately. It has better discriminative ability in categories with complex spatial structures such as buildings, roads, and farmland. Scene coupling can improve semantic segmentation performance in complex backgrounds.

In conclusion, the combination of SFCC and CSCA leads to the best segmentation performance on both datasets: it achieved the highest accuracy on the Potsdam dataset with AF of 92.69% and on the LoveDA dataset with mIoU of 56.35%, significantly better than structures that only use a single semantic guidance or clustering mechanism. This evidence indicates that the integration of the semantic coupling mechanism and global scanning can achieve complementary local global structural modeling, thereby achieving better recognition and boundary preservation effects in remote sensing image segmentation.

6. Discussion

6.1. Model Complexities

To evaluate the practical feasibility of our method, we analyzed the complexity regarding parameters, FLOPs, and memory usage. Table 5 illustrates that our proposed modules add only a small amount of extra computing work (2.4 M Params and 40.5 G FLOPs), which is much less than OCR (354.0 G FLOPs) and other common context modeling blocks. Despite this lightweight design, our method achieves competitive performance, demonstrating its practicality and efficiency for remote sensing tasks, especially in scenarios with constrained hardware resources such as Unmanned Aerial Vehicles (UAVs) or edge devices. These advantages make our approach particularly suitable for real-time applications, where both speed and accuracy are critical. By optimizing resource utilization, we enable enhanced processing capabilities in challenging environments while maintaining high-quality results.

The two sets of figures comprehensively evaluate the training efficiency and inference effectiveness of five models—TransUnet, DocNet, DAB, Sacanet, and Ours—across three different datasets.

In Figure 12, it is evident that our proposed method consistently outperforms the compared baselines throughout the training process. It not only achieves a significantly higher Overall Accuracy (OA%) but also converges faster, especially during the early training stages. This trend is visible across all three datasets, demonstrating the superior optimization ability and representation capacity of our approach. While other models (e.g., DAB or Sacanet) improve steadily with training, their final OA values still fall behind ours by a noticeable margin.

In Figure 13, our method again dominates across all inference durations. Even at lower inference times (e.g., 0–400 s), our model already surpasses other methods in accuracy, suggesting better computational efficiency and faster decision making. As inference time increases, all models show a rise in OA, but our method maintains a clear performance lead, particularly on the more challenging datasets (b) and (c), further validating its robustness and generalizability.

6.2. Model Hyperparameters

6.2.1. Scanning Direction

We developed four distinct scanning direction combinations (h-hflip, v-vflip, h-v, and hflip-vflip) on the Potsdam dataset and carried out comparison experiments using the same architecture to confirm the effect of these scanning direction combinations on model performance.

The results in Table 6 indicate that single-directional scanning (like h-hflip or v-vflip) can provide basic spatial modeling capabilities, but its context-aware range is limited, with mIoU mostly ranging from 84.36% to 84.83%. When using a bidirectional cross-scan strategy (like h-vflip or hflip_v), the model achieved higher performance, improving mIoU by nearly 1.2% and remaining stable on metrics such as AF and OA.

In addition, as demonstrated in Figure 14, we found that the horizontal direction is better at modeling linear buildings, while the vertical direction performs better for long roads and tree areas. The cross-scanning mechanism can effectively capture spatially heterogeneous structures and is suitable for modeling complex land features in remote sensing images.

6.2.2. Number of Clusters

We further explored the impact of the number of cluster centers (t) on the semantic aggregation effect. We set the number of cluster centers to 4 (C₄), 25 (C₂₅), and their combinations (C₄ + C₂₅), respectively, to study their contributions to local and global modeling capabilities. The choice of 4 and 25 as the number of cluster centers was based on balancing the trade-off between capturing fine-grained local features (C₄) and enhancing semantic abstraction (C₂₅). C₄ was empirically found suitable for small object focus, while C₂₅ favored broader context representation, enhancing overall representation capability.

The results in comparing the feature output to the last layer of the dock after adding different numbers of clusters. The target activation classes are cars (first row) and buildings (second row). Table 7 shows that when only four clustering centers are used, the model focuses on the main feature points, which improves the performance of small object segmentation. As shown in Figure 15, the use of 25 clustering centers enhances the precision of semantic expression and is suitable for areas with fuzzy boundaries. When using C₄ and C₂₅ in combination, the model achieved better semantic aggregation ability in various scan combinations.

Notably, the SF mechanism we introduced in each cluster to avoid information duplication and resource waste caused by excessive clustering effectively solving the problem of large intra-class differences in the field of remote sensing.

6.2.3. Loss Function Weight Analysis

To evaluate the impact of loss function weights on model performance, we conducted an ablation study by varying the values of

λ_{edg}

and

λ_{clu}

on the LoveDA dataset. These weights control the contribution of edge-aware loss and cluster loss in the total objective, respectively. The experiment explores how emphasizing edge delineation and semantic compactness affects the segmentation accuracy, particularly for complex remote sensing scenes with fuzzy boundaries and small objects.

According to Figure 16 and Table 8, the two hyperparameters,

λ_{edg}

and

λ_{clu}

, are systematically varied within the range of 0.4 to 1.5 to investigate their influence on model performance. As shown in Figure 16, performance improves as the values of

λ_{edg}

and

λ_{clu}

increase up to a certain point, with the highest scores observed when

λ_{edg}

is set to 1.0 and

λ_{clu}

is set to 0.5. This optimal setting yields the best overall segmentation performance, suggesting that this balance between edge-aware and cluster-aware loss components is critical for enhancing boundary precision. Beyond this region, further increases in either parameter lead to a gradual decline in performance, indicating the presence of an optimal trade-off rather than a monotonic relationship.

6.3. Limitations and Future Works

The proposed CSCN model has demonstrated promising performance across diverse remote sensing datasets, but it still has several limitations that require further investigation.

Although CSCN achieves competitive accuracy on major land cover categories, its segmentation performance on small or rare object classes—such as cars, clutter, or barren land—is relatively lower. This is partly due to the inherent class imbalance and low representation of these categories in public datasets like LoveDA and Vaihingen, which can lead to insufficient feature learning. Future work could explore class-balanced loss functions or targeted data augmentation strategies to improve recognition in under-represented categories.

The current study primarily adopts a fixed combination of scan directions (e.g., horizontal–vertical flips) and cluster counts (e.g., 4 and 25). Although ablation studies demonstrate that this configuration is effective, the model’s sensitivity to different scan–cluster configurations is not exhaustively explored. Future research could systematically analyze the effects of varied scanning direction pairs (e.g., diagonal or adaptive scan paths) and dynamic cluster sizing strategies to better align with the structural complexity of different landforms.

Furthermore, while the training process of CSCN shows steady improvements in performance, the convergence characteristics of the model were not deeply elaborated in the main body. For the sake of writing flow and conciseness, the convergence curve was provided in Appendix A. It reflects that CSCN not only reaches higher mIoU than the baseline but also converges faster, indicating that the proposed modules such as SFCC and CSCA contribute to both accuracy and training efficiency. Further exploration of early-stopping strategies or convergence-aware training regimes could help reduce computational costs in large-scale applications.

In addition, the backbone network used in our study is HRNet-W48, chosen for its ability to maintain high-resolution representations. However, it remains unclear how the CSCN architecture would perform when paired with alternative backbones—such as lightweight CNNs (e.g., MobileNetV3), hierarchical Transformers (e.g., Swin), or the latest sequence models like Mamba or RWKV. Evaluating backbone diversity could help extend CSCN to resource-constrained or real-time remote sensing applications. Lastly, although the CSCN model demonstrates strong generalizations across datasets with varying resolutions, its robustness to domain shifts—such as varying geographic regions, sensor modalities, or seasonal conditions—has not yet been thoroughly studied. A promising direction would be integrating domain adaptation techniques or self-supervised pretraining methods to enhance transferability.

Lastly, although the CSCN model demonstrates strong generalization across datasets with varying resolutions, its robustness to domain shifts—such as varying geographic regions, sensor modalities, or seasonal conditions—has not yet been thoroughly studied. A promising direction would be integrating domain adaptation techniques or self-supervised pretraining methods to enhance transferability.

7. Conclusions

In summary, this paper addressed key challenges in remote sensing image segmentation, including discrete intra-class features, complex backgrounds, and variable spatial structures. To overcome the limitations of vanilla attention on fixed scanning methods and the lack of perception of geographic spatial objects, we introduced a new framework for understanding remote sensing images, called Cross-scan Semantic Cluster Network (CSCN). CSCN combined the Semantic Filtering Contextual Cluster (SFCC) and the Cross-scan Scene Coupling Attention Module (CSCA). By grouping feature labels into meaningful regions, SFCC reduced the diffusion of similar features within the same class, improved the ability to distinguish between different classes, and reduced the required computational load. Meanwhile, CSCA utilized attention from different directions to understand long-distance relationships and combined scene layout with its overall meaning.

The results of experiments conducted on three benchmark remote sensing datasets indicate that CSCN enhances segmentation accuracy and model generalization while preserving relatively low model complexity and inference time. Our model outperforms current state-of-the-art approaches in both visual quality and quantitative parameters.

This study presents a novel perspective on attention mechanism design for remote sensing semantic segmentation, particularly in terms of reducing semantic redundancy, enhancing structural perception, and explicitly modeling spatial correlation. By combining an overall scene with grouping similar meanings, it creates a powerful way to understand complicated land cover patterns in high-resolution images.

However, the proposed method also has certain limitations. First, the current model assumes that the spatial layout patterns within scenes are uniform, which may not generalize well to extremely heterogeneous landscapes or irregularly distributed small targets. Second, even though the cross-scan strategy makes the meaning of the data clearer, it also adds extra calculations during the process of combining attention from different directions. Furthermore, different backbone networks may yield varied results under distinct remote sensing conditions. Further exploration on task-specific backbones can enhance generalization and efficiency.

For future research, we plan to explore task-specific backbone networks to adapt CSCN more effectively to different types of remote sensing scenes (e.g., rural, coastal, or disaster zones). Additionally, more systematic investigations into the interaction between scan direction combinations and cluster granularity could help establish adaptive configurations based on scene complexity. These improvements could further reduce redundancy and enhance real-time capabilities.

Overall, this work provides a flexible, understandable, and effective method for interpreting scenes in remote sensing and creates exciting opportunities for future enhancements in attention-based geospatial analysis models. Finally, the proposed CSCN framework has strong potential for deployment in real-world applications, including land use classification, urban planning, environmental monitoring, and infrastructure mapping, offering interpretable and efficient solutions for high-resolution geospatial analysis.

Author Contributions

Conceptualization, L.Z., X.X., C.J. and M.K.; methodology, X.X.; software, X.X.; validation, L.Z., X.X. and G.X.; formal analysis, L.Z. and X.X.; investigation, L.Z., X.X., C.J. and M.K.; resources, L.Z. and X.X.; data curation, L.Z., X.X. and G.X.; writing—original draft preparation, X.X.; writing—review and editing, L.Z. and X.X.; visualization, L.Z. and X.X.; supervision, G.X.; project administration, L.Z., X.X., C.J. and M.K.; funding acquisition, C.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Xiong’an New Area Science and Technology Innovation Program, grant number 2023XAGG0068.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors upon request.

Acknowledgments

The author would like to thank the anonymous reviewers for their comments and constructive suggestions for improving the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1 illustrates the convergence curves of the baseline model and the proposed model equipped with SFCC and CSCA modules on the Potsdam dataset. As the number of training epochs increases, the model with SFCC and CSCA consistently achieves higher OA (%) compared to the baseline. Notably, the enhanced model exhibits a faster convergence rate in the early stages and reaches a performance plateau around 100 epochs, indicating the effectiveness of the proposed modules in accelerating training and improving final accuracy.

Figure A1. Convergence comparison between the baseline model and the model with SFCC and CSCA modules on the Potsdam dataset.

References

Rau, J.Y.; Jhan, J.P.; Hsu, Y.C. Analysis of Oblique Aerial Images for Land Cover and Point Cloud Classification in an Urban Environment. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1304–1319. [Google Scholar] [CrossRef]
Liu, Y.; Yang, J.; He, J.; Chen, X.; Yuan, H.; Peng, X. Roof Segmentation of Remote Sensing Images Based on Improved UNET. In Proceedings of the 2023 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR), Adelaide, Australia, 9–11 July 2023; pp. 68–73. [Google Scholar] [CrossRef]
Sun, L.; Zou, H.; Wei, J.; Cao, X.; He, S.; Li, M.; Liu, S. Semantic Segmentation of High-Resolution Remote Sensing Images Based on Sparse Self-Attention and Feature Alignment. Remote Sens. 2023, 15, 1598. [Google Scholar] [CrossRef]
Arnaudo, E.; Vaschetti, J.L.; Innocenti, L.; Barco, L.; Lisi, D.; Fissore, V.; Rossi, C. FMARS: Annotating Remote Sensing Images for Disaster Management Using Foundation Models. In Proceedings of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Brisbane, Australia, 7–12 July 2024; pp. 3920–3924. [Google Scholar] [CrossRef]
Meng, W.; Shan, L.; Ma, S.; Liu, D.; Hu, B. DLNet: A Dual-Level Network with Self- and Cross-Attention for High-Resolution Remote Sensing Segmentation. Remote Sens. 2025, 17, 1119. [Google Scholar] [CrossRef]
Wang, X.; Jin, X.; Dai, Z.; Wu, Y.; Chehri, A. Deep Learning-Based Methods for Road Extraction from Remote Sensing Images: A vision, survey, and future directions. IEEE Geosci. Remote Mogazine 2025, 13, 55–78. [Google Scholar] [CrossRef]
Kosarevych, R.; Lutsyk, O.; Rusyn, B.; Alokhina, O.; Maksymyuk, T.; Gazda, J. Spatial point patterns generation on remote sensing data using convolutional neural networks with further statistical analysis. Sci. Rep. 2022, 12, 14341. [Google Scholar] [CrossRef]
Hu, Y.; Tian, C.; Zhang, J.; Zhang, S. Efficient Image Denoising with Heterogeneous Kernel-based CNN. Neurocomputing 2024, 592, 127799. [Google Scholar] [CrossRef]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Chen, Z.; Zhao, J.; Deng, H. Global Multi-Attention UResNeXt for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 1836. [Google Scholar] [CrossRef]
Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Tomizuka, M.; Keutzer, K.; Vajda, P. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv 2020, arXiv:2006.03677. [Google Scholar] [CrossRef]
Liu, Y.; Gao, K.; Wang, H.; Yang, Z.; Wang, P.; Ji, S.; Huang, Y.; Zhu, Z.; Zhao, X. A Transformer-based Multi-modal Fusion Network for Semantic Segmentation of High-resolution Remote Sensing Imagery. Int. J. Appl. Earth Obs. Geoinf. 2024, 133, 104083. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking Transformers for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5617515. [Google Scholar] [CrossRef]
Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking Scanning Strategies With Vision Mamba in Semantic Segmentation of Remote Sensing Imagery: An Experimental Study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
Liu, W.; Wang, L.; Wang, X.; Ding, H.; Xia, B.; Zhang, Z. ULKNet:Rethinking Large Kernel CNN with UNet-Attention for Remote Sensing Images Semantic Segmentation. In Proceedings of the 49th Annual Conference of the IEEE Industrial Electronics Society (IECON 2023), Singapore, 16–19 October 2023; pp. 1–10. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11927. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Gu, A.; Dao, T.J.A. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Han, S.; Guo, W.; Wang, C. Ionograms Trace Extraction Method Based on Multiscale Transformer Network. Remote Sens. 2024, 16, 2697. [Google Scholar] [CrossRef]
Shi, M.; Lin, S.; Yi, Q.; Weng, J.; Luo, A.; Zhou, Y. Lightweight Context-Aware Network Using Partial-Channel Transformation for Real-Time Semantic Segmentation. IEEE Trans. Intell. Transp. Syst. 2024, 25, 7401–7416. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. PSPNet: Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef]
Lin, G.; Liu, F.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-Path Refinement Networks for Dense Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 1228–1242. [Google Scholar] [CrossRef] [PubMed]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Yang, M.; Yu, K.; Zhang, C.; Li, Z.; Yang, K. DenseASPP for Semantic Segmentation in Street Scenes. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3684–3692. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual Attention Network for Scene Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3141–3149. [Google Scholar] [CrossRef]
Huang, Z.; Wang, X.; Wei, Y.; Huang, L.; Shi, H.; Liu, W.; Huang, T.S. CCNet: Criss-Cross Attention for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6896–6908. [Google Scholar] [CrossRef]
Yuan, Y.; Huang, L.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. OCNet: Object Context for Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 2375–2398. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Álvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Neural Information Processing Systems, Online, 6–12 September 2021. [Google Scholar] [CrossRef]
Liu, L.; Li, G.; Du, Y.; Li, X.; Wu, X.; Qiao, Z.; Wang, T. CS-Net: Conv-Simpleformer Network for Agricultural Image Segmentation. Pattern Recognit. 2024, 147, 110140. [Google Scholar] [CrossRef]
Zhu, E.; Chen, Z.; Wang, D.; Shi, H.; Liu, X.; Wang, L. UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6001205. [Google Scholar] [CrossRef]
Mu, J.; Zhou, S.; Sun, X. PPMamba: Enhancing Semantic Segmentation in Remote Sensing Imagery by SS2D. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Tsai, F.-J.; Peng, Y.-T.; Lin, Y.-Y.; Tsai, C.-C.; Lin, C.-W. Stripformer: Strip Transformer for Fast Image Deblurring. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–24 October 2022. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the ECCV Workshops, Montreal, QC, Canada, 11 October 2021. [Google Scholar] [CrossRef]
Chen, C.-F.; Panda, R.; Fan, Q. RegionViT: Regional-to-Local Attention for Vision Transformers. arXiv 2021, arXiv:2106.02689. [Google Scholar] [CrossRef]
Van Engelenhoven, A.; Strisciuglio, N.; Talavera, E. CAST: Clustering Self-Attention Using Surrogate Tokens for Efficient Transformers. Pattern Recognit. Lett. 2024, 186, 30–36. [Google Scholar] [CrossRef]
Liu, G.; Shan, Z.; Meng, Y.; Akbar, T.A.; Ye, S. RDPGNet: A road extraction network with dual-view information perception based on GCN. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102009. [Google Scholar] [CrossRef]
Zhou, X.; Sun, K.; Wang, J.; Zhao, J.; Feng, C.; Yang, Y.; Zhou, W. Computer Vision Enabled Building Digital Twin Using Building Information Model. IEEE Trans. Ind. Inform. 2023, 19, 2684–2692. [Google Scholar] [CrossRef]
Cheng, X.; He, X.; Qiao, M.; Li, P.; Hu, S.; Chang, P.; Tian, Z. Enhanced Contextual Representation with Deep Neural Networks for Land Cover Classification Based on Remote Sensing Images. Int. J. Appl. Earth Obs. Geoinf. 2022, 107, 102706. [Google Scholar] [CrossRef]
Dai, L.; Zhang, G.; Zhang, R. RADANet: Road Augmented Deformable Attention Network for Road Extraction from Complex High-Resolution Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5602213. [Google Scholar] [CrossRef]
Bai, B.; Fu, W.; Lu, T.; Li, S. Edge-Guided Recurrent Convolutional Neural Network for Multitemporal Remote Sensing Image Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610613. [Google Scholar] [CrossRef]
Li, R.; Su, J.; Duan, C.; Zheng, S.J.A. Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation. arXiv 2020, arXiv:2007.14902. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Liu, B.; Zhao, J.; Yao, R. Remote Sensing Image Semantic Segmentation via Class-Guided Structural Interaction and Boundary Perception. Expert Syst. Appl. 2024, 252, 124019. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A. Foreground-Aware Relation Network for Geospatial Object Segmentation in High Spatial Resolution Remote Sensing Imagery. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13715–13729. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Li, Q.; Jing, W.; He, G.; Zhu, L.; Gao, S. Multimodal Contrastive Learning for Remote Sensing Image Feature Extraction Based on Relaxed Positive Samples. Sensors 2024, 24, 7719. [Google Scholar] [CrossRef] [PubMed]
Kang, B.; Wu, J.; Xu, J.; Wu, C. DAENet: Deformable Attention Edge Network for Automatic Coastline Extraction from Satellite Imagery. Remote Sens. 2024, 16, 2076. [Google Scholar] [CrossRef]
Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep Clustering for Unsupervised Learning of Visual Features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar] [CrossRef]
Cai, J.; Wang, S.; Xu, C.; Guo, W. Unsupervised Deep Clustering via Contractive Feature Representation and Focal Loss. Pattern Recognit. 2022, 123, 108386. [Google Scholar] [CrossRef]
Jégou, H.; Douze, M.; Schmid, C. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 117–128. [Google Scholar] [CrossRef]
Hu, R.; Dollár, P.; He, K.; Darrell, T.; Girshick, R.B. Learning to Segment Every Thing. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4233–4241. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Sun, M.; He, X.; Xiong, S.; Ren, C.; Li, X. Reduction of JPEG Compression Artifacts Based on DCT Coefficients Prediction. Neurocomputing 2020, 384, 335–345. [Google Scholar] [CrossRef]
Gu, A.; Goel, K.; R’e, C. Efficiently Modeling Long Sequences with Structured State Spaces. arXiv 2021, arXiv:2111.00396. [Google Scholar] [CrossRef]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Sun, H.; He, X.; Li, H.; Kong, J.; Qiao, M.; Cheng, X.; Li, P.; Zhang, J.; Liu, R.; Shang, J. Adaptive Sparse Lightweight Multi-scale Hybrid Network for Remote Sensing Image Semantic Segmentation. Expert Syst. Appl. 2025, 280, 127347. [Google Scholar] [CrossRef]
Li, B.; Wang, L.; Wang, B.; Shao, Z.; Huang, J.; Gao, G.; Song, M.; Li, W.; Niu, D. A Symmetrical Parallel Two-stream Adaptive Segmentation Network for Remote Sensing Images. Digit. Signal Process. 2025, 165, 105319. [Google Scholar] [CrossRef]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
ISPRS Test Project on Urban Classification and 3D Building Reconstruction. Available online: https://www2.isprs.org/media/komfssn5/complexscenes_revision_v4.pdf (accessed on 10 August 2025).
Zhu, Q.; Cai, Y.; Fang, Y.; Yang, Y.; Chen, C.; Fan, L.; Nguyen, A. Samba: Semantic Segmentation of Remotely Sensed Images with State Space Model. Heliyon 2024, 10, e38495. [Google Scholar] [CrossRef] [PubMed]
Aach, M.; Sedona, R.; Lintermann, A.; Cavallaro, G.; Neukirchen, H.; Riedel, M. Accelerating Hyperparameter Tuning of a Deep Learning Model for Remote Sensing Image Classification. In Proceedings of the 2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 263–266. [Google Scholar] [CrossRef]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar] [CrossRef]
Yuan, Y.; Chen, X.; Wang, J. Object-Contextual Representations for Semantic Segmentation. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 173–190. [Google Scholar] [CrossRef]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention. In Proceedings of the IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14420–14430. [Google Scholar] [CrossRef]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Ma, X.; Che, R.; Wang, X.; Ma, M.; Wu, S.; Feng, T.; Zhang, W. DOCNet: Dual-Domain Optimized Class-Aware Network for Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2500905. [Google Scholar] [CrossRef]
Ma, X.; Lian, R.; Wu, Z.; Guo, H.; Yang, F.; Ma, M.-Y.; Wu, S.; Du, Z.; Zhang, W.; Song, S. LOGCAN++: Adaptive Local-Global Class-Aware Network for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Res. Sens. 2024, 63, 4404216. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, R.; Yang, Y.; Bai, H.; Zhang, J.; Guo, J. ISNet: Shape Matters for Infrared Small Target Detection. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 867–876. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. MetaFormer is Actually What You Need for Vision. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10809–10819. [Google Scholar] [CrossRef]
Xu, Q.; Ma, Z.; He, N.; Duan, W. DCSAU-Net: A Deeper and More Compact Split-Attention U-Net for Medical Image Segmentation. Comput. Biol. Med. 2023, 154, 106626. [Google Scholar] [CrossRef]
Ma, X.; Ma, M.; Hu, C.; Song, Z.; Zhao, Z.-S.; Feng, T.; Zhang, W. Log-Can: Local-Global Class-Aware Network for Semantic Segmentation of Remote Sensing Images. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece; 2023; pp. 1–5. [Google Scholar] [CrossRef]
Ma, X.; Che, R.; Hong, T.; Ma, M.; Zhao, Z.-S.; Feng, T.; Zhang, W. SACANet: Scene-aware Class Attention Network for Semantic Segmentation of Remote Sensing Images. In Proceedings of the 2023 IEEE International Conference on Multimedia & Expo, Brisbane, Australia, 10–14 July 2023; pp. 828–833. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]

Figure 1. (a) A rural area and (c) an urban environment. The rural images (b) exhibit significant geographical correlation, while the urban images (d) show a complex background and considerable intra-class variance. (f) The semantic cluster with (e) vanilla attention. (g) The cross-scanning strategy.

Figure 2. Comparison of different scanning methods: (a) CNN focuses on local information; (b) Vision Transformer uses global self-attention; (c) Mamba captures long-range patterns via recurrence; (d) our approach introduces cross-scan directional coupling to balance local and global context. The arrows denote the scanning directions.

Figure 3. The framework of CSCN. Backbone gains initial features; SFCC extracts local semantic features; CSCA computes global scene features; and the final feature is generated by combining three parts to output the segmentation results.

Figure 4. The filtering of both perceptual and semantic redundancy. (a) Perceptual redundancy; (b) semantic redundancy.

Figure 5. Illustration of the position encoding process in

{ROPE}^{rs}

along both horizontal and vertical directions. The lower part depicts a sequence of image patches undergoing encoding, where each patch is encoded based on its spatial position. The upper part details the two-step rotation process: first along the width (x-axis), then along the height (y-axis).

Figure 5. Illustration of the position encoding process in

{ROPE}^{rs}

along both horizontal and vertical directions. The lower part depicts a sequence of image patches undergoing encoding, where each patch is encoded based on its spatial position. The upper part details the two-step rotation process: first along the width (x-axis), then along the height (y-axis).

Figure 6. Comparison test results with state-of-the-art methods on the LoveDA dataset. The blue box is the area of focus.

Figure 7. (a) Confusion matrix illustrating the classification performance across seven land cover categories, where diagonal elements indicate correct predictions. (b) Per-class performance comparison based on IoU, F1-score, Precision, and Recall.

Figure 8. Comparison test results with state-of-the-art methods on the Potsdam dataset. The red box is the area of focus.

Figure 9. (a) Confusion matrix for the Potsdam dataset, highlighting classification performance across six urban surface categories. The diagonal values indicate correct classification ratios for each class. (b) Class-wise performance analysis based on IoU, F1-score, Precision, and Recall.

Figure 10. Comparison test results with state-of-the-art methods on the Vaihingen dataset. The red box is the area of focus.

Figure 11. (a) Confusion matrix of our method on the Potsdam dataset, showing the classification accuracy across six semantic categories. Most categories achieve high diagonal values, indicating strong intra-class consistency. (b) Category-wise evaluation of segmentation performance, including IoU, F1-score, Precision, and Recall.

Figure 12. Comparison of training efficiency among different methods on three datasets: (a) Potsdam dataset; (b) Vaihingen dataset; (c) LoveDA dataset.

Figure 13. Comparison of inference efficiency among different methods on three datasets: (a) Potsdam dataset; (b) Vaihingen dataset; (c) LoveDA dataset.

Figure 14. Comparison between the feature output and the last layer of the dock after adding different scanning methods. The target activation classes are buildings (first row) and shrubs (second row).

Figure 15. Comparison between the feature output and the last layer of the dock after adding different numbers of clusters. The target activation classes are cars (first row) and buildings (second row).

Figure 16. The 3D surface plot illustrates the effect of the loss weights

λ_{edg}

and

λ_{clu}

on the Boundary F1 score (%) of the proposed model.

Figure 16. The 3D surface plot illustrates the effect of the loss weights

λ_{edg}

and

λ_{clu}

on the Boundary F1 score (%) of the proposed model.

Table 1. Comparison with the most advanced techniques on the LoveDA dataset test set. The LoveDA dataset requires an online assessment to evaluate the model. Consequently, the findings for the F1 and mAcc measures are not presented here. The best score for each class is indicated in bold, while the second highest value is underlined.

Method	IoU							mIoU
Method	Background	Buildings	Roads	Water	Barren	Forest	Farmland	mIoU
PSPNet [21]	44.4	52.1	53.5	76.5	9.7	44.1	57.9	48.3
DeepLabv3+ [62]	43.0	50.9	52.0	74.4	10.4	44.2	58.5	47.6
DANet [25]	44.8	55.5	53.0	75.5	17.6	45.1	60.1	50.2
OCRNet [63]	44.2	55.1	53.5	74.3	18.5	43.0	60.5	49.9
ISNet [67]	44.4	57.4	58.0	77.5	21.8	43.9	60.6	51.9
LANet [68]	40.0	50.6	51.1	78.0	13.0	43.2	56.9	47.5
PoolFormer [69]	45.8	57.1	53.3	80.7	19.8	45.6	64.5	52.4
EfficientViT [64]	42.9	51.0	52.8	75.7	4.3	42.0	61.2	47.1
TransUnet [65]	44.6	52.7	53.1	75.1	5.3	44.5	61.9	48.7
VisonMamba [64]	43.1	51.4	54.2	74.3	11.2	45.8	59.8	48.2
Dcsanet [70]	45.6	50.4	56.3	73.6	14.4	44.0	60.0	49.2
DOCNet [67]	46.2	57.2	58.2	80.3	14.9	46.5	64.3	52.5
LOGCAN [71]	45.7	51.2	53.2	75.7	8.1	45.0	58.4	48.2
LOGCAN++ [68]	47.4	58.4	56.5	80.1	18.4	47.9	64.8	53.4
CSCN(Ours)	55.5	65.6	58.3	81.2	27.8	47.2	68.1	57.7

Table 2. Comparative results on the Vaihingen and Potsdam datasets. The best score for each class is indicated in bold, while the second highest value is underlined.

Method	Potsdam			Vaihingen
Method	AF	mIoU	OA	AF	mIoU	OA
PSPNet [21]	86.47	76.78	89.36	86.47	76.78	89.36
DeepLabv3+ [62]	86.77	77.13	89.12	86.77	77.13	89.12
DANet [25]	86.88	77.32	89.47	86.88	77.32	89.47
OCRNet [63]	89.22	81.71	90.47	89.22	81.71	90.47
ISNet [69]	90.19	82.71	90.52	90.19	82.71	90.52
LANet [70]	87.36	79.33	90.18	87.36	79.33	90.18
PoolFormer [71]	89.59	81.35	90.30	89.59	81.35	90.3
EfficientViT [64]	87.98	80.23	91.63	87.98	80.23	91.63
TransUnet [65]	89.65	84.80	90.05	89.88	81.88	91.16
VisonMamba [72]	86.83	78.89	86.64	89.12	82.13	90.12
Dcsanet [73]	91.64	86.06	90.64	91.78	86.97	91.78
DOCNet [67]	90.23	82.56	91.13	90.23	82.56	91.13
LOGCAN [71]	86.36	79.92	89.87	86.36	79.92	89.87
PSPNet [21]	91.86	83.67	90.86	91.86	83.67	90.86
CSCN(Ours)	92.69	86.26	91.43	93.02	87.59	92.68

Table 3. Ablation results of the LoveDA dataset. The best score for each class is indicated in bold, while the second highest value is underlined.

Methods	IoU							mIoU
Methods	Background	Buildings	Roads	Water	Barren	Forest	Farmland	mIoU
Baseline	52.30	60.30	53.40	73.96	23.79	40.35	57.02	51.87
Baseline + SF	53.76	61.65	54.41	75.74	23.99	44.45	56.35	52.25
Baseline + SFCC	54.77	62.34	55.45	77.26	24.57	45.87	58.67	53.46
Baseline + SFCC + SC	55.2	61.78	56.10	78.92	25.34	45.99	61.27	55.45
Baseline + SFCC + CSCA	57.11	61.46	55.69	80.0	25.51	46.89	65.74	56.35
CSCN(Ours)	55.5	65.6	58.3	81.2	27.8	47.2	68.1	57.7

Table 4. Ablation results of Potsdam dataset. The best score for each class is indicated in bold, while the second highest value is underlined.

Methods	AF	mIoU	OA
Baseline	89.34	84.35	90.14
Baseline + SF	89.67	85.82	91.21
Baseline + SFCC	90.12	85.21	90.25
Baseline + SFCC + SC	90.98	87.14	90.45
Baseline + SFCC + CSCA	91.24	86.11	91.01
CSCN(Ours)	92.69	86.26	91.43

Table 5. Evaluation of efficiency metrics for the CSSN model.

Module	Params (M)	FLOPs (G)	Memory (MB)
TransUnet [65]	192	470.62	1898
Sacanet [72]	30.3	120.98	1132
PPM [21]	23.1	309.5	257
ASPP [62]	15.1	503.0	284
DAB [25]	23.9	392.2	1546
DOCNet [67]	39.1	156.47	1152
OCRNet [63]	10.5	354	202
LOGCAN [71]	12.9	51.74	791
${RS}^{3} Mamba$ [73]	49.66	15.82	442
Samba [60]	51.9	232	576
LOGCAN++ [68]	25.2	100.75	1055
CSCN(Ours)	10.4	157.6	489

Table 6. The results corresponding to different scanning methods.

Setting	AF	mIoU	OA
$h_hflip_C_{4}_C_{25}$	91.14	84.36	90.41
$v_vflip_C_{4}$ $_C_{25}$	91.39	84.83	90.47
$hflip_v_C_{4}$ $_C_{25}$	92.23	85.29	90.62
$h_vflip_C_{4}$ $_C_{25}$	92.14	85.26	90.65
$hflip_vflip_C_{4}$ $_C_{25}$	92.41	85.48	90.66
CSCN(Ours)	92.69	86.26	91.43

Table 7. The results corresponding to different number of clusters.

Setting	AF	mIoU	OA
h_hflip	90.32	84.21	90.27
$h_hflip_C_{4}$	91.84	84.91	89.49
$h_hflip_C_{25}$	91.91	85.86	91.46
$h_hflip_C_{4}_C_{25}$	92.14	85.36	91.41
v-vflip	90.37	84.38	90.42
$v_vflip_C_{4}$	90.06	84.90	90.44
$v_vflip_C_{25}$	90.84	85.14	90.53
$v_vflip_C_{4}$ $_C_{25}$	92.39	85.83	90.47
h_hflip_v_vflip	89.59	85.23	91.29
CSCN(Ours)	92.69	86.26	91.43

Table 8. The influence of loss function weights of

λ_{edg}

and

λ_{clu}

.

Table 8. The influence of loss function weights of

λ_{edg}

and

λ_{clu}

.

$λ_{edg}$	$λ_{clu}$	mIoU	OA	Boundary F1
0	0	54.5	76.1	70.2
0.5	0.5	55.6	77.4	71.4
1.0	0.5	57.7	78.2	72.8
1.0	1.0	55.2	78.0	70.4
1.5	0.5	54.3	77.8	70.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Xing, X.; Jing, C.; Kong, M.; Xu, G. CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation. Remote Sens. 2025, 17, 2803. https://doi.org/10.3390/rs17162803

AMA Style

Zhang L, Xing X, Jing C, Kong M, Xu G. CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation. Remote Sensing. 2025; 17(16):2803. https://doi.org/10.3390/rs17162803

Chicago/Turabian Style

Zhang, Lei, Xing Xing, Changfeng Jing, Min Kong, and Gaoran Xu. 2025. "CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation" Remote Sensing 17, no. 16: 2803. https://doi.org/10.3390/rs17162803

APA Style

Zhang, L., Xing, X., Jing, C., Kong, M., & Xu, G. (2025). CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation. Remote Sensing, 17(16), 2803. https://doi.org/10.3390/rs17162803

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CSCN: A Cross-Scan Semantic Cluster Network with Scene Coupling Attention for Remote Sensing Segmentation

Abstract

1. Introduction

2. Related Work

2.1. General Semantic Segmentation

2.2. Semantic Segmentation in Remote Sensing

3. Methodology

3.1. Architecture and Workflow

3.2. Semantic Filtering Contextual Cluster

3.3. Cross-Scan Scene Coupling Attention

3.3.1. Scene Coupling

3.3.2. Cross-Scan Strategy

3.4. Loss Function

4. Experimental Setting

4.1. Dataset and Implementation Details

4.2. Evaluation Metrics

5. Experimental Results and Analysis

5.1. Comparison with State-of-the-Art Methods

5.1.1. Comparison on the LoveDA Dataset

5.1.2. Comparison on the Potsdam Dataset

5.1.3. Comparison on the Vaihingen Dataset

5.2. Ablation Experiments

6. Discussion

6.1. Model Complexities

6.2. Model Hyperparameters

6.2.1. Scanning Direction

6.2.2. Number of Clusters

6.2.3. Loss Function Weight Analysis

6.3. Limitations and Future Works

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI