In this section, we provide a detailed description of the Cross-scan Semantic Cluster Network (CSCN). This network utilizes the Semantic Filtering Contextual Cluster (SFCC) based on class-centered context modeling and couples it with the Cross-scan Scene Coupling Attention (CSCA) reconstructed using a multi-directional scanning strategy. This structure is designed to effectively address prevalent challenges in remote sensing semantic segmentation, such as substantial background interference, significant intra-class variability, and blurred object boundaries in remote sensing images. Thus, it can enhance the performance of the remote sensing semantic segmentation model.
3.1. Architecture and Workflow
As is shown in
Figure 3, CSCN is an encoder–decoder semantic segmentation framework, which includes three key components: a high-resolution backbone network, SFCC, and CSCA.
The input image is initially processed by the backbone network to extract multi-scale features. These features are subsequently downsampled using a bottleneck convolutional layer (Conv 3 × 3), producing a compact representation that facilitates efficient context extraction and attention modeling.
The SFCC module is designed to model local semantic context by viewing the feature map as a set of unordered points rather than a structured grid. Initially, it divides the feature map into local windows and flattens each into point sets. These points are clustered into semantic groups via top-k nearest neighbor selection based on cosine similarity. To eliminate perceptual and semantic redundancy, a Semantic Filtering (SF) mechanism is applied. Each cluster is assigned a centroid, and similarity-weighted aggregation is performed to produce refined local semantic features .
Meanwhile, the proposed CSCA constructs global and local class centers from pre-classification masks and uses rotation positional encoding () and DCT-based global scene representation. To capture the long-range dependencies, the model scans input features in four directions: horizontal, vertical, horizontal-flip, and vertical-flip. This multi-directional strategy allows the model to perceive spatial continuity and semantic relationships from multiple perspectives. The final attention output is generated by averaging the responses in each direction and projecting them back to the original spatial resolution.
Finally, the outputs from the clustering-enhanced context and attention module are concatenated with the original features and passed through a fusion block to generate the refined feature representation . This results in two-stage predictions: a coarse mask from the decoder and a fine-grained segmentation output from the full attention path. This hierarchical refinement design strengthens the framework’s capacity to better preserve edge structures and class boundaries.
3.2. Semantic Filtering Contextual Cluster
We created a Semantic Filtering Contextual Clustering (SFCC) module to clearly identify key local meanings and spatial relationships in remote sensing images while minimizing the confusion between different meanings. Unlike traditional methods that treat feature maps as structured grids, this module interprets them as unordered sets of points, enabling feature sharing and aggregation through point-based clustering within localized windows. The implementation of the semantic filtering method contributed to the model’s generalization capacity and semantic recognition in high-complexity situations.
To begin, divide the input feature map
into a number of non-overlapping local windows
. Flatten each window into a point set
. In this setting,
represents the total number of data points, and
is the feature dimension of the points. Use a learnable linear mapping [
16]
to project the point features into similarity space.
Within each window, an average of
cluster centers are initialized, and the initial features are obtained by selecting the
nearest neighbors of each center and taking the average. Due to the uneven semantic distribution in remote sensing images—largely attributed to perceptual and semantic redundancy—a Semantic Filtering (SF) mechanism is incorporated, inspired by recent image de-redundancy clustering techniques [
46,
47].
Figure 4a shows that perceptual redundancy exhibits the smallest difference at the pixel level but is repeatedly learned by the model. On the contrary, as shown in
Figure 4b, semantic redundancy displays images with significant differences in pixel styles but expresses the same semantic information. The presence of perceptual redundancy and semantic redundancy in remote sensing images leads to an imbalanced distribution of semantic concepts in the dataset. Such an imbalance may result in bias and resource waste during training.
For the center points within a single cluster, we set the global image embedding matrix as
, where
is the number of images, and
is the embedding dimension. We partition
into
chunks,
.
denotes the feature embedding of the
i-th image, and
is the
j-th point in the
n-th chunk
. For each image
we compute the Euclidean distance [
48] to all points in chunk
.
represents the total number of points in each chunk.
The top-k nearest neighbors [
46] are retained per chunk
. All the chunks are concatenated to form the full nearest neighbor matrix
. An image pair
is merged into the same cluster using a distance threshold
.
is in the same cluster if their distance is below the threshold
. In our experiments, we empirically set the threshold
to 0.7. This value was selected based on validation performance on the LoveDA dataset, where we observed that
provided the best trade-off between over-segmentation and under-clustering. A grid search over the range
was conducted, and 0.7 achieved the most stable and accurate clustering results across all datasets.
Each cluster
is assigned a centroid
, and only the image closest to
is retained. All others are filtered out to reduce redundancy.
To compute the similarity between cross-cluster centers and all patches, we utilized cosine similarity to create a similarity matrix
. We developed a similarity matrix using cosine similarity to compare all points with all cluster centers and used the Union-Find method to organize semantically similar patches. We combined two patches into a single group if their distance was below the threshold
. To minimize redundancy and improve semantic representation, only the core patch (i.e., the patch nearest to the centroid) remained in each group for subsequent feature aggregation. We assessed the semantic similarity of the patches based on these similarities, and thereafter chose whether to combine them for future feature aggregation.
In this formula, and represent the feature vectors of the cluster center and the patch, respectively.
clusters are created by allocating each point to the nearest center using the similarity matrix . The number of data points for each cluster may vary, and some clusters may not contain data points. This situation represents data redundancy. Data within the same cluster will dynamically cluster to the center point based on their similarities.
If feature points belonging to a cluster are selected from a local window and their feature dimension is d′, a matrix is formed by these points. The dimension of is .
We use the similarity matrix to calculate the similarity between each point and the cluster center. Considering each point is part of a cluster , the similarity between the extracted points from the similarity matrix and the cluster center forms a vector .
By using linear transformation to project into a new space, which becomes the value space, is obtained for subsequent feature aggregation and optimization. In the value space, we fuse all points to obtain a new cluster center .
We calculate the similarity between and the new cluster center , and weight it based on the new similarity. The weighting process is carried out through the sigmoid activation function. Among them, and are learnable parameters, is the Sigmoid activation function, and is the original point feature. This weighting process ensures that points closer to the cluster center have a greater weight in feature aggregation during clustering.
Next, we combine the similarity weighted features of
and member points to construct the aggregated local contextual feature
for this cluster. Here,
denotes the new cluster center,
is the sigmoid activation function,
and
are learnable parameters controlling the weighting of features.
Using aggregated features
to fuse and update with each point
after similarity weighting, improving intra-class consistency. Finally, the clustering results are mapped from the clustering feature space to the original feature space and reshaped back into a 2D feature map.
3.3. Cross-Scan Scene Coupling Attention
The Cross-scan Scene Coupling Attention (CSCA) reconstructs the attention attraction operations and improves the attention modeling process. By enabling dynamic adaptation, CSCA facilitates the integration of global contextual information and promotes the efficient exploitation of spatial relationships among geographic objects. CSCA mainly consists of two parts: scene coupling (SC) and cross-scan strategy.
3.3.1. Scene Coupling
In remote sensing images, the separation of similar pixel types in the feature space can cause vanilla attention models to have poor performance in class center construction. To address this limitation and more effectively capture the complex spatial relationships among ground objects, we introduce the scene coupling (SC) mechanism. The model is better able to comprehend the intricate and dynamic spatial structures by combining global scene semantic information with spatial item distribution. This approach has demonstrated success in accurately modeling geographic objects under complex spatial conditions [
49].
Originally, the initial feature
is extracted from the input image by the backbone, and a
convolution is employed to produce the pre-classification feature
, where
indicates the number of classes. A pre-classification semantic mask
and a local semantic representation
are constructed. Specifically, for each pixel position, the class corresponding to the maximum predicted class probabilities is selected as the pre-classification mask for that pixel. The
function is used to map
class centers back to the original graph space.
is the global class center. Here, the function
is introduced to project the pre-classification semantic features and class center information into the semantic graph space. Specifically,
acts as a mapping and reconstruction operator that reassigns each pixel’s class probability to its corresponding class center and generates a semantic representation.
To consider local consistency, the original image is divided into patches of size
. Among them,
, and there are a total of
local blocks. By further splitting
and
into
and
, semantic masks and a local class center are constructed:
At this point, the vanilla attention can be restructured into the following:
Remote sensing images often show specific spatial distribution patterns, such as vehicles mostly seen on roads and buildings generally located near roads. These patterns emphasize the importance of spatial relationships among pixels. Using the Rotation Position Encoding (ROPE) [
50] idea, we integrate these spatial correlations into the attention mechanism process through the inner product of
. As shown in
Figure 5, this method overcomes the limitations of fixed position encoding and significantly enhances the model’s generalization capabilities.
We define two-dimensional ROPE encoding and construct an attention inner product as follows:
represent the feature vectors at positions and in the image, and are the linear mapping weight matrices of the query and key, is the rotation factor, and is the complex angle.
We expand it to 2D images:
and are the horizontal and vertical indices of the patch, and is the rotation angle in both directions. and introduce spatial location information into queries and keys.
Due to the multi-channel nature of remote sensing image features, in order to adapt to multidimensional features, multi-channel extension is introduced on the basis of two-dimensional.
are the horizontal and vertical rotation matrices.
is the original query and key vector.
denotes taking the real part of the complex product. The affinity matrix between the two is calculated as follows:
To obtain cross-spatial global semantic information, we introduce two-dimensional discrete cosine transform [
51] to model the global representation of the scene. The basic transformation function is as follows:
Here,
and
are the normalization factors of the discrete cosine transform (DCT) along the height and width dimensions, respectively. The spectrum converts to the following:
denotes the pixel (or feature) value at a spatial location . The most distinguishing frequency components for channel concatenation are selected to obtain the global scene representation .
Ultimately, the global scene representation and affinity matrix are reconstructed into the attention mechanism.
is the number of feature channels, which serves as a scaling factor in the attention computation:
3.3.2. Cross-Scan Strategy
Conventional global scanning methods often overlook the importance of local features and contextual information in remote sensing images. Moreover, fixed scanning methods are unable to adaptively capture spatial relationships. To address this issue, we propose a cross-scan approach. Network modeling for remote dependence relationships has been improved by selectively interacting in four critical ways, which is particularly effective in handling remote sensing scenes characterized by irregular and diverse land cover distributions. In contrast to previous methods that only scan in the row or column direction (such as SSM [
52]), we patch and flatten the input image and scan the flattened image horizontally (
H), vertically (
V), horizontally flipped (
H_flip), and vertically flipped (
V_flip) to gain more comprehensive global information.
Specifically, the input feature map
is divided into local blocks and flattened into sequences in four directions,
and then rearranged according to the directions to simulate the scanning process. In each direction, learnable projections are projected onto queries, keywords, and values.
Among them,
and
represent the projection of features in a specific direction.
,
, and
are the learnable projection matrices for queries, keys, and values. An averaging process aggregates the final output. The attention map in each direction is computed as follows:
Our method successfully advances the basic attention operation by integrating the understanding of the scene with the design of local and global class attention. This integration reduces background noise and alleviates the adverse effects of complex and dynamic spatial structures on contextual modeling. It effectively uses the global features of geographic spatial objects from different angles.
After obtaining the attention-enhanced representations in all four directions, we perform an averaging operation to integrate the directional outputs and restore the original spatial structure. This process results in a unified global feature representation , which effectively captures long-range dependencies and scene-level semantics from different orientations. The resulting not only complements the class-aware local context derived from the SFCC module, but also serves as a crucial component in subsequent feature fusion and semantic prediction.
We concatenate the local semantic features
from SFCC, the global scene features
from CSCA, and the initial features
extracted by the backbone to effectively integrate the multi-granularity contextual representations. A learnable projection layer is employed to fuse these features, allowing the network to combine the fine-grained spatial semantics with long-range global dependencies. The fused representation is subsequently input into the final segmentation head to generate precise and structure-conscious predictions.
where
denotes a learnable fusion function implemented as a lightweight
convolution. The resulting fused feature
serves as the input to the final segmentation head.