Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation

Xie, Xinlin; Chang, Chenhao; Yang, Yunyun; Xie, Gang

doi:10.3390/rs18050754

Open AccessArticle

Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation

¹

School of Electronic and Information Engineering, Taiyuan University of Science and Technology, Taiyuan 030024, China

²

Shanxi Key Laboratory of Advanced Control and Industrial Intelligence, Taiyuan University of Science and Technology, Taiyuan 030024, China

³

College of Electrical and Power Engineering, Taiyuan University of Technology, Taiyuan 030024, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 754; https://doi.org/10.3390/rs18050754

Submission received: 1 February 2026 / Revised: 19 February 2026 / Accepted: 27 February 2026 / Published: 2 March 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a novel architecture named SFCT-Net for remote sensing semantic segmentation. SFCT-Net integrates superpixel tokens and high-frequency constraints to preserve structural integrity and boundary precision. The network comprises three core modules: the Superpixel-Tokenized Linear Position Attention module, the Frequency-Modulated Deformable Edge Refinement module, and the Spatial–Semantic Feature Coupling module.
We construct the Taiyuan Satellite Remote Sensing Dataset (TSRSD), which is a high-resolution and fine-annotated benchmark covering diverse and complex urban landscapes.

What are the implications of the main findings?

Our proposed SFCT-Net demonstrates that incorporating domain-specific geometric and physical priors into deep learning frameworks enables superior interpretation of complex scenes compared with purely data-driven methods.
Our proposed modular method and the self-constructed TSRSD dataset provide a solution for high-precision urban planning and environmental surveillance, particularly in high-density environments with complex land-cover distributions.

Abstract

Remote sensing semantic segmentation is fundamental for fine-grained urban scene understanding, which in turn provides pixel-level semantic insights for urban development and environmental surveillance. However, existing hybrid segmentation architectures fail to incorporate intrinsic geometric and physical priors, inevitably leading to structural fragmentation, boundary ambiguity, and spatial misalignment of heterogeneous features. Therefore, we propose a Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer network (SFCT-Net) for remote sensing semantic segmentation. The proposed network integrates superpixel tokens and high-frequency constraints to preserve structural integrity and boundary precision. First, our Superpixel-Tokenized Linear Position Attention (STLPA) module replaces rigid window tokens with semantic superpixels to ensure object integrity with linear computational complexity. Second, we construct a Frequency-Modulated Deformable Edge Refinement (FMDER) module that leverages high-frequency spectral priors to modulate deformable sampling, achieving robust boundary recovery. Finally, we develop the Spatial–Semantic Feature Coupling (SSFC) module, which employs a dual-branch strategy to correct spatial drift and align deep semantic features with shallow details. Experiments conducted on our self-built Taiyuan Satellite Remote Sensing Dataset (TSRSD) along with the ISPRS Vaihingen and Potsdam benchmark datasets demonstrate that our proposed SFCT-Net delivers state-of-the-art performance and efficiency by fusing superpixel and frequency priors for robust structural and boundary recovery.

Keywords:

remote sensing imagery; semantic segmentation; convolutional neural network; transformer; domain-specific priors

1. Introduction

Remote sensing semantic segmentation underpins urban scene understanding and geographic information observation, supporting diverse applications spanning urban planning [1,2], disaster monitoring [3,4], and precision agriculture [5,6]. Driven by practical applications and sensing technologies, high-resolution remote sensing imagery (HRRSI) has become the standard data source for segmentation tasks [7]. Although HRRSI provides rich geometric details and texture information, semantic segmentation in complex and rapidly developing urban areas still encounters problems such as large intra-class variations and high inter-class similarities [8]. Consequently, robust segmentation architectures are essential for resolving such fine-grained details, positioning challenging benchmarks as a critical frontier in remote sensing research.

The paradigm of remote sensing semantic segmentation has evolved significantly, initially spearheaded by Convolutional Neural Networks (CNNs). Ronneberger et al. [9] proposed the classic UNet and Chen et al. [10] developed DeepLabv3+, both of which utilize the translation invariance and inductive bias of convolutions to efficiently extract local features. However, the inherently limited receptive field of CNNs hinders capture of the long-range semantic dependencies required for modeling large-scale geospatial objects such as continuous road networks. To address this limitation, Dosovitskiy et al. [11] introduced Vision Transformers (ViTs) to model global context. ViTs demonstrate superior capability in capturing long-range dependencies through self-attention mechanisms. Subsequently, Liu et al. [12] and Dong et al. [13] proposed Swin Transformer and CSwin Transformer to capture long-range interactions more efficiently. However, these window-based attention mechanisms incur substantial computational costs, particularly in the context of ultra-high-resolution geospatial data. Furthermore, ViTs lack the structural inductive bias necessary for recovering fine-grained boundaries, which tends to result in slow convergence and blurred edges. Consequently, hybrid CNN–Transformer architectures have emerged as a promising solution. For instance, Chen et al. [14] proposed TransUNet and Wang et al. [15] designed UNetFormer as ways to synergize the local robustness of CNNs with the global connectivity of transformers, achieving a superior balance between detail preservation and semantic consistency. However, current hybrid methods neglect the unique geometric and physical prior information inherent in geospatial data leading to structural fragmentation of irregular objects and boundary ambiguity in complex scenes. To intuitively illustrate the potential of domain-specific priors in overcoming these limitations, we demonstrate the advantages of priors in handling complex features in Figure 1.

In the geometric domain, superpixels offer an object-consistent structural prior that can be embedded into segmentation networks to guide feature aggregation (Figure 1b). The boundaries generated by superpixels (Figure 1c) can closely fit irregular geospatial objects, introducing geometric prior information to reduce data redundancy [16]. Early superpixel prior methods mainly used superpixels as a preprocessing step or post-processing optimization tool for Object-Based Image Analysis (OBIA) [17]. Subsequently, deep learning frameworks have incorporated superpixels to provide object-level geometric priors, which can capture intrinsic structural dependencies and preserve fine-grained boundary details [18,19]. However, existing hybrid architectures rarely integrate superpixels as dynamic tokens in attention mechanisms, leaving unresolved the grid–object mismatch inherent to rigid square window segmentation.

In the physical domain, frequency-domain analysis provides an edge-aware physical prior that can be integrated into segmentation networks in order to guide boundary-sensitive feature extraction. As shown in Figure 1d, semantic boundaries are distinctly characterized by high-frequency components within the amplitude spectrum. Consequently, features extracted through high-pass filtering (Figure 1e) can clearly decouple semantic edges from complex background textures [20]. Accordingly, existing deep learning methods incorporates frequency information into the segmentation process to enhance performance. For instance, Lin et al. [21] proposed frequency-adaptive dilated convolution to dynamically adjust the receptive field based on frequency components. Simultaneously, Bai et al. [22] and Yang et al. [23] utilized spectral attention in MsanlfNet and SFFNet, respectively, to sharpen boundary details. However, these methods operate at the feature-response level and fail to explicitly influence the geometric sampling process of convolution. Therefore, the feature-level modulation is insufficient to guide the spatial sampling locations of the network.

Beyond the constraint of missing priors, the efficacy of the segmentation pipeline is further limited by the spatial misalignment between heterogeneous feature streams [24]. Specifically, shallow features possess high spatial resolution, which is essential for delineating rigid geometric boundaries and fine-grained texture details [25]. However, shallow features inherently lack semantic discrimination and are susceptible to background noise such as shadows, rendering them prone to misclassification when used in isolation. In addition, deep features excel at capturing robust semantic context and maintaining category consistency [26]. Due to repeated downsampling operations, however, deep semantic representations often suffer from spatial drift, leading to a significant misalignment with the precise physical boundaries preserved in the shallow layers [27]. Effective segmentation necessitates the complementary fusion of distinct feature streams. However, existing methods employing simple skip connections (e.g., concatenation) implicitly assume pixel-level correspondences, which leads to blurry boundaries [9]. Moreover, most existing fusion modules primarily focus on simple channel re-weighting and fail to actively utilize deep semantic features to correct spatial shifts or dynamically filter effective geometric details.

Motivated by the above discussion, we propose a Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer Network (SFCT-Net). By establishing a unified framework that incorporates superpixel-tokenized attention and frequency-modulated sampling, our proposed SFCT-Net embeds geospatial priors into the core feature extraction process to overcome the limitations of object fragmentation and blurred boundaries. First, we introduce a Superpixel-Tokenized Linear Position Attention (STLPA) module to preserve semantic integrity by replacing fixed window partitioning with object-consistent tokens. Second, a Frequency-Modulated Deformable Edge Refinement (FMDER) module is designed to enhance boundary perception by exploiting high-frequency spectrals. Third, we propose a Spatial–Semantic Feature Coupling (SSFC) module to join the misalignment between deep semantic representations and shallow spatial details. Finally, we construct a high-resolution urban remote sensing benchmark named the Taiyuan Satellite Remote Sensing Dataset (TSRSD) to comprehensively evaluate the effectiveness and generalization of the proposed method. Extensive experiments demonstrate that our proposed SFCT-Net achieves superior performance on multiple benchmark datasets.

Our main contributions are summarized as follows:

STLPA Module: The proposed STLPA module reformulates attention modeling by introducing superpixel-tokenized object representations instead of fixed window partitioning. This design preserves the semantic integrity of irregular ground objects while enabling efficient linear-complexity global dependency modeling in high-resolution remote sensing imagery.
FMDER Module: The proposed FMDER module integrates frequency-domain physical priors into boundary refinement by explicitly leveraging high-frequency spectral information. This strategy enhances boundary-aware feature learning and significantly improves edge localization robustness in complex and low-contrast urban scenes.
SSFC Module: The proposed SSFC module establishes an explicit coupling mechanism between deep semantic features and shallow spatial details. This module effectively rectifies feature misalignment induced by encoder–decoder downsampling and enables accurate pixel-level fusion across heterogeneous feature streams.
TSRSD: The proposed TSRSD provides a high-resolution urban remote sensing benchmark with dense and fine-grained annotations. This dataset facilitates rigorous evaluation of semantic segmentation models in heterogeneous urban environments and highlights their generalization capability.

2. Related Work

2.1. Heterogeneous Feature Interaction in Hybrid Architectures

To combine the inductive bias of convolutions with the global modeling capability of transformers, hybrid architectures for HRRSI semantic segmentation have explored various feature interaction strategies. Based on the topological interaction forms of feature streams, existing methods are primarily categorized into parallel dual-stream interaction and serial cascade interaction.

The dual-stream interaction category explores parallel dual-branch designs, where the CNN and transformer branches run in parallel to maintain local details and global context throughout the network depth. For instance, Chen et al. [28] and Zhang et al. [29] employed a dual-stream architecture in Mobile-Former and CMTFNet, respectively, establishing a bidirectional fusion bridge to facilitate information exchange between the local (CNN) and global (ViT) branches. Similarly, Wang et al. [30] developed BANet to enhance segmentation accuracy by simultaneously modeling global context and local details through parallel dependency and texture paths. While these parallel architectures theoretically preserve more spatial information, a bottleneck exists in their simplistic fusion mechanisms. More specifically, relying on basic channel concatenation or element-wise addition induces spatial shifts caused by disparate receptive fields, which results in semantic ambiguity within complex transition regions.

The serial cascade interaction category employs a serial cascade design, typically utilizing a CNN as the encoder to capture local textures and subsequently employing a transformer as the decoder to model global context. For example, Chen et al. [14] and Wang et al. [15] respectively proposed TransUNet and UNetFormer to combine the strengths of both architectures. The former employs a transformer encoder to strengthen global context awareness, whereas the latter embeds transformer modules within a U-Net framework to synergize local–global modeling with high efficiency. However, serial cascade interaction categories face a severe resolution gap. To limit computational costs, feature serialization typically occurs at low-resolution stages (e.g., 1/16 or 1/32 scale). Consequently, the interaction between deep semantics and shallow details becomes inefficient due to extensive downsampling, leading to the irreversible loss of high-frequency spatial details.

In summary, while hybrid representations have achieved notable progress, critical limitations regarding heterogeneous feature alignment remain unresolved. Specifically, the interaction between the CNN and ViT streams typically lacks explicit rectification mechanisms, leading to spatial shifts and blurred boundary inference.

2.2. Prior-Guided Learning for Remote Sensing Semantic Segmentation

Deep learning possesses exceptional capabilities in data-driven feature abstraction. However, the intricate topological structures and spectral ambiguities inherent in remote sensing objects necessitate the integration of explicit domain knowledge. Driven by this need, the research focus has progressively shifted from purely data-driven mapping to interpretable frameworks grounded in prior information. Simultaneously, strategies for embedding these priors have evolved from rigid external postprocessing constraints to deep internal architectural modulation.

In the early stages of prior integration, researchers primarily relied on probabilistic graphical models to enforce spatial consistency. For instance, Krähenbühl et al. [31] and Zheng et al. [32] employed Conditional Random Fields (CRF) and Markov Random Fields (MRF), both of which employ probabilistic models as postprocessing layers to refine segmentation predictions. However, external modules remain disjointed from the feature learning process and often incur prohibitive computational costs. To address this disconnect, multi-task learning paradigms were subsequently introduced utilizing geometric priors (e.g., boundaries and contours) as auxiliary supervision signals. Li et al. [33] introduced edge-aware networks, constructing parallel edge detection branches to jointly minimize semantic and boundary losses. This multi-task strategy effectively enhances edge responsiveness by enforcing boundary constraints. However, the interaction between boundary priors and semantic features remains implicit, meaning that it is typically confined to the loss function level rather than being involved in direct feature interaction.

To achieve deeper fusion, recent advancements have explicitly embedded priors into the network architecture. The injection of prior information serves as strong regularization, constraining the solution space in order to maintain physical and geometric consistency [34,35]. In the geometric domain, Liao et al. [36] employed superpixel contour priors in BACA to enhance multi-feature correlation measures, significantly reducing under-segmentation errors. In the physical domain, Zhang et al. [37] utilized frequency-domain guidance in FGNet to bolster spatial–frequency representation. However, critical limitations persist, as most prior-guided mechanisms remain either static or confined to a single domain.

3. Method

3.1. Overall Framework

To synergize the local feature extraction capability of CNNs with the global context modeling potential of transformers, we propose SFCT-Net, a superpixel-tokenized and frequency-modulated hybrid network for remote sensing semantic segmentation. Our proposed SFCT-Net adopts an asymmetric serial encoder–decoder design to progressively reconstruct high-resolution semantic maps. Specifically, the proposed STLPA mechanism utilizes superpixel clustering to enhance geometric object consistency and reconstructs spatial features via a linear positioning function. Second, the proposed FMDER leverages spectral high-frequencies to modulate deformable sampling, thereby achieving precise boundary recovery. Furthermore, we deploy SSFC at the output stage to bridge the spatial misalignment between deep semantics and shallow details. The overall framework diagram of the proposed method is shown in Figure 2.

3.2. Superpixel-Tokenized Linear Position Attention Module

In HRRSI, geospatial objects exhibit arbitrary shapes and multi-scale characteristics. Fixed-size window partitioning reduces self-attention complexity. However, the rigid structure fails to aligns with complex object boundaries, compromising semantic integrity and discarding vital shape information. To overcome these limitations, we propose the Superpixel-Tokenized Linear Position Attention (STLPA) module. Our proposed STLPA module employs Simple Linear Iterative Clustering (SLIC) to generate irregular superpixels based on semantic similarity. SLIC in our proposed STLPA module is applied to intermediate feature maps rather than raw RGB images. Meanwhile, the clustering is recomputed at each decoder stage. Consequently, superpixels serve as object tokens to provide shape-adaptive guidance, which aligns the attention mechanism with the physical boundaries of ground objects. The overall architecture of the proposed STLPA module is illustrated in Figure 3.

First, we construct compact object tokens through superpixel-guided aggregation to preserve semantic integrity. Let

f \in R^{H \times W \times C}

represent the input feature map. We adopt the SLIC clustering mechanism to partition the spatial domain into

N_{S}

coherent regions

R_{1}, \dots, R_{N_{S}}

. To maintain a balance between semantic granularity and computational efficiency across all levels of the decoder, we implement an adaptive sampling strategy where the number of tokens

N_{S}

is coupled with the feature resolution. Specifically, the sampling interval

S_{l}

for the l-th decoder stage is determined as

S_{l} = min (\sqrt{L_{l} \cdot d_{min}}, 16)

, where

L_{l}

denotes the side length of the current feature map. The choice of 16 ensures robust capture of minimal object scales at the shallow resolution. The minimum resolvable diameter

d_{min}

is set to 4 to ensure that the sampling frequency meets the Nyquist criterion for capturing small objects in deep resolution as well as to prevent large S from degrading superpixel tokens into pooling operations. Furthermore, we ensure that

N_{S} = \frac{H_{l} \times W_{l}}{S_{l}^{2}}

remains robust at the coarsest scales. Therefore, we obtain

N_{S} \in {4, 8, 16, 64}

for a 512 × 512 input and

N_{S} \in {2, 4, 8, 16}

for a 256 × 256 input at the 1/32 to 1/4 scales, respectively. Compared with a fixed stride, hierarchical persistence ensures that object tokenization remains effective in deep layers, which allows the model to capture multiple semantic objects even at high-level semantic layers. Subsequently, we generate the pixel-level query

Q \in R^{H W \times C}

to maintain fine-grained details, while aggregating keys and values via superpixel pooling to reduce redundancy. For the k-th region

R_{k}

, the object token is computed as

T_{k} = \frac{1}{| R_{k} |} \sum_{i \in R_{k}} ψ (F_{i}),

(1)

where

F_{i}

is the feature vector of pixel i and

ψ (\cdot)

denotes a depthwise separable convolution applied to encode local geometric structures. Consequently, we obtain the superpixel-level key matrix

K_{S} \in R^{N_{S} \times C}

and value matrix

V_{S} \in R^{N_{S} \times C}

.

Second, we introduce a direction-sensitive positioning function to reconstruct the geometric structure of the feature space and address the rank collapse problem in linear attention mechanisms. To address the issue of existing linear attention mechanisms being insensitive to relative positions, the proposed function

ϕ (\cdot)

nonlinearly reconstructs the directional distribution of query vectors by introducing a focusing factor p (

p \geq 1

):

ϕ (x) = \frac{sign (x) \cdot {| x |}^{p}}{{∥ x ∥}^{1 - \frac{1}{p}}}

(2)

where the input vector

x \in R^{d}

refers to the query vector obtained after the linear projection layer,

∥ \cdot ∥

denotes the

L_{2}

norm, and

sign (\cdot)

preserves the directional quadrant.

Third, for a d-dimensional query vector

x \in R^{d}

, the proposed function

ϕ (x)

reconstructs the feature space by re-modulating the competitive relationship between all dimensions. For any two dimensions i and j (

i, j \in {1, \dots, d}

), the ratio of their mapped intensities is given by

\frac{{| ϕ (x)}_{i} |}{| {ϕ (x)}_{j} |} = {(\frac{| x_{i} |}{| x_{j} |})}^{p} .

(3)

As p increases, the vector x is increasingly dominated by its most significant component

x_{m a x} = max (| x_{1} |, \dots, | x_{d} |)

, forcing the token to align with the corresponding k-th orthogonal basis axis

e_{k}

. To visualize this process, we consider a 2D simplification where the directional angle is

θ = arctan (\frac{x_{2}}{x_{1}})

. After applying the proposed positioning function

ϕ (x)

with element-wise power

{| x |}^{p}

, the new angle

θ^{'}

becomes

θ^{'} = arctan (\frac{sign (x_{2}) {| x_{2} |}^{p}}{sign (x_{1}) {| x_{1} |}^{p}}) = arctan (sign (\frac{x_{2}}{x_{1}}) \cdot {|\frac{x_{2}}{x_{1}}|}^{p}),

(4)

where the angular shift

Δ θ = | θ^{'} - θ |

reflects the degree of feature space reconstruction. If

p = 1

, the function degenerates into a standard linear projection. If

p > 1

, the term

| \frac{x_{2}}{x_{1}} |^{p}

acts as a polarization factor. If

| x_{1} | > | x_{2} |

, the ratio

| \frac{x_{2}}{x_{1}} |^{p}

approaches zero as p increases, which forces

θ^{'}

toward

0^{\circ}

(alignment with the x-axis). In contrast, if

| x_{2} | > | x_{1} |

, the angle

θ^{'}

is forced toward

90^{\circ}

(alignment with the y-axis). As illustrated in Figure 4, we consider a set of sample vectors:

v_{1} = [0.9, 0.5]

,

v_{2} = [0.3, 0.8]

,

v_{3} = [0.1, 0.7]

, and

v_{4} = [0.6, 0.5]

. As p increases from 1 to 4, these vectors exhibit a progressive orthogonalization trend toward the principal coordinate axes.

Finally, we compute the interaction between pixel-level Q and shape-adaptive object-level K in order to broadcast semantic information back to the pixel space. Specifically, we compute the attention map

A = ϕ (Q) \cdot K_{S}^{T} \in R^{H W \times N_{S}}

, which explicitly matches the dimensions illustrated in Figure 3. The final output

F_{S T L P A} = A \cdot V_{S} \in R^{H W \times C}

is then produced. The feature map ensures boundary consistency by aligning pixel features with object semantics. The reconstructed feature map is computed as follows:

F_{S T L P A} = A \cdot V_{S} = ϕ (Q) \cdot K_{S}^{T} \cdot V_{S}

(5)

where · denotes matrix multiplication. By substituting the global pixel-to-pixel interaction with our superpixel-to-pixel interaction, our proposed STLPA module achieves linear complexity

O (C \cdot N_{S} \cdot H \cdot W)

, where C and

N_{S} ≪ H \cdot W

, allowing for efficient processing of high-resolution remote sensing images.

3.3. Frequency-Modulated Deformable Edge Refinement Module

Shallow encoder features contain rich spatial details that are critical for boundary delineation. However, these features are often contaminated by complex background texture noise, making it difficult for standard spatial convolutions to distinguish true semantic edges. To address this problem, we construct a Frequency-Modulated Deformable Edge Refinement (FMDER) module. Instead of enhancing features with frequency cues, our proposed FMDER module reformulates deformable convolution as a frequency-modulated geometric sampling process. Specifically, FFT is employed to extract high-frequency spectral components that serve as explicit edge priors, which are utilized to modulate the offset prediction of deformable convolution. Subsequently, frequency information controls where the convolution samples, which is different from reweighting feature responses. Consequently, frequency-modulated offset learning steers the sampling locations toward physically meaningful boundaries and suppresses distractions from texture noise. The overall structure of FMDER is shown in Figure 5.

First, we construct a spectral edge branch to explicitly extract high-frequency boundary information via Fourier analysis. Given the input feature map

F_{s k i p}

from the skip connection, we transform it into the frequency domain using 2D FFT. Since edges correspond to high-frequency signals while smooth backgrounds correspond to low-frequency signals, we apply a fixed radial high-pass filter mask

M_{h i g h}

to isolate the boundary components. Compared with a learnable mask, the fixed

M_{h i g h}

preserves an explicit boundary-related spectral prior and avoids unnecessary learnable parameters. The process is formulated as

M_{h i g h} (u, v) = \{\begin{matrix} 0, & \sqrt{u^{2} + v^{2}} \leq r, \\ 1, & otherwise, \end{matrix}

(6)

F_{f r e q} = F^{- 1} (F (F_{s k i p}) ⊙ M_{h i g h}),

(7)

where the cutoff frequency is defined as

r = k \cdot min (H, W)

with

k = 0.15

,

F

and

F^{- 1}

denote the FFT and inverse FFT, respectively, and ⊙ represents element-wise multiplication. The FFT is performed independently for each channel to preserve channel-specific semantics. In addition, reflection padding is performed before the transform to eliminate wraparound convolution artifacts introduced by FFT. Subsequently, the features are cropped back to their original size following the inverse transform. The resulting

F_{f r e q}

contains pure high-frequency structural information free from low-frequency background interference, serving as a robust prior for subsequent edge localization.

Second, we propose a frequency-modulated deformable sampling strategy to explicitly guide the geometric alignment of edges. Unlike standard Deformable Convolution (DCN), where offsets are predicted solely from spatial features, we introduce spectral edge priors to directly regulate the sampling geometry. Specifically, the extracted spectral edge map

F_{f r e q}

serves as a physical guidance signal that modulates the offset generation process, steering the deformable kernels toward boundary-consistent locations. Instead of enriching feature representations, frequency information explicitly influences where the convolution samples. Therefore, the spatial features

F_{s k i p}

and spectral cues

F_{f r e q}

are integrated to predict the sampling offsets

Δ p

and modulation scalars

Δ m

for deformable convolution:

Δ p, Δ m = {Conv}_{o f f s e t} ([F_{s k i p}, F_{f r e q}])

(8)

F_{d e f o r m} = DCNv 2 (F_{s k i p}; Δ p, Δ m)

(9)

where

Δ p

is responsible for dynamically adjusting the convolutional sampling positions at the geometric level, allowing the convolutional kernel to adaptively move along true semantic boundaries, while

Δ m

assigns different importance weights to different sampling points at the semantic level, further suppressing interference from texture noise and low-frequency background regions.

Finally, the geometrically aligned edge features are fused with semantic streams through a dual-domain gating mechanism. To further refine the features, we employ a gated fusion strategy. The aligned feature

F_{d e f o r m}

passes through a channel-spatial attention layer to generate a refinement weight map. The final output

F_{F M D E R}

is obtained by residual connection, ensuring that the high-frequency details are seamlessly integrated into the decoding path without disrupting the semantic consistency:

F_{F M D E R} = F_{s k i p} + σ (G (F_{d e f o r m})) \cdot F_{d e f o r m}

(10)

where

σ

is the Sigmoid function and

G (\cdot)

represents a lightweight spatial-channel aggregation network.

3.4. Spatial–Semantic Feature Coupling Module

The combination of features from different layers enables the model to recover fine-grained spatial details that are often lost during progressive downsampling. However, the accuracy of remote sensing image segmentation often suffers from spatial misalignment between these heterogeneous feature streams. Specifically, deep encoder features provide rich semantic context with sacrificing spatial precision. High-resolution decoder features possess precise geometric information with lacking sufficient semantic awareness; consequently, direct concatenation or element-wise addition implicitly assumes spatial alignment, which inevitably leads to semantic ghosting and blurred boundaries. To address this issue, we develop the Spatial–Semantic Feature Coupling (SSFC) module. The proposed SSFC module treats deep semantic features as guidance to rectify and select effective spatial details from the shallow layers, which ensures precise pixel-level coupling. The overall structure of SSFC is shown in Figure 6.

First, we construct a semantic stream branch to generate a spatial alignment map from deep features. The upsampled deep features

F_{d e e p}

encapsulate rich categorical information but possess relatively coarse positional details. Therefore, we apply a

1 \times 1

convolution followed by a softmax operation to generate a semantic probability map

M_{s e m}

. Here,

M_{s e m}

serves as a soft mask, accentuating regions with high semantic confidence while effectively suppressing background noise. The

1 \times 1

convolution maps deep features to class-specific logits, establishing an explicit one-to-one correspondence between channels and semantic categories. Different from standard spatial attention mechanisms,

M_{s e m}

is explicitly supervised by the semantic stream of the decoder, ensuring a consistent focus on foreground objects:

M_{s e m} = Softmax ({Conv}_{1 \times 1} (F_{d e e p}))

(11)

where softmax is applied along the channel dimension at each spatial location to generate a pixel-wise class probability distribution.

F_{d e e p}^{a l i g n e d} = F_{d e e p} ⊙ M_{s e m},

(12)

where ⊙ denotes element-wise multiplication. This step aligns the deep feature distribution with the most salient semantic regions. Compared with sigmoid gating, softmax enforces inter-class competition and produces exclusive probabilities that match the one-hot supervision of semantic segmentation.

Second, we design a detail selection branch to filter shallow features based on channel saliency. To prevent irrelevant background textures from interfering with the fusion, we employ Global Average Pooling (GAP) to aggregate global context information and generate a channel-level selection vector

W_{c h n}

via a Multi-Layer Perceptron (MLP). Here,

W_{c h n}

dynamically re-weights the shallow channels to enhance feature maps that contain boundary information relevant to the current semantic category:

W_{c h n} = σ (MLP (GAP (F_{s h a l l o w})))

(13)

F_{s h a l l o w}^{s e l e c t e d} = F_{s h a l l o w} \otimes W_{c h n}

(14)

where

σ

is the sigmoid function and ⊗ denotes channel-wise multiplication.

Finally, the spatially aligned semantic features and the channel-selected detail features are tightly coupled. To bridge the semantic gap, we concatenate the calibrated features from both branches and fuse them through a lightweight depthwise separable convolution (DWConv). The convolution operation allows the network to learn the coupling flow between geometry and semantics, generating the final output

F_{o u t}

with both sharp edges and consistent semantics:

F_{o u t} = {DWConv}_{3 \times 3} (Concat ([F_{d e e p}^{a l i g n e d}, F_{s h a l l o w}^{s e l e c t e d}])) .

(15)

4. Experiment

To comprehensively evaluate the performance of SFCT-Net, we conducted extensive experiments on our self-constructed TSRSD and two widely used public benchmark datasets: ISPRS Vaihingen and Potsdam [38,39]. First, Section 4.1 introduces the detailed characteristics and preprocessing of the experimental datasets. Second, Section 4.2 describes the specific experimental settings, including hardware configurations and training strategies. Third, Section 4.3 defines the quantitative evaluation metrics used to measure segmentation accuracy and computational efficiency. In addition, Section 4.4 performs comprehensive ablation studies to verify the rationale and effectiveness of proposed module. Finally, Section 4.5 presents quantitative and qualitative comparative analyses against SOTA methods to demonstrate the superiority of the proposed SFCT-Net in capturing prior information and resolving spatial misalignment.

4.1. Datasets

4.1.1. Taiyuan Satellite Remote Sensing Dataset

Existing public datasets often lack the complexity characteristic of high-density Asian urban landscapes. To bridge this gap and verify the robustness of SFCT-Net in complex scenes, we constructed the Taiyuan Satellite Remote Sensing Dataset (TSRSD), as shown in Figure 7.

Taiyuan City is situated at the northern tip of the Taiyuan Basin in central Shanxi Province. Surrounded by mountains to the east, west, and north while opening southward into a valley plain, the region exhibits a typical topographic gradient that decreases from north to south. This configuration forms a distinctive “dustpan-shaped” landform with a maximum elevation reaching 2659 m. The urban area of Taiyuan, covering a total of 1417 km², is characterized by rich geographical features. It encompasses complex geo-object structures, diverse surface materials, and a high density of buildings, presenting significant challenges such as high intra-class heterogeneity and intricate spatial structures.

The source imagery covers six distinct districts in Taiyuan, Shanxi Province (Xiaodian, Yingze, Xinghualing, Jiancaoping, Wanbailin, and Jinyuan), spanning an area of 56,251 × 52,654 pixels with a spatial resolution of 1 m. The total data volume reaches 12.7 GB. Unlike standardized ISPRS datasets, TSRSD presents challenges unique to developing urban regions, specifically high intra-class variance (e.g., “Building” encompasses both ancient architectural structures and modern skyscrapers), high inter-class similarity (e.g., the spectral characteristics of “Road” and concrete “Hardened Surface” are extremely similar), and complex shadows (high-rise buildings cast extensive shadow areas, occluding ground objects).

We performed rigorous preprocessing on the dataset, including orthorectification, geometric accuracy inspection, image fusion, band reorganization, color enhancement, and image mosaicking. The dataset was manually annotated pixel-by-pixel into nine categories: Farmland, Forest, Grass, Water, Building, Hardened Surface, Excavated Land, Road, and Background. In the experimental phase, we cropped the TSRSD data into non-overlapping patches of 512 × 512 pixels, resulting in 5285 training samples and 1322 testing samples.

4.1.2. ISPRS Vaihingen and Potsdam

The ISPRS 2D Semantic Labeling Challenge datasets provide high-resolution aerial imagery for urban scene understanding.

The Vaihingen dataset consists of 33 True Orthophoto (TOP) tiles, each approximately 2000 × 2000 pixels (average size 2494 × 2064), containing Near-Infrared (NIR), Red (R), and Green (G) bands with a spatial resolution of 9 cm. This dataset is annotated with six land cover types: Impervious Surface, Building, Low Vegetation, Tree, Car, and Clutter.

The Potsdam dataset contains 38 TOP tiles of 6000 × 6000 pixels, consisting of Red (R), Green (G), and Blue (B) bands with a spatial resolution of 5 cm. It shares the same category definitions as the Vaihingen dataset but presents finer urban textures.

In the experimental phase, we cropped the Vaihingen and Potsdam data into 256 × 256 patches using a sliding window approach and split them into training and testing sets at a ratio of 8:2.

To better understand the dataset characteristics and guide the model design and evaluation, we analyze the pixel-level class distributions across the experimental datasets, as visualized in the radar charts in Figure 8.

4.2. Experimental Details

The experiments used the PyTorch 1.8.1 deep learning framework, the 64-bit Ubuntu 20.04 operating system, CUDA version 10.1, and Python 3.7. We used a single 11 GB NVIDIA 2080Ti GPU for training and evaluation. Furthermore, we used transfer learning to transfer the weights of a ResNet-18 network pretrained on the ImageNet dataset to accelerate convergence and improve model performance. The networks were trained using the AdamW optimizer with an initial learning rate of

6 \times 10^{- 4}

and a weight decay of

0.01

. The learning rate was adjusted using a cosine annealing schedule (CosineAnnealingLR) with a minimum learning rate of

1 \times 10^{- 6}

. The batch size was set to 16 and the training process spanned 100 epochs.

4.3. Evaluation Metrics

We used the mean intersection over union (mIoU), overall accuracy (OA), and F1-score as accuracy metrics for segmentation. Furthermore, we evaluated the computational cost using the parameters (Params) to quantify the model size.

Mean intersection over union (mIoU) represents the mean of the intersection over union (IoU) ratios between the predicted results for each class and the true labels. It is calculated as follows:

$mIoU = \frac{1}{k} \sum_{i = 1}^{k} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}}$

(16)

where ${TP}_{i}$ , ${FP}_{i}$ , and ${FN}_{i}$ represent the true positives, false positives, and false negatives for class i, respectively.
Overall accuracy (OA) represents the percentage of correctly segmented pixels among all pixels. It is calculated as follows:

$OA = \frac{TP + TN}{TP + TN + FP + FN}$

(17)

where TN represents true negatives.
F1-score is the harmonic mean of precision and recall, providing a balanced evaluation metric for handling class imbalance. It is calculated as follows:

$\begin{matrix} precision & = \frac{TP}{TP + FP}, \end{matrix}$

(18)

$\begin{matrix} recall & = \frac{TP}{TP + FN}, \end{matrix}$

(19)

$\begin{matrix} F 1 - score & = 2 \cdot \frac{precision \cdot recall}{precision + recall} . \end{matrix}$

(20)

4.4. Ablation Experiments

To verify the effectiveness of the proposed prior guidance mechanisms, we conducted detailed ablation experiments on the Potsdam dataset. Since the core innovation of the proposed SFCT-Net lies in the integration of explicit domain knowledge, we placed the analytical focus on our proposed STLPA and FMDER module to investigate how different modeling strategies for these priors impacted performance. Additionally, the role of our proposed SSFC module as an efficient coupling head for feature alignment was validated within the overall architecture ablation.

4.4.1. Impact of Individual Contributions

To evaluate the contribution of each core component in SFCT-Net, we employed a ResNet-18 encoder paired with a naive decoder consisting of upsampling and convolution operations as the baseline, then progressively incorporated the proposed STLPA, FMDER, and SSFC modules. The experimental results are presented in Table 1.

As observed in Table 1, the complete SFCT-Net achieved the optimal performance with an mIoU of 86.60%, confirming that the three proposed modules formed a highly synergistic effect rather than a simple functional stacking. First, the standalone integration of STLPA module yielded the most significant performance. This substantial improvement indicated that resolving object fragmentation caused by rigid grid partitioning was the primary key to enhancing accuracy during the decoding phase of HRRSI. Specifically, the performance gain stemmed directly from the ability of STLPA module to maintain the geometric integrity of irregular objects within the deep decoder layers. Second, the FMDER module provided a modest improvement by locking onto high-frequency physical boundaries in shallow layers. However, the efficacy of FMDER module in complex scenes was constrained without the global geometric priors provided by STLPA module, verifying the necessity of introducing geometric inductive bias into the decoder. Third, the addition of SSFC module yielded stable performance gains with a negligible increase in parameters. Specifically, the spatial–semantic feature coupling mechanism facilitated the interaction between heterogeneous feature streams, enhancing the overall fusion efficiency. Finally, compared with the “Baseline + STLPA” configuration, the complete model achieved a substantial improvement in OA alongside a marginal gain in mIoU. Rather than signifying a tradeoff between category scales, this disparity primarily arises from the superior boundary localization and enhanced intra-class consistency facilitated by the proposed FMDER and SSFC modules. Consequently, the observed gap between OA and mIoU enhancements testifies to the model’s efficacy in geometric alignment and boundary rectification rather than indicating underlying category imbalance. The integration of the three proposed module demonstrates that the combined use of geometric and physical priors can provide a more robust representation than any individual module for semantic segmentation in remote sensing.

4.4.2. Impact of Superpixel Tokenization

To further investigate how STLPA optimizes feature modeling by breaking the constraints of rigid grid partitioning, we first compared the proposed STLPA module against grid-based attention mechanisms, then analyzed the impact of the p value in the positioning function on directional sensitivity.

Comparison with Grid-Based Attention.
We replaced the STLPA block in the decoder with current mainstream window-based transformer blocks, including a ViT block [11], Swin block [12], and CSwin block [13]. Notably, we also evaluated the proposed STLPA module using both the original $S = 16$ setting and the proposed adaptive $S_{l} = min (\sqrt{L_{l} \cdot d_{min}}, 16)$ strategy to further justify the necessity of scale-aware superpixel tokenization. The segmentation results are shown in Table 2.
As observed in Table 2, our method outperformed fixed window-based mechanisms with a lower parameter count, which is primarily due to the superpixel prior introduced by STLPA module. The comparative methods impose rigid rectangular partitioning on images, meaning that the semantic continuity of irregular geospatial objects is inevitably severed. In contrast, our proposed STLPA module utilizes superpixel clustering to aggregate pixels into object tokens with explicit semantic boundaries. This object-wise reconstruction methods aligns more consistently with the physical attributes of geographical entities than window-based methods in the deep decoding stage.
Comparison of Different Values of p
The proposed STLPA module introduces a focusing factor p to reconstruct the feature directions within the linear attention mechanism. The impact of different values of p on the segmentation results is presented in Table 3.
Table 3 indicates that model’s performance peaked at $p = 3$ . Model performance was lowest at $p = 1$ , which was because the positioning function degenerated into a linear projection lacking directional sensitivity. Increasing p to 3 enhanced the nonlinearity and orthogonality of feature vectors, which enabled the network to better distinguish spatially adjacent but semantically distinct objects. Conversely, an excessively high p (e.g., 5) led to over-orthogonalization, which disrupted intra-class correlations and degraded performance.

4.4.3. Impact of Frequency Modulation

The proposed FMDER module injects physical edge priors via frequency-domain analysis. We investigated the impact of placing FMDER module at different network depths (skip connection levels) on performance. The results are presented in Table 4.

As observed in Table 4, the deployment of FMDER in the shallow layers (e.g., layers 1 and 2) yielded optimal performance; conversely, the introduction of this module into the deep layers (e.g., layers 3 and 4) led to accuracy degradation. Specifically, shallow features were characterized by high resolution and the preservation of the original image’s physical structure. At shallow stages, high-frequency components extracted via FFT precisely corresponded to genuine physical edges, effectively guiding the sampling of deformable convolutions. However, deep features underwent repeated downsampling, which shifted their spatial response centers toward abstract semantics rather than physical details. Consequently, the forced injection of pixel position-based high-frequency physical priors into deep layers resulted in semantic misalignment. Notably, the performance gain mainly manifested in improved OA rather than large mIoU increments. The proposed FMDER module enhanced boundary-aware sampling and reduced misclassification near object borders, which led to better pixel-level consistency across large regions. Moreover, the introduced edge information failed to spatially align with current semantic features and instead became interfering noise.

4.5. Comparative Experiment

To comprehensively evaluated the performance of the proposed SFCT-Net, we conducted a comprehensive comparison against a series of existing SOTA methods. The experiments were performed on our self-constructed TSRSD characterized by challenging complex urban scenes, as well as on the ISPRS Potsdam and Vaihingen benchmark datasets.

4.5.1. Comparative Methods

We selected representative methods covering five primary paradigms: CNN-based architectures serving as the cornerstone for local feature extraction, including UNet [9] and DeepLabv3+ [10]; transformer-based architectures capable of global context modeling, such as Swin Transformer [12] and CSwin Transformer [13]; hybrid architectures combining the strengths of both, represented by BANet [30] and TransUNet [14]; superpixel-guided methods utilizing geometric priors to constrain boundaries, such as SDNF [18] and ConvNeXt with Context-Weighted Deep Superpixels [19]; and frequency-domain and attention-guided methods leveraging physical or spectral features to enhance representation, including MsanlfNet [22] and SFFNet [23].

4.5.2. Comparison to SOTA Methods

We present a quantitative and qualitative comparative analysis of the segmentation results obtained by SFCT-Net and current mainstream SOTA methods on the TSRSD, ISPRS Vaihingen, and Potsdam datasets.

Quantitative Analysis.
The quantitative comparison results are reported in Table 5 and Table 6. To ensure stability and reliability, we conducted three independent runs with different random seeds and report the average results. Overall, the proposed SFCT-Net achieves SOTA performance across all three benchmarks, maintaining an optimal balance between segmentation accuracy and computational efficiency.
As shown in Table 5 and Table 6, the proposed SFCT-Net demonstrates superior performance in categories characterized by strong geometric features or irregular boundaries, such as “Building”, “Car”, and “Water”. Compared with transformer architectures (e.g., Swin [12] and CSwin [13]), the advantage of proposed SFCT-Net stems from its preservation of geometric integrity. Specifically, the rigid window-based partitioning relied upon by Swin Transformer often forcibly severs the semantic continuity of irregular objects, leading to fragmented predictions at boundaries. Conversely, our proposed STLPA module utilizes superpixels as object tokens in order to adaptively aggregate pixels sharing the same semantics, which successfully prevents small objects such as those in the “Car” category from being overwhelmed by background noise.
Moreover, the CNN-based methods (e.g., DeepLabv3+ [10]) rely primarily on spatial convolutions for feature extraction, which can confuse building edges with intricate rooftop textures in complex urban scenes. In contrast, the proposed FMDER module enables the network to locate genuine physical boundaries within texture-dense regions, achieving more precise edge segmentation for the “Building” and “Road” categories. Furthermore, most comparative methods (e.g., BANet [30]) rely on simple concatenation or addition to fuse features, which neglects the spatial drift caused by repeated downsampling. However, our proposed SSFC module utilizes the deep stream to actively rectify shallow details, which improves the recognition rate of fine-grained objects through dynamic feature alignment.
In addition to the breakthrough in accuracy, our proposed SFCT-Net exhibits unique advantages in model efficiency. Unlike Swin [12] and MsanlfNet [22], which trade massive parameter stacking for performance, our proposed SFCT-Net achieves a superior mIoU with only 13.70 M parameters.
Qualitative Analysis.
The comparative qualitative results are illustrated in Figure 9 and Figure 10. Overall, our proposed SFCT-Net demonstrates exceptional segmentation efficacy, effectively addressing the challenges of intra-class variance in self-constructed data and resolving fine-grained urban textures in public benchmarks.
As visualized in Figure 9 and Figure 10, our proposed SFCT-Net exhibits remarkable geometric preservation and fine-grained recognition across varying scenes. On the TSRSD (Figure 9), the proposed SFCT-Net maintains excellent continuity for winding roads and viaducts, effectively mitigating common disconnection issues. Even within interlaced natural scenes, narrow paths remain clearly discernible. Meanwhile, on the ISPRS benchmarks (Figure 10), our proposed SFCT-Net eliminates the sawtooth effect on large-scale objects in the “Buildings” class while preserving the morphological integrity of tiny objects in the “Cars” class. In alignment with these visual improvements, the proposed modules produce cleaner object interiors and sharper boundaries. The overall accuracy is also increased due to the correcting of numerous boundary pixels, albeit with a more moderate effect on category-averaged mIoU. In complex transition zones involving the “Water” and “Tree” classes, SFCT-Net achieves high semantic consistency with minimal category confusion. Nevertheless, a few challenging cases still present minor limitations. On the TSRSD dataset (Figure 9), several thin linear structures in the first two rows show slight fragmentation after downsampling. On the ISPRS dataset (Figure 10), slight confusion appears near building shadows in the third row due to weakened high-frequency responses. Boundaries between spectrally similar classes such as “Tree” and “Low Vegetation” (fifth row) remain locally ambiguous under extremely low contrast. These effects are confined to small regions, and reflect the inherent difficulty of weak-edge and ultra-fine object segmentation.

5. Discussion

The superior performance of the proposed SFCT-Net on the TSRSD and ISPRS benchmarks is primarily attributed to its synergistic fusion of explicit geometric and physical priors. Specifically, the proposed STLPA module ensures the integrity of irregular objects by replacing rigid meshes with superpixel-guided markers. Meanwhile, the proposed FMDER module enables clear outlines of dense buildings by utilizing high-frequency prior information to depict physical boundaries. Furthermore, the proposed SSFC module corrects spatial misalignments, which ensures robust segmentation across multiple scales. However, our proposed SFCT-Net still has some limitations. Specifically, the network’s segmentation accuracy is sensitive to superpixel quality, which may affect boundary fitting under low-light conditions. Additionally, the frequency-domain modulation method may incur additional computational overhead when processing extremely high-resolution inputs.

6. Conclusions

We propose SFCT-Net to bridge the gap between data-driven representations and physical scene structures in high-resolution remote sensing. First, the proposed STLPA module replaces fixed grid tokens with semantic superpixels, which effectively preserves the geometric integrity of irregular objects while ensuring linear computational complexity. Second, the proposed FMDER module introduces physical spectral priors, which enables the network to distinguish genuine semantic edges from complex texture noise through guided deformable sampling. Finally, the proposed SSFC module addresses feature misalignment by actively coupling shallow details with deep semantics as a guiding field. Extensive experiments on the TSRSD, ISPRS Vaihingen, and Potsdam datasets demonstrate that SFCT-Net maintains semantic continuity along boundaries and ensures the structural integrity of irregular geospatial objects, proving its robustness in complex remote sensing scenes.

Author Contributions

Conceptualization, X.X. and C.C.; methodology and software, C.C.; formal analysis and validation, C.C. and Y.Y.; data curation and visualization, X.X. and C.C.; writing—original draft preparation, X.X. and C.C.; writing—review and editing, X.X. and G.X.; supervision and funding acquisition, X.X. and G.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (62441313), the Key Research and Development Plan of Shanxi Province (202402150101008), the Fundamental Research Program of Shanxi Province (202303021221141), and the Foundation of Shanxi Key Laboratory Of Advanced Control and Industrial Intelligence (ACII202511).

Data Availability Statement

The ISPRS Vaihingen and Potsdam datasets are available at: https://www.isprs.org/resources/datasets/benchmarks/ (accessed on 1 Febuary 2026). The self-built Taiyuan Satellite Remote Sensing Dataset (TSRSD) presented in our study is available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zuo, R.; Huang, X.; Li, J.; Pan, X. A cross-angle propagation network for built-up area extraction by fusing spatial-spectral-angular features from the ZY-3 multiview satellite imagery: Dataset and analysis of China’s 41 major cities. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5408320. [Google Scholar] [CrossRef]
Huang, X.; Wang, W.; Li, J.; Wang, L.; Xie, X. A stepwise refining image-level weakly supervised semantic segmentation method for detecting exposed surface for buildings (ESB) from very high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5400517. [Google Scholar] [CrossRef]
Li, Q.; Bai, X.; Hu, L.; Li, L.; Bao, Y.; Geng, X.; Yan, X.H. Semantic segmentation of typical oceanic and atmospheric phenomena in SAR images based on modified Segformer. Remote Sens. 2026, 18, 113. [Google Scholar] [CrossRef]
Fenglei, W.; Xin, G.; Zongze, Z.; Lida, X.; Chao, M. BoundNet: A boundary-enhanced semantic segmentation model for buildings. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10840290. [Google Scholar] [CrossRef]
Zhou, C.; Huang, J.; Xiao, Y.; Du, M.; Li, S. A novel approach: Coupling prior knowledge and deep learning methods for large-scale plastic greenhouse extraction using Sentinel-1/2 data. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104073. [Google Scholar] [CrossRef]
Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; Wang, J.; Zeng, Y.; Yin, G.; Li, W.; You, L.; et al. Improving agricultural field parcel delineation with a dual branch spatiotemporal fusion network by integrating multimodal satellite data. ISPRS J. Photogramm. Remote Sens. 2023, 205, 34–49. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking Transformers for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5617515. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, H.; Qin, Y.; Liu, X.; Wang, H.; Zhao, J. An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation. Complex Intell. Syst. 2024, 10, 2839–2849. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021; pp. 1–21. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSwin Transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Wang, M.; Liu, X.; Gao, Y.; Ma, X.; Soomro, N.Q. Superpixel segmentation: A benchmark. Signal Process. Image Commun. 2017, 56, 28–39. [Google Scholar] [CrossRef]
Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
Mi, L.; Chen, Z. Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2020, 159, 140–152. [Google Scholar] [CrossRef]
Ye, Z.; Lin, Y.; Gan, M.; Tan, X.; Dai, M.; Kong, D. ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation. AI 2025, 6, 277. [Google Scholar] [CrossRef]
Zhang, J.; Shao, M.; Wan, Y.; Meng, L.; Cao, X.; Wang, S. Boundary-aware spatial and frequency dual-domain transformer for remote sensing urban images segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5600114. [Google Scholar] [CrossRef]
Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-adaptive dilated convolution for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1350–1360. [Google Scholar]
Bai, L.; Lin, X.; Ye, Z.; Xue, D.; Yao, C.; Hui, M. MsanlfNet: Semantic segmentation network with multiscale attention and nonlocal filters for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6512405. [Google Scholar] [CrossRef]
Yang, Y.; Yuan, G.; Li, J. SFFNet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
Wen, Y.; Gao, T.; Chen, T.; Li, Z.; Liu, M.; Liu, L. Cross-level Interaction and Intra-level Fusion Network for Remote Sensing Image Dehazing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5602115. [Google Scholar] [CrossRef]
Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603018. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer embedding UNet for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
Lei, S.; Xiao, X.; Zhang, T.; Li, H.-C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5601514. [Google Scholar] [CrossRef]
Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10294–10303. [Google Scholar]
Song, P.; Li, J.; An, Z.; Fan, H.; Fan, L. CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5900314. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Krähenbühl, P.; Koltun, V. Efficient inference in fully connected CRFs with gaussian edge potentials. In Proceedings of the Advances in Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 109–117. [Google Scholar]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
Li, M.; Long, J.; Stein, A.; Wang, X. Using a semantic edge-aware multi-task neural network to delineate agricultural parcels from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
Ye, Z.; Lin, Y.; Dong, B.; Tan, X.; Dai, M.; Kong, D. An object-aware network embedding deep superpixel for semantic segmentation of remote sensing images. Remote Sens. 2024, 16, 3805. [Google Scholar] [CrossRef]
Zhong, J.; Zeng, T.; Xu, Z.; Wu, C.; Qian, S.; Xu, N.; Chen, Z.; Lyu, X.; Li, X. A frequency attention-enhanced network for semantic segmentation of high-resolution remote sensing images. Remote Sens. 2025, 17, 402. [Google Scholar] [CrossRef]
Liao, N.; Guo, B.; Li, C.; Liu, H.; Zhang, C. BACA: Superpixel segmentation with boundary awareness and content adaptation. Remote Sens. 2022, 14, 4572. [Google Scholar] [CrossRef]
Zhang, H.; Xie, G.; Li, L.; Xie, X.; Ren, J. Frequency-domain guided swin transformer and global-local feature integration for remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5603115. [Google Scholar] [CrossRef]
Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Benitez, S.; Breitkopf, U. The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, I-3, 293–298. [Google Scholar] [CrossRef]
Gerke, M. Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); ResearcheGate: Berlin, Germany, 2014. [Google Scholar] [CrossRef]

Figure 1. Visualization of superpixel and frequency priors in remote sensing imagery: (a) original image; (b) superpixel segmentation map representing geometric clusters; (c) superpixel boundaries overlaid on the original image, showing alignment with semantic objects; (d) frequency domain amplitude spectrum; (e) high-frequency components extracted via high-pass filtering, highlighting physical semantic edges.

Figure 2. The overall architecture of the proposed SFCT-Net.

Figure 3. Overall structure diagram of STLPA.

Figure 4. Positioning function for reconstructing the two-dimensional vector space.

Figure 5. Overall structure diagram of FMDER.

Figure 6. Overall structure diagram of SSFC.

Figure 7. Taiyuan Satellite Remote Sensing Dataset: (a) original image; (b) label image; (c,d) zoomed-in 512 × 512 patches cropped from the label map to illustrate the patch partition strategy and annotation quality.

Figure 8. Radar charts visualizing the pixel-level class distributions across datasets: (a) class proportions of the self-built TSRSD and (b) class proportions of the ISPRS benchmark datasets.

Figure 9. Segmentation results of different methods on the TSRSD dataset.

Figure 10. Segmentation results of different methods on the ISPRS dataset.

Table 1. Ablation experiments with the proposed modules on the Potsdam dataset. Bold values indicate the best performance.

Method	mIoU (%)	F1-Score (%)	OA (%)
Baseline	82.21	88.52	86.67
Baseline + STLPA	86.52	92.59	91.18
Baseline + FMDER	83.39	90.26	88.12
Baseline + SSFC	84.32	88.16	88.90
Baseline + STLPA + FMDER + SSFC	86.60	92.73	93.19

Table 2. Comparative experiments with decoder blocks on the Potsdam dataset. Bold values indicate the best performance.

Block Type	Mechanism Strategy	mIoU (%)	Params (M)
ViT Block [11]	Global Patch	84.71	15.13
Swin Block [12]	Fixed Window Partition	85.33	13.82
CSwin Block [13]	Cross-Shaped Window	85.87	13.67
STLPA Block (Ours) Fixed S	Superpixel Tokenization	86.54	13.70
STLPA Block (Ours) Variable S	Superpixel Tokenization	86.60	13.70

Table 3. Comparative experiments of STLPA with different values of p. Bold values indicate the best performance.

p	mIoU (%)	F1-Score (%)	OA (%)
1	84.69	90.67	89.28
2	85.78	91.84	90.43
3	86.60	92.73	93.19
4	86.11	92.19	90.78
5	85.62	91.67	90.26

Table 4. Comparative experiments on FMDER at different skip connection layers. Bold values indicate the best performance.

Injection Layers	mIoU (%)	F1-Score (%)	OA (%)
1	86.31	92.52	91.04
1, 2	86.60	92.73	93.19
1, 2, 3	86.12	92.41	90.91
1, 2, 3, 4	85.93	92.00	90.59
2, 3, 4	85.88	92.15	90.66
3, 4	85.65	91.81	90.41
4	85.37	91.40	90.05

Table 5. Comparative experiments between SFCT-Net and advanced methods on the TSRSD dataset. Bold values indicate the best performance, while underlined values denote the second-best performance.

	UNet	Deeplabv3+	Swin	CSwin	BANet	TransUNet	SDNF	ConvNeXt	MsanlfNet	SFFNet	SFCT-Net
	[9]	[10]	[12]	[13]	[30]	[14]	[18]	[19]	[22]	[23]	(Ours)
Farmland	71.07	73.93	76.82	77.53	78.84	79.37	77.15	81.45	79.05	80.88	81.91
Forest	65.38	67.78	68.54	70.46	72.43	82.64	69.82	83.35	73.12	83.05	83.65
Grass	28.47	30.34	33.24	35.42	36.14	37.31	34.56	36.50	36.87	40.43	43.12
Water	29.33	40.14	39.75	43.64	45.67	46.39	41.21	44.80	45.95	57.16	63.31
Building	72.22	73.81	75.27	75.88	80.38	81.51	75.55	84.20	80.92	83.44	84.79
Har. Suf.	45.81	45.93	50.37	51.25	51.37	51.43	50.82	51.15	51.10	51.85	51.12
Exc. Lan.	42.56	46.69	52.24	54.27	55.43	57.18	53.33	55.90	56.06	58.50	59.31
Road	47.68	50.73	55.54	60.39	59.27	60.26	57.17	63.90	59.88	62.92	64.82
Background	38.57	40.18	43.44	42.63	47.14	50.37	43.09	49.10	48.25	53.15	55.31
mIoU (%)	49.01	52.17	55.02	56.83	58.52	60.72	55.86	61.15	59.02	63.49	65.26
OA (%)	62.57	65.92	68.72	70.86	72.19	74.26	69.58	74.55	72.74	75.87	78.81
F1 (%)	54.27	57.62	60.42	62.56	63.89	65.96	61.28	66.25	64.44	67.57	70.42
Params(M)	32.13	41.21	50.62	35.24	12.87	22.81	30.56	28.50	44.18	18.45	13.70

Table 6. Comparative experiments between SFCT-Net and advanced methods on the ISPRS dataset. Bold values indicate the best performance, while underlined values denote the second-best performance.

Dataset	Methods	IoU (%)					mIoU (%)	OA (%)	F1 (%)	Params (M)
Dataset	Methods	Imp. Suf.	Building	Low. Veg.	Tree	Car	mIoU (%)	OA (%)	F1 (%)	Params (M)
Vaihingen	UNet [9]	77.33	84.04	63.01	74.32	52.61	70.25	85.55	85.17	32.13
	Deeplabv3+ [10]	79.58	85.97	70.12	72.54	77.63	77.17	86.85	86.43	41.21
	Swin [12]	83.87	89.14	69.91	79.05	74.13	79.56	87.15	86.56	50.62
	CSwin [13]	86.15	89.84	72.47	79.99	75.54	80.80	88.85	88.43	35.24
	BANet [30]	83.76	86.48	76.06	81.95	78.79	81.41	89.95	89.58	12.87
	TransUNet [14]	84.67	86.48	78.15	82.07	81.94	82.66	89.55	89.08	22.81
	SDNF [18]	81.25	85.30	71.88	75.60	76.45	78.10	87.45	86.95	30.56
	ConvNeXt [19]	85.95	89.65	77.60	81.15	84.10	83.69	90.50	90.10	28.50
	MsanlfNet [22]	84.05	88.20	75.33	80.15	79.88	81.52	89.65	89.20	44.18
	SFFNet [23]	85.80	90.15	77.50	81.80	85.40	83.20	90.35	89.95	18.45
	SFCT-Net (ours)	84.61	91.13	75.65	77.52	90.61	83.90	91.31	90.84	13.70
Potsdam	UNet [9]	74.00	81.53	63.75	65.67	71.61	71.31	82.15	81.73	32.13
	Deeplabv3+ [10]	82.26	89.74	71.97	76.90	77.19	79.61	88.50	88.09	41.21
	Swin [12]	85.80	91.23	71.84	80.31	75.72	80.98	89.80	89.31	50.62
	CSwin [13]	87.51	91.46	73.70	82.26	76.80	82.35	90.65	90.18	35.24
	BANet [30]	85.99	89.05	80.49	82.10	88.43	85.21	91.80	91.39	12.87
	TransUNet [14]	87.54	89.81	80.25	85.51	85.48	85.72	92.00	91.60	22.81
	SDNF [18]	84.12	88.25	72.45	78.33	80.15	80.66	89.40	88.92	30.56
	ConvNeXt [19]	87.10	92.25	79.20	83.50	89.10	86.23	92.60	92.20	28.50
	MsanlfNet [22]	86.35	90.12	78.54	81.20	86.75	84.59	91.30	90.88	44.18
	SFFNet [23]	87.40	92.15	80.12	84.20	90.05	86.10	92.50	92.05	18.45
	SFCT-Net (ours)	87.62	93.69	78.30	80.13	93.25	86.60	93.19	92.73	13.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, X.; Chang, C.; Yang, Y.; Xie, G. Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sens. 2026, 18, 754. https://doi.org/10.3390/rs18050754

AMA Style

Xie X, Chang C, Yang Y, Xie G. Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sensing. 2026; 18(5):754. https://doi.org/10.3390/rs18050754

Chicago/Turabian Style

Xie, Xinlin, Chenhao Chang, Yunyun Yang, and Gang Xie. 2026. "Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation" Remote Sensing 18, no. 5: 754. https://doi.org/10.3390/rs18050754

APA Style

Xie, X., Chang, C., Yang, Y., & Xie, G. (2026). Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sensing, 18(5), 754. https://doi.org/10.3390/rs18050754

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Heterogeneous Feature Interaction in Hybrid Architectures

2.2. Prior-Guided Learning for Remote Sensing Semantic Segmentation

3. Method

3.1. Overall Framework

3.2. Superpixel-Tokenized Linear Position Attention Module

3.3. Frequency-Modulated Deformable Edge Refinement Module

3.4. Spatial–Semantic Feature Coupling Module

4. Experiment

4.1. Datasets

4.1.1. Taiyuan Satellite Remote Sensing Dataset

4.1.2. ISPRS Vaihingen and Potsdam

4.2. Experimental Details

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.4.1. Impact of Individual Contributions

4.4.2. Impact of Superpixel Tokenization

4.4.3. Impact of Frequency Modulation

4.5. Comparative Experiment

4.5.1. Comparative Methods

4.5.2. Comparison to SOTA Methods

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI