Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation
Highlights
- We propose a novel architecture named SFCT-Net for remote sensing semantic segmentation. SFCT-Net integrates superpixel tokens and high-frequency constraints to preserve structural integrity and boundary precision. The network comprises three core modules: the Superpixel-Tokenized Linear Position Attention module, the Frequency-Modulated Deformable Edge Refinement module, and the Spatial–Semantic Feature Coupling module.
- We construct the Taiyuan Satellite Remote Sensing Dataset (TSRSD), which is a high-resolution and fine-annotated benchmark covering diverse and complex urban landscapes.
- Our proposed SFCT-Net demonstrates that incorporating domain-specific geometric and physical priors into deep learning frameworks enables superior interpretation of complex scenes compared with purely data-driven methods.
- Our proposed modular method and the self-constructed TSRSD dataset provide a solution for high-precision urban planning and environmental surveillance, particularly in high-density environments with complex land-cover distributions.
Abstract
1. Introduction
- STLPA Module: The proposed STLPA module reformulates attention modeling by introducing superpixel-tokenized object representations instead of fixed window partitioning. This design preserves the semantic integrity of irregular ground objects while enabling efficient linear-complexity global dependency modeling in high-resolution remote sensing imagery.
- FMDER Module: The proposed FMDER module integrates frequency-domain physical priors into boundary refinement by explicitly leveraging high-frequency spectral information. This strategy enhances boundary-aware feature learning and significantly improves edge localization robustness in complex and low-contrast urban scenes.
- SSFC Module: The proposed SSFC module establishes an explicit coupling mechanism between deep semantic features and shallow spatial details. This module effectively rectifies feature misalignment induced by encoder–decoder downsampling and enables accurate pixel-level fusion across heterogeneous feature streams.
- TSRSD: The proposed TSRSD provides a high-resolution urban remote sensing benchmark with dense and fine-grained annotations. This dataset facilitates rigorous evaluation of semantic segmentation models in heterogeneous urban environments and highlights their generalization capability.
2. Related Work
2.1. Heterogeneous Feature Interaction in Hybrid Architectures
2.2. Prior-Guided Learning for Remote Sensing Semantic Segmentation
3. Method
3.1. Overall Framework
3.2. Superpixel-Tokenized Linear Position Attention Module
3.3. Frequency-Modulated Deformable Edge Refinement Module
3.4. Spatial–Semantic Feature Coupling Module
4. Experiment
4.1. Datasets
4.1.1. Taiyuan Satellite Remote Sensing Dataset
4.1.2. ISPRS Vaihingen and Potsdam
4.2. Experimental Details
4.3. Evaluation Metrics
- Mean intersection over union (mIoU) represents the mean of the intersection over union (IoU) ratios between the predicted results for each class and the true labels. It is calculated as follows:where , , and represent the true positives, false positives, and false negatives for class i, respectively.
- Overall accuracy (OA) represents the percentage of correctly segmented pixels among all pixels. It is calculated as follows:where TN represents true negatives.
- F1-score is the harmonic mean of precision and recall, providing a balanced evaluation metric for handling class imbalance. It is calculated as follows:
4.4. Ablation Experiments
4.4.1. Impact of Individual Contributions
4.4.2. Impact of Superpixel Tokenization
- Comparison with Grid-Based Attention.We replaced the STLPA block in the decoder with current mainstream window-based transformer blocks, including a ViT block [11], Swin block [12], and CSwin block [13]. Notably, we also evaluated the proposed STLPA module using both the original setting and the proposed adaptive strategy to further justify the necessity of scale-aware superpixel tokenization. The segmentation results are shown in Table 2.As observed in Table 2, our method outperformed fixed window-based mechanisms with a lower parameter count, which is primarily due to the superpixel prior introduced by STLPA module. The comparative methods impose rigid rectangular partitioning on images, meaning that the semantic continuity of irregular geospatial objects is inevitably severed. In contrast, our proposed STLPA module utilizes superpixel clustering to aggregate pixels into object tokens with explicit semantic boundaries. This object-wise reconstruction methods aligns more consistently with the physical attributes of geographical entities than window-based methods in the deep decoding stage.
- Comparison of Different Values of pThe proposed STLPA module introduces a focusing factor p to reconstruct the feature directions within the linear attention mechanism. The impact of different values of p on the segmentation results is presented in Table 3.Table 3 indicates that model’s performance peaked at . Model performance was lowest at , which was because the positioning function degenerated into a linear projection lacking directional sensitivity. Increasing p to 3 enhanced the nonlinearity and orthogonality of feature vectors, which enabled the network to better distinguish spatially adjacent but semantically distinct objects. Conversely, an excessively high p (e.g., 5) led to over-orthogonalization, which disrupted intra-class correlations and degraded performance.
4.4.3. Impact of Frequency Modulation
4.5. Comparative Experiment
4.5.1. Comparative Methods
4.5.2. Comparison to SOTA Methods
- Quantitative Analysis.The quantitative comparison results are reported in Table 5 and Table 6. To ensure stability and reliability, we conducted three independent runs with different random seeds and report the average results. Overall, the proposed SFCT-Net achieves SOTA performance across all three benchmarks, maintaining an optimal balance between segmentation accuracy and computational efficiency.As shown in Table 5 and Table 6, the proposed SFCT-Net demonstrates superior performance in categories characterized by strong geometric features or irregular boundaries, such as “Building”, “Car”, and “Water”. Compared with transformer architectures (e.g., Swin [12] and CSwin [13]), the advantage of proposed SFCT-Net stems from its preservation of geometric integrity. Specifically, the rigid window-based partitioning relied upon by Swin Transformer often forcibly severs the semantic continuity of irregular objects, leading to fragmented predictions at boundaries. Conversely, our proposed STLPA module utilizes superpixels as object tokens in order to adaptively aggregate pixels sharing the same semantics, which successfully prevents small objects such as those in the “Car” category from being overwhelmed by background noise.Moreover, the CNN-based methods (e.g., DeepLabv3+ [10]) rely primarily on spatial convolutions for feature extraction, which can confuse building edges with intricate rooftop textures in complex urban scenes. In contrast, the proposed FMDER module enables the network to locate genuine physical boundaries within texture-dense regions, achieving more precise edge segmentation for the “Building” and “Road” categories. Furthermore, most comparative methods (e.g., BANet [30]) rely on simple concatenation or addition to fuse features, which neglects the spatial drift caused by repeated downsampling. However, our proposed SSFC module utilizes the deep stream to actively rectify shallow details, which improves the recognition rate of fine-grained objects through dynamic feature alignment.
- Qualitative Analysis.The comparative qualitative results are illustrated in Figure 9 and Figure 10. Overall, our proposed SFCT-Net demonstrates exceptional segmentation efficacy, effectively addressing the challenges of intra-class variance in self-constructed data and resolving fine-grained urban textures in public benchmarks.As visualized in Figure 9 and Figure 10, our proposed SFCT-Net exhibits remarkable geometric preservation and fine-grained recognition across varying scenes. On the TSRSD (Figure 9), the proposed SFCT-Net maintains excellent continuity for winding roads and viaducts, effectively mitigating common disconnection issues. Even within interlaced natural scenes, narrow paths remain clearly discernible. Meanwhile, on the ISPRS benchmarks (Figure 10), our proposed SFCT-Net eliminates the sawtooth effect on large-scale objects in the “Buildings” class while preserving the morphological integrity of tiny objects in the “Cars” class. In alignment with these visual improvements, the proposed modules produce cleaner object interiors and sharper boundaries. The overall accuracy is also increased due to the correcting of numerous boundary pixels, albeit with a more moderate effect on category-averaged mIoU. In complex transition zones involving the “Water” and “Tree” classes, SFCT-Net achieves high semantic consistency with minimal category confusion. Nevertheless, a few challenging cases still present minor limitations. On the TSRSD dataset (Figure 9), several thin linear structures in the first two rows show slight fragmentation after downsampling. On the ISPRS dataset (Figure 10), slight confusion appears near building shadows in the third row due to weakened high-frequency responses. Boundaries between spectrally similar classes such as “Tree” and “Low Vegetation” (fifth row) remain locally ambiguous under extremely low contrast. These effects are confined to small regions, and reflect the inherent difficulty of weak-edge and ultra-fine object segmentation.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zuo, R.; Huang, X.; Li, J.; Pan, X. A cross-angle propagation network for built-up area extraction by fusing spatial-spectral-angular features from the ZY-3 multiview satellite imagery: Dataset and analysis of China’s 41 major cities. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5408320. [Google Scholar] [CrossRef]
- Huang, X.; Wang, W.; Li, J.; Wang, L.; Xie, X. A stepwise refining image-level weakly supervised semantic segmentation method for detecting exposed surface for buildings (ESB) from very high-resolution remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5400517. [Google Scholar] [CrossRef]
- Li, Q.; Bai, X.; Hu, L.; Li, L.; Bao, Y.; Geng, X.; Yan, X.H. Semantic segmentation of typical oceanic and atmospheric phenomena in SAR images based on modified Segformer. Remote Sens. 2026, 18, 113. [Google Scholar] [CrossRef]
- Fenglei, W.; Xin, G.; Zongze, Z.; Lida, X.; Chao, M. BoundNet: A boundary-enhanced semantic segmentation model for buildings. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 10840290. [Google Scholar] [CrossRef]
- Zhou, C.; Huang, J.; Xiao, Y.; Du, M.; Li, S. A novel approach: Coupling prior knowledge and deep learning methods for large-scale plastic greenhouse extraction using Sentinel-1/2 data. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104073. [Google Scholar] [CrossRef]
- Cai, Z.; Hu, Q.; Zhang, X.; Yang, J.; Wei, H.; Wang, J.; Zeng, Y.; Yin, G.; Li, W.; You, L.; et al. Improving agricultural field parcel delineation with a dual branch spatiotemporal fusion network by integrating multimodal satellite data. ISPRS J. Photogramm. Remote Sens. 2023, 205, 34–49. [Google Scholar] [CrossRef]
- Liu, Y.; Zhang, Y.; Wang, Y.; Mei, S. Rethinking Transformers for Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5617515. [Google Scholar] [CrossRef]
- Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Chen, H.; Qin, Y.; Liu, X.; Wang, H.; Zhao, J. An improved DeepLabv3+ lightweight network for remote-sensing image semantic segmentation. Complex Intell. Syst. 2024, 10, 2839–2849. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021; pp. 1–21. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. CSwin Transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
- Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
- Wang, M.; Liu, X.; Gao, Y.; Ma, X.; Soomro, N.Q. Superpixel segmentation: A benchmark. Signal Process. Image Commun. 2017, 56, 28–39. [Google Scholar] [CrossRef]
- Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef]
- Mi, L.; Chen, Z. Superpixel-enhanced deep neural forest for remote sensing image semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2020, 159, 140–152. [Google Scholar] [CrossRef]
- Ye, Z.; Lin, Y.; Gan, M.; Tan, X.; Dai, M.; Kong, D. ConvNeXt with Context-Weighted Deep Superpixels for High-Spatial-Resolution Aerial Image Semantic Segmentation. AI 2025, 6, 277. [Google Scholar] [CrossRef]
- Zhang, J.; Shao, M.; Wan, Y.; Meng, L.; Cao, X.; Wang, S. Boundary-aware spatial and frequency dual-domain transformer for remote sensing urban images segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5600114. [Google Scholar] [CrossRef]
- Chen, L.; Gu, L.; Zheng, D.; Fu, Y. Frequency-adaptive dilated convolution for semantic segmentation of urban scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 1350–1360. [Google Scholar]
- Bai, L.; Lin, X.; Ye, Z.; Xue, D.; Yao, C.; Hui, M. MsanlfNet: Semantic segmentation network with multiscale attention and nonlocal filters for high-resolution remote sensing images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6512405. [Google Scholar] [CrossRef]
- Yang, Y.; Yuan, G.; Li, J. SFFNet: A wavelet-based spatial and frequency domain fusion network for remote sensing segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 3000617. [Google Scholar] [CrossRef]
- Wen, Y.; Gao, T.; Chen, T.; Li, Z.; Liu, M.; Liu, L. Cross-level Interaction and Intra-level Fusion Network for Remote Sensing Image Dehazing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5602115. [Google Scholar] [CrossRef]
- Niu, R.; Sun, X.; Tian, Y.; Diao, W.; Chen, K.; Fu, K. Hybrid multiple attention network for semantic segmentation in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5603018. [Google Scholar] [CrossRef]
- He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer embedding UNet for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
- Lei, S.; Xiao, X.; Zhang, T.; Li, H.-C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5601514. [Google Scholar] [CrossRef]
- Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Dong, X.; Yuan, L.; Liu, Z. Mobile-Former: Bridging MobileNet and Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10294–10303. [Google Scholar]
- Song, P.; Li, J.; An, Z.; Fan, H.; Fan, L. CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5900314. [Google Scholar] [CrossRef]
- Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer meets convolution: A bilateral awareness network for semantic segmentation of very fine resolution urban scene images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected CRFs with gaussian edge potentials. In Proceedings of the Advances in Information Processing Systems, Granada, Spain, 12–14 December 2011; pp. 109–117. [Google Scholar]
- Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
- Li, M.; Long, J.; Stein, A.; Wang, X. Using a semantic edge-aware multi-task neural network to delineate agricultural parcels from remote sensing images. ISPRS J. Photogramm. Remote Sens. 2023, 200, 24–40. [Google Scholar] [CrossRef]
- Ye, Z.; Lin, Y.; Dong, B.; Tan, X.; Dai, M.; Kong, D. An object-aware network embedding deep superpixel for semantic segmentation of remote sensing images. Remote Sens. 2024, 16, 3805. [Google Scholar] [CrossRef]
- Zhong, J.; Zeng, T.; Xu, Z.; Wu, C.; Qian, S.; Xu, N.; Chen, Z.; Lyu, X.; Li, X. A frequency attention-enhanced network for semantic segmentation of high-resolution remote sensing images. Remote Sens. 2025, 17, 402. [Google Scholar] [CrossRef]
- Liao, N.; Guo, B.; Li, C.; Liu, H.; Zhang, C. BACA: Superpixel segmentation with boundary awareness and content adaptation. Remote Sens. 2022, 14, 4572. [Google Scholar] [CrossRef]
- Zhang, H.; Xie, G.; Li, L.; Xie, X.; Ren, J. Frequency-domain guided swin transformer and global-local feature integration for remote sensing images semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5603115. [Google Scholar] [CrossRef]
- Rottensteiner, F.; Sohn, G.; Jung, J.; Gerke, M.; Baillard, C.; Benitez, S.; Breitkopf, U. The ISPRS benchmark on urban object classification and 3D building reconstruction. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2012, I-3, 293–298. [Google Scholar] [CrossRef]
- Gerke, M. Use of the Stair Vision Library Within the ISPRS 2D Semantic Labeling Benchmark (Vaihingen); ResearcheGate: Berlin, Germany, 2014. [Google Scholar] [CrossRef]










| Method | mIoU (%) | F1-Score (%) | OA (%) |
|---|---|---|---|
| Baseline | 82.21 | 88.52 | 86.67 |
| Baseline + STLPA | 86.52 | 92.59 | 91.18 |
| Baseline + FMDER | 83.39 | 90.26 | 88.12 |
| Baseline + SSFC | 84.32 | 88.16 | 88.90 |
| Baseline + STLPA + FMDER + SSFC | 86.60 | 92.73 | 93.19 |
| Block Type | Mechanism Strategy | mIoU (%) | Params (M) |
|---|---|---|---|
| ViT Block [11] | Global Patch | 84.71 | 15.13 |
| Swin Block [12] | Fixed Window Partition | 85.33 | 13.82 |
| CSwin Block [13] | Cross-Shaped Window | 85.87 | 13.67 |
| STLPA Block (Ours) Fixed S | Superpixel Tokenization | 86.54 | 13.70 |
| STLPA Block (Ours) Variable S | Superpixel Tokenization | 86.60 | 13.70 |
| p | mIoU (%) | F1-Score (%) | OA (%) |
|---|---|---|---|
| 1 | 84.69 | 90.67 | 89.28 |
| 2 | 85.78 | 91.84 | 90.43 |
| 3 | 86.60 | 92.73 | 93.19 |
| 4 | 86.11 | 92.19 | 90.78 |
| 5 | 85.62 | 91.67 | 90.26 |
| Injection Layers | mIoU (%) | F1-Score (%) | OA (%) |
|---|---|---|---|
| 1 | 86.31 | 92.52 | 91.04 |
| 1, 2 | 86.60 | 92.73 | 93.19 |
| 1, 2, 3 | 86.12 | 92.41 | 90.91 |
| 1, 2, 3, 4 | 85.93 | 92.00 | 90.59 |
| 2, 3, 4 | 85.88 | 92.15 | 90.66 |
| 3, 4 | 85.65 | 91.81 | 90.41 |
| 4 | 85.37 | 91.40 | 90.05 |
| UNet | Deeplabv3+ | Swin | CSwin | BANet | TransUNet | SDNF | ConvNeXt | MsanlfNet | SFFNet | SFCT-Net | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| [9] | [10] | [12] | [13] | [30] | [14] | [18] | [19] | [22] | [23] | (Ours) | |
| Farmland | 71.07 | 73.93 | 76.82 | 77.53 | 78.84 | 79.37 | 77.15 | 81.45 | 79.05 | 80.88 | 81.91 |
| Forest | 65.38 | 67.78 | 68.54 | 70.46 | 72.43 | 82.64 | 69.82 | 83.35 | 73.12 | 83.05 | 83.65 |
| Grass | 28.47 | 30.34 | 33.24 | 35.42 | 36.14 | 37.31 | 34.56 | 36.50 | 36.87 | 40.43 | 43.12 |
| Water | 29.33 | 40.14 | 39.75 | 43.64 | 45.67 | 46.39 | 41.21 | 44.80 | 45.95 | 57.16 | 63.31 |
| Building | 72.22 | 73.81 | 75.27 | 75.88 | 80.38 | 81.51 | 75.55 | 84.20 | 80.92 | 83.44 | 84.79 |
| Har. Suf. | 45.81 | 45.93 | 50.37 | 51.25 | 51.37 | 51.43 | 50.82 | 51.15 | 51.10 | 51.85 | 51.12 |
| Exc. Lan. | 42.56 | 46.69 | 52.24 | 54.27 | 55.43 | 57.18 | 53.33 | 55.90 | 56.06 | 58.50 | 59.31 |
| Road | 47.68 | 50.73 | 55.54 | 60.39 | 59.27 | 60.26 | 57.17 | 63.90 | 59.88 | 62.92 | 64.82 |
| Background | 38.57 | 40.18 | 43.44 | 42.63 | 47.14 | 50.37 | 43.09 | 49.10 | 48.25 | 53.15 | 55.31 |
| mIoU (%) | 49.01 | 52.17 | 55.02 | 56.83 | 58.52 | 60.72 | 55.86 | 61.15 | 59.02 | 63.49 | 65.26 |
| OA (%) | 62.57 | 65.92 | 68.72 | 70.86 | 72.19 | 74.26 | 69.58 | 74.55 | 72.74 | 75.87 | 78.81 |
| F1 (%) | 54.27 | 57.62 | 60.42 | 62.56 | 63.89 | 65.96 | 61.28 | 66.25 | 64.44 | 67.57 | 70.42 |
| Params(M) | 32.13 | 41.21 | 50.62 | 35.24 | 12.87 | 22.81 | 30.56 | 28.50 | 44.18 | 18.45 | 13.70 |
| Dataset | Methods | IoU (%) | mIoU (%) | OA (%) | F1 (%) | Params (M) | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Imp. Suf. | Building | Low. Veg. | Tree | Car | ||||||
| Vaihingen | UNet [9] | 77.33 | 84.04 | 63.01 | 74.32 | 52.61 | 70.25 | 85.55 | 85.17 | 32.13 |
| Deeplabv3+ [10] | 79.58 | 85.97 | 70.12 | 72.54 | 77.63 | 77.17 | 86.85 | 86.43 | 41.21 | |
| Swin [12] | 83.87 | 89.14 | 69.91 | 79.05 | 74.13 | 79.56 | 87.15 | 86.56 | 50.62 | |
| CSwin [13] | 86.15 | 89.84 | 72.47 | 79.99 | 75.54 | 80.80 | 88.85 | 88.43 | 35.24 | |
| BANet [30] | 83.76 | 86.48 | 76.06 | 81.95 | 78.79 | 81.41 | 89.95 | 89.58 | 12.87 | |
| TransUNet [14] | 84.67 | 86.48 | 78.15 | 82.07 | 81.94 | 82.66 | 89.55 | 89.08 | 22.81 | |
| SDNF [18] | 81.25 | 85.30 | 71.88 | 75.60 | 76.45 | 78.10 | 87.45 | 86.95 | 30.56 | |
| ConvNeXt [19] | 85.95 | 89.65 | 77.60 | 81.15 | 84.10 | 83.69 | 90.50 | 90.10 | 28.50 | |
| MsanlfNet [22] | 84.05 | 88.20 | 75.33 | 80.15 | 79.88 | 81.52 | 89.65 | 89.20 | 44.18 | |
| SFFNet [23] | 85.80 | 90.15 | 77.50 | 81.80 | 85.40 | 83.20 | 90.35 | 89.95 | 18.45 | |
| SFCT-Net (ours) | 84.61 | 91.13 | 75.65 | 77.52 | 90.61 | 83.90 | 91.31 | 90.84 | 13.70 | |
| Potsdam | UNet [9] | 74.00 | 81.53 | 63.75 | 65.67 | 71.61 | 71.31 | 82.15 | 81.73 | 32.13 |
| Deeplabv3+ [10] | 82.26 | 89.74 | 71.97 | 76.90 | 77.19 | 79.61 | 88.50 | 88.09 | 41.21 | |
| Swin [12] | 85.80 | 91.23 | 71.84 | 80.31 | 75.72 | 80.98 | 89.80 | 89.31 | 50.62 | |
| CSwin [13] | 87.51 | 91.46 | 73.70 | 82.26 | 76.80 | 82.35 | 90.65 | 90.18 | 35.24 | |
| BANet [30] | 85.99 | 89.05 | 80.49 | 82.10 | 88.43 | 85.21 | 91.80 | 91.39 | 12.87 | |
| TransUNet [14] | 87.54 | 89.81 | 80.25 | 85.51 | 85.48 | 85.72 | 92.00 | 91.60 | 22.81 | |
| SDNF [18] | 84.12 | 88.25 | 72.45 | 78.33 | 80.15 | 80.66 | 89.40 | 88.92 | 30.56 | |
| ConvNeXt [19] | 87.10 | 92.25 | 79.20 | 83.50 | 89.10 | 86.23 | 92.60 | 92.20 | 28.50 | |
| MsanlfNet [22] | 86.35 | 90.12 | 78.54 | 81.20 | 86.75 | 84.59 | 91.30 | 90.88 | 44.18 | |
| SFFNet [23] | 87.40 | 92.15 | 80.12 | 84.20 | 90.05 | 86.10 | 92.50 | 92.05 | 18.45 | |
| SFCT-Net (ours) | 87.62 | 93.69 | 78.30 | 80.13 | 93.25 | 86.60 | 93.19 | 92.73 | 13.70 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Xie, X.; Chang, C.; Yang, Y.; Xie, G. Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sens. 2026, 18, 754. https://doi.org/10.3390/rs18050754
Xie X, Chang C, Yang Y, Xie G. Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sensing. 2026; 18(5):754. https://doi.org/10.3390/rs18050754
Chicago/Turabian StyleXie, Xinlin, Chenhao Chang, Yunyun Yang, and Gang Xie. 2026. "Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation" Remote Sensing 18, no. 5: 754. https://doi.org/10.3390/rs18050754
APA StyleXie, X., Chang, C., Yang, Y., & Xie, G. (2026). Superpixel-Tokenized and Frequency-Modulated Hybrid CNN–Transformer for Remote Sensing Semantic Segmentation. Remote Sensing, 18(5), 754. https://doi.org/10.3390/rs18050754

