3.2. Comparison with SOTA Models
In this study, DFSMamba was evaluated against representative CNN-, Transformer-, and Mamba-based SISR methods, including HSENet [
26], TransENet [
52], OmniSR [
53], SwinFIR [
54], SwinIR [
23], TTST [
55], DAT [
56], MambaIR [
29], MambaIRv2 [
46], HDI-PRNet [
57], and MAT [
58]. For numerical comparison, only results obtained under the same degradation setting and evaluation protocol are reported because directly mixing results from different training settings may lead to unfair conclusions.
CNN methods (HSENet, OmniSR) extract features based on local receptive fields, gradually expanding the perceptual range through stacked convolutional layers. Their advantages lie in high computational efficiency, small parameter counts, and fast inference speed, making them suitable for real-time or lightweight applications. However, constrained by a fixed kernel size, CNNs struggle to capture long-range dependencies, leading to blurry reconstructions of large-scale structures. They also lack adaptability to remote sensing images with complex textures and large structural spans, resulting in limited high-frequency detail recovery. As shown in
Table 1, on the WHU-RS19×2 task, HSENet achieves a PSNR of only 28.94 dB and an SSIM of 0.7616, significantly lower than Transformer and Mamba methods. On the AID×3 task, HSENet achieves 29.85 dB/0.8150, falling far behind DFSMamba (31.48 dB/0.8415). As shown in
Table 2,
Table 3 and
Table 4, CNNs perform acceptably on homogeneous land cover types (e.g., AID “Bare land” with a PSNR of 37.60 dB) but are severely inadequate on complex categories: AID “Port” reaches only 25.88 dB/0.7389, and “Stadium” achieves 28.28 dB (compared to DFSMamba’s 29.02 dB); on WHU-RS19 “Port”, PSNR is only 17.57 dB, with an SSIM of 0.5051, barely reconstructing valid structures. In summary, CNNs are suitable for fast processing scenarios where precision requirements are low and the imagery is dominated by large homogeneous land cover areas.
Transformer methods (SwinIR, DAT, TTST, MAT) leverage self-attention mechanisms to achieve global receptive fields, capturing pixel dependencies at arbitrary distances, and their reconstruction quality significantly exceeds that of CNNs. As a representative, MAT achieves 34.45 dB/0.9122 on the AID×2 task and 30.80 dB/0.8280 on the AID×3 task, outperforming both CNNs and MambaIR but still falling behind DFSMamba. Transformers demonstrate clear advantages on texture-rich categories: on AID, “Beach” reaches a PSNR of 36.41 dB, “Desert” 40.31 dB, and “Stadium” 28.60 dB, all higher than CNNs and early Mamba methods. However, Transformers suffer from two major issues. First, the computational complexity of self-attention grows quadratically (O(n2)), leading to high memory usage and slow inference speed when processing high-resolution images. Second, the window partitioning strategy lacks semantic adaptability, causing unstable reconstructions on certain categories. For example, on the RSSCN7 “Farmland” category, MAT achieves an SSIM of only 0.4920, lower than DFSMamba’s 0.4968; on the WHU-RS19 “Port” category, MAT reaches 18.07 dB, which, although better than CNNs’ values, is still far below DFSMamba’s 18.17 dB. This indicates that regular windows struggle to adapt to the diverse morphologies of remote sensing land cover, particularly in accurately capturing the semantic boundaries of slender or irregular structures.
Mamba methods (MambaIR, MambaIRv2) achieve a global receptive field with linear complexity (O(n)) and strike a good balance between efficiency and quality, based on state-space models. MambaIRv2 achieves 29.89 dB/0.7852 on the WHU-RS19×2 task and 30.41 dB/0.8130 on the AID×3 task, both lower than DFSMamba and MAT. At the category level, Mamba methods excel on homogeneous land cover (AID “Bare land”: 38.15 dB/0.9254; “Parking”: SSIM 0.9408). However, Mamba methods suffer from three key shortcomings. First, unidirectional scanning leads to information bias, limiting performance on categories that require bidirectional information fusion, such as AID “Stadium” (PSNR of only 28.28 dB vs. DFSMamba’s 29.02 dB) and “Viaduct” (SSIM of 0.8304 vs. DFSMamba’s 0.8367). Second, semantic adaptability is insufficient, resulting in limited detail recovery on texture-rich categories, such as RSSCN7 “Grasslands” (PSNR of 31.20 dB vs. DFSMamba’s 31.40 dB) and “Rivers” (SSIM of 0.8410 vs. DFSMamba’s 0.8452). Third, the underutilization of frequency-domain information restricts high-frequency detail recovery. Overall, Mamba methods perform well on homogeneous land cover but exhibit clear deficiencies on complex categories requiring bidirectional semantic perception and high-frequency detail recovery (
Figure 4).
DFSMamba addresses the aforementioned limitations of existing methods by integrating three core innovations. SCSA enhances semantic perception through dynamic chunking and sparse connections while maintaining linear complexity. The ASSM adopts parallel bidirectional SSM branches to overcome unidirectional information bias and introduces an activation-guided fusion mechanism to adaptively enhance semantic regions. The DFTM establishes a global, lossless frequency-domain receptive field and explicitly enhances high-frequency details. In terms of performance, DFSMamba achieves the best or second-best results across all five datasets and three scaling factors. On the AID×3 task, it attains a PSNR of 31.48 dB, outperforming MAT (30.80 dB) by 0.68 dB and MambaIRv2 (30.41 dB) by 1.07 dB; its SSIM reaches 0.8415, exceeding MAT (0.8280) by 0.0135 and MambaIRv2 (0.8130) by 0.0285. As shown in
Table 2,
Table 3 and
Table 4, in the fine-grained evaluation, DFSMamba achieves the best or second-best results across all categories on AID (30 categories), RSSCN7 (7 categories), and WHU-RS19 (19 categories).
The superiority of DFSMamba becomes evident when examining specific categories, as summarized in the following three aspects. First, DFSMamba performs particularly well on texture-rich categories (
Figure 4). On the AID dataset, in the “Beach” category, DFSMamba achieves a PSNR of 37.10 dB, exceeding MAT by approximately 0.69 dB and MambaIRv2 by 0.30 dB; in the “Desert” category, it reaches 40.79 dB, exceeding MAT by about 0.48 dB; in the “Stadium” category, it achieves 29.02 dB, exceeding MambaIRv2 by about 0.74 dB; and in the “Parking” category, its SSIM reaches 0.9444, significantly surpassing all compared methods. Second, DFSMamba also excels in categories with slender structures and irregular edges (
Figure 5). On the AID dataset, in the “Port” and “Viaduct” categories, its PSNR/SSIM reach 26.27 dB/0.7436 and 29.56 dB/0.8367, respectively, outperforming other compared methods. On the WHU-RS19 dataset, in the “Port” category, DFSMamba achieves a PSNR of 18.17 dB and an SSIM of 0.5131, showing consistent improvements over MAT (18.07 dB/0.5116) and MambaIRv2 (18.02 dB/0.5111), which fully demonstrates the effectiveness of bidirectional semantic modeling for slender structure recovery. Third, DFSMamba exhibits strong generalization capability on categories with multi-season and multi-weather conditions. On the RSSCN7 dataset, DFSMamba achieves the best performance across all seven categories. For example, in the “Grasslands” category, its PSNR reaches 31.40 dB, exceeding MAT by about 0.15 dB; in the “Rivers” category, its SSIM reaches 0.8452, exceeding MambaIRv2 by about 0.0042; and in the “Factory” category, its PSNR reaches 25.72 dB, exceeding MAT by about 0.17 dB. These results demonstrate that DFSMamba can effectively resist interference from factors such as illumination and seasonal variations, possessing exceptional cross-scene generalization capability.
As shown in
Table 5, for the ×4 upscaling task on the AID dataset, the three methods, MambaIRv2, FMSR, and DFSMamba, exhibit a consistent increasing trend in parameters, FLOPs, latency, and peak GPU memory usage. However, it is worth noting that the increased computational cost of DFSMamba is relatively modest and acceptable. Compared to MambaIRv2, DFSMamba increases parameters by only 0.84 M (~11.7%), FLOPs by 3.9 G (~9.1%), latency by 1.2 ms (~10.6%), and GPU memory by 119 MB (~9.2%). All increases remain around 10%, without any dramatic growth. Given that real-world hardware typically has some redundancy, such a marginal trade-off is reasonable, especially if DFSMamba delivers better reconstruction quality.
According to the sensitivity analysis results on the AID dataset for the ×4 upscaling task (
Table 6), both the DFTM and ASSMamba demonstrate strong robustness and stability under various hyperparameter settings. In the frequency-band ratio experiment, as the retained frequency band gradually increases from 25% to the full spectrum, reconstruction quality consistently improves, with the full spectrum achieving the highest PSNR (29.47) and SSIM (0.7708). This indicates that excessive frequency suppression degrades reconstruction performance. Regarding the spectral scaling adapter, a scaling factor of 1.0 yields the best performance (29.47/0.7708), while values that are too high (1.5) or too low (0.5) lead to slight drops, suggesting that moderate scaling provides the optimal balance between high-frequency enhancement and artifact suppression. For the projection dimension of ASSMamba, increasing from C/4 to C steadily improves reconstruction accuracy, although the difference between C/2 and C remains modest, implying that larger feature capacity is still beneficial for performance. In terms of activation temperature, the default setting of 1.0 achieves the best and most stable gating response (29.47/0.7708), while deviating from this value results in minor performance degradation. Overall, these results validate that the DFTM and ASSMamba are reasonably stable across typical hyperparameter ranges.
3.4. Ablation Study
3.4.1. Module Performance Test
To systematically evaluate the individual contributions and interactive gains of the DFTM, SCSA, and ASSMamba in the super-resolution task, we conducted a hierarchical ablation study on the AID and WHU-RS19 datasets, covering three configurations: single-module, pairwise combinations, and full integration of all three modules. MambaIRv2 was used as the baseline (
Table 7).
First, under the single-module configuration (Methods A–C), none of the three modules surpass the baseline in PSNR. Taking the AID dataset as an example, MambaIRv2 achieves a PSNR of 28.96 dB and an SSIM of 0.7501 (
Table 7). When the DFTM is used alone (Method A), PSNR drops to 28.62 dB while SSIM increases to 0.7612. With SCSA alone (Method B) and ASSMamba alone (Method C), PSNR decreases to 28.55 dB and 28.48 dB, respectively, with SSIM values of 0.7598 and 0.7575. Notably, although all single-module configurations underperform the baseline in PSNR, they consistently achieve significantly higher SSIM than MambaIRv2 (0.7501), with improvements ranging from 0.0074 to 0.0111. This indicates that while individual modules fail to improve the peak signal-to-noise ratio when used independently, they effectively enhance structural similarity, demonstrating a positive effect on preserving visual structure.
Second, under the two-module combination configurations (Methods D–F), all combinations outperform the baseline MambaIRv2 in both PSNR and SSIM. On the AID dataset, Method D (DFTM + SCSA) achieves 29.03 dB and 0.7665, yielding a PSNR improvement of 0.07 dB and an SSIM improvement of 0.0164 over the baseline. Method E (DFTM + ASSMamba) reaches 28.91 dB and 0.7643, with PSNR slightly lower than the baseline by only 0.05 dB but SSIM substantially higher by 0.0142. Method F (SCSA + ASSMamba) achieves 28.85 dB and 0.7627, with a PSNR 0.11 dB below the baseline while SSIM remains notably higher by 0.0126. On the WHU-RS19 dataset, Method D attains a PSNR of 28.02 dB, surpassing the baseline of 27.85 dB, with SSIM increasing from 0.7210 to 0.7221. Overall, two-module combinations not only maintain the SSIM advantage over the baseline but also achieve PSNR parity or superiority in most cases, demonstrating the emergence of synergistic effects among modules.
Third, when all three modules are enabled simultaneously (Method G), the model significantly outperforms MambaIRv2 across all metrics. On the AID dataset, PSNR reaches 29.47 dB, an improvement of 0.51 dB over the baseline, while SSIM reaches 0.7708, an improvement of 0.0207. On the WHU-RS19 dataset, PSNR reaches 28.05 dB, an improvement of 0.20 dB, and SSIM reaches 0.7250, an improvement of 0.0040. These results fully demonstrate that the complete integration of the DFTM, SCSA, and ASSMamba successfully surpasses the original baseline, achieving comprehensive performance advantages over MambaIRv2 through the joint effects of frequency-domain enhancement, semantic chunking, and bidirectional state-space modeling.
3.4.2. The Performance of SCSA
To further quantify the superiority of the SCSA module, we designed a comparative experiment on the AID dataset, replacing the original window-based multi-head self-attention (Window MHSA) in both MambaIRv2 and DFSMamba with SCSA, and evaluated the resulting performance changes under the ×4 super-resolution task (
Table 8).
As shown in
Table 8, in the MambaIRv2 baseline model, replacing Window MHSA with SCSA improves PSNR from 28.96 dB to 29.18 dB, a gain of 0.22 dB, and SSIM from 0.7501 to 0.7586, a gain of 0.0085. This demonstrates that even without the introduction of frequency-domain enhancement (DFTM) and bidirectional state-space modeling (ASSMamba), SCSA, by virtue of its dynamic semantic chunking and sparse connection mechanisms, can still more effectively capture semantic dependencies in images and mitigate the semantic truncation and feature loss issues caused by fixed window partitioning.
Furthermore, in the complete DFSMamba architecture (which already includes DFTM and ASSMamba), replacing the original Window MHSA with SCSA improves PSNR from 29.21 dB to 29.47 dB, a gain of 0.26 dB, and SSIM from 0.7602 to 0.7708, a gain of 0.0106. This gain is slightly larger than the replacement gain in MambaIRv2 (0.22 dB/0.0085), indicating a positive interaction between SCSA, the DFTM, and ASSMamba—frequency-domain enhancement provides a more stable global feature foundation, bidirectional state-space modeling expands the scope of semantic perception, and SCSA, on this basis, achieves finer semantic chunking and sparse attention computation. The joint integration of the three modules further amplifies the advantages of SCSA.
3.5. Multi-Scale Fourier Transform Super-Resolution
To validate the effectiveness of the proposed DFTM, we conducted comparative experiments on three public remote sensing datasets (AID, WHU-RS19, and RSSCN7). The experiments were performed under the ×4 super-resolution task, evaluating the reconstruction performance of DFSMamba and the baseline model MambaIRv2 at five different input scales (16 × 16, 32 × 32, 64 × 64, 128 × 128, and 256 × 256).
The experimental results show that both DFSMamba and MambaIRv2 achieve optimal reconstruction performance at the 64 × 64 input scale (
Table 9). Taking the AID dataset as an example, DFSMamba achieves a PSNR of 29.77 dB and an SSIM of 0.7815 at this scale, while MambaIRv2 achieves 29.05 dB and 0.7680, respectively. This phenomenon can be explained from the perspective of the balance between “local detail integrity” and “global structural controllability”: the 64 × 64 scale preserves the core texture and edge features of ground objects, avoiding the loss of global context caused by excessively small scales (e.g., 16 × 16 or 32 × 32), while also preventing the feature redundancy and computational overhead introduced by overly large scales (e.g., 128 × 128 or 256 × 256), thus achieving an optimal trade-off between information fidelity and model learning efficiency. Notably, when the input scale further increases from 64 × 64 to 128 × 128 and 256 × 256, the performance of both models declines to a certain extent. Taking the AID dataset as an example, the PSNR of MambaIRv2 drops from 29.05 dB at 64 × 64 to 27.52 dB at 256 × 256, a decrease of 1.53 dB, while the PSNR of DFSMamba drops from 29.77 dB to 28.13 dB, a decrease of 1.64 dB. The primary reason for this phenomenon is that as the input scale increases, the amount of spatial redundant information the model needs to process increases significantly, while the limited learning capacity makes it difficult to fully extract all effective features, and the difficulty of modeling long-range dependencies also rises.
Although both models exhibit a downward trend, DFSMamba consistently outperforms MambaIRv2 across all scales and datasets, demonstrating stronger robustness. For example, at the 256 × 256 input scale on the AID dataset, the PSNR of DFSMamba (28.13 dB) is 0.61 dB higher than that of MambaIRv2 (27.52 dB); on the WHU-RS19 dataset at the same scale, the PSNR of DFSMamba (27.13 dB) leads MambaIRv2 (26.65 dB) by 0.48 dB. This advantage is also reflected in the SSIM metric, indicating that DFSMamba maintains a consistent lead in structural fidelity. The fundamental reason for DFSMamba’s sustained performance advantage across different scales lies in the introduction of the DFTM module. Specifically, as a pure spatial-domain model, MambaIRv2 relies primarily on sequential scanning in the pixel space for state-space modeling, making it difficult to simultaneously recover high-frequency details and maintain global structure as the input scale increases. In contrast, the DFTM module achieves spatial–frequency collaborative modeling through the following two approaches.
On the one hand, the DFTM maps the image from the spatial domain to the frequency domain using the discrete Fourier transform, where each frequency coefficient covers all image pixels, thereby establishing a global, lossless “receptive field.” This characteristic enables the model to explicitly enhance high-frequency features such as road edges and building contours without relying on gradually expanding receptive fields in the spatial domain. On the other hand, the DFTM complements Mamba’s SS2D (Selective Scan 2D) mechanism: SS2D is responsible for long-range dependency modeling in the spatial domain, while the DFTM provides global structural constraints from the frequency domain. Their synergistic effect enables the model to effectively suppress interference from redundant information and maintain the ability to perceive global structural integrity when processing large-scale inputs.
DFSMamba maintains a consistent advantage trend across the AID, WHU-RS19, and RSSCN7 datasets, indicating that the effectiveness of the DFTM module does not depend on specific data distributions or land cover types. Notably, on the RSSCN7 dataset (which contains images under different seasons and weather conditions), DFSMamba achieves 27.42 dB/0.6530 at the 64 × 64 scale, significantly higher than MambaIRv2’s 26.85 dB/0.6415, further validating the robustness of spatial–frequency collaborative modeling under complex imaging conditions.