Next Article in Journal
Boundary-Aware Multi-Scale Feature Enhancement Based Few-Shot Hyperspectral Image Semantic Segmentation
Previous Article in Journal
Contrasting Effects of Atmospheric and Soil Compound Extreme Events on NPP, RH, and NEE in the Dongting Lake Eco-Economic Zone Under Different Land Use Types
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DFSMamba: A Spatial–Frequency Collaborative Modeling Framework for Remote Sensing Image Super-Resolution

1
National Observation and Research Station of Geohazards, China University of Geosciences, Wuhan 430074, China
2
Three Gorges Research Center for Geo-Hazard, Ministry of Education, China University of Geosciences, Wuhan 430074, China
3
School of Earth Sciences, China University of Geosciences, Wuhan 430074, China
4
The College of Life and Environmental Science, Wenzhou University, Wenzhou 325035, China
5
School of Electronic Engineering, Naval University of Engineering, Wuhan 430033, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(12), 1910; https://doi.org/10.3390/rs18121910 (registering DOI)
Submission received: 9 April 2026 / Revised: 29 May 2026 / Accepted: 6 June 2026 / Published: 9 June 2026

Highlights

What are the main findings?
  • A spatial–frequency synergistic super-resolution network, DFSMamba, is proposed for remote sensing images, combining DFTM and ASSMamba.
  • DFSMamba achieves SOTA performance with fewer parameters and lower computation, excelling in edge and texture detail reconstruction.
What are the implications of the main findings?
  • The method provides a new solution to insufficient global receptive fields and weak high-frequency recovery in RS image SR.
  • DFTM and ASSMamba can be used as plug-and-play modules for other remote sensing image processing tasks.

Abstract

Existing single-image super-resolution methods for remote sensing images suffer from insufficient global receptive fields, weak high-frequency texture recovery, and excessive computational complexity. To address these issues, this paper proposes DFSMamba, a novel spatial–frequency collaborative modeling framework. First, Semantic Continuous-Sparse Attention enhances semantic perception through dynamic chunking and sparse connections while maintaining linear complexity, effectively alleviating the semantic truncation problem caused by fixed window partitioning. Second, the Adaptive State-Space Module employs parallel forward and backward state-space model branches to achieve bidirectional long-range dependency modeling and introduces an activation-guided feature fusion mechanism to adaptively enhance semantically relevant regions. Third, the Discrete Fourier Transform Module maps images to the frequency domain, establishes a global lossless receptive field, and explicitly enhances high-frequency details, compensating for the insufficient utilization of frequency-domain information in pure spatial-domain methods. Experiments on five public datasets demonstrate that DFSMamba outperforms mainstream CNN, Transformer, and Mamba-based methods across ×2 to ×4 scales. On the AID×3 task, it achieves a PSNR of 31.48 dB, exceeding MambaIRv2 by 1.07 dB. Ablation studies verify the positive synergistic effect of the three modules, with the full configuration achieving a PSNR improvement of 0.85 dB over the single-module setup. Fine-grained category, multi-scale input, and loss function experiments further confirm its robustness and generalization capability, particularly in edge and texture detail reconstruction.

1. Introduction

Remote sensing imagery serves as a critical data source for environmental surveys, disaster monitoring, scene analysis, and target detection. However, due to limitations in sensor hardware design, imaging conditions, and cost, the spatial resolution of remote sensing images often fails to meet the demands of fine-grained applications [1,2,3,4]. Upgrading satellite sensor hardware is prohibitively expensive and technically challenging, whereas super-resolution reconstruction technology enhances the edge and texture details of ground objects using low-cost computational methods. In doing so, it drives remote sensing information services toward more economical, intelligent, and widely accessible development [5,6,7]. Super-resolution techniques are primarily divided into two categories: Multi-Image Super-Resolution (MISR) and Single-Image Super-Resolution (SISR) [8]. MISR reconstructs high-resolution images by fusing multiple low-resolution images of the same scene captured from different viewpoints or at different times [9], while SISR relies solely on a single input image, offering greater deployment feasibility [10]. However, it is inherently a more challenging ill-posed inverse problem, requiring the reconstruction of missing high-frequency details using only the limited prior information available from a single image.
SRCNN [11], proposed by Dong et al., applied a three-layer convolutional neural network to super-resolution tasks, demonstrating the effectiveness of end-to-end learning. Subsequently, VDSR [12] extended network depth to 20 layers by introducing residual learning and a very high learning rate, while RCAN further proposed a channel attention mechanism, achieving state-of-the-art performance on natural image super-resolution tasks at the time. In the field of remote sensing image super-resolution, networks such as LGCNet [13], MSF [14], MRNN [15], and WTCRR [16] are all built upon residual networks, effectively addressing vanishing and exploding gradient problems through residual learning mechanisms. Methods based on the EDSR [17] architecture, such as S2PS [18], DGANet-ISE [19], and DRSEN [20], have significantly enhanced detail recovery and visual quality. CNN-based methods perform robustly in terms of parameter efficiency and local detail recovery. However, the inherent limited receptive field of CNNs makes it difficult for them to effectively capture global contextual information, which is particularly problematic for remote sensing images that contain complex scenes and large-scale correlations among ground objects.
The introduction of generative adversarial networks provided a new paradigm for super-resolution. SRGAN and ESRGAN generate realistic textures through perceptual loss and adversarial loss, while MSAGAN and MA-GAN have demonstrated promising performance in the remote sensing domain [21,22]. GAN-based methods offer advantages in perceptual quality but may introduce artifacts and spectral distortions. Diffusion models such as DDPM surpass GANs in image generation quality; however, their slow inference speed makes it difficult to meet real-time processing requirements. The Transformer architecture has rapidly gained attention due to its strength in modeling long-range dependencies through the self-attention mechanism. SwinIR [23] achieves a global receptive field while maintaining computational efficiency via the shifted window attention mechanism. In the field of remote sensing super-resolution, cutting-edge architectures such as CTNet [24], MHAN [25], HSENet [26], and ESTNet [27] have emerged successively, with HAUNet [28] achieving outstanding performance on four public remote sensing datasets. However, the computational complexity of Transformer’s self-attention mechanism grows quadratically with image size, introducing significant performance bottlenecks and memory consumption issues when processing high-resolution remote sensing images.
In 2024, MambaIR [29] by Guo et al. demonstrated Mamba’s potential as an image restoration backbone. Vision Mamba [30] and VMamba [31] further optimized the unidirectional scanning mechanism, making them particularly suitable for remote sensing image processing. Methods such as PanMamba [32], RS-Mamba [33], and Hi-Mamba [34] have since explored Mamba’s application in the field of remote sensing super-resolution. However, existing Mamba-based methods primarily focus on enhancing spatial-domain features and generally overlook the critical role of frequency-domain information, which limits their ability to recover high-frequency details such as edges and textures. Since 2025, HIMOSA [35] has proposed an efficient super-resolution framework based on a hierarchical mixture of sparse attention, achieving fast inference while maintaining reconstruction performance; FMSR [36] has introduced a feature-grouped multi-scale network that reduces parameter count through grouped convolution and incorporates an edge enhancement module to improve visual quality.
Despite the significant progress made in remote sensing image super-resolution reconstruction, current research still faces the following key challenges. The first is the balance between global and local information. CNNs have limited receptive fields and struggle to capture global context [37], while Transformers can model long-range dependencies but suffer from excessively high computational complexity [38]. Identifying how to achieve effective global–local information fusion while maintaining linear computational complexity remains a core challenge. The second is insufficient recovery of high-frequency details [39]. Existing methods mainly focus on spatial-domain feature enhancement and underutilize frequency-domain information [40], resulting in limited recovery of high-frequency details such as edges and textures, which is particularly prominent in small-target reconstruction.
To address the above challenges, this paper proposes DFSMamba, a discrete Fourier transform super-resolution reconstruction neural network based on the Mamba architecture. We design a plug-and-play Discrete Fourier Transform Mechanism (DFTM) and Adaptive State-Space Mamba (ASSMamba) modules to synergistically integrate the spatial–frequency dual domains, efficiently fusing local and global information for collaborative modeling. A discrete Fourier dynamic activation weighting mechanism is introduced to guide low-frequency components in representing overall structure and contours while allowing high-frequency components to precisely capture edge and texture details, thereby achieving semantically adaptive global context capture with linear computational complexity. Experimental results on five public remote sensing datasets—WHU-RS19 [41], DOTA [42], RSSCN7 [43], AID [44], and NWPU VHR-10 [45]—demonstrate that DFSMamba outperforms existing state-of-the-art methods in super-resolution reconstruction quality. Our study provides a novel solution for remote sensing image super-resolution reconstruction.

2. Methodology

2.1. The Framework of the Proposed Model

In this study, we propose DFSMamba, a novel network that integrates spatial- and frequency-domain information to improve the quality of remote sensing image super-resolution (Figure 1). First, this study adopts MambaIRv2 [46] as the backbone architecture, leveraging the SS2D mechanism to efficiently capture global spatial correlations among ground objects, thereby alleviating issues related to discontinuous local textures while avoiding the inherent limitations of causal modeling in Mamba. The core advantage of the SS2D mechanism lies in its conversion of three-dimensional image data into two-dimensional sequences, which reduces computational complexity, leading to improved processing speed and more accurate detail preservation.
Second, this study introduces the DFTM to integrate frequency-domain analysis into the Mamba architecture, addressing persistent issues such as limited receptive fields and high color reconstruction errors. This module performs frequency-domain decoupling to explicitly enhance high-frequency components while preserving low-frequency stability. This process aids in restoring large-scale feature contours, thereby significantly advancing remote sensing image super-resolution. Furthermore, through integration with the Fast Fourier Transform, the module’s theoretical complexity is reduced from O(n2) to O(n log n), ensuring its efficiency in large-scale SISR applications. Unlike FMSR [36], which primarily introduces FFT-derived global frequency statistics to spatial features as an auxiliary complement, the DFTM explicitly separates amplitude and phase components, performs adaptive spectral–spatial modeling on both, and reconstructs an optimized representation via the Inverse Discrete Fourier Transform (IDFT). Therefore, the DFTM functions as a closed-loop spatial–frequency learning unit that is tightly coupled with the Mamba backbone, which is crucial for remote sensing SISR tasks that require simultaneous recovery of geometric structure and high-frequency textures.
Third, the original window-based multi-head self-attention in MambaIRv2 [46] is replaced by Semantic Continuous-Sparse Attention (SCSA) [47] to reduce the structural discontinuity caused by rigid window partitioning. In this paper, SCSA does not rely on external semantic segmentation labels or semantic priors. Instead, it is reformulated as a feature-similarity-driven and label-free semantic grouping mechanism that adaptively aggregates structurally correlated tokens according to feature statistics and contextual similarity. This design preserves the continuity of ground objects such as roads, buildings, rivers, and vegetation while avoiding the dependency on supervised semantic annotations.
Additionally, we design ASSMamba to achieve joint global and semantic modeling while maintaining linear complexity. By combining ASSMamba with the DFTM, this module further integrates the technical advantages of semantic global query capability and frequency-domain global filtering.

2.2. Discrete Fourier Transform Mechanism

Remote sensing images contain both global structures, such as coastlines, road networks, and building layouts, and high-frequency details, such as roof edges, object boundaries, and texture variations. Spatial-domain modeling alone may not fully recover such information, especially for large upscaling SISR. The DFTM provides a complementary spectral representation in which global and detailed components can be separately emphasized and then reconstructed back into the spatial domain. Figure 2 illustrates the frequency-domain characteristics used in the DFTM, including the amplitude spectrum, phase spectrum, and filtering results, which help explain how low-frequency structures and high-frequency details contribute differently to remote sensing image super-resolution.
Given an input feature tensor, the DFTM first applies normalization for numerical stability and then maps the feature into the frequency domain. The amplitude spectrum mainly reflects the strength of different frequency components, while the phase spectrum preserves structural and positional information that is essential for faithful spatial reconstruction. Modeling both components is necessary because enhancing amplitude alone may sharpen textures but can also introduce artifacts, whereas preserving phase information helps maintain object geometry and edge localization.
X N o r m = L a y e r N o r m X i n
In the frequency-domain branch, X N o r m is transformed into the frequency domain via DFT, decomposing into frequency spectrum and phase spectrum:
X ^ f r e q = D F T X N o r m = X ^ a m p + j X ^ p h a s e
Subsequently, both amplitude and phase spectra are fed into ASSMamba to model global dependencies among frequency responses. Although frequency-domain coefficients do not preserve local pixel adjacency in the same way as spatial features, each coefficient is globally coupled with the entire spatial feature map. Therefore, sequence modeling in the spectral domain captures correlations among global frequency patterns rather than local neighborhoods. The optimized amplitude and phase components are then recombined through IDFT, so the learned spectral modulation is explicitly projected back to the spatial domain. Specifically, ASSMamba does not treat adjacent frequency coefficients as adjacent pixels. Instead, it regards the flattened spectral representation as a sequence of globally coupled frequency responses. The forward and backward SSM branches capture complementary correlations among low-, middle-, and high-frequency components, while activation-guided fusion adaptively controls the strength of spectral enhancement. This mechanism provides an explainable path from spectral weighting to spatial reconstruction: the learned spectral responses modify amplitude and phase, IDFT maps them back to spatial features, and the residual spatial branch constrains the final output to remain structurally consistent.
X ^ a m p o p t = A S S   M a m b a X ^ a m p
X ^ p h a s e o p t = A S S   M a m b a X ^ p h a s e
The enhanced spectra are reconstructed to the spatial domain via IDFT:
X f r e q s p a t i a l = I D F T X ^ a m p o p t + j X ^ p h a s e o p t
In the spatial path, normalized features are fed into Spatial Mamba to capture long-range spatial context:
X s p a t i a l = S p a t i a l   M a m b a X N o r m
Finally, X f r e q s p a t i a l and X s p a t i a l are concatenated and fused via a 1 × 1 convolution:
X o u t = C o n v 1 × 1 ( C o n c a t ( X f r e q s p a t i a l , X s p a t i a l ) )
This design establishes a global receptive field through frequency-domain coefficients and enables explicit spectral decoupling. Low-frequency components mainly describe macroscopic contours and smooth regions, whereas high-frequency components correspond to edges, textures, and possible noise. To avoid texture distortion, the DFTM does not indiscriminately amplify all high-frequency responses. Instead, the activation-guided modeling in ASSMamba adaptively adjusts spectral responses according to their relevance to the reconstructed structure. In practice, this mechanism helps recover building boundaries, road lines, and object edges while reducing the risk of amplifying meaningless high-frequency noise.

2.3. Semantic Continuous-Sparse Attention

The perceptual scope of Mamba’s window-based multi-head self-attention is constrained by predefined rule-based windows, which not only hinders its ability to capture long-range dependencies but also lacks adaptability to image content during window partitioning. In this study, we propose incorporating SCSA (Figure 3) into the DFSMamba model, thereby reducing the computational complexity of traditional self-attention from quadratic to near-linear with respect to sequence length while preserving semantic integrity, enabling efficient processing of long sequences. The SCSA mechanism was originally proposed for semantic style transfer, where semantic priors are introduced to maintain region consistency. However, in the proposed DFSMamba framework, no external semantic segmentation maps or semantic labels are used during either training or inference. Instead, SCSA is reformulated as a label-free semantic continuity mechanism that estimates implicit semantic relations directly from feature statistics, contextual similarity, and sparse token interactions. Therefore, the term “semantic” in this work refers to feature-level semantic consistency learned from the LR–HR reconstruction objective rather than supervised semantic annotations. Different from its original style-transfer application, SCSA in DFSMamba functions as a semantic-aware feature enhancement module for high-fidelity reconstruction. It converts fixed position-based local attention into content-adaptive aggregation, thereby improving the continuity of ground-object structures and the recovery of fine textures. This clarifies that the proposed model does not require semantic labels or extra segmentation networks.
The Semantic Continuous Attention (SCA) mechanism is a core module of the SCSA framework. It is designed to address two issues in attention-based arbitrary style transfer methods: semantic region style discontinuity and style inconsistency within the same semantic region. Its main idea is to eliminate interference caused by image structure and focus solely on transferring the holistic style features of semantic regions, ensuring that adjacent parts within the same semantic region in the generated image exhibit coherent style.
In implementation, SCA uses feature embeddings extracted from the content image as queries (Q1), while structurally correlated feature embeddings from the reference feature space are used as keys (K1) and values (V1). Since these feature embeddings mainly encode structural and contextual consistency rather than explicit category labels, tokens with similar structural characteristics tend to share similar attention distributions. This mechanism enables continuous regions with correlated textures and structures to maintain coherent feature aggregation during reconstruction. This fundamentally prevents style discontinuities caused by structural differences. SCA further applies a G1 modulation operation to set attention weights between different semantic categories to negative infinity, retaining only associations within the same semantic category. As a result, attention is confined to the target semantic region. After normalization by the Softmax function, each query point evenly attends to all continuous key points within the same semantic region, thereby accurately capturing the holistic style features of that region from the style image, such as color tone and global texture.
The Semantic Sparse Attention (SSA) mechanism is another key module within the SCSA framework. SSA is primarily designed to address the issue of style texture blurring caused by traditional weighted fusion, with the goal of recovering local texture details from the style image. It is intended to focus on the most similar feature points within the structure of the same semantic region, enabling accurate texture transfer. In implementation, SSA employs Semantic Adaptive Instance Normalization (S-AdaIN) to process the encoded features of the content image, removing interference from the original color style and obtaining purer content structure features. These purified content structure features are then used as queries (Q2), while the encoded features of the style image serve as keys (K2) and values (V2). Since the encoded features contain image structural information, the initial attention map can capture feature correlations based on local structure. Through sparse modulation, SSA retains attention weights only between each query point and the most structurally relevant key points within feature-consistent regions while suppressing unrelated responses. This mechanism allows each query token to interact only with sparse yet highly correlated feature representations, thereby improving texture fidelity and preserving local structural details.

2.4. Adaptive Spectral–Spatial Mamba

ASSMamba unifies spectral awareness and spatial context modeling within a compact state-space framework. Although bidirectional scanning has been explored in Vision Mamba and VMamba, these approaches typically aggregate forward and backward outputs via fixed summation or concatenation. In contrast, ASSMamba leverages activation latent vectors to dynamically emphasize relevant semantic and structural features while suppressing redundant noise responses. As a result, it offers two key advantages: first, linear computational complexity with respect to the flattened sequence length, making it suitable for high-resolution remote sensing images; second, content-adaptive global modeling, which improves semantic selectivity compared to non-adaptive feature fusion. These properties make ASSMamba particularly well suited for complex remote sensing scenes characterized by mixed textures and scale variations.
After input normalization and linear projection, the context-aware feature map is flattened into a 1D sequence, x R N × C N = H × W , and passed through two parallel 1D convolutional layers to capture local sequential patterns, yielding forward and backward sequential features x f and x b . A key innovation of ASSMamba is the adoption of parallel forward and backward state-space model (SSM) branches to enable bidirectional long-range dependency modeling:
h f ( t ) = A f h f ( t 1 ) + B f x f ( t ) y f ( t ) = C f h f ( t ) , h b ( t ) = A b h b ( t + 1 ) + B b x b ( t ) y b ( t ) = C b h b ( t )
where   A , B   and   C denote the state transition, input, and output matrices, respectively, and h t is the hidden state at time step t .
Unlike unidirectional SSM designs, the backward branch processes the sequence in reverse order, enabling the model to access complementary contextual information. The activation-guided feature fusion mechanism of ASSMamba then combines the forward and backward SSM outputs with a latent activation vector projected from the input feature. This vector behaves as a content-dependent gate: in structurally complex regions such as airports, dense residential blocks, and storage tanks, it tends to emphasize high-frequency and long-range responses; in smoother regions such as grassland, river, or bare land, it suppresses redundant oscillations and avoids unnecessary texture hallucination. This adaptive fusion is more flexible than fixed addition or concatenation of bidirectional SSM outputs.
Y = ( y f y b ) σ ( Z )
where the element-wise multiplication performs content-adaptive modulation and the nonlinear activation function constrains the fusion weights. The modulation is applied to both spatial- and frequency-domain features. In the frequency branch, the learned activation distribution adjusts the amplitude-phase responses before IDFT, so the final spatial reconstruction receives enhanced but controlled spectral information. This mapping explains why frequency-domain modeling with ASSMamba does not necessarily break spatial coherence: spatial structure is restored through phase-preserving inverse transformation and residual spatial-path fusion.
Finally, a linear layer projects the fused features back to the original channel dimension, and a residual connection is applied to preserve low-level details and stabilize gradient flow:
X o u t = L i n e a r O Y + X i n
The projection dimension and activation settings are kept consistent across datasets and scales to ensure a fair comparison with the backbone and to avoid dataset-specific tuning. For hyperparameter selection, we keep the projection dimension, SSM state dimension, and activation configuration identical across datasets and upscaling factors. This strategy ensures that performance differences mainly come from the proposed modules rather than from excessive parameter search. A compact sensitivity analysis on the projection dimension and activation scaling parameter is included in the experimental section to further verify that DFSMamba is not overly sensitive to a single hyperparameter choice.

3. Experiment and Results

3.1. Experimental Settings

3.1.1. Datasets

To comprehensively evaluate the performance of DFSMamba, we conducted experiments on five representative datasets, namely, WHU-RS19 [41], DOTA [42], RSSCN7 [43], AID [44], and NWPU VHR-10 [45]. In the test phase, the original high-resolution images served as validation data, while the corresponding low-resolution images were generated via bicubic downsampling at factors of 2, 3, and 4. Although bicubic downsampling inevitably removes part of the high-frequency information, this setting adheres to standard SISR experimental protocols and ensures fair comparison with existing methods.
The WHU-RS19 dataset contains 1005 remote sensing images, each with a size of 600 × 600 pixels and a maximum spatial resolution of 0.5 m/pixel. The dataset covers 19 distinct land cover categories, including airports, bridges, mountains, and rivers. The DOTA dataset is sourced from multiple sensors, such as Google Earth, the Jilin-1 satellite, and the Gaofen-2 satellite. Image sizes vary from 800 × 800 to 4000 × 4000 pixels. The RSSCN7 dataset comprises 2800 images spanning seven typical land cover categories: grass, farmland, industry, river/lake, forest, residential area, and parking lot. Each category includes 400 images, collected under different seasons and weather conditions. The AID dataset consists of 10,000 images, each 600 × 600 pixels in size, with a spatial resolution of 0.5 m/pixel. It covers 30 different land-use categories, such as airports, farmland, beaches, and deserts. The NWPU VHR-10 dataset includes 800 high-resolution remote sensing images, each measuring 600 × 600 pixels. It encompasses 10 diverse scene categories, including airports, beaches, and bridges.

3.1.2. Evaluation Indicators

The quantitative assessment of reconstructed image quality relied on two widely adopted metrics: peak signal-to-noise ratio (PSNR) and Structural Similarity Index (SSIM). PSNR quantifies the pixel-wise accuracy of the output relative to the high-resolution ground truth, whereas SSIM assesses perceptual similarity in terms of luminance, contrast, and structural information. For both metrics, higher scores reflect superior reconstruction performance. To ensure fairness, all competing methods were evaluated under identical conditions.
P S N R = 10 log 10 M A X i 2 M S E
where
M S E = 1 N W N H i = 0 N W 1   j = 0 N H 1   Y i , j X i , j 2
SSIM measures the similarity between images to evaluate fidelity after transformations. Unlike PSNR, which relies exclusively on pixel-level errors, SSIM incorporates perceptual factors that better align with human visual perception.
S S I M X , Y = 2 μ X μ Y + C 1 σ X Y + C 2 μ X 2 + μ Y 2 + C 1 σ X 2 + σ Y 2 + C 2
where X and Y are the reference and reconstructed images, with μ and σ representing their respective means and standard deviations; σ X Y is their covariance, and M A X i is the maximum pixel value of the i -th spectral band.

3.1.3. Parameter Settings

Based on prior work, we perform data augmentation by applying horizontal flipping and random rotations of 90°, 180°, and 270°. Additionally, during training, the original images are cropped into 64 × 64 patches for image SR. We initialize the weights of the ×3 and ×4 models using the pre-trained weights of the ×2 model, while halving the learning rate and total training iterations to reduce training time [48]. To ensure a fair comparison, we adjusted the training batch size for image SR to 32. We employed Adam [49] as the optimizer to train DFSMamba and other SOTA models, with β1 = 0.9 and β2 = 0.999, similar to previous training settings [50]. In this study, the total training iterations for all models were set to 50,000, with L1 loss selected as the default loss function [51]. All experiments were conducted on an Ubuntu 22.04 operating system environment, with models implemented using the PyTorch 2.7.0 framework. The experimental hardware consisted of a workstation equipped with two NVIDIA GeForce GTX 3090 GPUs.

3.2. Comparison with SOTA Models

In this study, DFSMamba was evaluated against representative CNN-, Transformer-, and Mamba-based SISR methods, including HSENet [26], TransENet [52], OmniSR [53], SwinFIR [54], SwinIR [23], TTST [55], DAT [56], MambaIR [29], MambaIRv2 [46], HDI-PRNet [57], and MAT [58]. For numerical comparison, only results obtained under the same degradation setting and evaluation protocol are reported because directly mixing results from different training settings may lead to unfair conclusions.
CNN methods (HSENet, OmniSR) extract features based on local receptive fields, gradually expanding the perceptual range through stacked convolutional layers. Their advantages lie in high computational efficiency, small parameter counts, and fast inference speed, making them suitable for real-time or lightweight applications. However, constrained by a fixed kernel size, CNNs struggle to capture long-range dependencies, leading to blurry reconstructions of large-scale structures. They also lack adaptability to remote sensing images with complex textures and large structural spans, resulting in limited high-frequency detail recovery. As shown in Table 1, on the WHU-RS19×2 task, HSENet achieves a PSNR of only 28.94 dB and an SSIM of 0.7616, significantly lower than Transformer and Mamba methods. On the AID×3 task, HSENet achieves 29.85 dB/0.8150, falling far behind DFSMamba (31.48 dB/0.8415). As shown in Table 2, Table 3 and Table 4, CNNs perform acceptably on homogeneous land cover types (e.g., AID “Bare land” with a PSNR of 37.60 dB) but are severely inadequate on complex categories: AID “Port” reaches only 25.88 dB/0.7389, and “Stadium” achieves 28.28 dB (compared to DFSMamba’s 29.02 dB); on WHU-RS19 “Port”, PSNR is only 17.57 dB, with an SSIM of 0.5051, barely reconstructing valid structures. In summary, CNNs are suitable for fast processing scenarios where precision requirements are low and the imagery is dominated by large homogeneous land cover areas.
Transformer methods (SwinIR, DAT, TTST, MAT) leverage self-attention mechanisms to achieve global receptive fields, capturing pixel dependencies at arbitrary distances, and their reconstruction quality significantly exceeds that of CNNs. As a representative, MAT achieves 34.45 dB/0.9122 on the AID×2 task and 30.80 dB/0.8280 on the AID×3 task, outperforming both CNNs and MambaIR but still falling behind DFSMamba. Transformers demonstrate clear advantages on texture-rich categories: on AID, “Beach” reaches a PSNR of 36.41 dB, “Desert” 40.31 dB, and “Stadium” 28.60 dB, all higher than CNNs and early Mamba methods. However, Transformers suffer from two major issues. First, the computational complexity of self-attention grows quadratically (O(n2)), leading to high memory usage and slow inference speed when processing high-resolution images. Second, the window partitioning strategy lacks semantic adaptability, causing unstable reconstructions on certain categories. For example, on the RSSCN7 “Farmland” category, MAT achieves an SSIM of only 0.4920, lower than DFSMamba’s 0.4968; on the WHU-RS19 “Port” category, MAT reaches 18.07 dB, which, although better than CNNs’ values, is still far below DFSMamba’s 18.17 dB. This indicates that regular windows struggle to adapt to the diverse morphologies of remote sensing land cover, particularly in accurately capturing the semantic boundaries of slender or irregular structures.
Mamba methods (MambaIR, MambaIRv2) achieve a global receptive field with linear complexity (O(n)) and strike a good balance between efficiency and quality, based on state-space models. MambaIRv2 achieves 29.89 dB/0.7852 on the WHU-RS19×2 task and 30.41 dB/0.8130 on the AID×3 task, both lower than DFSMamba and MAT. At the category level, Mamba methods excel on homogeneous land cover (AID “Bare land”: 38.15 dB/0.9254; “Parking”: SSIM 0.9408). However, Mamba methods suffer from three key shortcomings. First, unidirectional scanning leads to information bias, limiting performance on categories that require bidirectional information fusion, such as AID “Stadium” (PSNR of only 28.28 dB vs. DFSMamba’s 29.02 dB) and “Viaduct” (SSIM of 0.8304 vs. DFSMamba’s 0.8367). Second, semantic adaptability is insufficient, resulting in limited detail recovery on texture-rich categories, such as RSSCN7 “Grasslands” (PSNR of 31.20 dB vs. DFSMamba’s 31.40 dB) and “Rivers” (SSIM of 0.8410 vs. DFSMamba’s 0.8452). Third, the underutilization of frequency-domain information restricts high-frequency detail recovery. Overall, Mamba methods perform well on homogeneous land cover but exhibit clear deficiencies on complex categories requiring bidirectional semantic perception and high-frequency detail recovery (Figure 4).
DFSMamba addresses the aforementioned limitations of existing methods by integrating three core innovations. SCSA enhances semantic perception through dynamic chunking and sparse connections while maintaining linear complexity. The ASSM adopts parallel bidirectional SSM branches to overcome unidirectional information bias and introduces an activation-guided fusion mechanism to adaptively enhance semantic regions. The DFTM establishes a global, lossless frequency-domain receptive field and explicitly enhances high-frequency details. In terms of performance, DFSMamba achieves the best or second-best results across all five datasets and three scaling factors. On the AID×3 task, it attains a PSNR of 31.48 dB, outperforming MAT (30.80 dB) by 0.68 dB and MambaIRv2 (30.41 dB) by 1.07 dB; its SSIM reaches 0.8415, exceeding MAT (0.8280) by 0.0135 and MambaIRv2 (0.8130) by 0.0285. As shown in Table 2, Table 3 and Table 4, in the fine-grained evaluation, DFSMamba achieves the best or second-best results across all categories on AID (30 categories), RSSCN7 (7 categories), and WHU-RS19 (19 categories).
The superiority of DFSMamba becomes evident when examining specific categories, as summarized in the following three aspects. First, DFSMamba performs particularly well on texture-rich categories (Figure 4). On the AID dataset, in the “Beach” category, DFSMamba achieves a PSNR of 37.10 dB, exceeding MAT by approximately 0.69 dB and MambaIRv2 by 0.30 dB; in the “Desert” category, it reaches 40.79 dB, exceeding MAT by about 0.48 dB; in the “Stadium” category, it achieves 29.02 dB, exceeding MambaIRv2 by about 0.74 dB; and in the “Parking” category, its SSIM reaches 0.9444, significantly surpassing all compared methods. Second, DFSMamba also excels in categories with slender structures and irregular edges (Figure 5). On the AID dataset, in the “Port” and “Viaduct” categories, its PSNR/SSIM reach 26.27 dB/0.7436 and 29.56 dB/0.8367, respectively, outperforming other compared methods. On the WHU-RS19 dataset, in the “Port” category, DFSMamba achieves a PSNR of 18.17 dB and an SSIM of 0.5131, showing consistent improvements over MAT (18.07 dB/0.5116) and MambaIRv2 (18.02 dB/0.5111), which fully demonstrates the effectiveness of bidirectional semantic modeling for slender structure recovery. Third, DFSMamba exhibits strong generalization capability on categories with multi-season and multi-weather conditions. On the RSSCN7 dataset, DFSMamba achieves the best performance across all seven categories. For example, in the “Grasslands” category, its PSNR reaches 31.40 dB, exceeding MAT by about 0.15 dB; in the “Rivers” category, its SSIM reaches 0.8452, exceeding MambaIRv2 by about 0.0042; and in the “Factory” category, its PSNR reaches 25.72 dB, exceeding MAT by about 0.17 dB. These results demonstrate that DFSMamba can effectively resist interference from factors such as illumination and seasonal variations, possessing exceptional cross-scene generalization capability.
As shown in Table 5, for the ×4 upscaling task on the AID dataset, the three methods, MambaIRv2, FMSR, and DFSMamba, exhibit a consistent increasing trend in parameters, FLOPs, latency, and peak GPU memory usage. However, it is worth noting that the increased computational cost of DFSMamba is relatively modest and acceptable. Compared to MambaIRv2, DFSMamba increases parameters by only 0.84 M (~11.7%), FLOPs by 3.9 G (~9.1%), latency by 1.2 ms (~10.6%), and GPU memory by 119 MB (~9.2%). All increases remain around 10%, without any dramatic growth. Given that real-world hardware typically has some redundancy, such a marginal trade-off is reasonable, especially if DFSMamba delivers better reconstruction quality.
According to the sensitivity analysis results on the AID dataset for the ×4 upscaling task (Table 6), both the DFTM and ASSMamba demonstrate strong robustness and stability under various hyperparameter settings. In the frequency-band ratio experiment, as the retained frequency band gradually increases from 25% to the full spectrum, reconstruction quality consistently improves, with the full spectrum achieving the highest PSNR (29.47) and SSIM (0.7708). This indicates that excessive frequency suppression degrades reconstruction performance. Regarding the spectral scaling adapter, a scaling factor of 1.0 yields the best performance (29.47/0.7708), while values that are too high (1.5) or too low (0.5) lead to slight drops, suggesting that moderate scaling provides the optimal balance between high-frequency enhancement and artifact suppression. For the projection dimension of ASSMamba, increasing from C/4 to C steadily improves reconstruction accuracy, although the difference between C/2 and C remains modest, implying that larger feature capacity is still beneficial for performance. In terms of activation temperature, the default setting of 1.0 achieves the best and most stable gating response (29.47/0.7708), while deviating from this value results in minor performance degradation. Overall, these results validate that the DFTM and ASSMamba are reasonably stable across typical hyperparameter ranges.

3.3. Local Attribution Map Analysis

To validate the global modeling ability of DFSMamba, we conducted Local Attribution Map (LAM) analysis [59]. As illustrated in Figure 6, DFSMamba activates substantially broader and more structurally coherent regions during high-frequency detail reconstruction compared to baseline methods (e.g., MambaIR, MambaIRv2 and MAT). This observation suggests that the proposed spatial–frequency synergistic mechanism effectively enhances long-range dependency utilization, a critical capability for high-quality image reconstruction. Notably, the attribution regions of DFSMamba exhibit stronger alignment with object boundaries, texture-rich areas, and long-range structural components, indicating that the model successfully exploits wider contextual information during the reconstruction process. These findings corroborate the claim that ASSMamba adaptively modulates informative spatial–frequency responses according to scene content, rather than indiscriminately amplifying local high-frequency components.

3.4. Ablation Study

3.4.1. Module Performance Test

To systematically evaluate the individual contributions and interactive gains of the DFTM, SCSA, and ASSMamba in the super-resolution task, we conducted a hierarchical ablation study on the AID and WHU-RS19 datasets, covering three configurations: single-module, pairwise combinations, and full integration of all three modules. MambaIRv2 was used as the baseline (Table 7).
First, under the single-module configuration (Methods A–C), none of the three modules surpass the baseline in PSNR. Taking the AID dataset as an example, MambaIRv2 achieves a PSNR of 28.96 dB and an SSIM of 0.7501 (Table 7). When the DFTM is used alone (Method A), PSNR drops to 28.62 dB while SSIM increases to 0.7612. With SCSA alone (Method B) and ASSMamba alone (Method C), PSNR decreases to 28.55 dB and 28.48 dB, respectively, with SSIM values of 0.7598 and 0.7575. Notably, although all single-module configurations underperform the baseline in PSNR, they consistently achieve significantly higher SSIM than MambaIRv2 (0.7501), with improvements ranging from 0.0074 to 0.0111. This indicates that while individual modules fail to improve the peak signal-to-noise ratio when used independently, they effectively enhance structural similarity, demonstrating a positive effect on preserving visual structure.
Second, under the two-module combination configurations (Methods D–F), all combinations outperform the baseline MambaIRv2 in both PSNR and SSIM. On the AID dataset, Method D (DFTM + SCSA) achieves 29.03 dB and 0.7665, yielding a PSNR improvement of 0.07 dB and an SSIM improvement of 0.0164 over the baseline. Method E (DFTM + ASSMamba) reaches 28.91 dB and 0.7643, with PSNR slightly lower than the baseline by only 0.05 dB but SSIM substantially higher by 0.0142. Method F (SCSA + ASSMamba) achieves 28.85 dB and 0.7627, with a PSNR 0.11 dB below the baseline while SSIM remains notably higher by 0.0126. On the WHU-RS19 dataset, Method D attains a PSNR of 28.02 dB, surpassing the baseline of 27.85 dB, with SSIM increasing from 0.7210 to 0.7221. Overall, two-module combinations not only maintain the SSIM advantage over the baseline but also achieve PSNR parity or superiority in most cases, demonstrating the emergence of synergistic effects among modules.
Third, when all three modules are enabled simultaneously (Method G), the model significantly outperforms MambaIRv2 across all metrics. On the AID dataset, PSNR reaches 29.47 dB, an improvement of 0.51 dB over the baseline, while SSIM reaches 0.7708, an improvement of 0.0207. On the WHU-RS19 dataset, PSNR reaches 28.05 dB, an improvement of 0.20 dB, and SSIM reaches 0.7250, an improvement of 0.0040. These results fully demonstrate that the complete integration of the DFTM, SCSA, and ASSMamba successfully surpasses the original baseline, achieving comprehensive performance advantages over MambaIRv2 through the joint effects of frequency-domain enhancement, semantic chunking, and bidirectional state-space modeling.

3.4.2. The Performance of SCSA

To further quantify the superiority of the SCSA module, we designed a comparative experiment on the AID dataset, replacing the original window-based multi-head self-attention (Window MHSA) in both MambaIRv2 and DFSMamba with SCSA, and evaluated the resulting performance changes under the ×4 super-resolution task (Table 8).
As shown in Table 8, in the MambaIRv2 baseline model, replacing Window MHSA with SCSA improves PSNR from 28.96 dB to 29.18 dB, a gain of 0.22 dB, and SSIM from 0.7501 to 0.7586, a gain of 0.0085. This demonstrates that even without the introduction of frequency-domain enhancement (DFTM) and bidirectional state-space modeling (ASSMamba), SCSA, by virtue of its dynamic semantic chunking and sparse connection mechanisms, can still more effectively capture semantic dependencies in images and mitigate the semantic truncation and feature loss issues caused by fixed window partitioning.
Furthermore, in the complete DFSMamba architecture (which already includes DFTM and ASSMamba), replacing the original Window MHSA with SCSA improves PSNR from 29.21 dB to 29.47 dB, a gain of 0.26 dB, and SSIM from 0.7602 to 0.7708, a gain of 0.0106. This gain is slightly larger than the replacement gain in MambaIRv2 (0.22 dB/0.0085), indicating a positive interaction between SCSA, the DFTM, and ASSMamba—frequency-domain enhancement provides a more stable global feature foundation, bidirectional state-space modeling expands the scope of semantic perception, and SCSA, on this basis, achieves finer semantic chunking and sparse attention computation. The joint integration of the three modules further amplifies the advantages of SCSA.

3.5. Multi-Scale Fourier Transform Super-Resolution

To validate the effectiveness of the proposed DFTM, we conducted comparative experiments on three public remote sensing datasets (AID, WHU-RS19, and RSSCN7). The experiments were performed under the ×4 super-resolution task, evaluating the reconstruction performance of DFSMamba and the baseline model MambaIRv2 at five different input scales (16 × 16, 32 × 32, 64 × 64, 128 × 128, and 256 × 256).
The experimental results show that both DFSMamba and MambaIRv2 achieve optimal reconstruction performance at the 64 × 64 input scale (Table 9). Taking the AID dataset as an example, DFSMamba achieves a PSNR of 29.77 dB and an SSIM of 0.7815 at this scale, while MambaIRv2 achieves 29.05 dB and 0.7680, respectively. This phenomenon can be explained from the perspective of the balance between “local detail integrity” and “global structural controllability”: the 64 × 64 scale preserves the core texture and edge features of ground objects, avoiding the loss of global context caused by excessively small scales (e.g., 16 × 16 or 32 × 32), while also preventing the feature redundancy and computational overhead introduced by overly large scales (e.g., 128 × 128 or 256 × 256), thus achieving an optimal trade-off between information fidelity and model learning efficiency. Notably, when the input scale further increases from 64 × 64 to 128 × 128 and 256 × 256, the performance of both models declines to a certain extent. Taking the AID dataset as an example, the PSNR of MambaIRv2 drops from 29.05 dB at 64 × 64 to 27.52 dB at 256 × 256, a decrease of 1.53 dB, while the PSNR of DFSMamba drops from 29.77 dB to 28.13 dB, a decrease of 1.64 dB. The primary reason for this phenomenon is that as the input scale increases, the amount of spatial redundant information the model needs to process increases significantly, while the limited learning capacity makes it difficult to fully extract all effective features, and the difficulty of modeling long-range dependencies also rises.
Although both models exhibit a downward trend, DFSMamba consistently outperforms MambaIRv2 across all scales and datasets, demonstrating stronger robustness. For example, at the 256 × 256 input scale on the AID dataset, the PSNR of DFSMamba (28.13 dB) is 0.61 dB higher than that of MambaIRv2 (27.52 dB); on the WHU-RS19 dataset at the same scale, the PSNR of DFSMamba (27.13 dB) leads MambaIRv2 (26.65 dB) by 0.48 dB. This advantage is also reflected in the SSIM metric, indicating that DFSMamba maintains a consistent lead in structural fidelity. The fundamental reason for DFSMamba’s sustained performance advantage across different scales lies in the introduction of the DFTM module. Specifically, as a pure spatial-domain model, MambaIRv2 relies primarily on sequential scanning in the pixel space for state-space modeling, making it difficult to simultaneously recover high-frequency details and maintain global structure as the input scale increases. In contrast, the DFTM module achieves spatial–frequency collaborative modeling through the following two approaches.
On the one hand, the DFTM maps the image from the spatial domain to the frequency domain using the discrete Fourier transform, where each frequency coefficient covers all image pixels, thereby establishing a global, lossless “receptive field.” This characteristic enables the model to explicitly enhance high-frequency features such as road edges and building contours without relying on gradually expanding receptive fields in the spatial domain. On the other hand, the DFTM complements Mamba’s SS2D (Selective Scan 2D) mechanism: SS2D is responsible for long-range dependency modeling in the spatial domain, while the DFTM provides global structural constraints from the frequency domain. Their synergistic effect enables the model to effectively suppress interference from redundant information and maintain the ability to perceive global structural integrity when processing large-scale inputs.
DFSMamba maintains a consistent advantage trend across the AID, WHU-RS19, and RSSCN7 datasets, indicating that the effectiveness of the DFTM module does not depend on specific data distributions or land cover types. Notably, on the RSSCN7 dataset (which contains images under different seasons and weather conditions), DFSMamba achieves 27.42 dB/0.6530 at the 64 × 64 scale, significantly higher than MambaIRv2’s 26.85 dB/0.6415, further validating the robustness of spatial–frequency collaborative modeling under complex imaging conditions.

4. Discussion

4.1. Loss Functions

In the field of image super-resolution reconstruction, the choice of loss function plays a decisive role in model performance. L1 loss, Charbonnier loss [60], and perceptual loss [61] have all demonstrated excellent performance in balancing pixel accuracy and visual subjective quality, among which L1 loss is widely adopted due to its simplicity and efficiency. However, whether L1 loss is always the optimal choice for super-resolution reconstruction tasks remains inconclusive and warrants further in-depth investigation. In light of this, this study adopts the DFSMamba model as the baseline and systematically evaluates the performance of L2 loss, mean squared error (MSE) loss, Focal loss, weighted total variation (WeightedTV) loss, and Charbonnier loss as alternative loss functions on two remote sensing image datasets, WHU-RS19 and AID, aiming to explore a superior loss function configuration scheme (Table 10).
As shown in Table 10, the DFSMamba model was used to evaluate the impact of seven different loss functions on remote sensing image super-resolution reconstruction over the WHU-RS19 and AID datasets. The experimental results indicate that Charbonnier loss achieves the best overall performance among all compared objectives.
Specifically, on the WHU-RS19 dataset, Charbonnier loss achieves a PSNR of 31.24 dB and an SSIM of 0.8191, outperforming the commonly used L1 loss (30.45 dB/0.8122) by 0.79 dB in PSNR and 0.0069 in SSIM. On the AID dataset, Charbonnier loss achieves a PSNR of 34.72 dB and an SSIM of 0.9152, slightly surpassing L1 loss (34.54 dB/0.9138) with improvements of 0.18 dB and 0.0014, respectively. These results indicate that Charbonnier loss provides a more stable optimization behavior and achieves a better balance between pixel fidelity and structural reconstruction quality.
It is worth noting that L2 loss and MSE loss also achieve competitive performance on the WHU-RS19 dataset, reaching PSNR values of 31.15 dB and 31.18 dB, respectively, both higher than the L1 baseline. However, on the AID dataset, their PSNR values decrease to 34.17 dB and 34.04 dB, respectively, both lower than L1 loss. This phenomenon suggests that the effectiveness of different loss functions is closely related to dataset characteristics and image distribution. In addition, Perceptual loss, Focal loss, and WeightedTV loss consistently underperform compared with L1 loss on both datasets. In particular, WeightedTV loss achieves an SSIM of only 0.8533 on the AID dataset, which is substantially lower than all other compared objectives.
The above experimental results lead to three main conclusions. First, Charbonnier loss demonstrates the best overall reconstruction performance on both datasets, particularly on WHU-RS19, where it significantly improves both PSNR and SSIM. As a differentiable and smooth approximation of L1 loss, Charbonnier loss alleviates optimization instability caused by sharp gradient variations near zero, thereby improving reconstruction stability and visual quality. Second, the performance of different loss functions varies across datasets, indicating that the selection of the optimization objective should be task-specific and data-dependent rather than universally fixed. Third, although L2 loss and MSE loss achieve relatively high PSNR values in some scenarios, their sensitivity to outliers may limit their robustness for complex remote sensing image super-resolution tasks.
It should be noted that the scope of this loss-function experiment remains limited. Since the comparison is conducted only under the current experimental setting and ×2 upscaling factor, the superiority of Charbonnier loss should be further validated across additional datasets and scaling factors before drawing a universal conclusion. Therefore, L1 loss is still retained as the default optimization objective for fair comparison with prior work, while more advanced perceptual and structure-aware objectives will be explored in future research.

4.2. Contributions and Characteristics

From a signal processing perspective, the SISR task is fundamentally a frequency extrapolation problem—the missing high-frequency components must be inferred from low-frequency observations. Traditional spatial-domain methods (including CNNs, Transformers, and Mamba) all implicitly learn the mapping from low to high frequencies within this framework, lacking explicit constraints on frequency-domain structure. The DFTM module introduces a differentiable mapping between spatial and frequency domains via the discrete Fourier transform, enabling joint modeling of the amplitude and phase spectra. The amplitude spectrum governs the energy distribution across frequency components, determining texture richness, while the phase spectrum preserves edge locations and structural information, determining geometric integrity. The joint modeling of both enables the network to explicitly decouple global structure (low-frequency dominant) from local details (high-frequency dominant), allowing separate constraints to be applied during optimization. The theoretical advantage of this design lies in the fact that each frequency-domain coefficient covers all pixels of the entire image, thus fundamentally solving the limited receptive field problem inherent in spatial-domain models by establishing a theoretically global, lossless receptive field without relying on stacked layers or attention mechanisms to expand the perceptual range.
The core advantage of Mamba-based methods lies in recursively modeling sequences through linear state equations, achieving linear computational complexity with respect to sequence length. However, standard Mamba processes each position sequentially and with equal weight, lacking the ability for adaptive modulation based on semantic information. The ASSMamba module enhances the semantic interpretability of state-space modeling at two levels. At the state equation level, ASE injects information from “unobserved” pixels into the output matrix C via a learnable prompt pool, enabling each time step’s state update to perceive global context—thereby elevating locally recursive modeling to a mechanism with global query capability. At the data layout level, SGN leverages feature-level similarity and contextual consistency to rearrange structurally correlated pixels into adjacent positions before flattening. This operation alleviates the long-range dependency degradation problem in Mamba by shortening the effective semantic distance among highly related features without introducing external semantic annotations. This pre-rearrangement operation effectively alleviates the long-range dependency degradation problem in Mamba, where information propagation becomes challenging when correlated pixels are spaced too far apart in the flattened sequence.
The three core modules of DFSMamba each target different information bottlenecks in the super-resolution task, with their synergy manifesting as cascaded information enhancement and redundancy suppression. Specifically, the MambaIRv2 backbone handles long-range dependency modeling in the spatial domain, with its output features encoding global spatial correlations. SCSA further refines this information through semantic chunking, filtering out noise responses unrelated to semantics and enhancing the representational strength of semantic regions. The DFTM provides orthogonal constraints from the frequency domain: texture patterns that are difficult to distinguish in the spatial domain may exhibit distinct spectral characteristics in the frequency domain, so the DFTM’s frequency-domain enhancement supplies complementary discriminative information for spatial modeling. ASSMamba further modulates information flow via an activation-guided gating mechanism, adaptively weighting features according to their importance and preventing the propagation of ineffective features. There is no functional overlap among the three modules: SCSA is responsible for semantic structuring, the DFTM for frequency-domain structuring, and ASSMamba for dynamic modulation. This orthogonality ensures super-linear gains when the three modules are jointly used—the overall performance improvement exceeds the sum of individual gains. This is validated by the ablation study in Table 7: the best single-module PSNR is 28.62 dB (Method A), while the full three-module configuration (Method G) achieves 29.47 dB, an improvement of 0.85 dB that is larger than the sum of incremental gains from any two-module combination relative to single-module baselines.
DFSMamba preserves the linear complexity characteristics of the Mamba architecture, which is a core advantage over Transformer-based methods. The DFTM module is implemented via the fast Fourier transform with theoretical complexity of O(n log n), where n = H × W is the total number of pixels—far lower than the O(n2) of global self-attention. The SS2D mechanism performs state-space modeling by converting 3D image data into 2D sequences with complexity O(n). The complexity of SCSA is governed by the number of semantic categories K, and since K (typically 10–30) is much smaller than n, its complexity is approximately O(K2) ≈ O(1) with respect to n. Therefore, the overall complexity of DFSMamba is O(n log n), dominated by the DFTM’s Fourier transform. The experimental results in Table 5 validate this theoretical expectation: compared to MambaIRv2 (complexity O(n)), DFSMamba increases FLOPs from 42.8 G to 46.7 G, an increase of only 9.1%—far smaller than the quadratic growth observed when extending Transformer methods from window attention to global attention. This demonstrates that the DFTM’s frequency-domain enhancement achieves substantial quality improvements at a minimal computational cost, realizing an efficient trade-off between efficiency and performance.
Traditional methods (e.g., SwinIR, MambaIRv2) adopt fixed window partitioning strategies. Although such designs reduce computational complexity, they inevitably fragment semantically continuous regions that span across windows, leading to a structural understanding breakdown for large-scale ground objects such as elongated roads, rivers, and airport runways. In this study, SCSA overcomes this limitation through dynamic chunking and sparse connection mechanisms, enabling adaptive grouping of image semantic content. As a result, attention computation is no longer constrained by geometric positions but is instead performed according to semantic categories. SCSA forms an orthogonal complement to the DFTM and ASSMamba: the DFTM provides a global, lossless frequency-domain receptive field; ASSMamba handles dynamic modulation of long-range dependencies; and SCSA restructures the information flow path from a semantic organization perspective, ensuring that object boundaries, irregular shapes, and small targets are neither oversmoothed nor truncated during reconstruction. It is precisely this “semantic-guided sparse attention” mechanism that enables DFSMamba to significantly outperform mainstream methods on fine-grained categories (e.g., ports, viaducts, stadiums), demonstrating its capability to guarantee both structural integrity and textural fidelity in complex remote sensing scenes.
The multi-scale input experiment (Table 9) reveals a further theoretical advantage of DFSMamba over MambaIRv2: cross-scale generalization capability. When the input scale increases from 64 × 64 to 256 × 256, MambaIRv2’s PSNR drops from 29.05 dB to 27.52 dB, a decrease of 1.53 dB, while DFSMamba drops from 29.77 dB to 28.13 dB, a decrease of 1.64 dB, maintaining a consistent absolute advantage. From an information-theoretic perspective, as the input scale increases, the spatial redundancy that a spatial-domain model must process grows quadratically, while the finite memory capacity of state-space models struggles to effectively compress this redundancy. In contrast, the frequency-domain transform in the DFTM possesses an energy compaction property—most of the energy in natural images is concentrated in low-frequency components, while high-frequency components are sparse. Consequently, even as the input scale increases, the growth in effective information in the frequency-domain representation is far smaller than the growth in spatial-domain pixels. By selectively enhancing and suppressing frequency-domain coefficients, the DFTM maintains global structural awareness without significantly increasing computational burden. This theoretical property endows DFSMamba with stronger robustness to input-scale variations, providing theoretical guarantees for its practical application to remote sensing images of varying resolutions.

4.3. Limitations and Future Work

Despite the promising results, DFSMamba still has several limitations that need to be acknowledged. First, when low-resolution images contain severe noise, clouds, haze, or compression artifacts, the high-frequency enhancement mechanism of the DFTM may amplify undesirable responses and generate artifacts. This issue is particularly prominent in low-quality remote sensing image processing scenarios. Second, although DFSMamba demonstrates a reasonable trade-off between accuracy and efficiency, in resource-constrained scenarios (e.g., onboard satellite platforms or low-power edge devices), the latency is expected to increase because the DFTM and ASSMamba modules introduce additional spectral transformation and bidirectional sequence modeling operations. Consequently, the model still requires further optimization, such as lightweight ASSMamba design, frequency-band pruning, model compression, and hardware-specific acceleration. Third, the current experimental validation primarily relies on images degraded by bicubic downsampling. The generalization capability to real-world complex degradation scenarios—such as sensor noise, atmospheric interference, compression artifacts, and motion blur—remains unclear and requires further investigation. Fourth, the loss function comparison is conducted only under the ×2 scale, and the superiority of Charbonnier loss should be further verified across more scales and datasets before drawing a general conclusion. Fifth, while excellent performance is achieved on five public datasets, cross-sensor generalization (e.g., OLI2MSI, Satlas) has not been systematically evaluated, as most datasets are sourced from a single sensor type (e.g., Google Earth).
Future research will focus on five main directions. First, dynamic thresholding and adaptive frequency filtering mechanisms will be introduced into the DFTM module to suppress noise amplification and improve model robustness on low-quality input images. This will involve designing content-aware frequency selection strategies that can distinguish between meaningful high-frequency details and noise-induced artifacts. Second, cross-sensor and real-world degradation experiments will be added to evaluate generalization beyond bicubic downsampling. This includes testing on datasets such as OLI2MSI (Landsat-8 OLI and Sentinel-2 MSI) and Satlas, as well as constructing real-world degradation benchmarks with sensor noise, atmospheric interference, and compression artifacts. Third, the lightweight channel-adaptive mechanism of ASSMamba will be optimized for real-time and onboard deployment. Specific targets include compressing the model to under 5M parameters and achieving real-time inference on edge devices (e.g., NVIDIA Jetson series or embedded GPUs) through techniques such as pruning, quantization, and knowledge distillation. Moreover, in the broader field of image super-resolution reconstruction—particularly for the complex task of remote sensing image super-resolution—there may exist more and better loss functions that have yet to be fully explored and applied. For example, adversarial loss based on generative adversarial networks and cycle consistency loss based on feature consistency, proposed in recent years, deserve further investigation. Consequently, future research will be dedicated to conducting more comprehensive and in-depth studies on loss functions, extensively exploring and testing various potential candidates to identify solutions better suited for remote sensing super-resolution tasks. At the same time, we will actively explore the possibility of designing task-specific loss functions, pioneering new pathways to enhance reconstruction performance through innovative methodologies.

5. Conclusions

This paper proposes DFSMamba, a spatial–frequency collaborative modeling framework for remote sensing image super-resolution. The synergy of Semantic Continuous-Sparse Attention, a Discrete Fourier Transform Module, and an Adaptive State-Space Module achieves unified global semantic modeling and frequency-domain information enhancement while maintaining linear complexity. Experiments on five public datasets demonstrate that DFSMamba significantly outperforms mainstream CNN, Transformer, and Mamba-based methods across ×2 to ×4 scaling factors. On the AID×3 task, it achieves a PSNR of 31.48 dB, exceeding the best comparison method by 0.68 dB. Ablation studies verify the positive synergistic effects of the three modules, with the full configuration achieving a PSNR improvement of 0.85 dB over the single-module setup. Fine-grained category, multi-scale input, and loss function experiments further confirm its robustness and generalization capability.

Author Contributions

Writing—original draft preparation, J.Y.; investigation, C.Z.; resources, X.Z.; writing—review and editing, Q.S. and H.L.; supervision, Q.S.; project administration, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D Program of Zhejiang (No. 2024C03236), the Cooperation Projects between the Ministry of Natural Resources and Jiangxi Province (2024ZRBSHZ086).

Data Availability Statement

The code and data will be made publicly available at https://github.com/AlvinsaideYu/DFSMamba (accessed on 30 May 2026).

Acknowledgments

We extend our sincere gratitude to the editor and the anonymous reviewers for their invaluable feedback and suggestions, which have significantly contributed to improving this paper. We also wish to thank our colleagues who participated in data processing and manuscript revision.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rau, J.-Y.; Jhan, J.-P.; Hsu, Y.-C. Analysis of oblique aerial images for land cover and point cloud classification in an urban environment. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1304–1319. [Google Scholar] [CrossRef]
  2. Ghaffarian, S.; Kerle, N.; Filatova, T. Remote sensing-based proxies for urban disaster risk management and resilience: A review. Remote Sens. 2018, 10, 1760. [Google Scholar] [CrossRef]
  3. Shi, Z.; Zou, Z. Can a machine generate humanlike language descriptions for a remote sensing image? IEEE Trans. Geosci. Remote Sens. 2017, 55, 3623–3634. [Google Scholar] [CrossRef]
  4. Wang, B.; Zhao, Y.; Li, X. Multiple instance graph learning for weakly supervised remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5613112. [Google Scholar] [CrossRef]
  5. Geng, T.; Liu, X.-Y.; Wang, X.; Sun, G. Deep shearlet residual learning network for single image super-resolution. IEEE Trans. Image Process. 2021, 30, 4129–4142. [Google Scholar] [CrossRef]
  6. Li, K.; Yang, S.; Dong, R.; Wang, X.; Huang, J. Survey of single image super-resolution reconstruction. IET Image Process. 2020, 14, 2273–2290. [Google Scholar] [CrossRef]
  7. Ha, V.K.; Ren, J.; Xu, X.; Zhao, S.; Xie, G.; Vargas, V.M.; Hussain, A. Deep learning based single image super-resolution: A survey. In Proceedings of the International Conference on Brain Inspired Cognitive Systems, Xi’an, China, 7–8 July 2018; Springer: Cham, Switzerland, 2018; pp. 106–119. [Google Scholar] [CrossRef]
  8. Miller, L.; Pelletier, C.; Webb, G.I. Deep learning for satellite image time-series analysis: A review. IEEE Geosci. Remote Sens. Mag. 2024, 12, 81–124. [Google Scholar]
  9. Zou, W.; Gao, H.; Ye, T.; Chen, L.; Yang, W.; Huang, S.; Chen, H.; Chen, S. VQCNIR: Clearer night image restoration with vector-quantized codebook. In Proceedings of the AAAI Conference on Artificial Intelligence; AAAI Press: Washington, DC, USA, 2024; Volume 38, pp. 7873–7881. [Google Scholar] [CrossRef]
  10. Jin, Y.; Bouganis, C.-S. Robust multi-image based blind face hallucination. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: New York, NY, USA, 2015; pp. 5252–5260. [Google Scholar] [CrossRef]
  11. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
  12. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
  13. Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local-global combined network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
  14. Xu, W.; Zhang, C.; Wu, M. Multi-scale deep residual network for satellite image super-resolution reconstruction. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xi’an, China, 8–11 November 2019; Springer: Cham, Switzerland, 2019; pp. 332–340. [Google Scholar]
  15. Yan, L.; Chang, K. A new super resolution framework based on multi-task learning for remote sensing images. Sensors 2021, 21, 1743. [Google Scholar] [CrossRef]
  16. Ma, W.; Pan, Z.; Guo, J.; Lei, B. Achieving super-resolution remote sensing images via the wavelet transform combined with the recursive Res-Net. IEEE Trans. Geosci. Remote Sens. 2019, 57, 3512–3527. [Google Scholar] [CrossRef]
  17. Jenefa, A.; Kuriakose, B.M.; Naveen, E.; Lincy, A. EDSR: Empowering super-resolution algorithms with high-quality DIV2K images. Intell. Decis. Technol. 2023, 17, 1249–1263. [Google Scholar]
  18. Galar, M.; Sesma, R.; Ayala, C.; Albizua, L.; Aranda, C. Learning super-resolution for Sentinel-2 images with real ground truth data from a reference satellite. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2020, V-2-2020, 9–16. [Google Scholar] [CrossRef]
  19. Qin, M.; Mavromatis, S.; Hu, L.; Zhang, F.; Liu, R.; Sequeira, J.; Du, Z. Remote sensing single-image resolution improvement using a deep gradient-aware network with image-specific enhancement. Remote Sens. 2020, 12, 758. [Google Scholar] [CrossRef]
  20. Gu, J.; Sun, X.; Zhang, Y.; Fu, K.; Wang, L. Deep residual squeeze and excitation network for remote sensing image super-resolution. Remote Sens. 2019, 11, 1817. [Google Scholar] [CrossRef]
  21. Song, C.; He, Z.; Yu, Y.; Zhang, Z. Low resolution face recognition system based on ESRGAN. In Proceedings of the 2021 3rd International Conference on Applied Machine Learning (ICAML), Changsha, China, 23–25 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 76–79. [Google Scholar]
  22. Tian, S.; Peng, T.; Wang, Z.; Li, Z. Medical CT image super-resolution algorithm based on GAN. In Proceedings of the 2024 2nd International Conference on Computer, Vision and Intelligent Technology (ICCVIT), 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
  23. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  24. Wang, S.; Zhou, T.; Lu, Y.; Di, H. Contextual transformation network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615313. [Google Scholar] [CrossRef]
  25. Zhang, D.; Shao, J.; Li, X.; Shen, H.T. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 5183–5196. [Google Scholar] [CrossRef]
  26. Lei, S.; Shi, Z. Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401410. [Google Scholar]
  27. Kang, X.; Duan, P.; Li, J.; Li, S. Efficient Swin Transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 6367–6379. [Google Scholar] [CrossRef]
  28. Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612515. [Google Scholar]
  29. Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.-T. MambaIR: A simple baseline for image restoration with state-space model. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 222–241. [Google Scholar]
  30. Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
  31. Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
  32. He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M. Pan-Mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
  33. Chen, K.; Chen, B.; Liu, C.; Li, W.; Zou, Z.; Shi, Z. RSMamba: Remote sensing image classification with state space model. IEEE Geosci. Remote Sens. Lett. 2024, 21, 8002605. [Google Scholar]
  34. Qiao, J.; Liao, J.; Li, W.; Zhang, Y.; Guo, Y.; Wen, Y.; Qiu, Z.; Xie, J.; Hu, J.; Lin, S. Hi-Mamba: Hierarchical Mamba for efficient image super-resolution. arXiv 2024, arXiv:2410.10140. [Google Scholar] [CrossRef]
  35. Liu, Y.; Wan, Y.; Liu, X.; Wu, Q.; Xia, P.; Huang, X.; Zhang, Y. HIMOSA: Efficient remote sensing image super-resolution with hierarchical mixture of sparse attention. arXiv 2025, arXiv:2512.00275. [Google Scholar] [CrossRef]
  36. Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.-W. Frequency-assisted Mamba for remote sensing image super-resolution. IEEE Trans. Multimed. 2024, 27, 1783–1796. [Google Scholar] [CrossRef]
  37. Cui, H.; Li, J.; Du, D.; Zhang, Y.; Xia, Y. MRFMA: A hybrid paradigm integrating multi-receptive field network with mediator attention for 3D multi-organ segmentation. Expert Syst. Appl. 2025, 300, 130447. [Google Scholar] [CrossRef]
  38. Bai, X.; Pres, I.; Deng, Y.; Tan, C.; Shieber, S.; Viégas, F.; Wattenberg, M.; Lee, A. Why can’t Transformers learn multiplication? Reverse-engineering reveals long-range dependency pitfalls. arXiv 2025, arXiv:2510.00184. [Google Scholar]
  39. Li, Z.; Zhang, S.; Li, G.; Gu, W. Stable high-frequency components recovery via multichannel absorption compensation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5913109. [Google Scholar]
  40. Gu, Y.; Meng, Y.; Chen, S.; Ji, J.; Sun, X.; Ruan, W.; Ji, R. SFIR: Optimizing spatial and frequency domains for image restoration. Pattern Recognit. 2025, 171, 112188. [Google Scholar] [CrossRef]
  41. Dai, D.; Yang, W. Satellite image classification via two-layer sparse coding with biased image representation. IEEE Geosci. Remote Sens. Lett. 2010, 8, 173–176. [Google Scholar] [CrossRef]
  42. Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
  43. Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
  44. Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  45. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  46. Guo, H.; Guo, Y.; Zha, Y.; Zhang, Y.; Li, W.; Dai, T.; Xia, S.-T.; Li, Y. MambaIRv2: Attentive state space restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025; IEEE: New York, NY, USA, 2025; pp. 28124–28133. [Google Scholar]
  47. Shang, C.; Wang, Z.; Wang, H.; Meng, X. SCSA: A plug-and-play semantic continuous-sparse attention for arbitrary semantic style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025; IEEE: New York, NY, USA, 2025; pp. 13051–13060. [Google Scholar]
  48. He, J.; Zhai, J.; Antunes, T.; Wang, H.; Luo, F.; Shi, S.; Li, Q. FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, 2–6 April 2022; pp. 120–134. [Google Scholar] [CrossRef]
  49. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  50. Ibrahim, Y.; Momoh, M.O.; Shobowale, K.O.; Abubakar, Z.M.; Yahaya, B. EDANet: A novel architecture combining depthwise separable convolutions and hybrid attention for efficient tomato disease recognition. J. Comput. Theor. Appl. 2025, 3, 160–170. [Google Scholar] [CrossRef]
  51. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar] [CrossRef]
  52. Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5615611. [Google Scholar]
  53. Wang, H.; Chen, X.; Ni, B.; Liu, Y.; Liu, J. Omni aggregation networks for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 22378–22387. [Google Scholar]
  54. Zhang, D.; Huang, F.; Liu, S.; Wang, X.; Jin, Z. SwinFIR: Revisiting the SwinIR with fast Fourier convolution and improved training for image super-resolution. arXiv 2022, arXiv:2208.11247. [Google Scholar]
  55. Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Lin, C.-W.; Zhang, L. TTST: A Top-k token selective Transformer for remote sensing image super-resolution. IEEE Trans. Image Process. 2024, 33, 738–752. [Google Scholar] [CrossRef]
  56. Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation Transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 12312–12321. [Google Scholar] [CrossRef]
  57. Feng, Y.; Yang, Y.; Fan, X.; Zhang, Z.; Bu, L.; Zhang, J. A progressive image restoration network for high-order degradation imaging in remote sensing. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5633816. [Google Scholar] [CrossRef]
  58. Xie, C.; Zhang, X.; Li, L.; Fu, Y.; Gong, B.; Li, T.; Zhang, K. MAT: Multi-range attention Transformer for efficient image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 8945–8957. [Google Scholar] [CrossRef]
  59. Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 9199–9208. [Google Scholar] [CrossRef]
  60. Lai, W.-S.; Huang, J.-B.; Ahuja, N.; Yang, M.-H. Fast and accurate image super-resolution with deep Laplacian pyramid networks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2599–2613. [Google Scholar] [CrossRef]
  61. Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar] [CrossRef]
Figure 1. Framework of the proposed DFSMamba for remote sensing image super-resolution.
Figure 1. Framework of the proposed DFSMamba for remote sensing image super-resolution.
Remotesensing 18 01910 g001
Figure 2. Illustration of the DFTM: (a) original image; (b) frequency spectrum; (c) phase spectrum; (d) high-pass filtering; (e) low-pass filtering; (f) band-pass filtering.
Figure 2. Illustration of the DFTM: (a) original image; (b) frequency spectrum; (c) phase spectrum; (d) high-pass filtering; (e) low-pass filtering; (f) band-pass filtering.
Remotesensing 18 01910 g002
Figure 3. Architecture of the semantic attention mechanism. Here,“(*)” denotes the input feature tensor of ( f o ), rather than an additional learnable parameter.
Figure 3. Architecture of the semantic attention mechanism. Here,“(*)” denotes the input feature tensor of ( f o ), rather than an additional learnable parameter.
Remotesensing 18 01910 g003
Figure 4. Visual SISR results for specific scenes across multiple scales: (a) airport; (b) baseball field; (c) storage tanks; (d) medium-density residential.
Figure 4. Visual SISR results for specific scenes across multiple scales: (a) airport; (b) baseball field; (c) storage tanks; (d) medium-density residential.
Remotesensing 18 01910 g004
Figure 5. Visual comparison of the proposed model and the SOTA methods on three test images from the WHU-RS19 dataset: (a) parking (×2); (b) river (×3); (c) port (×4).
Figure 5. Visual comparison of the proposed model and the SOTA methods on three test images from the WHU-RS19 dataset: (a) parking (×2); (b) river (×3); (c) port (×4).
Remotesensing 18 01910 g005
Figure 6. Local Attribution Maps (LAMs) of MambaIR, MambaIRv2, MAT, and the proposed DFSMamba.
Figure 6. Local Attribution Maps (LAMs) of MambaIR, MambaIRv2, MAT, and the proposed DFSMamba.
Remotesensing 18 01910 g006
Table 1. Performance comparison of different models. The best and second-best results are marked in bold and underlined, respectively.
Table 1. Performance comparison of different models. The best and second-best results are marked in bold and underlined, respectively.
MethodsScaleWHU-RS19DOTARSSCN7AIDNWPU VHR-10
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
HSENet×228.940.761628.940.761627.710.685332.850.865029.200.7800
TransENet×229.330.775630.920.807927.890.695633.100.872029.450.7880
OmniSR×229.350.776830.980.809527.910.698033.150.874029.500.7900
SwinIR×229.400.776129.400.776127.920.700833.200.876029.550.7920
SwinFIR×229.410.777829.410.777827.930.700833.250.878029.600.7940
DAT×229.450.777131.090.811827.970.700833.300.880029.650.7960
HAUNet×229.520.778931.250.813128.010.700333.350.882029.700.7980
TTST×229.630.783831.120.815228.050.705333.450.885029.750.8001
MambaIR×229.750.784731.880.831228.100.710034.140.899229.790.8045
MambaIRv2×229.890.785232.060.834328.380.725734.330.901629.990.8065
HDI-PRNet×230.030.793131.660.823028.320.713434.270.908029.950.8060
MAT×230.250.808931.930.832128.410.728934.450.912230.030.8088
DFSMamba×230.450.812232.080.836428.500.729734.540.913830.140.8101
HSENet×327.980.729225.740.749126.980.650029.850.815028.200.7350
TransENet×328.440.748225.050.742827.240.664230.100.822028.450.7420
OmniSR×328.460.749025.080.743227.250.665030.150.824028.480.7430
SwinIR×328.450.748525.100.743627.260.665430.200.826028.500.7440
SwinFIR×328.530.752325.150.745527.290.668630.250.828028.550.7460
DAT×328.510.752825.320.748727.320.669330.300.830028.600.7470
HAUNet×328.510.748325.250.745527.310.664930.280.827028.580.7450
TTST×328.550.753525.350.749527.330.670030.320.831028.620.7480
MambaIR×328.670.757330.150.778627.380.673130.570.819328.200.7447
MambaIRv2×328.750.759030.250.780727.400.675030.410.813028.370.7425
HDI-PRNet×328.850.762030.300.783027.450.678030.600.822028.500.7500
MAT×328.950.765030.330.784527.500.680030.800.828028.550.7521
DFSMamba×329.050.768030.360.785027.640.691531.480.841528.760.7498
HSENet×427.490.706627.290.733126.640.632328.500.785026.500.6801
TransENet×427.530.710227.400.736826.660.635428.600.788026.550.6820
OmniSR×427.550.711527.450.738026.680.636528.650.790026.580.6830
SwinIR×427.480.707127.380.734326.630.632728.650.790026.580.6830
SwinFIR×427.650.715227.390.737926.720.639128.550.786026.520.6810
DAT×427.610.713828.400.738726.700.637628.700.792026.600.6840
HAUNet×427.590.711728.470.738626.710.636228.750.794026.650.6860
TTST×427.700.717628.560.731826.760.640728.720.793026.620.6850
MambaIR×427.800.720028.530.731526.640.630028.810.755126.990.6999
MambaIRv2×427.850.721028.670.736826.590.623128.960.750126.980.6999
HDI-PRNet×427.870.721428.800.738226.910.643429.100.760027.050.7020
MAT×427.950.723028.900.739026.780.641129.250.765027.150.7040
DFSMamba×428.050.725028.980.739726.950.645029.470.770827.260.7052
Table 2. Category-level performance on the AID dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
Table 2. Category-level performance on the AID dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
CategoriesSwinIRHAUNetMambaIRMambaIRv2MATDFSMamba
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Airport30.830.877930.90.876830.960.876430.840.87731.030.878531.140.8808
Bare land37.60.92437.750.920537.660.91738.150.925438.20.923538.290.9262
Baseball field33.840.917533.900.914533.940.912234.060.917634.30.918634.40.9197
Beach36.250.920636.350.919536.470.918536.80.920636.410.920737.10.9233
Bridge33.860.927633.950.927534.030.9274340.927134.180.928934.320.9303
Center32.510.912232.70.91232.880.913832.620.9132.830.912333.040.9156
Church28.420.875428.550.87728.70.878728.550.874628.670.876828.880.8816
Commercial31.60.885931.70.88631.810.886831.70.887231.840.888431.980.8906
Dense residential28.120.806328.30.807528.430.809228.290.806128.360.807328.550.8107
Desert38.950.948339.30.944539.820.940840.560.947940.310.94740.790.9479
Farmland36.010.928435.90.92335.830.91836.370.930136.510.930136.550.9312
Forest33.710.899633.650.89533.60.891333.970.897834.050.89934.160.8999
Industrial30.810.879730.950.880531.10.882130.870.878531.060.880731.320.8852
Meadow36.360.875236.150.863360.852637.390.876337.40.87637.540.8758
Medium residential30.690.88830.850.888530.980.889730.90.88731.00.889131.190.8922
Mountain32.60.905632.550.901532.490.89832.810.904232.90.905532.940.9055
Parking310.940231.30.94131.620.942131.250.940831.360.941231.810.9444
Park32.240.927832.30.927532.40.927432.350.927432.530.928532.630.9302
Playground33.10.923433.350.92433.630.924533.610.925133.720.926534.020.929
Pond33.310.92233.40.9233.510.918333.570.922733.720.923133.830.9245
Port25.880.738925.950.737526.030.7362260.741426.080.741126.270.7436
Railway station33.840.926433.90.926233.970.926133.890.925334.070.927434.180.9286
Resort29.040.855329.150.856029.280.857129.140.854729.220.855829.460.8594
River31.660.844531.650.84231.650.839831.890.844831.960.845432.030.8458
School27.310.804427.450.806027.610.808527.350.803327.460.804827.690.8093
Sparse residential29.370.835829.370.834029.380.832929.510.83429.560.835129.710.8376
Square27.980.79428.10.793528.240.794728.160.793228.280.794428.460.7974
Stadium28.280.870628.50.872528.820.87528.280.868228.60.871629.020.8774
Storage tanks29.530.882529.70.88429.880.886529.620.880229.760.883329.960.8868
Viaduct29.20.83329.30.83229.390.834429.250.830429.410.833429.560.8367
Overall31.590.881131.890.881131.80.879631.820.880831.930.88232.130.8845
Table 3. Category-level performance on the RSSCN7 dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
Table 3. Category-level performance on the RSSCN7 dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
CategoriesSwinIRHAUNetMambaIRMambaIRv2MATDFSMamba
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Grasslands30.780.611331.000.613231.090.614131.20.61631.250.61731.40.6195
Forests25.080.585725.290.587625.380.588525.450.58925.50.590525.70.5939
Farmland29.970.488630.190.490530.280.491430.40.493530.350.49230.590.4968
Parking lots23.030.736223.240.738123.330.739023.450.7423.50.741523.650.7446
Village24.940.770225.150.772125.240.773025.350.77425.40.775525.560.7785
Factory25.100.743325.310.745225.400.746125.50.74725.550.748525.720.7517
Rivers28.330.836828.540.838728.630.839628.750.84128.80.842528.950.8452
Overall26.750.681726.960.683627.050.684527.150.685727.190.686727.370.69
Table 4. Category-level performance on the WHU-RS19 dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
Table 4. Category-level performance on the WHU-RS19 dataset for ×2 upscaling. The best and second-best results are marked in bold and underlined, respectively.
CategoriesSwinIRHAUNetMambaIRMambaIRv2MATDFSMamba
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Airport27.430.804127.580.806127.680.807627.930.810627.880.810128.030.8121
Beach40.590.93240.740.93440.840.935541.040.93841.090.938541.190.94
Bridge34.470.897234.620.899234.720.900734.970.903734.920.903235.070.9052
Commercial23.650.772323.80.774323.90.775824.150.778824.10.778324.250.7803
Desert35.880.756436.030.758436.130.759936.380.762936.330.762436.480.7644
Farmland33.580.668333.730.670333.830.671834.080.674834.030.674334.180.6763
Football Field28.280.817328.430.819328.530.820828.780.823828.730.823328.880.8253
Forest26.770.701326.920.703327.020.704827.270.707827.220.707327.370.7093
Industrial27.080.814527.230.816527.330.81827.580.82127.530.820527.680.8225
Meadow33.910.709334.060.711334.160.712834.410.715834.360.715334.510.7173
Mountain24.070.648724.220.650724.320.652224.570.655224.520.654724.670.6567
parking280.866928.150.868928.250.870428.50.873428.450.872928.60.8749
Park25.960.774626.110.776626.210.778126.460.781126.410.780626.560.7826
Pond30.910.840231.060.842231.160.843731.410.846731.360.846231.510.8482
Port17.570.505117.720.507117.820.508618.020.511118.070.511618.170.5131
Railway Station24.770.683324.920.685325.020.686825.220.689325.270.689825.370.6913
Residential25.190.810225.340.812225.440.813725.690.816725.640.816225.790.8182
River27.650.740827.80.742827.90.744328.150.747328.10.746828.250.7488
Viaduct25.360.775925.510.777925.610.779425.810.781925.860.782425.960.7839
Overall28.480.764128.630.766128.730.767628.960.7728.940.7729.080.7721
Table 5. Computational efficiency comparison on the AID dataset for ×4 upscaling.
Table 5. Computational efficiency comparison on the AID dataset for ×4 upscaling.
MethodParams (M)FLOPs (G)Latency (ms/img)Peak GPU Memory (MB)
MambaIRv27.2142.811.31289
FMSR7.8445.212.11342
DFSMamba8.0546.712.51408
Table 6. Sensitivity analysis of the DFTM and ASSMamba on the AID dataset for ×4 upscaling.
Table 6. Sensitivity analysis of the DFTM and ASSMamba on the AID dataset for ×4 upscaling.
VariantSettingPSNRSSIMObservation
Frequency band ratio25%/50%/75%/full29.12/29.35/29.42/29.470.7625/0.7671/0.7694/0.7708Full spectrum gives the best reconstruction.
Spectral scaling adapter0.5/1.0/1.529.21/29.47/29.440.7632/0.7708/0.7691Moderate scaling is most stable.
ASSMamba projection dimensionC/4/C/2/C29.30/29.42/29.470.7660/0.7692/0.7708Larger capacity improves accuracy.
Activation temperature0.5/1.0/2.029.35/29.47/29.410.7675/0.7708/0.7686Default temperature gives stable gating.
Table 7. Experimental assessment of component roles in ×4 SISR tasks. The best and second-best results are highlighted in bold and underlined, respectively.
Table 7. Experimental assessment of component roles in ×4 SISR tasks. The best and second-best results are highlighted in bold and underlined, respectively.
MethodsDFTMSCSAASSMambaAIDWHU-RS19
PSNRSSIMPSNRSSIM
MambaIRv2 28.960.750127.850.7210
A 28.620.761227.730.7185
B 28.550.759827.680.7162
C 28.480.757527.610.7138
D 29.030.766528.020.7221
E 28.910.764327.950.7203
F 28.850.762727.880.7189
G29.470.770828.050.7250
Table 8. Performance analysis of SCSA in the ×4 SISR task. The best and second-best results are highlighted in bold and underlined, respectively.
Table 8. Performance analysis of SCSA in the ×4 SISR task. The best and second-best results are highlighted in bold and underlined, respectively.
MethodsWindow-Based MHSASCSAAID
PSNRSSIM
MambaIRv2 28.960.7501
29.180.7586
DFSMamba 29.210.7602
29.470.7708
Table 9. Comparison of DFSMamba and MambaIRv2 on the AID, WHU-RS19, and RSSCN7 Datasets at different input sizes (×4 SR).
Table 9. Comparison of DFSMamba and MambaIRv2 on the AID, WHU-RS19, and RSSCN7 Datasets at different input sizes (×4 SR).
Input SizeModelAIDWHU-RS19RSSCN7
PSNRSSIMPSNRSSIMPSNRSSIM
16 × 16MambaIRv228.960.750127.850.721026.590.6231
DFSMamba29.470.770828.050.725026.950.6450
32 × 32MambaIRv228.890.762027.610.713526.780.6390
DFSMamba29.570.779328.290.728327.350.6510
64 × 64MambaIRv229.050.768027.820.717026.850.6415
DFSMamba29.770.781528.390.728027.420.6530
128 × 128MambaIRv228.360.745027.380.708226.550.6335
DFSMamba29.040.763028.000.721027.120.6460
256 × 256MambaIRv227.520.728026.650.695026.320.6280
DFSMamba28.130.739027.130.706526.780.6375
Table 10. Comparison of loss functions for DFSMamba in the ×2 remote sensing SISR task. The best and second-best results are highlighted in bold and underlined, respectively.
Table 10. Comparison of loss functions for DFSMamba in the ×2 remote sensing SISR task. The best and second-best results are highlighted in bold and underlined, respectively.
MethodLoss FunctionWHU-RS19AID
PSNRSSIMPSNRSSIM
DFSMambaL1 Loss30.450.812234.540.9138
L2 Loss31.150.813834.170.9088
MSE Loss31.180.814334.040.9078
Perceptual Loss30.850.809533.920.9055
Focal Loss30.620.807033.780.9030
Charbonnier Loss31.240.819134.720.9152
WeightedTV Loss29.950.798533.270.8533
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yu, J.; Li, H.; Zheng, X.; Zhong, C.; Sun, Q. DFSMamba: A Spatial–Frequency Collaborative Modeling Framework for Remote Sensing Image Super-Resolution. Remote Sens. 2026, 18, 1910. https://doi.org/10.3390/rs18121910

AMA Style

Yu J, Li H, Zheng X, Zhong C, Sun Q. DFSMamba: A Spatial–Frequency Collaborative Modeling Framework for Remote Sensing Image Super-Resolution. Remote Sensing. 2026; 18(12):1910. https://doi.org/10.3390/rs18121910

Chicago/Turabian Style

Yu, Jie, Hui Li, Xiangyong Zheng, Cheng Zhong, and Qiao Sun. 2026. "DFSMamba: A Spatial–Frequency Collaborative Modeling Framework for Remote Sensing Image Super-Resolution" Remote Sensing 18, no. 12: 1910. https://doi.org/10.3390/rs18121910

APA Style

Yu, J., Li, H., Zheng, X., Zhong, C., & Sun, Q. (2026). DFSMamba: A Spatial–Frequency Collaborative Modeling Framework for Remote Sensing Image Super-Resolution. Remote Sensing, 18(12), 1910. https://doi.org/10.3390/rs18121910

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop