FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks

Wang, Zhongyang; Liu, Shilong; Cao, Keyan; Wang, Xinlei

doi:10.3390/app152312420

Open AccessArticle

FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks

School of Computer Science and Engineering, Shenyang Jianzhu University, Shenyang 110168, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12420; https://doi.org/10.3390/app152312420 (registering DOI)

Submission received: 18 October 2025 / Revised: 20 November 2025 / Accepted: 21 November 2025 / Published: 23 November 2025

(This article belongs to the Special Issue Data Science and Medical Informatics)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Remote Sensing Image Super-Resolution (RSISR) plays a vital role in enhancing the spatial details and interpretability of satellite imagery. However, existing methods often struggle to recover fine textures and high-frequency information effectively. In this paper, we propose a frequency-aware super-resolution network for remote sensing images, termed FASwinNet. The network introduces an Enhanced High-Similarity-Pass Attention (EHSPA) module, which improves high-frequency detail modeling through a similarity-aware mechanism guided by edge and positional information. Additionally, we design an Octave-based Residual Attention Block that explicitly separates and optimizes high- and low-frequency features, further enhancing texture reconstruction. Experimental results demonstrate that FASwinNet outperforms state-of-the-art methods in both visual quality and quantitative metrics, achieving the best PSNR and SSIM performance on the AID and UCMerced datasets.

Keywords:

remote sensing; super resolution; frequency-aware network

1. Introduction

Remote Sensing Image Super-Resolution (RSISR) [1] is a long-standing low-level vision problem that aims to reconstruct high-resolution (HR) remote sensing images from their low-resolution (LR) counterparts. Remote sensing images are widely used in applications such as land cover classification, environmental monitoring, and urban planning. However, the spatial resolution of raw satellite imagery is often limited due to hardware, bandwidth, and other environmental constraints, preventing satellite sensors from capturing images at the desired high spatial resolution. As an alternative solution, image super-resolution (SR) has attracted significant attention in the field of remote sensing. SR techniques effectively leverage one or multiple LR images to generate corresponding HR outputs, offering advantages such as cost efficiency and the ability to recover historical data. Consequently, SR has become a prominent research focus in the domain of remote sensing.

Recent advances in deep learning have led to significant progress in remote sensing image super-resolution (RSISR) [1]. Deep learning techniques are increasingly applied to image super-resolution tasks [2,3,4,5,6,7,8,9]. However, due to the limited receptive field, convolutional neural networks (CNNs) tend to focus only on local regions of an image. Recently, Transformers [10], which have achieved great success in natural language processing, have gained considerable attention in the vision community. With the rapid development of high-level vision tasks [11,12,13], Transformer-based methods have also been introduced to low-level vision problems [14,15,16] and super-resolution (SR) tasks [17,18]. In particular, the SwinIR [18] architecture demonstrates impressive capability in modeling long-range dependencies.

As shown in Figure 1, recent RSISR methods exhibit a clear trade-off between model complexity and reconstruction performance, indicating that improvements in detail reconstruction often come at the cost of increased computational burden. Nevertheless, prior studies have reported that many CNN-based and Transformer-based SR methods tend to emphasize global contextual modeling while paying insufficient attention to high-frequency textures and structural edges. For instance, RCAN [6] and RDN [9] often generate over-smoothed textures in complex remote sensing scenes, while SwinIR [18] and HAT [19], though strong in global dependency modeling, still lack explicit mechanisms for frequency-aware fusion or edge-focused reconstruction. Several works in remote sensing (e.g., HSENet [20] and CTNet [21]) also confirm that insufficient high-frequency modeling leads to blurred building boundaries and degraded fine textures. This gap motivates the need for a frequency-aware design. To address these challenges, we propose a Frequency-Aware Swin Transformer Network (FASwinNet), specifically designed for remote sensing image super-resolution (RSISR). The proposed network aims to more effectively exploit frequency-domain representations, edge-sensitive information, and contextual modeling capabilities, thereby enhancing high-frequency detail reconstruction while maintaining moderate computational cost.

Built upon the SwinIR framework [18], we introduce the Enhanced High-Similarity-Pass Attention (EHSPA) module, which incorporates Sobel edge information and spatial positional encoding to guide the model’s attention toward structurally critical regions. EHSPA strengthens the model’s focus on informative high-frequency regions. This design facilitates more accurate reconstruction of structural edges and fine textures in remote sensing images.

At the same time, we design the Octave-based Residual Attention Block (ORAB) based on the improved Octave Convolution concept [22], which explicitly decomposes feature representations into high-frequency and low-frequency channels for separate modeling and enhancement. In the high-frequency branch, High-Frequency Residual Enhancement and Efficient Localization Attention (ELA) [23] are incorporated to boost texture representation, while the low-frequency branch retains the global structural and semantic information of the image. This decoupled modeling enables effective fusion of frequency-specific features, facilitating more accurate and detailed image reconstruction. The main contributions of this paper are as follows:

A frequency-aware super-resolution framework, FASwinNet, is proposed for remote sensing images, integrating frequency-domain modeling and attention mechanisms to enhance reconstruction quality.
An Enhanced High-Similarity-Pass Attention (EHSPA) module is designed to guide the network toward structurally significant high-frequency regions, thereby enhancing the restoration of fine details in reconstructed images.
An Octave-based Residual Attention Block (ORAB) is proposed to perform frequency-separated processing and fusion, effectively enhancing the network’s feature representation capability.

2. Related Work

2.1. Remote Sensing Image Super-Resolution

Remote sensing image super-resolution (RSISR) research has evolved along two major methodological lines: convolution-based models and Transformer-based architectures. Early CNN approaches, such as SRCNN and EDSR, progressively enhanced reconstruction accuracy by deepening network structures and exploiting hierarchical local feature extraction.

With the rapid success of Transformers in natural language processing, their strong capability for modeling long-range dependencies has stimulated widespread adoption in computer vision tasks. Transformer-based frameworks have demonstrated remarkable performance in high-level applications such as image classification, object detection, and semantic segmentation [11,12,24,25,26,27,28,29,30,31,32,33].

Among Transformer-based methods, the Vision Transformer (ViT), as the first fully attention-based model in vision, splits an image into patch tokens and models them with self-attention, enabling global context learning and achieving state-of-the-art results on several benchmark datasets [11,13]. Building on this paradigm, SwinIR [18] adapts the Swin Transformer to low-level restoration by incorporating a hierarchical representation and a shifted-window attention mechanism, effectively balancing computational efficiency with the ability to capture both local structures and non-local dependencies, thus establishing a strong baseline for super-resolution tasks.

2.2. Image Restoration Using Swin Transformer

Swin Transformer for Image Restoration (SwinIR) represents the first successful integration of the Swin Transformer framework into image restoration, establishing a milestone for Transformer-based approaches in low-level vision tasks [11]. Building upon the hierarchical architecture and shifted-window self-attention of the original Swin Transformer, SwinIR achieves a favorable balance between computational efficiency and effective modeling of both local patterns and long-range contextual relationships. Owing to this design, SwinIR has been widely adopted in various restoration scenarios—including super-resolution, image denoising, and compressed-sensing reconstruction—where it consistently delivers state-of-the-art results. The overall architecture of SwinIR is composed of several essential modules, including:

Shallow Feature Extraction: As illustrated in Equation (1), a single convolutional layer is employed to extract shallow features from the input low-resolution image. This step provides the initial feature representation required for the subsequent Transformer-based processing.

Deep Feature Extraction: As illustrated in Equation (2), this stage is composed of several Residual Swin Transformer Blocks (RSTBs). Each block integrates multiple Swin Transformer layers and employs residual connections across the block to facilitate feature propagation and stabilize gradient flow. By stacking these RSTBs, the network achieves enhanced representational power.

Image Reconstruction: As illustrated in Equation (4), the deep features are reconstructed into a high-resolution image through an upsampling module (e.g., PixelShuffle).

Window-based Self-Attention: This mechanism performs multi-head self-attention (W-MSA) within fixed-size local windows and introduces a window-shifting strategy (SW-MSA) to capture cross-window contextual information. This design significantly reduces computational costs, enabling SwinIR to efficiently process large-scale images.

Although SwinIR has achieved state-of-the-art performance in various image restoration tasks, it still faces significant challenges when applied to high-resolution and structurally complex remote sensing images. Remote sensing imagery is characterized by rich high-frequency textures, distinct edge structures, and large-scale repetitive patterns, which impose stringent requirements on the model’s capacity for effective feature representation [34,35]. While SwinIR improves computational efficiency via its local window-based self-attention mechanism, it exhibits inherent limitations in frequency modeling and edge reconstruction.

Firstly, SwinIR’s ability to model high-frequency components remains constrained. Its attention mechanism primarily operates within localized windows and lacks explicit mechanisms to capture and enhance high-frequency details such as fine textures and edge signals, often leading to reconstructed images with oversmoothed textures and blurred edges. Secondly, the network architecture is predominantly designed in the spatial domain without explicit disentanglement or differential modeling of high- and low-frequency information, thereby limiting its capability to comprehensively learn frequency-specific features. Lastly, the model demonstrates insufficient sensitivity to edge regions and does not explicitly guide the attention towards structurally salient areas within remote sensing images (e.g., building outlines, road edges), which adversely affects the preservation of structural fidelity and overall reconstruction quality.

Therefore, despite the strong Transformer backbone provided by SwinIR, its full potential in remote sensing image super-resolution has yet to be fully realized. To further enhance the reconstruction of structural details and fine textures, it is imperative to integrate frequency-aware and edge-focused mechanisms into the SwinIR framework, thereby facilitating high-quality and high-fidelity image restoration.

3. Methods

To address the challenges of remote sensing image super-resolution (RSISR) [1], we aim to design an efficient Transformer-based framework with enhanced frequency modeling and detail reconstruction capabilities. To this end, we propose the Frequency-Aware Swin Transformer Network (FASwinNet), which integrates frequency-aware attention mechanisms and a high–low frequency feature decoupling strategy to improve reconstruction performance. In this section, we first present the overall architecture of the proposed FASwinNet. We then provide a detailed description of its two core components: the Frequency-Aware Attention Block (FAB), which facilitates the recovery of fine high-frequency details, and the Octave-based Residual Attention Block (ORAB), which enables effective frequency-domain feature modeling. Finally, we describe the loss functions and optimization strategies used during training.

3.1. Network Architecture

As illustrated in Figure 2, the overall architecture of the proposed network consists of three main stages: shallow feature extraction, deep feature extraction, and image reconstruction. This architectural paradigm has been widely adopted in prior works [6,18], and our design follows the overall structure of SwinIR [18].

Specifically, we denote the low-resolution input and the super-resolved output of the network as

I_{L R}

and

I_{S R}

, respectively. Given a low-resolution input image

I_{L R} \in R^{C \times H \times W}

, where C, H, and W denote the number of channels, height, and width of the image, respectively, we employ a 3 × 3 convolutional layer

H_{S F} (\cdot)

to extract shallow features. The resulting feature map

F_{0} \in R^{C \times H \times W}

is computed as follows:

F_{0} = H_{S F} (I_{L R})

(1)

Convolutional layers are well-suited for early-stage visual processing and provide an efficient means of mapping input images from the spatial domain to a higher-dimensional feature space. Based on this observation, we feed the shallow feature map

F_{0}

into a deep feature extraction module, denoted as

H_{D F} (\cdot)

. This allows the network to extract more complex and abstract representations. The resulting deep feature map

F_{D F} \in R^{C \times H \times W}

is computed as follows:

F_{D F} = H_{D F} (F_{0})

(2)

The resulting deep feature map

F_{D F}

primarily captures high-frequency information. It is obtained through a sequence of K Residual Frequency-aware Attention Groups (RFAGs) followed by a 3 × 3 convolutional layer. Specifically, the intermediate features

F_{1}, F_{2}, \dots

,

F_{K}

and the final deep feature

F_{D F}

are extracted in a block-wise manner as follows:

\begin{matrix} F_{i} = H_{R F A G} (F_{i - 1}), i = 1, 2, \dots, K \\ F_{D F} = H_{c o n v} (F_{K}) \end{matrix}

(3)

Here,

H_{R F A G} (\cdot)

denotes the i-th Residual Frequency-aware Attention Group, and

H_{c o n v} (\cdot)

represents the final convolutional layer. After deep feature extraction, a convolutional layer is used to reconstruct spatial features. To enhance information flow and facilitate the learning of residual components, we adopt a long-range skip connection. The final high-quality super-resolved image

I_{S R} \in R^{C \times H \times W}

is generated as:

I_{S R} = H_{R C} (F_{0} + F_{D F})

(4)

where

H_{R C} (\cdot)

denotes the reconstruction module, which consists of a 3 × 3 convolutional layer followed by a pixel-shuffle operation [35] for upsampling. This design enables efficient and accurate reconstruction from the combined shallow and deep features.

3.2. Frequency-Aware Attention Block (FAB)

The Frequency-aware Attention Block (FAB) is a feature enhancement module designed for image reconstruction and enhancement tasks. It integrates the Swin Transformer with an Enhanced High-Similarity-Pass Attention mechanism to improve the model’s capability in capturing local details and frequency-aware structures. As illustrated in Figure 2, the architecture of FAB is structured as follows.

Residual Swin Transformer Block. The Residual Swin Transformer Block (RSTB) is the core component of SwinIR, designed to enable efficient image restoration by combining the local window-based self-attention mechanism of the Swin Transformer with residual connections. This design allows RSTB to effectively extract both local and global features from the input while preserving important structural information through residual learning. The residual connection also mitigates the problem of gradient vanishing, thus enhancing the stability of feature propagation. Compared with conventional Transformers, the Swin Transformer adopts a non-overlapping window partitioning strategy, which significantly reduces computational complexity. This makes it particularly suitable for processing high-resolution images while maintaining strong performance and efficiency.

Enhanced High-Similarity-Pass Attention. As illustrated in Figure 2, to effectively recover high-frequency details such as edges and textures in remote sensing images, we propose the Enhanced High-Similarity-Pass Attention (EHSPA) based on the original High-Similarity-Pass Attention (HSPA) module [36].

The original HSPA improves the modeling of structurally similar regions by constructing a global similarity map and selecting highly similar areas using a soft-thresholding operation for feature aggregation. While this approach enhances the representation of similar structures to some extent, it still suffers from several limitations. First, it lacks explicit modeling of high-frequency details, which are crucial for preserving fine-grained image content. Second, it offers limited spatial structural representation, making it less effective in capturing geometric layouts. Lastly, the matching-path representation capability is insufficient, leading to suboptimal performance in reconstructing edge- and texture-sensitive regions under the super-resolution task.

To address the aforementioned limitations, we introduce multi-modal auxiliary information to enhance the representational capacity of the EHSPA module. Specifically, considering that edge structures in remote sensing images often carry critical high-frequency information, we design an edge detection operator based on the Sobel filter to extract explicit edge features. These features serve as an additional signal to guide the construction of more accurate feature correspondences. For clarity, we denote the input feature map as

F_{0} \in R^{C \times H \times W}

, which is reshaped to

X \in R^{C \times H \times W}

. We then apply the Sobel operator to extract an edge map

E \in R^{1 \times H \times W}

, defined as:

E = \sqrt{{(X * K_{x})}^{2} + {(X * K_{y})}^{2} + ε}

(5)

where ∗ denotes the convolution operation,

K_{x}

and

K_{y}

are the horizontal and vertical Sobel kernels, respectively, and

ε

is a small constant (e.g.,

10^{- 6}

) added for numerical stability.

Secondly, to enhance the spatial awareness of the attention mechanism, we introduce a 2D normalized positional encoding. By concatenating positional information with the original features, the model is encouraged to establish spatially consistent similarity measurements among pixels. The normalized 2D coordinate map

P \in R^{2 \times H \times W}

is defined as:

P_{i, j} = [\frac{2 j}{W - 1} - 1, \frac{2 i}{H - 1} - 1]

(6)

We concatenate the original feature, edge map, and positional encoding to form an augmented feature representation, which is then used to extract attention-related embeddings. Specifically, two separate matching paths are used to extract attention features

F_{1}

and

F_{2}

, while a reconstruction path extracts input feature A. These features are reshaped accordingly for attention computation:

\begin{matrix} X_{a u g} = Concat (X, E, P), \\ F_{1} = f_{1} (X_{a u g}), F_{2} = f_{2} (X_{a u g}), A = f_{3} (X), \\ Q = reshape (F_{1}), K = reshape (F_{2}), \\ V = reshape (A) \end{matrix}

(7)

where

X_{a u g} \in R^{(C + 3) \times H \times W}

, and

f_{1}, f_{2}, f_{3}

are embedding functions implemented by 1 × 1 convolutional layers. The embeddings

F_{1}, F_{2} \in R^{C^{'} \times H \times W}

,

A \in R^{C \times H \times W}

, and the reshaped queries, keys, and values are

Q \in R^{H W \times C^{'}}

,

K \in R^{C ’ \times H W}

, and

V \in R^{H W \times C}

, respectively. Here,

C^{'}

denotes the reduced channel dimension after compression. To suppress noise from irrelevant regions, we apply a soft-thresholding operation

ST (\cdot)

that retains only the top-k most similar responses for each query:

S = ST (Q \cdot K)

(8)

where S is the sparsified attention map, and all non-top-k responses are zeroed out and re-normalized. The final output feature is obtained by aggregating the attention scores with the reconstruction features and applying residual learning:

Y = α \cdot reshape (S \cdot V) + X

(9)

where

α

is a residual scaling factor that controls the strength of feature enhancement. This mechanism enables effective aggregation and preservation of features from highly similar regions, particularly in edge- and texture-rich areas.

Fusion and Residual connection. The output features from the RSTB and EHSPA modules are concatenated along the channel dimension. The fused features are then passed through a 1 × 1 convolutional layer to perform channel compression and feature integration. Finally, a residual connection is employed by adding the result to the input of the FAB, which helps preserve the original features and enhances both training stability and information flow. The Frequency-aware Attention Block (FAB) leverages a hybrid design combining RSTB and EHSPA, enabling joint modeling of global contextual information and local high-similarity patterns. Its multi-path fusion structure significantly enhances the model’s representational capacity, making it particularly effective for remote sensing image super-resolution tasks.

3.3. Octave-Based Residual Attention Block (ORAB)

To more effectively decouple and exploit high-frequency and low-frequency components in image features, we design an Octave-based Residual Attention Block (ORAB) that integrates attention mechanisms and residual enhancement. As illustrated in Figure 2, ORAB consists of three main components: FirstOctaveConv, multiple stacked OctaveConv blocks, and a final LastOctaveConv.

In the FirstOctaveConv stage, the input feature map is decomposed into high-frequency and low-frequency components. The low-frequency features are obtained via average pooling, enabling frequency-aware feature separation at the early stage.

Each OctaveConv block is a stackable unit designed to process high- and low-frequency features in parallel. In the high-frequency branch, we introduce High-Frequency Residual Enhancement (HFRE), implemented using two 3 × 3 convolutional layers followed by ReLU activation, which helps preserve and amplify fine-grained details. Additionally, an Efficient Localization Attention (ELA) mechanism [23] is integrated into both frequency branches. This attention design leverages 1D depth-wise convolutions within channels along with Group Normalization to model spatial relationships, thereby enhancing local feature sensitivity.

The LastOctaveConv module reunifies the high- and low-frequency branches into a single output stream. Specifically, the low-frequency features are upsampled to match the resolution of the high-frequency features and are then aggregated via element-wise addition. This design allows the module to retain global structural information from the low-frequency pathway while simultaneously improving high-frequency detail reconstruction. The entire ORAB can be formally expressed as:

Y = f_{L a s t} (\prod_{t = 1}^{T} {OctaveConv}^{(t)} (f_{F i r s t} (X)))

(10)

where X is the input feature map,

f_{F i r s t}

and

f_{L a s t}

denote the FirstOctaveConv and LastOctaveConv modules, respectively, and

{OctaveConv}^{(t)}

represents the t-th OctaveConv block in a stack of T layers. By leveraging frequency separation, localized attention, and high-frequency residual enhancement, the proposed ORAB substantially improves the model’s ability to capture fine-grained image structures and contributes to high-fidelity reconstruction in remote sensing image super-resolution.

3.4. Learning Strategy

During model training, we adopt the widely used L1 loss as the objective function, and employ the Adam optimizer for parameter updates. Specifically, given a batch of N paired low- and high-resolution images denoted as

{(I_{i}^{L R}, I_{i}^{H R})}_{i = 1}^{N}

, the model is trained to minimize the following L1 loss:

L 1 (θ) = \frac{1}{N} \sum {i = 1}^{N} {∥f_{F A S w i n N e t} (I_{i}^{L R}) - I_{i}^{H R}∥}_{1}

(11)

Here,

f_{F A S w i n N e t} (\cdot)

represents the proposed super-resolution network with trainable parameters

θ

, and

f_{F A S w i n N e t} (I_{i}^{L R})

denotes the predicted high-resolution image corresponding to the input

I_{i}^{L R}

.

4. Experiments

4.1. Datasets and Implementation

To validate the proposed method for remote sensing image super-resolution (RSISR), we use the AID dataset [37], which includes images from 30 categories. For balanced training, 30 images per category are randomly chosen (900 samples), and 2 images per category are used for testing (60 samples). This setup enables reliable evaluation of the model’s reconstruction quality and generalization in small-sample RSISR scenarios.

Low-resolution (LR) images are generated from high-resolution (HR) counterparts via ×4 bicubic downsampling. Evaluation is conducted using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) on the Y channel of YCbCr. In ablation studies, these metrics quantify the impact of individual components, while additional experiments with Mean Squared Error (MSE) and Learned Perceptual Image Patch Similarity (LPIPS) provide a thorough assessment at both pixel and perceptual levels.

The network is trained with the Adam optimizer (

β_{1} = 0.9

,

β_{2} = 0.99

), using an initial learning rate of

2 \times 10^{- 4}

and a mini-batch size of 16. All experiments are carried out on an NVIDIA GeForce RTX 4090 GPU.

4.2. Ablation Study

In this section, to thoroughly evaluate the effectiveness of the proposed modules, we conduct systematic ablation studies on the AID dataset, focusing on the impact of the Enhanced High-Similarity-Pass Attention (EHSPA) and the Octave-based Residual Attention Block (ORAB) on model performance. For all experiments, the same hyperparameters are adopted; the number of RSTBs and STLs is set to 2, the window size to 8, the channel dimension to 48, and the number of attention heads to 2.

Training is performed for 30,000 iterations, and the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) are reported on the test set.

To quantitatively assess the impact of the proposed ORAB module, we develop several comparative model variants, and the evaluation results are summarized in Table 1.

To verify the effectiveness of the EHSPA module, we construct a series of baseline variants, and the corresponding results are presented in Table 2.

Finally, the overall ablation experiments consider four different combinations of EHSPA and ORAB. The results are reported in Table 3, where a ✓ denotes the inclusion of a module and a ✗ indicates its absence.

4.3. Comparisons with State-of-the-Art Method

To comprehensively evaluate the performance of the proposed FASwinNet on remote sensing image super-resolution tasks, we conduct comparative experiments on the AID dataset against a range of mainstream and state-of-the-art super-resolution methods. These include classical approaches such as Bicubic and SRCNN [38], GAN-based methods like ESRGAN [2], as well as recent deep learning models with strong performance, including EDSR [8], RCAN [6], SwinIR [18], HAT [19], and several advanced architectures specifically designed for remote sensing images, such as CTNet [21], HSENet [20], HAUNet [39], TFSNet [40], and ACT-SR [41]. To achieve a better balance between performance and model complexity, we configure FASwinNet with 6 RSTBs, 6 STLs, a window size of 8, 90 channels, and 6 attention heads. Under this setting, the model is trained for 500,000 iterations.

As shown in Table 4, we compare the reconstruction performance of various methods at a ×4 upscaling factor using PSNR, SSIM, MSE, and LPIPS as evaluation metrics. The results demonstrate that our proposed method outperforms existing state-of-the-art approaches across all metrics on the AID dataset, achieving the best overall performance.

To further evaluate the generalization capability of the proposed model, we conduct cross-dataset experiments. The model is trained on the AID dataset, which serves as the source domain, and tested on the UCMerced dataset as the target domain. For evaluation, two images are randomly selected from each class of the UCMerced dataset, resulting in a representative subset for testing the model’s performance on unseen data. The reconstruction quality is quantitatively assessed using two widely adopted metrics, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). The results of these cross-dataset tests are summarized in Table 5, which demonstrates the model’s ability to generalize beyond the training dataset and maintain high-quality reconstruction across different remote sensing image distributions.

As shown in Table 5, even when tested on images from an unseen dataset, the proposed model consistently outperforms the baseline and other comparative methods in terms of PSNR and SSIM. This indicates that the model effectively learns features with strong generalization capability, rather than being limited to the training samples from the AID dataset. Notably, the improvement in SSIM further suggests that the model can well preserve structural and textural information across different remote sensing scenes. Overall, these results validate the strong generalization ability of the proposed approach and highlight its potential advantages in practical scenarios where the test data distribution differs from that of the training data.

As shown in Figure 3, our method successfully reconstructs visually rich and structurally clear image content. In contrast, other methods exhibit varying degrees of blurring and detail loss. Our approach more accurately restores the textures of rooftop structures and road edges, with particularly sharp reconstruction observed in parking areas and along building boundaries. Notably, our method also preserves the circular line structures in the center of sports fields more effectively, exhibiting stronger continuity and clearer boundaries compared to other approaches. Visually, the reconstructed field textures appear more natural, with finer details better preserved. These visual results further demonstrate the superiority of our method in maintaining structural consistency and enhancing detail recovery in remote sensing image super-resolution.

4.4. Limitations

Although FASwinNet demonstrates strong reconstruction performance, several limitations remain. First, the model contains more parameters than lightweight RSISR networks such as CTNet or SRCNN, which reduces its suitability for real-time applications or deployment on resource-constrained edge devices. In addition, the incorporation of attention-based modules leads to increased memory consumption, particularly during training, potentially restricting its scalability on hardware with limited GPU resources. While the frequency-aware design improves texture and detail recovery, its performance gains become less pronounced when the input contains extremely weak or ambiguous high-frequency information. Furthermore, the current evaluation is primarily conducted under a ×4 upscaling setting, and the behavior of the model under larger magnification factors (e.g., ×8) remains to be fully explored. These limitations will be addressed in future work.

5. Conclusions

In this work, we proposed FASwinNet, a frequency-aware Swin Transformer framework for remote sensing image super-resolution. The network integrates two key components—Enhanced High-Similarity-Pass Attention (EHSPA) and Octave-based Residual Attention Block (ORAB)—to strengthen high-frequency representation and achieve more accurate texture reconstruction. Experiments on the AID dataset and cross-dataset evaluation on UCMerced demonstrate strong robustness and consistent superiority over state-of-the-art SR models. The major findings are summarized as follows:

FASwinNet significantly improves super-resolution on the AID dataset (×4 scale), effectively restoring both low- and high-frequency details.
EHSPA enhances structural sharpness, suppressing high-frequency degradation and alleviating over-smoothing common in Transformer-based SR models.
ORAB introduces explicit frequency separation, better modeling low-frequency semantics and high-frequency textures, improving clarity of boundaries, roof details, farmland patterns, and other key structures.
The model is lightweight, with only 4.26 M parameters, balancing efficiency and performance for practical deployment.
FASwinNet exhibits strong cross-dataset generalization; when trained on AID and tested on UCMerced without fine-tuning, it consistently outperforms existing methods, showing robustness to scene variations.

Despite its promising performance, FASwinNet offers several avenues for future work, including higher-magnification SR, multi-sensor integration, content-adaptive frequency decomposition, lightweight hardware-aware design, and cross-region or cross-temporal generalization to enhance robustness.

Overall, the results demonstrate that frequency-aware modeling is an effective strategy for remote sensing SR, and FASwinNet provides a solid foundation for both future research and practical high-resolution Earth observation applications.

Author Contributions

Conceptualization, Z.W. and K.C.; methodology, K.C. and X.W.; validation, X.W.; formal analysis, S.L.; investigation, S.L.; resources, K.C.; data curation, X.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W., S.L., K.C. and X.W.; visualization, Z.W.; supervision, S.L.; software, Z.W. and S.L.; project administration, K.C.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Foundation of Liaoning Educational Committee (Grant Nos. LJ212410153001, LJ212410153013) and the Liaoning Applied Basic Research Program (Grant No. 2025JH2/101300003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, D.; Li, Z.; Xia, Y.; Chen, Z. Remote sensing image super-resolution: Challenges and approaches. In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP), Singapore, 21–24 July 2015; pp. 196–200. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops 2018, Munich, Germany, 8–14 September 2018. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. pp. 391–407. [Google Scholar]
Kong, X.; Zhao, H.; Qiao, Y.; Dong, C. Classsr: A general framework to accelerate super-resolution networks by data characteristic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 19–25 June 2021; pp. 12016–12025. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 833–843. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar]
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]
Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 19–25 June 2021; pp. 12299–12310. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 5728–5739. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 17683–17693. [Google Scholar]
Li, W.; Lu, X.; Qian, S.; Lu, J.; Zhang, X.; Jia, J. On efficient transformer-based image pre-training for low-level vision. arXiv 2021, arXiv:2112.10175. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Lei, S.; Shi, Z. Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401410. [Google Scholar] [CrossRef]
Peng, G.; Xie, M.; Fang, L. Context-aware lightweight remote-sensing image super-resolution network. Front. Neurorobot. 2023, 17, 1220166. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Yan, S.; Feng, J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3435–3444. [Google Scholar]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. arXiv 2024, arXiv:2403.01123. [Google Scholar] [CrossRef]
Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 579–588. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
Wu, S.; Wu, T.; Tan, H.; Guo, G. Pale transformer: A general vision transformer backbone with pale-shaped attention. In Proceedings of the AAAI Conference on Artificial Intelligence 2022, Online, 22 February–1 March 2022; Volume 36, pp. 2731–2739. [Google Scholar]
Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Huang, Z.; Ben, Y.; Luo, G.; Cheng, P.; Yu, G.; Fu, B. Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv 2021, arXiv:2106.03650. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2021, Montreal, QC, Canada, 11–17 October 2021; pp. 568–578. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhang, K.; Cao, J.; Timofte, R.; Van Gool, L. Localvit: Bringing locality to vision transformers. arXiv 2021, arXiv:2104.05707. [Google Scholar]
Patel, K.; Bur, A.M.; Li, F.; Wang, G. Aggregating global features into local vision transformer. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 1141–1147. [Google Scholar]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Su, J.N.; Gan, M.; Chen, G.Y.; Guo, W.; Chen, C.P. High-similarity-pass attention for single image super-resolution. IEEE Trans. Image Process. 2024, 33, 610–624. [Google Scholar] [CrossRef] [PubMed]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13. pp. 184–199. [Google Scholar]
Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612515. [Google Scholar] [CrossRef]
Wang, J.; Lu, Y.; Wang, S.; Wang, B.; Wang, X.; Long, T. Two-stage spatial-frequency joint learning for large-factor remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606813. [Google Scholar] [CrossRef]
Kang, Y.; Wang, X.; Zhang, X.; Wang, S.; Jin, G. ACT-SR: Aggregation Connection Transformer for Remote Sensing Image Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 18, 8953–8964. [Google Scholar] [CrossRef]

Figure 1. Quantitative comparison of parameter count and PSNR performance among different remote sensing image super-resolution methods on the AID test set.

Figure 2. The overall architecture of FASwinNet and the structure of FAB and ORAB.

Figure 3. Reconstruction results of different methods on test images from the AID dataset under ×4 upscaling. The red boxes indicate enlarged regions, showing that FASwinNet achieves clearer texture and edge restoration.

Table 1. Effectiveness analysis of ORAB on the ×4 AID test set.

Model	PSNR	SSIM
Baseline	26.58	0.7177
Baseline + Octave	26.63	0.7181
Baseline + ORAB	26.70	0.7209

Table 2. Effectiveness analysis of EHSPA on the

\times 4

AID test set.

Table 2. Effectiveness analysis of EHSPA on the

\times 4

AID test set.

Model	PSNR	SSIM
Baseline	26.58	0.7177
Baseline + HSPA	26.63	0.7186
Baseline + EHSPA	26.65	0.7196

Table 3. Ablation study of EHSPA and ORAB on the ×4 AID test set.

EHSPA	ORAB	PSNR	SSIM
✗	✗	26.58	0.7177
✓	✗	26.65	0.7193
✗	✓	26.69	0.7215
✓	✓	26.73	0.7234

Table 4. Quantitative comparison of different methods on the AID dataset. The best results are marked in red, and the second-best in blue.

Method	Parameters	Flops	Inference Time	PSNR	SSIM	MSE	LPIPS
Bicubic	–	–	–	25.34	0.6752	287.31	0.4893
SRCNN	0.02 M	0.57	2.74 ms	25.89	0.6929	249.14	0.3859
ESRGAN	16.69 M	645.25	3096.51 ms	25.48	0.6708	264.75	0.3359
EDSR	1.52 M	16.25	77.98 ms	26.71	0.7282	205.26	0.3145
RCAN	15.59 M	106.28	510.03 ms	26.85	0.7340	198.70	0.3056
SwinIR	2.33 M	3.48	16.70 ms	26.84	0.7346	199.31	0.3025
HAT	3.78 M	4.78	22.94 ms	26.89	0.7360	195.79	0.3038
CTNet	0.53 M	0.57	2.74 ms	26.39	0.7157	221.38	0.3424
HSENet	1.84 M	38.44	184.47 ms	26.59	0.7250	210.22	0.3170
HAUNet	8.99 M	37.79	181.35 ms	26.51	0.7212	215.64	0.3242
TFSNet	3.13 M	111.17	533.50 ms	26.83	0.7318	199.21	0.3168
ACT-SR	3.30 M	131.5	631.06 ms	26.91	0.7376	195.95	0.2984
Ours	4.26 M	13.92	66.80 ms	26.94	0.7380	194.55	0.3022

Table 5. Quantitative comparison of different methods on the UCMerced dataset. The best results are marked in red, and the second-best in blue.

Method	PSNR	SSIM	MSE	LPIPS
Bicubic	24.29	0.5956	312.15	0.5705
SRCNN	24.37	0.6004	306.74	0.5604
ESRGAN	24.32	0.5963	309.85	0.5452
EDSR	24.58	0.6067	291.94	0.5327
RCAN	24.59	0.6129	291.47	0.5301
SwinIR	24.58	0.6137	292.33	0.5221
HAT	24.63	0.6141	288.85	0.5212
CTNet	24.56	0.6055	293.35	0.5231
HSENet	24.58	0.6121	292.14	0.5323
HAUNet	24.58	0.6104	292.33	0.5384
TFSNet	24.61	0.6112	290.52	0.5312
ACT-SR	24.63	0.6133	288.25	0.5203
Ours	24.64	0.6156	287.99	0.5180

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Liu, S.; Cao, K.; Wang, X. FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks. Appl. Sci. 2025, 15, 12420. https://doi.org/10.3390/app152312420

AMA Style

Wang Z, Liu S, Cao K, Wang X. FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks. Applied Sciences. 2025; 15(23):12420. https://doi.org/10.3390/app152312420

Chicago/Turabian Style

Wang, Zhongyang, Shilong Liu, Keyan Cao, and Xinlei Wang. 2025. "FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks" Applied Sciences 15, no. 23: 12420. https://doi.org/10.3390/app152312420

APA Style

Wang, Z., Liu, S., Cao, K., & Wang, X. (2025). FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks. Applied Sciences, 15(23), 12420. https://doi.org/10.3390/app152312420

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FASwinNet: Frequency-Aware Swin Transformer for Remote Sensing Image Super-Resolution via Enhanced High-Similarity-Pass Attention and Octave Residual Blocks

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Super-Resolution

2.2. Image Restoration Using Swin Transformer

3. Methods

3.1. Network Architecture

3.2. Frequency-Aware Attention Block (FAB)

3.3. Octave-Based Residual Attention Block (ORAB)

3.4. Learning Strategy

4. Experiments

4.1. Datasets and Implementation

4.2. Ablation Study

4.3. Comparisons with State-of-the-Art Method

4.4. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI