Next Article in Journal
Scalable Context-Preserving Model-Aware Deep Clustering for Hyperspectral Images
Previous Article in Journal
A Spatiotemporal Subgrid Least Squares Approach to DEM Generation of the Greenland Ice Sheet from ICESat-2 Laser Altimetry
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution

1
School of Artificial Intelligence and Computer Science, North China University of Technology, Beijing 100144, China
2
School of Electrical and Control Engineering, North China University of Technology, Beijing 100144, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(24), 4028; https://doi.org/10.3390/rs17244028
Submission received: 18 November 2025 / Revised: 9 December 2025 / Accepted: 10 December 2025 / Published: 14 December 2025
(This article belongs to the Section Remote Sensing Image Processing)

Highlights

What are the main findings?
  • RFANSR achieves superior performance with significantly reduced computational cost: 0.06 dB and 0.14 dB PSNR improvements on RSCNN7 and DOTA datasets, respectively, while using only 383 K parameters (45.4% reduction compared to DLKN’s 702 K parameters). Progressive Receptive Field Aggregator (PRFA) effectively expands receptive field through medium-sized kernels (7 × 7, 9 × 9, 11 × 11) while maintaining asymptotically Gaussian distribution, avoiding parameter redundancy of extremely large kernels.
  • Progressive Receptive Field Aggregator (PRFA) effectively expands receptive field through medium-sized kernels (7 × 7, 9 × 9, 11 × 11) while maintaining asymptotically Gaussian distribution, avoiding parameter redundancy of extremely large kernels.
What are the implications of the main findings?
  • The lightweight design (383 K parameters, 79.8 G FLOPs) makes high-quality remote sensing image super-resolution feasible on the resource-constrained edge devices and real-time processing scenarios.
  • The Statistical Guidance Module (SGM) provides a new paradigm for channel utilization in lightweight networks, replacing inefficient identity mapping with minimal parameter overhead (5.5% increase) while achieving ~1 dB performance improvement.

Abstract

Expanding the receptive field while maintaining efficiency is a key challenge in lightweight remote sensing super-resolution. Existing methods often suffer from parameter redundancy or insufficient channel utilization. To address these issues, we propose the Receptive Field Aggregation Network (RFANSR). First, we design a Progressive Receptive Field Aggregator (PRFA). It expands the receptive field by cascading medium-sized kernels, avoiding the heavy overhead of extremely large kernels. Second, we introduce a Statistical Guidance Module (SGM). This module replaces inefficient identity mappings with statistical channel recalibration to maximize feature utility. Additionally, we propose a Spatial-Gated Feed-Forward Network (SGFN) to reduce information loss via spatial attention. Extensive experiments demonstrate that RFANSR outperforms state-of-the-art lightweight models. Notably, RFANSR achieves PSNR improvements of 0.06 dB on RSCNN7 and 0.14 dB on DOTA datasets. Remarkably, it requires only 383 K parameters, representing a 45.4% reduction compared to DLKN.

Graphical Abstract

1. Introduction

Image super-resolution (SR) technology aims to recover high-resolution (HR) images from low-resolution (LR) images. It has important applications in fields such as computer vision and remote sensing image analysis [1,2,3]. In recent years, deep learning-based super-resolution methods have shown significant improvements in image restoration quality. As a result, they have gradually replaced traditional approaches. However, these models often rely on complex network structures. This increases model parameters, computational cost, and inference latency. The issue is particularly pronounced on resource-constrained edge devices. Remote sensing images, as a special type of image, have high spatial resolution, large data volume, and complex geographic information. Compared to conventional images, remote sensing images cover larger areas and require handling spatial information at different scales. This makes remote sensing image super-resolution (RSISR) face higher computational demands and storage pressures [4]. Moreover, remote sensing images have high-dimensional features. This makes it difficult for existing RSISR methods to efficiently extract both global and local information. Therefore, it is a core challenge in RSISR for optimizing computational efficiency, inference speed, and maintaining accuracy. Solving this issue is crucial for improving practical application performance.
Dong et al. [5] first proposed SRCNN. This work marked the introduction of deep learning into image super-resolution (SR) tasks. Since then, CNN-based SR approaches have rapidly advanced. This progress has significantly impacted the development of SR methods. Researchers have introduced various lightweight and efficient network architectures, such as RFDN [6] and BSRN [7]. These architectures enhance local feature extraction and reduce computational latency by stacking small convolutional kernels (3 × 3). However, because they primarily rely on local receptive fields, it remains difficult to capture long-range dependencies and global contextual information. Consequently, important details may be lost and texture blurring often occurs in complex remote sensing scenes. To address this, researchers have attempted to expand the receptive field within the CNN framework to improve feature representation. The Shuffle Mixer method [8] achieves receptive field expansion by combining large kernel decomposition with dilated convolution. While this enhances model performance to some extent, it inevitably leads to information loss. On the other hand, methods like LKDM [9] and PLKSR [10] use partial convolutions combined with large kernels to extract non-local features. Partial convolution reduces computational load by processing only a subset of the channels. The remaining channels are passed through identity mapping, which leads to under-utilization. Additionally, PLKSR employs a large 17 × 17 convolutional kernel, which substantially increases model parameters and causes heavy computational overhead. Therefore, it has become a critical challenge to achieve more effective non-local feature modeling in designing RSISR models. At the same time, it is essential to maintain CNN’s advantages of low latency and efficient feature extraction.
To address the above issues, this paper proposes the Receptive Field Aggregation Network (RFANSR). Unlike existing methods, RFANSR employs a progressive receptive field aggregation strategy. It expands the effective receptive field by combining medium-sized convolutional kernels (7 × 7, 9 × 9, 11 × 11). This approach maintains a progressive Gaussian distribution property. The Progressive Receptive Field Aggregator (PRFA) serves as the core module. It constructs a multi-layer progressive receptive field through a cascade of an amplifier and discriminator with a dual-branch structure. This design enables multi-scale feature fusion. The amplifier branch expands the receptive field and enhances significant pixel responses through large-kernel depth-wise convolution and element-wise multiplication. The discriminator branch introduces a cross-head feature supply mechanism to complement local discriminative features. Based on PRFA, the Dual-Stream Receptive Block (DSRB) divides the input features into two streams. One stream recursively inputs each layer’s PRFA to form a pyramid-like channel expansion. The other modulates channel attention through the Statistical Guidance Module (SGM). Unlike existing methods that apply identity mapping to the remaining channels, the SGM uses a minimal number of parameters. It simultaneously utilizes the mean and standard deviation statistics of the features to promote richer global information interaction between channels. Furthermore, the Multi-Scale Distillation Block (MSDB) employs a distillation strategy to gradually refine features. In each layer, a subset of channels is extracted as distillation features, which are then concatenated and fused at the final stage. To address the issue of information loss caused by identity mapping in traditional MLP, the Spatial-Gated Feed-Forward Network (SGFN) introduces a partial spatial attention mechanism in the feed-forward layer. The features are divided into two parts: one part is weighted through spatial attention to highlight key regions. The other part extracts channel information. This ensures that all channels participate in effective computation rather than using simple identity mapping. Through these designs, RFANSR effectively enhances the model’s global perception and feature discrimination ability while maintaining low computational complexity and inference latency. It provides a performance and efficiency-driven solution for remote sensing image super-resolution tasks. As shown in Figure 1, our method outperforms state-of-the-art methods by 0.14 dB while using 45.4% fewer parameters.
In summary, the primary motivation of this work is to reconcile the conflict between expanding receptive fields and maintaining computational efficiency. Existing large-kernel methods often incur excessive parameter overhead, whereas partial convolution strategies suffer from insufficient channel utilization. To overcome these limitations, we propose a compact yet expressive architecture. Specifically, our method integrates a progressive aggregation strategy for medium-sized kernels and incorporates statistical guidance to reactivate idle channels. This synergistic design effectively mitigates redundancy while preserving global contextual information. Consequently, RFANSR provides a feasible solution for high-quality remote sensing super-resolution in resource-constrained scenarios. Accordingly, the main contributions of this paper are summarized as follows:
  • We propose the Progressive Receptive Field Aggregator (PRFA). It achieves efficient non-local feature modeling by progressively combining medium-sized convolutional kernels. This approach avoids the parameter redundancy caused by extremely large kernels.
  • We design the Statistical Guidance Module (SGM) for lightweight global information interaction. It fully utilizes residual channels to improve channel utilization efficiency.
  • Experiments on multiple benchmark datasets show that RFANSR outperforms existing methods while maintaining low computational cost. This makes it suitable for resource-constrained scenarios.

2. Related Work

2.1. Image Super-Resolution

SR aims to reconstruct high-resolution images from low-resolution counterparts. CNN-based methods have long dominated this field. Dong et al. [5] proposed SRCNN, a three-layer end-to-end convolutional neural network that learns the mapping from low-resolution to high-resolution images. This shifts super-resolution from handcrafted prior to data-driven learning and yields significant improvements in reconstruction quality. Kim et al. [11] proposed VDSR, a very deep convolutional network for single-image super-resolution. This alleviates gradient vanishing. It permits substantially larger receptive fields. Lim et al. [12] introduced EDSR for single-image super-resolution. It removes batch normalization and scales the model’s depth and width using residual scaling. This mitigates normalization-induced range constraints and stabilizes training. Additionally, it delivers strong gains, especially for large upscaling factors. Dai et al. [13] proposed SAN. It incorporates second-order attention to improve feature representation. Niu et al. [14] introduced HAN. It uses global attention for better handling of complex image scenes.
With the success of Transformer architectures in computer vision, Transformer-based models have been increasingly explored in SR. Liang et al. [15] proposed SwinIR. It utilizes the Swin Transformer with hierarchical window-based attention. This balances computational cost and reconstruction quality. Chen et al. [16] introduced HAT, which combines channel attention with window-based self-attention to improve image recovery, particularly in complex scenes. Long et al. [17] proposed the Progressive Focused Transformer (PFT). It introduces the Progressive Focused Attention (PFA) mechanism. This reduces computational costs and improves image reconstruction by focusing on more relevant features. Unlike traditional attention methods, PFA dynamically filters out non-relevant features. This is performed before similarity calculation. As a result, the model can employ larger window sizes while maintaining efficiency. To overcome the redundancy problem in self-attention, Chen et al. [18] proposed a Top-k Token Selective Attention that focuses on the most informative tokens to improve efficiency. These Transformer-based methods have outperformed traditional CNN-based approaches, particularly in capturing long-range dependencies and enhancing fine-grained details.

2.2. Lightweight Super-Resolution

To address the practical challenges of applying SR models under limited computational resources, numerous lightweight architectures have been developed. Hui et al. [19] proposed the Information Multi-distillation Network (IMDN), which introduces a multi-level distillation mechanism. Liu et al. [6] further optimized this approach with the Residual Feature Distillation Network (RFDN). Li et al. [7] developed the Blueprint Separable Residual Network (BSRN), replacing traditional convolutions with blueprint separable convolutions. Sun et al. [8] proposed the Shuffle Mixer Network, combining large kernels with channel splitting. Wang et al. [20] introduced FDSCSR with feature de-redundancy blocks. These methods primarily focus on efficient feature extraction through distillation or structural decomposition. However, they often lack effective mechanisms for expanding receptive fields while maintaining low computational cost.

2.3. Large Kernel Convolution Methods

Recent work has explored large convolutional kernels to expand receptive fields in lightweight super-resolution. This approach is particularly relevant for remote sensing images that require capturing long-range spatial dependencies across large geographic areas. Lee et al. [10] proposed PLKSR, which employs a 17 × 17 convolutional kernel combined with partial convolution. The method processes only a subset of channels through large kernel convolution. It applies identity mapping to the remaining channels to reduce computational cost.Hu et al. [9] introduced LKMN, which also adopts large kernel design with channel shuffling operations. In the remote sensing domain, Wang et al. [21] proposed FeNet with lightweight lattice blocks, and Liu et al. [22] introduced the Dynamic Large Kernel Network (DLKN), which incorporates dynamic large-kernel attention modules.
However, existing large kernel approaches suffer from inherent limitations that constrain their practical effectiveness. The adoption of extremely large kernels (e.g., 17 × 17) introduces substantial parameter overhead, with computational complexity scaling quadratically as O(k2) relative to kernel size k. Additionally, while partial convolution strategies effectively reduce computational burden, they simultaneously create suboptimal channel utilization patterns that limit feature representation capacity. In PLKSR, approximately 50% of channels undergo no computation through identity mapping, essentially wasting half of the model’s representational capacity. Moreover, single large kernels cannot guarantee optimal receptive field properties, often exhibiting irregular weight distributions. To address these limitations, we propose a dual-branch architecture with progressive medium-sized kernels. Our approach replaces identity mapping with statistical guidance, achieving both parameter efficiency and full channel utilization.

3. Proposed Method

As shown in Figure 2, the RFANSR architecture comprises three main components based on their functions. Shallow features are first extracted by a 3 × 3 convolutional layer. Deep features are then processed through a series of stacked Residual Feature Modulation Groups (RFMGs). Lastly, image reconstruction is achieved with a sub-pixel layer [23]. The subsequent discussion will focus on the design of the core RFMG module.

3.1. Dual-Stream Receptive Block (DSRB)

To minimize model complexity, we follow PLKSR [10] and FasterNet [24]. We extract features from only a subset of channels. Moreover, this operation is highly lightweight. Similarly to LKMN [9], we also introduce a channel shuffling operation. Additionally, we propose RRFA as a replacement for large convolutional kernels. This operation expands the receptive field. It also amplifies the impact of pixels on the receptive field. At the same time, it maintains the progressive Gaussian distribution within the effective receptive field. Given an input I in R C × H × W , C and H × W represent the number of channels and spatial resolution, respectively. Following channel shuffling and subsequent kernel splitting, the output can be expressed as:
I i n C / g , I i n C - C / g = F s p l i t g ( F s h u f f l e ( I i n ) )
where F s p l i t denotes the channel splitting operation, F shuffle represents the channel shuffling operation, and g represents the ratio of channel splitting. We perform feature extraction on the separated channel features I in C / g .
The remaining channels exchange information through the SGM. The key difference is that we apply SGM to the other subset of channels. SGM integrates an enhanced Gaussian channel attention mechanism [25], which facilitates richer global information exchange between channels. As shown in Figure 2d, this module divides the input feature channels. One part passes through a 3 × 3 convolution to preserve local spatial features. The other part is sent into an enhanced Gaussian channel attention module to capture global context information. Unlike SENet [26], which only uses mean information, the enhanced attention mechanism calculates both the mean and variance of the channels. SGM acts as a replacement strategy. It provides meaningful computation for channels that would otherwise use identity mapping in partial convolution approaches. Assuming that feature maps approximate a normal distribution during training, the method more effectively utilizes Gaussian statistics. The (μ, σ) pair provides a complete second-order characterization of the distribution. This allows SGM to distinguish channels that share similar means but have different variances. For example, it can separate texture-rich channels with high variance from smooth regions with low variance. This helps build channel-level representations, enabling richer cross-channel information interaction. Global features are then extracted through PRFA. Finally, we concatenate the two parts along the channel dimension and use a 1 × 1 convolution for information fusion.

3.2. Progressive Receptive Field Aggregator (PRFA)

To further enhance feature discriminability and global consistency, we propose the Progressive Receptive Field Aggregator (PRFA). As shown Figure 3, this method maintains the asymptotically Gaussian distribution (AGD) of the effective receptive field (ERF). This module consists of two parallel branches: an amplifier (Amp) and a discriminator (Dis). It uses a progressive medium kernel convolution structure across layers. This structure gradually expands the receptive field. This design fundamentally differs from existing approaches in three ways. First, existing methods often rely on a single extremely large kernel, such as the 17 × 17 kernel used in PLKSR [10]. These designs suffer from O(k2) parameter redundancy. In contrast, PRFA achieves an equivalent receptive field by cascading medium-sized kernels. This reduces parameters by 27%. At the same time, PRFA preserves the AGD property, which is theoretically optimal for feature extraction. Second, simple cascaded convolutions allow only unidirectional information flow through sequential layers. In contrast, PRFA introduces horizontal cross-head connections through its discriminator branch. These connections enable bidirectional information flow. As a result, each layer can anticipate features from later processing stages. Third, the dual-branch architecture ensures that both amplification and discrimination occur simultaneously at each scale, rather than treating these as separate sequential operations.
Specifically, let the input features be X B × C × H × W , where B, C, H, and W, respectively, represent the batch size, number of channels, height, and width. The input is first split evenly into four sub-heads along the channel dimension:
X = { X 0 , X 1 , X 2 , X 3 } , X i R B × C / 4 × H × W
after each sub-head undergoes a 1 × 1 convolution for linear projection, they are directed to the corresponding amplifier and discriminator paths. This happens in their respective layers, forming a progressive multi-scale convolution structure. For the n-th layer (n = 1, 2, 3), the kernel size is set as K n = 2n + 5, i.e., 7 × 7, 9 × 9, and 11 × 11. The amplifier branch aims to expand the receptive field and highlight significant pixel responses. Its computational form is:
A n a m p = a n , 2 D W C o n v K n ( G E L U ( a n , 1 ) )
where a n , 1 , a n , 2 R B × C m × H × W are the channel mappings obtained from projecting the input feature X n R B × C m × H × W through two independent 1 × 1 convolutions, and ⊙ represents the element-wise multiplication operation. The amplifier branch models long-range dependencies through large kernel depth convolutions. An element-wise multiplication operation enhances significance. This achieves nonlinear amplification while maintaining the AGD property.
The discriminator branch supplements local details and discriminative features. It introduces a cross-head feature supply mechanism. This strengthens inter-layer associations. Specifically, the input h n of the discriminator in the n-th layer is obtained by mapping the higher-indexed feature head. This is performed through 1 × 1 and 3 × 3 depth convolutions. These convolutions assist in the discriminative modeling of the current layer’s features. When there is no higher-indexed head (e.g., in the third layer), an intra-layer adaptive compensation strategy is used. The computation is defined as:
h n = D W C o n v 3 × 3 ( C o n v 1 × 1 ( X n + 1 ) ) , A n d i s = D W C o n v K n ( h n ) + h n
where h n R B × C n × H × W denotes the discriminator input obtained by mapping the (n + 1)-th feature head X n + 1 through 1 × 1 and 3 × 3 depth-wise convolutions for cross-head feature supply. When n = 3, where no higher-indexed head exists, h 3 = D W C o n v 3 × 3 ( C o n v 1 × 1 ( A 2 ) ) is used as intra-layer compensation.
The amplifier and discriminator branches perform feature fusion within the layer, and their output is defined as:
A n + 1 = C o n v 1 × 1 ( C o n c a t [ A n a m p , A n d i s ] )
where Concat [·] denotes channel concatenation, and the fused channel count increases in a pyramid fashion C n + 1 = C n + Δ C n . After three layers of stacking, the PRFA module forms an approximate four-layer asymptotically Gaussian receptive field structure within a single unit. This enables multi-scale feature aggregation and discriminative enhancement from local to global scales. This design effectively expands the model’s global receptive field while maintaining low computational cost. It enhances the model’s responsiveness to small-scale and edge features. This significantly improves network consistency and discriminative performance.

3.3. Multi-Scale Distillation Block (MSDB)

Local features help reconstruct details, such as texture and edges. Non-local features capture long-range dependencies and global context. To fully leverage their complementarity, we design HFAB, as shown in Figure 2e. The 3 × 3 convolution extracts local features, while DSRB extracts global features. The outputs of the two branches interact via element-wise multiplication (equivalent to dynamically gating local textures with global responses). This is followed by a 1 × 1 convolution. It performs channel reorganization and cross-channel information mixing, resulting in fused features. Given the input feature I i n , the output of HFAB can be expressed as:
I o u t = C o n v 1 × 1 ( D W C o n v 3 × 3 ( I i n ) F D S R B ( I i n ) )
We then cascade multiple levels of HFAB along the backbone. After each level, we apply 1 × 1 convolutions for channel compression and importance re-calibration. This results in distilled feature representations. These distilled representations are concatenated along the channel dimension. They are then fused using another 1 × 1 convolution, forming the Hybrid Feature Distillation Block (HFDB). This effectively enhances representation ability while suppressing redundancy. In the HFDB process, the input feature F i n enters both the backbone path and the auxiliary distillation path. The backbone path uses the HFAB module to extract core feature representations. The auxiliary distillation path utilizes lightweight 1 × 1 convolutions. These convolutions progressively refine key feature information. The mathematical expression of this distillation process is as follows:
F m a i n 1 , F a u x 1 = H F A B 1 ( F i n ) , C o n v 1 × 1 1 ( F i n ) F m a i n 2 , F a u x 2 = H F A B 2 ( F m a i n 1 ) , C o n v 1 × 1 2 ( F m a i n 1 ) F m a i n 3 , F a u x 3 = H F A B 3 ( F m a i n 2 ) , C o n v 1 × 1 3 ( F a u x 2 )
Here, F m a i n 1 and F aux 1 represent the feature outputs of the backbone path and auxiliary path at the i-th stage, respectively. After three layers of feature distillation, we aggregate feature information from different levels. This is performed via channel concatenation. A 1 × 1 convolution is then applied for feature fusion. The final output of HFDB, I out , can be expressed as:
F o u t = C o n v 1 × 1 ( c o n c a t ( F a u x 1 , F a u x 2 , F a u x 3 ) )

3.4. Spatial-Gated Feedforward Network (SGFN)

To enhance the network’s nonlinear expressive ability and introduce an effective spatial attention mechanism, we design the SGFN. As shown in Figure 2b, SGFN uses a dual-path parallel structure similar to MLP, achieving adaptive feature fusion through a cross-gating mechanism. Specifically, we integrate a spatial attention module at the end of CGFN. Unlike traditional attention mechanisms, in detail, SGFN uses pointwise convolution to compress global channel information. This is converted into a single-channel tensor. The tensor is then passed through a hard-sigmoid activation. Finally, a spatial attention map is generated for feature weighting. This enhances the network’s ability to focus on key spatial regions.

3.5. Loss Function

To optimize our network and reconstruct high-quality super-resolved images with sharp details, we design a composite loss function. This function integrates the pixel-wise L1 loss with the frequency-domain FFT loss to balance spatial fidelity and spectral consistency. The total loss Ltotal is formulated as:
L t o t a l = L 1 + λ L f f t
where λ acts as a hyperparameter to balance the two loss components.
The pixel-wise L1 loss is widely used in super-resolution tasks for its convergence stability and ability to prevent distinct artifacts. It measures the mean absolute error between the ground truth I H R and the super-resolved output I ^ H R .
L 1 = 1 N i = 1 N I H R ( i ) I ^ H R ( i )
The FFT loss calculates the L1 distance in the frequency domain:
L f f t = 1 N i = 1 N F ( I H R ) ( i ) F ( I ^ H R ) ( i )
where F(.) denotes the Fast Fourier Transform operation.

4. Experiment

4.1. Experimental Results

To validate the effectiveness of the proposed method, experiments will be conducted on two public datasets. These datasets include RSCNN7 [27] and DOTA [28] and WHU-RS19. For the RSCNN7 dataset, the data will be split into training, validation, and test sets in a 7:2:2 ratio. For the DOTA and WHU-RS19 datasets, we randomly selected 1000 images as an independent test set. To assess the robustness of our method, we train the model solely on the RSCNN7 dataset. Then, the trained weights are tested directly on the DOTA dataset without any fine-tuning. To objectively evaluate reconstruction quality, we employ the Peak Signal-to-Noise Ratio (PSNR). This pixel-based metric is derived from the ratio between the maximum pixel value and the Mean Squared Error (MSE). Given a ground truth image and its reconstruction with N pixels, denoted as I H R and I ^ H R , respectively, the PSNR is defined as:
PSNR   = 10 × 10   log   10 ( L 2 1 N i = 1 N ( I H R ( i ) I ^ H R ( i ) ) 2 )
Despite its popularity, PSNR exhibits limitations in capturing structural details and aligns imperfectly with human visual perception due to its pixel-level nature. Therefore, we also utilize the Structural Similarity Index (SSIM), which emulates the Human Visual System (HVS) to assess structural fidelity more accurately. The definition of SSIM is given by:
SSIM ( I H R , I ^ H R )   =   [ C l ( I H R ,   I ^ H R ) α   C c ( I H R , I ^ H R ) β   C l ( I H R ,   I ^ H R ) γ ]
where α, β, and γ denote weighting parameters that modulate the relative contribution of each corresponding component.
To accelerate training, we break the trained HR images into 480 × 480 patches using a sliding window approach. Data augmentation is applied through random rotations, as well as horizontal and vertical flips. The cropped LR patches have a size of 16, and the batch size is set a size of 32. We use the Adan optimizer to perform 5000 K iterations of optimization for both L1 and FFT losses. The initial learning rate and the minimum rate is set to 5 × 10 3 . The learning rate is updated according to the cosine annealing schedule. All experiments are conducted on an NVIDIA RTX 4060 Ti GPU using the PyTorch (v2.0.1) framework.

4.2. Qualitative and Quantitative Experiments

We evaluate our method on RSCNN7, DOTA and WHU-RS19 datasets for ×2 and ×4 super-resolution tasks. Table 1 shows quantitative comparisons using PSNR, SSIM, parameter count (PARA), and FLOPs as evaluation metrics. For ×2 super-resolution, our method achieves the best performance on both datasets. On RSCNN7, we obtained 32.51 dB PSNR and 0.8723 SSIM, outperforming DLKN by 0.06 dB and 0.0024, respectively. On DOTA, our method reaches 36.59 dB PSNR and 0.9363 SSIM, surpassing DLKN by 0.14 dB and 0.0215. Notably, our method uses only 383 K parameters and 79.8 G FLOPs, significantly fewer than DLKN (702 K parameters, 159 G FLOPs). For ×4 super-resolution, the advantages are even more evident. Our method achieves 28.43 dB PSNR and 0.7275 SSIM on RSCNN7, and 29.82 dB PSNR and 0.8303 SSIM on DOTA, both representing the best results among all methods. The computational cost remains low with 399 K parameters and 21.1 G FLOPs, reducing complexity by 45% and 47% compared to DLKN [20] while maintaining superior performance.
The results demonstrate that our method effectively balances reconstruction quality and computational efficiency, making it practical for remote sensing applications. We also conducted extensive testing on five natural image test sets: Set5 [29], Set14 [30], Urban100 [31], BSD100 [30], and Manga109 [32]. The result shown in Table 2 allowed us to assess the performance of our approach. To ensure fairness, all comparison models were implemented using their publicly available training codes. Evaluation was carried out strictly according to standard benchmark settings.
The experimental results demonstrate that our method outperforms others at all magnifications (×2, ×3, ×4); ours especially shows its particular significant advantages at the ×4 magnification. Compared to SPAN, our method achieves an improvement of 0.49 dB and 0.55 dB on the Urban100 and Manga109 datasets. These results strongly validate the effectiveness of our approach in reconstructing complex scenes and high-frequency textures. At magnifications of ×2 and ×3, our method also demonstrates comprehensive advantages. For example, at ×2 magnification, our method achieves a PSNR improvement of 0.24 dB on the Set5 dataset compared to CFSR [33]. On the Urban100 dataset, the improvement is 0.15 dB. Compared to SPAN [34], our method improves PSNR by 0.21 dB on Set5 and by 0.14 dB on Urban100. At ×3 magnification, our method achieves a significant improvement of 0.45 dB on the Urban100 dataset. It outperforms SPAN [34] by 0.35 dB on the Manga109 dataset. These comparisons further validate the applicability and robustness of our method across different scenarios. In addition to the significant improvement in reconstruction quality, our method also shows clear advantages in model efficiency. At ×4 magnification, the number of parameters in our method is only 398 K. In comparison, CFSR has 298 K parameters, while SPAN [34] has 481 K. This ensures high performance while significantly reducing computational complexity. This lightweight design makes our network more suitable for practical applications, especially in resource-constrained environments.
Additionally, Figure 4 shows the training curves. Our method exhibits better convergence in the early stages of training without requiring extensive iterations.
Table 1. Quantitative comparison of different methods on RSCNN7 and DOTA datasets for ×2 and ×4 super-resolution. The best results are highlighted in red.
Table 1. Quantitative comparison of different methods on RSCNN7 and DOTA datasets for ×2 and ×4 super-resolution. The best results are highlighted in red.
ScaleMethodParamsFlopsRSCNN7DOTAWHU-RS19
PSNRSSIMPSNRSSIMPSNRSSIM
X2RLFN [35]527 K116 G32.390.868436.360.936135.480.9302
RFDN [6]417 K91.3 G32.370.868036.440.925935.670.9303
FENet [21]351 K77.1 G32.350.867436.330.935735.660.9301
SMFANet [36]186 K38.8 G32.370.867936.460.936435.690.9303
SPAN [34]426 K94.3 G32.380.868136.190.936035.600.9301
DLKN [22]702 K159 G32.450.869936.450.914837.870.9149
OUR383 K79.8 G32.510.872336.590.936335.800.9321
X4RLFN [35]544 K29.8 G28.390.724129.680.827630.030.7990
RFDN [6]433 K23.7 G28.400.724329.780.829130.060.7999
FENet [21]366 K20.1 G28.410.723629.850.829830.100.8001
SMFANet [36]197 K10.4 G28.380.723429.710.826530.040.7986
SPAN [34]426 K24.4 G28.380.724129.770.828530.010.7991
DLKN [22]721 K40.9 G28.390.727229.530.812429.720.7844
OUR399 K21.1 G28.430.727529.820.830330.130.8009
Table 2. Quantitative comparison of average PSNR (dB)/SSIM among lightweight models. The best results are highlighted in red.
Table 2. Quantitative comparison of average PSNR (dB)/SSIM among lightweight models. The best results are highlighted in red.
ScaleMethodParamsSet5Set14BSDS100Urban100Manga109
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
X2IMDN [19]694 K37.890.960633.510.916932.120.899032.000.926738.610.9770
RFDN [6]417 K37.780.960633.350.916632.090.899131.790.925438.290.9764
RLFN [35]526 K37.880.960633.440.916832.130.899131.880.925938.390.9766
CFSR [33]298 K37.860.960533.440.916932.120.899231.770.935238.310.9764
SPAN [34]481 K37.940.960833.470.916532.140.899331.920.926538.300.9765
Ours383 K38.100.961333.790.920232.250.900832.520.931939.140.9780
X3IMDN [19]703 K34.360.927230.280.841229.050.804528.090.850433.480.9438
RFDN [6]424 K34.180.926030.230.840629.020.803727.900.847533.230.9422
RLFN [35]533 K34.240.926630.260.841229.040.841227.990.848933.280.9426
CFSR [33]294 K34.230.926230.250.840629.040.804427.900.847533.300.9428
SPAN [34]417 K34.280.926830.270.841729.060.804928.040.849933.390.9436
Ours389 K34.370.927130.380.843429.120.806428.270.855233.690.9457
X4IMDN [19]715 K32.090.894228.540.781027.520.734025.960.781930.330.9063
RFDN [6]433 K32.130.894328.500.779527.510.733925.920.780330.200.9051
RLFN [35]543 K31.970.893128.470.779527.510.734225.880.780330.120.9035
CFSR [33]303 K32.000.893028.490.779727.520.734325.840.778130.150.9045
SPAN [34]426 K32.080.894228.530.781027.550.735125.950.781230.340.9064
Ours399 K32.370.897428.750.785527.660.740026.440.796630.890.9132
Based on the subjective visual quality comparison, our method demonstrates superior super-resolution performance across multiple domains and scenarios. As shown in Figure 5 and Figure 6, on urban aerial images, our method effectively recovers the structural details of buildings and road markings. Sharp edges and rich texture information are well preserved. For remote sensing image super-resolution tasks, we focus on scenes containing dense high-frequency details, such as building roof textures and solar panel arrays. Our method achieves superior reconstruction of structural integrity and texture details. Meanwhile, artifacts are effectively suppressed. Figure 7 illustrates the performance of our method on standard natural image benchmarks. Our approach exhibits a strong capability for high-frequency detail recovery and maintains excellent texture fidelity.
This performance significantly surpasses that of comparative methods. This superior performance comprehensively demonstrates the model’s ability to generalize across different image types. Consequently, it proves to be highly robust in handling challenging scenarios, including those with complex textures and regular structures. The consistent results across these tests highlight the practical value and adaptability of our method.

4.3. Complexity and Inference Efficiency

To further evaluate the practical deployability of RFANSR, we compared its inference speed, memory consumption, and computational complexity (FLOPs) against representative lightweight methods. The evaluation was conducted on an NVIDIA RTX 4060 Ti GPU.
As shown in Table 3, RFANSR achieves the lowest computational cost with 21.1 G FLOPs. In terms of inference speed, our method (38.8 ms) is significantly faster than DLKN (44.4 ms). Although it is slightly slower than RLFN [35] (35.0 ms), RFANSR achieves a higher PSNR of 29.82 dB. This demonstrates that our method maintains a competitive balance between efficiency and reconstruction quality.

4.4. Ablation Studies

4.4.1. Effects of SGM

To evaluate the effect of the SGM, we constructed two models. One model includes SGM, while the other does not (Table 4). Without the SGM, the model has 363 K parameters. After adding SGM, the parameter count increases to 383 K. This are additional 20 K parameters, or approximately 5.5%. Despite this minimal increase in complexity, the model shows an average PSNR improvement of about 1 dB across five benchmark datasets. The maximum improvement reaches nearly 2 dB on the Manga109 dataset. In other words, SGM provides stable and significant performance gains across multiple datasets with minimal parameter overhead. This demonstrates the module’s effectiveness in enhancing detail reconstruction. It also improves structural restoration, making it a valuable addition to the model.
As illustrated in Figure 8, the SGM demonstrates remarkable channel selectivity. The statistical evidence is clear: 45.8% of channels are actively enhanced (weight > 0.7), while 20.8% of redundant channels are effectively suppressed (weight < 0.3). Only 33.3% of channels remain in a balanced state. This non-uniform distribution validates the module’s superiority over traditional identity mapping. It confirms that SGM can effectively discriminate between more and less useful feature channels, leading to more intelligent resource allocation.

4.4.2. Effects of PRFA

To investigate the impact of kernel size in the Progressive Receptive Field Aggregator (PRFA), we conducted comparison experiments. We kept the rest of the network architecture unchanged and tested the following three different kernel size combinations: (5, 7, 9), (7, 9, 11), and (9, 11, 13). The results are shown in Table 5. As shown, the kernel size is expanded from (5, 7, 9) to (7, 9, 11). The model parameter count increases from 376 K to 383 K. This is a change of about 1.9%. Despite the small increase in parameters, the overall performance improves significantly across five datasets. The average PSNR increases by approximately 0.35 dB. The largest improvement is observed on the Manga109 dataset, where it reaches +0.94 dB. This demonstrates that a moderate expansion of the receptive field helps capture richer spatial context information. However, when the kernel size is further increased to (9, 11, 13), the parameter count rises to 426 K. Performance decreases by approximately 0.7 dB on average. This suggests that overly large kernels introduce redundant features and disrupt the progressive Gaussian distribution property. Overall, the (7, 9, 11) combination strikes the best balance between performance and complexity. It Is selected as the default setting for subsequent experiments.

4.4.3. Effects of SGFN

To validate the necessity of our proposed Spatial-Gated Feedforward Network (SGFN), we conduct ablation studies comparing the following four FFN designs: (1) MLP, a traditional multi-layer perceptron baseline, (2) SimpleGate, employing basic channel-wise gating, (3) ConFNN, using spatial convolutions without gating, and (4) SGFN (ours), combining spatial modeling with content-adaptive gating. As shown in Table 6, SGFN consistently outperforms all alternatives across five.

5. Discussion

The experimental results demonstrate that RFANSR achieves a superior trade-off between reconstruction quality and model complexity. It outperforms state-of-the-art methods like DLKN and SPAN [34]. This advantage primarily stems from our distinct design philosophy. Existing large-kernel methods often suffer from parameter redundancy, while partial convolution strategies lead to channel underutilization. In contrast, our Progressive Receptive Field Aggregator (PRFA) cascades medium-sized kernels. This effectively simulates large kernels with significantly reduced overhead. Furthermore, ablation studies validate the Statistical Guidance Module (SGM). SGM uses feature statistics to reactivate idle channels. This strategy proves more efficient than simple identity mapping, providing substantial performance gains with negligible parameter increase.
Beyond the improvements in visual quality, the proposed method holds significant potential for boosting downstream remote sensing tasks. High-fidelity reconstruction serves as a critical pre-processing step that alleviates the domain gap caused by low resolution. For instance, in object detection tasks, integrating super-resolution has proven highly effective in frameworks like SuperYOLO [37], where the recovered high-frequency details significantly enhance the feature representation of small objects. Similarly, for scene classification, the restored structural integrity can assist attention-based models in focusing on more informative regions. This indicates that RFANSR can function as an efficient front-end module to construct high-performance joint processing pipelines for intelligent interpretation.
These findings suggest that RFANSR is theoretically suitable for resource-constrained platforms. However, we must acknowledge practical limitations regarding deployment efficiency. Although our method achieves low theoretical complexity (FLOPs), the actual inference speed is relatively slow. Additionally, the peak GPU memory occupancy remains high during processing. This discrepancy is likely due to the architecture’s topology. Specifically, the PRFA module employs a complex multi-branch structure, and the SGM involves statistical calculations that require frequent memory access. These operations are often less hardware-friendly than standard convolutions. Consequently, while the model is parameter-efficient, future work will focus on optimizing memory access patterns. This will help reduce latency and memory footprint for real-time edge applications. Moreover, since our model is trained on standard benchmarks, its performance may vary when facing complex real-world degradations like sensor noise. Consequently, future work will focus on optimizing memory access patterns and exploring degradation modeling to enhance robustness.

6. Conclusions

This paper addresses parameter redundancy and insufficient channel utilization in lightweight remote sensing image super-resolution. The paper proposes the Receptive Field Aggregation Network (RFANSR) as a solution. RFANSR employs collaborative design of three modules: PRFA, SGM, and SGFN. The network effectively expands the receptive field while maintaining low computational complexity. RFANSR also improves feature utilization efficiency through its innovative design. Experimental results show that RFANSR achieves significant performance improvements on RSCNN7 and DOTA datasets. The method reduces parameters by 45.4% compared to existing approaches. RFANSR provides an effective solution for remote sensing image super-resolution in resource-constrained scenarios.

Author Contributions

Methodology, X.Y.; Software, X.Y. and W.S.; Investigation, K.N. and W.G.; Resources, K.N.; Writing—original draft, X.Y.; Writing—review & editing, X.F.; Supervision, W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy reasons.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Li, W.; Wei, W.; Zhang, L. GSDet: Object Detection in Aerial Images Based on Scale Reasoning. IEEE Trans. Image Process. 2021, 30, 4599–4609. [Google Scholar] [CrossRef] [PubMed]
  2. Yue, T.; Lu, X.; Cai, J.; Chen, Y.; Chu, S. YOLO-MST: Multiscale deep learning method for infrared small target detection based on super-resolution and YOLO. Opt. Laser Technol. 2025, 187, 112835. [Google Scholar] [CrossRef]
  3. Li, Y.; Chen, W.; Zhang, Y.; Tao, C.; Xiao, R.; Tan, Y. Accurate Cloud Detection in High-Resolution Remote Sensing Imagery by Weakly Supervised Deep Learning. Remote Sens. Environ. 2020, 250, 112045. [Google Scholar] [CrossRef]
  4. Xie, Z.; Wang, J.; Song, W.; He, Q.; Zhang, M.; Chang, B. HAMD-RSISR: Hybrid Attention and Multi-Dictionary for Remote Sensing Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 22556–22572. [Google Scholar] [CrossRef]
  5. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
  6. Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 41–55. [Google Scholar]
  7. Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 21–24 June 2022; pp. 833–843. [Google Scholar]
  8. Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. Adv. Neural Inf. Process. Syst. 2022, 35, 17314–17326. [Google Scholar]
  9. Hu, Q.; Tang, Y.; Zhang, X. Large Kernel Modulation Network for Efficient Image Super-Resolution. arXiv 2025, arXiv:2508.11893. [Google Scholar] [CrossRef]
  10. Lee, D.; Yun, S.; Ro, Y. Partial large kernel CNNS for efficient super-resolution. arXiv 2024, arXiv:2404.11848. [Google Scholar] [CrossRef]
  11. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 1646–1654. [Google Scholar]
  12. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2017, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  13. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  14. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S. Single image super-resolution via a holistic attention network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  15. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  16. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22367–22377. [Google Scholar]
  17. Long, W.; Zhou, X.; Zhang, L.; Gu, S. Progressive Focused Transformer for Single Image Super-Resolution. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
  18. Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 357–366. [Google Scholar]
  19. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  20. Wang, Z.; Gao, G.; Li, J.; Yan, H.; Zheng, H.; Lu, H. Lightweight feature de-redundancy and self-calibration network for efficient image super-resolution. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–15. [Google Scholar] [CrossRef]
  21. Wang, Y.; Shao, Z.; Lu, T.; Wu, C.; Wang, J. Remote sensing image super-resolution via multiscale enhancement network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 5000905. [Google Scholar] [CrossRef]
  22. Liu, Y.; Lan, C.; Feng, W. DLKN: Enhanced lightweight image super-resolution with dynamic large kernel network. Vis. Comput. 2025, 41, 3627–3644. [Google Scholar] [CrossRef]
  23. Li, A.; Zhang, L.; Liu, Y.; Zhu, C. Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 12514–12524. [Google Scholar]
  24. Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
  25. Lee, H.J.; Kim, H.E.; Nam, H. Srm: A style-based recalibration module for convolutional neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1854–1862. [Google Scholar]
  26. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  27. Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep Learning Based Feature Selection for Remote Sensing Scene Classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
  28. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
  29. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
  30. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; IEEE Computer Society: Washington, DC, USA, 2001; pp. 416–423. [Google Scholar]
  31. Huang, J.-B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE Computer Society: Washington, DC, USA, 2015; pp. 5197–5206. [Google Scholar]
  32. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  33. Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 9508–9520. [Google Scholar] [CrossRef] [PubMed]
  34. Wan, C.; Yu, H.; Li, Z.; Chen, Y.; Zou, Y.; Liu, Y.; Yin, X.; Zuo, K. Swift parameter-free attention network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024. [Google Scholar]
  35. Kong, F.; Li, M.; Liu, S.; Liu, D.; He, J.; Bai, Y.; Chen, F.; Fu, L. Residual local feature network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 766–776. [Google Scholar]
  36. Zheng, M.; Sun, L.; Dong, J.; Pan, J. SMFANet: A lightweight self-modulation feature aggregation network for efficient image super-resolution. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 359–375. [Google Scholar]
  37. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
Figure 1. Performance of different SISR models on DOTA dataset for ×2 super-resolution.
Figure 1. Performance of different SISR models on DOTA dataset for ×2 super-resolution.
Remotesensing 17 04028 g001
Figure 2. The overall architecture of the proposed network. Key modules including MSDB, SGFN, DSRB, SGM, and HFAB are illustrated in detail.
Figure 2. The overall architecture of the proposed network. Key modules including MSDB, SGFN, DSRB, SGM, and HFAB are illustrated in detail.
Remotesensing 17 04028 g002
Figure 3. The structure of PRFA.
Figure 3. The structure of PRFA.
Remotesensing 17 04028 g003
Figure 4. Convergence speed comparison on RSCNN7 dataset.
Figure 4. Convergence speed comparison on RSCNN7 dataset.
Remotesensing 17 04028 g004
Figure 5. The ×4 SRR results between different methods on the DOTA dataset.The red box serves as the reference object.
Figure 5. The ×4 SRR results between different methods on the DOTA dataset.The red box serves as the reference object.
Remotesensing 17 04028 g005
Figure 6. The ×4 SRR results between different methods on the RSCNN7 dataset.The red box serves as the reference object.
Figure 6. The ×4 SRR results between different methods on the RSCNN7 dataset.The red box serves as the reference object.
Remotesensing 17 04028 g006
Figure 7. The ×4 SRR results between different methods on natural datasets.
Figure 7. The ×4 SRR results between different methods on natural datasets.
Remotesensing 17 04028 g007
Figure 8. SGM channel selection distribution.
Figure 8. SGM channel selection distribution.
Remotesensing 17 04028 g008
Table 3. Complexity and efficiency comparison on the DOTA dataset.
Table 3. Complexity and efficiency comparison on the DOTA dataset.
MethodParamsFlopsInference Time (ms)Peak Memory (MB)PSNRManaga109
RLFN383 K29.8 G25.010229.6829.54
DLKN366 K40.9 G44.418629.5327.88
Ours363 K21.1 G38.815529.8227.66
Table 4. SGM ablation experiment.
Table 4. SGM ablation experiment.
ParamsSet5Set14BSDS100Urban100Managa109
With SGM383 K31.6928.2827.3625.4329.54
With SEBlock366 K30.6627.6327.1024.6127.88
Without SGM363 K30.5527.5926.9724.5827.66
Table 5. PRFA ablation experiment.
Table 5. PRFA ablation experiment.
ParamsSet5Set14BSDS100Urban100Managa109
[5, 7, 9]376 K31.2027.9927.1825.0028.60
[7, 9, 11]383 K31.6928.2827.3625.4329.54
[9, 11, 13]42630.3827.3926.8524.3427.27
Table 6. SGFN ablation experiment.
Table 6. SGFN ablation experiment.
ParamsSet5Set14BSDS100Urban100Managa109
MLP395 K31.3427.8326.9325.0629.17
SimpleGate410 K31.4828.0827.0925.1829.28
ConFNN391 K31.5728.1327.2225.3029.43
SGFN383 K31.6928.2827.3625.4329.54
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yan, X.; Song, W.; Feng, X.; Guo, W.; Ning, K. RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution. Remote Sens. 2025, 17, 4028. https://doi.org/10.3390/rs17244028

AMA Style

Yan X, Song W, Feng X, Guo W, Ning K. RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution. Remote Sensing. 2025; 17(24):4028. https://doi.org/10.3390/rs17244028

Chicago/Turabian Style

Yan, Xiaoyu, Wei Song, Xiaotong Feng, Wei Guo, and Keqing Ning. 2025. "RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution" Remote Sensing 17, no. 24: 4028. https://doi.org/10.3390/rs17244028

APA Style

Yan, X., Song, W., Feng, X., Guo, W., & Ning, K. (2025). RFANSR: Receptive Field Aggregation Network for Lightweight Remote Sensing Image Super-Resolution. Remote Sensing, 17(24), 4028. https://doi.org/10.3390/rs17244028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop