Next Article in Journal
A Novel Method for PolISAR Interpretation of Space Target Structure Based on Component Decomposition and Coherent Feature Extraction
Previous Article in Journal
GCA2Net: Global-Consolidation and Angle-Adaptive Network for Oriented Object Detection in Aerial Imagery
Previous Article in Special Issue
Beyond the Remote Sensing Ecological Index: A Comprehensive Ecological Quality Evaluation Using a Deep-Learning-Based Remote Sensing Ecological Index
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution

School of Software, Henan University, Kaifeng 475004, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(6), 1078; https://doi.org/10.3390/rs17061078
Submission received: 14 January 2025 / Revised: 3 March 2025 / Accepted: 17 March 2025 / Published: 19 March 2025

Abstract

:
In the field of remote sensing, super-resolution methods based on deep learning have made significant progress. However, redundant feature extraction and inefficient feature fusion can, respectively, result in excessive parameters and restrict the precise reconstruction of features, making the model difficult to deploy in practical remote-sensing tasks. To address this issue, we propose a lightweight Dual Attention Fusion Enhancement Network (DAFEN) for remote-sensing image super-resolution. Firstly, we design a lightweight Channel-Spatial Lattice Block (CSLB), which consists of Group Residual Shuffle Blocks (GRSB) and a Channel-Spatial Attention Interaction Module (CSAIM). The GRSB improves the efficiency of redundant convolution operations, while the CSAIM enhances interactive learning. Secondly, to achieve superior feature fusion and enhancement, we design a Forward Fusion Enhancement Module (FFEM). Through the forward fusion strategy, more high-level feature details are retained for better adaptation to remote-sensing tasks. In addition, the fused features are further refined and rescaled by Self-Calibrated Group Convolution (SCGC) and Contrast-aware Channel Attention (CCA), respectively. Extensive experiments demonstrate that DAFEN achieves better or comparable performance compared with state-of-the-art lightweight super-resolution models while reducing complexity by approximately 10∼48%.

Graphical Abstract

1. Introduction

Remote-sensing images (RSIs) significantly contribute to earth and environmental sciences by providing diverse observational data. They offer critical support for various tasks, such as agricultural and forestry monitoring [1,2], disaster early warning [3,4], military reconnaissance [5], and industrial manufacturing [6]. Therefore, providing high-resolution RSIs is essential for a more accurate and objective representation of the targets and contexts. However, remote-sensing data are affected by adverse factors such as long distances and wide viewing angles during the acquisition process. For edge devices with limited computational resources, coupled with the limitations of their hardware factors and unfavorable imaging conditions, it is even more difficult to acquire high-quality RSIs. RSI Super-Resolution (RSISR) focuses on constructing a nonlinear mapping relationship between a pair of high-resolution and low-resolution remote-sensing images. Compared with upgrading remote-sensing imaging equipment, researching efficient RSISR algorithms at the software level can not only effectively improve the resolution of images but also save significant costs. Therefore, using RSISR technologies to improve the resolution of remote-sensing images has become a key research direction for scholars.
Since the pioneering SR method SRCNN [7] was proposed, the field of SR has grown tremendously [8,9,10,11,12,13,14,15,16,17,18,19,20]. The success of natural image SR techniques has spurred the development of RSISR [21,22,23,24,25,26,27,28,29,30,31,32]. RSISR builds upon natural image SR research by incorporating specific improvements tailored to remote-sensing images, such as designing loss functions suited to the unique features of RSI [33], employing a two-stage design that utilizes spatial and spectral knowledge from adjacent bands [24], and creating degradation models that reflect remote-sensing scenarios [34]. As research progresses, although the performance of these RSISR models has continually improved, they face challenges in effectively managing the inherent complexity of the networks. We take examples from well-known SR methods such as RCAN [14], SAN [35], and SwinIR [36]. When applied to RSISR tasks, RCAN [14] has about 15M parameters, SAN [35] has about 26M parameters, and large variants of SwinIR [36] have upwards of 20M parameters. Recently, RSISR methods based on the diffusion model [37,38,39] and Mamba [40] have become new research hotspots, but their model parameter number is similarly large. The large number of parameters greatly limits their deployment on resource-constrained devices. Therefore, our work aims to design a lightweight and efficient RSISR network to recover remote-sensing images.
In the field of RSISR, several successful lightweight models have emerged. The attention-based multi-level feature fusion (AMFF) in AMFFN [41] differs from common feature fusion methods [15,16,42,43,44]. It adopts a grouped approach to progressively fuse features from different blocks, which fully leverages multi-level features. However, the AMFF adopts a 1 × 1 convolution fusion method, which significantly increases the number of parameters when the number of input channels is large. Additionally, it only uses the Contrast-aware Channel Attention (CCA) [15] module in the enhancement stage, which provides relatively weak performance. RFCNet [45] addresses feature enhancement by designing the Residual Feature Calibration Block (RFCB), which further refines and rescales input features through a clever dual-branch structure, significantly improving the model’s fitting capability. However, RFCB employs channel separation, causing each branch to enhance only half of the feature information, which limits the feature representation capability. Moreover, the use of multiple 3 × 3 convolution layers leads to a significant increase in model parameters. FeNet [46] constructs Lightweight Lattice Blocks (LLBs) based on channel separation and effectively integrates multi-level feature information through nested module design. In multi-level feature fusion, the Backward Fuse Module (BFM) [16] employed by FeNet can extract more contextual information at different levels more effectively than 1 × 1 convolution. However, LLB extracts features from only half of the channels, which results in poorer feature extraction. Additionally, its nested module structure is overly complex, resulting in significant redundancy. Moreover, the backward sequential concatenation used by BFM leads to higher-level features being compressed more than lower-level features. However, high-level features typically contain more semantic information and contextual relationships, which are crucial for reconstructing complex structures and global information. Given the complexity of RSIs [47], the excessive loss of high-level feature information is unreasonable.
To address the limitations of the existing methods, we propose the Dual Attention Fusion Enhancement Network (DAFEN). Meanwhile, to accommodate devices with extremely limited hardware capabilities, we further designed an extremely lightweight version, DAFEN-S, by reducing the number of channels, with the model parameters reduced to only 188 K. As shown in Figure 1, our DAFEN sufficiently demonstrates the competitive advantages of the model compared with other lightweight models and achieves a better balance between the complexity of the network and the reconstruction accuracy. Specifically, feature refinement and fusion section of DAFEN is composed of four stacked Dual Attention Fusion Enhancement Blocks (DAFEBs) and a Forward Fusion Enhancement Module (FFEM). The features extracted by the shallow-layer DAFEB are not only fed into the next-layer DAFEB for further extraction and refinement but also input into the FFEM for multi-level feature fusion and enhancement. Furthermore, each DAFEB consists of three Channel-Spatial Lattice Blocks (CSLBs) and a Forward Fusion Enhancement Module (FFEM), with a structure similar to the feature refinement and fusion section of DAFEN. Notably, we introduce a Context Enhancement Module (CEM) [48] at the end of the DAFEB to further enhance the feature information. By employing the extremely lightweight CEM, the receptive field is further enlarged to extract more contextual information for RSISR.
In addition, the core module of the DAFEN consists of the Forward Fusion Enhancement Module (FFEM) and Channel-Spatial Lattice Block (CSLB). Firstly, the FFEM consists of a multi-level feature fusion stage and a feature enhancement stage. On the one hand, we adopt a forward sequential concatenation approach, which reduces the loss of high-level feature information while supplementing low-level feature details, thereby fully utilizing multi-level features to extract richer contextual information. On the other hand, we design a dual-branch structure for feature enhancement. In the upper branch, we design a Self-Calibrated Group Convolution (SCGC), which self-calibrates based on the expression of local features in a deep receptive field, further refining the features. In the lower branch, we introduce CCA [15], which calculates weight coefficients through CCA to obtain a better channel attention vector. Secondly, the CSLB adopts a dual-branch design based on channel separation, making feature extraction and information interaction extremely lightweight. Unlike other lattice blocks [16,46,49], we design the Group Residual Shuffle Block (GRSB) and the Channel-Spatial Attention Interaction Module (CSAIM). The GRSB extracts features through group convolutions and enhances information flow via channel shuffle, which enables GRSB to achieve feature representation similar to 3 × 3 convolution while using fewer parameters. Furthermore, the CSAIM alternately employs channel attention and spatial attention to enhance information exchange between the two branches of the CSLB, thereby expanding the receptive field and improving feature representation.
In summary, the main contributions are as follows:
  • We propose a lightweight model, DAFEN, with approximately 416K parameters and an ultra-lightweight model, DAFEN-S, with approximately 188K parameters. Compared with other methods, we achieve better performance with fewer model parameters.
  • We design a novel lattice structure, CSLB, which combines GRSB and CSAIM. This structure efficiently extracts features while maintaining the model’s lightweight nature. It also enhances information exchange between the two branches through channel and spatial attention, improving feature extraction for remote-sensing images.
  • We design an efficient feature fusion module, FFEM, which consists of a fusion stage and an enhancement stage. In the fusion stage, FFEM effectively integrates multi-level features through forward sequential concatenation. In the enhancement stage, a unique dual-branch design is adopted to perform feature self-calibration and rescaling, which not only refines the features but also enhances the capability of representing complex remote-sensing image characteristics.

2. Related Works

2.1. Lightweight Natural Image SR

In recent years, the continuous improvement in the performance of super-resolution (SR) networks has often come with increased parameters and computational overhead, posing challenges for practical deployment. Consequently, there is a growing demand for lightweight SR networks. FSRCNN [8] reduced the model parameters and computational load by reconstructing the SRCNN [7] architecture while maintaining performance. Ahn et al. [50] proposed a Cascading Residual Network (CARN), which uses multi-level representation and shortcut connections for more efficient information transfer. Hui et al. [51] developed an Information Distillation Network (IDN), which extracts significant information by merging different features. Building on this, IMDN [15] enhanced IDN by using information multi-distillation blocks and channel splitting operations. Liu et al. [44] designed a lightweight and precise Residual Feature Distillation Network (RFDN) using Feature Distillation Connections (FDC) and Shallow Residual Blocks (SRB). LatticeNet [16] achieved excellent performance while significantly reducing parameters by utilizing lattice filters based on a butterfly structure and a backward fusion strategy. ShuffleMixer [52] explored a method to reduce parameters and computational load by employing large convolution kernels, channel splitting, and shuffling techniques, achieving higher efficiency while maintaining performance. Due to the significant similarity between feature maps of multiple channels in the same CNN layer, Feature-Refined Networks (FRNs) [43] designed shadow modules to generate such similar feature maps, thereby reducing model complexity. Zhang et al. [53] proposed the Super Token Interaction Network (SPIN), which clusters locally similar pixels using superpixels and facilitates local information interaction through intra-superpixel attention. Omni-SR [19] introduced a full self-attention paradigm and a full-scale aggregation scheme to address issues related to limited effective receptive fields due to one-dimensional self-attention modeling and homogeneous aggregation schemes. Huang et al. [49] proposed a Two-branch Adaptive Residual Network (TARN), which effectively utilizes residual features through a two-branch adaptive residual block (TARB) based on a lattice structure. Wang et al. [54] proposed an Omni-Stage Feature Fusion Network (OSFFNet), which effectively integrates features from different levels and fully utilizes their complementarity. With full consideration of structural priors, Wang et al. [55] proposed an Interactive Feature Inference Network (IFIN), which progressively extracts more specialized features to enhance the reconstruction of high-frequency details in images. The advancements in lightweight natural image SR have provided valuable insights for lightweight RSISR research, offering beneficial guidance for our work.

2.2. Lightweight Remote-Sensing Image SR

Inspired by the success of lightweight natural image SR, lightweight RSISR has received increasing attention. Lei et al. [21] first proposed LGCNet, which aims to enhance super-resolution performance by combining local and global contrast features. FeNet [46] achieved a good balance between computational cost and reconstruction accuracy by using Lightweight Lattice Blocks (LLB) as nonlinear extraction modules and utilizing a nested structure. CTN [48] reduced the network’s parameters by using lightweight convolutions instead of traditional 3 × 3 convolutions and generated SR images by alternating feature extraction and enhancement. Wang et al. [41] proposed an Attention-Based Multi-level Feature Fusion Network (AMFFN), which ensures efficient SR reconstruction through information distillation and attention-based multi-level feature fusion. Wu et al. [28] proposed a Saliency-Aware Dynamic Routing Network (SalDRN), which employs networks of different depths to handle various regions of RSIs to tackle SR challenges of varying difficulty. Distance Attention Residual Network (DARN) [56] utilizes Distance Attention Blocks (DABs) to efficiently leverage shallow features, effectively mitigating the loss of detailed features during the extraction process of deep CNNs. Gao et al. [57] proposed a Stepwise Fusion Mechanism (SFM) to integrate features retained after progressive distillation, effectively addressing the issue of insufficient information flow caused by channel separation during feature distillation. Wang et al. [27] proposed a Hybrid Attention-Based U-Shaped Network (HAUNet) to effectively explore multi-scale features and enhance global feature representation through hybrid convolution-based attention. To address the issue of losing low-weight background feature information, Wu et al. [58] employed a large-kernel attention mechanism and a multi-scale mechanism to generate background feature weights, thereby increasing attention to neglected information. Ye et al. [32] proposed a high-frequency and low-frequency separation reconstruction strategy, allowing the network to improve the reconstruction details of high-frequency components while maintaining lower model parameters. Unlike methods that focus on developing lightweight network architectures or modules, our approach emphasizes designing more efficient and lightweight feature fusion enhancement structures and focuses on alleviating the burden on convolution layers. Our strategy facilitates comprehensive feature representation using fewer parameters and Multiply-Add Operations (Multi-Adds).

3. Methods

3.1. Network Architecture

In the paper, we propose an innovative network called DAFEN specifically for lightweight RSISR. The network architecture is illustrated in Figure 2, consisting of three integrated components: shallow feature extraction H fe , feature refinement and fusion H rf , and reconstruction H re s . Assume I LR R 3 × H × W and I SR R 3 × s H × s W are the input and output of DAFEN, where H × W represents the spatial dimensions, and s is the upscaling factor. The process of the three integrated components can be expressed as follows:
F 0 = H fe ( I LR )
F d = H rf ( F 0 )
I SR = H re s ( F d ) + H bicubic s ( I LR )
where F 0 and F d denote the feature representations of H fe and H rf . We utilize two 3 × 3 convolutions to implement H fe , which are used to extract the initial representations of the I LR content. Similarly, H re s is implemented through two 3 × 3 convolutions combined with a pixel shuffle layer. It is noteworthy that H bicubic s represents the bicubic interpolation function with an upscaling factor of s. This function effectively conveys substantial information, compensating for the significant details of low-level features.
To develop a lightweight network for RSISR, we meticulously design the feature refinement and fusion module H rf . This section consists of four Dual Attention Fusion Enhancement Blocks (DAFEBs) and one Forward Fusion Enhancement Module (FFEM). After obtaining the coarse features F 0 through shallow feature extraction H fe , the four DAFEBs sequentially extract intermediate features. Here, we have
F i = H DAFEB i ( F i 1 ) , i = 1 , 2 , 3 , 4
where H DAFEB i ( · ) denotes the i t h DAFEB function and F i represents the intermediate features extracted by the i t h DAFEB. In order to more efficiently integrate these intermediate features containing multi-level information, we feed F i ( i = 1 , 2 , 3 , 4 ) into FFEM for fusion and enhancement operations, as follows:
T d = H FFEM ( F 1 , F 2 , F 3 , F 4 )
where T d represents the features that have been integrated and enhanced by the FFEM, and H FFEM ( · ) denotes the FFEM function.
We employ the L 1 loss function to train the aforementioned network. { I LR i , I HR i } i = 1 N is used to denote the training set composed of N pairs of low-resolution (LR) and high-resolution (HR) images. The loss is defined as follows:
L 1 ( θ ) = 1 N i = 1 N I SR i I HR i 1
where θ denotes the parameter sets of our proposed DAFEN. I SR i and I HR i represent the i t h SR image reconstructed by the DAFEN and the corresponding HR image, respectively.

3.2. Dual Attention Fusion Enhancement Block (DAFEB)

In order to better balance the reconstruction performance and model complexity, we design DAFEB with a similar architecture to the DAFEN. As shown in Figure 3, we progressively refine the complex features through three stacked CSLBs. The features refined by the shallow CSLBs are sent to both the deeper CSLBs and directly to the FFEM for fusion and enhancement. This can be represented as follows:
P i = H C S L B i ( P i 1 ) , i = 1 , 2 , 3
P fuse = H F F E M ( P 1 , P 2 , P 3 )
where H C S L B i ( · ) denotes the CSLB function at the i t h layer, H F F E M ( · ) represents the FFEM function, and P i denotes the features obtained after the i t h CSLB layer. Subsequently, we perform fusion and enhancement on the extracted multi-level features to obtain P fuse . Finally, we introduce CEM [48] to amplify spatial details within the fused features. It is noteworthy that the CEM has an extremely low parameter count, making it worthwhile to slightly increase the parameter count for effectively enhancing model performance by introducing the CEM.

3.3. Forward Fusion Enhancement Module (FFEM)

Due to the complexity of remote-sensing images [47], efficiently utilizing hierarchical information is crucial. We design FFEM to achieve more suitable multi-level feature fusion and enhancement. Specifically, FFEM includes a feature fusion phase and a feature enhancement phase. In the feature fusion phase, we design a forward fusion structure, as shown in Figure 4a. Once the multi-level features are fed into the FFEM, each level’s features first undergo a 1 × 1 convolution to halve the number of channels, followed by activation through the ReLU function (ReLU operations are omitted in Figure 2 and Figure 3), and finally are concatenated in a forward sequential manner. This can be expressed as follows:
T i = ReLU ( Conv ( F i ) ) , i = 1 Conv Concat T i 1 , ReLU ( Conv ( F i ) ) , i = 2 , , n 1 Concat T i 1 , ReLU ( Conv ( F i ) ) , i = n
where C o n c a t ( · ) and C o n v ( · ) denote the concatenation operation along the channel dimension and the 1 × 1 convolution. In the task of super-resolution for structurally complex remote-sensing images, it is crucial to efficiently capture more abstract and semantically rich global information. Our forward fusion structure allocates more channels to deeper features, thereby reducing the inevitable loss of high-level feature information due to information loss during dimensionality reduction. By leveraging high-level features more effectively, we can mitigate the difficulty of global information capture caused by the local receptive field of CNNs. Utilizing the rich semantic information and contextual relationships contained in high-level features allows us to better handle the reconstruction of complex structures and global information in RSIs. In addition, our strategy also integrates low-level features, enriching a wealth of edge and texture details, thus achieving a better balance between global information and detailed information.
In the feature enhancement phase, to balance both the performance and lightweight nature of the model, we design an efficient dual-branch feature enhancement structure. By feeding the fused features into the two modules, CCA and SCGC, we achieve rescaling and refinement processes. The structure of RSI is complex, and it is inaccurate to obtain the channel descriptors solely through Global Average Pooling (GAP). Since variance can reflect the richness of information in feature maps [45], we introduce CCA [15] to highlight the most informative features. This can be represented as follows:
x out = H C C A ( x )
where H C C A ( · ) denotes the CCA operation. Here, we briefly describe the computation of CCA. As shown in Figure 4b, given a set of feature maps, we compute the sum of their standard deviations and means. This summation is then passed through a sequence of nonlinear functions: Conv1 → ReLU → Conv1. Subsequently, the sigmoid function is utilized to generate a set of combination coefficients ranging from 0 to 1. Finally, we multiply the input feature maps by these combination coefficients to obtain the output feature maps.
For SCGC (as shown in Figure 4c), the features are first scaled down by the AdaptiveMaxPool2D function with a kernel size of 7 and a stride of 3 in order to obtain a large receptive field. Then, after tuning by group convolution and channel shuffle operations, the features are upsampled back to the input size using a bilinear interpolation function. Finally, the final spatial statistics are obtained using the sigmoid function after residual concatenation with the input features. This process can be represented as follows:
ω = δ H H shuffle H GConv H ( x 1 ) + x 1
where ω denotes the attention matrix for scaling and refining the input features x 1 . δ denotes the sigmoid activation function, H represents the up-sampling operation via bilinear interpolation, H GConv and H shuffle denote the group convolution operation and the channel shuffle operation, and H denotes the down-sampling operation using the AdaptiveMaxPool2D function. After obtaining the attention matrix ω , we multiply ω by the features refined through two GRSBs for self-calibration. Finally, we add group convolution and channel shuffle operations to further deepen the features, thereby enhancing performance. This process can be represented as follows:
x out = H shuffle H GConv ω × H GRSB H GRSB ( x 1 )
where x out denotes the output feature and H G R S B ( · ) denotes the GRSB operation.
Here, we provide a detailed description of the GRSB used in SCGC and CSLB, as shown in Figure 4d. We observe that 3 × 3 convolutions account for a significant proportion of network parameters in most lightweight super-resolution networks. This observation prompted us to consider reducing the weight of the super-resolution network by replacing 3 × 3 convolutions with lightweight convolutions while maintaining performance. The group convolution and channel shuffle strategy of ShuffleNet [59] effectively reduces model complexity while preserving high feature extraction capability, making it suitable for deployment on resource-constrained devices. Through our experimental comparison of numerous excellent convolution modules [60,61,62], we verify that the combination of group convolution and channel shuffle strategy offers optimal performance for this task. By drawing inspiration from ShuffleNet, we create GRSB as a lightweight alternative to conventional convolution. Group convolution divides the original 3 × 3 convolution operation into several groups, each containing a portion of the input and output channels, and performs convolution operations independently on each group. The results of all groups are then merged to obtain the final output feature map. Following the residual connection, we select LeakyReLU as the activation function and employ a channel shuffle operation at the end. Benefiting from the channel shuffle operation, GRSB mitigates the inter-group information isolation caused by grouping and facilitates information flow between different groups. This can be described as follows:
x d = H shuffle H LReLU x + H GConv ( x )
where x denotes input features, x d denotes output features, and H LReLU ( · ) denotes the LeakyReLU operation.
After the feature enhancement through the upper and lower branches, the two sets of output features undergo a 1 × 1 convolution for fusion. Our design FFEM includes two operational stages. In the first stage, we integrate different levels of CSLB features within the DAFEB, allowing for continuous refinement and expansion of these features. In the second stage, we merge different levels of DAFEB features. Through FFEM, the interaction between feature refinement and fusion is iteratively conducted, enabling the extraction of more significant contextual information.

3.4. Channel-Spatial Lattice Block (CSLB)

Lattice structures [16,46,49] enable high-speed parallel processing, making them highly suitable for lightweight models that require fast execution speeds. To achieve more efficient feature extraction, we designed the CSLB, which incorporates a dual-branch architecture inspired by lattice structures. As shown in Figure 5a, the input features x R C × H × W are first split into two equal parts along the channel dimension: P i - 1 ( x ) R C / 2 × H × W and Q i - 1 ( x ) R C / 2 × H × W . This can be expressed as follows:
P i 1 ( x ) , Q i 1 ( x ) = Split ( x )
where x represents the input to the CSLB, and P i - 1 ( x ) and Q i - 1 ( x ) denotes the inputs of the upper and lower branches, respectively. This design allows each branch to process only half of the input signal, enabling faster parallel processing and reduced complexity. Specifically, our CSLB is divided into two operational stages.
In the first stage, Q i - 1 ( x ) is fed into three GRSBs for stepwise feature extraction and refinement. Considering the feature misalignment between the two branches, simple operations like addition, multiplication, or concatenation are not sufficiently convincing. Therefore, we designed the CSAIM to calculate attention weights for the two branches from both the spatial and channel dimensions, which consist of two components: Spatial Attention Interaction (SAI) and Channel Attention Interaction (CAI). These weights are then used to perform a 1 × 1 convolution fusion with the other branch to obtain P i - 2 ( x ) and Q i - 2 ( x ) . This process can be expressed as follows:
P i 2 ( x ) = Conv 1 × 1 Concat P i 1 ( x ) , F C I f Q i 1 ( x )
Q i 2 ( x ) = Conv 1 × 1 Concat F S I ( P i 1 ( x ) ) , f ( Q i 1 ( x ) )
where f ( · ) denotes three GRSB operations, F C I ( · ) and F S I ( · ) denote channel attention and spatial attention operations.
In the second stage, P i - 2 ( x ) is fed into three GRSBs for stepwise feature extraction and refinement. Similar to the first stage, we perform weighted cross-combination from the spatial and channel dimensions for the two branches. This process can be expressed as follows:
P i 3 ( x ) = Conv 1 × 1 Concat f ( P i 2 ( x ) ) , F C I ( Q i 2 ( x ) )
Q i 3 ( x ) = Conv 1 × 1 Concat F S I f ( P i 2 ( x ) ) , Q i 2 ( x )
where P i - 3 ( x ) and Q i - 3 ( x ) denote the upper and lower branch features obtained from the second stage, respectively. Subsequently, we fuse P i - 3 ( x ) and Q i - 3 ( x ) using a 1 × 1 convolution.
Here, we provide a detailed description of the calculation of weight coefficients. For the SAI, as shown in Figure 5b, given a set of feature maps, they are directly passed through a nonlinear function, Conv1 → ReLU → Conv1, where the reduction ratios for pointwise convolution are 8 and C/8, with C representing the number of channels in the feature maps. Finally, a sigmoid function is used to generate combination coefficients, which range from 0 to 1. For the CAI, as shown in Figure 5c, given a set of feature maps, we first obtain the average value of each feature map through GAP. The resulting feature vectors are then passed through a nonlinear function, Conv1 → ReLU → Conv1, where the reduction and expansion ratios for pointwise convolution are set to 8. Again, a sigmoid function is used to generate combination coefficients ranging from 0 to 1. The CSAIM is an efficient module designed to enhance the flow of feature signals, with minimal network complexity introduced by the learning of weight coefficients. By employing the CSAIM along the spatial and channel dimensions, the upper and lower branches effectively capture feature signals with varying levels of attention, resulting in more diverse and enriched information integration.

4. Results

4.1. Datasets

Based on previous work [22,56,57,58,63], the widely used SR dataset DIV2K [64] is selected as our training dataset. The DIV2K dataset contains 800 high-quality RGB training images and 100 validation images. For testing, we test the reconstruction performance of the model using two remote-sensing datasets proposed by FeNet [46], RS-T1 and RS-T2. RS-T1 and RS-T2 are remote-sensing datasets used for land use studies collected from the UC Merced [65] dataset, and both contain 120 images covering 21 complex ground-truth remote-sensing scenes. To further test the robustness of the model comprehensively, we use four SR benchmark datasets: Set5 [66], Set14 [67], BSD100 [68], and Urban100 [69].

4.2. Implementation Details

To obtain low-resolution (LR) training images, we use bicubic interpolation with scaling factors of ×2, ×3, and ×4 to downscale the high-resolution (HR) images. To enhance the diversity of the training set, data augmentation techniques such as horizontal flipping and random 90 rotations are applied. During the training phase, DAFEN employs a batch size of 16, and HR image patches of size 192 × 192 are randomly cropped from the HR images. We posit that the selected batch size and patch size values strike an optimal balance, respectively, between training speed and gradient stability, as well as between the preservation of local details and the incorporation of global information. To ensure numerical stability, the pixel range of the input images is scaled to [0, 1]. Our network is optimized using the ADAM [70] optimizer, with β 1 = 0.9 and β 2 = 0.999 . The learning rate is set to 5 × 10 4 and halved every 200 epochs out of 1000 epochs. This learning rate scheduling strategy aims to balance convergence speed and stability, ensuring that the model avoids local optima while maintaining steady performance improvements. All experiments are implemented using the PyTorch (version 1.10.0) framework and evaluated on an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA). In the paper, our network consists of four three-layer DAFEBs. The selection of the number of feature channels follows [46], where DAFEN employs 48 feature channels, while its lightweight variant, DAFEN-S, utilizes 32 feature channels. These configurations offer a well-balanced trade-off between model performance and computational cost. Following [46,63,71,72], SR results are evaluated only on the Y channel in the transformed YCbCr space, using the average Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Additionally, we assess the network’s complexity using model parameters and Multi-Adds. Similar to [16,41,46], we assume that the size of the query image (HR image) is 1280 × 720 for calculating Multi-Adds. Table 1 presents the implementation details and hyperparameter settings of our methods and all the comparative lightweight methods discussed in this paper. Specifically, we elaborate on the common parameters employed in the implementation of our methods and the comparatively lightweight methods, encompassing the optimizer for the generator (Optim_g), beta parameters for the Adam optimizer (betas), learning rate (lr), gamma parameter for the learning rate scheduler (gamma), Loss Function type (Loss type), batch size, patch size, Use horizontal flip (Use_hflip), and Use rotation (Use_rot). To validate our insights into the optimization process and demonstrate the rationality of our hyperparameter selection, we designed a hyperparameter analysis experiment, as detailed in Table 2. Specifically, we conduct experiments with our DAFEN at a scaling factor of ×4, employing various loss functions, batch sizes, patch sizes, and data augmentation strategies as experimental variants. The results demonstrate that our DAFEN achieves the best performance while also maintaining a balance with model complexity.

4.3. Results on Remote-Sensing Datasets

To validate the effectiveness of the proposed method on remote-sensing datasets, DAFEN and DAFEN-S are compared with existing lightweight models, including LGCNet [21], IDN [51], LESRCNN [73], CTN [48], FeNet [46], FDENet [57], DARN-S [56], AMFFN [41], IFIN-S [55], and BMFENet [58]. These models have been published in high-quality journals or conferences. All models compared in the paper are tested directly on remote-sensing data using pre-trained models provided by the respective researchers. Additionally, the training set for all comparison models is DIV2K, ensuring fairness in the comparison results. Table 3 shows the quantitative comparison results of different methods under all scaling factors. It can be seen that DAFEN outperforms existing methods or achieves comparable performance. Based on the scaling factor of ×4, IFIN-S, with approximately 470 K parameters, provides better results on the two RSI datasets. In contrast, DAFEN achieves a similar performance level with around 431K parameters and lower computational requirements. Additionally, DAFEN’s performance is comparable to BMFENet, which has a larger number of parameters and computational load. Moreover, our DAFEN-S has a similar number of parameters to FeNet-baseline and LGCNet, but DAFEN-S performs much better. Furthermore, our DAFEN-S outperforms FeNet with 351 K parameters in most test experiments, with only 188 K parameters. This is mainly due to the clever design of DAFEN, which effectively enriches, fuses, and enhances features while reducing model redundancy.
We also present the visual results of several SR methods for ×4 SR on UC-Merced in Figure 6. It is clear that our method demonstrates better recovery effects on object contours and detailed textures compared with other methods. Specifically, in the scenes with strong global consistency like ‘agricultural21’, only DAFEN adequately recovers texture information. Similarly, in scenes with scale variations, such as ‘airplane67’, DAFEN achieves more precise edge details. Notably, for the parking lot in ‘mobilehomepark92’ and the tennis court boundaries in ‘tenniscourt93’, other methods exhibit varying degrees of distortion. For example, in ‘mobilehomepark92’, other methods reconstruct blurred parking lot lines, while our DAFEN reconstructs more accurate parking lot lines. Additionally, in ‘tenniscourt93’, the reconstruction results of methods like IDN exhibit blurred tennis court boundaries, while methods such as AMFFN and CTN exhibit boundary distortion. In contrast, our DAFEN achieves results that are closer to the ground truth. However, the reconstruction results of our DAFEN-S exhibit some incorrect information, such as blurred parking lot lines in ‘mobilehomepark92’. This is because DAFEN-S has fewer parameters compared with DAFEN, resulting in weaker anti-interference ability. Overall, the proposed DAFEN demonstrates superior visual performance compared with other methods.

4.4. Results on SR Benchmark Datasets

To further validate the generalization performance of our model, we compare it with excellent SR methods on natural image datasets, including LGCNet [21], IDN [51], MADNet [74], FeNet [46], FDENet [57], DARN-S [56], IFIN-S [55], BMFENet [58], and TARN [49]. Similar to [46], we utilize four benchmark datasets: Set5 [66], Set14 [67], BSD100 [68], and Urban100 [69], which cover urban buildings, ecological environments, flora, and fauna. The models used for testing are all pre-trained models provided by the respective researchers. The test results are shown in Table 4. Our findings indicate that in terms of reconstruction accuracy for ×2, ×3, and ×4 SR tasks, DAFEN performs better or comparably to other lightweight SR networks on most test datasets. Despite IFIN-S achieving optimal results in some cases, its parameter count is higher than DAFEN. Additionally, due to the higher computational complexity of the transformer architecture used by IFIN-S, its computational load is about 1.5 times that of DAFEN. Compared with the TARN method, which has a significantly higher parameter count than DAFEN, our method achieves comparable performance. Moreover, our DAFEN-S greatly outperforms the FeNet baseline and LGCNet, which have a similar number of parameters. DAFEN-S also performs similarly to the FeNet method, while its parameter count and Multi-Adds are nearly halved. The results on natural image test sets demonstrate that DAFEN has strong generalization capabilities, confirming its exceptional effectiveness.
To assess perceptual quality, we present three reconstruction results of some models on the BSD100 and Urban100 test sets in Figure 7. It is clear that DAFEN achieves the best visual experience in terms of overall image patch clarity and detail line textures. Specifically, in ‘BSD100: 182053’, only DAFEN successfully recovers the accurate edges of the bridge arch. Notably, both our DAFEN and DAFEN-S achieve correct results in ‘Urban100: img073’. In contrast, the reconstruction results of methods such as SRCNN exhibit distorted lines, while those of methods like FeNet exhibit upward-curving lines. Furthermore, for continuous and densely structured block-like scenes, such as ‘Urban100: img019’, the reconstruction results from other methods exhibit varying degrees of blurriness in the upper-left regions of the image patches, while our method achieves more accurate reconstruction results. The qualitative and quantitative analyses above suggest that our proposed DAFEN and DAFEN-S are competitive in both natural and remote-sensing image SR tasks.

4.5. Comparison Results with Non-Lightweight State-of-the-Art Methods

To further validate the computational efficiency of the proposed methods, we conduct a comprehensive complexity comparison between DAFEN and DAFEN-S and several state-of-the-art non-lightweight super-resolution methods, including RCAN [14], SwinIR [36], and HAT [18]. The complexity evaluation is primarily conducted from three dimensions: model parameters (Params), multiply-add operations (Multi-Adds), and inference times (Times), with specific results shown in Table 5. In terms of model parameters, DAFEN and DAFEN-S require only 0.422 M and 0.192 M parameters, respectively, which are significantly lower than those of RCAN (15.67M), SwinIR (11.55 M), and HAT (20.53 M). This indicates that the DAFEN series models significantly reduce the storage requirements while maintaining high performance, making them more suitable for deployment on resource-constrained devices. In terms of computational complexity, the Multi-Adds of DAFEN and DAFEN-S are 0.038 T and 0.017 T, respectively, which are about 1–2 orders of magnitude lower than those of RCAN (1.492 T), SwinIR (2.883 T), and HAT (3.871 T). The low computational complexity not only reduces energy consumption but also significantly improves the inference speed of the models. Experimental results show that the inference times of DAFEN and DAFEN-S are 0.017 s and 0.012 s, respectively, while those of RCAN, SwinIR, and HAT are 0.12 s, 0.23 s, and 0.32 s, respectively. This demonstrates that the DAFEN series models have significant advantages in scenarios with high real-time requirements. In summary, DAFEN and DAFEN-S excel in model complexity, computational efficiency, and inference speed, not only significantly outperforming other lightweight methods but also demonstrating their unique performance-efficiency trade-off advantages when compared with non-lightweight state-of-the-art methods. These results further validate the potential of the DAFEN series models in practical applications, especially in resource-constrained environments.

4.6. Results of Real Remote-Sensing Images

To further demonstrate the stability of the model, the reconstruction results of two real remote-sensing satellite images are shown in Figure 8. We use four different methods to upscale real satellite remote-sensing images by a factor of four for visual perception comparison. It can be observed that our model provides better visual perception in terms of both overall texture and detailed textures compared with other methods. This validates that our approach performs well on images captured by real remote-sensing satellites.

4.7. Network Complexity and Inference Speed

In addition to evaluating complexity, faster inference speed has also become a crucial metric for assessing whether these lightweight RSISR methods are suitable for real-time applications on edge devices with limited computational resources. To this end, we conduct timing tests on several representative RSISR methods using the same device equipped with an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) on the RS-T1 test set, as shown in Table 6. It can be found that DAFEN-S achieves optimal results in terms of parameter count, computational complexity, and inference time, while DAFEN achieves optimal performance and second-best inference speed. Notably, the parameter count of DAFEN-S is about half that of FeNet, yet its PSNR and SSIM values improved by approximately 0.04 dB and 0.0027, respectively. Additionally, the parameter count of DARN is about 1.5 times that of DAFEN, yet DAFEN still leads in PSNR and SSIM values by about 0.05 dB and 0.0057. Furthermore, our method, based on a CNN architecture, offers inference speeds over eight times faster than IFIN-S, which is based on a transformer architecture. This faster inference speed makes DAFEN more suitable for real-time applications on edge devices with limited computational resources. Overall, comparisons with existing state-of-the-art methods demonstrate the efficiency and lightness of the proposed method, as well as its high adaptability and greater flexibility in the use of remote-sensing devices.

4.8. Ablation Study

4.8.1. Effects of the Key Modules in DAFEN

To evaluate the contribution of key modules in DAFEN to the overall performance, we conduct ablation experiments on DAFEB, CSLB, and FFEM, with the specific results shown in Table 7. Firstly, we remove DAFEB and replace it with the Residual Group (RG) from RCAN [14]. To maintain a similar model structure, we set the number of Residual Channel Attention Blocks (RCAB) in RG to 3. By comparing the first and fourth rows in the table, it can be observed that after replacing DAFEB, the model’s parameter count and computational load increased to 1.6 times and 1.8 times that of DAFEN, respectively, but the performance remains close to that of DAFEN. Secondly, we remove CSLB and replace it with a residual block. The specific implementation details are as follows: the input feature information is first processed through Conv3 → ReLU → Conv3, then the processed features are connected with the input features via residual connection, and finally processed through ReLU again. From the comparison between the second and fourth rows, it can be seen that the variant with the replaced CSLB has approximately twice the parameter count and computational complexity of DAFEN but shows similar reconstruction accuracy. However, the SSIM value on the RS-T1 dataset and the PSNR value on the RS-T2 dataset are slightly lower than those of DAFEN. Additionally, we remove FFEM and replace it with the attention-based multi-level feature fusion (AMFF) structure from AMFFN [41]. From the comparison between the third and fourth rows, it can be seen that the performance of all metrics significantly decreased, especially on the Urban100 dataset. Through these experiments, we systematically demonstrate the lightweight design and efficiency of DAFEB, CSLB, and FFEM, as well as their exceptional contributions to the performance of DAFEN.

4.8.2. Effects of the CSLB

To assess the importance of CSLB, we conduct an effectiveness analysis of GRSB and CSAIM, as shown in Table 8. Firstly, we replace the GRSB with 3 × 3 convolution → ReLU and use it as the feature extraction module for CSLB. Comparing the first and third rows in the table, although the 3 × 3 convolution improved accuracy, the model’s parameter count increased by more than twofold. Considering the lightweight design, our use of the GRSB significantly reduced the model’s parameters and computational load, making the slight loss in accuracy worthwhile. Secondly, we replace our CSAIM with the cross-fusion method from [46], which utilizes channel attention to compute weighting coefficients for the weighted cross-combination of features from the upper and lower branches. As seen from the second and third rows in the table, our CSAIM has fewer parameters and consistently performs better on RS-T1, BSD100, and Urban100, achieving better PSNR values on RS-T2. This indicates that CSAIM is more beneficial for enriching feature extraction and promoting information exchange between the two branches of the lattice structure.
To further demonstrate the effectiveness of CSLB, we conduct tests on the SR results of CSLB and its two variants at a scaling factor of ×4, and Figure 9 presents the visualization results. It can be observed that the SR results of CSLB achieve superior quality in overall imaging. To facilitate a more detailed observation of the SR results, we enlarge the image patches within the red box in the HR image. Through detailed comparison, it is found that the SR results of CSLB exhibit better performance in details such as the edges of vehicles, the edges and textures of sidewalks, and the textures of shrubs, compared with the other two methods. This fully proves the rationality of the design and the superiority of the performance of CSLB.

4.8.3. Effects of the FFEM

The forward fusion structure, SCGC, and CCA combine to form a powerful FFEM. To demonstrate the effectiveness of these key modules, we compare FFEM with its variants, as shown in Table 9. For the forward fusion structure, we compare it with the 1 × 1 convolution fusion method and the BFM [16], as seen in the first, second, and fifth rows of the table. It is clear that our forward fusion structure has fewer parameters than the 1 × 1 convolution method and leads in PSNR and SSIM values across all four test sets. Furthermore, compared with BFM, our forward fusion structure improves the PSNR by 0.03 dB, 0.01 dB, and 0.01 dB on RS-T1, RS-T2, and BSD100, respectively, and achieves better SSIM values on RS-T1, RS-T2, BSD100, and Urban100. To further demonstrate the effectiveness of the forward fusion structure, we compare the visualization results of 1 × 1 convolution fusion, BFM, and our FFEM using average feature maps, as shown in Figure 10. It can be observed that our FFEM achieves better edge and texture details, both in the local DAFEB and the global DAFEN. This fully proves the effectiveness of the forward fusion structure and supports the rationale of allocating more channels to higher-level features in multi-level feature fusion to retain more high-level feature information.
To demonstrate the effectiveness of the dual-branch structure, we conduct two experiments using CCA and SCGC individually, as shown in the third and fourth rows of Table 9. It can be seen that the combined version of CCA and SCGC achieves better performance compared with using a single module. Although using CCA or SCGC alone achieves lower parameter counts and computational costs, this comes at the expense of significantly reduced accuracy. To gain deeper insights into the performance enhancement achieved by FFEM, we employ a Local Attribution Map (LAM) [75]. As shown in Figure 11, the LAM results clearly demonstrate that the combined version of CCA and SCGC significantly enhances attention to important information. Moreover, the results of the combined version of CCA and SCGC achieve a higher Diffusion Index (DI), indicating that more input pixels are utilized, which leads to increases in PSNR and SSIM values. Therefore, by employing the dual-branch structure, the coverage of utilized pixels is expanded, further validating the effectiveness of the proposed FFEM.

4.8.4. Influence of the Lightweight Convolution in DAFEN

To explore the most suitable lightweight convolution for our method, we introduce several common lightweight convolutions [60,61,62] into the model for comparison with our method, as shown in Table 10. It can be seen that although our method has a larger number of parameters and computational cost, the reconstruction accuracy far exceeds that of the other methods. Therefore, DAFEN using group convolution exhibits better performance.

4.8.5. Influence of the Number of Groups in Group Convolution

We explore the optimal group number settings for group convolution in the model, as shown in Table 11. When the number of groups is set to two, the model achieves the best performance, but the parameter count and computational cost increase significantly. With eight groups, the model is the most lightweight, but the accuracy is lower. To achieve a better balance between performance and lightweight design, we ultimately chose to set the number of groups to four.

4.8.6. Influence of the Number of DAFEBs and CSLBs

To further optimize network parameters and performance, we investigate the impact of different numbers of DAFEB and CSLB on the model, as shown in Table 12. Specifically, n b represents the number of DAFEB, while n l denotes the number of CSLB. Initially, when the number of CSLBs is 1, the network’s performance is the worst. As the number of CSLBs increases, such as 1, 3, and 5, the model’s performance continuously improves, indicating that our DAFEB structure has the potential to achieve top-level performance when used in larger networks. Additionally, we compare different numbers of DAFEB, such as 3, 4, 5, and 6. With the increase in the number of DAFEB, SR performance, parameter count, and computational cost also increase accordingly. Considering our goal is to study lightweight RSISR, to achieve a more reasonable balance between performance and lightweight design, we set the number of DAFEB to 4 and the number of CSLB to 3.

5. Discussion

In this section, we discuss the advantages of the research and then examine the limitations associated with the proposed method.
Firstly, our approach is able to better balance performance and complexity, primarily due to two aspects. On the one hand, DAFEN efficiently addresses the unique complexities of RSI [47] through effective local and global processing, thereby achieving strong performance in RSISR tasks. As illustrated in Figure 12, the visualization of the average feature maps reveals the comprehensive approach of DAFEN. For the local aspect, effective feature extraction and progressive refinement are achieved through three CSLBs, preserving the detail fidelity of all targets, regardless of size. For the global aspect, finer texture features are transferred to deeper DAFEB layers for further reconstruction. This allows different DAFEBs to handle features of varying complexity, contributing to richer detail contours. Meanwhile, the FFEM ensures that the generated images are coherent on a macro level and precise on a micro level by efficiently integrating and enhancing multi-level features both locally and globally. This leads to final generated features with more accurate edges and texture information, which is beneficial for the precise reconstruction of remote-sensing images. On the other hand, DAFEN leverages lightweight convolution and dimensionality reduction, enabling our network to deliver superior performance with lower complexity. Group convolution reduces redundant operations while combining channel shuffle to maintain efficient feature extraction. Additionally, CSLB achieves extremely lightweight computation for both branches through channel split. Furthermore, the forward fusion structure of FFEM is also a highly lightweight design. Even compared with the simple 1 × 1 convolution fusion method, it effectively reduces the parameter count by approximately 12 K (as shown in Table 9).
Secondly, we will discuss the limitations of our method and future work. As can be seen from Table 3, our method achieves superior performance at lower scaling factors, especially at ×2, but the accuracy improvement of DAFEN is relatively modest at higher scaling factors, such as ×4. We attribute this to the fact that CNN-based RSISR methods are constrained by the local processing principle of convolutional kernels, which hinders direct interaction between distant pixels in the image, leading to insufficient extraction of feature information at high scaling factors. However, at low scaling factors, the original image details are better preserved, which does not significantly affect the feature extraction performance. Another reason is that the trade-off between low-level and high-level features in our forward fusion structure is not perfect. Although we have demonstrated the rationality and effectiveness of the forward fusion structure in our experiments, such a trade-off may require more precise adjustments for different scaling factors. In the future, we will focus on addressing these limitations. On the one hand, we will further investigate methods to enhance the feature representation capability of the model, such as the hybrid use of different convolutions, exploration of more suitable lightweight convolutions, and finer tuning of the forward fusion structure to achieve optimal performance across various scaling factors. On the other hand, we plan to integrate new technologies, such as the Diffusion Model and Mamba, with CNN to compensate for the shortcomings of a single CNN architecture. Additionally, the scope of the paper is limited to RSISR with bicubic downsampling and does not cover other areas such as blind RSISR, continuous RSISR, or hyperspectral image SR. We will explore how to further integrate our model with other techniques to expand its applicability and encompass more diverse RSISR challenges.

6. Conclusions

In this paper, we propose a lightweight Remote-Sensing Image Super-Resolution (RSISR) network named Dual Attention Fusion Enhancement Network (DAFEN), designed for precise RSISR with limited time and spatial overhead. The model offers two versions: a 416 K lightweight DAFEN and a 188 K ultra-lightweight DAFEN-S, accommodating different task requirements. Specifically, we design an extremely lightweight lattice structure, Channel-Spatial Lattice Block (CSLB), as the feature extraction module, composed of Group Residual Shuffle Block (GRSB) and Channel-Spatial Attention Interaction Module (CSAIM). GRSB utilizes a strategy of group convolution combined with channel shuffle as the nonlinear extraction module of CSLB, effectively reducing redundant convolution calculations. CSAIM performs a weighted cross-combination of CSLB’s two branches in both spatial and channel dimensions, facilitating information flow between the branches. Furthermore, we develop the Forward Fusion Enhancement Module (FFEM), which uses a forward fusion structure to retain more high-level feature information, efficiently acquiring more contextual features and enhancing fused features through Self-Calibrated Group Convolution (SCGC) and Contrast-aware Channel Attention (CCA). FFEM incrementally fuses and enhances multi-level features using both local and global strategies, ultimately forming a comprehensive feature fusion representation. Finally, experimental results on two remote-sensing and four benchmark datasets demonstrate that our network achieves a better balance between performance and model complexity.

Author Contributions

Conceptualization, W.C.; methodology, W.C. and Y.L.; software, W.C.; validation, W.C.; formal analysis, W.C.; investigation, W.C.; resources, W.C. and S.Q.; data curation, W.C.; writing—original draft preparation, W.C.; writing—review and editing, W.C., S.Q., and L.L.; visualization, W.C.; supervision, S.Q. and L.L.; project administration, S.Q.; funding acquisition, S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Natural Science Foundation of China (Grant No. 12201185) and the Henan Science and Technology Development Plan Project (Grant No. 242102210064).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  2. Lechner, A.M.; Foody, G.M.; Boyd, D.S. Applications in remote sensing to forest ecology and management. One Earth 2020, 2, 405–412. [Google Scholar] [CrossRef]
  3. Gupta, M.; Almomani, O.; Khasawneh, A.M.; Darabkh, K.A. Smart remote sensing network for early warning of disaster risks. In Nanotechnology-Based Smart Remote Sensing Networks for Disaster Prevention; Elsevier: Amsterdam, The Netherlands, 2022; pp. 303–324. [Google Scholar]
  4. Xu, P.; Tang, H.; Ge, J.; Feng, L. ESPC_NASUnet: An end-to-end super-resolution semantic segmentation network for mapping buildings from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 5421–5435. [Google Scholar] [CrossRef]
  5. Wang, Z.g.; Kang, Q.; Xun, Y.j.; Shen, Z.q.; Cui, C.b. Military reconnaissance application of high-resolution optical satellite remote sensing. In Proceedings of the International Symposium on Optoelectronic Technology and Application 2014: Optical Remote Sensing Technology and Applications, Beijing, China, 9–11 December 2014; SPIE: Bellingham, WA, USA, 2014; Volume 9299, pp. 301–305. [Google Scholar]
  6. Booysen, R.; Gloaguen, R.; Lorenz, S.; Zimmermann, R.; Andreani, L.; Nex, P.A. The potential of multi-sensor remote sensing mineral exploration: Examples from Southern Africa. In Proceedings of the IGARSS 2019—IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 6027–6030. [Google Scholar]
  7. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
  8. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
  9. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  10. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  11. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  12. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  13. Han, W.; Chang, S.; Liu, D.; Yu, M.; Witbrock, M.; Huang, T.S. Image super-resolution via dual-state recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1654–1663. [Google Scholar]
  14. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  15. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  16. Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 272–289. [Google Scholar]
  17. Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12312–12321. [Google Scholar]
  18. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
  19. Wang, H.; Chen, X.; Ni, B.; Liu, Y.; Liu, J. Omni aggregation networks for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22378–22387. [Google Scholar]
  20. Fang, J.; Chen, X.; Zhao, J.; Zeng, K. A scalable attention network for lightweight image super-resolution. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102185. [Google Scholar] [CrossRef]
  21. Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local–global combined network. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
  22. Dong, X.; Wang, L.; Sun, X.; Jia, X.; Gao, L.; Zhang, B. Remote sensing image super-resolution using second-order multi-scale networks. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 3473–3485. [Google Scholar] [CrossRef]
  23. Zhang, D.; Shao, J.; Li, X.; Shen, H.T. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 5183–5196. [Google Scholar] [CrossRef]
  24. Li, Q.; Yuan, Y.; Jia, X.; Wang, Q. Dual-stage approach toward hyperspectral image super-resolution. IEEE Trans. Image Process. 2022, 31, 7252–7263. [Google Scholar] [CrossRef] [PubMed]
  25. Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
  26. Tu, J.; Mei, G.; Ma, Z.; Piccialli, F. SWCGAN: Generative adversarial network combining swin transformer and CNN for remote sensing image super-resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 5662–5673. [Google Scholar] [CrossRef]
  27. Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  28. Wu, H.; Ni, N.; Zhang, L. Lightweight stepless super-resolution of remote sensing images via saliency-aware dynamic routing strategy. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
  29. Xie, Z.; Wang, J.; Song, W.; Du, Y.; Xu, H.; Yang, Q. CFFormer: Channel Fourier Transformer for Remote Sensing Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 18, 569–583. [Google Scholar] [CrossRef]
  30. Hao, J.; Li, W.; Lu, Y.; Jin, Y.; Zhao, Y.; Wang, S.; Wang, B. Scale-aware Backprojection Transformer for Single Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5649013. [Google Scholar] [CrossRef]
  31. Hao, S.; Zhuge, Y.; Xu, J.; Lu, H.; He, Y. Remote Sensing Image Super-Resolution Using Enriched Spatial-Channel Feature Aggregation Networks. In Proceedings of the 2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS), Hangzhou, China, 16–18 August 2024; pp. 578–585. [Google Scholar]
  32. Ye, W.; Lin, B.; Lao, J.; Liu, Y.; Lin, Z. MRA-IDN: A Lightweight Super-Resolution Framework of Remote Sensing Images based on Multi-Scale Residual Attention Fusion Mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 7781–7800. [Google Scholar] [CrossRef]
  33. Qin, M.; Mavromatis, S.; Hu, L.; Zhang, F.; Liu, R.; Sequeira, J.; Du, Z. Remote sensing single-image resolution improvement using a deep gradient-aware network with image-specific enhancement. Remote Sens. 2020, 12, 758. [Google Scholar] [CrossRef]
  34. Dong, R.; Mou, L.; Zhang, L.; Fu, H.; Zhu, X.X. Real-world remote sensing image super-resolution via a practical degradation model and a kernel-aware network. ISPRS J. Photogramm. Remote. Sens. 2022, 191, 155–170. [Google Scholar] [CrossRef]
  35. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
  36. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
  37. Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Jin, X.; Zhang, L. EDiffSR: An efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2023, 62, 5601514. [Google Scholar] [CrossRef]
  38. Sebaq, A.; ElHelw, M. Rsdiff: Remote sensing image generation from text using diffusion model. Neural Comput. Appl. 2024, 36, 23103–23111. [Google Scholar] [CrossRef]
  39. Dong, W.; Liu, S.; Xiao, S.; Qu, J.; Li, Y. ISPDiff: Interpretable Scale-Propelled Diffusion Model for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5519614. [Google Scholar] [CrossRef]
  40. Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution. arXiv 2024, arXiv:.04964. [Google Scholar] [CrossRef]
  41. Wang, H.; Cheng, S.; Li, Y.; Du, A. Lightweight remote-sensing image super-resolution via attention-based multilevel feature fusion network. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  42. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
  43. Liu, F.; Yang, X.; De Baets, B. Lightweight image super-resolution with a feature-refined network. Signal Process. Image Commun. 2023, 111, 116898. [Google Scholar] [CrossRef]
  44. Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 41–55. [Google Scholar]
  45. Xue, Y.; Li, L.; Wang, Z.; Jiang, C.; Liu, M.; Wang, J.; Sun, K.; Ma, H. RFCNet: Remote Sensing Image Super-Resolution Using Residual Feature Calibration Network. Tsinghua Sci. Technol. 2022, 28, 475–485. [Google Scholar] [CrossRef]
  46. Wang, Z.; Li, L.; Xue, Y.; Jiang, C.; Wang, J.; Sun, K.; Ma, H. FeNet: Feature enhancement network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  47. Wang, P.; Bayram, B.; Sertel, E. A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Sci. Rev. 2022, 232, 104110. [Google Scholar] [CrossRef]
  48. Wang, S.; Zhou, T.; Lu, Y.; Di, H. Contextual transformation network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
  49. Huang, S.; Wang, J.; Yang, Y.; Wan, W. TARN: A lightweight two-branch adaptive residual network for image super-resolution. Int. J. Mach. Learn. Cybern. 2024, 15, 4119–4132. [Google Scholar] [CrossRef]
  50. Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
  51. Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 723–731. [Google Scholar]
  52. Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. Adv. Neural Inf. Process. Syst. 2022, 35, 17314–17326. [Google Scholar]
  53. Zhang, A.; Ren, W.; Liu, Y.; Cao, X. Lightweight image super-resolution with superpixel token interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12728–12737. [Google Scholar]
  54. Wang, Y.; Zhang, T. Osffnet: Omni-stage feature fusion network for lightweight image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5660–5668. [Google Scholar]
  55. Wang, L.; Li, X.; Tian, W.; Peng, J.; Chen, R. Lightweight interactive feature inference network for single-image super-resolution. Sci. Rep. 2024, 14, 11601. [Google Scholar] [CrossRef]
  56. Wang, Q.; Wang, S.; Chen, M.; Zhu, Y. DARN: Distance attention residual network for lightweight remote-sensing image superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sensing 2022, 16, 714–724. [Google Scholar] [CrossRef]
  57. Gao, F.; Li, L.; Wang, J.; Sun, K.; Lv, M.; Jia, Z.; Ma, H. A lightweight feature distillation and enhancement network for super-resolution remote sensing images. Sensors 2023, 23, 3906. [Google Scholar] [CrossRef] [PubMed]
  58. Wu, T.; Zhao, R.; Lv, M.; Jia, Z.; Li, L.; Wang, Z.; Ma, H. Lightweight remote sensing image super-resolution via background-based multi-scale feature enhancement network. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 7509405. [Google Scholar] [CrossRef]
  59. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
  60. Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
  61. Haase, D.; Amthor, M. Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14600–14609. [Google Scholar]
  62. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  63. Dong, X.; Sun, X.; Jia, X.; Xi, Z.; Gao, L.; Zhang, B. Remote sensing image super-resolution using novel dense-sampling networks. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 1618–1633. [Google Scholar] [CrossRef]
  64. Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.H.; Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 114–125. [Google Scholar]
  65. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
  66. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar]
  67. Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
  68. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  69. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  70. Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  71. Pan, Z.; Ma, W.; Guo, J.; Lei, B. Super-resolution of single remote sensing image based on residual dense backprojection networks. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 7918–7933. [Google Scholar] [CrossRef]
  72. Zhang, S.; Yuan, Q.; Li, J.; Sun, J.; Zhang, X. Scene-adaptive remote sensing image super-resolution using a multiscale attention network. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 4764–4779. [Google Scholar] [CrossRef]
  73. Tian, C.; Zhuge, R.; Wu, Z.; Xu, Y.; Zuo, W.; Chen, C.; Lin, C.W. Lightweight image super-resolution with enhanced CNN. Knowl.-Based Syst. 2020, 205, 106235. [Google Scholar] [CrossRef]
  74. Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
  75. Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9199–9208. [Google Scholar]
Figure 1. Model parameters and accuracy tradeoff with other lightweight methods on BSD100 for ×2 SR. Our proposed DAFEN achieves superior performance, and our DAFEN-S also maintains competitive performance. The Multi-Adds (Multiply-Add Operations) are computed on a 1280 × 720 HR image.
Figure 1. Model parameters and accuracy tradeoff with other lightweight methods on BSD100 for ×2 SR. Our proposed DAFEN achieves superior performance, and our DAFEN-S also maintains competitive performance. The Multi-Adds (Multiply-Add Operations) are computed on a 1280 × 720 HR image.
Remotesensing 17 01078 g001
Figure 2. Overview of the proposed Dual Attention Fusion Enhancement Network (DAFEN) architecture. The shallow feature extraction and reconstruction parts are utilized to extract coarse features and enlarge the features s times (e.g., ×2, ×3, ×4), respectively. The Feature Refinement and Fusion part with four DAFEBs carries the main feature expression ability. The FFEM can generate more contextual information via forward sequential concatenation.
Figure 2. Overview of the proposed Dual Attention Fusion Enhancement Network (DAFEN) architecture. The shallow feature extraction and reconstruction parts are utilized to extract coarse features and enlarge the features s times (e.g., ×2, ×3, ×4), respectively. The Feature Refinement and Fusion part with four DAFEBs carries the main feature expression ability. The FFEM can generate more contextual information via forward sequential concatenation.
Remotesensing 17 01078 g002
Figure 3. Structure of Dual Attention Fusion Enhancement Block (DAFEB).
Figure 3. Structure of Dual Attention Fusion Enhancement Block (DAFEB).
Remotesensing 17 01078 g003
Figure 4. Illustrations of the proposed Forward Fusion Enhancement Module (FFEM). (a) The forward fusion structure by using forward sequential concatenation is illustrated by taking four blocks as an example. (b) The Contrast-aware Channel Attention (CCA). (c) The Self-Calibrated Group Convolution (SCGC). (d) The Group Residual Shuffle Block (GRSB).
Figure 4. Illustrations of the proposed Forward Fusion Enhancement Module (FFEM). (a) The forward fusion structure by using forward sequential concatenation is illustrated by taking four blocks as an example. (b) The Contrast-aware Channel Attention (CCA). (c) The Self-Calibrated Group Convolution (SCGC). (d) The Group Residual Shuffle Block (GRSB).
Remotesensing 17 01078 g004
Figure 5. (a) The Channel-Spatial Lattice Block (CSLB), where the Channel-Spatial Attention Interaction Module (CSAIM) includes the Spatial Attention Interaction (SAI) and the Channel Attention Interaction (CAI). ‘Split’ represents the channel separation operation. (b) The Spatial Attention Interaction (SAI). (c) The Channel Attention Interaction (CAI).
Figure 5. (a) The Channel-Spatial Lattice Block (CSLB), where the Channel-Spatial Attention Interaction Module (CSAIM) includes the Spatial Attention Interaction (SAI) and the Channel Attention Interaction (CAI). ‘Split’ represents the channel separation operation. (b) The Spatial Attention Interaction (SAI). (c) The Channel Attention Interaction (CAI).
Remotesensing 17 01078 g005
Figure 6. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on UC-Merced for ×4 SR.
Figure 6. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on UC-Merced for ×4 SR.
Remotesensing 17 01078 g006
Figure 7. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on BSD100 and Urban100 datasets for ×4 SR.
Figure 7. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on BSD100 and Urban100 datasets for ×4 SR.
Remotesensing 17 01078 g007
Figure 8. Visualization results of our proposed networks (DAFEN and DAFEN-S) and other SR methods on real remote-sensing images for ×4 SR. (a) Residential areas and farmland. (b) Terraces and roads.
Figure 8. Visualization results of our proposed networks (DAFEN and DAFEN-S) and other SR methods on real remote-sensing images for ×4 SR. (a) Residential areas and farmland. (b) Terraces and roads.
Remotesensing 17 01078 g008
Figure 9. Visualization results for the ablation experiments on the CSLB design.
Figure 9. Visualization results for the ablation experiments on the CSLB design.
Remotesensing 17 01078 g009
Figure 10. Average feature maps from the ablation experiments on the FFEM design at different stages of the DAFEN.
Figure 10. Average feature maps from the ablation experiments on the FFEM design at different stages of the DAFEN.
Remotesensing 17 01078 g010
Figure 11. LAM results for the ablation experiments on the FFEM design. LAM reflects the importance of each pixel in the input LR image during the reconstruction of the marked blocks. The red-marked points indicate the pixels that contribute to the reconstruction process. The Diffusion Index (DI) reflects the range of involved pixels, with a higher DI indicating a wider range of utilized pixels.
Figure 11. LAM results for the ablation experiments on the FFEM design. LAM reflects the importance of each pixel in the input LR image during the reconstruction of the marked blocks. The red-marked points indicate the pixels that contribute to the reconstruction process. The Diffusion Index (DI) reflects the range of involved pixels, with a higher DI indicating a wider range of utilized pixels.
Remotesensing 17 01078 g011
Figure 12. Network average feature maps visualization.
Figure 12. Network average feature maps visualization.
Remotesensing 17 01078 g012
Table 1. Implementation details and hyperparameter settings of our methods and comparative lightweight methods.
Table 1. Implementation details and hyperparameter settings of our methods and comparative lightweight methods.
MethodOptim_gBetaslrGammaLoss TypeBatch SizePatch SizeUse_hflipUse_rot
LGCNet [21]ADAM[0.9,0.999] 1 × 10 1 0.1 L 2 12841 × 41falsefalse
IDN [51]ADAM[0.9,0.999] 1 × 10 4 0.1 L 1 6426 × 26truetrue
LESRCNN [73]ADAM[0.9,0.999] 1 × 10 4 0.5 L 2 6464 × 64truetrue
MADNet [74]ADAM[0.9,0.999] 1 × 10 3 0.5 L F 1648 × 48truetrue
CTN [48]ADAM[0.9,0.999] 1 × 10 4 0.5 L 1 1648 × 48truetrue
FeNet-baseline [46]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
FeNet [46]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
FDENet [57]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
DARN-S [56]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 6464 × 64truetrue
AMFFN [41]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 1648 × 48truetrue
IFIN-S [55]ADAM[0.9,0.99] 1 × 10 3 0.5 L 1 1660 × 60truetrue
BMFENet [58]ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
TARN [49]ADAM[0.9,0.999] 3 × 10 4 0.5 L 1 16256 × 256truetrue
DAFEN-S(ours)ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
DAFEN(ours)ADAM[0.9,0.999] 5 × 10 4 0.5 L 1 16192 × 192truetrue
Table 2. Impact of hyperparameter selection on model performance and complexity at a scaling factor of ×4 on the RS-T1 dataset. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
Table 2. Impact of hyperparameter selection on model performance and complexity at a scaling factor of ×4 on the RS-T1 dataset. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
VariantParamsMulti-AddsRS-T1(x4)
PSNR/SSIM
Loss type: L 2 431 K21.8 G29.73/0.7669
batch size: 32431 K21.8 G29.81/0.7694
batch size: 8431 K21.8 G29.80/0.7689
patch size: 256431 K38.8 G29.82/0.7695
patch size: 128431 K9.7 G29.80/0.7691
Use_hflip: false431 K21.8 G29.76/0.7681
Use_rot: false431 K21.8 G29.78/0.7685
DAFEN431 K21.8 G29.82/0.7701
Table 3. Quantitative evaluation results for SR on two RSI test datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
Table 3. Quantitative evaluation results for SR on two RSI test datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
MethodScaleParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIM
Bicubic ×2--33.25/0.893430.64/0.8837
LGCNet [21] ×2193 K178.1 G35.65/0.929833.47/0.9281
IDN [51] ×2553 K124.6 G36.13/0.933934.07/0.9329
LESRCNN [73] ×2626 K281.5 G36.04/0.932834.00/0.9320
CTN [48] ×2402 K60.9 G36.30/0.924334.31/0.9346
FeNet-baseline [46] ×2158 K35.2 G36.10/0.933134.10/0.9326
FeNet [46] ×2351 K77.9 G36.23/0.934134.22/0.9337
FDENet [57] ×2480 K138.7 G36.26/0.934634.28/0.9338
DARN-S [56] ×2350 K78.9 G36.31/0.934734.35/0.9348
AMFFN [41] ×2298 K61.5 G36.39/0.935734.34/0.9346
IFIN-S [55] ×2451 K110.6 G36.38/0.935634.42/0.9352
BMFENet [58] ×2465 K115.0 G36.42/0.936234.43/0.9356
DAFEN-S (ours) ×2188 K37.5 G36.28/0.935234.24/0.9338
DAFEN (ours) ×2416 K83.6 G36.42/0.936534.39/0.9357
Bicubic ×3--29.73/0.781827.23/0.7697
LGCNet [21] ×3193 K79.0 G31.30/0.831429.03/0.8312
IDN [51] ×3553 K56.3 G31.73/0.843029.59/0.8450
LESRCNN [73] ×3810 K238.9 G31.68/0.839829.65/0.8444
CTN [48] ×3402 K37.1 G31.91/0.845429.83/0.8489
FeNet-baseline [46] ×3163 K16.7 G31.73/0.837729.61/0.8446
FeNet [46] ×3357 K35.2 G31.89/0.843229.80/0.8481
FDENet [57] ×3488 K61.7 G31.98/0.848829.88/0.8489
DARN-S [56] ×3355 K35.0 G32.00/0.848329.98/0.8518
AMFFN [41] ×3305 K27.9 G31.94/0.845729.91/0.8504
IFIN-S [55] ×3459 K51.0 G32.04/0.844830.03/0.8535
BMFENet [58] ×3470 K51.7 G31.99/0.846529.97/0.8514
DAFEN-S (ours) ×3192 K17.1 G31.93/0.845929.81/0.8485
DAFEN (ours) ×3422 K37.7 G32.13/0.852729.98/0.8516
Bicubic ×4--27.91/0.696825.40/0.6770
LGCNet [21] ×4193 K44.5 G29.13/0.748126.76/0.7426
IDN [51] ×4553 K32.3 G29.56/0.762327.31/0.7627
LESRCNN [73] ×4774 K241.6 G29.62/0.762527.41/0.7646
CTN [48] ×4413 K25.6 G29.71/0.766627.52/0.7704
FeNet-baseline [46] ×4169 K9.4 G29.57/0.762627.31/0.7619
FeNet [46] ×4366 K20.4 G29.70/0.768827.45/0.7672
FDENet [57] ×4501 K35.9 G29.72/0.765827.54/0.7697
DARN-S [56] ×4363 K19.7 G29.78/0.768227.59/0.7732
AMFFN [41] ×4314 K16.2 G29.76/0.767427.57/0.7701
IFIN-S [55] ×4470 K31.6 G29.84/0.772427.68/0.7763
BMFENet [58] ×4477 K29.4 G29.81/0.770027.62/0.7730
DAFEN-S (ours) ×4198 K10.0 G29.70/0.767327.46/0.7677
DAFEN (ours) ×4431K21.8 G29.82/0.770127.62/0.7737
Table 4. Quantitative evaluation results for SR on four benchmark datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
Table 4. Quantitative evaluation results for SR on four benchmark datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
MethodScaleParamsMulti-AddsSet5 PSNR/SSIMSet14 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
Bicubic ×2--33.66/0.929930.24/0.868829.56/0.843126.88/0.8403
LGCNet [21] ×2193 K178.1 G37.31/0.958032.94/0.912031.74/0.893930.53/0.9112
IDN [51] ×2553 K124.6 G37.83/0.960033.30/0.914832.08/0.898531.27/0.9196
MADNet [74] ×2878 K187.1 G37.85/0.960033.39/0.916132.05/0.898131.59/0.9234
FeNet-baseline [46] ×2158 K35.2 G37.77/0.959733.28/0.915131.98/0.897331.46/0.9215
FeNet [46] ×2351 K77.9 G37.90/0.960233.45/0.916232.09/0.898531.75/0.9245
FDENet [57] ×2480 K138.7 G37.89/0.959433.50/0.917032.15/0.898832.02/0.9270
DARN-S [56] ×2350 K78.9 G37.97/0.960933.54/0.917232.19/0.900532.14/0.9284
IFIN-S [55] ×2451 K110.6 G38.00/0.960633.66/0.918132.18/0.899632.14/0.9284
BMFENet [58] ×2465 K115.0 G38.04/0.960533.62/0.918032.22/0.900432.29/0.9300
TARN [49] ×2687 K-38.09/0.960833.65/0.918332.22/0.900332.20/0.9289
DAFEN-S (ours) ×2188 K37.5 G37.94/0.960533.41/0.915932.12/0.899131.76/0.9248
DAFEN (ours) ×2416 K83.6 G38.04/0.961733.55/0.917532.22/0.901032.20/0.9291
Bicubic ×3--30.39/0.868227.55/0.774227.21/0.738524.46/0.7349
LGCNet [21] ×3193 K79.0 G33.32/0.917229.67/0.828928.63/0.792326.77/0.8180
IDN [51] ×3553 K56.3 G34.11/0.925329.99/0.835428.95/0.801327.42/0.8359
MADNet [74] ×3930 K88.4 G34.14/0.925130.20/0.839528.98/0.802327.78/0.8439
FeNet-baseline [46] ×3163 K16.7 G33.99/0.924030.02/0.835928.90/0.800027.55/0.8391
FeNet [46] ×3357 K35.2 G34.21/0.925630.15/0.838328.98/0.802027.82/0.8447
FDENet [57] ×3488 K61.7 G34.28/0.925330.33/0.841529.05/0.803328.03/0.8494
DARN-S [56] ×3355 K35.0 G34.35/0.927430.34/0.842829.09/0.806528.17/0.8528
IFIN-S [55] ×3459 K51.0 G34.45/0.927830.47/0.844229.13/0.806428.32/0.8560
BMFENet [58] ×3470 K51.7 G34.34/0.927130.27/0.840729.08/0.804928.18/0.8534
TARN [49] ×3754 K-34.42/0.927530.37/0.843029.12/0.805628.19/0.8529
DAFEN-S (ours) ×3192 K17.1 G34.25/0.926130.18/0.838929.02/0.803127.76/0.8451
DAFEN (ours) ×3422 K37.7 G34.43/0.927530.37/0.843429.12/0.805728.12/0.8517
Bicubic ×4--28.42/0.810426.00/0.702725.96/0.667523.14/0.6577
LGCNet [21] ×4193 K44.5 G30.87/0.874627.82/0.763027.08/0.718624.82/0.7399
IDN [51] ×4553 K32.3 G31.82/0.890328.25/0.773027.41/0.729725.41/0.7632
MADNet [74] ×41002 K54.1 G32.01/0.892528.45/0.778127.47/0.732725.77/0.7751
FeNet-baseline [46] ×4169 K9.4 G31.80/0.888628.31/0.774227.38/0.728925.53/0.7670
FeNet [46] ×4366 K20.4 G32.02/0.891928.38/0.776427.47/0.731925.75/0.7747
FDENet [57] ×4501 K35.9 G32.12/0.892928.52/0.779527.53/0.733925.97/0.7811
DARN-S [56] ×4363 K19.7 G32.16/0.895128.58/0.781727.57/0.737426.08/0.7859
IFIN-S [55] ×4470 K31.6 G32.27/0.895828.68/0.783427.62/0.738126.17/0.7890
BMFENet [58] ×4477 K29.4 G32.22/0.895128.61/0.781227.54/0.733526.04/0.7852
TARN [49] ×4835 K-32.23/0.895528.65/0.782927.61/0.736826.15/0.7874
DAFEN-S (ours) ×4198 K10.0 G32.00/0.891928.39/0.777327.49/0.732625.72/0.7758
DAFEN (ours) ×4431 K21.8 G32.23/0.894828.59/0.781527.57/0.737626.01/0.7832
Table 5. Comparison results with non-lightweight state-of-the-art methods at a scaling factor of ×3. Due to the higher model complexity of non-lightweight methods, we present the data using larger magnitude units for ease of comparison. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
Table 5. Comparison results with non-lightweight state-of-the-art methods at a scaling factor of ×3. Due to the higher model complexity of non-lightweight methods, we present the data using larger magnitude units for ease of comparison. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
MethodParamsMulti-AddsTimes
RCAN [14]15.67 M1.492 T0.12 s
SwinIR [36]11.55 M2.883 T0.23 s
HAT [18]20.53 M3.871 T0.32 s
DAFEN-S0.192M0.017T0.012s
DAFEN0.422M0.038T0.017s
Table 6. Quantify how lightweight the model is on RS-T1 dataset with a scaling factor of ×3. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
Table 6. Quantify how lightweight the model is on RS-T1 dataset with a scaling factor of ×3. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.
MethodParamsMulti-AddsTimesRS-T1(x3) PSNR/SSIM
FeNet [46]357 K35.2 G19.46 ms31.89/0.8432
DARN [56]596 K58.4 G18.87 ms32.08/0.8470
IFIN-S [55]459 K51.0 G143.34 ms32.04/0.8448
DAFEN-S192 K17.1 G11.65 ms31.93/0.8459
DAFEN422 K37.7 G17.31 ms32.13/0.8527
Table 7. Ablation experiments on the design of the DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 7. Ablation experiments on the design of the DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
DAFEN w/o DAFEB683 K38.7 G29.82/0.768927.63/0.774427.58/0.738026.03/0.7840
DAFEN w/o CSLB744 K40.2 G29.83/0.769427.61/0.774227.58/0.738326.05/0.7851
DAFEN w/o FFEM315 K16.5 G29.75/0.766327.54/0.770627.52/0.735325.92/0.7794
DAFEN431 K21.8 G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
Table 8. Ablation experiments on the design of the CSLB on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 8. Ablation experiments on the design of the CSLB on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
CSLB w/o GRSB867 K46.9 G29.85/0.770827.68/0.775327.61/0.738626.13/0.7854
CSLB w/o CSAIM434 K21.7 G29.80/0.769627.61/0.773527.54/0.737125.99/0.7831
CSLB431 K21.8 G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
Table 9. Ablation experiments on the design of the FFEM on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 9. Ablation experiments on the design of the FFEM on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
W/ 1x1Conv443K22.4G29.81/0.769527.59/0.772227.54/0.736726.00/0.7828
W/ BFM431K21.8G29.79/0.768927.61/0.773527.56/0.737326.01/0.7831
FFEM w/o SCGC303K15.8G29.73/0.767127.53/0.770527.49/0.735025.80/0.7765
FFEM w/o CCA406K20.4G29.79/0.768827.60/0.773227.54/0.736525.95/0.7814
FFEM431K21.8G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
Table 10. Ablation experiments on the design of the lightweight convolution in DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 10. Ablation experiments on the design of the lightweight convolution in DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
W/ PSConv368 K18.7 G29.73/0.766527.54/0.770627.48/0.734625.89/0.7792
W/ BSConv343 K17.5 G29.62/0.763427.42/0.766327.44/0.731725.80/0.7768
W/ DSConv343 K17.5 G29.65/0.764227.40/0.764827.45/0.732425.81/0.7766
DAFEN431 K21.8 G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
Table 11. Ablation experiments on the design of the number of groups in group convolution on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 11. Ablation experiments on the design of the number of groups in group convolution on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
G = 2628 K31.8 G29.84/0.770527.66/0.774627.59/0.738526.08/0.7844
G = 4431 K21.8 G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
G = 8333 K16.8 G29.77/0.767227.59/0.772727.54/0.736525.94/0.7808
Table 12. Ablation experiments on the design of the number of DAFEBs and CSLBs on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
Table 12. Ablation experiments on the design of the number of DAFEBs and CSLBs on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.
VariantParamsMulti-AddsRS-T1 PSNR/SSIMRS-T2 PSNR/SSIMBSD100 PSNR/SSIMUrban100 PSNR/SSIM
n b = 4, n l = 1289 K13.9 G29.76/0.768127.57/0.771627.53/0.736125.86/0.7787
n b = 4, n l = 3431 K21.8 G29.82/0.770127.62/0.773727.57/0.737626.01/0.7832
n b = 4, n l = 5574 K29.7 G29.86/0.769427.71/0.775927.59/0.738326.11/0.7861
n b = 3, n l = 3345 K17.5 G29.79/0.767627.60/0.773027.54/0.736225.92/0.7798
n b = 5, n l = 3517 K26.0 G29.84/0.770027.65/0.774627.59/0.738026.05/0.7849
n b = 6, n l = 3603 K30.3 G29.86/0.770427.70/0.776427.61/0.738626.12/0.7863
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, W.; Qu, S.; Luo, L.; Lu, Y. Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sens. 2025, 17, 1078. https://doi.org/10.3390/rs17061078

AMA Style

Chen W, Qu S, Luo L, Lu Y. Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sensing. 2025; 17(6):1078. https://doi.org/10.3390/rs17061078

Chicago/Turabian Style

Chen, Wangyou, Shenming Qu, Laigan Luo, and Yongyong Lu. 2025. "Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution" Remote Sensing 17, no. 6: 1078. https://doi.org/10.3390/rs17061078

APA Style

Chen, W., Qu, S., Luo, L., & Lu, Y. (2025). Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sensing, 17(6), 1078. https://doi.org/10.3390/rs17061078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop