Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution

Chen, Wangyou; Qu, Shenming; Luo, Laigan; Lu, Yongyong

doi:10.3390/rs17061078

Open AccessArticle

Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution

School of Software, Henan University, Kaifeng 475004, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1078; https://doi.org/10.3390/rs17061078

Submission received: 14 January 2025 / Revised: 3 March 2025 / Accepted: 17 March 2025 / Published: 19 March 2025

(This article belongs to the Special Issue Recent Advances in Deep Learning-Based High-Resolution Image Processing and Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

In the field of remote sensing, super-resolution methods based on deep learning have made significant progress. However, redundant feature extraction and inefficient feature fusion can, respectively, result in excessive parameters and restrict the precise reconstruction of features, making the model difficult to deploy in practical remote-sensing tasks. To address this issue, we propose a lightweight Dual Attention Fusion Enhancement Network (DAFEN) for remote-sensing image super-resolution. Firstly, we design a lightweight Channel-Spatial Lattice Block (CSLB), which consists of Group Residual Shuffle Blocks (GRSB) and a Channel-Spatial Attention Interaction Module (CSAIM). The GRSB improves the efficiency of redundant convolution operations, while the CSAIM enhances interactive learning. Secondly, to achieve superior feature fusion and enhancement, we design a Forward Fusion Enhancement Module (FFEM). Through the forward fusion strategy, more high-level feature details are retained for better adaptation to remote-sensing tasks. In addition, the fused features are further refined and rescaled by Self-Calibrated Group Convolution (SCGC) and Contrast-aware Channel Attention (CCA), respectively. Extensive experiments demonstrate that DAFEN achieves better or comparable performance compared with state-of-the-art lightweight super-resolution models while reducing complexity by approximately 10∼48%.

Keywords:

convolution neural network; lightweight; remote sensing; image super-resolution (SR)

Graphical Abstract

1. Introduction

Remote-sensing images (RSIs) significantly contribute to earth and environmental sciences by providing diverse observational data. They offer critical support for various tasks, such as agricultural and forestry monitoring [1,2], disaster early warning [3,4], military reconnaissance [5], and industrial manufacturing [6]. Therefore, providing high-resolution RSIs is essential for a more accurate and objective representation of the targets and contexts. However, remote-sensing data are affected by adverse factors such as long distances and wide viewing angles during the acquisition process. For edge devices with limited computational resources, coupled with the limitations of their hardware factors and unfavorable imaging conditions, it is even more difficult to acquire high-quality RSIs. RSI Super-Resolution (RSISR) focuses on constructing a nonlinear mapping relationship between a pair of high-resolution and low-resolution remote-sensing images. Compared with upgrading remote-sensing imaging equipment, researching efficient RSISR algorithms at the software level can not only effectively improve the resolution of images but also save significant costs. Therefore, using RSISR technologies to improve the resolution of remote-sensing images has become a key research direction for scholars.

Since the pioneering SR method SRCNN [7] was proposed, the field of SR has grown tremendously [8,9,10,11,12,13,14,15,16,17,18,19,20]. The success of natural image SR techniques has spurred the development of RSISR [21,22,23,24,25,26,27,28,29,30,31,32]. RSISR builds upon natural image SR research by incorporating specific improvements tailored to remote-sensing images, such as designing loss functions suited to the unique features of RSI [33], employing a two-stage design that utilizes spatial and spectral knowledge from adjacent bands [24], and creating degradation models that reflect remote-sensing scenarios [34]. As research progresses, although the performance of these RSISR models has continually improved, they face challenges in effectively managing the inherent complexity of the networks. We take examples from well-known SR methods such as RCAN [14], SAN [35], and SwinIR [36]. When applied to RSISR tasks, RCAN [14] has about 15M parameters, SAN [35] has about 26M parameters, and large variants of SwinIR [36] have upwards of 20M parameters. Recently, RSISR methods based on the diffusion model [37,38,39] and Mamba [40] have become new research hotspots, but their model parameter number is similarly large. The large number of parameters greatly limits their deployment on resource-constrained devices. Therefore, our work aims to design a lightweight and efficient RSISR network to recover remote-sensing images.

In the field of RSISR, several successful lightweight models have emerged. The attention-based multi-level feature fusion (AMFF) in AMFFN [41] differs from common feature fusion methods [15,16,42,43,44]. It adopts a grouped approach to progressively fuse features from different blocks, which fully leverages multi-level features. However, the AMFF adopts a 1 × 1 convolution fusion method, which significantly increases the number of parameters when the number of input channels is large. Additionally, it only uses the Contrast-aware Channel Attention (CCA) [15] module in the enhancement stage, which provides relatively weak performance. RFCNet [45] addresses feature enhancement by designing the Residual Feature Calibration Block (RFCB), which further refines and rescales input features through a clever dual-branch structure, significantly improving the model’s fitting capability. However, RFCB employs channel separation, causing each branch to enhance only half of the feature information, which limits the feature representation capability. Moreover, the use of multiple 3 × 3 convolution layers leads to a significant increase in model parameters. FeNet [46] constructs Lightweight Lattice Blocks (LLBs) based on channel separation and effectively integrates multi-level feature information through nested module design. In multi-level feature fusion, the Backward Fuse Module (BFM) [16] employed by FeNet can extract more contextual information at different levels more effectively than 1 × 1 convolution. However, LLB extracts features from only half of the channels, which results in poorer feature extraction. Additionally, its nested module structure is overly complex, resulting in significant redundancy. Moreover, the backward sequential concatenation used by BFM leads to higher-level features being compressed more than lower-level features. However, high-level features typically contain more semantic information and contextual relationships, which are crucial for reconstructing complex structures and global information. Given the complexity of RSIs [47], the excessive loss of high-level feature information is unreasonable.

To address the limitations of the existing methods, we propose the Dual Attention Fusion Enhancement Network (DAFEN). Meanwhile, to accommodate devices with extremely limited hardware capabilities, we further designed an extremely lightweight version, DAFEN-S, by reducing the number of channels, with the model parameters reduced to only 188 K. As shown in Figure 1, our DAFEN sufficiently demonstrates the competitive advantages of the model compared with other lightweight models and achieves a better balance between the complexity of the network and the reconstruction accuracy. Specifically, feature refinement and fusion section of DAFEN is composed of four stacked Dual Attention Fusion Enhancement Blocks (DAFEBs) and a Forward Fusion Enhancement Module (FFEM). The features extracted by the shallow-layer DAFEB are not only fed into the next-layer DAFEB for further extraction and refinement but also input into the FFEM for multi-level feature fusion and enhancement. Furthermore, each DAFEB consists of three Channel-Spatial Lattice Blocks (CSLBs) and a Forward Fusion Enhancement Module (FFEM), with a structure similar to the feature refinement and fusion section of DAFEN. Notably, we introduce a Context Enhancement Module (CEM) [48] at the end of the DAFEB to further enhance the feature information. By employing the extremely lightweight CEM, the receptive field is further enlarged to extract more contextual information for RSISR.

In addition, the core module of the DAFEN consists of the Forward Fusion Enhancement Module (FFEM) and Channel-Spatial Lattice Block (CSLB). Firstly, the FFEM consists of a multi-level feature fusion stage and a feature enhancement stage. On the one hand, we adopt a forward sequential concatenation approach, which reduces the loss of high-level feature information while supplementing low-level feature details, thereby fully utilizing multi-level features to extract richer contextual information. On the other hand, we design a dual-branch structure for feature enhancement. In the upper branch, we design a Self-Calibrated Group Convolution (SCGC), which self-calibrates based on the expression of local features in a deep receptive field, further refining the features. In the lower branch, we introduce CCA [15], which calculates weight coefficients through CCA to obtain a better channel attention vector. Secondly, the CSLB adopts a dual-branch design based on channel separation, making feature extraction and information interaction extremely lightweight. Unlike other lattice blocks [16,46,49], we design the Group Residual Shuffle Block (GRSB) and the Channel-Spatial Attention Interaction Module (CSAIM). The GRSB extracts features through group convolutions and enhances information flow via channel shuffle, which enables GRSB to achieve feature representation similar to 3 × 3 convolution while using fewer parameters. Furthermore, the CSAIM alternately employs channel attention and spatial attention to enhance information exchange between the two branches of the CSLB, thereby expanding the receptive field and improving feature representation.

In summary, the main contributions are as follows:

We propose a lightweight model, DAFEN, with approximately 416K parameters and an ultra-lightweight model, DAFEN-S, with approximately 188K parameters. Compared with other methods, we achieve better performance with fewer model parameters.
We design a novel lattice structure, CSLB, which combines GRSB and CSAIM. This structure efficiently extracts features while maintaining the model’s lightweight nature. It also enhances information exchange between the two branches through channel and spatial attention, improving feature extraction for remote-sensing images.
We design an efficient feature fusion module, FFEM, which consists of a fusion stage and an enhancement stage. In the fusion stage, FFEM effectively integrates multi-level features through forward sequential concatenation. In the enhancement stage, a unique dual-branch design is adopted to perform feature self-calibration and rescaling, which not only refines the features but also enhances the capability of representing complex remote-sensing image characteristics.

2. Related Works

2.1. Lightweight Natural Image SR

In recent years, the continuous improvement in the performance of super-resolution (SR) networks has often come with increased parameters and computational overhead, posing challenges for practical deployment. Consequently, there is a growing demand for lightweight SR networks. FSRCNN [8] reduced the model parameters and computational load by reconstructing the SRCNN [7] architecture while maintaining performance. Ahn et al. [50] proposed a Cascading Residual Network (CARN), which uses multi-level representation and shortcut connections for more efficient information transfer. Hui et al. [51] developed an Information Distillation Network (IDN), which extracts significant information by merging different features. Building on this, IMDN [15] enhanced IDN by using information multi-distillation blocks and channel splitting operations. Liu et al. [44] designed a lightweight and precise Residual Feature Distillation Network (RFDN) using Feature Distillation Connections (FDC) and Shallow Residual Blocks (SRB). LatticeNet [16] achieved excellent performance while significantly reducing parameters by utilizing lattice filters based on a butterfly structure and a backward fusion strategy. ShuffleMixer [52] explored a method to reduce parameters and computational load by employing large convolution kernels, channel splitting, and shuffling techniques, achieving higher efficiency while maintaining performance. Due to the significant similarity between feature maps of multiple channels in the same CNN layer, Feature-Refined Networks (FRNs) [43] designed shadow modules to generate such similar feature maps, thereby reducing model complexity. Zhang et al. [53] proposed the Super Token Interaction Network (SPIN), which clusters locally similar pixels using superpixels and facilitates local information interaction through intra-superpixel attention. Omni-SR [19] introduced a full self-attention paradigm and a full-scale aggregation scheme to address issues related to limited effective receptive fields due to one-dimensional self-attention modeling and homogeneous aggregation schemes. Huang et al. [49] proposed a Two-branch Adaptive Residual Network (TARN), which effectively utilizes residual features through a two-branch adaptive residual block (TARB) based on a lattice structure. Wang et al. [54] proposed an Omni-Stage Feature Fusion Network (OSFFNet), which effectively integrates features from different levels and fully utilizes their complementarity. With full consideration of structural priors, Wang et al. [55] proposed an Interactive Feature Inference Network (IFIN), which progressively extracts more specialized features to enhance the reconstruction of high-frequency details in images. The advancements in lightweight natural image SR have provided valuable insights for lightweight RSISR research, offering beneficial guidance for our work.

2.2. Lightweight Remote-Sensing Image SR

Inspired by the success of lightweight natural image SR, lightweight RSISR has received increasing attention. Lei et al. [21] first proposed LGCNet, which aims to enhance super-resolution performance by combining local and global contrast features. FeNet [46] achieved a good balance between computational cost and reconstruction accuracy by using Lightweight Lattice Blocks (LLB) as nonlinear extraction modules and utilizing a nested structure. CTN [48] reduced the network’s parameters by using lightweight convolutions instead of traditional 3 × 3 convolutions and generated SR images by alternating feature extraction and enhancement. Wang et al. [41] proposed an Attention-Based Multi-level Feature Fusion Network (AMFFN), which ensures efficient SR reconstruction through information distillation and attention-based multi-level feature fusion. Wu et al. [28] proposed a Saliency-Aware Dynamic Routing Network (SalDRN), which employs networks of different depths to handle various regions of RSIs to tackle SR challenges of varying difficulty. Distance Attention Residual Network (DARN) [56] utilizes Distance Attention Blocks (DABs) to efficiently leverage shallow features, effectively mitigating the loss of detailed features during the extraction process of deep CNNs. Gao et al. [57] proposed a Stepwise Fusion Mechanism (SFM) to integrate features retained after progressive distillation, effectively addressing the issue of insufficient information flow caused by channel separation during feature distillation. Wang et al. [27] proposed a Hybrid Attention-Based U-Shaped Network (HAUNet) to effectively explore multi-scale features and enhance global feature representation through hybrid convolution-based attention. To address the issue of losing low-weight background feature information, Wu et al. [58] employed a large-kernel attention mechanism and a multi-scale mechanism to generate background feature weights, thereby increasing attention to neglected information. Ye et al. [32] proposed a high-frequency and low-frequency separation reconstruction strategy, allowing the network to improve the reconstruction details of high-frequency components while maintaining lower model parameters. Unlike methods that focus on developing lightweight network architectures or modules, our approach emphasizes designing more efficient and lightweight feature fusion enhancement structures and focuses on alleviating the burden on convolution layers. Our strategy facilitates comprehensive feature representation using fewer parameters and Multiply-Add Operations (Multi-Adds).

3. Methods

3.1. Network Architecture

In the paper, we propose an innovative network called DAFEN specifically for lightweight RSISR. The network architecture is illustrated in Figure 2, consisting of three integrated components: shallow feature extraction

H_{fe}

, feature refinement and fusion

H_{rf}

, and reconstruction

H_{re}^{s}

. Assume

I_{LR} \in R^{3 \times H \times W}

and

I_{SR} \in R^{3 \times s H \times s W}

are the input and output of DAFEN, where H × W represents the spatial dimensions, and s is the upscaling factor. The process of the three integrated components can be expressed as follows:

F_{0} = H_{fe} (I_{LR})

(1)

F_{d} = H_{rf} (F_{0})

(2)

I_{SR} = H_{re}^{s} (F_{d}) + H_{bicubic}^{s} (I_{LR})

(3)

where

F_{0}

and

F_{d}

denote the feature representations of

H_{fe}

and

H_{rf}

. We utilize two 3 × 3 convolutions to implement

H_{fe}

, which are used to extract the initial representations of the

I_{LR}

content. Similarly,

H_{re}^{s}

is implemented through two 3 × 3 convolutions combined with a pixel shuffle layer. It is noteworthy that

H_{bicubic}^{s}

represents the bicubic interpolation function with an upscaling factor of s. This function effectively conveys substantial information, compensating for the significant details of low-level features.

To develop a lightweight network for RSISR, we meticulously design the feature refinement and fusion module

H_{rf}

. This section consists of four Dual Attention Fusion Enhancement Blocks (DAFEBs) and one Forward Fusion Enhancement Module (FFEM). After obtaining the coarse features

F_{0}

through shallow feature extraction

H_{fe}

, the four DAFEBs sequentially extract intermediate features. Here, we have

F_{i} = H_{DAFEB}^{i} (F_{i - 1}), i = 1, 2, 3, 4

(4)

where

H_{DAFEB}^{i} (\cdot)

denotes the

i^{t h}

DAFEB function and

F_{i}

represents the intermediate features extracted by the

i^{t h}

DAFEB. In order to more efficiently integrate these intermediate features containing multi-level information, we feed

F_{i}

(

i = 1, 2, 3, 4

) into FFEM for fusion and enhancement operations, as follows:

T_{d} = H_{FFEM} (F_{1}, F_{2}, F_{3}, F_{4})

(5)

where

T_{d}

represents the features that have been integrated and enhanced by the FFEM, and

H_{FFEM} (\cdot)

denotes the FFEM function.

We employ the

L_{1}

loss function to train the aforementioned network.

{I_{LR}^{i}, I_{HR}^{i}}_{i = 1}^{N}

is used to denote the training set composed of N pairs of low-resolution (LR) and high-resolution (HR) images. The loss is defined as follows:

L_{1} (θ) = \frac{1}{N} \sum_{i = 1}^{N} {∥ I_{SR}^{i} - I_{HR}^{i} ∥}_{1}

(6)

where

θ

denotes the parameter sets of our proposed DAFEN.

I_{SR}^{i}

and

I_{HR}^{i}

represent the

i^{t h}

SR image reconstructed by the DAFEN and the corresponding HR image, respectively.

3.2. Dual Attention Fusion Enhancement Block (DAFEB)

In order to better balance the reconstruction performance and model complexity, we design DAFEB with a similar architecture to the DAFEN. As shown in Figure 3, we progressively refine the complex features through three stacked CSLBs. The features refined by the shallow CSLBs are sent to both the deeper CSLBs and directly to the FFEM for fusion and enhancement. This can be represented as follows:

P_{i} = H_{C S L B}^{i} (P_{i - 1}), i = 1, 2, 3

(7)

P_{fuse} = H_{F F E M} (P_{1}, P_{2}, P_{3})

(8)

where

H_{C S L B}^{i} (\cdot)

denotes the CSLB function at the

i^{t h}

layer,

H_{F F E M} (\cdot)

represents the FFEM function, and

P_{i}

denotes the features obtained after the

i^{t h}

CSLB layer. Subsequently, we perform fusion and enhancement on the extracted multi-level features to obtain

P_{fuse}

. Finally, we introduce CEM [48] to amplify spatial details within the fused features. It is noteworthy that the CEM has an extremely low parameter count, making it worthwhile to slightly increase the parameter count for effectively enhancing model performance by introducing the CEM.

3.3. Forward Fusion Enhancement Module (FFEM)

Due to the complexity of remote-sensing images [47], efficiently utilizing hierarchical information is crucial. We design FFEM to achieve more suitable multi-level feature fusion and enhancement. Specifically, FFEM includes a feature fusion phase and a feature enhancement phase. In the feature fusion phase, we design a forward fusion structure, as shown in Figure 4a. Once the multi-level features are fed into the FFEM, each level’s features first undergo a 1 × 1 convolution to halve the number of channels, followed by activation through the ReLU function (ReLU operations are omitted in Figure 2 and Figure 3), and finally are concatenated in a forward sequential manner. This can be expressed as follows:

T_{i} = \{\begin{matrix} ReLU (Conv (F_{i})), & i = 1 \\ Conv (Concat (T_{i - 1}, ReLU (Conv (F_{i})))), & i = 2, \dots, n - 1 \\ Concat (T_{i - 1}, ReLU (Conv (F_{i}))), & i = n \end{matrix}

(9)

where

C o n c a t (\cdot)

and

C o n v (\cdot)

denote the concatenation operation along the channel dimension and the 1 × 1 convolution. In the task of super-resolution for structurally complex remote-sensing images, it is crucial to efficiently capture more abstract and semantically rich global information. Our forward fusion structure allocates more channels to deeper features, thereby reducing the inevitable loss of high-level feature information due to information loss during dimensionality reduction. By leveraging high-level features more effectively, we can mitigate the difficulty of global information capture caused by the local receptive field of CNNs. Utilizing the rich semantic information and contextual relationships contained in high-level features allows us to better handle the reconstruction of complex structures and global information in RSIs. In addition, our strategy also integrates low-level features, enriching a wealth of edge and texture details, thus achieving a better balance between global information and detailed information.

In the feature enhancement phase, to balance both the performance and lightweight nature of the model, we design an efficient dual-branch feature enhancement structure. By feeding the fused features into the two modules, CCA and SCGC, we achieve rescaling and refinement processes. The structure of RSI is complex, and it is inaccurate to obtain the channel descriptors solely through Global Average Pooling (GAP). Since variance can reflect the richness of information in feature maps [45], we introduce CCA [15] to highlight the most informative features. This can be represented as follows:

x_{out} = H_{C C A} (x)

(10)

where

H_{C C A} (\cdot)

denotes the CCA operation. Here, we briefly describe the computation of CCA. As shown in Figure 4b, given a set of feature maps, we compute the sum of their standard deviations and means. This summation is then passed through a sequence of nonlinear functions: Conv1 → ReLU → Conv1. Subsequently, the sigmoid function is utilized to generate a set of combination coefficients ranging from 0 to 1. Finally, we multiply the input feature maps by these combination coefficients to obtain the output feature maps.

For SCGC (as shown in Figure 4c), the features are first scaled down by the AdaptiveMaxPool2D function with a kernel size of 7 and a stride of 3 in order to obtain a large receptive field. Then, after tuning by group convolution and channel shuffle operations, the features are upsampled back to the input size using a bilinear interpolation function. Finally, the final spatial statistics are obtained using the sigmoid function after residual concatenation with the input features. This process can be represented as follows:

ω = δ (H_{↑} (H_{shuffle} (H_{GConv} (H_{↓} (x_{1})))) + x_{1})

(11)

where

ω

denotes the attention matrix for scaling and refining the input features

x_{1}

.

δ

denotes the sigmoid activation function,

H_{↑}

represents the up-sampling operation via bilinear interpolation,

H_{GConv}

and

H_{shuffle}

denote the group convolution operation and the channel shuffle operation, and

H_{↓}

denotes the down-sampling operation using the AdaptiveMaxPool2D function. After obtaining the attention matrix

ω

, we multiply

ω

by the features refined through two GRSBs for self-calibration. Finally, we add group convolution and channel shuffle operations to further deepen the features, thereby enhancing performance. This process can be represented as follows:

x_{out} = H_{shuffle} (H_{GConv} (ω \times H_{GRSB} (H_{GRSB} (x_{1}))))

(12)

where

x_{out}

denotes the output feature and

H_{G R S B} (\cdot)

denotes the GRSB operation.

Here, we provide a detailed description of the GRSB used in SCGC and CSLB, as shown in Figure 4d. We observe that 3 × 3 convolutions account for a significant proportion of network parameters in most lightweight super-resolution networks. This observation prompted us to consider reducing the weight of the super-resolution network by replacing 3 × 3 convolutions with lightweight convolutions while maintaining performance. The group convolution and channel shuffle strategy of ShuffleNet [59] effectively reduces model complexity while preserving high feature extraction capability, making it suitable for deployment on resource-constrained devices. Through our experimental comparison of numerous excellent convolution modules [60,61,62], we verify that the combination of group convolution and channel shuffle strategy offers optimal performance for this task. By drawing inspiration from ShuffleNet, we create GRSB as a lightweight alternative to conventional convolution. Group convolution divides the original 3 × 3 convolution operation into several groups, each containing a portion of the input and output channels, and performs convolution operations independently on each group. The results of all groups are then merged to obtain the final output feature map. Following the residual connection, we select LeakyReLU as the activation function and employ a channel shuffle operation at the end. Benefiting from the channel shuffle operation, GRSB mitigates the inter-group information isolation caused by grouping and facilitates information flow between different groups. This can be described as follows:

x_{d} = H_{shuffle} (H_{LReLU} (x + H_{GConv} (x)))

(13)

where x denotes input features,

x_{d}

denotes output features, and

H_{LReLU} (\cdot)

denotes the LeakyReLU operation.

After the feature enhancement through the upper and lower branches, the two sets of output features undergo a 1 × 1 convolution for fusion. Our design FFEM includes two operational stages. In the first stage, we integrate different levels of CSLB features within the DAFEB, allowing for continuous refinement and expansion of these features. In the second stage, we merge different levels of DAFEB features. Through FFEM, the interaction between feature refinement and fusion is iteratively conducted, enabling the extraction of more significant contextual information.

3.4. Channel-Spatial Lattice Block (CSLB)

Lattice structures [16,46,49] enable high-speed parallel processing, making them highly suitable for lightweight models that require fast execution speeds. To achieve more efficient feature extraction, we designed the CSLB, which incorporates a dual-branch architecture inspired by lattice structures. As shown in Figure 5a, the input features

x \in R^{C \times H \times W}

are first split into two equal parts along the channel dimension:

P_{i - 1} (x) \in R^{C / 2 \times H \times W}

and

Q_{i - 1} (x) \in R^{C / 2 \times H \times W}

. This can be expressed as follows:

P_{i - 1} (x), Q_{i - 1} (x) = Split (x)

(14)

where x represents the input to the CSLB, and

P_{i - 1} (x)

and

Q_{i - 1} (x)

denotes the inputs of the upper and lower branches, respectively. This design allows each branch to process only half of the input signal, enabling faster parallel processing and reduced complexity. Specifically, our CSLB is divided into two operational stages.

In the first stage,

Q_{i - 1} (x)

is fed into three GRSBs for stepwise feature extraction and refinement. Considering the feature misalignment between the two branches, simple operations like addition, multiplication, or concatenation are not sufficiently convincing. Therefore, we designed the CSAIM to calculate attention weights for the two branches from both the spatial and channel dimensions, which consist of two components: Spatial Attention Interaction (SAI) and Channel Attention Interaction (CAI). These weights are then used to perform a 1 × 1 convolution fusion with the other branch to obtain

P_{i - 2} (x)

and

Q_{i - 2} (x)

. This process can be expressed as follows:

P_{i - 2} (x) = {Conv}_{1 \times 1} (Concat (P_{i - 1} (x), F_{C - I} (f (Q_{i - 1} (x)))))

(15)

Q_{i - 2} (x) = {Conv}_{1 \times 1} (Concat (F_{S - I} (P_{i - 1} (x)), f (Q_{i - 1} (x))))

(16)

where

f (\cdot)

denotes three GRSB operations,

F_{C - I} (\cdot)

and

F_{S - I} (\cdot)

denote channel attention and spatial attention operations.

In the second stage,

P_{i - 2} (x)

is fed into three GRSBs for stepwise feature extraction and refinement. Similar to the first stage, we perform weighted cross-combination from the spatial and channel dimensions for the two branches. This process can be expressed as follows:

P_{i - 3} (x) = {Conv}_{1 \times 1} (Concat (f (P_{i - 2} (x)), F_{C - I} (Q_{i - 2} (x))))

(17)

Q_{i - 3} (x) = {Conv}_{1 \times 1} (Concat (F_{S - I} (f (P_{i - 2} (x))), Q_{i - 2} (x)))

(18)

where

P_{i - 3} (x)

and

Q_{i - 3} (x)

denote the upper and lower branch features obtained from the second stage, respectively. Subsequently, we fuse

P_{i - 3} (x)

and

Q_{i - 3} (x)

using a 1 × 1 convolution.

Here, we provide a detailed description of the calculation of weight coefficients. For the SAI, as shown in Figure 5b, given a set of feature maps, they are directly passed through a nonlinear function, Conv1 → ReLU → Conv1, where the reduction ratios for pointwise convolution are 8 and C/8, with C representing the number of channels in the feature maps. Finally, a sigmoid function is used to generate combination coefficients, which range from 0 to 1. For the CAI, as shown in Figure 5c, given a set of feature maps, we first obtain the average value of each feature map through GAP. The resulting feature vectors are then passed through a nonlinear function, Conv1 → ReLU → Conv1, where the reduction and expansion ratios for pointwise convolution are set to 8. Again, a sigmoid function is used to generate combination coefficients ranging from 0 to 1. The CSAIM is an efficient module designed to enhance the flow of feature signals, with minimal network complexity introduced by the learning of weight coefficients. By employing the CSAIM along the spatial and channel dimensions, the upper and lower branches effectively capture feature signals with varying levels of attention, resulting in more diverse and enriched information integration.

4. Results

4.1. Datasets

Based on previous work [22,56,57,58,63], the widely used SR dataset DIV2K [64] is selected as our training dataset. The DIV2K dataset contains 800 high-quality RGB training images and 100 validation images. For testing, we test the reconstruction performance of the model using two remote-sensing datasets proposed by FeNet [46], RS-T1 and RS-T2. RS-T1 and RS-T2 are remote-sensing datasets used for land use studies collected from the UC Merced [65] dataset, and both contain 120 images covering 21 complex ground-truth remote-sensing scenes. To further test the robustness of the model comprehensively, we use four SR benchmark datasets: Set5 [66], Set14 [67], BSD100 [68], and Urban100 [69].

4.2. Implementation Details

To obtain low-resolution (LR) training images, we use bicubic interpolation with scaling factors of ×2, ×3, and ×4 to downscale the high-resolution (HR) images. To enhance the diversity of the training set, data augmentation techniques such as horizontal flipping and random

90^{\circ}

rotations are applied. During the training phase, DAFEN employs a batch size of 16, and HR image patches of size 192 × 192 are randomly cropped from the HR images. We posit that the selected batch size and patch size values strike an optimal balance, respectively, between training speed and gradient stability, as well as between the preservation of local details and the incorporation of global information. To ensure numerical stability, the pixel range of the input images is scaled to [0, 1]. Our network is optimized using the ADAM [70] optimizer, with

β_{1} = 0.9

and

β_{2} = 0.999

. The learning rate is set to 5 ×

10^{- 4}

and halved every 200 epochs out of 1000 epochs. This learning rate scheduling strategy aims to balance convergence speed and stability, ensuring that the model avoids local optima while maintaining steady performance improvements. All experiments are implemented using the PyTorch (version 1.10.0) framework and evaluated on an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA). In the paper, our network consists of four three-layer DAFEBs. The selection of the number of feature channels follows [46], where DAFEN employs 48 feature channels, while its lightweight variant, DAFEN-S, utilizes 32 feature channels. These configurations offer a well-balanced trade-off between model performance and computational cost. Following [46,63,71,72], SR results are evaluated only on the Y channel in the transformed YCbCr space, using the average Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). Additionally, we assess the network’s complexity using model parameters and Multi-Adds. Similar to [16,41,46], we assume that the size of the query image (HR image) is 1280 × 720 for calculating Multi-Adds. Table 1 presents the implementation details and hyperparameter settings of our methods and all the comparative lightweight methods discussed in this paper. Specifically, we elaborate on the common parameters employed in the implementation of our methods and the comparatively lightweight methods, encompassing the optimizer for the generator (Optim_g), beta parameters for the Adam optimizer (betas), learning rate (lr), gamma parameter for the learning rate scheduler (gamma), Loss Function type (Loss type), batch size, patch size, Use horizontal flip (Use_hflip), and Use rotation (Use_rot). To validate our insights into the optimization process and demonstrate the rationality of our hyperparameter selection, we designed a hyperparameter analysis experiment, as detailed in Table 2. Specifically, we conduct experiments with our DAFEN at a scaling factor of ×4, employing various loss functions, batch sizes, patch sizes, and data augmentation strategies as experimental variants. The results demonstrate that our DAFEN achieves the best performance while also maintaining a balance with model complexity.

4.3. Results on Remote-Sensing Datasets

To validate the effectiveness of the proposed method on remote-sensing datasets, DAFEN and DAFEN-S are compared with existing lightweight models, including LGCNet [21], IDN [51], LESRCNN [73], CTN [48], FeNet [46], FDENet [57], DARN-S [56], AMFFN [41], IFIN-S [55], and BMFENet [58]. These models have been published in high-quality journals or conferences. All models compared in the paper are tested directly on remote-sensing data using pre-trained models provided by the respective researchers. Additionally, the training set for all comparison models is DIV2K, ensuring fairness in the comparison results. Table 3 shows the quantitative comparison results of different methods under all scaling factors. It can be seen that DAFEN outperforms existing methods or achieves comparable performance. Based on the scaling factor of ×4, IFIN-S, with approximately 470 K parameters, provides better results on the two RSI datasets. In contrast, DAFEN achieves a similar performance level with around 431K parameters and lower computational requirements. Additionally, DAFEN’s performance is comparable to BMFENet, which has a larger number of parameters and computational load. Moreover, our DAFEN-S has a similar number of parameters to FeNet-baseline and LGCNet, but DAFEN-S performs much better. Furthermore, our DAFEN-S outperforms FeNet with 351 K parameters in most test experiments, with only 188 K parameters. This is mainly due to the clever design of DAFEN, which effectively enriches, fuses, and enhances features while reducing model redundancy.

We also present the visual results of several SR methods for ×4 SR on UC-Merced in Figure 6. It is clear that our method demonstrates better recovery effects on object contours and detailed textures compared with other methods. Specifically, in the scenes with strong global consistency like ‘agricultural21’, only DAFEN adequately recovers texture information. Similarly, in scenes with scale variations, such as ‘airplane67’, DAFEN achieves more precise edge details. Notably, for the parking lot in ‘mobilehomepark92’ and the tennis court boundaries in ‘tenniscourt93’, other methods exhibit varying degrees of distortion. For example, in ‘mobilehomepark92’, other methods reconstruct blurred parking lot lines, while our DAFEN reconstructs more accurate parking lot lines. Additionally, in ‘tenniscourt93’, the reconstruction results of methods like IDN exhibit blurred tennis court boundaries, while methods such as AMFFN and CTN exhibit boundary distortion. In contrast, our DAFEN achieves results that are closer to the ground truth. However, the reconstruction results of our DAFEN-S exhibit some incorrect information, such as blurred parking lot lines in ‘mobilehomepark92’. This is because DAFEN-S has fewer parameters compared with DAFEN, resulting in weaker anti-interference ability. Overall, the proposed DAFEN demonstrates superior visual performance compared with other methods.

4.4. Results on SR Benchmark Datasets

To further validate the generalization performance of our model, we compare it with excellent SR methods on natural image datasets, including LGCNet [21], IDN [51], MADNet [74], FeNet [46], FDENet [57], DARN-S [56], IFIN-S [55], BMFENet [58], and TARN [49]. Similar to [46], we utilize four benchmark datasets: Set5 [66], Set14 [67], BSD100 [68], and Urban100 [69], which cover urban buildings, ecological environments, flora, and fauna. The models used for testing are all pre-trained models provided by the respective researchers. The test results are shown in Table 4. Our findings indicate that in terms of reconstruction accuracy for ×2, ×3, and ×4 SR tasks, DAFEN performs better or comparably to other lightweight SR networks on most test datasets. Despite IFIN-S achieving optimal results in some cases, its parameter count is higher than DAFEN. Additionally, due to the higher computational complexity of the transformer architecture used by IFIN-S, its computational load is about 1.5 times that of DAFEN. Compared with the TARN method, which has a significantly higher parameter count than DAFEN, our method achieves comparable performance. Moreover, our DAFEN-S greatly outperforms the FeNet baseline and LGCNet, which have a similar number of parameters. DAFEN-S also performs similarly to the FeNet method, while its parameter count and Multi-Adds are nearly halved. The results on natural image test sets demonstrate that DAFEN has strong generalization capabilities, confirming its exceptional effectiveness.

To assess perceptual quality, we present three reconstruction results of some models on the BSD100 and Urban100 test sets in Figure 7. It is clear that DAFEN achieves the best visual experience in terms of overall image patch clarity and detail line textures. Specifically, in ‘BSD100: 182053’, only DAFEN successfully recovers the accurate edges of the bridge arch. Notably, both our DAFEN and DAFEN-S achieve correct results in ‘Urban100: img073’. In contrast, the reconstruction results of methods such as SRCNN exhibit distorted lines, while those of methods like FeNet exhibit upward-curving lines. Furthermore, for continuous and densely structured block-like scenes, such as ‘Urban100: img019’, the reconstruction results from other methods exhibit varying degrees of blurriness in the upper-left regions of the image patches, while our method achieves more accurate reconstruction results. The qualitative and quantitative analyses above suggest that our proposed DAFEN and DAFEN-S are competitive in both natural and remote-sensing image SR tasks.

4.5. Comparison Results with Non-Lightweight State-of-the-Art Methods

To further validate the computational efficiency of the proposed methods, we conduct a comprehensive complexity comparison between DAFEN and DAFEN-S and several state-of-the-art non-lightweight super-resolution methods, including RCAN [14], SwinIR [36], and HAT [18]. The complexity evaluation is primarily conducted from three dimensions: model parameters (Params), multiply-add operations (Multi-Adds), and inference times (Times), with specific results shown in Table 5. In terms of model parameters, DAFEN and DAFEN-S require only 0.422 M and 0.192 M parameters, respectively, which are significantly lower than those of RCAN (15.67M), SwinIR (11.55 M), and HAT (20.53 M). This indicates that the DAFEN series models significantly reduce the storage requirements while maintaining high performance, making them more suitable for deployment on resource-constrained devices. In terms of computational complexity, the Multi-Adds of DAFEN and DAFEN-S are 0.038 T and 0.017 T, respectively, which are about 1–2 orders of magnitude lower than those of RCAN (1.492 T), SwinIR (2.883 T), and HAT (3.871 T). The low computational complexity not only reduces energy consumption but also significantly improves the inference speed of the models. Experimental results show that the inference times of DAFEN and DAFEN-S are 0.017 s and 0.012 s, respectively, while those of RCAN, SwinIR, and HAT are 0.12 s, 0.23 s, and 0.32 s, respectively. This demonstrates that the DAFEN series models have significant advantages in scenarios with high real-time requirements. In summary, DAFEN and DAFEN-S excel in model complexity, computational efficiency, and inference speed, not only significantly outperforming other lightweight methods but also demonstrating their unique performance-efficiency trade-off advantages when compared with non-lightweight state-of-the-art methods. These results further validate the potential of the DAFEN series models in practical applications, especially in resource-constrained environments.

4.6. Results of Real Remote-Sensing Images

To further demonstrate the stability of the model, the reconstruction results of two real remote-sensing satellite images are shown in Figure 8. We use four different methods to upscale real satellite remote-sensing images by a factor of four for visual perception comparison. It can be observed that our model provides better visual perception in terms of both overall texture and detailed textures compared with other methods. This validates that our approach performs well on images captured by real remote-sensing satellites.

4.7. Network Complexity and Inference Speed

In addition to evaluating complexity, faster inference speed has also become a crucial metric for assessing whether these lightweight RSISR methods are suitable for real-time applications on edge devices with limited computational resources. To this end, we conduct timing tests on several representative RSISR methods using the same device equipped with an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA) on the RS-T1 test set, as shown in Table 6. It can be found that DAFEN-S achieves optimal results in terms of parameter count, computational complexity, and inference time, while DAFEN achieves optimal performance and second-best inference speed. Notably, the parameter count of DAFEN-S is about half that of FeNet, yet its PSNR and SSIM values improved by approximately 0.04 dB and 0.0027, respectively. Additionally, the parameter count of DARN is about 1.5 times that of DAFEN, yet DAFEN still leads in PSNR and SSIM values by about 0.05 dB and 0.0057. Furthermore, our method, based on a CNN architecture, offers inference speeds over eight times faster than IFIN-S, which is based on a transformer architecture. This faster inference speed makes DAFEN more suitable for real-time applications on edge devices with limited computational resources. Overall, comparisons with existing state-of-the-art methods demonstrate the efficiency and lightness of the proposed method, as well as its high adaptability and greater flexibility in the use of remote-sensing devices.

4.8. Ablation Study

4.8.1. Effects of the Key Modules in DAFEN

To evaluate the contribution of key modules in DAFEN to the overall performance, we conduct ablation experiments on DAFEB, CSLB, and FFEM, with the specific results shown in Table 7. Firstly, we remove DAFEB and replace it with the Residual Group (RG) from RCAN [14]. To maintain a similar model structure, we set the number of Residual Channel Attention Blocks (RCAB) in RG to 3. By comparing the first and fourth rows in the table, it can be observed that after replacing DAFEB, the model’s parameter count and computational load increased to 1.6 times and 1.8 times that of DAFEN, respectively, but the performance remains close to that of DAFEN. Secondly, we remove CSLB and replace it with a residual block. The specific implementation details are as follows: the input feature information is first processed through Conv3 → ReLU → Conv3, then the processed features are connected with the input features via residual connection, and finally processed through ReLU again. From the comparison between the second and fourth rows, it can be seen that the variant with the replaced CSLB has approximately twice the parameter count and computational complexity of DAFEN but shows similar reconstruction accuracy. However, the SSIM value on the RS-T1 dataset and the PSNR value on the RS-T2 dataset are slightly lower than those of DAFEN. Additionally, we remove FFEM and replace it with the attention-based multi-level feature fusion (AMFF) structure from AMFFN [41]. From the comparison between the third and fourth rows, it can be seen that the performance of all metrics significantly decreased, especially on the Urban100 dataset. Through these experiments, we systematically demonstrate the lightweight design and efficiency of DAFEB, CSLB, and FFEM, as well as their exceptional contributions to the performance of DAFEN.

4.8.2. Effects of the CSLB

To assess the importance of CSLB, we conduct an effectiveness analysis of GRSB and CSAIM, as shown in Table 8. Firstly, we replace the GRSB with 3 × 3 convolution → ReLU and use it as the feature extraction module for CSLB. Comparing the first and third rows in the table, although the 3 × 3 convolution improved accuracy, the model’s parameter count increased by more than twofold. Considering the lightweight design, our use of the GRSB significantly reduced the model’s parameters and computational load, making the slight loss in accuracy worthwhile. Secondly, we replace our CSAIM with the cross-fusion method from [46], which utilizes channel attention to compute weighting coefficients for the weighted cross-combination of features from the upper and lower branches. As seen from the second and third rows in the table, our CSAIM has fewer parameters and consistently performs better on RS-T1, BSD100, and Urban100, achieving better PSNR values on RS-T2. This indicates that CSAIM is more beneficial for enriching feature extraction and promoting information exchange between the two branches of the lattice structure.

To further demonstrate the effectiveness of CSLB, we conduct tests on the SR results of CSLB and its two variants at a scaling factor of ×4, and Figure 9 presents the visualization results. It can be observed that the SR results of CSLB achieve superior quality in overall imaging. To facilitate a more detailed observation of the SR results, we enlarge the image patches within the red box in the HR image. Through detailed comparison, it is found that the SR results of CSLB exhibit better performance in details such as the edges of vehicles, the edges and textures of sidewalks, and the textures of shrubs, compared with the other two methods. This fully proves the rationality of the design and the superiority of the performance of CSLB.

4.8.3. Effects of the FFEM

The forward fusion structure, SCGC, and CCA combine to form a powerful FFEM. To demonstrate the effectiveness of these key modules, we compare FFEM with its variants, as shown in Table 9. For the forward fusion structure, we compare it with the 1 × 1 convolution fusion method and the BFM [16], as seen in the first, second, and fifth rows of the table. It is clear that our forward fusion structure has fewer parameters than the 1 × 1 convolution method and leads in PSNR and SSIM values across all four test sets. Furthermore, compared with BFM, our forward fusion structure improves the PSNR by 0.03 dB, 0.01 dB, and 0.01 dB on RS-T1, RS-T2, and BSD100, respectively, and achieves better SSIM values on RS-T1, RS-T2, BSD100, and Urban100. To further demonstrate the effectiveness of the forward fusion structure, we compare the visualization results of 1 × 1 convolution fusion, BFM, and our FFEM using average feature maps, as shown in Figure 10. It can be observed that our FFEM achieves better edge and texture details, both in the local DAFEB and the global DAFEN. This fully proves the effectiveness of the forward fusion structure and supports the rationale of allocating more channels to higher-level features in multi-level feature fusion to retain more high-level feature information.

To demonstrate the effectiveness of the dual-branch structure, we conduct two experiments using CCA and SCGC individually, as shown in the third and fourth rows of Table 9. It can be seen that the combined version of CCA and SCGC achieves better performance compared with using a single module. Although using CCA or SCGC alone achieves lower parameter counts and computational costs, this comes at the expense of significantly reduced accuracy. To gain deeper insights into the performance enhancement achieved by FFEM, we employ a Local Attribution Map (LAM) [75]. As shown in Figure 11, the LAM results clearly demonstrate that the combined version of CCA and SCGC significantly enhances attention to important information. Moreover, the results of the combined version of CCA and SCGC achieve a higher Diffusion Index (DI), indicating that more input pixels are utilized, which leads to increases in PSNR and SSIM values. Therefore, by employing the dual-branch structure, the coverage of utilized pixels is expanded, further validating the effectiveness of the proposed FFEM.

4.8.4. Influence of the Lightweight Convolution in DAFEN

To explore the most suitable lightweight convolution for our method, we introduce several common lightweight convolutions [60,61,62] into the model for comparison with our method, as shown in Table 10. It can be seen that although our method has a larger number of parameters and computational cost, the reconstruction accuracy far exceeds that of the other methods. Therefore, DAFEN using group convolution exhibits better performance.

4.8.5. Influence of the Number of Groups in Group Convolution

We explore the optimal group number settings for group convolution in the model, as shown in Table 11. When the number of groups is set to two, the model achieves the best performance, but the parameter count and computational cost increase significantly. With eight groups, the model is the most lightweight, but the accuracy is lower. To achieve a better balance between performance and lightweight design, we ultimately chose to set the number of groups to four.

4.8.6. Influence of the Number of DAFEBs and CSLBs

To further optimize network parameters and performance, we investigate the impact of different numbers of DAFEB and CSLB on the model, as shown in Table 12. Specifically,

n_{b}

represents the number of DAFEB, while

n_{l}

denotes the number of CSLB. Initially, when the number of CSLBs is 1, the network’s performance is the worst. As the number of CSLBs increases, such as 1, 3, and 5, the model’s performance continuously improves, indicating that our DAFEB structure has the potential to achieve top-level performance when used in larger networks. Additionally, we compare different numbers of DAFEB, such as 3, 4, 5, and 6. With the increase in the number of DAFEB, SR performance, parameter count, and computational cost also increase accordingly. Considering our goal is to study lightweight RSISR, to achieve a more reasonable balance between performance and lightweight design, we set the number of DAFEB to 4 and the number of CSLB to 3.

5. Discussion

In this section, we discuss the advantages of the research and then examine the limitations associated with the proposed method.

Firstly, our approach is able to better balance performance and complexity, primarily due to two aspects. On the one hand, DAFEN efficiently addresses the unique complexities of RSI [47] through effective local and global processing, thereby achieving strong performance in RSISR tasks. As illustrated in Figure 12, the visualization of the average feature maps reveals the comprehensive approach of DAFEN. For the local aspect, effective feature extraction and progressive refinement are achieved through three CSLBs, preserving the detail fidelity of all targets, regardless of size. For the global aspect, finer texture features are transferred to deeper DAFEB layers for further reconstruction. This allows different DAFEBs to handle features of varying complexity, contributing to richer detail contours. Meanwhile, the FFEM ensures that the generated images are coherent on a macro level and precise on a micro level by efficiently integrating and enhancing multi-level features both locally and globally. This leads to final generated features with more accurate edges and texture information, which is beneficial for the precise reconstruction of remote-sensing images. On the other hand, DAFEN leverages lightweight convolution and dimensionality reduction, enabling our network to deliver superior performance with lower complexity. Group convolution reduces redundant operations while combining channel shuffle to maintain efficient feature extraction. Additionally, CSLB achieves extremely lightweight computation for both branches through channel split. Furthermore, the forward fusion structure of FFEM is also a highly lightweight design. Even compared with the simple 1 × 1 convolution fusion method, it effectively reduces the parameter count by approximately 12 K (as shown in Table 9).

Secondly, we will discuss the limitations of our method and future work. As can be seen from Table 3, our method achieves superior performance at lower scaling factors, especially at ×2, but the accuracy improvement of DAFEN is relatively modest at higher scaling factors, such as ×4. We attribute this to the fact that CNN-based RSISR methods are constrained by the local processing principle of convolutional kernels, which hinders direct interaction between distant pixels in the image, leading to insufficient extraction of feature information at high scaling factors. However, at low scaling factors, the original image details are better preserved, which does not significantly affect the feature extraction performance. Another reason is that the trade-off between low-level and high-level features in our forward fusion structure is not perfect. Although we have demonstrated the rationality and effectiveness of the forward fusion structure in our experiments, such a trade-off may require more precise adjustments for different scaling factors. In the future, we will focus on addressing these limitations. On the one hand, we will further investigate methods to enhance the feature representation capability of the model, such as the hybrid use of different convolutions, exploration of more suitable lightweight convolutions, and finer tuning of the forward fusion structure to achieve optimal performance across various scaling factors. On the other hand, we plan to integrate new technologies, such as the Diffusion Model and Mamba, with CNN to compensate for the shortcomings of a single CNN architecture. Additionally, the scope of the paper is limited to RSISR with bicubic downsampling and does not cover other areas such as blind RSISR, continuous RSISR, or hyperspectral image SR. We will explore how to further integrate our model with other techniques to expand its applicability and encompass more diverse RSISR challenges.

6. Conclusions

In this paper, we propose a lightweight Remote-Sensing Image Super-Resolution (RSISR) network named Dual Attention Fusion Enhancement Network (DAFEN), designed for precise RSISR with limited time and spatial overhead. The model offers two versions: a 416 K lightweight DAFEN and a 188 K ultra-lightweight DAFEN-S, accommodating different task requirements. Specifically, we design an extremely lightweight lattice structure, Channel-Spatial Lattice Block (CSLB), as the feature extraction module, composed of Group Residual Shuffle Block (GRSB) and Channel-Spatial Attention Interaction Module (CSAIM). GRSB utilizes a strategy of group convolution combined with channel shuffle as the nonlinear extraction module of CSLB, effectively reducing redundant convolution calculations. CSAIM performs a weighted cross-combination of CSLB’s two branches in both spatial and channel dimensions, facilitating information flow between the branches. Furthermore, we develop the Forward Fusion Enhancement Module (FFEM), which uses a forward fusion structure to retain more high-level feature information, efficiently acquiring more contextual features and enhancing fused features through Self-Calibrated Group Convolution (SCGC) and Contrast-aware Channel Attention (CCA). FFEM incrementally fuses and enhances multi-level features using both local and global strategies, ultimately forming a comprehensive feature fusion representation. Finally, experimental results on two remote-sensing and four benchmark datasets demonstrate that our network achieves a better balance between performance and model complexity.

Author Contributions

Conceptualization, W.C.; methodology, W.C. and Y.L.; software, W.C.; validation, W.C.; formal analysis, W.C.; investigation, W.C.; resources, W.C. and S.Q.; data curation, W.C.; writing—original draft preparation, W.C.; writing—review and editing, W.C., S.Q., and L.L.; visualization, W.C.; supervision, S.Q. and L.L.; project administration, S.Q.; funding acquisition, S.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Natural Science Foundation of China (Grant No. 12201185) and the Henan Science and Technology Development Plan Project (Grant No. 242102210064).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote. Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Lechner, A.M.; Foody, G.M.; Boyd, D.S. Applications in remote sensing to forest ecology and management. One Earth 2020, 2, 405–412. [Google Scholar] [CrossRef]
Gupta, M.; Almomani, O.; Khasawneh, A.M.; Darabkh, K.A. Smart remote sensing network for early warning of disaster risks. In Nanotechnology-Based Smart Remote Sensing Networks for Disaster Prevention; Elsevier: Amsterdam, The Netherlands, 2022; pp. 303–324. [Google Scholar]
Xu, P.; Tang, H.; Ge, J.; Feng, L. ESPC_NASUnet: An end-to-end super-resolution semantic segmentation network for mapping buildings from remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 5421–5435. [Google Scholar] [CrossRef]
Wang, Z.g.; Kang, Q.; Xun, Y.j.; Shen, Z.q.; Cui, C.b. Military reconnaissance application of high-resolution optical satellite remote sensing. In Proceedings of the International Symposium on Optoelectronic Technology and Application 2014: Optical Remote Sensing Technology and Applications, Beijing, China, 9–11 December 2014; SPIE: Bellingham, WA, USA, 2014; Volume 9299, pp. 301–305. [Google Scholar]
Booysen, R.; Gloaguen, R.; Lorenz, S.; Zimmermann, R.; Andreani, L.; Nex, P.A. The potential of multi-sensor remote sensing mineral exploration: Examples from Southern Africa. In Proceedings of the IGARSS 2019—IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 6027–6030. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 184–199. [Google Scholar]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Han, W.; Chang, S.; Liu, D.; Yu, M.; Witbrock, M.; Huang, T.S. Image super-resolution via dual-state recurrent networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1654–1663. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 272–289. [Google Scholar]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X.; Yu, F. Dual aggregation transformer for image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12312–12321. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Wang, H.; Chen, X.; Ni, B.; Liu, Y.; Liu, J. Omni aggregation networks for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22378–22387. [Google Scholar]
Fang, J.; Chen, X.; Zhao, J.; Zeng, K. A scalable attention network for lightweight image super-resolution. J. King Saud-Univ.-Comput. Inf. Sci. 2024, 36, 102185. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local–global combined network. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
Dong, X.; Wang, L.; Sun, X.; Jia, X.; Gao, L.; Zhang, B. Remote sensing image super-resolution using second-order multi-scale networks. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 3473–3485. [Google Scholar] [CrossRef]
Zhang, D.; Shao, J.; Li, X.; Shen, H.T. Remote sensing image super-resolution via mixed high-order attention network. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 5183–5196. [Google Scholar] [CrossRef]
Li, Q.; Yuan, Y.; Jia, X.; Wang, Q. Dual-stage approach toward hyperspectral image super-resolution. IEEE Trans. Image Process. 2022, 31, 7252–7263. [Google Scholar] [CrossRef] [PubMed]
Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Tu, J.; Mei, G.; Ma, Z.; Piccialli, F. SWCGAN: Generative adversarial network combining swin transformer and CNN for remote sensing image super-resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 5662–5673. [Google Scholar] [CrossRef]
Wang, J.; Wang, B.; Wang, X.; Zhao, Y.; Long, T. Hybrid attention-based U-shaped network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Wu, H.; Ni, N.; Zhang, L. Lightweight stepless super-resolution of remote sensing images via saliency-aware dynamic routing strategy. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–17. [Google Scholar] [CrossRef]
Xie, Z.; Wang, J.; Song, W.; Du, Y.; Xu, H.; Yang, Q. CFFormer: Channel Fourier Transformer for Remote Sensing Super-Resolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 18, 569–583. [Google Scholar] [CrossRef]
Hao, J.; Li, W.; Lu, Y.; Jin, Y.; Zhao, Y.; Wang, S.; Wang, B. Scale-aware Backprojection Transformer for Single Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5649013. [Google Scholar] [CrossRef]
Hao, S.; Zhuge, Y.; Xu, J.; Lu, H.; He, Y. Remote Sensing Image Super-Resolution Using Enriched Spatial-Channel Feature Aggregation Networks. In Proceedings of the 2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS), Hangzhou, China, 16–18 August 2024; pp. 578–585. [Google Scholar]
Ye, W.; Lin, B.; Lao, J.; Liu, Y.; Lin, Z. MRA-IDN: A Lightweight Super-Resolution Framework of Remote Sensing Images based on Multi-Scale Residual Attention Fusion Mechanism. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2024, 17, 7781–7800. [Google Scholar] [CrossRef]
Qin, M.; Mavromatis, S.; Hu, L.; Zhang, F.; Liu, R.; Sequeira, J.; Du, Z. Remote sensing single-image resolution improvement using a deep gradient-aware network with image-specific enhancement. Remote Sens. 2020, 12, 758. [Google Scholar] [CrossRef]
Dong, R.; Mou, L.; Zhang, L.; Fu, H.; Zhu, X.X. Real-world remote sensing image super-resolution via a practical degradation model and a kernel-aware network. ISPRS J. Photogramm. Remote. Sens. 2022, 191, 155–170. [Google Scholar] [CrossRef]
Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11065–11074. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Xiao, Y.; Yuan, Q.; Jiang, K.; He, J.; Jin, X.; Zhang, L. EDiffSR: An efficient diffusion probabilistic model for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2023, 62, 5601514. [Google Scholar] [CrossRef]
Sebaq, A.; ElHelw, M. Rsdiff: Remote sensing image generation from text using diffusion model. Neural Comput. Appl. 2024, 36, 23103–23111. [Google Scholar] [CrossRef]
Dong, W.; Liu, S.; Xiao, S.; Qu, J.; Li, Y. ISPDiff: Interpretable Scale-Propelled Diffusion Model for Hyperspectral Image Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5519614. [Google Scholar] [CrossRef]
Xiao, Y.; Yuan, Q.; Jiang, K.; Chen, Y.; Zhang, Q.; Lin, C.W. Frequency-Assisted Mamba for Remote Sensing Image Super-Resolution. arXiv 2024, arXiv:.04964. [Google Scholar] [CrossRef]
Wang, H.; Cheng, S.; Li, Y.; Du, A. Lightweight remote-sensing image super-resolution via attention-based multilevel feature fusion network. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale residual network for image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 517–532. [Google Scholar]
Liu, F.; Yang, X.; De Baets, B. Lightweight image super-resolution with a feature-refined network. Signal Process. Image Commun. 2023, 111, 116898. [Google Scholar] [CrossRef]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 41–55. [Google Scholar]
Xue, Y.; Li, L.; Wang, Z.; Jiang, C.; Liu, M.; Wang, J.; Sun, K.; Ma, H. RFCNet: Remote Sensing Image Super-Resolution Using Residual Feature Calibration Network. Tsinghua Sci. Technol. 2022, 28, 475–485. [Google Scholar] [CrossRef]
Wang, Z.; Li, L.; Xue, Y.; Jiang, C.; Wang, J.; Sun, K.; Ma, H. FeNet: Feature enhancement network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Wang, P.; Bayram, B.; Sertel, E. A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Sci. Rev. 2022, 232, 104110. [Google Scholar] [CrossRef]
Wang, S.; Zhou, T.; Lu, Y.; Di, H. Contextual transformation network for lightweight remote-sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Huang, S.; Wang, J.; Yang, Y.; Wan, W. TARN: A lightweight two-branch adaptive residual network for image super-resolution. Int. J. Mach. Learn. Cybern. 2024, 15, 4119–4132. [Google Scholar] [CrossRef]
Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 252–268. [Google Scholar]
Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 723–731. [Google Scholar]
Sun, L.; Pan, J.; Tang, J. Shufflemixer: An efficient convnet for image super-resolution. Adv. Neural Inf. Process. Syst. 2022, 35, 17314–17326. [Google Scholar]
Zhang, A.; Ren, W.; Liu, Y.; Cao, X. Lightweight image super-resolution with superpixel token interaction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12728–12737. [Google Scholar]
Wang, Y.; Zhang, T. Osffnet: Omni-stage feature fusion network for lightweight image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 5660–5668. [Google Scholar]
Wang, L.; Li, X.; Tian, W.; Peng, J.; Chen, R. Lightweight interactive feature inference network for single-image super-resolution. Sci. Rep. 2024, 14, 11601. [Google Scholar] [CrossRef]
Wang, Q.; Wang, S.; Chen, M.; Zhu, Y. DARN: Distance attention residual network for lightweight remote-sensing image superresolution. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sensing 2022, 16, 714–724. [Google Scholar] [CrossRef]
Gao, F.; Li, L.; Wang, J.; Sun, K.; Lv, M.; Jia, Z.; Ma, H. A lightweight feature distillation and enhancement network for super-resolution remote sensing images. Sensors 2023, 23, 3906. [Google Scholar] [CrossRef] [PubMed]
Wu, T.; Zhao, R.; Lv, M.; Jia, Z.; Li, L.; Wang, Z.; Ma, H. Lightweight remote sensing image super-resolution via background-based multi-scale feature enhancement network. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 7509405. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Haase, D.; Amthor, M. Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14600–14609. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Dong, X.; Sun, X.; Jia, X.; Xi, Z.; Gao, L.; Zhang, B. Remote sensing image super-resolution using novel dense-sampling networks. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 1618–1633. [Google Scholar] [CrossRef]
Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.H.; Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 114–125. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 2–5 November 2010; pp. 270–279. [Google Scholar]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference, Surrey, UK, 3–7 September 2012; pp. 135.1–135.10. [Google Scholar]
Yang, J.; Wright, J.; Huang, T.S.; Ma, Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010, 19, 2861–2873. [Google Scholar] [CrossRef]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision, ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Pan, Z.; Ma, W.; Guo, J.; Lei, B. Super-resolution of single remote sensing image based on residual dense backprojection networks. IEEE Trans. Geosci. Remote. Sens. 2019, 57, 7918–7933. [Google Scholar] [CrossRef]
Zhang, S.; Yuan, Q.; Li, J.; Sun, J.; Zhang, X. Scene-adaptive remote sensing image super-resolution using a multiscale attention network. IEEE Trans. Geosci. Remote. Sens. 2020, 58, 4764–4779. [Google Scholar] [CrossRef]
Tian, C.; Zhuge, R.; Wu, Z.; Xu, Y.; Zuo, W.; Chen, C.; Lin, C.W. Lightweight image super-resolution with enhanced CNN. Knowl.-Based Syst. 2020, 205, 106235. [Google Scholar] [CrossRef]
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9199–9208. [Google Scholar]

Figure 1. Model parameters and accuracy tradeoff with other lightweight methods on BSD100 for ×2 SR. Our proposed DAFEN achieves superior performance, and our DAFEN-S also maintains competitive performance. The Multi-Adds (Multiply-Add Operations) are computed on a 1280 × 720 HR image.

Figure 2. Overview of the proposed Dual Attention Fusion Enhancement Network (DAFEN) architecture. The shallow feature extraction and reconstruction parts are utilized to extract coarse features and enlarge the features s times (e.g., ×2, ×3, ×4), respectively. The Feature Refinement and Fusion part with four DAFEBs carries the main feature expression ability. The FFEM can generate more contextual information via forward sequential concatenation.

Figure 3. Structure of Dual Attention Fusion Enhancement Block (DAFEB).

Figure 4. Illustrations of the proposed Forward Fusion Enhancement Module (FFEM). (a) The forward fusion structure by using forward sequential concatenation is illustrated by taking four blocks as an example. (b) The Contrast-aware Channel Attention (CCA). (c) The Self-Calibrated Group Convolution (SCGC). (d) The Group Residual Shuffle Block (GRSB).

Figure 5. (a) The Channel-Spatial Lattice Block (CSLB), where the Channel-Spatial Attention Interaction Module (CSAIM) includes the Spatial Attention Interaction (SAI) and the Channel Attention Interaction (CAI). ‘Split’ represents the channel separation operation. (b) The Spatial Attention Interaction (SAI). (c) The Channel Attention Interaction (CAI).

Figure 6. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on UC-Merced for ×4 SR.

Figure 7. Visualization results of several SR methods and our proposed networks (DAFEN and DAFEN-S) on BSD100 and Urban100 datasets for ×4 SR.

Figure 8. Visualization results of our proposed networks (DAFEN and DAFEN-S) and other SR methods on real remote-sensing images for ×4 SR. (a) Residential areas and farmland. (b) Terraces and roads.

Figure 9. Visualization results for the ablation experiments on the CSLB design.

Figure 10. Average feature maps from the ablation experiments on the FFEM design at different stages of the DAFEN.

Figure 11. LAM results for the ablation experiments on the FFEM design. LAM reflects the importance of each pixel in the input LR image during the reconstruction of the marked blocks. The red-marked points indicate the pixels that contribute to the reconstruction process. The Diffusion Index (DI) reflects the range of involved pixels, with a higher DI indicating a wider range of utilized pixels.

Figure 12. Network average feature maps visualization.

Table 1. Implementation details and hyperparameter settings of our methods and comparative lightweight methods.

Method	Optim_g	Betas	lr	Gamma	Loss Type	Batch Size	Patch Size	Use_hflip	Use_rot
LGCNet [21]	ADAM	[0.9,0.999]	$1 \times 10^{- 1}$	0.1	$L_{2}$	128	41 × 41	false	false
IDN [51]	ADAM	[0.9,0.999]	$1 \times 10^{- 4}$	0.1	$L_{1}$	64	26 × 26	true	true
LESRCNN [73]	ADAM	[0.9,0.999]	$1 \times 10^{- 4}$	0.5	$L_{2}$	64	64 × 64	true	true
MADNet [74]	ADAM	[0.9,0.999]	$1 \times 10^{- 3}$	0.5	$L_{F}$	16	48 × 48	true	true
CTN [48]	ADAM	[0.9,0.999]	$1 \times 10^{- 4}$	0.5	$L_{1}$	16	48 × 48	true	true
FeNet-baseline [46]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true
FeNet [46]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true
FDENet [57]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true
DARN-S [56]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	64	64 × 64	true	true
AMFFN [41]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	48 × 48	true	true
IFIN-S [55]	ADAM	[0.9,0.99]	$1 \times 10^{- 3}$	0.5	$L_{1}$	16	60 × 60	true	true
BMFENet [58]	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true
TARN [49]	ADAM	[0.9,0.999]	$3 \times 10^{- 4}$	0.5	$L_{1}$	16	256 × 256	true	true
DAFEN-S(ours)	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true
DAFEN(ours)	ADAM	[0.9,0.999]	$5 \times 10^{- 4}$	0.5	$L_{1}$	16	192 × 192	true	true

Table 2. Impact of hyperparameter selection on model performance and complexity at a scaling factor of ×4 on the RS-T1 dataset. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.

Variant	Params	Multi-Adds	RS-T1(x4) PSNR/SSIM
Loss type: $L_{2}$	431 K	21.8 G	29.73/0.7669
batch size: 32	431 K	21.8 G	29.81/0.7694
batch size: 8	431 K	21.8 G	29.80/0.7689
patch size: 256	431 K	38.8 G	29.82/0.7695
patch size: 128	431 K	9.7 G	29.80/0.7691
Use_hflip: false	431 K	21.8 G	29.76/0.7681
Use_rot: false	431 K	21.8 G	29.78/0.7685
DAFEN	431 K	21.8 G	29.82/0.7701

Table 3. Quantitative evaluation results for SR on two RSI test datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.

Method	Scale	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM
Bicubic	×2	-	-	33.25/0.8934	30.64/0.8837
LGCNet [21]	×2	193 K	178.1 G	35.65/0.9298	33.47/0.9281
IDN [51]	×2	553 K	124.6 G	36.13/0.9339	34.07/0.9329
LESRCNN [73]	×2	626 K	281.5 G	36.04/0.9328	34.00/0.9320
CTN [48]	×2	402 K	60.9 G	36.30/0.9243	34.31/0.9346
FeNet-baseline [46]	×2	158 K	35.2 G	36.10/0.9331	34.10/0.9326
FeNet [46]	×2	351 K	77.9 G	36.23/0.9341	34.22/0.9337
FDENet [57]	×2	480 K	138.7 G	36.26/0.9346	34.28/0.9338
DARN-S [56]	×2	350 K	78.9 G	36.31/0.9347	34.35/0.9348
AMFFN [41]	×2	298 K	61.5 G	36.39/0.9357	34.34/0.9346
IFIN-S [55]	×2	451 K	110.6 G	36.38/0.9356	34.42/0.9352
BMFENet [58]	×2	465 K	115.0 G	36.42/0.9362	34.43/0.9356
DAFEN-S (ours)	×2	188 K	37.5 G	36.28/0.9352	34.24/0.9338
DAFEN (ours)	×2	416 K	83.6 G	36.42/0.9365	34.39/0.9357
Bicubic	×3	-	-	29.73/0.7818	27.23/0.7697
LGCNet [21]	×3	193 K	79.0 G	31.30/0.8314	29.03/0.8312
IDN [51]	×3	553 K	56.3 G	31.73/0.8430	29.59/0.8450
LESRCNN [73]	×3	810 K	238.9 G	31.68/0.8398	29.65/0.8444
CTN [48]	×3	402 K	37.1 G	31.91/0.8454	29.83/0.8489
FeNet-baseline [46]	×3	163 K	16.7 G	31.73/0.8377	29.61/0.8446
FeNet [46]	×3	357 K	35.2 G	31.89/0.8432	29.80/0.8481
FDENet [57]	×3	488 K	61.7 G	31.98/0.8488	29.88/0.8489
DARN-S [56]	×3	355 K	35.0 G	32.00/0.8483	29.98/0.8518
AMFFN [41]	×3	305 K	27.9 G	31.94/0.8457	29.91/0.8504
IFIN-S [55]	×3	459 K	51.0 G	32.04/0.8448	30.03/0.8535
BMFENet [58]	×3	470 K	51.7 G	31.99/0.8465	29.97/0.8514
DAFEN-S (ours)	×3	192 K	17.1 G	31.93/0.8459	29.81/0.8485
DAFEN (ours)	×3	422 K	37.7 G	32.13/0.8527	29.98/0.8516
Bicubic	×4	-	-	27.91/0.6968	25.40/0.6770
LGCNet [21]	×4	193 K	44.5 G	29.13/0.7481	26.76/0.7426
IDN [51]	×4	553 K	32.3 G	29.56/0.7623	27.31/0.7627
LESRCNN [73]	×4	774 K	241.6 G	29.62/0.7625	27.41/0.7646
CTN [48]	×4	413 K	25.6 G	29.71/0.7666	27.52/0.7704
FeNet-baseline [46]	×4	169 K	9.4 G	29.57/0.7626	27.31/0.7619
FeNet [46]	×4	366 K	20.4 G	29.70/0.7688	27.45/0.7672
FDENet [57]	×4	501 K	35.9 G	29.72/0.7658	27.54/0.7697
DARN-S [56]	×4	363 K	19.7 G	29.78/0.7682	27.59/0.7732
AMFFN [41]	×4	314 K	16.2 G	29.76/0.7674	27.57/0.7701
IFIN-S [55]	×4	470 K	31.6 G	29.84/0.7724	27.68/0.7763
BMFENet [58]	×4	477 K	29.4 G	29.81/0.7700	27.62/0.7730
DAFEN-S (ours)	×4	198 K	10.0 G	29.70/0.7673	27.46/0.7677
DAFEN (ours)	×4	431K	21.8 G	29.82/0.7701	27.62/0.7737

Table 4. Quantitative evaluation results for SR on four benchmark datasets. PSNR and SSIM values are provided. ‘-’ denotes the results are not provided. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.

Method	Scale	Params	Multi-Adds	Set5 PSNR/SSIM	Set14 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
Bicubic	×2	-	-	33.66/0.9299	30.24/0.8688	29.56/0.8431	26.88/0.8403
LGCNet [21]	×2	193 K	178.1 G	37.31/0.9580	32.94/0.9120	31.74/0.8939	30.53/0.9112
IDN [51]	×2	553 K	124.6 G	37.83/0.9600	33.30/0.9148	32.08/0.8985	31.27/0.9196
MADNet [74]	×2	878 K	187.1 G	37.85/0.9600	33.39/0.9161	32.05/0.8981	31.59/0.9234
FeNet-baseline [46]	×2	158 K	35.2 G	37.77/0.9597	33.28/0.9151	31.98/0.8973	31.46/0.9215
FeNet [46]	×2	351 K	77.9 G	37.90/0.9602	33.45/0.9162	32.09/0.8985	31.75/0.9245
FDENet [57]	×2	480 K	138.7 G	37.89/0.9594	33.50/0.9170	32.15/0.8988	32.02/0.9270
DARN-S [56]	×2	350 K	78.9 G	37.97/0.9609	33.54/0.9172	32.19/0.9005	32.14/0.9284
IFIN-S [55]	×2	451 K	110.6 G	38.00/0.9606	33.66/0.9181	32.18/0.8996	32.14/0.9284
BMFENet [58]	×2	465 K	115.0 G	38.04/0.9605	33.62/0.9180	32.22/0.9004	32.29/0.9300
TARN [49]	×2	687 K	-	38.09/0.9608	33.65/0.9183	32.22/0.9003	32.20/0.9289
DAFEN-S (ours)	×2	188 K	37.5 G	37.94/0.9605	33.41/0.9159	32.12/0.8991	31.76/0.9248
DAFEN (ours)	×2	416 K	83.6 G	38.04/0.9617	33.55/0.9175	32.22/0.9010	32.20/0.9291
Bicubic	×3	-	-	30.39/0.8682	27.55/0.7742	27.21/0.7385	24.46/0.7349
LGCNet [21]	×3	193 K	79.0 G	33.32/0.9172	29.67/0.8289	28.63/0.7923	26.77/0.8180
IDN [51]	×3	553 K	56.3 G	34.11/0.9253	29.99/0.8354	28.95/0.8013	27.42/0.8359
MADNet [74]	×3	930 K	88.4 G	34.14/0.9251	30.20/0.8395	28.98/0.8023	27.78/0.8439
FeNet-baseline [46]	×3	163 K	16.7 G	33.99/0.9240	30.02/0.8359	28.90/0.8000	27.55/0.8391
FeNet [46]	×3	357 K	35.2 G	34.21/0.9256	30.15/0.8383	28.98/0.8020	27.82/0.8447
FDENet [57]	×3	488 K	61.7 G	34.28/0.9253	30.33/0.8415	29.05/0.8033	28.03/0.8494
DARN-S [56]	×3	355 K	35.0 G	34.35/0.9274	30.34/0.8428	29.09/0.8065	28.17/0.8528
IFIN-S [55]	×3	459 K	51.0 G	34.45/0.9278	30.47/0.8442	29.13/0.8064	28.32/0.8560
BMFENet [58]	×3	470 K	51.7 G	34.34/0.9271	30.27/0.8407	29.08/0.8049	28.18/0.8534
TARN [49]	×3	754 K	-	34.42/0.9275	30.37/0.8430	29.12/0.8056	28.19/0.8529
DAFEN-S (ours)	×3	192 K	17.1 G	34.25/0.9261	30.18/0.8389	29.02/0.8031	27.76/0.8451
DAFEN (ours)	×3	422 K	37.7 G	34.43/0.9275	30.37/0.8434	29.12/0.8057	28.12/0.8517
Bicubic	×4	-	-	28.42/0.8104	26.00/0.7027	25.96/0.6675	23.14/0.6577
LGCNet [21]	×4	193 K	44.5 G	30.87/0.8746	27.82/0.7630	27.08/0.7186	24.82/0.7399
IDN [51]	×4	553 K	32.3 G	31.82/0.8903	28.25/0.7730	27.41/0.7297	25.41/0.7632
MADNet [74]	×4	1002 K	54.1 G	32.01/0.8925	28.45/0.7781	27.47/0.7327	25.77/0.7751
FeNet-baseline [46]	×4	169 K	9.4 G	31.80/0.8886	28.31/0.7742	27.38/0.7289	25.53/0.7670
FeNet [46]	×4	366 K	20.4 G	32.02/0.8919	28.38/0.7764	27.47/0.7319	25.75/0.7747
FDENet [57]	×4	501 K	35.9 G	32.12/0.8929	28.52/0.7795	27.53/0.7339	25.97/0.7811
DARN-S [56]	×4	363 K	19.7 G	32.16/0.8951	28.58/0.7817	27.57/0.7374	26.08/0.7859
IFIN-S [55]	×4	470 K	31.6 G	32.27/0.8958	28.68/0.7834	27.62/0.7381	26.17/0.7890
BMFENet [58]	×4	477 K	29.4 G	32.22/0.8951	28.61/0.7812	27.54/0.7335	26.04/0.7852
TARN [49]	×4	835 K	-	32.23/0.8955	28.65/0.7829	27.61/0.7368	26.15/0.7874
DAFEN-S (ours)	×4	198 K	10.0 G	32.00/0.8919	28.39/0.7773	27.49/0.7326	25.72/0.7758
DAFEN (ours)	×4	431 K	21.8 G	32.23/0.8948	28.59/0.7815	27.57/0.7376	26.01/0.7832

Table 5. Comparison results with non-lightweight state-of-the-art methods at a scaling factor of ×3. Due to the higher model complexity of non-lightweight methods, we present the data using larger magnitude units for ease of comparison. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.

Method	Params	Multi-Adds	Times
RCAN [14]	15.67 M	1.492 T	0.12 s
SwinIR [36]	11.55 M	2.883 T	0.23 s
HAT [18]	20.53 M	3.871 T	0.32 s
DAFEN-S	0.192M	0.017T	0.012s
DAFEN	0.422M	0.038T	0.017s

Table 6. Quantify how lightweight the model is on RS-T1 dataset with a scaling factor of ×3. The best and second-best results are highlighted in red and blue, respectively. The Multi-Adds is calculated corresponding to a 1280 × 720 HR image.

Method	Params	Multi-Adds	Times	RS-T1(x3) PSNR/SSIM
FeNet [46]	357 K	35.2 G	19.46 ms	31.89/0.8432
DARN [56]	596 K	58.4 G	18.87 ms	32.08/0.8470
IFIN-S [55]	459 K	51.0 G	143.34 ms	32.04/0.8448
DAFEN-S	192 K	17.1 G	11.65 ms	31.93/0.8459
DAFEN	422 K	37.7 G	17.31 ms	32.13/0.8527

Table 7. Ablation experiments on the design of the DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
DAFEN w/o DAFEB	683 K	38.7 G	29.82/0.7689	27.63/0.7744	27.58/0.7380	26.03/0.7840
DAFEN w/o CSLB	744 K	40.2 G	29.83/0.7694	27.61/0.7742	27.58/0.7383	26.05/0.7851
DAFEN w/o FFEM	315 K	16.5 G	29.75/0.7663	27.54/0.7706	27.52/0.7353	25.92/0.7794
DAFEN	431 K	21.8 G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832

Table 8. Ablation experiments on the design of the CSLB on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
CSLB w/o GRSB	867 K	46.9 G	29.85/0.7708	27.68/0.7753	27.61/0.7386	26.13/0.7854
CSLB w/o CSAIM	434 K	21.7 G	29.80/0.7696	27.61/0.7735	27.54/0.7371	25.99/0.7831
CSLB	431 K	21.8 G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832

Table 9. Ablation experiments on the design of the FFEM on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
W/ 1x1Conv	443K	22.4G	29.81/0.7695	27.59/0.7722	27.54/0.7367	26.00/0.7828
W/ BFM	431K	21.8G	29.79/0.7689	27.61/0.7735	27.56/0.7373	26.01/0.7831
FFEM w/o SCGC	303K	15.8G	29.73/0.7671	27.53/0.7705	27.49/0.7350	25.80/0.7765
FFEM w/o CCA	406K	20.4G	29.79/0.7688	27.60/0.7732	27.54/0.7365	25.95/0.7814
FFEM	431K	21.8G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832

Table 10. Ablation experiments on the design of the lightweight convolution in DAFEN on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
W/ PSConv	368 K	18.7 G	29.73/0.7665	27.54/0.7706	27.48/0.7346	25.89/0.7792
W/ BSConv	343 K	17.5 G	29.62/0.7634	27.42/0.7663	27.44/0.7317	25.80/0.7768
W/ DSConv	343 K	17.5 G	29.65/0.7642	27.40/0.7648	27.45/0.7324	25.81/0.7766
DAFEN	431 K	21.8 G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832

Table 11. Ablation experiments on the design of the number of groups in group convolution on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
G = 2	628 K	31.8 G	29.84/0.7705	27.66/0.7746	27.59/0.7385	26.08/0.7844
G = 4	431 K	21.8 G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832
G = 8	333 K	16.8 G	29.77/0.7672	27.59/0.7727	27.54/0.7365	25.94/0.7808

Table 12. Ablation experiments on the design of the number of DAFEBs and CSLBs on RS-T1, RS-T2, BSD100, and Urban100 datasets for ×4 SR.

Variant	Params	Multi-Adds	RS-T1 PSNR/SSIM	RS-T2 PSNR/SSIM	BSD100 PSNR/SSIM	Urban100 PSNR/SSIM
$n_{b}$ = 4, $n_{l}$ = 1	289 K	13.9 G	29.76/0.7681	27.57/0.7716	27.53/0.7361	25.86/0.7787
$n_{b}$ = 4, $n_{l}$ = 3	431 K	21.8 G	29.82/0.7701	27.62/0.7737	27.57/0.7376	26.01/0.7832
$n_{b}$ = 4, $n_{l}$ = 5	574 K	29.7 G	29.86/0.7694	27.71/0.7759	27.59/0.7383	26.11/0.7861
$n_{b}$ = 3, $n_{l}$ = 3	345 K	17.5 G	29.79/0.7676	27.60/0.7730	27.54/0.7362	25.92/0.7798
$n_{b}$ = 5, $n_{l}$ = 3	517 K	26.0 G	29.84/0.7700	27.65/0.7746	27.59/0.7380	26.05/0.7849
$n_{b}$ = 6, $n_{l}$ = 3	603 K	30.3 G	29.86/0.7704	27.70/0.7764	27.61/0.7386	26.12/0.7863

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, W.; Qu, S.; Luo, L.; Lu, Y. Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sens. 2025, 17, 1078. https://doi.org/10.3390/rs17061078

AMA Style

Chen W, Qu S, Luo L, Lu Y. Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sensing. 2025; 17(6):1078. https://doi.org/10.3390/rs17061078

Chicago/Turabian Style

Chen, Wangyou, Shenming Qu, Laigan Luo, and Yongyong Lu. 2025. "Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution" Remote Sensing 17, no. 6: 1078. https://doi.org/10.3390/rs17061078

APA Style

Chen, W., Qu, S., Luo, L., & Lu, Y. (2025). Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. Remote Sensing, 17(6), 1078. https://doi.org/10.3390/rs17061078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual Attention Fusion Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution

Abstract

1. Introduction

2. Related Works

2.1. Lightweight Natural Image SR

2.2. Lightweight Remote-Sensing Image SR

3. Methods

3.1. Network Architecture

3.2. Dual Attention Fusion Enhancement Block (DAFEB)

3.3. Forward Fusion Enhancement Module (FFEM)

3.4. Channel-Spatial Lattice Block (CSLB)

4. Results

4.1. Datasets

4.2. Implementation Details

4.3. Results on Remote-Sensing Datasets

4.4. Results on SR Benchmark Datasets

4.5. Comparison Results with Non-Lightweight State-of-the-Art Methods

4.6. Results of Real Remote-Sensing Images

4.7. Network Complexity and Inference Speed

4.8. Ablation Study

4.8.1. Effects of the Key Modules in DAFEN

4.8.2. Effects of the CSLB

4.8.3. Effects of the FFEM

4.8.4. Influence of the Lightweight Convolution in DAFEN

4.8.5. Influence of the Number of Groups in Group Convolution

4.8.6. Influence of the Number of DAFEBs and CSLBs

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI