DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution

Zhang, Miaomiao; Wang, Quan; Zhang, Wuxia; Chen, Xiangpeng; Pan, Jiaxin; Guo, Huinan

doi:10.3390/rs18101626

Open AccessArticle

DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution

by

Miaomiao Zhang

^1,2

,

Quan Wang

³

,

Wuxia Zhang

⁴

,

Xiangpeng Chen

^1,2,

Jiaxin Pan

^1,2 and

Huinan Guo

^1,*

¹

Xi’an Institute of Optics and Precision Mechanics (XIOPM), Chinese Academy of Sciences, Xi’an 710119, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory of Biomedical Spectroscopy of Xi’an, Key Laboratory of Spectral Imaging Technology, Xi’an Institute of Optics and Precision Mechanics (XIOPM), Chinese Academy of Sciences, Xi’an 710119, China

⁴

Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing, School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1626; https://doi.org/10.3390/rs18101626

Submission received: 1 April 2026 / Revised: 3 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026

(This article belongs to the Special Issue Super-Resolution and Reconstruction of Remote Sensing Images)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose a local–nonlocal dual-branch feature complementary fusion network (DFCFNet), which combines a Partial Convolution Channel Mixer (PCCM) with a global variance-based strategy to jointly model local and global representations. An Efficient Feed-Forward Network (EFFN) is further introduced to refine features, leading to enhanced detail preservation and improved reconstruction quality in remote sensing images.
Extensive experimental results demonstrate that DFCFNet achieves superior performance on remote sensing datasets, effectively balancing reconstruction quality and inference efficiency. Furthermore, cross-domain evaluation on natural images confirms the model’s strong generalization capability.

What are the implications of the main findings?

DFCFNet adopts a lightweight design, enabling high-quality remote sensing image super-resolution on resource-constrained edge devices and demonstrating strong potential for real-time applications.
DFCFNet exhibits strong generalization capability, providing valuable guidance for future research in remote sensing image super-resolution as well as broader geospatial image processing applications.

Abstract

Remote sensing image super-resolution (RSISR) has gained significant attention in recent years due to its critical role in enhancing image analysis capabilities. While existing methods often focus on nonlocal feature extraction, they frequently overlook the importance of local information integration. Moreover, many methods reconstruct images by introducing more complex structures, which poses a challenge to resource-limited devices. To address these issues, we present a local–nonlocal dual-branch feature complementary fusion network (DFCFNet) featuring two key components: a lightweight dual-branch feature aggregation (DBFA) module and an Efficient Feed-Forward Network (EFFN). The DBFA employs a dual-branch structure comprising a Focused Local Feature Branch (FLFB) with novel Partial Convolution Channel Mixers for localized pattern modeling and a Non-Focal Exploration Branch (NFEB) utilizing global variance analysis for comprehensive feature extraction. This dual-branch design enables simultaneous capture of local and global contextual information. The EFFN is designed to further refine the features of the DBFA output in order to make full use of the detailed information of the image. Extensive experimental results show that the proposed DFCFNet reconstructs optimally on remote sensing datasets and is also optimal in terms of computational efficiency and network complexity. The framework’s versatility is further confirmed through successful adaptation to natural image SR tasks, showing consistent performance improvements across five standard datasets.

Keywords:

remote sensing image super-resolution (RSISR); lightweight; nonlocal feature extraction; focused local feature; feature aggregation

1. Introduction

Remote sensing image super-resolution (RSISR) is an image processing technique in the field of computer vision. It is the process of reconstructing one or more low-resolution (LR) images with coarse details to obtain high-resolution (HR) images with better visual quality and details. It can be applied to many remote sensing tasks, such as target detection, semantic segmentation and scene recognition. Therefore, more and more scholars are devoted to improving the performance of RSISR [1,2].

The traditional approach relies on predefined rules and a priori assumptions, which makes it challenging to handle the issue of different feature expression and detail reduction in complicated scenarios, despite its benefits of minimal hardware requirements and easy implementation.

Deep learning methods, with their powerful nonlinear modeling capabilities and the advantage of learning features from large-scale data, not only greatly increase the reconstructed images’ visual quality and accuracy but also adjust to the intricate requirements of various settings. However, RSISR is usually considered as a pathological problem with multisolvability in the SR process due to insufficient input information, which makes it challenging to construct an efficient SR model [3]. To resolve this challenge, researchers have developed a variety of CNN-based methods [4,5,6,7] that significantly improve the model’s representational capabilities by learning features layer-by-layer. But the higher complexity of these models, the larger the number of parameters and the higher computational cost limit their application on resource-constrained devices such as mobile devices. To be able to reduce the model’s parameter count, many scholars have designed many lightweight models. For example, the distilled information network is used to refine and fine-tune features [8,9,10]. Liu et al. [11] proposed the RFDN architecture and added residual blocks to achieve finer feature extraction and exploitation through feature distillation, all while lowering the number of parameters and the model’s level of complexity. Although these models have a compact architecture and achieve better results in visual effects, it is challenging to balance the model’s size and performance while fully utilizing both local and nonlocal information [12]. To overcome this problem, scholars have gone further to investigate local and nonlocal collaborative modeling methods for image features. Gao et al. [13] designed a dual-branching block (DBB) that extracts information from both local and larger regions using standard and inflated convolutions. Despite recent advancements in RSISR, current approaches struggle to meet the requirements for lightweight deployment and are still computationally costly and complex [14,15].

To address this, we propose a novel lightweight dual-branch feature aggregation (DBFA) module. The DBFA module combines local and nonlocal feature processing through two parallel pathways. The nonlocal branch employs an ADSC that first downsamples features to capture low-frequency components. These components are then modulated by global variance statistics before being adaptively fused with original features. Complementing this, the local branch incorporates a PCCM that enhances local modeling through spatial–channel interactions while reducing computation via optimized memory access patterns. This dual-path architecture enables efficient extraction of both detailed textures and global structures.

To further refine features, we introduce an EFFN that performs channel–spatial co-enhancement after DBFA processing. By integrating DBFA and EFFN into an end-to-end framework called DFCFNet, we achieve effective RSISR with minimal computational overhead. Extensive experiments on remote sensing benchmarks demonstrate DFCFNet’s superior balance between model efficiency and reconstruction quality, as illustrated in Figure 1. Additional evaluations on natural image datasets confirm its strong generalization capability.

The following is a summary of the primary contributions:

We created a lightweight and effective DBFA module with a dual-branch structure for extracting local and nonlocal information to extract more comprehensive and in-depth information.
The global variance was used to enhance the nonlocal feature representation, and the partial convolution design of the partial convolution channel mixer (PCCM) was used to enhance the ability of local modeling while reducing computational redundancy.
We proposed a lightweight EFFN, which further enhances the ability to extract image details while improving the stability and generalization performance of the model, thereby achieving better image reconstruction effects while ensuring efficient computing.
Quantitative and qualitative evaluations across multiple remote sensing datasets to test its balance between model complexity and performance, which were supplemented by cross-domain testing on natural images to verify generalizability.

2. Related Work

2.1. Deep Learning Developments for RSISR

In the domain of natural images, deep learning-based SR has demonstrated outstanding performance [16,17,18]. Therefore, many researchers design SR networks for optimising RSI. Liu et al. [19] suggested combining paired learning with a graph neural network structure to represent the degradation connection from SR to LR and the reconstruction process from LR to HR in order to recover finer texture information from a large number of remote sensing photos. GAN-based RSISR achieved more satisfactory results [20], although the GAN architecture can make the image more realistic. When applying the GAN architecture to RSISR [21,22], the generator adds pseudo-textures in the smooth regions, leading to spatial distortion. To address this issue, Ma et al. [23] designed an SD-GAN to apply different reconstruction criteria to different texture regions to improve the reconstruction quality of RSIs. Given that the SD-GAN generator reconstructs every area of the image using exactly the same parameters, in order to significantly process different regions of SR, a novel SG-FBGAN with significant and non-significant bipartite dominating PFBB [24] was proposed by Wu et al. in order to more efficiently process the images in both significant and non-significant regions. Diffusion models, owing to their powerful generative capabilities, have demonstrated significant potential in SR tasks. Zhu et al. [25]. proposed RSDiffSR based on a conditional diffusion framework, which leverages large-scale diffusion models to generate prior information, thereby effectively enhancing the visual quality of reconstructed images. Wang et al. [26] developed a semantic-guided diffusion model that utilizes pretrained generative models as priors to alleviate texture blurring in large-scale remote sensing image reconstruction. Furthermore, LDM [27] reduces computational cost by performing the diffusion process in a latent space rather than the original image space, with the aid of a pretrained autoencoder. Building upon this, StableSR [28] addresses the limitations in reconstruction fidelity and resolution flexibility by introducing a time-aware encoder, a controllable feature fusion module, and a progressive aggregation sampling strategy, thereby further improving reconstruction performance.

2.2. Feature Extraction in SR

The ability of the image to extract features is crucial for the model’s performance enhancement in the SR work. Since the application of SRCNN [5] to SR tasks breaks through the limitations of traditional methods, more and more CNN-related models have been proposed. Kim et al. [20] utilized the CNN architecture of VGG networks to learn high-frequency information to speed up training. Hu et al. [29] proposed a CMSC that extracts image features from rough to fine. Wang et al. [30] developed an MSRDN to lower the cost while maintaining high performance.

Because the self-attention (SA) mechanism can capture long-range relationships, the Transformer architecture was brought to the field of computer vision [31]. The model represented by ViT provides a solution to the vision task by capturing the correlation between elements in a sequence in a nonlocal range through the self-attention mechanism, thus effectively extracting nonlocal information and establishing long-term dependencies. Liu et al. [32] proposed Swin Transformer, which consists of a sliding window mechanism and a Transformer that employs hierarchical vision. Chen et al. [33] proposed Hybrid Attention Transformer (HAT), which strengthens the interaction of cross-window information so that through the HAT it is possible to mitigate the block effect of intermediate features. The ViT approach has achieved impressive success in visual tasks. However, recent studies have shown that SA suffers from high memory consumption and high computational cost and that ViT tends to focus on low-frequency information, leading to smooth reconstruction results [34]. These findings are based on the relationship between variance and nonlocal features explored in Refs [35]. In order to enhance model performance and optimization, we use variance to network designs, which can easily and effectively investigate nonlocal information.

To fully exploit both local and nonlocal characteristics, many scholars have developed a series of networks that combine local and nonlocal feature extraction and proposed different modules to optimize the feature extraction process. Meng et al. [36] suggested a two-branch extended network structure optimized for detail and contour reconstruction. To reduce the overall number of parameters while assuring accurate extraction of both local and nonlocal features, Hou et al. [37] used stacked local and nonlocal residual blocks for nonlocal and local feature extraction. To enhance the edge detail information of an image, Li et al. [38] designed an LGC-GDAN, which uses a dual-region discriminator and generator to enhance the edge detail information. However, these models usually require large memory and computing power, which not only reduces the efficiency of image processing but also leads to a waste of resources. Our approach prioritizes the full exploration of nonlocal information as well as the efficient extraction of local information to prevent the loss of richer low- and high-frequency information and employs a lighter-weight feature fusion structure to ensure the expressiveness and efficiency of the model while achieving a more comprehensive feature representation.

2.3. Lightweight Image SR

SR approaches have advanced significantly in the realm of image reconstruction. However, while pursuing enhanced performance, current SR models are typically sophisticated and computationally inefficient, increasing the demand on hardware resources and posing obstacles for practical deployment. Therefore, scholars have developed many lightweight SR models. To handle the issue of redundant model parameters, Hui et al. [39] introduced the IDN, which uses numerous information distillation blocks to gradually extract leftover information. Later on, it was extended to the IMDN [40], which proposes an adaptive tailoring method to achieve arbitrary scale amplification. Ahn et al. [4] suggested adopting a recursive network architecture to help improve model performance while minimizing parameter count and achieving a lightweight model. On the basis of the recursive architecture, Tai et al. [41] suggested a DRRN that reduces the number of model parameters by using the same network structure and parameters in each recursion, thus improving the model’s capacity for generalization. However, the recursive module’s recurrent actions result in more redundant processing, which slows down reasoning. In order to reduce unnecessary computation and achieve finer feature extraction and exploitation through feature distillation, Liu et al. [11] suggested the RFDN architecture with the addition of residual blocks. Kong et al. [42] modified the RFDN by removing the multi-branching structure and introducing the loss of contrast to speed up the inference speed and improve the accuracy. In order to lower complexity and enhance reconstruction quality, Gao et al. [43] created an FDEB utilizing distillation procedures and feature improvement.

3. Method

In this section, we first provide an overview of the proposed model architecture in Section 3.1. Subsequently, the dual-branch DBFA module is described in detail in Section 3.1. In Section 3.3, we explain the effectiveness of the EFFN module. Finally, the structure of the FAMs is presented in Section 3.4.

3.1. Overall Architecture

Figure 2 depicts the main network structure of our proposed DFCFNet. The network architecture starts with LR as the input and uses a

3 \times 3

convolution layer for early feature extraction. Then, it is routed through a sequence of feature aggregation modules (FAMs) for deep feature extraction, with the FAMs consisting of the DBFA and EFFN. Following the FAMs, the picture with richer features is output and submitted to the next FAM for further feature extraction. To make the model more lightweight, we utilize a

3 \times 3

convolution to convert the size to the requested size for upsampling operations. We use the residual operation in the deep feature extraction module to retain the input image’s nonlocal structural information.

3.2. DBFA Module

Recently, a large number of Transformer-based super-resolution methods have emerged, which leverage self-attention mechanisms to achieve superior visual performance. However, these methods typically incur high computational costs and tend to emphasize nonlocal modeling while neglecting the effective representation of local details. To address this issue, we propose a dual-branch DBFA module. Specifically, the module integrates a non-focal exploration branch (NFEB) and a focused local feature branch (FLFB) to jointly model nonlocal dependencies and local details, thereby improving both reconstruction quality and computational efficiency. As illustrated in Figure 3a, the overall architecture of DBFA is presented. In particular, the input features are first expanded along the channel dimension by a factor of two so that they can be evenly distributed to the two branches for subsequent processing. The formulation can be expressed as follows:

{F_{g}, F_{l}} = S p l i t (C o n v_{1 \times 1} (F_{i n}))

(1)

where

F_{i n}

denotes the input of DBFA, while

F_{g}

and

F_{l}

represent the two outputs obtained after the channel splitting operation.

C o n v_{1 \times 1} (\cdot)

denotes a 1×1 convolution, and

S p l i t (\cdot)

denotes the channel splitting operation.

3.2.1. NFEB

As illustrated in Figure 3a, within the DBFA module, the NFEB branch is employed to model nonlocal information in images. In this branch, a global variance strategy is introduced. Specifically, this strategy measures the dispersion of channel features by computing the global statistical variance of feature maps, thereby capturing their overall distribution characteristics. By incorporating such global statistical information, the model can effectively capture long-range nonlocal dependencies while maintaining low computational cost. In the NFEB branch, the input features are first downsampled to extract low-frequency components, which are then fed into a

3 \times 3

depthwise convolution to obtain initial nonlocal structural representations. Subsequently, these features are passed to the approximate depthwise separable convolution (ADSC) module to further enhance nonlocal feature representations. As shown in Figure 3c, the ADSC consists of a

1 \times 1

convolution, an activation function, and a depthwise convolution. Different from conventional depthwise separable convolution, ADSC first employs a standard

1 \times 1

convolution to promote inter-channel information interaction, followed by a nonlinear activation to enhance feature expressiveness, and finally applies a depthwise convolution to further refine the features. The above process can be formulated as follows:

F_{g_{-} a} = A D S C (D W C o n v_{3 \times 3} (M (F_{g})))

(2)

where

F_{g}

denotes the input of NFEB,

M (\cdot)

denotes maximum pooling,

F_{g_{-} a}

denotes the output of ADSC,

D W C o n v_{3 \times 3} (\cdot)

denotes a 3

\times 3

deep convolutional layer, and

A D S C (\cdot)

denotes approximate depth separable convolution.

To fully exploit the original information of the image, we do not adopt a simple local residual structure. Instead, we incorporate a global variance strategy into the local residual modeling process to facilitate more comprehensive exploration of nonlocal information. The resulting features are then effectively fused with the nonlocal representations generated by ADSC, thereby further enhancing feature expressiveness. The overall process can be formulated as follows:

F_{ρ} = C o n v_{1 \times 1} (F_{g_{-} a} + σ^{2} (F_{g}))

(3)

where

F_{g}

denotes the input of NFEB,

F_{g_{-} a}

denotes the output of ADSC,

C o n v_{1 \times 1} (\cdot)

denotes a 1

\times 1

convolutional layer, wherein

F_{ρ} \in R^{H \times W \times C}

. For the global variance, it can be formulated as:

σ^{2} (F_{g}) = \frac{1}{N} \sum_{i = 0}^{N - 1} {(f_{i} - μ)}^{2}

(4)

where

σ^{2} (F_{g})

represents the global variance of

F_{g}

, N is the total number of pixels,

f_{i}

represents the value of every pixel, and

μ

is the average of all pixels.

In the NFEB branch, we aggregate the input and output features to further enhance the nonlocal information of the image. The process can be formulated as follows:

F_{g_{-} o u t} = F_{g} ⊙ U (A (F_{ρ}))

(5)

where

F_{g_{-} o u t}

represents the output feature of this branch.

A (\cdot)

denotes the activation function,

U (\cdot)

denotes upsampling, and ⊙ denotes element-by-element multiplication computation.

3.2.2. FLFB

Local detail information is crucial for high-frequency reconstruction. Methods that rely solely on nonlocal modeling often fail to fully exploit fine-grained structures in images. To address this issue, we construct the FLFB branch to effectively extract local features from images. As shown in Figure 3a, within FLFB—inspired by CCM [7]—we design a partial convolution channel mixer (PCCM) to enhance local information modeling. Specifically, we modify and improve the CCM by incorporating the advantages of partial convolution (PConv) to more effectively model local features. On the one hand, the CCM improves parameter utilization efficiency and enhances local feature representation. On the other hand, PConv reduces redundant computations during local feature extraction. By effectively combining these two components, PCCM not only reduces computational cost but also better preserves image details and edge information, thereby improving overall reconstruction quality and visual performance.The structure of PCCM is shown in Figure 3d, while the architecture of PConv is illustrated in Figure 4.

In the FLFB branch, the input features are first normalized to ensure more stable training. After passing through an activation function, the features are then fed into the PCCM module to extract fine-grained local details. The process can be formulated as follows:

F_{l_{-} p} = P C C M (A (L N (F_{l})))

(6)

where

F_{l_{-} p}

represents the output feature of PCCM,

F_{l}

represents the input of FLFB,

L N (\cdot)

represents normalization,

A (\cdot)

represents the activation function. PCCM (·) represents the partial convolution channel mixer.

To fully exploit image information, a residual connection is introduced after the output of PCCM to enhance information propagation, accelerate convergence, and strengthen local detail representation. Subsequently, a

1 \times 1

convolution is applied to further refine the local features, which serves as the output of this branch

F_{l_{-} o u t}

. The above process can be formulated as follows:

F_{l_{-} o u t} = C o n v_{1 \times 1} (F_{l} + F_{l_{-} p})

(7)

where

F_{l_{-} p}

represents the output feature of PCCM,

F_{l}

represents the input of FLFB,

C o n v_{1 \times 1} (\cdot)

denotes a

1 \times 1

convolutional layer.

Finally, the outputs of NFEB and FLFB are combined to serve as the output

F a^{'}

of DBFA:

F a^{'} = C o n v_{1 \times 1} (F_{g_{-} o u t} + F_{l_{-} o u t})

(8)

3.3. Efficient Feed-Forward Network Model

When performing fully connected mapping, traditional feed-forward neural networks (FFNs) ignore the correlation between channels and the importance of features by performing equal-proportional transformations on all channels, which can easily lead to feature redundancy and information loss. In addition, in high-dimensional feature spaces, FFNs use point-by-point fully connected operations, which are computationally intensive and lack local perception, limiting the expressiveness of features and which may lead to overfitting. To address these issues, we draw on the idea of MBConv [44] and use its deep separable convolution and channel attention mechanism to reduce computational overhead while enhancing the flexibility and expressiveness of feature extraction.

Therefore, we propose an EFFN module to fully mine and utilize valuable feature information, thereby improving the generalization ability and computational efficiency of the model, as shown in Figure 3b. Specifically, this module first uses

1 \times 1

convolution to expand the channel and then uses channel separation to divide one part into

C / 2

for enhancing local information; following that, it uses

3 \times 3

convolution to extract local information and capture the feature relationship of adjacent pixels, while the other part retains the initial image information. Finally,

1 \times 1

convolution is used to integrate channels and capture their feature relationships. In this process, activation functions are used to apply nonlinear changes to help the model learn complex feature patterns more effectively, thereby improving the network’s expressiveness and generalization ability. This component can be expressed by the formula:

F a_{i n} = A (C o n v_{1 \times 1} (F a))

(9)

As shown in Figure 3b,

F a

denotes the input of EFFN, and

F a_{i n}

represents the output feature after the 1 × 1 convolution and activation function.

C o n v_{1 \times 1} (\cdot)

denotes a 1

\times 1

convolutional layer, and

A (\cdot)

represents the activation function.

F a_{l_{-} o u t} = A (C o n v_{1 \times 1} (C o n v_{3 \times 3} (S p l i t (F a_{i n}))))

(10)

where

F a_{l_{-} o u t}

denotes the output after enhanced local information processing.

C o n v_{3 \times 3} (\cdot)

denotes a 3 × 3 convolution, and

S p l i t (\cdot)

denotes the channel splitting operation, wherein

F a \in R^{H \times W \times C}

, and

F a_{i n} \in R^{H \times W \times C}

. While performing local feature refinement, in order to ensure that nonlocal information is not lost, we cascade and feed

F a_{l_{-} o u t}

and

F a_{2}

into the

1 \times 1

convolution for further feature blending and then downscale to the original dimension. We also include a residual operation in this procedure, which can be stated as follows, in order to ensure that the output features’ detailed information is preserved and to facilitate the flow of information:

{F a_{1}, F a_{2}} = S p l i t (F a_{i n})

(11)

F_{o u t} = F a + C o n v_{1 \times 1} (C (F a_{2}, F a_{l_{-} o u t}))

(12)

where

F a_{1}

and

F a_{2}

denote the two outputs obtained after channel splitting of

F a_{i n}

,

F_{o u t}

denotes the output of the EFFN module, and

C (\cdot)

denotes the channel connection.

3.4. Feature Aggregation Module

In summary, our proposed DBFA and EFFN are combined into a feature aggregation module (FAM), which can be expressed as:

F_{f} = E F F N (D B F A (F_{i n}) + F_{i n})

(13)

where

F_{i n}

denotes the initial input of the model, and

F_{f}

denotes the output of the FAM module. In particular, the proposed DFCFNet uses the same loss function as SAFMN [7] and is trained. To intuitively analyze the network’s feature extraction capability at different stages, we visualize the feature maps at key layers, as shown in Figure 5. Starting from the LR input, the features are initially processed by a Conv layer before being progressively refined through DBFA, EFFN, and multiple FAMs, ultimately generating the SR output. The feature maps at different stages illustrate how the network captures both local and nonlocal information, while the color variations reflect the feature responses at different levels, further validating its effectiveness in extracting textures, edges, and global details.

4. Experimental Results and Analyses

4.1. Primary Task: RSISR

4.1.1. Datasets

We utilized three remote sensing datasets: UCMerced [45], AID [46] and RSSCN7 [47]. The UCMerced collection includes photos from 21 different remote sensing categories. Each category comprises 100 photos with a size of 256 × 256. To maintain consistency with earlier investigations [48,49], we employed the same dataset subset. During training, 10% was retained to verify the model’s performance. The AID dataset contains 30 kinds of remote sensing photos representing various situations. The resolution for all photographs is 600 × 600. We randomly split the dataset into training and test sets. Five photographs were extracted from each category to verify the model, with the other images serving as test sets. The RSSCN7 dataset comprises 2800 images distributed among seven different classes, each with a resolution of 400 × 400 pixels. In order to test the generalization ability of the model, we also used the RSSCN7 dataset for testing.

4.1.2. Metrics

To be consistent with many studies, we calculated each image’s PSNR and SSIM values to evaluate the quality of the reconstructed image. The quality of the picture reconstruction improves with increasing PSNR and SSIM values. Additionally, to compare and examine the performance of the models in more detail, we calculated the FLOP, Parameters, and inference time. The smaller the FLOP and Parameters, the smaller the computational complexity and number of parameters of the model. Additionally, we computed the remote sensing image’s SAM [50] and SCC [51] from the image’s spectral perspective. In particular, the SAM computes the angle between two vectors to determine how similar two spectra are. The better spectral information is maintained during picture reconstruction as the angle lowers, the better the image reconstruction’s visual impact. An assessment of the spatial correlation between an image’s pixels is called the SCC. It can quantify how closely the original image’s spatial structure resembles that of the rebuilt image. The better the model reconstruction effect, the more the rebuilt image matches the original image in spatial organization as the SCC value rises. More details are lost or significant distortions are produced during the reconstruction process when the SCC value is lower.

4.1.3. Implementation Details

We employed scaling factors of

\times 2

,

\times 3

, and

\times 4

in training, which is in line with several of the techniques examined in RSISR. To supplement the data, we also horizontally rotated and flipped the photos, much like in previous experiments. To minimize pixel-level discrepancies and guarantee the uniformity of the overall intensity distribution between the reconstructed and original images, the suggested model employs mean square error (MSE) loss. The spectrum information is simultaneously constrained by the FFT-based frequency loss, which improves the texture and structure of the image and helps to retain high-frequency details. Adam optimized it, setting the initial learning rate at

1 \times 10^{- 3}

and the number of iterations at 1000 k. All tests were conducted using the Pytorch 2.1.0 framework on an NVIDIA GeForce RTX 4090 GPU. Two DFCFNet versions with varying levels of complexity were trained. The large version DFCFNet-S has 10 FAMs and 48 channels, whereas the ordinary version DFCFNet has 8 FAMs and 36 channels. The comparison experiments provide both.

4.1.4. Quantitative Results

We compared DFCFNet with other leading RSISR methods on the UCMerced and AID datasets, as shown in Table 1. DFCFNet obtained the best results compared to CNN-based methods (i.e., SRCNN [5], DCM [52], LGCNet [53], HSENet [54], SRDD [55], FENet [56], and VDSR [20]). DFCFNet also showed the best performance compared to Transformer-based methods (e.g., TransENet [57] and OmniSR). Table 1 presents a comprehensive comparison of our proposed model with other state-of-the-art methods in terms of the PSNR, SSIM, SCC, and SAM values. Our model consistently outperformed other methods across all evaluation metrics, demonstrating its superior image reconstruction capability. Specifically, excluding our method, we took the strong-performing OmniSR as a representative baseline for comparison. On the UCMerced dataset, DFCFNet achieved average improvements of 0.56 dB and 0.012 in its PSNR and SSIM values, respectively. In terms of spectral metrics, the SCC and SAM improved by 0.0305 and 0.004 on average. On the AID dataset, the proposed method further achieved gains of 0.88 dB and 0.021 in its PSNR and SSIM values, while the SCC and SAM improved by 0.054 and 0.007, respectively. These results demonstrate that the proposed model achieves superior performance in both reconstruction accuracy and structural preservation. Moreover, the improvements in SCC and SAM further indicate enhanced spectral consistency, leading to more comprehensive performance gains.

Additionally, we examined photographs from several categories—this category includes 30 separate categories—to get a closer look at our model’s performance. In Table 2, the comparative findings are displayed. The success of DFCFNet is demonstrated by the fact that, when compared to the second best model approach, HSENet, it improved the PSNR by 0.78 dB and 0.39 dB, respectively, for the farmland and square categories.

4.1.5. Qualitative Results

Figure 6 shows the visualization results of different methods. As shown in the figure, we observe that most methods produced obvious artifacts and blur. The picture reconstructed by our DFCFNet includes more realistic and full contour information in addition to being clearer than the image reconstructed by other methods in the Figure. This is mostly due to the fact that the FLFB and NFEB in the DBFA work together to extract enough features from the image to enable the reconstruction of images with improved visual quality.

4.1.6. Inference Speed and Network Complexity

As shown in Table 3, although OmniSR has fewer parameters than previous approaches, our DFCFNet delivered a 0.39 dB greater PSNR than OmniSR while using almost four times less parameters. This is attributed to a novel PCCM we proposed. In the PCCM, we improve and combine PConv and a CCM. PConv can not only reduce redundant calculations but also extract feature information more effectively. The CCM further enhances feature representation and stabilizes training. The results prove that our proposed PCCM is effective in reducing network complexity. The model is not only superior to other networks in inference speed and network complexity but also better in its reconstruction effect. Furthermore, we carried out a number of ablation tests on the PCCM, and Figure 7 provides additional evidence of the usefulness and significance of our module for network construction.

4.2. Extended Task: Natural Image SR

4.2.1. Datasets

We chose popular test datasets (Set14 [59], B100 [60], Urban100 [61], and Manga109 [62]) for testing and used the most popular natural picture dataset, DIV2K [63], which contains 800 HR images, as the training dataset. Natural photos of a wide variety of scenes can be found in DIV2K. In MATLAB R2022b, we used double and triple interpolation downsampling to create LR pictures, which we then compared under

\times 2

,

\times 3

, and

\times 4

.

4.2.2. Metrics

We decide to assess the SR index using the PSNR and SSIM in order to be consistent with other studies. The Y channel is typically used for image evaluation in natural settings. Consequently, we evaluated solely on the Y channel after uniformly converting the SR results to the YCbCr color space. Furthermore, for comparison, we assessed each method’s network complexity independently.

4.2.3. Implementation Details

We employed rotation and horizontal flipping for data augmentation, in line with previous methods. Adam is what we used for optimization. We set the initial learning rate to

1 \times 10^{- 3}

, the batch size to 16, and the total number of iterations to 1000 k during the training phase. All of our studies were conducted in PyTorch, and we made use of an NVIDIA RTX 3090Ti GPU.

4.2.4. Quantitative Results

We compared DFCFNet with other leading lightweight SISR methods (i.e., CARN [41], LMAN-S [48], IDN [39], IMDN [40], SMSR [64], FDIWN-M [20], RFDN [11], VLESR [65], GASSL-S [66], AMFFN [67], and FDSCSR-S [68]), and Table 4 lists the comparative results of these methods. We can observe that DFCFNet outperformed the second-place AMFFN by an average of 0.11 dB and 0.0016 in PSNR and SSIM in terms of accuracy for the ×4SR task. This experiment confirms the effectiveness of our approach to achieve a favorable balance between the image reconstruction accuracy and the achieved network parameters.

4.2.5. Qualitative Results

Figure 8 compares the visual effects of several SR networks on a natural image dataset

\times 4

. Our DFCFNet demonstrates sharper contours and textures. This further confirms the excellence of our approach, and our DFCFNet shows better performance in both natural SR and RSISR.

5. Ablation Study

For our model, we conducted very extensive ablation experiments to more directly observe the influence and effectiveness of each of our modules on the network. All our ablation experiments were performed on the

\times 4

DFCFNet model and trained and evaluated on the AID and UCMerced datasets.

5.1. Effectiveness of DBFA

The proposed DBFA module contains two branches, NFEB and FLFB, which use parallel structures to explore nonlocal information and effectively extract local information, with significantly improved accuracy. We took out the NFEB and FLFB and compared them with DFCFNet to more clearly show the success of our model DBFA. Table 5 shows that the PSNR and SSIM decreased by 0.3 dB and 0.0099 in the AID dataset and by 0.43 dB and 0.0147 in the UCMerced dataset. These experiments illustrate the importance of DBFA to DFCFNet. Furthermore, to visualize the effectiveness of DBFA, we present it in the Local Attribution Map (LAM) and diffusion index (DI), as shown in Figure 9. Specifically, when only NFEB was introduced, although the distribution range of the red points was slightly expanded, the response intensity remained relatively weak. This indicates that the model still mainly focuses on local information modeling, and its feature representation capability has not been sufficiently enhanced. When only FLFB was incorporated, a noticeable increase in the response intensity of red points could be observed in local regions; however, the spatial coverage remained limited. This suggests that while the model improves local feature representation, its ability to capture global information is still insufficient. In contrast, after introducing DBFA, the red points not only exhibited stronger responses in local regions but also expanded significantly to a wider spatial range. This demonstrates that DBFA effectively enhances feature interaction and information propagation, enabling the model to capture long-range dependencies while preserving local detail representation, thereby improving the completeness and consistency of overall feature representations.

Given that the DBFA module we created has a local feature focused branch and a nonlocal feature exploration branch, we did complimentary tests on each branch individually to demonstrate their usefulness. We can see that without NFEB, the model was reduced by 0.09 dB and 0.0028 in the AID dataset and by 0.13 dB and 0.0045 in the UCMerced dataset. To demonstrate the impact of our FLFB on DBFA, we replaced the PCCM in the FLFB with other modules, as will be detailed in Section 5.2. In the absence of the FLFB, we can see that the PSNR on both the AID and UCMerced datasets fell. In order to visualize the complementary nature of the FLFB and NFEB more visually, we used the power spectral density (PSD) to visualize the features, as shown in Figure 10. Compared with

F_{i n}

, the PSD of

F_{g_{-} a}

output by the FLFB is distributed in the peripheral area, indicating that high-frequency features are highlighted to increase the representation of image details, such as texture and edges. The PSD of

F_{l_{-} p}

output by NFEB is concentrated in the central area, indicating that low-frequency features dominate and represent overall aspects, such as background information.

5.2. Impact of PCCM

To further verify the effect of the FLFB on local information exploration, we prohibited the use of the PCCM in the FLFB and compared it with DFCFNet, as shown in Figure 7. We plotted the index curve, and it can be seen that DFCFNet has a much higher PSNR and SSIM than ‘Not—PCCM’, with much higher metrics and better convergence. To further explore the effectiveness of PCCM, we replaced it with PConv and the CCM, respectively, and tested it in UCMerced. We observed the curves and found that when only PConv and the CCM were used, the two indicators ended up lower than DFCFNet. Furthermore, to intuitively evaluate the effectiveness of the internal components within PCCM, we conducted comparative experiments among “Pure PConv”, “Pure CCM”, and “CCM + PConv” on the AID and UCMerced datasets. The quantitative results are presented in Table 6. Specifically, when only PConv was used, although it achieved a lower parameter count and FLOPs, the PSNR and SSIM on both datasets were significantly lower than those of the combined configuration. When only the CCM was adopted, while the parameter count and PSNR ended up being comparable, the SSIM decreased by 0.0006 and 0.0009 on the AID and UCMerced datasets, respectively.

After integrating PConv and the CCM, both the PSNR and SSIM were substantially improved on the two datasets. Meanwhile, the increase in parameter count and FLOPs is marginal compared to CCM alone. These results further demonstrate that incorporating PConv can enhance reconstruction performance while maintaining low computational cost, achieving a favorable trade-off between performance and complexity.

5.3. Effectiveness of the EFFN

To further investigate the data, we introduced the EFFN. We next performed tests by deleting and substituting EFFN, respectively, to intuitively confirm the impact of our EFFN on the model. Table 5 shows that all the metrics decreased when the EFFN was removed compared to DFCFNet. As shown in Figure 7, for direct observation, we replaced the EFFN with a regular FFN, and all the metrics on the UCMerced dataset ended up being lower. To further illustrate the roles of DBFA and EFFN in image restoration, we conducted a comparative analysis of LAM, as shown in Figure 9. During the reconstruction process, the significance of the relationship between red spots and rectangular boxes was evaluated to assess their correlation. Additionally, we computed the DI values, where a higher DI indicates a broader pixel coverage. The results demonstrate that our proposed DFCFNet effectively captures more extensive information, thereby enhancing the quality of image reconstruction.

5.4. Validity of Global Variance

We used the global variance operation in the NFEB to more fully explore nonlocal information to observe the effectiveness of this module more directly, as shown in Figure 7. In addition, we conductws a validation study on the proposed global variance strategy. Specifically, we compared the self-attention (SA) with the global variance strategy. Under the condition that all other components remained unchanged, we replaced the global variance strategy with SA for comparison. The results are reported in Table 7. Although SA is capable of modeling long-range dependencies, the proposed global variance strategy achieved more favorable performance, with average improvements of 0.13 dB in PSNR and 0.0046 in SSIM across two public datasets.

Furthermore, we provide additional comparisons in terms of model complexity. The results are shown in Table 7. SA has a similar number of parameters and global residual strategy, but it incurs significantly higher computational costs in terms of FLOPs and average inference time. This further demonstrates that the proposed global variance strategy achieves a better trade-off between performance and computational efficiency.

5.5. The Effects of FAM and Channel Number on the Network

We recognized that different numbers of channels and FAMs may have varying degrees of impact on network performance, so we experimented with different numbers of channels and FAMs to see how they affected the model. We used the DIV2K dataset for training and evaluated it on the Urban100 and Manga109 datasets. As shown in Table 8, although better results were achieved with 48 channels and 14 FAMs, it was three times higher than the number of parameters with 36 channels and 8 FAMs, and the memory consumption was higher.

6. Discussion

The proposed method is capable of preserving richer detail information and achieves satisfactory results. Furthermore, to evaluate the generalization ability of the model, experiments were conducted on both remote sensing datasets and natural image datasets. The experimental results demonstrate that the proposed method not only achieves competitive performance but also exhibits strong generalization capability. However, this study still has several limitations. First, the current model has been primarily validated under relatively limited spectral settings, and its generalization ability to more complex spectral data, such as multispectral or hyperspectral scenarios, remains to be further explored. Second, real-world remote sensing imaging often involves multiple degradation factors, including complex noise, blur, and compression artifacts. However, the proposed method has been mainly trained under relatively idealized or fixed degradation assumptions, and thus its robustness under more challenging degradation conditions still has room for improvement. In addition, during practical deployment, the model performance may be constrained by hardware platforms and runtime environments. For instance, when deployed on platforms such as HiSilicon chips, if the input images contain severe compression noise, the super-resolution process may further amplify such noise, thereby increasing the difficulty of reconstruction.

Future work will focus on the following directions. First, we aim to extend the model to handle multispectral or higher-dimensional spectral data to improve its generalization capability. Second, incorporating unsupervised or self-supervised learning strategies may enhance the model’s robustness and adaptability to complex degradations. Finally, we plan to explore more efficient lightweight designs to improve deployment efficiency in real-world applications.

7. Conclusions

We propose a lightweight and effective DFCFNet model to solve image SR. The model contains a DBFA module and EFFN module. Specifically, our DBFA module contains an FLFB and NFEB, and the two branches process images in parallel. Among them, we develop a PCCM module in the FLFB branch to capture local details information. The NFEB introduces global variance to explore nonlocal detail information. In order to make the network more lightweight, we performed simple and efficient fusion at the end. In addition, we introduced the EFFN to make better use of the DBFA module’s local and nonlocal characteristics for channel and spatial information. To verify the generalization ability of our model, we also trained and tested the model in natural images and evaluated it in the natural image domain. Our suggested DFCFNet strikes a favorable compromise between reconstruction performance, computational efficiency, and light weight, according to extensive experimental data.

Author Contributions

Conceptualization, M.Z.; methodology, M.Z.; software, M.Z.; validation, M.Z.; formal analysis, M.Z.; investigation, X.C.; resources, M.Z., H.G. and W.Z.; data curation, J.P. and X.C.; writing—original draft preparation, M.Z.; writing—review and editing, W.Z., Q.W., M.Z., J.P. and X.C.; visualization, M.Z.; supervision, H.G. and W.Z.; project administration, H.G. and Q.W.; funding acquisition, H.G. and Q.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China General Program under Grant 62471389, and in part by Shaanxi Province Technological Innovation Guidance Special Project: Regional Science and Technology Innovation Center, Strategic Scientific and Technological Strength Category: 2024QY-SZX-26.

Data Availability Statement

The data for this article are presented in the article. The data and materials supporting the findings are available from the corresponding author upon reasonable request.

Acknowledgments

The authors sincerely thank all colleagues in the laboratory for their support and assistance during this study. The authors especially appreciate the editor for their meticulous work and professional guidance and extend their sincere gratitude to the anonymous reviewers for their constructive comments and valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

HR	High-Resolution
LR	Low-Resolution
SR	Super-Resolution
CNN	Convolutional Neural Network
GAN	Generative Adversarial Network
PSNR	Peak Signal-to-Noise Ratio
SSIM	Structural Similarity Index
MSE	Mean Square Error
FLOPs	Floating Point Operations

References

Zhao, Q.; Lyu, S.; Chen, L.; Liu, B.; Xu, T.B.; Cheng, G.; Feng, W. Learn by oneself: Exploiting weight-sharing potential in knowledge distillation guided ensemble network. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 6661–6678. [Google Scholar] [CrossRef]
López-Cifuentes, A.; Escudero-Viñolo, M.; Bescós, J.; San Miguel, J.C. Attention-based knowledge distillation in scene recognition: The impact of a DCT-driven loss. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4769–4783. [Google Scholar] [CrossRef]
Zhang, L.; Lu, W.; Huang, Y.; Sun, X.; Zhang, H. Unpaired Remote Sensing Image Super-Resolution with Multi-Stage Aggregation Networks. Remote. Sens. 2021, 13, 3167. [Google Scholar] [CrossRef]
Ahn, N.; Kang, B.; Sohn, K.A. Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 252–268. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar] [CrossRef]
Sun, L.; Dong, J.; Tang, J.; Pan, J. Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 13190–13199. [Google Scholar] [CrossRef]
Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, H.; Guo, S.; Zhang, L. Efficient long-range attention network for image super-resolution. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2020; pp. 649–667. [Google Scholar] [CrossRef]
Li, F.; Bai, H.; Zhao, Y. FilterNet: Adaptive information filtering network for accurate and fast image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1511–1523. [Google Scholar] [CrossRef]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 41–55. [Google Scholar]
Zheng, P.; Jiang, J.; Zhang, Y.; Zeng, C.; Qin, C.; Li, Z. CGC-net: A context-guided constrained network for remote-sensing image super resolution. Remote. Sens. 2023, 15, 3171. [Google Scholar] [CrossRef]
Gao, X.; Zhang, L.; Mou, X. Single image super-resolution using dual-branch convolutional neural network. IEEE Access 2018, 7, 15767–15778. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, H.; Zeng, X.; Wang, B.; Li, W.; Ding, W. Binary Lightweight Neural Networks for Arbitrary Scale Super-Resolution of Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2025, 63, 1–16. [Google Scholar] [CrossRef]
Wang, Y.; Shao, Z.; Lu, T.; Huang, X.; Wang, J.; Zhang, Z.; Zuo, X. Lightweight remote sensing super-resolution with multi-scale graph attention network. Pattern Recognit. 2025, 160, 111178. [Google Scholar] [CrossRef]
Chen, R.; Zhang, Y. Learning dynamic generative attention for single image super-resolution. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8368–8382. [Google Scholar] [CrossRef]
Liu, Y.; Jia, Q.; Fan, X.; Wang, S.; Ma, S.; Gao, W. Cross-SRN: Structure-preserving super-resolution network with cross convolution. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4927–4939. [Google Scholar] [CrossRef]
Zuo, Y.; Xie, J.; Wang, H.; Fang, Y.; Liu, D.; Wen, W. Gradient-guided single image super-resolution based on joint trilateral feature filtering. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 505–520. [Google Scholar] [CrossRef]
Liu, Z.; Feng, R.; Wang, L.; Han, W.; Zeng, T. Dual learning-based graph neural network for remote sensing image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1646–1654. [Google Scholar]
Wang, C.; Zhang, X.; Yang, W.; Wang, G.; Li, X.; Wang, J.; Lu, B. MSWAGAN: Multi-spectral remote sensing image super resolution based on multi-scale window attention transformer. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, H.; Huangfu, X.; Li, Z.; Wang, P. ViT-ISRGAN: A High-Quality Super-Resolution Reconstruction Method for Multi-Spectral Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2025, 18, 3973–3988. [Google Scholar] [CrossRef]
Ma, J.; Zhang, L.; Zhang, J. SD-GAN: Saliency-discriminated GAN for remote sensing image superresolution. IEEE Geosci. Remote. Sens. Lett. 2019, 17, 1973–1977. [Google Scholar] [CrossRef]
Wu, H.; Zhang, L.; Ma, J. Remote sensing image super-resolution via saliency-guided feedback GANs. IEEE Trans. Geosci. Remote. Sens. 2020, 60, 1–16. [Google Scholar] [CrossRef]
Zhu, C.; Liu, Y.; Huang, S.; Wang, F. Taming a diffusion model to revitalize remote sensing image super-resolution. Remote. Sens. 2025, 17, 1348. [Google Scholar] [CrossRef]
Wang, C.; Sun, W. Semantic guided large scale factor remote sensing image super-resolution with generative diffusion prior. ISPRS J. Photogramm. Remote. Sens. 2025, 220, 125–138. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Wang, J.; Yue, Z.; Zhou, S.; Chan, K.C.; Loy, C.C. Exploiting diffusion prior for real-world image super-resolution. Int. J. Comput. Vis. 2024, 132, 5929–5949. [Google Scholar] [CrossRef]
Hu, Y.; Gao, X.; Li, J.; Huang, Y.; Wang, H. Single image super-resolution via cascaded multi-scale cross network. arXiv 2018, arXiv:1802.08808. [Google Scholar]
Wang, H. MSRDN: A Super-Resolution Network for Human Body. In Proceedings of the 2024 3rd International Conference on Innovations and Development of Information Technologies and Robotics (IDITR), Hong Kong, China, 23–25 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 178–182. [Google Scholar] [CrossRef]
Ou, B.; Shao, G.; Yang, B.; Fei, S. FocalSR: Revisiting image super-resolution transformers with fourier-transform cross attention layers for remote sensing image enhancement. Geomatica 2025, 77, 100042. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 22367–22377. [Google Scholar] [CrossRef]
Park, N.; Kim, S. How do vision transformers work? arXiv 2022, arXiv:2202.06709. [Google Scholar] [CrossRef]
Vanyan, A.; Barseghyan, A.; Tamazyan, H.; Huroyan, V.; Khachatrian, H.; Danelljan, M. Analyzing local representations of self-supervised vision transformers. arXiv 2023, arXiv:2401.00463. [Google Scholar] [CrossRef]
Shi, M.; Gao, Y.; Chen, L.; Liu, X. Dual-branch multiscale channel fusion unfolding network for optical remote sensing image super-resolution. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Hou, J.; Si, Y.; Li, L. Image super-resolution reconstruction method based on global and local residual learning. In Proceedings of the 2019 IEEE 4th International Conference on Image, Vision and Computing (ICIVC), Xiamen, China, 5–7 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 341–348. [Google Scholar]
Li, H.; Deng, W.; Zhu, Q.; Guan, Q.; Luo, J. Local-global context-aware generative dual-region adversarial networks for remote sensing scene image super-resolution. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 5402114. [Google Scholar] [CrossRef]
Hui, Z.; Wang, X.; Gao, X. Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 723–731. [Google Scholar]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th Acm International Conference on Multimedia, Nice, France, 21–25 October 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 2024–2032. [Google Scholar]
Tai, Y.; Yang, J.; Liu, X. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3147–3155. [Google Scholar]
Kong, F.; Li, M.; Liu, S.; Liu, D.; He, J.; Bai, Y.; Chen, F.; Fu, L. Residual local feature network for efficient super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 766–776. [Google Scholar] [CrossRef]
Gao, F.; Li, L.; Wang, J.; Sun, K.; Lv, M.; Jia, Z.; Ma, H. A lightweight feature distillation and enhancement network for super-resolution remote sensing images. Sensors 2023, 23, 3906. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
Yang, Y.; Newsam, S. Bag-of-Visual-Words and Spatial Extensions for Land-Use Classification; ACM: New York, NY, USA, 2010; pp. 270–279. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Zhang, L. AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification. IEEE Trans. Geosci. Remote. Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote. Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Wan, J.; Yin, H.; Liu, Z.; Chong, A.; Liu, Y. Lightweight Image Super-Resolution by Multi-Scale Aggregation. Broadcast. IEEE Trans. (T-BC) 2021, 67, 372–382. [Google Scholar] [CrossRef]
Hajian, A.; Aramvith, S. AERU-Net: Adaptive Edge Recovery and Attention U-Shaped Network for Remote Sensing Image Super-Resolution. IEEE Access 2025, 13, 59177–59197. [Google Scholar] [CrossRef]
Yuhas, R.H.; Goetz, A.F.; Boardman, J.W. Discrimination among semi-arid landscape endmembers using the spectral angle mapper (SAM) algorithm. In Proceedings of the JPL, Summaries of the Third Annual JPL Airborne Geoscience Workshop, 1 June 1992; AVIRIS Workshop; NASA: Washington, DC, USA, 1992; Volume 1. [Google Scholar]
Zhou, J.; Civco, D.L.; Silander, J.A. A wavelet transform method to merge Landsat TM and SPOT panchromatic data. Int. J. Remote. Sens. 1998, 19, 743–757. [Google Scholar] [CrossRef]
Haut, J.M.; Paoletti, M.E.; Fernandez-Beltran, R.; Plaza, J.; Plaza, A.; Li, J. Remote Sensing Single-Image Superresolution Based on a Deep Compendium Model. IEEE Geosci. Remote. Sens. Lett. 2019, 16, 1432–1436. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Zou, Z. Super-Resolution for Remote Sensing Images via Local–Global Combined Network. IEEE Geosci. Remote. Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z. Hybrid-Scale Self-Similarity Exploitation for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5401410. [Google Scholar] [CrossRef]
Maeda, S. Image Super-Resolution with Deep Dictionary. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022. [Google Scholar] [CrossRef]
Wang, Z.; Li, L.; Xue, Y.; Jiang, C.; Wang, J.; Sun, K.; Ma, H. FeNet: Feature Enhancement Network for Lightweight Remote-Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 5622112. [Google Scholar] [CrossRef]
Lei, S.; Shi, Z.; Mo, W. Transformer-Based Multistage Enhancement for Remote Sensing Image Super-Resolution. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 11. [Google Scholar] [CrossRef]
Wang, H.; Chen, X.; Ni, B.; Liu, Y.; Liu, J. Omni aggregation networks for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 22378–22387. [Google Scholar] [CrossRef]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the International Conference on Curves and Surfaces; Springer: Berlin/Heidelberg, Germany, 2010; pp. 711–730. [Google Scholar]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 898–916. [Google Scholar] [CrossRef]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 5197–5206. [Google Scholar]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Timofte, R.; Agustsson, E.; Van Gool, L.; Yang, M.H.; Zhang, L. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 114–125. [Google Scholar]
Wang, L.; Dong, X.; Wang, Y.; Ying, X.; Lin, Z.; An, W.; Guo, Y. Exploring Sparsity in Image Super-Resolution for Efficient Inference. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 4915–4924. [Google Scholar]
Gao, D.; Zhou, D. A very lightweight and efficient image super-resolution network. Expert Syst. Appl. 2023, 213, 118898. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Y.; Qin, C.; Van Gool, L.; Fu, Y. Global aligned structured sparsity learning for efficient image super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10974–10989. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Cheng, S.; Li, Y.; Du, A. Lightweight remote-sensing image super-resolution via attention-based multilevel feature fusion network. IEEE Trans. Geosci. Remote. Sens. 2023, 61, 2005715. [Google Scholar] [CrossRef]
Wang, Z.; Gao, G.; Li, J.; Yan, H.; Zheng, H.; Lu, H. Lightweight feature de-redundancy and self-calibration network for efficient image super-resolution. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 110. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-Recursive Convolutional Network for Image Super-Resolution; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1664–1673. [Google Scholar]
Gu, J.; Dong, C. Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9199–9208. [Google Scholar]

Figure 1. The overall performance of our proposed method in comparison with other state-of-the-art methods (on Ucmerced dataset for

\times 4

SR), with the size of the circle representing the FLOPs of the model. The proposed DFCFNet achieves a better balance between reconstruction performance and computational efficiency.

Figure 1. The overall performance of our proposed method in comparison with other state-of-the-art methods (on Ucmerced dataset for

\times 4

SR), with the size of the circle representing the FLOPs of the model. The proposed DFCFNet achieves a better balance between reconstruction performance and computational efficiency.

Figure 2. The overall architecture diagram of our proposed DFCFNet.

Figure 3. Important module structure diagram. (a) DBFA. (b) EFFN. (c) ADSC. (d) PCCM.

Figure 4. PConv.

Figure 5. Visualization of network feature graph. NFEB indicates Non-Focal Exploration Branch. FLFB indicates Focused Local Feature Branch. DBFA indicates dual-branch feature aggregation. FAM indicates feature aggregation module. Pseudo-color is used to highlight features in the feature graph.

Figure 6. Visual comparisons for

\times 4

on the RSSCN7 and UCMerced dataset.

Figure 6. Visual comparisons for

\times 4

on the RSSCN7 and UCMerced dataset.

Figure 7. Impact of different modules on indicators. (a) The effect of different models on PSNR. (b) The effect of different models on SSIM. The module in front of ‘–’ replaces the module that follows. Evaluated for

\times 4

on the UCMerced dataset.

Figure 7. Impact of different modules on indicators. (a) The effect of different models on PSNR. (b) The effect of different models on SSIM. The module in front of ‘–’ replaces the module that follows. Evaluated for

\times 4

on the UCMerced dataset.

Figure 8. Visual comparisons for ×4 SR on the BSD100 and Urban100 datasets.

Figure 9. A comparative analysis of LAMs and DIs [72]. DFCFNet makes full use of rich feature information to reconstruct more accurate and structured images.

Figure 10. The power spectral density (PSD). FLFB activates more high-frequency information, while NFEB activates more low-frequency information.

Table 1. Average evaluation metrics on the UCMerced and AID datasets. The best and second-best results are highlighted in red and blue.

Scale	Metric	SRCNN [5]	VDSR [20]	DCM [52]	LGCNet [53]	HSENet [54]	TransENet [57]	SRDD [55]	FENet [56]	OmniSR [58]	DFCFNet-S	DFCFNet
UCMerced $\times 2$	PSNR	33.04	33.95	34.14	33.54	34.32	34.05	34.25	34.14	34.33	34.58	34.69
	SSIM	0.9181	0.9281	0.9306	0.9242	0.9320	0.9294	0.9319	0.9304	0.9324	0.9355	0.9362
	SCC	0.5823	0.6228	0.6319	0.6051	0.6392	0.6275	0.6374	0.6332	0.6393	0.6519	0.6567
	SAM	0.0575	0.0515	0.0505	0.0542	0.0492	0.0511	0.0498	0.0507	0.0494	0.0472	0.0465
UCMerced $\times 3$	PSNR	29.00	29.78	29.86	29.36	30.04	29.90	29.92	29.93	29.56	30.41	30.51
	SSIM	0.8142	0.8354	0.8393	0.8247	0.8433	0.8397	0.8411	0.8407	0.8445	0.8519	0.8542
	SCC	0.3444	0.3965	0.4025	0.3666	0.4131	0.3988	0.4085	0.4016	0.4013	0.4368	0.4439
	SAM	0.0901	0.0826	0.0820	0.0866	0.0805	0.0816	0.0815	0.0814	0.0827	0.0761	0.0753
UCMerced $\times 4$	PSNR	26.92	27.56	27.60	27.18	27.75	27.78	27.67	27.70	27.78	28.09	28.17
	SSIM	0.7286	0.7522	0.7556	0.7394	0.7611	0.7635	0.7609	0.7589	0.7657	0.7730	0.7769
	SCC	0.2156	0.2590	0.2610	0.2286	0.2692	0.2701	0.2718	0.2639	0.2780	0.3011	0.3095
	SAM	0.1128	0.1054	0.1051	0.1098	0.1034	0.1029	0.1043	0.1041	0.1030	0.0983	0.0970
AID $\times 2$	PSNR	34.74	35.20	35.35	35.00	35.50	35.40	35.33	35.31	34.85	35.51	35.64
	SSIM	0.9299	0.9349	0.9366	0.9327	0.9383	0.9372	0.9367	0.9361	0.9381	0.9383	0.9396
	SCC	0.6096	0.6221	0.6407	0.6173	0.6626	0.6538	0.6373	0.6371	0.6395	0.6486	0.6674
	SAM	0.0571	0.0539	0.0531	0.0554	0.0524	0.0530	0.0531	0.0535	0.0539	0.0520	0.0493
AID $\times 3$	PSNR	30.63	31.25	31.36	30.87	31.49	31.50	31.38	31.33	31.53	31.55	31.62
	SSIM	0.8380	0.8526	0.8557	0.8441	0.8588	0.8588	0.8564	0.8648	0.8594	0.8596	0.8617
	SCC	0.3538	0.3848	0.3971	0.3647	0.4053	0.4067	0.3984	0.3961	0.4082	0.4111	0.4153
	SAM	0.0891	0.0892	0.0820	0.0866	0.0806	0.0806	0.0817	0.0823	0.0804	0.0801	0.0793
AID $\times 4$	PSNR	28.51	29.01	29.20	28.67	29.32	29.44	29.21	29.15	27.85	29.36	29.43
	SSIM	0.7577	0.7746	0.7826	0.7646	0.7867	0.7912	0.7835	0.7803	0.7319	0.7875	0.7898
	SCC	0.2153	0.2428	0.2679	0.2198	0.2765	0.2884	0.2695	0.2603	0.1618	0.2831	0.2892
	SAM	0.1116	0.1055	0.1032	0.1094	0.1016	0.1002	0.1030	0.1039	0.1203	0.1011	0.1002

Table 2. Average PSNR (dB) of each class on the AID dataset at

\times 4

scale. The best and second-best results are highlighted in red and blue.

Table 2. Average PSNR (dB) of each class on the AID dataset at

\times 4

scale. The best and second-best results are highlighted in red and blue.

Class	Bicubic	SRCNN [5]	LGCNet [53]	VDSR [20]	DCM [52]	HSENet [54]	DFCFNet-S (Ours)	DFCFNet (Ours)
airport	27.03	28.17	28.39	28.82	28.99	29.03	29.17	29.25
bareland	34.88	35.63	35.78	36.17	36.21	36.21	36.38	36.40
baseballfield	29.06	30.51	30.75	31.18	31.36	31.23	31.56	31.62
beach	31.07	31.92	32.08	32.29	32.45	32.76	32.66	32.68
bridge	28.98	30.41	30.67	31.19	31.39	31.30	31.64	31.68
center	25.26	26.59	26.92	27.48	27.72	27.84	27.98	28.06
church	22.15	23.41	23.68	24.12	24.29	24.39	24.49	24.54
commercial	25.83	27.05	27.24	27.62	27.78	27.99	27.96	28.01
denseresidential	23.05	24.13	24.33	24.70	24.87	25.13	25.08	25.14
desert	38.49	38.84	39.06	39.13	39.27	39.37	39.51	39.52
farmland	32.30	33.48	33.77	34.20	34.42	33.90	34.61	34.68
forest	27.39	28.15	28.20	28.36	8.47	28.31	28.57	28.59
industrial	24.75	26.00	26.24	26.72	26.92	26.99	27.14	27.22
meadow	32.06	32.57	32.65	32.77	32.88	32.74	32.97	32.99
mediumresidential	26.09	27.37	27.63	28.06	28.25	28.45	28.49	28.54
mountain	28.04	28.90	28.97	29.11	29.18	29.26	29.26	29.28
park	26.23	27.25	27.37	27.69	27.82	28.01	27.97	28.02
parking	22.33	24.01	24.40	25.21	25.74	26.17	26.25	26.41
playground	27.27	28.72	29.04	29.62	29.92	31.18	30.30	30.31
pond	28.94	29.85	30.00	30.26	30.29	30.40	30.50	30.55
port	24.69	25.82	26.02	26.43	26.62	26.92	26.85	26.94
railwaystation	26.31	27.55	27.76	28.19	28.38	28.47	28.56	28.62
resort	25.98	27.12	27.32	27.71	27.88	27.99	28.08	28.13
river	29.61	30.48	30.60	30.82	30.91	30.88	31.01	31.03
school	24.91	26.13	26.34	26.78	26.94	27.51	27.17	27.25
sparseresidential	25.41	26.16	26.27	26.46	26.53	26.43	26.64	26.67
square	26.75	28.13	28.39	28.91	29.13	29.05	29.38	29.44
stadium	24.81	26.10	26.37	26.88	27.10	27.28	27.32	27.41
storagetanks	24.18	25.27	25.48	25.86	26.00	26.07	26.18	26.23
viaduct	25.86	27.03	27.26	27.74	27.93	28.12	28.13	28.21
AVG	27.30	28.40	28.61	28.99	29.17	29.21	29.39	29.45

Table 3. Comparison of different methods in terms of Parameters, FLOPs, Inference time, and PSNR on the UCMerced at

\times 4

scale.

Table 3. Comparison of different methods in terms of Parameters, FLOPs, Inference time, and PSNR on the UCMerced at

\times 4

scale.

Method	Params (M)	FLOPs (G)	Time (ms)	PSNR
SRCNN [5]	0.07	4.53	0.409	26.92
VDSR [20]	0.67	44.03	5.138	27.56
LGCNet [53]	0.19	12.65	1.788	27.18
DCM [52]	2.17	13.00	3.391	27.60
HSENet [54]	5.43	19.20	41.105	27.75
TransENet [57]	37.46	21.44	26.873	27.78
FENet [56]	3.32	12.91	146.018	27.70
SRDD [55]	9.34	6.46	27.969	27.67
OmniSR [58]	3.26	12.94	102.388	27.78
DFCFNet-S (Ours)	0.36	20.00	12.25	28.09
DFCFNet (Ours)	0.79	43.48	20.79	28.17

Table 4. Evaluation results of SR on five benchmark datasets. The best result is shown in red, and the second-best result is shown in blue.

Method	Scale	Params	Set5		Set14		BSD100		Urban100		Manga109
Method	Scale	Params	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DRCN [69]	$\times 2$	1774K	37.63	0.9588	33.04	0.9118	31.85	0.8942	30.75	0.9133	-	-
EDSR [70]	$\times 2$	1370K	37.99	0.9604	33.57	0.9175	32.16	0.8994	31.98	0.9272	38.54	0.9769
DBPN [71]	$\times 2$	-	38.09	0.9600	33.85	0.9190	32.27	0.9000	33.02	0.9310	39.32	0.9780
IDN [39]	$\times 2$	553K	37.83	0.9600	33.30	0.9148	32.08	0.8985	31.27	0.9196	38.01	0.9749
CARN [41]	$\times 2$	1592K	37.76	0.9590	33.52	0.9166	32.09	0.8978	31.92	0.9256	38.36	0.9765
IMDN [40]	$\times 2$	694K	38.00	0.9605	33.63	0.9177	32.19	0.8996	32.17	0.9283	38.88	0.9774
RFDN [11]	$\times 2$	534K	38.05	0.9606	33.68	0.9184	32.16	0.8994	32.12	0.9278	38.88	0.9773
LMAN-S [48]	$\times 2$	525K	37.94	0.9603	33.49	0.9167	32.08	0.8984	31.85	0.9251	38.43	0.9765
SMSR [64]	$\times 2$	985K	38.00	0.9601	33.64	0.9179	32.17	0.8990	32.19	0.9284	38.76	0.9771
FDIWN-M [20]	$\times 2$	433K	38.03	0.9606	33.60	0.9179	32.17	0.8995	32.19	0.9284	-	-
VLESR [65]	$\times 2$	311K	38.01	0.9605	33.58	0.9177	32.16	0.8993	32.14	0.9280	38.75	0.9770
GASSL-S [66]	$\times 2$	280K	37.91	0.9602	33.53	0.9172	32.14	0.8992	31.81	0.9253	38.57	0.9769
FDSCSR-S [68]	$\times 2$	466K	38.02	0.9606	33.51	0.9174	32.18	0.8996	32.24	0.9288	38.67	0.9771
AMFFN [67]	$\times 2$	298K	38.07	0.9607	33.59	0.9178	32.21	0.9001	32.37	0.9299	38.89	0.9774
DFCFNet-S (Ours)	$\times 2$	360K	38.09	0.9608	33.70	0.9186	32.22	0.9002	32.29	0.9292	39.01	0.9777
DFCFNet (Ours)	$\times 2$	792K	38.18	0.9611	33.88	0.9204	32.29	0.9011	32.69	0.9331	39.26	0.9780
DRCN [69]	$\times 3$	1774K	33.82	0.9226	29.76	0.8311	28.80	0.7963	27.15	0.8276	-	-
EDSR [70]	$\times 3$	1555K	34.37	0.9270	30.28	0.8417	29.09	0.8052	28.15	0.8527	33.45	0.9439
DBPN [71]	$\times 3$	-	32.47	0.8980	28.82	0.7860	27.72	0.7400	28.08	0.7950	31.50	0.9140
IDN [39]	$\times 3$	553K	34.11	0.9253	29.99	0.8354	28.95	0.8013	27.42	0.8359	32.71	0.9381
CARN [41]	$\times 3$	118K	34.29	0.9255	30.29	0.8407	29.06	0.8034	28.06	0.8493	33.50	0.9440
IMDN [40]	$\times 3$	703K	34.36	0.9270	30.32	0.8417	29.09	0.8046	28.17	0.8519	33.61	0.9445
RFDN [11]	$\times 3$	541K	34.41	0.9273	30.34	0.8420	29.09	0.8050	28.21	0.8525	33.67	0.9449
LMAN-S [48]	$\times 3$	709K	34.31	0.9265	30.24	0.8397	29.02	0.8030	28.02	0.8487	33.42	0.9433
SMSR [64]	$\times 3$	993K	34.40	0.9270	30.33	0.8412	29.10	0.8050	28.25	0.8536	33.68	0.9445
FDIWN-M [20]	$\times 3$	446K	34.46	0.9274	30.35	0.8423	29.10	0.8051	28.16	0.8528	-	-
VLESR [65]	$\times 3$	319K	34.40	0.9272	30.34	0.8415	29.08	0.8043	28.16	0.8519	33.61	0.9445
GASSL-S [66]	$\times 3$	373K	34.24	0.9260	30.28	0.8407	29.06	0.8038	27.95	0.8474	33.42	0.9434
FDSCSR-S [68]	$\times 3$	471K	34.24	0.9274	30.37	0.8429	29.10	0.8052	28.20	0.8532	33.55	0.9443
AMFFN [67]	$\times 3$	305K	34.48	0.9275	30.34	0.8420	29.11	0.8051	28.29	0.8544	33.72	0.9451
DFCFNet-S (Ours)	$\times 3$	366K	34.52	0.9281	30.46	0.8444	29.15	0.8065	28.39	0.8554	33.97	0.9463
DFCFNet (Ours)	$\times 3$	798K	34.62	0.9291	30.51	0.8457	29.22	0.8083	28.65	0.8607	34.24	0.9480
DRCN [69]	$\times 4$	1774K	31.53	0.8854	28.02	0.7670	27.23	0.7233	25.14	0.7510	-	-
EDSR [70]	$\times 4$	1518K	32.09	0.8938	28.58	0.7813	27.57	0.7357	26.04	0.7849	30.35	0.9067
DBPN [71]	$\times 4$	-	27.21	0.7840	25.13	0.6480	24.88	0.6010	23.25	0.6220	25.50	0.7990
IDN [39]	$\times 4$	553K	31.82	0.8903	28.25	0.7730	27.41	0.7297	25.41	0.7632	29.41	0.8942
CARN [41]	$\times 4$	1592K	32.13	0.8937	28.60	0.7806	27.58	0.7349	26.07	0.7837	30.47	0.9084
IMDN [40]	$\times 4$	715K	32.21	0.8948	28.58	0.7811	27.56	0.7353	26.04	0.7838	30.45	0.9075
RFDN [11]	$\times 4$	541K	32.24	0.8952	28.61	0.7819	27.57	0.7360	26.11	0.7858	30.58	0.9089
LMAN-S [48]	$\times 4$	672K	32.12	0.8939	28.53	0.7798	27.51	0.7340	25.96	0.7813	30.30	0.9062
SMSR [64]	$\times 4$	1060K	32.12	0.8932	28.55	0.7808	27.55	0.7351	26.11	0.7868	30.54	0.9085
FDIWN-M [20]	$\times 4$	454K	32.17	0.8941	28.55	0.7806	27.58	0.7364	26.02	0.7844	-	-
VLESR [65]	$\times 4$	331K	32.17	0.8945	28.55	0.7802	27.55	0.7345	26.03	0.7830	30.48	0.9073
GASSL-S [66]	$\times 4$	428K	32.01	0.8931	28.56	0.7808	27.56	0.7351	25.98	0.7818	30.35	0.9070
FDSCSR-S [68]	$\times 4$	478K	32.25	0.8959	28.61	0.7821	27.58	0.7367	26.12	0.7866	30.51	0.9087
AMFFN [67]	$\times 4$	314K	32.29	0.8958	28.62	0.7821	27.59	0.7365	26.22	0.7889	30.50	0.9083
DFCFNet-S (Ours)	$\times 4$	371K	32.30	0.8963	28.68	0.7838	27.65	0.7381	26.30	0.7897	30.86	0.9114
DFCFNet (Ours)	$\times 4$	807K	32.47	0.8981	28.74	0.7857	27.71	0.7401	26.53	0.7966	31.08	0.9140

Table 5. Indicators of the different models evaluated on the UCMerced and AID testsets (calculating PSNR and SSIM at

\times 4

scale).

Table 5. Indicators of the different models evaluated on the UCMerced and AID testsets (calculating PSNR and SSIM at

\times 4

scale).

FLFB	NFEB	EFFN	Params (M)	FLOPs (G)	GPU Mem (M)	Avg. Time (ms)	AID	UCMerced
✓			0.29	14.80	65.29	3.95	29.23/0.7833	27.79/0.7632
	✓		0.06	1.59	40.60	2.85	28.64/0.7622	27.09/0.7377
		✓	0.08	4.83	68.83	2.21	29.06/0.7776	27.66/0.7583
✓	✓		0.28	16.10	89.32	6.37	29.32/0.7861	28.01/0.7701
	✓	✓	0.13	15.47	77.12	5.01	29.18/0.7815	27.82/0.7640
✓		✓	0.30	18.70	77.66	6.20	29.27/0.7847	27.96/0.7685
✓	✓	✓	0.36	20.00	89.60	12.25	29.36/0.7875	28.09/0.7730

Table 6. Influence of each component in PCCM (calculating PSNR and SSIM at

\times 4

scale).

Table 6. Influence of each component in PCCM (calculating PSNR and SSIM at

\times 4

scale).

PConv	CCM	Params (M)	FLOPs (G)	GPU Mem (M)	Avg. Time (ms)	AID	UCMerced
✓		0.16	8.01	88.62	7.51	28.64/0.7846	27.96/0.7683
	✓	0.34	18.61	89.51	12.53	29.34/0.7869	28.06/0.7721
✓	✓	0.36	20.00	89.60	12.25	29.36/0.7875	28.09/0.7730

Table 7. Comparison between global variance (Var) strategy and self-attention (SA) (calculating PSNR and SSIM at

\times 4

scale).

Table 7. Comparison between global variance (Var) strategy and self-attention (SA) (calculating PSNR and SSIM at

\times 4

scale).

Var	SA	Params (M)	FLOPs (G)	GPU Mem (M)	Avg. Time (ms)	Urban100	Manga109
✓		0.38	1220.00	25,366.62	16,982.40	26.08/0.7812	30.82/0.9107
	✓	0.36	20.00	89.60	12.25	26.30/0.7897	30.86/0.9114

Table 8. Influence of the number of channels and FAM on network performance (calculating PSNR and SSIM at

\times 4

scale; Dim represents the number of channels).

Table 8. Influence of the number of channels and FAM on network performance (calculating PSNR and SSIM at

\times 4

scale; Dim represents the number of channels).

Dim	FAM	Params (M)	FLOPs (G)	GPU Mem (M)	Avg. Time (ms)	Urban100	Manga109
36	8	0.36	20.00	89.60	12.25	26.30/0.7897	30.86/0.9114
48	10	0.79	43.48	123.52	6.84	26.53/0.7966	31.08/0.9140
48	12	0.94	51.92	124.12	22.04	26.53/0.7973	31.14/0.9150
48	14	1.10	60.37	124.16	27.65	26.63/0.7996	31.21/0.9155

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, M.; Wang, Q.; Zhang, W.; Chen, X.; Pan, J.; Guo, H. DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution. Remote Sens. 2026, 18, 1626. https://doi.org/10.3390/rs18101626

AMA Style

Zhang M, Wang Q, Zhang W, Chen X, Pan J, Guo H. DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution. Remote Sensing. 2026; 18(10):1626. https://doi.org/10.3390/rs18101626

Chicago/Turabian Style

Zhang, Miaomiao, Quan Wang, Wuxia Zhang, Xiangpeng Chen, Jiaxin Pan, and Huinan Guo. 2026. "DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution" Remote Sensing 18, no. 10: 1626. https://doi.org/10.3390/rs18101626

APA Style

Zhang, M., Wang, Q., Zhang, W., Chen, X., Pan, J., & Guo, H. (2026). DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution. Remote Sensing, 18(10), 1626. https://doi.org/10.3390/rs18101626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DFCFNet: A Local–Nonlocal Dual-Branch Feature Complementary Fusion Network for Remote Sensing Image Super-Resolution

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning Developments for RSISR

2.2. Feature Extraction in SR

2.3. Lightweight Image SR

3. Method

3.1. Overall Architecture

3.2. DBFA Module

3.2.1. NFEB

3.2.2. FLFB

3.3. Efficient Feed-Forward Network Model

3.4. Feature Aggregation Module

4. Experimental Results and Analyses

4.1. Primary Task: RSISR

4.1.1. Datasets

4.1.2. Metrics

4.1.3. Implementation Details

4.1.4. Quantitative Results

4.1.5. Qualitative Results

4.1.6. Inference Speed and Network Complexity

4.2. Extended Task: Natural Image SR

4.2.1. Datasets

4.2.2. Metrics

4.2.3. Implementation Details

4.2.4. Quantitative Results

4.2.5. Qualitative Results

5. Ablation Study

5.1. Effectiveness of DBFA

5.2. Impact of PCCM

5.3. Effectiveness of the EFFN

5.4. Validity of Global Variance

5.5. The Effects of FAM and Channel Number on the Network

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI