An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution

: Transformer models have great potential in the field of remote sensing super-resolution (SR) due to their excellent self-attention mechanisms. However, transformer models are prone to overfitting because of their large number of parameters, especially with the typically small remote sensing datasets. Additionally, the reliance of transformer-based SR models on convolution-based upsampling often leads to mismatched semantic information. To tackle these challenges, we propose an efficient super-resolution hybrid network (EHNet) based on the encoder composed of our designed lightweight convolution module and the decoder composed of an improved swin transformer. The encoder, featuring our novel Lightweight Feature Extraction Block (LFEB), employs a more efficient convolution method than depthwise separable convolution based on depthwise convolution. Our LFEB also integrates a Cross Stage Partial structure for enhanced feature extraction. In terms of the decoder, based on the swin transformer, we innovatively propose a sequence-based upsample block (SUB) for the first time, which directly uses the sequence of tokens in the transformer to focus on semantic information through the MLP layer, which enhances the feature expression ability of the model and improves the reconstruction accuracy. Experiments show that EHNet’s PSNR on UCMerced and AID datasets obtains a SOTA performance of 28.02 and 29.44, respectively, and is also visually better than other existing methods. Its 2.64 M parameters effectively balance model efficiency and computational demands.


Introduction
The technique of Single Image Super-Resolution (SISR) employs software algorithms to compensate for lost details in a low-resolution (LR) image, restoring it to a high-resolution (HR) counterpart.This technology has seen extensive application across various fields, notably in video surveillance [1], medical diagnosis [2], and remote sensing [3,4].In remote sensing, high spatial resolution images are very important in many scenarios, such as target detection [5], change detection [6], and object tracking [7].
Image sensors are the main limiting factor in the spatial resolution of remotely sensed images, and increasing the pixel density of sensors will significantly increase the cost of hardware.The remote sensing image super-resolution (RSISR) reconstruction technique is a method that can obtain high-resolution remote sensing images more efficiently than upgrading imaging equipment to improve image spatial resolution.novel sequence-to-sequence upsample method that focuses more on semantic information, diverging from the previous convolution-based method.The patch merge module in the original Swin Transformer is not used in SwinIR to downsample the feature maps to get feature maps with different resolutions to extract features at different scales, so the feature extraction ability of SwinIR is weakened compared to the original Swin Transformer.UNet's inherent encoder-decoder structure enables it to have better feature extraction capabilities.We use the encoder part to first downsample to extract features and then use the decoder part to upsample to recover the detailed information.We design a convolution-based Lightweight Feature Extraction Block (LFEB) as the fundamental module in the encoder, which gradually downsamples to extract semantic features.Convolutional structures, being more cost-effective than self-attention mechanisms, are better suited for extracting image features.To further reduce computation costs, we employ depthwise convolutions.For the decoder, we utilize Swin Transformer as the backbone network because it can establish long-range dependencies through self-attention, enhancing the restoration of image details.Its window attention mechanism also significantly reduces the model's computational cost.On the other hand, almost all super-resolution tasks utilize convolution-based upsampling methods, like the widely-used sub-pixel convolution method [31].However, data in the transformer flow in the form of a sequence of tokens.Our experiments demonstrate that employing convolution-based upsampling methods between two transformer layers may inadvertently introduce extraneous semantic information unrelated to the target.This can potentially reduce the model's accuracy.Thus, we propose a new upsampling module tailored for the attention mechanism of sequential data, the sequence-based upsample block (SUB).
The principal contributions of this paper are summarized as follows: 1.
We propose the Efficient Super-Resolution Hybrid Network (EHNet), a lightweight RSISR network that efficiently fuses CNN and Swin Transformer within a UNet-like structure.This hybrid model is capable of utilizing both the inductive bias of convolution and the long-range modeling capability.On the other hand, the multi-scale capability of UNet and skip connection can reconstruct images with richer details; 2.
We design a lightweight and efficient convolutional block as the fundamental unit for image feature extraction.The dual-branch design of CSP enables the integration of features from different stages, aiding the model in understanding and utilizing these varied stage features.In addition, we found that SELayer can also realize channel feature combinations with much less computational cost than pointwise convolution; 3.
In the decoder, we innovatively propose an upsampling method SUB based on a sequence of tokens.Compared with convolution-based upsampling methods, our SUB is more suitable for transformer-based models and can improve image detail recovery capabilities by focusing on semantic information.

Related Works
We divide the existing SR methods into two categories: natural images and remote sensing images, according to application scenarios.Nature images contain objects, scenes, and people from everyday life.These images often have more detail and are more accessible.Remote sensing images typically come from satellites or aircraft to obtain information about the Earth's surface.These images can contain features such as topography, land use, vegetation cover, etc.Additionally, remote sensing images are often difficult to obtain and have a small amount of data.In recent years, there have been many advanced SR models applied to natural images, and most of the SR models of remote sensing images are improved from advanced models of natural images.Table 1 lists the current SOTA methods for some of the SR tasks.
Table 1.Some SISR SOTA methods in recent years.The application scenarios of these methods, the number of parameters, and a brief description of the method are listed in the table.

Method Application Scenarios Params Description
SRCNN [10] Natural Image 69 K The first SISR method using deep learning VDSR [11] Natural Image 671 K A 20 layers model with residual learning RCAN [32] Natural Image 15.2 M A 200 layers model with channel attention IPT [26] Natural Image 115.5 M An SISR method using standard transformer SwinIR [28] Natural Image 3.87 M An SISR method using Swin Transformer except for patch merge HAT [33] Natural Image 5.29 M A SISR method activating more pixels based on SwinIR LGCNet [34] Remote sensing 193 K The first RSISR method combining local and global features DCM [35] Remote sensing 1.84 M An RSISR model with a network-in-network structure CTNet [36] Remote sensing 349 K An RSISR method using lightweight convolution HSENet [37] Remote sensing 5.29 M A hybrid-scale self-similarity exploitation network TransENet [38] Remote sensing 37.3 M A transformer-based enhancement network

SISR Methods of Natural Images
SRCNN [10] was the pioneering method utilizing deep learning to establish a nonlinear mapping between LR and HR images, achieving state-of-the-art performance on some public datasets with only three convolutional layers.Later, many scholars proposed deeper CNN models to obtain better performance.Very Deep Super-Resolution (VDSR) [11] expanded the network depth to 20 layers through residual learning, achieving better results.Fast Super-Resolution Convolutional Neural Networks (FSRCNN) [39] gave up the idea of interpolating the image to the target image size in advance but greatly reduced the parameters and calculation amount by adding a deconvolution layer at the end of the network.Efficient Sub-pixel Convolutional Neural Network (ESPCN) [31] proposed an efficient sub-pixel convolution module to achieve upsampling.Sub-pixel convolution is used by a large number of super-resolution models due to its excellent performance.Residual Channel Attention Network (RCAN) [32] considers the relationship between each channel and constructs a deep network with up to 200 residual blocks.The excellent performance of RCAN, a model based on channel attention, has led more researchers to start focusing on the attention mechanism.Second-order Attention Network (SAN) [40] proposed a second-order attention mechanism, which establishes feature relationships by calculating second-order feature statistics so that the model has better feature representation capabilities.Holistic Attention Network (HAN) [41] not only utilizes channel attention and spatial attention to learn the channel and spatial interdependence of features of each layer but also introduces a layer of attention to explore the correlation between layers.
IPT [26] is an image restoration model based on standard transformer, but the excellent performance of IPT requires a large amount of data (1.1 M images) for training and a complex model (115.5 M parameters).SwinIR [28] proposed an image super-resolution method based on Swin Transformer [27], which is mainly composed of W-MSA and SW-MSA.Unlike the VIT-style model, window attention greatly reduces the computation and parameters of the model because the calculation of attention only needs to be performed within each window.Nevertheless, such window-limited attention computations impinge upon the transformer's intrinsic capability for modeling long-range dependencies.However, the sliding window attention mechanism adeptly compensates for this shortcoming, endowing swin with the advantages of both CNN and Transformer.Hybrid Network of CNN and Transformer (HNCT) [42] proposed a lightweight image super-resolution model that mixes CNN and Transformer.HNCT considers both local and non-local priors and extracts deep features that are beneficial to super-resolution reconstruction, while maintaining the model is lightweight enough.A Hybrid Attention Transformer (HAT) [33] adds channel attention based on SwinIR, which makes up for the shortcomings of insufficient utilization of information between Transformer channels.In addition, HAT also introduces an overlapping cross-attention module to better aggregate cross-window information.

SISR Methods of Remote-Sensing Images
Spatial resolution is a crucial metric for assessing the performance of remote sensing satellites.Remote sensing images with higher spatial resolution are capable of containing more target information and enhancing the accuracy of subsequent tasks like classification, segmentation, or detection.Merely interpolating images can only increase the resolution without adding additional effective information.Recently, learning-based image super-resolution methods have become mainstream for enhancing the resolution of remotesensing images.Inspired by natural image SR networks, Lei et al. [34] first proposed an SR network that combines local and global features using deep learning, termed LGCNet.Haut et al. [35] introduced the Deep Compendium Model (DCM), which integrates residual blocks, skip connections, and a network-in-network structure.Pan et al. [43] presented the Residual Dense Backprojection Network (RDBPN) to address higher super-resolution magnifications, using a residual backprojection block structure for utilizing residual learning both globally and locally.Dong et al. [44] proposed a Second-order Multi-scale network (SMSR), which captures multi-scale information by reusing features learned at varying depths.Zhang et al. [45] extracted features of different scales using convolutions with varying kernel sizes and channel attention modules.Huan et al. [46] developed a new Pyramid-style Multi-Scale Residual Network (PMSRN) by merging hierarchical features to construct a Multi-Scale Dilated Residual Block (MSDRB).Leveraging the self-similarity of remote sensing images, Lei et al. [37] devised a Hybrid-scale Self-similarity Exploitation Network (HSENet), utilizing a Single-scale Self-similarity Exploitation Module (SSEM) to learn feature correlations at the same scale and also designed a Cross-scale Connection Structure (CCS) for capturing recurrences at different scales.
Lei et al. [38] proposed a Transformer-based Enhancement Network (TransENet), where the transformer is employed to extract features at different stages, and the multi-stage design allows for the fusion of high-dimensional and low-dimensional features.Tu et al. [47] combined the Swin Transformer with generative adversarial networks (GANs) to propose SWCGAN, where the generator is composed of both convolution and swin and the discriminator consists solely of the Swin Transformer.Shang et al. [48] designed a hybrid-scale hierarchical transformer network (HSTNet) to acquire long-range dependencies and effectively compute the correlations between high-dimensional and low-dimensional features.Wang et al. [36] created a lightweight convolution called the contextual transformation layer (CTL) to replace 3 × 3 convolutions, which can efficiently extract rich contextual features.Zhang et al. [29] proposed a FeNet that strikes a balance between performance and model parameters, where the lightweight lattice block (LLB) acts as a nonlinear extraction module to improve expressive ability.

Methodology
In this section, we first introduce the overall architecture of EHNet.Then, we introduce our proposed lightweight feature extraction module (LFEB) and a new sequence-based upsample block (SUB) in detail.

Network Architecture
Figure 1 displays the overall architecture of our EHNet, which designs an advanced encoder-decoder pattern based on the UNet structure.The encoder part uses efficient convolutional layers designed by us to capture the low-level features and spatial context information of the image, while the decoder part uses swin transformer to reconstruct image details.Additionally, following the Swin Transformer, there is a specialized upsampling module designed for the sequence of tokens.This module can more richly express the characteristics of the sequence of tokens, as it operates directly at the sequence level, avoiding the potential information compression and loss caused by convolutional layers.Moreover, it can perform SR reconstruction of images based on semantic information during upsampling.To compensate for the possible spatial information loss when reshaping feature maps into sequences, we have incorporated skip connections between the encoder and decoder.This network architecture design of EHNet not only facilitates the effective integration of local details with global information but also enhances the performance of the model in performing image super-resolution reconstruction by utilizing the focused semantic information.This leads to significant improvements in image clarity and richness, making our model particularly suitable for application scenarios requiring high-quality image reconstruction.
characteristics of the sequence of tokens, as it operates directly at the sequence level, avoiding the potential information compression and loss caused by convolutional layers.Moreover, it can perform SR reconstruction of images based on semantic information during upsampling.To compensate for the possible spatial information loss when reshaping feature maps into sequences, we have incorporated skip connections between the encoder and decoder.This network architecture design of EHNet not only facilitates the effective integration of local details with global information but also enhances the performance of the model in performing image super-resolution reconstruction by utilizing the focused semantic information.This leads to significant improvements in image clarity and richness, making our model particularly suitable for application scenarios requiring highquality image reconstruction.Given an LR image, we first interpolate it to the target resolution and then use a 3 × 3 convolution to transform it into a feature map, thereby extracting the initial features  0 .This process can be expressed mathematically as follows: where the Conv denotes a convolutional operation and  0 represents the initial feature, which will be the input of the following feature extraction part.We use three LFEGs to construct the encoder within the UNet structure.The primary function of these LFEGs is to extract low-level features at various scales from the image.Each LFEG is composed of multiple stacked LFEBs.The feature map is downsampled for each LFEG, so the resolution of the feature map after three LFEGs is 1 8 of the HR.The output of the encoder part can be written as follows: where LFEG n (•) and f n represent the operation of ith LFEG and its output.
After passing through the encoder composed of convolutional structures, we will use Swin Transformer Blocks (STB) and SUB to gradually upscale and restore image details.
STB is the basic module of Swin Transformer, which divides the image into a series of windows, and all the Attention is computed only within the window.This windowed attention mechanism greatly reduces the amount of computation.However, only calculating the attention within the windows weakens the long-term modeling ability of the transformer, so there is also a window sliding mechanism in the Swin Transformer to transfer the information between the windows.Given an LR image, we first interpolate it to the target resolution and then use a 3 × 3 convolution to transform it into a feature map, thereby extracting the initial features f 0 .This process can be expressed mathematically as follows: where the Conv denotes a convolutional operation and f 0 represents the initial feature, which will be the input of the following feature extraction part.We use three LFEGs to construct the encoder within the UNet structure.The primary function of these LFEGs is to extract low-level features at various scales from the image.Each LFEG is composed of multiple stacked LFEBs.The feature map is downsampled 1 2 for each LFEG, so the resolution of the feature map after three LFEGs is 1  8 of the HR.The output of the encoder part can be written as follows: where LFEG n (•) and f n represent the operation of ith LFEG and its output.After passing through the encoder composed of convolutional structures, we will use Swin Transformer Blocks (STB) and SUB to gradually upscale and restore image details.
STB is the basic module of Swin Transformer, which divides the image into a series of windows, and all the Attention is computed only within the window.This windowed attention mechanism greatly reduces the amount of computation.However, only calculating the attention within the windows weakens the long-term modeling ability of the transformer, so there is also a window sliding mechanism in the Swin Transformer to transfer the information between the windows.
In our EHNet, STB is mainly used to extract higher-dimensional semantic features for SUB, while our specially designed SUB uses these features to recover the image details and upsample the feature maps by a factor of 2. The output of each upsampling is concatenated with the corresponding output of the encoder part before being used as the input of the next layer.This feature fusion operation compensates for the loss of spatial information due to downsampling.
where SUB n and STB n represent the operation of ith sequence-based upsample block and Swin Transformer block, and F n represent the output after ith upsample.Finally, after concatenating the output of the decoder, F 1 , with f 0 , and then passing it through another convolutional layer, we can obtain the final SR image.

Lightweight Feature Extraction Block (LFEB)
In this section, we design an efficient feature extraction module that can extract rich features for the decoder to use with low computation.The LFEB is the base unit of the encoder, and we stack multiple LFEBs and incorporate residual learning to form a residualin-residual structure of the LFEG, which is capable of constructing deeper networks without gradient explosion.Each LFEG ends with a pooling layer to downsample the feature map.Finally, three LFEGs form the encoder part.The encoder of our EHNet is shown in Figure 2.
of the next layer.This feature fusion operation compensates for the loss of spatial information due to downsampling.
where SUB n and STB n represent the operation of ith sequence-based upsample block and Swin Transformer block, and F n represent the output after ith upsample.Finally, after concatenating the output of the decoder, F 1 , with f 0 , and then passing it through another convolutional layer, we can obtain the final SR image.

Lightweight Feature Extraction Block (LFEB)
In this section, we design an efficient feature extraction module that can extract rich features for the decoder to use with low computation.The LFEB is the base unit of the encoder, and we stack multiple LFEBs and incorporate residual learning to form a residual-in-residual structure of the LFEG, which is capable of constructing deeper networks without gradient explosion.Each LFEG ends with a pooling layer to downsample the feature map.Finally, three LFEGs form the encoder part.The encoder of our EHNet is shown in Figure 2. LFEB's overall structural design concept is similar to the Residual Channel Attention Block (RCAB) [32].RCAB mainly consists of standard convolution and Channel Attention (CA) in tandem with it.Our LFEB is mainly composed of CSP and lightweight convolution modules.The dual-branch design of CSP effectively integrates information from different stages with minimal computational cost.On the other hand, the lightweight convolution modules, consisting of depthwise convolution (dwconv) and Squeeze and Excitation layer [49] (SELayer), are able to extract features efficiently.The SELayer enables cross-channel feature fusion while reducing the computational cost caused by the pointwise convolution (pwconv) in separable convolutions.Whereas in our LFEB, we use depthwise convolutionin tandem with SELayer as the basic combination.In many lightweight convolutional designs, dwconv with pointwise convolution (pwconv) is a common combination, and pwconv is used to compensate for the lack of information fusion between channels in dwconv.However, in our experiments, it is demonstrated that this combination design is not necessarily helpful for super-resolution tasks, and SELayer can also take on the function of channel information fusion instead of pwconv, with lower computation effort.
SELayer adaptively recalibrates the feature responses between channels by explicitly modeling their interdependencies.Specifically, SELayer learns to automatically obtain the LFEB's overall structural design concept is similar to the Residual Channel Attention Block (RCAB) [32].RCAB mainly consists of standard convolution and Channel Attention (CA) in tandem with it.Our LFEB is mainly composed of CSP and lightweight convolution modules.The dual-branch design of CSP effectively integrates information from different stages with minimal computational cost.On the other hand, the lightweight convolution modules, consisting of depthwise convolution (dwconv) and Squeeze and Excitation layer [49] (SELayer), are able to extract features efficiently.The SELayer enables cross-channel feature fusion while reducing the computational cost caused by the pointwise convolution (pwconv) in separable convolutions.Whereas in our LFEB, we use depthwise convolutionin tandem with SELayer as the basic combination.In many lightweight convolutional designs, dwconv with pointwise convolution (pwconv) is a common combination, and pwconv is used to compensate for the lack of information fusion between channels in dwconv.However, in our experiments, it is demonstrated that this combination design is not necessarily helpful for super-resolution tasks, and SELayer can also take on the function of channel information fusion instead of pwconv, with lower computation effort.
SELayer adaptively recalibrates the feature responses between channels by explicitly modeling their interdependencies.Specifically, SELayer learns to automatically obtain the importance of each channel and then enhances useful features and suppresses features that are less useful according to this importance.The main operation of SELayer is to globally average pool the feature map to obtain 1 × 1 × C features (Squeeze) and then predict the importance of each channel through the fully connected layer, obtaining channel-level attention weights (Excitation), which are used to recalibrate the feature maps.
Because of the success of CSPDarknet in Yolov4 [50], we also add our own design of Cross Stage Partial (CSP) connection to extend the channel space in LEFB, and the addition of CSP hardly increases the computation and also improves the performance of the model to a certain extent.The structure of LFEB is shown in Figure 3.
the importance of each channel through the fully connected layer, obtaining channel-level attention weights (Excitation), which are used to recalibrate the feature maps.
Because of the success of CSPDarknet in Yolov4 [50], we also add our own design of Cross Stage Partial (CSP) connection to extend the channel space in LEFB, and the addition of CSP hardly increases the computation and also improves the performance of the model to a certain extent.The structure of LFEB is shown in Figure 3. CSP allows for the fusion of features at different network stages due to its dualbranch design.Doing so helps to integrate and propagate features from lower and higher levels more efficiently, improving the model's understanding and utilization of features from different levels.In super-resolution tasks, this fusion can help the network better understand image details and facilitate more accurate detail reconstruction.The CSP structure in our LFEB divides the input 2C feature maps f in into two branches, each with a number of channels C.This process can be written in the following form using Equation 4: where f 1 and f 2 denotes the feature map at the beginning of the two branches.In branch1, features are extracted as usual through the subsequent two convolutional layers.In branch2, f 2 is directly concatenated with the features extracted in branch1.Finally, a 1 × 1 convolution is used for information fusion, producing the output feature f out .This process can be mathematically described as follows: where 'branch2' represents the convolutions, batch normalization (bn), SELayer, and all other operations within 'branch2'.The branch2 of our LFEB consists mainly of a tandem stack of dwconv and SELayer, both of which have low computation cost, with a BN layer added to speed up convergence.

Sequence-Based Upsample Block
In super-resolution tasks, most of the models use convolution-based upsampling methods such as transposed convolution or sub-pixel convolution.The design inspiration for our SUB originally came from the patch expanding layer by Cao [51], which can achieve upsampling and feature dimension change without using convolution or interpolation.Compared with sub-pixel convolution and bilinear interpolation, this type of upsampling has achieved higher segmentation accuracy in segmentation tasks.Based on this sequence-based upsampling concept, we propose a new upsampling module SUB that is more suitable for super-resolution tasks.And our SUB can focus more on the semantic CSP allows for the fusion of features at different network stages due to its dual-branch design.Doing so helps to integrate and propagate features from lower and higher levels more efficiently, improving the model's understanding and utilization of features from different levels.In super-resolution tasks, this fusion can help the network better understand image details and facilitate more accurate detail reconstruction.The CSP structure in our LFEB divides the input 2C feature maps f in into two branches, each with a number of channels C.This process can be written in the following form using Equation (4): where f 1 and f 2 denotes the feature map at the beginning of the two branches.In branch1, features are extracted as usual through the subsequent two convolutional layers.In branch2, f 2 is directly concatenated with the features extracted in branch1.Finally, a 1 × 1 convolution is used for information fusion, producing the output feature f out .This process can be mathematically described as follows: where 'branch2' represents the convolutions, batch normalization (bn), SELayer, and all other operations within 'branch2'.The branch2 of our LFEB consists mainly of a tandem stack of dwconv and SELayer, both of which have low computation cost, with a BN layer added to speed up convergence.

Sequence-Based Upsample Block
In super-resolution tasks, most of the models use convolution-based upsampling methods such as transposed convolution or sub-pixel convolution.The design inspiration for our SUB originally came from the patch expanding layer by Cao [51], which can achieve upsampling and feature dimension change without using convolution or interpolation.Compared with sub-pixel convolution and bilinear interpolation, this type of upsampling has achieved higher segmentation accuracy in segmentation tasks.Based on this sequence-based upsampling concept, we propose a new upsampling module SUB that is more suitable for super-resolution tasks.And our SUB can focus more on the semantic information of the image to obtain better reconstruction results, which is the first time that this sequence-based upsampling method is proposed for super-resolution tasks.
The structure of our SUB is shown in Figure 4.The input sequence of tokens is first dimensionally transformed through the MLP layer, where the MLP layer is able to introduce nonlinear transforms to enhance the model feature learning and expression capabilities, and also to double the channel dimension.The MLP is then followed by a layer of Swin Transformer to recover more details of the image.There are three layers of Swin Transformer in the decoder, each of which corresponds to three downsampling layers in the encoder part.After one layer of transformer, we rearrange the sequence of tokens into feature maps of B × 2C × H × W and then go through a Pixel Shuffle operation to change the resolution of the feature maps to 2× of the input and the dimension of the channels to 1  4 of the input.Finally, we change the sequence of tokens into the form of feature maps mainly to facilitate the fusion with the features extracted from the convolution in the encoder.
Swin Transformer to recover more details of the image.There are three layers of Swin Transformer in the decoder, each of which corresponds to three downsampling layers in the encoder part.After one layer of transformer, we rearrange the sequence of tokens into feature maps of B × 2C × H × W and then go through a Pixel Shuffle operation to change the resolution of the feature maps to 2 × of the input and the dimension of the channels to of the input.Finally, we change the sequence of tokens into the form of feature maps mainly to facilitate the fusion with the features extracted from the convolution in the encoder.In summary, our SUB effectively upsamples the sequence of tokens in transformers and restores more precise and accurate details in super-resolution tasks.To demonstrate the effectiveness of our SUB module, we used Local Attribution Maps (LAM) [52] to analyze which pixels in the input LR contribute most to the SR (Super Resolution) reconstruction.LAM is a method for attribution analysis based on integrated gradients.By selecting a region of interest in the image, LAM can identify pixels that significantly contribute to the SR reconstruction of that area.
We applied LAM to analyze both the convolution upsampling method and our SUB, with results shown in Figure 5.In the airplane scene, we selected the engine part as the target region.It can be seen that there are many pixels in the LAM results sampled on the convolution that do not match the semantic information of the airplane, also have an impact on the SR results, and this additional introduction of extraneous pixel information degrades the quality of the SR reconstruction.While the LAM results of our method are more focused on the part that matches the target semantics, most of the pixels with large contributions are focused on the airplane engine part, and SR reconstruction based on the semantic information is an important reason why our EHNet can obtain higher performance.Similar results also appear in the overpass scene, we selected a car on the road as the target region, and our method also obtains results that are more focused on the car part, which leads to better reconstruction results.In summary, our SUB effectively upsamples the sequence of tokens in transformers and restores more precise and accurate details in super-resolution tasks.To demonstrate the effectiveness of our SUB module, we used Local Attribution Maps (LAM) [52] to analyze which pixels in the input LR contribute most to the SR (Super Resolution) reconstruction.LAM is a method for attribution analysis based on integrated gradients.By selecting a region of interest in the image, LAM can identify pixels that significantly contribute to the SR reconstruction of that area.
We applied LAM to analyze both the convolution upsampling method and our SUB, with results shown in Figure 5.In the airplane scene, we selected the engine part as the target region.It can be seen that there are many pixels in the LAM results sampled on the convolution that do not match the semantic information of the airplane, also have an impact on the SR results, and this additional introduction of extraneous pixel information degrades the quality of the SR reconstruction.While the LAM results of our method are more focused on the part that matches the target semantics, most of the pixels with large contributions are focused on the airplane engine part, and SR reconstruction based on the semantic information is an important reason why our EHNet can obtain higher performance.Similar results also appear in the overpass scene, we selected a car on the road as the target region, and our method also obtains results that are more focused on the car part, which leads to better reconstruction results.

Experiments 4.1. Experiment Settings
To verify the effectiveness of our model, we trained on two widely used public remote sensing datasets, UCMerced [53] and AID [54], respectively.
UCMerced dataset: This dataset contains 21 types of remote sensing scenarios, including airports, highways, ports, etc.Each scene category has 100 images, each measuring 256 × 256 pixels, and the spatial resolution of these images is 0.3 m/pixel.This dataset is divided into two equal parts, one of which is used as a training set with a total of 1050 images, and the other part is used as a test set, with 20% of the training set being used as a validation set; AID dataset: Compared with the UCMerced dataset, the AID dataset is a dataset with a larger number and size of images, containing 10,000 images and a total of 30 remote sensing scenes.The image size of the AID dataset is 600 × 600 pixels, and the spatial resolution of the image is 0.5 m/pixel.In this dataset, 8000 images were randomly selected as the training set images, and the remaining 2000 images were used as the test set images.In addition, we selected five images in each category for a total of 150 images as the validation set.
The images in both the UCMerced dataset and the AID dataset were used as HR images in the experiment, and their corresponding LR images were obtained by Bicubic interpolation.We trained and evaluated the model by constructing such paired HR-LR images.
We used peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the experimental results, and all evaluations of the super-resolution results were performed on the RBG channel.In general, SSIM is more reflective of image quality as perceived by the human eye but is computationally complex, whereas PSNR is computationally simple but does not necessarily fully reflect the human eye's perception of image quality.In our experiments we used a combination of these two metrics to more comprehensively assess image super-resolution quality.In our experiments, the original images in each dataset were treated as HR images, and the corresponding LR images were obtained by performing bicubic interpolation on the HR images.The PSNR and SSIM of a super-resolution image can be calculated by the following equation: where µ x µ y and σ x σ y are the mean and variance, respectively.σ xy is the covariance between x and y. c 1 and c 2 are constants.I SR is the super-resolution image and I HR is the high-resolution image.Floating Point Operations (FLOPs) and model parameters are used to measure the computation cost of the model, where the input image size is 64 × 64 when calculating FLOPs.
Our loss function employs the L1 loss, which is most common in super-resolution tasks.Given a training set I i LR , I i HR N i=1 , the loss function can be expressed as follows: 1 We conducted experiments on remote sensing images with scale factors of ×2 and ×4.During training, we randomly cropped the image, and the size of the cropped image was 192 × 192.We also performed random flips and rotations on the training samples to increase sample diversity.We used the Adam optimizer, where β 1 = 0.9, β 2 = 0.99.We adopted the cosine annealing learning rate decay strategy with an initial learning rate of 5 × 10 −5 and a minimum learning rate of 1 × 10 −7 .During the training process, we used a batch size of 16 and trained 2000 epochs on the model.The entire training was performed on two NVIDIA 3080 Ti GPUs.

Ablation Studies
In this section, we performed a series of ablation experiments on the UCMerced dataset to explore the importance of each module in our model, where all models were trained on the same settings.For simplicity, all experiments had a super-resolution factor of 4.

Effects of LFEB
The LFEB is the most important component of the encoders, and we explored the effect of using this module with different settings.The number of LFEBs in each LFEG in our experiments is set to 9. Compared to RCAB, a benchmark module commonly used in super-resolution modeling, our LFEB is 0.11 dB higher in PSNR metrics.We compared the most commonly used combination of dwconv + pwconv with our dwconv + SELayer combination scheme and found that our approach has better performance.Also, the use of pwconv has a larger computation cost and memory usage, whereas SELayer is a lightweight feature calibration module using only fully connected layers.We also validated the effectiveness of the CSP dual-branch structure in LFEB and found that the PSNR improved by 0.06 after the introduction of the CSP; all results are shown in Table 2.In recent years, there have also been some excellent Attention Modules that are often used in various super-resolution tasks, and we also compared SELayer with these methods.The Convolutional Block Attention Module (CBAM) [55] can perform Attention operations in both spatial and channel dimensions combining the Channel Attention Module And by combining the channel attention module and spatial attention module together, the network can achieve better feature selection and reinforcement in both channel and spatial dimensions, improving the model's representation ability.Efficient Channel Attention (ECA) [56] proposes a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented by one-dimensional convolution with high efficiency.We tested several other popular convolutional attention methods with other parts of the LFEB fixed unchanged, and SELayer obtained the best performance in both PSNR and SSIM metrics.The experimental results are shown in Table 3.We explored the experimental performance of different components forming an SUB and found the most effective SUB settings.All experimental results are shown in Table 4. Judging from the experimental results, if we only use the MLP layer for dimension transformation, the effect is average, and after adding a layer of Swin Transformer, the PSNR increases by 0.1 dB.There are two ways to transform features from an extended channel dimension to a larger spatial resolution: one is to directly reshape the feature map to the target resolution, and the other is to reshape with the channel dimension unchanged and then use pixel shuffle to increase the spatial resolution.From the experimental results, the latter scheme can obtain higher reconstruction results.We also compared SUB with transposition convolution and subpixel convolution, which are commonly used as upsampling methods in other SOTA methods, and our SUB is higher than transposition convolution and subpixel convolution in PSNR by 0.23 dB and 0.12 dB, respectively, and SSIM is higher than them by 0.0051 and 0.0036, respectively.Our experimental results verified the validity of the SUB upsampling method.The experimental results are shown in Table 5.

Ablation Study of Our EHNet
We performed ablation experiments on the whole EHNet, mainly including the number of layers of Swin Transformer and the number of layers of convolution, as well as the effect of feature dimensions on model accuracy and model complexity.We can see that when the LFEB, the number of swin layers, and the number of feature channels are set to 9, 2, and 96, respectively, the EHNet can obtain higher PSNR and SSIM and keep a low computational overhead.All the experimental results are shown in Table 6.To verify the effectiveness of the proposed EHNet, we conducted comparative experiments with some SOTA competitors, namely, SRCNN [10], VDSR [11], LGCNet [34], DCM [35], CTNet [36], HSENet [37], TransENet [38], SwinIR [28] and HAT [33].Among these methods, SRCNN [10], VDSR [11], HAT [33], and SwinIR [28] are the methods proposed for natural image SR.

Quantitative Evaluation
Quantitative Results on UCMerced Dataset: Table 7 presents a comparison of the latency and performance accuracy of various methods on the UCMerced dataset.The results indicate that our EHNet achieves a superior balance between the number of parameters and accuracy.In the case of ×2 and ×4 super-resolution factors, EHNet demonstrates the best performance in terms of PSNR.Compared to recent high-performing models such as SwinIR [28], TransENet [38], and HSENet [37], EHNet shows improvements in both parameter count and performance.Specifically, under the ×4 super-resolution factor, EHNet's PSNR is higher than TransENet [38], SwinIR [28] and HAT [33] by 0.24 dB, 0.15 dB and 0.16 dB, respectively, while having only 7%, 58%, and 50% of their parameter sizes.In comparison with lightweight models like SRCNN [10], VDSR [11], and CTNet [36], our EHNet also maintains competitive performance in terms of model accuracy and efficiency.Quantitative Results on AID Dataset: In Table 8, our proposed EHNet demonstrates exceptional performance across all metrics on the AID test dataset.However, due to its limited model capacity, the performance of our model deteriorates when trained on the larger AID training dataset.Despite this limitation, EHNet still achieves the best or secondbest performance in terms of PSNR on the AID test dataset and obtains the optimal results in the SSIM metric, which is more aligned with human visual perception.Overall, the method we propose maintains competitive performance.To further analyze the reasons behind these phenomena, we conducted an in-depth discussion on the quantitative performance of different methods across various categories.Table 9 lists the performance across the 30 categories in the AID dataset.The experiments demonstrate that our method performs well in scenes with rich textural details, such as airports, schools, parking lots, and sparse residential areas, achieving the best PSNR results in most cases.In contrast, the scenes where PSNR results are less satisfactory tend to be those with more uniform and less detailed environments, such as bare land, beaches, and deserts.These images lack sufficient feature information.Our method primarily relies on enhancing high-frequency details to improve image resolution, and in scenes with simple content, there may not be enough information for effective reconstruction.On the other hand, the PSNR evaluation metric may be more suited to assessing detail enhancement in richly textured scenes.In less textured environments, PSNR may not fully reflect the true improvement in image quality.

Quantitative Evaluation
In addition to the quantitative comparisons discussed above, we also conducted a qualitative analysis of super-resolved image quality.Figure 6 presents the visual results for two scenarios from the UCMerced dataset: airplane and freeway.In the case of 'airplane78 ′ , our method successfully recovers the texture of the engine part while maintaining sharp edges.For 'freeway97', our EHNet uniquely restores the car windows, a detail not achieved by other methods.Moreover, the super-resolved image exhibits clearer lane lines, demonstrating EHNet's significant advantage in recovering image details.

Conclusions
In our work, we introduce a novel model named EHNet, an efficient single-frame SR model for remote sensing.EHNet ingeniously merges an encoder formed by LFEB with an improved Swin Transformer within a UNet architecture.The LFEB utilizes depthwise convolution to reduce computation cost, while the incorporation of SELayer enhances inter-channel information fusion, addressing the shortcomings of insufficient channel information integration in depthwise convolution.Additionally, we employ a CSP dual-branch structure to boost model performance without adding extra parameters.In the decoder part, we utilize Swin Transformer to restore image details and introduce a novel sequencebased upsampling method, SUB, to capture more accurate long-range semantic information.EHNet achieves state-of-the-art results on multiple metrics in the AID and UC-Merced datasets and surpasses existing methods in visual quality.Its 2.64 M parameters effectively balance model efficiency and computation cost, highlighting its potential for broader application in SR tasks.
The results of the experiment show that our EHNet performs better on smaller datasets, but its performance is degraded for datasets such as AID, which has a larger image size and dataset size.We investigate the model's super-resolution reconstruction results for different scenes and find that our EHNet tends to underperform in those scenes with fewer details and smaller gradients.We speculate that the reason why the model does not perform well enough on large datasets may be that our model has a small number of parameters and cannot fully cope with all the scenes, especially those with smaller gradients.In addition, our model does not perform as well as the super-resolution factors of 2 and 4 on the super-resolution factor of 3, which may be due to the fact that our UNet architecture

Conclusions
In our work, we introduce a novel model named EHNet, an efficient single-frame SR model for remote sensing.EHNet ingeniously merges an encoder formed by LFEB with an improved Swin Transformer within a UNet architecture.The LFEB utilizes depthwise convolution to reduce computation cost, while the incorporation of SELayer enhances inter-channel information fusion, addressing the shortcomings of insufficient channel information integration in depthwise convolution.Additionally, we employ a CSP dualbranch structure to boost model performance without adding extra parameters.In the decoder part, we utilize Swin Transformer to restore image details and introduce a novel sequence-based upsampling method, SUB, to capture more accurate long-range semantic information.EHNet achieves state-of-the-art results on multiple metrics in the AID and UCMerced datasets and surpasses existing methods in visual quality.Its 2.64 M parameters effectively balance model efficiency and computation cost, highlighting its potential for broader application in SR tasks.
The results of the experiment show that our EHNet performs better on smaller datasets, but its performance is degraded for datasets such as AID, which has a larger image size and dataset size.We investigate the model's super-resolution reconstruction results for different scenes and find that our EHNet tends to underperform in those scenes with fewer details and smaller gradients.We speculate that the reason why the model does not perform well enough on large datasets may be that our model has a small number of parameters and cannot fully cope with all the scenes, especially those with smaller gradients.In addition, our model does not perform as well as the super-resolution factors of 2 and 4 on the superresolution factor of 3, which may be due to the fact that our UNet architecture of EHNet adopts 2× downsampling, so it does not work well enough for LR reconstruction with a super-resolution factor of 3.

Figure 5 .
Figure 5. LAM results of two methods in two different scenes, where the red shaded area shows the degree of semantic focusing.The red box represents the target area that we have selected.

Figure 7
Figure7shows two examples of the AID dataset.For parking210, our proposed method successfully recovers clear marker lines, while the other methods are either very blurred or have checkerboard artifacts.Furthermore, in the super-resolution result of 'stadium262', our model achieves sharper edges around letters, further evidencing its superior performance in enhancing details.

Figure 7 .
Figure 7. Visualization results of different RSISR methods on AID dataset for × 4 SR.

Figure 7 .
Figure 7. Visualization results of different RSISR methods on AID dataset for ×4 SR.

Table 2 .
PSNR and SSIM results with different components in LFEB.Bold data indicates that it is the best method.

Table 3 .
PSNR and SSIM results with different attention modules in LFEB.new sequence-based upsampling module we propose that can improve the detail restoration ability of the decoder composed of transformer by focusing on semantic information.

Table 4 .
PSNR and SSIM results with different components in SUB.

Table 5 .
PSNR and SSIM results with different upsample methods.

Table 6 .
PSNR and SSIM results with different settings in EHNet.

Table 7 .
Comparative results for the UCMerced dataset.

Table 8 .
Comparative results for the AID dataset.

Table 9 .
Mean PSNR (dB) of each class for the scale factor of ×4 on the AID dataset.