Hybrid-Scale Hierarchical Transformer for Remote Sensing Image Super-Resolution

: Super-resolution (SR) technology plays a crucial role in improving the spatial resolution of remote sensing images so as to overcome the physical limitations of spaceborne imaging systems. Although deep convolutional neural networks have achieved promising results, most of them overlook the advantage of self-similarity information across different scales and high-dimensional features after the upsampling layers. To address the problem, we propose a hybrid-scale hierarchical transformer network (HSTNet) to achieve faithful remote sensing image SR. Speciﬁcally, we propose a hybrid-scale feature exploitation module to leverage the internal recursive information in single and cross scales within the images. To fully leverage the high-dimensional features and enhance discrimination, we designed a cross-scale enhancement transformer to capture long-range dependencies and efﬁciently calculate the relevance between high-dimension and low-dimension features. The proposed HSTNet achieves the best result in PSNR and SSIM with the UCMecred dataset and AID dataset. Comparative experiments demonstrate the effectiveness of the proposed methods and prove that the HSTNet outperforms the state-of-the-art competitors both in quantitative and qualitative evaluations.


Introduction
With the rapid progress of satellite platforms and optical remote sensing technology, remote sensing images (RSIs) have been broadly deployed in civilian and military fields, e.g., disaster prevention, meteorological forecast, military mapping, and missile warning [1,2].However, due to hardware limitations and environmental restrictions [3,4], RSIs often suffer from low-resolution (LR) and contain some intrinsic noise.Upgrading physical imaging equipment to improve resolution is often plagued by high costs and long development cycles.Therefore, it is of utmost urgency to explore the remote sensing image super-resolution (RSISR).
Single-image super-resolution (SR) is a highly ill-posed visual problem which aims to reconstruct high-resolution (HR) images from corresponding degraded LR images.To this end, many representative algorithms have been proposed, which can be roughly divided into three categories, i.e., interpolation-based methods [5,6], reconstruction-based methods [7,8], and learning-based methods [9,10].The interpolation-based methods generally utilize different interpolation operations, including bilinear interpolation, bicubic interpolation, and nearest interpolation, to estimate unknown pixel value [11].These methods are relatively straightforward in practice, while the reconstructed images lack essential details.In contrast, reconstruction-based methods improve image quality by incorporating prior information of the image as constraints into the HR image.These methods can restore high-frequency details with the help of prior knowledge, while they require substantial computational costs, making it difficult for them to be readily applied to RSIs [12].Learning-based approaches try to produce HR images by learning the mapping relationship established between external LR-HR image training pairs.Compared with the aforementioned two lines of methods, learning-based methods achieve better performance and become the mainstream in this domain due to the powerful feature representation ability provided by convolutional neural networks (CNNs) [13].However, learning-based methods generally adopt the post-upsampling framework [14], which solely exploits low-dimensional features while ignoring the discriminative high-dimensional feature information after the upsampling process.
In addition to utilizing nonlinear mapping between LR-HR image training pairs, the self-similarity of the image is also employed to improve the performance of SR algorithms.Self-similarity refers to the property of similar patches appear repeatedly in a single image and is broadly adopted in image denoising [15,16], deblurring [17], and SR [18][19][20].Selfsimilarities are also an intrinsic property in RSIs, i.e., internal recursive information.Figure 1 illustrates the self-similarities in RSIs.One can see that the down-scaled image is on the left, and the original one is on the right.Similar highway patches with green box labels appear repeatedly in the same scale image, while the roof of factories with red box labels appear repeatedly across different scales, and these patches with similar edges and textures contain abundant internal recursive information.Previously, Pan et al. [21] employed dictionary learning to capture structural self-similarity features as additional information to improve the performance of the model.However, the sparse representation of SR has a limited ability to leverage the internal recursive information within the entire remote sensing image.
single-scale similarities cross-scale similarities In this paper, we propose a Hybrid-Scale Hierarchical Transformer Network (HSTNet) for RSISR.The HSTNet can enhance the representation of the high-dimensional features after upsampling layers and fully utilize the self-similarity information in RSIs.Specifically, we propose a hybrid-scale feature exploitation (HSFE) module to leverage the internal similar information both in single and cross scales within the images.The HSFE module contains two branches, i.e., a single-scale branch and a cross-scale branch.The former is employed to capture the recurrence within the same scale image, and the latter is utilized to learn the feature correlation across different scales.Moreover, we designed a cross-scale enhancement transformer (CSET) module to capture long-range dependencies and efficiently model the relevance between high-dimension and low-dimension features.In the CSET module, the encoders are used to encode low-dimension features from the HSFE module, and the decoder is used to utilize to fuse the multiple hierarchies high-/low-dimensional features so as to enhance the representation ability of high-dimensional features.To sum up, the main contributions of this work are as follows: 1.
We propose an HSFE module with two branches to leverage the internal recursive information from both single and cross scales within the images for enriching the feature representations for RSISR.

2.
We designed a CSET module to capture long-range dependencies and efficiently calculate the relevance between high-dimension and low-dimension features.It helps the network reconstruct SR images with rich edges and contours.

3.
Jointly incorporating the HSFE and CSET modules, we formed the HSTNet for RSISR.
Extensive experiments on two challenging remote sensing datasets verify the superiority of the proposed model.

CNN-Based SR Models
Dong et al. [22] pioneered the adoption of an SR convolutional neural network (SR-CNN) that utilizes three convolution layers to establish the nonlinear mapping relationship between LR-HR image training pairs.On the basis of the residual network introduced by He et al. [23], Kim et al. [24] designed a very deep SR convolutional neural network (VDSR) where residual learning is employed to accelerate model training and improve reconstruction quality.Lim et al. [25] built the enhanced deep super-resolution model to simplify the network and improve the computational efficiency via optimizing the initial residual block.Zhang et al. [26] designed a deep residual dense network in which the residual network with dense skip connections is used to transfer intermediate features.Benefiting from the channel attention (CA) module, Zhang et al. [27] presented a deep residual channel attention network to enhance the high-frequency channel feature representation.Dai et al. [28] designed a second-order CA mechanism to guide the model to improve the ability of discriminative learning ability and exploit more conducive features.Li et al. [29] proposed an image super-resolution feedback network (SRFBN) in which a feedback mechanism is adopted to transfer high-level feature information.The SRFBN could leverage high-level features to polish up the representation of low-level features.
Because of the impact of spatial resolution on the final performance of many RSI tasks, including instance segmentation, object detection, and scene classification, RSISR also raises significant research interest.Lei et al. [30] proposed a local-global combined network (LGC-Net) which can enhance the multilevel representations, including local detail features and global information.Haut et al. [31] produced a deep compendium model (DCM), which leverages skip connection and residual unit to exploit more informative features.To fuse different hierarchical contextual features efficiently, Wang et al. [32] designed a contextual transformation network (CTNet) based on a contextual transformation layer and contextual feature aggregation module.Ni et al. [33] designed a hierarchical feature aggregation and self-learning network in which both self-learning and feedback mechanisms are employed to improve the quality of reconstruction images.Wang et al. [34] produced a multiscale fast Fourier transform (FFT)-based attention network (MSFFTAN), which employs a multiinput U-shape structure as the backbone for accurate RSISR.Liang et al. [35] presented a multiscale hybrid attention graph convolution neural network for RSISR in which a hybrid attention mechanism was adopted to obtain more abundant critical high-frequency information.Wang et al. [36] proposed a multiscale enhancement network which utilizes multiscale features of RSIs to recover more high-frequency details.However, the CNN-based methods above generally employ the post-upsampling framework that directly recovers HR images after the upsampling layer, ignoring the discriminative high-dimensional feature information after the upsampling process [14].

Transformer-Based SR Models
Due to the strong long-range dependence learning ability of transformers, transformerbased image SR methods have been studied recently by many scientific researchers.Yang et al. [37] produced a texture transformer network for image super-resolution, in which a learnable texture extractor is utilized to exploit and transmit the relevant textures to LR images.Liang et al. [38] proposed SwinIR by transferring the ability of the Swin Transformer, which could achieve competitive performance on three representative tasks, namely image denoising, JPEG compression artifact reduction, and image SR.Fang et al. [39] designed a lightweight hybrid network of a CNN and transformer that can extract beneficial features for image SR with the help of local and non-local priors.Lu et al. [40] presented a hybrid model with a CNN backbone and transformer backbone, namely the efficient superresolution transformer, which achieved impressive results with low computational cost.Yoo et al. [41] introduced an enriched CNN-transformer feature aggregation network in which the CNN branch and transformer branch can mutually enhance each representation during the feature extraction process.Due to the limited ability of multi-head self-attention to extract cross-scale information, cross-token attention is adopted in the transformer branch to utilize information from tokens of different scales.
Recently, transformers have also found their way into the domain of RSISR.Lei et al. [14] proposed a transformer-based enhancement network (TransENet) to capture features from different stages and adopted a multistage-enhanced structure that can integrate features from different dimensions.Ye et al. [42] proposed a transformer-based super-resolution method for RSIs, and they employed self-attention to establish dependencies relationships within local and global features.Tu et al. [43] presented a GAN that draws on the strengths of the CNN and Swin Transformer, termed the SWCGAN.The SWCGAN fully considers the characteristics of large size, a large amount of information, and a strong relevance between pixels required for RSISR.He et al. [44] designed a dense spectral transformer to extract the long-range dependence for spectral super-resolution.Although the transformer can improve the long-range dependence learning ability of the model, these methods do not leverage the self-similarity within the entire remote sensing image [45].

Overall Framework
The framework of the proposed HSTNet is shown in Figure 2. It is built by the combination of three kinds of fundamental modules, i.e., a low-dimension feature extraction (LFE) module, a cross-scale enhancement transformer (CSET) module, and an upsample module.Specifically, the LFE module is utilized to extract high-frequency features across different scales, and the CSET module is employed to capture long-range dependency to enhance the final feature representation.The upsample module is adopted to transform the feature representation from a low-dimensional space to a high-dimensional space.
Given an LR image I LR , a convolutional layer with a 3 × 3 kernel is utilized to extract the initial feature F 0 .The process of shallow feature extraction is formulated as where f sf (•) represents the operation of the convolutional operation and F 0 is the shal- low feature.As shown in Figure 3, the LFE module consists of five basic extraction (BE) modules, and each BE module contains two 3 × 3 convolution layers and one hybrid-scale feature exploitation (HSFE) module.As the core component of the BE module, the HSFE module is proposed to model image self-similarity.The whole low-dimensional feature extraction process is formulated as where f i lfe (•) and F i LFE represent the operation of ith LFE module and its output.After the three cascaded LFE modules, a subpixel layer [46] is adopted to transform low-dimensional features into high-dimensional features, which is formulated as where F up represents the high-dimension feature and Subpixel(•) denotes the function of the subpixel layer.The low-dimension features F 1 LFE , F 2 LFE , and F 3 LFE and the high-dimension feature F up are fed into three cascaded CSET modules for feature hierarchical enhancement.To reduce the redundancy of the enhanced features, a 1 × 1 convolution layer is employed to reduce the feature dimension.The complete process including the enhancement and dimension reduction is formulated as where f i cset (•) and F i CSET represent the operation of ith CSET module and its output, respectively.Finally, one convolution layer is employed to obtain SR image I SR from the enhanced features.A conventional L 1 loss function was employed to train the proposed HSTNet model.Given a training set , the loss function is formulated as: where

Hybrid-Scale Feature Exploitation Module
To explore the internal recursive information in single-scale and cross-scale, we propose an HSFE module.Figure 4 exhibits the architecture of the HSFE module, which consists of a single-scale branch and a cross-scale branch.The single-scale branch aims to capture similar features within the same scale, and a non-local (NL) block [47] was utilized to calculate the relevance of these features.The cross-scale branch was applied to capture recursive features of the same image at different scales, and an adjusted non-local (ANL) block [45] was utilized to calculate the relevance of features between two different scales.Single-scale branch: As depicted in Figure 4, we built the single-scale branch to extract single-scale features.Specifically, in the single-scale branch, several convolutional layers are applied to capture recursive features, and an NL block is employed to guide the network to concentrate on informative areas.As shown in Figure 4a, an embedding function is utilized to mine the similarity information as where i is the index of the output position, j is the index that enumerates all positions, and x denotes the input of the NL block.W θ and W ϕ are the embeddings weight matrix.The non-local function is symbolized as The relevance between x i and all x j can be calculated by pairwise function f (•).The feature representation of x j can be obtained by the function g(•).Eventually, the output of the NL block is obtained by where W φ is a weight matrix.The convolution layer following the NL block transforms the input into an attention diagram, which is then normalized with a sigmoid function.In addition, the main branch output features are multiplied by the attention diagram, where the activation values for each space and channel location are rescaled.
Cross-scale branch: As depicted in Figure 4, the cross-scale branch is utilized to perform cross-scale feature representation.Specifically, the input of the HSFE module is considered the basic scale feature, which is symbolized as F b in .To exploit the internal recursive information at different scales, the downsampled scale feature F d in is formulated as where f s down (•) denotes the operation of downsampling with scale factor s. Two contextual transformation layers (CTLs) [48] are employed to extract feature with two different scales F b in and F d in .To align the spatial dimension of the features in different scales, the downsampled feature is firstly upsampled with the scale factor of s. x b and x b represent the output of the basic scale and the downsampled scale through the two CTLs, and this process is formulated as where f ctl (•) denotes the operation of two CTLs and f s up (•) represents the operation of upsample with scale factor s.
Similar to the single-scale branch, an ANL block [45] was introduced to exploit the feature correlation between two different scales RSIs.As shown in Figure 4b, the ANL block is improved compared to the NL block, and they have different inputs.Thus, z i in Equation ( 8) for ANL block can be rewritten as In the cross-scale branch, we employ the ANL block to fuse multiple scale features, therefore fully utilizing the self-similarity information.The HSFE module can be formulated as where F in is the input of the HSFE module and F out is the output of the HSFE module.f sin (•) and f cro (•) are the operation of the single-scale branch and cross-scale branch, respectively.

Cross-Scale Enhancement Transformer Module
The cross-scale enhancement transformer module is designed to learn the dependency relationship across long distances between high-dimension and low-dimension features and enhance the final feature representation.The architecture of the CSET module is shown in Figure 5a.Specifically, we introduced the cross-scale token attention (CSTA) module [41] to exploit the internal recursive information within an input image across different scales.Moreover, we use three CSET modules to utilize different hierarchies of feature information.Figure 5a illustrates in detail the procedure of feature enhancement using CSET-3 module as an example.
Transformer encoder: The encoders are used to encode different hierarchies of features from LFE modules.As shown in Figure 5a, the encoder is mainly composed of a multi-headed self-attention (MHSA) block and a feed-forward network (FFN) block, which is similar to the original design in [49].The FFN block contains two multilayer perceptron (MLP) layers with an expansion ratio r and a GELU activation function [50] in the middle.Moreover, we adopted layer normalization (LN) before the MHSA block and FFN block, and employed a local residual structure to avoid the gradient vanishing or explosion during gradient backpropagation.The entire process of the encoder can be formulated as where f mhsa (•), f ln (•), and f f f n (•) denote the function of the MHSA block, layer normaliza- tion, and FFN block, respectively.F i EN is the intermediate output of the encoder.F i LFE and F i EN are the input and output of the encoder in the ith CSET module.
Transformer decoder: The decoders are utilized to fuse high-/low-dimensional features from multiple hierarchies to enhance the representation ability of high-dimensional features.As shown in Figure 5a, the decoder contains two MHSA blocks and a CSTA block [41].With the CSTA block, the decoder can exploit the recursive information within an input image across different scales.The operation of the decoder can be formulated as where f csta (•) denotes the process of the CSTA block and F up is the output of Encoder-4.Each CSET module has two inputs, and the composition of the inputs is determined by the location of the CSET module.
where t and s represent the stride and token size.To improve efficiency, T s is replaced by T a , and T l is tokenized with a larger token size and overlapping.Numerous large-size tokens can be obtained by overlapping, enabling the transformer to actively learn patch recurrence across scales.
To effectively exploit self-similarity across different scales, we computed cross-scale attention scores between tokens in both T s and T l .Specifically, the queries 2 , and values v s ∈ R n× d 2 were generated from T s .Similarly, the queries and values v l ∈ R n × d 2 were generated from T l .The reorganized triples q s , k l , v l and q l , k s , v s were obtained by swapping their key-value pairs to each other.Then, the attention operation was executed using the reorganized triples.It should be noted that the projection of attention operations reduces the last dimension of queries, keys, and values in T l from d to d/2.Subsequently, we re-projected the attention results of T l to the dimension of n × d then transformed to the dimension of n × d 2 .Finally, we concatenated the attention results to obtain the output of the CSTA block.

Experiments 4.1. Experimental Dataset and Settings
We evaluate the proposed method on two widely adopted benchmarks [30,31,51], namely the UCMecred dataset [52] and AID dataset [53], to demonstrate the effectiveness of the proposed HSTNet.
UCMerced dataset: This dataset consists of 2100 images belonging to 21 categories of varied remote sensing image scenes.All images exhibit a pixel size of 256 × 256 and a spatial resolution of 0.3 m/pixel.The dataset is divided equally into two distinct sets, one comprising 1050 images for training and the other for testing.
AID dataset: This dataset encompasses 10,000 remote sensing images, spanning 30 unique categories.In contrast to the UCMerced dataset, all images in this dataset have a pixel size of 600 × 600 and spatial resolution of 0.5 m/pixel.A selection of 8000 images from this dataset was randomly chosen for the purpose of training, while the remaining 2000 images were used for testing.In addition, a validation set consisting of five arbitrary images from each category was established.
To verify the generalization of the proposed method, we further adapted the trained model to the real-world images of Gaofen-1 and Gaofen-2 satellites.We downsampled HR images through bicubic operations to obtain LR images.Two mainstream metrics, namely peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM), were calculated on the Y channel of the YCbCr space for objective evaluation.They are formulated as where L represents the maximum pixel, and N denotes the number of all pixels in I SR and I HR .
where x and y represent two images.σ xy symbolizes the covariance between x and y. u and σ 2 represent the average value and variance.k 1 and k 2 denote constant relaxation terms.Multi-adds and model parameters were utilized to evaluate the computational complexity [32,54].In addition, the natural image quality evaluator (NIQE) was adopted to validate the reconstruction of real-world images from Gaofen-1 and Gaofen-2 satellites [55].

Implementation Details
We conducted experiments on remote sensing image data with scale factors of ×2, ×3, and ×4.During training, we randomly cropped 48 × 48 patches from LR images and extracted ground-truth references from corresponding HR images.We also employed horizontal flipping and random rotation (90 • , 180 • and 270 • ) to augment training samples.Table 1 presents the comprehensive hyperparameter setting of the cross-scale enhancement transformer (CSET) module.We adopted the Adam optimizer [56] to train the HSTNet with β 1 = 0.9, β 2 = 0.99, and = 10 −8 .The initial learning rate was set to 10 −4 , and the batch size was 16.The proposed model was trained for 800 epochs, and the learning rate decreased by half after 400 epochs.Both the training and testing stages were performed using the PyTorch framework, utilizing CUDAToolkit 11.4, cuDNN 8.2.2, Python 3.7, and two NVIDIA 3090 Ti GPUs.

Qualitative Evaluation
To further verify the advantages of the proposed method, the subjective results of SR images reconstructed by the aforementioned methods are shown in Figures 6 and 7. Figure 6 shows the reconstruction results of the above methods for the UCMerced dataset by taking "airplane" and "runway" scenes as examples.Figure 7 shows the visual results of the "stadium" and "medium-residential" scenes in the AID dataset.In general, the SR results reconstructed by the proposed method possess sharper edges and clearer contours compared with other methods, which verifies the effectiveness of the HSTNet.

Results on Real Remote Sensing Data
Real images acquired by GaoFen-1 (GF-1) and GaoFen-2 (GF-2) satellites were employed to assess the robustness of the HSTNet.The spatial resolutions of GF-1 and GF-2 are 8 and 3.2 m/pixel, respectively.Three visible bands are selected from GF-1 and GF-2 satellite images to generate the LR inputs.The pre-trained DCM [31], ACT [41], and the proposed HSTNet models for the UCMerced dataset are utilized for SR image reconstruction.Figures 8 and 9 demonstrate the reconstruction results of the aforementioned methods on real data in some common scenes including river, factory, overpass, and paddy fields.One can see that the proposed HSTNet can obtain favorable results.Compared with DCM [31] and ACT [41], the reconstructed images of the proposed HSTNet achieved the lowest NIQE scores in all the four common scenes.Although the pixel size of these input images is different from the LR images in the training set, which are 600 × 600 and 256 × 256 for real-world images and training images, respectively, the HSTNet can still achieve good results in terms of visual perception qualities.It verifies the robustness of the proposed HSTNet.

Ablation Studies
Ablation studies with the scale factor of ×4 were conducted on the UCMerced dataset to demonstrate the effectiveness of the proposed fundamental modules in the HSTNet model.HSTNet achieves the highest PSNR and SSIM when utilizing three LFE and five HSFE modules.When employing three LFE and eight HSFE modules, the model has the largest number of parameters and computation, and its performance is not optimal.Therefore, considering the performance of the model and the computational complexity, we adopted three LFE and five HSFE modules in the proposed method.The results confirm the effectiveness of the LFE and HSFE modules in the proposed model, as well as the rationality of the number of LFE and HSFE modules.

Effects of the HSFE module:
We devised the HSFE module in the proposed LFE module to exploit the recursive information inherent in the image.We conducted further ablation studies by substituting the HSFE module with widely used feature extraction modules in SR algorithms, namely RCAB [27], CTB [48], CB [58], and SSEM [45] to validate the effectiveness of the HSFE module.Among them, SSEM [45] is also used to mine scale information.As presented in Table 6, the HSFE module outperforms the other feature extraction modules in terms of PSNR and SSIM, demonstrating its effectiveness in feature extraction.Meanwhile, it is also competitive in terms of parameter quantity and computational complexity.

Number of CSET modules:
The CSET module is designed to learn the dependency relationship across long distances between features of different dimensions.To validate the effectiveness of the proposed CSET modules, we conducted ablation experiments using varying numbers of CSET modules.Table 7 proves that the configuration of three CSET modules performs the best in terms of PSNR and SSIM.The features of low-dimension space are transmitted more to the high-dimension space, reducing the difficulty of optimization and facilitating the convergence of the deep model.The aforementioned results demonstrate the effectiveness of the CSET module in enhancing the representation of high-dimensional features.
Effects of the CSTA block: The CSTA [41] block is introduced to enable the CSET module to utilize the recurrent patch information of different scales in the input image.To verify the effectiveness of the CSTA module, we analyzed the composition of the transformer.Table 8 presents the comparative results of two different transformers.It proves that the CSTA block is beneficial to improve the performance of the HSTNet.

Conclusions and Future Work
In this paper, we present a hybrid-scale hierarchical transformer network (HSTNet) for remote sensing image super-resolution (RSISR).The HSTNet contains two crucial components, i.e., a hybrid-scale feature exploitation (HSFE) module and a cross-scale enhancement transformer (CSET) module.Specifically, the HSFE module with two branches was built to leverage the internal recurrence of information both in single and cross scales within the images.Meanwhile, the CSET module was built to capture long-range dependencies and effectively mine the correlation between high-dimension and low-dimension features.Experimental results on two challenging remote sensing datasets verified the effectiveness and superiority of the proposed HSTNet.In the future, more efforts are expected to simplify the network architecture and design a more effective low-dimensional feature extraction branch to improve RSISR performance.

Figure 1 .
Figure 1.Illustration of self-similarities in RSIs with single-scale (green box) and cross-scale (red box).

Figure 2 .Figure 3 .
Figure 2. Architecture of the proposed HSTNet for remote sensing image SR.

Figure 4 .
Figure 4. Architecture of the proposed HSFE module.
F i DE and F i DE represent the intermediate outputs of the decoder.F i CSET represents the output of ith CSET module.CSTA block: The CSTA block[41] is introduced to utilize the recurrent patch information of different scales in the input image.The feature information flow of the CSTA module is illustrated in Figure5b.Specifically, the input token embeddings T ∈ R n×d of the CSTA block are split into T a ∈ R n× d 2 and T b ∈ R n× d 2 along the channel axis.Then, T s ∈ R n× d 2 including n tokens from T a and T l ∈ R n ×d including n tokens by rearranging T b are generated.The number of tokens in T l can be set to n

Figure 5 .
Figure 5. Architecture of the CSET module.

Table 1 .
Parameter setting of the CSET module in the HSTNet.

Table 2 .
Comparative results for the UCMerced dataset and AID dataset.The best and the second-best results are marked in red and blue, respectively.

Table 3 .
Average PSNR of per-category for UCMerced dataset with the scale factor of ×3.The best and the second-best results are marked in red and blue, respectively.

Table 4 .
Average PSNR of per-category for AID dataset with the scale factor of ×4.The best and the second-best results are marked in red and blue, respectively.
Table 5 presents a comparative analysis of varying quantities of LFE and HSFE modules.It indicates that when adopting two LFE and 2 HSFE modules, the model has the smallest number of parameters and computation, but the model has the lowest PSNR and SSIM values.The results indicate that the proposed

Table 5 .
Ablation analysis of the number of LFE and HSFE modules (the best result is highlighted in bold).

Table 6 .
Ablation analysis of different feature extraction modules in LFE module (the best result is highlighted in bold).

Table 7 .
Ablation analysis of different feature extraction modules in the LFE module (the best result is highlighted in bold).

Table 8 .
Ablation analysis of the CSTA block.The best performances are highlighted in bold.