You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

25 July 2025

Enhancing Image Super-Resolution Models with Shift Operations and Hybrid Attention Mechanisms

,
,
,
and
1
Department of Artificial Intelligence, Asia University, Taichung 407, Taiwan
2
Department of Information Engineering and Computer Science, Feng-Chia University, Taichoung 407, Taiwan
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Recent Research and Applications of Computer Vision and Image Processing

Abstract

This study proposes an optimized approach to address the high computational demands and significant GPU memory consumption commonly associated with Transformer-based models. Building upon the HAT framework, a shallow feature extraction module is enhanced to improve local feature representation, thereby achieving a better balance between computational efficiency and model performance. Furthermore, inspired by self-supervised learning (SSL) techniques and incorporating shift operations, the proposed method effectively reduces both the number of parameters and the overall computational load. The resulting ISR-SHA model is trained and evaluated on the DF2K dataset, achieving approximately a 30% reduction in FLOPs and parameter count compared to the original HAT model, with only marginal declines in PSNR (0.02) and SSIM (0.0006). Experimental results confirm that ISR-SHA outperforms most existing super-resolution models in terms of performance while significantly enhancing computational efficiency without compromising output quality.

1. Introduction

Previous image restoration methods have predominantly relied on Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs). CNNs have attracted considerable attention due to their strong feature extraction capabilities, particularly in capturing spatial information within images. By applying multiple layers of convolution and pooling operations, CNNs are able to extract hierarchical representations, making them effective across a wide range of image processing tasks. However, their performance tends to degrade in complex scenes requiring fine-grained detail. Under challenging conditions such as lighting variations, occlusions, and background noise, CNNs often struggle to capture global contextual information, resulting in a noticeable decline in restoration quality. In contrast, Generative Adversarial Networks (GANs) constitute a distinct class of generative models known for their ability to produce high-fidelity synthetic images. Their core mechanism involves a competitive training process between a generator and a discriminator, allowing GANs to learn the underlying data distribution and generate realistic outputs. Despite their impressive image generation performance, GANs suffer from training instability and are prone to mode collapse, limiting their effectiveness in complex image restoration scenarios. With the recent success of Transformer models in natural language processing and related fields, researchers have increasingly investigated their applicability in image processing tasks, particularly in image super-resolution. Transformers are characterized by their self-attention mechanisms, which excel at modeling long-range dependencies within input sequences. This capability offers a significant advantage in image reconstruction, enabling the model to better capture global contextual information. Compared to traditional CNNs, Transformer-based models demonstrate a superior ability to analyze and interpret complex image structures, thereby enhancing restoration performance in visually intricate scenarios.
The primary contribution of this study is the introduction of a Transformer-based super-resolution model, referred to as ISR-SHA. This model leverages shifting operations and hybrid attention techniques to improve image reconstruction efficiency. The central idea of this research is to fully exploit the global perceptual capabilities of Transformers to capture long-range dependencies in images, which is crucial to improving the quality of image reconstruction. Furthermore, to address the high computational demands associated with Transformer models, this study adopts the HAT model as its foundational architecture. It proposes a relatively simple solution to reduce FLOPs (floating-point operations) and the number of parameters without significantly altering the fundamental structure of the Transformer super-resolution model, thereby enhancing its feasibility in practical applications.
During the research process, considering the complexity of deep feature extraction modules, the researchers particularly focused on utilizing shifting operations to achieve a balance between performance and computational cost. Shifting operations serve as an effective technique to enhance the model’s feature capture capabilities without significantly increasing the computational burden. This research direction is not only significant for improving the performance of Transformers in super-resolution tasks but also provides new insights and methods for subsequent studies.
This study used the DF2K dataset for training and validation, with results indicating a significant reduction in both FLOP and parameter counts for ISR-SHA compared to the HAT model. Despite the decrease in computational demands, the model maintains stable performance metrics. Experimental results demonstrate that ISR-SHA continues to surpass the majority of super-resolution models while maintaining its performance levels, even under reduced computational costs that impact practicality, thereby facilitating the further advancement of image processing technologies.

3. Implementation

This experimental method proposes the ISR-SHA image super-resolution model to verify the effectiveness of using shift operations and hybrid attention mechanisms in reducing model parameters and improving training efficiency.

3.1. ISR-SHA

The model architecture, as shown in Figure 6, is divided into three modules: a shallow feature extraction module, a deep feature extraction module, and a high-quality image reconstruction module. This design is commonly adopted in current Transformer-based super-resolution models. Specifically, for a given low-resolution (LR) image input I L R R H × W × C i n , a multi-scale feature fusion structure is first used to extract shallow features F 0 R H × W × C , where C i n and C represent the number of input and intermediate feature channels, respectively.
Figure 6. ShiftNet operation flow.
Then, a series of Integrated Residual Hybrid Attention Groups (IRHAG) and a 3 × 3 convolution layer H C o n v ( · ) are applied to perform deep feature extraction. A global residual connection is subsequently added to fuse the shallow features F 0 and the deep features F D R H × W × C , and the high-resolution output is reconstructed through the reconstruction module.
As illustrated in Figure 6, each IRHAG contains multiple Integrated Hybrid Attention Blocks (IHAB), an Overlapping Cross Attention Block (OCAB), and a 3 × 3 convolution layer with residual connections. In the IHAB, a Spatial Channel Attention Block (SCAB) is combined with (Shifted) Window Multi-Head Self-Attention ((S)W-MSA), endowing the model with robust representation capabilities. For the reconstruction module, pixel-shuffle is employed to upsample and fuse features. Finally, an L1 loss is simply used to optimize network parameters.
In this study, the HAT model is used as the base framework and is improved to reduce the model FLOPs and the amount of parameters while maintaining the performance. Unlike the previous modifications, the main focus of the feature extraction is on the shallow feature extraction module rather than the deep feature extraction module. For the deep feature module, only shift operations and some minor architectural adjustments are performed to reduce the FLOPs and parameter counts.

3.2. Shallow Feature Extraction Module

Given the objective of this study to minimize computational costs, we implement a shifting operation within the deep feature extraction module. Nevertheless, it is essential to recognize that the primary function of this module is to capture global information from the input images. Consequently, applying shifting operations may result in the loss of critical global information, which could adversely affect performance. To address this potential loss, we propose modifications to the design of the shallow feature extraction module. Specifically, we introduce a multi-scale feature fusion architecture aimed at maximizing the receptive field during the extraction of shallow features, thereby compensating for the information loss incurred by the shifting operation; this is illustrated in Figure 7.
Figure 7. Shallow feature extraction model.
For a given low-quality image I L Q R H × W × C i n (where H, W, and C i n denote the height, width, and number of input channels of the image, respectively), we employ both a 3 × 3 convolution H c o n v and a dilated convolution H d i l a t e d to facilitate multi-scale feature fusion and extract shallow features F 0 R H × W × C and F 0 as follows:
F 0 = H c o n v ( I L Q ) + H d i l a t e d ( I L Q )
where C represents the number of feature channels. The utilization of dilated convolution enables the attainment of an expanded receptive field while maintaining a comparable number of FLOPs and parameters to those of standard convolution. Both standard and dilated convolutions are particularly adept at early-stage visual processing, facilitating enhanced optimization stability and superior results. Moreover, they provide a straightforward mechanism for mapping the input image into a higher-dimensional feature space.
Dilated Convolution is characterized as a convolution operation that fundamentally differs from standard convolution; in standard convolution, each parameter within the convolution kernel interacts multiplicatively with adjacent pixels in the input image, followed by summation to produce the output feature map. In contrast, dilated convolution incorporates additional gaps between the parameters of the convolution kernel; these gaps are fixed intervals situated between the centers of the convolution kernel and are referred to as the dilation rate. A notable feature of dilated convolution is its capacity to enlarge the receptive field without augmenting the number of parameters or computational burden, thereby allowing each output pixel to correspond to an expansive area of the input image. The receptive field of a standard 3 × 3 convolution is 9, whereas a 3 × 3 dilated convolution with a dilation rate of 2 yields a receptive field of 25, all while the parameter count remains at merely 36% of that of a standard 5 × 5 convolution.

3.3. Deep Feature Extraction Module

The shallow feature F 0 is subsequently passed into the deep feature extraction module H D F to obtain the deep features F D and is represented as follows:
F D = H D F ( F 0 )
The overall architecture is inspired by the Hierarchical Attention Transformer (HAT) framework, wherein the deep feature extraction module comprises multiple Iterative Residual Hierarchical Attention Groups (IRHAGs), formulated as follows:
F D = H c o n v ( H I R H A G R ( F i 1 ) ) , { 1 , 2 , , R } i
Here, H IRHAG and H conv represent the IRHAG and standard convolution operations, respectively. The final convolution layer serves to consolidate the information from the IRHAGs, facilitating subsequent processing steps.
The IRHAG is composed of multiple Interleaved Hierarchical Attention Blocks (IHABs) and a single Overlapping Cross Attention Block (OCAB). Within each IHAB, a Shifting Channel Attention Block (SCAB) is integrated following the first LayerNorm layer of the standard Swin Transformer block, operating in parallel with the Shifted Window Multi-Head Self-Attention (SW-MSA). After adding residual connections, the output is further processed through another LayerNorm and a Multi-Layer Perceptron (MLP), culminating in the output of the IHAB denoted as Y; the process is as follows:
X inter 1 = LN ( X ) X inter 2 = SW - MSA ( X inter 1 ) + α SCAB ( X inter 1 ) + X Y = MLP ( LN ( X inter 2 ) ) + X inter 2
In this context, X i n t e r 1 and X i n t e r 2 represent intermediate features, while M L P denotes the Multi-Layer Perceptron. The SCAB module captures global information across channels, while the SW-MSA enables the modeling of long-range dependencies, collectively enhancing the model’s performance.
The Shifting Channel Attention Block comprises two pointwise convolutions, a learnable shifting operation, a GELU activation function, and a channel attention module. Initially, the number of input channels is reduced by a constant factor β to decrease computational costs. For an input feature with C channels, the first pointwise convolution reduces the output feature’s channel count to C β . The active shifting operation further diminishes computational load, followed by a second pointwise convolution that restores the feature to C channels as illustrated in Figure 8. Finally, a standard channel attention module is employed to adaptively scale the channel features. Experimental results demonstrate that this approach can reduce FLOPs and parameter counts by approximately 30%.
Figure 8. Shift channel attention block.
Compared to traditional group shifts, the active shifting method offers several advantages. First, the displacement amount in active shifting is learnable, eliminating the need for manual allocation. Secondly, active shifting is independent of the kernel size, meaning that changes to the kernel size do not affect the shifting outcome. Additionally, it does not require pointwise convolutions to maintain permutation invariance. The formulation of the active shifting operation is as follows:
By making the displacement parameters x and y differentiable, the original integer constraints are relaxed to real numbers, transforming the shifting operation into a bilinear interpolation form, which allows for adaptive optimization of x and y. Here, Ω represents the set of the four nearest points defined as ( i + x c , j + y c ) , so O c , i , j can be denoted as follows:
O c , i , j = ( n , m ) Ω c I c , n , m · ( 1 | i + x c n | ) ( 1 | j + y c m | )

3.4. High Quality Image Reconstruction Module

The shallow features and deep features are aggregated to reconstruct a high-quality image I R H Q , expressed mathematically as follows:
I R H Q = H R E C ( F 0 + F D )
In this equation, H R E C ( · ) denotes the reconstruction module function. The primary role of the shallow features is to capture low-frequency information of the image, whereas the deep features focus on restoring high-frequency information. By integrating both low-frequency and high-frequency information, the model’s performance is significantly enhanced.

4. Results

This chapter presents a comprehensive evaluation of the proposed ISR-SHA model’s performance and computational efficiency. Model validation is conducted using PSNR and SSIM as key metrics to assess image quality, complemented by FLOPs and parameter count to measure computational demands. The DF2K training dataset, along with the Set5 and Set14 validation sets, provide a robust foundation for comparison across models, including SwinIR and HAT. Performance benchmarks highlight ISR-SHA’s significant reduction in computational cost with minimal compromise in image quality, as confirmed by comparative analysis and ablation studies. Additionally, the ablation experiments emphasize the importance of multi-scale feature fusion in preserving model performance. Collectively, these results demonstrate ISR-SHA’s effective balance between high performance and reduced computational complexity, affirming its superiority over traditional SR methods.

4.1. Model Validation

In the model validation process, this study uses Peak Signal-to-Noise Ratio (PSNR) [34,35,36] and Structural Similarity Index Measure (SSIM) [37,38] as primary evaluation metrics. PSNR is used mainly to assess the quality of generated images by calculating the mean squared error (MSE) between generated and real images, thus providing a quantitative score for the fidelity of the image. SSIM, on the other hand, focuses on comparing the brightness, contrast, and structural attributes of images, more closely aligning with human visual perception. The SSIM score is computed through statistical measures such as standard deviation and variance. Additionally, given that the aim of this study is to maintain high PSNR and SSIM scores while reducing computational cost, we evaluate the model by calculating the number of floating-point operations per second (FLOPs) and parameter count. These measurements help determine whether the target goals have been achieved.

4.2. Datasets

In this study, we used the same settings for the training parameters, including the learning rate and the number of iterations, for all the models compared. The comparison includes the following methods: SwinIR: This is a widely recognized super-resolution modeling method that comprises three modules: shallow feature extraction, deep feature extraction, and image reconstruction. HAT: This model represents the most advanced approach to super-resolution, supporting both ×2 and ×4 upscaling. Our study builds on the improvements made to the HAT architecture. The training dataset used in this investigation is DF2K, which combines the DIV2K dataset, which contains 800 clean 2K images, with the Flickr2K dataset, which contains 2,650 noise-laden real-world 2K images. Incorporation of the DF2K dataset significantly improves the utility of the model. For validation, we utilize the commonly used Set5 and Set14 datasets, consisting of 5 and 14 small-sized images, respectively.

4.3. Compare with ISR-SHA

Table 1 presents the performance of the proposed ISR-SHA model compared to other methods on the Set5 and Set14 datasets. It is evident from the table that the ISR-SHA model effectively reduces computational costs while maintaining high model performance. The calculations for FLOPs (measured in GFLOPs, where 1 GFLOP 10 9 FLOPs) and the number of parameters (expressed in millions) were performed using an input image with three channels and dimensions of 32 × 32 pixels.
Table 1. Performance comparison of different methods on Set5 and Set14 datasets.
As indicated in Table 1, the ISR-SHA model reduces FLOPs by approximately 6.3 billion operations (about a 30% reduction) compared to HAT. Furthermore, it decreases the number of parameters by approximately 6.3 million (also about a 30% reduction).
In terms of PSNR and SSIM metrics, ISR-SHA exhibits only a minor decrease in performance relative to HAT. For the Set5 dataset, the PSNR decreases by approximately 0.02, and the SSIM drops by 0.0008. Similarly, for the Set14 dataset, the PSNR experiences a decrease of around 0.02, while the SSIM decreases by 0.0006. In general, the ISR-SHA model maintains performance levels significantly superior to those of SwinIR and other methods, achieving a balance between performance retention and reduced computational costs compared to HAT.
Figure 9 and Figure 10 illustrate the visual results of the ISR-SHA model compared to other methods. As shown in the figures, ISR-SHA is capable of generating superior image details relative to alternative approaches, particularly when compared to traditional methods such as EDSR. EDSR, which employs residual blocks for super-resolution, demonstrates significantly lower computational costs than Transformer-based super-resolution methods. However, the high-quality images produced by EDSR appear markedly more blurred than those generated by ISR-SHA, and the differences in detail are readily apparent to the naked eye. Additional examples and visual comparisons can be found in Appendix A.
Figure 9. Set5 valid result.
Figure 10. Set14 valid result.

4.4. Other Methods and References

In this experiment, we used the same experimental setup, including the experimental data and environment, to extract the best individual models for comparison with the proposed method. The methods being compared are as follows:
  • EDSR [7]: A deep CNN-based method that improves image quality by removing batch normalization layers and deepening the network structure.
  • SwinIR [20]: Built on Swin Transformer, it uses local and global attention to capture fine details across the entire image.
  • HAT [39]: Combines local and global attention to enrich image features, ideal for high-magnification tasks that need clear, high-quality results.

4.5. Ablation Experiment

Table 2 and Table 3 present the results of the ablation experiment for the ISR-SHA model. From the data, it is evident that when only the shifting operation is applied without the incorporation of multi-scale feature fusion, both PSNR and SSIM decrease by approximately 0.05 to 0.06 and 0.0007, respectively. This indicates that the inclusion of multi-scale feature fusion is essential for maintaining performance. Furthermore, since the modifications are made within the shallow feature extraction module, the increase in FLOPs and parameter counts is negligible. In summary, the ISR-SHA model effectively combines the shifting operation with multi-scale feature fusion, resulting in reduced computational costs while preserving its performance metrics.
Table 2. Comparison of shift operation and multi-scale feature fusion on Set5.
Table 3. Comparison of shift operation and multi-scale feature fusion on Set14.
Based on the experimental results, we conclude that ISR-SHA effectively meets the objectives of this study by maintaining high PSNR and SSIM values while achieving approximately 30% reductions in both FLOPs and parameter counts. Furthermore, the performance of ISR-SHA continues to surpass that of other traditional super-resolution methods.
In summary, this paper presents enhancements to the HAT architecture by modifying the shallow feature extraction module to increase feature diversity while simultaneously reducing the computational burden of the deep feature extraction module. This dual approach enables the model to achieve the desired performance levels with a significantly lower computational cost.

5. Conclusions and Future Work

This study proposes a super-resolution model based on HAT that effectively reduces computational costs while maintaining model performance. Using dilated convolutions, the model enhances the receptive field of shallow feature extraction, compensating for the loss of detail that occurs during shifting operations in the deep feature extraction module. Furthermore, the implementation of an active shifting operation in the shifting channel attention block, combined with pointwise convolutions, significantly reduces computational costs.
Experimental results demonstrate that the proposed method retains the expected performance level of HAT, outperforming most other super-resolution models while achieving approximately 30% reductions in both FLOPs and parameter counts. As illustrated, the images reconstructed by ISR-SHA exhibit superior detail recovery compared to other methods.
However, the ISA-SHA model still presents opportunities for further improvement. For example, utilizing more effective techniques for shallow feature extraction could enhance performance metrics. Furthermore, exploring various methods to further reduce model size would contribute to its practical applicability.

Author Contributions

Conceptualization, H.-M.T., W.-M.T. and J.-W.L.; methodology, H.-M.T., W.-M.T. and J.-W.L.; software, W.-M.T. and J.-W.L.; validation, H.-M.T., W.-M T. and J.-W.L.; formal analysis, H.-M.T., W.-M.T. and J.-W.L.; writing—original draft preparation, H.-M.T., W.-M.T., G.-L.T. and J.-W.L.; writing—review and editing, G.-L.T., J.-W.L. and H.-T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by Yu-Sheng Technology (YSTI), Taichung, Taiwan, under grant Super-Resolution Imaging for Shoe Manufacturing: An Industry-Academia Project.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data will be made available on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Figures

Figure A1. Set5 baby crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.
Figure A2. Set5 bird crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.
Figure A3. Set5 butterfly crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.
Figure A4. Set14 bolt crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.
Figure A5. Set14 human crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.
Figure A6. Set14 baboon crop result. The red boxes in the GT image indicate the regions that are enlarged below for comparison.

References

  1. He, K.; Sun, J.; Tang, X. Single Image Haze Removal Using Dark Channel Prior. IEEE Trans. Pattern. Anal. Mach. Intell. 2011, 33, 2341–2353. [Google Scholar] [CrossRef] [PubMed]
  2. Yang, F.; Yang, H.; Fu, J.; Lu, H.; Guo, B. Learning Texture Transformer Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5790–5799. [Google Scholar] [CrossRef]
  3. Gu, S.; Sang, N.; Ma, F. Fast image super resolution via local regression. In Proceedings of the IEEE/CVF International Conference on Pattern Recognition, Tsukuba, Japan, 11–15 November 2012; pp. 3128–03131. [Google Scholar]
  4. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. arXiv 2015, arXiv:1501.00092. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, J.; Lee, J.K.; Lee, K.M. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar] [CrossRef]
  6. Dong, C.; Loy, C.C.; Tang, X. Accelerating the Super-Resolution Convolutional Neural Network. arXiv 2016, arXiv:1608.00367. [Google Scholar] [CrossRef]
  7. Lima, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
  8. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a Deep Convolutional Network for Image Super-Resolution. In Computer Vision—European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar] [CrossRef]
  9. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-Order Attention Network for Single Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11057–11066. [Google Scholar] [CrossRef]
  10. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.P.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 105–114. [Google Scholar] [CrossRef]
  11. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual Dense Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2472–2481. [Google Scholar] [CrossRef]
  12. Gu, J.; Dong, C. Interpreting Super-Resolution Networks with Local Attribution Maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9195–9204. [Google Scholar] [CrossRef]
  13. Liu, Y.; Liu, A.; Gu, J.; Zhang, Z.; Wu, W.; Qiao, Y.; Dong, C. Discovering Distinctive ‘Semantics’ in Super-Resolution Networks. arXiv 2022, arXiv:2108.00406. [Google Scholar] [CrossRef]
  14. Xie, L.; Wang, X.; Dong, C.; Qi, Z.; Shan, Y. Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution. Adv. Neural Inf. Process. Syst. 2021, 34, 51–61. [Google Scholar]
  15. Kong, X.; Liu, X.; Gu, J.; Qiao, Y.; Dong, C. Reflash Dropout in Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5992–6002. [Google Scholar] [CrossRef]
  16. Liu, Y.; Zhao, H.; Gu, J.; Qiao, Y.; Dong, C. Evaluating the Generalization Ability of Super-Resolution Networks. arXiv 2023, arXiv:2205.07019. [Google Scholar] [CrossRef] [PubMed]
  17. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  18. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-Trained Image Processing Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12294–12305. [Google Scholar] [CrossRef]
  19. Cao, J.; Li, Y.; Zhang, K.; Gool, L.V. Video Super-Resolution Transformer. arXiv 2023, arXiv:2106.06847. [Google Scholar] [CrossRef]
  20. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  21. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
  22. Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin Transformer V2: Scaling Up Capacity and Resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11999–12009. [Google Scholar] [CrossRef]
  23. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
  25. Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
  26. Simonyan, K.; Zisserman, A. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
  27. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
  28. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
  29. He, Y.; Liu, X.; Zhong, H.; Ma, Y. AddressNet: Shift-Based Primitives for Efficient Convolutional Neural Networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019; pp. 1213–1222. [Google Scholar] [CrossRef]
  30. Wu, B.; Wan, A.; Yue, X.; Jin, P.; Zhao, S.; Golmant, N.; Gholaminejad, A.; Gonzalez, J.; Keutzer, K. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9127–9135. [Google Scholar] [CrossRef]
  31. Chen, W.; Xie, D.; Zhanga, Y.; Pu, S. All You Need Is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7234–7243. [Google Scholar] [CrossRef]
  32. Elhoushi, M.; Chen, Z.; Shafiq, F.; Tian, Y.H.; Li, J.Y. DeepShift: Towards Multiplication-Less Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 2359–2368. [Google Scholar] [CrossRef]
  33. Jeon, Y.; Kim, J. Constructing Fast Network through Deconstruction of Convolution. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, Canada, 2–8 December 2018. [Google Scholar]
  34. Salomon, D. Data Compression: The Complete Reference, 4th ed.; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  35. Lv, M.; Song, S.; Jia, Z.; Li, L.; Ma, H. Multi-Focus Image Fusion Based on Dual-Channel Rybak Neural Network and Consistency Verification in NSCT Domain. Fractal Fract. 2025, 9, 432. [Google Scholar] [CrossRef]
  36. Li, L.; Song, S.; Lv, M.; Jia, Z.; Ma, H. Multi-Focus Image Fusion Based on Fractal Dimension and Parameter Adaptive Unit-Linking Dual-Channel PCNN in Curvelet Transform Domain. Fractal Fract. 2025, 9, 157. [Google Scholar] [CrossRef]
  37. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  38. Chen, C.; Sun, Y.; Hu, X.; Zhang, N.; Feng, H.; Li, Z.; Wang, Y. Multi-Attitude Hybrid Network for Remote Sensing Hyperspectral Images Super-Resolution. Remote Sens. 2025, 17, 1947. [Google Scholar] [CrossRef]
  39. Chen, X.; Wang, X.; Zhang, W.; Kong, X.; Qiao, Y.; Zhou, J.; Dong, C. HAT: Hybrid Attention Transformer for Image Restoration. arXiv 2023, arXiv:2309.05239. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.