Next Article in Journal
Heterostructure-Based Optoelectronic Neuromorphic Devices
Previous Article in Journal
Enabling Efficient On-Edge Spiking Neural Network Acceleration with Highly Flexible FPGA Architectures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FE-FAIR: Feature-Enhanced Fused Attention for Image Super-Resolution

1
Shanghai Key Laboratory of Chips and Systems for Intelligent Connected Vehicle, School of Microelectronics, Shanghai University, Shanghai 200444, China
2
State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai 201203, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(6), 1075; https://doi.org/10.3390/electronics13061075
Submission received: 19 February 2024 / Revised: 9 March 2024 / Accepted: 12 March 2024 / Published: 14 March 2024

Abstract

:
Transformers have performed better than traditional convolutional neural networks (CNNs) for image super-resolution (SR) reconstruction in recent years. Currently, shifted window multi-head self-attention based on the swin transformer is a typical method. Specifically, the multi-head self-attention is used to extract local features in each window, and then a shifted window strategy is used to discover information interaction between different windows. However, this information interaction method needs to be more efficient and include some global feature information, which limits the model’s performance to a certain extent. Furthermore, optimizing the utilization of shallow features, which exhibit significant energy reserves and invaluable low-frequency information, is critical for advancing the efficacy of super-resolution techniques. In order to solve the above issues, we propose the feature-enhanced fused attention (FE-FAIR) method for image super-resolution. Specifically, we design the multi-scale feature extraction module (MSFE) as a shallow feature extraction layer to extract rich low-frequency information from different scales. In addition, we propose the fused attention block (FAB), which introduces channel attention in the form of residual connection based on shifted window self-attention, effectively achieving the fusion of global and local features. Simultaneously, we also discuss other methods to enhance the performance of the FE-FAIR method, such as optimizing the loss function, increasing the window size, and using pre-training strategies. Compared with state-of-the-art SR methods, our proposed method demonstrates better performance. For instance, FE-FAIR outperforms SwinIR by over 0.9 dB when evaluated on the Urban100 (×4) dataset.

1. Introduction

Image super-resolution reconstruction (SR) [1] refers to the reconstruction of low-resolution (LR) images into information-rich high-resolution (HR) images. This technique stands as a pivotal technology within computer vision, contributing significantly to various computational vision tasks like image denoising and target detection while simultaneously economizing on transmission and storage expenses.
Early image SR methods based on deep learning [2,3,4] mainly relied on simple convolutional neural network (CNN) structures for optimizing image reconstruction. In order to extract more image features, deeper network layers combined with more complex structures such as residual connections [5,6] and dense connections [7] were adopted. The attention mechanism enables the network to prioritize important information while disregarding irrelevant details. Various studies have demonstrated that using channel attention [8], layer attention [9], and high-order channel attention [10] can help the SR model recover more detailed features and improve the quality of the image. Nevertheless, limited by local convolution operations, the CNN method based on the attention mechanism exhibits a diminished ability to perceive long-range pixel relationships, thereby restricting enhancements in image quality.
Transformers [11] have attracted widespread attention in computer vision due to their excellent long-range dependency modeling capabilities. Many transformer models [12,13,14,15] have been proposed for low-level computational vision tasks. Subsequently, Liang et al. [16] combined the advantages of CNNs and transformers and proposed an image SR method based on the swin transformer, showing excellent performance across tasks such as image SR, image denoising, and image compression. This model, using a pre-training strategy and hybrid attention mechanism [17,18,19], effectively enhances the image reconstruction quality. The swin transformer, using shifted window technology for feature extraction, currently stands as a compelling structure for transformer-based image SR methods. However, although the swin transformer has excellent modeling capabilities for local features, its information interaction efficiency between different windows should be better and more effective for capturing global features. In addition, the classic super-resolution model consists of three parts: a shallow feature extraction layer, a deep feature extraction layer, and an upsampling layer. The low-frequency features obtained from the shallow feature extraction layer directly contribute to the network’s upsampling process. Additionally, the rich low-frequency features can provide more practical information for the subsequent deep feature extraction module. Hence, shallow features play a crucial role in SR tasks, and enhancing the effectiveness of the shallow extraction layer is crucial for improving image quality.
To address these challenges, we propose feature-enhanced fusion attention for image super-resolution (FE-FAIR). FE-FAIR mainly includes a shallow feature extraction layer, a deep feature extraction layer, and an image reconstruction layer. The shallow feature extraction layer uses the more effective multi-scale feature extraction (MSFE) module, which employs convolutional layers of varying depths combined with atrous convolutional layers to extract shallow features from multiple scales effectively, surpassing traditional single-layer 3 × 3 convolutional layers. MSFE also introduces richer low-frequency information for subsequent deep feature extraction layers. Inspired by the efficacy of channel attention in integrating global information and enhancing image reconstruction [8,20], we introduce the fused attention block (FAB). The FAB combines a multi-head self-attention mechanism with a channel attention mechanism, using residual connections to integrate global information into each self-attention layer. This enables the FAB to fuse information across local and global scales, proving highly very effective. Furthermore, we explore various methods to improve the performance of SR methods. These include enlarging the window size of the swin transformer for enhanced feature extraction, utilizing Smooth L 1 Loss for smoother model training, employing effective data augmentation techniques such as rotation and RGB channel shuffling during training to enhance robustness, implementing a pre-training strategy on ImageNet [21], and fine-tuning the model using the DF2K [22] dataset to further optimize performance. The comparison results between our proposed FE-FAIR and the state-of-the-art SR methods on the Manga109 and Urban100 benchmarks are shown in Figure 1. It demonstrates that FE-FAIR achieves state-of-the-art performance across all image super-resolution tasks and scales. In comparison to SwinIR, it exhibits a significant improvement of 0.84 dB to 0.96 dB on the Urban100 benchmark. In summary, our contributions can be summarized as follows:
  • We propose a better transformer-based super-resolution reconstruction method called FE-FAIR. It combines a shallow feature enhancement module with a fused attention mechanism to achieve better model performance.
  • We propose a more effective shallow feature extraction layer known as the multi-scale feature extraction (MSFE) module, aimed at enhancing the model’s capability to capture low-frequency information. By adjusting the depth and channel number of the convolutional layers of different branches and adding dilated convolutions, the receptive field is expanded and finer-grained shallow features are extracted.
  • We analyze the characteristics of window self-attention and propose the fused attention block FAB. Based on moving window multi-head self-attention, we add channel attention through the residual structure to achieve information fusion of global and local features.
  • We explore several additional strategies aimed at enhancing the model’s performance. These include employing data augmentation techniques, implementing a smoother Smooth L 1 Loss function, enlarging the window size of the swin-transformer, and adopting pre-training strategies.
The subsequent sections of this paper are organized as follows. Section 2 describes the research background for this research methodology. Section 3 describes the overall architecture of the FE-FAIR method. Section 4 introduces the evolution process of FE-FAIR and the experimental results on benchmark tests of performance in different tasks. Section 5 summarizes the contributions of this work.

2. Related Work

In this section, we briefly describe part of the evolution of the image SR method. First, we give an overview of attention mechanism methods and then analyze transformer-based methods.

2.1. Deep Network Methods for Image SR

Since SRCNN [2] first introduced convolutional neural networks into image SR, people have successively proposed a variety of deep network methods [3,5,6,7] to improve the performance of the model, thereby improving the quality of model reconstruction images. For example, sub-pixel convolution techniques [4], deeper networks and residual blocks [5,6], and more complex dense blocks [7] are used to improve the expressive ability of the model. In order to improve the visual quality after image reconstruction, Refs. [23,24] used generative adversarial networks to generate more realistic images. Limited by CNN size effects and feature extraction mechanisms, enhancing the performance of deep convolutional neural networks (CNNs) in image SR tasks becomes increasingly challenging. To solve this problem, some studies integrate attention mechanisms into various layers of CNNs [9,25], allowing for a more detailed understanding and analysis of images at different levels. In addition, researchers also explored techniques to introduce spatial and channel attention into these mechanisms [8,10], aiming to improve model efficiency further. These pioneering efforts provide valuable insights and motivate us to advocate deeper integration of attention mechanisms to effectively capture relevant information between different locations, thereby improving model performance.

2.2. Transformer-Based Methods for Image SR

In recent years, the great success of transformers in natural language processing (NLP) tasks has attracted attention in computer vision. Pure transformer-based methods perform excellently by handling long-distance dependencies well [13,15,26,27,28,29,30]. Some work has shown that combining convolutions and transformers can achieve more advanced results [31,32,33]. SwinIR [16] combines the advantages of convolutions and transformers and proposes a network for tasks such as image SR, which performs well in various image restoration tasks. EDT [17] explores the impact of pre-training mechanisms on transformer methods to enhance the performance of SR networks further. However, these works underestimate shallow feature importance and fail to combine global features with local features effectively. Therefore, our model focuses on more effective shallow feature extraction and feature fusion during the deep feature extraction process, effectively improving the model’s ability to depict image details.

3. Methodology

Inspired by the above work, we propose the FE-FAIR method in this section—the specific network structure shown in Figure 2.

3.1. Network Architecture

FE-FAIR mainly consists of three modules: the shallow feature, deep feature, and graphic reconstruction module. Specifically, the shallow feature module is mainly composed of multi-scale feature extraction layers, using different numbers of convolutional layers and atrous convolution combinations to extract shallow features from different scales. The deep feature extraction module mainly comprises the shifted window self-attention and channel attention mechanisms and introduces the residual structure. The image reconstruction module mainly consists of convolutional and pixel-shuffle layers.
Expressly, for the input low-resolution image I L R ω C 0 × H × W , we initially utilize a multi-scale feature extraction (MSFE) module to extract shallow-layer features F 0 ω C × H × W in different dimensions as follows:
F 0 = ϕ M S F E ( I L R )
where H and W represent the height and width of the input image, C 0 and C represent the number of channels output by the input image and shallow feature extraction layer, respectively, and ϕ M S F E represents the MSFE module. The intelligence of the MSFE module is to obtain rich low-frequency feature information from different perspectives and levels. Subsequently, the deep feature extraction module ϕ D F ( F 0 ) is utilized to obtain deep features F D F ω C × H × W :
F D F = ϕ D F ( F 0 )
where ϕ D F consists of N residual fused attention block (RFAB) and a 3 × 3 convolutional layer ϕ C o n v . This structure can extract deep features layer by layer, as follows:
F i = ϕ R F A B i ( F i 1 ) , i = 1 , 2 , , N
F D F = ϕ C o n v ( F i )
where ϕ R F A B i represents the ith RFAB block. Subsequently, a 3 × 3 convolutional layer is used after the deep feature extraction layer to aggregate features. Finally, the high-quality image reconstruction module ϕ H Q is applied to reconstruct the high-quality image I S R , as follows:
I S R = ϕ H Q ( F 0 + F D F )
In order to enhance the stability of the model while effectively retaining the low-frequency and high-frequency information of the image, we use residual connections to transfer the low-frequency information to the image reconstruction block. In the image reconstruction block, we utilize the pixel-shuffle to upsample the reconstructed image.

3.2. Multi-Scale Feature Extraction (MSFE) Module

The shallow feature extraction block mainly maps input features from low latitudes to higher dimensions, usually containing low-frequency information. Convolutional layers perform well in the early processing of visual tasks, thereby facilitating better-optimized results [32]. Simultaneously, we concentrate on the correlation between the target pixel and surrounding pixels, noting that this correlation diminishes as the pixel distance increases. The atrous spatial pyramid pooling (ASPP) [34] uses multiple parallel atrous convolutional layers with different sampling rates to extract features from different scales. Inspired by ASPP, we design the MSFE module based on the concept of multi-scale convolution, with the primary process outlined as follows:
F i n = H C o n v 1 ( I L R )
F i = H b r a n c h i ( F i n ) , i = 1 , 2 , 3 , 4
F c o n c a t = C o n c a t e ( F 0 , F 1 , F 2 , F 3 )
F o u t = H C o n v 2 ( F c o n c a t + F i n )
where F i n indicates the mapped high-dimensional features, and b r a n c h i represents dilated convolution modules of different scales. C o n v 1 implements the mapping of input features from low to high dimensions, and C o n v 2 indicates the aggregation of features from different branches.
As illustrated in Figure 3, for the input feature I L R ω C 0 × H × W , a convolutional layer is employed to perform feature mapping from low to high dimensions. Subsequently, distinct branches are utilized to acquire features at various scales. Each branch ( b r a n c h i ) comprises varying numbers of convolutional layers and dilated convolutional layers. The dilation value of the dilated convolutional layer corresponds to the number of convolutional layers, where a higher number of convolutional layers entails a larger dilation value [35]. This design effectively expands the receptive field of the convolutional layer while enabling each branch to focus on the interrelation between the central feature and surrounding features across different scales. Ultimately, the information from different branches is concatenated, and a residual structure is employed to incorporate previous level information, thereby enhancing model stability.

3.3. Fused Attention Block (FAB)

The window-based multi-head self-attention mechanism can extract high-frequency information and local features within the feature map. By adding global features, the model can effectively integrate the information from the entire feature map. Previous studies have demonstrated that convolutional layers can enhance the performance of transformers [36]. Channel attention, as proposed by Hu et al. [20], focuses on the importance and correlation of different feature channels. It assigns different weight characteristics to each channel, thereby enhancing model’s capability for global feature extraction. Consequently, the fusion of multi-head self-attention and channel attention serves to amalgamate features effectively. As illustrated in Figure 4, subsequent to passing through the LayerNorm layer, the channel attention block (CAB) and W-MSA operate as parallel structures to calculate the feature map across different dimensions, yielding a residual summation as output. To balance channel attention and W-MSA, we multiply the original input and output features of CAB by the adaptive weights of the sum, respectively. For an input feature X, the entire FAB processing process is as follows:
X n o r m = L N ( X )
X M S A = W - M S A ( X n o r m ) + X
X T = M L P ( L N ( X M S A ) ) + X M S A
X o u t = X T + α C A B ( X n o r m ) + β X
where X n o r m represents the layernorm (LN) layer, X M S A represents the intermediate result of multi-head self-attention calculation, X T represents the output feature of W-MSA branch, and X o u t represents the output feature of the FAB. MLP stands for multi-layer perceptron layer, and CAB stands for channel attention block.
The specific calculation process of the window self-attention mechanism is as follows: given an input feature of size H × W × C , divide the input feature into non-overlapping windows of size M 2 , then the total number of windows is H W M 2 . The input features can be reshaped into H W M 2 × M 2 × C . Subsequently, self-attention is calculated within each window independently. For each window feature M 2 × C , the query, key, and value matrix are calculated as
Q = X M Q K = X M K V = X M V
where M Q , M K , and M V represent the mapping matrices of query, key, and value, respectively. Then, the self-attention of the window can be expressed as
A t t e n t i o n ( Q , K , V ) = S o f t m a x ( Q K T d + B ) V
where d represents the dimension of query/key, and B indicates the relative position encoding. In addition, in order to promote information exchange between adjacent windows, the shift window method is also utilized, with the shift size being set at half the window size.
The CAB module mainly consists of convolutional and standard channel attention (CA) layers. The specific structure is shown in Figure 2. Due to the large number of channels, a high computational cost will be incurred when the standard channel attention layer is combined with the transformer. To address this, channels are compressed to C γ while maintaining similar performance using a convolutional layer. The entire CAB calculation process is as follows:
X o u t = C A ( C o n v ( X i n ) )
where X i n , X o u t , and C o n v represent input features, output features, and convolution layer.
As shown in Figure 2, each residual group fused attention block (RFAB) contains N fused attention block (FAB) modules and a 3 × 3 convolutional layer. Precisely, for the ith RFAB, it can be calculated as
F i , 0 = F i 1 , i = 1 , 2 , , N
F i , j = H F A B i , j ( F i , j 1 ) , j = 1 , 2 , , M
F i = H C o n v i ( F i , M ) + F i , 0
where F i , 0 and F i represent the input features and output features of the ith RFAB, H F A B i , j represents the ith FAB calculation block in the jth RFAB, and H c o n v represents the convolution layer of the ith RFAB module. This design offers two notable advantages. Firstly, the incorporation of convolutional layers facilitates a more stable aggregation of information. Secondly, the utilization of residual connections not only stabilizes the model’s training but also enhances inter-layer relationships by incorporating information from different modules.

3.4. Loss Function

For previous image SR methods, L 1 pixel loss is generally used as the loss function to optimize the model. Under normal circumstances, L 1 Loss can obtain better model performance.
L 1 = I S R I H Q 1
However, the L 1 Loss suffers from nondifferentiability at specific points, which can impede loss optimization. Secondly, in the later stage of model training, the difference between the sum and the sum is slight, but its derivative is still a constant. In this way, the loss value will fluctuate around the stable value when the learning rate remains unchanged, making it challenging to achieve higher accuracy. Therefore, we use Smooth L 1 Loss [37] as the loss optimization function of the method. Smooth L 1 Loss is expressed explicitly as
L = 0.5 ( I S R I H Q ) 2 θ i f | I S R I H Q | < θ | I S R I H Q | 0.5 θ o t h e r w i s e
where θ is set to 1 by default in our method. As indicated by the formula above, Smooth L 1 Loss employs the form of L 1 Loss when | I S R I H Q | θ and uses the form of L 2 loss when | I S R I H Q | < θ . This approach effectively addresses issues such as gradient explosion arising from significant losses and accuracy concerns when losses are small.

4. Experiments

In this section, the FE-FAIR method proposed in this paper is compared with other state-of-the-art methods, such as EDSR [6], RCAN [8], SAN [10], IGNN [38], HAN [9], NLSN [25], SwinIR [16], EDT [17], CARN [39], IMDN [40], LAPAR-A [41], LatticeNet [42], BM3D [43], WNNM [44], DnCNN [45], IRCNN [46], FFDNet [47], NLRN [48], FOCNet [49], MWCNN [50], DRUNet [51], DSNet [52], RPCNN [53], BRDNet [54], and IPT [15].
Simultaneously, the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [55] are utilized for evaluation. The calculation process of PSNR is demonstrated as follows:
P S N R = 10 × log 10 M a x 2 M S E = 20 × log 10 M a x M S E
where the default value of M a x is 255. The calculation expression of mean square error (MSE) is presented as follows:
M S E = 1 M × N i = 1 M j = 1 N ( I H R ( i , j ) I S R ( i , j ) ) 2
where H and W are the number of pixels in the length and width of the image, respectively. The value of PSNR depends on MSE, so the smaller the MSE, the greater the PSNR value, which means the smaller the difference between the reconstructed image and the actual image. The SSIM is also used to measure the similarity between the reconstructed and authentic images from brightness, contrast, and structure. SSIM can be expressed as
S S I M = ( 2 μ I H R × μ I S R + C 1 ) ( 2 σ I H R × σ I S R + C 2 ) ( μ I H R 2 + μ I S R 2 + C 1 ) ( σ I H R 2 + σ I S R 2 + C 2 )
where μ I H R , μ I S R , σ I H R , σ I S R , ζ I H R , and ζ I S R represent the mean, standard deviation, and covariance of I H R and I S R , respectively. C 1 and C 2 are constants. The closer the value of S S I M is to 1, the higher the similarity between the two images.
All experiments are conducted using PyTorch version 2.0 on 4 NVIDIA Tesla V100 GPUs with CUDA version 12.2.

4.1. Experimental Setup

In classical image SR, the DF2K dataset (comprising DIV2K [22] with 900 images and Flickr2K [22] with 2650 images) containing 3550 images is utilized as the original training set. Bicubic downsampling with scaling factors of × 2 , × 3 , and × 4 are performed using MATLAB to generate low-resolution images. The test set includes popular super-resolution benchmark datasets such as Set5 [56], Set14 [57], BSD100 [58], Urban100 [59], and Manga109 [60]. Regarding the architecture of FE-FAIR, the parameters are configured as follows: the number of RFABs, FABs, channels, attention heads, and window size are set to 6, 6, 180, 6, and 16, respectively. In lightweight image SR tasks, these parameters are set to 4, 6, 60, 6, and 16, respectively. The channel compression parameter γ in the MSFE module is defaulted to 3 for classical tasks and 6 for lightweight tasks. The channel compression parameter γ in CAB is set to 5, and α and δ in FAB are treated as adaptive parameters.
For image denoising, the training set comprises DIV2K (900 images), Flickr2K (2650 images), WED [61] (4744 images), and BSD200 [58] (200 images). The test set includes Set12 [45], BSD68 [58], CBSD68 [58], Kodak24 [62], McMaster [63], and Urban100 datasets. The parameter configurations remain consistent with classic SR.
For classical SR, we set the batch size to 32 and the total training iterations to 500 k. The initial learning rate is 2 × 10 4 , and it is halved at iterations [250 k, 400 k, 450 k, 475 k, 500 k], respectively. For lightweight SR, the batch size is 64, and the total number of training iterations is also 500 k. For the image denoising task, the batch size is 8, and the total number of training iterations is 1500 k. The initial learning rate is 2 × 10 4 , halved at iterations [600 k, 1000 k, 1300 k, 1450 k], respectively. For the pre-training model, 1.2 million images from ImageNet [21] are used for 1000 k iterations. The initial learning rate is also 2 × 10 4 , halved at iterations [300 k, 500 k, 750 k, 900 k, 1000 k]. Subsequently, the DF2K dataset (DIV2K with 900 images + Flickr2K with 2650 images) is employed for fine-tuning with 250 k iterations. The learning rate is initialized to 2 × 10 5 and halved at iterations [130 k, 200 k, 230 k, 245 k, 250 k].
Simultaneously, to determine the optimal number of iterations in our proposed method, experiments are conducted on the DF2K dataset, and results are shown in Figure 5. We can observe that after 500 k iterations, the method converges in both tasks. Specifically, compared with classic tasks, the lightweight SR methods can enter the convergence state faster due to the smaller number of parameters and minor computational cost, and the indicators are more stable during the training process. The classic task method has better performance. Therefore, we set 500 k iterations in this work as the total training times.

4.2. Ablation Experiment

In this part of the work, the impact of a series of methods proposed in this paper on image SR are separately verified.

4.2.1. Effectiveness of MSFE

Rich shallow features can help the model retain sufficient low-frequency information [64] while providing more effective feature information for the deep feature extraction module. The MSFE module combines dilated convolutions with different numbers of convolutional layers to achieve more detailed feature capture. The effectiveness of the proposed MSFE module is demonstrated through experimental setups in our work. Using the traditional single-layer convolutional layer as the baseline, this part of the work tests the method gain of three convolutional layers, ASPP, and MSFE modules as shallow feature extraction layers. The results are quantified on Set14, Urban100, and Manga109. As shown in Table 1, when using MSFE as the shallow feature extraction layer, the network achieves a performance gain of 0.06 to 0.11 dB compared to other methods, demonstrating a significant improvement over alternative methods. All results show that MSFE can effectively improve the performance of SR methods.

4.2.2. Effects of the FAB

The FAB module combines self-attention and channel attention mechanisms through residual connections, enabling the integration of global features into local features. We conducted experiments to demonstrate the effectiveness of the FAB module. Table 2 presents the quantitative performance on the Set14, Urban100, and Manga109 test sets for × 4 super-resolution. Compared to the baseline performance of the STL module in SwinIR, the FAB module brings a performance gain of 0.05 to 0.09 dB. Adaptive weights α and β are set to avoid conflicts between channel attention and window self-attention. We further investigate the impact of these variable weights on model performance. Experiments show that without adding parameters, the inclusion of α and β results in a performance gain of 0.03 dB based on the FAB module. This indicates that the adaptive parameters α and β reduce negative impacts between different attention mechanisms, facilitating their fusion and enabling improved model performance.

4.2.3. Effects of Smooth L 1

Loss Smooth L 1 Loss offers smoother convergence and better performance compared to traditional L 1 Loss. To demonstrate the superiority of Smooth L 1 Loss as a loss function, experiments were conducted. Table 3 presents the quantitative results for the super-resolution task with a scaling factor of × 4 on the Urban100 dataset. Smooth L 1 Loss achieves a performance gain of 0.03 dB compared to L 1 Loss. These results indicate that Smooth L 1 Loss, when used as a loss function for image super-resolution tasks, enhances the model’s performance.

4.2.4. Effects of Window Size

EDT explored the impact of window size on the performance of the window self-attention mechanism. It has been proved that increasing the window size is a direct method of improving the performance of the SR network. However, previous studies only explored windows up to 12 × 12 in size. Therefore, we also examined the impact of larger window sizes on network performance. Table 4 shows the quantitative test results when the amplification factor is 4 on the Set14, Urban100, and Manga109 test sets. We can find from the results that when the window size is 16, the model’s performance can be effectively improved, especially the PSNR improvement on Urban100, which reaches 0.24 dB. Therefore, in FE-FAIR, we directly set the window size to 16.

4.3. Comparison Result

4.3.1. Results on Classical Image Super-Resolution

Quantitative comparison. Table 5 shows the quantitative comparison results between FE-FAIR and other state-of-the-art methods: EDSR [6], RCAN [8], SAN [10], IGNN [38], HAN [9], NLSN [25], SwinIR [16], and EDT [17]. We can see that FE-FAIR exhibits the best performance across all test sets and magnifications. Specifically, FE-FAIR achieves a performance gain of 0.33–0.55 dB on Urban100 and 0.27–0.34 dB on Manga109. Thus, FE-FAIR demonstrates superior performance in image super-resolution. Additionally, we present quantitative comparison results between the pre-training strategy FE-FAIR and state-of-the-art models IPT† [15] and EDT† [17]. The pre-trained model exhibits a substantial performance improvement, notably surpassing the baseline (SwinIR) by 0.8 dB on Urban100, thereby affirming the effectiveness of the pre-training strategy.
Visual Comparison. We selected several pictures (img011, img048, image074, image092) from the benchmark to show the super-resolution reconstruction results of the model. It can be found from Figure 6 that our results have greatly improved texture details and authenticity.

4.3.2. Results on Lightweight Image Super-Resolution

Quantitative comparison. Table 6 shows the quantitative performance comparison results between lightweight FE-FAIR and state-of-the-art lightweight methods: CARN [39], IMDN [40], LAPAR-A [41], LatticeNet [42], SwinIR [16], and EDT [17]. The total number of parameters in our method (evaluated on 1280 × 720 images) is also provided. As shown in Table 6, the quantification performance of FE-FAIR is significantly better than other methods, especially in SSIM. It proves that our method is effective in lightweight image super-resolution tasks.

4.3.3. Results on Image Denoising

We further explore the performance of our method in image denoising. We show the comparison results of FE-FAIR on grayscale and colour image denoising tasks with other state-of-the-art methods: BM3D [43], WNNM [44], DnCNN [45], IRCNN [46], FFDNet [47], NLRN [48], FOCNet [49], MWCNN [50], DRUNet [51], DSNet [52], RPCNN [53], BRDNet [54], IPT, and SwinIR [16]. Table 7 and Table 8 provide quantitative comparison results at noise levels of 15, 25, and 50. Specifically, our method outperforms the state-of-the-art method SwinIR by 0.2 dB on the Urban100 benchmark dataset.

5. Conclusions

In this paper, we re-explore the importance of shallow features and propose a multi-scale shallow feature extraction module MSFE to obtain more prosperous and influential low-frequency features. Concurrently, we integrate window self-attention and channel attention in the form of residual connection, proposing the fused attention module FAB. The FAB effectively achieves an effective combination of local feature information and global feature information. In addition, we also incorporate other techniques to improve the model’s performance, such as data augmentation and increasing the window size. Combining these methods, we propose an image super-resolution reconstruction method FE-FAIR. Comparative evaluations on benchmark datasets demonstrate FE-FAIR’s superior performance compared to other state-of-the-art image super-resolution methods. Additionally, our method exhibits better performance in image denoising tasks.
In the future, we will continue exploring further interactions between shallow and deep features to achieve more fine-grained shallow feature capture. Additionally, investigating the impact of various attention fusion methods on image super-resolution remains a promising avenue of research. Due to the enormous potential of the transformer architecture, we aim to further explore its applications across various tasks, including the field of image SR.

Author Contributions

A.G. and K.S. completed the methodology, experimental data, and manuscript writing of this work. J.L. completed the revision and checking of the manuscript, and A.G. and J.L. provided financial support and supervised this work. All authors have read and approved the published version of the manuscript.

Funding

This research was supported in part by the National Natural Science Foundation of China under Grant 62204044, in part by the State Key Laboratory of Integrated Chips and Systems, and in part by Shanghai Science and Technology Innovation Action under Grants 22xtcx00700 and 22511101002.

Data Availability Statement

The data that support the findings of this study are available online. Data download address: https://github.com/sk0625-hhh/FE-FAIR (accessed on 8 February 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Park, S.C.; Park, M.K.; Kang, M.G. Super-resolution image reconstruction: A technical overview. IEEE Signal Process. Mag. 2003, 20, 21–36. [Google Scholar] [CrossRef]
  2. Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image super-resolution. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part IV 13. Springer: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
  3. Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Cham, Switzerland, 2016; pp. 391–407. [Google Scholar]
  4. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  5. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  6. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  7. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  8. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  9. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16. Springer: Cham, Switzerland, 2020; pp. 191–207. [Google Scholar]
  10. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-order attention network for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 18–24 June 2019; pp. 11065–11074. [Google Scholar]
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 15. [Google Scholar]
  12. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  13. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  14. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
  15. Chen, H.; Wang, Y.; Guo, T.; Xu, C.; Deng, Y.; Liu, Z.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12299–12310. [Google Scholar]
  16. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1833–1844. [Google Scholar]
  17. Li, W.; Lu, X.; Qian, S.; Lu, J.; Zhang, X.; Jia, J. On efficient transformer-based image pre-training for low-level vision. arXiv 2021, arXiv:2112.10175. [Google Scholar]
  18. Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 457–466. [Google Scholar]
  19. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
  20. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  21. Deng, J. A large-scale hierarchical image database. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
  22. Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
  23. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
  24. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  25. Mei, Y.; Fan, Y.; Zhou, Y. Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3517–3526. [Google Scholar]
  26. Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
  27. Wu, B.; Xu, C.; Dai, X.; Wan, A.; Zhang, P.; Yan, Z.; Tomizuka, M.; Gonzalez, J.; Keutzer, K.; Vajda, P. Visual transformers: Token-based image representation and processing for computer vision. arXiv 2020, arXiv:2006.03677. [Google Scholar]
  28. Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the design of spatial attention in vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 9355–9366. [Google Scholar]
  29. Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
  30. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
  31. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 22–31. [Google Scholar]
  32. Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R. Early convolutions help transformers see better. Adv. Neural Inf. Process. Syst. 2021, 34, 30392–30400. [Google Scholar]
  33. Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 579–588. [Google Scholar]
  34. Wang, Y.; Liang, B.; Ding, M.; Li, J. Dense semantic labeling with atrous spatial pyramid pooling and decoder for high-resolution remote sensing imagery. Remote Sens. 2018, 11, 20. [Google Scholar] [CrossRef]
  35. Li, Y.; Zhang, X.; Chen, D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1091–1100. [Google Scholar]
  36. Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar]
  37. Sutanto, A.R.; Kang, D.K. A novel diminish smooth L1 loss model with generative adversarial network. In Proceedings of the Intelligent Human Computer Interaction: 12th International Conference, IHCI 2020, Daegu, Republic of Korea, 24–26 November 2020; Proceedings, Part I 12. Springer: Cham, Switzerland, 2021; pp. 361–368. [Google Scholar]
  38. Zhou, S.; Zhang, J.; Zuo, W.; Loy, C.C. Cross-scale internal graph neural network for image super-resolution. Adv. Neural Inf. Process. Syst. 2020, 33, 3499–3509. [Google Scholar]
  39. Li, Y.; Agustsson, E.; Gu, S.; Timofte, R.; Van Gool, L. Carn: Convolutional anchored regression network for fast and accurate single image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  40. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  41. Li, W.; Zhou, K.; Qi, L.; Jiang, N.; Lu, J.; Jia, J. Lapar: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond. Adv. Neural Inf. Process. Syst. 2020, 33, 20343–20355. [Google Scholar]
  42. Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Cham, Switzerland, 2020; pp. 272–289. [Google Scholar]
  43. Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef] [PubMed]
  44. Gu, S.; Zhang, L.; Zuo, W.; Feng, X. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2862–2869. [Google Scholar]
  45. Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed]
  46. Zhang, K.; Zuo, W.; Gu, S.; Zhang, L. Learning deep CNN denoiser prior for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3929–3938. [Google Scholar]
  47. Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef] [PubMed]
  48. Liu, D.; Wen, B.; Fan, Y.; Loy, C.C.; Huang, T.S. Non-local recurrent network for image restoration. arXiv 2018, arXiv:1806.02919v2. [Google Scholar]
  49. Jia, X.; Liu, S.; Feng, X.; Zhang, L. Focnet: A fractional optimal control network for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6054–6063. [Google Scholar]
  50. Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; Zuo, W. Multi-level wavelet-CNN for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 773–782. [Google Scholar]
  51. Zhang, K.; Li, Y.; Zuo, W.; Zhang, L.; Van Gool, L.; Timofte, R. Plug-and-play image restoration with deep denoiser prior. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6360–6376. [Google Scholar] [CrossRef]
  52. Peng, Y.; Zhang, L.; Liu, S.; Wu, X.; Zhang, Y.; Wang, X. Dilated residual networks with symmetric skip connection for image denoising. Neurocomputing 2019, 345, 67–76. [Google Scholar] [CrossRef]
  53. Xia, Z.; Chakrabarti, A. Identifying recurring patterns with deep neural networks for natural image denoising. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass, CO, USA, 1–5 March 2020; pp. 2426–2434. [Google Scholar]
  54. Tian, C.; Xu, Y.; Zuo, W. Image denoising using deep CNN with batch renormalization. Neural Netw. 2020, 121, 461–473. [Google Scholar] [CrossRef]
  55. Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  56. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012. [Google Scholar]
  57. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; Revised Selected Papers 7. Springer: Cham, Switzerland, 2012; pp. 711–730. [Google Scholar]
  58. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  59. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  60. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  61. Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo exploration database: New challenges for image quality assessment models. IEEE Trans. Image Process. 2016, 26, 1004–1016. [Google Scholar] [CrossRef]
  62. Yu, S.; Park, B.; Jeong, J. Deep iterative down-up cnn for image denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  63. Zhang, L.; Wu, X.; Buades, A.; Li, X. Color demosaicking by local directional interpolation and nonlocal adaptive thresholding. J. Electron. Imaging 2011, 20, 023016. [Google Scholar]
  64. Lay, J.A.; Guan, L. Image retrieval based on energy histograms of the low frequency DCT coefficients. In Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Phoenix, AZ, USA, 15–19 March 1999; Volume 6, pp. 3009–3012. [Google Scholar]
Figure 1. Comparative performance evaluation between FE-FAIR and the state-of-the-art methods NLSN, SwinIR, and EDT. Quantitative analysis using PSNR (Y-channel only) on Urban100 and Manga109 Datasets at scale factors ×2, ×3, and ×4. “†” indicates we use a pre-training strategy on ImageNet.
Figure 1. Comparative performance evaluation between FE-FAIR and the state-of-the-art methods NLSN, SwinIR, and EDT. Quantitative analysis using PSNR (Y-channel only) on Urban100 and Manga109 Datasets at scale factors ×2, ×3, and ×4. “†” indicates we use a pre-training strategy on ImageNet.
Electronics 13 01075 g001
Figure 2. The overall structure of FE-FAIR. Specifically, it mainly includes a multi-scale shallow feature extraction (MSFE) module, fused attention block (FAB), and residual connection FAB (RFAB).
Figure 2. The overall structure of FE-FAIR. Specifically, it mainly includes a multi-scale shallow feature extraction (MSFE) module, fused attention block (FAB), and residual connection FAB (RFAB).
Electronics 13 01075 g002
Figure 3. Multi-scale shallow feature extraction (MSFE) module.
Figure 3. Multi-scale shallow feature extraction (MSFE) module.
Electronics 13 01075 g003
Figure 4. Fused attention block (FAB). ⊕ represents an element-wise sum operation. α and β represent adaptive parameters used to adjust different branch weights.
Figure 4. Fused attention block (FAB). ⊕ represents an element-wise sum operation. α and β represent adaptive parameters used to adjust different branch weights.
Electronics 13 01075 g004
Figure 5. PSNR (Y channel) and SSIM trends in training on classic tasks (FE-FAIR) and lightweight tasks (FE-FAIR-T).
Figure 5. PSNR (Y channel) and SSIM trends in training on classic tasks (FE-FAIR) and lightweight tasks (FE-FAIR-T).
Electronics 13 01075 g005
Figure 6. Visual comparison with state-of-the-art methods (average PSNR/SSIM) for scale ×4. The compared parts are marked with red markers in the image.
Figure 6. Visual comparison with state-of-the-art methods (average PSNR/SSIM) for scale ×4. The compared parts are marked with red markers in the image.
Electronics 13 01075 g006
Table 1. The effects of different shallow feature extraction modules. Bold text and numbers indicate the method we used and the best results among all methods.
Table 1. The effects of different shallow feature extraction modules. Bold text and numbers indicate the method we used and the best results among all methods.
ModuleScaleSet14 [57]Urban100 [59]Manga109 [60]
Conv234.46/0.925033.81/0.942739.92/0.9797
3 × Conv234.47/0.925233.82/0.942839.93/0.9796
ASPP [34]234.48/0.925333.84/0.943139.94/0.9798
MSFE234.52/0.925733.92/0.943539.98/0.9804
Conv429.09/0.795027.45/0.825432.03/0.9260
3 × Conv429.10/0.794927.47/0.825632.05/0.9259
ASPP [34]429.12/0.795227.48/0.825932.07/0.9262
MSFE429.15/0.795827.53/0.827132.11/0.9265
Table 2. The effects of the FAB module on performance. Bold text and numbers indicate the method we used and the best results among all methods.
Table 2. The effects of the FAB module on performance. Bold text and numbers indicate the method we used and the best results among all methods.
ModuleSet14 [57]Urban100 [59]Manga109 [60]
STL29.15/0.795827.53/0.827132.11/0.9265
FAB29.17/0.795927.62/0.827732.17/0.9270
FAB + α + β 29.19/0.796027.65/0.828232.20/0.9273
Table 3. The effects of using different loss functions on performance. Bold text and numbers indicate the method we used and the best results among all methods.
Table 3. The effects of using different loss functions on performance. Bold text and numbers indicate the method we used and the best results among all methods.
Module L 1 LossSmooth L 1 Loss
PSNR/SSIM27.65/0.828227.68/0.8289
Table 4. The effects of window size. Bold text and numbers indicate the method we used and the best results among all methods.
Table 4. The effects of window size. Bold text and numbers indicate the method we used and the best results among all methods.
Window SizeSet14 [57]Urban100 [59]Manga109 [60]
PSNR/SSIMPSNR/SSIMPSNR/SSIM
(8, 8)29.20/0.796227.68/0.828932.23/0.9268
(12, 12)29.22/0.796627.87/ 0.834832.28/0.9279
(16, 16)29.21/0.796627.96/0.837732.32/0.9287
Table 5. Quantitative comparison with state-of-the-art methods (average PSNR/SSIM) for classical image SR on benchmark datasets. The best and second-best performances are bolded and underlined, respectively. “†” indicates we use a pre-training strategy on ImageNet.
Table 5. Quantitative comparison with state-of-the-art methods (average PSNR/SSIM) for classical image SR on benchmark datasets. The best and second-best performances are bolded and underlined, respectively. “†” indicates we use a pre-training strategy on ImageNet.
MethodScaleTrainingSet5 [56]Set14 [57]BSD100 [58]Urban100 [59]Manga109 [60]
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
EDSR [6]×2DIV2K38.11/0.960233.92/0.919532.32/0.901332.93/0.935139.10/0.9773
RCAN [8]×2DIV2K38.27/0.961434.12/0.921632.41/0.902733.34/0.938439.44/0.9786
SAN [10]×2DIV2K38.31/0.962034.07/0.921332.42/0.902833.10/0.937039.32/0.9792
IGNN [38]×2DIV2K38.24/0.961334.07/0.921732.41/0.902533.23/0.938339.35/0.9786
HAN [9]×2DIV2K38.27/0.961434.16/0.921732.41/0.902733.35/0.938539.46/0.9785
NLSN [25]×2DIV2K38.34/0.961834.08/0.923132.43/0.902733.42/0.939439.59/0.9789
SwinIR [16]×2DF2K38.42/0.962334.46/0.925032.53/0.904133.81/0.942739.92/0.9797
EDT [17]×2DF2K38.45/0.962434.57/0.926332.52/0.904133.80/0.942539.93/0.9800
FE-FAIR×2DF2K38.58/0.962934.73/0.926632.58/0.904834.30/0.946040.17/0.9804
IPT † [15]×2ImageNet38.37/-34.43/-32.48/-33.76/--/-
EDT † [17]×2DF2K38.63/0.963234.80/0.927332.62/0.905234.27/0.945640.37/0.9811
FE-FAIR †×2DF2K38.66/0.963234.99/0.927532.66/0.905734.67/0.949040.54/0.9812
EDSR [6]×3DIV2K34.65/0.928030.52/0.846229.25/0.809328.80/0.865334.17/0.9476
RCAN [8]×3DIV2K34.74/0.929930.65/0.848229.32/0.811129.09/0.870234.44/0.9499
SAN [10]×3DIV2K34.75/0.930030.59/0.847629.33/0.811228.93/0.867134.30/0.9494
IGNN [38]×3DIV2K34.72/0.929830.66/0.848429.31/0.810529.03/0.869634.39/0.9496
HAN [9]×3DIV2K34.75/0.929930.67/0.848329.32/0.811029.10/0.870534.48/0.9500
NLSN [25]×3DIV2K34.85/0.930630.70/0.848529.34/0.811729.25/0.876034.57/0.9508
SwinIR [16]×3DF2K34.97/0.931830.93/0.853429.46/0.814529.75/0.882635.12/ 0.9537
EDT [17]×3DF2K34.97/0.931630.89/0.852729.44/0.814229.72/0.881435.13/0.9534
FE-FAIR×3DF2K35.02/0.932631.02/0.855129.50/0.816230.22/0.889835.42/0.9547
IPT † [15]×3ImageNet38.37/-34.43/-32.48/-33.76/--/-
EDT † [17]×3DF2K35.13/0.932831.09/0.855329.53/0.816530.07/0.886335.47/0.9550
FE-FAIR †×3DF2K35.14/0.933531.24/0.856929.53/0.817230.59/0.894435.62/0.9561
EDSR [6]×4DIV2K32.46/0.896828.80/0.787627.71/0.742026.64/0.803331.02/0.9148
RCAN [8]×4DIV2K32.63/0.900228.87/0.788927.77/0.743626.82/0.808731.22/0.9173
SAN [10]×4DIV2K32.64/0.900328.92/0.788827.78/0.743626.79/0.806831.18/0.9169
IGNN [38]×4DIV2K32.57/0.899828.85/0.789127.77/0.743426.84/0.809031.28/0.9182
HAN [9]×4DIV2K32.64/0.900228.90/0.789027.80/0.744226.85/0.809431.42/0.9177
NLSN [25]×4DIV2K32.59/0.900028.87/0.789127.78/0.744426.96/0.810931.27/0.9184
SwinIR [16]×4DF2K32.92/0.904429.09/0.795027.92/0.748927.45/ 0.825432.03/0.9260
EDT [17]×4DF2K32.82/0.903129.09/0.793927.91/0.748327.46 /0.824632.05/0.9254
FE-FAIR×4DF2K33.05/0.905329.21/0.796627.97/0.751427.96/0.837732.32/0.9287
IPT † [15]×4ImageNet38.37/-34.43/-32.48/-33.76/--/-
EDT † [17]×4DF2K33.06/0.905529.23/0.797127.99/0.751027.75/0.831732.39/0.9283
FE-FAIR †×4DF2K33.19/0.907529.35/0.799228.03/0.753128.41/0.845032.64/0.9301
Table 6. Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on benchmark datasets. The best and second-best performances are bolded and underlined, respectively.
Table 6. Quantitative comparison (average PSNR/SSIM) with state-of-the-art methods for lightweight image SR on benchmark datasets. The best and second-best performances are bolded and underlined, respectively.
MethodScale# ParamsSet5 [56]Set14 [57]B100 [58]Urban100 [59]Manga109 [60]
PSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIMPSNR/SSIM
CARN [39]×21592 k37.76/0.959033.52/0.916632.09/0.897831.92/0.925638.36/0.9765
IMDN [40]×2548 k38.00/0.960533.63/0.917732.19/0.899632.17/0.928338.88/0.9774
LAPAR-A [41]×2548 k38.01/0.960533.62/0.918332.19/0.899932.10/0.928338.67/0.9772
LatticeNet [42]×2756 k38.15/0.961033.78/0.919332.25/0.900532.43/0.9302-/-
SwinIR [16]×2878 k38.14/0.961133.86/0.920632.31/0.901232.76/0.934039.12/0.9783
EDT [17]×2917 k38.23/0.961533.99/0.920932.37/0.902132.98/0.936239.45/0.9789
FE-FAIR×22291 k38.30/0.962133.10/0.921432.41/0.902733.37/0.941739.56/0.9808
CARN [39]×31592 k34.29/0.925530.29/0.840729.06/0.803428.06/0.849333.50/0.9440
IMDN [40]×3703 k34.36/0.927030.32/0.841729.09/0.804628.17/0.851933.61/0.9445
LAPAR-A [41]×3544 k34.36/0.926730.34/0.842129.11/0.805428.15/0.852333.51/0.9441
LatticeNet [42]×3765 k34.53/0.928130.39/0.842429.15/0.805928.33/0.8538-/-
SwinIR [16]×3886 k34.62/0.928930.54/0.846329.20/0.808228.66/0.862433.98/0.9478
EDT [17]×3919 k34.73/0.929930.66/0.848129.29/0.810328.89/0.867434.44/0.9498
FE-FAIR×32299 k34.80/0.931130.75/0.849229.33/0.810529.25/0.872734.52/0.9518
CARN [39]×41592 k32.13/0.893728.60/0.780627.58/0.734926.07/0.783730.47/0.9084
IMDN [40]×4715 k32.21/0.894828.58/0.781127.56/0.735326.04/0.783830.45/0.9075
LAPAR-A [41]×4659 k32.15/0.894428.61/0.781827.61/0.736626.14/0.787130.42/0.9074
LatticeNet [42]×4777 k32.30/0.896228.68/0.783027.62/0.736726.25/0.7873-/-
SwinIR [16]×4897 k32.44/0.897628.77/0.785827.69/0.740626.47/0.798030.92/0.9151
EDT [17]×4922 k32.53/0.899128.88/0.788227.76/0.743326.71/0.805131.35/0.9180
FE-FAIR×42310 k32.59/0.900228.97/0.790327.79/0.744727.04/0.813931.41/0.9199
Table 7. Quantitative comparison (average PSNR) with state-of-the-art models for grayscale image denoising on benchmark datasets. The best and second-best performances are bolded and underlined.
Table 7. Quantitative comparison (average PSNR) with state-of-the-art models for grayscale image denoising on benchmark datasets. The best and second-best performances are bolded and underlined.
Dataset σ BM3D [43]WNNM [44]DnCNN [45]IRCNN [46]FFDNet [47]NLRN [48]FOCNet [49]MWCNN [50]DRUNet [51]SwinIR [16]FE-FAIR
Set12 [45]1532.3732.7032.8632.7632.7533.1633.0733.1533.2533.3633.41
2529.9730.2830.4430.3730.4330.8030.7330.7930.9431.0131.07
5026.7227.0527.1827.1227.3227.6427.6827.7427.9027.9127.96
BSD68 [58]1531.0831.3731.7331.6331.6331.8831.8331.8631.9131.9732.04
2528.5728.8329.2329.1529.1929.4129.3829.4129.4827.5027.54
5025.6025.8726.2326.1926.2926.4726.5026.5326.5926.5826.61
Urban100 [59]1532.3532.9732.6432.4632.4033.4533.1533.1733.4433.7033.81
2529.7030.3929.9529.8029.9030.9430.6430.6631.1131.3033.45
5025.9526.8326.2626.2226.5027.4927.4027.4227.9627.9828.12
Table 8. Quantitative comparison (average PSNR) with state-of-the-art methods for colour image denoising on benchmark datasets. The best and second-best performances are bolded and underlined.
Table 8. Quantitative comparison (average PSNR) with state-of-the-art methods for colour image denoising on benchmark datasets. The best and second-best performances are bolded and underlined.
Dataset σ BM3D [43]DnCNN [45]IRCNN [46]FFDNet [47]DSNet [52]RPCNN [53]BRDNet [54]IPT [15]DRUNet [51]SwinIR [16]FE-FAIR
CBSD68 [58]1533.5233.9033.8633.8733.91-34.10-34.3034.4234.46
2530.7131.2431.1631.2131.2831.2431.43-31.6931.7831.81
5027.3827.9527.8627.9628.0528.0628.1628.3928.5128.5628.58
Kodak24 [62]1534.2834.6034.6934.6334.63-34.88-35.3135.3435.39
2532.1532.1432.1832.1332.1632.3432.41-32.8932.8932.97
5028.4628.9528.9328.9829.0529.2529.2229.6429.8629.7929.91
McMaster [63]1534.0633.4534.5834.6634.67-35.08-35.4035.6135.67
2531.6631.5232.1832.3532.4032.3332.75-33.1433.2033.31
5028.5128.6228.9129.1829.2829.3329.5229.9830.0830.2230.27
Urban100 [59]1533.9332.9833.7833.83--34.42-34.8135.1335.26
2531.3630.8131.2031.40-31.8131.99-32.6032.9033.06
5027.9327.5927.7028.05-28.6228.5629.7129.6129.8230.12
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guo, A.; Shen, K.; Liu, J. FE-FAIR: Feature-Enhanced Fused Attention for Image Super-Resolution. Electronics 2024, 13, 1075. https://doi.org/10.3390/electronics13061075

AMA Style

Guo A, Shen K, Liu J. FE-FAIR: Feature-Enhanced Fused Attention for Image Super-Resolution. Electronics. 2024; 13(6):1075. https://doi.org/10.3390/electronics13061075

Chicago/Turabian Style

Guo, Aiying, Kai Shen, and Jingjing Liu. 2024. "FE-FAIR: Feature-Enhanced Fused Attention for Image Super-Resolution" Electronics 13, no. 6: 1075. https://doi.org/10.3390/electronics13061075

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop