MLWAN: Multi-Scale Learning Wavelet Attention Module Network for Image Super Resolution

Image super resolution (SR) is an important image processing technique in computer vision to improve the resolution of images and videos. In recent years, deep convolutional neural network (CNN) has made significant progress in the field of image SR; however, the existing CNN-based SR methods cannot fully search for background information in the measurement of feature extraction. In addition, in most cases, different scale factors of image SR are assumed to be different assignments and completed by training different models, which does not meet the actual application requirements. To solve these problems, we propose a multi-scale learning wavelet attention network (MLWAN) model for image SR. Specifically, the proposed model consists of three parts. In the first part, low-level features are extracted from the input image through two convolutional layers, and then a new channel-spatial attention mechanism (CSAM) block is concatenated. In the second part, CNN is used to predict the highest-level low-frequency wavelet coefficients, and the third part uses recursive neural networks (RNN) with different scales to predict the wavelet coefficients of the remaining subbands. In order to further achieve lightweight, an effective channel attention recurrent module (ECARM) is proposed to reduce network parameters. Finally, the inverse discrete wavelet transform (IDWT) is used to reconstruct HR image. Experimental results on public large-scale datasets demonstrate the superiority of the proposed model in terms of quantitative indicators and visual effects.


Introduction
Image super resolution (SR) reconstruction technology refers to the process of restoring a given low-resolution (LR) image into a corresponding high-resolution (HR) image by a specific algorithm. It is designed to overcome or compensate the problems of blurred image, low quality, and insignificant region of interest caused by the limitations of image acquisition system or acquisition environment. At present, many sophisticated visual applications (e.g., satellite and aerial imaging [1], medical imaging [2,3], and security and surveillance imaging [4,5]) can benefit from high-quality reconstructed HR images when SR techniques are used as a preprocessing step. How can we extract valuable information from the various kinds of LR images is the key to solving SR problems. Since the texture structure of an image has infinite solutions in the process of image SR reconstruction, it is a typical ill-posed problem.
To solve the SR problem, there are usually both hardware and software approaches. On the hardware side, this can be achieved by reducing the size of individual sensors; however, that would raise prices exponentially. The software method to achieve the image SR reconstruction has basically no cost; therefore, in earlier research, the mainstream algorithms of image SR are mainly divided into three categories: interpolation-based methods, reconstruction-based methods, and learning-based methods. More specifically, interpolation-based SR methods, such as bicubic interpolation [6] and Lanczos resampling [7], are fast and simple, but lack accuracy. Reconstruction-based SR methods [8][9][10][11] improve the performance of SR, additional network parameters and long inference time have become a prevalent issue. To solve this issue, some researchers have proposed two network shrinkage techniques while maintaining performance. One is to carefully design a condensed architecture [29,30], which will be effective but requires a high level of expertise. The other is to use parameter sharing algorithms [31,32], such as recursive/cyclic learning, but this class of methods uses sequential inference methods for each recursive/recursive block, which makes the inference time longer. In addition, many CNN-based SR methods ignore the feature connection in the middle layer, which leads to the texture details of LR images tend to be smooth in SR output. Although the results achieved by utilizing channel attention [33,34] maintain some detailed information, the channel attention-based techniques fail to preserve useful textures and restore natural features. Because they handle the feature maps at various levels identically, which results in certain detail sections being lost in the reconstructed image; therefore, SR research still has challenges in creating texture details.
Based on the above analysis, the current methods do not form a balance in pursuit of model inference time, parameters, and performance. Motivated by these facts, we propose a multi-scale learning wavelet attention network (MLWAN) model to achieve fast, accurate, and lightweight SR tasks. The overall architecture of the proposed MLWAN model is shown in Figure 1. As depicted in Figure 1, the proposed model can utilize more context-specific information and achieve multi-scale SR through a single-scale SR model. Compared with the previous wavelet correlation methods [35][36][37], our network uses the original LR image as direct input at different levels and predicts the wavelet coefficients of the target image. To be specific, the whole network is principally divided into three portions. In the first portion, two convolutional layers and a new channel-spatial attention mechanism (CSAM) are used to extract low-level features from the input LR images. In the second portion, it is composed of CNN branches to predict the highest-level low-frequency wavelet coefficients. The third part is an RNN branch, the remaining subband coefficients are predicted by the RNN component. In addition, an efficient channel attention recurrent module (ECARM) is proposed to compose the RNN branch. It reduces the total size of the network parameters by using the approach of sharing parameters. Moreover, different from anterior recurrent neural networks (RNN)-based approaches [31,32,38], the number of real recurrences in this work is determined by the scale factor, and every recurrence is responsible for predicting a certain level of subband coefficients. Based on the predicted wavelet coefficients of each branch, the HR image is reconstructed by inverse discrete wavelet transform (IDWT). Note that a preliminary version of this paper is published in [39], which only learn simple wavelets and channel attention network for image SR. This presented work adds some new insights to the preliminary version so that the proposed MLWAN model has better robustness and universality. In general, the mainly contributions of this work are as follows: (1) A multi-scale learning wavelets attention network (MLWAN) is proposed to complete multi-scale SR task in a fast and lightweight way. The network predicts the wavelet coefficient of the source image, and inverts the predicted wavelet coefficients to obtain the final HR image. (2) A novel channel-spatial attention mechanism (CSAM) block is proposed to learn the channel and spatial correlation of each layer's features. Due to it including responses from all feature mapping dimensions, we use it in our network to extract low-level features of LR image more completely. (3) As the basic unit in recurrent block (RB), an efficient channel attention recurrent module (ECARM) is proposed for reducing the network parameters. Experimental results show that the proposed MLWAN network achieves a good balance in model inference time, parameters, and performance, and outperforms most of the existing SR methods.

Related Works
Because of the high approximating capacity and hierarchical property of an artificial neural network (ANN), most modern image SR models are based on DL technology. In this section, we analyze the current DL-based models in the SR field from three aspects: waveletrelated SR, RNN-based SR, and attention mechanisms.

Wavelet-Related SR
As is known to all, wavelet transform is an efficient image representation method, which decomposes image signals into high frequency subbands representing texture details and low frequency subbands containing global topological information. Bae et al. [40] first combined the wavelet transform with deep-learning-based SR model, took the subband of the interpolated LR wavelet as the input, and predicted the residual value of the corresponding HR subband. Wavelet transform and inverse wavelet transform are used to decompose LR input and reconstruct HR output, respectively. Later, with the in-depth study of deep learning, various SR algorithms combining deep learning and wavelet transform have been proposed. Guo et al. [35] presented a deep wavelet super-resolution (DWSR) model by using a deep CNN. Liu et al. [36] developed a multi-level wavelet-CNN (MWCNN) model by using the inverse wavelet transform and the discrete wavelet transform in the process of up-sampling and down-sampling. Additionally, Xue et al. [37] constructed a wavelet-based residual attention network by stacking several multi-kernel convolutional layers that are applied by the attention block (WRAN). Ji et al. [41] predicted the missing portion of the wavelet coefficient using the multi-frame information in the wavelet domain. Anbarjafari et al. [42] proposed a novel SR metric based on interpolation of wavelet domain high frequency subbands and the spatial domain input image. Zhang et al. [43] proposed a lightweight and fast network (MSWSR) to implement multi-scale SR simultaneously by learning multi-level wavelet coefficients of the target image. Huang et al. [44] presented a wavelet-based CNN approach for face images SR. In summary, most of these wavelet correlated SR networks input a wavelet coefficient layer from the bicubic upsampled LR image and output a wavelet coefficient layer from the desired image, which can be considered as refining the wavelet coefficients of the upsampled LR image to the wavelet coefficients of the matched HR image. Most importantly, although these methods provide a high sampling scale, they do not take full advantage of wavelet transform in multiresolution decomposition. To perform multi-scale SR model, in this study, the original LR image is directly used as the input image to establish a simple network to predict the wavelet coefficients of the target image.

RNN-Based SR
The RNN is a recurrent fully connected neural network model inspired by the spiking behavior of biophysical neurons, which has achieved remarkable results in many tasks of natural language processing (NLP). On one hand, the feedback mechanism of RNN allows the model to use the current output to change the previous state. On the other hand, RNN is similar to ordinary artificial neural networks, with a tree-like hierarchical system, in which network nodes recursively input data in the order of connection. Generally, it is used for machine learning problems related to structural relationships, with flexible topology and weight sharing. In addition, RNN and recurrent neural networks are often used to reduce model parameters because they adopt the method of parameter sharing [31,32,38]. For instance, Kim et al. [45] proposed a deeply-recursive convolutional network (DRCN) to improve performance and reduce model parameters. Tai et al. [32] proposed a deep recursive residual network (DRRN) model by using a similar recursive mechanism and residual blocks [46]. Li et al. [31] proposed an SRFBN model including four cyclic inference units, each of which has a complex structure, resulting in a long inference time. Obviously, by recursively inferring recursive units, these methods greatly reduce the total number of network parameters. However, because so much recursion is involved, it can lead to long inference time. To perform multi-scale SR in a lightweight and fast way, in this paper, RNN algorithm is used to achieve the balance between the total network parameters and reconstruction performance. Meanwhile, we also attempt to shorten the inference time.

Attention Mechanism
Attention mechanism is derived from human visual attention mechanism, which can be regarded as an application of bionics. A brain signal processing system specific to human eyesight is called the visual attention mechanism, which has played a key role in various fields of computer vision (e.g., image recognition, capture, and restoration). The power to differentiate is given to models by the attention, which makes it highly fashionable. For instance, in speech recognition and machine translation applications, each word in the sentence is given a different weight, which increases the learning flexibility (softness) of the neural network block. In addition, attention can be used as an alignment relation to explain the relationship between output and input sentences and the knowledge acquired by the interpretation network. It gives us a window into the black box of deep learning. In other words, they assist the network in concentrating on crucial information while dismissing unimportant information. At present, the research object of image SR algorithm has shifted from traditional CNN to attention-based DL. For instance, Wang et al. [47] presented a residual attention network based on the truck-and-mask attention mechanism. Hu et al. [48] proposed a SE-Net model by using the channel attention learning mechanism. Zhang et al. [33] developed a residual channel attention network (RCAN) by using an extremely deep network with channel attention of SR. Zhu et al. [49] utilized U-shaped formation and the residual channel attention block to achieve excellent image SR performance. Woo et al. [50] used spatial attention (SA) and channel attention (CA) blocks to take advantage of the inter-spatial and inter-channel relationship of feature maps. Based on mentioned-above, clearly, the construction of complex attention module can improve the performance of image SR. Based on our previous work [39], in this study, we present a novel CSAM block to study the interdependencies between pixels and channels.

Method
In this section, we describe the proposed MLWAN and the adopted loss function in details. As shown in Figure 2, the main idea of this study is to make use of the characteristics of wavelet transform, that is, one level of wavelet coefficients can generate a 2× image by inverse wavelet transform. Taking the three-level wavelet as an example, the wavelet coefficients of the target HR image are predicted from the corresponding LR image. Then, the target HR images with 2×, 4×, and 8× scales are reconstructed by multi-level inverse wavelet transform, respectively.

Overview of the Proposed Model
As discussed above, some DL-based SR tasks (such as 2×, 4×, and 8×) can effectively deal with single-scale SR, while multi-scale SR is still difficult. In this paper, we propose a novel MLWAN model for multi-scale SR. The network framework is illustrated in Figure 1. It is divided into three main parts. In the first part, we use two convolutional layers to extract low-level features from the input image. Then, a CSAM block is concatenated to extract the underlying features representing the intra-channel and inter-channel information in a continuous channel. Furthermore, we use one CNN part and one RNN part to predict the wavelet coefficients of the target image. In the CNN part, two successive convolutional layers process the extracted low-level features to predict the wavelet coefficients ω A 3 . In addition, in order to reduce the total number of model parameters, an efficient channel attention recurrent module (ECARM) is introduced in the RNN part, which adopts the method of sharing all parameters. The ECARM module takes the low-level features as the input and cooperates with two following convolutional layers to predict the three remaining third-level wavelet coefficients (i.e., ω H 3 , ω V 3 , and ω D 3 ). Consequently, 2D IDWT of the third-level wavelet coefficients is applied to calculate the O 2× . Subsequently, the ECARM with the same weights takes the concatenation of its previous output and the low level feature as input to generate discriminative features. Then, a 2× deconvolutional layer and a convolutional layer are utilized to predict the second-level wavelet coefficients (i.e., ω H 2 , ω V 2 , and ω D 2 ) from the generated features; therefore, the second-level wavelet coefficients are inversely transformed into the O 4× by the corresponding IDWT. Likewise, the same RB propagates one more time followed by a 4× deconvolutional layer and a convolutional layer to predict the first-level wavelet coefficients (i.e., ω H 1 , ω V 1 , and ω D 1 ).
Ultimately, the O 8× is reconstructed. Note that we adopt the db1 wavelet function as a wavelet filter, and all of the RBs share the same weights. As depicted in Figure 1, the proposed MLWAN can reconstruct 2×, 4×, and 8× SR images using a single network, while the magnification of SR images has grown exponentially with the recurrent times of RB. Thus, at the inference step, SR results with the desired scale can be achieved by considering the scale factor to flexibly control the recurrent times of RB. In theory, since the RNN branch can involve an arbitrary number of recurrences, the proposed MLWAN network is scalable and has the ability to deal with multi-scale SR tasks via a single network if the scale factor is a power of 2. Moreover, benefiting from the weight sharing strategy of RBs, fewer additional parameters are involved when extending the RNN part to obtain more SR scales.

The Channel-Spatial Attention Mechanism (CSAM) Module
Attention mechanism has made important breakthroughs in the field of image processing and NLP in recent years, which has been proved to be beneficial to improve the performance of the model. The essence of attention mechanism is to locate the information of interest and suppress useless information. The results are usually displayed in the form of probability maps or probabilistic feature vectors. According to different application scenarios, it is mainly divided into spatial attention model, channel attention model, and channel-spatial mixed attention model [33,34,[50][51][52]. As is well known, channel attention focuses on what kind of features are meaningful, while spatial attention focuses on where features are meaningful. In [50], the two modules, channel attention and spatial attention, can be combined in parallel or sequentially, and the authors found that combining them sequentially and putting channel attention first leads to better results. Inspired by these findings, we propose a new channel-spatial attention mechanism (CSAM) that incorporates the response from every dimension of feature maps. To be specific, the overview of the CSAM is shown in Figure 3. The input feature F N is fed into a 3D convolution layer [53] to build an attention map by collecting joint spatial and channel characteristics, granted the output layer features maps F N ∈ R H×W×C . We then apply 3D convolutions, using a kernel size of 3, 3 × 3 × 3, stride of 1 and padding of 1 on each of the 3 channels separately to generate three sets of channel-spatial attention mappings W csa .
Furthermore, we use element-wise product operation on the input feature F N and the attention map W csa . Finally, the weighted features F CS are defined by where β is a scale factor, is the element-wise product, σ(·) is the sigmoid function. As a result, F CS is the weighted sum of every spatial-channel position feature along with the primary feature. In general, the proposed CSAM explicitly models spatial feature interdependencies and channel-wise to adaptively learn the intra-channel and inter-channel feature responses.

The Efficient Channel Attention Recurrent Module (ECARM)
Recently, channel attention mechanisms have been shown to have great potential in improving the performance of deep CNNs; however, most existing approaches focus on developing more complex attention modules to achieve better performance, which inevitably increases the complexity of the model [48]. In order to overcome the contradiction between performance and complexity tradeoff, we propose an effective channel attention recurrent module (ECARM), which contains only a few parameters and brings significant performance improvement. As depicted in Figure 4, the proposed ECARM is composed by progressive refinement module (PRM), efficient channel attention (ECA) module [54], and one 1 × 1 convolution layer. Note that our ECARM as a whole uses residual connections [46]. As shown in the gray box in Figure 4, PRM uses a convolution layer with the size of 3 × 3 to extract the input features of several successive distillation stages. Then, in each step, we use a channel split operation on the features from the previous stage to create two partial features. One part is saved and the remaining part is used in the following computation unit. The portion that was kept might be thought of as the refined features. Specifically, we utilize a band matrix W k to learn channel attention, and W k can be defined by where W k includes k × C parameters. The weight of y i is calculated by only considering interaction between y i and its k neighbors, as follows: where σ is a sigmoid function. The Ω k i is the set of k adjacent channel of y i . Then, a more efficient approach is to make all channels share the same learning parameters, as follows: Note that we can easily implement this strategy with a fast 1D convolution of kernel size k as follows: ω = σ(C1D k (y)) (5) where the C1D refers to 1D convolution. Here, the method in Equation (5) is called by ECA module, which only involves k parameters. In a word, it ensures both efficiency and effectiveness by appropriately capturing local cross-channel interaction.

Loss Functions
In this section, the pairwise difference between a reconstructed HR image and its corresponding ground truth (GT) is measured at two stages and two different domains. The losses take into account several factors, including the overall pixel distance between images in the spatial domain, as well as frequency and texture differences in the wavelet domain. The total loss is formed by linearly combination of loss components, as follows: Loss total = αLoss spatial + βLoss wavelet (6) Spatial Loss Loss spatial : After each IDWT, the mean absolute error (MAE) between the reconstruction results of different scale factors (O 2× , O 4× , and O 8× ) and the corresponding GTs is calculated to minimize the pixel level difference in the spatial domain, which is formulated as: where G indicates the corresponding GT of O 8× , and bic s↓ (·) is the function for bicubic downsampling with scale factors. Wavelet Loss L wavelet : Unlike many other SR methods [27,31,33] that only introduce losses between output and ground truth at the spatial domain, we adopt a wavelet loss at the wavelet domain, to better constrain the distance between predicted wavelet coefficients and target wavelet coefficients before the IDWT. In addition, we use MAE to calculate the wavelet loss between the predicted wavelet coefficient (ω) and its corresponding GT to help generate more high-frequency and detailed textures, which can be expressed as follows: where dwt n (·) is the mapping function for n-level 2D discrete wavelet transform (DWT).

Experiments
In this section, we first introduce the datasets and evaluation metrics used in our experiments. Then, the implementation details of the proposed model are discussed. On this basis, the effectiveness of the proposed MLWAN is compared quantitatively and qualitatively. Finally, the results of ablation experiments are presented.

Metrics and Datasets
DIV2K [55] is a popular image SR dataset that includes 1000 high-quality images of natural RGB images. In our experiments, the first 900 images of DIV2K are selected as the training data. Meanwhile, five standard benchmark datasets: Set5 [56], Set14 [16], B100 [57], Urban100 [58], and Manga109 [59] are used for evaluation. SR results are evaluated by two quantitative measures. Specifically, the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [60] are used to evaluate the quality of reconstructed HR images. For a fair comparison, the PSNR and SSIM are only calculated on the luminance (Y) channel as previous works do. Note that the Matlab functions with bicubic interpolation are used to downsample high-quality images to generate LR images of training data.

Implementation Details
The proposed MLWAN in our experiments is trained on (2,4,8)× SR for 4000 epochs. In every iteration, we randomly crop 32 patches with the size of 40 × 40 as the inputs. For dynamic data augmentation, we use random flips (horizontal and vertical) and a 90-degree rotation before feeding the data into the network. The network parameters are initialized according to [61] and optimized by Adam [62] with the learning rate of 4 × 10 −5 . Both α and β in Equation (6)
For network parameters, we consider convolutional, deconvolutional, and fully connected layers when calculating the parameters of the entire network. The parameter of IDWT is not included because its quantity is not obvious, specifically 0.008 K. The number of parameters for a convolutional layer can be defined by where C in and C out are the numbers of input and output channels in the convolutional layer, respectively. Note that (k h , k ω ) is the kernel size. b is the number of bias, which is equal to C out in terms of bias 0 when no bias is used. The size of the bias controls how easy it is to activate the sensor. When we use bias, it equals C out . In addition, for parameters of the deconvolution layer, the parameters are calculated in the same way as those of the convolutional layer. The number of parameters for a fully connected layer can be calculated by where D in and D out are the numbers of input and output dimensions in the fully connected layer, respectively. b is the number of bias, is equal to C out in terms of bias 0 when no bias is used. Note that In our experiments, all the numbers of network parameters are calculated based on Equations (9) and (10). For inference time, in our experiments, all of the network inference time are measured on the B100 dataset that contains 100 images. The official implementation of the comparison method is based on different deep learning libraries, which utilize different computational backends, affecting the fairness of the inference time comparison. In order to reduce the measurement differences caused by different software or hardware, we only adopt the PyTorch Library on a NVIDIA Quadro RTX 6000 GPU when computing the network inference time.
In comparison experiments, we first compare the proposed MWLAN model with Bicubic [6], SRCNN [18], FSRCNN [64], VDSR [22], DRCN [45], and MemNet [38]. The overall performance results on Set5 [56], Set14 [16], B100 [57], Urban100 [58], and Manga109 [59] databases are tabulated in Table 1. As shown in Table 1, we can observe that the proposed MWLAN model has a great improvement in PSNR and SSIM contrast with Bicubic, SRCNN, and FSRCNN. Particularly, for the VDSR [48], the proposed model has higher PSNR and SSIM of 4× and 8× images. Moreover, compared with the 4× Set5 dataset, PSNR and SSIM are improved by 0.84 and 0.0127. Then, for the DRCN [45], the proposed method is not just higher in PSNR and SSIM than the former, but also less in parameters and inference time. For the MemNet [38], the inference time of the proposed network is much less than it. In conclusion, the performance of the proposed MWLAN model in terms of SSIM and PSNR are better than other comparison models. Furthermore, to verify the effectiveness of multi-scale SR via a single network, we compare the proposed MWLAN model with MSSR, LapSRN, and MSWSR, which are originally designed for multi-scale SR tasks in Table 2. As shown in Table 2, we can observe that although the inference time of LapSRN is very short, the proposed MWLAN model achieves better PSNR and SSIM values than the LapSRN. Especially, on the Manga109 dataset of 8× images, the proposed MWLAN method improves PSNR by 1.12 and SSIM by 0.0483. Obviously, the proposed MWLAN model achieves better PSNR/SSIM results than MSSR, LapSRN, and MSWSR methods. Accordingly, we can conclude that the proposed MWLAN model is powerful for performing multi-scale SR in a lightweight and fast way. In addition, we also make the comparisons on the number of model parameters and running time of the SR methods with different scale sizes. The results are shown in Figure 5. As can be seen from Figure 5, for EDSR [27] and RCAN [33], their PSNR and SSIM values are higher than the proposed model, but their parameters and inference time are much larger than the proposed method. Recurrent learning is used in SRFBN [31] to lower the all number of model parameter, however the suggested model's parameters and inference time are also higher. Although these methods are higher than the proposed methods in PSNR and SSIM values, they are far lower than the proposed methods in the network parameters and inference time. Through quantitative comparison with the above methods, we find that the proposed method realizes a better balance in three fields: reconstruction performance, model parameter, and inference time; therefore, the proposed method can be embedded into the real-time image processing system for image SR applications.

Visual Effect Analysis
In this section, we visually compare the proposed method with some representative SR methods. Specifically, we compare the proposed MLWAN with Bicubic [6], SRCNN [18], FSRCNN [64], and LapSRN [23], and perform SR reconstruction of 2×, 4×and 8×images on standard datasets. The experimental results are shown in Figures 6-9. Note that the image of Figure 6 is selected in Set5 [56], the image of Figure 7 is selected in Set14 [16], and the images of Figures 8 and 9 are selected in B100 [57]. Obviously, it can be seen that the proposed method has better and clearer texture and reconstruction effect than the previous four methods. Especially, when 4×and 8× scales are used, the comparison of the results suggests that the proposed algorithm outperforms the other SR methods. In conclusion, based on subjective image quality assessment (IQA), we can find that the subjective image quality perception of the proposed model significantly outperforms the current representative SR algorithms.

The Effectiveness of CSAM
The extracted features by the CSAM block may have a great influence on the performance of the whole network. Based on such consideration, we design two groups of experiments. The first experiment is MLWAN without CSAM block that is named by MLWAN (without CSAM). The second experiment is MLWAN with CSAM block. We test the two trained networks on five standard benchmark datasets for 2×, 4×, and 8× in terms of SSIM and PSNR. The experimental results are shown in Table 3. As can be seen in Table 3, we can observe that the proposed model with CSAM block has evidently higher PSNR and SSIM values than the model without CSAM block; therefore, the CSAM block plays a key role in enhancing the performance of the proposed model. In general, the proposed CSAM block can efficiently extract image structure and texture information for image SR.  Table 4. We test the two models on five standard benchmark datasets for 2×, 4×, and 8× in terms of SSIM and PSNR. The PSNR and SSIM values obtained show that the proposed MLWAN without ECA are significantly lower than the initial MLWAN; therefore, we can conclude that overall performance can be enhanced by embedding the ECA in the proposed MLWAN model.  [39] for details. In this section, in order to demonstrate the universality and robustness of the proposed algorithm, we compare the proposed MLWAN model with LWCAN in terms of PSNR and SSIM. Experimental results are shown in Table 5. Since the previous model is designed for 4× SR, we can only compare the 4× scale in this study. As can be seen in Table 5, clearly, it can be seen that the proposed model is better than the previous model [39] in terms of PSNR and SSIM. In conclusion, the proposed model has better robustness and versatility than the LWCAN metric.

Conclusions and Future Work
In this paper, we propose a multi-scale learning wavelet attention network (MLWAN) to achieve accurate, fast, and lightweight image super resolution. The novelty of the proposed MLWAN model is that we use CSAM blocks in feature extraction and ECARM in RNN branching to improve network performance. Particularly, the proposed MLWAN model achieves a balance between network reconstruction performance, inference time, and parameter number. Experimental results show that the proposed model outperforms some representative SR methods. In the future work, we plan to extend the proposed approach to video super resolution.

Conflicts of Interest:
The authors declare no conflict of interest.