Perceptual Metric Guided Deep Attention Network for Single Image Super-Resolution

: Deep learning has been widely applied to image super-resolution (SR) tasks and has achieved superior performance over traditional methods due to its excellent feature learning capabilities. However, most of these deep learning-based methods require training image sets to pre-train SR network parameters. In this paper, we propose a new single image SR network without the need of any pre-training. The proposed network is optimized to achieve the SR reconstruction only from a low resolution observation rather than training image sets, and it focuses on improving the visual quality of reconstructed images. Speciﬁcally, we designed an attention-based decoder-encoder network for predicting the SR reconstruction, in which a residual spatial attention (RSA) unit is deployed in each layer of decoder to capture key information. Moreover, we adopt the perceptual metric consisting of L1 metric and multi-scale structural similarity (MSSSIM) metric to learn the network parameters. Different than the conventional MSE (mean squared error) metric, the perceptual metric coincides well with perceptual characteristics of the human visual system. Under the guidance of the perceptual metric, the RSA units are capable of predicting the visually sensitive areas at different scales. The proposed network can thus pay more attention to these areas for preserving visual informative structures at multiple scales. The experimental results on the Set5 and Set14 image set demonstrate that the combination of Perceptual metric and RSA units can signiﬁcantly improve the reconstruction quality. In terms of PSNR and structural similarity (SSIM) values, the proposed method achieves better reconstruction results than the related works, and it is even comparable to some pre-trained networks.


Introduction
Single Image Super-resolution (SISR) is designed to generate a high resolution (HR) image from a single low resolution (LR) image, which has been used for a variety of vision related tasks, such as remote sensing and imaging [1], medical imaging [2], and image enhancement. A variety of SISR methods have been proposed, including prediction-based methods [3], edge-based methods [4], statistical methods [5], patch-based methods [6], sparse representation methods [7], etc. These methods rely primarily on some pre-defined prior models to represent the underlying HR image, which are recognized as model-driven reconstruction methods. With the rapid development of deep learning technology, deep networks, especially convolution neural networks (CNNs), have been widely used for image generation [8] and super-resolution (SR) reconstruction [9], due to their superior performance over model-driven methods. Their main ideas are to train the deep networks to learn the inverse reconstruction mapping from LR images to HR images [10]. Although deep learning-based methods have good reconstruction quality, they are data driven, and a large set of pair-wised LR and HR

Related Work
In recent years, deep convolutional neural networks (CNN) have been widely used for image generation and SISR, and have effectively improved the quality of reconstructed images [14]. Dong et al. first exploited a convolutional neural network named SRCNN to perform SISR reconstruction [15]. In order to enrich the network capacity, some follow-up methods, such as VDSR [16] and IRCNN [17], continued to increase the network depth by stacking more convolutional layers. However, deeper networks require more image samples to train well, while bringing performance improvements. Jiwon et al. proposed a deeply-recursive convolutional network (DRCN) for SR [18] reconstruction, in which a recursive layer (up to 16 recursions) is used to increase the network depth without introducing new network parameters. These methods need to first upscale the LR image into an interpolated image with the same resolution as the HR image, and then feed it into the SR network. However, some useful information may be lost during the interpolation operation, and convolution operations in the HR space increase computational complexity.
Some other studies advocated to learn the end to end mapping from the original LR to the HR image directly. Reference [19] used multiple deconvolution layers to upscale the resolution of feature maps until to be same as that of the HR image. In Reference [20], Shi et al. proposed an efficient sub-pixel convolution layer for upscaling image resolution. Then, EDSR [21] and SRResNet [22] employed sub-pixel convolutions to increase the resolution of the network output, where the residual block is also used to learn the reconstruction mapping. In order to capture multi-scale structures of images, Reference [23] proposed a laplacian pyramid SR network (LapSRN), which can reconstruct the sub-band residuals of the high-resolution image at multiple pyramid scales progressively, and its reconstruction performance is better than SRCNN [15], VDSR [16], and DRCN [18]. Moreover, LapSRN can produce multi-scale SR images (e.g., 2×, 4× and 8×) with a single feed-forward pass.
All the above-mentioned methods learn SR mapping in supervised ways. Although they can produce promising reconstruction results, a large set of image pairs, consisting of the LR images and the corresponding ground-truth HR images, are required to pre-train the network parameters, which limits the applicability of these methods in practical scenarios. In some practical problems, the real HR images are not easily collected or even unavailable [24]. At the same time, if statistical characteristics of test images deviate significantly from training images, the reconstruction quality will be degraded [25]. The recent work DIP is a parametric network for image representation [11] without the need of pre-training on a large image set. The motivation behind DIP is that the convolutional network itself acts as a good priori of image structures, and the network parameters can be optimized to represent the single instance under a given observation model. DIP provides a new approach for single image SR. We go further upon this model and propose an unsupervised single-image SR network. Different from maximizing the PSNR metric in Reference [11], the main focus of our network targets improving the perceptual quality of reconstructed image. Thus, we designed a perceptual metric guided deep attention network (PM-DAN) for achieving this goal. The details of our network will be presented in the next section. Figure 1 shows the proposed PM-DAN for a single image SR. Similar to DIP [11], we take the output f Θ (z) of a parametric generator G to represent the unknown HR image x h ∈ R C×H×W , in which the random noise tensor z ∈ R C ×H×W is the network input and Θ is the network parameters. z has the same spatial resolution as the network output x h . C is the channel number of x h , and it is set as 3 for the color image. In the case of supervised learning, the network parameters are usually learned from the training set under the objective function that minimizes the mean reconstruction error. Unlike previous work, we optimize the network parameters according to the image resolution degradation model to ensure that the output of the generator x h = f Θ (z) can match with the given LR image x l , and the objective function of network learning is formulated as,

Perceptual Metric Guided Deep Attention Network
where D is the down-sampling operator for image resolution degeneration, L P is the perceptual metric consisting of the mean absolute error metric L MAE and multi-scale structural similarity metric L MSSSI M [13], and α ∈ (0, 1) is the regularization weight. The weights Θ are learned to minimize the perceptual metric given a specific LR image x l , thereby boosting visual quality of the reconstructed image.

Network Architecture
As shown in the bottom half of Figure 1, our generator network G has an attention-based encoder-decoder architecture consisting of three types of modules, namely a down-scale module, a skip connection module, and an up-scale module. The detailed configurations of our generator network G are shown in Table 1. The down-scale module in encoder is used to extract multi-scale features, the skip connection module delivers feature maps from encoder to decoder via convolution and concatenation operation, and the up-scale module in decoder is responsible for conducting reconstruction at different scales. Each convolution layer in these modules is coupled with batch normalization (BN) and nonlinear LeakyReLU (0.2) activation, and the kernel size of convolutional layers is set as 3 × 3. Different from Reference [11], we enhance the up-scale module by inserting two residual spatial attention (RSA) units. Under the guidance of the perceptual metric, it is expected that the predicted spatial attention maps will highlight areas with rich visually sensitive structures. Therefore, the up-scale module can pay more attention to informative features at different scales for reconstruction. Up-scale module The inner diagram of RSA unit is shown in Figure 2. Motivated by Reference [26,27], RSA adopts residual learning mechanism, and the output of RSA is computed as the sum of input and input masked by the predicted attention map. The mathematical formulation of RSA is where * represents the convolution operation, F c−1 is the input of RSA, X c is an intermediate result computed from F c−1 through the operation flow of convolution W 1 , ReLU activation function δ [28] and convolution W 2 , f c (.) predicts the spatial attention map from X c , is the point-wise multiplication, and F c is the final output of RSA. The spatial attention map f c (X c ) is computed as, where W d is a 3 × 3 dilated convolution [29] with the dilation rate of 3, and f c (X c ) is the obtained single-channel attention map. By enlarging the receptive field through dilated convolution, a larger range of information can be used to predict response in the attention map. By using a residual link, cascading two RSA units does not cause attenuation of the response values in the feature map.
In contrast, RSA units not only increase the depth of the network but also enable the network to focus on important features, thereby improving the quality of the reconstructed image.

Loss Function
According to the study of Reference [13], we take the perceptual metric defined in Equation (1) as the loss layer to drive our attention-based network learning, thereby preserving the visually sensitive structures in the HR image. The first loss term L MAE in L P is l 1 norm, which sums the absolute error at each pixel p. The mathematical formulation is defined as: where x l (p) is the pixel value of x l at the position p, N is the total number of pixels in x l , and Dx h is the downsampled image from x h and denotes as y. The second loss term in L P exploits MSSSIM metric [12] to measure the reconstruction error between x l and y. MSSSIM is a multi-scale generalization of SSIM metric. Before introducing the mathematical formula of MSSSIM, we first give the definition of SSIM metric as, By iteratively filtering and downsampling of the input image by M − 1 times, we can obtain M scales of the input image, and, accordingly, MSSSIM calculates structural similarity by combining the measurement of M scales, Therefore, the loss L MSSSI M is set to 1 minus the negative MSSSIM metric, In Equations (5) and (6), l j is the divergence in brightness, cs j is the compound divergence in contrast and structure at scale j = 1, ..., M, µ x and σ x represent the mean and standard deviations of the patch P centered at a pixel p of x l , respectively, µ y and σ y correspond to the mean and standard deviations of y at the pixel p, respectively, σ xy denotes the covariance of x l and y, and C 1 , C 2 are small positive constants which can avoid the case of dividing by zero. The mean and standard variance associated with the patch P are calculated by a convolution with Gaussian kernel G σ with the standard variance σ. The subscript p is omitted in MSSSIM metric for simplicity. N is the total number of patches produced by sliding the patch along the whole image y.
In order to propagate the reconstruction error from the loss layer to the previous layers, we need to first define the derivative of L P loss. Specifically, the derivative of L MAE for back-propagation can be calculated as, where D is the transpose of the downsampling matrix D. The calculation of MSSSIM for each patch P involves neighborhood pixels of the pixel p. According to the chain rule, we need to calculate the derivative of L MSSSI M (P) at the pixel p with respect to all the other pixels p in the patch P, and the derivation formula is where l(p) and cs(p) are corresponding to the brightness divergence and compound divergence of contrast and structure at the pixel p, namely the first and second term of Equation (5). Their derivation details can be referred to the additional material in Reference [13]. The derivatives of the perceptual metric L P can hence simply calculated as the weighed sum of the derivatives of L MAE and L MSSSI M according to Equations (8) and (9). Adam algorithm is then used to minimize L P and the optimal network parameters can be found for reconstruction. Different from supervised learning over a given training set in Reference [13], our network is optimized for SR reconstruction from only a given LR observation.

Experimental Results and Analysis
We conduct experiments on the Set5 [30], Set14 [31] and two images from the Internet to validate the performance of the proposed PM-DAN. The height and width of the full-resolution images in these two datasets range from 228 to 768. First, we test the impact of the hyper-parameters (including the weight α in the perceptual metric and the iteration number in network learning) on the reconstruction results. Then, some ablation studies are conducted to verify whether the attention-based network and perceptual metric are beneficial for SR reconstruction. Finally, PM-DAN is compared quantitatively and qualitatively with bicubic interpolation, DIP [11], SRCNN [15], and LapSRN [23]. PSNR and SSIM are used as quantitative metrics for measuring reconstruction quality. The source codes of DIP, SRCNN, and LapSRN are downloaded from online websites provided by the authors. The parameters of DIP, SRCNN, and LapSRN are set to be the same as the default values in the source code. The proposed PM-DAN is based upon the Pytorch [32] framework and runs on NVIDIA RTX 2080 GPU. The channel of input random noise tensor Z is set as 64. Adam is used for our network learning [33], and the learning rate is set to 0.001.

Parameters Analysis
The weight α in the perceptual metric. α is used to balance the importance of l MAE loss and l MSSSI M in the perceptual metric. Thus, we uniformly sample α at the interval of 0.05 in the range of 0 to 1. Figure 3 shows the curves of the mean PSNR and SSIM values versus different α upon three images from the Set14 [31] in the case of 4× super resolution. When α is approximately equal to 0.16, the proposed method achieves the best PSNR and SSIM values. It implies that L P is better than either l MAE or l MSSSI M for improving reconstruction quality, and it also verifies the rationality of the hybrid of l MAE and l MSSSI M . Thus, α is set as 0.16 in the subsequent experiments.
The iteration number in network learning. Both PM-DAN and DIP use iterative optimization to generate the HR images that match the LR observation as much as possible. The number of iterations will impact the final result. Figure 4 presents the PSNR and SSIM curves of PM-DAN and DIP versus iteration numbers upon the Zebra image from the Set14 in the case of 4× super resolution. The maximum iteration number is set as 3000. It can be seen that both the PSNR and SSIM curves of PM-DAN and DIP increase rapidly before 1500 iterations, then rise slowly to 2000 iterations and saturate near 3000 iterations. Although the PSNR and SSIM curves of PM-DAN and DIP follow a similar trend, PM-DAN has a higher PSNR and SSIM than DIP. Taking account of the compromise between time complexity and reconstruction performance, we select 2000 iterations for PM-DAN and DIP in the following experiments.

Ablation Studies
In this section, some ablation studies are performed to verify the strengthes of RSA units and the perceptual metric in the proposed PM-DAN. In detail, we implement another two simplified versions of PM-DAN, one without RSA units (PM-DAN w/o RSA) and one without the perceptual metric (PM-DAN w/o PL). We also compare PM-DAN and its two simplified versions with DIP. DIP can be regarded as a simplified PM-DAN model without RSA units and the perceptual metric. Table 2 presents the 4× SR reconstruction results of these four algorithms upon the Set14. The best PSNR and SSIM values are highlighted in bold. We can see that two simplified versions of PM-DAN both have superior performance than DIP, which demonstrates the advantages of RSA units and the perceptual metric for improving SR quality. However, the deployment of the perceptual metric can only result in marginal improvement. PM-DAN has the best reconstruction results in terms of both PSNR and SSIM. This reveals that the joint deployment of RSA units and the perceptual metric can produce positive incentives and further improve reconstruction quality.  Figure 5 shows the reconstructed HR images of Lenna and Man by PM-DAN and PM-DAN without the perceptual metric. As can be seen, when mse loss is used as the loss layer, the obtained SR image will become blurry, and many details are lost. Conversely, by utilizing the perceptual metric, the reconstructed SR images can have sharp edge and contour structures. The zooming-in visualization of Lenna's hat and Man's face demonstrates the effectiveness of the perceptual metric for preserving image structures.
The SR images of Barbara and Comic by PM-DAN and PM-DAN without RSA units are shown in Figure 6. The multi-scale spatial attention maps predicted by RSA units are also presented. We can see that the attention maps at different scales exhibit high response intensity in different areas, and the union of these high-response areas can almost cover the entire image. With the progressive refinement of the scale of the attention map, the areas with high response intensity mainly concentrate on the flat and local structures of image, which are consistent with sensitive characteristics of HVS. Due to contrast masking phenomenon of HVS [34], reconstruction distortions in structural areas are more likely to be perceived than texture regions. With the aid of RSA units, PM-DAN can well localize the visually sensitive areas at different scales, thus the visually informative structures can be preserved in the reconstructed image, especially in the area with highlighted attention response. This also explains why the combination of the perceptual metric and attention units can produce the better reconstruction results. Taking the image comic as an example, the spatial attention map at the finest scale has high response strength in the area of Girl's chin. Accordingly, Girl's chin is reconstructed with enhanced visual quality.

Performance Comparison
We compare the proposed PM-DAN with bicubic interpolation, DIP [11], SRCNN [15], and LapSRN [23]. DIP and PM-DAN do not require an image set to pre-train the models, while SRCNN uses a large training set consisting of 395,909 images from the ILSVRC2013 ImageNet detection training partition, and LapSRN employs 91 images from [7] and 200 images from BSD200 [35] as the training data for learning the reconstruction mapping. The symbols T and NT are used to represent the methods with or without pre-training, respectively. Tables 3-6 show quantitative PSNR and SSIM values of multiple methods for 4× and 8× SR upon the Set5 and Set14. The best PSNR and SSIM values are highlighted in bold.   In the four cases of experiments, the PSNR and SSIM values of the proposed PM-DAN are all better than DIP. Moreover, PM-DAN has better PSNR and SSIM values than the pre-trained SRCNN in the case of 4× SR upon Set5 and Set14. PM-DAN can also achieve comparable results with the pre-trained LapSRN, and even outperforms LapSRN in some cases, such as the averaged PSNR value in the case of 8× SR upon the Set5 and the averaged SSIM value in the case of 4× SR upon the Set14. Some 4× and 8× reconstructed images are shown in Figures 7 and 8, respectively. The zooming-in display of different patches are also presented. With the aid of pre-training, the SR results of LapSRN also show good visual quality. PM-DAN can reconstruct the HR image with sharp structures and texture details, and has better visual quality than DIP. The brightness and color of the reconstructed image can also be well preserved by PM-DAN. In the case of 8× SR, PM-DAN can even recover the clearer structures than LapSRN, such as the eyes in the Baby image and the spot textures in the Butterfly image.

Set14 ×4 Bicubic(NT) DIP(NT) PM-DAN(NT) SRCNN(T) LapSRN(T) PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM PSNR/SSIM
We further evaluate the performance of our method by conducting the SR experiments on two real images from the internet, one remote sensing image and one landscape image. Figure 9 presents the 4× SR results of our method and DIP upon these two images. The resolutions of the 4× SR images of these two images are 864 × 576 and 1088 × 736, respectively. We can see that, compared to the DIP, our method can recover sharp structures and more texture details. As shown in the zooming-in patches of the remote sensing image, our method can reconstruct more details in the areas of house and wood. With regard to the landscape image, the reconstructed image of our method has good contrast and saturation, which makes the reconstructed image visually attractive.

Conclusions
In this paper, we proposed a unsupervised SR network named PM-DAN. An attention-based decoder-encoder network is designed to predict the SR reconstruction, in which residual spatial attention units are deployed in each decoding layer to concentrate informative feature for reconstruction. Meanwhile, the network is learned under the guidance of the perceptual metric, which has good potential of recovering visually sensitive structures. The experimental results demonstrate that PM-DAN effectively improves the visual quality of SR image and can outperform DIP in terms of both PSNR and SSIM, even producing comparable results with the pre-trained LapSRN network. In future work, we plan to combine our model with appropriate domain-specific regularization to obtain better SR results.