Triple Attention Mixed Link Network for Single Image Super Resolution

Single image super resolution is of great importance as a low-level computer vision task. Recent approaches with deep convolutional neural networks have achieved im-pressive performance. However, existing architectures have limitations due to the less sophisticated structure along with less strong representational power. In this work, to significantly enhance the feature representation, we proposed Triple Attention mixed link Network (TAN) which consists of 1) three different aspects (i.e., kernel, spatial and channel) of attention mechanisms and 2) fu-sion of both powerful residual and dense connections (i.e., mixed link). Specifically, the network with multi kernel learns multi hierarchical representations under different receptive fields. The output features are recalibrated by the effective kernel and channel attentions and feed into next layer partly residual and partly dense, which filters the information and enable the network to learn more powerful representations. The features finally pass through the spatial attention in the reconstruction network which generates a fusion of local and global information, let the network restore more details and improves the quality of reconstructed images. Thanks to the diverse feature recalibrations and the advanced information flow topology, our proposed model is strong enough to per-form against the state-of-the-art methods on the bench-mark evaluations.


Introduction
Single image Super-Resolution (SISR) is an important low-level computer vision task which have high practical value in many fields such as industrial inspection, medical imaging and security monitoring. SISR aims at recovering high-resolution image from only one low-resolution image. For this ill-posed inverse problem, widely used interpolation methods could not achieve visual pleasing results and many learning-based methods (Yang et al. 2010;Timofte et al. 2014) have been proposed. In the recent years deeplearning based algorithms (Dong et al. 2016) have been developed which greatly improved the super resolution quality and the details of the images could be better preserved with these powerful deep networks. The introduction of the attention mechanism further improves the performance of the neural networks. SENet (Hu et al. 2017) and its derived super-resolution method (Cheng et al. 2018) focus on the attention between channels, and have achieved good results in many tasks. Attention is not limited to channels, the concurrent spatial attention and channel attention (Roy et al. 2018) have achieved better results in image semantic segmentation tasks. However, we found that in the super-resolution task, adding local attention like Roy et al. did not improve the image reconstruction performance, or even decrease the quality. Therefore, it is very important to seek a spatial or other attention that works effectively for super-resolution tasks. In order to recover more image details with the superresolution models, global residual learning is widely used , while global residuals represent the image details predicted by the neural networks. Therefore, in order to help the network gain enhanced the details, we propose a spatial attention on the global residual. Also, special attention module works for multi parallel convolution kernel was also introduced. With these attention mechanisms, the performance of the model is significantly improved.
In recent studies, residual networks with deeper structure or networks with dense connections were used in image super-resolution Tong et al. 2017;Zhang et al. 2018), both of which have achieved good results. Li et al. (Wang et al. 2018) found the commonality of the residual networks and dense networks, and proposed a Mixed Link structure that takes both of them into consideration, which improves the performance and reduces the number of model parameters. We introduced the attention enhanced mixed link block (AE-MLB) in our study. Unlike ordinary Mixed Link networks, our proposed structure was designed with appropriate zero padding to control the size of outputs equally among layers and we add attention to the channels between the connections of each layer. By introducing feature recalibration between channels, it enhances useful channel information and suppresses useless information. In addition, each logical layer contains two convolutional layers with different kernel sizes to gain different receptive field, and fuse kernel attention was introduced which can mix different output between convolutional layers to further improve the performance of the network. We also introduced multi supervise when training the network. Each AE-MLB will output high-resolution images and calculate the loss between output and target, so that our model could stably output high-resolution image with high quality. We summarize our contributions in the following points: • We proposed a novel triple attention mixed link network (TAN) model for single image super resolution. The global spatial attention (SA) and fuse kernel attention (KA) we proposed could significantly improve the super resolution performance. • We proposed the attention enhanced mixed link block which could help achieve better performance with less model parameter. • Our model achieved state of the art performance according to serval benchmark datasets.

Related Works
Single image super resolution is a research hot spot in recent years. Deep-learning based methods showed great improvement compared with conventional methods such as interpolations, anchored neighborhood regression (Timofte et al. 2014), self-exemplars (Huang et al. 2015) and methods based on sparse encoding (Yang et al. 2010). SRCNN (Dong et al. 2016) firstly used convolutional neural networks to sample images and achieved significant improvements. The performance of SRCNN was limited by its shallow structure. To achieve higher performance the networks are tend to be deeper and deeper, Kim et al. proposed the VDSR (Kim et al. 2016) model with a deeper structure. In order to make the deep model trainable, recursive supervision and residual models were introduced . In recent years, some very deep models have been proposed such as EDSR and MDSR (Lim et al. 2017), which achieves very pleasing performance on super-resolution tasks. In addition, superresolution models integrated with Dense connections have been proposed, such as SRDenseNet (Tong et al. 2017) and MemNet ), which can effectively utilize different levels of features. Later, the Residual Dense Network (RDN) (Zhang et al. 2018) was proposed which makes more use of hierarchical features, in addition, the structure can also effectively control parameter growth and makes large-scale models trainable. In terms of reconstruction network, the model also gradually uses deconvolution and effective subpixel shuffle (Shi et al. 2017;Lai et al. 2017) to replace the traditional pre-interpolation process, which simplifies the computational complexity and further improves the performance of the model.
The above methods showed impressive super resolution performance, however their structures were complex and very deep. To achieve higher performance, attention is another key role other than the scale and complexity of network structure. Moreover, the above model is not sufficient for the use of hierarchical information. Feature extraction for an LR image often requires different receptive fields and effective recalibration and fusion of features extracted from different subfields, which was often overlooked in previous super-resolution models. In order to solve these problems, we proposed a triple attention network with mixed link structure for single image super resolution task and we will detailly introduce our proposed TAN in the next section.

Proposed Method
Overall model structure As shown in Fig.1, our proposed triple attention mixed link network contains three basic parts, which are shallow feature extractor (SFENet), attention enhanced mixed link blocks (AE-MLBs) and reconstruction networks with multi supervise. The SFENet contain two convolution layers to grab the shallow features though the network. Low resolu-

Fig. 2 Comparison of different network structures
tion images fed directly into the network and divided to two branches, one was input to the upscale module after the first convolution layer in SFENet and the other was then pass through the second convolution layer and input to the AE-MLBs for predict the details. The reconstruction network combines the upscaled image with the predicted details to generate the high-resolution image.

Mixed Link Connections
As shown in Fig.1 (d), operator M denotes the mixed link connection which could be calculated as formula1-4. This operation could be divided into three parts. The first one slice the input channels to two parts equally, (. ) means the slice operation in formula 1.
The final step is shown as formula 4. (. ) denotes the concatenate operation and this enable the network to be partial residual network and partial dense network.

Channel attention
Channel attention could help the network gain the ability of modeling and selecting the information among channels which is also called feature recalibration. This was proved to be effective to improve the performance of the model on the field of image recognition and restoration. As shown in Fig.3 (a), the channel attention module consists with one global average pooling layer which squeeze the features spatially to grab the global information among channels, then two 1x1 convolution layers named ConvD and ConvU generate a bottleneck. Finally, a Sigmoid activation layer to map the information between 0 and 1 and the output is used to reweight the original output to generate a selflearned channel wise attention. The process of channel attention could be calculated as the following formulas: Where (. ) represents the spatial squeeze with global average pooling. H and W denotes the height and width of the feature map. means channel c of the input feature map .

Kernel Attention
Different sizes of convolution kernels can provide different receptive fields, extracting different features. Many networks utilized different size of convolution kernels to improve the performance (Szegedy et al. 2017). Therefore, in order to improve the super-resolution capability, we use 3x3 and 5x5 convolution kernels, and use 1 and 2 zero padding to ensure that the feature map size of each layer is equal. In addition, we use kernel attention to perform feature recalibration on the channels output from layers with Kernel attention is actually a special channel wise attention, the structure could be seen as Fig.3 (b). The operation process can be derived from formula 7 and formula 8.

Global spatial attention
In addition to channel attention, the spatial attention model proved to be effective in the segmentation task (Roy et al. 2018). However, we found that the performance of the super-resolution task does not improve. We have tried a variety of solutions that incorporate spatial attention. However, we all found a slight drop in model performance. We blame this phenomenon on the local spatial attention only focus on local information, and cannot play the role of filtering effective global information. To solve this problem, a fusion for global information and local information is needed. The reconstruction part of the super-resolution model in this paper adopts the strategy of global residual learning. The global residual channel is increased to 2 times of the original. The half of the channels are weighted by Global information, and the other half retains the local information. Then the two are summed and averaged to achieve global and local information fusion. This process could be seen as Fig.4. Finally, the subpixel shuffle is used for up sampling, and each block is weighted and fused. 1 , 2 = ( ( )) (9) HR = ∑ 1 2 ( 1 + 2 ( ( ))) + ( ) (10)

Multi Supervise and Loss Function
Multi supervise are used during training process. For each AE-MLB, a high-resolution image is generated and the loss is calculated. Finally, the loss values of all the blocks are added and the arithmetic mean is calculated as the overall loss. This could be calculated as formula.11. testing. All RGB images were converted to YCbCr color space and we selected the Y channel for training and testing the super resolution model. For higher scale super resolution, we first train the model with lower factor and then finetuning the model with higher factor with the pretrained checkpoint. We used PyTorch 0.4.0 as the deep-learning framework to build the neural network and we used a  Fig.4 Visual comparison for details of reconstructed images server with NVIDIA Tesla P40 GPU as the training setup. Also, parallel method was utilized. We equally divided the training batch to 5 GPUs and this greatly accelerate the model training process. In this work, we perform selfensemble (Lim et al. 2017) to gain higher performance. We flipped and rotate the test images to augment 7 images from the original. We input these images into the network and performed inverse transform on the output high resolution images. All these images were then added together and averaged to get the final high-resolution output.

Compare with state-of-the-art methods
We used peak signal to noise ratio (PSNR) and structural similarity (SSIM) (Wang et al. 2004) as the image quality metrics and we compared the result generated by our proposed TAN with Bicubic, A+ (Timofte et al. 2014) and deep learning based super resolution methods including SRCNN (Dong et al. 2016), LapSRN (Lai et al. 2017), MemNet , EDSR (Lim et al. 2017) and RDN (Zhang et al. 2018). Table1 show the quantitative results on Set5, Set14, BSD100 and Urban100 with scale factor x2 and x4. The best was marked with red and the second was marked with blue. Our proposed method could perform against other methods among the 5 datasets in scale x2 task and also achieved favorable results in scale x4 task. We also illustrate a visual quality comparison to show the reconstructed details on BSD100 and Urban100 datasets. As shown in Fig.4, the red rectangle represents where the sub image was taken from and the ground truth is located on the right bottom and marked with red edge. The first two images were img_052 and img_092 from BSD100 dataset and the third one is img_005 from Ur-ban100 datasets. The textures of theses images were reconstructed clear with our proposed TAN while images generated from other methods were blurred, distorted and have error with the image details. This illustrates our proposed method could generate high resolution images with more accurate details.

Model Parameter
We also compared the parameter of our proposed model with the state of the arts, the results of scale x2 super resolution on Set5 could be seen in Fig.5. The x axis shows the million parameters and the y axis shows the PSNR result. Our proposed method could achieve obvious higher results compared with those recursive model (MemNet  or model with small quantity of convolutional layers (LapSRN (Lai et al. 2017)) and model with encoderdecoder framework (RED30 (Mao et al. 2016)). Also, our proposed TAN could perform against the state-of-the-art among the large-scale networks with only 7.2 million model parameters, which is relative light weight and is 83.3% less than EDSR (Lim et al. 2017) (43M), 67.3% RDN (Zhang et al. 2018) (22M) and 28% less than DBPN

Super-resolving real-world images
In this section, we conduct experiment on image from the real-world. The image was compressed with JPEG method and with unknow degradation model and unknow high resolution ground truth. As shown in Fig.6, we compare our result with bicubic interpolation, SRCNN (Dong et al. 2016) and LapSRN (Lai et al. 2017) and our proposed TAN generates more clear and sharp details and edges among these methods.

Model structure
In this section we study the effect of network structure on the performance of super resolution models. We define the baseline model with 6 mixed link blocks and no attention modules were include. We replaced the mixed link connections with pure concatenate and skip layer connections to create a dense network and a residual network. We trained the models on DIV2k dataset and we compare the performance of these network structures and the result was shown in the Table.2. The experiment result shows that Mixed Link structure could achieve higher performance and less parameters.

Attention module
In this section we study the attention modules we introduced in our network. We researched on the effects of these attention modules. We first train models with only one type of attention module and then different combination of these modules. There are six different combinations and the results for scale factor x2 with three test datasets are shown in Table. 4. The first line shows the PSNR score and the second line shows the improvement. Baseline means a pure mixed link network with no attention module, CA denotes the channel attention, KA means the fuse kernel attention, SA means the global spatial attention. There is an obvious increase of performance when adding these attention modules to the baseline network among the experiment results of three testing datasets. When using one attention module, there will be at most 0.14dB gain in Set14, 0.42dB with two module and 0.57dB with all of the three attention modules.

Study the number of AE-MLBs
The number of attention-enhanced mixed link blocks could directly affect the scale and layers of the network and determines the model parameter and the super resolution performance. We studied on the number of blocks and the PSNR for scale x2 on Set5, Set14 and BSD100 is shown in Table.4. The experiment result shows the performance is better with more blocks. Our proposed TAN allows to train deeper network and which enable our model to grab more information from the images and predict more accurate details to reconstruct high resolution image with favorable quality.

Conclusion
In this work, we propose a novel single image super resolution method which utilized mixed link connections and three different attention including channel attention, fuse kernel attention and global spatial attention and we name the model triple attention mixed link network (TAN). The mixed link structure helps the network gain stronger representation ability and proved to be more powerful than residual or dense networks. Moreover, the attention mechanisms give an impressive improvement on performance. The channel attention (CA) could recalibrate the information among channels. The fuse kernel attention (KA) could fuse the feature output from layers with different kernels, which enable the model gain different receptive field. The global spatial attention mixes the information from local parts and global parts of channels which greatly improved the reconstruction network. With these attention modules, the proposed attention enhanced mixed link blocks (AE-MLB) was utilized as the basic part to build the whole network. Thanks to the sophisticated network structure and effective attention mechanisms, our model could perform against the state of the arts according to serval benchmark evaluations.