Attention Network with Information Distillation for Super-Resolution

Resolution is an intuitive assessment for the visual quality of images, which is limited by physical devices. Recently, image super-resolution (SR) models based on deep convolutional neural networks (CNNs) have made significant progress. However, most existing SR models require high computational costs with network depth, hindering practical application. In addition, these models treat intermediate features equally and rarely explore the discriminative capacity hidden in their abundant features. To tackle these issues, we propose an attention network with information distillation(AIDN) for efficient and accurate image super-resolution, which adaptively modulates the feature responses by modeling the interactions between channel dimension and spatial features. Specifically, gated channel transformation (GCT) is introduced to gather global contextual information among different channels to modulate intermediate high-level features. Moreover, a recalibrated attention module (RAM) is proposed to rescale these feature responses, and RAM concentrates the essential contents around spatial locations. Benefiting from the gated channel transformation and spatial information masks working jointly, our proposed AIDN can obtain a more powerful ability to identify information. It effectively improves computational efficiency while improving reconstruction accuracy. Comprehensive quantitative and qualitative evaluations demonstrate that our AIDN outperforms state-of-the-art models in terms of reconstruction performance and visual quality.


Introduction
The resolution of an image is restricted by the sensor imaging device, hindering its development. Single-image super-resolution (SISR) is a typical low-level problem in computer vision, which aims to restore an accurate high-resolution (HR) image from a degraded low-resolution (LR) observation. It has been widely used in various important fields involving the development of multimedia technology [1], such as remote-sensing imaging, live video [2], and monitoring devices. However, image super-resolution is still a challenging topic because multiple HR images may be reconstructed from any LR image. To tackle this difficulty, plenty of approaches based on deep convolutional neural networks (CNNs) have been proposed to establish LR-HR image mappings, which have achieved excellent performance [3,4].
SRCNN [5] was the pioneering work in deep learning for image super-resolution reconstruction, directly modeling an end-to-end mapping through only a three-layer convolutional network, which achieved better results than traditional algorithms. Subsequently, deep CNN-based SR models have become the mainstream. Kim et al. presented a very deep convolutional network (VDSR) [6] and DRCN [7], pushing the model depth to 20 layers by equipping the residual structure [8], which led to a remarkable performance gain (e.g., VDSR obtained a PSNR of 37.53 vs. SRCNN's PSNR of 36.66 on Set5 ×2; PSNR is defined in Section 4.1). These methods take the interpolated LR image as the input to the network, undoubtedly increasing the computational burden and time overhead. FSRCNN [9], using transposed convolution, and ESPCN [10], adopting sub-pixel convolution, have been proposed to accelerate the inference and reduce the computational burden by changing the up-scaled position of the input low-resolution (LR) image. Thanks to effective sub-pixel convolution, Lim et al. [11] explored a broad and deep EDSR network without the batch normalization module, dramatically improving the SR perfromance (e.g., EDSR PSNR = 38.11 vs. SRCNN PSNR = 37.53 on Set5 ×2). Since then, researchers have attempted to design more complex networks to enhance network accuracy.
To obtain more abundant information, hierarchical features and multi-scale features can be used. Wang et al. [12] introduced an adaptive weighted multi-scale (AWMS) module residual structure to realize a lightweight network. SRDenseNet [13], based on DenseNet [14], used the concatenated features of all layers to enhance feature propagation and maintain continuous feature transmission. Furthermore, Song et al. [15] leveraged NAS [16] to find an efficient structure based on a residual dense module for accurate super-resolution. However, most SR models do not distinguish these intermediate features and lack flexibility in processing different information types, thus preventing better performance. RCAN [17] developed a channel attention module to model the channel interdependencies, in order to obtain discriminative information, and achieved a PSNR of 38.27 on Set5 ×2; however, it has more than 16M parameters, which is not conducive to deployment on resource-limited devices. Later, Hui et al. [18] constructed an information multi-distillation structure with the splitting operation, greatly reducing the number of channels. Lan et al. [19] introduced channel attention into the residual multi-scale module to enhance the feature representation capability (MADNet), and generated a PSNR of 37.85 on Set5 ×2 with 878 K parameters .
Motivated by the above, we propose an attention network with an information distillation structure (AIDN) for efficient SISR, using several stacked attention information distillation blocks (AIDB). Inspired by IDN, we carefully develop an attention information distillation block (AIDB) to asymptotically learn more intermediate feature representations, mainly employing multiple splitting operations combined with gate channel transformation (GCT). Specifically, the splitting strategy divides the previously extracted features into two parts, where one is retained while the other is further processed by GCT. The normalization method and attention mechanism are combined to gain precise contextual information. GCT can learn the importance of different channels adaptively and takes weighted feature maps as the input to the next layer. Meanwhile, GCT encourages cooperation at shallow layers and competition at deeper layers. Moreover, all distilled features are aggregated through the recalibrated attention module (RAM), which further refines these high-frequency features and revises the importance of features in the channel dimension. In general, the main contributions of our work can be summarized as follows: • We propose an attention network with an information distillation structure (AIDN) for efficient and accurate image super-resolution, which extracts the valuable intermediate features step by step using the distillation structures; • We introduce gate channel transformation (GCT) into SISR and use it in one distillation branch; • We propose a recalibrated attention module (RAM) to re-highlight the contributions of features and strengthen the expressive ability of the network. Comprehensive experimental results demonstrate that the proposed method strikes a good balance between performance and model size.

Deep CNN-Based Super-Resolution Methods
In recent years, methods based on deep convolutional neural networks (CNNs) have been successfully applied to various tasks, showing excellent performance.
Dong et al. [5] first explored the use of three convolutional layers for single-image super-resolution (SISR), and obtained better reconstruction results than by using the traditional methods. Subsequently, with the successful application of the residual network architecture [8] in computer vision tasks, more and more residual-learning-variant algorithms have been used to reconstruct SR images, including LapSRN [20], WMRN [21], CFSRCNN [22], and RFANet [23]. Dense connections have also been introduced for image super-resolution through the information flow of hierarchical features. RDN [24] combined the residual structure with dense connections to form a residual dense network with a continuous memory. Zhang et al. [25] developed GLADSR through the use of the global-local adjustment of dense connections to increase the network capacity.
Although these methods have achieved good performance, the parameters increase dramatically with the network depth, making them unsuitable for mobile platforms. DRCN [7] leveraged recursive learning to decrease the parameters of the network. CARN [26] developed a cascading architecture in the residual structure, forming a lightweight model suitable for practical applications. CBPN [27] struck a good balance between efficiency and performance by learning mixed residual features. Song et al. [28] devised AdderNets to resolve the defects of adder neural networks. It provided a better visual effect with lower energy consumption without changing the original structures. More recently, some NASbased SR models have been proposed to automatically search for optimal architectures. Chu et al. [29] presented an automatic search algorithm, FALSR, based on NAS, to achieve a fast and lightweight SR model. DRSDN [30] explored diverse plug-and-play network architectures for efficient single-image super-resolution.

Attention Mechanism
Attention mechanism is a data processing method in machine learning, which is used to improve the performance of convolutional neural networks (CNN) in computer vision tasks. Attention mechanism aims to enable a network to automatically learn more focused areas by using masks (new weights). SENet [31] can be regarded as the first model of attention mechanism, which improved the representational capability of the network by modeling the relationship between channels. Wang et al. [32] presented a non-local block to calculate the response of a location to the information of all positions. CBAM [33] connected channel attention and spatial attention in a series to obtain a 3D attention map to form a lightweight, universal module. GCT [34] combined a normalization module with attention mechanism using lightweight variables to learn the interrelationships between channel-wise information. ECA-Net [35] developed a local cross-channel interaction scheme without dimension reduction, which proved to be an efficient and lightweight channel attention structure.
In addition, attention-based works have been proposed to further improve superresolution performance. Zhang et al. [23] introduced enhanced spatial attention (ESA) into the residual-in-residual (RIR) structure to build a residual feature aggregation block, thus forming a lightweight and effective model. Dai et al. [36] designed a second-order attention network (SAN), which employed second-order feature statistics to learn more discriminative feature expressions. DRLN [37] developed a novel Laplacian attention with dense connections on the cascaded residual structure to study the inter-and intra-layer dependencies that achieved deep supervision. Hu et al. [38] explored channel-wise and spatial attention residual blocks (CSAR) to modulate hierarchical features in both global and local manners, achieving prominent performance. CSNLN [39] proposed a non-local attention with a different scale, which thoroughly explored all possible priors through non-local calculations of the feature-wise similarities between patches in cross-scales.

Network Architecture
In this section, we introduce the entire framework of our proposed attention network with information distillation (AIDN), as shown in Figure 1. Our AIDN architecture comprises three parts: a low-level feature extraction module (LFE), stacked attention information distillation blocks (AIDBs), and a image reconstruction module. Here, I LR represents the original low-resolution (LR) input image, while I SR denotes its output super-resolution (SR) image. Specifically, a convolutional layer is first leveraged to extract the shallow features from the given LR input. This procedure can be expressed as where F LFE (·) denotes a convolutional layer with a kernel size of 3 × 3, and X 0 is the extracted shallow features. Then, X 0 is sent to the next part, which consists of multiple attention information distillation blocks (AIDBs) in a chain, which gradually refines multiple hierarchical features. This process can be denoted as where F n AIDB indicates the n-th AIDB function, and X n−1 and X n denote the input and output feature maps of the n-th AIDB, respectively. Then, the deep features generated by this sequence of AIDBs are further concatenated together through global feature fusion. After fusing, the deep features are processed by two convolution layers to the reconstruction module, which can be formulated as where Concat represents the concatenation operation, and F aggregate denotes a composite function of a convolution layer with a kernel size of 1 × 1 following a convolution layer with a kernel size of 3 × 3.
In addition, the deep-aggregated feature X aggregate is added to the shallow feature X 0 through global residual learning. Finally, the super-resolving output images are produced through the reconstruction function, as follows where F rec (·) represents the reconstruction module function and I SR is the output superresolution image of the network. The reconstruction module consists of a 3 × 3 convolutional layer and a pixel-shuffle layer. Different loss functions have been introduced to optimize SR networks. For fair comparison with the most advanced methods, our model is optimized using the L 1 loss function, as in previous works [18,21]. Given a training set {I i LR , I i HR } N i=1 , N denotes the number of LR-HR image patches. Hence, the loss function of our AIDN can be represented as where Θ indicates the learnable parameters of our AIDN model and H AIDN (·) denotes the function of our model. Our goal is to minimize the L 1 loss function between the reconstructed image I SR and the corresponding ground-truth high-resolution (HR) image I HR .

Attention Information Distillation Block
This section mainly introduces the key parts of the proposed AIDB. As shown in Figure 2, the proposed attention information distillation block (AIDB) mainly contains the feature refinement module (FRM) and the recalibrated attention module (RAM). Specifically, the FRM module gradually extracts the multi-layer features by employing information diffluence to obtain a discriminative learning ability. A few features are also aggregated according to their contributions. Moreover, the RAM module re-highlights the informativeness of the features and enhances the expression capability of the network.

Feature Refinement Module
The feature refinement module (FRM) exploits the distillation network and attention mechanism to separate and process features by connection or convolution. Specifically, a 3 × 3 convolution layer is first exploited to extract input features for multiple succeeding distillation steps in the FRM. For each step, the channel split operation is performed on the previous features, resulting in two-part features. Both parts require further processing. One part is reserved, while the other part is used as input to the gate channel transformation (GCT) module [34]. Assuming the input features are denoted by X in , this procedure can be formulated as where F conv i indicates the i-th 3 × 3 convolution operation followed by the Leaky ReLU(LReLU) activation function, F GCT i denotes the channel transformation operation (detailed in the following section), Split j represents the j-th channel split operation, X retain i denotes the i-th retained features, and X coarse j represents the j-th coarse features, which are further fed to the subsequent layers. Afterward, all the features retained in each step are concatenated along the channel dimension, which can be denoted as where Concat indicates the concatenation operation and X FRM denotes the output of the feature refinement module (FRM).

Gate Channel Transformation
Gate channel transformation (GCT) [34] is an attention mechanism. Moreover, GCT is a simple and effective channel-relationship-modeling architecture, combining a normalization module and gating mechanism. As shown in Figure 3, the overall structure of the GCT module consists three parts: global context embedding, channel normalization, and a gating mechanism. First, we employ L 2 -norm to capture global contextual information from the input feature. Given the input feature X = {x 1 , x 2 , . . . , x k }, X ∈ R C×H×W , it can be written mathematically as [34] where S = {S 1 , S 2 , . . . , S c }, S ∈ R C×1×1 is the gathered global-context-embedding information along each channel dimension, ε represents a very small constant to avoid the derivation problem at zero point, and α c denotes the trainable parameter, namely the embedding weight. Furthermore, α c can control the different weights of each channel. In particular, when α c approaches 0, the channel will not participate in the subsequent normalization module. Accordingly, it enables the network to recognize when one channel is independent of the others. Then, we adopt the normalization operation to reduce the number of parameters and improve the computational efficiency. Furthermore, normalization approaches [40] have been shown to establish competitive relations between different neurons (or channels) in neural networks, which stabilize the training process. This allows for larger values with larger channel responses and restrains the other channels with less feedback. The channel normalization function can be expressed aŝ where C is the number of channels. Finally, the gating mechanism is introduced to control the activation of the gate channel. The gating function is defined as followŝ where γ = [γ 1 , . . . , γ C ] denotes gating weights, β = [β 1 , . . . , β C ] represents gating biases, and x c ,x c are the input and output features of the gating mechanism module, respectively. The weights and biases determine the behavior of GCT in each channel. When the gating weight γ C is activated actively, GCT enhances this channel to compete with the others. When the gating weight is activated passively, GCT pushes the channel to cooperate with the others. In other words, low-level features are primarily learned in the shallow layers of the network. Thus, cooperation between channels is required to more widely extract features. In the deeper layers, high-level features are mainly learned, and their differences are often large. Therefore, competition between channels is needed to obtain more valuable feature information. In addition, when the gating weight and bias are zeros, the original features are allowed to pass to the next layer, which can be formulated aŝ This can establish an identity mapping and solve the degradation problem of deep networks. Hence, during GCT module initialization, α is initialized with 1, and γ and β are initialized with 0. The initial steps will be improved the robustness of the training process, and the final GCT results will be more accurate.

Recalibrated Attention Module
To recalibrate informative features, the output features of FRM are further fed into the recalibrated attention module (RAM), where the informative features are selectively emphasized and useless features are inhibited according to their importance. As shown in Figure 4, the overall structure of the RAM is a bottleneck architecture. Here, X FRM and X RAM are defined as the input and output of the RAM, respectively. Specifically, the concatenated features are first passed to a 1 × 1 convolution layer to decrease channel dimensions; then, they are divided into two branches. One branch preserves the original information with a 1 × 1 convolution to produce X 1 , while the other processes the spatial information to search for the areas with the highest contribution. In addition, this branch is equipped with two 3 × 3 convolutions, a max-pooling layer, and a bilinear interpolation operator to generate X 2 . The max-pooling operation not only enhances the receptive field but also captures high-frequency details. The bilinear interpolation layer maps the intermediate features to the original feature space to keep the identical size of the input and output. Finally, X 1 and X 2 are concatenated and fed into a 1 × 1 convolution followed by a sigmoid function. This 1 × 1 convolution is adopted to restore the channel dimensions. Hence, the recalibrated attention can be expressed as where F RAM (·) is the recalibrated attention module function. Therefore, the final output of the attention information distillation block (AIDB) can be formulated as where F conv is a 3 × 3 convolutional layer, and X B n and X B n−1 denote the input and output of the n-th AIDB, respectively. Furthermore, the GCT module considers the channel-wise statistics, while the recalibrated attention module (RAM) encodes multi-scale features, focusing on the context around the spatial locations. Therefore, AIDB can modulate more informative features to obtain a more powerful feature representation capability, which is conducive to improving SR performance.

Experiments Section
In this section, we first describe our experimental conditions regarding the implementation details and training settings. Then, we study the validity of the proposed modules in our model. Finally, we systematically compare the proposed network with plenty of state-of-the-art models.

Datasets and Metrics
In our experiments, following previous works [18,21], we employed the DIV2K dataset [41] to train our model. It includes 800 high-quality training images. In the testing phase, we adopted five public benchmark datasets-Set5 [42], Set14 [43], BSD100 [44], Urban100 [45], and Manga109 [46]-to comprehensively validate the effectiveness of our model. In addition, we leveraged the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [47] as quantitative evaluation metrics for the performance of the final reconstructed super-resolution images. We computed the PSNR and SSIM values on the luminance channel of the YCbCr color space. We also compared parameter amounts with other leading models. Given a ground-truth image I HR and a super-resolved image I SR , we defined the PSNR as: where  (17) µ I HR , σ I HR , and σ I HR I SR are the mean, standard deviation, and covariance of an image, respectively, and C 1 , C 2 , and C 3 are positive constants.

Training Settings
We obtained the input low-resolution (LR) images from the corresponding HR images by bicubic down-sampling in the training stage. Then, we set 16 LR patches as each training mini-batch, and extracted with a size of 48 × 48 from the LR images. Moreover, we randomly rotated the image in the training dataset by 90 • , 180 • , and 270 • , and flipped it horizontally for data augmentation. We utilized Adam optimizer [48] to optimize our model with settings of β 1 = 0.9 and β 2 = 0.999. We fixed the initial learning rate to 2 × 10 −4 , and decreased by half every 200 epochs. We performed the proposed model on the PyTorch framework with an NVIDIA GTX 1080Ti GPU. More setting details of our experiments are listed in Table 1.

Model Details
Our model includes six attention information distillation blocks (AIDBs), and we set the number of feature channels to 64. Among them, we reserved the channels=16 and further processed the other parts. We set the activation functions in the feature refinement module (FRM) as LReLU, while we applied ReLU to the other parts [49]. Additionally, in the recalibrated attention module (RAM), we deployed the first 3 × 3 convolution layer with a stride = 2, the other 3 × 3 convolution layer with stride = 1, and used the max-pooling operation with a 7 × 7 convolution with stride = 3.

Study of GCT and RAM
To study the contributions of the different modules in the proposed model, we conducted ablation experiments. All the models are trained from scratch for 1000 epochs, and are executed under similar settings. Each time we removed one module, we directly tested the model performance without adding other operations. Table 2 shows the experimental results at a scale factor of 2 on multiple datasets. Without gate channel transformation (GCT) and the recalibrated attention module (RAM) in the information distillation block (AIDB), the PSNR values of all datasets are relatively lower. The performance of the second row with GCT module is better than that of the first row with only 1 K more parameters. Similarly, RAM in the third row also improves the performance, especially on Urban100 and Manga109 datasets. Therefore, both the GCT module and RAM can independently obtain better reconstruction accuracy. This can be attributed to the multi-layer features being discriminatively treated, and different weights being allocated according to the characteristics of features to screen out high-value information features, improving the efficiency and accuracy of the network. Furthermore, the best reconstruction results are provided when integrating GCT and RAM into the AIDB with few additional parameters, as shown in the last row of Table 2. Thus, the proposed AIDB can capture spatial and global contextual information in each channel, benefiting image restoration. The above quantitative results effectively prove the effectiveness of the network structure with the introduced GCT and RAM, and their integration. Table 2. Investigations of GCT module and RAM unit on five benchmark datasets at scaling factors of ×2. PSNR/SSIM represent the two values. Params: kernel*kernel*channel-input*channel-output. The best and second-best performances are highlighted in red and blue.

Comparison with State-of-the-Art Methods
To demonstrate the effectiveness of our proposed architecture, we compared recently proposed competitive works, including SRCNN [5], VDSR [6], DRCN [7], LapSRN [20], IDN [18], CARN [26], MoreMNAS-A [50], FALSR-A [29], ESRN-V [15], WMRN [21], MADNet-L 1 [19], MSICF [51], and CFSRCNN [22], with the proposed network. These works are almost all lightweight networks with less than 2.0M parameters. The quantitative results with scale factors of ×2 , ×3, and ×4 on five benchmark datasets are provided in Table 3. It can be seen that our proposed model is superior to the other leading algorithms across different datasets and scaling factors. Specifically, compared with several automatic search SR architectures based on NAS (FALSR-A, MoreMNAS, and ESRN-V), our AIDN network gets higher PSNR values with fewer parameters (FLOPs) on five datasets for ×2 up-scaling. Table 3. Quantitative results of several state-of-the-art SR models at scaling factors of ×2, ×3 and ×4 (average PSNR/SSIM). The best performance is highlighted in red, while the second-best performance is highlighted in blue. Although WMRN has slightly fewer parameters than the proposed network, its reconstruction results are far worse. For example, with a scale factor of 3 on Set5, our network obtains a significant performance gain of 0.24 dB. Moreover, our AIDN performs well compared to MADNet-L 1 , which has a similar number of parameters. MADNet also applied an attention mechanism with a residual multi-scale module. For a scale factor of 4 on four datasets, CFSRCNN achieves the second-best performance with nearly twice as many parameters as our method. From Table 3, it can be seen that the number of parameters of SRCNN and VDSR do not change across scaling factors, as the input image is interpolated and then sent into the network. Other models have varying parameters due to different up-sampling approaches. As our model has relatively few parameters, it can be considered a lightweight model. Consequently, our method has better reconstruction performance than the most advanced methods, with fewer parameters.

Method
In addition, we also compared the visual quality with other methods at the ×4 scale, as shown in Figure 5. For "148026" from BSD100, most methods restored the blurred edges, while CARN even produced wrong textures; furthermore, the image generated by our method was closer to the original image. For "img_042" from Urban100, other methods suffered from severe artifacts, and the lines produced are curved. Only the image refined by our method outputted horizontal lines correctly. For "img_037" from Urban100, LapSRN could not recover grids, and VDSR rebuilt several redundant white vertical lines at the upper right of the image. Our AIDN reconstructed accurate grids with better visual effects.  [5], VDSR [6], DRCN [7], LapSRN [20] and CARN-M [26] for ×4 SR images on BSD100 and Urban100 dataset. The best results are highlighted by red.

Heatmaps of the Proposed AIDN
This section describes the heatmaps of the proposed AIDN at stages with the Urban100 dataset (×2). In Figure 6, the top row shows heatmaps of the shallow features before passing into AIDBs, while the next two rows are heatmaps of the refined high-level features. The results show that our method has different weights in different states. The rows represent the states of different channels at the same time, and the columns represent the states of the same channel at different times. Yellow is heavily weighted, blue is lightly weighted, and green is centered. It can be seen that our method has the function of modulating features, which is conducive to image reconstruction.

Model Size Analysis
In addition, to further illustrate the superiority of the proposed network, we compared its number of parameters and performance with other leading works. The number of parameters is especially important when building a lightweight network, especially for resource-constrained mobile devices. The experimental results on Urban100 with a scale factor of ×2 are shown in Figure 7. Compared with other methods, our AIDN model obtained comparable or higher PSNR values with fewer parameters, while other methods either had a larger number of parameters or lower performance. These analyses indicate that the proposed AIDN strikes a better balance between parameters and performance.

Visualization on Historical Images
To further illustrate the robustness and effect of our model, we evaluated our attention network with information distillation (AIDN) on historical images. The degradation process of these low-resolution images is unknown, and no corresponding high-quality images are available. Figure 8 shows the visual results on scale factor ×4. For "img006", the characters produced by our model were clearer and more independent. For "img007", our AIDN could reconstruct finer details, and the refined images showed lower blurring. In short, the images generated by our method have better perceptual quality than those of other methods.

Conclusions
In this paper, we proposed an attention network with information distillation (AIDN) for image super-resolution. Specifically, global contextual information embedding among different channels is employed to modulate multiple features in a step-by-step manner, forming the distillation structure. Moreover, a recalibrated attention module (RAM) is adopted to re-highlight these features, concentrating on the vital contents around spatial locations. Benefiting from the gated channel transformation and spatial information unit masks working jointly, the proposed AIDN possesses a more powerful information identifying capability, effectively improving the computational efficiency while enhancing the reconstruction accuracy. Comprehensive quantitative and qualitative evaluations effectively demonstrate that our AIDN outperforms state-of-the-art models in terms of both reconstruction performance and visual quality. In future work, we will extend our AIDN to other complex tasks (e.g., images with noise, blurring, etc.).