An Efficient Super-Resolution Network Based on Aggregated Residual Transformations

In this paper, we propose an efficient multibranch residual network for single image super-resolution. Based on the idea of aggregated transformations, the split-transform-merge strategy is exploited to implement the multibranch architecture in an easy, extensible way. By this means, both the number of parameters and the time complexity are significantly reduced. In addition, to ensure the high-performance of super-resolution reconstruction, the residual block is modified and simplified with reference to the enhanced deep super-resolution network (EDSR) model. Moreover, our developed method possesses advantages of flexibility and extendibility, which are helpful to establish a specific network according to practical demands. Experimental results on both the Diverse 2K (DIV2K) and other standard datasets show that the proposed method can achieve a good performance in comparison with EDSR under the same number of convolution layers.


Introduction
In recent years, single image super-resolution (SISR) has attracted a lot of attention from researchers in the field of computer vision.SISR aims to reconstruct a high-resolution image I HR from a single low-resolution image I LR [1], and it has been widely used in many fields, such as remote sensing [2], medical imaging [3], and environmental monitoring [4][5][6][7].To our knowledge, the interpolation technique based on sampling theory was the earliest method to solve the super-resolution problem.However, there are serious shortages in predicting details and realistic textures.To address this problem, techniques that learn the mapping relationship between I LR and I HR have been proposed, such as neighbor embedding [8][9][10][11] and sparse coding [12][13][14][15][16].In the last few years, deep learning-based approaches for super-resolution are constantly emerging [16][17][18][19][20]. Dong et al. first applied CNN (convolutional neural networks) into super-resolution [18], with a satisfactory effect in its practical use.Later, Kim et al. designed SRResNet (residual network for super-resolution) [20] based on the well-known residual network ResNet [19].Benefiting from the jump connection and recursive structure, deeper layers are easy to realize for better performance.To simplify SRResNet, enhanced deep super-resolution network (EDSR) [1] was proposed for super-resolution by Lim et al., which optimizes the architecture of residual blocks by removing unnecessary modules.Although these ResNet-based models can improve the quality of reconstruction due to deeper layers, they all met the same problem: a sharp increase in the number of parameters.Especially in engineering practice, the cost of a large number of residual blocks and parameters has hampered the wider use of ResNet-based models.Therefore, the question of how to reduce the number of model parameters without reconstruction quality loss has become one of the hottest research issues.
Nowadays, there are various methods reported to reduce the number of parameters [21][22][23][24].Network pruning, SVD (singular value decomposition), and split-transform-merge strategy are three representative methods.In 1990, LeCun et al. first proposed the concept of network pruning, which decreased the model size by cutting off the redundant parameters of the neural network [21].This method requires a lot of iterative training to ensure network performance.In 2014, Denton et al. proposed the SVD method to reduce the number of weights [22].In the SVD method, the complex matrix is represented by multiplying smaller and simpler submatrices, which can significantly reduce network parameters.However, with the increase of the matrix scale, the calculation of the singular value becomes complicated and difficult.In recent years, the split-transform-merge strategy attracted more and more attention from researchers.Based on this strategy, the Inception models were developed with less computational complexity and a fewer number of parameters [23].In the Inception models, the input is split into several low-dimensional embeddings (by 1×1 convolutions), then converted through a set of specialized filters (3×3, 5×5, etc.) and finally merged by connection [24].However, because the hyper-parameters of each branch need to be set properly, it is hard to find a simple design method for the construction of an Inception network.In 2016, Xie et al. proposed the ResNeXt [24] network based on aggregated transformations, which can be regarded as the improvement of the split-transform-merge strategy.However, the ResNeXt was originally designed for image classification, therefore, its structure must be changed and optimized when applying it to super-resolution.
In this paper, an efficient multibranch residual network for the super-resolution task is proposed.The multibranch architecture is built on the basis of aggregated transformations.In the meantime, we optimize the residual block with reference to EDSR.According to the proposed network structure, two specific models are established and given as examples in this work.Experiments show that our models can achieve a good reconstruction quality with a significant reduction of network parameters.

Related Work
Inception: The Inception network is a typical multibranch architecture based on the splittransform-merge strategy.Each branch in the network is carefully designed to gain good performance in terms of speed and accuracy.However, the customized size and number of each filter in the branch make the Inception network hard to implement.
SRResNet: SRResNet is a super-resolution reconstruction network which is inspired by the residual network [20].Based on the original residual structure, the network removes the active layer after the residual block and obtains a good image reconstruction result in human vision.
EDSR: EDSR is a state-of-the-art super-resolution network which further modifies the residual block structure based on SRResNet [1].Since BN (batch normalization) layers get rid of the range flexibility from networks and consume a lot of memory, EDSR removes two BN layers in the residual block.Benefiting from the structural modification, EDSR has great improvements in image reconstruction and reduction in the usage of graphics processing unit (GPU) memory.
ResNeXt: Based on the residual block architecture, ResNeXt exploits the split-transform-merge strategy in an easy, extensible way-namely, aggregated residual transformations [24].This method involves stacking a series of homogeneous, multibranch residual blocks with only a few hyperparameters to set [24].Branches of ResNeXt each preform their set of convolutions and merge at the end of the block.Compared with ResNet, ResNeXt shows better performance and less computation complexity in the task of image classification.
Grouped convolution: Grouped convolution was first proposed in the AlexNet paper [25] in 2012.The given motivation by the author was to distribute the model over two GPUs to solve the limited hardware resources of a single GPU.Grouped convolution divides the feature maps into multiple GPUs for convoluting and subsequently aggregates the obtained results of multiple GPUs.

Methods
EDSR has achieved good results in the super-resolution field, but there is little improvement on the parameter quantity compared with other algorithms.To reduce the number of parameters, the aggregation transformation method is applied to EDSR in this paper.The aggregation transformation method, by which the multibranch architecture of networks can be built in an easy way, is originally presented in ResNeXt.This method can reduce the parameter and time complexity without significantly decreasing the accuracy of image classification.
A simple and obvious way to directly transform EDSR into multibranch architecture is by the aggregation transformation method.However, the original residual block of EDSR with two convolution layers is inconsistent with the aggregation transformation method [24].This direct transformation would result in a wild and dense model, which not only has no benefit but adds more complexity.To solve this issue, we must redesign the model with multibranch architecture.Three or more convolution layers are required in the residual block of the new model.To simplify the structure of the residual block and enhance the feature extraction capability, we adopted three convolution layers in this work.Compared with the original residual block as shown in Figure 1a, our rebuilt residual block removes the unnecessary rectified linear unit (ReLU) and BN layers with reference to the EDSR structure.This removal operation helps improve the performance of image reconstruction.

Methods
EDSR has achieved good results in the super-resolution field, but there is little improvement on the parameter quantity compared with other algorithms.To reduce the number of parameters, the aggregation transformation method is applied to EDSR in this paper.The aggregation transformation method, by which the multibranch architecture of networks can be built in an easy way, is originally presented in ResNeXt.This method can reduce the parameter and time complexity without significantly decreasing the accuracy of image classification.
A simple and obvious way to directly transform EDSR into multibranch architecture is by the aggregation transformation method.However, the original residual block of EDSR with two convolution layers is inconsistent with the aggregation transformation method [24].This direct transformation would result in a wild and dense model, which not only has no benefit but adds more complexity.To solve this issue, we must redesign the model with multibranch architecture.Three or more convolution layers are required in the residual block of the new model.To simplify the structure of the residual block and enhance the feature extraction capability, we adopted three convolution layers in this work.Compared with the original residual block as shown in Figure 1a, our rebuilt residual block removes the unnecessary rectified linear unit (ReLU) and BN layers with reference to the EDSR structure.This removal operation helps improve the performance of image reconstruction.
As shown in Figure 1, the convolutional layer (Conv) was used to perform feature extraction, and ReLU to rectify the network output.The BN layer was used to normalize the features, and Addition represents the additional layer that the network adds as needed.It is also known from the experiment by Lim et al. [1] that increasing the number of feature maps above a certain level would make the training process numerically unstable.The typical solution is to place a constant scaling layer (also called as MulConstant layer) after the last convolutional layer of each residual block.Owing to the use of aggregation transformations, the number of feature maps per convolution layer can be significantly reduced in comparison with the original EDSR model, therefore, the model proposed in this paper does not require the constant scaling layer.From the results in the following Experiment section, we can see that adding a constant scaling layer could worsen the performance.After removing the constant scaling layer, the architecture of our multibranch network is modeled and shown in Figure 2. The detailed description of ResBlock (residual block) has been given in Figure 1c.Upsample (upsampling structure) can magnify the image to the desired multiple.As shown in Figure 1, the convolutional layer (Conv) was used to perform feature extraction, and ReLU to rectify the network output.The BN layer was used to normalize the features, and Addition represents the additional layer that the network adds as needed.
It is also known from the experiment by Lim et al. [1] that increasing the number of feature maps above a certain level would make the training process numerically unstable.The typical solution is to place a constant scaling layer (also called as MulConstant layer) after the last convolutional layer of each residual block.Owing to the use of aggregation transformations, the number of feature maps per convolution layer can be significantly reduced in comparison with the original EDSR model, therefore, the model proposed in this paper does not require the constant scaling layer.From the results in the following Experiment section, we can see that adding a constant scaling layer could worsen the performance.After removing the constant scaling layer, the architecture of our multibranch network is modeled and shown in Figure 2. The detailed description of ResBlock (residual block) has been given in Figure 1c.Upsample (upsampling structure) can magnify the image to the desired multiple.As shown in Figure 3, we design with different configurations for our multibranch architecture: EDSRSP-3×3 and EDSRSP-1×1.The number represents the size of the first and third convolution kernel.The configuration of the residual block in EDSRSP-3×3 is as the same as that in EDSR, i.e. 3×3 convolution kernel, 256-d input and 256-d output.It is seen from Table 1 that the number of parameters in EDSRSP-3×3 is reduced by 1/3 compared with EDSR.To further decrease the parameters, the configuration of EDSRSP-1×1 is properly adjusted and shown in Figure 3b   As shown in Figure 3, we design with different configurations for our multibranch architecture: EDSRSP-3×3 and EDSRSP-1×1.The number represents the size of the first and third convolution kernel.The configuration of the residual block in EDSRSP-3×3 is as the same as that in EDSR, i.e., 3×3 convolution kernel, 256-d input and 256-d output.It is seen from Table 1 that the number of parameters in EDSRSP-3×3 is reduced by 1/3 compared with EDSR.To further decrease the parameters, the configuration of EDSRSP-1×1 is properly adjusted and shown in Figure 3b.The detailed adjustments include using the 1 × 1 convolution kernel in the first and third layers and the 512-d input and output in the second layer.EDSRSP-1×1 is similar to the bottleneck structure of ResNet, only with a little modification on the output dimension in the first layer.Due to the use of a 1 × 1 convolution kernel, the number of parameters in EDSRSP-1×1 are reduced to 1/4 of those in EDSR.As shown in Figure 3, we design with different configurations for our multibranch architecture: EDSRSP-3×3 and EDSRSP-1×1.The number represents the size of the first and third convolution kernel.The configuration of the residual block in EDSRSP-3×3 is as the same as that in EDSR, i.e. 3×3 convolution kernel, 256-d input and 256-d output.It is seen from Table 1 that the number of parameters in EDSRSP-3×3 is reduced by 1/3 compared with EDSR.To further decrease the parameters, the configuration of EDSRSP-1×1 is properly adjusted and shown in Figure 3b.The detailed adjustments include using the 1 × 1 convolution kernel in the first and third layers and the 512-d input and output in the second layer.EDSRSP-1×1 is similar to the bottleneck structure of ResNet, only with a little modification on the output dimension in the first layer.Due to the use of a 1 × 1 convolution kernel, the number of parameters in EDSRSP-1×1 are reduced to 1/4 of those in EDSR.For the implementation of aggregation transformation, our model has two equivalent structures as shown in Figure 4.The two structures have the same-level reconstruction performance, but the structure based on group convolution (Figure 4b) has the distinct advantages of time complexity and memory usage.Therefore, we use group convolution to realize the aggregation transformation.

Datasets
For our experiment, the newly proposed Diverse 2K (DIV2K) dataset [26] is used due to its high-quality (2K) resolution for the image reconstruction tasks.The DIV2K dataset consists of training images, 100 validation images, and 100 test images.Since the test dataset ground truth has not been published, the performance comparison was made on the validation dataset.We also compared the performance on three standard benchmark datasets: Set5 [9], Set14 [12], and B100 [27].

PSNR and SSIM Criteria
Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are the two most-used indicators in the field of super-resolution reconstruction, which can measure the similarity between the reconstructed image and the original high-resolution image [28,29].The mathematical expression of PSNR is as follows: where  is the number of bits per pixel, and mean square error (MSE) is defined as shown below:

Datasets
For our experiment, the newly proposed Diverse 2K (DIV2K) dataset [26] is used due to its high-quality (2K) resolution for the image reconstruction tasks.The DIV2K dataset consists of 800 training images, 100 validation images, and 100 test images.Since the test dataset ground truth has not been published, the performance comparison was made on the validation dataset.We also compared the performance on three standard benchmark datasets: Set5 [9], Set14 [12], and B100 [27].

PSNR and SSIM Criteria
Peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) are the two most-used indicators in the field of super-resolution reconstruction, which can measure the similarity between the reconstructed image and the original high-resolution image [28,29].The mathematical expression of PSNR is as follows: where n is the number of bits per pixel, and mean square error (MSE) is defined as shown below: where f (i, j) and f (i, j) represent the original and reconstructed images, respectively.Both of them are of size M × N, and (i, j) stands for the pixel coordinate.The larger the value of PSNR, the better effect of image reconstruction.SSIM is another popular criteria to compare the reconstructed image x and the original high-definition image y.The formula of SSIM is shown as follows: where u x , u y are the mean value of x, y. σ x 2 , σ y 2 are the variance of x, y. σ xy is the covariance of x and y. c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 are constants to maintain formula validity, avoiding the denominator being zero.L represents the dynamic range of the pixel value.k 1 = 0.01 and k 2 = 0.03 by default.
The larger the value of SSIM, the better similarity of the two images.

Training Details
For training, we use and adjust the training parameters given in Lim et al. [1].Neither the pre-training model nor the geometric self-ensemble strategy is used in this training.The chop size is set to 4.0 × 10 4 and patch sizes of ×3/×4 were set to 96.We also learnt from the code published by the EDSR paper and trained the models by using NVIDIA Titan Xp GPUs.According to the official baseline model, the used EDSR model is retrained with no modifications other than those mentioned above.It takes seven days to train EDSR compared with three days for our models.

Comparison between the Cases with and without MulConstant Layer
To analyze the effect of the MulConstant Layer in our designed residual block, we performed experiments on the EDSRSP-1 × 1 × 4 model and the EDSRSP-3 × 3 × 2 model.The three experiments correspond to three different cases: (1) without the MulConstant layer; (2) MulConstant layer with the factor set to 0.1; (3) MulConstant layer with the factor set to 0.01.From the experimental results as shown in Figure 5, we can see that removing the MulConstant layer in our model results in better performance.
them are of size  × , and (, ) stands for the pixel coordinate.The larger the value of PSNR, the better effect of image reconstruction.
SSIM is another popular criteria to compare the reconstructed image x and the original high-definition image y.The formula of SSIM is shown as follows: where   ,   are the mean value of , .  2 ,   2 are the variance of , .  is the covariance of  and . 1 = ( 1 ) 2 and  2 = ( 2 ) 2 are constants to maintain formula validity, avoiding the denominator being zero. represents the dynamic range of the pixel value. 1 = 0.01 and  2 = 0.03 by default.The larger the value of SSIM, the better similarity of the two images.

Training Details
For training, we use and adjust the training parameters given in Lim et al. [1].Neither the pre-training model nor the geometric self-ensemble strategy is used in this training.The chop size is set to 4.0×10 4 and patch sizes of ×3/×4 were set to 96.We also learnt from the code published by the EDSR paper and trained the models by using NVIDIA Titan Xp GPUs.According to the official baseline model, the used EDSR model is retrained with no modifications other than those mentioned above.It takes seven days to train EDSR compared with three days for our models.

Comparison between the Cases with and without MulConstant Layer
To analyze the effect of the MulConstant Layer in our designed residual block, we performed experiments on the EDSRSP-1×1 ×4 model and the EDSRSP-3×3 ×2 model.

Evaluation on DIV2K Dataset
For the performance evaluation, a comparison between the retrained EDSR model and our model is made and shown in Figure 6.The detailed evaluation method is given and described in Lim et al. [1].Using PSNR and SSIM criteria, the evaluation is conducted on 10 images of the DIV2K validation set.Concretely, we use full RGB channels and ignore the (6 + scale) pixels from the border.

Evaluation on DIV2K Dataset
For the performance evaluation, a comparison between the retrained EDSR model and our model is made and shown in Figure 6.The detailed evaluation method is given and described in Lim et al. [1].Using PSNR and SSIM criteria, the evaluation is conducted on 10 images of the DIV2K validation set.Concretely, we use full RGB channels and ignore the (6 + scale) pixels from the border.The small difference between EDSR and our models could verify the performance of the proposed method.
Electronics 2019, 8, x FOR PEER REVIEW 7 of 11 The small difference between EDSR and our models could verify the performance of the proposed method.Table 2 gives PSNR and SSIM scores of EDSR and our models on the DIV2K validation set, where the results are consistent with those in Figure 6.In addition, visual comparisons of the super-resolution images are shown in Figure 7.It can be seen, intuitively, that our models show high quality regardless of details or textures.Table 2 gives PSNR and SSIM scores of EDSR and our models on the DIV2K validation set, where the results are consistent with those in Figure 6.In addition, visual comparisons of the super-resolution images are shown in Figure 7.It can be seen, intuitively, that our models show high quality regardless of details or textures.We also performed the running time test on the pictures in Figure 7.The experimental results are shown in Table 3.As can be seen from the data in the table, the proposed model has a faster running time than EDSR.We also performed the running time test on the pictures in Figure 7.The experimental results are shown in Table 3.As can be seen from the data in the table, the proposed model has a faster running time than EDSR.

Evaluation on Other Datasets
More experiments were implemented on the standard datasets of B100, Set5, and Set14.For comparison, we measured PSNR and SSIM on the y-channel, ignoring the same number of pixels as the boundary scaling.The MATLAB code was provided by the EDSR paper for this evaluation.As can be seen from Table 4, our models can achieve the same level performance as EDSR with fewer parameters, in theory.

Evaluation on Other Datasets
More experiments were implemented on the standard datasets of B100, Set5, and Set14.For comparison, we measured PSNR and SSIM on the y-channel, ignoring the same number of pixels as the boundary scaling.The MATLAB code was provided by the EDSR paper for this evaluation.As can be seen from Table 4, our models can achieve the same level performance as EDSR with fewer parameters, in theory.It can be seen from the experimental results that under the premise of ensuring the reconstruction quality, the proposed models have obvious advantages in time complexity and space complexity.This also means a reduction in the demand for hardware resources in practical applications, which makes our models easier to implement in real conditions.

Conclusions
In this paper, we propose an efficient super-resolution network based on aggregated residual transformations.Based on the proposed network, two specific models were designed and built in this work.Each of the two models has its own advantages regarding the reconstruction performance and the number of parameters.Experiments on both the DIV2K and other standard datasets were implemented to evaluate the performance of our network.The experiment results proved that our method is effective and easy to implement.Compared with EDSR, the number of parameters is significantly reduced with the same-level performance.

Figure 1 .
Figure 1.Comparison of residual blocks in the original ResNet, enhanced deep super-resolution network (EDSR), and our model.(a) Original ResNet residual block; (b) EDSR residual block; (c) Our proposed residual block.

Figure 1 .
Figure 1.Comparison of residual blocks in the original ResNet, enhanced deep super-resolution network (EDSR), and our model.(a) Original ResNet residual block; (b) EDSR residual block; (c) Our proposed residual block.

Figure 2 .
Figure 2. The architecture of the proposed multibranch network.

Figure 2 .
Figure 2. The architecture of the proposed multibranch network.

Figure 2 .
Figure 2. The architecture of the proposed multibranch network.

Figure 7 .
Figure 7. Super-resolution reconstruction results on the DIV2K dataset.

Figure 7 .
Figure 7. Super-resolution reconstruction results on the DIV2K dataset.

Table 1 .
Parameters of EDSR and our models.

Table 1 .
Parameters of EDSR and our models.

Table 1 .
Parameters of EDSR and our models.

Table 3 .
Running time (s) comparison between EDSR and proposed models.

Table 3 .
Running time (s) comparison between EDSR and proposed models.