Multi-Branch Deep Residual Network for Single Image Super-Resolution

: Recently, algorithms based on the deep neural networks and residual networks have been applied for super-resolution and exhibited excellent performance. In this paper, a multi-branch deep residual network for single image super-resolution (MRSR) is proposed. In the network, we adopt a multi-branch network framework and further optimize the structure of residual network. By using residual blocks and ﬁlters reasonably, the model size is greatly expanded while the stable training is also guaranteed. Besides, a perceptual evaluation function, which contains three parts of loss, is proposed. The experiment results show that the evaluation function provides great support for the quality of reconstruction and the competitive performance. The proposed method mainly uses three steps of feature extraction, mapping, and reconstruction to complete the super-resolution reconstruction and shows superior performance than other state-of-the-art super-resolution methods on benchmark datasets.


Introduction
Single image super-resolution (SISR) is an important topic in digital image processing and computer vision. SISR aims to recover a high resolution (HR) image from its low resolution (LR) image. Generally, many studies assume that the LR image is down-sampled or bicubicing from HR image with a scale factor. In the past decades, the problem of image super-resolution [1] (SR) has attracted extensive attention. These SR works have been applied to satellite imaging [2], medical imaging [3,4], face recognition [5], and surveillance [6]. Inherently, the SR is a highly ill-posed problem since there is a lot of high-frequency information lost for the LR image. Furthermore, the one-to-many mapping from LR image to HR image has many solutions. Therefore, SR can be considered as an inference problem, which needs to restructure the missing high-frequency data from the low-frequency components.
Thus far, many methods based on deep convolution neural network [7][8][9][10] have been proposed for the single image super-resolution and show excellent performance. These approaches apply the back-propagation algorithm [11] to train on large image datasets in order to learn the nonlinear mappings between LR images and HR images. Compared with previous statistics-based [12][13][14][15] and patch-based [16][17][18][19][20][21][22] models, these techniques provide improved performance for the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM). Due to the characteristics of deep convolution networks, these works will still have some existing defects. Such as the minor changes in network structure and different training methods will cause huge differences in reconstruction results.

Image Super-Resolution
Thus far, many methods have been proposed to solve the super-resolution problem. They can be categorized into four types-image statistical methods, prediction based methods, edge based methods, and patch-based methods. Early algorithms apply interpolation techniques and statistical methods [25,26] to SR, but work hardly with the lost details and realistic textures. Then, methods based on prediction, which are the first techniques for SISR, are proposed to reconstruct higher resolution images. While those filtering algorithms use bicubicing or linear filtering to oversimplify the SISR problem, methods based on edge preservation [27,28] have been proposed. These approaches not only take advantage in speed, but can also achieve overly smooth texture reconstruction. In addition, many works based on patch [14,26,29,30] are also designed for SR. With the patch redundancies across scales within the image, Glasner et al. [29] proposed an algorithm to drive the research and development of SR. Compared with the other three methods, these methods based on patch exhibit superior performance.
In order to achieve a better reconstruction, a superior mapping function from low resolution to high resolution is necessary. Among the existing techniques, those works based on deep neural networks are considered to have a strong capability to achieve significant mapping in image super-resolution. Dong et al. [10] first adopt convolutional neural network (CNN) architecture to solve the problem of SISR. They use a three-layer deep fully convolutional network to achieve state-of-the-art SR performance. This attempt shows that the CNN model can further improve reconstruction both in terms of speed and quality. Subsequently, various algorithms based on CNN model, including deeply-recursive convolutional network (DRCN) [8], are proposed to study for SR. With the residual network introduced by He et al. [23], the CNN methods can be used to train much deeper network and have better performance. Such as the method of EDSR, which is proposed by Lim et al. [24], uses enhanced deep residual networks for SISR and achieves improved performance. Moreover, Ledig et al. [9] first adopts the generative adversarial networks [31] (GAN) for SISR. In their model, powerful data generation capability of the GAN network and appropriate evaluation function based on the Nash equilibrium have been well applied.
To train network models better, a perceptual loss function is needed. As the common loss function, mean squared error [32,33] (MSE) has been widely used to calculate the pixel-wise error in general image restoration. Meanwhile, MSE is usually applied to compute the peak signal-to-noise ratio (PSNR), which is a major performance measure in reconstruction. Besides, the structural similarity index (SSIM) is another evaluation index for SR. The higher PSNR or SSIM is corresponding to better recovering quality.

Residual Network in Super-Resolution
Residual network (ResNet) is a kind of improved neural network, which achieves the identity mapping by applying the shortcut connections in the structure. This design cannot only eliminate the degradation during the SGD optimization process, but also enables the data to flow across layers. It is this characteristic that allows the network to be extended deeper. Compared with the single deep convolution neural network, ResNet can avoid the overfitting effectively and achieve better performance. Thus far, ResNet has been widely used in deep networks to deal with the computer vision problems. Meanwhile, these properties provide a guarantee for ResNet to be applied to solve the SR problem. The methods of SRResNet and EDSR, which apply the residual network in their structure, have achieved state-of-the-art performance in SISR.

Residual Blocks
Since the residual networks were proposed to solve the computer vision problems such as image classification and detection, they have shown superior performance, especially for the tasks that are from low-level to high-level. SRResNet employs the ResNet architecture directly to complete the SR reconstruction. EDSR improves the performance by adjusting the ResNet structure. In proposed model, the ResNet architecture is further improved and achieves better performance in SR. Compared with original residual network, we optimize the structure of residual network by substituting BN layers with the convolution layers. Further, different from the existing residual network structures that obtain the output from the input and the output of the last layer of the network, the newly proposed ResNet structure combines the input and the output of every layer to get the network output, as shown in Figure 1. The experiment shows that these modifications can not only speed up the convergence of the network, but can also improve the performance substantially in terms of image details and textures.

Model Architecture
In this section, the proposed model structure will be introduced, outlined in Figure 2. Our model conceptually consists of three parts: Feature extraction, mapping, and reconstruction. The feature extraction operation gets the features from the input LR image and represents them as a set of feature maps which are ready for the next mapping operation. In order to deliver more information to the next operation, in addition to using multiple filters and residual blocks to operate on the input, we also use a sliding connection to send the input directly to the mapping network. Then, non-linear mapping from LR to HR is worked by the mapping operation, which is the main component that solves the SR task. Obviously, the quality of SR mainly depends on the performance of the mapping network. In proposed model, the mapping network is composed of five branches. Each branch applies residual blocks and filters to achieve the effective mapping from LR to HR feature maps with different parameters. Moreover, the convolution layers are inserted between every two branches, which will result in different sizes of each branch network. This design provides the feasibility to implement multi-scale network and takes advantage of inter-scale correlation. Finally, the reconstruction networks undertake the task of rebuilding the super-resolution image. Since the output of each branch not only differs in size but also in number, every branch has its own independent reconstruction subnetwork. Every reconstruction network combines the output of the corresponding branch network with the original input to restore the SR image. Furthermore, we apply the sub-pixel convolution layers to every reconstruction network to upscale the LR image. Compared with the deconvolution layer or other implementations, which use various forms of upscaling before convolution, the sub-pixel convolution layer is faster in training. According to the structure of MRSR, the final SR image is derived from each reconstruction network.
In the proposed architecture, the feature extraction modules consist of ten residual blocks and filters with 3 × 3 kernels that allow more detailed texture information and hidden states to be passed. As the most important component of the model, the mapping network combines the advantages of multi-branch networks and residual networks where kernels are set to 5 × 5. By adopting larger kernels, the larger receptive field can be covered in the mapping network. Our model has approximately 50 times more receptive field than DRCN.

Model Architecture
In this section, the proposed model structure will be introduced, outlined in Figure 2. Our model conceptually consists of three parts: Feature extraction, mapping, and reconstruction. The feature extraction operation gets the features from the input LR image and represents them as a set of feature maps which are ready for the next mapping operation. In order to deliver more information to the next operation, in addition to using multiple filters and residual blocks to operate on the input, we also use a sliding connection to send the input directly to the mapping network. Then, non-linear mapping from LR to HR is worked by the mapping operation, which is the main component that solves the SR task. Obviously, the quality of SR mainly depends on the performance of the mapping network. In proposed model, the mapping network is composed of five branches. Each branch applies residual blocks and filters to achieve the effective mapping from LR to HR feature maps with different parameters. Moreover, the convolution layers are inserted between every two branches, which will result in different sizes of each branch network. This design provides the feasibility to implement multi-scale network and takes advantage of inter-scale correlation. Finally, the reconstruction networks undertake the task of rebuilding the super-resolution image. Since the output of each branch not only differs in size but also in number, every branch has its own independent reconstruction subnetwork. Every reconstruction network combines the output of the corresponding branch network with the original input to restore the SR image. Furthermore, we apply the sub-pixel convolution layers to every reconstruction network to upscale the LR image. Compared with the deconvolution layer or other implementations, which use various forms of upscaling before convolution, the sub-pixel convolution layer is faster in training. According to the structure of MRSR, the final SR image is derived from each reconstruction network.
In the proposed architecture, the feature extraction modules consist of ten residual blocks and filters with 3 × 3 kernels that allow more detailed texture information and hidden states to be passed. As the most important component of the model, the mapping network combines the advantages of multi-branch networks and residual networks where kernels are set to 5 × 5. By adopting larger kernels, the larger receptive field can be covered in the mapping network. Our model has approximately 50 times more receptive field than DRCN.  Figure 2. The architecture of our proposed SR network (MRSR), which consists of three parts: Feature extraction, mapping, and reconstruction. The feature extraction is composed of multiple filters and residual blocks. The non-linear mapping between LR and SR adopt the multi-branch network structure and each branch is made up of residual blocks. In the reconstruction, the final output is restored from every branch output and the LR input with different weights.

Training
As described in Section 3.2, the feature extraction network takes the low-resolution image as input. Assuming that this part of network is model F , we can get the output ( ) Learning from training, these weights will directly determine the final reconstruction quality.
In order to achieve superior performance, besides the ingenious network architecture, a loss function, which is as accurate as possible, is also necessary. Here, the training loss function, which can find optimal parameters for a proposed network, will be introduced. First, the difference between super-resolution image and high-resolution is the most intuitive expression. Combined with L2 loss, which is generally preferred since it minimizes the MSE, the part of loss function is defined as: Figure 2. The architecture of our proposed SR network (MRSR), which consists of three parts: Feature extraction, mapping, and reconstruction. The feature extraction is composed of multiple filters and residual blocks. The non-linear mapping between LR and SR adopt the multi-branch network structure and each branch is made up of residual blocks. In the reconstruction, the final output is restored from every branch output and the LR input with different weights.

Training
As described in Section 3.2, the feature extraction network takes the low-resolution image I LR as input. Assuming that this part of network is model F, we can get the output F(I LR ), which is the input to the mapping network M. Determined by the multi-branch structure, the output of each branch network M n , n = 1, 2, · · · , N can be described as: where the operator g denotes the function which represents each branch network. Since every branch network has different components, each g n , n = 1, 2 · · · , N represents different function expression. The reconstruction network R takes the M n as input and complete the reconstruction of super-resolution images. Under the branch-supervision, the predictions R n , n = 1, 2, · · · , N from each reconstruction network are: Same as the mapping network, the function h completes the reconstruction task of each recursive network. This process can be considered as the inverse operation of feature extraction network in a sense. Following our model, the final output I SR is the weighted average of all intermediate predictions and original input I LR : Obviously, the w n represents the weight of every intermediate prediction in reconstruction. Learning from training, these weights will directly determine the final reconstruction quality.
In order to achieve superior performance, besides the ingenious network architecture, a loss function, which is as accurate as possible, is also necessary. Here, the training loss function, which can find optimal parameters for a proposed network, will be introduced. First, the difference between super-resolution image and high-resolution is the most intuitive expression. Combined with L2 loss, which is generally preferred since it minimizes the MSE, the part of loss function is defined as: where W and H are the image size. The I SR and I HR represent the reconstructed image and reference image respectively. According to proposed model, every branch network needs to be supervised: Furthermore, based on the ideas of Johnson et al. [34] and Bruna et al. [35], a pre-trained 16 layer VGG network is adopted to compute the VGG loss l SR VGG , which is closer to perceptual similarity. The l SR VGG is based on the ReLU activation layers of the VGG network, described as Zisserman [36]. Each activation layer in the VGG network will get different feature maps for the two different inputs of the reconstructed image I SR and the truth image I HR . Then, we define the l SR VGG as the Euclidean distance between these feature maps: where ψ (i,j) denotes the i-th feature map of j-th activation layer of the VGG network. X and Y are the dimensions of the feature maps. For the final loss function, we have: where α and β represents the weight of the partial loss, which are setting between 0 and 1. Based on a series of experiments, we find that high α makes the model stable and easy to converge.

Datasets
Fairly, we implement all experiments on the standard datasets which have been widely used for other image restoration tasks. These datasets mainly include: Set5 [37], Set14 [38], BSD100 [39], and Urban100 [40]. Meanwhile, our model is also being tested on a newly proposed high-quality images dataset DIV2K [41], which contains 1000 training and testing images and is the official dataset of the NTIRE2017 Super-Resolution Challenge.

Training Details
For all experiments, the RGB HR image is down-sampled by using a bicubic kernel with a scale factor r. This is a common method to obtain the LR image, which has been applied in other state-of-the-art methods for an SR problem. The HR images used in all experiments are cropped to 96 × 96 with a batch-size of 16. In training, our network is trained with an initialized learning rate of 10 −4 and 10 6 update iterations on a GPU GTX1080. For optimization, we use the ADAM (Adaptive Moment Estimation) with the setting β 1 = 0.9, β 2 = 0.99, and = 10 −8 .
In terms of network architecture, we use 60 residual blocks totally. The weights in layers are initialized by the same method. For the biases, we set to zero. To achieve a fair comparison with other SR reconstruction methods, we take the PSNR and SSIM as the performance metrics to evaluate the experiment results.

Comparisons with State-of-the-Art Methods
In this section, the qualitative and quantitative comparisons with other state-of-the-art SR approaches will be introduced. We mainly compare the performance of MRSR with results from Bicubic, A+ [21], SRCNN, DRCN, SRResNet, EDSR, and the most recent work WSD [42]. Benchmark, the public code for these algorithms and the same technique to obtain the LR images from HR images were used in experiments. For a good visual comparison, we also adopt same method to deal with the luminance components.
Besides, compared with EDSR, the new ResNet structure uses fewer filters and parameters. The proposed model only uses 128 filters that account for 50% of EDSR. Not only that, the parameters has been reduced to about 12 M, which is 0.28 times compared with EDSR, as shown in Table 1. Consequently, the GPU memory usage and the training difficulty can be dramatically reduced. Moreover, the optimized ResNet architecture exhibits better performance with less computation.
The summary quantitative results on several datasets are presented in Table 2. As can be seen from the table, our model exhibits superior performance than existing methods in all datasets and scale factors in terms of PSNR and SSIM. In addition, visual comparisons of the super-resolution images are shown in . It can be seen intuitively from the figures that our reconstructed images show higher quality regardless of details or textures and exhibit more realistic outputs compared with the previous works. In Figure 6, the proposed approach is also compared with DRCN quantitatively. Moreover, the comparison between our algorithm and the most recent work WSD, which use wiener filter in similarity domain to achieve the reconstruction of single image super-resolution, is shown in Figure 7. Obviously, the SR result has been improved greatly.

Conclusions
In this paper, we present a multi-branch deep residual algorithm for the SR problem. By optimizing the residual network, our model achieves better performance with fewer parameters and filters on all datasets. Coupled with the use of multi-branch networks, the training and convergence problems were partly solved. Due to the proposed supervision function, the reconstructed images show a better performance in the edge details and textures compared with other existing reconstruction methods. Furthermore, we develop a multi-scale SR residual network to achieve superior mapping between the LR and SR images by increasing the reduced-dimensional convolution layers in every two adjacent branch networks. The experiment results prove that the

Conclusions
In this paper, we present a multi-branch deep residual algorithm for the SR problem. By optimizing the residual network, our model achieves better performance with fewer parameters and filters on all datasets. Coupled with the use of multi-branch networks, the training and convergence problems were partly solved. Due to the proposed supervision function, the reconstructed images show a better performance in the edge details and textures compared with other existing reconstruction methods. Furthermore, we develop a multi-scale SR residual network to achieve superior mapping between the LR and SR images by increasing the reduced-dimensional convolution layers in every two adjacent branch networks. The experiment results prove that the proposed approach achieves state-of-the-art performance in terms of PSNR and SSIM. In the future, we will continue to improve our algorithms for a superior performance.