Multi-Scale Inception Based Super-Resolution Using Deep Learning Approach

: Single image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) image. In order to address the SISR problem, recently, deep convolutional neural networks (CNNs) have achieved remarkable progress in terms of accuracy and efﬁciency. In this paper, an innovative technique, namely a multi-scale inception-based super-resolution (SR) using deep learning approach, or MSISRD, was proposed for fast and accurate reconstruction of SISR. The proposed network employs the deconvolution layer to upsample the LR image to the desired HR image. The proposed method is in contrast to existing approaches that use the interpolation techniques to upscale the LR image. Primarily, interpolation techniques are not designed for this purpose, which results in the creation of undesired noise in the model. Moreover, the existing methods mainly focus on the shallow network or stacking multiple layers in the model with the aim of creating a deeper network architecture. The technique based on the aforementioned design creates the vanishing gradients problem during the training and increases the computational cost of the model. Our proposed method does not use any hand-designed pre-processing steps, such as the bicubic interpolation technique. Furthermore, an asymmetric convolution block is employed to reduce the number of parameters, in addition to the inception block adopted from GoogLeNet, to reconstruct the multiscale information. Experimental results demonstrate that the proposed model exhibits an enhanced performance compared to twelve state-of-the-art methods in terms of the average peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) with a reduced number of parameters for the scale factor of 2 × , 4 × , and 8 × .


Introduction
Super-resolution (SR) is an image, video, and computer vision task that reconstruct the high quality or high-resolution (HR) image with large texture detail information from a single or multiple low quality or low-resolution (LR) image [1,2], under the limited conditional environment and low-cost imaging system. Despite its difficulty and limitations, SR could be applied in real world applications, such as security and surveillance imaging systems [3], face recognition [4], and medical [5] and satellite imaging systems [6].
However, SR is a classical challenging ill-posed problem. To handle the ill-posed problem in SR reconstruction, different algorithms have been proposed by the researchers in the area of image and video recognition. Earlier methods include interpolation and reconstruction-based techniques. Examples of interpolation-based techniques are cubic interpolation [7], nearest neighbor-based interpolation [8], and edge-guided-based interpolation [9]. Usually the performance of these methods is very good, and its implementation is very easy, but still, they generate ringing jagged artifacts and by bicubic interpolation to the desired size. Second, the reconstruction details information is still unsatisfactory. Third, training convergence is too slow. Z et al. [27] proposed the Deep Networks for Image Super-Resolution with Sparse Prior, named as sparse coding based network (SCN). This approach is simple and achieves notable performance over SRCNN.
Dong et al. [15] improved the SRCNN [26], further named Fast Super-Resolution Convolutional Neural Network (FSRCNN) [15], by introducing a deconvolution layer as the last layer of the model with a stride equal to the size of the scale factor. FSRCNN [15] has a simple network architecture that consists of four convolution layers with one transpose convolution layer and uses the original LR image without bicubic interpolation. FSRCNN [15] has better performance and lower computational cost than SRCNN [26] but has a limited network capacity.
Shi et al. [28] proposed the Efficient Sub Pixel Convolution Neural Network (ESPCN), which uses the same technique introduced by FSRCNN [15], to reduce the model complexity with a sub pixel convolution layer to upscale the information.
Kim et al. [16] proposed Very Deep Super Resolution (VDSR) [16] using the global residual connection to reduce the training complexity, which leads to faster convergence of the model and achieves great performance. The main purpose of VDSR [16] is to predict the residual, rather than the actual, pixel value.
Currently, due to the success of UNet [29] architecture, the work in [30] proposed the idea of the Residual Encoder-Decoder Network (REDNet). REDNet [30] consists of two parts: the encoder network and decoder network. The convolution layer is used at the encoder side, and the deconvolution layer is used at the decoder side.
Kim et al. [17] applied the same convolution layers multiple times and proposed the idea of the Deep Recursive Convolutional Network (DRCN) [17]. The main advantage of this architecture is that the number of model parameters is fixed, even though there are more recursions.
Lai et al. [18] proposed the Laplacian pyramid super-resolution network (LapSRN) [18], which reconstructs multiple images progressively with different scale factors. Deconvolution is proposed in [31][32][33]. It is observed as pointwise multiplication of each input pixel by a kernel, which could increase the input size if the stride is greater than one. LapSRN [18] uses three types of layers: the convolution layers, leaky rectified linear unit (LReLU) layers, and transpose or deconvolution layers. The training dataset is the same as the SRCNN [26].
The residual neural network (ResNet) [34], proposed by He et al., solves the vanishing/exploding gradient problem in a very deep neural network during the training. ResNet [34] uses many numbers of layers, like 34, 50, 101, 152, and also 1202. The most popular version is the ResNet50 contains 50 CNN layers and one fully-connected layer at the end of the network. In [19], the authors proposed the SRResNet [19] architecture with 16 residual blocks. Each block is made up of two convolution layers, followed by a batch normalization (BN) layer [35] and parametric rectifying linear unit (PReLU) activation function. It does not use any pre-processing nor residual learning. Transposed convolution is used to upscale the LR image. BN [35] is used to stabilize the training procedure.
Ren et al. [36] proposed Context-wise Network Fusion (CNF), in which each model of the SRCNN [26] is constructed with a different number of layers and, finally, each SRCNN [26] model output is passed through a single convolution layer and fused with the sum-pooling layer.
The Deep CNN with Skip Connection and Network in Network, abbreviated as DCSCN network architecture [37], proposed a shallower model than VDSR [16], introducing the skip connections at different stages and directly using the LR image as an input. The DCSCN [37] model consists of different modules, such as feature extraction and reconstruction network, which provide better SR performance.
In [39], super-resolution network for multiple degradations (SRMD) proposed a concatenated LR image and its degradation mappings. The network architecture is the same as in [14,40,41]. First, a size of 3 × 3 convolution filter is cascaded and followed by a sequence of convolution, rectified linear unit (ReLU) [42], and BN [35] layers. The authors also introduce the SR network for multiple degradations noise-free degradation model (SRMDNF).
Mei at el. [41], inspired by image SR via SRResNet [19] and LapSRN [18], proposed a new concept-the Super-Resolution Squeeze and Excitation (SrSE) Network (SrSENet) Network [41] for SISR. Utilizing SrSEBlock with deep residual networks in this approach can provide better feature extraction due to the channels correlations model between feature mappings from LR image.
In [43], Chu et al. introduced the idea of a multi-objective oriented algorithm, known as Multi-Objective Reinforced Evolution in Mobile Neural Architecture Search (MOREMNAS) by good virtue from both evaluation algorithm (EA) and reinforced learning (RL) methods. Authors also introduced a different version of models, like MOREMNAS-A, -B, -C, and the dominates version, MOREMNAS-D [43].
Many modern SR networks, such as FSRCNN [15], LapSRN [18], SrSENet [41], and DCSCN [37], achieved better results by using deconvolution as the upsampling module. However, the computation complexity of forward and back propagation of deconvolution [44] is still a major concern. They promise low computational complexity and better perceptual quality, but there possibly exists plenty of room for improvement in SR performance.

Proposed Method
In this section, we describe the design procedure of our proposed MSISRD method in detail. Initially, input LR image passes through three stacked CNN layers, followed by ReLU [42] using skip connection. This process produces a summed output that contains detailed feature information. As such, the number of parameters is thus reduced. Afterword, the information is fed to the deconvolution layer for upsampling purposes. The upsampled LR information is sent through two asymmetric residual blocks to reduce the training complexity and reconstruct the middle-level feature information. The inception block is used in the multi-scale reconstruction stage-II to reconstruct the final HR image, as shown in Figure 1.

Feature Extraction
Inspired by VDSR [16], we proposed three trainable convolution layers of 3 × 3 kernel size with 64 filters, followed by ReLU [42] activation function. ReLU [42] directly extracts the feature information from the original LR image as Y. Mathematically convolution layer can be represented as, where l is the lth convolution layer, W l represents the number of filters of the lth layer, and G l−1 denotes the previous layer output feature map. F l is an output feature map and '*' represents the convolution operation. ReLU [42] activation response can be calculated as general activation function as, where x is the input of activation on the lth layer and Y is an ReLU [42] activation output of the feature maps. The final out put of the convolution layer can be defined as, where G l represents the final output of the feature map of the lth layer, and b l , W l denotes a bias and weight of the convolution filter of the lth layer, respectively. Inspired by ResNet [34], we applied the first layer feature map output that is added in the third layer, using skip connection with identity mapping.

Deconvolution
In order to recover the SR images, the basic concept is to upscale the original LR image using interpolation techniques to get the HR image. The implementation of such an approach is very easy and fast. Actually, interpolation techniques were not designed for upscaling the original LR to recover the HR image. Additionally, the said approaches even damage the important LR information. Furthermore, it takes more computational time in pre-processing without any obvious advantages. Shi et al. [28] proposed the idea of a sub-pixel convolution layer to recover the HR image directly, but this approach does not completely utilize the related information from the LR domain to HR. LapSRN [18] introduced the concept of multiple transposed convolution layer in a progressive way with different upscale and obtained relatively faster and more accurate information from LR to HR image.
Based on the common architecture of CNN SR, the deconvolution layer is used to upsample the previous feature results with a number of convolution kernels. The quality of the LR image is improved by increasing the kernel size of the deconvolution layer, but a larger kernel size also increases the computational complexity. In our proposed approach, we apply two 1 × 1 operation of convolution before and after the deconvolution layer. The first 1 × 1 kernel operation performs the function of dimension reduction to change the 64 feature maps into 4 feature maps for the upsampling purpose, and the last convolution kernel is used to recover the feature information back to the 64 number of channels. The upsampling layer serves as the bridge between two 1 × 1 convolution layers, which uses the different kernel size for different scale factor like 14 × 14, 16 × 16, and 18 × 18 for enlargement factor of 2×, 4×, and 8×, respectively.

Multi-Scale Reconstruction Stage-I
As the depth of network increases, the flow of information becomes weak at the final layers [33]. This leads to the vanishing/exploding gradient issue during the training [45] . The ResNet proposed by He et al. [34] intends to solve this problem and widely uses the idea of skip connection in [19,20] to construct a very deeper model for image SR. The residual network blocks [16,19,34,46] are shown to improve training accuracy on the SR work. In Figure 2, we show the residual network block of original ResNet [34], SRResNet [19], and our proposed ResNet block. In the original ResNet block [34], their architecture consists of a direct path and skip connection for propagating the information through the residual block. Resultantly, the summed up information finally passes through the ReLU [42] activation layer. In the SRResNet block [19], the ReLU [42] activation function has been removed to provide the clean path from one block to the next one. Our proposed block removes two BN [35] layers to reduce the memory usage of the Graphics Processing Unit (GPU) and minimize the computational complexity. Compared to the original ResNet block [34] and SRResNet block [19], which use the standard convolution operation, our proposed block uses the idea of asymmetric convolution operation, which reduces the size of the model, as well as increases the training efficiency of the model.
For multi-scale reconstruction stage-I, we applied eight asymmetric convolution trainable layers, which are interleaved followed by ReLU [42] nonlinearity. The asymmetric convolution (AConv) is to factorize a standard two-dimensional convolution kernel into two one-dimension convolution kernels. In other words, a 3 × 1 convolution, followed by a 1 × 3 convolution, is substituted for a 3 × 3 convolution [47,48]. This mechanism can be expressed as, where I is a 2D image, W is a 2D kernel like 3 × 3, W x is a 1D kernel along x-dimension as 1 × 3, and W y is a 1D kernel along y-dimension as 3 × 1. The relationship between standard convolution kernel size and asymmetric convolution kernel size in terms of a number of parameters is shown in Table 1. For example, we took a single layer of 3 × 3, where the number of filters is 10 and image patch size is 28 × 28, and the calculated number of parameters is 900. Similarly, after applying asymmetric convolution operation on the 3 × 3 layer and splitting the same into 3 × 1 and 1 × 3, with the same number of filters and image patch size, the calculated number of parameters is 600. Results clearly show that asymmetric convolution type kernel has a lesser number of parameters compared to standard convolution kernel size. This approach is considered to be one of the most suitable options due to the fact that it reduces the size of a deeper model, increases the computational efficiency during the training, and avoids the overfitting problems. In our proposed architecture, we used four CNN layers of size 3 × 1 and 1 × 3 asymmetric convolution operation, with each layer taking the previous input feature and generating 16 channels of the new features. In order to facilitate the flow of training, we used the skip connection after every two convolution layers and added the input to the next block as output. In order to decrease the number of parameters we used, we used a 1 × 1 bottleneck CNN layer [50] after the final asymmetric residual block.

Multi-Scale Reconstruction Stage-II
At the final stage, we used a multi-scale block adopted from GoogLeNet [51] to select the appropriate kernel size. The size of the kernel plays a very important role in the model design, as well as the training procedure, because it is a very close relation to extracting the more useful information. The smaller size of the kernel is better for capturing the information locally, and the larger size of the kernel is more preferable for information distributed globally. The inception network [52] uses this idea and includes many convolutions with a different size kernels. Furthermore, the second and third version of inception architecture uses the idea of asymmetric convolution. For example, n × n shape of the kernel can translate into a combination of two 1 × n and n × 1 convolutions, which is the most efficient convolution kernel, rather than the standard convolution kernel. For example, a convolution with kernel size is 3 × 3 is equivalent to a 1 × 3 followed by 3 × 1 , which was found to be 33% of the low computational cost in the standard convolution [52]. Figure 3 shows the comparison between traditional convolution operation with asymmetric convolution operation. In Figure 3a, plain architecture with many layers is stacked in a single path, used by SRCNN [26] and FSRCNN [15]. These types of architecture design are very simple, but a deeper model increases the size of the model and consumes more memory.
In Figure 3b, a conventional inception block is used to extract the multi-scale feature information. This block allows the extraction of the multi-scale feature information more efficiently. However, the problem with this type of block is that it has a higher number of parameters, and so does the higher computational complexity of the model. We proposed the multi-scale asymmetric convolution block, as shown in Figure 3c, to solve the problem of training complexity. Our proposed inception block can reduce the computational time and can extract the multi-scale feature information to reconstruct the SR image. In Figure 4, we introduced a new module inspired by the idea of a naive version inception module and inception module with dimension reduction [51]. In the naive version inception module, the convolution operation is performed on a previous layer output, with three different sizes of the filters having order 1 × 1, 3 × 3, and 5 × 5. In order to achieve dimension reduction, the max-pooling operation is also employed. The output of these layers is concatenated and sent to the next inception module, as shown in Figure 4a. The major problem with the naive version inception module is a larger number of kernel size. Even the modest number of kernel size can be more expensive on the top of the convolutional layer. This problem becomes serious after the fusion of max-pooling layer output with the output of convolutional layers from one stage to another stage. With a view of making it computationally efficient and reducing the number of input channels, the authors have revised the naive version inception module with dimension reductions by adding an extra 1 × 1 convolution layer before the 3 × 3 and 5 × 5 convolution layers, as well as after max-pooling layer, as shown in Figure 4b. Followed by the aforementioned successful model, we proposed an asymmetric inception block to learn the multi-scale information for reconstructing the HR image, as shown in Figure 4c.
In the suggested asymmetric inception block, the standard convolution layers are replaced with asymmetric convolution layers. For multi-scale reconstruction purpose, we used five towers with four different sizes of the asymmetric convolution filter. These filters are followed by ReLU having 16 features of various asymmetric convolutional filter size. In the first branch/tower 1, we split the two filters having layers of 3 × 3 and 5 × 5 into four asymmetric convolution filters of the order 3 × 1, 1 × 3, 5 × 1, and 1 × 5 to reduce the number of parameters. Similarly, in tower 2 and tower 3, we applied the same size of the asymmetric convolution filter operation. In tower 4 and tower 5, we divided the larger filter size of 7 × 7 and 9 × 9 into an asymmetric convolution filter of size 7 × 1, 1 × 7, 9 × 1, and 1 × 9. Finally, we concatenated values of all the towers followed by ReLU activation nonlinearity. With the aim of improving the compactness, achieving the computational efficiency, and experiencing better performance, we used a 1 × 1 bottleneck CNN layer [50]. Remarkably, the 1 × 1 bottleneck CNN layer [50] not only reduced the dimensions of the previous layers for higher computational efficiency but also added more nonlinearity information to enhance the representation of the reconstructed LR image. The 1 × 1 bottleneck CNN layer [50] has less computational cost as compared to 3 × 3 CNN layer. As a result, our proposed block is relatively lighter, more efficient, and computationally effective in comparison to the other deep learning-based reconstruction blocks.

Experimental Results
Under the experimental results section, we first explain the construction of the training datasets and model hyperparameters. Next, we compare the quantitative and qualitative performance on five benchmark test datasets. Finally, we compare the complexity of the model in terms of peak signal-to-noise ratio (PSNR) [56] versus a number of parameters.

Training Datasets
There have been many training datasets available for single image-super resolution, but commonly used datasets are Yang et al.'s [57] image dataset and the Berkeley Segmentation Dataset (BSDS) [58]. To evaluate the proposed method, we selected 91 images from [57] and another 200 images from [58]. As followed by [21], to take the benefit of the full training dataset and avoid the over-fitting problem, we applied the data augmentation technique randomly by flipping all images and then performing the rotation operation to increase the training dataset [59]. All experiments were performed on the HR ground truth image and randomly cropped and flipped the training sample images as an original ground truth image. For data processing, we used MATLAB 2018a and the Keras 2.2.1 framework [60], with TensorFlow as back-end, and LR images were generated by built-in function bicubic. Several loss functions have been used in deep learning techniques. Most deep neural network based SR methods have used the mean squared error (MSE) loss function, so we adopted the same loss function with our proposed model. The end-to-end mapping function required the estimation parameters of the network θ, which consists of a set of weights and biases. This is obtained by minimizing the objective loss between the restored image F(Y,θ) and the corresponding original HR ground truth image X. The set of HR and high-quality images are X i and their corresponding LR images Y i , and m is the number of samples in each batch during the training; we used the MSE as a loss function that can be calculated as: To minimize the objective of the loss function, we used the adaptive momentum estimation (Adam) [61] optimizer, and its initial learning rate set as 0.0003, with 32 mini-batch sizes during the training. The training takes 100 epochs to converge properly, and all experiments were conducted on an NVIDIA Titan Xp GPU, under an Ubuntu 18.04 operating system of 3.5 GHz Intel i7-5960x CPU and 64 GB RAM. For a fast training procedure, we trained our model only on single channel, i.e., Y-channel, so we converted the RGB channel into YCbCr and finally added the enlarged color channel using bicubic interpolation technique.

Testing Datasets
We evaluated our model's performance on five publicly available benchmark datasets, such as the Set5 [62], Set14 [63], BSDS100 [58], Urban100 [23], and Manga109 [64]. The Set5 [62] dataset consists of five images with various sizes between 228 × 228 and 512 × 512 pixels. Set14 [63] consists of 14 images, and the BSDS100 [58] dataset consists of 100 natural scenes of images. The Urban100 [23] dataset consists of different challenging images with many frequency bands and details of the information available, and the Manga109 [64] dataset consists of many comic images with fine structure. However, for fair comparison purposes, our proposed method used the recently published data, such as that

Comparison with Other Existing State-of-the-Art Methods
There are many techniques to validate the effectiveness of the proposed model. In image SR literature, it is common to use two metrics for quality measurement, i.e., PSNR and structural similarity index (SSIM) [56]. Both quality metrics have measured the difference between upscaled or interpolated LR image and its original high-quality HR image. The higher value of PSNR and SSI M [56] of two images should correlate to a higher degree of similarity between them and shows the better reconstruction quality of the image. The value of PSNR [56] is measured in decibels (dB), and the ranges from 0 to infinity. SSI M means the perfect recovery of the LR image and the ranges from 0 to 1. The main expression of PSNR and SSI M [56] are shown in Equations (6) and (7), respectively, PSNR(r, s) = 10 * log 10 where k is the bit depth, and MSE is the mean square error.
where µ r and µ s denote the mean value of r and s. The variance of r and s are denoted by σ r 2 and σ s 2 . The covariance of r and s represents as σ rs . C 1 and C 2 are the constants to maintain the formula validity and to avoid the denominator being zero. Quantitative results of PSNR and SSIM were evaluated on the public benchmark of Set5 [62], Set14 [63], BSDS100 [58], Urban100 [23], and Manga109 [64] datasets, with scale factor 2×, 4×, and 8×, as shown in the Table 2.
For qualitative and quantitative comparison, we selected twelve different state-of-the-art algorithms, along with the baseline. PSNR and SSIM [56] are the most popular reference metrics, widely used in the image SR tasks, and they directly apply on the intensity of the image. As can been seen from Table 2, our method achieves, on average, better PSNR and SSIM [56] than all existing methods. Furthermore, overall on five datasets with upscale factor 2x, our MSISRD can improve 1.33 dB, 1.04 dB. 0.95 dB, 0.42 dB, 0.39 dB, 0.49 dB, 0.32 dB, 0.37 dB, 0.37 dB, 0.32 dB, 0.24 dB, and 0.16 dB on average PSNR, in comparison with SRCNN [26], ESPCN [28], FSRCNN [15], VDSR [16], DCSCN [37], LapSRN [18], DRCN [17], SrSENet [41], SRMD [39], REDNet [30], DSRN [38], and CNF [36], respectively. Table 3 shows the quantitative comparison results for scale 4×, on the Set5 [62] dataset of PSNR/SSIM [56] versus a number of parameters. Our model yields higher performance with fewer numbers of parameters than other SR methods, which proves the best efficiency of our proposed model. Furthermore, the proposed method employs a much fewer number of parameters than REDNet [30], DRCN [17], and SRMD [39]. For instance, our model uses up to 94% less the number of parameters than REDNet [30] , 86% less than DSRN [38], and 70% less than LapSRN [18]. Figure 5 shows the relationship between the number of parameters and PSNR [56]; our proposed model presents a favorable trade-off between the model complexity and the performance of the SR image. Figure 6-9 show the perceptual quality performance on the Set5 [56], Set14 [63], BSDS100 [58], and Urban100 [23] datasets for scale 4× enlargements image SR. Figures 10-13 present the visual performance of above datasets on scale factor 8×, including one image from the Manga109 [14] dataset. The results of the Bicubic, SRCNN [26], and FSRCNN [15] look blurry and lack high-frequency details. Image SR on scale 8× is a very challenging problem, but our method accurately reconstructs the texture details, suppresses the artifacts, and recovers the details of the LR image with sharp edges. Figure 10 clearly shows that our method accurately reconstructs the fine texture details, such as the eyebrow of a baboon, leading to the pleasing visual perceptual quality of the image. Table 2. Quantitative evaluation of existing SR algorithms with our proposed approach; reported results is the average value of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [56] using 2×, 4×, and 8× enlargement scale factors; red color with bold value indicates the best value, and the blue color with underline indicates the second best value.

Conclusions
In this paper, we proposed a multi-scale inception-based SR using deep learning approach. Our model uses the locally residual asymmetric convolution block and inception-based asymmetric convolution block architecture to directly extract the short and long feature information. For upscaling purposes, we used the learned transposed convolution layer in the latent feature space. In the reconstruction part, an asymmetric convolution type kernel is applied for better reconstruction of vertical and horizontal edges. Furthermore, we used an inception module to obtain better feature reconstructions with less computational complexity. To our knowledge, this is the first network in which asymmetric convolution kernel has been used in whole architecture. The results show, both qualitatively and quantitatively, a large upscaling factor of 2×, 4×, and 8× enlargements, along with a number of parameters. The proposed method achieves high competitive performance on five benchmark datasets. In the future, we will stack more residual and inception blocks to further improve the quality of SISR.