Deep Residual Dense Network for Single Image Super-Resolution

In this paper, we propose a deep residual dense network (DRDN) for single image super- resolution. Based on human perceptual characteristics, the residual in residual dense block strategy (RRDB) is exploited to implement various depths in network architectures. The proposed model exhibits a simple sequential structure comprising residual and dense blocks with skip connections. It improves the stability and computational complexity of the network, as well as the perceptual quality. We adopt a perceptual metric to learn and assess the quality of the reconstructed images. The proposed model is trained with the Diverse2k dataset, and the performance is evaluated using standard datasets. The experimental results confirm that the proposed model exhibits superior performance, with better reconstruction results and perceptual quality than conventional methods.


Introduction
Single image super-resolution (SISR) is used to reconstruct a high-resolution image (HR) from a single low-resolution (LR) input image with better visual quality [1,2]. For instance, in the sharing and research of multimedia content, access to the original data is non-existent, and the quality of the received image cannot be estimated. If the image quality is poor, then it becomes difficult to restore the information. Therefore, it is important to restore the original image in super-resolution. This is currently being implemented in many applications, such as closed-circuit television surveillance [3] and security systems [4], satellite remote sensing [5], medical imaging [6,7] atmospheric monitoring [8], and robotics [9].
Super-resolution methods can be broadly classified into two main categories: conventional [10] and deep learning methods [11]. Conventional approaches in computer vision for super-resolution are interpolation-based methods such as bicubic, nearest-neighbor [12], and bilinear interpolation [13][14][15]. Deep learning methods yield better performances than conventional methods. We categorized the network architecture as based on Saeed et al. [11]. In this reference, they are categorized as single image super resolution models based on their structures. Furthermore, the deep learning methods of single image super-resolution can be categorized into two types: peak signal-to-noise ratio (PSNR) and perception-oriented methods. In PSNR-oriented methods, deep neural networks [16][17][18] provide significantly improved performance measures in terms of the PSNR and structural similarity index (SSIM) [19]. In SISR problems, deep learning models are implemented using basic linear networks [20], which are simple structures involving one path and the signal flows sequentially from the start layer to the end layer. By contrast, residual networks [21] use dense and skip connections [22] as well as multiple branches for residual learning. Residual learning with deep networks yields better performance. We will briefly discuss recent deep residual-based [23][24][25] methods used to improve visual quality that are relevant to our current study. One of them includes the enhanced deep residual network (EDSR) [26]. This network was derived from a super-resolution residual network memory while training the model using the graphical processing unit. In the EDSR network, BN layers are removed, thereby improving image reconstruction.
AGTM: The aggregated residual transformation method (AGTM) is a modified version of the EDSR for reducing the number of parameters and time complexity. The aggregated residual transformation method [41][42][43] achieves the same level of performance as the EDSR.
PM-DAN: The perceptual-metric-guided deep attention network (PM-DAN) [44] is an attention-based decoder-encoder network that focuses on the visual quality of newly constructed images. In this case, the residual spatial attention unit captures key information.
ESRGAN: This network is a family of GANs that comprises two main building components: a discriminator and a generator. ESRGAN is a modified version that improves the visual quality of the reconstructed image. In this study, the RRDB was introduced instead of the residual block with a perceptual loss function.

Proposed Methods
The EDSR performed better results in single image super-resolution, in which the perception quality improved when it is compared with other models. To improve the perceptual quality, the modified residual block was replaced with the RRDB in the EDSR structure. The RRDB was designed for the discriminator of the GAN model. Finally, the RRDB was built in the residual network of the proposed model. A comparison between the residual block of the EDSR and the proposed model is shown in Figure 1. The original EDSR comprised two convolutional layers and a rectified linear unit (ReLU), as shown in Figure 1a. The convolution layer was used for feature extraction, whereas the ReLU was used to activate the network. The main difference between the EDSR and the proposed model is evident in the RRDB layers. Depending on the depth, we implemented two models, RRDB_20 and RRDB_28. If the model has a depth of 20, then the number of RRDB layers in the model is 20. If the model has a depth of 28, then the number of RRDB layers in the model is 28. The proposed model comprised three residual dense blocks (RDBs) in the RRDB, as shown in Figure 1b. In addition, it combined multiple RDBs that were concatenated with each other. The RDB comprised a dense block with its own convolutional layers, dense connections that are stacked together (as shown in Figure 2), and a leaky rectified linear unit (LReLU) to rectify the network. After modifying the residual block, the appearance of the proposed model is as illustrated in Figure 3.  The process of this model begins from the low-resolution image as the input x lr and y sr as the output of the proposed model. The first convolutional layer extracts the features F 0 from the low-resolution input, as shown in (1) [45,46].
where H f e (.) denotes the convolution operation for the first convolutional layer, and F 0 is the feature extraction used as input to the RRDB. Suppose we have rd residual in residual dense blocks, and the output F RD the rdth RRDB can be obtained by (2).
where H RRDB, rd denotes the operations of the rdth RRDB. The inner layers of the RRDB comprised three RDBs, and the features of the RRDB can be calculated using (3).
where H RDB, rd represents the operations of the rdth RDB. As H RDB, rd is a combined operation of the convolution and LReLU. As F RRDB is obtained using all RDBs and residual in residual learning (F rrl ). The output of the dth RDB can be obtained using the input of F d−1 as shown in (4).
where, F rd represents the local features of the RDB. As F rd is obtained using all convolutional layers and the LReLU, the inner layers of the dense block are formulated using (5).
where F GRL is the global residual learning, and F 0 is the feature extraction of the first convolution layer. Final output of the DRDN can be obtained by the (7).
The residual blocks of the EDSR and the proposed model are shown in Figure 4. Figure 4a shows the inner convolutional layers (as shown in Figure 1a) of the EDSR residual block as a building block. In the EDSR, the residual block consists of two convolutional layers, which are the same size as the 256-d input and output with a 3 × 3 kernel. Figure 4b represents the inner convolutional layers (as shown in Figure 2) of the proposed RRDB as a building block. Figure 4b shows the image transformation steps on one block. In the proposed model, the residual block (RRDB) has five convolutional layers to extract the features that begin with the 64-d input to the first convolutional layer, and its output is 32-d with the 3 × 3 kernel. Likewise, the second to fifth convolutional layers are followed by different input and output image sizes. Finally, reconstruction of the super-resolution image was performed. In addition, the proposed models were implemented at various depths, such as RRDB_20 and RRDB_28. These two models were distinguished by their depths, and the number of parameters involved is shown in the Table 1. Furthermore, Table 1 provides the model type, number of residual blocks, total number of parameters, residual scaling, and loss function. Based on to the performances of the three models, RRDB_28 and RRDB_20 exhibited performances comparable to the EDSR in terms of the PSNR and SSIM. In terms of perceptual quality, the RRDB_20 model outperformed the EDSR.

Experimental Results
In this section, we present the experimental procedures and results of the proposed model. A comparison with a state-of-the-art method is presented as well. For the experimental evaluation, the Diverse2k resolution (Div2K) training dataset [47] was used, and a quantitative evaluation was performed using a public benchmark dataset. Finally, perceptual quality was evaluated using the perception-based image quality evaluator (PIQE).

Training Datasets
In the training, the Div2K dataset [47] was used, which is a 2k-resolution high-imagequality dataset comprising 800 training images, 100 validation images, and 100 test images. In addition, low-resolution (LR) bicubic images were available in ×2, ×3, ×4, ×8 factors for training and evaluating the proposed model.

Evaluation on Benchmark Datasets
To evaluate the performance of the model in terms of quantitative measures, we compared our model with publicly available benchmark datasets, such as set 5 [48], set 14 [13], BSD100 [49], and Urban100 [50]. The main purpose of these datasets are testing and predicting the performance of the newly designed network architectures. It is also easy to compare with the existing conventional models.

Set 5 [48]
: This is a standard dataset with five test images of a baby, bird, butterfly, head, and woman. Set 14 [13]: More categories were compared with set 5. However, the number of images used was 14, which included a baboon, Barbara, bridge, coastguard, comic, face, flowers, foreman, lenna, man, monarch, pepper, ppt3, and zebra. BSD100 [49]: It is one of the classical datasets, consisting of 100 test images. The dataset consists of images ranging from nature to individual objects such as plants, people, food, animals, and devices, etc.
Urban100 [50]: It is also a classical dataset composed of the same 100 images as BSD100. The dataset focuses on artificial structures which are made by humans, such as urban scenes.

PIQE
The PIQE [51] is a metric used for human perceptual quality assessment in the field of super-resolution image reconstruction. The mathematical expression for the PIQE is as follows: Here, N SA represents the number of spatially active blocks in an image, C 1 a positive constant, and D SK the distorted block. The PIQE metric returns a positive scalar in the range of 0-100. The PIQE score is the individual image quality score, which is inversely correlated with the perceived quality of an image. A low score indicates good perceptual quality, whereas a high score indicates otherwise. The PIQE metric was evaluated based on benchmark datasets, such as set 5 and set 14, with scale factors ×2, ×3, ×4, and ×8 presented in Table 2. As shown, ×2 of the RRDB_20, set 5 and set 14 PIQE exhibited a lower value compared with that of the EDSR and RRDB_28, which were 56.3124 and 48.1648, respectively. Hence, RRDB_20 demonstrated better perceptual quality. Similarly, other scale models of set 5 and set 14, such as ×3, ×4, and ×8, yielded values of 66.4755, 75.7457, 76.2772 and 71.9378, 77.4722, 79.5802 respectively. In addition, RRDB_28 of the set 14 (×2), BSD100 (×3), and urban100 (×2, ×3, ×4, ×8) are higher PIQE values than conventional methods. It is evident that the proposed models exhibited better perceptual quality than the EDSR. The image quality measures for the human visual system (HVS) were calculated using the universal image quality index (UQI) [52,53] metrics. In particular, UQI models the correlation loss, luminance distortion, and contrast distortion. The UQI metric is expressed as in (9).
where UQI is universal image quality index, Q j is the local quality index, and M is the total number of steps. The UQI metric was used to evaluate the image quality for the HVS based on benchmark datasets such as set 5, set 14, and BSD100, and urban100 with scale factors of ×2, ×3, ×4, and ×8 are presented in Table 3. As shown, ×2 of the RRDB_20 set 5, UQI exhibited a higher value compared with that of VDSR, RRDB_28, and EDSR was 0.9951. Additionally, set 14 was 0.9920. Hence, RRDB_20 shows improved HVS image quality. Similarly, other scale models of set 5, such as ×3, ×4, and ×8, yielded values of 0.9929, 0.9870, 0.9666, and set 14, yielded 0.9889, 0.9820, and 0.9673, respectively. In addition, RRDB_28 of urban100 (×3) is higher value than the conventional methods in the Table 3. Hence, the proposed model shows improved image quality when compared to other models.

Training Details
For training, we followed the training parameters given in Bee et al. [26]. We also used the RGB input patches of size 48 × 48 from the LR image using the corresponding HR patches. We trained our model using the Adam optimizer by setting β 1 = 0.9 β 2 = 0.999, and ε = 10 −8 . We set the mini-batch to 16. The learning rate was initialized as 10 −4 , and the learning decay varying at boundaries (70,000, 100,000) were 5 × 10 −5 and 1 × 10 −5 respectively. We implemented the proposed model using the tensor flow framework and trained it using NVIDIA GeForce GTX GPUs device. The proposed model architectures required 10 days to train.

PSNR (dB)/SSIM Evaluation
The mean PSNR and SSIM of the super-resolution methods evaluated on benchmark datasets such as set 5, set 14, BSD100, and urban100 for super-resolution factors ×2, ×3, ×4, and ×8 of the proposed models were compared with those of the simulated EDSR, as shown in Table 4. The simulated performance of the EDSR indicated better PSNR and SSIM scores when compared with those of the reference EDSR [26]. By comparison, the PSNR and SSIM scores were measured on the y-channel. The validation of the PSNR and SSIM scores of the EDSR and proposed models on the Div2K dataset are shown in Figure 5. In addition, visual comparisons of the super-resolution images from set 5 (×2 and ×4) and set 14 (×3 and ×8) are shown in Figures 6 and 7, respectively. As shown by the results in Table 3, the proposed model performed comparably with the EDSR in terms of reduced parameters and better perceptual quality during the reconstruction of super-resolution images. In addition, RRDB_20 yielded better average perceptual quality than RRDB_28, and its performance varied with individual images, as shown in Figures 6 and 7.

Conclusions
In this paper, we proposed a DRDN for single image super-resolution. The proposed models were designed based on RRDBs and it required fewer parameters and demonstrated better stability and perceptual quality. The performance of the proposed model is limited with the increased number of layers, and to train comfortably, the network needs to be optimized. Furthermore, we need to investigate the cascaded multiscale deep network to improve the visual textures with a lower computational cost. The experimental results confirmed that the proposed model achieved better reconstruction results and perceptual quality than the conventional methods.