PSRGAN: Perception-Design-Oriented Image Super Resolution Generative Adversarial Network

: Among recent state-of-the-art realistic image super-resolution (SR) intelligent algorithms, generative adversarial networks (GANs) have achieved impressive visual performance. However, there has been the problem of unsatisfactory perception of super-scored pictures with unpleasant artifacts. To address this issue and further improve visual quality, we proposed a perception-design-oriented PSRGAN with double perception turbos for real-world SR. The ﬁrst-perception turbo in the generator network has a three-level perception structure with different convolution kernel sizes, which can extract multi-scale features from four 14 size sub-images sliced by original LR image. The slice operation expands adversarial samples to four and could alleviate artifacts during GAN training. The extracted features will be eventually concatenated in later 3 × 2 upsampling processes through pixel shufﬂe to restore SR image with diversiﬁed delicate textures. The second-perception turbo in discriminators has cascaded perception turbo blocks (PTBs), which could further perceive multi-scale features at various spatial relationships and promote the generator to restore subtle textures driven by GAN. Compared with recent SR methods (BSRGAN, real-ESRGAN, PDM_SR, SwinIR, LDL, etc.), we conducted an extensive test with a × 4 upscaling factor on various datasets (OST300, 2020track1, RealSR-Canon, RealSR-Nikon, etc.). We conducted a series of experiments that show that our proposed PSRGAN based on generative adversarial networks outperforms current state-of-the-art intelligent algorithms on several evaluation metrics, including NIQE, NRQM and PI. In terms of visualization, PSRGAN generates ﬁner and more natural textures while suppressing unpleasant artifacts and achieves signiﬁcant improvements in perceptual quality.


Introduction
Single-image super-resolution (SISR) aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) one.The traditional methods for solving the SR problems are mainly interpolation-based methods [1-4] and reconstruction-based methods [5][6][7].Intelligent computing has also been applied in the field of image super-resolution.Superresolution methods based on genetic algorithms, guided by imaging models, utilize optimization techniques to seek the optimal estimation of the original image.At its core, this approach transforms the problem of reconstructing multiple super-resolved images into a linear system of equations.The convolutional neural network (CNN) has greatly promoted the vigorous development of SR field and demonstrates vast superiority over traditional methods.The main reason it achieves good results is due to its strong capability of learning rich features from big data in an end-to-end manner [8].CNN-based SR methods often use PSNR as the evaluation metric; although some SR methods achieve good results for PSNR, it is still not completely satisfactory in terms of perception.
The generative adversarial network (GAN) [9] has achieved impressive visual performance in the field of super-resolution (SR) since the pioneering work of SRGAN [10].GANs have proven their capability to generate more realistic images with high perceptual quality.In pursuit of further enhancing visual quality, Wang et al. proposed ESRGAN [11].Given the challenges of collecting well-paired datasets in real-world scenarios, unsupervised GANs have been introduced [12,13].BSRGAN [14] and real-ESRGAN [15] are dedicated to simulating the practical degradation process to obtain better visual results on real datasets.
However, perceptual dissatisfaction accompanied by unpleasant artifacts still exists in GAN-based SR models because of insufficient design in either generators or discriminators.In GAN-based SR methods, it is obvious that the decisive capability to recover naturally finer textures in generators is dependent largely on the guidance of discriminators through GAN training, but discriminators are usually cloned from well-known networks (U-net [16], VGG [17], etc.) suitable for image segmentation or classification, which might not fully lead generators to restore subtle textures in SR.Moreover, the design of generators should be perceptive enough to extract multi-scale image features from low-resolution (LR) images and mitigate artifacts.
Research hypotheses and questions: Perceived quality improvement: How can we design a network structure of PSRGAN to suppress artifact generation in images, and how can we achieve the effect of suppressing artifacts?Generative adversarial network image quality assessment: Which evaluation metrics are used to assess the generated images to ensure their perceived quality is enhanced?Adversarial training stability: How can we ensure the stability and convergence of our PSRGAN training?To address these issues and further improve the visual quality of the restored SR images, we redesigned both generators and discriminators; the contributions of this paper are mainly in four aspects:

•
We present a novel perception-design-oriented PSRGAN with double perception turbos, which can generate real-world SR images with naturally finer textures while suppressing unpleasant artifacts by ×4 upscaling factors (see Figure 1).

•
We design the first-perception turbo in the generator network, characterized by slice operation and a three-level perception structure, which can extract multi-scale features from sliced sub-images and mitigate artifacts.

•
We propose the second-perception turbo in the discriminator network with cascaded perception turbo blocks, which can further promote the generator to restore subtle textures.

Related Work
Single-image super-resolution: SRCNN [18] is the first method to apply deep learning to SR reconstruction, and a series of learning-based works are subsequently proposed [19][20][21][22][23]. ESPCN [24] introduces an efficient sub-pixel convolution layer to perform the feature extraction stages in the LR space instead of HR space.VDSR [19] uses a very deep convolutional network.EDSR [25] removes the batch normalization layers from the network.SRGAN [10] first uses the GAN network for the SR problem and proposes perceptual loss, including adversarial loss and content loss.Based on human perceptual characteristics, the residual in the residual dense block strategy (RRDB) is exploited to implement various depths in network architectures [11,26].ESRGAN [11] introduces the residual-in-residual dense block (RRDB) into the generator.RealSR [27] estimates various blur kernels and real noise distributions to synthesize different LR images.CDC [28] proposes a divide-andconquer SR network.Luo et al.,in [29], propose a probabilistic degradation model (PDM).Shao et al., in [30], propose a sub-pixel convolutional neural network (SPCNN) for image SR reconstruction.
Perceptual-driven approaches: The PSNR-oriented approaches lead to overly smooth results and a lack of high-frequency details, and the results sometimes do not agree with the subjective human perception.In order to improve the perceptual quality of SR results, the perceptual-driven approach is proposed.Based on the idea of perceptual similarity [31], Li Feifei et al. propose perceptual loss in [32].Then, textures matching loss [33] and contextual loss [34] are introduced.ESRGAN [11] improves the perceptual loss by using the features before activation and wins the PIRM perceptual super-resolution challenge [35].Christian Szegedy et al. propose inception [36], which can extract more features with the same amount of computation, thus improving the training results.For the purpose of extracting multi-scale information and enhance the feature discriminability, RFB-ESRGAN [8] applies the receptive field block (RFB) [37] to super resolution and wins the NTIRE 2020 perceptual extreme super-resolution challenge.There is still plenty of room for perceptual quality improvement [38].
The design of discriminator networks: The discriminator in SRGAN is VGG-style, which is trained to distinguish between SR images and GT images [10].ESRGAN borrows ideas from relativistic GAN to improve the discriminator in SRGAN [11].Real-ESRGAN improves the VGG-style discriminator in ESRGAN to an U-Net design [15].In [39], Alejandro et al. propose a novel convolutional network architecture named "stacked hourglass", which captures and consolidates information across all scales of the image.Inspired by [39], we propose a new discriminator structure, which can guide the generator to recover finer textures.All the related work as Table 1 shows.Artifact suppression: The instability of the training of GANs often leads to the introduction of many perceptually unpleasant artifacts while generating details in the GAN-based SR networks [40].There have been several SR models focusing on solving the problem.Zhang et al. propose a supervised pixel-wise generative adversarial network (SPGAN) to obtain higher-quality face images [41].Gong et al.,in [42], overcome the effect of artifacts in the super-resolution of remote sensing images using self-supervised hierarchical perceptual loss.Real-ESRGAN uses spectral normalization (SN) regularization to stabilize the training dynamics [15].We propose a algorithm named "image slice and multi-scale feature extraction", which can generate more delicate textures and suppress artifacts.
The evaluation metrics: The DCNN-based SR approaches have two main optimization objectives: the distortion metric (e.g., PSNR, SSIM, IFC, and VIF [43][44][45]) and perceptual quality (e.g., the human opinion score; no-reference quality measures such as Ma's score [46], NIQE [47], BRISQUE [48], and PI [49]) [50].Yochai et al. in [49] have revealed that distortion and perceptual quality are contradictory and there is always a trade-off between the two.Algorithms that are superior in terms of perceptual quality tend to be poorer in terms of, e.g., PSNR and SSIM.However, sometimes there is also inconsistency between the results observed by human eyes and these perceptual quality metrics.Because the no-reference metrics do not always match perceptual visual quality [51], some SR models such as SRGAN perform mean-opinion-score (MOS) tests to quantify the perceptual ability of different methods [10].We use NIQE, NRQM, and PI as our image quality metrics, which do not depend on the GT image to measure the perceptual quality of the reconstructed image [52].The related work on evaluation metrics as Table 2 shows.

Distortion metrics Simple calculation Greater inconsistency with perceived quality
Human opinion score Consistent with visual perception High labor costs No-reference quality measures

Balancing consistency with perceived quality and computational cost
There is some inconsistency with visual perception The transformer: Vaswani et al. in [36] propose a new simple network architecture, transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.Transformer continues to show amazing capabilities in the NLP domain.Many researches have started to try to apply the powerful modeling ability of transformer to the field of computer vision [53].In [54], Yang et al. propose TTSR, in which LR and HR images are formulated as queries and keys in transformer, respectively, to encourage joint feature learning across LR and HR images.Swin transformer [55] combines the advantages of convolution and transformer.Liang et al. in [56] propose SwinIR based on Swin transformer.Vision transformer is computationally expensive and consumes high GPU memory, so Lu et al. in [57] propose ESRT, which uses efficient transformers (ET), a lightweight version of the transformer structure.

Proposed Methods
To further improve perceptual quality as well as mitigate artifacts in SISR, we proposed a novel perception-design-oriented super resolution generative adversarial network (PSRGAN) with double perception turbos.In this section, we first introduce the generator network-containing first-perception turbo (GPT) and then describe the construction of the discriminator network with the second-perception turbo (DPT).At last, we discuss the perceptual loss function used.

Generator Network
The generator network consists of two components: first-perception turbo, and the feature blending and upsampling component (FBUC) as shown in Figure 2.  The first perception turbo has two major blocks: the image slice block (ISB) and the multi-scale feature-extraction block (MFEB).The image slice block (ISB) produces four 1 4 size sub-images (I 1 sub , I 2 sub , I 3 sub , and I 4 sub ) from the low-resolution image I LR via pixel reassembly.Specifically, suppose I LR has the resolution of 2m • 2n pixels or padding to 2m • 2n pixels; the sliced sub-images are m • n pixels.If the upper left pixel is denoted as (0, 0), and the lower right pixel is denoted as (2m − 1, 2n − 1), the relationship of the pixels between I LR and the sub-images can be formulated as below.

Image
The slice method above has the following characteristics:

•
The slice splits the LR image to multiple detail adversarial sub-images while preserving the pixel integrity of the LR image.

•
The subsequent MFEB could extract multi-scale features from smaller adversarial samples; thus, the generator is capable of generating diverse and delicate textures.

•
The slice weakens the correlations among noisy pixels in I LR , which can effectively reduce noises and further alleviate artifacts in the restored SR image.Although the correlations among adjacent pixels might be also impaired, the meaningful semantic features will be eventually recovered in the SR image through GAN training.
The multi-scale feature extraction block (MFEB, Figure 3): It has been proven that each learned filter has its specific functionality and that a reasonably larger filter size could grasp richer structural information, which in turn could lead to better results [18].The MFEB is perceptually designed to extract diverse image features from the LR image by three groups of convolutional layers inspired by inception networks [36], as depicted in Figure 3. Please refer to Appendix B for more detail.
The first convolution group has a tiny receptive field, used to retain micro subtle features, denoted as k1-n64-s1.
The outputs of the three convolution groups are activated using the Sigmoid weighted liner unit (SiLU) and then ×2 upsampled via pixel-shuffle to obtain multi-scale features F 1 , F 2 , F 3 .The process can be formulated as: where Convs i (x sub ), i ∈ {1, 2, 3} denote the three convolution groups, SiLU is the activation function, ↑ denotes upsampling, s denotes the scale factor and s = 2 in this block, and F i , i ∈ {1, 2, 3} indicate the 3-scale feature maps extracted.Subsequently, the obtained feature maps F 1 , F 2 , F 3 are added in the channel dimension as input, residual in residual dense block (RRDB) [11] is adopted to further capture semantic information and improve the recovered textures, and the output is denoted as F. The formal processing in the first-perception turbo is described in Algorithm 1. Get x sub by merging the four sub-images I 1 sub ,I 2 sub ,I 3 sub ,I 4 sub in color channel dimension.

4:
for all i such that 1 ≤ i ≤ 3 do generate Feature blending and upsampling component (FBUC, Figure 2): The FBUC reassembles the obtained multi-scale features to generate the corresponding I SR counterpart of I LR .In the upsampling phase, the FBUC upsamples I LR with diversfied features F as the input via pixel shuffle and gradually blends the features extracted by the MFEB.The upsampling process can be formulated as follows: where '+' denotes concatenation operation, ↑ denotes upsampling, s denotes the scale factor, and s = 2. f Conv−SiLU denotes one convolutional kernel, SiLU is the activation function, and F f inal denotes the final features obtained from the FBUC.F f inal is passed through a triple convolutional layer with the kernel size of 3 × 3 and finally outputs I SR , which is ×4 upscaling according to the original I LR .

Discriminator Network
We proposed a novel discriminator containing the pre-processing block, cascaded perception turbo blocks (PTBs), and the post-processing block.The structure of the discriminator is depicted in Figure 4.The pre-processing block is utilized for the initial feature perception of I SR and I HR .As shown in Figure 4, it includes a CSR block, two residual blocks, and a downsampling layer.The CSR block consists of a convolution layer, an SN layer, and a ReLU activation function.The specific structure of the two residual blocks Res1 and Res2 is shown in the Figure 5.
The second-perception turbo is the core structure of this discriminator, which consists of cascaded PTBs.In order to further promote the generator to restore subtle textures, we proposed the PTB structure and made the following four improvements on the basis of hour-glass module [39]:

•
As shown in the Figure 5, we adopt the CSR structure instead of BRC, which consists of the BN layer, the ReLU activation function, and the convolutional layer.It has been proven that removing the BN layers can prevent BN artifacts of SR images, improve the performance, and reduce the computational complexity in the SR task [25].In addition, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery [11].

•
In the upsampling procedure, we use pixel-shuffle instead of nearest neighbor interpolation, which may lose pixel information.

•
In the downsampling layer, we use convolution instead of Maxpool2d operation, which may lose the integrity of feature map.

•
We enlarge the input channels of PTB to 128, which improves the perceptive capabilities of the discriminator.
The post-processing block consists of three convolutional layers to further learn features and output a feature map that benefits the computation of adversarial loss.
Based on the above improvements, the discriminator could further perceive multi-scale features at various spatial relationships and promote the generator to restore subtle textures driven by GAN.

Perception Loss
We introduced the loss function similar to ESRGAN, which is a hybrid weighted loss function that takes into account pixel-level recovery and visual perception effects and is able to achieve better super-resolution quality.Therefore, the total loss function of the generator L G is a weighted combination of several losses: the adversarial loss L GAN , pixel loss L Pixel , and perceptual loss L Percep .The loss function of the discriminator L D is the adversarial loss L GAN .The L G is described as follows: where L Pixel = E x i G(x i ) − y 1 is the one-norm distance between the recovered image G(x i ) and HR image y; it thus evaluates the average degree of approximation of I SR and I HR over pixels.α, β, γ are coefficients to balance different loss terms.Moreover, L Percep is gained by introducing a fine-tuned VGG19 network to calculate the one-norm distance between the recovered image G(x i ) and high-level features of y.It is used to evaluate the approximation of I SR and I HR in human perception.The perceptual loss is calculated as follows: L GAN aims to distinguish the SR image from the HR image by the superior perceptive capability of the discriminator, which could help to learn sharper edges and more detailed textures; it can be formulated as follows:

Experiments
In this section, we will discuss our PSRGAN model trained in RGB three channels.

Training Details
The experiments are performed with a scaling factor of ×4 between LR and HR images; we obtain corresponding four-times smaller LR images by degrading the HR pictures, which are cropped to size 400 × 400 using the high-order [15] algorithm.Meanwhile, the patch size of cropped HR is 256 × 256, and the patch size of LR is 64 × 64.When training, the batch size is set to 12 × 2, which means that we use two GPUs and the batch size per GPU is 12.
The training process is divided into two stages.One is the pre-training generator, and the other is conducting GAN training combined with the generator and discriminator.First, in the pre-training process, we purely train the generator with the L1 loss.The learning rate is 2 × 10 −4 , and the sum of the iteration is 0.4 million.Then, we employ the pre-training generator model as an initialization for the generator.The GAN is trained with a combination of L1 loss, perception loss, and GAN loss, with weights of 1, 1, and 0.1, respectively.The learning rate is set to 1 × 10 −4 for both the generator and discriminator, and the sum of iteration is 0.28 million.Pre-training with L1 loss is beneficial to obtain more visually pleasing results by avoiding undesired local optima for the generator.Moreover, it can help the discriminator to distinguish more on the textures part so that the discriminator can receive relatively better super-resolved images during GAN training.
For optimization, we use Adam [58] with β1 = 0.9, β2 = 0.99.We alternately update the generator and discriminator network until the model converges.We implement our models with the PyTorch framework and train them using NVIDIA GeForce RTX 3090 GPUs.

Data
For training, we use the DIV2K dataset [59], the Flickr2K dataset [21], and the Out-doorSceneTraining(OST) dataset [60] as training datasets.We employ these large datasets with rich textures, which help to generate SR pictures with more natural and subtle textures [11].

Qualitative Results
Due to the accessibility of SR methods, we compare our PSRGAN with several stateof-the-art methods, including BSRGAN, PDM SR, SwinIR [56], LDL [40], ESRGAN, and real-ESRGAN+.We have shown some representative qualitative results with NIQE in Figure 6 and Table 3.More detailed results calculated by NRQM and PI are presented in Tables 4-6   Although the NIQE score of PSRGAN is not always best, we still believe that exploring the effect of focusing on the human visual perception of real pictures is crucial for SR; after all, the existing perception indexes do not reflect all the problems.Please refer to Appendix C for more qualitative results.

Ablation Study
In order to study the effects of each component in the proposed PSRGAN, we gradually modify the discriminators of PSRGAN and compare their differences.The overall visual comparison is illustrated in Figure 7.Each column represents a model with its configurations shown at the top.The red sign indicates the best performance.A detailed discussion is provided as Table 7 follows.Number of PTBs: The discriminator with the optimal number of cascaded PTBs has a strong representation capacity to capture semantic information, which can further improve the recovered textures, especially for regular structures like the wall of image OST 278 in Figure 6.We set the order of the number to 2, 3, 4, 5, 6, and 7 for experimentation, respectively.For simplisity, we only demonstrate the results of 3, 5, and 7 numbers; the experimental results are depicted in Figure 7.As shown, when the number is 5, the results are relatively sharper with richer textures than others.For some cases, a prominent difference can be observed from the second, third and fifth column in Figure 7.
Channel size of PTB: The different channel sizes of PTB influence the perceptive capabilities of the discriminator.We have tested on 3, 128, and 256 channels.For simplisity, we only demonstrate the results of 128 and 256 channels, as shown in Figure 7.When the channel size is 128, the results are clearer and have fewer artifacts.
Cross verification between PTBs and U-net: Please refer to Appendix A for details.

Running Times
Our method achieves moderate GPU run times for both training and testing, thanks to its design characteristics.Our model achieves outstanding super-resolution performance, reaching a superior level of quality after a rigorous training regimen of 490 k iterations.Our model exhibits test times on multiple datasets that are comparable to existing state-ofthe-art models.Notably, when compared to SwinIR and LDL, our model demonstrates a significant advantage in test time efficiency.The algorithms were trained and tested on a server with NVIDIA GeForce RTX 3090 GPUs.Tables 8 and 9 compare the running times of different state-of-the-art models.and PI.More encouragingly, the images generated by PSRGAN are closer to the highresolution original images in terms of human perception.Limitations: Despite our satisfactory achievements, we have to recognize some limitations of PSRGAN.Computational requirements: the training and inference of PSRGAN requires a large number of computational resources, which may be a challenge for some applications.Data diversity: while our model performs well on multiple datasets, performance may be degraded in specific domains or with uneven data distribution.
In my opinion, the SR network will definitely develop in the direction of breaking through its current limitations in the future, and the trend of super-resolution application is to reduce the computational burden and to apply it to diversified datasets.

Conclusions
We have presented a PSRGAN model that achieves superior perceptual quality both in terms of evaluation metrics and visual effects.According to the experimental results, our proposed PSRGAN based on generative adversarial networks outperforms current state-of-the-art intelligent algorithms (BSRGAN, real-ESRGAN, PDM_SR, SwinIR, LDL, etc.) on several evaluation metrics (NIQE, NRQM and PI), with a ×4 upscaling factor on various datasets (OST300, DRealSR_Test_x4, RealSR-Canon, etc.).The PSRGAN model mainly consists of two kinds of perception turbo (PT), GPT in the generator network, and DPT in the discriminator network.In terms of visual effects, the proposed image slice block mitigates the artifacts and noise in the reconstructed image, the three-level perception structure in GPT which could extract diversified textures.The cascaded PTBs in DPT could further promote the generator to restore subtle textures.

Figure 2 .Figure 3 .
Figure 2. Architecture of generator network with corresponding kernel size (k), number of feature maps (n), and stride (s) indicated for each convolutional layer, where F 1 , F 2 , and F 3 are multi-scale features extracted by MFEB described in Figure 3.

Figure 4 .Figure 5 .
Figure 4. Discriminator network structure with second-perception turbo.The structure of CSR, Res1, and Res2 are shown in Figure 5.

Figure 6 .
Figure 6.Qualitative results of PSRGAN.PSRGAN produces more subtle textures and clearer structures, e.g., animal texture and building structure, as well as fewer unpleasant artifacts, e.g., artifacts in fonts.Zoom in for best view.

Figure 7 .
Figure 7. Visual comparisons of different configurations in PSRGAN.The red sign indicates the best performance.

Table 1 .
Related work on design of discriminator networks.

Table 2 .
Related work on evaluation metrics.

Slice Block Multi-scale Features Extraction Block
. It can be observed from the figure that the results of our proposed PSRGAN outperforms previous approaches in both details and clearness, with fewer artifacts.For instance, PSRGAN can produce clearer, more natural lion fur (see 0901) and more detailed wall structures (see OST 278) than BSRGAN and LDL, whose textures are unnatural, skewed, and contain unpleasing noise.Compared with PSRGAN, ESRGAN and real-ESRGAN+ fail to produce enough details.Moreover, PSRGAN is capable of boosting visual sharpness (see DSC 1454 x1), while other methods either produce blurry structures (ESRGAN, PDM SR, and SwinIR) or do not generate enough details (BSRGAN).In addition, previous GAN-based methods sometimes introduced unpleasant artifacts such as BSRGAN and real-ESRGAN+.Our PSRGAN eliminates these artifacts and obtains cleaner results (see Canon 40 x1).

Table 3 .
NIQE scores on diverse testing datasets-the lower, the better.Colors R, G, and B indicate the best first, second, and third NIQE results among models on each dataset row.The calculation method of NIQE is derived from the basic SR package of PyTorch 1.11.0 + cu113.

Table 4 .
NIQE scores on diverse testing datasets-the lower, the better.Colors R, G, and B indicate the best first, second, and third NIQE results among models on each dataset row.The calculation method of NIQE is in PIRM2018 derived from https://github.com/roimehrez/PIRM2018(accessed on 1 June 2023).

Table 5 .
NRQM scores on diverse testing datasets-the higher, the better.Colors R, G, and B indicate the best first, second, and third NRQM results among models on each dataset row.The calculation method of NRQM is in PIRM2018 derived from https://github.com/roimehrez/PIRM2018(accessed on 1 June 2023).

Table 6 .
PI scores on diverse testing datasets-the lower, the better.Colors R, G, and B indicate the best first, second, and third PI results among models on each dataset row.The calculation method of PI is in PIRM2018 derived from https://github.com/roimehrez/PIRM2018(accessed on 1 June 2023).

Table 7 .
Model with different configurations.

Table 8 .
The GPU run times for training of different networks.The unit is the number of iterators, and k represents thousands.Since Bicubic is not an adversarial neural network, there is no number of iterators.