Generative Adversarial Network for Image Super-Resolution Combining Texture Loss

Objective: Super-resolution reconstruction is an increasingly important area in computer vision. To alleviate the problems that super-resolution reconstruction models based on generative adversarial networks are difficult to train and contain artifacts in reconstruction results, we propose a novel and improved algorithm. Methods: This paper presented TSRGAN (Super-Resolution Generative Adversarial Networks Combining Texture Loss) model which was also based on generative adversarial networks. We redefined the generator network and discriminator network. Firstly, on the network structure, residual dense blocks without excess batch normalization layers were used to form generator network. Visual Geometry Group (VGG)19 network was adopted as the basic framework of discriminator network. Secondly, in the loss function, the weighting of the four loss functions of texture loss, perceptual loss, adversarial loss and content loss was used as the objective function of generator. Texture loss was proposed to encourage local information matching. Perceptual loss was enhanced by employing the features before activation layer to calculate. Adversarial loss was optimized based on WGAN-GP (Wasserstein GAN with Gradient Penalty) theory. Content loss was used to ensure the accuracy of low-frequency information. During the optimization process, the target image information was reconstructed from different angles of high and low frequencies. Results: The experimental results showed that our method made the average Peak Signal to Noise Ratio of reconstructed images reach 27.99 dB and the average Structural Similarity Index reach 0.778 without losing too much speed, which was superior to other comparison algorithms in objective evaluation index. What is more, TSRGAN significantly improved subjective visual evaluations such as brightness information and texture details. We found that it could generate images with more realistic textures and more accurate brightness, which were more in line with human visual evaluation. Conclusions: Our improvements to the network structure could reduce the model’s calculation amount and stabilize the training direction. In addition, the loss function we present for generator could provide stronger supervision for restoring realistic textures and achieving brightness consistency. Experimental results prove the effectiveness and superiority of TSRGAN algorithm.


Introduction
With the popularization of Internet and the development of information technology, the amount of information accepted by human is growing at an explosive rate. Images, videos and audio are the main carriers of information transmission. Related research [1] has pointed out that the information humans receive through vision accounts for 60%~80% of all media information, so visible images are an important way to obtain information. However, the quality of an image is often restricted by hardware equipment such as imaging system and the bandwidth during image transmission process.
(VGG)19 network as the basic framework of discriminator network. This measure can strengthen the reuse of forward features, reduce the amount of training parameters and control the training direction of reconstruction images. Secondly, four losses are introduced to constitute the total objective function of generator. We propose texture loss to encourage local information matching, enhance perceptual loss by employing the features before activation layer to calculate, optimize adversarial loss based on WGAN-GP (Wasserstein GAN with Gradient Penalty) theory and use content loss to ensure the accuracy of low-frequency information. Experimental results show that the model in this paper has achieved good results, which can generate images with more realistic textures.

Generative Adversarial Networks
GAN is a new network framework proposed by Ian Goodfellow et al. [22], it estimates generative model through adversarial process. The zero-sum game is the basic idea of GAN model, the generator (G) and discriminator (D) constitute the main framework of the model. GAN trains network through adversarial learning to achieve Nash equilibrium [25], achieving the goal of estimating data's potential distribution and generating new data samples.
G and D can be represented by any differentiable function, taking random variable z and real data x as input, respectively. G(z) represents the result generated by G that obeys the distribution of real samples (p data ) as much as possible. If D's input is the real sample, D outputs 1, otherwise D outputs 0. D actually acts as a two-classifier. The goal of G is to fool D, so that D could finally give an evaluation result which is closer to 1. G and D oppose each other and iteratively optimize until D can't distinguish whether the input sample is from G or real data, then it can be considered that the target G has been obtained. The basic framework described in this process is shown in Figure 1. The objective function of GAN is as follows: where G minimizes the objective function to generate samples that can better confuse D, D maximizes the objective function so that D can better distinguish the authenticity of input samples. GAN. Firstly, we use RDB as the basic unit of generator network and adopt Visual Geometry Group (VGG)19 network as the basic framework of discriminator network. This measure can strengthen the reuse of forward features, reduce the amount of training parameters and control the training direction of reconstruction images. Secondly, four losses are introduced to constitute the total objective function of generator. We propose texture loss to encourage local information matching, enhance perceptual loss by employing the features before activation layer to calculate, optimize adversarial loss based on WGAN-GP (Wasserstein GAN with Gradient Penalty) theory and use content loss to ensure the accuracy of low-frequency information. Experimental results show that the model in this paper has achieved good results, which can generate images with more realistic textures.

Generative Adversarial Networks
GAN is a new network framework proposed by Ian Goodfellow et al. [22], it estimates generative model through adversarial process. The zero-sum game is the basic idea of GAN model, the generator (G) and discriminator (D) constitute the main framework of the model. GAN trains network through adversarial learning to achieve Nash equilibrium [25], achieving the goal of estimating data's potential distribution and generating new data samples.
G and D can be represented by any differentiable function, taking random variable z and real data x as input, respectively. G(z) represents the result generated by G that obeys the distribution of real samples ( ) as much as possible. If D's input is the real sample, D outputs 1, otherwise D outputs 0. D actually acts as a two-classifier. The goal of G is to fool D, so that D could finally give an evaluation result which is closer to 1. G and D oppose each other and iteratively optimize until D can't distinguish whether the input sample is from G or real data, then it can be considered that the target G has been obtained. The basic framework described in this process is shown in Figure 1. The objective function of GAN is as follows: where G minimizes the objective function to generate samples that can better confuse D, D maximizes the objective function so that D can better distinguish the authenticity of input samples.

Dense Convolutional Network
In deep learning networks, the problem of gradient disappearance and gradient dispersion will become more serious as the increase of network layers. The ResNets proposed in [26], the Highway Networks proposed in [27] and the Stochastic depth structure proposed in [28] are all improved networks for the above problems. Although the proposed algorithms are different in network structure and training process, their key point is to create a short path from the forward feature layer to the backward one. Considering the need to ensure the maximum degree of information transmission between different layers, Huang et al. [29] have proposed dense convolutional network (DenseNet), each layer in DenseNet must obtain additional feature inputs from its all feedforward layers and transfer its own feature map to all subsequent layers for effective training. DenseNet has created a deeper and more efficient convolutional network, its dense connection mechanism is shown in Figure 2. The network has obvious advantages in mitigating the disappearance of gradient. Moreover, the structural design that enhances feature propagation and feature reuse can greatly reduce the number of parameters. DenseNet has been widely used in semantic cutting [30], speech recognition [31] and image classification [29].
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 In deep learning networks, the problem of gradient disappearance and gradient dispersion will become more serious as the increase of network layers. The ResNets proposed in [26], the Highway Networks proposed in [27] and the Stochastic depth structure proposed in [28] are all improved networks for the above problems. Although the proposed algorithms are different in network structure and training process, their key point is to create a short path from the forward feature layer to the backward one. Considering the need to ensure the maximum degree of information transmission between different layers, Huang et al. [29] have proposed dense convolutional network (DenseNet), each layer in DenseNet must obtain additional feature inputs from its all feedforward layers and transfer its own feature map to all subsequent layers for effective training. DenseNet has created a deeper and more efficient convolutional network, its dense connection mechanism is shown in Figure 2. The network has obvious advantages in mitigating the disappearance of gradient. Moreover, the structural design that enhances feature propagation and feature reuse can greatly reduce the number of parameters. DenseNet has been widely used in semantic cutting [30], speech recognition [31] and image classification [29].

Proposed Methods
This paper uses generative adversarial networks as the main frame, including generator network and discriminator network. The overall structure of TSRGAN is shown in Figure 3. LR image is the generator network's input, then the convolutional layers are responsible for extracting features. Subsequently, the feature map inputs residual model for non-linear mapping. Then the image is reconstructed through the upsampling layer and convolutional layer. Next, the network outputs the reconstruction result. Finally, we input the fake and real HR images into discriminator network separately, which is responsible for discriminating the authenticity of image.

Proposed Methods
This paper uses generative adversarial networks as the main frame, including generator network and discriminator network. The overall structure of TSRGAN is shown in Figure 3. LR image is the generator network's input, then the convolutional layers are responsible for extracting features. Subsequently, the feature map inputs residual model for non-linear mapping. Then the image is reconstructed through the upsampling layer and convolutional layer. Next, the network outputs the reconstruction result. Finally, we input the fake and real HR images into discriminator network separately, which is responsible for discriminating the authenticity of image.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 12 In deep learning networks, the problem of gradient disappearance and gradient dispersion will become more serious as the increase of network layers. The ResNets proposed in [26], the Highway Networks proposed in [27] and the Stochastic depth structure proposed in [28] are all improved networks for the above problems. Although the proposed algorithms are different in network structure and training process, their key point is to create a short path from the forward feature layer to the backward one. Considering the need to ensure the maximum degree of information transmission between different layers, Huang et al. [29] have proposed dense convolutional network (DenseNet), each layer in DenseNet must obtain additional feature inputs from its all feedforward layers and transfer its own feature map to all subsequent layers for effective training. DenseNet has created a deeper and more efficient convolutional network, its dense connection mechanism is shown in Figure 2. The network has obvious advantages in mitigating the disappearance of gradient. Moreover, the structural design that enhances feature propagation and feature reuse can greatly reduce the number of parameters. DenseNet has been widely used in semantic cutting [30], speech recognition [31] and image classification [29].

Proposed Methods
This paper uses generative adversarial networks as the main frame, including generator network and discriminator network. The overall structure of TSRGAN is shown in Figure 3. LR image is the generator network's input, then the convolutional layers are responsible for extracting features. Subsequently, the feature map inputs residual model for non-linear mapping. Then the image is reconstructed through the upsampling layer and convolutional layer. Next, the network outputs the reconstruction result. Finally, we input the fake and real HR images into discriminator network separately, which is responsible for discriminating the authenticity of image.

Generator Network
In order to further improve the quality of image reconstruction, this paper improves the network based on SRGAN model. Firstly, all BN layers are removed in SRGAN. BN is easy to introduce artifacts and limit the generalization ability of network. Studies have shown that removing the BN layers can improve reconstruction performance and reduce the computational complexity, such as SR task [23] and deblurring task [32]. Secondly, Leaky Rectified Linear Unit (LeakyReLU) is used instead of Rectified Linear Unit (ReLU) as the network's non-linear activation function to avoid gradient vanishing problem: where x is the input, y is the output and a is a constant between 0 and 1. Finally, based on the researches in [31,33,34], it is shown that deep networks and multi-level connections can improve the performance of algorithm. Therefore, we use RDB instead of Residual Block (RB) which is used in SRGAN as the basic network element. RDB has a deeper and more complex structure than RB, it has the advantages of both residual networks and dense connections. It increases the depth of network while improving the reuse of image feature information. Ultimately, it improves the qualities of reconstructed images. The specific structure is shown in Figure 4. Our generator network is a deep model with 36 RDB, it has larger capacity and stronger ability to capture semantic information. Therefor it can reduce the noises of reconstructed images and generate images with more realistic textures.

Generator Network
In order to further improve the quality of image reconstruction, this paper improves the network based on SRGAN model. Firstly, all BN layers are removed in SRGAN. BN is easy to introduce artifacts and limit the generalization ability of network. Studies have shown that removing the BN layers can improve reconstruction performance and reduce the computational complexity, such as SR task [23] and deblurring task [32]. Secondly, Leaky Rectified Linear Unit (LeakyReLU) is used instead of Rectified Linear Unit (ReLU) as the network's non-linear activation function to avoid gradient vanishing problem: where x is the input, y is the output and a is a constant between 0 and 1. Finally, based on the researches in [31,33,34], it is shown that deep networks and multi-level connections can improve the performance of algorithm. Therefore, we use RDB instead of Residual Block (RB) which is used in SRGAN as the basic network element. RDB has a deeper and more complex structure than RB, it has the advantages of both residual networks and dense connections. It increases the depth of network while improving the reuse of image feature information. Ultimately, it improves the qualities of reconstructed images. The specific structure is shown in Figure 4. Our generator network is a deep model with 36 RDB, it has larger capacity and stronger ability to capture semantic information.
Therefor it can reduce the noises of reconstructed images and generate images with more realistic textures.

Discriminator Network
As for the discriminator network, this paper uses the classic VGG19 network as basic architecture, which can be simplified into two modules: feature extraction and linear classification. Feature extraction module includes 16 convolutional layers, after each convolutional layer we use LeakyReLU as the activation function. In addition, the BN layer is used after each convolutional layer except the first one to avoid gradient vanishing problem and enhance the model's stability. Then the discriminator network needs to judge the input sample image. We use Global Average Pooling (GAP) [35] instead of fully connected layer which is used in most image classification models for fear of reducing the training speed of model and increasing the risk of overfitting. GAP is responsible for calculating the pixel average value of each feature map, and then all the values are sent into sigmoid activation function after linear fusion. Ultimately, network outputs D's judgement result for the input sample. Training discriminator network helps generator network to restore results that are closer to the ground-truth images.

Loss Functions
Loss function is an important factor that affects the quality of image reconstruction. In order to restore the high-frequency information and improve the intuitive visual experience of image, this

Discriminator Network
As for the discriminator network, this paper uses the classic VGG19 network as basic architecture, which can be simplified into two modules: feature extraction and linear classification. Feature extraction module includes 16 convolutional layers, after each convolutional layer we use LeakyReLU as the activation function. In addition, the BN layer is used after each convolutional layer except the first one to avoid gradient vanishing problem and enhance the model's stability. Then the discriminator network needs to judge the input sample image. We use Global Average Pooling (GAP) [35] instead of fully connected layer which is used in most image classification models for fear of reducing the training speed of model and increasing the risk of overfitting. GAP is responsible for calculating the pixel average value of each feature map, and then all the values are sent into sigmoid activation function after linear fusion. Ultimately, network outputs D's judgement result for the input sample. Training discriminator network helps generator network to restore results that are closer to the ground-truth images.

Loss Functions
Loss function is an important factor that affects the quality of image reconstruction. In order to restore the high-frequency information and improve the intuitive visual experience of image, this paper uses content loss L con , adversarial loss L adv , perceptual loss L per and texture loss L tex as the objective function of the generator network: where λ and η are the coefficients which are used to balance different loss functions.

Content Loss
Mean Square Error (MSE) loss is used as the model's content loss for the sake of ensuring the consistency of low-frequency information between reconstructed image and LR image. It is in charge of optimizing the squared error between pixels corresponding to the generated and real HR images. Reducing the distance between pixels can more quickly and effectively ensure the accuracy of the reconstructed image information, so that the results could get a higher value of peak signal to noise ratio.
where I H i represents the real HR image, I L i represents the LR image, N represents the number of training samples and G(x, θ) represents the mapping function between LR and HR images learned by the generator network.

Adversarial Loss
Based on the adversarial game mechanism between generator and discriminator network, the discriminator network needs to product the probability of image which is output by generator network being true or false. To maximize the probability that the reconstructed image deceives D, we adopt the adversarial loss proposed in WGAN-GP [36] model to replace the one proposed in GAN model. Improved L adv penalizes D for the gradient of input, it can help stable training of GAN architecture and generate higher quality samples with faster convergence speed with little need for tuning of hyperparameters.

Perceptual Loss
In order to generate images with more accurate brightness and realistic textures, L per based on VGG network is set to be calculated using feature layer information before activation layer instead of after it. It is defined on the activation layer of the pre-trained deep network to minimize the Euclidean distance between two activation features: where, W ij and H ij describe the dimensions of the respective feature maps within the VGG network, ∅ ij indicates the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the network. The improved L per overcomes two drawbacks of the original design: First, the activated features are very sparse, especially after a very deep network, the sparse activation provides weak supervision and thus leads to inferior performance. Second, using features after activation also causes inconsistent reconstructed brightness compared with the ground-truth image.

Texture Loss
Although perceptual loss can improve the quality of reconstructed image as a whole, it still has the problem of introducing unnecessary high-frequency structures. We propose to incorporate texture loss presented in [21] to constitute the total loss function of G. L tex encourages local matching of texture information, it extracts feature maps generated by the intermediate layer of convolutional network of generator and discriminator network. Then it calculates the corresponding gram matrix. Finally, L2 loss function is used to calculate texture loss for the obtained Gram matrix values: where I gen indicates images that are reconstructed by generator, G indicates the Gram matrix, G(F) = FF T . Texture loss provides strong supervision to further reduce visually incredible artifacts and produce more realistic textures.

Experimental Details
The experimental platform we use is NVIDIA GeForceMX150, Intel (R) Core (TM) i7-8550U CPU@2.00GHz, 8 GB RAM, the compilation software we use are pycharm2017 and MATLAB 2018a, and the pytorch deep learning toolbox is used to build and train the network. This paper uses DIV2K dataset, which consists of 800 training images, 100 validation images and 100 testing images. We augment the training data with random horizontal flips and 90 rotations. We perform experiments on three widely used benchmark datasets Set5 [37], Set14 [38] and BSD100 [39]. All experiments are performed with a scale factor of 4× between low-and high-resolution images. The mini-batch size is set to 16. The spatial size of cropped HR patch is 128 × 128.
The training process is divided into two stages. First, we train a generative model with L 1 loss as the objective function. Then, we use the initially trained model as the initialization of G. The generator is trained using the loss function in Equation (3). The initial learning rate is set to 1 × 10 −4 . For optimization, we use Adam with β1 = 0.9, β2 = 0.999. We alternately update the generator and discriminator network until the model converges. In addition, we introduce a residual scaling [40] strategy which scales down the residuals by multiplying a constant β between 0 and 1 before adding them to the main path to prevent instability. β is set to 0.2 in this paper.
For accurately evaluating image quality and proving the effectiveness of algorithm, Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index (SSIM) are adopted as image quality evaluation indicators. µ X and µ Y represent the mean values of images X and Y, σ X and σ Y represent the standard deviations of images X and Y and σ XY represents the covariance of images X and Y. PSNR is responsible for measuring the distortion of images from the difference in pixels, and SSIM is responsible for measuring the similarity of the images from the brightness, contrast and structure. The larger the two values, the closer the reconstruction result is to the ground-truth image.

Quantitative Evaluation
We have performed super-resolution experiments on Set5 and Set14 to analyze the effects of introducing RDB structure, L tex and improving initial L adv , L per on super-resolution performance.
The PSNR values of different model variants are shown in Table 1. It can be observed that each of the above four enhanced measures can improve the super-resolution performance of the network, and the effect is the best when all of them are used. In addition, we have adopted different values for λ and η in Equation (3) and performed experiments on Set5. The results have shown that the reconstruction effect is the best when λ = 3 × 10 −3 and η = 2 × 10 −2 . Table 2 presents the average PSNR results on Set5 dataset. For fair comparison, the SISR methods in comparison are Bicubic [4], ScSR [8], SRGAN [17], EDSR [23] and ESRGAN [24], all these methods are tested on Set5, Set14 and BSD100, respectively. Average PSNR/SSIM values on different datasets with those methods are recorded in Table 3, and the total running time with those methods on different datasets is recorded in Table 4. It can be seen from Table 3 that the performance of TSRGAN on PSNR is generally better than other algorithms. Except that the SSIM value is slightly lower than ESRGAN 0.009 on Set14, it is also superior than other algorithms. Note that Table 4 shows the results that Bicubic consumes the shortest time for it only has interpolation operations. ScSR spends longer time for learning sparse representation dictionaries between the LR and HR image patch pairs. SRGAN, EDSR, ESRGAN and TSRGAN models all need longer time to train for they have extensive convolutional layers. SRGAN has the slowest reconstruction speed because the BN layer is not removed in the network structure, while TSRGAN is slightly slower than EDSR and ESRGAN due to the introduction of deeper network and texture loss. Synthesizing Tables 3  and 4, TSRGAN obviously improves PSNR and SSIM indicators for measuring the quality of image reconstruction without losing too much speed, which verifies its effectiveness and superiority.  In order to ensure the contrast effect, we select an image from datasets Set5 and Set14, respectively. The actual reconstruction results of each algorithm are shown in Figures 5 and 6. Comparing the reconstruction results, it can be observed that the reconstruction details of Bicubic and ScSR are too few, and the generated images are very blurred. Although SRGAN and EDSR have restored some high-frequency information, the edge sharpening effect is poor. The overall effect of ESRGAN is better, but it has introduced unpleasant artifacts and noises. The reconstruction results of TSRGAN are superior to other algorithms in terms of sharpness and detail. As can be seen from enlarged details in Figure 5, TSRGAN can generate a clearer and more natural hat textures. According to Figure 6, it can be observed that TSRGAN has generated image with more accurate brightness information and more pleasing texture details.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 12 In order to ensure the contrast effect, we select an image from datasets Set5 and Set14, respectively. The actual reconstruction results of each algorithm are shown in Figures 5 and 6. Comparing the reconstruction results, it can be observed that the reconstruction details of Bicubic and ScSR are too few, and the generated images are very blurred. Although SRGAN and EDSR have restored some high-frequency information, the edge sharpening effect is poor. The overall effect of ESRGAN is better, but it has introduced unpleasant artifacts and noises. The reconstruction results of TSRGAN are superior to other algorithms in terms of sharpness and detail. As can be seen from enlarged details in Figure 5, TSRGAN can generate a clearer and more natural hat textures. According to Figure 6, it can be observed that TSRGAN has generated image with more accurate brightness information and more pleasing texture details.

Conclusions
Based on the generative adversarial network framework, we have described a super-resolution model TSRGAN. We have designed the method of removing BN layers and introducing residual dense blocks to deepen the structure of generator network. In addition, we have used WGAN-GP to improve adversarial loss to provide stronger and more effective supervision for model training. Moreover, we have enhanced perceptual loss by using the features before activation layer, which offer stronger supervision and thus restore more accurate brightness and realistic textures. Finally, we have cited texture loss which encourages to match local texture details to achieve better outcomes. The experimental results show that our method makes the average PSNR of reconstructed images reach 27.99 dB and the average SSIM reach 0.778 without losing too much speed, which is superior to other comparison algorithms in objective evaluation index. TSRGAN has significantly improved subjective visual evaluations such as brightness information and texture details, this further proves that our algorithm can reconstruct more realistic images. In future research work, we will consider super-resolution reconstruction of images in specific fields or scenes to improve the quality of image generation.

Conclusions
Based on the generative adversarial network framework, we have described a super-resolution model TSRGAN. We have designed the method of removing BN layers and introducing residual dense blocks to deepen the structure of generator network. In addition, we have used WGAN-GP to improve adversarial loss to provide stronger and more effective supervision for model training. Moreover, we have enhanced perceptual loss by using the features before activation layer, which offer stronger supervision and thus restore more accurate brightness and realistic textures. Finally, we have cited texture loss which encourages to match local texture details to achieve better outcomes. The experimental results show that our method makes the average PSNR of reconstructed images reach 27.99 dB and the average SSIM reach 0.778 without losing too much speed, which is superior to other comparison algorithms in objective evaluation index. TSRGAN has significantly improved subjective visual evaluations such as brightness information and texture details, this further proves that our algorithm can reconstruct more realistic images. In future research work, we will consider super-resolution reconstruction of images in specific fields or scenes to improve the quality of image generation.