Real Sample Consistency Regularization for GANs

Mode collapse has always been a fundamental problem in generative adversarial networks. The recently proposed Zero Gradient Penalty (0GP) regularization can alleviate the mode collapse, but it will exacerbate a discriminator’s misjudgment problem, that is the discriminator judges that some generated samples are more real than real samples. In actual training, the discriminator will direct the generated samples to point to samples with higher discriminator outputs. The serious misjudgment problem of the discriminator will cause the generator to generate unnatural images and reduce the quality of the generation. This paper proposes Real Sample Consistency (RSC) regularization. In the training process, we randomly divided the samples into two parts and minimized the loss of the discriminator’s outputs corresponding to these two parts, forcing the discriminator to output the same value for all real samples. We analyzed the effectiveness of our method. The experimental results showed that our method can alleviate the discriminator’s misjudgment and perform better with a more stable training process than 0GP regularization. Our real sample consistency regularization improved the FID score for the conditional generation of Fake-As-Real GAN (FARGAN) from 14.28 to 9.8 on CIFAR-10. Our RSC regularization improved the FID score from 23.42 to 17.14 on CIFAR-100 and from 53.79 to 46.92 on ImageNet2012. Our RSC regularization improved the average distance between the generated and real samples from 0.028 to 0.025 on synthetic data. The loss of the generator and discriminator in standard GAN with our regularization was close to the theoretical loss and kept stable during the training process.


Introduction
Since the generative adversarial network proposed by Goodfellow [1] in 2014, it has achieved great development [2,3] and has been applied in many ways [4][5][6][7][8][9], such as image inpainting, super-resolution reconstruction, style transfer, and image editing. However, researchers are still looking for ways to improve GANs, especially ways to solve the mode collapse and instability of GANs [10][11][12]. Thanh-Tung [13] argued that the generated samples and the real samples in the later training stage are very similar, but the discriminator can distinguish between the real samples and the generated samples, resulting in a gradient explosion. In this case, the generator's gradient in the minibatch points to samples where the gradient explodes and mode collapse occurs.
Thanh-Tung proved that when the generated distribution approaches the real distribution, the generator's gradient should tend to zero. Therefore, the author proposed 0GP regularization on the linear interpolation between the real samples and the generated samples to alleviate mode collapse. Mescheder [14] proved that 0GP regularization on real samples could guarantee convergence when initialized sufficiently close to equilibrium.
However, experiments in this paper showed that both 0GP regularizations on the linear interpolation between real samples and generated samples and 0GP regularization on real samples exacerbated the discriminator's misjudgment, that is the discriminator output has a higher value for generated samples than real samples. The discriminator's misjudgment makes it more difficult for the generated samples to converge to the real samples and guides the generator to generate unnatural images, reducing the quality of the generation.
It is necessary for the discriminator to output higher values for some generated samples than the real samples, because if the discriminator can perfectly distinguish the generated samples from the real samples, this will cause the training to collapse. However, if there are massive generated samples for which the discriminator outputs higher values, this will reduce the quality of the generation. In the actual training, the discriminator will direct the generated samples to point to samples with higher discriminator outputs, regardless of whether they are real samples. If the discriminator judges that the massive generated samples are more real than the generated samples, then the generator's gradient within a minibatch will be more directed toward the generated samples with a high discriminator output rather than the real samples. The result is that the generator generates many meaningless images, reducing the quality of the generation. 0GP regularization will exacerbate the problem of the discriminator's misjudgment.
Tao and Wang [15] proposed fake-as-real GAN based on 0GP regularization on real samples. When updating the discriminator, the generated samples with the lowest discriminator output in the minibatch should be regarded as real samples. However, the problem of the discriminator's misjudgment is still unresolved.
This paper focuses on solving the discriminator's misjudgment and achieving better performance with a more stable training process. Our contributions are as follows: 1. We analyze the discriminator's misjudgment. Due to the 0GP regularization, there will be more cases where the discriminator's gradient at the real samples is less than the discriminator's gradient at the generated samples during the training process; 2. We propose Real Sample Consistency (RSC) regularization, forcing the discriminator to output the same value for all real samples. For real samples, real sample consistency regularization can reduce the proportion of the discriminator output to be less than 1 2 . Experiments on synthetic and real-world datasets verified that our method achieves better performance than 0GP regularization.

Related Work
Researchers have been committed to improving generative adversarial networks. Reference [16] utilized the Wasserstein distance and the clip parameter to regularize GANs. The Wasserstein distance can solve the vanishing gradient problem, but the clip parameters will decrease the model's fitting ability. Reference [17] proposed One Gradient Penalty (1GP) regularization, which improves the fitting ability of the model, but does not guarantee model convergence. Mescheder [14] proved that 0GP regularization on real samples could guarantee convergence when initialized sufficiently close to equilibrium. Reference [10] proposed spectral normalization to make the model have Lipschitz continuity to stabilize the training process. Reference [18] proposed consistency regularization, which makes the model insensitive to data augmentation and can maintain consistency in the semantic feature space, thereby improving the model's performance. Reference [15] proposed that the generated samples with the lowest discriminator output in the minibatch should be regarded as real samples, thus achieving better generalization.
Reference [19] proposed to replace the sigmoid cross-entropy loss in the standard GAN with the mean-squared error loss to solve the vanishing gradient problem. Hinge loss [20] and ResNet [21] are also applied to GANs to improve the performance. Reference [22] utilized instance noise to alleviate the vanishing gradient problem. Reference [23] utilized the Exponential Moving Average (EMA) to update the generator, which can stabilize the training of the process. Reference [24] argued that only considering the current generator when updating the generator will lead to mode collapse, so the author proposed that the current generator be considered, and the discriminator after K iterations should be considered. References [25][26][27] utilized multiple generators to alleviate model collapse. References [11,12,28] achieved amazing results and could generate realistic highresolution pictures.

Approach
In this part, we focus on the problem of the discriminator's misjudgment. We analyze the problem of misjudgment by the discriminator and propose the real sample consistency regularization.

Background
The discriminator of the Standard GAN (SGAN) proposed in 2014 maximizes: where p r represents the real distribution and p g represents the generated distribution.
Reference [1] proposed that when the generator is fixed, the optimal discriminator is: When the global optimum is reached, there are: p r = p g and D * (v) = 1 2 . Reference [15] mentioned that for any v ∈ supp(p r ) ∪ supp p g , when p g approaches p r , D * (v) approaches 1 2 , and (∇D) v → 0, thereby (∇D) x → 0, x ∈ p r . Therefore, there are two ways to perform 0GP regularization. One enforces a zerocentered gradient penalty of the form (∇D) x 2 , where x ∼ p r . The discriminator maximizes: where µ > 0. The other enforces a zero-centered gradient penalty of the form (∇D) v 2 , where v is a linear interpolation between real samples and generated samples. The discriminator maximizes: where µ > 0.

Misjudgment by the Discriminator
However, no matter what the form of zero-centered gradient penalty, it is far from perfect regularization. Definition 1. For y 0 ∈ S g , y 0 is a fake real sample if y 0 ∈ {y 0 : D(y 0 ) > E[D(S r )]}. S r represents the set of real samples. S g represents the set of generated samples.
Consider the 0GP regularization on real samples. Although the 0GP regularization of the real sample can alleviate the mode collapse, it will lead to D(y) > D(x), which indicates that the discriminator believes that some generated samples are more real than the real samples. We can infer that SGAN-0GP will show more fake real samples than SGAN. Since (∇D) x → 0 only occurs near the equilibrium point and we apply 0GP regularization from the beginning of the training, this will lead the gradient of the discriminator at the real samples to be close to 0. Moreover, we did not impose restrictions on the gradient of the discriminator at the generated samples. As a result, there will be more cases where the discriminator's gradient at the real samples is less than the discriminator's gradient at the generated samples during the training process. In the end, the number of fake real samples in SGAN-0GP will be more than that in SGAN. The empirical discriminator guides the generated samples to point to samples with higher discriminator outputs, regardless of whether they are real samples. Therefore, the discriminator will guide the generator to generate more fake real samples. These fake real samples are often far from the real samples and eventually lead to difficulty in convergence.
0GP regularization on the linear interpolation between real and generated samples leads to D(y) > D(x) as well. Although, in this case, the zero-centered gradient penalty is applied on the linear interpolation between real and generated samples, the number of real samples is finite, and the number of generated samples is infinite. Assume that S r represents the set of real samples, S g represents the set of generated samples, and Reg represents the total quantity of the regularizations. Then, S r S g , . The penalty for the gradient of the discriminator at the generated samples is much less than that at the real samples. Therefore, we can infer that the number of fake real samples generated by SGAN with 0GP on real samples is similar to that on the linear interpolation between real and generated samples. Considering 0GP regularization on the linear interpolation between real and generated samples is more complicated than that on real samples, and the result of interpolation may not lie in supp(p r ) ∪ supp p g [15]. In the rest of the paper, we use the 0GP regularization on real samples by default.

Real Sample Consistency Regularization
In order to alleviate the problem of fake real samples, we can increase D(S r ) and decrease D(y 0 ). Assume that for x c ∈ S r , {x c , y i } is a close pair for ∀ y i ∈ {y 1 , · · · , y n }, n > 1. According to the definition of a close pair, we can approximate y i ∈ {y 1 , · · · , y n } as x c , because according to the previous assumption of n > 1, we can obtain This shows that although x c is a real sample, the discriminator's output for x c is less than 1 2 . Therefore, we propose Real Sample Consistency (RSC) regularization, which enforces the discriminator to output the same value for the real samples. Considering that the proportion of the discriminator's output for real samples < 1 2 is low and > 1 2 is high, this alleviates the problem of the discriminator's output for real samples < 1 2 by enforcing the discriminator to output the same value for the real samples. The discriminator in SGAN-RSC maximizes: where µ > 0, λ > 0. The training procedure is presented in Algorithm 1.

Algorithm 1:
Minibatch stochastic gradient descent training of SGAN-RSC. input : η, the learning rate. K, the number of iterations for training. n dis , the number of iterations of the discriminator per generator iteration. N, the batch size.

5
N real examples are randomly divided into {x s 1 , x s 2 , · · · , x s N/2 } and {x t 1 , x t 2 , · · · , x t N/2 } equally. ; Update the discriminator by ascending its stochastic gradient: ∇ θ d L rsc ; 7 end 8 Sample minibatch of N generated examples {y 1 , y 2 , · · · , y N } ; 9 Update the generator by ascending its stochastic gradient: ∇ θ g E y∼p g log(D(y)) ; [15]. S r represents the set of real samples, and S g represents the set of generated samples.
Assume that for x c ∈ S r , x c does not belong to any close pair, then the discriminator will output a high value for x c . Our proposed RSC regularization can alleviate this problem. Regularization on real samples will also affect the generated samples due to the adversarial learning.

Experimental Results
To verify the effectiveness of our proposed real sample consistency regularization, we experimented on synthetic data, CIFAR-10, CIFAR-100, and ImageNet2012. The optimizers of all our experiments were set to RMSProp, α = 0.99, = 1 × 10 −8 . The learning rates of the generator and the discriminator were both set to 1 × 10 −4 . The batch size was set to 64. Once the discriminator was updated, the generator was updated once. To achieve better results, instance noise [22] and the exponential moving average [23] were applied, and the β of the exponential moving average was set to 0.999. In order to verify the effectiveness of our method on different network architectures and different datasets, we applied ResNet [21] and traditional network architectures [29] on CIFAR-10 and CIFAR-100. The network architecture of ResNet was the same as [13], and the traditional network architecture was the same as [11]. We did not use batch normalization. The FID score [30] was selected to evaluate the generated samples, and a lower FID value represents better generation. The FID value was obtained on 10 k generated samples. We used Pytorch for development.

Synthetic Data
Sample N examples {x 1 , x 2 , · · · , x N } from two-dimensional normal distribution N (0, 0; 1, 1; 0), denoted as X sample . For each training, X sample was fixed. The synthetic data can be obtained by X synthesis = X sample + ψZ, where ψ is 0.02, Z ∼ N (0, 0; 1, 1; 0). In our experiment, there were three settings for N, namely 25, 50, and 100, which are denoted as 25 Gaussians, 50 Gaussians, and 100 Gaussians, respectively. Considering the dimension of the synthetic data to be two, we used MLP as the network architecture; see Tables A1 and A2 in the Appendix A for the details. In this dataset, we set µ = 200, λ = 500.
We verified our previous analysis by the experiments on the synthetic dataset. As shown in Figure 1, SGAN-0GP showed more fake real samples during the training process than SGAN. The number of fake real samples generated by SGAN with 0GP on real samples was similar to that on the linear interpolation between real and generated samples. We can also find that 0GP regularization on real samples and 0GP regularization on the linear interpolation between real and generated samples resulted in the number of fake real samples exceeding that in SGAN. These observations are consistent with our previous analysis in Section 3. Figure 1 shows the qualitative results of SGAN, SGAN-0GP, and SGAN-RSC on the synthetic data. The ideal generation is to generate samples that cover every real sample and that every generated sample be close to the real samples. However, we note that the green generated samples did not cover part of the orange real samples in Figure 1a, which indicates that mode collapse occurred. In Figure 1b, although mode collapse did not occur, many generated samples were distributed far away from the real samples. The generation in Figure 1c was the best in the three subfigures. Not only was there no mode collapse, but there were fewer generated samples that were far away from the real samples. The consequence was consistent with Figure 2. Although the number of fake real samples in SGAN was low, mode collapse occurred. According to Figures 1 and 2, the SGAN-RSC we proposed led to fewer fake real samples and generated samples that were far away from the real samples and avoided model collapse.  The quantitative results were consistent with the qualitative results in Figure 1. As shown in Table 1, the average distance obtained by our method was less than the average distance obtained by 0GP, and as the number of real samples increased, the better the improvement of our method. Our method's improvement for 25 Gaussians was less than that for 50 Gaussians and 100 Gaussians because the 25 Gaussian dataset was simple, and the average distance was close to the ideal average distance.

CIFAR-10 and CIFAR-100
In CIFAR-10 and CIFAR-100, we set the image resolution at 32 × 32 and experimented on both the conventional network and ResNet. We set µ = 10, λ = 20 for the conventional network architecture and µ = 10, λ = 500 for ResNet, see Tables A3-A6 in the Appendix A for the details. The results are shown in Table 2. We verified our previous analysis in Section 3 by experiments on CIFAR-10 and CIFAR-100. As shown in Figure 3, the discriminator's output with our regularization for both real samples and fake samples was more concentrated. For real samples, the proportion of the discriminator with our regularization output less than 1 2 was lower than the proportion of the discriminator with 0GP regularization output less than 1 2 . For fake samples, the proportion of the discriminator with our regularization output greater than 1 2 was lower than the proportion of the discriminator with 0GP regularization output greater than 1 2 . This was consistent with the result in Figure 2. As shown in Figure 2, the number of fake real samples in SGAN-RSC was lower than the number of fake real samples in SGAN-0GP.   In order to obtain the optimal parameter λ, we selected parameter λ through ablation experiments, as shown in Figure 5. Figure 5a shows that, when λ < 500, SGAN-RSC with ResNet achieved better results with the increase of λ. However, when we set λ = 1000, the result was worse than that of λ = 500. Consequently, we set λ = 500 for SGAN-RSC with ResNet. Similarly, we set λ = 20 for SGAN-RSC with the conventional network.  To verify the effectiveness of our proposed real sample consistency regularization with different network architectures, we compared SGAN-0GP and SGAN-RSC with the conventional network and ResNet, respectively. Experiments were carried out on CIFAR-10 and CIFAR-100. The result is shown in Figure 6. In all experiments, SGAN-RSC outperformed SGAN-0GP, especially with ResNet. Note that although the FID value of SGAN-RSC with the conventional architecture on CIFAR-100 increased slowly in the late training period, and the lowest FID value of SGAN-RSC with the conventional architecture on CIFAR-100 was lower than that of SGAN-0GP with the conventional architecture on CIFAR-100. Figure 6 shows that our method was effective for different network architectures. We also experimented with different GAN variants. In our experiments, 0GP on real samples instead of 1GP was applied in WGAN [16,17,31]. We set a = c = 1, b = 0 for LSGAN [19] and N = 64, M = 32, f = 8, and N 0 = 16 for FARGAN [15]. The result is shown in Figure 7. Note that real sample consistency regularization outperformed 0GP regularization for all GAN variants. Although LSGAN-RSC converged slowly in the early stages of training, it eventually reached an FID value similar to that obtained by other GAN variants with real sample consistency regularization. Figure 7 shows that our method was effective for different GAN variants. The loss of the generator and discriminator with ResNet on CIFAR-10 is shown in Figure 8. The theoretical loss was ln2 ≈ 0.693 for the generator and 2ln2 ≈ 1.386 for the discriminator. As the training progressed, we observed that the loss of the generator increased, and the loss of discriminator decreased significantly in SGAN-0GP. However, the loss of the generator and discriminator in SGAN-RSC was close to the theoretical loss and kept stable. Figure 8 shows that our method can stabilize the training of SGAN-0GP. The qualitative result with ResNet on CIFAR-10 is shown in Figure 9. Note that the images generated by SGAN-RSC were sharper and the structure clearer, such as the images of horses, airplanes, and cars. SGAN-0GP generated more unnatural and fuzzy images due to the fake real samples. Figures 9 and 10 show that our proposed regularization outperformed 0GP regularization qualitatively on CIFAR-10 and CIFAR-100. More qualitative results are shown in Figures A1-A4. In Figure 11, we show the images randomly generated by SGAN-RSC and the images closest to the generated images in CIFAR-10. The results showed that our model did not copy the images, but learned the distribution of the real images, which can guarantee the diversity of the generation. We compared other state-of-the-art methods, and the results are shown in Table 3. Real sample consistency regularization improved the FID score of FARGAN on CIFAR-10 from 14.28 to 9.8. Table 3. Comparison with state-of-the-art GAN models including SNGAN [10], BigGAN [6], CR-BigGAN [18], and FARGAN [15]. The FID value of FARGAN-RSC was obtained at 1200 k iterations.

ImageNet
In order to verify the effectiveness of our method on challenging datasets, we experimented on ImageNet, which contains 1000 classes. We set the image resolution at 64 × 64 and experimented on ResNet; see Tables A7 and A8 in the Appendix A for details. We set µ = 10, λ = 20. The results are shown in Table 4.

Summary of the Experimental Results
As shown in Table 5, the results of RSC regularization on all datasets surpassed the results of 0GP regularization. This shows that our method works for different datasets.

Conclusions
This paper showed that 0GP regularization introduces the discriminator's misjudgment, which is the discriminator outputting a higher value for some generated samples than the real samples. We analyzed the discriminator's output for the real sample, which was that a close pair with several generated samples was less than 1 2 . We proposed a new regularization, which forced the discriminator to output the same value for all real samples. The experiment result showed that our proposed regularization can reduce the number of fake real samples. Experiments on synthetic data showed that our method reduces the distance between the real distribution and the generated distribution and avoids mode collapse. Experiments on CIFAR-10, CIFAR-100, and ImageNet verified that our method can stabilize the training process and significantly improve the performance.