DC-MMD-GAN: A New Maximum Mean Discrepancy Generative Adversarial Network Using Divide and Conquer

: Generative adversarial networks (GANs) have a revolutionary inﬂuence on sample generation. Maximum mean discrepancy GANs (MMD-GANs) own competitive performance when compared with other GANs. However, the loss function of MMD-GANs is an empirical estimate of maximum mean discrepancy (MMD) and not precise in measuring the distance between sample distributions, which inhibits MMD-GANs training. We propose an efﬁcient divide-and-conquer model, called DC-MMD-GANs, which constrains the loss function of MMD to tight bound on the deviation between empirical estimate and expected value of MMD and accelerates the training process. DC-MMD-GANs contain a division step and conquer step. In the division step, we learn the embedding of training images based on auto-encoder, and partition the training images into adaptive subsets through k-means clustering based on the embedding. In the conquer step, sub-models are fed with subsets separately and trained synchronously. The loss function values of all sub-models are integrated to compute a new weight-sum loss function. The new loss function with tight deviation bound provides more precise gradients for improving performance. Experimental results show that with a ﬁxed number of iterations, DC-MMD-GANs can converge faster, and achieve better performance compared with the standard MMD-GANs on celebA and CIFAR-10 datasets. to cm, DC-MMD-GANs reach competitive score with less training time. DC-MMD-GANs accelerate the training process.


Introduction
Generative adversarial networks (GANs) [1] have developed as an admirable class of implicit generative models (IGMs), whose purpose is forcing distribution of generated data Q s to mimic target distribution P r . GANs alternately train two networks: generator and discriminator, in adversarial form. Generator attempts to produce vivid artificial samples, while discriminator distinguishes artificial samples from real ones. Since the original GAN [1] was proposed, diverse variants of GANs have appeared and shown impressive results on the tasks including image generation [2][3][4][5][6], video generation [7], transfer learning [8] and so on.
From the perspective of loss function, the purpose of training GANs actually minimizes the distance M(P r , Q s ). Selecting appropriate distance M(P r , Q s ) applied in the loss function of GANs is important to generation performance. A 'good' distance is required to make it easy for generated distribution to converge to the target distribution. That is, the distance should be continuous even when the target and generated distribution do not have non-negligible intersection [3]. Integral probability metric (IPM) [9], which offers convincing discrepancy between two distributions [10], becomes an important class of distance applied to GANs [10][11][12][13], namely IPM-GANs. Equipped with a class of functions F and given samples r ∼ P r i.i.d. and s ∼ Q s i.i.d., the IPM-type distance M IPM (P r , Q s ) between two distributions P r and Q s is defined as (1). f ∈ F is abstract function which is used to maximize the IPM-type distance.
When F is a "unit ball" in a reproducing kernel Hilbert space (RKHS), the corresponding IPM instance is the maximum mean discrepancy (MMD). MMD is also extended as a distance applied to GANs: the maximum mean discrepancy GANs (MMD-GANs) [14,15]. In practice, the loss function used in MMD-GANs is an empirical estimate of Equation (1). The estimate has deviation with the expectation of MMD. The deviation is large when sample size is small [16]. The batch size(sample size) in MMD-GANs is usually less than 128 because of the complex training data, which is insufficient to get a precise estimate of MMD. Therefore, the deviation is large and incurs inaccurate prediction in each iteration of training.
In this paper, we propose an efficient divide-and-conquer generative model: DC-MMD-GANs, which constrains loss function to tight bound on the deviation between empirical estimate and expected value of MMD. The loss value of DC-MMD-GANs can provide more precise gradients for updating network and accelerate the training process. Large deviation existing in MMD-GANs leads to slow convergence. However, expanding sample size effectively is challenging. To alleviate the existing problem, we consider decomposing the problem of expanding sample size to sub-problems and solving independently with a divide-and-conquer strategy. Our work is inspired by the divide-and-conquer strategy for kernel methods [17][18][19][20].
DC-MMD-GANs contain division step and conquer step. In the division step, we use a pre-trained auto-encoder to learn the embedding of images from target distribution and conduct k-means on latent code space from the encoder to divide training images into adaptive subsets. The purpose of the division step is minimizing the correlation between different subsets and retaining more information of input data for DC-MMD-GANs to learn. Each sub-model can learn from each subset independently and efficiently. In the conquer step, sub-models are fed with subsets separately and trained synchronously. The loss function values of all sub-models are integrated to compute a new weight-sum loss function. The new loss function of DC-MMD-GANs provides more precise gradients for accelerating training.
Our contributions can be summarized as follows: 1. Based on the deviation between empirical estimate and expected value in [16], we analyze and find that the large deviation exists in original MMD-GANs, which shows that sample size used in each training iteration is not sufficient to make precise evaluation. 2. We propose DC-IPM-GANs, a novel method to constrain the loss function to tight bound on the deviation between empirical estimate and expected value of MMD. Compared to the original MMD-GANs, the loss function of DC-MMD-GANs with tighter deviation bound can measure the distance between generated distribution and target distribution more precisely, and provide more precise gradients for updating networks, which accelerates the training process. The multiple sub-models DC-MMD-GANs occupy multiple distributed computing resources. Compared to expanding the batch size on original MMD-GANs directly, the training score of DC-MMD-GANs is close to the score of MMD-GANs using less time. 3. Experimental results show that DC-MMD-GANs can be trained efficiently compared to the original MMD-GANs.
The rest of the paper is outlined as follows. Related work is shown in Section 2. The proposed method is given in Section 3. In Section 4, we present the results of experiments. Section 5 is the conclusion of the paper.

MMD
As a class of IPM [9], MMD measures the distance between mean embedding of probability measures in RKHS, which is induced by a characteristic kernel [21,22]. MMD is defined as where H k is RKHS and F is a "unit ball" in H k . The "unit ball" means that the value of f (·) in RKHS is less than or equal to 1. H k is uniquely associated with a characteristic continuous kernel function k(·, ·), such as Gaussian radial-basis-function kernel where σ is the bandwidth.

GANs
The framework of GANs is a combination of implicit generative model and two-sample test. Generator generates fake data. Discriminator attempts to decide whether the fake-data distribution and real-data distribution are different, which is consistent with the goal of two-sample test. The choice of distance applied in GANs is closely related to the generating power, based on which we divide current variants of GANs into two sorts. The first class of GANs adopts statistical divergences as loss functions. The pioneering GANs use Jensen-Shannon divergence [1]. The distance used in least-squares GANs [23] turns out to be a variant of Pearson χ 2 divergence. Furthermore, [6] showed that all f-divergences can act as distances of GANs. IPM is the second class of loss function used in GANs, which has already been stated in Section 1. Some canonical models are WGANs [4,10], Fisher GAN [24], Sobolev GAN [11], McGAN [13], and so on.

MMD-GANs
Generative models using MMD as loss function contain many kinds of combinations [25][26][27][28]. MMD-GANs evaluate MMD for features extracted from discriminator and obtain competitive performance [14,15]. MMD in Equation (2) is defined as a form of expectation which needs to be replaced by unbiased empirical estimate using kernel function when used in MMD-GANs. The direct result of unbiased empirical estimate using kernel function is squared form: MMD em 2 . Using MMD em 2 and MMD em are equivalent for training MMD-GANs. Given target distribution P r , MMD-GANs generate fake distribution Q s to mimic P r . In MMD-GANs [14], the actual loss function is where R = {r i } n i=1 , r ∼ P r i.i.d., and Z ∈ Uniform([−1, 1] 128 ). G is generator and D is discriminator. Let G(Z) = S and S = {s i } m i=1 , s ∼ Q s i.i.d.. According to [14], MMD em 2 is calculated as where k D (·, ·) = k(D(·), D(·)). k(·, ·) is kernel function, and m is batch size. However, training troubles from GANs also surround MMD-GANs unavoidably. The work of [29] devised a suitable method that restricts precise and analytical regularizer on gradients to stabilize and accelerate training without additional cost. For the same purpose, [30] defined a bounded Gaussian kernel and repulsive loss function to stabilize the training of MMD-GANs.

Divide and Conquer MMD-GANs
In this section, we present the DC-MMD-GANs. First of all, we provide a detailed analysis of bound on the deviation between empirical estimate and expected value of MMD which shows that improving the accuracy of loss function value is extremely necessary in MMD-GANs. Secondly, we propose a novel divide-and-conquer method, which can expand the sample size of MMD-GANs indirectly and provide a more precise loss function value, to accelerate the training.
There is a deviation between MMD em 2 and MMD 2 according to [16] Pr{MMD em where K is the maximum value of kernel function k(·, ·), Pr is the probability that deviation exceeds t, and t is deviation threshold. According to Equation (6), the deviation between MMD em 2 and MMD 2 is closely related to batch size m: MMD em 2 is approaching gradually to expectation as sample size increases. Generally, the sample size in the region of the two-sample test is large enough because the two-sample test usually uses simple distribution with low-dimensional data, which could guarantee the quality of the two-sample test. In contrast, the sample size in the deep network model is generally not large due to complex training data. According to [31], the deviation between MMD em 2 and MMD 2 converges in a Gaussian distribution as where D −→ means converging to a distribution, and m is batch size. σ u 2 is the variance whose value is only related to Q s and P r . Supposing that the two distributions are completely different, and the data is same in a distribution, the minimum value of the kernel function k(s i , r j ) in Equation (5) is 0, which means that the two distributions have no relation, i.e., the minimum value of cross term between two distributions in expression of Equation (5) is 0. The maximum value of kernel function is 1, which means that the two distributions are same. Therefore, the values of k(s i , s j ) and k(r i , r j ) equal to 1. In this case, the maximum of MMD em 2 is When m → ∞, the maximum of MMD em 2 equals to 2. Figure 1 shows that bound on the deviation between empirical estimate and expected value of MMD becomes tighter as the sample size increase. The deviation in Figure 1 (e.g., 0.1, 0.2, ..., 0.5) is non-negligible for the MMD em 2 whose maximum is only 2. Generally, the value of MMD em 2 in MMD-GANs is less than 2 and gets smaller as the number of iterations increases, while the deviation does not vanish. So, the impact of deviation will intensify during training. We select a point of the curve from Figure 1 for example: (64, 0.4868). The meaning of this point is: when batch size in MMD-GANs is set to 64, there will be 48.68% probability that deviation between MMD 2 and MMD em 2 exceeds 0.3, which is unacceptable for training. Generally, batch size in MMD-GANs does not exceed 128 (e.g., 32, 64, 128) [14,29], which cannot be expanded directly and lead to large deviation. Batch size cannot be expanded directly with complex training data and limited computational budget. We find that the B-test [31] can obtain a more precise empirical estimate of MMD by computing an average over empirical estimates calculated on subsets. Inspired by the divide-and-conquer strategy for kernel methods [17][18][19][20], we propose a divide-and-conquer generative model, DC-MMD-GANs. We partition training images into c subsets {R 1 , R 2 , ..., R c } using auto-encoder and k-means in the latent space. The sub-models conduct forward propagation using different subsets independently. The loss function values of sub-models are integrated and used to calculate a new weight-sum loss to update the all sub-models synchronously. The framework of our proposed method is shown in Figure 2. The purpose of dividing training data is to reduce the loss of information and improve the generating performance. The advantages of division task is shown as follows: 1. Each subset of training images divided by k-means owns minimum distance in embedding space. The correlation between each subset is reduced. MMD em 2 (R i , R j ) i =j between each subset gets bigger. R i and R j are different batches of training images of different subset. According to Equation (5), the third term in MMD em 2 (R i , R j ) i =j will be reduced, which is the cross term between different subsets and cannot be obtained due to independent sub-model. From this perspective, auto-encoder and k-means help to reduce the loss of information of training images during training process. 2. Each subset of training images is used to train on one sub-model. All training images are learned more quickly compared with the baseline model. 3. According to different cluster of embeddings, we divide images into subsets, which can be viewed as different categories. Divided subsets contain different information of clustered embeddings, which is shown to improve training of GANs [32]. A pre-trained model such as the combination of auto-encoder and k-means is shown to be benefit for generator to produce high-quality images [33].  The target distribution P r is primarily supported by low-dimensional manifolds [10]. We try to learn useful feature representation, which can represent the essence of the target distribution. Auto-encoder has enough capacity to capture prominent information of target data. By mapping training images space R d into low-dimensional embedding space R e , the encoder outputs the most representative embedding e [34]. Furthermore, auto-encoder disentangles complex variations of training images which is beneficial for division tasks [35].
We train auto-encoder with all training images by minimizing mean squared errors (MSE) between training data and reconstructed samples. The dimension of the output layer of the encoder was set to 256 with three convolution layers and two fully connected layers. Auto-encoder is composed of encoder and decoder, which is shown as follows where E c (·) is encoder and D c (·) is decoder.R re is reconstructed data. m is batch size and We freeze auto-encoder parameters and pass all training images to well-trained auto-encoder. Each image r i is mapped to a vector of embedding e i . E all is a set containing all embedding e i .
Based on learned embedding representation, we conduct k-means on E all and obtain c clusters: E i , i ∈ {1, ..., c}. Hereby training images are divided into c subsets R i , i ∈ {1, ..., c} respectively according to clustered set of embedding. Each subset will be fed to a sub-model. Each sub-model is trained independently.
For conquer task, we adopt a weighted-sum unbiased empirical estimate of MMD 2 with tighter deviation bound: MMD dc 2 , which can provide more precise gradients for training. All sub-problems of training independently, we integrate the empirical estimation {MMD em 2 (R 1 , S 1 ), MMD em 2 (R 2 , S 2 ), ..., MMD em 2 (R c , S c )}. R c is a batch of training images and S c is a batch of generated images generated by the c-th sub-model. We compute a weight-sum value of all estimate: MMD dc 2 , with a tighter deviation bound. We set the weight parameter to 1 c .
Each sub-model is updated synchronously by optimizing MMD dc where G sub is generator and D sub is discriminator of sub-model. As shown in the Theorem 1, the new loss function MMD dc 2 can be seen as an unbiased estimator of expected value MMD 2 . Therefore, optimizing MMD dc 2 can give correct direction of convergence when training DC-MMD-GANs.

Theorem 1.
Assuming that ∑ c i=1 w i = 1, MMD dc 2 is an unbiased estimator of MMD 2 .
Proof of Theorem 1. According to [16], for each MMD em The calculation of MMD dc 2 can be seen as a type of unbiased estimator used in the B-test [31], which is an average over empirical estimate of MMD 2 computed on subsets. According to Equation (5), MMD em 2 with batch size equaling to m can be obtained by computing m 2 terms of k(·, ·).
MMD em 2 with batch size equaling to cm can be obtained by computing c 2 m 2 terms. By using the divide-and-conquer strategy, MMD dc 2 is computed using c MMD em 2 according to Equation (10).
So MMD dc 2 can be obtained by computing cm 2 terms. Compared to compute the MMD em 2 with batch size equaling to cm, we avoid the calculation of cross terms when we calculated MMD dc 2 .
The cross terms are the information between different subsets and has been reduced by the division task. MMD dc 2 has same deviation bound as MMD em 2 with batch size equaling to cm according to [31]. Therefore, DC-MMD-GANs using MMD dc 2 can be viewed expanding batch size from m to cm effectively with less computation amount compared to MMD em 2 with batch size equaling to cm and obtain a tighter deviation bound as is shown in Equation (12) compared to MMD em 2 with batch size equaling to m. Under the same Pr, MMD dc 2 is more accurate than MMD em 2 because the deviation t between MMD dc 2 and MMD 2 is smaller than deviation between MMD em 2 and MMD 2 .
According to [31], deviation between MMD dc 2 and MMD 2 converges in a Gaussian distribution with lower variance [31] compared to original model. As shown in Figure 3, MMD dc 2 has a tighter deviation bound compared to the deviation between MMD em 2 and MMD 2 . As the number of sub-models increases, the deviation bound becomes tighter, which is shown as where σ u 2 is the variance whose value is only related to Q s and P r . P r is same between models. Q s is approximately the same because it is approaching P r . Therefore, σ u 2 is same as the variance in Equation (7).

Experiment
In this section, we conducted experiments on unsupervised image generation to evaluate the performance of proposed DC-MMD-GANs. The experiments are conducted on two popular datasets, CIFAR-10 [36] (10 categories, 32 × 32), and CelebA [37] (face images, resized and cropped to 160 × 160). Models were trained on Nvidia GTX 1080Ti GPUs. CelebA was trained on a generator with 10-layer ResNet as in [29] and a discriminator with 5-layer DCGAN architecture [2]. For CIFAR-10, we used a standard CNN structure of [38], which contains a 4-layer generator and a 7-layer discriminator.
To compare generation quality of GANs, we adopted Fréchet Inception Distance (FID) [39] and Kernel Inception Distance (KID) [14] as evaluation metrics. FID measures similarity between generated images and real images. FID fits intermediate representations of generated images and real images into two multidimensional Gaussian distributions and computes Wasserstein-2 distance between the two distributions. KID adopts a squared polynomial-kernel MMD between generated images and real images as evaluation distance. The lower scores of FID and KID, the better generation quality will be. The number of real images used in computing FID and KID was set to 10,000. According to [39], the FID d FID (, ) is shown as where (m r , C r ) is Gaussian distribution obtained from P r and (m s , C s ) is obtained from Q s . According to [29], the KID d KID (, ) is shown as where ψ(·) is representation of generated images and real images in the Inception model [40]. For CelebA, auto-encoder was pre-trained for 20,000 iterations on 20,000 training images. We got embedding of all images by auto-encoder. According to the clusters of k-means on the learned embedding by auto-encoder, we divided 10,000 images into 2, 4 and 8 subsets. For CIFAR-10, auto-encoder was pre-trained for 10,000 iterations on 60,000 images and divided 30,000 images into 2, 4 and 8 subsets. As a comparison to DC-MMD-GANs, we also trained the standard model which is equivalent to the proposed model when c = 1.
We limited the number of training iterations to 15,000 for CelebA to compare learning efficiency and evaluation scores between different models. Because the generating quality of DC-MMD-GANs is high when the training step is 15,000. For CIFAR-10, we limited number of training iterations to 5000 in DC-SMMD-GAN and DC-MMD-GAN. To show differences in the performance of each model clearly, we computed FID and KID every 500 generator iterations. The learning rate was set to 0.0001. The batch size of each sub-model was set to 64. According to [14], the discriminator updated 5 times during 1 generator iteration. According to [29], the dimension of discriminator outputs was set to 1 in DC-SMMD-GAN, while it was 16 in DC-MMD-GAN. DC-SMMD-GAN used a radial-basis-function kernel with bandwidth equaling to 0.5. DC-MMD-GAN used a mixture of rational quadratic kernels as in [14]. In DC-MMD-GAN, the gradient penalty and L 2 penalty on discriminator layers were both set to 1. The Adam optimizer was used for all models with β 1 = 0.5, β 2 = 0.9. Figure 4 shows FID and KID of CelebA trained on these models. According to score curves in Figure 4, we find that curves of DC-MMD-GANs drop faster when compared to the original models. The training process of DC-MMD-GANs is faster than the original model. For example, in Figure 4a, the FID score of DC-SMMD-GAN with 8 models reaches 40 in about 6500 iterations on celebA. The FID score of original model reaches 40 in about 11,500 iterations, which is 5000 iterations more than the DC-SMMD-GANs with 8 sub-models. The FID score can reach 40 in fewer iterations as the sub-model increases (about 8500 iterations of DC-SMMD-GAN with 4 sub-models and 9500 iterations of DC-SMMD-GAN with 2 sub-models). We can get the consistent observation on other figures. Under the same number of iterations, we find that in most cases, the score of DC-MMD-GANs is better than the original models. The DC-MMD-GANs with 4 sub-models perform better than the 2-sub-models ones. The training results of DC-MMD-GANs with more sub-models are relatively better because of the more precise value of loss function. However, compared to the promotion of performance from original models to 4-sub-model DC-MMD-GANs, the improvement from 4-sub-models DC-MMD-GANs to 8-sub-models ones is not obvious. Partly because the improvement of the tightness from 4-sub-models DC-MMD-GANs to 8-sub-models ones is not obvious, which can be also observed in Figure 1. Figures 5 and 6 shows the generated images of CelebA after 15,000 iterations. We find that under the same number of iterations, DC-MMD-GANs can generate higher quality images while parts of the images from original models are blurred. Under the limited iterations, DC-MMD-GANs can learn more detail from training images and generate more realistic images. However, the resolution of CIFAR-10 is low. It is hard to find how the DC-MMD-GANs improve the generating quality so we did not show the generated images of CIFAR-10. However, the improvement of DC-MMD-GANs can be observed in Figure 7. Figure 7 presents FID and KID of CIFAR-10 trained on these models. DC-MMD-GANs can converge faster on CIFAR-10, which is consistent with the results of training on CelebA.     For CelebA, the GANs training time is 14.70 h in DC-SMMD-GANs and SMMD-GANs. In DC-MMD-GANs and MMD-GANs, the training time is 18.10 h. The time for auto-encoder training and conducting k-means (division task) is 1.57 h. Compared to training time of GANs, the division task can be omitted. For CIFAR-10, the training time is 0.7 h in DC-SMMD-GANs and 0.90 h in DC-MMD-GANs. The time of division task is 0.10 h, which can be omitted. The time used for auto-encoder training and k-means can be omitted because: (1) Compared to the GANs training time, division task takes a short period of time which will be shown in experiment. (2) The auto-encoder is trained once in each dataset and time of conducting k-means can be omitted.
We compared the training score of 2-sub-models DC-MMD-GANs with batch size equaling to 32, 4-sub-models DC-MMD-GANs with batch size equaling to 16, 8-sub-models DC-MMD-GANs with batch size equaling to 8, and MMD-GANs with batch size equaling to 64. Figures 8 and 9 shows FID and KID of CelebA and CIFAR-10 trained on these models. According to the score curves, we find that the training score of DC-MMD-GANs is close to the MMD-GANs. Table 1 shows the training time of DC-MMD-GANs can be reduced as the number of sub-models increases compared to MMD-GANs. The training time of DC-MMD-GANs is reduced because: (1) the strategy of DC-MMD-GANs is parallel divide-and-conquer (2) MMD dc 2 can be obtained with less computation amount compared to MMD em 2 with batch size equaling to cm which can accelerate the training process.

Conclusions
In this paper, we connected the improvement of accuracy of statistic value with efficient training of GANs, and proposed a brand-new divide-and-conquer GANs model, DC-MMD-GANs, which is a tradeoff between training time and computing resources. Based on the insight of large deviation existing in MMD-GANs, we showed the loss function of the DC-MMD-GANs owns a tighter deviation bound, which may open the door for research into optimization of generative models from statistical perspectives. Experimental results have shown that the DC-MMD-GANs can be trained efficiently compared to the standard MMD-GANs. However, the model of DC-MMD-GANs still has some limitations, such as the division task, which takes up some pre-training time and resources. One interesting direction of future study is reducing the occupation of division task and choosing the best number of sub-models.

Conflicts of Interest:
The authors declare no conflict of interest.