Improving Multi-Agent Generative Adversarial Nets with Variational Latent Representation

Generative adversarial networks (GANs), which are a promising type of deep generative network, have recently drawn considerable attention and made impressive progress. However, GAN models suffer from the well-known problem of mode collapse. This study focuses on this challenge and introduces a new model design, called the encoded multi-agent generative adversarial network (E-MGAN), which tackles the mode collapse problem by introducing the variational latent representations learned from a variable auto-encoder (VAE) to a multi-agent GAN. The variational latent representations are extracted from training data to replace the random noise input of the general multi-agent GANs. The generator in E-MGAN employs multiple generators and is penalized by a classifier. This integration guarantees that the proposed model not only enhances the quality of generated samples but also improves the diversity of generated samples to avoid the mode collapse problem. Moreover, extensive experiments are conducted on both a synthetic dataset and two large-scale real-world datasets. The generated samples are visualized for qualitative evaluation. The inception score (IS) and Fréchet inception distance (FID) are adopted to measure the performance of the model for quantitative assessment. The results confirmed that the proposed model achieves outstanding performances compared to other state-of-the-art GAN variants.


Introduction
Generative adversarial networks (GANs) [1], along with the rapid development of deep learning in various fields [2][3][4][5][6][7][8][9], have attracted worldwide attention in the fields of image generation [10,11], medical image analysis [12], natural language processing [13], speech emotion recognition [14,15], and others [16][17][18][19]. A GAN consists of two networks: a generator and a discriminator. The generator plays a "fraud" role by generating plausible samples from a random noise to simulate real samples, while the discriminator plays a "police" role by trying to differentiate generated fake samples from real samples [20]. These two networks compete with each other during training optimization. Thus, they form a zero-sum game that continues until Nash equilibrium is reached, at which point the generated samples are indistinguishable from real samples by the discriminator [21]. Based on this adversarial learning process, GANs can capture a complex distribution that is highly similar to the real data from a random distribution. The unique adversarial learning process causes GANs to produce sharper and more plausible samples than do other generative models. This characteristic makes GANs one of the most promising of the current deep generation models [20].
However, GAN models suffer from the well-known problem of mode collapse during the adversarial training process. With the goal of deceiving the discriminator, the generator tends to generate samples that the discriminator believes highly realistic [22]. During training, these generated the existing multi-agent GAN models simply generate samples from a random prior distribution without exploiting the latent information contained in the real data.
In this paper, a novel GAN framework is proposed, named the encoded multi-agent generative adversarial network (E-MGAN), which aims to generate higher quality and more diverse samples by introducing the variational latent representations learned from a VAE training (as indicated by the red box in Figure 1) to a multi-agent GAN. As shown in Figure 1, the proposed E-MGAN consists of four modules: (1) an encoder, E (red module), which abstracts latent feature representations z from real samples x; (2) a multi-agent generator, G m (green module), containing multiple generators that generate samples in different modes; (3) a classifier network, C (orange module), which assigns penalties with the goal of forcing the generator to discover scattered data modes; (4) and a discriminator, D (purple module), which distinguishes generated fake samples from real samples to maintain the adversarial learning process. Unlike the existing GAN variants, the proposed model not only makes good use of the latent feature representation condensed from real data but also uses a multi-agent generator during training to reconstruct synthetic samples in distinct modes. Experiments are conducted on both a synthetic dataset (a 2D mixture of 25 Gaussian distributions) and two diverse, large-scale, real-world datasets (CIFAR-10 [34] and STL-10 [35]). The results are evaluated by two widely used metrics, inception scores (IS) [22] and Fréchet inception distance (FID) [36]. Note that the proposed model is an unsupervised learning method that does not require any labeled data; thus, all the results discussed in this paper are learned in an unsupervised manner. Nevertheless, the proposed model outperforms other state-of-the-art GAN variants. A detailed analysis of the diversity and realism of the generated samples is shown in Section 4. A large number of experiments demonstrate that the proposed model not only overcomes the model collapse problem but also improves the quality of the generated samples.
The key contributions of this paper are as follows: • A novel GAN architecture is proposed, named E-MGAN, which makes good use of the advantages of both a VAE and a multi-agent GAN. The model capitalizes on the variational latent feature representations learned by VAE from real data to improve the quality of the generated samples. • The proposed E-MGAN model surmounts the mode collapse problem by incorporating a new multi-agent generator that coordinates training with the encoder and classifier. Its input is the latent variational feature representation learned by the encoder, and its output is constrained by the classifier through maximizing the Shannon entropy. Therefore, the multi-agent generator is encouraged to generate samples in discrete data modes. • We conducted experiments to validate the effectiveness of our model on a synthetic dataset (a 2D mixture of 25 Gaussian distributions) and two diverse, real-world datasets (CIFAR-10, and STL-10). The results illustrate that the proposed model not only overcomes the problem of model collapse, but also improves the anatomical structure of samples generated on large-scale diverse datasets.
The remainder of this paper is organized as follows. Section 2 introduces three preliminary models related to the proposed model. An in-depth study of the proposed model is provided in Section 3. Section 4 demonstrates the performance of the proposed model through a series of experiments. Finally, Section 5 provides a conclusion and briefly suggests the future work.

Preliminaries
This section provides the background regarding the proposed E-MGAN. Since the proposed model attempts to generate higher quality and more diverse samples by fully capitalizing on the advantages of a VAE and a multi-agent GAN, there related models, the original GAN, VAE, and a typical MGAN, are briefly introduced.

Generative Adversarial Networks (GANs)
A GAN consists of two networks that act as players in a game: a generator G and a discriminator D. G produces fake samples G(z) from a noise vector z, which is sampled from a prior distribution P(z). The generated fake samples G(z) imitate the real samples x ∼ P real (x) by maximizing the output of the discriminator D(G(z)). The output of the discriminator, which is denoted by D(x) ∈ [0, 1], is the probability that the input samples x come from the real sample distribution P real (x); however, its input samples could also be generated samples G(z). The notation D(x) = 1 means that the input sample x is a real sample. In contrast, D(x) = 0 means that the discriminator will view the input sample x as a fake sample. The task of the discriminator D is to distinguish generated fake samples G(z) from real samples x by minimizing the probability of fake samples (denoted as D(G(z))) while maximizing that of real samples (denoted as D(x)). In this adversarial training process, the value function of GAN can be described as follows: where P real (x) and P(z) are the real data distribution and a random prior distribution, respectively. From the above analysis, the training objective of GAN is a minimax game, min G max D V(G, D). The parameters of the generator and those of the discriminator are alternately updated through the minimax game until the discriminator can no longer distinguish whether an input sample x or G(z) comes from the real data distribution or is a fake sample. Mathematically, this can be denoted as where P ge (G(z)) is the distribution of generated samples. At this point, the GAN reaches Nash equilibrium, V(G, D) = −2 log 2.

Variational Auto-Encoder (VAE)
A VAE consists of two members: an encoder network En and a decoder network De. The encoder network En(x) compresses training data samples x into latent feature representation vectors z with a distribution P en ( z|x). Thus, the decoder network De( z) reconstructs synthetic samples x ∼ P de (x | z) from the abstracted latent representation vectors z. Mathematical explanations of those two networks are: En(x) = P en ( z | x), x ∼ P real (x), The latent feature representation P en ( z | x) learned from the encoder is constrained by the prior distribution P(z) ∼ N(0, 1). P en ( z | x) is the reconstructed sample distribution. Then, the training objective of VAE is to maximize its variational lower bound or evidence lower bound (ELBO) function: On the right side of Equation (2), the first term is a regular term for θ en that encourages the approximate posterior distribution P en ( z | x) to be close to the prior distribution P(z), where D KL (·||·) is the KL divergence. The second term is a reconstruction error. It is the maximum likelihood of the sample x ∼ De( z) reconstructed from the extracted features z ∼ P en ( z | x).

Mixture Generative Adversarial Nets (MGAN)
MGAN is one typical multi-agent GAN that employs multiple generators to enhance mode recovery. Assuming that there are K generators, each generator G k maps the prior z to a generated sample x = G k (z), which represents a generated distribution. Then, K generators can generate a mixed distribution covering K generated data distributions, denoted as P ge . An MGAN includes another new member, the classifier C. C k (x ) represents which generator G k the fake sample x comes from. Therefore, MGAN is a game composed of three members: a set of generators G 1:K (z), a discriminator D(x), and a classifier C 1:K (x ). The value function of MGAN is formulated as follows: where β is the diversity hyperparameter which is positive. MGAN training is a maximin game, min G 1:K ,C max D V(G 1:K , C, D).

Proposed Encoded Multi-agent GAN
This section introduces the proposed model, encoded multi-agent GAN (E-MGAN). Differently from the existing multi-agent GAN architectures, the proposed model contains a new member, an encoder network E, that abstracts variational latent feature representations from real data for the generator. As shown in Figure 1, the proposed model has four members: (1) an encoder network; (2) a multi-agent generator network; (3) a classifier network; and (4) a discriminator network.
As indicated by the red box in the E-MGAN structure shown in Figure 1, the combination of the encoder network E and the multi-agent generator G m is similar to a VAE, when the multi-agent generator is considered as a decoder. First, the encoder E extracts the variational latent representation distribution N(µ, σ 2 ) from the real samples x and provides this distribution to the generator. Then, the multi-agent generator G m generates a fake sample G m ( z) from the variational latent representation z sampled from the posterior distribution N(µ, σ 2 ). However, unlike the decoder in VAE, our multi-agent generator G m consists of K generators; thus, a generated fake sample G m ( z) is combined with K generated samples from K generators. After that, the generated sample G m ( z) is put into the classifier C that recognizes which generator the fake sample comes from. The classifier encourages the generator to discover different data modes by maximizing the information entropy. Finally, the function of the multi-agent generator G m and the discriminator D is the same as that in a standard GAN. The generator G m tries to capture the data distribution through the gradients of discriminator D, while the discriminator D tries to distinguish the generated fake samples G m ( z) from real samples x.
The remainder of this section presents the details of the proposed model in two parts. First, the formulation of the model is introduced in Section 3.1. Second, the training objective for the proposed model and the algorithm of the training process are provided in Section 3.2. Table 1 provides the notation and corresponding definitions used in the proposed model.

Notation Definition
Notation Definition x Real samples P real (x) Real data distribution z Random prior variable P(z) Random prior distribution µ, σ 2 Mean and variance of latent feature representations P en ( z|x) Latent feature distribution z Latent feature representations G m ( z) Output of the multi-agent generator K Number of generators in multi-agent generator.
Probability that x is a real sample L Gm Value function of the multi-agent generator

Formulation of E-MGAN
Differently from the existing multi-agent GANs, the proposed model samples from the latent feature distribution rather than from a random distribution. Recent work [37] argues that the encoder E of VAE can extract latent feature representations from the real data, while the extracted latent feature representations capture the semantic attributes of the training samples. Thus, E-MGAN makes good use of the advantage of a VAE that learns the variational latent feature representations from real data to improve the multi-agent GAN.
Our encoder and the selected generator form a VAE, as shown in the red box of Figure 1. The encoder E learns latent feature variables, the mean µ, and the covariance σ 2 , from real data x. The posterior distribution P en ( z|x) = N(µ, σ 2 ) is obtained through the reparameterization trick: 1). Then, the multi-agent generator G m generates fake samples x = G m ( z) from the latent feature representations z ∼ P en ( z|x).
As a special VAE variant, we adopt the KL divergence and the reconstruction error in the objective function of VAE to train our encoder E and multi-agent G m . The KL divergence aims to encourage the extracted feature distribution N(µ, σ 2 ) to be close to the random prior distribution P(z). We set the random prior distribution P(z) to a Gaussian distribution N(0, 1). The analysis of the KL divergence is shown in Equation (4). The reconstruction error is used to make the reconstructed image x more similar to the real image x. We calculate the reconstruction error L rec with the mean squared error (MSE) [38]. Therefore, the KL divergence and the reconstruction error in our model are as follows: The last equality of Equation (4) holds, because of P(z) = N(0, 1). The multi-agent generator G m consists of K generators, as shown in Figure 2, which is inspired from MGAN [33]. However, they are different; our multi-agent generator G m samples from the latent feature distribution P en ( z|x) learned by the encoder E, while the MGAN generator samples from a random prior distribution P(z). This change not only encourages our generator to capture the data distribution more quickly but also helps it generate more realistic samples that can fool the discriminator. As described in Figure 2, the generators in the multi-agent generator share their parameters except for the input layer and output layer, which helps reduce redundant computations while ensuring that the generated samples of each generator differ. Assuming that each generator G i captures one mode P G i from the learned latent feature representations z, then the multi-agent generator G m can theoretically capture K data modes, named P ge ( z). However, it cannot guarantee that those data modes will overlap with each other. The classifier C is a penalty to the generator, encouraging the generator to discover diverse data modes. It calculates the probability (C G i (x )) that the input artificial samples x come from the generator G i . We denote the output of our generator as is the output of the i-th generator and λ i is the weight of G i ( z). Therefore, maximizing the information entropy of the classifier output is organized as the objective function of the classifier L C : The Jensen-Shannon divergence (JSD) of multiple distributions that JSD(P 1 , ..., [39]. According to the above analysis, maximizing L C implies maximizing the JSD among P G 1 , P G 2 , · · · , P G K and the information entropy of λ i . The greater the JSD is, the more dispersed the generated sample modes are. Therefore, maximizing L C encourages our multi-agent generator to generate samples in relatively dispersed data modes. H(λ i ) reaches its maximum value when λ i = 1/K; thus, we set λ i = 1/K in our model. Clearly, the classifier functions as a penalty term to the generator.
The role of our discriminator D is the same as that in a classical GAN; its task is to distinguish the generated samples G m ( z) from real samples x. It calculates the probability D(x) ∈ [0, 1] that an input sample x (a real sample x or a generated sample x ) is sampled from the real data distribution P real (x). It accomplishes this task by maximizing the probability of real samples while reducing the probability of generated samples. Therefore, the objective function of the discriminator is to maximize the following function: This process motivates the generator to generate more realistic samples with the latent feature representation, thereby improving its ability to generate samples. According to the above analysis, the goal of our multi-agent generator G m is not only to generate more realistic samples that fool the discriminator but also to improve the mode recovery of generated samples to avoid the mode collapse problem. First, our multi-agent generator G m samples from the latent feature representation z ∼ P en ( z|x) provided by the encoder. Second, its generated samples x = G m ( z) are penalized by the reconstruction error L rec and the value function of the classifier Lc. Finally, the generated samples x must maximize the probability D(x ) to fool the discriminator. Therefore, the training objective function of the multi-agent generator is described as minimizing the value function L G m :

Objective of E-MGAN
The latent representation distribution P en ( z|x) extracted by the proposed model from real samples x needs to minimize the KL distance L KL . The proposed model samples from the learned distribution P en ( z|x) to generate samples. To encourage the generated samples x to be more plausible and realist samples, they are subject to reconstruction error L rec by minimizing Equation (5). The generated samples aim to deceive the discriminator by minimizing L D . Simultaneously, the discriminator tells the generated samples from real samples by maximizing L D . Therefore, the total training objective of E-MGAN can be summarized as follows: where the calculation formulas for the 3rd and 4th terms in Equation (9) are shown by Equation (4) and Equation (5), respectively. Algorithm 1 describes the training process of the proposed E-MGAN model. First, the K generators in the multi-agent generator, λ i (i = 0, 1, . . . , K), are initialized with the weights of the ith generator; and the parameters of the encoder network, multi-agent generator network, classifier network, and discriminator network are initialized with θ E , θ G , θ G , and θ D . Then, the encoder extracts the latent feature distribution at Step 4. The two parts of encoder loss L E , the KL divergence D KL (P en ( z|x) P(z)) between the learned latent feature distribution P en ( z|x) and the random prior distribution P(z), and the reconstruction error L rec , are calculated at Step 5 and Step 8, respectively. Thus, the encoder loss L E is obtained in Step 9. In addition, the loss functions of the classifier, generator, and discriminator are calculated at Steps 11, 13, and 14, respectively. Finally, Steps 15-18 update the parameters of the four networks.

Algorithm 1 The training process of the proposed E-MGAN algorithm.
Require: K, the number of generators in multi-agent generator; λ i , the i-th generator weight; P real (x), the real data distribution; P(z) ∼ N(0, 1), the random prior distribution. 1: Initialize:θ E , θ G , θ G , and θ D , the parameters of the encoder network, generator network, classifier network and discriminator network, respectively. 2: while not converged do 3: Sample x ∼ P real (x) a batch from the real dataset; 4: P en ( z|x) ← Enc(x);

5:
L KL ← D KL (P en ( z|x) P(z)); 6: Sample a minibatch of z ∼ P en ( z|x); put into multi-agent generator; 7: x ← G m ( z); 8: L rec ← x − x 2 2 ; 9: L E ← L KL + L rec ; 10: Put a minibatch of reconstruction samples x into the classifier and obtain C(x ); 11: Put a minibatch of training samples x and reconstruction samples x the discriminator and obtain D(x) and D(x ); 13:

Experiments
In this section, comprehensive experiments on both a synthetic dataset and two real-world datasets are described; they are done to validate the performance of the proposed E-MGAN model. The generated samples are shown for evaluation via visual inspection for qualitative assessment. Meanwhile, for quantitative assessment, the generated samples are evaluated by two of the currently most widely adopted metrics, namely, IS and FID. The experimental details and results are reported in the remainder of this section.

Datasets
A 2D synthetic dataset and two widely used large-scale real-world datasets, CIFAR-10 [34] and STL-10 [35] are adopted to demonstrate the performance of the proposed model through a series of experiments.
• 2D synthetic dataset adopted in these experiments consisted of 25 isotropic Gaussian distributions with a fixed standard deviation of 0.05. These 25 Gaussian distributions are arranged in a 5 × 5 grid, as shown by the red points in Figure 3. • CIFAE-10 [34] contains 60,000 32 × 32 color natural images and was collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. These images are balanced across the following 10 categories: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There are 6000 images in each category. • STL-10 [35] consists of 100,000 unlabeled natural color images balanced across the following 10 categories: airplane, car, bird, cat, dog, deer, horse, monkey, ship, and struck. STL-10 is more diverse than the CIFAR-10 dataset, and it contains images with a resolution of 96 × 96. To ensure a fair comparison with other models, we compressed the images from 96 × 96 resolution to 48 × 48 resolution.

Evaluation Metrics
The evaluation of generated images is still a notoriously challenging issue, and there is no uniform standard. Among current metrics, IS [22] and FID [36] are the two most widely adopted evaluation metrics in various GAN variants. In this paper, for quantitative assessment, these two different and widely used metrics are adopted to estimate the quality and diversity of generated samples in the experiments.

•
Inception score (IS) [22] is widely used to measure sample qualities. It is wonderfully designed and based on Google's inception deep learning model. The IS is adopted to evaluate the results in the experiments because it assesses the generated images based on two image aspects: realism and diversity. The score is computed as follows: As shown in Equation (10), IS consists of two parts: p(y|x) and p(y). The former is a conditional label distribution for each given generated image x . Lower entropy means the generated sample x is closer to a certain category that contains meaningful objects. The latter p(y) = x p(y|x)dx, a marginal distribution, is the label distribution of all generated samples. Higher entropy implies the generated samples are scattered among different categories. In these experiments, the IS is adopted to evaluate the realism and diversity of generated samples using the code available from https://github.com/openai/improved-gan/tree/master/inception_score. [36] is another reasonable way to quantify the quality of generated images. FID is more consistent than IS because it evaluates the generated samples by calculating the Fréchet distance between the real samples and generated samples in the feature space, where the Fréchet distance is a Wasserstein-2 distance. Therefore, FID is better than IS at capturing the level of similarity between the generated samples and real samples. If the feature distribution of real samples and that of generated samples are respectively denoted as N(µ r , Σ r ) and N(µ g , Σ g ), the FID value between them can be calculated by
In addition, FID is proven sensitive to mode dropping, particularly in respect to intraclass mode dropping [40]. We supplemented the experiments with FID to evaluate the diversity of generated data modes. The pertinent code is abstracted from https://github.com/bioinf-jku/TTUR .

Experimental Settings
To ensure a fair comparison, we followed the experimental settings of previous works. All the experiments are implemented using Python 3.7 and TensorFlow [41] with two GT2080Ti GPUs and CUDA 10.0.
There are four networks in the proposed model. First, the encoder network E is designed by adopting a multi-layer perceptron, a simple neural network that extracts data characteristics. Second, as described in Figure 2, the multi-agent generator consists of K generators that shares all their parameters except for the input and output layers. Therefore, in fact, they share the same hidden layer of a single network except for their input and output layers. In addition, for fair comparison with previous works, we adopted DCGAN [42] to construct both our multi-agent generator G m and our discriminator D networks, allowing us to use the same network architecture and training procedure. DCGAN is a widely adopted modeling architecture that has been used in various GAN variants due to its stability during adversarial training using convolutional neural networks. Finally, because the classifier C and the discriminator D are essentially both classifiers, to reduce redundant calculations, they are designed to share parameters except in the last output layer, as shown in Figure 1. The difference between them is that the latter is a binary classifier, while the former is a multi-class classifier that calculates the probability that a particular generator generated a given sample; the goal is to identify which generator was the originator of each synthetic sample.

Experimental Results
The experiments are divided into two parts to validate the proposed model. On one hand, the experiments demonstrate that our model improves the quality of the generated samples with respect to the precision of synthetic data and the correct anatomy of real images. On the other hand, the experiments demonstrate that the proposed model effectively improves the diversity of the generated samples. The samples generated by the proposed model are more diverse not only among the categories but also within the categories. It is worth noting that our model requires no data labels. The proposed model is a completely unsupervised method. Therefore, all the results are generated in an unsupervised manner.

Results on Synthetic Samples
A Gaussian mixture distribution with 25 isotropic data modes is selected to validate the accuracy of the proposed model. The proposed E-MGAN model is compared with a typical multi-agent GAN, MGAN. The experimental results are shown in Figure 3. Each model is trained for 50, 000 epochs and the results of the different methods are presented (MGAN in the upper row and E-MGAN in the bottom row) at different epochs: 2k, 5k, 25k, and 50k.
As shown in Figure 3, both E-MGAN and MGAN [33] capture all 25 modes. However, E-MGAN converges much faster than does MGAN; it covers nearly all the modes as early as step 2k, while the generated points of MGAN display more freedom. Furthermore, E-MGAN captures the data modes more precisely than does MGAN from Epoch 25k to the end, while there are several modes that MGAN cannot exactly cover. This comparison between MGAN and E-MGAN on the 2D synthetic dataset demonstrates that E-MGAN captures data modes more accurately.  Results on Real-world Samples Figure 4 shows the samples generated by the proposed E-MGAN and those generated by MGAN trained on the CIFAR-10 dataset. We adopted 10 generators to train these two multi-agent GAN models and present the generated samples of different generators in the rows, as shown in Figure 4. The samples generated by MGAN are displayed in Figure 4a, while the samples generated by E-MGAN are displayed in Figure 4b. In the left-hand image, MGAN can identify horses, ships, cars or trucks, flying objects (birds or airplanes), and other unclear animals. However, there is clearly a generator collapse in MGAN. In the right-hand image, E-MGAN clearly identifies the outlines of horses, ships, birds, and automobiles. It can also roughly identify the cats and dogs, trucks, and even frogs. More importantly, no generator suffers from the mode collapse problem, although the deer could not be clearly identified. To quantitatively analyze the quality of the generated samples, Table 2 shows the IS results of samples generated by E-MGAN and other benchmark methods on both the CIFAR-10 and STL-10 datasets. It also lists the IS results of the real data, which act as the baseline for all the generated samples from the different models. The models compared on the CIFAR-10 dataset include several classic GAN variants (such as WGAN [25], MIX+WGAN [43], improved-GAN [22], DCGAN [26], and MAGAN [44]); models that combine a GAN and VAE (such as ALI [28] and BEGAN [45]); and several multi-agent GANs (such as GMAN [31], D2GAN [30], and MGAN [33]). Among the compared models, we chose the best performing MGAN for recovery. We obtained an IS of 8.12 ± 0.10 for MGAN through 500 epochs of 5000 samples in the tested implementation, which is slightly lower than the results published in [33], as shown in Table 2. By adopting 10 generators as in MGAN, we completed the experiment with the proposed E-MGAN model. The samples generated by E-MGAN achieved a slightly higher IS of 8.42 ± 0.09, which is closest to the real data, indicating that E-MGAN outperforms other models. Table 2 also presents the IS results of E-MGAN compared with other models on STL-10. The STL-10 dataset is more diverse than CIFAR-10 because it is a subset of the full ImageNet dataset. The compared models include DCGAN [26], D2GAN [30], and MGAN [33]. The images in the STL-10 dataset have a resolution of 96 × 96; however, DCGAN, D2GAN, and MGAN are all trained on images with a resolution of 48 × 48. To ensure a fair comparison with these previous works, we compressed the images of the STL-10 dataset from 96 × 96 to 48 × 48. We reproduced the experimental results of the MGAN model on the STL-10 dataset and obtained an IS score of 9.12 ± 0.10, which is somewhat lower than the experimental result of 9.22 reported in [33]. In the same experimental environment, the proposed model obtained an IS score of 9.35 ± 0.10, outperforming the other baselines (DCGAN and D2GAN). Similarly, on the STL-10 dataset, the E-MGAN model achieved an IS value as high as 9.35 ± 0.12. Obtaining the highest IS score indicates that the quality of samples generated by the proposed E-MGAN model is superior to that of the other models. Table 2. Inception scores on CIFAR-10 and STL-10 datasets. All the results are made in an unsupervised manner. The higher the IS value, the better the quality of generated samples is. A dash ("-") indicates unavailable data.

Diversity Analysis of the Generated Samples
Results on Synthetic Samples Figure 5 compares the experimental results of E-MGAN and WGAN trained on 25 independent Gaussian mixture distribution datasets (WGAN is in the upper row and E-MGAN is in the lower row). E-MGAN converges much faster than WGAN because it achieves an even distribution around all the targets (red points) and nearly covers most of the targets as early as epoch 2k. By epoch 10k, E-MGAN recognizes all modes. However, at epoch 2k, the points generated by WGAN are distributed around merely a few target modes, and multiple target modes in the middle are not found. Until the end, at epoch 50k, several targets are still not discovered (such as col 1, row 5; col 3, row 5; col 4, row 2 and 3; and col 5, row 4). Additionally, during the entire WGAN training process, most of the generated points wander between two adjacent target modes. The wandering points among the points generated by E-MGAN gradually decrease and gather around the target points. Finally, at epoch 50k, only a few points are floating between two adjacent target points. Experiments on the synthetic dataset of WGAN and E-MGAN proved that the proposed E-MGAN model generates more diverse data to discover all modes and thus overcomes the mode collapse problem.

Results on Real-world Samples
To qualitatively illustrate the performance of the proposed model regarding the diversity of the generated samples, we intuitively evaluated the image quality by exhibiting the generated samples trained on different real-word datasets. Figure 6 presents images generated by our model trained on the CIFAR-10 32 × 32 dataset. Some generated samples trained on the STL-10 48 × 48 dataset are shown in Figure 7.  Figure 6 shows the real samples (in Figure 6a) and the generated samples of the E-MGAN model (in Figure 6b) on the CIFAR-10 dataset. Figure 6a shows 10 examples of each category among the 10 categories of the real dataset. Figure 6b shows 10 samples of each of the categories generated by the E-MGAN model. On one hand, the generated samples of E-MGAN cover all modes; cars, deer, frogs, dogs, ships, birds, airplanes, horses, and trucks can be clearly identified. The cats are fuzzy because the samples in the real dataset are insufficient to be clearly identified. On the other hand, the generated samples in each category are different. The generated samples of cars, ships, dogs, horses, and trucks are viewed from different perspectives and appear in diverse colors. The generated data on the CIFAR-10 dataset demonstrate that E-MGAN achieves diversity both among classes and within classes. Figure 7 represents generated samples of the proposed E-MGAN model trained on STL-10 at different epochs. Recognizing all the objects in the generated images on STL-10 is somewhat difficult, but it is relatively easy to recognize the shapes of objects such as cars, trucks, ships, airplanes, birds, horses, and other animals. Furthermore, the colors and backgrounds (such as the sky, sea, and land) of the generated images become increasingly diverse as the number of training iterations increases.  To quantitatively evaluate the performance of the E-MGAN model regarding diversity, FID is a good measurement method [40]. The FID values are compared with previous works, including DCGAN [26], DCGAN + TTUR [36], WGAN-GP, WGAN-GP-TTUR [36], and MGAN, on the CIFAR-10 dataset to validate the ability of the proposed model to solve the mode collapse problem. The FID values of E-MGAN and MGAN are obtained through experiments and the FID results of other models are quoted as reported in [36]. As shown in Table 3, the proposed E-MGAN obtained the lowest FID value (24.4) among all models. Obtaining the lowest FID value means that E-MGAN is the model least affected by the mode collapse problem and that it can recover different modes. The samples generated by the proposed E-MGAN model are shown in Figure 6b. Table 3. FIDs of different models on CIFAR-10. The lower the FID value, the better the diversity of generated samples is.

Conclusions
This paper proposes a novel multi-agent GAN architecture E-MGAN. Differently from existing multi-agent GANs, the proposed model aims to generate higher quality and more diverse samples by making full use of the advantages of a VAE and a multi-agent GAN. The E-MGAN learns the variational latent representations from real data to improve the quality of the generated samples. Meanwhile, E-MGAN increases the diversity of generated samples via the coordinated training of a multi-agent generator with the encoder, the classifier, and the discriminator. Extensive experiments on both a synthetic dataset and two large-scale real-world datasets, (CIFAR-10 and STL-10), demonstrated that the proposed E-MGAN model not only improves the quality of the generated samples but also recovers diverse data modes. Future work should consider employing labeled data information or utilizing the statistical information of generated samples to increase the diversity of generative models.