Generative Adversarial Network for Class-Conditional Data Augmentation

: We propose a novel generative adversarial network for class-conditional data augmentation (i.e., GANDA) to mitigate data imbalance problems in image classiﬁcation tasks. The proposed GANDA generates minority class data by exploiting majority class information to enhance the classiﬁcation accuracy of minority classes. For stable GAN training, we introduce a new denoising autoencoder initialization with explicit class conditioning in the latent space, which enables the generation of deﬁnite samples. The generated samples are visually realistic and have a high resolution. Experimental results demonstrate that the proposed GANDA can considerably improve classiﬁcation accuracy, especially when datasets are highly imbalanced on standard benchmark datasets (i.e., MNIST and CelebA). Our generated samples can be easily used to train conventional classiﬁers to enhance their classiﬁcation accuracy.


Introduction
In imbalanced datasets, training data are not uniformly distributed over the different classes. Despite the imbalance, these datasets are widely used in many image classification problems because it is nontrivial to gather data evenly for all classes. In many cases, we inevitably have to train conventional classifiers using imbalanced datasets, which induces highly biased or imprecise classification results. This problem occurs due to lack of consideration of the class ratio and the balance of the dataset.
Data augmentation can help mitigate this data imbalance problem by generating new synthetic data for imbalanced classes and improving the balance between classes. However, conventional data augmentation methods (e.g., mirroring, rotation, and geometric transformation) have several potential problems. For example, translation and random cropping may run the risk of changing image labels. Flip or rotation may disrupt objects' orientation-related features. In particular, for domain-dependent applications such as medical image analysis, there is more severe bias between potentially generated data (augment data) and training data than translational or positional variances.
Unlike geometric transformations, generative adversarial networks (GANs) [1] can improve the general balance among classes while remaining less affected by the application domain and dataset characteristics. GANs generate minority class images that can be used to restore the balance of classes in datasets. Because GANs can generate realistic samples that are uncovered by original datasets, they have shown outstanding performance in terms of data augmentation. However, despite their capabilities, GANs have difficulty in stably optimizing their objective functions. In addition, if minority data are very scarce, conventional GANs still face generalization problems. To overcome the aforementioned shortcomings of GANs for imbalanced datasets, Balanced GAN (BAGAN) [2] has been proposed, which is specialized in imbalanced datasets. Based on autoencoder initialization and class conditioning in the latent space method, BAGAN attempts to describe the minority class distribution using the data, which is jointly learned by majority and minority classes. However, BAGAN also has several issues that need to be addressed. First, BAGAN cannot explicitly impose the class condition in the latent space, which yields unintended class samples, especially when different class regions are significantly overlapped with each other in the latent space. Second, the aforementioned problem makes it difficult to use BAGAN to train deep neural networks, and BAGAN fails to optimize the network.
In this paper, we solve the data imbalance problem for image classification tasks and propose a novel data augmentation method based on GANs (i.e., generative adversarial network for class-conditional data augmentation (GANDA)), which can produce minority class data for accurate classification by leveraging majority class information effectively. To achieve stable learning in the GAN framework, we present a new denoising autoencoder initialization. Using this technique, we can leverage a more accurate representation of each class relatively in the generator latent space. Moreover, we introduce a conditional one-hot code, which can be used to generate definite samples. Experimentally, our proposed method can accurately generate minority samples from various imbalanced datasets where the BAGAN method has failed to do so.
Our generated samples are visually realistic and have a high resolution (128 × 128 × 3). When our generated samples are used for training conventional classifiers, the proposed GANDA significantly enhances the classification accuracy of existing classifiers and shows state-of-the-art performance given imbalanced datasets in various experiments. Figures 1 and 2 show the differences of the proposed GANDA from conventional methods. The main contributions of this work are as follows. Conceptual comparison between balanced generative adversarial network (BAGAN) autoencoder and the proposed generative adversarial network for class-conditional data augmentation (GANDA) autoencoder. Random noises are inserted into an input image to diversify output images. The conditional one-hot code is inserted into the proposed encoder and decoder to generate definite samples.

•
We present a novel data augmentation method based on GANs. Our proposed method can generate minority class data accurately in imbalanced datasets.

•
For stable GAN training, we present a new denoising autoencoder initialization technique with explicit class conditioning in the latent space.
• We conduct various experiments showing underlying problems in conventional methodologies. We experimentally show that majority class data can help generate minority class data and considerably enhance its classification accuracy.

Data Augmentation Methods
There are several approaches for data augmentation [3]. In traditional ways to augment data, images are geometrically transformed or distorted using rotation, scaling, white-balancing, or sharpening [4][5][6]. A relatively recent and powerful tool has been to use generative adversarial networks such as ACGAN [7] and BAGAN [2]. ACGAN can adapt to select the target class to be generated, but does not consider minority classes in image classification.
In contrast to ACGAN, our method directly targets the imbalanced dataset problem in image classification tasks. Our idea mainly stems from BAGAN, which can jointly solve data generation and image classification problems. However, the proposed GANDA produces more accurate classification results and induces fewer overlaps among different classes in the latent space using our explicit class conditioning. Please note that it is nontrivial to perform class conditioning in the latent space for data augmentation because there are many very similar classes, and class overlaps can occur easily.

Methods for Imbalanced Datasets
There are two representative ways that can be used to solve the imbalanced data problems. One is to resample the data via either oversampling or undersampling [8][9][10][11][12][13]. A synthetic minority oversampling technique (SMOTE) [14] combined oversampling and undersampling. For oversampling, SMOTE applied data augmentation to minority samples to mitigate the overfitting issue. However, oversampling is susceptible to overfitting problems, whereas undersampling usually results in the loss of valuable information in majority samples. The second method is cost-sensitive learning, which aims to avoid the aforementioned issues by assigning different costs for the misclassification of the minority class [15][16][17][18][19][20]. For example, dynamic curriculum learning [21], or active learning [22], has been proposed to tackle the problem of dataset imbalance. Some approaches [23,24] view the class-imbalance problem within the meta-learning framework. Max-marginal learning [25] directly enforced the max-margin constraints. Similar to this method, our method also attempts to delineate the space between the classes more prominently. However, it is difficult to design proper cost functions in different problem settings or environments in the aforementioned cost-sensitive learning. To solve the imbalanced data problem, M2m [26] generated minority samples using majority samples.
In contrast to these methods, our proposed GANDA uses autoencoders and GANs to generate minority class data. Please note that DOPING [27] originally aimed to detect anomaly events, but it can be used for augmenting minority class data, because oversampling anomaly samples at the boundary of the latent distributions is equivalent to augmenting minority class data. While DOPING also adopted autoencoders and GANs, the latent space is not tailored by class conditioning for more accurate oversampling, which is not similar to our proposed method.

GAN-Based Methods
After the seminal work proposed by Ian Goodfellow et al. [1], GANs have been actively researched to generate realistic fake data. For example, many studies have used GANs for imitation [28][29][30]. BAGAN [2] addressed imbalance problems by coupling autoencoders and GANs. It generated minority class data and jointly performed classification tasks to drive the generation of desired minority classes. BAGAN produced more accurate classification results compared to ACGAN [7], which separately handled data generation and image classification tasks. However, SCDL-GAN [31] shows that BAGAN cannot avoid mode collapse problems, in which its generator produces limited varieties of samples. SCDL-GAN produced impressive results by using a Wasserstein-based autoencoder in the representation of the class distributions.
In contrast, the proposed GANDA adopts explicit class conditioning to avoid mode collapse problems and to more accurately generate the minority classes.

Difficulties in Autoencoder Initialization
We observe that conventional autoencoder initialization with implicit class conditioning in the latent space can cause the latent space to train unstably [2]. In addition, conventional generators can generate wrong samples even though the training converges successfully. We first examine why the aforementioned problems occur through the following experiments. First and foremost, we carry out t-SNE visualization for class-conditional codes in the latent space, which are determined by the autoencoder initialization. Figure 3a illustrates the t-SNE of the conventional autoencoder initialization using the MNIST datasets. As shown in the figure, autoencoder initialization can maximize inter-class distance while minimizing intra-class distance. In addition, it causes visually similar classes to become distributed close to each other in the latent space. For example, the "9" (cyan) and "4" (red) classes share visually similar characteristics; thus, they are positioned relatively closer than other classes. Using autoencoder initialization, minority classes can also be recognized in the latent space if we examine the relative distance of other majority class data in the latent space. However, class-conditional noises learned by autoencoder initialization can cause unintended effects in the latent space if there is no explicit class conditioning. As shown in Figure 3a, many overlapped areas appear for visually similar classes, which potentially produces wrong class samples. For example, "4" and "9" are two different but very adjacent classes in the latent space. Thus, conventional approaches can frequently generate "9" when sampling "4".  To solve this problem, a denoising autoencoder [32] has been proposed to widen the gap between different classes in the latent space by training robust features for different classes. Figure 3b shows the t-SNE results using the denoising autoencoder. We qualitatively observe that denoising autoencoders broaden the inter-class gap in the latent space and shrink the intra-class gap compared with original autoencoders. However, the denoising autoencoder could not fully solve the inter-class overlapping problem as there still exists overlapped areas among different classes. Thus, in terms of quantitative measures, denoising autoencoders are similar to original autoencoders. In addition, if we handle high-resolution images in a low dimensional latent space, the aforementioned problem still remains and even worsens, as explained in the next section. Please note that Figure 3c illustrates t-SNE, which visualizes the proposed GANDA encoder representation. Although our GANDA uses a class-conditional one-hot code, it effectively disentangles different classes in the latent space, as shown in Figure 3c.

Difficulties in High-Resolution Data Generation
To show difficulties in high-resolution data generation, we conduct experiments using BAGAN with autoencoder initialization on the CelebA dataset (128 × 128 × 3) and perform a simple binary classification of males and females. Figure 4 shows generated samples for females using class-conditional noise. As shown in the figure, conventional methods typically generated correct samples but also created unintended samples. Some samples even wrongly belong to the other class (male). We argue that this problem occurs because class conditioning is not explicitly imposed for adjacent areas between classes in the latent space. Moreover, this problem becomes even worse for minority classes, because we cannot accurately represent their latent spaces due to the lack of class data. Figure 4 shows generated data with high resolution using the CelebA dataset. As shown in the figure, conventional methods generated many unintended class samples.
The problems of conventional approaches (e.g., BAGAN) can be summarized as follows. First, the conventional autoencoding process does not explicitly regard the imbalanced datasets. Therefore, it can incorrectly learn the latent space of the minority class compared to the majority class. Second, conventional approaches utilize the majority class data to generate minority class samples using implicit class conditioning in the latent space. Therefore, unintended class samples can be generated. Finally, existing approaches experimentally produce inaccurate results in that samples from other (majority) classes are used to generate images that belong to a certain (minority) class.

Class-Conditional GAN-Based DA
As shown in the experiments in Figure 3, autoencoder initialization can broaden the inter-class gap in the latent space and shrunken the intra-class gap compared with original autoencoders.
Thus, autoencoder initialization helps GANs to stably train imbalanced datasets. However, if samples are generated only with class-conditional noises, the training becomes unstable due to the possible overlaps between classes in the latent space. To settle this problem, we propose a variant of the GAN architecture called GANDA. In GANDA, minority classes are effectively generated using data from both majority and minority classes in imbalanced datasets to restore the balance of datasets. To generate a specific class sample, the proposed method explicitly conditions the class label to our generator. At the same time, class-conditional noises, which contain relative distances in the latent space with other classes, is used to leverage other class information. As the training proceeds, our method generates realistic minority classes and simultaneously improves the classification accuracy of both majority and minority classes.
The proposed method consists of two main steps: conditional denoising autoencoder initialization and adversarial learning. We describe each step in detail, as follows.
• Conditional denoising autoencoder initialization: We first add noise to an input image x ∈ R d , which can diversify output images. Subsequently, we concatenate the noisy input image with a one-hot label y = {c 1 , c 2 , . . . , c n } and feed it into the encoder φ ∈ R d×p , in which class-aware latent features z are extracted as follows, The estimated latent feature is concatenated with the same one-hot label y once more to explicitly guide the generation process. The concatenated feature is then fed into the decoder ψ ∈ R p×d , which results in x as follows, We train the proposed conditional denoising autoencoder using the reconstruction loss L recon : To optimize the objective function in (3), we adopt the l 2 norm. Please note that the trained autoencoder describes the distribution of the latent codes of the classes. Thus, we initialize a part of the discriminator, D e , and generator, G, using the parameters of encoder φ and decoder ψ, respectively. • Adversarial learning: After conditional denoising autoencoder initialization, we train generator G and discriminator D via adversarial training. Before the adversarial learning proceeds, we use the encoder φ to determine the multinomial distribution of the class latent code in the training data. Then, the generator is trained to produce a fake class samplex y by selecting the latent code z y , which is extracted from the aforementioned multinomial distribution determined by φ and is concatenated with the corresponding one-hot label y.x y = G(z y |y).
The discriminator classifies the generated sample as belonging to one of n classes or as being fake.
where l j denotes the j-th output of the discriminator. Please note that the discriminator can have (n + 1) outputs in total, where n is the number of class labels, and 1 is for the fake label. The objective function for adversarial learning is as follows, The advantages of the proposed method can be summarized as follows. Our class conditioning makes different classes to be well separated in the latent space even when minority class data are scarce. Thus, class conditioning enables accurate data augmentation and classification of the minority class data. Moreover, class conditioning can allow our method to safely use the features, which are shared by different classes. For example, facial images of males and females include common features like eyes, ears, mouth, and nose, where two classes could be differentiated by hair length and make-up. Then, we can use the shared features from majority data (i.e., male) to generate minority data (i.e., female), while accurately separating these two types of data in the latent space. In Section 4.3, we verify that our method generates minority class data accurately using majority class data. Please note that we can use the class information without additional cost in supervised learning.

Implementation Details
Network architecture details: The proposed method consists of two main components for conditional denoising autoencoder initialization and adversarial learning. For stable learning, we add the spectral normalization layer [33] to the encoder (i.e., part of the discriminator) and imposed the Lipschitz constraint. In addition, the decoder adopts transposed convolution layer, of which the structure is same as that of the generator. For more details about the network architecture, please refer to Table 1. Hyperparameters: The proposed network is trained with a batch size of 32. We use the ADAM optimizer with a fixed learning rate of 0.00005, β 1 = 0.5, and β 2 = 0.9999. For conditional denoising autoencoding, we train the encoder φ and decoder ψ with 150 epochs by using the L2 loss. For denoising, we use a standard normal function N(0, I) to make noise and combine the noise with an input image using a weight of 0.5 to make a noisy image. If the standard deviation is high, we can make heavy and diverse images with noises. We initialize the weights of the discriminator (generator) with the encoder (decoder). Subsequently, we train the discriminator and generator via adversarial learning using the sparse categorical cross-entropy loss function.
Datasets: We conduct experiments using the MNIST [34] and CelebA [35] datasets, where MNIST contains around 50 K handwritten images of 28 × 28 in size with 10 different classes. For CelebA, we randomly extract 10 K images of males and females, respectively, and resized them to 128 × 128 resolutions. We conduct the ablation test (i.e., in-depth analysis through evaluating the proposed method in a component-wise manner) by changing the degree of class imbalance. For this experiment, we use the MNIST and CelebA datasets. In the case of the MNIST dataset, we consider class 0 as a minority class and removed 60%, 80%, 90%, 95%, and 97.5% of the training images for class 0. In the case of the CelebA dataset, we treat class f emale as a minority class and removed 60%, 70%, 80%, and 90% of the training images.
Evaluation metrics: To evaluate the image generation performance, two metrics are used, namely, inception score (IS) [36] and Fréchet inception distance (FID) [37]. To evaluate the classification performance, we use the average classification accuracy and validation score [38].

Ablation Study
To provide in-depth analysis and insights on the proposed GANDA, we conduct several ablation studies.
Data augmentation from majority to minority classes: We experimentally show that the majority classes can help augment minority classes in the proposed GANDA framework. For this experiment, we examine latent code interpolation, which explains how the latent codes work in the GANDA framework. As shown in the generated images in Figure 5, the conditional one-hot code c y for y = {male, f emale} represents the main feature that determines a specific class, whereas the conditional noise z y expresses the detailed features (e.g., hair length, clothing, and skin color). Therefore, the experimental results demonstrate that the information obtained by learning majority classes through the mapping of latent code z can also be used to express shared features of the minority classes. Figure 6 shows images generated by autoencoding and the proposed method for the CIFAR10 dataset. As shown in Figure 6, our method qualitative outperforms conventional autoencoding approaches in terms of diversity of samples. Degree of class imbalance: We verify the effectiveness of the proposed GANDA when there is a very small amount of training data in a minority class. Table 2 shows the IS and FID scores when we increase the degree of class imbalance by removing more training data from the minority class (from 60%, 80%, and 90%).
As shown in Table 2, even when we delete 90% of the training data from the minority class, the proposed GANDA outperforms the baseline BAGAN in terms of IS and FID. Thus, our proposed GANDA can augment minority class data accurately, which can be used for object classification. Figure 6. Data augmentation using the CIFAR-10 dataset. For each block with three rows, the first, second, and third rows contain real images, generated images using autoencoding, and generated images using the proposed method, respectively.

Data Augmentation Comparison
FID and IS: Table 2 quantitatively compares the proposed GANDA with other state-of-the-art methods. Although the data imbalance worsened, the proposed GANDA generates better quality samples than the baseline BAGAN.
High resolution: Figure 7 qualitatively compares the proposed GANDA with other state-of-the-art methods [2] using the CelebA [35] dataset. The proposed GANDA produces more realistic high-resolution images than BAGAN. The noise z obtained from class-conditioning in the latent space can contain the shared features (e.g., hair length, clothing, and skin color). Thus, samples using latent noise for the female samples can be used to generate males with long hair G(z f emale |c male ), while samples using latent noise for the male samples can be applied to produce short-haired females G(z male |c f emale ). In addition, skin color, dress, and background are irrelevant to gender. Thus, the details do not change significantly, although the class changes.

Data Classification Comparison
Validation-score: We evaluate the classification performance using the validation score (V-Score) [38]. The V-Score can measure the clustering accuracy. To calculate the V-Score, we need to compute two terms, which are homogeneity and completeness. On one hand, homogeneity v h determines if each cluster has only the same label. On the other hand, completeness v c measures if all data points of the same class label belong to the same cluster, which can be numerically evaluated as follows, where C = {c 1 , · · · , c n } denotes a set of classes, and K = {k 1 , · · · , k m } denotes a set of clusters. In (7), we normalize the conditional entropy H(C|K) and H(K|C) using H(C) and H(K), respectively, to remove class size dependencies. Then, the V-Score V β is the weighted mean of v c and v h : where parameter β can be adjusted to favor either homogeneity or completeness. In our experiment, we set β to 1, giving the same weight to both metrics. The V-score results are obtained with the proposed conditional denoising autoencoder initialization only and not with adversarial training We apply the V-Score to the latent space obtained by the proposed conditional autoencoder and compared the latent space of our GANDA to that of BAGAN. As shown in Table 3, our method outperforms BAGAN. The performance improvements on V-Score can be obtained owing to explicit class conditioning in the latent space. For this experiment, we use the k-means algorithm to cluster the latent space and MNIST as the testing dataset, which consists of 10 classes. Please note that the number of class labels, the number of classes, the size of the data, and the clustering algorithm are independent of each other. Thus, the V-Score can accurately evaluate the quantitative classification performance. Table 3. Comparison of the proposed GANDA with BAGAN in terms of V-Score.

V-Score (K-Means)
GANDA (conditional denoising autoencoder initialization) 0.779 BAGAN (denoising autoencoder initialization) 0.739 Classification score: Tables 4 and 5 show the effects of using the generated samples as augmented data for classification tasks. We conduct this experiment by changing the removal ratio for the minority class. For the MNIST [34] dataset, we compare the existing methods (i.e., Vanilla GAN [1], ACGAN [7], and BAGAN [2]) with the proposed GANDA. Our GANDA outperforms state-of-the-art methods in various removal ratios. The CelebA [35] dataset consists of very high-resolution images, where the proposed GANDA also outperforms state-of-the-art methods. The classification accuracy in Tables 4  and 5 are not empirically affected by slight changes in the specific network architecture in Table 1.

Conclusions
In this paper, we proposed a novel generative adversarial network for the image classification using class conditional data augmentation (i.e., GANDA). The proposed GANDA effectively restores data imbalance in the GAN framework. For this, we presented a denoising autoencoder initialization technique with explicit class conditioning in the latent space, which provides a good initial point for GANs while effectively utilizing the information learned from majority class data to generate minority class data. We demonstrated the effectiveness of proposed method on the classification tasks using imbalanced datasets. The proposed GANDA outperforms other state-of-the-art methods.