Survey on Implementations of Generative Adversarial Networks for Semi-Supervised Learning

: Given recent advances in deep learning, semi-supervised techniques have seen a rise in interest. Generative adversarial networks (GANs) represent one recent approach to semi-supervised learning (SSL). This paper presents a survey method using GANs for SSL. Previous work in applying GANs to SSL are classiﬁed into pseudo-labeling/classiﬁcation, encoder-based, TripleGAN-based, two GAN, manifold regularization, and stacked discriminator approaches. A quantitative and qualitative analysis of the various approaches is presented. The R3-CGAN architecture is identiﬁed as the GAN architecture with state-of-the-art results. Given the recent success of non-GAN-based approaches for SSL, future research opportunities involving the adaptation of elements of SSL into GAN-based implementations are also identiﬁed.


Introduction
With recent advances in deep learning and its applications, research opportunities in the area have expanded and diversified in different directions. One of these directions is semi-supervised learning (SSL). As opposed to supervised learning, SSL is a form of learning that can learn based on incomplete data where only some of the data is labelled [1]. In supervised learning, the training data consist of a set of data points and a corresponding label for each of the points. Conversely, in unsupervised learning, the training data consist of only data points with no output provided, therefore requiring a process that discovers unknown structures and groupings within the data [2]. Semi-supervised learning is used in situations where there are a small number of labeled training samples along with a large number of unlabeled data points available [3]. While supervised learning has been the dominant technique used for most classification tasks, labeled data can often be difficult to obtain, and the process of labeling data can be very expensive and time consuming [4]. Therefore, SSL obviates the need for large, labelled datasets by using some labelled but mostly unlabeled data.
Semi-supervised learning relies on the assumption that the data distribution over the input space embeds significant information about the distribution of the labels in the output space [1]. Most SSL algorithms will break down if this assumption is not met as the input space would not contain any information about the actual labels and, therefore, improving accuracy with the help of unlabeled data would not be possible.
As [4] reports, if the sample distribution of the data does not embed significant information, then the resulting learning might not show improvement when compared to supervised learning, and may lead to an increase in false predictions. The basic assumption can further be sub-divided into three assumptions; the smoothness assumption, low-density assumption, and manifold assumption.
The first assumption, called the smoothness assumption, states that given two data points that are close by in the input space, the corresponding labels in the output space The second assumption, called the low-density assumption, states that the decision boundary in a classifier should pass through low-density regions in the input space [1]. This is also related to the cluster assumption since if the decision boundary is to pass through areas of high density, it would cut the cluster into different classes and therefore violate the cluster assumption [4]. Additionally, the low-density assumption is consistent with the smoothness assumption, which can be demonstrated by assuming a low-density area in the input space where the probability of a data point existing is low. This assumption can be visualized in Figure 1 where the optimal decision boundary is shown to be in the low-density area in between the two well-defined clusters.
The points in a high dimensional space can be mapped to low-dimensional structures known as manifolds. For example, a 3-dimensional input space where all data points lie on a sphere can be mapped to a 2-dimensional manifold [1]. The manifold assumption states that the input space of the data consists of multiple manifolds of low dimensions on which all data points lie. Furthermore, it states that any data points lying on the same manifold belong to the same class [4]. Therefore, if the manifolds are determined, and the unlabeled data points are distributed on these manifolds, the class labels can be inferred based on which manifold an unlabeled data point lies.
A multitude of SSL algorithms based on these three assumptions have been proposed yielding excellent results on datasets commonly used for benchmarking such as CIFAR [6] and SVHN [7], with recent algorithms such as FixMatch [8] showing error rates as low as 4.26% for CIFAR-10 and 2.28% for SVHN.
Generative adversarial networks (GANs) represent another class of techniques employed for SSL. The next section discusses common SSL techniques and the viability of generative architectures in the semi-supervised learning scenarios. The second assumption, called the low-density assumption, states that the decision boundary in a classifier should pass through low-density regions in the input space [1]. This is also related to the cluster assumption since if the decision boundary is to pass through areas of high density, it would cut the cluster into different classes and therefore violate the cluster assumption [4]. Additionally, the low-density assumption is consistent with the smoothness assumption, which can be demonstrated by assuming a low-density area in the input space where the probability of a data point existing is low. This assumption can be visualized in Figure 1 where the optimal decision boundary is shown to be in the low-density area in between the two well-defined clusters.
The points in a high dimensional space can be mapped to low-dimensional structures known as manifolds. For example, a 3-dimensional input space where all data points lie on a sphere can be mapped to a 2-dimensional manifold [1]. The manifold assumption states that the input space of the data consists of multiple manifolds of low dimensions on which all data points lie. Furthermore, it states that any data points lying on the same manifold belong to the same class [4]. Therefore, if the manifolds are determined, and the unlabeled data points are distributed on these manifolds, the class labels can be inferred based on which manifold an unlabeled data point lies.
A multitude of SSL algorithms based on these three assumptions have been proposed yielding excellent results on datasets commonly used for benchmarking such as CIFAR [6] and SVHN [7], with recent algorithms such as FixMatch [8] showing error rates as low as 4.26% for CIFAR-10 and 2.28% for SVHN.
Generative adversarial networks (GANs) represent another class of techniques employed for SSL. The next section discusses common SSL techniques and the viability of generative architectures in the semi-supervised learning scenarios.

Common Techniques Used in Semi-Supervised Learning
A number of algorithms and approaches to semi-supervised learning have been proposed recently. These algorithms can be grouped into different classes depending on criteria like the assumptions they are based on, the way they make use of unlabeled data, and how they relate to supervised algorithms [1]. However, most algorithms use a common set of techniques including consistency regularization, pseudo-labeling, and entropy minimization. These techniques are briefly described below.

Consistency Regularization
Consistency regularization [4] is an important technique used in SSL and relies on the manifold and smoothness assumptions. The technique assumes that realistic perturbations of data points in the input space (vi data augmentation, for example) should not significantly change the predicted labels of the model [9]. In simpler terms, if an input is disturbed in a way that preserves its semantics using operations such as image flipping or cropping, for example, the output label should be close to the output label for the original image. The idea is operationalized by adding a consistency regularization term to the loss function [4] that penalizes any sensitivity the model shows to the various perturbations [10].
The initial implementation of consistency regularization for deep SSL is most commonly attributed to Sajjadi et al. [11] where random augmentations were applied to the same data sample that forced predictions to be similar by proposing an unsupervised loss function that minimized the mean-squared difference between different passes of a single data point through the network. Additionally, another loss function called the mutual-exclusivity loss was used to ensure that the model's prediction vector had only one non-zero element, thereby forcing each prediction to be valid and non-ambiguous. Subsequently, the idea of temporal ensembling was introduced by Lain et al. [12], which used an exponential moving average of historical predictions at different epochs of training as one part of the output. However, the downside of this method was that predictions would change only after an entire epoch, which was troublesome in the case of large datasets. Therefore, the mean teacher model was proposed, which averaged model weights instead of previous predictions [13]. An alternate approach was proposed by Lou et al. [14] who proposed adding an additional regularization in the form of a contrastive loss on the predictions, thus forcing predictions to be different when the data points were from different classes. Another interesting approach was proposed by Miyato et al. [15] in which virtual adversarial training (VAT) was used to add perturbations to the data in order to achieve consistency regularization on the model predictions. An adversarial dropout was introduced by Perk et al. [16] that involved a dropout mask being learnt for data perturbation in a direction adversarial to the model's virtual label assignment. More recent work includes Verma et al. [17] proposing interpolation consistency training that encourages predictions at the interpolated data sample pairs to be consistent with the interpolated predictions, which helps move the decision boundary to low-density regions of the data space. A recent approach involving the use of consistency regularization was proposed in ReMixMatch [18], which did so by strongly augmenting an input multiple times and training the model to encourage the prediction for all strongly augmented images to be consistent with the prediction for a weakly augmented version of the same image.
Given the importance of the aforementioned techniques in the area of semi-supervised learning, numerous GAN-based SSL approaches have also leveraged these techniques. Consistency regularization was used by a number of GAN-based solutions such as Wei et al. [19] that added a consistency term to the loss function inspired by temporal ensembling [12]. Similarly, Chen et al. [20] reported that GAN-based SSL techniques lagged behind other SSL techniques due to a lack of consistency in class probability predictions for the same image under local perturbations. The authors attempted to solve the issue by adding an auxiliary loss term to the discriminator, which accounted for consistency regularization by using an approach based on the Mean Teacher [13]. Zhang et al. [10] proposed CR-GAN by adding consistency regularization to the discriminator while training by randomly augmenting training images as they were passed to the discriminator, and penalizing the sensitivity of the discriminator to the augmentations. Zhao et al. [21] argued that this approach was flawed as the consistency was applied only to real images and not to generated images, which could result in the generator learning the augmentation features, and introducing them into the generated images. They proposed an improved consistency regularization technique that added a consistency term to the discriminator for both real and generated images. Furthermore, they proposed an additional level of consistency by encouraging the generator to be sensitive to augmented latent vectors while encouraging the discriminator to be insensitive. Therefore, with the recent work adding consistency regularization to GANs, it can serve only to improve the adaptability of the technique towards semi-supervised learning.

Pseudo-Labeling
Pseudo-labeling [22] is a simple technique involving training the model on the labeled data and using the model to make predictions for the unlabeled data. The model predictions are then used as labels for the unlabeled data for further supervised training. Pseudo-labels are produced by setting a predefined threshold for assigning a class to an unlabeled sample, which can then be used as targets for a standard supervised loss function [9].
While this is the simplest technique theoretically, a number of attempts have been made to adapt this approach as part of a more evolved algorithm towards SSL. For example, Shi et al. [23] used class predictions as hard labels for the unlabeled data in addition to introducing an uncertainty weight for each sample loss. A more recent approach Iscen et al. [24] employed a graph-based transductive label propagation method on the basis of the manifold assumption to make predictions on the entire data, and then use these predictions as pseudo-labels. This technique was also used in the FixMatch algorithm [8] that generated pseudo-labels by passing weakly augmented unlabeled data through the model and using the predictions as labels when training strongly augmented versions of the same samples. A slightly different approach was proposed byArazo et al. [25] that proposed using soft pseudo-labels using the network's latest predictions.
Pseudo-labeling has been used in GANs performing SSL. For example, one such implementation was TripleGAN [26] where pseudo-labels were generated for unlabeled data and used as a real sample for the discriminator. This was carried out to prevent the discriminator from memorizing the empirical distribution of the labeled data. Similarly, Dong et al. [27] implemented pseudo-labeling for both unlabeled and generated images, which was then used along with cross-entropy during the training process. Finally, Liu et al. [28] used pseudo-labeling as well as part of the R 3 -CGAN model; pseudo-labeling was used to assign labels to the unlabeled classes.

Entropy Minimization
Entropy minimization is the process by which the network is encouraged to make high confidence predictions on the unlabeled data regardless of the predicted class [3]. This technique discourages the decision boundary from passing near data points as a line passing near data points would produce low confidence predictions [9]. This idea is operationalized by adding a loss term that minimizes the entropy of the prediction function. While entropy minimization ideally discourages the decision boundary from passing close to data points, Oliver et al. [9] reported an issue seen in high capacity models such as neural networks where the decision boundary overfits to locally avoid a number of small data points. Therefore, Ouali et al. [3] suggested that on its own entropy minimization was not as effective in producing viable results. However, this technique could be used in combination with other semi-supervised learning techniques as part of an algorithm to produce state-of-the-art results.
The implementation of entropy minimization with GANs performing SSL has also been seen in the literature, albeit less commonly. One notable implementation was Dai et al. [29] where the authors reported adding a conditional entropy term to the dis-criminator's objective function in order to strengthen the discriminator's true/fake belief following the approach of virtual adversarial training [15]. 3.1. Taxonomy Figure 2 shows a taxonomy of the surveyed papers. As Figure 2 shows, early approaches to SSL GANs generally involved extensions to existing GAN models by use of pseudo-labeling, or by adding a classifier component to the original GAN architecture. This approach was seen in numerous models such as CatGAN [30], SGAN [31], Improved GAN [32], GoodBadGAN [29], CT-GAN [19], and MatchGAN [33]. Many others used a conditional approach where the image as well as the label was fed into the GAN. This was seen in the case of EnhancedTGAN [34], MarginGAN [27], Triangle-GAN [35], Structured GAN [36], R 3 -CGAN [28], and EC-GAN [37]. A third approach consisted of models using encoder-based approaches where an encoder was added to the GAN architecture to map images into a latent space, which then subsequently helped in the training process. This approach was seen in BiGAN [38], ALI [39], and Augmented BiGAN [40] models. More recent approaches have used manifold regularization techniques in order to make the model more resistant to perturbations in the input. Laplacian-based GAN [41], Monte Carlo-based GAN [42], SelfAttentionGAN [43], and SSVM-GAN [44] all fall into this category. Other unique approaches involved using two GANs as seen in MCGAN [45], VTGAN [46], and IAGAN [47], and finally leveraging conditional GANs in a stacked discriminator approach, seen in SS-GAN [48], to discriminate between predicted attributes. used in combination with other semi-supervised learning techniques as part of an algorithm to produce state-of-the-art results.

Literature Review of GANS for SSL
The implementation of entropy minimization with GANs performing SSL has also been seen in the literature, albeit less commonly. One notable implementation was Dai et al. [29] where the authors reported adding a conditional entropy term to the discriminator's objective function in order to strengthen the discriminator's true/fake belief following the approach of virtual adversarial training [15]. Figure 2 shows a taxonomy of the surveyed papers. As Figure 2 shows, early approaches to SSL GANs generally involved extensions to existing GAN models by use of pseudo-labeling, or by adding a classifier component to the original GAN architecture. This approach was seen in numerous models such as CatGAN [30], SGAN [31], Improved GAN [32], GoodBadGAN [29], CT-GAN [19], and MatchGAN [33]. Many others used a conditional approach where the image as well as the label was fed into the GAN. This was seen in the case of EnhancedTGAN [34], MarginGAN [27], Triangle-GAN [35], Structured GAN [36], R 3 -CGAN [28], and EC-GAN [37]. A third approach consisted of models using encoder-based approaches where an encoder was added to the GAN architecture to map images into a latent space, which then subsequently helped in the training process. This approach was seen in BiGAN [38], ALI [39], and Augmented BiGAN [40] models. More recent approaches have used manifold regularization techniques in order to make the model more resistant to perturbations in the input. Laplacian-based GAN [41], Monte Carlo-based GAN [42], SelfAttentionGAN [43], and SSVM-GAN [44] all fall into this category. Other unique approaches involved using two GANs as seen in MCGAN [45], VTGAN [46], and IAGAN [47], and finally leveraging conditional GANs in a stacked discriminator approach, seen in SS-GAN [48], to discriminate between predicted attributes.

Notation
The notation and symbols used within this paper are defined in Table 1.

Extensions Using Pseudo-Labeling and Classifiers
GANs were introduced by Goodfellow et al. [49] as an architecture involving a generator and a discriminator competing against each other, with the generator generating fake images, and the discriminator identifying them as fake. As Engelen et al.
[1] notes, GANs are good candidates for SSL because the generator is trained on unlabeled images and the discriminator's primary function is to assess the quality of the generator. While the original implementations focused on using the GAN framework for image generation, it was not long before CatGAN [30] was proposed in 2015 that added an unsupervised classifier to the proposed model in order to enable categorical classification using a cross-entropy loss. This paper also added a cross-entropy loss term for the labeled samples that penalized misclassifications of real data. This approach was also used in SGAN [31], that leveraged a single discriminator/classifier network by having N + 1 classifying neurons, where N is the number of classes and one neuron is added to identify fake samples.
Improved GAN [32] introduced feature matching, which involved training the generator to produce images that match the expected value of features at an intermediate layer of the discriminator instead of for the final layer. This approach prevented the generator from overtraining to the specific discriminator. Mini-batch discrimination was also proposed where the discriminator predicted whether a mini-batch of images were real or fake instead of individually evaluating single images. This helped in making the generator produce more varied samples since the generator raced to the one point that the discriminator believed was realistic. Mini-batch discrimination generated better images. However, feature matching worked much better for the SSL component. In addition to the proposed techniques, the authors also argued that training GANs using gradient descent techniques was counterintuitive as they were designed to minimize the cost function instead of finding the Nash equilibrium. This argument is an important precursor to subsequent work that tried to reach a balance between generators and discriminators. For example, GoodBadGAN [29] was based on the premise that obtaining good classifier performance, and an effective generator at the same time was difficult, and therefore the focus should be on achieving one outcome only. They based their argument on Salmins et al. [32] and noted that while mini-batch discrimination produced better images, it was feature matching that showed an improved performance for SSL. They also questioned training the discriminator and generator jointly, and demonstrated that a good discriminator could be produced by using a bad generator. This was first carried out by increasing the generator entropy by adding an auxiliary cost in addition to forcing the generator to produce samples closer to the decision boundary, which was achieved by adding a term to the generator's objective function that penalized high density samples. This pushed the generated samples to move towards low-density areas. The final generator objective function was defined as shown in Equation (1). 2 (1) Another set of initial studies used Wasserstein GANs [50] as a baseline model for SSL. For example, CT-GAN [19] used a Wasserstein distance function, which seems to work better for learning distributions supported by low-dimensional manifolds as opposed to contemporary functions such as the Jensen-Shannon divergence used by many GANs. The Wasserstein distance converts the discriminator to a real-valued set of 1-Lipschitz functions instead of being a classifier. Wasserstein distance was used in conjunction with consistency regularization by Lane et al. [12]. They used a discriminator similar to that of Salimans et al. [32] with an output size of K + 1 neurons where K was the number of classes. Additionally, a consistency term was added to the loss function that forced consistency between multiple augmentations of the same data point. The objective function for the discriminator can be seen in Equation (2).
MatchGAN [33] also used Wasserstein distance and was a semi-supervised conditional GAN that made use of the label space in the target domain in conjunction with unlabeled samples to generate additional labeled samples. They reported using a system in which labels from the pool of labeled samples were assigned to unlabeled samples and passed through the generator that created synthetic versions of the images on the basis of the target labels. A match loss term was added, which compared the generated images to the original labeled image from which the target label was sampled.

Encoder-Based Approaches
The encoder-based approach was first presented as part of BiGAN [38], where the authors argued that while GANs were effective at taking a latent space and generating data, there was no technique for GANs to project the data back into the latent space. Therefore, they proposed an approach where an encoder was included as part of the GAN architecture to generate a latent space mapping from the input data. The architecture of the BiGAN model is shown in Figure 3.
Appl. Sci. 2022, 12, x FOR PEER REVIEW 7 of 21 ture matching that showed an improved performance for SSL. They also questioned training the discriminator and generator jointly, and demonstrated that a good discriminator could be produced by using a bad generator. This was first carried out by increasing the generator entropy by adding an auxiliary cost in addition to forcing the generator to produce samples closer to the decision boundary, which was achieved by adding a term to the generator's objective function that penalized high density samples. This pushed the generated samples to move towards low-density areas. The final generator objective function was defined as shown in Equation (1).
Another set of initial studies used Wasserstein GANs [50] as a baseline model for SSL. For example, CT-GAN [19] used a Wasserstein distance function, which seems to work better for learning distributions supported by low-dimensional manifolds as opposed to contemporary functions such as the Jensen-Shannon divergence used by many GANs. The Wasserstein distance converts the discriminator to a real-valued set of 1-Lipschitz functions instead of being a classifier. Wasserstein distance was used in conjunction with consistency regularization by Lane et al. [12]. They used a discriminator similar to that of Salimans et al. [32] with an output size of K + 1 neurons where K was the number of classes. Additionally, a consistency term was added to the loss function that forced consistency between multiple augmentations of the same data point. The objective function for the discriminator can be seen in Equation (2).
MatchGAN [33] also used Wasserstein distance and was a semi-supervised conditional GAN that made use of the label space in the target domain in conjunction with unlabeled samples to generate additional labeled samples. They reported using a system in which labels from the pool of labeled samples were assigned to unlabeled samples and passed through the generator that created synthetic versions of the images on the basis of the target labels. A match loss term was added, which compared the generated images to the original labeled image from which the target label was sampled.

Encoder-Based Approaches
The encoder-based approach was first presented as part of BiGAN [38], where the authors argued that while GANs were effective at taking a latent space and generating data, there was no technique for GANs to project the data back into the latent space. Therefore, they proposed an approach where an encoder was included as part of the GAN architecture to generate a latent space mapping from the input data. The architecture of the BiGAN model is shown in Figure 3.  The adopted approach involved the discriminator receiving a pair of latent space mapping and data as input wherein it discriminated jointly the data and latent space with the latent component either being the generator input z or the encoder output E(x). The training objective for this architecture is shown in Equation (3).
Adversarially learned inference (ALI) [39] also used an encoder that authors referred to as an "inference machine" that encodes training samples to the latent space, along with a discriminator that is trained to discriminate on the basis of joint samples consisting of the data and the corresponding latent variable. In this architecture, the generator acted as a decoder in mapping a latent distribution to the data distribution. The authors demonstrated the algorithm's utility for semi-supervised classification by leveraging the inference machine instead of the discriminator. In their experiments, they were able to train ALI in an unsupervised manner on labeled as well as on unlabeled data, and then train an Support Vector Machine (SVM) as a classifier for the latent encodings on a subset of the labeled data. Thus, using such a technique, the authors were able to demonstrate the utility of an adversarial approach for SSL.
The success of these techniques has resulted in a wider implementation of the bidirectional architecture. Kumar et al. [40] proposed an "Augmented BiGAN" model inspired from the BiGAN and the ALI models. They argued that since the trained GANs produced realistic images, it could be assumed that the generator obtained the tangent space of an image's manifold. Therefore, they leveraged these tangents to inject desirable invariances into the classifier to improve its performance. This is in contrast to techniques that apply assumed invariances such as rotating and flipping. Furthermore, they proposed an improvement to the encoder presented by BiGAN that they claimed caused "class switching", which is when the generated data from an encoded latent space is of a different class to the original data. Therefore, they proposed a third input pair to feed to the discriminator consisting of a latent space derived from encoding a data point and the result of passing that encoded space through the generator. This pair would also be labeled as fake, and an additional loss term would be added to complement this change. Using such an approach, the authors reported a quantitative as well as qualitative improvement in performance as compared to BiGAN.

The TripleGAN Approach
Another class of techniques for the implementation of semi-supervised GANs is based on the TripleGAN architecture proposed by Li et al. [26]. Addressing the issue that the generator and discriminator cannot be optimal at the same time, this paper proposed a different approach to Dai et al. [29]. TripleGAN consisted of injecting an additional classifier, which along with the generator characterized the conditional distributions between images, while the discriminator was limited to identifying fake image-label pairs. Figure 4 shows the architecture used for the TripleGAN where the discriminator either outputs an accept (A) or a reject (R), which serve as the adversarial losses, while the classifier produces the cross-entropy losses (CE) for the supervised part of the learning.
The discriminator in TripleGAN takes image-label pairs of which there are 3 kinds; a true data-label pair from the labeled data (x,y), a generated data-label pair (G(z),y'), and an unlabeled data sample assigned a pair by passing it through the classifier (x u ,P(c)) using pseudo-labeling. Appl  The discriminator in TripleGAN takes image-label pairs of which there are 3 kinds; a true data-label pair from the labeled data (x,y), a generated data-label pair (G(z),y'), and an unlabeled data sample assigned a pair by passing it through the classifier (xu,P(c)) using pseudo-labeling.
The resulting objective function is shown in Equation (4).
EnhancedTGAN [34] was an extension of TripleGAN that redesigned the training targets of the generator and classifier. They designed the generator to produce images on the basis of a class distribution that was regulated by a feature-semantics matching term in the loss function. Furthermore, they added another classifier, which worked in collaboration to provide additional categorical information for the generator to train on.
Another notable extension of TripleGAN was MarginGAN [27] where the classifier increased the margin for real samples, and decreased the margin for fake samples. The generator tried to increase the margin for the fake samples only. This approach further helped prevent the drop in performance that typically happens due to the misclassification of a pseudo-label. They based their theory on [29] and aimed to implement the theory within the TripleGAN framework.
Δ-GAN [35] combined ideas from BiGAN and TripleGAN. The model consisted of two generators and two discriminators, with the generators providing bidirectional mapping between domains and the discriminator classifying the real data pairs from the two kinds of fake data pairs. Structured GANs [36] were similar to Δ-GAN but assumed that generated data were conditioned on two independent latent variables, one of which encoded the designated semantics (y), while the other accounted for other factors of variation (z). Under the assumption that these latent variables were independent of each other, the authors proposed a set of two inference networks, one to map an input data point (x) to the designated semantics (y), and the other to map an input point to z. These networks were trained using two different adversarial games, one for each mapping of the input data. R 3 -CGAN [28] was another GAN architecture based on Δ-GAN. This architecture was based on the observation that the classification network often gives incorrect yet confident predictions on unlabeled data while generating pseudo-labels. Furthermore, due to the imbalance between real and fake samples, the discriminator learns the real samples and rejects any unseen data even if they are real. The authors proposed using a regularization approach based on Random Regional Replacement in the learning process of the classification and discriminative networks. They implemented two discriminative networks in addition to the classifier and the generator. Fake sample pairs of two types were used, one consisting of synthesized data paired with the target label, and the other consisting of an unlabeled sample paired with its pseudo-label. One of the discriminators was trained to discriminate between real and fake images, while the other was trained to discriminate between two fake sample types.
EnhancedTGAN [34] was an extension of TripleGAN that redesigned the training targets of the generator and classifier. They designed the generator to produce images on the basis of a class distribution that was regulated by a feature-semantics matching term in the loss function. Furthermore, they added another classifier, which worked in collaboration to provide additional categorical information for the generator to train on.
Another notable extension of TripleGAN was MarginGAN [27] where the classifier increased the margin for real samples, and decreased the margin for fake samples. The generator tried to increase the margin for the fake samples only. This approach further helped prevent the drop in performance that typically happens due to the misclassification of a pseudo-label. They based their theory on [29] and aimed to implement the theory within the TripleGAN framework.
∆-GAN [35] combined ideas from BiGAN and TripleGAN. The model consisted of two generators and two discriminators, with the generators providing bidirectional mapping between domains and the discriminator classifying the real data pairs from the two kinds of fake data pairs. Structured GANs [36] were similar to ∆-GAN but assumed that generated data were conditioned on two independent latent variables, one of which encoded the designated semantics (y), while the other accounted for other factors of variation (z). Under the assumption that these latent variables were independent of each other, the authors proposed a set of two inference networks, one to map an input data point (x) to the designated semantics (y), and the other to map an input point to z. These networks were trained using two different adversarial games, one for each mapping of the input data. R 3 -CGAN [28] was another GAN architecture based on ∆-GAN. This architecture was based on the observation that the classification network often gives incorrect yet confident predictions on unlabeled data while generating pseudo-labels. Furthermore, due to the imbalance between real and fake samples, the discriminator learns the real samples and rejects any unseen data even if they are real. The authors proposed using a regularization approach based on Random Regional Replacement in the learning process of the classification and discriminative networks. They implemented two discriminative networks in addition to the classifier and the generator. Fake sample pairs of two types were used, one consisting of synthesized data paired with the target label, and the other consisting of an unlabeled sample paired with its pseudo-label. One of the discriminators was trained to discriminate between real and fake images, while the other was trained to discriminate between two fake sample types.
EC-GAN [37] is another recent GAN using ideas from ∆-GAN. In this architecture, a generator was trained to generate images, which were then instantly fed to the classifier that produced a pseudo-label. This combination of label and generated image was then used to train the classifier, with the loss function accounting for this semi-supervised loss being multiplied by a hyperparameter that controlled how much importance the generated classification was given. The authors emphasized that the classifier was a separate network from the discriminator and empirically proved that it was a better approach as compared to a shared discriminator-classifier architecture. Furthermore, the use of CutMix [51] was noted as an augmentation strategy. The proposed architecture is shown in Figure 5. [37] is another recent GAN using ideas from Δ-GAN. In this architecture, a generator was trained to generate images, which were then instantly fed to the classifier that produced a pseudo-label. This combination of label and generated image was then used to train the classifier, with the loss function accounting for this semi-supervised loss being multiplied by a hyperparameter that controlled how much importance the generated classification was given. The authors emphasized that the classifier was a separate network from the discriminator and empirically proved that it was a better approach as compared to a shared discriminator-classifier architecture. Furthermore, the use of Cut-Mix [51] was noted as an augmentation strategy. The proposed architecture is shown in Figure 5.

Manifold Regularization-Based Methods
Many recent GANs for SSL used manifold regularization. For example, [41] presented a methodology involving using the ability of GANs to model the manifold of natural images to perform manifold regularization by leveraging the Monte Carlo approximation of the Laplacian norm. They claimed that this regularization would encourage classifier invariance to local perturbations on the image as points close to the manifold would be assigned similar labels. For their work, the authors made use of the feature matching semi-supervised GAN presented in [32] as the base GAN. The primary challenge in this approach is the estimation of the Laplacian norm, for which they present an approach on the basis of the assumptions that GANs can model the distribution as well as the manifold of images. Based on these assumptions, their technique involved training the GAN on a large number of unlabeled images, after which they inferred that the GAN approximated the marginal distribution over images that could then be used to estimate the Laplacian norm over a classifier using Monte Carlo integrations with samples drawn from the space of latent representations of the generator. Furthermore, the second assumption allowed the manifold on the image space to be utilized to compute the gradient in the form of a Jacobian matrix with respect to the latent representations. Based on this, the classifier loss is shown in Equation (5) A similar approach was taken by Lecouat et al. [41,42], where the Monte Carlo integrations were used to estimate a variant of the Laplacian norm seen in Equation (6).
More recently, the SelfAttentionGAN [43] made use of manifold regularization as part of a self-attention mechanism for a semi-supervised GAN. A variable attention unit

Manifold Regularization-Based Methods
Many recent GANs for SSL used manifold regularization. For example, Lecouat et al. [41] presented a methodology involving using the ability of GANs to model the manifold of natural images to perform manifold regularization by leveraging the Monte Carlo approximation of the Laplacian norm. They claimed that this regularization would encourage classifier invariance to local perturbations on the image as points close to the manifold would be assigned similar labels. For their work, the authors made use of the feature matching semi-supervised GAN presented in [32] as the base GAN. The primary challenge in this approach is the estimation of the Laplacian norm, for which they present an approach on the basis of the assumptions that GANs can model the distribution as well as the manifold of images. Based on these assumptions, their technique involved training the GAN on a large number of unlabeled images, after which they inferred that the GAN approximated the marginal distribution over images that could then be used to estimate the Laplacian norm over a classifier using Monte Carlo integrations with samples drawn from the space of latent representations of the generator. Furthermore, the second assumption allowed the manifold on the image space to be utilized to compute the gradient in the form of a Jacobian matrix with respect to the latent representations. Based on this, the classifier loss is shown in Equation (5) A similar approach was taken by Lecouat et al. [41,42], where the Monte Carlo integrations were used to estimate a variant of the Laplacian norm seen in Equation (6).
More recently, the SelfAttentionGAN [43] made use of manifold regularization as part of a self-attention mechanism for a semi-supervised GAN. A variable attention unit was used as part of the attention-based GAN architecture, while manifold regularization based on [42] was added as an additional regularization term to the loss function to make full use of unlabeled samples using a Monte Carlo approximation.
An interesting technique was proposed by SVMGAN [44], that tried to solve the issue of GAN-based SSL models being sensitive to local perturbations by introducing a discriminator using a scalable support vector machine (SVM) classifier with manifold regularization, while SVM was used due to its nature of performing well in situations with small datasets, which fit the semi-supervised problem well. Furthermore, the use of manifold regularization was reported to force the discriminator to be resistant to local perturbations.

Two-GAN Approaches
MCGAN [45] attempted to solve the problem of GANs generalizing when two classes of images shared similar characteristics. In order to achieve this, a modification to the GAN training method was proposed. In this case, a number of classes have labels, while one class does not have labels. The approach suggested by the authors was to first separate the labeled classes from the unlabeled class, and then classify among the labeled classes. Two GANs were used with a training regime where the first discriminator was trained by passing images of first class labeled as real with the generator outputs labeled as fake. Furthermore, the authors passed images of the second class to the discriminator labeled as fake, which forced the generator to not generalize to the similar to the second class when learning features of the first class as the discriminator flagged any generated images bearing resemblance to second class as fake. The authors then used the variation score as proposed by AnoGAN [52] to classify a third class on the basis of the sum of the variation scores of the two GANs (one trained on the first class and the other trained on second class). The architecture of the proposed GAN can be seen in Figure 6.
was used as part of the attention-based GAN architecture, while manifold regularization based on [42] was added as an additional regularization term to the loss function to make full use of unlabeled samples using a Monte Carlo approximation.
An interesting technique was proposed by SVMGAN [44], that tried to solve the issue of GAN-based SSL models being sensitive to local perturbations by introducing a discriminator using a scalable support vector machine (SVM) classifier with manifold regularization, while SVM was used due to its nature of performing well in situations with small datasets, which fit the semi-supervised problem well. Furthermore, the use of manifold regularization was reported to force the discriminator to be resistant to local perturbations.

Two-GAN Approaches
MCGAN [45] attempted to solve the problem of GANs generalizing when two classes of images shared similar characteristics. In order to achieve this, a modification to the GAN training method was proposed. In this case, a number of classes have labels, while one class does not have labels. The approach suggested by the authors was to first separate the labeled classes from the unlabeled class, and then classify among the labeled classes. Two GANs were used with a training regime where the first discriminator was trained by passing images of first class labeled as real with the generator outputs labeled as fake. Furthermore, the authors passed images of the second class to the discriminator labeled as fake, which forced the generator to not generalize to the similar to the second class when learning features of the first class as the discriminator flagged any generated images bearing resemblance to second class as fake. The authors then used the variation score as proposed by AnoGAN [52] to classify a third class on the basis of the sum of the variation scores of the two GANs (one trained on the first class and the other trained on second class). The architecture of the proposed GAN can be seen in Figure 6. Vanishing Twin GAN (VTGAN) [46] was an improvement over MCGAN, which heavily relied on labeled samples being used to train the discriminator and would fail in cases of semi-supervised learning where one of the classes did not have adequate labeled samples. The idea behind VTGAN was to train two GANs in parallel: a normal GAN to be used for classification, and a weak GAN to be used to improve the normal GAN's classification performance. The goal was to train the weak twin in such a way that the generator was stuck in the noisy image generation stage where it would not fall into modal collapse. The resulting noisy generation from this weak GAN was used as input to the normal twin with the fake labels. In order to weaken the weak twin, a number of strategies were used, such as making the network shallow, tuning the GAN's input noise dimension Vanishing Twin GAN (VTGAN) [46] was an improvement over MCGAN, which heavily relied on labeled samples being used to train the discriminator and would fail in cases of semi-supervised learning where one of the classes did not have adequate labeled samples. The idea behind VTGAN was to train two GANs in parallel: a normal GAN to be used for classification, and a weak GAN to be used to improve the normal GAN's classification performance. The goal was to train the weak twin in such a way that the generator was stuck in the noisy image generation stage where it would not fall into modal collapse. The resulting noisy generation from this weak GAN was used as input to the normal twin with the fake labels. In order to weaken the weak twin, a number of strategies were used, such as making the network shallow, tuning the GAN's input noise dimension while decreasing the noise, and increasing strides of the transpose convolution and the max pooling layers.
A different approach using two GANs leveraged data augmentation in order to prepare a data augmentation GAN, which in turn was used to train another GAN. Inception-Augmentation GAN (IAGAN) [47] used augmentation of a given image in order to prepare the image to be used to train another GAN. The generator took in a batch of images and a Gaussian noise vector concatenated them after encoding the images using convolution and attention layers to a smaller dimension. A mix of inception and residual architectures was then used to enhance the generator's ability to capture details from the training space. The discriminator was simply a 4-layer CNN, which predicted whether an input image was a real image from the training data or an output of the generator. A generic objective function shown in Equation (7) was used.

GAN Using Stacked Discriminator
An interesting implementation involved leveraging the Conditional GANs [53] in a semi-supervised setting in a model called Semi-Supervised GAN (SS-GAN) [48]. The approach used gave the discriminator two tasks: detecting if a given image was real or fake and detecting whether a proposed attribute given to the image was real or fake. For the first task, both labeled and unlabeled samples were used in training; however, for the second task only the labeled images were used. In order to perform this task, a stacked discriminator approach was used with one discriminator for each task. Figure 7 shows the architecture of the SS-GAN and the flow of the training data, which makes use of both labeled and unlabeled images for the unsupervised discriminator and only labeled images for the supervised discriminator.
while decreasing the noise, and increasing strides of the transpose convolution and the max pooling layers.
A different approach using two GANs leveraged data augmentation in order to prepare a data augmentation GAN, which in turn was used to train another GAN. Inception-Augmentation GAN (IAGAN) [47] used augmentation of a given image in order to prepare the image to be used to train another GAN. The generator took in a batch of images and a Gaussian noise vector concatenated them after encoding the images using convolution and attention layers to a smaller dimension. A mix of inception and residual architectures was then used to enhance the generator's ability to capture details from the training space. The discriminator was simply a 4-layer CNN, which predicted whether an input image was a real image from the training data or an output of the generator. A generic objective function shown in Equation (7) was used.

GAN Using Stacked Discriminator
An interesting implementation involved leveraging the Conditional GANs [53] in a semi-supervised setting in a model called Semi-Supervised GAN (SS-GAN) [48]. The approach used gave the discriminator two tasks: detecting if a given image was real or fake and detecting whether a proposed attribute given to the image was real or fake. For the first task, both labeled and unlabeled samples were used in training; however, for the second task only the labeled images were used. In order to perform this task, a stacked discriminator approach was used with one discriminator for each task. Figure 7 shows the architecture of the SS-GAN and the flow of the training data, which makes use of both labeled and unlabeled images for the unsupervised discriminator and only labeled images for the supervised discriminator.

Results
Tables 2-6 show lists of the sources reviewed and the techniques they set as the baseline comparison to their results. In order to provide ease of analysis, the different works reported are grouped as per the architecture followed by the work as discussed in the framework section.

Results
Tables 2-6 show lists of the sources reviewed and the techniques they set as the baseline comparison to their results. In order to provide ease of analysis, the different works reported are grouped as per the architecture followed by the work as discussed in the framework section.      The discussed works were analyzed in terms of the results reported by the authors for their respective proposed models and chronologically summarized in Tables 7-11 where the proposed model, evaluation datasets, and the results are detailed.      Table 11. Assorted Approaches Results Summary.

Quantitative Analysis
A number of GANs were chosen as representative models from each technique. Cat-GAN was chosen from the initial implementations. Similarly, ALI was the model of choice for comparison among the encoder-based architectures, and TripleGAN was used as the baseline of choice for its class of models. Table 12 displays a summary of notable works across categories and their results in order to enable a deeper comparison. A natural progression can be seen where encoder-based architectures such as ALI outperformed CatGAN, and this, in turn, was outperformed by TripleGAN and its derivatives. The more recent manifold regularization-based approaches also outperformed TripleGAN. However, the most recent manifold regularization-based paper [44] reported a 4.54% error rate on SVHN using 1000 labeled samples and a 14.27% error rate on CIFAR-10 using 4000 labeled samples. This model was outperformed by the most recent TripleGAN-based approach [28], which reported a 2.79% error rate on SVHN and 6.69% error rate on CIFAR for the same amount of labeled samples. Therefore, it is reasonable to claim that the R 3 -CGAN architecture holds the current state of the art as none of the other papers surveyed had a similar evaluation process or a comparison to this model. A number of interesting aspects of the R 3 -CGAN that could have contributed to its success. While the underlying architecture was based on TripleGAN, a Random Regional Replacement regularization was applied by making use of the CutMix mix-sample augmentation technique [51]. This technique has been implemented in non-generative semi-supervised learning techniques in order to achieve consistency regularization with good results. Therefore, its success in a generative architecture suggests adaptation of other semi-supervised learning techniques into GANs as well. It is interesting to note that while R 3 -CGAN is seemingly the best performing GAN-based technique currently available, it fades in comparison to non-GAN state of the art SSL techniques such as FixMatch [8], which reported error rates of 4.26% on CIFAR-10 with 4000 labeled samples and 2.28% on SVHN with 1000 labeled samples, in addition to showing a good performance of 11.39% error for CIFAR-10 with only 40 labeled samples and 3.96% for SVHN with 40 labeled samples. Therefore, the gap between GANbased SSL and other state of the art techniques is apparent, and so it would be interesting to attempt to apply some of the techniques used in other SSL algorithms to GANs in order to unify the enhanced performance seen in the state of the art SSL algorithms with the generative aspect that GANs are known for.

Qualitative Analysis
The initial approaches involving the implementation of pseudo-labeling, and an addition of a classifier had the advantage of being simple to implement without additional heavy computational load. These techniques, however, were limited in performance, where other more complex techniques were seen to outperform this class of methods. Encoderbased techniques were introduced with the intention of leveraging the feature space in the training of the models. However, success of these techniques was dependent on the latent representations being representative of the classification task at hand and, therefore, an assumption that could vary based on the target domain. Additionally, the addition of an encoder resulted in increased computational requirements for the training, which might be limited based on the available hardware resources. Conditional approaches involved discrimination in pairs of data points and labels with the classifier acting as a third player, which has been seen as a good solution to help resolve the conflict between having a good generator and a good classifier. However, the reliance on the class label as an input to the discriminator can be a point of failure in cases where the class distribution in the dataset is imbalanced to the extent that the discriminator only learns the majority classes to be real. In such a case, the generator will be strongly biased towards the majority class. A number of recent approaches have used manifold regularization to ensure that the model remains resistant to perturbations to input samples. However, such approaches rely on the assumption that any unseen data will lie on the same manifold as the perturbations used to perform the manifold regularization, which might fail in some cases based on the application domain.
An interesting difference in approaches is also observed among the various techniques analyzed in terms of the training objectives. While one class of techniques focused on a strategy where the generator was weakened to boost the discriminator (e.g., GoodBadGAN [29]), a different class of techniques leveraged a good generator to boost the performance of the classifier (e.g., TripleGAN [26]). Li et al. [54] conducted a comparative analysis of these two techniques by training the GoodBadGAN as the BadGAN approach and the TripleGAN as the GoodGAN approach on the MNIST, CIFAR10, and SVHN benchmark datasets with a varying level of labeled samples. Their conclusion was that while GoodBadGAN outperformed TripleGAN when there were a medium number of labeled samples, TripleGAN performed better with less data, thus demonstrating a lack of sensitivity to the number of labeled samples. Furthermore, the authors also provided visualizations for the images generated in the case of both of the techniques. Figure 8 is reproduced with permission from [54] and displays the generated images for both models. As can be seen in Figure 8, in the case of the GoodBadGAN, the images produced by the generator were far from ideal, indeed confusing the digits in the case of MNIST while failing entirely in the case of SVHN and CIFAR. The TripleGAN (GoodGAN) architecture, however, was able to produce clear distinct images while also performing well for lower amounts of unlabeled data. The authors suggested that future work could involve both types of architectures being used complimentarily. As can be seen in Figure 8, in the case of the GoodBadGAN, the images produced by the generator were far from ideal, indeed confusing the digits in the case of MNIST while failing entirely in the case of SVHN and CIFAR. The TripleGAN (GoodGAN) architecture, however, was able to produce clear distinct images while also performing well for lower amounts of unlabeled data. The authors suggested that future work could involve both types of architectures being used complimentarily.

Future Directions
A number of future research directions can be explored. One direction is in terms of the model architecture, and training methodology itself. With the success of R 3 -CGAN's usage of CutMix, an interesting direction for research could be the implementation of further semi-supervised methods alongside GANs. While this is not a new concept and work including Chen et al. [20] have previously used SSL algorithms such as MeanTeacher to achieve consistency regularization, however, newer SSL techniques could also be looked into, as well. For example, the idea of automated augmentation techniques like RandAugment [55] and AutoAugment [56] used by state of the art SSL techniques like UDA [57] and FixMatch [8] can be explored. Another interesting direction could be the unifying of the current dominant GAN-based SSL techniques by adding manifold regularization to the R 3 -CGAN implementation of TripleGAN. Since both techniques have the best results in recent works, combining them could be a step forward in the area of GAN-based SSL. On a similar note, future work can be carried out towards unifying the contrasting approaches of preparing a bad GAN for classification with approaches aiming to simultaneously improve both aspects of the GAN. This is a promising direction as BadGAN approaches have been noted to perform better for larger amounts of data, while GoodGAN approaches have outperformed for smaller levels of data. A unified method would be able to take advantage of these to form a more robust model.
Finally, attempts at training using a lower number of labeled samples could be undertaken in an effort to mimic state of the art SSL techniques, and to obtain a baseline for the current GAN-based performance for situations where an extremely low number of labeled samples are present. Consequently, efforts can be made to investigate the performance of existing techniques when implemented on real-world applications across domains, many of which have their own unique peculiarities. An example of such characteristics is seen in situations where the data relevant to the domain consists of a class imbalance, with the class of interest often being in the minority, such as in applications that include disease or fraud detection [58]. Investigations into how the existing solutions perform in these real-life domains will establish their viability and, in turn, can serve to further the field and improve the collective performance of the semi-supervised learning techniques.

Conclusions
Given the increasing interest in the field of semi-supervised learning, and the rapid progress being made in generative learning, a survey was conducted to analyze recent research in using GANs for semi-supervised learning. The previous work was catagoized based on the advancement being proposed, the model architecture, and the training procedures. Furthermore, the approach followed by each paper was discussed before a quantitative analysis was conducted based on the performance obtained by each of the works in their experimentation. Finally, a qualitative analysis of the various categories was also conducted to better understand the advantages and disadvantages of the diverse approaches, after which a number of possible directions for future work were identified in order to encourage advances in the field of using generative adversarial networks for semi-supervised learning.