# Variational Information Bottleneck for Semi-Supervised Classification

^{1}

^{2}

^{*}

## Abstract

**:**

## Notations

## 1. Introduction

## 2. Related Work

**Regularization techniques in semi-supervised learning**: Semi-supervised learning tries to find a way to benefit from a large number of unlabeled samples available for training. The most common way to leverage unlabeled data is to add a special regularization term or some mechanism to better generalize to unseen data. The recent work [8] identifies three ways to construct such a regularization: (i) entropy minimization, (ii) consistency regularization and (iii) generic regularization. The entropy minimization [9,10] encourages the model to output confident predictions on unlabeled data. In addition, more recent work [3] extends this concept to adversarially generated samples or fakes for which the entropy of class label distribution was suggested to be maximized. Finally, the adversarial regularization of label space was considered in [2], where the discriminator was trained to ensure the labels produced by the classifier follow a prior distribution, which was defined to be a categorical one. The consistency regularization [11,12] encourages the model to produce the same output distribution when its inputs are perturbed. Finally, the generic regularization encourages the model to generalize well and avoid overfitting the training data. It can be achieved by imposing regularizers and corresponding priors on the model parameters or feature vectors.

**Information bottleneck:**In the recent years, the IB framework [6] is considered to be a theoretical framework for analysis and explanation of supervised deep learning systems. However, as shown in [13], the original IB framework faces several practical issues: (i) for the deterministic deep networks, either the IB functional is infinite for network parameters, that leads to the ill-posed optimization problem, or it is piecewise constant, hence not admitting gradient-based optimization methods, and (ii) the invariance of the IB functional under bijections prevents it from capturing properties of the learned representation that are desirable for classification. In the same work, the authors demonstrate that these issues can be partly resolved for stochastic deep networks, networks that include a (hard or soft) decision rule, or by replacing the IB functional with related, but more well-behaved cost functions. It is important to mention that the same authors also note that rather than trying to repair the inherent problems in the IB functional, a better approach may be to design regularizers on latent representation enforcing the desired properties directly.

**The closest works:**The proposed framework is closely related to several families of semi-supervised classifiers based on generative models. VAE (M1 + M2) [1] combines latent-feature discriminative model M1 and generative semi-supervised model M2. A new latent representation is learned using the generative model from M1 and subsequently a generative semi-supervised model M2 is trained using embeddings from the first latent representation instead of the raw data. Semi-supervised AAE classifier [2] is based on the AE architecture, where the encoder of AE outputs two latent representations: one representing class and another style. The latent class representation is regularized by an adversarial loss forcing it to follow categorical distribution. It is claimed that it plays an essential role for the overall classification performance. The latent style representation is regularized to follow Gaussian distribution. In both cases of VAE and AAE, the mean square error (MSE) metric is used for the reconstruction space loss. CatGAN [3] is an extension of GAN and is based on an objective function that trades-off mutual information between observed examples and their predicted categorical class distribution, against robustness of the classifier to an adversarial generative model.

**Summary:**The considered methods of semi-supervised learning can be differentiated based on: (i) the targeted tasks (auto-encoding, clustering, generation or classification that can be accomplished depending on available labeled data); (ii) the architecture in terms of the latent space representation (with a single representation vector or with multiple representation vectors); (iii) the usage of IB or other underlying frameworks (methods derived from the IB directly or using regularization techniques); (iv) the label space regularization (based on available unlabeled data, augmented labeled data, synthetically generated labeled and unlabeled data, especially designed adversarial examples); (v) the latent space regularization (hand-crafted regularizers and priors or learnable priors under the reconstruction and constrastive setups) and (vi) the reconstruction space regularization in case of reconstruction setup (based on unlabeled and labeled data, augmented data under certain perturbations, synthetically generated examples).

## 3. IB with Hand-Crafted Priors (HCP)

#### 3.1. Decomposition of the First Term: Hand-Crafted Regularization

#### 3.2. Decomposition of the Second Term

#### 3.3. Supervised and Semi-Supervised Models with/without Hand-Crafted Priors

**baseline)**: is based on term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ in (6)

## 4. IB with Learnable Priors (LP)

#### 4.1. Decomposition of Latent Space Regularizer

#### 4.2. Decomposition of Reconstruction Space Regularizer

#### 4.3. Semi-Supervised Models with Learnable Priors

#### 4.4. Links to State-Of-The-Art Models

#### 4.4.1. Links to Unsupervised Models

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs two vectors representing the mean and standard deviation vectors that control a new latent representation $\mathbf{z}={f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)+{\sigma}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)\odot \mathit{\u03f5}$, where ${f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$ and ${\sigma}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$ are outputs of the encoder and $\mathit{\u03f5}$ is assumed to be a zero mean unit variance Gaussian vector.
- The usage of IB or other underlying frameworks: both VAE and $\beta $-VAE use evidence lower bound (ELBO) and are not derived from the IB framework. However, it can be shown [15] that the Lagrangian (18) can be reformulated for VAE and $\beta -$VAE as:$${\mathcal{L}}_{\beta -\mathrm{VAE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}},$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with Gaussian pdf.
- The reconstruction space regularization in case of reconstruction loss: is based on the mean square error (MSE) counterpart of ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ that corresponds to the Guassian likelihood assumption.

**Unsupervised AAE**[2]:

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs one vector in stochastic or deterministic way as $\mathbf{z}={f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$.
- The usage of IB or other underlying frameworks: AAE is not derived from the IB framework. As shown in [15], the AAE equivalent Lagrangian (18) can be linked with the IB formulation and defined as:$${\mathcal{L}}_{\mathrm{AAE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}},$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with zero mean unit variance Gaussian pdf for each dimension.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE.

**BIB-AE**[15]:

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs one vector using any form of stochastic or deterministic encoding.
- The usage of IB or other underlying frameworks: the BIB-AE is derived from the unsupervised IB (18) and its Lagrangian is defined as:$${\mathcal{L}}_{\mathrm{BIB}-\mathrm{AE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]-{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}.$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with Gaussian pdf applied to both conditional and unconditional terms. In fact, the prior for ${\mathcal{D}}_{\mathrm{z}}$ can be any but ${\mathcal{D}}_{\mathrm{z}|\mathrm{x}}$ requires analytical parametrisation.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE counterpart of ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ and the discriminator ${\mathcal{D}}_{\mathrm{x}}$. This is a disctintive feature in comparison to VAE and AAE.

#### 4.4.2. Links to Semi-Supervised Models

**Semi-supervised AAE**[2]:

- The targeted tasks: auto-encoding, clustering, (conditional) generation and classification.
- The architecture in terms of the latent space representation: the encoder outputs two vectors representing the discrete class and continuous type of style. The class distribution is assumed to follow categorical distribution and style Gaussian one. Both constraints on the prior distributions are ensured using adversarial framework with two corresponding discriminators. In its original setting, AAE does not use any augmented samples or adversarial examples.Remark: It should be pointed out that in our architecture we consider the latent space to be represented by the vector $\mathbf{a}$, which is fed to the classifier and regularizer that gives a natural consideration of IB and corresponding regularization and priors. In the case of semi-supervised AAE, the latent space is considered by the class and style representations directly. Therefore, to make it coherent with our case, one should assume that the class vector of semi-supervised AAE corresponds to the vector $\mathbf{c}$ and the style vector to the vector $\mathbf{z}$.
- The usage of IB or other underlying frameworks: AAE is not derived from the IB framework. However, as shown in our analysis the semi-supervised AAE represents the learnable prior case in part of latent space regularization. The corresponding Lagrangian of semi-supervised AAE is given by (16) and considered in Section 4.3.
- The label space regularization: is based on the adversarial discriminator in assumption that the class labels follow categorical distribution. This is applied to both labeled and unlabeled samples.
- The latent space regularization: is based on the learnable prior with Gaussian pdf of AE.
- The reconstruction space regularization in case of reconstruction loss: is only based on the MSE.

**CatGAN**[3]: is based on an extension of classical GAN binary discriminator designed to distinguish between the original images and fake images generated from the latent space distribution to a multi-class discriminator. The author assumes the one-hot-vector encoding of class labels. The system is considered for the unsupervised and semi-supervised modes. For both modes the one-hot-vector encoding is used to encoded class labels. For the unsupervised mode, the system has an access only to the unlabeled data and the output of the classifier is considered to be a clustering to a predefined number of clusters/classes. The main idea behind the unsupervised training consists of a training of the discriminator that any sample from the set of original images is assigned to one of the classes with high fidelity whereas any fake or adversarial sample is assigned to all classes almost equiprobably. This corresponds to the fake samples and the regularization in the label space is based on the considered and extended framework of entropy minimization-based regularization. In the case of absence of fakes, this regularization coincides with the semi-supervised AAE label space regularization under the categorical distribution and adversarial discriminator that is equivalent to enforcing the minimum entropy of label space. However, the encoding of fake samples is equivalent to a sort of rejection option expressed via the activation of classes that have maximum entropy or uniform distribution over the classes. Equivalently, the above types of encoding can be considered to be the maximization of mutual information between the original data and encoded class labels and minimization of mutual information between the fakes/adversarial samples and the class labels. Semi-supevised CatGAN model adds a cross-entropy term computed for the true labeled samples.

- The targeted tasks: auto-encoding, clustering, generation and classification.
- The architecture in terms of the latent space representation: there is no encoder as such and instead the system has a generator/decoder that generates samples from a random latent space $\mathbf{a}$ following some hand-crafted prior. The second element of architecture is a classifier with the min/max entropy optimization for the original and fake samples. The encoding of classes is assumed to be a one-hot-vector encoding.
- The usage of IB or other underlying frameworks: CatGAN is not derived from the IB framework. However, as shown in [15], one can apply the IB formulation to the adversarial generative models as in the case of CatGAN assuming that the term ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})=0$ in (3) due to the absence of encoder as such. The minimization problem (3) reduces to the maximization of the second term ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})$ expressed via its lower bound of variational decomposition (6). The first term ${\mathcal{D}}_{\mathrm{c}}$ enforces that the class labels of unlabeled samples follow the defined prior distribution $p\left(\mathbf{c}\right)$ with the above property of entropy minimization under one-hot-vector encoding whereas the second term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ reflects the supervised part for labeled samples. In the original CatGAN formulation, the author does not use the expression for the mutual information for the decoder/generator training as it is shown above but instead uses the decomposition of mutual information via the difference of corresponding entropies (see, first two terms in (9) in [3]). As we have pointed out, we do not include in our analysis the term corresponding to the fake samples as in original CatGAN. However, we do believe that this form of regularization does play an important role for the semi-supervised classification. The impact of this terms requires additional studies.
- The label space regularization: is based on the above assumptions for labeled samples, which are included into the cross-entropy term, unlabeled samples included into the entropy minimization term and fake samples included into the entropy maximization term in the original CatGAN method.
- The latent space regularization: is based on the hand-crafted prior.
- The reconstruction space regularization in case of reconstruction loss: is based on the adversarial discriminator only.

**SeGMA**[18]: is a semi-supervised clustering and generative system with a single latent vector representation auto-encoder similar in spirit to the unsupervised version of AAE that can be also used for the classification. The latent space of SeGMA is assumed to follow a mixture of Gaussians. Using a small labeled data set, classes are assigned to components of this mixture of Gaussians by minimizing the cross-entropy loss induced by the class posterior distribution of a simple Gaussian classifier. The resulting mixture describes the distribution of the whole data, and representatives of individual classes are generated by sampling from its components. In the classification setup, SeGMA uses the latent space clustering scheme for the classification.

- The targeted tasks: auto-encoding, clustering, generation and classification.
- The architecture in terms of the latent space representation: a single vector representation following mixture of Gaussians distribution.
- The usage of IB or other underlying frameworks: SeGMA is not derived from the IB framework but a link to the regularized ELBO an other related auto-encoders with interpretable latent space is demonstrated. However, as in previous methods it can be linked to the considered IB interpretation of the semi-supervised methods with hand-crafted priors (16). An equivalent Lagrangian of SeGMA is:$${\mathcal{L}}_{\mathrm{SeGMA}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}},$$
- The label space regularization: is based on the above assumptions for labeled samples, which are included into the cross-entropy term as discussed above.
- The latent space regularization: is based on the hand-crafted mixture of Gaussians pdf.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE.

**VAE (M1 + M2)**[1]: is based on the combination of several models. The model M1 represents a vanilla VAE considered in Section 4.4.1. Therefore, model M1 is a particular case of considered unsupervised IB. The model M2 is a combination of encoder producing a continuous latent representation and following Gaussian distribution and a classifier that takes as an input original data in parallel to the model M1. The class labels are encoded using the one-hot-vector representations and follow categorical distribution with a hyper-parameter following the symmetric Dirichlet distribution. The decoder of model M2 takes as an input the continuous latent representation and output of classifier. The decoder is trained under the MSE distortion metric. It is important to point out that the classifier works with the input data directly but not with the common latent space such as in the considered LP model. For this reason, it is an obvious analogy with the considered LP model (11) under the assumption that $\mathbf{a}=\mathbf{x}$ and all performed IB analysis directly applies to. However, as pointed by the authors, the performance of model M2 in the semi-supervised classification for the limited number of labeled samples is relatively poor. That is why the third hybrid model M1 + M2 is considered when the models M1 and M2 and used in a stacked way. At the first stage, the model M1 is learned as the usual VAE. Then the latent space of model M1 is used as an input to the model M2 trained in a semi-supervised way. Such a two-stage approach closely resembles the learnable prior architecture presented in Figure 2. However, our model is end-to-end trained with the explainable common latent space and IB origin, while the model M1 + M2 is trained in two stages with the use of regularized ELBO for the derivation of model M2.

- The targeted tasks: auto-encoding, clustering, (conditional) generation and classification.
- The architecture in terms of the latent space representation: the stacked combination of models M1 and M2 is used as discussed above.
- The usage of IB or other underlying frameworks: VAE M1 + M2 is not derived from the IB framework but it is linked to the regularized ELBO with the cross-entropy for the labeled samples. The corresponding IB Lagrangian of semi-supervised VAE M1 + M2 under the assumption of end-to-end training can be defined as:$${\mathcal{L}}_{\mathrm{SS}-\mathrm{VAE}\phantom{\rule{0.277778em}{0ex}}\mathrm{M}1+\mathrm{M}2}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}.$$
- The label space regularization: is based on the assumption of categorical distribution of labels.
- The reconstruction space regularization in case of reconstruction loss: is only based on the MSE.

## 5. Experimental Results

#### 5.1. Experimental Setup

#### 5.2. Discussion MNIST

#### 5.3. Discussion SVHN

## 6. Conclusions and Future Work

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

IB | Information bottleneck |

VAE | Variational autoencode |

AAE | Adversarial autoencoder |

CatGAN | Categorical generative adversarial networks |

KL-divergences | Kullback–Leibler divergences |

MSE | Mean squared error |

HCP | IB with hand-crafted priors |

LP | IB with learnable priors |

NN | Neural Network |

SS | Semi-supervised |

## Appendix A. Latent Space of Trained Models

## Appendix B. Supervised Training without Latent Space Regularization (Baseline)

**Figure A3.**Baseline classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$. The blue shadowed regions are not used.

**Table A1.**The network parameters of baseline classifier trained on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers.

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC, ReLU |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

## Appendix C. Semi-Supervised Training without Latent Space Regularization and with Class Label Regularizer

**Figure A4.**Semi-supervised classifier based on the cross-entropy ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and categorical class discriminator ${\mathcal{D}}_{\mathrm{c}}$. No latent space regularization is applied. The blue shadowed regions are not used.

**Table A2.**The network parameters of semi-supervised classifier trained on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and ${\mathcal{D}}_{\mathrm{c}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers.

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC, ReLU |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{c}}$ | |

Size | Layer |

10 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

**Table A3.**The performance (percentage error) of

**deterministic**classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ for the encoder with and without batch normalization as a function of Lagrangian multiplier ${\alpha}_{\mathrm{c}}$ and the number of labelled examples.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 26.56 | 26.24 | 28.04 | 26.95 | 0.96 |

0.005 | 20.44 | 21.93 | 18.98 | 20.45 | 1.48 | |

0.0005 | 18.55 | 20.43 | 20.59 | 19.86 | 1.14 | |

1 | 19.23 | 22.42 | 20.57 | 20.74 | 1.60 | |

with BN | 0 | 29.37 | 29.27 | 30.62 | 29.75 | 0.75 |

0.005 | 27.97 | 28.02 | 26.27 | 27.42 | 1.00 | |

0.0005 | 25.99 | 23.70 | 24.47 | 24.72 | 1.17 | |

1 | 27.78 | 31.98 | 35.88 | 31.88 | 4.05 | |

MNIST 1000 | ||||||

without BN | 0 | 7.74 | 6.99 | 6.97 | 7.23 | 0.44 |

0.005 | 5.62 | 6.06 | 5.60 | 5.76 | 0.26 | |

0.0005 | 6.30 | 6.12 | 6.02 | 6.15 | 0.14 | |

1 | 5.99 | 6.27 | 6.28 | 6.18 | 0.16 | |

with BN | 0 | 7.45 | 6.95 | 7.52 | 7.31 | 0.31 |

0.005 | 5.57 | 5.08 | 5.22 | 5.29 | 0.25 | |

0.0005 | 5.60 | 6.05 | 6.22 | 5.96 | 0.32 | |

1 | 6.05 | 6.41 | 5.82 | 6.09 | 0.30 | |

MNIST all | ||||||

without BN | 0 | 0.83 | 0.83 | 0.74 | 0.80 | 0.05 |

0.005 | 0.83 | 0.82 | 0.88 | 0.84 | 0.03 | |

0.0005 | 0.86 | 0.92 | 0.82 | 0.87 | 0.05 | |

1 | 0.72 | 0.85 | 0.87 | 0.81 | 0.08 | |

with BN | 0 | 0.73 | 0.67 | 0.79 | 0.73 | 0.06 |

0.005 | 0.72 | 0.73 | 0.70 | 0.72 | 0.02 | |

0.0005 | 0.75 | 0.77 | 0.72 | 0.75 | 0.03 | |

1 | 0.67 | 0.68 | 0.73 | 0.69 | 0.03 |

**Table A4.**The performance (percentage error) of

**stochastic**classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ for the encoder with and without batch normalization as a function of Lagrangian multiplier ${\alpha}_{\mathrm{c}}$ and the number of labelled examples.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 25.75 | 26.61 | 26.59 | 26.32 | 0.49 |

0.005 | 23.34 | 21.38 | 24.37 | 23.03 | 1.52 | |

0.0005 | 19.92 | 15.83 | 16.03 | 17.26 | 2.31 | |

1 | 22.51 | 20.48 | 21.28 | 21.42 | 1.02 | |

with BN | 0 | 30.26 | 31.24 | 29.3 | 30.27 | 0.97 |

0.005 | 21.17 | 24.41 | 24.75 | 23.44 | 1.98 | |

0.0005 | 22.97 | 26.38 | 24.44 | 24.60 | 1.71 | |

1 | 26.62 | 30.43 | 28.44 | 28.50 | 1.91 | |

MNIST 1000 | ||||||

without BN | 0 | 7.68 | 7.30 | 7.23 | 7.4 | 0.24 |

0.005 | 5.59 | 5.16 | 5.80 | 5.52 | 0.33 | |

0.0005 | 5.59 | 6 | 5.84 | 5.81 | 0.21 | |

1 | 6.66 | 6.8 | 7.62 | 7.03 | 0.52 | |

with BN | 0 | 6.97 | 7.06 | 7.66 | 7.23 | 0.38 |

0.005 | 4.42 | 4.54 | 4.08 | 4.35 | 0.24 | |

0.0005 | 5.28 | 5.56 | 5.14 | 5.33 | 0.21 | |

1 | 5.77 | 5.88 | 5.72 | 5.79 | 0.08 | |

MNIST all | ||||||

without BN | 0 | 0.8 | 0.91 | 0.87 | 0.86 | 0.06 |

0.005 | 0.77 | 0.82 | 0.88 | 0.82 | 0.06 | |

0.0005 | 0.86 | 0.81 | 0.87 | 0.85 | 0.03 | |

1 | 0.93 | 0.85 | 0.92 | 0.90 | 0.04 | |

with BN | 0 | 0.65 | 0.67 | 0.71 | 0.68 | 0.03 |

0.005 | 0.69 | 0.77 | 0.68 | 0.71 | 0.05 | |

0.0005 | 0.78 | 0.71 | 0.74 | 0.74 | 0.04 | |

1 | 0.71 | 0.64 | 0.62 | 0.66 | 0.05 |

## Appendix D. Supervised Training with Hand Crafted Latent Space Regularization

**Figure A5.**Supervised classifier based on the cross-entropy ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and hand crafted latent space regularization ${\mathcal{D}}_{\mathrm{a}}$. The blue shadowed parts are not used.

**Table A5.**The network parameters of supervised classifier trained on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and ${\mathcal{D}}_{\mathrm{a}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers. ${\mathcal{D}}_{\mathrm{a}}$ is trained in the adversarial way.

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{a}}$ | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

**Table A6.**The performance (percentage error) of

**deterministic**classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}$ for the encoder with and without batch normalization as a function of Lagrangian multiplier.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 26.79 | 27.26 | 27.39 | 27.15 | 0.32 |

0.005 | 28.05 | 25.95 | 30.72 | 28.24 | 2.39 | |

0.0005 | 26.67 | 27.69 | 28.46 | 27.61 | 0.89 | |

1 | 33.42 | 33.05 | 34.81 | 33.76 | 0.92 | |

with BN | 0 | 30.37 | 29.32 | 29.82 | 29.83 | 0.52 |

0.005 | 28.02 | 31.49 | 30.80 | 30.11 | 1.84 | |

0.0005 | 34.54 | 31.92 | 29.82 | 31.09 | 2.36 | |

1 | 34.43 | 44.35 | 44.25 | 41.01 | 5.70 | |

MNIST 1000 | ||||||

without BN | 0 | 7.16 | 8.12 | 7.55 | 7.61 | 0.48 |

0.005 | 7.02 | 6.34 | 6.59 | 6.65 | 0.34 | |

0.0005 | 6.73 | 6.34 | 6.82 | 6.63 | 0.26 | |

1 | 9.49 | 9.93 | 10.56 | 9.99 | 0.54 | |

with BN | 0 | 7.39 | 7.83 | 7.92 | 7.72 | 0.28 |

0.005 | 7.94 | 7.15 | 8.53 | 7.88 | 0.69 | |

0.0005 | 8.00 | 9.62 | 9.51 | 9.05 | 0.91 | |

1 | 15.79 | 14.88 | 13.71 | 14.79 | 1.04 | |

MNIST all | ||||||

without BN | 0 | 0.76 | 0.70 | 0.81 | 0.76 | 0.06 |

0.005 | 1.07 | 1.03 | 1.13 | 1.08 | 0.05 | |

0.0005 | 0.84 | 0.78 | 0.89 | 0.84 | 0.06 | |

1 | 4.78 | 7.24 | 4.71 | 5.58 | 1.44 | |

with BN | 0 | 0.68 | 0.68 | 0.69 | 0.68 | 0.01 |

0.005 | 0.90 | 0.81 | 1.12 | 0.94 | 0.16 | |

0.0005 | 0.87 | 0.80 | 0.89 | 0.85 | 0.05 | |

1 | 2.37 | 3.61 | 4.35 | 3.44 | 1.00 |

**Table A7.**The performance (percentage error) of

**stochastic**classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}$ for the encoder with and without batch normalization as a function of Lagrangian multiplier.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 28.13 | 25.16 | 29.9 | 27.73 | 2.40 |

0.0005 | 28.05 | 30.03 | 28.11 | 28.73 | 1.13 | |

1 | 32.33 | 34.09 | 33.73 | 33.38 | 0.93 | |

with BN | 0.005 | 32.25 | 33.47 | 26.01 | 30.58 | 4.00 |

0.0005 | 33.37 | 36.15 | 35.65 | 35.06 | 1.48 | |

1 | 33.37 | 42.37 | 32.46 | 36.07 | 5.48 | |

MNIST 1000 | ||||||

without BN | 0.005 | 7.37 | 7.17 | 6.65 | 7.06 | 0.37 |

0.0005 | 7.48 | 6.68 | 6.67 | 6.94 | 0.46 | |

1 | 9.48 | 9.94 | 11.61 | 10.34 | 1.12 | |

with BN | 0.005 | 7.82 | 7.97 | 7.81 | 7.87 | 0.09 |

0.0005 | 9.5 | 8.68 | 9.37 | 9.18 | 0.44 | |

1 | 12.99 | 10.52 | 9.98 | 11.16 | 1.60 | |

MNIST all | ||||||

without BN | 0.005 | 1.19 | 1.09 | 1.06 | 1.11 | 0.07 |

0.0005 | 0.79 | 0.88 | 0.82 | 0.83 | 0.05 | |

1 | 6.22 | 4.81 | 5 | 5.34 | 0.77 | |

with BN | 0.005 | 0.94 | 1.07 | 1.04 | 1.02 | 0.07 |

0.0005 | 0.78 | 0.81 | 0.78 | 0.79 | 0.02 | |

1 | 4.49 | 3.35 | 2.18 | 3.34 | 1.16 |

## Appendix E. Semi-Supervised Training with Hand Crafted Latent Space and Class Label Regularizations

**Table A8.**The network parameters of semi-supervised classifier trained on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, ${\mathcal{D}}_{\mathrm{a}}$ and ${\mathcal{D}}_{\mathrm{c}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers. ${\mathcal{D}}_{\mathrm{a}}$ and ${\mathcal{D}}_{\mathrm{c}}$ are trained in the adversarial way.

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{c}}$ | |

Size | Layer |

10 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

${\mathcal{D}}_{\mathbf{a}}$ | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

**Figure A6.**Semi-supervised classifier based on the cross-entropy ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and hand crafted latent space regularization ${\mathcal{D}}_{\mathrm{a}}$. The blue shadowed parts are not used.

**Table A9.**The performance (percentage error) of

**deterministic**classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ for the encoder with and without batch normalization.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | |||||

MNIST 100 | |||||||

without BN | 0.005 | 0.005 | 21.39 | 18.12 | 18.34 | 19.28 | 1.83 |

0.0005 | 0.0005 | 15.33 | 22.36 | 13.80 | 17.16 | 4.56 | |

0.005 | 0.0005 | 25.66 | 26.25 | 28.81 | 26.91 | 1.67 | |

0.0005 | 0.005 | 9.82 | 13.44 | 13.06 | 12.11 | 1.99 | |

with BN | 0.005 | 0.005 | 23.45 | 21.19 | 28.87 | 24.50 | 3.94 |

0.0005 | 0.0005 | 28.57 | 19.06 | 26.37 | 24.67 | 4.98 | |

0.005 | 0.0005 | 26.18 | 26.18 | 25.49 | 25.95 | 0.40 | |

0.0005 | 0.005 | 8.96 | 13.82 | 14.76 | 12.52 | 3.11 | |

MNIST 1000 | |||||||

without BN | 0.005 | 0.005 | 3.91 | 4.21 | 3.70 | 3.94 | 0.26 |

0.0005 | 0.0005 | 3.54 | 3.72 | 3.54 | 3.60 | 0.10 | |

0.005 | 0.0005 | 6.19 | 5.80 | 7.31 | 6.43 | 0.78 | |

0.0005 | 0.005 | 2.80 | 2.82 | 2.83 | 2.82 | 0.02 | |

with BN | 0.005 | 0.005 | 3.30 | 2.94 | 2.93 | 3.06 | 0.21 |

0.0005 | 0.0005 | 2.80 | 2.53 | 2.50 | 2.61 | 0.17 | |

0.005 | 0.0005 | 3.51 | 3.75 | 4.12 | 3.79 | 0.31 | |

0.0005 | 0.005 | 2.58 | 2.27 | 2.24 | 2.37 | 0.19 | |

MNIST all | |||||||

without BN | 0.005 | 0.005 | 1.04 | 1.07 | 1.07 | 1.06 | 0.02 |

0.0005 | 0.0005 | 0.86 | 0.90 | 0.88 | 0.88 | 0.02 | |

0.005 | 0.0005 | 1.08 | 0.92 | 1.09 | 1.03 | 0.10 | |

0.0005 | 0.005 | 0.85 | 0.93 | 0.93 | 0.90 | 0.05 | |

with BN | 0.005 | 0.005 | 1.10 | 1.01 | 0.93 | 1.01 | 0.09 |

0.0005 | 0.0005 | 0.84 | 0.88 | 0.83 | 0.85 | 0.03 | |

0.005 | 0.0005 | 1.10 | 1.12 | 0.93 | 1.05 | 0.10 | |

0.0005 | 0.005 | 0.76 | 0.82 | 0.79 | 0.79 | 0.03 |

**Table A10.**The performance (percentage error) of

**stochastic**classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ for the encoder with and without batch normalization.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | |||||

MNIST 100 | |||||||

without BN | 0.005 | 0.005 | 12.4 | 18.05 | 16.73 | 15.73 | 2.96 |

0.0005 | 0.0005 | 15.01 | 11.16 | 14.74 | 13.64 | 2.15 | |

0.005 | 0.0005 | 23.31 | 26.61 | 25.41 | 25.11 | 1.67 | |

0.0005 | 0.005 | 9.21 | 9.02 | 10.12 | 9.45 | 0.59 | |

with BN | 0.005 | 0.005 | 13.55 | 22.48 | 14.72 | 16.92 | 4.85 |

0.0005 | 0.0005 | 8.37 | 15.01 | 26.92 | 16.77 | 9.40 | |

0.005 | 0.0005 | 32.12 | 30.27 | 31.44 | 31.28 | 0.94 | |

0.0005 | 0.005 | 5.46 | 17 | 11.54 | 11.33 | 5.77 | |

MNIST 1000 | |||||||

without BN | 0.005 | 0.005 | 3.9 | 4.25 | 4.02 | 4.06 | 0.18 |

0.0005 | 0.0005 | 3.64 | 3.82 | 4.11 | 3.86 | 0.24 | |

0.005 | 0.0005 | 6.68 | 5.34 | 6.36 | 6.13 | 0.70 | |

0.0005 | 0.005 | 3.03 | 2.88 | 2.66 | 2.86 | 0.19 | |

with BN | 0.005 | 0.005 | 2.96 | 3.37 | 2.98 | 3.10 | 0.23 |

0.0005 | 0.0005 | 2.87 | 3.10 | 2.73 | 2.90 | 0.19 | |

0.005 | 0.0005 | 3.72 | 3.8 | 4.14 | 3.89 | 0.22 | |

0.0005 | 0.005 | 2.57 | 2.39 | 2.28 | 2.41 | 0.15 | |

MNIST all | |||||||

without BN | 0.005 | 0.005 | 1.05 | 1.09 | 1.1 | 1.08 | 0.33 |

0.0005 | 0.0005 | 0.94 | 0.96 | 0.9 | 0.93 | 0.03 | |

0.005 | 0.0005 | 1.16 | 1.14 | 1.13 | 1.14 | 0.02 | |

0.0005 | 0.005 | 0.88 | 0.92 | 0.91 | 0.90 | 0.02 | |

with BN | 0.005 | 0.005 | 0.98 | 0.84 | 0.94 | 0.92 | 0.07 |

0.0005 | 0.0005 | 0.79 | 0.96 | 0.82 | 0.86 | 0.09 | |

0.005 | 0.0005 | 1.04 | 1.05 | 1.03 | 1.04 | 0.01 | |

0.0005 | 0.005 | 0.74 | 0.78 | 0.84 | 0.79 | 0.05 |

## Appendix F. Semi-Supervised Training with Learnable Latent Space Regularization

**Table A11.**The encoder and decoder of semi-supervised classifier trained based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, ${\mathcal{D}}_{\mathrm{c}}$ and ${\mathcal{D}}_{\mathrm{z}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers. ${\mathcal{D}}_{\mathrm{c}}$ and ${\mathcal{D}}_{\mathrm{z}}$ are trained in the adversarial way.

Encoder | |||
---|---|---|---|

Size | Layer | ||

28 × 28 × 1 * | Input | ||

14 × 14 × 32 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

2048 | Flatten | ||

1024 | FC, ReLU | ||

10 | 10 | FC, Softmax | FC |

Decoder | |||

Size | Layer | ||

10 + 10 | Input | ||

7 × 7 × 128 | FC, Reshape, BN, ReLU | ||

14 × 14 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 64 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 1 | Conv2DTrans, Sigmoid | ||

Dz | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Dc | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid |

**Figure A7.**Semi-supervised classifier with learnable priors: the cross-entropy ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, MSE ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$, class label ${\mathcal{D}}_{\mathrm{c}}$ and latent space regularization ${\mathcal{D}}_{\mathrm{a}}$. The blue shadowed parts are not used.

**Table A12.**The performance (percentage error) of

**deterministic**classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ for the encoder with and without batch normalization.

Encoder Model | Runs | Mean | std | ||
---|---|---|---|---|---|

1 | 2 | 3 | |||

MNIST 100 | |||||

without BN | 2.15 | 2.05 | 1.78 | 1.99 | 0.19 |

with BN | 1.57 | 1.56 | 1.92 | 1.68 | 0.21 |

MNIST 1000 | |||||

without BN | 1.55 | 1.47 | 1.53 | 1.52 | 0.04 |

with BN | 1.37 | 1.34 | 1.73 | 1.48 | 0.22 |

MNIST all | |||||

without BN | 0.78 | 0.7 | 0.82 | 0.77 | 0.06 |

with BN | 0.79 | 0.77 | 0.76 | 0.77 | 0.02 |

**Table A13.**The performance (percentage error) of

**stochastic**classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ for the encoder with and without batch normalization.

Encoder Model | Runs | Mean | std | ||
---|---|---|---|---|---|

1 | 2 | 3 | |||

MNIST 100 | |||||

without BN | 1.55 | 3.19 | 2.11 | 2.28 | 0.83 |

with BN | 1.4 | 1.33 | 1.72 | 1.48 | 0.21 |

MNIST 1000 | |||||

without BN | 1.73 | 1.53 | 1.6 | 1.62 | 0.10 |

with BN | 1.28 | 1.43 | 1.2 | 1.30 | 0.12 |

MNIST all | |||||

without BN | 0.94 | 0.86 | 0.86 | 0.89 | 0.05 |

with BN | 0.77 | 0.65 | 0.84 | 0.75 | 0.10 |

## Appendix G. Semi-Supervised Training with Learnable Latent Space Regularization and Adversarial Reconstruction

**Figure A8.**Semi-supervised classifier with learnable priors: the cross-entropy ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, MSE ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$, adversarial reconstruction ${\mathcal{D}}_{\mathrm{x}}$, class label ${\mathcal{D}}_{\mathrm{c}}$ and latent space regularizer ${\mathcal{D}}_{\mathrm{z}}$. The blue shadowed parts are not used.

**Table A14.**The network parameters of semi-supervised classifier trained based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, ${\mathcal{D}}_{\mathrm{c}}$ and ${\mathcal{D}}_{\mathrm{z}}$. The encoder is trained with and without batch normalization (BN) after Conv2D layers. ${\mathcal{D}}_{\mathrm{c}}$ and ${\mathcal{D}}_{\mathrm{z}}$ are trained in the adversarial way.

Encoder | |||
---|---|---|---|

Size | Layer | ||

28 × 28 × 1 | Input | ||

14 × 14 × 32 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

2048 | Flatten | ||

1024 | FC, ReLU | ||

10 | 10 | FC, Softmax | FC |

Dz | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Dc | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Decoder | |||

Size | Layer | ||

10 + 10 | Input | ||

7 × 7 × 128 | FC, Reshape, BN, ReLU | ||

14 × 14 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 64 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 1 | Conv2DTrans, Sigmoid | ||

Dx | |||

Size | Layer | ||

28 × 28 × 1 | Input | ||

14 × 14 × 64 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

4 × 4 × 256 | Conv2D, LeakyReLU | ||

4096 | Flatten | ||

1 | FC, Sigmoid |

**Table A15.**The performance (percentage error) of

**deterministic**classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\alpha}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ for the encoder with and without batch normalization.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{x}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 2.85 | 3.36 | 2.77 | 2.99 | 0.32 |

0.0005 | 2.58 | 2.49 | 3.08 | 2.72 | 0.32 | |

1 | 19.62 | 19.96 | 15.97 | 18.52 | 2.21 | |

with BN | 0.005 | 1.56 | 1.33 | 1.35 | 1.41 | 0.13 |

0.0005 | 1.68 | 1.66 | 2.02 | 1.79 | 0.20 | |

1 | 20.85 | 13.6 | 21.67 | 18.71 | 4.44 | |

MNIST 1000 | ||||||

without BN | 0.005 | 2.29 | 2.35 | 2.11 | 2.25 | 0.12 |

0.0005 | 1.69 | 1.88 | 2.24 | 1.94 | 0.28 | |

1 | 3.47 | 3.30 | 4.12 | 3.63 | 0.43 | |

with BN | 0.005 | 1.18 | 1.21 | 1.09 | 1.16 | 0.06 |

0.0005 | 1.44 | 1.28 | 1.29 | 1.34 | 0.09 | |

1 | 4.14 | 2.94 | 2.48 | 3.19 | 0.86 | |

MNIST all | ||||||

without BN | 0.005 | 0.97 | 1.01 | 1.04 | 1.01 | 0.04 |

0.0005 | 0.88 | 0.85 | 0.93 | 0.89 | 0.04 | |

1 | 1.31 | 1.28 | 1.47 | 1.35 | 0.10 | |

with BN | 0.005 | 0.81 | 0.83 | 0.75 | 0.80 | 0.04 |

0.0005 | 0.73 | 0.78 | 0.75 | 0.75 | 0.03 | |

1 | 0.88 | 0.86 | 1.27 | 1.00 | 0.23 |

**Table A16.**The performance (percentage error) of

**stochastic**classifier with supervised noisy data (noise std = 0.1, # noise realisation = 3) based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\alpha}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ for the encoder with and without batch normalization.

Encoder Model | ${\mathit{\alpha}}_{\mathbf{x}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 2.45 | 3.04 | 2.67 | 2.72 | 0.30 |

0.0005 | 2.63 | 2.3 | 2.45 | 2.46 | 0.17 | |

with BN | 0.005 | 1.34 | 1.21 | 6.4 | 2.98 | 2.96 |

0.0005 | 1.35 | 1.51 | 1.93 | 1.60 | 0.30 | |

MNIST 1000 | ||||||

without BN | 0.005 | 2.31 | 2.26 | 2.2 | 2.26 | 0.06 |

0.0005 | 1.71 | 2.16 | 1.86 | 1.91 | 0.23 | |

with BN | 0.005 | 1.23 | 1.31 | 1.10 | 1.21 | 0.11 |

0.0005 | 1.42 | 1.62 | 1.37 | 1.47 | 0.13 | |

MNIST all | ||||||

without BN | 0.005 | 0.93 | 1.01 | 1.05 | 1.00 | 0.06 |

0.0005 | 0.92 | 0.83 | 0.88 | 0.88 | 0.05 | |

with BN | 0.005 | 0.88 | 0.86 | 0.91 | 0.88 | 0.03 |

0.0005 | 0.77 | 0.80 | 0.80 | 0.79 | 0.02 |

## References

- Kingma, D.P.; Mohamed, S.; Rezende, D.J.; Welling, M. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 3581–3589. [Google Scholar]
- Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv
**2015**, arXiv:1511.05644. [Google Scholar] - Springenberg, J.T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv
**2015**, arXiv:1511.06390. [Google Scholar] - Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv
**2020**, arXiv:2002.05709. [Google Scholar] - Federici, M.; Dutta, A.; Forré, P.; Kushman, N.; Akata, Z. Learning Robust Representations via Multi-View Information Bottleneck. arXiv
**2020**, arXiv:2002.07017. [Google Scholar] - Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
- Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed][Green Version] - Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2019; pp. 5049–5059. [Google Scholar]
- Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2004; pp. 529–536. [Google Scholar]
- Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop: Challenges in Representation Learning (WREPL); ICML: Atlanta, GR, USA, 2013; Volume 3. [Google Scholar]
- Cire<b>c</b>san, D.C.; Meier, U.; Gambardella, L.M.; Schmidhuber, J. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput.
**2010**, 22, 3207–3220. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv
**2018**, arXiv:1805.09501. [Google Scholar] - Amjad, R.A.; Geiger, B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 42, 2225–2239. [Google Scholar] [CrossRef] [PubMed][Green Version] - Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv
**2016**, arXiv:1612.00410. [Google Scholar] - Voloshynovskiy, S.; Kondah, M.; Rezaeifar, S.; Taran, O.; Hotolyak, T.; Rezende, D.J. Information bottleneck through variational glasses. In NeurIPS Workshop on Bayesian Deep Learning; Vancouver Convention Center: Vancouver, BC, Canada, 2019. [Google Scholar]
- Uğur, Y.; Zaidi, A. Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding. Entropy
**2020**, 22, 213. [Google Scholar] [CrossRef][Green Version] - Maaløe, L.; Sønderby, C.K.; Sønderby, S.K.; Winther, O. Auxiliary deep generative models. arXiv
**2016**, arXiv:1602.05473. [Google Scholar] - Śmieja, M.; Wołczyk, M.; Tabor, J.; Geiger, B.C. SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder. arXiv
**2019**, arXiv:1906.09333. [Google Scholar] - Makhzani, A.; Frey, B.J. Pixelgan autoencoders. In Advances in Neural Information Processing Systems; MIT Press: Long Beach, CA, USA, 2017; pp. 1975–1985. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2014**, arXiv:1312.6114. [Google Scholar] - Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv
**2014**, arXiv:1401.4082. [Google Scholar] - Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 2672–2680. [Google Scholar]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning; NIPS Workshop: Granada, Spain, 2011; Volume 2011, p. 5. [Google Scholar]

**Table 1.**Semi-supervised classification performance (percentage error) for the optimal parameters (Appendix B, Appendix C, Appendix D, Appendix E, Appendix F and Appendix G) defined on the MNIST (D—deterministic; S—stochastic).

MNIST (100) | MIST (1000) | MNIST (all) | SVHN (1000) | ||
---|---|---|---|---|---|

NN Baseline (${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$) | [D] | 26.31 (±0.91) | 7.50 (±0.19) | 0.68 (±0.05) | 36.16 (±0.77) |

[S] | 26.78 (±1.66) | 7.54 (±0.25) | 0.70 (±0.05) | 36.28 (±0.93) | |

InfoMax [3] | [S] | 33.41 | 21.5 | 15.86 | - |

VAE [5] | [S] | 14.26 | 8.71 | 5.02 | - |

MV-InfoMax [5] | [S] | 13.22 | 7.39 | 6.07 | - |

IB multiview [5] | [S] | 3.03 | 2.34 | 2.22 | - |

VAE (M1 + M2) [5] | [S] | 3.33 (±0.14) | 2.40 (±0.02) | 0.96 | 36.02 (±0.10) |

CatGAN | [S] | 1.91 (±0.10) | 1.73 (±0.18) | 0.91 | - |

AAE | [D] | 1.90 (±0.10) | 1.60 (±0.08) | 0.85 (±0.02) | 17.70 (±0.30) |

No priors on latent space | |||||

${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}$ | [D] | 20.72 (±1.58) | 4.99 (±0.28) | 0.69 (±0.04) | 25.78 (±0.90) |

[S] | 19.60 (±1.37) | 4.49 (±0.25) | 0.67 (±0.05) | 26.34 (±0.80) | |

Hand crafted latent space priors | |||||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}$ | [D] | 27.44 (±1.40) | 6.77 (±0.34) | 0.91 (±0.05) | 35.94 (±1.08) |

[S] | 27.48 (±1.07) | 6.91 (±0.45) | 0.88 (±0.05) | 35.80 (±1.21) | |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ | [D] | 12.04 (±4.46) | 2.43 (±0.12) | 0.81 (±0.05) | 24.70 (±0.46) |

[S] | 11.80 (±3.82) | 2.40 (±0.10) | 0.82 (±0.04) | 24.62 (±0.54) | |

Learnable latent space priors | |||||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ | [D] | 1.55 (±0.21) | 1.25 (±0.10) | 0.74 (±0.04) | 20.07 (±0.36) |

[S] | 1.49 (±0.18) | 1.43 (±0.06) | 0.78 (±0.04) | 20.00 (±0.31) | |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ | [D] | 1.38 (±0.09) | 1.21 (±0.10) | 0.77 (±0.06) | 19.75 (±0.52) |

[S] | 1.42 (±0.10) | 1.16 (±0.09) | 0.79 (±0.02) | 19.71 (±0.26) |

**Table 2.**Execution time (hours) per 100 epochs on one NVIDIA GPU. For the SVHN the models with the learnable latent space priors were trained with a learning rate 0.0001 that explains the longer time but without optimization of Lagrangians, i.e., the Lagrangians were re-used from pre-trained MNIST model. All the others models were trained with a learning rate 0.001.

MNIST | SVHN | |
---|---|---|

NN Baseline (${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$) | 0.47–0.65 | 0.85–0.92 |

No priors on latent space | ||

${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}$ | 0.47–0.65 | 0.85–0.92 |

Hand crafted latent space priors | ||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}$ | 0.47–0.65 | 1–1.05 |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ | 0.97–1.18 | 1.5–1.6 |

Learnable latent space priors | ||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ | 1.23–1.6 | 2.25–2.3 |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ | 1.98–2.42 | 3.5–3.55 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Voloshynovskiy, S.; Taran, O.; Kondah, M.; Holotyak, T.; Rezende, D.
Variational Information Bottleneck for Semi-Supervised Classification. *Entropy* **2020**, *22*, 943.
https://doi.org/10.3390/e22090943

**AMA Style**

Voloshynovskiy S, Taran O, Kondah M, Holotyak T, Rezende D.
Variational Information Bottleneck for Semi-Supervised Classification. *Entropy*. 2020; 22(9):943.
https://doi.org/10.3390/e22090943

**Chicago/Turabian Style**

Voloshynovskiy, Slava, Olga Taran, Mouad Kondah, Taras Holotyak, and Danilo Rezende.
2020. "Variational Information Bottleneck for Semi-Supervised Classification" *Entropy* 22, no. 9: 943.
https://doi.org/10.3390/e22090943