# Variational Information Bottleneck for Semi-Supervised Classification

^{1}

^{2}

^{*}

Next Article in Journal

Next Article in Special Issue

Next Article in Special Issue

Previous Article in Journal

Previous Article in Special Issue

Previous Article in Special Issue

Department of Computer Science, University of Geneva, 1227 Carouge, Switzerland

DeepMind, London N1C 4AG, UK

Author to whom correspondence should be addressed.

Received: 22 July 2020 / Revised: 24 August 2020 / Accepted: 24 August 2020 / Published: 27 August 2020

(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)

In this paper, we consider an information bottleneck (IB) framework for semi-supervised classification with several families of priors on latent space representation. We apply a variational decomposition of mutual information terms of IB. Using this decomposition we perform an analysis of several regularizers and practically demonstrate an impact of different components of variational model on the classification accuracy. We propose a new formulation of semi-supervised IB with hand crafted and learnable priors and link it to the previous methods such as semi-supervised versions of VAE (M1 + M2), AAE, CatGAN, etc. We show that the resulting model allows better understand the role of various previously proposed regularizers in semi-supervised classification task in the light of IB framework. The proposed IB semi-supervised model with hand-crafted and learnable priors is experimentally validated on MNIST under different amount of labeled data.

We will denote a joint generative distribution as ${p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})={p}_{\mathit{\theta}}\left(\mathbf{z}\right){p}_{\mathit{\theta}}\left(\mathbf{x}\right|\mathbf{z})$, whereas marginal ${p}_{\mathit{\theta}}\left(\mathbf{z}\right)$ is interpreted as a targeted distribution of latent space and marginal ${p}_{\mathit{\theta}}\left(\mathbf{x}\right)={\mathbb{E}}_{{p}_{\mathit{\theta}}\left(\mathbf{z}\right)}\left[{p}_{\mathit{\theta}}\left(\mathbf{x}\right|\mathbf{z})\right]={\int}_{\mathbf{z}}{p}_{\mathit{\theta}}\left(\mathbf{x}\right|\mathbf{z}){p}_{\mathit{\theta}}\left(\mathbf{z}\right)\mathrm{d}\mathbf{z}$ as a generated data distribution with a generative model described by ${p}_{\mathit{\theta}}\left(\mathbf{x}\right|\mathbf{z})$, where $\mathbb{E}$ stands for the expected value. A joint data distribution ${q}_{\mathit{\varphi}}(\mathbf{x},\mathbf{z})={p}_{\mathcal{D}}\left(\mathbf{x}\right){q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})$, where ${p}_{\mathcal{D}}\left(\mathbf{x}\right)$ denotes an empirical data distribution and ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right|\mathbf{x})$ is an inference or encoding model and marginal ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$ denotes a “true” or “aggregated” distribution of latent space data. We will denote parameters of encoders as ${\mathit{\varphi}}_{\mathrm{a}}$ and ${\mathit{\varphi}}_{\mathrm{z}}$, and those of decoders as ${\mathit{\theta}}_{\mathrm{c}}$ and ${\mathit{\theta}}_{\mathrm{x}}$. The discriminators corresponding to Kullback–Leibler divergences are denoted as ${\mathcal{D}}_{\mathrm{x}}$ where the subscript indicates the space to which this discriminator is applied to. The cross-entropy metrics are denoted as ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$, where the subscript indicates the corresponding vectors. $\mathbf{X}$ denotes random vector, while the corresponding realization is denoted as $\mathbf{x}$.

The deep supervised classifiers demonstrate an impressive performance when the amount of labeled data is large. However, their performance significantly deteriorates with the decrease of labeled samples. Recently, semi-supervised classifiers based on deep generative models such as VAE (M1 + M2) [1], AAE [2], CatGAN [3], etc., along with several other approaches based on multi-view and contrastive metrics just to mention the most recent ones [4,5], are considered to be a solution to the above problem. Besides the remarkable reported results, the information theoretic analysis of semi-supervised classifiers based on generative models and the role of different priors aiming to fulfil the gap in the lack of labeled data remain little studied. Therefore, in this paper we will try to address these issues using IB principle [6] and practically compare different priors on the same architecture of classifier.

Instead of considering the latent space of generative models such as VAE (M1 + M2) [1] and AAE [2] trained in the unsupervised way as suitable features for the classification, we will depart from the IB formulation of supervised classification, where we consider an encoder-decoder formulation of classifier and impose priors on its latent space. Thus, we study an approach to semi-supervised classification based on an IB formulation with a variational decomposition of IB compression and classification mutual information terms. To deeper understand the role and impact of different elements of variational IB on the classification accuracy, we consider two types of priors on the latent space of classifier: (i) hand-crafted and (ii) learnable priors. Hand-crafted latent space priors impose constraints on a distribution of latent space by fitting it to some targeted distribution according to the variational decomposition of the compression term of the IB. This type of latent space priors is well known as an information dropout [7]. One can also apply the same variational decomposition to the classification term of the IB, where the distribution of labels is supposed to follow some targeted class distribution to maximize the mutual information between inferred labels and targeted ones. This type of class label space regularization reflects an adversarial classification used in AAE [2] and CatGAN [3]. In contrast, learnable latent space priors aim at minimizing the need in human expertise in imposing priors on the latent space. Instead, the learnable priors are learned directly from unlabeled data using auto-encoding (AE) principle. In this way, the learnable priors are supposed to compensate the lack of labeled data in the semi-supervised learning yet minimizing the need in the hand-crafted control of the latent space distribution.

We demonstrate that several state-of-the-art models such as AAE [2], CatGAN [3], VAE (M1 + M2) [1], etc., can be considered to be instances of the variational IB with the learnable priors. At the same time, the role of different regularizers in the hand-crafted semi-supervised learning is generalized and linked to known frameworks such as information dropout [7].

We evaluate our model using standard dataset MNIST on both hand-crafted and learnable features. Besides revealing the impact of different components of variational IB factorization, we demonstrate that the proposed model outperforms prior works on this dataset.

Our main contribution is three-fold: (i) We propose a new formulation of IB for the semi-supervised classification and use a variational decomposition to convert it into a practically tractable setup with learnable parameters. (ii) We develop the variational IB for two classes of hand-crafted and learnable priors on the latent space of classifier and show its link to the state-of-the-art semi-supervised methods. (iii) We investigate the role of these priors and different regularizers in the classification, latent and reconstruction spaces for the same fixed architecture under the different amount of training data.

In this work, we implicitly use the concepts of all three forms of considered regularization frameworks. However, instead of adding additional regularizers to the baseline classifier as suggested by the framework in [8], we will try to derive the corresponding counterparts from a semi-supervised IB framework. In this way, we will try to justify their origin and investigate their impact on overall classification accuracy for the same system architecture.

In our work, we extend these ideas using variational approximation approach suggested in [14] and that was also applied to unsupervised models in the previous work [15,16]. More particularly, we extend the IB framework to the semi-supervised classification and as discussed above we will consider two different ways of regularization of the latent space of classifier, i.e., either using traditional hand-crafted priors or suggested learnable priors. Although we do not consider the semi-supervised clustering and conditional generation in this work, the proposed findings can be extended to these problems in a way similar to prior works such as AAE [2], ADGM [17] and SeGMA [18].

In contrast to the above approaches and following the IB framework, we formulate the semi-supervised classification problem as a training of classifier that aims at compressing the input $\mathbf{x}$ to some latent data $\mathbf{a}$ via an encoding that is supposed to retain only class relevant information that is controlled by a decoder as shown in Figure 1. If the amount of labeled data is sufficiently large, the supervised classifier can achieve this goal. However, when the amount of labeled examples is small such an encoder-decoder pair representing an IB-driven classifier is regularized by a latent space and adversarial label space regularizers to fill the gap in training data. The adversarial label space regularization was already used in AAE and CatGAN. The latent space regularization in the scope of IB framework was reported in [7]. In this paper, we demonstrate that both label and latent space regularizations are instances of the generalized IB formulation developed in Section 3. At the same time, in contrast to the hypothesis that the considered label space and latent space regularizations are the driving factors behind the success of semi-supervised classifiers, we demonstrate that the hand-crafted priors considered in these models cannot completely fulfil the lack of labelled data and lead to relatively poor performance in comparison to a fully supervised system based on a sole cross-entropy metric. For these reasons, we analyze another mechanism of regularization of latent space based on learnable priors as shown in Figure 2 and developed in Section 4. Along this line, we provide an IB formulation of AAE and explain the driving mechanisms behind its success as an instance of IB with learnable priors. Finally, we present several extensions that explain the IB origin and role of adversarial regularization in the reconstruction space.

In this work, our main focus is the latent space regularization for the hand-crafted and learnable priors under the reconstruction setup within the IB framework. Our main task is the semi-supervised classification. We will not consider any augmentation and adversarial techniques besides a simple stochastic encoding based on the addition of data independent noise at the system input or even deterministic encoding without any form of augmentation. The regularization of the label space and reconstruction space is solely based on the terms derived from the IB framework and only includes available labeled and unlabeled data without any form of augmentation. In this way, we want to investigate the role and impact of the latent space regularization as such in the IB-based semi-supervised classification. The usage of the above mentioned techniques of augmentation should be further investigated and will likely provide an additional performance improvement.

We assume that a semi-supervised classifier has an access to ${\left\{{\mathbf{x}}_{m},{\mathbf{c}}_{m}\right\}}_{m=1}^{N}$ training labeled samples, where ${\mathbf{x}}_{m}\in {\mathbb{R}}^{D}$ denotes ${m}^{th}$ data sample and ${\mathbf{c}}_{m}$ corresponding encoded class label from the set $\{1,2,\cdots ,{M}_{c}\}$, generated from the joint distribution $p(\mathbf{c},\mathbf{x})$, and non-labeled data samples ${\left\{{\mathbf{x}}_{j}\right\}}_{j=1}^{J}$ with $J\gg N$. To integrate the knowledge about the labeled and non-labeled data at training, one can formulate the IB as:
where $\mathbf{a}$ denotes the latent representation, ${\beta}_{\mathrm{c}}$ is a Lagrangian multiplier and the IB terms are defined as ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x},\mathbf{a})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\right]$ and ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{A};\mathbf{C})={\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right]$.

$${\mathcal{L}}^{\mathrm{HCP}}\left({\mathit{\varphi}}_{\mathrm{a}}\right)={I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})-{\beta}_{\mathrm{c}}{I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{A};\mathbf{C}),$$

According to the above IB formulation the encoder ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})$ is trained to minimize the mutual information between $\mathbf{X}$ and $\mathbf{A}$ while ensuring that the decoder ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})$ can reliably decide on labels $\mathbf{C}$ from the compressed representation $\mathbf{A}$. The trade-off between the compression and recognition terms is controlled by ${\beta}_{\mathrm{c}}$. Thus, it is assumed that the information retained in the latent representation $\mathbf{A}$ represents the sufficient statistics for the class labels $\mathbf{C}$.

However, since optimal ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})$ is unknown, the second term ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{A};\mathbf{C})$ is lower bounded by ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})$ using a variational approximation ${p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})$:
where ${D}_{\mathrm{KL}}({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})\left|\right|{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a}))={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\right]$ and the inequality follows from the fact that ${D}_{\mathrm{KL}}({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})\left|\right|{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a}))\ge 0$. We denote the term ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})={\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right]$. Thus, ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{A};\mathbf{C})\ge {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})$.

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{A};\mathbf{C})& \triangleq {\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right]={\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\right]\right]\hfill \\ & ={\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right]+{\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\right]\right]\hfill \\ & ={\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right]+{\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{D}_{\mathrm{KL}}({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{c}\right|\mathbf{a})\left|\right|{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a}))\right]\hfill \\ & \ge {\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\right]\right],\hfill \end{array}$$

Thus, the IB (1) can be reformulated as:

$${\mathcal{L}}^{{\mathrm{HCP}}_{L}}({\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}})={I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})-{\beta}_{\mathrm{c}}{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C}).$$

The considered IB is schematically shown in Figure 1 and we will proceed next with the detailed development of each component of the IB formulation.

The first mutual information term ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})$ in (3) can be decomposed using a factorization by a parametric marginal distribution ${p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)$ that represents a prior on the latent representation $\mathbf{a}$:
where the first term denotes the KL-divergence ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}\triangleq {D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{X}=\mathbf{x})\parallel {p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)\right)={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}{{p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\right]$ and the term denotes the KL-divergence ${\mathcal{D}}_{\mathrm{a}}\triangleq {D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)\right)={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\right]$.

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})& ={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x},\mathbf{a})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x},\mathbf{a})}{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right){p}_{\mathcal{D}}\left(\mathbf{x}\right)}\right]={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x},\mathbf{a})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\frac{{p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)}\right]\hfill \\ & ={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\underset{{\mathcal{D}}_{\mathrm{a}|\mathrm{x}}}{\underbrace{\left[{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{X}=\mathbf{x})\parallel {p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)\right)\right]}}-\underset{{\mathcal{D}}_{\mathrm{a}}}{\underbrace{{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{a}}}\left(\mathbf{a}\right)\right)}}\phantom{\rule{4.pt}{0ex}},\hfill \end{array}$$

It should be pointed out that the encoding ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})$ can be both stochastic or deterministic. Stochastic encoding ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})$ can be implemented via: (a) multiplicative encoding applied to the input $\mathbf{x}$ as $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x}\odot \mathit{\u03f5})$ or in the latent space $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)\odot \mathit{\u03f5}$, where ${f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)$ is the output of the encoder, ⊙ denotes the element-wise product and $\mathit{\u03f5}$ follows some data independent or data dependent distribution as in information dropout [7]; (b) additive encoding applied to the input $\mathbf{x}$ as $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x}+\mathit{\u03f5})$ with the data independent perturbations, e.g., such as in PixelGAN [19], or in the latent space with generally data-dependent perturbations of form $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)+{\sigma}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)\odot \mathit{\u03f5}$, where ${f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)$ and ${\sigma}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)$ are outputs of the encoder and $\mathit{\u03f5}$ is assumed to be a zero mean unit variance vector such as in VAE [1] or (c) concatenative/mixing encoding $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left([\mathbf{x},\mathit{\u03f5}]\right)$ that is generally applied at the input of encoder. Deterministic encoding is based on the mapping $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{x}\right)$, i.e., no randomization is introduced, e.g., such as one of encoding modalities of AAE [2].

In this section, we factorize the second term in (3) to address the semi-supervised training, i.e., to integrate the knowledge of both non-labeled and labeled data available at training:
with $H(p\left(\mathbf{c}\right);{p}_{\mathit{\theta}}\left(\mathbf{c}\right))=-{\mathbb{E}}_{p\left(\mathbf{c}\right)}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)\right]$ denoting a cross-entropy between $p\left(\mathbf{c}\right)$ and ${p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)$, and ${\mathcal{D}}_{\mathrm{c}}\triangleq {D}_{\mathrm{KL}}\left(p\left(\mathbf{c}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)\right)={\mathbb{E}}_{p\left(\mathbf{c}\right)}\left[\mathrm{log}\frac{p\left(\mathbf{c}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\right]$ to be a KL-divergence between the prior class label distribution $p\left(\mathbf{c}\right)$ and the estimated one ${p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)$. One can assume different forms of labels’ $\mathbf{c}$ encoding but one of the most often used forms is one-hot-label encoding that leads to the categorical distribution $p\left(\mathbf{c}\right)=\mathrm{cat}\left(\mathbf{c}\right)$.

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})& \triangleq {\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}\left(\mathbf{a}\right|\mathbf{x})}}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}{p\left(\mathbf{c}\right)}\frac{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\right]\right]\hfill \\ & =-{\mathbb{E}}_{p\left(\mathbf{c}\right)}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)\right]-{\mathbb{E}}_{p\left(\mathbf{c}\right)}\left[\mathrm{log}\frac{p\left(\mathbf{c}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\right]+{\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})\right]\right]\hfill \\ & =H(p\left(\mathbf{c}\right);{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right))-{D}_{\mathrm{KL}}\left(p\left(\mathbf{c}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)\right)-{H}_{{\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{C}\right|\mathbf{A}),\hfill \end{array}$$

Finally, the conditional entropy is defined as ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}\triangleq {H}_{{\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{C}\right|\mathbf{A})=-{\mathbb{E}}_{p(\mathbf{c},\mathbf{x})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})\right]\right]$.

Since $H(p\left(\mathbf{c}\right);{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right))\ge 0$, one can lower bound (5) as ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})\ge {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}^{\mathrm{L}}(\mathbf{A};\mathbf{C})$ where:

$$\begin{array}{c}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}^{\mathrm{L}}(\mathbf{A};\mathbf{C})\triangleq -\underset{{\mathcal{D}}_{\mathrm{c}}}{\underbrace{{D}_{\mathrm{KL}}\left(p\left(\mathbf{c}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)\right)}}-\underset{{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}}{\underbrace{{H}_{{\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{C}\right|\mathbf{A})}}\phantom{\rule{4.pt}{0ex}}.\end{array}$$

Summarizing the above variational decomposition of (3) with the terms (4) and (6), we will proceed with four practical scenarios.

Supervised training without latent space regularization (**baseline)**: is based on term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ in (6)
Semi-supervised training without latent space regularization is based on terms ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and ${\mathcal{D}}_{\mathrm{c}}$ in (6):
Supervised training with latent space regularization is based on term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ in (6) and either term ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ or ${\mathcal{D}}_{\mathrm{a}}$ or jointly ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ and ${\mathcal{D}}_{\mathrm{a}}$ in (4):
Semi-supervised training with latent space regularization deploys all terms in (4) and (6):
The empirical evaluation of these setups on MNIST dataset is given in Section 5. The same architecture of encoder and decoder was used to establish the impact of each term in a function of available labeled data.

$${\mathcal{L}}_{\mathrm{S}-\mathrm{NoReg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}.$$

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{NoReg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}.$$

$${\mathcal{L}}_{\mathrm{S}-\mathrm{Reg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{a}|\mathrm{x}}\right]+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}.$$

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{Reg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{a}|\mathrm{x}}\right]+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}.$$

In this section, we extend the results obtained for the hand-crafted priors to the learnable priors. Instead of applying the hand-crafted regularization of the latent representation $\mathbf{a}$ as suggested by the IB (3) and shown in Figure 1, we will assume that the latent representation $\mathbf{a}$ is regularized by an especially designed AE as shown in Figure 2. The AE-based regularization has two components: (i) the latent space $\mathbf{z}$ regularization and (ii) the observation space regularization. The design and training of this latent space regularizer in a form of the AE is guided by its own IB. In the general case, all elements of AE, i.e., its encoder-decoder pair, latent and observation space regularizers are conditioned by the learned class label $\mathbf{c}$. The resulting Lagrangian with the learnable prior is (formally one should consider ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})$ for the term A. However, since ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})\le {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{Z}|\mathbf{C})$ due to the Markovianity of considered architecture, we consider the decomposition starting from $\mathbf{A}$ [20], Data Processing Inequality, Theorem 2.8.1):
where ${\beta}_{\mathrm{x}}$ is a Lagrangian multiplier controlling the reconstruction of $\mathbf{x}$ at the decoder and ${\beta}_{\mathrm{c}}$ is the same as in (1).

$${\mathcal{L}}^{\mathrm{LP}}({\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}})=\underset{\mathrm{A}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{Z}|\mathbf{C})}}-{\beta}_{\mathrm{x}}\underset{\mathrm{B}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})}}\phantom{\rule{4.pt}{0ex}}-{\beta}_{\mathrm{c}}\underset{\mathrm{C}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}^{\mathrm{L}}(\mathbf{A};\mathbf{C})}},$$

The terms A and B, conditioned by the class $\mathbf{c}$, play a role of the latent space regularizer by imposing the learnable constrains on the vector $\mathbf{a}$. These two terms correspond to the hand-crafted counterpart ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})$ in (3). The term C in the learnable IB formulation corresponds to the classification part of hand-crafted IB in (3) and can be factorized along the same lines as in (6). Therefore, we will proceed with the factorization of terms A and B.

One can also consider the following IB formulation with the learnable priors with no conditioning on $\mathbf{c}$ in term A in (11) leading to an unconditional counterpart D below that can be viewed as an IB generalization of semi-supervised AAE [2]:

$${\mathcal{L}}_{\mathrm{AAE}}^{\mathrm{LP}}({\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}})=\underset{\mathrm{D}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{A};\mathbf{Z})}}-{\beta}_{\mathrm{x}}\underset{\mathrm{B}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})}}\phantom{\rule{4.pt}{0ex}}-{\beta}_{\mathrm{c}}\underset{\mathrm{C}}{\underbrace{{I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}^{\mathrm{L}}(\mathbf{A};\mathbf{C})}}.$$

We will denote ${p}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{x},\mathbf{a},\mathbf{c},\mathbf{z})={p}_{\mathcal{D}}\left(\mathbf{x}\right){q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x}){p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a}){q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})$ and decompose the term A in (11) using variational factorization as:
where ${\mathcal{D}}_{\mathrm{z}|\mathrm{a},\mathrm{c}}\triangleq {D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})}{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\right]$ and ${\mathcal{D}}_{\mathrm{z}|\mathrm{c}}\triangleq {D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{c})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{c})}{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\right]$ denote the KL-divergence terms and ${q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{c})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c}))\right]\right]$.

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A},\mathbf{Z}|\mathbf{C})& ={\mathbb{E}}_{{p}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{x},\mathbf{a},\mathbf{c},\mathbf{z})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})}{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{c})}\frac{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\right]\hfill \\ & ={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\underset{{\mathcal{D}}_{\mathrm{z}|\mathrm{a},\mathrm{c}}}{\underbrace{\left[{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{A}=\mathbf{a},\mathbf{C}=\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)\right]}}\phantom{\rule{4.pt}{0ex}}\right]\right],\hfill \\ & -{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\underset{{\mathcal{D}}_{\mathrm{z}|\mathrm{c}}}{\underbrace{\left[{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right|\mathbf{C}=\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)\right]}}\phantom{\rule{4.pt}{0ex}}\right]\right],\hfill \end{array}$$

Denoting ${p}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{x},\mathbf{a},\mathbf{c},\mathbf{z})={p}_{\mathcal{D}}\left(\mathbf{x}\right){q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x}){p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a}){q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c}){p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right|\mathbf{z},\mathbf{c})$, we decompose the term B in (11) as:
where ${p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathbf{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})\right]\right]$. The terms are defined as $H({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c});{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right))=-{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c})}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)\right]$, ${\mathcal{D}}_{\mathrm{x}|\mathrm{c}}\triangleq {D}_{\mathrm{KL}}\left({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{C}=\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)\right)={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c})}\left[\mathrm{log}\frac{{p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c})}{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}\right]$ and ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}\triangleq {H}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{X}\right|\mathbf{Z},\mathbf{C})=-{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a},\mathbf{c})}\left[\mathrm{log}{p}_{{\mathit{\theta}}_{\mathbf{x}}}\left(\mathbf{x}\right|\mathbf{z},\mathbf{c})\right]\right]\right]\right]$. Since ${\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\left[H({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c});{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right))\right]\ge 0$, we can lower bound ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})\ge {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}^{\mathrm{L}}(\mathbf{X};\mathbf{Z}|\mathbf{C})\triangleq -{\mathcal{D}}_{\mathrm{x}|\mathrm{c}}-{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$.

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{X};\mathbf{Z}|\mathbf{C})& ={\mathbb{E}}_{{p}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{x},\mathbf{a},\mathbf{c},\mathbf{z})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right|\mathbf{z},\mathbf{c})}{{p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c})}\frac{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}\right]\hfill \\ & ={\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\left[H({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c});{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right))\right]\hfill \\ & -{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\underset{{\mathcal{D}}_{\mathrm{x}|\mathrm{c}}}{\underbrace{\left[{D}_{\mathrm{KL}}\left({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{C}=\mathbf{c})\parallel {p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)\right)\right]}}\phantom{\rule{4.pt}{0ex}}-\underset{{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}}{\underbrace{{H}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{X}\right|\mathbf{Z},\mathbf{C})}}\phantom{\rule{4.pt}{0ex}},\hfill \end{array}$$

Summarizing the above variational decomposition of (11) with the terms (13) and (14), we will consider semi-supervised training with latent space regularization as:
To create a link to the semi-supervised AAE [2], we also consider (12), where all latent and reconstruction space regularizers are independent of $\mathbf{c}$, i.e., do not contain conditioning on $\mathbf{c}$.

$$\begin{array}{cc}\hfill {\mathcal{L}}_{\mathrm{SS}-\mathrm{Reg}}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})& ={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{a},\mathrm{c}}\right]\right]\right]+{\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})}\left[{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right|\mathbf{a})}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{c}}\right]\right]\right]\hfill \\ & +{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathbb{E}}_{{p}_{{\mathit{\theta}}_{\mathrm{c}}}\left(\mathbf{c}\right)}\left[{\mathcal{D}}_{\mathrm{x}|\mathrm{c}}\right]+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}.\hfill \end{array}$$

Semi-supervised training with latent space regularization and MSE reconstruction based on (12):
where ${\mathcal{D}}_{\mathrm{z}}\triangleq {D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{z}}\left(\mathbf{z}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\right]$.

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{AAE}}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}},$$

Semi-supervised training with latent space regularization and with MSE and adversarial reconstruction based on (12) deploys all terms:
where ${\mathcal{D}}_{\mathrm{x}}\triangleq {D}_{\mathrm{KL}}\left({p}_{\mathcal{D}}\left(\mathbf{x}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)\right)={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[\mathrm{log}\frac{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}\right]$.

$${\mathcal{L}}_{\mathrm{SS}-{\mathrm{AAE}}_{\mathrm{complete}}}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}},$$

The considered HCP and LP models can be linked with several state-of-the-art unsupervised models such VAE [21,22], $\beta $-VAE [23], AAE [2] and BIB-AE [15] and semi-supervised models such as AAE [2], CatGAN [3], VAE (M1 + M2) [1] and SeGMA [18].

The proposed LP model (11) generalizes unsupervised models without the categorical latent representation. In addition, the unsupervised models in a form of the auto-encoder are used as a latent space regularizer in the LP setup. For these reasons, we will briefly consider four models of interest, namely VAE, $\beta $-VAE, AAE, and BIB-AE.

Before we proceed with the analysis, we will define an unsupervised IB for these models. We will assume the fused encoders ${q}_{{\mathit{\varphi}}_{\mathrm{a}}}\left(\mathbf{a}\right|\mathbf{x})$ and ${q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{a})$ without conditioning on $\mathbf{c}$ in the inference model according to Figure 2. We also assume no conditionally on $\mathbf{c}$ in the generative model.

The Lagrangian of unsupervised IB is defined according to [15]:
where similarly to the supervised counterpart (4), we define the first term as:
and similarly to (14) the second term is defined as:
where the definition of all terms should follow from the above equations. Since $H({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c});{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right))\ge 0$, we can lower bound ${I}_{{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{Z};\mathbf{X})\ge -{\mathcal{D}}_{\mathrm{x}}-{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$.

$$\begin{array}{cc}\hfill {\mathcal{L}}^{{\mathrm{U}}_{L}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})& ={I}_{{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{X};\mathbf{Z})-{\beta}_{\mathrm{x}}{I}_{{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{Z};\mathbf{X}),\hfill \end{array}$$

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{X};\mathbf{Z})& ={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{x},\mathbf{z})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{x},\mathbf{z})}{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right){p}_{\mathcal{D}}\left(\mathbf{x}\right)}\right]={\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}(\mathbf{x},\mathbf{z})}\left[\mathrm{log}\frac{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{x})}{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\frac{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)}\right]\hfill \\ & ={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\underset{{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}}{\underbrace{\left[{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{X}=\mathbf{x})\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)\right]}}\phantom{\rule{4.pt}{0ex}}-\underset{{\mathcal{D}}_{\mathrm{z}}}{\underbrace{{D}_{\mathrm{KL}}\left({q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{z}}}\left(\mathbf{z}\right)\right)}}\phantom{\rule{4.pt}{0ex}},\hfill \end{array}$$

$$\begin{array}{cc}\hfill {I}_{{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{x}}}(\mathbf{Z};\mathbf{X})& ={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathbb{E}}_{{q}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{z}\right|\mathbf{x})}\left[\mathrm{log}\frac{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right|\mathbf{z})}{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\frac{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}{{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)}\right]\right]\hfill \\ & =H({p}_{\mathcal{D}}\left(\mathbf{x}\right|\mathbf{c});{p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right))-\underset{{\mathcal{D}}_{\mathrm{x}}}{\underbrace{{D}_{\mathrm{KL}}\left({p}_{\mathcal{D}}\left(\mathbf{x}\right)\parallel {p}_{{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{x}\right)\right)}}\phantom{\rule{4.pt}{0ex}}-\underset{{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}}{\underbrace{{H}_{{\mathit{\varphi}}_{\mathrm{z}},{\mathit{\theta}}_{\mathrm{x}}}\left(\mathbf{X}\right|\mathbf{Z})}}\phantom{\rule{4.pt}{0ex}},\hfill \end{array}$$

Having defined the unsupervised IB variational bounded decomposition, we can proceed with an analysis of the related state-of-the-art methods along the lines of analysis introduced in Summary part of Section 2.

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs two vectors representing the mean and standard deviation vectors that control a new latent representation $\mathbf{z}={f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)+{\sigma}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)\odot \mathit{\u03f5}$, where ${f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$ and ${\sigma}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$ are outputs of the encoder and $\mathit{\u03f5}$ is assumed to be a zero mean unit variance Gaussian vector.
- The usage of IB or other underlying frameworks: both VAE and $\beta $-VAE use evidence lower bound (ELBO) and are not derived from the IB framework. However, it can be shown [15] that the Lagrangian (18) can be reformulated for VAE and $\beta -$VAE as:$${\mathcal{L}}_{\beta -\mathrm{VAE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}},$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with Gaussian pdf.
- The reconstruction space regularization in case of reconstruction loss: is based on the mean square error (MSE) counterpart of ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ that corresponds to the Guassian likelihood assumption.

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs one vector in stochastic or deterministic way as $\mathbf{z}={f}_{{\mathit{\varphi}}_{\mathrm{z}}}\left(\mathbf{x}\right)$.
- The usage of IB or other underlying frameworks: AAE is not derived from the IB framework. As shown in [15], the AAE equivalent Lagrangian (18) can be linked with the IB formulation and defined as:$${\mathcal{L}}_{\mathrm{AAE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}},$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with zero mean unit variance Gaussian pdf for each dimension.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE.

- The targeted tasks: auto-encoding and generation.
- The architecture in terms of the latent space representation: the encoder outputs one vector using any form of stochastic or deterministic encoding.
- The usage of IB or other underlying frameworks: the BIB-AE is derived from the unsupervised IB (18) and its Lagrangian is defined as:$${\mathcal{L}}_{\mathrm{BIB}-\mathrm{AE}}({\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]-{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}.$$
- The label space regularization: does not apply here due to the unsupervised setting.
- The latent space regularization: is based on the hand-crafted prior with Gaussian pdf applied to both conditional and unconditional terms. In fact, the prior for ${\mathcal{D}}_{\mathrm{z}}$ can be any but ${\mathcal{D}}_{\mathrm{z}|\mathrm{x}}$ requires analytical parametrisation.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE counterpart of ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ and the discriminator ${\mathcal{D}}_{\mathrm{x}}$. This is a disctintive feature in comparison to VAE and AAE.

In summary, BIB-AE includes VAE and AAE as two particular cases. In turns, it should be clear that the regularizer of semi-supervised model considered in this paper resembles the BIB-AE model and extends it to the conditional case that will be considered below.

The proposed LP model (11) is also related to several state-of-the-art semi-supervised models used for the classification. As pointed out in the introduction, we only consider available labeled and unlabeled samples in our analysis. The extension to the augmented samples, i.e., permutations, syntehtically generated samples, i.e., fakes, and the adversarial examples for both latent space and label space regularizations can be performed along the line of analysis but it goes beyond the scope and focus of this paper.

- The targeted tasks: auto-encoding, clustering, (conditional) generation and classification.
- The architecture in terms of the latent space representation: the encoder outputs two vectors representing the discrete class and continuous type of style. The class distribution is assumed to follow categorical distribution and style Gaussian one. Both constraints on the prior distributions are ensured using adversarial framework with two corresponding discriminators. In its original setting, AAE does not use any augmented samples or adversarial examples.Remark: It should be pointed out that in our architecture we consider the latent space to be represented by the vector $\mathbf{a}$, which is fed to the classifier and regularizer that gives a natural consideration of IB and corresponding regularization and priors. In the case of semi-supervised AAE, the latent space is considered by the class and style representations directly. Therefore, to make it coherent with our case, one should assume that the class vector of semi-supervised AAE corresponds to the vector $\mathbf{c}$ and the style vector to the vector $\mathbf{z}$.
- The usage of IB or other underlying frameworks: AAE is not derived from the IB framework. However, as shown in our analysis the semi-supervised AAE represents the learnable prior case in part of latent space regularization. The corresponding Lagrangian of semi-supervised AAE is given by (16) and considered in Section 4.3.
- The label space regularization: is based on the adversarial discriminator in assumption that the class labels follow categorical distribution. This is applied to both labeled and unlabeled samples.
- The latent space regularization: is based on the learnable prior with Gaussian pdf of AE.
- The reconstruction space regularization in case of reconstruction loss: is only based on the MSE.

Therefore, in summary:

- The targeted tasks: auto-encoding, clustering, generation and classification.
- The architecture in terms of the latent space representation: there is no encoder as such and instead the system has a generator/decoder that generates samples from a random latent space $\mathbf{a}$ following some hand-crafted prior. The second element of architecture is a classifier with the min/max entropy optimization for the original and fake samples. The encoding of classes is assumed to be a one-hot-vector encoding.
- The usage of IB or other underlying frameworks: CatGAN is not derived from the IB framework. However, as shown in [15], one can apply the IB formulation to the adversarial generative models as in the case of CatGAN assuming that the term ${I}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{X};\mathbf{A})=0$ in (3) due to the absence of encoder as such. The minimization problem (3) reduces to the maximization of the second term ${I}_{{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\theta}}_{\mathrm{c}}}(\mathbf{A};\mathbf{C})$ expressed via its lower bound of variational decomposition (6). The first term ${\mathcal{D}}_{\mathrm{c}}$ enforces that the class labels of unlabeled samples follow the defined prior distribution $p\left(\mathbf{c}\right)$ with the above property of entropy minimization under one-hot-vector encoding whereas the second term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ reflects the supervised part for labeled samples. In the original CatGAN formulation, the author does not use the expression for the mutual information for the decoder/generator training as it is shown above but instead uses the decomposition of mutual information via the difference of corresponding entropies (see, first two terms in (9) in [3]). As we have pointed out, we do not include in our analysis the term corresponding to the fake samples as in original CatGAN. However, we do believe that this form of regularization does play an important role for the semi-supervised classification. The impact of this terms requires additional studies.
- The label space regularization: is based on the above assumptions for labeled samples, which are included into the cross-entropy term, unlabeled samples included into the entropy minimization term and fake samples included into the entropy maximization term in the original CatGAN method.
- The latent space regularization: is based on the hand-crafted prior.
- The reconstruction space regularization in case of reconstruction loss: is based on the adversarial discriminator only.

Therefore, in summary:

- The targeted tasks: auto-encoding, clustering, generation and classification.
- The architecture in terms of the latent space representation: a single vector representation following mixture of Gaussians distribution.
- The usage of IB or other underlying frameworks: SeGMA is not derived from the IB framework but a link to the regularized ELBO an other related auto-encoders with interpretable latent space is demonstrated. However, as in previous methods it can be linked to the considered IB interpretation of the semi-supervised methods with hand-crafted priors (16). An equivalent Lagrangian of SeGMA is:$${\mathcal{L}}_{\mathrm{SeGMA}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}},$$
- The label space regularization: is based on the above assumptions for labeled samples, which are included into the cross-entropy term as discussed above.
- The latent space regularization: is based on the hand-crafted mixture of Gaussians pdf.
- The reconstruction space regularization in case of reconstruction loss: is based on the MSE.

- The targeted tasks: auto-encoding, clustering, (conditional) generation and classification.
- The architecture in terms of the latent space representation: the stacked combination of models M1 and M2 is used as discussed above.
- The usage of IB or other underlying frameworks: VAE M1 + M2 is not derived from the IB framework but it is linked to the regularized ELBO with the cross-entropy for the labeled samples. The corresponding IB Lagrangian of semi-supervised VAE M1 + M2 under the assumption of end-to-end training can be defined as:$${\mathcal{L}}_{\mathrm{SS}-\mathrm{VAE}\phantom{\rule{0.277778em}{0ex}}\mathrm{M}1+\mathrm{M}2}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\mathbb{E}}_{{p}_{\mathcal{D}}\left(\mathbf{x}\right)}\left[{\mathcal{D}}_{\mathrm{z}|\mathrm{x}}\right]+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}.$$
- The label space regularization: is based on the assumption of categorical distribution of labels.
- The reconstruction space regularization in case of reconstruction loss: is only based on the MSE.

The tested system is based on (i) the deterministic encoder and decoder, (ii) the stochastic encoder of type $\mathbf{a}={f}_{{\mathit{\varphi}}_{\mathrm{a}}}(\mathbf{x}+\mathit{\u03f5})$ with the data independent perturbations $\mathit{\u03f5}$ and deterministic decoder. The density ratio estimator [24] is used to measure all KL-divergences. The results of semi-supervised classification on the MNIST dataset are reported in Table 1, where symbol D indicates the deterministic setup (i) and symbol S corresponds to the stochastic one (ii). To choose the optimal parameters of systems, e.g., the Lagrangian multipliers in the considered models, we used 3-run cross-validation with the randomly chosen labeled examples as shown in Appendix B, Appendix C, Appendix D, Appendix E, Appendix F and Appendix G. Once the model parameters were chosen, we run 10 time cross-validation and the average results are shown in Table 1.

Additionally, we performed a 10-run cross-validation on the SVHN dataset [25]. We used the same architecture as for MNIST with the same encoders, decoders and discriminators. In contrast to VAE M1 + M2, we used normalized raw data without any pre-processing. Additionally, in contrast to AAE, where an extra set of 531,131 unlabeled images was used for the semi-supervised training, in our experiments only a train set of 73,257 images was used for training. Moreover, the experiments were performed: (i) for the optimal parameters chosen after 3-run cross-validation for the MNIST dataset with no special adaption to SVHN dataset and (ii) under the network architectures with exactly the same number of used filters as given in Appendix B, Appendix C, Appendix D, Appendix E, Appendix F and Appendix G for the MNIST dataset. In summary, our goal is to test the generalization capacity of the proposed approach but not just to achieve the best performance by fine-tuning of network parameters. The obtained results are represented in Table 1.

We compare the considered architectures with several state-of-the-art semi-supervised methods such as AAE [2], CatGAN [3], VAE (M1 + M2) [1], IB multiview [5], MV-InfoMax [5] and InfoMax [3] with 100, 1000 and 60,000 training labeled samples. The expected training times for the considered models are given in Table 2. The source code is available at https://github.com/taranO/IB-semi-supervised-classification. The analysis of the latent space of trained models for the MNIST dataset is given in Appendix A.

The deterministic and stochastic systems based on the learnable priors clearly demonstrate the state-of-the-art performance in comparison to the considered semi-supervised counterparts.

Baseline Neural Network (NN): the obtained results allow concluding that, if the amount of labeled training data is large, as shown in “all” column (Table 1), the latent space regularization has no practically significant impact on the classification performance for both hand crafted and learnable priors. The deep classifier is capable of learning a latent representation retaining only sufficient statistics in the latent space solely based on the cross-entropy component of IB classification term decomposition as shown in Table A1, row ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and column “all”. The classes appear to be well separable under this form of visualization. At the same time, the decrease of number of labeled samples leads to the degradation of classification accuracy as show in Table 1 for columns “1000” and “100”. This degradation is also clearly observed in Table A1, row ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and column “l00”, where there is larger overlap between the classes compared to the column “all”. The stochastic encoding via the addition of noise to the input samples does not enhance the performance with respect to the deterministic decoding for the small amount of labeled examples. One can assume that the presence of additive noise is not typical for the considered data, whereas the samples clearly differ in the geometrical appearance. Therefore, we can only assume that random geometrical permutations would be a more interesting alternative to the additive noise permutations/encoding.

No priors on latent space: to investigate the impact of unlabeled data, we add the adversarial regularizer ${\mathcal{D}}_{\mathrm{c}}$ to the baseline classifier based on ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$. The term ${\mathcal{D}}_{\mathrm{c}}$ enforces the distribution of class labels for the unlabeled samples to follow the categorical distribution. At this stage, no regularization of latent space is applied. The addition of the adversarial regularizer ${\mathcal{D}}_{\mathrm{c}}$, see “100” column (Table 1), allows reducing the classification error in comparison to the baseline classifier. Moreover, the stochastic encoder slightly outperforms the deterministic one for all numbers of labeled samples. However, the achieved classification error is far away from the performance of baseline classifier trained on the whole labeled data set. Thus, the cross-entropy and adversarial classification terms alone can hardly cope with the lack of labeled data, and proper regularization of the latent space is the main mechanism capable of retaining the most relevant representation.

Hand crafted latent space priors: along this line we investigate the impact of hand-crafted regularization in the form of the added discriminator ${\mathcal{D}}_{\mathrm{a}}$ imposing Gaussian prior on the latent representation $\mathbf{a}$. The sole regularization of latent space with the hand-crafted prior on the Gaussianity does not reflect the complex nature of latent space of real data. As a result the performance of the regularized classifier ${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}$ does not lead to a remarkable improvement in comparison to the non-regularized counterpart ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ for both stochastic and deterministic types of encoding. When in addition the label space regularization ${\mathcal{D}}_{\mathrm{c}}$ is added to the final classifier ${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$, it leads to the factor of 2 classification error reduction over the cross-entropy baseline classifier but it is still far away from the fully supervised baseline classifier trained on the fully labeled data set. At the same time, there is no significant difference between the stochastic and deterministic types of encoding.

Learnable latent space priors: along this line we will investigate the impact of learnable priors by adding the corresponding regularizations of the latent space of auto-encoder and data reconstruction. We investigate the role of reconstruction space regularization based on the MSE expressed via ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ and joint ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ and ${\mathcal{D}}_{\mathrm{x}}$. The addition of discriminator ${\mathcal{D}}_{\mathrm{x}}$ slightly enhances the classification but requires almost doubled training time as shown in Table 2. The stochastic encoding does not show any obvious advantage over the deterministic one in this setup. The separability of classes shown in Table A1, row ${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ and column ”l00”, is very close to those of column “all” and row ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, i.e., the semi-supervised system with 100 labeled examples is capable of closely approximating the fully supervised one. We show the t-sne only for this setup since it practically coincides with ${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$. However, it should be pointed out that the learnable priors ensures the reconstruction of data from the compressed latent space and the learned representation is the sufficient statistics for the data reconstruction task but not for the classification one. Since the entropy of the classification task is significantly lower to those of reconstruction, such a learned representation contains more information than actually needed for the classification task. A fraction of retained information is irrelevant to the classification problem and might be a potential source of classification errors. This likely explains a gap in performance between the considered semi-supervised system and fully supervised one.

In the SVHN test, we did not try to optimize the Lagrangian coefficients as it was done for MNIST. However, to compensate for a potential non-optimality, we perform the model training with the reduced learning rate as indicated in Table 2. As a result, the training time on the SVHN dataset is longer. Therefore, 10-run validation of the proposed framework on the SVHN dataset was done with the optimal Lagrangian multipliers determined on the MNIST dataset. In this respect, one might observe a small degradation of the obtained results compared to the state-of-the-art. Additionally, we did not apply any pre-processing such as PCA that was used in VAE M1 + M2 and we did not use the extended unlabeled dataset as it was done in case of AAE. One can clearly observe the same behavior of semi-supervised classifiers as for MNIST data set discussed in Section 5.2. Therefore, we can clearly confirm the role of learnable priors in the overall performance observed for both datasets.

We have introduced a novel formulation of variational information bottleneck for semi-supervised classification. To overcome the problem of original bottleneck and to compensate the lack of labeled data in the semi-supervised setting, we considered two models of latent space regularization via hand-crafted and learnable priors. On a toy example of MNIST dataset we investigated how the parameters of proposed framework influence the performance of classifier. By end-to-end training, we demonstrate how the proposed framework compares to the state-of-the-art methods and approaches the performance of fully supervised classifier.

The envisioned future work is along the lines of providing a stronger compression yet preserving only classification task relevant information since retaining more task irrelevant information does not provide distinguishable classification features, i.e., it only ensures reliable data reconstruction. In this work, we have considered IB for the predictive latent space model. We think that the contrastive multi-view IB formulation would be an interesting candidate for the regularization of latent space. Additionally, we did not use the adversarially generated examples to impose the constraint on the minimization of mutual information between them and class labels or equivalently to maximize the entropy of class label distribution for these adversarial examples according to the framework of entropy minimization. This line of “adversarial” regularization seems to be a very interesting complement to the considered variational bottleneck. In this work, we considered a particular form of stochastic encoding by the addition of data independent noise to the input with the preservation of the same class labels. This also corresponds to the consistency regularization when samples can be more generally permuted including the geometrical transformations. It is also interesting to point out that the same form of generic permutations is used in the unsupervised constrastive loss-based multi-view formulations for the continual latent space representation as opposed to the categorical one in the consistency regularization. Finally, the conditional generation can be an interesting line of research considering the generation from discrete labels and continuous latent space of the autoencoder.

Conceptualization, S.V. and O.T.; methodology, O.T., M.K., T.H. and D.R.; software, O.T.; validation, O.T.; formal analysis, M.K., T.H. and D.R.; investigation, O.T.; writing—original draft preparation, S.V. and O.T., writing—review and editing, ALL; visualization, S.V. and O.T.; supervision, S.V.; project administration, S.V., All authors have read and agreed to the published version of the manuscript.

This research was funded by the Swiss National Science Foundation SNF No. 200021_182063.

The authors declare no conflict of interest.

The following abbreviations are used in this manuscript:

IB | Information bottleneck |

VAE | Variational autoencode |

AAE | Adversarial autoencoder |

CatGAN | Categorical generative adversarial networks |

KL-divergences | Kullback–Leibler divergences |

MSE | Mean squared error |

HCP | IB with hand-crafted priors |

LP | IB with learnable priors |

NN | Neural Network |

SS | Semi-supervised |

In this section, we consider the properties of classifier’s latent space for both the hand-crafted and learnable priors under different amount of training samples. Figure A1 and Figure A2 show t-sne plots for the perplexity 30 for 100, 1000 and 60,000 (“all”) training labels of the MNIST dataset.

The first raw of Figure A1 with the label “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$” corresponds to the classifier considered in Appendix B. The latent space $\mathbf{a}$ of the classifier with “all” labels demonstrates the perfect separability of classes. The classes are far away from each other and there are practically no outliers leading to the misclassification. The decrease of the number of labels in the supervised setup, see the columns 1000 and 100, leads to a visible degradation of separability between the classes.

The regularization of class label space by the regularizer ${\mathcal{D}}_{\mathrm{c}}$ or by the hand-crafted latent space regularizer ${\mathcal{D}}_{\mathrm{a}}$ shown in raws “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$” considered in Appendix C and “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}$” considered in Appendix D for the small number of training samples equal 100 does not significantly enhance the class separability with respect to “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$”.

At the same time, the joint usage of the above regularizers according to the model “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}$” according to the model in Appendix E leads to the better separability of classes for 100 labels in comparison with the previous cases. At the same time, the addition of these regularizers does not have any impact on the latent space for “all” label case.

The introduction of learnable regularization of latent space along with the class label regularization according to the model “${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\alpha}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$” considered in Appendix G enhances the class separability in the latent space of classifier for 100 label case that is also very close to the fully supervised case.

For the comparison reasons, we also visualize the latent space of the auto-encoder $\mathbf{z}$ for the above model in Figure A2.

The baseline architecture is based on the cross-entropy term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ (7) in the main part of paper and depicted in Figure A3:
$${\mathcal{L}}_{\mathrm{S}-\mathrm{NoReg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}.$$

The parameters of encoder and decoder are shown in Table A1. The performance of baseline supervised classifier with and without batch normalization corresponds to the parameter ${\alpha}_{\mathrm{c}}=0$ in Table A3 (deterministic scenario) and Table A4 (stochastic scenario).

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC, ReLU |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

This model is based on terms ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and ${\mathcal{D}}_{\mathrm{c}}$ in (8) in the main part of paper and schematically shown in Figure A4:

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{NoReg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}.$$

The parameters of encoder, decoder and discriminator are shown in Table A2. The KL-divergence term ${\mathcal{D}}_{\mathrm{c}}$ is implemented in a form of density ratio estimator (DRE). In the considered practical implementation, the parameter ${\alpha}_{\mathrm{c}}$ controls the trade-off between the cross-entropy and class discriminator terms. The discriminator ${\mathcal{D}}_{\mathrm{c}}$ is trained in an adversarial way based on samples generated by the decoder and from targeted distribution.

The performance of semi-supervised classifier with and without batch normalization is shown in Table A3 (deterministic scenario) and Table A4 (stochastic scenario).

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC, ReLU |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{c}}$ | |

Size | Layer |

10 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 26.56 | 26.24 | 28.04 | 26.95 | 0.96 |

0.005 | 20.44 | 21.93 | 18.98 | 20.45 | 1.48 | |

0.0005 | 18.55 | 20.43 | 20.59 | 19.86 | 1.14 | |

1 | 19.23 | 22.42 | 20.57 | 20.74 | 1.60 | |

with BN | 0 | 29.37 | 29.27 | 30.62 | 29.75 | 0.75 |

0.005 | 27.97 | 28.02 | 26.27 | 27.42 | 1.00 | |

0.0005 | 25.99 | 23.70 | 24.47 | 24.72 | 1.17 | |

1 | 27.78 | 31.98 | 35.88 | 31.88 | 4.05 | |

MNIST 1000 | ||||||

without BN | 0 | 7.74 | 6.99 | 6.97 | 7.23 | 0.44 |

0.005 | 5.62 | 6.06 | 5.60 | 5.76 | 0.26 | |

0.0005 | 6.30 | 6.12 | 6.02 | 6.15 | 0.14 | |

1 | 5.99 | 6.27 | 6.28 | 6.18 | 0.16 | |

with BN | 0 | 7.45 | 6.95 | 7.52 | 7.31 | 0.31 |

0.005 | 5.57 | 5.08 | 5.22 | 5.29 | 0.25 | |

0.0005 | 5.60 | 6.05 | 6.22 | 5.96 | 0.32 | |

1 | 6.05 | 6.41 | 5.82 | 6.09 | 0.30 | |

MNIST all | ||||||

without BN | 0 | 0.83 | 0.83 | 0.74 | 0.80 | 0.05 |

0.005 | 0.83 | 0.82 | 0.88 | 0.84 | 0.03 | |

0.0005 | 0.86 | 0.92 | 0.82 | 0.87 | 0.05 | |

1 | 0.72 | 0.85 | 0.87 | 0.81 | 0.08 | |

with BN | 0 | 0.73 | 0.67 | 0.79 | 0.73 | 0.06 |

0.005 | 0.72 | 0.73 | 0.70 | 0.72 | 0.02 | |

0.0005 | 0.75 | 0.77 | 0.72 | 0.75 | 0.03 | |

1 | 0.67 | 0.68 | 0.73 | 0.69 | 0.03 |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 25.75 | 26.61 | 26.59 | 26.32 | 0.49 |

0.005 | 23.34 | 21.38 | 24.37 | 23.03 | 1.52 | |

0.0005 | 19.92 | 15.83 | 16.03 | 17.26 | 2.31 | |

1 | 22.51 | 20.48 | 21.28 | 21.42 | 1.02 | |

with BN | 0 | 30.26 | 31.24 | 29.3 | 30.27 | 0.97 |

0.005 | 21.17 | 24.41 | 24.75 | 23.44 | 1.98 | |

0.0005 | 22.97 | 26.38 | 24.44 | 24.60 | 1.71 | |

1 | 26.62 | 30.43 | 28.44 | 28.50 | 1.91 | |

MNIST 1000 | ||||||

without BN | 0 | 7.68 | 7.30 | 7.23 | 7.4 | 0.24 |

0.005 | 5.59 | 5.16 | 5.80 | 5.52 | 0.33 | |

0.0005 | 5.59 | 6 | 5.84 | 5.81 | 0.21 | |

1 | 6.66 | 6.8 | 7.62 | 7.03 | 0.52 | |

with BN | 0 | 6.97 | 7.06 | 7.66 | 7.23 | 0.38 |

0.005 | 4.42 | 4.54 | 4.08 | 4.35 | 0.24 | |

0.0005 | 5.28 | 5.56 | 5.14 | 5.33 | 0.21 | |

1 | 5.77 | 5.88 | 5.72 | 5.79 | 0.08 | |

MNIST all | ||||||

without BN | 0 | 0.8 | 0.91 | 0.87 | 0.86 | 0.06 |

0.005 | 0.77 | 0.82 | 0.88 | 0.82 | 0.06 | |

0.0005 | 0.86 | 0.81 | 0.87 | 0.85 | 0.03 | |

1 | 0.93 | 0.85 | 0.92 | 0.90 | 0.04 | |

with BN | 0 | 0.65 | 0.67 | 0.71 | 0.68 | 0.03 |

0.005 | 0.69 | 0.77 | 0.68 | 0.71 | 0.05 | |

0.0005 | 0.78 | 0.71 | 0.74 | 0.74 | 0.04 | |

1 | 0.71 | 0.64 | 0.62 | 0.66 | 0.05 |

This model is based on the cross-entropy term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and either term ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ or ${\mathcal{D}}_{\mathrm{a}}$ or jointly ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ and ${\mathcal{D}}_{\mathrm{a}}$ as defined by (9) in the main part of paper. In our implementation, we consider the regularization based on the adversarial term ${\mathcal{D}}_{\mathrm{a}}$ similar to AAE due to the flexibility of imposing different priors on the latent space distribution. The implemented system is shown in Figure A5 and the training is based on:
where ${\alpha}_{\mathrm{a}}$ is a regularization parameter controlling a trade-off between the cross-entropy term and latent space regularization term. We have replaced the Lagrangians above with respect to (9) in the main part of paper and used it in front of ${\mathcal{D}}_{\mathrm{a}}$ in contrast to the original formulation (9). It is done to keep the term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ without a multiplier as the reference to the baseline classifier.

$${\mathcal{L}}_{\mathrm{S}-\mathrm{Reg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}},$$

The parameters of encoder, decoder and discriminator are summarized in Table A5. The performance of this classifier without and with batch normalization is shown in Table A6 (deterministic scenario) and Table A7 (stochastic scenario).

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{a}}$ | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0 | 26.79 | 27.26 | 27.39 | 27.15 | 0.32 |

0.005 | 28.05 | 25.95 | 30.72 | 28.24 | 2.39 | |

0.0005 | 26.67 | 27.69 | 28.46 | 27.61 | 0.89 | |

1 | 33.42 | 33.05 | 34.81 | 33.76 | 0.92 | |

with BN | 0 | 30.37 | 29.32 | 29.82 | 29.83 | 0.52 |

0.005 | 28.02 | 31.49 | 30.80 | 30.11 | 1.84 | |

0.0005 | 34.54 | 31.92 | 29.82 | 31.09 | 2.36 | |

1 | 34.43 | 44.35 | 44.25 | 41.01 | 5.70 | |

MNIST 1000 | ||||||

without BN | 0 | 7.16 | 8.12 | 7.55 | 7.61 | 0.48 |

0.005 | 7.02 | 6.34 | 6.59 | 6.65 | 0.34 | |

0.0005 | 6.73 | 6.34 | 6.82 | 6.63 | 0.26 | |

1 | 9.49 | 9.93 | 10.56 | 9.99 | 0.54 | |

with BN | 0 | 7.39 | 7.83 | 7.92 | 7.72 | 0.28 |

0.005 | 7.94 | 7.15 | 8.53 | 7.88 | 0.69 | |

0.0005 | 8.00 | 9.62 | 9.51 | 9.05 | 0.91 | |

1 | 15.79 | 14.88 | 13.71 | 14.79 | 1.04 | |

MNIST all | ||||||

without BN | 0 | 0.76 | 0.70 | 0.81 | 0.76 | 0.06 |

0.005 | 1.07 | 1.03 | 1.13 | 1.08 | 0.05 | |

0.0005 | 0.84 | 0.78 | 0.89 | 0.84 | 0.06 | |

1 | 4.78 | 7.24 | 4.71 | 5.58 | 1.44 | |

with BN | 0 | 0.68 | 0.68 | 0.69 | 0.68 | 0.01 |

0.005 | 0.90 | 0.81 | 1.12 | 0.94 | 0.16 | |

0.0005 | 0.87 | 0.80 | 0.89 | 0.85 | 0.05 | |

1 | 2.37 | 3.61 | 4.35 | 3.44 | 1.00 |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 28.13 | 25.16 | 29.9 | 27.73 | 2.40 |

0.0005 | 28.05 | 30.03 | 28.11 | 28.73 | 1.13 | |

1 | 32.33 | 34.09 | 33.73 | 33.38 | 0.93 | |

with BN | 0.005 | 32.25 | 33.47 | 26.01 | 30.58 | 4.00 |

0.0005 | 33.37 | 36.15 | 35.65 | 35.06 | 1.48 | |

1 | 33.37 | 42.37 | 32.46 | 36.07 | 5.48 | |

MNIST 1000 | ||||||

without BN | 0.005 | 7.37 | 7.17 | 6.65 | 7.06 | 0.37 |

0.0005 | 7.48 | 6.68 | 6.67 | 6.94 | 0.46 | |

1 | 9.48 | 9.94 | 11.61 | 10.34 | 1.12 | |

with BN | 0.005 | 7.82 | 7.97 | 7.81 | 7.87 | 0.09 |

0.0005 | 9.5 | 8.68 | 9.37 | 9.18 | 0.44 | |

1 | 12.99 | 10.52 | 9.98 | 11.16 | 1.60 | |

MNIST all | ||||||

without BN | 0.005 | 1.19 | 1.09 | 1.06 | 1.11 | 0.07 |

0.0005 | 0.79 | 0.88 | 0.82 | 0.83 | 0.05 | |

1 | 6.22 | 4.81 | 5 | 5.34 | 0.77 | |

with BN | 0.005 | 0.94 | 1.07 | 1.04 | 1.02 | 0.07 |

0.0005 | 0.78 | 0.81 | 0.78 | 0.79 | 0.02 | |

1 | 4.49 | 3.35 | 2.18 | 3.34 | 1.16 |

This model is based on the cross-entropy term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$ and either term ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ or ${\mathcal{D}}_{\mathrm{a}}$ or jointly ${\mathcal{D}}_{\mathrm{a}|\mathrm{x}}$ and ${\mathcal{D}}_{\mathrm{a}}$ and the label class regularizer ${\mathcal{D}}_{\mathrm{c}}$ as defined by (10) in the main part of paper. In our implementation, we consider the regularization based on the adversarial term ${\mathcal{D}}_{\mathrm{a}}$ only as shown in Figure A6. The training is based on:

$${\mathcal{L}}_{\mathrm{S}-\mathrm{Reg}}^{\mathrm{HCP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\varphi}}_{\mathrm{a}})={\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\alpha}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\alpha}_{\mathrm{a}}{\mathcal{D}}_{\mathrm{a}}.$$

The parameters of encoder, decoder and both discriminators are shown in Table A8. The performance of this classifier without and with batch normalization is shown in Table A9 (deterministic scenario) and Table A10 (stochastic scenario).

Encoder | |
---|---|

Size | Layer |

28 × 28 × 1 | Input |

14 × 14 × 32 | Conv2D, LeakyReLU |

7 × 7 × 64 | Conv2D, LeakyReLU |

4 × 4 × 128 | Conv2D, LeakyReLU |

2048 | Flatten |

1024 | FC |

Decoder | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

10 | FC, Softmax |

${\mathcal{D}}_{\mathbf{c}}$ | |

Size | Layer |

10 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

${\mathcal{D}}_{\mathbf{a}}$ | |

Size | Layer |

1024 | Input |

500 | FC, ReLU |

500 | FC, ReLU |

1 | FC, Sigmoid |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | |||||

MNIST 100 | |||||||

without BN | 0.005 | 0.005 | 21.39 | 18.12 | 18.34 | 19.28 | 1.83 |

0.0005 | 0.0005 | 15.33 | 22.36 | 13.80 | 17.16 | 4.56 | |

0.005 | 0.0005 | 25.66 | 26.25 | 28.81 | 26.91 | 1.67 | |

0.0005 | 0.005 | 9.82 | 13.44 | 13.06 | 12.11 | 1.99 | |

with BN | 0.005 | 0.005 | 23.45 | 21.19 | 28.87 | 24.50 | 3.94 |

0.0005 | 0.0005 | 28.57 | 19.06 | 26.37 | 24.67 | 4.98 | |

0.005 | 0.0005 | 26.18 | 26.18 | 25.49 | 25.95 | 0.40 | |

0.0005 | 0.005 | 8.96 | 13.82 | 14.76 | 12.52 | 3.11 | |

MNIST 1000 | |||||||

without BN | 0.005 | 0.005 | 3.91 | 4.21 | 3.70 | 3.94 | 0.26 |

0.0005 | 0.0005 | 3.54 | 3.72 | 3.54 | 3.60 | 0.10 | |

0.005 | 0.0005 | 6.19 | 5.80 | 7.31 | 6.43 | 0.78 | |

0.0005 | 0.005 | 2.80 | 2.82 | 2.83 | 2.82 | 0.02 | |

with BN | 0.005 | 0.005 | 3.30 | 2.94 | 2.93 | 3.06 | 0.21 |

0.0005 | 0.0005 | 2.80 | 2.53 | 2.50 | 2.61 | 0.17 | |

0.005 | 0.0005 | 3.51 | 3.75 | 4.12 | 3.79 | 0.31 | |

0.0005 | 0.005 | 2.58 | 2.27 | 2.24 | 2.37 | 0.19 | |

MNIST all | |||||||

without BN | 0.005 | 0.005 | 1.04 | 1.07 | 1.07 | 1.06 | 0.02 |

0.0005 | 0.0005 | 0.86 | 0.90 | 0.88 | 0.88 | 0.02 | |

0.005 | 0.0005 | 1.08 | 0.92 | 1.09 | 1.03 | 0.10 | |

0.0005 | 0.005 | 0.85 | 0.93 | 0.93 | 0.90 | 0.05 | |

with BN | 0.005 | 0.005 | 1.10 | 1.01 | 0.93 | 1.01 | 0.09 |

0.0005 | 0.0005 | 0.84 | 0.88 | 0.83 | 0.85 | 0.03 | |

0.005 | 0.0005 | 1.10 | 1.12 | 0.93 | 1.05 | 0.10 | |

0.0005 | 0.005 | 0.76 | 0.82 | 0.79 | 0.79 | 0.03 |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{a}}$ | ${\mathit{\alpha}}_{\mathbf{c}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|---|

1 | 2 | 3 | |||||

MNIST 100 | |||||||

without BN | 0.005 | 0.005 | 12.4 | 18.05 | 16.73 | 15.73 | 2.96 |

0.0005 | 0.0005 | 15.01 | 11.16 | 14.74 | 13.64 | 2.15 | |

0.005 | 0.0005 | 23.31 | 26.61 | 25.41 | 25.11 | 1.67 | |

0.0005 | 0.005 | 9.21 | 9.02 | 10.12 | 9.45 | 0.59 | |

with BN | 0.005 | 0.005 | 13.55 | 22.48 | 14.72 | 16.92 | 4.85 |

0.0005 | 0.0005 | 8.37 | 15.01 | 26.92 | 16.77 | 9.40 | |

0.005 | 0.0005 | 32.12 | 30.27 | 31.44 | 31.28 | 0.94 | |

0.0005 | 0.005 | 5.46 | 17 | 11.54 | 11.33 | 5.77 | |

MNIST 1000 | |||||||

without BN | 0.005 | 0.005 | 3.9 | 4.25 | 4.02 | 4.06 | 0.18 |

0.0005 | 0.0005 | 3.64 | 3.82 | 4.11 | 3.86 | 0.24 | |

0.005 | 0.0005 | 6.68 | 5.34 | 6.36 | 6.13 | 0.70 | |

0.0005 | 0.005 | 3.03 | 2.88 | 2.66 | 2.86 | 0.19 | |

with BN | 0.005 | 0.005 | 2.96 | 3.37 | 2.98 | 3.10 | 0.23 |

0.0005 | 0.0005 | 2.87 | 3.10 | 2.73 | 2.90 | 0.19 | |

0.005 | 0.0005 | 3.72 | 3.8 | 4.14 | 3.89 | 0.22 | |

0.0005 | 0.005 | 2.57 | 2.39 | 2.28 | 2.41 | 0.15 | |

MNIST all | |||||||

without BN | 0.005 | 0.005 | 1.05 | 1.09 | 1.1 | 1.08 | 0.33 |

0.0005 | 0.0005 | 0.94 | 0.96 | 0.9 | 0.93 | 0.03 | |

0.005 | 0.0005 | 1.16 | 1.14 | 1.13 | 1.14 | 0.02 | |

0.0005 | 0.005 | 0.88 | 0.92 | 0.91 | 0.90 | 0.02 | |

with BN | 0.005 | 0.005 | 0.98 | 0.84 | 0.94 | 0.92 | 0.07 |

0.0005 | 0.0005 | 0.79 | 0.96 | 0.82 | 0.86 | 0.09 | |

0.005 | 0.0005 | 1.04 | 1.05 | 1.03 | 1.04 | 0.01 | |

0.0005 | 0.005 | 0.74 | 0.78 | 0.84 | 0.79 | 0.05 |

This model is based on the cross-entropy term ${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$, the MSE term representing ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$, the label class regularizer ${\mathcal{D}}_{\mathrm{c}}$ and either term ${\mathcal{D}}_{\mathrm{z}|\mathrm{x}}$ or ${\mathcal{D}}_{\mathrm{z}}$ or jointly ${\mathcal{D}}_{\mathrm{z}|\mathrm{x}}$ and ${\mathcal{D}}_{\mathrm{z}}$ as defined by (16) in the main part of paper. In our implementation, we consider the regularization of the latent space based on the adversarial term ${\mathcal{D}}_{\mathrm{z}}$ only to compare it with the vanila AAE as shown in Figure A7. The encoder is also not conditioned on $\mathbf{c}$ as in the original semi-supervised AAE. Thus, the tested system is based on:

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{AAE}}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}.$$

We set the parameters ${\beta}_{\mathrm{x}}={\beta}_{\mathrm{c}}=1$ to compare our system with the vanila AAE. However, these parameters can be also optimized in practice.

The parameters of encoder and decoder are shown in Table A11. The performance of this classifier without and with batch normalization is shown in Table A12 (deterministic scenario) and Table A13 (stochastic scenario).

Encoder | |||
---|---|---|---|

Size | Layer | ||

28 × 28 × 1 * | Input | ||

14 × 14 × 32 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

2048 | Flatten | ||

1024 | FC, ReLU | ||

10 | 10 | FC, Softmax | FC |

Decoder | |||

Size | Layer | ||

10 + 10 | Input | ||

7 × 7 × 128 | FC, Reshape, BN, ReLU | ||

14 × 14 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 64 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 1 | Conv2DTrans, Sigmoid | ||

Dz | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Dc | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid |

Encoder Model | Runs | Mean | std | ||
---|---|---|---|---|---|

1 | 2 | 3 | |||

MNIST 100 | |||||

without BN | 2.15 | 2.05 | 1.78 | 1.99 | 0.19 |

with BN | 1.57 | 1.56 | 1.92 | 1.68 | 0.21 |

MNIST 1000 | |||||

without BN | 1.55 | 1.47 | 1.53 | 1.52 | 0.04 |

with BN | 1.37 | 1.34 | 1.73 | 1.48 | 0.22 |

MNIST all | |||||

without BN | 0.78 | 0.7 | 0.82 | 0.77 | 0.06 |

with BN | 0.79 | 0.77 | 0.76 | 0.77 | 0.02 |

Encoder Model | Runs | Mean | std | ||
---|---|---|---|---|---|

1 | 2 | 3 | |||

MNIST 100 | |||||

without BN | 1.55 | 3.19 | 2.11 | 2.28 | 0.83 |

with BN | 1.4 | 1.33 | 1.72 | 1.48 | 0.21 |

MNIST 1000 | |||||

without BN | 1.73 | 1.53 | 1.6 | 1.62 | 0.10 |

with BN | 1.28 | 1.43 | 1.2 | 1.30 | 0.12 |

MNIST all | |||||

without BN | 0.94 | 0.86 | 0.86 | 0.89 | 0.05 |

with BN | 0.77 | 0.65 | 0.84 | 0.75 | 0.10 |

This model is similar to the previously considered model but in addition to the MSE reconstruction term representing ${\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ it also contains the adversarial reconstruction term ${\mathcal{D}}_{\mathrm{x}}$ as defined by (17) in the main part of paper. In our implementation, we consider the regularization of the latent space based on the adversarial term ${\mathcal{D}}_{\mathrm{z}}$ as shown in Figure A8. The training is based on:

$${\mathcal{L}}_{\mathrm{SS}-\mathrm{AAE}}^{\mathrm{LP}}({\mathit{\theta}}_{\mathrm{c}},{\mathit{\theta}}_{\mathrm{x}},{\mathit{\varphi}}_{\mathrm{a}},{\mathit{\varphi}}_{\mathrm{z}})={\mathcal{D}}_{\mathrm{z}}+{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}+{\alpha}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}.$$

The parameters of encoder and decoder are shown in Table A14. The performance of this classifier without and with batch normalization is shown in Table A15 (deterministic scenario) and Table A16 (stochastic scenario).

Encoder | |||
---|---|---|---|

Size | Layer | ||

28 × 28 × 1 | Input | ||

14 × 14 × 32 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

2048 | Flatten | ||

1024 | FC, ReLU | ||

10 | 10 | FC, Softmax | FC |

Dz | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Dc | |||

Size | Layer | ||

10 | Input | ||

500 | FC, ReLU | ||

500 | FC, ReLU | ||

1 | FC, Sigmoid | ||

Decoder | |||

Size | Layer | ||

10 + 10 | Input | ||

7 × 7 × 128 | FC, Reshape, BN, ReLU | ||

14 × 14 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 128 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 64 | Conv2DTrans, BN, ReLU | ||

28 × 28 × 1 | Conv2DTrans, Sigmoid | ||

Dx | |||

Size | Layer | ||

28 × 28 × 1 | Input | ||

14 × 14 × 64 | Conv2D, LeakyReLU | ||

7 × 7 × 64 | Conv2D, LeakyReLU | ||

4 × 4 × 128 | Conv2D, LeakyReLU | ||

4 × 4 × 256 | Conv2D, LeakyReLU | ||

4096 | Flatten | ||

1 | FC, Sigmoid |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{x}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 2.85 | 3.36 | 2.77 | 2.99 | 0.32 |

0.0005 | 2.58 | 2.49 | 3.08 | 2.72 | 0.32 | |

1 | 19.62 | 19.96 | 15.97 | 18.52 | 2.21 | |

with BN | 0.005 | 1.56 | 1.33 | 1.35 | 1.41 | 0.13 |

0.0005 | 1.68 | 1.66 | 2.02 | 1.79 | 0.20 | |

1 | 20.85 | 13.6 | 21.67 | 18.71 | 4.44 | |

MNIST 1000 | ||||||

without BN | 0.005 | 2.29 | 2.35 | 2.11 | 2.25 | 0.12 |

0.0005 | 1.69 | 1.88 | 2.24 | 1.94 | 0.28 | |

1 | 3.47 | 3.30 | 4.12 | 3.63 | 0.43 | |

with BN | 0.005 | 1.18 | 1.21 | 1.09 | 1.16 | 0.06 |

0.0005 | 1.44 | 1.28 | 1.29 | 1.34 | 0.09 | |

1 | 4.14 | 2.94 | 2.48 | 3.19 | 0.86 | |

MNIST all | ||||||

without BN | 0.005 | 0.97 | 1.01 | 1.04 | 1.01 | 0.04 |

0.0005 | 0.88 | 0.85 | 0.93 | 0.89 | 0.04 | |

1 | 1.31 | 1.28 | 1.47 | 1.35 | 0.10 | |

with BN | 0.005 | 0.81 | 0.83 | 0.75 | 0.80 | 0.04 |

0.0005 | 0.73 | 0.78 | 0.75 | 0.75 | 0.03 | |

1 | 0.88 | 0.86 | 1.27 | 1.00 | 0.23 |

Encoder Model | ${\mathit{\alpha}}_{\mathbf{x}}$ | Runs | Mean | std | ||
---|---|---|---|---|---|---|

1 | 2 | 3 | ||||

MNIST 100 | ||||||

without BN | 0.005 | 2.45 | 3.04 | 2.67 | 2.72 | 0.30 |

0.0005 | 2.63 | 2.3 | 2.45 | 2.46 | 0.17 | |

with BN | 0.005 | 1.34 | 1.21 | 6.4 | 2.98 | 2.96 |

0.0005 | 1.35 | 1.51 | 1.93 | 1.60 | 0.30 | |

MNIST 1000 | ||||||

without BN | 0.005 | 2.31 | 2.26 | 2.2 | 2.26 | 0.06 |

0.0005 | 1.71 | 2.16 | 1.86 | 1.91 | 0.23 | |

with BN | 0.005 | 1.23 | 1.31 | 1.10 | 1.21 | 0.11 |

0.0005 | 1.42 | 1.62 | 1.37 | 1.47 | 0.13 | |

MNIST all | ||||||

without BN | 0.005 | 0.93 | 1.01 | 1.05 | 1.00 | 0.06 |

0.0005 | 0.92 | 0.83 | 0.88 | 0.88 | 0.05 | |

with BN | 0.005 | 0.88 | 0.86 | 0.91 | 0.88 | 0.03 |

0.0005 | 0.77 | 0.80 | 0.80 | 0.79 | 0.02 |

- Kingma, D.P.; Mohamed, S.; Rezende, D.J.; Welling, M. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 3581–3589. [Google Scholar]
- Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv
**2015**, arXiv:1511.05644. [Google Scholar] - Springenberg, J.T. Unsupervised and semi-supervised learning with categorical generative adversarial networks. arXiv
**2015**, arXiv:1511.06390. [Google Scholar] - Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv
**2020**, arXiv:2002.05709. [Google Scholar] - Federici, M.; Dutta, A.; Forré, P.; Kushman, N.; Akata, Z. Learning Robust Representations via Multi-View Information Bottleneck. arXiv
**2020**, arXiv:2002.07017. [Google Scholar] - Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015; pp. 1–5. [Google Scholar]
- Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 2897–2905. [Google Scholar] [CrossRef] [PubMed] - Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; Raffel, C.A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2019; pp. 5049–5059. [Google Scholar]
- Grandvalet, Y.; Bengio, Y. Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2004; pp. 529–536. [Google Scholar]
- Lee, D.H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In ICML Workshop: Challenges in Representation Learning (WREPL); ICML: Atlanta, GR, USA, 2013; Volume 3. [Google Scholar]
- Cire<b>c</b>san, D.C.; Meier, U.; Gambardella, L.M.; Schmidhuber, J. Deep, big, simple neural nets for handwritten digit recognition. Neural Comput.
**2010**, 22, 3207–3220. [Google Scholar] - Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv
**2018**, arXiv:1805.09501. [Google Scholar] - Amjad, R.A.; Geiger, B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 42, 2225–2239. [Google Scholar] [CrossRef] [PubMed] - Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv
**2016**, arXiv:1612.00410. [Google Scholar] - Voloshynovskiy, S.; Kondah, M.; Rezaeifar, S.; Taran, O.; Hotolyak, T.; Rezende, D.J. Information bottleneck through variational glasses. In NeurIPS Workshop on Bayesian Deep Learning; Vancouver Convention Center: Vancouver, BC, Canada, 2019. [Google Scholar]
- Uğur, Y.; Zaidi, A. Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding. Entropy
**2020**, 22, 213. [Google Scholar] [CrossRef] - Maaløe, L.; Sønderby, C.K.; Sønderby, S.K.; Winther, O. Auxiliary deep generative models. arXiv
**2016**, arXiv:1602.05473. [Google Scholar] - Śmieja, M.; Wołczyk, M.; Tabor, J.; Geiger, B.C. SeGMA: Semi-Supervised Gaussian Mixture Auto-Encoder. arXiv
**2019**, arXiv:1906.09333. [Google Scholar] - Makhzani, A.; Frey, B.J. Pixelgan autoencoders. In Advances in Neural Information Processing Systems; MIT Press: Long Beach, CA, USA, 2017; pp. 1975–1985. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Kingma, D.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2014**, arXiv:1312.6114. [Google Scholar] - Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv
**2014**, arXiv:1401.4082. [Google Scholar] - Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; MIT Press: Montreal, QC, Canada, 2014; pp. 2672–2680. [Google Scholar]
- Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning; NIPS Workshop: Granada, Spain, 2011; Volume 2011, p. 5. [Google Scholar]

MNIST (100) | MIST (1000) | MNIST (all) | SVHN (1000) | ||
---|---|---|---|---|---|

NN Baseline (${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$) | [D] | 26.31 (±0.91) | 7.50 (±0.19) | 0.68 (±0.05) | 36.16 (±0.77) |

[S] | 26.78 (±1.66) | 7.54 (±0.25) | 0.70 (±0.05) | 36.28 (±0.93) | |

InfoMax [3] | [S] | 33.41 | 21.5 | 15.86 | - |

VAE [5] | [S] | 14.26 | 8.71 | 5.02 | - |

MV-InfoMax [5] | [S] | 13.22 | 7.39 | 6.07 | - |

IB multiview [5] | [S] | 3.03 | 2.34 | 2.22 | - |

VAE (M1 + M2) [5] | [S] | 3.33 (±0.14) | 2.40 (±0.02) | 0.96 | 36.02 (±0.10) |

CatGAN | [S] | 1.91 (±0.10) | 1.73 (±0.18) | 0.91 | - |

AAE | [D] | 1.90 (±0.10) | 1.60 (±0.08) | 0.85 (±0.02) | 17.70 (±0.30) |

No priors on latent space | |||||

${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}$ | [D] | 20.72 (±1.58) | 4.99 (±0.28) | 0.69 (±0.04) | 25.78 (±0.90) |

[S] | 19.60 (±1.37) | 4.49 (±0.25) | 0.67 (±0.05) | 26.34 (±0.80) | |

Hand crafted latent space priors | |||||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}$ | [D] | 27.44 (±1.40) | 6.77 (±0.34) | 0.91 (±0.05) | 35.94 (±1.08) |

[S] | 27.48 (±1.07) | 6.91 (±0.45) | 0.88 (±0.05) | 35.80 (±1.21) | |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ | [D] | 12.04 (±4.46) | 2.43 (±0.12) | 0.81 (±0.05) | 24.70 (±0.46) |

[S] | 11.80 (±3.82) | 2.40 (±0.10) | 0.82 (±0.04) | 24.62 (±0.54) | |

Learnable latent space priors | |||||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ | [D] | 1.55 (±0.21) | 1.25 (±0.10) | 0.74 (±0.04) | 20.07 (±0.36) |

[S] | 1.49 (±0.18) | 1.43 (±0.06) | 0.78 (±0.04) | 20.00 (±0.31) | |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ | [D] | 1.38 (±0.09) | 1.21 (±0.10) | 0.77 (±0.06) | 19.75 (±0.52) |

[S] | 1.42 (±0.10) | 1.16 (±0.09) | 0.79 (±0.02) | 19.71 (±0.26) |

MNIST | SVHN | |
---|---|---|

NN Baseline (${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}$) | 0.47–0.65 | 0.85–0.92 |

No priors on latent space | ||

${\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{c}}$ | 0.47–0.65 | 0.85–0.92 |

Hand crafted latent space priors | ||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}$ | 0.47–0.65 | 1–1.05 |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\mathcal{D}}_{\mathrm{a}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}$ | 0.97–1.18 | 1.5–1.6 |

Learnable latent space priors | ||

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}$ | 1.23–1.6 | 2.25–2.3 |

${\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}\widehat{\mathrm{c}}}+{\beta}_{\mathrm{c}}{\mathcal{D}}_{\mathrm{c}}+{\mathcal{D}}_{\mathrm{z}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}\widehat{\mathrm{x}}}+{\beta}_{\mathrm{x}}{\mathcal{D}}_{\mathrm{x}}$ | 1.98–2.42 | 3.5–3.55 |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).