1. Introduction
One of the distinct traits of human intelligence is the ability to imagine and synthesize. Generative modeling in machine learning aims to train algorithms to synthesize completely new data, such as audio, text, and images; it does so by estimating the density of the data, and then sampling from that estimated density. The deep learning [
1] revolution has led to breakthroughs in generative modeling with deep generative models such as variational autoencoders, generative stochastic networks, neural autoregressive models, and generative adversarial networks.
Variational autoencoders combine Bayesian variational inference with deep learning [
2]; like the autoencoder, it has an encoder and decoder, but it aims to learn the probability distribution through amortized variational inference and the reparameterization trick. Information theory is a key component of variational inference because it involves minimizing the KL Divergence between the posterior distribution and variational posterior. The generative adversarial network is a framework for training two models (a discriminator and a generator) simultaneously with an adversarial process.
There has been lots of research on deep generative models for image, text, and audio generation. Generative adversarial networks (GANs) tend to outperform variational autoencoders in image fidelity, while a variational autoencoder is more stable and is better for estimating the probability distribution itself. Neural autoregressive models are powerful at density estimation but often slower than VAEs in sampling. Often times, these different frameworks have been integrated to compliment the different strengths and ameliorate the weaknesses, such as with the adversarial autoencoder, VAEGAN, VAE with inverse autoregressive flow, and PixelVAE.
Source separation, especially blind source separation, has long been a problem of interest in the signal processing community. The cocktail party problem is a common example of the blind source separation problem. It is when a listener is at a cocktail party where there are many people speaking concurrently, and the listener must try to follow one of the conversations. It seems like an easy task for humans, but, for computers, it is more difficult. Source separation has applications in music/speech/audio data, EEG signals, ECG signals, and image processing. In the past, methods like Independent Component Analysis (ICA) and Non Negative Matrix Factorization (NNMF) have been the stateoftheart methods for this problem. More recently, deep learning methods have improved our solution, including variational autoencoders, generative adversarial networks, and recurrent neural networks.
For applications in finance (which is relatively new), VAEs are used to generate synthetic volatility surfaces for options trading.
Biosignal applications of VAE include detection of serious diseases using electrocardiogram (ECG) signals, data augmentation of biosignals and improving electroencephalography (EEG)based speech recognition systems, etc.
The key contributions of our paper include:
(1) A general overview of autoencoders and VAEs (2) A comprehensive survey of applications of the variational autoencoders for speech source separation, data augmentation and dimensionality reduction in finance, and biosignal analysis. (3) A comprehensive survey of variational autoencoder variants. (4) Experiments and analysis of results in speech source separation using various VAE models. (5) While multiple survey papers have covered the VAE [
3,
4], this paper has a special focus on time series/signal processing applications and informationtheoretic interpretations.
The paper is organized as follows:
Section 2 provides a general background information and
Section 3 discusses Variational Inference.
Section 4 presents the Variational Autoencoder while
Section 5 discusses Problems with the VAE. Several variants of the VAEs are presented in
Section 6. Three interesting applications of VAEs are discussed in
Section 7. Experimental results on speech source separation are discussed in
Section 8 and
Section 9 concludes the paper.
Notations
The following notation will be used throughout this paper:
 
Lower case p, q, f, $\gamma $, $\psi $, or $p(.)$, $q(.)$, $f(.)$, $\gamma (.)$, $\psi (.)$, to denote probability density functions (PDFs) or probability mass functions (PMFs).
 
Random variables are written in lower case italicized letters, for example, x, with the exception of $\mathit{\u03f5}$, which will represent both a random variable and its instance.
 
Deterministic scalar terms, including realizations of random variables, are written in lower case letters, for example, $x$. Greek alphabets $\alpha $, $\beta $, $\mathit{\theta}$, and $\mathit{\varphi}$ will also denote deterministic scalar terms. Deterministic scalar terms that are not realizations of random variables can also be written in upper case letters, such as N.
 
Random vectors are written in lower case italicized bold letters, for example, $\mathit{x}$.
 
if we have a random vector $\mathit{x}$, then its $j\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ component will be noted ${x}_{j}$.
 
Ordinary vectors are written in lowercase bold letters or bold Greek alphabets $\mathit{\theta}$, $\mathit{\mu}$, and $\mathit{\varphi}$. A realization of a random vector $\mathit{x}$ will be written as $\mathbf{x}$.
 
Matrices are written in uppercase bold italics, such as $\mathit{W}$. $\mathit{I}$ denotes the identity matrix.
 
If we have two random variables x and y with probability functions $p\left(x\right)$ and $p\left(y\right)$, we can write their joint probability function as $p(x,y)$. Their conditional probability distribution function is written as $p\left(x\righty)$.
 
If we have a PMF/PDF $p(x=\mathrm{x})$, $p\left(\mathrm{x}\right)$ is the shorthand notation. For $p\left(x\righty=\mathrm{y})$, $p\left(x\right\mathrm{y})$ is the shorthand. For $p(x=\mathrm{x}y=\mathrm{y})$, $p\left(\mathrm{x}\right\mathrm{y})$ is the shorthand.
 
If we have
$p(.;\mathit{\theta})$ or
${p}_{\mathit{\theta}}\left(\mathrm{x}\right)$, this denotes that the PDF/PMF
p is parameterized by
$\mathit{\theta}$—similarly with
${q}_{\mathit{\varphi}}\left(\mathrm{z}\right)$. However, there are exceptions in some contexts.
${p}_{x}\left(\mathrm{x}\right)$ will be shorthand for a distribution
$p\left(\mathrm{x}\right)$ associated with random variable
x;
${p}_{z}\left(\mathrm{z}\right)$ will be shorthand for a distribution
p associated with random variable
z; and
${p}_{xz}(\mathrm{x}\mid {\mathrm{z}}_{i})$ denotes a conditional distribution of random variable
x given
z=
${\mathrm{z}}_{i}$. The term
${p}_{(x,y)}(\mathrm{x},\mathrm{y})$ is a joint PDF/PMF between random variables
x and
y. The term
${p}_{(x,z)}(\mathrm{x},\mathrm{z})$ is a joint PDF/PMF between random variables
x and
z. In
Section 2.8,
${p}_{g}\left(\mathrm{z}\right)$ would represent the distribution that is the input to the generator in the GAN, while
${p}_{d}\left(\mathrm{x}\right)$ represents the data distribution.
 
range(x) denotes the range of random variable x and dom(x) denotes the domain of random variable x.
 
If we have a dataset $\mathcal{X}={\left\{{\mathrm{x}}^{\left(i\right)}\right\}}_{i=1}^{N}$, where there are N independent and identically distributed (i.i.d.) realizations/observations of some continuous or discrete random variable x, the $i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ observation is denoted as ${\mathrm{x}}^{\left(i\right)}$.
 
If we have a dataset $\mathcal{X}={\left\{{\mathbf{x}}^{\left(i\right)}\right\}}_{i=1}^{N}$, where there are N i.i.d. realizations/observations of some continuous or discrete random vector $\mathbf{x}$, the $i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ observation is denoted as ${\mathbf{x}}^{\left(\mathbf{i}\right)}$. If this our dataset, then ${x}_{j}^{\left(i\right)}$ represents the $j\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ component of the $i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ observation.
 
diag$\left(\mathbf{x}\right)$ is a diagonal matrix, with the diagonal values being the values of vector $\mathbf{x}$ and det ($\mathit{W}$) is the determinant of matrix $\mathit{W}$.
2. Background
In this section, we summarize the prerequisite information for understanding variational autoencoders and its many extensions.
2.1. Distances and InformationTheoretic Measures
Information theory is key to machine learning, from the usage of informationtheoretic measures for the loss functions [
5,
6,
7] to use for analysis through the information bottleneck framework [
8,
9,
10]. It is also key to the development of the VAE [
2]. Therefore, it is recommended to know key measures in information theory.
2.1.1. Shannon’s Entropy
If you have a random variable
x, Shannon’s Entropy is a measure of uncertainty of
x, with a PDF of
$p\left(\mathrm{x}\right)$[
11]. It can also be thought of as a way to measure the uncertainty in a random vector or random process
$\mathit{x}$, which has a joint PDF
$p\left(\mathbf{x}\right)$. It is also a generalization of the variance of a process.
Discrete Shannon’s Entropy, for discrete random variable
x, with a PMF
$p\left(\mathrm{x}\right)$, is defined as:
Continuous (Differential) Shannon’s Entropy, with the PDF
$p\left(\mathrm{x}\right)$, for random variable
x, is given by
We can also rewrite both discrete and differential Entropy as:
2.1.2. Shannon’s Joint Entropy
The Joint Entropy
$\mathbb{H}(x,y)$ of a pair of discrete random variables
$(x,y)$ with a joint distribution
${p}_{(x,y)}(\mathrm{x},\mathrm{y})$ is defined as
We can also write the Joint Entropy as
and
If we have
$\mathit{x}=({x}_{1},{x}_{2},\dots ,{x}_{n})\sim p({\mathrm{x}}_{1},{\mathrm{x}}_{2},\dots ,{\mathrm{x}}_{n})=p\left(\mathbf{x}\right)$, where
$\mathbf{x}$ is a discrete random vector, then
If we have
$\mathit{x}=({x}_{1},{x}_{2},\dots ,{x}_{n})\sim p({\mathrm{x}}_{1},{\mathrm{x}}_{2},\dots ,{\mathrm{x}}_{n})=p\left(\mathbf{x}\right)$, where
$\mathit{x}$ is a continuous random vector, then
2.1.3. Shannon’s Conditional Entropy
If we have
$(x,y)\sim {p}_{(x,y)}(\mathrm{x},\mathrm{y})$, where
x and
y are discrete random variables, the conditional Entropy
$\mathbb{H}\left(y\rightx)$ is:
The conditional Entropy
$\mathbb{H}\left(x\righty)$ is:
Conditional Entropy can be thought of as the “
expected value of the entropies of the conditional distributions, averaged over the conditioning random variable” [
11].
If
$(x,y)$ are continuous random variables with the joint PDF
$p(x,y)$, the conditional differential Entropy
$\mathbb{H}\left(x\righty)$ is
Since the joint PDF has the property $p\left(x\righty)=p(x,y)/p(y)$, $\mathbb{H}\left(x\righty)=\mathbb{H}(x,y)\mathbb{H}(y).$
2.1.4. Kullback–Leiber (KL) Divergence
If you have two probability distributions,
p and
q, the KL Divergence measures the similarity between the two distributions; however, it is asymmetric [
11]. It is also nonnegative.
For discrete random variables with PMFs
p and
q, the discrete KL Divergence is given by
If
p and
q are distributions of a continuous random variable
x, the continuous KL Divergence is given by
We can also write the KL Divergence as
KL Divergence can be used when we are approximating a probability distribution
p with another probability distribution
q. We can use
${\mathbb{D}}_{KL}(p\parallel q)$, called the forward
$KL$, or
${\mathbb{D}}_{KL}(q\parallel p)$, which is called the reverse KL. Minimizing the forward KL with respect to the approximate distribution
q is called moment projection [
12]. In the case where
p is positive but
q is
$0,log\left(\frac{p}{q}\right)$ becomes
∞, so then the support of
p is overestimated in the approximation
q. Minimizing the reverse KL with respect to the approximate distribution
q is called information projection. In the case where
q is positive but
p is 0,
$log\left(\frac{q}{p}\right)$ becomes
∞, so the approximation
q does not include any input where
p is 0.
2.1.5. Mutual Information
Mutual Information of random variables
x and
y, denoted
$\mathbb{I}(x;y)$, measures the information that
x and
y share and the dependence between them [
11]. Intuitively, it is how much knowing one random variable decreases uncertainty in the other one; this can be seen by the following formula:
Mutual Information can also be written as
If
x and
y are discrete random variables, their discrete Mutual Information is given by
where
${p}_{(x,y)}(\mathrm{x},\mathrm{y})$ is the joint PMF between
x and
y. If
x and
y are continuous random variables, their continuous Mutual Information is
$\mathbb{I}(\mathrm{x};\mathrm{y})=0$ if and only if x and y are independent. Mutual Information is nonnegative and symmetric.
2.1.6. CrossEntropy
For two PMFs p and q, the CrossEntropy is defined as:
2.1.7. Jensen–Shannon (JS) Divergence
Given PDFs
p and
q, we have
$\psi =\frac{1}{2}(p+q)$. The JS Divergence is
This is also a symmetric measure.
2.1.8. Renyi’s Entropy
Renyi’s Entropy is a generalization of Shannon’s Entropy [
6]. Discrete Renyi’s Entropy for PMF
$p\left(\mathrm{x}\right)$ is given by
Continuous Renyi’s Entropy for PDF
$p\left(\mathrm{x}\right)$ is given by
When
$\alpha \to 1$, Renyi’s Entropy converges to Shannon’s Entropy. When
$\alpha =2$, it becomes quadratic entropy. The term
${\mathbb{V}}_{\alpha}\left(x\right)$ is called information potential:
If
$\alpha $ = 2, it becomes the quadratic information potential (QIP):
We first look at the continuous Quadratic Entropy, which is a case of
$\alpha $Renyi Entropy where
$\alpha $= 2:
${\mathbb{V}}_{2}\left(x\right)={\mathbb{E}}_{p\left(x\right)}\left[p\left(x\right)\right]$ is the QIP; it is the expected value of the PDF if x is continuous, or PMF if x is discrete. From this point on, all information potentials will be QIPs. In addition, the subscripts will denote the PDF/PMF associated with the QIP; ${\mathbb{V}}_{p}$ will be the QIP associated with p.
2.1.9. Renyi’s CrossEntropy
Renyi’s CrossEntropy for two PDFs is given by:
The cross information potential is given by:
From this point, all information potentials in this paper will be quadratic. In addition, the subscripts will denote the PDF/PMF associated with the QIP; ${\mathbb{V}}_{p}$ will be the QIP associated with p; ${\mathbb{V}}_{q}$ will be the QIP associated with q; ${\mathbb{V}}_{c}$ will be the cross information potential associated with p and q.
2.1.10. Renyi’s $\alpha $Divergence
For two PMFs,
$p\left(\mathrm{x}\right)$ and
$q\left(\mathrm{x}\right)$, the formula is:
When $\alpha \to 1$, it converges to $\mathrm{KL}$ Divergence.
For two PDFs
$p\left(\mathrm{x}\right)$ and
$q\left(\mathrm{x}\right)$, the formula is:
2.1.11. Euclidean Divergence
The Euclidean Divergence between PDFs
$p\left(\mathrm{x}\right)$ and
$q\left(\mathrm{x}\right)$ is given by the following formula:
If we want to express the Euclidean Divergence between
p and
q in terms of QIP, it is given by:
The Euclidean Divergence is a symmetric measure.
2.1.12. Cauchy–Schwarz Divergence
The Cauchy–Schwarz (CS) Divergence, for probability density functions
$p\left(\mathrm{x}\right)$ and
$q\left(\mathrm{x}\right)$, is given by the following formula [
6]:
If we have PDFs
$p\left(\mathrm{x}\right)$ and
$q\left(\mathrm{x}\right)$, and want to express the CS Divergence between them in terms of QIP, it is given by:
Unlike KL Divergence and Renyi’s $\alpha $Divergence, CS Divergence is a symmetric measure.
2.2. Monte Carlo
Monte Carlo methods [
13,
14] are methods using random simulations; they are often used to estimate integrals. This is useful in statistics for estimating expected values. In machine learning, it is especially useful for the case of gradient estimation.
Given the fact that we have an integrable function
$\mathrm{g}:{[0,1]}^{\mathrm{d}}\mapsto \mathbb{R}$, we can look at the following integral [
14]:
If the domain of the integral is in ${\mathbb{R}}^{\mathrm{d}}$, the change of variables can change the domain to ${[0,1]}^{\mathrm{d}}$.
We can generate an i.i.d sequence
$\{{u}_{1},\dots ,{u}_{N}\}$ from a standard uniform distribution over
${[0,1]}^{\mathrm{d}}$. Then, the Monte Carlo estimator is given by
We can also use a more general Monte Carlo estimator for integration [
15]. We can find a PDF
f of random variable
$z\in {[0,1]}^{\mathrm{d}}$ such that
f > 0 on
${[0,1]}^{\mathrm{d}}$ and
${\int}_{{[0,1]}^{\mathrm{d}}}f\left(\mathrm{x}\right)d\mathrm{x}=1$.
Given that
$\mathrm{h}\left(\mathrm{x}\right)=\mathrm{g}\left(\mathrm{x}\right)/f\left(\mathrm{x}\right)$, our integral becomes
The Monte Carlo estimator is then given by
Thus, the steps are
 (1)
sample i.i.d sequence $\{{x}_{1},{x}_{2}\dots {x}_{\mathrm{N}}\}\sim f$
 (2)
Plug i.i.d. sequence into the estimator given by Equation (
36).
If we set
$f\left(\mathrm{x}\right)$ =1 over the region
${[0,1]}^{\mathrm{d}}$, we arrive at Equation (
35). From the Law of Large Numbers,
${A}_{\mathrm{N}}$ for both Monte Carlo estimators converge to A when n
$\to \infty $, and the convergence rate does not depend on dimension d; this provides an advantage over traditional integration.
2.3. Autoencoders
The Autoencoder is a self supervised learning algorithm that is used for lossy data compression [
16]. The compression is specific to the data that is used to train the model.
There is an encoder that creates a compressed representation; this representation goes into the decoder and outputs a reconstructed input. This algorithm test label is the data input itself. Autoencoders can be considered a nonlinear Principal Component Analysis (PCA). Often times, the compressed representation has a smaller dimension than the input and output.
Figure 1 shows the architecture of an Autoencoder with the MNIST data set as the data.
The encoder and decoder are typically multilayered perceptrons (MLPs), but they can be replaced with convolutional neural networks, which becomes a Convolutional Autoencoder. The convolutional autoencoder is better with reconstructing image data. The use of convolution in deep learning actually refers to what is known as crosscorrelation in signal processing terminology [
1]. The LSTMAutoencoder uses LSTMs instead of MLPs for the encoder and decoder.
One important variation of the Autoencoder is the Denoising Autoencoder (DAE) [
17]; the DAE is used to clean data that is corrupted by noise. Random noise is added to the input, but the reconstruction loss is between the clean input and the output. Some noise that can be added includes salt & pepper noise, additive white Gaussian noise (AWGN), and masking noise.
Discretetime white noise is a zero mean discretetime random process with finite variance whose samples are serially uncorrelated random variables. AWGN is discretetime white noise that is Gaussian and additive. Additive implies it is added to the original signal. We add AWGN to the original signal/image
$\mathbf{x}$. If our signal is a 1D discrete time series, the AWGN vector added to the signal can be represented as
$\mathit{w}\sim \mathcal{N}\left(0,{\sigma}^{2}\mathit{I}\right)$. To introduce masking noise into
$\mathbf{x}$, a certain fraction of the elements of
$\mathbf{x}$ are randomly chosen and set to 0. Salt & pepper noise is when a certain fraction of the elements of
$\mathbf{x}$ are randomly chosen and set to the min or max possible value. This is chosen by a fair coin flip. We can also convolve the input
$\mathbf{x}$ with a Gaussian filter, blurring the input [
18]. In the context of MNIST data set, we can corrupt the data by adding a block of white pixels to the center of the digits [
18]. Salt & pepper noise and masking noise both corrupt a fraction of the elements in a signal significantly, while not affecting the others. By denoising, we are attempting to recover the original values of the elements that were corrupted. The only scenario where this is possible is if, in high dimensional probability distributions, there is a dependency between dimensions. What we expect when training the DAE is that it learns these dependencies. Thus, for low dimensional distribution, it does not make sense to use the DAE approach as much.
Convolutional Denoising Autoencoders (CDAEs) are DAEs with convolutional layers. Stacked denoising autoencoders (SAE) are when we are stacking layers of DAEs.
2.4. Bayesian Networks
For any joint probability distribution, their independence/dependence relationships can be depicted using Probabilistic Graphical Models. When the relationships are represented via directed acyclical graphs (DAGs), the graphical models are known as Bayesian Networks. Other names for Bayesian Networks include Directed Graphical Models and Belief Networks. To illustrate the use of Bayesian Networks, we will use an example from [
19].
A woman named Tracey notices that her lawn is wet in the morning. She wonders whether it is from the rain or her accidentally leaving the sprinklers on the previous night. She then sees that her neighbor Jack also has a wet lawn. Her conclusion was that it rained last night.
Our variables are:
$r\in \{0,1\}$, where $r=1$ denotes it was raining last night, 0 denotes it was not raining last night.
$s\in \{0,1\}$, where $s=1$ denotes Tracey left the sprinklers on the previous night, and 0 otherwise.
$j\in \{0,1\}$, where $j=1$ indicates that Jack’s lawn is wet, and 0 otherwise.
$t\in \{0,1\}$, where $t=1$ Denotes that Tracey’s grass is wet, and 0 otherwise.
We can represent this with a joint probability function
$p(\mathrm{t},\mathrm{j},\mathrm{r},\mathrm{s})$. Using the chain rule of probability, we can decompose this into:
However, we can simplify this further by looking at the constraints. We know that the status of Tracey’s lawn does not depend on Jack’s; it depends on whether it rained and whether she left the sprinkler on. Thus, then:
We can also assume that the only variable affecting the status of Jack’s lawn is whether it was raining the night before. Thus, then:
We assume that the rain is affected by the sprinkler.
Thus, our simplified model is:
We can represent this with a Bayesian Network, as shown in
Figure 2. Each node in this graph represents a variable from the joint distribution. Notice that there is a directed edge from
r to
j. This means that the
r is the parent node of
j, while
j is the child node of the
r. If any variable is a parent of another variable, it means it is on the right side of the conditional bar; like for
$p\left(\mathrm{j}\right\mathrm{r})$,
r is on the right side of the conditional bar.
If we have a set of random variables
$\{{x}_{1},\dots ,{x}_{M}\}$ with certain conditional independence assumptions, we can represent their joint distribution as
Similarly, for a set of random vectors,
$\{{\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{M}\}$ we can represent their joint distributions as:
For root nodes in the Bayesian Network, the set of parents is the empty set. Thus, they are marginal distributions. $Parents\left({x}_{j}\right)$ denotes the set of parent variables for node ${x}_{j}$ in the Bayesian Network.
Initially, the parameterization of each conditional probability distribution was done with a lookup table or a linear model. In deep learning, we can use neural networks to parameterize conditional distributions; this is more flexible. The meaning of a neural network parameterizing a PDF is that it is part of the function that computes the PDF [
20].
2.5. Generative Models vs. Discriminative Models
Given the PDF
$p(\mathbf{x},\mathrm{y})$, we generate a dataset
$\mathcal{D}={\left\{{\mathbf{x}}^{\left(i\right)},{\mathrm{y}}^{\left(i\right)}\right\}}_{i=1}^{N}$. We have the realizations of an i.i.d sequence
$\mathcal{X}={\left\{{\mathbf{x}}^{\left(i\right)}\right\}}_{i=1}^{N}$ where
${\mathbf{x}}^{\left(i\right)}\in {\mathbb{R}}^{d}$ and each
${\mathbf{x}}^{\left(i\right)}$ has a label
${\mathrm{y}}^{\left(i\right)}$ associated with it [
21]. A generative model would attempt to learn
$p(\mathbf{x},\mathrm{y})$. It would then generate new examples
$\mathbf{x}$ from estimated distribution. The term
$p\left(\mathrm{y}\right\mathbf{x})$ would be a discriminative model; it attempts to estimate the label generating probability distribution function. A discriminative model can predict
y given examples
$\mathit{x}$, but it cannot generate a sample of
$\mathit{x}$.
There are three types of generative models typically used in deep learning [
22]: latent variable models, neural autoregressive models, and implicit models.
2.6. Latent Variable Models
Latent variables are underlying variables; often times, they are not directly observable. An example given by Dr. Nuno Vasconcelos [
23] is the bridge example. There is a bridge, and there are weight sensors measuring the weight of each car. However, we do not have a camera, so we do not know what type of car it is; it could be a compact, sedan, station wagon, pick up, or van. Thus, the hidden/latent variable, represented by random variable
z, is the type of the car, and the observed variable is the weight measured, represented by random variable
x.
Figure 3 shows the process of data generation. Our example can be represented by the Bayesian Network in
Figure 4a. The latent variables and observed variables can also be random vectors, denoted as
$\mathit{z}$ and
$\mathit{x}$.
Sampling our observation takes two steps. First, a sample
$\mathit{z}$ comes from the probability distribution
${p}_{\mathit{z}}\left(\mathbf{z}\right)$. Then, a sample
$\mathbf{x}$ comes from
${p}_{\mathit{x}\mathit{z}}\left(\mathbf{x}\right\mathbf{z})$. This is also called generation, represented by
$z\to x$ in
Figure 4a,b; it is represented by the solid arrow. With
${p}_{\mathit{z}}\left(\mathbf{z}\right)$ with
${p}_{\mathit{x}\mathit{z}}\left(\mathbf{x}\right\mathbf{z})$, we can obtain the joint density
${p}_{(\mathit{x},\mathit{z})}(\mathbf{x},\mathbf{z})$. Then, by marginalizing the joint density, we get
${p}_{\mathit{x}}\left(\mathbf{x}\right)$. Then, from there, we can get
$p\left(\mathbf{z}\right\mathbf{x})$. Obtaining
$p\left(\mathbf{z}\right\mathbf{x})$ is represented by
$x\to z$. It is the inverse of generation, and is called inference. In
Figure 4b, inference is represented by the dotted arrow. Inference can be obtained using Bayes Rule:
Often times, calculating
${p}_{\mathit{x}}\left(\mathbf{x}\right)$ is intractable, making inference intractable through this method. This leads to approximation methods. Markov Chain Monte Carlo (MCMC) methods [
13] are a common collection of methods [
24] used for this. However, variational inference is a quicker family of methods. They do not guarantee that we will create exact samples from the target density function asymptotically like MCMC methods [
25].
If we parameterize our model with $\mathit{\theta}$, we use ${p}_{\mathit{\theta}}\left(\mathrm{x}\right)$; this means that the PDF $p\left(\mathrm{x}\right)$ is associated with random variable x and has parameters represented by $\mathit{\theta}$. We would attempt to learn $\mathit{\theta}$ using maximum likelihood estimation.
When we have a latent variable model
${p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})$, where a deep neural network parameterizes
${p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})$, it is called a deep latent variable model (DLVM) [
20].
An example of a DLVM is given by the following [
26]:
We have a random vector $\mathit{z}$, of length K, sampled from a multivariate Bernoulli distribution. This is then fed into a neural network, denoted by DNN, which outputs random vector $\mathbf{x}$. The neural network can have L output units with sigmoid activations.
Another example of DLVM is the following [
27]: If the observation data
$\mathbf{x}$ of size L is binary data, the latent space is Gaussian latent space, and the observation model is a factorized multivariate Bernoulli distribution, we have the following formulas:
where
${\mathrm{a}}_{j}$ is a value between 0 and 1 and
$\mathbf{a}$ is a vector with
${\mathrm{a}}_{j}$’s; it can be implemented by having the output layer of the neural network have sigmoid activation functions.
A third example of a DLVM is where
$\mathit{z}$ is a Gaussian distribution, and
$p\left(\mathbf{x}\right\mathbf{z})$ can be a neural network with a softmax activation function for its output layer [
26]. Our generative model in the case of latent variable models learns the joint PDF
${p}_{\mathit{\theta}}(\mathbf{z},\mathbf{x})$.
Overall, latent variable training can be summarized by the following four steps [
26]:
 (1)
Sampling
$\mathit{z}\sim {p}_{z}\left(\mathbf{z}\right)$
$\mathit{x}\sim {p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$
 (2)
Evaluate likelihood ${p}_{\mathit{\theta}}\left(\mathbf{x}\right)={\sum}_{\mathrm{z}}{p}_{\mathit{z}}\left(\mathbf{z}\right){p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$
 (3)
Train $\underset{\mathit{\theta}}{argmax}{\sum}_{i=1}^{N}log{p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(i\right)}\right)={\sum}_{i}log{\sum}_{\mathrm{z}}{p}_{z}\left(\mathbf{z}\right){p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(\mathbf{i}\right)}\mathbf{z}\right)$
 (4)
Representation $\mathit{x}\to \mathit{z}$
Common latent variable models in deep learning include energybased models, variational autoencoders, and flow models [
28,
29]. The VAE explicitly models the density of the distribution, so it has a prescribed Bayesian Network.
2.7. Neural Autoregressive Models
In time series analysis, an autoregressive model of order p is denoted as AR(p) [
30]. If we have a time series
$\left\{y\right[\mathrm{n}],y[\mathrm{n}1],\dots ,y[\mathrm{n}\mathrm{p}\left]\right\}$, it is an AR(p) process if it satisfies the following equation:
$y\left[n\right]$ denotes the value of
y (a scalar value) at time n.
$y\left[n\right]$ is a linear combination of the p past values of
y, weighted by scalar coefficients
${\mathrm{a}}_{j}$ plus some white noise
$w\left[n\right]$ and the mean
$\mu $ of the process (
$\mathbb{E}\left[y\right[\mathrm{n}\left]\right]=\mu $).
In neural networks, there is a subtype of the AR model, called the neural autoregressive model or autoregressive neural density estimators; in deep learning literature, this subtype is often just called an autoregressive model. The neural autoregressive model involves using a Bayesian Network structure where the conditional probabilities are set to neural networks.
If we are given a Bayesian Network representation for a model, we can get a tractable gradient for our log likelihood by setting the conditional probability distributions to neural networks [
31]:
Parent denotes the parent nodes of
${x}_{i}$. If we assume that our Bayesian Network is fully expressive, any joint probability distribution can be decomposed to a product of conditionals, using the probability chain rule and conditional independence assumptions:
This is called a neural autoregressive model. Common neural autoregressive models include Neural Autoregressive Distribution Estimation (NADE) [
32,
33], Masked Autoencoder for Distribution Estimation (MADE) [
34], Deep AutoRegressive Networks (DARN) [
35], PixelRNN [
36], PixelCNN [
36,
37], and WaveNet [
38].
Neural AR models are slower because they sequentially generate from one dimension at a time. They also tend to model local structure better than global structure.
2.8. Generative Adversarial Networks (GANs)
Two major implicit models are Generative Stochastic Networks (GSNs) and GANs [
39,
40].
A GAN trains a generative model $Gen$ and a discriminative model $Dis$ simultaneously. $Gen$ attempts to estimate the distribution of the data, while $Dis$ tries to estimate the probability that the data came from the training set rather than $Gen$. $Gen$ tries to maximize the probability that the discriminator makes a mistake. Typically for a GAN, $Dis$ is only used during training, but, afterwards, it is discarded.
The Bayesian Network in
Figure 4a from
Section 2.6 can also represent the generator from a GAN. This is because a random noise term
z, often sampled from a uniform distribution, is the input to
$Gen$, which outputs synthetic data
$Gen\left(\mathbf{z}\right)$; we can also write
$Gen\left(\mathbf{z}\right)$ as
$\widehat{\mathbf{x}}$. The GAN implicitly models the distribution and so it has an implicit Bayesian network [
41]. Thus, the GAN is both an implicit model and has a latent variable model.
The algorithm is outlined in Algorithm 1. The noise prior
${p}_{g}\left(\mathbf{z}\right)$ is the input to the generator; it is often typically a uniform distribution. The data generating distribution is denoted
${p}_{\mathrm{d}\phantom{\rule{4.pt}{0ex}}}\left(\mathbf{x}\right)$; it is the data behind our real distribution. The hyperparameter k denotes how many steps the discriminator is applied.
${\mathit{\theta}}_{G}$ represents the weights and biases of
$Gen$, while
${\mathit{\theta}}_{D}$ represents the weights and biases of
$Dis$; the subscripts denote their association with the discriminator and generator.
Figure 5 shows the architecture.
Algorithm 1: Original GAN algorithm 
for$\phantom{\rule{4pt}{0ex}}\mathbf{do}\phantom{\rule{1.em}{0ex}}\#$ of training iterations: for k do steps Sample minibatch $\left\{{\mathit{z}}^{\left(1\right)},\dots ,{\mathit{z}}^{\left(m\right)}\right\}$∼${p}_{g}\left(\mathbf{z}\right)$ Sample minibatch $\left\{{\mathit{x}}^{\left(1\right)},\dots ,{\mathit{x}}^{\left(m\right)}\right\}$∼${p}_{\mathrm{d}\phantom{\rule{4.pt}{0ex}}}\left(\mathbf{x}\right)$ Update the weights of the discriminator by ascending its stochastic gradient:
end for Sample minibatch $\left\{{\mathit{z}}^{\left(1\right)},\dots ,{\mathit{z}}^{\left(m\right)}\right\}$∼${p}_{g}\left(\mathbf{z}\right)$ Update the weights of the generator by descending its stochastic gradient:
end for

Gradient updates can use any rule. The loss function is similar to JS Divergence. When $Dis$ is optimal, the weights of $Gen$ are updated in a way that it minimizes the JS Divergence.
The original GAN had issues such as mode collapse, convergence, and vanishing gradients. The Wasserstein GAN [
42] is a class of GAN meant to improve on these flaws; it uses Wasserstein distance instead. Conditional GANs (CGANs) [
43] are another type of GAN that attempts to alleviate some of the flaws of the GAN.
The Deep Convolutional GAN (DCGAN) [
44] is what many GANs are based on; it uses ADAM to optimize, uses an all convolutional network [
45], and has batch normalization [
46] in most layers for
$Dis$ and
$Gen$.
$Gen$’s last and
$Dis$’s first layer are not batch normalized. Other GAN methods include Periodic Spatial GAN (PSGAN) [
47], INFOGAN [
48], CycleGAN [
49], StyleGAN [
50], and SelfAttention GAN (SAGAN) [
51].
2.9. Gradient Estimation
2.9.1. REINFORCE Estimator/Score Function
Two important gradient estimators are the score function and the pathwise gradient estimator [
22,
26,
52,
53]. The score function (also known as the REINFORCE estimator) can handle nondifferentiable functions; the downside is that it has a high variance.
If you have a function
${p}_{\mathit{\theta}}\left(\mathbf{x}\right)$, which is a PDF of random variable
$\mathbf{x}$ parameterized by
$\mathit{\theta}$, then the score function is
${\nabla}_{\mathit{\theta}}log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right)\right)$. This is the derivative of the log of our PDF w.r.t to
$\mathit{\theta}$. This score function can be written as:
The score function’s expectation is zero:
The score function’s variance is the Fisher information. The estimator for the score function can be derived as follows:
2.9.2. Pathwise Gradient Estimator
The pathwise gradient estimator is also known as the reparameterization trick or the pathwise derivative estimator. However, it has low variance, so it is a common choice. More details about this estimator will be shown in the next section. For a continuous distribution for
$\mathit{x}$, direct sampling has an equivalent indirect process:
This statement means an indirect way to create samples
$\mathbf{x}$ from
${p}_{\mathit{\theta}}\left(\mathbf{x}\right)$ is to sample from
$p\left(\mathit{\u03f5}\right)$ first; this distribution is independent of
$\mathit{\theta}$. The next step is to apply a transformation with
$\mathrm{g}(\mathit{\u03f5},\mathit{\theta})$ which is deterministic. This can be called a sampling path or a sampling process.
For indirect sampling from a Gaussian distribution denoted by $\mathcal{N}(\mathbf{x};\mu ,\mathbf{C})$, we can reparameterize it by making $\mathrm{g}(\mathit{\u03f5},\mathit{\theta})$ a location scale transformation, given by $\mathrm{g}(\mathit{\u03f5},\mathit{\theta})=\mathit{\mu}+\mathit{L}\mathit{\u03f5},\phantom{\rule{4.pt}{0ex}}\mathrm{where}\phantom{\rule{4.pt}{0ex}}{\mathit{LL}}^{T}=\mathit{C}.\phantom{\rule{4.pt}{0ex}}$$\mathit{L}$ is a lower triangular matrix with nonzero diagonal values and $\mathit{\u03f5}$ is sampled from a standard isotropic multivariate Gaussian $p\left(\mathit{\u03f5}\right)=\mathcal{N}(\mathbf{0},\mathbf{I})$.
We can derive the gradient estimator by the following:
Thus, our pathwise gradient estimator where our
$\mathit{x}$ is distributed according to
$\mathcal{N}(\mathbf{x};\mathit{\mu},\mathit{C})$ is given by:
where
$\mathrm{g}(\mathit{\u03f5},\mathit{\theta})=\mu +\mathit{L}\mathit{\u03f5},{\mathit{LL}}^{T}=\mathit{C},p\left(\mathit{\u03f5}\right)=\mathcal{N}(\mathbf{0},\mathbf{I})$.
3. Variational Inference
In Bayesian statistics, parameters that we estimate are random variables instead of deterministic variables. For latent random vector $\mathit{z}$ and observation variables/vector $\mathit{x}$,${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$ is known as the posterior distribution; $p\left(\mathbf{z}\right)$ is the prior distribution, ${p}_{\mathit{\theta}}\left(\mathbf{x}\right)$ is the model evidence or marginal likelihood, and ${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$ is the likelihood. We perform updates on prior ${p}_{\mathbf{z}}\left(\mathbf{z}\right)$ using Bayesian rule.
Variational inference is a particular method for approximating the posterior distribution. We approximate the posterior distribution
${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$ with the approximate posterior
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$; it is also known as the variational posterior, where
$\mathit{\varphi}$ represents the variational parameters. We will optimize over
$\mathit{\varphi}$ so we can fit the variational posterior to the real posterior. Any valid distribution for
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$ can be used as long as we can sample data from it and we can compute
$log\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\right)$ and
${\nabla}_{\mathit{\varphi}}log\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\right)$. Thus, we want to solve the following for all
${\mathbf{x}}^{\left(i\right)}$We are taking the reverseKL Divergence between
p and
q. There are different posteriors for each datapoint
${x}^{\left(i\right)}$, so we learn a different
$\mathit{\varphi}$ for each datapoint. To make this calculation more quick, we can use an amortized formulation for variational inference:
In this formulation, we predict
$\mathit{\varphi}$ with a neural network, called an inference network; variational parameters
$\mathit{\varphi}$ refer to the parameters of this inference network. The downside of this formulation is less precision. With this, we can derive the Evidence Lower Bound (ELBO):
$\mathrm{where}\phantom{\rule{4.pt}{0ex}}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})={\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathit{z}\right\mathit{x})}[log\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\right)+log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})\right)+log\left(p\left(\mathbf{z}\right)\right)]$.
Since ${\mathbb{D}}_{KL}(({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel {p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x}))\ge 0,\mathrm{we}\phantom{\rule{4.pt}{0ex}}\mathrm{have}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})+log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right)\right)\ge 0$.
This is called the Evidence Lower Bound (ELBO), or variational lower bound.
We can derive a second formulation as follows:
We also derive a third formulation as follows:
${\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[log\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\right)+log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})\right)+log\left(p\left(\mathbf{z}\right)\right)\right]$
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}[log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})\right)]+{\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[log\left(\frac{p\left(\mathbf{z}\right)}{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\right)\right]$
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}[log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})\right)]{\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel \mathrm{p}\left(\mathbf{z}\right))$
We would train our model by maximizing the ELBO $\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})$ with respect to $\mathit{\theta}$ and $\mathit{\varphi}$, which are our model parameters and variational parameters.
Taking the gradient of ELBO w.r.t. to $\mathit{\theta}$ is easily calculated.
${\nabla}_{\mathit{\theta}}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})={\nabla}_{\mathit{\theta}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[log\left(\frac{{p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})}{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\right)\right]$
$={\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[{\nabla}_{\mathit{\theta}}log({p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})\right)]$
$\approx \frac{1}{\mathrm{L}}{\sum}_{l=1}^{\mathrm{L}}{\nabla}_{\mathit{\theta}}log\left({p}_{\mathit{\theta}}(\mathbf{x},{\mathbf{z}}^{\left(l\right)})\right)\phantom{\rule{4.pt}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}{\mathbf{z}}^{\left(l\right)}\sim {q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$
Thus, we estimate the gradient w.r.t to
$\mathit{\theta}$ using the formula
We can generate samples from q and use it to calculate each ${\nabla}_{\mathit{\theta}}log\left({p}_{\mathit{\theta}}(\mathbf{x},{\mathbf{z}}^{\left(l\right)})\right)$, and average these individual gradients to estimate the gradient of the ELBO w.r.t $\mathit{\theta}$.
Estimating ${\nabla}_{\mathit{\varphi}}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})$ is more difficult. This is because we cannot bring the gradient inside the expectation because the expectation is a function of $\mathit{\varphi}$.
${\nabla}_{\mathit{\varphi}}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})={\nabla}_{\mathit{\varphi}}{\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[log\left(\frac{{p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})}{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\right)\right]\ne {\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}\left[{\nabla}_{\mathit{\varphi}}\left(log\left({p}_{\mathit{\theta}}(\mathbf{x},\mathbf{z})\right)log\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\right)\right)\right]$
The score function and pathwise gradient estimator are both possible methods to estimate the gradient. The score function can apply to latent variables that are continuous or discrete. The pathwise gradient estimator applied to continuous latent variables and requires that the function being estimated is differentiable. For the pathwise gradient estimator, we reparameterize a sample from
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$ by expressing it as a function of a sample
$\mathit{\u03f5}$ from some fixed distribution
$p\left(\mathit{\u03f5}\right)$:
$p\left(\mathit{\u03f5}\right)$ is independent of
$\mathbf{x}$ or
$\mathit{\varphi}$. We can bring
${\nabla}_{\mathit{\varphi}}$ inside the expectation because
$p\left(\mathit{\u03f5}\right)$ does not depend on
$\mathit{\varphi}$. We assume
$g(\mathit{\u03f5},\mathit{\varphi},\mathbf{x})$ is differentiable with respect to
$\mathit{\varphi}$. Thus, our pathwise gradient estimator for
${\nabla}_{\mathit{\varphi}}\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathbf{x})$, where
$\mathbf{z}$ is supposed to be sampled from
$\mathcal{N}(\mathbf{z};\mu ,\mathit{C})$, can be derived using Equation (
45):
We can choose our
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})=\mathcal{N}\left(\mathbf{z};\mu ,diag\left({\sigma}^{2}\right)\right)$, which is a multivariate Gaussian distribution with a diagonal matrix as its covariance matrix.
$\mu $ is the mean vector and
${\sigma}^{2}$ is a vector that creates covariance matrix
$diag\left({\sigma}^{2}\right)$, which we can sample using
where
$diag\left(\sigma {}^{2}\right)=\mathit{L}{\mathit{L}}^{T}$, and ⊙ is an elementwise multiplication operator. If we have
$z\sim \mathcal{N}\left(\mu ,{\sigma}^{2}\right)$, we can reparameterize it by
If we are not using amortized variational inference, the formula for ELBO is different.
4. The Variational Autoencoder
The Variational Autoencoder uses an inference network as its encoder. The VAE has a MLP encoder and an MLP decoder [
2]. The encoder is the variational posterior
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$ and is an inference/recognition model. The decoder is a generative model, and it represents the likelihood
${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$.
A joint inference distribution can be defined as
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right)$ is called the aggregated posterior, given by the following:
The prior is a standard multivariate isotropic Gaussian distribution, ${\mathrm{p}}_{z}\left(\mathrm{z}\right)=\mathcal{N}(\mathbf{z};\mathbf{0},\mathit{I})$, while the likelihood function ${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$ is a Gaussian distribution or a Bernoulli distribution. The Gaussian likelihood is ${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})=\mathcal{N}\left(\mathbf{x};{\mu}_{decoder},diag\left({\sigma}_{decoder}^{2}\right)\right)$, which is a multivariate Gaussian with a diagonal covariance matrix. The posterior distribution can be any PDF, but is assumed to be approximately a multivariate Gaussian with a diagonal covariance matrix. The variational posterior is also taken to be a multivariate Gaussian with a diagonal covariance matrix, given by ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})=\mathcal{N}\left(\mathbf{z};\mu ,diag\left({\sigma}^{2}\right)\right)$; $\mu $ and ${\sigma}^{2}$ from the variational posterior are outputs of the encoder.
The weights and biases for the encoder are the variational parameters $\mathit{\varphi}$, while the weights and biases for the decoder are the model parameters $\mathit{\theta}$.
We sample
$\mathbf{z}$ from the encoder using the reparameterization trick; so the encoder outputs
$\mu $ and
${\sigma}^{2}$, and generates
$\mathit{\u03f5}$ from
$\mathcal{N}(0,\mathit{I})$. The variable
$\mathbf{z}$ is the input to the decoder, which then generates a new example of
$\mathbf{x}$. Equation (
55) is used.
Note that, in practice, the encoder outputs log(${\sigma}^{2}$) instead of ${\sigma}^{2}$ to ensure positive values for ${\sigma}^{2}$. We can then retrieve ${\sigma}^{2}$ by taking the exponential of log(${\sigma}^{2}$). We can also retrieve $\sigma $ by taking the exponential of (0.5)log(${\sigma}^{2}$). In the literature, they often refer to the output as $\sigma $ instead of ${\sigma}^{2}$.
If we assume that our encoder has one hidden layer, and the distribution is multivariate Gaussian with a diagonal covariance matrix, we can write it and the sampling process as:
$\left\{{\mathbf{W}}_{1},{\mathbf{W}}_{2},{\mathbf{W}}_{3},{\mathbf{b}}_{1},{\mathbf{b}}_{2},{\mathbf{b}}_{3}\right\}$ are the weights and biases of the encoder MLP, so they are variational parameters
$\mathit{\varphi}$.
An encoder with any number of hidden layers can be summarized with the following equations:
For the decoder, we have two choices.
 (1)
Multivariate Bernoulli MLP for decoder:
The likelihood
${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$ is a multivariate Bernoulli. With decoder input z, the probabilities of the decoder are calculated with the MLP.
$\{{\mathbf{W}}_{1},{\mathbf{W}}_{2},{\mathbf{b}}_{1},{\mathbf{b}}_{2}\}$ are the weights and biases of the decoder MLP. The hidden layer has a tanh activation function, while the output layer has a sigmoid activation function
$sig(.)$. The output is plugged into the log likelihood, getting a CrossEntropy (CE) function.
The equations for more hidden layers can be written as
The variable
$\mathrm{d}$ is the dimensionality of
$\mathbf{x}$ and Bernoulli
$(.;\mathrm{y})$ is the Bernoulli PMF.
$\forall {\mathrm{y}}_{j}\in \mathbf{y}:0\le {y}_{j}\le 1$. This
$\mathbf{y}$ can be implemented by making the last layer of the decoder a sigmoid function. This is similar to the second example of a DLVM in
Section 2.6.
 (2)
Gaussian MLP as decoder
This is the case where the decoder distribution is a multivariate Gaussian with a diagonal covariance structure:
where
$\left\{{\mathbf{W}}_{3},{\mathbf{W}}_{4},{\mathbf{W}}_{5},{\mathbf{b}}_{3},{\mathbf{b}}_{4},{\mathbf{b}}_{5}\right\}$ are the weights and biases of the decoder MLP, so they are the model parameters
$\mathit{\theta}$.
Those are the derivations for the forward propagation in the VAE.
Figure 6 shows the architecture of the VAE for a forward pass, excluding the ELBO calculation.
We will get ELBO estimates by looking at this equation:
${\mathbb{E}}_{{q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})}[log\left({p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})\right)]$ is a reconstruction loss, while ${\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel p\left(\mathbf{z}\right))$ is a regularizing term.
We derive the expression to get
${\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel {p}_{\mathbf{z}}\left(\mathbf{z}\right))$ [
2].
We can then add the two terms:
${\sigma}_{j}^{2}$ and ${\mu}_{j}$ represent the $j\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ component of the ${\sigma}^{2}$ and $\mu $ vectors for a given datapoint inputted into the encoder.
Our total training set is
$\mathcal{X}={\left\{{\mathbf{x}}^{\left(i\right)}\right\}}_{i=1}^{\mathrm{N}}$. Our Stochastic Gradient Variational Bayes Estimator ( SGVB Estimator) estimates the lower bound for one datapoint by:
We can then have our SGVB estimator using minibatches from $\mathcal{X}$, where ${\mathcal{X}}^{\left(M\right)}$ is the Mth minibatch.
We can have
${\mathcal{X}}^{\left(M\right)}={\left\{{\mathbf{x}}^{\left(i\right)}\right\}}_{i=1}^{\mathrm{M}}$, where
${\mathcal{X}}^{\left(M\right)}$ is a minibatch from
$\mathcal{X}$ with M datapoints. Then, we can estimate the ELBO over the full dataset
$\mathcal{X}$, denoted by
$\mathcal{L}(\mathit{\theta},\mathit{\varphi};\mathcal{X})$:
where
${\tilde{\mathcal{L}}}^{\left(M\right)}\left(\mathit{\theta},\mathit{\varphi};{\mathcal{X}}^{\left(M\right)}\right)$ is a minibatch estimator. Empirically, it has been shown that L could be set to 1 if the size of the minibatch is large.
The method used to train the VAE to find the parameters is called the AutoEncoding Variational Bayes (AEVB) algorithm. Algorithm 2 shows the steps of the AEVB algorithm.
Algorithm 2: AEVB algorithm using minibatches 
$\mathit{\theta},\mathit{\varphi}\leftarrow \phantom{\rule{4.pt}{0ex}}\mathrm{Initialize}\phantom{\rule{4.pt}{0ex}}\mathrm{parameters}\phantom{\rule{4.pt}{0ex}}$ for$\phantom{\rule{4pt}{0ex}}\mathbf{do}\phantom{\rule{1.em}{0ex}}\#$ of training iterations: ${\mathcal{X}}^{\left(M\right)}\leftarrow \phantom{\rule{4.pt}{0ex}}\mathrm{Random}\phantom{\rule{4.pt}{0ex}}\mathrm{minibatch}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{M}\phantom{\rule{4.pt}{0ex}}\mathrm{datapoints}\phantom{\rule{4.pt}{0ex}}\mathrm{from}\phantom{\rule{4.pt}{0ex}}\mathrm{full}\phantom{\rule{4.pt}{0ex}}\mathrm{dataset}\phantom{\rule{4.pt}{0ex}}\mathcal{X}$ $\mathrm{Sample}\phantom{\rule{4.pt}{0ex}}\mathit{\u03f5}\sim p\left(\mathit{\u03f5}\right)$ $\mathrm{Calculate}\phantom{\rule{4.pt}{0ex}}\mathrm{minibatch}\phantom{\rule{4.pt}{0ex}}\mathrm{estimate}\phantom{\rule{4.pt}{0ex}}{\tilde{\mathcal{L}}}^{\left(M\right)}\left(\mathit{\theta},\mathit{\varphi};{\mathcal{X}}^{\left(M\right)},\mathit{\u03f5}\right)$ $\mathrm{Calculate}\phantom{\rule{4.pt}{0ex}}\mathrm{gradients}\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{minibatch}\phantom{\rule{4.pt}{0ex}}\mathrm{estimator}\phantom{\rule{4.pt}{0ex}}{\nabla}_{\mathit{\theta},\mathit{\varphi}}{\tilde{\mathcal{L}}}^{\left(M\right)}\left(\mathit{\theta},\mathit{\varphi};{\mathcal{X}}^{\left(M\right)},\mathit{\u03f5}\right)$ $\mathit{\theta},\mathit{\varphi}\leftarrow \mathrm{Update}\phantom{\rule{4.pt}{0ex}}\mathrm{parameters}\phantom{\rule{4.pt}{0ex}}\mathrm{using}\phantom{\rule{4.pt}{0ex}}\mathrm{gradients}\phantom{\rule{4.pt}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}\mathrm{methods}\phantom{\rule{4.pt}{0ex}}\mathrm{like}\phantom{\rule{4.pt}{0ex}}\mathrm{SGD}\phantom{\rule{4.pt}{0ex}}\mathrm{or}\phantom{\rule{4.pt}{0ex}}\mathrm{Adagrad}$ end for

We have presented and derived the original variational autoencoder model; however, the variational autoencoder often refers more of a general framework, where we can choose different prior, posterior and likelihood distributions, along with many other variations. Thus, the VAE framework can refer to a continuous latent variable deep learning model that uses the reparameterization trick and amortized variational inference [
22].
5. Problems/Tradeoffs with the VAE
While being a powerful model, the VAE has multiple problems and tradeoffs. We will cover variance loss, image blurriness, posterior collapse, disentanglement, the balancing issue, the origin gravity effect, and the curse of dimensionality. We then compare the VAE with the GAN.
5.1. Variance Loss and Image Blurriness
When comparing input data to generated data for the generic Autoencoders and the VAE, there is a variance loss [
54]. This was empirically measured in [
54]. This phenomena is possibly due to averaging.
When being used to generate new images, VAEs tend to be more blurry compared to other generative models. Variance loss is a main cause of this [
54]. In [
55], they find that the maximum likelihood approach is not always the cause of blurriness, it is choice of the inference distribution. They use a sequential VAE model. Choosing flexible inference models or flexible generative models in the architecture also helps to reduce this problem [
27].
The VAEGAN reduces image blurriness by replacing the reconstruction loss term with a discriminator [
56]. The multi stage VAE [
57], deep residual VAE, and Hierarchical VAEs such as VAE’s with inverse autoregressive flows (IAFVAE) [
58] and Noveau VAE (NVAE) [
59] also improve image generation quality. PixelVAE [
60], 2Stage VAE [
61], and VQVAE are also very effective in generating good quality images.
5.2. Disentanglement
How successful machine learning methods are depends on data representation. In [
62], they hypothesize that the reason behind this dependence on data representation is that multiple explanatory factors of variations of the data are entangled and hidden by the representation. Representation learning can be defined as learning data representations that makes extracting useful information easier for input into predictors [
62]. Three important goals of a good representation include being distributed, invariant, and having disentangled the factors of variation. Disentanglement and disentangled representations do not have agreed upon formal definitions. An intuitive definition that is commonly used is “
a disentangled representation should separate the distinct, informative factors of variations in the data” [
63].
The vanilla VAE fails to learn disentangled representations. INFOVAE [
64],
$\beta $VAE [
65],
$\beta $TCVAE [
66], AnnealedVAE [
67], DIPVAE I/II [
68], and FactorVAE [
68] are VAE variants that attempt to obtain a disentangled representation, and many of them are the state of the art for disentanglement. However, according to a largescale empirical study by Google AI, these stateoftheart VAE models do not really learn disentangled representations in an unsupervised manner [
63]; the choice of the model is not as important as the random seeds and hyperparameters; these hyperparameters do not transfer across data sets.
5.3. The Balancing Issue
In the VAE loss function for context of images, the KL Divergence regularizes the latent space, while the reconstruction loss affects the quality of the image [
69]. There is a tension between these two effects. If we emphasize the reconstruction loss, the reconstruction is more powerful, but the latent space shape is affected, so the capabilities of the VAE to generate new examples are negatively affected. If we emphasize the regularizing term, the disentangling becomes better and the latent space is smoother and normalized. However, it also results in the images being more blurry.
The 2Stage VAE uses a balancing factor that it learns during training to balance these effects. In [
69], they use a deterministic variable for the decoder variance to balance these factors.
5.4. Variational Pruning and Posterior Collapse
Generally, in variational inference, there is a problem called variational pruning. If we rewrite ELBO as the following:
The ${\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel {p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x}))$ term is known as the variational gap. When we maximize ELBO, we either decrease the variational gap or increase the log evidence. Maximum likelihood training increases the log evidence. To decrease the variational gap, there are two options. The first is to update $\mathit{\varphi}$ to make the variational posterior closer to the real posterior. The second way is to update $\mathit{\theta}$ so that the real posterior is closer to the variational posterior; this can lead to a decrease in how well the model can fit the data. This effect can be mitigated by using a more expressive posterior.
One possible consequence of this is called variational pruning; this is when latent variables are not used for the model, and the posterior becomes the same as the prior. In variational autoencoders, this is called posterior collapse. Some researchers speculate that the
$\mathrm{KL}$ Divergence term
${\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel p\left(\mathbf{z}\right))$ in the ELBO is a cause of this phenomena. Thus, this has led to a focus on reducing the effect of this
$\mathrm{KL}$ term. The decoder becoming powerful is another cause. Lucas [
70] investigated posterior collapse; initially, they investigated it for a linear VAE, and then extended their results for nonlinear VAEs. They find that, in cases where the decoder is not powerful, posterior collapse can still happen. They also formally define posterior collapse and how to measure it.
The
$\delta $VAE [
71], Variational Mixture of Posteriors prior VAE (VampPrior VAE) [
72], 2Stage VAE [
61], epitomic VAE (eVAE) [
73], and VQVAE [
74] are some models that attempt to deal with preventing posterior collapse.
5.5. Origin Gravity Effect
The origin gravity effect is an effect in low dimensions. Since the prior is a multivariate standard normal distribution, the probabilities are centered around the origin. This pushes the points in the latent space towards the origin. Thus, even when the data are spread around multiple clusters, the Gaussian prior tends to push the clusters centers of the latent space toward the origin. Ideally, the latent space should have separate clusters and the prior should not push the mean toward the origin. We can exploit this clustering structure by using GMM based models, such as VADE and GMMVAE [
75,
76].
5.6. Hidden Score Function
The pathwise gradient has a hidden score function that can lead to high variance; this is discussed more in depth in
Section 6.1.
5.7. Curse of Dimensionality
Since the Gaussian prior has a
${L}_{2}$ norm, it suffers from the curse of dimensionality [
77]. The mass of the Gaussian distribution is no longer concentrated around the mean when we go to higher dimensions. Instead of a bell curve, a higher dimensional Gaussian resembles a uniform distribution on a surface of a hypersphere; most of the mass is on the shell of the hypersphere. This can cause inefficiencies when sampling in high dimensions [
78]. The random walk Metropolis algorithm tends to perform poorly when sampling in high dimensions; the Hamiltonian Monte Carlo tends to perform better.
5.8. GANs vs. VAEs
VAEs and GANs use generative models to generate new data. GANs tend to be better at generating images that are perceived by humans to be good quality; however, they do not model the density very well with respect to the likelihood criterion [
27]. VAEs are the opposite; they tend to have blurry images but model the density very well with respect to the likelihood criterion. The VAE is more stable to train than the GAN.
6. Variations of the VAE
There are many ways to extend the VAE models. You can change the prior, the posterior/variational posterior, regularize the posterior, and change the architecture. Changing the architecture includes changing the layers to RNNs/LSTMs/CNN layers, and use other Divergence measures instead of KL Divergence. Many of these variations will often include convolutional layers, even if not explicitly stated. In this section, we will refer to the original VAE as the vanilla VAE.
6.1. VAE Using an STL Estimator
For the VAE, we can decompose the gradient of the lower bound w.r.t
$\mathit{\varphi}$ as follows [
79]:
The score function term can lead to a higher variance than necessary. One way to address this is to drop the score function term, leading to the following term being used instead of the gradient:
This does not affect the bias of the estimator because the expectation of the score function is 0. In some cases, dropping it can actually increase the variance if the score function is correlated with the other terms.
We call it the STL estimator because the paper that invented this new estimator was titled “Sticking the Landing: Simple, LowerVariance Gradient Estimators for Variational Inference”.
6.2. $\rho $VAE
Instead of an isotropic Gaussian approximate posterior, we use an AR(1) Gaussian distribution [
80], so this is a posterior variant. Note by autoregressive Gaussian that we are referring to a traditional autoregressive model. The covariance matrix of the AR(1) process is given by
The $\rho $ parameter is a scalar parameter controlling the correlation, so it is between −1 and 1; $\mathrm{s}$ > 0 is a scalar scaling parameter. The subscript $(\rho ,\mathrm{s})$ denotes that that the covariance matrix is dependent on $\rho $ and $\mathrm{s}$.
The vanilla VAE encoder outputs
$\mu $ and
${\sigma}^{2}$ (or log(
${\sigma}^{2}$)); the encoder in the
$\rho $VAE outputs
$\mu $,
$\rho $, and s. The determinant for this matrix is
The regularization term in the loss function can be formulated as:
We can take the Cholesky decomposition of the covariance matrix; from there, we get the following lower triangular matrix:
${\mathbf{z}}^{\left(i\right)}={\mu}^{\left(i\right)}+{\tilde{\mathbf{C}}}_{(\rho ,\mathrm{s})}^{\left(i\right)}\mathit{\u03f5}$, can be used to generate the latent codes. ${z}^{\left(i\right)}$ is ddimensional latent code associated with the ith input; $\mathit{\u03f5}$ is a ddimensional vector sampled from a multivariate standard normal distribution, and ${\tilde{\mathbf{C}}}_{(\rho ,\mathrm{s})}^{\left(i\right)}$ is a d x d lower triangular matrix for the $i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ input. Variations of the $\rho $VAE include the $\rho \beta $VAE and the INFO$\beta $VAE.
6.3. Importance Weighted Autoencoder (IWAE)
6.3.1. Importance Sampling
Importance sampling is a Monte Carlo variance reduction method [
81], where you have the following integral to estimate
where
$\mathrm{range}\left(x\right)\subseteq {\mathbb{R}}^{\mathrm{d}}$ is bounded
$,\mathrm{g}:\mathrm{range}\left(x\right)\to {\mathbb{R}}^{\mathrm{d}}$ is bounded and integrable;
f: PDF of a random variable
$x\in \mathrm{range}\left(x\right);f\ge 0$ on
$\mathrm{range}\left(x\right),f=0$ outside of
$\mathrm{range}\left(x\right)$, and
$\int f\left(\mathrm{x}\right)d\mathrm{x}=1$. We choose probability distribution function
$\gamma $ on
$\mathrm{range}\left(x\right);\gamma \ne 0$ on
$\mathrm{range}\left(x\right)$; this
$\gamma $ is called the importance function.
${\int}_{\mathrm{range}\left(x\right)}\mathrm{g}\left(\mathrm{x}\right)f\left(\mathrm{x}\right)d\mathrm{x}\to {\int}_{\mathrm{range}\left(x\right)}\frac{\mathrm{g}\left(\mathrm{x}\right)f\left(\mathrm{x}\right)}{\gamma \left(\mathrm{x}\right)}\gamma \left(\mathrm{x}\right)d\mathrm{x}={\mathbb{E}}_{\gamma}\left[\frac{\mathrm{g}\left(y\right)f\left(y\right)}{\gamma \left(y\right)}\right]=\mathrm{J}$, where $y\sim \gamma $
The importance sampling Monte Carlo estimator becomes
The algorithm is as follows:
 (1)
Generate i.i.d sequence $\{{y}_{1},\cdots ,{y}_{\mathrm{N}}\}\sim \gamma $.
 (2)
This is an unbiased estimator.
6.3.2. Importance Sampling for a Latent Variable Model
If we are trying to train a latent variable model to perform inference, with random vectors
$\mathbf{z}$ and
$\mathbf{x}$, we can use importance sampling in training the likelihood. If our
f is our prior distribution
${p}_{\mathbf{z}}\left(\mathbf{z}\right)$, and
$\mathrm{g}$ is the log conditional likelihood
${p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(\mathbf{i}\right)}\right\mathbf{z})$, the expected value we are estimating is
Then, the importance sampling estimator becomes
We would use importance sampling here if our ${p}_{\mathbf{z}}\left(\mathbf{z}\right)$ was difficult to sample from, or was not informative. When training the likelihood given by ${\sum}_{i}log\left({\sum}_{z}{p}_{\mathbf{z}}\left(\mathbf{z}\right){p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(\mathbf{i}\right)}\right\mathbf{z})\right)$,
${\sum}_{k}{p}_{z}\left(\mathbf{z}\right){p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(\mathbf{i}\right)}\right\mathbf{z})$ can be estimated with
${\mathrm{J}}_{\mathrm{N}}$, so
We choose our importance function to be the variational posterior,
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$.
The IWAE [
82] is a variation that uses importance sampling for weighted estimates of the log probability. There are two important terms:
Term 1: $\phantom{\rule{1.em}{0ex}}{\sum}_{i}log\left(\frac{1}{\mathrm{L}}{\sum}_{l=1}^{\mathrm{L}}\frac{{p}_{z}\left({\mathbf{z}}^{(\mathbf{i},\mathbf{l})}\right)}{q\left({\mathbf{z}}^{(\mathbf{i},\mathbf{l})}\right)}{p}_{\mathit{\theta}}\left({\mathbf{x}}^{\left(\mathbf{i}\right)}{\mathbf{z}}^{(\mathbf{i},\mathbf{l})}\right)\right)$ with ${\mathbf{z}}^{(\mathbf{i},\mathbf{l})}\sim q\left({\mathbf{z}}^{(\mathbf{i},\mathbf{l})}\right)$
Term 2: $\phantom{\rule{1.em}{0ex}}{min}_{\mathit{\varphi}}{\sum}_{i}{\mathbb{D}}_{KL}\left({q}_{\mathit{\varphi}}\left(\mathbf{z}{\mathbf{x}}^{\left(\mathbf{i}\right)}\right)\parallel {p}_{\mathit{\theta}}\left(\mathbf{z}{\mathbf{x}}^{\left(\mathbf{i}\right)}\right)\right)$
We want to maximize Term 1–Term 2. Thus, we end up with
We can do a form of the ELBO by taking
$\mathrm{L}$ samples of
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$.
Using Jensen’s Inequality, we can see that
The loss for the IWAE forms a tighter bound than the VAE, and, as you increase K, the bound becomes tighter. In the IWAE lower bound, the gradient weighs the datapoint by relative importance. In the VAE lower bound, the weights are equally weighted.
6.3.3. IWAE Variance Reductions
The gradient estimator of the IWAE can still have higher variance than desirable [
79,
83], due to a hidden score function. To eliminate this problem, you can drop the hidden score function, leading to IWAESTL [
79]. You can also use the reparametrization trick on the hidden score function [
84]. This new estimator is called the doubly reparametrized gradient estimator (DReG). This leads to the IWAEDReG.
6.4. MixtureofExperts Multimodal VAE (MMVAE)
The Multimodal VAE (MVAE) [
85] and MMVAE model [
86] address generative modeling of data across multiple modalities. In this context, examples of multimodal data include images with captions and video data with accompanying audio.
We have M modalities, denoted by
$m=1,\dots ,M$ of the form
where
${p}_{{\mathit{\theta}}_{m}}\left({\mathbf{x}}_{m}\mathbf{z}\right)$ are likelihoods; it is parameterized by a decoder. This decoder has parameters
$\Theta =\left\{{\mathit{\theta}}_{1},\dots ,{\mathit{\theta}}_{M}\right\}$.
The true joint posterior is denoted as
${p}_{\Theta}\left(\mathbf{z}{\mathbf{x}}_{1:M}\right)$, and the variational joint posterior
${q}_{\Phi}\left(\mathbf{z}{\mathbf{x}}_{\mathbf{1}:\mathbf{M}}\right)$,
where
${a}_{m}=\frac{1}{M}$ and
${q}_{{\mathit{\varphi}}_{m}}\left(\mathbf{z}\right{\mathbf{x}}_{\mathbf{m}})$ denotes a unimodal posterior.
We plug this into the
${\mathcal{L}}_{IWAE}$ to get
6.5. VR$\alpha $ Autoencoder and VRmax Autoencoder
We can also derive a variational lower bound for Rényi’s
$\alpha $Divergence, called variational Rényi
$\left(VR\right)$ bound [
87]. We approximate the exact posterior
${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$ for
$\alpha >0.$
when
$\alpha \ne 1$, it is equivalent to
This definition can be extended to $\alpha \le 0$, The VR$\alpha $ Autoencoder minimizes this VR bound. The VRmax Autoencoder is an autoencoder in the case of the Renyi Divergence where $\alpha =\infty $.
IWAE can also be seen as the case of the Renyi Divergence when $\alpha =0$ and L < ∞; L is the sample size of the Monte Carlo estimator. When $\alpha $ = 1, the VR$\alpha $ Autoencoder becomes the vanilla VAE.
6.6. INFOVAE
The INFOVAE [
64], also known as MMDVAE, is a posterior regularizing variant. This leads to better disentangled representations. However, the INFOVAE still has the blurred images generation problem.
The term ${q}_{\mathit{\varphi}}\left(\mathbf{x}\right\mathbf{z})$ is the posterior to ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$ and ${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$ is the posterior to ${p}_{\mathit{\theta}}\left(\mathbf{x}\right\mathbf{z})$.
The Divergence between
q and
p,
${\mathbb{D}}_{KL}\left({q}_{\mathit{\varphi}}\left(\mathbf{z}\right)\parallel p\left(\mathbf{z}\right)\right)$, is multiplied by
$\lambda $, a scaling parameter. A mutual information between
x and
z under
q, denoted by
${\mathbb{I}}_{q}(x;z)$, is also added, and scaled by parameter
$\alpha $ (this
$\alpha $ is different from the
$\alpha $ in Renyi Entropy and Divergence). This gives us the following loss function:
This objective function cannot be optimized directly; an equivalent form is
One typical architecture configuration for the INFOVAE involves using a DCGAN for both the encoder and the decoder.
6.7. $\beta $VAE
The
$\beta $VAE [
65] is a posterior regularizing variant. We weight the regularizing term by
$\beta $, so the ELBO is modified to:
$\beta $ is typically greater than 1. The correct choice of
$\beta $ creates more disentangled latent representation. However, the balancing issue comes in to play; there is a tradeoff between reconstruction fidelity and the disentanglement of the latent code [
67]. In [
67,
88,
89], the
$\beta $VAE’s ability to disentangle has been analyzed. In [
67], the authors explore disentanglement through the information bottleneck perspective and propose modifications to increase the disentanglement capabilities of the
$\beta $VAE.
6.8. PixelVAE
The PixelVAE [
60] is a VAE based model with a decoder based on the PixelCNN. Since the PixelCNN is a neural autoregressive model, the decoder of the PixelVAE is a neural autoregressive decoder.
The encoder and decoder both have convolutional layers. The convolutional layers use strided convolutions in the encoder for downsampling. The convolutions in the decoder and are transposed for upsampling.
Typically, a VAE decoder models each output dimension independently, so they use factorizable distributions. In the PixelVAE, a conditional PixelCNN is used in the decoder. The decoder is modeled by:
We model the distribution of x as the product of the distributions of each dimension of x, denoted by ${x}_{i}$, conditioned by z and all of previous dimensions. The variable z is the latent variable.
The PixelCNN is great at capturing details but does not have a latent code. The VAE is great at learning latent representations and capturing a global structure; it is not great at capturing details. The PixelVAE has the positives of both models; it has a latent representation, and is great at capturing global structure and small details. It can also have latent codes that are more compressed than the vanilla VAE.
Figure 7 shows the architecture of the PixelVAE.
The performance of VAEs can be improved by creating a hierarchy of random latent variables through stacking VAEs. This idea can also be applied to the PixelVAE.
The PixelVAE++ algorithm uses PixelCNN++ instead of PixelCNN in the decoder [
90]. It also uses a discrete latent variables with a Restricted Boltzmann Machine prior.
6.9. HyperSpherical VAE/SVAE
The vanilla VAE often fails to model data whose latent structure is hyperspherical. The soap bubble effect and the gravity origin effect are also a problem with Gaussian priors in the VAE. The HyperSpherical VAE [
77] attempts to deal with these problems.
The von Mises Fisher (
$\mathrm{vmF}$) distribution is parameterized by
$\mu \in {\mathfrak{R}}^{m}$ and
$\kappa \in {R}_{\ge 0};\mu $ is the mean direction, and
$\kappa $ is the concentration around the mean. The PDF of a vmF distribution for random vector
$z\in {\mathfrak{R}}^{m}$:
${\mathcal{I}}_{\mathrm{v}}$ represents a modified Bessel function of the first kind at order v.
The hyperspherical VAE uses the vMF as a the variational posterior. The primary advantage of this is the ability to use a uniform distribution as the prior. The KL Divergence term
${\mathbb{D}}_{KL}\left(\mathrm{vMF}(\mu ,\kappa )\parallel U\left({S}^{m1}\right)\right)$ to be optimized is:
The KL term does not depend on
$\mu $, this parameter is only optimized in the reconstruction term. The gradient with respect to the
$\kappa $ is
The sampling procedure for the vmF can be found in [
91]. The NTransformation reparameterization trick can be used to extend the reparameterization trick to more distributions [
92]; it is used to reparameterize vmF sampling.
6.10. $\delta $VAE
${\mathbb{D}}_{KL}({q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})\parallel p\left(\mathbf{z}\right))$ is also called the rate term. The
$\delta $VAE [
71] attempts to prevent posterior inference by preventing the rate term from going to 0 by using a lower bound. They address the posterior collapse problem with structural constraints so that the
$\mathrm{KL}$ Divergence between the posterior and prior is lower bounded by design. This can be achieved by choosing families of distributions for
${p}_{\mathit{\theta}}\left(\mathbf{z}\right)$ and
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x})$ such that
The committed rate of the model is denoted by $\delta $. One way to do so is to select from a family of Gaussian distributions with variances that are fixed but different.
6.11. Conditional Variational Autoencoder
The conditional VAE [
93] is a type of deep conditional generative model (CGM). In a deep CGM, there are three types of variables:
$\mathbf{x}$ denotes the input variables,
$\mathbf{y}$ denotes the output variables, and
$\mathbf{z}$ denotes the latent variables. The approximate posterior is is
${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x},\mathbf{y})$. The conditional prior is
${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$, and the conditional likelihood is
${p}_{\mathit{\theta}}\left(\mathbf{y}\right\mathbf{x},\mathbf{z}).$After
$\mathbf{x}$ is observed,
$\mathbf{z}$ is sampled from
${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$. Then,
$\mathbf{y}$ is generated from
${p}_{\mathit{\theta}}\left(\mathbf{y}\right\mathbf{x},\mathbf{z}).$ The variational lower bound of the deep CGM is
For the CVAE, where L is the number of samples,
${\mathbf{z}}^{\left(l\right)}={g}_{\mathit{\varphi}}\left(\mathbf{x},\mathbf{y},{\mathit{\u03f5}}^{\left(l\right)}\right),{\mathit{\u03f5}}^{\left(l\right)}\sim \mathcal{N}(\mathbf{0},\mathit{I})$, the lower bound estimator is
The encoder is ${q}_{\mathit{\varphi}}\left(\mathbf{z}\right\mathbf{x},\mathbf{y})$, the conditional prior is ${p}_{\mathit{\theta}}\left(\mathbf{z}\right\mathbf{x})$, and the decoder is ${p}_{\mathit{\theta}}\left(\mathbf{y}\right\mathbf{x},\mathbf{z}).$
6.12. VAEGAN
The VAEGAN architecture [
56] is influenced by both the VAE and the GAN; the decoder is also the generator, and the reconstruction loss term is replaced by a discriminator.
As shown in
Figure 8, there you have the same VAE structure, but the sample coming out of the VAE is fed into a discriminator, along with the original training data.
In this model, $\mathbf{z}$ is the output of the encoder, denoted $z\sim Enc\left(\mathbf{x}\right)=q\left(\mathbf{z}\right\mathbf{x})$, and $\tilde{\mathbf{x}}$ is the output of the decoder, denoted by $\tilde{x}\sim Dec\left(\mathbf{z}\right)=p\left(\mathbf{x}\right\mathbf{z}).$ ${Dis}_{l}\left(\mathbf{x}\right)$ denotes the representation of the l th layer of the discriminator.
The likelihood of the lth layer of the discriminator can be given by
It is a Gaussian distribution parametrized with mean
${Dis}_{l}\left(\tilde{\mathbf{x}}\right)$ and the identity matrix
$\mathbf{I}$ as its covariance. The likelihood loss for
${Dis}_{l}\left(\mathbf{x}\right)$ can be calculated as
The loss of the GAN is typically given by
${\mathcal{L}}_{\mathrm{GAN}}=log(Dis\left(\mathbf{x}\right))+log(1Dis(Gen\left(\mathbf{z}\right)))$. Since the generator and decoder are the same for the VAEGAN, it can be rewritten as
The overall loss used for training the VAEGAN is
There are multiple practical considerations regarding training the VAEGAN. The first consideration is to limit propagation of error signals to only certain networks. ${\mathit{\theta}}_{\mathrm{Enc}\phantom{\rule{4.pt}{0ex}}},{\mathit{\theta}}_{\mathrm{Dec}\phantom{\rule{4.pt}{0ex}}},{\mathit{\theta}}_{\mathrm{Dis}\phantom{\rule{4.pt}{0ex}}}$ denote the parameters of each network.
The second consideration is to weigh the error signals that the decoder receives. The decoder receives these signals from both ${\mathcal{L}}_{\mathrm{llike}\phantom{\rule{4.pt}{0ex}}}^{\mathrm{Dis}{\phantom{\rule{4.pt}{0ex}}}_{l}}$ and ${\mathcal{L}}_{\mathrm{GAN}}$,
The parameter
$\eta $ is used as a weighting factor, and the update of the decoders parameters looks like:
Empirically, the VAEGAN performs better if the discriminator input includes samples from both
$p\left(\mathbf{z}\right)$ and
$q\left(\mathbf{z}\right\mathbf{x})$. Therefore, the GAN loss can be rewritten as:
Algorithm 3 shows the VAEGAN training algorithm given practical considerations, and
Figure 9 shows the architecture given these modifications.
Algorithm 3: VAEGAN training 
${\mathit{\theta}}_{\mathrm{Enc}\phantom{\rule{4.pt}{0ex}}},{\mathit{\theta}}_{\mathrm{Dec}\phantom{\rule{4.pt}{0ex}}},{\mathit{\theta}}_{\mathrm{Dis}\phantom{\rule{4.pt}{0ex}}}\leftarrow \phantom{\rule{4.pt}{0ex}}\mathrm{initialize}\phantom{\rule{4.pt}{0ex}}\mathrm{network}\phantom{\rule{4.pt}{0ex}}\mathrm{parameters}\phantom{\rule{4.pt}{0ex}}\mathrm{for}\phantom{\rule{4.pt}{0ex}}\mathrm{encoder},\phantom{\rule{4.pt}{0ex}}\mathrm{decoder},\phantom{\rule{4.pt}{0ex}}\mathrm{and}\phantom{\rule{4.pt}{0ex}}\mathrm{discriminator}\phantom{\rule{4.pt}{0ex}}$ networks for$\phantom{\rule{4pt}{0ex}}\mathbf{do}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\phantom{\rule{0.166667em}{0ex}}\#\phantom{\rule{4.pt}{0ex}}\mathrm{of}\phantom{\rule{4.pt}{0ex}}\mathrm{training}\phantom{\rule{4.pt}{0ex}}\mathrm{iterations}:$ ${\mathcal{X}}^{\left(M\right)}\leftarrow \phantom{\rule{4.pt}{0ex}}\mathrm{random}\phantom{\rule{4.pt}{0ex}}\mathrm{mini}\mathrm{batch}\phantom{\rule{4.pt}{0ex}}$ ${\mathcal{Z}}^{\left(M\right)}\leftarrow Enc\left({\mathcal{X}}^{\left(M\right)}\right)$ ${\mathcal{L}}_{\mathrm{prior}\phantom{\rule{4.pt}{0ex}}}\leftarrow {\mathbb{D}}_{KL}(q\left({\mathcal{Z}}^{\left(M\right)}\right{\mathcal{X}}^{\left(M\right)})\parallel p\left({\mathcal{Z}}^{\left(M\right)}\right)$ ${\tilde{\mathcal{X}}}^{\left(M\right)}\leftarrow Dec\left({\mathcal{Z}}^{\left(M\right)}\right)$ ${\mathcal{L}}_{\mathrm{Llike}\phantom{\rule{4.pt}{0ex}}}^{\mathrm{Disl}\phantom{\rule{4.pt}{0ex}}}\leftarrow {\mathbb{E}}_{q\left({\mathcal{Z}}^{\left(M\right)}\right{\mathcal{X}}^{\left(M\right)})}\left[p\left({Dis}_{l}\left({\mathcal{X}}^{\left(M\right)}\right){\mathcal{Z}}^{\left(M\right)}\right)\right]$ ${\mathcal{Z}}_{p}^{\left(M\right)}\leftarrow \phantom{\rule{4.pt}{0ex}}\mathrm{samples}\phantom{\rule{4.pt}{0ex}}\mathrm{from}\phantom{\rule{4.pt}{0ex}}\mathrm{prior}\phantom{\rule{4.pt}{0ex}}\mathcal{N}(\mathbf{0},\mathbf{I})$ ${\mathcal{X}}_{p}^{\left(M\right)}\leftarrow Dec\left({\mathcal{Z}}_{p}^{\left(M\right)}\right)$ ${\mathcal{L}}_{\mathrm{GAN}}\leftarrow log(Dis\left({\mathcal{X}}^{\left(M\right)}\right))+log(1Dis\left({\tilde{\mathcal{X}}}^{\left(M\right)}\right))$ $\phantom{\rule{1.em}{0ex}}+log\left(1Dis\left({\mathcal{X}}_{p}^{\left(M\right)}\right)\right)$ $\mathrm{Update}\phantom{\rule{4.pt}{0ex}}\mathrm{the}\phantom{\rule{4.pt}{0ex}}\mathrm{network}\phantom{\rule{4.pt}{0ex}}\mathrm{parameters}\phantom{\rule{4.pt}{0ex}}\mathrm{with}\phantom{\rule{4.pt}{0ex}}\mathrm{their}\phantom{\rule{4.pt}{0ex}}\mathrm{stochastic}\phantom{\rule{4.pt}{0ex}}\mathrm{gradients}:\phantom{\rule{4.pt}{0ex}}$ ${\mathit{\theta}}_{\mathrm{Enc}\phantom{\rule{4.pt}{0ex}}}\stackrel{+}{\leftarrow}{\nabla}_{{\mathit{\theta}}_{\mathrm{Enc}\phantom{\rule{4.pt}{0ex}}}}\left({\mathcal{L}}_{\mathrm{prior}\phantom{\rule{4.pt}{0ex}}}+{\mathcal{L}}_{\mathrm{llike}\phantom{\rule{4.pt}{0ex}}}^{\mathrm{Dis}{\phantom{\rule{4.pt}{0ex}}}_{l}}\right)$ ${\mathit{\theta}}_{\mathrm{Dec}}\stackrel{+}{\leftarrow}{\nabla}_{{\mathit{\theta}}_{\mathrm{Dec}\phantom{\rule{4.pt}{0ex}}}}\left(\eta {\mathcal{L}}_{\mathrm{llike}\phantom{\rule{4.pt}{0ex}}}^{\mathrm{Dis}{\phantom{\rule{4.pt}{0ex}}}_{l}}{\mathcal{L}}_{\mathrm{GAN}}\right)$ ${\mathit{\theta}}_{\mathrm{Dis}}\stackrel{+}{\leftarrow}{\nabla}_{{\mathit{\theta}}_{\mathrm{Dis}}}{\mathcal{L}}_{\mathrm{GAN}}$ end for

Extensions of the VAEGAN include the ZeroVAEGAN [
94], FVAEGAND2 [
95], 3DVAEGAN [
96], and Hierarchical Patch VAEGAN [
97].
6.13. Adversarial Autoencoders (AAE)
The adversarial autoencoder is another architecture that takes inspiration from both the VAE and GAN [
98]. We denote
$\mathrm{x}$ as the input and
$\mathbf{z}$ as the latent code of the autoencoder. Then,
$p\left(\mathbf{z}\right)$ is the prior probability distribution over the latent code,
$q\left(\mathbf{z}\right\mathbf{x})$ is the probability distribution for the encoder,
$p\left(\mathbf{x}\right\mathbf{z})$ is the distribution for the decoder,
${p}_{\mathrm{d}}\left(\mathbf{x}\right)$ denotes the data generating distribution, and
$p\left(\mathbf{x}\right)$ is the model distribution. The encoder has an aggregated posterior distribution defined as
An adversarial network is connected to the latent code. From there, we sample from the aggregated posterior and the prior, and input both into the discriminator. The discriminator tries to distinguish whether or not the z is from the prior, which means it is real, or if it is from the aggregrated variational posterior, which is fake. This matches the prior with the aggregrated variational posterior, which has a regularizing effect on the autoencoder. The encoder can also be considered the generator of the adversarial net because it is generating the latent code. The autoencoder part of the AAE tries to minimize the reconstruction error.
The AE and the adversarial network are trained in two phases. The first phase is the reconstruction phase. This is where the autoencoder is trained on minimizing the reconstruction loss. The second phase is the regularization phase. In this phase, the discriminative network is trained to discriminate between the real samples and the fake ones. Then, the generator (the encoder of the AE) is also trained to fool the discriminator better. Both of these steps use minibatch SGD.
There is a broad choice of functions for the approximate posterior. Some common choices are a deterministic function, a Gaussian probability distribution, or a universal approximator of the posterior.
Figure 10 shows the architecture of the AAE. We can adjust the architecture of the AAE to do supervised learning, semi supervised learning, unsupervised clustering, and dimensionality reduction.
6.14. InformationTheoretic Learning Autoencoder
The InformationTheoretic Learning Autoencoder (ITLAE) [
99] is similar to the VAE, with both and encoder and decoder layers. There are two main differences. One is that it does not use the reparameterization trick.
The second difference is that, instead of using the KL Divergence, it uses alternate Divergence measures, like the CS Divergence and the Euclidean Divergence; these Divergences are estimated through kernel density estimation (KDE) [
100].
where
$\mathbb{L}$ is the reconstruction loss, R is the regularization term, typically the Euclidean or CS Divergence. The chosen prior is
p,
$Enc$ is the encoder, and
$\lambda $ controls the magnitude of the regularization.
If we want to estimate the QIP
$\mathbb{V}$ for
$p\left(\mathbf{x}\right)$ using KDE, it is given by the formula
where there are N datapoints, and a Gaussian kernel
$\mathbb{G}$ with kernel bandwidth
${\sigma}^{2}$. Minimizing the QIP is equivalent to maximizing the quadratic Entropy.
To get the kernel density estimator for the CSDivergence:
${\widehat{\mathbb{V}}}_{q}$ is the QIP estimator for PDF
$q\left(\mathbf{x}\right),{\widehat{\mathbb{V}}}_{p}$ is the QIP estimator for PDF
$p\left(\mathbf{x}\right),{\widehat{\mathbb{V}}}_{c}$ the cross information potential estimator, given by
where
${N}_{q}$ is the number of observations for distribution
$q\left(\mathbf{x}\right)$,
${N}_{p}$ is the number of observations for distribution
$p\left(\mathrm{x}\right)$,
$\mathbb{G}\left({\mathbf{x}}_{{q}_{k}}{\mathbf{x}}_{{p}_{j}},2{\sigma}^{2}\right)$ is a Gaussian kernel between points
${\mathbf{x}}_{{q}_{k}}$ and
${\mathbf{x}}_{{p}_{j}}$, with a kernel bandwidth
$2{\sigma}^{2}$.
KDE for Euclidean Divergence estimator is given by
We can choose q as the approximation distribution, and p as the prior distribution. When we try to minimize the information potential with respect to q, the samples that are generated from q would be spread out; when we try to maximize the cross information potential with respect to q, the samples from q and p are closer together. Thus, there is tension between these two effects.
The authors of [
99] experimented with three different priors: Laplacian distribution, 2D Swissroll, and a Gaussian distribution, and experimented on MNIST data generation. The Euclidean Divergence did not perform as well as the CS Divergence when the data became high dimensional; high dimensionality also means the batch size has to be larger for the ITLAE.
6.15. Other Important Variations
Important architectures we have not covered include
 (1)
The VRNN and VRAE: The VRNN [
101] is a RNN with a VAE in each layer. There is also the Variational Recurrent AutoEncoder (VRAE) [
102].
 (2)
VaDe and GMVAE: Both methods use Gaussian Mixture Models; the specific use case is for clustering and generation [
75,
76].
 (3)
VQVAE: This method combines Vector Quantization (VQ) with the variational autoencoder [
74]. Both the posterior and prior are categorical distributions, and the latent code is discrete. An extension of the VQVAE is the VQVAE2 [
103]. These methods are comparable to GANs in terms of image fidelity.
 (4)
VAEIAF: This uses inverse autoregressive flow with the VAE [
58].
 (5)
Wasserstein AutoEncoder (WAE):
The WAE minimizes a form of the Wasserstein distance between the model PDF and the target PDF [
104].
 (6)
2Stage VAE
The 2Stage VAE [
61] addresses multiple problems: image blurriness and the balancing issue. It also can tackle the problem of a mismatch between the aggregrate posterior and expected prior. It trains two different VAEs sequentially. The first VAE learns how to sample from the variational posterior without matching
$q\left(\mathrm{z}\right)=p\left(\mathrm{z}\right)$. The second VAE attempts to sample from the true
$q\left(\mathrm{z}\right)$ without using
$p\left(\mathrm{z}\right)$.
7. Applications
VAEs are typically used for generating data, including images and audio; another common application is for dimensionality reduction. There are many example applications we could have chosen. However, we decided to focus on three: financial, speech source separation, and biosignal applications. The financial applications for VAEs is a new area of research with a huge potential for innovation. Source separation has long been an important problem in the signal processing community, so it is important to survey how the VAE performs in this application. Innovations in biosignal research has a great potential for positive impact for patients with disabilities and disorders; VAEs can help improve the performance of classifiers in biosignal applications through denoising and data augmentation.
7.1. Financial Applications
One application is described in [
105], where the
$\beta $ VAE is used to complete volatility surfaces and generate synthetic volatility surfaces for options (in the context of finance). Volatility is the standard deviation of the return on an asset. In finance, options are a contract between two parties that “
gives the holder the right to trade in the future at a previously agreed price but takes away the obligation” [
106]. This is for the simple options; there are more complex options. A volatility surface is a volatility function based on moneyness and time to maturity. For moneyness, delta is used; in the context of finance, delta is the derivative of an option price with respect to the underlying asset.
We sample N points from the volatility surface. There are two types of methods to generate volatility surfaces with the VAE: the gridbased approach and pointwise approach. For the gridbased approach, the input to the encoder is the N grid point surface; this surface is flattened into an N point vector. The encoder outputs z which has d dimensions. The decoder uses z to reconstruct the original grid points.
Figure 11 and
Figure 12 show the architecture for the encoder and decoder for this approach.
For the pointwise approach, the input to the encoder is the N grid point surface, which is then flattened into an N point vector. The encoder outputs z, which is d dimensions. The input to the decoder is z along with moneyness K and maturity T. The output of the decoder is 1 point on the volatility surface. We obtain all the points using batch inference.
Figure 13 and
Figure 14 show the architecture for the encoder and decoder for this approach.
In the experiments, each volatility surface was a 40 points grid, with eight times to maturity and five deltas. Six currency pairs were used in this experiment. Only the pointwise approach was used. For completing surfaces, it was compared with the Heston Model; this algorithm predicts the surface faster than the Heston Model. In some cases, it outperforms the Heston Model.
Table 1 shows the results from the paper comparing the Heston Model with a Variational Autoencoder.
The experiments also generated new volatility surfaces. One main use of generating these surfaces is for data augmentation to create more observations for deep learning models (Maxime Bergeron, Private Communications).
In [
107], the
$\beta $VAE was used in conjunction with continuous time stochastic differential equations models to generate arbitragefree implied volatility (IV) surfaces. SDE models that were tested included Lévy additive processes and timevarying regime switching models.
The method, shown in the chart in
Figure 15, has the following steps:
 (1)
Use historical market data to fit the arbitragefree SDE model,
get collection SDE model parameters.
 (2)
Train VAE model using on the SDE model parameters.
 (3)
Sample from the latent space of the VAE model using a KDE approach.
 (4)
Get a collection of the SDE model parameters by decoding the samples.
 (5)
Use the SDE model with parameters to get arbitragefree surfaces.
In [
108], LSTM and LightGBM were used to predict the hourly directions of eight banking stocks listed in the BIST 30 Index from 2011 and 2015. The first three years were used in the training set, and the last 2 years were used in the test set. The first experiment used the stock features as the input to the models. The second experiment first put the stock features through a VAE for dimensionality reduction before inputting it into the models. The results found that they performed similarly, though the VAE filtered input uses 16.67% less features. The 3rd experiment involved adding features from other stocks into the first and second experiments, to account for effects from other stocks.
In [
109], a deep learning framework was used for multistepahead prediction of the stock closing price. The input features were market open price, market high price, market low price, market close price, and market volume price. This framework used the LSTMVAE to remove noise, then combined these reconstructed features with original features; these were the input to a stacked LSTM Autoencoder, which outputted a prediction.
In [
110], they looked at the index tracking performance of various autoencoder models, including the sparse AE, contractive AE, stacked AE, DAE, and VAE. These were used to find the relationships between stocks and construct tracking portfolios. The results were then compared to conventional methods, and results showed that, in order for the deep learning methods to perform better, there needed to be a higher number of stocks in the tracking portfolio.
7.2. Speech Source Separation Applications
Deep learning has been applied to various aspects of speech processing, speech recognition, speech and speaker identification and such applications [
111]. In this subsection, we focus on speech source separation applications using variational autoencoders.
If you have
$\mathrm{N}$ signals,
${\mathrm{s}}_{\mathrm{i}}\left(\mathrm{t}\right)$, you can have mixed signal
$\mathrm{y}\left(\mathrm{t}\right)={\sum}_{i=1}^{N}\phantom{\rule{3.33333pt}{0ex}}{\mathrm{s}}_{\mathrm{i}}\left(\mathrm{t}\right);$ the goal of signal/source separation is to retrieve an estimate of each
${\mathrm{s}}_{\mathrm{i}}\left(\mathrm{t}\right),\widehat{{s}_{i}}\left(t\right)$. When it is unsupervised, it is known as blind source separation.
Figure 16 shows two speech signals, Signal 1 and Signal 2 mixed to create a mixed signal.
Signal to Distortion Ratio (SDR), Signal to Artifact Ratio (SAR), Signal to Interference Ratio (SIR), Signal to Noise Ratio (SNR), and Perceptual Evaluation of Speech Quality (PESQ) are common measures to evaluate speech source separation [
112,
113]. SDR, SAR, SIR, and SNR are all typically measured in decibels (dB). We will drop the
i subscript of a signal estimate
$\widehat{{s}_{i}}\left(t\right)$ for simplicity in the following formulas. Using the BSSEval toolbox, we can decompose our estimate of a signal as
${s}_{\mathrm{target}\phantom{\rule{4.pt}{0ex}}}\left(t\right)$ is a deformed version of ${s}_{i}\left(t\right)$, ${e}_{\mathrm{artif}\phantom{\rule{4.pt}{0ex}}}\left(t\right)$ is an artifact term, ${e}_{\mathrm{interf}\phantom{\rule{4.pt}{0ex}}}\left(t\right)$ denotes the deformation of the signals due to interference from the unwanted signals, ${e}_{\mathrm{noise}\phantom{\rule{4.pt}{0ex}}}\left(t\right)$ is a deformation of the perturbating noise.
The SDR is then given by:
The SIR is given by:
the SNR is given by:
and the SAR is:
Spectrograms can be used to view a signal’s timefrequency representation. The amplitude and frequency information are represented by color intensity indicating the amplitude of the frequency. A common way of generating spectrograms is to take the Short Time Fourier Transform (STFT) of the signal. Two important parameters for the STFT are the window size and overlap size. The spectrogram is found by taking the square of the magnitude of the STFT especially for deep learning algorithms involving speech. Typically, in practice, the spectrogram is often normalized before being fed into a neural network [
111]. Alternative inputs include log spectrograms and mel spectrograms.
In [
114], the VAE was compared with NNMF, GAN, Gaussian WGAN, including Autoencoding WGANs (AE WGAN), and Autoencoders trained with maximum likelihood (MLAE) in the task of blind monoaural source separation. This VAE did not use convolutional layers. The TIMIT data set was used [
115]. Each training set had a female and male speech signal mixed together. There is a VAE for each speaker; so, for a mixed signal that mixes a female and male speaker, we need two VAEs. The input to the VAE is a normalized spectrogram. The training label is the ground truth spectrogram of the original signal that we are trying to obtain. The signal is reconstructed via the Wiener filtering equation:
${\mathit{S}}_{\mathrm{mixed}\phantom{\rule{4.pt}{0ex}}}$ is the magnitude spectra of the mixed signal, and phase is the phase of the mixture signal.
${\mathit{S}}_{m}$ and
${\mathit{S}}_{f}$ are the reconstructed estimate of male magnitude spectra and reconstructed estimate of female magnitude spectra.
${\widehat{\mathbf{x}}}_{k}\left(t\right)$ is the reconstructed signal. iSTFT denotes the inverse STFT. ⊙ denotes element wise multiplication. The experiments show that NNMF and the VAE methods result in similar SDR, SIR and SAR. The Gaussian WGAN has a better SIR and SDR than the NNMF and the VAE. The MLAE has a superior SAR to all the other models.
Figure 17 shows the results for the experiments.
In [
116], the researchers used two data sets, TIMIT and VIVOS. TIMIT is a speech corpus for American English; it has 8 dialects and 630 speakers. Each speaker speaks 10 sentences. In the experiment, they used all eight dialects. Background noise was also used, particularly trumpet sounds, traffic sounds and water sounds.
The architecture of the algorithm involved taking the STFT of the mixed signal, feeding it into a complex data based VAE, using a Chebyshev bandpass filter on the output, followed by an iSTFT to get the final reconstructed signal (
Figure 18). The metrics used were SIR and SDR. There are two ways to implement a VAE for this STFT based method, to account for the complex input. One is to assume the real and imaginary part of the input are independent, there will be a VAE for the real and imaginary part respectively, and the output of the two VAEs will be rejoined again before being put into an inverse STFT (Hao Dao, Private Communications).
There were four cases: one dialect vs. one background, many dialects vs. one background, one dialect vs. many backgrounds, and many dialects vs. many backgrounds. In one dialect vs. one background, one dialect was chosen at a time, with 10 people randomly chosen, with 100 utterances total; for each dialect, the utterances were mixed with trumpet sounds and Gaussian noise. The results were shown to be good but not stable. In many dialects vs. one background, 100 utterances were chosen and mixed with trumpet sounds and Gaussian noise. The results were shown to be better than the previous case, possibly due to a different data distribution. In one dialect vs. many backgrounds, the speech data were mixed in four ways: with each background sound or all three of them. The difference in performance between the background sound was not huge, though there was a reduction when all three mixed with the speech signal. Experiments were also run with VIVOS, a speech corpus for Vietnamese language; the performance was slightly lower. They also indicate that performance depends on depth size and code size. They also compare with the ICA, regular VAE, filter banks, and wavelets, and find that their model has better SDR, SIR, and PESQ. The exact results are shown in
Table 2.
In [
117], source separation is achieved through class information instead of the source signal themselves using the
$\beta $VAE; this
$\beta $VAE had convolutional layers.
In [
118], Convolutional Denoising Autoencoders (CDAEs) were used for monoaural source separation. Given the trained CDAEs, the magnitude spectrogram of the mixed signal is passed through all the trained CDAEs. The output of the CDAE of source
i is the estimate
${\tilde{\mathbf{S}}}_{i}$ of the spectrogram of source
i. The CDAE performs better than the MLP at separating drums but is similar in separating the other components.
In [
119], the multichannel conditional VAE (MCVAE) method was developed and used for semiblind source separation with the Voice Conversion Challenge (VCC) 2018 data set. This technique has also been used with supervised determined source separation [
120].
The generalized MCVAE is used to deal with multichannel audio source separation that has underdetermined conditions [
121,
122]. While the MVAE has good performance in source separation, the computational complexity is high, and it does not have a high source classification accuracy. The Fast MCVAE has been developed to deal with these issues [
123]. The MCVAE does not perform as well under reverberant conditions, and ref. [
124] works on extending the MCVAE to deal with this problem.
In [
125], variational RNNs (VRNNs) were used for speech separation on the TIMIT dataset, achieving superior results over the RNN, NNMF, and DNN for SDR, SIR, and SAR.
Autoencoders and VAEs are also used for speech enhancement [
126,
127,
128,
129,
130]. The goal of speech enhancement is to increase the quality of a speech signal, often by removing noise.
7.3. BioSignal Applications
VAEs can also be applied to biosignals, such as electrocardiogram (ECG) signals, electroencephalography (EEG) signals, and electromyography (EMG) signals.
7.3.1. ECG Related Applications
ECG machines measure the electrical signals from the heart; the signal recorded in an ECG machine is known as an ECG signal. The typical ECG has 12 leads; six on the arms/legs are called limb leads, and the six on the torso are called precordial leads. ECG waves can be defined as a “
positive or negative deflection from baseline that indicates a specific electrical event” [
131]. The common ECG waves are the P wave, Q wave, R wave, S wave, T wave, and U wave. A typical ECG waveform is shown in
Figure 19. The frequencies of an ECG signal are in the 0–300 Hz range, though most of the information is available in the 0.5–150 Hz range [
132].
Using ECGs, doctors can detect serious diseases by identifying distortions in the signal. One very important application is measuring the ECG signal of a fetus when a woman is pregnant; this is to help detect any heart problem that the fetus has. There are invasive and noninvasive methods of measuring this; due to side effects of the invasive methods, the noninvasive method is preferred. However, the mother’s ECG (MECG) signal is mixed with the baby’s ECG (FECG) signal, along with external respiratory noise. Thus, the two signals need to be separated, which is the blind source separation problem. The traditional methods to deal with fetal source separation have been ICA methods and adaptive filters such as LMS and RLS. For GANS, VAEs and AEs, it is more difficult to train the models due to the fact that we do not have a ground truth for the FECG signals. This problem can be solved by generating synthetic FECG and MECG signals from libraries such as signalz [
134] and FECGSYN toolbox [
135]. The average range for the beats per minute (BPM) of a pregnant woman is 80–90. The average range for the BPM of a fetus is 120–160. The MECG signal amplitude is also 2–10 times larger than the amplitude of a FECG signal. These are the important parameters needed to generate the synthetic ECG signals necessary.
While the VAE itself has not been used for fetal source separation, a similar method, the cross adversarial source separation framework (CASS), has been used. CASS is a method mixing AE and GANs for source separation tasks [
136]. For each mixture component, there is an autoencoder and discriminator. The mixed signal goes into each autoencoder, whose output goes into a discriminator. Typically, each AE and GAN pair is trained independently. For each pair, the
$i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ signal is separated, and the rest of the components in the mixture are treated as noise. Cross adversarial training is used to share information between each of those components. This was done by letting the
$i\mathrm{th}\phantom{\rule{4.pt}{0ex}}$ discriminator to reject samples from the other components.
Figure 20 shows the architectures. In their paper, the authors used two components in CASS for this particular problem. They generated synthetic FECG and MECG signals, mixed them, and added noise to simulate periodic respiratory noise. This noise consisted of random sinusoidal signals with varying frequencies and varying amplitudes. The synthetic data were converted into spectrograms before being inputted to the networks. The results are shown in
Table 3. The CASS is superior to the AE model, but the CASS with cross adversarial training is inferior to CASS with training each component independently for the MECG.
Detecting distortions in ECG signals are difficult to find due to noise from disturbances, such baseline wandering, muscle shaking, and electrode movement. The VAE has been used to distinguish these ECG signals under noise conditions [
137]. They used three data sets: AHA ECG database, the APNEA ECG database, and CHFDB ECG database. From these datasets, they obtained 30,000 ECG signals.
To evaluate how well the model denoised the ECG signals, noise was added to the ECG signal data. This included AWGN, salt and pepper noise, and Poisson noise (also known as shot noise). Sinusoidal signals with different amplitudes were also added to the signal to imitate baseline wandering noise. First, the ECG signal is preprocessed; this is done by using an algorithm to split the waves in segments according to the cardiac cycle. After these steps are completed, the new data are inputted into a VAE. The results showed that the VAE is as robust in the noise scenarios presented.
Morphological diagnosis, which is “
A diagnosis based on predominant lesion(s) in the tissue(s)” [
138], is one use of ECGs. Human experts typically perform better at ECG morphological recognition than deep learning methods, due to the fact that there are an insufficient amount of positive samples. In [
139], a pipeline is used that involves the VQVAE to generate new positive samples for data augmentation purposes. A classifier was then trained using this additional synthetic data to identify ten ECG morphological abnormalities, which resulted in an increase in the F1 score for the classifier. These ten abnormalities are myocardial infarction (MI), left bundle branch block (LBBB), right bundle branch block (RBBB), left anterior fascicular block (LAFB), left atrial enlargement (LAE), right atrial enlargement (RAE), left ventricular hypertrophy (LVH), right ventricular hypertrophy (RVH), I
${}^{\circ}$ atrial ventricular block (IAVB), and preexcitation syndrome (WPW).
Myocardial infarctions are also known as heart attacks, which can often lead to death. They need to be rapidly diagnosed to prevent deaths. Conventional methods are not very reliable and also perform poorly when applied to 6 lead ECG. In [
140], a deep learning algorithm was used to detect myocardial infarction using a 6 lead ECG. They found that using a deep learning algorithm with VAE for 6lead ECG performed better than the traditional rulebased interpretation. 425,066 ECGs from 292,152 patients were used in the study.
Deep learning can be used to classify the type of beat in an ECG signal. However, due to the black box nature of deep learning algorithms and their complexity, they are not easily adopted into clinical practice. Autoencoders have been used to reduce the complexity; the neural networks models would then use a lower dimensional embedding for the data. This solution still has problems with interpretability due to interactions between components of the embeddings. The
$\beta $VAE can be used to disentangle these interactions between components, leading to an interpretable and explainable beat embedding; this was done in [
141]. They used the
$\beta $VAE to create interpretable embedding with the MITBIH Arrhythmia dataset. VAEs can also be used to generating an ECG signal for one cardiac cycle [
142]. This is useful for data augmentation purposes. This method is relatively simple but cannot generate whole ECG signals.
In electrocardiographic imaging, recreating the heart’s electrical activity runs into numerical difficulties when using body surface potentials. A method using generative neural nets based on CVAEs have been used to tackle this problem [
143].
7.3.2. EEG Related Applications
EEG machines measure the electrical signals from the scalp; these are used to measure problems in the brain. A RNN based VAE has been used to improve the EEG based speech recognition systems by generating new features from raw EEG data [
144]. VAEs have been used to find the latent factors for emotion recognition in multi channel EEG data [
145]. These latent factors are then used as input for a sequence based model to predict emotion. Similarly, the bilateral variational domain adversarial neural network (BiVDANN), which uses a VAE as part of its architecture, has been used for emotion recognition from EEG signals [
146]. Video game to assess cognitive abilities have been developed, with a task performance metric and EEG signals recorded; from this, a deep learning algorithm for detecting task performance from the EEG data has been developed [
147]. First, this involves feature extraction, then dimensionality reduction by the VAE. The output of the VAE are used as input to an MLP, which predicts task performance.
7.3.3. EMG Related Applications
An EMG machine “
measures muscle response or electrical activity in response to a nerve’s stimulation of the muscle” [
148]. They can be used to find neuromuscular problems. Upper limb prosthetics use myoelectric controllers; however, these controllers are susceptible to interface noise, which reduces performance [
149]. Latent representations of muscle activation patterns have been found using supervised denoising VAE. These representations were robust to noise in single EMG channels. Latent space based deep learning classifiers have outperformed conventional LDA based classifiers.
BrainMachine Interfaces (BMIs) can be used in paralyzed patients to help return their voluntary movement; they can do this by extracting information from neural signals. However, the interface has performance issues with this method. Using latent representations through methods like the VAE has improved the performance of the BMI [
150].
8. Experiments
8.1. Experiment Setup and Data
We focused specifically on speech source separation experiments. We used the TIMIT dataset for the input [
115]. Two types of speakers, male and female, were used; each of the recordings was 2–4 s. We normalized the speech signals, then combined male signals with female signals to create 90 mixed signals. Eighty signals were used for training, 10 were used for testing. We used three models: the VAE, ITLAE, and
$\beta $VAE. For all three models, we used fully connected layers.
First, we will go over the VAE setup. There is a VAE for each speaker; thus, for a mixed signal that mixes a female and male speaker, we need two VAEs. The input to the VAE is a normalized spectrogram. The training label is the ground truth spectrogram of the original signal that we are trying to obtain. The signal is reconstructed via the Wiener filtering equation, from Equation (
90). The metrics used SDR, SIR, and SAR, evaluated using the BSS Eval Toolbox [
112]. This setup is repeated for the
$\beta $VAE and ITLAE.
8.2. Results
8.2.1. Hyperparameter Tuning
Normalizing the spectrogram was necessary for the loss function to not explode. We also found that the Gaussian distribution that generates $\mathit{\u03f5}$ had to have a standard deviation of 0.01. Parameters that we varied include window and overlap size for the STFT, latent variable code size, and batch size.
Figure 21 shows the results of varying the window size and overlap size with violin plots. The encoding layers had the value [M, 256, 256, 128], and the decoding layers had the value [256, 256, M], with M being the number of spectral frames, which depends on the choice of the STFT parameters. Using a 64 ms window size fared worse than a 32 ms window and a 16 ms window. Using 64 ms window with 32 ms overlap size resulted in a better SAR than using 64 ms window with 16 ms overlap size. Using a 16 ms window size with 8 ms overlap size had the best overall results by a wide margin. Using a 16 ms window size with 4 ms overlap size had worse results than using a 16 ms window size with 8 ms overlap size. When the window size went down to 8 ms, the results became worse than 16 ms window size.
We varied the latent code size d. The encoder layers are [129128d], and the decoding layers are [128129], with the window size being 256 ms and overlap size is 128 ms. The experiment found no clear difference in SIR, SDR, and SAR. We also experiment between the batch sizes in 1, 17, 34, 70. For batch sizes 17, 34, and 70, there was no clear difference between varying the batch size. Using batch size 1 increased the time while not changing the performance.
For the $\beta $VAE, $\beta $ was a hyperparameter to be tuned. For $\beta $ >1, there is no discernible difference between the various $\beta $ ’s.
For the ITLAE, the hyperparameter that we varied was the latent code size. We varied the latent code size in the range [32, 64, 128, 256]. There was no discernible difference between the various latent codes.
8.2.2. Final Results
In
Figure 22, we compare the results from the three different models using violin plots. The VAE and
$\beta $VAE have similar results. The ITLAE has a worse SAR, similar SDR, and a better SIR.
9. Conclusions
In this paper, first, we provided a detailed tutorial on the original variational autoencoder model. After that, we outlined the major problems of the vanilla VAE and the latest research on resolving these issues. These problems include posterior collapse, variance loss/image blurriness, disentanglement, the soap bubble effect, and the balancing issue.
Then, we comprehensively surveyed many important variations of the VAE. We can organize most of the variations into the following four groups:
 (1)
Regularizing Posterior Variant:We can regularize the posterior distribution to improve the disentanglement, like with the $\beta $VAE and INFOVAE.
 (2)
Prior/Posterior variants: We can change the prior or posterior, like with the hyperspherical VAE, $\rho $VAE, and the VQVAE. The hyperspherical VAE uses a vmF for the posterior and a uniform for the prior, which makes it superior for a hyperspherical latent space. The $\rho $VAE uses an AR(1) Gaussian process as its posterior, which leads to superior results over a vanilla VAE in terms of the loss function.
 (3)
Architectural Changes: There are many potential architectural changes. The VAEGAN and Adversarial Autoencoder take inspiration from both the VAE and GAN, and as a result mitigate the downsides of both frameworks. Combining the VAE framework with neural autoregressive models has created more flexible inference networks; the PixelVAE combines the PixelCNN with the VAE, allowing it to capture both small details and global structure. The conditional VAE uses a conditional likelihood instead of likelihood; it has been key to the MCVAE method for source separation. The MMVAE is better at dealing with data that have multiple modalities.
 (4)
Other Variations: Variance reduction of the VAE has been achieved through the STL estimator. Methods such as the IWAE use importance sampling to achieve a tighter lower bound for the loss function; variations of the IWAE using a DreG estimator or STL estimator have also reduced variance.
The applications of VAEs for generating images, speech/audio and text data are well known and well studied. In our article, we decided to focus on the less well known applications that VAEs can be used for, specifically in signal processing/time series analysis. In finance, we highlighted the use of the VAE in generation and completion of volatility surfaces for options, along with dimensionality reduction use cases. In the speech source separation subsection, we summarized the research on using the VAE framework for source separation in speech signals. We reviewed the use of the VAE framework for dimensionality reduction, generating disentangled interpretable features, and data generation for biosignals such as EEG, ECG, and EMG signals.
Some of the future potential areas of research for VAEs are:
 (1)
Disentanglement Representations: While many VAE extensions are state of the art for disentanglement, there is still the problem that they do not learn the disentangled representation in a truly unsupervised manner.
 (2)
Data Generation in Finance: VAEs are relatively unused in finance applications compared to other fields. For generating data, the VAE framework has been studied thoroughly and has had amazing results for natural images and audio data. Research into generating finance related data, such as volatility surfaces, is relatively unexplored, as we have only found two papers on this topic.
 (3)
Source Separation for Biosignals and Images: While speech source separation using the VAE framework has been heavily explored, the literature on biosignal source separation and image source separation with VAE is sparse. We see these as strong candidates for exploring the powerful capabilities of the VAE.