The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling
Abstract
1. Introduction
2. The Goal and Taxonomy of Dimensionality Reduction
- Generative Capability: This criterion distinguishes descriptive models from generative models. Descriptive models, such as PCA and t-SNE, provide a compressed representation but cannot create new data. In contrast, generative models like VAEs explicitly learn the data distribution, enabling the generation of novel instances.
- Probabilistic vs. Deterministic: This distinction concerns the nature of the learned mapping. Deterministic methods (e.g., PCA) produce a fixed low-dimensional coordinate for each input. Probabilistic methods (e.g., VAEs) use probability distributions to model the latent space or the mapping. This quantifies uncertainty and provides a flexible, stochastic framework.
3. Principal Component Analysis (PCA)
3.1. Definition of PCA
3.2. PCA Space and Principal Components
- Orthonormal: Each component vector has unit length () and is perpendicular to all others. Mathematically,To understand this, if , then, ; hence, and are orthogonal (perpendicular), whilehence, they are orthogonal but not orthonormal, as they are not of unit length.
- Uncorrelated: The principal components are uncorrelated. This means that the covariance between different components is zero, for , where represents the covariance between the and vectors.
- Ordered by Variance: The components are ordered so that captures the most variance, captures the next most, and so on.
3.3. PCA as an Optimization Problem
3.3.1. Variance Maximization Perspective
3.3.2. Reconstruction Error Perspective
3.4. Covariance Matrix Method
3.4.1. Calculating Covariance Matrix ()
3.4.2. Eigenvalue–Eigenvector Decomposition for Principal Components
3.5. Singular Value Decomposition (SVD)
- constitutes the left singular vectors; its columns form an orthonormal basis for the column space of and are exactly the eigenvectors of —the principal components.
- is a rectangular diagonal matrix containing singular values , where . Singular values relate to eigenvalues via .
- contains the right singular vectors (eigenvectors of ).
3.5.1. Equivalence of SVD and Covariance-Based PCA
- The principal components (eigenvectors of ) are exactly the columns of .
- The eigenvalues relate to singular values via . Equivalently, if we define the scaled data matrix , then the singular values of satisfy .
3.5.2. Computational and Numerical Considerations
3.6. Constructing the PCA Subspace
3.7. Data Reconstruction and Error Analysis
3.8. PCA Algorithms
- SVD-Based: for a data matrix .
- Covariance-Based: for the covariance computation plus the eigendecomposition.
| Algorithm 1 Principal Component Analysis (PCA) |
|
3.9. Numerical Examples
3.9.1. First Example: 2D Example
3.9.2. Second Example: 4D Example
4. Autoencoder (AE)
- 1.
- Encoder: The encoder is a function with learnable parameters (weights and biases). It maps input data () to a lower-dimensional latent code , where represents the latent feature representation. These “latent features” are a compressed summary of the input. For example, latent features for hand-written digits could encode information about the number of lines required to write each number, the angles of these lines, and how they connect. For a single hidden layer architecture,where denotes a non-linear activation function like ReLU, tanh (for more details about activation functions, see Appendix A). These non-linearities allow the network to learn much more complex patterns than PCA.In deep architectures, the encoder has multiple layers (see Figure 6, where the encoder consists of two layers) with progressively decreasing dimensions: the first layer has neurons (matching the input size), and the number of neurons in each layer drops, with the middle (or bottleneck) layer having the smallest number of neurons.with , where is the number of encoder layers [32].
- 2.
- Latent Space: The latent code represents the output of the encoder and contains a compressed representation of the input data, where . This bottleneck is the critical architectural element that prevents the network from learning a trivial identity mapping, forcing it to prioritize the most important features. Unlike the probabilistic latent space in VAEs, the standard Autoencoder’s latent space is deterministic, meaning that for each input, the encoder always produces exactly the same latent representation—there is no randomness or probability distribution involved in the encoding process [33].
- 3.
- Decoder: The decoder reconstructs data from the latent code through the transformation . The decoder , parameterized by , performs the inverse mapping from the latent space back to the original data space:For deep architectures, the decoder mirrors the encoder structure with progressively expanding layers:where denotes the number of decoder layers.
4.1. Parameters and Training
- Network Architecture: Autoencoders typically employ symmetric architectures where the decoder mirrors the encoder. This means that the decoder network has the same number of layers as the encoder, with corresponding layers having matching sizes in reverse order, as shown in Figure 6. For example, with 784-pixel MNIST images, if the encoder goes from Input(784) → (256) → (64), the decoder usually goes from (64) → FC(256) → Output(784) [33].
- Latent Space Dimensionality: The size K of the latent code represents a critical trade-off. If K is too small, excessive information is lost, while if K is too large, the network can avoid meaningful compression.
- Loss Function: The selection of an appropriate loss function should be guided by the data characteristics, desired reconstruction properties, and the specific application requirements, as these fundamentally shape what the autoencoder learns to preserve and discard during compression. Different loss functions make different assumptions about the data distribution and error sensitivity, leading to substantially different learned representations and reconstruction behaviors. Here, we focus on this study of the data and how this affects the choice of reconstruction loss . The most common choice for real-valued data is the Mean Squared Error (MSE) loss, also known as loss. MSE measures the average squared difference between the original input and its reconstruction :
4.2. The Generative Limitation
4.3. Practical Considerations for Autoencoders
- Encoder Hidden Layers: The function of the encoder is to perform compression and feature extraction, mapping high-dimensional inputs to low-dimensional latent codes. This process relies on activations that promote sparsity, computational efficiency, and gradient flow for effective feature learning. As a result, the Rectified Linear Unit (ReLU) and its variants have become the standard choice for the hidden layers of the encoder. This preference stems from ReLU’s computational efficiency, which involves only a simple thresholding operation, and its effectiveness in mitigating the vanishing gradient problem in deep networks (more details on vanishing gradients and activation functions can be found in Appendix A). However, ReLU neurons can occasionally “die,” outputting zero for all inputs and resulting in inactive neurons. Alternatives such as Leaky ReLU and Exponential Linear Unit (ELU) can help address this issue by permitting small negative values.
- Decoder Hidden Layers: The function of the decoder is to perform reconstruction and generation, mapping the low-dimensional latent code back to high-dimensional output. This task requires activations that facilitate smooth interpolation, stable gradient flow for reconstruction quality, and controlled output ranges—especially near the final layer. In this context, the hyperbolic tangent function (tanh) often outperforms ReLU in deeper decoders due to three key properties: (1) bounded outputs prevent extreme activations that can destabilize training, (2) smooth differentiability enhances gradient flow, and (3) zero-centered symmetry aids in reconstructing zero-centered data. Both ReLU and hyperbolic tangent (tanh) activations are commonly used in the hidden layers of the decoder. The choice of activation function depends on the nature of the data and the output requirements. ReLU is often preferred for its simplicity and non-saturating behavior, which can be beneficial for training deep decoders. However, for data that is zero-centered (e.g., normalized to ), tanh may be more suitable due to its zero-centered and bounded output (more details about the zero-centered problem can be found in Appendix A), potentially leading to more stable training.In practice, the selection of activation functions in the decoder is frequently influenced by empirical performance on the validation set.
- Output Layer: The activation function of the decoder’s output layer is chosen to match the statistical properties of the input data being reconstructed. This ensures that the reconstruction resides in the same domain as the original input .
- -
- Sigmoid Activation: Used for data where each element represents a probability or intensity normalized to . For example, binary-valued data (e.g., black-and-white images) and grayscale images normalized to (e.g., MNIST, Fashion-MNIST).
- -
- Softmax Activation: Used for discrete categorical data or one-hot encoded features. This is particularly important when modeling multinomial distributions (e.g., pixel intensities discretized into 256 bins), reconstructing text sequences or one-hot encoded categorical variables, or each output dimension must sum to 1 across categories.
- -
- Tanh Activation: Used for data normalized to the range , which is common when working with preprocessed image datasets where zero-centered data is beneficial.
- -
- Linear Activation (no activation): Used for unconstrained real-valued data where outputs can theoretically range over . Examples include regression tasks, scientific measurements without bounded ranges, and certain types of continuous sensor data.
- Latent Space: The activation function for the encoder’s output layer defines the latent space’s properties. A linear activation is most common, producing an unconstrained, interpretable real-valued vector analogous to PCA components. For specific applications requiring bounded latent codes (e.g., interpretable normalized features), a tanh or sigmoid activation can be used to restrict the latent values to (−1, 1) or (0, 1), respectively.
4.4. Numerical Example
- Input Layer: Two neurons (accepting 2D input vectors).
- Encoder Hidden Layer: One neuron with hyperbolic tangent (tanh) activation.
- Bottleneck Layer: One neuron with linear activation (producing the 1D latent code).
- Decoder Hidden Layer: One neuron with tanh activation.
- Output Layer: Two neurons with linear activation (reconstructing the original 2D input).
- Layer 1 (Encoder):
- Layer 2 (Bottleneck):
- Layer 3 (Decoder):
- Layer 4 (Output):
4.4.1. Forward Propagation
4.4.2. Backpropagation: Gradient Computation
4.5. The Fundamental Connection: When Autoencoders Generalize PCA
- 1.
- Linear Activations: All activation functions are identity functions.
- 2.
- Single Hidden Layer: One bottleneck layer of dimension , where M is the input dimension.
- 3.
- Mean Squared Error: The loss function is the standard MSE: .
- 4.
- No Regularization: No weight decay or other regularization terms are applied.
4.6. Variants of Autoencoders
4.6.1. Regularized Autoencoders
4.6.2. Architectural Variants for Specific Data Types
5. Variational Autoencoder (VAE)
5.1. Architectural Components: A Probabilistic Reinterpretation
- 1.
- Probabilistic Encoder (Recognition Model): The encoder is a neural network parameterized by . It serves as an approximate variational inference model, meaning that it uses a single set of parameters to approximate the posterior for any input data point, enabling efficient learning on large datasets. Its input is a data point . Its outputs are the parameters of the approximate posterior distribution (see Figure 8). For a standard VAE, we assume that this distribution is a multivariate Gaussian with a diagonal covariance matrix. For a latent space of dimension K, the encoder’s output for a given input consists of two K-dimensional vectors:where and are the output layers (or branches) of the encoder network, producing the parameters that define the distribution Here, contains the means and contains the variances for each of the K latent dimensions. We use log-variance for numerical stability. This formulation defines a K-dimensional multivariate Gaussian distribution with a diagonal covariance matrix , indicating that each latent dimension is modeled as an independent Gaussian with its own mean and variance .In the VAE’s probabilistic framework, represents the weights of the encoder (the inference network). It tries to approximate the true, intractable posterior by varying the parameters of a simpler distribution (a Gaussian) until it is as close as possible—this is the essence of variational inference.For example, if , the encoder for an input outputs and . This means the latent code is assumed to be drawn from a distribution where the first dimension has mean and variance , and the second dimension has mean and variance , with the dimensions being independent.
- 2.
- Latent Space Sampling: To obtain a latent vector for reconstruction or generation, we sample from the inferred distribution: . However, the sampling operation ′∼′ is a random process and its derivative is undefined; it “blocks gradient flow,” meaning that we cannot calculate gradients with respect to the encoder parameters through this operation, which prevents training via backpropagation.The reparameterization trick provides an elegant solution. Instead of sampling directly, we express it as a deterministic, differentiable function of the encoder’s outputs and an auxiliary noise variable. We first sample a noise vector from a standard normal distribution. Then, we compute the latent code as follows:where ⊙ denotes element-wise multiplication and . Crucially, all randomness is now contained in , which is independent of . The path from and (which depend on ) to is now fully differentiable, allowing gradients to flow back to the encoder.Continuing the example, we sample from and compute . The distribution describes the probability density over all possible given ; is one specific sample from this distribution.
- 3.
- Probabilistic Decoder (Generative Model): The decoder, parameterized by , is reinterpreted as defining a likelihood distribution over the input space. This represents the conditional likelihood in the Bayesian network. In the probabilistic framework, represents the weights of the decoder (the generative network). Its input is a latent sample (obtained via the reparameterization trick during training, or from the prior during generation). The decoder neural network typically outputs the parameters of a distribution in data space. For real-valued data (e.g., images normalized to [0, 1]), a common choice is a Gaussian distribution with identity covariance, where the decoder’s output is interpreted as the mean of the reconstructed data:Thus, for our sampled , the decoder outputs , which is the mean of the Gaussian distribution from which the reconstructed data is likely drawn. The decoder learns to map any latent point to a distribution over possible data points .
- The prior is what we believe about the latent space before seeing any data: a “clean map” (standard normal).
- The true posterior is what we would infer after seeing a specific input, but it is mathematically intractable to compute.
- The approximate posterior is what the encoder actually produces: a “messy, specific region” of the map for that input.
5.2. The Core Challenge and the Variational Inference Solution
- 1.
- We need to consider all possible in a high-dimensional space.
- 2.
- Only a tiny fraction of values produce a plausible , making random sampling hopelessly inefficient.
- 3.
- We cannot calculate the true posterior , because it requires the very term that we are trying to compute.
- 1.
- We work with instead of because probabilities multiply but log-probabilities add, making the math tractable and preventing numerical underflow.
- 2.
- We introduce the expectation under because we need to bring the encoder into the equation to connect it to the parameters we want to optimize ().Since is constant with respect to , taking its expectation does not change its value.
- 3.
- Expand using Bayes’ rule: We rewrite using the joint probability and the posterior:
- 4.
- Introduce our encoder strategically: We multiply and divide by inside the expectation to prepare for splitting the expression:
- 5.
- Split the logarithm and recognize the KL divergence: Using the property , we get:
- Increasing the Likelihood: We push up the value of , which is our original goal.
- Aligning Distributions: We simultaneously minimize the difference between our encoder and the true posterior distribution.
- Reconstruction Loss: , which for a Gaussian decoder becomes the negative MSE (equivalent to maximizing log-likelihood).
- Regularization Loss: . This KL divergence has a closed-form solution when both distributions are Gaussian, penalizing deviations of the encoder’s output from the standard normal prior .
5.3. The Variational Lower Bound (ELBO)
- Reconstruction Term (): This term measures how well the model can reconstruct the input from a latent code sampled from the encoder’s posterior. It is the expected log-likelihood of under the decoder’s distribution. Maximizing this term forces the decoder to produce outputs that are likely given the true data. For a Gaussian decoder , maximizing the log-likelihood is equivalent to minimizing the MSE between the original input and the reconstruction . The expectation is approximated during training using the reparameterized latent sample .
- Regularization Term (): This is the Kullback–Leibler (KL) divergence between the encoder’s posterior and the prior . It acts as a regularizer on the latent space. Minimizing this term (as we subtract it in the ELBO) pushes the distribution for each data point to be close to the standard normal prior. This has several crucial effects: it prevents the posterior distributions from collapsing into isolated point masses (overfitting), encourages smoothness and continuity in the latent space, and ensures that, for generation, sampling a random will land in a region of the latent space that the decoder understands. The KL divergence for Gaussian distributions has a convenient closed-form expression, making computation efficient.
- 1.
- The input is passed through the encoder to get and .
- 2.
- A latent code is sampled via reparameterization:
- 3.
- The decoder maps to the reconstruction parameters (e.g., the mean ).
- 4.
- The loss is computed as follows:
5.4. From Deterministic to Probabilistic Latent Spaces
5.5. Key Differences from the Standard Autoencoder
- Latent Representation: The AE produces a deterministic point . The VAE produces a probabilistic distribution , from which is sampled.
- Training Objective: The AE minimizes a simple reconstruction loss. The VAE maximizes the ELBO, which balances reconstruction accuracy and latent space regularization.
- Latent Space Structure: The AE’s latent space is unconstrained and can be discontinuous. The VAE’s latent space is explicitly regularized to be continuous and smooth, converging toward a known prior distribution. This structural difference is what unlocks the VAE’s generative power.
- Capability: AEs are primarily compression and representation learning models. VAEs are, by design, generative models capable of creating new data samples.
5.6. Numerical Example
5.6.1. Forward Propagation
5.6.2. Backpropagation: Gradient Computation
- Decoder parameters: : Receive gradients only from , not from . Changing decoder weights does not affect or .Therefore, depends on decoder parameters directly, while depends only on encoder parameters ( and ). As a result:
- Encoder parameters (): Receive gradients from both losses:
- -
- From : Through the sampled via the reparameterization trick,
- -
- From : Directly, since KL depends only on and .
- The reparameterization trick enables gradients to flow from the reconstruction loss back to and .
6. Experimental Results and Discussions
- 1.
- PCA: Using the standard version to reduce the data to K dimensions.
- 2.
- AE and VAE: Both used a deep network with four layers in the encoder and four in the decoder.
- Encoder layers: 2500 → 512 → 256 → 128 → K.
- Decoder layers: 128 → 256 → 512 → 2500.
- We used ReLU and batch normalization to make training faster and more stable.
- Bottleneck dimensions: or 3 in the first experiment, and K was dimensions in the second experiment.
- Training: Adam optimizer (, weight decay ), MSE loss, 300–400 epochs, to control the KL divergence regularization in the VAE.
6.1. First Experiment: Latent Space Visualization
- PCA in Figure 9 (first column): Shows a very “flat” or linear structure.
- AE in Figure 9 (second column): Learns much more complex, curved patterns, which help it separate different individuals better than PCA.
- VAE in Figure 9 (third column): Forces the points into a single, circular (Gaussian) cluster. This makes the space “smooth,” which is perfect for generating new faces through interpolation.
6.2. Second Experiment: Data Reconstruction
- Performance at extreme compression (2D).
- The “sweet spot” for balanced compression (5D–10D).
- Behavior near near-lossless reconstruction (100D).
6.3. Discussions
- 1.
- From Linear to Non-linear: PCA is a good baseline, but it is too simple for complex data like faces. The AE uses neural networks to capture curved patterns, greatly improving reconstruction.
- 2.
- From Reconstruction to Generation: While both AEs and VAEs compress and reconstruct data, only the VAE organizes the latent space. By accepting a small “penalty” in reconstruction accuracy, the VAE creates a smooth map that allows us to sample and create brand-new data.
- 3.
- Fundamental Trade-Offs: The experiments illuminate core design tensions: linearity versus expressivity, deterministic versus probabilistic representation, and reconstruction accuracy versus generative capability. The progression from PCA to AE to VAE navigates these tensions, each step introducing new functionality that addresses the limitations of the previous paradigm.
- 4.
- Choosing the Right Tool:
- PCA is best for quick, simple, and interpretable compression.
- AE is best for maximum-fidelity compression and reconstruction.
- VAE is the best choice when you need a structured latent space or want to generate new data.
7. PCA, AE, and VAE: Similarities and Differences
7.1. Fundamental Objectives: From Reconstruction to Generation
- PCA: Linear Compression—It finds the best straight-line directions to represent the data. Its goal is to summarize and visualize data.
- Autoencoder: Non-linear Reconstruction—It uses neural networks to learn flexible, curved patterns. Its main goal is to rebuild the data as accurately as possible.
- VAE: Probabilistic Generation—It models the actual “recipe” (probability distribution) of the data. Its main goal is to create brand-new data that looks like the original.
7.2. Mathematical and Architectural Frameworks
- Representation Learning:
- -
- PCA: Uses a linear projection , to find the latent .
- -
- AE: Uses a non-linear encoder and decoder .
- -
- VAE: Uses a probabilistic encoder that predicts a mean and variance, and a probabilistic decoder .
- How They Learn:
- -
- PCA: Closed-form solution via eigendecomposition/SVD. No iterative training; no parameters in the neural network sense.
- -
- AE: Gradient-based optimization (backpropagation) of reconstruction loss. Parameters = weights and biases of encoder/decoder networks.
- -
- VAE: Gradient-based optimization of the ELBO. Parameters include both network weights and distribution parameters.
- Loss Functions and Optimization:
- -
- PCA: Minimize reconstruction error.
- -
- AE: Minimize reconstruction loss.
- -
- VAE: Maximize ELBO.
- Latent Space Characteristics: The nature of the latent space fundamentally distinguishes these methods, as illustrated in Table 3.
7.3. Parameter Complexity and Model Capacity
- PCA: Deterministic algorithm—no trainable parameters in the conventional sense. Complexity depends on data matrix operations ( for N samples of dimension M).
- AE: Parameter count grows with network architecture. For our example AE with architecture , 11 parameters. General formula for a fully connected AE:where is the dimension of layer l.
- VAE: Similar parameter count to an AE of the same architecture, plus distribution parameters. For our example VAE with separate and pathways, approximately double the encoder parameters of a comparable AE. In practice, VAEs often share most encoder layers and only branch at the end, minimizing this overhead.
7.4. Generative Capabilities
- PCA: Not generative. You can rebuild what you have, but you cannot easily create something new and realistic.
- Standard AE: Not inherently generative. Because the latent space is messy and has “gaps,” picking a random point usually results in a blurry or broken image.
- VAE: Explicitly generative. This is what it was built for. By forcing the latent space into a smooth Gaussian shape, the VAE ensures that almost any point you pick will decode into a realistic, new face or image. This turns a “compression tool” into a “generative tool”.
8. Conclusions
- Use PCA for simple, fast, and easy-to-understand data summaries.
- Use AEs when you need to compress and rebuild complex data with high accuracy.
- Use VAEs when you want to generate new data or have a very organized latent space.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Non-Linear Activation Functions
Appendix A.1. Theoretical Foundations and General Principles
- Large Derivatives (): Facilitate strong, reliable gradient signals, leading to faster and more effective learning (e.g., ReLU for ).
- Small Derivatives (): Cause the gradient signal to diminish exponentially as it propagates backward through layers (e.g., sigmoid). This phenomenon, known as the vanishing gradient problem, can prevent learning, particularly in earlier layers.
- Zero Derivatives (): Completely halt the flow of gradients, creating “dead neurons” that cease to learn (e.g., ReLU for ).
Appendix A.2. The Vanishing/Exploding Gradient Problem
- Exponential Gradient Decay: Gradients shrink exponentially with network depth, making deep layers effectively untrainable.
- Hierarchical Learning Failure: Early layers (responsible for basic feature detection) stop learning while later layers continue adapting.
- Convergence Slowdown: Training becomes inefficient, requiring more data and iterations.
- Premature Saturation: Network parameters get stuck in suboptimal configurations.
- Hidden Layer Computation:where is the sigmoid function.
- Output Layer Computation:
- Loss Function:
- Output Layer Gradient:
- Hidden Layer Gradients:then
- The weight will be updated as follows:
- Analysis: The Vanishing Gradient Problem
- Output Layer Gradient Magnitude: .
- Hidden Layer Gradient Magnitudes: and (≈ and of output gradient).
- Sigmoid’s Severe Gradient Attenuation: The sigmoid activation demonstrates the most pronounced vanishing gradient problem, with hidden layer weights receiving only 4.7–12.5% of the gradient magnitude compared to the output layer. This occurs because , causing exponential decay of gradients through multiple layers.
- Tanh’s Moderate Improvement: The tanh function provides significantly better gradient flow, with hidden layer gradients at 22.8–60.7% of output layer magnitude. This improvement stems from , which has a maximum value of 1.0 (when ) compared to sigmoid’s maximum of 0.25.
- ReLU’s Superior Gradient Preservation: ReLU demonstrates the most balanced gradient distribution, with hidden layer gradients reaching 26.1–69.6% of output layer values. The constant derivative of 1 for positive inputs prevents gradient attenuation entirely in the forward pass, although it can suffer from the “dying ReLU” problem for negative inputs.
| Activation Function | Weight | Old Value | New Value | Change | Relative Change |
|---|---|---|---|---|---|
| Sigmoid | 0.4000 | 0.4256 | +0.0256 | 100.0% | |
| 0.5000 | 0.5032 | +0.0032 | 12.5% | ||
| 0.2000 | 0.2012 | +0.0012 | 4.7% | ||
| Tanh | 0.4000 | 0.5525 | +0.1525 | 100.0% | |
| 0.5000 | 0.5925 | +0.0925 | 60.7% | ||
| 0.2000 | 0.2347 | +0.0347 | 22.8% | ||
| ReLU | 0.4000 | 0.5645 | +0.1645 | 100.0% | |
| 0.5000 | 0.6145 | +0.1145 | 69.6% | ||
| 0.2000 | 0.2430 | +0.0430 | 26.1% |
Appendix A.3. Zero-Centered Output Property
Appendix A.4. Sigmoid Activation
- Output Layer for Binary Classification: Its output range naturally models probability scores, where represents .
- Saturating Neurons: The sigmoid’s smooth transition from 0 to 1, along with its asymptotic flattening at extreme inputs, was historically interpreted as modeling the continuous “firing rate” of biological neurons. In this analogy, an output near 0 represented an inactive neuron, near 1 represented maximum firing, and intermediate values represented varying degrees of activation. This saturation behavior was also thought to represent a neuron’s “confidence”—outputs approaching 0 or 1 indicated high certainty, while mid-range values indicated uncertainty.


Appendix A.5. Softmax Activation
- Valid Probability Distribution: All outputs are in and sum to 1: .
- Relative Scaling: The exponentiation amplifies differences between scores—larger values receive disproportionately higher probabilities.
Appendix A.6. Hyperbolic Tangent (tanh) Activation
- Hidden Layers in Deep Networks: Its zero-centered property and stronger gradients make it effective in hidden layers of various architectures.
- Normalized Data Processing: The output range naturally aligns with normalized input data, making it suitable for preprocessing pipelines that center data around zero.
- Moderately Deep Architectures: In networks of moderate depth, tanh can provide a good balance between expressive power and training stability.
Appendix A.7. Rectified Linear Unit (ReLU) Activation
- Leaky ReLU: Introduces a small, non-zero gradient () for negative inputs: .
- Parametric ReLU (PReLU): Makes the negative slope a learnable parameter, allowing the network to adaptively determine optimal behavior for negative inputs.
- Exponential Linear Unit (ELU): Provides smooth, non-linear transitions for negative inputs while maintaining linearity for positive inputs.
References
- Van Der Maaten, L.J.; Postma, E.O.; Van den Herik, H.J. Dimensionality Reduction: A Comparative Review. J. Mach. Learn. Res. 2009, 10, 1–41. [Google Scholar]
- Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
- Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NIPS’20: 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 6840–6851. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 2256–2265. [Google Scholar]
- Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
- Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in β-VAE. arXiv 2018, arXiv:1804.03599. [Google Scholar]
- Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7327–7347. [Google Scholar] [CrossRef]
- Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef]
- An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
- van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
- Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. In Proceedings of the NIPS’14: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, Montreal, QC, Canada, 8–13 December 2014; MIT Press: St. Cambridge, MA, USA, 2014. [Google Scholar]
- Shlens, J. A Tutorial on Principal Component Analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Doersch, C. Tutorial on Variational Autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar]
- Scholkopft, B.; Mullert, K.R. Fisher Discriminant Analysis with Kernels. In Proceedings of the Neural Networks for Signal Processing IX, Madison, WI, USA, 25 August 1999; Volume 1, pp. 23–25. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Henderson, P. Sammon mapping. Pattern Recognit. Lett. 1997, 18, 1307–1316. [Google Scholar] [CrossRef]
- Wold, S.; Esbensen, K.; Geladi, P. Principal Component Analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
- Turk, M.; Pentland, A. Eigenfaces for Recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
- Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Hyvärinen, L. Principal Component Analysis. In Mathematical Modeling for Industrial Processes; Springer: Berlin/Heidelberg, Germany, 1970; pp. 82–104. [Google Scholar]
- Strang, G. Introduction to Linear Algebra; SIAM: Philadelphia, PA, USA, 2022. [Google Scholar]
- Alter, O.; Brown, P.O.; Botstein, D. Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling. Proc. Natl. Acad. Sci. USA 2000, 97, 10101–10106. [Google Scholar] [CrossRef]
- Strang, G. Differential Equations and Linear Algebra; Wellesley-Cambridge Press: Wellesley, MA, USA, 2014. [Google Scholar]
- Wall, M.E.; Rechtsteiner, A.; Rocha, L.M. Singular Value Decomposition and Principal Component Analysis. In A Practical Approach to Microarray Data Analysis; Springer: Berlin/Heidelberg, Germany, 2003; pp. 91–109. [Google Scholar]
- Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Tharwat, A. Principal component analysis-a tutorial. Int. J. Appl. Pattern Recognit. 2016, 3, 197–240. [Google Scholar] [CrossRef]
- Hinton, G.E.; Zemel, R. Autoencoders, minimum description length and Helmholtz free energy. In Proceedings of the NIPS’93: Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
- Michelucci, U. An introduction to autoencoders. arXiv 2022, arXiv:2201.03898. [Google Scholar] [CrossRef]
- Baldi, P.; Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Netw. 1989, 2, 53–58. [Google Scholar] [CrossRef]
- Bourlard, H.; Kamp, Y. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 1988, 59, 291–294. [Google Scholar] [CrossRef] [PubMed]
- Plaut, E. From principal subspaces to principal components with linear autoencoders. arXiv 2018, arXiv:1804.10253. [Google Scholar] [CrossRef]
- Berahmand, K.; Daneshfar, F.; Salehi, E.S.; Li, Y.; Xu, Y. Autoencoders and their applications in machine learning: A survey. Artif. Intell. Rev. 2024, 57, 28. [Google Scholar] [CrossRef]
- Makhzani, A.; Frey, B. K-sparse autoencoders. arXiv 2013, arXiv:1312.5663. [Google Scholar]
- Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 833–840. [Google Scholar]
- Jia, K.; Sun, L.; Gao, S.; Song, Z.; Shi, B.E. Laplacian auto-encoders: An explicit learning of nonlinear data manifold. Neurocomputing 2015, 160, 250–260. [Google Scholar] [CrossRef]
- Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 52–59. [Google Scholar]
- Yong, B.X.; Brintrup, A. Bayesian autoencoders with uncertainty quantification: Towards trustworthy anomaly detection. Expert Syst. Appl. 2022, 209, 118196. [Google Scholar] [CrossRef]
- Preechakul, K.; Chatthee, N.; Wizadwongsa, S.; Suwajanakorn, S. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10619–10629. [Google Scholar]
- Samaria, F.S.; Harter, A.C. Parameterisation of a Stochastic Model for Human Face Identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]










| Method | Dimensions | MSE |
|---|---|---|
| PCA | 2 | 0.5932 |
| PCA | 3 | 0.5476 |
| AE | 2 | 0.3214 |
| AE | 3 | 0.2748 |
| VAE | 2 | 0.4123 |
| VAE | 3 | 0.3587 |
| Latent Dim | PCA | AE | VAE |
|---|---|---|---|
| 2 | 0.60 | 0.43 | 0.34 |
| 5 | 0.45 | 0.25 | 0.10 |
| 10 | 0.33 | 0.20 | 0.06 |
| 100 | 0.06 | 0.02 | 0.04 |
| Property | PCA | Standard AE | VAE |
|---|---|---|---|
| Structure | Flat, linear surface | Unstructured, messy | Smooth, organized map |
| Interpretability | High (principal components) | Low (black-box) | Moderate (regularized) |
| Dimensionality | Fixed by the data rank | Chosen by the user | Chosen by the user |
| Rules | Must be perpendicular | None (unconstrained) | Must follow a Gaussian bell curve |
| Interpolation | Linear only | Possible, but often jumpy | Smooth and meaningful |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tharwat, A.; Eid, M.M. The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats 2026, 9, 23. https://doi.org/10.3390/stats9020023
Tharwat A, Eid MM. The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats. 2026; 9(2):23. https://doi.org/10.3390/stats9020023
Chicago/Turabian StyleTharwat, Alaa, and Mahmoud M. Eid. 2026. "The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling" Stats 9, no. 2: 23. https://doi.org/10.3390/stats9020023
APA StyleTharwat, A., & Eid, M. M. (2026). The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats, 9(2), 23. https://doi.org/10.3390/stats9020023
