The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling

Tharwat, Alaa; Eid, Mahmoud M.

doi:10.3390/stats9020023

Open AccessReview

The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling

by

Alaa Tharwat

^1,* and

Mahmoud M. Eid

^2,*

¹

Center for Applied Data Science (CfADS), Bielefeld University of Applied Sciences and Arts, 33619 Bielefeld, Germany

²

Faculty of Computers and Information Systems, Egyptian Chinese University, Cairo 11786, Egypt

^*

Authors to whom correspondence should be addressed.

Stats 2026, 9(2), 23; https://doi.org/10.3390/stats9020023

Submission received: 24 December 2025 / Revised: 14 February 2026 / Accepted: 20 February 2026 / Published: 28 February 2026

(This article belongs to the Section Applied Statistics and Machine Learning Methods)

Download

Browse Figures

Versions Notes

Abstract

This tutorial provides a comprehensive and intuitive journey through the evolution of deep generative models, tracing a clear path from the foundations of Principal Component Analysis (PCA) to modern Variational Autoencoders (VAEs), showing how each method solves the limitations of the previous one. We begin with PCA, a linear tool for reducing data dimensions. Its inability to model non-linear patterns motivates the use of Autoencoders (AEs), which use neural networks to learn flexible, compressed representations. However, AEs lack a probabilistic framework, preventing them from generating new data. VAEs address this by treating the latent space as a probability distribution, enabling data generation. We compare the three methods through theoretical analysis, experiments, and step-by-step numerical examples that show exactly how each model compresses data—a detail often missing elsewhere. Unlike resources that treat these topics separately, we connect them into a single narrative, building intuition progressively from linear to probabilistic deep generative models.

Keywords:

PCA; Variational Autoencoder; autoencoder; generative models

1. Introduction

Dimensionality reduction (DR) stands as a cornerstone of machine learning (ML) and data science, solving the challenges of high-dimensional data. As data grows in scientific and industrial fields, the ′curse of dimensionality′ becomes a widespread obstacle, causing data sparsity, high computational costs, and lower model performance due to overfitting. Consequently, DR techniques are not merely a preprocessing step but a critical paradigm for data visualization, efficient computation, noise reduction, and finding the hidden low-dimensional structure that often underlies complex, high-dimensional observations [1].

There are two categories of DR techniques: supervised and unsupervised. Supervised methods use class labels to find patterns; unsupervised methods discover hidden structure without labels. This makes them useful for tasks like data visualization, compression, and noise removal. Principal Component Analysis (PCA) is a leading example, alongside other well-known techniques. Beyond this, DR methods can be categorized by whether the transformation is linear or non-linear. Linear methods like PCA are simpler; non-linear methods learn complex manifolds. Another criterion is the geometric property that they preserve. Some methods maintain global structure and large-scale distances, while others preserve local geometry and neighborhood relationships (e.g., Locally Linear Embedding (LLE) [2]). Finally, a modern and powerful taxonomy considers a model’s generative capability: descriptive vs. generative. Descriptive models (e.g. PCA and t-SNE) provide a compressed representation of data, but they do not learn its underlying probability distribution; thus, they cannot create new data. In contrast, generative models learn a model of the data distribution. This is achieved by learning a probabilistic mapping from the latent space back to the data space, enabling the generation of new data.

This tutorial focuses on a connected series of methods with a shared goal of learning efficient data representations: PCA, Autoencoders (AEs), and Variational Autoencoders (VAEs). As illustrated in Figure 1, these approaches differ fundamentally despite sharing dimensionality reduction as a goal: PCA uses linear transformation without learned parameters; AEs and VAEs use nonlinear transformations. An AE minimizes reconstruction error using learned neural network weights. VAEs, in contrast, introduce a probabilistic framework that maximizes the Evidence Lower Bound (ELBO) to enable generative modeling. This progression is driven by the need to address specific limitations. We begin with PCA [3], a linear method whose mathematical elegance and interpretability provide an ideal starting point. However, its linearity severely restricts its ability to model complex, real-world datasets. This limitation motivates the first jump to (non-linear) Autoencoders [4], which use neural networks to learn highly non-linear, compressed representations, offering a powerful and flexible framework for descriptive dimensionality reduction. However, standard AEs lack a probabilistic foundation, and their latent spaces are often poorly structured, making them unsuitable for generating new data. This key shortfall motivates the final conceptual leap to Variational Autoencoders [5]. By introducing a probabilistic interpretation of the latent space, VAEs transition from being tools for compression to being truly generative models. Thus, the journey from PCA to AEs to VAEs is one from linear to non-linear, from deterministic to probabilistic, and from compression to creation.

Beyond their direct application, the conceptual framework of VAEs has become a cornerstone of modern ML. The core ideas of VAEs—namely, a learned latent space, optimized inference, and the encoder–decoder structure—have been used in many subsequent developments. For example, diffusion models, as discussed in [6,7], are closely related to VAEs. Furthermore, the VAE objective provides a natural framework for learning disentangled representations [8,9], where independent latent factors control distinct, interpretable attributes of the data—a critical goal for explainable AI. This powerful architecture enables a wide range of practical applications, such as image generation [10], drug discovery [11], and anomaly detection [12]. The desire to achieve higher-quality generation has also led to the development of advanced variants such as Vector-Quantized VAEs (VQ-VAEs) [13], which play a crucial role in state-of-the-art systems for image and audio synthesis. The impact extends to semi-supervised learning [14], where the generative model improves learning from limited labeled data. Thus, a solid understanding of VAEs is essential to grasp a vast segment of modern AI, from generative models to representation learning. This extensive utility and theoretical elegance are core motivations for our tutorial.

The ML community has many educational resources that explain PCA, AEs, or VAEs. For instance, the mathematical foundations of PCA are explained in detail in foundational texts like [3,15]. Similarly, the concept of Autoencoders is explained thoroughly in deep learning textbooks [16], while the technical nuances of VAEs are the focus of dedicated, well-known tutorials such as [17] and the comprehensive survey in [18]. However, these resources, while excellent, often treat these models in isolation or with a focus on their standalone technical intricacies. Our tutorial distinguishes itself by connecting these topics into a single, coherent narrative that emphasizes their conceptual evolution. We start from the well-established and intuitive foundation of PCA and progressively build upon its limitations to motivate the introduction of non-linear AEs and, subsequently, the probabilistic framework of VAEs. This approach—tracing the path from a classic linear method to a modern deep generative model—builds intuition step by step. By minimizing the assumed prior knowledge and emphasizing the conceptual links over dense mathematical formalism, we aim to make the powerful ideas behind VAEs accessible and intuitive not only for ML specialists but also for researchers and students from any quantitative background.

The rest of this paper is organized as follows: Section 2 presents the theoretical background of DR methods and their categorization. We then trace the evolution from linear to non-linear probabilistic models across three core sections: Section 3 covers PCA, Section 4 introduces Autoencoders, and Section 5 presents VAEs. Section 3 details both covariance matrix and Singular Value Decomposition (SVD) approaches for dimensionality reduction and data reconstruction. This section includes step-by-step numerical examples demonstrating eigenvector selection and latent space construction. Section 4 explores the theoretical foundations of AEs, also supported by a clear numerical walkthrough of their architecture and training. Section 5 presents VAEs, providing a comprehensive, step-by-step explanation of their probabilistic framework. To bridge theory and practice, Section 6 offers two complementary experiments that empirically demonstrate the practical differences and progression from PCA to AE to VAE. Section 7 compares the key similarities and differences between the three methods from multiple perspectives. Finally, Section 8 provides concluding remarks and discusses future directions.

2. The Goal and Taxonomy of Dimensionality Reduction

A fundamental challenge in data analysis and ML is the curse of dimensionality, where data is often in a high-dimensional space, but its true structure is simpler and lies on a much lower-dimensional manifold. Dimensionality reduction (DR) addresses this by transforming data from the high-dimensional observation space into a meaningful lower-dimensional representation, known as the latent space.

Consider a dataset

X = {x_{1}, x_{2}, \dots, x_{N}}

, where each observation

x_{i} \in R^{M}

is a vector with M-dimensional space (M features), and N is the total number of samples. The goal of a DR technique is to learn a mapping

f : R^{M} \to R^{K}

that transforms each data point

x_{i}

into a latent representation

z_{i} \in R^{K}

(where

K ≪ M

), while preserving the essential information of the original data. The choice of K is a critical trade-off: a lower value achieves more compression but may discard important information, while a higher value preserves more structure but provides less compression. For instance, we project the data onto one dimension and compress each data point into a single scalar value. This mapping aims to find a low-dimensional representation that retains the maximum amount of meaningful information—such as statistical variance or shape—by eliminating redundancy. Because variables are often correlated, DR methods can eliminate this redundancy while keeping the most important patterns.

To understand different DR techniques, we can group them based on several key criteria. First, methods are either supervised or unsupervised. Supervised methods, such as Linear Discriminant Analysis (LDA) [19], use class labels to find the best low-dimensional subspace. In contrast, unsupervised methods, which are the focus of this tutorial, discover the hidden data structure without labels.

In addition, DR methods can be categorized by four critical features:

Linearity: This distinguishes methods that project data onto a linear subspace (e.g., PCA [3]) from those that learn complex, non-linear manifolds (e.g., t-SNE [20]).
Preservation Criterion: Methods that prioritize global structure and large-scale distances (e.g., Sammon mapping [21]) are contrasted with those that focus on preserving local neighborhoods (e.g., LLE [2]).
Generative Capability: This criterion distinguishes descriptive models from generative models. Descriptive models, such as PCA and t-SNE, provide a compressed representation but cannot create new data. In contrast, generative models like VAEs explicitly learn the data distribution, enabling the generation of novel instances.
Probabilistic vs. Deterministic: This distinction concerns the nature of the learned mapping. Deterministic methods (e.g., PCA) produce a fixed low-dimensional coordinate for each input. Probabilistic methods (e.g., VAEs) use probability distributions to model the latent space or the mapping. This quantifies uncertainty and provides a flexible, stochastic framework.

This taxonomy helps us see the conceptual evolution of DR methods, a progression defined by the nature of the mapping function. The journey begins with the simplicity of PCA, which uses linear projection. Since linear methods cannot capture complex relationships, we move to AEs, which employ non-linear functions—typically neural networks—to learn non-linear patterns. Finally, the need for a structured, generative latent space leads to VAEs, which introduce a probabilistic framework. In the following sections, we examine each method in detail, tracing this progression from linear compression to deep generative modeling.

3. Principal Component Analysis (PCA)

3.1. Definition of PCA

PCA is a standard linear dimensionality reduction technique. Its primary goal is to transform high-dimensional data into a lower-dimensional space while preserving as much information (variance) as possible [22].

PCA rotates the data onto new axes called principal components (PCs). These components define the PCA subspace

Z_{PCA} \subseteq R^{K}

with

K \leq M

. The first principal component (PC₁) points in the direction of maximum variance. Each subsequent component is orthogonal to all previous ones and captures the next-largest variance (Figure 2). When we project the data onto PC₁ alone, the projected points exhibit higher variance than when projected onto PC₂. For instance, the points highlighted with a red circle—A and B—are farther apart when projected onto PC₁ than onto PC₂—because PC₁ preserves more variance.

3.2. PCA Space and Principal Components

The PCA space (

Z_{PCA}

) is constructed using a set of K principal components. These components possess three key mathematical properties:

Orthonormal: Each component vector has unit length ( $∥ v_{i} ∥ = 1$ ) and is perpendicular to all others. Mathematically,

$\begin{matrix} v_{i}^{⊤} v_{j} = \{\begin{matrix} 1, & i = j \\ 0, & i \neq j \end{matrix} \end{matrix}$

To understand this, if $v_{1} = [\begin{matrix} 2 \\ 0 \end{matrix}] and v_{2} = [\begin{matrix} 0 \\ 3 \end{matrix}]$ , then, $v_{1}^{⊤} v_{2} = [\begin{matrix} 2 0 \end{matrix}] [\begin{matrix} 0 \\ 3 \end{matrix}] = (2) (0) + (0) (3) = 0$ ; hence, $v_{1}$ and $v_{2}$ are orthogonal (perpendicular), while

$\begin{matrix} ∥ v_{1} ∥ = \sqrt{2^{2} + 0^{2}} = \sqrt{4} = 2 \neq 1, ∥ v_{2} ∥ = \sqrt{0^{2} + 3^{2}} = \sqrt{9} = 3 \neq 1 \end{matrix}$

hence, they are orthogonal but not orthonormal, as they are not of unit length.
Uncorrelated: The principal components are uncorrelated. This means that the covariance between different components is zero, $Cov (v_{i}, v_{j}) = 0$ for $i \neq j$ , where $Cov (v_{i}, v_{j})$ represents the covariance between the $i^{t h}$ and $j^{t h}$ vectors.
Ordered by Variance: The components are ordered so that ${PC}_{1}$ captures the most variance, $P C_{2}$ captures the next most, and so on.

Due to the orthogonality of its components, PCA is an orthogonal transformation of the data. This can also be interpreted as a rotation of the original coordinate axes to align with the directions of the most spread-out data (i.e., maximal variance) [15,22].

3.3. PCA as an Optimization Problem

PCA can be understood in two equivalent ways: maximizing the variance of projected data, or minimizing the reconstruction error.

3.3.1. Variance Maximization Perspective

Consider a dataset of mean-centered data points

{d_{1}, d_{2}, \dots, d_{N}}

∈

R^{M}

(where

d_{i} = x_{i} - μ

and

μ

is the sample mean). We wish to find a unit vector

v

(where

∥ v ∥ = 1

) such that the projection of the data onto

v

has the highest possible variance. For example, Figure 2 compares two vectors. As we can see, after assuming that both vectors are unit vectors, the projection onto

v_{1}

has higher variance than projecting the data onto

v_{2}

.

The projection of a data point

d_{i}

onto

v

is the scalar

v^{⊤} d_{i}

. Because the data is mean-centered data, the mean of the projected data is zero, and the variance of the projected data is therefore

Variance = \frac{1}{N} \sum_{i = 1}^{N} {(v^{⊤} d_{i})}^{2} = \frac{1}{N} \sum_{i = 1}^{N} v^{⊤} d_{i} d_{i}^{⊤} v = v^{⊤} (\frac{1}{N} \sum_{i = 1}^{N} d_{i} d_{i}^{⊤}) v = v^{⊤} Σ v

where

Σ = \frac{1}{N} \sum_{i = 1}^{N} d_{i} d_{i}^{⊤}

is the covariance matrix (the scaling factor is limited to

\frac{1}{N - 1}

, a factor that has no effect on the optimization). To find the best vector (first principal component), we solve the constrained optimization problem:

max_{v} v^{⊤} Σ v subject to v^{⊤} v = 1 .

Using the method of Lagrange multipliers, we define

L (v, λ) = v^{⊤} Σ v - λ (v^{⊤} v - 1)

, which leads to the condition that the optimal

v

must satisfy

\begin{matrix} Σ V = V Λ \end{matrix}

(1)

where

V

is the matrix whose columns are the principal components (eigenvectors) of the covariance matrix

Σ

, ordered by descending eigenvalue (from largest to smallest). That is,

V = {v_{1}, v_{2}, \dots, v_{M}}

, where

v_{i}

is the eigenvector corresponding to the i-th largest eigenvalue.

Λ

is the diagonal matrix of corresponding eigenvalues;

Λ = diag (λ_{1}, λ_{2}, \dots, λ_{M})

, where

λ_{1} \geq λ_{2} \geq \dots \geq λ_{M}

. Thus, Equation (1) represents the complete eigendecomposition of the covariance matrix, where the i-th column of

V

satisfies

Σ v_{i} = λ_{i} v_{i}

.

3.3.2. Reconstruction Error Perspective

Similarly, PCA can be derived by minimizing reconstruction error, which is the difference between the original data and the version that we get after compressing and then decompressing it. Consider encoding data into a latent space via

z_{i} = Z_{PCA}^{⊤} d_{i}

and decoding back via

{\hat{d}}_{i} = Z_{PCA} z_{i} = Z_{PCA} Z_{PCA}^{⊤} d_{i}

, where

Z_{PCA}

is the matrix of principal components.

The objective is to minimize the squared reconstruction error across all data points:

\begin{matrix} J (Z_{PCA}) & = \frac{1}{N} \sum_{i = 1}^{N} {∥ d_{i} - \hat{d_{i}} ∥}^{2} \\ = \frac{1}{N} \sum_{i = 1}^{N} {∥ d_{i} - Z_{PCA} Z_{PCA}^{⊤} d_{i} ∥}^{2} \\ = \frac{1}{N} \sum_{i = 1}^{N} [d_{i}^{⊤} d_{i} - 2 d_{i}^{⊤} Z_{PCA} Z_{PCA}^{⊤} d_{i} + d_{i}^{⊤} Z_{PCA} \underset{= I}{\underset{︸}{Z_{PCA}^{⊤} Z_{PCA}}} Z_{PCA}^{⊤} d_{i}] \\ = \frac{1}{N} \sum_{i = 1}^{N} d_{i}^{⊤} d_{i} - \frac{1}{N} \sum_{i = 1}^{N} d_{i}^{⊤} Z_{PCA} Z_{PCA}^{⊤} d_{i} = const - trace (Z_{PCA}^{⊤} Σ Z_{PCA}) \end{matrix}

where

Z_{PCA}^{⊤} Z_{PCA} = I

(orthogonality constraint) and the first term (

\frac{1}{N} \sum_{i = 1}^{N} d_{i}^{⊤} d_{i}

) is a constant independent of

Z_{PCA}

. Therefore, minimizing reconstruction error is equivalent to maximizing

trace (Z_{PCA}^{⊤} Σ Z_{PCA})

, which is exactly the total variance of the projected data. The matrix

Z_{PCA}

that maximizes this trace under orthonormality constraints consists of the eigenvectors of

Σ

corresponding to the K largest eigenvalues—the same principal components obtained from the variance maximization perspective. Thus, both viewpoints lead to the same solution.

There are two principal computational approaches for determining these principal components: one based on the eigendecomposition of the covariance matrix, and another utilizing the Singular Value Decomposition (SVD) of the data matrix.

3.4. Covariance Matrix Method

The covariance matrix method computes principal components in two steps: First, the covariance matrix of the data matrix

X

is computed. Second, an eigendecomposition is performed on this covariance matrix to find its eigenvalues and corresponding eigenvectors. The resulting eigenvectors form the principal components, ordered by the magnitude of their associated eigenvalues, which represent the amount of variance captured along each component. This computational pipeline is visualized in Figure 3.

3.4.1. Calculating Covariance Matrix ( $Σ$ )

The covariance matrix helps us see how different variables (features) change together. For a single variable x, variance measures deviation from the mean:

σ^{2} = E [{(x - μ)}^{2}] = E [x^{2}] - μ^{2}

, where

μ = E [x]

is the mean. For multiple variables, the covariance between variables

x_{i}

and

x_{j}

is

\begin{matrix} Σ_{i j} = E [(x_{i} - μ_{i}) (x_{j} - μ_{j})] = E [x_{i} x_{j}] - μ_{i} μ_{j} \end{matrix}

As shown in Figure 3-step (A), the computational procedure begins by mean-centering the data matrix

X

. For each variable, we compute its mean

μ \in R^{M}

, and then we subtract it from all samples:

D = {d_{1}, d_{2}, \dots, d_{N}} = {x_{1} - μ, x_{2} - μ, \dots, x_{N} - μ}

where

μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i}

.

Geometrically, the mean-centering data step represents a translation of the coordinate system to center the data around the origin, which is

μ

[15,23,24]. This translation ensures that (i) distances between points remain the same, (ii) angles between vectors remain the same, (iii) the shape of the data cloud remains the same, and (iv) the covariance structure remains the same. Meanwhile, the coordinate system’s origin moves to the data centroid, and the interpretation of coordinates also changes—they now represent deviations from the mean. This step is crucial in PCA, because PCA identifies directions of maximum variance, and variance is measured as the average squared distance from the mean:

Variance = E [{(x - μ)}^{2}]

. After data centering, the mean

μ

becomes zero, which simplifies the variance calculation to

E [x^{2}]

. This simplification occurs because when

μ = 0

, the expression

{(x - μ)}^{2}

reduces to

x^{2}

, making the computation more straightforward and computationally efficient. Similarly, the covariance matrix computation simplifies:

Cov (X, Y) = E [XY]

when the means are zero. Geometrically, all principal components pass through the origin.

The covariance matrix is then computed as the outer product of the mean-centered data (Figure 3-step (B)):

\begin{matrix} Σ = \frac{1}{N - 1} \sum_{i = 1}^{N} d_{i} d_{i}^{⊤} = \frac{1}{N - 1} D D^{⊤} \end{matrix}

(2)

A valid covariance matrix is always (i) Symmetric:

Σ = Σ^{⊤}

and (ii) Positive Semi-Definite:

v^{⊤} Σ v \geq 0

for all

v \neq 0

. This guarantees non-negative eigenvalues; geometrically, the data cloud forms an ellipse whose axes’ lengths are the square roots of the eigenvalues.

The structure of the covariance matrix reveals key statistical relationships:

Σ = (\begin{matrix} Var (x_{1}) & Cov (x_{1}, x_{2}) & \dots & Cov (x_{1}, x_{M}) \\ Cov (x_{2}, x_{1}) & Var (x_{2}) & \dots & Cov (x_{2}, x_{M}) \\ ⋮ & ⋮ & ⋱ & ⋮ \\ Cov (x_{M}, x_{1}) & Cov (x_{M}, x_{2}) & \dots & Var (x_{M}) \end{matrix})

The diagonal elements

Var (x_{i})

represent the variances of individual variables

x_{i}

for

i = 1, \dots, M

, while the off-diagonal elements

Cov (x_{i}, x_{j})

capture the pairwise covariances between different variables. A positive covariance indicates a positive correlation between variables, negative values suggest inverse relationships, and zero values denote statistical uncorrelation or independence [15].

3.4.2. Eigenvalue–Eigenvector Decomposition for Principal Components

The fundamental mathematical operation underlying PCA involves solving the characteristic equation of the covariance matrix:

Σ v_{i} = λ_{i} v_{i}

(3)

where

λ_{i}

denotes the i-th eigenvalue (a scalar quantity) and

v_{i}

represents its corresponding eigenvector (a non-zero vector).

In PCA, the eigenvectors (

v_{i}

) define orthogonal principal component directions of maximum variance, while their associated eigenvalues (

λ_{i}

) quantify the amount of variance captured along each direction. We sort the eigenvectors from the largest eigenvalue to the smallest. This is because the largest eigenvalue points to the direction with the most information. To form our lower-dimensional PCA space, as in Figure 3-step (C), we choose the top K eigenvectors that correspond to the K largest eigenvalues

Z_{PCA} = [v_{1}, v_{2}, \dots, v_{K}]

[25,26].

Special Case—Degenerate Eigenvalues: When multiple eigenvalues are equal (

λ_{i} = λ_{j}

), the corresponding principal components explain identical variance (the data is equally spread in both directions). Geometrically, the data cloud forms a perfectly symmetric shape in the subspace spanned by those eigenvectors. Hence, any orthogonal rotation of the associated eigenvectors is equally valid, which makes it harder to choose a “best” single component [25]. Such cases typically occur in datasets with spherical symmetry or when multiple features exhibit identical correlation structures.

3.5. Singular Value Decomposition (SVD)

SVD provides an alternative, more numerically stable way of calculating the principal components. SVD represents a fundamental matrix factorization technique in linear algebra, providing optimal low-rank approximations while preserving maximal variance (see Figure 3-step (D)). For PCA computation, we apply SVD to the mean-centered data matrix

D \in R^{M \times N}

(where each column d_{i} represents a centered data point d_{i}^{⊤} = {(x_{i} - μ)}^{⊤})

into three constituent matrices:

D = L S R^{⊤}

(4)

where

$L \in R^{M \times M}$ constitutes the left singular vectors; its columns form an orthonormal basis for the column space of $D$ and are exactly the eigenvectors of $D D^{⊤}$ —the principal components.
$S \in R^{M \times N}$ is a rectangular diagonal matrix containing singular values $s_{1} \geq s_{2} \geq \dots \geq s_{r} > 0$ , where $r = min (M, N)$ . Singular values $s_{i}$ relate to eigenvalues via $λ_{i} = \frac{s_{i}^{2}}{N - 1}$ .
$R \in R^{N \times N}$ contains the right singular vectors (eigenvectors of $D^{⊤} D$ ).

L

and

R

are orthonormal bases, satisfying

L^{⊤} L = I

and

R^{⊤} R = I

.

3.5.1. Equivalence of SVD and Covariance-Based PCA

This section establishes the fundamental relationship between SVD and eigendecomposition approaches for PCA. Both methods yield identical principal components and eigenvalues, providing complementary mathematical perspectives on the same underlying structure. As we mentioned in Section 3.4, the principal components of the mean-centered data matrix

D

are obtained through eigendecomposition of the covariance matrix

Σ = \frac{1}{N - 1} {DD}^{⊤}

, which satisfies

Σ V = V Λ

, where

V

contains the eigenvectors and

Λ

the eigenvalues. Alternatively, applying SVD to the mean-centered data matrix yields

D = {LSR}^{⊤}

(see Equation (4)).

The fundamental connection emerges when we express the covariance matrix in terms of SVD components:

\begin{matrix} Σ & = \frac{1}{N - 1} {DD}^{⊤} = \frac{1}{N - 1} ({LSR}^{⊤}) {({LSR}^{⊤})}^{⊤} \\ = \frac{1}{N - 1} ({LSR}^{⊤}) ({RSL}^{⊤}) = \frac{1}{N - 1} L S R^{⊤} R S L^{⊤} \end{matrix}

utilizing the orthonormality property

R^{⊤} R = I

:

\begin{matrix} Σ & = \frac{1}{N - 1} L S^{2} L^{⊤} \end{matrix}

This is the eigendecomposition

Σ = {LS}^{2} L^{⊤}

, where

Λ = \frac{1}{N - 1} S^{2}

. Therefore:

The principal components (eigenvectors of $Σ$ ) are exactly the columns of $L$ .
The eigenvalues relate to singular values via $λ_{i} = \frac{s_{i}^{2}}{N - 1}$ . Equivalently, if we define the scaled data matrix $\tilde{D} = D / \sqrt{N - 1}$ , then the singular values ${\tilde{s}}_{i}$ of $\tilde{D}$ satisfy $λ_{i} = {\tilde{s}}_{i}^{2}$ .

The proportion of variance explained by the i-th principal component is identical in both approaches:

\frac{λ_{i}}{\sum_{j = 1}^{r} λ_{j}} = \frac{σ_{i}^{2}}{\sum_{j = 1}^{r} σ_{j}^{2}}

where the total number of non-zero components is limited by

r = min (M, N)

, which is the smaller of the number of features or samples.

3.5.2. Computational and Numerical Considerations

A key advantage of the SVD approach lies in its superior numerical stability, particularly for challenging datasets. Unlike the covariance method, which explicitly forms and decomposes

{DD}^{⊤}

, SVD works directly on the centered data matrix (

D

). This avoids numerical precision issues that commonly arise with the covariance method in three scenarios: when features have very different measurement scales, when variables are highly correlated, or when the number of features approaches or exceeds the number of observations. The SVD approach is generally preferred for high-dimensional data where

M ≫ N

due to its better numerical stability and efficiency. It is especially valuable for datasets with linear dependencies, where some features can be perfectly predicted from others. In such cases, the data matrix

X \in R^{M \times N}

has reduced dimensionality (rank lower than

min (M, N)

), making the covariance matrix singular and causing traditional eigendecomposition to become unstable. SVD handles these situations gracefully without relying on matrix inversion, providing a more reliable path to the principal components [27,28,29].

3.6. Constructing the PCA Subspace

The PCA subspace (

Z_{PCA}

) is constructed by selecting the K most significant principal components—those with the largest eigenvalues—as illustrated in Figure 3.

This selection criterion ensures maximum variance preservation from the original dataset, while excluding less informative components. The resulting low-dimensional subspace is defined as

Z_{PCA} = {v_{1}, \dots, v_{K}}

, where each

v_{i}

represents a principal component direction.

Dimensionality reduction is achieved by projecting the mean-centered original data onto this optimized subspace:

Y = Z_{PCA}^{⊤} D

(5)

where,

Y \in R^{K \times N}

denotes the transformed data in the reduced PCA space, as illustrated in Figure 4. This projection effectively discards

(M - K)

dimensions from the original feature space while retaining the most statistically significant information. The total variance preserved is

Retained Variance = \sum_{i = 1}^{K} λ_{i}

, and the variance lost is

Lost Variance \sum_{i = K + 1}^{M} λ_{i}

.

3.7. Data Reconstruction and Error Analysis

The original data can be reconstructed from its principal component representation through the inverse transformation:

\hat{X} = Z_{PCA} Y + μ = Z_{PCA} Z_{PCA}^{⊤} D + μ

(6)

where

\hat{X}

is the reconstructed approximation of the original dataset

X

(each column:

x_{i} = Z_{PCA} y_{i} + μ

). The reconstruction error is typically measured as the sum of squared Euclidean distances between the original and reconstructed points:

Error = ∥ X - \hat{X} ∥ = \sum_{i = 1}^{N} {∥ x_{i} - {\hat{x}}_{i} ∥}^{2} .

(7)

Error decreases as we retain more principal components (i.e., larger K), because more variance is preserved. Crucially, the error exhibits an inverse relationship with the total variance captured by the selected principal components.

The quality of the PCA representation is often quantified by the explained variance ratio:

Explained Variance = \frac{Total Variance of Z_{PCA}}{Total Variance} = \frac{\sum_{i = 1}^{K} λ_{i}}{\sum_{i = 1}^{M} λ_{i}} .

(8)

This ratio measures the proportion of the original data variance retained in the K-dimensional subspace [30]. A ratio of 1.0 means 100% variance is kept—this occurs when

K = M

or when the discarded components have zero variance (perfectly correlated features).

3.8. PCA Algorithms

The first step in the PCA algorithm is to construct a data matrix

X

, where each column represents a single sample and each row represents a specific feature. The detailed steps for calculating the principal components using both the covariance and SVD methods are summarized in Algorithm 1. Note that, in the SVD method, we use the scaled matrix

\tilde{D} = \frac{D}{N - 1}

so that the squared singular values

s_{i}^{2}

directly equal the eigenvalues

λ_{i}

.

The computational complexity of PCA is as follows:

SVD-Based: $O (min (M N^{2}, M^{2} N))$ for a data matrix $D \in R^{M \times N}$ .
Covariance-Based: $O (M^{2} N + M^{3})$ for the covariance computation plus the eigendecomposition.

Algorithm 1 Principal Component Analysis (PCA)

1:: Input: Data matrix $X = [x_{1}, x_{2}, \dots, x_{N}]$ , where $x_{i} \in R^{M}$ is the $i^{th}$ sample.
2:: Compute the sample mean vector: $μ = \frac{1}{N} \sum_{i = 1}^{N} x_{i}$
3:: Center the data matrix: $D = [d_{1}, d_{2}, \dots, d_{N}]$ , where $d_{i} = x_{i} - μ$
4:: Compute Principal Components using one of the following methods:
5:: if Using the Covariance Matrix then
6:: Compute the covariance matrix: $Σ = \frac{1}{N - 1} D D^{⊤}$
7:: Perform eigendecomposition: $Σ V = V Λ$
8:: Sort the eigenvectors in $V$ by their corresponding eigenvalues in $Λ$ (descending order).
9:: $Z_{PCA} = [v_{1}, \dots, v_{K}]$
10:: else if Using the SVD then
11:: Construct the scaled matrix: $\tilde{D} = \frac{1}{\sqrt{N - 1}} D$
12:: $[L, S, R] = SVD (\tilde{D})$
13:: $Principal Components are columns of L; Variances are λ_{i} = s_{i}^{2}$ .
14:: $Z_{PCA} = [l_{1}, \dots, l_{K}]$
15:: end if
16:: Project the centered data onto the lower-dimensional space: $Y = Z_{PCA}^{⊤} D$ .
17:: Output: Projected data $Y$ and projection matrix $Z_{PCA}$ .

3.9. Numerical Examples

This section explains the steps of calculating a lower-dimensional space using the covariance matrix method through two numerical examples. The first example, adapted from [31], uses a two-feature dataset to visually illustrate the derivation of principal components. It details the computation of eigenvalues and eigenvectors and demonstrates the data projection and reconstruction processes. The second example employs a four-feature dataset to examine how PCA procedures are affected by increased dimensionality and to explore the influence of a constant variable with zero variance.

3.9.1. First Example: 2D Example

In this section, the lower-dimensional PCA space is constructed using the covariance matrix method. Consider a dataset with

N = 8

samples and

M = 2

features, as follows:

X = [\begin{matrix} 1.00 & 1.00 & 2.00 & 0.00 & 5.00 & 4.00 & 5.00 & 3.00 \\ 3.00 & 2.00 & 3.00 & 3.00 & 4.00 & 5.00 & 5.00 & 4.00 \end{matrix}]

(9)

where each column vector

x_{i} \in R^{2}

represents the

i^{th}

sample. The mean (

μ = [\begin{matrix} 2.63 \\ 3.63 \end{matrix}]

) should be mean subtracted from data as follows,

D = X - μ

, and the values of

D

will be as follows:

D = [\begin{matrix} - 1.63 & - 1.63 & - 0.63 & - 2.63 & 2.38 & 1.38 & 2.38 & 0.38 \\ - 0.63 & - 1.63 & - 0.63 & - 0.63 & 0.38 & 1.38 & 1.38 & 0.38 \end{matrix}]

(10)

The covariance matrix

Σ

is then computed according to Equation (2). Subsequently, its eigenvalues

Λ

and eigenvectors

V

are calculated. The resulting matrices are as follows:

Σ = \frac{1}{N - 1} D D^{⊤} = [\begin{matrix} 3.71 & 1.70 \\ 1.70 & 1.13 \end{matrix}], Λ = [\begin{matrix} 0.28 & 0.00 \\ 0.00 & 4.54 \end{matrix}], and V = [\begin{matrix} 0.45 & - 0.90 \\ - 0.90 & - 0.45 \end{matrix}]

(11)

The results indicate that the second eigenvalue,

λ_{2}

, is larger than the first,

λ_{1}

. This eigenvalue accounts for approximately

4.54 / (0.28 + 4.54) \approx 94.19 %

of the total variance, while

λ_{1}

accounts for only

0.28 / (0.28 + 4.54) \approx 5.81 %

. This distribution reflects the greater significance of the second eigenvector compared to the first. Consequently, the second eigenvector (i.e., the second column of

V

) defines the direction of maximum variance and therefore constitutes the first principal component of the PCA space (

P C_{1}

). The variance explained by PC₁ is

\frac{4.54}{0.28 + 4.54} \approx 94.19 %

; PC₂ explains only

5.81 %

.

To reduce the dimensions from two to only one, the mean-centered data matrix

D

is then projected onto a lower-dimensional space. Here, we will select only one eigenvector as a lower-dimensional space, but which eigenvector we should select? Surely, we should select the one that has the maximum eigenvalue.

If we project the data onto

{PC}_{1}

(using

v_{2}

) and

{PC}_{2}

(using

v_{1}

), we get

\begin{matrix} Y_{v_{1}} & = v_{1}^{⊤} D, and Y_{v_{2}} = v_{2}^{⊤} D, \end{matrix}

where

Y_{v_{1}}

and

Y_{v_{2}}

represent the projection of

D

onto

v_{1}

(second principal component) and

v_{2}

(first principal component), respectively. The resulting projection vectors are as follows:

\begin{matrix} \begin{matrix} Y_{v_{1}} = [\begin{matrix} - 0.16 & 0.73 & 0.28 & - 0.61 & 0.72 & - 0.62 & - 0.18 & - 0.17 \end{matrix}] \\ Y_{v_{2}} = [\begin{matrix} 1.73 & 2.18 & 0.84 & 2.63 & - 2.29 & - 1.84 & - 2.74 & - 0.50 \end{matrix}] \end{matrix} \end{matrix}

(12)

When data are projected onto all eigenvectors and subsequently reconstructed, no information is lost, and the reconstruction error is zero. However, dimensionality reduction necessitates discarding information, as we did in this example by selecting only one eigenvector as a projection space (i.e., the projection space is one-dimensional) and discarding the other. If the discarded eigenvectors carry information, this removal will introduce a reconstruction error—a residual between the original and reconstructed data. The magnitude of this error, or residual, depends on the number of selected eigenvectors K and the variance they explain, as quantified by the corresponding eigenvalues. This error is defined as follows:

{\hat{X}}_{i} = v_{i} Y_{v_{i}} + μ .

The resulting reconstructions,

{\hat{X}}_{1}

and

{\hat{X}}_{2}

, are presented below.

\begin{matrix} \begin{matrix} {\hat{X}}_{1} = v_{1} Y_{v_{1}} + μ = [\begin{matrix} 2.55 & 2.95 & 2.75 & 2.35 & 2.95 & 2.35 & 2.55 & 2.55 \\ 3.77 & 2.97 & 3.37 & 4.17 & 2.98 & 4.18 & 3.78 & 3.78 \end{matrix}] \\ {\hat{X}}_{2} = v_{2} Y_{v_{2}} + μ = [\begin{matrix} 1.07 & 0.67 & 1.88 & 0.27 & 4.68 & 4.28 & 5.08 & 3.08 \\ 2.85 & 2.66 & 3.25 & 2.46 & 4.65 & 4.45 & 4.84 & 3.85 \end{matrix}] \end{matrix} \end{matrix}

(13)

The reconstruction errors between the original data and the data reconstructed from the first and second eigenvectors are denoted by

E_{v_{1}}

and

E_{v_{2}}

, respectively. The values of these errors are as follows:

\begin{matrix} \begin{matrix} E_{v_{1}} & = X - {\hat{X}}_{1} = [\begin{matrix} - 1.55 & - 1.95 & - 0.75 & - 2.35 & 2.05 & 1.65 & 2.45 & 0.45 \\ - 0.77 & - 0.97 & - 0.37 & - 1.17 & 1.02 & 0.82 & 1.22 & 0.22 \end{matrix}] \\ E_{v_{2}} & = X - {\hat{X}}_{2} = [\begin{matrix} - 0.07 & 0.33 & 0.12 & - 0.27 & 0.32 & - 0.28 & - 0.08 & - 0.08 \\ 0.15 & - 0.66 & - 0.25 & 0.54 & - 0.65 & 0.55 & 0.16 & 0.15 \end{matrix}] \end{matrix} \end{matrix}

(14)

When we reconstruct the data using only

{PC}_{1}

, the error for the first sample is

{([1.00; 3.00] - [1.07; 2.85])}^{2} = {(- 0.07)}^{2} + {(0.15)}^{2} \approx 0.03

. However, if we had mistakenly used

{PC}_{2}

(the component with less variance), the error would be 3.00. The total error for the dataset using

{PC}_{1}

is 1.98, while using

{PC}_{2}

it is 31.77. This confirms that

{PC}_{1}

preserves the most information.

Figure 5 provides a visual comparison of the two principal components. The original data samples, represented as red stars in Figure 5 (left), are points in a two-dimensional space (

x_{i} \in R^{2}

). The two eigenvectors (

v_{1}

and

v_{2}

) derived from the data in this example are depicted as lines. The solid line represents the second eigenvector (

v_{2}

), which corresponds to the first principal component (

P C_{1}

), while the dotted line represents the first eigenvector (

v_{1}

), which corresponds to the second principal component (

P C_{2}

). The projection of the original data onto these two principal components is shown in Figure 5 (right). This figure visualizes the individual sample errors

E_{v_{1}}

and

E_{v_{2}}

with green and blue lines, respectively. As shown, the error associated with the second eigenvector (

E_{v_{2}}

) is consistently and substantially lower than that of the first (

E_{v_{1}}

).

3.9.2. Second Example: 4D Example

In this example, each sample is represented by four features (

M = 4

), with the second feature being constant across all observations (the highlighted row):

X = [\begin{matrix} 1.00 1.00 2.00 0.00 7.00 6.00 7.00 8.00 \\ 2.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00 \\ 5.00 6.00 5.00 9.00 1.00 2.00 1.00 4.00 \\ 3.00 2.00 3.00 3.00 4.00 5.00 5.00 4.00 \end{matrix}]

(15)

The covariance matrix

Σ

is calculated as follows:

Σ = [\begin{matrix} 10.86 & 0.00 & - 7.57 & 2.86 \\ 0.00 & 0.00 & 0.00 & 0.00 \\ - 7.57 & 0.00 & 7.55 & - 2.23 \\ 2.86 & 0.00 & - 2.23 & 1.13 \end{matrix}]

(16)

From the above covariance matrix, the zero values in the second row and column reflect the zero variance of the second feature due to its constant value.

The eigenvalues

Λ

and eigenvectors

V

of the covariance matrix are as follows:

Λ = [\begin{matrix} 17.75 & 0.00 & 0.00 & 0.00 \\ 0.00 & 1.46 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.33 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 \end{matrix}], V = [\begin{matrix} 0.76 & 0.62 & - 0.20 & 0.00 \\ 0.00 & 0.00 & 0.00 & 1.00 \\ - 0.61 & 0.79 & 0.10 & 0.00 \\ 0.21 & 0.05 & 0.98 & 0.00 \end{matrix}]

(17)

Several key observations emerge from these results. First, the first eigenvector accounts for

\frac{17.75}{17.75 + 1.46 + 0.33 + 0} \approx 90.84 %

of the total variance, making it the first principal component. Second, the first three eigenvectors collectively represent

100 %

of the total variance. This makes the fourth eigenvector, which is associated with an eigenvalue of zero, mathematically redundant, as it corresponds to a constant dimension across all samples. Third, as in Equation (18), projection onto the fourth eigenvector captures no meaningful variation and results in a high reconstruction error of ≈136.75. In contrast, projection onto the first three eigenvectors yields errors of ≈

12.53

,

126.54

, and

134.43

, respectively, directly reflecting their decreasing contribution to the explained variance.

\begin{matrix} \begin{matrix} Y_{v_{1}} & = [\begin{matrix} - 2.95 & - 3.78 & - 2.19 & - 6.16 & 4.28 & 3.12 & 4.49 & 3.20 \end{matrix}] \\ Y_{v_{2}} & = [\begin{matrix} - 1.20 & - 0.46 & - 0.58 & 1.32 & - 0.58 & - 0.39 & - 0.54 & 2.39 \end{matrix}] \\ Y_{v_{3}} & = [\begin{matrix} 0.06 & - 0.82 & - 0.14 & 0.64 & - 0.52 & 0.75 & 0.45 & - 0.43 \end{matrix}] \\ Y_{v_{4}} & = [\begin{matrix} 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \end{matrix}] \end{matrix} \end{matrix}

(18)

4. Autoencoder (AE)

PCA is mathematically elegant, but its linearity limits its ability to model complex data distributions. This leads us to a more expressive framework: the Autoencoder (AE). Autoencoders were first proposed by Rumelhart, Hinton, and Williams in 1986, with the goal of learning to reconstruct input data with as little error as possible [5,32].

The Autoencoder is a neural network architecture designed to learn efficient non-linear data encodings in an unsupervised manner. Its goal is not just to reduce dimensions but to find the hidden, “deep” structure of the data.

The Autoencoder architecture consists of three fundamental components:

1.: Encoder: The encoder $f_{ϕ} : R^{M} \to R^{K}$ is a function with learnable parameters $ϕ = {W_{E}, b_{E}}$ (weights and biases). It maps input data ( $x_{i}$ ) to a lower-dimensional latent code $h_{i} = f_{ϕ} (x_{i})$ , where $h_{i} \in R^{K}$ represents the latent feature representation. These “latent features” are a compressed summary of the input. For example, latent features for hand-written digits could encode information about the number of lines required to write each number, the angles of these lines, and how they connect. For a single hidden layer architecture,

$Z_{AE} = f_{ϕ} (X) = σ (W_{E} X + b_{E}),$

where $σ$ denotes a non-linear activation function like ReLU, tanh (for more details about activation functions, see Appendix A). These non-linearities allow the network to learn much more complex patterns than PCA.
In deep architectures, the encoder has multiple layers (see Figure 6, where the encoder consists of two layers) with progressively decreasing dimensions: the first layer has $n_{1} = M$ neurons (matching the input size), and the number of neurons in each layer drops, with the middle (or bottleneck) layer having the smallest number of neurons.

$h^{(1)} = σ (W_{E}^{(1)} x + b_{E}^{(1)}), h^{(K)} = σ (W_{E}^{(K)} h^{(K - 1)} + b_{E}^{(K)})$

with $Z_{AE} = h^{(L_{E})}$ , where $L_{E}$ is the number of encoder layers [32].
2.: Latent Space: The latent code $Z_{AE} \in R^{K}$ represents the output of the encoder and contains a compressed representation of the input data, where $K ≪ M$ . This bottleneck is the critical architectural element that prevents the network from learning a trivial identity mapping, forcing it to prioritize the most important features. Unlike the probabilistic latent space in VAEs, the standard Autoencoder’s latent space is deterministic, meaning that for each input, the encoder always produces exactly the same latent representation—there is no randomness or probability distribution involved in the encoding process [33].
3.: Decoder: The decoder $g_{θ} : R^{K} \to R^{M}$ reconstructs data from the latent code through the transformation ${\hat{x}}_{i} = g_{θ} (h_{i}) = g_{θ} (f_{ϕ} (x_{i}))$ . The decoder $g_{θ}$ , parameterized by $θ = {W_{D}, b_{D}}$ , performs the inverse mapping from the latent space back to the original data space:

$\hat{x} = g_{θ} (Z_{AE}) = σ (W_{D} Z_{AE} + b_{D}) .$

For deep architectures, the decoder mirrors the encoder structure with progressively expanding layers:

$h^{(L_{E} + 1)} = σ (W_{D}^{(1)} z + b_{D}^{(1)}), \hat{x} = σ (W_{D}^{(L_{D})} h^{(L_{E} + L_{D} - 1)} + b_{D}^{(L_{D})})$

where $L_{D}$ denotes the number of decoder layers.

The learning objective is to optimize parameters

ϕ

and

θ

by minimizing a reconstruction loss:

min_{ϕ, θ} L (x, \hat{x}) = min_{ϕ, θ} L (x, g_{θ} (f_{ϕ} (x)))

where

\hat{x}

is the reconstructed output. The network learns a function that, for a given input

x

, produces a reconstruction

\hat{x}

from a compressed latent code. Critically, this learning is constrained by a bottleneck: the low-dimensional latent space (with dimension

K ≪ M

). By minimizing this loss, the network learns to keep only the features that are absolutely necessary to rebuild the input.

4.1. Parameters and Training

The complete parameter set of an Autoencoder is

Θ = {ϕ, θ}

. Key architectural hyperparameters include the following:

Network Architecture: Autoencoders typically employ symmetric architectures where the decoder mirrors the encoder. This means that the decoder network has the same number of layers as the encoder, with corresponding layers having matching sizes in reverse order, as shown in Figure 6. For example, with 784-pixel MNIST images, if the encoder goes from Input(784) → (256) → (64), the decoder usually goes from (64) → FC(256) → Output(784) [33].
Latent Space Dimensionality: The size K of the latent code $Z_{AE}$ represents a critical trade-off. If K is too small, excessive information is lost, while if K is too large, the network can avoid meaningful compression.
Loss Function: The selection of an appropriate loss function should be guided by the data characteristics, desired reconstruction properties, and the specific application requirements, as these fundamentally shape what the autoencoder learns to preserve and discard during compression. Different loss functions make different assumptions about the data distribution and error sensitivity, leading to substantially different learned representations and reconstruction behaviors. Here, we focus on this study of the data and how this affects the choice of reconstruction loss $L (x, \hat{x})$ . The most common choice for real-valued data is the Mean Squared Error (MSE) loss, also known as $L_{2}$ loss. MSE measures the average squared difference between the original input $x$ and its reconstruction $\hat{x}$ :

$\begin{matrix} L_{MSE} = \frac{1}{N} \sum_{i = 1}^{N} {∥ x_{i} - {\hat{x}}_{i} ∥}^{2} \end{matrix}$

Parameters

Θ

are optimized via gradient-based methods (e.g., Adam, SGD) through backpropagation to minimize the reconstruction loss over the training dataset.

4.2. The Generative Limitation

Although Autoencoders excel at compression, they are not effective at generating new data. This is because the latent space is deterministic and unstructured, which often results in “gaps”. If you select a random point in the latent space that the network did not encounter during training, the decoder will probably generate nonsense rather than a realistic image. This lack of a smooth, probabilistic structure is why VAEs are necessary.

4.3. Practical Considerations for Autoencoders

In Autoencoder design, the choice of activation function depends on the specific layer and data characteristics:

Encoder Hidden Layers: The function of the encoder is to perform compression and feature extraction, mapping high-dimensional inputs to low-dimensional latent codes. This process relies on activations that promote sparsity, computational efficiency, and gradient flow for effective feature learning. As a result, the Rectified Linear Unit (ReLU) and its variants have become the standard choice for the hidden layers of the encoder. This preference stems from ReLU’s computational efficiency, which involves only a simple thresholding operation, and its effectiveness in mitigating the vanishing gradient problem in deep networks (more details on vanishing gradients and activation functions can be found in Appendix A). However, ReLU neurons can occasionally “die,” outputting zero for all inputs and resulting in inactive neurons. Alternatives such as Leaky ReLU and Exponential Linear Unit (ELU) can help address this issue by permitting small negative values.
Decoder Hidden Layers: The function of the decoder is to perform reconstruction and generation, mapping the low-dimensional latent code back to high-dimensional output. This task requires activations that facilitate smooth interpolation, stable gradient flow for reconstruction quality, and controlled output ranges—especially near the final layer. In this context, the hyperbolic tangent function (tanh) often outperforms ReLU in deeper decoders due to three key properties: (1) bounded outputs prevent extreme activations that can destabilize training, (2) smooth differentiability enhances gradient flow, and (3) zero-centered symmetry aids in reconstructing zero-centered data. Both ReLU and hyperbolic tangent (tanh) activations are commonly used in the hidden layers of the decoder. The choice of activation function depends on the nature of the data and the output requirements. ReLU is often preferred for its simplicity and non-saturating behavior, which can be beneficial for training deep decoders. However, for data that is zero-centered (e.g., normalized to $[- 1, 1]$ ), tanh may be more suitable due to its zero-centered and bounded output (more details about the zero-centered problem can be found in Appendix A), potentially leading to more stable training.
In practice, the selection of activation functions in the decoder is frequently influenced by empirical performance on the validation set.
Output Layer: The activation function of the decoder’s output layer is chosen to match the statistical properties of the input data being reconstructed. This ensures that the reconstruction $\hat{x}$ resides in the same domain as the original input $x$ .
-
Sigmoid Activation: Used for data where each element represents a probability or intensity normalized to $[0, 1]$ . For example, binary-valued data (e.g., black-and-white images) and grayscale images normalized to $[0, 1]$ (e.g., MNIST, Fashion-MNIST).
-
Softmax Activation: Used for discrete categorical data or one-hot encoded features. This is particularly important when modeling multinomial distributions (e.g., pixel intensities discretized into 256 bins), reconstructing text sequences or one-hot encoded categorical variables, or each output dimension must sum to 1 across categories.
-
Tanh Activation: Used for data normalized to the range $[- 1, 1]$ , which is common when working with preprocessed image datasets where zero-centered data is beneficial.
-
Linear Activation (no activation): Used for unconstrained real-valued data where outputs can theoretically range over $(- \infty, + \infty)$ . Examples include regression tasks, scientific measurements without bounded ranges, and certain types of continuous sensor data.
Latent Space: The activation function for the encoder’s output layer defines the latent space’s properties. A linear activation is most common, producing an unconstrained, interpretable real-valued vector analogous to PCA components. For specific applications requiring bounded latent codes (e.g., interpretable normalized features), a tanh or sigmoid activation can be used to restrict the latent values to (−1, 1) or (0, 1), respectively.

4.4. Numerical Example

To illustrate this, we will follow one sample through a simple Autoencoder. We will train this network using the same first sample from the PCA example,

x_{1} = {[\begin{matrix} 1.00 3.00 \end{matrix}]}^{⊤}

, to facilitate comparison between the two dimensionality reduction methods. The goal is to demonstrate, step by step, how Autoencoders compress input data into a lower-dimensional latent representation and then reconstruct it, all while minimizing reconstruction error through gradient-based optimization. This process highlights how neural networks can learn similar low-dimensional representations through iterative optimization.

Consider a symmetric Autoencoder designed to learn a one-dimensional representation from two-dimensional input data. This architecture mirrors the dimensionality reduction objective of PCA but achieves it through a different, parameterized approach. The network consists of five layers (see Figure 7) arranged as follows:

Input Layer: Two neurons (accepting 2D input vectors).
Encoder Hidden Layer: One neuron with hyperbolic tangent (tanh) activation.
Bottleneck Layer: One neuron with linear activation (producing the 1D latent code).
Decoder Hidden Layer: One neuron with tanh activation.
Output Layer: Two neurons with linear activation (reconstructing the original 2D input).

For clarity in tracking computations, we initialize all parameters with simple values:

Layer 1 (Encoder): $W^{(1)} = [\begin{matrix} 0.3 & 0.4 \end{matrix}], b^{(1)} = 0.1$
Layer 2 (Bottleneck): $W^{(2)} = [\begin{matrix} 0.5 \end{matrix}], b^{(2)} = 0.2$
Layer 3 (Decoder): $W^{(3)} = [\begin{matrix} 0.6 \end{matrix}], b^{(3)} = 0.3$
Layer 4 (Output): $W^{(4)} = [\begin{matrix} 0.7 \\ 0.8 \end{matrix}], b^{(4)} = [\begin{matrix} 0.4 \\ 0.5 \end{matrix}]$

In this example, we used the tanh activation function to introduce non-linearity, while linear activation functions were employed in the bottleneck and output layers.

4.4.1. Forward Propagation

The forward propagation process demonstrates how the Autoencoder transforms input data through its layers, producing both a compressed representation and a reconstruction. This process consists of four sequential computations, each corresponding to one layer of the network.

Step 1: Encoder Hidden Layer Computation: The first step compresses the 2D input into the hidden representation using a linear transformation followed by the tanh activation:

\begin{matrix} z^{(1)} = W^{(1)} x_{1} + b^{(1)} = [0.3 0.4] [\binom{1.00}{3.00}] + 0.1 = 1.6 \Rightarrow a^{(1)} = \tanh (z^{(1)}) \approx 0.9217 \end{matrix}

Here,

z^{(1)}

represents the pre-activation value, and

a^{(1)}

is the activated output. The tanh function squashes this value to the range

(- 1, 1)

, introducing non-linearity that allows the network to learn complex mappings.

Step 2: Bottleneck Layer Computation: This step produces the one-dimensional latent representation, which serves as the compressed code for the input:

\begin{matrix} a^{(2)} = z^{(2)} = W^{(2)} a^{(1)} + b^{(2)} = 0.5 \times 0.9217 + 0.2 = 0.6609 \end{matrix}

where the bottleneck layer uses linear activation, meaning

a^{(2)} = z^{(2)}

. This single value (0.6609) constitutes the compressed representation of our original 2D input, analogous to the projected value in PCA but learned through network parameters.

Step 3: Decoder Hidden Layer Computation: The decoder begins reconstructing the original input from the latent code:

\begin{matrix} z^{(3)} = W^{(3)} a^{(2)} + b^{(3)} = 0.6 \times 0.6609 + 0.3 = 0.6965, a^{(3)} = \tanh (z^{(3)}) \approx 0.6029 \end{matrix}

Step 4: Output Layer Computation: The final step produces the reconstructed output, which should ideally match the original input:

\begin{matrix} z^{(4)} = a^{(4)} = {\hat{x}}_{1} = W^{(4)} a^{(3)} + b^{(4)} = [\binom{0.7}{0.8}] \times 0.6029 + [\binom{0.4}{0.5}] = [\binom{0.8220}{0.9823}] \end{matrix}

where the output

{\hat{x}}_{1}

represents the network’s attempt to reconstruct the original input

x_{1}

from the compressed latent representation. From the above equation, it is clear that we are using the linear activation function in this layer.

Step 5: Loss Computation: To quantify how well the AE has reconstructed the input, we compute the MSE loss as follows:

\begin{matrix} L = \frac{1}{2} {∥ x_{1} - {\hat{x}}_{1} ∥}^{2} = \frac{1}{2} [{(0.1780)}^{2} + {(2.0177)}^{2}] \approx 2.0514 \end{matrix}

The factor

\frac{1}{2}

simplifies gradient computation during backpropagation. The MSE loss of approximately 2.0514 quantifies the discrepancy between input and reconstruction, analogous to the reconstruction error in PCA but computed differently.

For reference, the complete forward transformation—from input to reconstruction—can be expressed compactly as follows:

\begin{matrix} x_{1} & \overset{W^{(1)} x_{1} + b^{(1)}}{\to} z^{(1)} \overset{\tanh}{\to} a^{(1)} \overset{W^{(2)} a^{(1)} + b^{(2)}}{\to} z^{(2)} = a^{(2)} \\ a^{(2)} & \overset{W^{(3)} a^{(2)} + b^{(3)}}{\to} z^{(3)} \overset{\tanh}{\to} a^{(3)} \overset{W^{(4)} a^{(3)} + b^{(4)}}{\to} z^{(4)} \overset{z^{(4)}}{\to} {\hat{x}}_{1} \end{matrix}

(19)

4.4.2. Backpropagation: Gradient Computation

Backpropagation computes how each parameter contributed to the reconstruction error, allowing us to update the network to reduce future errors. This process applies the chain rule of calculus backward through the network, starting from the output layer. We denote the sensitivity of the loss

L

with respect to the pre-activation of layer l as follows:

δ^{(l)} ≜ \frac{\partial L}{\partial z^{(l)}} .

We update all parameters using gradient descent with learning rate

η = 0.01

. Each parameter moves in the direction opposite to its gradient, proportionally to the learning rate. For any layer with affine mapping

z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)} (where a^{(0)} \equiv x_{1} for l = 1),

the gradients with respect to the weights and biases are

\frac{\partial L}{\partial W^{(l)}} = δ^{(l)} {(a^{(l - 1)})}^{⊤}, \frac{\partial L}{\partial b^{(l)}} = δ^{(l)} .

Applying this to each layer,

\frac{\partial L}{\partial W^{(4)}} = δ^{(4)} {(a^{(3)})}^{⊤}, \frac{\partial L}{\partial b^{(4)}} = δ^{(4)},

\frac{\partial L}{\partial W^{(3)}} = δ^{(3)} {(a^{(2)})}^{⊤}, \frac{\partial L}{\partial b^{(3)}} = δ^{(3)},

\frac{\partial L}{\partial W^{(2)}} = δ^{(2)} {(a^{(1)})}^{⊤}, \frac{\partial L}{\partial b^{(2)}} = δ^{(2)},

\frac{\partial L}{\partial W^{(1)}} = δ^{(1)} x_{1}^{⊤}, \frac{\partial L}{\partial b^{(1)}} = δ^{(1)} .

We now follow this chain step by step, using the numerical values already computed in the forward pass.

Output Layer (Layer 4).

Forward:

{\hat{x}}_{1} = z^{(4)}

(linear activation).

The gradient of the loss with respect to the output is the reconstruction error itself:

δ^{(4)} = \frac{\partial L}{\partial z^{(4)}} = {\hat{x}}_{1} - x_{1} = [\begin{matrix} - 0.1780 \\ - 2.0177 \end{matrix}] .

Using this, we compute the gradients for

W^{(4)}

and

b^{(4)}

:

\begin{matrix} \frac{\partial L}{\partial W^{(4)}} & = δ^{(4)} {(a^{(3)})}^{⊤} = [\binom{- 0.1780}{- 2.0177}] \times 0.6029 = [\binom{- 0.1073}{- 1.2165}] \\ \frac{\partial L}{\partial b^{(4)}} & = δ^{(4)} = [\binom{- 0.1780}{- 0.1780}] . \end{matrix}

Decoder Hidden Layer (Layer 3).

Forward:

z^{(4)} = W^{(4)} a^{(3)} + b^{(4)}

. Thus,

\frac{\partial L}{\partial a^{(3)}} = {(W^{(4)})}^{⊤} \frac{\partial L}{\partial z^{(4)}} = {(W^{(4)})}^{⊤} δ^{(4)} = [\begin{matrix} 0.7 & 0.8 \end{matrix}] [\begin{matrix} - 0.1780 \\ - 2.0177 \end{matrix}] = - 1.7388 .

The activation function is

a^{(3)} = \tanh (z^{(3)})

, whose derivative is

\tanh^{'} (z^{(3)}) = 1 - \tanh^{2} (z^{(3)}) = 1 - {(a^{(3)})}^{2} (element - wise) .

With

a^{(3)} = 0.6029

, we obtain

\tanh^{'} (z^{(3)}) = 1 - {0.6029}^{2} = 0.6365 .

Therefore,

δ^{(3)} = \frac{\partial L}{\partial z^{(3)}} = \frac{\partial L}{\partial a^{(3)}} ⊙ \tanh^{'} (z^{(3)}) = (- 1.7388) \times 0.6365 \approx - 1.1065 .

From this, we obtain the parameter gradients:

\frac{\partial L}{\partial W^{(3)}} = δ^{(3)} {(a^{(2)})}^{⊤} = - 1.1065 \times 0.6609 \approx - 0.7310, \frac{\partial L}{\partial b^{(3)}} = δ^{(3)} = - 1.1065 .

Similarly, all the other gradients of the other earlier layers will be calculated. With all gradients computed, we now have a complete picture of how each parameter contributes to the reconstruction error.

From the above calculations, the output layer parameters update will be as follows:

\begin{matrix} W_{new}^{(4)} & = W^{(4)} - η \frac{\partial L}{\partial W^{(4)}} = [\binom{0.7}{0.8}] - 0.01 \times [\binom{- 0.1073}{- 1.2165}] = [\binom{0.7011}{0.8122}] \\ b_{new}^{(4)} & = b^{(4)} - η \frac{\partial L}{\partial b^{(4)}} = [\binom{0.4}{0.5}] - 0.01 \times [\binom{- 0.1780}{- 2.0177}] = [\binom{0.4018}{0.5202}] \end{matrix}

Similarly, all the other weights and biases will be calculated for the other layers to be

W^{(3)} \approx 0.6073

,

b^{(3)} \approx 0.3111

,

W^{(2)} \approx 0.5061

,

b^{(2)} \approx 0.2066

,

W^{(1)} \approx [\begin{matrix} 0.3005 & 0.4015 \end{matrix}]

,

b^{(1)} \approx 0.1005

. After this single update, all parameters have been adjusted slightly to reduce the reconstruction error for this training sample.

4.5. The Fundamental Connection: When Autoencoders Generalize PCA

The relationship between Autoencoders and PCA is not merely conceptual but mathematically precise under specific conditions. This connection provides critical insight into how neural networks extend and generalize classical linear methods.

Consider an Autoencoder with the following specific characteristics:

1.: Linear Activations: All activation functions are identity functions.
2.: Single Hidden Layer: One bottleneck layer of dimension $K < M$ , where M is the input dimension.
3.: Mean Squared Error: The loss function is the standard MSE: $L = \frac{1}{2} {∥ x - \hat{x} ∥}^{2}$ .
4.: No Regularization: No weight decay or other regularization terms are applied.

Under these conditions, the Autoencoder learns an optimal solution that spans the same subspace as PCA. Specifically, let the encoder and decoder be defined by weight matrices

W^{(E)} \in R^{K \times M}

and

W^{(D)} \in R^{M \times K}

, with biases

b^{(E)}

and

b^{(D)}

. The network computes

z = W^{(E)} x + b^{(E)}, \hat{x} = W^{(D)} z + b^{(D)} = W^{(D)} (W^{(E)} x + b^{(E)}) + b^{(D)}

The optimization problem is to minimize the reconstruction error:

min_{W^{(E)}, W^{(D)}, b^{(E)}, b^{(D)}} E [∥ x - W^{(D)} W^{(E)} x - (W^{(D)} b^{(E)} + b^{(D)}) ∥^{2}]

To understand the optimal solution, it is helpful to reframe the problem. Define the combined transformation matrix

C = W^{(D)} W^{(E)} \in R^{M \times M}

and the combined bias vector

d = W^{(D)} b^{(E)} + b^{(D)}

. The reconstruction becomes

\hat{x} = C x + d

, and the objective simplifies to minimizing

E [| x - (C x + d) |^{2}]

.

The math shows that the optimal bias basically centers the data. By setting the derivative with respect to

d

to zero, one finds that the optimal bias acts like the mean-centering step in PCA. If we center the data beforehand (

\tilde{x} = x - μ

), the biases become

b^{(E)} = 0

and

b^{(D)} = 0

, and the problem reduces to

min_{W^{(E)}, W^{(D)}} E [∥ \tilde{x} - W^{(D)} W^{(E)} \tilde{x} ∥^{2}] .

Here, the product

C = W^{(D)} W^{(E)}

is a matrix of at most rank K, because

W^{(E)}

projects from M to K dimensions. The optimization problem is therefore to find the optimal rank-K linear approximation to the centered data. This is precisely the problem solved by PCA. More specifically, at the global optimum, the columns of

W^{(D)}

will cover the same space spanned by the first K principal components of the data covariance matrix. Furthermore,

W^{(E)} = {(W^{(D)})}^{⊤}

, making the encoder perform projection onto this space, and the latent representation

z

contains the first K principal component scores of the centered input. This means that a linear Autoencoder with MSE loss performs exactly the same dimensionality reduction as PCA: it finds the K-dimensional linear subspace that minimizes the squared reconstruction error. This equivalence, first proven by [34], reveals that PCA is actually just a special case of the Autoencoder framework. The only difference is that while PCA gives a unique set of eigenvectors, the Autoencoder might find different weights that cover the same space, depending on how the training starts [35,36].

4.6. Variants of Autoencoders

While the standard Autoencoder (Section 4) provides a flexible framework for unsupervised representation learning, numerous architectural variants have emerged to address specific limitations or application requirements. These variants typically modify one or more aspects of the basic AE: the architecture (e.g., replacing fully connected layers), the loss function (adding regularization terms), or the training objective (e.g., robustness to noise). Each variant tailors the AE to particular data modalities or desired representation properties. Notably, the VAE represents a foundational probabilistic extension that will be thoroughly examined in Section 5. More details and other variants can be found in [37].

4.6.1. Regularized Autoencoders

These versions add an extra “penalty” to the loss function to give the latent space specific shapes.

Sparse Autoencoder (SAE): This version uses a “less is more” approach. It forces most of the neurons in the latent space to stay at zero (inactive). This forces the network to find the most unique and important features that can represent the data with only a few active neurons [38]. This helps the network ignore small noise and focus on the real structure.

Contractive Autoencoder (CAE): This version makes the network “stiff” or robust. It adds a penalty that ensures that if you change the input just a little bit, the latent code does not change much [39].

Laplacian Autoencoder (LAE): This version focuses on “neighborhoods.” It tries to make sure that if two data points were neighbors in the original space, they stay neighbors in the latent space [40].

4.6.2. Architectural Variants for Specific Data Types

Convolutional Autoencoder (CAE): Instead of standard connections, this version uses “filters” (convolutional layers). This is the gold standard for image data because it understands that pixels near each other are related [41].

Bayesian and Diffusion Autoencoders: These are advanced versions that deal with uncertainty and high-quality generation. A Bayesian AE uses probability for its weights, while a Diffusion AE learns by “cleaning” noisy data [42].

Diffusion Autoencoder: This integrates the AE framework with diffusion models, where the encoder maps an input to a latent code, and a diffusion-based decoder learns to reconstruct the input by progressively denoising. The training involves optimizing a diffusion loss alongside reconstruction terms [43].

Other notable variants include Robust Autoencoders (designed to handle noisy inputs), Recurrent Autoencoders (using RNNs/LSTMs for sequential data), and Graph Autoencoders (operating on graph-structured data). Each of these versions shows how we can adapt the simple idea of “compress and reconstruct” to work for almost any kind of data.

5. Variational Autoencoder (VAE)

While Autoencoders excel at learning compressed, non-linear representations, their deterministic nature and unstructured latent space, as discussed in Section 4.2, fundamentally limit their utility as generative models. The Variational Autoencoder, introduced by Kingma and Welling (2013), addresses this core limitation by re-imagining the Autoencoder within a probabilistic framework [5]. Think of it this way: A standard Autoencoder maps an input to a single, fixed point in the latent space. The VAE’s “Variational” twist is that it maps an input to a fuzzy “region” or a probability distribution over that space.

The VAE accomplishes this by combining three foundational principles: (1) probabilistic models that define a generative process, (2) variational inference for approximating complex math, and (3) deep neural networks for learning and parameterizing complex, non-linear relationships. The VAE is not just a simple encoder–decoder architecture; it is a deep latent variable model designed to learn the underlying probability distribution

p (x)

of the data.

5.1. Architectural Components: A Probabilistic Reinterpretation

The VAE retains the encoder–decoder structure but redefines their functions. We represent all learnable parameters (weights and biases) as

Θ = {ϕ, θ}

, the encoder parameters as

ϕ

and the decoder parameters as

θ

.

1.: Probabilistic Encoder (Recognition Model): The encoder is a neural network parameterized by $ϕ$ . It serves as an approximate variational inference model, meaning that it uses a single set of parameters $ϕ$ to approximate the posterior for any input data point, enabling efficient learning on large datasets. Its input is a data point $x$ . Its outputs are the parameters of the approximate posterior distribution $q_{ϕ} (z | x)$ (see Figure 8). For a standard VAE, we assume that this distribution is a multivariate Gaussian with a diagonal covariance matrix. For a latent space of dimension K, the encoder’s output for a given input $x$ consists of two K-dimensional vectors:

$\begin{matrix} μ & = f_{ϕ}^{μ} (x), μ \in R^{K} \end{matrix}$

(20)

$\begin{matrix} \log σ^{2} & = f_{ϕ}^{σ} (x), σ^{2} \in R^{K} \end{matrix}$

(21)

where $f_{ϕ}^{μ}$ and $f_{ϕ}^{σ}$ are the output layers (or branches) of the encoder network, producing the parameters that define the distribution $q_{ϕ} (z | x) = N (z; μ, diag (σ^{2})) .$ Here, $μ$ contains the means and $σ^{2}$ contains the variances for each of the K latent dimensions. We use log-variance for numerical stability. This formulation defines a K-dimensional multivariate Gaussian distribution with a diagonal covariance matrix $diag (σ^{2})$ , indicating that each latent dimension $z_{k}$ is modeled as an independent Gaussian with its own mean $μ_{k}$ and variance $σ_{k}^{2}$ .
In the VAE’s probabilistic framework, $ϕ$ represents the weights of the encoder (the inference network). It tries to approximate the true, intractable posterior $p_{θ} (z | x)$ by varying the parameters of a simpler distribution (a Gaussian) until it is as close as possible—this is the essence of variational inference.
For example, if $K = 2$ , the encoder for an input $x_{1}$ outputs $μ_{1} = {[μ_{1, 1}, μ_{1, 2}]}^{⊤}$ and $σ_{1}^{2} = {[σ_{1, 1}^{2}, σ_{1, 2}^{2}]}^{⊤}$ . This means the latent code $z_{1}$ is assumed to be drawn from a distribution where the first dimension has mean $μ_{1, 1}$ and variance $σ_{1, 1}^{2}$ , and the second dimension has mean $μ_{1, 2}$ and variance $σ_{1, 2}^{2}$ , with the dimensions being independent.
2.: Latent Space Sampling: To obtain a latent vector for reconstruction or generation, we sample from the inferred distribution: $z \sim q_{ϕ} (z | x)$ . However, the sampling operation ′∼′ is a random process and its derivative is undefined; it “blocks gradient flow,” meaning that we cannot calculate gradients with respect to the encoder parameters $ϕ$ through this operation, which prevents training via backpropagation.
The reparameterization trick provides an elegant solution. Instead of sampling $z$ directly, we express it as a deterministic, differentiable function of the encoder’s outputs and an auxiliary noise variable. We first sample a noise vector $ϵ \sim N (0, I)$ from a standard normal distribution. Then, we compute the latent code as follows:

$\begin{matrix} z = μ + σ ⊙ ϵ, \end{matrix}$

where ⊙ denotes element-wise multiplication and $σ = \sqrt{σ^{2}}$ . Crucially, all randomness is now contained in $ϵ$ , which is independent of $ϕ$ . The path from $μ$ and $σ$ (which depend on $ϕ$ ) to $z$ is now fully differentiable, allowing gradients to flow back to the encoder.
Continuing the $K = 2$ example, we sample $ϵ = {[ϵ_{1}, ϵ_{2}]}^{⊤}$ from $N (0, I)$ and compute $z_{1} = {[μ_{1, 1} + σ_{1, 1} \cdot ϵ_{1}, μ_{1, 2} + σ_{1, 2} \cdot ϵ_{2}]}^{⊤}$ . The distribution $q_{ϕ} (z | x_{1})$ describes the probability density over all possible $z$ given $x_{1}$ ; $z_{1}$ is one specific sample from this distribution.
3.: Probabilistic Decoder (Generative Model): The decoder, parameterized by $θ$ , is reinterpreted as defining a likelihood distribution $p_{θ} (x | z)$ over the input space. This represents the conditional likelihood in the Bayesian network. In the probabilistic framework, $θ$ represents the weights of the decoder (the generative network). Its input is a latent sample $z$ (obtained via the reparameterization trick during training, or from the prior $p (z) = N (0, I)$ during generation). The decoder neural network typically outputs the parameters of a distribution in data space. For real-valued data (e.g., images normalized to [0, 1]), a common choice is a Gaussian distribution with identity covariance, where the decoder’s output $g_{θ} (z)$ is interpreted as the mean of the reconstructed data:

$\begin{matrix} p_{θ} (x | z) = N (x; g_{θ} (z), I) . \end{matrix}$

Thus, for our sampled $z_{1}$ , the decoder outputs $g_{θ} (z_{1})$ , which is the mean of the Gaussian distribution from which the reconstructed data ${\hat{x}}_{1}$ is likely drawn. The decoder learns to map any latent point $z$ to a distribution over possible data points $x$ .

To train this probabilistic model, we need an objective function that serves two distinct yet complementary purposes. First, the decoder must learn to accurately reconstruct the input data from the latent code, similar to a standard autoencoder. Second, the encoder’s learned posterior distributions must be regularized to match a simple prior distribution, typically chosen as a standard normal distribution

p (z) = N (0, I)

. This prior acts as a template, encouraging the latent codes for all data points to be distributed in a well-behaved, continuous region. Without this, the encoder might learn to map different inputs to disjoint, tightly clustered latent distributions, creating “holes” in the latent space from which the decoder cannot generate coherent outputs. To clarify these important terms

The prior $p (z)$ is what we believe about the latent space before seeing any data: a “clean map” (standard normal).
The true posterior $p_{θ} (z | x)$ is what we would infer after seeing a specific input, but it is mathematically intractable to compute.
The approximate posterior $q_{ϕ} (z | x)$ is what the encoder actually produces: a “messy, specific region” of the map for that input.

The training goal is to make this approximation both accurate (close to the true posterior) and well-structured (aligned with the prior).

5.2. The Core Challenge and the Variational Inference Solution

At its core, training a VAE means teaching it to model the probability distribution of our data. For any given data point

x

(like an image), we want our model parameters

θ

—which primarily govern the decoder—to assign a high probability to that data. This probability is expressed as

p_{θ} (x)

, and our goal is to maximize it across our entire dataset.

The challenge comes from how we calculate

p_{θ} (x)

. Since

x

is generated from the latent variable

z

, we must consider all possible

z

that could have produced it. Mathematically, we integrate over the latent space:

p_{θ} (x) = \int p_{θ} (x, z) d z = \int p_{θ} (x | z) p (z) d z .

This is the fundamental problem: this integral is intractable—meaning that it is impossible to compute exactly for complex models like neural networks. This is because of the following:

1.: We need to consider all possible $z$ in a high-dimensional space.
2.: Only a tiny fraction of $z$ values produce a plausible $x$ , making random sampling hopelessly inefficient.
3.: We cannot calculate the true posterior $p_{θ} (z | x) = \frac{p_{θ} (x | z) p (z)}{p_{θ} (x)}$ , because it requires the very term $p_{θ} (x)$ that we are trying to compute.

The term “Variational” refers to Variational Inference. In statistics, when a distribution is too complex to calculate, we pick a family of simpler distributions (like Gaussians) and “vary” their parameters (mean and variance) until they look as much like the complex one as possible. The VAE uses a neural network to do that “varying” automatically. In VAE, instead of computing the intractable

p_{θ} (x)

(complex distribution) directly, we approximate the true posterior

p_{θ} (z | x)

with a simpler, parameterized distribution

q_{ϕ} (z | x)

—our encoder. We then construct a lower bound on

\log p_{θ} (x)

that we can actually compute and maximize.

Here’s how we derive this bound step by step:

1.: We work with $\log p_{θ} (x)$ instead of $p_{θ} (x)$ because probabilities multiply but log-probabilities add, making the math tractable and preventing numerical underflow.
2.: We introduce the expectation under $q_{ϕ}$ because we need to bring the encoder into the equation to connect it to the parameters we want to optimize ( $ϕ$ ).

$\begin{matrix} \log p_{θ} (x) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x)] . \end{matrix}$

Since $\log p_{θ} (x)$ is constant with respect to $z$ , taking its expectation does not change its value.
3.: Expand using Bayes’ rule: We rewrite $p_{θ} (x)$ using the joint probability and the posterior:

$\log p_{θ} (x) = E_{q_{ϕ} (z | x)} [\log \frac{p_{θ} (x, z)}{p_{θ} (z | x)}] .$
4.: Introduce our encoder $q_{ϕ}$ strategically: We multiply and divide by $q_{ϕ} (z | x)$ inside the expectation to prepare for splitting the expression:

$\begin{matrix} \log p_{θ} (x) = E_{q_{ϕ} (z | x)} [\log (\frac{p_{θ} (x, z)}{q_{ϕ} (z | x)} \cdot \frac{q_{ϕ} (z | x)}{p_{θ} (z | x)})] . \end{matrix}$
5.: Split the logarithm and recognize the KL divergence: Using the property $\log (a b) = \log a + \log b$ , we get:

$\log p_{θ} (x) = \underset{ELBO L (θ, ϕ; x)}{\underset{︸}{E_{q_{ϕ} (z | x)} [\log \frac{p_{θ} (x, z)}{q_{ϕ} (z | x)}]}} + \underset{\geq 0}{\underset{︸}{D_{K L} (q_{ϕ} (z | x) ∥ p_{θ} (z | x))}} .$

This decomposition reveals that the KL divergence

D_{KL} (q_{ϕ} ∥ p_{θ})

is always non-negative (it’s 0 only if

q_{ϕ}

equals

p_{θ}

exactly), we have:

\log p_{θ} (x) \geq L (θ, ϕ; x) .

L

is called the Evidence Lower BOund (ELBO). We can rewrite

L (θ, ϕ; x)

by expanding the joint probability

p_{θ} (x, z) = p_{θ} (x | z) p (z)

:

L (θ, ϕ; x) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z) + \log p (z) - \log q_{ϕ} (z | x)] .

Separating the terms gives us:

L (θ, ϕ; x) = E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)] + E_{q_{ϕ} (z | x)} [\log p (z) - \log q_{ϕ} (z | x)] .

Notice that the second term is exactly the negative KL divergence:

E_{q_{ϕ} (z | x)} [\log p (z) - \log q_{ϕ} (z | x)] = - D_{K L} (q_{ϕ} (z | x) ∥ p (z)) .

Therefore, we arrive at the final, interpretable form of the ELBO:

\begin{matrix} L (θ, ϕ; x) = \underset{Reconstruction Term}{\underset{︸}{E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)]}} - \underset{Regularization Term}{\underset{︸}{D_{K L} (q_{ϕ} (z | x) ∥ p (z))}} = L_{E L B O} (x; ϕ, θ) . \end{matrix}

(22)

By maximizing this lower bound

L

, we achieve two objectives:

Increasing the Likelihood: We push up the value of $\log p_{θ} (x)$ , which is our original goal.
Aligning Distributions: We simultaneously minimize the difference between our encoder $q_{ϕ} (z | x)$ and the true posterior distribution.

The ELBO naturally decomposes into the two components that we intuitively need:

Reconstruction Loss: $E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)]$ , which for a Gaussian decoder becomes the negative MSE (equivalent to maximizing log-likelihood).
Regularization Loss: $D_{KL} (q_{ϕ} (z | x) | p (z))$ . This KL divergence has a closed-form solution when both distributions are Gaussian, penalizing deviations of the encoder’s output from the standard normal prior $N (0, I)$ .

Training a VAE is a balancing act between these two goals. The reconstruction loss asks “Does the output look like the input?” The regularization loss (KL divergence) forces the latent space distributions to look like a standard normal distribution. Without this regularizer, the network would give every input its own isolated “island” in latent space, preventing smooth interpolation and generation of new data.

5.3. The Variational Lower Bound (ELBO)

For practical implementation, we minimize the negative ELBO (see Equation 22) rather than maximizing the bound. The training procedure aims to maximize this bound, thereby approximately maximizing the data log-likelihood. For a dataset of N points, we sum the individual bounds:

max_{ϕ, θ} \sum_{i = 1}^{N} L_{E L B O} (x_{i}; ϕ, θ) .

In practice, we equivalently minimize the negative ELBO,

L_{VAE} (ϕ, θ; x) = - L_{E L B O} (x; ϕ, θ)

:

L_{VAE} (ϕ, θ; x) = - E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)] + D_{K L} (q_{ϕ} (z | x) ‖ p (z)) .

which serves as our loss function for gradient descent. Let’s examine its two critical components:

Reconstruction Term ( $E_{q_{ϕ} (z | x)} [\log p_{θ} (x | z)]$ ): This term measures how well the model can reconstruct the input $x$ from a latent code $z$ sampled from the encoder’s posterior. It is the expected log-likelihood of $x$ under the decoder’s distribution. Maximizing this term forces the decoder to produce outputs $g_{θ} (z)$ that are likely given the true data. For a Gaussian decoder $p_{θ} (x | z) = N (x; g_{θ} (z), I)$ , maximizing the log-likelihood is equivalent to minimizing the MSE between the original input $x$ and the reconstruction $\hat{x} = g_{θ} (z)$ . The expectation is approximated during training using the reparameterized latent sample $z = μ + σ ⊙ ϵ$ .
Regularization Term ( $D_{K L} (q_{ϕ} (z | x) ‖ p (z))$ ): This is the Kullback–Leibler (KL) divergence between the encoder’s posterior $q_{ϕ} (z | x)$ and the prior $p (z) = N (0, I)$ . It acts as a regularizer on the latent space. Minimizing this term (as we subtract it in the ELBO) pushes the distribution $q_{ϕ} (z | x)$ for each data point to be close to the standard normal prior. This has several crucial effects: it prevents the posterior distributions from collapsing into isolated point masses (overfitting), encourages smoothness and continuity in the latent space, and ensures that, for generation, sampling a random $z \sim p (z)$ will land in a region of the latent space that the decoder understands. The KL divergence for Gaussian distributions has a convenient closed-form expression, making computation efficient.

Therefore, the total VAE loss for a single data point is computed and backpropagated as follows:

1.: The input $x_{1}$ is passed through the encoder to get $μ_{1}$ and $σ_{1}^{2}$ .
2.: A latent code is sampled via reparameterization:

$z_{1} = μ_{1} + σ_{1} ⊙ ϵ .$
3.: The decoder maps $z_{1}$ to the reconstruction parameters (e.g., the mean ${\hat{x}}_{1} = g_{θ} (z_{1})$ ).
4.: The loss $L_{VAE}$ is computed as follows:

$L_{VAE} = \underset{Reconstruction Loss}{\underset{︸}{MSE (x_{1}, {\hat{x}}_{1})}} + \underset{KL Divergence (Closed Form)}{\underset{︸}{\frac{1}{2} \sum_{k = 1}^{K} (μ_{1, k}^{2} + σ_{1, k}^{2} - \log (σ_{1, k}^{2}) - 1)}} .$

As we see above, in practice, the expectation is approximated using a single sample of

z

, and the KL divergence for a Gaussian prior and posterior has a convenient closed-form expression.

The total loss is backpropagated to update both the encoder parameters (

ϕ

) and decoder parameters (

θ

). Gradients of this loss with respect to these parameters are calculated and used for model updates.

In summary, the VAE employs variational inference to transform an intractable maximum likelihood problem into the optimization of the ELBO (Evidence Lower Bound). This differentiable objective balances the dual goals of accurate data reconstruction and a structured, generative latent space, enabling the VAE to learn a powerful and useful model of the data distribution.

5.4. From Deterministic to Probabilistic Latent Spaces

The critical conceptual leap from the AE to the VAE is the treatment of the latent space. In a standard AE, the encoder outputs a single, deterministic latent vector

z

for each input

x

. In the VAE, the encoder maps an input

x

to a probability distribution over the latent space, typically modeled as a multivariate Gaussian characterized by a mean vector

μ

and a variance vector

σ^{2}

.

This probabilistic formulation introduces powerful consequences: it acts as a regularization mechanism, as the loss function includes the KL divergence term that pushes the learned distribution

q_{ϕ} (z | x)

toward a standard normal distribution

N (0, I)

. This constraint creates a continuous, well-structured latent space where interpolation between points corresponds to meaningful changes in the decoded output. Second, by sampling from this prior and passing the sample through the trained decoder, the VAE can generate novel, realistic data points that resemble the training data.

5.5. Key Differences from the Standard Autoencoder

Latent Representation: The AE produces a deterministic point $z$ . The VAE produces a probabilistic distribution $q_{ϕ} (z | x)$ , from which $z$ is sampled.
Training Objective: The AE minimizes a simple reconstruction loss. The VAE maximizes the ELBO, which balances reconstruction accuracy and latent space regularization.
Latent Space Structure: The AE’s latent space is unconstrained and can be discontinuous. The VAE’s latent space is explicitly regularized to be continuous and smooth, converging toward a known prior distribution. This structural difference is what unlocks the VAE’s generative power.
Capability: AEs are primarily compression and representation learning models. VAEs are, by design, generative models capable of creating new data samples.

5.6. Numerical Example

Building on the AE example in Section 4.4, we now trace a VAE forward and backward pass using the same input

x_{1} = [\begin{matrix} 1.00 \\ 3.00 \end{matrix}]

and a similar architecture. The key difference is that the VAE encoder has two output branches—one for the mean

μ

and one for the log-variance

\log σ^{2}

—and uses the reparameterization trick to sample

z

. The network architecture of the VAE will be: Input Layer: 2 neurons, Encoder Hidden Layer: 1 neuron with tanh activation, Bottleneck Distribution Parameters: (i) Mean (

μ

): 1 linear neuron, and (ii) Log-Variance (

\log σ^{2}

): 1 linear neuron, Sampling Layer: Uses the reparameterization trick:

z = μ + σ ⊙ ϵ

, Decoder Hidden Layer: 1 neuron with tanh activation, and Output Layer: 2 linear neurons (reconstruction).

For clarity, we initialize all parameters with simple values. For the encoder pathway (to

μ

),

\begin{matrix} W_{μ}^{(1)} = [0.3 0.4], b_{μ}^{(1)} = 0.1, W_{μ}^{(2)} = [0.5], b_{μ}^{(2)} = 0.2 \end{matrix}

where the encoder pathway (to

\log σ^{2}

) will be:

\begin{matrix} W_{σ}^{(1)} & = [0.25 0.35], b_{σ}^{(1)} = 0.15, W_{σ}^{(2)} = [\begin{matrix} 0.45 \end{matrix}], b_{σ}^{(2)} = 0.25 \end{matrix}

and the decoder pathway will be

\begin{matrix} W^{(3)} = [0.6], b^{(3)} = 0.3, W^{(4)} = [0.7 0.8], b^{(4)} = [0.4, 0.5] \end{matrix}

Note on Reparameterization: For reproducibility, we fix

ϵ = 0.5

(normally

ϵ \sim N (0, I)

).

5.6.1. Forward Propagation

Step 1: Encoder Hidden Layer Computation: The input first passes through the shared encoder hidden layer (identical computation to AE). First, the path to

μ

will be:

\begin{matrix} z_{μ}^{(1)} = W_{μ}^{(1)} x_{1} + b_{μ}^{(1)} = [0.3 0.4] [\binom{1.00}{3.00}] + 0.1 = 1.6, a_{μ}^{(1)} = \tanh (z_{μ}^{(1)}) \approx 0.9217 \end{matrix}

and the path to

\log σ^{2}

will be:

\begin{matrix} z_{σ}^{(1)} = W_{σ}^{(1)} x_{1} + b_{σ}^{(1)} = [0.25 0.35] [\binom{1.00}{3.00}] + 0.15 = 1.45, a_{σ}^{(1)} = \tanh (z_{σ}^{(1)}) \approx 0.8957 \end{matrix}

Step 2: Bottleneck Distribution Parameters: Compute the mean (

μ

) and log-variance (

\log σ^{2}

), as follows:

z_{μ}^{(2)} = W_{μ}^{(2)} a_{μ}^{(1)} + b_{μ}^{(2)} = 0.5 \times 0.9217 + 0.2 = 0.6609 \Rightarrow μ = z_{μ}^{(2)} = 0.6609

z_{σ}^{(2)} = W_{σ}^{(2)} a_{σ}^{(1)} + b_{σ}^{(2)} = 0.45 \times 0.8957 + 0.25 = 0.6531 \Rightarrow \log σ^{2} = z_{σ}^{(2)} = 0.6531

Step 3: Variance and Standard Deviation: Convert log-variance to variance and standard deviation:

σ^{2} = \exp (\log σ^{2}) = \exp (0.6531) \approx 1.9215 \Rightarrow σ = \sqrt{σ^{2}} \approx 1.3862

Step 4: Latent Code Sampling (Reparameterization Trick): Sample the latent code using the reparameterization trick:

z = μ + σ \cdot ϵ = 0.6609 + 1.3862 \times 0.5 = 1.3540

Key Insight: The reparameterization trick allows us to backpropagate through the random sampling by making the randomness an input (

ϵ

) rather than part of the computation graph.

Step 5: Decoder Hidden Layer Computation: The decoder begins reconstructing from the sampled latent code:

\begin{matrix} z^{(3)} = W^{(3)} z + b^{(3)} = 0.6 \times 1.3540 + 0.3 = 1.1124, a^{(3)} = \tanh (z^{(3)}) \approx 0.8046 \end{matrix}

Step 6: Output Layer (Reconstruction): Produce the final reconstruction:

\begin{matrix} {\hat{x}}_{1} & = a^{(4)} = z^{(4)} = W^{(4)} a^{(3)} + b^{(4)} = [0.7 0.8] \times 0.8046 + [0.4 0.5] = [\binom{0.9632}{1.1437}] \end{matrix}

Loss Computation: The VAE has two loss components: reconstruction loss and KL divergence loss. First, the reconstruction loss, as in an AE, measures how well we reconstruct the input:

\begin{matrix} L_{recon} = \frac{1}{2} {∥ x_{1} - {\hat{x}}_{1} ∥}^{2} = \frac{1}{2} [{(0.0368)}^{2} + {(1.8563)}^{2}] \approx 1.7236 \end{matrix}

and the KL divergence loss measures how much the learned distribution

q (z | x) = N (μ, σ^{2})

deviates from the prior

p (z) = N (0, I)

:

\begin{matrix} L_{K L} & = \frac{1}{2} [μ^{2} + σ^{2} - \log σ^{2} - 1] = \frac{1}{2} \sum_{k = 1}^{K} (μ_{k}^{2} + σ_{k}^{2} - \log σ_{k}^{2} - 1) \\ = \frac{1}{2} [- 0.6531 + 1.9215 + {(0.6609)}^{2} - 1] \approx 0.3526 \end{matrix}

Then, the total VAE loss will be the weighted sum of both losses (typically with weight

β = 1

):

L_{total} = L_{recon} + L_{K L} = 1.7236 + 0.3526 = 2.0762

Key Insight: The KL divergence term acts as a regularizer, encouraging the learned distribution to be close to the standard normal prior.

For reference, the complete forward transformation—from input to reconstruction—can be expressed compactly as:

\begin{matrix} x_{1} & \overset{W_{μ}^{(1)} x_{1} + b_{μ}^{(1)}}{\to} z_{μ}^{(1)} \overset{\tanh}{\to} a_{μ}^{(1)} \overset{W_{μ}^{(2)} a_{μ}^{(1)} + b_{μ}^{(2)}}{\to} μ \\ x_{1} & \overset{W_{σ}^{(1)} x_{1} + b_{σ}^{(1)}}{\to} z_{σ}^{(1)} \overset{\tanh}{\to} a_{σ}^{(1)} \overset{W_{σ}^{(2)} a_{σ}^{(1)} + b_{σ}^{(2)}}{\to} \log σ^{2} \overset{\exp (\cdot)}{\to} σ^{2} \overset{\sqrt{\cdot}}{\to} σ \\ μ, σ & \overset{z = μ + σ ⊙ ϵ}{\to} z \overset{W^{(3)} z + b^{(3)}}{\to} z^{(3)} \overset{\tanh}{\to} a^{(3)} \overset{W^{(4)} a^{(3)} + b^{(4)}}{\to} z^{(4)} = {\hat{x}}_{1} \end{matrix}

5.6.2. Backpropagation: Gradient Computation

Before computing gradients, we highlight how they flow through the VAE:

Decoder parameters: $(W^{(4)}, b^{(4)}, W^{(3)}, b^{(3)})$ : Receive gradients only from $L_{recon}$ , not from $L_{KL}$ . Changing decoder weights does not affect $μ$ or $σ$ .
Therefore, $L_{recon}$ depends on decoder parameters directly, while $L_{KL}$ depends only on encoder parameters ( $μ$ and $σ$ ). As a result:

$\frac{\partial L_{total}}{\partial W^{(4)}} = \frac{\partial L_{recon}}{\partial W^{(4)}} + \underset{= 0}{\underset{︸}{\frac{\partial L_{KL}}{\partial W^{(4)}}}}$
Encoder parameters ( $W_{μ}, b_{μ}, W_{σ}, b_{σ}$ ): Receive gradients from both losses:
-
From $L_{recon}$ : Through the sampled $z$ via the reparameterization trick,
-
From $L_{KL}$ : Directly, since KL depends only on $μ$ and $σ$ .
The reparameterization trick enables gradients to flow from the reconstruction loss back to $μ$ and $σ$ .

Step 1: Decoder Gradients (Reconstruction Only). Identical to AE:

\begin{matrix} \frac{\partial L_{recon}}{\partial W^{(4)}} & = \underset{= {\hat{x}}_{1} - x_{1}}{\underset{︸}{\frac{\partial L_{recon}}{\partial z^{(4)}}}} {(a^{(3)})}^{⊤} = [\binom{- 0.0368}{- 1.8563}] \times 0.8046 \approx [\binom{- 0.0296}{- 1.4936}] \\ \frac{\partial L_{recon}}{\partial b^{(4)}} & = \frac{\partial L_{recon}}{\partial z^{(4)}} = [\binom{- 0.0368}{- 1.8563}] \end{matrix}

Step 2: Decoder Hidden Layer Gradients (Reconstruction):

\begin{matrix} \frac{\partial L_{recon}}{\partial z^{(3)}} & = ({(W^{(4)})}^{⊤} \frac{\partial L_{recon}}{\partial z^{(4)}}) ⊙ (1 - {(a^{(3)})}^{2}) \\ = [0.7 0.8] [\binom{- 0.0368}{- 1.8563}] \times (1 - {0.8046}^{2}) \approx - 0.5328 \end{matrix}

Step 3: Gradient ThroughSampling ( $z$ ): The latent code

z

has gradients from both reconstruction and KL:

\frac{\partial L_{recon}}{\partial z} = {(W^{(3)})}^{⊤} \frac{\partial L_{recon}}{\partial z^{(3)}} = 0.6 \times (- 0.5328) \approx - 0.3197

Gradient for Decoder Hidden Weights

W^{(3)}

\frac{\partial L_{total}}{\partial W^{(3)}} = \frac{\partial L_{recon}}{\partial z^{(3)}} \cdot z = (- 0.5328) \times 1.3540 = - 0.7214

Similarly,

\frac{\partial L_{total}}{\partial b^{(3)}} = \frac{\partial L_{recon}}{\partial z^{(3)}} = - 0.5328

.

Step 4: Gradients for Distribution Parameters ( $μ$ and $\log σ^{2}$ ): Gradients flow to

μ

and

σ

from both reconstruction (through

z

) and KL divergence:

z = μ + σ \cdot ϵ, so \frac{\partial z}{\partial μ} = 1

and from KL:

\frac{\partial L_{K L}}{\partial μ} = μ = 0.6609

then, the total will be:

\frac{\partial L_{total}}{\partial μ} = \frac{\partial L_{recon}}{\partial z} \cdot \frac{\partial z}{\partial μ} + \frac{\partial L_{K L}}{\partial μ} = (- 0.3197) \cdot 1 + 0.6609 = 0.3412

The gradient with respect to

σ

, from sampling:

z = μ + σ \cdot ϵ, so \frac{\partial z}{\partial σ} = ϵ = 0.5

and from KL:

\frac{\partial L_{K L}}{\partial σ} = σ - \frac{1}{σ} = 1.3862 - \frac{1}{1.3862} = 0.6649

then, the total:

\frac{\partial L_{total}}{\partial σ} = \frac{\partial L_{recon}}{\partial z} \cdot \frac{\partial z}{\partial σ} + \frac{\partial L_{K L}}{\partial σ} = (- 0.3197) \cdot 0.5 + 0.6649 = 0.5050

The gradient with respect to

\log σ^{2}

:, we need

\begin{matrix} \frac{\partial σ}{\partial \log σ^{2}} & = \frac{σ}{2} = \frac{1.3862}{2} = 0.6931 \\ \frac{\partial L_{total}}{\partial \log σ^{2}} & = \frac{\partial L_{total}}{\partial σ} \cdot \frac{\partial σ}{\partial \log σ^{2}} = 0.5050 \times 0.6931 \approx 0.3500 \end{matrix}

Step 5: Encoder Parameter Gradients: for

μ

pathway:

\begin{matrix} \frac{\partial L_{total}}{\partial z_{μ}^{(2)}} = \frac{\partial L_{total}}{\partial μ} = 0.3412 \end{matrix}

then,

\begin{matrix} \frac{\partial L_{total}}{\partial W_{μ}^{(2)}} = {\frac{\partial L_{total}}{\partial z_{μ}}}^{(2)} a_{μ}^{(1)} = 0.3412 \times 0.9217 \approx 0.3145, \frac{\partial L_{total}}{\partial b_{μ}^{(2)}} = \frac{\partial L_{total}}{\partial z_{μ}^{(2)}} = 0.3412 \end{matrix}

Back-propagation through

μ

encoder hidden layer will be:

\begin{matrix} \frac{\partial L_{total}}{\partial a_{μ}^{(1)}} & = {(W_{μ}^{(2)})}^{⊤} \frac{\partial L_{total}}{\partial z_{μ}^{(2)}} = 0.5 \times 0.3412 = 0.1706 \\ \frac{\partial L_{total}}{\partial z_{μ}^{(1)}} & = \frac{\partial L_{total}}{\partial a_{μ}^{(1)}} \cdot (1 - {(a_{μ}^{(1)})}^{2}) = 0.1706 \times (1 - {0.9217}^{2}) \approx 0.0257 \end{matrix}

For

\log σ^{2}

pathway:

\begin{matrix} \frac{\partial L_{total}}{\partial z_{σ}^{(2)}} & = \frac{\partial L_{total}}{\partial \log σ^{2}} = 0.3500 \end{matrix}

then,

\begin{matrix} \frac{\partial L_{total}}{\partial W_{σ}^{(2)}} = \frac{\partial L_{total}}{\partial z_{σ}^{(2)}} a_{σ}^{(1)} = 0.3500 \times 0.8957 \approx 0.3135, \frac{\partial L_{total}}{\partial b_{σ}^{(2)}} = \frac{\partial L_{total}}{\partial z_{σ}^{(2)}} = 0.3500 \end{matrix}

Backpropagation through log variance encoder hidden layer:

\begin{matrix} \frac{\partial L_{total}}{\partial a_{σ}^{(1)}} & = {(W_{σ}^{(2)})}^{⊤} \frac{\partial L_{total}}{\partial z_{σ}^{(2)}} = 0.45 \times 0.3500 = 0.1575 \\ \frac{\partial L_{total}}{\partial z_{σ}^{(1)}} & = \frac{\partial L_{total}}{\partial a_{σ}^{(1)}} \cdot (1 - {(a_{σ}^{(1)})}^{2}) = 0.1575 \times (1 - {0.8957}^{2}) \approx 0.0311 \end{matrix}

Step 6: First Layer Parameter Gradients: The first layer receives gradients from both

μ

and

\log σ^{2}

pathways. For

W_{μ}^{(1)}

and

b_{μ}^{(1)}

:

\begin{matrix} \frac{\partial L_{total}}{\partial W_{μ}^{(1)}} = \frac{\partial L_{total}}{\partial z_{μ}^{(1)}} x_{1}^{⊤} = 0.0257 \times [\binom{1.00}{3.00}] = [\binom{0.0257}{0.0771}], \frac{\partial L_{total}}{\partial b_{μ}^{(1)}} = \frac{\partial L_{total}}{\partial z_{μ}^{(1)}} = 0.0257 \end{matrix}

and for

W_{σ}^{(1)}

and

b_{σ}^{(1)}

:

\begin{matrix} \frac{\partial L_{total}}{\partial W_{σ}^{(1)}} = \frac{\partial L_{total}}{\partial z_{σ}^{(1)}} x_{1}^{⊤} = 0.0311 \times [\binom{1.00}{3.00}] = [\binom{0.0311}{0.0933}], \frac{\partial L_{total}}{\partial b_{σ}^{(1)}} = \frac{\partial L_{total}}{\partial z_{σ}^{(1)}} = 0.0311 \end{matrix}

Parameter updates via gradient descent: using learning rate

η = 0.01

, the decoder updates will be:

\begin{matrix} W_{new}^{(4)} & = [0.7 0.8] - 0.01 \times [- 0.0296 - 1.4936] = [0.7003 0.8149] \\ b_{new}^{(4)} & = [0.4 0.5] - 0.01 \times [- 0.0368 - 1.8563] = [0.4004 0.5186] \\ W_{new}^{(3)} & = 0.6 - 0.01 \times (- 0.7214) = 0.6072 \\ b_{new}^{(3)} & = 0.3 - 0.01 \times (- 0.5328) = 0.3053 \end{matrix}

The encoder parameters (e.g.,

W_{μ}^{(1)}, b_{μ}^{(1)}, W_{σ}^{(1)}, b_{σ}^{(1)}

, etc.) are updated in the same way using the gradients computed in steps 5 and 6.

6. Experimental Results and Discussions

This section presents tests designed to illustrate the practical differences between PCA, AEs, and VAEs. All experiments used the Olivetti Faces dataset [44], chosen for its high-dimensional structure and clear semantic classes. The dataset contains 400 grayscale facial images of 40 distinct individuals (10 images per person). For computational efficiency while preserving essential features, all images were resized to

50 \times 50

pixels, resulting in a 2500-dimensional input space, and standardized to zero mean and unit variance per pixel [3]. These experiments show the shift from linear compression to non-linear reconstruction, and finally to probabilistic generation.

The settings for our models were as follows:

1.

PCA: Using the standard version to reduce the data to K dimensions.

2.

AE and VAE: Both used a deep network with four layers in the encoder and four in the decoder.

Encoder layers: 2500 → 512 → 256 → 128 → K.
Decoder layers: $K \to$ 128 → 256 → 512 → 2500.
We used ReLU and batch normalization to make training faster and more stable.
Bottleneck dimensions: $K = 2$ or 3 in the first experiment, and K was ${2, 5, 10, 100}$ dimensions in the second experiment.
Training: Adam optimizer ( $η = 0.001$ , weight decay $= 10^{- 5}$ ), MSE loss, 300–400 epochs, $β = 1.0$ to control the KL divergence regularization in the VAE.

All models in both experiments were trained on the entire dataset without a train–test split, in order to focus on reconstruction capabilities and latent space structure.

6.1. First Experiment: Latent Space Visualization

This experiment looks at how each model organizes the data in a 2D or 3D “map” (latent space). This helps us to see whether the models group similar faces together.

Quantitative Reconstruction: Table 1 shows that non-linear models (AE and VAE) reconstruct the faces much better than the linear PCA. The AE has the lowest error because it is purely focused on reconstruction. The VAE has a slightly higher error than the AE because it spends some of its “effort” on organizing the latent space into a smooth Gaussian shape (the KL penalty). As expected, moving from 2D to 3D improves the results for all methods.

Visual Analysis: Figure 9 shows the resulting maps.

PCA in Figure 9 (first column): Shows a very “flat” or linear structure.
AE in Figure 9 (second column): Learns much more complex, curved patterns, which help it separate different individuals better than PCA.
VAE in Figure 9 (third column): Forces the points into a single, circular (Gaussian) cluster. This makes the space “smooth,” which is perfect for generating new faces through interpolation.

6.2. Second Experiment: Data Reconstruction

In this experiment, we tested the models using different latent dimensions:

K \in 2, 5, 10, 100

. This shows how the data can be reconstructed from the compressed latent space with minimal error. The core methodological implementations (PCA, AE, VAE) remained consistent with Experiment 1. The key variation was the evaluation across four latent dimensions: 2, 5, 10, and 100. This design allowed us to observe the following:

Performance at extreme compression (2D).
The “sweet spot” for balanced compression (5D–10D).
Behavior near near-lossless reconstruction (100D).

The experimental workflow comprised three stages: First, data standardization was applied using

X_{scaled} = \frac{X - μ}{σ}

. Second, each model was trained with an appropriate methodology: PCA utilized direct eigendecomposition of the covariance matrix, the AE was trained for 300 epochs with the Adam optimizer (

η = 0.001

), and the VAE was trained for 400 epochs with KL-divergence weighting (

β = 1.0

). Third, reconstruction performance was evaluated using MSE.

Reconstruction Performance: Table 2 and Figure 10 show that all methods get better as we give them more dimensions (neurons in the bottleneck). PCA is consistently the weakest because it is restricted to linear transformations. Interestingly, in this specific dataset, the VAE’s regularization actually helped it outperform the AE at lower dimensions (5D and 10D). At 100D, the AE wins slightly, as the VAE’s KL penalty starts to limit its reconstruction perfectness.

While all three methods can reconstruct faces, the VAE achieved the lowest MSE score and, visually, the faces reconstructed using the VAE appear more natural than those generated by the other two methods (PCA and AE). The 10D space seems to be the ′sweet spot′ for this dataset, providing a compression rate of 99.6% compression rate (reducing the original

50 \times 50

= 2500 dimensions to just 10) while still enabling the VAE to reconstruct the faces almost perfectly (MSE = 0.06).

6.3. Discussions

These experiments show a clear path of improvement:

1.

From Linear to Non-linear: PCA is a good baseline, but it is too simple for complex data like faces. The AE uses neural networks to capture curved patterns, greatly improving reconstruction.

2.

From Reconstruction to Generation: While both AEs and VAEs compress and reconstruct data, only the VAE organizes the latent space. By accepting a small “penalty” in reconstruction accuracy, the VAE creates a smooth map that allows us to sample and create brand-new data.

3.

Fundamental Trade-Offs: The experiments illuminate core design tensions: linearity versus expressivity, deterministic versus probabilistic representation, and reconstruction accuracy versus generative capability. The progression from PCA to AE to VAE navigates these tensions, each step introducing new functionality that addresses the limitations of the previous paradigm.

4.

Choosing the Right Tool:

PCA is best for quick, simple, and interpretable compression.
AE is best for maximum-fidelity compression and reconstruction.
VAE is the best choice when you need a structured latent space or want to generate new data.

In summary, the journey from PCA to AE to VAE is a move from simple compression to deep creation. Each step fixes a problem in the previous model, resulting in the powerful generative AI tools that we use today.

7. PCA, AE, and VAE: Similarities and Differences

We have now traveled from the simple math of PCA to the deep learning of AEs and the probability of VAEs. This comparative analysis reveals both the conceptual continuity and the fundamental innovations that define this evolutionary path in representation learning and generative modeling.

7.1. Fundamental Objectives: From Reconstruction to Generation

All three methods try to compress data into a small bottleneck, but they have different reasons for doing so:

PCA: Linear Compression—It finds the best straight-line directions to represent the data. Its goal is to summarize and visualize data.
Autoencoder: Non-linear Reconstruction—It uses neural networks to learn flexible, curved patterns. Its main goal is to rebuild the data as accurately as possible.
VAE: Probabilistic Generation—It models the actual “recipe” (probability distribution) of the data. Its main goal is to create brand-new data that looks like the original.

In short: PCA describes, AEs rebuild, and VAEs generate.

7.2. Mathematical and Architectural Frameworks

Representation Learning:
-
PCA: Uses a linear projection $z = Z_{PCA}^{⊤} x$ , to find the latent $z$ .
-
AE: Uses a non-linear encoder $z = f_{ϕ} (x)$ and decoder $\hat{x} = g_{θ} (z)$ .
-
VAE: Uses a probabilistic encoder $q_{ϕ} (z | x) = N (z; μ_{ϕ}, σ_{ϕ}^{2})$ that predicts a mean and variance, and a probabilistic decoder $p_{θ} (x | z)$ .
How They Learn:
-
PCA: Closed-form solution via eigendecomposition/SVD. No iterative training; no parameters in the neural network sense.
-
AE: Gradient-based optimization (backpropagation) of reconstruction loss. Parameters = weights and biases of encoder/decoder networks.
-
VAE: Gradient-based optimization of the ELBO. Parameters include both network weights and distribution parameters.
Loss Functions and Optimization:
-
PCA: Minimize $L_{2}$ reconstruction error.
-
AE: Minimize reconstruction loss.
-
VAE: Maximize ELBO.
Latent Space Characteristics: The nature of the latent space fundamentally distinguishes these methods, as illustrated in Table 3.

7.3. Parameter Complexity and Model Capacity

As we saw in our numerical examples, the models get more complex as we go:

PCA: Deterministic algorithm—no trainable parameters in the conventional sense. Complexity depends on data matrix operations ( $O (min (N M^{2}, N^{2} M))$ for N samples of dimension M).
AE: Parameter count grows with network architecture. For our example AE with architecture $2 \to 1 \to 1 \to 1 \to 2$ , 11 parameters. General formula for a fully connected AE:

$P_{AE} = \sum_{l = 1}^{L - 1} (d_{l} \times d_{l + 1} + d_{l + 1})$

where $d_{l}$ is the dimension of layer l.
VAE: Similar parameter count to an AE of the same architecture, plus distribution parameters. For our example VAE with separate $μ$ and $\log σ^{2}$ pathways, approximately double the encoder parameters of a comparable AE. In practice, VAEs often share most encoder layers and only branch at the end, minimizing this overhead.

Key Insight: While PCA has zero “learnable parameters,” its solution is mathematically optimal for linear tasks. AEs and VAEs trade this optimality for flexibility, learning parameters that enable non-linear transformations at the cost of non-convex optimization.

7.4. Generative Capabilities

The following is the most important difference in how they are used:

PCA: Not generative. You can rebuild what you have, but you cannot easily create something new and realistic.
Standard AE: Not inherently generative. Because the latent space is messy and has “gaps,” picking a random point usually results in a blurry or broken image.
VAE: Explicitly generative. This is what it was built for. By forcing the latent space into a smooth Gaussian shape, the VAE ensures that almost any point you pick will decode into a realistic, new face or image. This turns a “compression tool” into a “generative tool”.

8. Conclusions

In this tutorial, we followed a clear path from Principal Component Analysis (PCA) to Variational Autoencoders (VAEs). We have shown how these models are part of the same family, with each one building on the last.

We began with PCA, a simple and elegant method that works in straight lines. While it is great for summarizing data, its linearity limits its power. Autoencoders (AEs) took the next step by adding neural networks, allowing us to capture complex and curved patterns. However, AEs are mostly meant for rebuilding what they have already seen. They lack the organized structure needed to create brand-new samples. The VAE solved this by adding the language of probability. By using the reparameterization trick and the ELBO loss, VAEs transformed the Autoencoder into a true generative model. The KL divergence penalty acts like a “manager” for the latent space, keeping it smooth and organized so that we can easily create new, realistic data. This progression illustrates a fundamental principle in machine learning: complex modern methods often extend simpler, well-understood foundations. PCA serves as the linear baseline, AEs introduce non-linear flexibility, and VAEs add probabilistic semantics. Each step addresses limitations of the previous approach while maintaining conceptual continuity.

For researchers and students, this path offers a guide on which tool to use:

Use PCA for simple, fast, and easy-to-understand data summaries.
Use AEs when you need to compress and rebuild complex data with high accuracy.
Use VAEs when you want to generate new data or have a very organized latent space.

Looking forward, the journey continues to even more advanced AI models—like diffusion models used in modern image generators. However, the foundational principles that we have covered here remain the same. By building intuition through this step-by-step path, we hope to have made deep generative modeling easy to understand for everyone.

Author Contributions

Conceptualization, A.T. and M.M.E.; methodology, A.T.; formal analysis, M.M.E.; investigation, A.T. and M.M.E.; resources, A.T.; writing—original draft preparation, A.T. and M.M.E.; writing—review and editing, A.T. and M.M.E.; visualization, A.T.; supervision, A.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted within the framework of the project “SAIL: SustAInable Lifecycle of Intelligent SocioTechnical Systems” (grant no. NW21-059B). SAIL is receiving funding from the programme “ Netzwerke 2021”, an initiative of the Ministry of Culture and Science of the State of North Rhine-Westphalia. The sole responsibility for the content of this publication lies with the authors.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Non-Linear Activation Functions

Activation functions represent the fundamental non-linear transformations that enable neural networks (NNs) to learn complex, hierarchical representations beyond the capabilities of linear models. The strategic arrangement of NNs between linear transformations enables them to approximate any continuous function. This section covers important theoretical and practical background that explains the limitations and advantages of these activation functions, with a particular focus on their derivatives, which are critical for training.

Appendix A.1. Theoretical Foundations and General Principles

The fundamental purpose of activation functions is to break linearity, enabling deep networks to learn hierarchical features across multiple abstraction levels. In a feed-forward neural network layer, the computation is as follows:

a^{(l)} = σ (z^{(l)}) = σ (W^{(l)} a^{(l - 1)} + b^{(l)})

where

σ

is the activation function,

a^{(l - 1)}

is the output of the previous layer (with

a^{(0)} = x

for the input layer), and

z^{(l)}

denotes the pre-activation. Without this non-linear transformation, the entire network would collapse to a single linear transformation, regardless of depth, severely limiting its representational capacity.

The derivative of the activation function,

σ^{'}

, is arguably as important as the function itself because it directly governs learning during backpropagation. During backpropagation, gradients are computed recursively via the chain rule (see the numerical example in Section 4.4).

The Critical Role of $σ^{'}$ : The term

σ^{'} (s^{(l)})

appears in every layer’s gradient computation during backpropagation. Its behavior critically determines training dynamics because it modulates the error signal in every layer. Consequently, the magnitude of this derivative directly controls the flow of gradients and has a profound impact on both the size and stability of weight updates during training.

The practical implications for activation function design are clear:

Large Derivatives ( $| σ^{'} (x) | \approx 1$ ): Facilitate strong, reliable gradient signals, leading to faster and more effective learning (e.g., ReLU for $x > 0$ ).
Small Derivatives ( $| σ^{'} (x) | ≪ 1$ ): Cause the gradient signal to diminish exponentially as it propagates backward through layers (e.g., sigmoid). This phenomenon, known as the vanishing gradient problem, can prevent learning, particularly in earlier layers.
Zero Derivatives ( $σ^{'} (x) = 0$ ): Completely halt the flow of gradients, creating “dead neurons” that cease to learn (e.g., ReLU for $x < 0$ ).

This analysis explains why modern activation functions, such as ReLU and its variants, have largely replaced traditional sigmoidal functions in deep networks. This is because they maintain stronger gradient signals across a wider range of inputs, while still providing the necessary non-linearity for complex function approximation. Further details regarding issues with active functions can be found in the subsequent sections.

Appendix A.2. The Vanishing/Exploding Gradient Problem

As shown in the chain rule analysis, the accumulation of activation derivatives is the primary cause of the vanishing or exploding gradient problem, with several detrimental effects:

Exponential Gradient Decay: Gradients shrink exponentially with network depth, making deep layers effectively untrainable.
Hierarchical Learning Failure: Early layers (responsible for basic feature detection) stop learning while later layers continue adapting.
Convergence Slowdown: Training becomes inefficient, requiring more data and iterations.
Premature Saturation: Network parameters get stuck in suboptimal configurations.

To ground these concepts, we analyze a simple network with different activations. Parameters: Inputs:

x^{(0)} = {[0.8, 0.3]}^{⊤}

, weights:

W^{(1)} = [0.5, 0.2]

,

W^{(2)} = [0.4]

, target output:

y = 0.9

, learning rate:

η = 0.5

. Biases are omitted for simplicity.

The forward pass and loss calculation follow the standard equations. The critical step is backpropagation:

Hidden Layer Computation:

$\begin{matrix} z^{(1)} & = W^{(1)} x^{(0)} = 0.5 \times 0.8 + 0.2 \times 0.3 = 0.46 \\ a^{(1)} & = σ (z^{(1)}) = σ (0.46) = \frac{1}{1 + e^{- 0.46}} = 0.613 \end{matrix}$

where $σ$ is the sigmoid function.
Output Layer Computation:

$\begin{matrix} z^{(2)} & = W^{(2)} a^{(1)} = 0.4 \times 0.613 = 0.245 \\ a^{(2)} & = σ (z^{(2)}) = σ (0.245) = \frac{1}{1 + e^{- 0.245}} = 0.561 \end{matrix}$
Loss Function:

$L = \frac{1}{2} {(y - a^{(2)})}^{2} = \frac{1}{2} {(0.9 - 0.561)}^{2} = 0.0575$

with

σ^{'} (z^{(1)}) = σ^{'} (0.46) = 0.613 \times (1 - 0.613) = 0.237

and

σ^{'} (z^{(2)}) = σ^{'} (0.245) = 0.561 \times (1 - 0.561) = 0.246

.

To calculate the backpropagation steps, following the chain rule derivation, we compute the following:

Output Layer Gradient:

$\begin{matrix} \frac{\partial L}{\partial W^{(2)}} & = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial W^{(2)}} = (a^{(2)} - y) \cdot σ^{'} (z^{(2)}) \cdot a^{(1)} \\ = (0.561 - 0.9) \times 0.246 \times 0.613 = - 0.0511 \end{matrix}$
Hidden Layer Gradients:

$\begin{matrix} \frac{\partial L}{\partial W^{(1)}} & = \frac{\partial L}{\partial a^{(2)}} \cdot \frac{\partial a^{(2)}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial a^{(1)}} \cdot \frac{\partial a^{(1)}}{\partial z^{(1)}} \cdot \frac{\partial z^{(1)}}{\partial W^{(1)}} \\ = (a^{(2)} - y) \cdot σ^{'} (z^{(2)}) \cdot W^{(2)} \cdot σ^{'} (z^{(1)}) \cdot {(x^{(0)})}^{⊤} \end{matrix}$

then

$\begin{matrix} \frac{\partial L}{\partial w_{11}} & = (- 0.339) \times 0.246 \times 0.4 \times 0.237 \times 0.8 = - 0.00633 \\ \frac{\partial L}{\partial w_{12}} & = (- 0.339) \times 0.246 \times 0.4 \times 0.237 \times 0.3 = - 0.00237 \end{matrix}$
The weight will be updated as follows:

$\begin{matrix} W_{new}^{(2)} & = 0.4 - 0.5 \times (- 0.0511) = 0.4256 (Δ = + 0.0256) \\ w_{11}^{new} & = 0.5 - 0.5 \times (- 0.00633) = 0.50317 (Δ = + 0.00317) \\ w_{12}^{new} & = 0.2 - 0.5 \times (- 0.00237) = 0.20119 (Δ = + 0.00119) \end{matrix}$

Analysis: The Vanishing Gradient Problem

This example clearly illustrates the vanishing gradient problem with sigmoid activations. The gradient computations reveal the critical role of the activation derivative

σ^{'} (z^{(k)})

:

\begin{matrix} \frac{\partial L}{\partial W^{(2)}} & = \underset{error}{\underset{︸}{(a^{(2)} - y)}} \cdot \underset{\leq 0.25}{\underset{︸}{σ^{'} (z^{(2)})}} \cdot a^{(1)} \\ \frac{\partial L}{\partial W^{(1)}} & = \underset{error}{\underset{︸}{(a^{(2)} - y)}} \cdot \underset{\leq 0.25}{\underset{︸}{σ^{'} (z^{(2)})}} \cdot W^{(2)} \cdot \underset{\leq 0.25}{\underset{︸}{σ^{'} (z^{(1)})}} \cdot a^{(0)} \end{matrix}

With our numerical values,

\begin{matrix} \frac{\partial L}{\partial W^{(2)}} & = \underset{error = - 0.339}{\underset{︸}{(a^{(2)} - y)}} \cdot \underset{0.246}{\underset{︸}{σ^{'} (z^{(2)})}} \cdot a^{(1)} \\ \frac{\partial L}{\partial W^{(1)}} & = \underset{error = - 0.339}{\underset{︸}{(a^{(2)} - y)}} \cdot \underset{0.246}{\underset{︸}{σ^{'} (z^{(2)})}} \cdot W^{(2)} \cdot \underset{0.237}{\underset{︸}{σ^{'} (z^{(1)})}} \cdot a^{(0)} \end{matrix}

The compounding effect of multiple

σ^{'} (z^{(k)}) \leq 0.25

terms causes the gradients to diminish rapidly in earlier layers, as in the above calculations of

\frac{\partial L}{\partial W^{(2)}}

, as evidenced by

Output Layer Gradient Magnitude: $\frac{\partial L}{\partial W^{(2)}} = 0.0511$ .
Hidden Layer Gradient Magnitudes: $\frac{\partial L}{\partial w_{11}} = 0.00633$ and $\frac{\partial L}{\partial w_{12}} = 0.00237$ (≈ $12 %$ and $5 %$ of output gradient).

This demonstrates mathematically why earlier layers learn much slower when using sigmoid activations: each additional backward layer multiplies the gradient by

σ^{'} (z^{(k)}) \leq 0.25

, causing exponential decay in gradient magnitude.

This gradient decay stems from a fundamental issue with saturating activation functions like sigmoid. A neuron saturates when its activation function’s output becomes insensitive to input changes, occurring when inputs have large magnitude and the function approaches its asymptotic limits. Mathematically, saturation causes the derivative to approach zero: for sigmoid,

σ^{'} (x) = σ (x) (1 - σ (x)) \to 0

as

σ (x) \to 0

or 1. During backpropagation, these near-zero derivatives multiply with incoming error signals, drastically shrinking gradients. In deep networks with multiple saturated layers, this vanishing gradient problem prevents early layers from receiving meaningful learning signals.

Although our example uses moderate inputs that do not exhibit extreme saturation (derivatives of 0.237 and 0.246), the sigmoid’s maximum derivative of 0.25 inherently limits gradient flow. In deeper networks, multiplying several such values causes exponential decay, explaining why deep networks with sigmoid activations suffer from slow convergence.

To demonstrate how activation function choice impacts gradient flow, we now compare the sigmoid results with tanh and ReLU activations using the same network architecture and initial parameters. The comparative results in Table A1 reveal critical insights about how activation function choice (more details about each of the activation functions in the following parts) fundamentally affects gradient propagation and training dynamics:

Sigmoid’s Severe Gradient Attenuation: The sigmoid activation demonstrates the most pronounced vanishing gradient problem, with hidden layer weights receiving only 4.7–12.5% of the gradient magnitude compared to the output layer. This occurs because $σ^{'} (z) = σ (z) (1 - σ (z)) \leq 0.25$ , causing exponential decay of gradients through multiple layers.
Tanh’s Moderate Improvement: The tanh function provides significantly better gradient flow, with hidden layer gradients at 22.8–60.7% of output layer magnitude. This improvement stems from $\tanh^{'} (z) = 1 - \tanh^{2} (z)$ , which has a maximum value of 1.0 (when $z = 0$ ) compared to sigmoid’s maximum of 0.25.
ReLU’s Superior Gradient Preservation: ReLU demonstrates the most balanced gradient distribution, with hidden layer gradients reaching 26.1–69.6% of output layer values. The constant derivative of 1 for positive inputs prevents gradient attenuation entirely in the forward pass, although it can suffer from the “dying ReLU” problem for negative inputs.

This comparative analysis underscores why ReLU and its variants have become standard in deep learning: they preserve gradient flow much more effectively than saturating activations like sigmoid and tanh, enabling successful training of very deep networks.

Table A1. Comparison of weight updates across activation functions.

Activation Function	Weight	Old Value	New Value	Change	Relative Change
Sigmoid	$w_{21}$	0.4000	0.4256	+0.0256	100.0%
	$w_{11}$	0.5000	0.5032	+0.0032	12.5%
	$w_{12}$	0.2000	0.2012	+0.0012	4.7%
Tanh	$w_{21}$	0.4000	0.5525	+0.1525	100.0%
	$w_{11}$	0.5000	0.5925	+0.0925	60.7%
	$w_{12}$	0.2000	0.2347	+0.0347	22.8%
ReLU	$w_{21}$	0.4000	0.5645	+0.1645	100.0%
	$w_{11}$	0.5000	0.6145	+0.1145	69.6%
	$w_{12}$	0.2000	0.2430	+0.0430	26.1%

Appendix A.3. Zero-Centered Output Property

Beyond gradient magnitude, the symmetry of the activation function’s output range is critical for optimization efficiency; this is primarily due to its effect on gradient dynamics during backpropagation.

When activation functions produce outputs that are not zero-centered—such as sigmoid (outputs in

[0, 1]

) or ReLU (outputs in

[0, \infty)

)—every activation

a_{i}

from the previous layer is strictly positive. This leads to several optimization difficulties. First, because each gradient component

\frac{\partial L}{\partial w_{i j}} = δ_{j} \cdot a_{i}

is a product of the error signal

δ_{j}

and a positive

a_{i}

, the sign of every gradient for a given neuron j is determined solely by

δ_{j}

. Consequently, all weights connected to the same neuron must be updated in the same direction—either all increased or all decreased—during each optimization step. This constraint forces the network to follow an inefficient optimization path, as it cannot fine-tune individual weights by increasing some while decreasing others within the same neuron simultaneously. This constraint forces the optimization path into an inefficient zig-zag trajectory, slowing convergence. In contrast, a zero-centered function like tanh (output range:

(- 1, + 1)

) allows activations

a_{i}

to be both positive and negative. This breaks the sign correlation, enabling the optimizer to adjust weights in opposing directions simultaneously, facilitating a more direct path to the solution and more stable training.

Consider a neuron j in layer l with two incoming weights,

w_{1 j}

and

w_{2 j}

, from the previous layer. Let the corresponding activations from layer

l - 1

be

a_{1} = 0.7

and

a_{2} = 0.3

(both positive, as with sigmoid or ReLU). Suppose the error signal for neuron j is

δ_{j} = - 0.5

. The gradients are

\frac{\partial L}{\partial w_{1 j}} = δ_{j} \cdot a_{1} = (- 0.5) \times 0.7 = - 0.35

\frac{\partial L}{\partial w_{2 j}} = δ_{j} \cdot a_{2} = (- 0.5) \times 0.3 = - 0.15

Both gradients are negative, regardless of the individual values of

a_{1}

and

a_{2}

, because

δ_{j} = - 0.5

is negative and

a_{1}, a_{2} > 0

. The sign of

δ_{j}

alone controls the sign of all weight gradients for neuron j.

Assume that we use gradient descent with a learning rate

η = 0.1

. The weight updates are

Δ w_{1 j} = - η \cdot (- 0.35) = + 0.035 (increase)

Δ w_{2 j} = - η \cdot (- 0.15) = + 0.015 (increase)

Both weights are increased because both gradients were negative. The network cannot increase

w_{1 j}

while decreasing

w_{2 j}

in the same step, even if that would be a more effective adjustment. This is a direct result of all

a_{i} > 0

. Contrast this with a zero-centered activation function (e.g., tanh, output in

[- 1, 1]

): If

a_{1} = 0.7

and

a_{2} = - 0.3

, with the same

δ_{j} = - 0.5

,

\frac{\partial L}{\partial w_{1 j}} = (- 0.5) \times 0.7 = - 0.35 \Rightarrow Δ w_{1 j} = + 0.035

\frac{\partial L}{\partial w_{2 j}} = (- 0.5) \times (- 0.3) = + 0.15 \Rightarrow Δ w_{2 j} = - 0.015

Now,

w_{1 j}

increases while

w_{2 j}

decreases. This flexibility allows for more efficient, coordinated optimization, as weights can move independently in the direction that best minimizes the loss.

Appendix A.4. Sigmoid Activation

The sigmoid function, also known as the logistic function, maps any real-valued number to the range

(0, 1)

. It is defined as follows:

σ (x) = \frac{1}{1 + e^{- x}} .

Its derivative, which is central to backpropagation, has a convenient form:

σ^{'} (x) = σ (x) (1 - σ (x))

Historically, the sigmoid was the default activation function in NNs, primarily due to its interpretable, s-shaped curve and bounded output. As shown in Figure A1, it is steepest at

x = 0

and saturates for inputs with large magnitude. This shape gives it two primary historical applications:

Output Layer for Binary Classification: Its $(0, 1)$ output range naturally models probability scores, where $σ (x)$ represents $P (y = 1 | x)$ .
Saturating Neurons: The sigmoid’s smooth transition from 0 to 1, along with its asymptotic flattening at extreme inputs, was historically interpreted as modeling the continuous “firing rate” of biological neurons. In this analogy, an output near 0 represented an inactive neuron, near 1 represented maximum firing, and intermediate values represented varying degrees of activation. This saturation behavior was also thought to represent a neuron’s “confidence”—outputs approaching 0 or 1 indicated high certainty, while mid-range values indicated uncertainty.

Despite its historical importance, the sigmoid function suffers from critical limitations that make it unsuitable for most modern deep learning applications, particularly in hidden layers. As detailed in Appendix A.2 and Appendix A.3, these limitations include (1) severe vanishing gradient problems due to saturation, (2) optimization inefficiency from non-zero-centered outputs, and (3) higher computational cost compared to simpler functions like ReLU. These drawbacks have led to sigmoid being largely replaced by more modern activation functions in hidden layers, although it remains useful in output layers for binary classification.

Figure A1. Comparison of common activation functions: sigmoid, hyperbolic tangent (Tanh), and Rectified Linear Unit (ReLU). Sigmoid:

σ (x) = \frac{1}{1 + e^{- x}}

maps inputs to

(0, 1)

, providing smooth saturation. Tanh:

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

maps to

(- 1, 1)

and is zero-centered. ReLU:

max (0, x)

provides linear behavior for positive inputs while being computationally efficient. The sigmoid and tanh functions exhibit saturation regimes (flat regions) where gradients vanish, while ReLU maintains a constant gradient for positive inputs.

Figure A1. Comparison of common activation functions: sigmoid, hyperbolic tangent (Tanh), and Rectified Linear Unit (ReLU). Sigmoid:

σ (x) = \frac{1}{1 + e^{- x}}

maps inputs to

(0, 1)

, providing smooth saturation. Tanh:

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}

maps to

(- 1, 1)

and is zero-centered. ReLU:

max (0, x)

provides linear behavior for positive inputs while being computationally efficient. The sigmoid and tanh functions exhibit saturation regimes (flat regions) where gradients vanish, while ReLU maintains a constant gradient for positive inputs.

Figure A2. Derivatives of activation functions shown in Figure A1: Sigmoid derivative

σ^{'} (x) = σ (x) (1 - σ (x))

is bounded in

(0, 0.25]

and approaches zero for large

| x |

, causing vanishing gradients. Tanh derivative

1 - \tanh^{2} (x)

is bounded in

(0, 1]

and also suffers from vanishing gradients but has stronger gradients near zero. ReLU derivative is 0 for

x < 0

and 1 for

x > 0

, eliminating vanishing gradients for positive inputs but causing the “dying ReLU” problem for negative inputs. The piecewise nature of ReLU’s derivative enables more efficient backpropagation in deep networks.

Figure A2. Derivatives of activation functions shown in Figure A1: Sigmoid derivative

σ^{'} (x) = σ (x) (1 - σ (x))

is bounded in

(0, 0.25]

and approaches zero for large

| x |

, causing vanishing gradients. Tanh derivative

1 - \tanh^{2} (x)

is bounded in

(0, 1]

and also suffers from vanishing gradients but has stronger gradients near zero. ReLU derivative is 0 for

x < 0

and 1 for

x > 0

, eliminating vanishing gradients for positive inputs but causing the “dying ReLU” problem for negative inputs. The piecewise nature of ReLU’s derivative enables more efficient backpropagation in deep networks.

Appendix A.5. Softmax Activation

The softmax function extends the sigmoid concept to multi-class classification problems. While sigmoid produces independent probabilities for binary classification, softmax converts a vector of real-valued scores (logits) into a proper probability distribution over multiple mutually exclusive classes. For an input vector

z = {[z_{1}, z_{2}, \dots, z_{K}]}^{⊤}

, the softmax function is defined as follows:

softmax (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}} for i = 1, \dots, K

This formulation ensures two critical properties:

Valid Probability Distribution: All outputs are in $(0, 1)$ and sum to 1: $\sum_{i = 1}^{K} softmax (z_{i}) = 1$ .
Relative Scaling: The exponentiation amplifies differences between scores—larger $z_{i}$ values receive disproportionately higher probabilities.

Example: For logits

z = {[2.0, 1.0, 0.1]}^{⊤}

:

softmax (z) = [\frac{e^{2.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.1}}, \frac{e^{0.1}}{e^{2.0} + e^{1.0} + e^{0.1}}] \approx [0.659, 0.242, 0.099]

The first class receives the highest probability because its logit (2.0) is the largest, but all probabilities sum to 1.

Softmax is almost exclusively used in the final output layer for multi-class classification, typically paired with cross-entropy loss. Its limitations include computational expense (exponentiation and summation across all classes) and saturation effects similar to the sigmoid for extreme inputs. Unlike sigmoid, softmax is never used in hidden layers, due to its dependence on all inputs and computational overhead.

Relationship to Softmax: The sigmoid function is actually a special case of the more general softmax function for binary classification. When there are only two classes, softmax with two outputs

[z_{1}, z_{2}]

reduces to sigmoid when we set

z_{2} = 0

. Specifically,

softmax (z_{1}) = \frac{e^{z_{1}}}{e^{z_{1}} + e^{z_{2}}} = \frac{e^{z_{1}}}{e^{z_{1}} + 1} = \frac{1}{1 + e^{- z_{1}}} = σ (z_{1})

This is why their plots look identical—for binary classification, they are mathematically equivalent. However, sigmoid is typically used for binary problems, while softmax extends to multi-class problems where outputs must sum to 1 across multiple mutually exclusive classes.

Appendix A.6. Hyperbolic Tangent (tanh) Activation

The hyperbolic tangent (tanh) function addresses several limitations of saturating activations while maintaining smooth, bounded behavior. The function maps inputs to the range

(- 1, 1)

and is defined as follows:

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} = 2 σ (2 x) - 1 .

Its derivative exhibits an elegant mathematical form:

\tanh^{'} (x) = 1 - \tanh^{2} (x)

The tanh function offers several key advantages for deep learning applications. As a zero-centered function with range

(- 1, 1)

, it enables more efficient optimization by allowing gradients to flow in opposing directions. Its derivative reaches a maximum of 1.0 at

x = 0

, providing stronger gradient signals than sigmoid (maximum of 0.25). The function is perfectly symmetric about the origin and provides smooth gradient transitions, with its steepest slope at

x = 0

creating an optimal region for learning. However, tanh retains important limitations: like all saturating functions, it gradually saturates for large-magnitude inputs (

| x | > 2

), leading to vanishing gradients (though less severe than sigmoid); its exponential computations make it more expensive than non-saturating alternatives like ReLU; and its bounded output range may not suit applications requiring non-negative outputs or specific value ranges.

Tanh finds particular utility in several deep learning scenarios:

Hidden Layers in Deep Networks: Its zero-centered property and stronger gradients make it effective in hidden layers of various architectures.
Normalized Data Processing: The $(- 1, 1)$ output range naturally aligns with normalized input data, making it suitable for preprocessing pipelines that center data around zero.
Moderately Deep Architectures: In networks of moderate depth, tanh can provide a good balance between expressive power and training stability.

Appendix A.7. Rectified Linear Unit (ReLU) Activation

The Rectified Linear Unit (ReLU) has become the default activation function for many deep learning architectures due to its simplicity and effectiveness in addressing gradient vanishing. It is defined as follows:

ReLU (x) = max (0, x) = \{\begin{matrix} x & if x > 0 \\ 0 & if x \leq 0 . \end{matrix}

The derivative is

{ReLU}^{'} (x) = \{\begin{matrix} 1 & if x > 0 \\ 0 & if x \leq 0 . \end{matrix}

ReLU provides several beneficial characteristics for deep learning. For positive inputs, ReLU maintains a constant gradient of 1, completely avoiding the vanishing gradient problem that affects saturating functions. This non-saturating behavior, combined with the simple

max (0, x)

operation, makes ReLU computationally inexpensive compared to exponential functions, resulting in high efficiency for both forward and backward passes. Additionally, by producing exact zero outputs for negative inputs, ReLU naturally creates sparse activations, which can improve model generalization and representational efficiency.

Despite these advantages, ReLU has significant limitations. The most prominent is the “dying ReLU” problem, where neurons receiving consistently negative inputs become permanently inactive, outputting zero for all inputs. Once a neuron’s input becomes negative, the gradient becomes exactly zero, preventing recovery through gradient-based learning. This irreversible deactivation can lead to substantial portions of deep networks becoming inactive, effectively reducing model capacity. Beyond this, ReLU’s exclusively non-negative outputs (being non-zero-centered) introduce optimization inefficiencies, as discussed in Appendix A.3. The unbounded positive output range also makes ReLU unsuitable for layers requiring specific value ranges, such as probability outputs or normalized reconstructions. Finally, ReLU is sensitive to weight initialization. Poor initialization can make the dying ReLU problem worse, and careful parameter setup is required.

Several variants have been developed to address the dying ReLU problem while maintaining ReLU’s benefits:

Leaky ReLU: Introduces a small, non-zero gradient ( $α \approx 0.01$ ) for negative inputs: $LeakyReLU (x) = max (α x, x)$ .
Parametric ReLU (PReLU): Makes the negative slope a learnable parameter, allowing the network to adaptively determine optimal behavior for negative inputs.
Exponential Linear Unit (ELU): Provides smooth, non-linear transitions for negative inputs while maintaining linearity for positive inputs.

References

Van Der Maaten, L.J.; Postma, E.O.; Van den Herik, H.J. Dimensionality Reduction: A Comparative Review. J. Mach. Learn. Res. 2009, 10, 1–41. [Google Scholar]
Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef]
Jolliffe, I.T.; Cadima, J. Principal component analysis: A review and recent developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the NIPS’20: 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; pp. 6840–6851. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 2256–2265. [Google Scholar]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Burgess, C.P.; Higgins, I.; Pal, A.; Matthey, L.; Watters, N.; Desjardins, G.; Lerchner, A. Understanding disentangling in β-VAE. arXiv 2018, arXiv:1804.03599. [Google Scholar]
Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep generative modelling: A comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7327–7347. [Google Scholar] [CrossRef]
Gómez-Bombarelli, R.; Wei, J.N.; Duvenaud, D.; Hernández-Lobato, J.M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T.D.; Adams, R.P.; Aspuru-Guzik, A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. [Google Scholar] [CrossRef]
An, J.; Cho, S. Variational autoencoder based anomaly detection using reconstruction probability. Spec. Lect. IE 2015, 2, 1–18. [Google Scholar]
van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. In Proceedings of the NIPS’17: Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017. [Google Scholar]
Kingma, D.P.; Rezende, D.J.; Mohamed, S.; Welling, M. Semi-Supervised Learning with Deep Generative Models. In Proceedings of the NIPS’14: Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 2, Montreal, QC, Canada, 8–13 December 2014; MIT Press: St. Cambridge, MA, USA, 2014. [Google Scholar]
Shlens, J. A Tutorial on Principal Component Analysis. arXiv 2014, arXiv:1404.1100. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Doersch, C. Tutorial on Variational Autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar]
Scholkopft, B.; Mullert, K.R. Fisher Discriminant Analysis with Kernels. In Proceedings of the Neural Networks for Signal Processing IX, Madison, WI, USA, 25 August 1999; Volume 1, pp. 23–25. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Henderson, P. Sammon mapping. Pattern Recognit. Lett. 1997, 18, 1307–1316. [Google Scholar] [CrossRef]
Wold, S.; Esbensen, K.; Geladi, P. Principal Component Analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
Turk, M.; Pentland, A. Eigenfaces for Recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Hyvärinen, L. Principal Component Analysis. In Mathematical Modeling for Industrial Processes; Springer: Berlin/Heidelberg, Germany, 1970; pp. 82–104. [Google Scholar]
Strang, G. Introduction to Linear Algebra; SIAM: Philadelphia, PA, USA, 2022. [Google Scholar]
Alter, O.; Brown, P.O.; Botstein, D. Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling. Proc. Natl. Acad. Sci. USA 2000, 97, 10101–10106. [Google Scholar] [CrossRef]
Strang, G. Differential Equations and Linear Algebra; Wellesley-Cambridge Press: Wellesley, MA, USA, 2014. [Google Scholar]
Wall, M.E.; Rechtsteiner, A.; Rocha, L.M. Singular Value Decomposition and Principal Component Analysis. In A Practical Approach to Microarray Data Analysis; Springer: Berlin/Heidelberg, Germany, 2003; pp. 91–109. [Google Scholar]
Abdi, H.; Williams, L.J. Principal Component Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Tharwat, A. Principal component analysis-a tutorial. Int. J. Appl. Pattern Recognit. 2016, 3, 197–240. [Google Scholar] [CrossRef]
Hinton, G.E.; Zemel, R. Autoencoders, minimum description length and Helmholtz free energy. In Proceedings of the NIPS’93: Neural Information Processing Systems, Denver, CO, USA, 29 November–2 December 1993; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
Michelucci, U. An introduction to autoencoders. arXiv 2022, arXiv:2201.03898. [Google Scholar] [CrossRef]
Baldi, P.; Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural Netw. 1989, 2, 53–58. [Google Scholar] [CrossRef]
Bourlard, H.; Kamp, Y. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 1988, 59, 291–294. [Google Scholar] [CrossRef] [PubMed]
Plaut, E. From principal subspaces to principal components with linear autoencoders. arXiv 2018, arXiv:1804.10253. [Google Scholar] [CrossRef]
Berahmand, K.; Daneshfar, F.; Salehi, E.S.; Li, Y.; Xu, Y. Autoencoders and their applications in machine learning: A survey. Artif. Intell. Rev. 2024, 57, 28. [Google Scholar] [CrossRef]
Makhzani, A.; Frey, B. K-sparse autoencoders. arXiv 2013, arXiv:1312.5663. [Google Scholar]
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; Bengio, Y. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011; pp. 833–840. [Google Scholar]
Jia, K.; Sun, L.; Gao, S.; Song, Z.; Shi, B.E. Laplacian auto-encoders: An explicit learning of nonlinear data manifold. Neurocomputing 2015, 160, 250–260. [Google Scholar] [CrossRef]
Masci, J.; Meier, U.; Cireşan, D.; Schmidhuber, J. Stacked convolutional auto-encoders for hierarchical feature extraction. In Artificial Neural Networks and Machine Learning–ICANN 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 52–59. [Google Scholar]
Yong, B.X.; Brintrup, A. Bayesian autoencoders with uncertainty quantification: Towards trustworthy anomaly detection. Expert Syst. Appl. 2022, 209, 118196. [Google Scholar] [CrossRef]
Preechakul, K.; Chatthee, N.; Wizadwongsa, S.; Suwajanakorn, S. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10619–10629. [Google Scholar]
Samaria, F.S.; Harter, A.C. Parameterisation of a Stochastic Model for Human Face Identification. In Proceedings of the Second IEEE Workshop on Applications of Computer Vision, Sarasota, FL, USA, 5–7 December 1994; pp. 138–142. [Google Scholar]

Figure 1. Comparison of PCA, AEs, and VAEs in five aspects: linearity, optimization objective, parameter learning, determinism, and generative capability. PCA: linear, maximizes explained variance, no learned parameters, deterministic, non-generative. AEs: non-linear, minimize reconstruction error, learned parameters, deterministic, non-generative. VAEs: non-linear, maximize ELBO, learned parameters, probabilistic, generative.

Figure 2. PCA illustration on two-dimensional data: (Left) Original data in the feature space

(x_{1}, x_{2})

in black starts. The spread of the data along each axis represents its variance. The direction of maximum variance (first principal component,

P C_{1}

) is highlighted. (Right) Blue stars show projections onto

P C_{1}

(maximal variance direction), while green stars show projections onto

P C_{2}

(orthogonal to

P C_{1}

).

Figure 2. PCA illustration on two-dimensional data: (Left) Original data in the feature space

(x_{1}, x_{2})

in black starts. The spread of the data along each axis represents its variance. The direction of maximum variance (first principal component,

P C_{1}

) is highlighted. (Right) Blue stars show projections onto

P C_{1}

(maximal variance direction), while green stars show projections onto

P C_{2}

(orthogonal to

P C_{1}

).

Figure 3. Computation of principal Components via the covariance matrix and SVD Methods. Step (A): Mean-centering of the original data matrix

X

. Step (B): Calculation of the covariance matrix

Σ = \frac{1}{N - 1} D D^{⊤}

from the mean-centered data

D

. Step (C): Eigendecomposition of

Σ

to obtain the principal components (eigenvectors) and their relative importance (eigenvalues). Step (D): Alternative approach using SVD applied directly to the mean-centered data matrix

D

to obtain the principal components and their relative importance.

Figure 3. Computation of principal Components via the covariance matrix and SVD Methods. Step (A): Mean-centering of the original data matrix

X

. Step (B): Calculation of the covariance matrix

Σ = \frac{1}{N - 1} D D^{⊤}

from the mean-centered data

D

. Step (C): Eigendecomposition of

Σ

to obtain the principal components (eigenvectors) and their relative importance (eigenvalues). Step (D): Alternative approach using SVD applied directly to the mean-centered data matrix

D

to obtain the principal components and their relative importance.

Figure 4. Visualization of the data projection step in PCA according to Equation (5), demonstrating the mapping from the original high-dimensional space to the PCA subspace.

Figure 5. A visualization of our numerical example illustrating PCA steps: (Left) The original data (red stars) with the first (

P C_{1}

, solid line) and second (

P C_{2}

, dotted line) principal components. The blue and green lines connecting the data points to their reconstructions represent the error when using

P C_{1}

and

P C_{2}

, respectively. (Right) The projected data, where blue and green stars represent the projections onto the first and second principal components, respectively.

Figure 5. A visualization of our numerical example illustrating PCA steps: (Left) The original data (red stars) with the first (

P C_{1}

, solid line) and second (

P C_{2}

, dotted line) principal components. The blue and green lines connecting the data points to their reconstructions represent the error when using

P C_{1}

and

P C_{2}

, respectively. (Right) The projected data, where blue and green stars represent the projections onto the first and second principal components, respectively.

Figure 6. Autoencoder architecture, which consists of an encoder network that compresses the input into a lower-dimensional latent representation (bottleneck), and a decoder network that reconstructs the original input from this latent code. The objective is to minimize the reconstruction loss between the input

x

and the output

\hat{x}

.

Figure 6. Autoencoder architecture, which consists of an encoder network that compresses the input into a lower-dimensional latent representation (bottleneck), and a decoder network that reconstructs the original input from this latent code. The objective is to minimize the reconstruction loss between the input

x

and the output

\hat{x}

.

Figure 7. Visualization of the forward pass through a simple Autoencoder in our numerical example. Shows the transformation from input

[x_{1}, x_{2}]

to latent representation (

z

) via encoder, and then to reconstructed output

[{\hat{x}}_{1}, {\hat{x}}_{2}]

via decoder. Weights are with red color.

Figure 7. Visualization of the forward pass through a simple Autoencoder in our numerical example. Shows the transformation from input

[x_{1}, x_{2}]

to latent representation (

z

) via encoder, and then to reconstructed output

[{\hat{x}}_{1}, {\hat{x}}_{2}]

via decoder. Weights are with red color.

Figure 8. Visualization of the architecture of the VAE, consisting of the encoder

q_{ϕ} (z | x)

mapping input data to a latent distribution, the latent space

Z_{VAE}

, and the decoder

p_{θ} (x | z)

reconstructing the output from the latent representation.

Figure 8. Visualization of the architecture of the VAE, consisting of the encoder

q_{ϕ} (z | x)

mapping input data to a latent distribution, the latent space

Z_{VAE}

, and the decoder

p_{θ} (x | z)

reconstructing the output from the latent representation.

Figure 9. Latent spaces for the first 10 people in the dataset. Each color is a different person. (Left): PCA. (Middle): AE. (B): VAE. The top row shows the 2D projections and the bottom row shows the 3D projections.

Figure 10. Comparing how models reconstruct faces in 2D and 5D. Row 1: Original. Row 2: PCA. Row 3: AE. Row 4: VAE. Note how VAE faces look more complete and natural.

Table 1. Reconstruction error (MSE) for 2D and 3D spaces. Lower is better.

Method	Dimensions	MSE
PCA	2	0.5932
PCA	3	0.5476
AE	2	0.3214
AE	3	0.2748
VAE	2	0.4123
VAE	3	0.3587

Table 2. Reconstruction performance comparison with different dimensions.

Latent Dim	PCA	AE	VAE
2	0.60	0.43	0.34
5	0.45	0.25	0.10
10	0.33	0.20	0.06
100	0.06	0.02	0.04

Table 3. Comparison of the latent space (bottleneck).

Property	PCA	Standard AE	VAE
Structure	Flat, linear surface	Unstructured, messy	Smooth, organized map
Interpretability	High (principal components)	Low (black-box)	Moderate (regularized)
Dimensionality	Fixed by the data rank	Chosen by the user	Chosen by the user
Rules	Must be perpendicular	None (unconstrained)	Must follow a Gaussian bell curve
Interpolation	Linear only	Possible, but often jumpy	Smooth and meaningful

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tharwat, A.; Eid, M.M. The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats 2026, 9, 23. https://doi.org/10.3390/stats9020023

AMA Style

Tharwat A, Eid MM. The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats. 2026; 9(2):23. https://doi.org/10.3390/stats9020023

Chicago/Turabian Style

Tharwat, Alaa, and Mahmoud M. Eid. 2026. "The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling" Stats 9, no. 2: 23. https://doi.org/10.3390/stats9020023

APA Style

Tharwat, A., & Eid, M. M. (2026). The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling. Stats, 9(2), 23. https://doi.org/10.3390/stats9020023

Article Menu

The Path from PCA to Autoencoders to Variational Autoencoders: Building Intuition for Deep Generative Modeling

Abstract

1. Introduction

2. The Goal and Taxonomy of Dimensionality Reduction

3. Principal Component Analysis (PCA)

3.1. Definition of PCA

3.2. PCA Space and Principal Components

3.3. PCA as an Optimization Problem

3.3.1. Variance Maximization Perspective

3.3.2. Reconstruction Error Perspective

3.4. Covariance Matrix Method

3.4.1. Calculating Covariance Matrix ( Σ )

3.4.2. Eigenvalue–Eigenvector Decomposition for Principal Components

3.5. Singular Value Decomposition (SVD)

3.5.1. Equivalence of SVD and Covariance-Based PCA

3.5.2. Computational and Numerical Considerations

3.6. Constructing the PCA Subspace

3.7. Data Reconstruction and Error Analysis

3.8. PCA Algorithms

3.9. Numerical Examples

3.9.1. First Example: 2D Example

3.9.2. Second Example: 4D Example

4. Autoencoder (AE)

4.1. Parameters and Training

4.2. The Generative Limitation

4.3. Practical Considerations for Autoencoders

4.4. Numerical Example

4.4.1. Forward Propagation

4.4.2. Backpropagation: Gradient Computation

4.5. The Fundamental Connection: When Autoencoders Generalize PCA

4.6. Variants of Autoencoders

4.6.1. Regularized Autoencoders

4.6.2. Architectural Variants for Specific Data Types

5. Variational Autoencoder (VAE)

5.1. Architectural Components: A Probabilistic Reinterpretation

5.2. The Core Challenge and the Variational Inference Solution

5.3. The Variational Lower Bound (ELBO)

5.4. From Deterministic to Probabilistic Latent Spaces

5.5. Key Differences from the Standard Autoencoder

5.6. Numerical Example

5.6.1. Forward Propagation

5.6.2. Backpropagation: Gradient Computation

6. Experimental Results and Discussions

6.1. First Experiment: Latent Space Visualization

6.2. Second Experiment: Data Reconstruction

6.3. Discussions

7. PCA, AE, and VAE: Similarities and Differences

7.1. Fundamental Objectives: From Reconstruction to Generation

7.2. Mathematical and Architectural Frameworks

7.3. Parameter Complexity and Model Capacity

7.4. Generative Capabilities

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Non-Linear Activation Functions

Appendix A.1. Theoretical Foundations and General Principles

Appendix A.2. The Vanishing/Exploding Gradient Problem

Appendix A.3. Zero-Centered Output Property

Appendix A.4. Sigmoid Activation

Appendix A.5. Softmax Activation

Appendix A.6. Hyperbolic Tangent (tanh) Activation

Appendix A.7. Rectified Linear Unit (ReLU) Activation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. Calculating Covariance Matrix ( $Σ$ )