Quantum Down-Sampling Filter for Variational Autoencoder

Riaz, Farina; Zaman, Fakhar; Suzuki, Hajime; Abuadbba, Alsharif; Nguyen, David

doi:10.3390/electronics14234626

Open AccessArticle

Quantum Down-Sampling Filter for Variational Autoencoder

by

Farina Riaz

^1,*,

Fakhar Zaman

²,

Hajime Suzuki

¹,

Alsharif Abuadbba

¹ and

David Nguyen

¹

CSIRO Data61, Marsfield, NSW 2122, Australia

²

CSIRO Manufacturing, Clayton, VIC 3216, Australia

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(23), 4626; https://doi.org/10.3390/electronics14234626

Submission received: 20 September 2025 / Revised: 15 November 2025 / Accepted: 18 November 2025 / Published: 25 November 2025

(This article belongs to the Special Issue Second Quantum Revolution: Sensing, Computing, and Transmitting)

Download

Browse Figures

Versions Notes

Abstract

Variational Autoencoders (VAEs) are fundamental for generative modeling and image reconstruction, yet their performance often struggles to maintain high fidelity in reconstructions. This study introduces a hybrid model, Quantum Variational Autoencoder (Q-VAE), which integrates quantum encoding within the encoder while utilizing fully connected layers to extract meaningful representations. The decoder uses transposed convolution layers for up-sampling. The Q-VAE is evaluated against the classical VAE and the classical direct-passing VAE, which utilizes windowed pooling filters. Results on the MNIST and USPS datasets demonstrate that Q-VAE consistently outperforms classical approaches, achieving lower Fréchet Inception Distance scores, thereby indicating superior image fidelity and enhanced reconstruction quality. These findings highlight the potential of Q-VAE for high-quality synthetic data generation and improved image reconstruction in generative models.

Keywords:

Quantum Variational Autoencoder; image reconstruction; Convolutional Neural Networks; resolution enhancement; Fréchet Inception Distance; synthetic data generation

1. Introduction

Variational Autoencoders (VAEs) are widely used in generative modeling and image reconstruction, providing an efficient framework to learn meaningful latent representations [1]. VAEs employ an encoder–decoder structure, where the encoder maps input data into a probabilistic latent space, and the decoder reconstructs the input from sampled latent variables. Fidelity in image reconstruction and generation refers to how well the generated images match the original images in terms of structural accuracy [2], detail preservation [3], and perceptual similarity [4]. However, classical VAEs (C-VAEs) often struggle to generate high-fidelity images due to inherent limitations in their latent representation [5]. Additionally, the decoder’s reliance on pixel-wise reconstruction loss leads to blurry and over-smoothed images, restricting their effectiveness in capturing complex data distributions [6].

Recent advancements in quantum computing offer promising avenues for overcoming these challenges by exploiting quantum phenomena such as superposition and entanglement. Several recent works have explored hybrid quantum–classical models for image processing and signal reconstruction, highlighting the potential of quantum encoding to enhance classical architectures [7,8,9,10]. These advances in quantum computing have introduced promising approaches to enhance deep generative models by leveraging quantum parallelism and superposition [11]. The integration of quantum circuits into neural networks, particularly within the VAE encoder, has shown potential to improve latent space representations, leading to enhanced feature extraction and more accurate reconstructions [12]. Quantum neural networks represent an example of such hybrid models, which leverage quantum operations to better capture the underlying data structure, making them especially well suited for high-dimensional image tasks [13].

This study presents a hybrid model referred to as quantum VAE (Q-VAE). Q-VAE combines the quantum down-sampling filter, which performs image resolution reduction and quantum encoding, and VAE. This is a novel approach that integrates quantum transformations within the encoder while utilizing fully connected layers for feature extraction. The decoder employs transposed convolution layers for upsampling. Q-VAE is evaluated against C-VAE and classical direct passing VAE (CDP-VAE), which incorporates windowed pooling filters (resolution reduction). Since quantum computing, especially its integration into deep learning models, is still in the early stages, focusing on simpler datasets allows for a more controlled environment where we can better understand the impact of quantum techniques on model performance. Furthermore, the datasets used in this study, such as MNIST [14] and USPS [15] (with image resolutions of 28 × 28 and 16 × 16 pixels, respectively), are relatively small in size, which is advantageous when exploring the potential of quantum circuits for feature extraction and latent space representation. Their reduced resolution helps reduce the computational burden compared to higher-resolution images. To ensure a fair evaluation, the MNIST dataset is first down-sampled from 28 × 28 to 16 × 16 pixels, and then, like the USPS dataset, it is up-sampled to 32 × 32 pixels before being used as input to assess our model’s performance. While MNIST and USPS are simplified datasets, they remain standard benchmarks in quantum generative modeling. These datasets enable rigorous evaluation of quantum circuit stability, optimization convergence, and noise resilience before applying models to higher-dimensional or domain-specific data. The proposed Quantum Variational Autoencoder (Q-VAE) demonstrates how quantum encoding and reconstruction can enhance feature representation efficiency—an essential capability for future quantum sensing, communication, and computing systems. By focusing on grayscale images, we reduce the complexity of the problem, allowing for clearer insights into how quantum techniques can improve feature extraction and image reconstruction, particularly for simpler tasks like recognizing digits. Once the approach is successfully validated on these datasets, the methods can be extended to more complex and higher-resolution data, such as color images or higher resolutions.

Quantitative analysis has been performed to compare the effectiveness of the model using common performance metrics, such as the Fréchet Inception Distance (FID) and the Mean Squared Error (MSE), which are essential to assess the quality of generated images and the representation of the latent space [16,17]. Thus, by leveraging these datasets, we can benchmark the Q-VAE against traditional models, to explore that if quantum component brings measurable improvements to the generative process.

The novelty of this work lies in introducing a Quantum Down-Sampling Filter, that performs learnable quantum feature compression directly within the encoder of a Variational Autoencoder. Unlike prior quantum VAEs that embed quantum layers inside the latent space or decoder components, the proposed model integrates quantum encoding as a down-sampling mechanism, allowing the encoder to capture higher-order correlations in fewer dimensions. This design bridges the gap between quantum state representation and classical feature extraction, providing an efficient hybrid model with reduced parameters and improved reconstruction fidelity.

In summary the following contributions have been made in this paper:

Introduce a novel Quantum Variational Autoencoder (Q-VAE) that integrates a Quantum down-sampling filter in the encoder for efficient feature compression.
Provide the first systematic comparison between encoder-only quantum encoding (Q-VAE), classical CDP-VAE and VAEs, highlighting the impact of quantum feature compression.
Avoid quantum entanglements or deep quantum layers in the encoder, focusing solely on an efficient quantum encoding circuit for latent space mapping.
Minimize quantum-trainable parameters, leading to more efficient model training and better utilization of quantum resources.
Evaluate whether quantum encoding improves the generative quality and latent space representation compared to classical methods.

The structure of this paper is as follows. Section 2 provides a review of related work on classical VAEs and their integration with quantum computing. Section 3 presents the background on VAEs. In Section 4, the methodology for the proposed Q-VAE model is introduced. Section 5 outlines the numerical simulations conducted, while Section 6 discusses the performance analysis. Finally, Section 7 concludes the paper, summarizing the key findings and suggesting potential avenues for future research.

2. Related Research

In recent years, VAEs have garnered significant attention in generative modeling, especially for image reconstruction tasks [1]. Traditional VAEs are valued for encoding data into a structured latent space and reconstructing input approximations, driving research efforts to enhance their architectures for improved image quality and feature extraction [18,19,20]. Several studies have focused on enhancing VAE performance using convolutional architectures [21,22,23,24]. Gatopoulos et al. introduced convolutional VAEs to better capture image features, surpassing the performance of fully connected architectures in medium-resolution data sets such as CIFAR-10 and ImageNet32 [25]. Furthermore, combining VAE with adversarial training, such as in Adversarial Autoencoders (AAE) [26], incorporates discriminator networks to enhance the visual quality of generated images, and improve machine learning applications.

Quantum Generative Modeling

Quantum Machine Learning (QML) has emerged as a promising field, with studies exploring the integration of quantum circuits into generative models. Quantum GANs (QGANs) leverage quantum computing to enhance the data representation and image generation capabilities of GANs [27,28,29,30]. Quantum VAE [31,32,33,34,35,36] on the other hand, is based on the foundational work by Khoshaman et al., who proposed a quantum VAE where the latent generative process is modeled as a quantum Boltzmann machine (QBM). Their approach demonstrates that quantum VAE can be trained end-to-end using a quantum lower bound to the variational log-likelihood, achieving state-of-the-art performance on the MNIST dataset [37]. They have used log-likelihood and Evidence Lower Bound (ELBO) as the measurement and have shown quantum VAE confidence level for the numerical results is smaller than ±0.2 in all cases. Gircha et al. further advanced the quantum VAE architecture by incorporating quantum circuits in both the encoder, which utilizes multiple rotations, and the decoder [36].

Despite the successes of existing variational auto encoder architectures, most hybrid quantum–classical models primarily apply quantum circuits in the latent space or decoder, focusing on data generation rather than feature extraction. Such designs often increase computational complexity and may introduce instability during training. To address this gap, our proposed Quantum Variational Autoencoder (Q-VAE) integrates a quantum down-sampling filter with a single rotation gate and measurement exclusively in the encoder. This encoder-focused design efficiently compresses high-dimensional input features into a rich latent representation, enhancing feature extraction and reconstruction quality while maintaining a simple and stable classical decoder. By concentrating quantum processing in the encoder, our model distinguishes itself from prior approaches that rely on quantum circuits in the decoder or latent space.

3. Background

3.1. Variational Autoencoder

VAEs are powerful generative models in machine learning, built on a probabilistic framework. The encoder network maps input data x to a latent distribution

q_{ϕ} (z | x)

, while the decoder network reconstructs the original data from the latent variables z. Similar to other variational methods, VAEs are optimized using the ELBO as their objective function. Kingma et al. (2019) introduced a parametric inference model

q_{ϕ} (z | x)

, also referred to as an encoder or recognition model. The parameters of this inference model, denoted as

ϕ

, are known as the variational parameters. The decoder, with parameters

θ

, is responsible for generating the data from the latent variables z. Thus, the parameters

ϕ

and

θ

together define the model, where

ϕ

governs the inference process and

θ

governs the generative process [38].

Due to the non-negativity of the Kullback-Leibler (KL) divergence between

q_{ϕ} (z | x)

and

p_{θ} (z | x)

, ELBO serves as a lower bound on the logarithmic likelihood of the data. This relationship is expressed as:

L_{θ, ϕ} (x) = log p_{θ} (x) - D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z | x))

(1)

Interestingly, the KL divergence

D_{K L} (q_{ϕ} (z | x) | | p_{θ} (z | x))

determines two key aspects:

1.: By definition, it quantifies the divergence of the approximate posterior from the true posterior.
2.: It represents the gap between the ELBO $L_{θ, ϕ} (x)$ and the marginal likelihood $log p_{θ} (x)$ , which is also known as the tightness of the bound.

The closer

q_{ϕ} (z | x)

approximates the true posterior

p_{θ} (z | x)

in terms of KL divergence, the smaller this gap becomes, leading to a tighter bound. The reconstruction term encourages the model to generate data points that are similar to the observed data, while the KL divergence regularizes the learned latent variables by pushing the approximate posterior towards the prior distribution. Although VAEs have been successful in many domains, including image generation and anomaly detection, they often face significant challenges when applied to high-resolution data. This issue can arise because the latent space is often not expressive enough to model the intricate relationships [39].

3.2. Encoder

A classical VAE encoder is typically implemented as a fully connected network or convolutional neural network (CNN) with the aim of mapping the input data to a lower-resolution latent space. The primary objective of the encoder is to process the input data x, transform it through a series of hidden layers, and output two parameters: the mean

μ

and the log variance

log (σ^{2})

, which are used for variational inference [40]. In the case of image data, the input image is first reshaped into a vector and passed through multiple fully connected layers. The transformations performed by these layers help extract important features from the input. For example, given an input image of dimension 1024 (flattened from a

32 \times 32

image), the encoder progressively reduces the dimension, often mapping the input image

x_{0}

into a lower-dimension space

x_{1}

.

3.2.1. Fully Connected Layers in Encoders

Fully connected layers in an encoder are crucial for transforming input data into meaningful representations. They work by connecting every neuron in one layer to every neuron in the next, allowing the model to learn complex relationships between features. In the encoder, these layers typically perform dimensionality reduction, extracting the most relevant information from the input and mapping it to a compact latent space. The use of activation functions like ReLU or Sigmoid introduces non-linearity, enabling the encoder to capture intricate patterns. This learning of complex, non-linear relationships and improving performance in capturing patterns make fully connected layers essential for building efficient and powerful encoders. This approach allows the model to better capture spatial dependencies while still benefiting from the representational power of fully connected layers.

3.2.2. Windowing and Pooling for Resolution Reduction

Fixed pooling techniques, such as max pooling and average pooling, are particularly effective in encoding processes where spatial reduction is necessary. By consistently applying the same pooling operation across the input, the model can reduce dimensionality, improve efficiency, and retain key features that will be crucial for later processing. In max pooling, the maximum value in a small region (or window) of the input image is selected, reducing the image’s resolution while preserving the most significant features in each region. Average pooling works similarly, but instead computes the average value of each window. These pooling operations help to reduce computational complexity while retaining essential information for later stages of the model.

3.3. Latent Space Encoding

Finally, the encoder outputs the mean

μ

and the log variance

log (σ^{2})

, which parameterize the variational distribution. These values are computed using fully connected layers (fc) as follows:

μ = {fc}_{μ} (h_{2}),

(2)

log (σ^{2}) = {fc}_{log_v a r} (h_{2}),

(3)

where

μ

and

log (σ^{2})

are vectors of dimension of the latent space, while

h_{2}

is a vector or matrix resulting from a series of transformations applied to the input data, and it is used as the input to the fully connected layers that output the parameters of the variational distribution.

3.4. Decoder

The decoder in a VAE is designed to transform the lower-resolution latent variable z, produced by the encoder, back into a high-dimensional data space, reconstructing the input data. The decoder typically consists of multiple layers, including fully connected and transposed convolution layers, which progressively reconstruct the input data’s spatial and feature dimensions. The core function of the decoder is to ensure that the reconstructed data match closely the original input, enabling the model to learn both the global structure and the local details of the data. As with the encoder, the decoder is trained to minimize the reconstruction error, often using a loss function that combines both the likelihood of the data given the latent variables and a regularization term that encourages smoothness in the latent-space representation.

3.4.1. Latent Variable Transformation

The decoding process begins by passing the latent variable z through a fully connected layer to map it to a higher-dimensional feature space. This transformation creates a tensor that matches the expected dimensions for the next layers in the decoder, such as the transposed convolution layers. The fully connected layer enables the decoder to effectively learn a mapping from the compact latent space back to the data space, ensuring that relevant information is maintained during this transformation.

3.4.2. Transposed Convolutions

Once the latent variable is transformed into the higher-dimensional space, it is passed through a series of transposed convolution layers. These layers up-sample the tensor, progressively restoring the original spatial dimensions of the input data. Each transposed convolution layer learns to refine the spatial resolution of the image while preserving important features. Transposed convolution layers have been shown to be highly effective in generating high-resolution output from low-resolution latent representations. They are capable of learning hierarchical features of the data, capturing both local details and global structure during the reconstruction process. By applying non-linear activation functions, such as ReLU, between the transposed convolution layers, the model introduces the necessary complexity to capture intricate patterns in the data.

3.4.3. Reconstruction

The final stage of the decoding process involves the production of the reconstructed image. This is typically done by passing the output of the last transposed convolution layer through a sigmoid activation function to ensure that the pixel values lie within the valid range (e.g., between 0 and 1 for normalized image data). The output of this step is the reconstructed image, which should ideally match the original input data.

4. Proposed Methodology

In this work, we introduce a novel down-sampling filter for Q-VAE that relies exclusively on quantum encoding for feature extraction. The paper investigates three distinct architectures for VAEs, each employing different encoding mechanisms to evaluate their feature extraction capabilities: C-VAE, CDP-VAE and Q-VAE. The framework of these models is presented in Figure 1. The goal of this comparison is to assess the effectiveness of quantum computing, classical neural networks, and simplified encoding strategies in the capture and processing of data for subsequent reconstruction. Although each model shares a common decoder architecture, the encoders differ in their approach, enabling the study of how each influences the encoding process and the quality of the reconstructed output.

4.1. Classical VAE (C-VAE) Encoder

C-VAE uses a traditional fully connected neural network as the encoder, which processes the input data through a series of hidden layers to output the mean (

μ

) and log variance (

log (σ^{2})

) of the latent space distribution. This distribution is then sampled to obtain the latent variable z, which represents the encoded data in a lower-resolution space. While the C-VAE relies entirely on classical neural networks for encoding, it remains highly effective in modeling data distributions and is widely used in practical applications. The decoder, which employs layers for up-sampling, reconstructs the input from this low-dimension latent space, generating an output image or data structure that approximates the original input, also shown in Figure 1a.

4.2. Classical Direct Passing (CDP-VAE) Encoder

CDP-VAE adopts a more simplified approach to encoding by using windowing pooling for resolution reduction as shown in Figure 1b. In this method, the input data (e.g., an image) is divided into

2 \times 2

pixel windows, and only the first pixel of each window is retained, discarding the other three. This process reduces the resolution of the input data while preserving significant features. The reduced input is then passed through a fully connected layer for further encoding into the latent space. The CDP-VAE encoder is particularly useful for computationally constrained environments, where reducing the complexity of the encoding process can lead to faster computations while still capturing essential information for reconstruction. For instance, if the input image x is of dimension

32 \times 32

, this resolution reduction process reduces it to

16 \times 16

.

4.3. Quantum VAE (Q-VAE) Encoder

Q-VAE integrates quantum computing into the encoding process, adding in the CDP-VAE as shown in Figure 1c. The quantum encoder processes input data into a quantum state representation, utilizing quantum superposition to represent intricate relationships within the data. Quantum encoding improves performance by transforming input data into quantum states, allowing the Q-VAE to apply rotations that facilitate efficient processing and capture richer, more intricate representations. The quantum-enhanced feature extraction is combined with a classical decoder that uses transposed convolution to reconstruct the input data, combining the strengths of quantum processing for feature extraction with classical techniques for data reconstruction.

The quantum encoding network, illustrated in Figure 2, uses a quantum encoding designed to process 16 × 16 pixel images that is the output of windowing pooling layer.

R_{Y} (x_{1}) = exp (- i \frac{x_{1}}{2} Y),

(4)

where

R_{Y}

is the rotation gate around the Y-axis. The rotation gate

R_{Y} (θ)

can be expressed in terms of sine and cosine as:

R_{Y} (θ) = exp (- i \frac{θ}{2} Y) = cos (\frac{θ}{2}) I - i sin (\frac{θ}{2}) Y

(5)

where I is the identity matrix and Y is the Pauli Y-matrix. The explicit matrix form is:

R_{Y} (θ) = (\begin{matrix} cos (\frac{θ}{2}) & - sin (\frac{θ}{2}) \\ sin (\frac{θ}{2}) & cos (\frac{θ}{2}) \end{matrix})

(6)

For the rotation gate applied to the first pixel,

R_{Y} (x_{1})

becomes:

R_{Y} (x_{1}) = (\begin{matrix} cos (\frac{x_{1}}{2}) & - sin (\frac{x_{1}}{2}) \\ sin (\frac{x_{1}}{2}) & cos (\frac{x_{1}}{2}) \end{matrix})

(7)

The Pauli-Z operator is applied to measure the quantum state. The Pauli-Z measurement flips the qubit’s state depending on the computational basis, effectively capturing the binary features of the image.

Mathematically, the Pauli-Z operator acts on the quantum state

|ψ〉

as follows: where Z flips the phase of the qubit in the

|1〉

state, while leaving the

|0〉

state unchanged. This operation effectively translates quantum information into meaningful latent variables for the model. The resulting latent variables z are given by:

z = Measurement (|ψ〉, Z)

(8)

The measurement on qubit yields the compressed representation of the input. This procedure reduces the dimension of the input image from 1024 to 256 measured values. After the quantum circuit processes the input image and extracts quantum features, the output of the quantum circuit, as shown in Figure 1c, is passed to fully connected layer. This transition enables the integration of quantum-encoded features with classical deep learning architectures.

After the quantum circuit processes the input image and extracts quantum features, the output of the quantum circuit, as shown in Figure 1c, is passed to fully connected layer. This transition enables the integration of quantum-encoded features with classical deep learning architectures.

Mechanistic Justification: Although the proposed Q-VAE encoder employs only single-qubit rotation gates followed by Pauli-Z measurements, performance gains are achieved through the nonlinear mapping induced by the trigonometric structure of the

R_{Y} (θ)

rotations. Each pixel intensity is transformed into a quantum state, creating nonlinear dependencies between features that are then captured by the classical fully connected layers. This hybrid mechanism allows the model to learn higher-order correlations without requiring explicit qubit entanglement. By maintaining shallow depth and avoiding multi-qubit operations, the design mitigates barren plateau effects while still enhancing representational richness through quantum-state encoding.

Quantum Resources and Scalability: The proposed Q-VAE encoder utilizes a shallow quantum circuit consisting of 4 qubits, with a single rotation gate per qubit followed by measurement. The encoding is applied sequentially across input features, with no entanglement layers, keeping the circuit depth minimal. This design allows the quantum operations to run efficiently on near-term NISQ devices. By limiting the number of qubits and the circuit depth, the Q-VAE reduces the computational overhead and mitigates noise accumulation, ensuring stable training. The simplicity of the quantum encoder also facilitates parallel implementation if larger datasets or higher-dimensional inputs are considered, supporting scalability of the approach for future applications.

4.4. Shared Decoder Architecture

Despite the differences in their encoding mechanisms, all three models, Q-VAE, C-VAE, and CDP-VAE, share a common decoder architecture. This decoder is responsible for reconstructing the input data from the latent space representation. The decoder begins by transforming the latent variable (z) into a medium-resolution space through fully connected layers. Then a series of transposed convolution (deconvolution) layers are used to progressively up-sample the data, restoring the original spatial dimensions. Finally, the output is passed through a sigmoid activation function to ensure that the pixel values are within the valid range (e.g,, between 0 and 1 for image data). This shared decoder allows for a fair comparison of the encoding techniques, with the focus on how the different encoders influence the quality of the reconstructed image.

5. Numerical Simulation

To assess the performance of our proposed model, we first need to preprocess the input image data. The preprocessing steps are described in the following sections.

5.1. USPS Dataset Preprocessing

USPS dataset input image size resolution is 16 × 16. The USPS images are resized to

32 \times 32

using the same resizing function as was done for MNIST.

The interpolation method used during resizing adjusts the pixel values to fit the larger image grid, enabling the model to be tested with standard image input sizes, such as 32 × 32. USPS image conversion is shown in Figure 3.

5.2. MNIST Dataset Preprocessing

As observed in USPS, We utilized MNIST to determine whether the performance improvements observed in USPS could be replicated with MNIST. To enhance the model’s ability to generalize across different image sizes, we perform the following preprocessing steps:

1.: Cropping: Initially, the center of each image is cropped to remove any surrounding whitespace, reducing the image size from $28 \times 28$ pixels to a smaller region that more tightly focuses on the handwritten digit (important features). The cropping ensures that the model receives more relevant features of the digits.
2.: Resizing to 16 × 16: After cropping, the image is resized to $16 \times 16$ pixels. This step reduces the image resolution, making it easier for the model to process while still retaining key features of the digit. The resizing step uses an interpolation technique that preserves the overall structure of the digit. This process will help us to get same resolution dataset like USPS to see if MNIST also show improvement or not.
3.: Resizing to 32 × 32: Finally, the image is resized to $32 \times 32$ pixels. This larger size is used for feeding the images into the neural network model, which is optimized to process medium resolution images.This study investigates how resizing adjusts the image resolution and its impact on the model’s ability to process finer details for reconstruction and classification.

Figure 4 shows the transformation of MNIST images as processed according to the above procedure. This resizing step ensures compatibility with the input size of the model while retaining the features necessary for effective representation learning.

Control on Preprocessing Bias: To assess the potential bias introduced by the resizing pipeline (crop → 16×16 → 32 × 32), a control experiment was conducted using the native 28 × 28 resolution of MNIST without intermediate scaling. The resulting FID and MSE values differed by less than 2.5% from the standard preprocessing configuration, indicating that the resizing pipeline does not materially alter generative performance or reconstruction fidelity. This confirms that the preprocessing primarily standardizes input dimensions for fair comparison across datasets, without introducing significant distortions or biases.

5.3. Experimental Setup

In this section, we describe the hyper-parameters and settings used for training the models. We are implementing our algorithm on Pennylane quantum simulator. The code is available at https://github.com/RRFar/Quantum-Downsampling-Filter (accessed on 17 November 2025). Hyper-parameters to run the algorithm were carefully selected to ensure stable training while optimizing performance for all models.

Hyper-Parameters

In our training process, several key hyper-parameters were used to ensure efficient learning and model convergence. The learning rate was set to

0.001

for all the models. This learning rate controls how much the model’s weights are updated in response to the computed gradients, providing a stable training process while allowing sufficient progress over the course of the training. The models were trained for 200 epochs, with each epoch representing one full pass of the data set through the model. This duration was chosen to ensure that the models had enough time to learn the underlying features of the data, while regular monitoring of the validation loss helped prevent over-fitting. A batch size of 400 was used, which means that 400 images were processed together before updating the model parameters. This batch size was selected to balance computational efficiency and the stability of gradient estimation, helping smooth gradient updates and ensuring more stable training. For optimization, we used the Adam optimizer, which is known for its adaptive learning rate that adjusts based on the first and second moments of the gradients. The Adam optimizer is particularly effective for models with many parameters and large datasets. The loss function used was the binary cross-entropy loss (BCELoss), which ensures that the individual losses across all pixels are summed rather than averaged. This choice emphasizes the total error in the reconstruction of the images. These images allowed us to visually assess how well the models were reconstructing the images as training progressed. These configurations were chosen to ensure effective model training, allowing the models to learn from the data efficiently and perform well in both reconstruction and evaluation tasks.

6. FID and MSE Score Performance Evaluation and Discussion

To evaluate the performance of the proposed models, we used a combination of several key metrics: FID score [16], training and testing loss, and MSE [17]. Each of these metrics provides distinct insights into the model’s ability to generate better-quality images and generalize to new data: FID measures the similarity between the distribution of real and generated images. A lower FID score indicates that the generated images are closer in distribution to the real images, reflecting higher image fidelity and quality. The FID score is particularly useful for assessing the realism of generated images in generative models. MSE measures the average squared difference between the generated and real images. A lower MSE indicates that the generated images are closer to the real images in terms of pixel-wise accuracy. This metric is commonly used to evaluate the reconstruction quality in generative models, with lower values indicating better performance. MNIST and USPS MSE results are shown in Figure 5 and Figure 6 respectively. For all experiments, a fixed random seed was used to ensure reproducibility of the FID and MSE results across all models.

By combining these metrics, we gain a comprehensive understanding of how well our models are performing in terms of both image quality and generalization ability. The results are shown in Table 1 below:

The FID score was computed for (1) images reconstructed from the input images and (2) images generated from random noise, giving us insight into how well the models perform in generating realistic images.

The improvements in FID and MSE metrics can be theoretically attributed to the Q-VAE in the encoder. By applying quantum encoding to the input features, the Q-VAE compresses high-dimensional data into a latent representation that preserves higher-order correlations. This structured latent space allows the classical decoder to reconstruct inputs more accurately, leading to superior FID and MSE performance compared to classical VAEs and hybrid models with decoder-integrated quantum circuits.

The variation of FID score as a function of the number of epochs for MNIST is shown in Figure 7.

For MNIST, the C-VAE achieves an FID score of 40.7, indicating a reasonable ability to reconstruct images. The Q-VAE, on the other hand, leverages quantum gates such as RY rotations and Pauli-Z measurements to encode the data, achieving a superior FID score of 37.3. This improved score suggests that the quantum model produces more faithful image reconstructions, capturing complex relationships in the data more effectively than the classical model. Finally, the CDP-VAE, demonstrates a slight improvement over the C-VAE with an FID score of 39.7. Similar trends have been noticed for USPS where C-VAE achieved 50.4, CDP-VAE scored 42.9 and Q-VAE FID score for reconstruction as 38.5. Although this approach is less complex, it still lacks the advanced feature extraction capabilities of the Q-VAE, which ultimately leads to lower reconstruction fidelity.

For the USPS dataset image generation, the Q-VAE achieved an FID score of 57.6, while MNIST achieved 78.7 indicating better-quality image reconstructions and strong performance in terms of both fidelity and distribution similarity to the original dataset. While for MNIST C-VAE and CDP-VAE scored 94.4 and 93.3 respectively. Similar trends have been seen for USPS where C-VAE and CDP-VAE scored 73.1 and 66.1 respectively. This result highlights the effectiveness of the Q-VAE in handling simpler grayscale datasets, where it demonstrates feature extraction and reconstruction capabilities.

The CDP-VAE is a VAE where, instead of using an encoder–decoder architecture with learned transformations, the input data is directly passed through with minimal transformation using windowing/pooling. Despite its simplicity, the CDP-VAE still achieved relatively good performance on the USPS dataset, with an FID score of 66.1 as shown in Table 1. Figure 8 further illustrates that the Q-VAE consistently achieves a lower FID score for reconstruction compared to both CDP-VAE and C-VAE, indicating its superior ability to generate high-fidelity reconstructions.

6.1. Latent Space Visualization

To better understand the expressiveness of the model, the latent vector mapping is visualized using dimensionality reduction techniques such as t-SNE [41], which shows the clustering structure in the latent space. Furthermore, Gaussian Mixture Model (GMM) clustering is used to assess how well the learned representations can be grouped, providing information on the model’s capacity to capture meaningful variations and structure in the data. The colors in the visualization reflect different classes, with well-separated colors indicating efficient feature learning and overlapping colors indicating class similarities or ambiguity. Distinct clusters suggest that the model has learned meaningful feature representations, but overlapping clusters could indicate insufficient separation in the latent space. By combining t-SNE and GMM clustering, we can assess whether the model effectively organizes data points and captures key properties for classification or reconstruction. C-VAE, CDP-VAE and Q-VAE latent space has been visualized in Figure 9, Figure 10 and Figure 11 respectively.

From the figures, it is evident that CDP-VAE and Q-VAE do not show a significant difference from C-VAE, but demonstrate competitive performance. For further investigation, we will analyze the reconstructed images and newly generated images in the next section.

6.2. Reconstructed Images

Figure 12 displays reconstructed grayscale digit images from the MNIST dataset, arranged in a grid format. The input image is first downscaled from 28 × 28 to 16 × 16, and then upscaled to 32 × 32. This sequence reveals that some images lose details during resolution reduction, leading to inaccuracies in their reconstruction. For example, the digit ‘3’ in the 4th row and 1st column is successfully reconstructed by Q-VAE, whereas the C-VAE and CDP-VAE struggle to generate a well-defined version. However, in the 5th row and 3rd column, the digit ‘0’ shows noticeable noise due to the resolution reduction, making it difficult to reconstruct accurately. In this case, C-VAE incorrectly generates a ‘7’, while Q-VAE produces a hybrid image that blends ‘7’ and ‘0’. Most digits retain their recognizable shapes, but some exhibit blurriness or distortions as column 3, row 5 and ‘6’ its difficult to recognize the image as ‘7’ and ‘9’, indicating challenges in capturing finer details during the reconstruction process. These issues may stem from factors such as limited latent space resolution’s used for the model (i.e., 16), preprocessing steps like resizing or normalization, or suboptimal model architecture. Interestingly, prior evaluations using metrics like the FID have demonstrated our proposed model Q-VAE, achieve better image quality on MNIST.

To provide a more rigorous quantitative evaluation of reconstruction performance, we computed both the Fréchet Inception Distance (FID) and Mean Squared Error (MSE) metrics for all three models: C-VAE, CDP-VAE, and Q-VAE. Table 1 summarizes these results. As shown, Q-VAE consistently achieves the lowest FID and MSE values across both MNIST and USPS datasets. This demonstrates that the quantum latent encoding enhances both reconstruction fidelity and feature richness, outperforming classical and hybrid approaches in quantitative terms as well as qualitative appearance.

On the MNIST dataset, the Q-VAE achieved the lowest MSE of 0.0088, demonstrating its ability to produce high-fidelity reconstructions compared to the other models. The quantum-based encoding method enables the Q-VAE to extract richer feature representations, leading to more accurate reconstructions. The quantum model’s ability to capture complex dependencies between pixels allows it to outperform classical approaches, even when dealing with highly structured images like those in the MNIST dataset. The C-VAE, with an MSE of 0.0093, showed slightly higher error compared to the Q-VAE, indicating that while it performs well, its feature extraction capabilities are not as advanced as those of the quantum model. Despite its relatively low MSE, the classical model’s performance is constrained by the limitations of traditional neural network-based encoding and decoding methods, which cannot capture the full complexity of the data as efficiently as quantum circuits. The CDP-VAE, with an MSE of 0.0092, performs little better then C-VAE but still didn’t perform better then Q-VAE as shown in Figure 12.

USPS images reconstructed from 3 different VAE models have been shown in Figure 13. For the USPS dataset, while FID score for C-VAE indicate that the model can generate images with reasonable fidelity (i.e., the generated images resemble real USPS digits), it lags behind the performance of the Q-VAE. The quantum model’s ability to leverage more complex transformations in the latent space for image reconstruction enables it to capture more intricate patterns and feature representations in the data. This allows the Q-VAE to achieve lower FID scores (indicating higher image fidelity) compared to the CDP-VAE and C-VAE as seen in the results for the USPS dataset. For MSE, the Q-VAE again outperformed the other models, achieving a score of 0.0058. This lower error indicates that the quantum model is particularly effective at capturing the more complex, medium-resolution patterns in the USPS images. The Q-VAE’s quantum-enhanced encoding allows it to extract intricate features from the grayscale images, leading to more accurate reconstructions. Its performance on USPS highlights the potential of quantum circuits in improving image generation tasks, especially for datasets with more complex structures compared to MNIST. The C-VAE, with an MSE of 0.0070, performed worse than the CDP-VAE and Q-VAE on USPS, reflecting the same trend observed in MNIST. While still a competitive model, the C-VAE’s ability to reconstruct images is limited by its classical encoding mechanism. The model performs well, but its lack of advanced feature extraction techniques prevents it from achieving the same reconstruction quality as the Q-VAE. The CDP-VAE with an MSE of 0.0065, performance is better than the C-VAE on MNIST which is 0.0092, it still lags behind both the CDP-VAE and Q-VAE on USPS. While reconstructed images we can see Q-VAE is reconstructing very well competing with C-VAE and CDP-VAE.

6.3. New Generated Images

The performance of the three models in handling random noise varies considerably. The generated images presented were captured at epoch 200. The C-VAE requires more time to converge, necessitating longer training. Therefore, we evaluate the model at epoch 200 to assess its performance after sufficient training. While initial epochs Q-VAE has shown fast convergence at epoch 20, while other models were unable to generate digits even at epoch 20. The Q-VAE outperforms the other models, achieving an FID score of 78.7 at epoch 200. In contrast, the C-VAE struggles with random noise, resulting in a much higher FID score of 94.4, indicating poor performance in generating from latent space using Gaussian variable as input. The CDP-VAE also faces challenges with noise but performs slightly better than the C-VAE, with an FID score of 93.3. While both the C-VAE and CDP-VAE show limitations in noise handling, the Q-VAE demonstrates a notable advantage in generating more realistic images from random variables. MNIST generated images from noise, have been shown below in Figure 14:

These images are organized in a grid format, with varying levels of clarity and detail. While some digits are clearly defined and resemble their expected shapes, others appear blurry, distorted, or lack fine structural details, indicating limitations in the model’s ability to fully capture the underlying data distribution. These artifacts could result from constraints such as an insufficiently expressive latent space, challenges in training stability, or limitations in the model architecture. For instance, models with lower-resolution latent spaces may struggle to encode the nuanced variations required to generate sharp, realistic images.

Interestingly, Q-VAE have demonstrated better performance in generating images from noise, as reflected in improved FID scores. These operations are highly effective at capturing complex patterns and correlations in medium-resolution data, enabling more detailed and coherent outputs. Similar results were obtained for USPS as shown in Figure 15.

An important consideration in quantum generative models is the barren plateau phenomenon, where gradients vanish as circuit depth increases. The Q-VAE mitigates this issue by employing a shallow encoder with limited entanglement depth, preserving gradient magnitude during optimization. Empirically, training curves showed consistent convergence without significant gradient decay, confirming that the proposed circuit design effectively minimizes barren plateau effects and supports stable training on NISQ-scale devices.

To provide a quantitative evaluation of image generation quality, we assessed the models using three objective metrics: Fréchet Inception Distance (FID), Mean Squared Error (MSE), and Structural Similarity Index Measure (SSIM). The Q-VAE consistently achieved the lowest FID and MSE values while maintaining a higher SSIM compared to both C-VAE and CDP-VAE. This demonstrates that the quantum encoder captures richer latent features that translate into more realistic and structurally coherent image generations. On the MNIST dataset, Q-VAE achieved an FID of 78.7, MSE of 0.0088, and SSIM of 0.934, outperforming C-VAE (FID 94.4, MSE 0.0093, SSIM 0.901) and CDP-VAE (FID 93.3, MSE 0.0092, SSIM 0.913). Similarly, on the USPS dataset, Q-VAE achieved lower FID and MSE values, confirming its superior performance in both low- and medium-resolution image generation tasks. These results quantitatively validate the qualitative improvements observed in Figure 14 and Figure 15. The enhanced visual quality of reconstructed images arises from the encoder-focused quantum design. By keeping the decoder classical and limiting quantum operations to the encoder, the latent space remains well-structured and stable, allowing the decoder to accurately map compressed quantum features back to the input space. Additionally, minimizing quantum-trainable parameters reduces instability during training, which helps preserve fine-grained details in the reconstructions.

Trainable Parameters

The results indicate that both the proposed Q-VAE model and the CDP-VAE demonstrate significant improvements over the C-VAE architecture in terms of parameter efficiency. Specifically, the C-VAE consists of 407,377 trainable parameters, with 299,424 allocated to the encoder and 107,953 to the decoder. In contrast, the CDP-VAE reduces the total number of trainable parameters to 144,977, primarily due to its more compact encoder, which contains only 37,024 parameters. While the Q-VAE also has 144,977 parameters, this reduction is attributed to the CDP-VAE’s design rather than the quantum encoding itself. This highlights the role of CDP-VAE in achieving parameter efficiency while leveraging quantum encoding for enhanced feature extraction. Despite the reduction in parameters, the decoder structure remains unchanged across all models, with 107,953 parameters. Notably, the reduction in trainable parameters stems from CDP-VAE, not the quantum component, as the quantum circuit in Q-VAE does not introduce additional trainable parameters. As a result, both CDP-VAE and Q-VAE share the same total number of parameters, totaling 144,977. This demonstrates the efficiency of the CDP approach in reducing parameter overhead while maintaining effective feature extraction.

These findings highlight the role of CDP-VAE in optimizing parameter efficiency within VAE architectures while ensuring high-fidelity image reconstruction. Although C-VAEs remain a widely adopted choice in machine learning, the combination of CDP-VAE’s parameter reduction and Q-VAE’s quantum encoding presents a compelling direction for future research. Potential applications include digital imaging, healthcare diagnostics, and creative arts, where both high-quality image generation and computational efficiency are crucial. However, the effectiveness and adoption of Q-VAE remain subject to the challenges of quantum hardware scalability and the complexity of quantum algorithms. Continued advancements in quantum machine learning are essential to further explore the integration of quantum encoding in generative models, paving the way for more efficient and powerful solutions in real-world applications.

However, despite these advantages, the adoption and effectiveness of Q-VAE still face challenges, including issues related to the scalability of quantum hardware and the complexity of quantum algorithms. Continued research and development in quantum machine learning are essential for unlocking the full potential of quantum-enhanced models, pushing the boundaries of what is achievable in generative modeling and real-world applications.

6.4. Scalability and Practical Considerations

The Q-VAE architecture exhibits favorable scaling characteristics. The number of quantum gates grows linearly with the latent dimensionality, ensuring computational tractability as input resolution increases. Such scalability makes the model suitable for future integration into quantum communication and sensing systems, where efficient encoding of high-dimensional data is crucial. Moreover, the modular nature of the Q-VAE allows adaptation to higher-resolution image and signal domains without exponential resource growth.

7. Conclusions

In this study, we introduce the quantum down-sampling filter for Q-VAE, a novel model that integrates quantum computing into the encoder component of a traditional VAE. Our experimental results demonstrate that Q-VAE offers significant advantages over classical VAE architectures across various performance metrics. In particular, Q-VAE consistently outperforms C-VAE, achieving lower FID scores, indicating superior image fidelity and the ability to generate better-quality reconstructions. The quantum-enhanced encoder captures more intricate features of the images, leading to improved generative performance. We evaluated the model on digit datasets such as MNIST and USPS, initially up-scaled to 32 × 32 for consistency. For a fair comparison, we also tested our quantum-enhanced model alongside a classical counterpart and a model mimicking classical behavior, referred to as CDP-VAE. One key finding of this study is the significant reduction in the number of trainable parameters in the CDP-VAE’s encoder. While the C-VAE consists of over 407,000 trainable parameters, the CDP-VAE achieves a reduction by significantly lowering the number of parameters. Furthermore, the Q-VAE leverages quantum encoding, which does not add any additional trainable parameters. This combination highlights the efficiency of quantum-based encoding methods and the potential for more scalable and resource-efficient models without increasing computational complexity. The results highlight the practical advantages of incorporating quantum computing into VAEs, particularly for tasks that require high-quality image generation and feature extraction. The Q-VAE not only improves image quality, but also enhances computational efficiency, making it a promising candidate for future applications in areas such as computer vision, data synthesis, and beyond. These findings suggest that further exploration of quantum-enhanced machine learning techniques could unlock new possibilities in generative modeling and image reconstruction, offering both performance improvements and computational savings.

Author Contributions

Conceptualization, F.R.; Methodology, F.R., F.Z. and D.N.; Validation, F.Z., H.S. and D.N.; Formal analysis, F.R. and F.Z.; Investigation, F.R. and A.A.; Resources, F.R.; Writing—original draft, F.R.; Writing—review & editing, H.S. and D.N.; Visualization, D.N.; Project administration, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data is available at https://github.com/RRFar/Quantum-Downsampling-Filter (accessed on 17 November 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Duminil, A.; Ieng, S.S.; Gruyer, D. A comprehensive exploration of fidelity quantification in computer-generated images. Sensors 2024, 24, 2463. [Google Scholar] [CrossRef]
Xie, Z.; Du, S.; Huang, D.; Ding, Y. Detail-preserving fidelity refinement for tone mapping. In Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 11–12 July 2016; pp. 257–262. [Google Scholar]
McNamara, A. Exploring visual and automatic measures of perceptual fidelity in real and simulated imagery. ACM Trans. Appl. Percept. (TAP) 2006, 3, 217–238. [Google Scholar] [CrossRef]
Yang, C. Attn-VAE-GAN: Text-driven high-fidelity image generation model with deep fusion of self-attention and variational autoencoder. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–7. [Google Scholar]
Khan, S.H.; Hayat, M.; Barnes, N. Adversarial training of variational auto-encoders for high fidelity image generation. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 1312–1320. [Google Scholar]
Gong, L.-H.; Chen, Y.-Q.; Zhou, S.; Zeng, Q.-W. Dual Discriminators Quantum Generation Adversarial Network Based on Quantum Convolutional Neural Network. Adv. Quantum Technol. 2025, 8, e2500224. [Google Scholar] [CrossRef]
Pei, J.J.; Gong, L.H.; Qin, L.G.; Zhou, N.R. One-to-many image generation model based on parameterized quantum circuits. Digit. Signal Process. 2025, 165, 105340. [Google Scholar] [CrossRef]
Kumaran, K.; Sajjan, M.; Oh, S.; Kais, S. Random Projection Using Random Quantum Circuits. Phys. Rev. Res. 2024, 6, 013010. [Google Scholar] [CrossRef]
Jayasinghe, U.; Kushantha, N.; Fernando, T.; Fernando, A. A robust multi-qubit quantum communication system for image transmission over error-prone channels. IEEE Trans. Consum. Electron. 2025, 71, e3594368. [Google Scholar] [CrossRef]
Benedetti, M.; Garcia-Pintos, D.; Perdomo, A.; Lloyd, E.; Rieffel, E.G. Parameterized quantum circuits as machine learning models. Quantum Sci. Technol. 2019, 4, 043001. [Google Scholar] [CrossRef]
Ahmed, S.; Sánchex Muñoz, C.; Nori, F.; Kockum, A.F. Classification and reconstruction of optical quantum states with deep neural networks. Phys. Rev. Res. 2021, 3, 033278. [Google Scholar] [CrossRef]
Senokosov, A.; Sedykh, A.; Sagingalieva, A.; Kyriacou, B.; Melnikov, A. Quantum machine learning for image classification. Mach. Learn. Sci. Technol. 2024, 5, 015040. [Google Scholar] [CrossRef]
Deng, L. The MNIST database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Activeloop. USPS Dataset. Available online: https://datasets.activeloop.ai/docs/ml/datasets/usps-dataset/ (accessed on 20 March 2025).
Yu, Y.; Zhang, W.; Deng, Y. Fréchet Inception Distance (FID) for Evaluating GANs; Report No. 3; China University of Mining Technology, Beijing Graduate School: Beijing, China, 2021. [Google Scholar]
Wang, Z.; Bovik, A.C. A universal image quality index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
Mak, H.W.L.; Han, R.; Yin, H.H. Application of variational autoencoder (VAE) model and image processing approaches in game design. Sensors 2023, 23, 3457. [Google Scholar] [CrossRef] [PubMed]
Neloy, A.A.; Turgeon, M. A comprehensive study of auto-encoders for anomaly detection: Efficiency and trade-offs. Mach. Learn. Appl. 2024, 17, 100572. [Google Scholar] [CrossRef]
Elbattah, M.; Loughnane, C.; Guérin, J.L.; Carette, R.; Cilia, F.; Dequen, G. Variational autoencoder for image-based augmentation of eye-tracking data. J. Imaging 2021, 7, 233. [Google Scholar] [CrossRef]
Wang, Y.; Li, D.; Li, L.; Sun, R.; Wang, S. A novel deep learning framework for rolling bearing fault diagnosis enhancement using VAE-augmented CNN model. Heliyon 2024, 10, e35407. [Google Scholar] [CrossRef]
Dai, B.; Wipf, D. Diagnosing and enhancing VAE models. arXiv 2019, arXiv:1903.05789. [Google Scholar] [CrossRef]
Hou, X.; Shen, L.; Sun, K.; Qiu, G. Deep feature consistent variational autoencoder. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 1133–1141. [Google Scholar] [CrossRef]
Yang, Z.; Hu, Z.; Salakhutdinov, R.; Berg-Kirkpatrick, T. Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3881–3890. [Google Scholar]
Gatopoulos, I.; Stol, M.; Tomczak, J.M. Super-resolution variational auto-encoders. arXiv 2020, arXiv:2006.05218. [Google Scholar] [CrossRef]
Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; Frey, B. Adversarial autoencoders. arXiv 2015, arXiv:1511.05644. [Google Scholar]
Ngo, T.A.; Nguyen, T.; Thang, T.C. A survey of recent advances in quantum generative adversarial networks. Electronics 2023, 12, 856. [Google Scholar] [CrossRef]
Huang, H.L.; Du, Y.; Gong, M.; Zhao, Y.; Wu, Y.; Wang, C.; Li, S.; Liang, F.; Lin, J.; Xu, Y.; et al. Experimental quantum generative adversarial networks for image generation. Phys. Rev. Appl. 2021, 16, 024051. [Google Scholar] [CrossRef]
Dallaire-Demers, P.L.; Killoran, N. Quantum generative adversarial networks. Phys. Rev. A 2018, 98, 012324. [Google Scholar] [CrossRef]
Niu, M.Y.; Zlokapa, A.; Broughton, M.; Boixo, S.; Mohseni, M.; Smelyanskyi, V.; Neven, H. Entangling quantum generative adversarial networks. Phys. Rev. Lett. 2022, 128, 220505. [Google Scholar] [CrossRef] [PubMed]
Rocchetto, A.; Grant, E.; Strelchuk, S.; Carleo, G.; Severini, S. Learning hard quantum distributions with variational autoencoders. npj Quantum Inf. 2018, 4, 28. [Google Scholar] [CrossRef]
Gao, N.; Wilson, M.; Vandal, T.; Vinci, W.; Nemani, R.; Rieffel, E. High-dimensional similarity search with quantum-assisted variational autoencoder. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Virtual, 6–10 July 2020; pp. 956–964. [Google Scholar]
Wang, G.; Warrell, J.; Emani, P.S.; Gerstein, M. ζ-QVAE: A Quantum Variational Autoencoder utilizing Regularized Mixed-state Latent Representations. arXiv 2024, arXiv:2402.17749. [Google Scholar]
Luchnikov, I.A.; Ryzhov, A.; Stas, P.J.; Filippov, S.N.; Ouerdane, H. Variational autoencoder reconstruction of complex many-body physics. Entropy 2019, 21, 1091. [Google Scholar] [CrossRef]
Bhupati, M.; Mall, A.; Kumar, A.; Jha, P.K. Deep learning-based variational autoencoder for classification of quantum and classical states of light. Adv. Phys. Res. 2024, 2024, 2400089. [Google Scholar] [CrossRef]
Gircha, A.I.; Boev, A.S.; Avchaciov, K.; Fedichev, P.O.; Fedorov, A.K. Training a discrete variational autoencoder for generative chemistry and drug design on a quantum annealer. arXiv 2021, arXiv:2108.11644. [Google Scholar]
Khoshaman, A.; Vinci, W.; Denis, B.; Andriyash, E.; Sadeghi, H.; Amin, M.H. Quantum variational autoencoder. Quantum Sci. Technol. 2018, 4, 014001. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An introduction to variational autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Hu, T.; Chen, F.; Wang, H.; Li, J.; Wang, W.; Sun, J.; Li, Z. Complexity matters: Rethinking the latent space for generative modeling. Adv. Neural Inf. Process. Syst. 2023, 37, 29558–29579. [Google Scholar]
Pinheiro Cinelli, L.; Araújo Marins, M.; Barros da Silva, E.A.; Lima Netto, S. Variational Autoencoder. In Variational Methods for Machine Learning with Applications to Deep Networks; Springer International Publishing: Cham, Switzerland, 2021; pp. 111–149. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. Comparison of the three VAE models: (a) C-VAE employs a standard classical encoder (blue); (b) CDP-VAE replaces the fully connected layer with a down-sampling window in the encoder (yellow); and (c) Q-VAE incorporates both the down-sampling window and a quantum encoding block (green), while all models use the same decoder.

Figure 2. Proposed Quantum circuit for Quantum Variational Autoencoder.

Figure 3. USPS Image transformed from 16 × 16 to 32 × 32.

Figure 4. MNIST Image cropped for digit then transformed to 16 × 16 to 32 × 32.

Figure 5. MNIST MSE score for all models.

Figure 6. USPS MSE score for all models.

Figure 7. MNIST FID score for image reconstruction and image generation.

Figure 8. USPS FID score for image reconstruction and image generation.

Figure 9. Latent space for C-VAE.

Figure 10. Latent space for CDP-VAE.

Figure 11. Latent space for Q-VAE.

Figure 12. Comparison of classical and quantum reconstructed images: (a) MNIST input image, (b) C-VAE reconstruction, (c) CDP-VAE reconstruction, and (d) Q-VAE reconstruction.

Figure 13. Reconstruction comparison across models for USPS: (a) input image up-sampled from 16 × 16 to 32 × 32, (b) C-VAE reconstruction, (c) CDP-VAE reconstruction, and (d) Q-VAE reconstruction.

Figure 14. Generation of MNIST images from Gaussian noise: (a) C-VAE, (b) CDP-VAE, and (c) Q-VAE generated outputs.

Figure 15. Generated USPS images from Gaussian noise (a) C-VAE, (b) CDP-VAE, and (c) Q-VAE models.

Table 1. Experimental results.

Model	FID Score for Image Reconstruction	FID Score for Image Generation	MSE
C-MNIST	40.7	94.4	0.0093
CDP-MNIST	39.7	93.3	0.0092
Q-MNIST	37.3	78.7	0.0088
C-USPS	50.4	73.1	0.0070
CDP-USPS	42.9	66.1	0.0065
Q-USPS	38.5	57.6	0.0058

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Riaz, F.; Zaman, F.; Suzuki, H.; Abuadbba, A.; Nguyen, D. Quantum Down-Sampling Filter for Variational Autoencoder. Electronics 2025, 14, 4626. https://doi.org/10.3390/electronics14234626

AMA Style

Riaz F, Zaman F, Suzuki H, Abuadbba A, Nguyen D. Quantum Down-Sampling Filter for Variational Autoencoder. Electronics. 2025; 14(23):4626. https://doi.org/10.3390/electronics14234626

Chicago/Turabian Style

Riaz, Farina, Fakhar Zaman, Hajime Suzuki, Alsharif Abuadbba, and David Nguyen. 2025. "Quantum Down-Sampling Filter for Variational Autoencoder" Electronics 14, no. 23: 4626. https://doi.org/10.3390/electronics14234626

APA Style

Riaz, F., Zaman, F., Suzuki, H., Abuadbba, A., & Nguyen, D. (2025). Quantum Down-Sampling Filter for Variational Autoencoder. Electronics, 14(23), 4626. https://doi.org/10.3390/electronics14234626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quantum Down-Sampling Filter for Variational Autoencoder

Abstract

1. Introduction

2. Related Research

Quantum Generative Modeling

3. Background

3.1. Variational Autoencoder

3.2. Encoder

3.2.1. Fully Connected Layers in Encoders

3.2.2. Windowing and Pooling for Resolution Reduction

3.3. Latent Space Encoding

3.4. Decoder

3.4.1. Latent Variable Transformation

3.4.2. Transposed Convolutions

3.4.3. Reconstruction

4. Proposed Methodology

4.1. Classical VAE (C-VAE) Encoder

4.2. Classical Direct Passing (CDP-VAE) Encoder

4.3. Quantum VAE (Q-VAE) Encoder

4.4. Shared Decoder Architecture

5. Numerical Simulation

5.1. USPS Dataset Preprocessing

5.2. MNIST Dataset Preprocessing

5.3. Experimental Setup

Hyper-Parameters

6. FID and MSE Score Performance Evaluation and Discussion

6.1. Latent Space Visualization

6.2. Reconstructed Images

6.3. New Generated Images

Trainable Parameters

6.4. Scalability and Practical Considerations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI