Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling

Tsili, Antonia; Maragkopoulos, Georgios; Mandilara, Aikaterini; Syvridis, Dimitris

doi:10.3390/electronics14050901

Open AccessArticle

Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling

¹

Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Panepistimioupolis, 15784 Ilisia, Greece

²

Eulambia Advanced Technologies, Agiou Ioannou 24, Building Complex C, 15342 Ag. Paraskevi, Greece

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(5), 901; https://doi.org/10.3390/electronics14050901

Submission received: 26 January 2025 / Revised: 15 February 2025 / Accepted: 24 February 2025 / Published: 25 February 2025

(This article belongs to the Special Issue Advances in Quantum Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

The parity problem, a generalization of the XOR problem to higher-dimensional inputs, is a challenging benchmark for evaluating learning algorithms, due to its increased complexity as the number of dimensions of the feature space grows. In this work, a single-qubit classifier is developed, which can efficiently learn the parity function from input data. Despite the qubit model’s simplicity, the solution landscape created in the context of the parity problem offers an attractive test bed for exploring optimization methods for quantum classifiers. We propose a new optimization method called Ensemble Stochastic Gradient Descent (ESGD), with which density matrices describing batches of quantum states are incorporated into the loss function. We demonstrate that ESGD outperforms both Gradient Descent and Stochastic Gradient Descent given the aforementioned problem. Additionally, we show that applying ESGD with only one measurement per data input does not lead to any performance degradation. Our findings not only highlight the potential of a single-qubit model, but also offer valuable insights into the use of density matrices for optimization. Further to this, we complement the outcome with interesting results arising by the employment of a Doubly Stochastic Gradient Descent for training quantum variational circuits.

Keywords:

quantum variational classifiers; parity function; doubly stochastic gradient descent

1. Introduction: The Parity Problem

Representing the XOR function is a fundamental problem for neural networks (NNs), as it involves solving a nonlinearly separable classification problem. Unlike other problems represented by logical functions with two-bit inputs, the XOR problem requires the use of two layers of neurons [1]. The parity function extends the XOR function from 2 to N inputs, forming a proportionally challenging binary classification problem for

2^{N}

inputs, an illustration of which in the 3-dimensional space (

N = 3

) is depicted in Figure 1. According to the universal approximation theorem, any function, including the parity function, can be represented by a deep NN. However, it is always preferable to employ an NN architecture with minimal cost in terms of the number of synapses, neurons, and layers. Another key consideration is the morphology of the solution landscape, as well as the representational capacity and extrapolation capabilities of the chosen NN model. An optimal NN architecture using perceptrons with one hidden layer of size N is described in [2,3], while a modular NN architecture is presented in [3], and a single product unit approach is discussed in [4].

Variational quantum circuits (VQCs) [5,6,7] combine the properties of quantum circuits and classical methods to solve classification tasks in a hybrid way. In particular, quantum circuits perform data processing, which encompasses data encoding, label extraction, etc., whereas optimization is handled by classical—typically local optimization—algorithms. The data flow follows a sequential combination of the processes, meaning that the data collected via measurements on the output quantum state of the VQC are subsequently exploited to perform the optimization task. The purpose of this hybrid formation is to optimally harvest the best of both worlds in an attempt to create new efficient learning methods. However, quantum measurement results are intrinsically stochastic, presenting a challenge in developing suitable optimization algorithms [8].

In this work, our aim is to employ an optimized VQC model in place of an NN and demonstrate that it can be efficiently trained to solve the N-bit parity problem. It is widely known that all Boolean logical functions can be represented by a quantum circuit [9]. Considering this, one can easily construct a quantum Boolean parity function using N qubits as input and an ancilla qubit to register the output, as shown in Figure 2. While the problem of computing the parity of an oracle function has been explored in the literature, it has been shown [10] that at least

N / 2

oracle queries are required, offering no quantum advantage. Finally, another class of parity problems has been used to evaluate the effectiveness of different VQC models in [11], further highlighting the importance of finding an efficient way of solving the parity problem.

The remainder of the manuscript is divided so as to showcase two main concepts; the first concerns the presentation of the quantum variational model for solving the parity problem; the second follows the methodology for training this model. More specifically, in Section 2, we propose a variational single-qubit classifier and investigate the solution landscape for the parity problem. In Section 3, we present the technical details of the training procedure’s preparation, and we showcase the application of optimization methods by employing Gradient Descent (GD) and obtaining some initial results. In Section 4, we propose and test a new stochastic gradient method, to the best of our knowledge, for which batches are represented by density matrices. This method can be transformed into a doubly stochastic approach if single stochastic measurement outcomes are considered and averaged over the batches. We present results from numerical tests on both methods, demonstrating the effectiveness of the approach.

2. A Single-Qubit Classifier for the Parity Problem

Given a series of N bits,

\vec{x} = x_{1}, x_{2}, \dots, x_{N}

, where

x_{j} \in \{0, 1\}

,

j \in [N]

, the parity function determines whether the value “1” appears an even or odd number of times in the series. The parity problem can, thus, be considered a classification problem, where the

n = 2^{N}

data are divided into two equally populated classes: class A (even parity) and class B (odd parity), as shown in Figure 1. This problem can also be extended to real numbers, i.e., noisy data

\vec{\tilde{x}}

, generated by adding Gaussian noise

N (μ, σ^{2})

to the integer values of

\vec{x}

.

Following the logic of [12], where a general model of a variational qudit classifier was introduced, we specify the following model for the parity problem:

exp [- i \sum_{k = 1}^{3} s_{k} π {\hat{σ}}_{k}] exp [- i (\sum_{j = 1}^{N} w_{j} x_{j}) π {\hat{σ}}_{1}] |0〉,

(1)

where

x_{j}

is the jth component of an input data point. We will refer to this model as the single-qubit classifier (SQC). In Equation (1),

{\hat{σ}}_{1}, {\hat{σ}}_{2}, {\hat{σ}}_{3}

represent the Pauli operators, |0〉 is the eigenstate of

{\hat{σ}}_{3}

with eigenvalue 1, and

\vec{s} = (s_{1}, s_{2}, s_{3})

,

s_{j} \in R

,

j \in [3]

,

\vec{w} = (w_{1}, w_{2}, \dots, w_{N})

,

w_{j} \in R

,

j \in [N]

, are the variational parameters, which are optimized during training. The training procedure updates the variational parameters’ given data points

\vec{x_{i}} = (x_{i 1}, x_{i 2}, \dots, x_{i N})

,

x_{i j} \in {0, 1}

with each data point labeled as

y_{j} = 1

for even or

y_{j} = - 1

for odd parity. The classes are, hence, represented by labels that can be directly matched to the mean measurement outcome of operator

{\hat{σ}}_{3}

,

{〈 {\hat{σ}}_{3} 〉}_{j}

, where positive is classified as even parity and negative is classified as odd parity.

Solution Landscape

The SQC in Equation (1) is inspired by the quantum digital circuit shown in Figure 2; the relation between the two can be directly derived. For noiseless data, a periodic (global) solution emerges, allowing the SQC to classify all inputs with perfect accuracy when

{\bar{w}}_{j} = 1 / 2 + l

and

s_{k} = 0

for all j and k, with

l \in Z

. In this solution, the SQC reproduces the conditional evolution of the ancillary qubit in Figure 2: an X gate is applied whenever

x_{j} = 1

. An even number of gate applications results in the final state |0〉, while an odd number of applications leads to the state |1〉. A measurement of the observable

{\hat{σ}}_{3}

at the end of the variational circuit can perfectly distinguish the two cases, resulting in the accurate classification of the input.

Even if the parameters converge to a solution with some deviation

δ w

around the global solution

\bar{w}

, the model classifies noiseless data successfully. Correspondingly, if the parameters are fixed at

\bar{w}

, the classifier can tolerate noise of the same order in the data. We observe that, for

N = 2

, the classifier can tolerate a maximum deviation

δ w_{m a x} \sim 1 / 8

, or equivalently

σ_{m a x} \sim 1 / 24

. For N inputs, the total noise in the exponent of Equation (1) increases by

\sqrt{N} σ

, meaning that the tolerable noise for each individual input scales with

σ_{m a x} \sim 1 / 24 \sqrt{N}

. Finally, the second rotation of the SQC’s state—represented by the second term of Equation (1)—is added, so that the model can tolerate recurrent additive errors in the form of

N (μ \neq 0, σ^{2})

in the input data. In the remainder of this work, we will ignore this term and set

\vec{s} = 0

in Equation (1).

We initially employ the square loss function

L (\vec{w}) = \sum_{j = 1}^{N} {(y_{j} - {〈 {\hat{σ}}_{3} 〉}_{j})}^{2}

for evaluating the result of the learning procedure of the VQC; however, we observe that no local minima appear in this case. The solution landscape for

N = 2

can be seen in Figure 3a over two periods. Since illustrations of the landscape’s morphology can only be extracted as cross-sections when

N > 2

, we plot an example for

N = 2, 6,

and 10, where the cross-section is applied along the principal hyper-diagonal, namely, where

x_{1} = x_{2} = \dots = x_{N} = z

, shown in Figure 3b. We notice that, as N increases, the number of local minima along the diagonal increases moderately. Nonetheless, the existence of local minima persists if

z \to 1 - z

for some inputs. This suggests that the number of minima within a period increases at least linearly with N, while the global solution remains unique.

On the other hand, our numerical investigation shows that input sets which include a subset of points whose coordinates coincide with that of the global solution and, simultaneously, the rest of the data points have coordinates of local minima, also corresponding to local minima of the loss function. Reaching the aforementioned minima will also yield a perfect classification of the input. Based on the analysis of numerical results in the following section, we conclude that both local minima that achieve accuracy 1, meaning that all points are correctly classified, and those that do not, increase in number proportionally as N grows. A final observation on Figure 3b, supported by our numerical investigations, is that as N increases, the area in the parametric space covered by barren plateaus also increases. This makes the approach to a local minimum more time consuming or, in some cases, impossible.

3. Configuration of the Training Procedure

The versatile landscape of the parametric space, combined with the fact that there is at least one known solution, renders the parity problem an attractive test bed for investigating different optimization techniques for the training of SQV. We will, first, provide a description of the common steps involved in training a variational quantum classifier, followed by an introduction to the basic optimization technique for adjusting the weights, namely the Gradient Descent (GD) method. It is essential to gain a deeper understanding of the solution landscape through the application of GD before more advanced optimization techniques can be explored.

Let us restrict the discussion to the case of binary classification tasks, where the data

{\vec{x}}_{j}

are to be classified into two classes, A and B, consisting of

N_{A}

and

N_{B}

points respectively, where

N_{A} + N_{B} = n

. We denote the output circuit state by

ψ (\vec{w}, {\vec{x}}_{j})

and assume that a measurement on a single observable

\hat{G}

(specifically,

{\hat{σ}}_{3}

in our case), which has two eigenvalues

\pm 1

, is performed on the quantum state. For fixed values of the parameters

\vec{w}

, the standard procedure involves running the quantum circuit multiple times to estimate

\begin{matrix} g (\vec{w}, {\vec{x}}_{j}) = & 〈 ψ (\vec{w}, {\vec{x}}_{j}) \hat{G} | ψ (\vec{w}, {\vec{x}}_{j}) \end{matrix}

(2)

\begin{matrix} \approx {\tilde{g}}_{M} (\vec{w}, {\vec{x}}_{j}) = & \frac{1}{M} \sum_{m = 1}^{M} G_{m} (\vec{w}, {\vec{x}}_{j}) \end{matrix}

(3)

where

G_{m} (\vec{w}, {\vec{x}}_{j}) \in \{- 1, + 1\}

represents the stochastic outcome of the m-th (independent) experiment. For

M ≫ 1

, we have

{\tilde{g}}_{M} (\vec{w}, {\vec{x}}_{j}) \to g (\vec{w}, {\vec{x}}_{j})

.

The step of estimating

g (\vec{w}, {\vec{x}}_{j})

via the quantum circuit and the corresponding measurements is followed by a classical procedure to update the parameters

\vec{w}

using a classical optimization method. The first step in this procedure is to construct a loss function

L (\vec{w})

based on

g (\vec{w}, {\vec{x}}_{j})

and the information about the classes. Here, we use the negative log-likelihood loss, which is commonly employed to learn the weights in NNs. If the classes A and B—corresponding to even and odd parity in our case study, respectively—are labeled as

y = \pm 1

, the probability that the training data

{\vec{x}}_{i}

belongs to the class y can be evaluated as follows:

P (y | {\vec{x}}_{j}) = \frac{y g (\vec{w}, {\vec{x}}_{j}) + 1}{2} .

(4)

By defining the class label

y_{j}

for each input

{\vec{x}}_{j}

, the loss function can be constructed by summing the negative log-likelihood of the probabilities as follows:

L (\vec{w}) = - \frac{1}{n} \sum_{j = 1}^{n} l o g (P (y_{j} | {\vec{x}}_{j}))

(5)

or,

L (\vec{w}) = - \frac{1}{N_{A}} \sum_{j \in A} l o g (P (y = 1 | {\vec{x}}_{j}^{A})) - \frac{1}{N_{B}} \sum_{j \in B} l o g (P (y = - 1 | {\vec{x}}_{j}^{B}))

(6)

Using GD, the weights are updated according to

{\vec{w}}_{t + 1} = {\vec{w}}_{t} - α \vec{\nabla} L ({\vec{w}}_{t})

(7)

where

α

is the learning rate. The numerical differentiation that produces the gradient

L (\vec{w})

of the loss function requires at least

2 N

evaluations of g of Equation (2), giving the mean value, to be estimated via the quantum circuit, where N is the dimension of each input data vector. The analytical formulas [13,14] of the partial derivatives of the gradient vector can potentially relate these derivatives to the mean values of other observables in the quantum state, offering an improvement in the precision of the optimization procedure. In our case, where the model is relatively simple, the analytic formula is straightforward and involves the mean value of only one observable:

\frac{\partial g (\vec{w}, {\vec{x}}_{i})}{\partial w_{j}} = 2 x_{j}^{(i)} π 〈 ψ (\vec{w}, {\vec{x}}_{i}) | {\hat{σ}}_{2} | ψ (\vec{w}, {\vec{x}}_{i}) 〉 = 2 x_{j}^{(i)} π h (\vec{w}, {\vec{x}}_{i})

(8)

where

x_{j}^{(i)}

denotes the jth component of the vector

{\vec{x}}_{i}

. The elements of the gradient

\frac{\partial L (\vec{w})}{\partial w_{j}}

can be accurately derived using the chain rule in combination with Equations (2) and (8).

We optimize the parameters of the SQC given by Equation (1), designed to learn the solution to the parity problem with

N = 6, \dots, 10

-bit input, using the GD method. For a given N, the learning rate is optimized empirically and then kept constant for all epochs, while the number of epochs is fixed in all simulations. The gradient is calculated analytically using Equations (2) and (8). We report two important metrics concerning the classification results; the average accuracy

A = (# inputs correctly classified) / (# all inputs)

, listed in Table 1, as well as the percentage of times that accuracy reached 1, i.e., where all inputs were correctly classified after training. The metrics are reported over 50 runs where the weight vector is randomly initialized for each run. Our implementation is available on GitHub (https://github.com/sanantoniochili/QSGD.git accessed on 23 February 2025).

The results of the GD optimization, shown in Table 1, as well as the evolution of the observed loss during optimization, lead to the following conclusions. The difficulty of the problem increases rapidly with the input dimension N, as expected, highlighting the need for more fine-tuned optimization methods. For

N = 6, 7, 8, 9

, the few occurrences of unsuccessful training are mainly due to the optimization method being trapped in the small neighborhood around local minima. However, for

N = 10

, the presence of barren plateaus becomes equally problematic and the loss function appears to oscillate randomly over these regions. Since,

N = 10

presents the first non-trivial case for the solution of the parity problem, in the following we employ it to numerically test the performance of more sophisticated gradient optimization methods.

4. The ESGD Method: Towards an Optimization with Single Measurement Sampling

Inspired by recent studies [8,15,16] on exploiting the intrinsic stochasticity of quantum measurements in gradient optimization methods and, at the same time, motivated by the need to reduce quantum resources, we explore whether the SQC can be trained using just one measurement (one repetition of the quantum circuit) for a very rough estimate of

g (\vec{w}, \vec{x})

and

\frac{\partial g (\vec{w}, \vec{x})}{\partial w_{j}}

. The imbalance caused by a single measurement calls for some averaging during the calculation of the partial derivatives, which induces a desirable smoothness to the loss function. Intuition suggests that this average should be taken over different input data

{\vec{x}}_{j}

. In the following, we present a logical procedure to justify such averaging, while keeping the negative log-likelihood as the loss function, contrary to previous works. We begin by reviewing and numerically testing the Stochastic Gradient Descent (SGD), known to provide better optimization results, i.e., less epochs and higher probability to converge to global minima, in variational quantum circuits. Then, we introduce the ESGD method, with which batches of input vectors are used—similarly to SGD—but now these are treated as quantum ensembles represented by density operators. The loss function in ESGD involves a double summation over stochastic measurement outcomes and batch input data. Even with single measurements, the sought-after smoothness of the loss function is provided by summing over batch input data and leading naturally to the Doubly Stochastic Gradient Descent (DSGD) method described in the next. We will show that, according to our simulations, both ESGD and DSGD outperform GD and SGD, in the case for

N = 10

case.

4.1. Application of the Stochastic Gradient Descent (SGD) Method

In SGD, the n data points are randomly split into k batches, each batch containing

n_{k} = n / k

points. During an epoch, the weights are updated k times using the gradient of the loss function of the k-th batch:

L^{(k)} (\vec{w}) = - \frac{1}{n_{k}} \sum_{j = (k - 1) n_{k} + 1}^{k n_{k}} l o g (P (y_{j} | {\vec{x}}_{j})),

(9)

which is constructed analogously to Equation (9).

Focusing, now, on the specific conditions of the parity problem, where

N_{1} = N_{2} = n / 2

, we introduce a “balanced” cost functional with an equal number of points (

n_{k} / 2

) from each class

{\vec{x}}_{j}^{A}

and

{\vec{x}}_{j}^{B}

in each batch:

\begin{matrix} L^{(k)} (\vec{w}) & = - \frac{2}{n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} log (P (y = 1 | {\vec{x}}_{j}^{A})) \\ - \frac{2}{n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} log (P (y = - 1 | {\vec{x}}_{j}^{B})) . \end{matrix}

(10)

We schematically present the steps of evaluating

L^{(k)} (\vec{w})

via the SQC of Equation (1) in Figure 4a.

We apply the SGD method for

N = 10

features and for a varied number of points

n_{k}

per batch by simulating the procedure described in Figure 4a, the results shown in Table 2. As for GD, we use the analytical expressions of Equation (8) for the evaluation of the gradient and we perform statistics over 25 random initializations of

\vec{w}

. Comparing the results of Table 2 with the results of Table 1 corresponding to the

N = 10

case, we can conclude that SGD performs better than GD, especially as the number of batches increases for a fixed number of total input points.

4.2. The Ensemble Stochastic Gradient Descent (ESGD) Method

Quantum mechanics introduces quantum ensembles, a novel perspective in studying data as far as machine learning is concerned, which has been partially explored in a different context in [17] and which we apply here. Let us take the batches of quantum states in SGD and construct

2 k

density matrices based on them, as follows:

\begin{matrix} {\hat{ρ}}_{+ 1}^{k} & = \frac{2}{n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} | ψ (\vec{w}, {\vec{x}}_{j}^{A}) 〉 〈 ψ (\vec{w}, {\vec{x}}_{j}^{A}) | \end{matrix}

(11)

\begin{matrix} {\hat{ρ}}_{- 1}^{k} & = \frac{2}{n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} | ψ (\vec{w}, {\vec{x}}_{j}^{B}) 〉 〈 ψ (\vec{w}, {\vec{x}}_{j}^{B}) | . \end{matrix}

(12)

We can derive an expectation value of the observable

\hat{G}

for each of the batch ensemble, as in

F (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) = T r ({\hat{ρ}}_{\pm 1}^{k} \hat{G}) = \frac{2}{n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} g (\vec{w}, {\vec{x}}_{j}^{A, B}) .

(13)

We can, then, define the corresponding probability

P_{k} (\pm 1 | {\vec{x}}_{j}) = \frac{\pm 1 F (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) + 1}{2}

and estimate the ensemble’s loss accordingly, as in

\tilde{L} {(\vec{w})}^{(k)} = - log (P_{k} (y = 1 | {\vec{x}}_{j}^{A})) - log (P_{k} (y = - 1 | {\vec{x}}_{j}^{B})) .

(14)

We will, hereafter, elaborate on a gradient optimization method where the weights are updated according to expectation values of batch ensembles represented by density matrices, as in Equation (14). This is called Ensemble Stochastic Gradient Descent (ESGD), and its relation to SGD when applied on our SQC is illustrated in Figure 4b. Its geometric interpretation can be easily understood as follows: if a Bloch vector is assigned to each of the density matrices of Equations (11) and (12), then the aforementioned gradients method guides the ensemble of the Bloch vectors associated with each class towards two opposite directions on the Bloch sphere (

\pm \hat{z}

). In the edge case of

n_{k} = 2

, ESGD reduces to SGD, since Equations (11) and (12) reduce to

\begin{matrix} {\hat{ρ}}_{+ 1}^{k} & = | ψ (\vec{w}, {\vec{x}}_{j}^{A}) 〉 〈 ψ (\vec{w}, {\vec{x}}_{j}^{A}) 〉 \end{matrix}

(15)

\begin{matrix} {\hat{ρ}}_{- 1}^{k} & = | ψ (\vec{w}, {\vec{x}}_{j}^{B}) 〉 〈 ψ (\vec{w}, {\vec{x}}_{j}^{B}) | \end{matrix}

(16)

and the label assignment becomes

F (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) = T r ({\hat{ρ}}_{\pm 1}^{k} \hat{G}) = g (\vec{w}, {\vec{x}}_{j}^{A, B}) .

(17)

In the opposite edge case where

n_{k} = n

, ESGD serves the same purpose as the learning procedure in [17]; it adjusts the parameters in the quantum circuit so that the distance between two Bloch vectors representing the classes is maximized. Finally, as with GD, the components of the gradient can be calculated either analytically, deriving equations similar to Equation (14), or by using a numerical derivative. We implemented ESGD using numerical derivatives and the statistics over 25 runs are reported in Table 3.

4.3. The Doubly Stochastic Gradient Descent (DSGD) Method

Let us now proceed with an approximate version of Equation (17) by considering a finite number M of measurements:

\begin{matrix} F (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) = T r ({\hat{ρ}}_{\pm 1}^{k} \hat{G}) \approx {\tilde{F}}_{M} (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) \\ = \frac{2}{M n_{k}} \sum_{j = (k - 1) n_{k} / 2 + 1}^{k n_{k} / 2} \sum_{m = 1}^{M} G_{m} (\vec{w}, {\vec{x}}_{j \in k}^{A, B}) . \end{matrix}

(18)

If the batch size

n_{k}

is large enough, the double summation involved in Equation (18) allows one to reduce M without introducing discontinuities to the mean value function F, thus avoiding non-differentiability. In this work, we simulate the most challenging case of single-measurement (

M = 1

) per input for batch size

n_{k} \geq 8

and we list the outcomes in Table 4. The described version of ESGD, where only a small number of quantum measurements is performed per input, will be referred to as the Doubly Stochastic Gradient Descent (DSGD) method, since it incorporates both the processing of random input batches and the stochasticity of a single (or small number of) measurements. For more clarity on ‘the points of contact’ of the four gradient optimization methods employed in this work, Figure 5 provides a schematic view clarifying their relation with respect to the the number of batches and number of measurements performed on the quantum system.

4.4. Results of Optimization Methods

As mentioned previously, we evaluate each method according to its accuracy, over a set of configurations. The results are gathered in Table 2, Table 3 and Table 4, listing those of the SGD, ESGD, and DSGD optimization methods for the same classification problem, namely the parity problem for an input dimension of

N = 10

, and using SQC. Regarding the SGD and ESGD methods, the exact expression of the mean values of the observables has been employed, while the random outcomes of single measurements needed for DSGD have been simulated according to analytical probability distributions.

We will, now, elaborate on some key technical details about the implementation of the methods. The random weight-vector initialization differs for each point in Figure 6. The performance of the methods is naturally influenced by the choice of hyperparameters, such as the learning rate and the number of epochs; hence, for all three methods and each batch size, we roughly optimized the learning rate to achieve the best outcomes. For all methods, the learning rate was also reduced by a factor

10^{2}

if the loss function was below a certain pre-fixed threshold value—constant for all different number of batches. The latter practice did improve the results of ESGD and DSGD methods but not those of SGD. For comparison purposes, the number of function evaluations is kept constant at 1024, meaning the number of epochs increases with

n_{k}

in order to accommodate all points in the input dataset. Note that the

N = 2, 4

cases are not included in Table 4 due to the fact that the doubly stochastic derivation introduces excessive noise, preventing the model parameters from converging to a solution vector. Finally, in the ESGD and DSGD methods where we numerically evaluate the gradient vector, we average over the last four steps the components of the gradient vector, in order to achieve smoother progress. The code is available at https://github.com/sanantoniochili/QSGD.git (accessed on 23 February 2025) for further testing and potential optimization.

From Figure 6, it is evident that both ESGD and DSGD outperform SGD for all batch sizes. Specifically, the mean accuracy of ESGD and its stochastic counterpart, DSGD, increases as the batch size grows, whereas SGD experiences a mild performance degradation, as shown in Figure 4. SGD eventually converges to the performance of GD (see Table 1 for

N = 10

). The close relationship between SGD and ESGD for

n_{k} = 2

is also apparent, with both showing poor performance due to the continuous, uncorrelated treatment of the data, which fails to collectively guide the weight vector toward a solution. In contrast, ESGD consistently achieves high accuracy, close to 100%, for any

n_{k} \geq 32

. DSGD, being stochastic, does not easily achieve 100% accuracy, as it tends to oscillate even once average convergence is reached.

5. Conclusions

This work presents two key contributions that advance both the theoretical understanding and practical application of quantum machine-learning techniques. First, we demonstrate that the N-bit parity problem can be efficiently solved using a single qubit, highlighting the remarkable potential of qubits as powerful learning units in quantum circuits. This result underscores the fact that any Boolean logical function, including parity, can be implemented within a quantum framework, opening an exciting avenue for future work. The second significant contribution of this work is the introduction of a novel optimization method, ESGD, which utilizes batches represented by density matrices. Our numerical investigations show that ESGD outperforms traditional SGD in terms of accuracy and efficiency, particularly for the parity problem. Furthermore, ESGD maintains its effectiveness even when reduced to the extreme scenario of using a single measurement per input, a result that highlights its robustness in noisy environments.

Overall, we believe that our methods and numerical results further motivate the exploration of incorporating quantum mechanics logic into learning tasks and classical optimization algorithms, hence creating new research directions. Limiting resources to one qubit, as carried out here, allows for more accessible, easily tested, and easily distributed implementations. This is a crucial property, especially in the field of machine learning, as it facilitates the increase in the range of practical applications, it promotes parallel execution, and it accelerates the production of novel and significant results. Furthermore, our preliminary investigation of a more universal logical function representation with variational quantum circuits suggests that quantum systems beyond qubits need to be considered, opening another exciting avenue for future work.

Author Contributions

Conceptualization, A.M. and A.T.; methodology, A.M. and G.M.; formal analysis, A.T. and A.M.; writing—original draft preparation, A.M., A.T. and G.M.; writing—review and editing, A.M., A.T., G.M. and D.S.; supervision, D.S.; funding acquisition, D.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the project Hellas QCI co-funded by the European Union under the Digital Europe Programme grant agreement No.101091504. A.M. acknowledges partial support from the European Union’s Horizon Europe research and innovation program under grant agreement No.101092766 (ALLEGRO Project).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The programs (GD, SGD, ESGD, DSGD methods) are written with Wolfram Mathematica and are available on GitHub (https://github.com/sanantoniochili/QSGD.git accessed on 23 February 2025).

Acknowledgments

A.M. is grateful to Uwe Jaekel and Babette Dellen for helpful discussions.

Conflicts of Interest

All authors were employed by the company Eulambia Advanced Technologies. The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ESGD	Ensemble Stochastic Gradient Descent
NN	Neural Network
VQC	Variational Quantum Circuit
SQC	Single-Qubit Classifier
GD	Gradient Descent
SGD	Stochastic Gradient Descent
DSGD	Doubly Stochastic Gradient Descent

References

Theodoridis, S.; Koutroumbas, K. Pattern Recognition, 4th ed.; Academic Press: Cambridge, MA, USA, 2009. [Google Scholar]
Fung, H.K.; Li, L.K. Minimal Feedforward Parity Networks Using Threshold Gates. Neural Comput. 2001, 13, 319–326. [Google Scholar] [CrossRef]
Franco, L.; Cannas, S. Generalization properties of modular networks: Implementing the parity function. IEEE Trans. Neural Netw. 2001, 12, 1306–1313. [Google Scholar] [CrossRef] [PubMed]
Leerink, L.R.; Giles, C.L.; Horne, B.G.; Jabri, M.A. Learning with product units. In Proceedings of the 8th International Conference on Neural Information Processing Systems NIPS’94, Denver, CO, USA, 1 January 1994; MIT Press: Cambridge, MA, USA, 1994; pp. 537–544. [Google Scholar]
Havlíček, V.; Córcoles, A.D.; Temme, K.; Harrow, A.W.; Kandala, A.; Chow, J.M.; Gambetta, J.M. Supervised learning with quantum-enhanced feature spaces. Nature 2019, 567, 209–212. [Google Scholar] [CrossRef]
Schuld, M.; Killoran, N. Quantum Machine Learning in Feature Hilbert Spaces. Phys. Rev. Lett. 2019, 122, 040504. [Google Scholar] [CrossRef]
Schuld, M.; Bocharov, A.; Svore, K.M.; Wiebe, N. Circuit-centric quantum classifiers. Phys. Rev. A 2020, 101, 032308. [Google Scholar] [CrossRef]
Sweke, R.; Wilde, F.; Meyer, J.; Schuld, M.; Faehrmann, P.K.; Meynard-Piganeau, B.; Eisert, J. Stochastic gradient descent for hybrid quantum-classical optimization. Quantum 2020, 4, 314. [Google Scholar] [CrossRef]
Nielsen, M.A.; Chuang, I.L. Quantum Computation and Quantum Information: 10th Anniversary Edition; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Farhi, E.; Goldstone, J.; Gutmann, S.; Sipser, M. Limit on the Speed of Quantum Computation in Determining Parity. Phys. Rev. Lett. 1998, 81, 5442–5444. [Google Scholar] [CrossRef]
Jerbi, S.; Fiderer, L.J.; Poulsen Nautrup, H.; Kübler, J.M.; Briegel, H.J.; Dunjko, V. Quantum machine learning beyond kernel methods. Nat. Commun. 2023, 14, 517. [Google Scholar] [CrossRef]
Mandilara, A.; Dellen, B.; Jaekel, U.; Valtinos, T.; Syvridis, D. Classification of data with a qudit, a geometric approach. Quantum Mach. Intell. 2024, 6, 17. [Google Scholar] [CrossRef]
Crooks, G.E. Gradients of parameterized quantum gates using the parameter-shift rule and gate decomposition. arXiv 2019, arXiv:1905.13311. [Google Scholar]
Schuld, M.; Bergholm, V.; Gogolin, C.; Izaac, J.A.; Killoran, N. Evaluating analytic gradients on quantum hardware. Phys. Rev. A 2019, 99, 032331. [Google Scholar] [CrossRef]
Harrow, A.W.; Napp, J.C. Low-Depth Gradient Measurements Can Improve Convergence in Variational Hybrid Quantum-Classical Algorithms. Phys. Rev. Lett. 2021, 126, 140502. [Google Scholar] [CrossRef]
Kübler, J.M.; Arrasmith, A.; Cincio, L.; Coles, P.J. An Adaptive Optimizer for Measurement-Frugal Variational Algorithms. Quantum 2020, 4, 263. [Google Scholar] [CrossRef]
Lloyd, S.; Schuld, M.; Ijaz, A.; Izaac, J.; Killoran, N. Quantum embeddings for machine learning. arXiv 2020, arXiv:2001.03622. [Google Scholar]

Figure 1. Binary classification of the data points produced for the parity problem with 3-bit input (

N = 3

).

Figure 1. Binary classification of the data points produced for the parity problem with 3-bit input (

N = 3

).

Figure 2. A quantum circuit representing a solution to the N-bit parity problem. N qubits encode the Boolean input

x_{1}, x_{2}, \dots, x_{N}

, and N CNOT gates act on an ancilla qubit, which registers the output.

Figure 2. A quantum circuit representing a solution to the N-bit parity problem. N qubits encode the Boolean input

x_{1}, x_{2}, \dots, x_{N}

, and N CNOT gates act on an ancilla qubit, which registers the output.

Figure 3. The square loss function

L (\vec{w})

for

N = 2

. (a) Contour plot of the objective function Equation (1), versus the parameters

w_{1}

and

w_{2}

, over two periods. (b) Plot along the cross-section

x_{1} = x_{2} = \dots = x_{N}

of the objective function for

N = 2, 6, 10

.

Figure 3. The square loss function

L (\vec{w})

for

N = 2

. (a) Contour plot of the objective function Equation (1), versus the parameters

w_{1}

and

w_{2}

, over two periods. (b) Plot along the cross-section

x_{1} = x_{2} = \dots = x_{N}

of the objective function for

N = 2, 6, 10

.

Figure 4. (a) SGD: A schematic depiction for the evaluation of the gradient

L^{(k)} ({\vec{w}}_{t})

via the SQC, defined by Equation (1), using the kth batch of the input. For an analytic evaluation of

\vec{\nabla} L^{(k)} ({\vec{w}}_{t})

, the whole procedure needs to be repeated, measuring the observable

{\hat{σ}}_{2}

(instead of

{\hat{σ}}_{3}

) so that

h ({\vec{w}}_{t}, {\vec{x}}_{j})

of Equation (8) is estimated as well. (b) ESGD: The batch is treated as an ensemble and, contrary to SGD, the measurement outcomes are first averaged out over the batch to estimate

F ({\vec{w}}_{t}, {\vec{x}}_{j})

and then used as inputs in the loss

{\tilde{L}}^{(k)} ({\vec{w}}_{t})

. DSGD is represented by sub-figure (b) for

M = 1

.

Figure 4. (a) SGD: A schematic depiction for the evaluation of the gradient

L^{(k)} ({\vec{w}}_{t})

via the SQC, defined by Equation (1), using the kth batch of the input. For an analytic evaluation of

\vec{\nabla} L^{(k)} ({\vec{w}}_{t})

, the whole procedure needs to be repeated, measuring the observable

{\hat{σ}}_{2}

(instead of

{\hat{σ}}_{3}

) so that

h ({\vec{w}}_{t}, {\vec{x}}_{j})

of Equation (8) is estimated as well. (b) ESGD: The batch is treated as an ensemble and, contrary to SGD, the measurement outcomes are first averaged out over the batch to estimate

F ({\vec{w}}_{t}, {\vec{x}}_{j})

and then used as inputs in the loss

{\tilde{L}}^{(k)} ({\vec{w}}_{t})

. DSGD is represented by sub-figure (b) for

M = 1

.

Figure 5. A schematic view of the four gradient optimization methods employed in this work: GD, SGD, ESGD, and DSGD. SGD is presented by a dotted (black) line indicating the number k of batches with

k \in [1, n / 2]

(see Equation (10)). For

k = 1

, SGD reduces to GD. ESGD is indicated as a dotted (orange) line, where k is now decreasing from left to right. For

k = n / 2

, ESGD coincides with SGD. Above ESGD there is a (yellow) trapezoidal area describing DSGD, which depends on k but also on the number M of measurements per input. With increasing M, DSGD approaches ESGD, while this convergence is faster (with M) when the number of batches is small. The direction of the arrows in the figure shows the direction of increase for the respective value.

Figure 5. A schematic view of the four gradient optimization methods employed in this work: GD, SGD, ESGD, and DSGD. SGD is presented by a dotted (black) line indicating the number k of batches with

k \in [1, n / 2]

(see Equation (10)). For

k = 1

, SGD reduces to GD. ESGD is indicated as a dotted (orange) line, where k is now decreasing from left to right. For

k = n / 2

, ESGD coincides with SGD. Above ESGD there is a (yellow) trapezoidal area describing DSGD, which depends on k but also on the number M of measurements per input. With increasing M, DSGD approaches ESGD, while this convergence is faster (with M) when the number of batches is small. The direction of the arrows in the figure shows the direction of increase for the respective value.

Figure 6. The evolution of the mean accuracy

〈A〉

with respect to the size

n_{k}

of the batches for each method, ESGD, DSGD, and SGD. The batch size is measured as the number of data points used for each loss and gradient calculation. The x-axis showing the size of the batch is given in the logarithmic scale with base 2. DSGD evaluations start with

n_{k} = 8

.

Figure 6. The evolution of the mean accuracy

〈A〉

with respect to the size

n_{k}

of the batches for each method, ESGD, DSGD, and SGD. The batch size is measured as the number of data points used for each loss and gradient calculation. The x-axis showing the size of the batch is given in the logarithmic scale with base 2. DSGD evaluations start with

n_{k} = 8

.

Table 1. The results of GD optimizations for learning the parity problem using the SQC model. For all runs, the number of epochs was set to 200. Notation is explained as follows. N: the dimension of the input for the parity problem.

α

: the learning rate.

〈A〉

: the average accuracy over 50 runs with random initializations of

\vec{w}

.

Δ 〈A〉

: the standard deviation of the accuracy.

% A = 1

: the percentage of runs where accuracy reached 1.

Table 1. The results of GD optimizations for learning the parity problem using the SQC model. For all runs, the number of epochs was set to 200. Notation is explained as follows. N: the dimension of the input for the parity problem.

α

: the learning rate.

〈A〉

: the average accuracy over 50 runs with random initializations of

\vec{w}

.

Δ 〈A〉

: the standard deviation of the accuracy.

% A = 1

: the percentage of runs where accuracy reached 1.

N	6	7	8	9	10
$α$	$3 \times 10^{- 3}$	$7.8 \times 10^{- 4}$	$2.9 \times 10^{- 4}$	$1.2 \times 10^{- 4}$	$7.3 \times 10^{- 5}$
$〈A〉$	0.97	0.97	0.8	0.7	0.6
$Δ 〈A〉$	0.10	0.11	0.3	0.2	0.2
$% A = 1$	0.96	0.94	0.6	0.34	0.16

Table 2. The results of SGD optimizations for learning the parity problem with

N = 10

and the loss function given by Equation (10). The results concern 25 runs, each with random initial

\vec{w}

and analytical derivatives. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1. For all runs, the number of function evaluations was set to 1024.

Table 2. The results of SGD optimizations for learning the parity problem with

N = 10

and the loss function given by Equation (10). The results concern 25 runs, each with random initial

\vec{w}

and analytical derivatives. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1. For all runs, the number of function evaluations was set to 1024.

$n_{k}$	2	4	8	16	32	64	128	256	512
# epocs	2	4	8	16	32	64	128	256	512
$〈A〉$	0.7	0.7	0.6	0.6	0.7	0.58	0.56	0.50	0.55
$Δ 〈A〉$	0.23	0.23	0.2	0.2	0.2	0.18	0.17	0.02	0.15
$% A = 1$	0.24	0.28	0.28	0.24	0.32	0.16	0.12	0.0	0.10

Table 3. The results of ESGD optimizations for learning the parity problem with

N = 10

and the loss function given by Equation (14). The results concern 25 runs, each with random initial

\vec{w}

and numerical derivatives. For all runs, the number of function evaluations was set to 1024. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1.

Table 3. The results of ESGD optimizations for learning the parity problem with

N = 10

and the loss function given by Equation (14). The results concern 25 runs, each with random initial

\vec{w}

and numerical derivatives. For all runs, the number of function evaluations was set to 1024. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1.

$n_{k}$	2	4	8	16	32	64	128	256	512
# epocs	2	4	8	16	32	64	128	256	512
$〈A〉$	0.70	0.86	0.97	0.96	0.997	0.998	0.998	0.97	0.98
$Δ 〈A〉$	0.25	0.23	0.10	0.14	0.008	0.004	0.006	0.10	0.10
$% A = 1$	0.40	0.72	0.84	0.72	0.96	0.96	0.92	0.84	0.96

Table 4. The results of DSGD optimizations, namely using a single-measurement

(M = 1)

, are used for learning the parity problem with

N = 10

and the loss function given by the revised version of Equation (14). The results concern 25 runs, each with random initial

\vec{w}

and numerical derivatives. For all runs, the number of function evaluations was set to 1024. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1.

Table 4. The results of DSGD optimizations, namely using a single-measurement

(M = 1)

, are used for learning the parity problem with

N = 10

and the loss function given by the revised version of Equation (14). The results concern 25 runs, each with random initial

\vec{w}

and numerical derivatives. For all runs, the number of function evaluations was set to 1024. The notation is explained as follows.

〈A〉

: the average accuracy;

Δ 〈A〉

: the standard deviation of the accuracy;

% A = 1

: the percentage of runs where accuracy reached 1.

$n_{k}$	8	16	32	64	128	256	512
# epocs	8	16	32	64	128	256	512
$〈A〉$	0.75	0.91	0.967	0.98	0.98	0.988	0.999
$Δ 〈A〉$	0.25	0.19	0.10	0.01	0.016	0.016	0.004
$% A = 1$	0.44	0.76	0.60	0.32	0.44	0.64	0.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tsili, A.; Maragkopoulos, G.; Mandilara, A.; Syvridis, D. Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling. Electronics 2025, 14, 901. https://doi.org/10.3390/electronics14050901

AMA Style

Tsili A, Maragkopoulos G, Mandilara A, Syvridis D. Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling. Electronics. 2025; 14(5):901. https://doi.org/10.3390/electronics14050901

Chicago/Turabian Style

Tsili, Antonia, Georgios Maragkopoulos, Aikaterini Mandilara, and Dimitris Syvridis. 2025. "Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling" Electronics 14, no. 5: 901. https://doi.org/10.3390/electronics14050901

APA Style

Tsili, A., Maragkopoulos, G., Mandilara, A., & Syvridis, D. (2025). Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling. Electronics, 14(5), 901. https://doi.org/10.3390/electronics14050901

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learning the N-Input Parity Function with a Single-Qubit and Single-Measurement Sampling

Abstract

1. Introduction: The Parity Problem

2. A Single-Qubit Classifier for the Parity Problem

Solution Landscape

3. Configuration of the Training Procedure

4. The ESGD Method: Towards an Optimization with Single Measurement Sampling

4.1. Application of the Stochastic Gradient Descent (SGD) Method

4.2. The Ensemble Stochastic Gradient Descent (ESGD) Method

4.3. The Doubly Stochastic Gradient Descent (DSGD) Method

4.4. Results of Optimization Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI