Quantum Circuit Learning with Error Backpropagation Algorithm and Experimental Implementation

: Quantum computing has the potential to outperform classical computers and is expected to play an active role in various ﬁelds. In quantum machine learning, a quantum computer has been found useful for enhanced feature representation and high-dimensional state or function approximation. Quantum–classical hybrid algorithms have been proposed in recent years for this purpose under the noisy intermediate-scale quantum computer (NISQ) environment. Under this scheme, the role played by the classical computer is the parameter tuning, parameter optimization, and parameter update for the quantum circuit. In this paper, we propose a gradient descent-based backpropagation algorithm that can efﬁciently calculate the gradient in parameter optimization and update the parameter for quantum circuit learning, which outperforms the current parameter search algorithms in terms of computing speed while presenting the same or even higher test accuracy. Meanwhile, the proposed theoretical scheme was successfully implemented on the 20-qubit quantum computer of IBM Q, ibmq_johannesburg. The experimental results reveal that the gate error, especially the CNOT gate error, strongly affects the derived gradient accuracy. The regression accuracy performed on the IBM Q becomes lower with the increase in the number of measurement shot times due to the accumulated gate noise error.


Introduction
The noisy intermediate-scale quantum computer (NISQ) is a quantum computer that possesses considerable quantum errors [1]. Under the NISQ circumstance, it is necessary to develop noise-resilient quantum computation methods that provide error resilience. There are two solutions to this problem. One is to perform quantum computing while correcting quantum errors in the presence of errors. Another approach is to develop a hybrid quantum-classical algorithm that completes the quantum part of computing before the quantum error becoming fatal and shifts the rest of the task to the classical computer. The latter approach has prompted the development of many algorithms, such as quantum approximation optimization algorithm (QAOA) [2], variational quantum eigensolver (VQE) [3], and many others [4][5][6]. The quantum-classical algorithms aim to seek the 'quantum advantage' rather than 'quantum supremacy' [7]. Quantum supremacy states that a quantum computer must prove that it can achieve a level, either in terms of speed or solution finding, that can never be achieved by any classical computer. It has been considered that the quantum supremacy may appear in several decades and that instances of 'quantum supremacy' reported so far are either overstating or lack fair comparison [8,9]. From this point of view, the quantum advantage is a more realistic goal, and it aims to find the concrete and beneficial applications of the NISQ devices. Within the scope of quantum advantage, the application of quantum computers can be expanded far beyond computing speed racing to the usage in various fields, such as representing wavefunctions in quantum chemistry [10][11][12][13][14] or working as a quantum kernel to represent enhanced high-dimensional features in the field of machine learning [15][16][17][18].
In QAOA, VQE, or other hybrid NISQ algorithms, the task of optimizing the model parameter is challenging. In all these algorithms, the parameter search and updating are performed in the classical computer. In a complete classical approach, the optimal parameter search is usually categorized as a mathematical optimization problem, where various methods, including both gradient-based and non-gradient-based, have been widely utilized. For quantum circuit learning, so far most parameter searching algorithms are based on non-gradient methods such as Nelder-Mead method [19] and quantum-inspired metaheuristics [20,21]. However, recently, gradient-based ones such as SPSA [22] and a finite difference method have been reported [23].
In this article, we propose an error backpropagation algorithm on quantum circuit learning to calculate the gradient required in parameter optimization efficiently. The purpose of this work is to develop a gradient-based circuit learning algorithm with superior learning speed to the ones reported so far. The error backpropagation method is known as an efficient method for calculating gradients in the field of deep neural network machine learning for updating parameters using the gradient descent method [24]. Further speed improvement can be easily realized through using the GPGPU technique, which is again well established and under significant development in the field of deep learning [25].
The idea behind our proposal is described as follows: As depicted in Figure 1, if the input quantum state is |ψ in and a specific quantum gate U(θ) is applied, then the output state |ψ out can be expressed as the dot product of the quantum gate with the input state where θ stands for the parameters for the gate U(θ). On the other hand, the calculation process of a fully connected neural network without activation function can be written as Y = W·X, where X is the input vector, W is the weight matrix of the network, and Y is the output. The quantum gate U(θ) is remarkably similar to the network weight matrix W. This shows that backpropagation algorithms that are used for deep neural networks can be modified to some extent to be applied to the simulation process of quantum circuit learning.
Quantum Rep. 2021, 3 FOR PEER REVIEW 3 backpropagation heavily used in the field of deep machine learning, can be shared by the quantum circuit as well. Figure 1. Example of three-gate quantum circuit and its corresponding fully connected quantum network, showing similarity to a four-layer neural network with equal numbers of nodes in the input layer, middle layer, and output layer. Note that the amplitude value is not normalized for better eye-guiding illustration.
In general, the backpropagation method uses the chain rule of the partial differentiation to propagate the gradient back from the network output and calculate the gradient of the weights. Owing to the chain rule, the backpropagation can be done only at the input/output relationship at the computation cost of a node [24]. In the simulation of quantum computing by error backpropagation, the quantum state | and the quantum gates are represented by complex values. Here we will show the derivation details regarding the quantum backpropagation in complex-valued vector space. When the input of n qubits is | and the quantum circuit parameter network ( ) is applied, the output | Figure 1. Example of three-gate quantum circuit and its corresponding fully connected quantum network, showing similarity to a four-layer neural network with equal numbers of nodes in the input layer, middle layer, and output layer. Note that the amplitude value is not normalized for better eye-guiding illustration. The method we propose makes it possible to reduce the time significantly for gradient calculation when the number of qubits is increased or the depth of the circuit (the number of gates) is increased. Meanwhile, by taking advantage of GPGPU, it is expected that using gradient-based backpropagation in the NISQ hybrid algorithms will further facilitate parameter search when many qubits and deeper circuits are deployed.

Quantum Backpropagation Algorithm
As shown in Figure 1, a quantum circuit can be effectively represented by a fully connected quantum network with significant similarity to the conventional neural network except for two facts: (1) there is no activation function applied upon each node, so the node is not considered as a neuron (or assuming an identical activation function); (2) the numbers of nodes are equal among the input layer, middle layer, and output layer, since the dimensionality of each layer is the same, which is quite different from the conventional neural network where the dimensionality in the middle layers can be freely tailored. Notice that the state shown as input in the quantum circuit is only one of the 2 n (n is the number of qubits) with the amplitude of '1' (not normalized) (see Figure 1 for details). The network similarity implies that the learning algorithm, such as the backpropagation heavily used in the field of deep machine learning, can be shared by the quantum circuit as well.
In general, the backpropagation method uses the chain rule of the partial differentiation to propagate the gradient back from the network output and calculate the gradient of the weights. Owing to the chain rule, the backpropagation can be done only at the input/output relationship at the computation cost of a node [24]. In the simulation of quantum computing by error backpropagation, the quantum state |ψ and the quantum gates are represented by complex values. Here we will show the derivation details regarding the quantum backpropagation in complex-valued vector space. When the input of n qubits is |ψ in and the quantum circuit parameter network W(θ) is applied, the output |ψ out can be expressed as W(θ)|ψ in =

Simulation Results
To verify the validity of the proposed quantum backpropagation algorithm, we conducted the experiments for the supervised learning tasks, including both regression and classification problems.
The quantum circuit consists of a unitary input gate U in (x) that creates an input state from classical input data x and a unitary gate W(θ) with parameters θ. We use as proposed in [23] for the unitary input gate, as shown in Figure 2a. We use W(θ) = U (l) loc (θ 0 ) as proposed in [27]; therefore, U loc (θ k ) comprises local single qubit rotations. We only use Y and Z rotations, so U θ j,k = R Z θ Z j,k R Y θ Y j,k . Each θ is parameterized as θ k ∈ R 2n , θ j,k ∈ R 2 . U ent is the entangling gate. We use controlled-Z gates (CZ) as U ent . The overall quantum circuit is shown in Figure 2c.

Regression
In regression tasks, the circuit parameters were set to n = 3 and l = 3; that is, the number of qubits is 3 and the depth of the circuit is 4. The expected value of observable Pauli Z for the first qubit was obtained from the output state | of the circuit. One-dimensional data is input by setting circuit parameters as

Regression
In regression tasks, the circuit parameters were set to n = 3 and l = 3; that is, the number of qubits is 3 and the depth of the circuit is 4. The expected value of observable Pauli Z for the first qubit was obtained from the output state ψ out of the circuit. One-dimensional data x is input by setting circuit parameters as The target function f (x) was regressed with the output of twice the Z expected value. We performed three regression tasks to verify the effectiveness of the proposed approach. A conventional least square loss function is adopted in the current regression tasks as follows: Moreover, its first derivation becomes The error δ is the one for the backpropagation. The expectation value of Z is given as follows: Here we provide a more detailed explanation regarding how the expectation value is obtained in Equation (14). There are two ways to obtain the probability in Equation (14). p |i 1,θ can be measured through observation. For example, when we have a quantum circuit of 3 qubits, there will be a probability for eight states defined as follows: If the observation measurement is performed at the first qubit, as shown in Figure 3, the probability of p |0 1,θ and p |1 1,θ represent the possibility of the first qubit being observed as either the state of |0 or |1 . The second approach to obtain the probability is by calculation using the quantum simulator. By measuring the first qubit, the p |0 1,θ and p |1 1,θ can be obtained and are mathematically equivalent to the following marginalization: By completing the calculation above, the probability needed in the equation can be worked out, and thus Z is obtained. , By completing the calculation above, the probability needed in the equation can be worked out, and thus 〈 〉 is obtained.  Figure 4 shows the regression results for three typical tasks to verify the validity of the proposed algorithm. In Figure 4a-c, three target functions representing both linear and nonlinear regression were chosen as follows: ( ) = , which represents a typical linear function; ( ) = , which represents a single concave profile nonlinear problem, and ( ) = sin , which represents a multi-concave-convex wavy profile for more complex problems. The noise was also added into the target function for realistic purposes, and the  function; f 2 (x) = x 2 , which represents a single concave profile nonlinear problem, and f 3 (x) = sin x, which represents a multi-concave-convex wavy profile for more complex problems. The noise was also added into the target function for realistic purposes, and the number of training data was chosen as 100 in circuit learning for the three target functions. It can be seen that the quantum circuit based on error backpropagation performs very well in the regression task. For instance, the value of R 2 for the regression of x 2 and sin x are found as high as 0.989 and 0.992, respectively. At the initial learning stage, the results show large deviation from the target function, and at the final learning stage the regressed curve catches the main feature of the training data and shows a very reasonably fitted curve. In Figure 4a, the fitted curve shows deviation at the left edge of the regression profile. This deviation is considered as a lack of training data at the boundary and can be improved by either increasing the number of training data or adding a regularization term in the loss function, which is regularly used in conventional machine learning tasks.

Classification
In the classification problem, we have modified the quantum circuit architecture to accommodate the increased number of parameters for both qubit and circuit depth. The initial parameter set for the classification problem was = 4 and = 6 (number of layers is 7). Here we show only the results for nonlinear classification problems. The example of binary classification of the two-dimensional data is used in the experiment. Here the dataset was prepared by referring to a similar dataset from scikit-learn [28]. We consider two representative nonlinear examples: one is a dataset of make_circles, and another one is make_moons. We consider the make_moons to possess more complicated nonlinear features than make_circles. It should be noted that the data presented here are results from the sample without the addition of the noise. Due to the shortage of space, classification results for noise training data are given in the Supplementary Materials. The two-dimensional input data was prepared by setting circuit parameters as follows:

Classification
In the classification problem, we have modified the quantum circuit architecture to accommodate the increased number of parameters for both qubit and circuit depth. The initial parameter set for the classification problem was n = 4 and l = 6 (number of layers is 7). Here we show only the results for nonlinear classification problems. The example of binary classification of the two-dimensional data is used in the experiment.
Here the dataset was prepared by referring to a similar dataset from scikit-learn [28]. We consider two representative nonlinear examples: one is a dataset of make_circles, and another one is make_moons. We consider the make_moons to possess more complicated nonlinear features than make_circles. It should be noted that the data presented here are results from the sample without the addition of the noise. Due to the shortage of space, classification results for noise training data are given in the Supplementary Materials. The two-dimensional input data x was prepared by setting circuit parameters as follows: For the training purpose, a typical cross-entropy loss function was adopted to generate the error and was further backpropagated to update the learning parameter.
The cross-entropy formula looks complicated, but its first derivative upon the probability y 1 reduces to the form of error backpropagation similar to the regression tasks.
For the output state |ψ out , we calculated the expected values Z 1 and Z 2 of observable Z using the first and second qubits, as shown in Figure 5. Similar to the process adopted in the regression task, the final probability for the first and second qubit can be defined as follows by assuming a 3-qubit quantum circuit.
Therefore, the expected values of Z 1 and Z 2 by observation measurement are given as follows: For the output state | , we calculated the expected values 〈 〉 and 〈 〉 of observable Z using the first and second qubits, as shown in Figure 5. Similar to the process adopted in the regression task, the final probability for the first and second qubit can be defined as follows by assuming a 3-qubit quantum circuit. , Therefore, the expected values of 〈 〉 and 〈 〉 by observation measurement are given as follows: Meanwhile, for the classification problem, a SoftMax function was applied to the output for 〈 〉 and 〈 〉 to obtain continuous probabilities and between 0 and 1. Again, this treatment is the same as the method used in neural network-based machine learning classification. The obtained or can be used to calculate the loss function defined in Equation (18). Here for binary classification, there exists a linear relation between and as shown in Equations (25)-(27). Meanwhile, for the classification problem, a SoftMax function was applied to the output for Z 1 and Z 2 to obtain continuous probabilities y 1 and y 2 between 0 and 1. Again, this treatment is the same as the method used in neural network-based machine learning classification. The obtained y 1 or y 2 can be used to calculate the loss function defined in Equation (18). Here for binary classification, there exists a linear relation between y 1 and y 2 as shown in Equations (25)-(27).
For the proof of concept, a limited number of training data was used and was set as 200. Half of the data were labelled as '0'; the remaining half of the data were labelled as '1 . For comparison, we have also applied the classical support vector machine (SVM), a toolkit attached in the scikit-learn package, to the same datasets. The results from SVM are served as a rigorous reference for the validity verification of the proposed approach. Figure 6 shows the test results for the two nonlinear classification tasks. In Figure 6a,e, two-dimensional training data with values ranging between −1 and 1 were chosen as the training dataset. Here the noise was not added for simplicity, and the training data with added noise are presented in the Supplementary Materials (S.D). Figure 6b shows the test results based on the learned parameter from the training dataset shown in Figure 6a. A multicolored contour-line-like classification plane was found in Figure 6b. The multicolored value corresponds to the continuous output of the probability from the SoftMax function. A typical two-valued region can be easily determined by taking the median of the continuous probability as the classification boundary, and it is shown in Figure 6b with the dashed line colored pink. Reference SVM results simulated using scikit-learn-SVM are shown in Figure 6c. Since SVM simulation treats the binary target discretely, the output shows the exact two-value-based colormaps of the test results. It can be easily seen here that the results shown in Figure 6b are highly consistent with the SVM results. In particular, the location of the median boundary (dashed pink line) corresponds precisely to the SVM results. For the dataset of make_moons, the situation becomes more complicated due to the increased nonlinearity in the training data. Figure 6d-f shows the same simulation sequence as the data of make_circles. However, the results from error backpropagation, both for the approach proposed here and for SVM, showed misclassification. The classification mistake usually occurs near the terminal edge area where the label '0' and label '1' overlapped with each other. Taking a closer look at the test results shown in Figure 6e,f, it can be found that the misclassification presented differently. For quantum circuit learning, the misclassification occurs mostly at the left side of the label '0' in the overlapping area. For SVM, the misclassification is roughly equally distributed for both label '0' and label '1', indicating the intrinsic difference between these two simulation algorithms. overlapped with each other. Taking a closer look at the test results shown in Figure 6e,f, it can be found that the misclassification presented differently. For quantum circuit learning, the misclassification occurs mostly at the left side of the label '0' in the overlapping area. For SVM, the misclassification is roughly equally distributed for both label '0' and label '1', indicating the intrinsic difference between these two simulation algorithms.

Learning Efficiency Improvement
As shown in Figure 6d-f, both the backpropagation-based quantum learning algorithm and classical SVM algorithm failed to provide excellent test accuracy in the make_moon classification dataset. Further investigation aiming at improving the test accuracy for the make_moons data was conducted. Here we adopted two approaches for this purpose: (i) adjusting the depth of the quantum circuit and (ii) adjusting the scaling parameter γ. The results are summarized as follows: (i) Varying the depth of the quantum circuit: We consider that one of the reasons for misclassification occurred in Figure 6e would be attributed to the limited representation ability due to the limited depth of the quantum circuit. Therefore, we investigated the effect of quantum circuit depth on the learning accuracy, and the results are shown in Figure 7a-c. The depth of the quantum circuit was set to 4, 7, and 10 layers. Four layers of the circuit showed an almost linear separation plane, indicating the insufficient representation of the nonlinear feature in the training data. However, with the increase in the circuit layer thickness, the classification boundary (separation plane) becomes more nonlinear, as shown in Figure 7b, where the depth of the quantum circuit was set as six layers. Figure 7c shows the results obtained at the 10 layers depth of the quantum circuit, and it can be clearly found that the separation classification plane is almost identical to that at 6 layers depth shown in Figure 7b. This observation indicates the existence of a critical depth, where the learning efficiency is saturated, and no further improvement could be obtained for any depth beyond the critical depth. For the current experimental condition of a 4 qubit system with a 200 data training dataset, the critical depth is estimated to be around six layers.
(ii) Varying the scaling parameter γ: Before we present the results obtained by varying the parameter γ, we first provide a detailed description about the tuning principle of γ since it is extremely important when dealing with the learning process under a large-scale quantum computing environment.
Parameter γ appears in the SoftMax function, which is used to convert the expectation values of Z 1 and Z 2 to continuous probabilities y 1 and y 2 between 0 and 1. The SoftMax function takes the same form as shown in Equations (25) and (26) except the modification shown below: In other words, for all the learning results shown so far, we have assumed the parameter γ = 1. The effect of γ on the probability value of y is illustrated as follows, where we have increased the value of γ from 1 to 3 and 5: Let us assume that we have obtained two values for Z 1 and Z 2 as 0.3 and 0.1, respectively. The difference between these two values is very small. However, we will show that the difference between the Z 1 and Z 2 can be mathematically magnified by increasing the value of the parameter γ: As shown in Equations (31)- (33), an increase in the parameter γ significantly enhances the difference between the converted probability y. The enlarged difference is expected to improve the learning efficiency in the classification problem, since it makes it easier to determine the separation plane between the binary training data.
To verify the effect from the scaling parameter γ, we performed further experiments on the make_moon data. The results obtained by tuning scaling parameter γ are summarized in Figure 7d-f, showing the results from three cases: γ = 1, γ = 3, and γ = 5. In all the experiments, the number of qubits was kept at 4 qubits. It can be clearly found that the scaling parameter γ exerts a significant effect on the learning efficiency. The classification accuracy is dramatically improved when γ is set to 5, as shown in Figure 7f. By checking the contour separation line shown in Figure 7f, it can be easily confirmed that the classification accuracy has reached almost 100%, indicating the effectiveness of scaling parameter γ in improving learning efficiency. It is also worthwhile to mention here that the probability of each quantum state has to be normalized to ensure the summation ∑ p i = 1. This constraint strongly suppresses the probability of each state, and the final probability difference between each state at the initial learning stage tends to become extremely small due to the exponential increase in 2 N qubit states in the large-scale quantum computing systems. We claim that it is extremely important to tune the scaling parameter γ for NISQ systems involving large numbers of qubits for good learning performance. parameter in improving learning efficiency. It is also worthwhile to mention here that the probability of each quantum state has to be normalized to ensure the summation ∑ = 1. This constraint strongly suppresses the probability of each state, and the final probability difference between each state at the initial learning stage tends to become extremely small due to the exponential increase in 2 states in the large-scale quantum computing systems. We claim that it is extremely important to tune the scaling parameter for NISQ systems involving large numbers of qubits for good learning performance.

Computation Efficiency
Having confirmed the validity of the proposed error backpropagation on various regression and classification problems, we show one great advantage of using backpropagation to perform parameter optimization over other approaches. It has been rigorously demonstrated in a deep neural network-based machine learning field that the error backpropagation method is several orders of magnitude faster than the conventional finite difference method in gradient descent-based learning algorithms. In this work, we

Computation Efficiency
Having confirmed the validity of the proposed error backpropagation on various regression and classification problems, we show one great advantage of using backpropagation to perform parameter optimization over other approaches. It has been rigorously demonstrated in a deep neural network-based machine learning field that the error backpropagation method is several orders of magnitude faster than the conventional finite difference method in gradient descent-based learning algorithms. In this work, we conducted a benchmark test to verify where there is a decisive advantage of using a backpropagation algorithm in quantum circuit learning. Figure 8 shows the computation cost comparison among three methods: a finite difference method proposed in [22], the popular SPSA method that is currently used in complicated quantum circuit learning [27], and the proposed method based on backpropagation. The execution time with the unit of a second per 100 iterations is selected for a fair comparison. The number of parameters corresponding to the quantum circuit depth l and number of qubits O qubit is given as follows: The result of the comparison by varying both the depth of the quantum circuit and the number of qubits is presented in Figure 8. We implemented the three methods on the same make_moons dataset and recorded the computation time cost per 100 iterations. Figure 8a shows the dependence of computation cost on the variation of depth of the quantum circuit. In this experiment, we fixed the number of quantum bits O qubit as 4 qubits. The depth of the quantum circuit was varied from 5 to 20 at intervals of 5. It can be clearly seen there is a dramatic difference in computation time cost for 100 iteration learning steps. The finite difference method and the SPSA method showed poor computation efficiency, as has been mentioned above and demonstrated in the deep neural network-related machine learning field. The computation costs rise exponentially as the thickness of the circuit increases, limiting its application in the large-scale and deep quantum circuit. In contrast, the backpropagation method proposed here showed a dramatic advantage over all other methods by showing an almost constant dependence on the depth of the quantum circuit. The computation time recorded at a depth of 20 layers was 3.2 s, which is almost negligible when compared to the value of 458 s obtained by using the finite difference method and the value of 696 s obtained by using the SPSA method at the same 20-layer thickness. Figure 8b shows the dependence of computation cost on the variation of the number of qubits. In this experiment, we fixed the depth of the quantum circuit as 10 layers. The number of qubits varied from 2 to 6 at the interval of 1. Similar to the tendency found in Figure 8a, there is a dramatic difference in computation time cost for 100 iteration learning steps. The finite difference method and the SPSA method showed poor computation efficiency, and the profile was similar to those shown in Figure 8a. The computation costs rise exponentially as the O qubit increases, limiting its application in the large-scale and deep quantum circuit. In contrast, the backpropagation method proposed here showed a dramatic advantage over all other methods by showing an almost constant dependency on the O qubit . The computation time recorded at 6 qubits was around 4.1 s, which is almost negligible compared to the value of 393 s obtained by using the finite difference method and the value of 752 s obtained by using SPSA method at the same number of qubits.

Experimental Implementation Using IBM Q
So far, we have presented results from simulation using the quantum simulator. Implementation architecture when using a real machine such as an NISQ device is described in Figure 9. To use the error backpropagation method, it is necessary to prepare not only the expected value 〈 〉 but also the quantum state | . Therefore, as shown in the figure, a quantum circuit having the same configuration as the real quantum circuit must be prepared as a quantum simulator on a classical computer. It should be noticed that this could not be considered as an additional load for the quantum computing scientist. Since a quantum computer is not allowed to be disturbed during the working condition, unlike the classical computer, it needs its counterpart of quantum circuit simulator to monitor and diagnose the qubits and gate error and characterize the advantage of quantum computers over classical computers [29][30][31][32][33][34]. Therefore, a real quantum computer always requires a quantum simulator ready for use at any time. That means we can always access the quantum simulator, as shown on the right-hand side of Figure 9, to examine and obtain detailed information regarding the performance of the corresponding real quantum computer. Observation probability for each state can be calculated by shooting times at the real quantum computer side. The observation probability obtained from the real quantum machine is then passed to the classical computer, and the quantum circuit in the simulator for simulation is then used. The parameter can be updated using backpropagation since all the intermediate information is available at the simulator side. After the parameter * is updated at the simulation side, it will be returned to the real quantum machine for the next iteration.

Experimental Implementation Using IBM Q
So far, we have presented results from simulation using the quantum simulator. Implementation architecture when using a real machine such as an NISQ device is described in Figure 9. To use the error backpropagation method, it is necessary to prepare not only the expected value Z but also the quantum state |ψ . Therefore, as shown in the figure, a quantum circuit having the same configuration as the real quantum circuit must be prepared as a quantum simulator on a classical computer. It should be noticed that this could not be considered as an additional load for the quantum computing scientist. Since a quantum computer is not allowed to be disturbed during the working condition, unlike the classical computer, it needs its counterpart of quantum circuit simulator to monitor and diagnose the qubits and gate error and characterize the advantage of quantum computers over classical computers [29][30][31][32][33][34]. Therefore, a real quantum computer always requires a quantum simulator ready for use at any time. That means we can always access the quantum simulator, as shown on the right-hand side of Figure 9, to examine and obtain detailed information regarding the performance of the corresponding real quantum computer. Observation probability for each state ψ j can be calculated by shooting R times at the real quantum computer side. The observation probability obtained from the real quantum machine is then passed to the classical computer, and the quantum circuit in the simulator for simulation is then used. The parameter θ can be updated using backpropagation since all the intermediate information is available at the simulator side. After the parameter θ * is updated at the simulation side, it will be returned to the real quantum machine for the next iteration. over classical computers [29][30][31][32][33][34]. Therefore, a real quantum computer always requires a quantum simulator ready for use at any time. That means we can always access the quantum simulator, as shown on the right-hand side of Figure 9, to examine and obtain detailed information regarding the performance of the corresponding real quantum computer. Observation probability for each state can be calculated by shooting times at the real quantum computer side. The observation probability obtained from the real quantum machine is then passed to the classical computer, and the quantum circuit in the simulator for simulation is then used. The parameter can be updated using backpropagation since all the intermediate information is available at the simulator side. After the parameter * is updated at the simulation side, it will be returned to the real quantum machine for the next iteration. Next, we implemented the architecture shown in Figure 9 and conducted an experiment to perform regression using a real machine. The number of qubits and the depth of the circuit were set to n = 3 and l = 4 as in Section 3.1. For the circuit parameters, the one-dimensional data x was substituted as in Equations (10) and (11). The target function f (x) was also regressed with a value that doubled the expected value of Pauli Z as before. The loss function and its derivative were calculated in the same way as in Equations (12) and (13). The expected value of Pauli Z was calculated as in Equation (14). Since we were using a real machine this time, we measured only the first qubit of the quantum circuit and statistically obtained p |0 1,θ and p |1 1,θ , as shown in Figure 3. It is considered that the expected value of Pauli Z approaches the more accurate value as the number of measurements R becoming large. We used a 20-qubit quantum computer of IBM Q, ibmq_johannesburg, in our experiments [35]. In the experiment, of the 20 qubits, we used 3 qubits for constructing the algorithm and multiple auxiliary qubits. Figure 10 shows the results of regression using the proposed method on a real machine. In this experiment, we only performed linear regression and set the target function to f (x) = x. Unlike the experiment in Section 3.1, we performed circuit learning using 50 training data that did not contain noise. For the results in Figure 10a-c, the numbers of measurements M shot of the quantum circuit were 2048 times, 4096 times, and 8192 times, respectively. We found that both the initial and final learning results are not smooth curves but jagged lines in all three cases. We have concluded that this was because the observed value deviated from the correct value due to the occurrence of noise or error in the qubits of the real machine. It may be possible to obtain more correct results by using an algorithm that reduces noise together with the algorithm of the proposed method or by using a machine with a lower noise rate. We can see that in all cases the regression was successful by comparing the results of the three experiments with the regression curve before learning. However, the R 2 values for regression in Figure 10a-c were 0.933, 0.900, and 0.895, which were lower than those in the experiment in which regression was performed using only the simulator. This is because the error rate of the qubits is larger than the value of the gradient of the loss function. This is verified by the probability comparison results for x = 0.5 shown in Figure 10b, where a large deviation was found between the ones directly measured from ibmq_johannesburg and the ones derived from the simulator. The fitted value is calculated by 2 Z , where Z is calculated using Equation (14). It can be easily confirmed that the fitted value derived from ibmq_johannesburg is 2(0.339 − 0.661) = −0.644, while the value from the simulator is 2(0.192 − 0.808) = −1.236, which deviates further from the target value of −0.5. This is because, during the learning, the model has learned to some extent to improve from the noisy environment but finally failed to reach a satisfactory level of accuracy. The simulator containing no noise, therefore, shows a much worse regression value than the one of ibmq_johannesburg when using the learned parameters from ibmq_johannesburg. Here the probability comparison for = 0.5 is shown. The left one is the measurement of IBM Q computer and the right one is derived from the quantum circuit simulator. (c) = 8192.
The error rate of the single quantum gate and the error rate of the CNOT gate of the machine used in this experiment are about 10 and 10 (see Figure 9), while the gradients of the loss function are about 10 or 10 . We cannot calculate the exact value of gradients due to insufficient precision. Therefore, we have considered that the regression accuracy was certainly lower when using the current quantum computer than when using only the simulator. Furthermore, the R value decreased as the number of measurements of the quantum circuit increased. We thought that this was because the influence of errors and noise increased each time the quantum circuit was measured. Therefore, the measurement value becomes statistically correct if the number of measurements is increased, but the noise of the measurement value is reduced if the number of measurements is decreased.
A concern may be raised about the feasibility of the proposed approach on a quantum circuit with hundreds or thousands of qubits. We indeed need a storage capacity of 2 to accommodate all the states in order to perform the error backpropagation well, and it turns out to be extremely challenging when is very large. For an 'authentic' quantum algorithm, the algorithm may indeed be designed in a way that we do not need 2 memory to record all the states because most of the amplitudes of the states vanish during the quantum operation. The word 'authentic' implies a complete end-to-end quantum algorithm. However, as mentioned in [29][30][31][32][33][34], quantum computing and algorithm design must be guided by an understanding of what tasks we can hope to perform. This means that an efficient scalable quantum simulator is always vital for the 'authentic' quantum algorithm. Since the error backpropagation is performed layer by layer over matrix operation, a more advanced GPGPU based algorithm, tensor contraction, or the path in- The error rate of the single quantum gate and the error rate of the CNOT gate of the machine used in this experiment are about 10 −4 and 10 −3 (see Figure 9), while the gradients of the loss function are about 10 −17 or 10 −18 . We cannot calculate the exact value of gradients due to insufficient precision. Therefore, we have considered that the regression accuracy was certainly lower when using the current quantum computer than when using only the simulator. Furthermore, the R 2 value decreased as the number of measurements of the quantum circuit increased. We thought that this was because the influence of errors and noise increased each time the quantum circuit was measured. Therefore, the measurement value becomes statistically correct if the number of measurements is increased, but the noise of the measurement value is reduced if the number of measurements is decreased.
A concern may be raised about the feasibility of the proposed approach on a quantum circuit with hundreds or thousands of qubits. We indeed need a storage capacity of 2 N qubit to accommodate all the states in order to perform the error backpropagation well, and it turns out to be extremely challenging when N qubit is very large. For an 'authentic' quantum algorithm, the algorithm may indeed be designed in a way that we do not need 2 N qubit memory to record all the states because most of the amplitudes of the states vanish during the quantum operation. The word 'authentic' implies a complete end-toend quantum algorithm. However, as mentioned in [29][30][31][32][33][34], quantum computing and algorithm design must be guided by an understanding of what tasks we can hope to perform. This means that an efficient scalable quantum simulator is always vital for the 'authentic' quantum algorithm. Since the error backpropagation is performed layer by layer over matrix operation, a more advanced GPGPU based algorithm, tensor contraction, or the path integral-based sum-over-histories method would be effectively used to tackle the 2 N qubit operation [35][36][37][38][39][40][41]. Therefore, the concern raised above will be relieved or eliminated through the improvement of the quantum computing field and GPGPU field as well as other surrounding techniques.

Conclusions
We proposed a backpropagation algorithm for quantum circuit learning. The proposed algorithm showed success in both linear and nonlinear regression and classification problems. Meanwhile, the computation efficiency was improved dramatically by using the error backpropagation-based gradient circuit learning rather than the other gradient-based methods such as finite difference method or SPSA method. The reduction in computing time by using a quantum simulator was surprisingly by up to several orders of magnitude when compared to the conventional methods. Meanwhile, the proposed theoretical scheme was successfully implemented on the 20-qubit quantum computer of IBM Q, ibmq_johannesburg, and it was revealed that the gate error, especially the CNOT gate error, strongly affects the derived gradient accuracy. Given that we do not need 2 N qubit memory to record all the states because most of the amplitudes of the state vanish during the quantum operation, further computing advantage would be expected by combining the backpropagation with the GPGPU technique. We, therefore, claim that gradient descent using the error backpropagation is an efficient quantum circuit learning tool not only in the NISQ era but also for more matured quantum computers with deeper circuit depths and thousands of quantum bits.

Data Availability Statement:
The data and scripts that support the findings of this study are available from the corresponding author upon reasonable request.