Quantum Adversarial Transfer Learning

Adversarial transfer learning is a machine learning method that employs an adversarial training process to learn the datasets of different domains. Recently, this method has attracted attention because it can efficiently decouple the requirements of tasks from insufficient target data. In this study, we introduce the notion of quantum adversarial transfer learning, where data are completely encoded by quantum states. A measurement-based judgment of the data label and a quantum subroutine to compute the gradients are discussed in detail. We also prove that our proposal has an exponential advantage over its classical counterparts in terms of computing resources such as the gate number of the circuits and the size of the storage required for the generated data. Finally, numerical experiments demonstrate that our model can be successfully trained, achieving high accuracy on certain datasets.


Introduction
Machine learning (ML) methods have successfully been applied to various fields such as speech recognition, visual object recognition, and object detection [1,2]. In recent years, ML research has been extended to applications in more complicated but ordinary scenarios, such as situations involving datasets belonging to different domains. The predicament in these scenarios is that the ML model trained on a dataset in one domain does not often work for tasks in a different domain of interest. One resource-consuming strategy used to solve the above issue is transfer learning (TL) [3][4][5][6][7][8], which estimates the usefulness of the knowledge learned in a source domain and transfers it to help the learning task in a target domain. However, conventional TL usually lacks efficiency when the distribution of the target domain data is completely different from the source domain. Recently, an adversarial transfer learning (ATL) method has been proposed to solve the issue. Such an ML scheme has the potential for a broad range of applications; it has been proven to be beneficial in areas such as natural language processing, autonomous cars, robotics, and image understanding [9][10][11][12][13][14][15][16][17][18][19].
The basic idea of the ATL method is to introduce generative models to bridge the gap between the datasets of different domains. This method is a new version of ML and cannot be provided by simply combining methods such as generative adversarial networks (GANs) [20][21][22] and TL. As schematically shown in Figure 1a, ATL controls the samples of a given source domain dataset and a target domain dataset, denoted as X R s and X R t , respectively. A generator G is employed to produce a fake sample X G using X R s and a noise vector → z . X G and X R t are then sent to a discriminator D to judge the probabilistic likelihood. The general purpose of training G and D is an adversarial game. Generator G is required to generate X G as close as possible to the target data sample X R t , cheating the discriminator D. Discriminator D attempts to accurately distinguish X G from X R t and avoid being cheated. The game finally reaches the Nash equilibrium [23] after the parameters of the model are optimized. In cases of classification, a classifier T is correctly applied to label X G according to the label of X R s . In certain cases, the source domain dataset is also employed as the input of T, which increases the efficiency of the training. The key ingredient in such a scheme is the cost function of the model, which finally converges to an equilibrium point. The generator G produces a data sample X G on a source data sample X R s and a noise vector → z . The discriminator D distinguishes X G from the target data sample X R t , assigning the judgement as "real" or "fake". The classifier T assigns task-specific labels {λ} to fake data sample X G . Note that source data sample X R s is only accepted into the next step when the efficiency of the training is increased, as marked by the dashed arrow. (b) The corresponding QATL scheme. The data samples (X R s s , X R t , and X G ) are encoded by the quantum state X R s (ρ R s , ρ R t , and ρ G ), respectively. The functioning of G, D, and T is implemented by the quantum operatorŝ G,D, andT, respectively. The judgement ofD is given by a quantum state (|real or | f ake ) and the label is also encoded by |λ . (c) The quantum circuit of QATL. The qubit numbers of the registers Bath D, Out G|R t|s , and Bath T are d, n, and p, respectively. The registers Out D and Test both contain one qubit. Their qubits are initialized as |0 . The register Bath G stores an m-qubit random state |z generated by the environmental coupling. The register Label has the initial state |λ .R t ,R s , Despite the success of the above-mentioned method, the increasing requirements of data processing present a significant challenge to all computing strategies, including ML. As quantum computing provides an opportunity to overcome this challenge, researchers have considered addressing ML tasks using quantum machines . Proposed quantum machine learning algorithms include quantum support vector machines [29,30], quantum deep learning [32][33][34][35][36][37][38], quantum Boltzmann machines [39,40], quantum generative adversarial learning [41][42][43][44][45], and quantum transfer learning (QTL) [47][48][49][50][51]. A quantum counterpart of ATL has not yet been provided; this could exhibit an advantageous performance in cross-domain learning tasks. To propose such a counterpart is complex because a suitable quantum cost function must be established so that the quantum adversarial game for knowledge transfer can finally converge to an equivalent point, as in classical cases. How to develop such a function in the quantum regime and how to obtain a quantum version of ATL remain unknown.
In this paper, we demonstrate how we solved the above problem, and we propose quantum adversarial transfer learning (QATL) schemes. The training process of our QATL was equivalent to an adversarial game of a quantum generator and a quantum discriminator, and we demonstrate that an equilibrium point also existed in the model. Specifically, a quantum cost function for adversarial training is provided; a measurement-based judgment of the data label and a quantum subroutine to compute the gradients are also discussed in detail. We prove that our proposal has an exponential advantage over classical counterparts in terms of computing resources such as the gate number of the circuits and the size of the storage for the generated data. This is of benefit to the transfer of complicated knowledge, during which a module is extensively called upon and a large amount of data are generated. We applied this scheme to a classification task based on the Iris dataset and the states on a Bloch sphere to prove that an extremely high classification accuracy could be achieved using this method.

Materials and Methods
Our QATL scheme is shown in Figure 1b. Here, the density matrix of the state that encodes a source (target) domain data sample is denoted by ρ R s (ρ R t ). A unitary operator is employed as the quantum generatorĜ, whose parameters are denoted by a vector → θ G . It operates on the state ρ R s and a quantum noise state |z z|, outputting a fake data sample ρ G . Another unitary operator, parameterized by vector → θ D , is used as the quantum discriminatorD. It operates on the state ρ G generated byĜ and the state ρ R t ; it then outputs the state |real real|. If it operates on ρ G , the state is | f ake f ake|. As with ATL, the general training purpose of QATL is to maximize the probability of ρ G passing the test of discriminatorD, simultaneously minimizing the probability ofD being cheated if ρ G is fake. The optimization of the parameters → θ G and → θ D are addressed when the above quantum adversarial game reaches the Nash equilibrium. We considered the classification task and applied a unitary operator parameterized by → θ T as a quantum classifierT [58][59][60][61]. This provided a mapping of the data sample (ρ G or ρ R s ) to a label state L = |λ λ |; → θ T was updated during the training.
Next, we discuss the state evolution of the above scheme on a circuit and demonstrate how an equilibrium point of the game was reached. The quantum circuit of Figure 1b is displayed in Figure 1c. The whole circuit ran on seven quantum registers. The register Out G|R t|s containing n qubits encoded the target data, source data, or the generated data. The encoding of the target (source) data was described by the operatorR t (R s ), which acted on the register Out G|R t|s . The register Bath G, containing m qubits, encoded the quantum noise state |z z|. The generatorĜ → θ G operated on Out G|R t|s and Bath G, producing the generated data. The register Bath D, containing d qubits, was employed as the internal workspace of the discriminatorD, whose outputs (|real real| or | f ake f ake|) were stored by the register Out D. The register Bath T contained p qubits and was employed as the workspace of the classifierT. The label state |λ outputted byT was compared with the single qubit label state |λ stored by the register Label. The register Test was used to perform the estimation of the likelihood of |λ and |λ , so that ρ G could be properly labeled. All qubits were initialized to be |0 , except for those in Bath G and Label. The initial state of the quantum circuit shown in Figure 1c could then be denoted by where ⊗ is the tensor product and |0 0| ⊗p is the tensor product of p |0 0|s; this was also applicable for other similar terms. The unitary operators of such a state encoding the source domain data, target domain data, and generated data were denoted by U R s , U R t , and U G → θ G , respectively. They were given by where I represents the 2 × 2 identity matrix. The states ρ R s and ρ R t were then given by We set Bath G to be |0 ⊗m instead of rotating it to a random quantum state with extra control. A random quantum state can naturally be generated by the entanglement of the register state and environmental degrees [44]. After |z z| was generated according to the requirements, the state ρ G could be expressed by The discriminatorD → θ D was applied to estimate the likelihood of ρ G and ρ R t . The unitary operator ofD → θ D was given by Therefore, the resultant states of ρ R t and ρ G after being operated by U D → θ D were given by The expectation value of operator Z ≡ |real real| − | f ake f ake| could be measured on the register Out D. Such a value was close to 1 whenD → θ D generated the state |real real|; it was close to −1 whenD → θ D outputted the state | f ake f ake|.
The quantum classifierT we considered was forced to output the label of ρ G according to its closeness to ρ R s . We also introduced this method to judge whether the label was correct. The unitary operator ofT was defined by whereT 1 andT 2 are two unitary operators on Out G|R t|s and Bath T, respectively. Their parameters were denoted by → θ T 1 and → θ T 2 correspondingly. Thus, The resultant states of ρ R s and ρ G after being operated by U T → θ T were given by One qubit of the above output state was the predicted label |λ byT. When comparing it with the label |λ of the input ρ R s by the swap test circuit [62] (which was composed of two Hadamard gates and one Fredkin gate, as shown in Figure 1c), the label of ρ G was set to be |λ if |λ and |λ were close enough. The closeness of |λ and |λ could be estimated by the measurement of the register Test. The probability of the register Test being in state |0 was given by where P c is the overlap of |λ and |λ , defined by P c = Tr |λ λ|L Therefore, P c could be estimated by P c = 2P 0 − 1. In specific cases, we could set a suitable boundary for accepting |λ to be the label of ρ G (see the example below). Finally, the training of the quantum model could be performed by finding the following minimax point The minimax point of V outputted byT when the input was the source domain data sample ρ R s k . Such a term provides a bias; this could increase the training efficiency. The angle φ is the parameter used to adjust the weight of the target domain data and the generated data in the cost function, and it could be set according to the requirements of the specific tasks. A usual form was given by considering that they were equally weighted. By setting φ = π/4, we obtained where χ l G , χ l D , and χ l T are the learning rates. In the above scheme, we used the variational quantum circuit in [44] and [58] to implement the generatorĜ, the discriminator D, and the classifierT, as shown in Figure 1c. In [44], each quantum gate of a quantum circuit corresponded with a parameter of the quantum circuit. The number of quantum gates was polynomial in the number of qubits; that is, 3/2Lq. L was the number of layers of quantum circuits and q was the number of qubits. Therefore, the number of parameters was also 3/2Lq. In [58], each quantum gate of the circuit had three parameters. The number of quantum gates was polynomial in the number of qubits, which was 2Lq + 1. Therefore, the number of parameters was 6Lq + 3. In our scheme, the total number of quantum gates was bound by those operating on the register Out G|R t|s , which was the sum of the number of quantum gates employed by the generator-discriminator sequence or the classifier. Based on the gate number of the above two circuits for the unitary operations we considered, the sum of the number of quantum gates remained polynomial in the number of qubits. We added two registers, Label and Test. Their gate number was constant and their impact on the complexity was negligible. In the process of obtaining the quantum cost function, the number of gates and the number of parameters of our scheme were polynomial in their number of qubits, which required exponentially fewer resources than the classical counterparts [10]. This property potentially benefits future applications for tasks involving complicated data structures. In general, if the information in a target dataset is insufficient, a large amount of data must be generated to effectively transfer the knowledge required to improve the training of the target dataset. This indicates a large storage requirement for the data and extensive calls of the modules. Otherwise, the gap between the features of different datasets is not bridged and the helpful information required to improve the training process is not captured and transferred. The improvement in computing resources by our QATL could loosen the requirements on storage and boost the efficiency of running the subroutines of the module, facilitating the execution of tasks.
To provide a more direct connection between the computation of the cost function and the updates of the parameters, we also applied two quantum circuits to calculate the gradients. Their theoretical scheme and circuit design are shown in Section SI of the Supplementary Materials. This scheme could be implemented on the recent quantum experimental platform because it was mainly based on variational quantum algorithms (VQAs) [63][64][65][66].

Results
To demonstrate the feasibility of QATLs, we considered their application in cases in which the probability distributions of the data in the two domains were different. The examples used the Iris dataset [67] and the states of the Bloch sphere.
The main task we considered was a classification of the Iris dataset in a numerical simulation. We chose two types of Iris flowers from the Iris dataset: versicolor and virginica. Each type contained 50 data samples. All data samples contained four attributes: sepal length (SL), sepal width (SW), petal length (PL), and petal width (PW). We considered encoding the four attributes based on the amplitudes of a two-qubit state (amplitudeencoding strategy [47]). The source domain dataset was composed of 40 data samples; 20 were picked from the Iris versicolor data samples and labeled |0 0|. The other 20 were picked from Iris virginica data and labeled |1 1|. The target domain dataset was also a 20/20 set chosen from samples other than those applied to the source domain dataset and without labeling. The last 10 data samples of both types were used to obtain a test dataset to check the performance of the model. The classification task labeled the target domain data samples using differently distributed source domain data. To enlarge the statistical difference of the source domain dataset and the target domain dataset and to improve the difficulty of classifying Iris flowers, we preprocessed the data samples by reducing the SL by 7 cm, the SW by 3 cm, the PL by 4 cm, and set the value of PW to 0. An illustration of the distribution of the data can be found in [67].
To show the convergence of terms of the cost function on the above datasets, we plotted the cost function V and its components as a function of the training step in Figure 2a. The cost function V converged to −0.217. The component V DG was |real real|, weighted by −1/4. V GAN ≡ 1/2 + V DG + V DR ; this reflected the capability ofD to identify the generated data. V DG , V DR , and V GAN converged to 1/4, −1/4, and 1/2, respectively; thus,D could not provide a fair judgement and designate all the density matrices as |real . The average probability of matching the label with the generated data was given by which eventually reached −0.739. As Figure 2 clearly demonstrates, the cost function and its components converged to stable values in only 1500 training steps. We tested the trained model on the classification using the test dataset mentioned above. The classification accuracy reached 95%. The results demonstrated that an equilibrium point existed for the proposed cost function and the QATL model was workable on the dataset. Due to the generative model and adversarial training, the gap of knowledge transfer between the differently distributed datasets was effectively bridged; therefore, the accuracy of the classification was largely improved compared with previous quantum algorithms [51]. The method is shown in Appendix A. We also applied our scheme to the classification of the states of the Bloch sphere; detailed descriptions are shown in Section SII of the Supplementary Materials. A high-efficiency classification accuracy was also observed.

Conclusions
We demonstrated an efficient quantum machine learning scheme for cross-domain learning problems-QATL-which was a quantum counterpart of the recent and well-received ATL. Using the well-defined quantum cost function, an adversarial training process was applied to the transfer of knowledge, which was independent of specific tasks. Our numerical experiments demonstrated that the QATL model could successfully be trained and outperformed state-of-the-art algorithms in the same tasks. The complexity of the algorithm was logarithmic in terms of the number of quantum gates and training parameters, showing an advance over ATL.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
To complete the numerical simulation task of the theoretical scheme, we used a variational quantum circuit with eight entangled qubits from [58] to implementĜ → θ G , D → θ D , andT → θ T . The quantum circuit is shown in Supplementary Materials Section S1. In our task, the register Out G|R t|s required two qubits. As the data sample was encoded by a pure state, Bath G did not need to generate entropy. The generatorĜ → θ G , involving 15 variational parameters, could generate the data for this task. The register Bath D was not necessary. Therefore,D → θ D , involving 42 variational parameters, operated on three qubits; one qubit served as Out D and the other two belonged to Out G|R t|s .
For the classifierT → θ T , we noted that Bath T was not necessary. Thus,T → θ T , which had 15 variational parameters, only operated on two qubits of Out G|R t|s . The registers Label and Test required only one qubit, respectively. An additional qubit was employed to store the information of the gradient. Therefore, the workspace of the whole scheme was composed of five qubits in total.
In this scenario, we trained the gradient steps for this scheme according to the updated rules of Equation (17). At the beginning of the training sequence, the parameters were randomly chosen. Similarly, we adopted RMSProp [68] to update the learning rate. The idea of RMSProp is to divide the learning rate of a weight by a running average of the magnitudes of recent gradients for that weight. With the help of RMSProp, the learning rate is largely updated in flat directions and finely tuned in steep directions. This speeds up the training process; it has shown a good adaptation of learning rates in different applications. Here, the initial learning rates ofĜ,D, andT were set to 0.001, 0.001, and 0.005, respectively. The training progress terminated; after that, the cost function converged at an equilibrium point. The numerical simulation results are shown in Figure 2.