Model-Free Deep Recurrent Q-Network Reinforcement Learning for Quantum Circuit Architectures Design

: Artiﬁcial intelligence (AI) technology leads to new insights into the manipulation of quantum systems in the Noisy Intermediate-Scale Quantum (NISQ) era. Classical agent-based artiﬁcial intelligence algorithms provide a framework for the design or control of quantum systems. Traditional reinforcement learning methods are designed for the Markov Decision Process (MDP) and, hence, have difﬁculty in dealing with partially observable or quantum observable decision processes. Due to the difﬁculty of building or inferring a model of a speciﬁed quantum system, a model-free-based control approach is more practical and feasible than its counterpart of a model-based approach. In this work, we apply a model-free deep recurrent Q-network (DRQN) reinforcement learning method for qubit-based quantum circuit architecture design problems. This paper is the ﬁrst attempt to solve the quantum circuit design problem from the recurrent reinforcement learning algorithm, while using discrete policy. Simulation results suggest that our long short-term memory (LSTM)-based DRQN method is able to learn quantum circuits for entangled Bell–Greenberger– Horne–Zeilinger (Bell–GHZ) states. However, since we also observe unstable learning curves in experiments, suggesting that the DRQN could be a promising method for AI-based quantum circuit design application, more investigation on the stability issue would be required.


Introduction
Recent advances in artificial intelligence (AI) and Noisy Intermediate-Scale Quantum (NISQ) technology produce new perspectives in quantum artificial intelligence [1,2]. The control of quantum system by a classical agent has been studied in various settings [3][4][5]. Reinforcement learning (RL) [6][7][8][9][10][11][12] was successfully applied to control problems [11,13] of classical systems and fully observable Markov Decision Process (MDP) environments [14]. However, the control and learning of Partially Observable Markov Decision Process (POMDP) [15][16][17][18][19] is more difficult due to indirect access to the state information. Both planning [20] and learning [21] of POMDP are proposed. For a POMDP system, the underlying state transition is classical Markovian and is different from quantum dynamics. The quantum counterpart of POMDP, Quantum Observable Markov Decision Process (QOMDP) [22][23][24], was theoretically studied. Implementation of a QOMDP planning method for quantum circuits [2,[25][26][27][28][29][30][31][32][33] is studied in a previous work [34]. Comparing to state tomography-based methods, which require an exponentially large number of measurement shots with respect to the circuit width, QOMDP-based approaches have favorable sample complexity from quantum circuits. However, an exact QOMDP planning method requires exponentially expensive classical computing. It is desirable to explore approximation methods to reduce the cost of computational resources.
In this work, we implement a deep recurrent Q-learning agent for model-free reinforcement learning [47][48][49][50] to design quantum circuits. The DRQN is based on long short-term memory (LSTM) [51][52][53] networks that encode the action-observation history time-series for partially observable environments [49,50]. The fidelity achieved by the DRQN learning agent is improved over learning episodes, showing the effectiveness of the proposed algorithm. However, we also observe unstable learning curves in experiments. These observations suggest that the DRQN could be a promising method for AI-based quantum circuit design application, but more investigation on the stability issue would be required.
Many previous works for quantum control using different approaches can be found in the literature [35][36][37][38][39][40][41][42][43][44][45][46]. Borah [46] use LSTM for the policy gradient. All these works [35,[37][38][39][40][41][42]46] are controlled at the Hamiltonian level instead of at the circuit architecture level [34,[43][44][45]. Kuo [36] has several similarities and differences compared to our work. Sivak et al. applied an actor-critic policy gradient method to a quantum optical system with a continuous action space. Our work applied deep recurrent Q-learning to a qubit system with discrete action set. Both Sivak et al.'s method and our method are model-free and use LSTM. Sivak et al. use LSTM for the policy network and the value network over a continuous action space. We use LSTM for a history-dependent Q-function over a discretize action space, which is more practical for field application.
This work is organized as follows. Section 2 introduces the LSTM-based DRQN reinforcement learning method for quantum circuit architecture. Section 3 presents the simulation results. Section 4 provides some discussion. Section 5 is the conclusion.

MDP, POMDP, and QOMDP
A POMDP problem instance is described by a set of states S, a set of actions A, a set of observations Ω, a state transition probability P, an observation probability O, a reward function R, and a discount rate γ ∈ [0, 1]. At each time step t, the agent in state s t ∈ S takes an action a t ∈ A and moves to a new state s t+1 ∼ P(s |s t , a t ). The agent also receives an observation o t ∼ O(o|s t ), o t ∈ Ω and a reward r t = R(s t , a t , s t+1 ) ∈ R. The action-observation history time series is t = {a 1 , o 1 , a 2 , o 2 , . . . a t , o t }. The goal is to find a policy π(a|h) to optimize the expected future reward E π ∑ T i=t γ i−t r i . In contrast to the situation of MDP, a POMDP agent does not have access to the time series {s t }.
A QOMDP problem instance is described by a Hilbert space S, a set of action super-operators A, a set of observations Ω, a set of reward operators R, a discount rate γ ∈ [0, 1], and an initial quantum state |s 0 . The set of actions consists of super-operators A = {A a 1 , . . . , A a |A| }, where each super-operator A a = {A a o 1 , . . . , A a o |O| } has |O| Kraus matrices. At each time step t, the agent takes an action a t , which introduces a change of the state of current quantum system The agent also receives an observation o t ∼ Pr(o||s t , a t ) = s t |A a t † o A a t o |s t , o t ∈ Ω and a reward r t = s t |R a t |s t ∈ R, where R a t ∈ R. Similar to POMDP and MDP, the goal is to find a policy to optimize the expected future reward.

LSTM-Based Deep Recurrent Q-Network
LSTM is a type recurrent neural network which can be used to model sequential data. The hidden state h t and output c t are computed by the recurrence (h t , c t ) = LSTM(h t−1 , c t−1 , x t−1 ) for time-dependent input signal x t . Traditional Q-learning for observable MDP uses a stateaction Q-function Q(s t , a t ) to represent the value of an action a t at a known state s t . To deal with partially observable environments in which s t is unknown, a history-dependent Q-function Q(a t , t−1 ) is used instead of the state-action Q-function. By treating the actionobservation pair as input x t = (a t , o t ), LSTM enables the encoding of the history-dependent Q-function Q(a t , t−1 ). A feed-forward neural network (FNN) is concatenated with the LSTM output to represent the Q-function. The FNN is a simple linear transformation, and its output gives the Q-value Q(:, t−1 ) = Wc t−1 + b, where W ∈ R |A|×|h| is a trainable weight matrix, and b ∈ R |h| is a bias vector. |h| denotes the size of the LSTM hidden states. The LSTM-FNN structure is shown in Figure 1a. The update of the Q function is via the optimization of loss function which can be computed by back-propagation through time. The implementation is performed by using the package PyTorch [54]. The hyperparameters can be found in Table 1.
The agent also receives an observation ~Pr( || ⟩, ) = , ∈ Ω and a reward = ∈ ℝ, where ∈ ℛ. Similar to POMDP and MDP, the goal is to find a policy to optimize the expected future reward.

LSTM-Based Deep Recurrent Q-Network
LSTM is a type recurrent neural network which can be used to model sequential data. The hidden state ℎ and output are computed by the recurrence (ℎ , ) = (ℎ , , ) for time-dependent input signal . Traditional Q-learning for observable MDP uses a state-action Q-function ( , ) to represent the value of an action at a known state . To deal with partially observable environments in which is unknown, a history-dependent Q-function ( , ) is used instead of the state-action Qfunction. By treating the action-observation pair as input = ( , ), LSTM enables the encoding of the history-dependent Q-function ( , ). A feed-forward neural network (FNN) is concatenated with the LSTM output to represent the Q-function. The FNN is a simple linear transformation, and its output gives the Q-value (: , ) = + , where ∈ ℝ | |×| | is a trainable weight matrix, and ∈ ℝ | | is a bias vector. |ℎ| denotes the size of the LSTM hidden states. The LSTM-FNN structure is shown in Figure 1a. The update of the Q function is via the optimization of loss function which can be computed by back-propagation through time. The implementation is performed by using the package PyTorch [54]. The hyperparameters can be found in Table  1.

RL Method
The proposed method is depicted in Figure 1b. The RL environment is the quantum circuit to be designed. The classical agent receives 0-1 observation from measurement result of the ancillary qubit. The action-observation pair is used to update the DRQN, and then the decision for the next action is made by the agent to control the circuit. The reward is the fidelity with respect to the target state r t = s t |s target s target |s t . The policy is epsilon-greedy. Experience reply is used to stabilize the calculation. Using the convention that the Hilbert space is ancilla ⊗ system, and the operator in Figure 1b is U(a t ) = U ent (H ⊗ U action ), where H is the single qubit Hadamard gate acting on the ancillary qubit. The action unitary U action is chosen from the action set Here, CNOT i,j denotes the control-not gate, for which the i-th qubit is the control qubit and the j-th qubit is the target qubit. R d,i (θ) denotes single qubit rotation of i-th qubit around d-axis. The system-ancilla entangler U ent = ∏ i∈system CNOT i,ancilla computes the system parity function and outputs the result to an ancilla qubit. The setup is similar to that of [34], but the classical agent in this work is an RL agent instead of a planning agent.

Results
Numerical simulations are conducted to test the applicability of the proposed method. The simulation code is based on the packages Numpy [55], Matplotlib [56], PyTorch [54], and Qiskit [57]. We test the state generation task for the 2-qubit Bell state and 3-qubit Greenberger-Horne-Zeilinger (GHZ) state [58]. The target state is considered reached when the fidelity is larger than a threshold value 0.99. The maximum number of steps for each episode is set to be 100. The PyTorch hyperparameters are listed in Table 1. Figure 2 is the learning curves for the 2-qubit Bell state. The received reward and number of steps to reach the target state are plotted with respect to the number of learning episodes. Each curve is the moving average of 2000 episodes and 10 independent runs. The error bar denotes the one standard deviation over 10 independent runs. For 30,000 episodes, we observe that the average reward is increased from <0.3 to >0.4. The maximum of the one-sigma error bar is close to 0.65. The average number of steps to reach the goal is decreased from >95 to <90. The minimum of the one-sigma error bar is close to 60. Figure 3 shows the learning curves for 3-qubit GHZ state. For 30,000 episodes, we observe that the reward is increased from <0.15 to >0.3. The maximum of one-sigma error bar can be larger than 0.45. The average number of steps to the goal is larger than 99 throughout the learning episodes.          Figure 4 is the city diagram for the density matrix generated by the RL agent. The result is the highest fidelity result over 10 independent training runs and 100 test steps for each training obtained by the policy of the last (30,000th) training episode. The fidelity of the obtained density matrix is 0.9698 for the Bell state, and the fidelity of the obtained density matrix is 0.6710 for the GHZ state.

Discussion
From the experimental data in Figures 2 and 3, we observe that the fidelity qubit Bell state and 3-qubit GHZ state are improved by the proposed learning a However, since these values are mostly way below the stopping criteria 0.99, th of steps is not improved significantly. The best output state has high fidelity wi to the target for the 2-qubit case, while the 3-qubit case provides moderate fidel results demonstrate that the learning algorithm is effective, but the performan our experiments is not satisfactory. More learning episodes and fine-tuning of rameters could potentially improve the performance. The fidelity achieved in th Bell experiments is generally better than that of the 3-qubit GHZ experiments. T sonable, since the possible action space for the 2-qubit system is smaller, and the

Discussion
From the experimental data in Figures 2 and 3, we observe that the fidelity of the 2-qubit Bell state and 3-qubit GHZ state are improved by the proposed learning algorithm. However, since these values are mostly way below the stopping criteria 0.99, the number of steps is not improved significantly. The best output state has high fidelity with respect to the target for the 2-qubit case, while the 3-qubit case provides moderate fidelity. These results demonstrate that the learning algorithm is effective, but the performance within our experiments is not satisfactory. More learning episodes and fine-tuning of hyperparameters could potentially improve the performance. The fidelity achieved in the 2-qubit Bell experiments is generally better than that of the 3-qubit GHZ experiments. This is reasonable, since the possible action space for the 2-qubit system is smaller, and the required action sequence to produce a 2-qubit Bell state is shorter than that of a 3-qubit GHZ state.
The city diagram in Figure 4 allows us to visualize the states produced by the agent. The Bell-GHZ target state is 1 √ 2 (|00+|11 ) for two qubits and 1 √ 2 (|000+|111 ) for three qubits. The ideal city diagram has peaks at four corners of the real part. For the 2-qubit case, the experimental data resemble the ideal case, and, hence, the fidelity is higher. On the other hand, the 3-qubit city diagram has many sub-peaks, which implies low fidelity.
To further understand the reasons behind the limitation of our method, the test fidelity distribution histogram for 10 independent runs is plotted in Figure 5. It is observed that all samples lie in the region Fidelity > 0.4 for both the 2-qubit and 3-qubit cases. However, the 2-qubit result has the highest fidelity sample in the interval Fidelity ∈ [0.9, 1.0) , while the 3-qubit result has the highest fidelity sample in the interval Fidelity ∈ [0.6, 0.7) . The 2-qubit result not only has better best-case performance but also has distribution maximum at Fidelity ∈ [0.6, 0.7) . This is better than the peak location of the 3-qbuit result, which is Fidelity ∈ [0.4, 0.5) . The problem is that a learning method that is successful for small problem instances would not necessarily scale to larger problem instances. We are encountering an scalability issue that arises commonly in the application of machine learning methodologies to optimization problems [59]. To the best of our knowledge, this is still an unresolved issue in the community, so further investigation in this direction is desirable.
antum Rep. 2022, 4, FOR PEER REVIEW To further understand the reasons behind the limitation of our method, the test fi ity distribution histogram for 10 independent runs is plotted in Figure 5. It is obser that all samples lie in the region > 0.4 for both the 2-qubit and 3-qubit ca However, the 2-qubit result has the highest fidelity sample in the interval [0.9,1.0), while the 3-qubit result has the highest fidelity sample in the interval [0.6,0.7). The 2-qubit result not only has better best-case performance but also has di bution maximum at ∈ [0.6,0.7). This is better than the peak location of the 3-q result, which is ∈ [0.4,0.5). The problem is that a learning method that is succ ful for small problem instances would not necessarily scale to larger problem instan We are encountering an scalability issue that arises commonly in the application of chine learning methodologies to optimization problems [59]. To the best of knowledge, this is still an unresolved issue in the community, so further investigatio this direction is desirable.

Conclusions
In this work, we propose and implement a deep recurrent Q-network algorithm quantum circuit design. Experimental results show that the agent is able to learn to duce a better quantum circuit for entangled states' preparation. However, the learne delity is not satisfactory. Future research and development are required to improve quality of the state-generation task. In particular, scalability to larger problem insta should be tackled. It would also be desirable to explore other applications, for exam the energy minimization task [26,34,[60][61][62].

Conclusions
In this work, we propose and implement a deep recurrent Q-network algorithm for quantum circuit design. Experimental results show that the agent is able to learn to produce a better quantum circuit for entangled states' preparation. However, the learned fidelity is not satisfactory. Future research and development are required to improve the quality of the state-generation task. In particular, scalability to larger problem instances should be tackled. It would also be desirable to explore other applications, for example, the energy minimization task [26,34,[60][61][62].

Data Availability Statement:
The data and scripts that support the findings of this study are available from the corresponding author upon reasonable request.