Figure 1.
Two-level system. is the Rabi frequency, proportional to the envelope of the applied electromagnetic field, and is the detuning between the qubit frequency and the frequency of the applied field.
Figure 2.
MDP and agent–environment interplay.
Figure 3.
DQN architectures of neural networks. (a) DQN architecture 1. (b) DQN architecture 2. The first architecture has a discrete state space with , and the second has a hybrid state space of the form , with discrete Rabi frequency and continuous density matrix elements. In both architectures, the output layer produces the values of all available actions that correspond to the input state. The greedy policy should select the action with the higher value, breaking ties randomly.
Figure 4.
REINFORCE architecture of neural networks. (
a) REINFORCE—Policy NN architecture 1. (
b) REINFORCE—Policy NN architecture 2. The first architecture has a discrete state space with
, and the second has a continuous state space with
, where, here, the Rabi frequency is also continuous. In the first case, the output is the discrete probability distribution of the actions selection, while in the second, the output is the parameters of a continuous probability distribution, for example, the mean and standard deviation of a parameterized normal probability distribution as in Equation (
23).
Figure 5.
Actor-critic architecture.
Figure 6.
Parameterized policy actor NN. Input layer consists of the state of the quantum system (the necessary density matrix components), and output layer consists of the parameters of the truncated trigonometric series for the controls and .
Figure 7.
Results for Q-learning with 7 actions at the early training stage after 2000 training episodes: (a) optimal normalized Rabi frequency as the external control of the system, (b) fidelity (population of excited state |2⟩) as a metric the performance of the population transfer, (c) expected return (cumulative rewards) results during the training episodes, (d) populations of states |1⟩ and |2⟩. Pulse shape does not succeed in transferring the state due to lack of sufficient learning.
Figure 8.
Results for Q-learning with 7 actions at the final training stage after 20,000 training episodes: (a) optimal normalized Rabi frequency as the external control of the system, (b) fidelity (population of excited state |2⟩) as a metric the performance of the population transfer, (c) expected return (cumulative rewards) from the training episodes, (d) populations of states |1⟩ and |2⟩. Solution successfully inverts the population between the two states.
Figure 9.
Results for expected SARSA with 7 actions at the early training stage after 1000 training episodes: (a) optimal normalized Rabi frequency as control of the system, (b) fidelity (population of excited state |2⟩) as a metric the performance of the population transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Suboptimal solution even at the early stage of training process.
Figure 10.
Results for expected SARSA with 7 actions at the final training stage after 10,000 training episodes: (a) optimal normalized Rabi frequency , (b) fidelity (population of excited state |2⟩) as a metric the performance of the population transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent can provide the optimal solution (-pulse).
Figure 11.
Results for DQN algorithm with 9 actions and discrete state space after 4000 training iterations: (a) optimal normalized Rabi frequency as control of the system, (b) fidelity (population of excited state |2⟩) as a metric of the population transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent attains the optimal pulse shape for this problem (-pulse).
Figure 12.
Results for DQN algorithm with 9 actions and hybrid state space of the form (
6) with discrete Rabi frequency and continuous density matrix elements after 2000 training iterations: (
a) optimal normalized Rabi frequency
, (
b) fidelity (population of excited state |2⟩), (
c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (
d) populations of states |1⟩ and |2⟩. In the hybrid setup, agent succeeds in obtaining optimal solution with faster convergence than in the discrete case.
Figure 13.
Results for REINFORCE algorithm with 9 actions and discrete state space after 2000 training iterations: (a) optimal normalized Rabi frequency as the external control, (b) fidelity (population of excited state |2⟩) as the metric of the population transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent succeeds in giving the optimal solution (-pulse).
Figure 14.
Results for REINFORCE algorithm with continuous action and state spaces for the resonant case () after 2000 training iterations: (a) optimal normalized Rabi frequency as the external control, (b) fidelity (population of excited state |2⟩) as the metric of the population transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Optimal -pulse shape is successfully obtained by the training process.
Figure 15.
Results for REINFORCE algorithm with continuous action and state spaces with additional detuning control () after 3000 training iterations: (a) optimal normalized Rabi frequency (blue) and detuning (orange) as two external controls of the system, (b) fidelity (population of excited state |2⟩), (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent successfully produces optimal solution that approximates the resonant pulse.
Figure 16.
Results for PPO algorithm with continuous action and state space for the resonant case () after 1500 training iterations: (a) optimal normalized Rabi frequency as the external control, (b) fidelity (population of excited state |2⟩) as the metric of the state transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent successfully solves the problem in the optimal way (-pulse).
Figure 17.
Results for PPO algorithm with continuous action and state space with additional detuning control () after 2000 training iterations: (a) optimal normalized Rabi frequency and detuning as the two external control functions, (b) fidelity (population of excited state |2⟩) as the metric of the state transfer, (c) expected return (cumulative rewards) from the training episodes averaged from 10 episode samples, (d) populations of states |1⟩ and |2⟩. Agent utilizes both controls to produce an optimal solution.
Figure 18.
Results for TSOA—PPO algorithm achieving fidelity of 0.99999: (a) optimal Rabi frequency and detuning , (b) fidelity (population of excited state |2⟩), (c) populations of states |1⟩ and |2⟩. Agent uses a finite trigonometric series function consisting of 3 harmonics to produce a very-high-fidelity solution.
Table 1.
Software versions.
Software | Version |
---|
TensorFlow | 2.15.1 |
tf-agents | 0.19.0 |
QuTiP | 4.7.3 |
Python | 3.11.8 |
Table 2.
Computer hardware specifications.
Component | Model |
---|
CPU | AMD Ryzen 5 5600X 6-Core Processor |
Memory RAM | 32 GB |
Table 3.
Tabular methods parameters—7 actions.
Parameters | |
---|
Maximum time t | 5 |
Maximum time steps | 15 |
Discount factor | 0.99 |
Minimum | 0.05 |
Detuning | 0 |
Rabi frequency | |
Actions | {−2, −1, −, 0, , 1, 2} |
Target fidelity | 0.99 |
Training time | ≈30 mins |
Table 4.
DQN parameters—Discrete state space.
Parameters | |
---|
Max time steps (N) | 35 |
| 1 |
End time (T) | |
Time step | |
discount factor | 0.99 |
| 0.1 |
Detuning | 0 |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 4000 |
Hidden layers (2) | (100, 75) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈45 mins |
Table 5.
DQN parameters—hybrid state space.
Parameters | |
---|
Max time steps (N) | 35 |
| 1 |
End time (T) | |
Time step | |
discount factor | 0.99 |
| 0.1 |
Detuning | 0 |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 2000 |
Hidden layers (2) | (100, 75) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈45 mins |
Table 6.
REINFORCE with baseline parameters—discrete action and state space.
Parameters | |
---|
Max time steps (N) | 35 |
| 1 |
End time (T) | |
Time step | |
discount factor | 0.95 |
Detuning | 0 |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 2000 |
Actor Hidden layers (2) | (100, 75) |
Value Hidden layers (2) | (100, 75) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈20 mins |
Table 7.
REINFORCE with baseline parameters—continuous action and state spaces—Resonant case ().
Parameters | |
---|
Max time steps (N) | 35 |
| 1 |
End time (T) | |
Time step | |
discount factor | 0.95 |
Detuning | 0 |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 2000 |
Actor Hidden layers (2) | (100, 75) |
Value Hidden layers (2) | (100, 75) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈30 mins |
Table 8.
REINFORCE with baseline parameters—continuous action and state spaces—Additional detuning control.
Parameters | |
---|
Max time steps (N) | 30 |
| 1 |
| 0.5 |
End time (T) | |
Time step | |
discount factor | 0.99 |
Detuning | |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 3000 |
Actor Hidden layers (2) | (100, 75) |
Value Hidden layers (2) | (75, 50) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈30 mins |
Table 9.
PPO parameters—continuous action and state space—resonant case ().
Parameters | |
---|
Max time steps (N) | 35 |
| 1 |
End time (T) | |
Time step | |
Detuning | 0 |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 1500 |
Actor Hidden layers (2) | (100, 75) |
Value Hidden layers (2) | (100, 50) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈45–60 mins |
Table 10.
PPO parameters—continuous action and state space—additional detuning control.
Parameters | |
---|
Max time steps (N) | 30 |
| 1 |
| 0.5 |
End time (T) | |
Time step | |
Detuning | |
Rabi frequency | |
Actions | |
Target fidelity | 0.9999 |
Training Iterations | 2000 |
Actor Hidden layers (2) | (100, 75) |
Value Hidden layers (2) | (100, 50) |
Learning rate | 0.001 |
Optimizer | Adam |
Training time | ≈45–60 mins |
Table 11.
Parameters for TSOA—PPO algorithm.
Parameters | |
---|
Simulation Time steps | 300 |
| 1 |
End time () | 3.15 |
MDP Time step | 1 |
Actions | |
Target fidelity | 0.9999 |
Actor Hidden layers (2) | (100, 100, 50) |
Value Hidden layers (2) | (75, 75, 50) |
Learning rate | 0.002 |
Optimizer | Adam |
Training time | ≈4 mins |
Table 12.
Optimal trigonometric series parameters for fidelity > 0.9999.
i | | |
---|
0 | 0.87517912 | 0.07352462 |
1 | 0.20610334 | −0.13624175 |
2 | 0.16243254 | −0.09679438 |
3 | 0.02755164 | 0.02831635 |
4 | −0.03201709 | 0.22532735 |
5 | −0.18376116 | −0.10505782 |
6 | 0.11923808 | 0.13957184 |