Quantum Reinforcement Learning with Quantum Photonics

: Quantum machine learning has emerged as a promising paradigm that could accelerate machine learning calculations. Inside this ﬁeld, quantum reinforcement learning aims at designing and building quantum agents that may exchange information with their environment and adapt to it, with the aim of achieving some goal. Different quantum platforms have been considered for quantum machine learning and speciﬁcally for quantum reinforcement learning. Here, we review the ﬁeld of quantum reinforcement learning and its implementation with quantum photonics. This quantum technology may enhance quantum computation and communication, as well as machine learning, via the fruitful marriage between these previously unrelated ﬁelds.


Introduction
The field of quantum machine learning promises to employ quantum systems for accelerating machine learning [1] calculations, as well as employing machine learning techniques to better control quantum systems. In the past few years, several books as well as reviews on this topic have appeared [2][3][4][5][6][7][8].
Inside artificial intelligence and machine learning, the area of reinforcement learning designs "intelligent" agents capable of interacting with their outer world, the "environment", and adapt to it, via reward mechanisms [9], see Figure 1. These agents aim at achieving a final goal that maximizes their long-term rewards. This kind of machine learning protocol is, arguably, the most similar one to the way the human brain learns. The field of quantum machine learning is recently exploring the fruitful combination of reinforcement learning protocols with quantum systems, giving rise to quantum reinforcement learning .
Different quantum platforms are being considered for the implementation of quantum machine learning. Among them, quantum photonics seems promising because of the good integration with communication networks, information processing at the speed of light, as well as possible realization of quantum computations with integrated photonics [40]. Moreover, in the scenario with a reduced amount of measurements, quantum reinforcement learning with quantum photonics has been shown to perform better than standard quantum tomography [16]. Quantum reinforcement learning with quantum photonics has been proposed [15,17,20] and implemented [16,19] in diverse works. Even before these articles were produced, a pioneering experiment of quantum supervised and unsupervised learning with quantum photonics was carried out [41].
In this review, we first give an overview of the field of quantum reinforcement learning, focusing mainly on quantum devices employed for reinforcement learning algorithms [10][11][12][13][14][15][16][17][18][19][20], in Section 2. Later on, we review the proposal for measurement-based adaptation protocol with quantum reinforcement learning [15] and its experimental implementation with quantum photonics [16], in Section 3. Subsequently, we describe further works in quantum reinforcement learning with quantum photonics [19,20], in Section 4. Finally, we give our conclusions in Section 5. Reinforcement learning protocol. A system, called agent, interacts with its external world, the environment, carrying out some action on it, while receiving information from it. Afterwards, the agent acts accordingly in order to achieve some long-term goal, via feedback with rewards, iterating the process several times.
In Ref. [10], a pioneer proposal for reinforcement learning using quantum systems was put forward. This employed a Grover-like search algorithm, which could provide a quadratic speedup in the learning process as compared to classical computers [10].
Ref. [11] provided a quantum algorithm for reinforcement learning in which a quantum agent, possessing a quantum processor, can couple classically with a classical environment, obtaining classical information from it. The speedup in this case would come from the quantum processing of the classical information, which could be done faster than with classical computers. This is also based on Grover search, with a corresponding quadratic speedup.
In Ref. [12], a quantum algorithm considers a quantum agent coupled to a quantum oracular environment, attaining a proven speedup with this kind of configuration, which can be exponential in some situations. The quantum algorithm could be applied to diverse kinds of learning, namely reinforcement learning, but also supervised and unsupervised learning.
Refs. [10][11][12] have speedups with respect to classical algorithms. While the first two rely on a polynomial gain due to a Grover-like algorithm, the latter achieves its proven speedup via a quantum oracular environment.
The series of articles in Refs. [13][14][15][16][17][18] study quantum reinforcement learning protocols with basic quantum systems coupled to small quantum environments. These works focus mainly on proposals for implementations [13][14][15]17] as well as experimental realizations in quantum photonics [16] and superconducting circuits [18]. In the theoretical proposals, small few-qubit quantum systems are proposed both for quantum agents and quantum environments. In Ref. [13], the aim of the agent is to achieve a final state which cannot be distinguished from the environment state, even if the latter has to be modified, as it is a single-copy protocol. In order to achieve this goal, measurements are allowed, as well as classical feedback inside the coherence time. Ref. [14] extends the previous protocol to the case in which measurements are not considered, but instead further ancillary qubits coupled via entangling gates to agent and environment are employed, and later on disregarded. In Ref. [15], several identical copies of the environment state are considered, such that the agent, via trial and error, or, equivalently, a balance between exploration and exploitation, iteratively approaches the environment state. This proposal was carried out in a quantum photonics experiment [16] as well as with superconducting circuits [18]. In Ref. [17], a further extension of Ref. [15] to operator estimation, instead of state estimation, was proposed and analyzed.
Ref. [16] obtained a speedup as well with respect to standard quantum tomography, in the scenario with a reduced amount of resources, in the sense of reduced number of measurements.
Finally, Ref. [20] considered different paradigms of learning inside a reinforcement learning framework, which included projective simulation [42] and a possible implementation with quantum photonics devices. The latter, with high-repetition rates, high-bandwith and low crosstalks, as well as the possibility to propagate to long distances, makes this quantum platform an attractive one for this kind of protocol.

Theoretical Proposal
In this section we review the proposal in Ref. [15] introducing a quantum agent that can obtain information from several identical copies of an unknown environment state, via measurement, feedback, and application of random unitary gates. The randomness of previous operations is reduced as long as the agent approaches the environment state, ideally converging to a large extent to this.
The detailed protocol was as follows. One assumes a quantum system, the agent (A), and many identical copies of an unknown quantum state, the environment (E). An auxiliary system, the register (R), which interacts with E, was also considered. Then, information about E is obtained by measuring R, and the result is employed as input to a reward function (RF). Finally, one carries out a partially random unitary operation on A, depending on the output of the RF. The aim is to increase the overlap between A and E, without measuring the state of A.
In the rest of the description of the protocol we use a notation as follows: the subscripts A, R, and E are referred to each subsystem, while the superscripts denote the iteration. For α denotes the operator O acting on subsystem α during the kth iteration. In case we do not employ indices we refer to a general entity in the iterations as well as in the subsystems.
Here we will review the case for which the subsystems are respectively described by single-qubit states [15]. For a multilevel state description we refer to Ref. [15]. One considers that A(R) is encoded in |0 A(R) , while E is represented by an arbitrary state The initial state is given by One subsequently introduces the ingredients of the reinforcement learning protocol, including the policy, the RF, as well as the value function (VF). We point out that we consider a definition for the VF which is different from standard reinforcement learning but which will fulfill our purposes. For carrying out the policy, one performs a controlled-NOT (CNOT) gate (U NOT E,R ) where E is the control and R is the target (namely, the interaction with the environment system), in order that the information of E is transferred into R, achieving One would then measure the register qubit in the {|0 , |1 } basis, with probabilities p (1) 0 = cos 2 (θ 1 /2) and p (1) 1 = sin 2 (θ 1 /2), to achieve state |0 or |1 , respectively (namely, extraction of information). If the outcome is |0 , one will have collapsed E into A such that one does nothing, while if the outcome is |1 , one has consequently measured the projection of E orthogonal to A, such that one accordingly updates the agent. Given that no further information about the environment is available, one carries out a partially-random unitary gate on A given by U (1) (1) (action), being α (1) and β (1) random angles given by α(β) (1) is the random angle range, α(β) (1) ∈ [−∆ (1) /2, ∆ (1) /2], and S k(1) A = S k is the kth component of the spin. Subsequently, one initializes the register qubit and considers a novel copy of E, achieving the following initial state for the second iteration: with U (1) Here we denote with m (1) = {0, 1} the result of the measurement, while I is the unity operator, and we express the new agent state in the form |0 (2) Subsequently, one considers the RF to change the exploration interval of the kth iteration ∆ (k) as where we denote with m (k−1) the result of the (k − 1)th iteration and with R and P the reward and punishment ratios, respectively. Equation (5) represents the fact that ∆ is changed by R∆ for the subsequent iteration whenever m = 0 and by P ∆ whenever the result is m = 1. In the described protocol, one takes for the sake of simplicity R = < 1 and P = 1/ > 1, in such a way that the value of ∆ is reduced every time |0 is measured, and it grows in the other situation. Moreover, given the fact that R · P = 1, reward and punishment are of similar value, or, equivalently, if the algorithm provides equal number of results 0 and 1, the exploration interval is not modified. Additionally, one defines the VF as the ∆ (n) after all the iterations have taken place. Thus, ∆ (n) → 0 whenever the algorithm converges to a maximal overlap between A and E.
In order to show in further detail how the algorithm works, we consider the kth step. The initial state in the protocol reads with |0 (k) where (j) A 0 |1 (j) A = 0. One can then express the state of system E in the Bloch basis, employing |0 j as reference and subsequently act with the quantum gate U (k) † E , achieving for E, One can then express the states |0 (k) and |1 (k) by means of the initial logical vectors |0 and |1 as well as θ (k) , θ (1) , φ (k) , and φ (1) in the following way, Accordingly, the unitary gate U (k) † carries out the required rotation to modify |0 (k) → |0 and |1 (k) → |1 . Subsequently, one applies the operation U NOT E,R , and later one measures R, with respective probabilities p (k) 0 = cos 2 (θ (k) /2) and p (k) 1 = sin 2 (θ (k) /2) for the results m (k) = 0 and m (k) = 1. Then, one acts with the RF provided by Equation (5). We remark that, statistically, when p (k) 0 → 1, ∆ → 0, and when p (k) 1 → 1, ∆ → 4π. With respect to the exploration-exploitation tradeoff, this implies that whenever the exploitation is reduced (one obtains |1 many times), one increases the exploration (the value of ∆ grows) in order that the probability of having a beneficial change increases, while whenever the exploitation grows (one obtains |0 often), one reduces the exploration range to permit only small subsequent modifications.

Implementation with Quantum Photonics
In Ref. [16], an experiment of the previous proposal described in Section 3.1 was carried out. The experiment aimed to estimate an unknown quantum state in a photonic system, in a scenario with a reduced amount of copies. A partially quantum reinforcement learning protocol was used in order to adapt a qubit state, namely, the "agent", to the given unknown quantum state, i.e., the "environment", via iterated single-shot projective measurements followed by feedback, with the aim of achieving maximum fidelity. The experimental setup, consisting of a quantum photonics device, could modify the available parameters to change the agent system according to the measurement results "0" and "1" in the environment state, namely, reward/punishment feedback. The experimental results showed that the proposed protocol provides a technique for estimating an unknown photonic quantum state whenever only a limited amount of copies is given, while it can be extended as well to higher dimensionality, multipartite, and density matrix quantum state cases. The achieved fidelities in the protocol for the single-photon case were over 88% with an appropriate reward/punishment ratio with 50 iterations.

Further Developments of Quantum Reinforcement Learning with Quantum Photonics
Without the aim of being exhaustive, here we briefly describe some other works that have appeared in the literature on the field of quantum reinforcement learning with quantum photonics.
In Ref. [19], it was argued that future breakthroughs in experimental quantum science would require dealing with complex quantum systems and therefore require complex and expensive experiments. Designing such complicated experiments is hard and could be enhanced with the aid of artificial intelligence. In this reference, the authors showed an automated learning system which learns to create complex quantum experiments, without considering previous knowledge or, sometimes, incorrect intuition. Their device did not only learn how to create the design of quantum experiments better than in previous works, but it also discovered in the process nontrivial experimental tools. The conclusion they obtained is that learning devices can provide crucial advances in the design and generation of novel experiments.
In Ref. [20], the authors introduced a blueprint for a quantum photonics realization of active learning devices employing machine learning algorithms such as, e.g., SARSA, Q-learning, and projective simulation. They carried out numerical calculations to evaluate the performance of their algorithm with customary reinforcement learning situations, obtaining that reasonable amounts of experimental errors could be allowed or, sometimes, benefit the learning protocol. Among other features, they showed that their designed device would enable features like abstraction and generalization, two aspects considered to be crucial for artificial intelligence. They consider for their studied model a quantum photonics platform, which, they argue, is scalable as well as simple, and proof-of-principle integration in quantum photonics devices seems feasible with near-term platforms.
More specifically, in Ref. [20], the main novelties are two. (i) Firstly, they describe a quantum photonics platform that allows reinforcement learning algorithms for working directly in combination with optical applications. For this aim, they focus on linear optics for its simplicity and well-established fabrication technology when compared to solidstate processors. They mention the example that nanosecond-scale reconfigurability and routing have already been achieved, and furthermore, photonic platforms allow one for decision-making at the speed of light, the fastest possible, which is only constrained by generation and detection rates. Energy efficiency in memories is also a bonus that photonic technologies may provide [20]. (ii) The second achievement of the article is the analysis of a variant of projective simulation based on binary decision trees, which is connected closely to standard projective simulation and suitable for photonic circuit implementations. Finally, they discuss how this development would enable key aspects of artificial intelligence, which are generalization and abstraction [20].

Conclusions
In this article, we reviewed the field of quantum reinforcement learning with quantum photonics. Without the goal of being exhaustive, we have firstly reviewed the area of quantum reinforcement learning in general, showing that automated quantum agents can provide sometimes enhancements with respect to classical computers. Later on, we have described a theoretical proposal and its quantum photonics experimental realization of a quantum reinforcement learning protocol for state estimation. This protocol has been shown to have a speedup with respect to standard quantum tomography in the reduced resource scenario. Finally, we have briefly reviewed some other works in the field of quantum reinforcement learning with quantum photonics that appeared in the literature.
The field of quantum reinforcement learning may provide quantum systems with a larger amount of autonomy and independence. Quantum photonics is among the quantum platforms where this kind of technology could be highly fruitful. Even though the amount of qubits is often not as large as in other platforms such as trapped ions and superconducting circuits, quantum photonics processes information at the speed of light, and it can be suitably interfaced with long-distance quantum communication protocols. A long-term scope in quantum reinforcement learning could be to combine this paradigm with quantum artificial life [8]. This will allow one to achieve fully autonomous quantum individuals that can reproduce, evolve, as well as interact and adapt to their environment. Further benefits in areas such as neuroscience could emerge as a consequence of this promising avenue.