Basic Reinforcement Learning Techniques to Control the Intensity of a Seeded Free-Electron Laser

: Optimal tuning of particle accelerators is a challenging task. Many different approaches have been proposed in the past to solve two main problems—attainment of an optimal working point and performance recovery after machine drifts. The most classical model-free techniques (e.g., Gradient Ascent or Extremum Seeking algorithms) have some intrinsic limitations. To overcome those limitations, Machine Learning tools, in particular Reinforcement Learning (RL), are attracting more and more attention in the particle accelerator community. We investigate the feasibility of RL model-free approaches to align the seed laser, as well as other service lasers, at FERMI, the free-electron laser facility at Elettra Sincrotrone Trieste. We apply two different techniques—the ﬁrst, based on the episodic Q-learning with linear function approximation, for performance optimization; the second, based on the continuous Natural Policy Gradient REINFORCE algorithm, for performance recovery. Despite the simplicity of these approaches, we report satisfactory preliminary results, that represent the ﬁrst step toward a new fully automatic procedure for the alignment of the seed laser to the electron beam. Such an alignment is, at present, performed manually.


Introduction
In a seeded Free-Electron Laser (FEL) [1][2][3][4], the generation of the FEL process is based on the overlap of a ∼1 ps-long bunch of relativistic electrons with a ∼100 fs pulse of photons of an optical laser, which takes place inside a static magnetic field generated by specific devices called undulators. Both the longitudinal (temporal) and transverse superposition are crucial for attaining the FEL process and therefore they must be controlled precisely. The former is adjusted by means of a single mechanical delay line placed in the laser path, while the latter has several degrees of freedom as it involves the trajectories of the electron and laser beams inside the undulators. A shot-to-shot feedback system based on position sensors [5] has been implemented, at the Free Electron laser Radiation for Multidisciplinary Investigations (FERMI) , to keep the electron trajectory stable, while the trajectory of the laser has to be continuously readjusted being subject to thermal drifts or restored whenever the laser transverse profile changes because the FEL operators modify the laser wavelength.
During standard operations, the horizontal and vertical transverse position and angle (pointing) of the laser beam inside the undulators is kept optimal by an automatic process exploiting the correlation of the FEL intensity with the natural noise of the trajectory [6]. Whenever the natural noise is not sufficient to determine in which direction to move the pointing of the laser, artificial noise can be 1. the need to evaluate the gradient of the objective function, which can be difficult to estimate when the starting point is far from the optimum; 2.
the difficulty to determine the hyper-parameters, whose appropriate values depend on the environment and the noise of the system; 3. the lack of "memory" to exploit the past experience.
Modern algorithms like Reinforcement Learning (RL), which belong to the category of Machine Learning (ML), are able to automatically discover the hidden relationship between input variables and objective function without human supervision. Although they usually require large amounts of data sets and long leaning time, they are becoming popular in the particle accelerator community thanks to their capability to work with no knowledge of the system.
In order to optimize the FEL's performance, different approaches have been adopted in recent years [9]. In 2011, a multi-physics simulation tool kit designed for the study of FELs and synchrotron light sources called OCELOT [10] was developed at the European XFEL GmbH. In addition to some common generic optimization algorithms (Extremum Seeking, Nelder-Mead) the framework implements Bayesian optimization based on Gaussian process. This tool is routinely employed in tuning of quadrupole currents at the Stanford Linear Accelerator Center (SLAC) [11,12] and optimization of the self-amplification power of the spontaneous emission (SASE) for the Free electron LASer in Hamburg (FLASH) at the Deutsches Elektronen-SYnchrotron (DESY) [13]. A different approach is described in Reference [14], where the authors advocate the use of artificial neural networks to model and control particle accelerators; they also mention applications based on the combination of neural networks and RL methods. Finally, recent works [15][16][17][18] have presented RL methods used in the context of particle accelerators. In References [15] and [16], performed through simulations, the FEL model and the policy are defined by neural networks. In Reference [17] the authors present an application of RL on a real system. The study concerns a beam alignment problem faced with a deep Q-learning approach in which the state is defined as the beam position.
The present paper is actually an extended version of Reference [18], in which Q-Learning with linear function approximation was used to perform the alignment of the seed laser. Here, we use an additional well-known RL technique, namely the Natural Policy Gradient (NPG) version of the REINFORCE algorithm [19] (NPG REINFORCE). It allows us to operate on a continuous space of actions that adapts itself to an underlying model changing over time. In fact, while in Reference [18] the goal was to control the overlap of electrons and laser beams starting from random initial conditions, in this paper we also deal with the problem of machine drifts. For the latter, we use NPG REINFORCE. The target of our study is the FERMI FEL (Section 3), one of the few 4th-generation light source facilities available in the world. Due to its intensive use, its availability for testing the algorithms is very limited. Therefore, some preliminary experiments have been conducted on a different system, namely the Electro-Optical Sampling station (EOS) (Section 2). Despite the differences between the two systems, they lead to similar problem formulations of RL. Both techniques have finally been implemented on the FERMI FEL.
The rest of the article is organized as follows-Sections 2 and 3 introduce the physical systems of EOS and FEL. Basic information on our implementation of the RL algorithms is provided in Section 4, while the experimental configuration and the achieved results are described in Section 5. Finally, conclusions are drawn in Section 6.

EOS Alignment System
The considered optical system is part of the EOS station, located upstream of the second line of the free-electron laser. The EOS is a non destructive diagnostics device designed to perform on-line single-shot longitudinal profile and arrival time measurements of the electron bunches using an UV laser [20][21][22]. Since the aim of the present work is to control a part of the laser trajectory, we will not explain in details the EOS process, but rather we will focus on the parts of the device relevant for our purpose.
The device, simplified in Figure 1, is a standard optical alignment system composed of two planar tip-tilt mirrors [23] (TTs), each driven by two couples of motors. Coarse positioning is obtained via two coarse-motors, while two piezo-motors are employed for fine-tuning. In the optimization process, only the piezo-motors are considered.Two charge-coupled devices (CCDs) detect the position of the laser beam in two different places along the path (the CCDs do not intercept the laser beam thanks to the use of semi-reflecting mirrors). The ultimate goal is to steer and keep the laser spot inside a pre-defined region of interest (ROI) of each CCD. To achieve this result, a proper voltage has to be applied to each piezo-motor. The product of the two light intensities detected by the CCDs in the ROIs can be used as an evaluation criterion for the correct positioning of the laser beam. In particular, it can be interpreted as an AND logic operator, that is "true" when the laser is inside both of the ROIs.

FEL Alignment System
In a seeded FEL, an initial seed signal, provided by a conventional high peak power pulsed laser, is temporally synchronized to overlap the electron bunches inside a first undulator section called modulator. In the transverse alignment process two Yttrium Aluminum Garnet (YAG) screens equipped with CCDs are properly inserted and extracted, in order to measure the electron beam transverse position before and after the modulator [24]. After the electron beam inhibition, using the same YAG screens, the seed laser position is measured and the correct positions of two tip-tilt mirrors are manually found by moving the coarse-motors in order to overlap the electron beam. The above destructive (a screen has to be inserted) procedure is repeated several times and, at the end, the screens are removed to switch on the FEL. The simplified scheme of the alignment set up is shown in Figure 2. After the above described coarse tuning, a further optimization is carried out by moving the tip-tilt mirrors to maximize the FEL intensity measured by the I 0 monitor. The working principle of this monitor is the atomic photo-ionization of a rare gas at low particle density in the range of 10 11 cm −3 (p ≈ 10 −5 mbar). The FEL photon beam, traveling through a rare gas-filled chamber, generates ions and electrons, which are extracted and collected separately. From the resulting currents it is possible to derive the absolute number of photons per pulse, shot by shot.

Reinforcement Learning
In RL, basically, data collected through experience are employed to select future inputs of a dynamical system [25,26]. An environment is a discrete dynamical system whose model can be defined by: x in which x k ∈ X and u k ∈ U respectively are the environment state and the external control input (the action) at the k-th instant; while f is the state-transition function. A controller, or agent, learns a suitable state-action map, also known as policy (π(u k |x k )), by interacting with the environment through a trial and error process. For each chosen action u k ∈ U , in state x k ∈ X , the environment provides a reward r(x k , u k ). The aim of the learning process is to find an optimal policy π * with respect to the maximization of an objective function J, which is a design choice.

Q-Learning
Among the different approaches to the RL problem, the approximate dynamic programming aims at solving the problem in which γ ∈ [0, 1[ is the discount factor, by iteratively estimating an action-value function (or Q-function) from data. Here, J takes the form of an expected discounted cumulative reward. Assuming that there exists a stationary (stationarity is the consequence of the infinite time horizon, i.e., N → ∞, and implies that the optimal action for a given state x at time k depends only on x and not on k). Optimal policy, the Q-function is defined as the optimum value of the expected discounted reward when action u is selected being in state x. Therefore, given the action-value function Q(x, u), the optimal policy is In other words, estimating the Q-function amounts to solving the learning problem. An attractive and well-known method for estimating the Q-function is the Q-learning algorithm [27].
In the present work, we employ the Q-learning in an episodic framework (meaning that the learning is split into episodes that end when some terminal conditions are met). The choice of the Q-learning among other RL approaches is due to its simplicity and the fact that the problem admits a non-sparse reward which is beneficial for speeding up the learning [28]. During learning, the exploration of the state-action space can be achieved by employing a so-called -greedy policy: in which defines the probability of a random choice (exploration). The Q-learning update rule is: where α is the learning rate and δ is the temporal difference error, the difference between the discounted optimal Q in the state x k+1 and the value Q(x k , u k ) (see Algorithm 1 for more details). Defining the state set as X ⊂ R n (where n is the dimension of the state vector), since the actions are finite (u ∈ U = {u (1) , . . . , u (N) }), the action-value function can be represented as a collection of maps Q(x, u (1) ), . . . , Q(x, u (N) ) from X × U to R. In order to work with a continuous state space, we employ a linear function approximation version of the Q-learning algorithm. More precisely, we parametrize is a vector of features and θ j a weight vector associated to the j-th input u (j) . Thus, the whole Q(x, u) is specified by the vector of parameters θ = [θ T 1 , . . . , θ T N ] T , and the corresponding policy will be identified by π θ . In particular, we employ Gaussian Radial Basis Functions (RBFs) as features; given a set of centers where σ i determines the decay rate of RBF. The pseudo code of the Q-learning with linear function approximation is reported in Algorithm 1.
Algorithm 1 Q-learning algorithm with linear function approximation [29] Initialize θ and set α, γ For each episode:

Policy Gradient REINFORCE
An alternative approach consists of directly learning a policy, without relying on value functions. In this regard, given a policy π(u|x, θ) parametrized by a vector θ, Policy Gradient (PG) methods [26] aim at finding an optimal θ * which ensures to is a state-input trajectory obtained by following a particular π, and R(ξ) = The trajectory ξ can be thought of as a random variable that has a probability distribution P(ξ|θ).
We employ the REINFORCE algorithm [19] which aims at finding the optimal θ * for P(ξ|θ), solution of the optimization problem (5), by updating θ along gradient of the objective function. More precisely, a stochastic gradient ascent is performed: where α ∈ [0; 1] is the learning rate. Since P(ξ|θ) = p(x 0 ) where p(x k+1 |x k , u k ) is the transition probability from x k to x k+1 when the action u k is applied, the update rule (6) becomes: Such an update is performed every time a path ξ is collected. In order to reduce the variance of the gradient estimates, typical of the PG approaches [30], we employ a NPG version [31] of the REINFORCE algorithm, in which a linear transformation of the gradient is adopted by using the inverse Fisher information matrix F −1 (θ). The pseudo code of the REINFORCE is reported in Algorithm 2.

Implementation and Results
In the following we apply the two different RL techniques described above, to address the two problems: • the attainment of an optimal working point, starting from random initial conditions; • the recovery of the optimal working point when some drifts, or working conditions changes, occur.
The former employs the Q-Learning, the latter through the NPG REINFORCE algorithm. The reason is their simplicity. In particular, for the problem of target recovery, the policy gradient algorithm REINFORCE has been chosen to employ a continuous action space, which allows fine adjustments to compensate small drifts. Both algorithms have been tested on the EOS system before being deployed on the FEL system. In the present section we describe the experimental protocols, and we report the results, that will be discussed in Section 5.3.

Optimal Working Point Attainment Problem
The problem of defining a policy, able to lead the plant to an optimal working point starting from random initial conditions, requires to split the experiments in two phases: (i) a training, which allows the controller to learn a proper policy, and (ii) a test, to validate the ability of the learned policy to properly behave, possibly in conditions not experienced during training.
In both the optical systems-the EOS and the FEL-the state x is a 4 dimensional vector that provides the current voltage values applied to each piezo-motor (two values for the first mirror and two values for the second mirror). We neglect the dynamics of the piezo-motors, being their transients much shorter than the time between shots. The input u is also a 4 dimensional vector; denoting the component index as a superscript, the update rule is: that is, the input is the incremental variation of the state itself. The action space is discrete, thus the module of each i-th component of u is set equal to a fixed value. Moreover, the state x can only assume values that satisfy the physical constraints of the piezo-motors [23]: hence we allow, for each state x of both systems, only those inputs u for which the component-wise inequality (8) is not violated. In the following, when referring to the intensity of the EOS, we will actually refer to the product of the two intensities detected in the ROIs when the laser hits both ROIs; by FEL intensity, we will mean the intensity measured by the I 0 monitor. Finally, for both systems, we will denote the target intensity (computed as explained below) as I T .

EOS
The training of the EOS alignment system consists of 300 episodes. The number of episodes has been chosen after preliminary experiments on a EOS device simulator. However, based on the results obtained on the real device, the number of episodes can be actually reduced (see Figure 3). At the beginning of the training, the ROIs are selected, and therefore also the target value I T . They remain the same for all the training episodes. At each time step k, the input provided by the agent is applied, and the new intensity I D (x k+1 ) is compared with the target (I T ). The episode ends in two cases: • when the detected intensity in the new state I D (x k+1 ) is greater than or equal to a certain percentage p T of the target (p T I T ); • when the maximum number of allowed time steps is reached.
When the first statement occurs, the goal is achieved. During the training procedure the values of (exploration) of (2) and α (learning rate) of (3), decay according to the following rules [32,33]: where the N 0 value is set empirically. In addition, the reward is shaped according to [28]: wherer is taken equal to 1 if the target is reached, 0 otherwise; the values of γ rs > 0 and k > 0 are set empirically. The specific design of (10) allows to reward the agent in correspondence of state-action pairs that lead to a sufficiently increased detected intensity γ rs I D (x k+1 ) > I D (x k ) (r(x k , u k ) > 0) and to penalize it otherwise (r(x k , u k ) < 0). At the end of each episode a new one begins from a new initial state, randomly selected, until the maximum number of episodes is reached. Then, a test (with random initial states) is carried out for the same target conditions of the training but with a fixed = 0.05 as in Reference [34] and α = 0. We repeat the training and test 10 times (i.e., we perform 10 different runs) and report in the following the results in terms of average duration of each episode. The parameter values employed during experiments are reported in Table 1; they result from offline experiments on a simulator of the EOS system.  The average number of time-steps per episode for the whole training phase is reported in Figure 3. The steep decrease of the average number of time steps shows that a few episodes are sufficient to get a performance close to the one obtained after a whole training phase. Indeed, thanks to the reward shaping (10), the Q-function is updated at each step of each episode instead of just at the end of the episode (see Section 5.3 for further details). The average number of time-steps per episode during the test phase is visible in Figure 4 and is consistent with the training results.

FEL
The experiment carried out on the FEL system consists of a training of 300 episodes and a test of 50 episodes. The chosen target value I T is kept constant throughout the whole training and test. At the beginning of each episode, a random initialization is applied. Each episode ends when the same conditions defined in Section 5.1.1 occur. The and the α values decay according to (9) and the reward is shaped in the same way of the EOS case. The parameter values are reported in Table 2. The results are reported in Figures 5 and 6, for training and test respectively. It can be observed that the overall behaviors, in training and test, resemble those in Figures 3 and 4.

Recovery of Optimal Working Point
In particle accelerator facilities, the working conditions are constantly subject to fluctuations. Indeed, thermal drifts or wavelength variations requested by users are common and result in a displacement of the optimal working point. Therefore, a controller must be able to quickly and properly adapt its policy to such drifts. For this purpose, we adopt the NPG REINFORCE algorithm (Section 4.2), which is able to work with a continuous action space and, thus, to allow for precise fine tuning. Here, we want to employ the learning as an adaptive mechanism, to face the machine drifts. Thus, in this case, a test phase would be meaningless, since adaptation occurs during learning only.
For both the optical systems, the EOS and the FEL, the state is a four-dimensional vector of the voltage values applied to each piezo-motor (two values for the first mirror and two values for the second mirror) and the action is composed of four references, one for each piezo-motor actuators, from which the new state depends.
The agent consists of four independent parametrized policies, one for each element of the action vector (u (i) k , i ∈ {1, 2, 3, 4}), which are shaped according to the Von Mises distribution (such a distribution is a convenient choice when the state and action spaces are bounded, since it is null outside a bounded region): where ψ i = e φ i is a concentration measure, µ i is the mean, I 0 (ψ i ) is the modified Bessel function of the first kind [35] and θ i = [µ i , φ i ] is the i-th policy parameter vector, updated at each step of the procedure. At each training step k, when the system is in state x k , the agent performs an action u k , according to the current policy, thus leading the system in a new state x k+1 . Then, the intensity I D (x k+1 ) is detected and the reward is computed according to: where I T is the target intensity. In the EOS system, in order to emulate drifts of the target condition, I T is initialized by averaging values collected at the beginning of the training procedure and then updated, each time that I D (x k+1 ) results greater than I T , according to: In the FEL, however, we initialize the system in a manually found optimal setting (including both the state and the I T ), and impose some disturbances manually. The possibility to update the target intensity is still enabled though.

EOS
The NPG REINFORCE experiment performed on the EOS system consists of a single training phase, at the beginning of which the EOS system is randomly initialized, as well as the I T . The learning rate α (7) is kept constant and equal to 0.1 (empirical setting). Only when the detected intensity I D (x k+1 ) assumes a value greater than I T , is the latter updated according to (12) and the algorithm continues with the new target to be reached. The procedure is stopped when θ vectors lead each Von Mises distributions enough close to Dirac delta functions, after no target update has been performed for a predefined time. Figure 7 shows the detected intensity I D (x k+1 ) (blue line), its moving average (green line) and the target intensity I T (red dashed line) during the experiment. In Figure 8, the reward (blue line) is reported along with its moving average (green line) and the target I T (red dashed line). By comparing the two figures, it can be seen that once the target does not change, the reward approaches zero and the detected intensity variance shrinks, evidence that the optimal working point is close.

FEL
Even in this case, the experiment consists of a single training phase, at the beginning of which, however, the system is set on an optimal working point, manually found by experts. During the experiment, some misalignment are forced by manually changing the coarse motors position. The learning rate of (7) is kept constant and equal to an empirically set value (α = 0.5). Figures 9 and 10 report the detected intensity and the reward, together with the target, during the execution of the NPG REINFORCE algorithm on the FEL. It can be seen that, contrary to the EOS experiment, the target intensity is not significantly updated. Indeed, in this case the system is initialized on an optimal working point. Two drift events took place, the first around time-step 120, and the second around time-step 210. Both plots, (detected intensity and reward), clearly show the capability to recover an optimal working point.

Discussion
The results obtained on EOS and FEL systems during experiments and presented above deserve some further comments that are provided here. The Q-learning algorithm has been applied to face the problem of finding an optimal working point, starting from a random initialization. The results are reported in Section 5.1. The enlarged portions reported in Figures 3 and 5 show that a few episodes are sufficient to drastically reduce the number of steps required to reach the goal. In other words, the exploration carried out during the first episodes provides a valuable information for the estimation of the Q-function and, as a consequence, of an appropriate policy. We believe that the main reason is the effectiveness of the reward shaping (10), that allows for obtaining a reward at each time step, as opposite of a sparse reward occurring only at the end of the episodes. Such a shaping seems reasonable for the problem at hand, and is based on the assumption that the observed intensity change of two subsequent steps is significant for guiding the learning. On the other hand, during the test phase, we have observed that some unsuccessful trials occur. Although some further investigation is needed, it might be due to either (i) the occurrence of unexpected drifts of the target during the test or (ii) the discrete set of actions employed, consisting of fixed steps that can prevent reaching the goal, starting from random initial conditions.
The NPG REINFORCE algorithm has been applied for restoring the optimal working point in case of drifts. The results are reported in Section 5.2. In particular, Figures 9 and 10 show the response to manual perturbations of the FEL operating conditions, set initially in an optimal working point. It is possible to observe how the algorithm quickly replies to disturbances of environment settings (marked by negative reward spikes), by learning a policy able to recover the optimal pointing of the laser.

Conclusions
Two tasks of particle accelerator optimal tuning have been addressed in this paper, namely (i) the attainment of the optimal working point and (ii) its recovery after a machine drift. Accordingly, two appropriate RL techniques have been employed: an the episodic Q-learning with linear function approximation, to reach the optimal working point starting from a random initialization, and a non-episodic NPG REINFORCE, to recover the performance after machine drifts or disturbances. Both approaches have been applied on the service laser alignment in the electro-optical sampling station before being successfully implemented on the FERMI free-electron laser at Elettra Sincrotrone, Trieste.
Based on the promising results, further investigation on free-electron laser optimization via automatic procedures will be carried out. Among some other approaches that could be investigated, we mention the normalized advantage functions [36], a continuous variant of the Q-learning, and the iterative linear quadratic regulator [37].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: