1. Introduction
In a seeded FreeElectron Laser (FEL) [
1,
2,
3,
4], the generation of the FEL process is based on the overlap of a ∼1
$\mathrm{p}\mathrm{s}$long bunch of relativistic electrons with a ∼100
$\mathrm{f}\mathrm{s}$ pulse of photons of an optical laser, which takes place inside a static magnetic field generated by specific devices called undulators. Both the longitudinal (temporal) and transverse superposition are crucial for attaining the FEL process and therefore they must be controlled precisely. The former is adjusted by means of a single mechanical delay line placed in the laser path, while the latter has several degrees of freedom as it involves the trajectories of the electron and laser beams inside the undulators. A shottoshot feedback system based on position sensors [
5] has been implemented, at the Free Electron laser Radiation for Multidisciplinary Investigations (FERMI), to keep the electron trajectory stable, while the trajectory of the laser has to be continuously readjusted being subject to thermal drifts or restored whenever the laser transverse profile changes because the FEL operators modify the laser wavelength.
During standard operations, the horizontal and vertical transverse position and angle (pointing) of the laser beam inside the undulators is kept optimal by an automatic process exploiting the correlation of the FEL intensity with the natural noise of the trajectory [
6]. Whenever the natural noise is not sufficient to determine in which direction to move the pointing of the laser, artificial noise can be injected. This method improves the convergence of the optimization, but the injected noise can affect the quality of the FEL radiation. This kind of modelfree optimization techniques (ex. Gradient Ascent and Extremum Seeking [
7,
8]) are widely used in FEL facilities, but have some intrinsic disadvantages:
the need to evaluate the gradient of the objective function, which can be difficult to estimate when the starting point is far from the optimum;
the difficulty to determine the hyperparameters, whose appropriate values depend on the environment and the noise of the system;
the lack of “memory” to exploit the past experience.
Modern algorithms like Reinforcement Learning (RL), which belong to the category of Machine Learning (ML), are able to automatically discover the hidden relationship between input variables and objective function without human supervision. Although they usually require large amounts of data sets and long leaning time, they are becoming popular in the particle accelerator community thanks to their capability to work with no knowledge of the system.
In order to optimize the FEL’s performance, different approaches have been adopted in recent years [
9]. In 2011, a multiphysics simulation tool kit designed for the study of FELs and synchrotron light sources called OCELOT [
10] was developed at the European XFEL GmbH. In addition to some common generic optimization algorithms (Extremum Seeking, NelderMead) the framework implements Bayesian optimization based on Gaussian process. This tool is routinely employed in tuning of quadrupole currents at the Stanford Linear Accelerator Center (SLAC) [
11,
12] and optimization of the selfamplification power of the spontaneous emission (SASE) for the Free electron LASer in Hamburg (FLASH) at the Deutsches ElektronenSYnchrotron (DESY) [
13]. A different approach is described in Reference [
14], where the authors advocate the use of artificial neural networks to model and control particle accelerators; they also mention applications based on the combination of neural networks and RL methods. Finally, recent works [
15,
16,
17,
18] have presented RL methods used in the context of particle accelerators. In References [
15] and [
16], performed through simulations, the FEL model and the policy are defined by neural networks. In Reference [
17] the authors present an application of RL on a real system. The study concerns a beam alignment problem faced with a deep Qlearning approach in which the state is defined as the beam position.
The present paper is actually an extended version of Reference [
18], in which QLearning with linear function approximation was used to perform the alignment of the seed laser. Here, we use an additional wellknown RL technique, namely the Natural Policy Gradient (NPG) version of the REINFORCE algorithm [
19] (NPG REINFORCE). It allows us to operate on a continuous space of actions that adapts itself to an underlying model changing over time. In fact, while in Reference [
18] the goal was to control the overlap of electrons and laser beams starting from random initial conditions, in this paper we also deal with the problem of machine drifts. For the latter, we use NPG REINFORCE. The target of our study is the FERMI FEL (
Section 3), one of the few 4thgeneration light source facilities available in the world. Due to its intensive use, its availability for testing the algorithms is very limited. Therefore, some preliminary experiments have been conducted on a different system, namely the ElectroOptical Sampling station (EOS) (
Section 2). Despite the differences between the two systems, they lead to similar problem formulations of RL. Both techniques have finally been implemented on the FERMI FEL.
The rest of the article is organized as follows—
Section 2 and
Section 3 introduce the physical systems of EOS and FEL. Basic information on our implementation of the RL algorithms is provided in
Section 4, while the experimental configuration and the achieved results are described in
Section 5. Finally, conclusions are drawn in
Section 6.
2. EOS Alignment System
The considered optical system is part of the EOS station, located upstream of the second line of the freeelectron laser. The EOS is a non destructive diagnostics device designed to perform online singleshot longitudinal profile and arrival time measurements of the electron bunches using an UV laser [
20,
21,
22]. Since the aim of the present work is to control a part of the laser trajectory, we will not explain in details the EOS process, but rather we will focus on the parts of the device relevant for our purpose.
The device, simplified in
Figure 1, is a standard optical alignment system composed of two planar tiptilt mirrors [
23] (TTs), each driven by two couples of motors. Coarse positioning is obtained via two coarsemotors, while two piezomotors are employed for finetuning. In the optimization process, only the piezomotors are considered.Two chargecoupled devices (CCDs) detect the position of the laser beam in two different places along the path (the CCDs do not intercept the laser beam thanks to the use of semireflecting mirrors).
The ultimate goal is to steer and keep the laser spot inside a predefined region of interest (ROI) of each CCD. To achieve this result, a proper voltage has to be applied to each piezomotor. The product of the two light intensities detected by the CCDs in the ROIs can be used as an evaluation criterion for the correct positioning of the laser beam. In particular, it can be interpreted as an AND logic operator, that is “true” when the laser is inside both of the ROIs.
3. FEL Alignment System
In a seeded FEL, an initial seed signal, provided by a conventional high peak power pulsed laser, is temporally synchronized to overlap the electron bunches inside a first undulator section called modulator. In the transverse alignment process two Yttrium Aluminum Garnet (YAG) screens equipped with CCDs are properly inserted and extracted, in order to measure the electron beam transverse position before and after the modulator [
24]. After the electron beam inhibition, using the same YAG screens, the seed laser position is measured and the correct positions of two tiptilt mirrors are manually found by moving the coarsemotors in order to overlap the electron beam. The above destructive (a screen has to be inserted) procedure is repeated several times and, at the end, the screens are removed to switch on the FEL. The simplified scheme of the alignment set up is shown in
Figure 2. After the above described coarse tuning, a further optimization is carried out by moving the tiptilt mirrors to maximize the FEL intensity measured by the
${I}_{0}$ monitor. The working principle of this monitor is the atomic photoionization of a rare gas at low particle density in the range of
${10}^{11}$$/{\mathrm{c}}^{3}\mathrm{m}$ (
$p\approx {10}^{5}$ mbar). The FEL photon beam, traveling through a rare gasfilled chamber, generates ions and electrons, which are extracted and collected separately. From the resulting currents it is possible to derive the absolute number of photons per pulse, shot by shot.
4. Reinforcement Learning
In RL, basically, data collected through experience are employed to select future inputs of a dynamical system [
25,
26]. An
environment is a discrete dynamical system whose model can be defined by:
in which
${x}_{k}\in \mathcal{X}$ and
${u}_{k}\in \mathcal{U}$ respectively are the environment state and the external control input (the
action) at the
kth instant; while
f is the statetransition function. A controller, or
agent, learns a suitable stateaction map, also known as
policy (
$\pi \left({u}_{k}\right{x}_{k})$), by interacting with the environment through a trial and error process. For each chosen action
${u}_{k}\in \mathcal{U}$, in state
${x}_{k}\in \mathcal{X}$, the environment provides a
reward $r({x}_{k},{u}_{k})$. The aim of the learning process is to find an optimal policy
${\pi}^{*}$ with respect to the maximization of an objective function
J, which is a design choice.
4.1. QLearning
Among the different approaches to the RL problem, the approximate dynamic programming aims at solving the problem
in which
$\gamma \in [0,1[$ is the discount factor, by iteratively estimating an
actionvalue function (or Qfunction) from data. Here,
J takes the form of an expected discounted cumulative reward. Assuming that there exists a stationary (stationarity is the consequence of the infinite time horizon, i.e.,
$N\to \infty $, and implies that the optimal action for a given state
x at time
k depends only on
x and not on
k). Optimal policy, the Qfunction is defined as the optimum value of the expected discounted reward when action
u is selected being in state
x. Therefore, given the actionvalue function
$Q(x,u)$, the optimal policy is
In other words, estimating the Qfunction amounts to solving the learning problem. An attractive and wellknown method for estimating the Qfunction is the Qlearning algorithm [
27].
In the present work, we employ the Qlearning in an episodic framework (meaning that the learning is split into episodes that end when some terminal conditions are met). The choice of the Qlearning among other RL approaches is due to its simplicity and the fact that the problem admits a nonsparse reward which is beneficial for speeding up the learning [
28]. During learning, the exploration of the stateaction space can be achieved by employing a socalled
$\u03f5$greedy policy:
in which
$\u03f5$ defines the probability of a random choice (exploration). The Qlearning update rule is:
where
$\alpha $ is the learning rate and
$\delta $ is the
temporal difference error, the difference between the discounted optimal
Q in the state
${x}_{k+1}$ and the value
$Q({x}_{k},{u}_{k})$ (see Algorithm 1 for more details). Defining the state set as
$\mathcal{X}\subset {\mathbb{R}}^{n}$ (where
n is the dimension of the state vector), since the actions are finite (
$u\in \mathcal{U}=\{{u}^{\left(1\right)},\dots ,{u}^{\left(N\right)}\}$), the actionvalue function can be represented as a collection of maps
$Q(x,{u}^{\left(1\right)}),\dots ,Q(x,{u}^{\left(N\right)})$ from
$\mathcal{X}\times \mathcal{U}$ to
$\mathbb{R}$. In order to work with a continuous state space, we employ a linear function approximation version of the Qlearning algorithm. More precisely, we parametrize each
$Q(x,{u}^{\left(j\right)})$ as
$Q(x,{u}^{\left(j\right)})={\theta}_{j}^{T}\phi \left(x\right)$, where
$\phi \left(x\right)$ is a vector of features and
${\theta}_{j}$ a weight vector associated to the
jth input
${u}^{\left(j\right)}$. Thus, the whole
$Q(x,u)$ is specified by the vector of parameters
$\theta ={[{\theta}_{1}^{T},\dots ,{\theta}_{N}^{T}]}^{T}$, and the corresponding policy will be identified by
${\pi}_{\theta}$. In particular, we employ Gaussian Radial Basis Functions (RBFs) as features; given a set of centers
$\{{c}_{i}\in \mathcal{X},\phantom{\rule{0.277778em}{0ex}}\phantom{\rule{0.277778em}{0ex}}i=1\dots ,d\}$, we set
$\phi \left(x\right)={[{\phi}_{1}\left(x\right),\dots ,{\phi}_{d}\left(x\right)]}^{T},$ in which
${\phi}_{i}\left(x\right):{\mathbb{R}}^{n}\to \mathbb{R}$ is:
where
${\sigma}_{i}$ determines the decay rate of RBF. The pseudo code of the Qlearning with linear function approximation is reported in Algorithm 1.
Algorithm 1 Qlearning algorithm with linear function approximation [29] 
Initialize $\theta $ and set $\alpha $, $\gamma $ For each episode: Set $k=0$, initialize ${x}_{0}$ Until ${x}_{k+1}$ is terminal: Choose ${u}^{\left(j\right)}\in \mathcal{U}$ using ${\pi}_{\theta}$ Perform ${u}_{k}={u}^{\left(j\right)}$ Observe ${x}_{k+1}$ and $r({x}_{k},{u}_{k})$
$i\leftarrow \underset{l}{arg\; max}{\theta}_{l}^{T}\phi \left({x}_{k+1}\right)$
$\delta \leftarrow r({x}_{k},{u}_{k})+\gamma \phantom{\rule{3.33333pt}{0ex}}{\theta}_{i}^{T}\phi \left({x}_{k+1}\right){\theta}_{j}^{T}\phi \left({x}_{k}\right)$
$\theta \leftarrow \theta +\alpha \delta \phi \left({x}_{k}\right)$
${x}_{k}\leftarrow {x}_{k+1}$
$k\leftarrow k+1$

4.2. Policy Gradient REINFORCE
An alternative approach consists of directly learning a policy, without relying on value functions. In this regard, given a policy
$\pi \left(u\rightx,\theta )$ parametrized by a vector
$\theta $, Policy Gradient (PG) methods [
26] aim at finding an optimal
${\theta}^{*}$ which ensures to
in which
$\xi =({x}_{0},{u}_{0},{x}_{1},{u}_{1},\cdots ,{x}_{T1},{u}_{T1},{x}_{T})$ is a stateinput trajectory obtained by following a particular
$\pi $, and
$R\left(\xi \right)={\displaystyle \sum _{k=0}^{T1}}r({x}_{k},{u}_{k});\phantom{\rule{0.277778em}{0ex}}({x}_{k},{u}_{k})\in \xi $ is the corresponding cumulative reward. The trajectory
$\xi $ can be thought of as a random variable that has a probability distribution
$P\left(\xi \right\theta )$. We employ the REINFORCE algorithm [
19] which aims at finding the optimal
${\theta}^{*}$ for
$P\left(\xi \right\theta )$, solution of the optimization problem (
5), by updating
$\theta $ along gradient of the objective function. More precisely, a stochastic gradient ascent is performed:
where
$\alpha \phantom{\rule{0.277778em}{0ex}}\in \phantom{\rule{0.277778em}{0ex}}[0;1]$ is the learning rate. Since
$P\left(\xi \right\theta )=p\left({x}_{0}\right){\displaystyle \prod _{k=0}^{T1}}\pi \left({u}_{k}\right{x}_{k},\theta )p\left({x}_{k+1}\right{x}_{k},{u}_{k})$, where
$p\left({x}_{k+1}\right{x}_{k},{u}_{k})$ is the transition probability from
${x}_{k}$ to
${x}_{k+1}$ when the action
${u}_{k}$ is applied, the update rule (
6) becomes:
Such an update is performed every time a path
$\xi $ is collected. In order to reduce the variance of the gradient estimates, typical of the PG approaches [
30], we employ a NPG version [
31] of the REINFORCE algorithm, in which a linear transformation of the gradient is adopted by using the inverse Fisher information matrix
${F}^{1}\left(\theta \right)$. The pseudo code of the REINFORCE is reported in Algorithm 2.
Algorithm 2 REINFORCE 
Set $\theta =0$ and set $\alpha $ While $True$: Obtain $\xi =({x}_{0},{u}_{0},{x}_{1},{a}_{1},\cdots ,{x}_{T1},{u}_{T1},{x}_{T})$ applying $\pi \left(u\rightx,\theta )$ Observe $r({x}_{k},{u}_{k})$ for each $({x}_{k},{u}_{k})\in \xi $
$\theta \leftarrow \theta +\alpha {F}^{1}\left(\theta \right){\displaystyle \sum _{k=0}^{T1}}r({x}_{k},{u}_{k}){\nabla}_{\theta}log\pi \left({u}_{k}\right{x}_{k},\theta )$

5. Implementation and Results
In the following we apply the two different RL techniques described above, to address the two problems:
the attainment of an optimal working point, starting from random initial conditions;
the recovery of the optimal working point when some drifts, or working conditions changes, occur.
The former employs the QLearning, the latter through the NPG REINFORCE algorithm. The reason is their simplicity. In particular, for the problem of target recovery, the policy gradient algorithm REINFORCE has been chosen to employ a continuous action space, which allows fine adjustments to compensate small drifts. Both algorithms have been tested on the EOS system before being deployed on the FEL system. In the present section we describe the experimental protocols, and we report the results, that will be discussed in
Section 5.3.
5.1. Optimal Working Point Attainment Problem
The problem of defining a policy, able to lead the plant to an optimal working point starting from random initial conditions, requires to split the experiments in two phases: (i) a training, which allows the controller to learn a proper policy, and (ii) a test, to validate the ability of the learned policy to properly behave, possibly in conditions not experienced during training.
In both the optical systems—the EOS and the FEL—the state
x is a 4 dimensional vector that provides the current voltage values applied to each piezomotor (two values for the first mirror and two values for the second mirror). We neglect the dynamics of the piezomotors, being their transients much shorter than the time between shots. The input
u is also a 4 dimensional vector; denoting the component index as a superscript, the update rule is:
that is, the input is the incremental variation of the state itself. The action space is discrete, thus the module of each
ith component of
u is set equal to a fixed value. Moreover, the state
x can only assume values that satisfy the physical constraints of the piezomotors [
23]:
hence we allow, for each state
x of both systems, only those inputs
u for which the componentwise inequality (
8) is not violated. In the following, when referring to the intensity of the EOS, we will actually refer to the product of the two intensities detected in the ROIs when the laser hits both ROIs; by FEL intensity, we will mean the intensity measured by the
${I}_{0}$ monitor. Finally, for both systems, we will denote the target intensity (computed as explained below) as
${I}_{T}$.
5.1.1. EOS
The training of the EOS alignment system consists of 300 episodes. The number of episodes has been chosen after preliminary experiments on a EOS device simulator. However, based on the results obtained on the real device, the number of episodes can be actually reduced (see
Figure 3). At the beginning of the training, the ROIs are selected, and therefore also the target value
${I}_{T}$. They remain the same for all the training episodes. At each time step
k, the input provided by the agent is applied, and the new intensity
${I}_{D}\left({x}_{k+1}\right)$ is compared with the target (
${I}_{T}$). The episode ends in two cases:
when the detected intensity in the new state ${I}_{D}\left({x}_{k+1}\right)$ is greater than or equal to a certain percentage ${p}_{T}$ of the target (${p}_{T}{I}_{T}$);
when the maximum number of allowed time steps is reached.
When the first statement occurs, the goal is achieved. During the training procedure the values of
$\u03f5$ (exploration) of (
2) and
$\alpha $ (learning rate) of (
3), decay according to the following rules [
32,
33]:
where the
${N}_{0}$ value is set empirically. In addition, the reward is shaped according to [
28]:
where
$\overline{r}$ is taken equal to 1 if the target is reached, 0 otherwise; the values of
${\gamma}_{rs}>0$ and
$k>0$ are set empirically. The specific design of (
10) allows to reward the agent in correspondence of stateaction pairs that lead to a sufficiently increased detected intensity
${\gamma}_{rs}{I}_{D}\left({x}_{k+1}\right)>{I}_{D}\left({x}_{k}\right)$ (
$r({x}_{k},{u}_{k})>0$) and to penalize it otherwise (
$r({x}_{k},{u}_{k})<0$). At the end of each episode a new one begins from a new initial state, randomly selected, until the maximum number of episodes is reached. Then, a test (with random initial states) is carried out for the same target conditions of the training but with a fixed
$\u03f5=0.05$ as in Reference [
34] and
$\alpha =0$. We repeat the training and test 10 times (i.e., we perform 10 different runs) and report in the following the results in terms of average duration of each episode. The parameter values employed during experiments are reported in
Table 1; they result from offline experiments on a simulator of the EOS system.
The average number of timesteps per episode for the whole training phase is reported in
Figure 3. The steep decrease of the average number of time steps shows that a few episodes are sufficient to get a performance close to the one obtained after a whole training phase. Indeed, thanks to the reward shaping (
10), the Qfunction is updated at each step of each episode instead of just at the end of the episode (see
Section 5.3 for further details). The average number of timesteps per episode during the test phase is visible in
Figure 4 and is consistent with the training results.
5.1.2. FEL
The experiment carried out on the FEL system consists of a training of 300 episodes and a test of 50 episodes. The chosen target value
${I}_{T}$ is kept constant throughout the whole training and test. At the beginning of each episode, a random initialization is applied. Each episode ends when the same conditions defined in
Section 5.1.1 occur. The
$\u03f5$ and the
$\alpha $ values decay according to (
9) and the reward is shaped in the same way of the EOS case. The parameter values are reported in
Table 2.
The results are reported in
Figure 5 and
Figure 6, for training and test respectively. It can be observed that the overall behaviors, in training and test, resemble those in
Figure 3 and
Figure 4.
5.2. Recovery of Optimal Working Point
In particle accelerator facilities, the working conditions are constantly subject to fluctuations. Indeed, thermal drifts or wavelength variations requested by users are common and result in a displacement of the optimal working point. Therefore, a controller must be able to quickly and properly adapt its policy to such drifts. For this purpose, we adopt the NPG REINFORCE algorithm (
Section 4.2), which is able to work with a continuous action space and, thus, to allow for precise fine tuning. Here, we want to employ the learning as an adaptive mechanism, to face the machine drifts. Thus, in this case, a test phase would be meaningless, since adaptation occurs during learning only.
For both the optical systems, the EOS and the FEL, the state is a fourdimensional vector of the voltage values applied to each piezomotor (two values for the first mirror and two values for the second mirror) and the action is composed of four references, one for each piezomotor actuators, from which the new state depends.
The agent consists of four independent parametrized policies, one for each element of the action vector (
${u}_{k}^{\left(i\right)},\phantom{\rule{0.277778em}{0ex}}i\in \{1,2,3,4\}$), which are shaped according to the Von Mises distribution (such a distribution is a convenient choice when the state and action spaces are bounded, since it is null outside a bounded region):
where
${\psi}_{i}={e}^{{\varphi}_{i}}$ is a concentration measure,
${\mu}_{i}$ is the mean,
${\mathcal{I}}_{0}\left({\psi}_{i}\right)$ is the modified Bessel function of the first kind [
35] and
${\theta}_{i}=\left[{\mu}_{i},{\varphi}_{i}\right]$ is the
ith policy parameter vector, updated at each step of the procedure.
At each training step
k, when the system is in state
${x}_{k}$, the agent performs an action
${u}_{k}$, according to the current policy, thus leading the system in a new state
${x}_{k+1}$. Then, the intensity
${I}_{D}\left({x}_{k+1}\right)$ is detected and the reward is computed according to:
where
${I}_{T}$ is the target intensity. In the EOS system, in order to emulate drifts of the target condition,
${I}_{T}$ is initialized by averaging values collected at the beginning of the training procedure and then updated, each time that
${I}_{D}\left({x}_{k+1}\right)$ results greater than
${I}_{T}$, according to:
In the FEL, however, we initialize the system in a manually found optimal setting (including both the state and the ${I}_{T}$), and impose some disturbances manually. The possibility to update the target intensity is still enabled though.
5.2.1. EOS
The NPG REINFORCE experiment performed on the EOS system consists of a single training phase, at the beginning of which the EOS system is randomly initialized, as well as the
${I}_{T}$. The learning rate
$\alpha $ (
7) is kept constant and equal to
$0.1$ (empirical setting). Only when the detected intensity
${I}_{D}\left({x}_{k+1}\right)$ assumes a value greater than
${I}_{T}$, is the latter updated according to (
12) and the algorithm continues with the new target to be reached. The procedure is stopped when
$\theta $ vectors lead each Von Mises distributions enough close to Dirac delta functions, after no target update has been performed for a predefined time.
Figure 7 shows the detected intensity
${I}_{D}\left({x}_{k+1}\right)$ (blue line), its moving average (green line) and the target intensity
${I}_{T}$ (red dashed line) during the experiment. In
Figure 8, the reward (blue line) is reported along with its moving average (green line) and the target
${I}_{T}$ (red dashed line). By comparing the two figures, it can be seen that once the target does not change, the reward approaches zero and the detected intensity variance shrinks, evidence that the optimal working point is close.
5.2.2. FEL
Even in this case, the experiment consists of a single training phase, at the beginning of which, however, the system is set on an optimal working point, manually found by experts. During the experiment, some misalignment are forced by manually changing the coarse motors position. The learning rate of (
7) is kept constant and equal to an empirically set value (
$\alpha =0.5$).
Figure 9 and
Figure 10 report the detected intensity and the reward, together with the target, during the execution of the NPG REINFORCE algorithm on the FEL. It can be seen that, contrary to the EOS experiment, the target intensity is not significantly updated. Indeed, in this case the system is initialized on an optimal working point. Two drift events took place, the first around timestep 120, and the second around timestep 210. Both plots, (detected intensity and reward), clearly show the capability to recover an optimal working point.
5.3. Discussion
The results obtained on EOS and FEL systems during experiments and presented above deserve some further comments that are provided here. The Qlearning algorithm has been applied to face the problem of finding an optimal working point, starting from a random initialization. The results are reported in
Section 5.1. The enlarged portions reported in
Figure 3 and
Figure 5 show that a few episodes are sufficient to drastically reduce the number of steps required to reach the goal. In other words, the exploration carried out during the first episodes provides a valuable information for the estimation of the Qfunction and, as a consequence, of an appropriate policy. We believe that the main reason is the effectiveness of the reward shaping (
10), that allows for obtaining a reward at each time step, as opposite of a sparse reward occurring only at the end of the episodes. Such a shaping seems reasonable for the problem at hand, and is based on the assumption that the observed intensity change of two subsequent steps is significant for guiding the learning. On the other hand, during the test phase, we have observed that some unsuccessful trials occur. Although some further investigation is needed, it might be due to either (i) the occurrence of unexpected drifts of the target during the test or (ii) the discrete set of actions employed, consisting of fixed steps that can prevent reaching the goal, starting from random initial conditions.
The NPG REINFORCE algorithm has been applied for restoring the optimal working point in case of drifts. The results are reported in
Section 5.2. In particular,
Figure 9 and
Figure 10 show the response to manual perturbations of the FEL operating conditions, set initially in an optimal working point. It is possible to observe how the algorithm quickly replies to disturbances of environment settings (marked by negative reward spikes), by learning a policy able to recover the optimal pointing of the laser.
6. Conclusions
Two tasks of particle accelerator optimal tuning have been addressed in this paper, namely (i) the attainment of the optimal working point and (ii) its recovery after a machine drift. Accordingly, two appropriate RL techniques have been employed: an the episodic Qlearning with linear function approximation, to reach the optimal working point starting from a random initialization, and a nonepisodic NPG REINFORCE, to recover the performance after machine drifts or disturbances. Both approaches have been applied on the service laser alignment in the electrooptical sampling station before being successfully implemented on the FERMI freeelectron laser at Elettra Sincrotrone, Trieste.
Based on the promising results, further investigation on freeelectron laser optimization via automatic procedures will be carried out. Among some other approaches that could be investigated, we mention the normalized advantage functions [
36], a continuous variant of the Qlearning, and the iterative linear quadratic regulator [
37].