1. Introduction
It is desirable to shorten the observation time needed by a terrestrial telescope to obtain a certain signal-to-noise ratio. In a diffraction limited scenario, it is inversely proportional to the fourth power of the diameter of the aperture. Hence, there is plenty of motivation for constructing larger telescopes. Successful construction of telescopes of 8 m and larger has been possible to a large extent with the introduction of segmented mirrors. To build telescopes with monolithic mirrors of the same size would have been impractical for financial and physical reasons. However, that segmentation in the reflective surface also introduced novel complications. A large increase in the number of parts and complexity of the system is one of these drawbacks. Piston errors introduce phase shifts between segments. Adjusting the degrees of freedom to mitigate these errors is problematic since they do not produce a slope in the wavefront.
Several methods have been developed in recent decades to tackle the problem of piston misalignment. The ones that are most currently used are based on Shack–Hartman wavefront sensors [
1,
2]. These methods rely on intensity images measured at the pupil plane. They have been proven to be reliable and precise. However, they require each segment edge to be aligned on a lenslet grid and this process might be very time-consuming. There is other family of methods that uses curvature sensors. These methods measure intensities at intermediate planes between pupil and focus. They are used to crosscheck the measurements obtained by the main methods. They are robust and require little extra hardware, but their capture range is not very large and they are deeply constrained by atmospheric conditions [
3,
4].
Other methods that have been proposed recently employ convolutional neural networks. This paradigm of machine learning has seen nowadays a great host of applications in many different areas. Some of the methods proposed so far are only suitable for extended objects [
5] or have not been proven to be robust under atmospheric turbulence [
6].
As far as we are concerned, all machine learning applications to piston sensing so far stand on the supervised paradigm [
7,
8,
9]. In this setting, input data and target have to be supplied. This requires the algorithm to be trained on simulated image data as well as the exact piston values. Eventually, the correspondence between the two sets is found after giving enough training data and enough model capacity. Nevertheless, the probability distributions of both, the synthetic and the real world data, must be in accordance with each other in order to generalize well in a real environment.
The technique presented here takes a reinforcement learning approach. It means that the learning process is driven by experience in an environment rather than training on a previously labeled dataset. It is then suitable for scenarios where ground-truth-labeled data are scarce or difficult to obtain. In the optical phasing problem, real telescope diffraction images might be available. However they lack the exact piston values that gave rise to those images. The RL algorithm learns in place with data provided by the telescope mirror in real time. Furthermore, it relies on an external physical quantity rather than labels. The method employs a convolutional neural network that takes as input an intensity image measured at an intermediate plane with four different wavelengths. The network outputs a probability distribution over actions that the piston actuators can take to reach an optimum Strehl ratio at the intersection. The agent then executes an action sampled from that distribution on the environment. Additionally, an image of the PSF of the portion of the wavefront at the given intersection is needed only during training. Once the network has been trained, this method gives fast immediate piston correction that could be used at any time during the observation. In supervised learning approaches, the PSF images are not needed; synthetic diffraction images are used instead.
The method has also been tested under atmospheric turbulence. The diffraction image was filtered with the long exposure optical transfer function. A Fried parameter of   was considered in the simulations.
Large scale metrology approaches can be used jointly with this RL fine tuning approach. The former allows characterizing the position and orientation of each mirror segment by means of photogrammetry or laser tracker technology [
10,
11,
12,
13].
The paper is organized as follows. First, the physical details of the problem and its mathematical considerations are introduced briefly. Then, there is the optical setup and the procedure to generate the simulations to continue in the next part with an introduction to the policy gradients method and the architecture of the network. Finally, conclusions and final remarks are found at the end of the paper.
  3. Results and Methodology
Reinforcement learning is a subfield of machine learning in which an algorithm gets feedback from the environment in the form of reward or punishment. According to this approach, the optimization problem becomes how to design an agent that acts optimally to get the highest long-term expected reward from the environment. Or alternatively, in the specific domain of optical phasing, given a diffraction image of an intersection of three adjacent segments, how to move piston actuators 
A and 
B in 
Figure 2 in order to get the maximum Strehl ratio of the wavefront at the intersection.
In a reinforcement learning setup, it is helpful to model the problem as a Markov Decision Process (MDP) in which the following tuple of elements  have to be defined.
The set of states the agent can be at, 
, are the diffraction images of the intersection of the segments taken at four different wavelengths. It is shown with dashed line in 
Figure 2. It is a 
 pixels image around the center of the intersection. It is assumed that tip-tilt values have been restored for each segment in a previous stage, hence only piston values remain.
The actions, U, are the set of all possible piston movements that can be commanded to segments A and B. These are pairs of values with units of length. They are limited to a distance equivalent to , being  the reference wavelength. Since only piston steps in the range  are considered, it is assumed that those action values should be enough to correct the piston misalignment completely. It might be the case that after movements have been applied to both A and B segments pistons, the final piston error among them might lie outside the  limit. Using four wavelengths helps to distinguish states outside the ambiguity range.
The element P is the transition probability. It is the probability of ending at a final state  conditioned on both, an initial state  and a certain action, . They are the stochastic rules that govern the physics of the environment, i.e., the probability distribution over states that can be reached from initial state when a certain action is taken. It corresponds to the dynamics of the system and it is implicitly learned by the algorithm through experience.
The reward signal, R, is obtained when taking action  at state . The sum of the maximum intensity values of the PSF images for each wavelength at the intersection is used as the reward. This value is proportional to the Strehl ratio. In order to produce the intersection PSF, a circular mask is first placed at the center of the junction of the three segments. That mask isolates a circular portion of the wavefront. The diameter of the mask is equivalent to  at the pupil. The reward is deterministic in the simulations. However, it can be considered stochastic in a general RL setup, as long as the expected long-term reward defines the agent final goal. The algorithm aims to maximize the expected long-term value of these reward outcomes.
And last, the discount factor 
 is used in a sequential task to indicate how valuable it is to achieve the rewards as soon as possible. In the one step MDP case, this hyperparameter is set to zero. This means that the agent only cares about the immediate reward. A one step MDP is shown diagrammatically in 
Figure 1.
Figure 3 represents the phase of the wavefront centered at the intersection after the circular mask has been applied. It is interesting to notice that the diameter of the wavefront is only sampled by ten pixels in the simulation. On the right hand side of the same figure, the PSF image of that part of the wavefront is showed. In a physical setup, that can be achieved by placing a microlens array centered at each segment junction. The PSF is obtained for each of the four wavelengths. The mask would be only needed during training.
 The goal is to find the best possible policy  such that the final expected reward is maximized. The policy gives a probability distribution over actions that the piston actuators can take from a given state. It can be represented with a three layer convolutional neural network where  is the set of parameters to be tuned. Each convolutional layer has 16 filters with weights and ReLU activation function. The sizes of the filters are , ,  at each layer respectively. The depth of each filter matches the previous activation depth. A trainable bias parameter for each filter is also considered in the network. A fully connected layer is placed at the end to compute the final scores. The output of the network defines the mean of a bivariate probability normal distribution over actions. This mean is the action that is more likely to achieve the highest long-term reward from the current state, according to the agent experience. An action sampled from that distribution comprises two length components to be commanded to both pistons, A and B. Sampling the action from the normal distribution rather than selecting the mean predicted by the network allows the agent to explore nearby actions that might end up being a better option than the prediction itself. Since the absolute piston positions are unknown, the actions represent the relative piston movement from initial to final state.
The quantity to be optimized is called utility and it can be expressed mathematically by the following manner:
      where 
 is the probability of the state and the action under a particular policy. And 
 is the reward. The gradient of the utility of the policy parameters 
 can be approximated with an empirical estimate for 
m samples [
17]:
      where 
m is the number of trials used in the estimation of the gradient.
A Gaussian model is used to describe a stochastic policy over the continuous action space. The mean of the Gaussian is where the agent thinks that lies the action that is more likely to give the highest long-term expected reward from the current state. The variance of the Gaussian quantifies the uncertainty about that prediction. Since the random policy happens to be Gaussian, the form of the gradient of the log-probability turns out to be:
The CNN returns two single scalar values for each diffraction image that it takes as an input. The mean of the bivariate distribution 
 is precisely the output of the convolutional network. The gradient can be computed with respect to its parameters 
 through backpropagation in the usual way. The variance of the distribution 
 is fixed to a small value. However, it can also be parameterized and learned from the experience. The action to be taken by the agent is sampled from that distribution. Selecting actions randomly allows the agent to explore new optimal actions while exploiting the current policy. Now, with the expression of the gradient, the optimum value of the parameters can be found with the gradient ascent. Adam algorithm, a variant of the latter, was used in the simulations [
18].
On the other hand, a second convolutional network is used as a function approximator to represent the value function [
19], 
. It takes as input the state i.e., the diffraction image of the intersection, and returns the expected reward from that state under the current policy 
. The set of parameters 
 that better approximate the value function are learned in a supervised manner from experience.
The value function can be used as a state dependent baseline to reduce the variance of the algorithm [
20]. Using the advantage estimator 
 rather than simply the reward in the Equation (
2) makes the learning process more stable. Using the baseline function makes the variance decrease without changing the expectation of the gradient.
The capture range defines the interval of possible piston jump values that the agent is trained to detect and act upon. Capture ranges considered here are suitable for fine tuning the piston positions after a previous coarse piston alignment stage has been carried out. The intersection has two piston jumps to segments A and B. Every combination of piston step values generates a distinctive diffraction pattern. The broader the capture range, the more patterns the agent is required to recognize to be able to perform the proper action.
Algorithm 1 shows the complete learning sequence. The initial state is the diffraction image of an intersection with piston jumps from bottom left segment to segments 
A and 
B. They can be any random value within the capture range, see 
Figure 2 for clarity. The policy network takes the diffraction image as input and predicts the mean of a Gaussian over the continuous action space. Then an action is sampled from that distribution in step 4. Next, state and reward are recorded once the action has been performed on the environment in step 5. The advantage is computed in step 6 and it quantifies how good or bad that reward is with respect to the average reward achieved on that state. The expected reward from a state 
 following the policy 
 is given by the value function. In order to be self-consistent, the value of the initial state 
, must be close on average to the immediate reward 
 plus the value of the final state 
. The squared distance between the two quantities is a loss function to be minimized. Updating the value function parameters 
 to minimize the loss function is done in step 7. Finally in step 8, policy network parameters 
 are updated in the direction of the log-policy gradient by an amount given by the advantage.
In 
Figure 4, we can see the training process while running the Algorithm 1. The vertical axis represents the sum of the PSF maximum intensity value for each wavelength after taking the action predicted by the policy. This magnitude has been normalized by the value attained with a perfectly phased intersection. The learning process plateaus at around 
. This is also the average maximum reward the agent achieves from any initial state. There is an upper bound on this quantity that is imposed by the small fix variance set in the design of the Gaussian policy. Two learning processes are shown in 
Figure 4. The continuous line represents the agent learning process over a capture range of 
. On the other hand, the dashed line shows the learning process for piston steps within the range 
.
      
| Algorithm 1 Policy gradients with value function baseline | 
| 1:Initialize policy and baseline parameters: 2:whiledo3:                          ▹ Initialize state with random piston values4:                      ▹ Sample next action from current policy5:               ▹ Apply piston displacement and get reward from PSF6:      ▹ Compute advantage between value function and immediate reward7:          ▹ Update value function parameters8:                     ▹ Update policy with gradient9:end while
 | 
It is interesting to visualize how accurately the network rectifies the piston misalignment during training. It is important to point out that the RL agent does not predict the piston mse error. It rather learns how to minimize it through the optimization of the PSF. In a real scenario, it would not be possible to know the ground truth piston misalignment in the initial state that gave rise to the diffraction image. Yet, it might be known in a simulated environment. 
Figure 5 shows the evolution of the mean squared error of the predictions over training steps measured in units of 
. A straight line is plotted at the threshold 
. Below this value, the intersection is known to have a Strehl ratio greater than 
 [
21]. Eventually, the agent gets to align the intersection with an accuracy of 
 on both capture ranges 
 and 
.
The two graphs shown in 
Figure 4 and 
Figure 5 are somehow related. Getting higher reward means in general a better estimation of the piston misalignment. The graphs have been smoothed out with a moving average over the last five steps.
Every step requires a real piston movement in the telescope, so the number of them needed by the agent to learn the task is an important aspect to consider. In that sense, the agent takes fewer steps to learn the task for the capture range . A long exposure image is the result of a large number of atmospheric perturbation realizations. It is necessary to use an exposure time much larger than , depending on wind velocity, in order to capture the time averaging effect of the atmosphere. An exposition time of  per image sets a lower bound duration of  to reach  in capture range . Additionally, a lower bound duration of  is set for the agent to reach the same rms value in capture range . The training though can be carried out in parallel at several intersections simultaneously. There are 10 of them to train on in a 36 segment mirror telescope. It makes the previous training times decrease to  h and  h respectively.