Towards Piston Fine Tuning of Segmented Mirrors through Reinforcement Learning

Featured Application: Piston alignment of segmented optical mirror telescopes through an algorithm that learns by itself how to maximize a physical quantity of the system. Abstract: Unlike supervised machine learning methods, reinforcement learning allows an entity to learn how to deploy a task from experience rather than labeled data. This approach has been used in this paper to correct piston misalignment between segments in a segmented mirror telescope. It was proven in simulations that the algorithm converges to a point where it learns how to move the piston actuators in order to maximize the Strehl ratio of the wavefront at the intersection.


Introduction
It is desirable to shorten the observation time needed by a terrestrial telescope to obtain a certain signal-to-noise ratio. In a diffraction limited scenario, it is inversely proportional to the fourth power of the diameter of the aperture. Hence, there is plenty of motivation for constructing larger telescopes. Successful construction of telescopes of 8 m and larger has been possible to a large extent with the introduction of segmented mirrors. To build telescopes with monolithic mirrors of the same size would have been impractical for financial and physical reasons. However, that segmentation in the reflective surface also introduced novel complications. A large increase in the number of parts and complexity of the system is one of these drawbacks. Piston errors introduce phase shifts between segments. Adjusting the degrees of freedom to mitigate these errors is problematic since they do not produce a slope in the wavefront.
Several methods have been developed in recent decades to tackle the problem of piston misalignment. The ones that are most currently used are based on Shack-Hartman wavefront sensors [1,2]. These methods rely on intensity images measured at the pupil plane. They have been proven to be reliable and precise. However, they require each segment edge to be aligned on a lenslet grid and this process might be very time-consuming. There is other family of methods that uses curvature sensors. These methods measure intensities at intermediate planes between pupil and focus. They are used to crosscheck the measurements obtained by the main methods. They are robust and require little extra hardware, but their capture range is not very large and they are deeply constrained by atmospheric conditions [3,4].
Other methods that have been proposed recently employ convolutional neural networks. This paradigm of machine learning has seen nowadays a great host of applications in many different areas.
Some of the methods proposed so far are only suitable for extended objects [5] or have not been proven to be robust under atmospheric turbulence [6].
As far as we are concerned, all machine learning applications to piston sensing so far stand on the supervised paradigm [7][8][9]. In this setting, input data and target have to be supplied. This requires the algorithm to be trained on simulated image data as well as the exact piston values. Eventually, the correspondence between the two sets is found after giving enough training data and enough model capacity. Nevertheless, the probability distributions of both, the synthetic and the real world data, must be in accordance with each other in order to generalize well in a real environment.
The technique presented here takes a reinforcement learning approach. It means that the learning process is driven by experience in an environment rather than training on a previously labeled dataset. It is then suitable for scenarios where ground-truth-labeled data are scarce or difficult to obtain. In the optical phasing problem, real telescope diffraction images might be available. However they lack the exact piston values that gave rise to those images. The RL algorithm learns in place with data provided by the telescope mirror in real time. Furthermore, it relies on an external physical quantity rather than labels. The method employs a convolutional neural network that takes as input an intensity image measured at an intermediate plane with four different wavelengths. The network outputs a probability distribution over actions that the piston actuators can take to reach an optimum Strehl ratio at the intersection. The agent then executes an action sampled from that distribution on the environment. Additionally, an image of the PSF of the portion of the wavefront at the given intersection is needed only during training. Once the network has been trained, this method gives fast immediate piston correction that could be used at any time during the observation. In supervised learning approaches, the PSF images are not needed; synthetic diffraction images are used instead.
The method has also been tested under atmospheric turbulence. The diffraction image was filtered with the long exposure optical transfer function. A Fried parameter of 0.2 m was considered in the simulations.
Large scale metrology approaches can be used jointly with this RL fine tuning approach. The former allows characterizing the position and orientation of each mirror segment by means of photogrammetry or laser tracker technology [10][11][12][13].
The paper is organized as follows. First, the physical details of the problem and its mathematical considerations are introduced briefly. Then, there is the optical setup and the procedure to generate the simulations to continue in the next part with an introduction to the policy gradients method and the architecture of the network. Finally, conclusions and final remarks are found at the end of the paper.

Background
The electromagnetic field emitted by a distant point source of light such as a star reaches the pupil of the telescope in the form of a plane wave. However, aberrations from either the propagation medium or the telescope itself make the wavefront depart from this ideal view. For simplicity, it is taken into consideration only the region of the wavefront reflected in three adjacent segments on a three ring segmented hexagonal mirror. There is piston misalignment in between the segments that introduces discontinuities in the phase of the wavefront. Figure 2 shows an intensity image recorded at the detector plane produced by the wavefront just described.
The intensity of the field is recorded at a distance d = 9 m, away from the focus with four different wavelengths. This distance is such that the full peak width of the diffraction pattern is twice larger than the image blur due to the atmospheric turbulence, as explained here [14]. The focal length of the telescope is set to be f = 170 m.
Four different wavelengths are considered to give the network the ability to distinguish piston errors that surpass the ambiguity range [15]. The largest λ 0 = 700 nm is taken as the reference wavelength and three shorter ones to disambiguate λ 1 = 0.930λ 0 , λ 2 = 0.860λ 0 and λ 3 = 0.790λ 0 . All piston values throughout the paper are measured at the wavefront. If a single wavelength would be used instead, diffraction patterns would be periodic with respect to the piston step values and the algorithm would not be able to predict which one gave rise to those images.
The wavefront at the pupil propagates to the observation plane by means of Fresnel equation. The observation plane is located at a distance z from the pupil. The distance z can be related to the focal length f , and the defocus distance d through Newton lens formula The intensity distribution at the observation plane is the squared magnitude of the complex field after propagation. Eventually, an image like the one in Figure 2 is produced.
The study takes also into consideration the atmospheric turbulence effects. The simulated intensity image at the defocus plane is filtered with the long exposure transfer function of the atmosphere [16]. A value for Fried parameter of r 0 = 0.2 m was chosen. Table 1 displays a summary of the parameters used in the simulation.

Results and Methodology
Reinforcement learning is a subfield of machine learning in which an algorithm gets feedback from the environment in the form of reward or punishment. According to this approach, the optimization problem becomes how to design an agent that acts optimally to get the highest long-term expected reward from the environment. Or alternatively, in the specific domain of optical phasing, given a diffraction image of an intersection of three adjacent segments, how to move piston actuators A and B in Figure 2 in order to get the maximum Strehl ratio of the wavefront at the intersection.
In a reinforcement learning setup, it is helpful to model the problem as a Markov Decision Process (MDP) in which the following tuple of elements {S, U , P, R, γ} have to be defined. The set of states the agent can be at, S, are the diffraction images of the intersection of the segments taken at four different wavelengths. It is shown with dashed line in Figure 2. It is a 24 × 24 pixels image around the center of the intersection. It is assumed that tip-tilt values have been restored for each segment in a previous stage, hence only piston values remain.
The actions, U, are the set of all possible piston movements that can be commanded to segments A and B. These are pairs of values with units of length. They are limited to a distance equivalent to ±λ 0 /2, being λ 0 the reference wavelength. Since only piston steps in the range ±λ 0 /2 are considered, it is assumed that those action values should be enough to correct the piston misalignment completely. It might be the case that after movements have been applied to both A and B segments pistons, the final piston error among them might lie outside the λ 0 limit. Using four wavelengths helps to distinguish states outside the ambiguity range. Figure 2. Diffraction image of three segments with piston errors between them. Distinctive intensity ripples at the borders between segments are caused by wavefront discontinuities. These discontinuities are, in turn, due to piston errors. Movement orders are commanded to segments A and B to minimize the effects of the errors between the three. The element P is the transition probability. It is the probability of ending at a final state s t+1 conditioned on both, an initial state s t and a certain action, u t . They are the stochastic rules that govern the physics of the environment, i.e., the probability distribution over states that can be reached from initial state when a certain action is taken. It corresponds to the dynamics of the system and it is implicitly learned by the algorithm through experience.
The reward signal, R, is obtained when taking action u t at state s t . The sum of the maximum intensity values of the PSF images for each wavelength at the intersection is used as the reward. This value is proportional to the Strehl ratio. In order to produce the intersection PSF, a circular mask is first placed at the center of the junction of the three segments. That mask isolates a circular portion of the wavefront. The diameter of the mask is equivalent to 0.2 m at the pupil. The reward is deterministic in the simulations. However, it can be considered stochastic in a general RL setup, as long as the expected long-term reward defines the agent final goal. The algorithm aims to maximize the expected long-term value of these reward outcomes.
And last, the discount factor γ is used in a sequential task to indicate how valuable it is to achieve the rewards as soon as possible. In the one step MDP case, this hyperparameter is set to zero. This means that the agent only cares about the immediate reward. A one step MDP is shown diagrammatically in Figure 1. Figure 3 represents the phase of the wavefront centered at the intersection after the circular mask has been applied. It is interesting to notice that the diameter of the wavefront is only sampled by ten pixels in the simulation. On the right hand side of the same figure, the PSF image of that part of the wavefront is showed. In a physical setup, that can be achieved by placing a microlens array centered at each segment junction. The PSF is obtained for each of the four wavelengths. The mask would be only needed during training. The goal is to find the best possible policy π θ (u|s) such that the final expected reward is maximized. The policy gives a probability distribution over actions that the piston actuators can take from a given state. It can be represented with a three layer convolutional neural network where θ is the set of parameters to be tuned. Each convolutional layer has 16 filters with weights and ReLU activation function. The sizes of the filters are 7 × 7, 5 × 5, 3 × 3 at each layer respectively. The depth of each filter matches the previous activation depth. A trainable bias parameter for each filter is also considered in the network. A fully connected layer is placed at the end to compute the final scores. The output of the network defines the mean of a bivariate probability normal distribution over actions. This mean is the action that is more likely to achieve the highest long-term reward from the current state, according to the agent experience. An action sampled from that distribution comprises two length components to be commanded to both pistons, A and B. Sampling the action from the normal distribution rather than selecting the mean predicted by the network allows the agent to explore nearby actions that might end up being a better option than the prediction itself. Since the absolute piston positions are unknown, the actions represent the relative piston movement from initial to final state.
The quantity to be optimized is called utility and it can be expressed mathematically by the following manner: where P(s t , u t ; θ) is the probability of the state and the action under a particular policy. And R(s t , u t ) is the reward. The gradient of the utility of the policy parameters U(θ) can be approximated with an empirical estimate for m samples [17]: where m is the number of trials used in the estimation of the gradient.
A Gaussian model is used to describe a stochastic policy over the continuous action space. The mean of the Gaussian is where the agent thinks that lies the action that is more likely to give the highest long-term expected reward from the current state. The variance of the Gaussian quantifies the uncertainty about that prediction. Since the random policy happens to be Gaussian, the form of the gradient of the log-probability turns out to be: The CNN returns two single scalar values for each diffraction image that it takes as an input. The mean of the bivariate distribution µ θ is precisely the output of the convolutional network. The gradient can be computed with respect to its parameters θ through backpropagation in the usual way. The variance of the distribution σ 2 is fixed to a small value. However, it can also be parameterized and learned from the experience. The action to be taken by the agent is sampled from that distribution. Selecting actions randomly allows the agent to explore new optimal actions while exploiting the current policy. Now, with the expression of the gradient, the optimum value of the parameters can be found with the gradient ascent. Adam algorithm, a variant of the latter, was used in the simulations [18].
On the other hand, a second convolutional network is used as a function approximator to represent the value function [19], V π φ (s t ). It takes as input the state i.e., the diffraction image of the intersection, and returns the expected reward from that state under the current policy π θ . The set of parameters φ that better approximate the value function are learned in a supervised manner from experience.
The value function can be used as a state dependent baseline to reduce the variance of the algorithm [20]. Using the advantage estimatorÂ = r t − V π φ (s t ) rather than simply the reward in the Equation (2) makes the learning process more stable. Using the baseline function makes the variance decrease without changing the expectation of the gradient.
The capture range defines the interval of possible piston jump values that the agent is trained to detect and act upon. Capture ranges considered here are suitable for fine tuning the piston positions after a previous coarse piston alignment stage has been carried out. The intersection has two piston jumps to segments A and B. Every combination of piston step values generates a distinctive diffraction pattern. The broader the capture range, the more patterns the agent is required to recognize to be able to perform the proper action.
Algorithm 1 shows the complete learning sequence. The initial state is the diffraction image of an intersection with piston jumps from bottom left segment to segments A and B. They can be any random value within the capture range, see Figure 2 for clarity. The policy network takes the diffraction image as input and predicts the mean of a Gaussian over the continuous action space. Then an action is sampled from that distribution in step 4. Next, state and reward are recorded once the action has been performed on the environment in step 5. The advantage is computed in step 6 and it quantifies how good or bad that reward is with respect to the average reward achieved on that state. The expected reward from a state s t following the policy π θ is given by the value function. In order to be self-consistent, the value of the initial state V π φ (s t ), must be close on average to the immediate reward r t plus the value of the final state V π φ (s t+1 ). The squared distance between the two quantities is a loss function to be minimized. Updating the value function parameters φ to minimize the loss function is done in step 7. Finally in step 8, policy network parameters θ are updated in the direction of the log-policy gradient by an amount given by the advantage. u t ∼ π θ (s t ) Sample next action from current policy 5: s t+1 , r t ← u t Apply piston displacement and get reward from PSF 6:Â = r t − V π φ (s t ) Compute advantage between value function and immediate reward 7: Update value function parameters 8: θ ← ∇ θ log π θ (u t |s t )Â Update policy with gradient 9: end while In Figure 4, we can see the training process while running the Algorithm 1. The vertical axis represents the sum of the PSF maximum intensity value for each wavelength after taking the action predicted by the policy. This magnitude has been normalized by the value attained with a perfectly phased intersection. The learning process plateaus at around 0.98. This is also the average maximum reward the agent achieves from any initial state. There is an upper bound on this quantity that is imposed by the small fix variance set in the design of the Gaussian policy. Two learning processes are shown in  It is interesting to visualize how accurately the network rectifies the piston misalignment during training. It is important to point out that the RL agent does not predict the piston mse error. It rather learns how to minimize it through the optimization of the PSF. In a real scenario, it would not be possible to know the ground truth piston misalignment in the initial state that gave rise to the diffraction image. Yet, it might be known in a simulated environment. Figure 5 shows the evolution of the mean squared error of the predictions over training steps measured in units of λ 2 0 . A straight line is plotted at the threshold rms = 50 nm. Below this value, the intersection is known to have a Strehl ratio greater than 0.8 [21]. Eventually, the agent gets to align the intersection with an accuracy of 0, 00082λ 2 0 on both capture ranges ±λ 0 /2 and ±λ 0 /4. The two graphs shown in Figures 4 and 5 are somehow related. Getting higher reward means in general a better estimation of the piston misalignment. The graphs have been smoothed out with a moving average over the last five steps.
Every step requires a real piston movement in the telescope, so the number of them needed by the agent to learn the task is an important aspect to consider. In that sense, the agent takes fewer steps to learn the task for the capture range ±λ 0 /4. A long exposure image is the result of a large number of atmospheric perturbation realizations. It is necessary to use an exposure time much larger than τ c = 10 ms, depending on wind velocity, in order to capture the time averaging effect of the atmosphere.

Conclusions and Future Work
In this paper, we have shown a novel approach to train convolutional neural networks for cophasing segmented mirrors. Unlike other supervised learning approaches, it does not need the data to be labeled. The maximum of the PSF image of the intersection is used instead. The method is able to correct piston step values in the range [−λ 0 /2, +λ 0 /2]. The narrower the capture range, the faster the agent learns. This is why the method is more appropriate for piston fine tuning.
This technique requires us to apply a circular mask centered at the junction of every three segments to obtain the PSF of that part of the wavefront. However, this optical setup aligned with the intersections is only needed during training. This means that once trained, the agent is able to correct piston misalignments in one single forward pass of the network by using the diffraction image alone. The accuracy attained in the predictions of the optical path difference between segments was rms = 20.04 nm for a reference wavelength λ 0 = 700 nm. This measurement at the wavefront suffices for the adaptive optics to be applied [22].
Quantitative analysis on how seeing variations can influence the training of the RL agent will be treated in future work.