A parameter refinement method for Ptychography based on Deep Learning concepts

X-ray Ptychography is an advanced computational microscopy technique which is delivering exceptionally detailed quantitative imaging of biological and nanotechnology specimens. However coarse parametrisation in propagation distance, position errors and partial coherence frequently menaces the experiment viability. In this work we formally introduced these actors, solving the whole reconstruction as an optimisation problem. A modern Deep Learning framework is used to correct autonomously the setup incoherences, thus improving the quality of a ptychography reconstruction. Automatic procedures are indeed crucial to reduce the time for a reliable analysis, which has a significant impact on all the fields that use this kind of microscopy. We implemented our algorithm in our software framework, SciComPty, releasing it as open-source. We tested our system on both synthetic datasets and also on real data acquired at the TwinMic beamline of the Elettra synchrotron facility.


Introduction
In the last decade, computational microscopy methods based on phase retrieval 1 have been extensively used to investigate the microscopic nature of thin, noncrystalline materials. In particular, quantitative information is provided by ptychography 2, 3 , which combines lateral scanning with Coherent Diffraction Imaging (CDI) methods 4,5 . Similar to Computed Tomography (CT) 6,7 , the technique tries to solve an inverse problem, as the object is reconstructed from its effects impinged on the incident beam. The result is a high-detail absorption and quantitative phase image of large specimens 8 . In a transmission setup (e.g., in a synchrotron beamline microscope 9, 10 ), this reconstructed object is a complex-valued 2D transmission function O(x, y), which describes the absorption and scattering behaviour 11 of the sample. made possible as a loss function was derived by explicitly taking into account all the setup parameters, which were added to the optimisation pool for a joint regression/reconstruction. It has to be noted that also the probe positions were refined in this way, without using an expensive Fourier-transform-based approach. Indeed, a deep-learning-inspired strategy was employed, rooted in the spatial transformer network literature 19 . The software has been released as open-source 20 .

Manuscript Organisation
This paper is organised as follows: in Section 2, we provide a brief introduction to the ptychography forward model, defining the major flaws we wanted to correct, as well as a brief description of the autograd technology. In Section 3, our computational methodology is described, introducing the designed loss function and the spatial transform components. In Section 4, we present our main results, while Section 5 concludes the paper.  Iterative-gradient-based optimisation: all the latent quantities are updated by using the information from the loss function computed between simulated and measured quantities. The update estimation is based on the reconstruction error gradient, which is calculated automatically in an automatic differentiation framework.

Background
In ptychography 3 (Figure 3), an extended object is placed onto a sample stage and is illuminated with a conic beam of monochromatic and coherent light. The radiation illuminates a limited region of the specimen (see the dotted circles in Figure 3), and a diffraction pattern is recorded by a detector placed at some distance z sd = z d − z s (see Figure 3). To reconstruct the object, both the set of J recorded diffraction patterns I j with j ∈ [1...J] and the J (x j , y j ) positions are required. Each jth acquisition can be considered independent of the others. By illuminating adjacent areas with a high overlap factor, diversity is introduced to the acquisition, thus creating a robust set of constraints, which greatly improve the convergence of the phase retrieval procedure 3 . Diversity is what helps the optimisation problem regress the setup parameters. A typical ptychography setup used in synchrotron laboratories: a virtual point source illuminates (probe) a well-defined region on the sample, which is mounted on a motorised stage. The scattered field intensity is recorded at a distance z d .

Ptychography Model
The image formation model schematised in Figure 1 can be described more formally by the following expression: where P(x, y; z s ) is the 2D illumination on the sample plane (z = z s ), typically referred to as the "probe"; O(x j , y j ; z s ) is the 2D transmission function of the sample in the local reference system (x j , y j ), centred at the known jth scan position; D z d −z s is an operator that describes the observation of the pattern at a known distance z d − z s . Knowing the illumination on a region of the object is crucial for the factorisation of the exit wave Ψ s (x, y; z s ): Ψ s (x, y; z s ) = P(x, y; z s ) · O(x j , y j ; z d ).
In a modern ptychography reconstruction, P(x, y) is automatically found thanks to the diversity in the dataset 3,21,22 , and the "probe retrieval" procedure has been extended also to the case of a partial coherence 3, 23-26 . Indeed, in order to take into account the independent propagation of M mutually incoherent probes, Equation (1) is modified in the following manner: where the exit wave produced by the modulation of each P p (x, z; z s ) mode is summed in intensity on the observation plane z d .

Parameter Refinement
We can denote as the "setup incoherences" all the deviations of a real setup from the a priori model defined in Equation (1). Even if in the literature, many solutions have been proposed, in most cases, the problems are typically tackled independently. Partial coherence and mixed-state ptychography have been extensively reviewed (e.g., in 3,27,28 ); the same is valid for the position refinement problem [29][30][31][32][33][34][35] , with methods that are quite similar to what is performed in, e.g., super-resolution imaging 36 or CT 37 . To understand the importance of positions for ptychography, Figure 4 illustrates a slightly exaggerated condition: note that the entire object computational box changes format, and this is detrimental, especially from the implementation point of view. Much more scarce is the literature on axial correction: in 13 , an evolutionary algorithm was used to cope with the uncertainty of the source-to-sample distance in electron ptychography. In 38 , the authors proposed using a position refinement scheme also to correct for the axial parameters, as the latter is responsible for modifying the distances between adjacent probes, while in a work published in January 2021, the authors of 39 used an autograd environment to directly infer the propagation distance.
In an even more recent work (March 2021), the authors in 18 proposed also a unifying approach to parameter refinement similar to the one used in our manuscript: we genuinely only became aware of these two references while writing the final version of this manuscript. It is undoubted that the automatic differentiation methods (Section 2.3) are becoming of interest for computational imaging techniques such as ptychography, and many research teams are approaching the methods as they represent the future of the technique. Figure 5 shows the effect of the wrong propagation distance on the probe retrieval procedure: in Panel A, speckle-like patterns typically appear; position errors produce instead a typical dotted artefact. An incorrectly retrieved probe P(x, y) will produce a severely wrong object reconstruction O(x, y).  Typical artefacts in a probe retrieval procedure (simulated data) in the case of the wrong propagation distance (Panel A, speckle patterns) and no position refinement (Panel B, cluster of dots). Panel C shows the real illumination (magnitude).

Automatic Differentiation
An unconstrained optimisation problem aims at minimising a real-valued loss function f (x) of N variables. This problem can be formally expressed by the following way: where x is the sought N-vector solution. One of the common methods to iteratively minimise the function f (x) is the gradient descent procedure, which relies on the gradient of the loss function to define an update step: which will provide a new vector estimate x k+1 from the previous estimate x k , which at convergence will be equal to x.
The handmade symbolic computation of ∇ x f (x) becomes increasingly tedious and error-prone as the complexity of the expression increases. Numerical differentiation automatically provides an estimate of the point derivative of a function by exploiting the central difference scheme, but while this method is particularly effective for a few dimensions, it becomes progressively slow as N increases. On the other side, a Computer Algebra System (CAS) generates a symbolic expression through symbolic computation, but often, the output results in expression swell. Automatic differentiation [40][41][42] is a way to provide an accurate gradient calculated at a point, thus lying in between numerical differentiation and handmade calculation. In

4/17
one of the currently used methods to automatically compute the gradients 43 , when a mathematical expression is evaluated, each temporary result constitutes a node in a computational acyclic graph that records the story of the expression, from the input variables to the generated result. The gradient is simply calculated following the graph backwards (backward mode differentiation 42 ), from the results to the input variables, only applying the chain rule to the gradient of each nuclear differentiable operator 40 .

Computational Methodology
The ptychography forward model is defined in terms of a complex probe vector P ∈ C K interacting with a complex object transmission function, which can also be arranged as a vector O ∈ C D .

Loss Function
Loss functions are typically designed around simple dissimilarity metrics such as quadratic norms, which can be real functions of a complex variable. As represented in Figure 2, the loss function takes into account the simulated ( I j ) and real (I j ) quantities of the same type (diffraction patterns). When all the conditions defined in Section 2 hold, a loss function L for all the j ∈ [1...J] diffraction patterns and positions in the dataset can be written as a data fidelity term: As can be seen, Equation (6) is a function of the 2K + 2D + 1 real variables, which when stacked up, constitute the optimisation vector pool. The simulated diffraction pattern I j relative to the current jth computational box is calculated by Equation (1) or by its extension to a multimode illumination (Equation (3)). The square root of the recorded data can be computed once for all the diffractions to fasten the implementation. D z is the angular spectrum propagator 44 defined by the expression: which relates the input field Ψ s (x, y; z s ) (defined in Equation (2)) to the output field at the detector plane Ψ d (x, y; z d ). Fixing the wavelength λ of the incident radiation, the 2D Fourier transform of the propagation filter h is defined by 44 :

Complex-Valued AD
Current DL autograd tools are not conceived of to work with complex numbers, so a basic complex library needs to be written; the natural way to introduce them is to just add an extra dimension to each 2D tensor and incorporate the real and the imaginary part in the same object, basically duplicating the number of actual variables. The automatic backward operation is completely acceptable for this kind of custom-made data type, and the resulting gradient is simply: where x = a + jb. The actual gradient is represented in the same complex data type of the variables, made of a real and an imaginary part. As can be seen in Equation (9), the result of the automatic differentiation can be written in the Wirtinger formalism 45 , just as the derivative with respect to the conjugate of the differentiation variable (except for the constant), which is the typical gradient expression exploited for functions of complex variables.

Regularisation
To increase the quality of the reconstruction, the data fidelity term in a loss function is usually paired with a regularisation term (the method of Lagrange multipliers), whose role is to penalise ad hoc solutions in the parameter space: a pixel value ought to fit the physical nature of the model beneath, without only accommodating the dissimilarity measure; this translates into a mere energy-conservation constraint. Other regularisation methods can be employed, especially the one based on priors on the image frequencies; however, they can be more difficult to tune and require a different forward model. Different from other works (e.g., 17,18 ), in this work, especially for O and P, we decided to use energy-based regularisation, paired only with a large penalisation of quantities out of the desired range. In this way, we practically constrained the minimisation even if a nonconstrained optimisation framework was used.

Spatial Transform Layer
If as seen in Equations (6) and (7), writing a mathematical expression in the parameters O, P and z directly permits the gradient generation, the same does not hold for the spatial shift correction defined on a discrete sampling grid. Indeed, in the typical ptychography reconstruction algorithm, the computational box for a given jth position is defined by exploiting a simple crop operator on a 2D tensor. While this method is simple and computationally fast (pointer arithmetics), it is not differentiable, as integer differentiation is nonsense. In this work, an approach borrowed from the DL community was then explored. Convolutional neural networks are not invariant to geometric transforms applied to their input. To cope with this problem, a spatial transform layer 19 is introduced; the new learnable model is trained by inferring the spatial transformation that, applied to the input feature map, maximises the task metric. This approach greatly improves classification and recognition performances (e.g., in face recognition 46 ).
In this work, instead, the parameters of the affine transform were directly learned as the position refinement coefficients, which minimise the objective function (Equation (6)) during the reconstruction. To do so, two components were used ( Figure  6): (I) a grid generator and (II) a grid sampler. The first element transforms a regular input sampling grid (x i , y i ) T into an output coordinates grid (x o , y o ) T , by applying an affine transform to the former; this mapping (Equation 10) is fully defined by six degrees of freedom, but as concerns rigid translation, only the last column (t x ,t y ) T is optimised. The scaling factors (s x , s y ) T are thus constants used to crop the central region.
Then, for each jth diffraction pattern and each jth shift-vector (t x ,t y ) T , the jth output grid (x o , y o ) T is generated. From the latter, the grid sampler thus outputs a warped version of the object. Following the formalism introduced in 19 , the cropped portion of the object V j is generated from the entire object U by: A well-performed sampling (with an antialiasing filter) adds a regularisation term that can help during the optimisation. The bilinear sampling exploits a triangular (separable) kernel K, defined in Equation (12), which does not present dangerous overshooting artefacts: where: Figure 6 shows the structure of the proposed approach, applied to the ptychography framework. The same spatial transform is enforced on the two channels of the tensor that represent the real and imaginary parts of the cropped region of the object. Within this setup, gradient propagation is possible because derivative expressions can be calculated with respect to both the affine grid and the output pixel values 19 .

Results and Discussion
In this section, reconstructions obtained from a soft X-ray experiment are presented. Additional details on the performance analysis for many synthetic datasets can be found instead in the Supplementary Material of this publication. The reconstruction obtained through the proposed method is confronted with the output of the EPIE 22 and RPIE 47 algorithms. The virtual propagation distance of 0.037 mm was chosen by reconstructing at many z and selecting the best reconstruction. All the computational experiments were written on PyTorch 1.2 43 and executed on a computer equipped with an Intel Xeon (R) E3-1245 v5 CPU running at 3.50 GHz. The entire code was implemented on a GPU (Nvidia Quadro P2000), which is essential for this heavy-duty computational imaging.
The imaging experiment was performed at the TwinMic spectromicroscopy beamline 9, 10 at the Elettra synchrotron facility. TwinMic can operate in three imaging modalities: (I) STXM; (II) full-field TXM/CDI; (III) scanning CDI (ptychography). Clearly, the latter (the one used for this work) is obtained by combining the optic setup of the second modality and the control of the sample stage from the first one.
Similar to other ptychography experiments performed at the beamline (e.g., 10,48 ), X-ray data were collected using a 1020 eV X-ray synchrotron beam 9 focused with a 600 µm diameter Fresnel Zone Plate (FZP) with an outer zone width of 50 nm. Figure 6. Schematics of the differentiable components used in our method. For each positions vector, a grid generator takes as the input the corresponding shift transform expressed in an affine transform formalism; the sampling grid is then generated using this information. The object is then sampled at each coordinate defined by the sampling grid, producing the correct cropped region of the object.
The zone plate was placed approximately at 2 m downstream a 25 µm aperture that defines a secondary source. This is crucial to increase the beam coherence, at the expenses of brightness. A Peltier-cooled Charge-Coupled Device (CCD) detector (Princeton MT-MTE) with 1300 × 1340 20 µm × 20 µm px 2 was placed roughly 72 cm downstream of the FZP. According to the Abbe theory 48 , the limit of the resolution for coherent illumination is Γ = 0.82λ /NA = 50 nm. The resolution in Fresnel CDI is a function of the experimental geometry 49 , namely the distance from the focal point of the FZP to the detector and the physical size of the detector itself, rather than the focusing optics.
From the ptychography configuration point of view, the situation is similar to the one presented in 50 , where a point source (obtained by idealising the focus of an FZP through an order sorting aperture) illuminates the sample. Following the approach of 50 , during the data analysis, the beam was parallelised by using the Fresnel scaling theorem 44 . In the following experiment, as pointed out in Section 2, the sample was considered sufficiently thin 21,51 to be modelled by a multiplicative complex transmission function O defined by the expression: where r is the planar coordinate on the sample plane, n(r) is the complex refraction index and t(r) a real R 2 → R 1 function defining the local thickness. From the reconstruction-inferred O(r) (the objective of the reconstruction), the magnitude map corresponds to: where I o is the flat field intensity and I the sample intensity. The phase map instead corresponds to: If the properties of the material (β and δ ) are known, it is simple to infer its thickness and vice versa. Diffraction data were acquired in the form of a 16-bit multipage tiff file. The nominal positions were directly acquired from the shift vectors provided to the control system of the mechanical stage. A series of dark field images was acquired for the dark field correction. Figure 7 shows a group of chemically fixed mesothelial cells: Mesenchymal-Epithelial Transition (Met5A) cells were grown on silicon nitride windows and exposed to asbestos fibres 10 . The absorbing diagonal bar is indeed an asbestos fibre included in the sample. The reconstructed sample is shown in magnitude (second columns) and phase (third column), where the first and the fourth ones are the magnification of the red (magnitude) and green (phase) area denoted in the figure. Each row shows the reconstruction obtained with a different algorithm (EPIE, RPIE and the proposed method "SciComPty autodiff").

7/17
Due to the fact that for each algorithm, many iterations (10,000) were needed for this dataset, in order to reduce the computation time, each diffraction pattern was scaled to 256 × 256 px, giving a resulting pixel size of roughly 36 nm (~4 × 9 nm = 36 nm) on a 846 × 847 px reconstructed image. As in all the simulated experiments, for each diffraction pattern, the correct value of the padding was inferred by the propagation routine, taking into account the wavelength and the current propagation distance value.
Observing the quality of the results, the proposed method (Fig. 7c) clearly surpassed all the aforementioned ones: Figure 7 shows the red insets of the top left fibre, which was correctly reconstructed with the highest resolution only by the proposed algorithm; in a multimode EPIE reconstruction, many ringing artefacts are visible. RPIE 47 provided the best result among the typical reconstruction algorithms, with an object with a large field of view, which extended also into sparse sampled areas. The proposed method (fourth row) reconstructed the fibre and the cell at the highest resolution in both magnitude and phase (see Figure 8). A second inset (green colour) shows the texture in the phase reconstruction, where again, the proposed method outperformed the others; cell structures were corrupted by fewer artefacts and visible in their entire length.
The higher reconstruction quality can be directly attributed to the combined action of both the automatic inference of the virtual propagation distance (Fresnel scaling theorem) of 0.24 mm instead of the 0.37 mm (as obtained by an exhaustive manual search for the other algorithms) and the use of an advanced optimisation algorithm (Adam 52 ) in which the choice of the batch size represents a new hyperparameter that can be tuned in grain steps (therefore easily). In all the other methods, the position refinement was also enabled. The final scan positions retrieved by the algorithm are shown in Figure 9. We stress that the use of a DL-inspired position refinement routine was here not in competition with other methods, but it was necessary to carry out the reconstruction within an AD framework.
The downscaling is required not only for speed reasons, but also due to the high GPU memory consumption, which is currently a drawback of the method: as the gradient are calculated per batch, increasing the batch size produces a faster computation (less gradients are calculated for the entire set of diffraction patterns), but the memory consumption is greatly increased.
The proposed method thus provides a good reconstruction of both the magnitude and phase of the object transmission function. Figure 8 shows the line profile for each magnitude reconstruction with an FWHM resolution that is clearly twice better than the other methods.

Conclusions
Automatic Differentiation (AD) methods are rapidly growing in popularity in the scientific computing world. This is especially true for difficult computational imaging problems such as ptychography or CT. In this paper, an AD-optimisation-based ptychography reconstruction algorithm was presented, which retrieves at the same time the object, the illumination and the set of setup-related quantities. We proposed a solution to solve at the same time for partial coherence, position errors and sensitivity to setup incoherences. In this way, accurate X-ray measurements become possible, even in the presence of large setup incoherences. This kind of automatic refinement not only allows improving the viability of a ptychography experiment, but also is particularly crucial to reduce the time for a reliable analysis, which currently is highly hand-tuned. Extended tests were performed on synthetic datasets and on a real soft-X-ray dataset acquired at the Elettra TwinMic spectromicroscopy beamline, resulting in a very noticeable quality increase. We implemented our algorithm in our modular ptychography software framework, SciComPty, which is provided to the research community as open-source 20 .

Appendix: experiments on synthetic data
A ptychography simulator, implemented in the SciComPty modular framework, is used to create 5 synthetic datasets, starting from a set of standard test images and setup parameters which spans different propagation distances, from 0.065 m to 0.2 m. The wavelength is kept constant at 1 nm. All the computational experiments are written on PyTorch 1.2 43 and executed on a computer equipped with an Intel Xeon (R) E3-1245 v5 CPU running at 3.50 GHz. The entire code is implemented on GPU (Nvidia Quadro P2000), which is essential for this heavy-duty computational imaging. During the tests on synthetic data, the resolution for each diffraction pattern is limited to 128x128 pixels to reduce the computation time. The overlap factor is kept constant at around 70%, producing a scan pattern of 11x11 positions. A known random jitter is added to the ideal grid-based scan pattern to avoid the "raster scan pathology" 21 . The resulting total object size has a field of view of around 512x512 pixels (roughly 200 µ m). Within the context of this paper, the term "epoch", which is used interchangeably with "iteration", defines how many times the entire set of J diffraction patterns is processed through the algorithm. The use of an autograd environment allows to easily experiment with the batch-size parameters: EPIE and DM are completely antipodal in this sense, as the first is a sequential algorithm (stochastic gradient descent), while the second employs all the measured data at once (gradient descent). The batch size hyper-parameter (here set at 5 probes) allows to span between the two worlds. Within this framework, new first-order optimisers such as Adam 52 become readily available, providing a considerable acceleration to the plain old gradient descent method described by eq. 4 and 5 in the main text.  To investigate the effects of our correction routine, for each dataset the propagation distance is initialised to a value corrupted by a 30% error, while each position is perturbed by a random jitter with a standard deviation of 10 pixels. Both the propagation distance and the position vectors are added to the optimisation pool. Apart from eye inspection, to validate the method, reconstruction quality can be analysed observing the behaviour of (I) the optimisation dissimilarity (eq. 6 in the main text), calculated between the simulated and the measured diffraction pattern, and (II) a truth-aware similarity metric, SSIM 53 , which obviously can be applied for simulated data only. In the latter case, the positions error and the propagation distance estimate can be monitored at each iteration, producing informative graphics which are the base for the following analysis. Fig. 10 and 11 show the convergence of the algorithm tested on one example dataset, depicted in Fig. 12: O(r) is made by the "Pepperoni" (magnitude) and "Peaches" (phase) images, while P(r) is composed by the "Chelsea cat" (magnitude) and "Astronaut" (phase) pictures. It can be seen that in around 500 epochs both the positions and the distance have recovered more than 90% of initial error, producing the object and illumination estimate which can be seen in Fig. 12. Note that the during the reconstruction of the illumination (the cat and astronaut insets in Fig. 12), no mask is used. The circular pattern is automatically retrieved by the procedure. The loss value (Fig. 10, top panel) is used as the guiding metrics to optimise the object, the illumination and the setup parameters. The use of the grid sampler is essential to increase the convergence, which tends to be instead slower for the case of optimisation-like reconstruction algorithm based on the a typical crop operator; the acceleration effect can be traced back to the inherent regularisation action of the interpolator. In the uppermost panel of Fig. 11, the distribution of the position error is shown for each epoch, denoting a Rayleight-like distribution, which tends to be narrower as the epochs increase; the median, indeed decay towards 0, indicating a good correction. The desired single-modality form of the distribution is an effect of a refined version of the problem expression: we have to optimise only additive correction factors for the positions, not just the positions. In this manner, an L 1 metric can be used to create a regularisation term on these correction factors. In the bottom panel of Fig. 10 the convergence of the z is shown: similarly to the other two panels in Fig. 10, the optimisation is moving fast towards the exact value within the first 1000 iterations of the algorithm, denoting how the convergence of each trained variable is aligned with the others, manifesting an ensemble behaviour. To better analyse this aspect, Fig. 13 panel A shows how during the reconstruction, the SSIM value increases as the propagation distance (panel A) approaches to the correct value. Each color is relative to a different dataset. The longest trails are the one for which the propagation distance is larger, then the optimisation, initialised with a 30% error, spans for an extended range of z.
Conversely, for small propagation distances, the regression of the correct value tends to be faster. Panel B shows a similar graph but for the median of the positions error. Convergence speed in the various phases can be guessed by observing the sparsity of the points for each error value. Figure 10. The convergence of our combined optimisation is analysed by observing: the loss value, the ground truth position refinement error and the propagation distance z inferred by the algorithm. In the bottom panel, the correct distance is denoted by the red horizontal line at z = 0.1m Figure 11. The uppermost graph is the zoomed version of the central panel in Fig. 10 (denoted by an asterisk). Here, the distribution of the positions errors is shown as epochs increase. It can be seen that the median decay towards 0 as the reconstruction proceeds. Panel A,B and C shows the ground truth error for each of the 11x11 positions. 14/17