Automatic Differentiation for Inverse Problems in X-ray Imaging and Microscopy

Computational techniques allow breaking the limits of traditional imaging methods, such as time restrictions, resolution, and optics flaws. While simple computational methods can be enough for highly controlled microscope setups or just for previews, an increased level of complexity is instead required for advanced setups, acquisition modalities or where uncertainty is high; the need for complex computational methods clashes with rapid design and execution. In all these cases, Automatic Differentiation, one of the subtopics of Artificial Intelligence, may offer a functional solution, but only if a GPU implementation is available. In this paper, we show how a framework built to solve just one optimisation problem can be employed for many different X-ray imaging inverse problems.


Introduction
Inverse problems such as nanotomography [1], compressive sensing [2], superresolution [3] and ptychography [4] are solved by exploiting the relation between data simulated through a model and what is detected during the measurements. Common iterative reconstruction algorithms make use of gradient-based optimisation to reconstruct the specimen x, which is the solution to the system of equations governing the model [5]. These approaches are powerful but can be problematic [6,7]: if a sophisticated method can simulate realistic behaviours, adding complexity results in error-prone expressions to be derived manually; this is especially true when the parameters must also be refined (e.g., in tomography [8,9]). If the object unknowns lie in a transform domain (e.g., wavelets), the complexity is also further increased. Pushed by Artificial Intelligence and machine learning research, Automatic Differentiation (AD) [10][11][12] methods represent an alternative in computational imaging, as they provide the key to prototypes in a simple manner for state-of-the-art reconstruction methods. Apart from being mathematically correct and powerful, resulting reconstruction algorithms are also fast if GPU/parallel implementations can be readily coded.

Paper Scope
The main scope of this manuscript is to showcase how AD-based methods can be used to open the black box of many computational imaging problems. Following our framework, it is possible to rapidly prototype state-of-the-art solutions for computational microscopy methods. Possible examples are: a compressive sensing reconstruction algorithm for sparse datasets, a single image super-resolution algorithm for recovering missing information from a low-resolution image, a 2D/3D tomography reconstruction algorithm that can deal with the missing wedge problem, sparse acquisition and axial misalignment, and a ptychography algorithm, which can retrieve many geometry parameters [13]. Even if the aforementioned solutions are purely "conventional", the use of such an AI component as AD blurs out the distinction between classical and AI methods [12,14,15].
In Section 1.2, we introduce the concept of Automatic Differentiation; then, in Section 1.3, we describe the generic computational imaging problem statement. We also provide an introduction to the four imaging techniques discussed in this manuscript (that can eventually be skipped by experienced readers). In the method section, we describe the proposed AD-based computational framework, and finally, present the results that can be obtained.

Potential Uses of Automatic Differentiation
Calculating derivatives is a need for many numerical techniques that require numerical optimisation. While gradient-less methods [16] are used to explore solution spaces of a few dimensions (e.g., Bayesian methods to find hyper-parameters), gradients are essential in minimisation problems that involve a huge number of parameters, such as the training of neural networks [11,14,17] or a reconstruction of a computational imaging framework. Any use of finite difference methods would result in a very long process. On the other side, as the computational pipelines and the models become more complicated, computing gradients by hand becomes a challenging, time-consuming, difficult and error-prone task [12]. For many problems, computing derivatives analytically is simply impossible, and only an evaluation at certain points is feasible [12].
Hand-coded analytical gradients provide an exact expression (when available) but can be time-consuming to code properly (e.g., on hardware accelerators), error-prone, and not applicable to problems with implicit solutions [10,11].
Finite differentiation, on the other hand, is extremely easy to implement but is subject to fixed numerical precision and is slow, as the method requires one evaluation for each dimension [12]. This immediately makes it infeasible for many tasks involving direct image optimisation.
Symbolic differentiation addresses the weaknesses of both the manual and numerical methods but often results in complex and cryptic expressions plagued with the problem of "expression swell" [11], referring to the phenomenon of a much larger representation of the derivative as opposed to the representation of the original function [18]. In addition to this, symbolic differentiation tends to also be memory intensive, slow and cannot handle common statements (e.g., unbounded loops) [12].
Automatic Differentiation is a set of techniques aiming at evaluating the derivative of a mathematical function specified in a code. AD interprets the program by incorporating derivative values at each node of a computational graph and propagates their values following the chain rule of differential calculus [10]. In a differentiable programming language such as PyTorch [19], indeed, one can write a differentiable expression acting on a multi-dimensional numerical array, and a computational graph is built node-by-node as the temporary results are calculated. When the gradient of the leaf with respect to a particular node is requested, the computational graph is followed by reverse-concatenating the sequence of known partial derivatives, following the chain rule [10,19]. This constitutes the additional computational step that makes AD methods slower than a well-coded analytical gradient, which can be immediately calculated from a hand-typed and potentially well-simplified expression. However, this problem can be consistently alleviated if GPU-accelerated frameworks such as PyTorch are used.

Computational Imaging Problem Statement
Inverse problems are frequently solved through iterative procedures that refine the estimate of a desired latent quantity x; this refinement can be formalised as the minimisation of an error function L(x), which measures how dissimilar the simulationsŷ are with respect to the acquired data points y. These simulations are produced by applying to the latent quantity x an operator A describing the computational model of the particular technique at hand. Intuitively, a good estimate of x thus permits to simulate quantities that closely follow the measured dataset.
The complete loss expression for a generic image inverse problem is shown in Equation (1) The Limited-memory-Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) [20] or the Orthant-Wise Limited-memory Quasi-Newton method (OWL-QN) [21] are commonly used solvers, but a gradient expression is required. While for simple expressions one can manually calculate it, such as in Equation (2) (to refine x) and in Equation (3) for cumbersome models (especially if complex-valued quantities are involved), AD methods come in handy, providing exact gradient expressions for the actual A, x, y values [10]. The only requirement is to write a differentiable expression as a function of the tensors that should be optimised.

Compressive Sensing
Compressive sensing (CS) [2,22,23] refers to a particular sampling modality that explicitly violates the Nyquist rate. Any discrete signal x of dimension N can be expressed as a linear combination of N basis vectors. In many practical cases, the signal x is not sparse in the sampling domain, but it can be considered K-sparse (K N coefficients are nonzero) if another set of basis vectors, e.g., from Discrete Cosine Transform (DCT) or Discrete Wavelet Transform (DWT), is chosen. When this condition is met, the signal is also compressible [22], and one can effectively reconstruct the whole latent image, even from the set of sparse measurements in the sampling domain [2].
During Scanning Transmission X-ray Microscopy (STXM) or X-ray Fluorescence (XRF) experiments [24], CS has been employed to dramatically reduce the time acquired for a single scan [25,26]: based on a rough STXM measurement, indeed, a spatial mask can be used to discriminate areas of sparse or fine scanning (for the subsequent XRF/STXM fine-measure). A different application is in Fourier Holography, where a CS framework is employed to solve for the values of the diffraction pattern, which are covered by a beam-stop at the detector plane [27].
The computational model for a CS reconstruction can be formalised exactly as in Equation (1), where A is substituted by the matrix product ΦΨ as in the following expression: Ψ denotes the 2D inverse cosine or wavelet transform operator applied on the transform coefficients vector x; in this latent space x, one can take into account energy considerations by leveraging the L1 regularisation. To reconstruct the latent image Ψx, the minimisation problems of Equation (4) have to be set. Finally, Φ represents the sparse sampling operation, where only a fraction of the image pixels is measured. The taxonomy [28,29] of the CS reconstruction methods (all iterative) involves typically three classes of algorithms: (1) iterative-thresholding algorithms based on Douglas-Rachford (DR) splitting [30,31]; (2) greedy algorithms such as Orthogonal Matching Pursuit (OMP) [32][33][34]; (3) gradient-based optimisations based on Equation (1).
When comparing these algorithms, gradient-based optimisation provides the best reconstruction accuracy, but at the cost of very high computational complexity. Greedy algorithms can be fast and very accurate but tend to stagnate for high-dimensional space and regular sampling, while the thresholding algorithms are fast and frequently used due to their simplicity but can provide severely blurred results [28].
Image inpainting is another common way to "fill the blanks" in the measured image [35]; This kind of method essentially solves a Partial Differential Equation describing how grey levels propagate inside missing regions [36], by continuing the lines of equal grey value (isophotes); in [35], a corrupted image is restored by solving the 2D Poisson Equation employing biharmonic functions; the Telea algorithm [36] aims at producing a very fast interpolation by taking into account a slow-varying "baseline" and the estimated value of the 2D gradient. The Naivier-Stokes (NS) method [37] works by assessing many similarities between 2D image quantities (intensity, isophote direction, smoothness, anisotropic diffusion) and fluid-related ones (stream function, fluid velocity, vorticity, fluid viscosity). Frequency Selective Reconstruction (FSR) [38] is a block-based algorithm that approximates the known samples by a weighted superposition of Fourier basis functions to estimate the lost pixels [39]; in some ways, it follows the reconstruction paradigm in CS.

Single Image Super Resolution
Single Image Super Resolution (SISR) [3] is a class of computational methods that aims at retrieving a High Resolution (HR) image from a single degraded Low Resolution (LR) one; indeed, finer details that are typically within the size of the former spatial sampling period can eventually be made visible.
One of the most used degradation models [40] is: where y is the degraded LR frame, x is, as usual, the latent HR image, H is the blur operator, B is the down-sampling operator and n the noise. While the traditional super-resolution approach employs multiple frames to recover the missing information (e.g., in order to obtain a readable license plate image from a surveillance video [41]), the SISR inverse problem is extremely ill-posed as many more different HR images can be projected to the same LR one [42]. This issue clearly emerges when simple interpolation methods are used, leading to the blurring or the generation of artefacts around edges and corners [43]. The literature on SISR evolved (a recent review is in [40]) mainly in four different branches: (i) reconstruction-based methods [44] are the oldest ones and aim at recovering the latent HR image by approximating the degradation process via a composition of blurring, downsampling, and noise injection; the method in [45] solves a Maximum Likelihood optimisation problem by employing a set of well-behaving image priors for the noise and image distribution. Turbozoom [46] is a recent reconstruction-based algorithm that employs a fast Conjugate Gradient optimisation to solve the degradation problem in Equation (5). However, in the majority of works, many solutions adopt a simpler degradation model, e.g., directly downscaling an HR image using a gaussian or a bicubic kernel to generate the corresponding LR one [40] .
(ii) Sparse-based algorithms [47] are another class of reconstruction methods where convergence is improved by exploiting the sparsity characteristics in the transform domain of the latent image, similar to a CS problem. One notable example is [48] where a CS framework is adapted for super-resolution by introducing a fixed blur kernel. The complete degradation model is thus composed of a blur operator and regular sparse scanning (sub-sampling) in the spatial domain. The regular sampling produces a structured mask, defining a critical condition for CS. By introducing the blur kernel, the (Restricted Isometry property) RIP condition is again met, making an OMP algorithm converge again.
(iii) Example/patch-based algorithms [40] historically provided the best results from a purely aesthetic point of view [49]; the inner working relies on generating a self-dictionary of high-resolution patches, which can be used to generate an HR frame [50]. As their use is questionable in the scientific, clinical or forensic cases, these kinds of algorithms are not considered in this manuscript.
SISR is currently receiving lots of attention from both the industry and the Image Processing research community, especially after the advent of (iv) Deep-Learning-based methods, which produce so-called "hallucinations" [40,42]: during a learning phase, a trainable model learns to estimate the most probable HR patch. Recent notable examples of these kinds of state-of-the-art methods involve the use of Convolutional Neural Networks (CNN), such as ESDR [51], ESPCN [52], FSRCNN [53], LAPSRN [54]) or Noise2Noise [42]; this is similar to the case of example-based SISR techniques where for natural images, the aesthetic factor can be privileged; in microscopy, instead, we need to assure that the reconstructed image holds high fidelity with the ground truth (the sample). The same reasoning can be applied to CNNs, if we are uncertain about their inner working [55], as the hallucinated images are strongly biased by the nature of the training dataset. As an example, the model used in [42] is provided pre-trained for different types of datasets, e.g., faces and walls, even if the task for what the model is trained is the same. The field of explainable AI [56] is growing in response to these assumptions [57], creating both less-opaque models and techniques to inspect the decisional process of AI [56,58].

Tomography
X-ray Computer tomography (CT) [59,60] is used to analyse biological samples in their native environment [61] in a non-destructive way. In its simplest form [62], we aim at reconstructing a 3D volume starting from a series of 2D projections y i , acquired at different rotation angles θ i , where i denotes the ith observation [1]. This inverse problem is formally still another formulation of Equation (1), where A is substituted with MR θ i , such as in Equation (6) L The latent 3D object x is illuminated by a high-dose parallel field and is rotated on a vertical axis by the operator R θ i ; the rotation axis lays exactly in the middle of the object, its projection is in the middle of the detector and is perpendicular to the beam. A Riemann sum operator M describes the ray-tracing within the sample, following the projection approximation [63] and the Lambert-Beer law [64]. The Crowther criterion [65] establishes the maximum angle step size that is required to fully recover the Fourier space of the sample. While the Filtered Back Projection (FBP)/gridrec [59,66] and the Algebraic Reconstruction (ART) [67] provides satisfactory reconstructions, in many micro/nano-CT experiments different technical difficulties can arise (here are just a few examples): (i) Radiation damage sets limits on the beam intensity for each projection [68]; when the number of photons is scarce (e.g., due to dose fractionation criteria [69] or photon starvation [70]), common single-shot algorithms produce very noisy images [60]. Ex-post noise removal increases the output Signal-to-Noise-Ratio (SNR) at the expense of spatial resolution. A meaningful reconstruction can be obtained by employing a Maximum-Likelihood algorithm (e.g., with Expectation-Maximisation (MLEM) [71]), which exploits a priori information on the noise process.
(ii) Satisfying the Crowther criterion might not always be possible, e.g., due to time/dose restrictions as well as mechanical limitations; iterative algorithms [64], such as SIRT [72], PSIRT [73], or MLEM [71], make use of the same concepts of regularisation and sparsity discussed so far to restrict the solution space to likely values.
(iii) In micro/nano-CT, various sample configurations limit the range of accessible tilt angles to 100-120 • (missing wedge [74]) due to crowded experimental chambers (e.g., cryo stage in cryo-nano-tomography [61,75]), nanopositioning stage limits or large absorption in the sample; while single-shot algorithms would produce severely corrupted reconstructions, iterative algorithms exploiting regularisation, and a CS approach [76] can at least provide good spatial resolution in a plane parallel to the detector [74,77].
The problems discussed so far produce very noticeable and characteristic artefacts in the reconstructions: while FBP output images are noisy, have poor contrast and are blurry [60], iterative methods might generate sub-optimal image texture, often referred to as "plastic" or "blotchy" [78]. In this context, AI can provide a huge help, as it can learn the correspondence between the latent object and the characteristic artefact/defect triggered in a well-defined reconstruction algorithm by a well-defined problem [79]. A review of methods is in [60].
In cryo-nanotomography [75,80], the aforementioned problems appear at the same time and are also exacerbated by unwanted and unknown movements in the sample stage [9,81,82]. A simple correction model employs serial cross-correlation between projections [83], also with cosine-stretching, reducing the risk of potentially propagating drifts in the correction [84]. Even if, in some cases, an algorithm might be able to create a set of landmarks from image features alone [85][86][87][88], marker-based alignment methods [84,[89][90][91][92][93] provide the de-facto solution to the problem; at the cost of decreasing the sample visibility, gold-nanobeads are added to the sample solutions, providing a set of landmarks that are relatively easy to detect and track automatically (but manual tracking is often required for a variety of reasons [75,94]). Stretched cross-correlation is still used as a pre-alignment step. A different class of algorithms is formalised around the concept of tomography self-consistency, which tries to infer the alignment parameters while reconstructing the volume. These methods can exploit both gradient-less [81,94] or gradient-based optimisation [95,96] to infer the parameters of a model significantly more complex than the one in Equation (6).

Ptychography
Ptychography [4,97] aims to dramatically increase the space-bandwidth product of a microscopy experiment by exploiting a computational imaging approach [98]. In a transmission setup, an object is raster scanned with a spatially coherent, monochromatic and time-stationary illumination p; the field is a spherical wave emerging from a pinhole or a Fresnel Zone Plate (FZP) [4]. In the thin-sample approximation [99], the input field is simply modulated in magnitude and phase by the object transmission function x. A 2D detector, typically placed in the far-field [4] at a distance z, records the diffraction pattern |d i | 2 of the transmitted field; the use of this regime is used to ease the computational part, as the computational propagation D z can be approximated with a 2D Fourier transform [4]. If only one diffraction pattern is collected, this approach corresponds to Keyhole-CDI [100], but when multiple diffraction patterns are acquired in an over-scan fashion, the overlap between different beam positions provides the diversity [98], which is required to overconstrain the computational problem [98], described by Equation (7) |d where, differently from the previous cases, d i , p and x i are complex-valued quantities. The approach allows extending the Field Of View (FOV) of the acquired area ideally without limits, as long as a good overlap is maintained, paired with constancy in the illumination conditions. While the first proposed algorithm was a single-shot process [101], in conventional experiments, iterative algorithms based on the concepts of "projection" are extensively used: Ptychography Iterative Engine (PIE) [4], extended-PIE (ePIE) [102], Refined-PIE (rPIE) [103] and Alternating Projection (AP) [104] employ a sequential approach that is considered essential to exploring the solution space as much as possible; the Differential Map (DM) algorithm [105] instead makes an averaged update taking into account all the experiment data. A different approach based on a form of Equation (1) was also proposed [6] but initially considered too slow compared to projection-based methods. Apart from PIE, thanks to the diversity in the dataset, all the aforementioned algorithms can automatically solve for an estimate of the illumination field, similar to a blind-deconvolution process. The complications, which can be accounted for by extending the simple ptychography model of Equation (7), closely follow what has been described for tomography (here, just a few examples): (i) In the case of a diffraction pattern with low SNR, knowledge on the noise statistics is essential; a Maximum Likelihood approach based on Gaussian or Poissonian noise has been described in [7], also with regularisation [106]; (ii) A trade-off between flux and spatial coherence of the beam is often required. When multiple propagating modes interfere, the resulting diffraction pattern appears blurred [107]. Based on the decomposition of the mutual intensity function [108], probe decomposition techniques such as [109][110][111][112] are often employed to computationally disentangle multiple modes.
(iii) Mechanical precision of the sample stage limits the spatial resolution of the reconstruction. Being a scanning technique, ptychography relies on accurate knowledge of the sample position (at the pixel level) for each acquired diffraction pattern |d i | 2 . On many ptychography beamlines, an ex-post approach employing computational position refinement algorithms is inevitable. Many methods have been proposed, based both on a gradient-less, e.g., [113][114][115][116][117][118], or gradient-based approach, e.g., [6,13,119,120] (iv) Ptychography relies on many other model parameters such as propagation distance, beam defocus and beam energy that allows correctly describing the experiment at hand. In [? ], beam defocus is corrected through a complex genetic-algorithm optimisation; in [122], the authors demonstrate how position correction can mitigate beam energy and propagation distance errors, while in [123], a variant of the PIE algorithm is proposed (z-PIE) that automatically refines the propagation distance.
Similarly to the previous case, AI can be used to correct or to provide a good initialisation for a subsequent common iterative algorithm [124].

Materials and Methods
In this work, we show how an AD framework can be employed to rapidly design state-of-art solutions to computational imaging problems, starting from a set of acquired data y.

Compressive Sensing
In Section 1, we have seen how Equation (1) is adapted to model Compressive Sensing, resulting in Equation (4). A typical Python implementation of such gradient-base optimisation is shown in Listing 1, where Numpy [125] and Scipy [126] Python libraries are used for numerical computing. An optimiser object is initialised by providing the initial values of the latent array x, containing the transform values in the space of the 2D DCT. As can be seen, both a loss function and a hand-typed gradient expression are required (see line 20 of Listing 1). The loss comprises both a data fidelity term (the L2 norm of the difference y − ΦΨx) and a regularisation term (L1 norm of the transform coefficients x), which are defined by using the functional representation of the matrix operators Ψ and Φ; these matrix operators are indeed substituted by functions (2D IDCT [127] and an element-wise multiplication). The use of the gradient-less algorithms specified in the method parameter in line 21 of Listing 1 would be impracticable. Note that this process is carried out on the CPU-single core-even if multiple CPUs are available. Porting this simple code on GPU is a non-trivial task.  The equivalent code in a GPU-accelerated environment such as PyTorch is reported in Listing 2 , where no gradient expression at all has to be manually calculated and coded. The loss function is defined as a closure function for the optimiser: in that function, the user describes all the steps required to produce the computational graph of the variable that has to be optimised (see params list in line 13 of Listing 2). At any iteration, the optimiser will (i) clear out the old gradient values; (ii) simulate the experiment in the forward pass; (iii) calculate the gradients by unrolling the computational graph from the loss value to the tensor x; (iv) calculate and apply a suitable update step for the variable in the list of params, following the particular algorithm at hand. The entire process is computed on the GPU with no memory transfers in the loop, as the functions (DCT transform [127] and generic float-array algebra) are defined and written to be executed through CUDA kernels and all the arrays have been moved to the GPU memory before the start of the optimisation loop (see line 8 and 12 in Listing 2). The resulting program is even faster than the previous single-threaded code, notwithstanding the hand-coded gradient in listing 1. While the optimiser itself is part of PyTorch [19], the interface used-pytorch-minimise [128]-allows writing a code, which is very familiar to experienced Numpy [125] users.

Single Image Super Resolution
The general degradation model in Equation (5) for Single Image Super Resolution can be formalised as an optimisation process similar to Equation (1), such as in Equation (8): where x is the unknown HR image, H is a 2D convolution operator that models the blur of the system with a convolution kernel h, and finally, B represents a warp operator (downscale) modelling the pixel size increase. As reported in Section 1.5, this model describes the low-resolution image as the output of blur with an unknown convolution kernel (13 × 13 in our experiments) and a fixed downscale (x4). Note that without operator B, Equation (8) is a blind deconvolution problem [129]. In [48], the blur and warp filter are modelled by only one filter, which is considered known (a gaussian with a width defined by the up-sample factor [46,129]); this represents a limiting factor.
In our SISR algorithm, we optmise both the HR image x and the deconvolution kernel H. To improve the quality of the result and the convergence, instead of the simple L2 norm ||ŷ − y||| 2 2 , we used the Huber loss function [130], which is typically employed in Deep Learning problems: whereŷ = BHx. This loss function allows to speed up the convergence when the loss value is smaller than a parameter δ, which, in our case, is set to 1. This process-guided switch between the L1 and the L2 norm would be very difficult to implement in a conventional optimisation procedure. As in this case we are directly optimising the values of the HR image, a good choice for the regularisation term is the Total Variation [131] V(x), defined by Equation (10): which acts by reducing the variations along adjacent pixels for any row i and any column j. By favouring slowly varying borders, we avoid the generation of high-frequency artefacts. The amount of regularisation in the loss function is controlled by the hyperparameters λ 1 and λ 2 . The total loss function used to optimise the HR image x and blur kernel h values is thus: which is implemented following the same reasoning in Listing 2.

Tomography
Even if the computational problem for CT is formally another formulation of Equation (1), for a real AD implementation the involved expressions start having a considerable complexity. For a parallel beam setup, the A matrix is indeed the composition of many operators, as shown in Equation (12): The main component is still the L2 norm of the difference between each simulation and the acquired data. As discussed in Section 1.6, a minimisation in the space of the transform coefficients x is required, especially in the case of peculiar sampling conditions (e.g., limited angle and/or sparse acquisitions) [75]; by applying the Ψ operator on each slice, the full 3D volume is obtained. R θi acts on the 3D volume and rotates it along a vertical axis. Finally, M describes the Riemann sum of the sample slice, simulating the X-ray matter interactions within the pure projection approximation [1]. The regularisation terms are weighted by the hyper-parameters λ 1 and λ 2 . In our implementation, similarly to the previous case, the Ψ matrix operator is substituted by a 2D IDWT [132] (Bi-orthogonal 4.4 [133]); R i employs an affine transform [13,17,134] to describe the rotation of θ i degrees along an axis of parameterised coordinates (r x , r y ). The Riemann sum is approximated with a sum along the propagation axis. All the functions are implemented in PyTorch and are GPU accelerated.
While the model described so far is sufficient for datasets that are dose-limited or with missing projections, it does not allow for geometrical parameter correction. One way to partially address the projection misalignment [81,94] described in Section 1.6 is to search for a set of detector shifts and a common detector tilt angle. To do this in AD, one can implement the model in Equation (13): where the operator C i applies an affine transform, which describes for each ith radiography a shift in x, y and a rotation that is global for the entire dataset. This operator is coded as R θi . The total number of parameters for a dataset of N projections is indeed a 3D array containing the volume, 2N floats defying the shifts and an additional float for the common detector angle.

Ptychography
By raster-scanning the sample with a constant illumination p and by recording the corresponding ith diffraction pattern, one can reconstruct the complex refraction index of the total illuminated area of the sample. To do so, a computational method is required, as the missed phase for each diffraction pattern has to be retrieved. The model for this acquisition modality is even more intricate than the one in Equation (12) and is shown in Equation (14): where, again, the main loss component (what drives the reconstruction) is the sum of the L2 norms calculated for the difference between the simulations and the acquired data for each diffraction pattern. In this case, however, x is a complex-valued matrix describing the entire sample transmission function (entire Field Of View), p m is the multi-mode [109] complex valued illumination [109], C i is a cropping operator, which models the spatial scanning at the ith position r xi , r yi [13]. Finally, D z is an operator that describes the wave propagation towards the detector up to the distance z [13]. Note that the complex differentiation is performed by AD in PyTorch employing intrinsically Wirtinger derivatives, similarly to other ptychography AD methods [119,120,123,135,136]. To do so, we wrote a small complex-algebra library [13], which can operate on a duplet datatype, where each 2D complex array is stored in its real and imaginary parts. To help the procedure correctly factorise the probe and the object, we included an L1 regularisation term, which takes into account energy conservation constraints: the object x cannot behave as a source, so the transmission function cannot present values larger than 1 in magnitude. ||x x>1 || 1 indeed selectively penalises only the element with a magnitude greater than 1. The key element for such regularisation is the fancy indexing capability in PyTorch, inherited from Numpy. The operator D z can be any single-pass wave-propagation routine; for the far-field case, the 2D Fourier Transform alone (implemented in PyTorch as a GPU function) reasonably approximates the field, as the ad hoc phase factors are automatically included in the probe p itself [4,103] during the reconstruction, thanks to the diversity in the dataset. In the case of near-field [137], the angular spectrum is a good compromise as it is relatively easy to implement, and it is also the most accurate solution [138]. To increase the execution speed, we cached the prefactors as much as possible [138], leaving only the Fourier Transform and the final element-wise multiplication to be calculated at run-time [13].
The model described so far can optimise the object x and the probe p, also in a multimode fashion [109]; to include the refinement of the prorogation distance z, we have to calculate more prefactors in D z at each iteration, making the algorithm inevitably slower.
In a typical ptychography code, the crop is simply obtained by slicing the array at a predefined set of coordinates and with fixed dimensions; this operation defines the socalled "computational box". Slicing, however, is not differentiable as it involves integer quantities. Position refinement indeed requires a different type of crop operator, which can be emulated by an affine transform (which is differentiable). Such an operator exploits the same component described previously in the text and computes the 2D translation and rescale needed to produce the particular computational box at hand. More details on the generation of the affine transform matrix for each diffraction pattern can be found in [13,17]. Inevitably, the price to pay for the optimisation of new parameters is the use of a formalism that slows down the entire algorithm.
The final computational model is thus written as the composition of differentiable functions in the optimisable arrays x, p, r xi , r yi , z, which are the parameters appearing on the left-hand member of Equation (14).

Results
In this section, we present the reconstructions obtained for each of the aforementioned techniques. The CS and ptychography datasets were acquired during an experiment carried out at TwinMic, the soft-X-ray spectromicroscopy beamline [24,139] of the Elettra synchrotron facility. The sample is composed of a group of chemically fixed mesothelial cells (Mesenchymal-Epithelial Transition Met5A) grown on silicon nitride windows and exposed to asbestos fibres [13,24,140]. An energy of 1260 eV was used, paired with a secondary source [24] of 15 µm.

CS Reconstructions
Differently from the work reported in [25,26], here we show an application of CS for a fast STXM measurement performed by using a regular but sparse sampling mask. This kind of acquisition can be employed to further speed up the acquisition of the final XRF sampling mask, which typically depends upon an inspection of the STXM map (see Section 2.1). Such scan modality is compatible with the position control system of the microscope, as the input can be a list of sorted coordinates. The total scan area spans a range of 80 × 80 µm, for a total of 400 × 400 pixels, providing a pixel size of 200 nm. Using such a large area and high-resolution setup requires employing a dwell time of 80 ms per scan point, totalling more than 4 h for an acquisition. By using CS, we can sample just 10% of the area (within one-tenth of the time) and recover the sample with a limited deterioration in quality. Figure 1 displays the results of many CS algorithms described in Section 1.4, compared to the proposed AD method: panel (a) shows the ground truth STXM acquired with a dense raster scan. From this image, we can realistically simulate the effects of sparse sampling by employing a regular sensing mask-a zoomed version is in Figure 1b-where only 10% of the pixel values are actually measured. Figure 1c shows the output of a Douglas-Rachford (DR) iterative thresholding algorithm; even after 1000 iterations, the final result is extremely blurred. While the use of a regular mask allows for a computationally simple interpolation scheme [26], it mines the applicability of a regular CS algorithm such as OMP (Section 1.4); the Restricted Isometry Property (RIP condition) indeed requires complete incoherence (randomness) between the sampling matrix and the underlying signal. When this condition is not met, the OMP algorithm stagnates, producing completely unusable results, such as the one in Figure 1d. A similar result can be observed in the output produced by the Naiver-Stokes (NS) inpainting algorithm; see Figure 1e.  Figure 1f shows the output of our proposed AD-based algorithm, which provides the least amount of artefacts, also compared to Biharmonic inpainting (panel g), Telea inpainting (panel h) and FSR inpainting (panel i). While the output of the Telea algorithm appears sharp, it presents a very blocky appearance. Nearest-neighbours-interpolation artefacts can also be spotted for panel (g), where small black features are grown with respect to the ground truth (see Figure 2). A very unnatural appearance with blurred borders can be seen in panel (i), where artefacts build up, especially in the external region.
The likelihood of having instabilities during an experiment increases with the time required for the data acquisition procedure. STXM and XRF experiments are often cursed by this (note the stripes in each panel of Figure 1). Within an AD framework, one can easily implement a new computational model, which takes into account the presence of a slowly varying function (beam intensity), which modulates the latent object in magnitude. The new computational model is presented in Equation (15): where the array w appears as a new trainable variable, with the same shape of x. In order to enforce a slow-varying behaviour that approximates slowly varying characteristics of the beam intensity, we apply a 1D version of the Total Variation regularisation V() on the serialised version of w. The output of this procedure is shown in Figure 1 panel (j), where the striping artefacts have been removed automatically during the reconstruction and not as the output of a post-processing filter; the estimated background, reshaped as x is shown in Figure 1k.  Figure 2 shows a detailed view of the CS reconstruction in Figure 1; note how the small circled dot in the ground truth (panel a) is blurred and increased in size for the Biharmonic and the Telea method (respectively panel b and c); the proposed method provides a good reconstruction of the fine detail both in the uncorrected (panel d) and background corrected version (panel e).

SISR Reconstructions
Single Image Super Resolution has been used to increase the level of detail of an STXM dataset, where the pixel size of 218 nm-fixed by the focusing characteristics of the FZP at 1260 eV-was not small enough to resolve the finer details. The sample is a different region of the Met5A sample, where many asbestos fibres aggregated, forming a cluster. Figure 3a shows the original STXM map of 26 × 26 pixels and the results of different SISR methods applied to that image; each recovered HR image has a resolution of 104 × 104 pixels (4× up-sample factor). In Figure 3a, each reconstruction is normalised and shown at the same level of contrast. Neither the bilinear or the bicubic interpolation provide a good reconstruction: a bogus feature builds up, appearing as a "cross-like" artefact between the top leftmost fibres (red rectangle). A recent very fast reconstruction-based SISR method, "turbozoom" [46], is unfortunately tricked into generating the "cross-like" artefact and fails at recovering the correct rod profile. We also tried a blind deconvolution method (implemented in the "miplib" package [129]) applied on a bilinear interpolation, but no apparent increase in quality was observed.
On the contrary, the proposed method based on Equation (11) correctly separates the three rods in the top red rectangle, enhancing the individual contrast characteristics of each fibre in the cluster. This avoids the generation of the "cross-like" artefact. Furthermore, the end of the rods in the purple rectangle is better resolved. Moreover, during the STXM scan, a strong drift of the sample was observed, much bigger than the one scan step. The proposed method correctly disentangles this spurious movement from the object, estimating a Point Spread Function (PSF) that closely describes a motion blur.
SISR is currently receiving lots of attention from both the industry and the Image Processing research community, especially after the advent of Deep-Learning-based methods. Indeed, we also decided to test four different state-of-the-art Convolutional Neural Networks (CNN)-SISR techniques (ESDR [51], ESPCN [52], FSRCNN [53], and LAP-SRN [54]). Being trained on natural images, these tested methods generalise poorly on X-ray microscopy images (see the blocking artefacts in FSRCNN, likely triggered by the intrinsic nature of the noise of a scanning technique). However, ESDR is somehow resolving the top-left features of the rods (red rectangle) while still blurring the bottom-right ends (purple rectangle).
The Fourier Ring Correlation [129] can be used to estimate the increase in resolution; for each tested method, the corresponding curve has been calculated and is plotted on the same graph of Figure 3b: as can be seen, the curve corresponding to the proposed AD-based reconstruction method (blue) intercepts the 1 bit threshold at the highest spatial frequency, meaning that at that spatial frequency, the signal-to-noise ratio is the highest. The FRC curves are not monotonic, but after roughly half of the Nyquist rate, many curves rise again: this behaviour can likely be explained by the build-up of high-frequency artefacts; among the tested method, the proposed algorithm (blue curve) provides the finest details, while limiting the artefacts components.

Micro/Nano-Tomography Reconstructions
We tested our tomography reconstruction algorithm based on Equation (12) on two synchrotron radiation datasets and compared it against state-of-the art solutions (SIRT and MLEM, running on the GPU), implemented in the advanced TomoPy [141] software package. The first dataset is the "tooth" sinogram included in the TomoPy [141], also showcased in [142]. The original micro-CT dataset consists of 180 projections acquired with a step of 1°. The gridrec reconstruction of this dataset is shown in Figure 4a. To emulate a limited angle and sparsely acquired tomogram, we subsampled the dataset by a factor of two and reduced the rotation range to 120°, producing a resulting dataset of 60 projections. SIRT [72] (Figure 4b) and MLEM [71] (Figure 4c) were used to compare the performance of the proposed algorithm. To produce the reconstruction in panel (b), the SIRT algorithm was iterated for 500 iterations (20 s), but the algorithm reached a reasonably good convergence already after 20 iterations. The MLEM algorithm was used for 12 iterations (2 s); this number was chosen to reduce the number of artefacts in the reconstruction, which build up rapidly from iteration 13. The proposed method formalised in Equation (10) produced the reconstruction in panel (d), which appears sharp and with the highest degree of similarity to the ground truth. Note how the use of the CS-inspired minimisation in the transform domain (DWT) allowed for a good reconstruction of the global object shape; this can be seen especially at the border in the bottom-right part of the tooth, which was clearly in the unobserved angle range. The second dataset is a different micro-CT scan performed at the Syrmep beamline [143] of the Elettra Synchrotron facility and used in [94,144]. A mouse femur [144] was acquired with an angle ranging between 0 and 180°, with steps of 0.4°, totalling 450 projections. The ground truth has been produced by using all the available projections with the SIRT algorithm, which was run for 100 iterations (Figure 5a). Similarly to the previous case, we simulated a sparsely sampled and limited-angle dataset by taking one projection of every three (step of 1.2°), again limiting the observations to 120°. Figure 5b shows the output of the SIRT reconstruction algorithm, where the effects of the missing wedge problem are highly visible (smearing at the borders). MLEM (Figure 5c) produces an higher quality reconstruction compared to panel (b); similarly to the previous case, the proposed algorithm in panel (d) produces a reconstruction with crisper details and higher contrast. To test the reconstruction method described by Equation (12), we used an electron tomography dataset published in [80,85] and well-known in the nanotomography community, being one of the tutorial datasets released for the Tomoj reconstruction software [145]. The Pyrodictium abyssi cell strain TAG11 dataset is composed of 91 (512 × 512) projections, with a tilt step of 1.5°. The pixel size of each image is 3.24 nm. As previously described, limited angle and large misalignment are co-occurring in the same nanotomography experiment, producing a dataset that can not be reconstructed in a straightforward manner. Figure 6 shows different reconstructions obtained with three alignment/reconstruction methods, in both the axial and longitudinal plane. The black dots are nanobeads, which are typically added to the sample before the cryostate, in order to create a set of landmarks for the semi-automatic registration of the tilt series [75,80]. To reduce the computation time, each projection was rescaled to 256 × 256 pixel. Panel (a) displays the output of a manual alignment procedure carried out with Tomoj [80], where the automatically tracked landmarks were refined manually. This type of correction is considered the best we can obtain from this dataset; volume slices are well separated, and the gold nano-beads appear spherical in the axial plane. The aligned set of projections was processed by SIRT for 30 iterations.
Panel (b) instead displays the output of a fully automatic reconstruction procedure carried out by employing the joint alignment-reconstruction method described in [81,94] (for details, see Section 1.6): in this case, the correction model solely takes into account a set of shifts at the detector plane. The correction is indeed way simpler than the full 3D refinement model in [80], but it is the only algorithm of this type that is readily available in a CT reconstruction framework. SIRT was used as the base algorithm for the reconstruction, and in total, the optimisation routine iterated for 100 steps. In the axial plane, the beads appear elongated, which is a sign of an incomplete alignment.
Panel (c) shows the output of the proposed method (Equation (12)), which refines not only 2D shifts at the detector plane but also a detector tilt angle, which is common for all the projections. Compared to panel (b), the amount of misalignment is greatly reduced (see the axial plane). This kind of result can suffice for a preview, as the landmark detection and tracking alone (in the semi-automatic positions, refinement in panel (a)) takes the same time as the full reconstruction with the proposed method (5 min).

Ptychography Reconstructions
A different region of the Met5A sample has been observed through ptychography. The detector was placed in the far field (z sd = 70 cm), and the sample was illuminated with a curved wavefront produced by a defocused zone plate, with a focus-to-sample distance z f s of 350 µm. The dataset is composed of a set of 121 diffraction patterns acquired with a Princeton CCD camera, with a regular scan path [24]. We demonstrated [118] that during real experiments, we do not need to use special sampling schemes to avoid the raster grid pathology, as the mechanical position errors of the sample stage add a pseudo-random jitter on each position. Diffraction patterns are dark subtracted, centred and cropped to a set of (1024 × 1024) pixel images. The resulting pixel size is of roughly 36 nm.
As already mentioned, the technique is very sensitive to geometrical parameter estimation. Indeed, the data analysis is typically performed by manually estimating the optical distances; a set of coarse reconstructions are performed by sweeping the parameters, then each reconstruction is carefully inspected to choose the best value. The positions vector is instead automatically refined [114][115][116]118]. Here we show the output of a ptychography reconstruction algorithm, which is automatically able to estimate the correct propagation distance [13]: Figure 7a shows the phase reconstruction of the sample. From a manually tuned virtual distance of 0.24 mm, the method estimated instead a distance of 0.37 mm. This final estimated value takes into account not only a mere distance correction but also positions and beam energy, as the exponential factor in the propagation operator is directly proportional to all these three parameters; this final value is the one corresponding to a minimum in the parameter-loss space. Figure 7b shows the effect of different reconstruction algorithms on a particular ROI of the sample. The algorithms involved are, respectively, DM [105], M-ePIE [102], M-rPIE [118] and, finally, the proposed AD-based algorithm [13].

Conclusions
In this manuscript, we described the results of an ongoing research on Automatic Differentiation methods applied to inverse microscopy problems. By realising that many computational imaging techniques can be solved by a common approach based on sampling, sparsity and error minimisation, we developed a modular framework that can be adapted to solve compressive sensing, super-resolution, tomography and ptychography. This allows the prototyping of ready-to-be-used algorithms in a fast manner, blurring out the border between the development and production phases. Fast prototyping allows us to experiment with multiple concepts across many fields and techniques. We showed how the proposed AD-based algorithms are not only easy to implement and reasonably fast but also provide comparable results, and in many cases superior to many state-of-the-art solutions. Prototypes for such reconstruction algorithms can be written in a reduced amount of time by non-experts and can also be adapted to parameter optimisation, mitigating various setup flaws. Even if an optimisation routine written by a software engineer will be the fastest solution, thanks to advanced GPU-enabled AD frameworks such as PyTorch the gap in speed can be underlooked if compared with the gain in reconstruction quality. If combined, these aspects make these methods available for a drop-in replacement within many analysis pipelines.
Funding: This research has been carried out within the Advanced Integrated Imaging Initiative (AI3)-project P2017004 of Elettra Sincrotrone Trieste.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Tomography dataset 1 is part of the Tomopy [141] software suite. Tomography dataset 2 was kindly donated by Dr. Francesco Brun and is available from [94]. Tomography dataset 3 has been used in [80] and is part of the Tomoj software suite tutorial [145]. The ptychography dataset/code is available from [13]. CS/SISR/CT code is available from [146].

Acknowledgments:
We are grateful to Roberto Borghes for his fundamental work on the TwinMic microscope control system and to Iztok Gregori for his work on the HPC solution.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: