Improving a Rapid Alignment Method of Tomography Projections by a Parallel Approach

: The high resolution of synchrotron cryo-nano tomography can be easily undermined by setup instabilities and sample stage deﬁciencies such as runout or backlash. At the cost of limiting the sample visibility, especially in the case of bio-specimens, high contrast nano-beads are often added to the solution to provide a set of landmarks for a manual alignment. However, the spatial distribution of these reference points within the sample is difﬁcult to control, resulting in many datasets without a sufﬁcient amount of such critical features for tracking. Fast automatic methods based on tomography consistency are thus desirable, especially for biological samples, where regular, high contrast features can be scarce. Current off-the-shelf implementations of such classes of algorithms are slow if used on a real-world high-resolution dataset. In this paper, we present a fast implementation of a consistency-based alignment algorithm especially tailored to a multi-GPU system. Our implementation is released as open-source. ,


Introduction
Soft X-ray cryo-nano tomography is an effective imaging tool that allows one to analyse biological samples in their native environment, providing ultrastructural three-dimensional information of cells [1]; indeed, the high resolution of this technique, typically in the tens of nanometres [2], is paired not only to a profitable contrast behaviour in a specific range of energy, the water window, but also permits imaging in the hydrated state without staining, as the sample preparation requires only vitrification of water via a cryo-fixation process [3]. Phase contrast techniques [4] are especially used in the hard-X-ray regime, as the contrast due to phase shift is typically three orders of magnitude larger than the corresponding absorption [4].
To obtain a reconstruction from a series of 2D projections, tomography relies on a corpus of a priori information which constitutes an image formation model [5]. In its simplest form, the rotation axis is fixed in the 3D space, and its 2D projection lays in the central column (or row) of the detector, which is also fixed and locked to the source; a pure rotation is what describes the relative motion of the sample with respect to the detector.

Projection Misalignment Problem
If any of the previous assumptions is not met, a dataset acquired under these conditions is defined as misaligned, and exploiting the ideal model introduced earlier produces several artefacts in the reconstruction. While for a constant systematic error [6] the misalignment can be corrected by a relatively easy determination of the rotation centre [7,8] (this can be typically sufficient for µ-CT, e.g., in [9]), the mechanical imperfections of the setup become clearly detectable, especially at the nanoscale [10]: temperature-dependent asynchronous errors [6] can be unpredictable for each projection angle. The physical cause of this issue is due to backlash and non-constant roundness in the bearings used for the sample stage, leading to eccentricity [11], runout, or spindle errors [2,6,12]. At the detector, the main observable effect of these misalignments is a jitter in x, y [12,13], which are the coordinate axes of the detector. Without proper correction, a dataset acquired under these circumstances is completely unusable, as a severe point spread function [14] is introduced in the reconstruction, jointly with a peculiar artefact [13]. Even if in some cases the error component can be measured and corrected [15] (also with interferometric encoding of the specimen position [11]), this is not the case of special setups, such as the one employing cryo-stages [12,16], for which even more advanced opto-mechanical outbreaks than the ones described in [17] are required to reach a sub-10 nm resolution. Post-acquisition alignment is thus usually necessary [12].

Post-Acquisition Alignment
Marker-based alignment methods are a common solution to the problem, especially in the case of biological samples which exhibit low absorption contrast and/or not welldefined features; at the cost of decreasing the sample visibility (this is the measure of how critical this step is), gold nano-beads are added to the sample solutions before cryo-fixation [2] and are tracked in a post-acquisition procedure among each projection. IMOD [18][19][20] and many other software frameworks [14,[21][22][23][24][25][26] reviewed in, e.g., [27] are used to manage this delicate pre-processing step. However, this approach is questionable, as for many biological specimens, it is extremely difficult to control the spatial distribution of such fiducials. Large areas of the sample can be completely void of markers, or, conversely, the distribution can be so dense that many different beads agglomerate one to another, determining no reliable tracking information. Automatic feature-based approaches are proven to work only in the presence of high quality images [28].
Tomography self-consistency or bootstrap methods [29,30] instead try to infer the alignment parameters while reconstructing the volume (Section 2). Paired with coarse alignment [31], these methods are usually the best choice to solve the misalignment problem. While in great part automatic, these techniques are for their nature extremely demanding from a computational point of view and can require both commercial software and highly experienced users [16]. The rapid projection alignment method presented in [12] belongs to this class of algorithms, is implemented in the advanced Tomopy [7] software framework, and can be simply introduced in the pipeline by just invoking one line of code. Unfortunately, the implementation can be slow on high-resolution datasets such as the one acquired with newer setups [4], undermining its use as a fast method.

Proposed Solution
Here, we propose a solution to improve the speed of the algorithm [12] by an order of magnitude by extensively exploiting the parallel approach in the entire processing pipeline. The algorithm is structured with a modular approach, allowing us to separately use each accelerated component (e.g., the tomography reconstruction/synthesis module) through an API. The proposed solution which is tailored for parallel beam geometry has also been implemented on a multi-GPU system and is tested against a multi-GPU-ready version of Tomopy. In Section 2, the method is described, while Section 3 presents the results by showing the acceleration capability for each module. Our software (provided as open source at [32]) has been tested on artificially misaligned µ-CT data but also on a real high-resolution dataset acquired at MISTRAL, the soft X-ray transmission microscopy beamline at the ALBA synchrotron light source (Barcelona, Spain) [1][2][3].

Computational Methods
The main idea of reconstruction "self-consistency" [33] is that at the convergence, from a reconstructed volume, one can simulate a set of projections which are virtually indistinguishable from the original dataset. This is exactly what is sought during a reconstruction in a deterministic iterative algorithm (MSE based) [5,13,34].
The setup parameters Θ may be made to concur with the object x to reduce the total reconstruction error; this is a recurring theme in many computational imaging techniques (e.g., in [28,34]). Indeed, in [13], an optimisation problem is cast involving the 2D detector offset and tilt for each projection. At the cost of increased complexity, more complete correction can be employed in the algorithms as described in [16,33].

Joint Reconstruction-Reprojection Method
Including many parameters directly within the optimisation pool can sometimes be detrimental, as the solution space is filled with local minima (the curse of dimensionality [35]), where the solution may stagnate [36].
When only the detector shift parameters are of interest [12], a different path is possible [29]: after having reconstructed an object with a conventional algorithm, a set of synthetic projections is generated by employing the reconstructed volume; these reprojections are then registered with the actual tomography dataset ( Figure 1). The procedure is iterated until the shift parameters are nullified. The actual alignment is retrieved by using the phase-correlation [37] as a meaningful similarity metric for the two sets. Figure 1. Illustration of the reprojection algorithm concept: a real projection (A) is affected by a severe misalignment in the x axis. A tomography reconstruction (not shown) will exhibit blurred details due to this defect. The synthesised projection (B) calculated for the same projection angle of (A) will be severely blurred but centred. A similarity measure is used to infer the parameters of the geometry transform which realigns (A) on (B), iteratively producing an aligned dataset.

Implementation Details
In recent times, faster execution of an algorithm is mainly achieved not by increasing the performance per processing unit, but by enlarging the number of executioners [38][39][40]. In a highly parallelisable problem such as parallel-beam tomography [41], the typical approach consists of exploiting the inherent data parallelism [40], meaning that the whole computation can be divided into many smaller problems which can be solved concurrently [38]. Not all computational imaging techniques are so blessed. Massively parallel accelerators such as GPUs are currently extensively used, thanks to a plethora of GPU-ready available algorithms (e.g., [42][43][44]) that in turn are implemented in several reconstruction frameworks such as [7,42,45].
From the analysis of the alignment algorithm [12], we detected three main areas for the global acceleration ( Figure 2): (1) a tomography reconstruction and projection component is required to gather the volume estimate and the synthetic projections; (2) a motion estimation procedure is employed to check for the registration parameters; and (3) a warp interpolator is used to generate the new registered dataset. While for the two latter modules we also studied a CPU-only parallel solution ( Figure 2), for the CT component, it was already clear that a GPU implementation would have been the only viable option. In this work, we deconstructed the algorithm by implementing each step with a highly parallelised version of each computational block; this is crucial to utilise all the computational resources at hand. The result is a modular structure that can be used to solve not only the tomography alignment problem as a whole, but each module can be used separately in any image-processing pipeline (e.g., Hough transform for pattern matching and computer vision applications, image registration, etc.).

Tomography Module
Even if in the literature several multi-GPU algorithms are proposed [45][46][47][48][49][50], off-theshelf, ready-to-be-used alternatives are currently scarce. This is especially true for iterative algorithms such as SIRT [51] which are essential in the case of cryo-nano tomography, as both the missing wedge problem [52] and the angle under-sampling can be severe. We released a simplistic but effective CT module ( Figure 3) that allows the distribution of both the tomography forward and backward operators among n GPUs. The custom implementation is realised on top of the advanced ASTRA Toolbox framework [13,42] and consists of n CT servers which are configured to accept commands from an easy-to-use Python API; the API automatically slices the dataset in n portions along the row axis and uses a memory mapped array for the entire data scattering procedure, and each portion is processed concurrently. The results of each executer are gathered and finally composed on the row axis to produce the entire output ( Figure 3). Through the ASTRA Toolbox, two 3D iterative reconstruction algorithms are available: SIRT3D_CUDA and CGLS3D_CUDA.

Motion Estimation Module
As said before, a similarity metric is required to estimate the registration parameters. Even if an iterative alignment algorithm has been initially taken into consideration [53], a one-shot procedure such as phase-correlation is desirable as a component of a fast method; for each iteration, only three DFTs and an element-wise multiplication are required, at least for its coarse-scale version. DFTs are surely expensive operations, but accelerated algorithms are currently pervasive. Still, sub-pixel information is crucial, as hundreds of iterations of the entire algorithm [12] are likely to be performed: the procedure indeed cumulates the story of the shift to producing the current registration value to warp the dataset. In this way, the frame information is not progressively lost iteration by iteration due to padding. The use of the particular methodology presented in [37] is thus essential to obtain at a reasonable speed for this fine-grained information.
The motion estimation procedure is carried out on each projection pair (acquired, synthesised); thus, this entire computational block can be implemented again with the data parallelism concepts in mind by splitting the whole problem as per-angle sub-procedures. We thus implemented two different pathways to achieve the algorithm acceleration: an accelerated CPU-only version is written by encapsulating the implementation of [37] already present in Scikit-Image [54] in a multi-threaded pipeline. The second solution instead consists of writing the sub-pixel phase correlation algorithm [37] in PyTorch [55], which here is used not as a deep learning tool but as an easy path for GPU programming. Similar to the previous case, a dispatcher splits the problem on n GPUs for concurrent execution. As it will be shown in Section 3, the acceleration is large even in the case of the CPU-only multi-threaded solution.

Warp Module
The warp module is used to transform the dataset applying an affine transform on each projection. As the problem can again be split into several independent problems, we can accelerate the algorithm both by moving the computation to the GPU and exploiting the data parallelism concept; the tuple composed of (projection, parameters) thus represents the input for each subroutine, which is automatically dispatched among the available executioners, which can be CPU cores or GPU cards. The GPU implementation is written in the PyTorch Python dialect and makes use of a parametrisable grid generator and grid sampler, which are the essential components initially developed for the 2D interpolator of a spatial transformer network [56]. In the results section, we will discuss the effects of the data transfer to the GPU.

Results and Discussion
To measure the algorithm acceleration, we benchmarked each module against the corresponding block in the algorithm implementation [12] in Tomopy [7]. The system used for the testing is an HPC node, whose hardware and software configuration is summarised in Table 1.

Reconstruction Baseline
While initially Tomopy offered GPU reconstruction algorithms solely through interfaces [47,50] to a single-GPU solution, in a very recent version, a native 3D multi-GPU reconstruction algorithm can be invoked during the alignment procedure. However, no other parts of the algorithm are currently parallelised. The recent scenario represents the baseline configuration we used in our manuscript.

Observation Variables
To verify the working principle of our method, it is crucial to check for a reduction in the execution time of each step of the proposed algorithm; this is done trough benchmarks. As we mentioned, to generate an aligned dataset, the procedure requires the steps described in Section 2 (see Figure 2: the reconstruction, the motion estimation, and the frame warp. That is why this section is divided into (1) "CT module benchmark"; (2) "Motion estimation benchmark"; and finally, (3) "Warp Module benchmark". Due to the complexity of the inner working, as the CT module will be difficult to use by calling the RAW functions, an easy-to-import API has been written to ease the import into a custom code; a test code will be presented.
If no further processing is required (e.g., beads removal, deconvolution, etc.), eventually the reconstruction generated by the method (as a by-product of the alignment procedure) can be considered final.
As we are devoted primarily to the speed of execution and considering that our implementation is working in a numerically accurate manner that follows the original algorithm [12], our results are mainly based on speed comparisons only. Figure 4 shows the speed performance of the proposed solution (panel a) tested against the reference Tomopy multi-GPU setup (panel b). We reconstructed a set of projections with size 1024 × 1024 each; each curve represents the time required to reconstruct a dataset with 57, 113, 225, or 450 angles and thus the same number of images. By looking at the highest number of angles (blue curve), it can be seen how the proposed solution is faster by a factor close to one order of magnitude. As our solution is modular, it has to be noted that our parallel CT module can be used as a stand-alone program which uses n GPUs. To embed it in a custom code is quite straightforward: with the CT servers running (Section 2), a simple API can be called within a Python program, such as shown in Listing 1.

Motion Estimation Module Benchmark
The motion estimation module has been tested by varying the number of projections in the dataset (x-axis on each panel of Figure 5), parametrising the curve on the size of each projection image. Considering the baseline configuration (panel a), the performance gain is still large even for the multi-thread and CPU-only solution (panel b), but it can be improved even further by dispatching the load on the four GPUs (panel c). In the present case, we can effectively measure a speed improvement by switching towards a GPU computation only due to the fact that the actual computation is effectively computationally intensive (DFTs, elementwise multiplications, and matrix multiplications [37,54]) and parallelisable [38], despite the additional latency that is introduced in the system by moving large arrays from the central memory to the GPU RAM. # common requirements import numpy as np angles = np.arange(0,np.pi, np.pi/180) # a 3D array of projections [angles, height, width] prjs = np.random.randn(len(angles),300,600) # load the SciCompCT API from ct_server_lib_reinit import parallel_gpu_recon, prepare_angle, parallel_gpu_proj # angles array in memory mapped I/O basemempath = '/storage/fast/tmp_ct_server/' anglepath = prepare_angle(angles, basemempath=basemempath) # To generate projections from a 3D array out = parallel_gpu_proj(rec, anglepath, nserv=4) ... # To reconstruct a projection dataset outrec, recerr = parallel_gpu_recon(prjs, angles, iterat=20, nserv=4, anglepath=anglepath basemempath=basemempath) Listing 1. SciCompCT module API usage example.   Figure 6 shows the speed gain that can be obtained by accelerating the warping procedure. Similar to the previous case, panel a shows the baseline configuration, while panels b and c show, respectively, the speed of a parallel CPU-only implementation and the multi-GPU approach. Here the latency associated with the GPU implementation is even larger, due to the results gathering; indeed, this operation involves the copy of the results from the GPU to the main memory. Despite this, we can still measure a considerable performance gain, as the computation of an affine transform is again computationally intensive, especially for a large number of high-resolution projections.

Entire Algorithm Test
To test for the algorithm working, we used an old µ-CT dataset acquired at the Syrmep beamline of the Elettra synchrotron facility [9,21]. The CCD resolution of each image is of 600 × 300 pixels, and the dataset contains 450 projections of a mouse femur that span a [0, 180 • ] angular range, acquired with a parallel-beam setup. To simulate the misalignment, we added a random x, y jitter in each projection with a standard deviation of 10 pixels, obtaining a severely misaligned dataset, as can be seen in Figure 7 panel a; the use of the alignment algorithm allows correcting for the jitter, producing the aligned dataset in panel b. An off-axis feature seems completely missed in panel (a), but it is simply faint; by using the alignment procedure, it becomes quite evident within the sinogram.  Figure 8 shows a real cryo-nano tomography dataset acquired at the MISTRAL beamline of the ALBA synchrotron facility; a biological sample of eukaryotic cell debris has been cryo-fixated on a gold grid and imaged at 900 eV [2]. Although the condenser optics of the soft X-ray microscope focuses the beam onto the sample, it is typical for this setup to assume a simplified geometry with an incoming parallel beam [1,2]. The objective Fresnel zone plate lens (FZP) collects the transmitted beam, producing a magnified image (×1000) at the back-illuminated CCD detector (Princeton Pixis XO, pixel size of 13 µm) [2]. The resulting dataset is a set of 121 images, acquired at a resolution of 1024 × 1024 pixels and at an angle of [−60 • , 60 • ]. The rotation run-out is about 300 nm, which is currently the state of the art [2], but due to the extremely complex setup and the large magnification factor (effective pixel size of 13 nm), a severe misalignment problem is inevitably present (Section 1). As can be seen in panel a, the sinogram of a particular off-axis feature is extremely jittered, and no "sine" can be recognized. Conversely, in panel b, the alignment makes it heavily pronounced. The effect of the correction is more evident at the border by observing how each line is shifted along the axis; the black part is the result of the padding operation, whose effects are reduced by employing a shift-cumulation procedure. The alignment process in Figure 8 has been performed by exploiting a multi-scale approach, aligning the dataset iteratively at a different scale. The reconstruction is re-initialised after each change of scale and starts by employing the set of projections aligned at the previous one. This kind of correction is completely unfeasible on the reference software configuration, as the required time would have increased even further.

Conclusions
In this paper, we proposed a modular framework for (nano)tomography alignment. Often, in biological samples, regular, high-contrast features can be scarce, and this represents a problem if nano-beads are also not in view. If no robust trackable features are present, the dataset is useless. Thus, an automatic alignment algorithm, such as the bootstrap method, becomes extremely useful. A well-known and well-performing CT bootstrap-based algorithm [12] has been analysed and deconstructed to isolate and determine its three essential components. This algorithm provides good results, but it can be slow, as its implementation is sequential and CPU-based. The entire algorithm has then been re-implemented: the modular approach we designed allows us to adapt the parallelisation paradigms both for a multi-thread CPU implementation and a multi-GPU solution. As a result, the entire algorithm is globally accelerated. For each module, the benchmarks reported a performance gain that is close to one order of magnitude, allowing for a rapid correction on high-resolution datasets. The correction can be even more accurate, as a multi-scale approach is now feasible. The software is released as open-source and can be downloaded from [32]. Thanks to its modular structure, each component can eventually be incorporated easily in a custom code. As a by-product, we provide "SciCompCT", a ready-to-be-used solution for a parallel beam multi-GPU SIRT algorithm. We believe that the proposed approach could be particularly useful for low-density matrices such as biological samples.
Funding: This research has been partially developed under the Advanced Integrated Imaging Initiative (AI3), project P2017004, of Elettra Sincrotrone Trieste in agreement with the University of Trieste.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The µ-CT dataset is available at [32].