CLAIRE—Parallelized Diffeomorphic Image Registration for Large-Scale Biomedical Imaging Applications

We study the performance of CLAIRE—a diffeomorphic multi-node, multi-GPU image-registration algorithm and software—in large-scale biomedical imaging applications with billions of voxels. At such resolutions, most existing software packages for diffeomorphic image registration are prohibitively expensive. As a result, practitioners first significantly downsample the original images and then register them using existing tools. Our main contribution is an extensive analysis of the impact of downsampling on registration performance. We study this impact by comparing full-resolution registrations obtained with CLAIRE to lower resolution registrations for synthetic and real-world imaging datasets. Our results suggest that registration at full resolution can yield a superior registration quality—but not always. For example, downsampling a synthetic image from 10243 to 2563 decreases the Dice coefficient from 92% to 79%. However, the differences are less pronounced for noisy or low contrast high resolution images. CLAIRE allows us not only to register images of clinically relevant size in a few seconds but also to register images at unprecedented resolution in reasonable time. The highest resolution considered are CLARITY images of size 2816×3016×1162. To the best of our knowledge, this is the first study on image registration quality at such resolutions.


Introduction
3D diffeomorphic image registration (also known as "image alignment" or "matching") is a critical task in biomedical image analysis [1,2]. For example, it enables the study of morphological changes associated with the progression of neurodegenerative diseases over time or in imaging studies of patient populations. The process of image registration involves finding a spatial transformation which maps corresponding points in an image to those in another [1]. In mathematical notation, we are given two images m 0 (x) (the template/moving image) and m 1 (x) (the reference/fixed image; here x ∈ Ω ⊂ R 3 ) and we seek a spatial transformation y : R 3 → R 3 , such that the deformed template image m 0 (y(x)) is similar to the reference image m 1 (x) for all x (see Figure 1 for an illustration) [3,4]. Image registration methods can be categorized based on the parameterization for y [3]. We seek a diffeomorphic map y, i.e., y is a differentiable bijection and has a differentiable inverse. Approaches that parameterize y in terms of a smooth, time-varying velocity field v : R 3 × [0, 1] → R 3 belong to a class of methods referred to as large-deformation diffeomorphic metric mapping (LDDMM) [5][6][7]. In this study, we consider a related class of methods that use stationary velocity fields v : R 3 → R 3 . This diffeomorphic registration problem is expensive to solve because the problem is infinite-dimensional, and upon discretization results in a nonlinear system with millions of unknowns-even for stationary velocity fields. For example, solving the registration problem for two images of resolution 256 3 (a typical size for clinical scans) requires solving for approximately 50 million unknowns in space (three vector components per image grid point). Furthermore, image registration is a highly nonlinear, ill-posed inverse problem [8], resulting in ill-conditioned inversion operators. Consequently, running registration on multi-core high-end CPUs can take several minutes.
There exist various algorithms and software packages for fast registration of images at standard clinical resolution (e.g., 256 3 ) [9][10][11][12][13][14][15][16]. This includes CLAIRE, which can execute image registration in parallel on multi-node multi-core CPUs and GPUs [17][18][19][20][21][22]. We note that there is little work on scalable image registration. One application that requires this scalability is the registration of CLARITY images [23][24][25][25][26][27] with a resolution in the order of 20 K × 20 K × 1 K. This corresponds to a problem with approximately 1.2 trillion unknowns. In [22], we extended CLAIRE to support GPU-accelerated scalable image registration, which can process high resolution images using multiple GPUs. We demonstrated the scalability of our solver using synthetic images with a resolution up to 2048 3 and CLARITY mouse brain images of size 768 × 768 × 1024. In this work, we scale registration to an even higher resolution, e.g., CLARITY images of size 2816 × 3016 × 1162. This corresponds to an increase of 16× in problem size. In our previous work [17,19,[28][29][30][31][32][33], we have extensively studied the algorithmic side of image registration within the framework of CLAIRE. In this paper, we pay closer attention to the quality of the registration results. We study the effect of different input parameters, including the quality and resolution of the input images, on the accuracy of the registration. : Image registration is the task of computing spatial correspondences between two images of the same object. These correspondences, denoted as y, are indicated by the red arrows for example points within a single axial slice of the template and reference image data shown in panel (A). Before we execute CLAIRE, we compensate for the global mismatch between the considered images by performing an affine registration. In panel (C), we show an axial slice of the volume shown in panel (A) after an affine pre-registration step has been carried out; we execute CLAIRE on these images. Panel (D): CLAIRE outputs a diffeomorphic deformation map y that matches each point in the template image m 0 to its corresponding point in the reference image m 1 . We show a typical deformation map y in the leftmost image and the corresponding determinant of the deformation gradient (encodes volume change) in the second and third image from the left (axial and coronal slice). In CLAIRE, we invert for a stationary velocity field v that parameterizes y (second and third figure from the right; color denotes orientation). The last figure in panel (D) shows the point-wise residual after applying CLAIRE.

Contributions
We build on our prior work on scalable deformable image registration [17,[19][20][21][22][28][29][30][31]33] using CLAIRE and analyze the effect of image resolution on image registration accuracy. The present work analyzes image registration performance. We do not propose any major improvements in our methodology, with the exception of additional advice for hyperpa-rameter tuning. We outline our past contributions on formulations, algorithms, and their parallel implementation below. Our major contributions in the present work are: • We evaluate CLAIRE on high resolution synthetic and real image datasets. We demonstrate that image registration when performed at native high resolution results in higher accuracy (measured in terms of the Dice coefficient of the labeled structures in the images). We conduct experiments to show that downsampling the images and then registering them result in loss of registration accuracy. • We design scalable image registration experiments to explore the effect of solver parameters-the number of time steps n t in the semi-Lagrangian scheme, and regularization parameters β v and β w -on the registration performance at different image resolutions. • We present an extension of the regularization parameter continuation scheme first presented in [28] by searching for β w in addition to β v , thereby removing the need for selecting an additional resolution-dependent solver parameter. • We study the performance of our scalable registration solver CLAIRE for applications in high-resolution mouse and human neuroimage registration. We perform image registration for two pairs of CLARITY mouse brain images at a resolution of 2816 × 3016 × 1162 voxels. To the best of our knowledge, images of this scale have not been registered before at full resolution in under 30 min.

Related Work
The current work builds upon the open source framework CLAIRE [17,[19][20][21][22][28][29][30][31][32]34]. Our formulation for diffeomorphic image registration has been described in [28,29]. Our Newton-Krylov solver was originally developed in [28]. We proposed efficient numerical implementations for evaluating forward and adjoint operators in [17,30,33]. We designed various methods for preconditioning in [19,22,28,30]. The computational kernels of the parallel CPU implementation of our solver were introduced in [17,19,32]. More recently, we ported CLAIRE to GPU architectures [20,22]. In summary, our work paved the way towards real-time applications of diffeomorphic image registration and its deployment to a high-resolution medical imaging application. To the best of our knowledge, this is the only existing software for diffeomorphic image registration with these capabilities. We have integrated our framework with biophysical modeling in [31,32,[35][36][37]. None of these works explore registration performance in large-scale biomedical imaging applications.
None of the GPU-accelerated LDDMM methods mentioned above, except for CLAIRE [17,[19][20][21][22][28][29][30][31][32]34], use second-order numerical optimization. Many of the available methods solve the registration problem by reducing the number of unknowns either through a coarse parameterization or by using a coarse grid and use simplified algo-rithms. These crude approximations and simplifications can result in inferior registration quality [19,20].
The work in [25] focuses on annotating CLARITY brain images by registering them to the Allen Institute's Mouse Reference Atlas (ARA). They use a "masked" LDDMM approach. They also consider the registration of CLARITY-to-CLARITY brain images and compare different mismatch terms for the registrations. However, they downscale the images to a lower resolution for conducting all experiments. In [25], mutual information is used for the registration of CLARITY to the ARA dataset but at an approximately one hundred times downsampled resolution (at an original in-plane isotropic resolution of 0.58 µm). The authors in [67] analyze registration performance on high-resolution mouse brain images of size 2560 × 2160 × 633 obtained using the CUBIC protocol [68]. They report results using different software packages including ANTs and Elastix. They did not observe a relationship between registration accuracy at different resolutions. For their high-resolution runs using ANTs, they report a wall clock time of over 200 h on a single compute node (2.66 GHz 64bit Intel Xeon processor with 256 GB RAM) while the same run with elastix [69] took approximately 30 h. The authors in [70] register high-resolution images of mouse brains to the ARA dataset [71]. They perform nonlinear registration using ANTs at coarse resolution (10 µm for the ARA) and apply the deformation at high-resolution. In the current work, we do not downsample high-resolution images but register them at the original resolution. We can register CLARITY images of resolution 2816 × 3016 × 1162 in less than 30 min using 256 GPUs. In addition to that, we study the effect of resolution on the registration quality.

Outline
We summarize the overall formulation in Section 2.1 and the algorithms in Section 2.2 for completeness. We note, that all of the material presented in Sections 2.1 and 2.2 has been discussed in detail in [19,22]. In Section 3, we present our kernels and parallel algorithms and discuss key solver parameters. We also introduce a new scheme to automatically identify adequate parameters of our solver for unseen data. This scheme extends on our prior work in [19]. We conclude with the main scalability experiments in Section 5, and present conclusions in Section 6.

Limitations
CLAIRE currently only supports mono-modal similarity measures, which limits our study to registrations for images acquired with the same imaging modality. Moreover, CLAIRE only supports periodic boundary conditions, i.e., we require that the image data be embedded in a larger background domain. In most medical imaging applications, the images are embedded in a zero background and, therefore, naturally periodic. If the images are not periodic, they can be zero-padded and mollified. CLAIRE uses stationary velocities, which improves computational efficiency, but is suboptimal from a theoretical point of view. In [28], we found no qualitative differences in registration mismatch when registering two images using stationary velocities. This observation is in line with the work of other groups using stationary velocity fields [72,73]. Regarding computational performance, one issue is the memory requirement of our method. We have optimized memory allocation for the core components of CLAIRE. Additional optimizations by reusing and sharing memory across external libraries to further reduce the memory load remain subject to future work.

Methods
Before discussing our enhancements in Section 3, we shortly introduce the underlying mathematical formulation of the image registration problem utilized in CLAIRE as well as the discretization and the numerical algorithms. The following exposition is only included for completeness, and is based on material described in our prior work on efficient algorithms for diffeomorphic image registration [17,19,20,22,[28][29][30]32,33]. Consequently, we keep this section brief.

Formulation
We summarize our notation in Table 1. CLAIRE uses an optimal control formulation. We parameterize the deformation map y(x) through a smooth, stationary velocity field v(x). The optimization problem is: Given two images m 0 (x) (template image; image to be deformed) and m 1 (x) (reference image), we seek a stationary velocity field v(x) by solving on a rectangular domain Ω ⊂ R 3 with periodic boundary conditions on ∂Ω. The first term in Equation (1) is a squared L 2 image similarity metric, which measures the distance between the deformed template image m(x, t = 1) and the reference image m 1 (x). The objective functional in Equation (1) additionally consists of two regularization models that act on the controls v and w with regularization parameters β v > 0 and β w > 0, respectively. The regularization operators are introduced to prescribe sufficient regularity requirements on v and its divergence ∇ · v. Smoothness of the velocity guarantees that the computed map is diffeomorphic [5][6][7]. We refer to [29] for details about our regularization scheme. The default configuration of CLAIRE is an H 1 -Sobolev-seminorm for v and H 1 -Sobolevnorm for w [19,29]. The transport equation in Equation (3) represents the geometrical deformation of m 0 (x) by advecting the intensities forward in time.
To solve Equation (1) subject to Equations (2)-(4), we apply the method of Lagrange multipliers to obtain the Lagrangian functional with state, adjoint, and control variables (m, λ, p, v, w) := φ, respectively.  To derive the first order optimality conditions, we take the variations of L with respect to the state variable m, the adjoint variables λ and p, and the control variable v. This results in a set of coupled, hyperbolic-elliptic PDEs in 4D (space-time). CLAIRE uses a reduced-space approach, in which one iterates only on the reduced-space of v. We require g(v ) = 0 for an admissible solution v , where

Symbol Description
is the so-called reduced gradient. The operator A corresponds to the first variation of the regularization model for v (i.e., reg v in Equation (1)) and the operator K projects v onto the space of near-incompressible velocity fields (see [29] for details). To evaluate Equation (6), we first solve the forward problem in Equation (3) and then the adjoint problem given by with final condition λ(x, t) = m 1 (x) − m(x, t) in Ω × {1} and periodic boundary conditions on ∂Ω.

Discretization
We discretize the forward and adjoint PDEs in the space-time interval Ω × [0, 1], Ω := [0, 2π) 3 ⊂ R 3 , with periodic boundary conditions on ∂Ω, on a regular grid with N = N 1 N 2 N 3 grid points x ijk ∈ R 3 in space and n t + 1 grid points in time. We use a semi-Lagrangian time-stepping method to solve the transport equations that materialize in the optimality system [17,30]. Key computational subcomponents of this scheme are 2nd-order Runge-Kutta time integrators and spatial interpolation kernels [17,19,22,30].
To solve the transport equation given by Equation (7) and to evaluate the reducedgradient g in Equation (6), we need to apply gradient and divergence operators. We use an 8th finite difference (FD) scheme for these first-order differential operators [20,22]. The reduced gradient in Equation (6) also involves the vector-Laplacian A and the Leray-like operator K (see [29]). In spectral methods, inversion and application of higher-order differential operators come at the cost of two FFTs and one Hadamard product in Fourier space.

Gauss-Newton-Krylov Solver
CLAIRE uses a Gauss-Newton-Krylov method globalized with an Armijo line search to solve the non-linear problem g(v ) = 0 [19,28]. The iterative scheme is given by where H ∈ R 3N,3N is the discretized reduced-space Hessian operator,ṽ k ∈ R 3N the search direction, g k ∈ R 3N a discrete version of the gradient in Equation (6), α k > 0 a line search parameter, and k ∈ N the Gauss-Newton iteration index. We have to solve the linear system in Equation (8) at each Gauss-Newton step. We do not form or assemble H; we use a matrix-free preconditioned conjugate gradient (PCG) method to solve Hṽ k = −g k forṽ k . This only requires an expression for applying H to a vector that we term Hessian matvec.
In continuous form, the Gauss-Newton approximation of this matvec is given by Similarly to the evaluation of the reduced gradient in Equation (6), the application of the Hessian to a vector in Equation (9) requires the solution of two PDEs to find the spacetime fieldλ (see [19,28,29] for details). Consequently, solving the linear system with H in Equation (8) is the most expensive part of CLAIRE. Preconditioning of the reduced-space Hessian system can be used to alleviate these computational costs.
In [22], we have introduced a zero velocity approximation for H as a preconditioner. This preconditioner can be applied at full resolution and through a two-level coarse grid approximation (see [22] for details). The latter variant represents the default considered in the present work.

Computational Kernels and Parallel Algorithms
At each Gauss-Newton step, we have to solve the forward and the adjoint equations for the reduced gradient and the Hessian matvecs. The main computational cost in CLAIRE constitute FFTs for (inverse) differential operators, scattered data interpolation (IP) for the semi-Lagrangian solver, and FD for computing first order derivatives (see [17,20,22,30] for a detailed description of these computational components). The distributed memory CPU implementation of CLAIRE uses AccFFT [74,75] for spectral operations [17,32]. In the single GPU setup, we use the highly optimized 3D FFT operations provided by NVIDIA's cuFFT library. In the multi-node multi-GPU setup, we use a 2D slab decomposition to leverage 2D cuFFT functions. We decompose the spatial domain in x 1 direction, which is the outer-most dimension, and the spectral domain in the x 2 direction. Let p be the number of MPI tasks. Then, each MPI task gets (N 1 /p) × N 2 × N 3 grid points, where N 1 , N 2 , N 3 are the image dimensions. We have discussed the implementation details and shown scalability of the FFT kernel in [22].
The parallel implementation of our IP kernel on CPUs was introduced in [17] and improved in [32]. In [20], we explored linear, cubic Lagrange, and cubic B-spline interpolation schemes for the interpolation kernel on a single GPU setup. In [22], we ported these kernels to the multi-node multi-GPU setup and made several optimizations. In the present study, we use linear interpolation to evaluate the image intensities at the off-grid points (also called characteristic points) in our semi-Lagrangian scheme. Depending on the image data layout and the velocity field, the IP kernel requires scattered peer-to-peer communication of off-grid points between the owner and the worker processors.
The CPU version of CLAIRE uses FFTs for spatial derivatives [17,19,32]. In [20], we introduced the 8th order FD kernel to evaluate first order derivatives, i.e., spatial gradients and divergence operators on a single GPU. In [22], we ported the FD kernel to the multi-GPU setup. We use the FD kernel for computing first-order derivatives throughout the registrations performed in this paper.
CLAIRE uses CUDA-aware MPI in the multi-node multi-GPU setup, thereby avoiding unnecessary CPU-GPU communication and automatically utilizing the high-speed on-node NVLink interconnect bus between GPUs if it is available.

Compute Hardware and Libraries
All runs reported in this study were executed on TACC's Longhorn system in single precision. Longhorn hosts 96 NVIDIA Tesla V100 nodes. Each node is equipped with four GPUs and 16 GB GPU RAM each (i.e., 64 GB per node) and two IBM Power 9 processors with 20 cores (40 cores per node) at 2.3 GHz with 256 GB memory. Our implementation uses PETSc [76,77] for linear algebra, the PETSc TAO package for nonlinear optimization, CUDA [78], thrust [79], cuFFT for FFTs [80], niftilib [81] for serial I/O for small images and PnetCDF [82] for parallel I/O for large scale images, IBM Spectrum MPI [83], and the IBM XL compiler [84].

Key Solver Parameters
Here, we summarize the key parameters of CLAIRE and discuss their effect on the solver and previous strategies to choose suitable values. In Section 3.4, we present our algorithm to choose these parameters in a combined continuation approach.
• β v -regularization parameter for the velocity field v. Large values for β v result in very smooth velocities and, thus, maps that are typically associated with a large final image mismatch. Smaller values of β v allow complex deformations but lead to a solution that might be close to being non-diffeomorphic due to discretization issues. From a user application point of view, we are interested in computing velocity fields, for which the Jacobian determinant, i.e., the determinant of the deformation gradient F := ∇y, is strictly positive for every image voxel. This guarantees a locally diffeomorphic transformation (subject to numerical accuracy). In [28,85], we determined the regularization parameter β v based on a binary search algorithm. The search is constrained by the bounds on J = detF. That is, we choose β v such that J is bounded from below by J min and bounded from above by 1/J min , where J min ∈ (0, 1) is a userdefined parameter. The binary search is expensive because we solve the inverse problem repeatedly. For each trial β v , we iterate until the convergence criteria for the Gauss-Newton-Krylov solver is met then use the previous velocity field as an initial guess for the next trial β v . • β w -regularization parameter for the divergence of the velocity field w = ∇ · v.
The choice of β w , along with β v , is equally critical. Small values can result in extreme values of J and make the deformations locally non-diffeomorphic. As discussed above, in our previous work [28], we do parameter continuation in β v and keep β w fixed. This is sub-optimal for two reasons: (i) Both β v and β w depend on the resolution, so keeping β w fixed for all resolutions can result in deformations with undesirable properties, and (ii) doing continuation in β v alone does not ensure we get close enough to the set Jacobian bounds. Therefore, adding continuation in β w , which also affects the Jacobian, is necessary. • J min -lower bound for the determinant J of the deformation gradient. The choice of this parameter is typically driven by dataset requirements, i.e., one has to decide how much volume change is acceptable. CLAIRE uses a default value of 0.25 [19]. Tighter bound on the Jacobian, i.e., J min close to unity, will result in large β v and β w values leading to simple deformations and sub-par registration quality. Relaxing the Jacobian bound in combination with our continuation schemes for β v and β w can result in very small regularization parameters and extremely complex deformations. • n t -number of time steps in the semi-Lagrangian scheme. The semi-Lagrangian scheme is unconditionally stable and outperforms RK2 time integration schemes in terms of runtime for a given accuracy tolerance [30]. The choice of n t is based on the adjoint error, which is the error measured after solving Equation (3) forward and then backward in time. In [30], we conducted detailed experiments for 2D image registration and found, that even for problems of clinical resolution n x = 256 2 , n t = 3 (CFL = 10) did not cause issues in solver convergence. Increasing n t beyond a certain value will introduce additional discretization errors from the interpolation scheme.
• Resolution of v. We use the same spatial discretization for v as given for the input images. There exist image registration algorithms that approximate the registration deformation in a low-dimensional bandlimited space without sacrificing accuracy, resulting in dramatic savings in computational cost [15]. We have not explored this within the framework of CLAIRE. Note that [15] uses higher order regularization operators, which leads to smoother velocities compared to the ones CLAIRE produces, therefore enabling a representation on a coarser mesh. Moreover, CLAIRE uses a stationary velocity field, i.e., v is constant in time. In our previous work [28], we have demonstrated that stationary and time-varying velocity fields yield similar registration accuracy for registration between two real medical images of different subjects. More precisely, we did not observe any practically significant quantitative differences in registration accuracy for a varying number of coefficient fields in the case of timevarying velocity fields. Using a stationary velocity field is significantly cheaper and has a smaller memory overhead from a computational cost perspective.

Parameter Identification
Our algorithm to choose solver parameters proceeds as follows:

Resolution-Dependent Choice of the Interpolation Order and n t
As our GPU implementation is only available in single precision (unlike the CPU implementation [19], which is available both in single and double precision), we use cubic interpolation (B-splines/Lagrange polynomials) with n t = 4 (n t = 8 for linear interpolation) for resolutions up to n x = 256 3 . For higher resolutions, we use linear interpolation to save computational cost and increase n t proportionately to n x to keep the CFL number fixed.

Parameter Search Scheme for β v and β w
We perform a two-stage search scheme: (i) In the first part of the parameter search, we fix β w = β w,init (β w,init = 1 × 10 −5 ) and search for β v . The registration problem is first solved for a large value of β v = β v,init so that we under-fit the data. In our experiments, we set β v,init = 1. Subsequently, β v is reduced by one order of magnitude in every continuation step and the registration problem is solved again with the new β v . We repeat the reduction of β v until we breach the Jacobian bounds [J min , 1/J min ]. When this happens, we do a binary search for β v between the last two values and terminate the binary search when the relative change in β v is less than 10% of the previous valid β v . In addition, we put a lower bound β v,min = 1 × 10 −5 on β v . This lower bound is set purely to minimize computational cost. We denote the final value of β v as β * v . (ii) In the second part of the search, we do a simple reduction search for β w by fixing Starting with a given value β w,init , we reduce β w by one order of magnitude and repeat solving the registration problem with β * v and the respective value for β w until we reach J min . We put a lower bound β w,min = 1 × 10 −7 on β w in order to minimize computational cost. We take the last valid value of β w , for which the Jacobian determinant was within bounds and denote it as β * w . We fixed the value of β w,init = 1 × 10 −5 for all experiments and resolutions. We determined this value empirically by running image registration on a couple of image pairs at resolution 640 × 880 × 880 and 160 × 220 × 220 (see Section 5.4 for the images) for different values of β w,init . We report these runs in Table A1 (see Appendix B).
We evaluate the parameter search scheme for real world brain images and report the performance in Section 5.2. Furthermore, we use it as the default parameter search scheme for all the experiments presented in this paper.

Parameter Continuation Scheme for β v and β w
If we want to use target β * v and β * w values for a new registration problem, we can perform a parameter continuation which is exactly like the parameter search except that we neither perform the binary search for β v nor check for the bounds on J. In the first stage of the continuation, we solve the registration problem for successively smaller values of β v starting from β v = 1 and reducing it by one order of magnitude until we reach β v = 1ek where k = log 10 (β * v ) . Then we do an additional registration solve at β v = β * v . We fix β w = β w,init in the first stage. In the second stage, we fix β v = β * v and reduce β w from β w,init to β * w in steps of one order of magnitude. Whereas the expensive parameter search allows us to identify an optimal set of regularization parameters for unseen data, we use the parameter continuation scheme to speed up convergence. The combination of both is particularly efficient, for example in cohort studies, where we identify optimal regularization parameters for one image pair in the cohort and use the obtained parameters for all the other images.

Materials
We use publicly available image datasets for carrying out the image registration experiments in this paper (see Section 5). We summarize these datasets in Table 2. We discuss these datasets in detail. Table 2. We list the image datasets we use in our scalable registration experiments (see Section 5). All the datasets are accessible publicly and further discussed in Section 4. We list the dataset name tag (which we use to refer to them throughout the rest of the paper), the imaging modality, the number of images, the spatial resolution and the image resolution in voxels. For datasets with an isotropic spatial resolution, we only provide a single value. For datasets with anisotropic spatial resolution, we list the resolution in all three dimensions. For the SYN dataset, spatial resolution does not carry a physical meaning, so we only list the image resolution.

Dataset
Image

MUSE
This dataset consists of five real brain T 1 -weighted MRIs of different individuals. These images were segmented into 149 functional brain regions in a semi-automated manner, including manual corrections by expert radiologists [86]. We visualize this data in Section 5.2. These images are part of a bigger set of template images that were used for the development of the MUSE [87] segmentation algorithm. The original image size is 256 × 256 × 256 at a spatial resolution of 1 mm. This dataset is available for download through the neuromorphometrics website [86].

NIREP
Ref. [88] is a standardized repository for assessing registration accuracy that contains 16 T 1 -weighted MR neuroimaging datasets (na01-na16) of different individuals at an isotropic resolution of 1 mm. The original image size is 256 × 300 × 256 voxels. We resample these images to an isotropic image size of 256 × 256 × 256. We use the images na01-na10 for our experiments. This dataset is available for download through the GitHub link https://github.com/andreasmang/nirep (accessed on 5 January 2022).

SYN
We create four sets of synthetic template and reference images to assess image registration accuracy as a function of resolution. We create a set of synthetic reference images m 1 by solving Equation (3) using a given synthetic template image m 0 and a synthetic velocity field v. To construct the template image m 0 , we use a linear combination of high-frequency spherical harmonics. To be precise, we define the template image m 0 (x) as and image coordinates x := (x, y, z) ∈ (−π, π] 3 . In Equation (10), Y m l represents spherical harmonics of the form with parameters m, l, angular directions θ ∈ [0, π] and φ ∈ [0, 2π], and associated Legendre functions P m l . We choose m = 6, l = 8 for our setup.θ i andφ i are random perturbations in integer multiples of π/2 andx i ∈ [−0.4π, 0.4π] 3 is a random offset from the origin. The reference image m 1 (x) is generated by solving Equation (3) with initial condition m 0 (x) and velocity field v( where K = {4, 8, 12, 16}. We set the template and the reference base image size to n x = n = (1024, 1024, 1024). It is important to note that m 0 and m 1 possess only the discrete intensities i ∈ {1, 2, . . . , 10}. This allow us to naturally define ten labels l i 0 and l i 1 , corresponding to m 0 and m 1 , respectively, for all image voxels with intensity i for each i ∈ {1, 2, . . . , 10}. We show a 2D slice of the template m 0 and reference m 1 images for the case K = 4 in Section 5.3. The scripts for generating the template image m 0 and the synthetic velocity field v can be found at https://github.com/naveenaero/scala-claire (accessed on 5 January 2022). The reference image m 1 can be generated using CLAIRE [21,34].

MRI250
Ref. [89] is an in-vivo 250 µm human brain MRI image which consists of a T 1 -weighted anatomical data acquired at an isotropic spatial resolution of 250 µm. The original image size is 640 × 880 × 880 voxels. This image can be downloaded from [90]. We skull strip the dataset by downsampling it to 128 × 128 × 128 using linear interpolation and then manually create the brain mask in ITK-SNAP [91]. We upsample this brain mask back to the original resolution and then apply it to the original image. We use the tool fast [92] from the FSL toolkit [93][94][95] to segment the T 1 -weighted MRI into gray matter (GM), white matter (WM), and cerebrospinal fluid (CSF) to be able to evaluate the registration performance using Dice score (see Equation (13)) between the image labels before and after registration.

CLARITY
We use the dataset from [27,[96][97][98] which consists of 12 mouse brain images acquired using CLARITY-Optimized Light-sheet Microscopy (COLM). This dataset is available for download from [99]. These images have low contrast and are noisy. The in-plane resolution is 0.585 µm × 0.585 µm and the cross-plane resolution is 5 to 8 µm. The images are stored at eight different resolution levels with level zero being the full resolution and level seven being the lowest resolution. We use the images at resolution levels three and six in our experiments. These levels correspond to an in-plane resolution of 4.68 µm × 4.68 µm and 37.44 µm × 37.44 µm, respectively, which translates to images of size n = (2816, 3016) and n/8 = (328, 412) voxels. The cross-plane resolution is constant at all levels and corresponds to 1162 voxels. We select Control182, Fear197, and Cocaine178 as the test images in our experiments.

Results and Discussion
We test the image registration on real-world (see Sections 5.4 and 5.5) and synthetic registration problems (see Section 5.3). The measures to analyze the registration performance are summarized in Section 5.1. We evaluate the parameter search scheme (see Section 3.4) on a set of real brain images and present the results in Section 5.2. Furthermore, we explore the following questions in the context of scalable image registration: Question Q1: Do we need large-scale high-resolution image registration? Does the registration quality degrade when the registration is performed at a downsampled resolution when compared to performing registration at the original high resolution?
Question Q2: How does registration perform and scale for real, noisy, and highresolution medical images of human and mouse brains?

Measures of Performance
In our experiments, we evaluate both runtime performance (in terms of solver wall clock time) and the registration quality in terms of accuracy. For the latter, we use the following metrics: Let l 0 and l 1 be the binary label maps associated with the images m 0 and m 1 , respectively. Then, the Dice score D between the two label maps is given by where | · | denotes the cardinality of a set, and ∩ denotes the intersection of the two sets, respectively. We define D(l 0 , l 1 ) to be the Dice score pre-registration and D(l(t = 1), l 1 ) post-registration, where l(t = 1) is the label map that corresponds to the deformed template image m(t = 1). Furthermore, for a set of discrete labels l i , i = {1, 2, . . . , M}, where i corresponds to the label index, we define the volume fraction Using this definition, we compute the following statistics for the Dice coefficient: The Dice coefficient average D a given by the volume weighted average of the Dice coefficient given by and the inverse of the volume weighted average Dice coefficient given by Note that D vw gives more weight to labels with higher volume fractions while D ivw gives more weight to labels with smaller volume fractions.

Relative Residual r
This metric corresponds to the ratio of the image mismatch before and after the registration. It is given by

Characteristic Parameters
For each image registration, we also report the regularization parameters and the obtained minimum and maximum values of the determinant of the deformation gradient J := det F, i.e., the determinant of the Jacobian of the deformation map.

Visual Analysis
We visually support this quantitative analysis with snapshots of the registration results. The registration accuracy can be visually judged from the residual image, which corresponds to the absolute value of the pointwise difference between m(t = 1) and m 1 . The regularity of the deformations can be assessed from the pointwise maps of the determinant of the deformation gradient.

Experiment 1: Evaluation of the Parameter Search Scheme
We evaluate the parameter search scheme on a set of real brain images and compare the registration performance with a state-of-the-art SyN deformable registration tool in the ANTs toolkit.

Dataset
We use the MUSE dataset (see Section 4) for this experiment. After registration of the original T 1 -weighted images from this dataset, we use the image labels to evaluate the registration performance in terms of the volume weighted average Dice score D vw .

Procedure
Out of the five T 1 images, we select Template27 as the reference image m 1 and register the other four images to m 1 . For the registration, we use the parameter search scheme (see Section 3.4) to identify best regularization. We use linear interpolation and n t = 8 time steps in the semi-Lagrangian solver. For the Jacobian bound, we select J min = 0.1. In the parameter search, for each trial β v and β w , we drive the relative gradient norm g 2,rel = g 2 / g 0 2 to 1 × 10 −2 . Once we have found adequate β v and β w for each image pair, we rerun the image registrations using only parameter continuation. For a baseline performance comparison, we also perform registration on the same image pairs using the SyN tool in ANTs [39]. For ANTs, we use the "MeanSquares" (i.e., squared L 2 -) distance measure. We run CLAIRE on a single NVIDIA V100 GPU with 16GB of memory on TACC's Longhorn supercomputer. We run ANTs on a single node of the TACC Frontera supercomputer (system specs: Intel Xeon Platinum 8280 ("Cascade Lake") processor with 56 cores on 2 sockets (base clock rate: 2.7 GHz)). We use all 56 cores. We report the parameters used for ANTs in Appendix A.

Results
We report the obtained estimates for β v and β w as well as results for registration quality in Table 3. In Figure 2, we provide a representative illustration of the obtained registration results. We report baseline registration performance using ANTs in Table 4. We compare the Dice scores obtained for CLAIRE and ANTs in Figure 3. Table 3. Experiment 1: Performance of the parameter search scheme implemented in CLAIRE. We report results for the registration of four template images to the reference image Template27. We consider the squared L 2 -distance measure as image similarity metric. We restrict the Jacobian determinant J ∈ [0.1, 10] for these registrations. We report the following quantities of interest: (i) optimal regularization parameters β * v and β * w , (ii) minimum J min and maximum J max Jacobian determinant achieved, (iii) solver wall clock time in seconds, and (iv) label volume weighted Dice average D vw pre and post registration.  . We also visualize the residual before and after the registration along with the determinant of the deformation gradient and an orientation map for the velocity field.  Table 3 and ANTs in Table 4.

Observations
CLAIRE allows us to precisely control the properties of the deformation without having to tune any parameters manually. The only free parameters are the Jacobian bounds, which depend on the overall workflow related to the dataset. The volume weighted Dice scores D vw obtained for CLAIRE (see Table 3) are competitive to those produced by ANTs (see Table 4). The average runtime for ANTs for all the registrations reported in Table 4 is 201 s (≈3 min). For CLAIRE, the average wall clock time of CLAIRE in the parameter search mode is 9.8 min (3× slower than ANTs; we search for adequate regularization parameters), while, in the continuation mode, the runtime of CLAIRE is 64 s (3× faster than ANTs; we apply the optimal regularization parameter and do not search for them).

Experiment 2A: High Resolution Synthetic Data Registration
In this experiment, we answer Q1. We attempt this by executing our registration algorithm on synthetic imaging data. The advantages of using such images over real datasets are as follows: • They are noise-free, high contrast, and sharp, unlike real-world images. • There is a scarcity of high resolution real image data because it is expensive and time-consuming to acquire. We can control the resolution of synthetic data because the images are created using analytically known functions. • We can control the number of discrete image intensity levels, i.e., labels. Because these labels are available as ground truth, we can use them to precisely quantify registration accuracy through the Dice coefficient, avoiding inter-and intra-observer variabilities and other issues associated with establishing ground truth labels in real imaging data.
By performing image registration at different resolutions (and applying the resulting velocity to transform the high resolution original images), we want to check whether the registration at higher resolutions is more accurate than performing the registration at a lower resolution.

Dataset
We use the SYN dataset (see Section 4) for this experiment.

Procedure
We execute registration at different resolutions for the original resolution images and quantify the accuracy using the Dice coefficient for labels before and after the registration. We compare the Dice statistics for different resolutions. More specifically, we take the following steps:

1.
We register the template image m 0 to the reference image m 1 at the base resolution n to get the velocity field v n . We transport m 0 using the velocity v n to get the deformed template image m(t = 1) by solving Equation (3). Then, we compute the Dice score between l i (t = 1) and l i 1 , i ∈ 1, . . . , 10 which are discrete labels for m(t = 1) and m 1 , respectively, using Equation (13).

2.
We downsample m 0 and m 1 using nearest neighbor interpolation to half the base resolution (for example, n/2 = (512, 512, 512). Notice that we treat n x = (N 1 , N 2 , N 3 ) as a tuple. When we say n x /2, we mean n x /2 = (N 1 /2, N 2 /2, N 3 /2)) and register the downsampled images to get the velocityv n/2 . We upsamplev n/2 to the base resolution n using spectral prolongation and call it v n/2 . We transport m 0 using v n/2 by solving Equation (3) to get the deformed template image m(t = 1) and then compute the Dice score for this new deformed template image.

3.
We repeat the procedure in step 2 for resolutions n/4 and n/8 and compute the corresponding Dice scores.
For the registration, we fix the determinant J of the deformation gradient to be within [5 × 10 −2 , 20] and search for the regularization parameters using the proposed parameter search scheme as described above in Section 2. Note that we perform a search for an optimal regularization parameter for each individual dataset because we want to obtain the best result for each pair of images. In practical applications, this is not necessary (see comments below; we also refer to [19] for a discussion). We fix the tolerance for the reduction of the gradient to 5 × 10 −2 , which we have found to be sufficiently accurate for most image registration problems (see [19]). We use linear interpolation in the semi-Lagrangian scheme. Another hyperparameter in our registration solver is the number of time steps n t for the semi-Lagrangian (SL) scheme. We consider two cases for selecting n t :

1.
n t changes with resolution: We use n t = 4 time steps for the coarsest resolution n x = n/8 and double n t when we double the resolution in order to keep the CFL number fixed. All other solver parameters, except for the regularization parameters, are the same at each resolution.

2.
n t fixed with resolution: In order to study the effect of n t on the Dice score we keep n t fixed for each n x , instead of increasing n t proportionately to n x .

Results
In Figure 4, we visualize the template, reference and deformed template images for the synthetic problem constructed with K = 4. We report quantitative results for CLAIRE in Tables 5 and 6, respectively. In Figure 5, we compare the Dice score for individual labels as a function of their volume fraction α. In Figure 6, we visualize box plots of the Dice score for the registrations reported in Table 5.  Table 5. The value in the parentheses in column 1 indicates the resolution at which registration was done. The visualization is done at the original resolution n = (1024, 1024, 1024). In columns 2 and 3, we visualize cropped portions of the images shown in column 1 for specific label values. In column 2, we show label 1, in column 3, we show the union of labels with intensity value ≥ 5. Note that higher label values have smaller volumes and more fine-grained features. We plot the label boundaries for the reference image in green to visualize the registration errors.

Table 5. Experiment 2A: Registration performance for CLAIRE for case 1 (n t changes proportionally to the image resolution, see Section 5.3).
Comparison of registration accuracy based on the Dice score at different resolutions for the synthetic dataset SYN. K denotes the frequency of the synthetic velocity field in Equation (12). n = (1024, 1024, 1024) is the base image resolution. We fix the tolerance for the reduction of the gradient to 5 × 10 −2 and use linear interpolation. The Jacobian bounds for the parameter search are [0.05, 20]. We report β * v and β * w (the optimal regularization parameters obtained with the proposed parameter search scheme), and J min and J max (the minimum and maximum values for the determinant of the deformation gradient). For the Dice score, we report average Dice (D a ), the volume weighted average Dice (D vw ), and the inverse volume weighted average Dice (D ivw ), pre and post registration. We also report the wall clock time for the parameter search.  Table 6. Experiment 2A: Registration performance for CLAIRE for case 2 (n t independent of the image resolution). Comparison of registration accuracy using Dice at different resolutions for the synthetic dataset SYN. K denotes the frequency of the synthetic velocity field in Equation (12). n = (1024, 1024, 1024) is the base image resolution. We fix the tolerance for the reduction of the gradient to 5 × 10 −2 and use linear interpolation. The Jacobian bounds for parameter search is [0.05, 20]. For each value of n t , we report results for different resolutions. We report β * v and β * w (the optimal regularization parameters obtained with the proposed parameter search scheme), and J min and J max (the minimum and maximum values for the determinant of the deformation gradient). For the Dice score, we report average Dice (D a ), the volume weighted average Dice (D vw ), and the inverse volume weighted average Dice (D ivw ), pre and post the registration. We also report the wall clock time for the parameter search. The missing cases for K = 8 failed to finish in a reasonable time frame. We only report a couple of cases for K = 16 and expect a behavior similar to K = 8 for the rest.   Table 5 for K = 4.  We show box plots of the Dice scores for the individual labels before and after registration for different resolutions. We consider the synthetic test problem SYN. This figure corresponds to the registration results reported in Table 5.

Observations
The most important observations are: (i) The Dice score averages are better for registrations performed at the base resolution n with progressively worse Dice scores for registrations done at coarser resolutions; (ii) the difference between Dice scores for registrations done at successively coarser resolutions for K = 16 (rougher velocity field) is higher than at K = 4 (smoother velocity field); (iii) keeping n t fixed for the base and coarser resolutions does not affect the Dice score trend, i.e., the Dice decreases as n x is decreased. In the following, we give more details for these general observations.
Regarding Dice score averages in Table 5, we observe that D a , the arithmetic mean of the Dice scores of individual labels, drops by as much as 7% between run #13 and #14. However, the percentage drop in volume weighted Dice average D vw is smaller than in D a . This indicates that labels with higher volume are still easier to register at coarser resolutions. The inverse weighted Dice average D ivw , which gives more weight to smaller labels, features a more pronounced decrease because smaller regions contribute to the high frequency content in the image; this information is lost when the images are downsampled. We observe a 21.8% difference in D ivw for the high frequency images in run #13 and #14 for K = 16. As we increase the frequency K of the synthetic ground truth velocity, we see that the difference in all Dice score averages between successive resolutions increases. As K increases, we get increasingly rougher velocity fields, which we can not recover by registering the original images at coarser resolutions.
In Table 6, the Dice scores behave the same way even when n t is fixed for different n x , indicating that the loss in accuracy is primarily because of the reduction in the spatial resolution (and not the temporal resolution). We also observe that for the full resolution of n x = n, using n t < 32 results in slow solver convergence; the run did not finish in under 2 h. We attribute this slow convergence rate to the loss in numerical accuracy in the computation of the reduced gradient in Equation (6). If we compare run #1 and #9 in Table 6, we see that the difference in D a is marginal in comparison to the run time cost overhead for run #9. However, the accuracy difference increases as K is increased, and the images get less smooth (see runs #13 and #14).
These quantitative observations are confirmed by the visual analysis in the figures shown: From Figure 4, we observe that at lower resolutions (top to bottom), the alignment of the outlines (green lines; reference image) with the structures (white areas; deformed template image) is less accurate. Figure 5: shows that the Dice score is worse for labels with smaller volume fractions, i.e., fine structures are matched less accurately at coarse resolutions. Looking at Figure 6, we observe that the average registration accuracy decreases as we decrease the resolution.
We use 32 GPUs for registration at n x = (1024, 1024, 1024), 4 GPUs for n x = (512, 512, 512), and a single GPU for n x = (256, 256, 256) and n x = (128, 128, 128). Registration for n x = (1024, 1024, 1024) takes on average 44 min wall-clock time. It is important to note that this includes the time spent in the search for optimal regularization parameters (i.e., we solve the inverse problem multiple times using warm starts; see Section 3.4 for details regarding the scheme). For the large-scale runs that use multiple GPUs, the overall runtime of the solver is dominated by communication between MPI processes [20]. Adding more resources does not necessarily reduce the runtime because of this increase in communication cost. Registrations for n x = (512, 512, 1512) and lower resolutions are much quicker and run in the order of 10 min or less. In the present work, we perform the parameter search for each individual case because we want to obtain the best result for each pair of images. However, in practice where a medical imaging pipeline requires registrations for several similar images, we suggest running the parameter search scheme on one pair of images and use the obtained regularization parameters to run the cohort registration for all images, as we have done in our previous work [19]. This strategy reduces the computational cost drastically. One downside to this strategy is that some images in the cohort will not be registered as accurately as others.
Our experiment with synthetic images suggests that Dice scores are better when registrations are done in the original high resolution at which the labels were created. Registration accuracy is affected more significantly if high frequency velocity fields are considered. The images used in this experiment are synthetic and free of noise. We use these images for both registration and evaluation of performance using Dice scores. Because the ground truth labels for these images at the highest resolution are known with certainty, we have high confidence in our observations regarding registration accuracy: the Dice scores become worse when registration is conducted at lower resolutions. However, in practical applications, images have noise and low contrast. To evaluate the registration accuracy for real images using Dice scores, we first evaluate their segmentation using external segmentation tools. This segmentation step is prone to errors (not only due to noise and a lack of contrast but also due to inherent limitations in segmentation software themselves). These errors result in a misalignment between the structures present in the original image and its segmentation, which complicates our analysis. Having said this, we conduct experiments on real brain MRIs in the next section to explore if we can provide experimental evidence that at least partially confirms the observations we have made in this section.

Experiment 2B: High Resolution Real Data Registration
In this experiment, we aim at answering Q1 as well as Q2. We do this by registering real human brain MRI datasets instead of synthetic images. Unlike synthetic images, these images are not noise-free. Moreover, they lack high contrast.

Datasets
We use the NIREP and the MRI250 image datasets (see Section 4) for this experiment.

Procedure
We designate the MRI250 image as the template image m 0 . We generate the reference images m 1 from the images na01-na10 from the NIREP dataset since we do not have access to other T 1 -weighted MRI from a different subject at the original resolution of 250 µm. The acquired spatial resolution of the NIREP data is 1 mm, which is 4× larger than 250 µm. Therefore, in order to generate a reference image m 1 that are 250 µm in spatial resolution, we take the following steps:

2.
Register MRI250 to the upsampled NIREP image using CLAIRE and transport m 0 (which corresponds to the MRI250 image) using the resulting velocity v and solving Equation (3) to obtain the deformed template image m 1 = m(t = 1). We set the tolerance for the relative gradient norm to g tol = 1 × 10 −2 . We lower the tolerance compared to other runs to obtain a potentially more accurate registration result. We use the default regularization parameters β v = 1 × 10 −2 and β w = 1 × 10 −4 . Consequently, we do not perform a parameter search to estimate an optimal regularization parameter for this registration. We want to keep the downstream registration performance analysis, where we will use parameter search, oblivious to the process of generating the high-resolution reference image.
To generate a segmentation that we can use to compute Dice scores (not for the registration itself, which is done on the original unsegmented images), we use the tool fast from FSL [95] both on the template image m 0 and on the reference image m 1 . We generate labels WM, GM, and CSF. The remaining steps for this experiment are the same as described in experiment 2A in Section 5.3 except that here we are registering real T 1 -weighted images instead of noise-free synthetic images. The base resolution for this experiment is n x = n = (640, 880, 880). We consider n x = n/2 and n x = n/4 for the downsampled resolutions. We also consider the two sub-cases for selecting n t as we did in Section 5.3. For the case where n t changes with resolution, we use n t = 4 for n x = n/4, n t = 8 for n x = n/2 and n t = 16 for n x = n.

Results
We report the solver parameters for our registration with CLAIRE along with the relative residual r and Dice score averages for GM, WM, and CSF before and after the registration in Table 7. The relative residual r and the Dice score are always computed at the base resolution n = (640, 880, 880). The respective results with n t fixed independent of the resolution are given in Table 8 for na01. We visualize the image registration results for the reference image na01 in Figure 7.  Table 7). The base resolution is n x = n = (640, 880, 880). In row 1, from left to right, we show the T 1 -weighted MRI250 datasets (template image m 0 ), the upsampled na01 dataset (reference image m 1 ) from the NIREP data repository, and the deformed template images obtained from registration at resolutions n x , n x /2, and n x /4, respectively. In row 2, we show a cropped portion of the images from row 1. In rows 3 and 4, we show the label maps consisting of white matter (WM; white), gray matter (GM; light gray), and cerebro-spinal fluid (CSF; dark gray) and their cropped versions, respectively. In row 5, we show the image residuals before and after registration with respect to each resolution level. Table 7. Experiment 2B: Registration performance for CLAIRE for case 1 (n t changes proportional to the image resolution). Comparison of registration accuracy using Dice and relative residual r at different resolutions for the registration of the MRI250 brain image to templates generated from ten real MRI scans from the NIREP dataset. We consider three resolution levels n x = {n, n/2, n/4} where n = (640, 880, 880). We fix the tolerance for the relative gradient to 5 × 10 −2 . We use linear interpolation in the semi Lagrangian scheme. The bounds for the determinant of the deformation gradient for the parameter search are [0.05, 20]. We report the regularization parameters β * v and β * w obtained through the proposed parameter search scheme, the minimum and maximum determinant of the deformation gradient (J min and J max ), the relative residual (r), the average Dice (D a ), pre and post the registration, as well as the wall clock time for the parameter search.   Table 8. Experiment 2B: Registration performance for CLAIRE for case 2 (n t independent of the image resolution). Comparison of registration accuracy using Dice and relative residual r for a fixed number of time steps n t at different resolutions for the registration of the real MRI datasets MRI250 and the reference image m 1 generated from na01 from the NIREP repository. We consider three resolution levels n x = {n, n/2, n/4} where n = (640, 880, 880). We fix the tolerance for the relative gradient to 5 × 10 −2 . We use linear interpolation in the semi Lagrangian schreme. The bounds for the determinant of the deformation gradient for the parameter search are [0.05, 20]. We keep the time step n t fixed. We report the regularization parameters β * v and β * w obtained through the proposed parameter search scheme, the minimum and maximum determinant of the deformation gradient (J min and J max ), the relative residual (r), the average Dice (D a ), pre and post the registration, as well as the wall clock time for the parameter search. The case with n x = n and n t = 4 failed to finish in under 4 h.

Run
NIREP n x n t β *

Observations
The most important observation is that the relative residual r increases and Dice score averages decrease for registrations done at coarser resolutions irrespective of whether we increase n t proportionally to the resolution, see Table 7 or keep n t fixed for different n x , see Table 8. This observation is in line with the experiment for the synthetic dataset SYN in Section 5.3. Except for the case of na04 (see runs #10 and #11 in Table 7), all other cases exhibit increasingly worse registration performance at coarser resolutions.
In Section 5.3, we used synthetic, noise-free, high-contrast images for assessing the registration accuracy at different resolutions. Here, we repeat the same experiment with real world images-T1-weighted MR images of the human brain. We used an external software to segment these images to provide the necessary labels to be able to quantify registration performance in terms of Dice score. Notice that this additional segmentation step will inevitably introduce additional errors to our analysis. Due to these additional errors at the native resolution, we expect that the improvement in registration performance at high resolution may not be as pronounced as for the synthetic images considered in Section 5.3 (which did not require this additional segmentation step). This hypothesis is confirmed if we compare the average Dice score D a across experiments. In particular, if we reduce the resolution from n to n/4 in Sections 5.3 (see Table 5) and 5.4 (see Table 7), the Dice score drops by 15.25% compared to 9.5%, respectively.
In Table 8, the case with n t = 4 and n x = n took very long to converge (>4 h). For this case the CFL number is 15.66 during the inverse solve while for n t = 16, the CFL number is 4. The larger CFL number for n t = 4 yields a higher adjoint error in the SL scheme. This leads to higher errors in the computation of the reduced gradient, which results in worse convergence of the inverse solver for n t = 4. The run time overhead associated with using n t = 16 against n t = 4 is easily compensated by better solver convergence. We refer to [30] for a thorough study on the effect of n t on the numerical accuracy of the reduced gradient.

Experiment 3: Registration of Mouse Brain CLARITY Images
This experiment aims to answer both Q1 and Q2 by examining the performance of our scalable registration solver on ultra-high resolution mouse brain images acquired using the CLARITY imaging technique [26,99]. As opposed to the previous datasets, the dataset in this experiment does not provide any real metrics for its assessment other than the relative residual (nor are we aware of any segmentation software that would work on these data).

Dataset
We use the CLARITY dataset (see Section 4) for this experiment.

Procedure.
Preprocessing: For all unprocessed images, the background intensity is non-zero. We normalize the image intensities such that they lie in the range [0,1] with the background intensity re-scaled to zero. Next, we affine register all images to Control182 at 8× downsampled resolution using the SyN tool in ANTs. We report the parameter settings for the affine registration in the appendix. Subsequently, we zero-pad the images to ensure that periodic boundary conditions are satisfied for CLAIRE. After preprocessing, the base image resolution is n x = n where n = (2816, 3016, 1162) and n/8 = (328, 412, 1162), respectively. We only conduct the parameter search for a single pair of images (at both resolutions independently) for these sets of images and then perform the parameter continuation on the entire dataset. We only report wall clock times for the parameter continuation and not for the parameter search.
Deformable Registration: We register all images to the reference image Control182 using CLAIRE. We use the proposed parameter continuation scheme. We set J min to 0.05. We do this for both resolution levels. To compare the registration accuracy between each resolution level, we follow the same steps from Section 5.4. We compare the registration performance using the relative residual r. We do not have access to image segmentation for this dataset and, therefore, we cannot evaluate accuracy using Dice scores.

Results
We report the quantitative results for the registration of the CLARITY data in Table 9. We showcase exemplary registration results in Figure 8. Table 9. Experiment 3: Registration performance for CLAIRE for the CLARITY imaging data at resolutions n = (2816, 3016, 1162) and n/8 = (328, 412, 1162). Control182 is the fixed (reference) image. All other images selected from the CLARITY dataset are registered to Control182 using a parameter continuation scheme. We fix the tolerance for the relative gradient to 5 × 10 −2 . We use linear interpolation for the semi Langrangian scheme. The bounds on the determinant J of the deformation gradient for the parameter search are [0.05, 20]. We report the estimated regularization parameters β * v and β * w , the minimum and maximum values for the determinant of the deformation gradient (J min and J max ), the relative residual (r), as well as the wall clock time for the parameter continuation.  Figure 8. Experiment 3: Illustration of the registration performance for CLAIRE for the CLARITY mouse brain imaging data. We report registration results for the Cocaine178 dataset registered to the Control182 dataset. In row 1 (from left to right), we have the template image m 0 (Cocaine178), the reference image m 1 (Control182) and the deformed template image. The resolution of the images is n = (2816, 3015, 1162). In row 2, we show the determinant of the deformation gradient and the image residuals before and after registration.

Observations
The most important observation is that we can register high resolution real medical images reasonably well in under 2 h (see run #1 and #3 in Table 9). Unlike the previous experiments in Sections 5.3 and 5.4, the reported wall clock time in Table 9 is for performing the parameter continuation and not the parameter search. The average time spent for the regularization parameter search for resolution n x = n is ∼2 h. Another observation, which is in agreement with the results reported for the experiments carried out in Sections 5.3 and 5.4, is that the registration performed at downsampled resolution (see Table 9) results in a larger relative residual and, therefore, worse registration accuracy. We had a maximum of 256 GPUs (64 nodes, 4 GPUs per node) available to us at the TACC Longhorn supercomputer. Because of this resource constraint, our solver ran out of memory for certain parameter configurations (for example, for run #1 and #3, we could not use n t > 16 time steps). Moreover, for all the runs in Table 9, we used the zero velocity approximation of H as the preconditioner and applied it at full resolution. We did not use the two-level coarse grid approximation to apply the preconditioner because it requires additional memory for the coarse grid spectral operations.

Conclusions
In this publication, we apply our previously developed multi-node, multi-GPU 3D image registration solver [22] to study and analyze large-scale image registration. This work builds upon our former contributions on constrained large deformation diffeomorphic image registration [17,19,28,30,34]. The main observations are: (i) We are able to register CLARITY mouse brain images of unprecedented ultra-high spatial resolution (2816 × 3016 × 1162) in 23 min using parameter continuation. To the best of our knowledge, images of this scale have not been registered in previous work [22,97,98]. (ii) We conduct detailed experiments to compare image registration performance at full and downsampled resolutions using synthetic and real images. We find that image registration at higher (native) image resolution is more accurate. To quantify the accuracy, we use Dice coefficients wherever image segmentation is available and relative residuals otherwise. We also do a sensitivity analysis for the overall solver accuracy with respect to the number of time steps n t in the SL scheme. Overall, CLAIRE performed as expected; fully automatic parameter tuning works well, and higher image resolutions result in improved image similarity compared to the registration results in lower resolution. We note that these improvements in registration accuracy are less pronounced for real imaging data compared to synthetic data for the experiments conducted in this study. We attribute these observations to uncertainties and errors introduced during the registration and segmentation steps due to noise and low contrast. We discuss this in more detail in Section 5.  Institutional Review Board Statement: Ethical review and approval were waived for this study, because we conduct a retrospective analysis. All the datasets used in this study are either publicly available as mentioned in Section 4 or can be generated analytically, such as the SYN dataset. The ethics statements for those studies can be found in the corresponding references mentioned in Section 4.

Data Availability Statement:
We list the datasets we used for the experiments in this paper and the references and weblinks to download them. MUSE-This dataset consists of five T 1 -weighted MR images acquired at a spatial resolution of 1 mm. The image dataset is available for download at http: //www.neuromorphometrics.com (accessed on 5 January 2022). NIREP [88]-This dataset consists of 16 T 1 -weighted MR images of different individuals imaged at a spatial resolution of 1 mm. This image dataset is available for download at the github repo https://github.com/andreasmang/nirep (accessed on 5 January 2022). SYN-In Section 4, we described the exact process to generate the synthetic images and can be reproduced exactly using CLAIRE [21,34] and the scripts available at https://github.com/naveenaero/scala-claire (accessed on 5 January 2022). Due to the large size of these images (∼4 GB), we have not hosted them on a public server and these can be made available on request. MRI250 [89]-This is a dataset consisting of a single high resolution brain MRI acquired at a spatial resolution of 250 µm. This image can be downloaded from https://datadryad.org/ stash/dataset/doi:10.5061/dryad.38s74 (accessed on 5 January 2022). CLARITY [27,[96][97][98]-This is a dataset consisting of ultra-high resolution mouse brain images acquired using the CLARITYoptimized Light Sheet Microscopy (COLM) at a intra-planar spatial resolution of 0.585 µm and inter-planar resolution of 5 to 8 µm. This dataset is available for download at https://neurodata. io/data/tomer15/ (accessed on 5 January 2022). CLAIRE [19,21,34] is available publicly on github at https://github.com/andreasmang/claire (accessed on 5 January 2022) under the GNU General Public License v3.0.

Conflicts of Interest:
The authors declare no conflicts of interest.

Appendix A. Deformable Registration Parameters for ANTs
We report the deformable registration parameters for ANTs which were used for comparison with CLAIRE in the parameter search experiment Section 5.2.

Appendix B. Determining β w,init
We report the runs for comparison of runtime and Dice scores for different values of β w,init for the experiment conducted in Section 5.4. Table A1. Experiment 1b: effect of β w,init on registration performance for real brain images: Comparison of registration accuracy using Dice and relative residual r for different values of β w,init at different resolutions for registration of MRI250 brain image to na01 and na02 from NIREP dataset. We fix β w,min = 1 × 10 −9 . We consider n x = {N, N/4} where N = 640 × 880 × 880. We fix the tolerance for the relative gradient to 5 × 10 −2 . We use linear interpolation. The Jacobian bounds for parameter search are [0.05, 20]. We increase the number of time steps n t proportionately with increase in resolution. We report β * v and β * w , the regularization parameters from the parameter search scheme, J min and J max , the minimum and maximum Jacobian determinant the relative residual r, average Dice D a pre and post the registration and the wall clock time for the parameter search for the registration. We highlight the best Dice scores for each resolution and for each NIREP image.