An Unsupervised Learning-Based Multi-Organ Registration Method for 3D Abdominal CT Images

Medical image registration is an essential technique to achieve spatial consistency geometric positions of different medical images obtained from single- or multi-sensor, such as computed tomography (CT), magnetic resonance (MR), and ultrasound (US) images. In this paper, an improved unsupervised learning-based framework is proposed for multi-organ registration on 3D abdominal CT images. First, the explored coarse-to-fine recursive cascaded network (RCN) modules are embedded into a basic U-net framework to achieve more accurate multi-organ registration results from 3D abdominal CT images. Then, a topology-preserving loss is added in the total loss function to avoid a distortion of the predicted transformation field. Four public databases are selected to validate the registration performances of the proposed method. The experimental results show that the proposed method is superior to some existing traditional and deep learning-based methods and is promising to meet the real-time and high-precision clinical registration requirements of 3D abdominal CT images.


Introduction
Sensors and sensing systems play important roles in various medical applications, including disease diagnosis, monitoring, preoperative planning, surgical navigation, and so on [1][2][3][4]. Registration is one of the fundamental technologies that enable the sensing systems to be used in the above-mentioned applications [5][6][7][8]. Abdominal images taken from inter individuals are different in shape and texture, due to the complex intensity distribution of multi-organ, susceptibility of respiratory movement, and so on. Most existing methods cannot simultaneously meet the clinical requirements of high-accuracy and real-time performance for full abdominal image registration. To solve the above problems, many researchers pay attention to the segmented-based abdominal image registration methods. For example, Li et al. [9] proposed a liver MR image registration method based on the respiratory sliding motion segmentation that achieves more accurate registration results. Xie et al. [10] proposed a lung and liver 4D-CT image registration method based on tissue features and ROI segmentation, which can be implemented on clinical data.
In clinical practice, experts usually delineate the regions of interest (ROIs) of the target organ and organs at risk for treatment planning, and thereby the multi-organ ROIs of the abdominal medical images can be naturally obtained. For the registration stage, the traditional registration methods usually obtain the optimized deformation field by iteratively minimizing the custom energy function including data and regularization terms [11][12][13][14], and the deformation field of each pair of fixed and moving images is calculated independently by the traditional methods. The registration time of the traditional methods is always substantial, especially when the pair-wise images have a large anatomical difference. For example, the registration time of the methods including Demons [15], Elastix [16], and free-form deformation with b-splines [17] on medical images are ranged from minutes to hours.
To address this issue, many researchers have begun to pay more attention to the learning-based registration methods implemented by convolutional neural networks (CNNs), which can be divided into supervised and unsupervised [18][19][20][21]. These learningbased methods use the CNN model to obtain good initialization parameters for medical image registration. Most supervised methods require ground truth of deformation fields or anatomical segmentation masks, which are obtained from the traditional registration tools or manual delineation. These approaches entail much effort on data labeling, and the registration performance is influenced by the quality of the labels.
The unsupervised methods can directly use the unlabeled data to train CNNs model, avoiding the expensive and time-consuming labeling work. For example, Lei et al. [22] presented a multi-scale unsupervised registration framework for abdominal 4D-CT images. It has three loss functions to train the global and local subnetworks, including the similarity, adversarial, and regularization losses. Heinrich et al. [23] used a discrete displacement layer to improve the accuracy of the unsupervised learning-based 3D abdominal CT image registration framework. Balakrishnan et al. [24] proposed a U-net unsupervised registration framework, namely Voxelmorph, for 3D brain MR images. The regular similarity and regularization loss functions are used to train the framework, and then an auxiliary data loss function is added in the testing stage. Zhao et al. [25] proposed a U-net unsupervised registration framework for liver CT and brain MR images, namely VTN. Specifically, the affine transformation is integrated into the framework as a subnetwork to reduce the preprocessing time. Subsequently, the recursive cascaded networks (RCNs) by Zhao et al. [26] can be embedded into any base network as general architecture. Both Kuang et al. [27] and Mok et al. [28] emphasized that the distortion of the transformation field is non-negligible. They integrated the topology-preserving loss into the total loss functions for preventing the distortion.
Aiming at avoiding the time-consuming pre-processing and maintaining the topologypreserving property of the transformation, we developed an unsupervised learning-based registration framework for the segmented multi-organ from 3D abdominal CT images. First, the recursive cascaded network (RCN) modules are embedded into a basic U-net framework for promoting the unsupervised end-to-end learning. Secondly, the affine transformation subnetwork is cascaded with the subsequent fine registration subnetworks to implement the integration of the transformation field prediction from coarse to fine. Then, a topology-preserving loss is added in the total loss function for the training the registration framework, and then the transformation field is obtained for abdominal multiorgan registration.
The main contributions of this paper are as follows. First, an unsupervised learningbased registration framework is proposed, which can automatically learn from unlabeled data avoiding the time-consuming expert labeling work. Secondly, the coarse-to-fine RCN modules are embedded into the framework, which can efficiently deal with the large-scale deformation and improve the accuracy of the transformation field prediction. Moreover, an additional loss is integrated into the total loss function, which can ensure the registration with the property of the topology preservation. Finally, the proposed method proved to be more precise and faster than some existing registration methods on multi-organ from 3D abdominal CT images.

Methods
The essential of image registration is to find the mapping relationship between the fixed image I f and the moving image I m , which enables the I m align to the I f with a reasonable transformation T. Generally, the energy function of optimizing T is as follows [29]: where L S denotes the similarity term, L R represents the regularization term to constraint the L S , and λ is the empirical constant. The uniform image domain of I m , I f , and T is Ω → R in D-dimension, where the value of D is 3.

Optimization Problem Formulation
In this paper, the optimal parameters for transformation T are estimated by an improved unsupervised learning-based model. The task of the model is to use a flow prediction function F(I m , I f ) to obtain the transformation field T(I m , I f ) during a recursive procedure. As shown in Figure 1, there are n modules cascaded in the model. Each module contains a predicted transformation field T in a subnetwork. Therefore, the final warped I m can be written as [25]: Warped and the output of the model is composited by: where T 1 , . . . , T n represent the transformation field of the modules from 1st to nth, respectively. For instance, the kth predicted transformation field T k can be represented as where f k is the kth prediction function of F. Therefore, the moving image is gradually warped by the modules.

Architecture of the Unsupervised Learning-Based Networks
First, we embed the coarse-to-fine modules of RCNs into the basic U-net framework, intending to improve its registration performance on multi-organ from 3D abdominal CT images. The RCN modules can be used to successively predict their corresponding flow field f , as shown in Figure 1. Then, we integrate a topology-preserving loss into the total loss function to avoid the distortion of the predicted transformation T of the registration framework.

Coarse Registration
Affine transformation is widely applied to coarsely register the pair-wise medical images as pre-processing because it can reduce the registration errors caused by the difficulty of predicting the large deformation between the two images. Researchers commonly use the conventional software to perform the above process in a traditional way. However, it is time-consuming and requires researchers' manual operations. To solve the above problems, the framework of the proposed method assigns the first subnetwork (namely coarse-subnetwork), to predict the affine transformation field with a small computational burden. The coarse-subnetwork contains a series of downsampling operations followed by a fully connected layer. The architecture of the network is the same as that of [25]. First, the input images are sequentially resampled to 64 3 , 32 3 , 16 3 , 8 3 , and 4 3 by the convolution layers with the uniform kernel size as 3 3 and stride as 2. Then, the flow transformation field is predicted according to the output parameters of the fully connected layer. The flow transformation field is represented as [25]: where x denotes the voxel in the image domain Ω, A is a transform matrix, and b is a displacement vector. The moving image I m is first warped by the predicted affine transformation parameters, and then becomes the initial input for the fine registration subnetworks (namely fine-subnetworks).

Fine Registration
For fine registration, each subnetwork has the same encode-decoder architecture as U-net [30]. Both of them contain five resolutions to obtain multiple receptive fields, including 64 3 , 32 3 , 16 3 , 8 3 , and 4 3 . The skip connection is used to concatenate the features of the same resolution from the encoding to the decoding stages, enabling the subnetwork to obtain more accurate predictions for the transformation field. For a pair of images, the transformation prediction is implemented by first extracting features in the encoding stage, and then restoring the image resolution in the decoding stage. The uniform kernel size and stride of the subnetworks are 3 3 and 2, respectively. All the subnetworks are used to progressively predict the flow transformation field for fine registration except the coarse-subnetwork.

Loss Functions
The unsupervised learning-based networks are trained by minimizing the following loss function: where L Total denotes the total loss, L Coarse and L Fine represent the subtotal losses for the coarse-and the fine-subnetworks, respectively. L S is the similarity loss of both L Coarse and L Fine . L R1 to L R4 are the regularization terms of their corresponding subtotal loss functions. λ 1 to λ 4 are the empirical constants.

Similarity Loss
For mono-modal medical images, the correlation coefficient (CC) is suitable to be the similarity loss L S both for L Coarse and L Fine , which is defined as: where σ[•] denotes the covariance. The values of CC ranges from −1 to 1, indicating that the degree of correlation between two images changes from completely anti-correlated to correlated.

Regularization Terms for the Coarse-Subnetwork
Dealing with the large-scale deformation, we first employ the coarse-subnetwork to implement the affine transformation of the moving image I m . To avoid over non-rigid transformation field prediction, the orthogonality loss L R1 and determinant loss L R2 are used to regularize the similarity loss L S in the coarse-subnetwork, as shown in Equation (6). The L R1 and L R2 are respectively defined as [25,26]: where µ i is the singular values obtained from A + I, i = {1, 2, 3}. I is the identity matrix, and A is the affine transform matrix.

Regularization Terms for the Fine-Subnetworks
Generally, researchers only use a regular regularization term to penalize the similarity measure, such as L 1 norm, L 2 norm, and total variation. The smoothness property of the transformation field can be maintained by the above-mentioned term. First, we combine the similarity loss L S and the smooth loss L R3 as a conventional group for transformation field prediction. However, one of the desirable properties of the transformation field is always ignored, namely topology-preserving. The relevant regularization term imposing on the similarity measure can prevent the distortion of the transformation field. Therefore, we integrate the topology-preserving loss L R4 into the conventional group, aiming to obtain a more physically possible and accurate transformation field, as shown in Equation (7). The L 2 variation loss L R3 is defined as [31]: where N is the total number of voxel x, ρ i forms the basis of R, i = {1, 2, 3}. Since the Jacobian determinant or its variations are commonly used to evaluate the topology preservation of the dense vector field, they can be used as a regularization loss for avoiding the distortion of the transformation field. Therefore, we use the negative Jacobian determinants as the topology-preserving loss L R4 to further constrain the similarity loss of the fine-subnetworks. The L R4 is defined as [27,28]: where N is the total number of voxel x, J T (x) denotes the Jacobian determinant; σ(•) is defined as max(0, •) that can linearly activate the positive values and change the others from negative to zero. Therefore, σ(•) decides whether L R4 will penalize x. If the orientation of x is inconsistent with those of the neighbors, L R4 can be activated (and vice versa).
BTCV is a dataset for the performance comparisons of 3D abdominal CT image segmentation methods. Its training dataset contains 30 objects with the liver, left kidney, right kidney, and spleen segmentation masks. All of them were selected for this study.
LiTS is a challenging dataset for liver tumor segmentation. A total of 130 training objects are provided with the liver and tumor segmentation masks. In this study, our experts randomly selected 22 objects and manually supplemented the corresponding left kidney, right kidney, and spleen masks.
Sliver07 is a challenging dataset for liver segmentation. Its training dataset includes 20 objects with liver segmentation masks. First, one of the objects was excluded considering the unclear structure of the spleen. Then, the remaining 19 objects were selected for this study, and our experts manually supplemented their left kidney, right kidney, and spleen masks.
3Dircadb is a 3D abdominal CT images dataset. It contains 20 training objects with multiple structures' segmentation masks, including liver, left kidney, liver tumor, and so on. In this study, six objects were selected that simultaneously include liver, left kidney, right kidney, and spleen segmentation masks.
Therefore, 76 abdominal multi-organ CT volumes were formed by combining the above selected original volumes with their masks. One of them was used as the atlas (i.e., the fixed image), which is randomly selected from LiTS. A total of 15 volumes were evenly selected from LiTS, BTCV, and Sliver07, and paired with the atlas as the testing groups. The remsining 60 volumes were paired with the atlas as the training groups. The training and testing data were unified to 1283 due to the limited memory usage of the GPU (NVIDIA RTX 2080 Ti, 11G).

Evaluation Indexes
For registration analysis, the global intensity differences between two images are first evaluated by the root mean squared error (RMSE) and the peak signal-to-noise ratio (PSNR) [36]: PSNR = 10 × log 10 ( MAX 2 I MSE ) (15) where N is the total number of voxel x. MAX I is the maximum possible voxel value of the image, Secondly, the local intensity differences between two images are evaluated by the structural similarity index (SSIM) [36]: where µ I f and µ I m •T are the mean values, σ 2 I f and σ 2 I m •T are the variances, and σ I f I m •T is the covariance for I f and I m • T; c i = (k i L) 2 denotes the constant, which i = {1, 2}, k 1 = 0.01, k 2 = 0.03, and L is the range of voxel value.
Moreover, the dice coefficient (DICE), Haustorff distance (HD), contour mean distance (CMD), intersection over union (IOU), sensitivity (SS), and specificity (SC) are also included [37]: where X and Y are the multi-organ segmentation masks of the fixed and warped images, respectively; X b and Y b are the boundaries for X and Y, respectively TP, TN, FN, and FP is the true positive, true negative, false negative, and false positive voxels in the multi-organ segmentation masks of the fixed and warped images, respectively.

Internal Comparisons
We use the basic U-net (namely Base-Net) integrated with our total loss function to explore the optimal number of the RCN modules. First, the coarse-subnetwork-related module is integrated into the Base-Net without counting. Then, the Base-Net is cascaded with 1, 2, 3, 5, and 7 RCN modules, namely Base-Net-1, Base-Net-3, Base-Net-5, and Base-Net-7, respectively, for experimental analysis. The maximum number of the RCN modules is determined by the memory usage of the GPU. Subsequently, we chose the best performing model as the proposed method for external comparisons, i.e., Base-Net-7. The values of λ 1 and λ 2 are the same as those in [25], and the value of λ 4 is the same as that in [27], i.e., λ 1 = 10 −1 , λ 2 = 10 −1 , and λ 4 = 10 −5 . The optimal value of λ 3 is selected according to the experimental results. The uniform batch size, epoch, and learning rate of these comparison models are set as 2, 5, and 10 −4 , respectively. Figure 2 shows the total loss trends of different models on the training dataset. The curves of the Baseline and Baseline-RCNs-Topology are obtained from the Base-Net and the Base-Net-7, respectively. The curve of the Baseline-RCNs is acquired from the same model as the Base-Net-7 without the topology-preserving loss. Compared with the Baseline, the Baseline-RCNs has lower total loss values, indicating that the RCN modules are effective in decreasing the registration errors; meanwhile, the Baseline-RCNs-Topology proves that the topology-preserving loss function modules can further improve the multi-organ registration performance on 3D abdominal CT images. Furthermore, Figure 3 displays the total loss trends of the Base-Net-7 with different values of λ 3 . It can be seen that the Base-Net-7 with λ 3 = 10 −5 achieves the optimal performance on the training dataset. Therefore, the value of λ 3 is set as 10 −5 in the proposed method.   Table 1 presents the internal comparisons of embedding different number of the RCNs modules into our Base-Net. As observed, the average values of the RMSE, HD, and CMD obtained by the Base-Net-1, 3, and 7 are smaller than those of the Base-Net. Meanwhile, the values of the PSNR, SSIM, DICE, IOU, SS, and SC obtained by the Base-Net-1, 3, 5, and 7 are higher than those of the Base-Net. It may be because the architecture of the Base-Net is changed to a recursive one by embedding the RCN modules, and the improved Base-Net can progressively predict a more accurate transformation field. Besides, the RMSE, HD, and CMD from Base-Net-7 are 54.01%, 31.45%, and 82.01% lower than those of the Base-net, respectively. The PSNR, SSIM, DICE, IOU, SS, and SC from Base-Net-7 are 38.16%, 7.92%, 33.81%, 65.66%, 17.52%, and 48.29% higher than those of the Base-Net, respectively. Moreover, the Base-Net-7 outperforms all of the other models on the multiorgan registration from 3D abdominal CT images. It can be therefore concluded that the optimal number is seven for embedding the RCNs modules into the Base-Net.  Figure 4 displays the registration progress of the proposed method on the 41st, 57th, 65th, 73rd, and 81st slices of the randomly selected testing group, where T 1 is the affine transformation field, and T 2 to T 8 are the transformation field obtained by 1st to 7th RCN modules, respectively. It can be observed that there exist big intensity differences between the fixed image I f and the moving image I m . However, the RCN modules can help the model progressively obtain the final warped images, which are exactly similar to the fixed images. Therefore, the proposed method has good performance on multi-organ registrations from 3D abdominal CT images.

External Comparisons
We compared our method against three traditional methods, Demons [38], Hybrid [39], MSI [40], and two state-of-art unsupervised learning-based methods, VTN [25] and Voxelmorph [24]. We ran the experimental methods on an Intel i5-GTX1060 CPU and an NVIDIA RTX 2080 Ti GPU, respectively. Table 2 shows the external comparisons of the abdominal multi-organ registration results on different methods. As observed, the proposed method performs comparably to MSI in terms of RMSE, and is superior to Demons, Hybrid, VTN, and Voxelmorph. Moreover, the average values of PSNR, SSIM, DICE, IOU, SS, SC, HD, and CMD from the proposed method are 29.5584, 0.9732, 0.9775, 0.9562, 0.9982, 0.9578, 12.2646, and 3.9296, respectively, which are better than those of the traditional and unsupervised learning-based methods. The registration time in Table 2 illustrates that the unsupervised learning-based methods are obviously faster than the traditional ones. Although VTN achieves the shortest registration time, its performances on the other evaluation indicators are barely satisfactory. Generally, the proposed method gives a good compromise to meet the real-time and high-accuracy clinical requirements. Figures 5 and 6 directly display the histograms and boxplots of the evaluation metrics of 15 pair-wise testing groups' registration results with different methods, respectively. It can be found that the distributions of the RMSE, PSNR, DICE, CMD, and IOU values in Figures 5 and 6 are consistent with those in Table 2. Hence, the proposed method has stable registration performance, and is superior to the other competing methods in the above evaluation indicators.    Figure 7 presents the intensity differences between the fixed and moving images from four randomly selected testing groups. The intensity values of the image range from 0 to 255 and its corresponding color varies from blue to red. As shown in Figure 7, all pair-wise groups have significant differences before registration. Then, the Demons, VTN, and Voxelmorph methods produce considerable differences, indicating that their performances may be influenced by the noise and artifacts in the abdominal CT images.
Moreover, the proposed method produces the smallest differences and outperforms the other competing methods. It can be concluded that our method can avoid the impact of the noise and artifacts and perform more accurately and robustly on multi-organ registration for 3D abdominal CT images. Figure 7. The intensity differences between the fixed and moving images from the paired testing groups with different methods.

Conclusions
In this paper, we present an improved unsupervised learning-based framework for multi-organ registration from 3D abdominal CT images. The coarse-to-fine RCNs modules are embedded into a basic U-net model, which can hence inherit the advantages of the recursive model and achieve better performance on multi-organ registration from 3D abdominal CT images. In addition, a topology-preserving loss is added in the total loss function, which can penalize the similarity loss to avoid the distortion of the predicted transformation field. The experimental results show that the proposed method has the optimal PSNR, SSIM, DICE, IOU, SS, SC, HD, and CMD average values and is competitive with the traditional and unsupervised learning-based methods, and therefore has the potential to be used in clinical practice.