The Successive Next Network as Augmented Regularization for Deformable Brain MR Image Registration

Deep-learning-based registration methods can not only save time but also automatically extract deep features from images. In order to obtain better registration performance, many scholars use cascade networks to realize a coarse-to-fine registration progress. However, such cascade networks will increase network parameters by an n-times multiplication factor and entail long training and testing stages. In this paper, we only use a cascade network in the training stage. Unlike others, the role of the second network is to improve the registration performance of the first network and function as an augmented regularization term in the whole process. In the training stage, the mean squared error loss function between the dense deformation field (DDF) with which the second network has been trained and the zero field is added to constrain the learned DDF such that it tends to 0 at each position and to compel the first network to conceive of a better deformation field and improve the network’s registration performance. In the testing stage, only the first network is used to estimate a better DDF; the second network is not used again. The advantages of this kind of design are reflected in two aspects: (1) it retains the good registration performance of the cascade network; (2) it retains the time efficiency of the single network in the testing stage. The experimental results show that the proposed method effectively improves the network’s registration performance compared to other state-of-the-art methods.


Introduction
Image registration is one of the basic tasks in medical image processing. It involves the acquisition of a dense deformation field (DDF) when a moving image is matched with a fixed image so that the two to-be-aligned images and their corresponding anatomical structures are aligned accurately in space [1]. The traditional registration method optimizes the cost function through a large number of iterations, a process that usually requires a significant amount of computation and time [2]. With the popularization and application of deep learning in the field of medical image registration, the deep learning registration method is now faster than the traditional image registration method. Therefore, for moving and fixed images, deformation fields can be generated by training a neural network, thus achieving rapid registration for a forward pass in the testing stage. Fan et al. [3] studied the computational costs of seven different deformable registration algorithms. The results showed that the assessed deep-learning network (BIRNet) without any iterative optimization needed the least time. Additionally, the registration accuracy improved after applying the deep learning method. For example, Cao et al. [4] proposed a deep learning method for registering brain MRI images, and it was revealed that the method's Dice coefficient was improved in terms of registering white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF).
The unsupervised learning image registration method has been widely applied because it is not difficult to obtain gold-standard registration [5]. Balakrishnan et al. [6] to predict the coarse deformation field and the fine deformation field, respectively, so as to achieve accurate registration. GANs showed excellent performance in the aforementioned studies. In the previous study, a GAN based on dual attention mechanisms was proposed, which showed good registration performance in areas with relatively flat edges, but poor registration performance in narrow and long-edge areas. To this end, based on previous research, this paper proposes a method to assist GANs in realizing the registration of long and narrow regions at the peripheries of the brain, which differs from the methods of coarse registration and fine registration. Our main contributions are summarized as follows: 1.
During training, the cascade networks are trained simultaneously to save network training time.

2.
The second network is used as a loss function. The mean square error loss function added to the second network can constrain the deformation field output by the second network such that it tends to 0. Only the first network is used during testing, which saves testing time.

3.
Coupled with the adversarial training of GANs, the registration performance of the first network is further improved.
The rest of this paper is organized as follows. Section 2 introduces the networks proposed in this paper in detail. Section 3 introduces the experimental datasets and evaluation indicators. Section 4 introduces the experimental results obtained from the HBN and ABIDE datasets. In Section 5, we provide a discussion. Finally, the conclusions are given in Section 6.

Methodology
This paper proposes a method combining adversarial learning with cascade learning. Joint training of cascaded networks can allow them to predict more accurate deformation fields. The first (registration) network is used to study the deformation field φ 1 . The second (augmented) network enables the first network to learn more deformations. A discrimination network improves the first network's performance through adversarial training. The structures of each cascading network are similar to those of VoxelMorph [6]. The proposed overall learning framework is illustrated in Figure 1.

First (Registration) Network
The registration network is the first network in cascading framework. Its inputs are the fixed image and the moving image . Its output is the deformation field , i.e., = ( , ). This network realizes the alignment from to , i.e., = ( ), where ( ) is the warped image. Subsequently, the loss function between ( ) and is Com 谢谢

First (Registration) Network
The registration network is the first network in cascading framework. Its inputs are the fixed image F and the moving image M. Its output is the deformation field φ 1 , i.e., φ 1 = G(F, M). This network realizes the alignment from M to F, i.e., F = M( φ 1 ), where M(φ 1 ) is the warped image. Subsequently, the loss function between M(φ 1 ) and F is calculated to drive the training process. This loss function includes three parts: intensity similarity loss L sim , adversarial loss L adv , and smooth regularization term L smooth .
The adversarial loss function of the registration network is: where p is the output value of the discrimination network and c indicates the registration network input. Local cross-correlation metric is used to calculate the similarity of the intensity between fixed image F and warped image M(φ 1 ). The specific formula of the loss function is: where p i denotes the iteration of the n 3 volume center at voxel p, and Ω represents a three-dimensional voxel. In this paper, n = 9 F(p i ), and M(φ 1 (p i )) represents the voxel intensities of F and M(φ 1 ) at p i , respectively. F(p) and M(φ 1 (p)) are the local mean values of n 3 volume. A higher CC indicates a more accurate alignment. According to the definition of CC, the intensity similarity loss L sim is defined as follows: Additionally, L2 regularization is implemented to smooth the deformation field φ 1 :

Successive (Augmented) Network
The inputs of the successive network are F and M(φ 1 ); the output is DDF φ 2 . φ 2 is used to deform M(φ 1 ) to obtain φ 2 (M(φ 1 )). Simultaneously, to clarify the warped image, we perform a composed operation on φ 1 and φ 2 , i.e., φ 1 is obtained by the moving image M with the composed DDF. Next, two intensity loss functions, namely, L sim ( F, M(φ 1 • φ 2 )) and L sim (F, φ 2 (M(φ 1 ))), are calculated between M(φ 1 • φ 2 ) and F and between φ 2 (M(φ 1 )) and F, respectively. The DDF φ 2 is also constrained as it approaches zero deformation field through the following MSE loss function, allowing the deformation field φ 1 to learn more accurate deformations.
The formula of MSE loss function is defined as: Through this function, the output effect of the first network can achieve fine registration after the two networks are connected in series.
The loss function for the registration network is as follows: Sensors 2023, 23, 3208

of 14
In addition, the loss function used by the second network is: The total loss function is:

Discrimination Network
The discrimination network consists of four convolutional layers combined with leakyReLU activation layers. Finally, the sigmoid activation function is used to output the probability value. The discrimination network is shown in Figure 2. The discrimination network distinguishes the authenticity of image. The harder it is to distinguish the warped image from the fixed image, the harder it is to judge the authenticity of the image by the discrimination network.
The formula of MSE loss function is defined as: Through this function, the output effect of the first network can achieve fine registration after the two networks are connected in series.
The loss function for the registration network is as follows: LG=Ladv In addition, the loss function used by the second network is: The total loss function is:

Discrimination Network
The discrimination network consists of four convolutional layers combined with leakyReLU activation layers. Finally, the sigmoid activation function is used to output the probability value. The discrimination network is shown in Figure 2. The discrimination network distinguishes the authenticity of image. The harder it is to distinguish the warped image from the fixed image, the harder it is to judge the authenticity of the image by the discrimination network.

Experimental Details
Python and TensorFlow were used to implement the experimental process. The program was trained and tested with GPU NVIDIA GeForce GTX 2080 Ti [30].
In the training process, the patch-based training method is adopted to reduce the occupied memory. Herein, 127 blocks are obtained from each image with a size of 182 ×

Experimental Details
Python and TensorFlow were used to implement the experimental process. The program was trained and tested with GPU NVIDIA GeForce GTX 2080 Ti [30].
In the training process, the patch-based training method is adopted to reduce the occupied memory. Herein, 127 blocks are obtained from each image with a size of 182 × 218 × 182. Each block size is 64 × 64 × 64. The stride is 32. The learning rates for training the registration and discrimination networks are set to 0.00001 and 0.000001, respectively.
The traditional methods of Demons and SyN are used as comparative experiments. The deep learning model VoxelMorph is also trained. VoxelMorph is a model of medical image registration based on unsupervised learning. Therefore, VoxelMorph is selected as the comparative experiment for deep learning. The Dice score, structural similarity, and Pearson's correlation coefficient are used as the evaluation indicators to verify the superiority of the experimental results. Moreover, the influence of the MSE and L sim loss functions on the experimental results is investigated.

Datasets
To prove the flexibility and superior performance of the proposed method, the HBN [31] and ABIDE datasets [32] are used for training and testing. The HBN dataset consists of brain data obtained from patients with ADHD (aged 5-21 years). Herein, 496 and 31 T1-weighted brain images are selected for training and testing, respectively. ABIDE is a dataset consisting of brain images from patients with autism (aged 5-64 years). Herein, 928 and 60 T1-weighted brain images are used for training and testing, respectively. The fixed image used in training comprises a pair of images randomly selected from the training set such that each image is linearly aligned to the fixed image. The image size of both the Sensors 2023, 23, 3208 6 of 14 HBN and ABIDE datasets is 182 × 218 × 182 voxels with a resolution of 1 × 1 × 1 mm 3 . Both these datasets contain segmentation marker images of CSF, GM, and WM.

Evaluation Indicators
The Dice coefficient (Dice) index is used to evaluate the degree of overlap between a warped segmentation image and the segmentation image of the fixed image. This index reflects the similarity between the experimental and the standard segmentation images. It is defined as follows: where X seg and Y seg represent the standard and warped segmentation images, respectively. The range of Dice values is 0-1, corresponding to a range in the gap between the warped and the standard segmentation images progressing from large to small values, respectively. Alternatively, the closer the experimental result is to 1, the more similar the warped segmentation image is to the standard segmentation image, and the better is the registration result.

Structural Similarity
The structure similarity index measure [33] can measure the similarity of two images. The SSIM is calculated as: where X, Y represent the two input 3D images; µ X and µ Y represent the average value of X and Y, respectively. σ 2 X and σ 2 Y are the variances of X and Y, respectively. σ X and σ Y represent the standard deviation of X and Y, respectively. σ XY represents the covariance of X and Y. c 1 and c 2 are constants used to avoid system errors caused by a denominator equal to 0. The SSIM can measure the structural similarity between the real and warped images. A SSIM value close to 1 indicates that the two images have a high degree of similarity.

Pearson's Correlation Coefficient
Pearson's correlation coefficient (PCC) was used to measure the similarity between two 3D images. The calculation formula of PCC is: The closer the value of PCC is to 1, the greater is the correlation. A PCC of 0 indicates no correlation. X, Y refer to the two input 3D images. − X and − Y represent the mean value of X and Y, respectively.

Results
The proposed methodology is compared with the following approaches: (1) Demons and SyN, two traditional registration methods; (2) Voxelmorph (VM), an unsupervised deep learning registration method; and (3) VM + A, a method consisting of a simultaneously trained registration network and augmented network.
First, the proposed GAN method (VM + A + GAN) is compared with Demons and SyN, which are two traditional methods. Tables 1 and 2 summarize the test results obtained through different datasets, and all indicators show that our experimental results are the best. Figure 3 shows the comparison of the test results of the two datasets. The first row of the experimental image represents the original image obtained from the HBN dataset, and the second row represents the segmentation image corresponding to the original image derived from the HBN dataset. Similarly, the third row represents the original image based on the ABIDE dataset, and the fourth row represents the segmentation image corresponding to the original image derived from the ABIDE dataset. Compared with Demons and SyN, the image obtained by the proposed GAN method is closer in appearance to the fixed image, and the parts with differences are shown in the enlarged image on the right.  Second, the proposed GAN method is compared with the VM and VM + A methods. Figure 4 shows the registered moving image and the fixed image. Moreover, the first row represents the original image from the HBN dataset, and the second row represents the segmentation image corresponding to the original image from the HBN dataset. Similarly, the third row represents the original image from the ABIDE dataset, and the fourth row represents the segmentation image corresponding to the original image from the ABIDE dataset. Additionally, the enlarged figure on the right shows that the result for the proposed method regarding the training of the registration, augmented, and discrimination networks together is closer to the fixed image. Through the experimental results, the per- Second, the proposed GAN method is compared with the VM and VM + A methods. Figure 4 shows the registered moving image and the fixed image. Moreover, the first row represents the original image from the HBN dataset, and the second row represents the segmentation image corresponding to the original image from the HBN dataset. Similarly, Sensors 2023, 23, 3208 8 of 14 the third row represents the original image from the ABIDE dataset, and the fourth row represents the segmentation image corresponding to the original image from the ABIDE dataset. Additionally, the enlarged figure on the right shows that the result for the proposed method regarding the training of the registration, augmented, and discrimination networks together is closer to the fixed image. Through the experimental results, the performance of the registration, augmented, and discrimination networks when trained together is verifiably better than that of the registration network trained individually and of the registration and augmented networks trained simultaneously.    In order to more clearly highlight the effectiveness of the method proposed in this paper, Figure 5 shows the experimental results of the three parts of the brain tissue based on the HBN dataset, and Figure 6 shows the experimental results of the three parts of the brain tissue based on the ABIDE dataset. The dotted circle in the figure is the result obtained by the method proposed in this paper.       Tables 3 and 4 summarize the Dice, SSIM, and PCC indices corresponding to the different datasets. Considering Table 3, for the HBN dataset, the proposed method improves the precision values by 0.030, 0.032, and 0.034 compared with the VM method. For the ABIDE dataset, the proposed method improves the accuracies by 0.008, 0.004, and 0.004 compared with the VM method. Considering Table 4, for the HBN dataset, the proposed method increases the SSIM and PCC indices by 0.02 and 0.008, respectively, compared with the VM method. For the ABIDE dataset, the proposed method improves the SSIM and PCC indices by 0.006 and 0.003, respectively, compared with the VM method.

Discussion
The usage of a registration and discrimination networks for image registration is a common method. Such a registration method has been investigated experimentally in previous work [34]. However, this adversarial method for training a GAN only limitedly improves a registration network's performance, and the registration capacity in some narrow and long edge areas needs to be further improved. Therefore, this paper proposes a method of training three networks together to allow the registration network to learn more deformations, further improving the registration performance. When the three networks are trained together, the use of different loss functions has a certain impact on the experimental results, which is discussed in the following subsections.

Importance of MSE
When two networks (VM + A) were trained together, both the L smooth loss function of the deformation field φ 2 and the MSE loss function were calculated. An experiment was also performed without the MSE loss function (VM + A − MSE) to verify its effectiveness. Additionally, when the three networks (VM + A + GAN) were trained together, the MSE loss function was removed again (VM + A + GAN − MSE), and experiments were performed to verify the impact of the MSE loss function on the experimental results. Through comparison, the best registration effect was achieved when the three networks were trained together and combined with the MSE loss function. The results are shown in Figure 7.
used by the proposed method achieves good results. Figure 4 shows the comparison of the experimental results after the MSE loss function was removed (VM + A − MSE) when two networks were trained together and after the MSE loss function was removed (VM + A + GAN − MSE) when three networks were trained together. Evidently, the proposed method obtained a result that is closer to the fixed image, which confirms the effectiveness of training three networks simultaneously; moreover, note that the proposed method intuitively shows a good registration effect in the narrow and long regions of the peripheries of the brain images. The first row of the resulting images represents the original image from the experimental results for the HBN dataset, and the second row represents the segmentation image corresponding to the original image from the experimental results for the HBN dataset. Similarly, the third row represents the original image from the experimental results for the ABIDE dataset, and the second row represents the segmentation image corresponding to the original image from the experimental results for the ABIDE dataset.    Table 5 summarizes the experimental results regarding the removal of the MSE loss function (VM + A − MSE) when two networks were trained together (VM + A) and the removal of the MSE loss function (VM + A + GAN − MSE) when three networks were trained together (VM + A + GAN). When comparing the results, note that the removal of the MSE loss function reduces registration accuracy, thus verifying that registration performance can be improved by adding the MSE loss function when these three networks are trained together. Comparing the SSIM and PCC metrics in Table 6, the loss function used by the proposed method achieves good results. Figure 4 shows the comparison of the experimental results after the MSE loss function was removed (VM + A − MSE) when two networks were trained together and after the MSE loss function was removed (VM + A + GAN − MSE) when three networks were trained together. Evidently, the proposed method obtained a result that is closer to the fixed image, which confirms the effectiveness of training three networks simultaneously; moreover, note that the proposed method intuitively shows a good registration effect in the narrow and long regions of the peripheries of the brain images. The first row of the resulting images represents the original image from the experimental results for the HBN dataset, and the second row represents the segmentation image corresponding to the original image from the experimental results for the HBN dataset. Similarly, the third row represents the original image from the experimental results for the ABIDE dataset, and the second row represents the segmentation image corresponding to the original image from the experimental results for the ABIDE dataset.

Importance of L sim
When the three networks (VM + A + GAN) are trained together, the L smooth loss functions between the φ 2 (M(φ 1 )) image and the fixed image F as well as the M(φ 1 • φ 2 ) image and the fixed image F are removed for experimental comparison. After removing the two L sim loss functions, the registration accuracy decreases significantly. Through this experimental analysis, it is evident that the L sim loss function can restrict the similarity among the images to a certain extent, which proves the effectiveness of adding the L sim loss function. By observing the histogram in Figure 8, it is evident that the proposed method improves the Dice, SSIM, and PCC indices. In Figure 8, note that (a) shows the importance of verifying the L sim loss function for the HBN dataset; (b) shows the difference between verifying the proposed method for the ABIDE dataset and removing the L sim loss function in the Dice index; (c) shows the impact of removing the L sim loss function on the SSIM and PCC indices for the HBN dataset; and (d) shows the impact of removing the L sim loss function on the SSIM and PCC indices for the ABIDE dataset.

Importance of Lsim
When the three networks (VM + A + GAN) are trained together, the Lsmooth loss functions between the ( ( )) image and the fixed image as well as the ( ° ) image and the fixed image are removed for experimental comparison. After removing the two Lsim loss functions, the registration accuracy decreases significantly. Through this experimental analysis, it is evident that the Lsim loss function can restrict the similarity among the images to a certain extent, which proves the effectiveness of adding the Lsim loss function. By observing the histogram in Figure 8, it is evident that the proposed method improves the Dice, SSIM, and PCC indices. In Figure 8, note that (a) shows the importance of verifying the Lsim loss function for the HBN dataset; (b) shows the difference between verifying the proposed method for the ABIDE dataset and removing the Lsim loss function in the Dice index; (c) shows the impact of removing the Lsim loss function on the SSIM and PCC indices for the HBN dataset; and (d) shows the impact of removing the Lsim loss function on the SSIM and PCC indices for the ABIDE dataset.