Correcting Susceptibility Artifacts of MRI Sensors in Brain Scanning: A 3D Anatomy-Guided Deep Learning Approach

Echo planar imaging (EPI), a fast magnetic resonance imaging technique, is a powerful tool in functional neuroimaging studies. However, susceptibility artifacts, which cause misinterpretations of brain functions, are unavoidable distortions in EPI. This paper proposes an end-to-end deep learning framework, named TS-Net, for susceptibility artifact correction (SAC) in a pair of 3D EPI images with reversed phase-encoding directions. The proposed TS-Net comprises a deep convolutional network to predict a displacement field in three dimensions to overcome the limitation of existing methods, which only estimate the displacement field along the dominant-distortion direction. In the training phase, anatomical T1-weighted images are leveraged to regularize the correction, but they are not required during the inference phase to make TS-Net more flexible for general use. The experimental results show that TS-Net achieves favorable accuracy and speed trade-off when compared with the state-of-the-art SAC methods, i.e., TOPUP, TISAC, and S-Net. The fast inference speed (less than a second) of TS-Net makes real-time SAC during EPI image acquisition feasible and accelerates the medical image-processing pipelines.


Introduction
Echo planar imaging is a fast magnetic resonance imaging (MRI) technique that has served as an important non-invasive tool in cognitive neuroscience [1]. EPI is widely used to record the functional magnetic resonance imaging (fMRI) data for studying human brain functions [2]. It is also the technique of choice to acquire the diffusion-weighted imaging (DWI) data for analyzing brain connection patterns [3]. Despite its popularity, EPI is prone to susceptibility artifacts (SAs) [4,5] and eddy-current artifacts [6,7], which consist of geometric distortions. The geometric distortions cause misalignments between the functional image and the underlying structural image, subsequently leading to errors in brain analysis, e.g. incorrect localization of neural activities in the functional brain studies. Therefore, an accurate geometric distortion correction method is crucial for applications that rely on EPI images.
In this study, we investigate the susceptibility artifact correction (SAC), as SAs are inevitable in EPI [5]. Interestingly, two EPI images, which are acquired using identical sequences but with reversed phase-encoding (PE) directions, have opposite patterns of geometric distortions caused by SAs [8,9]. Consequently, the middle version of the reversed-PE image pair is considered the distortion-free image. Chang and Fitzpatrick proposed to 1. We design a deep convolutional network to estimate the 3D displacement field. The deep network is designed to make TS-Net robust to different sizes, resolutions, and modalities of the input image by using batch normalization (BN) layers and sizenormalized layers.

2.
We estimate the displacement field in all three dimensions instead of only along the phase-encoding direction. In other words, TS-Net predicts the displacement field that captures the 3D displacements for every voxel. This, to our knowledge, is a significant improvement compared to most existing SAC methods [10,16], which estimate the distortions only along the PE direction and ignore the distortions along with the other two directions.

3.
We introduce a learning method that leverages T 1w images in the training of TS-Net. The motivation is that the T 1w image is widely considered as a gold standard representation of a subject's brain anatomy [17], and it is readily available in brain studies [18]. To make TS-Net more applicable for general use, the T 1w image is used only in training for network regularization, but not in the inference phase. 4.
We provide an extensive evaluation of the proposed TS-Net on four large public datasets from the Human Connectome Project (HCP) [19]. First, an ablation study is conducted to analyze the effects of using different similarity measures to train TS-Net, the effects of various components in the TS-Net framework, and the effects of using a pre-trained TS-Net when training a new dataset. Second, TS-Net is compared with three state-of-the-art SAC methods, i.e., TOPUP [10], TISAC [15], and S-Net [16], in terms of correction accuracy and processing time.
The remainder of this paper is organized as follows. Section 2 describes the materials and the proposed method. Section 3 presents the experimental results, and Section 4 discusses the proposed method and results. Finally, Section 5 summarizes our work.

Materials and Methods
In this section, Section 2.1 describes the EPI datasets used for experiments. Section 2.2 introduces the proposed TS-Net method. Section 2.3 presents the methods used for conducting experiments.

EPI Datasets
To evaluate the SAC methods, we used four EPI datasets (fMRI-3T, DWI-3T, fMRI-7T, and DWI-7T), which are the unprocessed data of the Subjects with 7T MR Session from the public Human Connectome Project repository. The functional and diffusion MRI datasets were used to study functional connectivity of the human brain and reconstruct the complex axonal fiber architecture, respectively [20,21]. These four datasets were acquired using different acquisition sequences, imaging modalities, field strengths, resolutions, and image sizes; thus, the datasets are diverse in size and distortion property. Table 1 shows a summary of the four datasets. Note that the apparent diffusion coefficient map was not acquired in the DWI datasets. The b-values were 1000, 2000, and 3000 s/mm 2 for the DWI-3T dataset, and 1000 and 2000 s/mm 2 for the DWI-7T dataset.

The Proposed TS-Net Method
This section introduces a 3D anatomy-guided deep learning framework, called TS-Net, to correct the susceptibility artifacts in a 3D reversed-PE image pair (see Figure 1). The proposed TS-Net includes a deep convolutional network to map the 3D image pair to the 3D displacement field U. It also has a 3D spatial transform unit to unwarp the input-distorted images with the predicted displacement field, providing the corrected images. In contrast to existing SAC methods [15,16], TS-Net estimates the 3D displacement field, or three displacement values for each voxel. Thus, the displacement field U can be represented as [U x , U y , U z ], where U d is the displacement field in the d direction. The 3D spatial transform unit is the interpolation operator to unwarp or resample the input images by the estimate displacement field [22]. Let U denote the displacement field of image I 1 to the corrected image, then −U is the displacement field of image I 2 to the corrected image because of the inverse distortion property of the reversed-PE image pair. The spatial transform unit produces the corrected images, expressed as E 1 = I 1 ⊗ (G + U) , and E 2 = I 2 ⊗ (G − U) , where ⊗ is the linear interpolation and G = [G x , G y , G z ] is the regular grids in the x, y, and z directions.
The deep convolutional network can be considered as a mapping function f θ : (I 1 , I 2 ) → U, where θ is the set of network parameters. The deep network, which is inspired by S-Net [16], U-Net [23], and DL-GP [24], is U-Net-like architecture with an encoder and a decoder (see Figure 2). The encoder takes a two-channel input (which is the reverse PE image pair) and extracts the latent features. The decoder takes the latent features to predict the displacement field.
Both the encoder and the decoder use a kernel size of 3 × 3 × 3 voxels for their convolutional layers to extract information from the neighboring voxels. This kernel size is selected because it requires fewer trainable parameters than larger kernel sizes, thereby improving computational efficiency. Each convolutional layer is followed by a BN layer to mitigate changes in the distribution of the convolutional layer's input [25]. To make TS-Net cope with different input image sizes, we add a size-normalization layer before the encoder and a size-recovery layer after the decoder. The size-normalization layer uses zero-padding so that each input dimension is divisible by 16. The size-recovery layer crops the decoder output to the size of the input image. To resize images, TS-Net uses zero-padding instead of interpolation to maintain the spatial resolution of the input images. Maintaining the original spatial resolution is critical in SAC because the displacements in the EPI images are small and sensitive to image interpolation. Note that the configuration of the introduced convolutional encoder-decoder, e.g. the number of layers, batch normalization, and upsampling layers, was experimentally selected, see Section 3.1.
In our previous deep-learning-based SAC method [16], the network parameters θ are estimated by optimizing the objective function that promotes the similarity between the pair of corrected images and enforces the local smoothness of the predicted displacement field. In this study, we regularize the training by introducing a T 1w -based regularizer to the loss function. This regularizer can improve the TS-Net training as the T 1w image is widely considered a gold standard representation of a subject's brain anatomy [17]. Note that T 1w images are used in the training phase, not in the testing phase.
The T 1w -based regularizer penalizes the distances from the corrected images to the corresponding T 1w structural image. Since T 1w and EPI are in different modalities, we use the normalized mutual information (NMI) to measure the similarity between the output images and the T 1w image because it is effective for multi-modal images. Let A denote the T 1w image, and the T 1w -based regularizer is then defined as The loss for TS-Net training is where L sim is the dissimilarity between the pair of corrected images. L smooth is the diffusion regularizer, denoting the non-smoothness of the predicted displacement field. The positive and user-defined regularization parameters λ and γ represent the trade-off between the similarity of the corrected images, the smoothness of the displacement field, and the similarity of the T 1w image to the output images.
Since the corrected images E 1 and E 2 have the same modality, we investigate three possible unimodal similarity metrics: mean squared error (MSE), local cross-correlation (LCC) [26], and local normalized cross-correlation (LNCC) [27] (refer to Appendix A.1 for a detailed description of the metrics). We experimentally found that LNCC metric is the best choice in terms of the trade-off between training accuracy and processing time (see the analysis in Section 3.1). Thus, LNCC is used as the L sim .

Experimental Methods
To evaluate TS-Net, for each dataset, we first split the subjects randomly into two parts: A and B. Then, the training set was formed by randomly selecting reversed-PE image pairs of each subject in Part A; this strategy reduces the data repetition of subjects. The test set was formed from all reversed-PE pairs of each subject in Part B. The training sets were used to select the hyper-parameters and train the TS-Net models, and the test sets were used to evaluate the correction accuracy of the TS-Net models. The training set of each dataset was further divided into a training set and a validation set with a ratio of 9:1. Table 2 summarizes the training, validation, and test sets of the four datasets. The proposed TS-Net was implemented using Keras [28] deep learning library. For training TS-Net, the Adam optimizer was used with the learning rate α = 0.001 and the exponential decay rates β 1 = 0.9 and β 2 = 0.999, as suggested by Kingma and Ba [29]. The Tree of Parzen Estimator algorithm was used to select suitable values for regularization parameters λ and γ [30][31][32]. In training each dataset, we selected the maximum batch size that could fit into the available GPU memory to reduce the training time. The batch sizes and regularization parameters used in training TS-Net are shown in Table 3. We then compared the proposed TS-Net with two iterative-optimization methods, i.e., TOPUP and TISAC, and a state-of-the-art deep learning method, i.e., S-Net. The comparison is in terms of the correction accuracy and processing speed. To evaluate the correction accuracy of the proposed method, we trained S-Net and TS-Net for 1500 epochs with each dataset. The trained models were used to compute the corrected image pairs of the test sets. For TOPUP (we used the TOPUP implementation in the FSL package, website: fsl.fmrib.ox.ac.uk/fsl/fslwiki/topup, accessed on 1 May 2020) and TISAC, the corrected image pairs were obtained by implementing the iterative-optimization algorithms. Here, the correction accuracy is measured in terms of LNCC similarity between the pair of reversed-PE images.
The experiments were conducted using images from the datasets directly, without any pre-processing step. The experiments for evaluating processing times were performed on a system that has an Intel Core i5-9600K CPU at 3.6 GHz, 32 GB of RAM, and an NVIDIA GeForce RTX2080 GPU with 8 GB memory. The other experiments were performed on a system that has an Intel Xero Gold 5115 CPU at 2.4 GHz and an NVIDIA GeForce GTX Titan Xp with 12 GB memory.

Results
In this section, Section 3.1 presents the results of the ablation study. Section 3.2 shows the results of the proposed method and other representative SAC methods in terms of correction accuracy and processing time.

Ablation Study of the Proposed Method
This section analyzes the proposed TS-Net method in five aspects: (i) the effects of using different similarity measures; (ii) the effects of the different network configurations in TS-Net; (iii) the effects of using the 3D distortion model and T 1w regularization; (iv) the effects of using a pre-trained TS-Net in training other datasets; (v) the visualization of the predicted displacement field.    [34]; (iv) TS-Net with BN and UL (the proposed method). The validation loss during the training phase was computed as the average LNCC measure between the output image pairs, across subsets of the training sets. This validation loss was then used to compare different network configurations. Figure 4a shows the validation loss versus the training time on three datasets: fMRI-3T, DWI-3T, and DWI-7T; each subfigure includes the validation loss for the four network configurations. Several observations can be made. First, using batch normalization (the proposed TS-Net, green curve) provides a lower validation loss compared to not using batch normalization (blue curve). Second, using batch normalization (the proposed TS-Net, green curve) provides a similar or lower validation loss compared to using instance normalization (orange curve). Third, using the upsampling layer (the proposed TS-Net, green curve) has a similar validation loss compared to using the transpose convolution (magenta curve). These results justify our selected configuration for TS-Net.

Effects of using the 3D distortion model and anatomical guidance by T 1w :
In this experiment, we trained three types of networks: (i) TS-Net with the 1D distortion model as used in S-Net [16]; (ii) TS-Net with the 3D distortion model and without T 1w guidance; (iii) TS-Net with the 3D distortion model and T 1w guidance (the proposed method). Figure 4b shows the validation loss versus the training time on three datasets: fMRI-3T, DWI-3T, and DWI-7T. Several observations can be made. First, the proposed TS-Net with T 1w guidance (green solid curve) has lower validation losses than the TS-Net without T 1w guidance (brown dash-dotted curve). This result shows that incorporating T 1w guidance can improve the correction accuracy. Second, the proposed TS-Net using the 3D distortion model (green solid curve) produces significantly lower validation losses than TS-Net using the 1D distortion model (magenta dashed curve). This result shows that the 3D distortion model used in the proposed TS-Net provides more accurate correction than the 1D distortion model (i.e., only along the phase-encoding direction), which is used in S-Net and existing iterative-optimization SAC methods.

Effects of using a pre-trained TS-Net:
In this experiment, we explored whether using a TS-Net model pre-trained on one dataset can reduce the training time on another dataset, compared to a randomly initialized TS-Net. To this end, we trained two TS-Net models: (i) from scratch; (ii) using an initial network, which had been pre-trained for 1500 epochs on the fMRI-3T dataset. Figure 4c shows the validation loss versus training time on three datasets: DWI-3T, fMRI-7T, and DWI-7T. The figure shows that the validation loss when training TS-Net using a pre-trained model (cyan dash-dotted curve) is much lower than when training from scratch (green solid curve). The result suggests that TS-Net is able to learn generalized features for correcting the susceptibility artifacts from one dataset. Subsequently, adopting the learned features in training other datasets leads to a faster converge. Figure 5 shows the samples of the displacement field estimated by the trained TS-Net for the four test sets. The displacement field is shown in three directions (left-right, anterior-posterior, and superior-inferior). TS-Net can estimate the geometric distortions along the directions that are not the dominant PE direction. The visual results indicate that TS-Net is able to predict realistic 3D displacement fields, i.e., the displacements in the phase-encoding direction are more dominant than those in the other two directions.

Comparison with Other Methods
This section compares TS-Net with three SAC methods, i.e., TOPUP, TISAC, and S-Net. Figure 6 shows sample slices of uncorrected and corrected images from each of the four test sets. Each example includes two reversed-PE images (Rows 1 and 2) and the absolute difference between the two images (Row 3). The arrows indicate the regions where TS-Net produces significantly improved correction in comparison with three other SAC methods. It can be seen that TS-Net removes distortions in the uncorrected images significantly. In general, TS-Net produces the output images that are comparable to or better than the outputs of TOPUP, TISAC, and S-Net. Note that the SAC methods work with 3D images; however, for visualization, 2D slices are presented in the figures. For a larger view of the TS-Net outputs, see Figure A1 in Appendix B. Table 4 summarizes the accuracy of uncorrected and corrected images in terms of LNCC on four different test sets. Paired t-tests were performed on the LNCC measures between TS-Net outputs and each of four image types: uncorrected images, TOPUP outputs, TISAC outputs, and S-Net outputs. The null hypothesis is H 0 : m S-Net = m other .
All computed P values are smaller than 0.001; this indicates that the null hypothesis is rejected at a confidence level of 99.9%. In other words, TS-Net produces image pairs with significant differences (i.e., improvements) in terms of accuracy compared to the output image pairs of other methods. For visual clarity, Figure 7 shows the box plots for comparing the LNCC measures of the four SAC methods. The results in Table 4 and Figure 7 show three notable observations. First, TS-Net produces output images that have significantly higher LNCC measures than the uncorrected images; in other words, TS-Net does reduce the susceptibility artifacts. Second, TS-Net produces output images that have higher LNCC measures than the outputs of the TISAC method in four out of four datasets, and the outputs of the TOPUP methods in three out of four datasets. This means that TS-Net has better correction accuracy compared to the two iterative-optimization methods, i.e., TISAC and TOPUP. Third, TS-Net also produces higher LNCC measures than S-Net in four out of four datasets. Compared to S-Net, the proposed TS-Net has several differences, one of which is its use of T 1w images in training. This result demonstrates that including the gold-standard representation of a subject's brain anatomy helps regularize the susceptibility artifact correction in TS-Net. Note that TS-Net does not require the T 1w image in the inference phase, which explains its comparable processing speed with S-Net, as analyzed next.
To compare the processing speed, we first randomly selected 50 distorted image pairs for each of the four datasets. We then recorded the time for correcting the selected image pairs by four SAC methods: TOPUP, TISAC, S-Net, and TS-Net. Table 5 shows the average processing time per image pair of TS-Net and the three SAC methods. Over the four datasets, TS-Net is 396.72 times faster than TOPUP, 29.45 times faster than TISAC, and only 1.05 times slower than S-Net. Both deep learning-based SAC methods (TS-Net and S-Net) can be accelerated by five times when using the GPU instead of the CPU. Note that, in the experiments for all datasets, the proposed TS-Net has 260,187 trainable parameters, whereas the S-Net model has 259,241 trainable parameters. In other words, the proposed TS-Net requires only 0.36% more trainable parameters than S-Net. , Figure 6. Sample visual results of SAC methods from the four test sets. In each subfigure, Column 1: input uncorrected images. Columns 2, 3, 4, and 5: output corrected images produced by TOPUP, TISAC, S-Net, and TS-Net, respectively. Rows 1 and 2: reversed phase-encoding EPI images. Row 3: the color bar of the absolute different maps. Row 4: the absolute difference between the image pair. The results of TS-Net over the four datasets show that the inference time of TS-Net is linearly proportional to the size of the input images. To correct an image pair with a size of 90 × 104 × 72, TS-Net takes 0.65 s using CPU and 0.14 s using GPU. On average, the inference speed of TS-Net is approximately 1.08 million voxels per second with CPU and 5.98 million voxels per second with GPU.

Discussion
This section discusses the proposed TS-Net in three aspects: robustness, generalizability, and feasibility. In terms of robustness, TS-Net can predict realistic 3D displacement fields, i.e., the most dominant displacements in the phase-encoding direction regardless of the PE direction order, resulting in high-quality corrected images. The experiments conducted on four different datasets show that TS-Net performed consistently on different image resolutions, image sizes, image modalities, and training set sizes. Furthermore, it can cope with different phase-encoding directions.
In terms of generalizability, TS-Net is able to learn the generalized features of the susceptibility artifacts in reversed-PE image pairs from one dataset. A trained TS-Net can be easily transferred to a new dataset, effectively reducing the training time. This observation is similar to the generalization capability of the deep networks [35]. Therefore, TS-Net can employ the network initialization techniques, e.g. MAML [36] and Reptile [37], to address the problem of a long training time, which is a common bottleneck in deep learning algorithms.
In terms of feasibility, TS-Net can produce higher accuracy than the state-of-the-art SAC methods, while having a fast processing time. To correct a pair of distorted images, TS-Net only takes less than 5 s using CPU or less than 1 s using GPU. These high-accuracy and high-speed capabilities allow TS-Net to be applied in many applications. For example, TS-Net can be integrated into the MRI scanner to correct SAs in real time; this is typically not possible with the traditional reversed-PE SAC methods because they are slow.

Conclusions
This paper presents an end-to-end 3D anatomy-guided deep learning framework, TS-Net, to correct the susceptibility artifacts in reversed phase-encoding 3D EPI image pairs. The proposed TS-Net contains a deep convolutional network to predict the displacement field in all three directions. The corrected images are then generated by feeding the predicted displacement field and input images into a 3D spatial transform unit. In the training phase, the proposed TS-Net additionally utilizes T 1w images to regularize the susceptibility artifact correction. However, the T 1w images are not used in the inference phase to simplify the use of TS-Net.
The visual analysis shows that TS-Net is able to estimate the realistic 3D displacement field, i.e., the displacements are dominant in the phase-encoding direction, compared with the other two directions. Evaluation on the four large datasets also demonstrates that the proposed TS-Net provides higher correction accuracy than TISAC and S-Net in all four datasets, and TOPUP in three out of four datasets. Over the four datasets, TS-Net runs significantly faster than the iterative-optimization SAC methods: 396.72 times faster than TOPUP and 29.45 times faster than TISAC. TS-Net is slightly slower than S-Net, but it still meets the real-time correction requirement of MRI scanners. Furthermore, the training time of TS-Net on a new dataset can be reduced by using a pre-trained model. Data Availability Statement: This research used the EPI data provided by the Human Connectome Project, which can be accessed via https://www.humanconnectome.org/study/hcp-young-adult/ document/1200-subjects-data-release (accessed on 11 November 2019), with appropriate data usage agreement.
A higher LNCC indicates higher similarity between two output images. We now can express the L sim loss based on the LNCC measure as L LNCC sim (E 1 , E 2 ) = 1 − LNCC(E 1 , E 2 ).