Contrast-Enhanced Liver Magnetic Resonance Image Synthesis Using Gradient Regularized Multi-Modal Multi-Discrimination Sparse Attention Fusion GAN

Simple Summary Contrast-enhanced MR has been used in diagnosing and treating liver patients. Recently, development in MR-guided radiation therapy calls for daily contrast MR for tumor targeting. However, frequent contrast injection is risky to patients. We developed a deep learning model (GRMM-GAN) to synthesize contrast-enhanced MR from pre-contrast images. GRMM-GAN adopts gradient regularization and multi-discrimination mechanisms. It shows superior performance compared with state-of-the-art deep learning models. Abstract Purposes: To provide abdominal contrast-enhanced MR image synthesis, we developed an gradient regularized multi-modal multi-discrimination sparse attention fusion generative adversarial network (GRMM-GAN) to avoid repeated contrast injections to patients and facilitate adaptive monitoring. Methods: With IRB approval, 165 abdominal MR studies from 61 liver cancer patients were retrospectively solicited from our institutional database. Each study included T2, T1 pre-contrast (T1pre), and T1 contrast-enhanced (T1ce) images. The GRMM-GAN synthesis pipeline consists of a sparse attention fusion network, an image gradient regularizer (GR), and a generative adversarial network with multi-discrimination. The studies were randomly divided into 115 for training, 20 for validation, and 30 for testing. The two pre-contrast MR modalities, T2 and T1pre images, were adopted as inputs in the training phase. The T1ce image at the portal venous phase was used as an output. The synthesized T1ce images were compared with the ground truth T1ce images. The evaluation metrics include peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and mean squared error (MSE). A Turing test and experts’ contours evaluated the image synthesis quality. Results: The proposed GRMM-GAN model achieved a PSNR of 28.56, an SSIM of 0.869, and an MSE of 83.27. The proposed model showed statistically significant improvements in all metrics tested with p-values < 0.05 over the state-of-the-art model comparisons. The average Turing test score was 52.33%, which is close to random guessing, supporting the model’s effectiveness for clinical application. In the tumor-specific region analysis, the average tumor contrast-to-noise ratio (CNR) of the synthesized MR images was not statistically significant from the real MR images. The average DICE from real vs. synthetic images was 0.90 compared to the inter-operator DICE of 0.91. Conclusion: We demonstrated the function of a novel multi-modal MR image synthesis neural network GRMM-GAN for T1ce MR synthesis based on pre-contrast T1 and T2 MR images. GRMM-GAN shows promise for avoiding repeated contrast injections during radiation therapy treatment.


Introduction
The American Cancer Society has estimated 41,260 newly diagnosed liver cancer and intrahepatic bile duct cancer patients in the United States and 30,520 related deaths [1]. Medical imaging is essential in both the diagnosis and treatment of liver cancer. Magnetic resonance (MR) imaging provides excellent soft tissue contrast with its versatile and functional imaging sequences. Compared to computed tomography (CT), multi-phase contrast-enhanced MR has shown improved sensitivity in detecting hepatocellular carcinoma (HCC), the most common type of primary liver cancer [2]. Hence, several guidelines recommend multi-phase contrastenhanced MR as the standard imaging modality for liver cancer diagnosis [3][4][5][6]. Treatments of liver cancer include surgery, liver transplant, thermal ablation, chemo-radioembolization, external beam radiation therapy (EBRT), targeted drug therapy, or immunotherapy. EBRT has the advantage of being geometrically targeted and non-invasive within these modalities. However, the outcome of conventional fractionated EBRT is unsatisfactory due to tumor radio-resistance and the risk of radiation-induced liver disease [7]. Stereotactic body radiation therapy (SBRT), a more precise, ablative type of radiation, has gained popularity in the recent decade in overcoming the radio-resistance of various cancers, including liver cancer [8]. SBRT is an aggressive form of EBRT delivering highly hypo fractionated, thus more biologically potent, doses. The success of SBRT hinges on geometrically accurate tumor targeting and rapid dose drop-off to spare the surrounding normal tissues. In other words, SBRT requires a more stringent tumor definition and reduced geometrical margin. To this end, multi-phase contrast-enhanced MR has been increasingly used to register with CT for SBRT planning to better define tumor and normal anatomy interfaces. However, deformable image registration between CT and MR can be error-prone [9].
More recently, MR-guided linear accelerators (LINACs) have been commercialized, making MR-only SBRT planning and adaptation an appealing alternative circumventing MR-CT registration. Besides being more sensitive in detecting liver tumors for more accurate treatment, contrast-enhanced MR imaging could be used for daily treatment response assessment by providing better visibility and richer details in the region of interest as the contrast agents flow with blood vessels over time. However, obtaining daily contrastenhanced images on MR-LINAC can be prohibitive due to the increased risk of side effects from repeated contrast injections [10][11][12]. The side effects can be severe for patients with compromised kidney function, a condition more commonly observed in patients with liver cancer [13]. Safety concerns regarding administering gadolinium-based contrast agents, nephrogenic systemic fibrosis, and additional procedure time [14][15][16] also preclude more frequent contrast-enhanced MR acquisition. The limitation on contrast usage thus severely diminishes the value of contrast-enhanced MR in MR-guided radiotherapy for daily tumor targeting and early response assessment. Therefore, there is a clinical need for predicting contrast-enhancement MR information without repeated contrast injection.
Medical image synthesis [17] is a rapidly developing area benefitting from deep learning (DL) methods, among which the generative adversarial network (GAN) [18,19] was mainly designed for image synthesis. The state-of-the-art GAN methods include pix2pix [20], which enforced the L1 norm paired image similarity and was efficient in paired image synthesis, and CycleGan [21], which learned the high-level features in the source domain and then applied them for style translation. GAN medical imaging contributions include medical image enhancement [22][23][24], super-resolution [25,26], cross-modal MR-CT image synthesis [27,28], and multi-contrast MR synthesis [29][30][31].
Multi-modal learning has attracted increasing interest in MR image synthesis due to the available salient and mutual complementary information. However, inherent redundancy and noisy artifacts exist across multiple modalities, making the efficacious fused learning from multi-modal MR images difficult. Several studies attempted to tackle the challenge. MM-Syns [32] learned a shared latent feature space for multi-modal data and synthesized the multi-output brain an MR image via an encoder-decoder structure. Similarly, LR-cGAN [33] adopted the encoder-decoder design in a GAN-based model for brain MR image synthesis. Hi-Net [34] proposed a mixed fusion block with element-wise operations for multi-modal feature fusion and brain MR image synthesis. MMgSN-Net [35] adopted the structure of Hi-Net with improved self-attention fusion for nasopharyngeal carcinoma MR image synthesis. MustGAN [36] designed a flexible multi-stream fusion framework to aggregate information across multiple sources and predicted missing modality data.
However, abdominal contrast-enhanced MR image synthesis from multi-modal inputs (e.g., T2, T1pre) is still challenging. To the best of our knowledge, there is currently no reported work for multi-modal contrast-enhanced liver MR image synthesis. We hypothesize that three difficulties account for this gap: (1) the inherent heterogeneity across different tumor types and patients in abdominal MR images. Compared with MR image synthesis for other anatomical sites, there is more significant variation in abdominal MR image characteristics due to different tissue composition, vascularization, perfusion, and motion artifacts. The interpatient heterogeneity poses substantial challenges to the discriminative feature learning of multi-modal fusion methods. (2) Co-registration error across abdominal MR modalities. Inevitable breathing motion during the collection of abdominal MR images results in mismatches between different sequences. The uncertainties in the co-registration could confuse a synthesis method, especially for the deep learning model, because of strong memory. (3) Blurring and over-smoothing effect of MR image synthesis. The blurring and over-smoothing effects of synthesized images are common and recognized as an issue in GAN-based models [37,38].
This study presents a novel image gradient-regularized multi-modal multi-discrimination sparse attention fusion generative adversarial network (GRMM-GAN), a non-intrusive, efficient, and cost-saving clinical tool for contrast-enhanced abdominal MR image synthesis. The GRMM-GAN produces synthetic contrast-enhanced abdominal MR images to enable more accurate tumor delineation and response assessment for adaptive liver radiation therapy.

Data and Preprocessing
With IRB (21-33858) approval, we randomly solicited 165 MR studies of 61 liver patients from our institutional database. The patients' demographic and clinical information is summarized in Table 1. Each study included three modalities, T2, T1 pre-contrast (T1pre), and T1 contrast-enhanced (T1ce) at the portal venous phase. The portal venous phase was used to test the MR synthesis idea. In theory, different GRMM-GAN can be built for the arterial and delayed phases. The three modalities of each study were rigidly registered in VelocityAI TM (Varian, a Siemens Healthiness company, Palo Alto, CA, USA). Various resolutions, including 640 × 640, 512 × 512, and 320 × 320, were used in the original MR images. In practice, we downsampled the MR images to 256 × 256 to balance computational cost and image quality.

Overall Pipeline
The GRMM-GAN MR synthesis pipeline consists of a conditional GAN baseline and three modules: sparse attention fusion, gradient regularization (GR) mechanism, and multi-scale multi-smoothness discriminators (MMD), as shown in Figure 1a. Specifically, the sparse attention fusion module was designed to extract modal-specific features from each input modality (T1pre and T2) while eliminating the inherent redundancy and noisy artifacts and discovering the salient and mutual complementary information. The image gradient regularization mechanism preserves the crucial texture and edge features of the abdominal organs and blood vessels such that the inherent heterogeneous knowledge of the abdominal MRs can be well studied. The multi-scale multi-smoothness discriminators examine the synthesis images in different scales and smoothness levels to overcome the blurring and over-smoothing effect. The following sections explain the baseline model and each network module in detail.

Baseline Conditional GAN Model
The GRMM-GAN method is derived from a conditional GAN framework consisting of a generator and a discriminator . The generator's objective is to fuse the multimodal pre-contrast MR inputs, T1pre and T2, and predict the contrast-enhanced T1ce. At the same time, the discriminator focuses on the discrimination of the synthetic MR

Baseline Conditional GAN Model
The GRMM-GAN method is derived from a conditional GAN framework consisting of a generator G and a discriminator D. The generator's objective G is to fuse the multi-modal pre-contrast MR inputs, T1pre and T2, and predict the contrast-enhanced T1ce. At the same time, the discriminator D focuses on the discrimination of the synthetic MR from the real MR. The prime conditional GAN objective function L cGAN is given by: where the multi-modal inputs and target reference are denoted as x 1 , x 2 , and y, respectively. Following the pix2pix method [20], the L1 loss is employed to promote the structural consistency between the real and the synthetic output and to avoid blurring. The L1 loss is defined as:

Sparse Attention Fusion
Although the current conditional GAN-based method shows promise for MR image synthesis, the inherent redundancy and noisy artifacts across multiple modalities present challenges for the effective fused learning of multi-modal abdominal MR images. To achieve superior fusion by discovering the salient and mutual complementary information in multiple MR modalities while discriminating redundancy and noise from distortion and ambiguous co-registration, we designed a sparse attention fusion network, as shown in Figure 1b. This module was inspired by the previously proposed L 1 sparsity-regularized attention feature fusion work (L1-attention fusion) [39]. Specifically, a sparse regularization term λ c i was introduced to the learned attention (c i ) where the sparsity regularizer (λ) controls the sparsity level of the attention weights. The sparse attention updates were realized using soft thresholding after the forward inference at each iteration, shown as follows: The sparsity constraint, applied to the attention estimated for multi-modal fusion, eliminates the inherent redundancy across modalities and improves discriminative ability.

Gradient Regularization Mechanism
We added an image gradient regularization to manage the inter-patient heterogeneity and capture salient high-level features. At the same time, the network discriminated the redundancy and artifacts from the noisy inputs. The gradient regularization mechanism [40] introduced a gradient preservation loss, shown as the following: where g is the Sobel gradient operator [41], and ⊗ denotes the convolution operation, as shown in Figure 2a. A similar idea was recently applied as Ea-GANs for brain MR image synthesis [42]. The gradient regularization mechanism enforced the fidelity of the firstorder information between the synthetic and real T1ce images. It thus preserved the crucial texture and edge features of the abdominal organs and blood vessels while increasing the module robustness to heterogeneous redundancy and noise. Figure 2a shows resultant gradient maps for the synthetic and real T1ce images after the Sobel operation.
where is the Sobel gradient operator [41], and ⊗ denotes the convolution operation, as shown in Figure 2a. A similar idea was recently applied as Ea-GANs for brain MR image synthesis [42]. The gradient regularization mechanism enforced the fidelity of the first-order information between the synthetic and real T1ce images. It thus preserved the crucial texture and edge features of the abdominal organs and blood vessels while increasing the module robustness to heterogeneous redundancy and noise. Figure 2a shows resultant gradient maps for the synthetic and real T1ce images after the Sobel operation.

Multi-Discrimination Mechanism
In GAN-based models for image synthesis, blurring and over-smoothing effects are commonly observed since the optimization of the fidelity loss could easily fall on a local optimum [37,38]. This problem becomes more severe and visible due to the heterogeneous nature and ambiguous co-registration of multi-modal abdominal MR images. Inspired by the multi-scale discriminators [43,44] that enhanced the resolution of the image synthesis, as well as the illumination and scale-invariant SIFT feature learning [45,46], we propose multi-scale multi-smoothness discriminators (MMD) to improve the discriminative ability and counter over-smoothness in synthesized images. The multi-modal inputs, synthesis, and ground truth images were downsampled twice into half of their original size. Then a Gaussian filter was applied to blur the synthetic image to simulate the over-smoothness effect. In total, there were three scales (256'256, 128´128, and 64´64), creating two pairs each with the original image downsampled over smoothed image for the discriminator. Three discriminators were trained given the paired blurred synthetic and real images, as shown in Figure 2b. The module thus learned blurring and over-smoothing effects in the synthesis and enforced the discrimination against these effects in adversarial training. The multi-scale multi-smoothness discrimination function is given as follows:

Multi-Discrimination Mechanism
In GAN-based models for image synthesis, blurring and over-smoothing effects are commonly observed since the optimization of the fidelity loss could easily fall on a local optimum [37,38]. This problem becomes more severe and visible due to the heterogeneous nature and ambiguous co-registration of multi-modal abdominal MR images. Inspired by the multi-scale discriminators [43,44] that enhanced the resolution of the image synthesis, as well as the illumination and scale-invariant SIFT feature learning [45,46], we propose multi-scale multi-smoothness discriminators (MMD) to improve the discriminative ability and counter over-smoothness in synthesized images. The multi-modal inputs, synthesis, and ground truth images were downsampled twice into half of their original size. Then a Gaussian filter was applied to blur the synthetic image to simulate the over-smoothness effect. In total, there were three scales (256 256, 128 128, and 64 64), creating two pairs each with the original image downsampled over smoothed image for the discriminator. Three discriminators were trained given the paired blurred synthetic and real images, as shown in Figure 2b. The module thus learned blurring and over-smoothing effects in the synthesis and enforced the discrimination against these effects in adversarial training. The multi-scale multi-smoothness discrimination function is given as follows: where G is the generator for image synthesis, D k , k = 1, 2, 3 are the three discriminators for over-smoothing discrimination, and L GAN is the conditional GAN loss function.

Objective and Optimization
After incorporating the sparse attention fusion, image gradient regularization, and the multi-scale multi-smoothness discriminators into the prime conditional GAN model, we wrote the final objective of the proposed GRMM-GAN formula as follows: where λ 1 and λ 2 are the trade-off hyper-parameters to balance the influence of different loss terms.
The proposed network was implemented on a workstation with Intel i9 7900x CPU and NVIDIA RTX 2080Ti 4 GPU under the PyTorch 1.4 and Ubuntu 18.04 environment. For the contrast-enhanced MR image synthesis task, the training procedure of our proposed model took 150 epochs, and we applied the Adam optimization algorithm with a batch size of 16 to update the network parameters. The balance hyper-parameter λ 2 was set to 100. λ 1 took a linear increase from 0 to 50 in the first 50 epochs and was frozen to 50 afterward. The learning rate for the first 50 epochs was fixed at 0.0002 and then linearly decreased to 0 in the following 100 epochs.

Model Evaluation
The performance of the proposed GRMM-GAN was evaluated quantitatively and compared with four state-of-the-art image synthesis networks: pix2pix [20], LR-cGAN [33], Hi-Net [34], and MMgSN-Net [35]. Pix2pix is the widely applied single modal image-toimage translation model in which both T2 and T1pre were set as the input. The other three comparisons are all state-of-the-art multi-modal MR image synthesis methods. In addition, 6 radiation oncologists (RadOnc 1-6), including 2 board-certified and 4 residents, performed Turing [47] tests to determine the authenticity of the synthesized T1ce images against the real images on 100 randomly (1:1 ratio) selected axial slices.
Three widely applied statistics metrics for evaluating the medical imaging synthesis include peak signal-to-noise ratio (PSNR), the structural similarity index (SSIM), and mean squared error (MSE). The definitions of these metrics are presented as follows: where y and G(x) are the ground truth and synthetic images, respectively. N represents the total number of pixels in each image slice. µ y , µ G(x) and σ y , σ G(x) are the mean and variance of the ground truth image and the synthesis image, and σ yG(x) is the covariance between y and G(x). c 1 = k 1 L) 2 and c 2 = k 2 L) 2 are two variables introduced to stabilize the SSIM division index with a weak denominator; here, the default setting of k 1 = 0.01 and k 2 = 0.03 are adopted, and L = 255 is the dynamic range of the pixel intensity.
To evaluate the synthesis performance within the tumor region, we retrieved the clinical diagnosis report of the patients and selected twenty patients with confirmed tumors. According to the diagnosis report and tumor markers depicted by the radiologists, we manually drew three bonding boxes on the axial slice containing the tumor, i.e., the tumor region, the normal liver tissue region, and the background noise region. The tumor contrastto-noise ratio (CNR) is defined as: where µ T and µ N are the average values of the tumor region and the normal tissue region, respectively, and σ B is the standard deviation of the background ground noise region.
To evaluate the impact of image synthesis on liver tumor delineation, we invited 2 additional physicians (1 radiation oncologist attending with >20 years of experience and 1 senior medical resident trained by the same attending) to independently contour 21 liver tumors from 10 testing patients. The DICE coefficient (defined as the intersection over an average of two volumes) and Hausdorff distance (HD, measuring how far two volumes are from each other) were used as analysis metrics. Both physicians first performed tumor delineation on the real T1ce MRs. The DICE coefficients were calculated from the two physicians' contours for each tumor. The average DICE (RadOnc7 vs. RadOnc 8) served as the baseline. The attending performed delineation on both the real T1ce MRs and synthetic T1ce MRs. The DICE (real vs. synth) coefficient and Hausdorff distance were calculated from volumes contoured on real and synthetic MRs for each tumor.
We performed tumor center shift analysis to further evaluate the potential effects of image-guided radiation therapy. This analysis extracted the tumor center of mass coordinates from real and synthetic volumes. The difference in the coordinates indicated the shifts in the superior-inferior (SI), right-left (RL), and anterior-posterior (AP) directions between real and synthetic tumor volumes.

Results
The synthesized T1ce was compared with the ground truth T1ce for thirty random patients. The overall performance is presented in Table 2. GRMM-GAN achieved a PSNR of 28.56 ± 0.87, an SSIM of 0.869 ± 0.028, and an MSE of 83.27 ± 15.42. GRMM-GAN outperformed all state-of-the-art multi-modal MR synthesis models and the single-modal pix2pix method. GRMM-GAN significantly improved all metrics over the comparison methods (p-value < 0.05). Figure 3 presents the synthesis results for one example patient. GRMM-GAN is shown to maintain rich details and textural consistency. The detail preservation is evident in the tumor region denoted by the red box.  In comparison, the other models, including both the single-input Pix2pix models and multi-input fusion models (LR-cGAN, Hi-Net, and MMgSN-Net), resulted in a substantial loss of anatomical details due to the lack of complementary feature mining and discrimination. The pix2pix methods with single input could not exploit the complementary information across different MR modalities. Although the pix2pix with T1pre input, as shown in Figure 3e, roughly predicted the tumor contour, it lost fine details, such as small vessels in the liver. The synthesis quality improved using LR-cGAN with added T2 input, as shown in Figure 3f, but the improvement on small vessel structures was modest. Hi-Net and MMgSN-Net more substantially improved fine structural preservation at the cost of compromising the contour integrity of the hypodense liver tumors. In contrast, our method, GRMM-GAN, retained the tumor integrity and preserved the high-contrast fine features, as shown in Figure 3i.
The Turing test results further substantiated the image quality preservation shown in Table 3. The average Turing test score from the 6 radiation oncologists was 52.33% ± 6.06, which is close to random guessing, indicating comparable visual quality between the In comparison, the other models, including both the single-input Pix2pix models and multi-input fusion models (LR-cGAN, Hi-Net, and MMgSN-Net), resulted in a substantial loss of anatomical details due to the lack of complementary feature mining and discrimination. The pix2pix methods with single input could not exploit the complementary information across different MR modalities. Although the pix2pix with T1pre input, as shown in Figure 3e, roughly predicted the tumor contour, it lost fine details, such as small vessels in the liver. The synthesis quality improved using LR-cGAN with added T2 input, as shown in Figure 3f, but the improvement on small vessel structures was modest. Hi-Net and MMgSN-Net more substantially improved fine structural preservation at the cost of compromising the contour integrity of the hypodense liver tumors. In contrast, our method, GRMM-GAN, retained the tumor integrity and preserved the high-contrast fine features, as shown in Figure 3i.
The Turing test results further substantiated the image quality preservation shown in Table 3. The average Turing test score from the 6 radiation oncologists was 52.33% ± 6.06, which is close to random guessing, indicating comparable visual quality between the synthetic and real images. The contribution of each module was evaluated in an ablation test. The performance by different statistics metrics, PSNR, MSE, and SSIM, showed a consistent trend of additive value. Therefore, only MSE is described and discussed here for brevity. Figure 4 provides a visual evaluation of the ablation study. After removing the multi-scale and multismoothness (MMD) component, the average performance indicated by MSE increased from 83.27 ± 15.42 to 105.43 ± 16.15, highlighting the contribution of MMD in the discrimination of the low-quality synthesis. The exclusion of the GR model further increased the MSE to 121.65 ± 16.32, indicating the contribution of GR to structural preservation. Additionally, the average PSNR, SSIM, and MSE for the synthesized tumor region are 28.40, 0.856, and 88.71, respectively. The multiple performance evaluations show that the real and synthesized MR images for the specific tumor region are very similar, leading to the conclusion that the synthesized MR images could be a suitable surrogate for real contrast MR when the latter is unobtainable.
The average CNR of 20 patients for the real MR tumor region was 26.18 ± 21.13. In contrast, the average CNR for the synthesized MR tumor regions was 24.53 ± 20.08 with a p-value of 0.401 with no significant difference between the real and synthesized images for tumor analysis, suggesting similar tumor conspicuity provided by synthetic and real T1ce MR images.

Discussion
Contrast-enhanced MR images provide improved visualization of liver tumors, which is essential for MR-guided radiotherapy and SBRT. However, repeated injections of contrast may not be clinically viable. As an alternative, we synthesized contrast-enhanced MR images from multi-modality pre-contrast MR images using a novel GRMM-GAN model. The synthesized virtual contrast-enhanced MR images closely mimicked the ground truth contrast-enhanced images in quantitative image analysis and human expert Turing tests. The success indicates that the pre-contrast T1 and T2 images have a substantial predictive value for the post-contrast MR. The latent information in T1 and T2 images is intricate for human

Discussion
Contrast-enhanced MR images provide improved visualization of liver tumors, which is essential for MR-guided radiotherapy and SBRT. However, repeated injections of contrast may not be clinically viable. As an alternative, we synthesized contrast-enhanced MR images from multi-modality pre-contrast MR images using a novel GRMM-GAN model. The synthesized virtual contrast-enhanced MR images closely mimicked the ground truth contrast-enhanced images in quantitative image analysis and human expert Turing tests. The success indicates that the pre-contrast T1 and T2 images have a substantial predictive value for the post-contrast MR. The latent information in T1 and T2 images is intricate for human operators to appreciate but can be distilled via image synthesis into a format, e.g., post-contrast MR, familiar to human operators. Our technical innovations in building the GAN network overcame multiple challenges in existing image synthesis methods.
Several deep learning models have been developed for brain MR synthesis [34,48]. Compared with brain images, abdominal MR images are considerably more heterogeneous in soft tissue composition, MR relaxation properties, size, shape, and textures. The heterogeneity is further compounded by substantial motion in this anatomical region, leading to mismatches among different MR sequence acquisitions. Therefore, abdominal MR synthesis is a more challenging problem.
The challenge is highlighted by prior efforts using single-modal deep learning models for abdominal MR image synthesis [49,50], which fail to learn the salient knowledge across MR modalities, leading to a substantial loss in the synthesized image quality. The proposed GRMM-GAN is the first network to discover and fuse mutual complementary information, which markedly improves the synthesis performance, showing realistic contrast enhancement style translation, precise contours, and textures.
Our technical innovations are summarized as follows. First, a multi-modal fusion model was developed to exploit the salient and complementary information in multiple MR modalities. A sparse attention fusion module was investigated to distinguish the redundancy incurred. Second, we adopted the previously proposed image gradient regularization mechanism to avoid the loss of anatomical details in the GAN-based model and presented the novel multi-scale and multi-smoothness discrimination. In addition to these technical innovations, we prepared a valuable multi-modal abdominal database for contrast enhancement MR synthesis with patient data (165 scans from 61 patients) to support model development and validation.
The exclusion of the GR and MMD components decreases the average MSE to 105.43 ± 16.15 (p-value < 0.05) and 121.65 ± 16.32 (p-value < 0.05), respectively. Figure 4 provides an intuitive understanding of the essential contribution of each module. The absence of the MMD module degrades the discriminative ability of the network in over-smoothing and blurring effects, which can be visually observed in Figure 4c. The further removal of the GR module decreases the perception of the network to textural and structural details, as illustrated in Figure 4d. The specific tumor region synthesis evaluation also solidifies the proposed method in potential clinical application. The CNR and other performance evaluation metrics (PSRN, SSIM, and MSE) indicate that the salience and similarity of the synthesized specific tumor of interest are good enough compared with the real tumor data.
Similar to our GRMM-GAN model, the three multi-modal fusion generative methods, LR-cGAN, Hi-Net, and MMgSN-Net, generally performed better than the single modality synthesis model (pix2pix). The better performance could be attributed to the theory of multi-modal fusion to exploit the complementary information from different MR modalities. Specifically, MMgSN-Net was derived from Hi-Net and achieved the second-best overall performance boosted by the self-attention fusion module multi-modal feature learning. However, these methods still failed to discriminate the redundancy that existed accordingly, resulting in inferior synthesis quality with over-smoothing effects. For the single modal synthesis comparison, the pix2pix model yielded better performance with the T1pre as the input than with T2, likely because T1pre is more similar to the output T1ce modality. On the contrary, Li et al. [35] reported that the crucial knowledge from multi-modalities (e.g., T2) in revealing the hyper-intensity or pathological inhomogeneity characteristics could not be neglected.
Despite the excellent image synthesis quality and expert evaluation results, our work has the following limitations. We downsampled the original MR images from 256 × 256 to balance the image quality and computation complexity. This preprocessing could compromise the fidelity and precision of the synthesis results compared to the original resolution. A superresolution [43,51] or interpolation [52,53] could be incorporated. Although quantitative metrics, such as PSNR, MSE, and SSIM, are suited for comparative image quality tests, they are not directly linked to more clinically relevant dosimetric measures for adaptive planning applications. We have performed an exploratory contouring study and showed satisfactory