DC 2 Anet: Generating Lumbar Spine MR Images from CT Scan Data Based on Semi-Supervised Learning

: Magnetic resonance imaging (MRI) plays a signiﬁcant role in the diagnosis of lumbar disc disease. However, the use of MRI is limited because of its high cost and signiﬁcant operating and processing time. More importantly, MRI is contraindicated for some patients with claustrophobia or cardiac pacemakers due to the possibility of injury. In contrast, computed tomography (CT) scans are much less expensive, are faster, and do not face the same limitations. In this paper, we propose a method for estimating lumbar spine MR images based on CT images using a novel objective function and a dual cycle-consistent adversarial network (DC 2 Anet) with semi-supervised learning. The objective function includes six independent loss terms to balance quantitative and qualitative losses, enabling the generation of a realistic and accurate synthetic MR image. DC 2 Anet is also capable of semi-supervised learning, and the network is general enough for supervised or unsupervised setups. Experimental results prove that the method is accurate, being able to construct MR images that closely approximate reference MR images, while also outperforming four other state-of-the-art methods.


Introduction
Computed tomography (CT) scanning is a medical imaging technique that is widely used for diagnostic and therapeutic purposes in a variety of clinical applications. Magnetic resonance imaging (MRI) is another imaging technique that visualizes anatomical details and is used in radiology and nuclear medicine. A comparison of the strengths and weaknesses of these imaging approaches is shown in Table 1. Unlike CT scans, MRI can detect slight differences in soft tissue, ligaments, and organs, which is beneficial for diagnosis. However, MRI is not only much more expensive, but also requires more time to produce its results, meaning that patients often prefer CT scans to MRI.
Lumbar disc herniation is common among the elderly and people who sit for long periods. The use of MRI to observe the spinal cord and the disc signals of the lumbar spine is of great importance in the treatment of this condition. However, some patients with claustrophobia or cardiac pacemakers are prevented from receiving an MRI due to possible injury. Thus, the ability to generate a reliable magnetic resonance (MR) image from a CT scan for these patients is vital. This would not only Benefits Faster and can provide images of tissue, organs, and skeletal structure Produces more detailed images

Risks
• Harmful for unborn babies • A very small dose of radiation • A potential reaction to the use of dyes • Possible reactions to metals due to magnets (e.g., artificial joints, eye implants, intrauterine devices, pacemakers) • Loud noises from the machine can cause hearing issues • Increase in body temperature during long MRIs • Claustrophobia In recent years, researchers have increasingly searched for ways to replace CT scans with MRI when planning for radiation therapy [4][5][6]. However, CT-based MR image construction has received little attention. It is challenging to generate an MR image directly from a CT image using a linear model because it is difficult to produce high-level image domains based on low-level ones. In response to this, we propose a synthesis method based on convolutional neural networks (CNNs) [1] with adversarial training [3] to produce a lumbar spine MR image from a CT scan. In this process, the development of an objective function for the deep neural network is essential [7,8]. An objective function is a combination of loss terms that maps a real number that intuitively represents the "cost" associated with the performance of a predefined network at a certain status. The optimization process seeks to minimize this cost by updating trainable variables to determine an optimal network. Synthetic images in medical imaging need to not only be realistic, i.e., they cannot be distinguished from genuine images by human experts, but also be very similar to reference images. In this study, we propose a novel objective function that balances between two quantitative loss terms and three qualitative loss terms to construct lumbar spine MR images from CT images. A dual cycle-consistent loss is also included for semi-supervised learning that alternates between optimizing supervised and unsupervised learning in order to seek a global minimum for the optimal network.
Experimental results based on quantitative and qualitative evaluations prove the superiority of the proposed method compared with other state-of-the-art methods. The main contributions of this study are as follows: • An objective function is proposed to balance quantitative and qualitative loss terms to construct a realistic and accurate synthetic MR image. This function consists of adversarial, dual cycle-consistent, voxel-wise, gradient difference, perceptual, and structural similarity losses. Using ablation analysis, the importance and effectiveness of each of these loss terms are investigated.
• The dual cycle-consistent adversarial network (DC 2 Anet) is proposed as a general synthesis system for semi-supervised learning. Due to its dual cycle-consistent structure, DC 2 Anet can be applied to both supervised and unsupervised learning.
This paper first summarizes previous research on the synthesis of medical images in Section 2. The proposed algorithm is outlined in Section 3, while Section 4 reports the experimental results and discussion. A conclusion is presented in Section 5.

Literature Review of Medical Imaging Synthesis
In medical imaging, a number of methods have been proposed for generating one image domain from another, e.g., constructing a CT image from MRI data or a positron emission tomography (PET) image from CT data. Existing methods can be divided into three categories: tissue-segmentation, learning, and atlas-based methods. Tissue segmentation first divides MR image voxels into different tissue classes, such as air, fat, soft tissue, and bone, and then, the segmentation classes are refined manually [5,9]. However, tissue segmentation is difficult, and its performance strongly depends on segmentation accuracy and the quality of the manual input. Learning-based methods extract features that represent two different domains and then construct a non-linear map between them. However, these methods depend on the quality of the feature extraction in terms of how well they can represent the different domains. Additionally, generating one image domain from another is not as simple as one-to-one mapping [10,11]. Atlas-based methods apply image registration to align an MR image with an atlas MR image to approximate the correspondence matrix. The matrix can then be used to warp the associated atlas CT image to generate the query CT image [12][13][14]. However, the performance of atlas-based methods is closely associated with the registration accuracy for the two image domains. Furthermore, it is difficult to cover pathological differences or significant anatomical variations using atlas data.
In recent years, CNNs [15,16] have demonstrated outstanding performance in various computer vision tasks. In particular, several studies have proven that CNNs are useful in medical imaging [17], such as skin cancer classification [18], X-ray organ segmentation [19], retinal vessel segmentation [20], and brain lesion detection [21]. In these applications, CNN-based medical image synthesis can be considered a form of regression in which non-linear mapping functions are stacked from one image domain to another. For example, Han [22] applied a U-Net [23] architecture consisting of an encoder network and a decoder network in which some layers were connected by skip connections to construct a synthetic CT image from an MR image. To train the deep CNN model using a limited dataset, Han [22] employed transfer learning by initializing the encoder network using a pretrained 16-layer VGG (VGG16) network [24]. The objective function of the network used voxel-wise loss only to minimize the difference between the synthetic and reference images. However, because voxel-wise loss is minimized by averaging all plausible outputs, simply minimizing this loss may produce blurry results. Additionally, the slight voxel-wise misalignment of training data may further lead to a blurry constructed image. Designing objective functions that force the CNN to operate as required, e.g., to generate sharp, realistic, and accurate synthetic images, remains an unsolved problem and generally requires both prior knowledge and experimental observations.
In image generation, generative adversarial networks (GANs) [25], which are a form of generative model [26,27], have been widely employed to produce state-of-the-art, realistic images [28,29] for applications such as GAN-based image inpainting [30] and video generation [31,32]. An adversarial loss of GAN learns to satisfy a high-level goal, such as generating an output image indistinguishable from reality. A discriminator network of GAN is used to distinguish whether an image is real or synthesized while simultaneously training a generator network to minimize the adversarial loss. For example, Bi et al. [33] presented a multi-channel GAN with an objective function consisting of adversarial loss and voxel-wise loss to generate a synthetic PET image from a CT image. Similarly, Ben-Cohen et al. [34] independently applied the advantages of a fully-convolutional network (FCN) [35] and a pixel-to-pixel (pix2pix) model [36] to synthesize a realistic PET image from a CT image. This method also used adversarial loss and voxel-wise loss together. Extending the above method, Nie et al. [37] proposed a context-aware GAN that utilized a 3D CNN [38,39] and an auto-context model [40] to generate a CT sequence from an MR sequence. The voxel-wise loss of the 3D CNN learns both the spatial and temporal information of the sequence data. In contrast, Wolterink et al. [41] applied a cycle-consistent GAN (cycleGAN) [42] with least-squares adversarial loss [43] in which the loss term leads to the stable optimization of the network when synthesizing an MR image using a CT image. A cycle-consistent loss [42,44] produces not only a synthetic image that looks real, but also one that is similar to the input under unsupervised learning. A CT-based MR image estimation method was first proposed by Jin et al. [45]. They proposed a synthesis system referred to as MR-GAN using a dual cycle-consistent structure. Their MR-GAN is trainable with paired and unpaired data together to improve performance. In addition to dual cycle-consistent loss, their objective function includes two other loss terms: adversarial loss and the voxel-wise loss. Table 2 presents a comparison of the network architectures and objective functions of the deep-neural-network-based medical synthesis methods. Compared to the methods in Table 2, this study proposes a more general system of the cross-modality synthesis, DC 2 Anet, that supports both supervised and unsupervised learning and a new objective function to balance quantitative and qualitative loss terms. Leastsquares adversarial [43] and cycleconsistent [42] Adversarial [25] and cycleconsistent [42] Adversarial [25], voxel -wise, and dual cycleconsistent [45] Model Pretrained VGG16 [24] with U-Net [23] pix2pix [36] 3D ConvNet [38,39] and auto-context model [40] cycleGAN [42] DiscoGAN [44] MR-GAN [45] No Path GAN [36] Path GAN [36] No. of layers in the discriminator The voxel-wise loss function measures the difference between the synthetic and the reference images, but it cannot reflect the perceptual difference between the two images. For example, even when two identical images have the same perceptual information, they will have very different voxel-wise loss measurements if they are offset from each other by just one pixel. Recent work has shown that high-quality synthetic images can be produced using perceptual loss based on differences between high-level feature representations extracted from a pretrained CNN [48,49]. Gatys et al. [49] conducted artistic style transfer by jointly minimizing feature reconstruction loss [50] and style reconstruction loss based on features extracted from a pretrained VGG16 network [24]. Johnson et al. [48] produced visually-pleasing results using image style transformation and single-image super-resolution, with voxel-wise loss replaced by perceptual loss. Structural similarity (SSIM) [51,52] is another qualitative measurement approach that is based on the human visual system and is used to compare local patterns of structural information that have been normalized for luminance and contrast. In our study, a similar structural loss term was proposed to retain the structural patterns of lumbar spine CT scans in the synthesis of MR images. Additionally, to balance quantitative and qualitative performance, gradient difference loss and perceptual loss were included based on adversarial, voxel-wise, and dual cycle-consistent loss.

Converting Supervised Learning to Semi-Supervised Learning
Semi-supervised learning is a form of machine learning that makes use of a small amount of aligned data and a large volume of unaligned data. It thus represents a combination of supervised learning (which utilizes completely aligned data) and unsupervised learning (which does not include aligned data). In our study, the paired CT and MR images were aligned, meaning that the CT image and its corresponding MR image were from the same slice of the same patient, with some post-processing such as image registration for coordinate offsetting and manual correction by neuroradiologists. For image registration, we utilized the contours of the body vertebra in CT and MR images to estimate the parameters of the affine transformation to register the two images. In contrast, unaligned data included CT and MR images that were captured from different slices or even different patients.
In medical image synthesis, supervised learning can easily be converted to unsupervised learning. Supervised learning applies aligned training data where the output image corresponds to each input image. On the contrary, by disconnecting the aligned data to consist of an input and output set for training, medical synthesis becomes an unsupervised learning-based synthetic task. A semi-supervised learning framework can also be constructed to utilize both supervised and unsupervised learning together. Figure 1 illustrates the conversion from supervised learning to semi-supervised learning. The left-hand side of the figure displays supervised learning using aligned data. The squares and circles represent the image domains X and Y, respectively. The three aligned points are indicated by different colors (red, green, and blue) with parentheses. By disconnecting the aligned data and recombining the different domains, unaligned data are generated, as shown on the right-hand side of Figure 1. In this manner, the three aligned data points can be converted into six unaligned data points, with the unpaired data increasing exponentially. Semi-supervised learning is thus conducted by combining supervised learning with aligned data and unsupervised learning with unaligned data. The advantage of this approach is that supervised learning uses the averages of all plausible outputs to reduce the bias of the domain translation, while unsupervised learning focuses on the structural pattern of the two image domains, reducing the variance of the model estimation process. Additionally, semi-supervised learning can more efficiently use a limited volume of paired data by combining the three paired data points with the six unpaired data points shown in Figure 1. The proposed DC 2 Anet is capable of semi-supervised learning, and the network is general enough for both supervised learning and unsupervised learning.

Dual Cycle-Consistent Adversarial Network
A GAN [25] is a generative model that is designed to generate synthetic samples directly from the desired data distribution without the need to model the underlying probability density function explicitly. It consists of two different networks that are trained simultaneously, with the generator network focused on image generation and the discriminator network used to distinguish between real samples and the synthetic images. The idea of using a cycle-consistent approach to regularize structural data has a long history in visual tracking [53] and structure from motion [54]. The cycle-consistent structure employed in the GAN (cycleGAN) [44] enables unsupervised learning, stitching two generator networks together head to toe so that the synthetic images can be translated into a forward cycle. In addition to the forward cycle, the cycleGAN also has a backward cycle to stabilize the training process and prevent mode collapse. The forward cycle enforces the translation from the CT domain to the MR domain, while the backward cycle moves from the MR domain to the CT domain.
The proposed DC 2 Anet also applies a cycle-consistent structure for its unsupervised learning setup. However, the proposed network has a dual cycle-consistent structure for the adoption of semi-supervised learning: one cycle-consistent structure for supervised learning with aligned data and the other for unsupervised learning with unaligned data. A diagram of DC 2 Anet is presented in Figure 2. Because the forward and backward cycle-consistent networks with aligned data or unaligned data are similar, we only illustrate a forward cycle-consistent adversarial network with unaligned learning in Figure 2a and a backward cycle-consistent with aligned learning in Figure 2b.
In the forward cycle-consistent adversarial network with unaligned learning, the Syn MR network generates a synthetic MR image from a CT image, and this MR image is then used by the Syn CT network to generate the original CT image in order to learn the domain structures. The input to the MR discriminator network is either a sample MR image from the real MR data or a synthetic MR image. The objective function for unaligned learning includes both cycle-consistent and adversarial loss. In the backward cycle-consistent adversarial network with aligned learning, a synthetic CT image is generated from an MR image, and this CT image is employed by the Syn MR network to generate the original MR image. The CT discriminator is used to distinguish between the synthetic CT and reference CT images. In aligned learning, a reference image is matched with the synthetic image to restrain the generated structure of the output. Based on the cycle-consistent and adversarial loss, the objective function of aligned learning also considers, due to the use of the reference image, voxel-wise, gradient difference, structural, and perceptual loss within the pretrained VGG16 network [24]. The four switches are simultaneously employed to control the data flow from the reference image, and these are connected in aligned learning, but disconnected in unaligned learning. It is also important to note that the Syn MR and Syn CT networks utilized in the forward and backward cycles share the same weights.

Objective Function
Our goal is for the mapping functions between the CT image and MR image domains to be learned using the given aligned training data. As illustrated in Figure 1, aligned data (I CT , I MR ) are converted into unaligned data I CT , I MR , and the aligned and unaligned data points are utilized together in semi-supervised learning. DC 2 Anet includes two synthesis networks, Syn MR : CT → MR and Syn CT : MR → CT, and includes two discriminator networks, Dis MR and Dis CT , where Dis MR aims to distinguish between the reference MR image I MR and the synthetic MR image Syn MR (I CT ); in the same way, Dis CT aims to distinguish between the reference CT image I CT and the synthetic CT image Syn CT (I CT ). Moreover, the four networks are different optimized objectives corresponding to the supervised and unsupervised learning due to the input aligned or unaligned data. Additionally, to measure the high-level perceptual and semantic differences between two images, the VGG16 perceptual network is employed in DC 2 Anet, which was pretrained on the ImageNet dataset [55]. Our objective function contains six loss terms in total: adversarial, dual cycle-consistent, voxel-wise, gradient difference, perceptual, and structural similarity. A summary of the strengths and weaknesses of each loss term is given in Table 3.
We apply adversarial loss [3] to both the supervised and unsupervised setups. The forward and backward mappings Syn MR : CT → MR and Syn CT : MR → CT and the discriminators Dis MR and Dis CT are expressed as follows: where the first two terms are the forward adversarial loss and the last two terms are the backward adversarial loss. The network Syn MR attempts to synthesize images Syn MR (I CT ), which look similar to images from the MR domain, while Dis MR aims not only to discriminate between synthetic MR and reference MR images, but also to ensure it generates images from the corresponding CT images I CT . For the backward adversarial loss, the synthesis network Syn CT generates the reference CT images Syn CT (I MR ), which look similar to images from the CT domain, while Dis CT aims to distinguish between synthetic and reference CT images based on the MR images I MR . The synthesis networks Syn MR and Syn CT attempt to minimize this objective function, while the adversarial discriminator networks Dis MR and Dis CT aim to maximize it, i.e., Syn * MR , Syn * CT = arg min Syn MR ,Syn CT max Dis MR ,Dis CT L sup −adver (Syn MR , Dis MR , Syn CT , Dis CT ). In Equation (2), we introduce a similar form of adversarial loss for unsupervised learning for the synthesis networks and discriminators. However, the discriminators for unsupervised learning need to distinguish whether the images are real or synthetic; the source domain images are not input into the discriminators. Table 3. A summary of the strengths and weaknesses of each loss term used in DC 2 Anet. In the last column, the symbols and denote whether the loss term requires aligned training data or not. In unsupervised learning, adversarial loss alone cannot guarantee that the learned synthesis network can map an input image to the desired output image. To reduce the possible mapping space between these two domains, we utilized a dual cycle-consistent structure for aligned and unaligned data. For an image I CT from the CT domain, the forward cycle-consistent network should be able to bring I CT back to the original image, i.e., I CT → Syn MR (I CT ) → Syn CT (Syn MR (I CT )) ≈ I CT . Similarly, the backward cycle-consistent network should extract an image I MR from the MR domain to satisfy I MR → Syn CT (I MR ) → Syn MR (Syn CT (I MR )) ≈ I MR . The cycle-consistent losses are expressed as follows:

Loss term
where L sup−cycle and L unsup−cycle are the cycle-consistent structures for supervised and unsupervised learning, respectively. Each cycle-consistent loss has two terms: a forward cycle-consistent and a backward cycle-consistent term.
In general, adversarial loss produces visually-appealing results. However, using only adversarial loss to match synthetic and reference MR images may cause the model to generate unseen structures. Voxel-wise loss helps to overcome this problem if aligned data are available. The goal of the discriminator networks remains unchanged, but the synthesis networks are tasked with not only cheating the discriminator networks, but also being similar to the reference image at an L1 distance. The voxel-wise loss of the forward and backward cycle-consistent network is defined as follows: Direct optimization of voxel-wise loss produces a suboptimal (i.e., blurry) result by minimizing the average loss for all plausible outputs. To deal with the inherently blurry results obtained from voxel-wise loss, gradient difference loss is constrained for the synthesis networks. The gradient difference loss between a synthetic and reference image is given as follows: where I MR in the first term and I CT in the second term are the reference images in the forward and backward cycle-consistent networks, respectively, and the xand y-direction gradients are calculated to emphasize the boundaries of the structural shape.
A pretrained VGG16 network is incorporated into the optimization of the synthesis networks to ensure perceptual similarity. We aim for the synthetic and reference images to have similar feature representations when computed by the pretrained VGG16 network φ. Let φ j (I CT ) and φ j (I MR ) be the activations of the j th convolutional layer of the network φ when processing CT image I CT and MR image I MR , respectively. The perceptual loss is defined as follows: where φ j (Syn MR (I CT )) and φ j (Syn CT (I MR )) are the activations of the synthetic images in the forward and backward cycles, respectively, H j × W j × C j is the shape of the activations from the j th convolution layer, and K is the number of layers in the VGG16 network. By utilizing the activations of the higher layer in the VGG16 network, the synthetic images can preserve the overall spatial structure of the reference images, but not the texture and exact shape. Perceptual loss causes the synthetic images to become more perceptually similar to the reference images, but does not lead to an exact match. The vertebra, spinal nerves, and ligaments in spinal images contain strong interdependencies. Structural similarity (SSIM) [51] is a perceptually-motivated metric that considers the human visual system and performs better in terms of visual pattern recognition than do quantitative metrics, e.g., mean-based metrics. To enhance the structural and perceptual similarity of the synthetic and reference images, structural similarity loss is expressed as follows: SSIM (x, y) = 1 i f and only i f x = y (9) where C 1 and C 2 are constants that stabilize the division with the weak denominator and µ x , µ y , σ x , σ y , and σ xy represent the mean, standard deviation, and cross-covariance of the synthetic and reference images. The objective functions for supervised and unsupervised learning are defined as follows: L sup (Syn MR , Syn CT , Dis MR , Dis CT ) =L sup−adver (Syn MR , Syn CT , Dis MR , Dis CT ) L unsup (Syn MR , Syn CT , Dis MR , Dis CT ) =L unsup−adver (Syn MR , Syn CT , Dis MR , Dis CT ) where λ cycle , λ voxel , λ grad , λ perc , and λ struc are hyper-parameters that balance the relative importance of adversarial, cycle-consistent, voxel-wise, gradient difference, perceptual, and structural similarity loss. In summary, the training objective function can be expressed mathematically as: where Syn MR and Syn CT minimize the objective function, while Dis MR and Dis CT maximize it. During the inference process, only the Syn * MR network is used to produce a synthetic MR image from an input CT image.

Optimization of DC 2 Anet with Semi-Supervised Learning
DC 2 Anet with semi-supervised learning can be optimized in two different ways, with joint or alternating optimization: • Joint optimization: For each training iteration, both the synthesis and discriminator networks are updated with regards to the objective function using supervised and unsupervised learning as defined in Equation (12). A pair of aligned data points and a pair of unaligned data points are sampled from the dataset and fed to DC 2 Anet to update the networks.
• Alternating optimization: For each training iteration, supervised and unsupervised learning for the objective function are alternated as defined in Equations (10) and (11). In this case, only the weights that correspond to the synthesis networks and the particular layers of the discriminators are updated. This form of training maintains a more stable convergence of the optimization, and it is easy to balance the synthesis and discriminator networks with Jensen-Shannon divergence [3]. However, the computational load required for alternating optimization is nearly twice as high as that of joint optimization in the training stage.
The most difficult complication of adversarial training is that one network may inevitably become more potent than the other, and this generally proved to be the discriminator network in most cases. When the discriminator network becomes too strong, the synthetic images are much easier to distinguish from the reference images. In this case, the gradients from the discriminator network approach zero. This results in no guidance for the further training of the synthesis network. To overcome this issue, alternating optimization is an effective approach for DC 2 Anet. DC 2 Anet with semi-supervised learning is described in Algorithm 1.

Require:
The batch size m, the number of alternative iterations between supervised learning and unsupervised learning n sup and n unsup , the learning rate α, and Adam hyperparameters β 1 and β 2 .
1: Construct unaligned data P unaligned {I CT , I MR } based on aligned data P aligned (I CT , I MR ).
2: for number of training iterations do 3: for n sup steps do 4: Sample ∼ P aligned (I CT , I MR ) a batch from the aligned data 5: Update the discriminator networks Dis MR and Dis CT by ascending their stochastic gradient: sup−adver Syn MR , Syn CT , Dis MR , Dis CT , I

Network Architecture
The synthesis networks Syn MR and Syn CT in DC 2 Anet adopt the same architecture as used in the network reported by Johnson et al. [48], who produced impressive results in real-time style transfer and single-image super-resolution. The network contained two stride-one convolutions at the beginning and the end, two stride-two convolutions, nine residual blocks [46,47], and two fractionally-strided convolutions with a stride of 0.5. Each residual block included two convolutions with 256 filters of a size of 3 × 3 and a stride of one. Instance normalization [56] and a rectified linear unit (ReLU) [57] activation function followed each convolution except in the final convolutional layer. The hyperbolic tangent (Tanh) activation function followed the final convolution to guarantee that the output was within [−1, 1]. A detailed description of the synthesis network is presented in Table 4. For the discriminator networks Dis MR and Dis CT , we used a patch-based GAN (PatchGAN) [36] architecture, which aims to classify small overlapping image patches as either real or synthetic, rather than whole images. This patch-level discriminator architecture has fewer parameters than a whole-image discriminator and can emphasize detailed information in local areas. DC 2 Anet with semi-supervised learning has two different input flows, aligned and unaligned, with different shapes for the input data. The flow size of a volume of aligned data is (N, H, W, 2). N is the batch size; H and W are the image height and width, respectively; and 2 represents a concatenation of the synthetic and input images. The flow size of a volume of unaligned data is (N, H, W, 1), in which only a synthetic image can be used as input (indicated as 1). Therefore, a hybrid discriminator model was designed that consisted of two input stages, a shared stage, and two output stages. To balance the capability between synthesis and discriminator networks, the discriminator network was designed to be much shallower than the synthesis network because generating images is much more difficult than merely distinguishing real from synthetic images. Based on the related works presented in Table 2, the number of discriminator layers was fixed at five, and the variant architectures of the hybrid discriminator are presented in Figure 3. Models A-F represent variations of the input, shared, and output stages. Model G is the independent discriminator for unaligned and aligned data flows. Each aligned data flow consisted of a 300 × 200 synthetic image and a corresponding 300 × 200 input domain image, and each unaligned data flow had only a 300 × 200 synthetic image as input to the discriminator. All convolutions in the discriminator conducted 4 × 4 filters with a stride of 2. A leaky rectified activation (LeakyReLU) [58] followed each of the convolutions as the activation function, except for the final convolution.

Implementation Details
To stabilize the DC 2 Anet training process, we used an image pooling technique [59] that updates the discriminator networks Dis MR and Dis CT using a history of synthetic images rather than the ones generated by the latest synthesis networks. We maintained an image pool buffer that stored the 50 previously-synthesized images. We also conducted data augmentation using random horizontal flipping (−5-5 degree rotation) and the random translation of up to 15 pixels in each spatial dimension in the training images. DC 2 Anet was trained with mini-batch stochastic gradient descent (SGD) [60] with a mini-batch size of one. All weights were initialized from a zero-centered truncated normal distribution with a standard deviation of 0.02. All networks were trained with a learning rate of 0.0002 for the first 100,000 iterations and a linearly decaying rate that went to zero over the next 100,000 iterations. Adam is one of the most pervasive and robust optimizers used in various field [61,62]. The model was also optimized using the Adam optimizer [63] with β 1 = 0.5 and β 2 = 0.999, as suggested in [28]. For all experiments, the following empirical values were used to train the synthesis networks: λ cycle = 10, λ voxel = 100, λ grad = 100, λ perc = 1, and λ struc = 0.05.
In LeakyReLU, the slope of the leak was set to 0.2. Reflection padding was used to reduce artifacts instead of zero padding in the convolution layers. The model took about 48 h to train for 200,000 iterations using a single GeForce GTX 1080Ti GPU. The code and pretrained models are available at https://github.com/ChengBinJin/SpineC2M.

Data Acquisition
Our lumbar spine dataset consisted of 641 patients, each with CT and MR images. The CT image was acquired helically on a GE Revolution CT scanner with a tube voltage of 120 kV, an exposure of 450 mAs, and a slice thickness of 1.00 mm. The MR image for each patient was obtained using a Siemens 3.0T Trio TIM MR scanner with T2 3D (with a repetition time of 4320 ms, an echo time of 95 ms, and a flip angle of 150 • ). To allow the voxel-wise comparison of the synthetic and reference MR images, the CT image was manually aligned to the MR image to produce voxel-level correspondence. After alignment, the CT and MR images from the same patient had the same image size and spacing. Because only the lumbar spine region was considered, we cropped the aligned CT and MR images to reduce the computational burden, producing a final preprocessed image size of 300 × 200 × 40 (40-48 slices depending on the alignment quality) with the same voxel size (1.00 × 1.00 × 1.00 mm). We randomly separated the 641 patients into two groups: 549 patients for the training set and 92 patients for the test set. Table 5 presents a summary of our dataset, while Figure 4 displays several sample images. Table 5. Summary of the lumbar vertebra dataset used in the experiments.

Evaluation Metrics
The synthesis and reference MR images were compared using the mean absolute error (MAE) and root mean squared error (RMSE), which are defined as follows: where N is the total number of image slices in the aligned voxel. MAE and RMSE measure the average distance between each pixel in the synthetic and reference MR images. In addition, the voxel-wise peak-signal-to-noise-ratio (PSNR) can also be calculated: where H is the maximum possible intensity of the pixel and MSE is the mean square error, which represents the square of difference between I MR and Syn MR (I CT ). MAE, RMSE, and PSNR were based on the correct alignment of test images I CT and I MR .
Because of the enormous differences between two image domains, it is difficult to achieve perfect image alignment. Therefore, the structural similarity (SSIM) index and the Pearson correlation coefficient (PCC) should also be calculated for patch-wise statistical comparisons, e.g., mean, variance, and correlation. The definition of the SSIM is given in Equation (9), and the PCC is defined as follows: where µ and σ are the mean and variance of the i th image slice. Lower values for the MAE and RMSE are preferable, while the reverse is true for the PSNR, SSIM, and PCC.

Analysis of DC 2 Anet
Based on our lumbar spine dataset and the metrics described in the previous section, we quantitatively evaluated the performance of our model in generating an MR image from a CT image. In Table 6, we compare the performance of DC 2 Anet with supervised, unsupervised, and semi-supervised learning. Data alignment of the tuples of the corresponding images in supervised learning produced a much higher accuracy than did unsupervised learning, while semi-supervised learning with alternating optimization was better than both supervised and unsupervised learning. The joint optimization of semi-supervised learning produced substantially weaker results compared to alternating optimization. Therefore, we concluded that the alternating optimization of DC 2 Anet led to a more stable convergence and was critical to effective performance.  Table 7 presents a comparison of the performance of the variant architectures for the discriminator previously displayed in Figure 3. Models A, B, and C had a different number of convolution layers in the input and output stages, and more than one convolution layer in the shared stage. In contrast, Models D, E, and F had only one convolution layer in the shared stage with a different number of convolution layers in the input and output stages. For the two different data flows, an independent discriminator design was employed in Model G. From the experimental results, three significant observations are worth noting. First, the independent discriminator architecture (Model G) exhibited higher performance than Models D, E, and F, due to the high discriminatory capability of the independent network. Second, Models A, B, and C outperformed the other models. This is because the deep weight-sharing constraint in the shared stage can learn the joint distribution of the aligned and unaligned data. Finally, Model C, which consists of two convolutions in the shared stage and two layers in the output stage, exhibited the most effective discriminatory capability, outperforming all other models in all metrics except for SSIM. As demonstrated in [48,49], synthesizing an image by minimizing the perceptual loss for the early layers of the pretrained network tends to focus on low-level information, such as intensity, texture, and shape. Perceptual loss is helpful when there is misalignment in the training and test datasets. Layer selection for perceptual loss is a task-oriented problem. We considered five ReLU layers before max-pooling in the pretrained VGG16 network as in [48,49]. The performance of the different perceptual layers is summarized in Table 8. The five layers were ReLUs 1_2, 2_2, 3_3, 4_3, and 5_3, with the high-layer ReLUs always including the early layer ones. Table 8 indicates that perceptual loss defined by high layers produced more accurate output than did the early layers. We also observed that the perceptual loss from ReLU {1_2, 2_2, 3_3, 4_3} and ReLU {1_2, 2_2, 3_3, 4_3, 5_3} had minor quantitative differences. Our objective function contained six independent loss terms. The experiments reported above used all of the loss terms. To investigate the strength of each loss term, we employed ablation analysis to determine how performance was affected by each loss term. We trained each network with a different objective function five times using different initialization weights and report the average of the five trials for each objective function. The evaluation results are shown in Table 9. Beginning with adversarial loss alone, each loss term was added one by one. In this process, five metrics were used to analyze the change in performance, and relative MAE improvement was also calculated. We considered adversarial loss alone to represent 0%, and the inclusion of all loss terms (the final row) was considered to be 100% when calculating relative MAE improvement.
The synthesis results for the ablation analysis are presented in Figure 5. The performance of DC 2 Anet generally improved with the addition of each loss term, with voxel-wise loss the most useful in terms of relative improvement. This is because voxel-wise loss and MAE are consistent with a per-pixel mean-error-based measure. Dual cycle-consistent and gradient difference loss exhibited relative improvement of 12.19% and 6.95%, respectively, while perceptual loss and structural similarity loss had a limited effect on the improvement of performance compared to the other forms of loss. However, when all terms were used, the occurrence of unnatural features in the synthetic MR images was significantly reduced. As a result of the above results, in the remainder of the experiments, we employed DC 2 Anet with the following characteristics: alternating optimization of semi-supervised learning, Model C for the discriminator architecture, perceptual loss from ReLUs {1_2, 2_2, 3_3, 4_3, 5_3}, and the inclusion of all loss terms.  5. Ablation analysis of the proposed method. From left to right: the input CT, adversarial loss alone, the addition of dual cycle-consistent loss, the addition of voxel-wise loss, the addition of gradient difference loss, the addition of perceptual loss, the addition of structural similarity loss, and the reference MR image.

Comparison with Baselines
To compare synthetic MR images produced using different methods quantitatively, we present box plots in Figure 6 representing the MAE, RMSE, PSRN, SSIM, and PCCs resulting from the use of multi-channel GAN [33], deep MR-to-CT [41], DiscoGAN [44], MR-GAN [45], and our proposed method. The circles next to the box plots represent a single image slice from the test dataset. The top and bottom box limits were calculated from Q25 and Q75, respectively. The green triangles and the horizontal lines denote the average and the median. The range of the box plot whiskers is given by Q25 − 1.5 × (Q75 − Q25), Q75 + 1.5 × (Q75 − Q25) . Any data point that falls outside of this range is typically considered an outlier and indicated by a red cross. The averages and standard deviations displayed in Table 10 indicate that our proposed method outperformed the other methods for all measures, with the lowest MAE and RMSE and the highest PSNR, SSIM, and PCC, thus further verifying the utility of our architecture. In addition, t-tests were conducted on the results in Table 10, finding that agreement with the reference MR images was significantly lower (p < 0.05) for images obtained using the MR-GAN method than for the images obtained using the DC 2 Anet model.  The MAE and standard deviation for the first 20 of the 92 subjects are plotted in Figure 7, comparing DC 2 Anet with multi-channel GAN [33], deep MR-to-CT [41], DiscoGAN [44], and MR-GAN [45]. It can be seen that DC 2 Anet generated a smaller MAE than the other approaches for most of the subjects. However, for some subjects, the MR-GAN [45] approach produced smaller MAE than did DC 2 Anet, though the MR-GAN [45] was unstable for some subjects, such as Subject 04 and Subject 08.  Figure 8 presents three examples of synthetic images produced by the proposed DC 2 Anet method, alongside the corresponding CT and MR images. The results for multi-channel GAN [33], deep MR-to-CT [41], DiscoGAN [44], and MR-GAN [45] are also presented for comparison purposes. The spinal cord region in the central area of the image, the most important element of the image, is enlarged to evaluate the reconstruction capability of each method. DC 2 Anet learned to differentiate between different structures with similar intensity values in CT images, but not in MR images, such as a vertebra, fat tissue, and disc signals. DC 2 Anet also preserved the continuity, smoothness, and semantics of the original images in the synthetic results because our objective function with semi-supervised learning led the synthetic MR images to be similar to the reference images. In CT-based MR image generation, the accurate reconstruction of the disc signal, the degree of disc protrusion, the degree of stenosis, and the thecal sac are essential in the analysis of lumbar vertebra. We can see that the disc signal and thecal sac in the synthetic MR image obtained using the proposed DC 2 Anet looked more similar to the reference MR image compared to the other methods. The structures of the muscle and fat tissue had a highest similarity. However, the proposed method exhibited limitations in the reconstruction of the degree of disc protrusion and the degree of stenosis.  [33], deep MR-to-CT [41], DiscoGAN [44], MR-GAN [45], the proposed DC 2 Anet, and the reference MR image.

Conclusions
In this work, we proposed an objective function and a general synthesis system, DC 2 Anet, that employs semi-supervised learning to generate lumbar spine MR images from single-sequence CT scans. Our objective function included six independent loss terms. Using ablation analysis, we assessed in detail the effectiveness and relative importance of each loss term. Performance was improved by adding each loss term because each had its own particular strengths and weaknesses. DC 2 Anet using semi-supervised learning can significantly outperform supervised and unsupervised learning approaches. To further improve the accuracy and to seek the global minimum of the objective function, alternating optimizing was much more efficient than the integrated optimization of DC 2 Anet. We applied our method to generate MR images from their corresponding CT images, demonstrating that our proposed method significantly outperformed four state-of-the-art approaches, thus providing its suitability for cross-modality image synthesis. Thus, it represents a very promising method that can be employed in the diagnosis of lumbar disc conditions for patients who are prevented from receiving an MRI due to claustrophobia or the presence of a cardiac pacemaker. Future research intends to further validate the quality of the synthesis results for downstream tasks such as segmentation or classification. Extending the method to handle cross-sectional views (axial, sagittal, and coronal) and multi-sequence CT images will also be considered in future work.

Conflicts of Interest:
The authors declare no conflict of interest.