Deep Learning-Based Delayed PET Image Synthesis from Corresponding Early Scanned PET for Dosimetry Uptake Estimation

The acquisition of in vivo radiopharmaceutical distribution through imaging is time-consuming due to dosimetry, which requires the subject to be scanned at several time points post-injection. This study aimed to generate delayed positron emission tomography images from early images using a deep-learning-based image generation model to mitigate the time cost and inconvenience. Eighteen healthy participants were recruited and injected with [18F]Fluorodeoxyglucose. A paired image-to-image translation model, based on a generative adversarial network (GAN), was used as the generation model. The standardized uptake value (SUV) mean of the generated image of each organ was compared with that of the ground-truth. The least square GAN and perceptual loss combinations displayed the best performance. As the uptake time of the early image became closer to that of the ground-truth image, the translation performance improved. The SUV mean values of the nominated organs were estimated reasonably accurately for the muscle, heart, liver, and spleen. The results demonstrate that the image-to-image translation deep learning model is applicable for the generation of a functional image from another functional image acquired from normal subjects, including predictions of organ-wise activity for specific normal organs.


Introduction
The internal dosimetry of radiopharmaceuticals is necessary to predict their toxicity and develop treatment plans [1,2].Their cumulative radioactivity allows them to be potential therapeutic substitutes, exhibiting the same targeting vectors as peptides or monoclonal antibodies; thus, the internal dosimetry of radiopharmaceuticals for diagnosis has been conducted in various studies.To estimate the absorbed doses, defined as the deposited energy per unit mass of the region of interest, it is necessary to acquire a spatial map of the radiopharmaceutical and the S-value, defined as the mean dose rate absorbed by the target region from the unit radioactivity at the source region [2,3].Although some studies have aimed to calculate personalized S-values based on deep learning, the general approach is to utilize pre-calculated data through a voxel phantom model against various clinical conditions, including age and sex.
The in vivo distribution of radiopharmaceuticals can be determined through reconstructive three-dimensional (3D) quantitative imaging, such as positron emission tomography (PET) and single-photon emission computed tomography (SPECT) [4].Radioisotopes within the body gradually decay; therefore, estimating the cumulative degree of radioactivity for each organ in the body is required.This is typically calculated by integrating a time-activity curve.From the exponential behavior of the radioactive decay of radioisotopes based on its random nature, a time-activity curve model can be composed from the mono-or linear combination of the exponential basis function to be fitted.The model is estimated via curve fitting with radioactivity in the region at several time points [5,6].Hence, the acquisition of functional imaging is required for at least three time points to perform dosimetry.This is a burdensome task in both real applications and research, as radiopharmaceuticals require at least an hour to be distributed within the body after administration.Therefore, determining the in vivo distribution of a diagnostic radiopharmaceutical at specific time points post-injection using the generation method can mitigate the burden.Therefore, we proposed a deep learning model that generates a PET image of diagnostic radiopharmaceuticals demonstrating post-injection points later than an early scanned image.
Image-to-image translation (I2I) is a deep learning scheme in the computer vision field that aims to transfer an image presented in one domain to another with a distinctive style or characteristic.It has also been widely used in medical imaging [7,8] for reconstruction, segmentation [9][10][11], and cross-modality conversion [12][13][14][15][16].However, most studies have focused on I2I translation between anatomical images, including those obtained through magnetic resonance imaging (MRI) and computed tomography (CT); there have been few approaches to generate functional images.Among the numerous I2I methodologies, we focused on those based on generative adversarial networks (GANs) [17], which have been used for data synthesis tasks for PET [18,19].Based on conditional GANs (cGANs) [20], which enable GANs to be exploited in the translation problem, many studies [21,22] on paired I2I translation have proposed a model where the image similarity term is added to the loss function.
This study aimed to investigate the method of translating a PET image of healthy participants scanned at an early uptake point to a delayed point via a paired I2I model based on a cGAN.We used paired axial two-dimensional (2D) slices of the 3D PET image as the input, and a neural network based on the 2D input was used.Since the style difference between image domains is relatively small compared with cross-modality conversion, we compared various combinations of the adversarial and image similarity loss functions to avoid the gradient vanishing problem and enhance the generation performance.The performance under several conditions, including various uptake times and reconstruction methods, was also compared.Furthermore, regarding the functional I2I translation, the standardized uptake values (SUVs) of several organs in the delayed PET image were estimated and compared with the ground-truth SUVs.

Materials and Methods
This study was approved by the Institutional Review Board of KIRAMS (IRB No.: KIRAMS 2020-09-003) and was carried out in accordance with the Declarations of Helsinki.All participants provided informed consent.Eighteen healthy volunteers (ten males; aged 18-24) without a history of malignancy were injected with 10 mCi of [ 18 F]Fluorodeoxyglucose ( 18 F-FDG).PET data were acquired using a Biograph 6 PET/CT scanner (Siemens Healthineers, Erlangen, Germany).
After 3.5 min, whole-body PET images were acquired at 5, 14, 31, and 52 min after the injection.The acquisition times of the projections at 5 and 14 min were 0.5 and 2 min per bed, respectively, while those at 31 and 52 min were both 3 min per bed.All methods were performed in accordance with the relevant guidelines and regulations.Among the patient data, 4668 (12 × 389) slices were used for training, 1556 (4 × 389) slices were used for validation, and 776 (2 × 389) slices were used for the test.The projections were reconstructed to 3D tomographic images by using filtered back-projection (FBP); ordered subset expectation maximization (OSEM) for 2D and 3D images; and the iterative TrueX algorithm.Each reconstructed PET image was used as a 2D axial slice, with 389 axial slices per image.The size of the slices was 128 × 128.The dataset was normalized in the range of [−1, 1] to enhance the training performance.
The network architectures of the generator and discriminator are illustrated in Figure 1.The architecture of the generator was mainly based on the U-Net structure.The overall arrangement of the convolution block and hyperparameters related to the layer, such as kernel size, was based on the study by Radford et al. [23].PatchGAN discriminator [24] was employed as an architecture.Unlike the general convolutional neural network (CNN)based classifiers, which return the probability of whether the input is real or fake, the shape of the output has a 2D array, with elements indicating the realness of the local patches of input.This has the advantage of conserving the high-frequency information of the generated image.
expectation maximization (OSEM) for 2D and 3D images; and the iterative TrueX algorithm.Each reconstructed PET image was used as a 2D axial slice, with 389 axial slices per image.The size of the slices was 128 × 128.The dataset was normalized in the range of [−1, 1] to enhance the training performance.
The network architectures of the generator and discriminator are illustrated in Figure 1.The architecture of the generator was mainly based on the U-Net structure.The overall arrangement of the convolution block and hyperparameters related to the layer, such as kernel size, was based on the study by Radford et al. [23].PatchGAN discriminator [24] was employed as an architecture.Unlike the general convolutional neural network (CNN)-based classifiers, which return the probability of whether the input is real or fake, the shape of the output has a 2D array, with elements indicating the realness of the local patches of input.This has the advantage of conserving the high-frequency information of the generated image.The adversarial loss function of an original GAN [17] was formulated based on the binary cross-entropy (BCE) of the output of the discriminator.For the generator G, discriminator D, distribution of real data (ℙ ), and distribution of generated data (ℙ ) from the generator, the loss function is expressed as (1): The discriminator was updated to minimize the loss, which decreased the cross-entropy of the ideal and current discriminators.Simultaneously, the generator was taught to maximize the loss to make the generated data 'look like' they were sampled from the distribution of ground-truth data.Solving the minimax problem stands for minimizing the Jensen-Shannon divergence (JSD) between the distribution of the real data and that of the generated artificial data.However, the training of vanilla GAN causes a gradient vanishing problem, owing to the divergence of JSD in the early stage of training.Various studies have been proposed to tackle this instability.The adversarial loss function of an original GAN [17] was formulated based on the binary cross-entropy (BCE) of the output of the discriminator.For the generator G, discriminator D, distribution of real data (P d ), and distribution of generated data (P g ) from the generator, the loss function is expressed as (1): The discriminator was updated to minimize the loss, which decreased the crossentropy of the ideal and current discriminators.Simultaneously, the generator was taught to maximize the loss to make the generated data 'look like' they were sampled from the distribution of ground-truth data.Solving the minimax problem stands for minimizing the Jensen-Shannon divergence (JSD) between the distribution of the real data and that of the generated artificial data.However, the training of vanilla GAN causes a gradient vanishing problem, owing to the divergence of JSD in the early stage of training.Various studies have been proposed to tackle this instability.
Using the Wasserstein distance as a metric between the distributions was suggested by Arjovsky et al. [25] to mitigate instability.From the Kantorovich-Rubinstein duality, the loss function of Wasserstein GAN (WGAN) was expressed as the difference between the expectations of each distribution with the Lipschitz condition.In this study, WGAN-GP (gradient penalty) [26], a method of adding a GP term to the WGAN loss to satisfy the Lipschitz condition, was used.The loss function of WGAN-GP is formulated as (2): Additionally, Mao et al. [27] argued that the gradient vanishing problem of GAN is due to fake samples far from the decision boundary of the discriminator.To penalize and pull the samples toward the decision boundary, they proposed the employment of the least square loss as an adversarial loss function (LSGAN), which has no singular point, unlike the cross-entropy-based function.The LSGAN is formulated as ( 3) and ( 4): 2  (3) During studies of paired I2I using a GAN framework, researchers made use of the adversarial loss and employed the additional loss to be summed considering the similarity of the generated image and corresponding ground-truth image [21,22].In our study, two kinds of additional losses were considered to improve the performance of training.L1-loss was used to reflect the similarity between the generated data and its ground-truth data in the image domain.Compared to the mean squared error (L2-loss), using L1-loss is known to cause less blurring.The L1-loss is formulated as (5): Perceptual loss is an assessment metric that measures the distance between two images as perceived by humans.Estimated using the feature information of images, the perceptual loss is less affected by subtle displacement of the object placed on both images, and it is composed of two loss terms: feature content and style losses.The feature content loss was calculated using the sum of the normalized L2-norm of the features of two images extracted by convolutional blocks.The style loss was related to the style differences, such as the texture or pattern, between two images.We used the expression of style representation as a Gram matrix of the feature from the convolutional block [28].The style loss was acquired by calculating the Frobenius norm of the difference between the style representations of the generated image and its ground-truth image.Combining the two feature losses, the perceptual loss was calculated as their weighted sum.Every weight in the loss functions was set to 1.The feature information of the data was calculated by exploiting the feature extraction layers of the pre-trained VGG19 CNN.The feature content, style, and perceptual loss is formatted as ( 6), (7), and (8), respectively, where w c i , w s i , α, and β are the hyperparameters for the weighted sum.
With the various loss functions described above, the total loss function used to optimize the generator with a fixed discriminator is expressed as (9), where L Adv ∈ {L GAN , L WGAN−GP , L LSGAN } is an adversarial loss function, L Sim ∈ L L1 , L perceptual is an image similarity loss function, and λ Sim ∈ L L1 , L perceptual is the weight of the image similarity loss function.
The generated images were evaluated using the Fréchet inception distance (FID) [29].FID is widely used to assess the similarity between the distributions of generated and real images.It indicates the Fréchet distance of two multivariate normal distributions of features of each set of real and generated data, encoded from the inception block of Inception V3 CNN, calculated as (10), where µ and C are the mean and covariance, respectively, and index d and g indicate the ground-truth and generated data, respectively.(10) The classic metric was also considered to evaluate the similarity between the given image and its ground-truth image.The axial slice images generated from the model were individually assessed using the peak signal-to-noise ratio (PSNR).For the mean (µ) and standard deviation (σ) of the image, PSNR is calculated as (11), where MSE is the mean squared error of the two images, and R is the maximum pixel intensity of the data type of the pixel; R represented unity since the pixel intensities were normalized to The size of the mini-batch in training was set to 4. Adam was used as an optimizer of both the generator and discriminator, with an initial learning rate of 0.0002, and the parameters were set to β 1 = 0.5 and β 2 = 0.999, respectively.The learning rate was set to decay by multiplying 0.98 epoch by the initial learning rate.The hyperparameters used in the weighted sum, namely, λ L1 , λ perceptual , and λ gp , were set to 200, 0.1, and 10, respectively.Training was implemented in Python 3.8.6, and PyTorch 1.10.0[30] was used for the deep learning framework.

Image Generation
To compare the performance of the translation algorithm according to loss functions, we used a pair of early and delayed images obtained 14 and 52 min after the injection, reconstructed using the OSEM 3D algorithm.The generation results are plotted in Figure 2.Each plotted slice was set to have identical windows and levels.Those trained with vanilla GAN caused generated image patterns that lowered the image quality and blurred the image, regardless of the image similarity loss function, whereas the translation via WGAN-GP or LSGAN was less degraded.However, there was a slight pattern artifact within the resultant image of the combination of least square and L1-loss.
The corresponding quantitative evaluation results are listed in Table 1.For comparison, we calculated the FID between the early and delayed image distributions, which was 22.67.Vanilla GAN caused the distribution of the early image to be further from that of the ground-truth image, despite improved performance after adding perceptual loss compared to adding L1-loss.The corresponding quantitative evaluation results are listed in Table 1.For comparison, we calculated the FID between the early and delayed image distributions, which was 22.67.Vanilla GAN caused the distribution of the early image to be further from that of the ground-truth image, despite improved performance after adding perceptual loss compared to adding L1-loss.
The loss functions using WGAN-GP and LSGAN translated the input image to be more alike to the ground-truth image.Among the combinations of adversarial and image similarity loss functions, the sum of LSGAN and perceptual loss showed the greatest performance in both FID and PSNR.The results of the I2I translation, according to the uptake time of the early image, are compared in Figure 3 and Table 1.Each image was reconstructed using OSEM 3D, and the networks were optimized with the combination of LSGAN and perceptual loss.Regarding the FID and classic metrics, the I2I translation scanned 5 min after injection, with an FID  The loss functions using WGAN-GP and LSGAN translated the input image to be more alike to the ground-truth image.Among the combinations of adversarial and image similarity loss functions, the sum of LSGAN and perceptual loss showed the greatest performance in both FID and PSNR.
The results of the I2I translation, according to the uptake time of the early image, are compared in Figure 3 and Table 1.Each image was reconstructed using OSEM 3D, and the networks were optimized with the combination of LSGAN and perceptual loss.Regarding the FID and classic metrics, the I2I translation scanned 5 min after injection, with an FID score of 21.36 and PSNR of 53.29 dB, displayed relatively worse performance, mainly for PSNR, due to the further distance between the image distribution before and after translation.However, the inception distance between the ground-truth and the generated one was also relatively insensitive regarding the input uptake time.score of 21.36 and PSNR of 53.29 dB, displayed relatively worse performance, mainly for PSNR, due to the further distance between the image distribution before and after translation.However, the inception distance between the ground-truth and the generated one was also relatively insensitive regarding the input uptake time.To determine the influence of the reconstruction method, we evaluated the performance of the model trained using the images reconstructed via several methods, including FBP, OSEM 2D, OSEM 3D, and TrueX.The comparison outcomes are shown in Figure 4 and Table 1.The models trained using the images reconstructed via TrueX and OSEM 3D displayed superior outcomes.The image generated from the generation model trained with FBP images was too blurred to identify the structure due to the streak artifacts generated during reconstruction.In a quantitative sense, the results generated from the early images reconstructed via the TrueX method demonstrated the best performance in FID and PSNR, marked 10.43 and 56.87dB, respectively.The FID and PSNR of the FBP translation were 85.93 and 40.46dB, respectively, which were far inferior to the others.To determine the influence of the reconstruction method, we evaluated the performance of the model trained using the images reconstructed via several methods, including FBP, OSEM 2D, OSEM 3D, and TrueX.The comparison outcomes are shown in Figure 4 and Table 1.The models trained using the images reconstructed via TrueX and OSEM 3D displayed superior outcomes.The image generated from the generation model trained with FBP images was too blurred to identify the structure due to the streak artifacts generated during reconstruction.In a quantitative sense, the results generated from the early images reconstructed via the TrueX method demonstrated the best performance in FID and PSNR, marked 10.43 and 56.87 dB, respectively.The FID and PSNR of the FBP translation were 85.93 and 40.46 dB, respectively, which were far inferior to the others.score of 21.36 and PSNR of 53.29 dB, displayed relatively worse performance, mainly for PSNR, due to the further distance between the image distribution before and after translation.However, the inception distance between the ground-truth and the generated one was also relatively insensitive regarding the input uptake time.To determine the influence of the reconstruction method, we evaluated the performance of the model trained using the images reconstructed via several methods, including FBP, OSEM 2D, OSEM 3D, and TrueX.The comparison outcomes are shown in Figure 4 and Table 1.The models trained using the images reconstructed via TrueX and OSEM 3D displayed superior outcomes.The image generated from the generation model trained with FBP images was too blurred to identify the structure due to the streak artifacts generated during reconstruction.In a quantitative sense, the results generated from the early images reconstructed via the TrueX method demonstrated the best performance in FID and PSNR, marked 10.43 and 56.87dB, respectively.The FID and PSNR of the FBP translation were 85.93 and 40.46dB, respectively, which were far inferior to the others.

SUV mean Estimation
The integrated uptake values of 18 F-FDG in the images generated 14 min post-injection were evaluated in two patients in the test dataset by comparing them with those in the delayed ground-truth PET image (Figure 5).The absolute errors of the SUV mean for the muscles, heart, liver, and spleen were similar to or <0.2, while the differences in SUV mean in the kidney were relatively large.The SUV mean difference in the brain displayed different trends between the two patients, implying that the accuracy of the uptake prediction in the brain depended on the patient.In addition, the absolute errors of SUV mean of the bladder for two patients were 9.055 and 10.63; the model could not estimate the amount of uptake in the bladder.

Figure 4.
Translation results of early PET image according to the reconstruction method.Every model was trained with LSGAN + perceptual loss as its loss function.2D, two dimensional; FBP, filtered back-projection; PET, positron emission tomography; OSEM, ordered subset expectation maximization; least square generative adversarial network.

SUVmean Estimation
The integrated uptake values of 18 F-FDG in the images generated 14 min post-injection were evaluated in two patients in the test dataset by comparing them with those in the delayed ground-truth PET image (Figure 5).The absolute errors of the SUVmean for the muscles, heart, liver, and spleen were similar to or <0.2, while the differences in SUVmean in the kidney were relatively large.The SUVmean difference in the brain displayed different trends between the two patients, implying that the accuracy of the uptake prediction in the brain depended on the patient.In addition, the absolute errors of SUVmean of the bladder for two patients were 9.055 and 10.63; the model could not estimate the amount of uptake in the bladder.

Discussion
We hypothesized that the selection of loss function would be significant, as the style difference between the early and delayed PET images is not clear, and GAN is widely known to suffer from unstable convergence during training.Our study found that using LSGAN or WGAN-GP mitigated the convergence issue relatively well, resulting in a finer performance than using vanilla GAN.The quantitative evaluation demonstrated that the choice of the image similarity function had less of an effect than the adversarial loss.However, the perceptual loss containing the style loss was more effective than the L1-loss, although the style difference between the two image domains was not dominant.
Radiopharmaceuticals injected into the body are gradually distributed via an active transport process or through the blood stream.Hence, the temporal change in activity differs along the axial slice according to the characteristics of the organs.The PSNR of the generated images was evaluated versus that of the ground-truth image acquired 52 min after injection and compared to examine the image generation performance according to the axial position.The slices between the neck and bladder could be improved regarding voxel similarity after translation (Figure 6).However, there was a little advance in the slices belonging to the head, leg, and especially the bladder, where the activity changed the most over time.

Discussion
We hypothesized that the selection of loss function would be significant, as the style difference between the early and delayed PET images is not clear, and GAN is widely known to suffer from unstable convergence during training.Our study found that using LSGAN or WGAN-GP mitigated the convergence issue relatively well, resulting in a finer performance than using vanilla GAN.The quantitative evaluation demonstrated that the choice of the image similarity function had less of an effect than the adversarial loss.However, the perceptual loss containing the style loss was more effective than the L1-loss, although the style difference between the two image domains was not dominant.
Radiopharmaceuticals injected into the body are gradually distributed via an active transport process or through the blood stream.Hence, the temporal change in activity differs along the axial slice according to the characteristics of the organs.The PSNR of the generated images was evaluated versus that of the ground-truth image acquired 52 min after injection and compared to examine the image generation performance according to the axial position.The slices between the neck and bladder could be improved regarding voxel similarity after translation (Figure 6).However, there was a little advance in the slices belonging to the head, leg, and especially the bladder, where the activity changed the most over time.
Likewise, the estimated SUV value of the generated image indicated that the training with whole-body slices resulted in the effective estimation of the SUV of organs, including the heart and liver, and a fine translation performance.However, model improvements are necessary to estimate the uptake in particular organs, such as the brain and kidney.Likewise, the estimated SUV value of the generated image indicated that the trainin with whole-body slices resulted in the effective estimation of the SUV of organs, includin the heart and liver, and a fine translation performance.However, model improvemen are necessary to estimate the uptake in particular organs, such as the brain and kidney.
This study demonstrates that the deep-learning-based generation model can predi the biodistribution of a radiopharmaceutical from an early scanned image.This is clin cally significant, as the high-performing deep learning model has the potential to allevia the cost of PET or SPECT image acquisition.For instance, conventional dosimetry of rad opharmaceuticals requires multiple acquisitions for the fitting of a time-activity curve.I addition, since the distribution of a monoclonal antibody within the body is time-consum ing, compared to a peptide, the application of this model is expected to be more effectiv for radiopharmaceutical radioimmunoconjugates (RICs).Therefore, validating the mod with PET images of the diagnostic RIC is a future plan.
Moreover, even though this study examined the correspondence between the fun tional images of the same radiopharmaceuticals acquired at different time points pos injection, the model has the potential to be generalized for the biodistribution of two r diopharmaceuticals; it can be utilized to generate an in vivo distribution of therapeut radiopharmaceuticals from a diagnostic one.This is significant, as the acquisition of the apeutic radiopharmaceutical distribution is hard to image, especially for an α-emittin radioisotope, including 211 At, and 225 Ac.Therefore, we also plan to examine the deep learn ing model for this purpose.Additionally, acquiring paired images of therapeutic and d agnostic radiopharmaceuticals is difficult; therefore, the unsupervised, or unpaired, I generation model is expected to be required.
There are some limitations to our study.First, since the model was trained with dataset of healthy participants, it is not capable of predicting the activity of malignan tumors during delayed imaging.Therefore, additional validation carried out in patien with cancer is necessary.Second, in a methodological sense, applying the I2I model to th problem was based on the assumption that the early and delayed PET images for the sam participant were 'paired.'However, due to the slight motion changes, including respir tory motion between two separate scans, the performance of the model may be impaire Thus, a state-of-the-art unpaired I2I translation model or semi-supervised I2I mod should be considered for application [31,32], treating the early and delayed images as un paired, or partially paired, assuming that the slices containing the respiratory system an bladder are not paired, while the others are paired.Third, there is a downside to the 2 This study demonstrates that the deep-learning-based generation model can predict the biodistribution of a radiopharmaceutical from an early scanned image.This is clinically significant, as the high-performing deep learning model has the potential to alleviate the cost of PET or SPECT image acquisition.For instance, conventional dosimetry of radiopharmaceuticals requires multiple acquisitions for the fitting of a time-activity curve.In addition, since the distribution of a monoclonal antibody within the body is timeconsuming, compared to a peptide, the application of this model is expected to be more effective for radiopharmaceutical radioimmunoconjugates (RICs).Therefore, validating the model with PET images of the diagnostic RIC is a future plan.
Moreover, even though this study examined the correspondence between the functional images of the same radiopharmaceuticals acquired at different time points postinjection, the model has the potential to be generalized for the biodistribution of two radiopharmaceuticals; it can be utilized to generate an in vivo distribution of therapeutic radiopharmaceuticals from a diagnostic one.This is significant, as the acquisition of therapeutic radiopharmaceutical distribution is hard to image, especially for an α-emitting radioisotope, including 211 At, and 225 Ac.Therefore, we also plan to examine the deep learning model for this purpose.Additionally, acquiring paired images of therapeutic and diagnostic radiopharmaceuticals is difficult; therefore, the unsupervised, or unpaired, I2I generation model is expected to be required.
There are some limitations to our study.First, since the model was trained with a dataset of healthy participants, it is not capable of predicting the activity of malignant tumors during delayed imaging.Therefore, additional validation carried out in patients with cancer is necessary.Second, in a methodological sense, applying the I2I model to the problem was based on the assumption that the early and delayed PET images for the same participant were 'paired.'However, due to the slight motion changes, including respiratory motion between two separate scans, the performance of the model may be impaired.Thus, a state-of-the-art unpaired I2I translation model or semi-supervised I2I model should be considered for application [31,32], treating the early and delayed images as unpaired, or partially paired, assuming that the slices containing the respiratory system and bladder are not paired, while the others are paired.Third, there is a downside to the 2D I2I translation model for 3D tomographic images in that the structural property along the axial direction is neglected.Therefore, we aim to improve the generation model by adopting a 3D convolution layer in the generator and discriminator.The technical improvements, expected to resolve the overparameterizations, should result in an increase in the number of parameters in the layer and a decrease in the dataset size simultaneously.
Several studies have been conducted generating medical images using a 3D GAN-based generation model, dealing with the problem using patch-based schemes [32][33][34].

Conclusions
In this study, we applied a paired cGAN to PET images obtained after a relatively short uptake time of radiopharmaceuticals for I2I translation after the general uptake time.The proposed model achieved satisfactory performance using various combinations of loss functions, except for those using the cross-entropy adversarial loss.We demonstrated that a longer uptake time for the early image results in improved generation performance.Additionally, the model trained with whole-body images effectively estimated the SUV mean of the heart, liver, muscle, and spleen; however, the estimated SUV mean of the brain, kidney, and bladder was inaccurate.The model is expected to shorten the uptake time of radiopharmaceuticals and generate time-dependent activities by only scanning once.

Figure 3 .
Figure 3. Translation results of early PET images according to the uptake time.The first row presents real images, while the images in the second row are generated from the model.Every model was trained with LSGAN + perceptual loss as its loss function.3D, three dimensional; LSGAN, least square generative adversarial network; PET, positron emission tomography; OSEM, ordered subset expectation maximization.

Figure 3 .
Figure 3. Translation results of early PET images according to the uptake time.The first row presents real images, while the images in the second row are generated from the model.Every model was trained with LSGAN + perceptual loss as its loss function.3D, three dimensional; LSGAN, least square generative adversarial network; PET, positron emission tomography; OSEM, ordered subset expectation maximization.

Figure 3 .
Figure 3. Translation results of early PET images according to the uptake time.The first row presents real images, while the images in the second row are generated from the model.Every model was trained with LSGAN + perceptual loss as its loss function.3D, three dimensional; LSGAN, least square generative adversarial network; PET, positron emission tomography; OSEM, ordered subset expectation maximization.

Figure 4 .
Figure 4. Translation results of early PET image according to the reconstruction method.Every model was trained with LSGAN + perceptual loss as its loss function.2D, two dimensional; FBP, filtered back-projection; PET, positron emission tomography; OSEM, ordered subset expectation maximization; least square generative adversarial network.

Figure 5 .
Figure 5. Absolute error of the SUVmean of organs in generated PET images at 14 min post-injection using the I2I model, trained with the combination of LSGAN and perceptual loss function.I2I, image-to-image; LSGAN, least square generative adversarial networks; PET, positron emission tomography; SUV, standardized uptake value.

Figure 5 .
Figure 5. Absolute error of the SUV mean of organs in generated PET images at 14 min post-injection using the I2I model, trained with the combination of LSGAN and perceptual loss function.I2I, imageto-image; LSGAN, least square generative adversarial networks; PET, positron emission tomography; SUV, standardized uptake value.

Figure 6 .
Figure 6.Translation performance along the axial position of the body regarding the PSNR ratio the validation and test dataset for the 14 min post-injection as early images and 55 min post-injectio as delayed image.PSNR, peak signal-to-noise ratio.

Figure 6 .
Figure 6.Translation performance along the axial position of the body regarding the PSNR ratio in the validation and test dataset for the 14 min post-injection as early images and 55 min post-injection as delayed image.PSNR, peak signal-to-noise ratio.

Table 1 .
Generation performance for various cases of loss functions, uptake times, and reconstruction methods.