Deep-Learning-Based Framework for PET Image Reconstruction from Sinogram Domain

: High-quality and fast reconstructions are essential for the clinical application of positron emission tomography (PET) imaging. Herein, a deep-learning-based framework is proposed for PET image reconstruction directly from the sinogram domain to achieve high-quality and high-speed reconstruction at the same time. In this framework, conditional generative adversarial networks are constructed to learn a mapping from sinogram data to a reconstructed image and to generate a well-trained model. The network consists of a generator that utilizes the U-net structure and a whole-image strategy discriminator, which are alternately trained. Simulation experiments are conducted to validate the performance of the algorithm in terms of reconstruction accuracy, reconstruction efﬁciency, and robustness. Real patient data and Sprague Dawley rat data were used to verify the performance of the proposed method under complex conditions. The experimental results demonstrate the superior performance of the proposed method in terms of image quality, reconstruction speed, and robustness.


Introduction
Positron emission tomography (PET) is an in vivo nuclear medicine imaging technique that is widely used in clinical trials and medical practice.Over several years, much effort has been devoted to promoting the wider application of PET in clinical and scientific research and enhancing PET imaging quality.In addition to improving the performance of PET imaging and acquisition systems, researchers have mainly focused on developing algorithms for image reconstruction.
In PET, the raw tomographic data are typically stored in projection data histograms called sinograms, which cannot be interpreted directly by observers but can be shown through the projection-slice theorem, which can be mapped to images that reveal the radioactivity concentration distribution.However, owing to random noise in the data, PET image reconstruction is an ill-posed inverse problem [1].Many solutions have been proposed to address this problem.In addition to the traditional analytic method, i.e., filtered backprojection (FBP) [2], various iterative strategies based on statistical data models have been widely used.The maximum-likelihood expectation maximum (MLEM) [3] is a classic iterative method that has a remarkable antinoise ability, but it also features slow convergence and has a large computational requirement.Compared with the MLEM, the maximum a posteriori (MAP) method [4] adds penalty terms that preserve regional smoothness and sharp variations of edges.Several studies attempted to design penalty terms using low-rank constraints [5], total variation (TV) [6], spatiotemporal spline modeling [7], tracer kinetics, and signal subspaces [8].Regardless of the method chosen, it is necessary to assume a statistical model that produces a distribution close to the real data distribution, but there may be inconsistencies, and the difference between the two distributions causes a deviation between the reconstructed and real images.In addition, the system matrix plays an important role in these methods.To eliminate the dependence on hypothetical models and system matrices and reduce the complexity of the artificial setting and parameter training, the deep learning based image reconstruction as a promising option can be investigated to reconstruct from the sinogram data directly.
In recent years, deep learning has demonstrated promising potential in the field of medical image reconstruction [9][10][11][12] and has successfully generated high-quality images for PET [13][14][15].Current medical image reconstruction methods using deep learning can be broadly classified into two categories: (category A) those achieving reconstruction from image domain to image domain and (category B) those completing reconstruction from the sinogram domain to the image domain.In recent years, there have been interesting changes in the approach and research of deep learning in image-to-image work.Deep learning was primarily used for postprocessing [16] or for accelerating [17,18] the reconstruction task.Kang et al. [19] used a deep convolutional neural network (CNN) to remove noise from low-dose computed tomography images.Wang et al. [20] designed an offline CNN to accelerate the reconstruction of magnetic resonance images.Wu et al. [21] proposed a cascaded CNN to remove the artifacts induced by denoising.
As deep learning has achieved satisfactory results in image processing [22,23], many researchers have attempted to introduce deep learning in PET image processing [24][25][26].Of which most of the works belong to category A. Gong et al. [27] proposed a framework combining a residual convolutional network with MLEM, and the dynamic data of prior patients were used to train a network for PET denoising.Kim et al. [15] combined a denoising CNN with a local linear fitting function and trained the network using full-dose images as the ground truth and low-dose images reconstructed from downsampled data as the input.An unsupervised deep learning framework for direct reconstruction with an MR image as prior information would later be proposed by Gong [28].GapFill-Recon Net [29] is also a domain transform reconstruction network based on CNN.Recently, the deep image prior (DIP) framework [30] showed that CNNs have the intrinsic ability to regularize various ill-posed inverse problems without pretraining.No prior training pairs are required, and random noise can be employed as the network input to generate denoised images.Inspired by this work, Gong et al. [31] achieved PET reconstruction guided by MRI using the DIP framework.Song et al. [32] designed a CNN-based PET reconstruction framework with multichannel prior inputs, including high-resolution magnetic resonance images and the radial and axial coordinates of each voxel.However, as the DIP method does not require pretraining [31], it is similar to an online training process for each reconstruction.Moreover, several parameters must be trained, thus requiring a long time to complete the reconstruction each time.This drawback limits the application of the method in clinical practice.The difficulty in select the stopping point of the iterative training is another limitation.With the addition of prior knowledge of other modes, image registration has also become a technical difficulty.
Compared with category A, category B makes full use of the information in the sinogram domain and implements direct reconstruction using deep learning.Häggström et al. [33] focused on direct reconstruction and adopted a deep encoder-decoder network to learn the mapping from the sinogram to the image.However, they mainly focused on the reconstruction speed rather than the image quality.Their proposed method was verified by simulation as opposed to real clinical data.
In this study, a direct reconstruction deep learning scheme for the recovery of radioactivity maps from sinogram data was built using conditional generative adversarial networks (cGANs) [34], which obtained excellent results for the image-to-image translation task [35].In the training phase, the input is an image pair that includes reconstructed images and sinogram images.After alternate training of the generator and discriminator, a well-trained model was established.In the testing phase, the input was only a sinogram image.After passing through the trained model, the reconstructed image was obtained from the output.There are two contributions of this study.First, simulation experiments were used to verify the performance of the proposed method in terms of accuracy, robustness, and runtime.Second, considering that the testing data and training data of simulation experiments may have similar structures, the method was verified on a Sprague Dawley (SD) rat dataset that includes 12 different objects and a real patient dataset that contains nine objects.Both kinds of experimental results demonstrate the feasibility of the proposed method.Our preliminary cGANs-based direct network result has been published on MICCAI 2019 [36].
The remainder of this report is organized as follows.In Section 2, the proposed methodology is presented.The design of the simulation experiments and real data experiments of PET image reconstruction is described in Section 3, followed by the experimental results in Section 4. Finally, the results are discussed, and the paper is concluded in Section 5.

Problem Definition
In PET imaging, photons emitted by radioactive tracers follow a random Poisson process.Therefore, it is assumed that the imaging model of the PET system obeys a Poisson distribution.The measured PET data y, referred to as "projection data", equal the sum of the coincidence events captured by each detector and are usually stored in sinogram mode, where y R I is a collection of detected events, I is the number of lines of response, and y is the mean of the Poisson distribution.In addition, the projection data y can be described as a projective transform about the unknown activity image x R J .
where G R I×J is the system matrix, and J is the number of pixels in the image space.In reality, y includes not only the true coincidences, but also scattered coincidences s and accidental coincidences r.In the specific acquisition process of PET, the method of the verbal coincidence window is usually used to correct the accidental coincidence, which can effectively remove accidental events in the data.Therefore, one can generally describe the PET measurement model as follows: The task can then be defined as a process of generating an activity map x from sinogram data y.However, this task is an ill-posed optimization problem in theory.Traditional solutions primarily involve iterative methods.

Framework Based on Conditional Generative Adversarial Network
The iterative method is based on a hypothetical initial image.Using the step-bystep approximation method, the theoretical projection value can be compared with the actual measurement projection value, and the optimal solution is determined under the guidance of some optimization criteria.Compared with the analytical method, the iterative method has better image quality and higher image resolution in PET imaging with relative undersampling and low counting.However, the disadvantage is that the calculation is complex and slow.In the proposed method, the approximation and optimization processes are replaced with the training process of the neural network.

Network Design
Unlike traditional generative adversarial networks (GANs) [37], in which the only input is noise z, cGANs can provide extra information y to lead the training of the network.In this work, the input data include both noise z and sinogram data y.The cGANs consist of two models: generator G and discriminator D. The generator adapts the U-net architecture [38], which generates the PET image directly from a sinogram image.The full-image strategy is introduced to the discriminator, which attempts to differentiate between the real sinogram and PET image pairs from the database and the fake image pair output generated by G.The entire framework involves training G and D alternately until a balance is reached in the convergence stage.Figure 1 illustrates the structure of the entire reconstruction network.In the discriminator part, both the whole-image strategy and Patch-GAN strategy are utilized.

EER REVIEW 4 of 18
In this work, the input data include both noise z and sinogram data y.The cGANs consist of two models: generator G and discriminator D. The generator adapts the U-net architecture [38], which generates the PET image directly from a sinogram image.The full-image strategy is introduced to the discriminator, which attempts to differentiate between the real sinogram and PET image pairs from the database and the fake image pair output generated by G.The entire framework involves training G and D alternately until a balance is reached in the convergence stage.Figure 1 illustrates the structure of the entire reconstruction network.In the discriminator part, both the whole-image strategy and Patch-GAN strategy are utilized.(2) Decoder: the symmetrical expanding path (shown on the right-hand side) is used to expand the path and locate accurately.The basic module of both paths is convolution layer-batchnorm layer-ReLU layer.However, there are some differences in the details of the specific settings of encoding and decoding.There is no batch normalization (batchnorm) in the first layer of the encoder.All rectified linear units (ReLUs) in the encoder are leaky, whereas the ReLUs in the decoder are not leaky.All convolutions are 16 spatial filters and downsample with stride 2. Dropout is applied only to the first three layers of the decoder.Skip connections that concatenate activations from the layer in the contracting path to the (n − i)th layer are built in the expanding path, where n is the total number of layers.For the proposed method, n = 12.Finally, a tanh function is adopted after the last layer in the decoder to complete the entire generator model, and the output image has the same size as that of the input image.The structure of the generator is illustrated in Figure 2. (2) Decoder: the symmetrical expanding path (shown on the right-hand side) is used to expand the path and locate accurately.The basic module of both paths is convolution layer-batchnorm layer-ReLU layer.However, there are some differences in the details of the specific settings of encoding and decoding.There is no batch normalization (batchnorm) in the first layer of the encoder.All rectified linear units (ReLUs) in the encoder are leaky, whereas the ReLUs in the decoder are not leaky.All convolutions are 16 spatial filters and downsample with stride 2. Dropout is applied only to the first three layers of the decoder.Skip connections that concatenate activations from the layer in the contracting path to the (n − i)th layer are built in the expanding path, where n is the total number of layers.For the proposed method, n = 12.Finally, a tanh function is adopted after the last layer in the decoder to complete the entire generator model, and the output image has the same size as that of the input image.The structure of the generator is illustrated in Figure 2. Patch-GAN is a very efficient strategy in image-to-image translation task.The Patch-GAN only penalizes structure at the scale of patches.This strategy tries to classify if each N × N patch in an image is real or fake.It can be understood as a form of texture/style loss.However, for reconstruction, the sinogram space is not correlated to image space as shown in Figure 3, which means the patch-based image to image translation may be not suitable for reconstruction task.The values of abscissa and ordinate of each pixel in the sinogram image have their own physical meaning.In actuality, each pixel in the sinogram related to the full activity image.Therefore, we attempt to perform it on the full space, which is called the whole-image strategy in this paper.The loss of the full image is adopted instead of the loss of each patch.The whole image and Patch-GAN discriminators were compared in this study.Both types of discriminators were tested in the simulation and SD rat experiments.Because the whole-image discriminator obtained a higher score in the compared experiments, all experiment results are based on the whole-image discriminator.The Adam solver was used to optimize the networks.The number of training iterations n = 120, the learning rate α = 0.0002, and the weight of the L1 term λ = 100.

Objective Function
The conditional restriction can make the results closer to the desired ones.Therefore, the input includes both random noise z and condition y, and the objective of the cGAN can be represented as Patch-GAN is a very efficient strategy in image-to-image translation task.The Patch-GAN only penalizes structure at the scale of patches.This strategy tries to classify if each N × N patch in an image is real or fake.It can be understood as a form of texture/style loss.However, for reconstruction, the sinogram space is not correlated to image space as shown in Figure 3, which means the patch-based image to image translation may be not suitable for reconstruction task.The values of abscissa and ordinate of each pixel in the sinogram image have their own physical meaning.In actuality, each pixel in the sinogram related to the full activity image.Therefore, we attempt to perform it on the full space, which is called the whole-image strategy in this paper.The loss of the full image is adopted instead of the loss of each patch.The whole image and Patch-GAN discriminators were compared in this study.Both types of discriminators were tested in the simulation and SD rat experiments.Because the whole-image discriminator obtained a higher score in the compared experiments, all experiment results are based on the whole-image discriminator.Patch-GAN is a very efficient strategy in image-to GAN only penalizes structure at the scale of patches.T N × N patch in an image is real or fake.It can be und loss.However, for reconstruction, the sinogram space shown in Figure 3, which means the patch-based imag suitable for reconstruction task.The values of abscissa sinogram image have their own physical meaning.In a related to the full activity image.Therefore, we attem which is called the whole-image strategy in this pap adopted instead of the loss of each patch.The whole ima were compared in this study.Both types of discrimina and SD rat experiments.Because the whole-image disc in the compared experiments, all experiment results a criminator.The Adam solver was used to optimize the networks.The number of training iterations n = 120, the learning rate α = 0.0002, and the weight of the L1 term λ = 100.

Objective Function
The conditional restriction can make the results closer to the desired ones.Therefore, the input includes both random noise z and condition y, and the objective of the cGAN can be represented as As shown in Equation ( 4), generator G attempts to adjust parameters to minimize log(1 − D(x, G(z, y))), while D tries to maximize it; thus, they are two adversarial models.Training according to adversarial loss makes the generated images clear but does not ensure similarity with the established image.To enhance the accuracy of low-frequency information, the L1 regularizer is introduced in the proposed model.Thus, the entire objective function can be described as where L L1 (G) is the L1-norm, which is represented as

Experiments
Simulation experiments, including the Zubal thorax phantom and Zubal brain phantom, SD rat experiments, and real patient experiments were designed to verify the performance of the algorithm.

Simulation Experiments
Simulated data of the Zubal thorax phantom with 64 Cu-ATSM and Zubal brain phantom with 11C-Acetate were generated using Monte Carlo simulations, which can simulate the true environment of the PET scan and generate realistic sinogram data by using the GATE toolbox.The simulated PET scanner adopted was the Hamamatsu SHR74000, which contains six rings with 48 detector blocks.The ring diameter is 826 mm.The size of the field of view is 576 mm in the transaxial direction and 18 mm in the axial direction [39].To obtain the original PET data for the Monte Carlo simulation, a two-compartment model was used to simulate the radiation concentration of different regions of interest (ROIs) in different phantoms.Figure 4 shows that each phantom is composed of several ROIs, in which the kinetic parameters of the two-compartment model are different.Details regarding the kinetic parameter settings can be found in the literature [40][41][42].Each phantom consisted of three data sampling times.For each sampling time, three levels of counting rates were set: 5 × 10 6 , 1 × 10 7 , and 5 × 10 7 .For different counts, each reconstructed dynamic image had 18 frames with a resolution of 64 × 64 pixels.Thus, the entire database consisted of two phantoms, and each phantom dataset contained 162 (9 × 18) images.One third of each phantom dataset was randomly selected as the testing set, which included 162 images, with the others making up the training set, which included 324 images.The detailed sampling interval of the 18 frames is shown in Table 1.For example, for the 20 min simulated scanning of the Zubal thorax phantom, the first 14 frames were taken every 50 s, then two frames were taken every 100 s, and the last two frames were taken every 150 s.In this study, the bias and variance were adopted for comparison.
where n denotes the overall number of pixels in the ROI,  denotes the reconstructed value a i,  denotes the true value at voxel i, and  is the mean value of  .

SD Rat Experiments
To verify the reliability of the algorithm, twelve nine-week-old SD rats injected 18F-FDG were scanned using a Siemens Inveon micro-PET scanner.Sinograms of 12 els × 160 pixels × 130 slices and activity images of 128 pixels × 128 pixels × 130 slices  In this study, the bias and variance were adopted for comparison.
where n denotes the overall number of pixels in the ROI, xi denotes the reconstructed value at voxel i, x i denotes the true value at voxel i, and x i is the mean value of x i .

SD Rat Experiments
To verify the reliability of the algorithm, twelve nine-week-old SD rats injected with 18F-FDG were scanned using a Siemens Inveon micro-PET scanner.
Sinograms of 128 pixels × 160 pixels × 130 slices and activity images of 128 pixels × 128 pixels × 130 slices were obtained.The reconstruction method used was the ordered subset expectation maxi-mization 3D algorithm (OSEM 3D) packaged in Siemens scanning software.Three rats were randomly selected for testing, and the other nine rats were used for training.Thus, there were 1170 image pairs in the training dataset and 390 image pairs in the testing dataset.Here, the images reconstructed using OSEM 3D were treated as the ground truth.
For the SD rat dataset, the relative root mean squared error (rRMSE) was used to evaluate the image quality.rRMSE = where x is the ground truth, x is the reconstructed image, x is the ground truth average pixel value, and n is the number of image pixels.

Real Patient Experiments
To test the feasibility of the proposed method on human datasets, the method was also evaluated using a human brain PET dataset from nine real patients.The tracer injected into the patient was 18F-FDG.For each patient, 93 slices of reconstructed images and corresponding sinogram images were obtained.The reconstruction method was OSEM.The sizes of the images were set to 320 × 320.The last two objects were selected as the testing dataset, and the other seven objects were used as the training dataset.In addition to using rRMSE, structure similarity (SSIM) was used to evaluate the quality of the results.
where u x is the mean value of x, u x is the mean value of x, σ x 2 is the variance of x, σ x2 is the variance of x, and σ x x is the covariance of x and x.
Here, c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 , k 1 = 0.01, and k 2 = 0.03; both c 1 and c 2 are constants used to maintain stability.In addition, L is the dynamic range of the pixel values.

Simulation Experiments
The Zubal thorax phantom was used to select a suitable discriminator and show the convergence of the algorithm.A Zubal head phantom was also used to verify the robustness and testing time of the proposed method.

Discriminator Comparison Experiments
The full-image discriminator and Patch-GAN discriminator were both tested on the Zubal thorax dataset.Three dynamic objects were used to test, and each object possesses 18 frames.As we know, during the scanning period, with the increase in tracer injection time, the more photons the organ accumulates, the clearer the active image.In the current clinical practice, doctors usually pay attention to the static data, one clear active image of the later frames.Thus, the later frames are much useful for clinical.We compared the two strategies with the dynamic testing data and the results are shown in Figure 5. Though we obtained approximate quantitative results in the first 10 frames of the two strategies, the latter 8 frames of the whole-image strategy obtained much lower bias and variance compared to Patch-GAN.It is obvious that with the whole-image discriminator helping, the proposed network has better ability in improving the reconstruction image quality, especially for the frames with higher numbers of photons.To better illustrate that the whole-image strategy is more suitable in PET reconstruction, we selected a real patient to show the generation results under different strategies at the same epoch in the training phase.The generation results of 2nd, 20th, 40th, 60th, 80th, and 100th epoch are all shown in Figure 6.With the increasing epoch, the global features become clearer for both strategies.However, the network with PatchGAN may generate some local features that do not exist in the ground truth, as shown in the red box.Moreover, compared with PatchGAN, a whole-image discriminator can generate more accurate reconstruction images in fewer epochs.

Accuracy
Three frames of the Zubal thorax testing dataset were extracted to exhibit the reconstructed results, as shown in Figure 8.The method was also compared with the MLEM and TV algorithms.As shown in Figure 8, the reconstruction results of the proposed method are highly consistent with the ground truth in terms of the sharp edges and high pixel values of the ROIs.The results of the MLEM method contain excessive noise and artifacts because of the chessboard effect, even though the boundaries cannot be clearly observed.The TV method provides a clearer and sharper result than the MLEM method.However, the reconstruction effect of the small structure in the ROI3 area is poor, as indicated by the pink rectangular box.The detailed quantitative results are provided in Table 2.The images generated by cGANs had less than one tenth of the bias and less than one percent of the variance of the full image compared to the MLEM.Compared to TV, we still found much lower bias and variance using cGANs.However, for different frames, the performance of cGANs has a great difference.For the first few frames, such as the third frame and seventh frame, though the cGANs obtained better quantitative results on full image compared with TV, the bias values of ROI2-which is a tiny area in the Zubal thorax phantom-are very close.For the late frames, such as the 12th and 18th frames, the bias values of cGANs are almost one tenth of TV's.Even for the ROI2, cGANs also got surprised results.cGANs indeed has much stronger ability in the frames that own higher photons and less noise.

Accuracy
Three frames of the Zubal thorax testing dataset were extracted to exhibit the reconstructed results, as shown in Figure 8.The method was also compared with the MLEM and TV algorithms.As shown in Figure 8, the reconstruction results of the proposed method are highly consistent with the ground truth in terms of the sharp edges and high pixel values of the ROIs.The results of the MLEM method contain excessive noise and artifacts because of the chessboard effect, even though the boundaries cannot be clearly observed.
The TV method provides a clearer and sharper result than the MLEM method.However, the reconstruction effect of the small structure in the ROI3 area is poor, as indicated by the pink rectangular box.The detailed quantitative results are provided in Table 2.The images generated by cGANs had less than one tenth of the bias and less than one percent of the variance of the full image compared to the MLEM.Compared to TV, we still found much lower bias and variance using cGANs.However, for different frames, the performance of cGANs has a great difference.For the first few frames, such as the third frame and seventh frame, though the cGANs obtained better quantitative results on full image compared with TV, the bias values of ROI2-which is a tiny area in the Zubal thorax phantom-are very close.For the late frames, such as the 12th and 18th frames, the bias values of cGANs are almost one tenth of TV's.Even for the ROI2, cGANs also got surprised results.cGANs indeed has much stronger ability in the frames that own higher photons and less noise.

Accuracy
Three frames of the Zubal thorax testing dataset were extracted to exhibit the reconstructed results, as shown in Figure 8.The method was also compared with the MLEM and TV algorithms.As shown in Figure 8, the reconstruction results of the proposed method are highly consistent with the ground truth in terms of the sharp edges and high pixel values of the ROIs.The results of the MLEM method contain excessive noise and artifacts because of the chessboard effect, even though the boundaries cannot be clearly observed.The TV method provides a clearer and sharper result than the MLEM method.However, the reconstruction effect of the small structure in the ROI3 area is poor, as indicated by the pink rectangular box.The detailed quantitative results are provided in Table 2.The images generated by cGANs had less than one tenth of the bias and less than one percent of the variance of the full image compared to the MLEM.Compared to TV, we still found much lower bias and variance using cGANs.However, for different frames, the performance of cGANs has a great difference.For the first few frames, such as the third frame and seventh frame, though the cGANs obtained better quantitative results on full image compared with TV, the bias values of ROI2-which is a tiny area in the Zubal thorax phantom-are very close.For the late frames, such as the 12th and 18th frames, the bias values of cGANs are almost one tenth of TV's.Even for the ROI2, cGANs also got surprised results.cGANs indeed has much stronger ability in the frames that own higher photons and less noise.Figure 9 shows the mean values of the bias and variance of all 18 frames of the Zubal thorax phantom.For the bias and variance, the cGAN framework obtained a satisfactory score.The TV method performs much better performance than the MLEM method because it can better suppress the noise in the images.However, compared with the other two areas, the reconstruction results of ROI2 were unsatisfactory, and the values of the bias and variance of the three methods were higher than those of the other areas.It is considered that this outcome was possible because ROI2 was extremely small for it to be distinguished from the other two parts.Figure 9 shows the mean values of the bias and variance of all 18 frames of the Zubal thorax phantom.For the bias and variance, the cGAN framework obtained a satisfactory score.The TV method performs much better performance than the MLEM method because it can better suppress the noise in the images.However, compared with the other two areas, the reconstruction results of ROI2 were unsatisfactory, and the values of the bias and variance of the three methods were higher than those of the other areas.It is considered that this outcome was possible because ROI2 was extremely small for it to be distinguished from the other two parts.

Robustness and Runtime Analysis
The Zubal thorax dataset was chosen to validate the robustness of the algorithm under different counting rates.The values of the evaluation parameters for reconstruction are shown in Figure 10.The three curves at the top of both the bias graph and the variance graph show the properties of MLEM, the three curves in the middle show the properties of TV, and the curves at the bottom show the properties of cGANs.The three solid curves at the bottom of the graph have the highest coincidence, which means that the cGAN method is the least affected by the count value.For the other two methods, as the count increases, both the variance and bias of the MLEM and TV methods decrease distinctly.Therefore, one can conclude that these two methods are significantly influenced by the count.This may be because both methods are based on a probabilistic statistical model; therefore, when the count increases, the probabilistic statistical characteristics of the data can be better guaranteed.Compared with the other two methods, the proposed frame-

Robustness and Runtime Analysis
The Zubal thorax dataset was chosen to validate the robustness of the algorithm under different counting rates.The values of the evaluation parameters for reconstruction are shown in Figure 10.The three curves at the top of both the bias graph and the variance graph show the properties of MLEM, the three curves in the middle show the properties of TV, and the curves at the bottom show the properties of cGANs.The three solid curves at the bottom of the graph have the highest coincidence, which means that the cGAN method is the least affected by the count value.For the other two methods, as the count increases, both the variance and bias of the MLEM and TV methods decrease distinctly.Therefore, one can conclude that these two methods are significantly influenced by the count.This may be because both methods are based on a probabilistic statistical model; therefore, when the count increases, the probabilistic statistical characteristics of the data can be better guaranteed.Compared with the other two methods, the proposed framework has minimum variance and deviation and maximum stability.
The Zubal head phantom is more complex than the Zubal thorax phantom.It was used to verify whether the proposed algorithm can predict the reconstructed images of the last six frames by training the first 12 frames.
The concentration distribution of the first 12 frames is shown in Figure 11a.Four frames were chosen from the subsequent six frames to obtain the reconstruction results, as shown in Figure 11b.Although the training dataset did not include the last six frames, the test results are highly consistent with the true concentration distribution.Even an extremely small area, such as ROI3, can be reconstructed clearly, as indicated by the pink rectangular boxes in Figure 11b.The Zubal head phantom is more complex than the Zubal thorax phantom.It was used to verify whether the proposed algorithm can predict the reconstructed images of the last six frames by training the first 12 frames.
The concentration distribution of the first 12 frames is shown in Figure 11a.Four frames were chosen from the subsequent six frames to obtain the reconstruction results, as shown in Figure 11b.Although the training dataset did not include the last six frames, the test results are highly consistent with the true concentration distribution.Even an extremely small area, such as ROI3, can be reconstructed clearly, as indicated by the pink rectangular boxes in Figure 11b.method.With increasing counts, the bias and variance values of the MLEM and TV methods decrease, but there is a slight effect on the cGANs method.Tables 3 and 4 listed the reconstruction results of three methods.As a classic traditional iterative method, MLEM cost about 0.2 s for reconstruction of each image.TV mixed a postprocessing of iterative method to improve the image quality, it takes almost twice as long as MLEM.As for cGANs, it only cost 0.007 s of each image because the model was trained in advance and the test time is very short which even can be ignored.

SD Rat Experiments
To verify the method in a faster and more familiar way, the sinogram and ground truth which generated with OSEM method were extended to a uniform size of 192 × 192 by zero padding.First, the two discriminators were compared on the real datasets.As shown in Figure 12, the rRMSE of the whole-image strategy is lower than that of Patch-GAN, which is consistent with the results of the simulations.Three images in the test dataset were randomly chosen to obtain the reconstructed results.As shown in Figure 13, the reconstructed images produced by the cGAN have contour structures that are similar to the ground truth.However, the details are not clear, and the rRMSE value is slightly high.This may be caused by two factors: a more complex real scan situation and large individual differences among samples.A larger amount of data is required, but this is always a major challenge in medical image processing.Three images in the test dataset were randomly chosen to obtain the reconstructed results.As shown in Figure 13, the reconstructed images produced by the cGAN have contour structures that are similar to the ground truth.However, the details are not clear, and the rRMSE value is slightly high.This may be caused by two factors: a more complex real scan situation and large individual differences among samples.A larger amount of data is required, but this is always a major challenge in medical image processing.
Three images in the test dataset were randomly chosen to obtain the reconstructed results.As shown in Figure 13, the reconstructed images produced by the cGAN have contour structures that are similar to the ground truth.However, the details are not clear, and the rRMSE value is slightly high.This may be caused by two factors: a more complex real scan situation and large individual differences among samples.A larger amount of data is required, but this is always a major challenge in medical image processing.

Real Patient Experiments
Several different learning rates and epochs were applied to a real human brain dataset.The best results were obtained with lr = 0.002 and 120 epochs.The training dataset contained seven objects, including 651 image pairs.The testing dataset contained two objects, which included 186 image pairs.Finally, five slices were selected from the test dataset, as shown in Figure 14.The quantitative results, including the SSIM and rRMSE, are shown in Figure 15.Compared to the SD rat whole-body dataset, the human brain dataset obtained better results in terms of reconstructed image quality.The SSIM of the test images was close to 0.94.The mean rRMSE was approximately 2.74.It is considered that the whole-body dataset of the SD rats may have finer and more complex structures than the human brain dataset.However, for a detailed structure, such as the green rectangle in Figure 14, it is difficult to obtain an accurate result.Unlike MLEM and TV, a single object can be reconstructed through a finite number of iterations; cGANs is a deep-learningbased method which achieve reconstruction by first learning the recognition processing of other objects.If there are not enough learning objects, then the leaning result is naturally not satisfied, and if the new object has some structures that the model has never seen before, it is very difficulty to reconstruct the new object accurately.

Real Patient Experiments
Several different learning rates and epochs were applied to a real human brain dataset.The best results were obtained with lr = 0.002 and 120 epochs.The training dataset contained seven objects, including 651 image pairs.The testing dataset contained two objects, which included 186 image pairs.Finally, five slices were selected from the test dataset, as shown in Figure 14.The quantitative results, including the SSIM and rRMSE, are shown in Figure 15.Compared to the SD rat whole-body dataset, the human brain dataset obtained better results in terms of reconstructed image quality.The SSIM of the test images was close to 0.94.The mean rRMSE was approximately 2.74.It is considered that the whole-body dataset of the SD rats may have finer and more complex structures than the human brain dataset.However, for a detailed structure, such as the green rectangle in Figure 14, it is difficult to obtain an accurate result.Unlike MLEM and TV, a single object can be reconstructed through a finite number of iterations; cGANs is a deep-learning-based method which achieve reconstruction by first learning the recognition processing of other objects.If there are not enough learning objects, then the leaning result is naturally not satisfied, and if the new object has some structures that the model has never seen before, it is very difficulty to reconstruct the new object accurately.human brain dataset.However, for a detailed structure, such as the green rectangle in Figure 14, it is difficult to obtain an accurate result.Unlike MLEM and TV, a single object can be reconstructed through a finite number of iterations; cGANs is a deep-learningbased method which achieve reconstruction by first learning the recognition processing of other objects.If there are not enough learning objects, then the leaning result is naturally not satisfied, and if the new object has some structures that the model has never seen before, it is very difficulty to reconstruct the new object accurately.

Discussion
PET has become an indispensable tool in clinical trials in recent years.The quality of the reconstructed image is crucial for the development of PET.However, traditional reconstruction methods have many limitations, as discussed above.Therefore, deep learning was adopted in this study to avoid the problems encountered in traditional methods.In this study, an attempt was made to learn the correspondence between sinogram images and reconstruction images.Considering that cGANs have obtained outstanding results in other fields, a cGAN was chosen as the main network.The results of the simulation, SD rat, and human brain datasets suggest that cGANs can outperform traditional methods.
For the simulation datasets, the reconstruction image quality is greatly improved compared with MLEM and TV, and the quantitative results also prove that.The bias values of cGANs for the first few frames are only 10% of that of MLEM method and about 30% of TV.The variance values are even less than 1% of the other methods.For the late frames which contain less noise and more photons, the bias and variance are much lower.For the reconstruction time, the deep-learning-based cGANs takes only 3% of the time of MLEM.
However, although satisfactory results were obtained for the simulation dataset, the performance on the SD rat and real patient datasets was not satisfactory.There are two reasons for this.First, the two real datasets had a more complex data acquisition environment.Moreover, there were significant differences between different individuals.There-

Discussion
PET has become an indispensable tool in clinical trials in recent years.The quality of the reconstructed image is crucial for the development of PET.However, traditional reconstruction methods have many limitations, as discussed above.Therefore, deep learning was adopted in this study to avoid the problems encountered in traditional methods.In this study, an attempt was made to learn the correspondence between sinogram images and reconstruction images.Considering that cGANs have obtained outstanding results in other fields, a cGAN was chosen as the main network.The results of the simulation, SD rat, and human brain datasets suggest that cGANs can outperform traditional methods.
For the simulation datasets, the reconstruction image quality is greatly improved compared with MLEM and TV, and the quantitative results also prove that.The bias values of cGANs for the first few frames are only 10% of that of MLEM method and about 30% of TV.The variance values are even less than 1% of the other methods.For the late frames which contain less noise and more photons, the bias and variance are much lower.For the reconstruction time, the deep-learning-based cGANs takes only 3% of the time of MLEM.
However, although satisfactory results were obtained for the simulation dataset, the performance on the SD rat and real patient datasets was not satisfactory.There are two reasons for this.First, the two real datasets had a more complex data acquisition environment.

Figure 1 .
Figure 1.Training framework of the direct reconstruction process.

Figure 1 .
Figure 1.Training framework of the direct reconstruction process.The generator consists of two paths.(1) Encoder: the contracting path (shown on the left-hand side of top of Figure 1 is used to compress the input image and obtain context information.(2)Decoder: the symmetrical expanding path (shown on the right-hand side) is used to expand the path and locate accurately.The basic module of both paths is convolution layer-batchnorm layer-ReLU layer.However, there are some differences in the details of the specific settings of encoding and decoding.There is no batch normalization (batchnorm) in the first layer of the encoder.All rectified linear units (ReLUs) in the encoder are leaky, whereas the ReLUs in the decoder are not leaky.All convolutions are 16 spatial filters and downsample with stride 2. Dropout is applied only to the first three layers of the decoder.Skip connections that concatenate activations from the layer in the contracting path to the (n − i)th layer are built in the expanding path, where n is the total number of layers.For the proposed method, n = 12.Finally, a tanh function is adopted after the last layer in the decoder to complete the entire generator model, and the output image has the same size as that of the input image.The structure of the generator is illustrated in Figure2.

Figure 3 .
Figure 3.The sinogram image and the corresponding activate image.There is not a simple pixel-topixel relationship between (a,b).Each pixel of sinogram has its own physical meaning.The abscissa position of the pixel refers to the distance between line-of-response (LOR) and the center of the detector.The ordinate position of the pixel refers to the angle between LOR and the standard surface.The value of the pixel refers to the number of coincidence events recorded by the detector in the corresponding at this position.

Figure 3 .
Figure 3.The sinogram image and the corresponding activat pixel relationship between (a,b).Each pixel of sinogram has i position of the pixel refers to the distance between line-ofdetector.The ordinate position of the pixel refers to the angl face.The value of the pixel refers to the number of coinciden the corresponding at this position.

Figure 3 .
Figure 3.The sinogram image and the corresponding activate image.There is not a simple pixel-topixel relationship between (a,b).Each pixel of sinogram has its own physical meaning.The abscissa position of the pixel refers to the distance between line-of-response (LOR) and the center of the detector.The ordinate position of the pixel refers to the angle between LOR and the standard surface.The value of the pixel refers to the number of coincidence events recorded by the detector in the corresponding at this position.

Figure 5 .
Figure 5. Reconstruction results comparison of whole-image discriminator and Patch-GAN: (a) bias; (b) variance.The bias and variance of whole-image strategy are much lower than Patch-GAN of the later frames, as shown between the dashed blue line.To better illustrate that the whole-image strategy is more suitable in PET reconstruction, we selected a real patient to show the generation results under different strategies at the same epoch in the training phase.The generation results of 2nd, 20th, 40th, 60th, 80th, and 100th epoch are all shown in Figure6.With the increasing epoch, the global features become clearer for both strategies.However, the network with PatchGAN may generate some local features that do not exist in the ground truth, as shown in the red box.Moreover, compared with PatchGAN, a whole-image discriminator can generate more accurate reconstruction images in fewer epochs.

Figure 6 .
Figure 6.The images generated in training phase under two different strategies.The first column shows the real brain images generated with PatchGAN.The second column shows the brain images generated with the whole-image method.From left to right: the images generated in the 2nd epoch, 20th epoch, 40th epoch, 60th epoch, 80th epoch, and 100th epoch.

4. 1
.2. Convergence of the AlgorithmAs shown in Table1, the Zubal thorax dataset is divided according to sampling time.The 20 and 30 min scanning datasets belong to the training set, which contains 108 PET images and the corresponding sinogram images.The 40 min scanning data were chosen as the testing set, including 54 image pairs.The entire network was trained for 120 iterations, and the convergence curves are shown in Figure7.The left curve shows the convergence of the discriminator, and the right curve represents the convergence of the generator.The x-axis indicates the loss value, and the y-axis indicates the iteration step.Both the generator and discriminator converge to a value quickly and tend to be stable.

Figure 5 .
Figure 5. Reconstruction results comparison of whole-image discriminator and Patch-GAN: (a) bias; (b) variance.The bias and variance of whole-image strategy are much lower than Patch-GAN of the later frames, as shown between the dashed blue line.To better illustrate that the whole-image strategy is more suitable in PET reconstruction, we selected a real patient to show the generation results under different strategies at the same epoch in the training phase.The generation results of 2nd, 20th, 40th, 60th, 80th, and 100th epoch are all shown in Figure6.With the increasing epoch, the global features become clearer for both strategies.However, the network with PatchGAN may generate some local features that do not exist in the ground truth, as shown in the red box.Moreover, compared with PatchGAN, a whole-image discriminator can generate more accurate reconstruction images in fewer epochs.

Figure 5 .
Figure 5. Reconstruction results comparison of whole-image discriminator and Patch-GAN: (a) bias; (b) variance.The bias and variance of whole-image strategy are much lower than Patch-GAN of the later frames, as shown between the dashed blue line.

Figure 6 .
Figure 6.The images generated in training phase under two different strategies.The first column shows the real brain images generated with PatchGAN.The second column shows the brain images generated with the whole-image method.From left to right: the images generated in the 2nd epoch, 20th epoch, 40th epoch, 60th epoch, 80th epoch, and 100th epoch.

4. 1
.2. Convergence of the AlgorithmAs shown in Table1, the Zubal thorax dataset is divided according to sampling time.The 20 and 30 min scanning datasets belong to the training set, which contains 108 PET images and the corresponding sinogram images.The 40 min scanning data were chosen as the testing set, including 54 image pairs.The entire network was trained for 120 iterations, and the convergence curves are shown in Figure7.The left curve shows the convergence of the discriminator, and the right curve represents the convergence of the generator.The x-axis indicates the loss value, and the y-axis indicates the iteration step.Both the generator and discriminator converge to a value quickly and tend to be stable.

Figure 6 .
Figure 6.The images generated in training phase under two different strategies.The first column shows the real brain images generated with PatchGAN.The second column shows the brain images generated with the whole-image method.From left to right: the images generated in the 2nd epoch, 20th epoch, 40th epoch, 60th epoch, 80th epoch, and 100th epoch.

4. 1 18 Figure 7 .
Figure 7. Convergence curves of the discriminator and generator.Both g-loss and d-loss converge quickly after approximately 1000 steps.

Figure 8 .
Figure 8. Reconstruction results for Zubal thorax phantom with 40 min scanning and 1 × 10 7 counts.From left to right: MLEM results, TV results, cGAN results, and ground truth.From top to bottom: the 3rd, 7th, and 12th frames.

Figure 7 .
Figure 7. Convergence curves of the discriminator and generator.Both g-loss and d-loss converge quickly after approximately 1000 steps.

18 Figure 7 .
Figure 7. Convergence curves of the discriminator and generator.Both g-loss and d-loss converge quickly after approximately 1000 steps.

Figure 8 .
Figure 8. Reconstruction results for Zubal thorax phantom with 40 min scanning and 1 × 10 7 counts.From left to right: MLEM results, TV results, cGAN results, and ground truth.From top to bottom: the 3rd, 7th, and 12th frames.

Figure 8 .
Figure 8. Reconstruction results for Zubal thorax phantom with 40 min scanning and 1 × 10 7 counts.From left to right: MLEM results, TV results, cGAN results, and ground truth.From top to bottom: the 3rd, 7th, and 12th frames.

Figure 9 .
Figure 9. Bias and variance of whole images for ROI1, ROI2, and ROI3 of the Zubal thorax phantom.

Figure 9 .
Figure 9. Bias and variance of whole images for ROI1, ROI2, and ROI3 of the Zubal thorax phantom.

Figure 10 .
Figure 10.Bias and variance of three methods (MLEM, TV, and cGANs) of Zubal thorax phantom with different counts.The top three curves represent the results of the MLEM method, the middle curves represent results of the TV method, and the bottom curves correspond to the proposed

Figure 10 .
Figure 10.Bias and variance of three methods (MLEM, TV, and cGANs) of Zubal thorax phantom with different counts.The top three curves represent the results of the MLEM method, the middle curves represent results of the TV method, and the bottom curves correspond to the proposed method.With increasing counts, the bias and variance values of the MLEM and TV methods decrease, but there is a slight effect on the cGANs method.

Figure 11 .
Figure 11.Robustness verification graphs of Zubal head phantom for different frames.(a) Concentration distribution of training images (the first 12 frames).(b) Reconstruction results for Zubal head phantom with a 70 min scan and 5 × 10 7 counts.From top to bottom: MLEM results, TV results, cGAN results, and ground truth.From left to right: the 13th, 14th, 16th, and 17th frames.

Figure 12 .
Figure 12.Comparison of reconstruction results using whole-image discriminator and patch-GAN on SD rat dataset.

Figure 12 .
Figure 12.Comparison of reconstruction results using whole-image discriminator and patch-GAN on SD rat dataset.

SinogramFigure 13 .
Figure 13.Reconstruction results of different slices of SD rats.From top to bottom: sinogram images, ground truth (OSEM), reconstruction results of cGANs.

Figure 13 .
Figure 13.Reconstruction results of different slices of SD rats.From top to bottom: sinogram images, ground truth (OSEM), reconstruction results of cGANs.

Figure 14 .
Figure 14.Reconstruction results of five selected images of human brain dataset: 1-59 depicts the 59th slice of the first test object, 2-34 depicts the 34th slice of the second test object, etc.

Figure 14 . 18 Figure 15 .
Figure 14.Reconstruction results of five selected images of human brain dataset: 1-59 depicts the 59th slice of the first test object, 2-34 depicts the 34th slice of the second test object, etc. Appl.Sci.2022, 12, x FOR PEER REVIEW 16 of 18

Figure 15 .
Figure 15.Reconstruction results of five selected images of human brain dataset.

Table 1 .
Dataset for simulation experiments.

Table 1 .
Dataset for simulation experiments.