Visual and Quantitative Evaluation of Amyloid Brain PET Image Synthesis with Generative Adversarial Network

: Conventional data augmentation (DA) techniques, which have been used to improve the performance of predictive models with a lack of balanced training data sets, entail an e ﬀ ort to deﬁne the proper repeating operation (e.g., rotation and mirroring) according to the target class distribution. Although DA using generative adversarial network (GAN) has the potential to overcome the disadvantages of conventional DA, there are not enough cases where this technique has been applied to medical images, and in particular, not enough cases where quantitative evaluation was used to determine whether the generated images had enough realism and diversity to be used for DA. In this study, we synthesized 18F-Florbetaben (FBB) images using CGAN. The generated images were evaluated using various measures, and we presented the state of the images and the similarity value of quantitative measurement that can be expected to successfully augment data from generated images for DA. The method includes (1) conditional WGAN-GP to learn the axial image distribution extracted from pre-processed 3D FBB images, (2) pre-trained DenseNet121 and model-agnostic metrics for visual and quantitative measurements of generated image distribution, and (3) a machine learning model for observing improvement in generalization performance by generated dataset. The Visual Turing test showed similarity in the descriptions of typical patterns of amyloid deposition for each of the generated images. However, di ﬀ erences in similarity and classiﬁcation performance per axial level were observed, which did not agree with the visual evaluation. Experimental results demonstrated that quantitative measurements were able to detect the similarity between two distributions and observe mode collapse better than the Visual Turing test and t-SNE.


Introduction
Approximately 50 million people worldwide have dementia, and nearly 10 million new cases occur each year. This number is expected to increase to 82 million by 2030 and 152 million by 2050 [1,2]. Alzheimer's disease (AD), which is present in 70% of patients with dementia, is the most prevalent dementia-causing illness. It degrades memory and, thinking skills and eventually renders a person unable to maintain an independent life [3]. From a neuropathological point of view, the main factor responsible for the symptoms of AD are intracellular neurofibrillary tangles and extracellular amyloid plaques [4][5][6][7]. Positron emission tomography (PET) is an ultrasensitive and non-invasive molecular imaging technique used to detect functional activity within organs that are expected to be Therefore, to successfully drive a stable GAN-based DA, determining whether to enhance the training set using the generated set based on quantitative evaluation could help to preserve and reliably improve the performance of the target model. However, no GAN-synthesized brain images have been evaluated using recently reported quantitative measurements [30] instead of traditional approaches [29,31], although some simple results of visual evaluations have been reported [32]. Furthermore, according to our investigation, studies on a GAN-based synthetic medical imaging were not addressed with a quantitative evaluation as well as a visual assessment of a brain PET for diagnosis and prognosis of dementia. In this study, we created a conditional GAN to improve the Aβ estimation model and performed a quantitative evaluation to confirm the similarity that can be expected to improve the generalization performance. This method includes (1) conditional WGAN-GP to learn each axial image distribution extracted from pre-processed 3D FBB images, (2) pre-trained DenseNet121 and model-agnostic metrics to visually and quantitatively measure the generated images, and (3) ML models such as support vector machine (SVM) and neural network (NN) for observing generalization performance after using of generated images for DA. Finally, we will upload the weights of GAN models, and the source code for our experiments (https://github.com/kang2000h/GAN_evaluation) so that the experiments we performed have reproducibility and persistence for related works (Supplementary Materials).

Experiment
A data flow diagram is shown in Figure 1 that illustrates the process of obtaining GAN to improve the target model using augmented FBB amyloid brain PET image data and measuring the reality of the generated data and its suitability for use in data augmentation. First, raw PET images obtained from a PACS running at DAUH undergo pre-processing. The pre-processed images were examined, and 3D images from patients that were Aβ negative or Aβ positive were divided using a 1:1 ratio into training and test sets. The training set was used to select and train GAN models that generated images of both groups. In this experiment, the similarity between the images generated from the trained GAN model and the real images was evaluated using a Visual Turing test, distribution with t-SNE, and 3 quantitative metrics. The metrics selected to measure a similarity of a given data distribution in this experiment used recently reported model-agnostic metrics [30,33] including Maximum mean discrepancy (MMD), Fréchet inception distance (FID), and The 1-nearest neighbor classifier (1-NN) leave-one-out (LOO) accuracy instead of traditional approaches in which the their limits are reported [34]. Finally, the generalization performance of the target model was measured by comparing the performance of a target model that was trained using only the training set with the augmented target model that was trained using both the training and generated sets. The tool used in this experiment was written using Python 3.6.9 (Python Software Foundation, Wilmington, DE, USA), and Keras 2.2.4, and OpenCV 4.1.2.30 libraries were mainly used. DenseNet was used as a feature extractor, and finetuned weights were provided by the Keras library. The The tool used in this experiment was written using Python 3.6.9 (Python Software Foundation, Wilmington, DE, USA), and Keras 2.2.4, and OpenCV 4.1.2.30 libraries were mainly used. DenseNet was used as a feature extractor, and finetuned weights were provided by the Keras library. The experimental environment ran on Linux Ubuntu 16.04 LTS with 4 NVIDIA GeForce GTX TITAN XP GPU.

Data Acquisition and Pre-Processing
The FBB PET/CT images used in this study were collected retrospectively from images taken at the Department of Nuclear Medicine, Dong-A University Hospital (DAUH) from November 2015 to May 2018. The Institutional Review Board of Dong-A University Hospital reviewed and approved this study protocol (DAUHIRB-17-108). Each FBB image was confirmed by a nuclear medicine physician after collection to ensure that the Aβ distribution labels were accurate. The labeling work performed for our experimental data was based on the brain amyloid plaque load (BAPL) scoring system for reading existing FBB images. Four areas of the brain including the frontal lobe, temporal lobe, parietal lobe, and posterior cingulate were observed in the axial plane and scored based on the amount of Aβ deposited on the gray matter against the white matter [35,36]. All subjects photographed in this study received clinical diagnosis by DAUH, a neurologist. There were 298 participants in the data group, which included 160 typical Aβ negatives and 138 typical Aβ positives. Detailed demographic data are presented in Table 1. The FBB PET images used in this experiment were taken using a Biograph 40mCT Flow PET/CT scanner (Siemens Healthcare, Knoxville, TN, USA) and reconstructed via UltraHD-PET (TrueX-TOF). The participants were photographed 90 min after an FBB (NeuraCeq, Piramal, Mumbai, India) dose of 300 mBq was intravenously injected and images were taken 20 min after Helical CT with a 0.5 s rotation time at 100 kVp and 228 mAs. The raw PET images used in this experiment were resliced from a field of view of 408 × 408 × 168 (mm) and stored in the DICOM format in the DAUH PACS. The pre-processing steps, including co-registration and spatial and count normalization for the brain images applied in this experiment, were performed based on statistical parametric mapping 8 [37]. Rigid co-registration was first performed on each PET image and the corresponding CT image with respect to the center. An in-house PET template was created using CT images from 21 patients without typical AD and 9 patients with typical AD along the MNI space. Spatial normalization was performed on each PET image using the generated PET template [38][39][40]. Then, stochastic cerebellar masks for the PET templates were obtained from PMOD3.6 (PMOD Technologies Ltd., Zurich, Switzerland) and the Hammers brain atlas [41], and these were used to perform count normalization based on cerebellar intensity [42]. After pre-processing, the input data for the Aβ classifier was extracted and only the 15-50 th axial images so that only the axial plane that was read by the nuclear medicine physician was examined. Finally, a 95 × 79 × 36 image representing the Aβ distribution of each subject was used as an input for the GAN and target classifier models.

Target Model to Enhance with Generated Set
Before elucidating the design of a generative model in Section 2.4, we first defined the target model for which the images created from the generative model in this experiment will be trained. Previous studies [43,44] have shown that the performance of the ML/DL-based classification system for the Aβ distribution on FBB amyloid PET image data obtained from DAUH was 92.38% and 93.37%, respectively. Brain images generated using various modalities such as Magnetic Resonance Imaging (MRI), CT, and PET maintain spatial, and, depending on the conditions, temporal information of more than 3-dimensions. Therefore, various designs can be adopted depending on the features and pathological characteristics of the target lesion [45]. When evaluating PET images for the presence of FBB amyloid, the nuclear medicine physician makes a reading decision based on the contrast of gray matter observed through the axial plane of the FBB Aβ PET. Therefore, in a previous study [43,44,46], the BAPL score of a given FBB PET was estimated based on the Aβ distributions found at each axial level, also known as regional cortical tracer uptake (RCTU), instead of extracting the features from the 3D information according to the current process used by physicians.
In this experiment, we used the main method described in previous studies as target classifier to observe the effectiveness of GAN-based DA with visual and quantitative similarity. Therefore, the object classifier used in this experiment consists of a feature extractor to reduce features from the 2D axial plane and a classifier to predict the Aβ distribution from the extracted features. We used DenseNet [47], a well-known convolutional neural network (CNN) structure, as the feature extractor, and the support vector machine (SVM) [48] and neural network (NN) as the classifiers. Figure 2 shows the simplified structure of the target model used in our experiment. Transfer learning is a technique that applies a model that has learned data in a specific field to similar or completely different fields, and is used in medical image classification using DL-based classifiers to report interesting results [13,14]. It is a way to reuse the weights of a finetuned CNN model that are mostly trained with ImageNet datasets [49]. In particular, In a case that an input medical image is a originally gray scale (e.g., ultrasonography, MRI, and PET), previous studies which use the conversion of gray to RGB reports feasible performance even with an unknown artifacts and increased complexity [50,51]. Although the input of the target model was FBB PET images which is originally a gray scale version of a real PET image, the channels of the input data were transformed into a color channel using the OpenCV-python library to match the channel size of the finetuned DenseNet model for a continuity and reproducibility of previous studies [43,44,46]. Target model training and model selection validation were performed using 4-fold nested cross-validation and Bayesian optimization for SVM. The search space for hyper-parameters for SVM was set to kernel functions in (Linear, RBF, Poly), C in [1, 100], gamma in [0.0001, 0.1], and for NN was deterministically set to 3 hidden layers with 128,128, and 64 nodes, respectively; Adam optimizer with β 1 = 0.9, β 2 = 0.999, and without decay; 300 epochs; and learning rate 0.00005.
Appl. Sci. 2020, 3, x FOR PEER REVIEW 6 of 21 using the OpenCV-python library to match the channel size of the finetuned DenseNet model for a continuity and reproducibility of previous studies [43,44,46]. Target model training and model selection validation were performed using 4-fold nested cross-validation and Bayesian optimization for SVM. The search space for hyper-parameters for SVM was set to kernel functions in (Linear, RBF, Poly), C in [1, 100], gamma in [0.0001, 0.1], and for NN was deterministically set to 3 hidden layers with 128,128, and 64 nodes, respectively; Adam optimizer with β1 = 0.9, β2 = 0.999, and without decay; 300 epochs; and learning rate 0.00005.

Generative Adversarial Network for Data Augmentation
The target classifier we chose performs inferences on the 2D axial images, while the prediction for the Aβ distribution for a subject should reflect the 36 axial planes placed on the transverse axis. Therefore, to assist the target classifier, the GAN model must understand all of the Aβ deposition patterns for the 36 axial plane levels, with the anatomical information matched to each level. Thus,

Generative Adversarial Network for Data Augmentation
The target classifier we chose performs inferences on the 2D axial images, while the prediction for the Aβ distribution for a subject should reflect the 36 axial planes placed on the transverse axis. Therefore, to assist the target classifier, the GAN model must understand all of the Aβ deposition patterns for the 36 axial plane levels, with the anatomical information matched to each level. Thus, we followed the structure of the discriminator and generator of the Deep Convolutional Generative Adversarial Network (DCGAN) [52] to learn and infer the Aβ deposition patterns on a axial plane level, and trained each of the axial levels from 0 to 35 to which the input image belongs using additional condition labels [53]. Repeatedly stacked blocks were used to construct a network structure for both generator and discriminator, and the inputs for each module multiplied by the encoded vector for each axial level label stored in an embedding matrix and then entered the stacked network. A generator produced an image from a noise vector of size 300 sampled from a normal distribution, and a discriminator estimated a score for the similarity of the two distributions from real and generated images as a critic. The generator consisted of first hidden layer with 8192 nodes connected to the noise vector, and 5 layers of blocks which had up-sampling, convolution, batch normalization, and activation function. The activation function of the last block was tanh instead of ReLU which other blocks had. The discriminator had 4 layers of blocks which had convolution, batch normalization, leaky ReLU (α = 0.2), and dropout layer (p = 0.25). And a global average pooling and a dense layer followed the blocks ahead. The structure of the GAN model used in this experiment is shown in Figure 3.  The GAN learns a function that connects the target distribution directly from the input distribution without any estimation of the probability density function for the target domain. GAN has a mechanism in which the two models, Generator G and Discriminator D, learn from each other competitively [29] via Equation (1): (1) Discriminator D predicts the probability that the received data belongs to the real distribution ℙ . To maximize V(D, G), D should ideally predict 0 for the generated data G(z) from Generator G, which learns the parameters such that D(G(z)) = 1 to minimize V(D, G).
In our experiments implementing Wasserstein GAN (WGAN) and the loss, the input image X of the discriminator parameterized by w, called critic in the original paper [54], and the input vector Z of Generator parameterized by , are in the real spaces × and , respectively. Since the image synthesized by Generator follows the distribution ℙ , and ℙ is also the same as The GAN learns a function that connects the target distribution directly from the input distribution without any estimation of the probability density function for the target domain. GAN has a mechanism in which the two models, Generator G and Discriminator D, learn from each other competitively [29] via Equation (1): (1) Discriminator D predicts the probability that the received data belongs to the real distribution P r . To maximize V(D, G), D should ideally predict 0 for the generated data G(z) from Generator G, which learns the parameters such that D(G(z)) = 1 to minimize V(D, G).
In our experiments implementing Wasserstein GAN (WGAN) and the loss, the input image X of the discriminator f w parameterized by w, called critic in the original paper [54], and the input vector Z of Generator G θ parameterized by θ, are in the real spaces R D×D and R d , respectively. Since the image synthesized by Generator G θ follows the distribution P g , and P g is also the same as G θ (P z ), then x Let the given PET image samples be S r = {x r 1 , . . . , x r m , . . . , x r n }, which is i.i.d, and the training set for the GAN and target model is the axial plane x r p = {x axial r 1 , . . . , x axial r p } extracted from S r and are on P r . x axial r p has two labels, an Aβ class y Aβ ∈ [0, 1] and an axial label class y Slice ∈ [0, 35], respectively. We aimed to obtain a generative model G θ that produces P g that is sufficiently close to P r .
In the previously reported WGAN [54], a weight clipping method was used to simply implement a Discriminator f w following the 1-Lipschitz constraint with a gradient between two points less than 1. Gradient penalty (GP) loss [55] was proposed to reduce the length of time needed to reach an optimality when the weights are too large or too small. We challenged the model to satisfy the constraints by adding a regularization term (Equation (10)) to the Wasserstein loss (Equation (9)) so that the gradient norm is 1 through the weighted average between the points sampled from P r and P g via Equation (6): Model optimization was performed by joint loss L (Equation (7)) for both f w and G θ , and each of the parameters w and θ were optimized by RMSProp [56], respectively, via Equations (10) and (11): To augment Aβ negative and positive images for each class, the GAN model was constructed as two independent models, and the generated images were used as training data for the target model to improve the generalization.

Performance Metrics
Visual and quantitative metrics were used to evaluate the degree of similarity between the images generated by the generator and the test set. The generated images were visually evaluated by comparing them with real images in test set using the Visual Turing test [32] and observing the distribution of image features using the t-SNE [57]. In the quantitative evaluation, features were extracted from the image, and the similarity between the extracted feature distributions was measured using 3 model-agnostic metrics. All feature extraction processes were performed using the finetuned DenseNet121 model. Because each image contained many slices, representative images were selected at equal intervals from all 36 images at low levels (beginning in the region where the cerebellum was Appl. Sci. 2020, 10, 2628 8 of 19 observed). We selected 6 representative axial planes to cover the four brain regions required by the BAPL scoring system.

Visual Turing Test and Feature Visualization
In the Visual Turing test, 2 parameters, a specific Aβ group (negative or positive) and a slice of the level to be evaluated, were determined, and then 40 samples were randomly extracted for each of the real and generated images. Using a GUI program written in the Python tkinter library, the randomly extracted real and generated images were presented at the same time to the evaluator who was asked to select the images that they thought were real. The GUI program was developed and tested on Windows 10 and installed in the Department of Nuclear Medicine, DAUH to allow physicians and researchers to participate in the Visual Turing Test. The results of the test were an accuracy estimated from the number of real images the evaluator found exactly.
Feature visualization begins by extracting features from the real and generated images for each class label. The extracted feature by DenseNet121 model were 1024-D, and reduced in 2-D using t-SNE (perplexity = 40.0). This feature extraction was followed by centering the mean to zero and scaling to unit variance. This process was performed for the Aβ groups and at each axial view level for the training, test, and generated set to observe the distribution.

Quantitative Measure
We used model-agnostic metrics reviewed in Xu et al. [30] to quantitatively measure the similarity of the distribution between real and generated images in our experiments. The 3 metrics ρ used in this experiment measure the similarity between P r and P g . It has been reported that the feature space is more advantageous for measuring the similarity of the distribution than the pixel space, and the selection of features to be extracted is also crucial [30]. Thus, for an arbitrary feature extractor Φ(.), the metric can be described as follows: ρ(Φ(P r ), Φ P g ).
MMD measures how different P r and P g are for a given empirical kernel function k. The higher the measured value, the more the two inputs are interpreted as being different [58]. We used Gaussian functions as kernel functions, FID is the Fréchet distance (d) between the Gaussian distribution with mean (m r , C r ) obtained from P r and the Gaussian distribution with mean m g , C g obtained from P g [59,60]. FID uses features extracted from a trained network structure, such as an inception network, to measure the similarity between the distributions. The FID is defined as The 1-NN classifier proposed by [61] as a binary classifier for two sample test statistics sets the label of the real image to 0 and the label of the generated image to 1, and can measure the similarity of the generated images by estimating the LOO accuracy. The closer the LOO accuracy is to 50%, the closer P r is to P g . As shown in [30], the LOO accuracy of 1-NN can be used to detect the tendency of mode collapse, which is difficult to detect with the human eye without special training and careful model selection. It can also robustly measure the similarity between distributions with small transformations in the feature space.

Statistical Analysis
The data collected in this experiment were statistically analyzed using MedCalc software version 18.9.1. First, for the experimental data collected retrospectively, we examined whether there was a bias involved in the formation of the Aβ distribution, other than for the diagnosis or result of cognitive function test that could be estimated based on Aβ deposited in the cerebrum. After applying the GAN-based DA, we statistically evaluated differences in the generalization performance of the ML-based model for each axial plane.
Discrete variables such as age, education, and K-MMSE that were used in the calculation of demographic data were first analyzed using the Kolmogorov-Smirnov normality test before applying the Mann-Whitney U test or t-test was applied to determine if there were differences between the distributions of Aβ groups. For continuous variables such as generalization performance, the difference in the distribution of the accuracy measured per axial level before and after GAN-based DA was analyzed using the same statistical tests that were used to evaluate discrete variables. Categorical variables, such as diagnostic results, were examined using the Chi-squared test. The statistical significance level α was 0.01, and a two-sided test was performed.

Results
First, we statistically confirmed that there was no bias in the other variables except for the distribution of each patient's disease and its dependent variables (K-MMES) in the experimental dataset used in this study. Then, we used quantitative measurements to examine the generalization performance of ML models. Figure 4 shows the real pre-processed images and the GAN-based generated images that were randomly extracted without cherry picking.

Results
First, we statistically confirmed that there was no bias in the other variables except for the distribution of each patient's disease and its dependent variables (K-MMES) in the experimental dataset used in this study. Then, we used quantitative measurements to examine the generalization performance of ML models. Figure 4 shows the real pre-processed images and the GAN-based generated images that were randomly extracted without cherry picking.

Demographic Data
As shown in Table 1, the demographic data summarizes the p-values that represent statistically significant differences in age, sex, education, K-MMSE, and diagnosis. The Mann-Whitney test was performed on age, education, and K-MMSE data because the Kolmogorov-Smirnov test showed no normality ( K−MMSE < 0.0001, = 0.0916, = 0.0802). The chi-squared test showed no significant difference in the sex ratio between the two groups ( = 0.105), and only the distribution of diagnosis was significantly different between the two groups ( < 0.0001). Therefore, it can be assumed that the FBB image data sets used in the experiments were collected without any bias in age, gender, or years of education, except for the actual disease diagnosis and cognitive function.

Demographic Data
As shown in Table 1, the demographic data summarizes the p-values that represent statistically significant differences in age, sex, education, K-MMSE, and diagnosis. The Mann-Whitney test was performed on age, education, and K-MMSE data because the Kolmogorov-Smirnov test showed no normality (p K-MMSE < 0.0001, p age = 0.0916, p education = 0.0802). The chi-squared test showed no significant difference in the sex ratio between the two groups (p sex = 0.105), and only the distribution of diagnosis was significantly different between the two groups (p diagnosis < 0.0001). Therefore, it can be assumed that the FBB image data sets used in the experiments were collected without any bias in age, gender, or years of education, except for the actual disease diagnosis and cognitive function.

Visual Turing Test
We performed the Visual Turing test [32] on real and generated images to evaluate how similar the FBB images that were generated by the GAN were to real images, and the results are shown in Figure 5. The proportions matched at the FL, PP2, and PL2 levels in the Aβ-negative group, and the FL and PP2 levels in the positive group did not exceed 50%. The axial level that visually demonstrated the greatest similarity in the Aβ-negative group was the TL level (60%), and it was the PP1 level (67%) in the Aβ-positive group. Although there was a difference in relative similarities among representative axial planes, all of them were shown to be similar with the real images.

Feature Visualization
To observe the overall distribution between real and generated images in each Aβ-negative and positive image, we acquired the image features extracted using the DenseNet121 model from the input image observed at any axial plane level. The t-SNE technique was used to observe a twodimensionally reduced distribution. Figure 6 shows scatter plots that visualize the features reduced by t-SNE. The distribution of training and test images in the Aβ-negative group almost overlapped in all representative axial planes. However, although the FBB images used in the experiments were representative typical Aβ-negative and positive cases, the Aβ-positive images in the training or test datasets rarely appeared in the distribution of negative groups (FL, PP1, PP2, and PL2). In addition, the real images from the Aβ-negative group were not included in the Aβ-positive distribution. The distribution of GAN-generated images used to augment the training dataset primarily overlapped with the distribution of real images at the PP1 and PP2 levels for both the Aβ-negative and positive datasets, and no sample invading other class distributions was seen.

Feature Visualization
To observe the overall distribution between real and generated images in each Aβ-negative and positive image, we acquired the image features extracted using the DenseNet121 model from the input image observed at any axial plane level. The t-SNE technique was used to observe a two-dimensionally reduced distribution. Figure 6 shows scatter plots that visualize the features reduced by t-SNE. The distribution of training and test images in the Aβ-negative group almost overlapped in all representative axial planes. However, although the FBB images used in the experiments were representative typical Aβ-negative and positive cases, the Aβ-positive images in the training or test datasets rarely appeared in the distribution of negative groups (FL, PP1, PP2, and PL2). In addition, the real images from the Aβ-negative group were not included in the Aβ-positive distribution. The distribution of GAN-generated images used to augment the training dataset primarily overlapped with the distribution of real images at the PP1 and PP2 levels for both the Aβ-negative and positive datasets, and no sample invading other class distributions was seen.  Figure 7 shows the similarity between the real and generated set (Ф(ℙ _ ), Ф(ℙ )) and

Quantitative Measurements
between the training and test set (Ф(ℙ _ ), Ф(ℙ _ )) measured over the entire axial level. Instead of changing the scale to [0, 1], the values on the graphs in Figure 7 are the values directly calculated from a metric. (Ф(ℙ _ ), Ф(ℙ )) and (Ф(ℙ _ ), Ф(ℙ _ )) were compared for both the Aβ and axial level classes. Contrary to the results of visual evaluation, the similarity of GANbased synthetic images differed over axial level classes regardless of the metric used, and generally the lower and higher the axial level, the lower the similarity. Ideally, the similarity between real distributions should not vary with Aβ or axial level class, but diverse variance existed according to the metrics used. In MMD and FID, the change within each range for (Ф(ℙ _ ), Ф(ℙ )) was greater than that of (Ф(ℙ _ ) , Ф( ℙ _ )). Meanwhile, when the 1-NN LOO accuracy was evaluated, the difference in the ranges of (Ф(ℙ _ ) , Ф( ℙ _ )) and (Ф(ℙ _ ) , Ф( ℙ )) appeared relatively small. The MMD and 1-NN LOO accuracies were the apparent similarities between (Ф(ℙ _ ), Ф(ℙ _ )) and (Ф(ℙ _ ), Ф(ℙ )) at the axial level; however, similar FID measurements were obtained at or near the 22-th axial level.  Figure 7 shows the similarity between the real and generated set ρ(Φ(P r_test ), Φ(P g )) and between the training and test set ρ(Φ(P r_train ), Φ(P r_test )) measured over the entire axial level. Instead of changing the scale to [0, 1], the values on the graphs in Figure 7 are the values directly calculated from a metric. ρ(Φ(P r_test ), Φ(P g )) and ρ(Φ(P r_train ), Φ(P r_test )) were compared for both the Aβ and axial level classes. Contrary to the results of visual evaluation, the similarity of GAN-based synthetic images differed over axial level classes regardless of the metric used, and generally the lower and higher the axial level, the lower the similarity. Ideally, the similarity between real distributions should not vary with Aβ or axial level class, but diverse variance existed according to the metrics used. In MMD and FID, the change within each range for ρ(Φ(P r_test ), Φ(P g )) was greater than that of ρ(Φ(P r_train ), Φ(P r_test )). Meanwhile, when the 1-NN LOO accuracy was evaluated, the difference in the ranges of ρ(Φ(P r_train ), Φ(P r_test )) and ρ(Φ(P r_test ), Φ(P g )) appeared relatively small. The MMD and 1-NN LOO accuracies were the apparent similarities between ρ(Φ(P r_train ), Φ(P r_test )) and ρ(Φ(P r_test ), Φ(P g )) at the axial level; however, similar FID measurements were obtained at or near the 22-th axial level. Appl. Sci. 2020, 3, x FOR PEER REVIEW 13 of 21 Figure 7. Quantitative measurements to estimate synthetic similarity between real and generated images with respect to each axial level of pre-processed FBB imaging.    Figure 7. Quantitative measurements to estimate synthetic similarity between real and generated images with respect to each axial level of pre-processed FBB imaging. Table 2 compares the similarity values between ρ(Φ(P r_train ), Φ(P r_test )) and ρ(Φ(P r_test ), Φ(P g )) measured in the representative axial plane using MMD, FID, and 1-NN LOO. When evaluating Aβ-negative images according to quantitative metrics, the representative axial levels that appeared the most similar to real images were PL1 (MMD: 0.2284), PP2 (FID: 6.8253), and PP1 (1-NN LOO accuracy: 0.8562), and each of the metrics identified different levels as the most similar. A set of generated Aβ-positive images, meanwhile, were similar to real images in PP2, regardless of the selection of quantitative metrics (MMD: 0.1865, FID: 5.8919, 1-NN LOO accuracy: 0.7391). For all 3 metrics, the GAN model used in the experiment produced more realistic synthetic images for Aβ-positive images than the Aβ-negative images.

Generalization Test
To statistically evaluate the differences in generalization performance for each axial level before and after GAN-based DA, we built one model (non-augmented) that was trained using only the training set and another model (augmented) that was trained using both the training and generated Appl. Sci. 2020, 10, 2628 13 of 19 sets. The model was evaluated independently for each axial level class using the same test set. The Mann-Whitney U test was used because the given distribution was not found to exhibit normality. Figure 8 shows a comparison of the generalization performance of the target model based on ML with and without the data generated in the training set. Regardless of the augmentation, the classification performance of the target model tended to decrease at both ends. In both models, SVM and NN, GAN-based DA was performed independently at each axial level, resulting in a statistically significant improvement in generalization performance. Thus, DA was confirmed to work with stronger evidence in the NN-based model rather than the SVM-based model (median-SVM: 0.943 to 0.956, p < 0.0454; median-NN: 0.946 to 0.963, p < 0.0047).
To statistically evaluate the differences in generalization performance for each axial level before and after GAN-based DA, we built one model (non-augmented) that was trained using only the training set and another model (augmented) that was trained using both the training and generated sets. The model was evaluated independently for each axial level class using the same test set. The Mann-Whitney U test was used because the given distribution was not found to exhibit normality. Figure 8 shows a comparison of the generalization performance of the target model based on ML with and without the data generated in the training set. Regardless of the augmentation, the classification performance of the target model tended to decrease at both ends. In both models, SVM and NN, GAN-based DA was performed independently at each axial level, resulting in a statistically significant improvement in generalization performance. Thus, DA was confirmed to work with stronger evidence in the NN-based model rather than the SVM-based model (median-SVM: 0.943 to 0.956, p < 0.0454; median-NN: 0.946 to 0.963, p < 0.0047).

Medical Image Synthesis with Quantitative Measurements
In a previous study [62] dealing with the synthesis of brain-structured MRI, a GAN structure that appropriately augments the input image domain is proposed, and some related studies comparing the performance of each generalization when the training steps of various classifiers were enhanced using generated images have been reported [27,28,32,[63][64][65]. These previous studies on GAN-based DA in the medical imaging field have emphasized the design of the applied GAN and the improved generalization of the target model that was trained using the augmented dataset. However, these studies only included qualitative visual evaluation, and the reasons for quantitatively evaluating the generated images before applying the GAN include: 1. The practitioner cannot predict what the samples generated from GAN will look like until they are confirmed, unlike conventional DA. 2. It is not easy to visually evaluate how similar the real distribution is to the generated distribution. 3. Models trained without validation of augmented data may learn data that is characteristics of diseases but falls outside of a given class with an arbitrary label.
In particular, medical images can be interpreted differently because of diverse disease distributions, then quantitative evaluation of generated medical images is important. In our study,

Medical Image Synthesis with Quantitative Measurements
In a previous study [62] dealing with the synthesis of brain-structured MRI, a GAN structure that appropriately augments the input image domain is proposed, and some related studies comparing the performance of each generalization when the training steps of various classifiers were enhanced using generated images have been reported [27,28,32,[63][64][65]. These previous studies on GAN-based DA in the medical imaging field have emphasized the design of the applied GAN and the improved generalization of the target model that was trained using the augmented dataset. However, these studies only included qualitative visual evaluation, and the reasons for quantitatively evaluating the generated images before applying the GAN include: 1.
The practitioner cannot predict what the samples generated from GAN will look like until they are confirmed, unlike conventional DA.

2.
It is not easy to visually evaluate how similar the real distribution is to the generated distribution.

3.
Models trained without validation of augmented data may learn data that is characteristics of diseases but falls outside of a given class with an arbitrary label.
In particular, medical images can be interpreted differently because of diverse disease distributions, then quantitative evaluation of generated medical images is important. In our study, the comparison of various classifiers was excluded, but we focused on the need for visual and quantitative evaluation of the data generated from GAN.
Evaluating the synthetic DA data is to examine how identical the generated distributions are to real distributions rather than how identical the generated samples are. The classic approach is to estimate the real distribution using Parzen window estimation to measure the average log-likelihood of the generated samples [29]. This method has the advantage of being intuitive, but a recent study has shown that the estimated log-likelihood at higher dimensional space is not realistic, and above all, this study proves that it does not give a meaningful value that correlates to the reality of the given sample [34]. Another widely known method is the inception score (IS) [31], which uses the average KL divergence between P(y x) and P(y) from the class label distribution estimated from the input images by an arbitrary finetuned model (e.g., inception network) to measure the quality and diversity. This method is also intuitive and is known to be correlated with human judgment, but it has the disadvantages of not detecting overfitting for samples the predictive model entirely memorizes, mode collapse for a distribution the model does not learn, or not accounting for a model trapped into bad mode [66]. To overcome these shortcomings, some variants of the inception score with KL divergence have been reported, including modified IS [67], mode score [68], and AM Score [66]. Several approaches for defining new distances in feature space have also been reported [30], including MMD [58], FID [60], and Wasserstein distance [54].
In general, the expected effects from GAN-based DA techniques include (1) generating samples that follow the same distribution as that of the real images to ensure that there are no insufficient datasets, or (2) generating similar but realistic samples to train the model on the creative pattern.
In our experiment, we demonstrated the quantitative similarity of the generated images so that we should expect to see effect (1) using the GAN that was trained with loss to minimize the Wasserstein distance between the real and generated distributions. Evaluating medical image synthesis or DA using quantitative measurements may be useful for providing a baseline for future studies, or for determining the direction of next future experiments in practical studies.

Comparison between t-SNE and Quantitative Measurements
The distribution of features extracted from t-SNE in Figure 6 shows that the Aβ-positive samples of the training set infiltrated the negative distribution at specific axial levels (PP1, PP2, and PL2). However, in the quantitative evaluation of all the axial levels of the real images, there was some variance within each metric and axial level but a consistent overall similarity (Figure 7). The quantitative metrics used in this experiment represented the similarity between the given datasets as a scalar value ρ(Φ(P r ), Φ(P g )), and it could be difficult to explain the similarity and distribution of a few outliers or individual samples, whereas t-SNE has the advantage of providing intuitive information about the distribution of individual samples. In internal observations, however, the Aβ-positive training set samples found in the negative distribution were typical Aβ-positive images, unlike the visualization. The similarity between the real images ρ(Φ(P r_train ), Φ(P r_test )), represented by the quantitative evaluation (MMD, FID, and 1-NN LOO accuracy) used in this experiment, seems to represent physicians' visual assessment rather than t-SNE in that it is measured in the same feature space using the DenseNet121.
Comparing the real test P r_test and the generated set P g using the Visual Turing test ( Figure 5) demonstrated that, although P g was quite similar to P r_test for the overall axial levels, t-SNE showed dissimilar distributions at lower and higher axial levels, and the results of the quantitative evaluation also seem to agree with the trend shown by t-SNE (Figure 7, Table 2). This suggests that t-SNE and quantitative measures can be used to determine the tendency of mode collapse of generated medical images that are difficult to find or define in visual assessment. Therefore, in the comparison between real and generated sets of medical images, the analysis using DenseNet121 trained with ImageNet and t-SNE still appears to be useful along with the quantitative evaluation method.

Comparison between Model-Agnostic Metrics
As shown in Figure 7, MMD and FID were able to distinguish the similarity between the real and generated sets ρ(Φ(P r_test ), Φ(P g )) at the middle axial level and the both end levels. In contrast, the 1-NN LOO accuracy exhibited a smaller variance in the ρ(Φ(P r_test ), Φ(P g )) for each axial level than the MMD and FID shown, and even the variance of ρ(Φ(P r_test ), Φ(P g )) for Aβ-negative images is greater than ρ(Φ(P r_train ), Φ(P r_test )) ( Table 2). The 1-NN LOO accuracy demonstrates that the variance that occurs in the similarity estimation is larger than that of MMD and FID in identical datasets owing to the nature of the estimation of classification performance, which is sensitive to the number of data [69].
MMD and 1-NN LOO accuracy showed clear differences in ρ(Φ(P r_test ), Φ(P g )), whereas FID had some axial level at which there was no difference between ρ(Φ(P r_train ), Φ(P r_test )) and ρ(Φ(P r_test ), Φ(P g )) ( Figure 7). This suggests that the distribution of medical image samples for which the FID measures the similarity is not suitable for measuring with the FID using the Gaussian kernel, which also suggests that proper care should be taken when measuring the similarity for medical image synthesis.

Role of Quantitative Measurements in Future Generative Data Augmentation Work
After applying GAN-based DA to the Aβ predictive model and observing statistical evidence that the GAN used in our experiment can usually improve the generalization performance at an axial level, we found some challenges that may represent directions for future work. As shown in Figure 8, the generalization performance after DA shows the results of applied DA regardless of the similarity of the generated set. Consequently, after the DA of our experiment, both increases and decreases in performance were observed when measuring the generalization performance along the axial levels. In a previous study, the performance of the target model decreased when the training data was augmented using GAN [24]. This may be caused by mode collapse, which makes the target model more confused. However, in the case of our experiment, the possibility that variance is large in the process of estimating the performance of the target classifiers cannot be excluded due to the small data set. Accordingly, we statistically verify the difference in bias of generalization performance. As a result, it seems that the performance is improved from the viewpoint of the whole slice after applying GAN-based DA. In our experiments, there might be some variance due to the small size of the small dataset.
In terms of stable DA, we need proper means to prevent or predict situations where the performance is reduced by the applied DA, which is required when the generated set is produced not by simple user-defined operations like conventional DA but by a complex function that is difficult to predict. In other words, excluding generated data that is not suitable for DA may be advantageous for stable performance improvement. Studying the quantitative evaluation of DA seems to play an important role in the detection of factors degrading the generalization performance and in assessing the suitability of the training dataset for augmentation.

Conclusions
In this study, we synthesized 18F-Florbetaben Aβ PET images using GAN and visually and quantitatively evaluated the real and generated images. The similarity of the images that could statistically augment Aβ images was quantitatively measured for Aβ-negative (MMD:0.2284, FID:6.8253, 1-NN LOO accuracy:0.8562) and positive images (MMD:0.01865, FID:5.8919, 1-NN LOO accuracy:0.6233). We enhanced SVM/NN-based classifier using Aβ images generated by GAN (median-SVM, 0.943-0.956, median-NN, 0.946-0.963). The experimental results demonstrated that quantitative measurements were able to detect the similarity between the two distributions and to observe mode collapse better than the Visual Turing test and t-SNE.
Supplementary Materials: The following are available online at https://github.com/kang2000h/GAN_evaluation, source code, model structure, weights, and figures used in this study and paper.