Anomaly Analysis of Alzheimer’s Disease in PET Images Using an Unsupervised Adversarial Deep Learning Model

: In this study, the anomaly analysis of Alzheimer’s disease using positron emission tomography (PET) images using an unsupervised proposed adversarial model is investigated. The model consists of three parts: a parallel-network encoder, which is comprised of a convolutional pipeline and a dilated convolutional pipeline that extracts global and local features and concatenates them, a decoder that reconstructs the input image from the obtained feature vector, and a discriminator that distinguishes if the input image image is real or fake. The hypothesis is that if the proposed model is trained with only normal brain images, the corresponding construction loss for normal images should be minimal. However, if the input image belongs to a class that is designated as an anomaly that which the model is not trained with, then the construction loss will be high. This will reﬂect during the anomaly score comparison between the normal and the anomalous image. A multi-case analysis is performed for three major classes using the Alzheimer’s Disease Neuroimaging Initiative dataset, Alzheimer’s disease, mild cognitive impairment, and normal control. The base parallel-encoder network shows better classiﬁcation accuracy than the benchmark models, and the proposed model that is built on the parallel model outperforms the benchmark anomaly detection models. The proposed model gave out 96.03% and 75.21% in classiﬁcation and area under the curve score, respectively. Additionally, a qualitative evaluation done by using Fréchet inception distance gave a better score than the state-of-the-art by three points.


Introduction
Alzheimer's disease (AD) is by far the most common type of dementia, generally seen in elderly people. For the majority of cases, AD symptoms begin to appear during the mid-60s, although the early-onset may occur in ages as early as the 30s. It is estimated that around 106 million people will be diagnosed in the world by 2050 due to the increase in the aging population [1].
During the progression of AD, the brain structure changes due to the deposition of amyloid-β (Aβ) plaques and hyperphosphorylated tau. The initial damage starts at the hippocampus [2], which handles episodic and spatial memory as well as working as a relay between the brain and the rest of the body, so the damage disrupts these pathways. The symptoms of AD are the result of these Aβ plaques and intracellular neurofibrillary tangles [3,4]. This decreases the brain metabolism of both glucose and oxygen, leading to progressive memory loss and inability to move in the late stages [5]. These changes start to form years before the initial clinical symptoms of AD are seen. Mild cognitive impairment (MCI) causes cognitive changes to the person that is noticeable by the family the majority of the normal case data is used for training, and abnormal case data and the remainder of the normal case data are used to find outlier cases during the inference. For a variety of domains [22][23][24], different approaches to anomaly detection [25][26][27] have been proposed in the past. It is generally assumed that the anomalies differ in lower dimensions as well as in high-dimensional space, meaning the latent space mapping is a vital point in anomaly detection. Recent studies that include generative adversarial networks (GANs) [28] in their models are highly effective in mapping the data distribution. GANs being efficient in mapping both high-dimensional and low-dimensional features with very little information loss has sparked a new interest in anomaly detection works [29].
In medical imaging, GANs are generally used for AD diagnosis [30], and data generation to be used in training the deep learning models [31]. Anomaly detection can be used to differentiate normal brain images and the abnormal brain images is not very well researched.
In this study, the analysis of Alzheimer's disease as an anomaly using the proposed model is researched. In the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, there are three classes which are AD, MCI, and normal control (NC). The proposed model is trained using the unsupervised method. The anomaly analysis is performed on these classes in the following cases: AD-NC, MCI-NC, AD-MCI, the initial class is the anomaly class, and the latter class, the normal class. The area under the curve (AUC) is calculated for each of these comparisons to evaluate the performance of the model, as per previous work in the field [32,33], as well as Fréchet Inception Distance as a qualitative evaluation. Contributions of this paper are as follows: • Novelty-the Alzheimer's disease anomaly analysis of PET images using a proposed unsupervised adversarially trained model with a unique feature extractor model. To the authors' knowledge, there are no anomaly detection studies in Alzheimer's disease cases using adversarial deep learning models. • Effectiveness-the proposed model quantitatively and qualitatively outperforms the state-of-the-art models.

Related Anomaly Detection Works
For better observation of the changes in the brain caused by AD, there has been a tremendous effort in the medical imaging field. Multiple machine learning applications have been proposed to classify different stages of AD using brain images [34][35][36]. Different imaging techniques such as PET [37,38], structural magnetic resonance imaging (sMRI) [39,40], functional magnetic resonance imaging (fMRI) [18,41,42] have been used in AD diagnosis. It has also been shown that multi-modal use may increase the performance compared to a single modal [43]. For example, using the ADNI dataset, Lie et al. [44] propose a multi-modality CNN model for binary classification of AD vs. NC with a classification score of 93.26%. Singh et al. [45] achieved a remarkable 97.37% F1 score with their proposed method. Various implementations of GANs in medical imaging have also opened new possibilities from image segmentation [46] to image translation [47].
GANs are a type of generative machine learning framework that includes a generator that creates realistic images, usually from random noise, and a discriminator that identifies if the input image is real or fake. The generator is usually a decoder that learns the input data distribution from the latent vector, and the discriminator generally has a classic architecture that reads the input image and discriminates fake images from real images. The comparison between the novel GAN framework and the proposed model can be seen in Figure 1.
Due to their potential uses, GANs have been examined in great depth [48] and many approaches have been proposed to improve their stability [49]. A well-known work, called Deep Convolutional GAN (DCGAN) [50], removes fully-connected layers and makes use of convolutional layers and batch normalization [51]. With the use of Wasserstein loss, the performance is improved even further [52]. The proposed model takes a real input image, obtains the latent vector using the embedded parallel network that is comprised of a conventional convolutional neural network (CNN) pipeline, and a dilated convolutional neural network (DCN) pipeline, and reconstructs the image. The discriminator then distinguishes between the normal input image and the reconstructed image. More detailed information about the proposed model can be found in Section 3 and the details of the proposed architecture can be seen in Figure 2. (a) The novel GAN architecture. The input is a random noise vector that is built into an image with the generator. The constructed image and a real image from the dataset are then fed to the discriminator where they can be classified as real or fake. (b) The proposed model architecture. The input is an actual image that is sent through the parallel network to generate a latent vector which will then be used to reconstruct the image. This reconstructed image and the corresponding real image are fed to the discriminator where the discrimination between the real and the fake image can be made.
Recent attention on anomaly detection tends to be reconstruction-based. One study from Ravanbaksh et al. [53] takes advantage of image-to-image translation [54] to detect abnormality in crowded scenes. The approach uses two conditional-GANs; the first generator constructs an optical flow from input frames, and the second one generates frames from the generated flow. The main breakthrough comes from Schlegl et al.'s work [23] where it is hypothesized that the end latent vector is a direct representation of the data distribution; however, mapping it is not a straightforward process. The first step is training a generator and a discriminator using only normal images, and afterward, remapping to the latent vector by freezing the weights based on the z vector. The model shows the anomaly score during the inference by pinpointing an anomaly. Furthering this study, Zenati et al. [32] (EGBAD) uses BiGAN [55], examining joint training to map the data distribution end from the image and latent space, respectively. Akcay et al. [29] propose the use of a conventional autoencoder (GANomaly) with the addition of another encoder the decoder to jointly train the model and the discriminator with the additional latent space loss. Furthering their work, Akcay et al. [56] propose a U-Net-like [57] autoencoder model (Skip-GANomaly) with skip-connections trained jointly with a discriminator. Figure 2. The architecture of the proposed model. The encoder part of the generator G is built on a unique parallel model that is capable of extracting local features from an input image x through the convolutional neural network (CNN), and global features through the DCN. Concatenation of these features z is used as the input for the generator which in turn reconstructs the original input image aŝ x. The discriminator D is an encoder that inputs both the original and the reconstructed image and outputs a class label if the image is real or fake. Its fully connected layerẑ is also used as a part of the main loss function during training.
The human brain has a structure with many unique features that can be extracted by different CNN models. However, the majority of the works focus on creating a single complicated pipeline to extract these features. This study uses a parallel model that has been proven [58] to extract more features than a single pipeline, which are reflected in its class activation maps during the inference. Furthermore, while the conventional GANs generate images from random input noise, the proposed model generates images from an input brain image, resulting in more realistic images. The findings of the experiment are explained and shown in the following sections.

Proposed Model
The proposed model uses an unsupervised adversarial training scheme. It has two major components: • The generator (G) learns the dataset distribution from the input image, encodes it into a latent vector, and reconstructs the image by upsampling. The uniqueness of the generator is that the encoder uses a parallel model that is comprised of a convolutional pipeline (CNN) and a dilated convolutional end network (DCN) that is 8 layers deep, each layer uses 3 × 3 convolutional filters, a Rectified linear unit (ReLU) activation function, and a batch normalization operation. After two identical layers, a maxpooling operation is used for spatial dimension reduction and doubling the depth of the tensor. A latent vector of the input image is then generated. The DCN is eight layers deep, each layer uses 3 × 3 convolutional filters with a dilation factor of 2, a ReLU (Rectified linear units) activation function, and a batch normalization operation. After 2 identical layers, a max-pooling operation is used for spatial dimension reduction and increasing the depth of the tensor. A latent vector of the input image is then generated. Concatenation of these features gives the optimal feature vector of the input image [58]. The class activation map for the given input image is shown in Figure 3. • The discriminator (D) predicts the class of the input (whether it is fake or not) based on learned features. The discriminator generally uses an encoder-type architecture.
The mathematical definition and the formulation of the problem are the following: The dataset is split into a training set D that is comprised of N normal images where D = {x 1 , . . . , x N }, and a testing setD of A normal and abnormal images combined 1] denotes normal and abnormal class labels, respectively. The task is training the proposed model f on D and perform inference on A. Ideally, the training set should be larger than the training set, D A. Training helps to map the distribution of the dataset D in all vector spaces. This enables the network to learn both higher and lower-level features that are different from abnormal samples.
As shown in Figure 2, the proposed model consists of a generator G and a discriminator D. The generator uses an autoencoder-type structure to generate an imagex through extracted latent vector z from the input image x such that G : x → z where x ∈ R (w×h×c) and z ∈ R d . The input image x is fed to both pipelines where the conventional CNN extracts local features and the DCN extracts global features. Concatenation of these features creates the latent vector z. The convolutional network pipeline consists of eight convolutional layers, each layer is created using 3 × 3 convolutional filters, a ReLU activation, and a batch normalization operation. The DCN consists of 8 dilated convolutional layers with each layer being comprised of a 3 × 3 convolutional filters with a dilation factor of 2, a ReLU activation, and a batch normalization operation. At the end of each pipeline, extracted image features are concatenated to the create the latent vector to z.
The decoder network consists of four upsampling layers and eight convolutional layers with 3 × 3 convolutional filters on each layer, and a ReLU activation. Its task is to upsample the latent vector z back to its original input image dimension which is denoted asx.
Unsupervised training is performed with the majority of the normal class in the proposed GAN-based anomaly detection. In all three cases, the train-test split is the same. Eighty percent of the NC data is used to train the model, and the remaining 20% is used together with a similar number of AD images to detect the AD as the anomaly, or MCI as the anomaly with the NC, or the AD as the anomaly with the MCI dataset. The model is trained with only MCI data to distinguish AD, and MCI and AD data are considered abnormal data.

The Dataset
The dataset used in this study was obtained from ADNI (http://adni.loni.usc.edu) (accessed on 22 December 2020). The ADNI study contains AD, MCI, and NC subjects and its main purpose is to establish clinical, imaging, biochemical, and genetic markers for the early detection of AD. For this study, participants with baseline, plus a-year follow-up 18Ffluorodeoxyglucose (FDG)-PET images were used. Of 256 subjects in total, there are 148 NC subjects, 83 MCI subjects, and 25 AD subjects. Demographic and clinical information (Clinical Dementia Rating-CRD, Mini-Mental State Examination-MMSE) is shown in Table 1. For the training stage, the normal class dataset is split 80-20% for training and inference (e.g., AD vs NC, MCI vs NC, AD vs MCI). One hundred percent of the abnormal class is used for inference, meaning, the model is not trained with the abnormal class.
The detailed information about the dataset split and the numbers can be seen in Table  2. Each patient's axial scan has 96 images, and each image is grayscale has the size of 160 × 160 pixels. The images were downscaled to 64 × 64 to reduce the computational cost.

Training the Model
The recent trend of anomaly detection models such as skip-GAN [56] and Anoma-lyGAN (AnoGAN) [23] is to train the model on the normal dataset by separating it into train-test fashion, and then perform inference with both unseen normal and unseen abnormal samples. Ideally, normal sample inference images will have similar latent vectors and the reconstructed image will be similar to the input image. On the other hand, abnormal samples that the model has not been trained for are expected to fail in both cases. A higher loss value can be expected in such cases, which will be used for anomaly score calculation. In case of training, the proposed model has three loss functions for optimization which are contextual loss, latent vector loss, and adversarial loss.
1. Contextual Loss: To learn the distribution of the dataset, L 1 normalization is applied to the input x and the outputx. This helps the generation of contextually similar images from the normal samples and is proven to produce less blurry images than L 2 normalization [54]. The loss formula is given below as: 2. Adversarial Loss: Taken from [28], this loss ensures that the generator G can reconstruct an image x as realistically as possible while the discriminator D can differentiate between the normal and fake images. The task is to minimize this objective for G and maximize it for D to achieve the min-max equilibrium L 1 where it is defined as: produce similar latent representations for sampling. Using the concatenated features z = f(x) and the fully connected layer of the discriminatorẑ = f(x). The loss becomes: Thus, the total loss of the model becomes: For training the proposed model f, stochastic gradient descent (SGD) with lambda decay, and Nesterov momentum optimizer function with the initial learning rate 10 −2 is used. The model is trained until the convergence, which took 60 epochs. To perform a patient-wise training and inference, the batch size is taken as 96 which is the total number of axial brain images of a single patient. Same hyperparameters were used to train the benchmark models and the ablation study, with the same dataset, and the same 80-20% split. The implementation is done using TensorFlow v2.0, Python 3.6.8, CUDA 10.0, and CuDNN 7.6.5. The experiments are conducted on an NVIDIA P5000 GPU.

Model Evaluation
Model performance is evaluated through multiple methods. First, a classification inference is performed with the parallel model to check whether the parallel encoder module is working accurately. Trained parallel model weights are frozen, a classification layer (softmax) is added, and fine-tuning is performed with the same training hyperparameters, with the categorical cross-entropy loss. Same hyperparameters are used to train the benchmark classification models. The benchmark result with other classification models is given in Table 3, and the confusion matrix of the parallel model can be seen in Figure 4. Class activation maps are helpful to highlight the regions where the model's attention is focused on in data visualization. These regions shown in Figure 3 are relevant to each different class. Thus, the greater the variance between these regions is, the more accurate the model performs. Looking at the model's class activation maps, both CNN and DCN activation areas are complementary to each other in a given brain image. The four activation images on the left-hand side are from the CNN, whereas the four activation images on the right-hand side are from the DCN. Although the DCN has a lower activation area overall, the values in these areas are higher than the CNN activation values, indicating that the DCN features are distinct and unique. The second evaluation method is the area under the curve (AUC) [63]. It is a function that uses true positive rates (TPR) and false-positive rates (FPR) with varying threshold values from the inference data.
where TP is a true positive, FN is a false negative, FP is a false positive, and TN is a true negative. The model's performance is compared with recent anomaly detection works such as GANomaly [29], Skip-GANomaly [56], EGBAD [32], and AnoGAN [23]. Additionally, an ablation study is performed on CNN pipeline and DCN pipeline networks. The comparison can be seen in Table 4. A qualitative evaluation method called Fréchet Inception Distance (FID) [64] is applied as a metric. This method is used to evaluate the quality of the generated images by calculating the distance of feature vectors between the real and the generated images. The estimation is done via the Inception-V3 [60] model which was built on the original GoogLeNet architecture [65] classifies the generated image. The conditional class probability and the confidence of each image are combined. This is defined as [64]: where X r ∼ N(µ r , ∑ r ) and X g ∼ N(µ g , ∑ g ) are 2048-dimensional activations of the pool3 layer for real and generated samples. If both images are identical, the score should ideally be 0. Since there is no clear given metric value that is acceptable for unsupervised learning in the literature, there might be problems with what the acceptable value. A total of 1000 constructed images gathered from each class were compared with the correspond-ing 1000 real images to obtain an FID score for each class separately. The score comparison for different classes and models can be seen in Table 5 and reconstructed sample images for all cases and the ablation study results can be seen in Figure 5. Table 5. Fréchet Inception Distance (FID) scores comparison for three different classes; AD, MCI, and NC using GANomaly, Skip-GANomaly, EGBAD, and AnoGAN along with the ablation study. Note that a lower score generally means better performance.  An anomaly score [32] is used to detect the anomalies in a given test image. For an input imageẋ the corresponding anomaly score is calculated as:

Models
where R(ẋ) is the L 1 loss between the input image and the corresponding reconstructed image. D(ẋ) is the latent vector score given in Equation (3). λ is the weighting parameter on emphasizing the importance of the scores. This experiment was performed based on λ = 0.5. The anomaly scores were then normalized. The anomaly score comparison for three different scenarios with the frequency graphs are shown in Figure 6. (c) Figure 6. Anomaly score distributions for test images. Note that the anomaly scores are normalized to [0, 1], and the images with higher anomaly scores are reflected with values closer to 1 while the normal images are closer to 0. (a) AD-NC case. AD is considered the anomaly, which reflects a higher anomaly score in the distribution. (b) MCI-NC case. MCI is considered the anomaly, which reflects a higher anomaly score in the distribution. (c) AD-MCI case. AD is considered the anomaly, which reflects a higher anomaly score in the distribution.

Conclusions
In this work, an analysis of Alzheimer's Disease as an anomaly in PET images using an unsupervised anomaly detection model featuring a parallel feature extractor is conducted. The parallel feature extractor consists of a CNN and a DCN. Latent vectors obtained from both pipelines are concatenated and used as the main feature vector to reconstruct the input image. The discriminator takes both the original input image and the reconstructed image as input and labels them as either real or fake. As it is shown in Tables 4 and 5, the proposed model outperforms the previous anomaly models both quantitatively and qualitatively by having a classification accuracy of 96.03%, 0.59% higher than DenseNet169, and an AUC score of 75.21%, 3.79% higher than Skip-GANomaly. The ablation study shows that the proposed model also outperforms ablation study sub-networks. The narrow activation areas with high values seen in the DCN class activation maps provide a boosting effect to the CNN's wider, but much lower value areas. The justification of the superior performance in the classification of the parallel feature extractor model is given in Table 3 and the proof is given in the form of class activation maps, as shown in Figure 3. Although there are no solid metric values of the optimal performance for the unsupervised models, comparison based on the FID score shows that the model produces images similar to the real ones.The score comparisons are given in Table 5, and a sample batch of images is shown in Figure 5. There are multiple discussions to be made in future studies: • Among the three loss functions, the importance of each loss function can be evaluated.
A genetic algorithm or a grid search can be used to assign weights for each loss function to observe its effect on the model. • The depth of the parallel model can be altered by using other parameter search algorithms. A deeper model may increase the computational cost while improving the accuracy. • The skip connections used in autoencoders have been showing promising results [57].
The possibility and feasibility of skip connections will be investigated to further improve the performance of the model. • Probable ways to improve the AUC score further will be investigated.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: