Computer-Aided Diagnosis of Alzheimer’s Disease via Deep Learning Models and Radiomics Method

: This paper focused on the problem of diagnosis of Alzheimer’s disease via the combination of deep learning and radiomics methods. We proposed a classiﬁcation model for Alzheimer’s disease diagnosis based on improved convolution neural network models and image fusion method and compared it with existing network models. We collected 182 patients in the ADNI and PPMI database to classify Alzheimer’s disease, and reached 0.906 AUC in training with single modality images, and 0.941 AUC in training with fusion images. This proved the proposed method has better performance in the fusion images. The research may promote the application of multimodal images in the diagnosis of Alzheimer’s disease. Fusion images dataset based on multi-modality images has higher diagnosis accuracy than single modality images dataset. Deep learning methods and radiomics signiﬁcantly improve the diagnosing accuracy of Alzheimer’s disease diagnosis.


Introduction
With the aging problem aggravating in China, the prevalence of Alzheimer's disease (AD) increases year by year. Diagnosis of Alzheimer's disease is critical, and multimodal diagnosis is an indispensable diagnostic method for Alzheimer's disease [1]. Multimodal image fusion techniques combine different modalities of medical images into a single image that can contain more information. Due to the lack of specific diagnostic features, convolution neural networks are trained to assist in the diagnosis of Alzheimer's disease [2].
1.1. Alzheimer's Disease 1.1.1. Status of Alzheimer's Disease Alzheimer's disease, commonly known as senile dementia, is an insidious onset cognitive and behavioral disorder, of which the morbidity rate keeps increasing. Alzheimer's disease is one of the most common types of dementia [3]. Figure 1 shows a comparison between the normal brain and the brain with Alzheimer's disease. The main clinical symptoms of Alzheimer's disease are psychiatric symptoms and behavioral disorders including progressive memory loss, cognitive impairment, etc. [4]. Alzheimer's disease has been discovered for more than a century, but the current research institutions still lack diagnosis technology and effective treatment to it. Consequently, Alzheimer's disease is treated as a global problem [5]. The pathogenesis of Alzheimer's disease is very complicated, which seriously infects the elder's life quality, and also brings a heavy burden to the family and society. Hence, the diagnosis is of great significance in delaying the process of disease and improving the quality of life. In Figure 1, comparison between a healthy brain and an Alzheimer's disease patient's brain is displayed. The Alzheimer's disease patient's brain is characterized by neural amyloid plaques and neurofibrillary tangles, which result in severe neurodegeneration, including shrinkage of the hippocampus and other cerebral cortices [6]. Figure 1. Difference between healthy brain and AD (Alzheimer's disease) patients (adapted from ref [6]).

Multimodality Diagnosis of Alzheimer's Disease
Multimodal diagnosis is an exhibition of the structural and functional imaging of brain images by a variety of imaging devices. By comparing the differences between various images and healthy brain image templates, the result reflects the image characteristics of Alzheimer's disease. It has been shown that in diagnosis and prognosis of Alzheimer's disease, multi-modal use may improve the performance more than a single modal [7]. In 2019, Zhou et al. [8] focused on how to make the best use of multimodal neuroimaging and genetic data for Alzheimer's disease diagnosis. They proposed a three-stage deep feature learning and fusion framework, and proved the superiority of using multimodality data through experiments. In 2021, for Alzheimer's disease diagnosis with incomplete modalities, Liu et al. [9] proposed an Auto-Encoder, which can complement the missing data in the kernel space. This can help solve the problem of incomplete medical data. Such research promotes the development of multimodal technology in the field of Alzheimer's disease.
The project intends to use T2-weighted Magnetic Resonance Imaging (T2-MRI) and positron emission tomography (PET). The lesions of T2-MRI in Alzheimer's disease focus on atrophy of brain regions such as hippocampus, amygdala, and entorhinal cortex. Studies have confirmed [10] that the hippocampus atrophy is earlier than other clinical symptoms of the disease, which is an early critical characteristics performance of Alzheimer's disease and can be used as an early specific and sensitive predictor. The only way that the brain gains energy is glucose uptake. Accordingly, the extent of Alzheimer's disease can be determined according to differences in glucose uptake in various parts of the brain. 18F-flurodeoxyglucose (18F-FDG) PET imaging shows a phenomenon in patients with Alzheimer's disease where the area of both sides of the temporal lobe and parietal lobe demonstrates a symmetrical decrease in glucose metabolism, and with the deteriorating of condition, the phenomenon spreads from the temporal lobe to the other cortical areas [11]. PET uses a unique imaging technique to present metabolic changes of glucose in the brain, which can be used to map the distribution of the lesions. In Figure 2, two different modality images of Alzheimer's disease patients are displayed.

Multimodal Image Fusion Technology
Multimodal image fusion technology polymerizes the medical images of different modes into a single image set, and the images from different imaging systems are aggregated into a new output image [12]. Different modality medical images focus on different contents, and multimodal fusion technology can achieve the integration of functional image and anatomical image through the designed algorithm. Fusion image contains more useful information and makes the image more suitable for human vision processing, and for this reason it plays an important role in the diagnosis and treatment of diseases.
Meanwhile, multimodal image fusion is divided into three levels: pixel-level fusion, feature-level fusion and decision-level fusion. Among them, pixel-level fusion is the lowest level fusion, which is the direct fusion between pixels on the pixel level. Figure 3 illustrates how the fusion process works and displays an example of fusion. Pixel-level fusion has many advantages. Compared with others, it can restore the previous information of the image and contain more details of the image. So, the pixel-level fusion has the most application among these three fusion layers.

Deep Learning and Convolution Neural Networks
Recently, deep learning has become a hot research topic in the field of machine learning, and has made many achievements in computer vision and speech recognition [13][14][15]. Varied deep learning methods such as deep neural network (DNN), convolution neural network (CNN), and recurrent neural network (RNN) have been implemented to diagnose human neurological disorders [14].
Deep learning is derived from artificial neural network, whose essence is the establishment of simulation structure of human brain neural networks and form a high-level feature representation through the combination of low-level features, in order to achieve efficient processing of complex data [15]. In 1962, Hubel [16] first proposed the concept of a receptive field; in 1980, Fukushima [17] proposed a neural perceptron based on receptive field, and implemented the first CNN model. Subsequently, LeCun et al. [18] applied Back Propagation (BP) algorithm to CNN training, and successfully completed the image recognition. CNN has attracted extensive attention because of its parallelism, distortion, and robustness. One example shows the application of deep learning diagnosis [19]. Additionally, CNN and RNN are found to present better results than other deep learning methods in diagnosing Alzheimer's disease [14].
CNN is a feedforward neural network with local connection and weight sharing. Each layer of neural networks consists of a pair of two-dimensional planes, and each pair of two-dimensional planes is composed of a convolution layer and a lower sampling layer. Convolutional layer is also called feature extraction layers, which consist of multiple independent neurons. The function of convolutional layer is to extract the input data characteristics, and this approach is closer to the treatment of the human brain vision system. Figure 4 shows the basic structure of convolution neural network.

Image Data
A total of 102 Alzheimer's disease patients corresponding to their T2-MRI and PET images, and images of 80 patients with normal cognition as control group were collected from ADNI database, PPMI database. The dataset was composed of the corresponding MRI and PET images of each patients, which were all labeled by professionals in Alzheimer's disease research. Figure 5 shows one of the samples in the project group dataset, T2-MRI and PET image of a patient, respectively.

Image Preprocessing
We collected images from the ADNI and PPMI database, and preprocessed the images to improve the recognition accuracy. The preprocessing process is divided into three parts, including gray level transformation, histogram equalization and wavelet soft-threshold denoising. The flow chart of the preprocessing is shown in Figure 6. In the following, we will discuss them separately. Gray level transformation maps image gray-scale from 0 to 255. We histogram equalize the image to adjust contrast after linear gray level transformation, which will increase the local contrast of the original image. Histogram equalization can be expressed by the following formula: where L is the number of possible gray levels in the image, MN represents the image pixels number, and n k represents pixels number in r k gray scale. Figure 7 demonstrates the effect of histogram equalization. After the gray level transform is the denoising process. The denoising algorithm implemented in this paper is based on the wavelet soft-threshold algorithm [20], which can be expressed in the following piecewise function: where x represents the pixel's gray value, α stands for the threshold value of the wavelet soft-threshold algorithm, and this threshold value is defined by the following formula: where N represents the signal dispersion number, and through the experiment we set σ = 25. Figure 8 shows the denoising effect with the algorithm. The next step is contrast stretching. We used three stage contrast stretching methods in this paper. The basic form of piecewise linear function is as follows.
The specific gray-scale restricting rule is shown in Figure 9, and in Figure 10 we give the image before and after contrast stretching.

Image Registration
Image registration is the basis of fusion technology, which matches the same area of two images. It is defined by the following formula: where R means reference figure R and floating figure F. After making the transformation of pixels of figure F correspond to pixels of figure R, the matching is completed [21]. Correspondence between two images follows: f (x, y) represents two-dimensional space transformation and h(x, y) represents one-dimensional gray-scale transformation. Image registration is the crucial step to realize image fusion. In order to highlight the diagnosis results, it is necessary to make the previous image registration as accurate as possible.
In this paper, affine transformation was used as spatial geometric transformation of registration. Affine transformation of two-dimensional Euclidean space is defined as: a∼ f are real numbers. x, y stands for the coordinate of each pixel. Transform S is affine transform, which has the characteristic that a finite pixel is mapped to a finite pixel [22]. Two-dimensional affine transformation is linear, including translation, rotation, and scaling. A coordinate point by affine transformation can be expressed as: where x , y stands for the coordinate of pixel after affine, k, θ, ∆x, ∆y are registration parameters of the two images. This study divides the registration into four steps according to the requirements, as described below.
Step 2: initial registration (coarse registration). The optimizer selects Powell algorithm, and similarity measure selects the maximum mutual information method.
Step 3: improve registration accuracy by changing the optimizer's step size and increasing iterations times.
Step 4: use the maximum mutual information as a reference to adjust the step size and iteration times, in order to improve the accuracy. As an optimization strategy of image registration, the Powell algorithm is a multi-parameter local optimization search algorithm that does not need to compute derivatives [23]. Its essence divides the optimization phase into an iterative process, which consists of n + 1 times one-dimensional search. First, an extremum point is obtained after searching n times of different conjugate directions, and the initial value of the search is searched in the direction of the extremum point connection. Then, the latest search direction is replaced by the direction of the extreme points obtained in the previous n times of search and keep iterating until the function stops decreasing. Maximum mutual information method is a registration similarity measure [24]. Based on the gray level of image, mutual information of two images is taken as the similarity measure, and the optimal transform [23] is obtained by maximizing the similarity measure. The formula is as follows: where H(A) and H(B) are the entropy of the image which indicates the gray distribution of image. S * is the updated transform result.

Multimodal Pixel-Level Image Fusion
The core of wavelet fusion is multi-resolution fusion, which is like the multi-channel characteristics of spatial frequency of human vision. Therefore, wavelet fusion has become a hot topic. The two input images are decomposed by K-layer wavelet fusion to obtain 3K + 1 sub-image, and one sub-image is the low-frequency image of the highest K layer, and the other 3K images are high frequency sub-images with different frequency characteristics which is the high frequency image of the original image in the horizontal, vertical and diagonal directions of the K layer. In this paper, we used traditional wavelet weighting algorithm as a standard comparison, and we also proposed a modified wavelet fusion algorithm, which is more suitable for MRI and PET fusion.

Traditional Wavelet Weighting Algorithm
Traditional wavelet algorithm decomposes image after registration by wavelet method and obtains a wavelet coefficients matrix of different frequencies. Then, low-frequency coefficients matrix and high-frequency coefficients matrix are both weighted and averaged by their weighted coefficients that are set according to the characteristics of medical images. Weighting coefficients used in this study is 0.5. After weighted average, low frequency coefficients and high frequency coefficients matrix of the fused image are obtained. Finally, the two coefficients matrixes are transformed, respectively, by wavelet inverse transform to obtain fused image. The formula is as follows: where A means the low frequency image, B is the high frequency image, C k means low frequency matrix, ω k means the weight parameter of matrix. W k means the weight parameter of inverse wavelet transform, D k means the maximum value of the inverse transform result.

Frequency Weighted Wavelet Fusion Algorithm
Frequency weighted wavelet fusion algorithm is similar to the traditional method in the wavelet transform and weighted-average part. However, in the inverse transform part, it calculates the mean square deviation, takes the maximum value to get the corresponding high-frequency coefficients matrix. Finally, the two coefficients matrix is transformed by wavelet inverse transform to obtain the fused image. The formula is as follows: where A means the low frequency image, B is the high frequency image, C k means low frequency matrix, ω k means the weight parameter of matrix, W k means the weight parameter of inverse wavelet transform, and D k means the maximum value of the inverse transform result.

Multimodal Image Fusion Results
Taking a patient's data from the database as an example, the results of three different fusion images are displayed below in Figure 11.

Evaluation and Analysis of Fused Images
Although fused images cannot directly assist the diagnosis of the image like doctors, it can give basic judgment from the nature of image. In order to comprehensively evaluate the fusion algorithm of this paper, we selected six image evaluation parameters and displayed them in Table 1.
According to the parameter evaluation formula above, the parameters of each algorithm were calculated. Results are shown in Table 2, and the following analysis was made according to the calculated result. We also use the significance test method to observe whether there is a significant difference, the p-value is less than 0.05, so a significant improvement is made.
On the basis of the data analysis in the table, the improved wavelet fusion is superior to the traditional wavelet fusion in spatial frequency, mean gradient, information entropy, mutual information, cross entropy, and peak signal to noise ratio. Overall, the improved wavelet fusion implemented in this research has better performance than the traditional wavelet fusion. Spatial frequency reflects the overall activity level of an image space and evaluates the degree of image clarity.
RF and CF are the row frequency and the column frequency, respectively, Larger SF refers to more active image and better fusion effect.

Information Entropy (IE)
Information entropy refers to the probability distribution of pixels of different gray levels in space, and describes its detail expressive force.
P m is the probability that the gray scale m appears in the image.
Bigger IE refers to richer details and better quality of the fused image.
Mean Gradient (MG) The average gradient represents the mean value of image gradient, which reflects the change of the gray level of image.
∆xF(i, j) and ∆yF(i, j) represent the difference in the x and y directions.
Higher MG refers to higher image contrast.

Cross Entropy (CERF)
Cross entropy is used to measure the difference of information between two images, which reflects the difference between the two-pixel levels.
Smaller CERF difference refers to more information the fusion method extracts Peak Signal-to-Noise Ratio (PSNR) The peak signal-to-noise ratio measures the realistic degree of the image. Higher PSNR refers to smaller distortion and better fusion effect.  Figure 12 is the flow chart of data from the end of preprocessing to entering the network and finally realizing classification.

Construction of Convolution Neural Network
In this paper, the project group designed and improved two CNN models and implemented the diagnose model in training. Besides that, four other kinds of image classification deep learning models, including, ResNet, GoogLeNet, Inception Net and AlexNet, were also implemented as reference and comparison. First, the specific model structure and training process of two CNN models we designed were introduced. Then, the MNIST dataset was used to test their performance.

Convolution Neural Network Structure
The first convolutional neural network designed in this paper was based on the classical convolutional neural network model LeNet-5 [18]. This CNN model had 10 layers, and Figure 13 shows the CNN model structure. The first layer is the input layer. The input image is divided into three stages. Each stage consists the convolution layer and the down-sampling layer. The convolution kernel is 9 × 9, and the down-sampling layer is sampled by 2 × 2 region. Each stage has the same structure and relative operation. The first stage output is six pieces of 60 × 60 feature graphs. The second stage output is 12 pieces of 26 × 26 feature graphs. The third stage finally forms 18 pieces of 9 × 9 feature graphs after down-sampling, and each of graphs has a corresponding bias term. The eighth layer is a convolution layer, which contains 120 feature graphs. Each element in the feature graph relates to the feature graph of the previous layer. After 9 × 9 convolution kernel operation, the size of the feature graph in this layer is 1 × 1, which constitutes the full connection between the fifth and sixth layers. The ninth full connected layer contains 84 neurons.
The final output layer consists of the Euclidean Radial Basis Function (RBF) unit, and each represents a category. In this project Alzheimer's disease diagnosis is a twoclassification problem that is ill or not, so the output layer has two units.
Before CNN training, parameter initialization is needed. Parameter initialization is very important in the gradient descent algorithm. If the error surface is relatively flat, it will lead to a very slow convergence rate. In general, more initialization weights are distributed as follows: This is the representation of Xavier initialization, where P (l) means the number of pixels in layer l, and each parameter will have the value from the interval specified by U.
CNN model training uses the back propagation (BP) algorithm. The main content of the algorithm is divided into forward propagation and back propagation. First, the training data is input into the CNN-10 model, and then the gradient of the weight and bias and inverse calculation error is calculated by the convolution filtering, the down sampling and the activation function, and the weight and bias are updated.

Improved Deformable U-Net (DeU-Net)
U-Net is a special CNN [25] that excels in image segmentation. We use U-net to process data in this research, which is an approach of computer-aided diagnosis of Alzheimer's disease in PET images. Inspired by the network proposed by Dai in 2019 [26], we replace the typical convolution kernel with deformable convolution in the network. Improved structure of network is shown in Figure 14. This CNN architecture has symmetrical structure and deformable convolution kernels. It consists path of the encoder and decoder, each formed by three layers. The encoder path has two 3 × 3 deformable convolution kernels and a down-sampling max-pooling operator; similarly, the base layer also formed by two deformable convolution kernels with this structure. Copy and crop between the encoder and the decoder helps the reservation of information for localization [27]. Activation function is Rectified Linear Unit (ReLU). By the Adam-Optimizer based on gradient decent, the parameters are relatively stable and promote their dynamic adjustment.
Definition of deformable convolution kernel: where Z l represents the input of convolution layer l and Z (l+1) represents output. Z(p) is the value of pixels on the feature map. f , s 0 , K1 represents size, stride, and padding layers size of the convolution kernel, respectively. In this equation ∆x, ∆y can be fractions, the bilinear interpolated w l+1 k as follows: where p represents the arbitrary fractional pixels and q represents integral pixels in the feature map. The definition of Kernel u is: In every layer, we used deformable kernels to work in the same way rather than implementing different kernels for each layer, this will improve network performance. Using the back-propagation method, the network was trained end-to-end with labeled PET/MRI/fusion Images.
Feature map pixels soft-max classifier with cross entropy loss function is the energy function, this classifier is: A k (x, y) represents the activated feature pixels in layer k, and the pixels number in the feature map is K. The energy function is: where (x, y) is pixel weight map function that previously defined, inspired by Ronnenberger [25]: To avoid excessive activation of some pixels and reduce errors,the weight of U-net needs to initialize. We set σ = 8 and 0 = 10 to obtain better performance. Epoch was set at 10,000, learning rate 0.01.

Performance Testing of Convolutional Neural Networks
After the construction of convolutional neural network, research needs to test the performance of CNN model before formal training diagnosis. Network performance was validated by testing 60,000 handwritten digits downloaded from the MNIST database.
Testing progress was done by the following specific test method: first, 60,000 pieces of 128 × 128 handwritten numbers and their corresponding labels were read into the CNN model. In view of the large amount of data, the 10-fold cross validation method was used to train the generalization ability of CNN model. One-fold cross validation is a method derived from the k-fold cross validation. By stratified sampling from the data set D, K, exclusive subsets of similar size can be separated mutually. Each cycle took a group as a test set and the rest of k-1 groups were training sets. After k times of training and testing were carried out, the average accuracy rate of k-group test represents the accuracy of the deep learning model [28]. Accuracy rate result of the applied different training tests is shown in Table 3. CNN and DeU-Net are the networks that the project group built, and the rest of the network is preset architecture as reference. Increasing training times can get better accuracy of the convolutional neural network. High accuracy of the network model test proves that the model has great performance, and the parameter settings of the hidden layer of the model was relatively correct.

Training and Testing Process
In order to validate the significance of image fusion, single modality images and multimodal fusion images were trained in both CNN and U-Net models, and the performance of each model were compared to determine which model has the relatively better diagnostic effect. Figure 15 illustrates the constitution of the single modality images dataset and the fusion images dataset.

Single Modality Image Training Model
T2-MRI and PET image of the same parts of brain from 101 patients with Alzheimer's disease and 80 normal cognitive people as control groups in total 5430 images of each modality were collected from the ADNI database and stored separately in two threedimensional matrices. Each matrix was read into the network with its corresponding binary classification label. Because of the few training and test data, it was necessary to use the set aside method to verify the generalization ability of CNN.
Set aside was the method that divided data set D into two mutually exclusive sets, one of which was the training set S, and the other one is the test set T. After the model training completed, S and T was used to evaluate the test error [27].

Fusion Image Training Model
Multi-modal fusion algorithm has the highest objective evaluation of all fusion algorithm. T2-MRI and PET images of the same person in the single modality training model are automatically fused. The training and verifying process was the same as the MRI and PET images training model.

Single Modality Training Model
By setting aside method, 3008 images were used as training sets, and 1534 images were used as evaluation set. After training and testing, the diagnostic accuracy used MRI only images of CNN reached 80.65% and the U-Net reached 84.17%. By using the similar method, results of training with PET only images were also achieved. Mean and SD of AUC shown as Table 4: The detailed training results are shown in Table 5. In Table 5, AUC means area under the receiver operating characteristic curve. AUC represents the degree of differentiation of different categories, the value is proportional to the probability that the model is correctly classified. ACC means accuracy, and its value is proportional to the overall prediction performance. SENS means sensitivity. In SENS, the numerator is the number of positive samples predicted to be correct, and the denominator is the number of positive samples predicted to be negative, also known as the recall rate, indicating how many positive samples predicted to be correct. SPEC means specificity, which is contrary to SENS. SPEC indicates how many of the negative samples are predicted to be correct. PPV means positive predict value, which is based on the result of prediction, and it represents the amount of the samples predicted to be positive were correct. There are two possible ways to predict a positive class, one is to predict a positive class (TP) , and the other is to predict a negative class (FP). NPV means negative predict value, which is contrary to PPV. NPV represents how many of the samples predicted to be negative were correct. Test means training cohorts. Validation means validation cohort. Prove means independent prove cohort. Factors above were calculated by the following formulas: (26) where n right is the correctly predicted samples and n false is the falsely predicted samples. T positive is the correctly predicted positive sample. T negative is the correctly predicted negative sample. F positive is the falsely predicted positive sample. F negative is falsely predicted negative sample.
In Figure 16, we also provide the confusion matrix in this case.

Fusion Image Training Model
According to the objective evaluation of the fused images, the best algorithm is the frequency weighted wavelet fusion algorithm. Therefore, the fused image set was selected to train different models. In order to compare different performance between models in diagnosing Alzheimer's disease, multiple deep learning models were implemented. Mean and Sd of AUC shown are in Table 6. Detailed training results were listed in Table 7. In Table 7, AUC means area under the receiver operating characteristic curve. ACC means accuracy. SENS means sensitivity. SPEC means specificity. PPV means positive predict value. NPV means negative predict value. Test means training cohorts. Validation means validation cohort. Prove means independent prove cohort. We also recorded the training loss curve of DeU-Net, where orange represents the training set and blue represents the test set. It can be seen from Figure 17 that the training loss decreases steadily, which shows the good performance in training. In Figure 18, we also provide the confusion matrix in training with fusion images.   Table 8 shows comparisons of different neural network accuracy between single modality and fusion images. Notes that the proposed fusion method yields better performance than implementing only a single modality image in training deep learning models. All of the fusion image trained models outperformed the single modality image trained models. Fusion images training can improve accuracy of classification. In training cohort, testing cohort and validation cohort, fusion images significantly raised AUC than training with single modality images.

Discussion
In this study, we developed and validated a pixel-based image fusion method based on brain conventional T2-weighted MRI and 18F-FDG PET-CT images for prediction of Alzheimer's disease. This method showed significantly better diagnostic performances in distinguishing patients with an Alzheimer's disease and cognitive normal patients than any single method.
Diagnostic information in fusion image is more abundant because the fused image integrates the anatomical information and the metabolic function information. Resulting from the above, it can be analyzed that fusion images improved the performance of the two CNN models that were built by the project group, as CNN and DeU-Net have significant improvement in diagnosing accuracy. Meanwhile, DeU-Net has better accuracy or performance in processing medical images than traditional CNN. However, fusion images training in other pretrained models represents irregularity performance. This result demonstrated that multimodal fusion can improve the diagnostic rate of Alzheimer's disease and confirms that multimodal fusion technology is greatly significant in the diagnosis of Alzheimer's disease. The result also proved that the multimodal image fusion based on convolutional neural networks has important research value. At the same time, one of the problems in this paper is that the accuracy of the diagnostic models still needs to be improved. The number of samples currently diagnosed by Alzheimer's disease is 182 (102 patients and 80 people without disease) much less than the MNIST set, while the number of sample sets for digital identification is 60,000. One of the reasons for the low accuracy is that the number of training samples is too small, so the extracted features are not obvious.

Conclusions
In order to assist doctors in the accurate diagnosis of Alzheimer's disease, this paper conducted an in-depth study of multi-modal image fusion technology. First, the traditional wavelet fusion of T2-MRI and PET modal images of Alzheimer disease patients were studied, and then a wavelet fusion algorithm that is more suitable for medical image fusion was proposed. Based on the evaluation of objective evaluation parameters, we found that the improved wavelet fusion algorithm has better performance than the traditional wavelet fusion algorithm on the testing accuracy.
This paper first improved and implemented the traditional convolution neural network and U-Net in the diagnosis of Alzheimer's disease in order to explore their diagnostic characteristics. It turned out that this modification of CNN has significantly improved the performance of CNN in processing with Alzheimer's disease medical images. Additionally, this paper first implemented a fusion image technique in Alzheimer's disease diagnosing and proved the efficiency of diagnosing using fusion images by implementing different deep learning models, including some pretrained models.
This paper designed two image sets training with six different deep learning structures in the same training process. MRI, PET and fusion images of Alzheimer's disease were trained and tested, respectively. Results of these tests demonstrated that the accuracy of fusion image is higher than that of single mode. In a word, pixel-based multi-modal fusion images applying on convolution neural network can have an outstanding diagnostic effect on the diagnosis of Alzheimer's disease.
Similar to the problems encountered by Liu et al. [9], the author is also actively exploring solutions to the problem of incomplete data, such as using semi-supervised learning or unsupervised learning. The authors also notice that the experiment focuses on binary classification, and the task of multiple classification has not been involved, and plan to focus on it in future research.  Informed Consent Statement: According to ADNI protocols, all procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and national research committee and with the Helsinki declaration. The ADNI data collection was Data Availability Statement: The data were collected from the open database ADNI and PPMI.

Conflicts of Interest:
The authors declare no conflict of interest.