Image Classification of Alzheimer’s Disease Based on External-Attention Mechanism and Fully Convolutional Network

Automatic and accurate classification of Alzheimer’s disease is a challenging and promising task. Fully Convolutional Network (FCN) can classify images at the pixel level. Adding an attention mechanism to the Fully Convolutional Network can effectively improve the classification performance of the model. However, the self-attention mechanism ignores the potential correlation between different samples. Aiming at this problem, we propose a new method for image classification of Alzheimer’s disease based on the external-attention mechanism. The external-attention module is added after the fourth convolutional block of the fully convolutional network model. At the same time, the double normalization method of Softmax and L1 norm is introduced to obtain a better classification performance and richer feature information of the disease probability map. The activation function Softmax can increase the degree of fitting of the neural network to the training set, which transforms linearity into nonlinearity, thereby increasing the flexibility of the neural network. The L1 norm can avoid the attention map being affected by especially large (especially small) eigenvalues. The experiments in this paper use 550 three-dimensional MRI images and use five-fold cross-validation. The experimental results show that the proposed image classification method for Alzheimer’s disease, combining the external-attention mechanism with double normalization, can effectively improve the classification performance of the model. With this method, the accuracy of the MLP-A model is 92.36%, the accuracy of the MLP-B model is 98.55%, and the accuracy of the fusion model MLP-C is 98.73%. The classification performance of the model is higher than similar models without adding any attention mechanism, and it is better than other comparison methods.


Introduction
Alzheimer's disease (AD) is a progressively developing degenerative disease of the brain and nervous system. With the global escalation of the aging process, the incidence of Alzheimer's disease is increasing every year. Elderly people with Alzheimer's disease will experience a series of brain damages such as gradual memory loss, inconvenience of movement, decline in language expression and cognitive difficulties as the disease continues to worsen [1]. A large number of clinical studies have shown that drug intervention and care for early AD patients can delay the development of the disease and stabilize the patient's condition. Therefore, the early and accurate judgment of patients with suspected AD has important practical significance.
In the field of computer vision, the attention mechanisms can effectively extract the feature of images. The attention mechanisms have various implementations, roughly divided into soft attention and hard attention [32][33][34]. The attention mechanism selects the focal position of the image, yielding more discriminative feature representations and bringing continuous performance improvements to the model. The soft attention mechanism means that when selecting information, it calculates the weighted average of the N input information instead of selecting only one information from the N information, and then inputs it into the neural network [35]. While the hard attention mechanism refers to selecting the information in a certain position of the input sequence, such as randomly selecting a piece of information or selecting the information with the highest probability. The visual attention mechanism can be used to pay attention to key areas in the image to obtain high-level information of image features.
The self-attention mechanism (SA) was proposed by Zhang et al. [36], and they used the weight matrix of three branches to capture the internal feature correlation of a single sample, thereby reducing the dependence on external information. But the self-attention mechanism has quadratic complexity and ignores the potential correlation between different samples. So, Guo et al. [37] proposed an external attention mechanism in 2021 to solve this problem, and they adopted two external matrices, M k and M v, to model the potential correlation between samples. Meanwhile, the external attention mechanism has linear complexity and implicitly considers the correlation among all data samples. Recently, Jiao et al. [38] proposed a feature fusion model for AD classification, which can comprehensively utilize multiple types of data to improve the classification performance.
Based on the fully convolutional network, this paper proposes a new method for image classification of Alzheimer's disease that combines the external-attention mechanism with double normalization [39]. First, we obtain the feature information of the disease probability map through the FCN model, and then select the region of interest (ROI) according to the MCC heatmap of the FCN model, finally combine with age, gender, MMSE as the input of the MLP model to classify AD and NC images. The contributions of this paper are: (1) We propose a method for image classification of Alzheimer's disease based on external-attention mechanism and fully convolutional network; (2) and add a self-attention module to the FCN model as a comparative experiment to highlight the effectiveness and efficiency of the external-attention mechanism; (3) In the normalization process of the attention map, the double normalization method of Softmax and L1 norm is used to replace the original Softmax, which can improve the classification performance in a small range.

Datasets
In order to prove the effectiveness of the self-attention mechanism and the externalattention mechanism, this paper uses T1-weighted Alzheimer's disease MRI images in the ADNI dataset for experiments. The specific details of the dataset are shown in Table 1. The dataset of the experiments selects individual scan images over 55 years old, which includes a total of 550 1.5 T three-dimensional MRI images, of which 307 images belong to Alzheimer's disease (AD) patients and 243 images belong to normal cognitive persons.

Image Preprocessing
We use the FLIRT tool in the FSL software package to align the brain magnetic resonance image with the MNI152 public template. We use the matrix that the image is invariant to affine transformation to determine the parameters of the transformation function, and then transform the original image into a standard form of image according to the transformation function determined by this parameter, finally the image size is 182 × 218 × 182. FLIRT uses coordinate rotation, translation, scaling, and shearing to match two images together, and the cost function O(w) is expressed in the form of quadratic summation, the intensity difference between input image and the public template is used as the optimization objective.
f (x i ) represents the intensity of the public template, g(x i ) represents the intensity of the input image, x i represents the value of x i after affine transformation.
After the image is registered, we normalize the voxel values of all images. Then we control these voxel values and other outliers within a certain range to avoid the interference of background information. The flowchart of image preprocessing is shown in Figure 1.

Image Preprocessing
We use the FLIRT tool in the FSL software package to align the brain magnetic resonance image with the MNI152 public template. We use the matrix that the image is invariant to affine transformation to determine the parameters of the transformation function, and then transform the original image into a standard form of image according to the transformation function determined by this parameter, finally the image size is 182 × 218 × 182. FLIRT uses coordinate rotation, translation, scaling, and shearing to match two images together, and the cost function   Ow is expressed in the form of quadratic summation, the intensity difference between input image and the public template is used as the optimization objective. After the image is registered, we normalize the voxel values of all images. Then we control these voxel values and other outliers within a certain range to avoid the interference of background information. The flowchart of image preprocessing is shown in Figure  1.

Experimental Settings and Evaluation Criteria
The experiments in this paper use the Pytorch deep learning framework and GeForce RTX 3090 GPU processor. The FCN model uses Adam optimizer and cross-entropy loss function. In addition, the experimental parameters are set: batch size is 10, learning rate is 0.0001, and the number of training iterations is 3000. The validation set is verified every 20 iterations, then the optimal model and weights are saved. Finally, the optimal model is tested with the test set to obtain the classification performance of the FCN model and the feature information of the disease probability map.
We use accuracy and Matthews correlation coefficient (MCC) to evaluate the classification performance of the FCN model. In addition, we also record the age, gender and MMSE of AD patients and normal cognitive persons in this dataset, as the input of the MLP model's classification experiments. As for the MLP model, we use accuracy (marked as Accu), sensitivity (marked as Sens), specificity (marked as Spec), F1 score and MCC to evaluate its classification performance. F1 score is an indicator used to measure the twoclass model in statistics. It takes into account the accuracy and recall of the classification

Experimental Settings and Evaluation Criteria
The experiments in this paper use the Pytorch deep learning framework and GeForce RTX 3090 GPU processor. The FCN model uses Adam optimizer and cross-entropy loss function. In addition, the experimental parameters are set: batch size is 10, learning rate is 0.0001, and the number of training iterations is 3000. The validation set is verified every 20 iterations, then the optimal model and weights are saved. Finally, the optimal model is tested with the test set to obtain the classification performance of the FCN model and the feature information of the disease probability map.
We use accuracy and Matthews correlation coefficient (MCC) to evaluate the classification performance of the FCN model. In addition, we also record the age, gender and MMSE of AD patients and normal cognitive persons in this dataset, as the input of the MLP model's classification experiments. As for the MLP model, we use accuracy (marked as Accu), sensitivity (marked as Sens), specificity (marked as Spec), F1 score and MCC to evaluate its classification performance. F1 score is an indicator used to measure the two-class model in statistics. It takes into account the accuracy and recall of the classification model at the same time, and can be regarded as the harmonic average of the model's accuracy and recall. MCC comprehensively considers true positives, true negatives, false positives, and false negatives, and is a relatively balanced indicator in deep learning. In order to ensure the accuracy and reliability of the experiments, we use five-fold cross-validation for the experiments, repeat the experiments five times for the FCN model, and three times for the MLP model. The final classification performance of the models is represented by the mean and standard deviation.

Self-Attention Mechanism
A simplified diagram of the self-attention mechanism is shown in Figure 2. W f , W g and W h are the weight matrices of the 1 × 1 × 1 convolutional layer, in order to learn the local dependencies of the image, and it also learns the long-distance global dependencies.
Among them, true positives (TP) represent the correct predictions of positive samples, true negatives (TN) represent the correct predictions of negative samples, false negatives (FN) represent the false predictions of positive samples, and false positives (FP) represent the false predictions of negative samples.

Self-Attention Mechanism
A simplified diagram of the self-attention mechanism is shown in Figure 2. Wf, Wg and Wh are the weight matrices of the 1 × 1 × 1 convolutional layer, in order to learn the local dependencies of the image, and it also learns the long-distance global dependencies. We take the feature map after the last convolutional block of the FCN model, passing through three branches of the 1 × 1 × 1 convolutional layers f(x), g(x), h(x) and the number of channels of the three branches is C, where H, W and D represent the length, width, and depth of the feature map respectively. After that, we transpose the output of matrix f(x) and multiply it with the output of matrix g(x), then multiply it by Softmax for normalization to obtain the attention feature map. Finally, we multiply the attention feature map with the output of matrix h(x), then pass it through the 1 × 1 × 1 convolutional layer to integrate the output into a self-attention feature map. We take the feature map after the last convolutional block of the FCN model, passing through three branches of the 1 × 1 × 1 convolutional layers f (x), g(x), h(x) and the number of channels of the three branches is C, where H, W and D represent the length, width, and depth of the feature map respectively. After that, we transpose the output of matrix f (x) and multiply it with the output of matrix g(x), then multiply it by Softmax for normalization to obtain the attention feature map. Finally, we multiply the attention feature map with the output of matrix h(x), then pass it through the 1 × 1 × 1 convolutional layer to integrate the output into a self-attention feature map.
g(x) = W g x, x ∈ R C×D×H×W is the original feature map before the input of three branches.
f (x i ) means the values of all channels at the i-th pixel position, g x j means the values of all channels at the j-th pixel position. β j,i means the degree of attention of the model to the i-th region when synthesizing the j-th region.
The output of the self-attention module is defined as: We multiply the output O j of the self-attention module with a weight coefficient γ and then add it to the input feature map x i to get the final output y i of the self-attention module: Among them, γ is a learnable parameter, and its function is to enable the network to learn the proportion of global dependence in the feature map by itself.
The tensor dimension's changes of the feature map in the self-attention module are shown in Figure 3. In the process of feature map convolution, in order to make each eigenvalue in the image interact with each other, we change the number of channels and matrix dimensions of the convolution. D × H × W means the total pixels of the feature map.
We multiply the output j O of the self-attention module with a weight coefficient  and then add it to the input feature map i x to get the final output i y of the selfattention module: Among them,  is a learnable parameter, and its function is to enable the network to learn the proportion of global dependence in the feature map by itself. The tensor dimension's changes of the feature map in the self-attention module are shown in Figure 3. In the process of feature map convolution, in order to make each eigenvalue in the image interact with each other, we change the number of channels and matrix dimensions of the convolution. D × H × W means the total pixels of the feature map.

External-Attention Mechanism
A simplified diagram of the external attention mechanism is shown in Figure 4. Here, the linear transformation refers to the 1 × 1 × 1 convolutional layer processing, and the Norm means double normalization; M k and M v are two external matrices.

External-Attention Mechanism
A simplified diagram of the external attention mechanism is shown in Figure 4. Here, the linear transformation refers to the 1 × 1 × 1 convolutional layer processing, and the Norm means double normalization;   Due to the self-attention mechanism only considering the value of a single sample to make the attention map, so the external attention mechanism uses two external matrices k M and v M which are different from the self-attention mechanism ( k M and v M are linear layer without bias), to model the similarity between the i-th pixel and the j-th pixel. Of course, the matric M is learnable, and it can also model the potential connections between different samples in the whole dataset as the training process progresses.
out v F AM  Due to the self-attention mechanism only considering the value of a single sample to make the attention map, so the external attention mechanism uses two external matrices M k and M v which are different from the self-attention mechanism (M k and M v are linear layer without bias), to model the similarity between the i-th pixel and the j-th pixel. Of course, the matric M is learnable, and it can also model the potential connections between different samples in the whole dataset as the training process progresses.
F ∈ R C×D×H×W is the output feature map of the last convolutional block, Norm means to normalize FM T k and F out is the output feature map of the FCN model after adding the external-attention module. In the self-attention module, Softmax is used to normalize the attention map to make Σ j α i,j = 1. However, the attention map is calculated by matrix multiplication, which is very sensitive to the size of the input feature. When a certain eigenvalue is very large or very small, its dot product to other eigenvalues will also become very large or very small. Therefore, the external attention module uses the double normalization of Softmax and L1 norm to solve this problem, that is, first applies Softmax to the columns of the attention map, and then applies L1 norm to the rows.
a i,j represents the input feature map F multiplying with the transposed external matrix M k .α i,j represents normalizing the columns of the attention map a i,j by Softmax, and α i,j represents normalizing the rows of the attention mapα i,j by L1 norm. Among them, the k of a k,j andα i,k represents the number of channels in the linear layer.
The tensor dimension's changes of the feature map in the external attention module are shown in Figure 5. First, we use 1 × 1 × 1 convolutional layer to change the tensor dimension, and then multiply it with the transposed external matrix M k . The k means the number of channels of the external matrix and the D × H × W means the total pixels of the feature map. After that, we apply the double normalization to normalize the attention map. Then, we multiply the attention map with the external matrix M v to put it into a new 1 × 1 × 1 convolutional layer. Finally, we can obtain the external-attention feature map with the same tensor dimension as the original feature map.

Model's Framework
The FCN model consists of four convolutional blocks and two fully connected layers. Among them, the convolutional block includes 3D convolutional layer, 3D maxpool layer, 3D batch normalization, Leaky ReLU and Dropout, as shown in the Figure 6. The last two fully connected layers play a role in improving the efficiency of the model in the classification task. The network is trained by randomly initializing weights. As shown in Figure  7, we adopt a method of randomly sampling patches of the 3D-MRI images to train the FCN model, that is, training patches with a random sampling size of 47 × 47 × 47 from the 3D-MRI images. And the size of each patch is the same as the receptive field of FCN model.

Model's Framework
The FCN model consists of four convolutional blocks and two fully connected layers. Among them, the convolutional block includes 3D convolutional layer, 3D maxpool layer, 3D batch normalization, Leaky ReLU and Dropout, as shown in the Figure 6. The last two fully connected layers play a role in improving the efficiency of the model in the classification task. The network is trained by randomly initializing weights. As shown in Figure 7, we adopt a method of randomly sampling patches of the 3D-MRI images to train the FCN model, that is, training patches with a random sampling size of 47 × 47 × 47 from the 3D-MRI images. And the size of each patch is the same as the receptive field of FCN model. 7, we adopt a method of randomly sampling patches of the 3D-MRI images to train the FCN model, that is, training patches with a random sampling size of 47 × 47 × 47 from the 3D-MRI images. And the size of each patch is the same as the receptive field of FCN model. Because convolution operation reduces the output size of the input layer, each Patch will generate two scalar values after being trained by the FCN model. Then they are converted into the Alzheimer's disease probability and normal cognitive probability of the corresponding pixel under the action of the activation function Softmax. The disease state of the brain's local structure is displayed through each pixel's risk probability value of Alzheimer's disease. The corresponding feature information of the disease probability  In order to conduct comparative experiments with the FCN model, we also use the same network for the CNN model, as shown in Figure 8. The specific parameter settings of the hidden layer of the CNN model are shown in Table 2.   Because convolution operation reduces the output size of the input layer, each Patch will generate two scalar values after being trained by the FCN model. Then they are converted into the Alzheimer's disease probability and normal cognitive probability of the corresponding pixel under the action of the activation function Softmax. The disease state of the brain's local structure is displayed through each pixel's risk probability value of Alzheimer's disease. The corresponding feature information of the disease probability map will be used as auxiliary information of the MLP model for classification experiments.
In order to conduct comparative experiments with the FCN model, we also use the same network for the CNN model, as shown in Figure 8. The specific parameter settings of the hidden layer of the CNN model are shown in Table 2. In order to conduct comparative experiments with the FCN model, we also use the same network for the CNN model, as shown in Figure 8. The specific parameter settings of the hidden layer of the CNN model are shown in Table 2.   After that, we build the MLP model's framework, as shown in Figure 9. The MLP model consists of two fully connected layers; batch normalization, Leaky ReLU and Dropout. For the MLP model, we select the probability value of Alzheimer's disease from the feature information of the disease probability map, and select the region of interest (ROI) based on the MCC value of the FCN model, then combine with the age, gender, MMSE of the MRI image as the input of the MLP model to reclassify MRI images.  According to Figure 9, the MLP-A model indicates that only the featur of the disease probability map of the FCN model is used to classify MRI ima B model indicates that only the image information of age, gender and MM classify MRI images; The MLP-C model indicates combining the feature i the disease probability map of the FCN model with age, gender, MMSE to images.
In addition, the specific FCN model's parameter settings and the chan patch size in our experiments are shown in the Table 3.  Among them, the MCC value can show the overall classification performance of the FCN model, and the MCC heatmap can show that the FCN model has a higher classification accuracy for certain pixel positions of the 3D-MRI image. Therefore, we select these highaccuracy regions as regions of interest (ROI).
According to Figure 9, the MLP-A model indicates that only the feature information of the disease probability map of the FCN model is used to classify MRI images; The MLP-B model indicates that only the image information of age, gender and MMSE are used to classify MRI images; The MLP-C model indicates combining the feature information of the disease probability map of the FCN model with age, gender, MMSE to classify MRI images.
In addition, the specific FCN model's parameter settings and the changes of output patch size in our experiments are shown in the Table 3.

Experiments and Results
First, this paper conducts experiments on the original FCN model and MLP model, then adds the self-attention module and the external-attention module respectively for multiple experiments. Finally, we use the mean and standard deviation to represent the classification performance of the model. The double normalization of Softmax and L1 norm are used to replace Softmax in the external-attention module. The MCC heatmap of the FCN model is shown in Figure 10.

Experiments and Results
First, this paper conducts experiments on the original FCN model and MLP model, then adds the self-attention module and the external-attention module respectively for multiple experiments. Finally, we use the mean and standard deviation to represent the classification performance of the model. The double normalization of Softmax and L1 norm are used to replace Softmax in the external-attention module. The MCC heatmap of the FCN model is shown in Figure 10. The feature information of the disease probability map generated by the FCN model is shown in the Figure 11. Red and blue indicate the probability of suffering from Alzheimer's disease in different parts of the brain. The dividing line between the two is 0.5. The feature information of the disease probability map generated by the FCN model is shown in the Figure 11. Red and blue indicate the probability of suffering from Alzheimer's disease in different parts of the brain. The dividing line between the two is 0.5. Figure 11. (A) The disease probability map generated by the FCN model highlights the brain regions at high risk of Alzheimer's disease. The first two samples were clinically diagnosed as patients with Alzheimer's disease, and the latter two samples were clinically confirmed as normal cognitive persons. (B-D) shows the axial, coronal and sagittal disease probability map of patients who are clinically diagnosed with Alzheimer's disease. Red indicates that the risk of Alzheimer's disease is > 0.5, and blue indicates < 0.5.
We have summarized the changes in the accuracy of the FCN models and the MLP models in different situations. The changes of the FCN models' accuracy and the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism are shown in Figure 12.  We have summarized the changes in the accuracy of the FCN models and the MLP models in different situations. The changes of the FCN models' accuracy and the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism are shown in Figure 12. The feature information of the disease probability map generated by the FCN model is shown in the Figure 11. Red and blue indicate the probability of suffering from Alzheimer's disease in different parts of the brain. The dividing line between the two is 0.5. Figure 11. (A) The disease probability map generated by the FCN model highlights the brain regions at high risk of Alzheimer's disease. The first two samples were clinically diagnosed as patients with Alzheimer's disease, and the latter two samples were clinically confirmed as normal cognitive persons. (B-D) shows the axial, coronal and sagittal disease probability map of patients who are clinically diagnosed with Alzheimer's disease. Red indicates that the risk of Alzheimer's disease is > 0.5, and blue indicates < 0.5.
We have summarized the changes in the accuracy of the FCN models and the MLP models in different situations. The changes of the FCN models' accuracy and the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism are shown in Figure 12.  In detail, the changes of the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism, as well as the double normalization are shown in Figure 13. In detail, the changes of the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism, as well as the double normalization are shown in Figure 13. Of course, we also plot the changes of the MLP models' MCC value after adding the self-attention mechanism, the external attention mechanism and double normalization, as shown in Figure 14. Of course, we also plot the changes of the MLP models' MCC value after adding the self-attention mechanism, the external attention mechanism and double normalization, as shown in Figure 14.
In detail, the changes of the MLP models' accuracy after adding the self-attention mechanism and the external-attention mechanism, as well as the double normalization are shown in Figure 13. Of course, we also plot the changes of the MLP models' MCC value after adding the self-attention mechanism, the external attention mechanism and double normalization, as shown in Figure 14. Next, we list the various experimental results of the FCN models and the MLP models, which are the final classification performance (mean and standard deviation), including accuracy, sensitivity, specificity, F1 score, MCC. The classification performance of the Next, we list the various experimental results of the FCN models and the MLP models, which are the final classification performance (mean and standard deviation), including accuracy, sensitivity, specificity, F1 score, MCC. The classification performance of the FCN model without any attention module is shown in Table 4. As a comparative experiment, the experimental results of the CNN model and the MLP fusion model are shown in Table 5. Among them, the fusion model means that MLP model combines the feature information of the CNN model with age, gender, and MMSE to classify MRI images. Comparing the experimental results of the MLP-C in Table 4 with the fusion model in Table 5. It can be found that selecting the region of interest (ROI) of the MLP model through the MCC heatmap of the FCN model, and combining with the feature information of the disease probability map can improve the classification performance better than the MLP fusion model.
We use accuracy and MCC to evaluate the classification performance of the FCN model, as shown in Table 6. At here, the MCC is calculated by using each pixel in the 3D-MRI image as a sample. After each pixel in the 3D-MRI image is trained by the FCN model, a predicted probability value of Alzheimer's disease will be generated. The prediction of each pixel is compared with the input label, and then the corresponding pixel is marked as TP, TN, FP, FN.
The classification performance of the MLP models after adding the self-attention module is shown in Table 7. The results in Table 6 show that after the self-attention module is integrated into the FCN model, the accuracy increases of about 1.155%, and the MCC value increases of about 2.454%. Comparing the experimental results in Tables 4 and 7, it can be found that after adding the self-attention module, for the MLP models, the accuracy increases by about 0.57% to 2.62%, and the MCC value increases by about 0.60% to 3.03%. This shows the effectiveness of the self-attention mechanism and it can improve the classification performance of the model.
On the other hand, the experimental results in Table 6 show that after adding the external-attention module and double normalization, compared with the original FCN model, the accuracy increases by about 3.366%, and the MCC value increases by about 5.195%. Furthermore, double normalization compares with Softmax, the accuracy increases by about 0.803%, and the MCC value increases by about 0.432%.
After adding the external-attention module, the classification performance of the MLP models by using double normalization or Softmax respectively are shown in Tables 8 and 9.   Tables 8 and 9 compare the classification performance difference between using double normalization with only using Softmax in the external-attention mechanism. Experimental results show that double normalization can increase the accuracy of the MLP models by about 0.22% to 1.12%, and the MCC value by about 0.21% to 2.39%. This shows that double normalization can improve the classification performance in a small range, highlighting the effectiveness of double normalization.
The classification index sensitivity of the model represents the proportion of all positive samples that are paired and measures the model's ability to discriminate against positive samples. From the experimental results, after adding an external-attention mechanism to the FCN model and combining with double normalization, the sensitivity of MLP-A model classification is 92.6%, the sensitivity of MLP-B model classification is 99.02%, and the sensitivity of MLP-C model classification is 99.29%. This also shows that our proposed model has a high discriminative ability for images of Alzheimer's disease patients.

Discussion
The above experimental results show that the external-attention mechanism can generate richer feature information of a disease probability map for the MLP models, thereby improving the classification performance of the model. In addition, the MLP models combine with age, gender, and MMSE, which are more conducive to the accurate judgment of image classification, in the case of selecting the region of interest (ROI) according to the FCN model.
The reason we choose to add the self-attention module to the FCN model as a comparative experiment is because the external-attention mechanism changes the weight matrix on the basis of the self-attention mechanism. The self-attention mechanism calculates the direct interaction between any two locations, allowing the network to focus on areas that are scattered in different locations. However, this self-attention mechanism only considers the correlations within a single sample, it ignores the potential connections between samples. Therefore, the external-attention mechanism uses a learnable external matrix to establish potential correlations between samples. In addition, the Softmax in the self-attention module normalizes the attention map, but the attention map is calculated by matrix multiplication, which is sensitive to the size of the input feature and susceptible to particularly large or small feature values. Therefore, the double normalization in the external-attention module first applies Softmax to the columns, and then applies L1 norm to the rows to solve this problem. The experimental results show that after adding external-attention mechanism to the fully convolutional network and combining with double normalization, the classification performance of the MLP models is better than other comparison methods. However, due to the many unknown factors of the deep learning model, the limitation of our study is that it is difficult to quantify and visually analyze the correspondence between models and results, which makes it difficult to apply in clinical practice. Therefore, we expect further research on the interpretability of the deep learning model in order to improve the confidence of the classification results.
Finally, we compare with the models of other references on Alzheimer's disease, as shown in Table 10. Different classification techniques have their own advantages and disadvantages. For example, SVM uses the inner product kernel function to replace the nonlinear mapping to the high-dimensional space, and its goal is to divide the feature space into the optimal classification hyperplane [40]. However, its disadvantage is that it is difficult to implement large-scale training samples, and it is sensitive to the choice of parameter adjustment and function.

Conclusions and Future Work
In this paper, the self-attention mechanism models the correlations within the samples to obtain the corresponding attention feature maps, which plays a certain role in improving the classification performance of the model. Moreover, this paper proposes a new method for the image classification of Alzheimer's disease based on the external-attention mechanism and double normalization, which embeds the external-attention module after the last convolutional block of the FCN model. The detailed experimental results show that the external-attention mechanism is effective and efficient in improving classification performance. We used the double normalization of Softmax and L1 norm to replace the original Softmax, so that the classification accuracy of the MLP model can increase by about 0.22% to 1.12%. The core of this paper is to integrate the external-attention mechanism into the FCN model to obtain richer and more detailed feature information of a disease probability map.
In the future, we will consider using other deep learning methods or new efficient attention mechanisms to continuously tap the potential of fully convolutional networks. In the image preprocessing, we will try to use different image denoising and smoothing methods to effectively remove noise from the original MRI image, and further improve the classification performance of the model. In addition, we will try different experimental approaches using 2D slices and 3D patches to compare the classification performance of the two [43], which has special implications for Alzheimer's disease classification research.