Multi-View Based Multi-Model Learning for MCI Diagnosis

Mild cognitive impairment (MCI) is the early stage of Alzheimer’s disease (AD). Automatic diagnosis of MCI by magnetic resonance imaging (MRI) images has been the focus of research in recent years. Furthermore, deep learning models based on 2D view and 3D view have been widely used in the diagnosis of MCI. The deep learning architecture can capture anatomical changes in the brain from MRI scans to extract the underlying features of brain disease. In this paper, we propose a multi-view based multi-model (MVMM) learning framework, which effectively combines the local information of 2D images with the global information of 3D images. First, we select some 2D slices from MRI images and extract the features representing 2D local information. Then, we combine them with the features representing 3D global information learned from 3D images to train the MVMM learning framework. We evaluate our model on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. The experimental results show that our proposed model can effectively recognize MCI through MRI images (accuracy of 87.50% for MCI/HC and accuracy of 83.18% for MCI/AD).


Introduction
Alzheimer's disease (AD), a neurodegenerative brain disease caused by multiple factors, is one of the most common chronic diseases in old age [1]. This disease usually causes progressive and disabling impairments of cognitive function, including memory, language, understanding and attention [2]. In 2015, it was estimated that about 47 million people worldwide had AD, and the number is expected to reach 141 million by 2050 [3]. At present, there is no practical method to cure AD [4], so early diagnosis of AD is needed to obtain treatment time. Mild Cognitive Impairment (MCI) is an intermediate state between normal aging and dementia [5], and one study showed that 32% of MCI converted to AD within five years [6]. Therefore, early diagnosis and intervention of Alzheimer's disease is very important.
In the past few decades, neuroimaging has been widely used to study brain diseases [7][8][9]. Neuroimaging technology provides anatomical and functional images of the brain, such as Positron Emission Computed Tomography (PECT), Structural Magnetic Resonance Imaging (SMRI), Diffusion Magnetic Resonance Imaging (DMRI), Functional Magnetic Resonance Imaging (FMRI), Electroencephalogram (EEG), and Magnetoencephalography (MEG) [10][11][12]. Among them, SMRI is often used for the characterization and prediction of AD due to its relatively low cost and good imaging quality. Previous studies have shown that the volume and thickness of the brain are closely related to AD [13], the hippocampus region of AD patients is one third smaller than that of healthy subjects [14], and the medial temporal lobe region is the most effective region of the brain for identifying patients with MCI [15].

Materials and Methods
We propose a new method of MCI diagnosis based on multi-view based multi-model (MVMM) framework. The MVMM framework mainly includes a 3D model for extracting global features and a 2D model for extracting local features. The 3D model we use is the Dilated Residual Network (DRN), which adds an Efficient Space Pyramid (ESP) module. The 2D model is the Dual Attention Inception Network (DAIN), which adds a dual attention mechanism to the Inception network. The flow of our MVMM framework is shown in Figure 1. Firstly, the gray matter (GM) images of subjects are divided into whole-brain gray matter images and some selected two-dimensional slices, and then they are respectively input into the corresponding different models. Finally, the local features and global features are concatenated together for integration training.

Data Acquisition
In this work, the MRI images we use are obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http://adni.loni.usc.edu/) [42]. ADNI started in 2004 under the leadership of Dr. Michael W. Weiner. ADNI is a longitudinal multicenter study with the primary goal of early detection of AD and the use of biomarkers to track disease progression. ADNI has already begun three phases, namely ADNI1 (2004-2009), ADNI2/GO (2010-2016) and ADNI3. At each phase of ADNI, new participants are recruited and agreed to complete various imaging acquisitions and clinical evaluations. Later phases include follow-up scans of some previously scanned subjects and scans of new subjects. In this paper, we select the SMRI data acquired at 1.5 Tesla [43]. The data we use are obtained from 649 subjects, which include 175 scans of AD, 214 scans of healthy controls (HC), and 260 scans of MCI. The demographic and clinical characteristics of all subjects are reported in Table 1. There is usually much noise in the raw data, so we need to preprocess the MRI data first. In this paper, we use the voxel-based morphological preprocessing method. Specifically, we use the CAT12 toolbox which is an extension to SPM12 [44] to provide computational anatomy. First, we register the MRI images to the standard space through DARTEL (Diffeomorphic Anatomical Registration Through Exponentiated Liealgebra) algorithm [45]. Second, we use the maximum a posteriori and partial volume estimation segmentation techniques [46] to segment the image into gray matter, white matter, and cerebrospinal fluid. Then, the Jacobian determinant is used to modulate the gray matter image nonlinearly. Finally, the gray matter image is spatially smoothed with the 8mm Gaussian smoothing kernel. The size of each gray matter image we get in the standard space is 121 × 145 × 121, then we use scikit-image package to resample it to a size of 96 × 96 × 96. It is noted that gray matter loss in the medial temporal lobe is characteristic of MCI [47], so we use gray matter images to analyze in this paper. The MRI images before and after preprocessing are shown in Figure 2.

DRN Model Based on a 3D View
For the whole brain three-dimensional view, we take the preprocessed whole gray matter image directly as input. In order to learn the 3D global information more comprehensively, we use 3D convolutional neural networks to perform global feature extraction related to AD. The 3D gray matter image contains the entire brain, which is very informative. Therefore, how to comprehensively learn useful features is a challenge. Convolutional neural networks have developed rapidly. Since the birth of AlexNet in 2012 [48], the depth of subsequent advanced convolutional neural network models has grown deeper. But as the depth increases, the problem of gradient disappearance during training becomes more serious. In order to avoid the problem of gradient disappearance caused by the network being too deep, the residual network (ResNet) [49] introduces an identity shortcut connection and skips one or more layers directly. Assuming that the layer l − 1 is connected to the layer l, the output x l of the layer l is: H(·) represents a non-linear transformation function, including batch normalization (BN), rectified linear unit (ReLu), and convolution operation. ResNet reduces the difficulty of training deep networks by adding shortcut connections. For AD-related information, ResNet uses linear activation to obtain identity mapping. In contrast, ResNet uses non-linear activation for redundant information not related to AD. Since the non-linear activation is for redundant information, less useful information is lost. ResNet effectively solves the problem of network degradation caused by too deep depth, so that the model can learn more powerful advanced features.
In this paper, ResNet-18 is used as the basic model to train gray matter images from 3D view to obtain disease characteristics related to AD. However, the traditional ResNet model only uses 3 × 3 × 3 convolution kernels, and the receptive field of a single convolution kernel can only reach 27. In order to enable the convolution kernel to obtain a larger receptive field, and to allow the model to learn 3D features more comprehensively, we add the dilated convolution [50] on the basis of the ResNet network. Assuming that the size of the convolution kernel is k and the dilation rate is r, the receptive field (RF) of the dilated convolution is: When r = 1, it is ordinary convolution, and the receptive field is k 3 . When r = 2, for the same 3 × 3 convolution, the dilated convolution not only increases the receptive field (from 9 to 25), but also does not increase the parameters like ordinary convolution (the weight in the dilated part is 0). Although dilated convolutions increase the receptive field, the mesh effect is prone to occur if misused. Therefore, we use an efficient space pyramid (ESP) [51] module to avoid the mesh effect. Specifically, a 1 × 1 × 1 convolution is performed on the input to obtain a feature map of n × n × n × f , and then four parallel dilated convolutions are used. Finally, the hierarchical features are merged to obtain the same size feature map. The structure of the ESP module is shown in Figure 3. In the figure above, the r1, r2, r3, r4 are different dilation rates. It can be seen that the residual operation is also used in the ESP module, and the information of different receptive fields is concatenated before the residual operation to ensure the output quality. We replace one layer in the original ResNet-18 network with an ESP module to learn the 3D global features of the brain image more effectively. The overall structure of the final three-dimensional DRN model is shown in Figure 1. The pre-processed whole-brain gray matter image contains much information. In order to effectively learn useful information, we choose 3D deep convolutional neural network for training. We use ResNet-18 with shortcut connections as our base model. At the same time, we also use dilated convolution to expand the receptive field in order to learn AD-related features more comprehensively.

DAIN Model Based on 2D View
For the 2D local view, we select some slices from the preprocessed gray matter image for training. We can select a large number of 2D slices from the 3D gray matter image. How to choose the best training data is very important for the success of the entire method. In this paper, we select 2D images based on image entropy [52] and extract the most informative slices to train the network. Generally, for a set of M gray values with probabilities p 1 , p 2 , . . . , p M , the entropy can be calculated as follows: The higher the entropy, the more information the image contains. The subsequent problem is that MRI images generally contain much noise. Blindly selecting the image with a large amount of information may lead to the selection of some useless images with much noise. The images we study are preprocessed by CAT12. Compared with the original MRI image, the image we use is standardized and smoothed, and the skull of the image is removed. Therefore, we sort the slices in descending order of entropy and select the first 32 images for training to provide robustness according to previous research [53].
After obtaining the selected 2D slices from the 3D MR image, we use them as the inputs of 2D model to learn AD features. Assume that the image set of each subject is the input of the two-dimensional model can be expressed as: Thirty-two slices are selected from each subject through the axial brain image. In order to enhance the representation ability of the model, we use the Inception structure [54] to obtain information on different scales and fuse the features learned by convolution kernels of different sizes. In the 2D model, we pay more attention to the local information as the selected pictures are comparatively informative. How to find the local information that can distinguish MCI from other categories is the focus of our research. The attentional mechanism solves this problem by enabling the model to think globally and focus on more critical local information. We use a dual attention network (DAN) proposed by Fu et al. [55] that combines channel attention and spatial attention. The spatial attention module (PAM) in DAN first performs three 1 × 1 convolutions on the feature map A to obtain three feature maps B, C, and D of the same size (h × w × c). Moreover, these three feature maps are converted to the size of n × c (n = h × w). Then the spatial attention map is calculated from B and C. We multiply the spatial attention map by D, and then multiply the result by the scale coefficient α (initialized to 0). Finally, the output of PAM is obtained by adding the original feature map A: The channel attention module (CAM) in DAN first converts the feature map A into A r , and multiplies A r and A r T to obtain the channel attention map with the size of c × c. We multiply the channel attention map by A r and then multiply it by the scale coefficient β (initialized to 0). The result of the product is converted to the size of h × w × c. Finally, the output of CAM is obtained by adding to the original feature map A: Finally, the results of the two attention modules are added together to form the output of DAN: We combine DAN with Inception to form the final two-dimensional model. The structure of our proposed DAIN model is shown in Figure 1. The proposed DAIN model first obtains multi-scale AD information through the Inception structure. Then, the dual attention mechanism is used to obtain more significant local information. Finally, the important local information is combined with the fused multi-scale information to obtain the AD features that represent the 2D view.

Combination of 2D and 3D Views
Before constructing the multi-view model, we first integrate the output of the DAIN Model. Because in the previous DAIN model, we selected 32 slices for each subject and treated each slice as one subject. In this study, we use the above DAIN model as the pre-trained model of the final 2D model. In the final 2D model, the features extracted from each of the 32 images are concatenated together for classification. The final full connection is the MCI feature representing the local information extracted from the 2D slices.
After integrating the 2D features, we combine the features extracted from the 2D model with the features extracted from the 3D model. Specifically, 32 full connections of size 32 are obtained from every MRI image after the DAIN model. Then, through the final 2D model, the 32 full connections of the same subject are concatenated together to obtain the feature of size 32 × 32. After this layer of full connection, we add another full connection of size 32. That is, the information learned by the same person from the DAIN model is nonlinearly integrated, and then the integrated features are taken as the final two-dimensional AD features. The 2D model and the 3D model are trained separately. Finally, the full connections of the 2D model and the full connections of the 3D model are concatenated together for training. The method of combining 2D with 3D models is our proposed MVMM framework. Our MVMM framework can learn both global information of 3D images and local information of 2D slices.

Experimental Setup
We implement the MVMM method through the Tensorflow and compute the model on the NVIDIA TITAN V GPU. The loss function we adopt is binary cross-entropy. We use He normal distribution to initialize the weights of the model. The learning rate of the 2D model is 0.001, while that of the 3D model and MVMM is 0.0001. The evaluation of the proposed method in this paper is based on the following two tasks: (1) T1: MCI/HC classification.
To evaluate the classification performance, we adopt the 10-fold cross-validation strategy 10 times. We randomly select ten percent of the images from each class as the test set, and the remaining images are further randomly divided into 10 subsets for each category. In the process of cross-validation, each subset is taken as the validation set in turn, and the rest are used as the training set. This process is repeated ten times to get the final result. In this paper, the accuracy (ACC), sensitivity (SEN) and specificity (SPE) are used for evaluation. The three classification performance measures are calculated as follows: Among them, TP represents true positive, FP represents false positive and FN represents false negative. In addition, we use the area under the receiver operating characteristic curve (AUC) to measure classification performance. The value with the best result for each measure is shown in bold in all tables of experiments.

Experimental Results
In this section, we present the respective results of the single-view and multi-view models, For the T1 task, it can be observed from Table 2 that the classification performance of MVMM model based on multiple views (87.50% for ACC) is better than the DRN model based on 3D view (81.67% for ACC) and the DAIN model based on 2D view (83.96% for ACC). For the T2 task, it can be observed from Table 3 that although the sensitivity of the MVMM is lower than that of the DAIN model, the ACC, SPE and AUC of the MVMM model (83.18%, 70.56% and 0.8124) are higher than that of DRN model and DAIN model. Therefore, the performance of our MVMM model is the best. In addition, we perform Spearman's Rank Order Correlation analysis to examine the relationship between the misclassification rate of MVMM model and the MMSE score of subject. As shown in Figure 4, the probability of MCI being misclassified as AD is negatively correlated with the MMSE score (r = −0.418, p = 0.033).

Discussion
In order to effectively verify the rationality of the model proposed in this article, we discuss it from the following six aspects: (1) selection of 3D models; (2) selection of 2D models; (3) selection of the number of 2D slices; (4) selection of combination methods; (5) performance on another dataset; (6) comparison with the whole-brain image without segmentation; (7) comparisons with related studies.

Selection of 3D Models
In this section, we compare the DRN 3D model used in this article with the VGGNet (VN) and DenseNet (DN) models. As shown in Tables 4 and 5, VN and DN model perform poorly in both tasks (T1: 76.04% and 78.33% for ACC; T2: 75.91% and 76.59% for ACC). This is because the 3D MRI images contain a large amount of information, and direct use of the existing network structure cannot get a good effect. The basic model we selected is the ND-DRN model that does not use dilated convolution, which can reduce the redundancy of data by continuously adding shortcut connections in the learning process. Therefore, the ACC obtained by ND-DRN is better than the previous two models (T1: 79.38%; T2: 76.82%). Moreover, for more comprehensive learning, we add dilated convolution to expand the receptive field in the DRN model. Compared with the ND-DRN model, the final DRN model improves the accuracy rate by about 2% to 4%. The ACC, SPE and AUC of our DRN model in T1 and T2 tasks are the highest, while SEN is the second highest. Therefore, from a comprehensive perspective, our DRN model has the best classification performance.

Selection of 2D Models
As in Section 4.1, we compare the 2D model used in this article with the classic convolutional network models (AlexNet (AN) and MobileNet (MN)). It can be seen from Tables 6 and 7 that the AUC of the AN and MN models is lower than the NA-DAIN model which does not use attention mechanism. Especially for the T2 task, the SPE of these two models are lower than 50%, which means that they cannot learn the deep local features of the 2D slices. The dual attention mechanism combines spatial and channel attention from a non-local perspective and makes full use of advanced brain features. In this way, more important information can be selected from the 2D data to diagnose MCI better. After adding the attention mechanism, the SPE of the DAIN model reaches 63.34%. At the same time, the ACC and AUC of the DAIN model are also the highest among these models (80.45% and 77.82%).

Selection of the Number of 2D Slices
In the two-dimensional model, we selecte 32 slices with the largest entropy from each gray matter image. In order to verify the effect of the number of slices on the model performance, we selecte 8, 16,24,32,40,48, and 56 slices from each gray matter image for comparison. It can be seen from Tables 8 and 9 that the ACC of the model with eight slices is the worst (T1: 73.13%; T2: 70.68%), which means that too few 2D images cannot represent the brain information of the subject. When the number of slices gradually increases to 32, the ACC improves continuously (T1: from 73.13% to 83.96%; T2: from 70.68% to 80.45%). This indicates that the higher the number of 2D slices, the more disease-related information that can be learned. Nevertheless, it is worth noting that when the number of 2D slices exceeds 32, the ACC no longer improves. It can be inferred that the 32 slices are sufficient to represent the subject's local brain information, and increasing the amount of information will lead to learning local information that is not related to the disease, resulting in a decrease in accuracy.

Selection of Combination Methods
After obtaining 2D local features and 3D global features, we combine DRN and DAIN for training. Tables 10 and 11 demonstrate that the model has obtained good results, no matter if it is a combination of sum or concatenation. The concatenation fusion method we adopt combines 2D and 3D information from different spaces to achieve higher performance (accuracy of 87.50% for T1 and 83.18% for T2).

Performance on Another Dataset
In this section, we validate the generality of our proposed MVMM model on the OASIS (Open Access Series of Imaging Studies) dataset [56]. This dataset contains 3 or 4 individual MRI scans of 416 subjects aged 18 to 96. We use baseline MRI images of 198 subjects aged 60 or older, including 100 AD patients and 98 healthy controls. As shown in Table 12, we can also achieve good performance using the same preprocessing method and MVMM model on OASIS dataset. In other words, the MVMM model we proposed in this paper has potential to be used for wider research.

Comparison with the Whole-Brain Image without Segmentation
In this paper, the input of our model is the gray matter image after segmentation. In this section, we compare the classification performance using the whole-brain image (without segmentation) and using the gray matter image. It can be seen from Tables 13 and 14 that all classification measures of the model using whole-brain images are lower than the model using gray matter images. This result is because whole-brain images contain more information than gray matter images, and we may need more data to learn the features associated with AD.  [58,59] studied the deep features of MRI from a 3D view through the novel convolutional neural network. Although they improved the MRI classification from 2D or 3D views, the verification results using the data in this paper are not as good as our proposed MVMM model. Our proposed MVMM model takes into account the extraction of 2D local information as well as 3D global information, which is more comprehensive than information extracted from a single view. Furthermore, we perform the Student's t test on ACC of different methods to compare the performance. As shown in Tables 15 and 16, our MVMM method is superior to other methods for all classification tasks at P < 0.05.

Conclusions
In summary, we develop a multi-view based multi-model learning framework for the early diagnosis of Alzheimer's disease. First of all, we comprehensively learn global information from the 3D view through residual networks and dilated convolutions. Then we perform the 2D view, which selects the most representative multiple slices through information entropy. Furthermore, we use the Inception network and dual attention mechanism to learn more crucial local information. Finally, we combine the models from 2D and 3D views to train for classification. In this paper, the deep features of MCI are studied both locally and comprehensively by MVMM learning framework. The experimental results show that the proposed method is effective and is expected to be used in the diagnosis of MCI. However, the training of different models in this paper is conducted in parallel, and the concatenation of features is performed at the end. In future research, more complicated combining methods can be considered to make full use of data for MCI diagnosis, such as combining texture features.