Feature Selection and Transfer Learning for Alzheimer’s Disease Clinical Diagnosis

: Background and Purpose: A majority studies on diagnosis of Alzheimer’s Disease (AD) are based on an assumption: the training and testing data are drawn from the same distribution. However, in the diagnosis of AD and mild cognitive impairment (MCI), this identical-distribution assumption may not hold. To solve this problem, we utilize the transfer learning method into the diagnosis of AD. Methods: The MR (Magnetic Resonance) images were segmented using spm-Dartel toolbox and registrated with Automatic Anatomical Labeling (AAL) atlas, then the gray matter (GM) tissue volume of the anatomical region were computed as characteristic parameter. The information gain was introduced for feature selection. The TrAdaboost algorithm was used to classify AD, MCI, and normal controls (NC) data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, meanwhile, the “knowledge” learned from ADNI was transferred to AD samples from local hospital. The classiﬁcation accuracy, sensitivity and speciﬁcity were calculated and compared with four classical algorithms. Results: In the experiment of transfer task: AD to MCI, 177 AD and 40NC subjects were grouped as training data; 245 MCI and 45 remaining NC subjects were combined as testing data, the highest accuracy achieved 85.4%, higher than the other four classical algorithms. Meanwhile, feature selection that is based on information gain reduced the features from 90 to 7, controlled the redundancy efﬁciently. In the experiment of transfer task: ADNI to local hospital data, the highest accuracy achieved 93.7%, and the speciﬁcity achieved 100%. Conclusions: The experimental results showed that our algorithm has a clear advantage over classic classiﬁcation methods with higher accuracy and less ﬂuctuation.


Introduction
Alzheimer's Disease (AD) is the most prevalent neurodegenerative dementia worldwide. It is reported that the number of affected patients is expected to double in the next 20 years, and one in 85 people will be affected by 2050. Thus, the early accurate diagnosis of AD is crucial, as an intermediate stage between normal age-related cognitive decline and dementia, mild cognitive impairment (MCI) has been identified [1]. It may provide a window of chance for treatment and intervention before the disease advances. However, MCI does not always lead to AD and the MCI group is very heterogeneous. It is a highly relevant task to differentiate MCI subjects who have greater risk of developing AD within the next few years from those who will remain stable or even improve [2]. Thus, it is of great interest to identify MCI and also to predict its risk of progressing to AD. Accumulated evidence demonstrates that individuals with AD have both functional and structural changes in their brains, such as loss of gray matter volume [3]. However, these findings are mainly obtained based on group-level statistical comparison, and are thus of limited value for individual-based disease diagnosis. Therefore, some recent studies tried to solve this problem by using magnetic resonance imaging (MRI) and algorithms from machine learning [4,5]. Computer-based decision shows its efficiency in the detection of the early stage diseases.
Recently, many machine learning classification methods have been used for the early diagnosis of AD, including the structural brain atrophy measured by MRI [6][7][8]. Data mining approaches, such as principal component analysis (PCA) [9], independent component analysis [10], and support vector machine [11,12], which can be applied to functional neuroimaging data. Currently, most computer-aided AD diagnostic methods are based on traditional machine learning methods. Traditional machine learning techniques follow a basic assumption; the training and testing data should be under the same distributions. However, in many cases, this assumption does not hold. This excludes a great amount of useful labeled samples as they follow similar but different distributions.
To make full use of the labeled samples from similar distributions, to gain knowledge from one context and benefit learning tasks in other contexts, transfer learning is proposed [13][14][15]. The earlier application of transfer learning methods have addressed some important issues, including learning how to learn [16], learning one more thing [17], and multi-task learning [18]. Ben-David and Schuller provided a theoretical justification for multi-task learning [19]. Daumé I and Marcu had studied the domain-transfer problem in statistical natural language processing, while using a specific Gaussian model [20]. In some areas, transfer learning has been used in automatic text categorization [21]. Some papers discussed the transfer learning application in robotic soccer system [22,23]. Bickel and Scheffer introduced a statistical formulation of this problem in terms of a simple mixture model and presented an instance of this framework to maximum entropy classifiers and their linear chain counterparts [24]. Although transfer learning has been widely used in many areas, few people used it in the diagnosis in AD.
To establish a transfer learning model for computer-aided clinical diagnosis in AD, five stages of analysis have been outlined in this paper. The first stage is the analysis of MRI results that were obtained from ANDI and a local hospital. 90 anatomical volumes of interest (AVOI) were studied within groups of AD, MCI and NC. In the next stage of the study, we explored the between-group variance within the sample collection. The information gain scales were used in this stage of study. The third stage of this research study was to establish the computer model using transfer learning methods. The data from AD and NC were collected as reference source, which we then transferred to the MCI result group for data classification. The information Gain Scales were used in optimization of feature to explore the influences of different features on result identification. The transfer learning model was established with optimized feature in result analysis of identification. The final stages involved the application of the model in transferring knowledge to clinical diagnosis in the local hospital.
In this paper, we first outline background information and the purpose of this study. Secondly, the material and methods are illustrated with comparisons between traditional machine learning methods and the transfer learning method. The analysis and results are explained, followed by discussion and conclusion.

Materials
In this paper, we collect and study two datasets to verify the performance of the proposed method. In this section, we first introduce the details of the two datasets. Then, we present the image processing, the feature extraction, and the feature selection processes.
The first dataset of 507 subjects was obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) public database (http://www.loni.ucla.edu/ADNI). The ADNI was initiated in 2003 by the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), pharmaceutical companies, and nonprofit organizations for the development of diverse biomarkers for the early detection of AD [25].
All data obtained from ADNI's MRI examinations of brains were performed on a 1.5T MRI scanner. We acquired a high-resolution T1-weighted Magnetization Prepared Rapidly Acquired Gradient echo (MP-RAGE) three-dimensional (3D)-sequence for analyzing the classification of 507 subjects, including 177 AD patients, 245 MCI and 85 NC. Table 1 lists the demographics of all these subjects. The second dataset was sourced from clinical diagnostic results in a local hospital; the procedure just follows ADNI's. The MRI examinations of brains were also performed on a 1.5T MRI scanner and then high-resolution T1-weighted MP-RAGE 3D-sequence were obtained, as detailed in Table 2. We must point out that ADNI database provided results of AD patients with MMSE mean of 23.07, CDR mean of 0.78, classified as AD at early stage. The data from the local hospital showed a MMSE (Mini-mental State Examination, MMSE) reading average of 10.7, CDR average at 1.3, classified as moderate AD. The figures suggest that the AD patients at local hospital suffered more severe symptoms when compared with the patients in ADNI database.

MR Image Processing
We have introduced the datasets that are used in this paper in the previous section. When considering the fact that the dataset is made up of images, image processing is necessary for the next feature extraction and feature selection steps.
We first selected the Magnetisation Prepared Rapidly Acquired Gradient echo (MP-RAGE) to segment the MR images into gray matter (GM), white matter (WM), and Cerebrospinal Fluid (CSF). Then, the Diffeomorphic Anatomical Registration through Exponentiated Lie Algebra (DARTEL) algorithm was utilized for spatial normalization [26]. After spatial normalization, we applied the SPM8 package to structural brain image segmentation, and utilize the DARTEL algorithm to create an average-shaped template and normalize the segmented image to Montreal Neurological Institute (MNI) Space, respectively. The features are the GM probability maps in the MNI space. The procedure of DARTEL is shown as Figure 1.

Feature Extraction
After image processing, we utilized the Automatic Anatomical Labeling (AAL) binary atlas for the 3D case in the anatomical registration [27]. An anatomical parcellation of the spatially normalized single-subject high-resolution T1 volume provided by the Montreal Neurological Institute (MNI) was performed [28]. The MNI single-subject main sulci was first delineated and further used as landmarks for the 3D definition of 45 anatomical volumes of interest (AVOI) in each hemisphere. Then, the 90 AVOI was reconstructed and then assigned a label. Based on the previous results, we computed the GM tissue volume of the anatomical region and then produced 90 regions features for classification analysis.

Feature Selection
We have obtained 90 features that are based on the previous feature extraction. However, not all of the features can be used for Alzheimer's Disease clinical diagnosis. Because some features may not be discriminative, these features are not helpful in diagnosis, and they may even affect diagnostic accuracy. To deal with this situation, we proposed a feature selection method that is based on the information gain in this section.
The concepts of information gain were discussed in the article: The Mathematical Theory of Communication by C.E. Shannon, 1949.
On the assumption that pij is the anatomical volume of gray matter of a single subject (j = 1, 2, 3, …, 90); m is the class number. The entropy of an attribute is: If the number of the subjects per class is equal for both classes, H = 1. In the case of unbalanced class sizes the entropy of the class distribution is smaller than one and approaches zero if there are much more instances of one class than of the other class.
Based on the definition of entropy, we can define the information gain of a voxel, as follows: where A is the attribute value of the feature examples, and B is the corresponding class label. The information gain scales are between 0 and 1. If IG(B) = 0, it indicates that the corresponding voxel provides no information on the class label of the subjects. On the other hand, if IG(B) = 1, it indicates that the class labels of all subjects can be derived from the corresponding voxel without any errors.

Feature Extraction
After image processing, we utilized the Automatic Anatomical Labeling (AAL) binary atlas for the 3D case in the anatomical registration [27]. An anatomical parcellation of the spatially normalized single-subject high-resolution T1 volume provided by the Montreal Neurological Institute (MNI) was performed [28]. The MNI single-subject main sulci was first delineated and further used as landmarks for the 3D definition of 45 anatomical volumes of interest (AVOI) in each hemisphere. Then, the 90 AVOI was reconstructed and then assigned a label. Based on the previous results, we computed the GM tissue volume of the anatomical region and then produced 90 regions features for classification analysis.

Feature Selection
We have obtained 90 features that are based on the previous feature extraction. However, not all of the features can be used for Alzheimer's Disease clinical diagnosis. Because some features may not be discriminative, these features are not helpful in diagnosis, and they may even affect diagnostic accuracy. To deal with this situation, we proposed a feature selection method that is based on the information gain in this section.
The concepts of information gain were discussed in the article: The Mathematical Theory of Communication by C.E. Shannon, 1949.
On the assumption that p ij is the anatomical volume of gray matter of a single subject (j = 1, 2, 3, . . . , 90); m is the class number. The entropy of an attribute is: If the number of the subjects per class is equal for both classes, H = 1. In the case of unbalanced class sizes the entropy of the class distribution is smaller than one and approaches zero if there are much more instances of one class than of the other class.
Based on the definition of entropy, we can define the information gain of a voxel, as follows: where where A is the attribute value of the feature examples, and B is the corresponding class label.
The information gain scales are between 0 and 1. If IG(B) = 0, it indicates that the corresponding voxel provides no information on the class label of the subjects. On the other hand, if IG(B) = 1, it indicates that the class labels of all subjects can be derived from the corresponding voxel without any errors.
In relation to information gain, C4.5 statistical classifier, as developed by Ross Quinlan [29], has been implemented in this study. The decision tree was built from large sets of cases belonging to known classes. The cases, as described by mixture of nominal and numeric properties, are scrutinized for patterns that allow for the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision tree or sets of rules. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets that are enriched in one class or the other. The splitting criterion is the normalized information gain. The attribute with the highest normalized information gain is chosen to make the decision. This is the process of feature selection, which is to learn from the training data that map from the attribute values to a predicted class.

Methods
We have obtained the optimal discriminant features for MRI images based on the previous feature extraction and selection steps. Then, we can utilize these features into Alzheimer's Disease clinical diagnosis with the traditional diagnosis methods (i.e., k-nearest neighbor [30], support vector machine [31]). However, these traditional methods are all based on the assumption: the training and testing data are drawn from the same distribution. This identical-distribution assumption may not hold in the diagnosis of AD and MCI. Because the MRI image data of AD patient at different stages have a significant difference. If we use the traditional diagnosis methods in this situation, it may lead to a greater diagnostic error. Since the MRI image data of the AD patient at different stages can be viewed as data from different distributions. If we directly use the model that is trained in the dataset from one distribution to the dataset from another distribution, the accuracy of the diagnosis cannot be guaranteed. To solve this problem, the transfer learning method is utilized into the diagnosis of AD with the selected features in this section.

Problem Definition
Before we introduce the transfer learning method, we first use different symbols to represent patients at different stages. In particular, we let AD indicates patients with diagnosed Alzheimer's Disease, MCI indicate patients with mild cognitive impairment. MCI may develop into AD, but it can also improve and recover. Hence, we consider AD and MCI as two different data set. The AD is considered as training data, MCI as testing data. The classification model based on AD as training data was then transferred to the MCI data for classification. Furthermore, the first group of MRI samples from the ADNI database was western population; the second group of MRI samples obtained from the local hospital was eastern population. The different features between these two groups of samples were also analyzed with the transfer learning method. The transfer learning model was established based on ADNI samples (training data) and it was implied to local samples (testing data) for diagnostic analysis.
Here, we define some of the necessary symbols using in the transfer learning method. Supposing that T d represents the diff-distribution training data (source domain), T s represents the same-distribution training data (target domain). Both T d and T s are labeled datasets, T = T d ∪T s . S is to indicate the test dataset. Our goal is to learn a model based on the labeled dataset T for the diagnostic tasks from S.

Transfer Learning Method for Alzheimer's Disease Clinical Diagnosis
To solve the problem that happened in Alzheimer's Disease clinical diagnosis that is caused by the training and testing data drawn from different distributions, we try to propose a transfer learning method for diagnosis by utilizing important weights. In our method, we first assign a weight for each image from training dataset. Then we try to adjust the weights to filter the images with the large difference from the testing data distribution. In this way, we can train a suitable model on the re-weighted data for the diagnostic tasks from testing data. Specifically, we let w i indicates the importance weights for the i-th image. Then, we can use a weight vector W = (w 1 , w 2 , . . . , w n+m ) to indicate the weights for training data. When considering the fact that the training dataset may include few labeled data from target domain, we use weights w 1 , . . . , w n to indicates weights for the source domain, and use weights w n+1 , . . . , w n+m for the target domain.
To introduce the used transfer learning method, we present the TrAdaboost method in detail. In TrAdaboost, to obtain a suitable images weights, we initialize weight vector W by W 1 = (w 1 1 , ..., w 1 n+m ), and In the first step, weight vector W 1 is used to re-sample the images from training dataset. In the second step, we utilize traditional method to learn a base model, h 1 with re-sampled dataset. After we obtain a base model, in the third step, we calculate the error e 1 of h 1 on S, where, c(x i ) indicates real label of image x i . To filter images with a large difference from testing data distribution, in the fourth step, the weight vector is updated, as following, where β 1 = e 1 /(1 − e 1 ). Then new weight vector is used to repeat step 1 to step 4 until change of e is less than threshold ε. For a test image x, the output hypothesis is: where N indicates the maximum number of iterations.

Data Setting
In this study, we analyzed the total 507 subjects of AD, MCI, and NC from ADNI database using Statistical Parametric Mapping (SPM), DARTEL procedure, and AAL template. The MR images from ADNI including 177 AD patients, 245 MCI patients, and 85 normal controls (NC) were served as experimental samples. All of the images were preprocessed with DARTEL toolbox and then mapped to the standard MNI space neurological coordinate system and Automated Anatomical Labeling (AAL). 90 AVOI were extracted as feature and then graphically depicted with Box-plot. Meanwhile, the mean of 90 AVOI for AD, MCI, and NC groups were calculated and compared with the line chart. Then, the information gains of 90 AVOI were calculated for feature selection.
Transfer learning algorithm was introduced into the classification analysis of AD, MCI, and NC. AD was used as training data and MCI as testing data in transfer learning. The testing result indicated that rather high accuracy and sensitivity had been achieved. Furthermore, the 90 AVOI were used to obtain information gains for feature selection. The impact of feature selection on classification accuracy was calculated. Finally, local AD subjects were analyzed with transfer learning module based on ADNI training data.

Evaluation Measure
The accuracy of each algorithm was utilized to evaluate the performance of the algorithms in the experimental results. For further detailed analysis of the results, another evaluation measure was used. For example, the fluctuation rate was illustrated with two Formulas (8) and (9), as follows: ACC max represents the maximum accuracy of each Src group, while ACC min represents the minimum; ACC group set as the accuracy of AD/NC1 as Src in our algorithm.
The change rate of accuracy among all tested classification algorithm was also calculated with the various Tartrain sample sizes. The calculation formula is (10), as follows: ACC max is the maximum accuracy of each classification algorithm, and ACC min is the minimum accuracy. Besides, the sensitivity and specificity were also calculated in this study.

Feature Selection Results
Box-plots display variation in 90 AVOI of AD, MCI, and NC groups, respectively. In the group of AD, the grey matter box-plots were illustrated as Figure 2, MCI showed in Figure 3 and NC showed in Figure 4. Referring to the figures, it can be seen that the group of NC has the lowest degree of dispersion, whereas the group of MCI has the highest one. The high degree of dispersion in MCI confirms the statistically significant difference in the volumes of gray matter, which is also consistent with the clinical characteristic of MCI. That is, part of MCI may turn to AD, while another part may keep stable or even turn to NC.

Evaluation Measure
The accuracy of each algorithm was utilized to evaluate the performance of the algorithms in the experimental results. For further detailed analysis of the results, another evaluation measure was used. For example, the fluctuation rate was illustrated with two Formulas (8) and (9) (9) ACCmax represents the maximum accuracy of each Src group, while ACCmin represents the minimum; ACC group set as the accuracy of AD/NC1 as Src in our algorithm.
The change rate of accuracy among all tested classification algorithm was also calculated with the various Tartrain sample sizes. The calculation formula is (10), as follows: ACCmax is the maximum accuracy of each classification algorithm, and ACCmin is the minimum accuracy. Besides, the sensitivity and specificity were also calculated in this study.

Feature Selection Results
Box-plots display variation in 90 AVOI of AD, MCI, and NC groups, respectively. In the group of AD, the grey matter box-plots were illustrated as Figure 2, MCI showed in Figure 3 and NC showed in Figure 4. Referring to the figures, it can be seen that the group of NC has the lowest degree of dispersion, whereas the group of MCI has the highest one. The high degree of dispersion in MCI confirms the statistically significant difference in the volumes of gray matter, which is also consistent with the clinical characteristic of MCI. That is, part of MCI may turn to AD, while another part may keep stable or even turn to NC.   The mean grey matter volumes of 90 AVOI in AD, MCI, and NC classed are shown in Figure 5. Figure 5 shows that the AD group has the smallest mean gray matter volume for every AVOI, whereas NC group has the largest one. This is also consistent with clinical evidence of significant atrophy of gray matter in AD patients and the mild atrophy in MCI patients.  The mean grey matter volumes of 90 AVOI in AD, MCI, and NC classed are shown in Figure 5. Figure 5 shows that the AD group has the smallest mean gray matter volume for every AVOI, whereas NC group has the largest one. This is also consistent with clinical evidence of significant atrophy of gray matter in AD patients and the mild atrophy in MCI patients. The mean grey matter volumes of 90 AVOI in AD, MCI, and NC classed are shown in Figure 5. Figure 5 shows that the AD group has the smallest mean gray matter volume for every AVOI, whereas NC group has the largest one. This is also consistent with clinical evidence of significant atrophy of gray matter in AD patients and the mild atrophy in MCI patients. Based on the feature, the grey matter volume of 90 AVOI data was divided into two groups and analyzed with information entropy. The first group is AD vs. NC, including 177 AD and 85 NC subjects; the other group is MCI vs. NC, including 245 MCI and 85 NC subjects. The information gains of 90 AVOI data was calculated with Formulas (1)- (3). The values were sorted and the top 15 The information gains of anatomical regions are shown in Table 3. Based on information gain, we selected just seven anatomical regions as feature and got good classification performance. These seven common regions were highlighted in bold and then identified in the top 15 anatomical regions, which had higher information gain among both AD and MCI groups. These selected seven regions were consistent with the grey matter volume atrophy in AD and MCI subjects that had been reported in various literatures [32]. For example, both AD and MCI patients appeared to have various degrees of atrophy in hippocampus and amygdala regions. Results in Figure 2 showed that the atrophy of common regions had tight relation with AD and MCI, such as Hippocampus (No. 37-No. 38), Parahippocampal gyrus (No. 40), and Amygdala (No. 41-No. 42). The atrophy in these seven regions was more severe in AD patients than MCI patients. So, we selected these seven anatomical regions as feature, which were applied into the transfer learning algorithm from AD to MCI subjects.
Based on the evidences of the anatomical regions with the highest 15 information gains being shown on Table 3, a total of seven regions, including left and right hippocampus, left and right amygdala, right parahippocampal gyrus, left angular gyrus, and Inferior temporal gyrus, were presented in both compared groups. These regions have also been reported in other studies, Based on the feature, the grey matter volume of 90 AVOI data was divided into two groups and analyzed with information entropy. The first group is AD vs. NC, including 177 AD and 85 NC subjects; the other group is MCI vs. NC, including 245 MCI and 85 NC subjects. The information gains of 90 AVOI data was calculated with Formulas (1)- (3). The values were sorted and the top 15 The information gains of anatomical regions are shown in Table 3. Based on information gain, we selected just seven anatomical regions as feature and got good classification performance. These seven common regions were highlighted in bold and then identified in the top 15 anatomical regions, which had higher information gain among both AD and MCI groups. These selected seven regions were consistent with the grey matter volume atrophy in AD and MCI subjects that had been reported in various literatures [32]. For example, both AD and MCI patients appeared to have various degrees of atrophy in hippocampus and amygdala regions. Results in Figure 2 showed that the atrophy of common regions had tight relation with AD and MCI, such as Hippocampus (No. 37-No. 38), Parahippocampal gyrus (No. 40), and Amygdala (No. 41-No. 42). The atrophy in these seven regions was more severe in AD patients than MCI patients. So, we selected these seven anatomical regions as feature, which were applied into the transfer learning algorithm from AD to MCI subjects.
Based on the evidences of the anatomical regions with the highest 15 information gains being shown on Table 3, a total of seven regions, including left and right hippocampus, left and right amygdala, right parahippocampal gyrus, left angular gyrus, and Inferior temporal gyrus, were presented in both compared groups. These regions have also been reported in other studies, suggesting brain volume loss in these regions in both the AD and MCI group [33]. For example, both AD and MCI patients appeared to have various degrees of volume loss in left and right hippocampus, left and right amygdala, and right parahippocampal gyrus. In subsequent experiments, we also discussed the accuracy of classification using the AVOI of seven common regions as feature.

Transfer Learning Results
We defined "Src" as training data in source domain, "Tartrain" as training data in target domain, and "Tartest" as testing data in target domain. In this study, transfer learning includes three parts. In the first part, the accuracy of the classification with AD was explored as training data and MCI as testing data. In the second part, the common anatomical regions in both information gain ranged groups, AD vs. NC and MCI vs. NC, was selected as feature. These features were then examined using feature iterative optimization for their impact on accuracy. In the final part, the ADNI subjects were set as the training data and the testing data was from the local hospital. Obviously, there were big differences in ethnicity and sample size between these two data sets. In this context, we explored whether or not transfer learning was still valid for transferring the knowledge that was learned from ADNI data to local hospital data.

Results on Transfer Task: AD → MCI
177 AD and 40 NC subjects were grouped as training data; 245 MCI and 45 remaining NC subjects were combined as testing data. Since some studies indicated the possibility of inadequate classification due to imbalance sample sizes, we divided AD data into four groups, MCI data into two groups, and NC data into three groups, details of each group were shown in Table 4 (the last group in AD was AD which contained all 177 AD subjects). In this way, we set AD1 and NC1, AD2 and NC1, AD3 and NC1, and AD4 and NC1 as Src, respectively; MCI1 and NC3 as Tartrain; and, MCI2 and NC2 as Tartest. Then, the sample sizes of MCI1 and NC3 would be increased in the following experiments, as we exam the assumption of consistency between the size of Tartrain and the accuracy of classification. The proposed transfer learning method was compared with four classic classification algorithms, i.e., SVM, KNN, ITML, and Linear-MSVM. For the sake of fairness, the same Src, Tartrain, and Tartest were used in all methods. The classification performances based on all five methods are listed in Table 5.  As shown in Figure 6a, the accuracy of tour algorithm is 80.44%, which is significantly higher than the other four algorithms. Furthermore, we studied the impact on classification accuracy according to the change of Tartrain size. Parallel with the division method in Table 4, we ran the test with MCI1/NC3 (Tartrain) size at 15, 20, 25, and 30, respectively (see Lab2-Lab5). The accuracy of classification on five methods was listed in Figure 6b-e for the different Tartrain size. Figure 6a-e showed that the accuracy of proposed algorithm achieved the highest scores among all five methods. Meanwhile, we could see there was no significant difference in the TrAdaboost algorithm among all Src groups, which suggests that Src grouping has little impact on the accuracy of TrAdaboost algorithm. In comparison, the AD/NC1 group as Src appeared to have higher accuracy in the other four classic methods, but it obviously lacked stability. Based on the above results, we concluded that dividing the training data has no significant influence on the accuracy of transfer learning method. It also suggests that the results are more reliable with a bigger sample size. Consequently, the training data was no longer divided in the following experiments, and we just used AD/NC1 as Src data.
When comparing Figure 6a with Figure 6e, the accuracy of our algorithm fluctuated the most in Figure 6b, and the details are listed in Table 6. Based on the various sizes of Tartrain data, the classification performance on AD/NC1 group were shown in Figure 6e and Table 7. The classification results showed that our algorithm had the highest accuracy among all five methods. Moreover, in Figure 6e, we can see the classification accuracy in TrAdaboost algorithm increased as the sample size of Tartrain increased. The details are listed in Table 7.
Appl. Sci. 2018, 8, x 11 of 15 according to the change of Tartrain size. Parallel with the division method in Table 4, we ran the test with MCI1/NC3 (Tartrain) size at 15, 20, 25, and 30, respectively (see Lab2-Lab5). The accuracy of classification on five methods was listed in Figure 6b-e for the different Tartrain size. Figure 6a-e showed that the accuracy of proposed algorithm achieved the highest scores among all five methods. Meanwhile, we could see there was no significant difference in the TrAdaboost algorithm among all Src groups, which suggests that Src grouping has little impact on the accuracy of TrAdaboost algorithm. In comparison, the AD/NC1 group as Src appeared to have higher accuracy in the other four classic methods, but it obviously lacked stability. Based on the above results, we concluded that dividing the training data has no significant influence on the accuracy of transfer learning method. It also suggests that the results are more reliable with a bigger sample size. Consequently, the training data was no longer divided in the following experiments, and we just used AD/NC1 as Src data. When comparing Figure 6a with Figure 6e, the accuracy of our algorithm fluctuated the most in Figure 6b, and the details are listed in Table 6. Based on the various sizes of Tartrain data, the classification performance on AD/NC1 group were shown in Figure 6e and Table 7. The classification results showed that our algorithm had the highest accuracy among all five methods. Moreover, in Figure 6e, we can see the classification accuracy in TrAdaboost algorithm increased as the sample size of Tartrain increased. The details are listed in Table 7.  In the final part of the study, we constructed the classification module based on ADNI samples, and then transferred the knowledge to the local hospital subjects. Meanwhile, the Accuracy (ACC), Sensitivity (SEN), and Specificity (SPE) were calculated in order to assess the validation of this module in clinical diagnosis. ACC, SEN, and SPE results were shown in Table 8. As the result showed, the transfer learning module achieved high accuracy, sensitivity, and specificity. We conclude that this method is efficient and valuable in assisting clinical diagnosis.
In the classification of local AD subjects using the "knowledge" learned from ADNI data and TrAdaboost algorithm (Table 8), classification accuracy had all gone beyond 90% on each Tartrain group. The top accuracy was recorded at 93.75% with Tartrain AD1/NC1 set at 2/2, while sensitivity was 80%, and specificity was 1. Obviously, the proposed transfer learning module was effective. According to Tabel 9, the Youden indexes could be calculated on each Tartrain group. The Youden indexes were 0.875, 0.857, 0.833, 0.8, 0.875, and 0.833 respectively, apparently all gone beyond 0.7. Therefore, we could conclude that TrAdaboost algorithm is valuable in assisting clinical diagnosis.

The Efficiency of TrAdaboost Algorithm
In this study, we tested and verified that the transfer learning algorithm had higher accuracy and stability than some classic algorithms.
To keep the balance of sample sizes for better classification performance, we divided the AD, MCI, and NC data into small groups, as shown in Table 4. Then, the classification accuracy was calculated based on AD as training data and MCI as testing data with different grouping combinations.
To have a better understanding of the efficiency of the TrAdaboost algorithm, we also applied classic algorithms, including SVM (support vector machine), KNN (K-nearest neighbor), ITML (information theory metric learning), and Linear-MSVM (linear metric based support vector machine). The accuracy of each algorithm was compared and studied with MCI1/NC3 as training data in target domain (Tartrain) whose sample size set at 10/10; AD1/NC1, AD2/NC1, AD3/NC1, AD4/NC1, and AD/NC1 as training data in source domain (Src), respectively. As shown in Table 5 and Figure 6a, TrAdaboost algorithm clearly outperformed other compared algorithms, with 80.44% as compared with 42.5% in KNN, 74.07% in SVM, 56.3% in ITML, and 72.1% in Linear-MSVM.
With the Formulas (8), (9) and data shown in Table 6, the rate top and rate bott was calculated as 5.77% and 3.25% respectively. It shows that using AD/NC1 as Src has no significant difference with other Src groups, i.e., AD1/NC1, AD2/NC1, AD3/NC1, and AD4/NC1. Since the test result would be more stable with larger sample size, we no longer considered other Src groups, just used AD/NC1 as Src in the experiments of selected feature and local hospital subjects.
This study also examined the classification accuracy of different algorithms with different Tartrain sample size. The results in Figure 6a-e and Table 7 showed that when Tartrain MCI1/NC3 was 15/15 with Src was AD/NC1, the classification accuracy of SVM, KNN, ITML, Linear-MSVM, and TrAdaboost algorithm were 0.7523, 0.4808, 0.623077, 0.753036, and 0.804615, respectively. TrAdaboost algorithm achieved the highest accuracy in all tested methods. Similarly, our algorithm had the highest accuracy among all the tested methods, as the sample size of MCI1/NC3 was altered to 20/20, 25/25, and 30/30 ( Table 7).
The data in Table 7 and Formula (10) were used to calculate the change rate of accuracy. The results indicated that the change rate of SVM, KNN, ITML, and Linear-MSVM and our algorithm was 12.79%, 14.69%, 51.09%, 4.37%, and 4.85%, respectively. It was clear that Linear-MSVM and our algorithm had lower fluctuation of accuracy and more stable comparing with the other three algorithms. Combined with accuracy and stability, our algorithms had outperformed other tested methods.

Conclusions
This research introduced transfer learning module into the classification of AD, MCI, and NC. The results show that TrAdaboost algorithm has a clear advantage over classic classification methods with higher accuracy and less fluctuation. Moreover, with transfer "knowledge" from the ADNI database to a local database, TrAdaboost algorithm achieves high accuracy, sensitivity, and specificity, which indicates its value in clinical application. In addition, we applied information gain method to optimize the feature selection in transfer learning. The results show that the classification accuracy is improved with the optimized feature selection, which indicates that information gain method can be used to select the more sensitive anatomical regions in AD and MCI diagnosis. In future work, better feature optimization will be explored in the transfer learning module in order to improve its accuracy and consistency. Except better feature optimization, the challenge when diverse databases are involved will be taken into consideration for the future work. For example, how to filter redundant information in different databases or how to migrate knowledge-assisted diagnostics between heterogeneous data. These are all questions that are worth studying. Furthermore, our research just distinguishes AD from normal individuals, but many other conditions manifests dementia in clinical, such as Lewy body disease, vascular dementia, etc. How to use artificial intelligent assisting radiological diagnosis to discriminate AD from other types of dementia will be taken into consideration for the future work.

Conflicts of Interest:
The authors declare no conflicts of interest.