Machine Learning for the Classiﬁcation of Alzheimer’s Disease and Its Prodromal Stage Using Brain Di ﬀ usion Tensor Imaging Data: A Systematic Review

: Alzheimer’s disease is notoriously the most common cause of dementia in the elderly, a ﬀ ecting an increasing number of people. Although widespread, its causes and progression modalities are complex and still not fully understood. Through neuroimaging techniques, such as di ﬀ usion Magnetic Resonance (MR), more sophisticated and speciﬁc studies of the disease can be performed, o ﬀ ering a valuable tool for both its diagnosis and early detection. However, processing large quantities of medical images is not an easy task, and researchers have turned their attention towards machine learning, a set of computer algorithms that automatically adapt their output towards the intended goal. In this paper, a systematic review of recent machine learning applications on di ﬀ usion tensor imaging studies of Alzheimer’s disease is presented, highlighting the fundamental aspects of each work and reporting their performance score. A few examined studies also include mild cognitive impairment in the classiﬁcation problem, while others combine di ﬀ usion data with other sources, like structural magnetic resonance imaging (MRI) (multimodal analysis). The ﬁndings of the retrieved works suggest a promising role for machine learning in evaluating e ﬀ ective classiﬁcation features, like fractional anisotropy, and in possibly performing on di ﬀ erent image modalities with higher accuracy.


Introduction
Alzheimer's disease (AD), or Alzheimer's, is a neurodegenerative disorder representing the most common cause of dementia in the elderly population of developed countries. Currently, the number of people affected by Alzheimer is about fifty million, and this number is expected to triple by 2050, due to population aging [1]. Alzheimer's disease is characterized by a progressive and irreversible neurologic deterioration, leading to the decline of cognitive functions and eventually to patient death [2]. Mild cognitive impairment (MCI) is an intermediate pathological condition where patients show heterogeneous symptoms. MCI can represent the prodromal stage of AD, but can also turn to other types of dementia [3]. AD diagnosis is very complex because of different symptoms that patients might show, both at the cognitive and behavioral level. Furthermore, the disease progression modalities are as subjective as the therapeutic responses. Within this framework, the most challenging goal is to develop innovative diagnostic tools to help detecting the disease from its early stages, including MCI. In this context, computer aided diagnosis (CAD) systems are desirable, in order to AD is characterized by a loss of brain barriers that determine a restriction of water motion, thus, compromising the integrity of WM and leading to abnormal diffusivity patterns, and resulting in a measurable difference in the diffusion of water molecules [20]. It has been suggested that such changes precede macroscopic atrophy [21] and, while they are not visible on conventional structural MRI sequences, they can be detected by DTI. Moreover, the literature suggests that WM integrity alterations detected by DTI could be complementary to volumetric alterations [22].
Several studies have applied DTI technique for the characterization of WM integrity in AD (for a review see [23]). In particular, DTI-based studies have shown that AD patients exhibit aberrant FA and MD values in the white matter of specific cerebral regions [24]. Furthermore, other studies have found similar, yet less severe, changes of these values in MCI patients [25]. In particular, voxel-based studies showed that AD and MCI subjects have reduced fractional anisotropy (FA) in multiple posterior WM regions [26] and increased mean diffusivity (MD) in the posterior occipital-parietal cortex and right parietal supramarginal gyrus [27]. ROI-based studies demonstrated higher MD and/or lower FA in the hippocampus [28][29][30] and posterior cingulate [31,32]. Notably, the results of a previous study showed that measures of diffusivity extracted from the hippocampus are better predictors of MCI conversion to AD than its volume [32]. Altogether, these results suggest that the biomarkers obtained from the DTI technique can be used for AD classification through advanced classification methods [33].
For these reasons, combining DTI data with ML classification algorithms looks promising in detecting specific AD and MCI biomarkers. In this paper, we present the resulting findings of several studies in a systematic review regarding models of CAD that integrate DTI data (or the combination of DTI with other MRI techniques) and ML methods to classify healthy controls and patients affected by AD or MCI.
The main goal of this review is to examine the benefits and the issues of applying DTI combined with ML algorithms in the detection of AD/MCI and to suggest future lines of research. To the author's knowledge, this is the only review in the existing literature focusing on studies that perform DTI-based classification to detect AD and its early stage.

Materials and Methods
A systematic literature review covering the period from the year 2010 through to the year 2019 was conducted in PubMed according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [34]. Articles published before 2010 were not taken into account, due to the limited knowledge of DTI at their disposal. The search strategy was ("machine learning" OR "artificial intelligence" OR "classification") AND "diffusion tensor imaging" AND ("alzheimer's disease" OR alzheimer's OR alzheimer).
To reduce a risk of bias, two authors (L.Bi. and A.B.) independently screened paper abstracts and titles and analyzed the full papers that met the inclusion criteria, as suggested by the PRISMA guidelines.
Overall, the search was limited to articles pertaining to studies that used supervised machine learning methods on data derived from DTI or from other neuroimaging techniques combined with DTI. Moreover, we included only studies that classified AD patients compared to healthy controls, or that also included a sample of MCI subjects. We decided to exclude articles that did not include a sample of AD but only included MCI patients and controls, since this review is mainly focused on the automatic diagnosis of AD, and since we wanted to evaluate the benefits and the issues of using DTI combined with ML methods, according to the literature so far, in a sample which is more uniquely characterized and more homogeneous compared with MCI group. This search led to 51 articles, 36 of which were selected. Among these, 15 articles were excluded: three of them were not focused on AD or MCI disorders, nine did not consider any AD sample and one systematic review and two studies did not involve DTI-based classification. From the remaining 21 articles, a consistent set of information was extracted: the neuroimaging techniques involved, the number of pathologic patients and healthy controls, the list of features, the classification algorithm(s) and the results (accuracy-ACC, sensitivity-SEN, specificity-SPE). When multiple classifiers were tested, only the performance of the one that achieved the best result are reported in Tables 1 and 2. In Figure 1, the general procedure for data analysis and classification applied in the selected articles is represented.
In Appendix A, a comprehensive list of the acronyms and abbreviations used throughout the paper can be found, while Appendix B contains a brief description of the ML approaches mentioned in this paper.

Results
The 21 articles selected ( Figure 2) are separated in two groups: classification considering only AD patients and healthy controls (HC) (n = 11) and classification including MCI patients (n = 10). For each article, when multiple classification approaches were tested, the best performance is reported in bold. Since some studies did not provide all the exact values of accuracy, sensitivity or specificity, these values have been deduced from plots.
In Appendix A, a comprehensive list of the acronyms and abbreviations used throughout the paper can be found, while Appendix B contains a brief description of the ML approaches mentioned in this paper.

Results
The 21 articles selected ( Figure 2) are separated in two groups: classification considering only AD patients and healthy controls (HC) (n = 11) and classification including MCI patients (n = 10). For each article, when multiple classification approaches were tested, the best performance is reported in bold. Since some studies did not provide all the exact values of accuracy, sensitivity or specificity, these values have been deduced from plots.

AD/HC Classification
The articles included in this review have been further classified depending on the type of neuroimaging technique used. Information extracted is showed in Table 1. Among the eleven studies of Table 1, four of them analyzed only DTI scans (DTI analysis), while the remaining seven also involved other neuroimaging modalities such as sMRI and rs-fMRI (multimodal analysis).

AD/HC Classification
The articles included in this review have been further classified depending on the type of neuroimaging technique used. Information extracted is showed in Table 1. Among the eleven studies of Table 1, four of them analyzed only DTI scans (DTI analysis), while the remaining seven also involved other neuroimaging modalities such as sMRI and rs-fMRI (multimodal analysis).

DTI Analysis
Graña et al. [35] trained an SVM using DTI measures to classify AD patients and HC. Images from DTI scans were preprocessed, in order to extract FA and MD. Different methods of cross-validation were employed, and the most accurate prediction was obtained by the leave-one-out method: with FA features, a 100% accuracy, sensitivity and specificity were achieved, while MD features achieved lower values.
Patil et al. [36] identified specific white matter regions which might represent AD markers. Classification between AD and HC was performed by the Adaptive Boosting (AdaBoost) algorithm. Considering FA measures and a set of 10 features, selected by a genetic algorithm, the accuracy, sensitivity and specificity scores were, respectively, 84.5%, 80.2% and 85.2%. If the feature set is not reduced, these values decreased due to overfitting (ACC = 75.3%), thus proving that features' reduction improves classification accuracy by removing redundancy. It can be noticed that, considering MD in place of FA, no significant changes in accuracy were observed, suggesting that FA is an effective parameter for AD/HC classification.
Patil and Ramakrishnan, in a successive study [37], focused on the correlation between the DTI indices and the mini-mental state examination (MMSE) score. FA, MD, DR and DA measures were obtained from DTI images of AD-damaged cerebral areas and then fed singularly or along with MMSE as inputs of an SVM, decision stumps and a simple logistic. The best results were achieved by considering the feature combination of FA and MMSE score (ACC = 94.2%) with SVM. Although there was not a significant correlation between DTI indices and MMSE score, the latter improved classification accuracy for each parameter.
Schouten et al. [38] differentiated between AD and HC through four DTI measures: FA, MD, DR and DA. As a first step, voxel-wise measures (FA, MD, DR, DA) were extracted via TBSS; these voxel measures were then separately clustered with independent component analysis (ICA). Then, probabilistic tractography applied on the clustering results allowed to determine a structural connectivity network and graph measures. Using TBSS, best accuracy was reached by RD (ACC = 84.8%), closely followed by the other DTI measures. ICA reached an accuracy of 85.1% with FA, while other performance scores were not dissimilar to those of TBSS. The ICA method allowed a significant reduction of features, while structural connectivity-based classification showed best results on the connectivity graph (ACC = 85.0) compared to other measures. Lastly, the Sparse Group Lasso (SGL) was used to assess the performance of parameters' combination: although reaching good classification accuracy, the best values were achieved by single parameters. Nevertheless, SGL shows that the most important contribution is given by TBSS and ICA's measures, connectivity graph and strength parameters. This finding suggests that DTI and graph theory provide complementary information.

Multimodal Analysis
Mesrob et al. [39] developed a multimodal method to classify AD and HC based on data from both DTI and structural MRI (sMRI). The model identified 73 anatomical cerebral regions of interest (ROIs) and the extraction of different parameters concerning them. Most distinctive regions for discrimination between subjects were selected using both univariate (t-test) and multivariate (SVM-based recursive feature elimination (SVM-RFE)) methods and then used to train an SVM for classification. FA and MD from DTI were considered, while gray matter concentration (GMC) was obtained from sMRI. Moreover, two multimodal parameters were used: MD/GMC and MD/FA. The best accuracy value (ACC = 99.6%) was achieved by the multimodal parameter MD/GMC on the 15 regions chosen through the multivariate feature selection method. Interestingly, the GMC parameter alone obtained higher accuracy value (76.5%) than any other accuracy obtained by other single parameters. However, classification with the multimodal parameter in the selected regions outperformed all other parameters combined.
Dyrba et al. [40], combined data originating from different kinds of scanners to classify AD patients and controls by considering FA and MD from DTI and the densities of white matter and gray matter (WMD, GMD) from sMRI. Such processed data served as the training set for an SVM and a naïve Bayes (NB) classifier. Furthermore, two different methods of cross-validation (CV) were employed: pooled CV and scanner-specific CV. Entropy-based information gain (IG) criterion, which allows to identify the more useful features for data separation, was used for feature selection. As expected, the SVM was more accurate than the NB classifier: best results were achieved by SVM using a pooled CV method on GMD data, with an accuracy of 89.3%. Interestingly, DTI data yielded inferior accuracy compared to GMD data.
Li et al. [41], combined DTI and sMRI indices to assess their discriminatory power in AD/HC classification. FA was measured from both tract-and voxel-based DTI, while gray matter volume (GMV) was obtained from sMRI. The best classification outcome resulted in the combination of tract-based FA and GMV (ACC = 94.3%). Considering only DTI indices, it was observed that tract-based FA yielded better accuracy than voxel-based FA.
Dyrba et al. [42] compared data derived from three different neuroimaging techniques: DTI, sMRI and resting-state functional MRI (rs-fMRI). The selected diffusion indexes were FA, MD and mode of anisotropy (MO). GMV was obtained from sMRI, while two parameters were extracted from rs-fMRI: "local clustering coefficient" and "shortest path length". Both single and multimodal parameters were used to train and test an SVM. A multiple kernel SVM (MK-SVM) was also tested, which allows for the combination of different imaging modalities. High accuracy values were reached using singular DTI indices (ACC= 85.0%) and GMV alone (ACC= 81.0%) as inputs for SVM, while for multimodal analysis, accuracy was 85.0% combining DTI measures and GMV. The multimodal results did not differ significantly from the results of the single modalities. In addition, the MK-SVM did not improve the results.
Chen et al. [43], assumed that combining DTI and DKI (diffusion kurtosis imaging) data could improve Alzheimer's detection compared to single modalities. Diffusion indices (FA, MD, DA, DR) were measured from both DTI and DKI, while kurtosis indices (mean kurtosis-MK, axial kurtosis-AK, radial kurtosis-RK) were obtained from DKI. Two different methods of features selection were employed: SVM-RFE and correlation coefficients with MMSE score (CORR-MMSE). SVM-RFE ranking led to high scores in the occipital white matter, whereas the scores from CORR-MMSE ranking selected the splenium of the corpus callosum and the posterior limb of the internal capsule, which were omitted in the scoring of diffusivity indices. According to these results, different regions are more predictive of the condition in different parametric maps and this presented a different sensitivity effect of matrices in pathological detection. The results show that DKI-diffusion indices (Diff-DKI) yielded a better performance than DTI-diffusion indices (Diff-DTI) (ACC = 92.4% vs. ACC = 81.1%). Moreover, the highest performance (ACC = 96.2%) resulted from the combination of kurtosis and diffusion indices from DKI (ALL-DKI), highlighting that kurtosis provided additional information in the detection of abnormalities.
Cai et al. [44] selected 330 participants from the ADNI (Alzheimer's Disease Neuroimaging Initiative) database and developed a classifier based on structural brain network modeling through the rich-club hierarchical network paradigm. Both the Automated Anatomical Labeling (AAL) and the Harvard-Oxford Atlas (HOA) were considered for the structural networks' construction, performed on DTI and b0 (sMRI) images, aligned with the PANDA pipeline tool, for each individual included in the study. The classification between AD and HC was performed through linear discriminant analysis (LDA) on the following topologic parameters extracted from the resulting structural brain networks: "betweenness centrality (BC)" and "connection strength". The classification accuracy of both BC and connections strength was compared with common measures in AD diagnosis: hippocampal volume and MMSE. The study findings reported significant difference in BC and connection strength between AD and controls for some brain regions, which were specific to each atlas (AAL or HOA). These relevant connections were considered as classification features to distinguish AD from controls. The best results were obtained using the AAL atlas, which achieved the best outcome in particular (ACC = 84.62%), with BC applied to the left putamen and left precuneus.
Tang et al. [45] closely examined the feasibility of AD/HC classification through volumetric, morphometric and DTI-based features specifically extracted from hippocampus and amygdala. T1 sMRI images of the participants were segmented with a two-level diffeomorphic multi-atlas likelihood-fusion algorithm and the help of an expert neuroanatomist, in order to calculate the volume of hippocampus and amygdala. The T1 images were also 3D segmented, creating triangulated surfaces of the regions of interest, and through large deformation diffeomorphic metric mapping (LDDMM), shrinking or expansion of local surface vertices, in relation to the adequate template, was estimated. DTI images were processed and segmented to obtain FA and MD values of hippocampus and amygdala. The feature set thus included volumetric measures, DTI indices and the deformation degree at each vertex of the modeled surfaces. Given the high number of vertices (over 1200), feature reduction through principal component analysis (PCA, selecting 95% of variance) and t-test was explored. Classification was performed with both LDA and SVM, validated through leave-one-out cross-validation, with SVM achieving the best results, reaching an accuracy of 94.6% for the best-case scenario with the most significative feature set, for 37 total subjects. Even though the feature reduction process significantly improved the performance of the LDA classifier, while not substantially affecting SVM, the SVM classifier still outperformed the LDA. Given the complexity of results of this study in Table 1, we only reported the performance for the right hippocampus using SVM, for which the best performance was obtained, showing how the results change according to the combination of the image modalities used.

AD/MCI/HC Classification
In Table 2, ten articles that include MCI classification are summarized. All these studies employ only DTI analysis.
Shao et al. [46] proposed individual structural connectivity networks (ISCNs) to distinguish predementia and AD from healthy aging, in individual scans. For each connection, three attributes were calculated: fiber density (FD), the mean value of FA and mean value of MD across all voxels for all connection fibers. Once the structure of ISCNs was identified, three classifiers, namely, SVM, k-nearest neighbor (k-NN), NB were trained to classify subjects based on selected connections. Among the considered ML models, SVM yielded better accuracy. Patients with AD were distinguished from healthy control subjects with an accuracy of 100% using FD and MD, while patients with MCI were distinguished from healthy controls with an accuracy higher than 90%. This result is in line with previous findings of widely distributed FA decreases and MD increases in MCI. Furthermore, groups of MCI and AD patients were separated with an accuracy of about 85%, suggesting that ISCN alterations increase during the course of AD. These study findings suggested that ISCNs may have the potential of providing an imaging-and white matter-based biomarker for distinguishing between healthy subjects, aging subjects and patients with very early AD.
Nir et al. [47] investigated white matter integrity via a novel tract clustering and registration method that combines the strengths of voxel-wise and tractography-based methods, offering a compact representation of fiber bundles. In the proposed method, maximum density paths (MDP) was applied to whole-brain tractography. Differences in WM microstructure were determined by comparing FA and MD along each MDP. Significant MD and FA differences between AD patients and HC subjects were found, as well as MD differences between HC and late MCI subjects. Significant associations between FA, MD and MDP measures and cognitive deficits, as measured by MMSE scores, were also observed across all subjects. To discern between HC and AD groups, FA and MD values were tested along all the mean MDP points (1080 points). The subset of significant FA points (FAFDR CvA = 214 points) and the subset of significant MD points (MDFDR CvA = 641 points) was further tested: to distinguish between HC and MCI, all the MD values along all the MDP points (1080 points) were used, as well as the subset of significant MD points (MDFDR CvL = 12 points). Only MD measures were sensitive enough to detect MCI differences and revealed more profuse associations than FA in all analyses. The features interpolated along full mean MDPs were robust enough to reach high classification accuracies (~80%), so that reducing dimensionality by including only statistically significant MDP points did not dramatically increase classification accuracy (~85%).
Demirhan et al. [48] combined FA and MD measures from DTI to train an SVM classifier for the classification of HCs, AD and MCI patients. Good performances were reached by distinguishing AD from HC (87.8%), and MCI from HC (85.9%), while a lower value (78.4%) was obtained in separating MCI from AD subjects. Through ReliefF, an algorithm that makes it possible to identify the most discriminative voxels in white matter's map, a best feature set consisting of 1500 elements was extracted. Selecting a subset of these features did not provide a noticeable improvement in classification accuracy if the disease was at late stages. On the other hand, the selection of specific cerebral regions considerably improved the AD/MCI and MCI/HC classification.
Prasad et al. [49] compared an ensemble of different anatomical connectivity measures using both fiber and flow connectivity methods that may help in detecting AD patients. These features were fed into a repeated, stratified 10-fold cross-validation design, using SVMs to classify controls vs. AD, controls vs. early MCI (eMCI), controls vs. late MCI (L-MCI), and eMCI vs. L-MCI. The results exhibit a significant difference in the accuracy of the various feature sets used to distinguish between the various diagnostic groups. In each of these classification problems, nine different sets of features were used: the fiber connectivity matrix, (FI(M)), the flow connectivity matrix (FL(M)), the fiber network measures (FI(N)), the flow network measures (FL(M)), combinations of these sets as FI (N+M), FL(N+M), FI(N)+FL(N), FI(M)+FL(M) and FI(N+M)+FL(N+M). All of these connectivity measures were derived simply from diffusion images. The emphasis of the study was to explore and understand which diffusion-based network measures are predictive of Alzheimer's disease, in contrast to the optimization of classification accuracy, as in previous studies. In this way, the classification accuracy was adopted as the metric to evaluate different types of brain connectivity features, and to understand which ones may have an advantage in predicting MCI or AD insurgence.
Ebadi et al. [50] investigated the diagnostic potential of brain connectivity models regarding AD and MCI, applying graph theory to DTI measures. Graphs represented connections between different cerebral areas; once the graph measures were extracted, the best features were selected, in order to optimize the classifier's performance and reduce overfitting. Classification was conducted through different classification methods (logistic regression, random forest, NB, k-NN and SVM) and combining their output, to improve the performance of the whole model (Ensemble). They also tested a k-best feature selection method where the features are ranked based on their power in performing the classification, and then the top K features are selected for the given estimator. Ensemble with feature selection obtained the best performance. AD patients and HC were classified with an accuracy of 80.0%, while MCI patients were separated from controls with an accuracy of 66.7%; overall, the AD/MCI ratio reached an accuracy of 76.7%.
Maggipinto et al. [51] proved the effect of feature selection bias (FSB) occurring in DTI-based AD classification, leading to an overestimation of performance metrics. FA and MD maps were extracted and registered to the same reference, and the regions corresponding to white matter were isolated through the TBSS algorithm, extracting the skeleton of white fiber tracts for each patient. Feature selection was performed via Wilcoxon rank sum test and the ReliefF algorithm both in a "nested" (unbiased) and "non-nested" way: in the former, feature selection is done after training, while in the latter it is performed before the training (i.e., only once). The classification task was accomplished by a random forest with B = 300 learning trees trained with bootstrap aggregating. Performance was assessed with 100 rounds of 5-fold cross validation. The results showed that the performance diminished using a nested approach. For example, for FA accuracy, it dropped from a maximum mean value of 87% (non-nested) to 75% (nested) in AD/HC discrimination, while for MCI/HC accuracy dropped from 81% to 59%. The same behavior was observed considering MD, where ACC decreased from 83% to 76%, and from 79% to 66% for the AD/HC and MCI/HC classification, respectively.
Eldeeb et al. [52] proposed a novel method to extract relevant markers associated with FA and MD. After preprocessing of DTI-data, FA and MD maps of regions of interest were determined using a "bag-of-words" model. This model has been used to model the hippocampus diffusivity maps patterns, through clustering the extracted hippocampus features, where the number of features is changing from one slice to another. Both the speeded up robust features (SURF) and the scale invariant feature transform (SIFT) features were extracted. With these FA and MD maps, an SVM was then trained to classify the different groups of subjects. Classification was performed for each pair of groups, and then between all of the classes, solving a multiclass problem. The best accuracies were obtained with MD map using a SIFT features descriptor and are reported as follows: 98.3% AD/HC, 93.6% MCI/HC, 92.0% AD/MCI and 89.0% multiclass.
Ye et al. [53] conducted a connectome-wide association (CWAS) study on AD, stable MCI (sMCI), MCI converting to AD (cMCI) and healthy patients selected from the ADNI database to explore the alterations in structural connectivity networks of white matter without any a priori hypothesis on pathologic alterations. Whole-brain connectomes were generated through probabilistic fiber tracking of registered T1 images and DTI scans, separated in 90 regions according to the AAL atlas. Multivariate distance matrix regression (MDMR) paired with the delta method were applied to assess the variation of distance in connectivity patterns, highlighting the brain regions that displayed greater differences between the study groups. The discriminatory power of the connectivity features isolated by the MDMR analysis was tested by comparing the classification performance obtained with them against the whole-brain connectivity features, using a partial least squares discrimination analysis (PLS-DA) classifier with five-fold cross-validation on 161 subjects. For cMCI/HC classification, considering MDMR-selected features over whole-brain ones, the SEN score increased from 54.7% to 71.3%, while SPEC decreased from 85.0% to 79.3%; regarding AD/HC classification, SEN went from 71.9% to 67.0%, while SPEC grew from 70.1% to 76.2%.
Dalboni da Rocha et al. [54] classified AD, MCI and HC through an SVM applied to the patients' FA maps obtained through DTI, focusing on brain areas frequently associated with AD abnormalities. The analysis was repeated for the whole-brain and in specific brain areas both with and without a feature selection stage, based on the Fisher Score. As expected, results obtained without feature selection were lower. Among all the considered brain areas, two of them showed greater discriminatory power (consistently lower FA) between AD and HC: the bilateral cingulum in the hippocampal formation and the parahippocampal gyrus, in accordance with previous studies on AD indicating parahippocampal white matter modifications. Repeating the analysis of both regions by requiring the voxels to have a minimum Fisher Score (0.4/0.8) led to a maximum ACC of 93% in AD/HC classification considering the cingulum in the hippocampal formation and 90% for the parahippocampal gyrus. However, MCI/HC classification showed lower accuracy, in some cases close to chance level, possibly due to the inability to assess FA on a submillimeter scale.
Dou et al. [55] evaluated the integrity of whole-brain WM structure using automated fiber quantification (AFQ) for AD, amnestic MCI and healthy patients. The corrected, b0-aligned DTI images of the patients were processed with the AFQ toolkit in order to identify 20 major fiber tracts that have been shown to be relevant in AD progression, first by estimating the fiber tractography and then by segmenting the fiber tracts of interest. The FA, MD, DR, DA of each point was determined. Three classifiers were tested on a set of 1440 features per patient: SVM, LDA and extreme gradient boosting (XGB). Performance was evaluated both with 10-fold cross-validation and leave-one-out cross-validation. The results of this study summarized in Table 2 refer to SVM with leave-one-out cross validation for which the best results were obtained. Patients were divided into a discovery dataset and a replicated dataset and the statistical analysis, model learning and validation was repeated for both databases, obtaining agreeing results: ACC = 82.56-83.72% for AD/HC classification, 77.78%-82.28% for AD/aMCI classification and 52.02%-51.25% for aMCI/HC classification.

Discussion
In this review article, we identified twenty-two studies applying ML techniques for the classification of AD based on DTI imaging data, used alone or in combination with other imaging techniques. Some of the reviewed studies only differentiated between AD patients and healthy controls, while others also included a group of MCI patients for the identification and differentiation of the prodromal stage of the disease.
To the best of our knowledge, this is the first study that systematically reviewed classification approaches in AD with a focus on DTI. The attention to this specific technique is due to the fact that DTI is sensitive to microstructural white matter changes that are not visible with conventional volumetric techniques, and thus may contribute to the search for early biomarkers of the disease [56].
Studies discussed in this review have highlighted the role of DTI data as biomarkers of AD and MCI. Combining the application of ML approach with features extracted from DTI scans can provide a customized diagnosis for the early identification of AD, MCI and healthy subjects. Importantly, one of the great advantages of applying classification algorithms on neuroimaging data is the potential use for detecting AD at the prodromal stages, even well before clinical manifestation [57], which would have potential application in routine clinical settings in the future. In particular, the early detection of MCI is fundamental, since existing AD therapies show better results if the disease is still at earlier stages.
As regards the binary classification between AD and HC, very high performance in terms of accuracy (>90%) was achieved by several studies ( [35,37,39,41,43,46,52]), among which, two even obtained 100% accuracy ( [35,46]) ( Figure 3). However, it should be noted that the sample size of these studies, in particular of the ones obtaining an accuracy of 100%, is quite limited (15-35 subjects per group), thus, the model could have been overfitted and could lack generalizability. DTI is sensitive to microstructural white matter changes that are not visible with conventional volumetric techniques, and thus may contribute to the search for early biomarkers of the disease [56]. Studies discussed in this review have highlighted the role of DTI data as biomarkers of AD and MCI. Combining the application of ML approach with features extracted from DTI scans can provide a customized diagnosis for the early identification of AD, MCI and healthy subjects. Importantly, one of the great advantages of applying classification algorithms on neuroimaging data is the potential use for detecting AD at the prodromal stages, even well before clinical manifestation [57], which would have potential application in routine clinical settings in the future. In particular, the early detection of MCI is fundamental, since existing AD therapies show better results if the disease is still at earlier stages.
As regards the binary classification between AD and HC, very high performance in terms of accuracy (>90%) was achieved by several studies ( [35,37,39,41,43,46,52]), among which, two even obtained 100% accuracy ( [35,46]) ( Figure 3). However, it should be noted that the sample size of these studies, in particular of the ones obtaining an accuracy of 100%, is quite limited (15-35 subjects per group), thus, the model could have been overfitted and could lack generalizability. Studies reported in this review show evidence that automated DTI-based classifications of both MCI/HC and MCI/AD provide considerably inferior results than AD/HC separation (accuracy: ~80%). Only two studies obtained an accuracy higher than 90% [46,52], but also in this case, the limited sample size needs to be considered as a potential bias (Figure 4). Lower accuracy in these classifications is probably due to less marked differences between the features extracted. In addition, it is worth mentioning that, also from a clinical point of view, there is less confidence in the underlying pathology in MCI patients. Indeed, MCI itself is an heterogenous group, which is not always screened for primarily amnestic type or amyloid biomarkers that would increase the probability of prodromal AD. Studies reported in this review show evidence that automated DTI-based classifications of both MCI/HC and MCI/AD provide considerably inferior results than AD/HC separation (accuracy:~80%). Only two studies obtained an accuracy higher than 90% [46,52], but also in this case, the limited sample size needs to be considered as a potential bias (Figure 4). Lower accuracy in these classifications is probably due to less marked differences between the features extracted. In addition, it is worth mentioning that, also from a clinical point of view, there is less confidence in the underlying pathology in MCI patients. Indeed, MCI itself is an heterogenous group, which is not always screened for primarily amnestic type or amyloid biomarkers that would increase the probability of prodromal AD.
Only one work [52] investigated the ternary problem: AD vs. MCI vs. HC and reached a good performance (accuracy = 89%). Thus, from this study, it seems that the integration of DTI with ML can be a variable instrument for the AD vs. MCI vs. HC classification also in clinical practice.
Interestingly, one study [49] also compared early MCI (eMCI) vs. late MCI (L-MCI), obtaining a quite low accuracy (63.4%). Thus, the problem of detecting subtle differences between subgroups needs to be further investigated.
Importantly, the reviewed studies differed by several factors including the sample sizes, the imaging analysis approach (i.e., voxel-based vs. tract-based), different features extracted, different feature selection methods and classification approaches. For this reason, it is difficult to quantitatively compare the different studies, while a qualitative analysis of the results can be performed.
Interestingly, one study [49] also compared early MCI (eMCI) vs. late MCI (L-MCI), obtaining a quite low accuracy (63.4%). Thus, the problem of detecting subtle differences between subgroups needs to be further investigated.
Importantly, the reviewed studies differed by several factors including the sample sizes, the imaging analysis approach (i.e., voxel-based vs. tract-based), different features extracted, different feature selection methods and classification approaches. For this reason, it is difficult to quantitatively compare the different studies, while a qualitative analysis of the results can be performed.
Few studies have compared different classification approaches [40,45,46,55], all of them finding that SMV outperformed the other classifiers. However, in future studies, it would be useful to perform a more extensive comparison of the performance of diverse classification algorithms.
Another important factor that influences the performance concerns the extracted features. The first important distinction is between studies which computed voxel-based or ROI-based features (i.e., [35][36][37]) vs. studies relying on tract-based features (i.e., [38,49]). In the first case, diffusion features are computed in each voxel, or in specific ROIs, of the whole-brain, while in the second method, white matter fiber tracts are estimated and for each tract, the mean value of the desired diffusion feature is calculated. Then, while most of the studies computed quite common and simple diffusion features like fractional anisotropy, mean diffusivity, betweenness centrality, radial or axial diffusivity and connectivity strength (i.e., [35,37,44,51]), few studies extracted more complex features [38,49].
Most of these studies showed that FA represents the best diffusion feature for classification models and provides valuable information to distinguish between AD and healthy subjects [35,37,38,51], while others obtained better results using other features like MD [47,52]. Concerning Few studies have compared different classification approaches [40,45,46,55], all of them finding that SMV outperformed the other classifiers. However, in future studies, it would be useful to perform a more extensive comparison of the performance of diverse classification algorithms.
Another important factor that influences the performance concerns the extracted features. The first important distinction is between studies which computed voxel-based or ROI-based features (i.e., [35][36][37]) vs. studies relying on tract-based features (i.e., [38,49]). In the first case, diffusion features are computed in each voxel, or in specific ROIs, of the whole-brain, while in the second method, white matter fiber tracts are estimated and for each tract, the mean value of the desired diffusion feature is calculated. Then, while most of the studies computed quite common and simple diffusion features like fractional anisotropy, mean diffusivity, betweenness centrality, radial or axial diffusivity and connectivity strength (i.e., [35,37,44,51]), few studies extracted more complex features [38,49].
Most of these studies showed that FA represents the best diffusion feature for classification models and provides valuable information to distinguish between AD and healthy subjects [35,37,38,51], while others obtained better results using other features like MD [47,52]. Concerning MCI vs. HC classification, some studies [46,47,49,52] reached better performances using mean diffusivity and fiber density as features.
In one study [41], the performances using voxel-and tract-based features were compared. According to the result of this study, tract features seem to perform better in differentiating between AD and HC. This could be due to the fact that the clustering of voxel in the tracts reduces dimensionality by grouping voxels with similar anatomic and functional characteristics.
In addition, two studies [37,44] found that clinical parameters, such as MMSE score, can also improve classification performances, meaning that the inclusion of other types of features, like clinical scores, can improve the performance.
In addition to classification and feature extraction, feature selection is also important for identifying discriminating features. The selection of appropriate features not only removes the non-informative signal, but also reduces the computational time involved in classification. The two most adopted methods for feature selection are biologically informed and automated feature selection methods.
The former relies on prior biological knowledge about the discriminating ability of certain regions, generally obtained from existing literature, whereas the latter selects features based on general data characteristics, without prior knowledge.
The automated methods applied in the reviewed studies included genetic algorithm [36], t-test [39,45], recursive feature elimination [39,43], PCA [45], Wilcoxon rank sum test [51], ReliefF algorithm [48,51], multivariate distance matrix regression [53], false discovery rate [47] and k-best method [50]. Although it is difficult to say which is the best feature selection algorithm, since a comparison study is missing and several studies differentiate for multiple factors, it is evident from all these studies that selecting the most discriminant features improves the performance of the classifier by eliminating redundant or less useful features from the dataset. In particular, [51] shows that a feature selection which is blind to the t-test, leads to overoptimistic results (10% up to 30% relative increase in area under curve (AUC)).
Some studies applied a biologically informed selection method and focus only on regions, which are known to be compromised in AD, in particular hippocampus [44,45], parahippocampal gyrus and hippocampal cingulum [54] or amygdala [45]. Indeed, the hippocampus and the amygdala are among the anatomical structures of particular interest to the study of AD, mainly because of their active involvement in memory [58]. Both the global volume and the local shape of the hippocampus and the amygdala have been found to be compromised in AD [59,60]. The performance obtained by these studies are comparable to those obtained using automated methods. In particular, diffusion features from the right hippocampus [38,45] or from the parahippocampal gyrus [54] provided the best results in discriminating between AD/HC or AD/MCI. Indeed, it has previously been suggested that automated feature selection will not improve classification accuracy as compared to biologically informed feature selection, driven by prior biological knowledge of regions typically affected by AD, such as the hippocampus, amygdala, thalamus and caudate [61]. Notably, in the classification MCI/HC, whole-brain analysis performed better in [54], possibly due to the more subtle and sparse alterations in the prodromal stage of the disease.
The last important point to be considered when discussing the reviewed studies concerns the application of unimodal versus multimodal images. For AD/HC classification, five studies integrated DTI with sMRI [39][40][41]44,45], while only one also added fMRI [42]. One study also combined DTI with a more novel technique, which is DKI [43]. Notably, none of those studies applied a multimodal approach for the classification of MCI compared with AD.
All but one study [40] found that the results obtained using DTI measures outperformed those obtained with volumetric images. The contradictory results in [40] could be due to the advanced stage of the patient included in the study, so that the brain volume was highly compromised with cortical atrophy. Another possible explanation for this contradictory result could be represented by the multi-centric nature of the study. Indeed, it has been pointed out that DTI is more affected by site effects due to differences in acquisition parameters than volume measures [62]. For this reason, combining images of different sites could have mostly compromised the classification accuracy for DTI images.
In addition, most of these studies found that the combination of multimodal features outperformed the results obtained by using one single technique. Indeed, DTI-based features serve as a complementary tool to volume-based features, as the two imaging techniques reflect tissue changes associated with AD that correspond to pathological evidences in the gray matter and white matter, respectively. Thus, from the results of this review, it seems that combining several neuroimaging modalities is promising for further understanding the underlying disease mechanisms. However, it must be noted that [42] found that combining parameters from different neuroimaging modalities does not significantly improve AD/HC separation. Thus, future studies need to assess whether multimodal imaging, including functional (or metabolic) imaging methods, provides additional diagnostic accuracy for the classification of AD clinical labels, which could only be obtained from pathology.
In addition to the above-mentioned future lines of research, including the testing and fair comparison of different classifier and different feature extraction/selection approaches and a more systematic evaluation of the benefits of multimodal imaging compared with unimodal one, other future directions can be suggested. At first, it would be important to include larger samples of subjects since most of the reviewed study deals with quite low study groups. Larger samples from different sites, together with better pooling analysis methods, may improve the statistical power of the analysis, allowing to obtain more reliable information [63].
Then, future works should be more focused on the integration of heterogenous data sources, since promising results were obtained so far in this direction. Such data should importantly include physiological and functional parameters that can aid in constructing diagnostic tools with higher sensitivity and specificity, for more effective analysis of brain diseases [8]. Moreover, other miscellaneous data than neuroimaging could improve the classification of AD, including cognitive measures, risk factors associated with AD or cerebrospinal fluid measures [64].
Another important line of future direction consists in the implementation of longitudinal studies, which include different stages of AD for a better understanding of the progression of the disease, from the earliest to the most advanced stages. Indeed, a better understanding of the progression of neuronal deterioration and its correlation with psychological symptoms may help setting up new tailored treatments, such as real-time neurofeedback [65] and brain-computer interface training [66].
Finally, the application of deep learning methods and in their comparison with ML approaches should be better investigated in the future. With respect to conventional ML methods, deep learning algorithms require little or no image pre-processing, and can automatically infer an optimal representation of the data from the raw images without requiring prior feature selection, thus resulting in a more objective and less biased process [67]. Few papers on the application of deep learning approaches, and in particular convolutional neural networks, in the classification or prediction of AD using DTI imaging data have been recently published achieving good results [68,69]. More comprehensive studies are needed to evaluate the advantages of these methods compared with more traditional approaches.

Conclusions
To summarize, the results of this review showed that ML algorithms can be successfully applied to DTI or multimodal imaging data to deepen the current understanding of structural and functional connectivity mechanisms of AD and MCI, representing one of the ultimate goals of future AD-related research.
According to existing studies, the classification between AD and HC performs better than that between AD and MCI or MCI and HC, probably due to the less advanced study concerning MCI and to the heterogeneity of this group. Support vector machine appears to outperform the other classifiers, although in this domain other approaches (i.e., random forest) are promising. Regarding selected features, FA provided the most powerful results in AD/HC classification, possibly due to the high disruption of WM integrity, while in the detection of MCI, other features could be more reliable, in particular MD. Focusing on specific ROIs, in particular the hippocampus and the amygdala, which are known to be compromised in AD, might not decrease the performance compared with a whole-brain analysis, at least in the classification between AD and HC. Multimodal approaches that look for patterns of neurodegeneration across different kinds of bioimages are gaining increasing attention and seem to be promising for a better classification of AD or MCI. Multimodal imaging approaches, MCI-biomarkers, characterization of different stages of the disease, testing and comparing different types of classifiers, including deep learning algorithms, feature selection algorithms and bigger sample sizes, are important strategies that are likely to be emphasized in future studies.  Machine learning (ML) is a broad term referring to an ensemble of computer algorithms that adapt their output through experience to match a desired outcome. Generally, an ML algorithm returns an output value determined by its input variables, called features, in order to refine the aptness of the computed output the program first learns on a training dataset, while evaluation of its performance is done on one or more validation datasets. The size of the data involved in both steps is crucial, as small samples could lead to unreliable results. ML models are most often grouped into three categories, depending on the nature of the learning process: in supervised learning, the program learns on a labelled dataset where the desired outcome is known, adjusting its output to replicate as best as possible the desired one; in unsupervised learning, the data is not labelled and the algorithm looks for similarities in the inputs by modeling their probability densities, highlighting the standing relations between them; in reinforcement learning, the algorithm discovers the desired outcome in a process of trial and error, and adapts its output to maximize the correct decisions that lead to it. The machine learning aspects of this review specifically concern one of the four kinds of learning problems, classification, where the output belongs to a discrete range (AD, HC and/or MCI) and with a supervised learning process. Consequently, these ML models are classifiers, i.e., objects that assign each feature vector x (a patient) to one of the c classes or groups. A brief description of each method mentioned in this paper follows. For additional background, see [70][71][72].
y and x for the projection into component space. This algorithm finds the latent variables with the maximum covariance with the y variable, instead of seeking directions that explain only the most variance. Considering all available directions would correspond to a conventional least square estimate, while selecting only a subset of them leads to a reduced regression with lower chances of overfitting. The conversion of the continuous value of y into its corresponding categorical value (i.e., turning a regressor into a classifier) can be done by comparing, for each new observation x, the c class values resulting from the PLS regression: the observation is then assigned to the class that showed the highest probability.

Appendix B.7. K-Nearest Neighbors
The nearest neighbor family of classifiers process new observations x depending on the outcome of the closest datapoints. On its most elementary form, the k-nearest neighbors (k-NN) classifier assigns the data x to the most popular class among its k neighbors, where k is a user-defined parameter. Distance can be determined with various metrics, the most common one being the Euclidean distance. Several versions of supervised k-NN exists, where the object of the learning process is usually the definition of the metric that better sorts the training inputs in their respective groups. This means finding the matrix M which, placed in d x i , x j = x i − x j T M x i − x j minimizes the classification error.

Appendix B.8. Random Forest
The random forest is a regression and classification technique based on bagging (bootstrap aggregating), by training a large ensemble of decision trees with low correlation between them, which are then averaged. A decision tree, often represented in their flowchart structure, is a model consisting of subsequent binary splits of the input space. A tree consists of its root, the first split; its branches, the next consecutive splits; the leaves, representing the predicted value (whether continuous or categorical). Building a tree corresponds to partitioning the input space in squares with lines that are parallel to coordinate axes. In a decision tree, leaning (growing) means deciding, at each node, the splitting threshold for the n-th input feature, which can be done by exhaustive research, minimizing an error function: for classification, two common measures are cross-entropy and the Gini index. After a sufficiently large tree is built it gets pruned, removing some of its branches by balancing the error function and a measure of model complexity (cost-complexity pruning). In a random forest, several trees are built, each time selecting a subset of the input variables. After the desired number of classification trees has been trained, the output classification is the result of a majority vote. By bagging the threes, instead of considering a single, larger tree, the overall variance of the model is decreased, although its bias is unchanged.

Appendix B.9. Boosting Techniques
The term boosting refers to a technique where several weak classifiers, with performance slightly above chance level, are combined to form a powerful committee, able to get very close to the target classification performance. Adaptive boosting (AdaBoost) is one of the most popular algorithms for boosting formulated for the two-class problem, where the weak classifiers are trained consequently, the performance of each one influencing the training of the next. Every one of the M training data points x is given a weight w m , initially set to 1/M. The first weak classifier is then trained, using the data to produce a class prediction y ∈ {−1, 1}. The next weak classifiers are trained after the weights are updated, giving more relevance to misclassified data. When the desired number of weak classifiers has been trained, the committee is formed: each one will contribute to the class prediction through a second set of weights a j , one for each base classifier, determined by minimizing an exponential loss error function. One of the simplest forms of base learner that can be adopted is the decision stump, a single-level decision tree: the discrimination between two classes is done by comparing the features to a single threshold. Gradient boosting is a numerical development of the boosting method, often applied to decision trees. Through a differentiable loss function, the successive weak learners are trained in the gradient direction of minimal loss (gradient descent), fitting them to the negative gradient values of the chosen function. For classification, such loss function can consist in multinomial deviance, constructing at each iteration a number of trees equal to the total number of groups c, even though for binary classification a single tree for each iteration is sufficient.